discuss.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

\section{Discussion}
The experiment on \textsc{IntroClass} reveals some limitations
of \psychic{}.  First, the tool has trouble scaling with growing path spaces.
While this is inherent to the path explosion problem, not all paths
need to be explored.  Therefore, more aggressive path pruning~\cite{homi,hydiff}
and scheduling heuristics~\cite{directed,rsc,paradyse} could prove
to be beneficial.  The use of \textsc{DiffTest} on any (undifferentiated)
state pair in algorithm~\ref{diffgen} further hurts performance.

Next, while in theory, \klee{} can execute any LLVM bitcode,
the programming interface sets the limit on logical interpretation.
More specifically, to leave rooms for optimization, C and POSIX standards
leave a lot of behaviors unspecified.  For example, function \texttt{atoi(3)},
which converts string to integer, may return any number upon
a parsing error~\cite{atoi}, lumping this implementation-defined number
with error cases.

Some others like \texttt{scanf(3)} and \texttt{printf(3)}
are prohibitively expensive due to heavy pattern matching, and since
they operate on input/output, their calls are likely dominators
or post-dominators in the control-flow graph, requiring manual interventions.
Failing to replace these calls could either significantly degrade efficiency
or force the theorem prover to give up on \gls{smt} solving.  In case
of \textsc{IntroClass}, ignoring \texttt{printf(3)} calls disabled
the use of output files capturing.

On more positive notes, capturing local function return values
and mutable arguments has proved its usability.  Together
with the use of eager concrete execution to skip symbolic test generation,
promising preliminaries have been obtained.

\subsection{Threats to Validity}
The small scope of experiments on toy programs challenges
the practical generalizibility of the tool.  Since the \textsc{Prophet}-based
\gls{apr} tool MSV mines for tokens within the codebase~\cite{prophet},
patches are not diverse in semantics and the overall result
is not statistically strong.

\subsection{Related Works}
There exists multiple differential test generators for a pair
of program revisions, such as \textsc{Nezha} which uses black- or gray-box
fuzzing~\cite{nezha}, \textsc{Shadow} based on symbolic execution~\cite{shadow},
and \textsc{HyDiff} combining both~\cite{hydiff}.  Some recent work targets
the opposite problem of detecting semantic code clone~\cite{semclone}.

For interpreting multiple program variants symbolically at the same time,
alternative to \psychic{} keeping independent states, other approaches
may choose shadow execution~\cite{shadow}, or to join~\cite{join-split}
or merge~\cite{merge} states at higher-order branches.