eval.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

\section{Experiments}
To evaluate our approach, we perform experiments with our implementation
\psychic{} on C meta programs to generate decision trees of patches.

\subsection{Experiment setup}
Experiments are run on a machine with AMD Ryzen 3700X at \SI{3.6}{\giga\hertz}
and \SI{16}{\giga\byte} of memory.  Software dependencies and tooling
are provided by a declarative Nix environment~\cite{nix} for reproducibility.

We evaluate \psychic{} on patches generated by MSV~\cite{msv}, a fork
of the pattern-based \gls{apr} tool \textsc{Prophet}~\cite{prophet}
that adds extra repair templates and generate meta programs.
Patches are generated for \textsc{IntroClass}, a set of attempts for solving
homework problems~\cite{introclass}.

In addition to the test cases provided by \textsc{IntroClass},
which are non-exhaustive of the input domains, we use
property tests~\cite{hypothesis} with the respective
ground truth program\footnote{\url{https://github.com/McSinyx/IntroClass}}
to select buggy programs.  MSV then generate fixes in the form of meta programs,
which are then post-processed into \psychic{}-compatible format.

Due to the limitation of \klee{}'s POSIX runtime, \texttt{libc} calls
such as \texttt{scanf} has to be replaced by more direct access
of standard input or command-line arguments.  Unless the logic is outside
of the \texttt{main} function, an output value is manually annotated.

A selection of 10 meta programs with at least two semantically different
revisions (including the original buggy version) are then fed to \psychic{}
to generate differential tests and decision trees with a timeout of 10 minutes.
The vast majority of these bugs are either simple logical mistakes
or missing initializations that trigger \glspl{ub}.

\subsection{Results}
Experiment results on \textsc{IntroClass} is summarized in table~\ref{result}.
In case of \texttt{digits} \texttt{1b31@2} and \texttt{smallest} \texttt{d25c@1},
\psychic{} failed to generate differential tests.  In other cases,
except for the timeout in \texttt{digits} \texttt{0cdf@3},
the tool successfully differentiate all semantically different patches.

\begin{table}[h]
  \begin{center}
    \begin{tabular}{l c l r r r r}
      \toprule
      Task & Attempt & Bug type & $n$ & Tree height & Largest cluster & Time (s)\\
      \midrule
      \texttt{checksum} & \texttt{3b23@3} & Arithmetic & 2 & 1 & 1 & 1.2\\
      \texttt{checksum} & \texttt{cb24@3} & Logic & 3 & 1 & 2 & 1.7\\
      \texttt{digits} & \texttt{0cdf@3} & Logic & 16 & 1 & 5 & 600.0\\
      \texttt{digits} & \texttt{1b31@2} & Logic & 5 & 0 & 5 & 0.9\\
      \texttt{grade} & \texttt{1b31@3} & \gls{ub} & 2 & 1 & 1 & 0.8\\
      \texttt{grade} & \texttt{317a@3} & \gls{ub} & 8 & 3 & 5 & 1.0\\
      \texttt{grade} & \texttt{b192@3} & \gls{ub} & 2 & 1 & 1 & 0.5\\
      \texttt{median} & \texttt{97f6@3} & \gls{ub} & 7 & 1 & 6 & 1.6\\
      \texttt{smallest} & \texttt{0704@2} & \gls{ub} & 5 & 1 & 3 & 87.0\\
      \texttt{smallest} & \texttt{d25c@1} & \gls{ub} & 5 & 0 & 5 & 1.3\\
      \bottomrule
    \end{tabular}
  \end{center}
  \caption{Decision tree quality on the patch pool generated by MSV
    for \textsc{IntroClass} programs.  Respective to each meta program
    with $n$ revisions are the height of the patch decision tree
    and the maximum number of undistinguished revisions.}\label{result}
\end{table}

\psychic{} took orders of magnitude more time on \texttt{digits} \texttt{0cdf@3}
and \texttt{smallest} \texttt{d25c@1}, which respectively include a loop
and heavily-nested conditions.

\begin{figure}
  \begin{tikzpicture}[mymatrix/.style={matrix of nodes,
                                       nodes=typetag,row sep=1ex},
                      typetag/.style={draw,anchor=west},
                      title/.style={draw=none,inner sep=0pt},
                      line width=0.2ex]
    \node[draw,diamond,aspect=4] (root) {Input: 6 3 2 1 0};
    \node[draw,diamond,aspect=4,below left=6em of root]
      (left) {Input: 66 7 6 2 1};
    \draw[-angle 45] (root) to node[above,sloped]{Output: \gls{ub}} (left);
    \node[draw,diamond,aspect=4,below right=6em of root]
      (right) {Input: 6 3 2 1 1};
    \draw[-angle 45] (root) to node[above,sloped]{Output: F} (right);
    \node[draw,below left=6em of left] (i) {Original};
    \node[draw,right=5em of i] (d) {Patches 108, 147, 187, 227, 267};
    \node[draw,right=5em of d] (a) {Patch 10};
    \node[draw,right=5em of a] (b) {Patch 307};
    \draw[-angle 45] (left) to node[above,sloped]{Output: \gls{ub}} (i);
    \draw[-angle 45] (left) to node[above,sloped]{Output: F} (d);
    \draw[-angle 45] (right) to node[above,sloped]{Output: D} (a);
    \draw[-angle 45] (right) to node[above,sloped]{Output: F} (b);
  \end{tikzpicture}
  \caption{Decision tree generated from patches of \texttt{grade}
    \texttt{317a@3}, with outputs minimized.}\label{317a}
\end{figure}

To demonstrate the decision tree, we examine the result for \texttt{grade}
\texttt{317a@3}.  The task is to calculate the letter grade,
given the threshold for each grade from A to D, followed by a percentage grade.
This attempt is buggy because it misses the initialization the result
grade variable, which is inserted in various locations by MSV.
The decision tree generated from these patches is illustrated
in figure~\ref{317a}, from which patch number 10 can be deduced to be
the most correct one.