1 files changed, 545 insertions, 0 deletions
diff --git a/docs/technical_details.md b/docs/technical_details.md
new file mode 100644
index 00000000..d53b30e3
--- /dev/null
+++ b/docs/technical_details.md
@@ -0,0 +1,545 @@
+# Technical "whitepaper" for afl-fuzz
+
+This document provides a quick overview of the guts of American Fuzzy Lop.
+See README for the general instruction manual; and for a discussion of
+motivations and design goals behind AFL, see historical_notes.md.
+
+## 0. Design statement
+
+American Fuzzy Lop does its best not to focus on any singular principle of
+operation and not be a proof-of-concept for any specific theory. The tool can
+be thought of as a collection of hacks that have been tested in practice,
+found to be surprisingly effective, and have been implemented in the simplest,
+most robust way I could think of at the time.
+
+Many of the resulting features are made possible thanks to the availability of
+lightweight instrumentation that served as a foundation for the tool, but this
+mechanism should be thought of merely as a means to an end. The only true
+governing principles are speed, reliability, and ease of use.
+
+## 1. Coverage measurements
+
+The instrumentation injected into compiled programs captures branch (edge)
+coverage, along with coarse branch-taken hit counts. The code injected at
+branch points is essentially equivalent to:
+
+```c
+  cur_location = <COMPILE_TIME_RANDOM>;
+  shared_mem[cur_location ^ prev_location]++; 
+  prev_location = cur_location >> 1;
+```
+
+The `cur_location` value is generated randomly to simplify the process of
+linking complex projects and keep the XOR output distributed uniformly.
+
+The `shared_mem[]` array is a 64 kB SHM region passed to the instrumented binary
+by the caller. Every byte set in the output map can be thought of as a hit for
+a particular (`branch_src`, `branch_dst`) tuple in the instrumented code.
+
+The size of the map is chosen so that collisions are sporadic with almost all
+of the intended targets, which usually sport between 2k and 10k discoverable
+branch points:
+
+```
+   Branch cnt | Colliding tuples | Example targets
+  ------------+------------------+-----------------
+        1,000 | 0.75%            | giflib, lzo
+        2,000 | 1.5%             | zlib, tar, xz
+        5,000 | 3.5%             | libpng, libwebp
+       10,000 | 7%               | libxml
+       20,000 | 14%              | sqlite
+       50,000 | 30%              | -
+```
+
+At the same time, its size is small enough to allow the map to be analyzed
+in a matter of microseconds on the receiving end, and to effortlessly fit
+within L2 cache.
+
+This form of coverage provides considerably more insight into the execution
+path of the program than simple block coverage. In particular, it trivially
+distinguishes between the following execution traces:
+
+```
+  A -> B -> C -> D -> E (tuples: AB, BC, CD, DE)
+  A -> B -> D -> C -> E (tuples: AB, BD, DC, CE)
+```
+
+This aids the discovery of subtle fault conditions in the underlying code,
+because security vulnerabilities are more often associated with unexpected
+or incorrect state transitions than with merely reaching a new basic block.
+
+The reason for the shift operation in the last line of the pseudocode shown
+earlier in this section is to preserve the directionality of tuples (without
+this, A ^ B would be indistinguishable from B ^ A) and to retain the identity
+of tight loops (otherwise, A ^ A would be obviously equal to B ^ B).
+
+The absence of simple saturating arithmetic opcodes on Intel CPUs means that
+the hit counters can sometimes wrap around to zero. Since this is a fairly
+unlikely and localized event, it's seen as an acceptable performance trade-off.
+
+### 2. Detecting new behaviors
+
+The fuzzer maintains a global map of tuples seen in previous executions; this
+data can be rapidly compared with individual traces and updated in just a couple
+of dword- or qword-wide instructions and a simple loop.
+
+When a mutated input produces an execution trace containing new tuples, the
+corresponding input file is preserved and routed for additional processing
+later on (see section #3). Inputs that do not trigger new local-scale state
+transitions in the execution trace (i.e., produce no new tuples) are discarded,
+even if their overall control flow sequence is unique.
+
+This approach allows for a very fine-grained and long-term exploration of
+program state while not having to perform any computationally intensive and
+fragile global comparisons of complex execution traces, and while avoiding the
+scourge of path explosion.
+
+To illustrate the properties of the algorithm, consider that the second trace
+shown below would be considered substantially new because of the presence of
+new tuples (CA, AE):
+
+```
+  #1: A -> B -> C -> D -> E
+  #2: A -> B -> C -> A -> E
+```
+
+At the same time, with #2 processed, the following pattern will not be seen
+as unique, despite having a markedly different overall execution path:
+
+```
+  #3: A -> B -> C -> A -> B -> C -> A -> B -> C -> D -> E
+```
+
+In addition to detecting new tuples, the fuzzer also considers coarse tuple
+hit counts. These are divided into several buckets:
+
+```
+  1, 2, 3, 4-7, 8-15, 16-31, 32-127, 128+
+```
+
+To some extent, the number of buckets is an implementation artifact: it allows
+an in-place mapping of an 8-bit counter generated by the instrumentation to
+an 8-position bitmap relied on by the fuzzer executable to keep track of the
+already-seen execution counts for each tuple.
+
+Changes within the range of a single bucket are ignored; transition from one
+bucket to another is flagged as an interesting change in program control flow,
+and is routed to the evolutionary process outlined in the section below.
+
+The hit count behavior provides a way to distinguish between potentially
+interesting control flow changes, such as a block of code being executed
+twice when it was normally hit only once. At the same time, it is fairly
+insensitive to empirically less notable changes, such as a loop going from
+47 cycles to 48. The counters also provide some degree of "accidental"
+immunity against tuple collisions in dense trace maps.
+
+The execution is policed fairly heavily through memory and execution time
+limits; by default, the timeout is set at 5x the initially-calibrated
+execution speed, rounded up to 20 ms. The aggressive timeouts are meant to
+prevent dramatic fuzzer performance degradation by descending into tarpits
+that, say, improve coverage by 1% while being 100x slower; we pragmatically
+reject them and hope that the fuzzer will find a less expensive way to reach
+the same code. Empirical testing strongly suggests that more generous time
+limits are not worth the cost.
+
+## 3. Evolving the input queue
+
+Mutated test cases that produced new state transitions within the program are
+added to the input queue and used as a starting point for future rounds of
+fuzzing. They supplement, but do not automatically replace, existing finds.
+
+In contrast to more greedy genetic algorithms, this approach allows the tool
+to progressively explore various disjoint and possibly mutually incompatible
+features of the underlying data format, as shown in this image:
+
+  ![gzip_coverage](./visualization/afl_gzip.png)
+
+Several practical examples of the results of this algorithm are discussed
+here:
+
+  http://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html
+  http://lcamtuf.blogspot.com/2014/11/afl-fuzz-nobody-expects-cdata-sections.html
+
+The synthetic corpus produced by this process is essentially a compact
+collection of "hmm, this does something new!" input files, and can be used to
+seed any other testing processes down the line (for example, to manually
+stress-test resource-intensive desktop apps).
+
+With this approach, the queue for most targets grows to somewhere between 1k
+and 10k entries; approximately 10-30% of this is attributable to the discovery
+of new tuples, and the remainder is associated with changes in hit counts.
+
+The following table compares the relative ability to discover file syntax and
+explore program states when using several different approaches to guided
+fuzzing. The instrumented target was GNU patch 2.7k.3 compiled with `-O3` and
+seeded with a dummy text file; the session consisted of a single pass over the
+input queue with afl-fuzz:
+
+```
+    Fuzzer guidance | Blocks  | Edges   | Edge hit | Highest-coverage
+      strategy used | reached | reached | cnt var  | test case generated
+  ------------------+---------+---------+----------+---------------------------
+     (Initial file) | 156     | 163     | 1.00     | (none)
+                    |         |         |          |
+    Blind fuzzing S | 182     | 205     | 2.23     | First 2 B of RCS diff
+    Blind fuzzing L | 228     | 265     | 2.23     | First 4 B of -c mode diff
+     Block coverage | 855     | 1,130   | 1.57     | Almost-valid RCS diff
+      Edge coverage | 1,452   | 2,070   | 2.18     | One-chunk -c mode diff
+          AFL model | 1,765   | 2,597   | 4.99     | Four-chunk -c mode diff
+```
+
+The first entry for blind fuzzing ("S") corresponds to executing just a single
+round of testing; the second set of figures ("L") shows the fuzzer running in a
+loop for a number of execution cycles comparable with that of the instrumented
+runs, which required more time to fully process the growing queue.
+
+Roughly similar results have been obtained in a separate experiment where the
+fuzzer was modified to compile out all the random fuzzing stages and leave just
+a series of rudimentary, sequential operations such as walking bit flips.
+Because this mode would be incapable of altering the size of the input file,
+the sessions were seeded with a valid unified diff:
+
+```
+    Queue extension | Blocks  | Edges   | Edge hit | Number of unique
+      strategy used | reached | reached | cnt var  | crashes found
+  ------------------+---------+---------+----------+------------------
+     (Initial file) | 624     | 717     | 1.00     | -
+                    |         |         |          |
+      Blind fuzzing | 1,101   | 1,409   | 1.60     | 0
+     Block coverage | 1,255   | 1,649   | 1.48     | 0
+      Edge coverage | 1,259   | 1,734   | 1.72     | 0
+          AFL model | 1,452   | 2,040   | 3.16     | 1
+```
+
+At noted earlier on, some of the prior work on genetic fuzzing relied on
+maintaining a single test case and evolving it to maximize coverage. At least
+in the tests described above, this "greedy" approach appears to confer no
+substantial benefits over blind fuzzing strategies.
+
+### 4. Culling the corpus
+
+The progressive state exploration approach outlined above means that some of
+the test cases synthesized later on in the game may have edge coverage that
+is a strict superset of the coverage provided by their ancestors.
+
+To optimize the fuzzing effort, AFL periodically re-evaluates the queue using a
+fast algorithm that selects a smaller subset of test cases that still cover
+every tuple seen so far, and whose characteristics make them particularly
+favorable to the tool.
+
+The algorithm works by assigning every queue entry a score proportional to its
+execution latency and file size; and then selecting lowest-scoring candidates
+for each tuple.
+
+The tuples are then processed sequentially using a simple workflow:
+
+  1) Find next tuple not yet in the temporary working set,
+  2) Locate the winning queue entry for this tuple,
+  3) Register *all* tuples present in that entry's trace in the working set,
+  4) Go to #1 if there are any missing tuples in the set.
+
+The generated corpus of "favored" entries is usually 5-10x smaller than the
+starting data set. Non-favored entries are not discarded, but they are skipped
+with varying probabilities when encountered in the queue:
+
+  - If there are new, yet-to-be-fuzzed favorites present in the queue, 99%
+    of non-favored entries will be skipped to get to the favored ones.
+  - If there are no new favorites:
+    * If the current non-favored entry was fuzzed before, it will be skipped
+      95% of the time.
+    * If it hasn't gone through any fuzzing rounds yet, the odds of skipping
+      drop down to 75%.
+
+Based on empirical testing, this provides a reasonable balance between queue
+cycling speed and test case diversity.
+
+Slightly more sophisticated but much slower culling can be performed on input
+or output corpora with `afl-cmin`. This tool permanently discards the redundant
+entries and produces a smaller corpus suitable for use with `afl-fuzz` or
+external tools.
+
+## 5. Trimming input files
+
+File size has a dramatic impact on fuzzing performance, both because large
+files make the target binary slower, and because they reduce the likelihood
+that a mutation would touch important format control structures, rather than
+redundant data blocks. This is discussed in more detail in perf_tips.md.
+
+The possibility that the user will provide a low-quality starting corpus aside,
+some types of mutations can have the effect of iteratively increasing the size
+of the generated files, so it is important to counter this trend.
+
+Luckily, the instrumentation feedback provides a simple way to automatically
+trim down input files while ensuring that the changes made to the files have no
+impact on the execution path.
+
+The built-in trimmer in afl-fuzz attempts to sequentially remove blocks of data
+with variable length and stepover; any deletion that doesn't affect the checksum
+of the trace map is committed to disk. The trimmer is not designed to be
+particularly thorough; instead, it tries to strike a balance between precision
+and the number of `execve()` calls spent on the process, selecting the block size
+and stepover to match. The average per-file gains are around 5-20%.
+
+The standalone `afl-tmin` tool uses a more exhaustive, iterative algorithm, and
+also attempts to perform alphabet normalization on the trimmed files. The
+operation of `afl-tmin` is as follows.
+
+First, the tool automatically selects the operating mode. If the initial input
+crashes the target binary, afl-tmin will run in non-instrumented mode, simply
+keeping any tweaks that produce a simpler file but still crash the target. If
+the target is non-crashing, the tool uses an instrumented mode and keeps only
+the tweaks that produce exactly the same execution path.
+
+The actual minimization algorithm is:
+
+  1) Attempt to zero large blocks of data with large stepovers. Empirically,
+     this is shown to reduce the number of execs by preempting finer-grained
+     efforts later on.
+  2) Perform a block deletion pass with decreasing block sizes and stepovers,
+     binary-search-style. 
+  3) Perform alphabet normalization by counting unique characters and trying
+     to bulk-replace each with a zero value.
+  4) As a last result, perform byte-by-byte normalization on non-zero bytes.
+
+Instead of zeroing with a 0x00 byte, `afl-tmin` uses the ASCII digit '0'. This
+is done because such a modification is much less likely to interfere with
+text parsing, so it is more likely to result in successful minimization of
+text files.
+
+The algorithm used here is less involved than some other test case
+minimization approaches proposed in academic work, but requires far fewer
+executions and tends to produce comparable results in most real-world
+applications.
+
+## 6. Fuzzing strategies
+
+The feedback provided by the instrumentation makes it easy to understand the
+value of various fuzzing strategies and optimize their parameters so that they
+work equally well across a wide range of file types. The strategies used by
+afl-fuzz are generally format-agnostic and are discussed in more detail here:
+
+  http://lcamtuf.blogspot.com/2014/08/binary-fuzzing-strategies-what-works.html
+
+It is somewhat notable that especially early on, most of the work done by
+`afl-fuzz` is actually highly deterministic, and progresses to random stacked
+modifications and test case splicing only at a later stage. The deterministic
+strategies include:
+
+  - Sequential bit flips with varying lengths and stepovers,
+  - Sequential addition and subtraction of small integers,
+  - Sequential insertion of known interesting integers (`0`, `1`, `INT_MAX`, etc),
+
+The purpose of opening with deterministic steps is related to their tendency to
+produce compact test cases and small diffs between the non-crashing and crashing
+inputs.
+
+With deterministic fuzzing out of the way, the non-deterministic steps include
+stacked bit flips, insertions, deletions, arithmetics, and splicing of different
+test cases.
+
+The relative yields and `execve()` costs of all these strategies have been
+investigated and are discussed in the aforementioned blog post.
+
+For the reasons discussed in historical_notes.md (chiefly, performance,
+simplicity, and reliability), AFL generally does not try to reason about the
+relationship between specific mutations and program states; the fuzzing steps
+are nominally blind, and are guided only by the evolutionary design of the
+input queue.
+
+That said, there is one (trivial) exception to this rule: when a new queue
+entry goes through the initial set of deterministic fuzzing steps, and tweaks to
+some regions in the file are observed to have no effect on the checksum of the
+execution path, they may be excluded from the remaining phases of
+deterministic fuzzing - and the fuzzer may proceed straight to random tweaks.
+Especially for verbose, human-readable data formats, this can reduce the number
+of execs by 10-40% or so without an appreciable drop in coverage. In extreme
+cases, such as normally block-aligned tar archives, the gains can be as high as
+90%.
+
+Because the underlying "effector maps" are local every queue entry and remain
+in force only during deterministic stages that do not alter the size or the
+general layout of the underlying file, this mechanism appears to work very
+reliably and proved to be simple to implement.
+
+## 7. Dictionaries
+
+The feedback provided by the instrumentation makes it easy to automatically
+identify syntax tokens in some types of input files, and to detect that certain
+combinations of predefined or auto-detected dictionary terms constitute a
+valid grammar for the tested parser.
+
+A discussion of how these features are implemented within afl-fuzz can be found
+here:
+
+  http://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-grammar-with.html
+
+In essence, when basic, typically easily-obtained syntax tokens are combined
+together in a purely random manner, the instrumentation and the evolutionary
+design of the queue together provide a feedback mechanism to differentiate
+between meaningless mutations and ones that trigger new behaviors in the
+instrumented code - and to incrementally build more complex syntax on top of
+this discovery.
+
+The dictionaries have been shown to enable the fuzzer to rapidly reconstruct
+the grammar of highly verbose and complex languages such as JavaScript, SQL,
+or XML; several examples of generated SQL statements are given in the blog
+post mentioned above.
+
+Interestingly, the AFL instrumentation also allows the fuzzer to automatically
+isolate syntax tokens already present in an input file. It can do so by looking
+for run of bytes that, when flipped, produce a consistent change to the
+program's execution path; this is suggestive of an underlying atomic comparison
+to a predefined value baked into the code. The fuzzer relies on this signal
+to build compact "auto dictionaries" that are then used in conjunction with
+other fuzzing strategies.
+
+## 8. De-duping crashes
+
+De-duplication of crashes is one of the more important problems for any
+competent fuzzing tool. Many of the naive approaches run into problems; in
+particular, looking just at the faulting address may lead to completely
+unrelated issues being clustered together if the fault happens in a common
+library function (say, `strcmp`, `strcpy`); while checksumming call stack
+backtraces can lead to extreme crash count inflation if the fault can be
+reached through a number of different, possibly recursive code paths.
+
+The solution implemented in `afl-fuzz` considers a crash unique if any of two
+conditions are met:
+
+  - The crash trace includes a tuple not seen in any of the previous crashes,
+  - The crash trace is missing a tuple that was always present in earlier
+    faults.
+
+The approach is vulnerable to some path count inflation early on, but exhibits
+a very strong self-limiting effect, similar to the execution path analysis
+logic that is the cornerstone of `afl-fuzz`.
+
+## 9. Investigating crashes
+
+The exploitability of many types of crashes can be ambiguous; afl-fuzz tries
+to address this by providing a crash exploration mode where a known-faulting
+test case is fuzzed in a manner very similar to the normal operation of the
+fuzzer, but with a constraint that causes any non-crashing mutations to be
+thrown away.
+
+A detailed discussion of the value of this approach can be found here:
+
+  http://lcamtuf.blogspot.com/2014/11/afl-fuzz-crash-exploration-mode.html
+
+The method uses instrumentation feedback to explore the state of the crashing
+program to get past the ambiguous faulting condition and then isolate the
+newly-found inputs for human review.
+
+On the subject of crashes, it is worth noting that in contrast to normal
+queue entries, crashing inputs are *not* trimmed; they are kept exactly as
+discovered to make it easier to compare them to the parent, non-crashing entry
+in the queue. That said, `afl-tmin` can be used to shrink them at will.
+
+## 10 The fork server
+
+To improve performance, `afl-fuzz` uses a "fork server", where the fuzzed process
+goes through `execve()`, linking, and libc initialization only once, and is then
+cloned from a stopped process image by leveraging copy-on-write. The
+implementation is described in more detail here:
+
+  http://lcamtuf.blogspot.com/2014/10/fuzzing-binaries-without-execve.html
+
+The fork server is an integral aspect of the injected instrumentation and
+simply stops at the first instrumented function to await commands from
+`afl-fuzz`.
+
+With fast targets, the fork server can offer considerable performance gains,
+usually between 1.5x and 2x. It is also possible to:
+
+  - Use the fork server in manual ("deferred") mode, skipping over larger,
+    user-selected chunks of initialization code. It requires very modest
+    code changes to the targeted program, and With some targets, can
+    produce 10x+ performance gains.
+  - Enable "persistent" mode, where a single process is used to try out
+    multiple inputs, greatly limiting the overhead of repetitive `fork()`
+    calls. This generally requires some code changes to the targeted program,
+    but can improve the performance of fast targets by a factor of 5 or more - approximating the benefits of in-process fuzzing jobs while still
+    maintaining very robust isolation between the fuzzer process and the
+    targeted binary.
+
+## 11. Parallelization
+
+The parallelization mechanism relies on periodically examining the queues
+produced by independently-running instances on other CPU cores or on remote
+machines, and then selectively pulling in the test cases that, when tried
+out locally, produce behaviors not yet seen by the fuzzer at hand.
+
+This allows for extreme flexibility in fuzzer setup, including running synced
+instances against different parsers of a common data format, often with
+synergistic effects.
+
+For more information about this design, see parallel_fuzzing.md.
+
+## 12. Binary-only instrumentation
+
+Instrumentation of black-box, binary-only targets is accomplished with the
+help of a separately-built version of QEMU in "user emulation" mode. This also
+allows the execution of cross-architecture code - say, ARM binaries on x86.
+
+QEMU uses basic blocks as translation units; the instrumentation is implemented
+on top of this and uses a model roughly analogous to the compile-time hooks:
+
+```c
+  if (block_address > elf_text_start && block_address < elf_text_end) {
+
+    cur_location = (block_address >> 4) ^ (block_address << 8);
+    shared_mem[cur_location ^ prev_location]++; 
+    prev_location = cur_location >> 1;
+
+  }
+```
+
+The shift-and-XOR-based scrambling in the second line is used to mask the
+effects of instruction alignment.
+
+The start-up of binary translators such as QEMU, DynamoRIO, and PIN is fairly
+slow; to counter this, the QEMU mode leverages a fork server similar to that
+used for compiler-instrumented code, effectively spawning copies of an
+already-initialized process paused at `_start`.
+
+First-time translation of a new basic block also incurs substantial latency. To
+eliminate this problem, the AFL fork server is extended by providing a channel
+between the running emulator and the parent process. The channel is used
+to notify the parent about the addresses of any newly-encountered blocks and to
+add them to the translation cache that will be replicated for future child
+processes.
+
+As a result of these two optimizations, the overhead of the QEMU mode is
+roughly 2-5x, compared to 100x+ for PIN.
+
+## 13. The `afl-analyze` tool
+
+The file format analyzer is a simple extension of the minimization algorithm
+discussed earlier on; instead of attempting to remove no-op blocks, the tool
+performs a series of walking byte flips and then annotates runs of bytes
+in the input file.
+
+It uses the following classification scheme:
+
+  - "No-op blocks" - segments where bit flips cause no apparent changes to
+    control flow. Common examples may be comment sections, pixel data within
+    a bitmap file, etc.
+  - "Superficial content" - segments where some, but not all, bitflips
+    produce some control flow changes. Examples may include strings in rich
+    documents (e.g., XML, RTF).
+  - "Critical stream" - a sequence of bytes where all bit flips alter control
+    flow in different but correlated ways. This may be compressed data, 
+    non-atomically compared keywords or magic values, etc.
+  - "Suspected length field" - small, atomic integer that, when touched in
+    any way, causes a consistent change to program control flow, suggestive
+    of a failed length check.
+  - "Suspected cksum or magic int" - an integer that behaves similarly to a
+    length field, but has a numerical value that makes the length explanation
+    unlikely. This is suggestive of a checksum or other "magic" integer.
+  - "Suspected checksummed block" - a long block of data where any change 
+    always triggers the same new execution path. Likely caused by failing
+    a checksum or a similar integrity check before any subsequent parsing
+    takes place.
+  - "Magic value section" - a generic token where changes cause the type
+    of binary behavior outlined earlier, but that doesn't meet any of the
+    other criteria. May be an atomically compared keyword or so.
\ No newline at end of file