0% found this document useful (0 votes)
5 views

Automated Whitebox Fuzz Testing Paper Patrice Godefroid

asdd

Uploaded by

pythonbittester
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Automated Whitebox Fuzz Testing Paper Patrice Godefroid

asdd

Uploaded by

pythonbittester
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Automated Whitebox Fuzz Testing

Patrice Godefroid Michael Y. Levin David Molnar∗


Microsoft (Research) Microsoft (CSE) UC Berkeley
[email protected] [email protected] [email protected]

Abstract form of blackbox random testing which randomly mutates


well-formed inputs and tests the program on the resulting
Fuzz testing is an effective technique for finding security data [13, 30, 1, 4]. In some cases, grammars are used to
vulnerabilities in software. Traditionally, fuzz testing tools generate the well-formed inputs, which also allows encod-
apply random mutations to well-formed inputs of a pro- ing application-specific knowledge and test heuristics. Al-
gram and test the resulting values. We present an alterna- though fuzz testing can be remarkably effective, the limita-
tive whitebox fuzz testing approach inspired by recent ad- tions of blackbox testing approaches are well-known. For
vances in symbolic execution and dynamic test generation. instance, the then branch of the conditional statement “if
Our approach records an actual run of the program un- (x==10) then” has only one in 232 chances of being ex-
der test on a well-formed input, symbolically evaluates the ercised if x is a randomly chosen 32-bit input value. This in-
recorded trace, and gathers constraints on inputs capturing tuitively explains why random testing usually provides low
how the program uses these. The collected constraints are code coverage [28]. In the security context, these limita-
then negated one by one and solved with a constraint solver, tions mean that potentially serious security bugs, such as
producing new inputs that exercise different control paths buffer overflows, may be missed because the code that con-
in the program. This process is repeated with the help of a tains the bug is not even exercised.
code-coverage maximizing heuristic designed to find defects We propose a conceptually simple but different approach
as fast as possible. We have implemented this algorithm of whitebox fuzz testing. This work is inspired by recent ad-
in SAGE (Scalable, Automated, Guided Execution), a new vances in systematic dynamic test generation [16, 7]. Start-
tool employing x86 instruction-level tracing and emulation ing with a fixed input, our algorithm symbolically executes
for whitebox fuzzing of arbitrary file-reading Windows ap- the program, gathering input constraints from conditional
plications. We describe key optimizations needed to make statements encountered along the way. The collected con-
dynamic test generation scale to large input files and long straints are then systematically negated and solved with a
execution traces with hundreds of millions of instructions. constraint solver, yielding new inputs that exercise different
We then present detailed experiments with several Windows execution paths in the program. This process is repeated
applications. Notably, without any format-specific knowl- using a novel search algorithm with a coverage-maximizing
edge, SAGE detects the MS07-017 ANI vulnerability, which heuristic designed to find defects as fast as possible. For
was missed by extensive blackbox fuzzing and static analy- example, symbolic execution of the above fragment on the
sis tools. Furthermore, while still in an early stage of de- input x = 0 generates the constraint x 6= 10. Once this con-
velopment, SAGE has already discovered 30+ new bugs in straint is negated and solved, it yields x = 10, which gives
large shipped Windows applications including image pro- us a new input that causes the program to follow the then
cessors, media players, and file decoders. Several of these branch of the given conditional statement. This allows us
bugs are potentially exploitable memory access violations. to exercise and test additional code for security bugs, even
without specific knowledge of the input format. Further-
more, this approach automatically discovers and tests “cor-
1 Introduction ner cases” where programmers may fail to properly allocate
memory or manipulate buffers, leading to security vulnera-
Since the “Month of Browser Bugs” released a new bug bilities.
each day of July 2006 [25], fuzz testing has leapt to promi- In theory, systematic dynamic test generation can lead to
nence as a quick and cost-effective method for finding seri- full program path coverage, i.e., program verification [16].
ous security defects in large applications. Fuzz testing is a In practice, however, the search is typically incomplete both
∗ The work of this author was done while visiting Microsoft. because the number of execution paths in the program un-
der test is huge and because symbolic execution, constraint void top(char input[4]) {
generation, and constraint solving are necessarily impre- int cnt=0;
cise. (See Section 2 for various reasons of why the latter if (input[0] == ’b’) cnt++;
is the case.) Therefore, we are forced to explore practical if (input[1] == ’a’) cnt++;
tradeoffs, and this paper presents what we believe is a par- if (input[2] == ’d’) cnt++;
ticular sweet spot. Indeed, our specific approach has been if (input[3] == ’!’) cnt++;
remarkably effective in finding new defects in large applica- if (cnt >= 3) abort(); // error
tions that were previously well-tested. In fact, our algorithm }
finds so many defect occurrences that we must address the
defect triage problem (see Section 4), which is common in
Figure 1. Example of program.
static program analysis and blackbox fuzzing, but has not
been faced until now in the context of dynamic test genera-
tion [16, 7, 31, 24, 22, 18]. Another novelty of our approach This problem is typical of random testing: it is difficult to
is that we test larger applications than previously done in generate input values that will drive the program through all
dynamic test generation [16, 7, 31]. its possible execution paths.
We have implemented this approach in SAGE, short for
In contrast, whitebox dynamic test generation can easily
Scalable, Automated, Guided Execution, a whole-program
find the error in this program: it consists in executing the
whitebox file fuzzing tool for x86 Windows applications.
program starting with some initial inputs, performing a dy-
While our current tool focuses on file-reading applications,
namic symbolic execution to collect constraints on inputs
the principles also apply to network-facing applications. As
gathered from predicates in branch statements along the ex-
argued above, SAGE is capable of finding bugs that are be-
ecution, and then using a constraint solver to infer variants
yond the reach of blackbox fuzzers. For instance, without
of the previous inputs in order to steer the next executions
any format-specific knowledge, SAGE detects the critical
of the program towards alternative program branches. This
MS07-017 ANI vulnerability, which was missed by exten-
process is repeated until a given specific program statement
sive blackbox fuzzing and static analysis. Our work makes
or path is executed [22, 18], or until all (or many) feasible
three main contributions:
program paths of the program are exercised [16, 7].
• Section 2 introduces a new search algorithm for sys- For the example above, assume we start running the
tematic test generation that is optimized for large ap- function top with the initial 4-letters string good. Fig-
plications with large input files and exhibiting long ex- ure 2 shows the set of all feasible program paths for the
ecution traces where the search is bound to be incom- function top. The leftmost path represents the first run of
plete; the program on input good and corresponds to the program
• Section 3 discusses the implementation of SAGE: the path ρ including all 4 else-branches of all conditional if-
engineering choices behind its symbolic execution al- statements in the program. The leaf for that path is labeled
gorithm and the key optimization techniques enabling with 0 to denote the value of the variable cnt at the end of
it to scale to program traces with hundreds of millions the run. Intertwined with the normal execution, a symbolic
of instructions; execution collects the predicates i0 6= b, i1 6= a, i2 6= d
and i3 6= ! according to how the conditionals evaluate, and
• Section 4 describes our experience with SAGE: we where i0 , i1 , i2 and i3 are symbolic variables that represent
give examples of discovered defects and discuss the the values of the memory locations of the input variables
results of various experiments. input[0], input[1], input[2] and input[3],
respectively.
2 A Whitebox Fuzzing Algorithm The path constraint φρ = hi0 6= b, i1 6= a, i2 6= d, i3 6=
!i represents an equivalence class of input vectors, namely
2.1 Background: Dynamic Test Generation all the input vectors that drive the program through the
path that was just executed. To force the program through
Consider the program shown in Figure 1. This program a different equivalence class, one can calculate a solution
takes 4 bytes as input and contains an error when the value to a different path constraint, say, hi0 6= b, i1 6= a, i2 6=
of the variable cnt is greater than or equal to 3 at the end d, i3 = !i obtained by negating the last predicate of the cur-
of the function top. Running the program with random rent path constraint. A solution to this path constraint is
values for the 4 input bytes is unlikely to discover the error: (i0 = g, i1 = o, i2 = o, i3 = !). Running the program
there are 5 values leading to the error out of 2(8∗4) possible top with this new input goo! exercises a new program
values for 4 bytes, i.e., a probability of about 1/230 to hit path depicted by the second leftmost path in Figure 2. By
the error with random testing, including blackbox fuzzing. repeating this process, the set of all 16 possible execution
1 Search(inputSeed){
2 inputSeed.bound = 0;
3 workList = {inputSeed};
4 Run&Check(inputSeed);
5 while (workList not empty) {//new children
6 input = PickFirstItem(workList);
7 childInputs = ExpandExecution(input);
8 while (childInputs not empty) {
9 newInput = PickOneItem(childInputs);
10 Run&Check(newInput);
11 Score(newInput);
12 workList = workList + newInput;
0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4
good goo! godd god! gaod gao! gadd gad! bood boo! bodd bod! baod bao! badd bad! 13 }
14 }
15 }
Figure 2. Search space for the example of Fig-
ure 1 with the value of the variable cnt at the
end of each run and the corresponding input Figure 3. Search algorithm.
string.

automated reasoning is difficult. Whenever an actual exe-


cution path does not match the program path predicted by
paths of this program can be exercised. If this systematic symbolic execution for a given input vector, we say that a
search is performed in depth-first order, these 16 executions divergence has occurred. A divergence can be detected by
are explored from left to right on the Figure. The error is recording a predicted execution path as a bit vector (one bit
then reached for the first time with cnt==3 during the 8th for each conditional branch outcome) and checking that the
run, and full branch/block coverage is achieved after the 9th expected path is actually taken in the subsequent test run.
run.
2.3 Generational Search
2.2 Limitations
We now present a new search algorithm that is designed
Systematic dynamic test generation [16, 7] as briefly de- to address these fundamental practical limitations. Specifi-
scribed above has two main limitations. cally, our algorithm has the following prominent features:
Path explosion: systematically executing all feasible • it is designed to systematically yet partially explore the
program paths does not scale to large, realistic programs. state spaces of large applications executed with large
Path explosion can be alleviated by performing dynamic inputs (thousands of symbolic variables) and with very
test generation compositionally [14], by testing functions deep paths (hundreds of millions of instructions);
in isolation, encoding test results as function summaries ex-
pressed using function input preconditions and output post- • it maximizes the number of new tests generated from
conditions, and then re-using those summaries when testing each symbolic execution (which are long and expen-
higher-level functions. Although the use of summaries in sive in our context) while avoiding any redundancy in
software testing seems promising, achieving full path cov- the search;
erage when testing large applications with hundreds of mil- • it uses heuristics to maximize code coverage as quickly
lions of instructions is still problematic within a limited as possible, with the goal of finding bugs faster;
search period, say, one night, even when using summaries. • it is resilient to divergences: whenever divergences oc-
Imperfect symbolic execution: symbolic execution of cur, the search is able to recover and continue.
large programs is bound to be imprecise due to complex
program statements (pointer manipulations, arithmetic op- This new search algorithm is presented in two parts in
erations, etc.) and calls to operating-system and library Figures 3 and 4. The main Search procedure of Figure 3
functions that are hard or impossible to reason about sym- is mostly standard. It places the initial input inputSeed
bolically with good enough precision at a reasonable cost. in a workList (line 3) and runs the program to check
Whenever symbolic execution is not possible, concrete val- whether any bugs are detected during the first execution
ues can be used to simplify constraints and carry on with (line 4). The inputs in the workList are then pro-
a simplified, partial symbolic execution [16]. Randomiza- cessed (line 5) by selecting an element (line 6) and ex-
tion can also help by suggesting concrete values whenever panding it (line 7) to generate new inputs with the function
1 ExpandExecution(input) { expanded with many children, we call such a search order
2 childInputs = {}; a generational search.
3 // symbolically execute (program,input)
4 PC = ComputePathConstraint(input); Consider again the program shown in Figure 1. Assum-
5 for (j=input.bound; j < |PC|; j++) { ing the initial input is the 4-letters string good, the leftmost
6 if((PC[0..(j-1)] and not(PC[j])) path in the tree of Figure 2 represents the first run of the
has a solution I){ program on that input. From this parent run, a generational
7 newInput = input + I; search generates four first-generation children which cor-
8 newInput.bound = j; respond to the four paths whose leafs are labeled with 1.
9 childInputs = childInputs + newInput; Indeed, those four paths each correspond to negating one
10 } constraint in the original path constraint of the leftmost par-
11 return childInputs; ent run. Each of those first generation execution paths can
12 }
in turn be expanded by the procedure of Figure 4 to gen-
erate (zero or more) second-generation children. There are
Figure 4. Computing new children. six of those and each one is depicted with a leaf label of
2 to the right of their (first-generation) parent in Figure 2.
By repeating this process, all feasible execution paths of the
ExpandExecution described later in Figure 4. For each function top are eventually generated exactly once. For
of those childInputs, the program under test is run with this example, the value of the variable cnt denotes exactly
that input. This execution is checked for errors (line 10) and the generation number of each run.
is assigned a Score (line 11), as discussed below, before Since the procedure ExpandExecution of Figure 4
being added to the workList (line 12) which is sorted by expands all constraints in the current path constraint (below
those scores. the current bound) instead of just one, it maximizes the
The main originality of our search algorithm is in the number of new test inputs generated from each symbolic
way children are expanded as shown in Figure 4. Given an execution. Although this optimization is perhaps not sig-
input (line 1), the function ExpandExecution sym- nificant when exhaustively exploring all execution paths of
bolically executes the program under test with that input small programs like the one of Figure 1, it is important when
and generates a path constraint PC (line 4) as defined ear- symbolic execution takes a long time, as is the case for large
lier. PC is a conjunction of |PC| constraints, each cor- applications where exercising all execution paths is virtu-
responding to a conditional statement in the program and ally hopeless anyway. This point will be further discussed
expressed using symbolic variables representing values of in Section 3 and illustrated with the experiments reported in
input parameters (see [16, 7]). Then, our algorithm at- Section 4.
tempts to expand every constraint in the path constraint In this scenario, we want to exploit as much as possi-
(at a position j greater or equal to a parameter called ble the first symbolic execution performed with an initial
input.bound which is initially 0). This is done by input and to systematically explore all its first-generation
checking whether the conjunction of the part of the path children. This search strategy works best if that initial input
constraint prior to the jth constraint PC[0..(j-1)] and is well formed. Indeed, it will be more likely to exercise
of the negation of the jth constraint not(PC[j]) is sat- more of the program’s code and hence generate more con-
isfiable. If so, a solution I to this new path constraint is straints to be negated, thus more children, as will be shown
used to update the previous solution input while values of with experiments in Section 4. The importance given to the
input parameters not involved in the path constraint are pre- first input is similar to what is done with traditional, black-
served (this update is denoted by input + I on line 7). box fuzz testing, hence our use of the term whitebox fuzzing
The resulting new input value is saved for future evaluation for the search technique introduced in this paper.
(line 9). The expansion of the children of the first parent run is
In other words, starting with an initial input itself prioritized by using a heuristic to attempt to maxi-
inputSeed and initial path constraint PC, the new mize block coverage as quickly as possible, with the hope
search algorithm depicted in Figures 3 and 4 will attempt of finding more bugs faster. The function Score (line 11
to expand all |PC| constraints in PC, instead of just the of Figure 3) computes the incremental block coverage ob-
last one with a depth-first search, or the first one with a tained by executing the newInput compared to all previ-
breadth-first search. To prevent these child sub-searches ous runs. For instance, a newInput that triggers an execu-
from redundantly exploring overlapping parts of the search tion uncovering 100 new blocks would be assigned a score
space, a parameter bound is used to limit the backtracking of 100. Next, (line 12), the newInput is inserted into the
of each sub-search above the branch where the sub-search workList according to its score, with the highest scores
started off its parent. Because each execution is typically placed at the head of the list. Note that all children compete
with each other to be expanded next, regardless of their gen- grams.
eration number.
Our block-coverage heuristic is related to the “Best-First 3.1 System Architecture
Search” of EXE [7]. However, the overall search strategy is
different: while EXE uses a depth-first search that occasion- SAGE performs a generational search by repeating four
ally picks the next child to explore using a block-coverage different types of tasks. The Tester task implements the
heuristic, a generational search tests all children of each ex- function Run&Check by executing a program under test on
panded execution, and scores their entire runs before pick- a test input and looking for unusual events such as access vi-
ing the best one from the resulting workList. olation exceptions and extreme memory consumption. The
The block-coverage heuristics computed with the func- subsequent tasks proceed only if the Tester task did not
tion Score also helps dealing with divergences as defined encounter any such errors. If Tester detects an error, it
in the previous section, i.e., executions diverging from the saves the test case and performs automated triage as dis-
expected path constraint to be taken next. The occurrence cussed in Section 4.
of a single divergence compromises the completeness of The Tracer task runs the target program on the same
the search, but this is not the main issue in practice since input file again, this time recording a log of the run which
the search is bound to be incomplete for very large search will be used by the following tasks to replay the program
spaces anyway. A more worrisome issue is that divergences execution offline. This task uses the iDNA framework [3] to
may prevent the search from making any progress. For in- collect complete execution traces at the machine-instruction
stance, a depth-first search which diverges from a path p to level.
a previously explored path p′ would cycle forever between The CoverageCollector task replays the recorded
that path p′ and the subsequent divergent run p. In contrast, execution to compute which basic blocks were executed
our generational search tolerates divergences and can re- during the run. SAGE uses this information to implement
cover from this pathological case. Indeed, each run spawns the function Score discussed in the previous section.
many children, instead of a single one as with a depth-first Lastly, the SymbolicExecutor task implements the
search, and, if a child run p divergences to a previous one function ExpandExecution of Section 2.3 by replaying
p′ , that child p will have a zero score and hence be placed at the recorded execution once again, this time to collect input-
the end of the workList without hampering the expansion related constraints and generate new inputs using the con-
of other, non-divergent children. Dealing with divergences straint solver Disolver [19].
is another important feature of our algorithm for handling Both the CoverageCollector and
large applications for which symbolic execution is bound SymbolicExecutor tasks are built on top of the
to be imperfect/incomplete, as will be demonstrated in Sec- trace replay framework TruScan [26] which consumes
tion 4. trace files generated by iDNA and virtually re-executes
Finally, we note that a generational search parallelizes the recorded runs. TruScan offers several features that
well, since children can be checked and scored indepen- substantially simplify symbolic execution. These include
dently; only the work list and overall block coverage need instruction decoding, providing an interface to program
to be shared. symbol information, monitoring various input/output
system calls, keeping track of heap and stack frame allo-
3 The SAGE System cations, and tracking the flow of data through the program
structures.
The generational search algorithm presented in the pre-
vious section has been implemented in a new tool named 3.2 Trace-based x86 Constraint Generation
SAGE, which stands for Scalable, Automated, Guided Exe-
cution. SAGE can test any file-reading program running on SAGE’s constraint generation differs from previous dy-
Windows by treating bytes read from files as symbolic in- namic test generation implementations [16, 31, 7] in two
puts. Another key novelty of SAGE is that it performs sym- main ways. First, instead of a source-based instrumen-
bolic execution of program traces at the x86 binary level. tation, SAGE adopts a machine-code-based approach for
This section justifies this design choice by arguing how it three main reasons:
allows SAGE to handle a wide variety of large production Multitude of languages and build processes. Source-
applications. This design decision raises challenges that are based instrumentation must support the specific language,
different from those faced by source-code level symbolic compiler, and build process for the program under test.
execution. We describe these challenges and show how they There is a large upfront cost for adapting the instrumen-
are addressed in our implementation. Finally, we outline tation to a new language, compiler, or build tool. Cover-
key optimizations that are crucial in scaling to large pro- ing many applications developed in a large company with
a variety of incompatible build processes and compiler ver- operation op on the values represented by the tags t1 and
sions is a logistical nightmare. In contrast, a machine-code t2 ; the sequence tag ht0 . . . tn i where n = 1 or n = 3
based symbolic-execution engine, while complicated, need describes a word- or double-word-sized value obtained by
be implemented only once per architecture. As we will see grouping byte-sized values represented by tags t0 . . . tm to-
in Section 4, this choice has let us apply SAGE to a large gether; subtag(t, i) where i ∈ {0 . . . 3} corresponds to
spectrum of production software applications. the i-th byte in the word- or double-word-sized value rep-
Compiler and post-build transformations. By perform- resented by t. Note that SAGE does not currently reason
ing symbolic execution on the binary code that actually about symbolic pointer dereferences. SAGE defines a fresh
ships, SAGE makes it possible to catch bugs not only in symbolic variable for each non-constant symbolic tag. Pro-
the target program but also in the compilation and post- vided there is no confusion, we do not distinguish a tag from
processing tools, such as code obfuscators and basic block its associated symbolic variable in the rest of this section.
transformers, that may introduce subtle differences between As SAGE replays the recorded program trace, it updates
the semantics of the source and the final product. the concrete and symbolic stores according to the semantics
Unavailability of source. It might be difficult to ob- of each visited instruction.
tain source code of third-party components, or even com- In addition to performing symbolic tag propagation,
ponents from different groups of the same organization. SAGE also generates constraints on input values. Con-
Source-based instrumentation may also be difficult for self- straints are relations over symbolic variables; for example,
modifying or JITed code. SAGE avoids these issues by given a variable x that corresponds to the tag input(4),
working at the machine-code level. While source code does the constraint x < 10 denotes the fact that the fifth byte of
have information about types and structure not immediately the input is less than 10.
visible at the machine code level, we do not need this infor- When the algorithm encounters an input-dependent con-
mation for SAGE’s path exploration. ditional jump, it creates a constraint modeling the outcome
Second, instead of an online instrumentation, SAGE of the branch and adds it to the path constraint composed of
adopts an offline trace-based constraint generation. With the constraints encountered so far.
online generation, constraints are generated as the program The following simple example illustrates the process of
is executed either by statically injected instrumentation tracking symbolic tags and collecting constraints.
code or with the help of dynamic binary instrumentation
# read 10 byte file into a
tools such as Nirvana [3] or Valgrind [27] (Catchconv
# buffer beginning at address 1000
is an example of the latter approach [24].) SAGE adopts
mov ebx, 1005
offline trace-based constraint generation for two reasons.
mov al, byte [ebx]
First, a single program may involve a large number of bi-
dec al # Decrement al
nary components some of which may be protected by the
jz LabelForIfZero # Jump if al == 0
operating system or obfuscated, making it hard to replace
them with instrumented versions. Second, inherent nonde-
The beginning of this fragment uses a system call to read
terminism in large target programs makes debugging online
a 10 byte file into the memory range starting from address
constraint generation difficult. If something goes wrong in
1000. For brevity, we omit the actual instruction sequence.
the constraint generation engine, we are unlikely to repro-
As a result of replaying these instructions, SAGE updates
duce the environment leading to the problem. In contrast,
the symbolic store by associating addresses 1000 . . . 1009
constraint generation in SAGE is completely deterministic
with symbolic tags input(0) . . . input(9) respectively.
because it works from the execution trace that captures the
The two mov instructions have the effect of loading the
outcome of all nondeterministic events encountered during
fifth input byte into register al. After replaying these in-
the recorded run.
structions, SAGE updates the symbolic store with a map-
ping of al to input(5). The effect of the last two instruc-
3.3 Constraint Generation tions is to decrement al and to make a conditional jump
to LabelForIfZero if the decremented value is 0. As a
SAGE maintains the concrete and symbolic state of the result of replaying these instructions, depending on the out-
program represented by a pair of stores associating every come of the branch, SAGE will add one of two constraints
memory locations and registers to a byte-sized value and a t = 0 or t 6= 0 where t = input(5) − 1. The former con-
symbolic tag respectively. A symbolic tag is an expression straint is added if the branch is taken; the latter if the branch
representing either an input value or a function of some in- is not taken.
put value. SAGE supports several kinds of tags: input(m) This leads us to one of the key difficulties in generating
represents the mth byte of the input; c represents a constant; constraints from a stream of x86 machine instructions—
t1 op t2 denotes the result of some arithmetic or bitwise dealing with the two-stage nature of conditional expres-
sions. When a comparison is made, it is not known how 3.4 Constraint Optimization
it will be used until a conditional jump instruction is ex-
ecuted later. The processor has a special register EFLAGS SAGE employs a number of optimization techniques
that packs a collection of status flags such as CF, SF, AF, whose goal is to improve the speed and memory usage of
PF , OF , and ZF . How these flags are set is determined by constraint generation: tag caching ensures that structurally
the outcome of various instructions. For example, CF—the equivalent tags are mapped to the same physical object; un-
first bit of EFLAGS—is the carry flag that is influenced by related constraint elimination reduces the size of constraint
various arithmetic operations. In particular, it is set to 1 by solver queries by removing the constraints which do not
a subtraction instruction whose first argument is less than share symbolic variables with the negated constraint; local
the second. ZF is the zero flag located at the seventh bit of constraint caching skips a constraint if it has already been
EFLAGS ; it is set by a subtraction instruction if its arguments added to the path constraint; flip count limit establishes the
are equal. Complicating matters even further, some instruc- maximum number of times a constraint generated from a
tions such as sete and pushf access EFLAGS directly. particular program instruction can be flipped; concretiza-
For sound handling of EFLAGS, SAGE defines bitvec- tion reduces the symbolic tags involving bitwise and multi-
tor tags of the form hf0 . . . fn−1 i describing an n-bit value plicative operators into their corresponding concrete values.
whose bits are set according to the constraints f0 . . . fn−1 . These optimizations are fairly standard in dynamic test
In the example above, when SAGE replays the dec instruc- generation. The rest of this section describes constraint sub-
tion, it updates the symbolic store mapping for al and for sumption, an optimization we found particularly useful for
EFLAGS . The former becomes mapped to input(5) − 1; analyzing structured-file parsing applications.
the latter—to the bitvector tag ht < 0 . . . t = 0 . . .i where The constraint subsumption optimization keeps track of
t = input(5) − 1 and the two shown constraints are lo- the constraints generated from a given branch instruction.
cated at offsets 0 and 6 of the bitvector—the offsets corre- When a new constraint f is created, SAGE uses a fast syn-
sponding to the positions of CF and ZF in the EFLAGS reg- tactic check to determine whether f definitely implies or is
ister. definitely implied by another constraint generated from the
same instruction. If this is the case, the implied constraint
Another pervasive x86 practice involves casting between
is removed from the path constraint.
byte, word, and double word objects. Even if the main code
The subsumption optimization has a critical impact on
of the program under test does not contain explicit casts, it
many programs processing structured files such as various
will invariably invoke some run-time library function such
image parsers and media players. For example, in one of the
as atol, malloc, or memcpy that does.
Media 2 searches described in Section 4, we have observed
SAGE implements sound handling of casts with the help a ten-fold decrease in the number of constraints because of
of subtag and sequence tags. This is illustrated by the fol- subsumption. Without this optimization, SAGE runs out of
lowing example. memory and overwhelms the constraint solver with a huge
number of redundant queries.
mov ch, byte [...]
Let us look at the details of the constraint subsumption
mov cl, byte [...]
optimization with the help of the following example:
inc cx # Increment cx
mov cl, byte [...]
Let us assume that the two mov instructions read addresses dec cl # Decrement cl
associated with the symbolic tags t1 and t2 . After SAGE re- ja 2 # Jump if cl > 0
plays these instructions, it updates the symbolic store with
the mappings cl 7→ t1 and ch 7→ t2 . The next instruc- This code fragment loads a byte into cl and decrements it
tion increments cx—the 16-bit register containing cl and in a loop until it becomes 0. Assuming that the byte read
ch as the low and high bytes respectively. Right before the by the mov instruction is mapped to a symbolic tag t0 , the
increment, the contents of cx can be represented by the se- algorithm outlined in Section 3.3 will generate constraints
quence tag ht1 , t2 i. The result of the increment then is the t1 > 0, . . ., tk−1 > 0, and tk ≤ 0 where k is the concrete
word-sized tag t = (ht1 , t2 i + 1). To finalize the effect value of the loaded byte and ti+1 = ti − 1 for i ∈ {1 . . . k}.
of the inc instruction, SAGE updates the symbolic store Here, the memory cost is linear in the number of loop iter-
with the byte-sized mappings cl 7→ subtag(t, 0) and ations because each iteration produces a new constraint and
ch 7→ subtag(t, 1). SAGE encodes the subtag relation a new symbolic tag.
by the constraint x = x′ + 256 ∗ x′′ where the word-sized The subsumption technique allows us to remove the first
symbolic variable x corresponds to t and the two byte-sized k − 2 constraints because they are implied by the follow-
symbolic variables x′ and x′′ correspond to subtag(t, 0) ing constraints. We still have to hold on to a linear number
and subtag(t, 1) respectively. of symbolic tags because each one is defined in terms of
the preceding tag. To achieve constant space behavior, con- RIFF...ACONLIST RIFF...ACONB
straint subsumption must be performed in conjunction with B...INFOINAM.... B...INFOINAM....
constant folding during tag creation: (t−c)−1 = t−(c+1). 3D Blue Alternat 3D Blue Alternat
The net effect of the algorithm with constraint subsumption e v1.1..IART.... e v1.1..IART....
and constant folding on the above fragment is the path con- ................ ................
straint with two constraints t0 − (k − 1) > 0 and t0 − k ≤ 0. 1996..anih$...$. 1996..anih$...$.
Another hurdle arises from multi-byte tags. Consider the ................ ................
following loop which is similar to the loop above except ................ ................
that the byte-sized register cl is replaced by the word-sized ..rate.......... ..rate..........
register cx. ..........seq .. ..........seq ..
................ ................
mov cx, word [...] ..LIST....framic ..anih....framic
dec cx # Decrement cx on......... .. on......... ..
ja 2 # Jump if cx > 0

Assuming that the two bytes read by the mov instruction are Figure 5. On the left, an ASCII rendering of
mapped to tags t′0 and t′′0 , this fragment yields constraints a prefix of the seed ANI file used for our
s1 > 0, . . ., sk−1 > 0, and sk ≤ 0 where si+1 = ht′i , t′′i i−1 search. On the right, the SAGE-generated
with t′i = subtag(si , 0) and t′′i = subtag(si , 1) for crash for MS07-017. Note how the SAGE test
i ∈ {1 . . . k}. Constant folding becomes hard because each case changes the LIST to an additional anih
loop iteration introduces syntactically unique but semanti- record on the next-to-last line.
cally redundant word-size sequence tags. SAGE solves this
with the help of sequence tag simplification which rewrites
hsubtag(t, 0), subtag(t, 1)i into t avoiding duplicating
equivalent tags and enabling constant folding. appeared in the wild [32]. This was only the third such out-
Constraint subsumption, constant folding, and sequence of-band patch released by Microsoft since January 2006,
tag simplification are sufficient to guarantee constant indicating the seriousness of the bug. The Microsoft SDL
space replay of the above fragment generating constraints Policy Weblog states that extensive blackbox fuzz testing of
ht′0 , t′′0 i − (k − 1) > 0 and ht′0 , t′′0 i − k ≤ 0. More gen- this code failed to uncover the bug, and that existing static
erally, these three simple techniques enable SAGE to effec- analysis tools are not capable of finding the bug without ex-
tively fuzz real-world structured-file-parsing applications in cessive false positives [20]. SAGE, in contrast, synthesizes
which the input-bound loop pattern is pervasive. a new input file exhibiting the bug within hours of starting
from a well-formed ANI file.
In more detail, the vulnerability results from an incom-
4 Experiments plete patch to MS05-006, which also concerned ANI pars-
ing code. The root cause of this bug was a failure to vali-
We first describe our initial experiences with SAGE, in- date a size parameter read from an anih record in an ANI
cluding several bugs found by SAGE that were missed by file. Unfortunately, the patch for MS05-006 is incomplete.
blackbox fuzzing efforts. Inspired by these experiences, we Only the length of the first anih record is checked. If a file
pursue a more systematic study of SAGE’s behavior on two has an initial anih record of 36 bytes or less, the check is
media-parsing applications. In particular, we focus on the satisfied but then an icon loading function is called on all
importance of the starting input file for the search, the ef- anih records. The length fields of the second and subse-
fect of our generational search vs. depth-first search, and quent records are not checked, so any of these records can
the impact of our block coverage heuristic. In some cases, trigger memory corruption.
we withold details concerning the exact application tested Therefore, a test case needs at least two anih records
because the bugs are still in the process of being fixed. to trigger the MS07-017 bug. The SDL Policy Weblog at-
tributes the failure of blackbox fuzz testing to find MS07-
4.1 Initial Experiences 017 to the fact that all of the seed files used for blackbox
testing had only one anih record, and so none of the test
MS07-017. On 3 April 2007, Microsoft released an out of cases generated would break the MS05-006 patch. While of
band critical security patch for code that parses ANI format course one could write a grammar that generates such test
animated cursors. The vulnerability was originally reported cases for blackbox fuzzing, this requires effort and does not
to Microsoft in December 2006 by Alex Sotirov of Deter- generalize beyond the single ANI format.
mina Security Research, then made public after exploit code In contrast, SAGE can generate a crash exhibiting MS07-
Test # SymExec SymExecT Init. |PC| # Tests Mean Depth Mean # Instr. Mean Size
ANI 808 19099 341 11468 178 2066087 5400
Media 1 564 5625 71 6890 73 3409376 65536
Media 2 3 3457 3202 1045 1100 271432489 27335
Media 3 17 3117 1666 2266 608 54644652 30833
Media 4 7 3108 1598 909 883 133685240 22209
Compressed File 47 1495 111 1527 65 480435 634
OfficeApp 1 3108 15745 3008 6502 923731248 45064

Figure 6. Statistics from 10-hour searches on seven test applications, each seeded with a well-formed
input file. We report the number of SymbolicExecutor tasks during the search, the total time spent
in all SymbolicExecutor tasks in seconds, the number of constraints generated from the seed file,
the total number of test cases generated, the mean depth per test case in number of constraints, the
mean number of instructions executed after reading the input file, and the mean size of the symbolic
input in bytes.

017 starting from a well-formed ANI file with one anih program copying zero bytes into a buffer and then reading
record, despite having no knowledge of the ANI format. from a non-zero offset. In addition, starting from a seed
Our seed file was picked arbitrarily from a library of well- file of 100 zero bytes, SAGE synthesized a crashing Media
formed ANI files, and we used a small test driver that called 1 test case after 1403 test cases, demonstrating the power
user32.dll to parse test case ANI files. The initial test of SAGE to infer file structure from code. Figure 6 shows
case generated a path constraint with 341 branch constraints statistics on the size of the SAGE search for each of these
after parsing 1279939 total instructions over 10072 sym- parsers, when starting from a well-formed file.
bolic input bytes. SAGE then created a crashing ANI file at Office 2007 Application. We have used SAGE to success-
depth 72 after 7 hours 36 minutes of search and 7706 test fully synthesize crashing test cases for a large application
cases, using one core of a 2 GHz AMD Opteron 270 dual- shipped as part of Office 2007. Over the course of two
core processor running 32-bit Windows Vista with 4 GB 10-hour searches seeded with two different well-formed
of RAM. Figure 5 shows a prefix of our seed file side by files, SAGE generated 4548 test cases, of which 43 crashed
side with the crashing SAGE-generated test case. Figure 6 the application. The crashes we have investigated so far
shows further statistics from this test run. are NULL pointer dereference errors, and they show how
Compressed File Format. We released an alpha version of SAGE can successfully reason about programs on a large
SAGE to an internal testing team to look for bugs in code scale. Figure 6 shows statistics from the SAGE search on
that handles a compressed file format. The parsing code one of the well-formed files.
for this file format had been extensively tested with black- Image Parsing. We used SAGE to exercise the image pars-
box fuzzing tools, yet SAGE found two serious new bugs. ing code in a media player included with a variety of other
The first bug was a stack overflow. The second bug was applications. While our initial run did not find crashes, we
an infinite loop that caused the processing application to used an internal tool to scan traces from SAGE-generated
consume nearly 100% of the CPU. Both bugs were fixed test cases and found several uninitialized value use errors.
within a week of filing, showing that the product team con- We reported these errors to the testing team, who expanded
sidered these bugs important. Figure 6 shows statistics from the result into a reproducible crash. This experience shows
a SAGE run on this test code, seeded with a well-formed that SAGE can uncover serious bugs that do not immedi-
compressed file. SAGE also uncovered two separate crashes ately lead to crashes.
due to read access violations while parsing malformed files
of a different format tested by the same team; the corre- 4.2 Experiment Setup
sponding bugs were also fixed within a week of filing.
Media File Parsing. We applied SAGE to parsers for four Test Plan. We focused on the Media 1 and Media 2 parsers
widely used media file formats, which we will refer to as because they are widely used. We ran a SAGE search for the
“Media 1,” “Media 2,” “Media 3,” and “Media 4.” Through Media 1 parser with five “well-formed” media files, chosen
several testing sessions, SAGE discovered crashes in each from a library of test media files. We also tested Media 1
of these media files that resulted in nine distinct bug reports. with five “bogus” files : bogus-1 consisting of 100 zero
For example, SAGE discovered a read violation due to the bytes, bogus-2 consisting of 800 zero bytes, bogus-3
stack hash wff-1 wff-2 wff-3 wff-4 wff-5 bogus-1
1867196225 × × × × ×
2031962117 × × × × ×
stack hash wff-1 wff-3 wff-4 wff-5
612334691 × ×
790577684 × × × ×
1061959981 × ×
825233195 × × ×
1212954973 × ×
795945252 × × × ×
1011628381 × × ×
1060863579 × × × ×
842674295 ×
1043337003 ×
1246509355 × × ×
808455977 ×
1527393075 ×
1162567688 ×
1277839407 ×
1392730167 ×
1951025690 ×

Figure 7. SAGE found 12 distinct stack hashes (shown left) from 357 Media 1 crashing files and 7
distinct stack hashes (shown right) from 88 Media 2 crashing files.

consisting of 25600 zero bytes, bogus-4 consisting of 100 tial coverage, even on the same machine. We believe this is
randomly generated bytes, and bogus-5 consisting of 800 due to nondeterminism associated with loading and initial-
randomly generated bytes. For each of these 10 files, we izing DLLs used by our test applications.
ran a 10-hour SAGE search seeded with the file to estab-
lish a baseline number of crashes found by SAGE. If a task 4.3 Results and Observations
was in progress at the end of 10 hours, we allowed it to fin-
ish, leading to search times slightly longer than 10 hours in The Appendix shows a table of results from our exper-
some cases. For searches that found crashes, we then re- iments. Here we comment on some general observations.
ran the SAGE search for 10 hours, but disabled our block We stress that these observations are from a limited sample
coverage heuristic. We repeated the process for the Me- size of two applications and should be taken with caution.
dia 2 parser with five “well-formed” Media 2 files and the Symbolic execution is slow. We measured the total amount
bogus-1 file. of time spent performing symbolic execution during each
Each SAGE search used AppVerifier [8] configured to search. We observe that a single symbolic execution task is
check for heap memory errors. Whenever such an error many times slower than testing or tracing a program. For
occurs, AppVerifier forces a “crash” in the application un- example, the mean time for a symbolic execution task in
der test. We then collected crashing test cases, the absolute the Media 2 search seeded with wff-3 was 25 minutes 30
number of code blocks covered by the seed input, and the seconds, while testing a Media 2 file took seconds. At the
number of code blocks added over the course of the search. same time, we can also observe that only a small portion of
We performed our experiments on four machines, each with the search time was spent performing symbolic execution,
two dual-core AMD Opteron 270 processors running at 2 because each task generated many test cases; in the Media
GHz. During our experiments, however, we used only one 2 wff-3 case, only 25% of the search time was spent in
core to reduce the effect of nondeterministic task schedul- symbolic execution. This shows how a generational search
ing on the search results. Each machine ran 32-bit Windows effectively leverages the expensive symbolic execution task.
Vista, with 4 GB of RAM and a 250 GB hard drive. This also shows the benefit of separating the Tester task
Triage. Because a SAGE search can generate many differ- from the more expensive SymbolicExecutor task.
ent test cases that exhibit the same bug, we “bucket” crash- Generational search is better than depth-first search.
ing files by the stack hash of the crash, which includes the We performed several runs with depth-first search. First, we
address of the faulting instruction. It is possible for the same discovered that the SAGE search on Media 1 when seeded
bug to be reachable by program paths with different stack with the bogus-1 file exhibited a pathological divergence
hashes for the same root cause. Our experiments always (see Section 2) leading to premature termination of the
report the distinct stack hashes. search after 18 minutes. Upon further inspection, this diver-
Nondeterminism in Coverage Results. As part of our ex- gence proved to be due to concretizing an AND operator in
periments, we measured the absolute number of blocks cov- the path constraint. We did observe depth-first search runs
ered during a test run. We observed that running the same for 10 hours for Media 2 searches seeded with wff-2 and
input on the same program can lead to slightly different ini- wff-3. Neither depth-first searches found crashes. In con-
Figure 8. Histograms of test cases and of crashes by generation for Media 1 seeded with wff-4.

trast, while a generational search seeded with wff-2 found nondeterminism as mentioned above. Despite this, SAGE
no crashes, a generational search seeded with wff-3 found was able to find many bugs in real applications, showing
15 crashing files in 4 buckets. Furthermore, the depth-first that our search technique is tolerant of such divergences.
searches were inferior to the generational searches in code Bogus files find few bugs. We collected crash data from
coverage: the wff-2 generational search started at 51217 our well-formed and bogus seeded SAGE searches. The
blocks and added 12329, while the depth-first search started bugs found by each seed file are shown, bucketed by stack
with 51476 and added only 398. For wff-3, a generational hash, in Figure 7. Out of the 10 files used as seeds for
search started at 41726 blocks and added 9564, while the SAGE searches on Media 1, 6 found at least one crash-
depth-first search started at 41703 blocks and added 244. ing test case during the search, and 5 of these 6 seeds were
These different initial block coverages stem from the non- well-formed. Furthermore, all the bugs found in the search
determinism noted above, but the difference in blocks added seeded with bogus-1 were also found by at least one well-
is much larger than the difference in starting coverage. The formed file. For SAGE searches on Media 2, out of the 6
limitations of depth-first search regarding code coverage are seed files tested, 4 found at least one crashing test case, and
well known (e.g., [23]) and are due to the search being too all were well-formed. Hence, the conventional wisdom that
localized. In contrast, a generational search explores alter- well-formed files should be used as a starting point for fuzz
native execution branches at all depths, simultaneously ex- testing applies to our whitebox approach as well.
ploring all the layers of the program. Finally, we saw that a Different files find different bugs. Furthermore, we ob-
much larger percentage of the search time is spent in sym- served that no single well-formed file found all distinct bugs
bolic execution for depth-first search than for generational for either Media 1 or Media 2. This suggests that using a
search, because each test case requires a new symbolic ex- wide variety of well-formed files is important for finding
ecution task. For example, for the Media 2 search seeded distinct bugs as each search is incomplete.
with wff-3, a depth-first search spent 10 hours and 27
Bugs found are shallow. For each seed file, we collected
minutes in symbolic execution for 18 test cases generated,
the maximum generation reached by the search. We then
out of a total of 10 hours and 35 minutes. Note that any
looked at which generation the search found the last of its
other search algorithm that generates a single new test from
unique crash buckets. For the Media 1 searches, crash-
each symbolic execution (like a breadth-first search) has a
finding searches seeded with well-formed files found all
similar execution profile where expensive symbolic execu-
unique bugs within 4 generations, with a maximum num-
tions are poorly leveraged, hence resulting in relatively few
ber of generations between 5 and 7. Therefore, most of the
tests being executed given a fixed time budget.
bugs found by these searches are shallow — they are reach-
Divergences are common. Our basic test setup did not able in a small number of generations. The crash-finding
measure divergences, so we ran several instrumented test Media 2 searches reached a maximum generation of 3, so
cases to measure the divergence rate. In these cases, we of- we did not observe a trend here.
ten observed divergence rates of over 60%. This may be due Figure 8 shows histograms of both crashing and non-
to several reasons: in our experimental setup, we concretize crashing (“NoIssues”) test cases by generation for Media
all non-linear operations (such as multiplication, division, 1 seeded with wff-4. We can see that most tests exe-
and bitwise arithmetic) for efficiency, there are several x86 cuted were of generations 4 to 6, yet all unique bugs can be
instructions we still do not emulate, we do not model sym- found in generations 1 to 4. The number of test cases tested
bolic dereferences of pointers, tracking symbolic variables with no issues in later generations is high, but these new
may be incomplete, and we do not control all sources of test cases do not discover distinct new bugs. This behav-
ior was consistently observed in almost all our experiments, eas of the input space compared to others. In practice, they
especially the “bell curve” shown in the histograms. This are often key to enable blackbox fuzzing to find interest-
generational search did not go beyond generation 7 since ing bugs, since the probability of finding those using pure
it still has many candidate input tests to expand in smaller random testing is usually very small. But writing gram-
generations and since many tests in later generations have mars manually is tedious, expensive and scales poorly. In
lower incremental-coverage scores. contrast, our whitebox fuzzing approach does not require
No clear correlation between coverage and crashes. We an input grammar specification to be effective. However,
measured the absolute number of blocks covered after run- the experiments of the previous section highlight the impor-
ning each test, and we compared this with the locations of tance of the initial seed file for a given search. Those seed
the first test case to exhibit each distinct stack hash for a files could be generated using grammars used for blackbox
crash. Figure 9 shows the result for a Media 1 search seeded fuzzing to increase their diversity. Also, note that blackbox
with wff-4; the vertical bars mark where in the search fuzzing can generate and run new tests faster than whitebox
crashes with new stack hashes were discovered. While this fuzzing due to the cost of symbolic execution and constraint
graph suggests that an increase in coverage correlates with solving. As a result, it may be able to expose new paths that
finding new bugs, we did not observe this universally. Sev- would not be exercised with whitebox fuzzing because of
eral other searches follow the trends shown by the graph for the imprecision of symbolic execution.
wff-2: they found all unique bugs early on, even if code As previously discussed, our approach builds upon re-
coverage increased later. We found this surprising, because cent work on systematic dynamic test generation, intro-
we expected there to be a consistent correlation between duced in [16, 6] and extended in [15, 31, 7, 14, 29]. The
new code explored and new bugs discovered. In both cases, main differences are that we use a generational search al-
the last unique bug is found partway through the search, gorithm using heuristics to find bugs as fast as possible in
even though crashing test cases continue to be generated. an incomplete search, and that we test large applications
Effect of block coverage heuristic. We compared the num- instead of unit test small ones, the latter being enabled by
ber of blocks added during the search between test runs a trace-based x86-binary symbolic execution instead of a
that used our block coverage heuristic to pick the next child source-based approach. Those differences may explain why
from the pool, and runs that did not. We observed only we have found more bugs than previously reported with dy-
a weak trend in favor of the heuristic. For example, the namic test generation.
Media 2 wff-1 search added 10407 blocks starting from Our work also differs from tools such as [11], which
48494 blocks covered, while the non-heuristic case started are based on dynamic taint analysis that do not generate
with 48486 blocks and added 10633, almost a dead heat. or solve constraints, but instead simply force branches to
In contrast, the Media 1 wff-1 search started with 27659 be taken or not taken without regard to the program state.
blocks and added 701, while the non-heuristic case started While useful for a human auditor, this can lead to false pos-
with 26962 blocks and added only 50. Out of 10 total search itives in the form of spurious program crashes with data
pairs, in 3 cases the heuristic added many more blocks, that “can’t happen” in a real execution. Symbolic exe-
while in the others the numbers are close enough to be al- cution is also a key component of static program analy-
most a tie. As noted above, however, this data is noisy due sis, which has been applied to x86 binaries [2, 10]. Static
to nondeterminism observed with code coverage. analysis is usually more efficient but less precise than dy-
namic analysis and testing, and their complementarity is
5 Other Related Work well known [12, 15]. They can also be combined [15, 17].
Static test generation [21] consists of analyzing a program
statically to attempt to compute input values to drive it along
Other extensions of fuzz testing have recently been de- specific program paths without ever executing the program.
veloped. Most of those consist of using grammars for In contrast, dynamic test generation extends static test gen-
representing sets of possible inputs [30, 33]. Probabilis- eration with additional runtime information, and is therefore
tic weights can be assigned to production rules and used as more general and powerful [16, 14]. Symbolic execution
heuristics for random test input generation. Those weights has also been proposed in the context of generating vulner-
can also be defined or modified automatically using cover- ability signatures, either statically [5] or dynamically [9].
age data collected using lightweight dynamic program in-
strumentation [34]. These grammars can also include rules
for corner cases to test for common pitfalls in input vali- 6 Conclusion
dation code (such as very long strings, zero values, etc.).
The use of input grammars makes it possible to encode We introduced a new search algorithm, the generational
application-specific knowledge about the application under search, for dynamic test generation that tolerates diver-
test, as well as testing guidelines to favor testing specific ar- gences and better leverages expensive symbolic execution
Figure 9. Coverage and initial discovery of stack hashes for Media 1 seeded with wff-4 and wff-2.
The leftmost bar represents multiple distinct crashes found early in the search; all other bars repre-
sent a single distinct crash first found at this position in the search.

tasks. Our system, SAGE, applied this search algorithm to championing this project from the very beginning. SAGE
find bugs in a variety of production x86 machine-code pro- builds on the work of the TruScan team, including An-
grams running on Windows. We then ran experiments to drew Edwards and Jordan Tigani, and the Disolver team,
better understand the behavior of SAGE on two media pars- including Youssf Hamadi and Lucas Bordeaux, for which
ing applications. We found that using a wide variety of well- we are grateful. We thank Tom Ball, Manuvir Das and Jim
formed input files is important for finding distinct bugs. We Larus for their support and feedback on this project. Var-
also observed that the number of generations explored is a ious internal test teams provided valuable feedback during
better predictor than block coverage of whether a test case the development of SAGE, including some of the bugs de-
will find a unique new bug. In particular, most unique bugs scribed in Section 4.1, for which we thank them. We thank
found are found within a small number of generations. Derrick Coetzee, Ben Livshits and David Wagner for their
While these observations must be treated with caution, comments on drafts of our paper, and Nikolaj Bjorner and
coming from a limited sample size, they suggest a new Leonardo de Moura for discussions on constraint solving.
search strategy: instead of running for a set number of We thank Chris Walker for helpful discussions regarding
hours, one could systematically search a small number of security.
generations starting from an initial seed file and, once these
test cases are exhausted, move on to a new seed file. The
promise of this strategy is that it may cut off the “tail” of a References
generational search that only finds new instances of previ-
ously seen bugs, and thus might find more distinct bugs in
the same amount of time. Future work should experiment [1] D. Aitel. The advantages of block-based proto-
with this search method, possibly combining it with our col analysis for security testing, 2002. http:
block-coverage heuristic applied over different seed files to //www.immunitysec.com/downloads/
avoid re-exploring the same code multiple times. The key advantages_of_block_based_analysis.
point to investigate is whether generation depth combined html.
with code coverage is a better indicator of when to stop test-
ing than code coverage alone. [2] G. Balakrishnan and T. Reps. Analyzing memory
Finally, we plan to enhance the precision of SAGE’s accesses in x86 executables. In Proc. Int. Conf. on
symbolic execution and the power of SAGE’s constraint Compiler Construction, 2004. http://www.cs.
solving capability. This will enable SAGE to find bugs that wisc.edu/wpis/papers/cc04.ps.
are currently out of reach.
[3] S. Bhansali, W. Chen, S. De Jong, A. Edwards, and
Acknowledgments M. Drinic. Framework for instruction-level tracing
and analysis of programs. In Second International
We are indebted to Chris Marsh and Dennis Jeffries for Conference on Virtual Execution Environments VEE,
important contributions to SAGE, and to Hunter Hudson for 2006.
[4] D. Bird and C. Munoz. Automatic Generation of Ran- Methods), volume 3771 of Lecture Notes in Computer
dom Self-Checking Test Cases. IBM Systems Journal, Science, pages 20–32, Eindhoven, November 2005.
22(3):229–245, 1983. Springer-Verlag.

[5] D. Brumley, T. Chieh, R. Johnson, H. Lin, and [16] P. Godefroid, N. Klarlund, and K. Sen. DART: Di-
D. Song. RICH : Automatically protecting against rected Automated Random Testing. In Proceedings
integer-based vulnerabilities. In NDSS (Symp. on Net- of PLDI’2005 (ACM SIGPLAN 2005 Conference on
work and Distributed System Security), 2007. Programming Language Design and Implementation),
pages 213–223, Chicago, June 2005.
[6] C. Cadar and D. Engler. Execution Generated Test
Cases: How to Make Systems Code Crash Itself. In [17] B. S. Gulavani, T. A. Henzinger, Y. Kannan, A. V.
Proceedings of SPIN’2005 (12th International SPIN Nori, and S. K. Rajamani. Synergy: A new algo-
Workshop on Model Checking of Software), volume rithm for property checking. In Proceedings of the
3639 of Lecture Notes in Computer Science, San Fran- 14th Annual Symposium on Foundations of Software
cisco, August 2005. Springer-Verlag. Engineering (FSE), 2006.

[7] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and [18] N. Gupta, A. P. Mathur, and M. L. Soffa. Generat-
D. R. Engler. EXE: Automatically Generating Inputs ing Test Data for Branch Coverage. In Proceedings
of Death. In ACM CCS, 2006. of the 15th IEEE International Conference on Auto-
mated Software Engineering, pages 219–227, Septem-
[8] Microsoft Corporation. AppVerifier, ber 2000.
2007. http://www.microsoft.com/
technet/prodtechnol/windows/ [19] Y. Hamadi. Disolver : A Distributed Constraint
appcompatibility/appverifier.mspx. Solver. Technical Report MSR-TR-2003-91, Mi-
crosoft Research, December 2003.
[9] M. Costa, J. Crowcroft, M. Castro, A. Rowstron,
L. Zhou, L. Zhang, , and P. Barham. Vigilante: End- [20] M. Howard. Lessons learned from the an-
to-end containment of internet worms. In Symposium imated cursor security bug, 2007. http:
on Operating Systems Principles (SOSP), 2005. //blogs.msdn.com/sdl/archive/2007/
04/26/lessons-learned-from-
[10] M. Cova, V. Felmetsger, G. Banks, and G. Vigna. the-animated-cursor-security-bug.
Static detection of vulnerabilities in x86 executables. aspx.
In Proceedings of the Annual Computer Security Ap-
plications Conference (ACSAC), 2006. [21] J. C. King. Symbolic Execution and Program Testing.
Journal of the ACM, 19(7):385–394, 1976.
[11] W. Drewry and T. Ormandy. Flayer: Exposing ap-
plication internals. In First Workshop On Offensive [22] B. Korel. A Dynamic Approach of Test Data Gener-
Technologies (WOOT), 2007. ation. In IEEE Conference on Software Maintenance,
pages 311–317, San Diego, November 1990.
[12] M. D. Ernst. Static and dynamic analysis: synergy and
duality. In Proceedings of WODA’2003 (ICSE Work- [23] R. Majumdar and K. Sen. Hybrid Concolic testing. In
shop on Dynamic Analysis), Portland, May 2003. Proceedings of ICSE’2007 (29th International Con-
ference on Software Engineering), Minneapolis, May
[13] J. E. Forrester and B. P. Miller. An Empirical Study 2007. ACM.
of the Robustness of Windows NT Applications Using
Random Testing. In Proceedings of the 4th USENIX [24] D. Molnar and D. Wagner. Catchconv: Symbolic exe-
Windows System Symposium, Seattle, August 2000. cution and run-time type inference for integer conver-
sion errors, 2007. UC Berkeley EECS, 2007-23.
[14] P. Godefroid. Compositional Dynamic Test Genera-
tion. In Proceedings of POPL’2007 (34th ACM Sym- [25] Month of Browser Bugs, July 2006. Web page:
posium on Principles of Programming Languages), http://browserfun.blogspot.com/.
pages 47–54, Nice, January 2007.
[26] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards,
[15] P. Godefroid and N. Klarlund. Software Model Check- and B. Calder. Automatically classifying benign and
ing: Searching for Computations in the Abstract or the harmful data races using replay analysis. In Program-
Concrete (Invited Paper). In Proceedings of IFM’2005 ming Languages Design and Implementation (PLDI),
(Fifth International Conference on Integrated Formal 2007.
[27] N. Nethercote and J. Seward. Valgrind: A framework
for heavyweight dynamic binary instrumentation. In
PLDI, 2007.

[28] J. Offutt and J. Hayes. A Semantic Model of Pro-


gram Faults. In Proceedings of ISSTA’96 (Interna-
tional Symposium on Software Testing and Analysis),
pages 195–200, San Diego, January 1996.

[29] Pex. Web page:


http://research.microsoft.com/Pex.

[30] Protos. Web page: http://www.ee.oulu.fi/


research/ouspg/protos/.

[31] K. Sen, D. Marinov, and G. Agha. CUTE: A


Concolic Unit Testing Engine for C. In Proceed-
ings of FSE’2005 (13th International Symposium on
the Foundations of Software Engineering), Lisbon,
September 2005.
[32] A. Sotirov. Windows animated cursor stack
overflow vulnerability, 2007. http://www.
determina.com/security.research/
vulnerabilities/ani-header.html.

[33] Spike. Web page: http://www.immunitysec.


com/resources-freesoftware.shtml.

[34] M. Vuagnoux. Autodafe: An act of software torture.


In 22nd Chaos Communications Congress, Berlin,
Germany, 2005. autodafe.sourceforge.net.

A Additional Search Statistics


Media 1: wff-1 wff-1nh wff-2 wff-2nh wff-3 wff-3nh wff-4 wff-4nh
NULL 1 (46) 1 (32) 1(23) 1(12) 1(32) 1(26) 1(13) 1(1)
ReadAV 1 (40) 1 (16) 2(32) 2(13) 7(94) 4(74) 6(15) 5(45)
WriteAV 0 0 0 0 0 1(1) 1(3) 1(1)
SearchTime 10h7s 10h11s 10h4s 10h20s 10h7s 10h12s 10h34s 9h29m2s
AnalysisTime(s) 5625 4388 16565 11729 5082 6794 5545 7671
AnalysisTasks 564 545 519 982 505 752 674 878
BlocksAtStart 27659 26962 27635 26955 27626 27588 26812 26955
BlocksAdded 701 50 865 111 96 804 910 96
NumTests 6890 7252 6091 14400 6573 10669 8668 15280
TestsToLastCrash 6845 7242 5315 13616 6571 10563 6847 15279
TestsToLastUnique 168 5860 266 13516 5488 2850 2759 1132
MaxGen 6 6 6 8 6 7 7 8
GenToLastUnique 3 (50%) 5 (83%) 2 (33%) 7 (87.5%) 4 (66%) 3 (43%) 4 (57%) 3 (37.5%)
Mean Changes 1 1 1 1 1 1 1 1
Media 1: wff-5 wff-5nh bogus-1 bogus-1nh bogus-2 bogus-3 bogus-4 bogus-5
NULL 1(25) 1(15) 0 0 0 0 0 0
ReadAV 3(44) 3(56) 3(3) 1(1) 0 0 0 0
WriteAV 0 0 0 0 0 0 0 0
SearchTime 10h8s 10h4s 10h8s 10h14s 10h29s 9h47m15s 5m23s 5m39s
AnalysisTime(s) 21614 22005 11640 13156 3885 4480 214 234
AnalysisTasks 515 394 1546 1852 502 495 35 35
BlocksAtStart 27913 27680 27010 26965 27021 27022 24691 24692
BlocksAdded 109 113 130 60 61 74 57 41
NumTests 4186 2994 12190 15594 13945 13180 35 35
TestsToLastCrash 4175 2942 1403 11474 NA NA NA NA
TestsToLastUnique 1504 704 1403 11474 NA NA NA NA
MaxGen 5 4 14 13 8 9 9 9
GenToLastUnique 3 (60%) 3 (75%) 10 (71%) 11 (84%) NA NA NA NA
Mean Changes 1 1 1 1 1 1 1 1
Media 2: wff-1 wff-1nh wff-2 wff-3 wff-3nh wff-4 wff-4nh wff-5 wff-5nh bogus1
NULL 0 0 0 0 0 0 0 0 0 0
ReadAV 4(9) 4(9) 0 4(15) 4(14) 4(6) 3(3) 5(14) 4(12) 0
WriteAV 0 0 0 0 0 0 0 1(1) 0 0
SearchTime 10h12s 10h5s 10h6s 10h17s 10h1s 10h3s 10h7s 10h3s 10h6s 10h13s
AnalysisTime(s) 3457 3564 1517 9182 8513 1510 2195 10522 14386 14454
AnalysisTasks 3 3 1 6 7 2 2 6 6 1352
BlocksAtStart 48494 48486 51217 41726 41746 48729 48778 41917 42041 20008
BlocksAdded 10407 10633 12329 9564 8643 10379 10022 8980 8746 14743
NumTests 1045 1014 777 1253 1343 1174 948 1360 980 4165
TestsToLastCrash 1042 989 NA 1143 1231 1148 576 1202 877 NA
TestsToLastUnique 461 402 NA 625 969 658 576 619 877 NA
MaxGen 2 2 1 3 2 2 2 3 2 14
GenToLastUnique 2 (100%) 2 (100%) NA 2 (66%) 2 (100%) 2 (100%) 1 (50%) 2 2 NA
Mean Changes 3 3 4 4 3.5 5 5.5 4 4 2.9

Figure 10. Search statistics. For each search, we report the number of crashes of each type: the
first number is the number of distinct buckets, while the number in parentheses is the total number
of crashing test cases. We also report the total search time (SearchTime), the total time spent in
symbolic execution (AnalysisTime), the number of symbolic execution tasks (AnalysisTasks), blocks
covered by the initial file (BlocksAtStart), new blocks discovered during the search (BlocksAdded),
the total number of tests (NumTests), the test at which the last crash was found (TestsToLastCrash),
the test at which the last unique bucket was found (TestsToLastUnique), the maximum generation
reached (MaxGen), the generation at which the last unique bucket was found (GenToLastUnique),
and the mean number of file positions changed for each generated test case (Mean Changes).

You might also like