Automated Whitebox Fuzz Testing Paper Patrice Godefroid
Automated Whitebox Fuzz Testing Paper Patrice Godefroid
Assuming that the two bytes read by the mov instruction are Figure 5. On the left, an ASCII rendering of
mapped to tags t′0 and t′′0 , this fragment yields constraints a prefix of the seed ANI file used for our
s1 > 0, . . ., sk−1 > 0, and sk ≤ 0 where si+1 = ht′i , t′′i i−1 search. On the right, the SAGE-generated
with t′i = subtag(si , 0) and t′′i = subtag(si , 1) for crash for MS07-017. Note how the SAGE test
i ∈ {1 . . . k}. Constant folding becomes hard because each case changes the LIST to an additional anih
loop iteration introduces syntactically unique but semanti- record on the next-to-last line.
cally redundant word-size sequence tags. SAGE solves this
with the help of sequence tag simplification which rewrites
hsubtag(t, 0), subtag(t, 1)i into t avoiding duplicating
equivalent tags and enabling constant folding. appeared in the wild [32]. This was only the third such out-
Constraint subsumption, constant folding, and sequence of-band patch released by Microsoft since January 2006,
tag simplification are sufficient to guarantee constant indicating the seriousness of the bug. The Microsoft SDL
space replay of the above fragment generating constraints Policy Weblog states that extensive blackbox fuzz testing of
ht′0 , t′′0 i − (k − 1) > 0 and ht′0 , t′′0 i − k ≤ 0. More gen- this code failed to uncover the bug, and that existing static
erally, these three simple techniques enable SAGE to effec- analysis tools are not capable of finding the bug without ex-
tively fuzz real-world structured-file-parsing applications in cessive false positives [20]. SAGE, in contrast, synthesizes
which the input-bound loop pattern is pervasive. a new input file exhibiting the bug within hours of starting
from a well-formed ANI file.
In more detail, the vulnerability results from an incom-
4 Experiments plete patch to MS05-006, which also concerned ANI pars-
ing code. The root cause of this bug was a failure to vali-
We first describe our initial experiences with SAGE, in- date a size parameter read from an anih record in an ANI
cluding several bugs found by SAGE that were missed by file. Unfortunately, the patch for MS05-006 is incomplete.
blackbox fuzzing efforts. Inspired by these experiences, we Only the length of the first anih record is checked. If a file
pursue a more systematic study of SAGE’s behavior on two has an initial anih record of 36 bytes or less, the check is
media-parsing applications. In particular, we focus on the satisfied but then an icon loading function is called on all
importance of the starting input file for the search, the ef- anih records. The length fields of the second and subse-
fect of our generational search vs. depth-first search, and quent records are not checked, so any of these records can
the impact of our block coverage heuristic. In some cases, trigger memory corruption.
we withold details concerning the exact application tested Therefore, a test case needs at least two anih records
because the bugs are still in the process of being fixed. to trigger the MS07-017 bug. The SDL Policy Weblog at-
tributes the failure of blackbox fuzz testing to find MS07-
4.1 Initial Experiences 017 to the fact that all of the seed files used for blackbox
testing had only one anih record, and so none of the test
MS07-017. On 3 April 2007, Microsoft released an out of cases generated would break the MS05-006 patch. While of
band critical security patch for code that parses ANI format course one could write a grammar that generates such test
animated cursors. The vulnerability was originally reported cases for blackbox fuzzing, this requires effort and does not
to Microsoft in December 2006 by Alex Sotirov of Deter- generalize beyond the single ANI format.
mina Security Research, then made public after exploit code In contrast, SAGE can generate a crash exhibiting MS07-
Test # SymExec SymExecT Init. |PC| # Tests Mean Depth Mean # Instr. Mean Size
ANI 808 19099 341 11468 178 2066087 5400
Media 1 564 5625 71 6890 73 3409376 65536
Media 2 3 3457 3202 1045 1100 271432489 27335
Media 3 17 3117 1666 2266 608 54644652 30833
Media 4 7 3108 1598 909 883 133685240 22209
Compressed File 47 1495 111 1527 65 480435 634
OfficeApp 1 3108 15745 3008 6502 923731248 45064
Figure 6. Statistics from 10-hour searches on seven test applications, each seeded with a well-formed
input file. We report the number of SymbolicExecutor tasks during the search, the total time spent
in all SymbolicExecutor tasks in seconds, the number of constraints generated from the seed file,
the total number of test cases generated, the mean depth per test case in number of constraints, the
mean number of instructions executed after reading the input file, and the mean size of the symbolic
input in bytes.
017 starting from a well-formed ANI file with one anih program copying zero bytes into a buffer and then reading
record, despite having no knowledge of the ANI format. from a non-zero offset. In addition, starting from a seed
Our seed file was picked arbitrarily from a library of well- file of 100 zero bytes, SAGE synthesized a crashing Media
formed ANI files, and we used a small test driver that called 1 test case after 1403 test cases, demonstrating the power
user32.dll to parse test case ANI files. The initial test of SAGE to infer file structure from code. Figure 6 shows
case generated a path constraint with 341 branch constraints statistics on the size of the SAGE search for each of these
after parsing 1279939 total instructions over 10072 sym- parsers, when starting from a well-formed file.
bolic input bytes. SAGE then created a crashing ANI file at Office 2007 Application. We have used SAGE to success-
depth 72 after 7 hours 36 minutes of search and 7706 test fully synthesize crashing test cases for a large application
cases, using one core of a 2 GHz AMD Opteron 270 dual- shipped as part of Office 2007. Over the course of two
core processor running 32-bit Windows Vista with 4 GB 10-hour searches seeded with two different well-formed
of RAM. Figure 5 shows a prefix of our seed file side by files, SAGE generated 4548 test cases, of which 43 crashed
side with the crashing SAGE-generated test case. Figure 6 the application. The crashes we have investigated so far
shows further statistics from this test run. are NULL pointer dereference errors, and they show how
Compressed File Format. We released an alpha version of SAGE can successfully reason about programs on a large
SAGE to an internal testing team to look for bugs in code scale. Figure 6 shows statistics from the SAGE search on
that handles a compressed file format. The parsing code one of the well-formed files.
for this file format had been extensively tested with black- Image Parsing. We used SAGE to exercise the image pars-
box fuzzing tools, yet SAGE found two serious new bugs. ing code in a media player included with a variety of other
The first bug was a stack overflow. The second bug was applications. While our initial run did not find crashes, we
an infinite loop that caused the processing application to used an internal tool to scan traces from SAGE-generated
consume nearly 100% of the CPU. Both bugs were fixed test cases and found several uninitialized value use errors.
within a week of filing, showing that the product team con- We reported these errors to the testing team, who expanded
sidered these bugs important. Figure 6 shows statistics from the result into a reproducible crash. This experience shows
a SAGE run on this test code, seeded with a well-formed that SAGE can uncover serious bugs that do not immedi-
compressed file. SAGE also uncovered two separate crashes ately lead to crashes.
due to read access violations while parsing malformed files
of a different format tested by the same team; the corre- 4.2 Experiment Setup
sponding bugs were also fixed within a week of filing.
Media File Parsing. We applied SAGE to parsers for four Test Plan. We focused on the Media 1 and Media 2 parsers
widely used media file formats, which we will refer to as because they are widely used. We ran a SAGE search for the
“Media 1,” “Media 2,” “Media 3,” and “Media 4.” Through Media 1 parser with five “well-formed” media files, chosen
several testing sessions, SAGE discovered crashes in each from a library of test media files. We also tested Media 1
of these media files that resulted in nine distinct bug reports. with five “bogus” files : bogus-1 consisting of 100 zero
For example, SAGE discovered a read violation due to the bytes, bogus-2 consisting of 800 zero bytes, bogus-3
stack hash wff-1 wff-2 wff-3 wff-4 wff-5 bogus-1
1867196225 × × × × ×
2031962117 × × × × ×
stack hash wff-1 wff-3 wff-4 wff-5
612334691 × ×
790577684 × × × ×
1061959981 × ×
825233195 × × ×
1212954973 × ×
795945252 × × × ×
1011628381 × × ×
1060863579 × × × ×
842674295 ×
1043337003 ×
1246509355 × × ×
808455977 ×
1527393075 ×
1162567688 ×
1277839407 ×
1392730167 ×
1951025690 ×
Figure 7. SAGE found 12 distinct stack hashes (shown left) from 357 Media 1 crashing files and 7
distinct stack hashes (shown right) from 88 Media 2 crashing files.
consisting of 25600 zero bytes, bogus-4 consisting of 100 tial coverage, even on the same machine. We believe this is
randomly generated bytes, and bogus-5 consisting of 800 due to nondeterminism associated with loading and initial-
randomly generated bytes. For each of these 10 files, we izing DLLs used by our test applications.
ran a 10-hour SAGE search seeded with the file to estab-
lish a baseline number of crashes found by SAGE. If a task 4.3 Results and Observations
was in progress at the end of 10 hours, we allowed it to fin-
ish, leading to search times slightly longer than 10 hours in The Appendix shows a table of results from our exper-
some cases. For searches that found crashes, we then re- iments. Here we comment on some general observations.
ran the SAGE search for 10 hours, but disabled our block We stress that these observations are from a limited sample
coverage heuristic. We repeated the process for the Me- size of two applications and should be taken with caution.
dia 2 parser with five “well-formed” Media 2 files and the Symbolic execution is slow. We measured the total amount
bogus-1 file. of time spent performing symbolic execution during each
Each SAGE search used AppVerifier [8] configured to search. We observe that a single symbolic execution task is
check for heap memory errors. Whenever such an error many times slower than testing or tracing a program. For
occurs, AppVerifier forces a “crash” in the application un- example, the mean time for a symbolic execution task in
der test. We then collected crashing test cases, the absolute the Media 2 search seeded with wff-3 was 25 minutes 30
number of code blocks covered by the seed input, and the seconds, while testing a Media 2 file took seconds. At the
number of code blocks added over the course of the search. same time, we can also observe that only a small portion of
We performed our experiments on four machines, each with the search time was spent performing symbolic execution,
two dual-core AMD Opteron 270 processors running at 2 because each task generated many test cases; in the Media
GHz. During our experiments, however, we used only one 2 wff-3 case, only 25% of the search time was spent in
core to reduce the effect of nondeterministic task schedul- symbolic execution. This shows how a generational search
ing on the search results. Each machine ran 32-bit Windows effectively leverages the expensive symbolic execution task.
Vista, with 4 GB of RAM and a 250 GB hard drive. This also shows the benefit of separating the Tester task
Triage. Because a SAGE search can generate many differ- from the more expensive SymbolicExecutor task.
ent test cases that exhibit the same bug, we “bucket” crash- Generational search is better than depth-first search.
ing files by the stack hash of the crash, which includes the We performed several runs with depth-first search. First, we
address of the faulting instruction. It is possible for the same discovered that the SAGE search on Media 1 when seeded
bug to be reachable by program paths with different stack with the bogus-1 file exhibited a pathological divergence
hashes for the same root cause. Our experiments always (see Section 2) leading to premature termination of the
report the distinct stack hashes. search after 18 minutes. Upon further inspection, this diver-
Nondeterminism in Coverage Results. As part of our ex- gence proved to be due to concretizing an AND operator in
periments, we measured the absolute number of blocks cov- the path constraint. We did observe depth-first search runs
ered during a test run. We observed that running the same for 10 hours for Media 2 searches seeded with wff-2 and
input on the same program can lead to slightly different ini- wff-3. Neither depth-first searches found crashes. In con-
Figure 8. Histograms of test cases and of crashes by generation for Media 1 seeded with wff-4.
trast, while a generational search seeded with wff-2 found nondeterminism as mentioned above. Despite this, SAGE
no crashes, a generational search seeded with wff-3 found was able to find many bugs in real applications, showing
15 crashing files in 4 buckets. Furthermore, the depth-first that our search technique is tolerant of such divergences.
searches were inferior to the generational searches in code Bogus files find few bugs. We collected crash data from
coverage: the wff-2 generational search started at 51217 our well-formed and bogus seeded SAGE searches. The
blocks and added 12329, while the depth-first search started bugs found by each seed file are shown, bucketed by stack
with 51476 and added only 398. For wff-3, a generational hash, in Figure 7. Out of the 10 files used as seeds for
search started at 41726 blocks and added 9564, while the SAGE searches on Media 1, 6 found at least one crash-
depth-first search started at 41703 blocks and added 244. ing test case during the search, and 5 of these 6 seeds were
These different initial block coverages stem from the non- well-formed. Furthermore, all the bugs found in the search
determinism noted above, but the difference in blocks added seeded with bogus-1 were also found by at least one well-
is much larger than the difference in starting coverage. The formed file. For SAGE searches on Media 2, out of the 6
limitations of depth-first search regarding code coverage are seed files tested, 4 found at least one crashing test case, and
well known (e.g., [23]) and are due to the search being too all were well-formed. Hence, the conventional wisdom that
localized. In contrast, a generational search explores alter- well-formed files should be used as a starting point for fuzz
native execution branches at all depths, simultaneously ex- testing applies to our whitebox approach as well.
ploring all the layers of the program. Finally, we saw that a Different files find different bugs. Furthermore, we ob-
much larger percentage of the search time is spent in sym- served that no single well-formed file found all distinct bugs
bolic execution for depth-first search than for generational for either Media 1 or Media 2. This suggests that using a
search, because each test case requires a new symbolic ex- wide variety of well-formed files is important for finding
ecution task. For example, for the Media 2 search seeded distinct bugs as each search is incomplete.
with wff-3, a depth-first search spent 10 hours and 27
Bugs found are shallow. For each seed file, we collected
minutes in symbolic execution for 18 test cases generated,
the maximum generation reached by the search. We then
out of a total of 10 hours and 35 minutes. Note that any
looked at which generation the search found the last of its
other search algorithm that generates a single new test from
unique crash buckets. For the Media 1 searches, crash-
each symbolic execution (like a breadth-first search) has a
finding searches seeded with well-formed files found all
similar execution profile where expensive symbolic execu-
unique bugs within 4 generations, with a maximum num-
tions are poorly leveraged, hence resulting in relatively few
ber of generations between 5 and 7. Therefore, most of the
tests being executed given a fixed time budget.
bugs found by these searches are shallow — they are reach-
Divergences are common. Our basic test setup did not able in a small number of generations. The crash-finding
measure divergences, so we ran several instrumented test Media 2 searches reached a maximum generation of 3, so
cases to measure the divergence rate. In these cases, we of- we did not observe a trend here.
ten observed divergence rates of over 60%. This may be due Figure 8 shows histograms of both crashing and non-
to several reasons: in our experimental setup, we concretize crashing (“NoIssues”) test cases by generation for Media
all non-linear operations (such as multiplication, division, 1 seeded with wff-4. We can see that most tests exe-
and bitwise arithmetic) for efficiency, there are several x86 cuted were of generations 4 to 6, yet all unique bugs can be
instructions we still do not emulate, we do not model sym- found in generations 1 to 4. The number of test cases tested
bolic dereferences of pointers, tracking symbolic variables with no issues in later generations is high, but these new
may be incomplete, and we do not control all sources of test cases do not discover distinct new bugs. This behav-
ior was consistently observed in almost all our experiments, eas of the input space compared to others. In practice, they
especially the “bell curve” shown in the histograms. This are often key to enable blackbox fuzzing to find interest-
generational search did not go beyond generation 7 since ing bugs, since the probability of finding those using pure
it still has many candidate input tests to expand in smaller random testing is usually very small. But writing gram-
generations and since many tests in later generations have mars manually is tedious, expensive and scales poorly. In
lower incremental-coverage scores. contrast, our whitebox fuzzing approach does not require
No clear correlation between coverage and crashes. We an input grammar specification to be effective. However,
measured the absolute number of blocks covered after run- the experiments of the previous section highlight the impor-
ning each test, and we compared this with the locations of tance of the initial seed file for a given search. Those seed
the first test case to exhibit each distinct stack hash for a files could be generated using grammars used for blackbox
crash. Figure 9 shows the result for a Media 1 search seeded fuzzing to increase their diversity. Also, note that blackbox
with wff-4; the vertical bars mark where in the search fuzzing can generate and run new tests faster than whitebox
crashes with new stack hashes were discovered. While this fuzzing due to the cost of symbolic execution and constraint
graph suggests that an increase in coverage correlates with solving. As a result, it may be able to expose new paths that
finding new bugs, we did not observe this universally. Sev- would not be exercised with whitebox fuzzing because of
eral other searches follow the trends shown by the graph for the imprecision of symbolic execution.
wff-2: they found all unique bugs early on, even if code As previously discussed, our approach builds upon re-
coverage increased later. We found this surprising, because cent work on systematic dynamic test generation, intro-
we expected there to be a consistent correlation between duced in [16, 6] and extended in [15, 31, 7, 14, 29]. The
new code explored and new bugs discovered. In both cases, main differences are that we use a generational search al-
the last unique bug is found partway through the search, gorithm using heuristics to find bugs as fast as possible in
even though crashing test cases continue to be generated. an incomplete search, and that we test large applications
Effect of block coverage heuristic. We compared the num- instead of unit test small ones, the latter being enabled by
ber of blocks added during the search between test runs a trace-based x86-binary symbolic execution instead of a
that used our block coverage heuristic to pick the next child source-based approach. Those differences may explain why
from the pool, and runs that did not. We observed only we have found more bugs than previously reported with dy-
a weak trend in favor of the heuristic. For example, the namic test generation.
Media 2 wff-1 search added 10407 blocks starting from Our work also differs from tools such as [11], which
48494 blocks covered, while the non-heuristic case started are based on dynamic taint analysis that do not generate
with 48486 blocks and added 10633, almost a dead heat. or solve constraints, but instead simply force branches to
In contrast, the Media 1 wff-1 search started with 27659 be taken or not taken without regard to the program state.
blocks and added 701, while the non-heuristic case started While useful for a human auditor, this can lead to false pos-
with 26962 blocks and added only 50. Out of 10 total search itives in the form of spurious program crashes with data
pairs, in 3 cases the heuristic added many more blocks, that “can’t happen” in a real execution. Symbolic exe-
while in the others the numbers are close enough to be al- cution is also a key component of static program analy-
most a tie. As noted above, however, this data is noisy due sis, which has been applied to x86 binaries [2, 10]. Static
to nondeterminism observed with code coverage. analysis is usually more efficient but less precise than dy-
namic analysis and testing, and their complementarity is
5 Other Related Work well known [12, 15]. They can also be combined [15, 17].
Static test generation [21] consists of analyzing a program
statically to attempt to compute input values to drive it along
Other extensions of fuzz testing have recently been de- specific program paths without ever executing the program.
veloped. Most of those consist of using grammars for In contrast, dynamic test generation extends static test gen-
representing sets of possible inputs [30, 33]. Probabilis- eration with additional runtime information, and is therefore
tic weights can be assigned to production rules and used as more general and powerful [16, 14]. Symbolic execution
heuristics for random test input generation. Those weights has also been proposed in the context of generating vulner-
can also be defined or modified automatically using cover- ability signatures, either statically [5] or dynamically [9].
age data collected using lightweight dynamic program in-
strumentation [34]. These grammars can also include rules
for corner cases to test for common pitfalls in input vali- 6 Conclusion
dation code (such as very long strings, zero values, etc.).
The use of input grammars makes it possible to encode We introduced a new search algorithm, the generational
application-specific knowledge about the application under search, for dynamic test generation that tolerates diver-
test, as well as testing guidelines to favor testing specific ar- gences and better leverages expensive symbolic execution
Figure 9. Coverage and initial discovery of stack hashes for Media 1 seeded with wff-4 and wff-2.
The leftmost bar represents multiple distinct crashes found early in the search; all other bars repre-
sent a single distinct crash first found at this position in the search.
tasks. Our system, SAGE, applied this search algorithm to championing this project from the very beginning. SAGE
find bugs in a variety of production x86 machine-code pro- builds on the work of the TruScan team, including An-
grams running on Windows. We then ran experiments to drew Edwards and Jordan Tigani, and the Disolver team,
better understand the behavior of SAGE on two media pars- including Youssf Hamadi and Lucas Bordeaux, for which
ing applications. We found that using a wide variety of well- we are grateful. We thank Tom Ball, Manuvir Das and Jim
formed input files is important for finding distinct bugs. We Larus for their support and feedback on this project. Var-
also observed that the number of generations explored is a ious internal test teams provided valuable feedback during
better predictor than block coverage of whether a test case the development of SAGE, including some of the bugs de-
will find a unique new bug. In particular, most unique bugs scribed in Section 4.1, for which we thank them. We thank
found are found within a small number of generations. Derrick Coetzee, Ben Livshits and David Wagner for their
While these observations must be treated with caution, comments on drafts of our paper, and Nikolaj Bjorner and
coming from a limited sample size, they suggest a new Leonardo de Moura for discussions on constraint solving.
search strategy: instead of running for a set number of We thank Chris Walker for helpful discussions regarding
hours, one could systematically search a small number of security.
generations starting from an initial seed file and, once these
test cases are exhausted, move on to a new seed file. The
promise of this strategy is that it may cut off the “tail” of a References
generational search that only finds new instances of previ-
ously seen bugs, and thus might find more distinct bugs in
the same amount of time. Future work should experiment [1] D. Aitel. The advantages of block-based proto-
with this search method, possibly combining it with our col analysis for security testing, 2002. http:
block-coverage heuristic applied over different seed files to //www.immunitysec.com/downloads/
avoid re-exploring the same code multiple times. The key advantages_of_block_based_analysis.
point to investigate is whether generation depth combined html.
with code coverage is a better indicator of when to stop test-
ing than code coverage alone. [2] G. Balakrishnan and T. Reps. Analyzing memory
Finally, we plan to enhance the precision of SAGE’s accesses in x86 executables. In Proc. Int. Conf. on
symbolic execution and the power of SAGE’s constraint Compiler Construction, 2004. http://www.cs.
solving capability. This will enable SAGE to find bugs that wisc.edu/wpis/papers/cc04.ps.
are currently out of reach.
[3] S. Bhansali, W. Chen, S. De Jong, A. Edwards, and
Acknowledgments M. Drinic. Framework for instruction-level tracing
and analysis of programs. In Second International
We are indebted to Chris Marsh and Dennis Jeffries for Conference on Virtual Execution Environments VEE,
important contributions to SAGE, and to Hunter Hudson for 2006.
[4] D. Bird and C. Munoz. Automatic Generation of Ran- Methods), volume 3771 of Lecture Notes in Computer
dom Self-Checking Test Cases. IBM Systems Journal, Science, pages 20–32, Eindhoven, November 2005.
22(3):229–245, 1983. Springer-Verlag.
[5] D. Brumley, T. Chieh, R. Johnson, H. Lin, and [16] P. Godefroid, N. Klarlund, and K. Sen. DART: Di-
D. Song. RICH : Automatically protecting against rected Automated Random Testing. In Proceedings
integer-based vulnerabilities. In NDSS (Symp. on Net- of PLDI’2005 (ACM SIGPLAN 2005 Conference on
work and Distributed System Security), 2007. Programming Language Design and Implementation),
pages 213–223, Chicago, June 2005.
[6] C. Cadar and D. Engler. Execution Generated Test
Cases: How to Make Systems Code Crash Itself. In [17] B. S. Gulavani, T. A. Henzinger, Y. Kannan, A. V.
Proceedings of SPIN’2005 (12th International SPIN Nori, and S. K. Rajamani. Synergy: A new algo-
Workshop on Model Checking of Software), volume rithm for property checking. In Proceedings of the
3639 of Lecture Notes in Computer Science, San Fran- 14th Annual Symposium on Foundations of Software
cisco, August 2005. Springer-Verlag. Engineering (FSE), 2006.
[7] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and [18] N. Gupta, A. P. Mathur, and M. L. Soffa. Generat-
D. R. Engler. EXE: Automatically Generating Inputs ing Test Data for Branch Coverage. In Proceedings
of Death. In ACM CCS, 2006. of the 15th IEEE International Conference on Auto-
mated Software Engineering, pages 219–227, Septem-
[8] Microsoft Corporation. AppVerifier, ber 2000.
2007. http://www.microsoft.com/
technet/prodtechnol/windows/ [19] Y. Hamadi. Disolver : A Distributed Constraint
appcompatibility/appverifier.mspx. Solver. Technical Report MSR-TR-2003-91, Mi-
crosoft Research, December 2003.
[9] M. Costa, J. Crowcroft, M. Castro, A. Rowstron,
L. Zhou, L. Zhang, , and P. Barham. Vigilante: End- [20] M. Howard. Lessons learned from the an-
to-end containment of internet worms. In Symposium imated cursor security bug, 2007. http:
on Operating Systems Principles (SOSP), 2005. //blogs.msdn.com/sdl/archive/2007/
04/26/lessons-learned-from-
[10] M. Cova, V. Felmetsger, G. Banks, and G. Vigna. the-animated-cursor-security-bug.
Static detection of vulnerabilities in x86 executables. aspx.
In Proceedings of the Annual Computer Security Ap-
plications Conference (ACSAC), 2006. [21] J. C. King. Symbolic Execution and Program Testing.
Journal of the ACM, 19(7):385–394, 1976.
[11] W. Drewry and T. Ormandy. Flayer: Exposing ap-
plication internals. In First Workshop On Offensive [22] B. Korel. A Dynamic Approach of Test Data Gener-
Technologies (WOOT), 2007. ation. In IEEE Conference on Software Maintenance,
pages 311–317, San Diego, November 1990.
[12] M. D. Ernst. Static and dynamic analysis: synergy and
duality. In Proceedings of WODA’2003 (ICSE Work- [23] R. Majumdar and K. Sen. Hybrid Concolic testing. In
shop on Dynamic Analysis), Portland, May 2003. Proceedings of ICSE’2007 (29th International Con-
ference on Software Engineering), Minneapolis, May
[13] J. E. Forrester and B. P. Miller. An Empirical Study 2007. ACM.
of the Robustness of Windows NT Applications Using
Random Testing. In Proceedings of the 4th USENIX [24] D. Molnar and D. Wagner. Catchconv: Symbolic exe-
Windows System Symposium, Seattle, August 2000. cution and run-time type inference for integer conver-
sion errors, 2007. UC Berkeley EECS, 2007-23.
[14] P. Godefroid. Compositional Dynamic Test Genera-
tion. In Proceedings of POPL’2007 (34th ACM Sym- [25] Month of Browser Bugs, July 2006. Web page:
posium on Principles of Programming Languages), http://browserfun.blogspot.com/.
pages 47–54, Nice, January 2007.
[26] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards,
[15] P. Godefroid and N. Klarlund. Software Model Check- and B. Calder. Automatically classifying benign and
ing: Searching for Computations in the Abstract or the harmful data races using replay analysis. In Program-
Concrete (Invited Paper). In Proceedings of IFM’2005 ming Languages Design and Implementation (PLDI),
(Fifth International Conference on Integrated Formal 2007.
[27] N. Nethercote and J. Seward. Valgrind: A framework
for heavyweight dynamic binary instrumentation. In
PLDI, 2007.
Figure 10. Search statistics. For each search, we report the number of crashes of each type: the
first number is the number of distinct buckets, while the number in parentheses is the total number
of crashing test cases. We also report the total search time (SearchTime), the total time spent in
symbolic execution (AnalysisTime), the number of symbolic execution tasks (AnalysisTasks), blocks
covered by the initial file (BlocksAtStart), new blocks discovered during the search (BlocksAdded),
the total number of tests (NumTests), the test at which the last crash was found (TestsToLastCrash),
the test at which the last unique bucket was found (TestsToLastUnique), the maximum generation
reached (MaxGen), the generation at which the last unique bucket was found (GenToLastUnique),
and the mean number of file positions changed for each generated test case (Mean Changes).