0% found this document useful (0 votes)
0 views

BugBench_Benchmarks_for_Evaluating_Bug_Detection_T

The document discusses the creation of BugBench, a benchmark suite designed to evaluate software bug detection tools systematically. It highlights the lack of standardized benchmarks in the field and outlines the criteria for selecting representative buggy applications, as well as the metrics for evaluating detection tools. The authors emphasize the importance of a unified evaluation method to facilitate comparisons among various bug detection tools and share their preliminary findings from evaluating existing tools.

Uploaded by

tigerclub88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

BugBench_Benchmarks_for_Evaluating_Bug_Detection_T

The document discusses the creation of BugBench, a benchmark suite designed to evaluate software bug detection tools systematically. It highlights the lack of standardized benchmarks in the field and outlines the criteria for selecting representative buggy applications, as well as the metrics for evaluating detection tools. The authors emphasize the importance of a unified evaluation method to facilitate comparisons among various bug detection tools and share their preliminary findings from evaluating existing tools.

Uploaded by

tigerclub88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/238198596

BugBench: Benchmarks for Evaluating Bug Detection Tools

Article · January 2005

CITATIONS READS
185 483

6 authors, including:

Shan Lu Zhenmin Li
University of Wisconsin–Madison University of Illinois, Urbana-Champaign
56 PUBLICATIONS 5,868 CITATIONS 28 PUBLICATIONS 3,539 CITATIONS

SEE PROFILE SEE PROFILE

Pin Zhou
Huazhong Agricultural University
22 PUBLICATIONS 1,101 CITATIONS

SEE PROFILE

All content following this page was uploaded by Zhenmin Li on 05 May 2014.

The user has requested enhancement of the downloaded file.


BugBench: Benchmarks for Evaluating Bug Detection Tools
Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou and Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana Champaign, Urbana, IL 61801

ABSTRACT makes the proposed tools much more convincing. Unfortunately,


based on our previous experience of evaluating our own bug de-
Benchmarking provides an effective way to evaluate different tection tools [18, 24, 32, 33], finding real applications with real
tools. Unfortunately, so far there is no good benchmark suite to bugs is a time-consuming process, especially since many bug re-
systematically evaluate software bug detection tools. As a result, port databases are not well documented for our purposes, i.e., they
it is difficult to quantitatively compare the strengths and limita- only report the symptoms but not the root causes. Furthermore,
tions of existing or newly proposed bug detection tools. different tools are evaluated with different applications, making it
In this paper, we share our experience of building a bug bench- hard to cross-compare tools with similar functionality.
mark suite called BugBench. Specifically, we first summarize the Besides benchmarking, the evaluation criteria of software bug
general guidelines on the criteria for selecting representative bug detection tools are also not standardized. Some work evaluated
benchmarks, and the metrics for evaluating a bug detection tool. only the execution overhead using SPEC benchmarks, completely
Second, we present a set of buggy applications collected by us, overlooking the bug detection functionality. In contrast, some
with various types of software bugs. Third, we conduct a pre- work [16, 18] did much more thorough evaluation. They not only
liminary study on the application and bug characteristics in the reported false positives and/or false negatives, but also provided
context of software bug detection. Finally, we evaluate several ex- the ranking of reported bugs.
isting bug detection tools including Purify, Valgrind, and CCured As the research area of software bug detection starts boom-
to validate the selection of our benchmarks. ing with many innovative ideas, the urgency of a unified evalua-
tion method with a standard benchmark suite has been recognized,
as indicated by the presence of this workshop. For example, re-
1 Introduction searchers at IBM Haifa [6] advocate building benchmarks for test-
ing and debugging concurrent programs. Similarly, although not
formally announced as benchmark, a Java application set, HEDC
1.1 Motivation used in [31], is shared by a few laboratories to compare the effec-
Software bugs account for more than 40% system failures [20], tiveness of data race detection methods.
which makes software bug detection an increasingly important re-
search topic. Recently, many bug detection tools have been pro- 1.2 Our Work
posed, with many more expected to show up in the near future. Benchmark suite building is a long-term, iterative process and
Facing ever so many tools, programmers need a guidance to se- needs the cooperation from all over the community. In this pa-
lect tools that are most suitable for their programs and occurring per, we share our experience of building a bug benchmark suite
failures; and researchers also desire a unified evaluation method to as a vehicle to solicit feedbacks. We plan to release the current
demonstrate the strength and weakness of their tools versus others. collection of buggy applications soon to the research community.
All these needs strongly motivate a representative, fair and com- Specifically, this paper reports our work on bug benchmark design
prehensive evaluation benchmark suite for the purpose of evaluat- and collection in the following aspects:
ing software bug detection tools.
Benchmark is a standard of measurement or evaluation, and (1) General guidelines on bug benchmark selection criteria
an effective and affordable way of conducting experiments [28]. and evaluation metrics: By learning from successful bench-
A good all-community accepted benchmark suite has both tech- marks in other areas and prior unsuccessful bug benchmark tri-
nique and sociality impact. In the technical aspect, evaluations als, we summarize several criteria that we follow when selecting a
with standard benchmarks are more rigorous and convincing; al- buggy application into our benchmark suite. In addition, based on
ternative ideas can be compared objectively; problems overlooked previous research experience and literature research in software
in previous research might be manifested in benchmarking. In bug detection, we also summarize a set of quantitative and quali-
the social aspect, building benchmark enforces the collaboration tative metrics for evaluating bug detection tools.
within community and help the community to form a common un- (2) A C/C++ bug benchmark suite BugBench: By far, we have
derstanding of the problem they are facing [26]. There are many collected 17 C/C++ applications for our BugBench and we are still
successful benchmark examples, such as SPEC (Standard Perfor- looking for more applications to enrich the suite. All of the appli-
mance Evaluation Corporation) benchmarks [27], and TPC series cations are from the open source community and contain various
(Transaction Processing Council) [29], both of which have been software defects including buffer overflows, stack smashing, dou-
widely used by the corresponding research and product develop- ble frees, uninitialized reads, memory leaks, data races, atomic
ment communities. violations, semantic bugs, etc. Some of these buggy applications
However, in the software bug detection area, there is no widely- have been used by our previous work [18, 24, 32, 33], and also
accepted benchmark suite to evaluate existing or newly proposed forwarded by us to a few other research groups at UCSD, Purdue,
methods. As a result, many previous studies either use syn- etc in their studies [7, 22].
thetic toy-applications or borrow benchmarks (such as SPEC and (3) A preliminary study of benchmark and bug characteris-
Siemens) from other research areas. While such evaluation might tics: We have studied the characteristics of several benchmarks
be appropriate to use for proof of concept, it hardly provides a that contain memory-related bugs, including memory access fre-
solid demonstration of the unique strength and shortcomings of quencies, malloc frequencies, crash latencies (the distance from
the proposed method. Being aware of this problem, some stud- the root cause to the manifestation point), etc., which would af-
ies [5, 23, 32] use real buggy applications for evaluation, which fect the overhead and bug-detection capability of a dynamic bug
detection tool. To our best knowledge, ours is one of the first in be used to measure time overhead. Therefore, they are not suitable
studying buggy application characteristics in the context of soft- for serving as bug detection benchmarks.
ware bug detection. In the bug detection community, there is not much work done in
(4) A preliminary evaluation of several existing tools: To val- benchmarking. Recently, researchers in IBM Haifa [14] propose
idate our selection of benchmarks and characteristics, we have building multithreading program benchmarks. However, their ef-
conducted a preliminary evaluation using several existing tools in- forts are unsuccessful as also acknowledged in their following pa-
cluding Purify [12], Valgrind [25] and CCured [23]. Our prelimi- per [6], because they rely on students to purposely generate buggy
nary results show that our benchmarks can effectively differentiate programs instead of using real ones.
the strengths and limitations of these tools.
3 Benchmarking Guideline
2 Lessons from Prior Work
3.1 Classification of Software Bugs
2.1 Successful Benchmark in Other Areas
In order to build good bug benchmarks, we first need to classify
SPEC (Standard Performance Evaluation Cooperative) was software bugs. There are different ways to classify bugs [1, 15], in
founded by several major computer vendors in order to “provide this section we make classification based on different challenges
the industry with a realistic yardstick to measure the performance the bug exposes to the detection tools. Since our benchmark suite
of advanced computer systems” [3]. To achieve this purpose, cannot cover all bug types, in the following we only list the bug
SPEC has very strict application selection process. First, candi- types that are most security critical and most common. They are
dates are picked from those that have significant use in their fields, also the design focus of most bug detection tools.
e.g. gcc from compiler field, and weather prediction from scien- Memory related bugs Memory related bugs are caused by im-
tific computation field. Then, candidates are checked for their clar- proper handling of memory objects. These bugs are often ex-
ity and portability over different architecture platforms. Qualified ploited to launch security attack. Based on US-CERT vulner-
candidates will be analyzed for detailed dynamic characteristics, ability Notes Database [30], they contribute the most to all re-
such as instruction mix, memory usage, etc. Based on these char- ported vulnerabilities since 1991. Memory-related bugs can be
acteristics, SPEC committee decides whether there are enough di- further classified into: (1) Buffer overflow: Illegal access beyond
versity and little redundancy in the benchmark suite. After sev- the buffer boundary. (2) Stack smashing: Illegally overwrite the
eral iterations of the above steps, a SPEC-benchmark is finally function return address. (3) Memory leak: Dynamically allocated
announced. memory have no reference to it, hence can never be freed. (4)
TPC (Transaction Processing Council) was founded in the mid- Uninitialized read: Read memory data before it is initialized. The
dle 80’s to satisfy the demand of comparing numerous database reading result is illegal. (5) Double free: One memory location
management systems. TPC benchmark shares some com- freed twice.
mon properties as that in SPEC, i.e. representative, diverse, Concurrent bugs Concurrent bugs are those that happen only
and portable, etc. Take TPC-C (an OLTP benchmark) as an in multi-threading (or multi-processes) environment. They are
example[17]. To be representative, TPC-C uses five real-world caused by ill-synchronized operations from multiple threads. Con-
popular transactions: new order, payment, delivery, order status, current bugs can be further divided into following groups: (1) Data
and stock level. In terms of diversity, these transactions cover al- race bugs: Conflicting accesses from concurrent threads touch
most all important database operations. In addition, TPC-C has the shared data in arbitrary order. (2) Atomicity-related bugs: A
a comprehensive evaluation metric set. It adopts two standard bunch of operations from one thread is unexpectedly interrupted
metrics: new-order transaction rate and price/performance, to- by conflicting operations from other threads. (3) Deadlock: In re-
gether with additional tests for ACID properties, e.g. whether the source sharing, one or more processes permanently wait for some
database can recover from failure. All these contribute to the great resources and can never proceed any more.
success of TPC-C benchmark. An important property of concurrent bugs is un-deterministic,
which makes them hard to be reproduced. Such temporal sensitiv-
2.2 Prior Benchmarks in Software Engineering ity adds extra difficulty to bug detection.
Semantic bugs A big family of software bugs are semantic bugs,
and Bug Detection Areas i.e. bugs that are inconsistent with the original design and the
Recently, much progress has been made on benchmarking in soft- programmers’ intention. We often need semantic information to
ware engineering-related areas. CppETS [26] is a benchmark suite detect these bugs.
in reverse engineering for evaluating “factor extractors”. It pro-
vides a collection of C++ programs, each associated with a ques- 3.2 Classification of Bug Detection Tools
tion file. Evaluated tools will answer the questions based on their
Different tools detect bugs using different methods. A good
factor extracting results and get points from their answers. The fi-
benchmark suite should be able to demonstrate the strength and
nal score from all test programs indicates the performance of this
weakness of each tool. Therefore, in this section, we study the
tool. This benchmark suite is a good vehicle to objectively evalu-
classification of bug detection tools, by taking a few tools as ex-
ate and compare factor extractors.
amples and classifying them by two criteria in Table 1.
The benchmark suites more related to bug detection are
Siemens benchmark suite [11] and PEST benchmark suite [15] Static Dynamic Model Checking
for software testing. In these benchmark suites, each application is Programming-rule PREfix [2] Purify [12] VeriSoft[9]
associated with some buggy versions. Better testing tools can dis- based tools RacerX [4] Valgrind [25] JPFinder[13]
tinguish more buggy versions from correct ones. Although these Statistic-rule CP-Miner [18] DIDUCE [10] CMC[21]
benchmark suites provide a relatively large bug pool, most bugs based tools D. Engler’s [5] AccMon [32]
are semantic bugs. There is almost no memory-related bugs and Liblit’s [19]
definitely no multi-threading bugs. Furthermore, the benchmark Annotation-based ESC/Java [8]
applications are very small (some are less than 100 line of code),
hence cannot represent real bug detection scenarios and can hardly Table 1: Classification of a few detection tools
As shown in Table 1, one way to classify tools is based on Functionality Metrics Overhead Metrics
the rules they use to detect bugs. Most detection tools hold Bug Detection False Positive Time Overhead
some “rules” in mind: code violating the rules is reported as Bug Detection False Negative Space Overhead
bug. Programming-rule-based tools use rules that should be Easy to Use Metrics Static Analysis Time
followed in programming, such as “array pointer cannot move Reliance on Manual Effort Training Overhead
out-of-bound”. Statistic-rule-based approaches learn statistically Reliance on New Hardware Dynamic Detection Overhead
correct rules (invariants) from successful runs in training phase. Helpful to Users Metrics
Annotation-based tools use programmer-written annotations to Bug Report Ranking
check semantic bugs. Pinpoint Root Cause?
We can also divide tools into static, dynamic and model check- Table 2: Evaluation metric set
ing. Static tools detect bugs by static analysis, without requiring static instruction numbers (we call it Detection Latency). Some
code execution. Dynamic tools are used during execution, analyz- metrics, such as manual effort and new hardware reliance, will be
ing run-time information to detect bugs on-the-fly. They add run- measured qualitatively.
time overhead but are more accurate. Model checking is a formal We should also notice that, the same metric may have different
verification method. It was usually grouped into static detection meanings for different types of tools. That is the reason that we list
tools. However, recently people also use model checking during three different types of overhead together with the time and space
program execution. overhead metrics. We will only measure static analysis time for
static tools; measure both training and dynamic detection over-
3.3 Benchmark Selection Criteria head for statistical-rule-based tools and measure only dynamic
Based on the study in section 2 and 3.1, 3.2, we summarize fol- detection overhead for most programming-rule-based tools. The
lowing bug detection benchmark selection criteria. (1) Represen- comparison among tools of the same category is more appropriate
tative: The applications in our benchmark suite should be able to for some metrics. When comparing tools of different categories,
represent real buggy applications. That means: First, the appli- we should keep the differences in mind.
cation should be real, implemented by experienced programmers
instead of novices. It is also desirable if the application has signifi- 4 Benchmark
cant use in practice. Second, the bug should also be real, naturally
generated, not purposely injected. (2) Diverse: In order to cover 4.1 Benchmark Suite
a wide range of real cases, the applications in benchmark should
be diverse in the state space of some important characteristics, in- Based on the criteria in section 3.3, we have collected 17 buggy
cluding bug types; some dynamic execution characteristics, such C/C++ programs from open source repositories. These programs
as heap and stack usage, the frequency of dynamic allocations, contain various bugs including 13 memory-related bugs, 4 concur-
memory access properties, pointer dereference frequency, etc; and rent bugs and 2 semantic bugs 1 . We have also prepared different
the complexity of bugs and applications, including the bug’s crash test cases, both bug-triggering and non-triggering ones, for each
latency, the application’s code size and data structure complex- application. We are still in the process of collecting more buggy
ity, etc. Some of these characteristics will be discussed in detail applications.
in section 4.2. (3) Portable: The benchmark should be able to Table 3 shows that all applications are real open-source appli-
evaluate tools designed on different architecture platforms, so it is cations with real bugs and most of them have significant use in
better to choose hardware-independent applications. (4) Accessi- their domains. They have different code sizes and have covered
ble: Benchmark suites are most useful when everybody can easily most important bug types.
access them and use them in evaluation. Obviously, proprietary As we can see from the table, the benchmark suite for memory-
applications can not meet this requirement, so we only consider related bugs is already semi-complete. We will conduct more de-
open source code to build our benchmark. (5) Fair: The bench- tailed analysis for them in the following sections. Other types
mark should not bias toward any detection tool. Applying above of bugs, however, are incomplete yet. Enriching BugBench with
criteria, we can easily see that benchmarks like SPEC, Siemens more applications on other types of bugs and more analysis on
are not suitable in our context: many SPEC applications are not large applications remains as our future work.
buggy at all and Siemens benchmarks are not diverse enough in
code size, bug types and other characteristics. 4.2 Preliminary Characteristics Analysis
In addition to the above five criteria designed for selecting ap-
plications into the bug benchmark suite, application inputs also An important criterion for a good benchmark suite is its diversity
need careful selection. A good input set should contain both cor- on important characteristics, as we described in section 3.3. In this
rect inputs and bug-triggering inputs. Bug-triggering inputs will section, we focus on a subset of our benchmarks (memory-related
expose the bug and correct inputs can be used to calculate false bug applications) and analyze their characteristics that would af-
positives and enable the overhead measurement in both buggy runs fect dynamic memory bug detection tools.
and correct runs. Additionally, a set of correct inputs can also be Dynamic memory allocation and memory access behaviors are
used to unify the training phase of invariant-based tools. the most important characteristics that have significant impact on
the overheads of dynamic memory-related bug detection tools.
This is because many memory-related bug detection tools inter-
3.4 Evaluation Metrics cept memory allocation functions and monitor most memory ac-
The effectiveness of a bug detection tool has many aspects. A cesses. In table 4, we use frequency and size to represent dy-
complete evaluation and comparison should base on a set of met- namic allocation properties. As we can see, in 8 applications, the
rics that reflect the most important factors. As shown in Table 2, memory allocation frequency ranges from 0 to 769 per Million In-
our metric set is composed of four groups of metrics, each repre- structions and the size ranges from 0 to 6.0 MBytes. Such large
senting an important aspect of bug detection. range of memory allocation behaviors will lead to different over-
Most metrics can be measured quantitatively. Even for some heads in dynamic bug detection tools, such as Valgrind and Purify.
traditionally subjective metric, such as “pinpoint root cause”, we In general, the more frequent of memory allocation, the larger
can measure it quantitatively by the distance from the bug root
1
cause to the bug detection position in terms of dynamic and/or some applications contain more bugs than we describe in Table 3.
Name Program Source Description Line of Code Bug Type
NCOM ncompress-4.2.4 Red Hat Linux file (de)compression 1.9K Stack Smash
POLY polymorph-0.4.0 GNU file system ”unixier” 0.7K Stack Smash &
(Win32 to Unix filename converter) Global Buffer Overflow
GZIP gzip-1.2.4 GNU file (de)compression 8.2K Global Buffer Overflow
MAN man-1.5h1 Red Hat Linux documentation tools 4.7K Global Buffer Overflow
GO 099.go SPEC95 game playing (Artificial Intelligent) 29.6K Global Buffer Overflow
COMP 129.compress SPEC95 file compression 2.0K Global Buffer Overflow
BC bc-1.06 GNU interactive algebraic language 17.0K Heap Buffer Overflow
SQUD squid-2.3 Squid web proxy cache server 93.5K Heap Buffer Overflow
CALB cachelib UIUC cache management library 6.6K Uninitialized Read
CVS cvs-1.11.4 GNU version control 114.5K Double Free
YPSV ypserv-2.2 Linux NIS NIS server 11.4K Memory Leak
PFTP proftpd-1.2.9 ProFTPD ftp server 68.9K Memory Leak
SQUD2 squid-2.4 Squid web proxy cache 104.6K Memory Leak
HTPD1 httpd-2.0.49 Apache HTTP server 224K Data Race
MSQL1 msql-4.1.1 MySQL database 1028K Data Race
MSQL2 msql-3.23.56 MySQL database 514K Atomicity
MSQL3 msql-4.1.1 MySQL database 1028K Atomicity
PSQL postgresql-7.4.2 PostgreSQL database 559K Semantic Bug
HTPD2 httpd2.0.49 Apache HTTP server 224K Semantic Bug
Table 3: Benchmark suite
Name Malloc Freq. Allocated Heap Usage Memory Access Memory Read Symptom Crash Latency
(# per MInst) Memory Size vs. Stack Usage (# per Inst) vs. Write (# of Inst)
NCOM 0.003 8B 0.4% vs. 99.5% 0.848 78.4% vs. 21.6% No Crash NA
POLY 7.14 10272B 23.9% vs. 76.0% 0.479 72.6% vs. 27.4% Varies on Input∗ 9040K∗
GZIP 0 0B 0.0% vs. 100% 0.688 80.1% vs. 19.9% Crash 15K
MAN 480 175064B 85.1% vs. 14.8% 0.519 70.9% vs. 20.1% Crash 29500K
GO 0.006 364B 1.6% vs. 98.3% 0.622 82.7% vs. 17.3% No Crash NA
COMP 0 0B 0.0% vs. 100% 0.653 79.1% vs. 20.9% No Crash NA
BC 769 58951B 76.6% vs. 23.2% 0.554 71.4% vs. 28.6% Crash 189K
SQUD 138 5981371B 99.0% vs. 0.9% 0.504 54.2% vs. 45.8% Crash 0
Table 4: Overview of the applications and their characteristics (*:The crash latency is based on the input that will cause the crash.)
overhead would be imposed by such tools. To reflect the mem- 5 Preliminary Evaluation
ory access behavior, we use access frequency, read/write ratio and
heap/stack usage ratio. Intuitively, access frequency directly influ- In order to validate the selection of our bug benchmark suite, in
ences the dynamic memory bug detection overhead: the more fre- this section, we use BugBench to evaluate 3 popular bug detec-
quent memory accesses, the larger the checking overhead. Some tion tools: Valgrind [25], Purify [12] and CCured [23]. All these
tools use different policies to check read and write accesses and three are designed to detect memory-related bugs, so we choose
some tools differentiate stack and heap access, so all these ratios 8 memory-relate buggy applications from our benchmarks. The
are important to understand the overhead. In table 4, the access evaluation results are shown in Table 5.
frequencies of our benchmark applications change from 0.479 to In terms of time overhead, among the three tools, CCured al-
0.848 access per instruction and heap usage ratio from 0 to 99.0%. ways has the lowest overhead, because it conducted static analysis
Both show a good coverage. Only the read/write ratio seems not beforehand. Purify and Valgrind have similar magnitude of over-
to change much within all 8 applications, which indicates the need head. Since Valgrind is implemented based on instruction emula-
to further improve our benchmark suite based on this. tion and Purify is based on binary code instrumentation, we do not
Obviously, the bug complexity may directly determines the compare the overheads of these two tools. Instead, we show how
false negatives of bug detection tools. In addition, the more dif- an application’s characteristics affect one tool’s overhead. Since
ficult to detect, the more benefits a bug detection tool can pro- our bug benchmark suite shows a wide range of characteristics,
vide the programmers. While it is possible to use many ways to the overheads imposed by these tools also vary from more than
measure complexity (which we will do in the future), we use the 100 times of overhead to less than 20% overhead. For example,
symptom and crash latency to measure this property. Crash la- the application BC has the largest overhead in both Valgrind and
tency is the latency from the root cause of a bug to the place where Purify, as high as 119 times. The reason is that BC has very high
the application finally crashes due to the propagation of this bug. memory allocation frequency, as shown in Table 4. On the other
If the crash latency is very short, for example, right at the root hand, POLY has very small overhead due to its smallest memory
cause, even without any detection tool, programmers may be able access frequency as well as its small allocation frequency.
to catch the bug immediately based on the crash position. On the In terms of bug detection functionality, CCured successfully
other hand, if the bug does not manifest until a long chain of error catches all the bugs in our applications and also successfully points
propagation, detecting the bug root cause would be much more out the root cause in most cases. Both Valgrind and Purify fail to
challenging for both programmers and all bug detection tools. As catch the bugs in NCOM and COMP. The former is a stack buffer
shown in Table 4, the bugs in our benchmarks manifest in different overflow and the latter is a one-byte global buffer overflow. Val-
ways: crash or silent errors. For applications that will crash, their grind also fails to catch another global buffer overflow in GO and
crash latency varies from zero latency to 29 million instructions. has long detection latencies in the other three global buffer over-
Catch Bug? False Positive
Pinpoint The Root Cause Overhead Easy to Use
(Detection Latency(KInst)1 )
Valgrind Purify CCured Valgrind Purify CCured Valgrind Purify CCured Valgrind Purify CCured Valgrind Purify CCured
NCOM No No Yes 0 0 0 N/A N/A Yes 6.44X 13.5X 18.5% Easiest Easy Moderate
POLY Vary2 Yes Yes 0 0 0 No(9040K)2 Yes Yes 11.0X 27.5% 4.03% Easiest Easy Moderate
GZIP Yes Yes Yes 0 0 0 No(15K) Yes Yes 20.5X 46.1X 3.71X Easiest Easy Moderate
MAN Yes Yes Yes 0 0 0 No(29500K) Yes Yes 115.6X 7.36X 68.7% Easiest Easy Hard
GO No Yes Yes 0 0 0 N/A Yes Yes 87.5X 36.7X 1.69X Easiest Easy Moderate
COMP No No Yes 0 0 0 N/A N/A Yes 29.2X 40.6X 1.73X Easiest Easy Moderate
BC Yes Yes Yes 0 0 0 Yes Yes Yes 119X 76.0X 1.35X Easiest Easy Hardest
SQUD Yes Yes N/A3 0 0 N/A3 Yes Yes N/A3 24.21X 18.26X N/A3 Easiest Easy Hardest
Table 5: Evaluation of memory bug detection tools. (1: Detection latency is only applicable when fail to pinpoint the root cause; 2: Valgrind’s
detection result varies on inputs. Here we use the input by which Valgrind fails to pinpoint root cause; 3: We fail to apply CCured on Squid)
flow applications: POLY, GZIP and MAN. The results indicate [5] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as
that Valgrind and Purify handle heap objects much better than they deviant behavior: A general approach to inferring errors in systems code.
In SOSP ’01, pages 57–72, 2001.
do on stack and global objects. [6] Y. Eytani and S. Ur. Compiling a benchmark of documented multi-
As for POLY, we tried different buggy inputs for Valgrind and threaded bugs. In IPDPS, 2004.
the results are interesting: if the buffer is not overflowed signif- [7] L. Fei and S. Midkiff. Artemis: Practical runtime monitoring of appli-
icantly, Valgrind will miss it; with moderate overflow, Valgrind cations for errors. Technical Report TR-ECE05-02, Purdue University,
2005.
catches the bug after a long path of error propagation, not the root [8] C. Flanagan, K. Leino, M. Lillibridge, C. Nelson, J. Saxe, and R. Stata.
cause; only with significant overflow, Valgrind can detect the root Extended static checking for java, 2002.
cause. The different results are due to POLY’s special bug type: [9] P. Godefroid, R. S. Hanmer, and L. J. Jagadeesan. Model checking with-
first global corruption and later stack corruption. out a model: An analysis of the heart-beat monitor of a telephone switch
using verisoft. In ISSTA, pages 124–133, 1998.
Although CCured performs much better than Valgrind and Pu- [10] S. Hangal and M. S. Lam. Tracking down software bugs using automatic
rify in both overhead and functionality evaluation, the tradeoff is anomaly detection. In ICSE ’02, May 2002.
its high reliance on manual effort in code preprocessing. As shown [11] M. J. Harrold and G. Rothermel. Siemens Programs, HR Variants. URL:
in the “Easy to Use” column of Table 5, among all these tools, Val- http://www.cc.gatech.edu/aristotle/Tools/subjects/.
[12] R. Hastings and B. Joyce. Purify: Fast detection of memory leaks and
grind is the easiest to use and requires no re-compilation. Purify access errors. In Usenix Winter Technical Conference, Jan. 1992.
is also fairly easy to use, but requires re-compilation. CCured is [13] K. Havelund and J. U. Skakkebæk. Applying model checking in java
the most difficult to use. It often requires fairly amount of source verification. In SPIN, 1999.
code modification. For example, in order to use CCured to check [14] K. Havelund, S. D. Stoller, and S. Ur. Benchmark and framework for
encouraging research on multi-threaded testing tools. In IPDPS, 2003.
BC, we have worked about 3 to 4 days to study the CCured policy [15] James R. Lyle, Mary T. Laamanen, and Neva M. Carlson. PEST:
and BC’s source code to make it satisfy the CCured’s language Programs to evaluate software testing tools and techniques. URL:
requirement. Moreover, we fail to apply CCured on a more com- www.nist.gov/itl/div897/sqg/pest/pest.html.
plicated server application: SQUD. [16] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler. Correlation exploita-
tion in error ranking. In SIGSOFT ’04/FSE-12, pages 83–93, 2004.
[17] C. Levine. TPC-C: an OLTP benchmark. URL:
http://www.tpc.org/information/sessions/sigmod/sigmod97.ppt, 1997.
6 Current Status & Future Work [18] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding
copy-paste and related bugs in operating system code. In OSDI, 2004.
Our BugBench is an ongoing project. We will release these appli- [19] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via
remote program sampling. In PLDI, pages 141–154, 2003.
cations together with documents and input sets through our web [20] E. Marcus and H. Stern. Blueprints for high availability. John Willey
page soon. We welcome feedbacks to refine our benchmark. and Sons, 2000.
In the future, we plan to extend our work in several dimensions. [21] M. Musuvathi, D. Park, A. Chou, D. Engler, and D. L. Dill. CMC: A
First, we will enrich the benchmark suite with more applications, pragmatic approach to model checking real code. In OSDI, Dec. 2002.
[22] S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously
more types of bugs based on our selection criteria and characteris- recording program execution for deterministic replay debugging. In
tic analysis (the characteristics in Table 4 show that some impor- ISCA’05, 2005.
tant benchmark design space is not covered yet). We are also in [23] G. C. Necula, S. McPeak, and W. Weimer. CCured: Type-safe
the plan of designing tools to automatically extract bugs from bug retrofitting of legacy code. In POPL, Jan. 2002.
[24] F. Qin, S. Lu, and Y. Zhou. SafeMem: Exploiting ECC-memory for
databases (e.g. Bugzilla) maintained by programmers, so that we detecting memory leaks and memory corruption during production runs.
can not only get many real bugs but also gain deeper insight into In HPCA ’05, 2005.
real large buggy applications. Second, we will evaluate more bug [25] J. Seward. Valgrind. URL: http://www.valgrind.org/.
detection tools, which will help us enhance our BugBench. Third, [26] S. E. Sim, S. Easterbrook, and R. C. Holt. Using benchmarking to ad-
vance research: a challenge to software engineering. In ICSE ’03, pages
we intend to add some supplemental tools, for example, program 74–83. IEEE Computer Society, 2003.
annotation for static tools, and scheduler and record-replay tools [27] Standard Performance Evaluation Corporation. SPEC benchmarks.
for concurrent bug detection tools. URL: http://www.spec.org/.
[28] W. F. Tichy. Should computer scientists experiment more? Computer,
31(5):32–40, 1998.
REFERENCES [29] Transaction Processing Council. TPC benchmarks. URL:
http://www.tpc.org/.
[1] B. Beizer. Software testing techniques (2nd ed.). Van Nostrand Reinhold [30] US-CERT. US-CERT vulnerability notes database. URL:
Co., 1990. http://www.kb.cert.org/vuls.
[2] W. R. Bush, J. D. Pincus, and D. J. Sielaff. A static analyzer for find- [31] C. von Praun and T. R. Gross. Object race detection. In OOPSLA, 2001.
ing dynamic programming errors. Softw. Pract. Exper., 30(7):775–802, [32] P. Zhou, W. Liu, F. Long, S. Lu, F. Qin, Y. Zhou, S. Midkiff, and J. Tor-
2000. rellas. AccMon: Automatically Detecting Memory-Related Bugs via
[3] K. M. Dixit. The spec benchmarks. Parallel Computing, 17(10-11), Program Counter-based Invariants. In MICRO ’04, Dec. 2004.
1991. [33] P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient
[4] D. Engler and K. Ashcraft. RacerX: Effective, static detection of race
conditions and deadlocks. In SOSP, Oct. 2003. Architecture Support for Software Debugging. In ISCA ’04, June 2004.

View publication stats

You might also like