0% found this document useful (0 votes)
10 views

(2019-ICSE) Hunting For Bugs in Code Coverage Tools Via Randomized Differential Testing

Uploaded by

Zhou Yuming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

(2019-ICSE) Hunting For Bugs in Code Coverage Tools Via Randomized Differential Testing

Uploaded by

Zhou Yuming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing

Yibiao Yang∗ , Yuming Zhou∗ , Hao Sun† , Zhendong Su‡§ , Zhiqiang Zuo∗ , Lei Xu∗ , and Baowen Xu∗
∗ State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, China
† Unaffiliated
‡ Department of Computer Science, ETH Zurich, Switzerland
§ Computer Science Department, UC Davis, USA

Abstract—Reliable code coverage tools are critically important 1 | | # i n c l u d e < s t d i o . h> # i n c l u d e < s t d i o . h>
as it is heavily used to facilitate many quality assurance activities, 2 | | i n t main ( ) i n t main ( )
such as software testing, fuzzing, and debugging. However, little 3 | 1 | { {
attention has been devoted to assessing the reliability of code 4 | 1 | i n t g =0 , v = 1 ; i n t g =0 , v = 1 ;
coverage tools. In this study, we propose a randomized differ- 5 | 0 | g = v || !v; // g = v || !v;
ential testing approach to hunting for bugs in the most widely 6 | 1 | p r i n t f ( ”%d\n ” , g ) ; p r i n t f ( ”%d\n ” , g ) ;
7 | 1 | } }
used C code coverage tools. Specifically, by generating random
input programs, our approach seeks for inconsistencies in code (a) (b)
coverage reports produced by different code coverage tools, and Fig. 1. (a) Bug #33465 of llvm-cov and (b) The “equivalent” program
then identifies inconsistencies as potential code coverage bugs. To obtained by pruning the unexecuted code (Line #4) of the program in (a)
effectively report code coverage bugs, we addressed three specific
challenges: (1) How to filter out duplicate test programs as many
of them triggering the same bugs in code coverage tools; (2) how llvm-cov. Given the program p and its corresponding code
to automatically reduce large test programs to much smaller coverage as shown in Fig. 1(a), EMI compiler testing [7]
ones that have the same properties; and (3) how to determine generates its “equivalent” program p’ as shown in Fig. 1(b)
which code coverage tools have bugs? The extensive evaluations by removing unexecuted code (statement 5). The program p
validate the effectiveness of our approach, resulting in 42 and
28 confirmed/fixed bugs for gcov and llvm-cov, respectively. This
and p’ will be compiled by a compiler under testing and
case study indicates that code coverage tools are not as reliable then executed to obtain two different outputs, i.e. 1 and 0,
as it might have been envisaged. It not only demonstrates the resulting in a bug reported by the EMI approach. However,
effectiveness of our approach, but also highlights the need to this is obviously not a real compiler bug. The incorrect code
continue improving the reliability of code coverage tools. This coverage report leads to the false positive in compiler testing.
work opens up a new direction in code coverage validation which
calls for more attention in this area.
As the code coverage tools offer the fundamental information
Index Terms—Code Coverage; Differential Testing; Coverage needed during the whole process of software development,
Tools; Bug Detection. it is essential to validate the correctness of code coverage.
Unfortunately, to our best knowledge, little attention has been
I. I NTRODUCTION devoted to assessing the reliability of code coverage tools.
Code coverage [1] refers to which code in the program This work makes the first attempt in this direction. We
and how many times each code is executed when running devise a practical randomized differential testing approach
on the particular test cases. The code coverage information to discovering bugs in code coverage tools. Our approach
produced by code coverage tools is widely used to facilitate firstly leverages programs generated by a random generator to
many quality assurance activities, such as software testing, seek for inconsistencies of code coverage reports produced by
fuzzing, and debugging [1]–[15]. For example, researchers different code coverage tools, and identifies inconsistencies as
recently introduced an EMI (“Equivalence Modulo Inputs”) potential code coverage bugs. Secondly, due to the existence of
based compiler testing technique [7]. The equivalent programs too many inconsistency-triggering test programs reported and
are obtained by stochastically pruning the unexecuted code of a large portion of irrelevant code within these test programs,
a given program according to the code coverage information reporting these inconsistency-triggering tests directly is hardly
given by the code coverage tools (e.g., llvm-cov, gcov). There- beneficial to debugging. Before reporting them, the reduction
fore, the correctness of “equivalence” relies on the reliability to those test programs is required [18]. However, it is usually
of code coverage tools. very costly and thus unrealistic to reduce every and each of
In spite of the prevalent adoption in practice and extensive the test programs [18]. We observe that many test programs
testing of code coverage tools, a variety of defects still remain. trigger the same coverage bugs. Thus, we can filter out many
Fig. 1(a) shows a buggy code coverage report produced by duplicate test programs. Note that ‘duplicate test programs’ in
llvm-cov [16], a C code coverage tool of Clang [17]. Note this study indicates multiple test programs triggering the same
that all the test cases have been reformatted for presentation code coverage bug. Overall, to effectively report coverage
in this study. The coverage report is an annotated version of bugs, we need to address the following key challenges:
source code, where the first and second column list the line Challenge 1: Filtering Out Test Programs. To filter out
number and the execution frequency, respectively. We can see potential test programs triggering the same code coverage
that the code at line 5 is marked incorrectly as unexecuted by bugs, the most intuitive way is to calculate similarities between

1558-1225/19/$31.00 ©2019 IEEE 488


DOI 10.1109/ICSE.2019.00061

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
programs using the whole text. However, we use Csmith [19] Organization. The rest of this paper is structured as follows.
as the random program generator and two Csmith-generated Section II introduces the background on code coverage. Sec-
programs are not meaningfully comparable as they diverge in tion III describes our approach for code coverage validation.
many ways [20]. In addition, calculating similarities between Section IV reports the experimental results in detail. Section V
programs using the whole text is expensive. To tackle this surveys related work. Section VI concludes the paper and
challenge, only lines of code triggering inconsistencies are outlines the direction for future work.
used for computing similarities between programs.
Challenge 2: Reducing Test Programs. Reducing test pro- II. C ODE C OVERAGE
grams for code coverage bugs is much more complex than In this section, we introduce the preliminary knowledge, the
reducing test programs for compiler bugs as the later one importance, and the bug categories of code coverage.
only requires testing the behavior of the compiled executables
or the exit code of compilers [18]. However, reducing test A. Preliminaries
programs for code coverage bugs involves processing textual Code coverage is a quality assurance metric used to describe
code coverage reports and identify inconsistencies. After each the degree to which the code of a given program is executed
iteration of reduction, we need to specify the inconsistency of when a particular test suite executes [1]. It is suggested that a
interest we would like to preserve. In this study, we design a program with a high test coverage will have a lower chance of
set of inconsistency types over coverage reports as the interest. containing undetected software bugs compared to a program
Challenge 3: Inspecting Coverage Bugs. With the reduced with a low test coverage.
test program, we need to inspect which code coverage tools The code coverage analysis process is generally divided
have bugs before reporting bug. In practice, it is usually done into three tasks: code instrumentation, data gathering, and
manually [21]. In other words, developers manually inspect the coverage analysis. Specifically, code instrumentation inserts
coverage reports to determine which coverage tools are buggy. additional code instructions or probes to monitor whether the
To relieve the burden of manual intervention, we summarize specific program chunk is executed or not at runtime. The
a number of rules that code coverage reports must follow. instrumentation can be done at the source level in a separate
With those rules, we develop a tool to examine part of the pre-processing phase or at runtime by instrumenting byte code.
inconsistent coverage reports and determine which tools have Data gathering aims to collect the coverage data produced at
bugs automatically. test runtime. Finally, coverage analysis aims to analyze the
We implemented a differential testing prototype called C2V collected results and to provide test strategy recommendations
(“Code Coverage Validation”) for code coverage tools. In in order to reduce, to feed or to modify the relevant test suite.
order to evaluate the effectiveness of our approach, we have Currently, many code coverage tools are available [16],
applied C2V to gcov and llvm-cov, two widely used C [23]–[31], which support different languages (e.g., C/C++ and
code coverage tools respectively in the production compilers Java), instrumentation levels (e.g. source code and byte code
GCC [22] and Clang [17]. Our evaluation confirms that C2V level), or coverage criteria. Coverage criteria are the rules or
is very effective in finding code coverage bugs: 46 bugs were requirements that a test suite needs to satisfy [32]. In the lit-
found (42 bugs confirmed/fixed) for gcov, while 37 bugs were erature, many coverage criteria have been proposed, including
found (28 bugs confirmed/fixed) for llvm-cov. statement coverage, branch coverage, path coverage, condition
Contributions. We made the following contributions: coverage, decision coverage, and data-flow coverage [33].
• We introduce an effective testing approach to validating the These criteria can be used to guide the generation of a test
code coverage tools, and have implemented it as a practical suite or to evaluate the effectiveness of a test suite [33].
tool C2V for testing C coverage tools. C2V mainly consists
B. Importance of Code Coverage
of a random program generator, a comparer to identify
inconsistencies between coverage reports, a filter to remove Code coverage is widely used in, but not limited to, the
test programs triggering same coverage bugs, a test program following software techniques:
reducer, and an inspector to automatically determine which • Coverage-based regression testing. In the context of regres-
coverage tools have bugs for bug reporting. sion testing, test case prioritization and test suite augmenta-
• We adopted C2V to uncover 46 and 37 bugs for gcov and tion are the two widely used techniques [2]–[4], [10], [15],
llvm-cov both of which are widely used and extensively [34]–[36]. The former aims to improve the ability of test
tested C code coverage tools, respectively. Specifically, for cases to find bugs by scheduling test cases in a specific
gcov, 42 bugs have already been confirmed/fixed; for llvm- order [10], [15], [37]. One common practice is to achieve
cov, 28 bugs have been confirmed/fixed. a high code coverage as fast as possible [38]. The latter is
• Our evaluation indicates that code coverage tools are not to generate new test cases to strengthen the ability of a test
as reliable as it might have been envisaged. It opens up a suite to find bugs [6], [14], [36], [39]. In practice, it is often
new research direction to improve the reliability of code to generate new test cases to cover the source code affected
coverage tools which calls for more attention in this area. by code changes.
Besides, there is a need to examine the influence of those • Coverage-based compiler testing. Recent years have seen an
bugs on other techniques which depend on code coverage. increasing interest in compiler testing which aims to validate

489

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
1 | | # i n c l u d e < s t d l i b . h> • Spurious Marking. Spurious marking denotes that a program
2 | | chunk is unexecuted at runtime but is wrongly marked as
3 | | i n t main ( v o i d )
4 | 1 | { executed by a code coverage tool. Figure 2 gives such an
5 | 1 | s t a t i c i n t *p ; example. In the main function, the two return-statements
6 | 1 | at line 9 and line 12 cannot be executed at the same time.
7 | 1 | p = malloc ( s i z e o f ( int ) ) ;
8 | 1 | i f ( p == NULL) Thus, one of them must be wrongly marked by llmv-cov.
9 | 1 | return 0; At runtime, function main returns “1”, indicating that line
10 | 1 | 9 was wrongly marked. A code coverage tool with spurious
11 | 1 | *p = 7 ;
12 | 1 | return 1; marking bugs will cause non-executed code wrongly marked
13 | 1 | } as executed. Consequently, for a program under test, a part
Fig. 2. Spurious marking (Bug #37092 of llvm-cov)
of elements are indeed untested and hence may have latent
bugs undetected.
1 : 1 : int func ( int i )
− : 2 : {
• Missing Marking. Missing marking denotes that a program
1 : 3 : i f ( i > 0) chunk is actually executed at runtime but is wrongly marked
1 : 4 : return 1; as unexecuted by a code coverage tool. The coverage bug as
# #### : 5 : return 0;
− : 6 : }
shown in Figure 1 belongs to this class. A code coverage tool
− : 7 : with missing marking bugs will cause a predefined coverage
1 : 8 : i n t main ( ) goal never achieved, regardless of how much testing time
− : 9 : {
1 : 10 : int i = 1 , j = 2;
and resource are allocated.
2 : 11 : f u n c ( ( i ==0) | | ( i&&j , 1 ) ) ; • Wrong Frequency. Wrong frequency denotes that a program
1 : 12 : return 0; chunk is executed m times at runtime but is wrongly marked
− : 13 : }
as executed n times by a code coverage tool {(m = n) ∧
Fig. 3. Wrong frequency (Bug #85163 of gcov) (m > 0)∧(n > 0)}. Figure 3 shows a gcov coverage report,
in which the first column lists the execution frequency and
the correctness of compilers. One of the most attractive the second column lists the line number. As can be seen, the
compiler testing techniques is based on the code coverage code at line 11 is wrongly marked as executed twice but it
of a program’s execution to generate equivalence modulo was actually executed only once. A code coverage tool with
inputs by stochastically pruning its unexecuted code [7], wrong frequency bugs might lead to a suboptimal decision
[40], [41]. With the equivalence modulo inputs, we can in many coverage-based software engineering activities.
differentially test compilers.
• Coverage-based fuzzing. Fuzzing technique is one of the III. M ETHODOLOGY
most effective testing techniques to find vulnerabilities
in software. In practice, coverage-based fuzzing has been Figure 4 presents our framework for code coverage vali-
widely used [8], [42]. Based on the code coverage infor- dation. In the following, we describe the key steps in this
mation of test cases, it aims to determine which test cases framework. In particular, we will use gcov and llvm-cov as
should be retained for fuzzing. In general, a test case is the subject code coverage tools to illustrate our approach to
retained if a new basic block is covered. hunting for code coverage bugs if necessary.
• Coverage-based debugging. Debugging is a common activity
in software development which aims to find the root cause
A. Generating Test Programs
of a fault. Spectrum-Based Fault Localization (SBFL) is one
of the most extensively studied debugging techniques which Our approach starts with a random program generator. The
is heavily based on code coverage [5], [43]–[47]. Under a Generator randomly generates a large program set P . Each
specific test suite, SBFL leverages the code coverage and program p ∈ P will be used as a test program for differentially
the corresponding failed/passed information to infer which testing two code coverage tools under examination. In other
code is the root cause of a fault. words, each program will be fed to the two code coverage
As can be seen, the above-mentioned software techniques tools to obtain two raw coverage reports.
heavily depend on the code coverage information produced In our study, test programs refer to a collection of compi-
by code coverage tools. Therefore, it is of great importance lable C programs with the corresponding inputs. We choose
to guarantee the correctness of the code coverage reports. Csmith to generate the test programs because of the following
Otherwise, they might inevitably produce incorrect results or reasons: (1) it supports many C language features and can
lead to suboptimal decisions. avoid generating programs with undefined behaviors [48], thus
outperforming some random C program generators [49], [50];
(2) each generated program is one single file with input self-
C. Categories of Code Coverage Bugs
contained which does not require extra inputs; and (3) it is
Code coverage bugs can be categorized into the following fast enough to generate programs with tens of thousands of
three general classes: lines within a few seconds.

490

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
IPS: Inconsistency-triggering Program Set Figure 5(b) shows the code coverage report r2 produced by
R-IPS: Reduced IPS llvm-cov for the same program. r1 and r2 are two annotated
versions of the source file. However, there are three differences
Generator between r1 and r2 . First, they have different instrumentation
program set P
sites. On the one hand, lines 4, 6∼8, 12, 19, and 22 are
i=1 treated as the instrumentation sites by llvm-cov but not by
gcov. On the other hand, lines 3 and 18 are treated as an
i <= |P| No instrumentation site by gcov but not by llvm-cov. Hence, only
IPS
Yes the common instrumentation sites used by gcov and llvm-cov
need to be compared later. Second, gcov and llvm-cov have
p = P[i] Reducer different annotation formats for execution frequency. A non-
instrumentation site is noted as hyphen “-” in r1 but is noted
coverage tool t1 coverage tool t2 R-IPS
as a null string in r2 (e.g. line 3). Besides, an unexecuted line
is noted as hashes “#####” in r1 (e.g. line 15), while it is
noted as “0” in r2 (e.g. line 9). Third, coverage statistics are
report r1 report r2 Inspector included in r1 but are not available in r2 . Figure 5(c) lists the
unified coverage reports produced by our parser for gcov and
Parser Bug Reports llvm-cov respectively. As can be seen, there are 9 common
instrumentation sites between them: lines 5, 7, 9, 10, 11, 13,
14, 15, 20, and 21.
unified report ur1 unified report ur2

C. Comparing Unified Coverage Reports


Comparer
After obtaining the unified coverage reports ur1 and ur2 , we
consistent? No Filter use a tool Comparer to determine whether they are consistent.
If not, the corresponding p and associated coverage reports
Yes will be added to a set named IPS (Inconsistency-triggering test
Program Set). When comparing the unified coverage reports,
i = i+1 Yes duplicate?
Comparer only takes into account the execution frequencies
of the common instrumentation sites among the n coverage
IPS = IPS + { ( p, ic, <ur1, ur2> ) } No tools. During the comparison of ur1 and ur2 , the following
types of cases might occur:
Fig. 4. The framework for code coverage validation • T ype A: one line code is marked as executed by t1 but as
unexecuted by t2 ;
B. Parsing Code Coverage Reports • T ype B: one line code is marked as unexecuted by t1 but
For each test program p in the generated program set P, as executed by t2 ;
we obtain the raw code coverage reports r1 and r2 emitted • T ype C: one line code is marked as executed k times by
from the coverage tools t1 and t2 respectively. However, the t1 and as executed l times by t2 while k = l.
raw coverage reports cannot be directly compared since they Consequently, the inconsistent unified reports can be di-
are presented in different formats. We thus develop a parser vided into seven categories (C001 : T ype C; C010 :
to transform ri into uri in a unified format (1 ≤ i ≤ 2). T ype B; C100 : T ype A; C011 : T ype B + C; C101 :
Specifically, uri is a sequence of two-tuple, including the T ype A+C; C110 : T ype A+B; C111 : T ype A+B+C). If
monitored program chunk and the corresponding execution the unified coverage reports corresponding to the test program
frequency. Given a program p with N lines of source code, p are found to have an inconsistency category ic, it will be
uri is a sequence of N two-tuples (nj , fj ) in ascending order handled by the filter.
according to the value of nj , 1 ≤ j ≤ N . Here, nj is the line Considering the two unified coverage reports shown in
number and fj is the corresponding execution frequency. If Fig. 5, Comparer will compare the execution frequencies of
line nj is not an instrumentation site for a particular coverage the common 9 instrumentation sites to determine whether ur1
tool, fj is assigned −1. and ur2 are consistent. As can be seen, there are four incon-
In our study, gcov and llvm-cov are selected as the subject sistent execution frequencies in the unified coverage reports
code coverage tools, which are integrated with the mature between gcov and llvm-cov: T ype C at line 5, T ype C at line
production compilers GCC and Clang respectively. In the 13, T ype A at line 9, and T ype B at line 15. Consequently, the
following, we give an illustrating example to demonstrate inconsistency category of ur1 and ur2 is found to be C111. In
how the parser works in real world. Figure 5(a) shows the other words, the inconsistency introduced by the test program
code coverage report r1 produced by gcov for a program, and p belongs to the C111 category.

491

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
− : 1 : i n t g =1; ( 1 , −1) , 1 | | i n t g =1; ( 1 , −1) ,
− : 2 : ( 2 , −1) , 2 | | ( 2 , −1) ,
2 : 3 : int f ( int arg ) ( 3 , 2) , 3 | | int f ( int arg ) ( 3 , −1) ,
− : 4 : { ( 4 , −1) , 4 | 2 | { ( 4 , 2) ,
6 : 5 : for ( i n t i =0; i !=1;++ i ) ( 5 , 6) , 5 | 3 | for ( i n t i =0; i !=1;++ i ) ( 5 , 3) ,
− : 6 : { ( 6 , −1) , 6 | 2 | { ( 6 , 3) ,
− : 7 : int f [1]; ( 7 , −1) , 7 | 2 | int f [1]; ( 7 , 2) ,
− : 8 : if (0) ( 8 , −1) , 8 | 2 | if (0) ( 8 , 2) ,
1 : 9 : break ; ( 9 , 1) , 9 | 0 | break ; ( 9 , 0) ,
2 : 10 : i f ( arg ) (10 , 2) , 10 | 2 | i f ( arg ) (10 , 2) ,
1 : 11 : break ; (11 , 1) , 11 | 1 | break ; (11 , 1) ,
− : 12 : } (12 , −1) , 12 | 2 | } (12 , 2) ,
4 : 13 : for ( ; g ; ) (13 , 4) , 13 | 2 | for ( ; g ; ) (13 , 2) ,
2 : 14 : return 0; (14 , 2) , 14 | 2 | return 0; (14 , 2) ,
# #### : 15 : return 0; (15 , 0) , 15 | 2 | return 0; (15 , 2) ,
− : 16 : } (16 , −1) , 16 | 2 | } (16 , 2) ,
− : 17 : (17 , −1) , 17 | | (17 , −1) ,
1 : 18 : i n t main ( ) (18 , 1) , 18 | | i n t main ( ) (18 , −1) ,
− : 19 : { (19 , −1) , 19 | 1 | { (19 , 1) ,
1 : 20 : f (0); f (1); (20 , 1) , 20 | 1 | f (0); f (1); (20 , 1) ,
1 : 21 : return 0; (21 , 1) , 21 | 1 | return 0; (21 , 1) ,
− : 22 : } (22 , −1) 22 | 1 | } (22 , 1)
(a) Raw coverage report by gcov 7.3.0 and unified report ur1 (b) Raw coverage report by llvm-cov 6.0.0 and unified report ur2
Fig. 5. An illustrating example.

D. Filtering out Test Programs semi” with the help of the clang compiler. After that, we
calculate its similarities with other programs. In this study,
Intuitively, using a test program set P with a larger size has we use Jaccard index similarity as it is the most widely used
a higher chance to discover code coverage bugs. Therefore, token based similarity algorithm [51]. If all the similarity
in practice, we prefer to generate a very large number of coefficients between p and each of the programs in IPS are
test programs and further resulting in a large number of less than 0.8, a tuple (p, ic, < ur1 , ur2 >) will be added
inconsistency-triggering test programs. This will lead to the to the set IPS. Otherwise, p will be filtered out.
following two problems. On the one hand, it may be unrealistic
to inspect all the inconsistency-triggering test programs, as
E. Reducing Test Programs
the reduction is very costly [18] and inspection resources are
often limited. On the other hand, we observe that many test An inconsistency-triggering test program only indicates that
programs trigger the same code coverage bugs. Therefore, we some of the two code coverage tools are buggy but does not
can filter out duplicate test programs. inform which one contains bugs. Therefore, there is a need
In Fig. 4, we use Filter to determine whether the to inspect the inconsistency-triggering test program to obtain
inconsistency-triggering test program is “duplicate”, i.e, the correct coverage report (as the coverage oracle). After
triggering the same code coverage bug with other test that, for each code coverage tool, we can easily determine
programs. To filter out potential “duplicate” test programs, whether it is buggy by comparing the corresponding coverage
the most intuitive way is to calculate similarities between report with the coverage oracle. Since a test program may
programs using the whole text. However, we use Csmith [19] have a large code size, it is time-consuming and error-prone
as the random program generator and two Csmith-generated to determine the corresponding coverage oracle. Furthermore,
programs are not meaningfully comparable as they diverge in if a large test program is submitted as a bug report, it is also
many ways [20]. In addition, calculating similarities between difficult for developers to determine the location of bugs in the
programs using the whole text is expensive. To tackle this code coverage tool(s). To tackle this problem, we use Reducer
challenge, only lines of code triggering inconsistencies are to reduce each inconsistency-triggering test program in IPS by
used for computing similarities between programs. More removing the code irrelevant to the inconsistency.
specifically, we first transform the inconsistency-triggering For each inconsistency-triggering test program p in IPS,
lines of code in each program into a token based string Reducer takes the following steps to obtain the reduced test
with variable names insensitive and then use a token program. At step 1, the tool C-Reduce [18] is applied to p to
based similarity algorithm to calculate similarities between obtain the reduced version p . if p has a smaller size than
programs. We calculate program similarity with variable name p, go to step 2; otherwise, stop the reduction. At step 2,
insensitive since the variable names in Csmith-generated feed p to the n coverage tools, collect the coverage reports,
programs have no specific meaning. For example, we assume and determine the inconsistency category. If the inconsistency
“int i = 0; i = 1;” are inconsistency-triggering lines of category triggered by p is the same as the inconsistency
code in program p. We first transform these two statements category triggered by p, assign p to p. The above process is
into “int identifier equal numeric_constant repeated until p cannot be further reduced. Consequently, we
semi identifier equal numeric_constant reduce IPS to obtain R-IPS, the set of reduced test programs.

492

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
F. Inspecting Reduced Test Programs -O0 t.c; ./a.out; llvm-profdata merge
With the reduced test program, we need to inspect which default.profraw -o t.profdata; llvm-cov show
code coverage tools have bugs before reporting. In practice, -instr-profile=t.profdata ./a.out t.c > t.c.lcov
it is usually done manually [21]. In other words, developers The llvm-cov-generated coverage report is stored in a file
manually inspect the coverage reports to determine which named t.c.lcov. Then these two produced coverage
coverage tools are buggy. To relieve the burden of manual reports will be parsed into unified format and compared to
intervention, we summarize the following rules that code detect inconsistency. It is worth noting that we are using -O0
coverage reports must comply with: option to turn off compiler optimizations. It make sense to
compare the coverage reports produced in this way.
• Identical Coverage: Assuming statements s1 and s2 in the
same block: {s1; s2;}. If s1 is not a jump statement (i.e, B. Testing Environment Setup
break, goto, return, exit, or abort statement) and
Our evaluation was conducted on a Linux server with
s2 is not a label statement nor a loop (for or while)
Intel(R) Xeon(R) [email protected] (60 cores) and 32GB RAM.
statement, s1 and s2 should have identical coverage.
The server is running on Ubuntu 17.10 (x86 64). We spent
• Unexecuted Coverage: Assuming statements s1 and s2 in
non-continuous four months, of which over one month we
the same block: {s1; s2;}. If s1 is a return, break,
devoted to developing various tools. The rest of time was
goto, or exit statement and s2 is not a labeled statement,
spent in testing the two code coverage tools, filtering out test
s2 should be never executed.
programs, reducing test programs, and inspect test programs.
• Ordered Coverage: Assuming statements s1 and s2 form:
Initially, we only used Csmith-generated programs as the test
s1; if (...) {s2; ...}. If s2 is not a labeled state-
programs. Later, we made use of the programs selected from
ment, the execution time of s1 should be no less than s2.
GCC’s and Clang’s test-suites as our test programs since they
With the above rules, we develop the tool Inspector to examine
may cover many C semantics that Csmith does not cover. This
the inconsistent coverage reports and determine which tools
is further confirmed in Section IV-C as a number of bugs are
have bugs automatically. There is still some inconsistent cov-
detected by programs inside test-suites.
erage reports in R-IPS that can not be inspected automatically
by our tool. We inspect those coverage reports manually. This C. Experimental Results
process does not require too much human inspection effort, as
Inconsistent Coverage Reports. Table I shows the statis-
the reduced test programs only have a few lines (usually less
tics of inconsistency-triggering test programs over Csmith-
than 13 lines in our study).
generated programs, GCC’s test-suite, and Clang’s test-suite.
G. Reporting Test Programs to Developers Column 2 shows the total number of test programs and
For each test program in RS-IPS, this step simply generates Column 3 shows the number of test programs which run out
bug reports for the corresponding buggy tool(s). A bug report of time (10 seconds in our experiment). We used 1 million
mainly consists of the reduced test program and the affected Csmith-generated programs and collected 2, 756 and 106 C
versions. If a test program triggers multiple bugs, multiple compilable programs respectively from GCC’s and Clang’s
separate bug reports will be generated. test-suites. Note that there are more than tens of thousands
of test programs in GCC and Clang’s test-suites. Only those
IV. E VALUATION C files that can be compiled independently are considered
In this section, we first present the subject coverage tools here. Among them, 182, 927 programs executed more than
and the testing environment. Then, we describe our experi- 10 seconds and hence were excluded for further analysis. The
mental results in detail. remaining test programs were fed to C2V. Column 4 is in
the form of ‘a / b’, where ‘a’ refers to the total number
A. Subject Code Coverage Tools of test programs which can lead to inconsistencies and ‘b’
In this study, we select gcov and llvm-cov as our subject refers to the percentage of ‘a’ over the number of all the
code coverage tools. We choose these two code coverage tools test programs C2V analyzed (i.e. Column 2 − Column 3).
since: (1) they have been widely used in software engineering We found 261, 347 programs leading to inconsistent coverage
community; and (2) they have been integrated with the most reports (261, 065, 262, and 20 respectively from Csmith-
widely-used production compilers, i.e. GCC and Clang. More generated programs, GCC’s test-suite, and Clang’s test-suite).
specifically, we chose gcov in latest development trunk of GCC About 31.95% Csmith-generated programs caused inconsistent
and llvm-cov in the latest LLVM development trunk. coverage reports, much higher than those from GCC’s test-
For gcov, the command flags we used to compile a given suite and Clang’s test-suite. Columns 5 to 11 display the
source file, e.g., t.c, and produce the corresponding coverage distributions of inconsistency-triggering test programs over 7
report is as follows: different categories. In the third rows of these columns, the
# gcc -O0 --coverage t.c; ./a.out; gcov t.c number in parentheses indicates the number of test programs
The gcov-generated coverage report is stored in a file named after filtering potential test programs that trigger the same code
t.c.gcov. For llvm-cov, we use the following command: coverage bugs. Most of inconsistent reports fell into the C010
# clang -fprofile-instr-generate -fcoverage-mapping category, indicating that the majority of inconsistencies belong

493

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
TABLE I
S TATISTICS OF I NCONSISTENCY- TRIGGERING T EST P ROGRAMS .

Inconsistency-triggering Test Programs(After filtering)


#Test Programs #Time out
#Num(After filtering) / %P C001 C010 C100 C011 C101 C110 C111
Csmith-generated 1,000,000 182,927 261,065 (758) / 31.95% 3,331 (119) 238,554 (251) 1,625 (36) 4,547 (115) 87 (33) 12,470 (151) 451 (53)
GCC’s test-suite 2,756 15 262 / 9.56% 81 124 15 30 8 1 3
Clang’s test-suite 106 15 20 / 21.98% 9 5 4 1 0 1 0

TABLE II in latest version. It is worth noting that “won’t fix” indicates a


S TATISTICS OF LINES OF CODE FOR THE ORIGINAL AND REDUCED confirmed bug but will not be fixed by developers. There are
VERSION OF C SMITH - GENERATED TEST PROGRAMS
total 6 bugs are pending for developers’ responses which are
min mean median max not listed in Table V. Consistent with C. Sun et al’s [21] and
Original 41 210 167 874 V. Le et al’s [52] studies, due to the bug management process
Reduced 2 10 8 27
of LLVM is not organized as that of GCC, if a llvm-cov bug
Relative – 4.76% 4.79% –
report has been CCed by Clang developers and there is no
TABLE III objection in the comments, we label the bug as confirmed. In
I NFORMATION OF A LL R EPORTED B UGS
addition, as stated by developers, if someone does not close
gcov llvm-cov Total the reported bug as “invalid”, then the bug is real in LLVM
Confirmed 42 28 70 Bugzilla. Another 4 reported bugs were marked as duplicate
Pending 1 5 6 by developers, since similar test programs trigger the same
Duplicate 0 4 4
Rejected 3 0 3 coverage bug inside gcov or llvm-cov. The remaining 3 reports
Reported 46 37 83 were rejected by GCC developers. Two of them is GCC’s
TABLE IV
default optimization strategy and the other is with invalid code.
B UG T YPES OF C ONFIRMED B UGS Table V lists all the confirmed bugs in detail, including the
identity, priority, current report status, bug types, the origins of
gcov llvm-cov Total the bug-triggering test programs (i.e. Csmith-generated, GCC’s
Spurious Marking 18 16 34
Missing Marking 13 5 18 or Clang’s test-suite), and affected versions. Note that ‘New’
Wrong Frequency 11 7 18 indicates confirmed in GCC’s Bugzilla, and ‘Assigned’ refers
Total 42 28 70 to that the confirmed bugs are under the process of fixing.
Bug type. We categorize coverage bugs into three classes as
to the cases where some code is marked as unexecuted by gcov mentioned in Section II-C: Spurious Marking, Missing Mark-
but as executed by llmv-cov. C101 category has the minimal ing, and Wrong Frequency. Table IV shows the breakdown
number of inconsistent reports, indicating that inconsistencies of the bug types of all the confirmed bugs. Most of the bugs
of T ype A and T ype C rarely occur simultaneously. In are spurious marking bugs, i.e. unexecuted code is wrongly
addition, we can found that our method is very efficient for marked as executed.
filtering potential “duplicate” test programs. Bug importance. As shown in Column 4 of Table V, all
For those Csmith-generated inconsistency-triggering pro- our confirmed bugs have the default priority P3 except 14 of
grams, our filtering strategy led to 758 test programs for re- them are reset to P5 by developers. Besides, 13 of our reported
duction and inspection. Table II lists their code size before and coverage bugs have been fixed by developers. Note that 3 bugs
after reduction. We can see that the mean lines of code drops are confirmed as ‘Works’ which means that they are fixed in
from 210 to 10, thus helping effectively report code coverage developers’ version. Thus, we consider these three bugs as
bugs. For those inconsistency-triggering test programs from fixed by developers. This together shows that our reported
GCC’s and Clang’s test-suites, we intended to inspect all of bugs are important and worth the effort.
them. The reasons are two-fold. First, programs in these test- Program source. As shown in Column 7 of Table V,
suites are usually unique in program structures (unlike the test programs from all the three main sources (i.e. Csmith-
randomly generated programs produced by Csmith). Second, generated, GCC’s test-suite, and Clang’s test-suite) can trigger
the total number is not large (i.e. 262 from GCC’s test-suite coverage bugs. Two programs from Clang’ test-suite trigger
and 20 from Clang’ test-suite). In summary, we analyzed about coverage bugs of gcov, and a number of programs from
1000 inconsistency-triggering test programs. GCC’s test-suite can also help find coverage bugs for llvm-
Bug Count. We filed a total of 83 bug reports for gcov and cov as well. It is worth noting that test programs from
llvm-cov during our testing period. They can be found un- different sources may induce the same coverage bugs. It indeed
der “[email protected]” in GCC’s and LLVM’s happened in our experiment, and we only reported once.
Bugzilla databases. Table III shows the details of all the bugs Affected versions. We only tested gcov and llvm-cov inside
we have reported so far. As shown in Column 4, till April the latest development trunks of GCC and Clang respectively.
16, 2018, we have reported 83 bugs, of which 70 bugs are When we find a test case that reveals a bug in gcov or llvm-cov,
confirmed by developers. Of the 70 confirmed bugs, 11 are we also check the corresponding compiler’s stable releases
resolved as “fixed”, 17 as “won’t fix”, and 2 as “works for me” against the same test case. Column 8 of Table V shows the

494

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
TABLE V gcov llvm-cov #N Source Code
C ONFIRMED B UGS
10 : 10 | 1 void func ( i n t i ) {
#ID Prio. Status Type Source Affected Ver.
10 : 10 | 2 switch ( i ) {
1 : 10 | 3 c a s e 1 : break ;
1 gcov 83434 P4 New Missing Csmith v7,trunk
9 : 10 | 4 case 2: ;
2 gcov 83465 P3 Won’t fix Missing Csmith v4∼7,trunk
9 : 9 | 5 d e f a u l t : break ;
3 gcov 83486 P3 Won’t fix Missing Csmith v4∼7,trunk
− : 10 | 6 }
4 gcov 83505 P5 New Spurious Csmith v4∼7,trunk
10 : 10 | 7 }
5 gcov 83587 P5 New Missing Csmith v7,trunk
− : | 8
6 gcov 83616 P4 Assigned Missing Csmith v7,trunk
1 : 1 | 9 i n t main ( ) {
7 gcov 83617 P3 Works Wrong Freq. Csmith v4∼7
11 : 11 | 10 f o r ( i n t i = 0 ; i <10;++ i )
8 gcov 83678 P3 Won’t fix Spurious Csmith v4∼7,trunk
10 : 10 | 11 func ( i ) ;
9 gcov 83813 P3 Fixed Missing Cmsith trunk
1 : 1 | 12 return 0;
10 gcov 85163 P5 New Wrong Freq. Csmith trunk
− : 1 | 13 }
11 gcov 85178 P3 Won’t fix Missing Csmith v4∼7,trunk
12 gcov 85179 P4 Assigned Missing Csmith trunk Fig. 6. One program triggering multiple bugs
13 gcov 85188 P3 Assigned Spurious Csmith v4∼7,trunk
14 gcov 85197 P3 Won’t fix Wrong Freq. Csmith v4∼7,trunk
15 gcov 85199 P4 Assigned Missing Csmith v4∼7,trunk affected various versions of GCC and Clang for each bug.
16 gcov 85201 P5 Assigned Wrong Freq. Csmith trunk Note that as for GCC, we select GCC-4.8.0, GCC-5.4.1, GCC-
17 gcov 85202 P3 Won’t fix Spurious Csmith v4∼7,trunk 6.4.0, GCC-7.2.0, and the latest trunk version (GCC-8.0), and
18 gcov 85217 P3 Fixed Spurious Csmith v4∼7,trunk
19 gcov 85218 P3 Won’t fix Spurious Csmith v4∼7,trunk for Clang, we select Clang-3.8, Clang-4.0, Clang-5.0, and the
20 gcov 85219 P3 Won’t fix Spurious Csmith v4∼7,trunk latest trunk version (Clang-7.0). As can be seen, 42 out of all
21 gcov 85225 P5 Assigned Wrong Freq. Csmith v4∼7,trunk the bugs can affect stable releases, and a considerable number
22 gcov 85243 P3 Won’t fix Spurious Csmith v4∼7,trunk
23 gcov 85245 P3 Won’t fix Spurious Csmith v4∼7,trunk of bugs have been latent in the old stable releases (such as
24 gcov 85272 P3 Won’t fix Spurious Csmith v4∼7,trunk GCC-4.8.0 and Clang-3.8) for many years.
25 gcov 85273 P3 Won’t fix Spurious Csmith v4∼7,trunk
26 gcov 85274 P3 Won’t fix Spurious Csmith v4∼7,trunk D. Interesting Bugs Found
27 gcov 85276 P5 New Wrong Freq. Csmith trunk
28 gcov 85294 P3 Won’t fix Spurious Csmith v4∼7,trunk One program triggering multiple bugs. In Figure 6, Col-
29 gcov 85297 P3 Won’t fix Spurious Csmith v4∼7,trunk umn 4 lists a code snippet and Column 3 shows the line
30 gcov 85299 P3 Won’t fix Spurious Csmith v4∼7,trunk numbers. Column 1 and Column 2 display the coverage results
31 gcov 85332 P3 Fixed Wrong Freq. GCC v7,trunk
32 gcov 85333 P3 Won’t fix Missing GCC v4∼7,trunk by gcov and llvm-cov respectively. This code snippet is a
33 gcov 85336 P4 New Wrong Freq. GCC v4∼7,trunk simple switch-statement inside a for-loop from the caller. The
34 gcov 85337 P5 New Wrong Freq. Clang v7,trunk code at line 3 and line 4 actually executes only one time,
35 gcov 85338 P3 Fixed Wrong Freq. Clang v4∼7,trunk
36 gcov 85349 P5 New Spurious GCC v6∼7,trunk however, they are wrongly marked as executed 10 times by
37 gcov 85350 P3 Fixed Spurious GCC v4∼7,trunk llvm-cov and line 4 is wrongly marked as executed 9 times
38 gcov 85351 P5 Assigned Missing GCC v4∼7,trunk by gcov. We have reported this case as Bug #85337 of gcov
39 gcov 85367 P3 Works Wrong Freq. GCC v5∼7,trunk
40 gcov 85370 P3 Fixed Spurious GCC trunk and Bug #37124 of llvm-cov. This case is quite simple and
41 gcov 85372 P3 Fixed Missing GCC trunk common in real world. Unfortunately, this single case triggers
42 gcov 85377 P4 New Missing GCC v4∼7,trunk coverage bugs for both gcov and llvm-cov, indicating that
43 llvm-cov 33465 P3 Fixed Missing Csmith v3∼5,trunk
44 llvm-cov 35404 P3 Confirmed Spurious Csmith trunk coverage bugs are subtle and prevalent.
45 llvm-cov 35426 P3 Fixed Spurious Csmith trunk Coverage bugs or compiler bugs? Figure 7 shows the
46 llvm-cov 35495 P3 Fixed Missing Csmith trunk coverage report for Bug #37082 of Clang. A definition of
47 llvm-cov 35556 P3 Fixed Missing Csmith trunk
48 llvm-cov 37001 P3 Confirmed Spurious Csmith trunk strcmp function is first given at line 2, while strcmp is
49 llvm-cov 37003 P3 Confirmed Spurious Csmith trunk redefined as the libc function. As a result, the calling of
50 llvm-cov 37012 P3 Confirmed Spurious Csmith trunk strcmp at line 6 will invoke the libc function with internal
51 llvm-cov 37017 P3 Confirmed Spurious Csmith trunk
52 llvm-cov 37043 P3 Confirmed Spurious Csmith trunk linkage, instead of the self-defined function (line 2). However,
53 llvm-cov 37046 P3 Confirmed Spurious Csmith trunk in our test, GCC outputs “−1” whereas Clang outputs “20” for
54 llvm-cov 37070 P3 Confirmed Missing Clang v3∼5,trunk this case. That means, GCC invokes the correct libc function
55 llvm-cov 37071 P3 Confirmed Spurious Clang trunk
56 llvm-cov 37072 P3 Confirmed Wrong Freq. GCC trunk but Clang still invokes the self-defined function at line 2. That
57 llvm-cov 37081 P3 Confirmed Missing Clang v3∼5,trunk is why line 2 is reported as executed one time by llvm-cov but
58 llvm-cov 37083 P3 Confirmed Wrong Freq. GCC trunk as not executed by gcov. As for this case, the root cause of
59 llvm-cov 37084 P3 Confirmed Wrong Freq. GCC trunk
60 llvm-cov 37085 P3 Confirmed Spurious GCC trunk producing the incorrect coverage report is not the bugs inside
61 llvm-cov 37090 P3 Confirmed Spurious GCC trunk llvm-cov, but the bugs inside Clang instead.
62 llvm-cov 37092 P3 Confirmed Spurious GCC trunk Code formatting problem. Figure 8 shows the coverage
63 llvm-cov 37099 P3 Confirmed Spurious GCC trunk
64 llvm-cov 37102 P3 Confirmed Spurious GCC trunk report for Bug #37102 of Clang. Line 8 is marked as executed
65 llvm-cov 37103 P3 Confirmed Wrong Freq. GCC v3∼5,trunk twice by gcov, but wrongly marked as executed only once in
66 llvm-cov 37105 P3 Confirmed Spurious GCC trunk llvm-cov. Along the execution, we can see that each of the four
67 llvm-cov 37107 P3 Confirmed Wrong Freq. GCC v3∼5,trunk
68 llvm-cov 37124 P3 Confirmed Wrong Freq. GCC trunk statements at line 8 (i.e. foo(); goto L2; L1: bar())
69 llvm-cov 37125 P3 Confirmed Wrong Freq. Clang v3∼5,trunk executes actually only one time before the main function
70 llvm-cov 37126 P3 Confirmed Spurious GCC v3∼5,trunk finishes. However, from the perspective of line coverage, the

495

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
1 | | # i n c l u d e < s t d i o . h> gcov llvm-cov #N Source Code
2 | 1 | s t a t i c i n t strcmp (){ return 2;} − : | 1 # i n c l u d e < s t d i o . h>
3 | 1 | # define strcmp builtin strcmp − : | 2 # i n c l u d e <s e t j m p . h>
4 | | i n t main ( ) − : | 3
5 | 1 | { 1 : 1 | 4 i n t foo ( jmp buf b ) { longjmp ( b , 1 ) ; }
6 | 1 | int r e t = strcmp ( ”a” , ”b” ) ; − : | 5
7 | 1 | p r i n t f ( ”%d\n ” , r e t ) ; 1 : | 6 i n t main ( )
8 | 1 | return 0; − : 1 | 7 {
9 | 1 | } − : 1 | 8 int ret ;
Fig. 7. Coverage or compiler bug (Bug #37082 of llvm-cov) − : 1 | 9 jmp buf buf ;
− : 1 | 10
1 | | int a = 0; 1 : 1 | 11 i f ( setjmp ( buf ) ) {
2 | | # #### : 0 | 12 foo ( buf ) ;
3 | 1 | v o i d f o o ( ) { a ++; } − : 0 | 13 }
4 | 1 | v o i d b a r ( ) { a ++; } − : 1 | 14
5 | | 1 : 1 | 15 i f ( setjmp ( buf )!= 0) {
6 | | i n t main ( ) 1 : 1 | 16 r e t = 0;
7 | 1 | { − : 1 | 17 } else {
8 | 1 | f o o ( ) ; g o t o L2 ; L1 : b a r ( ) ; 1 : 0 | 18 r e t = 1;
9 | 1 | 1 : 0 | 19 foo ( buf ) ;
10 | 2 | L2 : − : 0 | 20 }
11 | 2 | i f ( a == 1 ) − : 1 | 21
12 | 1 | g o t o L1 ; 1 : 1 | 22 p r i n t f ( ”%d ” , r e t ) ;
13 | 1 | } − : 1 | 23 }
Fig. 8. Code formatting problem (Bug #37102 of llvm-cov) Fig. 9. Non-trivial inspection (Bug #37081 of Clang)
code at line 8 executes twice. For the first time, the first two programs from GCC’s and Clang’s test-suites, and fed them to
statements, i.e. foo() and goto L2, get executed and then C2V, which can mitigate this problem to some extent. Second,
the control flow jumps to line 10. After executing the goto- we only take account inconsistency-triggering lines of code for
statement at line 12, the control flow jumps back to line 8. computing program similarities to filter out test programs that
Then the last two statements, i.e. L1: bar();, get executed. potentially trigger the same code coverage bugs. In this study,
Note that the coverage result will be correct if we put the four a large number of inconsistency-triggering test programs are
statements at line 8 into four separate lines. filtered out which may miss a number of quality test pro-
Non-trivial inspection. Figure 9 follows the same notions grams. If we can inspect all Csmith-generated inconsistency-
with Figure 6. Lines 18 and 19 are marked as executed by gcov triggering test programs, it is reasonable to expect that more
but as unexecuted by llvm-cov. Human inspection is conducted code coverage bugs would be found. Third, it is possible
to determine whether gcov or llvm-cov produces the incorrect that both code coverage tools may have the same bugs. In
coverage report. But this process is non-trivial. Intuitively, the other words, these two coverage tools might produce same
two branches of the if -statement at line 15 should not be but incorrect code coverage reports for a given program. Our
executed simultaneously, implying that the coverage report by approach can not identify any inconsistencies from such paired
gcov is incorrect. Besides, this code outputs “0” instead of coverage reports and further miss this kind of bugs. Therefore,
“1” at runtime, further supporting the implication. However, in the future, more research efforts should be paid in this area
it is actually llvm-cov that produces the incorrect report (Bug to improve the quality of code coverage tools. Forth, different
#37081 of Clang). Function setjmp and function longjmp code coverage tools having different implementations may
are provided to perform complex flow-of-control in C. Due to make the coverage reports difficult to be compared. To mitigate
their existence, the execution flows for this code are: (1) the this problem, we have taken the following steps: (1) we
if -statement at line 11 takes the false branch, (2) then the reformatted the generated test programs before feeding them
if -statement at line 15 also takes the false branch, assigning to the coverage tools, which led to formatted and comparable
variable ret as 1 and calling function foo at line 4, (3) coverage reports; (2) before comparing coverage reports, we
function longjmp restores the program state when setjmp identified and excluded specific behavioral differences of the
at line 15 are called and returns 1, hence taking the true branch coverage tools; and (3) before reporting bugs, we inspected
at line 15. As a result, variable ret is assigned as 0. (4) the inconsistent coverage reports to determine which tools are
main function returns after printing the value of variable ret. buggy. During inspection, false alarms are manually identified.
Therefore, we have taken careful steps to reduce false positives
E. Limitations resulting from the variability among different tools. Besides,
In this study, we assess the reliability of code coverage it is also interesting to develop more accurate techniques for
tools via differential testing. This is a first effort towards this coverage reports comparison in the future.
direction. However, our technique has a number of limitations.
First, most of the test programs we used were generated by V. R ELATED W ORK
Csmith. The Csmith-generated programs only cover a subset This section introduces the related work on randomized
of C semantics, which might cause C2V to miss a number of differential testing, coverage-directed differential testing, and
code coverage defects. As a complement, we collected 2862 C testing via equivalence modulo inputs.

496

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
A. Randomized Differential Testing C compilers, GCC and LLVM. Based on this idea, Athena
Differential testing is originally introduced by McKee- [59] and Hermes [60] are developed subsequently. Athena [59]
man [53] which attempt to detect bugs by checking in- generates EMI by randomly inserting code into and removing
consistent behaviors across different comparable software or statements from dead code regions. Hermes [60] complements
different software versions. Randomized differential testing mutation strategies by operating on live code regions, which
is a widely-used black-box differential testing technique in overcomes the limitations of mutating dead code regions.
which the inputs are randomly generated [19], [54]. Yang et Le et al. [52] first used Csmith to generate single-file test
al. [19] developed Csmith, a randomized test case generation programs and transformed each single-file test program into
tool that can support a large subset of C features and avoid multiple compilation units. Then, they stochastically assigned
introducing undefined and unspecified behaviors, to find C each unit an optimization level to thoroughly exercise link-
compiler bugs. Lidbury et al. [40] developed CLsmith, a tool time-optimizers. They discovered and reported 37 LTO bugs
built on top of Csmith, to validate OpenCL compilers based for GCC and LLVM in 11 months. These techniques heavily
on differential testing and testing via equivalence modulo depend on the code coverage information.
inputs (EMI). They presented several strategies for random
generation of OpenCL kernels and an injection mechanism VI. C ONCLUSION AND F UTURE W ORK
which allowed EMI testing to be applied to kernel in order to
avoid little or no dynamically-dead code. Their study revealed We proposed a randomized differential testing approach to
a significant number of OpenCL compiler bugs in commercial hunting code coverage bugs and implemented a tool named
implementations. Sun et al. [21] applied randomized differen- C2V to test two C code coverage tools, gcov and llvm-cov.
tial testing to find and analyze compiler warning defects across Our evaluations where 42 and 28 bugs confirmed from gcov
GCC and LLVM. In less than six months, they successfully and llvm-cov respectively in a short few months provided a
found 52 confirmed/fixed bugs. Different from prior studies, strong evidence that code coverage tools are not as reliable as
we apply randomized differential testing to find code coverage they might have been envisaged. Overall, our approach has the
bugs which we believe is an important topic. following main advantages: (1) it simplifies the difficult code
B. Coverage-based Differential Testing coverage validation problem as a simple comparison problem;
(2) the comparison between code coverage reports not only
A number of recent studies leverage coverage to improve the checks whether a program chunk gets executed or not, but
effectiveness of differential testing. Chen et al. [55] proposed also the exact execution frequencies. Any discrepancy in these
a coverage-directed fuzzing approach to detecting inconsis- dimensions would alert a potential bug report, which helps find
tencies between different implementations of Java Virtual subtle but deep semantic bugs in code coverage tools; and (3)
Machine (JVM). They mutated seeding classfiles, executed our approach is simple, straightforward, and general. It can
mutants on a reference JVM implementation, and used cov- be easily applied to validate different code coverage tools,
erage uniqueness as a discipline for accepting representative under various programming languages and coverage criteria.
mutants. The accepted mutants were then used as the inputs In the future, more efforts should be paid on this area and
to differentially test different JVM implementations. Pei et there is a need to examine the influence of those bugs on
al. [56] proposed DeepXplore, a whitebox coverage-directed other techniques which depend on code coverage.
differential testing for detecting inconsistencies between multi-
ple DNNs. They first introduced neuron coverage as a system-
atic metric for measuring how much of the internal logic of a ACKNOWLEDGMENT
DNNs had been tested and then used this information to guide
the testing process. As can be seen, the prerequisite of the We thank Yanyan Jiang, Zhaogui Xu, and the anony-
above techniques is to obtain the correct coverage. Our work mous reviewers for their constructive comments. We also
provides a general and practical approach to finding coverage thank the GCC and LLVM developers especially Martin
bugs, thus helping improve the quality of code coverage tools. Liška for analyzing and fixing our reported bugs. This
work is supported by the National Natural Science Founda-
C. Testing via Equivalence Modulo Inputs tion of China (61702256, 61772259, 61432001, 61832009,
Testing via equivalence modulo inputs is a new testing tech- 61772263, 61802168, 61872177), the Natural Science Foun-
nique proposed in recent years. In nature, EMI testing is a kind dation of Jiangsu Province (BK20170652), the China Post-
of metamorphic testing, which modifies a program to generate doctoral Science Foundation (2018T110481), the Fundamental
variants with the same outputs as the original program [57], Research Funds for the Central Universities (020214380032,
[58]. Le et al. [7] proposed to generate equivalent versions 02021430047), the National Key R&D Program of China
of the program by profiling program’s execution and pruning (2018YFB1003901). Zhendong Su was supported by United
unexecuted code. Once a program and its equivalent variant are States NSF Grants 1528133 and 1618158, and Google and
constructed, both are used as input of the compiler under test, Mozilla Faculty Research awards. Yuming Zhou (zhouyum-
checking for inconsistencies in their results. So far, this method [email protected]) and Baowen Xu ([email protected]) are the
has been used to detect 147 confirmed bugs in two open source corresponding authors.

497

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [21] C. Sun, V. Le, and Z. Su, “Finding and analyzing compiler warning de-
fects,” in Proceedings of the 38th International Conference on Software
[1] J. C. Miller and C. J. Maloney, “Systematic mistake analysis of digital Engineering, ser. ICSE ’16. New York, NY, USA: ACM, 2016, pp.
computer programs,” Commun. ACM, vol. 6, no. 2, pp. 58–63, Feb. 203–213.
1963. [22] “Gcc,” https://gcc.gnu.org/.
[2] M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and [23] “Gcov,” https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.
D. Marinov, “Guidelines for coverage-based comparisons of non- [24] “Testwell ctc++,” http://www.testwell.fi/ctcdesc.html.
adequate test suites,” ACM Trans. Softw. Eng. Methodol., vol. 24, no. 4, [25] “Covtool,” http://covtool.sourceforge.net/.
pp. 22:1–22:33, Sep. 2015. [26] “Jcov,” https://wiki.openjdk.java.net/display/CodeTools/jcov.
[3] G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg, “Does [27] “Jacoco,” https://www.jacoco.org/.
automated white-box test generation really help software testers?” in [28] “Emma: a free java code coverage tool,” http://emma.sourceforge.net/.
Proceedings of the 2013 International Symposium on Software Testing
[29] “Atlassian clover,” https://www.atlassian.com/software/clover.
and Analysis, ser. ISSTA 2013. New York, NY, USA: ACM, 2013, pp.
[30] “Codecover,” http://codecover.org/.
291–301.
[31] “Openclover,” http://openclover.org/.
[4] S. Yoo and M. Harman, “Regression testing minimization, selection and
prioritization: A survey,” Softw. Test. Verif. Reliab., vol. 22, no. 2, pp. [32] P. Ammann and J. Offutt, Introduction to software testing. Cambridge
67–120, Mar. 2012. University Press, 2016.
[5] J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantula [33] G. J. Myers, T. Badgett, T. M. Thomas, and C. Sandler, The art of
automatic fault-localization technique,” in Proceedings of the 20th software testing. Wiley Online Library, 2004, vol. 2.
IEEE/ACM International Conference on Automated Software Engineer- [34] J. A. Jones and M. J. Harrold, “Test-suite reduction and prioritization for
ing, ser. ASE ’05. New York, NY, USA: ACM, 2005, pp. 273–282. modified condition/decision coverage,” IEEE Trans. Softw. Eng., vol. 29,
[6] G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test no. 3, pp. 195–209, Mar. 2003.
generation using evosuite,” ACM Trans. Softw. Eng. Methodol., vol. 24, [35] L. Zhang, D. Marinov, L. Zhang, and S. Khurshid, “Regression muta-
no. 2, pp. 8:1–8:42, Dec. 2014. tion testing,” in Proceedings of the 2012 International Symposium on
[7] V. Le, M. Afshari, and Z. Su, “Compiler validation via equivalence Software Testing and Analysis, ser. ISSTA 2012. New York, NY, USA:
modulo inputs,” in Proceedings of the 35th ACM SIGPLAN Conference ACM, 2012, pp. 331–341.
on Programming Language Design and Implementation, ser. PLDI ’14. [36] S. Artzi, J. Dolby, F. Tip, and M. Pistoia, “Directed test generation
New York, NY, USA: ACM, 2014, pp. 216–226. for effective fault localization,” in Proceedings of the 19th international
[8] M. Böhme, V.-T. Pham, and A. Roychoudhury, “Coverage-based grey- symposium on Software testing and analysis. ACM, 2010, pp. 49–60.
box fuzzing as markov chain,” in Proceedings of the 2016 ACM SIGSAC [37] G. Rothermel, R. J. Untch, and C. Chu, “Prioritizing test cases for
Conference on Computer and Communications Security, ser. CCS ’16. regression testing,” IEEE Trans. Softw. Eng., vol. 27, no. 10, pp. 929–
New York, NY, USA: ACM, 2016, pp. 1032–1043. 948, Oct. 2001.
[9] D. Hao, L. Zhang, L. Zang, Y. Wang, X. Wu, and T. Xie, “To be optimal [38] W. E. Wong, J. R. Horgan, S. London, and H. A. Bellcore, “A
or not in test-case prioritization,” IEEE Trans. Softw. Eng., vol. 42, no. 5, study of effective regression testing in practice,” in Proceedings of the
pp. 490–505, May 2016. Eighth International Symposium on Software Reliability Engineering,
[10] D. Hao, L. Zhang, L. Zhang, G. Rothermel, and H. Mei, “A unified ser. ISSRE ’97. Washington, DC, USA: IEEE Computer Society, 1997,
test case prioritization approach,” ACM Trans. Softw. Eng. Methodol., pp. 264–274.
vol. 24, no. 2, pp. 10:1–10:31, Dec. 2014. [39] K. Jamrozik, G. Fraser, N. Tillmann, and J. De Halleux, “Augmented
[11] Z. Zuo, S.-C. Khoo, and C. Sun, “Efficient predicated bug signature dynamic symbolic execution,” in Proceedings of the 27th IEEE/ACM
mining via hierarchical instrumentation,” in Proceedings of the 2014 International Conference on Automated Software Engineering, ser. ASE
International Symposium on Software Testing and Analysis, ser. ISSTA 2012. New York, NY, USA: ACM, 2012, pp. 254–257.
2014. New York, NY, USA: ACM, 2014, pp. 215–224. [40] C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson, “Many-core com-
[12] Z. Zuo, L. Fang, S.-C. Khoo, G. Xu, and S. Lu, “Low-overhead and piler fuzzing,” in Proceedings of the 36th ACM SIGPLAN Conference
fully automated statistical debugging with abstraction refinement,” in on Programming Language Design and Implementation, ser. PLDI ’15.
Proceedings of the 2016 ACM SIGPLAN International Conference on New York, NY, USA: ACM, 2015, pp. 65–76.
Object-Oriented Programming, Systems, Languages, and Applications, [41] Q. Zhang, C. Sun, and Z. Su, “Skeletal program enumeration for rigorous
ser. OOPSLA 2016. New York, NY, USA: ACM, 2016, pp. 881–896. compiler testing,” in Proceedings of the 38th ACM SIGPLAN Conference
[13] Z. Zuo and S.-C. Khoo, “Mining dataflow sensitive specifications,” in on Programming Language Design and Implementation, ser. PLDI 2017.
Formal Methods and Software Engineering, L. Groves and J. Sun, Eds. New York, NY, USA: ACM, 2017, pp. 347–361.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 36–52. [42] C. Lemieux and K. Sen, “Fairfuzz: A targeted mutation strategy for
[14] Y. Li, Z. Su, L. Wang, and X. Li, “Steering symbolic execution to less increasing greybox fuzz testing coverage,” in Proceedings of the 33rd
traveled paths,” in Proceedings of the 2013 ACM SIGPLAN International ACM/IEEE International Conference on Automated Software Engineer-
Conference on Object Oriented Programming Systems Languages &#38; ing, ser. ASE 2018. New York, NY, USA: ACM, 2018, pp. 475–485.
Applications, ser. OOPSLA ’13. New York, NY, USA: ACM, 2013, [43] X. Xie, T. Y. Chen, F.-C. Kuo, and B. Xu, “A theoretical analysis of
pp. 19–32. the risk evaluation formulas for spectrum-based fault localization,” ACM
[15] Z. Li, M. Harman, and R. M. Hierons, “Search algorithms for regression Trans. Softw. Eng. Methodol., vol. 22, no. 4, pp. 31:1–31:40, Oct. 2013.
test case prioritization,” IEEE Trans. Softw. Eng., vol. 33, no. 4, pp. 225– [44] T. Wang and A. Roychoudhury, “Automated path generation for software
237, Apr. 2007. fault localization,” in Proceedings of the 20th IEEE/ACM International
[16] “llvm-cov,” https://llvm.org/docs/CommandGuide/llvm-cov.html. Conference on Automated Software Engineering, ser. ASE ’05. New
[17] “Clang,” http://clang.llvm.org/. York, NY, USA: ACM, 2005, pp. 347–351.
[18] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, [45] S. Yoo, M. Harman, and D. Clark, “Fault localization prioritization:
“Test-case reduction for c compiler bugs,” in Proceedings of the 33rd Comparing information-theoretic and coverage-based approaches,” ACM
ACM SIGPLAN Conference on Programming Language Design and Trans. Softw. Eng. Methodol., vol. 22, no. 3, pp. 19:1–19:29, Jul. 2013.
Implementation, ser. PLDI ’12. New York, NY, USA: ACM, 2012, [46] R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold, “Lightweight
pp. 335–346. fault-localization using multiple coverage types,” in Proceedings of the
[19] X. Yang, Y. Chen, E. Eide, and J. Regehr, “Finding and understanding 31st International Conference on Software Engineering, ser. ICSE ’09.
bugs in c compilers,” in Proceedings of the 32Nd ACM SIGPLAN Washington, DC, USA: IEEE Computer Society, 2009, pp. 56–66.
Conference on Programming Language Design and Implementation, ser. [47] J. Xuan and M. Monperrus, “Test case purification for improving fault
PLDI ’11. New York, NY, USA: ACM, 2011, pp. 283–294. localization,” in Proceedings of the 22Nd ACM SIGSOFT International
[20] Y. Chen, A. Groce, C. Zhang, W.-K. Wong, X. Fern, E. Eide, and Symposium on Foundations of Software Engineering, ser. FSE 2014.
J. Regehr, “Taming compiler fuzzers,” in Proceedings of the 34th New York, NY, USA: ACM, 2014, pp. 52–63.
ACM SIGPLAN Conference on Programming Language Design and [48] X. Wang, N. Zeldovich, M. F. Kaashoek, and A. Solar-Lezama, “To-
Implementation, ser. PLDI ’13. New York, NY, USA: ACM, 2013, wards optimization-safe systems: Analyzing the impact of undefined
pp. 197–208. behavior,” in Proceedings of the Twenty-Fourth ACM Symposium on

498

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.
Operating Systems Principles, ser. SOSP ’13. New York, NY, USA: Implementation, ser. PLDI ’16. New York, NY, USA: ACM, 2016,
ACM, 2013, pp. 260–275. pp. 85–99.
[49] E. Eide and J. Regehr, “Volatiles are miscompiled, and what to do [56] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox
about it,” in Proceedings of the 8th ACM International Conference on testing of deep learning systems,” in Proceedings of the 26th Symposium
Embedded Software, ser. EMSOFT ’08. New York, NY, USA: ACM, on Operating Systems Principles, ser. SOSP ’17. New York, NY, USA:
2008, pp. 255–264. ACM, 2017, pp. 1–18.
[50] A. Balestrat, “Ccg: A random c code generator,” https:// github.com/ [57] T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. H. Tse, and Z. Q.
Merkil/ ccg, 2015. Zhou, “Metamorphic testing: A review of challenges and opportunities,”
[51] P.-N. Tan, M. Steinbach, and V. Kumar, “Association analysis: basic ACM Comput. Surv., vol. 51, no. 1, pp. 4:1–4:27, Jan. 2018.
concepts and algorithms,” Introduction to Data mining, pp. 327–414,
[58] S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Corts, “A survey on
2005.
metamorphic testing,” IEEE Trans. Softw. Eng., vol. 42, no. 9, pp. 805–
[52] V. Le, C. Sun, and Z. Su, “Randomized stress-testing of link-time
824, Sept 2016.
optimizers,” in Proceedings of the 2015 International Symposium on
Software Testing and Analysis, ser. ISSTA 2015. New York, NY, USA: [59] V. Le, C. Sun, and Z. Su, “Finding deep compiler bugs via guided
ACM, 2015, pp. 327–337. stochastic program mutation,” in Proceedings of the 2015 ACM SIG-
[53] W. M. McKeeman, “Differential testing for software,” DIGITAL TECH- PLAN International Conference on Object-Oriented Programming, Sys-
NICAL JOURNAL, vol. 10, no. 1, pp. 100–107, 1998. tems, Languages, and Applications, ser. OOPSLA 2015. New York,
[54] A. Groce, G. Holzmann, and R. Joshi, “Randomized differential testing NY, USA: ACM, 2015, pp. 386–399.
as a prelude to formal verification,” in Proceedings of the 29th Interna- [60] C. Sun, V. Le, and Z. Su, “Finding compiler bugs via live code mutation,”
tional Conference on Software Engineering, ser. ICSE ’07. Washington, in Proceedings of the 2016 ACM SIGPLAN International Conference on
DC, USA: IEEE Computer Society, 2007, pp. 621–631. Object-Oriented Programming, Systems, Languages, and Applications,
[55] Y. Chen, T. Su, C. Sun, Z. Su, and J. Zhao, “Coverage-directed ser. OOPSLA 2016. New York, NY, USA: ACM, 2016, pp. 849–863.
differential testing of jvm implementations,” in Proceedings of the 37th
ACM SIGPLAN Conference on Programming Language Design and

499

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:24 UTC from IEEE Xplore. Restrictions apply.

You might also like