CTOS Compiler Testing For Optimization Sequences of LLVM
CTOS Compiler Testing For Optimization Sequences of LLVM
Abstract—Optimization sequences are often employed in compilers to improve the performance of programs, but may trigger critical
compiler bugs, e.g., compiler crashes. Although many methods have been developed to automatically test compilers, no systematic
work has been conducted to detect compiler bugs when applying arbitrary optimization sequences. To resolve this problem, two main
challenges need to be addressed, namely the acquisition of representative optimization sequences and the selection of representative
testing programs, due to the enormous number of optimization sequences and testing programs. In this study, we propose CTOS, a
novel compiler testing method based on differential testing, for detecting compiler bugs caused by optimization sequences of LLVM.
CTOS first leverages the technique Doc2Vec to transform optimization sequences into vectors to capture the information of
optimizations and their orders simultaneously. Second, a method based on the region graph and call relationships is developed in
CTOS to construct the vector representations of the testing program, such that the semantics and the structure information of programs
can be captured simultaneously. Then, with the vector representations of optimization sequences and testing programs, a “centroid”
based selection scheme is proposed to address the above two challenges. Finally, CTOS takes in the representative optimization
sequences and testing programs as inputs, and tests each testing program with all the representative optimization sequences. If there
is an output that is different from the majority of others of a given testing program, then the corresponding optimization sequence is
deemed to trigger a compiler bug. Our evaluation demonstrates that CTOS significantly outperforms the baselines by up to 24:76%
50:57% in terms of the bug-finding capability on average. Within seven month evaluations on LLVM, we have reported 104 valid bugs
within 5 types, of which 21 have been confirmed or fixed. Most of those bugs are crash bugs (57) and wrong code bugs (24). 47 unique
optimizations are identified to be faulty and 15 of them are loop related optimizations.
Index Terms—Compiler testing, optimization sequences, LLVM, program representation, software testing
3.2 Representation of Optimization Sequences completely different from that in the sequence (b). By using
An optimization sequence is constituted of some optimiza- Doc2Vec, we can resolve the difficulty, which captures opti-
tions in a certain order. Thus, the representation of an opti- mizations and their orders of optimization sequences
mization sequence should reflect the specific optimizations simultaneously.
and their orders contained in the sequence. Intuitively, an
optimization sequence is similar to a sentence in natural lan- 3.3 Representation of Testing Programs
guage, which consists of some words in a certain order. Testing programs are another critical factor to trigger com-
Hence, in this study, we treat optimization sequences as piler bugs caused by optimization sequences. Different test-
sentences, such that efficient representation methods of sen- ing programs may trigger different bugs. Thus, we need to
tences can be adopted to transform optimization sequences select representative testing programs to improve the test
into vectors. However, many state-of-the-art representation efficiency for finding more distinct bugs. Our study focuses
methods of sentences, such as the bag-of-words [32], cannot on finding compiler bugs caused by optimization sequences
reflect the word order. They fail to distinguish different sen- of LLVM, which makes us decide to construct vector repre-
tences with the same words. For capturing optimizations sentations of testing programs using LLVM IR. LLVM IR is
and their orders of an optimization sequence simulta- a light-weight and low-level while expressive, typed, and
neously, we employ Doc2Vec [29], a popular and widely extensible representation of programs [21]. In this subsec-
used sentence vector representation technique to represent tion, we present a vector representation method based on
optimization sequences as vectors. Doc2Vec is an unsuper- the region graph and call relationships generated from the
vised method for learning continuous distributed vector unoptimized IR to transform a testing program into a vec-
representations of sentences or documents. Doc2Vec takes tor. With this approach, we can capture the semantics and
word orders into consideration such that the sequences structure information of programs, which are useful for
with different orders of the same words have different vec- selecting representative testing programs. We divide the
tor representations. In addition, Doc2Vec can be applied to vector representation of a testing program into two parts,
variable-length word sequences, so variable-length optimi- namely, the representation of a function and the representa-
zation sequences can be easily transformed into vector tion of the whole program. First, we transform the instruc-
representations. tion sequences of each edge in the region graph of a
In this study, Doc2Vec is applied in a relatively straight- function into vectors using the Doc2Vec technique; then a
forward way. That is, optimizations and optimization deep region-first algorithm is employed to aggregate vec-
sequences are viewed as words and sentences, respectively. tors of each edge under two constraints to construct the vec-
We leverage the DMPV model (see Section 2.2) of Doc2Vec tor representation of a function. Second, after obtaining the
as the representation method of optimization sequences. vector representations of all functions, we aggregate them
Then we input optimizations and optimization sequences according to their call relationships to form the vector repre-
into the DMPV model of Doc2Vec to obtain the vector repre- sentation of the whole program.
sentations of optimization sequences.
For example, if we only take five optimizations in LLVM
into consideration, i.e., {-functionattrs, -gvn, -loop-rotate, 3.3.1 Representation of a Function
-loop-vectorize, -sroa}, and set the max length of optimization A function consists of basic blocks, branches, and loops.
sequences to 5, we can obtain 51 þ 52 þ 53 þ 54 þ 55 ¼ 3905 Basic blocks contain the basic semantics of a function, while
optimization sequences. Take the following three optimiza- branches and loops control the structure of a function [33].
tion sequences as an example: (a) “-functionattrs -loop-rotate Thus, we use the region graph [34], [35] of a function to con-
-sroa -gvn -loop-vectorize”; (b) “-functionattrs -sroa -loop-vector- struct its vector representation, since the region graph could
ize -loop-rotate -gvn”; (c) “-loop-rotate -sroa -gvn -loop- simultaneously capture the semantics and structure infor-
vectorize”. If we can only test two sequences among them mation of a function [34], [35]. Definition 1 shows the gen-
due to the limitation of resources (e.g., time), testing sequen- eral definition of a region graph.
ces (a) and (b) may uncover more bugs, since sequence (c) is
Definition 1. A region graph is a special control flow graph, in
a subsequence of sequence (a) only without the optimiza-
which each node (i.e., basic block) exactly belongs to a region.
tion “-functionattrs”. However, in this case, the order of opti-
Specifically, a region is a connected subgraph of the control
mizations is hard to be captured by some bag-of-words
flow graph that has exactly two connections to the residual
methods. For instance, we can calculate the similarity
graph [34], [35], [36].
between optimization sequences utilizing the Jaccard simi-
larity coefficient,4 which is defined as the size of the inter- In a region graph, each node (i.e., basic block) exactly
section divided by the size of the union of two sample sets belongs to a region. Fig. 4 shows an example of the region
A and B, i.e., JðA; BÞ ¼ jA \ Bj=jA [ Bj. For the sequences graph. This graph is derived from a simple bubble sort algo-
(a) and (b), they have the same optimizations, i.e., Jða; bÞ ¼ rithm using the tool Opt in LLVM. We remove the contents
1; while Jða; cÞ ¼ 4=5 for the sequences (a) and (c). It indi- (i.e., statement sequences) of some basic blocks for simplic-
cates that sequences (a) and (b) are identical, and sequences ity. Clearly, there are 10 basic blocks, 12 edges, and 4
(a) and (c) should be tested. This contrasts with the observa- regions colored by four colors in Fig. 4. The number next to
tion, since the order of optimizations in the sequence (a) is each edge is its index. In these basic blocks, blocks “%11”
and “%15” are entry nodes of the outer loop and the inner
loop respectively, and block “%19” is the entry node of a
4. https://en.wikipedia.org/wiki/Jaccard_index. branch. From Fig. 4, the structure information (e.g., the
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
2344 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 7, JULY 2022
outer loop, the inner loop, and the branch) of the program
are clearly captured by each region.
“func4”, it does not have the final vector representation. On least one instance in each group to be selected. This is
the other hand, the recursive calls have almost no effect on because the distribution of instances may be unbalanced,
improving the representation of functions. For instance, which causes that some instances with special features that
“func7” recursively calls itself. Its vector representation can trigger bugs may be lost. In this paper, we use the
does not change after the aggregation. X-means algorithm [37] to cluster the target instances, since
Then, the start nodes (start nodes) and the end nodes it can automatically determine the best number of groups.
(end nodes) in the call graph are recognized in lines 3 and 4. Next, from line 7 to 18, we select the best candidate instance
start nodes are the nodes without predecessors, while in each iteration until k required instances are selected.
end nodes are the nodes without successors in the call graph.
We add a node “start” (i.e., a fake function with the name
“start”) to the call graph with a zero vector as its representa-
tion. Next, we add edges from the node “start” to
start nodes. The reason is that there may exist some functions
(besides the “main” function) which are not called by any
other functions. Notably, the function without any caller can
also be split into an independent program, but in our study,
we treat the functions in a program generated by Csmith in a
uniform way for simplifying the processing of the program.
The work list (work list) is initialized with end nodes. Thus,
from line 12 to 25, we propagate the vectors of callees to call-
ers until work list is empty. In line 13, the current node
(cur node) is randomly picked up from work list. In line 14,
the predecessor nodes (pre nodes) of (cur node) are selected.
Thus, for each node in pre nodes, we propagate the vectors of
its successors to it from line 15 to 24. The constraint is verified
in line 18. If all nodes in suc nodes of a node have been vis-
ited, we average the sum of its vector and those of its succes-
sors (line 19 and 20). Then this average value is assigned to
the node as its final vector representation in line 22.
In Fig. 5, when the recursive calls are deleted, “func4”,
“func5”, “func6”, and “func7” are the nodes with final vector
representations. Thus, we can propagate these vectors back
to “func1”, “func2”, and “func3” according to their calling
relationships. After obtaining the final vector representa- Specifically, the selection procedure is a “centroid” based
tions of “func1”, “func2”, and “func3”, the vector representa- scheme. Suppose the solar system is a set of instances, the
tions of “func2” and “func3” can be propagated back to task is to select some instances around the center “Sun”. For
function “main”. However, we cannot propagate the vector a candidate instance, we first calculate the distance
representations of “func1” and “main” back to “start”, since dist2center from this instance to the center, such that the
the vector representation of “main” is not the final result. instances with different orbits can be distinguished like the
Lastly, the vector representation of “start” is treated as the “Earth” and “Mars”. In addition, the minimum distance
vector representation of the whole program. min dist2sel from the candidate instance to the selected
instances is calculated for avoiding similar instances in the
same or similar orbits. For example, if the instances “Earth”
3.4 Selection Scheme and “Mars” are selected, the instance “Moon” cannot be
With the vector representations of optimization sequences selected since it is very close to “Earth”. Therefore, to balance
and testing programs, we present a selection scheme in this the effect of these two distances, we leverage the product of
subsection to select representative optimization sequences these two distances as the score of a candidate instance. The
and testing programs. larger the score a candidate instance has, the better it will be.
Given a set of instances (i.e., optimization sequences and In this study, we utilize the euclidean distance function
testing programs), we aim to select a small set of instances distðu; vÞ to calculate the distance between two instances.
with better diversity, since the space of instances is huge Suppose ~ u ¼ ðu1 ; u2 ; . . . ; un Þ and ~ v ¼ ðv1 ; v2 ; . . . ; vn Þ are two
and duplicate bugs may be triggered by the instances with candidate instances represented by n dimensional vectors,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
high similarities. Algorithm 3 presents the proposed selec- distð~ vÞ ¼ ðu1 v1 Þ2 þ þ ðun vn Þ2 . In Algorithm 3,
u; ~
tion scheme. The central idea is to select instances one by
one such that the total distances among the selected instan- for a candidate instance that has not been selected, we first cal-
ces are maximized. First, M instances will be generated by a culate the distance from it to the centroid in line 11; then the
random generator (for the generation of initial optimization minimum distance from it to the selected instances is calcu-
sequences and testing programs, see Section 4.2). We then lated in line 12, the score of the current instance is calculated
cluster these M instances into groups. After that, the central via the product of these two distances in line 13. In lines 10 to
instances of each group are selected as the initialization of 16, the current best instance with the maximum score is
the set of selected instances selected insts, which leads at selected. This process repeats until k instances are selected.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CTOS: COMPILER TESTING FOR OPTIMIZATION SEQUENCES OF LLVM 2347
optimizations, and results visualization of optimizations. optimization sequences, we remove each optimization in
Finally, 114 optimizations are selected.8 In LLVM, each opti- the optimization sequence one by one. If the bug still occurs,
mization may depend on certain other optimizations as the which indicates that this optimization has no impact on the
preconditions. LLVM provides a mechanism to manage the bug, we delete it from the optimization sequence; otherwise,
dependencies, i.e., PassManager.9 Thus, we do not need to the removed optimization will be put back into the original
manually manage the dependencies of optimizations. position. This process continues until no optimization can
For generating the initial optimization sequences, we be deleted. Additionally, similar to the related work [9],
assign an index to each optimization. The length of an opti- [13], [17], we also use Creduce [45], a widely used tool for
mization sequence is randomly selected in the range from reducing C, C++, or OpenCL programs, to reduce the test-
50 to 200. This is because there are 84 unique transformation ing programs that have triggered bugs. LLVM warnings
optimizations in the -O3 optimization level of the current and Frama-C are used to detect undefined behaviors during
released LLVM 7.0.1 (the optimizations in -O3 optimization the reduction process to ensure that the resultant program
level may be different in different versions of LLVM). The is valid.
range from 50 to 200 could guarantee that the generated Note that, we first reduce the optimization sequences.
optimization sequences have different lengths. Then we This is because the reduction of testing programs always
leverage a uniform random number generator to randomly takes more time than the reduction of optimization sequen-
generate the indexes of optimizations until the length of the ces. The developers of LLVM think that “the passes are mostly
current optimization sequence is reached. Next, the index designed to operate independently, so if we see an assert/crash,
sequence is translated into the corresponding optimization then we can always blame the last pass in the sequence. And if the
sequence. In addition, the parameters of optimizations are test ends with the same assertion and backtrace in the last pass in
set to the default values. the sequence, then we can assume that it is a duplicate.” (see
Testing Programs. We use Csmith [13], a widely used pro- LLVM Bug#40927 [46]) Thus, if the last optimizations of
gram generator that supports most features of the C pro- the reduced sequences are identical (for the crash bug, the
gramming language to generate initial testing programs. To failed assertion or backtrace also should be identical), the
detect compiler bugs caused by optimization sequences, the corresponding bugs are treated as a duplicate. Therefore,
testing program needs to be valid, free to undefined behav- reducing optimization sequences first can save the total
iors, diverse, and executable. However, other program gen- time to reduce test cases.
erators (e.g., CCG [42], Yarpgen [43], Orion [9]) can not Duplicate Bug Identification. In our study, we also adopt
meet our requirements. For example, CCG can not generate the above strategy to filter out duplicate bugs. We treat the
runnable testing programs, and the maintenance of CCG last optimization in a reduced optimization sequence as a
has stopped for a long time. Yarpgen is a generator to pro- buggy optimization. Thus, if the last optimizations in the
duce correct runnable C/C++ programs, but it only sup- two reduced optimization sequences of two bugs are the
ports a few C/C++ features. Orion is a mutation-based tool same, these two bugs are treated as a duplicate. Besides, for
to generate new programs for seed programs by deleting crash bugs, to improve the accuracy of duplicate bugs iden-
the dead code in the seed programs. For the mutation-based tification, we further use the failed assertion or backtrace to
tools (e.g., Orion), the diversity of the programs generated determine duplicate crash bugs. That is, when two crash
by these tools is limited by the seed programs. In addition, bugs have the same failed assertion or backtrace, we treat
some grammar-based program generators (e.g., Grammari- them as duplicate crash bugs. The reason for adopting this
nator [44]) can also be utilized to generate testing programs, strategy rather than just distinct optimization sequences to
but the generated programs always are invalid and contain identify duplicate bugs is because many duplicate bugs can
undefined behaviors. Thus, we employ Csmith to generate be triggered even though the reduced optimization sequen-
testing programs in this paper. The minimum size of the ces are distinct. This is also the strategy applied by the
generated programs is set to 80KB as suggested in [13]. LLVM developers, as can be seen in LLVM Bug#40926 [47],
Other parameters of Csmith are set to the default values. In #40927 [46], #40928 [48], #40929 [49], and #40933 [50], which
addition, we leverage LLVM warnings and Frama-C10 to are marked as duplicates of LLVM Bug#40925 [51].
detect undefined behaviors of the generated programs, Although the optimization sequences of these bugs are dis-
since the undefined behavior may cause invalid compiler tinct, they are marked as duplicates because the root causes
bugs. of these bugs are introduced in the same last optimization
Test Case Reduction. Similar to other related studies (e.g., in the sequences. Through this strategy, we may avoid
[9], [13], [17]), all test cases that trigger compiler bugs reporting too many duplicate bugs to developers.
should be reduced before we report them to the developers, Bug Types. In our study, we mainly find the following
such that the developers can quickly locate the real reasons five types of compiler bugs caused by optimization sequen-
of the bugs and fix them. Test case reduction includes two ces of LLVM.
parts in our study, namely, optimization sequence reduc-
tion and testing program reduction. For the reduction of (1) Crash. The optimizer Opt of LLVM crashes when
optimizing the IR of a program.
(2) Invalid IR. In LLVM, each optimization takes in the
8. The full list of 114 optimizations can be found on the website, valid IR of a program as input, and its outputs also
https://github.com/CTOS-results/LLVM-Bugs-by-Optimization- should be a valid IR. However, invalid IR may be
sequences
9. https://llvm.org/docs/WritingAnLLVMPass.html
generated by some optimizations due to the interac-
10. http://frama-c.com/ tion among optimizations. In our evaluation, we
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CTOS: COMPILER TESTING FOR OPTIMIZATION SEQUENCES OF LLVM 2349
TABLE 2
Results of CTOS and its 14 Variants
TP.: Testing Programs, Inv. IR: Invalid IR, WC.: Wrong Code, CGB.: Code Generator Bug.
Therefore, it takes nearly 50 days to run these experiments. testing programs into the testing process, the testing effi-
Notably, the 10 runs of CTOS and its 14 variants are inde- ciency may be dramatically decreased due to the possible
pendent, i.e., the optimization sequences and testing pro- memory swapping. However, under our current experi-
grams are different in each time except the 2-way ment setting, we may only use a small fraction of the initial
combinatorial optimization sequences, which are identical 1,000 testing programs to test different optimization sequen-
in each variant of CTOS since the optimization sequences ces. This may be a threat of validity for our experiments,
generated by ACTS are constant. In addition, the testing which will be discussed in Section 6. On the other hand,
programs and optimization sequences for CTOS and its 14 regarding the baselines using the combinatorial testing tech-
variants are prepared before we carry out the experiments. nique, it generates the same set of optimization sequences
The initial number of testing programs for each experiment (i.e., 2W) each time (taking about 10 hours). The experiment
is 1,000, because we cannot know how many testing pro- reuses 2W for each run of different baselines. Thus, it may
grams can be tested before the experiments. It is rapid to bring unfair comparisons between the baselines with 2W
generate RS and RP. While the time for generating RPS and and without 2W, when the optimization sequence genera-
RPSv needs about 3 hours, since some random values of tion time is included.
configuration options may cause Csmith takes more time to Table 2 presents the experiment results of CTOS and its
generate a testing program.12 Generating SS, SP, and SPS 14 variants. The second column is the average total number
takes 3 to 6 hours in our system, respectively. Hence, com- of testing programs, and the following 5 columns are the
pared to the testing period, the time for preparing optimiza- average number of unique bugs for each type. Actually,
tion sequences and testing programs is relatively short. The many duplicate bugs can be found by CTOS and its 14 var-
most time-consuming part is to generate 2W, which takes iants. We filter out these duplicate bugs using the strategy
about 10 hours in our system. But we only need to generate described in Section 4.2. Next, the eighth column is the aver-
2W once, all the experiments use the same 2W. age total number of unique bugs for CTOS and its 14 var-
In RQ1, we do not include the time spent on generating iants. From Table 2, the numbers of testing programs for
testing programs and optimization sequences into the test- CTOS(RP + 2W), CTOS(SP + 2W), and CTOS(SPS + 2W) are
ing period for two reasons. On the one hand, we adopt this smaller than those of CTOS and other variants, since the
evaluation strategy due to the limitation of computational number of 2-way combinatorial optimization sequences is
resources. Our experiments are conducted on a computer about 11 times larger than those of other type optimization
with 16 GB of memory. Although we have tried our best to sequences. In addition, the six variants with RPS and RPSv
optimize our programs to select representative testing pro- test a small number of testing programs compared to CTOS
grams from the initial set of 100,000 testing programs in our and other variants. For example, for the variants with RS,
experiments, the selection process can consume 4-8 GB of CTOS(RPS + RS) and CTOS(RPSv + RS) only test 28.8 and
memory. Besides, we need to run 150 experiments (CTOS 61.4 testing programs on average, respectively. However,
and its 14 variants, 10 runs for each experiment) in RQ1, the numbers of testing programs of CTOS(RP + RS), CTOS
which makes us run many experiments simultaneously so (SP + RS), and CTOS(SPS + RS) are 113.3, 122.4, and 117.8,
that we can finish all experiments in nearly 50 days. For respectively. The reason is that testing programs generated
each experiment, the testing process will consume about by swarm testing may contain some complicated structures,
100M-4GB of memory. Thus, if we integrate the selection of which cause a long time to optimize them and execute the
corresponding executables. Generally, the majority of time
12. https://github.com/csmith-project/csmith/blob/master/doc/ for the whole testing process is utilized to optimize testing
p-robabilities.txt programs and execute the corresponding executables. This
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CTOS: COMPILER TESTING FOR OPTIMIZATION SEQUENCES OF LLVM 2351
15. https://bugs.llvm.org/show_bug.cgi?id=39626.
13. https://bugs.llvm.org/show_bug.cgi?id=41290. 16. We use the open source code shared by Tim Menzies to calculate
14. https://bugs.llvm.org/show_bug.cgi?id=42452. A12 , https://github.com/txt/ase16/blob/master/doc/stats.md.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
2352 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 7, JULY 2022
TABLE 3 TABLE 5
Excluded Optimization Subsequences Buggy Optimizations for Reported Bugs
17. The details of these bugs can be found on the website, https:// 18. https://bugs.llvm.org/show_bug.cgi?id=42452.
github.com/CTOS-results/LLVM-Bugs-by-Optimization-sequences. 19. https://bugs.llvm.org/show_bug.cgi?id=41723.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CTOS: COMPILER TESTING FOR OPTIMIZATION SEQUENCES OF LLVM 2353
Fig. 7. Top 5 most used optimizations for reported bugs. jump-t: jump-threading, instc: instcombine, structcfg: structurizecfg, loop-e: loop-extract,
loop-r: loop-rotate, ls-vec: load-store-vectorizer.
From Table 4, only 21 of the reported bugs are confirmed Table 5 presents the buggy optimizations of the 104
or fixed. To investigate this phenomenon, we collect 1,323 reported bugs arranged in the alphabetical order. These
unique bugs related to scalar optimizations (most optimiza- buggy optimizations are the last ones in the reduced optimi-
tions in LLVM belong to this component) from the LLVM zation sequences, since the optimizations of LLVM are
bug repository20 from October 2003 to Jun 2019. In these mostly designed to operate independently and the develop-
bugs, 828 bugs have been confirmed or fixed, and 495 bugs ers always blame the last optimizations in the reduced
are still kept as “NEW”. For the 828 confirmed or fixed sequences [46]. 47 unique optimizations have been reported
bugs, although 428 bugs are confirmed or fixed in one to be faulty. Specifically, there are 15 buggy loop related
month, the developers take more than 15 months to confirm optimizations (in bold fonts), such as loop-rotate, loop-unroll
or fix the most residual bugs. The average number of and loop-vectorize. From Table 5, we can see that the loop
months for confirming or fixing these bugs is 5.6. In addi- related optimizations are more bug-prone than other opti-
tion, 495 bugs with “NEW” status have already existed for a mizations. This result indicates that the design of loop opti-
long time, an average of 14.1 months. This indicates that the mizations may exist some flaws and should be further
overall speed to confirm or fix LLVM bugs is relatively enhanced by the developers.
slow, not just for our reported bugs. One possible reason is Fig. 7 shows the statistics of the top 5 most used optimi-
that it is hard and time-consuming to analyze and find the zations in the reduced optimization sequences for the 104
root causes of compiler bugs [58]. Especially, for a bug reported bugs. From Fig. 7f, the optimizations jump-thread-
caused by an optimization sequence, the root causes of this ing, gvn, licm, loop-rotate, and instcombine are the 5 most used
bug may lie in any optimization of this sequence. In addi- optimizations in all reduced optimization sequences. In par-
tion, the study by Sun et al. [19] also finds that the bug-fixing ticular, jump-threading appears 44 times in all reduced opti-
rate of LLVM is lower than GCC. The authors explain that mization sequences for the 104 reported bugs. This
this is due to the limited human resources since some optimization is used to turn conditional into unconditional
LLVM developers in Apple are pulled into other projects branches that can greatly improve performance for hard-
like Swift [19]. Furthermore, we also talk with three devel- ware with branch prediction, speculative execution, and
opers of an international company that has a team to prefetching. However, the code structure may be compli-
develop compilers using LLVM. We ask these three devel- cated after performing jump-threading, since it will add
opers what are the difficulties to fix an LLVM bug during some new paths and duplicate code.21 This may cause other
their development. All these three developers say that it is optimizations to produce wrong results. In Figs. 7a, 7b, and
difficult to analyze and find the root causes of compiler 7c, we can see that the optimization jump-threading is the
bugs, especially for the wrong code bugs. top 2 most used optimization for the crash bugs, invalid IR
bugs, and wrong code bugs. In addition, from Figs. 7a, 7b, duplicate bugs with high probability. This helps us to test
7c, 7d, and 7e, optimizations that change the structure of a more optimization sequences and reduce the time to analyze
program (loop-rotate changes the structure of a loop and duplicate bugs. Second, when the corresponding bugs have
structurizecfg transforms the control flow structure of a pro- been fixed, we will remove the limitation of these subsequen-
gram) are widely used in the buggy optimization sequences. ces, such that the deep bugs caused by these subsequences
This indicates that the design flaws of optimizations may be could be detected. In the future, automation techniques may
introduced by the edge cases of the structure of a program, be developed to make the summarization of subsequences
which may help developers to pay more attention to the more precise and filter out duplicate bugs to improve the test-
interactions among these optimizations when they design ing efficiency.
and implement new optimizations. Buggy Optimization Isolation Challenge. In our study, we
Answer to RQ2. Our testing efforts over seven months treat the last optimization in a reduced optimization
clearly demonstrates that CTOS is effective in detecting sequence as a buggy optimization. However, this strategy is
LLVM bugs caused by optimization sequences. In the seven not absolutely true, since the real reason for a bug may lie in
months, we reported 104 valid bugs within 5 types, of any optimization of the reduced optimization sequence.
which 21 have been confirmed or fixed. 47 unique optimiza- This indicates that the results in Table 5 may not be accu-
tions are identified to be faulty and 15 of them are loop rate. For example, an assertion fails in LLVM bug#4226423
related optimizations. when the optimizer Opt optimizes a program using “-early-
cse-memssa -die -gvn-hoist”. We then treat “-gvn-hoist” as the
buggy optimization. Nevertheless, the developers show
5 DISCUSSION
that the root cause is introduced by “-die” since it does not
Importance of Optimization Sequence. The optimization correctly preserve the information generated by “-Memo-
sequences are mainly used to improve the performance rySSA” (an analysis method in LLVM24). Hence, we may
(e.g., size, speed, and energy) of a program. Especially, the underestimate the effectiveness of CTOS in the experiments
default optimization levels (e.g., -O1, -O2, and -O3) pro- since some bugs may be wrongly labeled as duplicates.
vided by a compiler are specific optimization sequences However, we must make a tradeoff to avoid reporting too
designed by the compiler experts. Although the default many duplicate bugs to developers. In practice, it may not
optimization levels can significantly improve the program be acceptable for developers to receive hundreds of
performance, many studies (e.g., [1], [2]) have shown that reported bugs in a few days, where the majority of the bugs
the autotuning of optimization sequences helps to further are duplicates. This situation is currently difficult to be alle-
improve the performance of a program. In addition, a pro- viated. First, to the best of our knowledge, there does not
gram in different scenarios may have different performance exist a perfect method to locate the real reasons for a com-
requirements. For example, the energy consumption may piler bug. In our work, we manually validate the reduced
be more important for a program in an embedded system, optimization sequences to guarantee that the bugs cannot
while the speed of a scientific program may be sensitive on be reproduced when omitting the last optimization of the
a high-performance supercomputer. Nevertheless, these sequences. Second, the developers of LLVM also utilize the
techniques may be invalid due to the potential bugs in the same strategy to roughly determine whether the bugs
selected optimization sequences. On the other hand, com- caused by optimization sequences are duplicate. The opti-
piler developers of a new program language (e.g., Rust, mizations in LLVM are mostly designed to operate indepen-
Swift) based on LLVM need to design better optimization dently, the developers always blame the last optimization in
sequences as the default optimization levels (e.g., O1, O2, a reduced optimization sequence [46]. Thus, we believe that
and O3 in LLVM) to meet the features in the new program the results of Table 5 are reasonable. In the future, we plan
language. In this case, if there are bugs in the optimization to introduce advanced fault localization techniques [58],
sequences, compiler developers could be frustrated, thus [59], [60], [61] to address this challenge.
badly slowing down the development. Hence, it is critical to Limitation of the Selection Scheme. The experimental results
guarantee the correctness of optimization sequences. show that CTOS is effective to detect LLVM bugs caused by
Buggy Subsequences Exclusion. In the testing process, we optimization sequences. However, the selection scheme in
exclude some optimization subsequences listed in Table 3 CTOS may be limited. In our study, the selection scheme is
that can easily trigger duplicate bugs. However, the excluded based on the hypothesis that the effects of two testing pro-
subsequences do not always trigger duplicate bugs. For grams (or optimization sequences) for testing LLVM are
example, LLVM bug#3962622 can be easily triggered by the similar if they are closed to each other. Therefore, for a set
optimization sequences with first subsequence in Table 3. of testing programs (or optimization sequences), our goal is
Especially, the optimizer Opt must be crashed when any pro- to select representative testing programs (or optimization
gram is optimized using “-early-cse-memssa -early-cse-mem- sequences) such that the total distances among them are
ssa”, while the subsequence “-early-cse-memssa -gvn -early-cse- maximized. Nevertheless, the selection scheme may miss
memssa” cannot trigger this bug. Even so, we think this exclu- some corner test cases, as it is difficult to know which test-
sion strategy contributes to improving the testing efficiency. ing program (or optimization sequence) can trigger a bug
First, the excluded subsequences are manually summarized before execution. Besides, we currently use Csmith with the
from many duplicate bugs, which indicates that the optimiza- default configuration to generate the initial testing
tion sequences that contain these subsequences may trigger
23. https://bugs.llvm.org/show_bug.cgi?id=42264.
22. https://bugs.llvm.org/show_bug.cgi?id=39626. 24. https://www.llvm.org/docs/MemorySSA.html.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CTOS: COMPILER TESTING FOR OPTIMIZATION SEQUENCES OF LLVM 2355
programs, which may limit the diversities of the generated many duplicate bugs are triggered in the testing process
testing programs and affect the effectiveness of the pro- though we have leveraged the selection scheme to obtain
posed selection scheme. In future work, we will consider the representative optimization sequences and testing pro-
more advanced techniques (e.g., combine the selection grams. To alleviate this threat, we summarize the subse-
scheme with coverage information) to select representative quences (listed in Table 3) that could easily trigger
testing programs and optimization sequences to include duplicate bugs and remove the optimization sequences that
more corner cases. contain these subsequences in the next testing process until
the corresponding bugs are fixed. In addition, we utilize the
duplicate bug identification strategy described in Section 4.2
6 THREATS TO VALIDITY to identify duplicate bugs. However, this strategy may be
Threats to Internal Validity. The threats to internal validity not precise, which may influence the effectiveness of CTOS.
mainly lie in the implementations of CTOS. As mentioned in This is because the root causes for a bug can be introduced
Section 3, the vector representations of the optimization by any optimization in the reduced optimization sequence.
sequences and testing programs rely on the Doc2Vec tech- Since our strategy is also adopted by LLVM developers to
nique. Hence, the testing efficiency may be impacted by the identify duplicate bugs caused by optimizations in practice,
implementation of Doc2Vec. To reduce this threat, we adopt this strategy can still significantly reduce the negative influ-
the widely used tool Gensim [38] that has an efficient imple- ence of duplicate bugs on the experiments. In the future, we
mentation of Doc2Vec. In addition, the parameters of Doc2Vec will consider to apply advanced software fault localization
are currently set according to the documents of Gensim and techniques to improve the strategy for identifying duplicate
the suggestions by [39]. For the parameters of Algorithm 3, we bugs in our study.
set the values of these two parameters according to the hard- Second, we utilize Csmith to generate the testing pro-
ware limitation of our system. We do not investigate the grams in this study. However, Csmith has been widely used
impact of these parameters for CTOS in this paper, due to the to test LLVM for a long time, which makes LLVM, to a cer-
heavy time cost to fine-tune the parameters of CTOS. There tain extent, resistant to it. In the future, the advanced techni-
are in total 6 parameters, i.e., 4 parameters for Doc2Vec and ques (e.g., the test-program generation approach via history-
2 parameters for Algorithm 3. Assuming that each parameter guided configuration diversification [62]) may be employed
has 10 candidate values, we will get 106 parameter combina- to further improve the diversity of testing programs.
tions. The large number of parameter combinations can lead
to a long time to investigate the impact of the parameters for
7 RELATED WORK
CTOS, even the testing period is only 90 hours (as set in RQ1)
for evaluating one parameter combination. Despite this, the 7.1 Compiler Testing
experimental results illustrate that CTOS can achieve good Compiler testing is currently the most important technique
results under the parameter settings in our paper. to guarantee the quality of compilers. In the literature, the
Besides, in our experiments for RQ1, the time to obtain techniques of compiler testing fall into three categories,
testing programs and optimization sequences is not inclu- namely, Randomized Differential Testing (RDT), Different
ded in the testing period, which may cause unfair compari- Optimization Levels (DOL), and Equivalence Modulo
sons between CTOS and the baselines. For example, Inputs (EMI) [9], [15], [16], [17], [18]. For a given testing pro-
compared with a random strategy (i.e., RS and RP), COTS gram, RDT detects compiler bugs by comparing the outputs
spends 3-6 more hours in generating testing programs and of some compilers with the same specification. DOL is a var-
optimization sequences. However, we do not expect this can iant of RDT, and compares the outputs produced by the
dramatically affect the experiment results in Table 2, because same compiler with different optimization levels to deter-
the time to obtain testing programs and optimization sequen- mine whether a compiler has bugs. Most of the techniques
ces (3-6 hours) is much smaller than the testing period (i.e., [12], [13], [14] belonging to RDT and DOL use randomly
90 hours). To investigate its potential impact on the results of generated testing programs to test a compiler. Zhao et al.
CTOS, we analyze the number of bugs detected by CTOS in [12] develop a tool, called JTT, that automatically generates
the first 80 hours. That is, we exclude 10 hours from the test- testing programs to validate the EC++ embedded compiler.
ing period which are assumed to be used for CTOS to obtain In particular, Csmith [13] as the most successful random C
testing programs and optimization sequences. As presented program generator has been widely used to test C com-
in the supplemental material,25 CTOS can find the majority of pilers. Lidbury et al. [14] develop CLsmith based on Csmith
bugs (11.7 on average) in the first 80 hours; it outperforms all to generate programs for testing the OpenCL compilers.
the baselines which run in 90 hours. For example, CTOS(RP Different from RDT and DOL, EMI compares the outputs
+RS) detects 9.8 bugs on average in 90 hours, which is produced by equivalent variants of a seed program to detect
19.39 percent fewer than the number of bugs detected by compiler bugs. If an output is different from others, the com-
CTOS in the first 80 hours. Thus, we believe that CTOS out- piler then contains a bug [9]. There are three instantiations of
performs the baselines even when the time to obtain testing EMI, namely, Orion [9], Athena [15], and Hermes [17]. Orion
programs and optimization sequences is considered. tries to randomly prune unexecuted statements to generate
Threats to External Validity. The threats to external valid- variant programs [9], while Athena can delete code from or
ity mainly lie in duplicate bugs and testing programs. First, insert code into code regions that are not executed under the
inputs [15]. In contrast to Orion and Athena, Hermes [17] can
25. https://github.com/CTOS-results/LLVM-Bugs-by- generate variant programs via mutation performed on both
Optimization-sequences/blob/master/Appendix.pdf. live and dead code regions. An empirical study conducted
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
2356 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 7, JULY 2022
by Chen et al. [19] compares the strength of RDT, DOL, and evaluation demonstrates that CTOS significantly outper-
EMI, and reveals that DOL is more effective in detecting forms the baselines by detecting 24:76% 50:57% more bugs
compiler bugs related to optimizations. on average. Within only seven months, we have reported 104
To accelerate compiler testing, a method [63] based on valid bugs within 5 types, of which 21 have been confirmed
machine learning is proposed to predict the bug-revealing or fixed. 47 unique optimizations are identified to be faulty
probabilities of testing programs, such that the testing pro- and 15 of them are loop related optimizations.
grams with large bug-revealing probabilities can be exe- For future work, we will keep actively testing LLVM
cuted as early as possible. Recently, Chen [20] et al. present with CTOS, and report the detected bugs. Furthermore, we
a more efficient technique to predict test coverage statically plan to design more efficient compiler fuzzing techniques
for compilers, and then leverage the predicted coverage with coverage information of LLVM to further improve the
information to prioritize testing programs. reliability of compiler optimizations.
Our work is similar to DOL. However, unlike the tradi-
tional DOL which only considers the default optimization ACKNOWLEDGMENT
levels with fixed orders of optimizations, CTOS tests LLVM
We would like to thank the LLVM developers for analyz-
with arbitrary optimization sequences.
ing and fixing our reported bugs. This work was sup-
ported in part by the National Natural Science Foundation
7.2 Compiler Phase-Ordering Problem of China under Grants 61772107, 61722202, 61902181, and
The compiler phase-ordering problem aims to improve 62032004.
the performance of target programs by selecting good
optimization sequences [7], [8], [64]. Currently, two meth- REFERENCES
odologies have been proposed to resolve the compiler
[1] G. Fursin et al., “Milepost GCC: Machine learning enabled self-
phase-ordering problem. The approaches in the first cate- tuning compiler,” Int. J. Parallel Program., vol. 39, no. 3, pp. 296–
gory treat the compiler phase-ordering problem as an 327, 2011.
optimization problem and then evolutionary algorithms [2] J. Ansel et al., “OpenTuner: An extensible framework for program
autotuning,” in Proc. 23rd Int. Conf. Parallel Archit. Compilation,
are used to resolve it. For example, Kulkarni et al. [65],
2014, pp. 303–316.
[66] develop a method based on genetic algorithms for [3] S. Kulkarni and J. Cavazos, “Mitigating the compiler optimization
quickly searching effective optimization sequences. Purini phase-ordering problem using machine learning,” in Proc.
et al. [4] propose a downsampling technique to reduce the ACM Int. Conf. Object Oriented Program. Syst. Lang. Appl., 2012,
pp. 147–162.
infinitely large optimization sequence space. OpenTuner [4] S. Purini and L. Jain, “Finding good optimization sequences cov-
[2] uses the ensembles of search techniques to find opti- ering program space,” ACM Trans. Archit. Code Optim., vol. 9,
mal optimizations for a program. no. 4, pp. 56:1–56:23, 2013.
The approaches in the second category tackle the com- [5] L. G. Martins, R. Nobre, J. M. Cardoso, A. C. Delbem, and
E. Marques, “Clustering-based selection for the exploration of
piler phase-ordering problem based on machine learning compiler optimization sequences,” ACM Trans. Archit. Code
[1], [3], [6]. Most recent methods based on machine learning Optim., vol. 13, no. 1, 2016, Art. no. 8.
for compiler auto-tuning have been introduced by the sur- [6] A. H. Ashouri, A. Bignoli, G. Palermo, C. Silvano, S. Kulkarni,
and J. Cavazos, “MiCOMP: Mitigating the compiler phase-
vey [64]. Milepost [1] is a machine-learning based compiler ordering problem using optimization sub-sequences and mac-
that automatically adapts the internal optimization heuristic hine learning,” ACM Trans. Archit. Code Optim., vol. 14, no. 3,
to improve the performance. Kulkarni and Cavazos [3] pro- pp. 29:1–29:28, Sep. 2017.
pose a method based on the Markov process to mitigate the [7] D. B. Loveman, “Program improvement by source-to-source
transformation,” J. ACM, vol. 24, no. 1, pp. 121–145, 1977.
compiler phase-ordering problem. Ashouri et al. [6] leverage [8] S. R. Vegdahl, “Phase coupling and constant generation in an opti-
the optimization subsequences and machine learning to mizing microcode compiler,” ACM SIGMICRO Newslett., vol. 13,
build a predictive model. Huang et al. [67] present Auto- no. 4, pp. 125–133, 1982.
[9] V. Le, M. Afshari, and Z. Su, “Compiler validation via equivalence
Phase, a deep reinforcement learning method to tackle the modulo inputs,” in Proc. 35th ACM SIGPLAN Conf. Program. Lang.
compiler phase-ordering problem for multiple high-level Des. Implementation, 2014, pp. 216–226.
synthesis programs. [10] C. Lindig, “Random testing of C calling conventions,” in Proc. Int.
However, there is no guarantee that the programs opti- Symp. Automated Anal.-Driven Debugging, 2005, pp. 3–12.
[11] F. Sheridan, “Practical testing of a C99 compiler using output
mized by different optimization sequences are correct. No comparison,” Softw.: Practice Experience, vol. 37, no. 14, pp. 1475–1488,
systematic work has been conducted to detect compiler 2007.
bugs caused by optimization sequences. We present CTOS [12] C. Zhao, Y. Xue, Q. Tao, L. Guo, and Z. Wang, “Automated test
to mitigate this problem to further improve the reliability of program generation for an industrial optimizing compiler,” in
Proc. ICSE Workshop Autom. Softw. Test, 2009, pp. 36–43.
optimization sequences. [13] X. Yang, Y. Chen, E. Eide, and J. Regehr, “Finding and under-
standing bugs in C compilers,” ACM SIGPLAN Notices, vol. 46,
no. 6, pp. 283–294, 2011.
8 CONCLUSION [14] C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson, “Many-core
In this study, we presented CTOS, a method based on differ- compiler fuzzing,” in Proc. 36th ACM SIGPLAN Conf. Program.
Lang. Des. Implementation, 2015, pp. 65–76.
ential testing, for catching compiler bugs caused by optimi- [15] L. Vu, S. Chengnian, and S. Zhendong, “Finding deep compiler
zation sequences of LLVM. Rather than only testing bugs via guided stochastic program mutation,” in Proc. ACM SIG-
compilers with predefined optimization sequences like the PLAN Int. Conf. Object-Oriented Program. Syst. Lang. Appl., 2015,
pp. 386–399.
state-of-the-art methods, our technique validates compilers [16] L. Vu, S. Chengnian, and S. Zhendong, “Randomized stress-test-
with arbitrary optimization sequences, which significantly ing of link-time optimizers,” in Proc. Int. Symp. Softw. Testing
increases the test efficiency for detecting bugs. Our Anal., 2015, pp. 327–337.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CTOS: COMPILER TESTING FOR OPTIMIZATION SEQUENCES OF LLVM 2357
[17] C. Sun, V. Le, and Z. Su, “Finding compiler bugs via live code [47] LLVM bug 40926, 2021. [Online]. Available: https://bugs.llvm.
mutation,” in Proc. ACM SIGPLAN Int. Conf. Object-Oriented Pro- org/show_bug.cgi?id=409–26
gram. Syst. Lang. Appl., 2016, pp. 849–863. [48] LLVM bug 40928, 2021. [Online]. Available: https://bugs.llvm.
[18] Q. Zhang, C. Sun, and Z. Su, “Skeletal program enumeration for org/show_bug.cgi?id=409–28
rigorous compiler testing,” in Proc. 38th ACM SIGPLAN Conf. Pro- [49] LLVM bug 40929, 2021. [Online]. Available: https://bugs.llvm.
gram. Lang. Des. Implementation, 2017, pp. 347–361. org/show_bug.cgi?id=409–29
[19] J. Chen et al., “An empirical comparison of compiler testing [50] LLVM bug 40933, 2021. [Online]. Available: https://bugs.llvm.
techniques,” in Proc. 38th Int. Conf. Softw. Eng., 2016, pp. 180–190. org/show_bug.cgi?id=409–33
[20] J. Chen et al., “Coverage prediction for accelerating compiler testing,” [51] LLVM bug 40925, 2021. [Online]. Available: https://bugs.llvm.
IEEE Trans. Softw. Eng., vol. 47, no. 2, pp. 261–278, Feb. 2021. org/show_bug.cgi?id=409–25
[21] LLVM Compiler Community, “LLVM language reference manual,” [52] C. Nie and H. Leung, “A survey of combinatorial testing,” ACM
2021. [Online]. Available: https://llvm.org/docs/LangRef.html Comput. Surv., vol. 43, no. 2, pp. 11:1–11:29, Feb. 2011.
[22] LLVM Compiler Community, “LLVM’s analysis and transform [53] D. R. Kuhn, R. N. Kacker, and Y. Lei, “SP 800-142. practical combi-
passes,” 2021. [Online]. Available: https://www.llvm.org/docs/ natorial testing,” Nat. Inst. Standards Technol., 2010.
Passes.html [54] A. Groce, C. Zhang, E. Eide, Y. Chen, and J. Regehr, “Swarm
[23] Clang, “Clang: A C language family frontend for LLVM,” 2021. testing,” in Proc. Int. Symp. Softw. Testing Anal., 2012, pp. 78–88.
[Online]. Available: http://clang.llvm.org/ [55] J. Chen, G. Wang, D. Hao, Y. Xiong, H. Zhang, and L. Zhang,
[24] Rust, “Rust program language,” 2021. [Online]. Available: “History-guided configuration diversification for compiler test-
https://www.rust-lang.org/ program generation,” in Proc. 34th IEEE/ACM Int. Conf. Automated
[25] Swift, “Swift program language,” 2021. [Online]. Available: Softw. Eng., 2019, pp. 305–316.
https://developer.apple.com/swift/ [56] A. Arcuri and L. Briand, “A practical guide for using statistical
[26] A. Haas et al., “Bringing the web up to speed with webassembly,” tests to assess randomized algorithms in software engineering,”
in Proc. 38th ACM SIGPLAN Conf. Program. Lang. Des. Implementa- in Proc. 33rd Int. Conf. Softw. Eng., 2011, pp. 1–10.
tion, 2017, pp. 185–200. [57] C. Sun, V. Le, and Z. Su, “Finding and analyzing compiler warn-
[27] C. Cadar, D. Dunbar, and D. Engler, “KLEE: Unassisted and auto- ing defects,” in Proc. 38th Int. Conf. Softw. Eng., 2016, pp. 203–213.
matic generation of high-coverage tests for complex systems pro- [58] J. Chen, J. Han, P. Sun, L. Zhang, D. Hao, and L. Zhang,
grams,” in Proc. 8th USENIX Conf. Operating Syst. Des. “Compiler bug isolation via effective witness test program gener-
Implementation, 2008, pp. 209–224. ation,” in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp.
[28] P. D. Schubert, B. Hermann, and E. Bodden, “PhASAR: An inter- Found. Softw. Eng., 2019, pp. 223–234.
procedural static analysis framework for C/C++,” in Proc. Int. [59] S. Pearson et al., “Evaluating and improving fault localization,” in
Conf. Tools Algorithms Construction Anal. Syst., 2019, pp. 393–410. Proc. 39th Int. Conf. Softw. Eng., 2017, pp. 609–620.
[29] Q. Le and T. Mikolov, “Distributed representations of sentences [60] Y. Chen et al., “Taming compiler fuzzers,” in Proc. 34th ACM SIG-
and documents,” in Proc. 31st Int. Conf. Int. Conf. Mach. Learn., PLAN Conf. Program. Lang. Des. Implementation, 2013, pp. 197–208.
2014, pp. II-1188–II-1196. [61] J. Holmes and A. Groce, “Using mutants to help developers dis-
[30] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, tinguish and debug (compiler) faults,” Softw. Test. Verification Rel.,
“Distributed representations of words and phrases and their vol. 30, no. 2, 2020, Art. no. e1727.
compositionality,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2013, [62] J. Chen, G. Wang, D. Hao, Y. Xiong, H. Zhang, and L. Zhang,
pp. 3111–3119. “History-guided configuration diversification for compiler test-
[31] W. M. McKeeman, “Differential testing for software,” Digit. Tech. program generation,” in Proc. 34th IEEE/ACM Int. Conf. Automated
J., vol. 10, no. 1, pp. 100–107, 1998. Softw. Eng., 2019, pp. 305–316.
[32] Z. Harris, “Distributional structure,” Word, vol. 10, no. 23, pp. 146–162, [63] J. Chen, Y. Bai, D. Hao, Y. Xiong, H. Zhang, and B. Xie, “Learning
1954. to prioritize test programs for compiler testing,” in Proc. IEEE/
[33] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, “Compilers: Prin- ACM Int. Conf. Softw. Eng., 2017, pp. 700–711.
ciples, Techniques, and Tools (2nd Edition),” Addison-Wesley [64] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano,
Longman Publishing Co., Inc., 2006. “A survey on compiler autotuning using machine learning,” ACM
[34] R. Johnson, D. Pearson, and K. Pingali, “The program structure Comput. Surv., vol. 51, no. 5, 2018, Art. no. 96.
tree: Computing control regions in linear time,” ACM SIGPLAN [65] P. A. Kulkarni, S. Hines, J. Hiser, D. B. Whalley, J. W. Davidson,
Notices, vol. 29, no. 6, pp. 171–185, 1994. and D. L. Jones, “Fast searches for effective optimization phase
[35] J. Vanhatalo, H. V€ olzer, and J. Koehler, “The refined process struc- sequences,” in Proc. ACM SIGPLAN Conf. Program. Lang. Des.
ture tree,” Data Knowl. Eng., vol. 68, no. 9, pp. 793–818, 2009. Implementation, 2004, pp. 171–182.
[36] Region graph, 2019. [Online]. Available: https://www.llvm.org/ [66] P. A. Kulkarni, S. R. Hines, D. B. Whalley, J. D. Hiser, J. W. Davidson,
doxygen/regioninfo_8h_-source.html and D. L. Jones, “Fast and efficient searches for effective optimiza-
[37] D. Pelleg and A. Moore, “X-means: Extending K-means with effi- tion-phase sequences,” ACM Trans. Archit. Code Optim., vol. 2, no. 2,
cient estimation of the number of clusters,” in Proc. 17th Int. Conf. pp. 165–198, 2005.
Mach. Learn., 2000, pp. 727–734. [67] A. Haj-Ali et al., “AutoPhase: Compiler phase-ordering for
[38] R. Rehu rek and P. Sojka, “Software framework for topic model- high level synthesis with deep reinforcement learning,” IEEE
ling with large corpora,” in Proc. LREC Workshop New Challenges 27th Annu. Int. Symp. Field-Programmable Custom Comput. Mach.,
NLP Frameworks, 2010, pp. 45–50. early access, pp. 308–308, doi: 10.1109/FCCM.2019.00049.
[39] J. H. Lau and T. Baldwin, “An empirical evaluation of doc2vec
with practical insights into document embedding generation,” in
Proc. 1st Workshop Representation Learn. NLP, 2016, pp. 78–86.
[40] D. A. Schult, “Exploring network structure, dynamics, and function He Jiang (Member, IEEE) received the PhD
using NetworkX,” in Proc. 7th Python Sci. Conf., 2008, pp. 11–15. degree in computer science from the University
[41] A. Novikov, “PyClustering: Data mining library,” J. Open Source of Science and Technology of China, Hefei,
Softw., vol. 4, no. 36, Apr. 2019, Art. no. 1230. [Online]. Available: China. He is currently a professor with the Dalian
https://doi.org/10.21105/joss.01230 University of Technology, China. He is also a
[42] A. Balestrat, “CCG: A random C code generator,” 2021. [Online]. member of the ACM and the CCF (China Com-
Available: https://github.com/Merkil/ccg/ puter Federation). He is one of the ten supervi-
[43] V. L. Dmitry Babokin and J. Regehr, “Yarpgen,” 2021. [Online]. sors for the Outstanding Doctoral Dissertation of
Available: https://github.com/intel/yarpgen the CCF in 2014. His current research interests
[44] R. Hodov an, A. Kiss, and T. Gyim othy, “Grammarinator: A gram- include search-based software engineering
mar-based open source fuzzer,” in Proc. 9th ACM SIGSOFT Int. Work- (SBSE) and mining software repositories (MSR).
shop Automating TEST Case Des. Selection Eval., 2018, pp. 45–48. His work has been published at premier venues like ICSE, SANER, and
[45] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test- GECCO, as well as in major IEEE transactions like IEEE Transactions
case reduction for C compiler bugs,” in Proc. 33rd ACM SIGPLAN on Software Engineering, IEEE Transactions on Knowledge and Data
Conf. Program. Lang. Des. Implementation, 2012, pp. 335–346. Engineering, IEEE Transactions on Systems, Man, and Cybernetics,
[46] LLVM bug 40927, 2021. [Online]. Available: https://bugs.llvm. Part B (Cybernetics), IEEE Transactions on Cybernetics, and IEEE
org/show_bug.cgi?id=409–27 Transactions on Services Computing.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.
2358 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 7, JULY 2022
Zhide Zhou received the BS degree in computer Jingxuan Zhang received the PhD degree in
science and technology from the Guilin University software engineering from the Dalian University
of Electronic Technology, Guilin, China, in 2013. of Technology, Dalian, China. He is a lecturer with
He is currently working toward the PhD degree at the College of Computer Science and Technol-
the Dalian University of Technology, Dalian, ogy, Nanjing University of Aeronautics and Astro-
China. He is a student member of the China nautics, China. His current research interests
Computer Federation (CCF). His current include mining software repositories and software
research interests include intelligent software data analytics.
engineering, software testing, and program anal-
ysis techniques.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 02:29:43 UTC from IEEE Xplore. Restrictions apply.