0% found this document useful (0 votes)
3 views

Alpha Prog

The document presents A LPHA P ROG, a reinforcement learning-based tool designed to generate valid programs for compiler fuzzing, addressing challenges in automatic test suite generation for compilers. It utilizes a knowledge-guided approach to improve testing efficacy by generating syntactically and semantically valid programs, and has been shown to effectively detect bugs in a target compiler. The framework incorporates various reward functions to balance program validity and diversity, ultimately enhancing compiler testing coverage.

Uploaded by

cpjesus221
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Alpha Prog

The document presents A LPHA P ROG, a reinforcement learning-based tool designed to generate valid programs for compiler fuzzing, addressing challenges in automatic test suite generation for compilers. It utilizes a knowledge-guided approach to improve testing efficacy by generating syntactically and semantically valid programs, and has been shown to effectively detect bugs in a target compiler. The framework incorporates various reward functions to balance program validity and diversity, ultimately enhancing compiler testing coverage.

Uploaded by

cpjesus221
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

A LPHA P ROG: Reinforcement Generation of Valid Programs for Compiler Fuzzing


Xiaoting Li1 * , Xiao Liu2 * † , Lingwei Chen3† , Rupesh Prajapati1 , Dinghao Wu1
1
Pennsylvania State University, University Park, PA, USA
2
Facebook, Inc., USA
3
Wright State University, Dayton, OH, USA
[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract compiler. Therefore, it is critical to enforce the validity of


compilers with more advanced techniques.
Fuzzing is a widely-used testing technique to assure software
Testing is widely adopted (Chen et al. 2013; Regehr et al.
robustness. However, automatic generation of high-quality
test suites is challenging, especially for software that takes 2012) to verify the correctness and robustness of compilers.
in highly-structured inputs, such as the compilers. Compiler As a random test case generation paradigm, fuzzing has been
fuzzing remains difficult as generating tons of syntactically proven to be effective to improve testing efficacy and detect
and semantically valid programs is not trivial. Most previous software bugs (Kossatchev and Posypkin 2005; Zalewski
methods either depend on human-crafted grammars or heuris- 2014; Chen et al. 2016; Takanen et al. 2018), which can
tics to learn partial language patterns. They both suffer from be categorized as black-box fuzzing and white-box fuzzing.
the completeness issue that is a classic puzzle in software The main difference between fuzzing and testing is that,
testing. To mitigate the problem, we propose a knowledge- fuzzing focuses on program crashes and hangs but testing
guided reinforcement learning-based approach to generating is more general that aims at detecting kinds of syntactical
valid programs for compiler fuzzing. We first design a naive
and semantic errors with well-defined sanitizers. Although
learning model which evolves with the sequential mutation
rewards provided by a target compiler we test. By iterating the black-box fuzzing is efficient for general software, existing
training cycle, the model learns to generate valid programs techniques are not applicable in the scenario of compiler
that can improve the testing efficacy as well. We implement testing where highly-structured inputs are taken in.
the proposed method into a tool called A LPHA P ROG. We an- To generate high-quality programs in the context of com-
alyze the framework with four different reward functions and piler testing, researchers propose to use rigorous generation
our study reveal the effectiveness of A LPHA P ROG for com- engines that encodes formal language grammars for whole
piler testing. We also reported two important bugs for a com- program generation (Yang et al. 2011). Typically, they con-
piler production that were confirmed and addressed by the
form both syntactic and semantic rules for generating effec-
project owner, which further demonstrates A LPHA P ROG’s
applied value in practice. tive programs for compiler testing. However, it takes human
efforts and expert knowledge to construct the grammars in
generation engines where only a subset of the whole lan-
Introduction guage grammars are encoded as claimed by most of the own-
ers of fuzzing engines in this type. To reduce human labor,
Compilers are the most critical components of computing researchers propose to use deep neural networks to learn
systems. Although vast research resources have been de- language patterns from existing programs (Cummins et al.
ployed to verify production compilers, they still contain 2018; Liu et al. 2019; Godefroid, Peleg, and Singh 2017).
bugs and their robustness requires improvements (Sun et al. Based on a sequence-to-sequence model, language patterns
2016). Different from application bugs, errors in compil- in terms of production rules can be acquired and then used
ers are usually harder to find, which are not the first place for new program generations. The neural networks can au-
to put breakpoints at when a developer tries to debug an tomatically capture most syntactical features and generate
unexpected behavior during compilation. They are presum- new tests effectively. But their success rate depends on the
ably correct for most application developers; however, such chosen data used for training the model and serving as the
compiler-bugs can be exploited to launch attacks, resulting seeds. Without valid and diverse testsuites built by program-
in serious security threats. For example, as demonstrated by mers, the proposed machine-learning-based approach usu-
researchers (David 2018), an attacker can enable a stealth ally won’t work as expected.
backdoor in Microsoft Visual Studio software with legiti-
To address this challenge, in this study, we build a deep-
mate code by merely exploiting a simple bug in MASM
learning-based framework free from the dataset dependency.
* These authors contributed equally. Specifically, we propose a reinforcement learning frame-

Work done while at PSU. work (Sutton, Barto et al. 1998) to bootstrap the neural net-
Copyright © 2022, Association for the Advancement of Artificial works that encode language patterns from scratch with the
Intelligence (www.aaai.org). All rights reserved. oracle of returning messages and runtime information dur-

12559
ing compilation. Reinforcement learning shows its potential ing messages of compilations and it demonstrates how the
in program analysis (Bunel et al. 2018; Böttinger, Gode- policy conforms formal language production rules. And for
froid, and Singh 2018; Verma et al. 2018) due to its ca- compiler testing coverage, it will be calculated by analyzing
pability of achieving learning goals in uncertain and com- the runtime information of each compilation.
plex environments. In our case, we use it to generate new
programs within a limited length. Then we ask compiler Designed Framework
to compile the generated program and collect both return-
In this work, we propose a reinforcement learning frame-
ing message and runtime information, i.e., execution traces,
work based on Q-learning to generate BF code for fuzzing
for calculating a designed reward to train the model. With
BF compilers. The designed generation process is illustrated
more programs generated, the neural network will be better
in Figure 1. In this framework, there are essentially two main
trained to craft new programs towards our anticipations. To
components, the fuzzing agent and the environment. The
achieve the goal of high compiler testing efficacy, we con-
fuzzing agent, i.e. the provided neural network, will try to
struct the coverage-guided reward functions to balance the
generate a new program with best practice, and the environ-
program validity and testing coverage improvement of the
ment, i.e. the compiler, will provide a scalar reward for eval-
target compilers. In such a manner, the trained neural net-
uating this synthesized program. To generate a new program,
work will eventually learn to generate valid and diverse test
the neural network will take in a base string xt for predict-
suites for testing.
ing new characters. The generated program yt is a new string
We built the proposed framework into a prototyping tool
by appending a new character to the base string. The model
called A LPHA P ROG. To evaluate the practicability of our
will evaluate the quality of this new program and calculate
approach, we deployed A LPHA P ROG on an esoteric lan-
a scalar reward rt according to the message and execution
guage called BrainFuck (Müller 1993) (we use BF in later
trace from the compilation to train the neural networks it-
context) which is a Turing-complete programming language
eratively. The model will evolve and optimize gradually as
that only contains eight operators. We explored the effec-
more and more strings are generated and evaluated. In this
tiveness of A LPHA P ROG by testing an industrial-grade BF
section, we will detail the model configuration in our frame-
compiler called BFC (Hughes 2019). We compared the re-
work and elaborate on the defined reward function.
sults of A LPHA P ROG under four different reward functions
for compiler fuzzing, A LPHA P ROG achieves promising per-
formance in terms of validity and testing efficacy. During the
Action-State Value
analysis, we also detected two important bugs of this target Unlike traditional Q-learning, deep Q-learning leverages
compiler. After reporting both issues, they were actively ad- deep neural network to improve the scalability of model
dressed by the project owner and already fixed in the new for tasks with large state-action space. In our design, once
release. observing a current state, the trained action-state network
will predict an action that selecting a character from the BF
Overview language to append in the next step. To deal with different
If we see a program as a string of characters of such lan- length of strings, we use a simple LSTM model for sequence
guage, we can model program generation task as a Markov embedding. In particular, we derive a LSTM layer with 128
Decision Process (MDP) (Markov 1954) process. An MDP neurons followed by two fully-connected hidden layers with
is a 4-tuple (S, A, Pa , Ra ), where S is a finite set of states, 100 and 512 neurons respectively. For each layer, we adopt
A is a finite set of actions and it is a transition between two ReLU (Maas, Hannun, and Ng 2013) as the activation func-
states. Given each different state s ∈ S, the probability of tion. The size of the output layer is 8 (corresponding to BF’s
taking action a ∈ A is Pa (s, s′ ); accordingly, it receives an eight different characters) that allows to predict the character
immediate reward Ra (s, s′ ), where s ∈ S is the current state to append with highest value.
and s′ ∈ S is the state after action. Starting at the training
iteration t, one action at ∈ A(st ) will be selected and per- Reward
formed. Once the environment receives the current state st The reward function is key to reinforcement learning frame-
and action at , it responds with a numerical reward rt+1 and works that indicates the learning direction. In the compiler
finds model a new state st+1 . In our context, we choose the fuzzing task, there are two main goals: (a) the generated pro-
best character to generate based on current program state and grams should be valid; (b) the generated programs should be
append new characters iteratively to current character string as diverse as possible. For validity, the generated programs
until EOF. The generation of EOF may vary and a simple are supposed to be both syntactically and semantically valid.
implementation is set EOF at a fixed length. The core prob- There are stages during the compilation process and if the
lem of MDP is to find a policy π for making action decisions test code is rejected during early stages, such as the syntax
on a specific state s. That is an update of the probability ma- analysis, the compilation will be terminated and the rest exe-
trix, Pa (s, s′ ), which achieves the optimal reward Ra (s, s′ ). cution paths won’t be tested. Thus, the validity of generated
In the fuzzing task, the probability for each transition will test programs is important for the fuzzing task. In addition
be learning by neural networks to achieve an optimal reward to validity, diversity is another goal from the perspective of
which combines two important metrics (1) the validity of testing efficacy. If similar tests are generated, although they
generated programs and (2) compiler testing coverage. The are valid to be successfully compiled by target compilers, we
validity of generated strings will be confirmed by return- cannot achieve any testing coverage improvement that we

12560
Fuzzing Agent Environment
Embedding Action
Sequence-to-Sequence
Model Validity

Encoder Decoder Coverage

Bugs

Figure 1: Compiler Fuzzing Process of A LPHA P ROG

won’t be able to trigger more unknown flaws or vulnerabili- no branches except for the entry and exit points, which is
ties in compilers. In other words, we prefer more legitimate considered as one of the important atomic units to measure
language patterns are explored and encoded into the neural code coverage. In this regard, we have the reward
networks rather than synthesizing test code in vain with the X
same patterns. In our design, we set up four different reward R2 = B(Tp )/ B(Tρ ). (2)
ρ∈I ′
functions for the learning process which demonstrates the
two different learning goals and how to achieve the balance In this reward function, B(Tp ) is the number of unique basic
in between. blocks of the execution trace of a program p and I ′ is all the
programs generated from this test suite where we compute
Reward 1 First, considering the syntactic and semantic the total number of unique basic blocks created so far.
validity, we set the reward function as
Reward 3 Third, to consider both code validity and di-
0, length is less than limit
(
versity, we further formulate a combination of their reward
R1 = −1, compilation error (1) metrics as the new reward function, which is accordingly
1, compilation success specified as
where for any intermediate programs during a generation 0, length is less than limit
(
episode, we give it a reward of 0 until its length hits our R3 = −1, compilation error (3)
restriction. To collect the compilation feedback and verify 1 + R2 , compilation success
the validity of a synthesized program, we use a production In this reward function, for all the generated programs that
compiler to parse the generated program and evaluate its cor- are compiled successfully, we use the portion of the newly
rectness based on the compilation messages. tested basic blocks as the reward. For the other two cases,
Compilation Message: Usually, there are five kinds of we still return reward 0 when the program length does not
compilation messages: no errors or warning means that the hit the limit, and −1 when the program is not compilable.
program is successfully compiled to an executable without
any conflict to the hard or soft rules defined by the compiler; Reward 4 In the fourth scenario, we further add a control-
(2) errors represents that the program does not conform syn- flow complexity of the synthesized programs into consider-
tactic or semantic checks and hits the exceptions that termi- ation based on the previous reward metrics. According to
nate the compilation process; (3) internal errors indicates Zhang et al.’s study (Zhang, Sun, and Su 2017), the increase
an error (bug) of the compiler where the compiler does not of control-flow complexity of programs in the test suites
conform the pre-defined assertions during the compilation; will remarkably improve the testing efficacy of the corre-
(4) warnings is the sign that the compilation succeeds but sponding compilers. The effective testing coverage can be
there are some soft rules that have not been met, such as improved up to 40% by simply switching the positions of
the program contains some meaningless sequences; and (5) variables in each program within the GCC test suite. In our
hangs depicts the compilation falls into some infinite loops design, we add the cyclomatic complexity (Watson, Wallace,
and it cannot exit in a reasonable time. We consider three and McCabe 1996) of the synthesized programs into our re-
cases among these compilation messages as the indicator for ward metrics which is used to describe program control-flow
a valid program: no errors or warning, warnings, and inter- complexity. Then we have new reward function,
nal errors. Theoretically, this reward metric should guide the R4 = R3 + C(p)/max(C(ρ : ρ ∈ I ′ )). (4)
model to synthesize programs that are valid with least effort
In this function, C(∗) is the cyclomatic complexity of a
such that the same character can be repeatedly generated all
program. We simply add the cyclomatic complexity of a syn-
the time in a synthesized program.
thesized program divided by the max value we get till now in
Reward 2 Second, to measure the diversity of the synthe- the previous reward function R3 . In other words, if the syn-
sized program, we use the unique tested basic blocks on the thesized program does not hit the length limit, we give it a
compilation trace by the generated test suite as the testing reward of 0 and if it is not valid, we give it reward −1. Other-
coverage. In compiler construction, a basic block of an exe- wise, the reward will be a combination of program validity,
cution trace is defined as a straight-line code sequence with testing coverage, and program control-flow complexity.

12561
Training basic blocks are covered by a certain new program in the
During the training stage, we bootstrap the deep neural net- compiler code. Additionally, our environment will also log
work for program generation that takes in a current program and report abnormal crashes, memory leaks or failing as-
x with state s, the action a that generates x to a next state s′ , sertions of compilers with the assistance of internal errors
the reward r that is calculated based on compilation, and an alarms from the compiling messages.
original Q-network. For a given state, this Q-network pre- Besides, the Q-learning network is implemented in Ten-
dicts the expected rewards for all defined actions simultane- sorflow (Abadi et al. 2016) using a LSTM layer for se-
ously. We update the Q-network to adapt the predicted value quence embedding that is connected with a 2-layer encoder-
Q(st , at ) according to the target r + γmaxa Q(st+1 , a) by decoder network. The initialized weights are randomly and
minimizing the loss of the deviation in between, where γ uniformly distributed within w ∈ [0, 0.1]. We choose a dis-
is a discounted rate between 0 and 1. A value closer to 1 counted rate γ = 1 to address long-term goal and a learning
indicates a goal that is targeted on long-term reward while a rate α = 0.0001 for the gradient descent optimizer. We as-
value closer to 0 means the model is more greedy. The trade- sign ϵmax = 1 and ϵmin = 0.01 with a decaying value of
off between the exploration and exploitation during train- (ϵmax − ϵmin )/100000 after each prediction. Therefore, the
ing is a dilemma that we frequently face in reinforcement model stops exploration after episode 20, 000. We will open
learning. In our program generation problem, exploitation source our prototyping tool A LPHA P ROG for public dissem-
pays more attention to take advantage of a trained model to ination after the paper is accepted.
search new conform programs as much as possible, while
exploration means the fuzzing agent will randomly choose Validity
a character that allows the generated sequences to vary. In Generating valid programs is one of our important goals.
our method, we employ the ϵ-greedy method in the training We evaluate the valid rate of the generated programs during
process to balance exploration and exploitation, where with the training process. Four different reward functions are de-
probability ϵ, our model will choose a random action and signed towards two different goals for program generation.
with probability 1 − ϵ, it will follow the prediction from a We report the number of valid program numbers per 1, 000
neural network. In the implementation, we make the value generated programs in Figure 2.
for ϵ decaying, such that at earlier stages of training, the Reward 1: Reward 1 demonstrates the learning towards
chance to choose a random action is higher but the prob- generating only valid programs. From the Figure 2 we can
ability goes down proportionally to the number of predic- find that, with the increasing number of programs gener-
tions. It indicates that we gradually rely on the trained neu- ated, the valid rate grows fast and by 20, 000 generated pro-
ral network rather than random guesses to explore as model grams, the valid rate reaches 100%. The generation result
becomes more matured. implies that, once the easiest pattern to generate a valid pro-
gram is found by a random generation, e.g., ,,,,,,,, or
Experiment >>>>>>>>, the network converges quickly to it and stops
learning anything new. The model trained by this reward
To evaluate our prototyping tool A LPHA P ROG, we perform function achieves the highest rate of valid programs in the
studies on training the model towards the two different goals synthesis procedure.
by setting reward functions as described in Reward section. Reward 2: Reward 2 demonstrates the learning towards
We log the valid rate and testing coverage improvement dur- generating diverse programs for improving testing coverage
ing the learning process. The analysis will confirm our guess for a target compiler. Without balancing with syntactic and
on the leading role of the different reward functions. To semantic validity, using this reward, we anticipate more di-
demonstrate the testing ability, we compare our tool with verse programs patterns will be generated but less of them
random fuzzing with 30,000 newly generated programs, in should be valid. The results in Figure 2 show that the valid
terms of testing efficacy. To elaborate its effectiveness on rate stays the lowest for most of the time which means the
generating more diverse programs, we also study the gener- generation engine has a low efficiency to learn a valid pro-
ated programs to explain the evolving process of the training gram through the reward on pure coverage.
model. In this section, we report the detailed implementation Reward 3: Reward 3 sets up the goal of combining va-
of A LPHA P ROG, and discuss the experiments we conducted. lidity and diversity. In a high-level, to generate valid and di-
verse programs are two conflicting goals. To generate valid
Settings programs, the model only needs to know one simple way
We build A LPHA P ROG by applying an existing framework that fits language grammar. For example, in the experiment
of binary instrumentation and neural network training. The of using Reward 1, the model only learns that by append-
core framework of the deep Q-learning module is imple- ing , to whatever prefix; it can generate valid programs out
mented in Python 3.6. In our implementation, the program of it. However, if the goal becomes generating diverse pro-
execution trace is generated by Pin (Luk et al. 2005), a grams, different characters should be tried which makes va-
widely-used dynamic binary instrumentation tool. We de- lidity easy to be violated. The model trained by this reward
velop a plug-in of Pin to log the executed instructions. Ad- function achieves the second place in the rate of valid pro-
ditionally, we develop another coverage analysis tool based grams in the synthesis procedure. From Figure 2, we observe
on the execution trace to report all the basic block touched that the valid rate keeps fluctuating, but overall, it is increas-
so far. It will also report whether and the number of new ing and approximates to 90% at the final stage.

12562
1000 Reward1 100000
Reward2
Reward3
800 Reward4
80000

# Unique Basic Blocks Tested


Valid Programs per 1K

600
60000

400
40000

200
20000 Reward1
Reward2
Reward3
0
0 Reward4

0 5 10 15 20 25 30 0 5000 10000 15000 20000 25000 30000


# of Tests (K) # of Tests

Figure 2: Code validity under four reward functions Figure 3: Testing coverage under four reward functions

Reward 4: Reward 4 sets up the goal of adding program point of coverage. It is because our model finally converges
control-flow complexity together with the synthesis validity at the point that the model keeps producing , or > for ev-
and diversity. By studying related studies, we know that the ery action. Although the generated programs are 100% valid,
control-flow complexity of programs in test suites is one of they do not improve the testing coverage anymore. This re-
the most important factors that improve testing efficacy for sult confirms the analysis from the validity test experiment.
compilers. We anticipate the add-on of this factor into the re- Reward 2: The red line shows the accumulated compiler
ward function will help us to improve the testing coverage of testing coverage by generating programs under Reward 2.
target compilers while not compromising the program valid- In this reward, we find that coverage results also increase
ity that much. From Figure 2 we find that the model trained drastically at the beginning stages of training. It still slowly
by this reward function achieves the third place in the rate of grows after the improvement stops with Reward 1 but the
valid programs in the synthesis procedure. pace is not as fast as the improvement under Reward 3. In the
corresponding figure that shows the code validity, although
Testing Coverage our model scarcely generates valid programs under Reward
Coverage improvement is the most important metric 2, these generated ones are inspired to be diverse to hit dif-
for software testing. Traditionally, it denotes the overall ferent parts inside the target compiler which eventually im-
lines/branches/paths in target software being visited with proves the testing coverage with lower efficiency.
certain test cases. In the design of A LPHA P ROG, to im- Reward 3: The green line shows the accumulated com-
prove the performance in this end-to-end learning process, piler testing coverage by generating programs under Reward
we adopt an approximation to describe the overall testing 3. In this reward, we find that the testing coverage goes up
coverage, which is the accumulated number of unique basic dramatically at early stages and it keeps increasing until the
blocks being executed with the generated new programs. A second-highest coverage is achieved eventually. We also no-
basic block of an execution trace is a straight-line code se- tice that the coverage improves periodically. In the figure 2
quence with no branches except for the entry and exit point that shows the code validity, we can observe the regularity
in compiler constructions. To capture the overall number of of increasing wave. We interpret it as that the model is al-
unique basic blocks, we first capture the unique basic blocks ways driven to generate valid programs according to the fre-
B(Tp ) with respect to each execution trace Tp , and then quent validity reward stimulation; meanwhile, it is periodi-
calculate a store of accumulated unique basic blocks num- cally guided to generate new patterns towards higher reward.
ber B(I ′ ) by unionizing the new basic blocks on current In this case, the generated programs are trained to achieve a
trace with existing ones that are visited before. In the ex- good trade-off between validity and diversity.
periments, we log the accumulated testing coverage for the Reward 4: The orange line shows the accumulated com-
four different reward functions we adopt in the framework. piler testing coverage by generating programs under Reward
We compare their coverage improvements and display the 4. In this reward, we can observe that the coverage improves
results in Figure 3. as drastically as the synthesis under Reward 3 at early learn-
Reward 1: The blue line shows the accumulated compiler ing stages. The coverage keeps increasing until the highest
testing coverage by generating programs under Reward 1. value is achieved among the 4 designed reward functions.
In this reward, we find the coverage improves drastically at Although the final program valid rate under Reward 4 is
the beginning of training. But it stops growing since episode lower than those under Reward 1 and Reward 3, the test-
11, 000. In the corresponding figure that shows the validity ing coverage beats both of them. The reason why Reward 4
distribution, we also notice that the valid rate achieves 100% achieves better testing coverage than Reward 1 is straight-
since episode 11, 000 which is very close to the converging forward as the latter one naively depends on the validity of

12563
Episode Cyclomatic Complexity Program
101 2 [+, <>++[>..],-+<+[,]-,[<].<-[],>,[>. <[+]]+><<]
1786 11 [>[,,[... - [<]>+, .+-,. .-.,],].]> .,+[>]>. +..+.
5096 32 <-+[. <,[.,-] +]> -.+++<++-.>-,[>.,+,] -<- --[]
10342 39 -<[>.<.<.><,]<-<[<.-. ] -,[>- <>++-[],. ]>>-+[,<]

Table 1: Synthesis Examples

codes. However, it is more complex for Reward 4 to outper- 1 . + 1 . + . +


2 [ [ [ [ >. 2 [ [ [ [ >.
form Reward 3. We may interpret it as the side effect learned
3 [+ 3 [+
from the program control-flow complexity. On one hand, the
4 [<> 4 [<>
higher control-flow complexity is a more direct and instant 5 ] 5 ]
reward to improve the testing coverage, which enforces the 6 ] 6 ]
fuzzing agent to generate programs that require more op- 7 ] > 7 ] >
timizations in the compilation process. On the other hand, 8 ]<>+ . 8 ]<>+ .
it sets up the goal of synthesizing complex program in ev- 9 ]>< ,−.,,+++ 9 ]>< ,−.,,+++
ery episode which is not considered under Reward 3. To im- 10 [ 10 [
prove testing coverage under Reward 3, the fuzzing agent 11 ]−−− 11 ]−−−
needs to learn new language patterns, while Reward 4 needs 12 ] 12 ]
the fuzzing agent to additionally learn how to combine the Listing 1: Bug 1 Listing 2: Bug 2
newly learned language patterns in an efficient way because
the entire sequence length is limited.
can claim that A LPHA P ROG is better than AFL in generat-
Synthesis Examples ing valid and diverse programs for compiler fuzzing.
To demonstrate how the control-flow complexity of syn-
thesized programs grows, we show four cases that gener- Deployment for Bug Detection
ated during different episodes using the model under Re-
ward 4. The synthesized programs are displayed in Table 1. With the improved testing efficacy, our tool has the poten-
We extract their abstracted control-flow graphs based on tial to be deployed for discovering more compiler bugs com-
the control-flow graphs generated from the LLVM machine- pared with pure random fuzzing in practice. During our anal-
independent optimizer and observe an obvious trend towards ysis, our developed tool A LPHA P ROG has already helped
complexity. We can also observe that, with the learning to detect two important bugs for the target compiler BFC,
moves forward, the fuzzing agent learns to synthesize more which is the industrial-grade BF compiler with the most stars
complex programs which have higher cyclomatic complex- (207) and folks (8) on GitHub. We reported two programs
ities. Moreover, we calculate the average cyclomatic com- that enforce BFC to hang due to compile-time evaluations.
plexities of programs generated from four reward functions, The first program triggers the bug during the BF IR opti-
the results increase from 4 to 18 which is consistent with the mization, while the second one triggers the bug because the
test coverage metric. It confirms that with more designed compiler aggressively unrolls the loop as compile-time eval-
heuristics integrated into the learning rewards, the fuzzing uation sends a huge amount of IR to LLVM, and then spends
agent can be potentially aware of and thus reinforce the gen- ages trying to optimize the IR. After we reported both issues,
eration goal to craft more effective code patterns. they were addressed by the project owner and fixed in the
new release. Here we show the two bugs we found, reported
Comparison with AFL and confirmed1 .
AFL (Zalewski 2014) is a matured fuzzing production that
has been widely used for different applications. Here we Discussion and Conclusions
compare A LPHA P ROG with AFL in the two perspectives we In this paper, we propose a reinforcement learning-based
focus on: validity and coverage. We use AFL with a single approach to continuously generate programs for compiler
empty seed to generate 30, 000 programs for fuzzing BFC fuzzing. We practically evaluate our method on fuzzing
and record the highest valid rate per 1, 000 samples, and BF compiler. Our study reveals the overall effectiveness of
the accumulated coverage achieved. As a result, the high- A LPHA P ROG for compiler testing and yields great applied
est valid rate for AFL is 35% and the accumulated coverage value of our tool. However, there are also two main limita-
in terms of basic blocks tested is 43, 135. It covers 162 paths tions in our current work. The scalability of A LPHA P ROG
but has found no crashes or hangs (actually we ran AFL for is restricted, especially for complex programming language,
24 hours and no crashes or hangs were found). By contrast, e.g., C. As the C language structures are difficult to synthe-
A LPHA P ROG manages to achieve the valid rate around 80% size where the entire search space is 14120 , almost 8e + 24
under Reward 4 which is the most efficient one for fuzzing, times of the BF language with the same length limit, it can
where over 100,000 basic blocks are tested with 30, 000 test
1
samples, and two bugs were detected. With this result, we https://github.com/Wilfred/bfc/issues/28

12564
take days for our prototype to just find one single valid C Liu, X.; Li, X.; Prajapati, R.; and Wu, D. 2019. DeepFuzz:
program. We still need more grammars to be encoded in Automatic Generation of Syntax Valid C Programs for Fuzz
the generation engine to make it applicable for complex lan- Testing. In Proceedings of the 33th AAAI Conference on
guages. The second difficulty is that it is hard to determine Artificial Intelligence, 1044–1051. USA: Proceedings of the
the end of a training cycle. Unlike the game of Go, the learn- AAAI Conference on Artificial Intelligence.
ing goal for reinforcement fuzzing is hard to strictly define Luk, C.-K.; Cohn, R.; Muth, R.; Patil, H.; Klauser, A.;
only with the current reward metrics. We need more in-depth Lowney, G.; Wallace, S.; Reddi, V. J.; and Hazelwood, K.
study to overcome existing challenges and leave that as our 2005. Pin: Building Customized Program Analysis Tools
future work. with Dynamic Instrumentation. In Proceedings of the 2005
ACM SIGPLAN Conference on Programming Language
Acknowledgments Design and Implementation, 190–200. Chicago, IL, USA:
We gratefully acknowledge the support of NVIDIA Corpo- ACM. ISBN 1-59593-056-6.
ration with the donation of the Titan Xp GPU used for this Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Recti-
research. This research was supported in part by the National fier nonlinearities improve neural network acoustic models.
Science Foundation (NSF) grant CNS-1652790. In Proceedings of the 30th International Conference on Ma-
chine Learning (ICML), volume 30.
References Markov, A. A. 1954. The theory of algorithms. Trudy
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, Matematicheskogo Instituta Imeni VA Steklova, 42: 3–375.
J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Müller, U. 1993. Brainfuck–an eight-instruction turing-
2016. Tensorflow: A system for large-scale machine learn- complete programming language. Available at the Internet
ing. In 12th USENIX symposium on operating systems de- address http://en. wikipedia. org/wiki/Brainfuck.
sign and implementation (OSDI 16), 265–283.
Regehr, J.; Chen, Y.; Cuoq, P.; Eide, E.; Ellison, C.; and
Böttinger, K.; Godefroid, P.; and Singh, R. 2018. Deep Yang, X. 2012. Test-case Reduction for C Compiler Bugs.
reinforcement fuzzing. In 2018 IEEE Security and Pri- In Proceedings of the 33rd ACM SIGPLAN Conference on
vacy Workshops (SPW), 116–122. IEEE, San Francisco, CA, Programming Language Design and Implementation, PLDI
USA: IEEE. ’12, 335–346. New York, NY, USA: ACM. ISBN 978-1-
Bunel, R.; Hausknecht, M.; Devlin, J.; Singh, R.; and Kohli, 4503-1205-9.
P. 2018. Leveraging grammar and reinforcement learning for Sun, C.; Le, V.; Zhang, Q.; and Su, Z. 2016. Toward un-
neural program synthesis. arXiv preprint arXiv:1805.04276, derstanding compiler bugs in GCC and LLVM. In Proceed-
1: 265–283. ings of the 25th International Symposium on Software Test-
Chen, J.; Hu, W.; Hao, D.; Xiong, Y.; Zhang, H.; Zhang, L.; ing and Analysis (ISSTA), 294–305. ACM.
and Xie, B. 2016. An empirical comparison of compiler test-
Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement
ing techniques. In 2016 IEEE/ACM 38th International Con-
Learning: An Introduction. USA: MIT Press.
ference on Software Engineering (ICSE), 180–190. IEEE,
Austin, TX, USA: IEEE. Takanen, A.; Demott, J. D.; Miller, C.; and Kettunen, A.
2018. Fuzzing for software security testing and quality as-
Chen, Y.; Groce, A.; Zhang, C.; Wong, W.-K.; Fern, X.;
surance. Artech House.
Eide, E.; and Regehr, J. 2013. Taming Compiler Fuzzers.
In Proceedings of the 34th ACM SIGPLAN conference on Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and Chaud-
Programming language design and implementation (PLDI), huri, S. 2018. Programmatically Interpretable Reinforce-
197–208. New York, NY, USA: ACM. ment Learning. arXiv preprint arXiv:1804.02477.
Cummins, C.; Petoumenos, P.; Murray, A.; and Leather, H. Watson, A. H.; Wallace, D. R.; and McCabe, T. J. 1996.
2018. Compiler fuzzing through deep learning. In Proceed- Structured testing: A testing methodology using the cyclo-
ings of the 27th ACM SIGSOFT International Symposium on matic complexity metric, volume 500. USA: US Department
Software Testing and Analysis, 95–105. ACM, Amsterdam, of Commerce, Technology Administration.
Netherlands: ACM. Yang, X.; Chen, Y.; Eide, E.; and Regehr, J. 2011. Finding
David, B. 2018. How a simple bug in ML compiler could be and understanding bugs in C compilers. In Proceedings of
exploited for backdoors? arXiv preprint:1811.10851, 1: 1. the 32nd ACM SIGPLAN conference on Programming lan-
Godefroid, P.; Peleg, H.; and Singh, R. 2017. Learn&fuzz: guage design and implementation (PLDI), volume 46, 283–
Machine learning for input fuzzing. In Proceedings of the 294. USA: ACM.
32nd IEEE/ACM International Conference on Automated Zalewski, M. 2014. American fuzzy lop. https://lcamtuf.
Software Engineering, 50–59. Piscataway, NJ, USA: IEEE coredump.cx/afl/.
Press. Zhang, Q.; Sun, C.; and Su, Z. 2017. Skeletal Program Enu-
Hughes, W. 2019. BFC: An industrial-grade brainfuck com- meration for Rigorous Compiler Testing. In Proceedings of
piler. https://bfc.wilfred.me.uk/. the 38th ACM SIGPLAN Conference on Programming Lan-
Kossatchev, A. S.; and Posypkin, M. A. 2005. Survey of guage Design and Implementation, PLDI 2017, 347–361.
compiler testing methods. Programming and Computer New York, NY, USA: ACM. ISBN 978-1-4503-4988-8.
Software, 31(1): 10–19.

12565

You might also like