Alpha Prog
Alpha Prog
12559
ing compilation. Reinforcement learning shows its potential ing messages of compilations and it demonstrates how the
in program analysis (Bunel et al. 2018; Böttinger, Gode- policy conforms formal language production rules. And for
froid, and Singh 2018; Verma et al. 2018) due to its ca- compiler testing coverage, it will be calculated by analyzing
pability of achieving learning goals in uncertain and com- the runtime information of each compilation.
plex environments. In our case, we use it to generate new
programs within a limited length. Then we ask compiler Designed Framework
to compile the generated program and collect both return-
In this work, we propose a reinforcement learning frame-
ing message and runtime information, i.e., execution traces,
work based on Q-learning to generate BF code for fuzzing
for calculating a designed reward to train the model. With
BF compilers. The designed generation process is illustrated
more programs generated, the neural network will be better
in Figure 1. In this framework, there are essentially two main
trained to craft new programs towards our anticipations. To
components, the fuzzing agent and the environment. The
achieve the goal of high compiler testing efficacy, we con-
fuzzing agent, i.e. the provided neural network, will try to
struct the coverage-guided reward functions to balance the
generate a new program with best practice, and the environ-
program validity and testing coverage improvement of the
ment, i.e. the compiler, will provide a scalar reward for eval-
target compilers. In such a manner, the trained neural net-
uating this synthesized program. To generate a new program,
work will eventually learn to generate valid and diverse test
the neural network will take in a base string xt for predict-
suites for testing.
ing new characters. The generated program yt is a new string
We built the proposed framework into a prototyping tool
by appending a new character to the base string. The model
called A LPHA P ROG. To evaluate the practicability of our
will evaluate the quality of this new program and calculate
approach, we deployed A LPHA P ROG on an esoteric lan-
a scalar reward rt according to the message and execution
guage called BrainFuck (Müller 1993) (we use BF in later
trace from the compilation to train the neural networks it-
context) which is a Turing-complete programming language
eratively. The model will evolve and optimize gradually as
that only contains eight operators. We explored the effec-
more and more strings are generated and evaluated. In this
tiveness of A LPHA P ROG by testing an industrial-grade BF
section, we will detail the model configuration in our frame-
compiler called BFC (Hughes 2019). We compared the re-
work and elaborate on the defined reward function.
sults of A LPHA P ROG under four different reward functions
for compiler fuzzing, A LPHA P ROG achieves promising per-
formance in terms of validity and testing efficacy. During the
Action-State Value
analysis, we also detected two important bugs of this target Unlike traditional Q-learning, deep Q-learning leverages
compiler. After reporting both issues, they were actively ad- deep neural network to improve the scalability of model
dressed by the project owner and already fixed in the new for tasks with large state-action space. In our design, once
release. observing a current state, the trained action-state network
will predict an action that selecting a character from the BF
Overview language to append in the next step. To deal with different
If we see a program as a string of characters of such lan- length of strings, we use a simple LSTM model for sequence
guage, we can model program generation task as a Markov embedding. In particular, we derive a LSTM layer with 128
Decision Process (MDP) (Markov 1954) process. An MDP neurons followed by two fully-connected hidden layers with
is a 4-tuple (S, A, Pa , Ra ), where S is a finite set of states, 100 and 512 neurons respectively. For each layer, we adopt
A is a finite set of actions and it is a transition between two ReLU (Maas, Hannun, and Ng 2013) as the activation func-
states. Given each different state s ∈ S, the probability of tion. The size of the output layer is 8 (corresponding to BF’s
taking action a ∈ A is Pa (s, s′ ); accordingly, it receives an eight different characters) that allows to predict the character
immediate reward Ra (s, s′ ), where s ∈ S is the current state to append with highest value.
and s′ ∈ S is the state after action. Starting at the training
iteration t, one action at ∈ A(st ) will be selected and per- Reward
formed. Once the environment receives the current state st The reward function is key to reinforcement learning frame-
and action at , it responds with a numerical reward rt+1 and works that indicates the learning direction. In the compiler
finds model a new state st+1 . In our context, we choose the fuzzing task, there are two main goals: (a) the generated pro-
best character to generate based on current program state and grams should be valid; (b) the generated programs should be
append new characters iteratively to current character string as diverse as possible. For validity, the generated programs
until EOF. The generation of EOF may vary and a simple are supposed to be both syntactically and semantically valid.
implementation is set EOF at a fixed length. The core prob- There are stages during the compilation process and if the
lem of MDP is to find a policy π for making action decisions test code is rejected during early stages, such as the syntax
on a specific state s. That is an update of the probability ma- analysis, the compilation will be terminated and the rest exe-
trix, Pa (s, s′ ), which achieves the optimal reward Ra (s, s′ ). cution paths won’t be tested. Thus, the validity of generated
In the fuzzing task, the probability for each transition will test programs is important for the fuzzing task. In addition
be learning by neural networks to achieve an optimal reward to validity, diversity is another goal from the perspective of
which combines two important metrics (1) the validity of testing efficacy. If similar tests are generated, although they
generated programs and (2) compiler testing coverage. The are valid to be successfully compiled by target compilers, we
validity of generated strings will be confirmed by return- cannot achieve any testing coverage improvement that we
12560
Fuzzing Agent Environment
Embedding Action
Sequence-to-Sequence
Model Validity
Bugs
won’t be able to trigger more unknown flaws or vulnerabili- no branches except for the entry and exit points, which is
ties in compilers. In other words, we prefer more legitimate considered as one of the important atomic units to measure
language patterns are explored and encoded into the neural code coverage. In this regard, we have the reward
networks rather than synthesizing test code in vain with the X
same patterns. In our design, we set up four different reward R2 = B(Tp )/ B(Tρ ). (2)
ρ∈I ′
functions for the learning process which demonstrates the
two different learning goals and how to achieve the balance In this reward function, B(Tp ) is the number of unique basic
in between. blocks of the execution trace of a program p and I ′ is all the
programs generated from this test suite where we compute
Reward 1 First, considering the syntactic and semantic the total number of unique basic blocks created so far.
validity, we set the reward function as
Reward 3 Third, to consider both code validity and di-
0, length is less than limit
(
versity, we further formulate a combination of their reward
R1 = −1, compilation error (1) metrics as the new reward function, which is accordingly
1, compilation success specified as
where for any intermediate programs during a generation 0, length is less than limit
(
episode, we give it a reward of 0 until its length hits our R3 = −1, compilation error (3)
restriction. To collect the compilation feedback and verify 1 + R2 , compilation success
the validity of a synthesized program, we use a production In this reward function, for all the generated programs that
compiler to parse the generated program and evaluate its cor- are compiled successfully, we use the portion of the newly
rectness based on the compilation messages. tested basic blocks as the reward. For the other two cases,
Compilation Message: Usually, there are five kinds of we still return reward 0 when the program length does not
compilation messages: no errors or warning means that the hit the limit, and −1 when the program is not compilable.
program is successfully compiled to an executable without
any conflict to the hard or soft rules defined by the compiler; Reward 4 In the fourth scenario, we further add a control-
(2) errors represents that the program does not conform syn- flow complexity of the synthesized programs into consider-
tactic or semantic checks and hits the exceptions that termi- ation based on the previous reward metrics. According to
nate the compilation process; (3) internal errors indicates Zhang et al.’s study (Zhang, Sun, and Su 2017), the increase
an error (bug) of the compiler where the compiler does not of control-flow complexity of programs in the test suites
conform the pre-defined assertions during the compilation; will remarkably improve the testing efficacy of the corre-
(4) warnings is the sign that the compilation succeeds but sponding compilers. The effective testing coverage can be
there are some soft rules that have not been met, such as improved up to 40% by simply switching the positions of
the program contains some meaningless sequences; and (5) variables in each program within the GCC test suite. In our
hangs depicts the compilation falls into some infinite loops design, we add the cyclomatic complexity (Watson, Wallace,
and it cannot exit in a reasonable time. We consider three and McCabe 1996) of the synthesized programs into our re-
cases among these compilation messages as the indicator for ward metrics which is used to describe program control-flow
a valid program: no errors or warning, warnings, and inter- complexity. Then we have new reward function,
nal errors. Theoretically, this reward metric should guide the R4 = R3 + C(p)/max(C(ρ : ρ ∈ I ′ )). (4)
model to synthesize programs that are valid with least effort
In this function, C(∗) is the cyclomatic complexity of a
such that the same character can be repeatedly generated all
program. We simply add the cyclomatic complexity of a syn-
the time in a synthesized program.
thesized program divided by the max value we get till now in
Reward 2 Second, to measure the diversity of the synthe- the previous reward function R3 . In other words, if the syn-
sized program, we use the unique tested basic blocks on the thesized program does not hit the length limit, we give it a
compilation trace by the generated test suite as the testing reward of 0 and if it is not valid, we give it reward −1. Other-
coverage. In compiler construction, a basic block of an exe- wise, the reward will be a combination of program validity,
cution trace is defined as a straight-line code sequence with testing coverage, and program control-flow complexity.
12561
Training basic blocks are covered by a certain new program in the
During the training stage, we bootstrap the deep neural net- compiler code. Additionally, our environment will also log
work for program generation that takes in a current program and report abnormal crashes, memory leaks or failing as-
x with state s, the action a that generates x to a next state s′ , sertions of compilers with the assistance of internal errors
the reward r that is calculated based on compilation, and an alarms from the compiling messages.
original Q-network. For a given state, this Q-network pre- Besides, the Q-learning network is implemented in Ten-
dicts the expected rewards for all defined actions simultane- sorflow (Abadi et al. 2016) using a LSTM layer for se-
ously. We update the Q-network to adapt the predicted value quence embedding that is connected with a 2-layer encoder-
Q(st , at ) according to the target r + γmaxa Q(st+1 , a) by decoder network. The initialized weights are randomly and
minimizing the loss of the deviation in between, where γ uniformly distributed within w ∈ [0, 0.1]. We choose a dis-
is a discounted rate between 0 and 1. A value closer to 1 counted rate γ = 1 to address long-term goal and a learning
indicates a goal that is targeted on long-term reward while a rate α = 0.0001 for the gradient descent optimizer. We as-
value closer to 0 means the model is more greedy. The trade- sign ϵmax = 1 and ϵmin = 0.01 with a decaying value of
off between the exploration and exploitation during train- (ϵmax − ϵmin )/100000 after each prediction. Therefore, the
ing is a dilemma that we frequently face in reinforcement model stops exploration after episode 20, 000. We will open
learning. In our program generation problem, exploitation source our prototyping tool A LPHA P ROG for public dissem-
pays more attention to take advantage of a trained model to ination after the paper is accepted.
search new conform programs as much as possible, while
exploration means the fuzzing agent will randomly choose Validity
a character that allows the generated sequences to vary. In Generating valid programs is one of our important goals.
our method, we employ the ϵ-greedy method in the training We evaluate the valid rate of the generated programs during
process to balance exploration and exploitation, where with the training process. Four different reward functions are de-
probability ϵ, our model will choose a random action and signed towards two different goals for program generation.
with probability 1 − ϵ, it will follow the prediction from a We report the number of valid program numbers per 1, 000
neural network. In the implementation, we make the value generated programs in Figure 2.
for ϵ decaying, such that at earlier stages of training, the Reward 1: Reward 1 demonstrates the learning towards
chance to choose a random action is higher but the prob- generating only valid programs. From the Figure 2 we can
ability goes down proportionally to the number of predic- find that, with the increasing number of programs gener-
tions. It indicates that we gradually rely on the trained neu- ated, the valid rate grows fast and by 20, 000 generated pro-
ral network rather than random guesses to explore as model grams, the valid rate reaches 100%. The generation result
becomes more matured. implies that, once the easiest pattern to generate a valid pro-
gram is found by a random generation, e.g., ,,,,,,,, or
Experiment >>>>>>>>, the network converges quickly to it and stops
learning anything new. The model trained by this reward
To evaluate our prototyping tool A LPHA P ROG, we perform function achieves the highest rate of valid programs in the
studies on training the model towards the two different goals synthesis procedure.
by setting reward functions as described in Reward section. Reward 2: Reward 2 demonstrates the learning towards
We log the valid rate and testing coverage improvement dur- generating diverse programs for improving testing coverage
ing the learning process. The analysis will confirm our guess for a target compiler. Without balancing with syntactic and
on the leading role of the different reward functions. To semantic validity, using this reward, we anticipate more di-
demonstrate the testing ability, we compare our tool with verse programs patterns will be generated but less of them
random fuzzing with 30,000 newly generated programs, in should be valid. The results in Figure 2 show that the valid
terms of testing efficacy. To elaborate its effectiveness on rate stays the lowest for most of the time which means the
generating more diverse programs, we also study the gener- generation engine has a low efficiency to learn a valid pro-
ated programs to explain the evolving process of the training gram through the reward on pure coverage.
model. In this section, we report the detailed implementation Reward 3: Reward 3 sets up the goal of combining va-
of A LPHA P ROG, and discuss the experiments we conducted. lidity and diversity. In a high-level, to generate valid and di-
verse programs are two conflicting goals. To generate valid
Settings programs, the model only needs to know one simple way
We build A LPHA P ROG by applying an existing framework that fits language grammar. For example, in the experiment
of binary instrumentation and neural network training. The of using Reward 1, the model only learns that by append-
core framework of the deep Q-learning module is imple- ing , to whatever prefix; it can generate valid programs out
mented in Python 3.6. In our implementation, the program of it. However, if the goal becomes generating diverse pro-
execution trace is generated by Pin (Luk et al. 2005), a grams, different characters should be tried which makes va-
widely-used dynamic binary instrumentation tool. We de- lidity easy to be violated. The model trained by this reward
velop a plug-in of Pin to log the executed instructions. Ad- function achieves the second place in the rate of valid pro-
ditionally, we develop another coverage analysis tool based grams in the synthesis procedure. From Figure 2, we observe
on the execution trace to report all the basic block touched that the valid rate keeps fluctuating, but overall, it is increas-
so far. It will also report whether and the number of new ing and approximates to 90% at the final stage.
12562
1000 Reward1 100000
Reward2
Reward3
800 Reward4
80000
600
60000
400
40000
200
20000 Reward1
Reward2
Reward3
0
0 Reward4
Figure 2: Code validity under four reward functions Figure 3: Testing coverage under four reward functions
Reward 4: Reward 4 sets up the goal of adding program point of coverage. It is because our model finally converges
control-flow complexity together with the synthesis validity at the point that the model keeps producing , or > for ev-
and diversity. By studying related studies, we know that the ery action. Although the generated programs are 100% valid,
control-flow complexity of programs in test suites is one of they do not improve the testing coverage anymore. This re-
the most important factors that improve testing efficacy for sult confirms the analysis from the validity test experiment.
compilers. We anticipate the add-on of this factor into the re- Reward 2: The red line shows the accumulated compiler
ward function will help us to improve the testing coverage of testing coverage by generating programs under Reward 2.
target compilers while not compromising the program valid- In this reward, we find that coverage results also increase
ity that much. From Figure 2 we find that the model trained drastically at the beginning stages of training. It still slowly
by this reward function achieves the third place in the rate of grows after the improvement stops with Reward 1 but the
valid programs in the synthesis procedure. pace is not as fast as the improvement under Reward 3. In the
corresponding figure that shows the code validity, although
Testing Coverage our model scarcely generates valid programs under Reward
Coverage improvement is the most important metric 2, these generated ones are inspired to be diverse to hit dif-
for software testing. Traditionally, it denotes the overall ferent parts inside the target compiler which eventually im-
lines/branches/paths in target software being visited with proves the testing coverage with lower efficiency.
certain test cases. In the design of A LPHA P ROG, to im- Reward 3: The green line shows the accumulated com-
prove the performance in this end-to-end learning process, piler testing coverage by generating programs under Reward
we adopt an approximation to describe the overall testing 3. In this reward, we find that the testing coverage goes up
coverage, which is the accumulated number of unique basic dramatically at early stages and it keeps increasing until the
blocks being executed with the generated new programs. A second-highest coverage is achieved eventually. We also no-
basic block of an execution trace is a straight-line code se- tice that the coverage improves periodically. In the figure 2
quence with no branches except for the entry and exit point that shows the code validity, we can observe the regularity
in compiler constructions. To capture the overall number of of increasing wave. We interpret it as that the model is al-
unique basic blocks, we first capture the unique basic blocks ways driven to generate valid programs according to the fre-
B(Tp ) with respect to each execution trace Tp , and then quent validity reward stimulation; meanwhile, it is periodi-
calculate a store of accumulated unique basic blocks num- cally guided to generate new patterns towards higher reward.
ber B(I ′ ) by unionizing the new basic blocks on current In this case, the generated programs are trained to achieve a
trace with existing ones that are visited before. In the ex- good trade-off between validity and diversity.
periments, we log the accumulated testing coverage for the Reward 4: The orange line shows the accumulated com-
four different reward functions we adopt in the framework. piler testing coverage by generating programs under Reward
We compare their coverage improvements and display the 4. In this reward, we can observe that the coverage improves
results in Figure 3. as drastically as the synthesis under Reward 3 at early learn-
Reward 1: The blue line shows the accumulated compiler ing stages. The coverage keeps increasing until the highest
testing coverage by generating programs under Reward 1. value is achieved among the 4 designed reward functions.
In this reward, we find the coverage improves drastically at Although the final program valid rate under Reward 4 is
the beginning of training. But it stops growing since episode lower than those under Reward 1 and Reward 3, the test-
11, 000. In the corresponding figure that shows the validity ing coverage beats both of them. The reason why Reward 4
distribution, we also notice that the valid rate achieves 100% achieves better testing coverage than Reward 1 is straight-
since episode 11, 000 which is very close to the converging forward as the latter one naively depends on the validity of
12563
Episode Cyclomatic Complexity Program
101 2 [+, <>++[>..],-+<+[,]-,[<].<-[],>,[>. <[+]]+><<]
1786 11 [>[,,[... - [<]>+, .+-,. .-.,],].]> .,+[>]>. +..+.
5096 32 <-+[. <,[.,-] +]> -.+++<++-.>-,[>.,+,] -<- --[]
10342 39 -<[>.<.<.><,]<-<[<.-. ] -,[>- <>++-[],. ]>>-+[,<]
12564
take days for our prototype to just find one single valid C Liu, X.; Li, X.; Prajapati, R.; and Wu, D. 2019. DeepFuzz:
program. We still need more grammars to be encoded in Automatic Generation of Syntax Valid C Programs for Fuzz
the generation engine to make it applicable for complex lan- Testing. In Proceedings of the 33th AAAI Conference on
guages. The second difficulty is that it is hard to determine Artificial Intelligence, 1044–1051. USA: Proceedings of the
the end of a training cycle. Unlike the game of Go, the learn- AAAI Conference on Artificial Intelligence.
ing goal for reinforcement fuzzing is hard to strictly define Luk, C.-K.; Cohn, R.; Muth, R.; Patil, H.; Klauser, A.;
only with the current reward metrics. We need more in-depth Lowney, G.; Wallace, S.; Reddi, V. J.; and Hazelwood, K.
study to overcome existing challenges and leave that as our 2005. Pin: Building Customized Program Analysis Tools
future work. with Dynamic Instrumentation. In Proceedings of the 2005
ACM SIGPLAN Conference on Programming Language
Acknowledgments Design and Implementation, 190–200. Chicago, IL, USA:
We gratefully acknowledge the support of NVIDIA Corpo- ACM. ISBN 1-59593-056-6.
ration with the donation of the Titan Xp GPU used for this Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Recti-
research. This research was supported in part by the National fier nonlinearities improve neural network acoustic models.
Science Foundation (NSF) grant CNS-1652790. In Proceedings of the 30th International Conference on Ma-
chine Learning (ICML), volume 30.
References Markov, A. A. 1954. The theory of algorithms. Trudy
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, Matematicheskogo Instituta Imeni VA Steklova, 42: 3–375.
J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Müller, U. 1993. Brainfuck–an eight-instruction turing-
2016. Tensorflow: A system for large-scale machine learn- complete programming language. Available at the Internet
ing. In 12th USENIX symposium on operating systems de- address http://en. wikipedia. org/wiki/Brainfuck.
sign and implementation (OSDI 16), 265–283.
Regehr, J.; Chen, Y.; Cuoq, P.; Eide, E.; Ellison, C.; and
Böttinger, K.; Godefroid, P.; and Singh, R. 2018. Deep Yang, X. 2012. Test-case Reduction for C Compiler Bugs.
reinforcement fuzzing. In 2018 IEEE Security and Pri- In Proceedings of the 33rd ACM SIGPLAN Conference on
vacy Workshops (SPW), 116–122. IEEE, San Francisco, CA, Programming Language Design and Implementation, PLDI
USA: IEEE. ’12, 335–346. New York, NY, USA: ACM. ISBN 978-1-
Bunel, R.; Hausknecht, M.; Devlin, J.; Singh, R.; and Kohli, 4503-1205-9.
P. 2018. Leveraging grammar and reinforcement learning for Sun, C.; Le, V.; Zhang, Q.; and Su, Z. 2016. Toward un-
neural program synthesis. arXiv preprint arXiv:1805.04276, derstanding compiler bugs in GCC and LLVM. In Proceed-
1: 265–283. ings of the 25th International Symposium on Software Test-
Chen, J.; Hu, W.; Hao, D.; Xiong, Y.; Zhang, H.; Zhang, L.; ing and Analysis (ISSTA), 294–305. ACM.
and Xie, B. 2016. An empirical comparison of compiler test-
Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement
ing techniques. In 2016 IEEE/ACM 38th International Con-
Learning: An Introduction. USA: MIT Press.
ference on Software Engineering (ICSE), 180–190. IEEE,
Austin, TX, USA: IEEE. Takanen, A.; Demott, J. D.; Miller, C.; and Kettunen, A.
2018. Fuzzing for software security testing and quality as-
Chen, Y.; Groce, A.; Zhang, C.; Wong, W.-K.; Fern, X.;
surance. Artech House.
Eide, E.; and Regehr, J. 2013. Taming Compiler Fuzzers.
In Proceedings of the 34th ACM SIGPLAN conference on Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and Chaud-
Programming language design and implementation (PLDI), huri, S. 2018. Programmatically Interpretable Reinforce-
197–208. New York, NY, USA: ACM. ment Learning. arXiv preprint arXiv:1804.02477.
Cummins, C.; Petoumenos, P.; Murray, A.; and Leather, H. Watson, A. H.; Wallace, D. R.; and McCabe, T. J. 1996.
2018. Compiler fuzzing through deep learning. In Proceed- Structured testing: A testing methodology using the cyclo-
ings of the 27th ACM SIGSOFT International Symposium on matic complexity metric, volume 500. USA: US Department
Software Testing and Analysis, 95–105. ACM, Amsterdam, of Commerce, Technology Administration.
Netherlands: ACM. Yang, X.; Chen, Y.; Eide, E.; and Regehr, J. 2011. Finding
David, B. 2018. How a simple bug in ML compiler could be and understanding bugs in C compilers. In Proceedings of
exploited for backdoors? arXiv preprint:1811.10851, 1: 1. the 32nd ACM SIGPLAN conference on Programming lan-
Godefroid, P.; Peleg, H.; and Singh, R. 2017. Learn&fuzz: guage design and implementation (PLDI), volume 46, 283–
Machine learning for input fuzzing. In Proceedings of the 294. USA: ACM.
32nd IEEE/ACM International Conference on Automated Zalewski, M. 2014. American fuzzy lop. https://lcamtuf.
Software Engineering, 50–59. Piscataway, NJ, USA: IEEE coredump.cx/afl/.
Press. Zhang, Q.; Sun, C.; and Su, Z. 2017. Skeletal Program Enu-
Hughes, W. 2019. BFC: An industrial-grade brainfuck com- meration for Rigorous Compiler Testing. In Proceedings of
piler. https://bfc.wilfred.me.uk/. the 38th ACM SIGPLAN Conference on Programming Lan-
Kossatchev, A. S.; and Posypkin, M. A. 2005. Survey of guage Design and Implementation, PLDI 2017, 347–361.
compiler testing methods. Programming and Computer New York, NY, USA: ACM. ISBN 978-1-4503-4988-8.
Software, 31(1): 10–19.
12565