Metamorphic Code Generation from LLVM IR Bytecode
Metamorphic Code Generation from LLVM IR Bytecode
SJSU ScholarWorks
Spring 2013
Recommended Citation
Tamboli, Teja, "Metamorphic Code Generation from LLVM IR Bytecode" (2013). Master's Projects. 301.
DOI: https://doi.org/10.31979/etd.adyy-u2vw
https://scholarworks.sjsu.edu/etd_projects/301
This Master's Project is brought to you for free and open access by the Theses and Graduate Research at SJSU
ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU
ScholarWorks. For more information, please contact [email protected].
Metamorphic Code Generation from LLVM IR Bytecode
A Project
Presented to
In Partial Fulfillment
Master of Science
by
Teja Tamboli
May 2013
c 2013
Teja Tamboli
by
Teja Tamboli
May 2013
by Teja Tamboli
Metamorphic software changes its internal structure across generations with its
writers as a means of evading signature detection and other advanced detection strate-
gies. However, code morphing also has potential security benefits, since it increases
In this research, we have created a metamorphic code generator within the LLVM
tion process. Our metamorphic generator functions at the IR bytecode level, which
morphing techniques that we employ include dead code insertion—where the dead
We have tested the effectiveness of our code morphing using hidden Markov model
analysis.
ACKNOWLEDGMENTS
Firstly, I would like to thank Dr. Mark Stamp, my project advisor, for his guid-
Secondly, I would like to thank my committee members, Dr. Sami Khuri and
Dr. Robert Chun for providing suggestions without which this project would not be
possible.
Finally, I would like to thank my family and my husband Onkar Deshpande for
v
TABLE OF CONTENTS
CHAPTER
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Metamorphic Techniques . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi
4.2.2 LLVM Design . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.3 lli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.4 Opt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.5 llvm-dis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.6 llc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.7 llvm-link . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
6.2 Challenge and Innovation . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
APPENDIX
viii
LIST OF TABLES
ix
LIST OF FIGURES
15 HMM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
x
22 HMM scores after using optimizers . . . . . . . . . . . . . . . . . . . 50
A.2 HMM results with 10% and 20% dead code insertion . . . . . . . . . 59
A.3 HMM results with 30% and 50% of dead code insertion . . . . . . . . 59
xi
CHAPTER 1
Introduction
To date, metamorphic code generation has primarily been used by malware writ-
ers, since well-designed metamorphic code can evade signature-based detection and
other more advanced detection strategies [22, 42, 53]. However, metamorphism also
has the potential to provide security benefits by increasing the “genetic diversity” of
software, thereby making several types of attacks more difficult [16, 45].
cording to recent research [42, 53], techniques based on hidden Markov models
Many metamorphic malware generators are readily available [51]. Some notable
examples include
In addition, research morphing engines are presented in [22] and [42]. All of these
metamorphic generators work at the assembly language level. Note that code mor-
1
phing of high-level source code is far simpler, but generally ineffective, since such
morphing does not provide sufficient control over the resulting executable file.
the LLVM compiler framework [2]. The LLVM compiler infrastructure provides for
source languages and multiple target architectures. In the optimization process, code
tool functions at this IR bytecode level, which provides advantages that are some-
what analogous to morphing at the source-code level, but also provides the necessary
“shadow attack” is developed using LLVM. This attack hides system call behavior
We evaluate our morphing technique using the hidden Markov model (HMM)
analysis developed in [53] which has been further developed and analyzed in [42, 43].
This body of previous work provides a baseline for determining the effectiveness of
our approach.
the architecture of the LLVM compiler infrastructure and elaborates on IR byte code.
Chapter 5 details the design and implementation of the HMM detector used to evalu-
2
ate our results. Chapter 6 covers the design and implementation of our metamorphic
3
CHAPTER 2
Malware
puter [36]. These malicious activities include crashing disks or operating system or
alter system’s data [35]. Writing malware is a challenging task [3] and could be a
source of revenue for malware writers. Anti-virus softwares can detect these mali-
cious activities and remove them from the computer. To date, most development
and research into metamorphic code has involved malware. Therefore, we present
background information on malware before turning our attention to the general case
There are two most prominent types of malwares: Virus and Worm. They are
2.1.1 Virus
do not have reproductive ability but it can replicate itself. They need human interac-
tion to spread the infection from one computer to another. For example, they can get
downloaded from Internet or by exchanging infected USB drives or floppy disks. Virus
writers constantly develop new obfuscation technique to evade the signature based
detection. Most important methods to evade the detection are encryption, polymor-
phic and modern metamorphic techniques [21]. These techniques are explained in
following subsections.
4
2.1.1.1 Encrypted Viruses
The simplest method to hide the virus body is to encrypt it with different en-
cryption keys. Most part of the virus executable is encrypted and a small decryption
module exists to decrypt the encrypted body. For example, XORing the key with the
virus body [6]. Since the virus is encrypted with different key per infected file, only
the decryption module remains constant across generations. There is no common sig-
nature in such type of viruses, but virus scanners can detect the decryption module.
As a result, the code which encrypts or decrypts, is the part of the signature in most
polymorphic viruses can generate large number of unique decryptors by using different
encryption methods. Therefore, not two infections have the same signature [22, 42].
decryption process and dynamically decrypts the encrypted virus body [13].
ance of the virus while keeping its functionality. Metamorphic viruses don’t need
encryption or decryption techniques. They produce new virus body on each infec-
tion [13].
5
Metamorphic engine can be kept separate or it can be embedded in the virus.
2.1.2 Worms
Worms are self-replicating malwares. Unlike viruses, they do not need any human
intervention to spread. They are standalone [36, 46]. Similar techniques like meta-
morphism and polymorphism are used by worms to avoid detection. Worm could be a
macro residing in a word document or in a excel sheet which spreads itself across the
network. Worms spread from host to host across the network, unlike viruses which
6
2.2 Detection Techniques
anti-virus software. They are popular because of accurate detection, simplicity and
speed [44]. In this detection mechanism, scanner scans each executable and looks for
the virus. Anti-virus software has database of signatures for different viruses. By
comparing the signature, it detects the virus. This technique has to keep on updating
database with new malware signatures. Another downside of this type of detection is
it cannot detect new virus. Therefore, by using simple code obfuscation techniques,
overcome using this type of detection mechanism. Heuristic methods are implemented
to detect anomalous behavior. Primarily there are two phases: training and detection.
In training, scanner learns normal and malicious behavior e.g. finding root password.
Using this technique newbie viruses can be detected but it has downside too [18].
Since its detection is based on how you train the system, it has more number of false
7
2.2.3 Hidden Markov Model Based Detection
technique. HMMs are probabilistic models and are widely used in solving problems
on pattern recognition. They help in finding the probability of transition from one
state to another. HMM has two phases: training and detection. Once you train the
model, it can be used to detect benign and malware software [22, 42, 53].
8
CHAPTER 3
Metamorphic Techniques
The metamorphic code generator described in this paper makes use of morphing
techniques like dead code insertion and function permutation. A number of tech-
mented metamorphosis via register usage exchange. Register swap means changing
the register operands. For example, if instruction is PUSH ECX then it can be replaced
with PUSH EAX. In this technique, opcode sequence remains the same. Figure 2 shows
some sample code fragments selected from two different generations of W95/Regswap
that use different registers. A wildcard string can be used to detect such type of
viruses [10].
This technique changes the appearance of a virus by reordering the layout of the
virus subroutines. If virus has n subroutines, then there can be n! generations without
repetition. Some examples are BadBoy and W32/Ghost (discovered in May 2000).
them can be detected with search strings as the content of each subroutine remains
9
malware morphing strategy, particularly with respect to statistical based detection.
10
3.3 Dead Code Insertion
Dead code can be inserted into an existing program. In its simplest form, dead
code is never executed. Alternatively, dead code can be executed, provided that it
has no effect on the overall program function. Although more difficult, this latter
approach is more effective, since the dead code is more difficult to detect.
Dead code can be very effective for evading malware detection, particularly with
respect to statistical-based techniques–the dead code can be used to mask the statis-
tical properties of the underlying code. By adding such code, a virus can generate
infinite number of unique copies. However, dead code insertion can be challenging
at the assembly code level, since care must be taken so that addresses remain valid.
Win95/Zperm virus appeared in June and September of 2000 incorporated this tech-
nique [10]. Figure 4 explains the code structure changes of Zperm-like viruses.
For example, MOV R1, R2 can be replaced by PUSH R1 and then POP R2. As an-
other trivial example XOR R1, R1 and SUB R1, R1 both zero the contents of register
11
R1. But opcode of these two instructions are now different. Instruction substitution
is a powerful technique for evading signature detection and altering code statistics.
code level. The W32/MetaPhor virus is one of the metamorphic virus generators that
3.5 Transposition
actual functionality. It means the sequence can be reordered only if two instructions
These instructions can be swapped since both instructions are independent of each
other. Therefore, they can be reordered with the following sequence without changing
niques [7, 14, 54]. In general, traditional morphing engines are non-deterministic
12
automata (NDA), since transitions are possible from every symbol to every other
symbol [54]. The symbol set is the set of all possible instructions. It means, any
niques, one can create formal grammar rules and can apply these rules to create viral
copies with great variation. Figure 5 shows a simple polymorphic decryptor template
and two possible mutations of the decryptor code achieved using the formal gram-
mar. Figure 6 With this decryptor template and formal grammar combination, it is
13
Figure 6: Formal grammar for decrpyptor mutation [54]
14
CHAPTER 4
LLVM
4.1 Introduction
components in multiple languages to gain efficiency. During the lifetime of the ap-
plication, certain components have small hot spots in terms of memory footprint or
CPU; other spread their execution time evenly throughout the application. It is im-
LLVM (Low Level Virtual Machine) is a compiler infrastructure which has several
a Static Single Assignment (SSA). It means each variable is assigned once and then
it cannot be reassigned [2, 32]. This is done by using numbers to represent variables.
LLVM is part of GCC. It supports static compilation and late compilation from the
The origins of the LLVM infrastructure is in project called “The Lifelong Code
15
science at the university of Illinois at Urbana-Champaign [37].
Most of the traditional static compilers (For example, GCC used for C/C++
programs) are three-phase compilers. Three main phases includes frontend, optimizer
and backend. Figure 7, shows the typical design of three phase compilers [27].
The key function of frontend component is to parse the source code, check for
any syntactical errors and then build a language specific Abstract Syntax Tree (AST).
Optimizers use this tree and transform it to a new representation by applying opti-
dependent representation of the code (assembly code). It maps the code to target
machine instruction set. Its main goal is to generate correct code that can take
The key feature of LLVM three-phase compiler design is, it supports multiple
The frontend can be written in any language which essentially will be converted
16
to an intermediate representation. This intermediate representation is machine and
language independent. A backend can be written for any target platform to compile
from this common representation [1, 27]. Figure 8 shows LLVM design.
Using this design it is now easy to add new language by implementing new fron-
tend and reusing existing optimizers and backends. New platforms can be supported
implement new language we have to start all over from the scratch. To support N
rated from each other, skills required to implement frontend are completely different
than skills required for implementing backend and optimizer [2, 32]. Frontend person
can only maintain or enhance their part of the compiler. This is not the technical
issue, but for open source project, it reduces the barrier to contributing as much as
possible.
compilers supporting only one source language and target. Traditional open source
compilers (like GCC) are stable and efficient because they serve larger communities.
17
This tends to generate better optimized machine code compared to narrower compilers
mon intermediate form (IR). This common form separates frontend and backend
restructuring transformations.
Figure 9 shows the LLVM IR’s structure. It supports following sections [37]:
1. A module: The module is the container which has functions and global variables.
a list of basic blocks. Each basic block consists of a set of instructions. Instructions
code [27]. Frontend programmers should understand IR and its invariants. LLVM
18
Figure 9: LLVM bytecode file format [37]
text. Similarly, backend writers (code generators) should know how to convert it into
machine code.
19
4.4 Tools in LLVM
code there are various tools available in LLVM infrastructure. Some of these tools
are explained in following sections. Program’s life cycle from source program to
This tool is used to generate the LLVM IR bytecode. It checks for syntacti-
cal errors and then produces IR bytecode of the source program. It uses .s or .ll
20
4.4.2 llvm-as (.s → .bc)
This tool is used to generate bitcode file (executable) from the IR bytecode. For
example,
llvm-as -f helloworld.s
4.4.3 lli
To execute LLVM bitcode file, lli command is used [28]. For example,
lli helloworld.s.bc
4.4.4 Opt
It is used to run different optimizers on IR bytecode. One can write his / her own
opt h optimizer namei h input bytecode or bitcode file i h output bytecode or bitcode file i
For example,
The file mhellow.bc is generated after running the optimizer pass mem2reg on hel-
4.4.5 llvm-dis
This tool is used to disassemble the LLVM bitcode file to IR bytecode file.
21
Using this command we can check how code was optimized in mem2reg pass [29].
4.4.6 llc
This tool is used to generate native assembly-code from bitcode file. For example,
llc -f opthelloworld.bc
A file containing the native assembly code opthelloworld.s is created [28]. Throughout
Here, file.bc is the input binary file and file.asm is the output file containing assembly
code
4.4.7 llvm-link
This tool takes several LLVM bitcode files and links them together and generates
This command generates one finalopt.bc bitcode file after linking addCode.s.bc and
22
CHAPTER 5
that Hidden Markov Models (HMMs) are very effective in detecting metamorphic
viruses [5, 11, 22, 39, 48, 49, 53]. Hidden Markov models can be viewed as a ma-
chine learning technique. In the past, a method is described to train the HMM with
sequences of opcodes from viruses which belong to same family [53]. This trained
HMM is then used to score binaries, to determine if the given binary belongs to a
virus or benign file. Log Likelihood Per Opcode (LLPO) score is calculated and based
on this score, threshold is obtained for viruses and benign executables. This thresh-
old value categorizes executables between viruses or benign executables based on its
LLPO score. The detailed working of HMM and its virus detection mechanism are
5.1 HMM
The Hidden Markov Model (HMM) is a statistical pattern analysis tool. The
in Figure 12. Its state and observation at time t are represented by Xt and Ot
respectively. The initial state X0 and A matrix together determine the hidden Markov
process. This Markov process is illustrated in Figure 12. Oi is related to the states
23
Table 1: HMM Notations [43]
Symbol Description
T Length of the observed sequence
N Number of states in the model
M Number of distinct observation symbols
O Observation sequence (O0 , O1 , . . . , OT −1 )
A State transition probability matrix
B Observation probability distribution matrix
π Initial state distribution matrix
Research in [12] indicates HMMs are used in protein modeling and speech recog-
nition systems. HMMs can also be used to detect certain types of software piracy [20].
In general, HMM needs to be trained with the input data. It creates training models
based on this training data. Each individual element in the training data is mapped
to an observation symbol. All unique observation symbols are then extracted and
are represented in the training model. This trained model is then used to determine
whether the new sequence of observations is similar to one represented in the training
model. HMM collects training data from all known viruses and builds training models
one for each family of virus. A new file is tested against these models. If a new file
matches with the model, then it can be identified as a virus from that family [43, 53].
24
5.1.1 Example
Papers [22, 43, 53] explain inner working of HMM with this simple example:
Suppose one has to find out the average annual temperature for any given year by
observing the tree sizes (S-small, M -medium and L-large). To keep it simple, assume
the annual temperature is either hot (H) or cold (C). In addition to this, the prob-
another hot year (HH) is 0.7, a hot year followed by a cold year (HC) is 0.3, a cold
year followed by a hot year is 0.4 and a cold year followed by another cold year (CC)
The correlation between tree sizes and temperature is also known. In a hot year,
the probability of tree size being small is 0.1, being medium is 0.4 and being large
is 0.5. In a cold year, the probability of tree size being small is 0.7, being medium
is 0.2 and being large is 0.1 [43]. The matrix representation of this information is as
follows:
This known information can be mapped to HMM notations. States are repre-
sented by annual temperatures, tree sizes are observable symbols. States H and C
are hidden, as we cannot see the temperature in the past. We have access to see the
observation symbols (S, M and L). With this known information, we can build the
25
Figure 13: HMM Model [43]
Consider we have information of tree size sequence for four consecutive years and
it is (S, M, S, L) and by using this information we want to find the annual temper-
ature sequences. To solve this problem using HMM, its parameters are explained as
follows [43]:
26
HMM steps to determine transitions of length T = 4 with given observations
2. Calculate the probability for each state transition (Table 2) with the given
= (0.6)(0.1)(0.7)(0.4)(0.3)(0.7)(0.6)(0.1)
= 0.000212
3. We know that annual temperature sequence will be the one with the highest
CCCH sequence.
The brute force method applied here requires exponential amount of work which
Following three problems can be efficiently solved using HMMs [22, 42, 43]:
Problem 2. The model λ = (A, B, π) is known, find an optimal state sequence for
27
Table 2: Probabilities of observing (S, M, S, L) for all possible state sequences
number of states N are known, find the model λ = (A, B, π) that maximizes the
probability of O.
quence using the model λ. Second problem concentrates on exposing the hidden part
of the HMM. The third problem concentrates on training the HMM with input obser-
vation sequence O and the parameters M and N . In this paper, we first train a model
(Problem 3) on opcode sequences derived from a base piece of software. Then use the
trained model to score (Problem 1) morphed versions of this base software. Previous
research has shown that HMMs are effective at detecting metamorphic malware, and
that HMMs can also be used to detect certain types of software piracy [20]. That is,
HMMs have proven useful at detecting morphed or disguised versions of code. Conse-
28
quently, HMM analysis provides a challenging test for any code morphing technique.
N
X −1
3. P(O | λ) = αT −1 (i)
i=0
29
2. For t = T − 2, T − 3, . . . , 0 and i = 0, 1, . . . , N − 1, compute [43]
N
X −1
βt (i) = aij bj (Ot+1 )βt+1 (j)
j=0
by βt (i) [43],
αt (i)βt (i)
γt (i) =
P (O|λ)
The most likely state at any time t is the state for which γt (i) is maximum.
By adjusting the model parameters, this algorithm provides efficient way to best
fit the observations. In this algorithm, number of states N and number of unique
observation symbols M are constant. However, other parameters like A, B and π are
changeable with row stochastic condition. The process of re-estimating the model is
2. Compute αt (i), βt (i), γt (i) and γt (i, j) where γt (i, j) is di-gamma. This Di-
30
3. Re-estimate model parameters as : For i = 0, 1, . . . , N − 1 let
πi = γ0 (i)
in [12, 23]. HMM as a virus detection tool needs to train with the input data to
generate training model. Trained HMM, represents the statistical properties of the
virus family. These trained models are then used to determine the score of a new
binary file. This score indicates how “close” a new binary file is to the virus family
that the model represents. Based on threshold values, we can then categorize files.
To train the HMM, first set of virus files belonging to the same family are disas-
sembled. From each disassembled file, unique assembly opcodes are extracted. These
opcodes constitute the HMM symbols. Example of extracted opcodes is shown in Fig-
from all virus files within the same family. This concatenated sequence then is used
to train an HMM. The set of unique opcodes from this long sequence serve as the set
of distinct observation symbols. The example of HMM model is shown in Figure 15.
31
Figure 14: Extracted opcode sequence
problem, forward and backward algorithms are used to normalize the result. This
so log likelihood is length dependent. Longer the sequence, higher will be the log
observation probability. The sequence in test set can differ in length comparing to
32
Figure 15: HMM Model
sequences used to train the HMM. To obtain the LLPO, divide the log likelihood by
ent files, also got detected by HMM in [53]. The detection rate of HMM is almost
33
90% [12].
Researchers have tried to write the metamorphic engine to evade HMM detec-
tion [22]. The dead code is inserted into the virus files based on dynamic scoring
algorithm. The block of dead code is inserted into the virus file only if doing so
increases the likelihood of virus file score getting closer to the score of benign files.
Results in [23] indicate inserting long sequence of opcodes, like subroutines are
more effective in avoiding HMM detection than randomly inserting blocks of dead
code. The HMM detector failed when 35% of dead blocks and 30% subroutines were
inserted from benign files to virus file. The metamorphic code generator presented in
this paper makes use of these results while inserting dead code.
34
CHAPTER 6
6.1 Introduction
at IR bytecode level instead of at assembly code. It has been seen in the past that
writing LLVM optimizer passes [17]. Malware writers have also tried to implement
The aim of the project is to produce multiple base software copies that are hard to
detect and significantly different from each other. When a program is compiled with
this optimizer, it generates significantly different morphed copy of the base software.
Even after implementing all metamorphic techniques, HMM detector developed in [53]
is able to classify virus files and benign files correctly. An unsuccessful attempt was
made [11] to escape from HMM-based detector. This proves HMM is very effective
in detection. Therefore, our aim is to write metamorphic code generator that evades
include
• LLVM was originally implemented for C and C++, but its language-agnostic
design has spawned a wide variety of front ends which include Objective-C,
FORTRAN, Ada, Haskell, Java bytecode, Python, Ruby, Action Script, GLSL,
35
D, and Rust. Code written in any of the above language can use our code
• At IR level, virtual addresses are not assigned. Addresses get assigned at bitcode
6.3 Goals
Morphed copies of code should have same functionality as base file. In addition,
the higher the percentage of inserted or modified code, the more the morphed files
should differ (on average) from the base file. A morphed base file will look like a
morphing file, if its opcode counts and opcode sequences are more like morphing files
than base file. As previously mentioned, HMMs have a proven record of being able
techniques described in Chapter 3. For example, register swapping and equivalent in-
from other program files. In addition, the order of these dead subroutines is ran-
36
domized. In this way, we create a singnificant amount of transposition and dead
to all dead code subroutines so that they are not trivially identifiable as dead code.
Dead code insertion involves inserting instructions whose result is never used in
any other computation. The main goal of adding this code is to increase the diversity
like instruction set; it is difficult to add dead instructions. However, functions can
be easily inserted by using linker tool llvm-link. To insert dead code we need IR
We have used core-util Linux command files [24] and files from httpd web
browser [4] to insert the dead code. These files include system level code to do
operations that we would expect to be somewhat similar to our selected base code.
LLVM provides options to optimize and remove dead code. Anti-virus softwares
are also smart enough to identify code snippets which are not actually getting ex-
ecuted. They can track to execution sequence and detect the virus. To make the
metamorphic code generator smarter, we “call” the inserted dead code. A detailed
37
can be implemented easily by changing the sequence of the functions. This helps
to evade any pattern matching detector. We have written Python script to operate
on IR bytecode and produce another text file which has same functions but with a
6.5 Implementation
In this project we have developed three passes. These passes and their algorithms
are explained in following subsections. The high level architecture of our morphing
First pass operates on inserting the dead code. A base file, a morphing file
(i.e., our source of dead code), and a dead code percentage are specified. Based on
the percentage of dead code, we determine total number of lines we want to insert
38
into the base file. We then select complete functions from the morphing file so that
the total size approximates the number of lines we want to insert into the base file.
These subroutines are integrated into the base file at the linking stage. In the output
it provides function names it has inserted. It also distinguishes the output dead
function names which can be called in pass 2. The details of this first pass of our
1. Compile selected morphing file using llvm-gcc command and generate its IR
bytecode.
4. Based on total number of dead code lines, use a greedy strategy to determine a
6. Create bitcode files for the base code and temporary IR bytecode using llvm-as.
8. If there are any subrouttine naming conflicts, replace each offending name in
In this pass, we use the LLVM optimizer to insert a “call” instruction for each
dead code subroutine. As mentioned in Section $6.5.1, pass 1 identifies dead functions
39
which can be called using this pass. The optimizer takes function name as input. It
then finds the main function definition in the IR bytecode and inserts a “call” type of
instruction after every “load” type of instruction. This optimizer operates on Module
class. The current implementation does not support structure type of parameters
and pointers except single pointers. For each dead code subroutine, we perform the
following steps.
3. To insert the “call” instruction, iterate over its function parameters. For each
IR bytecode file. This pass is written in python script. Its algorithm is explained in
following steps.
number of functions.
4. If function is not already added, then write this function definition in temporary
IR bytecode file.
40
5. Repeat steps from 3 until all functions are written to temporary bytecode file.
LLVM 3.1 version source code [34] is used. LLVM commands from this version
are used to link and compile the code. LLVM-GCC version 4.2 which comes with
MAC operating system is used to create IR bytecode files of the source program.
41
CHAPTER 7
Experiments
In this section, we use the HMM technique developed in [53] to test the effective-
of dead code to find the threshold at which HMM detector starts to fail. We show that
after adding about 20% (or more) dead code, our metamorphic code generator started
to escape HMM-based detection. These results indicate that our LLVM-based morph-
ing strategy is more effective than any of the hacker-produced metamorphic malware
For the experiments given here, we use spike fuzzer [41] as our base software.
or errors in the application [19]. Fuzzing finds bugs which can then be used by
attackers to run their own code. Fuzzing is one of the best ways in which exploitable
As this metamorphic code generator is written within the LLVM compiler, it has
different executable format. To disassemble the binary, we cannot use standard tools
like IDA-PRO. The base software and morphing files should be disassembled by using
For each experiment, we generate 50 morphed copies by inserting dead code from
42
different morphing files. The morphing files were randomly selected from coreutil
Once the morphed files are generated, we use an HMM scoring technique similar
to that in [53]. Previous research [22, 42, 53], has consistently shown that the number
of hidden states in the HMM does not impact the quality of the file classification.
First, we train an HMM to model the base file. To obtain sufficient observations
for training, we generated 50 copies of the base file, each having a 5% rate of morphing.
We used random 50 files as morphing files to create set of files. We then trained an
HMM on these 50 morphed files. We refer to this model as the “base HMM”. As
discussed in [53], the purpose of the morphing at this stage is simply to prevent the
base HMM from overfitting the available data in the base file. Consequently, we use
Once base files, morphing files and model are ready, next we use the base HMM
model to score 50 morphing fies. Specifically, we score the coreutil Linux commands
If the score of the morphing file is not higher than the base file, then HMM
detects it as family of base file. There is a threshold value which determines if the
given file is a base (score greater than the threshold) or morphing file (score lower
than the threshold). If score of the morphing file is greater than the score of some of
base files, then HMM does not have a good threshold which can be used to determine
the new file. Thus, some base files can escape the HMM based detection.
Figure 17 shows the result of 50 base files against 50 morphing files. Score of all
43
morphing files is lower than base files. Thus, the base files generated using our code
Figure 17: HMM results for base virus and benign files
7.2 HMM
We then conducted experiments where we morph the base file at each of the
following rates: 10%, 20%, 30%, and finally, 50%. In each case, we generated 50
morphed versions of the base file, with each file morphed at the given rate. These
morphed copies were then scored using the base HMM and these scores were compared
to the scores obtained for the morphing files. Note that all scores are normalized to
a per opcode basis so that file size does not affect the results. By calculating scores
at various percentages, we found at which percentage base files start escaping from
HMM detection.
Number of lines added as dead code are calculated based on the percentage of
44
lines in IR bytecode of base file.
First graph in Figure 18 shows scores after inserting 10% of dead code. Second
graph in Figure 18 shows the result after inserting 20% of dead code.
Figure 18: HMM results with 10% and 20% dead code insertion
After inserting 10% of dead code (left figure) scores of the base morphed files
improved a little but still most of the files have scores similar to base files. After
inserting 20% (right figure), scores are better but still some files have scores similar
to base files.
First graph in Figure 19 shows the result after inserting 30% of dead code. Second
graph in Figure 19 shows the results of scores after inserting 50% of dead code.
Results of the scores are significantly improved after inserting 30% and 50% of
dead code. For 50% almost all base morphed files have scores almost same as that
scores.
45
Figure 19: HMM results with 30% and 50% dead code insertion
From these results, we see that after inserting 20% dead code, the scores are
starting to merge, which indicates that the morphed base files are difficult for the
HMM to distinguish from the morphing files. This is precisely the effect that we hope
Graph in Figure 20, gives overall picture of various percentage of dead code
insertions.
Results of Figure 20 are summarized in the ROC curves in Figure 21. These
ROC curves plot the false positive rate versus the true positive rate as the threshold
The area under the ROC curve (AUC) is equal to the probability that a classiffier
ranks a randomly chosen positive instance higher than a randomly chosen negative
one [9]. The AUC values for the ROC curves in Figure 21 are given in the Table 3.
Note that an AUC of 1.0 indicates ideal separation (i.e., no false positives or false
46
Figure 20: HMM results with rate of dead code insertion
negatives), while an AUC of 0.5 indicates that the classiffier yields results that are
no better than flipping a coin. After inserting 20% dead code, our HMM classiffier
does poorly, and at higher morphing rates, the rate of classification failure increases
dramatically. Again, these results show that our code morphing technique is highly
47
Figure 21: ROC curves for rate of dead code insertion
7.3 Detection
There are number of dead code removal optimizers already developed in LLVM.
used compile-time optimizer options in the order explained in Table 4 to remvoe dead
code. All these optimizers perform single pass over the function to remove instructions
48
Table 4: LLVM optimizer passes used to remove dead code
After executing these optimizers on base morphed files with 50% of dead code
insertion, we found that they have removed some amount of dead code from base
morphed files. But, while optimizing they have used different instructions and so the
opcodes. From the Figure 22 we can see that, score of some files is actually improved
(LTO) to remove dead code. This optimzer as explained in [31] works in multiple
phases and very effectively optimizes the code. After executing this optimizer, we
found that it has removed all dead code we have inserted and it has also optimized
the original base file. Current implementation is not using result of dead functions
in the rest of the calculation, therefore it figures out dead functions in its multiple
passes and optimizes the code. From the Figure 23 we can see that all 50 files have
After using link-time optimizer our metamorphic code generator becomes inef-
fective. All morphed files fall back to the original base file.
49
Figure 22: HMM scores after using optimizers
50
CHAPTER 8
Conclusion
In this paper, we presented and analyzed a novel code morphing technique based
on LLVM IR bytecode. Our approach makes a strong code morphing engine available
as a compile-time option, and requires no special effort on the part of the software
developer. As far as we are aware, this is the first general purpose code morphing
Our metamorphic generator uses dead code insertion and function permutation.
The dead code is in the form of functions copied from other programs, and these dead
functions are called within the program unlike in [23, 42], which makes their detection
and removal more challenging. We tested the effectiveness of our code morphing using
an HMM technique that has proven successful in metamorphic malware detection and
certain cases of software piracy. We showed that our morphing is highly effective, in
the sense that the HMM cannot effectively distinguish our morphed code from other
Results from the experiments show that with the increase in rate of dead code
insertion, base files starts looking more like morphing files. The Log Likelihood score
per opcode of base files and that of morphing files started looking similar to each
other.
The HMM detector’s performance is acceptable with 20% of dead code insertion.
After inserting, more than 20% of dead code, HMM started misclassifying files, as
From the experiments, this tool is still effective after using compile-time opti-
51
mizers to remove dead code from base morphed files. However, after using link-time
optimzer it becomes ineffective. All morphed files fall back to the original base file.
here. The dead code insertion could be improved by making it not depend on complete
subroutines it should be possible to do such insertion at the level of basic blocks. Other
serve to make our code morphing techniques more robust. For example, in our current
implementation, tools available within the LLVM framework could be used to analyze
the morphed bitcode. However, if the bitcode is converted to, say, a Windows PE
file, then the tools within LLVM will not be available for such analysis.
mizer as shown in Figure 23. The future work can be done to handle this situation.
LLVM compiler infrastructure can also be used to write complex malwares [17,
mizers already exist to remove dead code at IR byte code level [25]. Similar optimizers
We used HMM based detector to evaluate the effectiveness. It is seen that it was
not able to detect all virus files generated by our code generator. Further research
needs to be done to enhance HMM detector. One possible way is to remove dead
52
LIST OF REFERENCES
[1] V. Adve and C. Lattner. A compilation framework for lifelong program analysis
and transformation. Proceedings of the 2004 International Symposium on Code
Generation and Optimization, 2004.
http://www.cgo.org/cgo2004/papers/06_76_lattner_c.pdf
[2] V. Adve and C. Lattner. Architecture for a next Generation GCC. First GCC
Annual Developer’s Summit, May 2003.
http://llvm.org/pubs/2003-05-01-GCCSummit2003pres.pdf
[5] S. Attaluri, S. McGhee, and M. Stamp. Profile hidden markov models and meta-
morphic virus detection. Journal in Computer Virology, 5:151–169, 2009.
[9] A. P. Bradley. The use of the area under the roc curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30:1145–1159, 1997.
[12] FASM.
http://flatassembler.net/
53
[13] E. Filiol. Computer viruses: from theory to applications, Volume 1. Birkhauser,
2005, pp. 19–38.
[14] E. Filiol. Metamorphism, formal grammars and undecidable code mutation. In-
ternational Journal of Computer Science, 2:70–75, 2007.
[16] X. Gao and M. Stamp. Metamorphic software for buffer overflow mitigation. Pro-
ceedings of 3rd Conference on Computer Science and its Applications, P. P. Dey
and M. N. Amin, editors, San Diego, California, June 30, 2005
[18] N. Idika and A.P. Mathur. A Survey of Malware Detection Techniques. Purdue
University, p. 48, 2007.
http://www.serc.net/system/files/SERC-TR-286.pdf
[20] S. Kazi and M. Stamp. Hidden Markov models for software piracy detection, to
appear in Information Security Journal: A Global Perspective
[21] A. Lakhotia. Are Metamorphic Viruses Really Invincible? Virus Bulletin, De-
cember 2004.
http://www.iscas2007.org/~arun/papers/invincible-complete.pdf
[22] D. Lin and M. Stamp. Hunting for undetectable metamorphic viruses. Journel
in Computer Virology, 7:201–214, Aug. 2011.
[23] D. Lin. Hunting for undetectable metamorphic viruses (2009). Master’s Projects.
Paper 18.
http://scholarworks.sjsu.edu/etd_projects/18
54
[26] LLVM Analysis and Transform passes.
http://llvm.org/docs/Passes.html#id63
[29] LLVM Helloworld in C, Overview on LLVM tools and explains how to compile
code using LLVM.
http://projects.prabir.me/compiler/wiki/LLVMHelloworldInC.ashx
[36] Panda Security (n.d.), Virus, worms, trojans and backdoors: Other harmful
relatives of viruses, 2011.
http://www.pandasecurity.com/homeusers-cms3/security-info/about-malware/
generalconcepts/concept-2.html
[37] J. Praher. A change framework based on the Low Level Virtual Machine Compiler
Infrastructure. Thesis report at the Johannes Kepler University Linz, April 2007.
http://llvm.cs.uiuc.edu/pubs/2007-04-PraherMSThesis.pdf
55
[39] N. Runwal, R. M. Low, and M. Stamp. Opcode graph similarity and metamorphic
detection. Journal in Computer Virology, 8: 37–52, 2012.
[42] S. Sridhara. Metamorphic Worm that Carries Its Own Morphing Engine (2012).
Master’s Projects. Paper 240.
http://scholarworks.sjsu.edu/etd_projects/240/
[44] M. Stamp. Information Security: Principles and Practice, second edition, Wi-
ley, 2011.
[45] M. Stamp. Risks of Monoculture. Inside Risks 165, CACM 47, 3, March 2004.
http://www.csl.sri.com/users/neumann/insiderisks04.html#165
[46] Symantec. What is the difference between viruses, worms, and Trojans?, 2006.
http://service1.symantec.com/support/nav.nsf/docid/1999041209131106
[49] A. Venkatesan. Code obfuscation and virus detection (2008). Master’s Projects.
Paper 116.
http://scholarworks.sjsu.edu/etd_projects/116
[52] VX Heavens.
http://download.adamas.ai/dlbase/Stuff/VX%20Heavens
%20Library/static/vdat/creatrs1.htm
56
[53] W. Wong and M. Stamp. Hunting for metamorphic engines. Journal in Computer
Virology, 2(3):211–229, 2006.
[54] P. Zbitskiy. Code mutation techniques by means of formal grammars and au-
tomatons. Journal in Computer Virology, 5:199–207, 2009.
57
APPENDIX
• HMM parameters: N = 3
• Graph of overall picture with different rate of dead code insertions: Figure A.4
58
Figure A.2: HMM results with 10% and 20% dead code insertion
• AUC values for inserting various percentages of dead codes are shown in the
Table A.1.
Figure A.3: HMM results with 30% and 50% of dead code insertion
59
Figure A.4: HMM results with rate of dead code inserion
Table A.1: ROC AUC statistics for rate of dead code insertion
60