Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
Identification of malware variants provides great benefit in early detection. Control flow
has been proposed as a characteristic that can be identified across variants, resulting in
construct the signatures but can be ineffective if malware undergoes a code packing
transformation to hide its real content. This thesis proposes a novel system, named
Malwise, for malware classification using a fast application level emulator to reverse
the code packing transformation, and two flowgraph matching algorithms to perform
exact flowgraph matching algorithm uses string based signatures of graph invariants,
and is able to detect malware with near real-time performance. The approximate
flowgraph matching algorithm is slower but more effective and uses the decompilation
using the string edit distance. To demonstrate the effectiveness and efficiency of the
automated unpacking and flowgraph based classification, we evaluate the system with
synthetic malware and over 15,000 real samples. The evaluation shows our system is
highly effective in terms of accuracy in revealing all a sample‟s hidden code, execution
time for unpacking and classification, and accuracy in detection of malware variants.
1
Fast Automated Unpacking and Classification of
Malware
Silvio CESARE
Master of Informatics
May 2010
2
Certificate of Authorship and Originality of thesis
The work contained in this thesis has not been previously submitted either in whole or
in part for a degree at Central Queensland University or any other tertiary institution. To
the best of my knowledge and belief, the material presented in this thesis is original
Signed:
3
Copyright statement
This thesis may be freely copied and distributed for private use and study, however, no
part of this thesis or the information contained therein may be included in or referred to
in publication without prior written permission of the author and/or any reference fully
acknowledged.
Signed:
4
Table of Contents
1 Introduction ............................................................................................................ 18
5
2.1.1.8 Code Packing .................................................................................... 30
6
2.3.8 Data Flow ................................................................................................. 39
2.5.4 Graphs....................................................................................................... 44
7
2.7.3.1 Whole System Emulation .................................................................. 50
8
2.8.2.1.1 Whole Program Control Flow Graph Isomorphism Recognition
2.10 Summary......................................................................................................... 64
9
4.2 Application Level Emulation ......................................................................... 68
4.4 Discussion....................................................................................................... 75
4.6 Summary......................................................................................................... 82
10
5 Malware Feature Extraction ................................................................................... 83
5.4 Discussion....................................................................................................... 89
5.5 Summary......................................................................................................... 90
6.4.1 Effectiveness............................................................................................. 97
11
7.2 Conclusions .................................................................................................. 108
12
List of Tables
Table 1. Metrics for identifying the original entry point in packed samples
(hostname.exe)................................................................................................................ 78
Table 2. Metrics for identifying the original entry point in packed samples (calc.exe). 79
Table 9. Similarity matrix for non similar programs using approximate matching. .... 104
Table 10. Similarity matrix for non similar programs using exact matching. .............. 104
13
List of Figures
Figure 12. A control flow graph (left), and a call graph (right). .................................... 38
Figure 14. Block diagram of the Malwise malware classification system. .................... 66
Figure 16. The relationship between a control flow graph, a high level structured graph,
Figure 17. The grammar to represent a structured control flow graph signature. ......... 88
14
Figure 18. Assignment of flowgraph strings between sets. ............................................ 92
Figure 20. Pseudo code for the set similarity search. ..................................................... 95
15
Acknowledgments
I would first like to thank my family for their support and in particular my mother,
and scholarship support. I am thankful for his desire to publish during candidature.
16
List of Publications
[1] Silvio Cesare, and Yang Xiang 2010, Classification of Malware Using
[2] Silvio Cesare, and Yang Xiang 2010, A Fast Flowgraph Based Classification
System for Packed and Polymorphic Malware on the Endhost, IEEE 24th
17
1 Introduction
The presence of malicious software is a problem that plagues internet and network
by their malicious intent. They are hostile, intrusive or annoying software programs.
Examples of malware include trojan horses, worms, backdoors, dialers and spyware.
Internet Threat Report [1], 499,811 new malware samples were received in the second
half of 2007. F-Secure additionally reported, “As much malware [was] produced in
The modern purpose of malware is that of criminal enterprise for financial gain [3]. In
2008, “78 percent of confidential information threats exported user data” [3]. The
stealing of banking information using malware known as spyware [4] to covertly log
The malware problem continues when malicious software remains undetected by users.
example of a malicious botnet [5]. Botnets are illegally leased to criminal networks in
order to create Email spamming networks, and to extort money from commercial
entities using the threat of distributed denial of service attacks. A user‟s inability to
prevent or detect malware often makes them liable to become an additional node in a
malware poses to users‟ security. Detecting malware before it is allowed to execute its
malicious or benign, automated analysis is required. The analysis can employ either a
static or dynamic approach. In the dynamic approach, the malware is executed, possibly
in a sandbox, and its runtime behaviour examined. In the purely static approach, the
static detection. Dynamic approaches [6], while having some benefits compared to
static detection, also have disadvantages. The dynamic approach requires an execution
For cross platform systems, this may be an ineffective environment in which to operate.
required, which may allow malware to execute its intent, before being detected.
malicious behaviour is not triggered during the analysis. While dynamic malware
detection is an important topic, this thesis focuses only on the static detection of
malware.
Traditional solutions to static malware detection have employed the use of signatures.
identify it. Because of performance constraints, the most predominantly used signature
is a string containing patterns of the raw file content [7, 8]. This allows for a string
search [9] to quickly identify patterns associated with known malware. However, these
19
patterns can easily be invalidated because minor changes to the malware source code
have significant effects on the malware‟s raw content. Thus, traditional signatures can
related, but different instances of malware sharing a common history of code. Code
sharing among variants can have many sources, whether derived from autonomously
self mutating malware, or manually copied by the malware creator to reuse previously
authored code. Related to polymorphic malware are packed malware. Code packing is
an obfuscation technique used to hide a malware‟s real content. A code packing tool is
new packed version of the malware. It is often used to make manual analysis and
automated analysis of the malware more difficult. Code packing is also used to evade
than the traditional byte level content. The field of static program analysis has provided
This thesis investigates unpacking combined with the static detection and classification
20
polymorphism, develop algorithms to unpack malware, and develop algorithms to
1.2.1 Aim
The aim of this research is to discover effective and efficient methods for the detection
1.2.2 Scope
The scope of this study is limited to malware in the form of executable program
worms based on network traffic, while important in their own right, are not investigated.
To achieve the aim of detecting and classifying malware, the associated analyses must
would otherwise make such analyses ineffective. This mandates that the code packing
investigated. The scope is limited to only the code packing transformations and
obfuscations evident in malware. The scope does not extend into the general problem of
deobfuscation and associated static analyses. The malware detection and classification,
21
Chapter 2 surveys the related work of malware unpacking and static
the general approach for Malwise, our prototype malware classification system.
application level emulation. We also propose our method for detecting when
Chapter 5 examines the static features we extract from malware that we will use
22
1.4 Major Contributions of the Thesis
complete and the hidden code has been revealed. Entropy analysis has
previously been used to detect packed binaries, but has not been used in
malware unpacking.
similarity. The graph invariant chosen has been used previously to aid detection
classification.
based control flow signature, amenable to comparisons using the string edit
distance. This approach can be used for approximate control flow graph
matching. Decompilation has not been used previously to construct control flow
We propose a set similarity function and a set similarity search algorithm which
form the basis for our malware classification system and which perform
23
efficiently in the expected case. The set similarity function and search are
Malwise.
24
2 Related Work
This chapter surveys the related work in malware unpacking and classification. The
Section 2.3 provides taxonomy of static features that are present in malware and
benign samples that can be used for automated malware classification and
detection purposes.
Section 2.5 categorizes the taxonomy of static program features in terms of their
Section 2.6 examines static analysis techniques that can be used on malware and
Section 2.8 then surveys the literature that investigates static classification of
structure of the malware [10]. Though the syntactic structure changes in polymorphic
used to evade byte level signature based detection and classification that is routinely
usage it is used to describe the automated syntactic mutation of the malware‟s code and
mutation of limited parts of the malware‟s instruction content. The remaining parts of
the malware are encoded at the byte level without regard to the instruction syntax or
other.
Dead code is also known as junk code and a semantic nop [10]. Dead code is
semantically equivalent to a nil operation. Insertion of this type of code has no semantic
impact on the malware. The insertion increases the size of the malware and modifies the
Figure 1. A
semantic nop.
semantically equivalent, but differing instructions and instruction sequences. The size of
Variable renaming [11] and the associated technique of register reassignment alters the
use of variables and registers in a sequence of code such that the instructions are
semantically equivalent but use different variables and registers when compared to the
original code.
27
2.1.1.4 Code Reordering
Code reordering [11] changes the syntactic order of the code in the malware [10]. The
actual or semantic execution path of the program does not change. However, the
syntactic order as present in the malware image is altered. Code reordering includes the
techniques of branch obfuscation, branch inversion, branch flipping, and the use of
opaque predicates.
Branch obfuscation attempts to hide the target of a branch instruction. Examples include
the use of Structured Exception Handling (SEH) on the Microsoft Windows platform.
The use of SEH to obscure control flow is common in modern malware. Similar
techniques involve indirect branching. Indirect branching uses data content as the target
of a branch. This translates control flow identification into a harder data flow analysis
problem. The use of a branch function [12] extends this approach and dispatches
multiple branches through a single routine. The main purpose of branch obfuscation is
to make the static analysis of the malware by an analyst or automated system harder to
perform.
mov $0x8048200,%eax
jmp *%eax
28
2.1.1.6 Branch Inversion and Flipping
Branch inversion inverts the branch condition in conditional branches. Whereas the
branch may originally transfer control when the condition is true, branch inversion
alters the condition to branch when false. To maintain the original semantics of the
program the branch instruction is also inverted. For example, a branch on condition true
condition being tested would also be inverted. Branch inversion is effectively a form of
flow properties. For example, if the original code is to branch on condition true then the
new code branches on condition false to the original fall-through instruction. The new
target.
jnz L
jz $0x80482000
jmp $0x80482000
L:
L:
29
2.1.1.7 Opaque Predicate Insertion
An opaque predicate [12] is a predicate that always evaluates to the same result. An
to know the predicate result. Opaque predicates can be used to insert superfluous
branching in the malware‟s control flow. They can also be used to assign variables
values which are hard to determine statically. The use of opaque predicates is primarily
analysis.
mov $1,%eax
jz $0x80482000
Code packing [13, 14] is used to hide and obfuscate the contents of malware from an
analyst and automated static analyses. Code packing is described in Section 2.2.
new variant is a derived work of the original malware. Semantic changes to malware
occur due to the malware authors modifying the original source code or functionality.
This can occur to a natural evolution of the malware during its software development
life cycle. Additionally, it can occur when a malware author reuses existing malicious
30
2.1.2.1 Code Insertion
algorithm or code.
Code transposition occurs when specific code and functionality of the malware is
removed from its initial location and inserted into a semantically different location in
the malware.
Code packing is the dominant technique used to obfuscate malware and hinder an
analyst‟s understanding of the malware‟s intent. In one month during 2007, 79% of
identified malware from a commercial Antivirus vendor was found to be packed [15].
Additionally, almost 50% of new malware in 2006 were repacked versions of existing
malware [16].
31
[17] evaluated the effectiveness of code packing against Antivirus detection by
providing a service to pack malware using a variety of code packing tools. Antivirus
systems often have the capabilities of unpacking known code packing tools, and
unpacking unknown tools has also had commercial interest [18]. However, Polypack
demonstrated that packing can be an effective tool to defeat an Antivirus system with
many commercial malware detection systems failing to identify the packed versions of
existing malware.
Code packing is used in the majority of malware, but code packing also serves to
provide compression and software protection for the intellectual property contained in a
being indicative of malicious activity. Code packing tools are freely available [19] and
commercially sold to the public as legitimate software [20]. For this reason, unpacking
are malicious, rather than identifying only the fact that unknown contents are packed.
32
Remnant Code
and Restoration
Routine
Restoration
Routine
Packing Runtime
Original Code =
Original Code
g(Hidden Code)
Hidden Code =
f(Original Code)
The most common method of code packing is described in [13]. Malware employing
this method of code packing transforms executable code into data as a post-processing
stage in the malware development cycle. This transformation may perform compression
analysis. At runtime, the data, or hidden code, is restored to its original executable form
Execution then resumes as normal to the original entry point. The original entry point
marks the entry point of the original malware, before the code packing transformation is
applied. Execution of the malware, once the restoration routine is complete and control
is transferred to the original entry point, is transparent to the fact that code packing and
restoration had been performed. A malware may have the code packing transformation
applied more than once. After the restoration routine of one packing transformation has
33
Shifting Decode Shifting Decode
Frame Restoration Frame Restoration
Routine Routine
Hidden Code =
f(Original Code) Hidden Code
an encrypted form at run-time. During execution of the malware, blocks of memory can
automated system from having access to all the hidden code at any single moment in
time. This technique is known as the shifting decode frame [22]. The granularity of
encryption can occur at the page level, the basic block level, and the instruction level.
This type of code packing is not often used in wild malware, and in practice, traditional
code packing and instruction virtualization are the dominant techniques used in real
malware.
Code packing may employ the use of instruction virtualization also known as a malware
emulator [14]. An emulator used by a malware should not be confused with an emulator
34
Interpreter Interpreter
Packing Runtime
Original Code
packing translates the original native code into a byte-code which is subsequently
emulated by the malware at run-time. Using this form of code packing, the hidden code
Many malware packers introduce code that intentionally makes run-time analysis of the
packed malware more difficult [22]. Strategies employed by packed malware include
detection of the malware being debugged, or detection of the malware being executed
inside a virtual machine. These techniques are currently being employed by malware
[23]. In these situations, when an attempted dynamic analysis is being performed, the
execution of the malware packer diverges and the true malware behavior remains
35
2.3 Taxonomy of Static Program Features
Malware classification and detection involves the extraction of features which are
execution of the programs and extracting features based on their behaviour. Static
The object file header contains attributes which are often custom written during link
2.3.2 Bytes
The simplest feature that can be extracted from a program is the raw byte level content
of the malware executable file [24]. An alternative source of content comes from the
individual program sections in the binary, including the code and data segments.
36
8d 4c 24 04 lea 0x4(%esp),%ecx lea 0x4(%esp),%ecx
83 e4 f0 and $0xfffffff0,%esp and $0xfffffff0,%esp
ff 71 fc pushl -0x4(%ecx) pushl -0x4(%ecx)
55 push %ebp push %ebp
89 e5 mov %esp,%ebp mov %esp,%ebp
51 push %ecx push %ecx
83 ec 24 sub $0x24,%esp sub $0x24,%esp
e8 6a 00 00 00 call 4011b0 <___main> call 4011b0 <___main>
c7 45 f8 00 00 00 00 movl $0x0,-0x8(%ebp) movl $0x0,-0x8(%ebp)
eb 10 jmp 40115f <_main+0x2f> jmp 40115f <_main+0x2f>
c7 04 24 a0 20 40 00 movl $0x4020a0,(%esp)
e8 5d 00 00 00 call 4011b8 <_puts>
83 45 f8 01 addl $0x1,-0x8(%ebp) movl $0x4020a0,(%esp)
83 7d f8 09 cmpl $0x9,-0x8(%ebp) call 4011b8 <_puts>
7e ea jle 40114f <_main+0x1f> addl $0x1,-0x8(%ebp)
83 c4 24 add $0x24,%esp
59 pop %ecx cmpl $0x9,-0x8(%ebp)
5d pop %ebp jle 40114f <_main+0x1f>
8d 61 fc lea -0x4(%ecx),%esp
c3 ret
add $0x24,%esp
pop %ecx
pop %ebp
lea -0x4(%ecx),%esp
ret
2.3.3 Instructions
instruction level content of a program can represent a more resilient form than the byte
level content if the instructions are considered by their type or mnemonic representation
[25].
A basic block is a straight line sequence of code without an intervening control transfer
instruction [26]. The basic block may be treated at the byte level, or at the instruction
level. Additionally, data dependencies within the basic block may be examined to
construct a directed acyclic graph [27]. The basic blocks may also be grouped to form a
set, or they may have additional structure imposed by the control flow graph.
37
lea 0x4(%esp),%ecx
and $0xfffffff0,%esp Proc_0
pushl -0x4(%ecx)
push %ebp
mov %esp,%ebp
push %ecx
sub $0x24,%esp
call 4011b0 <___main>
movl $0x0,-0x8(%ebp)
jmp 40115f <_main+0x2f>
Proc_1 Proc_3
movl $0x4020a0,(%esp)
call 4011b8 <_puts>
addl $0x1,-0x8(%ebp)
add $0x24,%esp
pop %ecx
pop %ebp Proc_2
lea -0x4(%ecx),%esp
ret
Figure 12. A control flow graph (left), and a call graph (right).
The control flow graph is a directed graph, where the nodes are basic blocks [28]. The
edges in the graph represent the possible control flow of the associated procedure. The
control flow graph represents the intra-procedural control flow. A program may be
considered a set of control flow graphs, or the control flow graphs may have additional
structure as dictated by the call graph. Alternatively, control flow graphs may represent
inter-procedural and intra-procedural control flow in a single graph. In this case, the
graph. Loop nest trees, dominator trees, and control dependency graphs can also be
constructed [27].
38
2.3.6 Call Graph
Call graphs like control flow graph model the possible execution paths and control flow
in a program [29]. The call graph is a directed graph representing the inter-procedural
control flow.
Like the control flow graph, alternative or abstracted representations are possible such
as a dominator tree.
Programs interface with the underlying operating system and libraries. The invocation
of an API function from a known library can often be identified statically [30]. The API
The data flow of a program represents the set of possible values data may hold during
program execution [31]. Many types of data flow analyses exist, including live variable
analysis, reaching definitions, and value-set analysis. Each analysis looks at a particular
property of the data at specific program points. Modelling the data flow requires that the
39
2.3.10 System Dependence Graph
The system dependence graph is a collection of procedure dependence graphs; one for
Malware may be polymorphic, but static program features are known to be invariant
Byte and instruction level program features perform poorly when faced with the
compile time options may result in syntactic changes including variable renaming, and
instruction substitution. Code normalization [10] can sometimes reverse the effects of
syntactic polymorphism and can work in practice, but is not based on a sound
technique. Additionally, the byte and instruction stream may change when minor
The advantage of byte level content as a program feature is that the dependence on
If the instruction stream is used, additional challenges are presented because it is known
that perfect disassembly of an unknown image is undecidable on the x86 platform [32].
program can be used. The control flow features including control flow graphs and call
graphs are considered more invariant in polymorphic malware than byte and instruction
level content [28]. However, opaque predicates may result in these features being
40
altered. The detection of opaque predicates has been investigated, but it is not evident
that this is entirely satisfactory, and a sound method of detection against all unknown
predicates is not possible. For example, it is known that some algorithms which are used
to construct predicates are not proven to be true and remain only as conjectures that
The presence of pointers and indirection in assembly language also present problems to
static analyses which may not have the precision required to construct a control flow
graph or call graph with the degree of accuracy required for malware classification. For
all its disadvantages, control flow has shown to be an effective feature that is invariant
The use of API calls is another approach to solve the syntactic polymorphism problem.
This approach has problems with malware that obscures the use of those calls, as is the
case of the stolen bytes technique [22] introduced by code packing tools.
Data flow analysis is another high level abstraction but when used in the presence of
The procedure and system dependence graphs have similar problems with pointers and
indirection even when the data dependencies of pointers are ignored. The dependence
graphs are also dependent on accurate modelling of the instruction sequence. This
avoids problems such as register reassignment because the data dependencies are
represented as a graph. However, the problem occurs with the modelled instructions
used in the data dependencies, which may be polymorphic and variant. Polymorphism is
not handled effectively in this situation although code normalization may help.
41
2.5 Classification of Static Program Features
The program features can be divided into four categories of models that enable
Vectors
Strings
Sets
Graphs
2.5.1 Vectors
Vectors represent the simplest object when processed for classification purposes.
[25]. Selecting features and reducing the dimensionality of a vector or feature vector is
possible using data mining techniques. Exact matching of vectors can be done quickly,
in linear time relative to the dimensionality of the vector. Approximate matching may
employ distance metrics or similarity functions. Distance metrics exist between vectors
including the Euclidean distance and the Manhattan distance. Additional methods to
determine the similarity between two vectors include the cosine similarity.
2.5.2 Strings
Strings are often associated with byte level content in relation to malware classification.
Searching for the presence of a substring in a body of text is a traditional technique used
42
malware database. The Aho-Corasick [9] string matching algorithm can be performed in
a time independent to the size of the database. Extensions to string matching include the
Byte level content may be treated as a string and approximate matching performed. The
Levenshtein or edit distance between two strings is the minimum number of insertions,
deletions and substitutions to transform one string to the other. The edit distance is the
basis for an approximate dictionary search which identifies related strings with at most
a specific number of errors. Related string metrics to show similarity between strings
include the longest common subsequence (LCS), and the sequence alignment
algorithms which are used frequently in the Bioinformatics field. The Smith-Waterman
It is possible to extract all substrings of size n from the text to produce n-grams. Distinct
n-grams represent dimensions in a feature vector. This approach can improve the
grams also allows for reordering of substrings that the edit distance would penalize
heavily. The use of an n-gram feature vector reduces the problem of approximate
matching of strings and byte or instruction level content to the problem of approximate
vector matching.
complexity or entropy.
43
2.5.3 Sets
between sets or collections of objects. Objects could include the control flow graphs or
the basic blocks of a program. An example usage could be to show program similarity
by identifying the set similarity between the programs‟ basic blocks. A number of set
similarity functions exist such as the Dice coefficient or the Jaccard index [33].
2.5.4 Graphs
Graphs naturally describe a number of program features including control flow graphs
and call graphs. Finding the equivalence between two graphs is to show they are
isomorphic. This problem has not been shown to run in polynomial time, but has also
not been proven that it does not. Additionally, approximate and inexact graph matching
has increased difficulty. Approximate graph matching is based on the graph edit
distance or the maximum common subgraph. The graph edit distance is analogous to the
made. Graphs may be decomposed into subgraphs of fixed sizes where each distinct
gram decomposition.
the program being analysed is not executed. This type of analysis is often employed
44
during program compilation for the purposes of code optimisation. Static Analysis of
malware has many benefits in identifying features and building abstract models of
malware. These features and models can be used to perform malware classification.
Static analysis has been widely investigated, and its scope in this survey limited to its
2.6.1 Disassembly
Static disassembly parses the entire binary image statically without execution. In static
disassembly, there are two main algorithms. In the Linear Sweep algorithm, the
instructions are disassembled one instruction after another, starting from the beginning
of code. The disadvantage of this method is that data introduced into instruction stream
The other main algorithm to perform disassembly is the Recursive Traversal algorithm.
This algorithm decodes each instruction following the order of the control flow. This
resolves the issue of embedded data, but may miss decoding instructions that are the
target of indirect jumps or other situations when it is hard to resolve control flow
statically.
algorithm problem by first performing the Recursive Traversal, and then performing a
45
proposed a more robust algorithm in [34] to disassemble binaries that had been
purposely obfuscated.
when a program uses indirect branches and procedure calls. The analysis of indirect
targets requires data flow analysis. A number of approaches have been employed [35-
37], but the simplest approach is to ignore indirect targets completely and accept a less
accurate representation. The edges of the graphs representing the control flow can be
constructed by connecting the branch or call site to the branch or call target.
The presence of opaque predicates in a control flow graph reduces the accuracy of the
graph because of misleading branch targets. In [38] it was proposed to use the program
predicate algorithms.
Alias analysis is an analysis that seeks to statically determine the possible values that
pointer variables may contain during program execution. Value-Set Analysis [39] has
been proposed as an alias analysis, suitable for binary programs and assembly language.
Value-Set Analysis has been used in malware detection [40] and the automated static
46
2.6.4 Obfuscation and Limits to Static Analysis
It is known that perfectly precise disassembly is undecidable [32]. Branch targets can be
indirect, and precise understanding of those run-time values can be problematic. In [42]
an analysis of some limits to static analysis of malware were identified. The use of
opaque predicates with hard to analyse predicates were shown to confound the problem
Automated unpacking is the process of revealing the hidden code that is introduced by
classification because it is required for the static analysis to avoid false classification of
code packing transformation. By knowing that the sample is not packed, further
unpacking analysis need not be performed. The process of identifying packed binaries
begins with feature extraction. The raw file and section contents can be examined using
47
Restoration
Routine
High Entropy
Hidden Code
Packed Executable
Hamrock and Lyda in [43]. The Entropy of a block of data is a statistical measure that
N
p (i ) log 2 p (i ), p (i ) 0
H ( x)
i 1 0, p (i ) 0
where p(i) is the probability of the ith unit of information in event x‟s sequence of N
symbols. For the malware packing analysis, the unit of information is a byte value, N is
Hamrock and Lyda made the key observation that compressed and encrypted data
characterise packed malware samples, and compressed and encrypted data are
characterised as having high entropy. Program code and data are found to have much
lower entropies. Using this observation, packed malware is identified by the high
48
Entropy analysis is simple to implement and shown to be effective, yet it has some
limitations. Entropy analysis can fail to detect packed malware that intentionally lowers
its own entropy. However, this form of evasion is not presently employed by malware.
Additionally, entropy analysis can fail to identify code packing transformations which
perform simple obfuscations on the malware content, and do not transform and
obfuscate the malware using strong encryption or compression. Likewise, code packing
making entropy analysis unable to identify binaries packed using this method.
reveals and restores the hidden code. After the restoration routine is complete, the
malware transfers control to the restored code. Because the malware naturally reveals
the hidden code during execution, dynamic analysis can allow for the extraction of the
execution, code that became evident and which was not present in the static
disassembly, was identified. This was identified as the hidden code. The collection of
hidden code constituted the unpacking process. Polyunpack provides a generic solution
49
and single stepping through execution. Additionally, the dynamic analysis requires
isolation of the running malware. This would imply the use of a virtual machine or
whole system emulation with the associated performance cost. This system would not
The most common approach to automated unpacking has taken advantage of the fact
that at the original entry point all of the hidden code is revealed. This has resulted in the
Detecting when to stop the simulation – when the restoration routine has
Simulation of the malware may involve whole system emulation, hardware based
revealed. The most common technique in detecting when to stop the simulation is by
memory.
Renovo required the use of a kernel driver in the guest operating system being
emulated. This is used to track the malware process being executed in the guest system.
This requirement of modifying the guest system with a kernel driver may make the
Pandora's Bochs also used whole system emulation, but requires no modifications to the
guest operating system, and was proposed by Bohne in [22]. It is similar in concept to
Renovo. Renovo utilises a dynamic binary translator based on QEMU [45] to perform
the emulation, while Bochs uses an interpreter based emulator. Pandora's Bochs
potentially prone to detection. Attacks to detect whole system emulation were shown in
Both Pandora's Bochs and Renovo using whole system emulation are quite effective at
However, whole system emulation has shown poor performance. Neither Pandora‟s
Bochs nor Renovo shows results that are suitable for a real-time Antivirus system.
efficient because there is no guest operating system that requires execution within the
simulation. There has been some commercial interest in application level emulation
51
[18]. However, little literature has been published and no authoritative refereed
literature. Application level emulation‟s main failing is that it provides a less faithful
simulation than whole system emulation. This is because the implementer of the
emulator must simulate the operating system‟s operation. In whole system emulation,
Dynamic Binary Instrumentation was proposed by Quist in Saffron [48]. Quist proposed
code. Saffron employed the use of the DBI framework PIN [49] which has problems
unpack programs. Hardware page protections were used to monitor the activity of each
program. Once unpacked, the image would be scanned by Antivirus software. A similar
hardware based approach was employed in [48]. The Omnipack system is implemented
to run co-operatively with an operating system, and perform unpacking and virus
scanning on demand. The disadvantage of this approach is in the use of the unpacking
provision of a virtual or emulated machine in which to run in. This reduces the level of
52
2.7.3.5 Hardware Based Virtualization
Using hardware based virtualization for malware analysis and automated unpacking was
generated code triggered extraction of the malware's process image similar to Renovo.
The difference is that the simulated environment is provided by a virtual machine using
hardware support. Ether, like Pandoras Boch's requires no changes to the guest
operating system. Unlike Pandora's Bochs, Ether does not have the same level of
problems of a malware detecting the system emulator. However, it has been shown that
hardware based virtualization is not immune to detection [46]. The use of a virtual
machine, and the use whole system emulation, requires a software license for
installation of the guest operating system. This restricts desktop adoption which
typically uses a single license. Virtual Machines are also inhibited by slow start-up
times which again are problematic for desktop Antivirus use. The use of a Virtual
Machine also prevents the system being cross platform as the guest and host CPU's
Detecting when the original entry point is reached and the hidden code of the packed
53
2.7.4.1 Renovo
Malware is executed in the simulated machine and allowed to run until the dynamically
generated code is executed. At this point, the memory image of the running malware is
taken. There can exist multiple layers or stages of the code packing transformation, so
the shadow memory is cleared and the process is restarted. This complete process is
to stop emulation, Pandora's Bochs identified markers that indicate unpacking is still
occurring - such indications include if the ratio of memory writes to unique branches is
high, the loading of a new dynamic Link Library, executing dynamically generated
2.7.4.3 OmniUnpack
The OmniUnpack approach employed the use of hardware based page protection to
monitor writes to memory. Omnipack detects the end of unpacking stage when there is
dangerous system call is one which can leave the system in an unsafe state. The
54
granularity of tracking memory writes is in the unit of pages. The advantage of the
2.7.4.4 Uncover
eliminate false positives. 1) That the stack pointer at the potential original entry point
must be the same as when the malware is initially started. 2) That the potential original
written pages - and those pages must consist of what appears to be code. Determining if
2.7.4.5 Hump-and-dump
when to stop the simulation. This technique is not based on detecting execution of
addresses of executed instructions. The premise of this technique is to note that the
the spike, is a flat section of height 1 which normally represents the original entry point.
Once the original entry point is detected, simulation ceases and an image of the process
is taken to reveal the hidden code. The process can be repeated to account for multiple
packing stages. The Hump-and-dump approach requires the use of simulation such as
emulation or virtualization.
55
2.8 Static Approaches to Malware Classification
the benign and malicious classes. In this approach, a labelled training set is required to
build the class models during a process of supervised learning. Many statistical
classification algorithms exist including Naive Bayes, Neural Networks, and Support
Vector Machines. The key to statistical classification is to represent the malicious and
an associated feature vector that can accurately represent the invariant characteristics in
known instance of malware in the training set. Traditional Antivirus utilises this
approach when it performs signature based detection. The key component to perform
r
q
d(p,q)
p
Query Malicious
Query
Malware
between objects, the objects must be modelled by a limited set of features that capture
the invariant characteristics of the malicious and benign programs. In some cases, the
distance function is replaced with a test for equality. However, testing only for equality
reduces the effectiveness of the classification process when dealing with malware
A search of a database to find similar, but not necessarily identical objects to a query is
learning when applied to malware detection and classification using a large number of
Distance functions between objects that have the properties of a metric can employ the
use of Metric Access Methods. A similarity search using metric access methods
performs faster than exhaustive linear search and enables significantly larger databases
57
without being restricted by an equivalent increase in running time. Metric access
methods may use either static [54] or dynamic databases [55]. In dynamic Metric
A fast approach to detecting whole program control flow graph isomorphism and
subgraph isomorphism was proposed in [56]. This approach constructed a spanning tree
based structure from the control flow graph, and then built a tree automaton for graph
technique is not effective at detecting malware variants that alter the control flow or
have semantic changes. Nor does this approach attempt to perform unpacking.
Decomposing control flow graphs into subgraphs was proposed by Kruegel et al in [28]
to classify polymorphic worms. The control flow graphs were decomposed into the set
of all subgraphs of fixed size k, where k is the number of nodes in the subpgrah. The k-
subgraphs were subsequently transformed into their canonical labelled form. The
adjacency matrix of the canonically label graph was transformed into a string. This
string represented the k-subgraph feature of the control flow being analysed. Worm
58
L_0 L_0 L_0 L_3
L_1 L_7
true
L_3
L_6
features between worm like executable content and unclassified executable programs.
The performance of this system was reasonable. Because the classification only
It was proposed in [57] that the inter-procedural control flow information could be
represented as a context free grammar with only some loss of information. A string
could represent the grammar, and string equality used to show equivalence between the
grammar, and inter-procedural control flow they represented. The advantage of this
approach, is that string based representations allow for fast searches in a malware
59
this research is that it did not employ approximate matching of the inter-procedural
control flow. For polymorphic malware variants that alter the control flow through
malware.
fixed points in the flowgraphs and successively matching surrounding nodes in the
graph. Carrera built a similarity index between malware and used this to build
phylogeny (evolutionary) trees for taxonomy. Dullien and Rolles expanded the
callgraphs and control flow graphs. Their algorithm worked by identifying nodes, or
fixed points, between binaries that have uniquely identifiable features. Features for a
node in the callgraph include the number of basic blocks, control flow edges, and
graph isomorphism based on string equality and a string signature of the graph
representing a graph traversal. Once a set of fixed points were known, their
The advantage of this approach is that it allows for moderately fast pair-wise
comparison between graphs. However, the approach does not perform efficiently for a
database of graphs and is not fast enough for desktop Antivirus use. Additionally,
60
to have occurred before classification is applied. A system for automated unpacking
system [59]. SMIT employed the use of bipartite graphs and the Hungarian algorithm to
find matching nodes between two call graphs being compared in O(N3) running time.
The strength of their matching algorithm was that they allowed for it be used as an
approximation to the graph edit distance. The graph edit distance between two graphs,
is the number of edit operations to convert one graph to the other. The graph edit
distance gives a sound basis for similarity and dissimilarity between graphs.
The graph edit distance is known to have the properties of a metric which allows the use
of metric access methods to search a database of objects. The metric access method
used in SMIT to perform a nearest neighbor search of call graphs was a Vantage Point
Tree [54]. The disadvantage of a Vantage Point Tree is that it is primarily a static data
structure. Alternate metric access methods such as the M-Tree [55] can be used for the
61
2.9 Trends
The driving force behind malware development is that of commercial gain by the
involves the typical development cycles as seen with legitimate software. Malware
creators will continue to protect and extend the lifetime of their software using available
attempt to extend the lifetime of their malware. Techniques were developed for
virtualization are now routinely detected by malware [23, 46]. The detection of
individual software systems used for performing analyses will continue. If research
detect these systems. The research community has responded in making analysis
We expect that malware authors will continue to use code packing [13] and
instruction virtualization and malware emulators will grow in use due to the added
resistance it provides against malware analysis [14]. Semantic changes to malware will
also continue as malware authors reuse already developed malicious code. It is likely
that syntactic polymorphism will continue to grow in use. Obfuscations will develop in
62
response to the static analyses used in detection systems to extract features from
malware because the motivation for malware development is that of financial gain and
will continue to be developed and incorporated into malware detection systems. These
We believe research will continue using this approach and new features will be
developed that can more accurately characterize malware. Instance-based learning will
also be developed with particular research opportunity in working with large scale
datasets.
Static program features have been extracted at increasing levels of abstraction, and we
expect this to continue in future research. Abstraction has the benefit of being resistant
to lower level polymorphic changes. The performance of these research systems has not
been fully investigated, and we expect that future research opportunity lies in making
63
2.10 Summary
must deal with polymorphism. Polymorphic malware introduces syntactic and semantic
poorly with polymorphic malware. Program abstractions including control flow are
observed to be more invariant, when used as static features, than traditional approaches.
However, efficient algorithms that use these static features are lacking. Efficiency is a
research systems to scale to the high number of malware found in the wild.
and the true contents of the malicious software are hidden. Automated unpacking
reveals the hidden content. Efficiency is a key requirement for desktop adoption and
widespread use.
64
3 Problem Definition and Our Approach
The problem of malware classification and variant detection is defined in this chapter.
The problem summary is to use instance based learning and perform a similarity search
Malwise.
use of the system commences. The system has as input a previously unknown binary
that is to be classified as being malicious or non malicious. The input binary and the
the input binary and each malware in the database. The similarity is measured as a real
number between 0 and 1 - 0 indicating not at all similar and 1 indicating an identical or
very similar match. This similarity is a based on the similarity between malware
characteristics in the database. If the similarity exceeds a given threshold for any
malware in the database, then the input binary is deemed a variant of that malware, and
65
Win32
Executable Malware
Dynamic Analysis Database
No
New
Signature
End of Generate
Packed? Yes Emulate Yes Classify
Unpacking? Signatures
No Non
Malicious
Malicious
incorporate the potentially new set of generated signatures associated with that variant.
Our approach employs both dynamic and static analysis to classify malware. Entropy
analysis initially determines if the binary has undergone a code packing transformation.
If packed, dynamic analysis employing application level emulation reveals the hidden
code using entropy analysis to detect when unpacking is complete. Static analysis then
identifies characteristics, building signatures for control flow graphs in each procedure.
The similarities between the set of control flow graphs and those in a malware database
malware database to find similar objects to the query. The system design of our
generate and compare flowgraph signatures: exact flowgraph matching and approximate
flowgraph matching.
66
3.2.1 Exact Flowgraph Matching
An ordering of the nodes in the control flow graph is used to generate a string based
signature or graph invariant of the flowgraph. String equality between graph invariants
The control flow graph is structured in this approach. Structuring is the process of
decompiling unstructured control flow into higher level, source code like constructs
including structured conditions and iteration. Each signature representing the structured
control flow is represented as a string. These signatures are then used for querying the
between flowgraphs can subsequently be constructed using the string edit distance.
67
4 Automated Unpacking
extraction and classification. In this chapter, the automated unpacking component of the
Malwise performs an initial analysis on the input binary to determine if it has undergone
a code packing transformation. Entropy analysis [43] is used to identify packed binaries.
Compressed and encrypted data have relatively high entropy. Program code and data
have much lower entropy. Packed data is typically characterised as being encrypted or
malware is established if there exists sequential blocks of high entropy data in the input
binary. If the binary is identified as being packed, then the dynamic analysis to perform
automated unpacking proceeds. If the binary is not packed, then the static analysis and
may reveal its hidden code. The hidden code once revealed is then extracted from the
process image.
68
Application level emulation provides an alternate approach to whole system emulation
for automated unpacking. Application level emulation simulates the instruction set
architecture and system call interface. In the Windows OS, the officially supported
4.2.1 Interpretation
Much of the 32-bit x86 ISA has been implemented in Malwise. Extensions to the ISA,
including SSE and MMX instructions, have been partially implemented. A partial
implementation is adequate because the majority of programs do not employ full use of
the ISA. FPU, SSE, and MMX instructions are primarily used by malware to evade or
detect emulation. Malware may also use the debugging interface component of the ISA,
including debug registers and the trap flag, which are primarily used to obfuscate
control flow.
segment registers to reference thread specific data. Thread specific data is additionally
handle abnormal conditions such as division by zero and is routinely used by packers
69
Segmented memory is handled in Malwise by maintaining a table of segment
descriptions, known in the x86 ISA as the descriptor table. Addressed memory is
associated with a segment, known in the ISA as segment selectors, which hold an index
into the descriptor table. This enables a translation from segmented addressing to a flat
linear addressing.
address. Each memory region maintains its associated memory contents. Each region
also maintains a shadow memory that is utilised by the automated unpacking logic. The
shadow memory maintains a flag for each address that is set if that location has been
The Windows API is the official system call interface provided by Windows. Malwise
detects calls to the Windows API by inspecting the simulated program counter. If the
program counter contains the address of a Windows API function, then a handler
There are too many windows API functions to fully emulate, so only the most common
APIs are implemented. Commonly used APIs include heap management, object
70
4.2.1.4 Linking and Loading
Program loading entails allocating the appropriate virtual memory, loading the program
text, data and dynamic libraries and performing any required relocations. OS specific
without having access to the native library. Such a system may have benefit when the
emulator is cross platform and when licensing issues should be avoided. Malwise
performs full dynamic library loading using the native libraries. This is done to provide
level threads - only one thread is running on the host at any particular time and each
Windows has process and thread specific structures that require initialization such as the
Process Environment Block, Thread Environment Block, and Loader Module. These
71
4.2.2 Improvements to Emulation
speed. In this technique, the decoding of unique instructions is cached. This results in a
processor time. Predecoding can also be used to cache a function pointer directly to the
opcode handler. When used in this way, predecoding allows for fast implementation of
the x86 debugging ISA including hardware breakpoints and single step execution used
by debuggers. In this optimisation, the cache holding a function pointer to the opcode
handler is modified on-demand to reflect that it should execute the breakpoint or trap
logic. This removes explicit checks for these conditions from the emulator's main loop.
The x86 condition codes are another point of optimisation and the prototype defers to
lazy evaluation of these at the time of their use, similar to QEMU [45].
Many instances of malware use modified variants of the same packer or share similar
code between different packers. Taking advantage of this, it is possible to detect known
72
sections of code during emulation and handle them more specifically, and therefore
more efficiently than interpretation [60]. To implement this it is noted that each stage
during unpacking gives access to a layer of hidden code that has been revealed, and the
memory in each layer can be searched for sections of known code. These sections of
code can then be emulated, in whole, using custom handlers. This approach achieves
code sections that can have written handlers include decryption loops, decompression
loops and checksum calculations. Handlers can also be written and used to dynamically
Malwise implements handlers for frequently used loops in several well known packers.
to that of testing whole system emulation [61]. To achieve this, the program being
emulated is executed in parallel on the host machine. The host program is monitored
using the Windows debugging API. At the commencement of each instruction, the
emulator machine state is compared against the host version and examined for deviant
Faithful emulation is made more difficult, as some instructions and Windows API
functions behave differently when debugged. Malwise rewrites these instructions and
enabled testing of packers and malware that employ known techniques to detect and
evade debugging.
73
4.3 Entropy Analysis to Detect Completion of Hidden Code
Extraction
Detection of the original entry point (OEP) during emulation identifies the point at
which the hidden code is revealed and execution of the original unpacked code begins
to take place. Detecting the execution of dynamic code generation by tracking memory
writes was used as an estimation of the original entry point in Renovo [21]. In this
approach the emulator executes the malware, and a shadow memory is maintained to
track newly written memory. If any newly written memory is executed, then the hidden
code in the packed binary being will now be revealed. To complicate this approach,
multiple layers or stages of hidden code may be present, and malware may be packed
more than once. This scenario is handled by clearing the shadow memory contents,
continuing emulation, and repeating the monitoring process until a timeout expires.
Malwise extends the concept of identifying the original entry point when unpacking
multiple stages by identifying more precisely at which stage to terminate the process,
without relying on a timeout. The intuition behind our approach is that if there exists
high entropy packed data that has not been used by the packer during execution, then it
original entry point, the entropy of new or unread memory in the process image is
examined. Newly written memory is indicated by the shadow memory for the current
stage being unpacked. Unread memory is maintained globally, in a shadow memory for
all stages. If the entropy of the analysed data is low, then it is presumed that no more
74
completion of unpacking. Malwise also performs the described entropy analysis to
detect unpacking completion after a Windows API imposes a significant change to the
entropy. This is commonly seen when the packer deallocates large amounts of memory
during unpacking. In the remaining case that the original entry point is not identified at
function will have the same effect as having identified the original entry point at this
location.
4.4 Discussion
implemented to emulate the Windows operating system. The Windows API is a large
set of APIs that requires significant effort to faithfully emulate. Complete emulation of
the API has not been achieved in the prototype and faithful emulation of undocumented
side effects may be near impossible. Malware that circumvents usual calling
mechanisms and malware that employs the use of uncommon APIs may result in
An alternative approach is to emulate the Native API which is used by the Windows
API implementation. However, the only complete and official documentation for
system call interfaces is the Windows API. The Windows API is a library interface, but
malware may employ the use of the Native API to interface directly with the kernel.
75
There does exist reported malware that employ the Native API to evade Antivirus
software.
Another problem that exists is early termination of unpacking due to time constraints.
much time is consumed during emulation. Malware may employ the use of code which
purposely consumes time for the purpose of causing early termination of unpacking.
Dynamic binary translation may provide some relief through faster emulation.
4.5 Evaluation
To verify our system correctly performs hidden code extraction, we tested the Malwise
prototype against 14 public packing tools. These tools perform various techniques in the
obfuscation, debugger detection and virtual machine detection. The samples chosen to
undergo the packing transformation were the Microsoft Windows XP system binaries
hostname.exe and calc.exe. hostname.exe is 7680 bytes in size, and calc.exe is 114688
bytes.
76
The original entry point identified by the unpacking system was compared against what
was identified as the real OEP. To identify the real OEP, the program counter was
inspected during emulation and the memory at that location examined. If the program
counter was found to have the same entry point as the original binary, and the 10 bytes
of memory at that location was the same as the original binary, then that address was
The results of the OEP detection evaluation are in table 1 and table 2. The revealed code
column in the tabulated results identifies the size of the dynamically generated code and
data. The number of unpacking stages to reach the real OEP is also tabulated, as is the
number of stages actually unpacked using entropy based OEP detection. Finally, the
that were executed to reach the real OEP is also shown. This last metric is not a
definitive metric by itself, as the result of the unaccounted for instructions may not
affect the revelation of hidden code – the instructions could be only used for debugger
evasion for example. Entries where the OEP was not identified are marked with err.
Binaries that failed to pack correctly are marked as fail. The closer the results in column
3 and 4 indicates better performance. The closer the result in column 5 to 100%
77
Table 1. Metrics for identifying the original entry point in packed samples (hostname.exe).
yoda's
78
Table 2. Metrics for identifying the original entry point in packed samples (calc.exe).
yoda's
79
The results show that unpacking the samples reveals most of the hidden code. The OEP
of pespin was not identified, possibly due to unused encrypted data remaining in the
process image, which would raise the entropy and affect the heuristic OEP detection.
The OEP in the packed calc.exe samples was more accurately identified, relative to the
metrics, than in the hostname.exe samples. This may be due to fixed size stages during
unpacking that were not executed due to incorrect OEP detection. Interestingly, in many
cases, the revealed code was greater than the size of the original unpacked sample. This
is because the metric for hidden code is all the code and data that is dynamically
generated. Use of the heap, and the dynamic generation of internally used hidden code
The worst result was in hostname.exe using aspack. 43% of the instructions to the real
OEP were not executed, yet nearly 2.5K of hidden of code and data was revealed, which
is around a third of the original sample size. While some of this may be heap usage and
the result not ideal, it may still potentially result in enough revealed procedures to use
4.5.2 Performance
The system used to evaluate the performance of the unpacking prototype was a modern
desktop - a 2.4 GHz Quad core computer, with 4G of memory, running 32-bit Windows
Vista Home Premium with Service Pack 1. The performance of the unpacking system is
shown in table 3. The running time is total time minus start-up time of 0.60s. Binaries
that failed to pack correctly are marked as fail. The number of instructions emulated
80
Table 3. Running time to perform unpacking.
hostname.exe calc.exe
yoda’s
81
In this evaluation full interpretation of every instruction is performed. The results
demonstrate the system is fast enough for integration into a desktop anti-malware
system.
4.6 Summary
The analysis of malicious software is made more challenging due to the presence of
packed malware. In this chapter we proposed fast algorithms to unpack malware using
completion of unpacking, we proposed and evaluated the use of entropy analysis. The
detection of the original entry point worked with a high degree of accuracy. The
synthetic samples using known packing tools, with high speed. This demonstrated that
the automated unpacking system is fast enough for potential desktop integration. The
automated unpacking system is efficient and effective and lays the foundation for
82
5 Malware Feature Extraction
In this chapter, algorithms to extract the static features of malware are proposed. These
features characterize the malware samples and are used for subsequent classification in
binary. The analysis is used to extract characteristics from the input binary that can be
used for classification. The characteristic for each procedure in the input binary is
obtained through transforming its control flow into compact representation that is
To initiate the static analysis process, the memory image of the binary is disassembled
using speculative disassembly [34]. Procedures are identified during this stage. A
disassembly - the target of a call instruction identifies a procedure, only if the call site
belongs to an existing procedure. Data runs of more than 256 bytes all having the value
of zero are ignored. Once processed, the disassembly is translated into an intermediate
Malwise is built as a general binary analysis platform which utilizes the intermediate
control flow graph for each identified procedure. The control flow graph is then
83
1
T F (1 -> 2), (1 -> 4)
(2 -> 3), ()
2 4
(), ()
T T (4 -> 3), ()
associated with a weight, described in the following sections. The weight intuitively
represents the importance of the signature when used to determine program similarity.
It is possible to generate a signature using a fast and simple method if the matching
algorithm only identifies graph isomorphism [29]. This approach takes note that if the
signatures or graph invariants of two graphs are not the same, then the graphs are not
isomorphic. The converse, while not strictly sound, is used as a good estimate to
indicate isomorphism. To generate a signature, the algorithm orders the nodes in the
control flow graph using a depth first order, although other orderings are equally
sufficient. A signature subsequently consists of a list of graph edges for the ordered
nodes, using the node ordering as node labels. This signature can be represented as a
To improve the performance, a hash of the string signature can be used instead. CRC64
84
matching is that classification using exact matches of signatures can be performed very
Bx
weight x
Bi
i
The similarity ratio between two flowgraphs in exact matching, with signatures x and y
is:
1, x y
wed
0, x y
In Malwise, balanced binary trees implement the exact search of the flowgraph
85
Structuring is the process of recovering high level structured control flow from a control
flow graph. In our system, the control flow graphs in a binary are structured to produce
signatures that are amenable to comparison and approximate matching using string edit
distances.
malware variants are reflected by variants sharing similar high level structured control
flow. If the source code of the variant is a modified version of the original malware,
proposed in the DCC decompiler [62]. If the algorithm cannot structure the control flow
graph then an unstructured branch is generated. Surprisingly, even when graphs are
reducible (a measure of how inherently structured the graph is), the algorithm generates
L_0
proc(){
L_3 L_0:
while (v1 || v2) {
true L_1:
L_6
if (v3) {
true L_2:
} else {
BW|{BI{B}E{B}B}BR
L_1 L_7 L_4:
true }
true L_5:
}
L_2 L_4 L_7:
true return;
}
L_5
Figure 16. The relationship between a control flow graph, a high level
structured graph, and a signature.
86
improvements to this algorithm to reduce the generation of unstructured branches have
been proposed [63, 64]. However, these improvements were not implemented.
high level structured constructs that are typical in a structured programming language.
Subfunction calls are represented, as are gotos; however, the goto and subfunction
targets are ignored. The grammar for a resulting signature is defined in figure 17.
len( s x )
weight x
len(si )
i
where si is signature of procedure i in the binary. The weights are normalized so that the
The similarity ratio [26] was proposed to measure the similarity between basic blocks. It
is used in our research to establish the number of allowable errors between flowgraph
ed ( x, y )
wed 1
max( len( x), len( y ))
where ed(x,y) is the edit distance. Malwise defines the edit distance as the Levenshtein
distance – the number of insertions, deletions, and substitutions to convert one string to
another. Signatures that have a similarity ratio equal or exceeding a threshold t (t=0.9)
are identified as positive matches. This figure was derived empirically through a pilot
study.
87
Procedure ::= StatementList
IfThenElse ::= 'I' Condition '{' StatementList „}‟ „E‟ „{„ StatementList '}'
88
Using the similarity ratio t as a threshold, the number of allowable errors, E, or edit
E len( x)(1 t )
To identify matching graphs from a flowgraph database, an approximate dictionary
BK Trees [65]. BK Trees exploit knowledge that the Levenshtein distance forms a
metric space. The BK Tree search algorithm is faster than an exhaustive comparison of
The runtime complexity of the edit distance between two signatures or strings is O(nm),
where n and m are the lengths of each respective signature. The algorithm employs
dynamic programming.
5.4 Discussion
Malware classification based on static analysis has a number of inherent problems and
may fail to perform correctly in all cases. Performing static disassembly, identifying
procedures and generating control flow graphs is, in the general case, undecidable.
Malware may specifically craft itself to make static analysis hard. In practice, the
majority of malware is compiled from a high level language and obfuscated as a post-
practice.
89
5.5 Summary
malware. In this chapter we proposed two algorithms to extract control flow graph
graph isomorphism through the string equality of graph invariants. We also proposed
for use in approximate graph matching. The structured signature was amenable to
approximate matching using the string edit distance. These features lay the foundation
90
6 Malware Classification
To classify an input binary, the analysis makes use of a malware database. The database
classify the input binary, a similarity is constructed between the set of the binary‟s
flowgraph strings and each set of flowgraphs associated with malware in the database.
To construct the similarity between the two sets of flowgraph strings we construct a
mapping or assignment between the strings from each set. For exact matching, the
is made for the best approximate matching string where the similarity ratio is above 0.9.
Two weights are associated with each matching flowgraph signature. The weights have
been normalized and the sum of matching weights identifies the size of the matching
0, wedi t
Sx
wedi weight xi , wedi t
i
where t is the empirical threshold value of 0.9, wed is the similarity ratio between the ith
control flow graph of the input binary and the matching graph in the malware database,
and weightx is the weight of the cfg where x is either the input binary or the malware
91
p
d=ed(p,q)
q
BR
BW|{B}BR
BR
BI{B}BR
BW|{B}BR
BSSR
BSSR
BSR
BSSSSR
BSSSR
The analysis performs more accurately with a greater number of procedures and hence
signatures. If the input binary has too few procedures, then classification cannot be
performed. The prototype does not perform classification on binaries with less than 10
procedures. For the exact matching classification, an additional requirement is that the
The program similarity is the final measure of similarity used for classification and is
the product of the asymmetric similarities. The program similarity is defined as:
S (i, d ) Si S d
where i is the input binary, d is the database malware instance, Si and Sd are the
figure 19.
If the program similarity of the examined program to any malware in the database
contains only malicious software, the binary of unknown status is also deemed
92
a1
s1
b1
a2
a3 b2
s3 s1,a1,b1
b3
a4
a5 s3,a3,b3
b4
a6
S=Si*Sd
S=(s1*a1 + s3*a3) * (s1*b1 + s3*b3)
malware in the database, the new set of malware signatures can be stored in the
to find any similar malware in the database. The search can be performed exhaustively
but has poor performance. To improve the performance, the similarity between
programs, represented as sets, can utilise an alternative algorithm. The expected case
when performing the set similarity search, is that the query is not similar to any
malware in the database and our algorithm exploits this expected case.
93
Our first proposed algorithm iterates through each flowgraph string in the query
program and finds matching strings from malware using a global database. From this,
the asymmetric similarities associated with each malware are constructed during each
round. After processing the query program, the matching malware are examined to
identify those that have a program similarity above the threshold of 0.6.
The problem with this initial approach is that some flowgraph strings have many
matching malware. To handle this problem, we divide the classification process into
two stages. In the first stage, we only build the asymmetric similarity for flowgraphs
processing uniquely matching malware, we prune those that cannot have an eventual
program similarity above 0.6. Finally, we process the remaining flowgraph strings, but
we do not employ the entire flowgraph database, and instead use a local database for
each of the malware remaining from the previous stage. Pseudo code to describe the
algorithm is given in figure 20. We then return the remaining malware equal to or
exceeding the program similarity of 0.6. This part of the process is not shown to
conserve space.
The set similarity search algorithm can be used for approximate matching by using an
approximate dictionary search over the standard dictionary search used in exact
matching. The similarity ratio threshold defines the maximum number of errors allowed
in the search.
94
S = 0.6
matches[name][Sa,Sb] : output : input initialized Sa=0, Sb=0
db : input : malware database
in : input : input binary
solutions : global temporary
95
Figure 20. Pseudo code for the set similarity search.
6.3 Complexity Analysis
We assume a search complexity is O(log(N)) for both global and local flowgraph
where M is the number of control flow graphs in the database, and N is the number of
control flow graphs in the input binary. N is proportional to the input binary size and
not more than several hundred in most cases. The worst case can be expected to have a
malware to the input binary. It is desirable that the malware database is not populated
The runtime complexity, in existing literature, to identify similarity between two call
graphs using the Hungarian method [59] is N3, where N is the sum of nodes in each
graph. Metric trees can avoid exhaustive comparisons in the database, which naively
would be MN3, where M is the number of indexed malware. An average of 70% of the
database size M, was pruned when identifying the 10 nearest neighbours in a search
utilizing metric trees [59]. Our algorithm, has similar intentions and comparable results
Antivirus systems, employing the Aho-Corasick algorithm [9] is linear to the size of the
input program and number of identified matches. The disadvantage of this approach is
that pre-processing is required on the malware database to enable linear scanning time
that is independent of the database size. Our system imposes more overhead by
performing unpacking and static analysis, but is potentially capable of real-time updates
96
to the malware database, and is capable of maintaining efficient runtime complexity.
increase [56]. Our system is more resilient to false positives under these conditions
6.4 Evaluation
6.4.1 Effectiveness
variants from the Netsky, Klez, Roron and Frethem families of malware were classified.
The Netsky, Klez and Roron malware samples were chosen to mimic a selection of the
malware and evaluation metrics in previous research [29]. The malware was obtained
through a public database [66]. A number of the malware samples were packed.
Malwise automatically identifies and unpacks such malware as necessary. Each of the
252 comparisons identified variants. The same evaluation was performed using exact
97
Table 5. Similarity matrices for malware Table 4. Similarity matrices for malware
using exact matching. using approximate matching.
a b c d g h a b c d g h
a 0.76 0.82 0.69 0.52 0.51 a 0.84 1.00 0.76 0.47 0.47
b 0.76 0.83 0.80 0.52 0.51 b 0.84 0.84 0.87 0.46 0.46
c 0.82 0.83 0.69 0.51 0.51 c 1.00 0.84 0.76 0.47 0.47
d 0.69 0.80 0.69 0.51 0.50 d 0.76 0.87 0.76 0.46 0.45
g 0.52 0.52 0.51 0.51 0.85 g 0.47 0.46 0.47 0.46 0.83
h 0.51 0.51 0.51 0.50 0.85 h 0.47 0.46 0.47 0.45 0.83
aa 0.74 0.59 0.67 0.49 0.72 0.50 0.83 aa 0.78 0.61 0.70 0.47 0.67 0.44 0.81
ac 0.74 0.69 0.78 0.40 0.55 0.37 0.63 ac 0.78 0.66 0.75 0.41 0.53 0.35 0.64
f 0.59 0.69 0.88 0.44 0.61 0.41 0.70 f 0.61 0.66 0.86 0.46 0.59 0.39 0.72
j 0.67 0.78 0.88 0.49 0.69 0.46 0.79 j 0.70 0.75 0.86 0.52 0.67 0.44 0.83
p 0.49 0.40 0.44 0.49 0.68 0.85 0.58 p 0.47 0.41 0.46 0.52 0.61 0.79 0.56
t 0.72 0.55 0.61 0.69 0.68 0.63 0.86 t 0.67 0.53 0.59 0.67 0.61 0.61 0.79
x 0.50 0.37 0.41 0.46 0.85 0.63 0.54 x 0.44 0.35 0.39 0.44 0.79 0.61 0.49
y 0.83 0.63 0.70 0.79 0.58 0.86 0.54 y 0.81 0.64 0.72 0.83 0.56 0.79 0.49
ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 ao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75
b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87
d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29
e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33
g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30
k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96
m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87
q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87
a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58 a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87
98
Table 4 and table 6 evaluates the flowgraph matching system in more detail using
normal operation, the system does not calculate the complete similarity between
binaries which are not considered variants, however this performance feature was
relaxed for this evaluation metric. Highlighted cells identify a malware variant, defined
excess of 0.9. To improve the performance of exact matching, procedures with less than
5 basic blocks were not included, which on occasion results in higher similarity being
malware. The results demonstrate that the system finds high similarities between
Table 5 shows the difference in the similarity matrix when the threshold for the
similarity ratio is increased to 1.0. Differences of up to 30% were noted across the
malware variants using the two similarity ratio thresholds. Using a threshold of 1.0 for
the similarity ratio is similar, but not identical, to the results of exact matching.
To evaluate exact matching in Malwise on a larger scale, 15,409 malware samples with
unique MD5 hashes were collected between 02-01-2009 and 8-12-2009 from honeypots
in the mwcollect Alliance [67] network. The malware samples were sorted according to
collection time, and processed in order. 94.4% of malware samples were found to have
a similarity of more than 95% to previously classified malware in the set. 863
99
representative malware signatures were stored in the database, where none were more
than 95% similar to other signatures. It was found that 88.26% of malware were
strong evidence that detecting malware variants has much benefit in the detection of
unknown malware samples. It was also found that 34.24% of malware were 100%
similar to existing malware, once unpacked. This corroborates research [16] that many
new instances of malware are repacked versions of existing malware. The results after
809 malware samples with unique MD5 hashes were collected between 29-04-2009 and
17-05-2009 from honeypots in the mwcollect Alliance network [67] and form a subset
of the previously classified 15,409 malware. All malware were used to populate the
malware. 754 samples were found to have at least one other sample in the set which was
a variant. Table 7 and figure 21 evaluates the speed of processing these malware
samples, including unpacking and classification time but excluding the loading time of
the malware database. The evaluation was performed on a 2.4 GHz Quad Core Desktop
PC with 4G of memory, running 32-bit Windows Vista Home Premium with Service
Pack 1, as was used in the unpacking performance testing. 86% of the malware were
processed in under 1.3 seconds. The only malware that was not processed in under 5
100
Table 8. Benign sample processing time.
0.0 0
0.1 139
0.2 80
0.3 42
Samples 0.6 10
0-1 299
0.7 3
1-2 401
0.8 6
2-3 46
0.9 5
3-4 30
1-2 17
4-5 32
2+ 6
5+ 1
Figure 21. Malware processing time. Figure 22. Benign processing time.
101
seconds instead took nearly 14 seconds. This was because nearly 163 Million
instructions were emulated during unpacking. This is possibly the result of an anti-
emulation loop. Manual inspection of the results also reveal some malware were not
fully unpacked. The static analysis is therefore likely generating signatures based on the
To evaluate the speed of classifying benign samples, 346 binaries in the Windows
system directory were evaluated using the malware database created in the previous
evaluation. The results are shown in table 8 and figure 22. The median time to perform
classification was 0.25 seconds. The slowest sample classified required 5.12 seconds.
It is much faster to process benign samples than malicious samples. Malicious samples
are typically packed and the unpacking consumes the majority of processing time. The
results clearly show this difference, and give more evidence that our system performs
quickly in the average case. The results shown demonstrate efficient processing in the
majority of benign and real malware samples, with speeds suitable for potential desktop
adoption.
102
Figure 23. Scalability of classification.
synthetic database was constructed. To simulate conditions likely in real samples, 10%
of the control flow graphs were made common to all malware. The synthetic database
contained up to a maximum of 70,000 malware, with each malware having 200 control
flow graphs. The malware signatures were randomly generated. The time to perform
in figure 23. Less than a millisecond was required to complete a single repetition of
classification for all evaluated database sizes. The trend of the graph is logarithmic, as
103
Table 9. Similarity matrix for non similar
programs using approximate matching.
cmd.exe calc.exe netsky.aa klez.a roron.ao
To evaluate the generation of false positives in Malwise, table 9 and table 10 shows
classification among non similar binaries using approximate and exact matching. Low
To further evaluate the exact matching algorithm against false positives, the malware
database created from the 809 samples in Section 6.4.3 was used for classifying the
binaries in the windows system directory. No false positives were identified. The
highest matching sample showed a similarity of 0.34. All other binaries had similarities
below 0.25. This result clearly shows resilience against false positives.
104
Table 11. Histogram of
similarities between executable
files in Windows system
directory.
Similarity Matches Matches
(approx.) (exact)
0.6 44 34
0.7 72 24
0.8 24 22
0.9 20 12
1.0 6 0
thorough test for false positive generation by comparing each executable binary to every
other binary in the Windows Vista system directory. The histogram groups binaries that
shares similarity in buckets grouped in intervals of 0.1. The results show there exist
similarities between some of the binaries, but for the majority of comparisons the
similarity is less than 0.1. This seems a reasonable result as most binaries will be
105
expected. Exact matching also produces fewer comparisons due to the added
requirement of each flowgraph having at least 5 basic blocks, which resulted in some
6.5 Summary
algorithm to identify the similarity between programs based on sets of control flow
graph features. We additionally proposed a similarity search algorithm that allowed for
efficient database searching to find similar sets to our query. We implemented these
algorithms in the prototype Malwise system. It was shown that our system can
effectively identify variants of malware in samples of real malware. It was also shown
that there is a high probability that new malware is a variant of existing malware.
Finally, we evaluated the speed and efficiency of the complete Malwise system
Malwise as suitable for potential applications including desktop and Internet gateway
106
7 Conclusions and Future Work
The Malwise system performs effectively but we believe the malware detection rate
assigning control flow graphs between sets of programs. The Malwise system currently
employs a greedy solution to the assignment problem. This could be replaced with an
In addition to effectiveness, the efficiency of the Malwise system could also potentially
translation. Approximate matching could use heuristic based comparisons. The more
sound string edit distance could subsequently be used to refine the results. Additionally,
alternative string metrics are possible such as the sequence alignment algorithms
The malware detection could also be made more robust against different forms of
situations. The use of multiple features, including call graph information and data
control flow features in the detection of unknown and novel malware samples.
107
7.2 Conclusions
malware and static classification of malware. The thesis proposed novel approaches to
effectively unpack and classify malware while maintaining a high degree of efficiency.
Our approach employed application level emulation for unpacking malware and used
similarity.
string based control flow signature, amenable to comparisons using the string
edit distance. This approach was used for approximate control flow graph
matching.
which formed the basis for our malware classification system and performed
We implemented and evaluated our ideas in a novel prototype system named Malwise.
The automated unpacking system was found to accurately unpack samples that were
108
obfuscated using known packing tools. The speed and efficiency of the unpacking
system was found to be suitable for potential desktop adoption. The malware
classification system was demonstrated to detect variants of real malware. It was shown
that a high probability existed that a new malware instance was a variant of existing
malware. Approximate matching was shown to detect more malware variants than exact
matching, yet exact matching was shown to have comparable effectiveness. The exact
matching classification system was found to perform efficiently in our evaluation with
Antivirus.
109
References
[1] Symantec 2008, Symantec internet security threat report: Volume xii, Symantec.
[2] F-Secure 2007, F-secure reports amount of malware grew by 100% during
us/pressroom/news/2007/fs_news_20071204_1_eng.html
[3] Symantec 2009, Symantec internet security threat report: Volume xiv,
Symantec.
[4] Heng, Y., Dawn, S., Manuel, E., Christopher, K. & Engin, K. 2007, Panorama:
[5] Feily, M., Shahrestani, A. & Ramadass, S. 2009, A survey of botnet and botnet
[6] Kolbitsch, C., Comparetti, P. M., Kruegel, C., Kirda, E., Zhou, X., Wang, X. F.
& Santa Barbara, U. C. 2009, Effective and efficient malware detection at the
[7] Griffin, K., Schneider, S., Hu, X. & Chiueh, T. 2009, Automatic generation of
Springer.
110
[8] Kephart, J. O. & Arnold, W. C. 1994, Automatic extraction of computer virus
[10] Christodorescu, M., Kinder, J., Jha, S., Katzenbeisser, S. & Veith, H. 2005,
[11] Mihai, C. & Somesh, J. 2004, Testing malware detectors, Proceedings of the
[13] Royal, P., Halpin, M., Dagon, D., Edmonds, R. & Lee, W. 2006, Polyunpack:
[14] Sharif, M., Lanzi, A., Giffin, J. & Lee, W. 2009, Rotalume: A tool for automatic
http://research.pandasecurity.com/archive/Mal_2800_ware_2900_formation-
statistics.aspx
111
[16] Stepan, A. 2006, Improving proactive detection of packed malware, Virus
Bulletin Conference.
[17] Oberheide, J., Bailey, M. & Jahanian, F. 2009, Polypack, USENIX Workshop
[19] 2010, Upx: The ultimate packer for executables, viewed 6 April 2010,
http://upx.sourceforge.net/
[21] Kang, M. G., Poosankam, P. & Yin, H. 2007, Renovo: A hidden code extractor
University of Mannheim.
pp. 470-478.
112
[26] Gheorghescu, M. 2005, An automated virus classification system, Virus Bulletin
[27] Aho, A. V., Sethi, R. & Ullman, J. D. 1986, Compilers: Principles, techniques,
[28] Kruegel, C., Kirda, E., Mutz, D., Robertson, W. & Vigna, G. 2006,
[30] Ye, Y., Wang, D., Li, T. & Ye, D. 2007, Imds: Intelligent malware detection
[31] Christodorescu, M., Jha, S., Seshia, S. A., Song, D. & Bryant, R. E. 2005,
detranslation of computer programs', The Computer Journal, vol. 23, pp. 223-
229.
[34] Kruegel, C., Robertson, W., Valeur, F. & Vigna, G. 2004, Static disassembly of
[36] Daniel, K., Stner & Stephan, W. 2002, 'Generic control flow reconstruction from
[37] Theiling, H. 2000, Extracting safe and precise control flow from binaries,
[38] Dalla Preda, M., Madou, M., De Bosschere, K. & Giacobazzi, R. 'Opaque
[39] Balakrishnan, G., Reps, T., Melski, D. & Teitelbaum, T. 2007, 'Wysinwyx:
What you see is not what you execute', Verified Software: Theories, Tools,
[40] Leder, F., Steinbock, B. & Martini, P. 2009, Classification and detection of
Canada.
114
[42] Moser, A., Kruegel, C. & Kirda, E. 2007, Limits of static analysis for malware
[43] Lyda, R. & Hamrock, J. 2007, 'Using entropy analysis to find encrypted and
[44] Josse, S. 2007, 'Secure and advanced unpacking using computer emulation',
[45] Bellard, F. 2005, Qemu, a fast and portable dynamic translator, USENIX
[46] Raffetseder, T., Kruegel, C. & Kirda, E. 2007, 'Detecting system emulators',
[47] Min Gyung, K., Heng, Y., Steve, H., Stephen, M. & Dawn, S. 2009, Emulating
[48] Quist, D. & Valsmith 2007, Covert debugging circumventing software armoring
[49] Luk, C. K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S.,
115
[50] Martignoni, L., Christodorescu, M. & Jha, S. 2007, Omniunpack: Fast, generic,
[51] Dinaburg, A., Royal, P., Sharif, M. & Lee, W. 2008, Ether: Malware analysis
[52] Wu, Y., Chiueh, T. & Zhao, C. 2009, Efficient and automatic instrumentation
[53] Sun, L., Ebringer, T. & Boztas, S. 2008, Hump-and-dump: Efficient generic
[54] Peter, N. Y. 1993, Data structures and algorithms for nearest neighbor search
[55] Paolo, C., Marco, P. & Pavel, Z. 1997, M-tree: An efficient access method for
116
[56] Bonfante, G., Kaczmarek, M. & Marion, J. Y. 2008, Morphological detection of
[59] Hu, X., Chiueh, T. & Shin, K. G. Large-scale malware indexing using function-
[60] Babar, K., Khalid, F. & Pakistan, P. 2009, Generic unpacking techniques,
[61] Martignoni, L., Paleari, R., Roglia, G. F. & Bruschi, D. 2009, Testing cpu
University of Technology.
[63] Moretti, E., Chanteperdrix, G. & Osorio, A. 2001, New algorithms for control-
117
[64] Wei, T., Mao, J., Zou, W. & Chen, Y. 2007, Structuring 2-way branches in
http://www.offensivecomputing.net
http://alliance.mwcollect.org
118