0% found this document useful (0 votes)
33 views

Anju 2010

Uploaded by

Syazwan Subhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Anju 2010

Uploaded by

Syazwan Subhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Malware Detection using Assembly Code

and Control Flow Graph Optimization


Anju S.S Harmya.P
Centre for Cyber Security Centre for Cyber Security
Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham
Tamil Nadu, India Tamil Nadu, India
[email protected] [email protected]

Noopa Jagadeesh Darsana.R


Centre for Cyber Security Centre for Cyber Security
Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham
Tamil Nadu, India Tamil Nadu, India
[email protected] [email protected]

ABSTRACT 1. INTRODUCTION
Malware detection is a crucial aspect of software security. A Malware, short for malicious software, is a software designed to
malware detector is a system that attempts to determine whether infiltrate a computer system without the owner's informed
a program has malicious intent. Current malware detectors work consent. It includes computer viruses, worms, trojan horses,
by checking for signatures, which attempt to capture the spyware, dishonest adware, crimeware, most rootkits, and other
syntactic characteristics of the machine level byte sequence of malicious and unwanted software.
the malware. This syntactic approach makes current detectors Most of today‘s commercial malware-detection tools recognize
vulnerable to code obfuscations, increasingly used by malware malware by searching for peculiar sequences of bytes. Such byte
writers that alter the syntactic properties of the malware byte strings act as the malware‘s ― fingerprint,‖ and is called malware
sequence without significantly affecting their execution signatures. However, this classic signature-based method always
behavior. fails to detect variants of known malware or previously
unknown malware, because the malware writers always adopt
This paper derives from the idea that the key to malware techniques like obfuscation to bypass these signatures [1][2][3].
identification lies in their syntactic as well as semantic features. In order to remain effective, it is of paramount importance for
It explains an approach using control flow graphs (CFG) for the antivirus companies to be able to quickly analyze variants of
malware detectors . We present an architecture for detecting known malware and previously unknown malware samples. The
malicious patterns in executables that is resilient to common number of file samples that need to be analyzed on a daily basis
obfuscation transformations. is constantly increasing [4]. Thus, a current trend in the
community is to design a new generation of malware detectors
Categories and Subject Descriptors based on semantical aspects [5, 6] or [7].However, a major
D.4.6 [Operating Systems]: Security and Protection—Access
difficulty of these approaches is the efficiency of the detection.
Control;K.6.5 [Management of Computing and Information
Heuristics can be very complex as it is illustrated in the field of
Systems]:Security and Protection—Invasive software
computer safety. We will put the accent on these issues.
General Terms In this paper, a detection strategy based on control flow graphs
Design, Security (CFGs) is used. More precisely, we show how flow graphs can
be used as signatures. Our technique is essentially syntactic, but
Keywords we also take into account some of the semantic features of the
Malware, Optimization, Control Flow Graph, Detection program.. In section 2 we described the system architecture of
the proposed system. Experiments and results are presented in
the section 3.
"Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that 2. SYSTEM ARCHITECTURE
copies bear this notice and the full citation on the first page. To copy In this approach we use control flow graph (CFG) which plays
otherwise, the role of a malware signature database. When a new
to republish, to post on servers or to redistribute to lists, requires prior obfuscated malware comes into place, it is scanned to recognize
specific permission and/or a fee. A2CWiC 2010, September 16-17, the shape of a malware. If it is same as any one present in the
2010, India Copyright © 2010 978-1-4503-0194-7/10/0009… $10.00"
signature database then, it is treated as malicious. As we see, the
design is closed to a string signature based detector, but we have
taken into the semantical feature also. Indeed, the CFG can be
used directly as a witness of the program. The design of the
malware detector is presented in the fig 1.
complexity and hence efficiency[10-11].All these use the
information obtained from the static analysis performed on the
intermediate code obtained at the beginning of the code
optimization phase. The optimizations mainly depend on the
accuracy of the static analysis.
The x86 assembly instructions denote simple expressions that
generally have no more than one or two operands. We try to
develop higher-level expressions (with more than two operands)
and eliminate all intermediate temporary variables that the
malware used to implement high-level expressions. This is done
by propagating values assigned or computed by intermediate
instructions. For e.g. the sequence of instructions in the
intermediate representation
r9= [r12]
r12= r10|r9
[r11]= [r11]|r12
[r11]= ~ [r11]
[r11]= [r11] & r10 can be converted to
r9= [r12]
r12=r10| [r12]
Fig 1.System Architecture [r11]= (~ ([r11]|r12)) & r10.

2.1 Assembly Code During this phase we try to find out the obfuscated instructions
like dead codes and remove them. Also those assignments which
Optimization include expressions like addition with zero which does not
The first step is to get the assembly code (X86 architecture) change its value can be said to be dead code. In the example
from the .exe files using disassemble[8]. To ease the given above we can see that the first statement is completely
manipulation of object code, our detector uses a high-level useless and can be removed. Hence this is identified as a dead
representation of machine instructions to express every opcode‘s code. During this phase of optimization we also try to evaluate
operational semantics, as well as the registers and memory all expressions involved. This step is mainly useful in order to
addresses involved. Here we considerably simplify the find out whether certain branch conditions will ever be true or
subsequent sub processes because we require a target language false. And if any branched block of statements is found to be
with a very limited set of features and instructions [9]. The x86 never reached they can be removed from the code. Expression
assembler programs have four kinds of flow instruction: non- evaluation also helps us to find out if constant memory address
conditional jumps (jmp), conditional jumps (jcc), function calls is camouflaged through more complex expressions. During this
(call) and function returns (ret).The instruction operands can be phase we often come across a large no of indirect statements.
only registers, memory addresses, and constants. The second The target addresses of jumps and function calls have to be
step is to convert this assembly language into an intermediate dynamically computed. For example, if the statement is jmp eax,
representation. Below a simple example is shown, reduced from one needs the value of the register eax in order to follow the
the original due to space constraints. Here we can see that even control flow transfer. In such cases, our current procedure relies
the simple dec instruction conceals a complex semantics: its on a heuristic (|e|) which provides the value of the expression e
argument is decremented by one, and the instruction, or the by static analysis. If the value cannot be computed then (|e|) = ±.
subtraction, produces an update of six control flags according to Such a heuristic can be based on partial evaluation, emulation or
the result. any other static analysis technique. Next we must also take care
of the fake conditional and unconditional jumps by which a
Example: dec %ebx can be converted into malware can significantly twist a program‘s control flow. This
tmp = r08 will lead to the development of a different control flow graph.
r08 = r08 - 1 Optimizations of a chain of unconditional jump instructions can
NF = r08@[31:31] be done by replacing them with a sequence of instructions. Also
ZF = [r08 = 0? 1:0] fake conditional jump instructions can be removed by pruning
CF = (~ (tmp @[31:31]).... those paths which are never reached.
Next, this intermediate representation is optimized such that it
becomes simpler in terms of structure, at the same time 2.2 Control- flow graph Extraction
preserving the original semantics. This optimization is done The control-flow graph (CFG) is a fundamental data structure
because of the fact that most of the malware when obfuscated needed by almost all the techniques that compilers use to find
lead to under optimized versions. The obfuscated versions often opportunities for optimization and to prove the safety of those
grow in size because of the addition of irrelevant statements in optimizations. After optimization the different instances of same
the code. And this avoids their detection using the signature malicious code may not reduce to the same form. Hence a
based detection. Hence by optimization we are actually trying to bytewise comparison will likely lead to false positives. This is
remove the unwanted instructions and the number of indirect why we decided to use the method of control-flow graph. Our
paths. The optimization process include the processes that CFG representation is a rough abstraction of programs. Indeed
compilers employ to optimize the code , improve the space we do not make any distinction between the different kinds of
sequential instruction; they are all represented by nodes labeled 2.4 Control flow graph comparison and
with ‗inst‘.
detection
When a malicious program infects another program, it includes
2.3 Control –flow graph optimization its own code within the host program. Then, we can reasonably
We try to compress the extracted CFG and thereby making it suppose that the CFG of the malicious program appears as a
more sound with respect to classic mutation techniques. The sub-graph of the global CFG of the infected program. As a
following procedures can be adopted to reduce the CFG[12] result, we can detect such an infection by deciding the sub-graph
 Merge consecutive continuous flow instructions into isomorphism problem within the context of CFG. So, our
single node (fig 2a). problem is a classical problem of subgraph isomorphism
 Unconditional jumps can be replaced by the jumped property, a property which is NP-complete in general. Given
block itself (fig 2b). two graphs G1 and G2, is G1 isomorphic to a subgraph of G2?.
 Consecutive conditional jumps can be combined to However, due to the fact that the successor relation is ordered, in
form a single one (fig 2c). the present terms, the problem is polynomial. Indeed a CFG
 Paths never reached can be pruned if detected any. composed of n vertices has only n distinct sub-CFG of at most n
vertices .Figure 3 shows the two graphs just mentioned. Figure
3a models the malicious code, figure 3b matches the suspicious
program (nodes highlighted are those that match).

Fig 2a Merge continous flow instructions

Fig 3a.CFG of a malicious code

Fig 2b.Replacement of unconditional jumps

Fig 3b.CFG of an infected program


Fig 2c.Merge conditional jumps
3. EXPERIMENTS AND RESULTS
To experimentally verify our approach, in terms of both
correctness and efficiency, we developed a prototype. We built
the code optimization module on top of Boomerang
(http://boomerang.sourceforge.net), an open source decompiler
that reconstructs high-level code by analyzing binary [2] Danilo Bruschi, Lorenzo Martignoni, Mattia Monga,
executables. Boomerang performs the data- and control-flow 2006.Using Code Normalization for Fighting Self-Mutating
analysis (static analysis) directly on an intermediate form [13] Malware.Proceedings of International Symposium on Secure
automatically generated from machine code. We adapted it to Software Engineering.
our needs and used it to undo the previously described
mutations. Using the information collected with the static [3] M. Christodorescu and S. Jha. 2003.Static analysis of
analysis, our tool decides which set of optimizations to apply to executables to detect malicious patterns.Proceedings of
a piece of code based on control- and data-flow analysis results. USENIX Security Symposium, Aug.
The analysis framework can also accommodate the resolution of
indirections and performs jump- and calltable analysis [14].After [4] M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, Jahanian,
the prototype has optimized the code; the CFG extraction and J. Nazario, 2007.Automated classification and analysis
module builds a control-flow graph of the resulting code. Then of internet malware.Proc. RAID 2007 LNCS, vol. 4637,
the CFG optimization module optimizes the CFG obtained 178–197.
which in turn make it more sound with respect to classic
mutation techniques. The optimized CFGs are then stored in the [5] M. Dalla Preda, M. Christodorescu, S. Jha,and S.
database as signatures. During the detection phase the optimized Debray,2007.A Semantics-Based Approach to Malware
CFG of the program to be checked is constructed .Then, we give Detection. POPL’07.
the graph in input to a subgraph isomorphism algorithm in order
to perform the detection. [6] M. Christodorescu, S. Jha, S.A. Seshia, D. Song, and R.E.
We collected malware from different sources and this Bryant,2005.Semantics-aware malware detection.IEEE
collection is composed of 100 malicious programs. Then, we Symposium on Security and Privacy.
have collected 100 win32 binaries from a fresh installation of
Windows VistaTM. This second collection is considered as sane [7] Andrew Walenstein, Rachit Mathur, Mohamed R.
programs. Using those samples we experimented with the Chouchane, and Arun Lakhotia,2006.Normalizing
prototype of our malware detector. We focused our attention on metamorphic malware using term rewriting.Source Code
false positives in order to validate the method. Analysis and Manipulation,Sixth IEEE International
Workshop.
Table1. Results of the experiment
[8] DataRescue sa/nv. IDA Pro – interactive disassembler.
Size of <100 100- >4000 Total http://www.datarescue.com/idabase/
CFG 4000
Sane 9 51 40 100 [9] C. Cifuentes and S. Sendall,1998.Specifying the Semantics
Programs of Machine Instruction.6th Int’l Workshop on Program
Malware 47 45 8 100 Comprehension (IWPC 98), IEEE CS Press,126–133.
False 9 2 1 12
Positives [10] S.K. Debray et al.,2000.Compiler Techniques for Code
Compaction.ACM Trans. Programming Languages and
From the table 1 we observed that the false-positives decrease Systems, vol. 22, no. 2, 378–415.
with respect to the size of CFGs. For less than 100 nodes the
CFG is not reliable to discriminate malware from sane [11] A.V. Aho, R. Sethi, and J.D. Ullman,1986.Compilers:
programs, but for more than 100 nodes the result is much better. Principles,Techniques and Tools, Addison-Wesley.
This result is really encouraging as the detection accuracy
increase with the size of programs for which classical signature [12] Guillaume Bonfante, Matthieu Kaczmarek and Jean-Yves
scanning has more false-positives when the size increases. As a Marion,2007. Control Flow Graphs as Malware Signatures.
result, CFG detection could enhance classical detection and it International Workshop on the Theory of Computer Viruses.
would be interesting to associate the two methods.
[13] C. Cifuentes and S. Sendall,1998.Specifying the Semantics
of Machine Instructions.Proc. 6th Int’l Workshop on
4. CONCLUSION Program Comprehension (IWPC 98), IEEE CS Press, 126–
We observed that certain malicious behaviors (such as 133.
decryption loops) appear in all variants of a certain malware. In
this paper, we proposed a detection technique based on [14] C. Cifuentes and M.V. Emmerik,2001.Recovery of Jump
semantical features. From this point of view, the result we got Table Case Statements from Binary Code. Proc. 7th Int’l
was promising, and we still work on it. Our future work is to Workshop on Program Comprehension, IEEE CS Press,
optimize our tool to reduce the execution times. 171–188.

5. REFERENCES
[1] A. Sung, J. Xu, P. Chavez, and S. Mukkamala, 2004. Static
analyzer of vicious executables (save). Proc. 20th Annu.
Comput. Security Appl. Conf., 326–334.

You might also like