Anju 2010
Anju 2010
ABSTRACT 1. INTRODUCTION
Malware detection is a crucial aspect of software security. A Malware, short for malicious software, is a software designed to
malware detector is a system that attempts to determine whether infiltrate a computer system without the owner's informed
a program has malicious intent. Current malware detectors work consent. It includes computer viruses, worms, trojan horses,
by checking for signatures, which attempt to capture the spyware, dishonest adware, crimeware, most rootkits, and other
syntactic characteristics of the machine level byte sequence of malicious and unwanted software.
the malware. This syntactic approach makes current detectors Most of today‘s commercial malware-detection tools recognize
vulnerable to code obfuscations, increasingly used by malware malware by searching for peculiar sequences of bytes. Such byte
writers that alter the syntactic properties of the malware byte strings act as the malware‘s ― fingerprint,‖ and is called malware
sequence without significantly affecting their execution signatures. However, this classic signature-based method always
behavior. fails to detect variants of known malware or previously
unknown malware, because the malware writers always adopt
This paper derives from the idea that the key to malware techniques like obfuscation to bypass these signatures [1][2][3].
identification lies in their syntactic as well as semantic features. In order to remain effective, it is of paramount importance for
It explains an approach using control flow graphs (CFG) for the antivirus companies to be able to quickly analyze variants of
malware detectors . We present an architecture for detecting known malware and previously unknown malware samples. The
malicious patterns in executables that is resilient to common number of file samples that need to be analyzed on a daily basis
obfuscation transformations. is constantly increasing [4]. Thus, a current trend in the
community is to design a new generation of malware detectors
Categories and Subject Descriptors based on semantical aspects [5, 6] or [7].However, a major
D.4.6 [Operating Systems]: Security and Protection—Access
difficulty of these approaches is the efficiency of the detection.
Control;K.6.5 [Management of Computing and Information
Heuristics can be very complex as it is illustrated in the field of
Systems]:Security and Protection—Invasive software
computer safety. We will put the accent on these issues.
General Terms In this paper, a detection strategy based on control flow graphs
Design, Security (CFGs) is used. More precisely, we show how flow graphs can
be used as signatures. Our technique is essentially syntactic, but
Keywords we also take into account some of the semantic features of the
Malware, Optimization, Control Flow Graph, Detection program.. In section 2 we described the system architecture of
the proposed system. Experiments and results are presented in
the section 3.
"Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that 2. SYSTEM ARCHITECTURE
copies bear this notice and the full citation on the first page. To copy In this approach we use control flow graph (CFG) which plays
otherwise, the role of a malware signature database. When a new
to republish, to post on servers or to redistribute to lists, requires prior obfuscated malware comes into place, it is scanned to recognize
specific permission and/or a fee. A2CWiC 2010, September 16-17, the shape of a malware. If it is same as any one present in the
2010, India Copyright © 2010 978-1-4503-0194-7/10/0009… $10.00"
signature database then, it is treated as malicious. As we see, the
design is closed to a string signature based detector, but we have
taken into the semantical feature also. Indeed, the CFG can be
used directly as a witness of the program. The design of the
malware detector is presented in the fig 1.
complexity and hence efficiency[10-11].All these use the
information obtained from the static analysis performed on the
intermediate code obtained at the beginning of the code
optimization phase. The optimizations mainly depend on the
accuracy of the static analysis.
The x86 assembly instructions denote simple expressions that
generally have no more than one or two operands. We try to
develop higher-level expressions (with more than two operands)
and eliminate all intermediate temporary variables that the
malware used to implement high-level expressions. This is done
by propagating values assigned or computed by intermediate
instructions. For e.g. the sequence of instructions in the
intermediate representation
r9= [r12]
r12= r10|r9
[r11]= [r11]|r12
[r11]= ~ [r11]
[r11]= [r11] & r10 can be converted to
r9= [r12]
r12=r10| [r12]
Fig 1.System Architecture [r11]= (~ ([r11]|r12)) & r10.
2.1 Assembly Code During this phase we try to find out the obfuscated instructions
like dead codes and remove them. Also those assignments which
Optimization include expressions like addition with zero which does not
The first step is to get the assembly code (X86 architecture) change its value can be said to be dead code. In the example
from the .exe files using disassemble[8]. To ease the given above we can see that the first statement is completely
manipulation of object code, our detector uses a high-level useless and can be removed. Hence this is identified as a dead
representation of machine instructions to express every opcode‘s code. During this phase of optimization we also try to evaluate
operational semantics, as well as the registers and memory all expressions involved. This step is mainly useful in order to
addresses involved. Here we considerably simplify the find out whether certain branch conditions will ever be true or
subsequent sub processes because we require a target language false. And if any branched block of statements is found to be
with a very limited set of features and instructions [9]. The x86 never reached they can be removed from the code. Expression
assembler programs have four kinds of flow instruction: non- evaluation also helps us to find out if constant memory address
conditional jumps (jmp), conditional jumps (jcc), function calls is camouflaged through more complex expressions. During this
(call) and function returns (ret).The instruction operands can be phase we often come across a large no of indirect statements.
only registers, memory addresses, and constants. The second The target addresses of jumps and function calls have to be
step is to convert this assembly language into an intermediate dynamically computed. For example, if the statement is jmp eax,
representation. Below a simple example is shown, reduced from one needs the value of the register eax in order to follow the
the original due to space constraints. Here we can see that even control flow transfer. In such cases, our current procedure relies
the simple dec instruction conceals a complex semantics: its on a heuristic (|e|) which provides the value of the expression e
argument is decremented by one, and the instruction, or the by static analysis. If the value cannot be computed then (|e|) = ±.
subtraction, produces an update of six control flags according to Such a heuristic can be based on partial evaluation, emulation or
the result. any other static analysis technique. Next we must also take care
of the fake conditional and unconditional jumps by which a
Example: dec %ebx can be converted into malware can significantly twist a program‘s control flow. This
tmp = r08 will lead to the development of a different control flow graph.
r08 = r08 - 1 Optimizations of a chain of unconditional jump instructions can
NF = r08@[31:31] be done by replacing them with a sequence of instructions. Also
ZF = [r08 = 0? 1:0] fake conditional jump instructions can be removed by pruning
CF = (~ (tmp @[31:31]).... those paths which are never reached.
Next, this intermediate representation is optimized such that it
becomes simpler in terms of structure, at the same time 2.2 Control- flow graph Extraction
preserving the original semantics. This optimization is done The control-flow graph (CFG) is a fundamental data structure
because of the fact that most of the malware when obfuscated needed by almost all the techniques that compilers use to find
lead to under optimized versions. The obfuscated versions often opportunities for optimization and to prove the safety of those
grow in size because of the addition of irrelevant statements in optimizations. After optimization the different instances of same
the code. And this avoids their detection using the signature malicious code may not reduce to the same form. Hence a
based detection. Hence by optimization we are actually trying to bytewise comparison will likely lead to false positives. This is
remove the unwanted instructions and the number of indirect why we decided to use the method of control-flow graph. Our
paths. The optimization process include the processes that CFG representation is a rough abstraction of programs. Indeed
compilers employ to optimize the code , improve the space we do not make any distinction between the different kinds of
sequential instruction; they are all represented by nodes labeled 2.4 Control flow graph comparison and
with ‗inst‘.
detection
When a malicious program infects another program, it includes
2.3 Control –flow graph optimization its own code within the host program. Then, we can reasonably
We try to compress the extracted CFG and thereby making it suppose that the CFG of the malicious program appears as a
more sound with respect to classic mutation techniques. The sub-graph of the global CFG of the infected program. As a
following procedures can be adopted to reduce the CFG[12] result, we can detect such an infection by deciding the sub-graph
Merge consecutive continuous flow instructions into isomorphism problem within the context of CFG. So, our
single node (fig 2a). problem is a classical problem of subgraph isomorphism
Unconditional jumps can be replaced by the jumped property, a property which is NP-complete in general. Given
block itself (fig 2b). two graphs G1 and G2, is G1 isomorphic to a subgraph of G2?.
Consecutive conditional jumps can be combined to However, due to the fact that the successor relation is ordered, in
form a single one (fig 2c). the present terms, the problem is polynomial. Indeed a CFG
Paths never reached can be pruned if detected any. composed of n vertices has only n distinct sub-CFG of at most n
vertices .Figure 3 shows the two graphs just mentioned. Figure
3a models the malicious code, figure 3b matches the suspicious
program (nodes highlighted are those that match).
5. REFERENCES
[1] A. Sung, J. Xu, P. Chavez, and S. Mukkamala, 2004. Static
analyzer of vicious executables (save). Proc. 20th Annu.
Comput. Security Appl. Conf., 326–334.