0% found this document useful (0 votes)
16 views6 pages

p317 Han

Uploaded by

bảo ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

p317 Han

Uploaded by

bảo ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262350024

Malware analysis method using visualization of binary files

Conference Paper · October 2013


DOI: 10.1145/2513228.2513294

CITATIONS READS
82 4,295

3 authors, including:

Eul Gyu Im
Hanyang University
118 PUBLICATIONS 1,862 CITATIONS

SEE PROFILE

All content following this page was uploaded by Eul Gyu Im on 12 November 2018.

The user has requested enhancement of the downloaded file.


Malware Analysis Method
using Visualization of Binary Files
KyoungSoo Han Jae Hyun Lim Eul Gyu Im
Dept. of Computer and Software, Dept. of Computer and Software, Div. of Computer Science and
Hanyang University, Hanyang University, Engineering,
Seoul, Korea Seoul, Korea Hanyang University,
+82-2-2220-2381 +82-2-2220-2381 Seoul, Korea
[email protected] [email protected] +82-2-2220-4321
[email protected]

ABSTRACT diverse detection avoidance techniques [2,3]. Consequently,


Malware authors have been generating and disseminating malware malware analysts and researchers have been studying diverse
variants through various ways, such as reusing modules or using analysis techniques in order to deal with malware variants. Since
automated malware generation tools. With the help of the the number of malware increases every year, new malware
malware generation techniques, the number of malware keeps analysis techniques are needed to reduce burdens of malware
increasing every year. Therefore, new malware analysis analysts. One of new ways of malware analysis is to use
techniques are needed to reduce malware analysis overheads. visualization techniques.
Recently several malware visualization methods were proposed to In this paper, we propose a novel method of visually analyzing
help malware analysts. In this paper, we proposed a novel method malware using the malware binary information to quickly identify,
to visually analyze malware by transforming malware binary detect, and classify malware and malware families. The proposed
information into image matrices. Our experimental results show method generates RGB colored pixels on image matrices using the
that the image matrices of malware can effectively classify malware binary information extracted through static analysis.
malware families. Using these malware image matrices, similarities are calculated
among different malware. Experimental results show that the
Categories and Subject Descriptors image matrices of malware can effectively classify malware
D.4.6 [Operating Systems]: Security Protection – Invasive families. The proposed visualization technique can be easily
software. automated and used to analyze a large number of malware.
Security and privacy~Malware and its mitigation This paper is composed as follows. In section 2, malware
analysis-related studies are described. In section 3, malware
General Terms analysis methods using visualized binary information and
Security similarity calculating methods are proposed and experimental
results are presented in section 4. Finally, in section 5,
conclusions and future directions are provided.
Keywords
Malware analysis, malware visualization, malware similarity,
malware detection 2. RELATED WORK
To detect and classify malware, various static analysis techniques
were proposed so far, including control flow graph analysis [4,5],
1. INTRODUCTION function call graph analysis [6], byte level analysis [7],
The number of malware found on the Internet continues to instruction-based analysis [8,9,10], and similarity-based analysis
increase because malware can be generated with various [11,12]. Even though there are many static analysis techniques
automated tools and reused modules. Because some modules for available, new techniques that can complement existing
malicious behaviors are reused in malware variants, malware techniques are still needed to improve malware analysis
variants of the same family have similar binary patterns, and these performance.
patterns can be used to detect malware and to classify malware
families. Recently, various visualization techniques for malware analysis
have been proposed to enable human analysts to visually observe
Most antivirus programs focus on malware signatures, i.e string the features of malware. Studies to visualize malware behaviors
patterns, to detect malware [1]. However, malware variants can have been also conducted. Trinius et al. [13] collected information
avoid these signature-based detection methods by applying on API (Application Program Interface) calls and instructions of a
certain behaviors, and they visualized the percentages of API calls
into a “Treemap” as well as malware behaviors into a “Thread
Permission to make digital or hard copies of all or part of this work for graph.” Saxe et al. [14] proposed a system to visualize the
personal or classroom use is granted without fee provided that copies are relationships and similarities of system call sequences. The former
not made or distributed for profit or commercial advantage and that
shows map-like visualization of similarity, and the latter shows
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, similarities and differences between selected samples, based on
requires prior specific permission and/or a fee. system calls or function calls.
RACS’13, October 1–4, 2013, Montreal, QC, Canada.
Copyright 2013 ACM 978-1-4503-2348-2/13/10 …$15.00.

317
Conti et al. [15] proposed an integrated visualizing system that
enables the analysis of byte information of malware samples
through different graphical elements. A “byteview visualization”
shows each byte in the binary sample to a pixel, and a “byte
presence visualization” shows how many bytes have appeared. Figure 2. Binary information extraction procedure
Moreover, a “dot plot visualization” detects duplicated sequence To extract binary information, the binary sample files are
of bytes contained within a sample. Because of the overhead of disassembled first, using disassembling tools, such as IDA Pro
dot plot algorithm, they implemented simplified algorithm by [21] or OllyDbg [22]. After assembly codes are extracted using a
applying these visualization techniques. tool, the sequence of assembly codes are divided into blocks
Anderson et al. [16] visually showed the results of similarity according to some instructions that are used as delimiters, as
calculations between malware samples through images named shown in Figure 3.
“Heatmap.” Nataraj et al. [17] scanned all malware bytes, The sequence of opcodes included in individual blocks is used as
converted the information into gray-scale images, and classified binary information. From each opcode, only first three characters
the malware using image processing. After generating images, are used to generate information for the block. For example, four-
they applied an abstract representation technique for the scene character opcode instructions such as push are reduced to three-
image, i.e. GIST [18,19], to compute texture features and to character instruction. Then, these three-character instructions are
classify malware. Moreover, they proved that the binary texture concatenated together, and the character string is used to represent
analysis techniques using image processing can classify malware the opcode block in the next step to generate an image matrix.
more quickly than existing malware classification methods [20].
However, since the texture analysis method has large
computational overheads, the proposed method has problems to
process a large number of malware.
In this paper, we propose a novel analysis method using image
matrices in order to visually represent malware so that the features
of malware can be easily detected and the similarities between
different malware can be calculated faster than other visualization
methods.

3. OUR PROPOSED METHOD


3.1 Overview
Our proposed visualized malware analysis method consists of
three steps, as shown in Figure 1. In Step 1, binary information is
extracted from binary sample files, and image matrices in which Figure 3. Opcode instructions used as binary information
the binary information is recorded as RGB colored pixels are
generated in Step 2. In Step 3, the similarities between the image 3.3 Generation of Image Matrices
matrices are calculated. In the following sections, each step was Figure 4 shows a procedure of Step 2 that converts the opcode
explained in detail. instruction sequences into an image matrix. Two hash functions
are used to decide coordinate information and RGB color
information, as shown in Figure 4.

Figure 4. Generating images using binary information


In order to visualize a binary file as an image matrix, both the
length and the width of an image matrix are initialized to 2n,
where n is selected by users. To reduce the probability of
collisions of hash functions, n should be large enough. In our
experiments, we selected n as 8 to avoid collisions.
Figure 1. Overview of the proposed method The coordinate-defining module and the RGB color-defining
module are used to generate image matrices. First, the coordinate-
3.2 Extraction of Binary Information defining module defines the (x,y) coordinates on image matrices
Figure 2 shows the process to extract binary information from for binary information of each code block. SimHash [23] is
binary sample files in Step 1. applied to binary information extracted in the Step 1. SimHash is

318
a local-sensitive hash function that assumes if input values are matrix to be compared. For example, as shown in Figure 8, an
similar, output values will also be similar. Therefore, if character image matrix can be divided into 16 (N=16) areas and four (n=4)
strings of binary information are similar, the outputs will be areas can be randomly selected.
similar and it will map into similar coordinates in an image matrix. Matching pixels are now identified in each selected area and used
Second, the RGB color-defining module defines the color values in similarity calculations. In this paper, vector angular-based
of images on an image matrix. djb2 [24] is applied to binary distance measure algorithm [25] which decides the similarity
information to determine colors of images for the binary using vector value for each pixel is used to calculate the
information. RGB colors are defined by calculating values of 8 similarities between image matrices. The similarities among n
bits each for red, green, and blue colors. pieces of areas are calculated, and the overall similarity is
Once the coordinates and RGB colors of individual images have calculated as the average of the similarities for the matching pixels
been defined, RGB colored images are recorded on individual on each area.
coordinates of image matrices. To provide human analysts with a
more convenient visual analysis, pixels around the defined
coordinates are recorded simultaneously. As shown in Figure 5,
nine pixels from (x–1,y–1) to (x+1,y+1) around an (x,y) coordinate
defined through the opcode instruction sequence for a block are
recorded.

Figure 7. The areas of image matrices divided according to the


values of N

Figure 5. Nine pixels recording for one opcode instruction


sequence
If images are overlapped each other because coordinates defined
for multiple opcode instruction sequences are adjacent, as shown Figure 8. Examples of randomly selected areas (N=16, n=4)
in Figure 6, the sums of RGB colors become new pixel colors. If a
result of color summing exceeds 255(0xFF), the result will be set 4. EXPERIMENTAL RESULTS
to 255. For example, if RGB1 is (255,0,0) and RGB2 is (0,176,50),
new color will become (255,176,50).
4.1 Experimental Data
Using the visualization analysis tools implemented in this paper,
image matrices were generated for the benign and malware binary
samples shown in Table 1. We set the sizes of the generated image
matrices to 256 × 256 pixels.
Table 1. Benign and malware binary samples
Type File Name # of Blocks
notepad 1072
Benign winmine 360
wuauclt 1073
.a 1038
Boxed .b 1043
.d 1050
Figure 6. Method of recording overlapping pixels .f 1474
Malware Klez .g 1475
Since the number of pixels recorded on an image matrix varies .j 1529
according to the file sizes and the number of opcode instruction
.a 184
sequences, and the number of overlapping pixels will increases as
Evol .b 187
the number of images increases. If there are too many overlapped
.c 189
images, the size of the image matrix should be increased.

3.4 Similarity Calculation between Image 4.2 Results of Image Matrix Extraction
Matrices Figure 9 shows image matrices extracted from individual benign
We used “selective area matching” to calculate the similarities binary samples, and Figure 10 shows image matrices extracted
between image matrices. For selective area matching, an image from individual malware families. Since the number of opcode
matrix should be divided into N pieces, where N can be set to 4x instruction sequencesused as binary information varies, the
(x=1,2,3, …), such as 4, 16, and 64. Figure 7 shows image number of pixels recorded on image matrices differs. For the
matrices in which the areas were divided according to different N benign binaries, even if pixels are recorded on the same
values. Then, n pieces are randomly selected from the image coordinates of different image matrices, similarities between

319
image matrices are minimal because the RGB color information of coordinates and a maximum of 18 cases showed the same images
the relevant pixels is different. In contrast, many similar pixels are on image matrices of benign binaries and malware families. Our
found among the image matrices of binary files classified to the results show that image matrices of variants included in same
same malware family. malware family can be shown to be similar and that clear
differences exist between malware binaries and benign binaries.

4.3 Results of Image Matrix Similarity


Table 2 shows the results of similarity calculations between image
matrices extracted from binary samples of three malware families.
The image matrices from the same malware family have at least
0.95 similarity on average, but, those from different families have
0.325 similarity on average. Figure 11 shows the results of
average similarities among fifty malware sample files from ten
Figure 9. Image matrices of benign binaries malware families. The average similarity within the same family is
Figure 10 shows the image matrices of Trojan-DDoS.Win32. 0.984; whereas, the average similarity between different families
Boxed, Email-Worm.Win32.Klez and Virus.Win32.Evol families, is 0.309. Therefore, our proposed method can be used to classify
respectively. Same RGB colored pixels recorded in the same malware family effectively. The average time spent to calculate
coordinate can be found in the image matrices extracted from the similarities between image matrices was about 2.4 ms.
variants of the same malware families. Table 2. Similarities between image matrices of malware
Images for 1038, 1043, and 1050 opcode instruction sequences Boxed family Klez family Evol family
were respectively recorded on the malware Boxed family’s image .a .b .d .f .g .j .a .b .c
matrices, as shown in Figure 10(a), and 731 images were the same. .a 1 0.968 0.936 0.450 0.449 0.443 0.266 0.267 0.267
Boxed
For the malware Klez family in Figure 10(b), images for 1474, family
.b 0.968 1 0.932 0.449 0.449 0.442 0.266 0.266 0.267
1475, and 1529 opcode instruction sequences were respectively .d 0.936 0.932 1 0.448 0.448 0.442 0.265 0.266 0.266
recorded, and 984 images were the same. For the malware Evol .f 0.450 0.449 0.448 1 0.999 0.965 0.263 0.263 0.266
Klez
.g 0.449 0.449 0.448 0.999 1 0.965 0.263 0.263 0.264
family in Figure 10(c), images for 184, 187, and 189 opcode family
.j 0.443 0.442 0.442 0.965 0.965 1 0.263 0.263 0.264
instruction sequences were respectively recorded, and 102 images .a 0.266 0.266 0.265 0.263 0.263 0.263 1 0.986 0.960
were the same. Evol
.b 0.267 0.266 0.266 0.263 0.263 0.263 0.986 1 0.974
family
.c 0.267 0.267 0.266 0.264 0.264 0.264 0.960 0.974 1

Figure 11. Average similarities of image matrices from the 10


malware families

5. CONCLUSION AND FUTURE WORK


In this paper, we proposed a novel method to visually analyze
malware by generating image matrices from assembly instructions.
The proposed method was implemented with visualization
analysis tools. The experimental results showed that binary
variants included in the same malware family were similar when
converted into image matrices. The similarities were calculated
Figure 10. Image matrices of the malware families through selective area matching, and the similarities of malware
In contrast, among several hundred opcode instruction sequences variants were shown to be higher. With our proposed method,
on different image matrices extracted from benign binaries, only 9 malware analysts can analyze malware files visually, and can
cases showed the same RGB colored images from the same distinguish similar malware files for further analysis.

320
Our future studies include visualizaing various other information [12] Walenstein, A., Venable, M., Hayes, M., Thompson, C., and
from binary files, and extending opcode instruction sequences and Lakhotia, A., 2007. Exploiting similarity between variants to
algorithms for automatic malware classification. defeat malware. In Proceedings of the BlackHat DC
Conference.
6. ACKNOWLEDGMENTS [13] Trinius, P., Holz, T., Gobel, J., and Freiling, F.C., 2009.
This research was supported by Next-Generation Information Visual analysis of malware behavior using treemaps and
Computing Development Program through the National Research thread graphs. In Proceedings of the 6th International
Foundation of Korea(NRF) funded by the Ministry of Science, Workshop on IEEE Visualization for Cyber Security
ICT & Future Plannig (2011-0029924) (VizSec ) 2009., 33-38.
[14] Saxe, J., Mentis, D., and Greamo, C., 2012. Visualization of
7. REFERENCES shared system call sequence relationships in large malware
[1] Christodorescu, M. and Jha, S., 2004. Testing malware corpora. In Proceedings of the Ninth International
detectors. ACM SIGSOFT Software Engineering Notes 29, 4, Symposium on Visualization for Cyber Security, ACM, 33-
34-44. 40.
[2] Kang, B., Kim, T., Kwon, H., Choi, Y., and Im, E.G., 2012. [15] Conti, G., Dean, E., Sinda, M., and Sangster, B., 2008.
Malware classification method via binary content Visual reverse engineering of binary and data files.
comparison. In Proceedings of the 2012 ACM Research in Visualization for Computer Security, Springer, 1-17.
Applied Computation Symposium ACM, 316-321. [16] Anderson, B., Storlie, C., and Lane, T., 2012. Improving
[3] Moser, A., Kruegel, C., and Kirda, E., 2007. Limits of static malware classification: bridging the static/dynamic gap. In
analysis for malware detection. In Proceedings of the Proceedings of the 5th ACM workshop on Security and
Twenty-Third Annual IEEE Computer Security Applications artificial intelligence, ACM, 3-14.
Conference (ACSAC) 2007., 421-430. [17] Nataraj, L., Karthikeyan, S., Jacob, G., and Manjunath, B.,
[4] Cesare, S. and Xiang, Y., 2010. A fast flowgraph based 2011. Malware images: visualization and automatic
classification system for packed and polymorphic malware classification. In Proceedings of the 8th International
on the endhost. In Proceedings of the 24th IEEE Symposium on Visualization for Cyber Security, ,ACM.
International Conference on IEEE Advanced Information [18] Oliva, A. and Torralba, A., 2001. Modeling the shape of the
Networking and Applications (AINA), 2010, 721-728. scene: A holistic representation of the spatial envelope.
[5] Kinable, J. and Kostakis, O., 2011. Malware classification International journal of computer vision, 42, 3, 145-175.
based on call graph clustering. Journal in computer virology [19] Torralba, A., Murphy, K.P., Freeman, W.T., and Rubin,
7, 4, 233-245. M.A., 2003. Context-based vision system for place and
[6] Shang, S., Zheng, N., Xu, J., Xu, M., and Zhang, H., 2010. object recognition. In Proceedings of the Ninth IEEE
Detecting malware variants via function-call graph similarity. International Conference on Computer Vision, 273-280.
In Proceedings of the 5th International Conference on IEEE [20] Nataraj, L., Yegneswaran, V., Porras, P., and Zhang, J., 2011.
Malicious and Unwanted Software (MALWARE), 2010, 113- A comparative assessment of malware classification using
120. binary texture analysis and dynamic analysis. In Proceedings
[7] Tabish, S.M., Shafiq, M.Z., and Farooq, M., 2009. Malware of the 4th ACM workshop on Security and artificial
detection using statistical analysis of byte-level file content. intelligence, ACM, 21-30.
In Proceedings of the ACM SIGKDD Workshop on [21] Eagle, C., 2008. The IDA Pro Book: The Unofficial Guide to
CyberSecurity and Intelligence Informatics, ACM, 23-31. the World's Most Popular Disassembler. No Starch Press.
[8] Bilar, D., 2007. Opcodes as predictor for malware. [22] Yuschuk, O., 2007. Ollydbg. http://www.ollydbg.de/
International Journal of Electronic Security and Digital
[23] Charikar, M.S., 2002. Similarity estimation techniques from
Forensics 1, 2, 156-168.
rounding algorithms. In Proceedings of the thiry-fourth
[9] Han, K.S., Kim, S.-R., and Im, E.G., 2012. Instruction annual ACM symposium on Theory of computing, ACM,
frequency-based malware classification method. 380-388.
INFORMATION - An International Interdisciplinary Journal
[24] D. Bernstein. Usenet posting, comp.lang.c.
15, 7, 2973-2984.
http://groups.google.com/group/comp.lang.c/msg/6b82e9648
[10] Santos, I., Brezo, F., Nieves, J., Penya, Y.K., Sanz, B., 87d73d9, Dec. 1990.
Laorden, C., and Bringas, P.G., 2010. Idea: Opcode-
[25] Androutsos, D., Plataniotis, K., and Venetsanopoulos, A.N.,
sequence-based malware detection. Engineering Secure
1999. A novel vector-based approach to color image retrieval
Software and Systems, Springer, 35-43.
using a vector angular-based distance measure. Computer
[11] Sung, A.H., Xu, J., Chavez, P., and Mukkamala, S., 2004. Vision and Image Understanding,75, 1, 46-58.
Static analyzer of vicious executables (save). In Proceedings
of the 20th Annual IEEE Computer Security Applications
Conference, 2004., 326-334.

321

View publication stats

You might also like