0% found this document useful (0 votes)
47 views

19 - Survey of Machine Learning Techniques For Malware Analysis

This document summarizes a survey of 64 papers that use machine learning techniques for malware analysis of Windows portable executable (PE) files. The survey categorizes the papers based on their objective (malware detection, similarity analysis, or category detection), the types of features extracted from PE files, and the machine learning techniques employed. The survey aims to identify research directions that have not been fully explored and introduces the novel concept of "malware analysis economics" to study trade-offs between analysis accuracy and economic costs.

Uploaded by

lcoqzxnm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

19 - Survey of Machine Learning Techniques For Malware Analysis

This document summarizes a survey of 64 papers that use machine learning techniques for malware analysis of Windows portable executable (PE) files. The survey categorizes the papers based on their objective (malware detection, similarity analysis, or category detection), the types of features extracted from PE files, and the machine learning techniques employed. The survey aims to identify research directions that have not been fully explored and introduces the novel concept of "malware analysis economics" to study trade-offs between analysis accuracy and economic costs.

Uploaded by

lcoqzxnm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

computers & security 81 (2019) 123–147

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Survey of machine learning techniques for


malware analysis

Daniele Ucci a,∗, Leonardo Aniello b, Roberto Baldoni a


a Research Center of Cyber Intelligence and Information Security, “La Sapienza” University of Rome, Italy
b Cyber Security Research Group, University of Southampton, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history: Coping with malware is getting more and more challenging, given their relentless growth in
Received 1 February 2018 complexity and volume. One of the most common approaches in literature is using machine
Revised 30 October 2018 learning techniques, to automatically learn models and patterns behind such complexity,
Accepted 9 November 2018 and to develop technologies to keep pace with malware evolution. This survey aims at pro-
Available online 24 November 2018 viding an overview on the way machine learning has been used so far in the context of
malware analysis in Windows environments, i.e. for the analysis of Portable Executables.
Keywords: We systematize surveyed papers according to their objectives (i.e., the expected output), what
Portable executable information about malware they specifically use (i.e., the features), and what machine learn-
Malware analysis ing techniques they employ (i.e., what algorithm is used to process the input and produce
Machine learning the output). We also outline a number of issues and challenges, including those concerning
Benchmark the used datasets, and identify the main current topical trends and how to possibly advance
Malware analysis economics them. In particular, we introduce the novel concept of malware analysis economics, regarding
the study of existing trade-offs among key metrics, such as analysis accuracy and econom-
ical costs.

© 2018 Elsevier Ltd. All rights reserved.

ily eluded by applying standard techniques like obfuscation,


1. Introduction or more advanced approaches such as polymorphism or meta-
morphism. For a comprehensive review of these techniques, re-
Despite the significant improvement of cyber security mech-
fer to Ye et al. (2017). These methods change the binary of the
anisms and their continuous evolution, malware are still
malware, and thus its hash, but leave its behaviour unmod-
among the most effective threats in the cyber space. Malware
ified. On the other side, developing detection rules that cap-
analysis applies techniques from several different fields, such
ture the semantics of a malicious sample is much more diffi-
as program analysis and network analysis, for the study of ma-
cult to circumvent, because malware developers should apply
licious samples to develop a deeper understanding on several
more complex modifications. A major goal of malware anal-
aspects, including their behaviour and how they evolve over
ysis is to capture additional properties to be used to improve
time. Within the unceasing arms race between malware de-
security measures and make evasion as hard as possible. Ma-
velopers and analysts, each advance in security technology
chine learning is a natural choice to support such a process of
is usually promptly followed by a corresponding evasion. Part
knowledge extraction. Indeed, many works in literature have
of the effectiveness of novel defensive measures depends on
taken this direction, with a variety of approaches, objectives
what properties they leverage on. For example, a detection
and results.
rule based on the MD5 hash of a known malware can be eas-


Corresponding author.
E-mail addresses: [email protected] (D. Ucci), [email protected] (L. Aniello), [email protected] (R. Baldoni).
https://doi.org/10.1016/j.cose.2018.11.001
0167-4048/© 2018 Elsevier Ltd. All rights reserved.
124 computers & security 81 (2019) 123–147

This survey aims at reviewing and systematising existing • the identification of a number of topical trends on machine
literature where machine learning is used to support malware learning for malware analysis of PEs, with general guide-
analysis of Windows executables, i.e. Portable Executables lines on how to advance them;
(PEs). The intended audience of this survey includes any secu- • the definition of the novel concept of malware analysis
rity analysts, i.e. security-minded reverse engineer or software economics.
developer, who may benefit from applying machine learning
to automate part of malware analysis operations and make
The rest of the paper is structured as follows. Related work
the workload more tractable. Although mobile malware rep-
are described in Section 2. Section 3 presents the taxonomy we
resents an ever growing threat, Windows largely remains the
propose to organise reviewed malware analysis approaches
preferred target (AV-TEST, 2017) among all the existing plat-
based on machine learning, which are then characterised ac-
forms. Malware analysis techniques for PEs are slightly dif-
cording to such a taxonomy in Section 4. From this char-
ferent from those for Android apps because there are signifi-
acterisation, current issues and challenges are pointed out
cant dissimilarities on how operating system and applications
in Section 5. Section 6 highlights topical trends and how to
work. As a matter of fact, literature papers on malware anal-
advance them. Malware analysis economics is introduced in
ysis commonly point out what specific platform they target,
Section 7. Finally, conclusions and future works are presented
so we specifically focus on works that consider the analysis of
in Section 8.
PEs. 64 recent papers have been selected on the basis of their
bibliographic significance, reviewed and systematised accord-
ing to a taxonomy with three fundamental dimensions: (i) the
specific objective of the analysis, (ii) what types of features ex- 2. Related work
tracted from PEs they consider and (iii) what machine learn-
ing algorithms they use. We distinguish three main objectives: Other academic works have already addressed the problem
malware detection, malware similarity analysis and malware cate- of surveying contributions on the usage of machine learn-
gory detection. PE features have been grouped in eight types: ing techniques for malware analysis. The survey written by
byte sequences, APIs/system calls, opcodes, network, file system, Shabtai et al. (2009) is the first one on this topic. It specifically
CPU registers, PE file characteristics and strings. Machine learn- deals with how classifiers are used on static features to de-
ing algorithms have been categorized depending on whether tect malware. As most of the other surveys mentioned in this
the learning is supervised, unsupervised or semi-supervised. The section, the main difference with our work is that our scope is
characterisation of surveyed papers according to such taxon- wider as we target other objectives besides malware detection,
omy allows to spot research directions that have not been in- such as similarities analysis and category detection. Further-
vestigated yet, such as the impact of particular combination more, a novel contribution we provide is the idea of malware
of features on analysis accuracy. The analysis of such a large economics, which is not mentioned by any related work. Also
literature leads to single out three main issues to address. The in Sahu et al. (2014), the authors provide a comparative study
first concerns overcoming modern anti-analysis techniques on papers using pattern matching to detect malware, by re-
such as encryption. The second regards the inaccuracy of mal- porting their advantages, disadvantages and problems. Souri
ware behaviour modelling due to the choice of what opera- and Hosseini (2018) proposes a taxonomy of malware detec-
tions of the sample are considered for the analysis. The third is tion approaches based on machine learning. In addition to
about the obsolescence and unavailability of the datasets used consider detection only, their work differs from ours because
in the evaluation, which affect the significance of obtained re- they do not investigate what features are taken into account.
sults and their reproducibility. In this respect, we propose a LeDoux and Lakhotia (2015) describe how machine learning is
few guidelines to prepare suitable benchmarks for malware used for malware analysis, whose end goal is defined there as
analysis through machine learning. We also identify a num- “automatically detect malware as soon as possible, remove it,
ber of topical trends that we consider worth to be investigated and repair any damage it has done”.
more in detail, such as malware attribution and triage. Fur- Bazrafshan et al. (2013) focus on malware detection and
thermore, we introduce the novel concept of malware analysis identify three main methods for detecting malicious software,
economics, regarding the existing trade-offs between analysis i.e. based on signatures, behaviours and heuristics, the latter
accuracy, time and cost, which should be taken into account using also machine learning techniques. They also identify
when designing a malware analysis environment. what classes of features are used by reviewed heuristics for
The novel contributions of this work are malware detection, i.e. API calls, control flow graphs, n-grams,
opcodes and hybrid features. In addition to going beyond mal-
ware detection, we propose a larger number of feature types,
which reflects the wider breadth of our research.
• the definition of a taxonomy to synthesise the state of the Basu (2016) examines different works relying on data min-
art on machine learning for malware analysis of PEs; ing and machine learning techniques for the detection of mal-
• a detailed comparative analysis of existing literature on ware. They identify five types of features: API call graph, byte
that topic, structured according to the proposed taxonomy, sequence, PE header and sections, opcode sequence frequency
which highlights possible new research directions; and kernel, i.e. system calls. In our survey we establish more
• the determination of present main issues and challenges feature types, such as strings, file system and CPU registers.
on that subject, and the proposal of high-level directions They also compare surveyed papers by used features, used
to investigate to overcome them; dataset and mining method.
computers & security 81 (2019) 123–147 125

Fig. 1 – Taxonomy of machine learning techniques for malware analysis.

Ye et al. (2017) examine different aspects of malware detec- omy. Section 3.1 describes in details the objective dimension,
tion processes, focusing on feature extraction/selection and features are pointed out in Section 3.2 and machine learning
classification/clustering algorithms. Also in this case, our sur- algorithms are reported in Section 3.3.
vey looks at a larger range of papers by also including many
works on similarity analysis and category detection. They also
highlight a number of issues, mainly dealing with machine 3.1. Malware analysis objectives
learning aspects (i.e. incremental learning, active learning and
adversarial learning). We instead look at current issues and Malware analysis, in general, demands for strong detection ca-
limitations from a distinct angle, indeed coming to a different pabilities to find matches with the knowledge developed by
set of identified problems that complement theirs. Further- investigating past samples. Anyway, the final goal of search-
more, they outline several trends on malware development, ing for those matches differs. For example, a malware analyst
while we rather report on trends about machine learning for may be specifically interested in determining whether new
malware analysis, again complementing their contributions. suspicious samples are malicious or not, while another may
Barriga and Yoo (2017) briefly survey literature on mal- be rather inspecting new malware looking for what family
ware detection and malware evasion techniques, to discuss they likely belong to. This section details the analysis goals
how machine learning can be used by malware to bypass cur- of the surveyed papers, organized in three main objectives:
rent detection mechanisms. Our survey focuses instead on malware detection (Section 3.1.1), malware similarity analysis
how machine learning can support malware analysis, even (Section 3.1.2) and malware category detection (Section 3.1.3).
when evasion techniques are used. Gardiner and Nagaraja
(2016) concentrate their survey on the detection of command
and control centres through machine learning. 3.1.1. Malware detection
The most common objective in the context of malware anal-
ysis is detecting whether a given sample is malicious. This
3. Taxonomy of machine learning techniques objective is also the most important because knowing in ad-
for malware analysis vance that a sample is dangerous allows to block it before it
becomes harmful. Indeed, the majority of reviewed works has
This section introduces the taxonomy on how machine learn- this as main goal (Ahmadi et al., 2015; Ahmed et al., 2009; An-
ing is used for malware analysis in the reviewed papers. We derson et al., 2011; 2012; Bai et al., 2014; Chau et al., 2010; Chen
identify three major dimensions along which surveyed works et al., 2015; Elhadi et al., 2015; Eskandari et al., 2013; Feng et al.,
can be conveniently organised. The first one characterises the 2015; Firdausi et al., 2010; Ghiasi et al., 2015; Kolter and Mal-
final objective of the analysis, e.g. malware detection. The sec- oof, 2006; Kruczkowski and Szynkiewicz, 2014; Kwon et al.,
ond dimension describes the features that the analysis is based 2015; Mao et al., 2015; Raff and Nicholas, 2017; Santos et al.,
on in terms of how they are extracted, e.g. through dynamic 2013b; 2011; Saxe and Berlin, 2015; Schultz et al., 2001; Tamer-
analysis, and what features are considered, e.g. CPU registers. soy et al., 2014; Uppal et al., 2014; Vadrevu et al., 2013; Wüchner
Finally, the third dimension defines what type of machine learn- et al., 2015; Yonts, 2012). Depending on what machine learning
ing algorithm is used for the analysis, e.g. supervised learning. technique is used, the generated output can be provided with
Fig. 1 shows a graphical representation of the taxonomy. a confidence value that can be used by analysts to understand
The rest of this section is structured according to the taxon- if a sample needs further inspection.
126 computers & security 81 (2019) 123–147

3.1.2. Malware similarity analysis (i.e., spyware), encrypting documents and asking for a ran-
Another relevant objective is spotting similarities among mal- som (i.e., ransomware), or gaining remote control of an infected
ware, for example to understand how novel samples differ machine (i.e., remote access toolkits). Using these categories is
from previous, known ones. We find four slightly different ver- a coarse-grained yet significant way of describing malicious
sions of this objective: variants detection, families detection, sim- samples (Attaluri et al., 2009; Chen et al., 2012; Comar et al.,
ilarities detection and differences detection. 2013; Kwon et al., 2015; Sexton et al., 2015; Wong and Stamp,
2006). Although cyber security firms have not still agreed upon
Variants detection. Developing variants is one of the most ef- a standardized taxonomy of malware categories, effectively
fective and cheapest strategies for an attacker to evade de- recognising the categories of a sample can add valuable in-
tection mechanisms, while reusing as much as possible al- formation for the analysis.
ready available codes and resources. Recognizing that a sam-
ple is actually a variant of a known malware prevents such 3.2. Malware analysis features
strategy to succeed, and paves the way to understand how
malware evolve over time through the development of new This section deals with the features of samples that are con-
variants. Also this objective has been deeply studied in lit- sidered for the analysis. How features are extracted from exe-
erature, and several reviewed papers target the detection of cutables is reported in Section 3.2.1, while Section 3.2.2 details
variants. Given a malicious sample m, variants detection con- which specific features are taken into account.
sists in selecting from the available knowledge base the sam-
ples that are variants of m (Gharacheh et al., 2015; Ghiasi 3.2.1. Feature extraction
et al., 2015; Khodamoradi et al., 2015; Liang et al., 2016; Up- The information extraction process is performed through ei-
church and Zhou, 2015; Vadrevu and Perdisci, 2016). Consid- ther static or dynamic analysis, or a combination of both,
ering the huge number of malicious samples received daily while examination and correlation are carried out by us-
from major security firms, recognising variants of already ing machine learning techniques. Approaches based on static
known malware is crucial to reduce the workload for human analysis look at the content of samples without requiring their
analysts. execution, while dynamic analysis works by running samples
to examine their behaviour. Several techniques can be used for
Families detection. Given a malicious sample m, families de- dynamic malware analysis. Debuggers are used for instruction
tection consists in selecting from the available knowledge level analysis. Simulators model and show a behaviour simi-
base the families that m likely belongs to Lee and Mody (2006), lar to the environment expected by the malware, while emula-
Huang et al. (2009), Park et al. (2010), Ye et al. (2010), Dahl et al. tors replicate the behaviour of a system with higher accuracy
(2013), Hu et al. (2013), Islam et al. (2013), Kong and Yan (2013), but require more resources. Sandboxes are virtualised oper-
Nari and Ghorbani (2013), Ahmadi et al. (2015), Kawaguchi and ating systems providing an isolated and reliable environment
Omote (2015), Lin et al. (2015), Mohaisen et al. (2015), Pai et al. where to detonate malware. Refer to Ye et al. (2017) for a more
(2015) and Raff and Nicholas (2017). In this way, it is possible to detailed description of these techniques. Execution traces are
associate unknown samples to already known families and, by commonly used to extract features when dynamic analysis
consequence, provide an added-value information for further is employed. Reviewed articles generate execution traces by
analyses. using either sandboxes (Anderson et al., 2011; Bayer et al.,
2009; Firdausi et al., 2010; Graziano et al., 2015; Kawaguchi
Similarities detection. Analysts can be interested in identi- and Omote, 2015; Lee and Mody, 2006; Lin et al., 2015; Lin-
fying the specific similarities and differences of the binaries dorfer et al., 2011; Mao et al., 2015; Palahan et al., 2013; Park
to analyse with respect to those already analysed. Similarities and Jun, 2009; Rieck et al., 2011) or emulators (Asquith, 2015;
detection consists in discovering what parts and aspects of a Liang et al., 2016). Also program analysis tools and techniques
sample are similar to something that has been already exam- can be useful in the feature extraction process by providing,
ined in the past. It enables to focus on what is really new, and for example, disassembly code and control- and data-flow
hence to discard the rest as it does not deserve further inves- graphs. An accurate disassembly code is important for obtain-
tigation (Bailey et al., 2007; Bayer et al., 2009; Egele et al., 2014; ing correct Byte sequences and Opcodes features (Section 3.2.2),
Palahan et al., 2013; Rieck et al., 2011). while control- and data-flow graphs can be employed in the
extraction of APIs and system calls (Section 3.2.2). For an ex-
Differences detection . As a complement, also identifying what tensive dissertation on dynamic analyses, refer to Egele et al.
is different from everything else already observed in the past (2012).
results worthwhile. As a matter of fact, differences can guide Among reviewed works, the majority relies on dynamic
towards discovering novel aspects that should be analysed analyses (Anderson et al., 2011; Bailey et al., 2007; Bayer et al.,
more in depth (Bayer et al., 2009; Lindorfer et al., 2011; Pala- 2009; Comar et al., 2013; Dahl et al., 2013; Elhadi et al., 2015;
han et al., 2013; Polino et al., 2015; Rieck et al., 2011; Santos Firdausi et al., 2010; Ghiasi et al., 2015; Kawaguchi and Omote,
et al., 2013a). 2015; Kruczkowski and Szynkiewicz, 2014; Lee and Mody, 2006;
Liang et al., 2016; Lin et al., 2015; Lindorfer et al., 2011; Mo-
3.1.3. Malware category detection haisen et al., 2015; Nari and Ghorbani, 2013; Palahan et al.,
Malware can be categorized according to their prominent be- 2013; Park et al., 2010; Rieck et al., 2011; Uppal et al., 2014;
haviours and objectives. They can be interested in spying Wüchner et al., 2015), while the others use, in equal propor-
on users’ activities and stealing their sensitive information tions, either static analyses alone (Ahmadi et al., 2015; Attaluri
computers & security 81 (2019) 123–147 127

et al., 2009; Bai et al., 2014; Caliskan-Islam et al., 2015; Chen 2015; Khodamoradi et al., 2015; Srakaew et al., 2015; Ye et al.,
et al., 2015; 2012; Feng et al., 2015; Gharacheh et al., 2015; Hu 2010).
et al., 2013; Khodamoradi et al., 2015; Kolter and Maloof, 2006;
Kong and Yan, 2013; Pai et al., 2015; Santos et al., 2013a; 2011; API and system calls. Similarly to opcodes, APIs and system
Schultz et al., 2001; Sexton et al., 2015; Siddiqui et al., 2009; calls enable the analysis of samples’ behaviour, but at a higher
Srakaew et al., 2015; Tamersoy et al., 2014; Upchurch and Zhou, level. They can be either extracted statically or dynamically
2015; Vadrevu et al., 2013; Wong and Stamp, 2006; Yonts, 2012) by analysing the disassembly code (to get the list of all calls
or a combination of static and dynamic techniques (Anderson that can be potentially executed) or the execution traces (for
et al., 2012; Egele et al., 2014; Eskandari et al., 2013; Graziano the list of calls actually invoked). While APIs allow to charac-
et al., 2015; Islam et al., 2013; Jang et al., 2011; Polino et al., 2015; terise what actions are executed by a sample (Ahmadi et al.,
Santos et al., 2013b; Vadrevu and Perdisci, 2016). Depending on 2015; Ahmed et al., 2009; Bai et al., 2014; Egele et al., 2014; Is-
the specific features, extraction processes can be performed lam et al., 2013; Kawaguchi and Omote, 2015; Kong and Yan,
by applying either static, dynamic, or hybrid analysis. 2013; Liang et al., 2016), looking at system call invocations pro-
vides a view on the interaction of the PE with the operating
3.2.2. Portable executable features system (Anderson et al., 2012; Asquith, 2015; Bayer et al., 2009;
This section provides an overview on what features are used Dahl et al., 2013; Egele et al., 2014; Elhadi et al., 2015; Lee and
by reviewed papers to achieve the objectives outlined in Mody, 2006; Mao et al., 2015; Palahan et al., 2013; Park et al.,
section 3.1. In many cases, surveyed works only refer to 2010; Rieck et al., 2011; Santos et al., 2013b; Uppal et al., 2014).
macro-classes without mentioning the specific features they Data extracted by observing APIs and system calls can be re-
employed. As an example, when n-grams are used, only a mi- ally large, and many works carry out additional processing to
nority of works mention the size of n. reduce feature space by using convenient data structures. One
of the most popular data structures to represent PE behaviour
Byte sequences. A binary can be characterised by comput- and extract program structure is the control flow graph. This
ing features on its byte-level content. Analysing the specific data structure allows compilers to produce an optimized ver-
sequences of bytes in a PE is a widely employed static tech- sion of the program itself and model control flow relation-
nique. A few works use chunks of bytes of specific sizes (Raff ships (Allen, 1970). Several works employ control flow graphs
and Nicholas, 2017; Schultz et al., 2001; Srakaew et al., 2015), and their extensions for sample analysis, in combination with
while many others rely on n-grams (Ahmadi et al., 2015; An- other feature classes (Anderson et al., 2012; Eskandari et al.,
derson et al., 2011; 2012; Chen et al., 2015; Dahl et al., 2013; 2013; Graziano et al., 2015; Polino et al., 2015; Wüchner et al.,
Feng et al., 2015; Jang et al., 2011; Kolter and Maloof, 2006; Lin 2015).
et al., 2015; Rieck et al., 2011; Sexton et al., 2015; Srakaew et al.,
2015; Upchurch and Zhou, 2015; Uppal et al., 2014; Wüchner Network activity. A huge number of key information can be
et al., 2015). obtained by observing how the PE interacts with the network.
An n-gram is a sequence of n bytes, and features corre- Contacted addresses and generated traffic can unveil valuable
spond to the different combination of these n bytes, namely aspects, e.g. regarding the communication with a command
each feature represents how many times a specific combina- and control centre. Relevant features include statistics on
tion of n bytes occurs in the binary. The majority of works used protocols, TCP/UDP ports, HTTP requests, DNS-level in-
that specified the size of used n-grams relies on sequences teractions. Many surveyed works require dynamic analysis to
no longer than 3 (i.e. trigrams) (Ahmadi et al., 2015; Ander- extract this kind of information (Bailey et al., 2007; Bayer et al.,
son et al., 2011; 2012; Dahl et al., 2013; Islam et al., 2013; Lin 2009; Graziano et al., 2015; Kwon et al., 2015; Lee and Mody,
et al., 2015; Sexton et al., 2015; Srakaew et al., 2015). Indeed, 2006; Liang et al., 2016; Lindorfer et al., 2011; Mohaisen et al.,
the number of features to consider grows exponentially with 2015; Nari and Ghorbani, 2013). Other papers extract network-
n. related inputs by monitoring the network and analysing in-
coming and outgoing traffic (Comar et al., 2013; Kruczkowski
Opcodes. Opcodes identify the machine-level operations ex- and Szynkiewicz, 2014; Vadrevu and Perdisci, 2016). A comple-
ecuted by a PE, and can be extracted through static analyses by mentary approach consists in analysing download patterns of
examining the assembly code (Ahmadi et al., 2015; Anderson network users in a monitored network (Vadrevu et al., 2013).
et al., 2011; 2012; Attaluri et al., 2009; Gharacheh et al., 2015; It does not require sample execution and focuses on network
Hu et al., 2013; Khodamoradi et al., 2015; Kong and Yan, 2013; features related to the download of a sample, such as the web-
Pai et al., 2015; Santos et al., 2013a; 2013b; Sexton et al., 2015; site from which the file has been downloaded.
Srakaew et al., 2015; Wong and Stamp, 2006; Ye et al., 2010).
Opcode frequency is one of the most commonly used feature. File system. What file operations are executed by samples is
It measures the number of times each specific opcode appears fundamental to grasp evidence about the interaction with the
within the assembly or is executed by a PE (Khodamoradi environment and possibly detect attempts to gain persistence.
et al., 2015; Ye et al., 2010). Others (Anderson et al., 2012; Kho- Features of interest mainly concern how many files are read
damoradi et al., 2015) count opcode occurrences by aggregat- or modified, what types of files and in what directories, and
ing them by operation type, e.g., mathematical instructions, which files appear in not-infected/infected machines (Bailey
memory access instructions. Similarly to n-grams, also se- et al., 2007; Chau et al., 2010; Graziano et al., 2015; Kong and
quences of opcodes are used as features (Gharacheh et al., Yan, 2013; Lee and Mody, 2006; Lin et al., 2015; Mao et al., 2015;
128 computers & security 81 (2019) 123–147

Mohaisen et al., 2015). Sandboxes and memory analysis toolk- Szynkiewicz, 2014; Lin et al., 2015; Mohaisen et al., 2015; San-
its include modules for monitoring interactions with the file tos et al., 2013a; 2013b; Sexton et al., 2015; Uppal et al., 2014;
system, usually modelled by counting the number of files cre- Wüchner et al., 2015), Multiple Kernel Learning (Anderson et al.,
ated/deleted/modified by the PE. In Mohaisen et al. (2015), the 2012), Prototype-based Classification (Rieck et al., 2011), Decision
size of these files is considered as well, while Lin et al. leverage Tree (Ahmed et al., 2009; Bai et al., 2014; Firdausi et al., 2010;
the number of created hidden files (Lin et al., 2015). Islam et al., 2013; Kawaguchi and Omote, 2015; Khodamoradi
A particularly relevant type of file system features are those et al., 2015; Kolter and Maloof, 2006; Mohaisen et al., 2015; Nari
extracted from the Windows Registry. The registry is one of the and Ghorbani, 2013; Santos et al., 2013a; 2013b; Srakaew et al.,
main sources of information for a PE about the environment, 2015; Uppal et al., 2014), Random Forest (Ahmadi et al., 2015;
and also represents a fundamental tool to hook into the op- Comar et al., 2013; Islam et al., 2013; Kawaguchi and Omote,
erating system, for example to gain persistence. Discovering 2015; Khodamoradi et al., 2015; Kwon et al., 2015; Mao et al.,
what keys are queried, created, deleted and modified can shed 2015; Siddiqui et al., 2009; Uppal et al., 2014; Wüchner et al.,
light on many significant characteristics of a sample (Lee and 2015), Gradient Boosting Decision Tree (Chen et al., 2012; Sex-
Mody, 2006; Lin et al., 2015; Mao et al., 2015; Mohaisen et al., ton et al., 2015), Logistic Model Tree (Dahl et al., 2013; Graziano
2015). Usually, works relying on file system inputs monitor et al., 2015; Palahan et al., 2013; Sexton et al., 2015), k-Nearest
also the Windows Registry. Neighbors (k-NN) (Ahmed et al., 2009; Firdausi et al., 2010; Islam
et al., 2013; Kawaguchi and Omote, 2015; Kong and Yan, 2013;
CPU registers. The way CPU registers are used can also be a Lee and Mody, 2006; Mohaisen et al., 2015; Raff and Nicholas,
valuable indication, including whether any hidden register is 2017; Santos et al., 2013a), Artificial Neural Network (Dahl et al.,
used, and what values are stored in the registers, especially 2013; Saxe and Berlin, 2015), Multilayer Perceptron Neural Net-
in the FLAGS register (Ahmadi et al., 2015; Egele et al., 2014; work (Firdausi et al., 2010).
Ghiasi et al., 2015; Kong and Yan, 2013).
3.3.2. Unsupervised learning
PE file characteristics. A static analysis of a PE can provide Unsupervised approaches do not rely on any training phase
a large set of valuable information such as sections, imports, and learn directly from unlabeled data. Reviewed papers use
symbols, used compilers (Asquith, 2015; Bai et al., 2014; Kirat these unsupervised learning algorithms: Clustering with local-
et al., 2013; Lee and Mody, 2006; Saxe and Berlin, 2015; Yonts, ity sensitive hashing (Bayer et al., 2009; Tamersoy et al., 2014;
2012). Upchurch and Zhou, 2015), Clustering with Distance and Similar-
ity Metrics (using either Euclidean (Mohaisen et al., 2015; Rieck
Strings. A PE can be statically inspected to explicitly look for et al., 2011) or Hamming distances (Mohaisen et al., 2015), or
the strings it contains, such as code fragments, author signa- cosine (Mohaisen et al., 2015) or Jaccard similarities (Mohaisen
tures, file names, system resource information (Ahmadi et al., et al., 2015; Polino et al., 2015)), Expectation Maximization (Pai
2015; Islam et al., 2013; Saxe and Berlin, 2015; Schultz et al., et al., 2015), k-Means Clustering (Huang et al., 2009; Pai et al.,
2001). 2015), k-Medoids (Ye et al., 2010), Density-based Spatial Clustering
of Applications with Noise (Vadrevu and Perdisci, 2016), Hierarchi-
3.3. Malware analysis algorithms cal Clustering (Jang et al., 2011; Mohaisen et al., 2015), Prototype-
based Clustering (Rieck et al., 2011), Self-Organizing Maps (Chen
This ection reports what machine learning algorithms are et al., 2012).
used in surveyed works by organising them on the basis of
whether the learning is supervised (Section 3.3.1), unsuper- 3.3.3. Semi-supervised learning
vised (Section 3.3.2) or semi-supervised (Section 3.3.3). Semi-supervised learning combines both labeled and unla-
beled data for feeding statistical models to acquire knowledge.
3.3.1. Supervised learning Learning with Local and Global Consistency is used in Santos et al.
Supervised learning is the task of gaining knowledge by (2011) while Belief Propagation in Chau et al. (2010), Tamersoy
providing statistical models with correct instance examples, et al. (2014) and Chen et al. (2015).
during a preliminary phase called training. The supervised
algorithms used by reviewed papers are rule-based classi-
fier (Ahmed et al., 2009; Feng et al., 2015; Ghiasi et al., 2015; 4. Characterization of surveyed papers
Liang et al., 2016; Lindorfer et al., 2011; Schultz et al., 2001;
Sexton et al., 2015; Tian et al., 2008), Bayes classifier (Kawaguchi In this section we characterize each reviewed paper on the
and Omote, 2015; Santos et al., 2013a; 2013b; Uppal et al., basis of analysis objective, used machine learning algorithm
2014; Wüchner et al., 2015), Naïve Bayes (Firdausi et al., 2010; and features. Several details are also reported on the dataset
Kawaguchi and Omote, 2015; Kolter and Maloof, 2006; Schultz used for the evaluation, including whether it is publicly avail-
et al., 2001; Sexton et al., 2015; Uppal et al., 2014; Wüch- able (Public column), where samples have been collected from
ner et al., 2015), Bayesian Network (Eskandari et al., 2013; San- (Source column) and whether the specific set of samples con-
tos et al., 2013a; 2013b), Support Vector Machine (SVM) (Ahmadi sidered for the experiment is available (Available column). In-
et al., 2015; Ahmed et al., 2009; Anderson et al., 2011; Chen deed, many works declare they do not use all the executa-
et al., 2012; Comar et al., 2013; Feng et al., 2015; Firdausi bles in the dataset but they do not specify what samples they
et al., 2010; Islam et al., 2013; Kawaguchi and Omote, 2015; choose, which prevents to reproduce their results. The Label
Kolter and Maloof, 2006; Kong and Yan, 2013; Kruczkowski and column states how samples have been labelled. Finally, Benign,
computers & security 81 (2019) 123–147 129

Malicious and Total columns report benign executables count, tion of the payload. This implies that such a kind of anti-
malware count and their sum, respectively. analysis techniques can be overcome by using dynamic anal-
Malware detection. Table 1 lists all the reviewed works hav- ysis (Section 3.2.1) to make the sample unveil hidden infor-
ing malware detection as objective. Most used features are mation and load them in memory, where they can then be
byte sequences and API/system call invocations, derived by extracted by creating a dump. Refer to Ye et al. (2017) for a de-
executing the samples. Most of the works use more than one tailed disquisition on how obfuscation, packing and encryp-
algorithm to find out the one guaranteeing more accurate tion are used by malware developers.
results. More advanced anti-analysis techniques exist to keep mal-
Malware similarity analysis. A table is provided for each ware internals secret even when dynamic analysis is used.
version of this objective (Section 3.1.2). Tables 2 and 3 describe One approach, commonly referred to as environmental aware-
the works dealing with variants detection and families detec- ness, consists in the malware trying to detect whether it is be-
tion, respectively. For both, APIs and system calls are largely ing executed in a controlled setting where an analyst is try-
used, as well as malware interactions with the environment, ing to dissect it, for example by using a virtual machine or
i.e. memory, file system, and CPU registers. Tables 4 and 5 re- by running the sample in debug mode. If any cue is found of
port the papers on similarities and differences detection, re- possibly being under analysis, then the malware does not ex-
spectively. All the analysed papers but (Santos et al., 2013a) ecute its malicious payload. Miramirkhani et al. (2017) show
rely on APIs and system calls collection. Works on differences that a malware can easily understand if it is running into an
detection, in general, do not take into account the interactions artificial environment. Other approaches rely on timing-based
with the hosting system, while those on similarities detection evasion, i.e. they only show their malicious behaviour at pre-
do. determined dates and times. Other malware instead require
Malware category detection. These articles focus on the or wait for some user interaction to start their intended activity,
identification of specific threats and, thus, on particular fea- in order to make any kind of automatic analysis infeasible.
tures such as byte sequences, opcodes, function lengths and Identifying and overcoming these anti-analysis techniques
network activity. Table 6 reports the works whose objective is is an important direction to investigate to improve the ef-
the detection of malware category. fectiveness of malware analysis. Recent academic and not-
By reasoning on what algorithms and features have been academic literature are aligned on this aspect. Karpin and
used and what have not for specific objectives, the provided Dorfman (2017) highlight the need to address very current
characterisation allows to easily identify gaps in the liter- problems such as discovering where malware configuration
ature and, thus, possible research directions to investigate. files are stored and whether standard or custom obfusca-
For instance, all works on differences detection (see Table 5) tion/packing/encryption algorithms are employed. Deobfus-
but (Santos et al., 2013a), rely on dynamically extracted APIs cation (Blazytko et al., 2017; Kotov and Wojnowicz, 2018) and
and system calls for building their machine learning mod- other operations aimed at supporting binary reverse engineer-
els. Novel approaches can be explored by taking into account ing, such as function similarity identification (Liao et al., 2018),
other features that capture malware interactions with the en- are still very active research directions. Symbolic execution
vironment (e.g., memory, file system, CPU registers and Win- techniques (Baldoni et al., 2018) are promising means to un-
dows Registry). derstand what execution paths trigger the launch of the ma-
licious payload.

5. Issues and challenges 5.2. Operation set

Based on the characterization detailed in Section 4, this sec- Opcodes, instructions, APIs and system calls (hereinafter, we
tion identifies the main issues and challenges of surveyed refer to them in general as operations) are the most used and
papers. In the specific, the main problems regard the us- powerful features employed for malware analysis (Section 4),
age of anti-analysis techniques by malware (Section 5.1), as they allow to directly and accurately model sample be-
what operation set to consider (Section 5.2) and used haviour. Normally, to reduce complexity and required compu-
dataset (Section 5.3). tational power, only a subset of all the available operations
is considered. This choice of ignoring part of the operations
5.1. Anti-analysis techniques at disposal can reduce the accuracy of malware behaviour
model, which in turn reflects on the reliability of analysis out-
Malware developers want to avoid their samples to be anal- comes. This issue has been raised explicitly in some surveyed
ysed, so they devise and refine several anti-analysis techniques papers, including Anderson et al. (2012), Gharacheh et al.
that are effective in hindering the reverse engineering of exe- (2015) and Khodamoradi et al. (2015) while others are anyway
cutables. Indeed many surveyed works claim that the solution affected although they do not mention it, such as Ghiasi et al.
they propose does not work or loses in accuracy when samples (2015), Liang et al. (2016) and Huang et al. (2009).
using such techniques are considered (Section 4). On one hand, this challenge can be addressed by either
Static analysis (Section 3.2.1) is commonly prevented by improving or using different machine learning techniques
rendering sample binary and resources unreadable through to achieve a more effective feature selection. On the other
obfuscation, packing or encryption. Anyway, at runtime, code hand, program analysis advances can be leveraged to enhance
and any other concealed data has to be either deobfus- the accuracy of disassemblers and decompilers, indeed these
cated, unpacked or decrypted to enable the correct execu- tools are known to be error-prone (Guilfanov, 2008; Rosseau
130
Table 1 – Characterization of surveyed papers having malware detection as objective.

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total
Schultz et al. (2001) Rule-based classifier, Strings and byte Proposed solutions are - FTP sites Automated 1,001 3,265 4,266
Naïve Bayes sequences not effective against
encrypted executables
and the dataset used in
the evaluations is small.
Kolter and Maloof Decision Tree, Naïve Byte sequences Payload classification ✗ Internet, VX ✗ Automated 1,971 1,651 3,622
(2006) Bayes, SVM fails in presence of Heavens, and
binary obfuscation and MITRE
the dataset used in the
evaluations is very

computers & security 81 (2019) 123–147


small.
Ahmed et al. (2009) Decision Tree, Naïve APIs/System calls The dataset used in the Legitimate apps ✗ - 100 416 516
Bayes, SVM experimental and VX Heavens
evaluations is very
small.
Chau et al. (2010) Belief propagation File system Rare and new files Symantec’s ✗ - ? ? 903 · 106
cannot be accurately Norton
classified as benign or Community
malicious. Watch
Firdausi et al. (2010) Decision Tree, Naïve APIs/System calls, The dataset used in the Windows XP SP2 ✗ - 250 220 470
Bayes, SVM, k-NN, file system, and experimental
Multilayer Perceptron Windows Registry evaluations is very
Neural Network small.
Anderson et al. SVM Byte sequences and The dataset used in the - - ✗ - 615 1,615 2,230
(2011) APIs/system calls evaluations is small.
Santos et al. (2011) Learning with Local and Byte sequences Proposed approach is Own machines ✗ - 1,000 1,000 2,000
Global Consistency not effective against and VX Heavens
packed malware and
requires manual
labeling of a portion of
the small dataset. In
particular, the dataset
used in the
experimental
evaluations is small.
Anderson et al. Multiple Kernel Byte sequences, Instruction ✗ Offensive ✗ - 776 21,716 22,492
(2012) Learning opcodes, and categorization is not Computing
APIs/system calls optimal.
Yonts (2012) Rule-based classifier PE file Only a subset of all the ✗ SANS Institute ✗ - 65,000 25 · 105 25.65 · 105
characteristics potential low level
attributes is considered.

(continued on next page)


Table 1 (continued)

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total

Eskandari et al. Bayesian Network APIs/System calls Ignore specific ✗ Research ✗ - 1,000 2,000 3,000
(2013) instructions. Laboratory at
Evasion/obfuscation Shiraz University
techniques and samples
requiring user
interactions reduce the
effectiveness of the
proposed approach. The
dataset is small.
Santos et al. (2013b) Bayesian Network, Opcodes, Packed malware, Own machines ✗ - 1,000 1,000 2,000
Decision Tree, k-NN, APIs/system calls, evasion techniques, and and VX Heavens

computers & security 81 (2019) 123–147


SVM and raised samples requiring user
exceptions interactions reduce the
accuracy of the
proposed solution. The
dataset is small.
Vadrevu et al. (2013) Random Forest PE file Requires a huge number ✗ Georgia Institute ✗ Automated 170,780 15,182 185,962
characteristics and of samples labeled of Technology
network either as malicious or
benign.
Bai et al. (2014) Decision Tree, Random PE file Assume that samples Windows and ✗ Automated 8,592 10,521 19,113
Forest characteristics are not packed and Program Files
malware authors can folders and
properly modify PE VXHeavens
header to remain
undetected.
Kruczkowski and SVM Network The dataset used in the ✗ N6 Platform ✗ - ? ? 1,015
Szynkiewicz (2014) experimental
evaluations is small.
Tamersoy et al. Clustering with locality File system Rare and new files Symantec’s ✗ - 1,663,506 47,956 4,970,865
(2014) sensitive hashing cannot be accurately Norton
classified as benign or Community
malicious. Watch
Uppal et al. (2014) Decision Tree, Random Byte sequences and The dataset used in the Legitimate apps ✗ - 150 120 270
Forest, Naïve Bayes, APIs/system calls experimental and VX Heavens
SVM evaluations is very
small.
Chen et al. (2015) Belief propagation File system Rare and new files ✗ Comodo Cloud ✗ - 19,142 2,883 69,165
cannot be accurately Security Center
classified as benign or
malicious.
Elhadi et al. (2015) Malicious graph APIs/System calls The dataset used in the VX Heavens - 10 75 85
matching experimental
evaluations is extremely

131
small.

(continued on next page)


132
Table 1 (continued)

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total

Feng et al. (2015) Rule-based classifier, Byte sequences Only specific malware ✗ Windows system ✗ - 100,000 135,064 235,064
SVM classes are considered files and own AV
for the approach platform
evaluation.
Ghiasi et al. (2015) Rule-based classifier APIs/System calls APIs/System calls Windows XP ✗ - 390 850 1,240
and CPU registers categorization could be system and
not optimal and the Program Files
dataset size is small. folders and
private

computers & security 81 (2019) 123–147


repository
Kwon et al. (2015) Random Forest Network Not able to detect bots Symantec’s ✗ - ? ? 24∗ 106
with rootkit capabilities. Worldwide
Intelligence
Network
Environment
Mao et al. (2015) Random Forest APIs/System calls, Evasion techniques and Windows XP SP3 ✗ - 534 7,257 7,791
file system, and samples requiring user and VX Heavens
Windows Registry interactions reduce the
accuracy of the
proposed approach. The
dataset is small.
Saxe and Berlin Neural Networks Strings and PE file Label assigned to ✗ Legitimate apps ✗ Automated 81,910 350,016 431,926
(2015) characteristics training set may be and own
inaccurate and the malware
accuracy of the database
proposed approach
decreases substantially
when samples are
obfuscated.
Srakaew et al. (2015) Decision Tree Byte sequences and Obfuscation techniques ✗ Legitimate files ✗ - 600 3,851 69,165
opcodes reduce detection and apps and
accuracy. CWSandbox
Wüchner et al. (2015) Naïve Bayes, Random Byte sequences, Obfuscation techniques ✗ Legitimate app ✗ - 513 6,994 7,507
Forest, SVM APIs/system calls, applied by the authors downloads and
file system, and may not reflect the ones Malicia
Windows Registry of real-world samples.
The dataset is small.
Raff and Nicholas k-NN with Lempel-Ziv Byte sequences Obfuscation techniques ✗ Industry partner ✗ - 240,000 237,349 477,349
(2017) Jaccard distance reduce detection
accuracy.
Table 2 – Characterization of surveyed papers having malware variants selection as objective. 1 Instead of using machine learning techniques, Gharacheh et al. rely on
Hidden Markov Models to detect variants of the same malicious sample Gharacheh et al. (2015).

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total
Gharacheh et al. -1 Opcodes Opcode sequence is not Cygwin and VX ✗ - ? ? 740
(2015) optimal and the dataset Heavens
size is very small.

computers & security 81 (2019) 123–147


Ghiasi et al. (2015) Rule-based classifier APIs/System calls APIs/System calls Windows XP ✗ - 390 850 1,240
and CPU registers categorization could be system and
not optimal and the Program Files
dataset size is small. folders and
private
repository
Khodamoradi et al. Decision Tree, Random Opcodes Opcode sequence is not Windows XP ✗ - 550 280 830
(2015) Forest optimal and the dataset system and
size is very small. Program Files
folders and
self-generated
metamorphic
malware
Upchurch and Zhou Clustering with locality Byte sequences The dataset size is ✗ Sampled from Manual 0 85 85
(2015) sensitive hashing extremely small. security
incidents
Liang et al. (2016) Rule-based classifier APIs/System calls, Monitored API/system ✗ Anubis website ✗ - 0 330,248 330,248
file system, call set could be not
Windows Registry, optimal and the dataset
and network size is small.
Vadrevu and Perdisci DBSCAN clustering APIs/System calls, PE Evasion techniques and ✗ Security ✗ - 0 1,651,906 1,651,906
(2016) file characteristics, samples requiring user company and
and network interactions reduce the large Research
accuracy of the Institute
proposed approach.

133
Table 3 – Characterization of surveyed papers having malware families selection as objective. 2 Asquith describes aggregation overlay graphs for storing PE metadata,

134
without further discussing any machine learning technique that could be applied on top of these new data structures.

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total
Huang et al. (2009) k-Means-like algorithm Byte sequences Instruction sequence ✗ Kingsoft ✗ - 0 2,029 2,029
categorization could be Corporation
not optimal and the
dataset size is small.
Park et al. (2010) Malicious graph APIs/System calls Approach vulnerable to ✗ Legitimate apps ✗ Automated 80 300 380
matching APIs/system calls and Anubis
injection and the Sandbox
dataset used in the
experimental
evaluations is very
small.

computers & security 81 (2019) 123–147


Ye et al. (2010) k-Medoids variants Opcodes Instruction ✗ Kingsoft ✗ - 0 11,713 11,713
categorization could be Corporation
not optimal.
Dahl et al. (2013) Logistic Regression, Byte sequences and The authors obtain a ✗ Microsoft ✗ Mostly 817,485 1,843,359 3,760,844
Neural Networks APIs/system calls high two-class error manual
rate.
Hu et al. (2013) Prototype-based Opcodes Obfuscation techniques ✗ - ✗ Manual and 0 137,055 137,055
clustering reduce the effectiveness automated
of their prototype for
malware family
selection.
Islam et al. (2013) Decision Tree, k-NN, Strings, byte The proposed approach ✗ CA Labs ✗ - 51 2,398 2,939
Random Forest, SVM sequences and is less effective novel
APIs/system calls samples. The dataset is
small.
Kong and Yan (2013) SVM, k-NN Opcodes, memory, Significant differences ✗ Offensive ✗ Automated 0 526,179 526,179
file system, and CPU in samples belonging to Computing
registers the same family reduce
the proposed approach
accuracy.
Nari and Ghorbani Decision Tree Network Network features are ✗ Communication ✗ Automated 0 3,768 3,768
(2013) extracted by a Research Center
commercial traffic Canada
analyzer. The dataset
used in the
experimental
evaluations is small.
Ahmadi et al. (2015) SVM, Random Forest, Byte sequences, Selected features can be Microsoft’s ✗ - 0 21,741 21,741
Gradient Boosting opcodes, further reduced to have malware
Decision Tree APIs/system calls, a clearer view of the classification
Windows Registry, reasons behind sample challenge
CPU registers, and PE classification.
file characteristics

(continued on next page)


Table 3 (continued)

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total

Asquith (2015) -2 APIs/System calls, - - - - - - - -


memory, file system,
PE file
characteristics, and
raised exceptions
Lin et al. (2015) SVM Byte sequences, Selected API/system call ✗ Own sandbox ✗ - 389 3,899 4,288

computers & security 81 (2019) 123–147


APIs/system calls, set could be not optimal.
file system, and CPU Evasion techniques and
registers samples requiring user
interactions reduce the
accuracy of the
proposed approach. The
dataset is small.
Kawaguchi and Decision Tree, Random APIs/System calls This classification ✗ FFRI Inc. ✗ - 236 408 644
Omote (2015) Forest, k-NN, Naïve approach can be easily
Bayes evaded by real-world
malware. The dataset
used in the
experimental
evaluations is very
small.
Mohaisen et al. Decision Tree, k-NN, File system, Evasion techniques ✗ AMAL system ✗ Manual and 0 115,157 115,157
(2015) SVM, Clustering with Windows Registry, reduce the accuracy of automated
with different similarity CPU registers, and the proposed solution.
measures, Hierarchical network
clustering
Pai et al. (2015) k-Means, Expectation Opcodes Obfuscation techniques ✗ Cygwin utility ✗ - 213 8,052 8,265
Maximization reduce the effectiveness files and Malicia
of the employed
approach. The dataset is
small.
Raff and Nicholas k-NN with Lempel-Ziv Byte sequences Obfuscation techniques ✗ Industry partner ✗ - 240,000 237,349 477,349
(2017) Jaccard distance reduce detection
accuracy.

135
136
Table 4 – Characterization of surveyed papers having malware similarities detection as objective. 3 SVM is used only for computing the optimal values of weight factors
associated to each feature chosen to detect similarities among malicious samples.

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total
Bailey et al. (2007) Hierarchical clustering APIs/System calls, Evasion techniques and ✗ Albor Malware ✗ Automated 0 8,228 8,228
with normalized file system, samples requiring user Library and
compression distance Windows Registry, interactions reduce the public repository
and network accuracy of the
proposed classification
method. The dataset is
small.

computers & security 81 (2019) 123–147


Bayer et al. (2009) Clustering with locality APIs/System calls Evasion techniques and ✗ Anubis website ✗ Manual and 0 75,692 75,692
sensitive hashing samples requiring user automated
interactions reduce the
approach accuracy.
Rieck et al. (2011) Prototype-based Byte sequences and Evasion techniques and ✗ CWSandbox and ✗ Automated 0 36,831 36,831
classification and APIs/system calls samples requiring user Sunbelt Software
clustering with interactions reduce the
Euclidean distance accuracy of the
proposed framework.
Palahan et al. (2013) Logistic Regression APIs/System calls Evasion techniques and ✗ Own honeypot ✗ - 49 912 961
samples requiring user
interactions reduce the
accuracy of the
proposed framework,
while unknown
observed behaviors are
classified as malicious.
The dataset used in the
experimental
evaluations is very
small.
Egele et al. (2014) SVM3 APIs/System calls, The accuracy of coreutils-8.13 ✗ - 1,140 0 1,140
memory, and CPU computed PE function program suite
registers similarities drops when
different compiler
toolchains or aggressive
optimization levels are
used. The dataset is
small.
Table 5 – Characterization of surveyed papers having malware differences detection as objective.

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total
Bayer et al. (2009) Clustering with locality APIs/System calls Evasion techniques and ✗ Anubis website ✗ Manual and 0 75,692 75,692
sensitive hashing samples requiring user automated
interactions reduce the
approach accuracy.
Lindorfer et al. (2011) Rule-based classifier APIs/System calls Sophisticated evasion ✗ Anubis Sandbox ✗ Automated 0 1,871 1,871
and network techniques and samples
requiring user
interactions can still
bypass detection
processes. The dataset
used in the
experimental
evaluations is small.

computers & security 81 (2019) 123–147


Rieck et al. (2011) Prototype-based Byte sequences and Evasion techniques and ✗ CWSandbox and ✗ Automated 0 36,831 36,831
classification and APIs/system calls samples requiring user Sunbelt Software
clustering with interactions reduce the
Euclidean distance accuracy of the
proposed framework.
Palahan et al. (2013) Logistic Regression APIs/System calls Evasion techniques and ✗ Own honeypot ✗ - 49 912 961
samples requiring user
interactions reduce the
accuracy of the
proposed framework,
while unknown
observed behaviors are
classified as malicious.
The dataset used in the
experimental
evaluations is small.
Santos et al. (2013a) Decision Tree, k-NN, Opcodes Opcode sequence is not Own machines ✗ Automated 1,000 1,000 2,000
Bayesian Network, optimal and the dataset and VXHeavens
Random Forest size is small. The
proposed method is not
effective against packed
malware. The dataset is
small.
Polino et al. (2015) Clustering with Jaccard APIs/System calls Evasion techniques, - - - - ? ? 2,136
similarity packed malware, and
samples requiring user
interactions reduce the
accuracy of the
proposed framework.
API calls sequence used
to identify sample
behaviors is not

137
optimal. The dataset
size is small.
Table 6 – Characterization of surveyed papers having malware category detection as objective. 4 Instead of using machine learning techniques, these articles rely on

138
Hidden Markov Models to detect metamorphic viruses (Attaluri et al., 2009; Wong and Stamp, 2006).

Paper Algorithms Features Limitations Dataset samples


Public Source Available Labeling Benign Malicious Total
Wong and Stamp - 4
Opcodes Detection fails if ✗ Cygwin and VX ✗ - 40 25 65
(2006) metamorphic malware Heavens
are similar to benign generators
files. The dataset is
extremely small.
Attaluri et al. (2009) -4 Opcodes The proposed approach ✗ Cygwin, ✗ - 240 70 310
is not effective against legitimate DLLs
all types of and VX Heavens
metamorphic viruses. generators
The dataset size is very
small.

computers & security 81 (2019) 123–147


Tian et al. (2008) Rule-based classifier Function length Function lengths alone - - - - 0 721 721
are not sufficient to
detect Trojans and the
dataset used in the
experimental
evaluations is very
small.
Siddiqui et al. (2009) Decision Tree, Random Opcodes Advanced packing Windows XP and ✗ - 1,444 1,330 2,774
Forest techniques could reduce VX Heavens
detection accuracy. The
dataset used in the
experimental
evaluations is small.
Chen et al. (2012) Random Forest, SVM Byte sequences The proposed ✗ Trend Micro ✗ - 0 330,248 330,248
framework heavily
relies on security
companies’
encyclopedias.
Comar et al. (2013) Random Forest, SVM Network Network features are ✗ Internet Service ✗ Manual and 212,505 4,394 216,899
extracted by a Provider automated
commercial traffic
analyzer.
Kwon et al. (2015) Random Forest Network Not able to detect bots Symantec’s ✗ - ? ? 24∗ 106
with rootkit capabilities. Worldwide
Intelligence
Network
Environment
Sexton et al. (2015) Rule-based classifier, Byte sequences and Obfuscation techniques - - - - 4,622 197 4,819
Logistic Regression, opcodes reduce detection
Naïve Bayes, SVM accuracy. The dataset
used in the
experimental
evaluations is small.
computers & security 81 (2019) 123–147 139

Fig. 2 – Frequency histogram showing how many reviewed papers use each type of source (e.g. public repositories,
honeypot) to collect their datasets, and whether it is used to gather malware or benign samples.

and Seymour, 2018) and are thus likely to affect negatively the ator and Second Generation Virus Generator, all available on
whole analyses. Approaches that improve the quality of gen- VX Heavens (Vxheavens). A minority of analysed papers do
erated disassembly and decompiled code, as in Schulte et al. not mention the source of their datasets.
(2018), can reduce the impact due to these errors. Among surveyed papers, a recurring issue is the size of
used dataset. Many works, including Kolter and Maloof (2006),
5.3. Datasets Ahmed et al. (2009) and Firdausi et al. (2010), carry out evalu-
ations on less than 1000 samples. Just 39% of reviewed stud-
More than 72% of surveyed works use datasets with both ma- ies test their approaches on a population greater than 10,000
licious and benign samples, while about 28% rely on datasets samples.
with malware only. Just two works rely on benign datasets When both malicious and benign samples are used for the
only (Caliskan-Islam et al., 2015; Egele et al., 2014), because evaluation, it is crucial to reflect their real distribution (Ahmed
their objectives are identifying sample similarities and at- et al., 2009; Anderson et al., 2011; 2012; Bai et al., 2014; Bilge
tributing the ownership of some source codes under analysis, et al., 2012; Dahl et al., 2013; Elhadi et al., 2015; Eskandari
respectively. et al., 2013; Feng et al., 2015; Firdausi et al., 2010; Ghiasi et al.,
Fig. 2 shows the dataset sources for malicious and be- 2015; Islam et al., 2013; Kawaguchi and Omote, 2015; Kirat
nign samples. It is worth noting that most of benign datasets et al., 2013; Kolter and Maloof, 2006; Lin et al., 2015; Mao et al.,
consists of legitimate applications (e.g. software contained 2015; Pai et al., 2015; Palahan et al., 2013; Park et al., 2010; Raff
in “Program Files” or “system” folders), while most of mal- and Nicholas, 2017; Santos et al., 2013a; 2013b; 2011; Saxe and
ware have been obtained from public repositories, secu- Berlin, 2015; Schultz et al., 2001; Siddiqui et al., 2009; Srakaew
rity vendors and popular sandboxed analysis services. The et al., 2015; Uppal et al., 2014; Wüchner et al., 2015; Yonts,
most popular public repository in the examined works is 2012). Indeed, there needs to be a huge imbalance because
VX Heavens (Vxheavens), followed by Offensive Comput- non-malware executables are the overwhelming majority. 48%
ing (OffensiveComputing) and Malicia Project (MaliciaProject). of surveyed works do not take care of this aspect and use
The first two repositories are still actively maintained at the datasets that either are balanced between malware and non
time of writing, while Malicia Project has been permanently malicious software or, even, have more of the former than the
shut down due to dataset ageing and lack of maintainers. latter. In Yonts (2012), Yonts supports his choice of using a
Security vendors, popular sandboxed analysis services, smaller benign dataset by pointing out that changes in stan-
and AV companies have access to a huge number of samples. dard system files and legitimate applications are little. 38%
Surveyed works rely on CWSandbox, developed by Threat- of examined papers employ instead datasets having a proper
Track Security (ThreatTrack), and Anubis (Anubis). As can be distribution of malware and non-malware, indeed they are ei-
observed from Fig. 2, these sandboxes are mainly used for ob- ther unbalanced towards benign samples or use exclusively
taining malicious samples. Internet Service Providers (ISPs), benign or malicious software. As an example, the majority of
honeypots and Computer Emergency Response Teams (CERTs) surveyed papers having malware similarities detection as ob-
share with researchers both benign and malicious datasets. A jective (see Table 4) contains exclusively either malware or
few works use malware developed by the authors (Gharacheh legitimate applications (Bailey et al., 2007; Bayer et al., 2009;
et al., 2015; Khodamoradi et al., 2015), created using malware Egele et al., 2014; Rieck et al., 2011). The remaining 14% does
toolkits (Wong and Stamp, 2006) such as Next Generation not describe how datasets are composed.
Virus Constrution Kit, Virus Creation Lab, Mass Code Gener-
140 computers & security 81 (2019) 123–147

Differently from other research fields, no reference bench- and how many online antiviruses classify a sample as ma-
mark is available for malware analysis to compare accuracy licious. Graziano et al. (2015) leverage submissions to an
and performance with other works. Furthermore, published online sandbox for identifying cases where new samples
results are known to be biased towards good results (Sanders, are being tested, with the final aim to detect novel mal-
2017). In addition, since the datasets used for evaluations are ware during their development process. Surprisingly, it turned
rarely shared, it is nearly impossible to compare works. Only out that samples used in infamous targeted campaigns
two surveyed works have shared their dataset (Schultz et al., had been submitted to public sandboxes months or years
2001; Upchurch and Zhou, 2015), while a third one plans to before.
share it in the future (Mohaisen et al., 2015). It is worth men- With reference to the proposed taxonomy, advances in the
tioning that one of the shared dataset is from 2001, hence al- state of the art in malware analysis could be obtained by
most useless today. Indeed, temporal information is crucial to analysing submissions to online malware analysis services, to
evaluate malware analysis results (Miller et al., 2015) and de- extract additional machine learning features and gather intel-
termine whether machine learning models have become ob- ligence on what next malware are likely to be.
solete (Harang and Ducau, 2018; Jordaney et al., 2017).
6.2. Malware attribution
Given such lack of reference datasets, we propose three
desiderata for malware analysis benchmarks.
Another aspect of interest for malware analysts is the identi-
fication of who developed a given sample, i.e. the attribution
1. Benchmarks should be labeled accordingly to the specific of a malware to a specific malicious actor. There are a number
objectives to achieve. As an example, benchmarks for fam- of features in a binary to support this process: used program-
ilies selection should be labeled with samples’ families. ming language, included IP addresses and URLs, and the lan-
2. Benchmarks should model realistically the sample distri- guage of comments and resources. Additional, recently pro-
butions of real-world scenarios, considering the objectives posed features which can be used for attribution are the time
to attain. For example, benchmarks for malware detection slot when the malware communicates with a command and
should contain a set of legitimate applications orders of control centre and what digital certificates are used(Ruthven
magnitude greater than the number of malware samples. and Blaich, 2017). Features related to the coding style can also
3. Benchmarks should be actively maintained and updated reveal important details on developer’s identity, at least for
over time with new samples, trying to keep pace with the arguing whether different malware have been developed by
malware industry. Samples should also be provided with the same person or group. In Caliskan-Islam et al. (2015), the
temporal information, e.g., when they have been spotted author’s coding style of a generic software (i.e. not necessar-
first. ily malicious) is accurately profiled through syntactic, lexical,
and layout features. Unfortunately, this approach requires the
Datasets used in Schultz et al. (2001) and Upchurch and availability of source code, which happens only occasionally,
Zhou (2015) are correctly labeled according to malware detec- e.g. in case of leaks and/or public disclosures.
tion and malware variants selection objectives, respectively. Malware attribution can be seen as an additional analysis
Both datasets are not balanced. In Schultz et al. (2001), the de- objective, according to the proposed taxonomy. Progresses in
scribed dataset is biased towards malicious programs, while this direction through machine learning techniques are cur-
in Upchurch and Zhou (2015) diverse groups of variants con- rently hindered by the lack of ground truth on malware au-
tain a different number of samples, ranging from 3 to 20. Fi- thors, which proves to be really hard to provision. Recent ap-
nally, analysed datasets are not actively maintained and do proaches leverage on public reports referring to APT groups
not contain any temporal information (in Schultz et al. (2001), and detailing what malware they are supposed to have devel-
the authors do not mention if such information has been in- oped: those reports are parsed to mine the relationships be-
cluded into the dataset). tween malicious samples and corresponding APT group au-
thors (Laurenza et al., 2017). The state of the art in malware
attribution through machine learning can be advanced by
6. Topical trends researching alternative methods to generate reliable ground
truth on malware developers, or on what malware have been
This section outlines a list of topical trends in malware anal- developed by the same actor.
ysis, i.e. topics that are currently being investigated but have 6.3. Malware triage
not reached the same level of maturity of the other areas de-
scribed in previous sections. Given the huge amount of new malware that need to be anal-
ysed, a fast and accurate prioritisation is required to identify
6.1. Malware development detection what samples deserve more in depth analyses. This can be de-
cided on the basis of the level of similarity with already known
Malware developers can use online public services like samples. If a new malware resembles very closely other bina-
VirusTotal (VirusTotal) and Malwr (Malwr) to test the ef- ries that have been analysed before, then its examination is
fectiveness of their samples in evading most common an- not a priority. Otherwise, further analyses can be advised if a
tiviruses. Malware analysts can leverage such behaviour by new malware really looks differently from everything else ob-
querying these online services to obtain additional infor- served so far. This process is referred to as malware triage and
mation useful for the analysis, such as submission time shares some aspects with malware similarity analysis, as they
computers & security 81 (2019) 123–147 141

Fig. 3 – Relationship between execution time (in logarithmic scale) and detection accuracy as n varies. The target accuracy of
86% is also reported.

Fig. 4 – Relationship between machine count and malware throughput (in logarithmic scale) for different n-grams sizes. The
one million malware per day to sustain is also reported.

both provide key information to support malware analysis pri- to allow analysts to update anti-malware measures ahead.
oritisation. Anyway, they are different because triage requires Howard et al. (2017) use machine learning techniques to model
faster results at the cost of worse accuracy, hence different patterns in malware family evolutions and predict future
techniques are usually employed (Jang et al., 2011; Kirat et al., variants.
2013; Laurenza et al., 2017; Rosseau and Seymour, 2018). This problem can be seen as yet another objective in
Likewise attribution, triage can be considered as another the malware analysis taxonomy. It has not been investi-
malware analysis objective. One important challenge of mal- gated much yet, only a couple of works seem to address
ware triage is finding the proper trade-off between accuracy that topic (Howard et al., 2017; Juzonis et al., 2012). Given its
and performance, which fits with the problems we address in great potential to proactively identify novel malware, and con-
the context of malware analysis economics (see Section 7). sidering the opportunity to exploit existing malware fami-
lies datasets, we claim the worthiness to boost the research
6.4. Prediction of future variants on malware evolution prediction through machine learning
techniques.
Compared to malware analysts, malware developers have the
advantage of knowing current anti-malware measures and 6.5. Other features
thus novel variants can be designed accordingly. A novel trend
in malware analysis is investigating the feasibility to fill that This section describes features different from those analysed
gap by predicting how future malware will look like, so as in Section 3.2.2 and that have been used by just a few papers so
142 computers & security 81 (2019) 123–147

far. In view of advancing the state of the art in malware analy- execution. While time complexity of machine learning algo-
sis, additional research is required on the effectiveness of us- rithms is widely discussed in literature, the same does not
ing such features to improve the accuracy of machine learning apply for the study of feature extraction execution time. The
techniques. main aspect to take into account is whether desired features
come from static or dynamic analysis, which considerably af-
6.5.1. Memory accesses fects execution time because the former does not require to
Any data of interest, such as user generated content, is tem- run the samples, while the latter does.
porary stored in main memory, hence analysing how mem- To deepen even further this point, Table 7 reports for each
ory is accessed can reveal important information about the feature type whether it can be extracted through static or
behaviour of an executable (Pomeranz, 2012). Kong and Yan dynamic analysis. It is interesting to note that certain types
(2013) rely on statically trace reads and writes in main mem- of features can be extracted both statically and dynamically,
ory, while Egele et al. (2014) dynamically trace values read from with significant differences on execution time as well as
and written to stack and heap. on malware analysis accuracy. Indeed, while certainly more
time-consuming, dynamic analysis enables to gather features
6.5.2. Function length that contribute relevantly to the overall analysis effective-
Another characterising feature is the function length, mea- ness (Damodaran et al., 2015). As an example, we can consider
sured as the number of bytes contained in a function. This in- the features derived from API calls (see Table 7), which can be
put alone is not sufficient to discriminate malicious executa- obtained both statically and dynamically. Tools like IDA pro-
bles from benign software, indeed it is usually combined with vide the list of imports used by a sample and can statically
other features. This idea, formulated in Tian et al. (2008), is trace what API calls are present in the sample code. Malware
adopted in Islam et al. (2013), where function length frequen- authors usually hide their suspicious API calls by inserting in
cies, extracted through static analysis, are used together with the source code a huge number of legitimate APIs. By means
other static and dynamic features. of dynamic analysis, it is possible to obtain the list of the APIs
that the sample has actually invoked, thus simplifying the
6.5.3. Raised exceptions identification of those suspicious APIs. By consequences, in
The analysis of the exceptions raised during the execution this case dynamic analysis is likely to generate more valuable
can help understanding what strategies a malware adopts to features compared to static analysis. MazeWalker (Kulakov,
evade analysis systems (Asquith, 2015; Santos et al., 2013b). A 2017) is a typical example of how dynamic information can
common trick to deceive analysts is throwing an exception to integrate static analysis.
run a malicious handler, registered at the beginning of mal- Although choosing dynamic analysis over, or in addition
ware execution. In this way, examining the control flow be- to, static seems obvious, its inherently higher time complexity
comes much more complex. constitutes a potential performance bottleneck for the whole
malware analysis process, which can undermine the possibil-
ity to keep pace with malware evolution speed. The natural
7. Malware analysis economics solution is to provision more computational resources to par-
allelise analysis tasks and thus remove bottlenecks. In turn,
Analysing samples through machine learning techniques re- such solution has a cost to be taken into account when de-
quires complex computations for extracting desired features signing a malware analysis environment, such as the one pre-
and running chosen algorithms. The time complexity of these sented by Laurenza et al. (2016).
computations has to be carefully taken into account to en- The qualitative trade-offs we have identified are between
sure they complete fast enough to keep pace with the speed accuracy and time complexity (i.e., higher accuracy requires
new malware are developed. Space complexity has to be con- larger times), between time complexity and analysis pace (i.e.,
sidered as well, indeed feature space can easily become ex- larger times implies slower pace), between analysis pace and
cessively large (e.g., using n-grams), and also the memory re- computational resources (faster analysis demands using more
quired by machine learning algorithms can grow to the point resources), and between computational resources and eco-
of saturating available resources. nomic cost (obviously, additional equipment has a cost). Sim-
Time and space complexities can be either reduced to ilar trade-offs also hold for space complexity. As an example,
adapt to processing and storage capacity at disposal, or they when using n-grams as features, it has been shown that larger
can be accommodated by supplying more resources. In the for- values of n lead to more accurate analysis, at cost of having the
mer case, the analysis accuracy is likely to worsen, while, in feature space grow exponentially with n (Lin et al., 2015; Up-
the latter, accuracy levels can be preserved at the cost of pro- pal et al., 2014). As another example, using larger datasets in
viding more computing machines, storage and network. There general enables more accurate machine learning models and
exist therefore trade-offs between maintaining high accuracy thus better accuracy, provided the availability of enough space
and performance of malware analysis on one hand, and sup- to store all the samples of the dataset and the related analysis
plying the required equipment on the other. We refer to the reports.
study of these trade-offs as malware analysis economics, and in We present a qualitative, simplified example of analysis
this section we provide some initial qualitative discussions on that leverages on the trade-offs just introduced. The scenario
this novel topic. we target regards detecting malware families of new malicious
The time needed to analyse a sample through machine samples (Section 3.1.2) using as features n-grams computed
learning is mainly spent in feature extraction and algorithm over invoked APIs (Section 3.2.2), recorded through dynamic
computers & security 81 (2019) 123–147 143

Table 7 – Type of analysis required for extracting the inputs presented in Sections 3.2.2 and 6.5: strings, byte sequences,
opcodes, APIs/system calls, file system, CPU registers, PE file characteristics, network, AV/Sandbox submissions, code
stylometry, memory accesses, function length, and raised exceptions.

Analysis Str. Byte seq. Ops APIs sys. calls File sys. CPU reg. PE file char. Net. Sub-mis. Code stylo. Mem. Func. len. Exc.
Static
Dynamic

compliance with requirements on analysis accuracy and pace,


Table 8 – Relationship between n and number of features.
while respecting budget constraints.
n Feature count
1 187
2 6,740 8. Conclusion
3 46,216
4 130,671 We presented a survey on existing literature on malware
5 342,663 analysis through machine learning techniques. There are five
main contributions of our work. First, we proposed an orga-
nization of reviewed works according to three orthogonal di-
mensions: the objective of the analysis, the type of features ex-
tracted from samples, the machine learning algorithms used to pro-
analysis (Section 3.2.1). We want here to explore the trade-offs cess these features. Such characterization provides an overview
between family detection accuracy, execution time, analysis on how machine learning algorithms can be employed in mal-
pace and cost, in terms of required computational resources. ware analysis, emphasising which specific feature classes al-
For what concerns the scenario and qualitative numbers on low to achieve the objective(s) of interest. Second, we have
the relationships between n, the number of features, accuracy arranged existing literature on PE malware analysis through
and execution time, we take inspiration from the experimen- machine learning according the proposed taxonomy, pro-
tal evaluation presented by Lin et al. (2015). Table 8 shows the viding a detailed comparative analysis of surveyed works.
relationship between n and feature count. We introduce a few Third, we highlighted the current issues of machine learn-
simplifying assumptions and constraints to make this quali- ing for malware analysis: anti-analysis techniques used by
tative example as consistent as possible. We assume that the malware, what operation set to consider for the features and
algorithm used to detect families is parallelisable and ideally used datasets. Fourth, we identified topical trends on interest-
scalable, meaning that by doubling available machines we also ing objectives and features, such as malware attribution and
double the throughput, i.e. the number of malware analysed triage. Fifth, we introduced the novel concept of malware analy-
per second. We want to process one million malware per day sis economics, concerning the investigation and exploitation of
with an accuracy of at least 86%. existing trade-offs between performance metrics of malware
Fig. 3 highlights the trade-off between execution time (in analysis (e.g., analysis accuracy and execution time) and eco-
logarithmic scale) and detection accuracy as n is varied. As n nomical costs.
grows, the accuracy increases almost linearly while the exe- Noteworthy research directions to investigate can be linked
cution time has an exponential rise, which translates to an to those contributions. Novel combinations of objectives, fea-
exponential decrease of how many malware per second can tures and algorithms can be investigated to achieve better ac-
be processed. It can be noted that the minimum n-grams size curacy compared to the state of the art. Moreover, observing
to meet the accuracy requirement of 86% is 3. The trade-off be- that some classes of algorithms have never been used for a
tween analysis pace and cost can be observed in Fig. 4 where, certain objective may suggest novel directions to examine fur-
by leveraging on the assumption of ideal scalability of the ther. The discussion on malware analysis issues can provide
detection algorithm, it is shown that sustainable malware further ideas worth to be explored. In particular, defining ap-
throughput (in logarithmic scale) increases linearly as the propriate benchmarks for malware analysis is a priority of the
algorithm is parallelised on more machines. 4-grams and whole research area. The novel concept of malware analysis
5-grams cannot be used to cope with the expected malware economics can encourage further research directions, where
load of one million per day, at least when considering up to appropriate tuning strategies can be provided to balance com-
five machines. On the other hand, by using four machines and peting metrics (e.g. accuracy and cost) when designing a mal-
3-grams, we can sustain the target load and at the same time ware analysis environment.
meet the constraint on detection accuracy.
The presented toy example is just meant to better explain
how malware analysis economics can be used in practical sce- Acknowledgment
narios. We claim the significance of investigating these trade-
offs more in detail, with the aim of outlining proper guidelines This work has been partially supported by a grant of the Ital-
and strategies to design a malware analysis environment in ian Presidency of Ministry Council and by the Laboratorio
144 computers & security 81 (2019) 123–147

Nazionale of Cyber Security of the CINI (Consorzio Interuni- Proceedings of the ACM SIGKDD conference on knowledge
versitario Nazionale Informatica). discovery and data mining; 2010. p. 131–42.
Chen L, Li T, Abdulhayoglu M, Ye Y. Intelligent malware detection
based on file relation graphs. In: Proceedings of the IEEE
R E F E R E N C E S
international conference on semantic computing (ICSC),;
2015. p. 85–92.
Chen Z, Roussopoulos M, Liang Z, Zhang Y, Chen Z, Delis A.
Malware characteristics and threats on the internet
Ahmadi M, Giacinto G, Ulyanov D, Semenov S, Trofimov M. Novel ecosystem. J Syst Softw 2012;85(7):1650–72.
feature extraction, selection and fusion for effective malware Comar PM, Liu L, Saha S, Tan PN, Nucci A. Combining supervised
family classification. CoRR 2015. abs/1511.04317. and unsupervised learning for zero-day malware detection. In:
Ahmed F, Hameed H, Shafiq MZ, Farooq M. Using spatio-temporal Proceedings of the 32nd annual IEEE international conference
information in APi calls with machine learning algorithms for on computer communications (INFOCOM); 2013. p. 2022–30.
malware detection. In: Proceedings of the 2nd ACM workshop Dahl GE, Stokes JW, Deng L, Yu D. Large-scale malware
on security and artificial intelligence. ACM; 2009. p. 55–62. classification using random projections and neural networks.
Allen FE. Control flow analysis. In: Proceedings of a symposium In: Proceedings of the 38th international conference on
on compiler optimization. New York, NY, USA: ACM; 1970. acoustics, speech and signal processing (ICASSP). IEEE; 2013.
p. 1–19. p. 3422–6.
Anderson B, Quist D, Neil J, Storlie C, Lane T. Graph-based Damodaran A, Di Troia F, Visaggio CA, Austin TH, Stamp M. A
malware detection using dynamic analysis. J Comput Virol comparison of static, dynamic, and hybrid analysis for
2011;7(4):247–58. malware detection. J Comput Virol Hack Tech 2015:1–12.
Anderson B, Storlie C, Lane T. Improving malware classification: Egele M, Scholte T, Kirda E, Kruegel C. A survey on automated
bridging the static/dynamic gap. In: Proceedings of the 5th dynamic malware-analysis techniques and tools. ACM
ACM workshop on security and artificial intelligence. ACM; Comput Surv (CSUR) 2012;44(2):6.
2012. p. 3–14. Egele M, Woo M, Chapman P, Brumley D. Blanket execution:
Anubis. https://seclab.cs.ucsb.edu/academic/projects/ dynamic similarity testing for program binaries and
projects/anubis/. Accessed: 2018-06-03. components. In: Proceedings of the 23rd USENIX Security
Asquith M. Extremely scalable storage and clustering of malware Symposium. San Diego, CA: USENIX Association; 2014.
metadata. J Comput Virol Hack Tech 2015;12:1–10. p. 303–17.
Attaluri S, McGhee S, Stamp M. Profile hidden Markov models and Elhadi E, Maarof MA, Barry B. Improving the detection of malware
metamorphic virus detection. J Comput Virol 2009;5(2):151–69. behaviour using simplified data dependent api call graph. J
AV-TEST. Security Report 2016/17. 2017. Secur Appl 2015.
https://www.av-test.org/fileadmin/pdf/security_report/ Eskandari M, Khorshidpour Z, Hashemi S. Hdm-analyser: a hybrid
AV-TEST_Security_Report_2016-2017.pdf. analysis approach based on data mining techniques for
Bai J, Wang J, Zou G. A malware detection scheme based on malware detection. J Comput Virol Hack Tech 2013;9(2):77–93.
mining format information. Sci World J 2014. Feng Z, Xiong S, Cao D, Deng X, Wang X, Yang Y, Zhou X, Huang Y,
Bailey M, Oberheide J, Andersen J, Mao ZM, Jahanian F, Nazario J. Wu G. Hrs: a hybrid framework for malware detection. In:
Automated classification and analysis of internet malware. In: Proceedings of the 2015 ACM international workshop on
Proceedings of the 10th international symposium on recent security and privacy analytics. ACM; 2015. p. 19–26.
advances in intrusion detection. Springer; 2007. p. 178–97. Firdausi I, Lim C, Erwin A, Nugroho AS. Analysis of machine
Baldoni R, Coppa E, D’Elia DC, Demetrescu C, Finocchi I. A survey learning techniques used in behavior-based malware
of symbolic execution techniques. ACM Comput Surv detection. In: Proceedings of the 2010 second international
2018;51(3). conference on advances in computing, control, and
Barriga JJ, Yoo SG. Malware detection and evasion with machine telecommunication technologies(ACT ’10). IEEE; 2010. p. 201–3.
learning techniques: a survey. Int J Appl Eng Res Gardiner J, Nagaraja S. On the security of machine learning in
2017;12(318):41:1–41:40. malware c&c detection: a survey. ACM Comput Surv
Basu I. Malware detection based on source data using data 2016;49(3):59:1–59:39.
mining: a survey. Am J Adv Comput 2016;3(1):18–37. Gharacheh M, Derhami V, Hashemi S, Fard SMH. Proposing an
Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E. HMM-based approach to detect metamorphic malware. In:
Scalable, behavior-based malware clustering, 9; 2009. p. 8–11. Proceedings of the 4th Iranian Joint Congress on Fuzzy and
Bazrafshan Z, Hashemi H, Fard SMH, Hamzeh A. A survey on Intelligent Systems Fuzzy and Intelligent Systems (CFIS);
heuristic malware detection techniques. In: Proceedings of 2015. p. 1–5.
the 5th conference on information and knowledge technology Ghiasi M, Sami A, Salehi Z. Dynamic VSA: a framework for
(IKT). IEEE; 2013. p. 113–20. malware detection based on register contents. Eng Appl Artif
Bilge L, Balzarotti D, Robertson W, Kirda E, Kruegel C. Disclosure: Intell 2015;44:111–22.
detecting botnet command and control servers through Graziano M, Canali D, Bilge L, Lanzi A, Balzarotti D. Needles in a
large-scale netflow analysis. In: Proceedings of the 28th haystack: Mining information from public dynamic analysis
annual computer security applications conference(ACSAC sandboxes for malware intelligence. In: Proceedings of the
’12). ACM; 2012. p. 129–38. 24th USENIX Security Symposium; 2015. p. 1057–72.
Blazytko T, Contag M, Aschermann C, Holz T. Syntia: Guilfanov I. Decompilers and beyond. Black Hat USA 2008.
Synthesizing the semantics of obfuscated code. In: Harang R., Ducau F.. Measuring the Speed of the
Proceedings of the 26th USENIX security symposium (USENIX Red Queens Race. https://i.blackhat.com/us- 18/Wed- August- 8/
Security 17). USENIX Association; 2017. p. 643–59. us- 18- Harang- Measuring- the- Speed- of- the- Red- Queens- Race.
Caliskan-Islam A, Harang R, Liu A, Narayanan A, Voss C, pdf, Last accessed: 2018-10-18; 2018.
Yamaguchi F, Greenstadt R. De-anonymizing programmers via Howard M, Pfeffer A, Dalai M, Reposa M. Predicting signatures of
code stylometry. In: Proceedings of the USENIX Security ’15. future malware variants. In: Proceedings of the 12th
USENIX Association; 2015. p. 255–70. international conference on malicious and unwanted
Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C. software (MALWARE); 2017. p. 126–32.
Polonium: Tera-scale graph mining for malware detection. In:
computers & security 81 (2019) 123–147 145

Hu X, Shin KG, Bhatkar S, Griffin K. Mutantx-s: Scalable malware Proceedings of the 46th annual IEEE/IFIP international
clustering based on static features. In: Proceedings of the conference on dependable systems and networks workshop.
USENIX Annual Technical Conference; 2013. p. 187–98. IEEE; 2016. p. 137–42.
Huang K, Ye Y, Jiang Q. ISMCS: an intelligent instruction sequence LeDoux C, Lakhotia A. Malware and machine learning. In:
based malware categorization system. In: Proceedings of the Intelligent methods for cyber warfare. Springer; 2015. p. 1–42.
3rd international conference on Anti-counterfeiting, security, Lee T, Mody JJ. Behavioral classification. In: Proceedings of the
and identification in Communication. IEEE; 2009. p. 509–12. EICAR Conference; 2006. p. 1–17.
Islam R, Tian R, Batten LM, Versteeg S. Classification of malware Liang G, Pang J, Dai C. A behavior-based malware variant
based on integrated static and dynamic features. J Netw classification technique. Int J Inf Educ Technol 2016;6(4):291.
Comput Appl 2013;36(2):646–56. Liao Y, Cai R, Zhu G, Yin Y, Li K. Mobilefindr: Function similarity
Jang J, Brumley D, Venkataraman S. Bitshred: feature hashing identification for reversing mobile binaries, 11098. Springer;
malware for scalable triage and semantic analysis. In: 2018. p. 66–83. Lecture Notes in Computer Science
Computer and communications security. ACM; 2011. p. 309–20. Lin CT, Wang NJ, Xiao H, Eckert C. Feature selection and
Jordaney R, Sharad K, Dash SK, Wang Z, Papini D, Nouretdinov I, extraction for malware classification. J Inf Sci Eng
Cavallaro L. Transcend: Detecting concept drift in malware 2015;31(3):965–92.
classification models. In: Proceedings of the 26th USENIX Lindorfer M, Kolbitsch C, Comparetti PM. Detecting
Security Symposium (USENIX Security 17). Vancouver, BC: environment-sensitive malware. In: Recent advances in
USENIX Association; 2017. p. 625–42. intrusion detection. Springer; 2011. p. 338–57.
Juzonis V, Goranin N, Cenys A, Olifer D. Specialized genetic Mao W, Cai Z, Towsley D, Guan X. Probabilistic inference on
algorithm based simulation tool designed for malware integrity for access behavior based malware detection. In:
evolution forecasting. Annales Univ Mariae Curie-Sklodowska Proceedings of the international workshop on recent
sectio AI–Inf 2012;12(4):23–37. advances in intrusion detection. Springer; 2015. p. 155–76.
Karpin J., Dorfman A.. Crypton - exposing malware’s deepest Malicia Project. http://malicia-project.com; Accessed: 2018-06-03.
secrets. https://recon.cx/2017/montreal/resources/slides/ Malwr. https://malwr.com; Accessed: 2018-06-03.
RECON- MTL- 2017- crypton.pdf, Last accessed: 2018-10-18; Miller B, Kantchelian A, Afroz S, Bachwani R, Faizullabhoy R,
2017. Huang L, Shankar V, Tschantz M, Wu T, Yiu G, et al. Back to
Kawaguchi N, Omote K. Malware function classification using the future: Malware detection with temporally consistent
APIs in initial behavior. In: Proceedings of the 10th Asia Joint labels. Computing Research Repository 2015.
Conference on Information Security (AsiaJCIS). IEEE; 2015. Miramirkhani N, Appini MP, Nikiforakis N, Polychronakis M.
p. 138–44. Spotless sandboxes: Evading malware analysis systems using
Khodamoradi P, Fazlali M, Mardukhi F, Nosrati M. Heuristic wear-and-tear artifacts. In: Proceedings of the IEEE
metamorphic malware detection based on statistics of Symposium on Security and Privacy (SP); 2017. p. 1009–24.
assembly instructions using classification algorithms. In: Mohaisen A, Alrawi O, Mohaisen M. Amal: High-fidelity,
Proceedings of the 18th CSI international symposium on behavior-based automated malware analysis and
computer architecture and digital systems (CADS). IEEE; 2015. classification. Comput Secur 2015;52:251–66.
p. 1–6. Nari S, Ghorbani AA. Automated malware classification based on
Kirat D, Nataraj L, Vigna G, Manjunath B. Sigmal: A static signal network behavior. In: Proceedings of the international
processing based malware triage. In: Proceedings of the 29th conference on computing, networking and communications
Annual conference on computer security applications. ACM; (ICNC). IEEE; 2013. p. 642–7.
2013. p. 89–98. Offensive Computing. http://www.offensivecomputing.net.
Kolter JZ, Maloof MA. Learning to detect and classify malicious Accessed: 2018-06-03.
executables in the wild. J Mach Learn Res 2006;7:2721–44. Pai S, Di Troia F, Visaggio CA, Austin TH, Stamp M. Clustering for
Kong D, Yan G. Discriminant malware distance learning on malware classification. J Comput Virol Hack Tech 2015.
structural information for automated malware classification. Palahan S, Babić D, Chaudhuri S, Kifer D. Extraction of
In: Proceedings of the international conference on knowledge statistically significant malware behaviors. In: Proceedings of
discovery and data mining. New York, NY, USA, KDD ’13: ACM; the conference on computer security applications. ACM; 2013.
2013. p. 1357–65. p. 69–78.
Kotov V, Wojnowicz M. Towards generic deobfuscation of Park HS, Jun CH. A simple and fast algorithm for K-medoids
windows API calls. Computing Research Repository clustering. Expert Syst Appl 2009;36:3336–41.
2018;abs/1802.04466. Park Y, Reeves D, Mulukutla V, Sundaravel B. Fast malware
Kruczkowski M, Szynkiewicz EN. Support vector machine for classification by automated behavioral graph matching. In:
malware analysis and classification. In: Web intelligence (WI) Proceedings of the workshop on cyber security and
and intelligent agent technologies (IAT). IEEE Computer information intelligence research. ACM; 2010. p. 45.
Society; 2014. p. 415–20. Polino M, Scorti A, Maggi F, Zanero S. Jackdaw: towards automatic
Kulakov Y.. Mazewalker. https://recon.cx/2017/montreal/ reverse engineering of large datasets of binaries. In: Detection
resources/slides/RECON- MTL- 2017- MazeWalker.pdf; 2017. of intrusions and malware, and vulnerability assessment.
Kwon BJ, Mondal J, Jang J, Bilge L, Dumitras T. The dropper effect: Springer International Publishing; 2015. p. 121–43.
Insights into malware distribution with downloader graph Pomeranz H.. Detecting malware with memory forensics.
analytics. In: Proceedings of the 22nd SIGSAC Conference on http://www.deer-run.com/∼hal/
Computer and Communications Security. ACM; 2015. Detect_Malware_w_Memory_Forensics.pdf, Last accessed:
p. 1118–29. 2018-05-14; 2012.
Laurenza G, Aniello L, Lazzeretti R, Baldoni R. Malware triage Raff E, Nicholas C. An alternative to NCD for large sequences,
based on static features and public apt reports. In: Dolev S, Lempel-Ziv Jaccard distance. In: Proceedings of the 23rd ACM
Lodha S, editors. In: Cyber security cryptography and machine SIGKDD international conference on knowledge discovery
learning. Cham: Springer International Publishing; 2017. and data mining. ACM; 2017. p. 1007–15.
p. 288–305. Rieck K, Trinius P, Willems C, Holz T. Automatic analysis of
Laurenza G, Ucci D, Aniello L, Baldoni R. An architecture for malware behavior using machine learning. J Comput Secur
semi-automatic collaborative malware analysis for CIs. In: 2011;19(4):639–68.
146 computers & security 81 (2019) 123–147

Rosseau A., Seymour R.. Finding Xori: malware analysis triage framework. In: Proceedings of the 10th international
with automated disassembly. conference on malicious and unwanted software (Malware).
https://i.blackhat.com/us- 18/Wed- August- 8/ IEEE; 2015. p. 31–9.
us- 18- Rousseau- Finding- Xori- Malware-Analysis- Uppal D, Sinha R, Mehra V, Jain V. Malware detection and
Triage- With- Automated- Disassembly.pdf, Last accessed: classification based on extraction of API sequences. In:
2018-10-14; 2018. Proceedings of the international conference on advances in
Ruthven M., Blaich A.. Fighting targeted malware in the mobile computing, communications and informatics (ICACCI). IEEE;
ecosystem. 2014. p. 2337–42.
https://www.blackhat.com/docs/us-17/wednesday/ Vadrevu P, Perdisci R. MAXS: Scaling malware execution with
us- 17- Ruthven- Fighting- Targeted- Malware- In- The- sequential multi-hypothesis testing. In: Proceedings of the
Mobile-Ecosystem.pdf, Last accessed: 2018-10-16; 2017. conference on computer and communications security (CCS).
Sahu MK, Ahirwar M, Hemlata A. A review of malware detection New York, NY, USA: ACM; 2016. p. 771–82. ASIA CCS ’16
based on pattern matching technique. Int J Comput Sci Inf Vadrevu P, Rahbarinia B, Perdisci R, Li K, Antonakakis M.
Technol (IJCSIT) 2014;5(1):944–7. Measuring and detecting malware downloads in live network
Sanders H.. Garbage in, garbage out - how purportedly great ML traffic. In: Proceedings of the 18th European Symposium on
models can be screwed up by bad data. Research in Computer Security. Berlin, Heidelberg: Springer
https://www.blackhat.com/docs/us-17/wednesday/ Berlin Heidelberg; 2013. p. 556–73. Egham, UK, September 9–13,
us- 17- Sanders- Garbage- In- Garbage- Out- How- Purportedly- 2013
Great- ML- Models- Can- Be- Screwed- Up- By- Bad- Data.pdf, Last VirusTotal. https://www.virustotal.com. Accessed: 2018-06-03.
accessed: 2018-10-18; 2017. Vxheaven. https://github.com/opsxcq/mirror-vxheaven.org.
Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG. Opcode sequences Accessed: 2018-06-03.
as representation of executables for data-mining-based Wong W, Stamp M. Hunting for metamorphic engines. J Comput
unknown malware detection. Inf Sci 2013a;231:64–82. Virol 2006;2(3):211–29.
Santos I, Devesa J, Brezo F, Nieves J, Bringas PG. Opem: a Wüchner T, Ochoa M, Pretschner A. Robust and effective
static-dynamic approach for machine-learning-based malware detection through quantitative data flow graph
malware detection. In: Proceedings of the international Joint metrics. In: Detection of intrusions and malware, and
Conference CISIS ’12-ICEUTE’ 12-SOCO’ Special Sessions. vulnerability assessment. Springer; 2015. p. 98–118.
Springer; 2013b. p. 271–80. Ye Y, Li T, Adjeroh D, Iyengar SS. A survey on malware detection
Santos I, Nieves J, Bringas PG. International symposium on using data mining techniques. ACM Comput Surv (CSUR)
distributed computing and artificial intelligence. Berlin, 2017;50(3):41.
Heidelberg: Springer Berlin Heidelberg; 2011. p. 415–22. Ye Y, Li T, Chen Y, Jiang Q. Automatic malware categorization
Saxe J, Berlin K. Deep neural network based malware detection using cluster ensemble. In: Proceedings of the 16th ACM
using two dimensional binary program features. In: SIGKDD international conference on knowledge discovery
Proceedings of the 10th International Conference on Malicious and data mining. ACM; 2010. p. 95–104.
and Unwanted Software (MALWARE). IEEE; 2015. p. 11–20. Yonts J. In: Technical Report. Attributes of malicious files. The
Schulte E, Ruchti J, Noonan M, Ciarletta D, Loginov A. Evolving SANS Institute; 2012.
exact decompilation. Proceedings of the workshop on binary
analysis research (BAR), 2018. Daniele Ucci is a Ph.D. student in Engineering in Computer Science
Schultz MG, Eskin E, Zadok F, Stolfo SJ. Data mining methods for at Department of Computer, Control, and Management Engineer-
detection of new malicious executables. In: Proceedings of the ing “Antonio Ruberti” at Sapienza University of Rome. He received
IEEE Symposium on Security and Privacy; 2001. p. 38–49. the master degree with honors in Computer Engineering in Com-
Sexton J, Storlie C, Anderson B. Subroutine based detection of puter Science in 2014 A.Y.. His research interests mainly focus on
APT malware. J Comput Virol Hack Tech 2015:1–9. Big Data and information security and privacy, with special regard
Shabtai A, Moskovitch R, Elovici Y, Glezer C. Detection of to malware analysis. During his master thesis, he has investigated
malicious code by applying machine learning classifiers on topics related to business intelligence and Big Data. Currently, he
static features: a state-of-the-art survey. Inf Secur Tech Rep is working both on privacy-preserving data sharing of sensitive
2009;14(1):16–29. information in collaborative environments and malware analysis
Siddiqui M, Wang MC, Lee J. Detecting internet worms using data based on machine learning techniques.
mining techniques. J Syst Cybern Inf 2009:48–53.
Leonardo Aniello is a Lecturer in Cyber Security at the University
Souri A, Hosseini R. A state-of-the-art survey of malware
of Southampton, where he is also a member of the Cyber Security
detection approaches using data mining techniques. Hum
Research Group. He obtained a Ph.D. in Engineering in Computer
Cent Comput Inf Sci 2018;8(1):3.
Science in 2014 from “La Sapienza” University of Rome, with a the-
Srakaew S, Piyanuntcharatsr W, Adulkasem S. On the
sis about techniques for processing Big Data in large-scale envi-
comparison of malware detection methods using data mining
ronments by adopting a collaborative approach, and with the aim
with two feature sets. J Secur Appl 2015;9:293–318.
of improving the timeliness of the elaboration. His research stud-
Tamersoy A, Roundy K, Chau DH. Guilt by association: large scale
ies are currently focused on cyber security aspects, including mal-
malware detection by mining file-relation graphs. In:
ware analysis, blockchain-based systems and privacy-preserving
Proceedings of the 20th international conference on
data sharing. Leonardo is author of more than 30 papers about
Knowledge discovery and data mining(SIGKDD). ACM; 2014.
these topics, published on international conferences, workshops,
p. 1524–33.
journals and books.
ThreatTrack.
https://www.threattrack.com/malware-analysis.aspx.
Roberto Baldoni is a full professor at the Sapienza University of
Accessed: 2018-06-03.
Rome. He conducts research (from theory to practice) in the fields
Tian R, Batten LM, Versteeg SC. Function length as a tool for
of distributed, pervasive and p2p computing, middleware plat-
malware classification. In: Proceedings of the 3rd
forms and information systems infrastructure with a specific em-
international conference on malicious and unwanted
phasis on dependability and security aspects. Roberto Baldoni is
software(malware); 2008. p. 69–76.
director of the Sapienza Research Center for Cyber Intelligence
Upchurch J, Zhou X. Variant: a malware similarity testing
and Information Security and, at national level, is director of the
computers & security 81 (2019) 123–147 147

Cyber Security National Laboratory. Recently, he has been ap- Italian National Research Council and the Cyber Security National
pointed as coordinator of the National Committee for Cybersecu- Laboratory. A partial list of his publications can be found at DBLP,
rity Research born on February 2017 as an agreement between the at Scholar Google and at MIDLAB publication repository.

You might also like