0% found this document useful (0 votes)
7 views

2018 - Countering Android Malware - A Scalable Semi-Supervised Approach For Family-Signature Generation

This document summarizes an article that proposes a scalable semi-supervised framework to automatically cluster Android applications into families and generate signatures to identify them. The framework was tested on a database of 1.5 million Android apps and has been successfully deployed on an anti-malware platform called Koodous since January 2018. It aims to reduce the effort required by humans to analyze malware by focusing only on representative samples from each identified family.

Uploaded by

aulia rachma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

2018 - Countering Android Malware - A Scalable Semi-Supervised Approach For Family-Signature Generation

This document summarizes an article that proposes a scalable semi-supervised framework to automatically cluster Android applications into families and generate signatures to identify them. The framework was tested on a database of 1.5 million Android apps and has been successfully deployed on an anti-malware platform called Koodous since January 2018. It aims to reduce the effort required by humans to analyze malware by focusing only on representative samples from each identified family.

Uploaded by

aulia rachma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Countering Android Malware: a Scalable


Semi-Supervised Approach for
Family-Signature Generation
ANDREA ATZENI1 , FERNANDO DÍAZ2 , ANDREA MARCELLI1 , ANTONIO SÁNCHEZ2 ,
GIOVANNI SQUILLERO1 , (Senior Member, IEEE), ALBERTO TONDA2
1
Politecnico di Torino — DAUIN, Corso Duca degli Abruzzi 24, 10129 Torino, Italy (e-mail: {andrea.atzeni, andrea.marcelli, giovanni.squillero}@polito.it)
2
Hispasec Sistemas S.L., C/ Trinidad Grund 12, 1A-1B, 29001 Málaga, España (e-mail: {asanchez, fdiaz}@hispasec.com)
3
INRA, UMR 782 GMPA, Avenue Lucien Brétignières, 78850 Thiverval-Grignon, France. (e-mail: [email protected])
Corresponding author: Andrea Marcelli (e-mail: [email protected]).

ABSTRACT Reducing the effort required by humans in countering malware is of utmost practical value.
We describe a scalable, semi-supervised framework to dig into massive datasets of Android applications
and identify new malware families. Up to the 2010s, the industrial standard for the detection of malicious
applications has been mainly based on signatures; as each tiny alteration in malware make them ineffective,
new signatures are frequently created — a task that requires a considerable amount of time and resources
from skilled experts. The framework we propose is able to automatically cluster applications in families
and suggest formal rules for identifying them with 100% recall and quite high precision. The families are
used either to safely extend experts’ knowledge on new samples, or to reduce the amount of applications
requiring thorough analyses. We demonstrated the effectiveness and the scalability of the approach running
experiments on a database of 1.5 million Android applications. In January 2018, the framework has been
successfully deployed on Koodous, a collaborative anti-malware platform.

INDEX TERMS Semi-supervised learning, Clustering, Android, Malware, Automatic signature generation

I. INTRODUCTION variants of the same malicious application [4], [5]. As a


NDROID’S first malware, FakePlayer, was released consequence antivirus (AV) software are struggling to keep
A in August 2010 [1] and, since then, the number of
new malware steadily increases [2]. After only seven years,
their signature database up-to-date, and AV scanners suffer
from a considerable quantity of false negatives [6]. Moreover,
malware programs are hundreds of times bigger than the the malicious code is often reused and customized to fit dif-
old FakePlayer, hide their presence and activities, and they ferent needs. For example, a developer may reuse the rootkit
can even communicate secretly through complex anonymous installation code, while replacing the modules that provide
networks. network connectivity to a Command-and-Control server.
Android offers an open market model, where millions of By the end of 2010s, the Android ecosystem is facing a
applications are downloaded by users every day. While appli- similar scenario, although the situation is exacerbated by the
cations from the official Google Play store undergo a review simplicity of malicious repackaging [7]. That is an alteration
process to confirm that they comply with Google policies [3] of the original application installation package (i.e., the APK
other third-party markets do not. Hence, a typical pattern file), where legitimate applications are reverse engineered,
among malware developers is to repack popular applications modified to include malicious code, signed with a new
from Google Play by adding malicious features and distribute signature, and eventually distributed for download. Since
them to third-party app-stores, leveraging apps popularity to applications consist of bytecode, changes are relatively easy
accelerate malware propagation. to implement and ad-hoc tools assist the procedure [8], [9].
In the personal-computer ecosystem, malware developers The growth of Android malware created a major chal-
commonly exploit executable packing and other code obfus- lenge for AV vendors to efficiently handle new samples
cation techniques to generate a large number of polymorphic and accurately label them. Due to the practical impossibility

VOLUME 4, 2016 1

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

of manually analyzing the thousands of suspicious samples on Koodous1 , the mobile AV platform from Hispasec,
received every day, a large fraction is left unlabeled, delaying since the January 2018.
the signature generation. The rest of the paper is organized as follows: Section II
While malware variants can be generated at a high pace, illustrates problem statement and motivation, and Section III
they are likely to perform similar malicious activities when introduces Koodous; Section IV describes in detail the pro-
executed. One possible solution would be to automatically posed approach, while experimental results and performance
cluster such applications in a family and focus the manual evaluations are presented in Section V; Section VI surveys
analysis on few archetypal samples, with the underlying related work about Android malware and automated analysis
assumption that malware bearing significant similarities are procedures; limitations and future works are discussed in
likely to derive from the same code base [10]. Furthermore, Section VII and VIII; Section IX concludes the paper.
the label of a new sample of a known family could be auto-
matically derived, and existing signatures or other mitigation II. PROBLEM STATEMENT AND MOTIVATION
techniques could be more easily extended to cover the new Since the 2000s, academia proposed approaches based on
threats. machine learning aiming at completely replacing humans in
Eventually, if a large number of malware belonging to the malware analysis process. In most cases, such proposals
the same family is identified, it may become possible to fell back into mere classification, that is, supervised machine
define a generic behavioral signature able to detect future learning. The drawbacks included the need of large amount
variants with reduced false positives and false negatives [11]. of accurately labeled, i.e., already analyzed, data, and the
Therefore, a sharp clustering is crucial to help AV companies lack of control over the false positives eventually produced,
categorizing the large amount of samples, avoiding dupli- a major cause of concern for all AV vendors. As a result,
cate work, and allowing analysts to prioritize their limited AV companies developed systems mostly based on the reli-
resources on novel and representative samples [12], [13]. able signature-detection mechanism. Even though signatures
In this article, we describe a semi-supervised system for suffer from the so-called “specificity” problem, and new ones
the analysis of massive datasets of malicious applications. need to be frequently generated, they have been demonstrated
We created a platform able to suggest new families of effective, scalable, and almost unaffected by false positives.
applications to human experts; the platform also generates The proposed framework is semi-supervised and intro-
an intelligible YARA rule [14] to identify family members duces essential improvements in the identification of simi-
with high precision. We explicitly minimize false positives, a lar applications and the generation of family signatures. It
business hazard and a reputation blow for AV vendors. The combines the scalability of fully automatic techniques for
approach alleviates human experts from the burden of man- clustering and the optimization of new family signatures,
ually inspecting thousands of Android applications, while while it exploits manual analysis, inherently more flexible
letting them take critical decisions. The main contributions and accurate, in few crucial steps, such as the validation of
of this article can be summarized as: newly discovered malware families.
Traditionally, the effort of automatically classifying and
• We introduce a scalable system for the analysis of analyzing malware focuses on content-based signatures that
massive Android malware datasets based on careful specify binary sequences. Indeed, content-based signatures
feature engineering, and a standard clustering algorithm. are inherently vulnerable to malware obfuscation: even if
The mechanism is demonstrated to be robust and able all variants of a malicious application share the same func-
to overcome the well-known limitations of traditional tionalities and exhibit the same behavior, they can have tiny
signature-matching mechanisms. different syntactic representations. As a consequence, a huge
• We propose an algorithm that, starting from a cluster of number of signatures needs to be created and distributed by
samples, generates its family signature as a YARA rule. AV companies.
Thanks to exact and heuristic evaluations, such rules On the other hand, a rule that automatically identifies the
are intelligible and appear reasonable to human experts. behavior of a family of samples would be the first step to-
Moreover, the algorithm guarantees zero false positives wards the creation of true family signatures. Such a signature
in the existing dataset, and limits the possibility of false would match all samples of a family, and would significantly
positives in the future. help to reduce the number of signatures required to cover
• We report experiments on a dataset of about 1.5 million it. Moreover, as new samples could be mapped to a family
Android applications, and results show the scalability behavior already known, the time and effort required to
of the approach. We use a set of internal and external analyze and reverse engineer new samples would be reduced.
indicators to demonstrate that the proposed system per- Differently from the previous approaches, the proposed
forms an accurate and efficient automatic identification system generates effective, precise and descriptive rules us-
of groups of similar applications. By exploiting limited ing the properties directly extracted from both static and
data, the framework is able to propose insightful exten- dynamic analyses. While aiming at reducing false positives
sions to the rule detecting suspicious applications.
• Finally, our framework has been deployed and it is used 1 https://koodous.com/

2 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

detected by triage4 .

IV. PROPOSED FRAMEWORK


Our framework operates through three main steps, detailed in
sections IV-A, IV-B and IV-C.
1) Similarities among Android samples are discovered
through an iterative clustering process, offering a new
point of view and valuable information to malware
analysts.
2) Families of suspicious applications are identified tak-
ing advantage of the knowledge already available in
Koodous, and extensions to the current detection rules
are proposed.
3) Signatures are generated to identify the malware fami-
lies with an acceptable generalization capability, yet a
FIGURE 1. Monthly trend of application submission and detection in Koodous reduced risk of false positives in the future.
from October 2014 until March 2017.

A. ITERATIVE CLUSTERING
and false negatives, it also exploits an heuristic measure to Clustering provides a mechanism to automatically categorize
emulate how expert analysts write existing signatures. applications into groups that reflect their similarity, both in
source code and runtime behavior. We exploit HDBSCAN, a
density-based algorithm, as it fits most of our requirements.
III. KOODOUS
Density-based clustering algorithms locate high-density
Koodous is a collaborative platform for researching on An-
regions in the feature space; DBSCAN (density-based spatial
droid malware that combines online analysis tools with so-
clustering of applications with noise) is probably the best
cial interactions between the analysts. Started in 2014, in 4
known among them [17]. Density-based algorithms can ef-
years it collected one of the largest repositories of Android
fectively discover clusters of arbitrary shape and filter out
applications: its databases contain more than 30 millions
outliers, eventually increasing cluster homogeneity. Addi-
of applications, among which 7 millions have already been
tionally, the number of expected clusters to be found in
identified as malicious. Fig. 1 illustrates the trend of appli-
the data is not required: our aim is to discover groups of
cation submission and detection from October 2014, until
similar applications without any prior knowledge about their
March 2017.
composition, otherwise the number of clusters is hard to
Koodous provides both analysis service and end-point guess a priori.
protection: upon submission, each application is analyzed
In 2013, Campello et al. [18] proposed HDBSCAN, a
both statically and dynamically, and the final report is ac-
new density-based algorithm that converts the original DB-
cessible through a web interface specifically designed to help
SCAN into a hierarchical clustering algorithm. As a matter
analysts detect new malware threats. Analysis tools include
of fact, HDBSCAN find clusters of varying densities, and is
a custom version of Androguard [15], CuckooDroid2 and
more robust to parameter selection. Moreover, it supports the
DroidBox [16].
GLOSH (global-local outlier score from hierarchies) outlier
Instead of relying on a closed group of expert malware
detection algorithm: during the fitting phase, each data point
analysts, Koodous takes advantage of an open community
is associated to a score that represents its likelihood of being
to identify malicious applications. Furthermore, in order to
an outlier; at the end of the process, outliers are selected via
guarantee high quality results, manual detections are subject
upper quantiles [19].
to reputation-based checking. Moreover, protection is guar-
In low-dimensional spaces, HDBSCAN has an average
anteed through an Android application, which backs to the
complexity of approximately O(n log n), while its space
cloud platform to detect most recent threats3 .
requirement is O(n), making it applicable to moderately
Koodous uses YARA to describe patterns for detect-
large datasets [20].
ing malware application: since the creation of high-quality
As the number of samples in malware datasets is in the
YARA rules requires a considerable effort, the platform also
tens of millions, we exploit an iterative process where the
offer the possibility to identify malware through a simpler
voting mechanism — an operation referred to as “triage”. l m D is divided into m chunks di of fixed size
original dataset
As of July 1, 2018, more than 2.5 millions applications are N (m = |D| N ).

2 https://github.com/idanr1986/cuckoo-droid 4 For the up-to-date figure, visit https://koodous.com/apks?search=rating:


3 https://play.google.com/store/apps/details?id=com.koodous.android %3C-1%20%26%20detected:1.

VOLUME 4, 2016 3

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

TABLE 1. List of the 35 statistical properties extracted from the analysis result
m−1
[ of each APK file. Features are grouped according to the type of analysis.
Static features are extracted using Androguard both parsing the Manifest file
D= di (1) and looking for interesting API calls in the decompiled source code. Dynamic
i=0 features are extracted using DroidBox and CuckooDroid from the dynamic
analysis of the application.
The parameter N balances the quality of the results with
the time required for the analysis, and can be set experimen-
Analysis method Sofware Statistical property
tally according to the available resources.
HDBSCAN is applied to each chunk of data di finding, Filters
Activities
at each step, a set of clusters
Sm−1ci and a set of outliers oi . Parsing Manifest file Androguard Receivers
Finally, all the outliers O = i=0 oi , are clustered together Services
in order to find even those small groups of applications whose Permissions
samples are spread through several chunks of data. In the end, Accounts
the total number of required iterations is equal to m + 1. Advertisement
Browser history
Since HDBSCAN could be executed in parallel on the Camera
first m chunks, the benefit of the iterative approach is the Crypto functions
Dynamic broadcast receiver
huge reduction in the time required for the analysis. On the Installed applications
other hand, few applications could be misclassified as outliers Run binary
and the same group of similar applications could be found MCC
ICCID
multiple times, although, as shown in section IV-B, those Statically from APK Androguard
IMEI
corner cases do not limit the framework efficacy. IMSI
SMS
MMS
1) Features selection
Phone call
An accurate features selection is a crucial step in every Phone number
machine learning approach. As suggested in [11], we exploit Sensor
Serial number
aggregate information: from the analysis result of each appli- Socket
cation, we extract a subset of “statistical” properties, meant SSL
as quantitative measure of a malware behavior. Indeed, we Files written
experimentally found that exploiting statistical similarities Crypto usage
among applications, rather than “structural" properties which Files read
DroidBox
Send SMS
exactly describe the malicious behavior, does not effectively Dynamically Send network
alter the results, while at the same time, significantly reduces Recv Network
the amount of data to process. HTTP request
Starting from a set of n analysis reports ri provided by CuckooDroid Hosts
Domains
Koodous, each report ri is translated into a feature vector DNS
vi = (f0 , . . . , f34 ) containing the 35 statistical properties.
Table 1 summarizes the features extracted from the results
of the static and dynamic analysis.
In more detail, the static analysis performed by An- of the distance to use during cluster analysis is tied to the
droguard extracts the features from the Manifest file (i.e., type and the dimension of selected features, we experimen-
number of activities, permissions, receivers, filters), and the tally found that the combination with the Euclidean distance
source code analysis. The former allows to unveil similarities delivered the best performances.
among applications based on the software architecture used
to develop the application, while the latter models each B. EXTENDING MALWARE DETECTION
application extracting portions of code related to suspicious Starting from millions of samples, the iterative clustering
API call (e.g., number of calls to SMS API, or IMEI, or other (Section IV-A) identifies a smaller number of families com-
network related methods). On the other hand, the dynamic posed of strongly related applications. In some cases, by
analysis extracts features that model the application interac- combining this result with the information already available
tion with the surrounding operating system both at file system in Koodous, these families may be automatically labeled, as
and network level extracted by DroidBox (e.g., files written, they extend either known threats or legitimate software. In
usage of cryptographic methods, SMS sent), and the network the other cases, experts are required to manually evaluate
information extracted by CuckooDroid (e.g., number of DNS the family, but they need to analyze only few representative
resolved, HTTP requests). These are the standard type of samples of the group and not all applications, therefore
information extracted in the field of malware analysis. drastically reducing the time required by the analysis. This
Because the range of each feature is quite different, the process exploits the “clustering assumption” of the semi-
dataset is firstly normalized so that the features have mean supervised learning algorithms, which states that two points
equal to zero and variance equal to one. Since the choice which are in the same cluster (i.e. which are linked by a high
4 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

and added to the detection system without manual inter-


vention.
(4)
• Type 4 {∀s ∈ F | s ∈ S ∪ U }. The family
combines applications detected by existing signatures
with undetected ones. In order to avoid false positives,
the correctness of the family must be manually validated
before generating a family signature.
(5)
• Type 5 {∀s ∈ F | s ∈ S ∪ T ∪ U }. The family
combines applications detected both by signatures and
by triage only, with undetected ones. As in the previous
case, in order to guarantee complete correctness the
family must be manually validated before generating a
FIGURE 2. The figure illustrates the subdivision of the applications in signature.
(6)
database and the seven type of families (i.e., clusters) that can be • Type 6 {∀s ∈ F | s ∈ T ∪ U }. The family combines
automatically inferred by the proposed approach. The database is divided in
three macro areas according to the type of detection: applications detected by applications either detected by the triage process only
signatures, by triage only, and undetected. Each point in the figure represents with undetected ones. As in the two previous cases, the
an application, and based on the detection status of the applications within
each cluster, the proposed approach identifies seven possible cases.
family must be manually validated before generating a
family signature.
(7)
• Type 7 {∀s ∈ F | s ∈ U }. The family is composed of
density path) are likely to share same label. In such a way, the undetected applications, hence no classification can be
partial information of few labels extracted from each cluster automatically inferred. However, as all the applications
can be used to increase the knowledge of all the applications within the cluster show strong similarities, the analysis
within the same group. of few representative samples shall be sufficient to clas-
sify the whole cluster as malware or goodware.
The set of all applications in Koodous K may be parti-
tioned into three subsets K = {S ∪ T ∪ U} corresponding Such an approach offers apparent benefits: the need for
to the applications detected by signatures (S), detected by human intervention is often limited to the simple validation
triage only (T), and undetected (U); applications detected of the discovered family, while the need for full analysis is
both by signatures and in the triage phase belong to the S set. reduced to few representative samples. The identification of
Such a partition does not reflect a peculiarity of Koodous, as families with only partially detected applications, either by
the usage of a staging area T where samples are pointed out signature or during the triage process, allows to discover false
waiting further analysis is common in AV laboratories. negative and new 0-day malware.
It is possible to classify a family according to the different In Koodous, the triage process makes it possible to quickly
subsets its applications belong to (Fig. 2). The resulting seven identify threats without the burden of creating signatures,
different types of family correspond to the powerset P(K), although it has the drawback of potentially leaving others
excluding the empty set: { {S}, {T}, {U}, {S, T}, {S, U}, similar applications undetected. Our frameworks may auto-
{T, U}, {S, T, U} } matically convert all the knowledge about single, unrelated
threats into more reliable signature, potentially able to dis-
• Type 1 {∀s ∈ F (1) | s ∈ S}. The family is composed of cover newer variants as well.
applications that have been already detected by YARA Finally, among Type 7 families, the system is able to
signatures. No further action is required, although the identify groups of legitimate software, for example finding
generated family rule may still be effective to generalize applications written by the same developer or using the
the detection. same framework. This result was proved to be of practical
• Type 2 {∀s ∈ F (2) | s ∈ S ∪ T }. The family includes importance to limit and correct false positive detections.
applications already identified as malicious either by
YARA, or during the triage process. The correctness C. FAMILY SIGNATURES GENERATION
of the detection is either guaranteed by the existing
In the last step of our framework, a signature is generated
signatures, or by the triage process (i.e., the community
for each family that has been identified as malicious5 . We
votes); thus a new YARA rule matching all the applica-
developed an automatic procedure that starts from a set of ap-
tions in the family can be automatically generated and
plications, and eventually produces a YARA rule describing
added to the detection system without further manual
them. The program has no requirements on the origin of the
check.
set: it could be the result of automatic clustering or manual
• Type 3 {∀s ∈ F (3) | s ∈ T }. The family is composed
selection, although the more the applications in the set are
of applications that have been detected through the
related, the better the result.
triage process only. The correctness of the detection is
guaranteed by the triage process, and as in the previous 5 Our system could generate family signatures of legitimate applications
case, a new YARA rule can be automatically generated as well, but they would be of no use for Koodous

VOLUME 4, 2016 5

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

Authoring an effective signature requires a considerable As it is not possible to find features common to malware
effort and experience. Good signatures are compact, and that do not matches legitimate applications (ma ∩ mb ) \ ga =
they have the ability to generalize, that is, to identify all ∅ and (ma ∩mb )\gb = ∅, the third step generates a signature
known variants of the malware and even possible new ones. with the disjunction of two clauses Y ∗ = (r ∧ ra ) ∨ (r ∧ rb ).
Moreover, they do not yield false-positive results by detecting
non-family members, and finally, they appear intelligible to
human experts and are almost self-explanatory.
Wn formally, a signature S is the disjunction of n clauses
More
S = Vi=0 ci . A clause ci is a finite conjunction of mi literals
mi
ci = k=0 lki . In the present context, a literal lki is a single
feature specified in the report resulting from the analysis of
the application.
Traditionally, a YARA rule is defined on unique strings
found in malware but not present in legitimate programs;
quite differently, we generate precise, descriptive rules using
the structural properties extracted by the static and dynamic
analyses. Our program identifies an optimal set of clauses
for matching all target applications while yielding no false
positive in the current database; moreover, thanks to some
heuristics, the rule has a good ability to generalize, a low
risk of detecting false positives in the future, and it appears
reasonable to the eye of the human experts.
An example of an automatically generated YARA rule
for the Syringe malware family is shown below6 . It may be
noted that the statistical features exploited during clustering
(Section IV-A) are not used in the in the rule, as they would
result in over-complicated rules hardly understandable by FIGURE 3. Schema of the process of generation of a YARA rule. In the first
phase a signature Y = r is defined for malware ma and mb . In the second
humans. phase Y is checked against a dataset of goodware (ga and gb ). Finally, in the
third phase, a new signatures Y ∗ = (r ∧ ra ) ∨ (r ∧ rb ) is created to avoid
false positive detection of ga and gb .
r u l e YaYaSyringe {
condition :
a n d r o g u a r d . f i l t e r ( " a c t i o n . BATTERYCHECK" )
and a n d r o g u a r d . n u m b e r _ o f _ s e r v i c e s == 3 Algorithm 1 Automatic YARA rule generation
and a n d r o g u a r d . p e r m i s s i o n ( "SYSTEM_ALERT_WINDOW" )
and a n d r o g u a r d . u r l ( " h t t p : / / s . a d s l i n k u p . com / v2 " ) 1: procedure GENERATE S IGNATURE (R)
... 2: C ← Clauses(R, ∅)
} 3: Y ← Clot(R, C)
4: G ← GetFalsePositives(Y)
The process is performed in three steps: a reasonable
5: C∗ ← Clauses(R, G)
signature composed of a small number of clauses is gen-
6: Y∗ ← Clot(R, C∗ )
erated; the signature is checked against the full database of
7: DumpAsYARARule(Y∗ )
applications, and false positives are identified; the generation
procedure is run again, but explicitly taking into considera-
tion the false positives discovered in the second step. The pseudocode of the algorithm is reported in Algo-
Fig. 3 exemplifies the idea of the process of generation of a rithm 1: at first it determines a suitable set of clauses (func-
signature for two malware ma and mb , and two legitimate ap- tion Clauses), then picks a subset of them of variable size to
plications ga and gb . In the first phase, the algorithm defines a build an optimal family signature (function Clot). Lines 2 and
signature Y = r, where r is a single clause composed by the 3 correspond to the first phase of Fig. 3; line 4, to the second;
common features between the two malware: r = ma ∩ mb . lines 5 and 6, to the third.
Indeed, a rule Y detects an application m only if Y is a subset Algorithms 2 and 3 add more details about the procedure:
of m: Y ⊆ m. the function Clauses extracts the clauses that can be used to
During the second phase, the rule Y is checked against build the signature, and is based on a heuristic algorithm.
the complete database, where it generates two false positives First, each malware application ri in the target set R is
matching two legitimate applications ga and gb . The clause r transformed into a single clause yi able to detect it using all
is therefore too generic to be used as a signature. available literals. Such clauses are not directly usable, but are
the starting point of the interactive procedure for building the
6 The complete version of the rules is available on Koodous at https: set of optimal clauses H: in each step, the least generic yi is
//koodous.com/rulesets/3243 selected and compared against all clauses in H calculating the
6 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

common features zi ; the least generic of these zi is eventually detect at least an application not yet detected by any rule,
considered for inclusion in H. with the only exception that, if an application can be detected
by only one clause, that clause is the only one picked. Then
Algorithm 2 Clauses extraction the algorithm selects among this group the clause that is able
1: function C LAUSES (R, G) to detect more application in the original target set R.
2: Y ← {Features(r) ∀r ∈ R}
3: for all r ∈ R do 1) Rule quality
4: Y ← Y ∪ SelectedClauses(r) A heuristic evaluation is used to reduce the risk of false
5: H ← {Features(r) ∀r ∈ R} positives in the future and to increase the perceived quality of
6: while |H| > 0 do the rule. We defined a heuristic score S(·) for a rule, inversely
7: h ← LeastGeneric(H) related to its generality. More formally, let associate each
8: Z ← {CommonFeatures(h, y) ∀y ∈ Y} literal l to a score S ∗ (l) that measures how specific the literal
9: z ← LeastGeneric(Z) is. The score of a clause ci is the sum Pni of the scores of the
10: F = {r ∈ G | Det(z, r) = True} ni literals composing it: S(ci ) = S ∗
(li ). The sore
k=0 k
11: if F = ∅ and z ∈ / Y and Quality(z) > Tq of a rule r is the minimum among the scores of its clauses:
then S(r) = min∀i S(ci ).
12: H ← H ∪ {z} The higher the score, the more a rule is specific and less
13: Y ← Y ∪ {z} susceptible to generate false positives. On the other hand, the
14: H ← H \ {h} lower the score, the more a rule will be able to generalize,
15: return Y while being more prone to generate false positive in the
future. High quality signatures require an optimal balance
between generality and specificity, and this is one of the main
The rationale is to build Y by adding clauses progressively
challenges in automatic signature generation. We use two
less specific (i.e., checking fewer features), but still usable in
threshold Tmin and Tmax , where the lowest is the minimum
signatures. Line 10 computes the set F of application from
score that a rule needs to be valid, and the highest is used in
G detected by the candidate clause; as G is the set of all
the optimization process to avoid overly-specific rulesets.
potential false positives, if F is not null the clause is too
generic to be usable. Additionally, the function Quality(·) All the clauses in YARA rules created by expert analysts
performs a heuristic evaluation of the clause: if the quality is are valid, that is, the score assigned to literals must guarantee
below a certain threshold Tq , the rule is so generic that it is that ∀r ∈ Rexpert : Tmin ≤ S(r) ≤ Tmax . We consider
likely to create false positives in a near future — see IV-C1 invalid the rules containing a clause mentioning only Android
for more details. For each application, few not-too-generic, official permissions and intent filters, or containing a clause
heuristically selected clauses are also included (i.e., ra and composed of a single literal, with the exception of accessing
rb in the example shown in 3). an URL that have been detected as malicious by VirusTotal
The function Clot (Algorithm 3) implements a dynamic or similar services. Then, we exploit the simplex method as a
greedy algorithm for building the signature as a disjunction mean to automatically define S ∗ (·) starting from the existing
of clauses. It iteratively adds one clause to Y from a set C ruleset.
until all applications in R are detected by at least on clause in The simplex method is a linear programming technique,
Y. which refers to the problem of optimizing a linear objective
function ζ of m variables xi subject to a set of n linear in-
Algorithm 3 Clauses selection equality constraints. In standard form, the problem of finding
1: procedure C LOT (R, C) an optimal set of weights for m literals can be expressed as:
2: Y←∅
min ζ = cT × x
3: D←∅
4: while R 6= C do s.t − A × x ≥ −b, x ≥ 0
5: if ∃r ∈ R \ D : Critical(r) = True then
where ci = 1, ∀i = 1 . . . m, since the objective function ζ
6: r̄ ← GetCritical(R \ D)
minimize the number of literals in each clause, x ∈ Rm is a
7: Z = {z ∈ C | Det(z, r̄) = True}
vector of m unknown weights, and bi = Tmin , ∀i = 1 . . . n,
8: else
as we want each existing literal combination to satisfy the
9: Z = {z ∈ C | ∃r ∈ R \ D : Det(z, r) =
minimum score of all existing rulesets.
True}
Finally A is a n × m matrix that put into relation each
10: Y ← Y ∪ {MostUseful(Z)} clause with their own literals:
11: D ← {r ∈ R | @y ∈ Y : Det(y, r) = True}  
l11 l12 l13 . . . l1m
12: return Y  l21 l22 l23 . . . l2m 
A= . . . . . . . . . . . . .


In an iterative way, Clot first picks out all clauses that ln1 ln2 ln3 . . . lnm
VOLUME 4, 2016 7

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

TABLE 2. Details about the number of rules, clauses, and unique clauses TABLE 4. Comparison of Homogeneity (Hom.) and Completeness (Comp.)
analyzed to find the optimal score for each literal. index values between the families inferred by the clustering process (using
both the iterative clustering with different chunk sizes N , and the non-iterative
version), and the families labels extracted from Koodous and VirusTotal.
Num. of DNF clauses
Num. of YARA rules
Total Unique Koodous labels VirusTotal labels
Koodous N Hom. Comp. Hom. Comp.
348 788 104
public rules 50k 0.96 0.36 0.85 0.49
Yara-Rules 100k 0.96 0.35 0.85 0.49
348 697 48 200k 0.96 0.35 0.85 0.50
on GitHub
non-iterative 0.92 0.36 0.78 0.50

TABLE 3. Weights assigned to each type of literal as a result of the simplex


method optimization. Weights are used by the automatic procedure to
generate new YARA rulesets. 65% undetected applications, 31% detected by signatures,
and 4% detected through triage only.
Module Name Literal type S ∗ (·)
App name 100 A. EVALUATING CLUSTERING RESULTS
Package Name 100
Certificate SHA1 150
HDBSCAN has two parameters that mostly influence the
Certificate Subject 100 results of the clustering: min cluster size (mss) determines
Certificate Issuer 100 the smallest size of a cluster, while min samples (ms) how
Main Activity 50
Androguard Activity 150
conservative are the results. A higher value of min samples
Service 150 restricts clusters to more dense areas, but it also increases
Broadcast Receiver 100 the number of outliers. We use mss = 3 and ms = 1; in
Intent Filter 150
Content Provider 80
other words, we considered only malware clusters containing
Functionality 15 a minimum of three samples as representative of a malware
URL 400 family.
Permission Normal 7
Permission Dangerous 80 We used a high-performance, open-source implementation
Permission Not third party 50 of HDBSCAN in Python from Leland McInnes [21]. All
Permission System 80 experiments were performed on a 6-core Intel Xeon (CPU
Permission with Typos 150
Permission non standard 50 E5-1650 v2 @ 3.50GHz), with 128 GB of RAM, although
HDBSCAN only used up to four cores and 6 GB of RAM in
DNS lookup 400
Cuckoo each run.
HTTP request 400
The quality of the clustering results is evaluated as a mea-
sure of the ability of correctly extending malware detection to
where lnm = 1 if lnm is a literal of the clause cn , otherwise undetected applications. However, given the difficulty of es-
lnm = 0. In order to get the list of all the n existing clauses tablishing a reliable ground truth in the field of malware anal-
ci , we firstly reduced all the available YARA ruleset in the ysis, evaluating the results was challenging. Finally, for the
Disjunctive normal form (DNF). clustering validation we used all the available information:
We set almost arbitrarily the values Tmin = 400 and detection results and AVs labels extracted from VirusTotal
Tmax = 700 for the two thresholds. Table 2 reports the reports, and signature labels extracted from existing YARA
details about the rules, clauses, and unique clauses that have rules in Koodous.
been analyzed, using the YARA rules from both Koodous7 Since clustering exploits the relationship between sta-
and the Yara-Rules repository on GitHub8 . Table 3 show the tistical similarities among applications, in contrast to the
final result, where each literal is assigned a distinct weight. structural properties commonly used in AVs signatures, no
one-to-one correspondence between clusters and AV labels is
V. CASE STUDY expected, however by combining several indexes we deliver a
As a case study we used a dataset of 1.5 million Android trustworthy quality measures of clustering performances. In
applications collected over the 2016. The dataset is recent order to estimate cluster assignment, we adopt the Adjusted
and diverse in the set of attack vectors it represents: in Rand Index in combination with other external indexes as
order to have the same ratio between detected and undetected proposed by Rosemberg et al. [22]:
applications as in Koodous, we sampled a subset of 1 million • Adjusted Rand Index (ARI) is defined as the number
apps9 . As result, the dataset under analysis is composed by of pairs of items that are either both in the same clus-
ter or both in different clusters in the two partitions,
7 https://koodous.com/rulesets
normalized over the total number of pairs of items.
8 https://github.com/Yara-Rules/rules/tree/master/Mobile_Malware
9 In order to ensure the quality of the results and avoid artifacts, the
The index lies between 0 and 1: when two partitions
sampling of 1 million applications have been repeated three times: in all agree perfectly, the Rand index achieves the maximum
the cases the proposed techniques showed coherent results. value 1, and more in general a larger adjusted Rand
8 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

index means a higher agreement between two partitions.


Moreover, ARI supports the measure of the agreements
even when the compared partitions have different num-
bers of clusters
• Homogeneity (Hom.), which measures whether its clus-
ters contain only data points which are members of a
single class
• Completeness (Comp.), which measures whether all
the data points that are members of a given class are
elements of the same cluster
• V-measure (V-ms.), measured as the weighted harmonic
mean of homogeneity and completeness; this is use-
ful since homogeneity and completeness of a cluster-
ing solution run roughly in opposition: increasing the FIGURE 4. Number of total applications, and newly automatically inferred
homogeneity of a clustering solution often results in detections, for each type of malware family (Type 2...6). Results refer to the
iterative clustering approach, using chunk size N = 100k, over a dataset of 1
decreasing its completeness. million applications
Table 4 compares homogeneity and completeness index
values between the families (i.e., clusters) inferred during TABLE 5. Number of families automatically inferred by the clustering
algorithm (using both the iterative clustering with different chunk sizes N , and
clustering process, and the families labels extracted from the non-iterative version), using dataset of 1 million applications. Results are
Koodous signature names and VirusTotal AV labels10 . Re- gathered for each type of malware family (Type 2...6).
sults are compared using both the iterative clustering, with
different chunk size N , and the non-iterative version. N Type 2 Type 3 Type 4 Type 5 Type 6
Since AVs listed in VirusTotal commonly use different 50k 1,890 2,949 1,467 463 2,846
names to identify the same type of threat, we took advantage 100k 1,477 2,439 1,519 500 3,385
of AVclass [23], an automated labeling tool that, given the 200k 1,193 2,203 1,436 536 3,133
non-iterative 435 1,046 2,126 536 4,629
labels of multiple antivirus engines, returns the most likely
family names for each sample, focusing on normalization,
removal of generic tokes and alias detection. The implemen-
tation is open-source, available on GitHub [24], and natively iterative clustering with different chunk sizes N , and the non-
provides VirusTotal integration. iterative version.
Interestingly, all the cases reported in Table 4 show very Among the clusters of Type 2 and 3, the system automat-
high homogeneity value, which indicates that malware fam- ically identifies a total of 21,450 new malicious applications
ilies identified by AVs signatures are further split in finer that will be automatically covered by new signatures, without
partitions during the clustering process. Moreover, precise requiring any human intervention. In more detail, 5,386
clusters increase the effectiveness of the following automati- applications (Type 2) are found within clusters with other
cally generaged signatures. apps already detected by YARA signatures; while 16,064
applications (Type 3) are assigned to clusters purely made of
1) Extending malware detection applications detected during the triage phase only. As matter
of fact, generating new family signatures for these applica-
Fig. 4 illustrates the result of the automatic detection exten-
tions allows to transform the knowledge of existing threats
sion for the 1 million applications under analysis: each bar in
into a more reliable and scalable form of detection, without
the plot is related to a family type (refer to Section IV-B for an
affecting the precision of the results: all those applications
accurate description of each type of malware family), illus-
have been already identified as malicious by the community
trating both the total number of applications, and the number
of malware experts.
of those automatically identified as malicious. Results are
On the other hand, 34,818 applications are assigned to
obtained using the iterative clustering approach, with chunk
families Type 4, 5 and 6: 20,464 are the newly identified po-
size N = 100k. Note that Type 1 and 7 families are not
tential threats, since previously marked as undetected. In this
shown, as the first consist of application that are already
case, the proposed framework allows an easy identification of
completely detected by signature, while the latter include
hard to find potential threats, reducing the human interven-
families found within unknown applications, hence no direct
tion from the manual analysis of thousands of applications
information about their composition can be automatically
to the validation of a very fewer number of families where
inferred.
applications reflect a similar behavior, eventually speeding
Table 5 is complementary to Fig. 4, as it compares the
up the procedure of new malware discovery. For example, the
number of families, for each family Type, using both the
system identified a total of 500 families for the Type 4 (refer
10 The comparison with VirusTotal AV labels is limited to 100,000 ran- to Table 5, second row) reducing of an order of magnitude the
domly selected applications. need of manual analysis, as a detail analysis of a malicious
VOLUME 4, 2016 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

TABLE 6. Comparison of the detection results between VirusTotal and two TABLE 8. Example of a Type 4 malware family. As the first two samples are
datasets of 50,000 applications, respectively undetected (und.) and detected already detected in Koodous by the YARA rule Xynyin.Trojan, the system
(det.) by Koodous. Columns indicate the number of applications unknown identifies other applications within the cluster as potentially malicious too. The
(unk.), undetected (und.), detected by at least one AV (det.), and detected by comparison with VirusTotal (the number of detection is reported) and a manual
more than three AVs, as reported by VirusTotal. analysis confirm the accuracy of the system.

VT unk. VT und. VT det. VT det. >3 Detected


MD5 Koodous VT
Koodous det. 18 72 49,910 49,717
Koodous und. 3,449 12,508 34,043 28,166 998faf5e7a0d45f6ad60903bc5d60817 Yes 12
5a8dd85a5707f520563069bf536f9d5f Yes 19
695d6b9f97a9e992f8e321d36509c080 No 0
TABLE 7. Evaluation of the accuracy of the clustering system to automatically 304754e9f8f95228af0e7118d62e999f No 12
identify groups of malicious applications, by comparing the detection of the 805d8770d6314f5adad266ddaba610e1 No 10
new applications with VirusTotal. Columns Correct and Incorrect respectively 23863ddba21b96aea3e8b2cc120bb2b2 No 12
reports the number of applications correctly or wrongly classified, while Min
and Error illustrate the minimum precision and the maxim error of the
proposed approach. Results are reported using both the iterative clustering
with different chunk sizes N , and the non-iterative version.
With the awareness that VT detection results are not com-
pletely reliable, we only considered those clusters for which
N Correct Incorrect Min % Error %
the VT information is available for all the applications. In
50k 7,493 254 91.23 2.91 order to calculate the accuracy of the proposed system, we
100k 12,502 877 86.04 6.03
200k 13,628 917 89.54 6.18 adopted the following metrics:
non-iterative 14,619 1,109 87.93 6.67 • if the system proposes an extension to a malware family
where all the applications are detected by VT, we con-
sider the extension as correct;
application could take few hours, this approach results in a • if the system proposes an extension to a family where all
huge time saving. the applications are undetected by VT, the extensions is
considered as incorrect;
2) Evaluation of malware detection extension • if the system proposes an extension to a cluster that
Aiming at evaluating the detection extension performance in mixes applications partially detected and undetected by
a real-world case, we evaluate how the proposed system is VT, the result is considered unknown.
accurate in relationship to the information of the detections Table 7 illustrates the results. For each clustering experi-
availavable in VirusTotal. We choose VirusTotal as a well- ment, each line of the table reports the number of applications
known and trustworthy source of information about existing that have been correctly or wrongly classified, according to
threats since it collects the detection results from tens of inde- the type of the cluster to which they were assigned. Without
pendent AV companies. Moreover, recently other researches any human intervention, the system scores a minimum accu-
used the same metric [25]. racy that ranges from 86.04% to 91.23%, and it has a worst
In order to evaluate the detection extension results, we case error of the 6.18%. A further manual inspection of the
firstly assessed how precisely Koodous detects malware sam- results revealed that several families completely undetected
ples, and how effectively covers all the malware variants. by VT are mostly related to aggressive adware samples,
Starting from two randomly sampled subsets of 50,000 ap- whose classification is subject to different considerations.
plications, respectively originally undetected and detected Furthermore, results show that a smaller chunk size increase
in Koodous, we cross-checked their maliciousness using the precision the detection, reducing the error, although the
VirusTotal. Results are illustrated in Table 6. The first line absolute number of applications automatically extended is
of the table (Koodous det.) shows that among detected ap- smaller. Accordingly, the chunk size can be set in accordance
plications, Koodous has 100% of precision, and very high with the needs of the system.
recall (99.8%), as almost all Koodous detected applications
are completely identified as malware by traditional AVs too, 3) Example of manual analysis of a malware family
while only 100 applications (the 0.2% of the dataset) are Table 8 shows an example of a Type 4 malware family. As
unknown or undetected by VirusTotal. However, the second the first two samples are already detected by the signature
line of the table (Koodous und.) shows a very low accuracy Xynyin.Trojan11 in Koodous, the system proposes to extend
(27.8%), as a consequence of a major diversity in the de- the detection to the other applications of the same cluster.
tection ratio among the applications undetected by Koodous The comparison of the detection results with VirusTotal12
and VirusTotal. Although such a difference could be partially shows that all but one application are already detected, while
explained by the different policies that traditional AVs use a manual analysis of Leagueoftankheroes3D13 confirm its
in identifying a malicious application, particularly regarding
adware, this result further motivates the need of an automatic 11 https://koodous.com/rulesets/1225
mechanism to increase the number of correct detections in 12 Detection results refer to 15 Nov 2016
Koodous. 13 MD5: 695d6b9f97a9e992f8e321d36509c080

10 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

TABLE 9. Comparison of the clustering results using using both the iterative shown by an increasing V-score value. Eventually, if a large
version with different chunk sizes N , and the non-iterative one. Column Time
indicates the time (in seconds) required by the clustering process, while enough chunk size is used, the iterative approach produces
column "Outliers" reports the number of outliers found at the end of the almost the same results as the non-iterative one, while gen-
iterations.
erally finding a higher number of clusters, as illustrated in
Table 5, and a less outliers, Table 9.
N Time (s) Outliers
Finally, in order to further test the scalability of the pro-
50k 5,746 64,553 posed method, we successfully applied the algorithm on a
100k 6,408 65,685
200k 10,573 67,081 very large dataset of 10 million applications, using a chunk
non-iterative 16,592 119,919 size N = 500k.

TABLE 10. Indexes comparison of the clustering label inferred by the iterative B. AUTOMATIC FAMILY SIGNATURES GENERATION
approach (with different chunk sizes N ) using the assignment produced by the In order to evaluate the effectiveness of the automatic signa-
non-iterative version as a reference.
ture generator, we compare the detection results of several
YARA rules automatically generated by the proposed algo-
N ARI Homogeneity Completeness V-Score
rithm with existing rulesets created by expert analysts.
50k 0.26 0.92 0.78 0.85 Table 11 reports the results of the rules detections on
100k 0.27 0.93 0.81 0.86
200k 0.29 0.94 0.84 0.89 a dataset of 1.5 million applications: in all the cases, the
automated generated rules 15 performed better than the one
authored by humans, increasing the detection from the 8.2%
affinity to the Xynyin malware family14 . up to 131.2%, without generating any false positives.
One of the major benefit of a semi-supervised system is Referring to Section IV-C, in all the cases the rule gen-
to limit the detection of false positives, and the operation is eration process stopped at the second step, as none of the
further simplified since the analysts should only focus on new rules produced any false positives in the current dataset
groups of similar applications, without considering single of applications. A further manual analysis of the detected
samples. As useful side effect, the system could be also used applications, confirmed that no false positive was generated.
to improve the precision of the results, by reducing false As shown in Table 12, the time required to generate a
positive detections for those families of applications that have rule for few hundreds malware is always less than a minute,
been partially miss-classified by existing signatures. although when the target increases to a few thousands ap-
plications, the time required grows up to several minutes, as
4) The Iterative algorithm
the most expensive part of the process is the check for false
positives against a reference dataset. This is not considered a
The adoption of the iterative approach brings a number of
limitation, since all the process is automatic, and given the
benefits: it proved to be essential in order to analyze millions
goodness of the results, it is of invaluable support for the
of applications, and the resulting number of outliers, as
family signature generation process.
illustrated in Table 9, is much lower than what was obtained
Table 12 reports the number of literals (i.e., application
by clustering all applications together. The time required by
features) and the final score for each generated YARA rule:
the clustering phase is proportional to the chunk size and it is
referring to Section IV-C, each score is higher than the
up to one order of magnitude lower than in the non-iterative
minimum threshold Tmin = 400, satisfying the minimum
case.
requirement for acceptability in order to avoid false positive
The adoption of the iterative approach does not affect the
detections, and lower than the maximum threshold Tmax =
quality of the results, even though using a bigger chunk size
700, as a result of the optimization process to increase the
results in a greater number of new detections.
rule generality and therefor the ability to catch future mal-
Table 10 compares the iterative approach using as a ref-
ware variants.
erence clustering assignment the one produced by the non-
As shown in the example of Section IV-C, in order to
iterative version. A relatively low ARI value indicates a
increase the effectiveness of a rule, urls are included only if
difference in the clustering assignment between the two
are known to be malicious, like in case of http:// s.adslinkup.
approaches, while a very high homogeneity value, compared
com/ v2 for the Syringe malware family. Moreover, aiming at
to completeness, is a clear sign of a finer cluster partitioning.
identifying malware with very high precision and avoiding
In other words, using the iterative approach the quality of
false positives, whenever available, the automatic signature
the information is not compromised, although the resulting
generator includes those attributes extracted from the ap-
clusters are smaller, hence less likely to contain enough appli-
plication analysis that contains a typing mistake. For in-
cations that span different detection areas, finally resulting in
stance, the rule YaYaMetasploit116 includes a wrong permis-
a lower extension. A bigger chunk size lowers the differences
sion ACCESS_COURSE_LOCATION instead of the correct
between the iterative and the non-iterative assignment, as
15 Example rulesets could be found at the following address: https://
14 On 24 August 2017 VirusTotal updated the detection, identifying the koodous.com/analysts/YaYaGen/rulesets
applications as malicious too. 16 https://koodous.com/my_rulesets/3466

VOLUME 4, 2016 11

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

TABLE 11. Comparison of detection performances of human authored YARA computable efficiency and local sensitiveness), using single-
rules (Original) with automated generated ones (Auto). Last column reports
the improvement (in percentage) for the newly generated rules. Detections are linkage HAC as clustering algorithm. Experimental results
tested on a dataset of 1.5 million applications. show that Manhattan Distance along with 3-grams deliver the
best results, while NCD and Edit Distance generally perform
Rule name
Detections poorly.
Original Auto Improvement Neither Lee and Mody [26], nor Bailey et al. [27] have any
SmsSender 539 1,004 +86.3% specific solution to large-scale clustering. On the other hand,
Syringe 220 315 +43.2% Bayer et al. [13], Rieck et al. [28], and Jang et al. [12] directly
HummingBad2 136 257 +89.0%
Marcher2 559 652 +16.6% address the problem of managing large datasets, developing
SMSReg 159 172 +8.2% methods to scale the clustering process.
VolcmanDropper 186 430 +131.2% Bayer et al. [13] propose a scalable malware clustering
FakeGoogleChrome 516 822 +59.3%
approach using a combination of approximate and hierar-
chical clustering with Local Sensitive Hashing (LSH) [29]
TABLE 12. Number of literals, score and time (in seconds) required to
generate each YARA rule. to significantly reduce the number of distance computations.
By extending Anubis [30], they are able to extract detailed
Rule name Literals Score Time (s) behavioral-reports based on taint tracking results and net-
work captures from malware execution. In particular, the
SmsSender 15 412 43
Syringe 19 574 48 taint engine allows them to map low-level operations (e.g.,
HummingBad2 12 599 52 system calls) to operating system objects (e.g., registry keys
Marcher2 20 686 49 and files). By deploying LSH, Bayer et al. are capable of
SMSReg 34 537 42
VolcmanDropper 10 439 13 clustering 75,000 samples in less than 3 hours. By contrast,
FakeGoogleChrome 15 407 43 Rieck et al. [28], [31] proposes an incremental approach,
where they alternate a prototype-based clustering algorithm
with a classification step, eventually reducing the runtime
one ACCESS_COARSE_LOCATION. Given the difficulty of complexity by performing clustering only on representative
reproducing such an uncommon mistake, we consider this samples.
feature as a hard indicator of the maliciousness of a sample. Jang et al. [12] develop BitShred as remedy to the problem
of clustering large data sets with high-dimensional feature
VI. RELATED WORK sets. They propose to use feature hashing to reduce the
A. CLUSTERING APPLIED TO MALWARE ANALYSIS dimensionality of high-scale feature sets, while reducing
The first attempt to automatically group computer malware the computational cost of the calculation of the Jaccard
based on their behavior dates back to Lee and Mody [26], index using an approximated version that exploits bit-vector
who use a sequence of runtime events (e.g., registry and arithmetic. However, since BitShred simply relies on a static
file system modifications) to cluster similar programs. As a analysis approach, results are susceptible to binary level
similarity measure, they choose a variant of the edit distance, obfuscation.
resulting demanding in term of computational resources, In 2010, Perdisci et al. [11] propose a network-based
since it has a computational complexity O(n2 ) in the number version of a behavioral malware clustering system, relying on
n of features. a three-step clustering refinement process, starting from the
Later, Bailey et al. [27] propose a system for automated analysis of malicious HTTP traces. The first phase consists
malware classification and analysis as a remedy for the in a coarse-grained clustering where malware samples are
inconsistent and incomplete labeling that commonly affect grouped together according to simple statistical similarities;
traditional antivirus. By applying single-linkage Hierarchical subsequently, a fine-grained clustering further splits samples
agglomerative clustering (HAC) with Normalized Compres- considering structural properties of HTTP queries. In the
sion Distance (NCD) and using inconsistency measure as a final step, fine-grained clusters whose centroids are close
cutting criteria, Bailey et al. are able to automatically catego- to each other are merged together. The system is tested
rize malware profiles into groups that reflect similar classes on HTTP traces generated from 25,000 applications using
of behaviors in terms of system state changes. While results single-linkage HAC and the Davies-Bouldin (DB) validity
are generally affected by the restriction of dynamic analysis, index [32] as cutting criteria. While the underlying idea of a
for the first time they introduce the idea of “detection through multi-step clustering refinement process is quite interesting,
clustering”, exploited in our proposed framework. this practically results in the biggest limitation to the scala-
In their work, Apel et al. [5] study which combination bility of their work. Moreover, Perdisci et al. limit behavioral
of metrics (i.e., Edit Distance, Approximated Edit Distance analysis to HTTP-based malware only, which in practice
with Blockwise Hashing, NCD and Manhattan Distance) can be easily bypassed by using an encrypted protocol (e.g.,
and n-gram features are mostly appropriate for determining HTTPS).
relations between malware samples. They define three differ- In 2013 Hu et. al [10] present MutantX-S, focusing on
ent criteria to support their evaluation (i.e., appropriateness, malware comparison and triage on a large scale. Their sys-
12 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

tem falls into the static-analysis category, since it relies on As a reference set is not available, one possibility is to
features extracted from the malware instructions. MutantX-S take advantage of labels assigned to each malware sample
can efficiently cluster a large number of samples into families by several antivirus scanner. The availability of services that
based on program static features, by extracting N-gram fea- specifically provide these results (e.g., Metadefender17 or
tures directly from the x86 opcode sequences and exploiting VirusTotal18 ) eases the procedure. However, there is an in-
a feature hashing technique to reduce features dimensional- trinsic complexity in defining a unique labeling schema, since
ity, thus significantly lowering the memory requirement and most of the malware result in being marked as belonging to
computation costs. MutantX-S adopts the same prototype- one malicious category only. As a matter of fact, Bailey et.
based algorithm of [31] because of its efficiency and explicit al. [27] showed that antivirus labeling fails in satisfying three
expression of malware features. fundamental criteria: consistency among different products,
In the Android context, ClusTheDroid [33] is the first completeness in malware tagging, and conciseness in label
research to combine behavioral analysis and clustering to semantics. One possible explanation is that signatures used
specifically target Android malware. The goal is both to in the malware-matching algorithms mostly evaluate static
develop a tool, and to evaluate clustering alternatives. Finally properties of the binary, in contrast to behavioral properties:
they focused on single and complete linkage HAC, using a the result is that families found using static features might
feature set composed of 38 numerical quantities extracted be quite different from ones established using behavioral fea-
from the CopperDroid [34] report, and weighted according tures. Moreover, different AV products apply different crite-
to a three-level interpretation of malware behaviors. ria and granularity to rule generation, resulting in inconsistent
Differently from most of the previous works [5], [11], [13], results. Despite the complexity and intrinsic challenges of the
[27], [33] that rely on the HAC algorithm (which is both com- procedure, given the importance of automatically building a
putationally and storage expensive, respectively O(n2 log n) malware reference dataset to evaluate clustering results, the
and O(n2 ) [35]), we use HDBSCAN, that with N data points problem was directly tackled in different researches, such as
has an average complexity approximately O(N log N ) [20], VAMO [38] and AVclass [23].
and a space requirement O(n), making it applicable to large In the literature of malware clustering, several techniques
datasets. Furthermore, differently from [31], we devise an are proposed. Bayer et al. [13] and Jang et al. [12] use preci-
iterative clustering approach where HDBSCAN is iteratively sion and recall to compare the results of their system-level be-
applied over the entire dataset, without the needed of al- havioral clustering to a reference dataset, defining a manual
ternate any classification step, finally discovering precise mapping between labels assigned by different AVs. However,
families of applications with a shared behavior. as the dataset size increase this method becomes hardly
sustainable and quite costly. Similarly, ClusTheDroid [33]
B. EVALUATING CLUSTERING RESULTS used a reference set developed through manual analysis [39].
The clustering problem is inherently ill-posed, in the sense On the other hand, Apel et al. [5] choose to take into
that there is no single criterion that measures how well a consideration the amount of “shared behaviour” that can be
clustering of the data corresponds to the real world [36]. found among different analysis traces within the same cluster
Cluster validity analysis often involves the use of subjective of applications. In practice, each system call is modeled as
criteria of optimality specific to a particular application. a single character, and the evaluation is computed in linear
Therefore, no commonly accepted standard of validating the time finding all substrings in a generalized suffix tree, using
output of a clustering procedure exists [37]. In real-world the algorithm described in [40]. The main limitation of this
applications, it is often completely infeasible to manually technique is related to the choice of the reference dataset,
investigate the results of a clustering, making necessary the since Apel et al. use an artificial dataset starting from three
definition of automatic measures [33]. Helpful metrics to real-world malware traces, then divided into blocks of system
determine the quality of a clustering process are commonly calls and randomly permutated.
classified in internal and external indexes. The former eval- Differently, Perdisci et al. [11] tackle the problem by mea-
uates both cluster cohesion and separation, which determine suring the cohesion and separation of each cluster, in terms
how distinct or well-separated a cluster is from others. On of agreement between labels assigned by cluster and multiple
the other hand, the latter uses a reference set as a means of AV scanners. However, since AV labels have been shown
quality control for the setup of the clustering algorithm [33]. to be inconsistent [41], the measures of cluster cohesion
In the field of malware analysis, clustering validation is and separation only give an indication of the validity of the
further complicated by the intrinsic difficulty of establishing clustering results.
a reliable ground truth. Firstly, malware analysis is challeng-
ing and it gets more difficult when anti-analysis, triggering C. SIGNATURE-BASED DETECTION
sequences and dynamic code loading techniques are in place. Early AV products used the hash value of an application to
Secondly, not even a manual categorization would provide detect malicious software. However, every modification in
a reliable partition, since most of the malware could not
be unequivocally assigned in categories; not to mention the 17 https://metadefender.opswat.com

unrealistically high amount of time it would require. 18 https://virustotal.com/

VOLUME 4, 2016 13

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

the source code, as tiny as one byte, results in a detection ples, a few automatic tools have been proposed to generate
evasion. Today’s signatures are pattern-matching rules com- malware signatures which balance the required generality to
monly defined on static or dynamic properties of applications catch future samples with the need of avoiding false positives
under analysis and, even though they are assisted by heuristic detections.
and AI-based solutions, still represent the most reliable (i.e., In 2013, Chris Clark develops YaraGenerator19 , a python
with the lowest false positives) antivirus technology. program which automatically generates YARA rules by sam-
pling a small subset of common strings between malware,
1) Automatic Signature Generation while blacklisting goodware ones. Although the tool is de-
A number of prior works propose systems to automatically signed to work with any type of malicious file, in order
generate different types of network signatures to identify to increase the efficacy of the results, specific dataset of
malicious traffic. goodware strings are available for several file formats (e.g.,
Honeycomb [42], Autograph [43], and EarlyBird [44] Windows executable, PDF, email and office document).
propose the generation of signatures comprising a single Similarly, yarGen20 is a python tool developed by Florian
contiguous string (i.e., token). Later on, PAYL [45], Ne- Roth to automatically generate YARA rules by combining
mean [46], Hamsa [47] and Botzilla [48] introduce more the topmost malware strings, while removing those that also
complex methods based on the token subsequence signatures. appear in goodware files. By using fuzzy regular expressions,
Other researches like ProVex [49], AutoRE [50], Shield- each malware string is assigned a score proportionally to the
Gen [51], and [52] also tackle the problem of automatically inverse of its frequency, and the “Gibberish Detector” allows
generating network signatures, although their applicability is to select real language over character chains without any
specific to the network traffic detection. meaning. The tool also exploits a naive-bayes-classifier to
In 2005, Newsome et. al. introduces Polygraph [53], a classify candidate strings, avoiding compression or encryp-
system which exploits the Token-Subsequence algorithm to tion garbage in favor of more generic strings. Finally, each
automatically obtain IDS signatures to match polymorphic rule is created by combining the 20 strings with the highest
worms. Polygraph is tested against three real-world exploits score. The result of the generation process may be a single
and is able to successfully generate HTTP and DNS signa- rule, specific to one sample, or a super rule, catching malware
tures with a low false positive rate. variants and groups.
Perdisci et al. [11] also tackles the problem of automati- While both YaraGenerator and yarGen have been devel-
cally generate network signatures for cluster centroids, with oped aiming at supporting the rule creation, rather then
the aim of deploying them into an IDS at the edge of a completely replacing the role of expert analysts, as a major
network in order to detect malicious HTTP traffic. Since mal- drawback, their efficacy strongly relies on the completeness
ware samples may contact legitimate websites for malicious of the dataset of goodware strings.
purposes, instead of pre-filtering HTTP traffic against legiti- Differently from previous works, which mostly rely on
mate websites, authors apply a pruning process by testing the the search of an optimal sequence of opcodes or strings,
signature set against a large dataset of legitimate traffic, while the proposed algorithm generates signatures from a set of
discarding signatures that generate false positives, although attributes extracted from the application analysis, finding an
such an approach is as effective as it is the legitimate traffic optimal combination to minimize false negatives and guaran-
available. tee zero false positives in the current set of applications. None
In the Android context, Faruki et al. [54] propose An- of the previous researches can be directly applied to solve
droSimilar, a statistical signature-based solution that gener- such a problem. Moreover, the proposed approach exploits
ates variable-length signatures for the application under test an heuristic measure to find the right balance between rule
and identifies malware on the basis of a similarity percentage generality and specificity, using the same criteria that expert
with a dataset of known malicious samples. analysts adopted while authoring existing rulesets.
Another approach is presented in DroidAnalytics [55], a
signature-based analytic system, which extracts and analyzes VII. LIMITATIONS
applications at opcode level. Firstly, a three-level signature A major limiting factor of the described semi-supervised
(i.e., methods, classes, application) is generated by combin- approach is represented by the ability to extract meaningful
ing the API call traces, then the malware is associated to a information from the applications under analysis. Indeed, the
family according to its malicious content. accuracy of the analysis directly affects the clustering results
While [54] shows robustness against control-flow obfusca- and the automatic rule generation process. The Android
tion, junk method insertion and string encryption, [55] could platform lacks of mature reverse engineering tools compared
fail in the detection of repackaged malware. On the the other to the ones used for x86 malware [57]. Since each malware
hand, both solutions are affected by a high false-positive rate is different, automatically finding the malicious code by
due to the wrong choice of signature patterns available in means of static analysis is difficult, because it is mixed with
both malicious and benign applications.
Since the release of YARA [56], a patten-matching lan- 19 https://github.com/Xen0ph0n/YaraGenerator

guage designed to help to identify and classify malware sam- 20 https://github.com/Neo23x0/yarGen

14 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

benign code; moreover dynamic code loading and reflections VIII. FUTURE WORKS
make the analysis even harder. Unfortunately, most malware The work presented in this paper can be improved and
include trigger-based anti-analysis techniques that delay or extended in a number of ways. At the time of writing, we
hide their malicious activities at the first application run are focusing our efforts on the correct management of new
or in an emulated environment. For instance, the family of samples collected every day. Since the current version of the
applications known as DroidKungFu21 uses a time bomb system does not allow to incrementally add new applications
of 240 minutes to schedule the execution of its malicious to the existing model, when enough samples are collected,
code, indeed a simple dynamic analysis fails to observe those are treated as a new iteration of the clustering process.
interesting behaviors. However, in this research we do not As an alternative to the iterative approach, incremental clus-
address problems related to application analysis, as we focus tering algorithms have been proposed [63]–[65], although
on the detection of new samples and the automatic generation still non directly applicable to HDBSCAN. Their study and
of new signatures. adoption will be addressed in future works.
Evasion attacks, such as noise-injection attacks [58] and
other similar approaches [59]–[62] may affect the correctness IX. CONCLUSION
results of the clustering and the signature generation. Those In this paper, we introduced a set of semi-supervised tech-
attacks rely on the ability of injecting, in the analysis plat- niques with the ultimate goal of assisting human experts in
form, applications specially crafted to mislead the clustering the generation of malware family signatures. As a result,
process and the generation of a good detection model. we developed a scalable framework able to dig into massive
In the described system, an attacker could exploit such datasets of Android applications with the main purpose of
attacks by injecting specially crafted applications with the identifying new malware samples, while reducing false posi-
final goal of generating a false positive or a false negative tive detections.
detection. However, in both cases we assume that the detec- Our study shows that combining the scalability of the
tion information of already known threats (identified through automatic techniques with the inherent flexibility of the
signatures or by triage) cannot be maliciously tampered, thus manual analysis, achieves the best performances. Eventually,
new injected families will result in a Type 4, 5, 6 or 7, hence the proposed approach introduces two essential automation
will be subject to manual validation. improvements in a well known and tested AVs standard
If the attacker wants to deliberately generate a false posi- detection mechanism based on signatures. An iterative clus-
tive, several malicious applications whose statistical proper- tering algorithm allows for easy identification of hard to find
ties are similar to a target goodware can be injected. Since potential threats, reducing the human intervention from the
a false positive detection mainly generates a disruption to manual analysis of thousands of applications to the validation
a third party service, causing a reputation fail for the AV of a much smaller number of clusters where applications
solution, the magnitude of the echo is proportional to the reflect a similar class of behavior. Subsequently an automated
diffusion of the target goodware. As a matter of fact, the procedure, which exploits a heuristic optimization strategy,
analyst will be alerted by such a family. generates a set of YARA rules to cover newly identified
On the other hand, if the goal is to generate a false negative, malware with an acceptable generalization capability yet
the attacker could inject several goodware with the same minimizing false positives.
statistical properties of a target unknown malicious app. Such
Experimental results on a dataset of 1.5 million distinct
a family could be misclassified as a completely goodware
Android applications confirm the effectiveness of the pro-
even after the validation process, as the manual analysis focus
posed system, both in the identification of new malware
only on few samples. However, such a situation applies only
samples and in the generation of new family signatures in
as far as the malware is a zero-day, and no specif knowledge
the form of YARA rules.
about that threat is available. The identification of zero-day
Finally, the proposed approach has been deployed in Jan-
malware is a challenging and an open-research problem in
uary 2018 and, since then, it is in use on Koodous, the mobile
the security community.
antivirus platform developed by Hispasec.
Finally, the proposed system strongly relies on the in-
formation provided by the platform to automatically extend
the detection to new applications and identify new potential ACKNOWLEDGEMENT
malware families. It is a prerequisite that this information Andrea Marcelli’s Ph.D. program at Politecnico di Torino is
is not tampered by any malicious actor. Although Koodous supported by a fellowship from TIM (Telecom Italia Group).
provides protection mechanism for both YARA rules (rules Authors wish to thanks Dario Lombardo for his support and
before becoming active undergo a review process) and the insightful comments.
triage process (community members are subject to a reputa- This article is based upon work from COST Action
tion check), it is not intent of this research to tackle those CA15140 ‘Improving Applicability of Nature-Inspired Op-
issues, leaving their study to future works. timisation by Joining Theory and Practice (ImAppNIO)’
supported by COST (European Cooperation in Science and
21 Sample MD5: 7f5fd7b139e23bed1de5e134dda3b1ca Technology).
VOLUME 4, 2016 15

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

REFERENCES [26] T. Lee and J. J. Mody, “Behavioral classification,” in EICAR Conference,


[1] “AndroidOS.FakePlayer | Symantec,” https://www.symantec.com/ 2006, pp. 1–17.
security_response/writeup.jsp?docid=2010-081100-1646-99, August [27] M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, and
2010, (Accessed on 03/24/2017). J. Nazario, “Automated classification and analysis of internet malware,”
[2] S. Arshad, M. A. Shah, A. Khan, and M. Ahmed, “Android malware in International Workshop on Recent Advances in Intrusion Detection.
detection & protection: a survey,” Int. J. Adv. Comput. Sci. Appl, vol. 7, Springer, 2007, pp. 178–197.
no. 2, pp. 463–475, 2016. [28] K. Rieck, T. Holz, C. Willems, P. Düssel, and P. Laskov, “Learning and
[3] “Google_android_security_2016_report_final.pdf,” https://source. classification of malware behavior,” in International Conference on Detec-
android.com/security/reports/Google_Android_Security_2016_Report_ tion of Intrusions and Malware, and Vulnerability Assessment. Springer,
Final.pdf, Mar 2017, (Accessed on 11/12/2017). 2008, pp. 108–125.
[4] F. Guo, P. Ferrie, and T.-C. Chiueh, “A study of the packer problem and [29] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimen-
its solutions,” in International Workshop on Recent Advances in Intrusion sions via hashing,” in VLDB, vol. 99, no. 6, 1999, pp. 518–529.
Detection. Springer, 2008, pp. 98–115. [30] U. Bayer, C. Kruegel, and E. Kirda, “Anubis: Analyzing unknown bina-
[5] M. Apel, C. Bockermann, and M. Meier, “Measuring similarity of malware ries,” 2009.
behavior,” in Local Computer Networks, 2009. LCN 2009. IEEE 34th [31] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis of
Conference on. IEEE, 2009, pp. 891–898. malware behavior using machine learning,” Journal of Computer Security,
[6] J. Oberheide, E. Cooke, and F. Jahanian, “Cloudav: N-version antivirus in vol. 19, no. 4, pp. 639–668, 2011.
the network cloud.” in USENIX Security Symposium, 2008, pp. 91–106. [32] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On clustering validation
[7] F. Zhang, H. Huang, S. Zhu, D. Wu, and P. Liu, “ViewDroid: Towards techniques,” Journal of intelligent information systems, vol. 17, no. 2-3,
obfuscation-resilient mobile application repackaging detection,” in Pro- pp. 107–145, 2001.
ceedings of the 2014 ACM conference on Security and privacy in wireless [33] D. Korczynski, “Clusthedroid: Clustering Android malware,” 2015.
& mobile networks. ACM, 2014, pp. 25–36. [34] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “CopperDroid: Automatic
[8] R. Winsniewski, “Android–apktool: A tool for reverse engineering An- reconstruction of Android malware behaviors.” in NDSS, 2015.
droid apk files,” 2012. [35] P.-N. Tan, M. Steinbach, and V. Kumar, “Data mining cluster analysis:
[9] J. Freke, “Smali, an assembler/disassembler for AndroidâĂŹs dex format,” basic concepts and algorithms,” Introduction to data mining, 2013.
Google Project Hosting [online] http://code. google. com/p/smali, 2013. [36] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
[10] X. Hu, K. G. Shin, S. Bhatkar, and K. Griffin, “Mutantx-s: Scalable 2016, http://www.deeplearningbook.org.
malware clustering based on static features.” in USENIX Annual Technical [37] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM
Conference, 2013, pp. 187–198. computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.
[11] R. Perdisci, W. Lee, and N. Feamster, “Behavioral clustering of http- [38] R. Perdisci et al., “Vamo: towards a fully automated malware clustering
based malware and signature generation using malicious network traces.” validity analysis,” in Proceedings of the 28th Annual Computer Security
in NSDI, vol. 10, 2010, p. 14. Applications Conference. ACM, 2012, pp. 329–338.
[12] J. Jang, D. Brumley, and S. Venkataraman, “Bitshred: feature hashing [39] Y. Zhou and X. Jiang, “Dissecting Android malware: Characterization
malware for scalable triage and semantic analysis,” in Proceedings of the and evolution,” in Security and Privacy (SP), 2012 IEEE Symposium on.
18th ACM conference on Computer and communications security. ACM, IEEE, 2012, pp. 95–109.
2011, pp. 309–320.
[40] E. Ukkonen, “On-line construction of suffix trees,” Algorithmica, vol. 14,
[13] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda,
no. 3, pp. 249–260, 1995.
“Scalable, behavior-based malware clustering.” in NDSS, vol. 9. Citeseer,
[41] A. Mohaisen and O. Alrawi, “Av-meter: An evaluation of antivirus scans
2009, pp. 8–11.
and labels,” in International Conference on Detection of Intrusions and
[14] “YARA — The pattern matching swiss knife for malware re-
Malware, and Vulnerability Assessment. Springer, 2014, pp. 112–131.
searchers,” https://virustotal.github.io/yara/, November 2013, (Accessed
[42] C. Kreibich and J. Crowcroft, “Honeycomb: creating intrusion detection
on 03/27/2017).
signatures using honeypots,” ACM SIGCOMM computer communication
[15] A. Desnos, “Androguard: Reverse engineering, malware and goodware
review, vol. 34, no. 1, pp. 51–56, 2004.
analysis of Android applications... and more (ninja!),” Retrieved June,
vol. 10, p. 2014, 2011. [43] H.-A. Kim and B. Karp, “Autograph: Toward automated, distributed worm
[16] A. Desnos and P. Lantz, “Droidbox: An android application sandbox for signature detection.” in USENIX security symposium, vol. 286. San
dynamic analysis,” 2011. Diego, CA, 2004.
[17] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm [44] S. Singh, C. Estan, G. Varghese, and S. Savage, “Automated worm finger-
for discovering clusters in large spatial databases with noise.” in Kdd, printing.” in OSDI, vol. 4, 2004, pp. 4–4.
vol. 96, no. 34, 1996, pp. 226–231. [45] K. Wang, G. Cretu, and S. J. Stolfo, “Anomalous payload-based worm
[18] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering detection and signature generation,” in International Workshop on Recent
based on hierarchical density estimates,” in Pacific-Asia Conference on Advances in Intrusion Detection. Springer, 2005, pp. 227–246.
Knowledge Discovery and Data Mining. Springer, 2013, pp. 160–172. [46] V. Yegneswaran, J. T. Giffin, P. Barford, and S. Jha, “An architecture for
[19] R. J. Campello, D. Moulavi, A. Zimek, and J. Sander, “Hierarchical density generating semantic aware signatures.” in USENIX Security Symposium,
estimates for data clustering, visualization, and outlier detection,” ACM 2005, pp. 97–112.
Transactions on Knowledge Discovery from Data (TKDD), vol. 10, no. 1, [47] Z. Li, M. Sanghi, Y. Chen, M.-Y. Kao, and B. Chavez, “Hamsa: Fast
p. 5, 2015. signature generation for zero-day polymorphic worms with provable attack
[20] L. McInnes and J. Healy, “Accelerated hierarchical density clustering,” resilience,” in Security and Privacy, 2006 IEEE Symposium on. IEEE,
arXiv preprint arXiv:1705.07321, 2017. 2006, pp. 15–pp.
[21] L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based [48] K. Rieck, G. Schwenk, T. Limmer, T. Holz, and P. Laskov, “Botzilla:
clustering,” The Journal of Open Source Software, vol. 2, no. 11, mar Detecting the phoning home of malicious software,” in Proceedings of the
2017. [Online]. Available: https://doi.org/10.21105%2Fjoss.00205 2010 ACM Symposium on Applied Computing. ACM, 2010, pp. 1978–
[22] A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based 1984.
external evaluation measure.” in EMNLP-CoNLL, vol. 7, 2007, pp. 410– [49] C. Rossow and C. J. Dietrich, “Provex: Detecting botnets with encrypted
420. command and control channels,” in International Conference on Detection
[23] M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero, “Avclass: A tool for of Intrusions and Malware, and Vulnerability Assessment. Springer,
massive malware labeling,” in International Symposium on Research in 2013, pp. 21–40.
Attacks, Intrusions, and Defenses. Springer, 2016, pp. 230–253. [50] Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I. Osipkov, “Spam-
[24] “malicialab/avclass: AVClass malware labeling tool,” https://github.com/ ming botnets: signatures and characteristics,” ACM SIGCOMM Computer
malicialab/avclass, July 2016, (Accessed on 03/27/2017). Communication Review, vol. 38, no. 4, pp. 171–182, 2008.
[25] Y. Li, S. C. Sundaramurthy, A. G. Bardas, X. Ou, D. Caragea, X. Hu, [51] W. Cui, M. Peinado, H. J. Wang, and M. E. Locasto, “Shieldgen: Auto-
and J. Jang, “Experimental study of fuzzy hashing in malware clustering matic data patch generation for unknown vulnerabilities with informed
analysis,” in 8th workshop on cyber security experimentation and test (cset probing,” in Security and Privacy, 2007. SP’07. IEEE Symposium on.
15), vol. 5, no. 1. USENIX Association, 2015, p. 52. IEEE, 2007, pp. 252–266.

16 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

[52] P. Wurzinger, L. Bilge, T. Holz, J. Goebel, C. Kruegel, and E. Kirda, “Auto- FERNANDO DÍAZ is a malware analyst and
matically generating models for botnet detection,” in European symposium software engineer. Currently he is a B.Sc. stu-
on research in computer security. Springer, 2009, pp. 232–249. dent in Health Engineering at the University of
[53] J. Newsome, B. Karp, and D. Song, “Polygraph: Automatically generating Malaga and he is working at Hispasec Sistemas
signatures for polymorphic worms,” in Security and Privacy, 2005 IEEE as a Security Engineer. Daily work focuses on
Symposium on. IEEE, 2005, pp. 226–241. automated malware configuration extractions, dis-
[54] P. Faruki, V. Ganmoor, V. Laxmi, M. S. Gaur, and A. Bharmal, “Androsim- tributed analysis environments, and developing
ilar: robust statistical feature signature for Android malware detection,” in
software for the Koodous platform. His research
Proceedings of the 6th International Conference on Security of Informa-
includes analysis of new malware families and IoT
tion and Networks. ACM, 2013, pp. 152–159.
[55] M. Zheng, M. Sun, and J. C. Lui, “Droid analytics: a signature based malware.
analytic system to collect, extract, analyze and associate Android mal-
ware,” in Trust, Security and Privacy in Computing and Communications
(TrustCom), 2013 12th IEEE International Conference on. IEEE, 2013,
pp. 163–171.
[56] “Virus Bulletin :: Rule-driven malware identification and classifi-
cation,” https://www.virusbulletin.com/virusbulletin/2008/01/rule-driven-
malware-identification-and-classification, January 2008, (Accessed on
04/03/2017).
[57] N. Kiss, J.-F. Lalande, M. Leslous, and V. V. T. Tong, “Kharon dataset:
Android malware under a microscope,” in The Learning from Authori- ANDREA MARCELLI received his M.Sc. de-
tative Security Experiment Results (LASER) workshop. The USENIX gree in Computer Engineering from Politecnico
Association, 2016. of Torino, Italy, in 2015. Currently he is a Ph.D.
[58] R. Perdisci, D. Dagon, W. Lee, P. Fogla, and M. Sharif, “Misleading student in Computer and Control Engineering at
worm signature generators using deliberate noise injection,” in Security
the same institute and member of the CAD group.
and Privacy, 2006 IEEE Symposium on. IEEE, 2006, pp. 15–pp.
His research interests include malware analysis,
[59] J. Newsome, B. Karp, and D. Song, “Paragraph: Thwarting signature
learning by training maliciously,” in International Workshop on Recent semi-supervised modeling, machine learning and
Advances in Intrusion Detection. Springer, 2006, pp. 81–105. optimization problems, with main applications in
[60] W. Xu, Y. Qi, and D. Evans, “Automatically evading classifiers,” in computer security.
Proceedings of the 2016 Network and Distributed Systems Symposium,
2016.
[61] B. Biggio, K. Rieck, D. Ariu, C. Wressnegger, I. Corona, G. Giacinto, and
F. Roli, “Poisoning behavioral malware clustering,” in Proceedings of the
2014 Workshop on Artificial Intelligent and Security Workshop. ACM,
2014, pp. 27–36.
[62] J. Crussell and P. Kegelmeyer, “Attacking dbscan for fun and profit,” in
Proceedings of the 2015 SIAM International Conference on Data Mining.
SIAM, 2015, pp. 235–243.
[63] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental
clustering for mining in a data warehousing environment,” in VLDB, ANTONIO SÁNCHEZ is malware analyst and
vol. 98, 1998, pp. 323–333. research engineer. He received his M.Sc. in Com-
[64] N. Goyal, P. Goyal, K. Venkatramaiah, P. Deepak, and P. Sanoop, “An ef-
puter Science at Universidad de Jan (Spain) in
ficient density based incremental clustering algorithm in data warehousing
environment,” in 2009 International Conference on Computer Engineering
2013 and since 2012 he is working as a secu-
and Applications, IPCSIT, vol. 2, 2011, pp. 482–486. rity engineer at Hispasec Sistemas S.L.. His daily
[65] A. M. Bakr, N. M. Ghanem, and M. A. Ismail, “Efficient incremental work focuses on the improvement of systems for
density-based algorithm for clustering large datasets,” Alexandria engi- the automatic detection and analysis of malware
neering journal, vol. 54, no. 4, pp. 1147–1154, 2015. samples, while his research interests include new
techniques for storage, recovering and correlation
of big data.

ANDREA ATZENI holds a MSc and a Ph.D. in


Computer Engineering, both from Politecnico di
Torino. He is currently Senior Research Assistant
in the TORSEC Security Group at the Politecnico
di Torino. In last fifteen years he contributed to a GIOVANNI SQUILLERO (M01-SM14) received
number of large-scale European research projects his M.S. and Ph.D. in computer science in 1996
under the FP5, FP6 and FP7 and CIP programmes. and 2001, respectively. He is an associate profes-
He addressed, among the others, the definition of sor of computer science at Politecnico di Torino.
security requirements in multi-platform systems, His research mixes the whole spectrum of bio-
mobile security, modelisation of user expectation inspired metaheuristics with computational intel-
on security and privacy, security specification, risk analysis and threat mod- ligence, machine learning, and selected topics
eling for complex cross-domain architectures, development of cross-domain in electronic CAD, games, multi-agent systems.
usable security, digital and cloud forensics, development and integration of Other activities focus on the development of op-
cross-border authentication mechanisms, malware analysis and modelling. timization techniques able to achieve acceptable
solutions with limited amount of resources, mainly applied to industrial
problems. Squillero is a member of the IEEE Computational Intelligence
Society Games Technical Committee.

VOLUME 4, 2016 17

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access

Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation

ALBERTO TONDA received his Ph.D. in Com-


puter Science Engineering from Politecnico di
Torino, Torino, Italy, in 2010. He is a senior
permanent researcher at Université Paris-Saclay,
INRA, France. His research interests include semi-
supervised modeling of complex systems, evolu-
tionary optimization and machine learning, with
main applications in food science and industry.

18 VOLUME 4, 2016

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like