2018 - Countering Android Malware - A Scalable Semi-Supervised Approach For Family-Signature Generation
2018 - Countering Android Malware - A Scalable Semi-Supervised Approach For Family-Signature Generation
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Reducing the effort required by humans in countering malware is of utmost practical value.
We describe a scalable, semi-supervised framework to dig into massive datasets of Android applications
and identify new malware families. Up to the 2010s, the industrial standard for the detection of malicious
applications has been mainly based on signatures; as each tiny alteration in malware make them ineffective,
new signatures are frequently created — a task that requires a considerable amount of time and resources
from skilled experts. The framework we propose is able to automatically cluster applications in families
and suggest formal rules for identifying them with 100% recall and quite high precision. The families are
used either to safely extend experts’ knowledge on new samples, or to reduce the amount of applications
requiring thorough analyses. We demonstrated the effectiveness and the scalability of the approach running
experiments on a database of 1.5 million Android applications. In January 2018, the framework has been
successfully deployed on Koodous, a collaborative anti-malware platform.
INDEX TERMS Semi-supervised learning, Clustering, Android, Malware, Automatic signature generation
VOLUME 4, 2016 1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
of manually analyzing the thousands of suspicious samples on Koodous1 , the mobile AV platform from Hispasec,
received every day, a large fraction is left unlabeled, delaying since the January 2018.
the signature generation. The rest of the paper is organized as follows: Section II
While malware variants can be generated at a high pace, illustrates problem statement and motivation, and Section III
they are likely to perform similar malicious activities when introduces Koodous; Section IV describes in detail the pro-
executed. One possible solution would be to automatically posed approach, while experimental results and performance
cluster such applications in a family and focus the manual evaluations are presented in Section V; Section VI surveys
analysis on few archetypal samples, with the underlying related work about Android malware and automated analysis
assumption that malware bearing significant similarities are procedures; limitations and future works are discussed in
likely to derive from the same code base [10]. Furthermore, Section VII and VIII; Section IX concludes the paper.
the label of a new sample of a known family could be auto-
matically derived, and existing signatures or other mitigation II. PROBLEM STATEMENT AND MOTIVATION
techniques could be more easily extended to cover the new Since the 2000s, academia proposed approaches based on
threats. machine learning aiming at completely replacing humans in
Eventually, if a large number of malware belonging to the malware analysis process. In most cases, such proposals
the same family is identified, it may become possible to fell back into mere classification, that is, supervised machine
define a generic behavioral signature able to detect future learning. The drawbacks included the need of large amount
variants with reduced false positives and false negatives [11]. of accurately labeled, i.e., already analyzed, data, and the
Therefore, a sharp clustering is crucial to help AV companies lack of control over the false positives eventually produced,
categorizing the large amount of samples, avoiding dupli- a major cause of concern for all AV vendors. As a result,
cate work, and allowing analysts to prioritize their limited AV companies developed systems mostly based on the reli-
resources on novel and representative samples [12], [13]. able signature-detection mechanism. Even though signatures
In this article, we describe a semi-supervised system for suffer from the so-called “specificity” problem, and new ones
the analysis of massive datasets of malicious applications. need to be frequently generated, they have been demonstrated
We created a platform able to suggest new families of effective, scalable, and almost unaffected by false positives.
applications to human experts; the platform also generates The proposed framework is semi-supervised and intro-
an intelligible YARA rule [14] to identify family members duces essential improvements in the identification of simi-
with high precision. We explicitly minimize false positives, a lar applications and the generation of family signatures. It
business hazard and a reputation blow for AV vendors. The combines the scalability of fully automatic techniques for
approach alleviates human experts from the burden of man- clustering and the optimization of new family signatures,
ually inspecting thousands of Android applications, while while it exploits manual analysis, inherently more flexible
letting them take critical decisions. The main contributions and accurate, in few crucial steps, such as the validation of
of this article can be summarized as: newly discovered malware families.
Traditionally, the effort of automatically classifying and
• We introduce a scalable system for the analysis of analyzing malware focuses on content-based signatures that
massive Android malware datasets based on careful specify binary sequences. Indeed, content-based signatures
feature engineering, and a standard clustering algorithm. are inherently vulnerable to malware obfuscation: even if
The mechanism is demonstrated to be robust and able all variants of a malicious application share the same func-
to overcome the well-known limitations of traditional tionalities and exhibit the same behavior, they can have tiny
signature-matching mechanisms. different syntactic representations. As a consequence, a huge
• We propose an algorithm that, starting from a cluster of number of signatures needs to be created and distributed by
samples, generates its family signature as a YARA rule. AV companies.
Thanks to exact and heuristic evaluations, such rules On the other hand, a rule that automatically identifies the
are intelligible and appear reasonable to human experts. behavior of a family of samples would be the first step to-
Moreover, the algorithm guarantees zero false positives wards the creation of true family signatures. Such a signature
in the existing dataset, and limits the possibility of false would match all samples of a family, and would significantly
positives in the future. help to reduce the number of signatures required to cover
• We report experiments on a dataset of about 1.5 million it. Moreover, as new samples could be mapped to a family
Android applications, and results show the scalability behavior already known, the time and effort required to
of the approach. We use a set of internal and external analyze and reverse engineer new samples would be reduced.
indicators to demonstrate that the proposed system per- Differently from the previous approaches, the proposed
forms an accurate and efficient automatic identification system generates effective, precise and descriptive rules us-
of groups of similar applications. By exploiting limited ing the properties directly extracted from both static and
data, the framework is able to propose insightful exten- dynamic analyses. While aiming at reducing false positives
sions to the rule detecting suspicious applications.
• Finally, our framework has been deployed and it is used 1 https://koodous.com/
2 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
detected by triage4 .
A. ITERATIVE CLUSTERING
and false negatives, it also exploits an heuristic measure to Clustering provides a mechanism to automatically categorize
emulate how expert analysts write existing signatures. applications into groups that reflect their similarity, both in
source code and runtime behavior. We exploit HDBSCAN, a
density-based algorithm, as it fits most of our requirements.
III. KOODOUS
Density-based clustering algorithms locate high-density
Koodous is a collaborative platform for researching on An-
regions in the feature space; DBSCAN (density-based spatial
droid malware that combines online analysis tools with so-
clustering of applications with noise) is probably the best
cial interactions between the analysts. Started in 2014, in 4
known among them [17]. Density-based algorithms can ef-
years it collected one of the largest repositories of Android
fectively discover clusters of arbitrary shape and filter out
applications: its databases contain more than 30 millions
outliers, eventually increasing cluster homogeneity. Addi-
of applications, among which 7 millions have already been
tionally, the number of expected clusters to be found in
identified as malicious. Fig. 1 illustrates the trend of appli-
the data is not required: our aim is to discover groups of
cation submission and detection from October 2014, until
similar applications without any prior knowledge about their
March 2017.
composition, otherwise the number of clusters is hard to
Koodous provides both analysis service and end-point guess a priori.
protection: upon submission, each application is analyzed
In 2013, Campello et al. [18] proposed HDBSCAN, a
both statically and dynamically, and the final report is ac-
new density-based algorithm that converts the original DB-
cessible through a web interface specifically designed to help
SCAN into a hierarchical clustering algorithm. As a matter
analysts detect new malware threats. Analysis tools include
of fact, HDBSCAN find clusters of varying densities, and is
a custom version of Androguard [15], CuckooDroid2 and
more robust to parameter selection. Moreover, it supports the
DroidBox [16].
GLOSH (global-local outlier score from hierarchies) outlier
Instead of relying on a closed group of expert malware
detection algorithm: during the fitting phase, each data point
analysts, Koodous takes advantage of an open community
is associated to a score that represents its likelihood of being
to identify malicious applications. Furthermore, in order to
an outlier; at the end of the process, outliers are selected via
guarantee high quality results, manual detections are subject
upper quantiles [19].
to reputation-based checking. Moreover, protection is guar-
In low-dimensional spaces, HDBSCAN has an average
anteed through an Android application, which backs to the
complexity of approximately O(n log n), while its space
cloud platform to detect most recent threats3 .
requirement is O(n), making it applicable to moderately
Koodous uses YARA to describe patterns for detect-
large datasets [20].
ing malware application: since the creation of high-quality
As the number of samples in malware datasets is in the
YARA rules requires a considerable effort, the platform also
tens of millions, we exploit an iterative process where the
offer the possibility to identify malware through a simpler
voting mechanism — an operation referred to as “triage”. l m D is divided into m chunks di of fixed size
original dataset
As of July 1, 2018, more than 2.5 millions applications are N (m = |D| N ).
VOLUME 4, 2016 3
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
TABLE 1. List of the 35 statistical properties extracted from the analysis result
m−1
[ of each APK file. Features are grouped according to the type of analysis.
Static features are extracted using Androguard both parsing the Manifest file
D= di (1) and looking for interesting API calls in the decompiled source code. Dynamic
i=0 features are extracted using DroidBox and CuckooDroid from the dynamic
analysis of the application.
The parameter N balances the quality of the results with
the time required for the analysis, and can be set experimen-
Analysis method Sofware Statistical property
tally according to the available resources.
HDBSCAN is applied to each chunk of data di finding, Filters
Activities
at each step, a set of clusters
Sm−1ci and a set of outliers oi . Parsing Manifest file Androguard Receivers
Finally, all the outliers O = i=0 oi , are clustered together Services
in order to find even those small groups of applications whose Permissions
samples are spread through several chunks of data. In the end, Accounts
the total number of required iterations is equal to m + 1. Advertisement
Browser history
Since HDBSCAN could be executed in parallel on the Camera
first m chunks, the benefit of the iterative approach is the Crypto functions
Dynamic broadcast receiver
huge reduction in the time required for the analysis. On the Installed applications
other hand, few applications could be misclassified as outliers Run binary
and the same group of similar applications could be found MCC
ICCID
multiple times, although, as shown in section IV-B, those Statically from APK Androguard
IMEI
corner cases do not limit the framework efficacy. IMSI
SMS
MMS
1) Features selection
Phone call
An accurate features selection is a crucial step in every Phone number
machine learning approach. As suggested in [11], we exploit Sensor
Serial number
aggregate information: from the analysis result of each appli- Socket
cation, we extract a subset of “statistical” properties, meant SSL
as quantitative measure of a malware behavior. Indeed, we Files written
experimentally found that exploiting statistical similarities Crypto usage
among applications, rather than “structural" properties which Files read
DroidBox
Send SMS
exactly describe the malicious behavior, does not effectively Dynamically Send network
alter the results, while at the same time, significantly reduces Recv Network
the amount of data to process. HTTP request
Starting from a set of n analysis reports ri provided by CuckooDroid Hosts
Domains
Koodous, each report ri is translated into a feature vector DNS
vi = (f0 , . . . , f34 ) containing the 35 statistical properties.
Table 1 summarizes the features extracted from the results
of the static and dynamic analysis.
In more detail, the static analysis performed by An- of the distance to use during cluster analysis is tied to the
droguard extracts the features from the Manifest file (i.e., type and the dimension of selected features, we experimen-
number of activities, permissions, receivers, filters), and the tally found that the combination with the Euclidean distance
source code analysis. The former allows to unveil similarities delivered the best performances.
among applications based on the software architecture used
to develop the application, while the latter models each B. EXTENDING MALWARE DETECTION
application extracting portions of code related to suspicious Starting from millions of samples, the iterative clustering
API call (e.g., number of calls to SMS API, or IMEI, or other (Section IV-A) identifies a smaller number of families com-
network related methods). On the other hand, the dynamic posed of strongly related applications. In some cases, by
analysis extracts features that model the application interac- combining this result with the information already available
tion with the surrounding operating system both at file system in Koodous, these families may be automatically labeled, as
and network level extracted by DroidBox (e.g., files written, they extend either known threats or legitimate software. In
usage of cryptographic methods, SMS sent), and the network the other cases, experts are required to manually evaluate
information extracted by CuckooDroid (e.g., number of DNS the family, but they need to analyze only few representative
resolved, HTTP requests). These are the standard type of samples of the group and not all applications, therefore
information extracted in the field of malware analysis. drastically reducing the time required by the analysis. This
Because the range of each feature is quite different, the process exploits the “clustering assumption” of the semi-
dataset is firstly normalized so that the features have mean supervised learning algorithms, which states that two points
equal to zero and variance equal to one. Since the choice which are in the same cluster (i.e. which are linked by a high
4 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
VOLUME 4, 2016 5
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
Authoring an effective signature requires a considerable As it is not possible to find features common to malware
effort and experience. Good signatures are compact, and that do not matches legitimate applications (ma ∩ mb ) \ ga =
they have the ability to generalize, that is, to identify all ∅ and (ma ∩mb )\gb = ∅, the third step generates a signature
known variants of the malware and even possible new ones. with the disjunction of two clauses Y ∗ = (r ∧ ra ) ∨ (r ∧ rb ).
Moreover, they do not yield false-positive results by detecting
non-family members, and finally, they appear intelligible to
human experts and are almost self-explanatory.
Wn formally, a signature S is the disjunction of n clauses
More
S = Vi=0 ci . A clause ci is a finite conjunction of mi literals
mi
ci = k=0 lki . In the present context, a literal lki is a single
feature specified in the report resulting from the analysis of
the application.
Traditionally, a YARA rule is defined on unique strings
found in malware but not present in legitimate programs;
quite differently, we generate precise, descriptive rules using
the structural properties extracted by the static and dynamic
analyses. Our program identifies an optimal set of clauses
for matching all target applications while yielding no false
positive in the current database; moreover, thanks to some
heuristics, the rule has a good ability to generalize, a low
risk of detecting false positives in the future, and it appears
reasonable to the eye of the human experts.
An example of an automatically generated YARA rule
for the Syringe malware family is shown below6 . It may be
noted that the statistical features exploited during clustering
(Section IV-A) are not used in the in the rule, as they would
result in over-complicated rules hardly understandable by FIGURE 3. Schema of the process of generation of a YARA rule. In the first
phase a signature Y = r is defined for malware ma and mb . In the second
humans. phase Y is checked against a dataset of goodware (ga and gb ). Finally, in the
third phase, a new signatures Y ∗ = (r ∧ ra ) ∨ (r ∧ rb ) is created to avoid
false positive detection of ga and gb .
r u l e YaYaSyringe {
condition :
a n d r o g u a r d . f i l t e r ( " a c t i o n . BATTERYCHECK" )
and a n d r o g u a r d . n u m b e r _ o f _ s e r v i c e s == 3 Algorithm 1 Automatic YARA rule generation
and a n d r o g u a r d . p e r m i s s i o n ( "SYSTEM_ALERT_WINDOW" )
and a n d r o g u a r d . u r l ( " h t t p : / / s . a d s l i n k u p . com / v2 " ) 1: procedure GENERATE S IGNATURE (R)
... 2: C ← Clauses(R, ∅)
} 3: Y ← Clot(R, C)
4: G ← GetFalsePositives(Y)
The process is performed in three steps: a reasonable
5: C∗ ← Clauses(R, G)
signature composed of a small number of clauses is gen-
6: Y∗ ← Clot(R, C∗ )
erated; the signature is checked against the full database of
7: DumpAsYARARule(Y∗ )
applications, and false positives are identified; the generation
procedure is run again, but explicitly taking into considera-
tion the false positives discovered in the second step. The pseudocode of the algorithm is reported in Algo-
Fig. 3 exemplifies the idea of the process of generation of a rithm 1: at first it determines a suitable set of clauses (func-
signature for two malware ma and mb , and two legitimate ap- tion Clauses), then picks a subset of them of variable size to
plications ga and gb . In the first phase, the algorithm defines a build an optimal family signature (function Clot). Lines 2 and
signature Y = r, where r is a single clause composed by the 3 correspond to the first phase of Fig. 3; line 4, to the second;
common features between the two malware: r = ma ∩ mb . lines 5 and 6, to the third.
Indeed, a rule Y detects an application m only if Y is a subset Algorithms 2 and 3 add more details about the procedure:
of m: Y ⊆ m. the function Clauses extracts the clauses that can be used to
During the second phase, the rule Y is checked against build the signature, and is based on a heuristic algorithm.
the complete database, where it generates two false positives First, each malware application ri in the target set R is
matching two legitimate applications ga and gb . The clause r transformed into a single clause yi able to detect it using all
is therefore too generic to be used as a signature. available literals. Such clauses are not directly usable, but are
the starting point of the interactive procedure for building the
6 The complete version of the rules is available on Koodous at https: set of optimal clauses H: in each step, the least generic yi is
//koodous.com/rulesets/3243 selected and compared against all clauses in H calculating the
6 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
common features zi ; the least generic of these zi is eventually detect at least an application not yet detected by any rule,
considered for inclusion in H. with the only exception that, if an application can be detected
by only one clause, that clause is the only one picked. Then
Algorithm 2 Clauses extraction the algorithm selects among this group the clause that is able
1: function C LAUSES (R, G) to detect more application in the original target set R.
2: Y ← {Features(r) ∀r ∈ R}
3: for all r ∈ R do 1) Rule quality
4: Y ← Y ∪ SelectedClauses(r) A heuristic evaluation is used to reduce the risk of false
5: H ← {Features(r) ∀r ∈ R} positives in the future and to increase the perceived quality of
6: while |H| > 0 do the rule. We defined a heuristic score S(·) for a rule, inversely
7: h ← LeastGeneric(H) related to its generality. More formally, let associate each
8: Z ← {CommonFeatures(h, y) ∀y ∈ Y} literal l to a score S ∗ (l) that measures how specific the literal
9: z ← LeastGeneric(Z) is. The score of a clause ci is the sum Pni of the scores of the
10: F = {r ∈ G | Det(z, r) = True} ni literals composing it: S(ci ) = S ∗
(li ). The sore
k=0 k
11: if F = ∅ and z ∈ / Y and Quality(z) > Tq of a rule r is the minimum among the scores of its clauses:
then S(r) = min∀i S(ci ).
12: H ← H ∪ {z} The higher the score, the more a rule is specific and less
13: Y ← Y ∪ {z} susceptible to generate false positives. On the other hand, the
14: H ← H \ {h} lower the score, the more a rule will be able to generalize,
15: return Y while being more prone to generate false positive in the
future. High quality signatures require an optimal balance
between generality and specificity, and this is one of the main
The rationale is to build Y by adding clauses progressively
challenges in automatic signature generation. We use two
less specific (i.e., checking fewer features), but still usable in
threshold Tmin and Tmax , where the lowest is the minimum
signatures. Line 10 computes the set F of application from
score that a rule needs to be valid, and the highest is used in
G detected by the candidate clause; as G is the set of all
the optimization process to avoid overly-specific rulesets.
potential false positives, if F is not null the clause is too
generic to be usable. Additionally, the function Quality(·) All the clauses in YARA rules created by expert analysts
performs a heuristic evaluation of the clause: if the quality is are valid, that is, the score assigned to literals must guarantee
below a certain threshold Tq , the rule is so generic that it is that ∀r ∈ Rexpert : Tmin ≤ S(r) ≤ Tmax . We consider
likely to create false positives in a near future — see IV-C1 invalid the rules containing a clause mentioning only Android
for more details. For each application, few not-too-generic, official permissions and intent filters, or containing a clause
heuristically selected clauses are also included (i.e., ra and composed of a single literal, with the exception of accessing
rb in the example shown in 3). an URL that have been detected as malicious by VirusTotal
The function Clot (Algorithm 3) implements a dynamic or similar services. Then, we exploit the simplex method as a
greedy algorithm for building the signature as a disjunction mean to automatically define S ∗ (·) starting from the existing
of clauses. It iteratively adds one clause to Y from a set C ruleset.
until all applications in R are detected by at least on clause in The simplex method is a linear programming technique,
Y. which refers to the problem of optimizing a linear objective
function ζ of m variables xi subject to a set of n linear in-
Algorithm 3 Clauses selection equality constraints. In standard form, the problem of finding
1: procedure C LOT (R, C) an optimal set of weights for m literals can be expressed as:
2: Y←∅
min ζ = cT × x
3: D←∅
4: while R 6= C do s.t − A × x ≥ −b, x ≥ 0
5: if ∃r ∈ R \ D : Critical(r) = True then
where ci = 1, ∀i = 1 . . . m, since the objective function ζ
6: r̄ ← GetCritical(R \ D)
minimize the number of literals in each clause, x ∈ Rm is a
7: Z = {z ∈ C | Det(z, r̄) = True}
vector of m unknown weights, and bi = Tmin , ∀i = 1 . . . n,
8: else
as we want each existing literal combination to satisfy the
9: Z = {z ∈ C | ∃r ∈ R \ D : Det(z, r) =
minimum score of all existing rulesets.
True}
Finally A is a n × m matrix that put into relation each
10: Y ← Y ∪ {MostUseful(Z)} clause with their own literals:
11: D ← {r ∈ R | @y ∈ Y : Det(y, r) = True}
l11 l12 l13 . . . l1m
12: return Y l21 l22 l23 . . . l2m
A= . . . . . . . . . . . . .
In an iterative way, Clot first picks out all clauses that ln1 ln2 ln3 . . . lnm
VOLUME 4, 2016 7
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
TABLE 2. Details about the number of rules, clauses, and unique clauses TABLE 4. Comparison of Homogeneity (Hom.) and Completeness (Comp.)
analyzed to find the optimal score for each literal. index values between the families inferred by the clustering process (using
both the iterative clustering with different chunk sizes N , and the non-iterative
version), and the families labels extracted from Koodous and VirusTotal.
Num. of DNF clauses
Num. of YARA rules
Total Unique Koodous labels VirusTotal labels
Koodous N Hom. Comp. Hom. Comp.
348 788 104
public rules 50k 0.96 0.36 0.85 0.49
Yara-Rules 100k 0.96 0.35 0.85 0.49
348 697 48 200k 0.96 0.35 0.85 0.50
on GitHub
non-iterative 0.92 0.36 0.78 0.50
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
TABLE 6. Comparison of the detection results between VirusTotal and two TABLE 8. Example of a Type 4 malware family. As the first two samples are
datasets of 50,000 applications, respectively undetected (und.) and detected already detected in Koodous by the YARA rule Xynyin.Trojan, the system
(det.) by Koodous. Columns indicate the number of applications unknown identifies other applications within the cluster as potentially malicious too. The
(unk.), undetected (und.), detected by at least one AV (det.), and detected by comparison with VirusTotal (the number of detection is reported) and a manual
more than three AVs, as reported by VirusTotal. analysis confirm the accuracy of the system.
10 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
TABLE 9. Comparison of the clustering results using using both the iterative shown by an increasing V-score value. Eventually, if a large
version with different chunk sizes N , and the non-iterative one. Column Time
indicates the time (in seconds) required by the clustering process, while enough chunk size is used, the iterative approach produces
column "Outliers" reports the number of outliers found at the end of the almost the same results as the non-iterative one, while gen-
iterations.
erally finding a higher number of clusters, as illustrated in
Table 5, and a less outliers, Table 9.
N Time (s) Outliers
Finally, in order to further test the scalability of the pro-
50k 5,746 64,553 posed method, we successfully applied the algorithm on a
100k 6,408 65,685
200k 10,573 67,081 very large dataset of 10 million applications, using a chunk
non-iterative 16,592 119,919 size N = 500k.
TABLE 10. Indexes comparison of the clustering label inferred by the iterative B. AUTOMATIC FAMILY SIGNATURES GENERATION
approach (with different chunk sizes N ) using the assignment produced by the In order to evaluate the effectiveness of the automatic signa-
non-iterative version as a reference.
ture generator, we compare the detection results of several
YARA rules automatically generated by the proposed algo-
N ARI Homogeneity Completeness V-Score
rithm with existing rulesets created by expert analysts.
50k 0.26 0.92 0.78 0.85 Table 11 reports the results of the rules detections on
100k 0.27 0.93 0.81 0.86
200k 0.29 0.94 0.84 0.89 a dataset of 1.5 million applications: in all the cases, the
automated generated rules 15 performed better than the one
authored by humans, increasing the detection from the 8.2%
affinity to the Xynyin malware family14 . up to 131.2%, without generating any false positives.
One of the major benefit of a semi-supervised system is Referring to Section IV-C, in all the cases the rule gen-
to limit the detection of false positives, and the operation is eration process stopped at the second step, as none of the
further simplified since the analysts should only focus on new rules produced any false positives in the current dataset
groups of similar applications, without considering single of applications. A further manual analysis of the detected
samples. As useful side effect, the system could be also used applications, confirmed that no false positive was generated.
to improve the precision of the results, by reducing false As shown in Table 12, the time required to generate a
positive detections for those families of applications that have rule for few hundreds malware is always less than a minute,
been partially miss-classified by existing signatures. although when the target increases to a few thousands ap-
plications, the time required grows up to several minutes, as
4) The Iterative algorithm
the most expensive part of the process is the check for false
positives against a reference dataset. This is not considered a
The adoption of the iterative approach brings a number of
limitation, since all the process is automatic, and given the
benefits: it proved to be essential in order to analyze millions
goodness of the results, it is of invaluable support for the
of applications, and the resulting number of outliers, as
family signature generation process.
illustrated in Table 9, is much lower than what was obtained
Table 12 reports the number of literals (i.e., application
by clustering all applications together. The time required by
features) and the final score for each generated YARA rule:
the clustering phase is proportional to the chunk size and it is
referring to Section IV-C, each score is higher than the
up to one order of magnitude lower than in the non-iterative
minimum threshold Tmin = 400, satisfying the minimum
case.
requirement for acceptability in order to avoid false positive
The adoption of the iterative approach does not affect the
detections, and lower than the maximum threshold Tmax =
quality of the results, even though using a bigger chunk size
700, as a result of the optimization process to increase the
results in a greater number of new detections.
rule generality and therefor the ability to catch future mal-
Table 10 compares the iterative approach using as a ref-
ware variants.
erence clustering assignment the one produced by the non-
As shown in the example of Section IV-C, in order to
iterative version. A relatively low ARI value indicates a
increase the effectiveness of a rule, urls are included only if
difference in the clustering assignment between the two
are known to be malicious, like in case of http:// s.adslinkup.
approaches, while a very high homogeneity value, compared
com/ v2 for the Syringe malware family. Moreover, aiming at
to completeness, is a clear sign of a finer cluster partitioning.
identifying malware with very high precision and avoiding
In other words, using the iterative approach the quality of
false positives, whenever available, the automatic signature
the information is not compromised, although the resulting
generator includes those attributes extracted from the ap-
clusters are smaller, hence less likely to contain enough appli-
plication analysis that contains a typing mistake. For in-
cations that span different detection areas, finally resulting in
stance, the rule YaYaMetasploit116 includes a wrong permis-
a lower extension. A bigger chunk size lowers the differences
sion ACCESS_COURSE_LOCATION instead of the correct
between the iterative and the non-iterative assignment, as
15 Example rulesets could be found at the following address: https://
14 On 24 August 2017 VirusTotal updated the detection, identifying the koodous.com/analysts/YaYaGen/rulesets
applications as malicious too. 16 https://koodous.com/my_rulesets/3466
VOLUME 4, 2016 11
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
TABLE 11. Comparison of detection performances of human authored YARA computable efficiency and local sensitiveness), using single-
rules (Original) with automated generated ones (Auto). Last column reports
the improvement (in percentage) for the newly generated rules. Detections are linkage HAC as clustering algorithm. Experimental results
tested on a dataset of 1.5 million applications. show that Manhattan Distance along with 3-grams deliver the
best results, while NCD and Edit Distance generally perform
Rule name
Detections poorly.
Original Auto Improvement Neither Lee and Mody [26], nor Bailey et al. [27] have any
SmsSender 539 1,004 +86.3% specific solution to large-scale clustering. On the other hand,
Syringe 220 315 +43.2% Bayer et al. [13], Rieck et al. [28], and Jang et al. [12] directly
HummingBad2 136 257 +89.0%
Marcher2 559 652 +16.6% address the problem of managing large datasets, developing
SMSReg 159 172 +8.2% methods to scale the clustering process.
VolcmanDropper 186 430 +131.2% Bayer et al. [13] propose a scalable malware clustering
FakeGoogleChrome 516 822 +59.3%
approach using a combination of approximate and hierar-
chical clustering with Local Sensitive Hashing (LSH) [29]
TABLE 12. Number of literals, score and time (in seconds) required to
generate each YARA rule. to significantly reduce the number of distance computations.
By extending Anubis [30], they are able to extract detailed
Rule name Literals Score Time (s) behavioral-reports based on taint tracking results and net-
work captures from malware execution. In particular, the
SmsSender 15 412 43
Syringe 19 574 48 taint engine allows them to map low-level operations (e.g.,
HummingBad2 12 599 52 system calls) to operating system objects (e.g., registry keys
Marcher2 20 686 49 and files). By deploying LSH, Bayer et al. are capable of
SMSReg 34 537 42
VolcmanDropper 10 439 13 clustering 75,000 samples in less than 3 hours. By contrast,
FakeGoogleChrome 15 407 43 Rieck et al. [28], [31] proposes an incremental approach,
where they alternate a prototype-based clustering algorithm
with a classification step, eventually reducing the runtime
one ACCESS_COARSE_LOCATION. Given the difficulty of complexity by performing clustering only on representative
reproducing such an uncommon mistake, we consider this samples.
feature as a hard indicator of the maliciousness of a sample. Jang et al. [12] develop BitShred as remedy to the problem
of clustering large data sets with high-dimensional feature
VI. RELATED WORK sets. They propose to use feature hashing to reduce the
A. CLUSTERING APPLIED TO MALWARE ANALYSIS dimensionality of high-scale feature sets, while reducing
The first attempt to automatically group computer malware the computational cost of the calculation of the Jaccard
based on their behavior dates back to Lee and Mody [26], index using an approximated version that exploits bit-vector
who use a sequence of runtime events (e.g., registry and arithmetic. However, since BitShred simply relies on a static
file system modifications) to cluster similar programs. As a analysis approach, results are susceptible to binary level
similarity measure, they choose a variant of the edit distance, obfuscation.
resulting demanding in term of computational resources, In 2010, Perdisci et al. [11] propose a network-based
since it has a computational complexity O(n2 ) in the number version of a behavioral malware clustering system, relying on
n of features. a three-step clustering refinement process, starting from the
Later, Bailey et al. [27] propose a system for automated analysis of malicious HTTP traces. The first phase consists
malware classification and analysis as a remedy for the in a coarse-grained clustering where malware samples are
inconsistent and incomplete labeling that commonly affect grouped together according to simple statistical similarities;
traditional antivirus. By applying single-linkage Hierarchical subsequently, a fine-grained clustering further splits samples
agglomerative clustering (HAC) with Normalized Compres- considering structural properties of HTTP queries. In the
sion Distance (NCD) and using inconsistency measure as a final step, fine-grained clusters whose centroids are close
cutting criteria, Bailey et al. are able to automatically catego- to each other are merged together. The system is tested
rize malware profiles into groups that reflect similar classes on HTTP traces generated from 25,000 applications using
of behaviors in terms of system state changes. While results single-linkage HAC and the Davies-Bouldin (DB) validity
are generally affected by the restriction of dynamic analysis, index [32] as cutting criteria. While the underlying idea of a
for the first time they introduce the idea of “detection through multi-step clustering refinement process is quite interesting,
clustering”, exploited in our proposed framework. this practically results in the biggest limitation to the scala-
In their work, Apel et al. [5] study which combination bility of their work. Moreover, Perdisci et al. limit behavioral
of metrics (i.e., Edit Distance, Approximated Edit Distance analysis to HTTP-based malware only, which in practice
with Blockwise Hashing, NCD and Manhattan Distance) can be easily bypassed by using an encrypted protocol (e.g.,
and n-gram features are mostly appropriate for determining HTTPS).
relations between malware samples. They define three differ- In 2013 Hu et. al [10] present MutantX-S, focusing on
ent criteria to support their evaluation (i.e., appropriateness, malware comparison and triage on a large scale. Their sys-
12 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
tem falls into the static-analysis category, since it relies on As a reference set is not available, one possibility is to
features extracted from the malware instructions. MutantX-S take advantage of labels assigned to each malware sample
can efficiently cluster a large number of samples into families by several antivirus scanner. The availability of services that
based on program static features, by extracting N-gram fea- specifically provide these results (e.g., Metadefender17 or
tures directly from the x86 opcode sequences and exploiting VirusTotal18 ) eases the procedure. However, there is an in-
a feature hashing technique to reduce features dimensional- trinsic complexity in defining a unique labeling schema, since
ity, thus significantly lowering the memory requirement and most of the malware result in being marked as belonging to
computation costs. MutantX-S adopts the same prototype- one malicious category only. As a matter of fact, Bailey et.
based algorithm of [31] because of its efficiency and explicit al. [27] showed that antivirus labeling fails in satisfying three
expression of malware features. fundamental criteria: consistency among different products,
In the Android context, ClusTheDroid [33] is the first completeness in malware tagging, and conciseness in label
research to combine behavioral analysis and clustering to semantics. One possible explanation is that signatures used
specifically target Android malware. The goal is both to in the malware-matching algorithms mostly evaluate static
develop a tool, and to evaluate clustering alternatives. Finally properties of the binary, in contrast to behavioral properties:
they focused on single and complete linkage HAC, using a the result is that families found using static features might
feature set composed of 38 numerical quantities extracted be quite different from ones established using behavioral fea-
from the CopperDroid [34] report, and weighted according tures. Moreover, different AV products apply different crite-
to a three-level interpretation of malware behaviors. ria and granularity to rule generation, resulting in inconsistent
Differently from most of the previous works [5], [11], [13], results. Despite the complexity and intrinsic challenges of the
[27], [33] that rely on the HAC algorithm (which is both com- procedure, given the importance of automatically building a
putationally and storage expensive, respectively O(n2 log n) malware reference dataset to evaluate clustering results, the
and O(n2 ) [35]), we use HDBSCAN, that with N data points problem was directly tackled in different researches, such as
has an average complexity approximately O(N log N ) [20], VAMO [38] and AVclass [23].
and a space requirement O(n), making it applicable to large In the literature of malware clustering, several techniques
datasets. Furthermore, differently from [31], we devise an are proposed. Bayer et al. [13] and Jang et al. [12] use preci-
iterative clustering approach where HDBSCAN is iteratively sion and recall to compare the results of their system-level be-
applied over the entire dataset, without the needed of al- havioral clustering to a reference dataset, defining a manual
ternate any classification step, finally discovering precise mapping between labels assigned by different AVs. However,
families of applications with a shared behavior. as the dataset size increase this method becomes hardly
sustainable and quite costly. Similarly, ClusTheDroid [33]
B. EVALUATING CLUSTERING RESULTS used a reference set developed through manual analysis [39].
The clustering problem is inherently ill-posed, in the sense On the other hand, Apel et al. [5] choose to take into
that there is no single criterion that measures how well a consideration the amount of “shared behaviour” that can be
clustering of the data corresponds to the real world [36]. found among different analysis traces within the same cluster
Cluster validity analysis often involves the use of subjective of applications. In practice, each system call is modeled as
criteria of optimality specific to a particular application. a single character, and the evaluation is computed in linear
Therefore, no commonly accepted standard of validating the time finding all substrings in a generalized suffix tree, using
output of a clustering procedure exists [37]. In real-world the algorithm described in [40]. The main limitation of this
applications, it is often completely infeasible to manually technique is related to the choice of the reference dataset,
investigate the results of a clustering, making necessary the since Apel et al. use an artificial dataset starting from three
definition of automatic measures [33]. Helpful metrics to real-world malware traces, then divided into blocks of system
determine the quality of a clustering process are commonly calls and randomly permutated.
classified in internal and external indexes. The former eval- Differently, Perdisci et al. [11] tackle the problem by mea-
uates both cluster cohesion and separation, which determine suring the cohesion and separation of each cluster, in terms
how distinct or well-separated a cluster is from others. On of agreement between labels assigned by cluster and multiple
the other hand, the latter uses a reference set as a means of AV scanners. However, since AV labels have been shown
quality control for the setup of the clustering algorithm [33]. to be inconsistent [41], the measures of cluster cohesion
In the field of malware analysis, clustering validation is and separation only give an indication of the validity of the
further complicated by the intrinsic difficulty of establishing clustering results.
a reliable ground truth. Firstly, malware analysis is challeng-
ing and it gets more difficult when anti-analysis, triggering C. SIGNATURE-BASED DETECTION
sequences and dynamic code loading techniques are in place. Early AV products used the hash value of an application to
Secondly, not even a manual categorization would provide detect malicious software. However, every modification in
a reliable partition, since most of the malware could not
be unequivocally assigned in categories; not to mention the 17 https://metadefender.opswat.com
VOLUME 4, 2016 13
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
the source code, as tiny as one byte, results in a detection ples, a few automatic tools have been proposed to generate
evasion. Today’s signatures are pattern-matching rules com- malware signatures which balance the required generality to
monly defined on static or dynamic properties of applications catch future samples with the need of avoiding false positives
under analysis and, even though they are assisted by heuristic detections.
and AI-based solutions, still represent the most reliable (i.e., In 2013, Chris Clark develops YaraGenerator19 , a python
with the lowest false positives) antivirus technology. program which automatically generates YARA rules by sam-
pling a small subset of common strings between malware,
1) Automatic Signature Generation while blacklisting goodware ones. Although the tool is de-
A number of prior works propose systems to automatically signed to work with any type of malicious file, in order
generate different types of network signatures to identify to increase the efficacy of the results, specific dataset of
malicious traffic. goodware strings are available for several file formats (e.g.,
Honeycomb [42], Autograph [43], and EarlyBird [44] Windows executable, PDF, email and office document).
propose the generation of signatures comprising a single Similarly, yarGen20 is a python tool developed by Florian
contiguous string (i.e., token). Later on, PAYL [45], Ne- Roth to automatically generate YARA rules by combining
mean [46], Hamsa [47] and Botzilla [48] introduce more the topmost malware strings, while removing those that also
complex methods based on the token subsequence signatures. appear in goodware files. By using fuzzy regular expressions,
Other researches like ProVex [49], AutoRE [50], Shield- each malware string is assigned a score proportionally to the
Gen [51], and [52] also tackle the problem of automatically inverse of its frequency, and the “Gibberish Detector” allows
generating network signatures, although their applicability is to select real language over character chains without any
specific to the network traffic detection. meaning. The tool also exploits a naive-bayes-classifier to
In 2005, Newsome et. al. introduces Polygraph [53], a classify candidate strings, avoiding compression or encryp-
system which exploits the Token-Subsequence algorithm to tion garbage in favor of more generic strings. Finally, each
automatically obtain IDS signatures to match polymorphic rule is created by combining the 20 strings with the highest
worms. Polygraph is tested against three real-world exploits score. The result of the generation process may be a single
and is able to successfully generate HTTP and DNS signa- rule, specific to one sample, or a super rule, catching malware
tures with a low false positive rate. variants and groups.
Perdisci et al. [11] also tackles the problem of automati- While both YaraGenerator and yarGen have been devel-
cally generate network signatures for cluster centroids, with oped aiming at supporting the rule creation, rather then
the aim of deploying them into an IDS at the edge of a completely replacing the role of expert analysts, as a major
network in order to detect malicious HTTP traffic. Since mal- drawback, their efficacy strongly relies on the completeness
ware samples may contact legitimate websites for malicious of the dataset of goodware strings.
purposes, instead of pre-filtering HTTP traffic against legiti- Differently from previous works, which mostly rely on
mate websites, authors apply a pruning process by testing the the search of an optimal sequence of opcodes or strings,
signature set against a large dataset of legitimate traffic, while the proposed algorithm generates signatures from a set of
discarding signatures that generate false positives, although attributes extracted from the application analysis, finding an
such an approach is as effective as it is the legitimate traffic optimal combination to minimize false negatives and guaran-
available. tee zero false positives in the current set of applications. None
In the Android context, Faruki et al. [54] propose An- of the previous researches can be directly applied to solve
droSimilar, a statistical signature-based solution that gener- such a problem. Moreover, the proposed approach exploits
ates variable-length signatures for the application under test an heuristic measure to find the right balance between rule
and identifies malware on the basis of a similarity percentage generality and specificity, using the same criteria that expert
with a dataset of known malicious samples. analysts adopted while authoring existing rulesets.
Another approach is presented in DroidAnalytics [55], a
signature-based analytic system, which extracts and analyzes VII. LIMITATIONS
applications at opcode level. Firstly, a three-level signature A major limiting factor of the described semi-supervised
(i.e., methods, classes, application) is generated by combin- approach is represented by the ability to extract meaningful
ing the API call traces, then the malware is associated to a information from the applications under analysis. Indeed, the
family according to its malicious content. accuracy of the analysis directly affects the clustering results
While [54] shows robustness against control-flow obfusca- and the automatic rule generation process. The Android
tion, junk method insertion and string encryption, [55] could platform lacks of mature reverse engineering tools compared
fail in the detection of repackaged malware. On the the other to the ones used for x86 malware [57]. Since each malware
hand, both solutions are affected by a high false-positive rate is different, automatically finding the malicious code by
due to the wrong choice of signature patterns available in means of static analysis is difficult, because it is mixed with
both malicious and benign applications.
Since the release of YARA [56], a patten-matching lan- 19 https://github.com/Xen0ph0n/YaraGenerator
14 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
benign code; moreover dynamic code loading and reflections VIII. FUTURE WORKS
make the analysis even harder. Unfortunately, most malware The work presented in this paper can be improved and
include trigger-based anti-analysis techniques that delay or extended in a number of ways. At the time of writing, we
hide their malicious activities at the first application run are focusing our efforts on the correct management of new
or in an emulated environment. For instance, the family of samples collected every day. Since the current version of the
applications known as DroidKungFu21 uses a time bomb system does not allow to incrementally add new applications
of 240 minutes to schedule the execution of its malicious to the existing model, when enough samples are collected,
code, indeed a simple dynamic analysis fails to observe those are treated as a new iteration of the clustering process.
interesting behaviors. However, in this research we do not As an alternative to the iterative approach, incremental clus-
address problems related to application analysis, as we focus tering algorithms have been proposed [63]–[65], although
on the detection of new samples and the automatic generation still non directly applicable to HDBSCAN. Their study and
of new signatures. adoption will be addressed in future works.
Evasion attacks, such as noise-injection attacks [58] and
other similar approaches [59]–[62] may affect the correctness IX. CONCLUSION
results of the clustering and the signature generation. Those In this paper, we introduced a set of semi-supervised tech-
attacks rely on the ability of injecting, in the analysis plat- niques with the ultimate goal of assisting human experts in
form, applications specially crafted to mislead the clustering the generation of malware family signatures. As a result,
process and the generation of a good detection model. we developed a scalable framework able to dig into massive
In the described system, an attacker could exploit such datasets of Android applications with the main purpose of
attacks by injecting specially crafted applications with the identifying new malware samples, while reducing false posi-
final goal of generating a false positive or a false negative tive detections.
detection. However, in both cases we assume that the detec- Our study shows that combining the scalability of the
tion information of already known threats (identified through automatic techniques with the inherent flexibility of the
signatures or by triage) cannot be maliciously tampered, thus manual analysis, achieves the best performances. Eventually,
new injected families will result in a Type 4, 5, 6 or 7, hence the proposed approach introduces two essential automation
will be subject to manual validation. improvements in a well known and tested AVs standard
If the attacker wants to deliberately generate a false posi- detection mechanism based on signatures. An iterative clus-
tive, several malicious applications whose statistical proper- tering algorithm allows for easy identification of hard to find
ties are similar to a target goodware can be injected. Since potential threats, reducing the human intervention from the
a false positive detection mainly generates a disruption to manual analysis of thousands of applications to the validation
a third party service, causing a reputation fail for the AV of a much smaller number of clusters where applications
solution, the magnitude of the echo is proportional to the reflect a similar class of behavior. Subsequently an automated
diffusion of the target goodware. As a matter of fact, the procedure, which exploits a heuristic optimization strategy,
analyst will be alerted by such a family. generates a set of YARA rules to cover newly identified
On the other hand, if the goal is to generate a false negative, malware with an acceptable generalization capability yet
the attacker could inject several goodware with the same minimizing false positives.
statistical properties of a target unknown malicious app. Such
Experimental results on a dataset of 1.5 million distinct
a family could be misclassified as a completely goodware
Android applications confirm the effectiveness of the pro-
even after the validation process, as the manual analysis focus
posed system, both in the identification of new malware
only on few samples. However, such a situation applies only
samples and in the generation of new family signatures in
as far as the malware is a zero-day, and no specif knowledge
the form of YARA rules.
about that threat is available. The identification of zero-day
Finally, the proposed approach has been deployed in Jan-
malware is a challenging and an open-research problem in
uary 2018 and, since then, it is in use on Koodous, the mobile
the security community.
antivirus platform developed by Hispasec.
Finally, the proposed system strongly relies on the in-
formation provided by the platform to automatically extend
the detection to new applications and identify new potential ACKNOWLEDGEMENT
malware families. It is a prerequisite that this information Andrea Marcelli’s Ph.D. program at Politecnico di Torino is
is not tampered by any malicious actor. Although Koodous supported by a fellowship from TIM (Telecom Italia Group).
provides protection mechanism for both YARA rules (rules Authors wish to thanks Dario Lombardo for his support and
before becoming active undergo a review process) and the insightful comments.
triage process (community members are subject to a reputa- This article is based upon work from COST Action
tion check), it is not intent of this research to tackle those CA15140 ‘Improving Applicability of Nature-Inspired Op-
issues, leaving their study to future works. timisation by Joining Theory and Practice (ImAppNIO)’
supported by COST (European Cooperation in Science and
21 Sample MD5: 7f5fd7b139e23bed1de5e134dda3b1ca Technology).
VOLUME 4, 2016 15
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
16 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
[52] P. Wurzinger, L. Bilge, T. Holz, J. Goebel, C. Kruegel, and E. Kirda, “Auto- FERNANDO DÍAZ is a malware analyst and
matically generating models for botnet detection,” in European symposium software engineer. Currently he is a B.Sc. stu-
on research in computer security. Springer, 2009, pp. 232–249. dent in Health Engineering at the University of
[53] J. Newsome, B. Karp, and D. Song, “Polygraph: Automatically generating Malaga and he is working at Hispasec Sistemas
signatures for polymorphic worms,” in Security and Privacy, 2005 IEEE as a Security Engineer. Daily work focuses on
Symposium on. IEEE, 2005, pp. 226–241. automated malware configuration extractions, dis-
[54] P. Faruki, V. Ganmoor, V. Laxmi, M. S. Gaur, and A. Bharmal, “Androsim- tributed analysis environments, and developing
ilar: robust statistical feature signature for Android malware detection,” in
software for the Koodous platform. His research
Proceedings of the 6th International Conference on Security of Informa-
includes analysis of new malware families and IoT
tion and Networks. ACM, 2013, pp. 152–159.
[55] M. Zheng, M. Sun, and J. C. Lui, “Droid analytics: a signature based malware.
analytic system to collect, extract, analyze and associate Android mal-
ware,” in Trust, Security and Privacy in Computing and Communications
(TrustCom), 2013 12th IEEE International Conference on. IEEE, 2013,
pp. 163–171.
[56] “Virus Bulletin :: Rule-driven malware identification and classifi-
cation,” https://www.virusbulletin.com/virusbulletin/2008/01/rule-driven-
malware-identification-and-classification, January 2008, (Accessed on
04/03/2017).
[57] N. Kiss, J.-F. Lalande, M. Leslous, and V. V. T. Tong, “Kharon dataset:
Android malware under a microscope,” in The Learning from Authori- ANDREA MARCELLI received his M.Sc. de-
tative Security Experiment Results (LASER) workshop. The USENIX gree in Computer Engineering from Politecnico
Association, 2016. of Torino, Italy, in 2015. Currently he is a Ph.D.
[58] R. Perdisci, D. Dagon, W. Lee, P. Fogla, and M. Sharif, “Misleading student in Computer and Control Engineering at
worm signature generators using deliberate noise injection,” in Security
the same institute and member of the CAD group.
and Privacy, 2006 IEEE Symposium on. IEEE, 2006, pp. 15–pp.
His research interests include malware analysis,
[59] J. Newsome, B. Karp, and D. Song, “Paragraph: Thwarting signature
learning by training maliciously,” in International Workshop on Recent semi-supervised modeling, machine learning and
Advances in Intrusion Detection. Springer, 2006, pp. 81–105. optimization problems, with main applications in
[60] W. Xu, Y. Qi, and D. Evans, “Automatically evading classifiers,” in computer security.
Proceedings of the 2016 Network and Distributed Systems Symposium,
2016.
[61] B. Biggio, K. Rieck, D. Ariu, C. Wressnegger, I. Corona, G. Giacinto, and
F. Roli, “Poisoning behavioral malware clustering,” in Proceedings of the
2014 Workshop on Artificial Intelligent and Security Workshop. ACM,
2014, pp. 27–36.
[62] J. Crussell and P. Kegelmeyer, “Attacking dbscan for fun and profit,” in
Proceedings of the 2015 SIAM International Conference on Data Mining.
SIAM, 2015, pp. 235–243.
[63] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental
clustering for mining in a data warehousing environment,” in VLDB, ANTONIO SÁNCHEZ is malware analyst and
vol. 98, 1998, pp. 323–333. research engineer. He received his M.Sc. in Com-
[64] N. Goyal, P. Goyal, K. Venkatramaiah, P. Deepak, and P. Sanoop, “An ef-
puter Science at Universidad de Jan (Spain) in
ficient density based incremental clustering algorithm in data warehousing
environment,” in 2009 International Conference on Computer Engineering
2013 and since 2012 he is working as a secu-
and Applications, IPCSIT, vol. 2, 2011, pp. 482–486. rity engineer at Hispasec Sistemas S.L.. His daily
[65] A. M. Bakr, N. M. Ghanem, and M. A. Ismail, “Efficient incremental work focuses on the improvement of systems for
density-based algorithm for clustering large datasets,” Alexandria engi- the automatic detection and analysis of malware
neering journal, vol. 54, no. 4, pp. 1147–1154, 2015. samples, while his research interests include new
techniques for storage, recovering and correlation
of big data.
VOLUME 4, 2016 17
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2874502, IEEE Access
Atzeni et al.: Countering Android Malware: a Scalable Semi-Supervised Approach for Family-Signature Generation
18 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.