0% found this document useful (0 votes)
132 views

tddd17 Android Malware Fabha972 Andwi954 PDF

This document discusses generating rules to profile Android malware using metadata that can be easily extracted from app marketplaces and the app packages (APKs) themselves. An automatic system was created to scrape metadata from marketplaces, extract intrinsic APK metadata, and submit APKs to Virustotal for analysis. Decision trees and association analysis of the dataset aimed to extract rules for profiling malware, but decision trees were biased toward benign apps. The strategy changed to modify the dataset to increase the likelihood of extracting malware profiling rules via association analysis. Both analyses produced sets of rules that were mostly intuitive but with some harder to interpret outliers.

Uploaded by

Daniel Donciu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

tddd17 Android Malware Fabha972 Andwi954 PDF

This document discusses generating rules to profile Android malware using metadata that can be easily extracted from app marketplaces and the app packages (APKs) themselves. An automatic system was created to scrape metadata from marketplaces, extract intrinsic APK metadata, and submit APKs to Virustotal for analysis. Decision trees and association analysis of the dataset aimed to extract rules for profiling malware, but decision trees were biased toward benign apps. The strategy changed to modify the dataset to increase the likelihood of extracting malware profiling rules via association analysis. Both analyses produced sets of rules that were mostly intuitive but with some harder to interpret outliers.

Uploaded by

Daniel Donciu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Generating Android Malware Profiling Rules From

Simple to Extract Metadata


1st Fabian Haugen 2nd André Willquist
IDA IDA
Linköping University Linköping University
Linköping, Sweden Linköping, Sweden
[email protected] [email protected]

Abstract—The aim of this research is to derive rules for discussed in this report, mostly relating to machine learning,
profiling malware in Android applications, using data that could data mining and Android security.
be easily extracted from the hosting marketplaces and data
intrinsic to the downloaded APKs themselves. An automatic A. APK
system for scraping metadata from the marketplace website,
extracting intrinsic metadata and submitting APKs to Virustotal An Android package, or APK, is a file compressed with the
is set up. Based on the dataset generated from this system, format used to distribute applications on the Android platform.
analysis is done using decision trees and association analysis. APKs contain everything that is needed for an app to be run
Because of the discovery that the decision trees created were on an android phone. The relevant files for feature extraction
heavily biased towards rules for profiling benign APKs the
are as follows:
strategy of the research was changed for the association analysis,
modifying the dataset in such a way as to increase the likelihood • The Android Manifest defines a lot of the aspects which
of extracting rules which profile malware. make up the metadata of the app, such as the name, the
From both of these analyses a sets of rules were found. The components of the app, the permissions it requires to be
rules were in general rather intuitive and reasonable, with a few
run and the hardware and software features it requires.
outliers which were harder to interpret. A discussion of these
rules were performed where their meanings were analysed. [2]
Index Terms—Malware Profiling, Android Security, Rule Ex- • The .dex Files contain the application’s program code.
traction, Machine Learning • CERT.RSA contains certificate information of the appli-
cation issuer.
I. I NTRODUCTION • The assets/ and res/ Directories contain data used by
the application, such as images or text data.
Given the prevalence of malware in mobile applications
shared on the web, Android as a platform should in no way B. Description of Relevant Android Permissions
be excluded from modern security considerations. Among
• BROADCAST STICKY - Allows an application to
applications offered on various marketplaces targeting Android
broadcast sticky intents. These are broadcasts whose
users, many are disguised as benign while in reality exhibiting
data is held by the system after being finished, so that
malicious behaviour when being run. As of 2018, 26.6 million
clients can quickly retrieve that data without having to
applications were detected to contain malware [1]. This is
wait for the next broadcast. Messages are sent between
however still just a small part of all of the android apps
components of an application. Protection level: normal
available for download. Thus when considering all applications
• BLUETOOTH ADMIN - Allows applications to dis-
currently in circulation, how does one determine which can
cover and pair Bluetooth devices. Protection level: normal
be trusted and which to avoid? Scanning every application for
• BLUETOOTH ADMIN - Allows applications to con-
malware with well known malware detection tools may be a
nect to paired Bluetooth devices. Protection level: normal
good idea, but might not be feasible with the amount of APKs
• CALL PHONE - Allows an application to initiate a
available. The aim of this research is to generate rules from
phone call without going through the Dialer user inter-
easy to extract metadata available on an Android marketplace,
face for the user to confirm the call. Protection level:
to aid security analysts in profiling malicious applications.
dangerous
This task motivates the following question formulation:
• CHANGE NETWORK STATE - Allows applications
• What rules are likely to be most effective for deciding to change network connectivity state. Protection level:
whether a given app is malicious? normal
• INTERNET - Allows applications to open network sock-
II. BACKGROUND
ets. Protection level: normal
Terms in need of explanation will be defined here along with • MODIFY AUDIO SETTINGS - Allows an application
a presentation of previous research lightly illustrating concepts to modify global audio settings. Protection level: normal
• READ PHONE STATE - Allows read only access to these techniques do have some weaknesses. One of these is
phone state, including the phone number of the device, the reliance on transforming the data into a feature space to
current cellular network information, the status of any help ease the classification process and to enable non-linear
ongoing calls, and a list of any PhoneAccounts registered classification. Being able to perform non-linear classification
on the device. Protection level: dangerous is essential to being able to handle complex data, and thus the
• RECEIVE SMS - Allows an application to receive SMS conversions to feature space are essential for these techniques
messages. Protection level: dangerous to function properly. These feature spaces do however come
• READ SMS - Allows an application to read SMS mes- with some drawbacks, one of the largest of these being
sages. Protection level: dangerous a lack of interpretability. It can be extremely hard to tell
• READ EXTERNAL STORAGE - Allows an applica- exactly which properties of the input data that has affected the
tion to read from external storage. Protection level: none outcome. Since it is hard to tell which inputs lead to which
• WRITE EXTERNAL STORAGE - Allows an applica- output it is also hard to create rules connecting input to output
tion to write to external storage. Protection level: none which humans can understand based on the trained models.
• PROCESS OUTGOING CALLS - Allows an applica- Because of these problems in order to extract the rules based
tion to see the number being dialed during an outgoing on metadata for deciding whether an app is malware or not
call with the option to redirect the call to a different this report resorts to other machine learning approaches which
number or abort the call altogether. Protection level: can provide these rules. These approaches will be introduced
dangerous in the following sections.
1) Decision trees: If you were to imagine the simplest
C. Tools: classifier possible it would probably be to classify based on
• Virustotal is a website providing services for scanning one value in one dimension, for example if the amount of
files for malware with several different engines such downloads are less than 10,000 then the app is assumed
as Avast and F-secure. Virustotal also provides an API to be malware, otherwise it is assumed to be safe. Such a
for automating the scanning of files via scripting. The classifier is called a decision stump. While decision stumps
Virustotal API can be accessed for free, but with some might work on the simplest of data it would obviously not be
restrictions, which is in part enforced by requiring users that good as a classifier for more complicated data, even such a
to create an account on the Virustotal website. [3] simple relation as y=x could not be modelled. However, if we
• NinjaDroid is a tool which automatically extracts most combine several of these decision stumps, looking at different
of the existing information from a given APK, such as dimensions and having different thresholds on the values a
file sizes, name, version, requirements, et cetera [4] much better classifier could be created. This is what is called
• Weka is an application which contains several tools a decision tree and what will be used as a classifier in this
for machine learning and data mining. It is GUI based study.
and allows the user to apply many different algorithms The biggest advantage of decision trees compared to other
on provided data. One of the tools is specialized in relevant machine learning approaches is that the classification
performing association analysis and allows the user to rules are immediately extractable from the trained model.
perform such analysis with minimal knowledge of the 2) Association Analysis Rules: Association analysis is the
underlying functionality and algorithms. theory of finding rules for the different classes in the already
• Selenium is a tool for testing web applications. It uses a existing data, and is thus extremely relevant for this report. The
browser process, a web driver, to interact with a webpage rules are found by looking at the dataset and looking at which
using given commands. In this case, google chrome was patterns are often present in the interesting samples. Such a
used together with selenium’s python implementation. pattern could be a correlation between malware classification
and what average review score a given app has. Since it would
D. Classification be hard and take way to long time to generate all rules, there
Classification of data in a machine learning sense means are algorithms limiting the generation of rules to the ones
that, given measurements of different attributes of the sample which are considered interesting, based on the rule’s support
that is to be classified, the model will try to find patterns so that and confidence.
it can predict a class of the object based on the attributes. An The support of a rule is how often it occurs in the database,
example of this would be that given the amount of downloads for example if low review score and malware occurs together
for an app, the model will try to predict whether the app is in 1% of the entries in the data then the support for this
malware or not. combination of attributes would be 1%.
The confidence of a rule is what fraction of cases where
E. Selected machine learning algorithms the left hand side of the rule (the consequent) of the rule
While technologies such as Neural Networks (often Deep occurs that the right hand side (the antecedent) also occurs.
Neural Networks also known as Deep Learning) and Support For example if in 70% of the cases where we have classified
Vector Machines (with non-linear kernels) are usually more an app as malware we also find that the app has less than
in the public eye as to regarding machine learning nowadays 10,000 downloads then the confidence of the rule is 70%.
The algorithms for rule generation does require that the data B. Challenges and Rationale of the Program Structure
is discrete, and thus require the data to be discretized before A challenge with this project is the amount of data which
the algorithm can be run, meaning that the parameters selected needs to be downloaded and processed. Assuming that the
can severely affect the outcome of the algorithm. maximum size of each app is 30MB, which is the maximum
F. Malware Classification on Application Metadata which we can submit to Virustotal, and that the goal of the
report is to download and process about 10,000 apps this
In [5] it has been shown that it is possible to classify android
would mean that in the worst case an approximate 300GB of
APKs as either malware or benign based on easily extracted
data would be required to be downloaded and stored. It would
metadata features, such as the number of required permissions
therefore be preferable to instead manage the downloading
and the time since the latest update. This does lend credence
and processing of apps in smaller batches. This would allow
to the idea of finding rules for classification of malware based
for the apps within a batch to be removed when processed
on such easily extracted metadata features. It has also been
so that only the unprocessed apps need to be stored on
shown that certain categories of apps are more likely to contain
the computer, greatly reducing the required space. This does
malware, once again supporting the hypothesis that such a
however increase the complexity of the programming task.
classification would be possible.
Another challenge is the amount of time it takes to process
G. Motivation of Collected Features all of the apps. Since Virustotal restricts the allowed amount
In this section initial motivation for the choice of attributes of API calls to 4 per key per minute this means that the
to analyze is presented arguing for why they are considered expected time to process 10,000 apps would range from two
important in the context of this report. to three days. This is quite a long time to expect a single
The expectations when setting out with the project is that: process to manage to run uninterrupted, especially when the
• There will be a difference in malware density among the process requires access to the internet to function. This means
different categories, as shown in previous studies. [5] that it would be preferable to take care to save state as
• Apps with more permission requirements are more likely often as possible so that the process can be paused at any
malware since the goal of the malware is likely to point, or even stop it entirely, without having to restart the
gain access to as many of the phones functionalities as downloading and processing process. This becomes harder
possible. when the downloaded apps are removed after they have been
• Apps with lower user review score are more likely processed.
malware since if users discover that malware is present, C. Structure of the Data Collection Code and Execution
then they are likely to give the app a lower review score.
The data collection code is made up of four different main
• Apps with a higher version number is probably less likely
components, these are:
to be malware, this since it is unlikely that malware will
• A module for downloading APKs and scraping metadata
get continuous support and updates after release.
• A module for submitting APKs to Virustotal for scanning
• The proportion of the amount of URL:s and the size of the
• A module for extracting the intrinsic metadata from the
app might be indicative of malware. While it is likely that
larger apps have more URL:s it is possible that malware APKs
• A module for removing the already processed apps
have a higher number of URL:s compared to their size.
These four components communicate with each other by
III. M ETHODS using files which they write to and read from. For example
In this section the general workflow of the project is when the downloading module has downloaded an APK it
presented. This includes the fetching of apps, labeling of the writes this to two different files to notify the submitting
apps (as malware or benign) as well as rule extraction from module and the extracting module that the APK is available
the collected data. for processing.
A. Choosing Markets The entire system is also built up in a cyclical manner where
the processes do all of the work available to them in one cycle
To select a suitable marketplace for downloading apps a few
before restarting to check if any new work is available. For
different criteria were taken into account. The most important
example the submitter submits all of the APKs available to
were:
it in one cycle to then wait until the next cycle starts before
• The amount of free apps should be larger than 10,000
checking if any new apps are available. This is done so that
• The likelihood of viruses to be present on the site should
the cleaner has some time where all processes have stopped
not be too low during which it can remove APKs safely.
The chosen marketplace was Anzhi.com, on which both The system was parallelized between 6 computers, all
stated criteria were fulfilled [6]. Furthermore the site had running on a given index.
relevant metadata readily available on every app page and had
a structure that could be traversed without too much trouble D. Traversing the Website
(for more information on structure and traversal see section For the site traversal the choice was made to fetch the app
III-D). pages from an index page found on the site. All the apps
considered for download could be reached by incrementing a thing which is done for each of the APKs is to check if the
number in the URL of this page and following the app links APK is already present on Virustotal. Is this the case then
listed there. The computers running the program were given the next step is to check if the current report on Virustotal is
indices spread out over the whole set. recent enough to use, in this case recent enough is defined as
made within the last 180 days.
E. Downloading the Metadata Is the report not present on Virustotal or if the existing
Application metadata was obtained by loading the app- report is too old then the APK is submitted to Virustotal for
specific page on the marketplace and then scraping the relevant scanning. After an APK has been submitted to Virustotal the
information from it (using the python package BeautifulSoup), thread responsible for the APK sleeps for five minutes waiting
which is then saved to disk in a CSV file for future analysis. for the result. When the thread wakes up it checks if the results
In the case that one of the apps were missing some of the are available or not, if they are then the thread is done and
expected metadata a default value was instead used. outputs the results, if the results are not available, then the
The chosen data was: thread sleeps for another three minutes.
• The app name The reason for the threaded module is to enable the possibil-
• The review score ity of having more than one app being processed at a time, this
• The app category means that if the Virustotal submitter has to wait for results
• The number of comments for one of the apps then it can still process the others, making
sure to fully utilize the four available requests per minute.
F. Downloading the APKs The restriction of four requests per minute gets somewhat
The APKs were downloaded by clicking a download link trickier when having a thread per APK however, since all
on their respective app pages. Since the apps could not simply of the living threads must share the available requests. This
be downloaded by fetching from a specific URL, but instead is however accomplished by python’s condition functionality.
by executing some JavaScript code, a web driver was needed The condition functionality allows having a restriction where
to initiate downloads. The python package selenium was used the threads can only make a request if a counter is above a
with a chrome driver to click on the download buttons and threshold value, this counter is protected by a lock. A separate
run downloads in the browser. Since the filenames where thread can then update the counter with four new requests
randomly generated by chrome, the file creation time was every minute as well as updating all of the waiting threads
used to associate files with fetched data, which meant that when the counter is updated.
the downloading process needed to be serial.
I. Extracting Malware Recognition Rules
G. Extracting the Intrinsic Metadata For the rule extraction the R programming language is used.
To extract intrinsic metadata from the APKs, the APK The data is loaded in and combined into one large data frame.
parser tool NinjaDroid was used. The apps marked ready for After the data has been loaded it is then used to create a first
parsing was input, with JSON formatted data being output classification tree. After this, cross validation is used to decide
from NinjaDroid. From these the relevant data was extracted upon the optimal tree depth and the tree is pruned to this level
and saved in a CSV file for future analysis. The chosen data iteratively.
was: For the association analysis Weka is used to generate the
• The version number rules. To be possible to be used in weka the data needs to be
• The app size converted to an .arff format, this can, however, easily be done
• The number of URLs in the APK (extracted from the in weka. After converting the data the association is performed
.dex files) by selecting the association option from the interface. From
• The number of shell commands used (extracted from the this the algorithm is run and the rules extracted. Because of
.dex files) how the association analysis works, not all APKs can be used
• The permissions required for this. If all of the APKs are included then the benign APKs
far outnumber the malicious APKs, and thus the rules found
H. Submitting the APKs to Virustotal relate only to the benign APKs. Because of this the data was
The submitter sends APKs to Virustotal for malware scan- pre processed in such a way that there were approximately the
ning. It is made up of one main thread which loads in all of the same amount of benign APKs as there was malware. This was
apps and makes sure that all other subprocesses finish properly done by first dividing the data up into malware and benign
as well as one new thread for each of the APKs which are to APKs. After the APKs were divided up approximately the
be processed. same amount of benign APKs were taken at random from the
As stated for each of the APKs which are to be submitted set of benign APKs as there were malware APKs. These two
there is one new thread spawned which is responsible for equal sized sets were then combined into one final set which
taking care of the entire process of scanning that APK. This was then processed. The reason it is required to have benign
process is made up of several different steps which also vary APKs in the set, which is processed with the association
depending on the responses received from Virustotal. The first analysis, was that otherwise there occurs problems with the
confidence of the rules, since there exists no benign APKs for
which the rules could also hold.
When doing the intrinsic metadata extraction such as it was
done in this project, each of the possible permissions of an
app is represented as one binary variable in the data regarding
the APK. This means that each app was represented by 170
different variables in total. Because of both the time and the
memory complexity of the Apriori algorithm, as a function
of the amount of variables in the data, it becomes unfeasible
to perform the algorithm on the data when keeping all of the
variables. This means that some of the variables has to be
removed. This removal of variables was done manually and
worked in such a way that if there were a variable where
less than 100 APKs fell into either of the categories then the
variable was removed. This means that if, for example, less
than 100 out of the APKs used a permission then the variable
representing that permission was removed before the algorithm
was run. This also means that all of the variables that were
removed were those with the lowest amount of variance.

IV. R ESULTS

In this section the generated decision tree and rules are


presented to the reader.
Fig. 1. Virus density as function of threshold

A. Model and Extracted Rules


In the following section the results regarding both the
generated tree classifier and the generated rules are presented.
Regarding the decision trees only the final tree for the finally
selected virus threshold is presented, there were however trees
generated for each of the integer thresholds in the range [1,20],
this was done to be able to compare the misclassification rates
of the trees.

B. Selection of Threshold
As earlier presented one challenge which presented itself in
this project was to decide upon a threshold as to how many
of the Virustotal engines had to mark an APK as a virus for
it to be considered a virus. To help with this decision two
figures were generated, one presenting how the proportion of
viruses compared to overall APKs changed for different values
of the threshold(Figure 1), the other was an image showing
the misclassification rate of the decision tree as a function of
the selected threshold(Figure 2).
Based on these figures the values of the threshold for the
decision trees was set at 13. The threshold for the association
rules was also set at 13, which is the value for which about
10% of the data is virus. The reason that 20 activations
were not selected, even though it apparently has a lower
misclassification rate is that when the threshold is increased
the density of malware in the data is decreased. This decrease Fig. 2. Misclassification rate as related to threshold
of malware density results in a better misclassification rate,
even if the model is not better at profiling viruses, just because
there are fewer to profile.
C. Generated Tree Models 1) BLUETOOTH=0,
Based on the selected threshold of 13 engine activations for BROADCAST STICKY=0,
the APK to be classified as a virus the following tree model CALL PHONE=0,
was generated. The actual tree model is presented in Appendix CHANGE NETWORK STATE=0,
A, since it was too large to include in this text. Because of INTERNET=1,
problems with the graphical representation the tree model is MODIFY AUDIO SETTINGS=0,
included in text form. READ PHONE STATE=1,
RECEIVE SMS=0,
D. Rules Extracted via Association Analysis relevant for 1105 samples
When the rules were generated two sets of rules were → virus=1, 791 conf:(0.72)
generated, one with rules only related to the classification of
viruses and one where all rules were included. Here only the 2) BLUETOOTH=0,
virus classification rules are shown. BROADCAST STICKY=0,
Only two of the rules generated are actually represented CHANGE NETWORK STATE=0,
since the length of the rules otherwise quickly make the report INTERNET=1,
rather unreadable. The two that were selected were two out MODIFY AUDIO SETTINGS=0,
of the top four. The reason that not just the top four were READ EXTERNAL STORAGE=0,
selected is that the rules seem to come in pairs, where every READ PHONE STATE=1,
second rule is just a slight modification of the first rule. In all READ SMS=0,
of the cases it is also such that both the right and the left hand WRITE EXTERNAL STORAGE=1,
side include exactly the same amount of samples in both of relevant for 1098 samples
the rules, meaning that both of the rules relate to exactly the → virus=1, 785 conf:(0.71)
same sample of apps. Because of this it seemed more relevant
to include rules which differed more, even if they were not
V. D ISCUSSION
specified as the rules with the highest confidence.
It should be noted that even with this exclusion the rules In this section the results are discussed, were they similar
which are included are still quite similar, but have neither to the expected results? Do they have good support and
side covering the exact same amount of apps, and also have confidence? Do they make sense?
differing confidence. More differing rules could likely be
found if more rules were to be generated, doing this is A. Rule Analysis - Tree Algorithm
only prevented by a lack of time. The rules could easily be While analysing the resulting tree, it should be noted that the
generated by the authors upon request. feature on the level closest to the root is the most important,
Any further generated rules would have a lower confidence in this case the READ PHONE STATE permission. This
than the ones presented in this report. being the most important feature also seems rather logical.
The rules are in format Data about the phone state would be useful in developing
effective malware, which is further backed by the fact that the
A, B, C, relevantf orDsamples → E, F conf (G) permission has the “dangerous” protection level. Not having
To read this format it is important to know: the READ PHONE STATE permission reduces the amount
• A, B, C are the attributes that all apps in the rule have
of malware in the population from about 10% to about 2.7%
• D is the amount of samples which show all of these
which is a rather significant shift, almost a factor 4.
attributes Attributes on the second level are also rather important,
• E is the result of the rule (in this report that the samples
which would include the update date and the amount of shell
are virus) commands. The role of these attributes is however not possible
• F is how many of the D samples which are actually virus
to determine from the tree. This is partly due to the fact that
• G is the confidence of the rule based upon D and F
the rules represented by these nodes are an intersection of
several different constraints, and thus not only dependent on
It is also the case that if the rules say A=0 or A=1 this means
these attributes. Another reason as to why it is not possible
that A is either false or true respectively for the samples. For
to extract any immediate meaning as to why the attributes are
example if A is a permission this means that the samples either
important is that the intermediate nodes do not represent any
have or do not have the permission. On the other hand if A
complete rules, neither do they represent any finished profiling
is the status of virus then it means that the samples either are
of the apps as malware or benign. That said there are some
not viruses or are viruses.
interesting complete rules ending in leaf nodes which can be
discussed. The rules which are most interesting to discuss are
the ones that profile malware with a reasonable certainty, as
well as those that profile benign apps with an extraordinary
certainty, these are the ones that will be presented.
First amongst the rules is the rule ending in node 4), shell command part of the rules, both for this rule and for the
profiling benign apps with a 99.8% certainty. This rule previous rule, ending in node 11.
holds for apps without the READ PHONE STATE permission The final rule, ending in node 222), is rather long. It states
which had their latest update more than about 2.5 years ago. that, if the app uses the READ PHONE STATE permission,
This is states that apps which are somewhat older almost has more than 30 shell commands, has more than 120 URLs,
entirely needed the READ PHONE STATE permission in has the ACCESS COARSE LOCATION permission and has
able to function as malware, increasing the likelihood that a version number between 64 and 115, then it is with 84%
the READ PHONE STATE permission at least used to be an certainty a virus. This rule has many parts to go through.
extremely important indicator of malware. Starting with the READ PHONE STATE permission it has
The next rule, the one ending in node 21, does however earlier been stated that this is a likely predictor of malware,
present a somewhat interesting counterpoint to this, it states meaning that it is not a surprise to see it included in the rule.
that apps that do not have the READ PHONE STATE permis- Along with this the ACCESS COARSE LOCATION permis-
sion, but that are updated more recently than about 2.5 years sion is also included, this allows the app to access the phones
ago, which also have less than 13.5 shell commands and are approximate location. Just as the READ PHONE STATE this
larger than 6MB, have a 70% probability of being malicious. is a “dangerous” permission, meaning that it is reasonable to
This seems to indicate that some way of performing malicious find it amongst the permissions which help profile malware.
activities without the READ PHONE STATE permission has The location of a phone is valuable information that might
been introduced recently. It also seems likely that the code be of high interest to potential malware creators, if possible
required to perform this malicious activity is likely rather to get a hold of. As to the amount of URLs it is in no
large, leading to the larger size requirement for the rule. Based way unlikely that malware would have a higher amount of
on the previous assumptions that malware in general have a URLs in the app, since all of the locations to both download
higher amount of shell commands the final requirement does malicious information and all the locations to upload the user
however seem somewhat unintuitive. Overall however this rule information would need to be included in the app. Finally the
seems quite intuitive based on these explanations. This rule version number presented in the rule raises some questions, as
does raise some questions as to what allowed the malware to it was assumed that malware would not be updated frequently.
bypass the requirement of having the READ PHONE STATE What is of note however, is that the version number is gotten
permission however. A potential explanation for this could from the android manifest, and can thus be set at will by
be the emergence of lower level exploits, such as Meltdown the application developer. This means that the version number
and Spectre [7] which could allow applications to bypass does not necessarily have a direct correlation to the actual
the protections of the operating system, even without special version of the app. What this part of the rule might indicate
permissions. is that the app developers set the version number intentionally
The third interesting rule is the rule ending with node so that it does not seem like the app is as new as it might in
11. This is one of the most unintuitive rules in the gen- fact be. The reason for this might be to give the users a false
erated tree. It states that if the app does not have the security, assuming that the app is safe since it has been around
READ PHONE STATE permission, it updated more recently and been updated while this is not, in fact, actually true.
than 2.5 years ago, does also have more than 13.5 shell Further research would be required to confirm the validity
commands, then it is with 99.7% certainty a benign app. This of these explanations. The applications could be examined to
rule is not nearly as intuitive as most other rules, but there check for the presence of exploits bypassing the operating
are two possibilities that can be seen as to why this would be system, and given such a presence its effect on apk size would
possible. The first of these possibilities is once again that some also be considered.
important feature is missing or some unsuitable simplification
has been made in the model. The other option that can be B. Rule Analysis - Association Analysis
seen is not that there is something special with the benign The rules generated seem to be logical, with a
apps having more than just 13 shell commands, but rather clear correlation between sets of permissions involv-
that the malware presented in the previous rule, at node 11, ing one with a “dangerous” (as defined by [8])or
have extraordinary few shell commands. If we are to assume undefined protection level (READ PHONE STATE and
this then it might be so that if the malware profiled in rule WRITE EXTERNAL STORAGE), some general permission
11 use some exploit to bypass the restrictions of the operating (INTERNET) and positive virus profiling. The inverse is in
system (thereby allowing for direct execution of instructions), part true with permissions of a “normal” protection level,
then the shell commands might be less necessary than for any such as BLUETOOTH and BROADCAST STICKY, indicat-
usual app. This combined with the assumption that malware ing negative malware profiling. Though permissions related
applications likely have less effort put into them, thus requiring to making calls and sending SMS messages, which have a
a less than average amount of shell commands than normal to “dangerous” protection level, are constrained to be missing
perform their other functionality, then this might result in this from the permission set in the presented rules. One explanation
specific class of malware having less than the average amount could be that these particular sets of rules are given with
of shell commands overall. This would somewhat explain the respect to particular malware families, in which no such
permissions are used. The permissions just mentioned would APKs are not removed, just that they will be present later in
then have to be present to differentiate from other malware the list. The problem that occurred however was that running
families. Analysis of the code of some of the applications the algorithm long enough to try to find these rules was not
from the supposed malware families would be helpful in feasible on the available hardware as the algorithm was run
determining whether or not such patterns actually exist in the for five hours to then finally crash due to using more than
dataset. 12GB of ram memory.
Finally the rules chosen for presentation are similar, which It should however be noted that changing the proportions of
is not unexpected if the variables correlate with the malicious the APKs will, as previously stated, change the confidence of
truth value on their own. the rules, since it is possible that there would be more benign
APKs who would also have the same permissions, or lack of
C. Rule Analysis - Algorithm Comparison permissions, as is presented in the rules. This would mean that
A comparison of the rules generated by the algorithms, and out of all of the APKs which have all of the features presented
thereby also of the algorithms themselves, would have been in the rules a smaller proportion of them would be malware
interesting. Performing such a comparison was also the plan at and thus the confidence would be lower. The only two other
the start of the project. Currently it is however impossible to do options to fix the presented problem, however, is to either buy
such a comparison. This is due to the fact that the algorithms better hardware or to reduce the amounts of viruses as well to
did not run on the same data. The data that the tree did run on keep the proportion until rules were found, neither of which
was 90% benign apps and 10% malware, while the association were deemed as doable or a better solution.
analysis did run on a dataset made up of 50% benign apps and There was also some bias towards benign apps in the
50% malware. The fact that the algorithms ran on different decision tree, only a few of the rules ended up profiling
sets of data means that it is impossible to fairly compare the malware, most only profiled benign apps. The reason why
resulting rules as well as comparing the algorithms themselves. the data was not changed for the decision tree as well was
since there did actually did exist some rules for profiling
D. Method Criticism malware meaning that it was deemed worse to risk reducing
Amongst the rules a few unintuitive rules were found the generalizability only to increase the amount of rules which
one option for why this happened is that some important profiled malware.
underlying features were not collected. This possibility would
be rather reasonable since only easily extracted metadata from VI. C ONCLUSIONS
the marketplace and from the APK were used. If this is the This study contributes some possible rules to help profile
case then there might be a clear pattern in the data, based on malware. These rules can potentially be used by security
some of missing feature(s), that the rules try to mimic based analysts to help profile malware and identify which groups
on the currently available features. If this is the case then of apps might be most relevant as to finding new malware.
it might, by happenstance, be such that a seemingly random These rules can also help profiling zero day exploits where the
feature is the best available approximation of the missing tools to actually identify the malware are not yet developed.
feature(s). If this is the case then the model would then select Only a smaller set of rules were extracted for this report but
this approximative feature in lack of a better option, leading the frameworks for extracting more, as well as for basing the
to seemingly random features and rules being included in the rules on more malware, are set up and tested. Making it easy
final model. to extract more rules if required.
As previously stated most of the rules generated for the Most of the rules generated seemed intuitive, with a few that
association analysis were rather similar. This means that it were harder to interpret. This shows that likely the method in
is possible that only the first of these rules would have been general is valid, but with some minor aspects which could be
needed for a relevant profiling. This could probably have been improved. Seeing as the result would likely improve given a
helped by the generation of more rules, since it is unlikely that larger set of features, it can be concluded that profiling based
all of the rules are this similar, it is probably just the case with on easily extracted features is possible, but not optimal.
the rules which have the highest confidence, since they likely
profile a similar set of APKs. Unfortunately there was not R EFERENCES
enough time to generate these rules to see if there were any [1] “Development of new android malware worldwide from january 2011 to
differences amongst later rules. march 2018,” https://www.statista.com/statistics/680705/global-android-
malware-volume, accessed on 2019-04-12.
Along with this another problem is that the association [2] “App manifest overview,” https://developer.android.com/guide/topics/
analysis was only run on about 3,000 apps. This is since as manifest/manifest-intro, accessed on 2019-04-05.
stated in the method III-I there needed to be a balance between [3] “About us,” https://support.virustotal.com/hc/en-
us/categories/360000160117-About-us, accessed on 2019-05-05.
benign and malicious APKs for the algorithm to function [4] P. Rovelli, “Ninjadroid,” https://github.com/rovellipaolo/NinjaDroid, May
correctly. 2017, accessed on 2019-05-05.
The reason to change the number of APKs for the associa- [5] A. Muñoz, I. Martı́n, A. Guzmán, and J. A. Hernández, “Android malware
detection from google play meta-data: Selection of important features,” in
tion analysis was to find relevant rules, it should however be 2015 IEEE Conference on Communications and Network Security (CNS),
noted that it should be possible to find the rules even if the Sep. 2015, pp. 701–702.
[6] Y. Ishii, T. Watanabe, F. Kanei, Y. Takata, E. Shioji, M. Akiyama,
T. Yagi, B. Sun, and T. Mori, “Understanding the security management of
global third-party android marketplaces,” in Proceedings of the 2nd ACM
SIGSOFT International Workshop on App Market Analytics. ACM, 2017,
pp. 12–18.
[7] “Meltdown and spectre,” https://meltdownattack.com/, accessed on 2019-
05-16.
[8] “Permissions overview,” https://developer.android.com/guide/topics/permissions/overview,
accessed on 2019-05-14.
A PPENDIX
Because of issues with the graphical representation of the
tree that could be generated, the tree is presented in text
form, because of this some explanation of the format might
be needed, the generated explanation is also presented in the
first two lines of the tree.
Each row of the tree represents a split in the tree, starting
from the root node. Each of the lines are presented on the
form of:
Node Number), split condition, amount of data points in node
(and potential children), classification value of data points in
the node (only definitive if leaf node), (percentage of data
points in node that are not viruses, percentage of data points
in node that are viruses)
To note is that the update date feature means the number of
days between the year 2000 and the date the app was last
updated this means that a larger number indicates a more
recent update, this was done to get a good and fair comparison
between different apps for the model to process. This structure
was also good since it meant that the date value of a given
app would not change from day to day, which is important
since the downloading process took several days. The size is
given in number of bytes.
node), split, n, yval, (yprob)
* denotes terminal node

1) root 9192 0 ( 0.899043 0.100957 )


2) READ_PHONE_STATE < 0.5 2394 0 ( 0.972013 0.027987 )
4) update_date < 6395.5 1864 0 ( 0.998391 0.001609 ) *
5) update_date > 6395.5 530 0 ( 0.879245 0.120755 )
10) shell_commands < 13.5 144 0 ( 0.562500 0.437500 )
20) size < 6.0739e+06 57 0 ( 0.964912 0.035088 ) *
21) size > 6.0739e+06 87 1 ( 0.298851 0.701149 ) *
11) shell_commands > 13.5 386 0 ( 0.997409 0.002591 ) *
3) READ_PHONE_STATE > 0.5 6798 0 ( 0.873345 0.126655 )
6) shell_commands < 231.5 3850 0 ( 0.810130 0.189870 )
12) shell_commands < 30.5 277 1 ( 0.469314 0.530686 ) *
13) shell_commands > 30.5 3573 0 ( 0.836552 0.163448 )
26) urls < 120.5 2530 0 ( 0.881423 0.118577 )
52) VIBRATE < 0.5 1108 0 ( 0.805957 0.194043 )
104) update_date < 6394.5 698 0 ( 0.883954 0.116046 )
208) update_date < 5011 277 0 ( 0.758123 0.241877 ) *
209) update_date > 5011 421 0 ( 0.966746 0.033254 ) *
105) update_date > 6394.5 410 0 ( 0.673171 0.326829 )
210) update_date < 6420 304 0 ( 0.572368 0.427632 ) *
211) update_date > 6420 106 0 ( 0.962264 0.037736 ) *
53) VIBRATE > 0.5 1422 0 ( 0.940225 0.059775 ) *
27) urls > 120.5 1043 0 ( 0.727709 0.272291 )
54) ACCESS_COARSE_LOCATION < 0.5 390 0 ( 0.961538 0.038462 ) *
55) ACCESS_COARSE_LOCATION > 0.5 653 0 ( 0.588055 0.411945 )
110) version < 64 352 0 ( 0.863636 0.136364 )
220) shell_commands < 153 67 0 ( 0.507463 0.492537 ) *
221) shell_commands > 153 285 0 ( 0.947368 0.052632 ) *
111) version > 64 301 1 ( 0.265781 0.734219 )
222) version < 115 254 1 ( 0.153543 0.846457 ) *
223) version > 115 47 0 ( 0.872340 0.127660 ) *
7) shell_commands > 231.5 2948 0 ( 0.955902 0.044098 ) *

You might also like