Hybrid-Based Malware Analysis For Effective and Efficiency Android Malware Detection
Hybrid-Based Malware Analysis For Effective and Efficiency Android Malware Detection
Abstract— In the last decade, Android is the most widely the Android system is the main target of mobile malware
used operating system. Despite this rapidly increasing because the Android operating system allows users to install
popularity, Android is also a target for the spread of malware. applications downloaded from third-party markets. This fact
Android admits the installation of applications from other allows the attacker to mislead Android users into
unauthorized markets. This fact allows malware developers to downloading malware from the attacker's server.
place malicious apps and engage Android devices. So far,
malware analysis and detection systems have been developed to Because of the increasing sophistication and amount of
use both static analysis and dynamic analysis. However, existing malware, it is crucial to do malware detection analysis. Many
research is still lagging in the performance of detecting malware researchers endeavor to mitigate cyberattacks on Android
efficiently and accurately. For accurate malware detection, it malware through various approaches. So far, the feature
often utilizes many resources from resource-limited mobile extraction of malware approaches can be classified into static
devices. Therefore, this research proposes a solution by analysis-based detection and dynamic analysis-based
developing and testing an efficient and accurate machine detection [4]. In a static-based approach, the analysis process
learning and deep learning model for this problem. We used the is carried out without executing a malware sample and vice
malware genome dataset and the Drebin project for static versa for a dynamic approach.
analysis and used the CICMalDroid dataset for dynamic
analysis. From these two datasets, we extract 261 combined Static-based detection is an anti-malware approach that
features of the hybrid analysis. To test the model that was built, identifies malware by matching the software's pattern being
we took 311 application samples consisting of 165 benign apps examined with a database of signatures from known
from the play store and 146 malicious apps from VirusShare. malicious programs. This scheme is based on the premise that
The test results show that the hybrid analysis model can increase malware instance can be defined through patterns. The
detection by about 5%. Further testing also revealed that the disadvantage of the static analysis method is that because the
extreme gradient boosting (XGB) assemble model is the best byte signature pattern originates from known malware, it is
accuracy and efficiency model. also commonly known. Therefore, an attacker can easily
avoid this using code obfuscation technique [5].
Keywords—malware analysis, hybrid-based analysis, android
malware, malware detection Malware that can change its code when it spread is known
as metamorphic and polymorphic malware. This type of
I. INTRODUCTION malware also cannot be detected by static-based analysis [6].
The android operating system is known to be the most Since the signature-based anti-malware systems are
popular and widely used operating system. According to the constructed by the database of known malware, they cannot
Stat Counter report, Android overlooked the mobile recognize zero-day malware attacks. As time goes by, this
operating system market by seizing 85% of the total market database is also getting bigger, which will affect the speed of
shares by the end of 2019 [1]. With the growing popularity of malware detection.
Android OS every year, Android malware attacks are also
On the other hand, dynamic-based analysis is an approach
proliferating. TrendMicro declared that the number of
to detecting malware by running a sample of the malware in
Android malware has increased to 10.6 million by mid-2020
a controlled environment such as a sandbox or virtual
[2]. This enormous number of mobile malwares will keep
machine. In this scheme, the extracted features are usually in
improved and spread to commit several cybercrimes on
the form of API calls, system calls, memory writes, or
mobile devices.
registry changes [7]. The drawback of dynamics-based
Even with various protection mechanisms such as play analysis is time-consuming because it requires setting up the
protect, there are still various reports of malware presence in environment for malware sample testing.
the play store [3]. The report shows that even large companies
These limitations inspire researchers to develop hybrid-
like Google unable to retaining the Android ecosystem free
based analysis approaches for more effective results.
of malware. Besides the popularity of Android smartphones,
Although hybrid-based analysis can perform high accuracy
by mixing the benefits of static and dynamic analysis, they author [19], machine learning algorithms are intensively used
fail to ensure the efficiency of resources. This level of for classification or identification of malware, and some are
efficiency is essential because the memory, processor, and for malware clustering. Deep learning is an innovative domain
power resources on smartphones are limited compared to of machine learning study that mimics the way the human
desktops. For these reasons, we propose developing a hybrid- brain functioning and has achieved growing recognition in the
based malware analysis model that not only aims at the scope of artificial intelligence. It has motivated a significant
accuracy in malware detection but also resource-friendly. number of successful speech recognition applications and
image classification. There are reports of a preliminary study
In this research, we combine the extracted data from in deep learning that is applied for Android malware detection
signature-based and dynamics-based analysis and then build a [20].
learning machine and deep learning-based analysis model.
Although machine learning and deep learning models are III. METHODOLOGY
commonly used for malware detection, the reality is that
building a reliable deep learning model requires many training This research went through several stages, as illustrated in
data. This step can be done by artificially creating data using Fig. 1. The initial stage is the process of gathering data. We
the oversampling technique. In this way, we create simulated performed experiments on two datasets from two publicly
data on the malware class while also balancing the amount. accessible malware samples collection (Android Malgenome
Through this model, we were using dimension reduction to project [21] and DREBIN [22]). In this dataset, there are 215
increase malware detection performance significantly. As we features and two classes, specifically malware and benign.
construct a hybrid-based approach, then a static approach will The dataset is consisting of 18,835 android applications (6,820
overcome the drawbacks of the dynamic approach, while the malware and 12,015 benign). To develop an effective and
dynamic approach will cover the lack of the static approach. efficient Android malware detection pattern, we assemble
Our proposed malware detection approach is expected to most representative attributes such as manifest permission,
recognize different types of malware effectively while API call signatures, intent-filter, command signature, and
maintaining efficient performance through real-time analysis. binaries from the application being analyzed.
Data for behavior analysis is taken from the CICMalDroid
II. RELATED WORK 2020 dataset [23]. The dataset consists of 17,341 Android
There are several studies related to malware analysis on samples. The dataset was analyzed using the CopperDroid
the Android platform. Pham et al. stated that there is a framework [24] to obtain dynamic behaviors that are broken
significant relationship between malware and the permission down into three categories: system calls, binder calls, and
attribute in the Android manifest file [8]. Meanwhile, Fauzia composite behavior. There are about 11,598 samples
Idrees, stated that permission and intent used on Android successfully analyzed, and the rest failed due to errors such as
applications are efficient and accurate to distinguish malware time-out, invalid APK files, and memory allocation failures.
while remaining resistant to code obfuscation. [9]. Methods One thousand seven hundred ninety-five of samples are
commonly used in the static analysis include intent filter [10], benign, and the other 9803 are malware.
system command, and Fingerprinting [11]. As for dynamic or
behavioral analysis, it can be performed by tracking system
calls [12] and analyzing the instructions [13].
Hybrid-based analysis through combining the two
methods done by Damodaran et al. shows that API calling and
opcode sequences in the dynamic analysis are the most
effective [14]. However, there is an imbalance issue between
the number of benign samples and malware samples. So, a
more in-depth analysis is needed. Several studies show
improved detection using hybrid-based analysis. Ali-Gombe
et al. combine static and dynamic analysis. It uses static
bytecode instrumentation to recognize the abuse of resources
and examine suspicious behavior, which is then observed
dynamically [15]. Surendran et al. propose a new TAN (Tree
Augmented naive Bayes) hybrid malware detection
mechanism using conditional dependencies between static and
dynamic features (API calls, permits, and system calls) used
in their machine learning classifiers [16]. However, this work
is still struggling with the performance issue since many
variables need to be analyzed. Fig. 1. Research methodology
Machine learning has also been widely used in malware The following step is the data pre-processing. At this stage,
detection. Sapna Malik analyzed the Android call system features in the dataset with missing values are omitted.
using various machine learning algorithms and found a strong Considering the sample sizes of the benign and malware
correlation between malware and system calls [17]. Li et al. classes are imbalanced, we balance the number of samples in
use machine learning in malware detection, which result is each class before splitting them into the training and test bins
more significantly accurate in the identification of permissions for analyzing using machine learning techniques. To use all
[18]. They handle Feature extraction from the Android the samples equally, we randomly shuffle the dataset in each
manifest file combined with the Naive Bayes classification class before balancing the samples. We then use the
algorithm increases malware detection rate. Based on the oversampling technique with the Synthetic Minority Over
Sampling Technique (SMOTE) method [25] to equalize the artificial neural network that can learn and make decisions
number of class samples. independently. Deep learning enables us to analyze problems
through its hidden layer architecture, which are otherwise
The next process is the dataset training and validation. The considerably more complex to be programmed manually.
dataset is split for training and validation data. We use cross- However, the drawback is the long training time following the
validation with 10-fold to ensure that validation and training number of nodes on the neural network.
data are entirely shifted [26]. We conduct training using
various machine / deep learning classification algorithms, i.e., Gradient boosting (GB) is one type of machine learning
Support Vector Machine (SVM), decision tree, random forest, boosting, shows the best accuracy results compared to other
Naive Bayes, K-Nearest Neighbor (K-NN), Multi-Layer algorithms. It relies on the foreknowledge that the best
Perceptron (MLP), and Gradient Boost (GB). After a possible next model, combined with preceding models,
benchmark, we made enhancements to the benchmark results reduces the overall prediction error. The essential idea is to set
by adjusting the hyperparameter of the algorithm used to the target results for this next model to minimize the error, this
identify best practices in malware classification. The process seems to be the reason behind the most accurate result.
is conducted on the dataset results from static-based and Gradient boosting tends to have a long training time because,
dynamics-based analysis. unlike random forests, GB has decision trees interconnected
to create the final prediction model.
After the optimal classification value is determined, we
combine the features in a static and dynamic dataset. This Following, we mix the data from the static and dynamic
combined dataset was tested on test data consisting of 311 analysis results. There are 261 features, a combination of 215
applications (166 malware and 145 benign). Finally, we use features of static analysis and 46 features of dynamic analysis.
the Principal Component Analysis technique (PCA) to reduce We conducted a PCA analysis to find out the top 10 features
the number of features that best represent each variant. All best distinguish malware classes or not. Fig. 2 illustrates the
these steps are carried out in the development environment results of this analysis.
with Intel Core i7 9750 processor, 32 GB DDR4 RAM, Nvidia
Quadro T1000 graphics card, 512 SSD NVMe, and Windows
10 operating system.
IV. RESULT AND DISCUSSION
After the data is collected, we use the Python 3
programming tool with the Scikit-learn library to process it. In
the oversampling stage with the SMOTE technique, an equal
number of sample classes is obtained, namely 12,015 in each
class. The stratified 10-fold cross-validation divides data
evenly by 90% of the training data, and the rest is for
validation. The techniques used for training and validation are
SVM, K-NN, MLP, Random Forest, Decision Tree and Naive
Bayes, and GB. Table 1 describes the results of average
training accuracy and latency.
10on May 17,2021 at 02:46:16 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Queens University Belfast. Downloaded
2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS)
The last experiment was carried out to measure the shortest latency time from several machine learning and deep
combined static and dynamic analysis model. For this learning algorithms that were measured.
purpose, we reverse engineer sample applications for feature
extraction. Benign applications are obtained through play REFERENCES
store while malicious applications are taken from VirusShare. [1] Stat Counter, “Mobile Operating System Market Share Worldwide,”
After collecting 311 sample applications, we extracted both [Online]. Available: https://gs.statcounter.com/os-market-
static and dynamic features according to the vector features share/mobile /worldwide. [Access on 20 July 2020].
contained in each dataset. Furthermore, the training and [2] Trend Micro, “Android Malware Campaigns Reportedly Installed 250
Million Times,” [Online]. Available: https://www.trendmicro.com/
validation models are tested against the collected datasets. news/mobile-safety/android-malware-campaigns-simbad-adware-and-
This test is carried out after the hyperparameter of the model operation-sheep-reportedly-installed-250-million-times. [Access on 11
has been refined to obtain the best accuracy. The top 3 results July 2020].
of the benchmark are shown in table 2. [3] W. R. Aditya, N. Qolbi, M. F. Kamal, I. S. Tsany, F. N. Pramitha dan
R. B. Hadiprakoso, “Malware Detection Analysis on Android
TABLE II. COMBINED MODEL RESULT Applications with Deep Learning Algorithms,” in ICOIACT, 2020.
[4] A. Damodaran, F. D. Troia, C. A. Visaggio, T. H. Austin dan M. Stamp,
Algorithm Average score Prediction time (in “A comparison of static, dynamic, and hybrid analysis for malware
seconds) detection,” Journal of Computer Virology and Hacking Techniques,
GB 0.993572 0,667 vol. 13, pp. 1-13, 2017.
Random Forest 0.978146 1.222 [5] A. M. Al-Bakri dan H. L. Hussein, “Static Analysis Based Behavioral
MLP 0.950026 2.455 API for Malware Detection,” Computer Engineering and Intelligent
Systems, vol. 5, no. 12, pp. 55-63, 2014.
[6] E. Masabo, K. S. Kaawaase, J. S. Otim, J. Ngubiri dan D.
Table 2 reveals that the GB algorithm has the best Hanyurwimfura, “A State of The Art Survei On Polymorphic Malware
accuracy decisions with the shortest latency prediction time. Analysis And Detection Techniques,” ICTACT Journal on Soft
These results are not different from experiments with previous Computng, vol. 8, no. 4, pp. 1762-1772, 2018.
static analysis data. By adding dynamic data, the accuracy [7] T. Kim, B. Kang, M. Rho, S. Sezer dan E. G. Im, “A Multimodal Deep
increases by about 5%. The interesting fact is that the GB Learning Method for Android Malware Detection,” IEEE Transactions
algorithm has a significant improvement in prediction time on Information Forensics and Security, vol. 14, no. 3, pp. 773-788,
2018.
than during training. The reason is that this algorithm makes
a sequential decision vote during the training, but when the [8] R. B. Hadiprakoso, “Face Anti-Spoofing Method with Blinking Eye
and HSV Texture Analysis,” in Tarumanagara International
prediction is made parallel. Conference on the Application of Technology and Engineering
(TICATE), 2020.
The results of the GB algorithm are closely followed by
[9] F. Idreesa, M. R. M. Contib, T. M. Chena dan Y. Rahulamathavan,
random forest. The boosting algorithm, such as GB, is a model “PIndroid: A novel Android malware detection system using ensemble
development of the random forest approach. Random forests learning methods,” Computers & Security, vol. 68, pp. 36-46, 2017.
build each tree independently, while GB builds one tree at a [10] N. V. Duc, P. T. Giang dan P. Minh, “Permission Analysis for Android
time. This additive model works in a forward stage-wise Malware Detection,” in The Proceedings of the 7th VAST, Hanoi,
manner, introducing a weak learner to improve the 2015.
shortcomings of existing weak learners. Based on the test [11] F. I. Abro, “Investigating Android permissions and intents for malware
results, the GB algorithm is slightly better than the Random detection,” Unpublished Doctoral thesis, University of London, 2018.
forest; this seems to be due to the structured and noise-free [12] S. Malik, “Android System Call Analysis for Malicious Application
input dataset. Random forest algorithms perform better on Detection,” International Journal of Computer Sciences and
Engineering, vol. 5, no. 11, pp. 105-108, 2017.
data with lots of noise where GB will be overfitting the data.
[12] Li, Xiang, et al. "An android malware detection method based on
With further experiments, we also found that the modification android manifest file." 2016 4th International Conference on Cloud
of the assembly model, particularly extreme gradient boosting Computing and Intelligence Systems (CCIS). IEEE, 2016.
(XGB), succeeded in reducing the prediction time by 15% [13] Kabakus, Abdullah Talha, and Ibrahim Alper Dogru. "An in-depth
with the same accuracy. analysis of Android malware using hybrid techniques." Digital
Investigation 24 (2018): 25-33.
V. CONCLUSIONS [14] Damodaran, Anusha, et al. "A comparison of static, dynamic, and
We tested a model with 215 static analysis results from hybrid analysis for malware detection." Journal of Computer Virology
and Hacking Techniques 13.1 (2017): 1-12.
the malware genome project and the Drebin project dataset of
[15] Ali-Gombe, Aisha I., et al. "Toward a more dependable hybrid analysis
18,835 applications. We tried to build the best machine of android malware using aspect-oriented programming." computers &
learning / deep learning model to detect malware based on security 73 (2018): 235-248.
these datasets. The results show a detection accuracy of [16] Surendran, Roopak, Tony Thomas, and Sabu Emmanuel. "A TAN
around 99% percent with gradient boosting algorithm. Then based hybrid model for android malware detection." Journal of
we also analyzed 46 features of dynamic analysis results from Information Security and Applications 54 (2020): 102483.
the CICMalDroid dataset of 17,341 applications. The two [17] Li, Jin, et al. "Significant permission identification for machine-
features of the static and dynamic analysis results are then learning-based android malware detection." IEEE Transactions on
Industrial Informatics 14.7 (2018): 3216-3225.
combined to develop a hybrid-based malware analysis model.
A total of 261 features were tested on 311 applications to [18] Sihwail, Rami, Khairuddin Omar, and KA Zainol Ariffin. "A survey on
malware analysis techniques: Static, dynamic, hybrid and memory
measure latency and model accuracy in detecting malware. analysis." International Journal on Advanced Science, Engineering
The results show that the hybrid model can increase malware and Information Technology 8.4-2 (2018): 1662.
detection accuracy up to 3% rather than relying solely on [19] Yang, Manzhi, and Qiaoyan Wen. "Detecting android malware by
static-based analysis. Furthermore, through extensive applying classification techniques on images patterns." 2017 IEEE 2nd
experimentation, we found that the gradient boosting International Conference on Cloud Computing and Big Data Analysis
algorithm had the best accuracy effectiveness with the (ICCCBDA). IEEE, 2017.
11on May 17,2021 at 02:46:16 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Queens University Belfast. Downloaded
2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS)
[20] Anderson, Hyrum S., and Phil Roth. "Ember: an open dataset for
training static pe malware machine learning models." arXiv preprint
arXiv:1804.04637 (2018).
[21] Xie, Nannan, et al. "Fingerprinting Android malware
families." Frontiers of Computer Science 13.3 (2019): 637-646.
[22] Wei, Fengguo, et al. "Deep ground truth analysis of current android
malware." International Conference on Detection of Intrusions and
Malware, and Vulnerability Assessment. Springer, Cham, 2017.
[23] Zhao, Yong-liang, and Quan Qian. "Android malware identification
through visual exploration of disassembly files." International Journal
of Network Security 20.6 (2018): 1061-1073.
[24] Bhatia, Taniya, and Rishabh Kaushal. "Malware detection in android
based on dynamic analysis." 2017 International Conference on Cyber
Security and Protection of Digital Services (Cyber Security). IEEE,
2017.
[25] Pektaş, Abdurrahman, and Tankut Acarman. "Deep learning for
effective Android malware detection using API call graph
embeddings." Soft Computing 24.2 (2020): 1027-1043.
[26] Yan, Jiaqi, Guanhua Yan, and Dong Jin. "Classifying malware
represented as control flow graphs using deep graph convolutional
neural network." 2019 49th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks (DSN). IEEE, 2019.
12on May 17,2021 at 02:46:16 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Queens University Belfast. Downloaded