Research Paper 2 Malware Detection
Research Paper 2 Malware Detection
1. Introduction and detect (Shaukat et al., 2022). The new malware and unwanted
applications are reported to be over 450,000 daily and over 100 million
Internet is a major source of information sharing among distinct yearly. The malicious pattern can reside within a data file or software
nodes. It consists of millions of computers, networks, and devices. Con- application, to name a few. These software applications may belong to
sequently, Internet has become the target of cybercriminals. Malware, different platforms, such as Windows, Linux, and Android.
spam, and phishing are examples of such cyberattacks (Shaukat et al., Malware analysis techniques are usually categorised into two major
2020b). Malware is considered to be one of the major security threats groups: static and dynamic. In static analysis, an application is observed
to cyberspace. Malware is a combination of ‘mal’ from ‘malicious’ and for malicious patterns without execution. These data files or appli-
‘ware’ from ‘software’. It is a code snippet covertly inserted into a cations are decrypted and disassembled into feature vectors. Feature
computer system or a network with malicious intent to disrupt the vectors characterise the essential features and format information of
normal flow of activities. Based on its functionality, sometimes malware the file potentially containing the malicious pattern. Anti-malware solu-
is classified into viruses, worms and Trojan horses (Guo et al., 2016). tions detect malicious patterns by analysing these feature vectors. Con-
Cybercriminals use various obfuscation methods to generate mal- ventional malware detectors use a signature-based detection method.
ware variants. Exclusive-OR, base64 encoding, ROT13, dead code inser- These detectors are stored with a vast database of malware signatures
tion, instruction substitution, and sub-routine reordering are common (malicious code patterns). They decode the suspicious file and match its
obfuscation methods. These methods add malicious adversaries into extracted static features with the stored malware signatures. Malware
binary and textual data that malware detectors find difficult to interpret variants can easily undermine conventional anti-malware solutions.
∗ Corresponding author.
E-mail address: [email protected] (K. Shaukat).
https://doi.org/10.1016/j.engappai.2023.106030
Received 6 November 2022; Received in revised form 7 February 2023; Accepted 19 February 2023
Available online 9 March 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Static analysis is vulnerable to adversarial attacks and obfuscation The performance of the proposed approach is validated with 15 deep
techniques. Dynamic analysis is carried out by running the application learning and 12 machine learning models. We demonstrate that our
in a controlled virtual environment to observe the dynamic behaviour proposed approach outperforms the state-of-the-art conventional and
and potential malicious pattern (Arora et al., 2014). Function call mon- learning-based solutions.
itoring, dynamic visual analysis, and instructions traces are common Our contributions are as follows:
techniques used in dynamic analysis. Typically, dynamic analysis is
implemented using a sandbox. Researchers have also combined static • We propose a novel hybrid framework combining deep transfer
and dynamic analysis to detect malware (De Paola et al., 2018). learning and machine learning for malware detection. First, deep
transfer learning is used to extract all the deep features from the
Challenges: The exponential growth of malware variants is another last fully connected layer of the deep learning model, and then
challenge for conventional anti-malware solutions in addition to the machine learning models are used as the final detector, which
challenges mentioned above. These solutions are incapable of detecting fully utilises the inherent connections between input and output.
these polymorphic and rapidly growing malware variants. The static The proposed approach eliminates the need for knowledge from
analysis is faster and provides insights into the structural properties of domain experts for reverse engineering tasks.
the application; thus can detect the malware variant within a complete • The proposed framework consists of two steps. In the first step,
area of coverage. However, a detailed analysis is impossible as most an image-based PE dataset is generated by transforming malicious
variants employ detection avoidance techniques that are generated and benign Windows executables into RGB-coloured images. We
using code obfuscation. On the other hand, dynamic analysis effectively address the imbalanced data problem in this step using data
detects variants generated through code obfuscation. Nevertheless, it
augmentation. The images are normalised at this stage. The per-
is slower, requires intensive resources (virtual environment, i.e., Sand-
formance of 15 deep transfer learning models is evaluated on
box) and is highly time-consuming. It requires human effort to analyse
three datasets for image-based malware detection. We show that
and interpret detailed reports. There is where learning-based artificial
data augmentation enhances the detector’s performance.
intelligence-based techniques came into practice. Machine learning is
• The second step involves a novel combination of deep learning
a sub-branch of artificial intelligence (AI) that learns from the existing
and machine learning models for malware detection. An in-depth
attack patterns and prevents similar attacks in the future. Cyber defend-
analysis of 15 deep learning models as feature extractors and
ers have also used machine learning-based methods to detect suspicious
12 end-to-end machine learning models as malware detectors is
characteristics in evolving malware variants (Li et al., 2021). However,
presented.
it is still challenging for machine learning-based methods to detect
• We demonstrate that our proposed approach is scalable, cost-
polymorphic and new malware variants. Nataraj et al. (2011) proposed
effective, and efficient. The effectiveness of the proposed frame-
a method to convert a malware-packed binary into a greyscale image
work is validated on a smaller dataset. We analyse the effective-
to classify different malware families. Their detection approach con-
ness of various models, first considering only a single feature
sidered the visible features to address code obfuscation. Image-based
and then using all features for malware classification. Our results
malware conversion can help to overcome the problem of detection
avoidance techniques in certain malware variants. However, extraction demonstrate that the proposed framework performs better than
of complex texture features, i.e., Super-Resolution Radial Fluctuations other state-of-the-art techniques.
(SRRF) and Linde-Buzo-Gray (LBG), slowed the detection processing. The rest of the paper is structured as follows. Section 2 provides a
Consequently, the representation of various variants that can be rep- literature review of the relevant related works. Section 3 describes the
resented in colour images is overlooked. Colour images are represented proposed framework. Results and discussion are given in Section 4.
using red, green, and blue (RGB) colour format. The proposed approach Section 5 concludes the paper and outlines future research. We list the
overcomes these challenges by providing an efficient mechanism for acronyms used in this paper in Table 1 for convenient referencing.
converting a PE file into a coloured image. Thus, the proposed method
reduces the challenges given by static and dynamic methods.
2. Related work
Learning-based detectors come with their challenges. They need
to be trained with a sufficient and equal number of instances from
Most anti-malware solutions are based on either conventional or
each class (benign and malicious). Due to privacy and security issues,
learning-based solutions. The conventional anti-malware solution in-
most datasets that contain the latest attacks are private. The publicly
cludes feature extraction, analysis mode (static and dynamic), and
available datasets are laboriously anonymised and suffer from various
statistical similarity measures. Learning-based anti-malware solutions
issues. The problem of imbalanced data is a major concern for the
use machine learning, deep learning, and visualisation techniques.
learning-based detector as few malicious samples are present to train
Subsequent sections describe related works in each of these methods.
the detector properly. The skewed distribution of the training examples
makes standard learning-based malware detectors biased towards the
majority class. The proposed approach targeted the imbalanced dataset 2.1. Conventional anti-malware approaches
and enhanced the detection effectiveness by applying various data
augmentation techniques. Ishai et al. in Rosenberg et al. (2018) proposed a method to classify
Machine learning models are not explicitly designed for malware advanced persistent threats (APTs). APTs are a collection of highly
detection tasks and need domain knowledge. However, deep transfer evolving, dynamic and sophisticated attacks developed by nation states.
learning pre-trained models are trained with more than 10 million The authors used dynamic analysis using a sandbox to extract features
images. These cutting-edge pre-trained models can be directly used as of APTs. A deep neural network (DNN) was further used for detection.
detectors (end-to-end classification) and feature extractors for malware Yu Jiang et al. in Jiang et al. (2020) proposed a method based on
image classification. Thus, combining both machine learning (ML) and static Android malware analysis. The authors have extracted the opcode
deep learning (DL) can help to overcome the intensive feature engi- sequence of malware and provided a detection mechanism. However,
neering and domain knowledge tasks. Hence, we have combined recent their approach did not handle the packed applications. Packed applica-
developments in image processing using transfer learning with state- tions have encrypted applications that can be restored only when the
of-the-art machine learning to improve detection accuracy. This paper application is executed. Their proposed system also ignored Android
proposes a novel hybrid malware detector using a deep learning model applications that use dynamic loading technology (DLT). The Android
as a feature extractor and a machine learning model as a detector. applications that use DLT are not detectable using static analysis.
The proposed approach targeted all the aforementioned challenges. Zhangjie Fu et al. proposed a long short-term memory for Android
2
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 1
List of abbreviations and acronyms.
Abbreviation Description Abbreviation Description
AI Artificial Intelligence LDA Linear Discriminator Analysis
ANN Artificial Neural Network LR Logistic Regression
APTs Advanced Persistent Threats LSTM Long Short-Term Memory
APK Android Application Package ML Machine Learning
API Application Programming Interface MLP Multi-Layer Perceptron
CNN Convolutional Neural Network NB Naïve Bayes
DT Decision Tree NN Neural Network
DL Deep Learning PE Portable Executable
DNN Deep Neural Network PCA Principal Component Analysis
DLL Dynamic-Link Libraries RF Random Forest
FNR False Negative Rate RNN Recurrent Neural Networks
FPR False Positive Rate SVM Support Vector Machine
HistGradientBoosting Histogram-Based Gradient Boosting TL Transfer Learning
IoT Internet of Things TNR True Negative Rate
IDS Intrusion Detection System TPR True Positive Rate
k-NN K Nearest Neighbour XGBoost Extreme Gradient Boosting
malware detection. They have extracted the features from 3090 be- collected instruction traces, the work in Imran et al. (2015) observed
nign and 3090 malicious Android application package (APK) files (Fu system calls, and the work in Fujino et al. (2015) monitored API calls.
et al., 2021). Their experiment results outperformed traditional support Though dynamic analysis helps to identify malicious behaviour, it is a
vector machine (SVM), random forest (RF), decision tree (DT), and time-consuming and resource-intensive task.
K-nearest neighbour (KNN) models. The hybrid-based approach combines both static and dynamic anal-
Schultz et al. in Schultz et al. (2000) proposed a data mining- ysis. Alejandro et al. in Martín et al. (2019) proposed a framework
based method using static analysis. The authors extracted the byte, that can perform static and dynamic analyses. The authors performed
string sequences, and portable executable head features from malicious the experiments on Android applications. Huda et al. in Huda et al.
binaries. Instead of extracting the byte n-gram sequence as in Schultz (2016) proposed a hybrid analysis approach that obtained the cluster
et al. (2000), Shabtai et al. extracted the opcode sequence to identify information using term-frequency features. The authors used an SVM
the malicious pattern (Shabtai et al., 2012). Kolter et al. in Kolter for malware detection.
and Maloof (2006) also performed static analysis and extracted the
n-grams of byte codes from a set of benign and malicious Windows 2.2. Learning-based anti-malware approaches
executables. The authors extracted the most relevant n-grams from a
set of 255 million distinct n-grams. Naïve Bayes, SVM, and decision
Many research works have leveraged machine learning and deep
tree have been used for malware detection. However, their proposed
learning advancements for malware detection. Many works have con-
method failed to correctly classify malicious codes with obfuscation
verted the malware binaries into images and used various machine
and encryption. Further, all these approaches need a domain expert and
learning models to classify and identify different malware families. This
feature engineering tools for malware detection.
section describes the state-of-the-art malware detection approaches
Researchers have also analysed the API system call sequences for
using visualisation-based and learning-based techniques.
detecting malicious behaviour using dynamic analysis. Searles et al.
Niket et al. in Bhodia et al. (2019) used transfer learning for mal-
in Searles et al. (2017) performed dynamic analysis to extract the
ware detection. The authors compared traditional K-NN to ResNet34 on
dependency relationships among various system calls. They constructed
the Malicia dataset. The authors reported that deep learning ResNet34
a similarity matrix to feed the SVM for malware detection based on
outperformed traditional K-NN. However, the authors did not compare
the generated dependency graph. Cesare et al. in Cesare et al. (2013)
it with other state-of-the-art deep learning architectures. They have
extracted the control flow graphs using de-compilers. Further, the
used 9895 malware instances and only 704 benign samples to train
control flow graphs were used to identify different malware variants.
Malware detection was also performed through static visual analysis. their model. They did not tackle the imbalanced data problem. Most
Authors either performed static or dynamic analysis to identify a spe- of the researchers only used one deep learning model for malware
cific behaviour or extract specific features from the malware binaries. detection by neglecting the latest deep learning architectures; e.g., Li
The monitored behaviour or specific features are transformed into an Chen in Chen (2018) used Inception-V1, Wai Weng et al. in Lo et al.
image. Afterwards, various statistical-based similarity methods are used (2019) used Xception, Prima et al. in Prima and Bouhorma (2020) used
on crafted image datasets for final malware detection. Gianni et al. in VGG16, Daniel et al. in Nahmias et al. (2020) used VGG-19, and Yuntao
D’Angelo et al. (2020) performed dynamic analysis on Android-based et al. in Zhao et al. (2020) used Faster Region-Convolutional Neural
apps. The authors converted the API call sequences into API images Networks (RCNN) only. MalConv was initially proposed by Raff et al.
during execution. An autoencoder is then used for detection. They show (2018) for malware and benign classification. Kadri et al. in Kadri et al.
that this approach outperforms the decision tree, naïve Bayesian, with (2019) used MalConv to classify different classes of malware. Kadri
an accuracy of 94%. et al. claim that MalConv gives similar performance for classifying
Syed et al. in Shaid and Maarof (2014) performed dynamic anal- among various classes of malware. In Kumar (2021), authors used the
ysis to classify various malware variants. The authors executed the Malimg dataset for malware classification by slightly modifying the last
malware apps in a virtual environment to identify the API call se- layer of ResNet50. Their proposed model is better than the standard
quences. The extracted API call sequences were then transformed into a ResNet50. However, the authors have not provided a rationale for
coloured image. The authors applied various statistical methods to find using ResNet50 only. Furthermore, there was no analysis provided for
similarities among different malware images. KyoungSoo et al. used other state-of-the-art deep learning architectures. Xianwei Gao et al. in
opcode instructions to generate coloured images. The image matrices Gao et al. (2020) extracted byte features from malware and applied
are generated from these images. Specific areas of images are used recurrent neural network (RNN) for malware detection. The authors
to identify similarities among these matrices (Han et al., 2013). Most also validated the performance of their approach using various machine
researchers extracted a specific feature during dynamic analysis for fur- learning models and XGBoost model comes out the best. However, their
ther malware detection; for example, the work in Anderson et al. (2011) approach has the overhead of static analysis and feature selection.
3
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
4
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 2
A comparative analysis of the literature.
Sr# Ref# Year Used Models Feature Need for Transfer Technique Handled ML Dataset Deep Experimented on One-vs-All
model extraction domain learning static/dynamic/ data features smaller dataset features
count expert Visual imbalance extracted comparison
1 Nataraj et al. 2011 1 KNN No No No Visual No Yes Malimg No No No
(2011)
2 Rosenberg 2018 1 DNN Yes Yes No Dynamic No No APTs No No No
et al. (2018)
3 Jiang et al. 2020 1 Resnet50 Yes Yes Yes Static and No No Drebin No No No
(2020) visual
4 Fu et al. 2021 1 LSTM Yes Yes Yes Static No No VirusShare No No No
(2021)
5 Shabtai et al. 2012 5 ANN, LR, Yes Yes No Static No Yes VXHeaven No No No
(2012) RF, DT, NB,
6 Kolter and 2006 4 NB, DT, SVM, Yes Yes No Static No Yes VXHeaven No No No
Maloof (2006) Boosting
7 Searles et al. 2017 1 SVM Yes Yes No Dynamic No Yes Custom No No No
(2017)
8 D’Angelo et al. 2020 1 Autoencoder Yes Yes No Dynamic No No Custom No No No
(2020)
9 Chen (2018) 2018 1 InceptionV1 No No Yes Visual No No Microsoft No No No
10 Lo et al. 2018 1 Xception No No Yes Visual No No Malimg, No No No
(2019) Microsoft
11 Kumar (2021) 2021 1 ResNet50 No No Yes Visual No No Malimg No No No
12 Hemalatha 2021 1 DenseNet No No Yes Visual Yes No Malimg No No No
et al. (2021)
13 Gibert et al. 2019 1 CNN No No No Visual No No Malimg No No No
(2019)
14 Vasan et al. 2020 2 VGG16, No No Yes Visual No No Malimg No No No
(2020b) ResNet50
15 Roseline et al. 2019 4 RF, Xgboost, Yes No No Visual No Yes Malimg No No No
(2019) ETC, LR
16 Ben 2019 1 KNN Yes Yes No Visual No Yes Malimg No No No
Abdel Ouahab
et al. (2019)
17 Vinayakumar 2019 5 CNN, DT, NB, Yes Yes No Dynamic No Yes Malimg No No No
et al. (2019) LR, KNN and Visual
18 Rong et al. 2020 1 ResNet50 No No Yes Visual No No MCFP No No No
(2020)
19 Rezende et al. 2017 1 ResNet50 No No Yes Visual No No Malimg No No No
(2017)
20 Bansal et al. 2021 4 VGG19, NB, DT, Yes No Yes Visual and No Yes Caltech-101 No No No
(2021) RF Static
21 Our paper 2022 27 – No No Yes Static, Visual Yes Yes VirusShare, Yes Yes Yes
VXHeaven,
Malimg
in an image. Afterwards, the substring of length 8 bits is converted of malware detection using an image dataset generated by converting
into an unsigned integer ranging between 0–255. For example, the a whole PE file binary into an image or through an API-based image
PE binary given in Fig. 1 is divided into substrings of 8 bits length, dataset. API-based image dataset is generated by extracting the API
i.e., 01100010100111101101010100010110 → 01100010, 10011110, calls. For brevity, we have used the same malware and benign PEs
11010101, 00010110 → 98, 158, 213, 22. A one-dimensional (1D) dataset from Al-Dujaili et al. (2018). The Al-Dujaili et al. (2018) dataset
feature vector containing decimal values of 8 bits substrings is gen- contains 19,000 malicious and 19,000 benign PE files. LIEF (LIEF
erated. The 1D feature vector is transformed into a two-dimensional Library, 2021) library is utilised to parse each PE file. There were
(2D) matrix. The height of the image depends upon the contents of the 22,760 unique API calls in the Al-Dujaili et al. (2018) dataset. Thus, the
PE file. However, the width of the image is fixed according to criteria size of each binary feature vector is 22,761. After parsing, each PE file
is mapped into its corresponding binary feature vectors. Each index of
based on the file size. The width of the image is determined using the
the feature vector represents the presence or absence of a unique API
file size of the PE file (Oliva and Torralba, 2001). Finally, a colour
call. ‘1’ on the respective index denotes the presence of an API call,
map is applied to the 2D matrix to visualise the PE binary. Most of the
‘0’ otherwise. Each feature vector is mapped into an API-based image
research in visualisation-based malware detection was performed with
dataset using the process illustrated in Fig. 1.
greyscale images. The colour images better represent various malware
Thus, the whole image generation and later detection process do
variants and other features than greyscale images (Vasan et al., 2020a). not require any feature engineering or domain expert. The height and
We have performed all our experiments on coloured images. width of the image are depended upon the contents and the size of the
PE file contains information related to API calls, Opcodes, etc. We PE file, respectively. All the images are pre-processed and normalised
have generated an API-image dataset by extracting the API calls from according to the network input settings. Most of the deep learning
each PE file. The purpose of generating this API-based image dataset architectures take input images of size 224 × 224. Thus, we have trans-
is to analyse the detection effectiveness of learning-based detectors on formed all the images into a fixed rectangular shape of size 224 × 224.
an image dataset generated through one feature or all features. A set However, the images have been resized in case some deep learning-
of experiments was performed to compare the detection effectiveness based models take different image sizes as input. We have resized the
5
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
This section explains the second part of the framework, i.e., hybrid
images in these cases. Though some features are lost during this resizing deep learning and support vector machine (SVM) classifier for the final
process, texture features are preserved. malware detection. The first section, 3.2.1, briefly explains the use
of deep learning models and transfer learning. Then, Section 3.2.2
3.1.3. Data imbalance and solution describes the rationale behind the proposed approach. Section 3.2.3
Class imbalance and unavailability of the most recent attack pat- briefly introduces the final detector, i.e., the support vector machine
terns are well-known problems in malware detection. Due to privacy and other applied machine learning models, to validate the perfor-
reasons, data on the latest attacks are not publicly available. This makes mance of the proposed framework. The explanation of the Convolu-
the learning models skewed and biased towards the majority class. tional neural network-SVM (CNN-SVM) malware detector is provided
The malware classes in benchmark and well-known public datasets are in Section 3.2.4. Fig. 2 illustrates the pictographic explanation of
not equally distributed. The Malimg dataset comprises 9339 malware the second-step of the proposed novel framework for malware de-
samples from 25 families and 617 benign samples. Among different tection. We name the proposed model a hybrid deep learning and
families of malware, the Allaple. A family has 2949 samples, whereas machine learning-based malware detector (HDLMLMD ≡ HD(LM)2 D).
the Skintrim.N has only 80 samples. Microsoft malware dataset is The HD(LM)2 D is a framework that classifies malware and benign PEs.
composed of 10 868 malware samples. Out of 10 868 samples, the It consists of 3 parts: image-based PE image dataset generation (step
Kelihos_ver3 malware family has 2942 samples, and Simda has only 1 explained as Section 3.1), fine-tuned pre-trained model and transfer
42 samples. As mentioned in Table 2, this problem has not been ad- learning to extract features, and final detection to get an output.
dressed for learning-based malware detectors. Learning-based detectors Fig. 2 depicts the second step of the proposed framework. It consists
are skewed towards the majority class and give less attention to the of pre-processed PE Image dataset, data splitting, a pre-trained model
minority classes. The detector accuracy is unreliable due to the lack for feature extraction, and feeding of extracted features to the final clas-
of consideration of minority class samples during the model’s training. sifier. It takes the pre-processed PE images as input. The pre-processed
We have used state-of-the-art data augmentation techniques to over- dataset is generated as a result of step 1 (Section 3.1). The data is split
come the degradation of the detection effectiveness of learning-based into training, validation, and testing. The training and validation data
detectors and avoid overfitting caused by an imbalanced dataset. are used for transfer learning and feature extraction. The testing data is
The real image data is used to augment new images by relocating used to test the detection effectiveness of the models. Any deep learning
the image pixels. The augmentation is performed to ensure that the model can be used to extract the deep features from the malware image
structural and other important features of the images remain consistent. dataset. The deep features are extracted from the last fully connected
The augmentation not only expands the size of the data but also adds layer as 1D. The extracted deep features are fed into a machine learning
a level of variation in different samples of the dataset that allows the model, i.e., SVM, for final malware detection. The non-linear machine
malware detector to generalise better on unseen data. We have used learning model fully utilises the inherent connections between input
various augmentation techniques, including rotation range, width shift, and output. The performance of the proposed HD(LM)2 D framework
height shift, zoom range, horizontal flip, vertical flip, brightness range, is compared with state-of-the-art ML and DL-based approaches. The
rescale, fill mode, and shear range. It is worth considering that a few performance of the final detector, i.e., SVM in the proposed framework,
data augmentation techniques may alter the image dimension after is also validated with other ML models. The following sections explain
augmentation, e.g., rotation. Rotating a square image will preserve the the framework in detail.
image dimension. However, rotating a rectangle shape image will not
preserve the dimension once rotated at 180 degrees. Considering this 3.2.1. Deep learning models and transfer learning
issue, we have already normalised all our images into square 224 × 224 Transfer learning (TL) is a machine learning method that leverages
size images. The normalisation process preserves not only the image the knowledge learnt from solving a problem to solve another related
dimension but also important features in the augmented images. Thus, problem. Transfer learning alleviates two major problems in malware
the generated data is quite similar to the real data. In this study, we detection: imbalanced datasets and inadequate computing resources. As
used the Augumentor package of Python, which has features of size- mentioned in the previous sections, the available public datasets do not
preserving rotations and size-preserving shearing. These features ensure contain sufficient latest attack patterns or suffer from an imbalanced
the preservation of the original features of images in the augmented number of instances required to train a malware detector effectively
data. by avoiding overfitting. We have addressed the problem of imbalanced
We have created a pipeline of various augmentations techniques. data through data augmentation described earlier. Transfer learning
The pipeline processes each augmentation technique sequentially. How- can be formally defined in terms of domains and learning tasks. Table 4
ever, probability can also be assigned to each augmentation technique. summarises the formal notations:
Alternatively, any augmentation technique can be applied randomly, Let Ds be a source domain, Ts be a source learning task, Dt be
or only a subset of augmentation techniques can be applied from the a target domain, Tt is a target learning task. The source domain Ds
pipeline. Let A = [a1 , a2 , a3 ,. . . .a10 ] be the pipeline of augmentation consists of a feature space Xs and a marginal probability distribution
techniques. Si = ax ,. . . . . . ..ay be the sequence of 10 augmentation 𝑃 (𝑋s ), where 𝑋s = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } ∈ Xs . If the Ds and Dt are different,
techniques given in Table 3. Each augmentation technique is assigned then 𝑃 (X ) and X would also be different, i.e., Xs ≠ Xt and 𝑃 (𝑋s ) ≠
a specific weight Wx . Hence, the final sequence of augmentation 𝑃 (𝑋t ).
techniques is given as Si = ax wx ,. . . . . . ..ay wy - where wx is the weight Given the source domain Ds , Ds = {Xs , 𝑃 (𝑋s )}, a learning task Ts
assigned to each augmentation technique. We have only augmented the consists of two components: a label space Ys, where 𝑌s = {𝑦1 , 𝑦2 ,. . . ,
images for training data. The test data is not augmented. Using these y𝑛 } ∈ Ys, and mapping objective function 𝑓s : Xs → Ys . The function 𝑓s
6
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 4
Formal notations and their descriptions.
Notations Description Notations Description
Ds Source domain Xs Source feature space
Ts Source learning task 𝑃 (𝑋s ) Marginal probability distribution of source feature space
Dt Target domain 𝑃 (𝑋t ) Marginal probability distribution of target feature space
Tt Target learning task Ys Label space of the source domain
𝑓s Objective function Yt Label space of the target domain
(x) is used to efficiently predict the corresponding label 𝑦𝑛𝑒𝑤 of a new the features. Facing the problem of vanishing gradients, training deep
instance 𝑥𝑛𝑒𝑤 . The learning task Ts = {Ys , 𝑓 (𝑥)} is learnt through the neural networks was challenging before ResNet. To add the outputs
source training data consisting of pairs {𝑥𝑠𝑖 , 𝑦𝑠𝑖 }, where 𝑥𝑠𝑖 ∈ Xs ∧ 𝑦si from an earlier layer to a later layer, ResNet employs skip connections.
∈ Ys . This aids it in overcoming the problem of vanishing gradients. Due
In our case, the Ds is the collection of images with the ImageNet to this, it could become possible to achieve several advancements in
dataset, and Dt is the collection of malware images. Ds ≠ Dt ∨ Ts various computer vision tasks. Regardless of its success in generating
≠ Tt the aim of transfer learning to utilise knowledge learned from gradient flow between building blocks, the additive function limits the
source domain Ds and source learning task Ts for generating the target opportunity to explore other potential complementary features because
mapping objective function 𝑓t : Xt → Yt on malware domain Dt to of the basic shortcut connecting technique. Ilija et al. in Radosavovic
improve the malware detection effectiveness. In this study, 15 pre- et al. (2020) introduced a regulator module as a memory mechanism
trained models are applied for malware detection. The architectural to extract complementary features, which would then be sent to the
details of these models are given in Table 5. ResNet. Convolutional recurrent neural networks are used in the reg-
The pre-trained models are trained on the ImageNet dataset. Ima- ulator module because they have been demonstrated to be good at
geNet dataset has 10+ million images containing 1000 class labels. In retrieving spatiotemporal information RegNetY320 is the name given
this study, the pre-trained models are analysed from two perspectives. to the new regulated networks. The regulator module is simple and may
First, the 15 pre-trained models are fine-tuned and used for end-to- be added to any ResNet design. The non-bottleneck building block and
end malware detection. We have fine-tuned all the models by adding the bottleneck building block are two types of ResNet building blocks
a dropout layer, a flattened layer, and a final fully connected layer proposed and show that RegNet outperformed on CIFAR10/100 and
containing two classes (malware and benign) instead of a fully con- ImageNet datasets. Flowing are the equations (Eqs. (1)–(6)) that express
nected layer (containing 1000 classes). Initial weights of pre-trained the 𝑡th bottleneck RegNet module.
models (trained on natural images) were used, and the last layers ( ( 𝑡 ))
𝑋2𝑡 = 𝑅𝑒𝐿𝑈 𝐵𝑁 𝑊12 ∗ 𝑋1𝑡 + 𝑏𝑡12 (1)
were fine-tuned for the malware detection task. An in-depth analysis [ 𝑡 𝑡] ( ( ( 𝑡 [ 𝑡−1 𝑡−1 ])))
of 15 malware detectors is done on three sets of datasets. Second, 𝐻 , 𝐶 = 𝑅𝑒𝐿𝑈 𝐵𝑁 𝐶𝑜𝑛𝑣𝐿𝑆𝑇 𝑀 𝑋2 , 𝐻 , 𝐶 (2)
𝑡
( ( 𝑡 𝑡 𝑡
))
the 15 pre-trained fine-tuned models were used as feature extractors 𝑋3 = 𝑅𝑒𝐿𝑈 𝐵𝑁 𝑊23 ∗ 𝑋2 + 𝑏23 (3)
for the last fully connected layer. The shortcoming of the pre-trained ( ( 𝑡 ))
𝑋4𝑡 = 𝑅𝑒𝐿𝑈 𝐵𝑁 𝑊34 ∗ 𝐶𝑎𝑜𝑛𝑐𝑎𝑡[𝑋3𝑡 , 𝐻 𝑡 ] (4)
models in accurately detecting malware images and its solution is
𝑡
( ( 𝑡 𝑡 𝑡
))
provided in the next section. RegNetY320 has been used to extract 𝑋5 = 𝑅𝑒𝐿𝑈 𝐵𝑁 𝑊45 ∗ 𝑋4 + 𝑏45 (5)
7
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 5
Overview of pre-trained models.
Sr# Model Year Depth Layers Size (MB) Parameters (millions) Non-trainable parameters Trainable parameters
1 VGG16 2014 16 41 515 138 137,897,600 102,400
2 ResNet50 2015 50 177 96 25.6 25,548,800 51,200
3 InceptionV3 2015 48 316 89/ 87 23.9 23,848,800 51,200
4 VGG19 2014 19 47 535 20.1 20,024,384 25,089
5 MobileNet 2018 28 55 16 3.2 3,228,864 50,177
6 Xception 2013 71 171 88 22.9 20,861,480 100,353
7 DenseNet169 2016 169 338 57 12.7 12,642,880 81,537
8 DenseNet201 2017 201 709 80 18.4 18,321,984 94,081
9 InceptionResNetV2 2017 164 825 215 54.4 54,336,736 38,401
10 MobileNetV2 2018 53 155 13/ 14 3.5 3,468,000 32,000
11 ResNet152V2 2015 307 570 232 58.4 58,331,648 100,353
12 AlexNet 2012 8 25 227 61 60,897,600 102,400
13 SqueezeNet 2017 18 68 4.5 23.5 23,084,252 48,842
14 NasNetMobile 2018 389 914 23 4.4 4,269,716 51,745
15 RegNetY320 2020 320 553 145 144,754,400 103,840
𝑋1𝑡+1 = 𝑅𝑒𝐿𝑈 (𝑋1𝑡 + 𝑋5𝑡 ) (6) learning model. The final logits vector of each image could be differ-
ent. Hence, the softmax assigned different probabilities and eventually
W𝑡𝑖𝑗 is the convolutional kernel whose mapping feature maps to 𝑋𝑖𝑡
different outputs to the same PE file. The rationale behind this is
𝑋𝑗𝑡 , and 𝐵𝑖𝑗𝑡 is the correlative bias. In which ReLU, Batch normalisa-
that the last layer of the pre-trained model is being fine-tuned. The
tion and concatenation layer functions are used. The Rectified linear models do not properly learn the knowledge through transfer learning.
activation function changes the node’s summed weighted input into Deep learning pre-trained models are not specifically designed for the
the node’s activation or output for that input. Batch normalisation is a malware detection task. Our proposed model handles this problem by
technique for training very deep neural networks that standardise each extracting all the features from the last fully connected layer. Thus,
mini-input batch to a layer. This helps to stabilise the learning process the proposed framework does not depend on the probability assigned
and significantly reduces the number of training epochs needed to train to each class by the learning model. Rather, we will use the machine
deep networks. learning model for the final malware detection. We have experimented
with extracting features from the second fully connected and earlier
3.2.2. Rationale behind the proposed framework layers. The results were not promising. The last fully connected results
Fig. 3 depicts a conventional malware detection process using a gave better results as it has more deep and rich features. The number of
learning-based model. Based on the training process, the final fully features returned by each model is different, depending on the density
connected layer of the learning model returns a vector of logits. The and architecture of each model. For example, the last fully connected
vector then passes through the softmax activation function that assigns layer of VGG16 models returned 4096 nodes. Hence, the vector size of
a certain probability to each classification label (malware and benign 4096 is used to represent each image and later provided to the final
in our case). The softmax activation function is used on the last output detector for malware detection.
layer of the pre-trained model for final malware detection. Fig. 1 Secondly, a deep learning model can be used for classification and
illustrates the detailed conversion process of the PE file into an image. deep feature computation. In the first approach, the deep learning
The width and length of a converted image vary and depend upon the model is used end-to-end for feature computation and malware detec-
content and size of the PE file. Fig. 3 illustrates that the binaries of a tion. In the second approach, the deep features are extracted from any
PE file can be mapped into the two 2D matrices differently. layer and are given to a linear classifier for final malware detection. Our
The binaries of the same PE file are reshaped in two different proposed model used the second approach. In the first approach, end-
images. Both images are passed through certain layers of the same to-end, deep learning models require high computation resources and
8
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
1 𝑇 ∑ 𝑁
𝑚𝑖𝑛 𝑤 𝑤+𝑐 𝜉𝑛 (7)
𝑁 𝑖=0
Fig. 4. Support vector machine. 𝑠.𝑡. T𝑛′ (𝑤.𝑥 + 𝑏) ≥ 1 − 𝜉𝑛 (8)
𝜉𝑛 ≥ 0∀𝑛 (9)
parameter tuning using the new image dataset. Hence, the model may The unconstrained optimisation problem of Eq. (7) can be defined as
suffer from overfitting. Furthermore, it may require a lot of training Eq. (10):
data that may not be available in the case of malware detection. Lastly,
1 𝑇 ∑ 𝑁
we need a quick response to classify a PE file as malware or benign 𝑚𝑖𝑛 𝑤 𝑤+𝑐 𝑚𝑎𝑥(0, 1 − T𝑛′ (𝑤𝑇 𝑥𝑖 + 𝑏)) (10)
to avoid subsequent damage. We also need to reduce the reliance on 𝑁 𝑖=0
knowledge from domain experts for reverse engineering tasks. Hence, where (𝑤𝑇 𝑥𝑖 + 𝑏) is the predictor function and T𝑛′ is the actual label.
our proposed malware detector is based on the second approach that The loss function for Eq. (10) is hinge loss. Eq. (11), also called the
addresses all these requirements. We use a linear classifier, i.e., Support L2-SVM, provides a more stable result. It uses the ‖𝑤‖22 Euclidean norm
Vector Machine (SVM), for final malware detection. (a.k.a L2 norm) with the squared hinge loss.
9
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 6
A compact overview of machine learning models.
Model Year Ref. No Description Limitations
Decision Tree 1979 Ross (1993) • Works on an if-then rule to find the best • Difficult to change the data without affecting the
immediate node. overall structure. Complex, expensive and
• Continue the process until the predicted class is time-consuming.
obtained.
Naïve Bayes 1960 Frank and Hall • A probabilistic classifier that takes less • Assigns 0 probability if some category in the test
(2011) computational time. data set is not present in the training data set.
• Assumes that a feature is entirely independent of • Stores entire training examples
all other present features. • Needs massive data to obtain good results.
Random Forest 1995 Agrawal and • Composed of many DTs. • Computational cost is higher.
Srikant (1995) • Every DT yields a prediction. • Slow prediction generator
• The prediction that having a maximum number of
votes will be the final prediction of the model.
Logistic 1970 Akram et al. • A.k.a., the logit model is a classification model. • Suffer with restrictive expressiveness.
Regression (2021) • Used a logistic function to model for modelling a • Struggle with the problem of complete separation.
binary classification. When a feature can separate each class perfectly,
• Does not require a linear relationship between the LR model cannot be trained.
input and output variables.
Linear 1936 Maulana et al. • Considers the linear combinations of variables that • Sensitive to outliers.
Discriminator (2022) better explain the data. • Too many assumptions and restrictions.
Analysis • It can work well with small sample sizes. • Linear decision boundary might not generalise
• The decision boundary is linear. Provided the well to all the classes and adequately separate both
dimensionality reduction that later helped in classes.
feature engineering.
K-Nearest 1951 Cunningham • A non-parametric classifier that groups a data • It is not scalable and does not perform well with
Neighbour and Delany sample based on proximity. Similar points lie near high-dimension data.
(2021) each other in a sample cluster. • It is computationally expensive.
• It is a lazy learner model that does not learn from • A lower value of k tends to be overfitting, and a
the data immediately; rather first stores the dataset higher value tends to smooth out. It always required
• It is adaptive and easy to implement. the value of k, which makes it complex sometimes.
Histogram-based 1990 Huo et al. • Provides built-in support to handle missing values. • Needs a conversion of data values into
Gradient (2022) • Bins the input samples into integer-based value integer-based to form bins.
Boosting bins.
Adaboost 1995 Galen and • Combines the inputs from a sequence of weak • Computational expensive.
Steele (2021) learners applied to repeatedly modify training • Cannot do parallelism for each learner. It needs to
dataset versions. wait for the output from the previous learner.
• Combined output through voting. The weights are • Needs a balanced dataset for better performance.
revised on each iteration. The learning model is • Cannot perform well when the data is noisy or has
reapplied to the revised weighted dataset. outliers.
• It is Adaptive boosting.
Gradient 1999 Gibert et al. • Weak learners are ensembled to outperform in • Does not provide parallelism.
Boosting (2022) terms of efficiency and accuracy. • Considers the loss from all possible tree splits to
• Typically, decision trees are the week learners. form a new branch that makes it inefficient.
The outputs of decision trees are combined to • Prone to overfitting.
achieve better results. • Hard to interpret the final models.
• Uses a gradient to reduce the loss function.
XGBoost 2014 Al-Hashmi • Speedy and high-performance classifier for large • Does not perform well where the number of
et al. (2022) datasets. training samples is less than the number of features.
• Performs parallelism with a single tree.
• Uses advanced regularisation such as L1 and L2. • Requires a lot of resources to train larger datasets.
• Also handles missing values in the dataset.
• It is extreme Gradient Boosting.
Bagging 1994 Kumar and • A.k.a. bootstrapping method. • Reduce the model interpretability.
Subbiah (2022) • The base models are executed on the bags of the • Computational expensive.
whole dataset. A bag is a subset of the whole • The final model can experience the problem of
dataset with a replacement to balance the size of high bias.
the whole dataset.
• The same model is learned with multiple subsets
of the dataset.
analysed. The models are fine-tuned on various hyperparameters for analysed in Section 4.2. Third, the detection effectiveness of the pro-
malware detection. Second, the performance of CNN-based SVM is posed malware framework is presented in Section 4.3. The proposed
10
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
framework HD(LM)2D performance is validated with other machine learning models are selected based on their best performance in the
learning models (when used as final detectors). In addition, this section literature on image and malware detection tasks. A total of 20,000
also compares the performance of the proposed malware framework malicious PEs and 10,000 benign PEs corpus have been collected.1 The
with other published work on the frequently used Malimg dataset. malware corpus is the collection of malicious PEs from the Microsoft
Due to privacy, the latest attack patterns are not available. Fourth, to malware dataset (Microsoft, 2015), Malimg (Nataraj et al., 2011), and
consider the shortage of attack patterns problem, Section 4.4 validates VirusShare (VirusShare, 2022). The benign PEs are collected from mul-
the detection effectiveness of the proposed HD(LM)2 D on a smaller tiple online resources. The collected corpus is further divided into three
dataset. These results reveal the performance of the proposed malware datasets: datasetA, datasetB, and datasetC. The datasets are generated
detector when insufficient data samples are available for training. Fifth, to analyse the performance of learning-based detectors on different
the used image dataset (datasetA, datasetB, datasetC, and Malimg) are dataset sizes.
generated using all the features and binaries of good ware and malware Table 7 presents the results of 15 fine-tuned deep learning-based
PE files. The API call is one feature of the PE file. The API calls from malware detectors on datasetA. DatasetA consists of 2000 malware and
all the PE files are extracted and converted into feature vectors. The 10,000 benign PEs. The samples are divided with a ratio of 60:20:20
API-based feature vectors are then transformed into an image-based for training, validation, and testing, respectively. That is, 1200 malware
API dataset. Section 4.5 compares the performance of detectors on and 6000 benign PEs are used for training, 400 malicious and 2000
an image-based dataset generated using one input feature (API call) benign PEs for validation, and 400 malicious and 2000 benign PEs for
or considering all features/binaries into images. We show that the testing. All the PEs are transformed into images. We see that the VGG16
proposed malware detector has outperformed state-of-the-art learning- model has the highest accuracy of 92.24% for the batch size of 64 and
based detectors on large and small datasets, including published static 350 epochs. The test loss value is <1 for all settings of hyperparameters
and dynamic techniques. for both validation and test data. The ResNet50 has given the highest
Different datasets are used in accordance to provide a true compari- loss value of 46.7. The model did not generalise all the malicious PEs
son of the state-of-the-art with the proposed framework. All the PEs are images. The model has overfitted the benign PEs. This might happen
transformed into images using the aforementioned methodology. Vari- due to the low instances of the malicious class. Additionally, it is worth
ous data pre-processing techniques are applied to scale and normalise noting that the given test and validation accuracy values in Tables 7, 8,
the data as per the requirement of each deep learning model. The and 9 are based on the ratio of correctly classified malicious and benign
malware images are divided into training, validation, and testing sets PEs.
with a ratio of 60:20:20, respectively. However, the data augmentation However, the balanced accuracy value for ResNet50 is important
techniques are only applied to the training data to generalise the model to discuss. Balanced accuracy (BA) is an important metric in the case
in a better way. The 10-fold cross-validation is further used to validate of an imbalanced dataset. It is the average value of recall and speci-
the results. The statistical result comparison is made using the result ficity for each class (malware and benign in our case). Hence, 𝐵𝐴 =
True Positive Rate+True Negative Rate
of 10-fold cross-validation. For brevity of the comparison with the 2
is the average of correctly classified
state-of-the-art, the same splitting mechanism is used. The experiments benign and correctly classified malware samples. For example, the
are performed with different values of epochs and batch sizes. The test accuracy value is 68.32% for 150 epochs and 8 batch sizes. The
experiments are performed on the high-performance computing (HPC) ResNet50 did not generalise the malicious samples perfectly. The recall
system of The University of Newcastle (UON), Australia, having an (a.k.a. true positive rate) for the benign class is 79.8% (1596 samples
Nvidia V100 GPU. Detection effectiveness includes accuracy, loss, true are correctly classified out of 2000) and 11% (only 44 samples are
negative rate, true positive rate, false positive rate and false negative classified correctly out of 400). The balanced accuracy of ResNet50 is
rate. Accuracy is the ratio of correctly classified malware and benign 45.4% ((79.8 + 11)/2). Overall, ResNet50 gives the worst performance.
PEs to all the malware and benign PEs in the dataset. The accuracy The VGG19 has a comparative accuracy of 92.19%, with a lower loss
values are given in percentage. The loss, on the other hand, is not a value for both validation and test data. The bold value is the best
percentage. It is a summation of the differences/errors made by the performance of each model. The bold value with a yellow background
model against each PE file in the validation or test set. Specificity, is the best among all models.
a.k.a. true negative rate (TNR), is the ratio between correctly classified Table 8 presents the results of 15 fine-tuned deep learning-based
negative samples to the total negative samples. Sensitivity, a.k.a. true malware detectors on datasetB. DatasetB consists of 20,000 malware
positive rate (TPR), is the ratio between correctly classified positive and 10,000 benign PEs. The samples are divided with a ratio of
samples to the total positive samples. False positive rate (FPR) is the 60:20:20 for training, validation, and testing, respectively. 12,000
ratio of the incorrectly classified negative samples as positive to the malware and 6000 benign PEs are used for training, 4000 malicious
total number of negative samples. False negative rate (FNR) is the ratio and 2000 benign PEs for validation, and 4000 malicious and 2000
of incorrectly classified positive samples as negative to the total number benign PEs for testing. The VGG16 model has again outperformed with
of positive samples (Shaukat et al., 2020c). an accuracy of 93.58% and a lower loss rate. ResNet50 model gave a
higher loss value of 54.62 for test data. VGG19 has an accuracy value
4.1. Detection effectiveness on imbalanced datasets of 93.04%. ResNet50 again did not generalise the samples of minority
classes and skewed towards the majority class. The overall accuracy
This section provides the detection effectiveness of 15 fine-tuned of ResNet50 is 44.58%, and the balanced accuracy is 38.33%, with
deep learning-based malware detectors. Deep learning models are not 150 epochs and 32 batch size. The true positive rate is 57.02%. Out
specifically designed for malware detection. Using transfer learning, we of 4000 malicious PEs, only 2281 are correctly classified. On the other
have fine-tuned the 15 deep learning-based models for the malware hand, the true negative rate is 19.65% for benign samples. 393 benign
detection task using end-to-end classification. Section 4.1 highlighted samples are correctly classified out of 2000. The bold value is the best
the problem of an imbalanced dataset, where the count of instances of performance of each model. The bold value with a yellow background
each classification class is not balanced. As a result, the detector cannot is the best among all the models.
generalise each class instance. Various state-of-the-art augmentation Table 9 presents the results of 15 fine-tuned deep learning-based
techniques are used to balance the data. Tables 7, 8 and 9 provide the malware detectors on datasetC. DatasetC consists of 20,000 malware
detection effectiveness of 15 fine-tuned deep learning-based malware and 10,000 benign PEs. All the samples are the same as in datasetB.
detectors on three different datasets. The analysis is based on loss The 12,000 malicious PEs are used for training, 4000 for validation and
and accuracy. The analysis is performed with batch sizes 8, 16, and
1
32. The effect of various epoch values is also analysed. These deep The code and datasets are available by request.
11
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 7
Detection effectiveness of malware detectors on datasetA.
Sr. No. Model Epochs Batch Size Valid Loss Valid Acc. Test Loss Test Acc.
150 8 0.9573 78.19% 0.6878 89.08%
150 16 0.6512 85.78% 0.4279 90.86%
150 32 0.578 86.99% 0.4075 90.71%
1 VGG16
250 32 0.4898 89.16% 0.4279 91.93%
150 64 0.621 82.78% 0.3445 91.17%
250 64 0.4914 88.14% 0.338
150 8 46.63 69.39% 46.7 68.32%
70 8 21.5819 69.39% 21.57 69.33%
2 ResNet50
150 16 21.85 69.39% 18.43 69.18%
150 32 30.88 69.39% 31.42 69.38%
150 8 6.86 67.22% 4.68 67.14%
150 16 4.66 61.67% 4.52 62.04%
3 InceptionV3
250 16 3.29 74.11% 3.48 72.19%
150 32 5.55 71.30% 3.599 72.55%
150 8 0.8179 86.86% 0.3869 90.91%
4 VGG19 150 16 0.666 86.80% 0.424 92.19%
150 32 0.5474 88.46% 0.3023 89.84%
150 8 16.5933 59.82% 9.54 74.08%
150 16 14.74 57.21% 8.89 66.02%
5 MobileNet
250 16 11.15 66.65% 8.348 73.77%
150 32 10.6329 55.99% 5.44 70.25%
150 8 7.11 77.49% 6.56 81.07%
150 16 5.72 75.26% 5.81 80.71%
6 DenseNet169
150 32 4.53 77.10% 4.702 77.95%
250 32 4.65 77.55% 4.08 81.78%
150 8 4.94 65.43% 5.84 57.44%
7 Xception 150 16 6.402 50.64% 7.04 46.17%
150 32 3.24 65.43% 3.88 62.04%
150 8 14.19 58.61% 20.38 52.80%
150 16 11.86 59.38% 11.83 66.22%
150 32 9.74 73.02% 10.99 75.05%
8 DenseNet201
250 32 7.7311 73.53% 7.07 67.75%
150 64 9.5959 46.36% 7.156 55.05%
250 64 6.1271 60.97% 9.188 62.34%
150 8 5.035 65.37% 6.066 65.91%
9 Inception ResNet V2 150 16 8.3494 47.39% 8.959 46.17%
150 32 3.389 61.54% 4.63 52.80%
150 8 12.5227 67.22% 13.35 69.81%
10 MobileNetV2 150 16 10.577 63.39% 10.911 66.53%
150 32 12.3141 70.34% 10.84 70.81%
150 8 13.9263 73.34% 12.98 74.64%
11 ResNet152V2 150 16 12.24 75.06% 9.25 77.91%
150 32 11.49 74.87% 9.099 77.44%
150 8 6.04 70.60% 4.226 76.29%
150 16 6.94 52.04% 6.202 54.03%
150 32 4.307 56.17% 5.53 50.70%
12 NasNetMobile
250 32 2.8907 70.41% 3.01 70.96%
150 64 2.3857 75.32% 2.3213 77.29%
250 64 3.666 64.16% 2.7905 65.00%
150 8 10.23 70.13% 8.533 69.52%
13 AlexNet 150 16 10.48 68.48% 9.12 70.32%
150 32 9.03 66.72% 8.49 68.54%
150 8 10.89 63.46% 10.29 64.28%
14 SqueezeNet 150 16 10.39 66.41% 8.18 67.21%
150 32 10.88 71.23% 8.38 70.27%
150 8 4.12 76.34% 3.19 78.59%
15 RegNetY320 150 16 3.92 79.17% 3.57 79.24%
150 32 2.98 78.18% 2.41 78.92%
4000 for testing. The benign PEs are divided with a ratio of 60:20:20. Conspicuously, the count of benign and malicious PEs for training
Hence, 6000 for training, 2000 for validation and 2000 for testing. is highly imbalanced. We have used the augmentation techniques
12
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 8
Detection effectiveness of malware detectors on datasetB.
Sr. No. Model Epochs Batch Size Valid Loss Valid Acc. Test Loss Test Acc.
150 8 0.9236 89.45% 0.564 92.54%
150 16 0.7045 89.39% 0.4013 93.56%
150 32 0.6954 88.09% 0.487 90.75%
1 VGG16
250 32 0.5649 91.09% 0.3765 92.22%
150 64 0.36 91.69% 0.2609
250 64 0.5372 90.09% 0.3284 92.69%
150 8 28.91 46.80% 19.89 54.38%
70 8 64.94 38.07% 31.41 39.53%
2 ResNet50
150 16 33.37 41.98% 16.07 64.51%
150 32 67.29 35.00% 54.62 44.58%
150 8 8.01 74.09% 5.98 78.97%
150 16 4.21 75.16% 2.57 79.12%
3 InceptionV3
250 16 3.89 76.94% 3.1 78.20%
150 32 4.01 75.38% 2.72 80.23%
150 8 0.791 91.16% 0.5591 91.34%
4 VGG19 150 16 0.669 90.56% 0.5262 91.98%
150 32 0.5344 90.34% 0.3785 93.04%
150 8 11.74 76.25% 10.26 74.31%
150 16 15.04 74.67% 8.23 73.79%
5 MobileNet
250 16 14.64 75.28% 8.3 74.41%
150 32 8.503 77.42% 5.15 76.04%
150 8 18.53 72.70% 10.53 83.57%
150 16 11.59 76.03% 7.823 82.34%
6 DenseNet169
150 32 11.43 73.42% 5.94 83.59%
250 32 14.31 72.33% 8.71 80.87%
150 8 6.68 73.36% 5.18 77.96%
7 Xception 150 16 4.52 76.95% 3.56 80.29%
150 32 3.53 76.11% 4.29 73.93%
150 8 24.64 70.58% 25.67 75.98%
150 16 17.97 70.98% 16.3 77.62%
150 32 15.84 68.50% 14.24 76.23%
8 DenseNet201
250 32 12.73 73.23% 12.18 78.31%
150 64 11.03 68.82% 10.38 76.54%
250 64 11.12 71.52% 12.39 76.28%
150 8 8.71 72.11% 10.86 73.62%
9 Inception ResNet V2 150 16 4.99 73.33% 5.77 73.28%
150 32 4.27 74.41% 5.29 75.35%
150 8 10.73 73.36% 10.42 73.35%
10 MobileNetV2 150 16 9.23 73.42% 8.73 75.18%
150 32 8.61 72.05% 9.12 73.97%
150 8 22.52 79.33% 14.63 81.04%
11 ResNet152V2 150 16 17.7 77.31% 12.19 81.09%
150 32 16.02 75.86% 11.71 80.21%
150 8 6.2 76.55% 6.34 79.45%
150 16 4.11 76.81% 4.15 78.46%
150 32 3.77 76.17% 3.52 78.43%
12 NasNetMobile
250 32 3.33 75.61% 4.43 76.60%
150 64 2.9 74.45% 3.09 76.54%
250 64 3.66 73.53% 4.03 75.91%
150 8 9.91 72.87% 8.34 72.52%
13 AlexNet 150 16 10.02 70.76% 8.59 71.35%
150 32 10.64 72.65% 8.02 73.11%
150 8 9.68 72.76% 9.47 74.51%
14 SqueezeNet 150 16 9.12 73.71% 7.85 75.24%
150 32 9.63 71.87% 8.26 74.95%
150 8 2.34 80.18% 2.21 81.07%
15 RegNetY320 150 16 3.14 82.65% 2.76 82.71%
150 32 2.83 83.12% 1.93 82.54%
13
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 9
Detection effectiveness of malware detectors on datasetC.
Sr. No. Model Epochs Batch Size Valid Loss Valid Acc. Test Loss Test Acc.
150 8 0.7649 90.41% 0.427 93.12%
150 16 0.4896 91.72% 0.3067 93.84%
150 32 0.4312 91.72% 0.2509 93.68%
1 VGG16
250 32 0.4069 91.76% 0.31 92.44%
150 64 0.3291 92.23% 0.2173
250 64 0.459 91.11% 0.26 93.95%
150 8 14.24 74.71% 15.97 74.69%
70 8 18.05 74.69% 17.56 74.61%
2 ResNet50
150 16 19.76 72.81% 16.01 74.56%
150 32 28.88 70.97% 30.54 74.84%
150 8 5.07 78.39% 4.51 78.99%
150 16 4.08 76.13% 2.36 79.26%
3 InceptionV3
250 16 3.22 78.72% 2.63 80.76%
150 32 2.87 77.33% 1.9 80.46%
150 8 0.6398 91.39% 0.3659 92.95%
4 VGG19 150 16 0.4814 91.11% 0.3327 93.53%
150 32 0.4998 91.39% 0.2789 93.64%
150 8 10.59 76.69% 9.43 78.22%
150 16 10.97 74.80% 7.43 74.49%
5 MobileNet
250 16 10.27 75.99% 8.28 76.58%
150 32 7.58 78.24% 4.39 76.13%
150 8 6.71 81.08% 4.86 83.85%
150 16 4.64 82.45% 4.94 84.19%
6 DenseNet169
150 32 3.71 91.12% 2.66 84.15%
250 32 4.48 81.75% 3.23 84.89%
150 8 4.72 74.64% 3.61 78.15%
7 Xception 150 16 3.7 77.43% 3.17 81.45%
150 32 2.8 76.57% 3.41 74.21%
150 8 11.84 79.82% 11.6 79.80%
150 16 11.76 77.16% 11.7 77.83%
150 32 9.09 77.13% 8.84 77.44%
8 DenseNet201
250 32 6.17 81.54% 5.57 82.38%
150 64 5.36 78.21% 4.47 79.84%
250 64 5.59 75.38% 8.95 76.57%
150 8 4.01 75.78% 5.95 76.77%
9 Inception ResNet V2 150 16 3.34 75.38% 4.24 74.36%
150 32 3.38 76.69% 4.21 75.55%
150 8 8.82 73.80% 9.7 74.42%
10 MobileNetV2 150 16 8.65 74.02% 7.67 75.28%
150 32 8.51 73.04% 8.8 74.01%
150 8 11.35 80.31% 11.19 81.99%
11 ResNet152V2 150 16 11.05 77.71% 8.59 82.41%
150 32 9.5 76.62% 8.95 80.31%
150 8 4.54 78.46% 3.43 81.99%
150 16 3.02 79.00% 2.66 79.41%
150 32 2.33 77.86% 1.93 79.62%
12 NasNetMobile
250 32 2.14 75.90% 2.25 76.65%
150 64 1.84 79.40% 1.54 80.20%
250 64 2.72 77.72% 2.47 78.33%
150 8 9.25 75.95% 7.52 77.18%
13 AlexNet 150 16 9.85 74.17% 8.36 76.41%
150 32 8.81 76.01% 7.14 77.48%
150 8 9.56 72.95% 8.52 78.32%
14 SqueezeNet 150 16 8.92 73.84% 7.51 79.87%
150 32 9.51 73.51% 7.94 81.28%
150 8 2.21 84.52% 1.75 87.95%
15 RegNetY320 150 16 2.87 84.96% 1.24 88.01%
150 32 1.96 85.41% 1.19 88.52%
14
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 10
A comparison of detection effectiveness of fine-tuned deep learning-based malware detectors with CNN-based SVM malware detector on datasetA.
Sr. No. Model Epochs Batch size Valid loss Valid Acc. Test loss Test Acc.
1 VGG16 150 32 0.578 86.99% 0.4575 90.71%
2 ResNet50 150 32 30.88 69.39% 31.42 69.38%
3 InceptionV3 150 32 5.55 71.30% 3.599 72.55%
4 VGG19 150 32 0.5874 85.46% 0.4613 89.84%
5 MobileNet 150 32 10.6329 55.99% 5.44 70.25%
6 DenseNet169 150 32 4.53 77.10% 4.702 77.95%
7 Xception 150 32 3.24 65.43% 3.88 62.04%
8 DenseNet201 150 32 9.74 73.02% 10.99 75.05%
9 Inception ResNet V2 150 32 3.389 61.54% 4.63 52.80%
10 MobileNetV2 150 32 12.3141 70.34% 10.84 70.81%
11 ResNet152V2 150 32 11.49 74.87% 9.099 77.44%
12 NasNetMobile 150 32 4.307 56.17% 5.53 50.70%
13 AlexNet 150 32 9.03 66.72% 8.49 68.54%
14 SqueezeNet 150 32 10.88 71.23% 8.38 70.27%
15 RegNetY320 150 32 2.98 78.18% 2.41 78.92%
16 CNN-SVM 150 32 0.397 88.65% 0.4366 91.17%
mentioned in Section 4.1 to augment the benign training data to CNN-SVM is provided in Section 3.2.4. Three sets of experiments were
Section 3.1.3. The total number of benign PEs for training is 12,000. performed on three datasets: datasetA, datasetB, and datasetC.
The benign training set includes 6000 real benign samples and 6000 The same datasets are used to validate the performance of this
augmented benign images. approach. We have observed that better results are achieved with
Whereas the count of validation and testing benign PEs is the same. 150 epochs and 32 batch size. For brevity, we have used only this
The augmentation is only performed on the training data. VGG16 hyperparameter setting for the analysis in this section. Tables 10, 11,
has outperformed with an accuracy of 94.44%. VGG19 takes second and 12 present the results of these datasets. It is seen that CNN-SVM
place with an accuracy of 93.64%. The ResNet50 loss has also been has outperformed the existing approach on all three datasets. The
decreased. After the augmentation, the results are improved. Each experiments were performed using TensorFlow with 150 epochs, 32
model has generalised the samples of both classes correctly. The overall batch size, 1e−3 learning rate, and 0.85 dropout rate.
accuracy of ResNet50 is 74.84%, and the balanced accuracy is 71.73%.
The CNN-SVM has outperformed with an accuracy of 91.17% on
The augmentation has significantly increased the accuracies for both
datasetA, 93.39% on datasetB, and 95.15% on datasetC. The CNN-SVM
classification classes. The true positive rate is 81.02% for the malware
has outperformed VGG16 with 0.46%, as demonstrated in Table 8.
class, which has improved to 24% than without augmentation. Out of
However, the accuracy of VGG19 is highest on datasetB with these hy-
4000 malicious samples, 3241 samples are correctly classified.
perparameters settings as 93.04% in Table 11. CNN-SVM has increased
On the other hand, 1249 benign samples are correctly classified out
the accuracy to 0.35%. Table 12 presents the results of datasetC, which
of 2000. The improvement is 42.80% more than without augmentation.
The true negative rate was 19.65% without augmentation for the same was augmented to balance the data. The VGG16 gave 93.68% accuracy.
settings of ResNet50 in Table 8. The bold value is the best performance CNN-SVM outperformed with 95.15%, which is 1.47% higher. The
of each model. The bold value with a yellow background is the best CNN-SVM outperformed other models with slightly higher accuracy.
among all models. However, the increase is not significant though the training process
Overall, the accuracy of VGG16 is improved by roughly 1% for all took a shorter time than the training of fine-tuned deep learning models
hyperparameter settings. Surprisingly, ResNet50’s accuracy values are using transfer learning.
enhanced by more than 10% for each setting of hyperparameters. With
balanced data, the densest architecture RegNetY320 accuracy is im- 4.3. Detection effectiveness of the proposed malware detection
proved by more than 5%. The VGG19 model’s accuracy is significantly
improved by >1% for batch sizes 8 and 16. The augmentation has in- This section highlights the detection effectiveness of our proposed
creased the 3.49% accuracy on average for all the models. It is observed hybrid framework. The proposed framework extracts the deep features
that the VGG16 model’s accuracy decreases by increasing the number using a fine-tuned deep learning model. The extracted features are fed
of epochs with a batch size of 32. However, the InceptionV3 accuracy is into a machine learning model to fully utilise the inherent connections
increased to 80.76% from 79.26% by increasing the number of epochs between input and output. The proposed framework uses the SVM
with 16 batch size. The accuracy of DenseNet201 is also increased from as a final detector. The densest deep learning model RegNetY320
77.44% to 82.38% by increasing the number of epochs from 150 to
outperformed with a combination of SVM for malware detection. We
250 for 32 batch size. On the other hand, the accuracy is decreased
are unaware of any work using a RegNetY320 as a feature extractor or
by 3% when the number of epochs is increased from 150 to 250 for
for end-to-end malware detection. We are also unaware of any work
batch size 64. A similar case happened with VGG16 with 64 batch size.
using learning-based models as feature extractors and SVM as a final
The augmentation techniques proved efficient in increasing the accu-
detector. Hence, the proposed framework is novel and scalable. In
racy of fine-tuned models and decreasing the loss values. During the
principle, any deep learning and machine learning models can be used
analysis, we cannot exactly comment on the effect of hyperparameters.
Sometimes, the increases in the batch size improve the accuracy and as feature extractors and final detectors, respectively.
sometimes not. The same is valid for the number of epochs. We will It seems that augmentation has increased the malware detection
provide the data and code on a reasonable request. We plan to analyse effectiveness and reduced the training and testing loss in our previ-
the effects of other hyperparameters and models in future. ous experiments. For brevity, we have used datasetC to analyse the
detection effectiveness of our proposed framework. All the experiments
4.2. Analysis of CNN-based SVM malware detector are conducted with 32 batch size and 150 epochs. Table 13 not only
demonstrates the detection effectiveness of our proposed framework
This section provides a comparative analysis of CNN-SVM with deep but also validates the results with state-of-the-art machine learning
learning-based models for malware detection. A detailed discussion of models. The model column lists all the models. The accuracy column
15
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 11
A comparison of detection effectiveness of fine-tuned deep learning-based malware detectors with CNN-based SVM malware detector on datasetB.
Sr. No. Model Epochs Batch size Valid loss Valid Acc. Test loss Test Acc.
1 VGG16 150 32 0.6954 88.09% 0.487 90.75%
2 ResNet50 150 32 67.29 50.00% 54.62 74.58%
3 InceptionV3 150 32 4.01 75.38% 2.72 80.23%
4 VGG19 150 32 0.5344 90.34% 0.3785 93.04%
5 MobileNet 150 32 8.503 77.42% 5.15 76.04%
6 DenseNet169 150 32 11.43 73.42% 5.94 83.59%
7 Xception 150 32 3.53 76.11% 4.29 73.93%
8 DenseNet201 150 32 15.84 68.50% 14.24 76.23%
9 Inception ResNet V2 150 32 4.27 74.41% 5.29 75.35%
10 MobileNetV2 150 32 8.61 72.05% 9.12 73.97%
11 ResNet152V2 150 32 16.02 75.86% 11.71 80.21%
12 NasNetMobile 150 32 3.77 76.17% 3.52 78.43%
13 AlexNet 150 32 10.64 72.65% 8.02 73.11%
14 SqueezeNet 150 32 9.63 71.87% 8.26 74.95%
15 RegNetY320 150 32 2.83 83.12% 1.93 82.54%
16 CNN-SVM 150 32 0.4544 92.00% 0.3373 93.39%
Table 12
A comparison of detection effectiveness of fine-tuned deep learning-based malware detectors with CNN-based SVM malware detector on datasetC.
Sr. No. Model Epochs Batch size Valid loss Valid Acc. Test loss Test Acc.
1 VGG16 150 32 0.4312 91.72% 0.2509 93.68%
2 ResNet50 150 32 28.88 70.97% 30.54 74.84%
3 InceptionV3 150 32 2.87 77.33% 1.9 80.46%
4 VGG19 150 32 0.4998 91.39% 0.2789 93.64%
5 MobileNet 150 32 7.58 78.24% 4.39 76.13%
6 DenseNet169 150 32 3.71 91.12% 2.66 84.15%
7 Xception 150 32 2.8 76.57% 3.41 74.21%
8 DenseNet201 150 32 9.09 77.13% 8.84 77.44%
9 Inception ResNet V2 150 32 3.38 76.69% 4.21 75.55%
10 MobileNetV2 150 32 8.51 73.04% 8.8 74.01%
11 ResNet152V2 150 32 9.5 76.62% 8.95 80.31%
12 NasNetMobile 150 32 2.33 77.86% 1.93 79.62%
13 AlexNet 150 32 8.81 76.01% 7.14 77.48%
14 SqueezeNet 150 32 9.51 73.51% 7.94 81.28%
15 RegNetY320 150 32 1.96 85.41% 1.19 88.52%
16 CNN-SVM 150 32 0.2783 93.70% 0.2303 95.15%
presents the detection effectiveness of each fine-tuned deep learning fine-tuned models. The proposed framework has given 16.55% average
model in this hyperparameter setting. The follow-up columns mention increase for to all of the fine-tuned models. The proposed framework
the results of each machine learning-based model when used as the significantly gave a 1.48% better performance than the HistGradient
final detector. Finally, the last column shows the performance of our Boosting classifier. Logistic regression (LR) outperformed all fine-tuned
proposed framework with bold green text. models, though the average increase with LR is 16.05% which is 1%
Table 13 demonstrates the in-depth analysis of 15 fine-tuned deep better than HGB. Random Forest (RF) has also performed better for all
learning-based models for malware detection. Each Cartesian value the deep learning models.
presents the framework’s accuracy if the features are extracted using The average increase in accuracy is 14.11% which is <2.44%,
a fine-tuned deep learning model (listed in the model column) and fed than our proposed framework. Naïve Bayesian (NB), Decision Tree
to any machine learning model (listed in all the columns) as the final (DT), and Linear Discriminator Analysis (LDA) could not outperform
detector. For example, the accuracy column shows that if we apply all the models, though the average increase was 4.27%, 8.27%, and
a fine-tuned deep learning model for end-to-end malware detection, 12.4%, respectively. K-Nearest Neighbour (K-NN) has outperformed
the obtained accuracy is 93.68%. However, if we extract the features all fine-tuned models with an average increase of 14.4%, which is
using VGG16 and apply bagging as a detector, the obtained accuracy <2.15% than the proposed framework. Resultantly, Bagging, Adaboost,
is 93.41%. The same is the case with the rest of the columns as a final Gradient Boosting, XGBoosting, Naïve Bayesian, Decision Tree, and
detector. Linear Discriminator Analysis could not increase accuracy for all the
The proposed framework improves the detection effectiveness of all models. Conversely, KNN, RF, LR and HGB have outperformed the fine-
the fine-tuned models. Though the accuracy of some fine-tuned models tuned models. Our proposed framework has outperformed not only
(when applied end-to-end) was even lower than 70%, the framework the existing fine-tuned models but also the state-of-the-art machine
significantly increased the detection effectiveness of all fine-tuned mod- learning classifiers.
els. We observed that the framework performed better with the densest Regarding models, VGG16, VGG19, and RegNetY320 are the models
architecture, i.e., RegNetY320. Table 14 provides a compact overview that could not outperform all the combinations of machine learning
of the performance improvement compared to the fine-tuned deep models (as final detectors) than fine-tuned model’s performance indi-
learning-based malware detector. Bagging, Adaboost, Gradient Boost- vidually (as end-to-end malware detector). For example, the end-to-end
ing, and XGBoosting have increased the accuracy for most models but malware detection effectiveness of VGG19 is better than when the
not all. The HistGradient Boosting (HGB) classifier has outperformed all framework used Adaboost, gradient boosting, naïve bays, decision tree,
the fine-tuned models and given a 15.07% average increase to all the and linear discriminator analysis classifiers as the final detectors. The
16
K. Shaukat, S. Luo and V. Varadharajan
Table 13
A deep analysis of proposed framework (all values are in percentage %).
Sr. No Model Accuracy Bagging Adaboost Gradient XGBoost HistGradient Logistic Random Naïve Decision K-Nearest Linear Discriminator Proposed
Boosting Boosting Regression Forest Bays Tree Neighbour Analysis framework
1 VGG16 93.68 93.41 92.11 91.81 93.59 96.12 97.23 95.31 84.87 90.31 96.72 89.21 97.91
2 ResNet50 74.84 93.2 92.29 90.23 93.18 96.04 96.53 94.39 81.51 89.37 94.78 84.43 96.87
3 InceptionV3 80.46 91.61 90.33 89.49 92.34 94.31 95.21 92.37 83.82 84.41 92.12 93.81 96.02
4 VGG19 93.64 95.23 93.48 93.58 94.28 96.52 97.36 95.85 86.43 91.57 96.53 90.53 97.82
5 MobileNet 76.13 93.1 92.14 91.33 92.57 96.33 97.24 95.89 86.57 88.43 95.47 95.81 97.34
17
6 Xception 74.21 91.48 90.23 91.13 93.31 95.11 96.31 94.71 82.89 88.81 94.58 95.77 96.74
7 DenseNet169 84.15 92.93 89.99 92.44 94.73 96.27 97.43 95.77 85.23 90.21 96.81 96.79 97.55
8 DenseNet201 77.44 90.23 87.81 89.23 94.18 96.51 97.78 95.56 86.57 89.9 96.51 97.23 97.91
9 InceptionResNetV2 75.55 91.44 90.25 90.93 94.03 96.48 97.41 94.97 87.1 88.53 95.91 96.14 97.9
10 MobileNetV2 74.01 92.51 91.14 91.82 93.81 96.43 97.35 95.35 87.41 89.43 95.58 95.44 97.5
11 NasNetMobile 79.62 93.14 92.1 91.22 93.48 95.93 96.33 94.56 84.86 88.33 95.21 95.23 96.84
12 ResNet152V2 80.31 93.22 90.98 91.58 92.13 95.22 96.24 94.23 85.75 88.58 92.47 95.35 96.73
13 AlexNet 77.48 89.97 88.54 90.27 90.85 94.28 95.28 93.43 83.29 88.56 93.86 86.59 96.25
Table 14
A compact comparison of the detection effectiveness of proposed framework with state-of-the-art machine learning models (all values are in percentage % -
green colour shows the % increment to the benchmark, i.e., accuracy column values).
Table 15
A comparison of the proposed framework with the state-of-the-art.
Sr. No. year Dataset Reference# Models Accuracy Precision Recall F-score
1 2011 Malimg Nataraj et al. (2011) K-NN 98.08% – – –
2 2017 Malimg Luo and Lo (2017) CNN+LBP 93.72% 94.13% 92.54% 93.33%
3 2017 Malimg Rezende et al. (2017) ResNet50 97.48% – – –
4 2017 Malimg Rezende et al. (2017) ResNet-50 98.62% – – –
5 2017 Malimg Makandar and Patrot (2017) GIST+SVM 98.88% – – –
6 2018 Malimg Lo et al. (2019) Xception 98.52% – – –
7 2019 Malimg Bhodia et al. (2019) ResNet34 94.80% – – –
8 2019 Malimg Roseline et al. (2019) Ensembling using RF 97.82% 98% 98% 98%
9 2019 Malimg Agarap (2017) CNN-SVM 77.22% 84% 77% 79%
10 2019 Malimg Ben Abdel Ouahab et al. (2019) K-NN 97% – – –
11 2019 Malimg Gibert et al. (2019) CNN 97.18% – – –
12 2019 Malimg Vinayakumar et al. (2019) CNN+LSTM 96.3% 96.3% 96.2% 96.2%
13 2019 Malimg Vinayakumar et al. (2019) RF 78.6% – – –
14 2019 Malimg Vinayakumar et al. (2019) NB 80.5% – – –
15 2019 Malimg Vinayakumar et al. (2019) KNN 41.8% – – –
16 2019 Malimg Vinayakumar et al. (2019) DT 79.5% – – –
17 2019 Malimg Singh et al. (2019) ResNet50 96.08% 95.76% 96.16% 95.96%
18 2020 Malimg Vasan et al. (2020b) VGG16, 97.59% – – –
19 2020 Malimg Vasan et al. (2020b) ResNet50 95.94% – – –
20 2021 Malimg Hemalatha et al. (2021) DenseNet 98.23% 97.78% 97.92% 97.85%
21 2022 Malimg Proposed framework 99.06% 98.47% 98.52% 98.49%
generalisation capability of the proposed framework is accessed in 4.4. Performance analysis on a smaller dataset
Table 15. Table 15 demonstrates the comparison of our framework with
the existing works. To validate the detection effectiveness of our pro- This section validates the detection effectiveness of the proposed
posed framework, we conducted experiments with the Malimg dataset. framework on a smaller dataset. DatasetD is used to conduct exper-
Fig. 5 illustrates the comparative results of the proposed framework iments for this sub-section. DatasetD consists of 1000 malicious PEs
with state-of-the-art to have better visualisation and understanding. and 1000 benign PEs. These PEs are randomly selected from datasetA.
In our proposed framework, RegNetY320 with SVM has outper- The dataset is split into training, validation, and testing with a ratio
formed. We used the same combination for comparison with the state- of 60:20:20. 600 malicious PEs and 600 benign PEs are used for
of-the-art. It is palpable that the proposed framework has outperformed training, and 200 of each class are used for validation and testing. Deep
the existing work on a benchmark dataset as well. We are unaware that learning models are called data-hungry models. They give better results
any previous work has handled the problem of an imbalanced dataset when trained with a large amount of data. Due to privacy issues, the
and extensive malware analysis in single research work. latest attack patterns are unavailable in real life. The results are not
18
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
reasonable if the learning-based malware detectors are trained with In Proposed Method 1, we have used fine-tuned VGG16 model. Out of
an imbalanced class instance. The deep learning models tend to skew 15 fine-tuned models, VGG16 outperformed other models once applied
towards the majority class and cannot generalise the instances from the end-to-end for malware detection. We then leverage the experiments
minority class well. The performance analysis of imbalanced datasets and apply fine-tuned VGG16 to compare with the performance of
and its solution has been discussed in the previous sections. In this ANN used by Al-Dujaili et al. (2018). The Proposed Method 1 classified
section, we validate the performance of the proposed framework on the malware and benign samples with an accuracy of 93.47% and
a limited-size, smaller dataset (with few samples in training). The per- outperformed with 1.57% accuracy.
formance of the proposed framework is validated on different settings In Proposed Method 2, we have conducted the experiments using our
of hyperparameters, i.e., the number of epochs and batch sizes with proposed framework. We have extracted the features from the APIImg
15 fine-tuned deep learning-based and CNN-SVM malware detectors. dataset using RegNetY320. The deep features are then fed to SVM for
We also conducted experiments with different values of batch size, final detection. The Proposed Method 2 gave an accuracy of 97.86%.
including 8, 16, and 32, with 150 epochs. The performance of the The accuracy of the proposed framework is 5.96% and 4.39% better
state-of-the-art fine-tuned malware detectors decreased with a smaller than (Al-Dujaili et al., 2018) and fine-tuned VGG16, respectively.
dataset. In Proposed Method 3 and Proposed Method 4, we have transformed
However, the proposed malware detector outperformed with an ac- all the binaries of the source PEs dataset (Al-Dujaili et al., 2018)
curacy of 93.43%. In our previous experiments with datasetA, datasetB, into images. The fine-tuned VGG16 gave 94.02% accuracy, and the
and datasetC, the VGG16 performed better than the other deep learning proposed framework outperformed with 98.53%.
models. The VGG16 has 89.71% accuracy with 150 epochs and 32 Through these experiments, we analysed the detection effectiveness
batch sizes. The CNN-SVM has increased accuracy to 90.87%, which of whether to consider a single feature or full binaries for malware
is 1.16% better than the VGG16 (see Table 16). detection. In Al-Dujaili et al. (2018), the proposed method worked
The proposed framework used RegNetY320 as a feature extrac- on the binary feature vectors with an ANN for detection. We have
analysed the performance that converting those binary feature vectors
tor. The extracted deep learning features are then given to SVM for
into images and leveraging the latest advancement in deep learning
final malware detection. The proposed framework gave 93.43% accu-
can certainly increase the accuracy. We concluded that applying the
racy. The proposed framework has outperformed VGG16, VGG19, and
latest deep learning models could improve accuracy. In method 3 and
CNN-SVM with an accuracy of 3.72%, 4.7%, and 2.56%, respectively.
method 4, we analysed that considering 1 feature (e.g., API calls)
regardless of what would be the next detection model or method
4.5. Performance comparison of detectors: With one vs. all features (binary feature vectors or visualisation) could reduce the accuracy. We
have transformed the PEs binaries into images and applied the latest
This section presents the analysis of the one-vs-all features image- development in deep learning to analyse the performance. The applied
based dataset. Al-Dujaili et al. in Al-Dujaili et al. (2018) proposed models have certainly increased detection effectiveness. We are unsure
a method of malware detection using API features. Each PE file is how cybercriminals manipulate the PEs to convert them into malicious
converted into a corresponding feature vector. The authors then used ones. Thus, considering all the binaries proved helpful in increasing the
an artificial neural network (ANN) for further malware detection. Our detection accuracy (method 3 and method 4).
work presents a hybrid deep learning-based solution for malware de-
tection. We conducted experiments with four perspectives. Table 17 4.6. Statistical results
presents these results. The first data row shows the detection effective-
ness of Al-Dujaili et al. (2018). The subsequent rows demonstrate the Deep learning models can be employed for feature computation
results of our proposed approaches. The experiments were performed and final classification. We have conducted experiments from both
using the same dataset as Al-Dujaili et al. (2018). perspectives. Firstly, we leverage the power of deep learning models for
In Proposed Method 1 and Proposed Method 2, we converted API- end-to-end malware detection. Secondly, deep features are extracted us-
based feature vectors into images and named the dataset as APIImg. The ing deep learning models. The deep learning models efficiently extract
process mentioned in Section 3.1.2 transformed all the feature vectors deep features by reducing the overhead of feature engineering and the
into images. need for domain experts. VGG16 has performed better for end-to-end
19
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 16
A comparison of the proposed framework on a smaller dataset.
Sr. No. Model Epochs Batch size Valid loss Valid Acc. Test loss Test Acc.
150 8 0.8283 84.15% 0.5161 88.19%
150 16 0.8176 84.07% 0.4513 88.76%
150 32 0.6356 84.40% 0.3665 89.71%
1 VGG16
250 32 0.7702 84.28% 0.418 89.51%
150 64 0.8612 83.64% 0.3687 89.76%
250 64 0.6013 86.10% 0.3585 89.84%
150 8 43.87 42.77% 45.75 42.44%
70 8 47.27 42.86% 47.1 42.95%
2 ResNet50
150 16 9.015 39.65% 10.79 39.91%
150 32 49.6 42.90% 49.44 43.01%
150 8 4.855 66.70% 3.79 69.82%
150 16 3.5 63.31% 2.462 70.04%
3 InceptionV3
250 16 4.04 64.37% 3.23 68.89%
150 32 2.917 63.10% 2.391 66.60%
150 8 0.8895 84.19% 0.3683 89.34%
4 VGG19 150 16 0.774 85.93% 0.3878 89.44%
150 32 0.5388 85.46% 0.3769 88.73%
150 8 15.59 60.64% 11.92 63.70%
150 16 12.67 61.50% 7.18 67.10%
5 MobileNet
250 16 16.43 60.02% 11.41 62.14%
150 32 18.96 53.09% 12.55 54.92%
150 8 6.65 72.95% 5.57 75.77%
150 16 5.92 72.61% 5.03 73.04%
6 DenseNet169
150 32 3.82 73.37% 3.59 75.47%
250 32 3.94 72.95% 4.01 73.65%
150 8 7.6 60.19% 8.69 56.27%
7 Xception 150 16 7.23 58.75% 8.42 56.00%
150 32 5.42 59.13% 7.46 53.14%
150 8 19.16 58.20% 21.06 59.31%
150 16 12.17 61.29% 12.2 61.64%
150 32 8.28 63.86% 8.26 64.00%
8 DenseNet201
150 64 5.93 61.96% 5.15 66.23%
250 64 7.44 62.26% 7.9 63.70%
250 32 8.44 64.03% 6.99 68.32%
150 8 6.3 60.78% 7.4 56.88%
9 Inception ResNet V2 150 16 6.92 55.62% 9.28 51.58%
150 32 5.54 56.13% 7.19 51.95%
150 8 17.03 54.31% 16.03 52.97%
10 MobileNetV2 150 16 11.35 53.47% 11.26 54.18%
150 32 14.84 49.66% 16.22 49.82%
150 8 15.41 66.23% 14.06 67.74%
11 ResNet152V2 150 16 11.85 67.54% 9.69 70.24%
150 32 7.874 68.30% 7.15 69.50%
150 8 6.43 62.55% 5.44 64.34%
150 16 2.95 68.34% 3.24 67.64%
150 32 2.2 68.05% 1.84 69.57%
12 NasNetMobile
250 32 3.19 64.88% 2.67 66.83%
150 64 2.41 61.71% 2.11 62.21%
250 64 2.37 64.29% 1.73 69.33%
150 8 13.98 64.47% 11.49 63.32%
13 AlexNet 150 16 13.64 56.13% 13.08 64.21%
150 32 11.09 59.75% 10.96 62.98%
150 8 13.75 57.41% 13.38 58.29%
14 SqueezeNet 150 16 14.51 61.44% 11.27 62.55%
150 32 15.48 67.79% 12.08 64.47%
150 8 6.83 71.17% 6.72 72.74%
15 RegNetY320 150 16 6.08 72.86% 6.95 74.08%
150 32 5.72 73.58% 5.67 73.18%
150 8 0.7415 89.16% 0.3741 89.93%
16 CNN-SVM 150 16 0.8749 89.27% 0.3686 89.87%
150 32 0.4479 89.38% 0.3277 90.87%
17 Proposed detector 150 32 0.2177 91.85% 0.2974 93.43%
20
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 17
A comparison of one-vs-all features image dataset (all metrics are in %, except epochs column).
Sr. No. Methods Dataset Model Epochs Accuracy Specificity Sensitivity FPR FNR
1 Al-Dujaili et al. (2018) VirusShare-API ANN 150 91.9 91.8 91.9 8.2 8.1
2 Proposed Method 1 VirusShare-APIImg VGG16 150 93.47 93.08 93.87 6.92 6.13
3 Proposed Method 2 VirusShare-APIImg Proposed framework 150 97.86 97.82 97.9 2.18 2.1
4 Proposed Method 3 VirusShare VGG16 150 94.02 94.23 93.81 5.77 6.19
5 Proposed Method 4 VirusShare Proposed framework 150 98.53 98.74 98.33 1.26 1.67
malware detection than other fine-tuned deep learning models. In the methods. The alternative hypothesis (Ha ) is that the proposed mal-
second scenario, integrating RegNetY320 (as a feature extractor) with ware detection framework has better detection effectiveness and is
SVM outperformed other deep learning models. VGG16 is not consid- performed differently. The test is performed to validate the P≪0.0001
ered a state-of-the-art benchmark model for the ImageNet dataset. The value using Table 15. 10-fold cross-validation is further used to avoid
latest deep learning models have outperformed VGG16 on the ImageNet any biased effect of the data. The test validates whether the exist-
dataset. In this research, the experiments are performed with different ing methods are significantly the same or different and whether our
perspectives using the latest state-of-the-art deep learning models both proposed framework significantly improved the detection effectiveness.
for feature extraction and end-to-end malware detection. The results The two-tailed p-value is less than the threshold 𝛼 value (0.0001). This
revealed the deterioration of performance with the latest models. These shows that the improvement of the proposed framework is highly sta-
results are interesting in the context of malware detection, as the latest tistically significant and rejects the null hypothesis. The statistical test
deep learning models outperformed VGG16 on the ImageNet dataset. rejected the null hypothesis based on the results reported in Table 15.
We inferred from the results that the learned features of deep learning Our proposed framework has significantly improved the effectiveness
models, except VGG16, are less transferable. It is hard to give the of malware detection in various aspects. Table 18 summarises the
exact rationale behind what actually makes the CNN models more comparison of the existing system and the proposed approach based
transferable to each other; however, the results show that one CNN on various indicators. The existing systems include static analyses,
model might outperform other state-of-the-art models on a specific dynamic analyses, and existing learning-based solutions.
task. It would be valuable to evaluate the transformability of a CNN Time Complexity The time complexity determines the time a model
model for a specific task in future work. However, performing experi- will take during training and testing. The time complexity and time are
ments with various aspects confirmed the superiority of our proposed directly proportional. The more complexity, the great deal of time the
malware detection framework. We have conducted a non-parametric model will take. Deep learning-based malware detectors contain com-
plex structures; thus, evaluating a model or making a quick prediction
Wilcoxon signed-rank (WSR) test to show the statistical significance of
about a PE file is hard. The time complexity of a convolutional layer
our proposed framework.
depends upon the size of kernels, the size of input feature maps, and
The null hypothesis statistical testing (NHST) is a statistical tech-
the count of input and output channels of an image. Given 𝐷 as the
nique to interpret the results. It also ensures that the claim of better
depth of the CNN model, 𝑆𝑚 as the sizes of feature maps, 𝐶𝑖𝑛 and 𝐶𝑜𝑢𝑡
performance is backed up by statistical analysis. The WSR test is used
as the count of input channels and output channels, n as the name of a
to compare the location of two populations using the same dataset or
convolutional layer, 𝑆𝐾 as the sizes of kernels, the Eq. (13) represents
to locate the population based on sample data. The WRS test is the
the time complexity of a CNN model:
best alternative to the t -test when the population means are not of (𝐷 )
interest. For example, when we want to test whether there are >50% ∑
𝑂 𝑆𝑚2 ∗ 𝑆𝑘2 ∗ 𝐶𝑛−1 ∗ 𝐶𝑛 (13)
chances that a sample from one population is better than the samples
𝑛=1
from other populations or whether the median value of the population
is non-zero. Secondly, the existing methods mentioned in Table 15 do Limitations: The proposed framework also comes with a few limita-
not follow the Gaussian distribution. Thus, we have chosen one sample tions. The dimensions of the features extracted using deep learning
non-parametric WSR test to validate the significance of theoretical and models are higher. We need a significant amount of memory to hold
median values from the existing methods. The whiskers in Fig. 6 show those 1D features temporarily. In future, we will explore the imple-
the standard division of different methods. mentation of incremental learning. In incremental learning, only an
The null hypothesis (H0 ) is that our proposed malware detection incremental subset of data is needed for training. Exploring various
framework does not have better detection effectiveness than existing feature reduction and selection techniques is another open research
21
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Table 18
A comparative summary of existing approaches with the proposed approach.
Existing approaches Proposed approach
Method Learning-based solutions use deep learning for end-to-end Use deep learning for feature extraction and machine
classification. learning as the final detector.
Efficient Require a lot of training for end-to-end classification, Efficient as it uses a deep learning model for feature
extensive resources and slower in the case of dynamic extraction and eliminates the need for the virtual
analysis. environment.
Flexibility Can work with one model. Most existing learning-based Experimented with 15 deep learning and 12 machine
solutions use and experiment with one deep learning learning models. Flexible to adapt any deep learning
model. and/or machine learning model.
Adaptive Need specific approaches for each field (IoT, mobile Adaptive to IoT and attacks on mobile devices and easily
malware). translate them into images.
Decoding Need to dissemble and decode the file for static analysis. Easily mapped the whole file contents into binary and
images later on.
Domain knowledge Need domain knowledge and experts. No need for domain experts.
Reverse engineering Yes No
Memory resources Does not require extensive memory resources as the Yes, to hold the extracted features using a deep learning
features are decoded into other formats. model.
Extensive virtual environment Require extensive resources in the case of dynamic Does not require any virtual environment or extensive
analysis. resources.
Features count Most of the time, consider one feature for detection, e.g., Considers all file for file binaries into an image for further
API or Opcode. processing.
Colour format Mostly grey images in the case of learning-based solutions. RGB Colour images to have more rich features for
malware detection.
Imbalanced data Not handled Handled through data augmentation
direction. The malware and benign images are translated into im- colour image using a fine-tuned deep learning model, and detecting
ages. These transformations act on the image data, but how those malware based on the deep features using support vector machines.
transformations translate back onto the portable executable is unclear. The performance of the proposed framework has been validated
This means it is possible that even a small change to the malware on 15 deep learning and 12 machine learning models. The proposed
image (e.g. rotating it 2 degrees) could represent a potentially be- framework is scalable, cost-effective, and efficient. Any deep learning
nign portable executable. It would be optimal to either analyse what and machine model can be used as a feature extractor and final detec-
portable executable’s images result in. tor, respectively. We found the best combination with RegNetY320 as
the feature extractor and SVM as the final detector. The combination
Future Works: We will explore the robustness of the proposed frame-
outperformed existing methods with an accuracy of 99.06%. Resul-
work against adversarial attacks. Due to the breadth and depth of
tantly, we eliminated the need for knowledge from domain experts
analysis, other important metrics (such as the area under the curve
for reverse engineering tasks. Furthermore, the proposed framework
and the F-measure) are overlooked in analysing the proposed frame-
uses deep learning, thus eliminating the high computation and time-
work. We will analyse the performance of the proposed framework
consuming overhead of feature engineering. Lastly, the success of a
against these metrics in future. We will analyse the effectiveness of learning-based detector highly depends upon the features on which
the proposed framework for Android and IoT applications. Deep learn- the model was trained. The paper provides the detection effectiveness
ing models are CNN-based, which comes with a limitation to taking of various models on the dataset generated using one or all features
uniform-size images for training. We will explore a spatial pyramid for malware detection. Extensive experiments have demonstrated the
pooling layer in the future that can take input of any size. This will superiority of our proposed framework over state-of-the-art approaches.
enable us to utilise all the images in an original format without nor- The statistical test was performed to validate the hypothesis that
malisation. Having stated the limitations, it is evident from the results the proposed model generalises better. The validity of the proposed
that the proposed framework can be useful in real-world applications framework was validated over various learning-based approaches. The
in detecting malicious PEs. The proposed approach is generic, flexible, results demonstrate convincing statistical evidence that the proposed
and adaptive to other deep learning and machine learning models. Any framework performed better than other state-of-the-art approaches.
new deep learning and machine learning model can be investigated and There is a need for more advanced malware visualisation techniques
integrated. In the defence industry, the proposed framework will be in the future. It was apparent that reshaping and resizing images
helpful in devising more efficient malware detection solutions. influenced the detection effectiveness of the learning-based models.
We will explore designing visualisation techniques specifically for PEs
5. Conclusions transformation (When dealing with real-world images, bicubic interpo-
lation is usually used). We will also analyse the effect of other data
This paper has proposed a framework for malware detection based resampling methods, such as Generative adversarial network (GAN),
on a hybrid deep learning and machine learning approach, as well on the learning process. The proposed framework is applicable to all
as providing an in-depth analysis of various methods for malware Windows PEs. However, in the future, we will explore the performance
detection. The proposed method combined deep learning with machine of the proposed framework on other Mobile and IoT datasets. We will
learning and did not need intensive feature engineering and domain also explore whether the feature reduction and selection techniques
knowledge. It consisted of three major steps: visualising a portable help to improve detection accuracy. Lastly, we will explore the impact
executable file as a colour image, extracting deep features from the of ensemble techniques as a final detector.
22
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
CRediT authorship contribution statement Frank, E., Hall, M.A., 2011. Data Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann.
Fu, Z., Ding, Y., Godfrey, M., 2021. An LSTM-based malware detection using transfer
Kamran Shaukat: Conceptualization, Data curation, Formal anal-
learning. J. Cybersecur. 3 (1), 11.
ysis, Investigation, Resources, Software, Validation, Visualization, Fujino, A., Murakami, J., Mori, T., 2015. Discovering similar malware samples using api
Writing – original draft, Writing – review & editing. Suhuai Luo: Con- call topics. In: 2015 12th Annual IEEE Consumer Communications and Networking
ceptualization, Investigation, Methodology, Project administration, Re- Conference. CCNC, IEEE, pp. 140–147.
Galen, C., Steele, R., 2021. Empirical measurement of performance maintenance of
sources, Validation, Writing – review & editing. Vijay Varadharajan:
gradient boosted decision tree models for malware detection. In: 2021 International
Conceptualization, Investigation, Methodology, Project administration, Conference on Artificial Intelligence in Information and Communication. ICAIIC,
Resources, Validation, Writing – review & editing. IEEE, pp. 193–198.
Gao, X., Hu, C., Shan, C., Liu, B., Niu, Z., Xie, H., 2020. Malware classification for the
cloud via semi-supervised transfer learning. J. Inf. Secur. Appl. 55, 102661.
Declaration of competing interest
Gibert, D., Mateu, C., Planes, J., Vicens, R., 2019. Using convolutional neural networks
for classification of malware represented as images. J. Comput. Virol. Hacking Tech.
The authors declare that they have no known competing finan- 15 (1), 15–28.
cial interests or personal relationships that could have appeared to Gibert, D., Planes, J., Mateu, C., Le, Q., 2022. Fusing feature engineering and deep
learning: A case study for malware classification. Expert Syst. Appl. 117957.
influence the work reported in this paper.
Guo, H., Cheng, H.K., Kelley, K., 2016. Impact of network structure on malware
propagation: A growth curve perspective. J. Manage. Inf. Syst. 33 (1), 296–325.
Data availability Han, K., Lim, J.H., Im, E.G., 2013. Malware analysis method using visualization of
binary files. In: Proceedings of the 2013 Research in Adaptive and Convergent
Systems. pp. 317–321.
Data will be made available on request.
Hemalatha, J., Roseline, S.A., Geetha, S., Kadry, S., Damaševičius, R., 2021. An efficient
densenet-based deep learning model for malware detection. Entropy 23 (3), 344.
References Huda, S., Abawajy, J., Alazab, M., Abdollalihian, M., Islam, R., Yearwood, J., 2016.
Hybrids of support vector machine wrapper and filter based framework for malware
Agarap, A.F., 2017. Towards building an intelligent anti-malware system: a deep detection. Future Gener. Comput. Syst. 55, 376–390.
learning approach using support vector machine (SVM) for malware classification. Huo, D., Li, X., Li, L., Gao, Y., Li, X., Yuan, J., 2022. The application of 1D-CNN in
arXiv preprint arXiv:1801.00318. microsoft malware detection. In: 2022 7th International Conference on Big Data
Agrawal, R., Srikant, R., 1995. Mining sequential patterns. In: Proceedings of the Analytics. ICBDA, IEEE, pp. 181–187.
Eleventh International Conference on Data Engineering. IEEE, pp. 3–14. Imran, M., Afzal, M.T., Qadir, M.A., 2015. Using hidden markov model for dynamic
Akram, Z., Majid, M., Habib, S., 2021. A systematic literature review: Usage of logistic malware analysis: First impressions. In: 2015 12th International Conference on
regression for malware detection. In: 2021 International Conference on Innovative Fuzzy Systems and Knowledge Discovery. FSKD, IEEE, pp. 816–821.
Computing. ICIC, IEEE, pp. 1–8. Jiang, B., Chen, S., Wang, B., Luo, B., 2022. MGLNN: Semi-supervised learning via
Al-Dujaili, A., Huang, A., Hemberg, E., O’Reilly, U.-M., 2018. Adversarial deep learning multiple graph cooperative learning neural networks. Neural Netw. 153, 204–214.
for robust detection of binary encoded malware. In: 2018 IEEE Security and Privacy Jiang, Y., Li, R., Tang, J., Davanian, A., Yin, H., 2020. Aomdroid: Detecting obfuscation
Workshops. SPW, IEEE, pp. 76–82. variants of android malware using transfer learning. In: International Conference
on Security and Privacy in Communication Systems. Springer, pp. 242–253.
Al-Hashmi, A.A., et al., 2022. Deep-ensemble and multifaceted behavioral malware
Kadri, M.A., Nassar, M., Safa, H., 2019. Transfer learning for malware multi-
variant detection model. IEEE Access 10, 42762–42777.
classification. In: Proceedings of the 23rd International Database Applications &
Anderson, B., Quist, D., Neil, J., Storlie, C., Lane, T., 2011. Graph-based malware
Engineering Symposium. pp. 1–7.
detection using dynamic analysis. J. Comput. Virol. 7 (4), 247–258.
Kim, S., 2018. PE header analysis for malware detection.
Arora, A., Garg, S., Peddoju, S.K., 2014. Malware detection using network traffic
Kolter, J.Z., Maloof, M.A., 2006. Learning to detect and classify malicious executables
analysis in android based mobile devices. In: 2014 Eighth International Conference
in the wild. J. Mach. Learn. Res. 7 (12).
on Next Generation Mobile Apps, Services and Technologies. IEEE, pp. 66–71.
Kumar, S., 2021. MCFT-CNN: Malware classification with fine-tune convolution neural
Awan, M.J., et al., 2021. Image-based malware classification using VGG19 network and
networks using traditional and transfer learning in internet of things. Future Gener.
spatial convolutional attention. Electronics 10 (19), 2444.
Comput. Syst. 125, 334–351.
Bansal, M., Kumar, M., Sachdeva, M., Mittal, A., 2021. Transfer learning for image
Kumar, R., Subbiah, G., 2022. Zero-day malware detection and effective malware
classification using VGG19: Caltech-101 image data set. J. Ambient Intell. Humaniz.
analysis using Shapley ensemble boosting and bagging approach. Sensors 22 (7),
Comput. 1–12.
2798.
Ben Abdel Ouahab, I., Bouhorma, M., Boudhir, A.A., El Aachak, L., 2019. Classification
Lad, S.S., Adamuthe, A.C., 2020. Malware classification with improved convolutional
of grayscale malware images using the K-nearest neighbor algorithm. In: The
neural network model. Int. J. Comput. Netw. Inf. Secur. 12, 30–43.
Proceedings of the Third International Conference on Smart City Applications.
Li, N., Zhang, Z., Che, X., Guo, Z., Cai, J., 2021. A survey on feature extraction
Springer, pp. 1038–1050.
methods of heuristic malware detection. J. Phys. Conf. Ser. 1757 (1), 012071,
Bhodia, N., Prajapati, P., Di Troia, F., Stamp, M., 2019. Transfer learning for
IOP Publishing.
image-based malware classification. arXiv preprint arXiv:1903.11551. 2021. LIEF - library to instrument executable formats - quarkslab. https://lief.quarkslab.
Bouchaib, P., Bouhorma, M., 2021. Transfer learning and smote algorithm for image- com/ (accessed April 05, 2021).
based malware classification. In: Proceedings of the 4th International Conference Lo, W.W., Yang, X., Wang, Y., 2019. An xception convolutional neural network for
on Networking, Information Systems & Security. pp. 1–6. malware classification with transfer learning. In: 2019 10th IFIP International
Cesare, S., Xiang, Y., Zhou, W., 2013. Control flow-based malware variantdetection. Conference on New Technologies, Mobility and Security. NTMS, IEEE, pp. 1–5.
IEEE Trans. Dependable Secure Comput. 11 (4), 307–317. Luo, J.-S., Lo, D.C.-T., 2017. Binary malware image classification using machine
Chandio, A., et al., 2022. Precise single-stage detector. arXiv preprint arXiv:2210.04252. learning with local binary pattern. In: 2017 IEEE International Conference on Big
Chen, L., 2018. Deep transfer learning for static malware classification. arXiv preprint Data (Big Data). IEEE, pp. 4664–4667.
arXiv:1812.07606. Makandar, A., Patrot, A., 2017. Malware class recognition using image processing
Cogswell, M., Ahmed, F., Girshick, R., Zitnick, L., Batra, D., 2015. Reducing overfitting techniques. In: 2017 International Conference on Data Management, Analytics and
in deep networks by decorrelating representations. arXiv preprint arXiv:1511. Innovation. ICDMAI, IEEE, pp. 76–80.
06068. Marastoni, N., Giacobazzi, R., Dalla Preda, M., 2021. Data augmentation and transfer
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297. learning to classify malware images in a deep learning context. J. Comput. Virol.
Cui, Z., Du, L., Wang, P., Cai, X., Zhang, W., 2019. Malicious code detection based on Hacking Tech. 17 (4), 279–297.
CNNs and multi-objective algorithm. J. Parallel Distrib. Comput. 129, 50–58. Martín, A., Lara-Cabrera, R., Camacho, D., 2019. Android malware detection through
Cunningham, P., Delany, S.J., 2021. K-nearest neighbour classifiers-a tutorial. ACM hybrid features fusion and ensemble classifiers: the AndroPyTool framework and
Comput. Surv. 54 (6), 1–25. the OmniDroid dataset. Inf. Fusion 52, 128–142.
D’Angelo, G., Ficco, M., Palmieri, F., 2020. Malware detection in mobile environments Maulana, P., Heryanto, A., Oklilas, A.F., 2022. Klasifikasi Malware Adware Pada
based on autoencoders and API-images. J. Parallel Distrib. Comput. 137, 26–33. Android Menggunakan Metode Support Vektor Machine (SVM) Dan Linear
De Paola, A., Gaglio, S., Re, G.L., Morana, M., 2018. A hybrid system for malware Discriminant Analysis. LDA, Sriwijaya University.
detection on big data. In: IEEE INFOCOM 2018-IEEE Conference on Computer Microsoft, Microsoft Malware Classification Challenge (BIG 2015) [Online] Available:
Communications Workshops (INFOCOM WKSHPS). IEEE, pp. 45–50. https://www.kaggle.com/c/malware-classification/data.
El-Shafai, W., Almomani, I., AlKhayer, A., 2021. Visualized malware multi-classification Naeem, H., Guo, B., Naeem, M.R., Ullah, F., Aldabbas, H., Javed, M.S., 2019.
framework using fine-tuned CNN-based transfer learning models. Appl. Sci. 11 (14), Identification of malicious code variants based on image visualization. Comput.
6446. Electr. Eng. 76, 225–237.
23
K. Shaukat, S. Luo and V. Varadharajan Engineering Applications of Artificial Intelligence 122 (2023) 106030
Nahmias, D., Cohen, A., Nissim, N., Elovici, Y., 2020. Deep feature transfer learning for Searles, R., et al., 2017. Parallelization of machine learning applied to call graphs of
trusted and automated malware signature generation in private cloud environments. binaries for malware detection. In: 2017 25th Euromicro International Conference
Neural Netw. 124, 243–257. on Parallel, Distributed and Network-Based Processing. PDP, IEEE, pp. 69–77.
Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S., 2011. Malware images: Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y., 2012. Detecting unknown
visualization and automatic classification. In: Proceedings of the 8th International malicious code by applying classification techniques on opcode patterns. Secur.
Symposium on Visualization for Cyber Security. pp. 1–7.
Inform. 1 (1), 1–22.
Oliva, A., Torralba, A., 2001. Modeling the shape of the scene: A holistic representation
Shaid, S.Z.M., Maarof, M.A., 2014. Malware behaviour visualization. J. Teknol. 70 (5).
of the spatial envelope. Int. J. Comput. Vis. 42 (3), 145–175.
Shaukat, K., Luo, S., Chen, S., Liu, D., 2020a. Cyber threat detection using machine
Prajapati, P., Stamp, M., 2021. An empirical analysis of image-based learning techniques
learning techniques: A performance evaluation perspective. In: 2020 International
for malware classification. In: Malware Analysis using Artificial Intelligence and
Deep Learning. Springer, pp. 411–435. Conference on Cyber Warfare and Security. ICCWS, IEEE, pp. 1–6.
Prima, B., Bouhorma, M., 2020. Using transfer learning for malware classification. Int. Shaukat, K., Luo, S., Varadharajan, V., 2022. A novel method for improving the
Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 44, 343–349. robustness of deep learning-based malware detectors against adversarial attacks.
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P., 2020. Designing network Eng. Appl. Artif. Intell. 116, 105461.
design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A., Xu, M., 2020b. A survey on
Pattern Recognition. pp. 10428–10436. machine learning techniques for cyber security in the last decade. IEEE Access 8,
Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.K., 2018. 222310–222354.
Malware detection by eating a whole exe. In: Workshops At the Thirty-Second Shaukat, K., et al., 2020c. Performance comparison and current challenges of using
AAAI Conference on Artificial Intelligence. machine learning techniques in cybersecurity. Energies 13 (10), 2509.
Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., De Geus, P., 2017. Malicious software Singh, A., Handa, A., Kumar, N., Shukla, S.K., 2019. Malware classification using image
classification using transfer learning of resnet-50 deep neural network. In: 2017
representation. In: International Symposium on Cyber Security Cryptography and
16th IEEE International Conference on Machine Learning and Applications. ICMLA,
Machine Learning. Springer, pp. 75–92.
IEEE, pp. 1011–1014.
Tang, Y., 2013. Deep learning using linear support vector machines. arXiv preprint
Rong, C., Gou, G., Cui, M., Xiong, G., Li, Z., Guo, L., 2020. TransNet: Unseen malware
variants detection using deep transfer learning. In: International Conference on arXiv:1306.0239.
Security and Privacy in Communication Systems. Springer, pp. 84–101. Vasan, D., Alazab, M., Wassan, S., Naeem, H., Safaei, B., Zheng, Q., 2020a. IMCFN:
Roseline, S.A., Sasisri, A., Geetha, S., Balasubramanian, C., 2019. Towards efficient Image-based malware classification using fine-tuned convolutional neural network
malware detection and classification using multilayered random forest ensemble architecture. Comput. Netw. 171, 107138.
technique. In: 2019 International Carnahan Conference on Security Technology. Vasan, D., Alazab, M., Wassan, S., Safaei, B., Zheng, Q., 2020b. Image-based malware
ICCST, IEEE, pp. 1–6. classification using ensemble of CNN architectures (IMCEC). Comput. Secur. 92,
Rosenberg, I., Sicard, G., David, E.O., 2018. End-to-end deep neural networks and 101748.
transfer learning for automatic analysis of nation-state malware. Entropy 20 (5), Vinayakumar, R., Alazab, M., Soman, K., Poornachandran, P., Venkatraman, S.,
390. 2019. Robust intelligent malware detection using deep learning. IEEE Access 7,
Ross, Q.J., 1993. C4. 5: Programs for Machine Learning. San Mateo, CA. 46717–46738.
Roy, A.M., Bhaduri, J., Kumar, T., Raj, K., 2022. WilDect-YOLO: An efficient and robust
VirusShare, https://virusshare.com/ (accessed May 24, 2022).
computer vision-based accurate object localization model for automated endangered
Wang, W., Zhao, M., Wang, J., 2019. Effective android malware detection with a hybrid
wildlife detection. Ecol. Inform. 101919.
model based on deep autoencoder and convolutional neural network. J. Ambient
Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C., 2000. Support
vector method for novelty detection. Adv. Neural Inf. Process. Syst. 582–588. Intell. Humaniz. Comput. 10 (8), 3035–3043.
Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J., 2000. Data mining methods for detection Zhao, Y., Cui, W., Geng, S., Bo, B., Feng, Y., Zhang, W., 2020. A malware detection
of new malicious executables. In: Proceedings 2001 IEEE Symposium on Security method of code texture visualization based on an improved faster RCNN combining
and Privacy. S & P 2001. IEEE, pp. 38–49. transfer learning. IEEE Access 8, 166630–166641.
24