A Review of Deep Learning Based Malware Detection Techniques
A Review of Deep Learning Based Malware Detection Techniques
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Survey paper
A R T I C L E I N F O A B S T R A C T
Keywords: With the popularization of computer technology, the number of malware has increased dramatically in recent
Malware detection years. Some malware can threaten the network security of users by downloading and installing, and even
Deep learning spreading widely on the Internet, causing consequences such as private data leakage in the operating system,
Malware datasets
extortion, and network paralysis. In order to deal with these threats, researchers analyze malicious samples
through various analysis techniques, which are usually divided into static and dynamic analysis based on the
principle of whether the code needs to be executed or not. This paper analyzes in detail several classical methods
of feature extraction in malware detection techniques. With the technological development of artificial intelli
gence, deep learning is gradually being introduced into malware detection, which does not require the identi
fication of professional security personnel and greatly improves the generalization ability of detection. In the
paper, text-based detection methods, image visualization-based detection, and graph structure-based detection
techniques are reviewed according to different feature extraction methods. In addition, the paper compares 26
datasets that have been commonly used in recent years applied in the research field and explains the main
contents and specifications of the datasets. Finally, a summary and outlook of the malware research field is given.
1. Introduction In 1986, the "brain" of malware was developed, which had a very
high propagation rate and damaged thousands of computer systems.
With the development of the Internet, computers have been fully With the development of technology, every day there are thousands of
integrated into people’s daily lives, the number of malware has grown new malware infections, compared with the early malware, which had a
substantially, and the technology has matured. Malware has become one stronger target. In the early days, the traditional detection method was
of the main threats to computer security, some hackers usually take based on feature code detection, this method was used to generate
advantage of the vulnerabilities created in the computer to achieve the signature features in a specific way [2]. When there is a new malware
theft of data and information, or to obtain the information privately detection need to retrieve matches in the signature library, the tradi
without the user’s permission, and even they can paralyze the network. tional detection method needs to consume a lot of manpower to main
Hundreds of millions of electronic devices have been or are being tain the signature library. At the same time, it can’t detect a new
attacked by malware. At the same time, the financial losses caused generation of malware, so the traditional detection method can’t be
globally are increasing dramatically. As the technology of criminal adapted to malware detection. Nowadays, in order to cope with this kind
gangs has improved, the types of malware have increased dramatically, of dangerous attack, researchers use machine learning and deep learning
taking many different forms, becoming more varied, and posing a model architectures to solve these problems[10]. This technology can
greater threat to computers. In the early days, hackers wrote simple code automatically learn the characteristics of malicious code, mine the
that was easily detected. Nowadays, they mix many different types of complex data structure in high-dimensional data, and make full use of
malware [1], resulting in the same malware presenting multiple cate the deep connection between the data. With the help of deep learning,
gories of characteristics, and the difficulty of detection increases. The the researchers made a great breakthrough in accuracy.
focus of malware detection is to discover how a malicious file, URL or Nowadays, deep learning has replaced traditional detection methods
software behaves on a computer and what it was created for. Therefore, as a hot topic in malware classification. Deep neural networks can
identifying malware is the key to network security defense. simulate the human brain for learning, so scholars have applied neural
* Corresponding author.
E-mail address: [email protected] (H. Wang).
https://doi.org/10.1016/j.neucom.2024.128010
Received 2 November 2023; Received in revised form 29 May 2024; Accepted 7 June 2024
Available online 15 June 2024
0925-2312/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
H. Wang et al. Neurocomputing 598 (2024) 128010
network models to malware classification problems [3,4]. In 2019, these two parts in Sections 3 and 4 based on the three classification
Wang et al. [5] used a hybrid model of deep autoencoder (DAE) and methods mentioned earlier.
Convolutional Neural Networks (CNN) to reconstruct high-dimensional This article provides an overview of deep learning based malware
features in samples. DAE was used as a pre-training method for CNN, detection techniques, investigating the evolution and research status of
and experiments were conducted on, 10000 benign applications and malware detection methods. The main contributions to this article are as
13000 malicious applications using this combination. Compared to follows:
traditional machine learning detection methods, DAE-CNN models have
lower time complexity and an accuracy improvement of 5 % compared • Explained the relevant technologies and methods of malware
to SVM models. In 2022, to address the issue of insufficient features in detection, including the development history of malware, malware
sample data, Zhang Yang et al. [6] combined the attention mechanism countermeasure technology, malware detection technology, mal
with the residual network (ResNet) model to establish an ARMD model. ware analysis technology, etc.
His experimental data came from 47580 different types of hash values in • We have summarized and compared the malware dataset, which
Kaggle, and the final experiment showed that the accuracy of ARMD involves 26 malware datasets.
reached 97.76 %. Shen et al.[7] proposed a feature fusion-based detec • A summary of the research on deep learning-based malware detec
tion method that combines a dual attention mechanism and a bidirec tion technology in the past five years has been made, which is
tional long short-term memory (BiLSTM) network. Convert malicious divided into three categories for discussion based on their different
code into grayscale images, and then use a dual attention mechanism to feature extraction methods.
extract local texture features from grayscale images. The BiLSTM model • Discussed the challenges facing the current field and made prospects.
extracts global features from grayscale images. Finally, the local features
were fused with the global features, and the results showed that the 2. Malware classification and related technologies
feature fusion method did indeed have better performance. Sachith [8]
designed a self-supervised model called SHERLOCK based on the 2.1. Type of malware
VisionTransformer (ViT) network. The model uses Masked Auto Encoder
(MAE) to input masked images, learn features from the images, and use Malicious software is a collective term for all malicious software or
self-supervised learning to reconstruct the original images. This method programs that continuously attack tens of thousands of computer sys
effectively reproduces the trained model for three different classification tems in order to destroy computer resources, obtain economic benefits,
tasks with minimal computational cost. These three tasks correspond to steal private and confidential data, and use computer resources. Mal
three types of labels in the MalNet dataset: malware/benign (Class 2), ware can be classified into viruses, Trojans, worms, ransomware,
malware types (47 categories) such as adware, virus, trojan, etc., and adware, rootkits, spyware, etc. based on their behavior and execution
malware families (696 categories) such as dhyvax, dxshbq, etc. The methods [13]. Before 1999, malware mainly appeared in its original
dataset contains 1.2 million images. This type of self-supervised learning form, with a single function and limited impact. From 2000–2010, the
can achieve a binary classification accuracy of 97 % for malware, but it main transmission mechanism of malware became the Internet, and the
is only an improvement based on static analysis. The network process is propagation speed was improved. At the same time, variants of malware
shown in Fig. 1. continue to emerge, and their destructive power is gradually increasing.
This article categorizes malware detection techniques into three The number of malware and their toolkits is rapidly increasing. Email
categories: text-based malware detection, image-based malware detec worms and SQL injection attacks have become mainstream. Since 2010,
tion, and graph structure-based malware detection [9]. Text-based malware teams have become more collaborative, and their functions
detection refers to directly using the text features of malicious soft have become increasingly complex. The targets of malware have also
ware samples as model input, and judging the classification of malicious become more diverse, including stealing personal information,
families based on the semantic information in the text. Image-based damaging systems, engaging in phishing, encrypting files to demand
detection [10] typically converts binary files of malicious code into ransom, and so on. Malicious software during this period has the char
grayscale images or RGB images. Implement classification through the acteristics of strong sustainability and resistance. Table 1 presents the
differences in texture features of images. The classification method typical evolution process of malware functionality from a temporal
based on graph structure extracts information from binary files of ma perspective:
licious software and converts it into graph structures with edges and
nodes to achieve classification [11]. Due to the fact that the two crucial
2.2. Malware detection process
steps in malware detection technology are feature extraction of malware
and the construction of deep learning models, the article will focus on
Due to the coding characteristics of malicious code and the
2
H. Wang et al. Neurocomputing 598 (2024) 128010
Table 1
Types and evolution of malware.
TIME Representative Type of Purpose or effect Sample functional technical characteristics
sample software
limitations of relevant domain knowledge, experience, and coding tools engines. This technology has resulted in a large number of variants of
possessed by different hacker teams, malicious code from the same malicious code. According to Symantec statistics [15], In 2012, there
author or family often has similarities in content and structure. The were an average of 38 variants per mobile malware family. In 2013, the
similarity of this encoding feature is mainly manifested in the similarity number of variants in each family reached 58. In 2016, Symantec [16]
of malware functions, encoding styles, and code logical structures. In the discovered 407 million malicious codes, including 357 million malicious
process of malware development, the coding of malicious families or code variants. These data demonstrate the increasing trend of the scale
authors has become relatively mature, and they usually have more of mobile platform malware variants year by year.
advanced and covert attack behaviors, which increases the difficulty of At the same time, in order to evade detection by mobile platform
malware detection. In order to better classify malicious software, re defense systems, the number of malicious software constructed using
searchers need to understand its execution purpose, data characteristics, obfuscation technology has also been increasing year by year. Reference
and differences from benign software. [17] points out that nearly 25 % of legitimate applications in Google
Fig. 2 shows the five steps of the malware detection experiment. Play are confused, and 90 % of malicious code uses obfuscation tech
Firstly, data collection aims to collect both benign and malicious data niques [18]. In 2017, 24 research institutions in the United States [19]
sets. The more experimental datasets there are, the more convincing the conducted traceability studies on Advanced Persistent Threat (APT)
results are. Secondly, feature extraction involves extracting features attacks, publishing up to 47 related research reports. Hackers have taken
from APK or PE files through supporting tools. This process requires countermeasures at multiple levels, including code and attack behavior,
overcoming the adversarial measures of malicious code, revealing the in order to avoid detection as much as possible. Common adversarial
application of sequence features, code structure features, image fea techniques include shell addition, confusion, deformation, and
tures, permission features, and resource usage features in malicious anti-sandbox techniques. Among them, shell technology uses encryption
software. Next, the feature processing section connects with the work of algorithms to compress the form of malicious code, so that antivirus
feature extraction, such as using regular expression transformation and software cannot read the entire code file, and some functions cannot be
N-grams sequence processing for sequence features. For graph structure obtained through decompilation. Meanwhile, this technology will not
features, redundancy should be removed, and code structures related to affect the direct operation of the program.
malicious operations should be retained. Finally, using deep learning In 2017, to address the issue of imbalanced detection data caused by
networks to select and build models [14]. The feature extraction and shelling techniques, Guan et al. [20] used class imbalanced learning
network model design in these processes are crucial to the detection (CIL) methods to solve the problem. They used oversampling algorithms
process. These two parts will be detailed in Sections 3 and 4. to generate minority-class samples to achieve data balance. In 2022, Liu
et al. [21] designed a simulation-based system called UBER to enhance
malware analysis sandboxes in response to anti-sandbox countermea
2.3. Malware-related technologies sures against malware. This technology is based on automatically
derived user profile models to generate real system artifacts. Experi
2.3.1. Malware countermeasure technology ments have shown that UBER can effectively alleviate attacks from
The purpose of malware countermeasure technology is to bypass the anti-sandbox technology.
analysis of security personnel and the detection technology of antivirus
3
H. Wang et al. Neurocomputing 598 (2024) 128010
Confusion techniques can reduce the readability of code by renaming disassembly tools (JEB, IDA, and Apktool) on detection results. They
variables, controlling flow changes, planting garbage code, and making tested different disassembly tools using an improved LSTM model based
indirect calls based on the original code. And deformation technology on RNN. According to many experimental results, Apktool shows better
will deform the source code during the development phase to prevent performance than the other two disassemblers. The accuracy of Apktool
some software detection based on feature code killing. For example, if an in the BSM, MSM, and CSM datasets is 81.9 %, 88.8 %, and 91.4 %,
attacker adds a few specific subtle perturbations to the original software, respectively, which is significantly higher than that of other
it often leads to some malware detection schemes obtaining incorrect disassemblers.
results. Generally speaking, static detection includes analyzing the charac
In 2022, EnminZhu et al. [22] proposed the N-gram MalGAN method teristics between feature codes and matching them with the feature li
based on this problem, which directly extracts the hexadecimal bytecode brary to calculate the weights of operation codes [12], or using machine
of executable files. Therefore, the features of the model are functionally learning models for classification, extracting malicious software images,
independent of the executable file. N-gram MalGAN adds them to the and using deep learning networks for classification. In 2019, Fang et al.
non-functional area of malicious programs to maintain their original [25] constructed a reinforcement learning based DQEAF framework
executability. Simultaneously, making adversarial attacks easier. based on static analysis techniques to evade anti-malware engines. The
proposed DQEAF has a high success rate in PE samples and demonstrates
2.3.2. Malware analysis techniques good robustness. In 2021, Chen [26] developed a transformer-based
Usually, deep learning techniques cannot be directly applied to bi MSDT model to detect the injection of real code into source code
nary files. In order to better extract features, these files require some packages. MSDT is a new type of static analysis method based on deep
transformation. In order to identify the behavior, characteristics, and learning. Compared with traditional static analysis methods, the
functions of malicious software, malicious software analysis techniques model’s processing object changes from a file to a function, which more
are usually adopted. According to the principle of whether malicious accurately improves the detection effect. The final experimental accu
code samples need to be run, malware analysis techniques can be racy reaches 0.909.
divided into static analysis, dynamic analysis, and mixed analysis, with
their characteristics shown in Fig. 3.
2.5. Dynamic analysis
2.4. Static analysis Monitoring the behavior of an application during runtime is called
dynamic analysis. During dynamic analysis, the purpose of code
Static analysis refers to the technique of obtaining static information execution can be observed, such as changes to user logs, updates to the
without running an application. This technology is based on disas registry, deletion and addition of system files, etc. Collect behavioral
sembly, decompilation, and reverse analysis of the code. This also means characteristics of code in dynamic analysis and convert them into fea
that in this analysis, we do not need to observe the internal structure of tures, which are then used in deep learning models. Due to the fact that
the software or gain access to the system. Static analysis determines the dynamic analysis can only cover the actual execution behavior infor
execution characteristics of code based on its underlying execution se mation, this method cannot fully understand the overall characteristics
mantics, and cannot handle complex code samples, as well as methods of the code. In addition, during the process of dynamic analysis, it is
such as shell, obfuscation, deformation, and anti-sandbox [23]. Com often necessary to simulate malicious software running in virtual envi
mon feature codes in static analysis include Dalvik opcodes, API calls, PE ronments such as sandboxes, VMware software, AndroPyTool, MobSF,
header files, binary assembly instructions, strings, control flow dia etc. to avoid directly executing malicious programs on the local machine
grams, and data flow diagrams. In text-based malware detection, API and causing harm to the operating system.
call sequences and other features are often used as the main features for In 2020, JEON et al. [27] proposed a Dynamic Analysis Based
malware detection. Usually, the extraction method for API call se Detection (DAIMD) method to protect IoT devices from malware infec
quences is to use disassemblers (JEB, IDA, Apktool, etc.) to indirectly tion. They constructed a cloud based nested virtual environment. To
extract API sequences. They convert the machine language code of the prevent malicious software from spreading to major networks on the
program back into assembly language code, making the control flow and Internet of Things, researchers have developed an embedded Linux
data flow of the program visible. Next, we analyze the function call system with advanced RISC Machines (ARM) processors using
relationship and data flow in the code, paying special attention to those embedded software development verification solutions for virtual ma
parts that involve API calls. Once the API call is identified, the con chines. And using CNN models to dynamically analyze malicious soft
struction of the API call sequence can begin. Finally, analyze and apply ware on the Internet of Things, this technology can accurately detect
the extracted API call sequence. Obviously, the extraction effect of API various new and variant malicious software in the network. Yang [28]
feature sequences directly affects the detection results. In 2023, G used dynamic analysis methods to run malicious code in a sandbox and
Balikcioglu [24] proposed that different disassembly tools would have obtain API call sequences. The API call sequence was abstracted as a data
an impact on the extraction of feature sequences and further affect flow graph with attributes, and a graph convolutional neural network
detection performance. Therefore, they studied the impact of three was used to learn the attribute data flow graph, ultimately achieving an
4
H. Wang et al. Neurocomputing 598 (2024) 128010
accuracy of 0.9679. ACARTURK [29] used dynamic analysis to mine the example).
running trajectories of portable executable files and trained them using
LSTM network models. The final accuracy reached 0.9926.
3.1. Text-based feature extraction
2.6. Hybrid analysis Text-based feature extraction technology usually refers to detecting
malicious software by directly analyzing the code or related text infor
Static analysis methods can quickly and accurately detect known mation of the malicious software. Firstly, feature information is
malicious code, but it is difficult to detect anti sandbox techniques such extracted from the malicious software and usually converted into binary
as shell and obfuscation. Dynamic analysis requires a large amount of file form. Then use techniques such as n-gram to extract n consecutive
computation and has poor scalability. Hybrid detection technology words or characters from the text as features. Finally, the extracted
utilizes static analysis technology to identify static structural features, feature set is processed using classification methods such as machine
while dynamic analysis technology identifies dynamic behavioral fea learning or deep learning. The following are several commonly used text
tures. Combining the advantages of both. In the analysis of the behav feature extraction methods:
ioral characteristics of spyware, dynamic analysis is first used to monitor
the interaction between components and browsers to determine the code 3.1.1. N-gram based extraction
area, and then static analysis is used to check the code area, identify N-gram technology is an algorithm based on statistical language
system call information, etc. to detect malicious code. In addition, this modeling and Markov assumptions that divides system call sequences or
analysis method requires changing the corresponding mixing methods API call sequences into continuous windows. Usually, it sets a sliding
for different code families. Obviously, this method will increase the window with a length of n to divide the text into many segments, and the
complexity and workload of the code, so it may not be applicable when number of sequences contained in each segment is determined by n.
facing large-scale data volume detection. After completing each segmentation, it continues to slide backwards.
In 2019, Xue et al. [30] proposed a probability scoring classification Finally, count the frequency of each segment and form a graph list,
system called Malcore based on this method, which is divided into two where each graph segment represents a feature vector [31]. Due to the
detection stages: static analysis (Stage 1) and dynamic analysis (Stage fact that the behavioral features created by this method are directly
2). The feature in Stage 1 is the grayscale image analyzed by a con extracted from the samples, there is no need for expert identification in
volutional neural network with spatial pyramid pooling, while Stage 2 is the field of security.
the API call sequence analyzed using n-gram technology. Finally, merge In 2022, Zhu et al. [32] borrowed the idea of n-gram to extend the
the two stages. The article conducted experiments on 174607 malware feature sources of adversarial malware examples in the model. Using
samples from 63 families, achieving an accuracy of 98.82 %. n-gram technology, the model can directly obtain feature vectors from
In contrast, detection techniques based on static analysis are more the hexadecimal byte code of executable files without requiring any
accurate for non confounding samples, with relatively low overhead and prior knowledge of executable files or any professional feature extrac
faster detection speed. But its disadvantage is that malicious software tion tools, significantly improving the simplicity of the model. Liu et al.
using obfuscation techniques cannot be fully analyzed through static [33] combined the assembly file instruction features extracted from
analysis. The detection technology based on dynamic analysis performs n-grams with the texture features extracted from grayscale binary files,
better in detecting mixed malware, but its analysis and detection time is and inputted this new feature into a classifier to achieve malware clas
long and computational cost is high. The detection based on mixed sification. This model further improved the accuracy of the model
analysis combines the advantages of both, making it more comprehen compared to a single feature. In 2023, Choi et al. [34] proposed the
sive in application analysis. However, it also faces the problem of NGswin efficient SR network using two trajectories in the n-gram
occupying more system resources and high computational costs. context. They applied SwinTransformer to single image super-resolution
(SR), solving the problem of ignoring wide areas due to limited receptive
3. Feature extraction fields when processing high-resolution images.
Data feature extraction is an important step in the detection process, 3.1.2. One-hot model
which utilizes data mining techniques in data processing to extract Word vector models combined with deep learning networks can also
feature information from malicious code samples. It is mainly described be used to classify malware [35], where word vector models convert
in three aspects: text, image and graph structure. The main process of unstructured word sample data into structured spatial vectors, and then
feature extraction is shown in Fig. 4 (taking text extraction as an classify malware through neural network models in the field of deep
5
H. Wang et al. Neurocomputing 598 (2024) 128010
learning. The simplest and most commonly used text feature represen RNN of the feature vectors contained in the Skip gram architecture of the
tation is one-hot coding. Since the distance and similarity of features are Word2Vec model is most suitable for malware detection and has high
crucial in classification algorithms, one-hot coding maps discrete fea performance and stability.
tures into Euclidean space, and the transformed result is usually a
combination of 0 s and 1 s. This method serves to increase the features,
3.2. Feature extraction based on graph structure
but it cannot reflect the sequential features of text, cannot be combined
with contextual analysis, and the computational pressure is too great
Common feature extraction techniques also include graph structure-
when the data volume is huge, so one-hot coding is usually used in co
based feature extraction [43], which can obtain more syntactic and se
ordination with some dimensionality reduction methods.
mantic information. Even if malicious software is subjected to shell and
In 2021, Markus Ring [36] et al. used Windows audit logs as a target
obfuscation during writing, this detection method can still trace mali
to detect malware. Windows audit logs are sequential textual data, and
cious features based on semantics. Due to the fact that program graphs
neural networks can only handle sequential data. Therefore, text fea
can obtain more comprehensive behavioral semantic information, many
tures are embedded and represented as one-hot encoding, one-hot
researchers combine graph structures as important features with neural
encoding and different embedding representations to convert these
networks in deep learning to analyze and detect malicious software. In
features into sequential vectors. And LSTM is used to capture the
the graph structure, convert the code structure into graph G (V, E),
sequential effect. Zhou [37] et al. In analyzing the malware, parameters
where V is the node of the graph structure, representing the current
such as the frequency of individual API calls for the entire sample are
function call. E is the edge, indicating the relationship of the call. As the
used as statistical features. This contains the parameters of API calls in
number of samples increases, the graph structure will generate a large
each thread, TF-IDF values of API sequence fragments of length 1–4.
number of subgraphs. The definition of subgraphs requires a significant
Finally, One-Hot coding technique is used to construct the feature vector
amount of time, while the graph structure of malicious programs does
and synthesize the feature matrix to improve the accuracy of the model.
not require a large number of nodes and edges, making it possible to
detect malicious or benign behavior. Malicious programs express
3.1.3. Word2vec model
execution semantics through the logical structure of their code, which
Word2vec comes from the 2013 paper [38]. This more commonly
can be represented by Function Call Graph (FCG), Control Flow Graph
used tool was developed by Google. It is a lightweight network model
(CFG), and Program Dependency Graph (PDG).
that inputs n words before and after a word into a neural network, which
CFG is a flowchart with instructions as atoms, which can describe the
is better than traditional embedding, and this distributed word vector
code logic of malicious programs at a fine-grained level, including
technique preserves the semantics of the context, which further im
trigger conditions, API calls, method calls, result returns, etc. [44]. At
proves the accuracy of the final classification result [39].Word2Vec
the same time, CFG can comprehensively cover the execution process
consists of two types of network structures: the Skip-gram model uses
involved in the code, and more comprehensively reflect the logical
input words to predict the context, while the Continuous Bag-of-Words
structure of malicious code. Thus revealing the control flow and data
(CBOW) model takes the context of the word as input to predict the word
flow of malicious software, CFG can more accurately identify the
itself. In contrast, the Skip-gram model has a shorter training time, and
behavior patterns of malicious software, detect abnormal control flow,
the CBOW model has a higher prediction accuracy [40].
and analyze the execution process of malicious software.
In 2019, Jungho et al. [41] combined word2vec with the LSTM
PDG is a program dependency graph, whose code structure is
method to extract opcodes and API function names from assembly
centered around data flow. PDG can reveal the data dependency re
sources. Word2vec was vectorized into lower dimensional vectors and
lationships in malicious software, discover the taint propagation path of
input into a long short-term memory network. Compared to one hot
malicious programs, and further accurately locate the execution range of
encoding based methods, it shows a performance improvement of about
malicious programs [45]. Therefore, it is very helpful for discovering
0.5 %. In 2020, the author proposed an efficient Recurrent Neural
potential malicious behavior and data leakage issues.
Network (RNN) for detecting malware in reference [42]. The hyper
FCG is a graph structure constructed based on API calls, which better
parameters in the model selected three different feature vectors to
expresses the original information of a program than a sequence set
measure RNN performance, namely one hot encoding, random feature
composed of function calls [46]. In addition, FCG is also better at un
vectors, and Word2Vec feature vectors. The experiment shows that the
derstanding the functionality and behavior patterns of malware, as well
6
H. Wang et al. Neurocomputing 598 (2024) 128010
7
H. Wang et al. Neurocomputing 598 (2024) 128010
vectors, and used text convolutional neural network text CNN to identify consists of sentences, consisting of a sentence hierarchy analysis model
malicious families. Classify the malicious code family on the Microsoft (slam) based on stacked BiLSTM and DistilBERT, a domain specific
Challenge dataset and the SOREL-20 M dataset to evaluate the model. language model DSLM-GPT2, and a universal language model
The experimental results show that the method has good accuracy, with GLM-GPT2. The third dataset combines all unlabeled assembly in
accuracy rates of 98.66 % and 93.46 %, respectively. In addition, the structions and inputs them into a custom pre-trained model. The article
average speed of identifying malicious code is 0.04 seconds. indicates that the accuracy of DLAM, DistilBERT, DSLM-GPT2, and
Due to its strong representational learning ability, strong robustness, GLM-GPT2 experiments is 98.3 %, 70.4 %, 86.0 %, and 76.2 %,
and good parallel computing performance in text processing, the Long respectively. Meanwhile, the pre-trained model improved the detection
Short Term Memory Network (LSTM) has many advantages. performance of DSLM-GPT2 and GLM-GPT2.
Therefore, many researchers detect malicious software by stacking In addition to improving CNN and LSTM models, researchers have
multiple LSTM layers or combining them with other network models. In also combined many other deep learning models for classification.
2020, Catak [54] ran malware from a sandbox and collected API call In 2020, Chen et al. [59] conducted research on adversarial sample
sequences from Windows to form a dataset containing 7071 types. We generation methods and utilized deep reinforcement learning to
constructed a network model using the Long Short Term Memory generate malware adversarial samples. Based on [60], they further
Network (LSTM) and demonstrated that the complexity of the model did optimized and proposed a Gym Malware mini learning environment.
not affect performance by comparing single-layer LSTM with The success rates of DQN and A2C agents were 20 % and 15 % higher
double-layer LSTM. At the same time, eight different categories of than those of random agents, respectively. In 2022, JahRizvi [61] used
multi-classification models were created, with classification accuracy unsupervised clustering and feature-focused neural networks to train
ranging from 0.835 to 0.985. In 2021, Bae [55] and Lee utilized simple with pseudo-labels on the neural network. Its accuracy also reached
data augmentation techniques (EDA) to generate artificial training data 0.98. DemirkıRan et al. [62] first applied a transformer based CANINE
and enhanced LSTM’s malware detection classifier in order to improve pre-training model to the field of malware detection, using the bagging
detection accuracy. Compared to RNN, LSTM has an accuracy based BERT model for data preprocessing. They then trained the pro
improvement of 1.76 %. In 2022, Li et al. [56] used the Bi-LSTM model posed Random Transform Forest (RTF) model on the Catak dataset,
to detect the relationships between APIs. By fully exploring the internal achieving an F1 score of 0.6149. In 2023, Fang et al. [63] proposed a
features of the API sequence and using the API categories, actions, and comprehensive Android malware detection method FEDrive based on
operation objects as factors to express semantic information, an accu federated learning (FL) architecture. In order to better detect Android
racy of 0.9731 was ultimately achieved on the dataset. Liu [57] extracts malware variants, they adopted a genetic evolution strategy to simulate
the API sequence of malicious code and converts it into input for word the evolution of Android malware. They constructed a network model
embedding models. In the classification model, the author introduces a based on residual neural networks and tested it using malware samples
temporal pattern attention mechanism into a bidirectional from different years, including 275052 benign software and 305139
long-short-term memory network. Compared with the 1D-CNN Dense malicious software. In the end, FEDrive achieved a detection effect of
Net model based on spatial features, the TPA BiLSTM model based on 0.9608 and an F1 score of 0.9853. In 2023, Qiao [64] proposed an
temporal features has higher accuracy in the field of malicious code adversarial malware sample detection method based on model inter
detection. DENİZ [58] combines Stacked BiLSTM (Bidirectional Long pretation. Firstly, analyze the existing adversarial malware attack
Short Term Memory Network Stacking) with a Generative Pre trained methods, use model interpretation techniques to obtain the contribution
Tramsformer (GPT-2) language model. Extract assembly instructions of each byte to model decision-making, and construct the distribution
from the. Text section of malicious and benign portable executable (PE) characteristics of sample contributions. Finally, the adversarial samples
files, and create three datasets based on these instructions. The first are identified based on their contribution distribution characteristics.
dataset consists of multiple files that will be input into the stacked The model explains the construction of network routing grad CAM, and
BiLSTM document level analysis model (DLAM). The second dataset applies unsupervised anomaly detection methods in the classifier
8
H. Wang et al. Neurocomputing 598 (2024) 128010
section. The experiment shows that this method can effectively detect global features of the image as input to the neural network. He used the
adversarial malware. After training with adversarial samples, the ac K-Nearest Neighbor (KNN) algorithm and Euclidean distance metric for
curacy of the model is 0.916 and the recall is 0.959. classification, and ultimately validated the image texture based classi
In addition, some variants of traditional machine learning models fication scheme on a large-scale binary file dataset with an accuracy of
have also shown good performance in detecting text-based malware. 0.72. Nataraj’s method of processing data has become an effective
In 2019, Tian et al. [65] achieved good performance in the training preprocessing step that researchers can learn from. In 2016, Mansour
of decision tree models and random forest models by optimizing the et al. [71] used fine-tuning XGBoost for training and used the forward
opcode features and API call features in disassembly code. Li et al. [66] stepwise selection technique to perform feature selection on the data.
extracted header data from PE files as features and trained them using We achieved an accuracy of 0.997 on the BIG 2015 dataset.
ensemble learning algorithms. Compared to its naive Bayesian, XGBoost, Meanwhile, CNN is also commonly used to process image data,
and logistic regression methods, ensemble learning has shown better and many researchers have innovated models based on CNN. In 2018,
performance. In 2023, Muhammad [67] proposed a framework for Cui [72] used grayscale images as input and introduced the bat algo
ANTI-ANT, which consists of decision tree classifiers, random forests, rithm based on CNN to address the problem of insufficient data equal
logistic regression, and support vector machines, achieving an accuracy ization. They used 9342 25-inch grayscale images as the training set,
of 0.9964. achieving an accuracy of 0.945. However, when faced with similar
In addition to improving the model, researchers will focus on opti malicious families, a single feature detection result is not satisfactory. In
mizing feature extraction, with specific improvements as follows:. 2020, Danish [73] fine-tuned CNN and proposed the IMCFN network
In 2020, Lu et al. [68] aimed to resist Android malware obfuscation model. They used IMCFN for pre-training on the ImageNet dataset
technology by extracting dynamic and static features of malware at (≥10000000), while utilizing data augmentation techniques to improve
runtime and constructing a comprehensive feature set to enhance the model robustness. The classification accuracy on the Malimg dataset
detection capability of malware. They improved the traditional classi reaches 0.9882, and on the IoT Android mobile dataset, the accuracy is
fication model by combining deep belief networks (DBN) and gated 0.9756. In 2021, Ma [74] combined CNN with attention mechanism
recurrent units (GRU). Due to the relatively independent nature of static based on [72]. And detection was performed on the VX Heaven dataset
features, DBN is used to process static features; The dynamic features and Malimg dataset, with accuracy rates of 0.945 and 0.988, respec
have the characteristic of temporal correlation; and GRU is used to tively. Compared with the model in [72], the accuracy has improved by
process the dynamic feature sequence. Finally, input the training results 4.3 percentage points. Jeyaprakash and S Abijah [75] also used Densely
into the Back Propagation Neural Network (BP) neural network for Connected Convolutional Networks (DenseNet) to detect malware in
classification. Compared with traditional machine learning algorithms, response to data imbalance issues. DenseNet has been improved and
this method can improve the ability to resist confusion and achieve an optimized based on CNN, inheriting the basic structure and character
accuracy of 0.9678. Mateless [69] has conducted a more comprehensive istics of CNN. By introducing dense connections, the performance of the
processing in feature extraction, using decompilation techniques to network has been further improved. They placed the reweighted class
extract features from Android software, extracting API call information balanced loss function in the final classification layer of the DenseNet
and permission information from source code as feature baselines, and model and tested it on four benchmark datasets. The accuracy of the
extracting keywords and non-obfuscating tags, which are divided into Malimg dataset is 0.9823, the BIG 2015 dataset is 0.9846, the MaleVis
stop tags, feature tags, and long tail tags. Finally, different classifiers dataset is 0.9821, and the Malicia dataset is 0.8948. At the same time,
were compared on the AMD dataset, and the results showed that the the time performance and antialiasing performance have been
method using random forest outperformed other models, with an ac optimized.
curacy of 0.978. In 2021, Mirabelle [70] combined multidimensional In addition, transfer learning technology is also a direction of
features extracted from strings of executable binary files and detection for researchers.
image-based representations to classify IoT malware. Design two com In 2019, Bhodia [76] added transfer learning to the deep learning
ponents based on the CNN model, pre-train the components based on network ResNet and pre-trained it on the ImageNet dataset. The
strings and grayscale images, and use their top-level learning features as experimental dataset used the Malimg dataset and the Malicious dataset,
inputs for feature fusion and classification components. In the experi and the results showed that the performance of the model was improved
ment, 10234 IoT malware samples from four well-known families were compared to KNN in simulating zero-day malware experiments. The
used and achieved good recognition results with an F1 score exceeding accuracy rates in binary classification problems and multiclassification
0.995. In addition, they attempted to identify the family labels of 24271 problems are 0.976 and 0.923, respectively. In 2021, Mazhar [77] also
unlabeled (unknown/unseen) malware samples in the dataset, but the used a transfer learning model based on ImageNet to detect malicious
effect was not significant. Therefore, this method needs to be improved software. He used VGG-19 and added spatial attention mechanisms to
in terms of anti-confusing ability. Zhou [37] proposed a thread fusion enhance features before inputting them into CNN. They conducted
feature extraction method using dynamic analysis. He extracts no more performance evaluations on the Malimg dataset, with an accuracy of
than 5000 API call information from malicious files to form an API call 0.9768. However, the model has a parameter count of 20199402, so it is
parameter dataset. Using the TF-IDF values of API sequence fragments as not lightweight. In the same year, Pratikkumar [78] applied image
part of the features, these feature vectors are combined into a feature features to multiple network models for comparison. This includes
matrix using One Hot encoding technology. Finally, the gradient descent multi-layer perceptrons (MLP), convolutional neural networks (CNN),
method (LR) is used to optimize the feature weight parameters, and the recurrent neural networks (RNN), long short-term memory networks
vectorization method is used to convert the iteration into a matrix (LSTM), gated recurrent units (GRU), and transfer learning based
operation. This method optimizes the LR algorithm and improves the ResNet52 and VGG-19. Transfer learning technology has shown good
accuracy of the model.(Table 2) performance in detection, with an accuracy of 0.9216.
In addition to the above improvement methods, researchers
4.2. Image-based malware detection have also combined many deep learning models for classification.
For example, the AlexNet neural network, knowledge distillation
In recent years, deep learning has made significant progress in the technology, Autoencoder, etc.
field of image processing, so many researchers have adopted different In 2021, Jiang [79] used RGB images as input and combined them
network models to detect malicious software. In 2011, Nataraj et al. [47] with an AlexNet neural network to extract their color texture features.
first converted binary files into a two-dimensional array, generating Finally, the balanced Malimg dataset was trained with an accuracy of
grayscale images through the two-dimensional array and using the 0.978. They utilized multi-channel feature extraction techniques and
9
H. Wang et al. Neurocomputing 598 (2024) 128010
Table 2
Performance comparison of text-based research articles.
Research Methods Research papers Model Type of malware F1-scorre Accuracy
local response normalization techniques to improve the generalization PlausMal GAN network architecture based on generative adversarial
ability of the model while effectively reducing its complexity. Compared networks. PlausMal GAN combines deep convolutional GAN, least
with the VGGNet method, the accuracy has improved by 1.8 %, and the squares GAN, Wasserstein GAN with gradient dependency, and evolu
accuracy on the Malimg dataset is 0.978. Meanwhile, the disadvantage tionary GAN. In PlausMal GAN, the first stage trains a generator and
of this model is that it does not perform well in extracting fine-grained discriminator based on real malware data and generated malware data.
features. Wang et al. [80] used knowledge distillation technology to In the second stage, fix the generator and retrain the discriminator based
detect malicious software samples. They used SE ResNet50 in the on real malware data and the malware data generated by the fixed
teacher model to extract deep level features of image texture while generator. The accuracy of the designed zero day experiment is 0.9556.
introducing the channel attention mechanism, extracting key informa In 2023, Yuhan Chai [85] proposed a sample adaptive dynamic proto
tion in the image-based on changes in channel weights. LeNet and type network based on the Few Shot Learning (FSL) network for
ResNet18 are used as student network models. The parameter count of detecting small samples of malware. This model uses dynamic convo
ResNet18 has decreased by approximately 55 % compared to the lution to achieve sample adaptive dynamic feature extraction. Solved
teacher network. The LeNet and ResNet18 student networks improved the overfitting problem caused by sparse samples and the insufficient
their model accuracy by 3.12 % and 1.93 %, respectively, after knowl expression ability of lightweight neural networks, while achieving
edge distillation. In 2022, Xing [81] utilized an autoencoder to identify effective detection of unknown malicious software. This method
high-dimensional features contained in grayscale images. They designed improved the average accuracy by 19.53 % in the 5-way 5-shot scene
two model structures, AE-1 and AE-2, to transmit grayscale images and by 15.10 % in the 5-way 10-shot scene. Compared with the best
through these two deep learning networks. AE-1 analyzes the feasibility optimization based methods, this method improved the average accu
of using grayscale images to represent the corresponding features of racy by 13.06 % in the 5-way 5-shot scenario and 12.44 % in the 5-way
software, while AE-2 performs classification tasks for malicious and 10-shot scenario.
benign software. Compared with CNN, SVM, and Naïve Bayes, this In the process of feature extraction, some researchers focus on
method reduces training time and improves detection accuracy, with an feature fusion technology to improve detection accuracy and
accuracy of 0.961. In 2022, Olorunjube [82] designed an integrated model robustness.
network based on deep convolutional neural networks and deep In 2021, Xiang Huang [86] proposed that static analysis is easily
generative adversarial neural networks using RGB images as input. The influenced by obfuscated code compared to dynamic features. However,
detection performance was evaluated on three benchmark datasets, static features often reflect the structure and layout of the original binary
MaleVis, Mallmg, and Virushare, with an average accuracy of 0.967. file, so he proposed a hybrid visualization technique to extract features,
Zero-day malware refers to previously unknown or newly discovered which combines static and dynamic methods. Merge static and dynamic
software vulnerabilities. In 2019, VINAYAKUMAR [83] proposed an images into one RGB image. The network architecture based on VGG16
architecture called scalemalnet to combat zero-day software. The anal network showed an accuracy of 0.947. Weijie Han [87] proposed a
ysis process mainly includes two stages. Firstly, a hybrid analysis potential connection between dynamic and static API call sequences. By
method based on static and dynamic analysis is applied to malware associating and fusing static and dynamic API sequences, their dynamic
classification, and then image processing methods are used to group and static API sequences are associated and fused into a mixed sequence
malware into corresponding malware categories. In the experiment, based on semantic mapping, constructing a mixed feature vector space.
multiple machine learning and deep learning methods were compared, A machine learning based malware detection framework MalDAE has
and finally, CNN and LSTM networks were combined to classify images, been established, with detection and classification accuracies of 97.89 %
with an accuracy of 0.963 on the dataset. In 2022, Won [84] proposed a and 94.39 %, respectively. However, MalDAE still has shortcomings in
new detection method for this type of malware, and they proposed the handling obfuscation techniques. In 2022, Liu [88] proposed using
10
H. Wang et al. Neurocomputing 598 (2024) 128010
disassembly technology to extract n-gram features from. bytes and. asm Convolutional Neural Network (CNN), Deep Recurrent Neural Network
files in PE files. And use feature fusion technology to fuse grayscale (RNN), and Deep Fully Connected Feedforward Network (FC). The test
images and n-gram features. Finally, the BiLSTM model is used for results indicate that MLDLS utilizes parallel strategies for training and
classification. Experiments have shown that multi-feature classification selecting deep learning models to reduce model construction time and
has improved accuracy compared to single feature classification. And improve the performance of detection systems. Hisham Alasmary [92]
the accuracy of this model is higher than that of traditional classification conducted in-depth graphical analysis on three different datasets:
models such as random forests. In the same year, Zhang [89] and Shen Android malware, IoT malware, and IoT benign samples. By tracking the
[90] also applied feature fusion technology, but they chose to fuse the distribution of CFGs attributes for different malware samples and types,
global features of the image. Zhang et al. detected malicious software in it was found that Android and IoT malware have differences in density,
the Internet of Things (IoT) environment. They proposed the FF-MICNN tightness, and number of nodes. Therefore, they use different features as
network on the basis of traditional CNN. Unlike CNN, FF-MICNN can a pattern to construct deep learning network models. In this method,
detect images of different sizes. Compared with DBN, CNN, and KNN each malware sample is abstracted into a Control Flow Graph (CFG), and
algorithms, their proposed algorithms have improved detection speed, traditional machine learning algorithms such as Linear Regression (LR)
feature comprehensiveness, and accuracy. Compared with the detection classifier, Support Vector Machine (SVM), and Random Forest (RF) are
algorithm based on a single feature, the accuracy of the proposed al used in the experiment, as well as more advanced deep learning methods
gorithm has also improved by 0.2 %. Shen et al. combined Dual atten such as Convolutional Neural Network (CNN), where CNN has the best
tion mechanism and Bi directional Long Short Term Memory (BiLSTM), learning ability with an accuracy of 0.996. Ge et al. [93] compared
which give different attention to the channels and spaces of images to several machine-learning classification models. Using the skip graph
extract local texture features of grayscale images. The BiLSTM module technique of doc2vec to transform FCG into continuous vectors, and
extracts the global texture structure features of malicious code grayscale combining graph kernels technology to learn the structural semantics of
images. The final model integrates two types of features, which can malicious software from FCG. Using machine learning techniques to
reflect local texture features and retain global overall structural features. construct classifiers, including Support Vector Machine (SVM), Logistic
Compared to a single feature, the classification accuracy has improved Regression (LR), Random Forest (RF), and k-Nearest Neighbor (KNN).
by 3.25 % − 4.35 %.(Table 3) The experiment shows that SVM performs the best with an accuracy of
0.97. Bai et al. [94] combined the VF2 algorithm and FCGiso algorithm
4.3. Malware detection based on graph structure to study the function call graph as the signature of the program. In order
to deal with packaging and obfuscation techniques, they used these two
Deep learning also performs well in handling graph-based detection, graph isomorphism algorithms to detect malware and its variants. Ex
and using graph structures as input features can capture complex periments showed that both algorithms can quickly determine whether
structures and relationships within malicious software. In recent years, the test sample is isomorphic to one of the FCG feature libraries. In the
researchers have conducted experiments using different network FCG dataset with 64934 vertices, compared to the VF2 algorithm which
models, such as CNN, LSTM, GNN, etc. took 102.3 seconds, the FCGiso algorithm only took 18.9 seconds.
In 2019, Wei Zhong [91] believed that a single deep learning model References [95,96] combined the features of CNN and LSTM to
could not effectively handle the complex distribution of malware data, analyze and detect graph structures. In 2019, Jieran Liu [95] con
and therefore proposed a multi-level deep learning system (MLDLS). structed a detection model based on CNN. A network model named
Firstly, important static and dynamic features are selected from the MCrab was constructed using the system call graph CFG as the model
feature set. The parallel improved K-means algorithm is used to divide input. MCrab consists of two main components: CNN and improved
the dataset into multiple primary clusters, and then multiple cluster LSTM. Among them, the CNN model extracts higher-level word feature
subtrees are generated to construct a hierarchical clustering tree. Build sequences from the disassembly code of each custom function’s CFG.
deep learning models for each cluster separately, and each cluster will The LSTM model is responsible for capturing the long-term de
evaluate several important deep learning models, including pendencies of window features. Its accuracy is 0.942. In 2022, Shen et al.
Table 3
Performance comparison of image-based research articles.
Research methods Research paper Model Types of malware F1-scorre Accuracy
11
H. Wang et al. Neurocomputing 598 (2024) 128010
collected up to 8000 samples from different APT organizations over a mentioned earlier, some researchers have focused their research
period of two years and extracted 680718 custom functions from 6972 on feature construction. In addition to traditional graph structures,
samples from 10 APT organizations. Disassemble the samples using different graph structures are constructed from multiple aspects
IDAPro to form a Critical System Call Graph (CFG). A CNN-SLSTM model for use in malware detection.
was constructed based on the MCrab model in paper [95], and the input In 2018, Ding et al. [104] proposed a method for constructing a
gate, forget gate, unit, and output gate in the network were modified. Common Dependency Graph (CDG) of a malware family during feature
After training, the accuracy reached 0.95, which is 11.6 % higher than extraction. The author constructs a common behavior graph for each
CNN. malware family, and at the same time, in order to cope with similar or
MITRE, an American research institution, launched a new attack repetitive behaviors appearing multiple times in the dependency graph,
model ATT&CK in 2014. The ATT&CK model is mainly used for threat they prune similar behaviors in subgraphs and similar behaviors be
intelligence analysis and can also be used for detecting and analyzing tween subgraphs. Based on API dependency graph, dynamic stain
network series attack behaviors. In 2021, Yang [97] proposed the analysis technology is used to label system call parameters with stain
m-ATT&CK model to further abstract the behavioral characteristics of labels, and a system call dependency graph is constructed by tracking
malicious code. At the same time, an approximate pattern matching the propagation of stain data. This method has a high detection rate and
algorithm based on F-MWTO with non-negative variable gap constraints a low false alarm rate, and has the effect of detecting variants of mali
was proposed to map malicious code behavior information to the cious software.
m-ATT&CK model. This pattern matching behavior algorithm combines In 2019, Di Xue [105] learned features from multiple perspectives,
contextual semantic information to explore potential relationships be including Call Graph (CFG), grayscale image, and RGB image. A mali
tween high-level behaviors. Using semantic level attack graphs as inputs cious software homology analysis system (MHAS) for processing mul
to the network model can indicate the attack intent and working tiple features was designed using ensemble learning methods based on
mechanism of malicious code. In 2022, Zhang et al. [98] proposed a CNN as the basic learner. MHAS generates grayscale and RGB images
model called HyGNN Mal for automatic detection of Android malware from malicious software binary files, and uses the disassembly tool IDA
based on Self attention, Deep-TNN, and Bi GRU. Meanwhile, for the first Pro to extract opcode sequences and system call diagrams. Learn about
time, the Abstract Grammar Tree (AST) structure has been applied to three types of feature views. In MHAS, 9 CNNs were used as basic
malware detection. Construct directed graph features based on AST and learners for classification, and the ensemble results were mapped to the
input them into the deep traversal neural network Deep-TNN to extract result matrix. Finally, they proposed an ELR method to integrate the
features. In addition, self attention mechanism algorithms are used to integrated results again. In the end, MHAS achieved an accuracy of
handle source code sequences that add row and column position infor 99.17 % on datasets from 10 families.
mation, while Bi GRU is responsible for handling API call features. In 2021, Pengbin et al. [106] did not analyze API call information,
Under the training of this model, HyGNN Mal achieved an accuracy of but directly analyzed the source code of malicious software and
0.992 on classification tasks across five families. In the same year, extracted high-level semantic information. Through experiments, this
D′Angelo et al. [99] proposed a new model called Perm Maps. This feature extraction method has also been effectively confirmed. They use
model combines Android application permissions and severity level in Word2Vec to extract internal attributes, including required permissions,
formation, and improves accuracy by 16 % through CNN training. The security levels, and semantic information of Smali instructions, to form
feature selection technique proposed in the article reduces the compu nodes within the graph structure. Then, a CGDroid model was designed
tational complexity required for Perm Maps generation and CNN by combining graph neural networks, with a detection accuracy of 0.92.
training processes. This method increases the barrier to evade detection and counteracts
Due to the fact that graph neural networks are better at some shell techniques, making it difficult for attackers to escape detec
handling graph structured data and can capture the complex re tion. Yang Pin et al. [107] designed a Directed Data Flow Graph (ADG)
lationships between nodes as well as the global properties of with weights. They run malicious code in a sandbox to extract API se
graphs, many researchers use GNN as the basic model for their quences, then abstract API sequences into data flow events and build
research. ADGs. At the same time, a graph convolutional neural network called
In 2019, Liu et al. [100] detected malicious software based on GCN Attributed Data Flow Graph Convolutional Networks (ADGCN) was
by using disassembly tools to extract malicious code API calls and proposed. ADGCN is based on the GraphSE framework, and compared to
annotate their attributes. At the same time, based on the contribution of API call graphs, this graph structure improves detection accuracy by
APIs, key APIs are selected to generate call graphs as model inputs, and 10 %, achieving an accuracy of 0.90. SibelGulmez et al. [108] adopted
the similarity between GCN and CNN is calculated for malicious code. another approach to constructing graph structures, using disassembly
Finally, DBSCAN is used to cluster malicious families. The experimental techniques to extract opcode sequences from PE files and construct
results can reach 0.873. In 2021, Li [101] chose API call graph as the opcode feature maps for detection. Their proposed method achieved an
model input, extracted the feature map of the directed band graph using accuracy of 0.98 on the dataset. This image construction method im
Markov chain method, and normalized the weights. The extracted fea proves performance by 10 % compared to opcode histograms.
tures were input into the graph convolutional neural network GCN with In 2023, Niu et al. [109] conducted research on the detection of
an accuracy of 0.9832. Fang [102] utilized GCN to learn control flow malicious software in the Internet of Things and proposed a graph
information from functions and embedded basic semantic blocks into compression algorithm with reachable relation extraction (GCRR). And
the network using a word embedding model using natural language based on this, a Android malware detection and classification method
processing techniques. Their proposed detection model eliminates the GCDroid was designed. They first decompiled the APK file using the
differences in instruction architecture of binary code. The scalability and ApkTool tool and extracted the API. Then, GCRR technology is used to
accuracy of the final result have been improved. Its accuracy reached extract the reachability relationship between APKs, compressing the
0.876, which is better than the Opcode N-Gram method. In 2023, Feng massive heterogeneous APK-API relationship graph into a homogeneous
et al. [103] proposed a new Java bytecode classification framework APK graph. Finally, the obtained small isomorphic graph information is
BejaGNN for Java-based malware. It combines word embedding tech input into the GCN model to complete the classification of APK nodes.
nology and GNN algorithm to capture key semantic information from The experiment showed that the detection accuracy of GCDroid
the Inter Procedural Control Flow Graph (ICFG) of Java programs. We improved by 1.53–39.13 % on different datasets. This method not only
compared the performance of GCN, GAT, and GIN models, and ulti reduces the time and space costs of Android malware detection and
mately, the GAT model showed an accuracy of 0.998. classification methods, but also ensures accuracy. In 2023, Sun et al.
However, in addition to the improvements made to the model [110] decomposed attribute maps into function based modules using
12
H. Wang et al. Neurocomputing 598 (2024) 128010
graph embedding clustering method and proposed a binary code module trend in the research of malware detection technology.
similarity detection method ModDiff. This method uses graph matching
algorithms and similarity detection algorithms for homology detection. 5. Dataset
Siamese Network and BERT pre trained models were combined to detect
similarity. The results show that the accuracy is 0.94, which can effec With the continuous development of malware detection technology,
tively implement anti-confusing technology.(Table 4) the malware dataset is also constantly improving. The malware dataset
Based on the summary and analysis in the previous text, it can is a key factor in malware detection. A comprehensive and high-quality
be seen that in recent years, researchers have shown research malware dataset can ensure that the model learns the features and
trends and preferences in malware detection technology: behavior patterns of various malware, thereby improving the accuracy
(1) The application of deep learning technology of detection. A dataset of this quality can be used to conduct in-depth
With the continuous development of deep learning technology in analysis of the behavior of malicious software, allowing researchers to
recent years, the application of deep learning can automatically extract better understand the attack methods, propagation pathways, and po
features and identify new malicious software by learning and analyzing tential targets of malicious software. This is very helpful for researchers
a large number of malicious software samples. This method can not only to develop targeted defense strategies.
improve the accuracy of detection, but also save a lot of manual detec The collection and labeling of malware datasets is a time-consuming
tion costs. Researchers often use various deep learning based network and expensive process that requires ensuring the privacy and security of
models to detect malicious software. For example, convolutional neural the data. In order to fully leverage the role of malware datasets in
networks (CNN) and other models have achieved significant results in malware detection, we need to constantly update and optimize datasets
the field of image recognition [72–74], and graph neural networks to adapt to the constantly changing malware environment. The
(GNN) have been used to learn the representation of graph structures following is a summary and analysis of some commonly used datasets for
and automatically capture malicious software behavior features malware analysis in recent years, and a comparison is made as shown in
[100–103]. Therefore, the application of deep learning technology is a Table 5.
major trend in research. Malicia dataset [111]:
(2) Multi-source feature fusion Since March 7, 2012, the Malicia dataset has collected a total of 502
In text-based detection tasks, researchers have achieved good per exploited servers with 46,514 executable malware and 603 DNS do
formance through multi-feature fusion [37], [68–70]. In image-based mains, which are stored in 242 Ases in 57 countries.
detection tasks, researchers use feature fusion techniques to convert Microsoft Malware Classification Challenge Dataset [112]:
different features into grayscale or RGB images to enhance the model’s The 2015 Microsoft Malware Classification Challenge was released
generalization ability [86–90]. In graph-based detection tasks, re on the kaggle website. The dataset has a size of about 0.5 TB and con
searchers improve detection performance by constructing different tains over 20,000 byte codes of malware samples. The dataset includes a
graph structures [107–110]. mix of nine different families, with an ID, a hash value that uniquely
Therefore, multi feature fusion is a major trend in research, which identifies the file, and a category in each file. To ensure operating system
integrates multiple types of information as inputs to the model. For security, each file is free of PE headers and contains a hexadecimal
example, extracting advanced semantic information from the source representation of the binary content. The dataset is used not only in
code [106], extracting reachability relationships between APKs [109], kaggle competitions, but also as a benchmark by a wide range of re
fusing local image features and global image features [90], or fusing searchers. The dataset has been cited in more than 50 research papers.
static and dynamic features [87]. Malimg dataset [47]:
(3) The development of dynamic detection technology This dataset is a malicious code visualization dataset compiled by
Dynamic detection technology is an important research direction in Nataraj et al. in 2011, where malicious binary files are converted to octet
the field of malware detection in recent years. By simulating the pixel values via a matrix M ∈ R^{m×n}, and the malicious images are
behavior of malicious software in real environments, researchers can use grayscale images within [0− 255]. Where 1 denotes white and 0 denotes
dynamic detection techniques to observe and record its behavioral black. The dataset includes 25 malware samples, 9339 samples and
characteristics in real time [37], [68], [83], [109], thereby timely 339,25 malware byte map images [24]. Most of the researchers chose
detecting and reporting potential threats. This method has a higher this dataset for the classification of malicious families.
detection rate and a lower false alarm rate compared to static detection Malnet dataset [113]:
techniques. Therefore, dynamic detection technology is also a major The MALNET-IMAGE dataset is the largest public cybersecurity
Table 4
Performance comparison of research articles based on graph structure.
Research methods Research papers Model Type of malware F1-scorre Accuracy
13
H. Wang et al. Neurocomputing 598 (2024) 128010
dataset. This dataset contains 1.2 million malware images with 47 types over 2 million sample files per day. The site has collected about 2.4
and 696 families. MALNET-IMAGE provides 24 times more images and billion sample data since 2014. Virus total provides unlimited down
70 times more classes compared to the largest public database of binary loads of sample files from the past seven days and generates analytical
images. Malnet provides an opportunity to advance cybersecurity. reports for researchers to download.
Virus-MNIST dataset [114]: Ember dataset [120]:
The Virus-MNIST dataset consists of an image dataset of 10 execut This dataset contains features from 10,000 portable executables (PE
ables and 50,000 examples of malicious code, including 9 computer files). 900 K training samples (300 K malicious, 300 K benign, 300 K
malware families and a collection of benign software. The first 1024 untagged) and 200 K test samples (100 K malicious, 100 K benign), and
bytes of images in Portable Executable (PE) reflect the characteristics of each example file includes an open-source collection of security hashes
the dataset. (SHA256). The data source is based on 1.1 million PE files scanned by
Derbin dataset [115]: Virus Total in 2017.
This dataset was collected between August 2010 and October 2012 The Ember dataset consists of eight sets of raw features that cover
and contains 560,179 files from five different malware families. The parsed features as well as unparsed format-independent graphs and
Derbin dataset was collected by performing extensive static analysis, strings. The Ember dataset is not stored as a file, but rather contains
which collected many code structure features. These features are metadata from the file, including the month in which the file first
embedded as inputs into the joint vector space and can help researchers appeared, tags, and derived features of the PE file, a behavior that avoids
automatically identify malware. The dataset is a dataset of real malware intellectual property disputes.
based on Android applications. The initial number of malware and Malshare dataset [121]:
benign software included is 131,611, specifically from 96,150 apps in The Malshare dataset is a collaborative, community-driven public
the GooglePlay store, including 19,545 apps in the Chinese market, malware repository published by the Malshare website. The website has
2810 in the Russian market, and 13,106 program samples in blogs, fo unlimited public API keys, and it’s available for researchers to download
rums, and websites, in addition, Derbin contains all the data from the 2000 sample API calls, which include downloading examples, detailed
Android Malware Genome program. information lookups and searches.
CIC-AAGM dataset [116]: Virus Share dataset [122]:
This dataset is provided by the Canadian Institute for Cyber Security VirusShare.com is a malware test repository for services hosted and
Research and the data collected after semi-automatic installation of maintained by Corvus Forensics. The Virus Share dataset is a sample of
programs on Android phones make up the CICAAGM dataset. It contains call malware constructed with a static API. Stored in MD5 hash form,
1900 mixed types of software, including 250 adware, 150 general each list is a plain text file with one hash per line. Files 0–148 are 4.3 MB
malware, and 1500 benign software. in size, with 131072 hashes per file. File 149 and higher have a size of
NSL-KDD dataset [117]: 2.1 MB and, 65536 hashes per file.
This dataset, also from the Canadian Institute for Cybersecurity, Virus Sign dataset [123]:
addresses some of the problems inherent in the KDD’99 dataset, which The Virus Sign dataset contains a large amount of malicious sample
removes redundant data from the training set, which packs about data, with over 300 TB and approximately 60 billion non-redundant
125,000 records and 41 features. samples. The sample data consists of two collections, one is the Mal
CLaMP dataset [118]: wareList which contains malware samples for PC computers except
This dataset is a collection of header field values of portable exe Android. The other is a collection of mobile samples, AndroidList, which
cutables (PEs). It contains 5184 malware and benign software samples. includes Android, Mac and Java samples.
Virus Total dataset [119]: MalwareBazaar dataset [124]:
This dataset is derived from the Virus Total website, which processes This dataset tracks over 300,000 malware distribution sites, in
Table 5
Comparison of datasets.
Platform Dataset name Time Sample type Number of samples
14
H. Wang et al. Neurocomputing 598 (2024) 128010
contrast to the VirusTotal dataset, the MalwareBazaar dataset tracks the previous summary and explanation, integrating deep learning
only malware samples and does not contain benign software. technology into malware detection has become a hot topic. Based on
RmvDroid [125]: existing research, the development history, related technologies, and
This dataset was collected between 2014 and 2018 and contains detection methods of malicious software were reviewed, analyzed, and
9133 malware samples from 56 malware families. The researchers summarized in the previous text. And review malicious detection tech
crawled apps from Google Play that contained all the meta information niques from three feature perspectives: text, graph structure, and image
and helped flag malware with the help of Google Play’s app mainte type. In addition, due to differences in the content and quality of the
nance behavior. dataset, the selection of the dataset can also affect the research results to
Benchmark API Call Dataset [126]: a certain extent. Therefore, the article also summarizes 26 datasets
This dataset is an API call-based malware dataset analyzed by applied to malware detection research.
Cuckoo Sandbox based on Windows OS API calls. The dataset was ob In recent years, malware detection technology has made sig
tained by analyzing 7107 different types of malware, including 8 mal nificant progress. Compared with traditional detection methods,
ware families: Trojan, Backdoor, Downloader, Worms, Spyware, deep learning technology reduces the cost of manual labeling and
Adware, Dropper, and Virus. time. The detection accuracy of malware tracing is constantly
MalGenome dataset [127]: : improving, and there are also certain achievements in combating
This dataset was collected from August 2010 to October 2011 and is unknown malware. However, nowadays, the methods of malware
composed of, 1260 malicious samples and 863 benign samples. The attacks continue to develop, so there are still shortcomings and
malicious sample types consist of 49 categories, which contain about 52 challenges in our work.
Android malware families. (1) The challenges of adversarial attacks to research work
Android Malware Datase [128]: Researchers have been fighting against adversarial attacks all along.
This data was acquired from 2010 to December 2015 and consists of The detection of malware faces challenges in techniques such as shell
24,650 malicious samples. The team used anti-virus scanning structure addition, obfuscation, deformation, and sandbox avoidance. This tech
(anti-virus scan results) and automation techniques to classify the mal nology will hide its true intention and behavior, rendering traditional
ware samples into 135 variants based on the semantics of the malicious signature or feature code based detection methods ineffective, further
behaviors [37], which are subordinate to 71 malware families. increasing the difficulty of detection.
Marvin dataset[129]: (2) Zero-day software poses a huge threat
This dataset was collected in 2015 and includes more than 135,000 Attackers often exploit undisclosed vulnerabilities (zero-day vul
Android apps and 15,000 malware samples, of which 10,572 are mali nerabilities) for attacks. Currently, researchers mostly test existing
cious and 75,996 are benign. datasets in experiments. Obviously, the larger the sample size in the
ISCX Android Botnet Dataset [130]: dataset, the more convincing the detection results are. However, the
This dataset focuses on collecting malicious samples for botnets. The samples in existing datasets provide prior knowledge for new APT at
samples were collected from 2010 to 2014 and consist of 1929 malicious tacks, resulting in traceability techniques always falling behind variants
samples that can be categorized into 14 botnet families [40], which of malicious software. Especially when facing zero day software in the
represent early and mature versions of Android botnets. APT environment, existing experience often fails to effectively identify
Android PRA Guard dataset [131]: it.
This dataset consists of 10,479 malicious samples collected in (3) The widespread of malicious software attacks
2010–2011 and was obtained by obfuscating the MalGenome and io The diversity of malware lies in its ability to run on various operating
mobile mini-dump datasets using seven different obfuscation systems, including Windows, macOS, Linux, Android, and iOS. There
techniques. fore, it increases the difficulty of detection and maintenance, while also
National Software Reference Library (NSRL) [132]: increasing the risk of missed and false alarms.
This dataset provides metadata (MD5 hashes) for benign and mali (4) The limitations of malware detection methods
cious samples of malware.RDS 2.10 in the NSRL was released in Both static detection and dynamic detection have certain limitations
September 2005. It contains 33,860,009 files and provides 10,663,650 in the detection process. Static detection cannot effectively deal with
unique SHA-1, MD5 and CRC32 values. adversarial measures such as shelling and variants of malicious soft
AndroOBFS dataset [133]: ware, and is easily bypassed by attackers through encryption and
This dataset was collected from 2018 to 2020, with unobfuscated obfuscation techniques. Due to the fact that many malicious programs
malware samples obtained through the AndroZoo project and Virus exhibit different operational characteristics without root privileges.
Share.com, and month-by-month obfuscation of unobfuscated malware Therefore, dynamic detection technology can only execute malicious
samples for each month of each year in six different categories, leaving software with limited permissions in sandbox environments.
14,579 unique families of obfuscated malware distributed across 158 (5) The latency of malware detection
families, respectively. The behavior and mutation speed of malware are very fast, and some
Piggybacking dataset [134]: malware even have the ability to self transform, automatically changing
Released in 2016, this dataset is a programmatic dataset on piggy code structure or features. They exhibit different forms during each
backing apps. Android’s packaging model provides plenty of opportu detection, and traditional detection methods cannot monitor and
nities for malware writers to carry malicious code in their apps and then analyze the behavior of malicious software in real-time.
easily distribute it to numerous users. The dataset contains 1136 In order to address the difficulties and challenges currently
samples. faced by malware detection technology, the relevant prospects are
AndroZoo dataset [135]: as follows:
This dataset collects millions of Android apps from various data
sources, totaling over 20 TB. These apps have been analyzed by dozens (1) In future work, we need to delve into the principles of adversarial
of different antivirus products. attacks and analyze the patterns and characteristics of adversarial
techniques used in malware. Strengthen dynamic real-time
6. Summary and prospect behavior analysis and extract high-level semantic features from
malicious samples. Fine grained characterization of malicious
The severe network security confrontation and game situation have software behavior, reshaping the intent of malicious software
increased the necessity of malware detection technology. According to with an attacking nature to respond to adversarial attacks.
15
H. Wang et al. Neurocomputing 598 (2024) 128010
(2) Build a detection model that is more suitable for zero-day soft References
ware by combining multiple types of features (static code fea
tures, dynamic behavior features, network traffic features, etc.). [1] Ö.A. Aslan, R. Samet, Acomprehensivereviewonmalwaredetectionapproaches,
IEEE access 8 (2020) 6249–6271.
For example, models for self-learning and adapting to new envi [2] Q. Le, O. Boydell, B. MacNamee, et al., Deeplearningattheshallowend:
ronmental changes can be built based on unsupervised tech malwareclassificationfornon-domainexperts, Digital Invest. 26 (2018)
niques, and network models for quickly adapting to new tasks can S118–S126.
[3] Ali Muzaffar, et al., An in-depth review of machine learning based Android
be built based on meta-learning. Combining deep learning malware detection. Comput. Secur. 121 (2022) 102833.
models to identify abnormal behavior and potential zero-day [4] R. Vinayakumar, M. Alazab, K.P. Soman, et al.,
threats improves the detection ability of unknown threats. Robustintelligentmalwaredetectionusingdeeplearning, IEEE access 7 (2019)
46717–46738.
(3) Build a system that adapts to a multi-operating system environ [5] Wei Wang, Zhao Mengxue, Wang Jigang, Effectiveandroidmalware
ment, such as using cloud security services to ensure that it can detectionwithahybridmodelbasedondeepautoencoderand
resist malicious software attacks regardless of the operating convolutionalneuralnetwork, J. Ambient Intelligence Humanized Comput. 10
(2019) 3035–3043.
system.
[6] Zhang Yang, Hao Jiangbo, Malicious code detection method based on attention
(4) Enhance static and dynamic techniques, improve static analysis mechanism and residual network, Comput. Appl. 42 (06) (2022) 1708–1715.
capabilities through learning from a large number of malicious [7] Gaoning Shen, et al., Featurefusion-
software samples, and construct anti-obfuscation, encryption, basedmaliciouscodedetectionwithdualattentionmechanismandBiLSTM, Comput.
Secur. 119 (2022) 102761.
and shell deformation methods. In the static stage, try to reveal [8] S. Seneviratne, R. Shariffdeen, S. Rasnayaka, et al., Self-
the characteristics of malicious behavior as much as possible. supervisedvisiontransformersformalwaredetection, IEEE Access 10 (2022)
Further enhance the intelligence level of sandbox technology to 103121–103135.
[9] Ö.A. Aslan, R. Samet, Acomprehensivereviewonmalwaredetectionapproaches,
counter anti-sandbox technology and capture the real behavior of IEEE access 8 (2020) 6249–6271.
malicious software in more advanced execution environments. [10] W. Han, J. Xue, Y. Wang, et al., MalDAE:Detectingandexplainingmalware
(5) When constructing network models, researchers need to further basedoncorrelationandfusionofstaticanddynamiccharacteristics, comput. secur.
83 (2019) 208–233.
improve the generalization ability of the model to cope with [11] J. Singh, J. Singh, Asurveyonmachinelearning-
complex variants, and combine various technical means such as basedmalwaredetectioninexecutablefiles, J. Syst. Architecture 112 (2021)
static detection, dynamic behavior analysis, and network traffic 101861.
[12] M. Gopinath, S.C. Sethuraman,
monitoring to build a multi-level defense system. At the same Acomprehensivesurveyondeeplearningbasedmalwaredetectiontechniques,
time, reducing the latency of detection and enabling the model to Comput. Sci. Rev. 47 (2023) 100529.
have real-time detection and response capabilities. [13] Information Security: 12th International Conference, ISC 2009 Pisa, Italy,
September 7–9, 2009 Proceedings[M]. Springer, 2009.
[14] Jiang Kaolin, Bai Wei, Zhang Lei, et al., Malicious code detection based on multi-
CRediT authorship contribution statement channel image deep learning, Comput. Appl. 41 (04) (2021) 1142–1147.
[15] Fossi M., Egan G., Haley K., et al. Symantec internet security threat report trends
for 2010[J]. Volume XVI, 2011.
huijuan wang: Writing – review & editing, Supervision, Project [16] K. Haley, N. Johnson, J. Fulton, Symantec internet security threat report 2017,
administration, Methodology, Funding acquisition. Boyan Cui: Writing Symantec Corp. Mt. View CA USA Tech. Rep. (2017) 22.
– review & editing, Writing – original draft, Validation, Conceptualiza [17] Wermke D., Huaman N., Acar Y., et al. A large scale investigation of obfuscation
use in google play. arXiv preprint arXiv:1801.02742, 2018.
tion. Quanbo Yuan: Software, Resources, Investigation. Ruonan Shi: [18] Faruki P., Fereidooni H., Laxmi V., et al. Android code protection via obfuscation
Visualization, Data curation. Mengying Huang: Investigation. techniques: past, present and future directions. arXiv preprint arXiv:1611.10231,
2016.
[19] AVLTeam. Antiy mobile security’s “Dvmap” Android malware analysis report.
2017. 〈http://www.freebuf.com/articles/terminal/〉 137015.html.
Declaration of Competing Interest [20] Li Li, et al., Understanding android apppiggy backing:a systematic study of
malicious code grafting, IEEE Trans. Inf. Forensics Secur. 12.6 (2017) 1269–1284.
[21] S. Liu, P. Feng, S. Wang, et al., Enhancing malware analysiss and boxes with
The authors declare the following financial interests/personal re emulate duser behavior, Comput. Security 115 (2022) 102613.
lationships which may be considered as potential competing interests: [22] E. Zhu, J. Zhang, J. Yan, et al., N-gramMalGAN:evading machine learning
Wang Huijuan reports financial support was provided by The Fund detection via featuren-gram, Digital communications networks 8 (4) (2022)
485–491.
Project of Central Government Guided Local Science and Technology
[23] S. Liu, P. Feng, S. Wang, et al., Enhancing malware analysiss and boxes with
Development(No. 226Z0302G). Wang Huijuan reports financial support emulate duser behavior, Comput. Security 115 (2022) 102613.
was provided by The Special Project of Langfang Key Research and [24] P.G. Balikcioglu, M. Sirlanci, O. A. Kucuk, et al., Malicious code detection in
Development(No. 2023011005B). If there are other authors, they android: the role of sequence characteristics and disassembling methods, Int. J.
Inf. Secur. 22 (1) (2023) 107–118.
declare that they have no known competing financial interests or per [25] Zhiyang Fang, et al., Evadinganti-malware engines with deeprein for cement
sonal relationships that could have appeared to influence the work re learning, IEEEAccess 7 (2019) 48867–48879.
ported in this paper. [26] C. Acarturk, M. Sirlanci, P.G. Balikcioglu, et al., Malicious code detection:
Runtrace out putanaly sis by LSTM, IEEEAccess 9 (2021) 9625–9635.
[27] J. Jueun Jeonand, Y., Jeong Parkand, Dynamic analysisforiot malware
Data availability detectionwith convolution neural network model, IEEE Access (2020) 8.
[28] N.W. Pérez-Díaz, J.O. Chinchay-Maldonado, H.I. Mejía-Cabrera, et al.,
RansomwareIdentificationThroughSandboxEnvironment[C]//
The authors do not have permission to share data. ProceedingsoftheFutureTechnologiesConference,
SpringerInternationalPublishing, Cham, 2022, pp. 326–335.
[29] Tsfaty, C., Fire, M., MaliciousSourceCodeDetectionUsingTransformer.
Acknowledgments arXivpreprintarXiv:2209.07957,2022.
[30] D. Xue, J. Li, T. Lv, et al., Malware classificatio nusing probability scoring and
This study was funded by the Fund Project of Central Government machine learning, IEEE Access 7 (2019) 91641–91656.
[31] B. Kolosnjaji, A. Zarras, G. Webster, C. Eckert, Deep learning for classification of
Guided Local Science and Technology Development under Grant No.
malware system call sequences. In: Proc. of the Australasian Joint Conf. on
226Z0302G, the Special Project of Langfang Key Research and Devel Artificial Intelligence, Springer-Verlag, Cham, 2016, pp. 137–149.
opment under Grant No. 2023011005B and the Young Top Talent [32] E. Zhu, J. Zhang, J. Yan, et al., N-gramMalGAN:evading machine learning
Project of Hebei Provincial Department of Education under Grant No. detection via featuren gram, Digital commun. networks 8 (4) (2022) 485–491.
BJK2023116.
16
H. Wang et al. Neurocomputing 598 (2024) 128010
[33] Liu Zixuan, Wang Chen, BiLSTM Malicious Code Classification Based on Multi- [63] W. Fang, J. He, W. Li, et al., Comprehensive android malware detection based on
feature Fusion, in: Electronic Design Engineering, 30, 2022, pp. 67–72. DOI: federated learning architecture[J], IEEE Trans. Inf. Forensics Secur. (2023).
10.14022/j.issn1674-6236.2022.18.014. [64] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM:
[34] Choi, H., Lee, J., Yang, J., N- Visual explanations from deep networks via gradient-based localization,” in Proc,
graminswintransformersforefficientlightweightimagesuper-resolution[C]// IEEE Int. Conf. Comput. Vis. (2017) 618–626.
ProceedingsoftheIEEE/ [65] T.I.A.N. Donghai, W.E.I. Xing, Z.H.A.N.G. Bo et al. Research and implementation
CVFConferenceonComputerVisionandPatternRecognition.2023:2071-2081. of kernel malicious program detection based on machine learning[J]. Journal of
[35] Sanjay Madan, Sofat Sanjeev, Bansal Divya, Toolsand techniques for collection Beijing Institute of Technology,2020,40(12):1295-1301.DOI:10.15918/j.
and analysis of internet-of-thingsmalware:a systematic state-of-artreview, tbit1001-0645.2019.261.
J. Comput. 34.10 (2022) 9867–9888. [66] W. Li, C. Zhang, J. Zhou, Malicious Code Detection Method Based on Static
[36] M. Ring, D. Schlör, S. Wunderlich, et al., Malware detectionon windows auditlogs Features and Ensemble Learning[C]//Journal of Physics: Conference Series. IOP
using LSTMs, Comput. Security 109 (2021) 102389. Publishing, 2021, 2010(1): 012165.
[37] Zhou Yang, Detection and Analysis of Windows Malicious Code Based on [67] M. Awais, M.A. Tariq, J. Iqbal, Anti-Ant Framework for Android Malware
Behavioral Features, People’s Public Security University of China, 2021. DOI: Detection and Prevention Using Supervised Learning[C]//2023 4th International
10.27634/d.cnki.gzrgu.2021.000279. Conference on Advancements in Computational Sciences (ICACS). IEEE, 2023: 1-
[38] Q. Le, O. Boydell, B. MacNamee, et al., Deep learningatthe shallowend:malware 5.
classification fornon-domain experts, Digital Invest. 26 (2018) S118–S126. [68] T. Lu, Y. Du, L. Ouyang, et al., Android malware detection based on a hybrid deep
[39] Y. Sung, S. Jang, Y.S. Jeong, et al., Malware classification algorith musing learning model, Secur. Commun. Netw. 2020 (2020) 1–11.
advanced Word 2vec-based Bi-LSTM for ground control stations, Comput. [69] R. Mateless, D. Rejabek, O. Margalit, et al., Decompiled APK based malicious code
Commun. 153 (2020) 342–348. classification, Future Gener. Comput. Syst. 110 (2020) 135–147.
[40] J. Sun, X. Luo, H. Gao, et al., Categori zing malwarevia A Word2Vec-based [70] M. Dib, S. Torabi, E. Bou-Harb, et al., A multi-dimensional deep learning
temporal convolutional network scheme, J. Cloud Comput. 9 (2020) 1–14. framework for iot malware classification and family attribution, IEEE Trans.
[41] J. Kang, S. Jang, S. Li, et al., Longshort-termmemory- Netw. Serv. Manag. 18 (2) (2021) 1165–1177.
basedmalwareclassificationmethodforinformationsecurity, Comput. Electrical [71] Mansour Ahmadi, et al., Novel feature extraction, selection and fusion for
Eng. 77 (2019) 366–375. effective malware family classification, Proc. sixth ACM Conf. data Appl. Secur.
[42] S. Jha, D. Prashar, H.V. Long, et al., Recurrent neural network for detecting Priv. (2016).
malware, comput. security 99 (2020) 102037. [72] Z. Cui, F. Xue, X. Cai, et al., Detection of malicious code variants based on deep
[43] Y. Ding, X. Xia, S. Chen, et al., A malware detection method based on family learning, IEEE Trans. Ind. Inform. 14 (7) (2018) 3187–3196.
behavior graph, Comput. Secur. 73 (2018) 73–86. [73] Danish Vasan, et al., IMCFN: Image-based malware classification using fine-tuned
[44] Song Wenna, Peng Guojun, Fu Jianming, et al. Research on Malicious Code convolutional neural network architecture, Comput. Netw. 171 (2020) 107138.
Evolution and Traceability Technology [J]. Journal of Software, 2019,30 (08): [74] Ma Dan, Wan Liang, Cheng Qichen, et al., Attention-CNN in malicious code
2229-2267. DOI: 10.13328/j.cnki. job-005767. detection, Comput. Sci. Explor. 15 (04) (2021) 670–681.
[45] Silva C.D.S., Ferreira da Costa L., Rocha L.S., et al. KNN applied to PDG for source [75] J. Hemalatha, S.A. Roseline, S. Geetha, et al., An efficient densenet-based deep
code similarity classification[C]//Intelligent Systems: 9th Brazilian Conference, learning model for malware detection, Entropy 23 (3) (2021) 344.
BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part II 9. [76] Bhodia N., Prajapati P., Di Troia F., et al. Transfer learning for image-based
Springer International Publishing, 2020: 471-482. malware classification[J]. arXiv preprint arXiv:1903.11551, 2019.
[46] Li H., Cheng Z., Wu B., et al. Black-box Adversarial Example Attack towards [77] M. Ahmadi, D. Ulyanov, S. Semenov, et al., Novel feature extraction, selection
{FCG} Based Android Malware Detection under Incomplete Feature Information and fusion for effective malware family classification, Proc. Sixth ACM Conf. Data
[C]//32nd USENIX Security Symposium (USENIX Security 23). 2023: 1181-1198. Appl. Secur. Priv. (2016) 183–194.
[47] Nataraj,Lakshmanan,etal."Acomparativeassessmentofmalwareclassification [78] P. Prajapati, M. Stamp, An empirical analysis of image-based learning techniques
usingbinarytextureanalysisanddynamicanalysis."Proceedingsofthe for malware classification, Malware Anal. Using Artif. Intell. Deep Learn. (2021)
4thACMWorkshoponSecurityandArtificialIntelligence.2011. 411–435.
[48] J. Gennissen, L. Cavallaro, V. Moonsamy, et al., Gamut: sifting through images to [79] Kao-Lin Jiang, Wei Bai, Lei Zhang et al. Malicious code detection based on multi-
detect android malware[J]. Bachelor thesis, Royal Holloway University, London, channel image deep learning[J]. Computer.
UK, 2017. [80] W.A.N.G. Runzheng, G.A.O. Jian, H.U.A.N.G. Shuhua, et al., Malicious code
[49] G. Conti, E. Dean, M. Sinda, et al., Visual reverse engineering of binary and data family detection method based on knowledge distillation, Comput. Sci. 48 (01)
files[C]. //International Workshop on Visualization for Computer Security, (2021) 280–286.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 1–17. [81] X. Xing, X. Jin, H. Elahi, et al., A malware detection approach using autoencoder
[50] Freitas S., Duggal R., Chau D.H. MalNet: A large-scale image database of in deep learning, IEEE Access 10 (2022) 25696–25706.
malicious software[C]//Proceedings of the 31st ACM International Conference on [82] O.J. Falana, A.S. Sodiya, S.A. Onashoga, et al., Mal-detect: an intelligent
Information & Knowledge Management. 2022: 3948-3952. visualization approach for malware detection, J. King Saud. Univ. Comput. Inf.
[51] Gibert, Daniel; Mateu, Carles; Planes, Jordi. (2019). [IEEE 2019 International Sci. 34 (5) (2022) 1968–1983.
Joint Conference on Neural Networks (IJCNN) - Budapest, Hungary (2019.7.14- [83] R. Vinayakumar, M. Alazab, K.P. Soman, et al., Robust intelligent malware
2019.7.19)] 2019 International Joint Conference on Neural Networks (IJCNN) - A detection using deep learning, IEEE Access 7 (2019) 46717–46738.
Hierarchical Convolutional Neural Network for Malware Classification., (.), 1–8. [84] D.O. Won, Y.N. Jang, S.W. Lee, PlausMal-GAN: Plausible malware training based
doi:10.1109/ijcnn.2019.8852469. on generative adversarial networks for analogous zero-day malware detection,
[52] Q. Wang, Q. Qian, Malicious code classification based on opcode sequences and IEEE Trans. Emerg. Top. Comput. 11 (1) (2022) 82–94.
textCNN network, J. Inf. Secur. Appl. 67 (2022) 103151. [85] Y. Chai, L. Du, J. Qiu, et al., Dynamic prototype network based on sample
[53] Q. Wang, Q. Qian, Malicious code classification based on opcode sequences and adaptation for few-shot malware detection, IEEE Trans. Knowl. Data Eng. 35 (5)
textCNN network, J. Inf. Secur. Appl. 67 (2022) 103151. (2022) 4754–4766.
[54] F.O. Catak, A.F. Yazı, O. Elezaj, et al., Deep learning based Sequential model for [86] X. Huang, L. Ma, W. Yang, et al., A method for windows malware detection based
malware analysis using Windows exe API Calls, PeerJ. Comput. Sci. 6 (2020) on deep learning[J], J. Signal Process. Syst. 93 (2021) 265–273.
e285. [87] W. Han, J. Xue, Y. Wang, et al., MalDAE: detecting and explaining malware based
[55] J. Bae, C. LeeEasy Data Augmentation for Improved Malware Detection: A on correlation and fusion of static and dynamic characteristics, Comput. Secur. 83
Comparative Study[C]//2021 IEEE International Conference on Big Data and (2019) 208–233.
Smart Computing (BigComp). IEEE, 2021: 214-218. [88] Liu Zixuan, Wang Chen, BiLSTM malicious code classification based on multi-
[56] F.O. Catak, A.F. Yazı, O. Elezaj, et al., Deep learning based Sequential model for feature fusion, in: Electronic Design Engineering, 30, 2022, pp. 67–72. DOI:
malware analysis using Windows exe API Calls, PeerJ. Comput. Sci. 6 (2020) 10.14022/j.issn1674-6236.2022.18.014.
e285. [89] W. Zhang, Y. Feng, G. Han, et al., A malicious code detection method based on FF-
[57] Xiaochen Liu, Research on deep learning detection model of malicious code based MICNN in the internet of things, Sensors 22 (22) (2022) 8739.
on text features, People’S. Public Secur. Univ. China (2022), https://doi.org/ [90] G. Shen, Z. Chen, H. Wang, et al., Feature fusion-based malicious code detection
10.27634/d.cnki.gzrgu.2022.000193. with dual attention mechanism and BiLSTM, Comput. Secur. 119 (2022) 102761.
[58] D. Demırcı, C. Acarturk, Static malware detection using stacked BiLSTM and GPT- [91] W. Zhong, F. Gu, A multi-level deep learning system for malware detection,
2, IEEE Access 10 (2022) 58488–58502. Expert Syst. Appl. 133 (2019) 151–162.
[59] J. Chen, J. Jiang, R. Li, Generating adversarial examples for static PE malware [92] H. Alasmary, A. Khormali, A. Anwar, et al., Analyzing and detecting emerging
detector based on deep reinforcement learning[C]//Journal of Physics: internet of things malware: a graph-based approach, IEEE Internet Thing sJ. 6 (5)
Conference Series. IOP Publishing, 2020, 1575(1): 012011. (2019) 8977–8988.
[60] Anderson H.S., Kharkar A., Filar B., et al. Learning to evade static pe machine [93] X. Ge, Y. Pan, Y. Fan, et al., AMDroid: android malware detection using function
learning malware models via reinforcement learning[J]. arXiv preprint arXiv: call graphs[C]//. 2019 IEEE 19th International Conference on Software Quality,
1801.08917, 2018. Reliability and Security Companion (QRS-C), IEEE, 2019, pp. 71–77.
[61] S.K.J. Rizvi, W. Aslam, M. Shahzad, et al., PROUD-MAL: static analysis-based [94] J. Bai, Q. Shi, S. Mu, A malware and variant detection method using function call
progressive framework for deep unsupervised malware classification of windows graph isomorphism, Secur. Commun. Netw. 2019 (2019) 1–12.
portable executable, Complex Intell. Syst. (2022) 1–13. [95] J. Liu, Y. Shen, H. Yan, Functions-based CFG embedding for malware homology
[62] F. Demirkıran, A. Çayır, U. Ünal, et al., An ensemble of pre-trained transformer analysis[C]. 2019 26th International Conference on Telecommunications (ICT),
models for imbalanced multiclass malware classification, Comput. Secur. 121 IEEE, 2019, pp. 220–226.
(2022) 102846.
17
H. Wang et al. Neurocomputing 598 (2024) 128010
[96] Shen Yuan, Yan Hanbing, Xia Chunhe et al. A deep learning-based malicious code 14thInternationalConference,DIMVA2017,Bonn,Germany,July6-7,2017,
clone detection technique[J]. Journal of Beijing University of Aeronautics and Proceedings14.SpringerInternationalPublishing,2017:252-276.
Astronautics,2022,48(02):282-290.DOI:10.13700/j.bh.1001-5965.2020.0400. [128] Li, Y., Jang, J., Hu X., et al.,
[97] Yang Ping, Shu Hui, Kang Fei, et al., A method for generating malicious code Androidmalwareclusteringthroughmaliciouspayloadmining[C]//
attack graphs based on semantic analysis, Comput. Sci. 48 (S1) (2021) 448–458+ ResearchinAttacks,Intrusions,andDefenses:20thInternationalSymposium,
463. RAID2017,Atlanta,GA,USA,September18–20,2017,Proceedings.
[98] C. Zhang, Q. Zhou, Y. Huang, et al., Automatic detection of Android malware via SpringerInternationalPublishing,2017:192-214.
hybrid graph neural network, Wirel. Commun. Mob. Comput. 2022 (2022). [129] Lindorfer, M., Neugschwandtner M., Platzer C., Marvin:Efficientand
[99] G. D’Angelo, F. Palmieri, A. Robustelli, A federated approach to Android malware comprehensivemobileappclassificationthroughstaticanddynamicanalysis[C]//
classification through Perm-Maps, Clust. Comput. 25 (4) (2022) 2487–2500. 2015IEEE39thannuaASystematicLiteratureReviewofAndroidMalware
[100] K. Liu, Y. Fang, L. Zhang, et al., Malicious code clustering based on graph DetectionUsingStaticAnalysislcomputersoftwareandapplicationsconference.
convolutional networks, J. Sichuan Univ. 56 (04) (2019) 654–660. IEEE,2015,2:422-433.
[101] Shan-Xi Li, Research on sElf-optimizing Real-time Detection Technology of [130] Abdul Kadir A.F., Stakhanova N., Ghorbani A.A. Android botnets: What urls are
Unknown Malicious Code Based on Machine Learning, Lanzhou University, 2021. telling us[C]//Network and System Security: 9th International Conference, NSS
DOI:10.27204/d.cnki.glzhu.2021.000051. 2015, New York, NY, USA, November 3-5, 2015, Proceedings 9. Springer
[102] L. Fang, Q. Wei, Z.H. Wu, et al., Neural network based similarity detection International Publishing, 2015: 78-91.
technique for binary functions, Comput. Sci. 48 (10) (2021) 286–293. [131] Davide Maiorca, et al., Stealthattacks:
[103] P. Feng, L. Yang, D. Lu, et al., BejaGNN: behavior-based Java malware detection Anextendedinsightintotheobfuscationeffectsonandroidmalware,
via graph neural network, J. Supercomput. 79 (14) (2023) 15390–15414. ComputersSecurity 51 (2015) 16–31.
[104] Y. Ding, X. Xia, S. Chen, et al., A malware detection method based on family [132] White, D., NISTnationalsoftwarereferencelibrary(NSRL)[C]//Mid-
behavior graph, Comput. Secur. 73 (2018) 73–86. AtlanticChapterHTCIAMeeting.2005.
[105] D. Xue, J. Li, W. Wu, et al., Homology analysis of malware based on ensemble [133] Kumar S., Mishra D., Panda, B., et al., AndroOBFS:time-
learning and multifeatures, PloS One 14 (8) (2019) e0211373. taggedobfuscatedAndroidmalwaredatasetwithfamilyinformation[C]//
[106] P. Feng, J. Ma, T. Li, et al., Android malware detection via graph representation Proceedingsofthe19thInternationalConferenceonMiningSoftware
learning, Mob. Inf. Syst. 2021 (2021) 1–14. Repositories.2022:454-458.
[107] Yang Pin, Zhu Yue, Zhang Lei, Classification of malicious code families based on [134] Mallya, A., Davis D., Lazebnik S., Piggyback:
attribute data flow graph, Inf. Secur. Res. 6 (03) (2020) 228–234. Adaptingasinglenetworktomultipletasksbylearningtomaskweights[C]//
[108] S. Gülmez, I. Sogukpinar, Graph-based malware detection using opcode sequences ProceedingsoftheEuropeanconferenceoncomputervision(ECCV).2018:67-82.
[C]. 2021 9th International Symposium on Digital Forensics and Security (ISDFS), [135] Allix, K., Bissyandé, T.F., Klein, J., et al., Androzoo:
IEEE, 2021, pp. 1–5. Collectingmillionsofandroidappsfortheresearchcommunity[C]//
[109] W. Niu, Y. Wang, X. Liu, et al., GCDroid: Android malware detection based on Proceedingsofthe13thinternationalconferenceonminingsoftwarere
graph compression with reachability relationship extraction for IoT devices, IEEE positories.2016:468-471.
Internet Things J. (2023).
[110] Sun H., Shu H., Kang F., et al. ModDiff: Modularity Similarity-Based Malware
Homologation Detection[J]. Electronics, 2023, 12(10): 2258. Huang X, Ma L,
Huijuan Wang was born in Dacheng,Hebei,China, in 1982.
Yang W, et al. A method for windows malware detection based on deep learning
[J]. Journal of Signal Processing Systems, 2021, 93: 265-273. She received the B.S. and M.S. degrees in computer science and
technology from Nankai University, China in 2005 and 2008.
[111] A. Nappa, M.Z. Rafique, J. Caballero, The MALICIA dataset: identification and
analysis of drive-by download operations, Int. J. Inf. Secur. 14 (2015) 15–33. She received the PhD degree from Hebei University of Tech
[112] Ronen, R., Radu, M., Feuerstein, C., et al., nology, China in 2019. She is currently an Professor in North
Microsoftmalwareclassificationchallenge[J].arXivpreprintarXiv: China Institute of Aerospace Engineering. She has published
1802.10135,2018. more than 20 papers. Her research interests include computer
vision, pattern recognition and deep learning.
[113] Ö. Aslan, A.A. Yilmaz,
Anewmalwareclassificationframeworkbasedondeeplearningalgorithms[J], Ieee
Access 9 (2021) 87936–87951.
[114] Noever, David, and Samantha E. Miller Noever."Virus-MNIST:
Abenchmarkmalwaredataset."arXivpreprintarXiv:2103.00602(2021).
[115] Arp,D.,Spreitzenbarth,M.,Hubner,M.,Gascon,H.,Rieck,K.,&Siemens,C.E.R.T.
(2014,February).Drebin:
Effectiveandexplainabledetectionofandroidmalwareinyourpocket.InNdss(Vol.14,
pp.23-26). Boyan Cui was born in Cangzhou City, Hebei Province, China
[116] A. Huertas Celdrán, P.M. Sánchez Sánchez, F. Sisi, et al., CreationofaData in 1999. She received his bachelor’s degree in engineering from
setModelingtheBehaviorofMalwareAffectingtheConfidentialityofData the Computer Department of North China Institute of Aero
ManagedbyIoTDevices[M]// space Engineering in 2018. She is currently studying for a
RoboticsandAIforCybersecurityandCriticalInfrastructureinSmartCities, master’s degree in the Computer Department of North China
SpringerInternationalPublishing, Cham, 2022, pp. 193–225. Institute of Aerospace Engineering His research interests
[117] R. Bala, R. Nagpal, Areviewonkddcup99andnslnsl-kdddataset, Int. J. Advanced include malware detection and deep learning.
Res. Comput. Sci. 10 (2019) 2.
[118] Morales-Molina C.D., Santamaria-Guerrero, D., Sanchez-Perez, G., et al.,
Methodologyformalwareclassificationusingarandomforestclassifier[C]//
2018IEEEInternationalAutumnMeetingonPower,ElectronicsandComputing
(ROPEC).IEEE,2018:1-6.
[119] Virustotal.Virustotal.n.d.Web.Accessed March18,2024〈https://www.virustotal.
com/〉.
[120] Anderson H.S., Roth P., Ember:
anopendatasetfortrainingstaticpemalwaremachinelearningmodels[J].
arXivpreprintarXiv:1804.04637,2018. Quanbo Yuan was born in Feixiang,Hebei,China, in 1984.He
[121] MalShare.MalShare.n.d.Web.AccessedOctober17,2023〈https://malshare.com/〉. received the B.S.degree in computer science and technology
[122] VirusShare.VirusShare.n.d.Web.AccessedOctober17,2023〈https://virusshare. from North China Institute of Aerospace Engineering, China in
com/〉. 2008. He received the M.S.degree in computer science and
[123] VirusSign.n.d.Web.AccessedOctober17,2023〈https://www.virussign.com/〉. technology from Tianjin Polytechnic University, China in
[124] Bazaar.Bazaar.n.dWeb.AccessedOctober17,2023〈https://bazaar.abuse.ch 2019. He’s working on PhD degree in computer science of
/browse/VirusSign〉. Tianjin University.He is currently an Associate Researcher in
[125] H. Wang, J. Si, H. Li, et al., Rmvdroid: towards a reliable android malware dataset North China Institute of Aerospace Engineering. He has pub
with app metadata[C]. 2019 IEEE/ACM 16th International Conference on Mining lished more than 10 papers. His research interests include
Software Repositories (MSR), IEEE, 2019, pp. 404–408. computer vision, artificial intelligence and deep learning.
[126] Catak F.O., Yazı A.F.,
AbenchmarkAPIcalldatasetforwindowsPEmalwareclassification[J].
arXivpreprintarXiv:1905.01999,2019.
[127] Wei F., Li Y., Roy S., et al., Deepgroundtruthanalysisofcurrentandroidmalware
[C]//DetectionofIntrusionsandMalware,andVulnerabilityAssessment:
18
H. Wang et al. Neurocomputing 598 (2024) 128010
Ruonan Shi was born in Xingtai City, Hebei Province in 2002. Mengying Huang was born in Hunan,China, in 1994. She
He is currently studying at North China Institute of Aerospace received the B.E. and the M.S. degrees in computer science and
Engineering and will pursue a Master’s degree in Engineering technology from North China Electric Power University, China
from Tianjin University of Technology in 2024. His research in 2016 and 2019. She is currently an teaching assistant in
areas include network security, artificial intelligence, and data North China Institute of Aerospace Engineering. Her research
mining. interests include fault diagnosis, pattern recognition and data
mining.
19