Research Paper
Research Paper
Abstract—In recent years, the rapid proliferation of Furthermore, signature-based tools frequently offer little
sophisticated malware has necessitated advanced detection insight into the rationale behind their detections, which can
techniques. This paper presents a novel deep learning lead to difficulties in comprehending why specific software
approach for multi-class malware detection by leveraging a has been flagged or in understanding the nature of the threat.
modified Graph Neural Network. Specifically, the Deep learning, a subset of machine learning, has
witnessed tremendous success across various domains, from
deeperGCN, a cutting-edge deep learning model, is
image recognition to natural language processing. By
employed to enhance the feature extraction capabilities.
automatically extracting intricate patterns from vast amounts
Recognizing the imperative need for transparency in of data, deep learning models hold significant promise in the
machine learning models, especially in cybersecurity, this realm of malware detection. Particularly, they can potentially
approach integrates Explainable AI (XAI) principles using detect malware not just based on fixed signatures but on
the GradCAM method, allowing for the interpretability of behavioural and structural patterns, accommodating the
model decisions. A critical step in the methodology involved dynamic nature of the malware landscape. However, one of
the merging of diverse datasets: both ASM (Assembly) and the pressing concerns with deep learning models is their 'black
BYTE files. This comprehensive dataset amalgamation box' nature. While they might produce impressive results, they
ensures the capture of intricate malware behaviors, thus often do so without elucidating the underlying reasoning. In
bolstering the accuracy and reliability of the detection critical applications such as malware detection, where the
system. The approach strikes a balance between achieving stakes are high, understanding the 'why' behind a decision is
high detection rates and accuracy upto 97% that as crucial as the decision itself. The emerging field of
stakeholders can intuitively understand and trust the Explainable AI (XAI) seeks to address this concern by making
model's predictions, paving the way for more robust and AI models more interpretable and transparent.
accountable malware detection systems in the future. Within this context, this research introduces a novel
approach: combining the power of deeperGCN with
Keywords—Malware detection, deep learning, deeperGCN, Explainable AI for multi-class malware detection.
Explainable AI. deeperGCN, which has gained traction due to its ability to
process data represented in graph form, is particularly suitable
I. INTRODUCTION (HEADING 1) for analysing certain malware datasets. For instance, files like.
In today's digital age, the reliability and security of bytes and .asm, which are often used to represent malware
computing systems have become paramount. As dependence samples, can be modelled as graphs, making them amenable
on digital ecosystems expands, the prevalence of malware also to deeperGCN-based analysis. By melding deeperGCN with
increases—malicious software crafted to harm, disrupt, or XAI, the proposed methodology aspires to not only accurately
illicitly access computer systems. These menacing threats are detect a diverse range of malware types but also offer clear
ever-evolving, continually adapting to the latest security insights into the decision-making process, ensuring that users
measures, and often leaving behind a trail of significant can trust and understand the detection outcomes. Current
damage, both financially and operationally. Consequently, the signature-based methods often struggle to identify previously
need to detect and nullify malware in its myriad forms is of unseen malware variants, and their lack of transparency
utmost importance. Historically, the most common approach hinders the understanding of detection outcomes. They have
to malware detection has been the signature-based method. implemented it for binary class classification and mostly
This approach relies on a database of known malware perform on binary text dataset. To tackle these challenges, this
signatures, patterns of code unique to particular malware research aims to develop a deep learning-based approach for
strains. When incoming software or files match a signature multi-class malware detection while incorporating
from the database, they are flagged as malicious. However, the Explainable AI techniques and applying graphical neural
Achilles heel of this method is evident: it primarily detects network algorithm to enhance the interpretability and
only known malware strains. With the ceaseless emergence of trustworthiness of the detection process.
novel malware variants, many of which employ polymorphic
or metamorphic techniques to alter their code and thus evade II. LITERTURE REVIEW
signature-based systems, this method often falls short. Authors explore advanced techniques for malware
detection. They harness the power of deep learning, an
evolving computational method that excels in pattern
Preprocessing of
dataset
Explainable AI
Applying Integration using
+ DeeperGCN GRAD-CAM
architecture
Collection of asm
/byte file Dataset Multiclass
Classification of
malware images
Fig.1. Flowchart of GRAD-CAM to integrate Explainable AI into DeeperGCN for malware image classification
Merging 300 bytebigram, 200 opcode bigram, 200 opcode trigram, tetragram, first 200
image pixels
Fig.4. Flow Diagram of Malware Byte to Image conversion for malware classification
Byte Features Extraction: for graph-structured data. They harness the inherent
• Perform bigram and trigram combination on the byte data. relationships between nodes, utilizing graph convolution
• Use CountVectorizer to create a matrix from the bigram operations to iteratively aggregate information from
data. neighboring nodes. This iterative process enhances
ASM Features Extraction: representation learning and enables effective capture of
• Extract opcode sequences from the ASM files and represent intricate dependencies within graph data. While a Simple
them as strings. GCN employs a single graph convolution layer, more
• Generate opcode bi-grams, tri-grams, and tetra-grams. advanced architectures like DeeperGCN address challenges
• Save these opcode n-grams in npz format. such as over-smoothing and vanishing gradients, contributing
to their adaptability. GCNs exhibit success in diverse
Image Creation: applications, from social network analysis to
• Determine the dimensions of the image based on the size of recommendation systems and biological network analysis.
the ASM files. Their capacity to handle non-Euclidean, irregularly
• Calculate width of the image, the square of the file size. structured data underscores their versatility in tackling real-
• Reshape the image to match the calculated dimensions world problems entrenched in complex relational
Final Feature Set: dependencies.
• Combine the various extracted features, including byte Table 1. Types with its labels
bigram, opcode bi-gram, opcode trigram, opcode tetra-gram, Labels Malware Types No. of Images
and the ASM images (represented as pixel values). 0 Adposhel 360
• Convert pixel representations into actual images. 1 Fakeran 306
• Categorize the images into their respective classes based on 2 Hlux 350
the data they represent. 3 Regrun 350
4 Snarasite 351
In “Table 1.” it shows the number of images of each malware 5 VBA 350
types generated. 6 Vilsel 350
Total Images 3407
C. Model Building
Algorithms used for developing the model are as follows-
In “Fig.5.” it demonstrates the overall flow of DeeperGCN
C.1. Graph Convolution Network: Graph Convolutional architecture from inputting image to outputting the prediction
Networks (GCNs) are specialized neural networks designed in 7 classes.
Represent Define edges between
Superpixel
each pixel as superpixel (adjency
Construction
node matrix)
Introduction Assign
Aggregation of Skip Features to
Functions Connections Stack multiple each node
between layers GCN layers
Repeat process
Initialization Color and Cluster Superpixel
until centre
and Spatial Centre Mapping
movement
clustering Arrangement Update stops
It includes two main steps: grid and the clusters of W x H image where W is width and H
1. Image to Superpixel (Node) Conversion: is height of the image.
The Simple Linear Iterative Clustering (SLIC) algorithm a
superpixel segmentation algorithm which transforms images For (𝑥, 𝑦) = 𝑖 ⋅ ,𝑗 ⋅ (1)
into superpixels, which combines K means clustering and √ √
compactness term which influences spatial proximity which is In color and spatial arrangement iterate through each pixel
important for pixel grouping. A process crucial for efficient and for each pixel assign to each cluster with nearest cluster
computer vision tasks. SLIC clusters pixels based on color and center in both color space and spatial space. In equation (2)
spatial proximity, providing a structured representation that The distance between the pixel and the center pixel to form
preserves finer details and enhances computational efficiency the cluster is defined by the formula:
in subsequent image processing and analysis. By intelligently
grouping pixels into perceptually meaningful superpixels, 𝐷(𝑝, 𝑐) = 𝑅 + 𝐺 −𝐺 + 𝐵 −𝐵 +
SLIC facilitates tasks such as segmentation, object
recognition, and feature extraction. In equation (1) to 𝑥 − 𝑥 ) + (𝑦𝑝 − 𝑦𝑐 (2)
initialization define number of clusters K as centers in image Where,𝑅 , 𝐺 , 𝐵 represents the RGB color values of pixel p,
𝑅 , 𝐺 , 𝐵 represents the RGB color values of cluster center In equation (8) for each node, its updated representation is
c,𝑥 , 𝑦 are the spatial coordinates of pixel p, determined by aggregating features from the surrounding
𝑥 , 𝑦 are the spatial coordinates of the cluster center c, neighbourhood through a differentiable generalized
m is the compactness parameter, aggregation function, facilitated by an activation function:
k is the number of superpixels or clusters
ℎ = 𝜎 𝐴𝐺𝐺 ℎ , ∀𝑗 ∈ 𝑁(𝑖) (8)
The formula includes combination of two metrics that is color
similarity and spatial proximity. defines the normalization Where, ℎ represents updated node i,
𝜎 𝑖𝑠 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑎𝑛𝑑 𝑁(𝑖) represents
by compactness term.
neighborhood of node i, with aggregation functions such as
In updation of cluster centre it updates by taking mean color sum, min and max functions.
and spatial position of pixel assigned to each cluster. Repeats In equation (9) the graph readout pooling operation plays
this step until there is ess moving of centre of assigned pixels. a pivotal role in capturing global patterns within the graph.
Then Superpixel maps assign each pixel in the image to its
corresponding superpixel label. ℎ = POOLI𝑁𝐺 ℎ , ∀𝑗 ∈ 𝐺𝑟𝑎𝑝ℎ (9)
The graph readout pooling operation aggregates node
2. Adjacency Matrix and edges representations to obtain a graph-level representation. A
Now the superpixels are converted from the images. These common approach is global pooling, where the graph-level
superpixels are considered as the nodes of the graphs. The representation h_{graph} is obtained by applying a pooling
closest one of the other superpixel are connected with an edge. function (e.g., mean or max pooling) to all node
In equation (3) a spatial threshold is being calculated can be representations. Then by adding the softmax function at end
set by user or with the given formula: can train the model.
𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛 ∗ 𝐷𝑖𝑎𝑔𝑜𝑛𝑎𝑙 D. Training and Testing of model
𝑆𝑝𝑎𝑡𝑖𝑎𝑙 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = (3)
√𝑘 In this project, it employs a deeper Graph Convolutional
Fraction is a user-defined parameter representing the Network (GCN) model for multi-class malware detection
fraction of the image size. using image data. The model's training involves multiple
epochs, with a key focus on monitoring validation accuracy to
Diagonal Length of Image is the length of the diagonal of the ensure effective generalization. During testing, the model's
image, calculated as √𝑤 + ℎ performance is evaluated using accuracy and Explainable AI
K is number of superpixel. If distance between the superpixel techniques to enhance transparency.
is less than the spatial threshold then considers an edge E. Explainable AI:
between them otherwise not. Grad-CAM is a technique used to visualize and understand the
C.2.2. Apply Deepergcn on Graph Representation attention of convolutional neural networks (CNNs). It
In equation (4),(5),(6) the message passing between the generates heatmaps that highlight the important regions of an
nodes(v, u) happens between the nodes called vertices of input image that contribute to the model's predictions. By
graph G=(V,E) at the lth layer is given by: combining the gradients of the target class with respect to the
feature maps, Grad-CAM produces a weighted combination
() () () ()
𝑚 = 𝜌( ) 𝑓 , 𝑓 , 𝑓 , 𝑢 ∈ 𝑁(𝑣) (4) of the feature maps, emphasizing the regions relevant to the
target class. This helps in interpreting and explaining the
() () decision-making process of CNN models. In equation (10)
𝑚 = 𝛿( ) 𝑚 𝑢 ∈ 𝑁(v) (5)
(11) both CAM and Grad-CAM operate under the
fundamental assumption that the ultimate score Yc for a
( ) () ()
𝑓 = 𝜙( ) 𝑓 𝑚 (6) particular class c can be represented as a linear combination
of the global average-pooled feature maps Ak extracted from
Where 𝜌( ) , 𝛿 ( ) , 𝜙 ( ) are differential functions for the last convolutional layer, expressed as
aggregation and construction of message at lth layer. And f 𝑛
𝑛
represents the vertex features to construct unique message for 𝑌𝑐 = ∑𝑘 𝑤𝐶𝑘 𝑗
𝐴𝑘2𝑗 (10)
neighbour 𝑢 ∈ 𝑁(𝑣). Skip connections introduce a direct link 𝑖
from the input of one layer to the output of a deeper layer.
Additionally, DeeperGCN introduces skip connections, For a feature map the weights 𝑤 , Ak and class c is equal to:
fostering direct links from the input of one layer to the output 1 𝜕𝑦𝐶
of a deeper layer. This innovative approach mitigates the 𝑤 = ∑𝑛𝑖 ∑𝑗 (11)
𝑍 𝜕𝐴𝑘𝑖𝑗
vanishing gradient problem, promoting smoother training of
deeper architectures. The updated representation of a node
considering skip connections involves the summation of its In Modifying Grad-CAM: In “Fig.7.”it illustrates Gradcam
existing representation and the information skipped from the the CNN layers are stacked together to perform accurate
previous layer. Equation (7) shows updated representation ℎ : localization. By updating the number of layers in CNN the
accuracy of localization can be improved. Different CNN
ℎ = ℎ + 𝑆𝑘𝑖𝑝_𝑐𝑜𝑛𝑛(ℎ ) (7) architectures are being used such as VGG16 with 16 layers,
Where 𝑆𝑘𝑖𝑝_𝑐𝑜𝑛𝑛(ℎ ) represents information skipped in Resnet50 with 50 layers and MobilenetV2 with 53 layers.
previous layer.
The aggregation function, pivotal in graph convolution,
strategically combines information from neighbouring nodes.
Original Image
CNN
VGG 16
ResNet 50
F1 Validation
Algorithm Precision Recall score accuracy Pros Cons
Flexible and can learn
JK- complex patterns in Can be computationally
GraphSAGE 93.10% 94.20% 93.60% 0.9491 graph data expensive to train
Simple to implement and Can be less interpretable than
GIN 92.50% 93.70% 93.10% 0.9307 train other models
May not be able to learn
Efficient to train and complex patterns in graph
SGC 91.30% 92.50% 91.90% 0.9479 infer data
Can be less interpretable than
JK-GCN 89.80% 91.00% 90.40% 0.8941 Simple and effective other models
Combines the advantages Can be computationally
JK-GIN 92.90% 94.10% 93.50% 0.9369 of GIN and JK-Net expensive to train
High accuracy, precision,
Deeper-GCN recall, and F1 score;
(Proposed more interpretable than Can be computationally
Model) 96% 95% 95.40% 0.9704 other GNN models expensive to train
Snarasite, a portmanteau of 'snare' and 'parasite,' suggests an feature engineering, robust evaluation, scalability for real-
insidious nature. VBA exploits Microsoft Office macros, time detection, and addresses ethical and legal concerns.
targeting document processing vulnerabilities. Lastly, Vilsel,
though less familiar, demands scrutiny for potential emergent REFERENCES
threats. “Table 3.” shows the Comparison Table of Existing [1] K. Shaukat, S. Luo, V. Varadharajan, "A novel deep learning-based
and Proposed Algorithm done by previous research work. approach for malware detection," Engineering Applications of
Artificial Intelligence, vol. 122, p. 106030, 2023, ISSN: 0952-1976.
V. SCOPE OF PROJECT [2] M.S. Akhtar, T. Feng, "Malware Analysis and Detection Using
Machine Learning Algorithms," Symmetry, vol. 14, p. 2304, 2022.
The project, " Multi-Class Malware Detection using modified
[3] R. Vinayakumar, M. Alazab, K.P. Soman, P. Poornachandran, S.
GNN and Explainable AI," holds significant scope in the Venkatraman, "Robust Intelligent Malware Detection Using Deep
realms of cybersecurity and artificial intelligence. It primarily Learning," IEEE Access, vol. 7, pp. 46717-46738, 2019.
aims to create an effective malware detection system capable [4] E.S. Alomari, R.R. Nuiaa, Z.A.A. Alyasseri, H.J. Mohammed, N.S.
of identifying various types of malware, including viruses, Sani, M.I. Esa, B.A. Musawi, "Malware Detection Using Deep
trojans, worms, and ransomware, using deep learning Learning and Correlation-Based Feature Selection," Symmetry, vol.
techniques. This entails the investigation and customization of 15, p. 123, 2023.
deep learning models such as Convolutional Neural Networks [5] R. Ali, A. Ali, F. Iqbal, M. Hussain, F. Ullah, "Deep Learning Methods
for Malware and Intrusion Detection: A Systematic Literature
(CNNs) and Recurrent Neural Networks (RNNs), as well as Review," Security and Communication Networks, vol. 2022, Article
the incorporation of Graph Neural Networks (GNNs) to ID 2959222, pp. 1-31, 2022.
enhance the resilience of malware analysis. Additionally, the [6] Z. Hou, Y. Wang, W.L. Hsu, H. Zheng, "MALCOM: A MALware
project focuses on achieving multi-class classification, detection framework using COMbination of deep learning and graph-
making model decisions transparently through Explainable based techniques," Future Generation Computer Systems, vol. 91, pp.
AI, conducting feature engineering, and preparing diverse 56-66, 2019.
datasets. It aims to deploy the system for real-time detection, [7] A. Mushtaq, S. Khan, A. Gani, "GraphConvolutional Networks for
Malware Detection," IEEE Access, vol. 7, pp. 148703-148714, 2019.
address scalability challenges, and consider ethical and legal DOI: 10.1109/access.2019.2948648.
aspects. The project's findings can contribute to research in the [8] K. Jain, S. Kumar, "Explainable and Interpretable Intent-based
field, and engagement with the cybersecurity community can Analysis For Malware Detection," arXiv preprint arXiv:2006.00570,
enhance its impact. 2020.
[9] G.S. Pudlo, B.X. Furquim, J. Granatyr, "DeepMal: A Deep Learning
VI. CONCLUSION Approach for Malware Classification Exploiting Recurrent Neural
It offers a comprehensive and forward-looking approach to the Networks and Word Embeddings," in Proceedings of the 2017 IEEE
15th International Symposium on Intelligent Systems and Informatics
critical challenges of malware detection in the realms of (SISY), pp. 477-482.
cybersecurity and artificial intelligence. Developed a deep [10] H.P. Nguyen, Y. Nakamichi, "Static Malware Detection Using Deep
learning system using modified Graph Neural Networks for Learning on Operation Sequence," in Proceedings of the 2018 17th
multi-class malware detection, ensuring accurate IEEE International Conference on Machine Learning and Applications
identification of various malware types. Explainable AI will (ICMLA), pp. 816-821. DOI: 10.1109/icmla.2018.00142.
be integrated for transparent decision-making. It involves