0% found this document useful (0 votes)
10 views

Research Paper

Uploaded by

Shubham Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Research Paper

Uploaded by

Shubham Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Multi-Class Malware Detection using modified

GNN and Explainable AI


Premanand Ghadekar Tejas Adsare Neeraj Agrawal
Department of Information Technology Department of Information Technology Department of Information Technology
Vishwakarma Institute of Technology Vishwakarma Institute of Technology Vishwakarma Institute of Technology
Pune, India Pune, India Pune, India
[email protected] [email protected] [email protected]

Dhananjay Deore Tejas Dharmik


Department of Information Technology Department of Information Technology
Vishwakarma Institute of Technology Vishwakarma Institute of Technology
Pune, India Pune, India
[email protected] [email protected]

Abstract—In recent years, the rapid proliferation of Furthermore, signature-based tools frequently offer little
sophisticated malware has necessitated advanced detection insight into the rationale behind their detections, which can
techniques. This paper presents a novel deep learning lead to difficulties in comprehending why specific software
approach for multi-class malware detection by leveraging a has been flagged or in understanding the nature of the threat.
modified Graph Neural Network. Specifically, the Deep learning, a subset of machine learning, has
witnessed tremendous success across various domains, from
deeperGCN, a cutting-edge deep learning model, is
image recognition to natural language processing. By
employed to enhance the feature extraction capabilities.
automatically extracting intricate patterns from vast amounts
Recognizing the imperative need for transparency in of data, deep learning models hold significant promise in the
machine learning models, especially in cybersecurity, this realm of malware detection. Particularly, they can potentially
approach integrates Explainable AI (XAI) principles using detect malware not just based on fixed signatures but on
the GradCAM method, allowing for the interpretability of behavioural and structural patterns, accommodating the
model decisions. A critical step in the methodology involved dynamic nature of the malware landscape. However, one of
the merging of diverse datasets: both ASM (Assembly) and the pressing concerns with deep learning models is their 'black
BYTE files. This comprehensive dataset amalgamation box' nature. While they might produce impressive results, they
ensures the capture of intricate malware behaviors, thus often do so without elucidating the underlying reasoning. In
bolstering the accuracy and reliability of the detection critical applications such as malware detection, where the
system. The approach strikes a balance between achieving stakes are high, understanding the 'why' behind a decision is
high detection rates and accuracy upto 97% that as crucial as the decision itself. The emerging field of
stakeholders can intuitively understand and trust the Explainable AI (XAI) seeks to address this concern by making
model's predictions, paving the way for more robust and AI models more interpretable and transparent.
accountable malware detection systems in the future. Within this context, this research introduces a novel
approach: combining the power of deeperGCN with
Keywords—Malware detection, deep learning, deeperGCN, Explainable AI for multi-class malware detection.
Explainable AI. deeperGCN, which has gained traction due to its ability to
process data represented in graph form, is particularly suitable
I. INTRODUCTION (HEADING 1) for analysing certain malware datasets. For instance, files like.
In today's digital age, the reliability and security of bytes and .asm, which are often used to represent malware
computing systems have become paramount. As dependence samples, can be modelled as graphs, making them amenable
on digital ecosystems expands, the prevalence of malware also to deeperGCN-based analysis. By melding deeperGCN with
increases—malicious software crafted to harm, disrupt, or XAI, the proposed methodology aspires to not only accurately
illicitly access computer systems. These menacing threats are detect a diverse range of malware types but also offer clear
ever-evolving, continually adapting to the latest security insights into the decision-making process, ensuring that users
measures, and often leaving behind a trail of significant can trust and understand the detection outcomes. Current
damage, both financially and operationally. Consequently, the signature-based methods often struggle to identify previously
need to detect and nullify malware in its myriad forms is of unseen malware variants, and their lack of transparency
utmost importance. Historically, the most common approach hinders the understanding of detection outcomes. They have
to malware detection has been the signature-based method. implemented it for binary class classification and mostly
This approach relies on a database of known malware perform on binary text dataset. To tackle these challenges, this
signatures, patterns of code unique to particular malware research aims to develop a deep learning-based approach for
strains. When incoming software or files match a signature multi-class malware detection while incorporating
from the database, they are flagged as malicious. However, the Explainable AI techniques and applying graphical neural
Achilles heel of this method is evident: it primarily detects network algorithm to enhance the interpretability and
only known malware strains. With the ceaseless emergence of trustworthiness of the detection process.
novel malware variants, many of which employ polymorphic
or metamorphic techniques to alter their code and thus evade II. LITERTURE REVIEW
signature-based systems, this method often falls short. Authors explore advanced techniques for malware
detection. They harness the power of deep learning, an
evolving computational method that excels in pattern

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


recognition. Furthermore, they incorporate correlation-based explainability, providing clear interpretations of its decision-
feature selection to enhance the robustness of their model. making process. Additionally, the paper emphasizes the
This dual approach aims to address the increasing interpretability of their model's predictions. They introduce an
sophistication of cyber threats. Their research underscores the explainability mechanism that provides insights into the
necessity of employing multifaceted strategies in decision-making process of the GCN, which enhances the
cybersecurity, advocating for a fusion of artificial intelligence model's transparency and trustworthiness [7].
and statistical methodologies to ensure optimal malware
detection and prevention [1]. The paper introduces an intent-based analysis framework
using Graph Neural Networks (GNNs) for malware detection.
A deep learning-based malware detection system was The framework aims to offer more explainability, unravelling
presented by Jeon in 2020. In this work, they transformed and justifying the predictions made by the model. This helps
opcode sequences taken from Windows executable files using in understanding the exact reasoning behind the model’s
a convolutional encoder. Recurrent neural networks (RNNs) decision, enhancing its transparency and the trust of the user.
were then applied specifically to virus detection tasks. Their Overall, this paper presents a valuable contribution to the field
method produced findings that showed a 96% true positive of malware detection by emphasizing the importance of
rate and 96% detection accuracy. [2] explainability and interpretability. By utilizing intent-based
analysis and GNNs, the framework provides insights into the
This study introduces a malware detection technique based decision-making process of the model, improving trust and
on deep learning (DLMD) that relies on static methods to understanding [8].
categorize various malware families. Experimental results
indicate that the proposed DLMD technique achieves a log- Malware detection systems, exemplified by Jeon and
loss of 2.09 across ten independent runs [3]. Moon's 2020 work [2], showcase a 96% accuracy using
convolutional encoders and recurrent neural networks (RNNs)
Researchers investigated the impact of feature selection on on Windows executable files. Static methods, like Deep
training deep learning models for the purpose of malware Learning-based Malware Detection (DLMD), achieved a
detection, employing two separate datasets in their analysis. notable log-loss of 2.09. Feature selection's role in training
The first, characterized by a vast number of records and few deep learning models, with minimal impact on validation
attributes, revealed that by retaining 81.58% of features, only accuracy, has been explored. Long Short-Term Memory
a minor reduction of 0.07% in validation accuracy was (LSTM) layers enhance accuracy, especially for high-
observed. The second, a high-dimensional dataset with 15,036 dimensional datasets. Ali et al.'s 2022 review[5] highlights the
records and 214 attributes, showed a 9.44% reduction in prevalence of deep learning techniques like RNNs, AEs,
validation accuracy when only 6.5% of the columns were CNNs, and LSTMs in malware detection, yet their real-world
retained. The research also underscored the utility of an LSTM applicability, particularly with large datasets, remains
layer in enhancing model accuracy, especially for the first uncertain. Advances in graph-based methods, seen in
dataset, increasing accuracies by 0.05% for validation and MALCOM and Graph Convolutional Networks (GCNs),
0.44% for test scenarios. The optimal data split, consistently, emphasize transparency and structural dependencies in
was identified as 80-20 for training and testing [4]. malware data graphs, signaling ongoing commitment to
In their comprehensive review, Ali et al. (2022) analysed innovative deep learning approaches.
the application of deep learning (DL) in malware detection
across various platforms such as Windows, smartphones, and III. METHODOLOGY/ PROPOSED SYSTEM
IoT. From an initial corpus of 290 studies sourced from five The proposed system for detecting malware follows a
databases, they scrutinized 107 unique publications spanning stage of methodology. The initial stage collecting dataset of
six years. The research landscape shows a significant reliance malwares and training of dataset, building a model to convert
on DL techniques like CNN, AE, RNN, and LSTM, with byte and asm dataset into images dataset. The second is the
many models achieving up to 99.9% accuracy. Python and its Model building for images dataset and further using the data
associated libraries, notably TensorFlow and Keras, dominate of images for classification of different malwares. At last,
the implementation space. Despite advancements, the applying Explainable AI (XAI) which help to improve the
practical applicability of these DL-based systems in real- malware classification system by providing transparency into
world settings, especially concerning voluminous data, how decisions are made, enabling debugging and model
remains uncertain [5]. refinement. In this system, design and test of model for
byte/asm files and images have been done. The dataset
This paper explores the potential of combining deep
includes various types of malware like Adposhel, Fakeran,
learning techniques with graph-based methods for malware
Hlux, Regrun, Snarasite, VBA. Vilsel.
detection. The proposed framework, MALCOM, leverages
In this system, the implementation involves a modified
features extracted from malware samples through a modified
GNN architecture known as deeperGCN to enhance the
graph neural network, improving the performance in multi-
accuracy and performance of the model. TensorFlow and
class malware classification tasks. This paper brings forward
Keras are the chosen technologies for training the models. The
significant contributions to the discussion of utilizing the
proposed malware detection system, illustrated in “Fig. 1”,
capabilities of GNNs in malware detection and showing how
consists of five sub-groups. These sub-groups encompass the
it could make their solution explainable, leading to greater
collection of datasets from seven different classes, merging
transparency in cybersecurity [6].
datasets by converting byte/asm dataset into images,
The authors explore the advantages of Graph constructing the deep learning model, evaluating, and training
Convolutional Networks (GCNs) for malware detection in this the datasets, and applying explainable AI to images.
paper. They propose changes to traditional GCNs which help
The initial stage of system lays groundwork for entire model
capture structural dependencies within malware data graphs
and outlines its implementation. The process commences by
effectively. The proposed model also emphasizes
taking files through a pipeline, converting them into specific Explainable AI is then applied to interpret the results.
images, extracting features, preprocessing the input, and Additional details regarding the subsequent stages are
subsequently inputting it into the deep learning architecture. elaborated upon in the following sections.

Preprocessing of
dataset

Explainable AI
Applying Integration using
+ DeeperGCN GRAD-CAM
architecture

Conversion in Evaluation and


Training of malware Testing
image data
image dataset

Collection of asm
/byte file Dataset Multiclass
Classification of
malware images

Fig.1. Flowchart of GRAD-CAM to integrate Explainable AI into DeeperGCN for malware image classification

based features constituted a critical component of the malware


A. Collection of Datasets detection strategy.
The process initiates with the merging of two core datasets
provided in textual format: the “. byte" dataset and the ".asm" Byte file:
dataset. The primary objective of this merger is to capitalize The byte files in “Fig.3”. were subject to a comprehensive
on the inherent information present in each dataset, enhancing analysis in this methodology for malware detection. Two
the overall richness and diversity of training data. primary aspects of feature extraction were explored. Firstly,
ASM file: focused on byte n-grams, including 1, 2, and 4-byte sequences,
which allowed to capture intricate patterns within the byte data
In malware detection, the focus was on ASM (assembly) files of these files. Additionally, delved into the distribution of
in “Fig.2”, with an emphasis on extracting valuable features entropy across the byte files.
from these files. The structure and significance of ASM
elements were explored, and it was found that a more
structured feature set outperformed brute force methods like
character or word n-grams. Key features included the
proportion of lines or characters associated with recognized
sections based on their headers, counts of specific DLLs,
occurrences of distinct opcodes, and proportions of relevant
punctuation characters within ASM sections.

Fig.3. Snapshot of Byte File


B. Merging and Generating Images from asm/ byte file.
In feature extraction process, amalgamate diverse feature
sets to create a comprehensive representation for data analysis.
In fig.4. For byte features, employ bigram and trigram
combinations to construct a matrix using CountVectorizer,
capturing the essence of the byte sequences. Simultaneously,
Fig.2. Snapshot of Asm File in the ASM features domain, leverage Opcode vectorization
To optimize the feature set, rare elements were removed based with NGRAM representations, capturing opcode sequences as
on their occurrence in training files. In a specific model, the strings and saving bi, tri, and tetra-grams in npz format.
feature set was further refined by retaining only those present Furthermore, for a holistic perspective, incorporate image
in at least 500 files and conducting statistical analyses on creation by considering the ASM file size to determine image
sections, DLLs, opcodes, and function calls. These ASM- dimensions and reshaping them accordingly, creating unique
images. The final feature set includes extracted features like representations are converted into images and categorized into
byte bigram, opcode bi-gram, opcode trigram, opcode tetra- their respective classes, enhancing the depth and diversity of
gram, and 200 pixels of ASM images. These pixel feature set for data analysis and machine learning applications.

Byte File Asm File

Byte Bigram Opcode Opcode Opcode


Bigram Trigram Tetragram

Merging 300 bytebigram, 200 opcode bigram, 200 opcode trigram, tetragram, first 200
image pixels

Final Image Pixel Data

Image Generated among 7 classes

Fig.4. Flow Diagram of Malware Byte to Image conversion for malware classification
Byte Features Extraction: for graph-structured data. They harness the inherent
• Perform bigram and trigram combination on the byte data. relationships between nodes, utilizing graph convolution
• Use CountVectorizer to create a matrix from the bigram operations to iteratively aggregate information from
data. neighboring nodes. This iterative process enhances
ASM Features Extraction: representation learning and enables effective capture of
• Extract opcode sequences from the ASM files and represent intricate dependencies within graph data. While a Simple
them as strings. GCN employs a single graph convolution layer, more
• Generate opcode bi-grams, tri-grams, and tetra-grams. advanced architectures like DeeperGCN address challenges
• Save these opcode n-grams in npz format. such as over-smoothing and vanishing gradients, contributing
to their adaptability. GCNs exhibit success in diverse
Image Creation: applications, from social network analysis to
• Determine the dimensions of the image based on the size of recommendation systems and biological network analysis.
the ASM files. Their capacity to handle non-Euclidean, irregularly
• Calculate width of the image, the square of the file size. structured data underscores their versatility in tackling real-
• Reshape the image to match the calculated dimensions world problems entrenched in complex relational
Final Feature Set: dependencies.
• Combine the various extracted features, including byte Table 1. Types with its labels
bigram, opcode bi-gram, opcode trigram, opcode tetra-gram, Labels Malware Types No. of Images
and the ASM images (represented as pixel values). 0 Adposhel 360
• Convert pixel representations into actual images. 1 Fakeran 306
• Categorize the images into their respective classes based on 2 Hlux 350
the data they represent. 3 Regrun 350
4 Snarasite 351
In “Table 1.” it shows the number of images of each malware 5 VBA 350
types generated. 6 Vilsel 350
Total Images 3407
C. Model Building
Algorithms used for developing the model are as follows-
In “Fig.5.” it demonstrates the overall flow of DeeperGCN
C.1. Graph Convolution Network: Graph Convolutional architecture from inputting image to outputting the prediction
Networks (GCNs) are specialized neural networks designed in 7 classes.
Represent Define edges between
Superpixel
each pixel as superpixel (adjency
Construction
node matrix)

Input image dataset

Introduction Assign
Aggregation of Skip Features to
Functions Connections Stack multiple each node
between layers GCN layers

Graph Fully Evaluation and


Train the
Readout or connected prediction in 7
model
Pooling with SoftMax classes

Fig.5 The project flow diagram of Deeper GCN architecture

C.2. DeeperGCN: representations. These unique features set DeeperGCN apart,


While Graph Convolutional Networks (GCNs) excel in graph overcoming limitations and significantly improving the depth
learning tasks, challenges such as the vanishing gradient and performance of GCNs in handling complex graph-
problem, over-smoothing, and overfitting impede the training structured data. The process includes two steps:
of deep GCNs. Addressing these issues, the DeeperGCN 1. Graph representation and construction using images
algorithm introduces innovative techniques. Distinctively, 2. Apply the deepergcn algorithm taking input as graph node
DeeperGCN incorporates skip connections between layers to structure for classification of malware images.
enhance gradient flow, facilitating the training of deeper
architectures. Additionally, it introduces differentiable C.2.1. Graph Representation and Construction:
generalized aggregation functions and MsgNorm. Notably, In “Fig.6.” it illustrates Graph representation from an image
DeeperGCN introduces a graph readout pooling layer, involves converting pixel relationships to nodes, constructing
enabling effective information aggregation across nodes, edges based on spatial proximity, which can be given as an
thereby promoting robust and expressive graph input to the deeperGCN algorithm.

Repeat process
Initialization Color and Cluster Superpixel
until centre
and Spatial Centre Mapping
movement
clustering Arrangement Update stops

Fig.6. SuperPixel Construction

It includes two main steps: grid and the clusters of W x H image where W is width and H
1. Image to Superpixel (Node) Conversion: is height of the image.
The Simple Linear Iterative Clustering (SLIC) algorithm a
superpixel segmentation algorithm which transforms images For (𝑥, 𝑦) = 𝑖 ⋅ ,𝑗 ⋅ (1)
into superpixels, which combines K means clustering and √ √

compactness term which influences spatial proximity which is In color and spatial arrangement iterate through each pixel
important for pixel grouping. A process crucial for efficient and for each pixel assign to each cluster with nearest cluster
computer vision tasks. SLIC clusters pixels based on color and center in both color space and spatial space. In equation (2)
spatial proximity, providing a structured representation that The distance between the pixel and the center pixel to form
preserves finer details and enhances computational efficiency the cluster is defined by the formula:
in subsequent image processing and analysis. By intelligently
grouping pixels into perceptually meaningful superpixels, 𝐷(𝑝, 𝑐) = 𝑅 + 𝐺 −𝐺 + 𝐵 −𝐵 +
SLIC facilitates tasks such as segmentation, object
recognition, and feature extraction. In equation (1) to 𝑥 − 𝑥 ) + (𝑦𝑝 − 𝑦𝑐 (2)
initialization define number of clusters K as centers in image Where,𝑅 , 𝐺 , 𝐵 represents the RGB color values of pixel p,
𝑅 , 𝐺 , 𝐵 represents the RGB color values of cluster center In equation (8) for each node, its updated representation is
c,𝑥 , 𝑦 are the spatial coordinates of pixel p, determined by aggregating features from the surrounding
𝑥 , 𝑦 are the spatial coordinates of the cluster center c, neighbourhood through a differentiable generalized
m is the compactness parameter, aggregation function, facilitated by an activation function:
k is the number of superpixels or clusters
ℎ = 𝜎 𝐴𝐺𝐺 ℎ , ∀𝑗 ∈ 𝑁(𝑖) (8)
The formula includes combination of two metrics that is color
similarity and spatial proximity. defines the normalization Where, ℎ represents updated node i,
𝜎 𝑖𝑠 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑎𝑛𝑑 𝑁(𝑖) represents
by compactness term.
neighborhood of node i, with aggregation functions such as
In updation of cluster centre it updates by taking mean color sum, min and max functions.
and spatial position of pixel assigned to each cluster. Repeats In equation (9) the graph readout pooling operation plays
this step until there is ess moving of centre of assigned pixels. a pivotal role in capturing global patterns within the graph.
Then Superpixel maps assign each pixel in the image to its
corresponding superpixel label. ℎ = POOLI𝑁𝐺 ℎ , ∀𝑗 ∈ 𝐺𝑟𝑎𝑝ℎ (9)
The graph readout pooling operation aggregates node
2. Adjacency Matrix and edges representations to obtain a graph-level representation. A
Now the superpixels are converted from the images. These common approach is global pooling, where the graph-level
superpixels are considered as the nodes of the graphs. The representation h_{graph} is obtained by applying a pooling
closest one of the other superpixel are connected with an edge. function (e.g., mean or max pooling) to all node
In equation (3) a spatial threshold is being calculated can be representations. Then by adding the softmax function at end
set by user or with the given formula: can train the model.
𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛 ∗ 𝐷𝑖𝑎𝑔𝑜𝑛𝑎𝑙 D. Training and Testing of model
𝑆𝑝𝑎𝑡𝑖𝑎𝑙 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = (3)
√𝑘 In this project, it employs a deeper Graph Convolutional
Fraction is a user-defined parameter representing the Network (GCN) model for multi-class malware detection
fraction of the image size. using image data. The model's training involves multiple
epochs, with a key focus on monitoring validation accuracy to
Diagonal Length of Image is the length of the diagonal of the ensure effective generalization. During testing, the model's
image, calculated as √𝑤 + ℎ performance is evaluated using accuracy and Explainable AI
K is number of superpixel. If distance between the superpixel techniques to enhance transparency.
is less than the spatial threshold then considers an edge E. Explainable AI:
between them otherwise not. Grad-CAM is a technique used to visualize and understand the
C.2.2. Apply Deepergcn on Graph Representation attention of convolutional neural networks (CNNs). It
In equation (4),(5),(6) the message passing between the generates heatmaps that highlight the important regions of an
nodes(v, u) happens between the nodes called vertices of input image that contribute to the model's predictions. By
graph G=(V,E) at the lth layer is given by: combining the gradients of the target class with respect to the
feature maps, Grad-CAM produces a weighted combination
() () () ()
𝑚 = 𝜌( ) 𝑓 , 𝑓 , 𝑓 , 𝑢 ∈ 𝑁(𝑣) (4) of the feature maps, emphasizing the regions relevant to the
target class. This helps in interpreting and explaining the
() () decision-making process of CNN models. In equation (10)
𝑚 = 𝛿( ) 𝑚 𝑢 ∈ 𝑁(v) (5)
(11) both CAM and Grad-CAM operate under the
fundamental assumption that the ultimate score Yc for a
( ) () ()
𝑓 = 𝜙( ) 𝑓 𝑚 (6) particular class c can be represented as a linear combination
of the global average-pooled feature maps Ak extracted from
Where 𝜌( ) , 𝛿 ( ) , 𝜙 ( ) are differential functions for the last convolutional layer, expressed as
aggregation and construction of message at lth layer. And f 𝑛
𝑛
represents the vertex features to construct unique message for 𝑌𝑐 = ∑𝑘 𝑤𝐶𝑘 𝑗
𝐴𝑘2𝑗 (10)
neighbour 𝑢 ∈ 𝑁(𝑣). Skip connections introduce a direct link 𝑖
from the input of one layer to the output of a deeper layer.
Additionally, DeeperGCN introduces skip connections, For a feature map the weights 𝑤 , Ak and class c is equal to:
fostering direct links from the input of one layer to the output 1 𝜕𝑦𝐶
of a deeper layer. This innovative approach mitigates the 𝑤 = ∑𝑛𝑖 ∑𝑗 (11)
𝑍 𝜕𝐴𝑘𝑖𝑗
vanishing gradient problem, promoting smoother training of
deeper architectures. The updated representation of a node
considering skip connections involves the summation of its In Modifying Grad-CAM: In “Fig.7.”it illustrates Gradcam
existing representation and the information skipped from the the CNN layers are stacked together to perform accurate
previous layer. Equation (7) shows updated representation ℎ : localization. By updating the number of layers in CNN the
accuracy of localization can be improved. Different CNN
ℎ = ℎ + 𝑆𝑘𝑖𝑝_𝑐𝑜𝑛𝑛(ℎ ) (7) architectures are being used such as VGG16 with 16 layers,
Where 𝑆𝑘𝑖𝑝_𝑐𝑜𝑛𝑛(ℎ ) represents information skipped in Resnet50 with 50 layers and MobilenetV2 with 53 layers.
previous layer.
The aggregation function, pivotal in graph convolution,
strategically combines information from neighbouring nodes.
Original Image

CNN

GradCAM W1,W2…Wn Class


Relu
+

Fig.7. Modified GradCAM Explainable Ai


IV. RESULTS AND DISCUSSIONS In “Fig.9” and “Fig.10” shows the heatmap and
explainable Ai of the malware image respectively. The study
In this study, byte and assembly (ASM) files were achieved a 97% accuracy in classifying byte and ASM files as
converted into an image dataset for subsequent analysis. This images using the DeeperGCN. This underscores the
innovative approach facilitated the application of the effectiveness of the approach, enhanced by Grad-CAM for
DeeperGCN (Graph Convolutional Network) algorithm in transparent explanations. This innovative methodology holds
classifying these files as images. By integrating Grad-CAM promise for malware detection and broader file analysis,
(Gradient-weighted Class Activation Mapping) into the though potential limitations warrant further exploration and
methodology, a transparent explanation for the classification
discussion for a comprehensive understanding of its
results was provided, offering valuable insights into the inner implications. “Table 2.” shows the comparison of Gradcam
workings of the model. The following sections delve into the images using different CNN architecture.
results obtained from this novel approach and engage in
discussions that illuminate the implications and significance Table 2. Comparison of GradCAM images using different
of the findings. CNN Architecture
A. Generating Images through files:
Original Image CNN GradCam Output
Architecture image
Used

VGG 16

B. Image classification on images using DeeperGCN:

ResNet 50

Fig.8. Malware image (Predicted Malware: Adposhel)


MobileNet
V2

Analysis of Malware: This research paper investigates seven


distinct types of malware: Adposhel, Fakeran, Hlux, Regrun,
Snarasite, VBA, and Vilsel. Analysis delves into their unique
characteristics, propagation methods, and potential impact on
computer systems. Adposhel exploits advertising platforms;
Fakeran disguises as legitimate software. Hlux employs
Fig.9. Heatmap of image Fig.10. Explainable ai sophisticated evasion techniques, challenging traditional
security measures. Regrun exhibits traits associated with
In “Fig.8.” shows the prediction of malware. rootkits, posing a substantial threat to system integrity.
Table 3. Comparison Table of Existing and Proposed Algorithm

F1 Validation
Algorithm Precision Recall score accuracy Pros Cons
Flexible and can learn
JK- complex patterns in Can be computationally
GraphSAGE 93.10% 94.20% 93.60% 0.9491 graph data expensive to train
Simple to implement and Can be less interpretable than
GIN 92.50% 93.70% 93.10% 0.9307 train other models
May not be able to learn
Efficient to train and complex patterns in graph
SGC 91.30% 92.50% 91.90% 0.9479 infer data
Can be less interpretable than
JK-GCN 89.80% 91.00% 90.40% 0.8941 Simple and effective other models
Combines the advantages Can be computationally
JK-GIN 92.90% 94.10% 93.50% 0.9369 of GIN and JK-Net expensive to train
High accuracy, precision,
Deeper-GCN recall, and F1 score;
(Proposed more interpretable than Can be computationally
Model) 96% 95% 95.40% 0.9704 other GNN models expensive to train

Snarasite, a portmanteau of 'snare' and 'parasite,' suggests an feature engineering, robust evaluation, scalability for real-
insidious nature. VBA exploits Microsoft Office macros, time detection, and addresses ethical and legal concerns.
targeting document processing vulnerabilities. Lastly, Vilsel,
though less familiar, demands scrutiny for potential emergent REFERENCES
threats. “Table 3.” shows the Comparison Table of Existing [1] K. Shaukat, S. Luo, V. Varadharajan, "A novel deep learning-based
and Proposed Algorithm done by previous research work. approach for malware detection," Engineering Applications of
Artificial Intelligence, vol. 122, p. 106030, 2023, ISSN: 0952-1976.
V. SCOPE OF PROJECT [2] M.S. Akhtar, T. Feng, "Malware Analysis and Detection Using
Machine Learning Algorithms," Symmetry, vol. 14, p. 2304, 2022.
The project, " Multi-Class Malware Detection using modified
[3] R. Vinayakumar, M. Alazab, K.P. Soman, P. Poornachandran, S.
GNN and Explainable AI," holds significant scope in the Venkatraman, "Robust Intelligent Malware Detection Using Deep
realms of cybersecurity and artificial intelligence. It primarily Learning," IEEE Access, vol. 7, pp. 46717-46738, 2019.
aims to create an effective malware detection system capable [4] E.S. Alomari, R.R. Nuiaa, Z.A.A. Alyasseri, H.J. Mohammed, N.S.
of identifying various types of malware, including viruses, Sani, M.I. Esa, B.A. Musawi, "Malware Detection Using Deep
trojans, worms, and ransomware, using deep learning Learning and Correlation-Based Feature Selection," Symmetry, vol.
techniques. This entails the investigation and customization of 15, p. 123, 2023.
deep learning models such as Convolutional Neural Networks [5] R. Ali, A. Ali, F. Iqbal, M. Hussain, F. Ullah, "Deep Learning Methods
for Malware and Intrusion Detection: A Systematic Literature
(CNNs) and Recurrent Neural Networks (RNNs), as well as Review," Security and Communication Networks, vol. 2022, Article
the incorporation of Graph Neural Networks (GNNs) to ID 2959222, pp. 1-31, 2022.
enhance the resilience of malware analysis. Additionally, the [6] Z. Hou, Y. Wang, W.L. Hsu, H. Zheng, "MALCOM: A MALware
project focuses on achieving multi-class classification, detection framework using COMbination of deep learning and graph-
making model decisions transparently through Explainable based techniques," Future Generation Computer Systems, vol. 91, pp.
AI, conducting feature engineering, and preparing diverse 56-66, 2019.
datasets. It aims to deploy the system for real-time detection, [7] A. Mushtaq, S. Khan, A. Gani, "GraphConvolutional Networks for
Malware Detection," IEEE Access, vol. 7, pp. 148703-148714, 2019.
address scalability challenges, and consider ethical and legal DOI: 10.1109/access.2019.2948648.
aspects. The project's findings can contribute to research in the [8] K. Jain, S. Kumar, "Explainable and Interpretable Intent-based
field, and engagement with the cybersecurity community can Analysis For Malware Detection," arXiv preprint arXiv:2006.00570,
enhance its impact. 2020.
[9] G.S. Pudlo, B.X. Furquim, J. Granatyr, "DeepMal: A Deep Learning
VI. CONCLUSION Approach for Malware Classification Exploiting Recurrent Neural
It offers a comprehensive and forward-looking approach to the Networks and Word Embeddings," in Proceedings of the 2017 IEEE
15th International Symposium on Intelligent Systems and Informatics
critical challenges of malware detection in the realms of (SISY), pp. 477-482.
cybersecurity and artificial intelligence. Developed a deep [10] H.P. Nguyen, Y. Nakamichi, "Static Malware Detection Using Deep
learning system using modified Graph Neural Networks for Learning on Operation Sequence," in Proceedings of the 2018 17th
multi-class malware detection, ensuring accurate IEEE International Conference on Machine Learning and Applications
identification of various malware types. Explainable AI will (ICMLA), pp. 816-821. DOI: 10.1109/icmla.2018.00142.
be integrated for transparent decision-making. It involves

You might also like