Deep Learning Models for Real-Time Automatic Malware Detection.docx Main
Deep Learning Models for Real-Time Automatic Malware Detection.docx Main
ABSTRACT The increase in the sophistication and volume of cyberattacks has made traditional malware
detection methods, such as those based on signatures and heuristics, obsolete. These conventional techniques
struggle to identify new malware variants that employ advanced evasion tactics, resulting in significant
security gaps. This study addresses this problem by proposing a hybrid model based on deep learning that
integrates static and dynamic analysis to improve the precision and robustness of malware detection. This
proposal combines the extraction of static features from the code and dynamic features from the behavior at
runtime, using convolutional neural networks for visual analysis and recurrent neural networks for sequential
analysis. This comprehensive integration of features allows our model to detect known malware and new
variants more effectively. The results show that our model achieves a precision of 98%, a recall of 97%,
and an F1-score of 0.975, outperforming traditional methods, which generally reach 88% to 89% precision.
Furthermore, our model outperforms recent deep learning approaches documented in the literature, which
report up to 96% precision. In work, it offers a significant advancement in malware detection, providing a
more effective and adaptable solution to modern cyber threats.
INDEX TERMS Malware detection, deep learning, static and dynamic analysis, cybersecurity.
This work presents several important contributions. First, and CNNs to take advantage of sequential and visual features,
it introduces a hybrid model combining static and dynamic thereby improving accuracy and generalization.
analysis using CNN and RNN, improving the detection The work of Chen and Cao [14] combined static and
of known and new malware [9]. Furthermore, the model dynamic analysis using a deep neural network, achieving a
achieves an accuracy of 98%, significantly outperforming precision of 93%. Although this approach is similar to ours,
traditional and recent techniques. Limitations of the model the difference in results can be attributed to our dataset’s
are also addressed, highlighting the need for more flexible greater diversity and size and the optimization of model
and adaptive architectures and the representativeness of the hyperparameters. Incorporating advanced preprocessing and
dataset. Finally, future research directions are proposed, feature selection techniques also played a crucial role in
suggesting the need for more diverse and representative improving performance.
datasets and the development of more robust architectures for Deep learning-based methods have proven more effective
malware detection. in detecting unknown malware variants than traditional
The article is structured as follows: the Introduction malware detection methods such as signature-based and
presents the context, a literature review, the definition of heuristic-based ones. For example, signature-based methods,
the problem, and our proposal. The literature review covers such as those discussed by Pandit and Mondal [15],
traditional methods and recent advances in deep learning showed an accuracy of 88%, which is less effective against
for malware detection. Materials and methods include data new malware variants. Although more adaptive, heuristic
selection and preprocessing, model architecture, and training. techniques achieved an accuracy of 89% but suffer from
The Case Study details the implementation of the system on a high false favorable rates due to the difficulty distinguishing
mobile application platform and the evaluation of the results. between legitimate and malicious behavior. Our study, with
The Results and Discussion present an analysis, a comparison an accuracy of 98%, demonstrates the superiority of deep
with other approaches, and the study’s limitations. The learning techniques in detecting modern malware.
Conclusions summarize the findings, potential impact, and Despite the strides made, deep learning for malware
future research directions. Finally, the References used in the detection has challenges and limitations. The generalization
study are included. of new samples and the representativeness of the dataset are
critical issues. Our study identified that approximately 5% of
II. LITERATURE REVIEW the latest samples were not detected due to advanced evasion
Malware detection using deep learning techniques has gained techniques. Moreover, the representativeness of the dataset is
considerable attention in the last decade due to its ability limited, with 80% of the samples representing only five types
to identify complex patterns in large and heterogeneous of malware. These issues underscore the urgent need for more
data [10]. Recent studies have explored various neural diverse and representative datasets to bolster the robustness
network architectures to improve accuracy and robustness in and effectiveness of the models.
malware detection. Hybrid approaches, which combine static and dynamic
A study by Liu et al. [11] introduced a deep neural network analysis with advanced neural networks, have been explored
for malware detection using static features extracted from to bridge these gaps. For instance, Dong et al. [10] employed a
binary code. This work showed an accuracy of around 95%, combination of CNN and DNN to detect malware on Android
but malware obfuscation and mutation techniques limited devices, enhancing accuracy by integrating multiple features.
the model’s effectiveness. Compared to our study, which Furthermore, Yerima et al. [16] proposed a model that incor-
achieved a precision of 98%, the difference can be attributed porates a balancing optimizer with deep learning techniques
to incorporating dynamic analysis techniques and using a for Android malware detection, demonstrating the effective-
more complex neural network architecture. ness of hybrid approaches in improving accuracy and gen-
Another significant work is Yadav et al. [12], where a CNN eralization. This progress should reassure the audience that
was implemented for malware detection based on visualizing malware detection techniques are continuously improving.
binaries as images. This method achieved a precision of
around 96%, highlighting the effectiveness of CNNs in
detecting complex visual patterns. However, this approach is III. MATERIALS AND METHODS
limited to static features and may be less effective against A. DATA SELECTION
malware that uses behavior-based evasion techniques. Our This work was developed in a cybersecurity research environ-
study combined static and dynamic features, which allowed ment within the University’s Computer Security Laboratory,
for better generalization and superior performance. equipped with advanced computing resources and access to
Yao et al. [13] proposed using recurrent neural networks multiple malware databases. The lab is configured with high-
(RNNs) for malware detection by sequencing system calls, performance servers, large storage capacity, and specialized
achieving an accuracy of around 94%. RNNs are effective at software tools for security analysis. The infrastructure
capturing temporal dependencies and sequential patterns, but includes GPU clusters to efficiently train deep learning
their ability to handle large volumes of sequential data may be models and sandboxing systems to execute malware samples
limited. In contrast, our model used a combination of RNNs safely.
Several public and private databases were used to obtain TABLE 1. Features extracted for malware and benign software analysis.
malware and benign software samples. Public databases
include Drebin [17], a widely used collection of malicious
Android applications, and VirusShare [18], which offers
extensive malware samples for multiple platforms. Internal
repositories of samples collected and labeled by the lab were
accessed, including executable files for Windows systems and
mobile applications for Android and iOS.
The dataset used in this study is classified into two
main categories: static analysis and dynamic analysis. The
static analysis includes features extracted from application
source code and binaries, such as code signatures, requested
permissions, and code structure [19]. On the other hand,
dynamic analysis focuses on the behavior of applications
during their execution, capturing information such as system
call sequences, network activities, and system resource
usage [20]. This duality allows for more complete and robust
malware detection, combining static patterns with dynamic
behaviors.
The dataset used in this study includes 50,000 samples,
distributed between 30,000 malware samples and 20,000 FIGURE 1. Distribution of extracted features between malware and
benign software samples. These samples cover a wide range benign software samples.
of malware types, including Trojans, ransomware, adware,
and spyware, as well as benign applications from various were transformed into images by representing bytes as pixels.
categories, such as games, productivity tools, and social This technique allows CNNs to analyze the samples as if
networking applications. The diversity of malware types and they were images, identifying visual patterns characteristic of
target platforms (Windows, Android, iOS) ensures that the malware [25]. Additionally, system call sequences and other
deep learning model can effectively generalize and detect dynamic features were converted into numerical vectors using
various threats in different operating environments. one-hot encoding and embeddings for RNNs and extended
short-term memory networks (LSTMs).
Specific Examples of Extracted Features: Static Features:
B. DATA PREPROCESSING
• Code Signatures: Byte patterns in binary code that help
Data cleansing is crucial in preparing malware and benign identify similarities between different malware samples.
software samples for deep learning analysis. This process • Requested Permissions: Permissions that applications
involves several steps to ensure that the data is consistent request, such as access to sensitive data or device
and high-quality. First, duplicate samples were identified and functionality, which may indicate potentially malicious
removed using hashing techniques to ensure each sample behavior.
was unique [21]. Subsequently, samples that do not provide
relevant information, such as empty files or containing only Dynamic Features:
• System Call Sequences: These are the software’s
non-executable data, are discarded. The data is normalized
interactions with the operating system, providing a
to ensure format consistency, such as file names and folder
detailed profile of the software’s actions.
structures, thus facilitating subsequent analysis [22].
• Network Activities: Traffic generated by the application
Feature extraction is essential to convert raw data into
is relevant to identifying malicious behavior, such as
useful information that deep learning models can process. communication with command and control servers.
This study used both static and dynamic features. Static
Table 1 summarizes the features extracted for analysis,
features were obtained through static analysis of the code
without executing it, extracting code signatures, permissions including the techniques and data types generated. It also
requested by applications, and structures from the source clearly shows the static and dynamic characteristics and the
code [23]. In contrast, dynamic characteristics were captured transformations carried out.
by monitoring the behavior of applications during their Figure 1 illustrates the distribution of extracted features,
execution in a controlled environment (sandbox) [24]. comparing the number of extracted features between malware
Sequences of system calls, network activities, and system and benign software samples. The graph shows four main
resource usage were recorded, providing a dynamic profile categories of features: code signatures, permissions, system
of each sample’s behavior. calls, and network activities. Each category is important for
Several transformations were performed to prepare the data deep analysis of the samples, providing multiple perspectives
on the software’s behavior and structure.
for the deep learning model. Binaries of the malware samples
split allows you to evaluate the model’s performance at In addition, data augmentation techniques were imple-
each training stage and adjust parameters as necessary. The mented to increase the diversity of the training data and
samples were randomized to avoid bias and ensure that each improve the model’s ability to generalize to new samples.
set adequately represented data types, malware, and benign These techniques included rotating and scaling the images
software. Additionally, class balances were ensured within generated from the binaries and introducing Gaussian noise,
each set to prevent the problem of class imbalance, which which helps simulate variations in the data and makes
could lead to a biased model. the model more robust to different representations of the
The model training procedure involves several steps to malware [32]. The training process was monitored using
optimize the model’s performance and ensure its robustness. precision and loss plots for the training and validation
The training was carried out in a high computing environment sets. These visualizations made it possible to quickly
using GPU clusters, specifically NVIDIA Tesla V100, which identify any signs of overfitting or underfitting and adjust
provide the power needed to handle the large volume of data the hyperparameters accordingly. Additionally, Keras used
and complexities of the deep learning model. The choice of callbacks to implement early stopping, stopping training if the
GPUs was based on their ability to perform massively parallel loss on the validation set stopped improving for a predefined
calculations, which is essential to accelerate the training number of epochs, thus preventing overfitting.
process of deep convolutional neural networks.
The first step in the training procedure was setting up E. VALIDATION AND EVALUATION
the environment. The TensorFlow framework was used Model validation and evaluation ensure that the deep learning
with Keras, running in a development environment based model performs optimally and can adequately generalize to
on Jupyter Notebooks, allowing easy manipulation and unseen data. Several evaluation methods and metrics were
visualization of data and results [30]. The training scripts used to evaluate the model’s performance. Cross-validation
were implemented in Python, taking advantage of the was used to assess the stability and generalization of the
advanced deep-learning libraries and tools available in this model. This study applied k-fold cross-validation, dividing
ecosystem. Additionally, Docker containers were used to the training set into k subsets (folds). The model is trained k
ensure the development environment’s reproducibility and times using k-1 subsets for training and the remaining subset
facilitate the model’s deployment on different operating for validation [33]. This is repeated k times so that each subset
systems and hardware configurations. is used exactly once as a validation set. Cross-validation
Several regularization techniques were applied during helps ensure the model does not overfit a specific part of
training to prevent overfitting and improve the model’s the training data set. The primary evaluation metric on each
generalization ability. The dropout technique was used in fold is calculated, and then the metrics across all folds are
the fully connected layers, with a dropout rate of 50%. This averaged to obtain a robust estimate of model performance.
technique randomly turns off a fraction of the neurons during To evaluate the model’s performance, the following metrics
each training step, forcing the model to learn more robust and were used: precision, recall, F1-score, and the area under
distributed representations. Additionally, batch normalization the ROC curve (AUC-ROC) [34]. These metrics provide a
was implemented after each convolutional layer [31]. This comprehensive view of model performance regarding binary
technique normalizes the activations of each mini-batch, classification (malware vs. benign).
stabilizing and accelerating the training process by reducing Precision: Precision is the proportion of correct predictions
the problem of fading and gradient explosion. over the total predictions. TP is the number of true positives,
The training process was carried out for 50 epochs, TN is the number of true negatives, FP is the number of
with a batch size of 64 samples. The loss function used false positives, and FN is the number of false negatives. It is
was binary crossentropy, which is suitable for binary calculated as:
classification. The Adam optimizer, known for its ability to True positives
adapt to changes in the gradient dynamically, was used to Precision = (1)
True positives + False positives
minimize the loss function. During training, model perfor-
mance was continuously monitored on the validation set, Recall (Sensitivity or True Positive Rate): Recall measures
adjusting hyperparameters to optimize precision and reduce the model’s ability to identify all positive samples correctly.
error. It is calculated as:
Several evaluations were performed during the training to True positives
Recall = (2)
monitor model performance and ensure that overfitting did True positives + False negatives
not occur. These evaluations included cross-validation, where F1-score: The F1-score is the harmonic mean of precision
the training set was further divided into k-subsets, and the
and recall, balancing the two. It is calculated as:
model was trained and evaluated k times, each time using a 2 × Precision × Recall
different subset as the validation set and the remaining k-1 F1 Score = (3)
subsets as the validation set training [11]. This technique Precision + Recall
helps ensure that the model generalizes well and is not overly AUC-ROC: The ROC curve is a graph that shows the
dependent on any specific subset of the data. relationship between the true positive rate (TPR) and the
false positive rate (FPR). The AUC provides a single additional benefits in terms of latency and performance.
measure of performance, where a value of 1 indicates A cache system was implemented to store recent inference
perfect performance and a value of 0.5 indicates random results to improve efficiency and reduce latency. This is
performance. The AUC is calculated using the integral of the particularly useful for samples analyzed repeatedly, avoiding
ROC curve. The TPR and FPR are defined as: the need to process the same samples multiple times. Using
TP a load balancer distributes inference requests across multiple
TPR (4)
= inference server instances, ensuring no single point of failure
TP + FN
FP and improving system scalability.
FPR (5)
= Several continuous monitoring and evaluation methods
FP + TN were implemented to ensure the malware detection system
The independent test set (15% of the total data) was works effectively in a real production environment. Stress
used to conduct the evaluation, which was not seen by tests were performed to evaluate system performance under
the model during training. This approach ensures that the load, simulating a high volume of inference requests to
evaluation metrics reflect the model’s actual performance on identify potential bottlenecks and ensure the system can
unseen data, accurately measuring its generalization ability. handle traffic spikes without performance degradation [36].
In addition, confusion matrices were generated to analyze the A continuous monitoring system was implemented using
model predictions in detail. Confusion matrices allow you to tools such as Prometheus and Grafana, which allow tracking
identify and quantify true positives, false positives, and false key metrics such as inference latency, error rate, and resource
negatives, providing a detailed view of areas where the model utilization in real-time. This helps detect operational issues
can improve. quickly and take corrective action before they impact end
users. In addition to operational metrics, model precision
F. REAL-TIME IMPLEMENTATION in production was monitored by collecting and analyzing
Deploying the deep learning model in a production envi- ground truth labels for a subset of the analyzed samples. This
ronment requires careful integration with other software allows you to continually evaluate the model’s effectiveness
components to ensure efficient and reliable operation. The and adjust parameters as necessary. A feedback loop was
trained model is deployed on a highly available server, established where newly labeled samples are fed back to the
integrated with a micro-services architecture to facilitate model to perform periodic adjustments and retraining, thus
interaction with other systems and applications. The system continuously improving its detection capacity and adapting
architecture includes several main components. First, a ded- to new threats.
icated inference server that uses GPUs to speed up request
processing. This server is connected to a RESTful web
service that allows external applications to submit real-time G. ETHICAL AND SAFETY CONSIDERATIONS
software samples for analysis. Implementing a deep learning-based malware detection
Additionally, a NoSQL database stores records of the infer- system involves technical challenges and ethical and security
ences performed, including classification results, response considerations important for its acceptance and effectiveness
times, and any errors found. This allows for continuous in real environments. Data privacy is a priority in designing
monitoring and rapid response to operational problems [35]. and implementing the malware detection system. Several
The preprocessing pipeline ensures that software samples measures were adopted to ensure the privacy and security
undergo a feature extraction and transformation process of the data used and generated by the system. First, all
like during training, ensuring consistency in the input data. sample data is anonymized before use in model training and
The threshold-based alert system notifies administrators of evaluation. This includes removing personally identifiable
detected anomalies, such as an unexpected increase in false information (PII) from software samples and inference logs.
positives or high response times. Before the anonymization process, the risk of explicit or
Several techniques were implemented to optimize real- implicit inferences shall be assessed; that is, the structure
time model performance, ensuring fast and efficient and information within an attribute shall be identified and
inference without compromising model precision. Compres- understood to ensure that all inference records have been
sion techniques such as quantization and weight pruning were removed. Data is also encrypted at rest and in transit
applied to reduce the model’s size and improve its inference using advanced encryption algorithms, such as AES-256,
efficiency. Quantization reduces the precision of the model to prevent unauthorized access. Data accesses are restricted
weights from 32 bits floating point to 16 or even 8 bits. At the to authorized personnel through role-based access controls
same time, pruning removes insignificant weights that do not (RBAC), ensuring only users with appropriate credentials can
significantly affect the precision of the model. The model access sensitive information [37].
runs on high-performance GPUs, such as the NVIDIA Tesla Malware detection carries ethical implications that must
V100, capable of massively parallel calculations. be carefully considered. One of the main challenges is the
Additionally, the use of TPUs (Tensor Processing Units) potential for false positives, where legitimate software is
was evaluated for specific inference tasks, which could offer incorrectly identified as malware. This can have significant
consequences, including disruption of services, loss of data, TABLE 2. Deep learning model evaluation metrics.
and damage to the reputation of software developers [38].
To mitigate these risks, manual verification mechanisms are
implemented where suspected cases are reviewed before
corrective actions are taken. Additionally, transparent com-
munication is maintained with end users, providing clear
explanations when malware is detected and allowing appeals
or additional reviews in case of disputes.
TABLE 3. Performance comparison between deep learning model and
Another important ethical aspect is responsibility in auto- traditional malware detection methods.
mated decision-making. System decisions must be auditable
and explainable. For this reason, explainable AI (XAI)
techniques were implemented to allow the deep learning
model’s decisions to be broken down and justified. Both the
decision and the techniques implemented must allow human
evaluation.
Compliance with relevant regulations and standards is
essential for successfully implementing any cybersecurity predictions, recall measures the model’s ability to identify
system. The malware detection system is aligned with various positive samples (malware) correctly, the F1-score provides
international and local data protection and cybersecurity a balance between precision and recall, and the AUC-ROC
regulations. This includes compliance with the European evaluates the ability of the model to distinguish between
Union’s General Data Protection Regulation (GDPR), which positive and negative classes.
establishes strict guidelines for collecting, processing, and The results obtained are presented in Table 2. The model
storing personal data. achieved a precision of 0.96 on the validation and 0.95 on
In the field of cybersecurity, the system follows the the test set, indicating a high precision level in classifying
standards established by the National Institute of Standards malware and benign software. The model recall was 0.94 in
and Technology (NIST), particularly the cybersecurity frame- validation and 0.93 in testing, reflecting its ability to
work (NIST Cybersecurity Framework) and the guidelines identify malware samples correctly. The F1-score, which
for privacy risk management [39]. In addition, they adhere to balances precision and recall, was 0.95 in validation and
the recommendations of the Cybersecurity and Infrastructure 0.94 in testing, suggesting the balanced performance of the
Security Agency (CISA) for protecting critical infrastructure model. Finally, the AUC-ROC, which measures the model’s
and managing cyber incidents. These measures ensure that ability to distinguish between classes, was 0.97 in validation
the system complies with current regulations and is prepared and 0.96 in testing, demonstrating excellent discrimination
to adapt to future regulatory changes. Continuous review between malware and benign software.
and updating of security and privacy policies and procedures The deep learning model performed better than traditional
ensure the system remains compliant and protects user data malware detection methods, such as those based on signatures
and rights adequately. and heuristics. As shown in Table 3, the deep learning model
achieved significantly higher precision (0.95) compared to
the signature-based (0.85) and heuristic-based (0.80) meth-
IV. RESULTS ods. Similarly, the recall and F1-score of the deep learning
A. GENERAL DESCRIPTION OF RESULTS model were higher, with values of 0.93 and 0.94, respectively,
Analysis of the performance of the deep learning model compared to 0.80 and 0.825 for signature-based methods
was performed using a set of metrics, including precision, and 0.78 and 0.79 for heuristic methods. The AUC-ROC of
recall, F1-score, and AUC-ROC. The model was trained the deep learning model (0.96) also outperformed that of
and evaluated in multiple phases to obtain these results. traditional methods, indicating a better ability to distinguish
First, the data was divided into training, validation, and between malware and benign software.
test sets, ensuring adequate representation of malware and Figure 3 illustrates the deep learning model’s performance
benign software samples in each set. The model was then compared to traditional malware detection methods, using
trained using the training set, with hyperparameter tuning line graphs for a more detailed and precise representation.
based on performance on the validation set. Finally, model The first part of the figure presents the deep learning model’s
performance was evaluated on the test set to ensure that the performance in the training, validation, and test sets.
metrics reflect the model’s ability to generalize to previously In Graph 3A, we observe that the deep learning model
unseen data. shows high precision in all sets, with values of 0.98 in
In each phase of the analysis, precision, recall, F1- training, 0.96 in validation, and 0.95 in testing. This indicates
score, and AUC-ROC metrics were calculated to evaluate that the model accurately classifies malware and benign
the performance of the deep learning model. Precision software samples. The model recall follows a similar trend,
measures the proportion of correct predictions over the total with values of 0.97 in training, 0.94 in validation, and
FIGURE 3. Performance of the deep learning model and comparison with traditional malware detection methods. Graph 3A: Metrics of the deep
learning model. Graph 3B: Comparison of conventional malware detection methods.
0.93 in testing, reflecting its ability to identify malware TABLE 4. Comparison of Features between Interactive Learning Tools.
samples correctly. The F1-score, which balances precision
and recall, is also high, with values of 0.975 in training,
0.95 in validation, and 0.94 in testing, suggesting a balanced
model performance. Finally, the AUC-ROC of the model
is 0.99 in training, 0.97 in validation, and 0.96 in testing,
demonstrating excellent discrimination between malware and
benign software.
Graph 3B compares the performance of the deep learning
model with traditional methods based on signatures and
heuristics. Here, the deep learning model outperforms
conventional methods in all evaluated metrics. The precision
of the deep learning model is significantly higher (0.95)
compared to the signature (0.85) and heuristic (0.80) based
methods. Similarly, the recall and F1-score of the deep
learning model are higher, with values of 0.93 and 0.94,
respectively, compared to 0.80 and 0.825 for signature-based
methods and 0.78 and 0.79 for heuristic methods. The
AUC-ROC of the deep learning model, with a value of 0.96,
also exceeds that of traditional methods, indicating a better
ability to distinguish between malware and benign software.
The results demonstrate the effectiveness of the deep
learning approach in malware detection, outperforming
traditional methods in precision, recall, F1-score, and AUC- to generalize to unseen data, complementing the general
ROC. The superiority of the deep learning model is due results presented in the previous section.
to its ability to learn complex patterns and features that The critical difference between this section and the
signature-based methods and heuristics cannot capture. This previous one lies in the granularity and temporal focus of
deep learning capability allows the deep learning model the metrics. While the last section focused on the results
to detect newer, more sophisticated malware variants with and comparison with traditional methods, here we explore
greater precision, making it a valuable tool for real-time how the metrics change during the training process, providing
cybersecurity. insights into the stability and behavior of the model over time.
Table 4 shows that the model precision improves con-
sistently across epochs, reaching a value of 0.98 on the
B. QUANTITATIVE RESULTS training set, 0.96 on the validation set, and 0.95 on the
The precision, recall, F1-score, and AUC-ROC metrics were test set at the end of 50 epochs. The recall also shows
calculated over 50 epochs for the training, validation, and test continuous improvement, with final values of 0.97, 0.94,
sets to evaluate the deep learning model’s performance. This and 0.93 for the training, validation, and test sets. The
analysis was carried out following a meticulous process in F1 score follows a similar trend, indicating an adequate
which the model was trained iteratively and evaluated in each balance between precision and recall. At the same time,
epoch, thus allowing us to observe how the metrics evolve. the AUC-ROC reflects an excellent ability of the model to
These results provide detailed insight into the model’s ability distinguish between classes, with final values of 0.99, 0.97,
and 0.96. These metrics allow an under-standing of how the TABLE 5. Analysis of Case Studies in Malware Detection.
model improves performance over time and help identify
potential optimization points in future iterations.
Figure 4 presents the evolution of the loss during training
and validation, as well as the confusion matrices for the
validation and test sets. The process to obtain these results
includes monitoring the loss in each epoch during training
and validation, which allows for evaluating the model’s
convergence and detecting possible overfitting or underfitting
problems. Confusion matrices provide a detailed view of
the model’s ability to classify malware and benign software
samples correctly.
In Graphs 4A, the evolution of the loss during training and
validation shows how the model fits the data. Initially, the loss
is high but decreases as the model learns, stabilizing towards
the later epochs, indicating that the model has reached a good
fit. Graphs 4B and 4C represent the confusion matrices for
the validation and test sets. These matrices show that the
model has a high rate of true positives and negatives, with a
relatively low number of false positives and negatives. This
confirms the model’s ability to correctly classify malware
and benign software samples. However, there is always room
to improve the reduction of false negatives to increase the
model’s sensitivity.
FIGURE 4. Visualization of the evolution of the loss and confusion matrices. Graph 4A: Evolution of Loss during Training and
Validation. Graph 4B: Confusion Matrix - Validation. Graph 4C: Confusion Matrix - Test.
TABLE 6. Performance Metrics Under Load. Figure 5 presents the system’s performance under different
FIGURE 5. Figure 5. System Performance Under Different Load Levels. Graph 5A: Inference Latency Under Different Load Levels. Graph 5B: Request
Processing Rate Under Different Load Levels.
FIGURE 6. Continuous System Performance in Production. Graph 6A: Average Latency Over Time. Graph 6B: Real-Time Error Rate Over Time.
TABLE 7. Continuous Monitoring Metrics of the System in Production. E. COMPARATIVE EVALUATION WITH ADVANCED
TECHNIQUES
is observed, especially noticeable between Day 3 and Day 5, To provide a clear context for the developed model’s
which could indicate an increasing workload or optimization performance, it was compared with other recent studies in the
problems in the system. Average latency increased from literature. This benchmarking process compares our model’s
60 ms on Day 1 to 82 ms on Day 5, suggesting the need for critical metrics with other deep-learning approaches and
adjustments to improve operational efficiency. traditional malware detection methods.
Chart 6B presents the real-time error rate over time. The When comparing our model with other recent studies
error rate shows an increasing trend, with significant peaks presented in Table 8, it is observed that our approach based
on Day 3 and Day 5, when it reached 0.85%. This increase on deep learning outperforms the models presented in studies
in errors can be due to system overload, network issues, A: [12], B: [13] and C: [14]. Specifically, our model achieves
or failures in the underlying infrastructure. This analysis a precision of 98%, a recall of 97%, an F1-score of 0.975,
highlights the importance of continuous monitoring and the and an AUC-ROC of 0.99. These values are higher than those
need for periodic adjustments to maintain the stability and obtained in the other studies, where the metrics range between
reliability of the system in production. 94% and 96% for precision and between 91% and 94% for
recall.
Table 9 compares traditional malware detection methods.
The deep learning-based model also shows significantly
VOLUME 12, 2024 107753
R. Gutierrez et al.: DL Models for Real-Time Automatic Malware Detection
TABLE 8. Comparison of Key Metrics with Other Deep Learning TABLE 10. Limitations of the Study and Results.
Approaches.
be promising approaches to improving model generalization [15] A. V. Pandit and D. Mondal, ‘‘Real-time malware detection on IoT devices
ability and reducing biases in the data. using behavior-based analysis and neural networks,’’ Res. J. Comput. Syst.
Eng., vol. 4, no. 2, pp. 117–129, Dec. 2023, doi: 10.52710/rjcse.82.
In terms of future work, exploring several directions to [16] S. Y. Yerima, M. K. Alzaylaee, A. Shajan, and P. Vinod, ‘‘Deep learning
improve and expand this work is recommended. A promising techniques for Android botnet detection,’’ Electronics, vol. 10, no. 4,
line of research is the development of hybrid models p. 519, Feb. 2021, doi: 10.3390/electronics10040519.
[17] F. M. Alotaibi and Fawad, ‘‘A multifaceted deep generative adversarial
that combine traditional machine learning techniques with networks model for mobile malware detection,’’ Appl. Sci., vol. 12, no. 19,
deep learning to take advantage of the strengths of both p. 9403, Sep. 2022, doi: 10.3390/app12199403.
approaches. Additionally, implementing real-time detection [18] K. Kong, Z. Zhang, Z.-Y. Yang, and Z. Zhang, ‘‘FCSCNN: Fea-
systems that dynamically adapt to new threats and adjust their ture centralized Siamese CNN-based Android malware identifica-
tion,’’ Comput. Secur., vol. 112, Jan. 2022, Art. no. 102514, doi:
parameters based on live data is crucial to maintaining model 10.1016/j.cose.2021.102514.
relevance and effectiveness in production environments. [19] G. Marín, P. Caasas, and G. Capdehourat, ‘‘DeepMAL—Deep learning
It would also be beneficial to investigate the application models for malware traffic detection and classification,’’ in Data Science–
Analytics and Applications. Wiesbaden, Germany, 2021, pp. 105–112, doi:
of XAI techniques to provide greater transparency and 10.1007/978-3-658-32182-6_16.
interpretability in model decisions, thus facilitating its [20] S. S. Lad. and A. C. Adamuthe, ‘‘Improved deep learning model for static
adoption in safety-critical environments. PE files malware detection and classification,’’ Int. J. Comput. Netw. Inf.
Secur., vol. 14, no. 2, pp. 14–26, Apr. 2022, doi: 10.5815/ijcnis.2022.02.02.
[21] A. Morales, R. Cuevas, and J. M. Martínez, ‘‘Analytical processing
REFERENCES with data mining,’’ RECI Revista Iberoamericana de las Ciencias
[1] E. S. Alomari, R. R. Nuiaa, Z. A. A. Alyasseri, H. J. Mohammed, Computacionales e Informática, vol. 5, no. 9, pp. 22–43, 2016. [Online].
N. S. Sani, M. I. Esa, and B. A. Musawi, ‘‘Malware detection using deep Available: http://www.reci.org.mx/index.php/reci/article/view/40/176
learning and correlation-based feature selection,’’ Symmetry, vol. 15, no. 1, [22] J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified data processing on
p. 123, Jan. 2023, doi: 10.3390/sym15010123. large clusters,’’ Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008,
[2] X. Luo, J. Li, W. Wang, Y. Gao, and W. Zhao, ‘‘Towards improving doi: 10.1145/1327452.1327492.
detection performance for malware with a correntropy-based deep learning [23] A. Ksibi, M. Zakariah, L. Almuqren, and A. S. Alluhaidan, ‘‘Efficient
method,’’ Digit. Commun. Netw., vol. 7, no. 4, pp. 570–579, Nov. 2021, doi: Android malware identification with limited training data utilizing multiple
10.1016/j.dcan.2021.02.003. convolution neural network techniques,’’ Eng. Appl. Artif. Intell., vol. 127,
[3] Y. J. Kim, C.-H. Park, and M. Yoon, ‘‘FILM: Filtering and machine Jan. 2024, Art. no. 107390, doi: 10.1016/j.engappai.2023.107390.
learning for malware detection in edge computing,’’ Sensors, vol. 22, no. 6, [24] U. A. Khan and A. Alamäki, ‘‘Designing an ethical and secure pain
p. 2150, Mar. 2022, doi: 10.3390/s22062150. estimation system using AI sandbox for contactless healthcare,’’ Int.
[4] Y. Liu, P. Yang, P. Jia, Z. He, and H. Luo, ‘‘MalFuzz: Coverage- J. Online Biomed. Eng., vol. 19, no. 15, pp. 166–201, Oct. 2023, doi:
guided fuzzing on deep learning-based malware classification model,’’ 10.3991/ijoe.v19i15.43663.
PLoS ONE, vol. 17, no. 9, Sep. 2022, Art. no. e0273804, doi: [25] I. Almomani, A. Alkhayer, and W. El-Shafai, ‘‘E2E-RDS: Efficient end-
10.1371/journal.pone.0273804. to-end ransomware detection system based on static-based ML and vision-
[5] M. Maray, M. Maashi, H. M. Alshahrani, S. S. Aljameel, S. Abdelbagi, and based DL approaches,’’ Sensors, vol. 23, no. 9, p. 4467, May 2023, doi:
A. S. Salama, ‘‘Intelligent pattern recognition using equilibrium optimizer 10.3390/s23094467.
with deep learning model for Android malware detection,’’ IEEE Access, [26] A. Rasool, A. R. Javed, and Z. Jalil, ‘‘SHA-AMD: Sample-efficient hyper-
vol. 12, pp. 24516–24524, 2024, doi: 10.1109/access.2024.3357944. tuned approach for detection and identification of Android malware family
[6] G. Iadarola, F. Martinelli, F. Mercaldo, and A. Santone, ‘‘Towards an and category,’’ Int. J. Ad Hoc Ubiquitous Comput., vol. 38, nos. 1–3, p. 172,
interpretable deep learning model for mobile malware detection and family 2021, doi: 10.1504/ijahuc.2021.119097.
identification,’’ Comput. Secur., vol. 105, Jun. 2021, Art. no. 102198, doi: [27] B. Menaouer, A. E. H. M. Islem, and M. Nada, ‘‘Android malware detection
10.1016/j.cose.2021.102198. approach using stacked AutoEncoder and convolutional neural networks,’’
[7] Ö. A. Aslan and R. Samet, ‘‘A comprehensive review on malware Int. J. Intell. Inf. Technol., vol. 19, no. 1, pp. 1–22, Sep. 2023, doi:
detection approaches,’’ IEEE Access, vol. 8, pp. 6249–6271, 2020, doi: 10.4018/ijiit.329956.
10.1109/ACCESS.2019.2963724. [28] M. Aamir, M. W. Iqbal, M. Nosheen, M. U. Ashraf, A. Shaf,
[8] A. R. Nasser, A. M. Hasan, and A. J. Humaidi, ‘‘DL-AMDet: Deep K. A. Almarhabi, A. M. Alghamdi, and A. A. Bahaddad, ‘‘AMDDLmodel:
learning-based malware detector for Android,’’ Intell. Syst. Appl., vol. 21, Android smartphones malware detection using deep learning model,’’
Mar. 2024, Art. no. 200318, doi: 10.1016/j.iswa.2023.200318. PLoS ONE, vol. 19, no. 1, Jan. 2024, Art. no. e0296722, doi: 10.1371/jour-
[9] H. Rathore, A. Samavedhi, S. K. Sahay, and M. Sewak, ‘‘Robust malware nal.pone.0296722.
detection models: Learning from adversarial attacks and defenses,’’ [29] H. G. Ghifari, D. Darlis, and A. Hartaman, ‘‘Pendeteksi golongan darah
Forensic Sci. Int., Digit. Invest., vol. 37, Jul. 2021, Art. no. 301183, doi: manusia berbasis tensorflow menggunakan ESP32-CAM,’’ ELKOMIKA,
10.1016/j.fsidi.2021.301183. Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, Teknik Elektronika,
[10] S. Dong, L. Shu, and S. Nie, ‘‘Android malware detection method based on vol. 9, no. 2, p. 359, Apr. 2021, doi: 10.26760/elkomika.v9i2.359.
CNN and DNN bybrid mechanism,’’ IEEE Trans. Ind. Informat., vol. 20, [30] J. Huang, ‘‘Accelerated training and inference with the TensorFlow object
no. 5, pp. 7744–7753, May 2024, doi: 10.1109/tii.2024.3363016. detection API,’’ Google AI Blog, Mountain View, CA, USA, Rep., 2017.
[11] B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, and W. Zou, ‘‘αDiff: [31] Y. Qiao, W. Zhang, Z. Tian, L. T. Yang, Y. Liu, and M. Alazab, ‘‘Adversarial
Cross-version binary code similarity detection with DNN,’’ in Proc. 33rd malware sample generation method based on the prototype of deep learning
ACM/IEEE Int. Conf. Automated Softw. Eng., Sep. 2018, pp. 667–678, doi: detector,’’ Comput. Secur., vol. 119, Aug. 2022, Art. no. 102762, doi:
10.1145/3238147.3238199. 10.1016/j.cose.2022.102762.
[12] P. Yadav, N. Menon, V. Ravi, S. Vishvanathan, and T. D. Pham, [32] M. Chandan, S. G. Santhi, and T. S. Rao, ‘‘Combined shallow
‘‘EfficientNet convolutional neural networks-based Android malware and deep learning models for malware detection in WSN,’’ Int. J.
detection,’’ Comput. Secur., vol. 115, Apr. 2022, Art. no. 102622, doi: Image Graph., vol. 19, no. 2, Sep. 2023, Art. no. 2550034, doi:
10.1016/j.cose.2022.102622. 10.1142/s0219467825500342.
[13] Y. Yao, Y. Zhu, Y. Jia, X. Shi, L. Zhang, D. Zhong, and J. Duan, [33] L. D. M. Ortiz-Aguilar, M. Carpio, J. A. Soria-Alcaraz, H. Puga, C. Díaz,
‘‘Research on malware detection technology for mobile terminals based C. Lino, and V. Tapia, ‘‘Training OFF-line hyperheuristics for course
on API call sequence,’’ Mathematics, vol. 12, no. 1, p. 20, Dec. 2023, doi: timetabling using K-folds cross validation,’’ La Revista Programación
10.3390/math12010020. Matemática y Softw., vol. 8, pp. 1–8, Oct. 2016.
[14] Z. Chen and J. Cao, ‘‘VMCTE: Visualization-based malware classification [34] S. Sen, D. Sugiarto, and A. Rochman, ‘‘Komparasi metode multilayer
using transfer and ensemble learning,’’ Comput., Mater. Continua, vol. 75, perceptron (MLP) dan long short term memory (LSTM) dalam peramalan
no. 2, pp. 4445–4465, 2023, doi: 10.32604/cmc.2023.038639. Harga beras,’’ Ultimatics, vol. 12, no. 1, pp. 35–41, 2020.
[35] A. A. Darem, F. A. Ghaleb, A. A. Al-Hashmi, J. H. Abawajy, LORENA NARANJO GODOY received the mas-
S. M. Alanazi, and A. Y. Al-Rezami, ‘‘An adaptive behavioral-based ter’s degree in new technologies law and the
incremental batch learning malware variants detection model using Ph.D. degree (cum laude) in legal and political
concept drift detection and sequential deep learning,’’ IEEE Access, vol. 9, sciences and from the Universidad Pablo de
pp. 97180–97196, 2021, doi: 10.1109/ACCESS.2021.3093366. Olavide, Seville, Spain. She is a Researcher, a BID
[36] I. Almomani, A. Alkhayer, and W. El-Shafai, ‘‘An automated vision- Consultant, an undergraduate and postgraduate
based deep learning model for efficient detection of Android mal- Teacher, the author of several academic articles,
ware attacks,’’ IEEE Access, vol. 10, pp. 2700–2720, 2022, doi:
and the national and international Lecturer. She is
10.1109/ACCESS.2022.3140341.
a leading implementer with national and interna-
[37] G. Sahani, C. S. Thaker, and S. M. Shah, ‘‘Supervised learning-based
approach mining ABAC rules from existing RBAC enabled systems,’’ EAI
tional companies, banks, and other entities in the
Endorsed Trans. Scalable Inf. Syst., vol. 10, no. 1, 2023, Art. no. e9, doi: financial sector, digital platforms, and e-commerce in adopting personal data
10.4108/eetsis.v5i16.1560. protection models, cybersecurity, and digital transformation incorporating
[38] M. Cho, J.-S. Kim, J. Shin, and I. Shin, ‘‘mal2D: 2D based deep learning big data, the Internet of things, and artificial intelligence, with an experience
model for malware detection using black and white binary image,’’ in the public and private sector. She is the author and a leader of the
IEICE Trans. Inf. Syst., vol. E103-D, no. 4, pp. 896–900, 2020, doi: process of approval of the personal data protection law for Ecuador and other
10.1587/transinf.2019edl8146. regulations that allowed its implementation in the National System of Public
[39] T. Lu, Y. Du, L. Ouyang, Q. Chen, and X. Wang, ‘‘Android malware Data Registry, when she was the National Director of DINARDAP. She was
detection based on a hybrid deep learning model,’’ Secur. Commun. Netw., the Director of the School of Law, UDLA; an Undersecretary of Normative
vol. 2020, pp. 1–11, Aug. 2020, doi: 10.1155/2020/8863617. Development of the Ministry of Justice, Human Rights and Worship; an
[40] A. Albakri, F. Alhayan, N. Alturki, S. Ahamed, and S. Shamsudheen, Advisor to the Presidency of the National Court of Justice; and the National
‘‘Metaheuristics with deep learning model for cybersecurity and Android Director of the Public Data Registry. Currently, she is the Director of the
malware detection and classification,’’ Appl. Sci., vol. 13, no. 4, p. 2172, Master’s in digital law and innovation, with a mention in the economy, trust,
Feb. 2023, doi: 10.3390/app13042172. and digital transformation with UDLA and the digital law and personal data
protection area of Estudio Jurídial.