Improve Malware Classifiers Performance Using Cost-Sensitive Learning For Imbalanced Dataset
Improve Malware Classifiers Performance Using Cost-Sensitive Learning For Imbalanced Dataset
Corresponding Author:
Ikram BEN ABDEL OUAHAB
Computer science, systems and telecommunication laboratory (LIST)
Faculty of Sciences and Techniques, University Abdelmalek Essaadi
Tangier, Morocco
Email: [email protected]
1. INTRODUCTION
Malware is malicious code designed to install covertly on a target system. The malicious intention
could be: destroying data, installing additional malicious programs, exfiltrating data, or encrypting data to get
a ransom [1]. Malware compromise the confidentiality, integrity, and availability of the user’s data. The
landscape of malware is constantly evolving. In the past, malware was typically created to be fast and easily
detectable, often carrying out destructive actions shortly after infecting a system [2], [3]. Older types of
malware had specific procedures for dealing with different types of infections. However, today's malware is
designed to be stealthy and difficult to detect. It spreads slowly over time, gathering information over a longer
period before exfiltrating it. Modern-day malware tends to utilize a single set of procedures, as most attacks
are blended and incorporate multiple methods [4], [5].
In cybersecurity, the use of artificial intelligence (AI) is being necessary [6]–[8]. Many works are
focusing on solving the imbalanced data issue in literature [9], [10]. Since, most algorithms are designed to
work well with balanced databases. Recently, most researcher work with malware visualization technique. This
method deals indirectly with the malicious code. The main idea is to visualize a malicious binary executable
as a grayscale or colored image. These images are presented as arrays in the range of (0, 255). This technique
was initiated the very first time by Nataraj in 2011 [11], where they deliver the Malimg database which contains
directly 9,369 malware images in 25 classes. Researchers use many methodologies [12], [13], where the
common thing is the very first step of malware visualization (dealing with images). A malware detection
method was proposed by Di Wu [14], which utilized cascading extreme gradient boosting (XGBoost) and cost-
sensitive techniques to handle unbalanced data. The method used extracted application programming interface
calls (API) from portable executable (PE) files as features, and adopted a three-tier cascading XGBoost
approach for data balancing and model training. Di Wu used a database that contained two classes for malicious
and benign API calls, achieving a high accuracy of 99% with this method. In a separate study, Roland Burks
[15] incorporated generative models to generate synthetic training data for malware detection. Two models
were utilized - the generative adversarial network (GAN) and variational autoencoder (VAE)-with the goal of
improving the performance of the residual network (ResNet-18) classifier. The addition of synthetic malware
samples to the training data resulted in a 2% accuracy improvement for ResNet-18 using VAE, and a 6%
accuracy improvement using GAN.
In this paper, we perform malware classification into 25 malware families. To deal with imbalanced
data we proposed a new approach to calculate weights as part of the cost-sensitive learning application. Then,
we evaluated the proposed approach using two different convolutional neural networks (CNN) models that we
developed from scratch using functional and subclassing Keras API. We compare the proposed weights
approach with classical approach such as weights calculated using sklearn and random weights value. The
overall goal of this work is to increase the performance of the classifier while working with imbalanced data.
Finally, we reach our goal and our proposed weights approach performs better than the other techniques and
better than without using any cost-sensitive learning approach. This manuscript is structures: First, an
introduction to malware classification challenges, and how researchers deal with imbalanced data. Second, the
proposal description. Third, we defined methods and materials used in the whole approach, then,
experimentations and obtained results. Finally, we discuss these results, compare them with others in literature
and conclude with future perspectives.
2. PROPOSAL
This article’s contribution is to propose a weights approach for cost sensitive to deal with imbalanced
data in general and malware image data in particular as shown in Figure 1. We demonstrate that the classical
used weight is not effective in the case of too many classes, as we have 25 classes. Then, we evaluated our
approach using two CNN models; with functional and subclassing APIs. All the experiments have a common
goal to detect and classify malware variants effectively into their corresponding families. Then, we could see
clearly the improvement between classical weights and the proposed weights for 25 classes as a use case, in
term of classification metrics. So, our main contribution includes,
− Proposing a customized weight for Cost sensitive to deal with Malimg imbalanced database.
− Evaluate the cost sensitive approach, using two CNN models and compare with classical approach.
Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben Abdel Ouahab)
1838 ISSN: 2252-8938
3. METHODS
3.1. Image representation of a malware
Malware visualization is an area focused on detecting, classifying, and presenting malware features
in the form of visual cues that can be used to convey more data about a specific malware type. Visualization
techniques can use to display static data, monitor network traffic, or manage networks. In [16], the visualization
technique is used to discover and visualize malware behavior. Recently, researchers focus on the development
of orthogonal methods motivated by signal and image processing to deal with malware variants. They took
advantage of the fact that most malware variants have a similar structure, since new malware are simply a
variant of existing one’s in most cases. So, a malware is treated as digital signals and apply Signal and Image
Processing techniques. These techniques are proved to be effective in malware classification and detection in
many researches [17]. The traditional way to view and edit malware binaries is by using Hex editors, which
show us the byte by byte of the binary file in a hexadecimal format. In [11], authors proposed a new method to
view binary files as grayscale image or signal. A malware binary is read as a vector of 8bits unsigned integers
as shown in Figure 2. These integers are then organizers to be presented as 2D array. So that, it can be viewed
as grayscale image in the range of [0-255]. After converting a malware binary to grayscale image, the image
itself keep a significant structure as described in an older work [18]. The binary fragments of a malware show
special image textures, and that allow as to classify malware images effectively since years.
1591
800
431
408
384
214
200
198
184
177
162
158
159
146
142
136
132
128
127
123
116
106
97
80
Malware families
to samples from the minority class. Ensemble learning: combines multiple techniques from one or both
categories (data level preprocessing and/or cost-sensitive learning). Hence, this methos is broadly referred to
as ensemble learning and it can be viewed as a wrapper to other methods [20]–[23].
In general, the goal of a machine learning algorithm is to minimize the cost function of loss function
(1). In cost-sensitive learning, we modify this cost function to take into account that the cost of a false positive
and a false negative may not be the same. We have below the standard cost function for the logistic regression
classifier also known as binary cross entropy loss. In logistic regression, we call the positive class 1 and the
negative class 0. These values are just for convenience, and doesn’t really matter what numerical values we
give to each class since we’re using two different numbers.
1
cost = ∑ni=1 −yi log(ŷi ) − (1 − yi ) log(1 − ŷi ) (1)
n
Where:
𝑛 is the size of training samples
𝑦𝑖 is the actual labels
𝑦̂𝑖 is the predicted probability
−𝑦𝑖 𝑙𝑜𝑔 (𝑦̂𝑖 ) present the cost for 𝑦𝑖 = 1 (minority)
(1 − 𝑦𝑖 ) 𝑙𝑜𝑔 (1 − 𝑦̂𝑖 ) present the cost for 𝑦𝑖 = 0 (majority)
For the modified cost function (2), we define two class rates w1 and w0 to incorporate the significance
of each class in the cost function. In general, wj is defined as the total number of samples over the number of
classes times the number of samples in each class j.
1
cost modified = ∑ni=1 −𝐰𝟏 yi log(ŷi ) − 𝐰𝟎 (1 − yi ) log(1 − ŷi ) (2)
n
Where,
𝑛
𝑤𝑗 =
𝑐𝑙𝑎𝑠𝑠𝑒𝑠×𝑛𝑗
here, w1 and w0 are identical, which means that we are putting the same weight for making mistakes in terms
of false positive and false negative. However, if these two classes are imbalanced, that’s mean the minority
class has for example 10% of total number of samples. And the majority class has 90%. Then we can plug in
𝑛 9𝑛
these values again into the equation for the weight parameter: 𝑛1 = (𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦), and, 𝑛0 = (𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦).
10 10
So,
𝑛
𝑤1 = 𝑛 =5
2×
10
𝑛 10 } → 𝒘𝟎 < 𝒘𝟏 (4)
𝑤0 = 9𝑛 = <1
2× 18
10
We got here that the weight of minority class is greater than the weight of majority class. Based on
the cost function that we have before, that’s mean we are paying more attention to the minority class. Moving
to weights calculation. First, we use the sklearn function in order to compute weights. This function is an
implementation of the previous formulas, so there is no need to redo it. As presented in Table 1, these weights
are tiny in the range of 10-6. After, that we use random values of 1 and 2 for all the 25 classes. These two
weights methods are not effective in our case. That lead us thinking of a new way to compute weights.
Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben Abdel Ouahab)
1840 ISSN: 2252-8938
Above all, we proposed a new approach to calculate weights for multiclass databases. The raison the
previous weights were too small is that we divide by 25 classes based on the weight’s formula. So, we decided
to redistribute classes in order to increase these weights to be more relevant. In our approach, we first arranged
classes ascending, then we divide database into 5 classes: class A, class B, class C, class D and class E. In other
words, class A contains the majority classes and the class E contains the minority 5 classes. After that, we
calculate summary of samples is our new classes, and the percentage of each class over the whole database.
Then, we calculated the new weights. But this time we have only 5 classes, so the formula of weight will be.
In general, we have the simplified formula of weight for class i (5). As planned, our weights respect the ordering
of classing. Based on the new weights given in Table 2, we can say that the model will give more attention
with high weight (class E), and it will give less attention to classes with low weights (class A).
𝟏
𝐰𝐢 = (5)
𝐜𝐥𝐚𝐬𝐬𝐞𝐬 × 𝐍𝐢
Where,
𝑐𝑙𝑎𝑠𝑠𝑒𝑠 is the number of classes
𝑁𝑖 is the percentage of class i over database
model without cost sensitive. Here we use Keras TensorFlow functional API. The model architecture is simple
with known layers as shown in Figure 4. The second model architecture is given in figure below. First, we
build the CNNBlock. Second, we create the ResBlock base on the previous block. Third, we perform the global
malware detection model which contain the previous blocks in addition to other wide known layers as
MaxPooling, flatten, and dense. We train, and evaluate the model without cost sensitive and using default then
the proposed weights. Here, we have more flexibility and options to customized in term of coding as shown in
Figure 5.
3.4. Tools
Powerful hardware is essential for image processing, and in our laboratory, we use the NVIDIA
Quadro T1000 with Max-Q graphics processing unit (GPU) workstation due to its high compute capability of
7.5, which allows us to process images quickly compared to other devices. For deep learning using TensorFlow,
we found that installing compute unified device architecture (CUDA) and CUDA deep neural network
(cuDNN) on the GPU environment was necessary. We also installed additional required Python packages.
While attempting to use a simple Conda command for installation, we encountered numerous errors, leading
us to recommend a manual installation and configuration to save time and ensure successful installation. A
manual installation guide for TensorFlow can be found in reference [25].
The loss is 1%, the accuracy is 98.46%, the precision is 98.5%, and the recall is 98.42%. These values retain
to be the best over several experimentation tests. In addition, cost sensitive using out proposed weights
approach gives also best results comparing to the other methods for the function CNN model. So, we proved
the efficacy of this approach in the case of many classes, 25 classes in our case. However, classical weights
calculated from sklearn function are the worst with the first model, and average with the second model. Results
are presented in Figure 6 and Table 3.
Figure 6. Loss, accuracy, precision, and recall curves using various techniques
So, customized weights are effective with Malimg imbalanced database. In general, we can apply the
same approach to deal with any imbalanced data. The point here is to not use default weights especially, when
we are working with multiclass database, we can rebuild classes to calculate weights. The proposed weights
approach here gives a details calculation for 25 classes. In other context the same idea could be applied. In
literature, most of works using cost-sensitive in different domains and application shows the improvement
while using this technique to deal with imbalanced data [26], [27]. For instance, in [14] the use of cost-sensitive
was implemented for binary classification, and the obtained result reach 99%. Then, for Malimg imbalanced
database, researchers in [15] used GAN which also give acceptable results (90%). In our paper, we proposed
the cost-sensitive weights approach to deal with Malimg imbalance data and the given result is 98%. The most
important thing while doing this is that all classes have same attention and weights, so, even if a class has few
samples, we gave it a good weight and the classifier was able to recognize this class more effectively than
before. The final performance of the overall model without cost-sensitive or any other technique that deal with
imbalance data, could be very high, but when we give in details, we found that the model lack to recognize
effectively or with high accuracy some classes (mainly those with less data) [28], [29].
5. CONCLUSION
Summing up, in this work, we investigate cost-sensitive learning for advancing the classification of
imbalanced data. A new cost-sensitive weights computation was proposed and evaluated using 2 CNN models
along with evaluation metrics. The main goal is to improve the performance of malware classification into their
corresponding families. So, we proposed a new approach for cost sensitive using customized weights approach
to deal with unbalanced database. We order the classes by the number of samples, then we make subclasses
where each new class englobe 5 of the malware classes. Then we compute weights, here the new weights will
be given to all malware families belonging to the new subclass. The idea is to give more attention to classes
having few samples. After that, we compare the proposed weights to the classical computed weights. When
applying the proposed weights, the model performance improved clearly using both CNN models one by one.
As a conclusion, we recommend to use customized weights in the case of many classes e.g. 25 classes, in order
to improve the performance overall, and especially the performance withing minority classes. The best results
in this paper is related to the customized approach of cost sensitive with CNN subclassing model where we
have improved the accuracy with +0.1% (and so with other metrics). As future work, we aim to develop a
framework based one our methods to defend again malwares using malware images and deep learning. Also,
we are looking forward to use GAN as data augmentation technique and compare it to actual findings.
Moreover, we found that the very right way to do CNN models is using subclassing API. It gives the developer
lots of possibilities to customize literally everything. We are looking forward to dive in deeper in this context
and propose a customized layers and functions.
ACKNOWLEDGMENTS
This work was supported by the “Centre National pour la Recherche Scientifique et Technique”,
CNRST, Morocco.
REFERENCES
[1] P. W. Singer and A. Friedman, “Cybersecurity: What everyone needs to know,” Oxford University Press, no. September, pp. 1–7,
2018, [Online]. Available: https://www.researchgate.net/profile/Sushma-Rao-4/publication/354907006_Cybersecurity_
What_Everyone_needs_to_know_Cybersecurity_What_Everyone_needs_to_know/links/6153a90b14d6fd7c0fb7a705/Cybersecur
ity-What-Everyone-needs-to-know-Cybersecurity-What-Everyon.
[2] M. Macas, C. Wu, and W. Fuertes, “A survey on deep learning for cybersecurity: Progress, challenges, and opportunities,” Computer
Networks, vol. 212, 2022, doi: 10.1016/j.comnet.2022.109032.
[3] A. Gaurav, B. B. Gupta, and P. K. Panigrahi, “A comprehensive survey on machine learning approaches for malware detection in
IoT-based enterprise information system,” Enterprise Information Systems, vol. 17, no. 3, 2023,
doi: 10.1080/17517575.2021.2023764.
[4] F. L. Barsha and H. Shahriar, “Mitigation of malware using artificial intelligence techniques,” Security Engineering for Embedded
and Cyber-Physical Systems, pp. 221–234, 2022, doi: 10.1201/9781003278207-13.
[5] M. Ahsan, K. E. Nygard, R. Gomes, M. M. Chowdhury, N. Rifat, and J. F. Connolly, “Cybersecurity threats and their mitigation
approaches using machine learning- A review,” Journal of Cybersecurity and Privacy, vol. 2, no. 3, pp. 527–555, 2022,
doi: 10.3390/jcp2030027.
[6] S. E. Donaldson, S. G. Siegel, C. K. Williams, and A. Aslam, “Enterprise cybersecurity study guide,” Enterprise Cybersecurity
Study Guide, 2018, doi: 10.1007/978-1-4842-3258-3.
[7] S. Mahdavifar and A. A. Ghorbani, “Application of deep learning to cybersecurity: A survey,” Neurocomputing, vol. 347,
pp. 149–176, 2019, doi: 10.1016/j.neucom.2019.02.056.
[8] L. F. Sikos, “AI in cybersecurity,” Springer, 2018, doi: 10.1007/978-3-319-98842-9.
[9] J. H. Lee and K. H. Park, “GAN-based imbalanced data intrusion detection system,” Personal and Ubiquitous Computing, vol. 25,
no. 1, pp. 121–128, 2021, doi: 10.1007/s00779-019-01332-y.
[10] Y. Fu, Y. Du, Z. Cao, Q. Li, and W. Xiang, “A deep learning model for network intrusion detection with imbalanced data,”
Electronics (Switzerland), vol. 11, no. 6, 2022, doi: 10.3390/electronics11060898.
[11] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: Visualization and automatic classification,” ACM
International Conference Proceeding Series, 2011, doi: 10.1145/2016904.2016908.
[12] I. Obaidat, M. Sridhar, K. M. Pham, and P. H. Phung, “Jadeite: A novel image-behavior-based approach for Java malware detection
using deep learning,” Computers and Security, vol. 113, 2022, doi: 10.1016/j.cose.2021.102547.
[13] T. Sree Lakshmi, M. Govindarajan, and A. Sreenivasulu, “Malware visual resemblance analysis with minimum losses using Siamese
neural networks,” Theoretical Computer Science, vol. 943, pp. 219–229, 2023, doi: 10.1016/j.tcs.2022.07.018.
[14] D. Wu, P. Guo, and P. Wang, “Malware detection based on cascading XGboost and cost sensitive,” Proceedings - 2020 International
Conference on Computer Communication and Network Security, CCNS 2020, pp. 201–205, 2020,
doi: 10.1109/CCNS50731.2020.00051.
[15] R. Burks, K. A. Islam, Y. Lu, and J. Li, “Data augmentation with generative models for improved malware detection: A comparative
study,” 2019 IEEE 10th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON 2019,
pp. 0660–0665, 2019, doi: 10.1109/UEMCON47517.2019.8993085.
[16] K. Han, J. H. Lim, and E. G. Im, “Malware analysis method using visualization of binary files,” Proceedings of the 2013 Research
in Adaptive and Convergent Systems, RACS 2013, pp. 317–321, 2013, doi: 10.1145/2513228.2513294.
[17] I. B. A. Ouahab, M. Bouhorma, L. El Aachak, and A. A. Boudhir, “Towards a new cyberdefense generation: Proposition of an
intelligent cybersecurity framework for malware attacks,” Recent Advances in Computer Science and Communications, vol. 15,
no. 8, pp. 1026–1042, 2020, doi: 10.2174/2666255813999201117093512.
[18] G. Conti et al., “A visual study of primitive binary fragment types,” Black Hat USA, pp. 1–17, 2010.
Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben Abdel Ouahab)
1844 ISSN: 2252-8938
BIOGRAPHIES OF AUTHORS
Ikram Ben abdel ouahab received her master degree in Computer Systems and
Networks from Faculty of Sciences and Techniques of Tangier, Morocco. She is currently
working toward her PhD degree in LIST Laboratory of FSTT, University Abdelmalek
Essaadi, Tangier, Morocco. Her main research interests include cybersecurity, malware
analysis, artificial intelligence, and IoT. She participated in many international conferences,
and have published more than 10 scientific papers in 3 years with LIST Laboratory team. In
August 2022, she graduates from CPITS (Certification Program in IT Security) program
provided by Trend Micro, an intensive 9 weeks training in cybersecurity technologies and
industry trends, including certification in industry leading solutions for endpoint, cloud, and
network security. She is a fresh AWS (Amazon Web Services) Solution Architect certified.
She can be contacted at email: [email protected].