SlideShare a Scribd company logo
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
DOI: 10.5121/ijcnc.2024.16605 75
INTRUSION DETECTION MODEL USING MACHINE
LEARNING ALGORITHMS ON NSL-KDD DATASET
Reem Alshamy and Muhammet Ali Akcayol
Department of Computer Science, Gazi University, Ankara, Turkey
ABSTRACT
Big data, generated by various sources such as mobile devices, sensors, and the Internet of Things (IoT),
has many characteristics such as volume, velocity, variety, variability, veracity, validity, vulnerability,
volatility, visualization, and value. An Intrusion Detection System (IDS) is essential for cybersecurity to
detect intrusions before or after attacks. Traditional software methods struggle to store, manage, and
analyze big data, developing new techniques for effective and rapid intrusion detection in organizations
and enterprises. This study introduces the IDS Random Forest (RF) model in binary and multiclass
classification for intrusion detection. In this model, we used the Synthetic Minority Oversampling
TEchnique (SMOTE) to address class imbalances, and the RF classifier to classify attacks using the
Network Security Laboratory (NSL)-KDD dataset. In the experiment, we compared the IDS-RF model with
the Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and Logistic Regression (LR) classifiers
in terms of accuracy, precision, recall, f1-score, and times for training and testing. The experimental
results showed that the IDS-RF model achieved high performance in binary and multiclass classification
compared to others. In addition, the proposed model also achieved high accuracies for each class (Normal,
DoS, Probe, U2R, or R2L) and obtained 98.69%, 99.72%, 98.93%, 95.13%, and 89%, respectively.
KEYWORDS
Intrusion Detection, Network Security Laboratory (NSL)-KDD dataset, SMOTE, Machine Learning
1. INTRODUCTION
Computers and the Internet have become integral components of human life. Cybercrime is rising
in a world characterized by connectivity and data dependency, causing significant disruptions to
our way of life. Tools such as antivirus software, firewalls, and IDS have been deployed to
address these security challenges [1].
An IDS is a device that examines data to identify and detect unauthorized attempts to affect a
system or network's security. Traditional methods can be complex and inefficient when dealing
with big data, making the system vulnerable to attack. Leveraging big data tools and techniques
can reduce the time required for computations. An IDS utilizes three primary methods to detect
attacks: 1) Signature-based detection is effective for detecting known attacks but cannot detect
new types of attacks. 2) Anomaly-based detection utilizes established profiles to compare user
activities and identify odd behaviors; however, it can result in high false-positive rates (FPR). 3)
Hybrid-based detection combines multiple methods to address each method's drawbacks while
utilizing its strengths. Many researchers have recommended using ML techniques in IDS to
reduce FPR and improve IDS accuracy. By leveraging advanced data analysis tools and
methodologies, an IDS can enhance the effectiveness and precision of its detection approaches [2,
3].
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
76
The term " big data" refers to an amount of data that stands out for its volume, velocity, and
variety. Moreover, the increasing number of attacks worsens these issues. As a result, big data
poses challenges, for IDS in identifying and analyzing intrusions and attacks. Dealing with
security data from sources is a task for intrusion detection particularly when handling the
complexities of big data. The complexity of managing data grows when dealing with issues like
having high dimensions of data from various sources or varied data structures. To tackle these
challenges effectively an IDS needs to integrate techniques designed for big data environments.
Incorporating big data techniques and ML in IDS can help address challenges such, as
computational time and speed ultimately contributing to the improvement of more accurate IDS
systems [2].
Spark and Hadoop MapReduce are widely recognized examples of frameworks used for
processing big data. Spark is a distributed processing system known for its speed. Its flexible
architecture supports a range of large-scale workloads including algorithms, batch applications,
streaming tasks, and interactive searches. Furthermore, Spark offers an ML library known as
MLlib [4]. Numerous research papers have employed ML classifiers to identify intrusions, in
settings some of which have integrated techniques, for selecting features to handle amounts of
data [5]. Other studies have employed Apache Spark and its MLlib to develop distributed IDS.
However, users of the Spark ML library face limitations in applying feature selection methods, as
this library only contains the chi-squared selector [6].
Datasets are essential for training and evaluating anomaly-based network IDS. Some studies have
used actual data captured from switches and tools designed to simulate attacks to test their
systems. However, a major inconvenience is the impossibility of comparing the results with those
of other studies because of the differences in the data, which inevitably lead to different
outcomes. Consequently, several public network datasets are available for this purpose. The most
well-known are DARPA98, ARPA99, KddCup99, NSL-KDD, Kyoto 2006+, UNSW-NB15, and
CICIDS2017, which are widely used in numerous studies. However, these datasets typically
include a significant number of redundant or missing records and are considered outdated. The
landscape of computer networks and their attack patterns has evolved considerably since their
creation. To address these challenges, the Communications Security Establishment (CSE) has
introduced a new dataset. Since its release, many projects have demonstrated that CSE-CIC-IDS
2018 is more complex than previous datasets, and can be reliably used to evaluate existing and
new detection strategies. CSE-CIC-IDS 2018 accurately represents modern network attacks [7,
8].
The remainder of this study is structured as follows: Section 2 presents a review of related
works. In the 'Methodology' section, we introduce the IDS-RF model and describe each step of
this method. The results of the IDS-RF model and a discussion are presented in Section 4.
Comparisons with existing studies are provided in Section 5. Finally, we conclude our work and
outline future work in the 'Conclusion' section.
2. RELATED WORKS
Many research approaches have been introduced for IDS that apply ML algorithms, feature
selection methods, and big data platforms. Different approaches have been proposed for IDS that
utilize the benchmark NSL-KDD dataset to improve attack detection. Some researchers have
employed traditional techniques, while others have leveraged big data techniques in an IDS,
aiming to reduce training and computation time. With the advent of big data, traditional
techniques for handling such data have become increasingly complex. In this section, the authors
summarize previous works that utilized big data techniques to address classification problems
(binary, multi-class, or both) in IDS.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
77
Louati et al. [5] proposed a novel IDS based on Multi-Agent Reinforcement Learning (MARL)
for big data environments. The MARL method leverages decentralized cooperative learning
among agents to address the issues presented by the large amount, velocity, and variety of big
data. This work contributes significantly to developing advanced IDS that can effectively
function within the challenging environments of big data frameworks.
Trrad et al. [9] focused on using ML algorithms to categorize network traffic. They specifically
examined the NSL-KDD dataset to evaluate how well classifiers, like SVM, k-NN, and Long
Short-Term Memory (LSTM) could detect network activities.
Lin et al. [10] introduced the Multi-Feature Extraction Extreme Learning Machine (MFE-ELM)
as a method to prevent intrusions in cloud computing environments related to the Internet of
Things (IoT). For intrusion detection, the MFE-ELM method employed the NSL-KDD dataset.
The results indicate that the proposed algorithm enhances the ability to promptly identify security
weaknesses in cloud clusters, facilitating the development of intrusion response strategies.
Mari et al. [11] introduced a random forest algorithm that uses learning to enhance feature
selection. The algorithm combined three ML models; SVM, LR, and k-NN. Their research
showed an enhancement, in detecting and categorizing network intrusions through ML
highlighting the importance of feature selection in improving detection accuracy. The study also
underlined the potential of these methods for real-time detection and monitoring of cyber-attacks.
Türk et al. [12] evaluated intrusion detection capabilities by applying ML and Deep Learning
(DL) techniques on two datasets, specifically the University of New South Wales (UNSW)-
NB15 and NSL-KDD. They achieved accuracy in classification showcasing the effectiveness of
these algorithms, in identifying and categorizing network intrusions.
Kalinin et al. [13] investigated Quantum ML (QML) to address challenges, in intrusion detection
and improve effectiveness. Their study involved comparing Quantum SVM (QSVM) and
Quantum Convolutional Neural Networks (QCNN) as strategies for detecting intrusions. The
findings from their experiments showed that intrusion detection based on QML can effectively
manage amounts of data inputs with high accuracy and at twice the speed of conventional
algorithms.
Vibhute et al. [14] developed an ML-based IDS using a Generative Adversarial Network (GAN).
The study aimed to generate adversarial network traffic instances that might avoid conventional
IDS detection. These instances were then used to improve the IDS's ability to detect new and
evolving threats. The GAN method not only assessed the durability of the IDS against complex
approaches used to avoid detection but also enhanced its capacity to adapt to new attacks,
resulting in notable progress in the field of cybersecurity.
Abid et al. [15] proposed a real-time, fault-tolerant, distributed, and scalable approach for
detecting intrusions in Industrial Control Systems (ICS). Their system employs data fusion
techniques to merge streaming datasets into a stream enhancing accuracy and precision. The
approach demonstrated performance in binary and multiclass classifications by leveraging
Apache Spark Structured Continuous Streaming for real-time data processing along with a Multi-
Layer Perceptron Classifier (MLPC). Table 1 provides a comparison of studies focusing on
aspects such as algorithms, tools, datasets, and classification problems.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
78
Table 1. The comparison of the related works.
In related works, we observed that IDS encounter challenges posed by big data. Researchers have
addressed these challenges using two main strategies: employing a complete dataset and applying
the proposed approach with Big Data techniques [5]. Table 1 summarizes related works,
categorizing them based on the algorithms used for classification, the tools developed for the
research, the datasets on which the research was trained and evaluated, and the focus of the
classification problems (Binary, Multiclass, or both).
3. METHODOLOGY
This section will provide a detailed explanation of the proposed model and the tools that were
utilized. Figure 1 shows the IDS-RF model. The IDS-RF model can be summarized in the
following steps:
1. Import the NSL-KDD dataset.
2. Conduct data pre-processing.
3. Utilize the SMOTE and RF classifier to train the IDS-RF model on the training dataset.
4. The IDS-RF model is evaluated by applying the RF classifier to the NSL-KDD test dataset.
Figure 1. The sequence of steps in the IDS-RF model.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
79
3.1. Dataset Description
The NSL-KDD dataset is among the most widely used datasets for training and evaluating IDS
models [16]. This dataset contained 41 features, plus one feature for the target class, including
four categorical, four binary, and 33 numerical features. The features in the NSL-KDD dataset
are divided into four groups: basic, host, traffic, and content features [17]. The 42 features
provide data on five classes of network connections, with each instance classified either as
normal or as one of four types of attacks [7, 14].
The attack types in the NSL-KDD dataset can fall into one of the following four main categories:
Denial of Service Attack (DoS): The attacker occupies excessive computing or memory resources
by performing calculations, thereby denying legitimate users access to the machine. This is
achieved by handling more requests so the system can process them logically. Probing Attack:
This type of attack occurs when an attacker exploits vulnerabilities to gain local access to the
network. Then, the attacker sends packets to the device over the network. Remote to Local Attack
(R2L): This attack occurs when an attacker takes advantage of vulnerabilities to gain local
network access and begins sending packets to the device over the network. User-to-Root Attack
(U2R): An attacker gains root access by exploiting vulnerabilities, potentially allowing the
attacker to escalate privileges from a regular user account to root access. The distribution of
various attacks on the NSL-KDD dataset is presented in Table 2.
Table 2. The Distribution of various attack types in the NSL-KDD dataset [9].
The training dataset consisted of 23 classes, whereas the test dataset included 38. These
encompass 21 attacks from the training dataset, 16 novel attacks, and one normal class. The class
labels of instances in the dataset were categorized into five main categories: Normal, DoS, Probe,
R2L, and U2R. According to the distribution of instances across different attack types, as
illustrated in Table 2, the distribution of various types of data is unbalanced. 'Normal' represents
the highest percentage of the total data, whereas U2R has the lowest, which may lead to
inadequate training and misclassification of data. Attackers often utilize Probe, R2L, and U2R
attacks as advanced threats. Therefore, improving the classification accuracy for these types of
attacks should be prioritized.
3.2. Dataset Pre-Processing
The preprocessing is an essential step that enhances the data quality and aids the extraction of
features [18]. This step is essential for developing and training the models. We first need to
preprocess the data to derive the value from the dataset and prepare it for the IDS-RF model,
Below, we describe the techniques used for data preprocessing:
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
80
3.2.1. Transformation
The data performed a transformed into formats that are appropriate for ML algorithms [19]. First,
the dataset underwent a process of eliminating features that included zero values. Additionally,
we transformed the categorical features within the dataset into numerical features. Second, we
employ the one-hot encoder technique to transform categorical features into a one-hot numeric
array. We achieve this by identifying the distinct values of each feature and creating
corresponding categories. As an illustration, the output of this function can encode the
'protocol_type' feature, which has three separate values (TCP, UDP, and UCMP), as [1,0,0],
[0,1,0], and [0,0,1] correspondingly. As a result, this function transformed the initial 41-
dimensional NSL-KDD dataset into a format with 122 dimensions.
3.2.2. Standardization
In ML, the standardization of datasets is very significant for algorithms that utilize the Euclidean
distance. Without standardization, there is a potential risk that features with values in larger
ranges may be given undue importance. Because only some features can be represented within
the same measurement range, features of varying magnitudes can negatively affect ML
algorithms [20]. This issue can be mitigated through standardization, which involves converting
data to a common range. Standardization is beneficial because it helps improve the prediction
performance of the IDS-RF model. For example, the NSL-KDD dataset encompasses features
with a broad spectrum of values. Some features displayed significant discrepancies between their
minimum and maximum values, making them incomparable and unsuitable for direct processing.
An example is the 'duration' feature, which ranges from a minimum of 0 to a maximum of
58,329, rendering the feature values inconsistent for analysis. To address these challenges within
the IDS-RF model, we employ StandardScaler. This tool standardizes features by eliminating the
mean and scaling to unit variance using the following formula to achieve data standardization:
𝒛 =
𝒙 − 𝒖
𝒔
𝟏
The equation represents the relationship between the standardized value of a sign (z), the original
vector (x), the mean of the vectors (u), and the standard deviation (s).
3.3. IDS-RF Model
This section presents the SMOTE technique and the RF classifier as a component of the IDS-RF
model.
3.3.1. SMOTE
The NSL-KDD dataset has a class imbalance issue. When dealing with a dataset that contains a
class imbalance, it is preferable to use the SMOTE method to enhance attack detection in the
IDS-RF model [21, 22]. SMOTE was developed by Chawla [23]. This method generates
synthetic samples in the minority class instead of replacing existing samples, which helps avoid
the overfitting problem. The SMOTE algorithm was proposed to overcome the overfitting issue
and improve the accuracy. This straightforward cluster-based oversampling technique has gained
widespread acceptance in the community [24]. It selects two samples from the minority class and
assigns properties between them. The difference in value is calculated for each feature, and then a
random value r is generated to lie between 0 and 1. Consequently, the features of the new sample
were generated according to the following equation:
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
81
𝒙𝒏𝒊 = 𝒎𝒊𝒏 𝒙𝒋𝒊, 𝒙𝒌𝒊 + |𝒙𝒋𝒊 − 𝒙𝒌𝒊| × 𝒓 (2)
Equation (2) calculates the value of xni, which represents the ith
feature of the nth
new sample. It is
determined by taking the minimum value between xji and xki and adding the absolute difference
between xji and xki multiplied by the r value. The samples xj and xki are randomly selected from
the minority class, and r is a number between 1 and 0. This equation helps generate a set of
oversampled values. By employing this method, the two classes were nearly evenly distributed in
terms of proportion. This approach effectively preserves the correlation between the
characteristics, even when a substantial number of artificially generated samples are included in
the dataset. The researchers utilized the SMOTE technique in their tests to attain precise
categorization [25]. Figures 2 and 3 demonstrate the distribution of instances for various
attacks before and after applying SMOTE to the training dataset in binary and multiclass
scenarios.
Figure 2. Results of Applying the SMOTE method to the training dataset for binary classification.
Figure 3. Results of Applying the SMOTE Method to the training dataset for multiclass classification.
3.3.2. RF Classifier
The random forest algorithm is composed of many tree predictors. Each tree makes predictions
based on a random vector, which is sampled individually and follows the same distribution across
all trees in the forest [26]. Random forests are regarded as an enhanced iteration of bootstrap
aggregating (bagging), where the forecasts of many decision trees, developed with bootstrapped
data samples that are identically distributed, are aggregated. One drawback of bagging is that the
correlations between each pair of variables limit the extent to which the prediction errors can be
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
82
reduced, even when a large number of trees are aggregated. Random forests enhance the bagging
technique by randomly selecting a subset of variables to use for splitting while constructing each
decision tree. This helps to reduce the impact of correlations between pairs of variables. To
obtain a comprehensive description of the random forest algorithm, please consult the work of
Breiman [26]. The number of trees (B) and the number of randomly picked variables utilized for
each split in an individual decision tree (m) are two important hyperparameters in constructing
random forests. These two parameters are commonly identified using a grid search technique in
conjunction with cross-validation [27]. As stated by Breiman [26], the process of building a
random forest model involves the following steps: Iterate over the values of b from 1 to B:
For b=1 to :
(a) Generate a resampled dataset of size N using the bootstrap method from the training data.
(b) Construct an RF tree by iteratively performing the following steps for each terminal
node of the tree until the minimal node size nmin is reached.
1) Randomly choose m variables from the complete set of variables.
2) Choose the most optimal variable from the options provided.
3) Partition the node into two sub-regions.
Print the collection of trees { }i, i=1, 2,…, .
For a regression problem, given a new input x, the corresponding prediction can be written as:
(3)
Figure 4 illustrates the steps described above.
Figure 4. Random Forest Architecture [27].
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
83
4. RESULTS ANALYSIS AND DISCUSSION
4.1. Evaluation Metrics
Some of the performance metrics used to evaluate the effectiveness of the IDS-RF model involve
accuracy, precision, recall, f1-score, and prediction time for binary and multiclass classification.
Accuracy is a metric that measures the proportion of correct predictions made by the IDS-RF
model when applied to a test dataset. Precision is a measure of the accuracy of a positive
prediction. Recall, or the true positive rate, is the measure of the number of true positives
predicted out of all positives in the dataset. The f1-score measures the accuracy of a model based
on its recall and precision. The metrics mentioned above have definitions in Eqs. (4), (5), (6), and
(7).
where TP is True Positive, FN is False Negative, FP is False Positive, and TN is True Negative.
TP indicates the number of records correctly detected as a normal class, FN indicates the number
of records not correctly detected as an attack class, FP indicates the number of records not
correctly detected as a normal class, and TN indicates the number of records correctly detected as
an attack class.
4.2. Binary Classification
The IDS-RF model achieved an accuracy of 98.67% for binary classification on the NSL-KDD
dataset. Precision was 98.68%, recall was 98.67%, and f1-score was 98.67%. The confusion
matrix for the IDS-RF model in the binary classification is shown in Figure 5. In addition, we
assessed the performance of the ML algorithms k-NN, SVM, and LR, focusing on accuracy,
precision, recall, and F1-score. These algorithms were chosen because of their extensive
application and proven efficacy in IDS [17]. The accuracy of the k-NN classifier was 97.34%, the
precision was 97.32%, the recall was 97.26%, and the f1-score was 97.29%. The accuracy of the
SVM classifier was 95.77%, the precision was 95.72%, the recall was 95.65%, and the f1-score
was 95.69%. The accuracy of the LR classifier was 94.62%, precision was 94.61%, recall was
94.41%, and f1-score was 94.50%. A comparison of the accuracy values between the four
algorithms for each class (normal and attack) is presented in Table 3. The results in Table 3
highlight the superior performance of the IDS-RF model in terms of accuracy for binary and
multiclass classification.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
84
Figure 5. Confusion Matrix for IDS-RF model for binary classification.
Table 3. Accuracy for each class in binary classification.
4.3. Multiclass Classification
For NSL-KDD in multiclass, the IDS-RF model achieved an overall accuracy of 98.54%,
precision of 97.58 %, recall of 96.29%, and f1-score of 96.92%. The confusion matrix for the
IDS-RF model for multiclass classification is shown in Figure 6. The accuracy of the k-NN
classifier was 96.93 %, precision was 94.33 %, recall was 93.98%, and the f1-score was 94.15 %.
The accuracy of the SVM classifier was 95.53%, precision was 93.69 %, recall was 93.39 %, and
f1-score was 93.49%. The accuracy of the LR classifier was 94.76%, precision was 93.25%,
recall was 91.36%, and f1-score was 92.13%. In addition, the IDS-RF model achieved high
performance in terms of accuracy for multiclass classification. The IDS-RF model accuracy was
higher for Normal, DoS, Probe, and R2L and lower for U2R. This is because the U2R class lacks
sufficient data compared to other attacks, leading to a lower accuracy than other classes. The
accuracy comparison between the four algorithms for each class in the multiclass classification is
presented in Figure 7.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
85
Figure 6. Confusion Matrix for IDS-RF model in multiclass classification.
Table 4. Accuracy for each class in multiclass classification.
Figure 7. Accuracy for ML algorithms in multiclass classification.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
86
In IDS, analyzing big data requires high real-time performance, which requires not only high
evaluation metrics but also minimal detection times to ensure the efficiency of network security
applications. To demonstrate the superiority of the IDS-RF model, researchers measured its
training and prediction times for binary and multiclass classification and compared these metrics
with other ML algorithms. The findings revealed that our proposed model significantly reduced
training and prediction times. Table 5 compares the training and prediction times between the
proposed IDS-RF model and the other classifiers for binary and multiclass classification.
Table 5. Training and testing times for IDS-RF model with other classifiers.
IDS-RF consistently outperformed the other algorithms in all metrics, establishing it as the most
effective option for the classification task in this study. These impressive results were achieved
by integrating the RF classifier with the SMOTE method, which was applied to the training data
for binary classification, as illustrated in Table 3. The superiority of the IDS-RF model is mainly
due to its flexibility and adeptness in identifying correlations between features, rendering it an
ideal choice for this task. In addition, RF consistently shows excellent performance, even when
handling large datasets. Moreover, the preprocessing stages yielded a dataset comprising 134,686
instances for binary classification and 202,305 for multiclass classification. Utilizing Anaconda
capabilities [28], our system achieved an exceptional processing rate of 22,544 instances in just 6
s for binary classification and 41.762 s for multiclass classification (as shown in Table 5). This
rapid processing rate significantly boosts the intrusion detection capabilities, thereby enhancing
the overall efficiency and effectiveness of the system. The results highlight the efficiency and
reliability of the proposed system classification.
5. COMPARISON WITH EXISTING WORKS
The importance of the strategy presented in the IDS-RF model is compared with existing studies.
Some previous studies focused on ML and DL algorithms and achieved satisfactory results. In
this section, we focus on studies that use the same dataset used in our model to train and test the
IDS-RF model. Although the NSL-KDD dataset we used does not cover modern attacks,
researchers still use it widely. In addition, when comparing the IDS-RF model with previous
studies that used the NSL-KDD dataset for binary and multiclass classification, the IDS-RF
model consistently demonstrated high performance and speed in attack detection across all
metrics evaluated. The accuracy of the k-NN model in [14] was 98.24% when using the ensemble
learning-enabled RF algorithm for feature selection. In [11], correlation-based feature selection
achieved an accuracy of 91.33% through k-NN, less than that achieved in our study. Based on our
findings, the IDS-RF model suggested in this work outperforms the models used in previous
studies [5, 9, 10, 12] in terms of accuracy, training, and testing times. To enhance the
performance of the IDS-RF model, we recommend its implementation in Apache Spark, a cluster
computing platform designed for high-speed processing [11]. Table 6 shows the significance of
the current study compared to recent studies in binary and multiclass classification.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
87
Table 6. Comparison results with other studies in binary and multiclass classification.
6. CONCLUSIONS
This study introduces an IDS-RF model developed by researchers to detect intrusions using the
NSL-KDD dataset. The classification process for big data is complicated in an IDS because of its
high dimensionality. The proposed model leveraged the Anaconda platform to process and
analyze big data rapidly. In the proposed model, researchers utilized the SMOTE technique to
handle a dataset and utilized Random Forest (RF) for binary and multiclass classification. The
experimental results showed that the model achieved high accuracy, precision, recall, f1-score,
and speed for binary and multiclass classification compared with the k-NN, SVM, and LR
algorithms. The proposed model achieved an overall accuracy of 98.67% for binary classification
and 98.54% for multiclass classification. The proposed model also achieved high testing
accuracies for Normal (98.69%), DoS (99.72%), Probe (98.93%), R2L (95.13%), and U2R (89%)
even though the accuracy of the U2R class was lower compared with other attacks because the
number of instances in this class is small. In addition, the proposed model achieved high
accuracy, which is remarkable for binary and multiclass classification compared to existing
models. In future work, we plan to apply the IDS-RF model on more modern datasets to detect
intrusions and reduce computation time using the Apache Spark environment.
CONFLICTS OF INTEREST
The authors declare no conflict of interest.
ACKNOWLEDGEMENTS
We would like to thank all authors for their contributions and the success of this manuscript and
all editors and anonymous reviewers of this manuscript.
REFERENCES
[1] F. Bannat Wala and M. J. a. e.-p. Kiran, "5G Network Security Practices: An Overview and
Survey," p. arXiv: 2401.14350, 2024.
[2] S. M. Othman, N. T. Alsohybe, F. M. Ba-Alwi, A. T. J. I. J. o. C.-S. Zahary, and D. Forensics,
"Survey on intrusion detection system types," vol. 7, no. 4, pp. 444-463, 2018.
[3] A. Touzene, A. Al Farsi, N. J. I. J. o. C. N. Al Zeidi, and Communications, "HIGH
PERFORMANCE NMF BASED INTRUSION DETECTION SYSTEM FOR BIG DATA IOT
TRAFFIC," vol. 16, no. 2, pp. 43-58, 2024.
[4] X. Sun, L. Zhao, J. Chen, Y. Cai, D. Wu, and J. Z. J. E. A. o. A. I. Huang, "Non-MapReduce
computing for intelligent big data analysis," vol. 129, p. 107648, 2024.
[5] F. Louati, F. B. Ktata, and I. J. C. C. Amous, "Big-IDS: a decentralized multi agent reinforcement
learning approach for distributed intrusion detection in big data networks," pp. 1-19, 2024.
[6] K. Pramilarani and P. V. J. A. S. C. Kumari, "Cost based Random Forest Classifier for Intrusion
Detection System in Internet of Things," vol. 151, p. 111125, 2024.
International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024
88
[7] M. Ghurab, G. Gaphari, F. Alshami, R. Alshamy, and S. J. A. J. o. R. i. C. S. Othman, "A detailed
analysis of benchmark datasets for network intrusion detection system," vol. 7, no. 4, pp. 14-33,
2021.
[8] M. Ghurab, R. Alshamy, S. J. I. J. o. S. R. Othman, and E. Development, "Performance Evaluation
for Attack Detection in Intrusion Detection System," vol. 4, no. 5, 2021.
[9] Trrad, "Applying Deep Learning Techniques for Network Traffic Classification: A Comparison
Study on the NSL-KDD Dataset," 2024.
[10] H. Lin, Q. Xue, J. Feng, and D. Bai, "Internet of things intrusion detection model and algorithm
based on cloud computing and multi-feature extraction extreme learning machine," Digital
Communications and Networks, vol. 9, no. 1, pp. 111-124, 2023.
[11] A.-G. Mari, D. Zinca, and V. J. S. Dobrota, "Development of a Machine-Learning Intrusion
Detection System and Testing of Its Performance Using a Generative Adversarial Network," vol. 23,
no. 3, p. 1315, 2023.
[12] F. J. B. E. Ü. F. B. D. Türk, "Analysis of intrusion detection systems in UNSW-NB15 and NSL-
KDD datasets with machine learning algorithms," vol. 12, no. 2, pp. 465-477, 2023.
[13] M. Kalinin and V. Krundyshev, "Security intrusion detection using quantum machine learning
techniques," Journal of Computer Virology and Hacking Techniques, vol. 19, no. 1, pp. 125-136,
2023.
[14] A. D. Vibhute, C. H. Patil, A. V. Mane, and K. V. J. P. C. S. Kale, "Towards Detection of Network
Anomalies using Machine Learning Algorithms on the NSL-KDD Benchmark Datasets," vol. 233,
pp. 960-969, 2024.
[15] A. Abid, F. Jemili, and O. Korbaa, "Real-time data fusion for intrusion detection in industrial control
systems based on cloud computing and big data techniques," Cluster Computing, pp. 1-22, 2023.
[16] R. Alshamy, M. J. J. o. C. S. Ghurab, and Engineering, "A review of big data in network intrusion
detection system: Challenges, approaches, datasets, and tools," vol. 8, no. 7, pp. 62-74, 2020.
[17] NSL-KDD dataset. Accessed at Jan. 01, 2024. [Online]. Available:
https://www.unb.ca/cic/datasets/nsl.html
[18] M. A. Bouke, A. Abdullah, K. Cengiz, and S. J. P. C. S. Akleylek, "Application of BukaGini
algorithm for enhanced feature interaction analysis in intrusion detection systems," vol. 10, p. e2043,
2024.
[19] K. C. Santos, R. S. Miani, F. J. J. o. N. de Oliveira Silva, and S. Management, "Evaluating the
Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems," vol.
32, no. 2, p. 36, 2024.
[20] M. A. Talukder, S. Sharmin, M. A. Uddin, M. M. Islam, and S. J. I. J. o. I. S. Aryal, "MLSTL-WSN:
machine learning-based intrusion detection using SMOTETomek in WSNs," pp. 1-20, 2024.
[21] W. A. Ali, M. Roccotelli, G. Boggia, and M. P. J. I. S. J. A. G. P. Fanti, "Intrusion detection system
for vehicular ad hoc network attacks based on machine learning techniques," pp. 1-19, 2024.
[22] R. Alshamy, M. Ghurab, S. Othman, and F. Alshami, "Intrusion detection model for imbalanced
dataset using SMOTE and random forest algorithm," in Advances in Cyber Security: Third
International Conference, ACeS 2021, Penang, Malaysia, August 24–25, 2021, Revised Selected
Papers 3, 2021, pp. 361-378: Springer.
[23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. J. J. o. a. i. r. Kegelmeyer, "SMOTE: synthetic
minority over-sampling technique," vol. 16, pp. 321-357, 2002.
[24] J. S. Aguilar-Ruiz and M. J. S. R. Michalak, "Classification performance assessment for imbalanced
multiclass data," vol. 14, no. 1, p. 10759, 2024.
[25] A. Kumar, D. Singh, R. J. M. T. Shankar Yadav, and Applications, "Class overlap handling methods
in imbalanced domain: A comprehensive survey," pp. 1-48, 2024.
[26] L. J. M. l. Breiman, "Random forests," vol. 45, pp. 5-32, 2001.
[27] H. Gong, Y. Sun, W. Hu, P. A. Polaczyk, and B. Huang, "Investigating impacts of asphalt mixture
properties on pavement performance using LTPP data through random forests," Construction and
Building Materials, vol. 204, pp. 203-212, 2019.
[28] Anaconda platform. Accessed at Jan. 01, 2024. [Online]. Available: https://www.anaconda.com/
Ad

More Related Content

Similar to Intrusion Detection Model using Machine Learning Algorithms on NSL-KDD Dataset (20)

Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...
Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...
Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...
IJCNCJournal
 
Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...
Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...
Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...
IJCNCJournal
 
Constructing a predictive model for an intelligent network intrusion detection
Constructing a predictive model for an intelligent network intrusion detectionConstructing a predictive model for an intelligent network intrusion detection
Constructing a predictive model for an intelligent network intrusion detection
Alebachew Chiche
 
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIERATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
CSEIJJournal
 
Attack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest ClassifierAttack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest Classifier
CSEIJJournal
 
A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...
A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...
A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...
IRJET Journal
 
Attention Mechanism for Attacks and Intrusion Detection
Attention Mechanism for Attacks and Intrusion DetectionAttention Mechanism for Attacks and Intrusion Detection
Attention Mechanism for Attacks and Intrusion Detection
AIRCC Publishing Corporation
 
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
AIRCC Publishing Corporation
 
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
ijcsit
 
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SETCLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
IJNSA Journal
 
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SETCLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
IJNSA Journal
 
Network Intrusion Detection using MRF Technique
Network Intrusion Detection using MRF Technique Network Intrusion Detection using MRF Technique
Network Intrusion Detection using MRF Technique
Associate Professor in VSB Coimbatore
 
A45010107
A45010107A45010107
A45010107
IJERA Editor
 
A45010107
A45010107A45010107
A45010107
IJERA Editor
 
An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...
IJECEIAES
 
An ensemble-based approach for effective distributed denial of service attack...
An ensemble-based approach for effective distributed denial of service attack...An ensemble-based approach for effective distributed denial of service attack...
An ensemble-based approach for effective distributed denial of service attack...
IAESIJAI
 
1850 1854
1850 18541850 1854
1850 1854
Editor IJARCET
 
1850 1854
1850 18541850 1854
1850 1854
Editor IJARCET
 
Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...
IJCNCJournal
 
Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...
IJCNCJournal
 
Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...
Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...
Intrusion Detection System (IDS) Development Using Tree-Based Machine Learnin...
IJCNCJournal
 
Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...
Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...
Intrusion Detection System(IDS) Development Using Tree-Based Machine Learning...
IJCNCJournal
 
Constructing a predictive model for an intelligent network intrusion detection
Constructing a predictive model for an intelligent network intrusion detectionConstructing a predictive model for an intelligent network intrusion detection
Constructing a predictive model for an intelligent network intrusion detection
Alebachew Chiche
 
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIERATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
CSEIJJournal
 
Attack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest ClassifierAttack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest Classifier
CSEIJJournal
 
A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...
A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...
A Comparative Study of Deep Learning Approaches for Network Intrusion Detecti...
IRJET Journal
 
Attention Mechanism for Attacks and Intrusion Detection
Attention Mechanism for Attacks and Intrusion DetectionAttention Mechanism for Attacks and Intrusion Detection
Attention Mechanism for Attacks and Intrusion Detection
AIRCC Publishing Corporation
 
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
AIRCC Publishing Corporation
 
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
ijcsit
 
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SETCLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
IJNSA Journal
 
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SETCLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
CLASSIFICATION PROCEDURES FOR INTRUSION DETECTION BASED ON KDD CUP 99 DATA SET
IJNSA Journal
 
An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...
IJECEIAES
 
An ensemble-based approach for effective distributed denial of service attack...
An ensemble-based approach for effective distributed denial of service attack...An ensemble-based approach for effective distributed denial of service attack...
An ensemble-based approach for effective distributed denial of service attack...
IAESIJAI
 
Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection Via Deep Deterministic Policy Gradient R...
IJCNCJournal
 
Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...
Trust Metric-Based Anomaly Detection via Deep Deterministic Policy Gradient R...
IJCNCJournal
 

More from IJCNCJournal (20)

Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
IJCNCJournal
 
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT NetworksDelay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
IJCNCJournal
 
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor NetworksEnhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
IJCNCJournal
 
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable NetworksComparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
IJCNCJournal
 
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
IJCNCJournal
 
Visually Image Encryption and Compression using a CNN-Based Autoencoder
Visually Image Encryption and Compression using a CNN-Based AutoencoderVisually Image Encryption and Compression using a CNN-Based Autoencoder
Visually Image Encryption and Compression using a CNN-Based Autoencoder
IJCNCJournal
 
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
IJCNCJournal
 
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT NetworksDelay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
IJCNCJournal
 
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor NetworksEnhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
IJCNCJournal
 
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable NetworksComparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
IJCNCJournal
 
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
IJCNCJournal
 
Formal Abstraction & Interface Layer for Application Development in Automatio...
Formal Abstraction & Interface Layer for Application Development in Automatio...Formal Abstraction & Interface Layer for Application Development in Automatio...
Formal Abstraction & Interface Layer for Application Development in Automatio...
IJCNCJournal
 
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
IJCNCJournal
 
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
IJCNCJournal
 
Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...
Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...
Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...
IJCNCJournal
 
Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...
Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...
Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...
IJCNCJournal
 
A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...
A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...
A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...
IJCNCJournal
 
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
IJCNCJournal
 
FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...
FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...
FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...
IJCNCJournal
 
Load Balanced Attack Defense System with Lightweight Authentication and Modif...
Load Balanced Attack Defense System with Lightweight Authentication and Modif...Load Balanced Attack Defense System with Lightweight Authentication and Modif...
Load Balanced Attack Defense System with Lightweight Authentication and Modif...
IJCNCJournal
 
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
IJCNCJournal
 
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT NetworksDelay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
IJCNCJournal
 
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor NetworksEnhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
IJCNCJournal
 
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable NetworksComparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
IJCNCJournal
 
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
IJCNCJournal
 
Visually Image Encryption and Compression using a CNN-Based Autoencoder
Visually Image Encryption and Compression using a CNN-Based AutoencoderVisually Image Encryption and Compression using a CNN-Based Autoencoder
Visually Image Encryption and Compression using a CNN-Based Autoencoder
IJCNCJournal
 
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
IJCNCJournal
 
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT NetworksDelay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
Delay and Throughput Aware Cross-Layer TDMA Approach in WSN-based IoT Networks
IJCNCJournal
 
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor NetworksEnhancement of Quality of Service in Underwater Wireless Sensor Networks
Enhancement of Quality of Service in Underwater Wireless Sensor Networks
IJCNCJournal
 
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable NetworksComparative Analysis of POX and RYU SDN Controllers in Scalable Networks
Comparative Analysis of POX and RYU SDN Controllers in Scalable Networks
IJCNCJournal
 
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
IJCNCJournal
 
Formal Abstraction & Interface Layer for Application Development in Automatio...
Formal Abstraction & Interface Layer for Application Development in Automatio...Formal Abstraction & Interface Layer for Application Development in Automatio...
Formal Abstraction & Interface Layer for Application Development in Automatio...
IJCNCJournal
 
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
Deadline-Aware Task Scheduling Strategy for Reducing Network Contention in No...
IJCNCJournal
 
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
IJCNCJournal
 
Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...
Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...
Elliptic Curve Cryptography Algorithm with Recurrent Neural Networks for Atta...
IJCNCJournal
 
Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...
Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...
Enhanced Papr Reduction in OFDM Systems using Adaptive Clipping with Dynamic ...
IJCNCJournal
 
A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...
A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...
A Novel Stable Path Selection Algorithm for Enhancing Qos and Network Lifetim...
IJCNCJournal
 
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
A Novel Cluster Head Selection Algorithm to Maximize Wireless Sensor Network ...
IJCNCJournal
 
FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...
FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...
FSHAHA: Feature Selection using Hybrid Ant Harris Algorithm for IoT Network S...
IJCNCJournal
 
Load Balanced Attack Defense System with Lightweight Authentication and Modif...
Load Balanced Attack Defense System with Lightweight Authentication and Modif...Load Balanced Attack Defense System with Lightweight Authentication and Modif...
Load Balanced Attack Defense System with Lightweight Authentication and Modif...
IJCNCJournal
 
Ad

Recently uploaded (20)

Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
Resistance measurement and cfd test on darpa subboff model
Resistance measurement and cfd test on darpa subboff modelResistance measurement and cfd test on darpa subboff model
Resistance measurement and cfd test on darpa subboff model
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Data Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptxData Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptx
RushaliDeshmukh2
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
LECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's usesLECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's uses
CLokeshBehera123
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
New Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdfNew Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdf
mohamedezzat18803
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Data Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptxData Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptx
RushaliDeshmukh2
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
LECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's usesLECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's uses
CLokeshBehera123
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
New Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdfNew Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdf
mohamedezzat18803
 
Ad

Intrusion Detection Model using Machine Learning Algorithms on NSL-KDD Dataset

  • 1. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 DOI: 10.5121/ijcnc.2024.16605 75 INTRUSION DETECTION MODEL USING MACHINE LEARNING ALGORITHMS ON NSL-KDD DATASET Reem Alshamy and Muhammet Ali Akcayol Department of Computer Science, Gazi University, Ankara, Turkey ABSTRACT Big data, generated by various sources such as mobile devices, sensors, and the Internet of Things (IoT), has many characteristics such as volume, velocity, variety, variability, veracity, validity, vulnerability, volatility, visualization, and value. An Intrusion Detection System (IDS) is essential for cybersecurity to detect intrusions before or after attacks. Traditional software methods struggle to store, manage, and analyze big data, developing new techniques for effective and rapid intrusion detection in organizations and enterprises. This study introduces the IDS Random Forest (RF) model in binary and multiclass classification for intrusion detection. In this model, we used the Synthetic Minority Oversampling TEchnique (SMOTE) to address class imbalances, and the RF classifier to classify attacks using the Network Security Laboratory (NSL)-KDD dataset. In the experiment, we compared the IDS-RF model with the Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and Logistic Regression (LR) classifiers in terms of accuracy, precision, recall, f1-score, and times for training and testing. The experimental results showed that the IDS-RF model achieved high performance in binary and multiclass classification compared to others. In addition, the proposed model also achieved high accuracies for each class (Normal, DoS, Probe, U2R, or R2L) and obtained 98.69%, 99.72%, 98.93%, 95.13%, and 89%, respectively. KEYWORDS Intrusion Detection, Network Security Laboratory (NSL)-KDD dataset, SMOTE, Machine Learning 1. INTRODUCTION Computers and the Internet have become integral components of human life. Cybercrime is rising in a world characterized by connectivity and data dependency, causing significant disruptions to our way of life. Tools such as antivirus software, firewalls, and IDS have been deployed to address these security challenges [1]. An IDS is a device that examines data to identify and detect unauthorized attempts to affect a system or network's security. Traditional methods can be complex and inefficient when dealing with big data, making the system vulnerable to attack. Leveraging big data tools and techniques can reduce the time required for computations. An IDS utilizes three primary methods to detect attacks: 1) Signature-based detection is effective for detecting known attacks but cannot detect new types of attacks. 2) Anomaly-based detection utilizes established profiles to compare user activities and identify odd behaviors; however, it can result in high false-positive rates (FPR). 3) Hybrid-based detection combines multiple methods to address each method's drawbacks while utilizing its strengths. Many researchers have recommended using ML techniques in IDS to reduce FPR and improve IDS accuracy. By leveraging advanced data analysis tools and methodologies, an IDS can enhance the effectiveness and precision of its detection approaches [2, 3].
  • 2. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 76 The term " big data" refers to an amount of data that stands out for its volume, velocity, and variety. Moreover, the increasing number of attacks worsens these issues. As a result, big data poses challenges, for IDS in identifying and analyzing intrusions and attacks. Dealing with security data from sources is a task for intrusion detection particularly when handling the complexities of big data. The complexity of managing data grows when dealing with issues like having high dimensions of data from various sources or varied data structures. To tackle these challenges effectively an IDS needs to integrate techniques designed for big data environments. Incorporating big data techniques and ML in IDS can help address challenges such, as computational time and speed ultimately contributing to the improvement of more accurate IDS systems [2]. Spark and Hadoop MapReduce are widely recognized examples of frameworks used for processing big data. Spark is a distributed processing system known for its speed. Its flexible architecture supports a range of large-scale workloads including algorithms, batch applications, streaming tasks, and interactive searches. Furthermore, Spark offers an ML library known as MLlib [4]. Numerous research papers have employed ML classifiers to identify intrusions, in settings some of which have integrated techniques, for selecting features to handle amounts of data [5]. Other studies have employed Apache Spark and its MLlib to develop distributed IDS. However, users of the Spark ML library face limitations in applying feature selection methods, as this library only contains the chi-squared selector [6]. Datasets are essential for training and evaluating anomaly-based network IDS. Some studies have used actual data captured from switches and tools designed to simulate attacks to test their systems. However, a major inconvenience is the impossibility of comparing the results with those of other studies because of the differences in the data, which inevitably lead to different outcomes. Consequently, several public network datasets are available for this purpose. The most well-known are DARPA98, ARPA99, KddCup99, NSL-KDD, Kyoto 2006+, UNSW-NB15, and CICIDS2017, which are widely used in numerous studies. However, these datasets typically include a significant number of redundant or missing records and are considered outdated. The landscape of computer networks and their attack patterns has evolved considerably since their creation. To address these challenges, the Communications Security Establishment (CSE) has introduced a new dataset. Since its release, many projects have demonstrated that CSE-CIC-IDS 2018 is more complex than previous datasets, and can be reliably used to evaluate existing and new detection strategies. CSE-CIC-IDS 2018 accurately represents modern network attacks [7, 8]. The remainder of this study is structured as follows: Section 2 presents a review of related works. In the 'Methodology' section, we introduce the IDS-RF model and describe each step of this method. The results of the IDS-RF model and a discussion are presented in Section 4. Comparisons with existing studies are provided in Section 5. Finally, we conclude our work and outline future work in the 'Conclusion' section. 2. RELATED WORKS Many research approaches have been introduced for IDS that apply ML algorithms, feature selection methods, and big data platforms. Different approaches have been proposed for IDS that utilize the benchmark NSL-KDD dataset to improve attack detection. Some researchers have employed traditional techniques, while others have leveraged big data techniques in an IDS, aiming to reduce training and computation time. With the advent of big data, traditional techniques for handling such data have become increasingly complex. In this section, the authors summarize previous works that utilized big data techniques to address classification problems (binary, multi-class, or both) in IDS.
  • 3. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 77 Louati et al. [5] proposed a novel IDS based on Multi-Agent Reinforcement Learning (MARL) for big data environments. The MARL method leverages decentralized cooperative learning among agents to address the issues presented by the large amount, velocity, and variety of big data. This work contributes significantly to developing advanced IDS that can effectively function within the challenging environments of big data frameworks. Trrad et al. [9] focused on using ML algorithms to categorize network traffic. They specifically examined the NSL-KDD dataset to evaluate how well classifiers, like SVM, k-NN, and Long Short-Term Memory (LSTM) could detect network activities. Lin et al. [10] introduced the Multi-Feature Extraction Extreme Learning Machine (MFE-ELM) as a method to prevent intrusions in cloud computing environments related to the Internet of Things (IoT). For intrusion detection, the MFE-ELM method employed the NSL-KDD dataset. The results indicate that the proposed algorithm enhances the ability to promptly identify security weaknesses in cloud clusters, facilitating the development of intrusion response strategies. Mari et al. [11] introduced a random forest algorithm that uses learning to enhance feature selection. The algorithm combined three ML models; SVM, LR, and k-NN. Their research showed an enhancement, in detecting and categorizing network intrusions through ML highlighting the importance of feature selection in improving detection accuracy. The study also underlined the potential of these methods for real-time detection and monitoring of cyber-attacks. Türk et al. [12] evaluated intrusion detection capabilities by applying ML and Deep Learning (DL) techniques on two datasets, specifically the University of New South Wales (UNSW)- NB15 and NSL-KDD. They achieved accuracy in classification showcasing the effectiveness of these algorithms, in identifying and categorizing network intrusions. Kalinin et al. [13] investigated Quantum ML (QML) to address challenges, in intrusion detection and improve effectiveness. Their study involved comparing Quantum SVM (QSVM) and Quantum Convolutional Neural Networks (QCNN) as strategies for detecting intrusions. The findings from their experiments showed that intrusion detection based on QML can effectively manage amounts of data inputs with high accuracy and at twice the speed of conventional algorithms. Vibhute et al. [14] developed an ML-based IDS using a Generative Adversarial Network (GAN). The study aimed to generate adversarial network traffic instances that might avoid conventional IDS detection. These instances were then used to improve the IDS's ability to detect new and evolving threats. The GAN method not only assessed the durability of the IDS against complex approaches used to avoid detection but also enhanced its capacity to adapt to new attacks, resulting in notable progress in the field of cybersecurity. Abid et al. [15] proposed a real-time, fault-tolerant, distributed, and scalable approach for detecting intrusions in Industrial Control Systems (ICS). Their system employs data fusion techniques to merge streaming datasets into a stream enhancing accuracy and precision. The approach demonstrated performance in binary and multiclass classifications by leveraging Apache Spark Structured Continuous Streaming for real-time data processing along with a Multi- Layer Perceptron Classifier (MLPC). Table 1 provides a comparison of studies focusing on aspects such as algorithms, tools, datasets, and classification problems.
  • 4. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 78 Table 1. The comparison of the related works. In related works, we observed that IDS encounter challenges posed by big data. Researchers have addressed these challenges using two main strategies: employing a complete dataset and applying the proposed approach with Big Data techniques [5]. Table 1 summarizes related works, categorizing them based on the algorithms used for classification, the tools developed for the research, the datasets on which the research was trained and evaluated, and the focus of the classification problems (Binary, Multiclass, or both). 3. METHODOLOGY This section will provide a detailed explanation of the proposed model and the tools that were utilized. Figure 1 shows the IDS-RF model. The IDS-RF model can be summarized in the following steps: 1. Import the NSL-KDD dataset. 2. Conduct data pre-processing. 3. Utilize the SMOTE and RF classifier to train the IDS-RF model on the training dataset. 4. The IDS-RF model is evaluated by applying the RF classifier to the NSL-KDD test dataset. Figure 1. The sequence of steps in the IDS-RF model.
  • 5. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 79 3.1. Dataset Description The NSL-KDD dataset is among the most widely used datasets for training and evaluating IDS models [16]. This dataset contained 41 features, plus one feature for the target class, including four categorical, four binary, and 33 numerical features. The features in the NSL-KDD dataset are divided into four groups: basic, host, traffic, and content features [17]. The 42 features provide data on five classes of network connections, with each instance classified either as normal or as one of four types of attacks [7, 14]. The attack types in the NSL-KDD dataset can fall into one of the following four main categories: Denial of Service Attack (DoS): The attacker occupies excessive computing or memory resources by performing calculations, thereby denying legitimate users access to the machine. This is achieved by handling more requests so the system can process them logically. Probing Attack: This type of attack occurs when an attacker exploits vulnerabilities to gain local access to the network. Then, the attacker sends packets to the device over the network. Remote to Local Attack (R2L): This attack occurs when an attacker takes advantage of vulnerabilities to gain local network access and begins sending packets to the device over the network. User-to-Root Attack (U2R): An attacker gains root access by exploiting vulnerabilities, potentially allowing the attacker to escalate privileges from a regular user account to root access. The distribution of various attacks on the NSL-KDD dataset is presented in Table 2. Table 2. The Distribution of various attack types in the NSL-KDD dataset [9]. The training dataset consisted of 23 classes, whereas the test dataset included 38. These encompass 21 attacks from the training dataset, 16 novel attacks, and one normal class. The class labels of instances in the dataset were categorized into five main categories: Normal, DoS, Probe, R2L, and U2R. According to the distribution of instances across different attack types, as illustrated in Table 2, the distribution of various types of data is unbalanced. 'Normal' represents the highest percentage of the total data, whereas U2R has the lowest, which may lead to inadequate training and misclassification of data. Attackers often utilize Probe, R2L, and U2R attacks as advanced threats. Therefore, improving the classification accuracy for these types of attacks should be prioritized. 3.2. Dataset Pre-Processing The preprocessing is an essential step that enhances the data quality and aids the extraction of features [18]. This step is essential for developing and training the models. We first need to preprocess the data to derive the value from the dataset and prepare it for the IDS-RF model, Below, we describe the techniques used for data preprocessing:
  • 6. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 80 3.2.1. Transformation The data performed a transformed into formats that are appropriate for ML algorithms [19]. First, the dataset underwent a process of eliminating features that included zero values. Additionally, we transformed the categorical features within the dataset into numerical features. Second, we employ the one-hot encoder technique to transform categorical features into a one-hot numeric array. We achieve this by identifying the distinct values of each feature and creating corresponding categories. As an illustration, the output of this function can encode the 'protocol_type' feature, which has three separate values (TCP, UDP, and UCMP), as [1,0,0], [0,1,0], and [0,0,1] correspondingly. As a result, this function transformed the initial 41- dimensional NSL-KDD dataset into a format with 122 dimensions. 3.2.2. Standardization In ML, the standardization of datasets is very significant for algorithms that utilize the Euclidean distance. Without standardization, there is a potential risk that features with values in larger ranges may be given undue importance. Because only some features can be represented within the same measurement range, features of varying magnitudes can negatively affect ML algorithms [20]. This issue can be mitigated through standardization, which involves converting data to a common range. Standardization is beneficial because it helps improve the prediction performance of the IDS-RF model. For example, the NSL-KDD dataset encompasses features with a broad spectrum of values. Some features displayed significant discrepancies between their minimum and maximum values, making them incomparable and unsuitable for direct processing. An example is the 'duration' feature, which ranges from a minimum of 0 to a maximum of 58,329, rendering the feature values inconsistent for analysis. To address these challenges within the IDS-RF model, we employ StandardScaler. This tool standardizes features by eliminating the mean and scaling to unit variance using the following formula to achieve data standardization: 𝒛 = 𝒙 − 𝒖 𝒔 𝟏 The equation represents the relationship between the standardized value of a sign (z), the original vector (x), the mean of the vectors (u), and the standard deviation (s). 3.3. IDS-RF Model This section presents the SMOTE technique and the RF classifier as a component of the IDS-RF model. 3.3.1. SMOTE The NSL-KDD dataset has a class imbalance issue. When dealing with a dataset that contains a class imbalance, it is preferable to use the SMOTE method to enhance attack detection in the IDS-RF model [21, 22]. SMOTE was developed by Chawla [23]. This method generates synthetic samples in the minority class instead of replacing existing samples, which helps avoid the overfitting problem. The SMOTE algorithm was proposed to overcome the overfitting issue and improve the accuracy. This straightforward cluster-based oversampling technique has gained widespread acceptance in the community [24]. It selects two samples from the minority class and assigns properties between them. The difference in value is calculated for each feature, and then a random value r is generated to lie between 0 and 1. Consequently, the features of the new sample were generated according to the following equation:
  • 7. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 81 𝒙𝒏𝒊 = 𝒎𝒊𝒏 𝒙𝒋𝒊, 𝒙𝒌𝒊 + |𝒙𝒋𝒊 − 𝒙𝒌𝒊| × 𝒓 (2) Equation (2) calculates the value of xni, which represents the ith feature of the nth new sample. It is determined by taking the minimum value between xji and xki and adding the absolute difference between xji and xki multiplied by the r value. The samples xj and xki are randomly selected from the minority class, and r is a number between 1 and 0. This equation helps generate a set of oversampled values. By employing this method, the two classes were nearly evenly distributed in terms of proportion. This approach effectively preserves the correlation between the characteristics, even when a substantial number of artificially generated samples are included in the dataset. The researchers utilized the SMOTE technique in their tests to attain precise categorization [25]. Figures 2 and 3 demonstrate the distribution of instances for various attacks before and after applying SMOTE to the training dataset in binary and multiclass scenarios. Figure 2. Results of Applying the SMOTE method to the training dataset for binary classification. Figure 3. Results of Applying the SMOTE Method to the training dataset for multiclass classification. 3.3.2. RF Classifier The random forest algorithm is composed of many tree predictors. Each tree makes predictions based on a random vector, which is sampled individually and follows the same distribution across all trees in the forest [26]. Random forests are regarded as an enhanced iteration of bootstrap aggregating (bagging), where the forecasts of many decision trees, developed with bootstrapped data samples that are identically distributed, are aggregated. One drawback of bagging is that the correlations between each pair of variables limit the extent to which the prediction errors can be
  • 8. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 82 reduced, even when a large number of trees are aggregated. Random forests enhance the bagging technique by randomly selecting a subset of variables to use for splitting while constructing each decision tree. This helps to reduce the impact of correlations between pairs of variables. To obtain a comprehensive description of the random forest algorithm, please consult the work of Breiman [26]. The number of trees (B) and the number of randomly picked variables utilized for each split in an individual decision tree (m) are two important hyperparameters in constructing random forests. These two parameters are commonly identified using a grid search technique in conjunction with cross-validation [27]. As stated by Breiman [26], the process of building a random forest model involves the following steps: Iterate over the values of b from 1 to B: For b=1 to : (a) Generate a resampled dataset of size N using the bootstrap method from the training data. (b) Construct an RF tree by iteratively performing the following steps for each terminal node of the tree until the minimal node size nmin is reached. 1) Randomly choose m variables from the complete set of variables. 2) Choose the most optimal variable from the options provided. 3) Partition the node into two sub-regions. Print the collection of trees { }i, i=1, 2,…, . For a regression problem, given a new input x, the corresponding prediction can be written as: (3) Figure 4 illustrates the steps described above. Figure 4. Random Forest Architecture [27].
  • 9. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 83 4. RESULTS ANALYSIS AND DISCUSSION 4.1. Evaluation Metrics Some of the performance metrics used to evaluate the effectiveness of the IDS-RF model involve accuracy, precision, recall, f1-score, and prediction time for binary and multiclass classification. Accuracy is a metric that measures the proportion of correct predictions made by the IDS-RF model when applied to a test dataset. Precision is a measure of the accuracy of a positive prediction. Recall, or the true positive rate, is the measure of the number of true positives predicted out of all positives in the dataset. The f1-score measures the accuracy of a model based on its recall and precision. The metrics mentioned above have definitions in Eqs. (4), (5), (6), and (7). where TP is True Positive, FN is False Negative, FP is False Positive, and TN is True Negative. TP indicates the number of records correctly detected as a normal class, FN indicates the number of records not correctly detected as an attack class, FP indicates the number of records not correctly detected as a normal class, and TN indicates the number of records correctly detected as an attack class. 4.2. Binary Classification The IDS-RF model achieved an accuracy of 98.67% for binary classification on the NSL-KDD dataset. Precision was 98.68%, recall was 98.67%, and f1-score was 98.67%. The confusion matrix for the IDS-RF model in the binary classification is shown in Figure 5. In addition, we assessed the performance of the ML algorithms k-NN, SVM, and LR, focusing on accuracy, precision, recall, and F1-score. These algorithms were chosen because of their extensive application and proven efficacy in IDS [17]. The accuracy of the k-NN classifier was 97.34%, the precision was 97.32%, the recall was 97.26%, and the f1-score was 97.29%. The accuracy of the SVM classifier was 95.77%, the precision was 95.72%, the recall was 95.65%, and the f1-score was 95.69%. The accuracy of the LR classifier was 94.62%, precision was 94.61%, recall was 94.41%, and f1-score was 94.50%. A comparison of the accuracy values between the four algorithms for each class (normal and attack) is presented in Table 3. The results in Table 3 highlight the superior performance of the IDS-RF model in terms of accuracy for binary and multiclass classification.
  • 10. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 84 Figure 5. Confusion Matrix for IDS-RF model for binary classification. Table 3. Accuracy for each class in binary classification. 4.3. Multiclass Classification For NSL-KDD in multiclass, the IDS-RF model achieved an overall accuracy of 98.54%, precision of 97.58 %, recall of 96.29%, and f1-score of 96.92%. The confusion matrix for the IDS-RF model for multiclass classification is shown in Figure 6. The accuracy of the k-NN classifier was 96.93 %, precision was 94.33 %, recall was 93.98%, and the f1-score was 94.15 %. The accuracy of the SVM classifier was 95.53%, precision was 93.69 %, recall was 93.39 %, and f1-score was 93.49%. The accuracy of the LR classifier was 94.76%, precision was 93.25%, recall was 91.36%, and f1-score was 92.13%. In addition, the IDS-RF model achieved high performance in terms of accuracy for multiclass classification. The IDS-RF model accuracy was higher for Normal, DoS, Probe, and R2L and lower for U2R. This is because the U2R class lacks sufficient data compared to other attacks, leading to a lower accuracy than other classes. The accuracy comparison between the four algorithms for each class in the multiclass classification is presented in Figure 7.
  • 11. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 85 Figure 6. Confusion Matrix for IDS-RF model in multiclass classification. Table 4. Accuracy for each class in multiclass classification. Figure 7. Accuracy for ML algorithms in multiclass classification.
  • 12. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 86 In IDS, analyzing big data requires high real-time performance, which requires not only high evaluation metrics but also minimal detection times to ensure the efficiency of network security applications. To demonstrate the superiority of the IDS-RF model, researchers measured its training and prediction times for binary and multiclass classification and compared these metrics with other ML algorithms. The findings revealed that our proposed model significantly reduced training and prediction times. Table 5 compares the training and prediction times between the proposed IDS-RF model and the other classifiers for binary and multiclass classification. Table 5. Training and testing times for IDS-RF model with other classifiers. IDS-RF consistently outperformed the other algorithms in all metrics, establishing it as the most effective option for the classification task in this study. These impressive results were achieved by integrating the RF classifier with the SMOTE method, which was applied to the training data for binary classification, as illustrated in Table 3. The superiority of the IDS-RF model is mainly due to its flexibility and adeptness in identifying correlations between features, rendering it an ideal choice for this task. In addition, RF consistently shows excellent performance, even when handling large datasets. Moreover, the preprocessing stages yielded a dataset comprising 134,686 instances for binary classification and 202,305 for multiclass classification. Utilizing Anaconda capabilities [28], our system achieved an exceptional processing rate of 22,544 instances in just 6 s for binary classification and 41.762 s for multiclass classification (as shown in Table 5). This rapid processing rate significantly boosts the intrusion detection capabilities, thereby enhancing the overall efficiency and effectiveness of the system. The results highlight the efficiency and reliability of the proposed system classification. 5. COMPARISON WITH EXISTING WORKS The importance of the strategy presented in the IDS-RF model is compared with existing studies. Some previous studies focused on ML and DL algorithms and achieved satisfactory results. In this section, we focus on studies that use the same dataset used in our model to train and test the IDS-RF model. Although the NSL-KDD dataset we used does not cover modern attacks, researchers still use it widely. In addition, when comparing the IDS-RF model with previous studies that used the NSL-KDD dataset for binary and multiclass classification, the IDS-RF model consistently demonstrated high performance and speed in attack detection across all metrics evaluated. The accuracy of the k-NN model in [14] was 98.24% when using the ensemble learning-enabled RF algorithm for feature selection. In [11], correlation-based feature selection achieved an accuracy of 91.33% through k-NN, less than that achieved in our study. Based on our findings, the IDS-RF model suggested in this work outperforms the models used in previous studies [5, 9, 10, 12] in terms of accuracy, training, and testing times. To enhance the performance of the IDS-RF model, we recommend its implementation in Apache Spark, a cluster computing platform designed for high-speed processing [11]. Table 6 shows the significance of the current study compared to recent studies in binary and multiclass classification.
  • 13. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 87 Table 6. Comparison results with other studies in binary and multiclass classification. 6. CONCLUSIONS This study introduces an IDS-RF model developed by researchers to detect intrusions using the NSL-KDD dataset. The classification process for big data is complicated in an IDS because of its high dimensionality. The proposed model leveraged the Anaconda platform to process and analyze big data rapidly. In the proposed model, researchers utilized the SMOTE technique to handle a dataset and utilized Random Forest (RF) for binary and multiclass classification. The experimental results showed that the model achieved high accuracy, precision, recall, f1-score, and speed for binary and multiclass classification compared with the k-NN, SVM, and LR algorithms. The proposed model achieved an overall accuracy of 98.67% for binary classification and 98.54% for multiclass classification. The proposed model also achieved high testing accuracies for Normal (98.69%), DoS (99.72%), Probe (98.93%), R2L (95.13%), and U2R (89%) even though the accuracy of the U2R class was lower compared with other attacks because the number of instances in this class is small. In addition, the proposed model achieved high accuracy, which is remarkable for binary and multiclass classification compared to existing models. In future work, we plan to apply the IDS-RF model on more modern datasets to detect intrusions and reduce computation time using the Apache Spark environment. CONFLICTS OF INTEREST The authors declare no conflict of interest. ACKNOWLEDGEMENTS We would like to thank all authors for their contributions and the success of this manuscript and all editors and anonymous reviewers of this manuscript. REFERENCES [1] F. Bannat Wala and M. J. a. e.-p. Kiran, "5G Network Security Practices: An Overview and Survey," p. arXiv: 2401.14350, 2024. [2] S. M. Othman, N. T. Alsohybe, F. M. Ba-Alwi, A. T. J. I. J. o. C.-S. Zahary, and D. Forensics, "Survey on intrusion detection system types," vol. 7, no. 4, pp. 444-463, 2018. [3] A. Touzene, A. Al Farsi, N. J. I. J. o. C. N. Al Zeidi, and Communications, "HIGH PERFORMANCE NMF BASED INTRUSION DETECTION SYSTEM FOR BIG DATA IOT TRAFFIC," vol. 16, no. 2, pp. 43-58, 2024. [4] X. Sun, L. Zhao, J. Chen, Y. Cai, D. Wu, and J. Z. J. E. A. o. A. I. Huang, "Non-MapReduce computing for intelligent big data analysis," vol. 129, p. 107648, 2024. [5] F. Louati, F. B. Ktata, and I. J. C. C. Amous, "Big-IDS: a decentralized multi agent reinforcement learning approach for distributed intrusion detection in big data networks," pp. 1-19, 2024. [6] K. Pramilarani and P. V. J. A. S. C. Kumari, "Cost based Random Forest Classifier for Intrusion Detection System in Internet of Things," vol. 151, p. 111125, 2024.
  • 14. International Journal of Computer Networks & Communications (IJCNC) vol 16, No 6, November 2024 88 [7] M. Ghurab, G. Gaphari, F. Alshami, R. Alshamy, and S. J. A. J. o. R. i. C. S. Othman, "A detailed analysis of benchmark datasets for network intrusion detection system," vol. 7, no. 4, pp. 14-33, 2021. [8] M. Ghurab, R. Alshamy, S. J. I. J. o. S. R. Othman, and E. Development, "Performance Evaluation for Attack Detection in Intrusion Detection System," vol. 4, no. 5, 2021. [9] Trrad, "Applying Deep Learning Techniques for Network Traffic Classification: A Comparison Study on the NSL-KDD Dataset," 2024. [10] H. Lin, Q. Xue, J. Feng, and D. Bai, "Internet of things intrusion detection model and algorithm based on cloud computing and multi-feature extraction extreme learning machine," Digital Communications and Networks, vol. 9, no. 1, pp. 111-124, 2023. [11] A.-G. Mari, D. Zinca, and V. J. S. Dobrota, "Development of a Machine-Learning Intrusion Detection System and Testing of Its Performance Using a Generative Adversarial Network," vol. 23, no. 3, p. 1315, 2023. [12] F. J. B. E. Ü. F. B. D. Türk, "Analysis of intrusion detection systems in UNSW-NB15 and NSL- KDD datasets with machine learning algorithms," vol. 12, no. 2, pp. 465-477, 2023. [13] M. Kalinin and V. Krundyshev, "Security intrusion detection using quantum machine learning techniques," Journal of Computer Virology and Hacking Techniques, vol. 19, no. 1, pp. 125-136, 2023. [14] A. D. Vibhute, C. H. Patil, A. V. Mane, and K. V. J. P. C. S. Kale, "Towards Detection of Network Anomalies using Machine Learning Algorithms on the NSL-KDD Benchmark Datasets," vol. 233, pp. 960-969, 2024. [15] A. Abid, F. Jemili, and O. Korbaa, "Real-time data fusion for intrusion detection in industrial control systems based on cloud computing and big data techniques," Cluster Computing, pp. 1-22, 2023. [16] R. Alshamy, M. J. J. o. C. S. Ghurab, and Engineering, "A review of big data in network intrusion detection system: Challenges, approaches, datasets, and tools," vol. 8, no. 7, pp. 62-74, 2020. [17] NSL-KDD dataset. Accessed at Jan. 01, 2024. [Online]. Available: https://www.unb.ca/cic/datasets/nsl.html [18] M. A. Bouke, A. Abdullah, K. Cengiz, and S. J. P. C. S. Akleylek, "Application of BukaGini algorithm for enhanced feature interaction analysis in intrusion detection systems," vol. 10, p. e2043, 2024. [19] K. C. Santos, R. S. Miani, F. J. J. o. N. de Oliveira Silva, and S. Management, "Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems," vol. 32, no. 2, p. 36, 2024. [20] M. A. Talukder, S. Sharmin, M. A. Uddin, M. M. Islam, and S. J. I. J. o. I. S. Aryal, "MLSTL-WSN: machine learning-based intrusion detection using SMOTETomek in WSNs," pp. 1-20, 2024. [21] W. A. Ali, M. Roccotelli, G. Boggia, and M. P. J. I. S. J. A. G. P. Fanti, "Intrusion detection system for vehicular ad hoc network attacks based on machine learning techniques," pp. 1-19, 2024. [22] R. Alshamy, M. Ghurab, S. Othman, and F. Alshami, "Intrusion detection model for imbalanced dataset using SMOTE and random forest algorithm," in Advances in Cyber Security: Third International Conference, ACeS 2021, Penang, Malaysia, August 24–25, 2021, Revised Selected Papers 3, 2021, pp. 361-378: Springer. [23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. J. J. o. a. i. r. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," vol. 16, pp. 321-357, 2002. [24] J. S. Aguilar-Ruiz and M. J. S. R. Michalak, "Classification performance assessment for imbalanced multiclass data," vol. 14, no. 1, p. 10759, 2024. [25] A. Kumar, D. Singh, R. J. M. T. Shankar Yadav, and Applications, "Class overlap handling methods in imbalanced domain: A comprehensive survey," pp. 1-48, 2024. [26] L. J. M. l. Breiman, "Random forests," vol. 45, pp. 5-32, 2001. [27] H. Gong, Y. Sun, W. Hu, P. A. Polaczyk, and B. Huang, "Investigating impacts of asphalt mixture properties on pavement performance using LTPP data through random forests," Construction and Building Materials, vol. 204, pp. 203-212, 2019. [28] Anaconda platform. Accessed at Jan. 01, 2024. [Online]. Available: https://www.anaconda.com/