Research Paper
Research Paper
Abstract— Computer networks are exposed to cyber related benefit that when moving over a network, encrypted
attacks due to the common usage of internet, as the result of information can be accessed. The downside is that HIDS is
such, several intrusion detection systems (IDSs) were proposed very difficult to handle, as every host needs to configure and
by several researchers. Among key research issues in securing manage data. In addition, some forms of denial-of-service
network is detecting intrusions. It helps to recognize attacks could disable HIDS. NIDS, is intelligently
unauthorized usage and attacks as a measure to ensure the distributed software or hardware-based IDS in network
secure the network’s security. Various approaches have been which track packets passing through the network. NIDS is
proposed to determine the most effective features and hence doubled interfaced, first being for listening network
enhance the efficiency of intrusion detection systems, the
conversation and the second for monitoring [4][3]. The
methods include, machine learning-based (ML), Bayesian based
algorithm, nature inspired meta-heuristic techniques, swarm
NIDS has the benefit that it needs a few well-fit NIDS to
smart algorithm, and Markov neural network. Over years, the monitor a wide network and often NIDS is hidden to various
various works being carried out were evaluated on different intruders, so it is safe against invasions. However, during a
datasets time of heavy traffic, NIDS has the downside of finding it
hard to discover an attack launch.
Keywords—Machine learning, Single classifiers, Hybrid, The rampant usage of internet makes it difficult to
Ensemble, Misuse detection, Intrusion Detection System
protect network resources from the mischievous action of
I. INTRODUCTION (HEADING 1) attackers. According to Cybersecurity ventures, the damage
related to cybersecurity is predicted to reach $6 trillion
For detecting illicit or abnormal behavior, IDS is used. yearly by 2021. Gartner reports that, in taking steps to
Attack is launched in a network in a state of an anomaly counter the damage, Global expenses on cybersecurity could
behavior. Attackers use the opportunity of network reaches $133.7 billion in 2022. Multiple measures have
weaknesses like poor security measures and practices, been taken in which various security tools such as IDS were
program bugs such as buffer overflows, yielding the developed. The various previously works done on building
breaches of the network. The attackers may be less IDS showed effectiveness to some extent. However, several
privileged device operators who aim to claim more access issues need to be addressed to build an efficient IDS that
control or black hat-hackers who are normal users of could detect and report malicious traffic with very high
internet that intend to hijack sensitive information [1]. The detection accuracy.
Techniques for detecting intrusion can be centered on
misuse detection or based on anomaly detection. Misuse Most IDS were developed and evaluated using outdated
based IDS tracks the flow of network traffic and compares and old dataset like KDD Cup ‟99, NSL KDD and so on,
them to the predefined malicious activities signatures in a which lack the most recent and up to date attack labels.
database. Whereas in the technique of anomaly detection, Slow detection rate is experienced in the existing works.
attacks are detected when they are compared with actions This happens due to inability to get rid of all redundant and
which deviate from normal user operations [2]. irrelevant columns. High false positive rate. This happens
when a legit traffic is incorrectly detected and classified as
The IDS could be Network-based IDS (NIDS) or Host- an attack. The false positive rate increases complexity of
based IDS (HIDS). Computer network administrators utilize IDS, hence, reducing its performance.
the host-based intrusion detection method to track and
evaluate activities on a specific machine. HIDS has a The rest of the paper is presented as follows; section two
of the paper discusses the overview of ML techniques. The
comparison of the studied works was covered by section
three. The comparison was based on the classier used, the
performance of the algorithms as well as the dataset applied
to evaluate the algorithms. The last section of this paper
discusses, concludes and provides future scope in and were adopted in various IDS studied in this work.
developing IDS using ML techniques.
1. Support Vector Machine (SVM)
II. OVERVIEW OF MACHINE LEARNING (ML) SVM could be used for both classification and regression
ALGORITHMS cases. There is a provision of separating hyper-plane in
SVM which defines the various classes to be predicted. The
A. Machine Learning (ML)
classification always depends upon the nature of the
ML could be described as an approach whereby models problem and the adopted dataset. The dataset could be one
undergone training for the purpose of learning and dimensional and, in such case, the hyper-plane is a point on
one dimensional line. In a situation whereby the dataset is
two dimensional, then the hyper-plane is a separating line
and for three dimensional the hyper-plane is a plane and
lastly for higher dimensional dataset it is a hyper-plane.
SVM is widely used in most intrusion detection systems due
its popularity in making accurate predictions.
2. Artificial Neural Network (ANN)
ANNs are a category of ML algorithms motivated by the
behavior and amount of computing performed by human
Fig 1: Block Diagram of Intrusion Detection System brain in biological nervous system. The model of ANN is
enhancing performance parameters automatically so that made up of an input layer, one or more hidden layer(s) and
an output layer. The hidden layer(s) weigh and process the
they don’t have to be solely programmed using previous
inputs fed to the artificial neurons so that the output to the
experience or example data. In accordance with attributes,
next layer can be decided. In ANN, a learning rule known as
the ML model focuses on training data sets to predict
gradient-descent back propagation of error is used to
various class labels. ML is typically divided into three
adaptively adjust the various weights and biases of the
groups:
hidden and output layer neurons so that the desired or
1. Supervised Learning required output is achieved.
In supervised ML the dataset to be trained is made up of
3. Decision Tree (DT)
examples of the input vector, each with their equivalent
desired output vectors. Algorithms in this type of learning Decision tree is a type of machine learning algorithms that
include: Naïve Bayes, KNN, ANN, Decision Tree (C4.5, is applied for both categorical and numeric classifications.
ID3, CART, RF, and J48), SVM, Ensemble methods The decision tree is made up of three nodes, namely; the
(Bagging, Voting Classifier, Adaboost, Gradient Boosting), tree‟s topmost node, called the root node (root), the
logistic regression. intermediate node also called internal node (node) and leaf
nodes also called leaves. In decision tree the flow of
2. Unsupervised Learning learning rule is top to down. The leaves are the outcomes of
In unsupervised ML, the learning algorithm is not given decisions. In decision tree data sample is split into two
labels and as such it must by itself find structure in its input. homogeneous sets (subsample) based on most significant
This is also known as learning without a teacher. Self- splitter. The decision tree is widely used in classification
Organizing Map (SOM), Apriori algorithm, Éclat algorithm problems due to its popularity in data exploration and less
and outlier detection, Hierarchical clustering, and Cluster data cleaning requirement.
Analysis (K-Means clustering, Fuzzy clustering) are various
4. K-Nearest Neighbor (KNN)
unsupervised learning algorithms.
KNN is a distance-based ML model, employed to solve
3. Reinforcement Learning classification problems. When it used and combined with
In reinforcement learning, the model is trained to make a prior knowledge, KNN produces a very good result. The
sequence of decisions. The goal is achieved in an uncertain classification in KNN is done by classifying each unlabeled
and potentially complex manner. The model performs trial example by the majority label among its K-nearest
and error to bring up a solution to the problem. Deep Q neighbors in the training set. The nearest neighbors are
Network (DQN), Q-Learning, State- Action-Reward-State- determined by the KNN‟s performance on distance metrics.
Action (SARSA), Deep Deterministic Policy Gradient Various techniques of measuring distance are used to
(DDPG) are various reinforcement learning algorithms. identify the nearest neighbors, the most popular among
which is the Euclidean distance. KNN is time efficient and
A. Single Machine Learning Classifiers can easily be interpreted, hence it‟s widely usage for
ML Classifiers are classified as single classifiers when they classification problems in IDS and other applications.
only contain just one classification algorithm. The single
machine classification models have been adopted by many C. Hybrid Classifiers
intrusion detection systems. SVM, ANN, DT, K- Nearest Hybrid classifier is an approach in which multiple ML
Neighbor, NB are made up of one ML algorithm models are combined in order to improve the efficiency of
the aggregate classifier in the IDS. The purpose for the use A study conducted by reveals that the performance of
of the IDS hybrid method is to improve the IDS anomaly-based IDS could be improved, especially in the
performance as it is well known that hybrid systems work FPR. The NSL-KDD dataset was applied to evaluate the
much more efficiently than the IDS classification of single extreme-gradient boosting (XGBoost) and AdaBoost
machine learning. Either supervised or unsupervised ML models. While a relatively high accuracy is achieved, the
models can be set as the initial hybrid classifier level. implementation of hybrid or ensemble ML classifiers is
required to boost the effectiveness of the IDS.
D. Ensemble Classifier
Many works done failed to address the issues high
Ensemble Classifier is a combination of more than one execution time and detection rate as their work lack feature
ML classifiers sometimes referred to as poor learners, extraction. A study conducted by evaluated various ML
whose individual choices are combined as a consensus models on the NSL-KDD with various ML algorithms and
decision in some way to provide better effective predictive attribute extraction methods. Because of the high FPR of the
performance. Therefore, by aggregating different results of model and the work focusing solely on signature-based
poor learners, the ensemble classifier provides enhanced threats, a significant limitation in zero-day attacks remains
efficiency. The work of many researchers who implemented unexplained, leaving novel attacks uncaptured. Many of the
ensemble models demonstrates a high precision and previous works failed to evaluate their adopted model on
efficiency. Approaches for building ensembles include: different datasets. A work proposed by suggested a novel
Infusion of randomness, plurality polling, ensemble of IDS in which a feature extraction was applied. The work
function collection, bagging and random trees and gives the advantage of integrating the ensemble classifier
performance coding error correction. with selecting features that provides increased intrusion
III. REVIEW AND COMPARISON OF RELATED detection performance and accuracy. Three separate datasets
WORKS were utilized for the work; the popular NSL-KDD dataset
and two newly released data sets, i.e. IDS2017-CIC. And
In the work done by, multiple ML techniques have been AWID. The CFS-BA-based technique was used for feature
adopted with the aim of overcoming the challenges of collection. The ensemble-based method improves the
lacking accuracy when dealing with low frequent attacks efficiency of multi-class categorization. The model showed
often faced by the previous IDS when ANN with fuzzy the best value of accuracy when evaluated on AWID dataset.
clustering is used. They did this effectively by separating
the heterogeneous collection dataset for training into Both the FFANN and PRANN were applied in, which
homogeneous one, thus decreasing the size of each training utilized scaled conjugate gradient and Bayesian
set. In the study, J48 trees, MLP and BN classifiers were regularization techniques in training the ANN based IDS.
applied, offering the highest precision for J48 trees. One of To assess the quality and capability of the work, various
their work's big disadvantage is their failure to implement result metrics were used. In various output tests on different
the feature extraction to discard all irrelevant, obsolete and attack detections from the yielded result, the 2 models have
unnecessary attributes. been shown to bettered each other in performance. Overall,
the FFANN provided 98.0742 percent improved precision.
An ensemble-based ML approach was applied in, in By checking the model on multiple datasets, the reliability
such the outputs of several models, both supervised and of the work needs to be increased.
unsupervised ML were combined through voting
classification. The work increases the accuracy and Four different algorithms were combined in the work of
reliability of the IDS. Their work was evaluated on in an ensemble model which includes; bay classifier,
Kyoto2006+ dataset since it is more appealing in decision tree, random forest, and RNN-LSTM. the work
comparison with the most adopted datasets that are provides contribution in which imbalanced dataset was
outdated. The accuracy of their work is quite good, though handled through selecting the most effective intrusion
the false positive rate is high. detection features needed for detecting intrusions and
signaling to system administrators whether the traffic is a
A real-time approach to hybrid IDS was suggested in legit or illicit behavior. Although the approach performs on
such approach the signature-based detection was utilized to NSL-KDD to some degree of precision, an experimental
discover well-known intrusions and the method of anomaly study on the most up-to-date datasets is required.
to discover new threats. A good rate of detection value was
obtained in this work because the attacks that avoided the The work of developed an IDS using single machine
signature-based technique could be classified as an intrusion learning classifier. They applied RF and DT algorithms,
by the technique of anomaly detection. On the last day of evaluated on the NSL-KDD dataset. Having edged the
the trial, the precision of the algorithm improved decision tree in accuracy, the random classifier gives the
incrementally every day to a substantial value of 92.65 superior results. The study has not addressed both the issue
percent, and the rate of false negative declines sharply as the of detection rate and the FPR.
algorithm improves and train the machine every day. A work proposed by on the IDS in which two separate
However, whenever the model is extended to a very broad datasets, NSL-KDD and UNSWN B-15, were evaluated on
data size, the problem of slow detection rate is observed. KNN and Random Committee. In this work, a feature
extraction has been implemented that produces and
uses only the most appropriate attribute subsets for the Competition. It is made up of 41 columns of attributes. The
applied datasets. The study findings show that the Random attributes are constituted in each input pattern instance. The
Committee algorithm works better than KNN. In future second most used dataset is the NSL-KDD dataset. The
studies, it is important to further resolve the problem of NSL-KDD data set is an upgrade over KDD'99 data set,
large data scale, Data imbalance, and normal efficiency of redundant records have been removed from KDD Cup‟99
IDS algorithms. dataset to get rid of bias effects of classification. This
dataset consists of 38 numeric features and 3 nominal
In Ponthapalli et al's proposed work, single classifiers features taking the total number of features stands at 41.
were applied to detect network intrusion. The algorithms
used are: SVM, LR, RF, and DT . The work was evaluated The Kyoto2006+ is developed based on real traffic data
on NSL-KDD dataset. The research showed that with the gathered for three years at Kyoto University using 348
random forest classifier, the intrusion detection system honeypots. In these, 24 features are used in this dataset; its 14
performs the best. They have also found that there is the features are similar to that of KDD Cup'99 dataset. The
least execution time for the RF algorithm. The study has the remaining 10 columns containing six characteristics relevant to
drawback of only being evaluated efficiently with one knowledge bring light to certain problems frequently
dataset only. encountered while using the KDD Cup '99.
A work was carried out on a stacking ensemble AWID is made up of real traffic of benign and attacks which
technique using heterogeneous datasets. LR, KNN, SVM was collected from real network environments. It was made
and RF constitute the ensemble approach. The study uses publicly published in 2015. Standard and recent typical
two of the recent datasets, UGR'16 and UNSW NB-15 attacks are given in the CIC-IDS2017 dataset. It was
datasets. In a simulated machine, UNSW NB-15 was founded in 2017 by the Canadian Institute for Cyber
generated, while UGR '16 is generated in an actual data Security, it is an updated dataset for IDS. It consists of
traffic scenario. The method increased the IDS' estimation 3,119,345, 84 rows, with 84 distinct labelled features. The
accuracy and detection speed and returned the best UNSW NB-15 dataset was established in 2015. Its acronym
accuracy. Moreover, further studies need to be performed on means New South Wales University. There is a total of 47
multiple datasets that contain the latest attack types. attributes in the dataset with two class marks [20].
A hybrid NIDS was proposed in the work of, where TABLE 1: Distribution of Dataset Usage over the Years
several hybrid models were evaluated on the KDD-NSL KDD Cup NSL- CIC-IDS UNSW UGR
Year 99 KDD 2017 NB-15 ‘16
dataset. A combination of Neural Network and K-Means 2015 1 1 0 0 0
clustering with Attribute extraction was made. Also, SVM
was combined with K-means. The findings showed vividly 2016 1 2 0 0 0
that the hybridization of the types of ML complements each 2017 0 3 0 0 0
other, boosting IDS's performance. The highest accuracy is 2018 2 1 1 0 0
gotten in the integration of support vector machine and K- 2019 0 3 0 1 0
means with Attribute extraction. To decrease the FPR, 2020 0 4 0 0 2
further works need to be carried out using improved hybrid Table 2 and Figure 3 explain the distribution of datasets
ML techniques. usage over the years from the review papers reviewed.
KDD-NSL was applied 14 times, covering 58.33 percent of
A. Comparison of Related Work the total use of data sets. It was accompanied by KDD Cup
In the review of the previous work, various papers have ‟99 which was applied in four occasions. UNSW NB-15 is
been studied. At least two research papers have been studied applied in two occasions, while Kyoto-2006+, AWID,
in each of the years within the stated range. Figure 2 below CICIDS-2017, and UGR 2016 were applied each in one
illustrated the overall distribution of the studied research occasion.
papers.As it can be clearly observed in figure 3, ensemble
classifier returns with highest accuracy whenever it is
employed over the years.
B. Datasets employed in the previous Works
Dataset can be defined as the collection of records. A
record is the word that is used to describe a single data row.
Each record consists of many features, referred to as a data
instance attribute. The KDD-NSL is the most common
dataset used in the work being studied. In general, KDD‟99,
NSL-KDD, Kyoto2006, UGR2006, CICIDS‟17, and
UNSW-NB‟15 are the dataset used to evaluate various
algorithms applied in the studied works.
KDDCup dataset is the dataset applied in the 3rd
International Knowledge Discovery and Data Mining Tools
TABLE 2: Comparison of the related works
IDS using bagging Genetic Algorithm (GA) based KDD-NSL Bagged PART=99.7166% High detection rate High execution time
with partial feature selection.
decision tree base Bagged Classifier with partial
classifier decision tree
IDS based on KNN KDD‟99 CANN=99.76% They applied Some malicious
combining cluster CANN KNN=93.87% attribute selection traffic managed to
centers and nearest SVM SVM=80.65% for effective escape detection
neighbors classification of
intrusions and
normal traffic
Comparison of BFTree KDD-NSL BFTree=98.24% High decrease in FP The study needs to
classification NBTree NBTree=98.44% be evaluated on
techniques applied J48 J48=97.68% updated datasets
for network RFT RFT=98.34%
intrusion detection MLP MLP=98.53%
and classification NB NB=84.75%
Machine Learning
Based Network
Intrusion
Detection
Intrusion detection Hybrid NN, SVM and K-Means. KDD-NSL SVM+K-Means=96.81% Integration of Have to evaluate the
in computer J48 Tress KDD „99 NN+K-Means=95.55% supervise and work on up to date
networks using MLP J48=93.1083% unsupervised ML datasets
hybrid machine BN MLP=91.9017% models complement Failed to select the
learning BN=90.7317% each other in required attributes
techniques boosting IDS only
Machine Learning efficiency
Methods for Provides solution to
Network the low accuracy
Intrusions often faced in the
detection of low
frequent attacks
Evaluation of K-Means Kyoto2006+ RBF=97.54% The work was Low Recall
Machine Learning KNN KNN=97.54 evaluated using
Techniques for FCM Ensemble=96.72% kyoto2006+
Network Intrusion SVM NB=96.72%
Detection NB SVM=94.26%
RBF FCM=83.60%
K-Means=83.60%
Anomaly network Adabooost for classification and NSL-KDD Adaboost=98.9% As the result of The approach needs
based using a ABC for feature selection ISCXIDS2012 applying feature to be evaluated on
reliable hybrid selection using recent datasets
artificial bee colony ABC, a good
and adaboost performance on
algorithm different datasets
was demonstrated
Improved off-line GA NSL-KDD GA=98.90% The approach It is needed that
IDS using GA managed to reduce future IDS
the false negative development has to
by building up consider the
aggregate solution standardization of
sets of all audit files
compatible
intrusions found
An implementation GA KDD‟99 99.40% A reasonable level Need to evaluate the
of IDs using GA of detection rate approach on more
was achieved recent datasets
Improving Adaboost CICIDS2017 81.83% The result of this Machine learning
adaboost-based IDS work shows that the based approach
performance on work‟s performance needs to be adopted
CICIDS2017 outperforms the
dataset previous works of
similar technique