Ensemble Based Approach For Intrusion
Ensemble Based Approach For Intrusion
Abstract With the swift growth of Internet technology, various types of attacks
and intrusions are taking place over the Internet. Intrusion Detection Systems (IDS)
are widely used to detect attacks. Enormous research has been done in the area of
IDS but due to new attacks it is still an open area to researchers. In this paper, an
ensemble based scheme is proposed using extra tree classifier for intrusion detection.
The proposed scheme has four major steps namely Data Collection, Preprocessing
of data, ensemble based training and testing and results, respectively. The basic
idea of ensemble based approach is to make separate—separate classifier and train
these classifiers. Combining the decision of different classifier is done to obtain the
strong decision. In implementation of proposed scheme, KDDcup99 and NSL-KDD
datasets are used. These datasets are benchmarks for intrusion detection. The simu-
lated results show that our proposed scheme is very effective for intrusion detection.
Python programming environment is used in implementation. The proposed scheme
achieved 99.97% accuracy on KDDcup99 dataset and 99.32% on NSL-KDD datasets.
1 Introduction
Due to the rapid growth of the Internet, various transactions are taking place over
the networks. Since the past decade, people have started showing high dependency
on the Internet or networking. However different types of attacks are also coming
in to picture during exchange of the information through Internet [1]. It is a big
challenge to researchers who belong to network security. Nowadays, it is essential
to protect our useful and private information. Intrusion detection system is used to
protect useful information from different types of attacks. The main motive of IDS
is to detect various attacks that are occurring in the network. Intrusion detection is
a process to monitor the various states occurring in the networks and investigating
the occurring states to detect malicious states [2]. Broadly the IDS techniques can
be classified as below:
Anomaly Based Technique: In this type of IDS, investigation on the occurring states
is done based on deviation in behavior of occurring event. Suppose a deviation is
found in the behavior then a particular state is detected as intrusion otherwise detected
as normal state [3].
Signature Based Technique: In this type of IDS, investigation on the occurring
states is done based on signatures stored in knowledge- base of IDS. The signatures
of malicious activities are already stored in knowledge-base of IDS. If signature of
occurring state matches with existing signatures, that particular state is detected as
intrusion. It is also referred to as knowledge-base or misuse detection [4].
2 Related Work
Sindhu et al. [12] presented improved multi-class categorizing IDS using three dif-
ferent viewpoints such as preprocessing and cleaning input traffic pattern followed by
a feature selection algorithm. They used neurotree model for increased classification
and detection rate which is superior to NN* and extended C4.5 algorithms.
Aburomman and Reaz [13] observed that if opinions from multiple experts can
be combined into one it can help in demonstrating improved classification accuracy
using ensemble approach. It was observed that Particle Swarm optimization obtained
best results with an average improved accuracy of 0.756% indicating relatively less
time and optimized weights which produced best possible accuracy. It was concluded
that these techniques were good enough for binary classification.
Mukkamala et al. [14] proposed an ensemble approach and also compared it with
other techniques. The performance of ANNs, SVMs and MARs were compared with
an ensemble method. It was concluded that the ensemble method was superior in
accuracy of classification with the possibility of gaining 100% classification accuracy
with correct intelligent paradigms. Among these techniques SVMs outperformed
MARs and ANNs in terms of training time and prediction accuracy. Resilient back
propagation performed the best with accuracy of 97.04% and training of 67 epochs.
Hence, demonstrating the importance of ensemble in different learning paradigms.
3 Proposed Scheme
Proposed scheme is divided into the following major steps namely Data collec-
tion, Preprocessing of datasets, Training and Testing using Extra tree classifier, and
Results. The outline of proposed scheme is given in Fig. 1.
Dataset is required as a benchmark for any proposed technique for intrusion detection.
Here, most popular intrusion datasets are used. Both KDDcup99 and NSL-KDD
datasets are used which are publically available for researcher of network security
community [15]. These datasets have many normal records and malicious records.
In the implementation all the records are categorized as Normal, Denial of Services
(DOS), Probe, R2L and U2R [16]. In implementation KDDCup99 and NSL-KDD
datasets are used.
3.2 Preprocessing
When we collect the datasets from any source, the datasets is not normalized. In
preprocessing some operation are performed on the datasets before the training and
testing. The main objective of preprocessing is to eliminate unwanted data from
the datasets. In simple way we can say preprocessing is a process to transform the
data from one form to desirable form [17]. A preprocessed dataset behave in a good
manner during training and testing phase and the outcome of model will be accurate.
In this implementation, Python environment is used to preprocessing the both datasets
KDDcup99 and NSL-KDD datasets.
Training and testing is a very important phase, in this phase model is built through
training and then model is tested. Here, the Extra tree Classifier ensemble method is
used. The graphical representation of the above method is given in Fig. 2. Extra tree
classification is the modification of bagging where samples of the training dataset
are used to construct the random trees [18]. Extra tree classifier is also known as
extremely randomized trees. The working of extra tree classifier is given below:
Step 1: Train the dataset.
Step 2: Random selection is done in this step. Random selection (K) is used to
determine the best split.
Step 3: Multiple decision tree are build using random vector.
Step 4: All tree generated by random vector are combine into a single decision tree.
Ensemble Based Approach for Intrusion Detection … 217
3.4 Results
This phase is the outcome of the above phases. In this phase, the model takes the
decisions whether the occurring event is a malicious activity or normal activity.
The proposed scheme gives 99. 97% detection accuracy on KDDcup99 dataset and
99.32% on NSL-KDD datasets.
Tables 1 and 2 show the summary of results obtained using proposed scheme. With
the help of precision, recall, f1 score the results have been analyzed. Confusion
Matrix for both the datasets is given in Tables 3 and 4. The terminology used at
analysis stage can be explained as following:
Precision: Precision may be defined as the ratio of True Positive (TP) to the sum of
True Positive (TP) and False Positive (FP). It measures the model’s accuracy.
TP
Precision = (1)
T P + FP
Recall: Recall is defined as the ratio of TP to the sum of TP and False Negative (FN).
It measures the model’s completeness [19].
TP
Recall = (2)
T P + FN
Accuracy: Accuracy may be defined as the ratio of how correctly predict the
observation to the total observation. Where TN refers to true negative.
Ensemble Based Approach for Intrusion Detection … 219
TP +TN
Accuracy = (3)
T P + FP + FN + T N
F-beta (β) score: F-beta score is defined as the average weight of Precision and
Recall. When the value of beta in F-beta score is 1 then it is termed as F1 score.
F-beta score gives best value when its value reaches 1 and at 0 value it gives the
worst score [20].
1 + β2 P R
F-score (β) = (4)
β2 P + R
where, R, refers to Recall, P refers to Precision. F-beta (β) score can be computed
as:
1 + β2 T P
F-score (β) = (5)
1 + β2 T P + β2 F P + F N
5 Conclusion
In this paper, several ensemble based approach for intrusion detection have been
reviewed. Here, a new ensemble based approach has been proposed and analyzed.
Proposed scheme has four phases namely Data collection, Preprocessing of datasets,
Training and Testing using Extra tree classifier, and Results. The proposed scheme has
been applied on both the intrusion detection dataset i.e., KDDcup99 and NSL-KDD.
For implementation, a Python programming environment has been used. Accuracy
has been observed for both the datasets. Results have been analyzed based on accuracy
and confusion matrix, which shows the proposed scheme achieved 99.97% accuracy
on KDDcup99 dataset and 99.32% on NSL-KDD datasets. In future, the proposed
scheme may be applied in real-time datasets and optimization technique may be
used.
References
1. Lazarevic A (2005) Managing cyber threats: issues, approaches, and challenges. Springer
Science+Business Media, Incorporated
2. Bace R, Mell P (2001) NIST special publication on intrusion detection systems
3. Bhati BS, Rai CS (2016) Intrusion detection systems and techniques: a review. Int J Crit Comput
Based Syst 6(3):173–190
4. Liao HJ, Lin CHR, Lin YC, Tung KY (2013) Intrusion detection system: a comprehensive
review. J Netw Comput Appl 36(1):16–24
220 B. S. Bhati and C. S. Rai