0% found this document useful (0 votes)
65 views5 pages

(IJCST-V10I4P14) :manish Chava, Aman Agarwal, DR Radha K

Network attacks and breaches have become very common among global companies leading them to substantial revenue losses. It is critical to protect data and networks against malicious assaults. Therefore, it is high time for companies to use a highly dependable Intrusion Detection System (IDS). An Intrusion Detection System (IDS) is a system that monitors network traffic for suspicious activity and sends an alert when such activity is discovered. This project demonstrates the working of an intrusi

Uploaded by

EighthSenseGroup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views5 pages

(IJCST-V10I4P14) :manish Chava, Aman Agarwal, DR Radha K

Network attacks and breaches have become very common among global companies leading them to substantial revenue losses. It is critical to protect data and networks against malicious assaults. Therefore, it is high time for companies to use a highly dependable Intrusion Detection System (IDS). An Intrusion Detection System (IDS) is a system that monitors network traffic for suspicious activity and sends an alert when such activity is discovered. This project demonstrates the working of an intrusi

Uploaded by

EighthSenseGroup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 4, Jul-Aug 2022

RESEARCH ARTICLE OPEN ACCESS

Ensemble Learning Vs Ann Approaches to Network Based


Intrusion Detection System
Manish Chava [1], Aman Agarwal [2], Dr Radha K [3]
[1]
CSE, GITAM University, Hyderabad, Rudraram
[2]
CSE, GITAM University, Hyderabad, Rudraram
[3]
CSE, GITAM University, Hyderabad, Rudraram

ABSTRACT
Network attacks and breaches have become very common among global companies leading them to substantial revenue losses.
It is critical to protect data and networks against malicious assaults. Therefore, it is high time for companies to use a highly
dependable Intrusion Detection System (IDS). An Intrusion Detection System (IDS) is a system that monitors network traffic
for suspicious activity and sends an alert when such activity is discovered. This project demonstrates the working of an
intrusion detection system that scans the network for any malicious activity. An intrusion detection system uses the data to
identify highly advanced threats before they wreak the system. Machine learning must be applied to make the IDS more generic
and capable of discovering new attack techniques and avoiding such attacks.
Because of the rapid growth in computer network usage and the massive rise in applications that use these computer networks,
providing security is becoming significantly vital. The existing security systems include technical and commercial flaws that are
difficult for manufacturers to address. As a result, the role of IDS in detecting network threats is becoming increasingly
important.
Keywords :- Machine learning, deep learning, Artificial Neural Networks, XGBoost, Random Forest, Intrusion Detection
System, computer Networks.

I. INTRODUCTION
Although many research papers discuss NIDS well-structured data to feed the machine learning algorithm.
implementation, the companies do not use them because of The algorithm usually tries to learn from the training data,
their high fault ratio. The existing security systems include hidden patterns, and features and then uses the knowledge it
technical and commercial flaws that are difficult for gained to predict future outcomes. In the project, the dataset is
manufacturers to address. It, therefore, resulted in the rise in well structured and labelled with training and testing data
the need for genuine and highly acceptable NIDS that are separately.
reliable. The goal is to develop an algorithm capable of
detecting attacks in network traffic by adopting a machine II. BACKGROUND
learning-based approach. An easy way to comply with the conference paper
formatting requirements is to use this document as a template
First, a highly reliable NIDS is developed using supervised and simply type your text into it.
machine learning algorithms. The dataset considered for this
project is the NSL KDD dataset. A. Ensemble Techniques
Ensemble learning is a unique machine learning technique in
After proper analysis of the dataset, the appropriate algorithms which multiple weak learners, called base models, are
are chosen. This project adopts classification algorithms to developed. Their results are combined to get one bigger and
predict the types of network attacks that are possible. The higher accuracy model rather than a single model with lower
attacks are categorized into five types: Normal, PROBE, R2L, accuracy. Ensemble learning is of two types Bagging and
U2R, and DOS. After the NIDS model is developed, it is Boosting. In this project, one technique from each sub-
trained and tested with the data to see its performance. The category is explored. They are Random Forest from Bagging
performance considers how well the model predicts for each and Extreme Gradient Boosting (Xgboost) [4].
attack category. Different classification metrics like train and
test data accuracy, precision, recall, and f1 score, ROC AUC 1) Random Forest: Random Forest classifier is also called as
score are used. Bootstrap Aggregation.

This project comprises several algorithms designed and


developed for potential network attack prediction. The
supervised machine learning approach generally consists of

ISSN: 2347-8578 www.ijcstjournal.org Page 87


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 4, Jul-Aug 2022

Fig. 2 XGBoost Process [9]

Fig. 1 Random Forest Process

Let us consider a dataset D with dimensionality rxc, where r =


no. of rows,

c = No.of columns. In Random Forest, multiple models will Lambda: Regularization parameter
be crated based on decision trees(D.T). A subset of the data D
is provided individually for each of these models. The If lambda value is high, it leads to higher pruning of the tree
technique of dividing the data D into subsets is called row- thus avoid overfitting of the tree.
wise sampling with replacement. Each model considers only a
subset of features. After each model makes its prediction, it Also, higher lambda value takes care of the effect outliers on
combines all the results by choosing the majority class as the the model to some extent.
final prediction class.
𝑁𝑒𝑤 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 = 𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 + 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
2) EXTREME GRADIENT BOOSTING(XGBOOST): 𝑥 𝑜𝑢𝑡𝑝𝑢𝑡New prediction = Previous prediction + Learning
rate x output.
Boosting algorithm is a serial ensemble technique.
B. DEEP LEARNING
Initially, a base model will be constructed. This base model
output will be the dependent or target column's average. Later Artificial neural networks are used. The input values must be
the errors are calculated for each instance called residuals, and Standardized or Normalized. These methods keep your
the following model (say model 1) will be fitted on these parameters in a suitable level, allowing your Neural Network
residuals. to function them more easily. This is essential for the
operational capacity of your Neural Network. In a neural
A similarity score on the residuals node will be calculated. network, in order to find the optimal weights, we need to
Then splitting criteria will be selected. After the splitting is reduce the cost function. One approach for this is the brute
done, the similarity score for each sub-branch will be force approach. However, this is not feasible when the neural
calculated. networks are complex. So, the Gradient descent approach was
adopted [4].
The Gain value is generated. Later, using the gamma value
provided to the XGBoost algorithm, If the gain value is The GDF(GRADIENT DESCENT FUNCTION) is an
greater than the gamma value, the split will be considered; iterative algorithm to find the minimum of a function. Here
otherwise not. This is how the auto pruning for the decision that function is the cost function. Usually, the GDF is applied
tree will happen [2]. to smooth and convex cost functions. If the function is not
convex, then the algorithm might not find the global minimum
but the local minimum. Moreover, if the cost function curve is
not smooth but sharp, then it is not differentiable [4].

ISSN: 2347-8578 www.ijcstjournal.org Page 88


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 4, Jul-Aug 2022

III. METHODOLOGY
A. DATA PREPROCESSING

Data pre-processing involves checking the dataset for Null


values, data variety, Label encoding, data standardization, and
Feature selection. The NSL KDD dataset involves no null
values. For Label encoding, the first attack feature is
categorized into five attack types, namely 'DOS,' 'PROBE,'
'NORMAL,' 'R2L', and 'U2R'. They are categorized as follows
[6] [5].

Fig. 3 Types of Cost Function Curves [10]

The ANN is a sequential neural network. The sequential order


of the layers are as follows.
Input layer -> Hidden layers -> Output layer

The No.of neurons in the input layer will equal the no of


independent features. For a classification task, the No.of
neurons in the output layer will equal the total No.of class
labels. For hidden layers, the value is dynamic, meaning
the neurons' count depends on how well it performs on Fig. 4 Categorizing attacks into five categories
the data. Different activation functions are used for each
layer. Generally, for ANN, the Rectified/Relu activation The machine understands the data only in numbers, so the
function is used [2]. label encoder is used to convert the categorical data into
numerical data. The categorical features that are present in the
3) Intrusion Detection System: The intrusion detection data are 'protocal_type', 'flag,' and 'attack.'
systems are of four types. They are explained below. The label encoder encodes each category with a value
between 0 and n_classes-1, where n is the number of distinct
a) Network Intrusion Detection System (NIDS): These IDS categories.
are deployed across multiple locations within a network that is The Label encoding is done in the following manner.
unprotected and has a high risk of getting exposed to potential
FEATURE CATEGORIES LABEL
attacks [8]. ENCODED
VALUES
b) Host Intrusion Detection System (HIDS): They are flag OTH 0
implemented at multiple host/client machines within a REJ 1
network. Contrary to the NIDS, These HIDS monitors the RDTO 2
network packets within the host machine (Internally) for any RSTOS0 3
suspicious activity [8]. RSTR 4
S0 5
c) Anomaly-Based Intrusion Detection System (AIDS): This S1 6
S2 7
form of IDS uses a method or strategy to monitor the network S3 8
traffic and compare it to predefined standards. It then detects SF 9
and notifies administrators of abnormal activity in the network SH 10
[8].

d) Signature-based Intrusion Detection System (SIDS):


These systems have a database or library of signatures or attack DOS 0
properties that are present in known intrusion attacks or NORMAL 1
malicious threats. Signature-based IDS monitors all network PROBE 2
packets and detects potential malware by comparing R2L 3
U2R 4
signatures to suspicious activity [8].

Protocol type icmp 0


tcp 1

ISSN: 2347-8578 www.ijcstjournal.org Page 89


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 4, Jul-Aug 2022

udp 2
Table. 1 Label Encoded Values Distribution For ANN, the Keras Tuner is used for hyper Parameter tuning.
The data in the ‘attack’ column is distributed as This includes finding the optimal No.of hidden layers required
follows. to understand the data during training and their activation
functions efficiently. It uses RandomSearch for this. The input
layer contains neurons equal to the count of features fed into
the network. The output layer contains five neurons with the
SoftMax activation function, which signifies the five attack
classes. The hidden neurons contain neurons in the range of
5>=x<=15, which Keras Tuner determines during the model-
building process.

Fig. 5 Training data Fig. 6 Testing data


IV. RESULTS AND DISCUSSION

The dataset consists of 43 features. Feeding all this data into A. Random Forest
the machine learning model can invite much noise to the
model and thus make the model inefficient and inaccurate. So For the Random Forest classifier, the highest accuracy of 76%
we used features selection techniques to extract the best was achieved when the Top 15 features were considered. The
features from the data set that would help predict the attacks ROC AUC score is 94%. A review of the classification report
accurately and efficiently. is as below.

The features selection technique adopted in this project is


'SelectKBest’. This technique generates a score for each
feature called 'Feature_score.' These feature scores select the
best K features where K is an integer.

In this project, we considered K as 6,10,15. So we generated


three separate data frames for top_10, top_15, and top_6
features. Each algorithm is trained on the above-mentioned
The reason for the low scores of the fourth category, namely
top_n features. Where n = 10,15,6.
‘U2R’, is the fewer data available for this attack class. Less
than 0.8% of the data training data is covered for the ‘U2R’
After feature selection, the data is standardized using the
category. So, the model was not provided with a sufficient
standard scalar technique. It converts the data to unit variance
amount of ‘U2R’ attack category data to learn appropriately.
and means zero.
Hence the low scores.
The process of data pre-processing and feature selection is
B. Extreme Gradient Boosting (XGBOOST)
made to both training and testing data.
For the Random Forest classifier, the highest accuracy of 78%
B. Model Building and Training was achieved when the Top 15 features were considered. The
ROC AUC score is 94.8%. A review of the classification
report is as below.
A model is built using the 'Random Forest,' 'Extreme Gradient
Boosting,' and 'ANN' algorithms.

Dealing with such massive data involves trying to obtain the


optimal values for all the algorithm's hyperparameters. For
Random Forest and XGBOOST, the RandomSearchCV
technique is used for hyperparameter tuning. It takes a set of
values for each parameter. Usually, the ones we are interested
in as input form random combinations of values from those
parameter values and find the best combination of
hyperparameter values that gives optimal output.

This is performed on the top 10,15 and 6 feature datasets.

ISSN: 2347-8578 www.ijcstjournal.org Page 90


International Journal of Computer Science Trends and Technology (IJCST) – Volume 10 Issue 4, Jul-Aug 2022

https://www.tek-tools.com/security/what-is-an-intrusion-
detection-system-ids

[9] Wikipedia contributors, "XGBoost," Wikipedia, The Free


Encyclopedia,
https://en.wikipedia.org/w/index.php?title=XGBoost&oldid=1
C. Artificial Neural Networks 095596992 (accessed July 18, 2022).
Accuracy: 99% [10] https://www.coursera.org/learn/machine-learning-
Loss: 0.034 classification-algorithms, Instructor: Anna Koop.
Validation Accuracy: 80%
[11] https://www.udemy.com/course/deeplearning/, instructor:
REFERENCES Kirill Eremenko
[1] G. De Carvalho Bertoli et al., "An End-to-End Framework
for Machine Learning-Based Network Intrusion Detection
System," in IEEE Access, vol. 9, pp. 106790-106805, 2021,
doi: 10.1109/ACCESS.2021.3101188.

[2] Devan, P., Khare, N. An efficient XGBoost–DNN-based


classification model for network intrusion detection system.
Neural Comput & Applic 32, 12499–12514 (2020).
https://doi.org/10.1007/s00521-020-04708-x

[3] P. García-Teodoro, J. Díaz-Verdejo, G. Maciá-Fernández,


E. Vázquez,
Anomaly-based network intrusion detection: Techniques,
systems and challenges,
Computers & Security. Volume 28, Issues 1–2,2009,Pages 18-
28,ISSN 0167-4048,
https://doi.org/10.1016/j.cose.2008.08.003,(https://www.scien
cedirect.com/science/article/pii/S0167404808000692)

[4] Ahmad, Zeeshan, et al. "Network intrusion detection


system: A systematic study of machine learning and deep
learning approaches." Transactions on Emerging
Telecommunications Technologies 32.1 (2021): e4150.

[5] Lasitha Hiranya (2020, March 9). Most Important things


of “NSL-KDD” data set
https://medium.datadriveninvestor.com/did-you-know-the-
famous-data-set-called-nsl-kdd-293b39420c74

[6] Gerry Saporito (2019, Sep 17). A Deeper Dive into the
NSL-KDD Data Set.
https://towardsdatascience.com/a-deeper-dive-into-the-nsl-
kdd-data-set-15c753364657

[7] DNSSTUFF (2020, Feb 18). 7 Best Intrusion Detection


Software and Latest IDS System.

https://www.dnsstuff.com/network-intrusion-detection-
software.

[8] Tek Tools (2020, Feb 14). Intrusion Detection System(IDS)


- The Fundamentals.

ISSN: 2347-8578 www.ijcstjournal.org Page 91

You might also like