(IJCST-V10I4P14) :manish Chava, Aman Agarwal, DR Radha K
(IJCST-V10I4P14) :manish Chava, Aman Agarwal, DR Radha K
ABSTRACT
Network attacks and breaches have become very common among global companies leading them to substantial revenue losses.
It is critical to protect data and networks against malicious assaults. Therefore, it is high time for companies to use a highly
dependable Intrusion Detection System (IDS). An Intrusion Detection System (IDS) is a system that monitors network traffic
for suspicious activity and sends an alert when such activity is discovered. This project demonstrates the working of an
intrusion detection system that scans the network for any malicious activity. An intrusion detection system uses the data to
identify highly advanced threats before they wreak the system. Machine learning must be applied to make the IDS more generic
and capable of discovering new attack techniques and avoiding such attacks.
Because of the rapid growth in computer network usage and the massive rise in applications that use these computer networks,
providing security is becoming significantly vital. The existing security systems include technical and commercial flaws that are
difficult for manufacturers to address. As a result, the role of IDS in detecting network threats is becoming increasingly
important.
Keywords :- Machine learning, deep learning, Artificial Neural Networks, XGBoost, Random Forest, Intrusion Detection
System, computer Networks.
I. INTRODUCTION
Although many research papers discuss NIDS well-structured data to feed the machine learning algorithm.
implementation, the companies do not use them because of The algorithm usually tries to learn from the training data,
their high fault ratio. The existing security systems include hidden patterns, and features and then uses the knowledge it
technical and commercial flaws that are difficult for gained to predict future outcomes. In the project, the dataset is
manufacturers to address. It, therefore, resulted in the rise in well structured and labelled with training and testing data
the need for genuine and highly acceptable NIDS that are separately.
reliable. The goal is to develop an algorithm capable of
detecting attacks in network traffic by adopting a machine II. BACKGROUND
learning-based approach. An easy way to comply with the conference paper
formatting requirements is to use this document as a template
First, a highly reliable NIDS is developed using supervised and simply type your text into it.
machine learning algorithms. The dataset considered for this
project is the NSL KDD dataset. A. Ensemble Techniques
Ensemble learning is a unique machine learning technique in
After proper analysis of the dataset, the appropriate algorithms which multiple weak learners, called base models, are
are chosen. This project adopts classification algorithms to developed. Their results are combined to get one bigger and
predict the types of network attacks that are possible. The higher accuracy model rather than a single model with lower
attacks are categorized into five types: Normal, PROBE, R2L, accuracy. Ensemble learning is of two types Bagging and
U2R, and DOS. After the NIDS model is developed, it is Boosting. In this project, one technique from each sub-
trained and tested with the data to see its performance. The category is explored. They are Random Forest from Bagging
performance considers how well the model predicts for each and Extreme Gradient Boosting (Xgboost) [4].
attack category. Different classification metrics like train and
test data accuracy, precision, recall, and f1 score, ROC AUC 1) Random Forest: Random Forest classifier is also called as
score are used. Bootstrap Aggregation.
c = No.of columns. In Random Forest, multiple models will Lambda: Regularization parameter
be crated based on decision trees(D.T). A subset of the data D
is provided individually for each of these models. The If lambda value is high, it leads to higher pruning of the tree
technique of dividing the data D into subsets is called row- thus avoid overfitting of the tree.
wise sampling with replacement. Each model considers only a
subset of features. After each model makes its prediction, it Also, higher lambda value takes care of the effect outliers on
combines all the results by choosing the majority class as the the model to some extent.
final prediction class.
𝑁𝑒𝑤 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 = 𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 + 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
2) EXTREME GRADIENT BOOSTING(XGBOOST): 𝑥 𝑜𝑢𝑡𝑝𝑢𝑡New prediction = Previous prediction + Learning
rate x output.
Boosting algorithm is a serial ensemble technique.
B. DEEP LEARNING
Initially, a base model will be constructed. This base model
output will be the dependent or target column's average. Later Artificial neural networks are used. The input values must be
the errors are calculated for each instance called residuals, and Standardized or Normalized. These methods keep your
the following model (say model 1) will be fitted on these parameters in a suitable level, allowing your Neural Network
residuals. to function them more easily. This is essential for the
operational capacity of your Neural Network. In a neural
A similarity score on the residuals node will be calculated. network, in order to find the optimal weights, we need to
Then splitting criteria will be selected. After the splitting is reduce the cost function. One approach for this is the brute
done, the similarity score for each sub-branch will be force approach. However, this is not feasible when the neural
calculated. networks are complex. So, the Gradient descent approach was
adopted [4].
The Gain value is generated. Later, using the gamma value
provided to the XGBoost algorithm, If the gain value is The GDF(GRADIENT DESCENT FUNCTION) is an
greater than the gamma value, the split will be considered; iterative algorithm to find the minimum of a function. Here
otherwise not. This is how the auto pruning for the decision that function is the cost function. Usually, the GDF is applied
tree will happen [2]. to smooth and convex cost functions. If the function is not
convex, then the algorithm might not find the global minimum
but the local minimum. Moreover, if the cost function curve is
not smooth but sharp, then it is not differentiable [4].
III. METHODOLOGY
A. DATA PREPROCESSING
udp 2
Table. 1 Label Encoded Values Distribution For ANN, the Keras Tuner is used for hyper Parameter tuning.
The data in the ‘attack’ column is distributed as This includes finding the optimal No.of hidden layers required
follows. to understand the data during training and their activation
functions efficiently. It uses RandomSearch for this. The input
layer contains neurons equal to the count of features fed into
the network. The output layer contains five neurons with the
SoftMax activation function, which signifies the five attack
classes. The hidden neurons contain neurons in the range of
5>=x<=15, which Keras Tuner determines during the model-
building process.
The dataset consists of 43 features. Feeding all this data into A. Random Forest
the machine learning model can invite much noise to the
model and thus make the model inefficient and inaccurate. So For the Random Forest classifier, the highest accuracy of 76%
we used features selection techniques to extract the best was achieved when the Top 15 features were considered. The
features from the data set that would help predict the attacks ROC AUC score is 94%. A review of the classification report
accurately and efficiently. is as below.
https://www.tek-tools.com/security/what-is-an-intrusion-
detection-system-ids
[6] Gerry Saporito (2019, Sep 17). A Deeper Dive into the
NSL-KDD Data Set.
https://towardsdatascience.com/a-deeper-dive-into-the-nsl-
kdd-data-set-15c753364657
https://www.dnsstuff.com/network-intrusion-detection-
software.