NETWORK INTRUSION DETECTION SYSTEM copy
NETWORK INTRUSION DETECTION SYSTEM copy
A PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE
OF
BACHELOR OF TECHNOLOGY
IN
MATHEMATICS AND COMPUTING
Submitted by:
Harshit Gandhi (2K20/MC/054)
Ishan Pathak (2K20/MC/062)
Jaskaran Singh Sahota (2K20/MC/064)
1
CANDIDATE’S DECLARATION
We (Harshit Gandhi -2K20/MC/054, Ishan Pathak 2K20/MC/062, and Jaskaran Singh Sahota-
2K20/MC/064) of B.Tech (Mathematics and Computing), hereby declare that the project
Dissertation titled “Network Intrusion Detection System” which is submitted by us to the
Department of Applied Mathematics, Delhi Technological University, Delhi in partial fulfillment
of the requirement for the award of the degree of bachelor of Technology, is original and not
copied from any source without proper citation. This work has not previously formed the basis
for the award of any degree, Diploma Associateship Fellowship or other similar title or
recognition.
2
CERTIFICATE
3
ABSTRACT
The information proceeds to the use of intrusion detection models, making use of the Random
Forest algorithm's strong points. The technique is created to strategically incorporate PCA to
improve model efficacy and feature selection. This study concludes the examination of model
predictions that easily understand the variables affecting each result. The abstract particularly
highlights how important visualization tools are for communicating complex ideas and
encouraging a greater realization of the complex relationships found in the data.
4
ACKNOWLEDGEMENT
We are extremely grateful to our project guide, Prof. Sumedha Seniaray, Assistant Professor,
Department of Applied Mathematics, Delhi Technological University, Delhi for providing
invaluable guidance and being a constant source of inspiration throughout our research. We will
always be grateful to her for their extensive support and encouragement.
We are extremely grateful to all the panel members who evaluated our progress, guided us
throughout our project, and gave us constant support and motivation, innovative ideas, and all
the information that we needed to pursue this project.
5
CONTENTS
CHAPTER 1 INTRODUCTION
1.1 Overview
1.1.1 Intrusion detection
1.1.2 Attacks - Why are these a problem?
1.1.3 Why do we need IDS?
1.2 Problem Formulation
1.3 Objectives
1.4 Motivation
CHAPTER 2 THEORY
2.1 Original Vs Improved CIC IDS 2017 DATASET
2.2 Dimensionality Reduction Analysis
2.2.1 2D-PCA visualization:
2.2.2 2D-TSNE visualization:
2.2.2.1 A BRIEF COMPARISON BETWEEN 2D TSNE VS 2D PCA
2.3 Machine Learning and Its Techniques
2.3.1 Random Forest Technique
2.3.2 Principal Component Analysis
CHAPTER 3 RELATED WORK
3.1 Some remarkable work
3.2 Comparision of related work
3.2.1 Succinct overview of the main ideas covered in the research paper
3.2.2 Here is a quick explanation of the research paper's important points
3.2.3 Here are the main points of the research paper
3.2.4. The following are the main points of the research paper
3.3 Limitations
3.3.1 2D PCA Limitations
3.3.2 2D t-SNE Limitations
3.3.3 Principal Component Analysis
CHAPTER 4 PROPOSED WORK
4.1 Tools and Terminologies
4.1.1 Languages
4.1.2 Libraries
6
4.1.3 IDE’s
4.2 Methodologies
CHAPTER 5 RESULTS AND CONCLUSION
5.1 PCA intrusion detector
5.2 Conclusion
CHAPTER 6 REFERENCES
LIST OF TABLES
7
● Table 1:( A brief comparison of 2D PCA VS 2D TSNE).
● Table 2 :(Existing research work and their analysis).
● Table 3: (Cumulative Result Of PCA Intrusion Detection System )
LIST OF FIGURES
8
● Figure 1: ( IDS and its types).
● Figure 2: (2D PCA Visualization).
● Figure 3: (2D -t-SNE Visualization).
● Figure 4:
● Figure 5:
● Figure 6:
● Figure 7:
● Figure 8:
● Figure 9:
9
LIST OF ABBREVIATIONS
10
CHAPTER 1: INTRODUCTION
1.1 Overview
An intrusion detection system also known as IDS is designed to monitor and examine
system activity to search for signs of malicious activity and secure procedures, which is a
very important part of cybersecurity. The main objective of the research is to majorly
address and mitigate potential and harmful security threats to improve the overall security
position of a system or network.
1. Monitoring: By examining logs, events, and data packets, IDS continuously monitors
traffic and system activity.
2. Analysis: After the network data is collected, it is compared to the system's prior
signatures of known threats or assaults derived from its established base code lines.
3. Alerting: When it detects anomalous behavior, the intrusion detection system (IDS)
notifies users. The system supervisors or security personnel are notified about potential
security incidents by these notifications.
11
4. Reaction: IP addresses may be automatically banned by intrusion detection systems
(IDS), depending on how they are configured.
Classifications:
Signature-Based IDS: Signature-Based IDS: This kind contrasts observed patterns with an attack
signature database. It works well against known threats, but it might not be as strong against
more creative or advanced ones.
1. Anomaly-Based IDS: This type of intrusion detection works by creating a baseline of
typical behavior and then marking any departures from it as possibly malicious. It can
catch risks which are not known to exist before and it can also produce false positives
too.
2. Heuristic-Based IDS: This method blends aspects of anomaly and signature detection. It
makes use of rule-based algorithms to spot known attack patterns and departures from
typical behavior.
3. Network-Based IDS (NIDS): Tracks and examines packets as they travel across the
network. It works well for identifying outside threats.
4. Host-Based IDS (HIDS): Concentrates on specific devices or hosts, keeping an eye on
their actions. It is so much helpful to us in identifying insider threats and attacks.
5. Distributed IDS (DIDS): An interconnected fifisystem of IDS sensors that collaborate to
offer a thorough overview of the whole network, augmenting detection capability.
IDS amplifies digital defences and it ensures a bold approach towards cybersecurity threats by
using these classifications.
12
1.1.2 Attacks - Why are these a problem?
Attackers seek to undermine the efficacy of intrusion detection systems (IDS) and avoid
detection. Data is used in evasion attacks to trick the intrusion detection system (IDS) or to flood
it with traffic in a Distributed Denial of Service (DDoS) attack. Attackers focus on the IDS's
internal vulnerabilities, trying to disable it completely or take advantage of flaws in its
algorithms. Techniques like polymorphic malware, which alters its form to evade signature-
based detection, are examples of continuous inclusion. Attackers may also use adaptive and
creative ways in their attack methodology which in turn is used to carry out harmful attacks
making it necessary to continuously strengthen IDS against new threats.
Types:
● Benign: Non-malicious activities or harmless events within a system.
● DoS Hulk: A Denial of Service attack leveraging HTTP POST requests to overwhelm
and disable a web server.
● DDoS: Distributed Denial of Service attack, where multiple compromised systems flood
a target with traffic to disrupt or halt its services.
● PortScan: Systematic probing of a computer's ports to discover vulnerabilities.
● DoS GoldenEye: A variant of DoS attack aiming to render a target's resources
unavailable.
● FTP-Patator: An attack using brute-force techniques to crack FTP server credentials.
● Slowloris: A DoS attack that keeps numerous connections to a target web server open,
preventing it from serving legitimate requests.
● DoS Slowhttptest: Exploits slow HTTP POST vulnerabilities, consuming server
resources to disrupt services.
● SSH-Patator: A brute-force attack targeting SSH servers to gain unauthorized access.
● Bot: Malicious software that allows an attacker to control a compromised computer
remotely.
● Web Attack: A generic term for various attacks targeting web applications or services.
● Infiltration: Unauthorised entry into a system with the intent of compromising security.
● Heartbleed: A security vulnerability in OpenSSL that allows attackers to read sensitive
data from a server's memory.
13
corporations face the risk of legal issues and damage to their reputations. Because ransomware
attacks can interrupt critical systems and hold onto valuable data, these risks are exacerbated by
the stealthy nature of malware. Technology-driven changes in cyber threat strategies mean that
both individuals and organizations will be able to strengthen their defenses against online
intrusion invaders.
Intrusion Detection Systems (IDS) stand out as the steadfast defenders of digital fortresses in the
face of the growing threats of malware attacks and data breaches. These techniques use a variety
of smart strategies to keep an eye on network activity which spot irregularities prevent intrusion
save data packets. Whereas anomaly-based detection closely examines departures from accepted
standards, signature-based detection spots recognized patterns of malevolent behavior. An
additional layer is added by heuristic-based detection, which identifies new dangers through
behavioral analysis. IDS's assertive approach reduces the negative effects of data breaches by
enabling fast and quick reactions and early threat identification. The methods used by these
attentive systems also advance with technology, strengthening the digital barriers and
guaranteeing the availability, integrity, and confidentiality of private and business data in a
dynamic cyber environment.
1.3 Objectives
This project comprises two goals: first, the project will use advanced machine learning
techniques to make a strong intrusion detection system (IDS); the second part, will include
dimensional reduction techniques to increase the system's efficacy and prevent intrusive data and
cyberattacks. The goal of the research is to develop an intrusion detection system (IDS) that can
continuously learn and then respond to new upcoming threats and recognize patterns of
malicious conduct through the use of machine learning algorithms. The impact of any breaches
14
will be minimized by this proactive defense strategy, which will drastically cut down on
response times.
In addition, the report aims to work on dimensional reduction techniques which will examine the
large datasets which are related to cyber dangers.
The goal of the paper is to increase the accuracy and speed of the technique that detects and
responds to potential intrusions by simplifying data sets while saving important information.
This cutting-edge technology ensures a strong defense mechanism that works with the constantly
changing nature of cybersecurity threats and challenges while simultaneously strengthening the
digital infrastructure against threats and future-proofing it against the ever-changing environment
of cyber threats.
1.4 Motivation
This project's primary concern is the urgent demand to protect digital gold from the increasing
threat of intrusive data and attacks. Given the increasing number of threats and breaches that
expose personal data and corporate classified files, strengthening the interconnection between the
networks is a driving force. Observing the severe effects of cyber threats that help in financial
stability, organizational integrity, and individual privacy motivates us to continue developing
cutting-edge solutions. The dynamic nature of cyber malware threats is another origin of
inspiration, causing us to investigate efficient cutting-edge technologies like dimensional
reduction using machine learning. Adding up to its technical developments, we hope to slowly
and steadily create the future of cybersecurity and not only start responding to current threats but
also ensure a strong defense system that can withstand the relentless evolution of digital threats.
In essence, our motivation lies in creating a safer digital landscape for individuals and
organizations alike.
15
CHAPTER 2: THEORY
CIC IDS 2017 Dataset is a company standard for assessing the usefulness of intrusion detection
systems. This dataset was introduced by the Canadian Institute of Cybersecurity to help evaluate
the effectiveness of the Intrusion Detection System
The Improved CIC IDS 2017 dataset is a remarkable idea of development for the CIC IDS 2017
dataset, a crucial resource in the field of cybersecurity. CIC-IDS-2017 and CSE-CIC-IDS-2018
have many errors throughout the dataset creation lifecycle, such as in attack orchestration,
feature generation, documentation, and labeling. The first dataset is regarded as an important
reference point because its flaws give a detailed redesign to meet the changing needs of the
environment in which the data is present. The Improved CIC IDS 2017 dataset is a
comparatively refined and enhanced version, with some of the distinguishing features being
careful data preparation, detailed feature engineering, and the addition of genuine background
traffic. The enhancements of the dataset take into account the complexities of modern-day
cybersecurity attacks and threats. The improved CIC IDS 2017 dataset provides a fairly realistic
image of both theoretical and practical scenarios. The Improved CIC IDS 2017 dataset emerged
as a further improved variant of the older predecessor, addressing its built-in limitations. Diligent
data preprocessing for noise reduction, complex feature engineering to facilitate highly delicate
analysis, and the introduction of realistic background traffic are some of the notable
improvements in Improved CIC IDS 2017. These enhancements try to create a more challenging
and realistic environment to help in better evaluation of IDS performance. The improved dataset
captures the complexities of contemporary cyber threats, offering a more accurate representation
of normal and malicious activities. It reflects the sophistication of modern attack vectors,
providing researchers and practitioners with a more robust platform to assess and advance IDS
capabilities. As the cybersecurity landscape continues to evolve, the Improved CIC IDS 2017
dataset stands as a high-priority alternative aiding innovation and research in the field of
intrusion detection.
16
● Feature Engineering: The improved variant allows for sophisticated feature engineering,
which helps provide a more sensitive analysis of network traffic and therefore enhances
the dataset's overall quality.
● Realistic Background Traffic: Unlike the original dataset, the Improved CIC IDS 2017
dataset introduces realistic background traffic, offering a more challenging environment
for Intrusion Detection System evaluations.
17
● Covariance Matrix Calculation: Calculate the standardized data's covariance matrix to
see how various features relate to one another and depend on one another.
● Eigenvalue and Eigenvector Computation: Find the covariance matrix's eigenvalues and
matching eigenvectors. The directions of the data's largest variance are represented by
these eigenvectors.
● Ranking Components: The 2D subspace is formed by ranking the eigenvectors according
to their corresponding eigenvalues and choosing the top two.
● Projection: Project the original data into the specified 2D subspace that the selected
eigenvectors produce.
● 2D-PCA Visualization: The resulting 2D projection captures the important variance and
offers a reduced dimensionality visual representation of the data.
18
Figure 2. ( 2D PCA VISUALIZATION)
Working:
19
Figure 3 (2D -t-SNE Visualization).
t-SNE is particularly adept at revealing clusters and patterns in data, making it a valuable tool for
exploratory data analysis and understanding the underlying structures in complex datasets.
20
Focuses on preserving local
Preserves global variance,
Preservation of Distance relationships, effective for
emphasizing overall structure.
clustering.
21
continues to grow in complexity, machine learning remains a driving force in extracting
meaningful insights and facilitating intelligent decision-making.
22
CHAPTER 3 RELATED WORK
23
learning generalizable to
algorithms other datasets.
employed in Additionally, the
the study. The study only
results show focuses on three
that the machine-
decision tree learning
classifier algorithms and
model built does not explore
with the other potential
extracted techniques that
features shows could be used
better accuracy for intrusion
detection.
Finally, the
study does not
address the issue
of false positives
and false
negatives, which
are important
considerations in
intrusion
detection.
24
a false positive available
rate of 1.01%. intrusion
datasets
The research papers on "Intrusion Detection using Machine Learning" addressed various
methodologies and techniques to enhance the efficiency and accuracy of intrusion detection
systems (IDS). Here's a summary of their contributions and limitations.
3.2.1 Below is a succinct overview of the main ideas covered in the research paper
The study suggests an intrusion detection system (IDS) that can identify various cyberattacks by
utilizing machine learning methods. The objective is to utilize feature selection on the NSL-KDD
dataset to construct an appropriate IDS model.
The main steps are:
● Data Preprocessing: Clean and preprocess data to handle missing values, categorical
variables, etc.
● Feature Selection: Select the most relevant features using ANOVA F-test and recursive
feature elimination (RFE). This improved model accuracy.
● Modeling: Build models using Decision Tree, Random Forest, and SVM machine
learning algorithms on the selected features.
● Evaluation: Test models on unseen data to predict attack types. Evaluate performance
using accuracy, precision, recall, etc.
Key Results:
● Feature selection increased model accuracy over using all features
● Random Forest algorithm performed best overall with 87-98% accuracy across different
attack types
● Comparative evaluation is done between models to determine the most suitable ML
algorithm
In summary, the paper demonstrates how to successfully apply machine learning for network
intrusion detection by carefully selecting features and algorithms. Performance is quantified
through rigorous evaluation of multiple models.
25
The main goals are to:
● Review publicly available labeled intrusion datasets - Analyze 23 datasets in terms of
data source, traffic type, features, anomalies, etc., and identify drawbacks of popular
datasets like KDDCup and NSL-KDD. Suggests UNSW-NB15 and BoT-IoT as more
updated alternatives.
● Discuss machine learning techniques for IDS - Analyze 26 ML algorithms to understand
their characteristics, uses, and limitations in intrusion detection. Look at techniques like
decision trees, random forests, neural networks, support vector machines, etc.
● Survey recent IDS models - Review 23 research papers on ML-based IDS models for
traditional and advanced networks like cloud, IoT, etc. Discusses challenges, solutions,
outcomes, and future work.
● Identify IDS problems - Highlights issues like high false alarms, lack of updated datasets,
data imbalance, model complexity, etc. that impact IDS performance.
To help with the design of efficient intrusion detection systems, the paper offers helpful insights
into intrusion detection data and methods. It also points out areas that still need investigation to
effectively handle modern threats and monitor quickly changing network data in real time.
26
To summarize, it is an intrusion detection system that uses machine learning and log correlation
to effectively identify cyber-attacks while adapting to new threats.
3.2.4. The following are the main points of the research paper
● They compare three feature extraction approaches to lower the dimensionality of three
NIDS datasets: UNSW-NB15, ToN-IoT, and CSE-CIC-IDS2018: PCA, autoencoder, and
LDA. Lower dimensions aid in the effectiveness of machine learning.
● They run these reduced datasets through six machine learning models, three of which are
deep learning (DFF, CNN, and RNN) and three of which are shallow learning (logistic
regression, decision tree, and naive Bayes).
● They compare the results of various dimensionality reductions, datasets, and models.
The purpose is to see if any strategies have good generalization across datasets.
● No single feature extraction and machine learning combination outperforms the others
across all datasets. This emphasizes the importance of dataset selection.
● They recommend defining a standardized universal feature set for NIDS to enable better
comparison of machine learning techniques across research papers.
● Analysis of PCA dimensionality variance shows most variance is in the first 10 features.
So higher dimensions provide diminishing returns.
In essence, their main contribution is the systematic testing of various combinations of ML
models and feature extraction approaches on various datasets. This elucidates generalization
capabilities and the requirement for uniform datasets.
3.3 Limitations
The challenge of dealing with imbalanced datasets, the requirement for significant labeled data
for supervised learning techniques, and the possible vulnerability of ML-based IDS to
adversarial attacks are all drawbacks shared by this research. Moreover, there is still a problem
with the understandability of complicated models, especially in critical systems where decision-
making depends on knowing the logic underlying warnings. Resolving these issues will be more
and more important as the business grows to sustain the development and use of reliable and
efficient machine learning-based intrusion detection systems.
27
● Sensitivity to Outliers: PCA is sensitive to outliers, which can skew the principal
components. In IDS, outliers may represent significant intrusion attempts, and PCA's
vulnerability to outliers might affect its ability to highlight these instances effectively.
28
4.1 Tools and Technologies
4.1.1 Languages
● For the implementation of this project, we have used Python programming language for
following Learning to create a sign language translator that uses human-made perceptions for
translating and performing tasks. Python has an extensive range of easy and useful means of
machine learning techniques and libraries that can be used for image classification.
● We used images of various alphabets and numbers as input and found out that the most feasible
library for image processing operations is OpenCV. OpenCV library can be used with Python’s
numpy module for image processing tasks.
● Python is easy to learn and has several useful modules for exploratory data analysis making the
task of loading and preprocessing data very simple.
4.1.2 Libraries
● SKlearn - Scikit-learn, commonly referred to as sklearn, is a popular machine-learning library in
Python that provides simple and efficient tools for data analysis and modeling.
● OpenCV - OpenCV is a widely used wide-platform library for performing real-time computer
vision-related tasks and to help get real-time analysis of the data. The major applications of
OpenCV are image processing and classification, object detection, and video capturing.
Some features of openCV include:
1. Reading and writing of images
2. Capturing and saving videos
3. Performing feature detection
4. Processing of images
5. Object detection
6. Handles videos efficiently - estimates the motion in videos, reduces the unwanted
background, and traces the objects in it.
● Os - Os is a Python module that has miscellaneous operating systems features. It has
functionalities for using the operating system’s features from Python. We used OS for loading
and saving data.
● Numpy - Numpy is one of the most widely used Python libraries. It is used to
perform computation tasks and has support for handling numerical computations for
multidimensional arrays and matrices. This library was used for performing operations on
images for preprocessing tasks and manipulating and preparing image data for machine learning
models.
● Pandas - Pandas is a powerful Python language library that is used for data
29
manipulation and analysis. It is widely used for the preparation of data for machine learning
models. We used pandas as it makes data analysis and understanding very simple
● Seaborn - Seaborn is a data visualization library in Python. It is used for statistical analysis of
data through graphs and charts of several kinds. We used Seaborn for its excellent graphical
analysis tools and also for visualization of image data.
● Matplotlib - Matplotlib is a Python tool that can be utilized for creating interactive
visualizations. Its extensive features make it an attractive choice when it comes to graphical
analysis of models and data. We have used Matplotlib for preprocessing tasks and to visualize
results.
4.1.3 IDE’s
For writing and executing the Python code we utilized the following IDEs:
● Jupyter Notebook- Jupyter Notebook is a widely used open-source IDE. It is a web-based
application that provides facilities for editing and executing code within the web browser. It has
a user-friendly interface and files can be easily uploaded and downloaded in multiple formats.
● Google Colaboratory - As this was a group project, we needed to share our work and work in
coordination. Hence, we utilized Google Colaboratory IDE which allows users to combine
executable code and work together virtually. It uses a Jupyter notebook environment and has
options for using GPU and TPU for fast processing.
4.2 Methodology
4.2.1 IMPLEMENTING LIBRARIES
CIC IDS 2017 (Original): The given code is part due to which we process the load and connect
multiple CSV files that give us the Canadian Institute for Cybersecurity Intrusion Detection
System (CIC IDS) 2017 dataset.
1. dataset_csv_path: Specifies the directory path where the CSV files for the
CIC IDS 2017 dataset are located.
2. csv_file_names: A list containing the names of the specific CSV files that
correspond to different scenarios or types of network traffic in the CIC
IDS 2017 dataset.
3. complete_paths: An empty list is created, and then, for each CSV file
name, the code appends the complete file path by joining the
dataset_csv_path and csv_file_name using os.path.join.
● pd.concat: Linking multiple data frames along a particular axis. In this case, it also links
and formulates up DataFrames created by reading each CSV file using pd.read_csv.
● map(pd.read_csv, complete_paths): Applies pd.read_csv to each file path in the
complete_paths list, resulting in a list of DataFrames.
● ignore_index=True: Resets the index of the resulting DataFrame.
30
CIC IDS 2017 (Improved): This code segment performs data cleaning and
preprocessing on the Improved CICIDS 2017 dataset.
1. dropping_cols: A list of column names to be dropped from the DataFrame. These
columns seem to include identifiers and timestamp information.
2. clean_df: A function that presumably performs additional cleaning operations on the
DataFrame. This function might handle tasks like handling missing values, data type
conversions, or other specific cleaning operations.
3. drop: Removes the specified columns (dropping_cols) from the DataFrame
(improved_df) along the specified axis (columns). The inplace=True parameter modifies
the DataFrame in place.
4. Prints the counts of unique values in the 'Label' column, providing insights into the
distribution of different labels in the dataset.
● Prints the shape (number of rows and columns) of the DataFrame after dropping selected
columns.
● Subsequent lines indicate the removal of zero variance columns, dropping rows with NaN
values, dropping duplicate rows, and removing columns with identical values.
The code segment performs dimensionality reduction using Principal Component Analysis
(PCA) on a subsampled CICIDS 2017 dataset and visualizes the results using scatterplots. Let's
break down the code step by step:
Subsampling the DataFrame
● group by ('Label').apply(...): Groups the DataFrame by the 'Label' column and applies the
sample function to each group. This function samples 10% of the data from each group.
● reset_index(drop=True): Resets the index of the resulting DataFrame.
Performing PCA:
● PCA: Initializes a PCA object with 2 components and fits it to the feature data (X). The
transformed data is stored in z.
31
● Creates a scatterplot for binary classification, distinguishing between 'BENIGN' and
'ATTACK' classes.
32
4.2.3 2D-TSNE Visualization:
[IMPROVED CIC IDS 2017 IS USED]
This code segment performs dimensionality reduction using t-distributed Stochastic Neighbor
Embedding (t-SNE) on a subsampled CICIDS2017 dataset and visualizes the results using
scatterplots.
1. Defining Features and Labels
2. Performing t-SNE
3. Creating DataFrames for Visualization (15 Classes)
4. Visualizing t-SNE Projection (15 Classes)
5. Creating DataFrames for Binary Classification Visualization
33
6. Visualizing t-SNE Projection (Binary Classes)
In summary, this code segment uses t-SNE to reduce the dimensionality of a subsampled
CICIDS2017 dataset and visualizes the data in the first two t-SNE components for both the
original 15 classes and a binary classification scenario. The scatterplots provide insights into the
distribution and separability of the data points in the reduced-dimensional space.
34
35
Chapter 5: RESULTS AND CONCLUSION
Figure 4
Figure 5
36
Figure 6
37
38
39
Figure 9
40
RESULT
au_precission_re
Attacks call auroc f1 score precision recall
All attacks 0.9092582665 0.9216302073 0.8537748773 0.8096094692 0.903036891
DoS Slowloris
and Slowhttptes 0.1842091476 0.9668418907 0.09437501861 0.04952702421 0.9989493591
FTP-Patator 0.02374412187 0.8418531858 0.06878954889 0.03562217242 0.9982227488
SSH-Patator 0.007302431664 0.6084999388 0.0005758280194 0.0002958547463 0.01072705602
DoS Slowloris 0.1623154495 0.9729041334 0.06704024869 0.03468342644 0.9993902439
DoS
Slowhttptes 0.04441064695 0.9533975659 0.03134123942 0.01592061266 0.9979716024
DoS Hulk 0.7609364448 0.9245216066 0.7147619319 0.5793380732 0.9328126624
DoS GoldenEye 0.3998278721 0.9161204869 0.107248062 0.05718832672 0.8603855721
HeartBleed 0.02090841161 0.9995213072 0.0001972559506 9.86E-05 1
Web Attack -
Brute Force 0.0004619218491 0.8449767742 0.001357297664 0.0006791097091 1
Infiltration 0.00507130006 0.9743308539 0.0006791097091 0.0003396701912 1
Infiltration -
Portscan 0.6525348331 0.9484159677 0.563296276 0.3954890606 0.9784597226
Web Attack -
XSS 0.0001864256554 0.9087524706 0.00032871669 0.0001643853631 1
Web Attack -
SQL Injection 4.08E-05 0.7088096257 4.38E-05 2.19E-05 0.1818181818
Botnet 0.004678904774 0.7637311257 0.005471599792 0.002754519817 0.4025559105
Portscan 0.664846762 0.8835710745 0.6485631584 0.5436109332 0.8037365206
DDoS 0.8742312588 0.9723754681 0.6340985298 0.466879368 0.9879440604
5.2 CONCLUSION
Through the use of Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour
Embedding (t-SNE) for dimensional reduction, as well as PCA, Random Forest, and K-means
clustering for intrusion detection, the project aimed to improve intrusion detection through the
application of Machine Learning (ML) techniques. A more effective data representation was
made possible by the successful reduction of the feature space achieved by the PCA and t-SNE
41
approaches. By using PCA, the number of dimensions was significantly reduced while
maintaining pertinent information, improving computational efficiency without sacrificing
accuracy.
The results showed promising performance when PCA, Random Forest, and K-means clustering
were used to achieve intrusion detection. Using the decreased PCA-transformed features, the
Random Forest algorithm showed resilience in categorizing network data and achieved a
noteworthy level of accuracy in identifying intrusion patterns. Furthermore, K-means clustering
shows the potential to identify anomalous activity by displaying clusters that successfully
distinguish between normal and intrusive network behavior.
Finally, the use of machine learning techniques, including PCA and t-SNE for dimensionality
reduction, in conjunction with K-means clustering and Random Forest for intrusion detection,
demonstrated encouraging results in improving network security. PCA's feature space reduction
allowed for more effective computing without sacrificing detection accuracy. The potential for
real-time intrusion detection systems was demonstrated by the successful identification and
delineation of intrusive activity using the merging of Random Forest and K-means clustering. To
improve the system's resilience and adaptability in a variety of network contexts and intrusion
scenarios, more testing and algorithm optimization are advised. In summary, this study
highlights the effectiveness of machine learning techniques in supporting intrusion detection
systems, opening the door for enhanced cybersecurity protocols in contemporary network
architectures.
CHAPTER 6: REFERENCES
42
[1] Nkiama, H., Said, S.Z.M. and Saidu, M., 2016. A Subset Feature Elimination
[2] Khan, M.A., Pradhan, S.K. and Fatima, H., 2017, March. Applying data
[3] Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H. and
[4] Taher, K.A., Jisan, B.M.Y. and Rahman, M.M., 2019, January. Network
[5] Thomas, R. and Pavithran, D., 2018, November. A Survey of Intrusion Detection
Models basedon NSL-KDD Data Set. In2018Fifth HCT Information Technology Trends
(ITT)(pp.286-291).IEEE.
[8] F. Amiri, et al., "Improved feature selection for intrusion detection system," Journal of
[9] Juan Wang, Qiren Yang, Dasen Ren, “An intrusion detection algorithm based on
Processing, 2009.
43
[10] Dewan Md. Farid, Nouria Harbi, and Mohammad Zahidur Rahman, "Combining
Nave Bayesand Decision Tree for Adaptive Intrusion Detection," International Journal
of Network Security & Its Applications, Vol. 2, No. 2, April 2010, pp. 12-25.[11] Ektefa M,
Memar S, Sidi F, Affendey L., "Intrusion detection using data mining techniques,"2010
International
44