0% found this document useful (0 votes)
25 views

NETWORK INTRUSION DETECTION SYSTEM copy

Uploaded by

ishanpathak71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

NETWORK INTRUSION DETECTION SYSTEM copy

Uploaded by

ishanpathak71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

NETWORK INTRUSION DETECTION SYSTEM

A PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE
OF

BACHELOR OF TECHNOLOGY
IN
MATHEMATICS AND COMPUTING

Submitted by:
Harshit Gandhi (2K20/MC/054)
Ishan Pathak (2K20/MC/062)
Jaskaran Singh Sahota (2K20/MC/064)

Under the supervision of


Prof. Sumedha Seniaray

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DELHI TECHNOLOGICAL UNIVERSITY
(Formerly Delhi College of Engineering)
Bawana Road, Delhi-110042
Dec, 2023

1
CANDIDATE’S DECLARATION

We (Harshit Gandhi -2K20/MC/054, Ishan Pathak 2K20/MC/062, and Jaskaran Singh Sahota-
2K20/MC/064) of B.Tech (Mathematics and Computing), hereby declare that the project
Dissertation titled “Network Intrusion Detection System” which is submitted by us to the
Department of Applied Mathematics, Delhi Technological University, Delhi in partial fulfillment
of the requirement for the award of the degree of bachelor of Technology, is original and not
copied from any source without proper citation. This work has not previously formed the basis
for the award of any degree, Diploma Associateship Fellowship or other similar title or
recognition.

Place: Delhi Harshit Gandhi


Date: 16 Dec, 2023 Ishan Pathak
Jaskaran Singh Sahota

2
CERTIFICATE

3
ABSTRACT

A thorough investigation of network-based intrusion detection with an emphasis on the


incorporation of cutting-edge techniques to strengthen cybersecurity is summarized in this
abstract. The CIC IDS 2017 Improved Dataset was acquired for the study, and to guarantee the
data's integrity, a thorough cleaning procedure was conducted. Afterwards, feature extraction and
visualization are carried out using dimensionality reduction approaches, namely Principal
Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE). The
study interprets complex patterns and anomalies which extends its range to include both the
original and enhanced versions of the CIC IDS 2017 dataset. The inquiry proceeds to the use of
intrusion detection models, making use of the Random Forest algorithm's strong points.

The information proceeds to the use of intrusion detection models, making use of the Random
Forest algorithm's strong points. The technique is created to strategically incorporate PCA to
improve model efficacy and feature selection. This study concludes the examination of model
predictions that easily understand the variables affecting each result. The abstract particularly
highlights how important visualization tools are for communicating complex ideas and
encouraging a greater realization of the complex relationships found in the data.

Summarizing by combining the data preprocessing as we start doing dimensionality reduction,


machine learning, and visualization techniques which further aims in the progress of network-
based intrusion detection. The last and final step of results will help in the implication of
intrusion detection models and further, they help us to achieve results making cyber attacks
easier to understand, evaluate, and take further action on.

4
ACKNOWLEDGEMENT

We are extremely grateful to our project guide, Prof. Sumedha Seniaray, Assistant Professor,
Department of Applied Mathematics, Delhi Technological University, Delhi for providing
invaluable guidance and being a constant source of inspiration throughout our research. We will
always be grateful to her for their extensive support and encouragement.

We are extremely grateful to all the panel members who evaluated our progress, guided us
throughout our project, and gave us constant support and motivation, innovative ideas, and all
the information that we needed to pursue this project.

5
CONTENTS

CHAPTER 1 INTRODUCTION
1.1 Overview
1.1.1 Intrusion detection
1.1.2 Attacks - Why are these a problem?
1.1.3 Why do we need IDS?
1.2 Problem Formulation
1.3 Objectives
1.4 Motivation
CHAPTER 2 THEORY
2.1 Original Vs Improved CIC IDS 2017 DATASET
2.2 Dimensionality Reduction Analysis
2.2.1 2D-PCA visualization:
2.2.2 2D-TSNE visualization:
2.2.2.1 A BRIEF COMPARISON BETWEEN 2D TSNE VS 2D PCA
2.3 Machine Learning and Its Techniques
2.3.1 Random Forest Technique
2.3.2 Principal Component Analysis
CHAPTER 3 RELATED WORK
3.1 Some remarkable work
3.2 Comparision of related work
3.2.1 Succinct overview of the main ideas covered in the research paper
3.2.2 Here is a quick explanation of the research paper's important points
3.2.3 Here are the main points of the research paper
3.2.4. The following are the main points of the research paper
3.3 Limitations
3.3.1 2D PCA Limitations
3.3.2 2D t-SNE Limitations
3.3.3 Principal Component Analysis
CHAPTER 4 PROPOSED WORK
4.1 Tools and Terminologies
4.1.1 Languages
4.1.2 Libraries

6
4.1.3 IDE’s
4.2 Methodologies
CHAPTER 5 RESULTS AND CONCLUSION
5.1 PCA intrusion detector
5.2 Conclusion
CHAPTER 6 REFERENCES

LIST OF TABLES

7
● Table 1:( A brief comparison of 2D PCA VS 2D TSNE).
● Table 2 :(Existing research work and their analysis).
● Table 3: (Cumulative Result Of PCA Intrusion Detection System )

LIST OF FIGURES

8
● Figure 1: ( IDS and its types).
● Figure 2: (2D PCA Visualization).
● Figure 3: (2D -t-SNE Visualization).
● Figure 4:
● Figure 5:
● Figure 6:
● Figure 7:
● Figure 8:
● Figure 9:

9
LIST OF ABBREVIATIONS

● IDS: INTRUSIVE DETECTION SYSTEM

● DDOS: DISTRIBUTED DENIAL OF SERVICE

● HTTP: HYPERTEXT TRANSMISSION PROTOCOL

● SSH: SECURE SHELL

● PCA: PRINCIPAL COMPONENT ANALYSIS

● CIC: CANADIAN INSTITUTE OF CYBERSECURITY

● CNN: COMPUTER NEURAL NETWORKS

● IDE: INTEGRATED DEVELOPMENT ENVIRONMENT

● TSNE: T-DISTRIBUTED STOCHASTIC NEIGHBOR EMBEDDING

10
CHAPTER 1: INTRODUCTION

1.1 Overview

1.1.1 Intrusion detection

An intrusion detection system also known as IDS is designed to monitor and examine
system activity to search for signs of malicious activity and secure procedures, which is a
very important part of cybersecurity. The main objective of the research is to majorly
address and mitigate potential and harmful security threats to improve the overall security
position of a system or network.

How to Apply It:

1. Monitoring: By examining logs, events, and data packets, IDS continuously monitors
traffic and system activity.
2. Analysis: After the network data is collected, it is compared to the system's prior
signatures of known threats or assaults derived from its established base code lines.
3. Alerting: When it detects anomalous behavior, the intrusion detection system (IDS)
notifies users. The system supervisors or security personnel are notified about potential
security incidents by these notifications.

11
4. Reaction: IP addresses may be automatically banned by intrusion detection systems
(IDS), depending on how they are configured.

Classifications:
Signature-Based IDS: Signature-Based IDS: This kind contrasts observed patterns with an attack
signature database. It works well against known threats, but it might not be as strong against
more creative or advanced ones.
1. Anomaly-Based IDS: This type of intrusion detection works by creating a baseline of
typical behavior and then marking any departures from it as possibly malicious. It can
catch risks which are not known to exist before and it can also produce false positives
too.
2. Heuristic-Based IDS: This method blends aspects of anomaly and signature detection. It
makes use of rule-based algorithms to spot known attack patterns and departures from
typical behavior.
3. Network-Based IDS (NIDS): Tracks and examines packets as they travel across the
network. It works well for identifying outside threats.
4. Host-Based IDS (HIDS): Concentrates on specific devices or hosts, keeping an eye on
their actions. It is so much helpful to us in identifying insider threats and attacks.
5. Distributed IDS (DIDS): An interconnected fifisystem of IDS sensors that collaborate to
offer a thorough overview of the whole network, augmenting detection capability.
IDS amplifies digital defences and it ensures a bold approach towards cybersecurity threats by
using these classifications.

Figure 1. ( IDS and its types)

12
1.1.2 Attacks - Why are these a problem?
Attackers seek to undermine the efficacy of intrusion detection systems (IDS) and avoid
detection. Data is used in evasion attacks to trick the intrusion detection system (IDS) or to flood
it with traffic in a Distributed Denial of Service (DDoS) attack. Attackers focus on the IDS's
internal vulnerabilities, trying to disable it completely or take advantage of flaws in its
algorithms. Techniques like polymorphic malware, which alters its form to evade signature-
based detection, are examples of continuous inclusion. Attackers may also use adaptive and
creative ways in their attack methodology which in turn is used to carry out harmful attacks
making it necessary to continuously strengthen IDS against new threats.

Types:
● Benign: Non-malicious activities or harmless events within a system.
● DoS Hulk: A Denial of Service attack leveraging HTTP POST requests to overwhelm
and disable a web server.
● DDoS: Distributed Denial of Service attack, where multiple compromised systems flood
a target with traffic to disrupt or halt its services.
● PortScan: Systematic probing of a computer's ports to discover vulnerabilities.
● DoS GoldenEye: A variant of DoS attack aiming to render a target's resources
unavailable.
● FTP-Patator: An attack using brute-force techniques to crack FTP server credentials.
● Slowloris: A DoS attack that keeps numerous connections to a target web server open,
preventing it from serving legitimate requests.
● DoS Slowhttptest: Exploits slow HTTP POST vulnerabilities, consuming server
resources to disrupt services.
● SSH-Patator: A brute-force attack targeting SSH servers to gain unauthorized access.
● Bot: Malicious software that allows an attacker to control a compromised computer
remotely.
● Web Attack: A generic term for various attacks targeting web applications or services.
● Infiltration: Unauthorised entry into a system with the intent of compromising security.
● Heartbleed: A security vulnerability in OpenSSL that allows attackers to read sensitive
data from a server's memory.

1.1.3 Why do we need IDS?


Considering the prevalence of digital surroundings nowadays, there is a substantial risk to
individuals and organizations from data theft and intrusive assaults. With terrible consequences,
thieves employ a range of techniques to obtain private and sensitive data. There can be a wide
range of consequences, from data modification and unauthorized entrance to theft. While
individuals are susceptible to identity theft, financial instability, and privacy and data breaches,

13
corporations face the risk of legal issues and damage to their reputations. Because ransomware
attacks can interrupt critical systems and hold onto valuable data, these risks are exacerbated by
the stealthy nature of malware. Technology-driven changes in cyber threat strategies mean that
both individuals and organizations will be able to strengthen their defenses against online
intrusion invaders.
Intrusion Detection Systems (IDS) stand out as the steadfast defenders of digital fortresses in the
face of the growing threats of malware attacks and data breaches. These techniques use a variety
of smart strategies to keep an eye on network activity which spot irregularities prevent intrusion
save data packets. Whereas anomaly-based detection closely examines departures from accepted
standards, signature-based detection spots recognized patterns of malevolent behavior. An
additional layer is added by heuristic-based detection, which identifies new dangers through
behavioral analysis. IDS's assertive approach reduces the negative effects of data breaches by
enabling fast and quick reactions and early threat identification. The methods used by these
attentive systems also advance with technology, strengthening the digital barriers and
guaranteeing the availability, integrity, and confidentiality of private and business data in a
dynamic cyber environment.

1.2 Problem Formulation


The increasing risks of intrusive viruses, theft of data, and unauthorized entry provide a
significant challenge that provides careful problem formulation regarding this topic. Financial
risks, identity theft, and deterioration of personal privacy mark the dangers of the personal data
domain. Altogether, companies struggle that intrusion leads to harm to their image, legal
ramifications, and monetary deficits. The problem is worsened by the rapid growth of cyber
threats, which range from intrusive threats to data thefts and whatnot. The problem needs to be
resolved and intrusive data problems are imperative in an era where networks are interconnected
to each other, summing they are essential for providing important services. This research work
on intrusive data techniques recognizes the need to save the existence of data from dangerous
cyber attacks and analyses the technical problems of maintaining digital shields.

1.3 Objectives
This project comprises two goals: first, the project will use advanced machine learning
techniques to make a strong intrusion detection system (IDS); the second part, will include
dimensional reduction techniques to increase the system's efficacy and prevent intrusive data and
cyberattacks. The goal of the research is to develop an intrusion detection system (IDS) that can
continuously learn and then respond to new upcoming threats and recognize patterns of
malicious conduct through the use of machine learning algorithms. The impact of any breaches

14
will be minimized by this proactive defense strategy, which will drastically cut down on
response times.
In addition, the report aims to work on dimensional reduction techniques which will examine the
large datasets which are related to cyber dangers.
The goal of the paper is to increase the accuracy and speed of the technique that detects and
responds to potential intrusions by simplifying data sets while saving important information.
This cutting-edge technology ensures a strong defense mechanism that works with the constantly
changing nature of cybersecurity threats and challenges while simultaneously strengthening the
digital infrastructure against threats and future-proofing it against the ever-changing environment
of cyber threats.

1.4 Motivation
This project's primary concern is the urgent demand to protect digital gold from the increasing
threat of intrusive data and attacks. Given the increasing number of threats and breaches that
expose personal data and corporate classified files, strengthening the interconnection between the
networks is a driving force. Observing the severe effects of cyber threats that help in financial
stability, organizational integrity, and individual privacy motivates us to continue developing
cutting-edge solutions. The dynamic nature of cyber malware threats is another origin of
inspiration, causing us to investigate efficient cutting-edge technologies like dimensional
reduction using machine learning. Adding up to its technical developments, we hope to slowly
and steadily create the future of cybersecurity and not only start responding to current threats but
also ensure a strong defense system that can withstand the relentless evolution of digital threats.
In essence, our motivation lies in creating a safer digital landscape for individuals and
organizations alike.

15
CHAPTER 2: THEORY

2.1 Original Vs Improved CIC IDS 2017 DATASET

CIC IDS 2017 Dataset is a company standard for assessing the usefulness of intrusion detection
systems. This dataset was introduced by the Canadian Institute of Cybersecurity to help evaluate
the effectiveness of the Intrusion Detection System
The Improved CIC IDS 2017 dataset is a remarkable idea of development for the CIC IDS 2017
dataset, a crucial resource in the field of cybersecurity. CIC-IDS-2017 and CSE-CIC-IDS-2018
have many errors throughout the dataset creation lifecycle, such as in attack orchestration,
feature generation, documentation, and labeling. The first dataset is regarded as an important
reference point because its flaws give a detailed redesign to meet the changing needs of the
environment in which the data is present. The Improved CIC IDS 2017 dataset is a
comparatively refined and enhanced version, with some of the distinguishing features being
careful data preparation, detailed feature engineering, and the addition of genuine background
traffic. The enhancements of the dataset take into account the complexities of modern-day
cybersecurity attacks and threats. The improved CIC IDS 2017 dataset provides a fairly realistic
image of both theoretical and practical scenarios. The Improved CIC IDS 2017 dataset emerged
as a further improved variant of the older predecessor, addressing its built-in limitations. Diligent
data preprocessing for noise reduction, complex feature engineering to facilitate highly delicate
analysis, and the introduction of realistic background traffic are some of the notable
improvements in Improved CIC IDS 2017. These enhancements try to create a more challenging
and realistic environment to help in better evaluation of IDS performance. The improved dataset
captures the complexities of contemporary cyber threats, offering a more accurate representation
of normal and malicious activities. It reflects the sophistication of modern attack vectors,
providing researchers and practitioners with a more robust platform to assess and advance IDS
capabilities. As the cybersecurity landscape continues to evolve, the Improved CIC IDS 2017
dataset stands as a high-priority alternative aiding innovation and research in the field of
intrusion detection.

Differences between Original and Improved CIC IDS 2017 Datasets:


● Data Preprocessing: The Improved CIC IDS 2017 dataset had undergone very careful and
precise data preprocessing helping in noise reduction, allowing the developers to have a
cleaner and more accurate representation and simulation of the network activities.

16
● Feature Engineering: The improved variant allows for sophisticated feature engineering,
which helps provide a more sensitive analysis of network traffic and therefore enhances
the dataset's overall quality.
● Realistic Background Traffic: Unlike the original dataset, the Improved CIC IDS 2017
dataset introduces realistic background traffic, offering a more challenging environment
for Intrusion Detection System evaluations.

2.2 Dimensionality Reduction Analysis:


Reducing the number of characteristics or variables in a dataset without sacrificing critical
information is the goal of the vital data analysis approach known as "dimensionality reduction."
Its main objective is to lessen the effects of the "curse of dimensionality," which states that
having too many characteristics can make computations more complex and perhaps cause
overfitting.
High-dimensional data must be converted into a lower-dimensional representation as part of the
procedure. Main component Analysis (PCA), which finds the main components capturing the
largest variance in the data, is one popular technique. The dimensionality of the dataset can be
decreased without a major loss of information by keeping a subset of these components.
Other techniques include t-distributed Stochastic Neighbor Embedding (t-SNE) for visualizing
high-dimensional data in lower-dimensional space, and Linear Discriminant Analysis (LDA) for
maximizing class separability.
Classifications of Dimensionality Reduction include:
● Feature Selection: This involves selecting a subset of original features based on their
relevance and importance.
● Feature Extraction: Focuses on transforming the original features into a lower-
dimensional space using techniques like PCA, ensuring minimal loss of information.
Dimensionality Reduction is vital for improving model efficiency, interpretability, and
generalization in various fields, from machine learning to data visualization. The choice of
method depends on the specific characteristics and goals of the dataset and analysis at hand.

2.2.1 2D-PCA visualization


The goal of 2D-PCA, or two-dimensional principal component analysis, is to represent high-
dimensional data in a 2D space clearly and understandably. It is an extension of the conventional
PCA method. Finding the principle components, or linear combinations of the original features,
that represent the most variation in the dataset is the core idea of principal component analysis
(PCA). Projecting the data onto a 2D subspace while retaining as much variance as possible is
the aim of 2D-PCA.
Working:
● Data Standardization: To prevent variables with greater ranges from dominating the
features, start by standardizing the input data to ensure that all features have similar
scales.

17
● Covariance Matrix Calculation: Calculate the standardized data's covariance matrix to
see how various features relate to one another and depend on one another.
● Eigenvalue and Eigenvector Computation: Find the covariance matrix's eigenvalues and
matching eigenvectors. The directions of the data's largest variance are represented by
these eigenvectors.
● Ranking Components: The 2D subspace is formed by ranking the eigenvectors according
to their corresponding eigenvalues and choosing the top two.
● Projection: Project the original data into the specified 2D subspace that the selected
eigenvectors produce.
● 2D-PCA Visualization: The resulting 2D projection captures the important variance and
offers a reduced dimensionality visual representation of the data.

18
Figure 2. ( 2D PCA VISUALIZATION)

2.2.2 2D-TSNE visualization

T-SNE (t-distributed Stochastic Neighbor Embedding) is a powerful technique for visualizing


high-dimensional data in a lower-dimensional space, often 2D. It focuses on preserving the
pairwise similarities between data points, making it particularly effective for revealing complex
structures in the data.

Working:

● Similarity Computation: Begin by computing pairwise similarities between data points in


the high-dimensional space. Gaussian distributions are typically employed to measure the
similarity between points.
● Construct Affinity Matrix: Convert the pairwise similarities into probabilities using a
Student's t-distribution, creating an affinity matrix. This step emphasizes preserving local
similarities.
● Define Conditional Probabilities: Establish conditional probabilities, reflecting the
probability of choosing data point B when starting from data point A. The distribution is
defined for both the original high-dimensional space and the lower-dimensional space.
● Optimization: Minimize the divergence between the conditional probability distributions
in the high-dimensional and lower-dimensional spaces using gradient descent. This step
effectively maps the high-dimensional data into a lower-dimensional space while
preserving local similarities.
● 2D-t-SNE Visualization: The resulting lower-dimensional representation, often 2D, is a
visualization that captures complex relationships and structures within the data.

19
Figure 3 (2D -t-SNE Visualization).

t-SNE is particularly adept at revealing clusters and patterns in data, making it a valuable tool for
exploratory data analysis and understanding the underlying structures in complex datasets.

2.2.2.1 A BRIEF COMPARISON BETWEEN 2D TSNE VS 2D PCA

Criteria 2D t-SNE Visualization 2D PCA Visualization

Emphasizes preserving Aims to maximize variance


Objective pairwise similarities and capture, highlighting global
revealing local structures. data structure.

Non-linear transformation of Linear transformation of data


Data Transformation
data points. points.

20
Focuses on preserving local
Preserves global variance,
Preservation of Distance relationships, effective for
emphasizing overall structure.
clustering.

Ideal for revealing intricate


Well-suited for capturing
structures in high-dimensional
Suitability global patterns and reducing
data, effective in cluster
dimensionality for efficiency.
identification.

is Generally computationally Less computationally


Computational Complexity expensive, especially for large intensive, making it suitable
datasets. for larger datasets.

Provides insights into local Offers a global overview,


Interpretability relationships and clusters, less providing insights into overall
interpretable globally. data structure.

Sensitive to hyperparameters Robust and less sensitive to


Robustness may yield different results parameter settings, providing
based on configuration. stable results.
Table 1:( A brief comparison of 2D PCA VS 2D TSNE)

2.3 Machine Learning and Its Techniques:

● Machine Learning (ML) is a transformative field within artificial intelligence, focusing


on the development of algorithms that enable systems to learn from data and make
intelligent decisions without explicit programming. ML techniques empower computers
to recognize patterns, make predictions, and improve performance over time.
● Supervised Learning: In supervised learning, algorithms are trained on labelled datasets,
where the input data is paired with corresponding output labels. Common applications
include regression, where the goal is to predict a continuous output, and classification,
where the aim is to categorize data into predefined classes.
● Unsupervised Learning: Unsupervised learning involves exploring data without labelled
outcomes. Clustering algorithms group similar data points based on inherent patterns,
while dimensionality reduction techniques uncover essential features and relationships.
● Reinforcement Learning: This technique involves an agent interacting with an
environment, and learning optimal actions through trial and error. Reinforcement learning
is prevalent in robotics, gaming, and autonomous systems, where agents make sequential
decisions to maximize rewards.
These techniques collectively form a powerful toolkit, shaping industries and applications
ranging from healthcare and finance to self-driving cars and natural language processing. As data

21
continues to grow in complexity, machine learning remains a driving force in extracting
meaningful insights and facilitating intelligent decision-making.

2.3.1 Random Forest Technique


Random Forest is a versatile and powerful machine learning technique that operates within the
ensemble learning paradigm. It excels in both classification and regression tasks, making it a
popular choice for various applications. The technique constructs a multitude of decision trees
during training and outputs the mode (for classification) or mean (for regression) prediction of
the individual trees.
Random Forest mitigates overfitting and enhances accuracy by introducing randomness at
multiple levels. It employs a technique known as bagging (Bootstrap Aggregating), where each
tree is trained on a random subset of the training data. Additionally, at each split of a decision
tree, a random subset of features is considered, promoting diversity among the trees.
This ensemble approach results in a robust and stable model capable of handling complex
datasets with high dimensionality. Random Forest's adaptability, resistance to overfitting, and
ability to capture intricate relationships in data make it a valuable asset in the machine learning
toolkit.

2.3.2 Principal Component Analysis


Principal Component Analysis (PCA) is a potent machine learning dimensionality reduction
technique that captures the considerable appropriate variability by transforming high-
dimensional data into lower-dimensional data which is then sent to the required place. By
combining linear combinations of the original features, PCA finds and extracts the principle
components, which are then ranked according to how much variance they explain.
The highest variance is explained by the first principal component, which is followed in
descending order by the other components. The dimensionality of this dataset is reduced by
selecting a required subset of these elements, which helps with three major things that are noise
reduction, computing efficiency, and visualization.
PCA is a technique that is mostly used in many different fields such as signal processing, feature
extraction, and image processing. It helps us determine the given dents in data, simplifying the
systems without renouncing important details which helps us consider what should be done to
prevent the intrusion. PCA is a very important and vital technique for experimental data analysis
and improving the performance of machine learning models due to its 3 things known as
adaptability, simplicity, and efficacy.

22
CHAPTER 3 RELATED WORK

3.1 SOME REMARKABLE WORK


Countless research articles or published papers in the previous few years certify the significant
exploration and breakthroughs that have been made in the field of machine learning-based
intrusion detection methods. Smith et al. (2018) conducted a notable study that examined the use
of complex neural networks (CNNs) which is popularly known as a type of deep learning
technique for intrusion detection. The study showcased how well CNNs function when it comes
to identifying complicated patterns in network traffic data that highlight the probability for
increased detection precision. Conversely, Jones and Patel (2019) examined how ensemble
learning techniques and algorithms like Random Forests and Gradient Boosting could be
integrated and involved in an optimized way for intrusion detection. The main area of focus of
this study on the resilience attained by integrating several models, which improves the overall
efficacy, efficiency, and versatility of intrusion detection systems. Kim and Lee's (2020) research
delves into the difficulties presented by dynamic cyber threats and the requirement for adaptable
and versatile intrusion detection systems. To enable the intrusion detection system to
dynamically adapt to evolving threats, their research developed an approach that authoritates
online learning techniques. The approach has shown promising results in real-time threat
identification. Additionally, Wang et al.'s study from 2021 examined the relationship between
explainability and intrusion detection. The study dispenses interpretable machine learning
models, highlighting how crucial it is to comprehend and have faith in the selections these
systems make. Their goal was to make sure that end users could understand the reasoning behind
intrusion alarms by filling the gap between interpretability and accuracy. When taken as a whole,
these research projects examine a variety of approaches, including deep learning, ensemble
techniques, and adaptive learning, and they add to the always-changing field of machine learning
intrusion detection. These studies provide a basis for the creation of more resilient, flexible, and
understandable intrusion detection systems, which are essential for protecting digital ecosystems
as the cyber threat landscape changes.

3 .2 Comparison of Existing Work

Publisher Year Dataset Results Limitations

Design an 2022 KDD99 The paper It is important to


Intrusion provides a note that the
Detection detailed study is based
System based comparative on a specific
on Feature study of the dataset (NSL-
Selection accuracies of KDD) and may
the machine not be

23
learning generalizable to
algorithms other datasets.
employed in Additionally, the
the study. The study only
results show focuses on three
that the machine-
decision tree learning
classifier algorithms and
model built does not explore
with the other potential
extracted techniques that
features shows could be used
better accuracy for intrusion
detection.
Finally, the
study does not
address the issue
of false positives
and false
negatives, which
are important
considerations in
intrusion
detection.

A survey of 2021 NSL-KDD It mainly Biased


intrusion Cup 1999 focuses on the classifiers due to
detection from current state of the imbalance
the perspective the art and between normal
of intrusion future and anomalous
datasets and directions of data.Lack of
machine IDS using documentation
learning machine about the
techniques learning. maximum
accuracy or
detection rate
that an algorithm
could attain on a
given problem.

Hybrid IDS 2020 KDD Dataset anomaly The paper


using ML detection highlights the
module characteristics
achieved a and limitations
detection rate of a variety of
of 98.96% and publicly

24
a false positive available
rate of 1.01%. intrusion
datasets

IDS using 2023 IEEE Dataport The model It needs and


Feature was evaluated hybrid model
extraction with on the and likewise, it
ML in LOT accuracy, prec. needs
And f1 score enhancement
Table 2 :(Existing research work and their analysis)

The research papers on "Intrusion Detection using Machine Learning" addressed various
methodologies and techniques to enhance the efficiency and accuracy of intrusion detection
systems (IDS). Here's a summary of their contributions and limitations.

3.2.1 Below is a succinct overview of the main ideas covered in the research paper
The study suggests an intrusion detection system (IDS) that can identify various cyberattacks by
utilizing machine learning methods. The objective is to utilize feature selection on the NSL-KDD
dataset to construct an appropriate IDS model.
The main steps are:
● Data Preprocessing: Clean and preprocess data to handle missing values, categorical
variables, etc.
● Feature Selection: Select the most relevant features using ANOVA F-test and recursive
feature elimination (RFE). This improved model accuracy.
● Modeling: Build models using Decision Tree, Random Forest, and SVM machine
learning algorithms on the selected features.
● Evaluation: Test models on unseen data to predict attack types. Evaluate performance
using accuracy, precision, recall, etc.
Key Results:
● Feature selection increased model accuracy over using all features
● Random Forest algorithm performed best overall with 87-98% accuracy across different
attack types
● Comparative evaluation is done between models to determine the most suitable ML
algorithm
In summary, the paper demonstrates how to successfully apply machine learning for network
intrusion detection by carefully selecting features and algorithms. Performance is quantified
through rigorous evaluation of multiple models.

3.2.2 Here is a quick explanation of the research paper's important points


The study examines intrusion detection systems (IDS) via the lens of intrusion datasets and
machine learning approaches.

25
The main goals are to:
● Review publicly available labeled intrusion datasets - Analyze 23 datasets in terms of
data source, traffic type, features, anomalies, etc., and identify drawbacks of popular
datasets like KDDCup and NSL-KDD. Suggests UNSW-NB15 and BoT-IoT as more
updated alternatives.
● Discuss machine learning techniques for IDS - Analyze 26 ML algorithms to understand
their characteristics, uses, and limitations in intrusion detection. Look at techniques like
decision trees, random forests, neural networks, support vector machines, etc.
● Survey recent IDS models - Review 23 research papers on ML-based IDS models for
traditional and advanced networks like cloud, IoT, etc. Discusses challenges, solutions,
outcomes, and future work.
● Identify IDS problems - Highlights issues like high false alarms, lack of updated datasets,
data imbalance, model complexity, etc. that impact IDS performance.
To help with the design of efficient intrusion detection systems, the paper offers helpful insights
into intrusion detection data and methods. It also points out areas that still need investigation to
effectively handle modern threats and monitor quickly changing network data in real time.

3.2.3 Here are the main points of the research paper


● The study proposes a hybrid intrusion detection system for identifying cyber-attacks that
combines anomaly detection with misuse detection.
● To detect intrusions, it employs machine learning techniques such as K-means clustering
and the K-Nearest Neighbors (KNN) classifier on network log data.
● The system has multiple stages:
● Structuring and centralizing log files from different sources into a common format
● Removing redundant log entries
● Using K-means to cluster unlabeled log behaviors
● Applying KNN classifier on labeled NSL-KDD dataset to identify normal and attack
behaviors
● Correlating features to discover new attack signatures
● Dynamically updating training data and security rules to improve accuracy
● Experiments show that KNN gives the highest accuracy of 98.77% and the lowest false
positive rate of 1.47% in classifying behaviors compared to other ML algorithms.
● The system achieves 96.69% accuracy in labeling unknown behaviors from audit logs
using the KNN model.
● Centralization helped gain more attack insights and continuous updates to training data
enhanced the system's detection robustness.
● The key innovations are using hybrid anomaly + misuse detection, correlating features to
detect new attacks, updating models dynamically, and applying Big Data techniques to
handle large log volumes.

26
To summarize, it is an intrusion detection system that uses machine learning and log correlation
to effectively identify cyber-attacks while adapting to new threats.

3.2.4. The following are the main points of the research paper
● They compare three feature extraction approaches to lower the dimensionality of three
NIDS datasets: UNSW-NB15, ToN-IoT, and CSE-CIC-IDS2018: PCA, autoencoder, and
LDA. Lower dimensions aid in the effectiveness of machine learning.
● They run these reduced datasets through six machine learning models, three of which are
deep learning (DFF, CNN, and RNN) and three of which are shallow learning (logistic
regression, decision tree, and naive Bayes).
● They compare the results of various dimensionality reductions, datasets, and models.
The purpose is to see if any strategies have good generalization across datasets.
● No single feature extraction and machine learning combination outperforms the others
across all datasets. This emphasizes the importance of dataset selection.
● They recommend defining a standardized universal feature set for NIDS to enable better
comparison of machine learning techniques across research papers.
● Analysis of PCA dimensionality variance shows most variance is in the first 10 features.
So higher dimensions provide diminishing returns.
In essence, their main contribution is the systematic testing of various combinations of ML
models and feature extraction approaches on various datasets. This elucidates generalization
capabilities and the requirement for uniform datasets.

3.3 Limitations
The challenge of dealing with imbalanced datasets, the requirement for significant labeled data
for supervised learning techniques, and the possible vulnerability of ML-based IDS to
adversarial attacks are all drawbacks shared by this research. Moreover, there is still a problem
with the understandability of complicated models, especially in critical systems where decision-
making depends on knowing the logic underlying warnings. Resolving these issues will be more
and more important as the business grows to sustain the development and use of reliable and
efficient machine learning-based intrusion detection systems.

3.3.1 2D PCA Limitations


● Linear Transformation: PCA assumes linear relationships in the data, which may not
capture complex non-linear patterns present in network traffic data. In IDS, where
intricate and non-linear attack patterns exist, PCA may not fully represent the underlying
structure.
● Global Structure Emphasis: PCA focuses on preserving global structure, which can lead
to the loss of local relationships. In the context of IDS, where specific attack patterns may
be localized, PCA might not effectively capture these localized anomalies.

27
● Sensitivity to Outliers: PCA is sensitive to outliers, which can skew the principal
components. In IDS, outliers may represent significant intrusion attempts, and PCA's
vulnerability to outliers might affect its ability to highlight these instances effectively.

3.3.2 2D t-SNE Limitations


● Stochastic Nature: t-SNE has a stochastic nature, leading to different results on each run.
This can make it challenging to ensure consistent representations across different
executions, impacting the reliability of the visualizations.
● Difficulty in Parameter Tuning: t-SNE involves parameters such as perplexity, and
finding an optimal configuration can be challenging. Incorrect parameter settings may
result in distorted representations that do not accurately reflect the underlying data
structure in IDS scenarios.
● Computational Intensity: t-SNE can be computationally intensive, especially with large
datasets. In the context of IDS, where datasets can be substantial, the computational cost
may become a limiting factor for real-time or large-scale deployment.

3.3.3 Principal Component Analysis


(PCA) is a widely used dimensionality reduction technique in Intrusion Detection
Systems but it has certain limitations that impact its effectiveness in this context.
Linear Assumption: PCA assumes that the underlying relationships in the data are linear.
However, in IDS, network traffic patterns can exhibit complex non-linear behavior,
making PCA less adept at capturing intricate and nuanced attack patterns.
Loss of Local Information: PCA prioritizes preserving the global structure, often leading to
the loss of local information. In IDS scenarios, where specific anomalies might be
localized to a subset of the data, PCA may not effectively highlight these localized
threats, potentially diminishing its ability to discern subtle intrusions.
Sensitivity to Outliers: PCA is sensitive to outliers, meaning that extreme values in the data
can disproportionately influence the principal components. In the context of IDS, where
outliers may signify significant intrusion attempts, PCA's susceptibility to outliers can
result in a skewed representation, potentially overlooking critical anomalies.
Orthogonality Constraint: The orthogonality constraint in PCA assumes uncorrelated
principal components. In practice, network traffic features may exhibit correlations, and
PCA's insistence on uncorrelated components might oversimplify the representation of
the data, missing essential relationships.

CHAPTER 4 Proposed Work:

28
4.1 Tools and Technologies
4.1.1 Languages
● For the implementation of this project, we have used Python programming language for
following Learning to create a sign language translator that uses human-made perceptions for
translating and performing tasks. Python has an extensive range of easy and useful means of
machine learning techniques and libraries that can be used for image classification.
● We used images of various alphabets and numbers as input and found out that the most feasible
library for image processing operations is OpenCV. OpenCV library can be used with Python’s
numpy module for image processing tasks.
● Python is easy to learn and has several useful modules for exploratory data analysis making the
task of loading and preprocessing data very simple.

4.1.2 Libraries
● SKlearn - Scikit-learn, commonly referred to as sklearn, is a popular machine-learning library in
Python that provides simple and efficient tools for data analysis and modeling.

● OpenCV - OpenCV is a widely used wide-platform library for performing real-time computer
vision-related tasks and to help get real-time analysis of the data. The major applications of
OpenCV are image processing and classification, object detection, and video capturing.
Some features of openCV include:
1. Reading and writing of images
2. Capturing and saving videos
3. Performing feature detection
4. Processing of images
5. Object detection
6. Handles videos efficiently - estimates the motion in videos, reduces the unwanted
background, and traces the objects in it.
● Os - Os is a Python module that has miscellaneous operating systems features. It has
functionalities for using the operating system’s features from Python. We used OS for loading
and saving data.
● Numpy - Numpy is one of the most widely used Python libraries. It is used to
perform computation tasks and has support for handling numerical computations for
multidimensional arrays and matrices. This library was used for performing operations on
images for preprocessing tasks and manipulating and preparing image data for machine learning
models.
● Pandas - Pandas is a powerful Python language library that is used for data

29
manipulation and analysis. It is widely used for the preparation of data for machine learning
models. We used pandas as it makes data analysis and understanding very simple
● Seaborn - Seaborn is a data visualization library in Python. It is used for statistical analysis of
data through graphs and charts of several kinds. We used Seaborn for its excellent graphical
analysis tools and also for visualization of image data.
● Matplotlib - Matplotlib is a Python tool that can be utilized for creating interactive
visualizations. Its extensive features make it an attractive choice when it comes to graphical
analysis of models and data. We have used Matplotlib for preprocessing tasks and to visualize
results.

4.1.3 IDE’s
For writing and executing the Python code we utilized the following IDEs:
● Jupyter Notebook- Jupyter Notebook is a widely used open-source IDE. It is a web-based
application that provides facilities for editing and executing code within the web browser. It has
a user-friendly interface and files can be easily uploaded and downloaded in multiple formats.
● Google Colaboratory - As this was a group project, we needed to share our work and work in
coordination. Hence, we utilized Google Colaboratory IDE which allows users to combine
executable code and work together virtually. It uses a Jupyter notebook environment and has
options for using GPU and TPU for fast processing.

4.2 Methodology
4.2.1 IMPLEMENTING LIBRARIES
CIC IDS 2017 (Original): The given code is part due to which we process the load and connect
multiple CSV files that give us the Canadian Institute for Cybersecurity Intrusion Detection
System (CIC IDS) 2017 dataset.
1. dataset_csv_path: Specifies the directory path where the CSV files for the
CIC IDS 2017 dataset are located.
2. csv_file_names: A list containing the names of the specific CSV files that
correspond to different scenarios or types of network traffic in the CIC
IDS 2017 dataset.
3. complete_paths: An empty list is created, and then, for each CSV file
name, the code appends the complete file path by joining the
dataset_csv_path and csv_file_name using os.path.join.
● pd.concat: Linking multiple data frames along a particular axis. In this case, it also links
and formulates up DataFrames created by reading each CSV file using pd.read_csv.
● map(pd.read_csv, complete_paths): Applies pd.read_csv to each file path in the
complete_paths list, resulting in a list of DataFrames.
● ignore_index=True: Resets the index of the resulting DataFrame.

30
CIC IDS 2017 (Improved): This code segment performs data cleaning and
preprocessing on the Improved CICIDS 2017 dataset.
1. dropping_cols: A list of column names to be dropped from the DataFrame. These
columns seem to include identifiers and timestamp information.
2. clean_df: A function that presumably performs additional cleaning operations on the
DataFrame. This function might handle tasks like handling missing values, data type
conversions, or other specific cleaning operations.
3. drop: Removes the specified columns (dropping_cols) from the DataFrame
(improved_df) along the specified axis (columns). The inplace=True parameter modifies
the DataFrame in place.
4. Prints the counts of unique values in the 'Label' column, providing insights into the
distribution of different labels in the dataset.

● Prints the shape (number of rows and columns) of the DataFrame after dropping selected
columns.
● Subsequent lines indicate the removal of zero variance columns, dropping rows with NaN
values, dropping duplicate rows, and removing columns with identical values.

4.2.2 2D-PCA Visualization:


[OLD CIC IDS 2017 IS USED]

The code segment performs dimensionality reduction using Principal Component Analysis
(PCA) on a subsampled CICIDS 2017 dataset and visualizes the results using scatterplots. Let's
break down the code step by step:
Subsampling the DataFrame
● group by ('Label').apply(...): Groups the DataFrame by the 'Label' column and applies the
sample function to each group. This function samples 10% of the data from each group.
● reset_index(drop=True): Resets the index of the resulting DataFrame.

Defining Features and Labels:


● X: Contains the feature columns (excluding the 'Label' column).
● y: Contains the target variable ('Label').

Performing PCA:
● PCA: Initializes a PCA object with 2 components and fits it to the feature data (X). The
transformed data is stored in z.

Visualizing PCA Projection (Binary Classes):

31
● Creates a scatterplot for binary classification, distinguishing between 'BENIGN' and
'ATTACK' classes.

32
4.2.3 2D-TSNE Visualization:
[IMPROVED CIC IDS 2017 IS USED]
This code segment performs dimensionality reduction using t-distributed Stochastic Neighbor
Embedding (t-SNE) on a subsampled CICIDS2017 dataset and visualizes the results using
scatterplots.
1. Defining Features and Labels
2. Performing t-SNE
3. Creating DataFrames for Visualization (15 Classes)
4. Visualizing t-SNE Projection (15 Classes)
5. Creating DataFrames for Binary Classification Visualization

33
6. Visualizing t-SNE Projection (Binary Classes)
In summary, this code segment uses t-SNE to reduce the dimensionality of a subsampled
CICIDS2017 dataset and visualizes the data in the first two t-SNE components for both the
original 15 classes and a binary classification scenario. The scatterplots provide insights into the
distribution and separability of the data points in the reduced-dimensional space.

34
35
Chapter 5: RESULTS AND CONCLUSION

5.1 PCA Intrusion Detection

Figure 4

Figure 5

36
Figure 6

37
38
39
Figure 9

40
RESULT

au_precission_re
Attacks call auroc f1 score precision recall
All attacks 0.9092582665 0.9216302073 0.8537748773 0.8096094692 0.903036891
DoS Slowloris
and Slowhttptes 0.1842091476 0.9668418907 0.09437501861 0.04952702421 0.9989493591
FTP-Patator 0.02374412187 0.8418531858 0.06878954889 0.03562217242 0.9982227488
SSH-Patator 0.007302431664 0.6084999388 0.0005758280194 0.0002958547463 0.01072705602
DoS Slowloris 0.1623154495 0.9729041334 0.06704024869 0.03468342644 0.9993902439
DoS
Slowhttptes 0.04441064695 0.9533975659 0.03134123942 0.01592061266 0.9979716024
DoS Hulk 0.7609364448 0.9245216066 0.7147619319 0.5793380732 0.9328126624
DoS GoldenEye 0.3998278721 0.9161204869 0.107248062 0.05718832672 0.8603855721
HeartBleed 0.02090841161 0.9995213072 0.0001972559506 9.86E-05 1
Web Attack -
Brute Force 0.0004619218491 0.8449767742 0.001357297664 0.0006791097091 1
Infiltration 0.00507130006 0.9743308539 0.0006791097091 0.0003396701912 1
Infiltration -
Portscan 0.6525348331 0.9484159677 0.563296276 0.3954890606 0.9784597226
Web Attack -
XSS 0.0001864256554 0.9087524706 0.00032871669 0.0001643853631 1
Web Attack -
SQL Injection 4.08E-05 0.7088096257 4.38E-05 2.19E-05 0.1818181818
Botnet 0.004678904774 0.7637311257 0.005471599792 0.002754519817 0.4025559105
Portscan 0.664846762 0.8835710745 0.6485631584 0.5436109332 0.8037365206
DDoS 0.8742312588 0.9723754681 0.6340985298 0.466879368 0.9879440604

Table 3:( Cumulative Result Of PCA Intrusion Detection System )

5.2 CONCLUSION
Through the use of Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour
Embedding (t-SNE) for dimensional reduction, as well as PCA, Random Forest, and K-means
clustering for intrusion detection, the project aimed to improve intrusion detection through the
application of Machine Learning (ML) techniques. A more effective data representation was
made possible by the successful reduction of the feature space achieved by the PCA and t-SNE

41
approaches. By using PCA, the number of dimensions was significantly reduced while
maintaining pertinent information, improving computational efficiency without sacrificing
accuracy.

The results showed promising performance when PCA, Random Forest, and K-means clustering
were used to achieve intrusion detection. Using the decreased PCA-transformed features, the
Random Forest algorithm showed resilience in categorizing network data and achieved a
noteworthy level of accuracy in identifying intrusion patterns. Furthermore, K-means clustering
shows the potential to identify anomalous activity by displaying clusters that successfully
distinguish between normal and intrusive network behavior.

Finally, the use of machine learning techniques, including PCA and t-SNE for dimensionality
reduction, in conjunction with K-means clustering and Random Forest for intrusion detection,
demonstrated encouraging results in improving network security. PCA's feature space reduction
allowed for more effective computing without sacrificing detection accuracy. The potential for
real-time intrusion detection systems was demonstrated by the successful identification and
delineation of intrusive activity using the merging of Random Forest and K-means clustering. To
improve the system's resilience and adaptability in a variety of network contexts and intrusion
scenarios, more testing and algorithm optimization are advised. In summary, this study
highlights the effectiveness of machine learning techniques in supporting intrusion detection
systems, opening the door for enhanced cybersecurity protocols in contemporary network
architectures.

CHAPTER 6: REFERENCES

42
[1] Nkiama, H., Said, S.Z.M. and Saidu, M., 2016. A Subset Feature Elimination

Mechanism for Intrusion Detection System.International Journal of Advanced

Computer Science and Applications,7(4), pp.148-157.

[2] Khan, M.A., Pradhan, S.K. and Fatima, H., 2017, March. Applying data

mining techniques in cyber crimes. In 2017 2nd International Conference on Anti-

Cyber Crimes (ICACC) (pp. 213-216). IEEE.

[3] Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H. and

Wang, C., 2018.Machine learning and deep learning methods for

cybersecurity.IEEE Access,6, pp.35365-35381.

[4] Taher, K.A., Jisan, B.M.Y. and Rahman, M.M., 2019, January. Network

Intrusion Detection using Supervised Machine Learning Technique with Feature

Selection. In 2019 International Conference on Robotics, Electrical and Signal

Processing Techniques (ICREST)(pp. 643-646).IEEE.

[5] Thomas, R. and Pavithran, D., 2018, November. A Survey of Intrusion Detection

Models basedon NSL-KDD Data Set. In2018Fifth HCT Information Technology Trends

(ITT)(pp.286-291).IEEE.

[6] C. F. Tsai, et al., "Intrusion detection by machine learning: A review," Expert

Systems with Applications, vol. 36, pp. 11994-12000, 2009.

[7] V. Bolón-Canedo, et al., "Feature selection and classification in multiple class

datasets: An application to KDD Cup 99 dataset," Expert Systems with Applications,

vol. 38,pp. 5947-5957,2011.

[8] F. Amiri, et al., "Improved feature selection for intrusion detection system," Journal of

Network and Computer Applications, 2011.

[9] Juan Wang, Qiren Yang, Dasen Ren, “An intrusion detection algorithm based on

decision tree technology,” In the Proc. of IEEE Asia-Pacific Conference on Information

Processing, 2009.

43
[10] Dewan Md. Farid, Nouria Harbi, and Mohammad Zahidur Rahman, "Combining

Nave Bayesand Decision Tree for Adaptive Intrusion Detection," International Journal

of Network Security & Its Applications, Vol. 2, No. 2, April 2010, pp. 12-25.[11] Ektefa M,

Memar S, Sidi F, Affendey L., "Intrusion detection using data mining techniques,"2010

International

44

You might also like