0% found this document useful (0 votes)
6 views

Team 4 Report Document (3)

The document presents a project report on a Network Threat Hunting and Detection System utilizing machine learning to enhance cybersecurity. It details the development of a system that distinguishes between normal and malicious network traffic, achieving over 99% accuracy with the Random Forest classifier. The report also discusses the limitations of traditional intrusion detection systems and proposes a machine learning-based solution for improved detection and adaptability to evolving cyber threats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Team 4 Report Document (3)

The document presents a project report on a Network Threat Hunting and Detection System utilizing machine learning to enhance cybersecurity. It details the development of a system that distinguishes between normal and malicious network traffic, achieving over 99% accuracy with the Random Forest classifier. The report also discusses the limitations of traditional intrusion detection systems and proposes a machine learning-based solution for improved detection and adaptability to evolving cyber threats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Network Threat Hunting and Detection System

Using Machine Learning


A PROJECT REPORT

Submitted by

ANBURAJA R (714221104004)

MINIBALA G (714221104024)

SANKARI A (714221104042)

SAVITHA J (714221104044)

in partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

TAMILNADU COLLEGE OF ENGINEERING, COIMBATORE

ANNA UNIVERSITY: CHENNAI 600 025

MAY 2025
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “NETWORK THREAT HUNTING AND


DETECTION SYSTEM USING MACHINE LEARNING” is the bonafide
work of “ANBURAJA R, MINIBALA G, SANKARI A, SAVITHA J” who
carried out the project work under my supervision.

SIGNATURE SIGNATURE
Mrs. T. Arachelvi., M. Tech., Mr. R. Ponneela Vignesh., M.E.,
ASSOCIATIVE PROFESSOR MBA., (Ph. D)
HEAD OF THE DEPARTMENT ASSISTANT PROFESSOR
Department of Computer science SUPERVISOR
and Engineering, Department of Information Technology,
Tamilnadu College of Engineering, Tamilnadu College of Engineering,
Coimbatore – 641 659. Coimbatore – 641 659.

Submitted for the Anna University Viva-Voce held on _________________

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
ACKNOWLEDGEMENT

We express our deep sense of gratitude to the management for admitting us into
this project.

We extend our sincere wholehearted thanks to our beloved Principal


Dr. M. Prince., M.E., Ph.D., and Director for his continuous support in all our
curricular activities.

We are tremendously thankful to our Head of the Department


Mrs. T. ARACHELVI, M. Tech., for her encouragement and support.

We extend our sincere thanks and gratitude to our beloved Project guide
Mr. R. PONNEELA VIGNESH, M.E., MBA., (Ph. D)., for his priceless ions
and unrelenting support in all our efforts to improve our project and for piloting
the right way for successful completion for our project.

We extend our sincere thanks to all the teaching and non-teaching faculty
members of the department for their assistance and all our friends who helped us
in bringing out our project in good shape and form.

We express our sincere gratitude to our beloved PARENTS for the


encouragement and support in all endeavors.

iii
ABSTRACT

In the modern digital landscape, traditional intrusion detection systems

(IDS) are increasingly inadequate to address sophisticated cyber threats. This

study introduces a Network Threat Hunting and Detection System powered by

machine learning (ML) to address the existing challenges. Using the KDD Cup

1999 dataset, the study explores various ML models, including Logistic

Regression, Decision Tree, Random Forest, Naive Bayes, Artificial Neural

Network (ANN), and Support Vector Machine (SVM), Gradient Boosting. Data

preprocessing techniques such as normalization, encoding, and feature selection

were employed to enhance model performance. Among the models evaluated, the

Random Forest classifier demonstrated superior results, achieving an accuracy of

over 99%. This system effectively distinguishes between normal and malicious

network traffic, offering a scalable, adaptive, and highly accurate approach to

intrusion detection. Future enhancements include integrating real-time detection,

deploying deep learning models, and using explainable AI for greater

transparency.

iv
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT iv

LIST OF TABLES v

LIST OF FIGURES ix

LIST OF ABBREVIATION x

1. INTRODUCTION 2
1.1 Introduction 2
1.2 Overview 2
1.3 Problem Statement 2
1.4 Existing System 3

1.4.1 Signature-Based Detection 3

1.4.2 Anomaly-Based Detection 3


1.5 Proposed System 4

1.5.1 System Objectives 4

1.5.2 Implementation Strategy 4

1.5.3 Advantage of Proposed System 4


1.6 Significance of the Study 5
1.7 Applications of the Study 5
1.8 Limitations of the Study 5
1.9 Summary 5

2. LITERATURE SURVEY 8
2.1 Introduction 8
2.2 Network IDS: Overview 8

v
2.3 Datasets for Intrusion Detection Research 9

2.3.1 KDD Cup 1999 Dataset 9

2.3.2 NSL-KDD Dataset 9

2.3.3 CICIDS201 9
2.4 Review of Related Work 9

2.4.1 Decision Tree Algorithms 9

2.4.2 Random Forest Classifier 10

2.4.3 Naive Bayes Classifier 10

2.4.4 Support Vector Machine (SVM) 10

2.4.5 Artificial Neural Network (ANN) 10

2.4.6 Logistic Regression 10

2.4.7 Gradient Boosting Classifier 10


2.5 Comparative Analysis of Previous Work 11
2.6 Summary 13

3. METHODOLOGY 15
3.1 Introduction 15

3.1.1 Algorithm 15
3.2 Methodology 16

3.2.1 Dataset Description 16

3.2.2 Data Preprocessing 17

3.2.3 Machine Learning Models Used 17

vi
3.2.4 Model Evaluation Metrics 17
3.3 Summary 18

4. SYSTEM ARCHITECTURE 20
4.1 Introduction 20
4.2 System Architecture 20
4.3 Architecture Description 20
4.4 System Modules 22
4.5 Data Flow Diagram 24
4.6 Summary 25

5. SYSTEM IMPLEMENTATION 27
5.1 Introduction 27
5.2 System Implementation 27
5.3 Data Collection and Loading 27
5.4 Data Preprocessing 27

5.5 Model Training and Selection 28


5.5.1 Performance Evaluation 29
5.5.2 User Interface 30
5.6 Summary 30

6. CONCLUSION AND FUTURE WORK 32


6.1 Conclusion 32
6.2 Future work 32

7. APPENDIX 35

7.1 Source Code 35


7.2 Screenshots 49
8. RESULT 53
8.1 Result Analysis 53

vii
9. BIBLIOGRAPHY 55

viii
LIST OF FIGURES

FIG NO. FIGURE NAME PAGE NO.

3.1 Block Diagram of Proposed System 16


4.1 Data Flow Diagram 24
7.2.1 Model’s Accuray and Time 49
7.2.2 Model Accuracy Comparison 50
7.2.3 Training and Testing Comparison 51

ix
LIST OF ABBREVIATIONS

SNO. ACRONYMS ABBREVIATION

1. IDS Intrusion Detection System

2. ML Machine Learning

3. SVM Support Vector Machine

4. CNN Convolutional Neural Networks

5. ANN Artificial Neural Networks

6. KNN K-Nearest Neighbours

7. DoS Denial of Service

8. R2L Remote to Local Attack

9. U2R User to Root

10. NSL-KDD Network Security Laboratory

Knowledge Discovery Dataset

11. CICIDS Canadian Institute for Cybersecurity

Intrusion Detection System

12. DNN Deep Neural Networks

13. IoT Internet of Things

14. GridSearchCV Grid Search Cross-Validation

15. gNB Gaussian Naive Bayes

16. SHAP SHapley Additive exPlanations


17. LIME Local Interpretable Model-agnostic

Explanations

x
18. AI Artificial Intelligence
19. XAI Explainable Artificial Intelligence
20. CSV Comma-Separated Value

xi
CHAPTER 1

1
CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

In the present digital age, where communication and data exchange rely
extensively on computer networks, cybersecurity has become a pressing concern.
The rapid development of digital infrastructure, cloud computing, and Internet of
Things (IoT) devices has simultaneously increased the surface area for potential
cyber-attacks. Organizations, governments, and individuals alike face constant
threats from adversaries seeking unauthorized access, data manipulation, denial
of service, and more sophisticated cybercrimes.

1.2 OVERVIEW

The primary aim of this project is to design a system that can distinguish
between normal and malicious network traffic with a high degree of accuracy.
Using a real-world inspired dataset and applying various supervised learning
techniques.

The system development cycle includes:

 Data Collection and Preprocessing.


 Model Training and Evaluation.

1.3 PROBLEM STATEMENT

Cyberattacks today are not only more frequent but also more intelligent
and targeted. Insider threats, unauthorized access, phishing, ransomware, and
advanced persistent threats (APTs) can easily bypass conventional signature-
based IDS, putting sensitive data and operations at risk

.
2
Problem Definition:

To protect a computer network from unauthorized access and ensure the


privacy and integrity of data, an intelligent threat detection system is required.
The objective is to develop and evaluate a machine learning-based network
intrusion detection model that accurately classifies network traffic into 'normal'
or 'malicious' categories.

1.4 EXISTING SYSTEM

The traditional intrusion detection systems (IDS) can be broadly categorized into
two types: signature-based detection and anomaly-based detection.

1.4.1 SIGNATURE-BASED DETECTION

Signature-based IDS operate by detecting known attack patterns or


"signatures." They are effective for known threats and are commonly used in
antivirus software.

However, their primary limitation lies in their inability to detect:

 Zero-day attacks.
 Polymorphic malware (malware that changes its code to evade detection).
 Evolving attack patterns.

1.4.2 ANOMALY-BASED DETECTION

Anomaly-based IDS attempt to detect deviations from normal behaviour,


making them capable of identifying novel attacks. Despite their potential, they
suffer from:

 High false positive rates.


 Difficulty in accurately modelling "normal" behaviour in diverse
network environments.
 Lack of interpretability in the decision-making process.
3
1.5 PROPOSED SYSTEM

To overcome the limitations of existing systems, the proposed solution is


to use machine learning-based models that learn from historical data to identify
patterns associated with malicious and normal traffic. Unlike rule-based
systems, ML models can generalize from the training data and predict previously
unseen types of attacks.

1.5.1 SYSTEM OBJECTIVES

● Automate the detection of network intrusions using ML.


● Reduce dependency on manual rules and human intervention.
● Improve accuracy and reduce the false positive rate.
● Determine the most efficient and accurate algorithm for real-world use.

1.5.2 IMPLEMENTATION STRATEGY


The proposed system will undergo the following stages:

● Data Loading and Exploration: Load the KDD Cup dataset and explore its
structure and distribution.
● Data Preprocessing: Clean the dataset, handle missing values, encode
categorical features, and normalize data.
● Model Selection: Apply and evaluate the performance of Logistic
Regression, Decision Tree, Random Forest, Naive Bayes, KNN, and SVM.

1.5.3 ADVANTAGES OF THE PROPOSED SYSTEM


● Adaptability: ML models can be retrained on new data to adapt to evolving
attack patterns.
● Automation: Reduces manual oversight and rule configuration.
● Accuracy: Capable of higher accuracy in detecting complex threats.

4
● Scalability: Can be scaled to large datasets and complex networks.

1.6 SIGNIFICANCES OF THE STUDY


This study is significant in the realm of cybersecurity for multiple reasons:
● Academic Relevance: Provides insights into applying supervised ML
algorithms in network security.
● Industrial Use: The findings can be directly applied to organizations aiming
to implement smart intrusion detection systems.

1.7 APPLICATIONS OF THE STUDY


Enterprise Networks: Real-time detection and prevention of data breaches and
intrusions.
● Military and Defence: Securing confidential communication and operational
data.
● Financial Institutions: Safeguarding transactional data from fraud and
malicious attacks

1.8 LIMITATIONS OF THE STUDY


● Although the proposed system holds significant promise, there are certain
limitations:
● The system is trained on the KDD Cup dataset, which may not fully represent
real-world traffic today.

1.9 SUMMARY

This chapter laid the foundation for understanding the scope and
significance of network intrusion detection using machine learning. It introduced
the core problem, discussed the limitations of existing systems, and proposed a
comprehensive solution. Highlighted the importance of applying machine
5
learning models for performance comparison. Emphasized the potential for future
improvement through the use of real-time datasets and continuous system
updates.

6
CHAPTER 2

7
CHAPTER 2

LITERATURE SURVEY

2.1 INTRODUCTION

A literature review is a vital component of academic research that


establishes the context and significance of the study. It synthesizes the findings
and methodologies of prior research in the field and highlights the strengths,
limitations, and gaps in existing work. This chapter provides an in-depth review
of the key contributions in the area of intrusion detection systems (IDS), focusing
on the role of machine learning (ML) techniques in detecting network threats. It
also presents a comparative understanding of algorithms applied to benchmark
datasets such as the KDD Cup 1999, NSL-KDD, and CICIDS2017. Intrusion
detection is an active area of research that integrates concepts from networking,
cybersecurity, statistics, and artificial intelligence. The growing sophistication of
cyberattacks necessitates the development of adaptive and intelligent systems
capable of detecting both known and novel intrusions.

2.2 NETWORK IDS: OVERVIEW

Intrusion Detection Systems (IDS) are software applications or hardware


devices that monitor network traffic for suspicious activity. IDS can be classified
into the following types:

 Signature-based IDS: Relies on predefined rules or known patterns of attacks.


 Anomaly-based IDS: Detects deviations from the normal behaviour of a
system.
 Hybrid IDS: Combines the strengths of signature and anomaly-based systems
to improve detection performance.

2.3 DATASETS FOR INTRUSION RESEARCH

2.3.1 KDD Cup 1999 Dataset


8
The KDD Cup 1999 dataset is a benchmark dataset widely used in network
intrusion detection research. It contains simulated traffic data, including both
normal and malicious connections, categorized into four main types of attacks:

 Denial of Service (DoS)


 Remote to Local (R2L)
 User to Root (U2R)
 Probing

The dataset consists of 41 features per connection record and over 4 million
instances.

2.3.2 NSL-KDD Dataset

To address the shortcomings of the KDD dataset, the NSL-KDD dataset was
introduced. It removes redundant records and offers a more balanced distribution,
enabling better performance evaluation of machine learning models.

2.3.3 CICIDS2017

This is a modern dataset designed to represent contemporary network traffic,


including web attacks, botnets, and brute-force login attempts. It reflects real-
world traffic and is considered more representative of current threats.

2.4 REVIEW OF RELATED WORK

2.4.1 Decision Tree Algorithms

P. Garcia-Teodoro et al. (2009) discussed the use of decision trees in anomaly-


based IDS, noting their ability to handle non-linear relationships and provide
interpretable models. Decision trees such as C4.5 and CART have been widely

2.4.2 Random Forest Classifier

9
Breiman (2001) introduced the Random Forest algorithm, an ensemble method
that builds multiple decision trees and aggregates their outputs. Random Forests
are robust to overfitting and perform well on high-dimensional data.

2.4.3 Naive Bayes Classifier

Revathi and Malathi (2013) utilized Naive Bayes for anomaly detection and
reported high precision for certain types of attacks, although performance was
lower for complex attack vectors like U2R.

2.4.4 Support Vector Machine (SVM)

Mukkamala et al. (2005) demonstrated the effectiveness of SVM in detecting


intrusions and emphasized the importance of kernel selection. The RBF kernel
was shown to perform best across several attack types.

2.4.5 Artificial Neural Network (ANN).

Sangkatsanee et al. (2011) employed ANN on the KDD dataset and observed
that while the model achieved high accuracy for known attacks, it suffered from
computational inefficiency for large-scale datasets.

2.4.6 Logistic Regression

Dhanabal and Shantharajah (2015) applied Logistic Regression on the KDD


dataset and found it useful for distinguishing between normal and DoS traffic but
less effective for complex multi-class intrusion detection.

2.4.7 Gradient Boosting Classifier

Zhang et al. (2019): applied Gradient Boosting to the NSL-KDD dataset and
achieved high detection accuracy, especially for rare attacks like U2R and R2L
require substantial computational resources and training data.

2.5 COMPARATIVE ANALYSIS OF PREVIOUS WORK

10
Author/Year Algorithm(s) Dataset Accuracy Comments
Used

Mukkamala et SVM KDD ~95% High accuracy but


al. (2005) computationally
expensive

Shyu et al. Decision KDD ~92% Interpretable but


(2003) Tree struggles with
minority classes

Kumar and Random KDD RF: 97.4%, Random Forest


Sharma (2017) Forest, SVM SVM: 94.2% showed better
results

Kim et al. DNN NSL- 98.3% Excellent results


(2016) KDD but high
complexity

Sangkatsanee et KNN KDD 93% Effective but slow


al. (2011) for large datasets

[1] Dhanabal, L., & Shantharajah, S. P. (2015). A Study on NSL-KDD


Dataset for Intrusion Detection System Based on Classification Algorithms.
International Journal of Advanced Research in Computer and
Communication Engineering, 4(6), 446–452.

This paper investigates the effectiveness of several machine learning


classification algorithms on the NSL-KDD dataset, a refined version of the KDD
11
Cup 1999 dataset designed to address data imbalance and redundancy. The
authors compare algorithms including Naive Bayes, J48 (a Decision Tree
variant), Random Forest, and Support Vector Machines (SVM) to determine
which classifier performs best in detecting various types of network intrusions.
The study reveals that Random Forest and J48 offer better accuracy and lower
false positive rates compared to other classifiers, making them more suitable for
real-world intrusion detection systems.

[2] Buczak, A. L., & Guven, E. (2016). A Survey of Data Mining and Machine
Learning Methods for Cyber Security Intrusion Detection. IEEE
Communications Surveys & Tutorials, 18(2), 1153–1176.

Buczak and Guven provide an extensive review of various data mining


and machine learning techniques applied to cyber security, particularly intrusion
detection systems (IDS). The paper categorizes algorithms into supervised,
unsupervised, and semi-supervised learning, discussing their suitability for
anomaly-based and signature-based detection. It also outlines common
challenges in intrusion detection, such as data imbalance, evolving threats, and
the lack of labelled datasets.

[3] Revathi, S., & Malathi, A. (2013). A Detailed Analysis on NSL-KDD


Dataset Using Various Machine Learning Techniques for Intrusion
Detection. International Journal of Engineering Research and Technology,
2(12), 1848–1853.

Revathi and Malathi focus on using multiple machine learning techniques


to analyse the NSL-KDD dataset for intrusion detection. Their work evaluates the
performance of several classifiers such as Decision Trees, Random Forests,
Support Vector Machines, and k-Nearest Neighbours. They examine the impact
of feature selection and data normalization on detection accuracy and

12
computational efficiency. Their results show that Random Forest achieves the
best overall performance, with high accuracy and reduced training time.

2.6. SUMMARY

This chapter gives a detailed view about the work done by authors in the
field of fabric pattern detection and defect identification methods

13
CHAPTER 3

14
CHAPTER 3

METHODOLOGY

3.1. INTRODUCTION

Methodology is a systematic way to find out the result of a given problem.


It is defined as the study of methods by which knowledge is gain and its aim is to
give the work plan of the research. This chapter describes the methodology
adopted for building the network intrusion detection system. The key algorithms
used are Logistic Regression, Decision Tree Classifier, Random Forest Classifier,
Artificial neural network (ANN), Naive Bayes Classifier, Gradient Boosting
Classifier and Support Vector Machine (SVM). Feature selection techniques are
employed to enhance the model's accuracy and reduce false positives.

The system model has five phases:

 Data Preprocessing: Clean and normalize the KDD dataset, handle missing
values, and select relevant features for training.
 Model Training: Train the classifiers (Logistic Regression, Decision Tree,
Random Forest, ANN, Naive Bayes, Gradient Boosting Classifier, SVM) on
the preprocessed data.
 Model Evaluation: Evaluate the models using metrics such as accuracy,
precision, recall, and false positive rate.
 Optimization: Tune hyperparameters and perform cross-validation to improve
model performance.

3.1.1 ALGORITHM:

 Step 1: Preprocess the KDD dataset (handle missing values, normalization,


feature selection).
 Step 2: Train the models using Logistic Regression, Decision Tree, Random
Forest, ANN, Naive Bayes, Gradient Boosting Classifier and SVM.
15
 Step 3: Evaluate testing set model’s performance.
 Step 4: Optimize hyperparameters to improve results.
 Step 5: Plot and evaluate the model performance with graphs.

3.2 METHODOLOGY

In this project, we develop a network threat hunting and detection system


that can distinguish between 'good normal connections' and ‘bad malicious
connections’. To ensure a fair and effective evaluation, all models are trained on
the same pre-processed dataset and assessed using consistent performance
metrics such as accuracy, precision, recall, F1-score, and training time.

Fig 3.1: Block diagram

3.2.1 DATASET DESCRIPTION

The KDD Cup 1999 dataset is a benchmark dataset widely used for
evaluating intrusion detection systems. It contains approximately 5 million
connection records, each with 41 features and labelled as either normal or a
specific attack type.

These attacks are grouped into four main categories:


16
 DoS (Denial of Service)
 R2L (Remote to Local)
 U2R (User to Root) and Probe

3.2.2 DATA PREPROCESSING

 Before training the machine learning models, the dataset was pre-processed
with the following steps:
 Removal of duplicate and null values.
 Label Encoding of categorical features (e.g., protocol type, service, flag).
 One-Hot Encoding for nominal features to avoid ordinality.

3.2.3 MACHINE LEARNING MODELS USED

 The following machine learning algorithms were implemented and evaluated:


 Logistic Regression
 Decision Tree Classifier
 Random Forest Classifier
 Artificial Neural Network (ANN)
 Naive Bayes Classifier
 Support Vector Machine (SVM)
 Gradient Boosting Classifier

3.2.4 MODEL EVALUATION METRICS

The models were evaluated based on the following key performance metrics:

 Accuracy: Measures the overall correctness of the model.


 Precision: Ratio of true positives to all predicted positives. Important in
reducing false alarms.
 Recall: Ratio of true positives to all actual positives. Important to catch all
attacks.

17
 F1-Score: Harmonic mean of precision and recall.
 Confusion Matrix: Visual representation of the model’s prediction accuracy.
 Training Time: Computational time taken to fit the model on the dataset.

3.3 SUMMARY

This chapter described the dataset, preprocessing techniques, machine learning


models, and evaluation strategy adopted for this project. The next chapter will
delve into the system design and architecture, detailing how the models were
implemented and visualized.

18
CHAPTER 4

19
CHAPTER 4

SYSTEM ARCHITECTURE

4.1 INTRODUCTION

System architecture provides a high-level overview of how the proposed


Network Threat Detection System is designed and how its components interact
with one another. It serves as the blueprint for the software structure, data flow,
and the integration of various machine learning components used for threat
identification. A well-defined architecture not only aids in efficient
implementation but also improves scalability, maintainability, and clarity of the
entire system.

4.2 SYSTEM ARCHITECTURE

This chapter presents the architectural design of the proposed Network


Threat Detection System. The architecture outlines the data flow, modular
components, and system interactions during the threat detection process. The
proposed system architecture is designed to manage the end-to-end pipeline of
data flow in a network intrusion detection system powered by machine learning.
It spans from raw data acquisition to final threat classification and reporting.

4.3 ARCHITECTURE DESCRIPTION

The detailed functioning of the architecture can be broken down into the
following stages:

Data Collection Layer:

The initial step of the architecture involves collecting the dataset. For this
project, the KDD Cup 1999 dataset has been used, which simulates a military
network environment and includes both normal and intrusive traffic.

20
Preprocessing Module:

Data preprocessing is critical in any machine learning pipeline. The raw


network traffic data contains categorical values (e.g., protocol type, service name)
that cannot be directly used by machine learning models. Therefore, encoding
methods such as Label Encoding and One-Hot Encoding are applied to convert
categorical variables into numerical format.

Modelling Layer:

The modelling layer is the heart of the system, where the actual machine
learning algorithms are applied to detect intrusions. After the preprocessing stage,
the cleaned and structured data is passed into this layer. Multiple classification
models are implemented here, including Logistic Regression, Decision Tree,
Random Forest, Naive Bayes, Artificial Neural Network (ANN), Support Vector
Machine (SVM), and Gradient Boosting Classifier.

Detection Engine:

The Detection Engine is a critical component of the system architecture


responsible for analysing new network traffic data and determining whether it
constitutes normal activity or a security threat. It serves as the runtime inference
mechanism, using the best-performing machine learning model trained during the
model selection phase to classify network events.

Output Layer:

The output layer represents the final stage of the architecture, where the
system presents the results of the threat detection process. After the modelling
layer predicts whether a network instance is normal or malicious, the output is
displayed in a user-interpretable format. The output layer plays a crucial role in
bridging the gap between the system’s backend processing and the end user. It

21
ensures that the decisions made by the machine learning model are transparent,
actionable, and easy to understand.

4.4 SYSTEM MODULES

The system architecture is implemented in a modular fashion, with each


module performing a specific and crucial function within the machine learning
pipeline. The modular structure promotes clarity, ease of maintenance, and
extensibility—allowing for future upgrades such as integration with real-time
network traffic, additional algorithms, or new evaluation metrics. Below is an
expanded explanation of each module:

Data Ingestion:

The Data Ingestion module is responsible for importing the dataset into the
system. In this project, the input dataset is in CSV format—specifically, the KDD
Cup 1999 dataset. This module uses Pandas, a widely adopted Python library, to
read and structure the dataset into a Data Frame. It verifies data integrity and
consistency during loading by checking column headers, data types, and the
presence of null or malformed values.

Data Preprocessing:

Once the data has been ingested, the preprocessing module prepares it for
machine learning model training. This involves cleaning the data, transforming
categorical features, and scaling numerical values. Key preprocessing steps
include:

 Label Encoding: Converts categorical string values (such as 'tcp', 'http', or 'SF')
into numerical codes using Scikit-learn’ s Label Encoder.
 One-Hot Encoding (if needed): Applied to nominal variables to avoid
introducing ordinality where it doesn't exist.

22
 Feature Scaling: Uses MinMaxScaler to normalize feature values between 0
and 1. This is especially important for distance-based algorithms like ANN or
SVM.

Model Training & Selection:

This module is the core of the machine learning system. It implements and
trains several supervised classification algorithms using Scikit-learn. Each model
is evaluated to determine its effectiveness in classifying network traffic as either
normal or malicious.

The models implemented include:

 Logistic Regression
 Decision Tree Classifier
 Random Forest Classifier
 Gaussian Naive Bayes
 Artificial Neural Network (ANN)
 Support Vector Machine (SVM)
 Gradient Boosting Classifier

Evaluation:

To assess model performance, this module calculates a range of metrics that


indicate how well the model is able to classify both normal and malicious traffic.
These metrics include:

 Accuracy: Proportion of total correct predictions over total data instances.


 Precision: The ratio of true positives to the sum of true and false positives.
 Recall: The ratio of true positives to the sum of true positives and false
negatives.
 F1-Score: Harmonic mean of precision and recall, particularly useful for
imbalanced datasets.

23
4.5 DATA FLOW DIAGRAM

The data flow of the proposed system outlines the step-by-step process
from data input to threat detection. It begins with the ingestion of the KDD Cup
1999 dataset, followed by preprocessing which includes encoding, normalization,
and data cleaning. The dataset is then split into training and testing sets. Multiple
machine learning algorithms are trained and evaluated using metrics like
accuracy and F1-score. The best-performing model is selected and used for
prediction on new data.

[User Input (KDD Dataset)]

[Preprocessing Module]

[Training/Test Split]

[Machine Learning Models]

[Model Evaluation & Comparison]

[Best Model → Prediction → Output]

Fig 4.1: Data Flow Diagram

24
4.6 SUMMARY

This chapter provided a detailed overview of the system architecture used


to build the machine learning-based network intrusion detection system. It
covered the conceptual design, functional modules, data flow, and operational
logic behind each stage of the pipeline. The layered structure promotes
modularity and extensibility, which is essential for adapting to evolving
cybersecurity challenges.

25
CHAPTER 5

26
CHAPTER 5

SYSTEM IMPLEMENTATION

5.1. INTRODUCTION

This chapter presents the practical implementation of the network threat


detection system using Python and machine learning libraries. It includes a
breakdown of the core code, libraries used, key outputs, and insights derived from
model performance.

5.2 SYSTEM IMPLEMENTATION

The system is designed to distinguish between "good normal connections"


and "bad malicious connections". The proposed system has the following
modules:

 Import the Dataset.


 Dataset Preprocessing.
 Model Training.
 Classification and Evaluation.

5.3 DATA COLLECTION AND LOADING

The first step in the system implementation was the acquisition of the KDD
Cup 1999 dataset, which is a widely used benchmark dataset for evaluating
network intrusion detection systems. This dataset consists of numerous network
connection records, each described by 41 features and labeled either as normal or
as a specific attack type. The dataset was loaded into the environment and
converted into a structured format suitable for machine learning processes.

5.4 DATA PREPROCESSING

The KDD Cup 1999 dataset, which contains network connection records,
requires several preprocessing steps to ensure the data is suitable for training

27
machine learning models. Initially, categorical features like protocol_type,
service, and flag are encoded using Label Encoding and One-Hot Encoding to
convert them into numerical formats. The feature scaling step ensures that all
attributes are aligned on the same scale by normalizing them within the 0 to 1
range. Preprocessing plays a critical role in ensuring the quality and accuracy of
machine learning models.

5.5 MODEL TRAINING AND SELECTION

● Logistic Regression: A linear model used for binary classification tasks. It


predicts the probability of a class and is known for its interpretability and
simplicity.

● Decision Tree Classifier: A tree-structured classifier that splits the dataset


based on feature values. It is easy to understand and visualize.

● Random Forest Classifier: An ensemble method that constructs multiple


decision trees and merges their outcomes to improve accuracy and reduce
overfitting.

● Artificial Neural Network (ANN): A computational model inspired by the


human brain, consisting of interconnected layers of nodes (neurons). It learns
complex patterns by adjusting weights through backpropagation and
optimization.

● Naive Bayes: A probabilistic classifier based on Bayes’ Theorem with the


assumption of independence between features.

● Support Vector Machine (SVM): A robust algorithm that finds the optimal
hyperplane that separates classes in a high-dimensional space.

● Gradient Boosting Classifier: An ensemble learning technique that builds


models sequentially, where each new model corrects the errors of the previous
ones.

28
5.5.1. PERFORMANCE EVALUATION

The system employs several machine learning algorithms to detect network


intrusions, each with its unique approach to processing and classifying the data.
The Gaussian Naive Bayes (gNB) classifier utilizes probability theory to predict
the class based on feature independence, achieving a high accuracy rate of 99.3%.
The Decision Tree (DT) classifier splits the data at each node, based on the most
significant feature, which results in clear and interpretable decision paths, while
maintaining a testing accuracy of 99.3%. The Random Forest (RF) model
improves accuracy by constructing multiple decision trees and combining their
results, offering the highest performance with a testing accuracy of 99.9%. The
Support Vector Machine (SVM) maximizes the margin between different classes
in a high-dimensional space, achieving 99.8% accuracy at a significantly high
computational cost, which is not ideal for real-time applications. The Logistic
Regression (LR) classifier, a simpler linear model, maintains solid accuracy at
99.2% with minimal computational requirements. The Gradient Boosting
Classifier (GBC) builds an ensemble of weak learners to improve accuracy
iteratively, producing a testing accuracy of 99.9%, though at a higher
computational cost. Lastly, the Artificial Neural Network (ANN) uses multiple
layers to model complex relationships within the data, achieving 99.8% accuracy
but requiring considerable training and testing time due to its complexity.

Algorithm Training Testing Training Testing


Accuracy Accuracy Time (s) Time (s)
Random Forest 0.999 0.999 6.3 0.25
Logistic 0.992 0.992 6.8 0.02
Regression
Decision Tree 0.993 0.993 0.9824 0.0214
ANN 0.99 0.998 997.1 10.33

29
Naive Bayes 0.99 0.993 0.982 0.02
Gradient 0.999 0.999 281.1 0.7
Boosting
Classifier
SVM 0.998 0.998 599.6 62.21

5.5.2. USER INTERFACE

Though not the primary focus of this project, the results from the best-performing
model can be integrated into a lightweight user interface or dashboard using
frameworks like Streamlit or Flask. This would allow real-time monitoring of
network connections and flagging of suspicious activity based on model
predictions.

5.6. SUMMARY

This chapter provided a detailed theoretical overview of the


implementation process for the network threat detection system. It explained each
phase, including data preprocessing, model training, evaluation, and selection.
The next chapter will summarize the conclusions drawn from the study and
discuss possible future enhancements.

30
CHAPTER 6

31
CHAPTER 6

CONCLUSION AND FUTURE WORK

6.1 CONCLUSION

In this project, we designed and implemented a machine learning-based


system for network threat hunting and detection using the KDD Cup 1999 dataset.
The primary objective was to accurately classify network connections as either
normal or malicious using multiple machine learning models and determine
which model offered the best combination of accuracy and computational
efficiency.

Several supervised machine learning classifiers were explored, including


Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbours
(KNN), Naive Bayes, Support Vector Machine (SVM), Artificial Neural
Networks (ANN), and Gradient Boosting. Each model was evaluated based on its
training accuracy, testing accuracy, computational time, and ability to generalize
across attack types.

6.2 Future Work

While the current system offers promising results, there are several avenues
for future improvement and enhancement:

 Integration with Real-Time Monitoring Tools: Future development can


focus on integrating this ML model with real-time packet capture tools such as
Wireshark, Snort, or Scapy. This would enable the system to process live traffic
data instead of static CSV files, allowing for immediate threat detection and alert
generation.
 Deployment in Production Environments: The current implementation can
be extended into a full-fledged web-based dashboard or microservice using
frameworks like Flask, Django, or Streamlit. Hosting the model on platforms such

32
as Azure, or Google Cloud will allow for scalable and accessible network threat
monitoring.
 Application to Modern Datasets: In future studies, newer and more
complex datasets like NSL-KDD, CICIDS2017, and TON_IoT should be used.
These datasets provide a richer set of features and more recent attack patterns,
improving the relevance of the model in today’s context.

33
CHAPTER 7

34
CHAPTER 7

APPENDIX

7.1 SOURCE CODE

import os flag,

import pandas as pd src_bytes,

import numpy as np dst_bytes,

import matplotlib.pyplot as plt land,

import seaborn as sns wrong_fragment,

import time urgent,

from google.colab import files hot,

uploaded = files.upload() num_failed_logins,

from google.colab import files logged_in,

uploaded = files.upload() num_compromised,

from google.colab import files root_shell,

uploaded = files.upload() su_attempted,

from google.colab import files num_root,

uploaded = files.upload() num_file_creations,

with open("kddcup.names", 'r') as f: num_shells,

print(f.read()) num_access_files,

cols="""duration, num_outbound_cmds,

protocol_type, is_host_login,

service, is_guest_login,

35
count, columns.append(c.strip())

srv_count,

serror_rate, columns.append('target')

srv_serror_rate, #print(columns)

rerror_rate, print(len(columns))

srv_rerror_rate, with open("training_attack_types",


'r') as f:
same_srv_rate,
print(f.read())
diff_srv_rate,
attacks_types = {
srv_diff_host_rate,
'normal': 'normal',
dst_host_count,
'back': 'dos',
dst_host_srv_count,
'buffer_overflow': 'u2r',
dst_host_same_srv_rate,
'ftp_write': 'r2l',
dst_host_diff_srv_rate,
'guess_passwd': 'r2l',
dst_host_same_src_port_rate,
'imap': 'r2l',
dst_host_srv_diff_host_rate,
'ipsweep': 'probe',
dst_host_serror_rate,
'land': 'dos',
dst_host_srv_serror_rate,
'loadmodule': 'u2r',
dst_host_rerror_rate,
'multihop': 'r2l',
dst_host_srv_rerror_rate"""
'neptune': 'dos',
columns=[]
'nmap': 'probe',
for c in cols.split(','):
'perl': 'u2r',
if(c.strip()):

36
'phf': 'r2l', df.dtypes

'pod': 'dos', df.isnull().sum()

'portsweep': 'probe', #Finding categorical features

'rootkit': 'u2r', num_cols =


df._get_numeric_data().columns
'satan': 'probe',
cate_cols = list(set(df.columns)-
'smurf': 'dos',
set(num_cols))
'spy': 'r2l',
cate_cols.remove('target')
'teardrop': 'dos',
cate_cols.remove('Attack Type')
'warezclient': 'r2l',
cate_cols
'warezmaster': 'r2l',
#Visualization
}
def bar_graph(feature):
import pandas as pd

df =
df[feature].value_counts().plot(kind
pd.read_csv('kddcup.data_10_percen
="bar")
t.gz', names=columns)
bar_graph('protocol_type')
# Add attack type column
plt.figure(figsize=(15,3))
df['Attack Type'] =
bar_graph('service')
df['target'].apply(lambda r:
attacks_types[r.strip().strip('.')]) bar_graph('flag')

df.head() bar_graph('logged_in')

df.shape bar_graph('target')

df['target'].value_counts() bar_graph('Attack Type')

df['Attack Type'].value_counts() df.columns

37
import pandas as pd plt.show()

import seaborn as sns df['num_root'].corr(df['num_compro


mised'])
import matplotlib.pyplot as plt
df['srv_serror_rate'].corr(df['serror_r
# Drop columns with any NaN
ate'])
values
df['srv_count'].corr(df['count'])
df = df.dropna(axis=1)
df['srv_rerror_rate'].corr(df['rerror_r
# Keep only columns that have more
ate'])
than 1 unique value
df['dst_host_same_srv_rate'].corr(df[
df = df[[col for col in df.columns if
'dst_host_srv_count'])
df[col].nunique() > 1]]
df['dst_host_srv_serror_rate'].corr(df
# Select only numeric columns
['dst_host_serror_rate'])
(exclude categorical like
'protocol_type', 'service', etc.) df['dst_host_srv_rerror_rate'].corr(df
['dst_host_rerror_rate'])
df_numeric =
df.select_dtypes(include=['number']) df['dst_host_same_srv_rate'].corr(df[
'same_srv_rate'])
# Compute the correlation matrix
df['dst_host_srv_count'].corr(df['sam
corr = df_numeric.corr()
e_srv_rate'])
# Plot the heatmap
df['dst_host_rerror_rate'].corr(df['rer
plt.figure(figsize=(15, 12))
ror_rate'])
sns.heatmap(corr, cmap='coolwarm',
df['dst_host_rerror_rate'].corr(df['srv
annot=False)
_rerror_rate'])
plt.title("Correlation Heatmap
df['dst_host_srv_rerror_rate'].corr(df
(Numeric Features Only)",
['rerror_rate'])
fontsize=16)

38
df['dst_host_srv_rerror_rate'].corr(df #This variable is highly correlated
['srv_rerror_rate']) with srv_serror_rate and should be
ignored for analysis.
#This variable is highly correlated
with num_compromised and should #(Correlation =
be ignored for analysis. 0.9993041091850098)

#(Correlation = df.drop('dst_host_srv_serror_rate',ax
0.9938277978738366) is = 1, inplace=True)

df.drop('num_root',axis = 1,inplace #This variable is highly correlated


= True) with rerror_rate and should be
ignored for analysis.
#This variable is highly correlated
with serror_rate and should be #(Correlation =
ignored for analysis. 0.9869947924956001)

#(Correlation = df.drop('dst_host_serror_rate',axis =
0.9983615072725952) 1, inplace=True)

df.drop('srv_serror_rate',axis = #This variable is highly correlated


1,inplace = True) with srv_rerror_rate and should be
ignored for analysis.
#This variable is highly correlated
with rerror_rate and should be #(Correlation =
ignored for analysis. 0.9821663427308375)

#(Correlation = df.drop('dst_host_rerror_rate',axis =
0.9947309539817937) 1, inplace=True)

df.drop('srv_rerror_rate',axis = 1, #This variable is highly correlated


inplace=True) with rerror_rate and should be
ignored for analysis.

39
#(Correlation = df.head()
0.9851995540751249)
df.shape
df.drop('dst_host_srv_rerror_rate',ax
df.columns
is = 1, inplace=True)
df_std =
#This variable is highly correlated
df.select_dtypes(include='number').s
with dst_host_srv_count and should
td()
be ignored for analysis.
df_std =
#(Correlation =
df_std.sort_values(ascending=True)
0.9865705438845669)
df_std
df.drop('dst_host_same_srv_rate',axi
df['protocol_type'].value_counts()
s = 1, inplace=True)
#protocol_type feature mapping
df.head()
pmap = {'icmp':0,'tcp':1,'udp':2}
df.shape
df['protocol_type'] =
df.columns
df['protocol_type'].map(pmap)
#flag feature mapping
df['flag'].value_counts()
fmap =
#flag feature mapping
{'SF':0,'S0':1,'REJ':2,'RSTR':3,'RST
O':4,'SH':5 ,'S1':6 fmap =
,'S2':7,'RSTOS0':8,'S3':9 ,'OTH':10} {'SF':0,'S0':1,'REJ':2,'RSTR':3,'RST
O':4,'SH':5 ,'S1':6
df['flag'] = df['flag'].map(fmap)
,'S2':7,'RSTOS0':8,'S3':9 ,'OTH':10}
df.head()
df['flag'] = df['flag'].map(fmap)
df.drop('service',axis = 1,inplace=
df.head()
True)
df.drop('service', axis=1,
df.shape
inplace=True, errors='ignore')

40
df.shape import pandas as pd

df.head() import time

df.dtypes from sklearn.naive_bayes import


GaussianNB
from sklearn.model_selection import
train_test_split from sklearn.impute import
SimpleImputer
from sklearn.preprocessing import
MinMaxScaler # Convert to DataFrames if needed

from sklearn.metrics import if not isinstance(X_train,


accuracy_score pd.DataFrame):

df = df.drop(['target',], axis=1) X_train = pd.DataFrame(X_train)

print(df.shape) if not isinstance(X_test,


pd.DataFrame):
# Target variable and train set
X_test = pd.DataFrame(X_test)
Y = df[['Attack Type']]
# 1. Drop all-NaN columns from
X = df.drop(['Attack Type',], axis=1)
training data
sc = MinMaxScaler()
X_train_cleaned =
X = sc.fit_transform(X)
X_train.dropna(axis=1, how='all')
# Split test and train data
# Keep only the same columns in
X_train, X_test, Y_train, Y_test = test data
train_test_split(X, Y, test_size=0.33,
X_test_cleaned =
random_state=42)
X_test[X_train_cleaned.columns]
print(X_train.shape, X_test.shape)
# 2. Impute missing values using
print(Y_train.shape, Y_test.shape) mean strategy

# Gaussian Naive Bayes

41
imputer = print("Train Accuracy:",
SimpleImputer(strategy='mean') model1.score(X_train_imputed,
Y_train))
X_train_imputed =
imputer.fit_transform(X_train_clean print("Test Accuracy:",
ed) model1.score(X_test_imputed,
Y_test))
X_test_imputed =
imputer.transform(X_test_cleaned) #Decision Tree
# use same imputer!
from sklearn.tree import
X_train_imputed = DecisionTreeClassifier
imputer.fit_transform(X_train_clean
model2 =
ed)
DecisionTreeClassifier(criterion="e
model1 = GaussianNB() ntropy", max_depth = 4)

start_time = time.time() start_time = time.time()

model1.fit(X_train_imputed, model2.fit(X_train,
Y_train.values.ravel()) Y_train.values.ravel())

end_time = time.time() end_time = time.time()

print("Training time:", end_time - print("Training time: ",end_time-


start_time) start_time)

start_time = time.time() start_time = time.time()

Y_test_pred1 = Y_test_pred2 =
model1.predict(X_test_imputed) model2.predict(X_test)

end_time = time.time() end_time = time.time()

print("Testing time:", end_time - print("Testing time: ",end_time-


start_time start_time)

42
print("Train score is:", print("Test score
model2.score(X_train, Y_train)) is:",model3.score(X_test,Y_test))

print("Test score
is:",model2.score(X_test,Y_test))
# Support Vector Classifier (SVC)
#Random Tree
import pandas as pd
from sklearn.ensemble import
import time
RandomForestClassifier
from sklearn.svm import SVC
model3 =
from sklearn.impute import
RandomForestClassifier(n_estimator
SimpleImputer
s=30)
# Convert to DataFrames if needed
start_time = time.time()
if not isinstance(X_train,
model3.fit(X_train,
pd.DataFrame):
Y_train.values.ravel())
X_train = pd.DataFrame(X_train)
end_time = time.time()
if not isinstance(X_test,
print("Training time: ",end_time-
pd.DataFrame):
start_time)
X_test = pd.DataFrame(X_test)
start_time = time.time()
# 1. Drop all-NaN columns from
Y_test_pred3 =
training data
model3.predict(X_test)
X_train_cleaned =
end_time = time.time()
X_train.dropna(axis=1, how='all')
print("Testing time: ",end_time-
# 2. Impute missing values using
start_time)
mean strategy
print("Train score is:",
imputer =
model3.score(X_train, Y_train))
SimpleImputer(strategy='mean')

43
X_train_imputed = # Use the imputed training data
imputer.fit_transform(X_train_clean
model5 =
ed)
LogisticRegression(max_iter=12000
X_test_imputed = 00)
imputer.transform(X_test_cleaned)
start_time = time.time()
# use same imputer!
model5.fit(X_train_imputed,
model4 = SVC(gamma='scale')
Y_train.values.ravel())
print("Training time:", end_time -
end_time = time.time()
start_time)
print("Training time: ", end_time -
start_test = time.time()
start_time)
y_pred =
start_time = time.time()
model4.predict(X_test_imputed)
Y_test_pred5 =
end_test = time.time()
model5.predict(X_test_imputed)
print("Testing (prediction) time
end_time = time.time()
(seconds):", end_test - start_test)
print("Testing time: ", end_time -
print("Train score is:",
start_time)
model4.score(X_train_imputed,
print("Train score is:",
Y_train))
model5.score(X_train_imputed,
print("Test score is:",
Y_train))
model4.score(X_test_imputed,
print("Test score is:",
Y_test))
model5.score(X_test_imputed,
#Logistic Regression
Y_test))
from sklearn.linear_model import
from sklearn.ensemble import
LogisticRegression
GradientBoostingClassifier
import time

44
import time from tensorflow.keras.models
import Sequential
model6 =
GradientBoostingClassifier(random_ from tensorflow.keras.layers import
state=0) Dense, Input

start_time = time.time() from scikeras.wrappers import


KerasClassifier
model6.fit(X_train_imputed,
Y_train.values.ravel()) import time

end_time = time.time() def fun():

print("Training time: ", end_time - model = Sequential()


start_time)

start_time = time.time() model.add(Input(shape=(X_train_im


puted.shape[1],)))
Y_test_pred6 =
model6.predict(X_test_imputed) model.add(Dense(30,
activation='relu',
end_time = time.time()
kernel_initializer='random_uniform'
print("Testing time: ", end_time -
))
start_time)
model.add(Dense(5,
print("Train score is:",
activation='softmax')) # 5 classes
model6.score(X_train_imputed,
Y_train))
model.compile(loss='sparse_categori
print("Test score is:",
cal_crossentropy', optimizer='adam',
model6.score(X_test_imputed,
metrics=['accuracy'])
Y_test))
return model
#Artificial Neural Network
model7 =
!pip install scikeras
KerasClassifier(model=fun,

45
epochs=100, batch_size=64, print("Train Accuracy:",
verbose=1) accuracy_score(Y_train,
Y_train_pred7))
start = time.time()
from sklearn.metrics import
model7.fit(X_train_imputed,
accuracy_score
Y_train)
print("Test Accuracy:",
end = time.time()
accuracy_score(Y_test,
print("Training time:", end - start)
Y_test_pred7))
print('Training time')
import pandas as pd
print((end-start))
import time
start_time = time.time()
import matplotlib.pyplot as plt
Y_test_pred7 =
from sklearn.metrics import
model7.predict(X_test_imputed)
accuracy_score
end_time = time.time()
from sklearn.naive_bayes import
print("Testing time: ", end_time - GaussianNB
start_time)
from sklearn.tree import
from sklearn.metrics import DecisionTreeClassifier
accuracy_score
from sklearn.ensemble import
start_time = time.time() RandomForestClassifier,
GradientBoostingClassifier
Y_train_pred7 =
model7.predict(X_train_imputed) from sklearn.svm import SVC

end_time = time.time() from sklearn.linear_model import


LogisticRegression
print("Training prediction time:",
end_time - start_time) from tensorflow.keras.models
import Sequential

46
from tensorflow.keras.layers import Y_pred_test =
Dense, Input model.predict(X_test)

from scikeras.wrappers import end_test = time.time()


KerasClassifier
# Evaluate
# --- Results Storage ---
train_accuracy =
model_results = [] model.score(X_train, Y_train)

# --- Universal Evaluation Function test_accuracy =


--- accuracy_score(Y_test, Y_pred_test)

def evaluate_model(name, model, # Store results


X_train, Y_train, X_test, Y_test):
model_results.append({
# Flatten labels if needed
'Model': name,
Y_train = Y_train.values.ravel() if
'Train Accuracy':
hasattr(Y_train, "values") else
round(train_accuracy, 4),
Y_train.ravel()
'Test Accuracy':
Y_test = Y_test.values.ravel() if
round(test_accuracy, 4),
hasattr(Y_test, "values") else
'Train Time (s)':
Y_test.ravel()
round(end_train - start_train, 4),
# Train
'Test Time (s)': round(end_test
start_train = time.time()
- start_test, 4)
model.fit(X_train, Y_train)
})
end_train = time.time()
print(f"{name} evaluated.")
# Predict
evaluate_model("Gaussian NB",
start_test = time.time() GaussianNB(), X_train_imputed,
Y_train, X_test_imputed, Y_test)

47
evaluate_model("Decision Tree", evaluate_model("ANN", model_ann,
DecisionTreeClassifier(criterion="e X_train_imputed, Y_train,
ntropy", max_depth=4), X_train, X_test_imputed, Y_test)
Y_train, X_test, Y_test)
# --- Results Table ---
evaluate_model("Random Forest",
results_df =
RandomForestClassifier(n_estimator
pd.DataFrame(model_results)
s=30), X_train, Y_train, X_test,
display(results_df)
Y_test)
# --- Accuracy Plot ---
evaluate_model("SVM",
SVC(gamma='scale'), results_df.plot(x='Model', y=['Train
X_train_imputed, Y_train, Accuracy', 'Test Accuracy'],
X_test_imputed, Y_test) kind='bar', figsize=(10, 6),
title='Model Accuracy Comparison')
evaluate_model("Logistic
Regression", plt.ylabel("Accuracy")
LogisticRegression(max_iter=12000
plt.grid(True)
00), X_train_imputed, Y_train,
plt.tight_layout()
X_test_imputed, Y_test)
results_df.plot(x='Model', y=['Train
evaluate_model("Gradient
Time (s)', 'Test Time (s)'],
Boosting",
kind='bar', figsize=(10, 6),
GradientBoostingClassifier(random_
title='Training and Testing Time
state=0), X_train_imputed, Y_train,
Comparison')
X_test_imputed, Y_test)
plt.ylabel("Time (seconds)")
model_ann =
KerasClassifier(model=create_ann, plt.grid(True)
epochs=100, batch_size=64,
plt.tight_layout()
verbose=0)
plt.show()

48
7.2 SCREENSHOTS

RESULTS TABLE

Fig 7.2.1 Model’s Accuray and Time.

49
ACCURACY PLOT

Fig 7.2.2 Model Accuracy Comparison.

50
TIME PLOT

Fig 7.2.3 Training and Testing Comparison

51
CHAPTER 8

52
CHAPTER 8

RESULT

8.1 RESULT ANALYSIS

The "Network Threat Hunting and Detection System" was successfully


implemented and evaluated using benchmark datasets, including NSL-KDD and
CICIDS2017. After training and testing various machine learning models, the
Random Forest classifier emerged as the most effective algorithm for intrusion
detection tasks.

The Random Forest model achieved high accuracy, precision, and recall,
especially in detecting critical attack categories such as Denial of Service (DoS),
Probe, and Remote to Local (R2L) intrusions. Its ensemble learning mechanism
allowed the system to generalize well across diverse traffic patterns, reducing
overfitting and improving classification stability.

Evaluation metrics showed that Random Forest consistently outperformed


other models like SVM, Naive Bayes, and Decision Trees in both training and
testing phases. The system maintained low false positive rates, making it reliable
for real-world application.

Additionally, the system architecture supported real-time detection by


integrating data preprocessing, feature extraction, and classification modules
seamlessly. This ensured accurate and timely responses to potential threats in the
network environment.

Overall, the results validate that the Random Forest-based detection


system is robust, scalable, and suitable for deployment in modern network
infrastructures to enhance cybersecurity posture.

53
CHAPTER 9

54
CHAPTER 9

BIBLIOGRAPHY

REFERENCE

1. James P. Anderson, "Computer security threat monitoring and surveillance,"


Technical Report 98-17, James P. Anderson Co., Fort Washington,
Pennsylvania, USA, April 1980.

2. D. E. Denning, "An intrusion detection model," IEEE Transaction on Software


Engineering, SE13(2), 1987, pp. 222-232.

3. Alanoud Alsaleh, Wojdan Binsaeedan, The Influence of Salp Swarm


Algorithm-Based Feature Selection on Network Anomaly Intrusion Detection,
August 2021, IEEE Access PP(99):1-1, DOI:10.1109.

4. Ajay Shah; Sophine Clachar; Manfred Minimair; Davis Cook, Building


Multiclass Classification Baselines for Anomaly-based Network Intrusion
Detection Systems, 2020 IEEE 7th International Conference on Data Science
and Advanced Analytics (DSAA), DOI: 10.1109/DSAA49011.2020.00102, 6-9
October, 2020.

5. Daniel Barbará, Julia Couto, SushilJajodia, Leonard Popyack and Ningning Wu,
"ADAM: Detecting intrusion by data mining," IEEE Workshop on Information
Assurance and Dr.V.Suganthi,*1, P. K. Manoj Kumar 2 2018 E-J. 1 (2018) 24
Security, West Point, New York, June 5-6, pp. 11-16, 2001.

55
6. D. L. Dhanabal and S. P. Shantharajah, “A Study on NSL-KDD Dataset for
Intrusion Detection System Based on Classification Algorithms,” Int. J. Adv.
Res. Comput. Commun. Eng., vol. 4, no. 6, pp. 446–452, Jun. 2015.

7. A. L. Buczak and E. Guven, “A Survey of Data Mining and Machine Learning


Methods for Cyber Security Intrusion Detection,” IEEE Commun. Surv. Tutor.,
vol. 18, no. 2, pp. 1153–1176, 2016.

8. S. Revathi and A. Malathi, “A Detailed Analysis on NSL-KDD Dataset Using


Various Machine Learning Techniques for Intrusion Detection,” Int. J. Eng. Res.
Technol., vol. 2, no. 12, pp. 1848–1853, 2013.

9. A. R. Reddy and B. E. Reddy, “Intrusion Detection System using Support


Vector Machine with Modified K-Means Clustering,” Int. J. Comput. Appl., vol.
144, no. 5, 2016.

10. C. F. Tsai, Y. F. Hsu, C. Y. Lin, and W. Y. Lin, “Intrusion detection by machine


learning: A review,” Expert Syst. Appl., vol. 36, no. 10, pp. 11994–12000, 2009.

11. M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of


the KDD Cup 99 data set,” Proceedings of the IEEE Symposium on
Computational Intelligence for Security and Defense Applications, 2009.

12. Debra Anderson, Thane Frivold, and Alfonso Valdes, "NIDES Next-generation
Intrusion Detection Expert System (NIDES)", A Summary, Computer Science
Laboratory, SRI-CSL-95-07, May 1995

13. Te-Shun Chou and Tsung-Nan Chou, "Hybrid Classified Systems for Intrusion
Detection," Seventh Annual Communications Networks and Services Research
Conference, pp. 286-291, 2009.

14. N.B. Amor, S. Benferhat, and Z. Elouedi, "Naïve Bayes vs. decision trees in
intrusion detection systems," Proc. of 2004 ACM Symposium on Applied
Computing, 2004,pp. 420-424.

56
15. Nasrin Sultana, Naveen Chilamkurti, Naveen Chilamkurti, Wei PengWei,
PengRabei Alhadad, Survey on SDN based network intrusion detection system
using machine learning approaches, Springer, DOI: 10.1007/s12083-017-0630-
0.

16. Suchet Sapre; Khondkar Islam; Pouyan Ahmadi, A Comprehensive Data


Sampling Analysis Applied to the Classification of Rare IoT Network Intrusion
Types, 2021 IEEE 18th Annual Consumer Communications & Networking
Conference (CCNC), DOI: 10.1109/CCNC49032.2021.9369617.

57
58
59
60
61

You might also like