Team 4 Report Document (3)
Team 4 Report Document (3)
Submitted by
ANBURAJA R (714221104004)
MINIBALA G (714221104024)
SANKARI A (714221104042)
SAVITHA J (714221104044)
of
BACHELOR OF ENGINEERING
IN
MAY 2025
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Mrs. T. Arachelvi., M. Tech., Mr. R. Ponneela Vignesh., M.E.,
ASSOCIATIVE PROFESSOR MBA., (Ph. D)
HEAD OF THE DEPARTMENT ASSISTANT PROFESSOR
Department of Computer science SUPERVISOR
and Engineering, Department of Information Technology,
Tamilnadu College of Engineering, Tamilnadu College of Engineering,
Coimbatore – 641 659. Coimbatore – 641 659.
ii
ACKNOWLEDGEMENT
We express our deep sense of gratitude to the management for admitting us into
this project.
We extend our sincere thanks and gratitude to our beloved Project guide
Mr. R. PONNEELA VIGNESH, M.E., MBA., (Ph. D)., for his priceless ions
and unrelenting support in all our efforts to improve our project and for piloting
the right way for successful completion for our project.
We extend our sincere thanks to all the teaching and non-teaching faculty
members of the department for their assistance and all our friends who helped us
in bringing out our project in good shape and form.
iii
ABSTRACT
machine learning (ML) to address the existing challenges. Using the KDD Cup
Network (ANN), and Support Vector Machine (SVM), Gradient Boosting. Data
were employed to enhance model performance. Among the models evaluated, the
over 99%. This system effectively distinguishes between normal and malicious
transparency.
iv
TABLE OF CONTENTS
ABSTRACT iv
LIST OF TABLES v
LIST OF FIGURES ix
LIST OF ABBREVIATION x
1. INTRODUCTION 2
1.1 Introduction 2
1.2 Overview 2
1.3 Problem Statement 2
1.4 Existing System 3
2. LITERATURE SURVEY 8
2.1 Introduction 8
2.2 Network IDS: Overview 8
v
2.3 Datasets for Intrusion Detection Research 9
2.3.3 CICIDS201 9
2.4 Review of Related Work 9
3. METHODOLOGY 15
3.1 Introduction 15
3.1.1 Algorithm 15
3.2 Methodology 16
vi
3.2.4 Model Evaluation Metrics 17
3.3 Summary 18
4. SYSTEM ARCHITECTURE 20
4.1 Introduction 20
4.2 System Architecture 20
4.3 Architecture Description 20
4.4 System Modules 22
4.5 Data Flow Diagram 24
4.6 Summary 25
5. SYSTEM IMPLEMENTATION 27
5.1 Introduction 27
5.2 System Implementation 27
5.3 Data Collection and Loading 27
5.4 Data Preprocessing 27
7. APPENDIX 35
vii
9. BIBLIOGRAPHY 55
viii
LIST OF FIGURES
ix
LIST OF ABBREVIATIONS
2. ML Machine Learning
Explanations
x
18. AI Artificial Intelligence
19. XAI Explainable Artificial Intelligence
20. CSV Comma-Separated Value
xi
CHAPTER 1
1
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
In the present digital age, where communication and data exchange rely
extensively on computer networks, cybersecurity has become a pressing concern.
The rapid development of digital infrastructure, cloud computing, and Internet of
Things (IoT) devices has simultaneously increased the surface area for potential
cyber-attacks. Organizations, governments, and individuals alike face constant
threats from adversaries seeking unauthorized access, data manipulation, denial
of service, and more sophisticated cybercrimes.
1.2 OVERVIEW
The primary aim of this project is to design a system that can distinguish
between normal and malicious network traffic with a high degree of accuracy.
Using a real-world inspired dataset and applying various supervised learning
techniques.
Cyberattacks today are not only more frequent but also more intelligent
and targeted. Insider threats, unauthorized access, phishing, ransomware, and
advanced persistent threats (APTs) can easily bypass conventional signature-
based IDS, putting sensitive data and operations at risk
.
2
Problem Definition:
The traditional intrusion detection systems (IDS) can be broadly categorized into
two types: signature-based detection and anomaly-based detection.
Zero-day attacks.
Polymorphic malware (malware that changes its code to evade detection).
Evolving attack patterns.
● Data Loading and Exploration: Load the KDD Cup dataset and explore its
structure and distribution.
● Data Preprocessing: Clean the dataset, handle missing values, encode
categorical features, and normalize data.
● Model Selection: Apply and evaluate the performance of Logistic
Regression, Decision Tree, Random Forest, Naive Bayes, KNN, and SVM.
4
● Scalability: Can be scaled to large datasets and complex networks.
1.9 SUMMARY
This chapter laid the foundation for understanding the scope and
significance of network intrusion detection using machine learning. It introduced
the core problem, discussed the limitations of existing systems, and proposed a
comprehensive solution. Highlighted the importance of applying machine
5
learning models for performance comparison. Emphasized the potential for future
improvement through the use of real-time datasets and continuous system
updates.
6
CHAPTER 2
7
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
The dataset consists of 41 features per connection record and over 4 million
instances.
To address the shortcomings of the KDD dataset, the NSL-KDD dataset was
introduced. It removes redundant records and offers a more balanced distribution,
enabling better performance evaluation of machine learning models.
2.3.3 CICIDS2017
9
Breiman (2001) introduced the Random Forest algorithm, an ensemble method
that builds multiple decision trees and aggregates their outputs. Random Forests
are robust to overfitting and perform well on high-dimensional data.
Revathi and Malathi (2013) utilized Naive Bayes for anomaly detection and
reported high precision for certain types of attacks, although performance was
lower for complex attack vectors like U2R.
Sangkatsanee et al. (2011) employed ANN on the KDD dataset and observed
that while the model achieved high accuracy for known attacks, it suffered from
computational inefficiency for large-scale datasets.
Zhang et al. (2019): applied Gradient Boosting to the NSL-KDD dataset and
achieved high detection accuracy, especially for rare attacks like U2R and R2L
require substantial computational resources and training data.
10
Author/Year Algorithm(s) Dataset Accuracy Comments
Used
[2] Buczak, A. L., & Guven, E. (2016). A Survey of Data Mining and Machine
Learning Methods for Cyber Security Intrusion Detection. IEEE
Communications Surveys & Tutorials, 18(2), 1153–1176.
12
computational efficiency. Their results show that Random Forest achieves the
best overall performance, with high accuracy and reduced training time.
2.6. SUMMARY
This chapter gives a detailed view about the work done by authors in the
field of fabric pattern detection and defect identification methods
13
CHAPTER 3
14
CHAPTER 3
METHODOLOGY
3.1. INTRODUCTION
Data Preprocessing: Clean and normalize the KDD dataset, handle missing
values, and select relevant features for training.
Model Training: Train the classifiers (Logistic Regression, Decision Tree,
Random Forest, ANN, Naive Bayes, Gradient Boosting Classifier, SVM) on
the preprocessed data.
Model Evaluation: Evaluate the models using metrics such as accuracy,
precision, recall, and false positive rate.
Optimization: Tune hyperparameters and perform cross-validation to improve
model performance.
3.1.1 ALGORITHM:
3.2 METHODOLOGY
The KDD Cup 1999 dataset is a benchmark dataset widely used for
evaluating intrusion detection systems. It contains approximately 5 million
connection records, each with 41 features and labelled as either normal or a
specific attack type.
Before training the machine learning models, the dataset was pre-processed
with the following steps:
Removal of duplicate and null values.
Label Encoding of categorical features (e.g., protocol type, service, flag).
One-Hot Encoding for nominal features to avoid ordinality.
The models were evaluated based on the following key performance metrics:
17
F1-Score: Harmonic mean of precision and recall.
Confusion Matrix: Visual representation of the model’s prediction accuracy.
Training Time: Computational time taken to fit the model on the dataset.
3.3 SUMMARY
18
CHAPTER 4
19
CHAPTER 4
SYSTEM ARCHITECTURE
4.1 INTRODUCTION
The detailed functioning of the architecture can be broken down into the
following stages:
The initial step of the architecture involves collecting the dataset. For this
project, the KDD Cup 1999 dataset has been used, which simulates a military
network environment and includes both normal and intrusive traffic.
20
Preprocessing Module:
Modelling Layer:
The modelling layer is the heart of the system, where the actual machine
learning algorithms are applied to detect intrusions. After the preprocessing stage,
the cleaned and structured data is passed into this layer. Multiple classification
models are implemented here, including Logistic Regression, Decision Tree,
Random Forest, Naive Bayes, Artificial Neural Network (ANN), Support Vector
Machine (SVM), and Gradient Boosting Classifier.
Detection Engine:
Output Layer:
The output layer represents the final stage of the architecture, where the
system presents the results of the threat detection process. After the modelling
layer predicts whether a network instance is normal or malicious, the output is
displayed in a user-interpretable format. The output layer plays a crucial role in
bridging the gap between the system’s backend processing and the end user. It
21
ensures that the decisions made by the machine learning model are transparent,
actionable, and easy to understand.
Data Ingestion:
The Data Ingestion module is responsible for importing the dataset into the
system. In this project, the input dataset is in CSV format—specifically, the KDD
Cup 1999 dataset. This module uses Pandas, a widely adopted Python library, to
read and structure the dataset into a Data Frame. It verifies data integrity and
consistency during loading by checking column headers, data types, and the
presence of null or malformed values.
Data Preprocessing:
Once the data has been ingested, the preprocessing module prepares it for
machine learning model training. This involves cleaning the data, transforming
categorical features, and scaling numerical values. Key preprocessing steps
include:
Label Encoding: Converts categorical string values (such as 'tcp', 'http', or 'SF')
into numerical codes using Scikit-learn’ s Label Encoder.
One-Hot Encoding (if needed): Applied to nominal variables to avoid
introducing ordinality where it doesn't exist.
22
Feature Scaling: Uses MinMaxScaler to normalize feature values between 0
and 1. This is especially important for distance-based algorithms like ANN or
SVM.
This module is the core of the machine learning system. It implements and
trains several supervised classification algorithms using Scikit-learn. Each model
is evaluated to determine its effectiveness in classifying network traffic as either
normal or malicious.
Logistic Regression
Decision Tree Classifier
Random Forest Classifier
Gaussian Naive Bayes
Artificial Neural Network (ANN)
Support Vector Machine (SVM)
Gradient Boosting Classifier
Evaluation:
23
4.5 DATA FLOW DIAGRAM
The data flow of the proposed system outlines the step-by-step process
from data input to threat detection. It begins with the ingestion of the KDD Cup
1999 dataset, followed by preprocessing which includes encoding, normalization,
and data cleaning. The dataset is then split into training and testing sets. Multiple
machine learning algorithms are trained and evaluated using metrics like
accuracy and F1-score. The best-performing model is selected and used for
prediction on new data.
[Preprocessing Module]
[Training/Test Split]
24
4.6 SUMMARY
25
CHAPTER 5
26
CHAPTER 5
SYSTEM IMPLEMENTATION
5.1. INTRODUCTION
The first step in the system implementation was the acquisition of the KDD
Cup 1999 dataset, which is a widely used benchmark dataset for evaluating
network intrusion detection systems. This dataset consists of numerous network
connection records, each described by 41 features and labeled either as normal or
as a specific attack type. The dataset was loaded into the environment and
converted into a structured format suitable for machine learning processes.
The KDD Cup 1999 dataset, which contains network connection records,
requires several preprocessing steps to ensure the data is suitable for training
27
machine learning models. Initially, categorical features like protocol_type,
service, and flag are encoded using Label Encoding and One-Hot Encoding to
convert them into numerical formats. The feature scaling step ensures that all
attributes are aligned on the same scale by normalizing them within the 0 to 1
range. Preprocessing plays a critical role in ensuring the quality and accuracy of
machine learning models.
● Support Vector Machine (SVM): A robust algorithm that finds the optimal
hyperplane that separates classes in a high-dimensional space.
28
5.5.1. PERFORMANCE EVALUATION
29
Naive Bayes 0.99 0.993 0.982 0.02
Gradient 0.999 0.999 281.1 0.7
Boosting
Classifier
SVM 0.998 0.998 599.6 62.21
Though not the primary focus of this project, the results from the best-performing
model can be integrated into a lightweight user interface or dashboard using
frameworks like Streamlit or Flask. This would allow real-time monitoring of
network connections and flagging of suspicious activity based on model
predictions.
5.6. SUMMARY
30
CHAPTER 6
31
CHAPTER 6
6.1 CONCLUSION
While the current system offers promising results, there are several avenues
for future improvement and enhancement:
32
as Azure, or Google Cloud will allow for scalable and accessible network threat
monitoring.
Application to Modern Datasets: In future studies, newer and more
complex datasets like NSL-KDD, CICIDS2017, and TON_IoT should be used.
These datasets provide a richer set of features and more recent attack patterns,
improving the relevance of the model in today’s context.
33
CHAPTER 7
34
CHAPTER 7
APPENDIX
import os flag,
print(f.read()) num_access_files,
cols="""duration, num_outbound_cmds,
protocol_type, is_host_login,
service, is_guest_login,
35
count, columns.append(c.strip())
srv_count,
serror_rate, columns.append('target')
srv_serror_rate, #print(columns)
rerror_rate, print(len(columns))
36
'phf': 'r2l', df.dtypes
df =
df[feature].value_counts().plot(kind
pd.read_csv('kddcup.data_10_percen
="bar")
t.gz', names=columns)
bar_graph('protocol_type')
# Add attack type column
plt.figure(figsize=(15,3))
df['Attack Type'] =
bar_graph('service')
df['target'].apply(lambda r:
attacks_types[r.strip().strip('.')]) bar_graph('flag')
df.head() bar_graph('logged_in')
df.shape bar_graph('target')
37
import pandas as pd plt.show()
38
df['dst_host_srv_rerror_rate'].corr(df #This variable is highly correlated
['srv_rerror_rate']) with srv_serror_rate and should be
ignored for analysis.
#This variable is highly correlated
with num_compromised and should #(Correlation =
be ignored for analysis. 0.9993041091850098)
#(Correlation = df.drop('dst_host_srv_serror_rate',ax
0.9938277978738366) is = 1, inplace=True)
#(Correlation = df.drop('dst_host_serror_rate',axis =
0.9983615072725952) 1, inplace=True)
#(Correlation = df.drop('dst_host_rerror_rate',axis =
0.9947309539817937) 1, inplace=True)
39
#(Correlation = df.head()
0.9851995540751249)
df.shape
df.drop('dst_host_srv_rerror_rate',ax
df.columns
is = 1, inplace=True)
df_std =
#This variable is highly correlated
df.select_dtypes(include='number').s
with dst_host_srv_count and should
td()
be ignored for analysis.
df_std =
#(Correlation =
df_std.sort_values(ascending=True)
0.9865705438845669)
df_std
df.drop('dst_host_same_srv_rate',axi
df['protocol_type'].value_counts()
s = 1, inplace=True)
#protocol_type feature mapping
df.head()
pmap = {'icmp':0,'tcp':1,'udp':2}
df.shape
df['protocol_type'] =
df.columns
df['protocol_type'].map(pmap)
#flag feature mapping
df['flag'].value_counts()
fmap =
#flag feature mapping
{'SF':0,'S0':1,'REJ':2,'RSTR':3,'RST
O':4,'SH':5 ,'S1':6 fmap =
,'S2':7,'RSTOS0':8,'S3':9 ,'OTH':10} {'SF':0,'S0':1,'REJ':2,'RSTR':3,'RST
O':4,'SH':5 ,'S1':6
df['flag'] = df['flag'].map(fmap)
,'S2':7,'RSTOS0':8,'S3':9 ,'OTH':10}
df.head()
df['flag'] = df['flag'].map(fmap)
df.drop('service',axis = 1,inplace=
df.head()
True)
df.drop('service', axis=1,
df.shape
inplace=True, errors='ignore')
40
df.shape import pandas as pd
41
imputer = print("Train Accuracy:",
SimpleImputer(strategy='mean') model1.score(X_train_imputed,
Y_train))
X_train_imputed =
imputer.fit_transform(X_train_clean print("Test Accuracy:",
ed) model1.score(X_test_imputed,
Y_test))
X_test_imputed =
imputer.transform(X_test_cleaned) #Decision Tree
# use same imputer!
from sklearn.tree import
X_train_imputed = DecisionTreeClassifier
imputer.fit_transform(X_train_clean
model2 =
ed)
DecisionTreeClassifier(criterion="e
model1 = GaussianNB() ntropy", max_depth = 4)
model1.fit(X_train_imputed, model2.fit(X_train,
Y_train.values.ravel()) Y_train.values.ravel())
Y_test_pred1 = Y_test_pred2 =
model1.predict(X_test_imputed) model2.predict(X_test)
42
print("Train score is:", print("Test score
model2.score(X_train, Y_train)) is:",model3.score(X_test,Y_test))
print("Test score
is:",model2.score(X_test,Y_test))
# Support Vector Classifier (SVC)
#Random Tree
import pandas as pd
from sklearn.ensemble import
import time
RandomForestClassifier
from sklearn.svm import SVC
model3 =
from sklearn.impute import
RandomForestClassifier(n_estimator
SimpleImputer
s=30)
# Convert to DataFrames if needed
start_time = time.time()
if not isinstance(X_train,
model3.fit(X_train,
pd.DataFrame):
Y_train.values.ravel())
X_train = pd.DataFrame(X_train)
end_time = time.time()
if not isinstance(X_test,
print("Training time: ",end_time-
pd.DataFrame):
start_time)
X_test = pd.DataFrame(X_test)
start_time = time.time()
# 1. Drop all-NaN columns from
Y_test_pred3 =
training data
model3.predict(X_test)
X_train_cleaned =
end_time = time.time()
X_train.dropna(axis=1, how='all')
print("Testing time: ",end_time-
# 2. Impute missing values using
start_time)
mean strategy
print("Train score is:",
imputer =
model3.score(X_train, Y_train))
SimpleImputer(strategy='mean')
43
X_train_imputed = # Use the imputed training data
imputer.fit_transform(X_train_clean
model5 =
ed)
LogisticRegression(max_iter=12000
X_test_imputed = 00)
imputer.transform(X_test_cleaned)
start_time = time.time()
# use same imputer!
model5.fit(X_train_imputed,
model4 = SVC(gamma='scale')
Y_train.values.ravel())
print("Training time:", end_time -
end_time = time.time()
start_time)
print("Training time: ", end_time -
start_test = time.time()
start_time)
y_pred =
start_time = time.time()
model4.predict(X_test_imputed)
Y_test_pred5 =
end_test = time.time()
model5.predict(X_test_imputed)
print("Testing (prediction) time
end_time = time.time()
(seconds):", end_test - start_test)
print("Testing time: ", end_time -
print("Train score is:",
start_time)
model4.score(X_train_imputed,
print("Train score is:",
Y_train))
model5.score(X_train_imputed,
print("Test score is:",
Y_train))
model4.score(X_test_imputed,
print("Test score is:",
Y_test))
model5.score(X_test_imputed,
#Logistic Regression
Y_test))
from sklearn.linear_model import
from sklearn.ensemble import
LogisticRegression
GradientBoostingClassifier
import time
44
import time from tensorflow.keras.models
import Sequential
model6 =
GradientBoostingClassifier(random_ from tensorflow.keras.layers import
state=0) Dense, Input
45
epochs=100, batch_size=64, print("Train Accuracy:",
verbose=1) accuracy_score(Y_train,
Y_train_pred7))
start = time.time()
from sklearn.metrics import
model7.fit(X_train_imputed,
accuracy_score
Y_train)
print("Test Accuracy:",
end = time.time()
accuracy_score(Y_test,
print("Training time:", end - start)
Y_test_pred7))
print('Training time')
import pandas as pd
print((end-start))
import time
start_time = time.time()
import matplotlib.pyplot as plt
Y_test_pred7 =
from sklearn.metrics import
model7.predict(X_test_imputed)
accuracy_score
end_time = time.time()
from sklearn.naive_bayes import
print("Testing time: ", end_time - GaussianNB
start_time)
from sklearn.tree import
from sklearn.metrics import DecisionTreeClassifier
accuracy_score
from sklearn.ensemble import
start_time = time.time() RandomForestClassifier,
GradientBoostingClassifier
Y_train_pred7 =
model7.predict(X_train_imputed) from sklearn.svm import SVC
46
from tensorflow.keras.layers import Y_pred_test =
Dense, Input model.predict(X_test)
47
evaluate_model("Decision Tree", evaluate_model("ANN", model_ann,
DecisionTreeClassifier(criterion="e X_train_imputed, Y_train,
ntropy", max_depth=4), X_train, X_test_imputed, Y_test)
Y_train, X_test, Y_test)
# --- Results Table ---
evaluate_model("Random Forest",
results_df =
RandomForestClassifier(n_estimator
pd.DataFrame(model_results)
s=30), X_train, Y_train, X_test,
display(results_df)
Y_test)
# --- Accuracy Plot ---
evaluate_model("SVM",
SVC(gamma='scale'), results_df.plot(x='Model', y=['Train
X_train_imputed, Y_train, Accuracy', 'Test Accuracy'],
X_test_imputed, Y_test) kind='bar', figsize=(10, 6),
title='Model Accuracy Comparison')
evaluate_model("Logistic
Regression", plt.ylabel("Accuracy")
LogisticRegression(max_iter=12000
plt.grid(True)
00), X_train_imputed, Y_train,
plt.tight_layout()
X_test_imputed, Y_test)
results_df.plot(x='Model', y=['Train
evaluate_model("Gradient
Time (s)', 'Test Time (s)'],
Boosting",
kind='bar', figsize=(10, 6),
GradientBoostingClassifier(random_
title='Training and Testing Time
state=0), X_train_imputed, Y_train,
Comparison')
X_test_imputed, Y_test)
plt.ylabel("Time (seconds)")
model_ann =
KerasClassifier(model=create_ann, plt.grid(True)
epochs=100, batch_size=64,
plt.tight_layout()
verbose=0)
plt.show()
48
7.2 SCREENSHOTS
RESULTS TABLE
49
ACCURACY PLOT
50
TIME PLOT
51
CHAPTER 8
52
CHAPTER 8
RESULT
The Random Forest model achieved high accuracy, precision, and recall,
especially in detecting critical attack categories such as Denial of Service (DoS),
Probe, and Remote to Local (R2L) intrusions. Its ensemble learning mechanism
allowed the system to generalize well across diverse traffic patterns, reducing
overfitting and improving classification stability.
53
CHAPTER 9
54
CHAPTER 9
BIBLIOGRAPHY
REFERENCE
5. Daniel Barbará, Julia Couto, SushilJajodia, Leonard Popyack and Ningning Wu,
"ADAM: Detecting intrusion by data mining," IEEE Workshop on Information
Assurance and Dr.V.Suganthi,*1, P. K. Manoj Kumar 2 2018 E-J. 1 (2018) 24
Security, West Point, New York, June 5-6, pp. 11-16, 2001.
55
6. D. L. Dhanabal and S. P. Shantharajah, “A Study on NSL-KDD Dataset for
Intrusion Detection System Based on Classification Algorithms,” Int. J. Adv.
Res. Comput. Commun. Eng., vol. 4, no. 6, pp. 446–452, Jun. 2015.
12. Debra Anderson, Thane Frivold, and Alfonso Valdes, "NIDES Next-generation
Intrusion Detection Expert System (NIDES)", A Summary, Computer Science
Laboratory, SRI-CSL-95-07, May 1995
13. Te-Shun Chou and Tsung-Nan Chou, "Hybrid Classified Systems for Intrusion
Detection," Seventh Annual Communications Networks and Services Research
Conference, pp. 286-291, 2009.
14. N.B. Amor, S. Benferhat, and Z. Elouedi, "Naïve Bayes vs. decision trees in
intrusion detection systems," Proc. of 2004 ACM Symposium on Applied
Computing, 2004,pp. 420-424.
56
15. Nasrin Sultana, Naveen Chilamkurti, Naveen Chilamkurti, Wei PengWei,
PengRabei Alhadad, Survey on SDN based network intrusion detection system
using machine learning approaches, Springer, DOI: 10.1007/s12083-017-0630-
0.
57
58
59
60
61