0% found this document useful (0 votes)

19 views

Project Final Doc(17)

The document is a project report on 'Cyber Threat Detection Using Machine Learning Techniques' submitted by students for their Bachelor of Technology in Artificial Intelligence and Machine Learning. It evaluates various machine learning methods, including deep belief networks, decision trees, and support vector machines, for detecting cyber threats. The report emphasizes the increasing need for effective cybersecurity measures due to the rise in cyber threats and the limitations of traditional intrusion detection systems.

Uploaded by

sadanalapunya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Project Final Doc(17)

Uploaded by

sadanalapunya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

CYBER THREAT DETECTION USING MACHINE LEARNING

TECHNIQUES:A PERFORMANCE EVALUATION PERSPECTIVE

A Project Report Submitted in partial fulfillment of the requirements for the award of
The degree

BACHELOR OF TECHNOLOGY IN
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Submitted by

S.PUNYA TEJA 216N1A6156

K.V.R.TULASI 216N1A6129
L.JITHENDRA 216N1A6136
A.SATHISH 216N1A6101
V.V.S.K.ARUN 216N1A6163

Under the Esteemed Guidance of

Dr.R.SRINIVAS M.Tech ,Ph.D
Associate Professor of AIML

SRINIVASA INSTITUTE OF ENGINEERING AND TECHNOLOGY

(UGC–Autonomous Institution)
(Approved by AICTE, permanently affiliated to JNTUK, Kakinada, ISO 9001:2015 certified Institution)

(Accredited by NAAC with 'A' Grade; Recognised by UGC under sections 2(f) & 12(B))

NH-216,Cheyyeru(V),Amalapuram-533216.

APRIL-2025
SRINIVASA INSTITUTE OF ENGINEERING AND TECHNOLOGY
(UGC–Autonomous Institution)
(Approved by AICTE, permanently affiliated to JNTUK, Kakinada, ISO 9001:2015 certified Institution)

(Accredited by NAAC with 'A' Grade; Recognised by UGC under sections 2(f) & 12(B))
NH-216,Cheyyeru(V),Amalapuram-533216.

CERTIFICATE
This is to certify that Project Report entitled “CYBER THREAT DETECTION USING MACHINE

LEARNING TECHNIQUES:A PERFORMANCE EVALUATION PERSPECTIVE” that being

submitted by S.PUNYATEJA (216N1A6156), K.V.R TULASI (216N1A6129), L.JITHENDRA

(216N1A6136), A.SATHISH (216N1A6101), V.V.S.K.S.ARUN (216N1A6163), in partial fulfillment

for the award of Bachelor of Technology in Artificial Intelligence and Machine Learning during the academic

period 2021-2025.

Project Guide HeadoftheDepartment

Dr.R.SrinivasM.Tech,Ph.D Dr.R.Srinivas,M.Tech,Ph.D

AssociateProfessor, AssociateProfessor,
DepartmentofAIML DepartmentofAIML.

EXTERNAL EXAMINER
ACKNOWLEDGEMENT

We express my sincere gratitude to our esteemed Institute “Srinivasa Institute of

Engineering & Technology”, which has provided me an opportunity to fulfill the most cherished
desire to reach my goal.

We owe my project to Dr. R. SRINIVAS, M. Tech, Ph.D, Associate Professor,

Department of AIML, who has been my project guide. I sincerely thank for his support and
guidance which was given to me as without which I would not have made this effort of my
success.

We express my deep hearted thanks to Dr. R. SRINIVAS, M. Tech, Ph.D, Associate

Professor our beloved Head of the Department for being helpful in providing me with his
valuable advice and timely guidance.

We would like to thankful the Principal, DR.M. SREENEVAS KUMAR, M. Tech, Ph.D,
(U.K), MISTE, FIE (1) and Management of “ Srinivasa Institute of Engineering &
Technology,” for providing me with the requisite facilities to carry out my project in the campus.

Our deep hearted thanks to project coordinator, Mr. K VIJAY BABU, M. Tech and all the
faculty members of our department for their value-based imparting of theory and practical
subjects, which I had put into use in my project. I am also indebted to the Non-Teaching Staff
for their co-operation.

We would like to thank our Friends and Family members for their help and support in
making my project a success. Last but not the least, we would like to convey special thanks to all
those who have helped either directly or indirectly for the completion of the project work.

S.PUNYA TEJA 216N1A6156

K.V. R.TULASI 216N1A6129

L.JITHENDRA 216N1A6136

A.SATHISH 216N1A6101

V.V.S. K. ARUN 216N1A6163

S.NO LIST OF CONTENTS PAGE
NO
1 CHAPTER-1 1
INTRODUCTION
1.1 Overview 2-5
CHAPTER-2 6-8
2 LITERATURE SURVEY
CHAPTER-3 9
3 CYBER ATTACKS
Introduction 10
3.1
Types of cyber attacks 11
3.2
Fundamentals of machine learning 11-12
3.3
CHAPTER-4 13
4 INPUT &OUTPUT DESIGN
Input design 14
4.1
Output design 15
4.2
CHAPTER-5 16
3 PROBLEM IDENTIFICATION &OBJECTIVES
Existing System 12-18
5.1
Proposed System 18-19
5.2
19-22
5.3 Modules
22-23
5.4 Algorithms
24-26
5.5 Testing
CHAPTER-6 27
6 SYSTEM REQUIREMENTS

6.1 Software Environment 28

6.2 System specifications 29

6.3 Introduction to Python 30-32

6.4 Installation of Windows 32-36

6.5 Python Libraries 36-37

i
6.5 Deployment of System and Maintenance 37

CHAPTER-7 38
7 SYSTEM DESIGN
7.1 System Architecture 39

7.2 UML diagrams 40-41

7.3 Use case diagram 42

7.4 Class diagram 43

7.5 Sequence diagram 44

7.6 Activity diagram 45-46

8 CHAPTER-8 47
IMPLEMENTATION

8.1 Flowchart 48

8.2 Code 49-55

9 CHAPTER-9 56
RESULTS AND DISCUSSIONS

9.1 Implementation Description 57-58

9.2 Dataset Description 58-59
9.3 Result Description 59-66
10 CHAPTER-10 67-68
CONCLUSION AND FUTURE SCOPE
11 CHAPTER-11 69-71
REFERENCES

ii
LIST OF FIGURES

Figure No Figure Name Page No

3.1 Overall design of cyber attack

5.1 Dataset Splitting 21

7.1 System Architecture 39

7.2.1 Dataflow diagram 41

7.3.1 Use case diagram 42

7.4.1 Class Diagram 43

7.5.1 Sequence diagram 44

7.6.1 Activity diagram 45

8.1 Working Process 48

9.3.1 Upload dataset 59

9.3.2 Data preprocess 60

9.3.3 Training the dataset 61

9.3.4 SVM accuracy score 62

9.3.5 Logistic Regression accuracy score 62

9.3.6 Decision Tree accuracy score 63

9.3.7 Random forest accuracy score 63

9.3.8 Upload the test dataset 64

9.3.9 Attack detection 65

9.3.10 Accuracy graph 65

9.3.11 Confusion matrix 66

iii
ABSTRACT

The present-day world has become all dependent on cyberspace for every aspect of daily
living. The use of cyberspace is rising with each passing day. The world is spending more
time on the Internet than ever before. As a result, the risks of cyber threats and cybercrimes
are increasing. The term 'cyber threat' is referred to as the illegal activity performed using the
Internet. Cybercriminals are changing their techniques with time to pass through the wall of
protection. Conventional techniques are not capable of detecting zero day attacks and
sophisticated attacks. Thus far, heaps of machine learning techniques have been developed to
detect the cybercrimes and battle against cyber threats. The objective of this research work is
to present the evaluation of some of the widely used machine learning techniques used to
detect some of the most threatening cyber threats to the cyberspace. Three primary machine
learning techniques are mainly investigated, including deep belief network, decision tree and
support vector machine. We have presented a brief exploration to gauge the performance of
these machine learning techniques in the spam detection, intrusion detection and malware
detection based on frequently used and benchmark datasets.

iv
CYBER THREAT DETECTION

CYBER THREAT DETECTION USING MACHINE LEARNING

TECHNIQUES: A PERFORMANCE EVALUATION PERSPECTIVE

SIET-AIML 1
CYBER THREAT DETECTION

CHAPTER-1
INTRODUCTION

SIET-AIML 2
CYBER THREAT DETECTION

CHAPTER 1

INTRODUCTION

OVERVIEW:
IDSs are security solutions that, like antivirus software, firewalls, and access control schemes, are designed
to make information and communication systems more secure. IDS arose as a result of the inadequacy of
traditional security methods. The following subsections discuss the network security, firewalls and IDSs,
respectively. According to Cisco [5], network security involves any action that is tailored to ensure that there
is usefulness and reliable integrity of the user’s network and data. This activity incorporates both tangible
and intangible innovations to computer systems. Accessing the network is usually under the control of active
network security. It can detect and prevent a variety of threats from getting into or proliferating throughout
the user’s network at any given time. The majority of security threats are purposefully created by malicious
people seeking a benefit, gaining publicity, or harming someone. Network security issues can be loosely
classified into five interconnected areas, as noted by [6]: 1. Confidentiality: The contents of the transmitted
communication should only be understood by the sender and the intended receiver because the message
could be intercepted by eavesdroppers. Encryption is used to accomplish this. 2. Message integrity assures
that the delivered message’s content isn’t tampered with, either intentionally or accidentally. Checksum and
hash functions are used to accomplish this. 3. Verification: the party sending, and the one receiving the
information, ought to have a way of verifying their identity. Each party should be able to verify the identity
of the other. 4. No repudiation deals with the possibility of someone denying sending a message or carrying
out an action. It is achieved through digital signatures. 5. Operational security: this is a security process used
to prevent important materials of a company or an institution from being accessed by unauthorized
individuals. Nearly all institutions, including banks and higher learning institutions, among others, possess or
use a network that happens to be linked to the public Internet. At some point, the networks can easily be
tampered with without the owner’s consent. Malicious people can introduce worms into the network’s host,
access the institution’s confidential documents, change the organization’s network configuration, and launch
disk operating system attacks. For this reason, firewalls and IDSs are put into use to counter attacks that may
arise against a company’s network. Networks of companies or institutions are organized into two categories:
internal networks and demilitarized zone (Fig. 1). The internal network of the company or institution can
only be accessed by the network administrators or the workers within the company. The demilitarized zone
(DMZ) can be accessed by anyone. Having a demilitarized zone within any organization plays a very crucial
role.

SIET-AIML 3
CYBER THREAT DETECTION

It adds an extra layer of security to the company’s internal network because the hosts that are the most
susceptible to attacks are the ones that provide services to users who are not within the internal network, for
instance, electronic mail, website, and domain name system servers. Due to the high number of
organizations that are facing attacks, the organizations are placed within a sub network to protect the rest of
the network within the organization from receiving attacks. Only the information exposed in the DMZ
within an organization can be accessed by an external host. The rest of the organization’s network cannot be
accessed by any means from an external host. Nevertheless, having a separation of the organization’s
network while not developing tactics that can control network traffic doesn’t make any sense. Consequently,
a common mechanism of security is the addition of a firewall. As we saw in the previous section, a packet
filter (firewall) inspects packets such as ICMP, TCP, IP, and UDP header fields when determining whether
to allow them past the firewall. However, Deep Packet Inspection (DPI) is required to detect many attack
types, particularly those that the packet filter cannot detect. A device that not only analyzes the headers of all
packets traveling through it (unlike a packet filter) but also does deep packet inspections has a place in
intrusion prevention. An Intrusion Prevention System acts when a device detects a suspect packet or a
suspicious series of packets and drops them to prevent them from accessing the organization’s network.
Intrusion Detection Systems are used when a device can let packets pass by it on their way to the corporate
network but sends an alarm to the network administrator or logs the packets. In this section, we’ll look at
intrusion detection in further depth. Intrusion Detection Systems are computer based security and defense
systems that monitor, identify, and analyze harmful activity on hosts or networks. The purpose of an
intrusion detection system is to ensure that the security of a computer system or network based on integrity,
confidentiality, and availability is maintained. The Intrusion Detection System, upon detecting that an
intrusion has occurred and that the firewall failed to mitigate or stop the attack or intrusion [8]. The firewall
is the first protection against intrusion. At the same time, using the Intrusion Detection System is based upon
the certainty that an attack will occur that the firewall cannot eliminate or mitigate. The Intrusion Detection
System can be classified in different ways, based on the monitored platform or the technique they employ to
identify anomalous activity.
Motivation:
Cybersecurity threats have increased significantly due to the rapid digital transformation and the growing
dependence on technology. Cyber-attacks pose significant risks to individuals, businesses, and
government institutions, leading to financial losses, data breaches, and privacy concerns. The need for
robust cybersecurity measures and an understanding of various attack mechanisms is more crucial than

SIET-AIML 4
CYBER THREAT DETECTION

ever.
Cyber threats come in many forms, including malware, phishing, ransomware, denial-of-service attacks,
and insider threats. Organizations and individuals must adopt proactive measures to safeguard their
digital assets. The increasing sophistication of cybercriminals calls for continuous innovation in
cybersecurity solutions, emphasizing real-time monitoring, threat intelligence, and advanced mitigation
strategies.

Problem Statement:
The task is to build a network intrusion detector, a predictive model capable of distinguishing between bad
connections, called intrusions or attacks, and good normal connections. Providing security to the industrial
networks using IT solutions may not be a reasonable approach because of the different functionalities that
these networks have. Hence, to effectively protect the ICS network from the increasing number of intrusions
and reduce their impact, an efficient Intrusion Detection Systems(IDS) which can minimize the effects of the
attacks is vital. However, existing IDSs have shown inefficiency in detecting zero-day attacks. They also
suffer from false positives (unnecessary alarm) and false negatives (which impact the security), which affect
the performance and accuracy of the ICS. When designing an efficient IDS framework, the problem that
struggles developers is to intertwine various components to reduce these drawbacks.
Applications:

Cyberattack detection is essential for securing digital assets across various domains, including network
security, cloud security, finance, healthcare, industrial control systems, and government sectors. It helps
identify unauthorized access, detect malware, prevent data breaches, and mitigate threats using advanced
technologies like Intrusion Detection Systems (IDS), AI-driven anomaly detection, and real-time
monitoring. In the financial sector, it prevents fraudulent transactions, while in healthcare, it safeguards
patient data from cyber threats. Industrial control systems rely on it to protect critical infrastructure, and
government agencies use it for national security. As cyber threats continue to evolve, robust detection
mechanisms are crucial for ensuring digital safety and maintaining data integrity.

SIET-AIML 5
CYBER THREAT DETECTION

CHAPTER-2
LITERATURE SURVEY

SIET-AIML 6
CYBER THREAT DETECTION

CHAPTER 2

LITERATURESURVEY

As basic SVM cannot be used for IDS domain due to previously mentioned shortcomings, various authors
have suggested variant in SVM framework to address the mentioned limitation. Some of the related works
are mentioned here.
• Heba F. Eid, Ashraf Darwish, Aboul Ella Hassanien, and Ajith Abraham we effectively introduced
intrusion detection system by using Principal Component Analysis (PCA) with Support Vector
Machines (SVMs) as an approach to select the optimum feature subset [11]. They verified the
effectiveness and the feasibility of the proposed IDS system by several experiments on NSL-KDD
dataset.
• J.F Joseph, A. Das, B.C. Seet in their paper proposed an autonomous host-based ID for detecting
sinking behaviour in an ad hoc network [12]. The proposed detection system uses a cross-layer
approach to maximize detection accuracy. To further maximize the detection accuracy SVM is used
for training the detection model. However, SVM is computationally expensive for resource-limited
ad hoc network nodes. Hence, the proposed IDS preprocess the training data for reducing the
computational overhead incurred by SVM. Number of features in the training data is reduced using
predefined association functions. Also, the proposed IDS uses a linear classification algorithm,
namely Fischer Discriminants Analysis (FDA) to remove data with low-information content
(entropy). The above data reduction measures have made SVM feasible in ad hoc network nodes.
• T. Shon, Y. Kim, C. Lee and J. Moon in their paper proposed a Machine Learning Model using a
modified Support Vector Machine (SVM) that combines the benefits of supervised and unsupervised
learning. Moreover, a preliminary feature selection process using GA is provided to select more
appropriate packet fields.
• Peddabachigari, A. Abraham, C. Grosan conducted an empirical investigation of SVM and Decision
Tree, in which they analyzed their performance as standalone detectors and as hybrids. Two hybrids
models were examined, a hierarchical model (DT-SVM), with the DT as the first layer to produce
node information for the SVM in the second layer, and an ensemble model comprising the standalone
techniques and the hierarchal hybrid. For the ensemble approach, each technique is given a weight
according to detection rate of each particular attack type during training. Thereafter, when the system
is tested, only the technique with the largest weight for the respective attack prediction is chosen to
output the classification. The approaches were tested on the KDD Cup ’99 data set.

SIET-AIML 7
CYBER THREAT DETECTION

• R. C. Chen, K.F Cheng and C. F Hsieh in their paper used RST (Rough Set Theory) and SVM
(Support Vector Machine) to detect intrusions [15]. First, RST is used to preprocess the data and
reduce the dimensions. Next, the features selected by RST are sent to SVM model to learn and test
respectively. The method is effectively decreased the space density of data.
• KyawThetKhaingin his paper proposed an enhanced SVM Model with a Recursive Feature
Elimination (RFE) and K Nearest Neighbor (KNN) method to perform a feature ranking and
selection task of the new model [16].
Different techniques have been implemented to tackle the problem of feature selection. Some of them
method uses the predictive accuracy of a classifier as a means to evaluate the “goodness” of a feature set,
while other uses measures such as information, consistency, or distance measures to compute the relevance
of a set of features. These approaches suffer from many drawbacks: the first major drawback is that feeding
the classifier with arbitrary features may lead to biased results, and hence, we cannot rely on the classifier’s
predictive accuracy as a measure to select feature. A second drawback is that for a set of N features, trying
all possible combinations of features (2N Combinations) to find the best combination to feed the classifier is
not a feasible approach.

SIET-AIML 8
CYBER THREAT DETECTION

CHAPTER-3
INTRODUCTION TO CYBER ATTACKS

SIET-AIML 9
CYBER THREAT DETECTION

CHAPTER- 3
3.1 Introduction to Cyber Attacks

In today's digital world, cyberattacks have become a significant threat to individuals, businesses, and
governments. A cyberattack is any malicious attempt to gain unauthorized access to a computer system,
network, or data with the intent to steal, alter, or destroy information. These attacks can take various forms,
including malware infections, phishing scams, denial-of-service (DoS) attacks, and ransomware. With the
increasing reliance on digital infrastructure, cyber threats are becoming more sophisticated and frequent,
making cybersecurity a critical concern.

Cybercriminals use various techniques to exploit vulnerabilities in systems, often targeting sensitive data
such as personal information, financial records, and intellectual property. Advanced persistent threats
(APTs), social engineering attacks, and zero-day exploits are some of the highly complex cyber threats
organizations face today. The consequences of a cyberattack can be severe, ranging from financial losses and
reputational damage to legal consequences and national security risks. This growing threat landscape
demands proactive security measures and robust cyber defense strategies. To combat cyberattacks
effectively, organizations and individuals must adopt a multi-layered cybersecurity approach. This includes
implementing strong authentication methods, regular security updates, employee awareness training, and the
use of artificial intelligence for threat detection. Governments and cybersecurity experts worldwide are also
working on developing advanced security frameworks to counter evolving cyber threats. As technology
continues to advance, the need for continuous monitoring and improvement in cybersecurity practices
remains crucial to safeguarding digital assets and ensuring data privacy.

Fig 3.1.1:overview of cyber attack

SIET-AIML 10
CYBER THREAT DETECTION

3.2 TYPES OF CYBER ATTACK:

Cyberattacks come in various forms, each designed to exploit vulnerabilities in digital systems and
networks. Malware attacks involve malicious software such as viruses, worms, ransomware, and spyware
that infect devices to steal data, disrupt operations, or demand ransom payments. Phishing attacks trick
users into revealing sensitive information, such as passwords and financial details, through deceptive emails,
messages, or fake websites. Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) attacks
flood networks or websites with excessive traffic, causing disruptions and making services unavailable to
legitimate users. Man-in-the-Middle (MitM) attacks intercept communications between two parties to
steal or manipulate data, often occurring over unsecured public Wi-Fi networks.

Additionally, SQL injection attacks exploit vulnerabilities in web applications by injecting malicious SQL
code into databases, allowing attackers to access or manipulate sensitive information. Zero-day exploits
target unknown software vulnerabilities before developers can release security patches, making them highly
dangerous. Password attacks, such as brute-force and credential stuffing, attempt to crack passwords and
gain unauthorized access to systems. Social engineering attacks manipulate individuals into disclosing
confidential information through psychological manipulation, often bypassing technical security measures.
As cyber threats continue to evolve, understanding these attack types is essential for implementing effective
cybersecurity measures and protecting digital assets.

3.3 Fundamentals of Machine Learning

Machine learning is the foundation of modern cybersecurity advancements because it enables the analysis of
massive datasets, the recognition of patterns, and the formation of predictions that are essential for the
detection, prevention, and response to threats. Within the scope of this part, we will investigate the
fundamental ideas that underpin machine learning and the significance of these ideas in the field of
cybersecurity.

1. Understanding Machine Learning Algorithms

The term "machine learning" refers to a wide variety of algorithms, each of which is designed to meet
particular requirements within the field of cybersecurity. This subsection offers a summary of the
fundamental ideas, which are as follows:

SIET-AIML 11
CYBER THREAT DETECTION

Supervised Learning: In the process of supervised learning, models are trained using datasets that have
been labelled, with the input data being associated with the output labels that correspond to it. In order to
complete tasks such as classification and regression, this approach is absolutely necessary.

Unsupervised Learning: Discovering hidden patterns or groups is the goal of unsupervised learning, which
involves training models on data that has not been labelled. In the field of cybersecurity, clustering and
dimensionality reduction are two applications that are frequently used.

Semi-Supervised Learning: The semi-supervised learning approach, which incorporatesaspects of both

supervised and unsupervised learning, is particularly beneficial in situations when there is a scarcity of
labelled data but an abundance of unlabeled data and vice versa.

SIET-AIML 12
CYBER THREAT DETECTION

CHAPTER-4
INPUT AND OUTPUT DESIGNS

SIET-AIML 13
CYBER THREAT DETECTION

CHAPTER 4

INPUT AND OUTPUT DESIGN

4.1 INPUT DESIGN

The input design is the link between the information system and the user. It comprises the developing
specification and procedures for data preparation and those steps are necessary to put transaction data in to a
usable form for processing can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the system. The design of input
focuses on controlling the amount of input required, controlling the errors, avoiding delay, avoiding extra
steps and keeping the process simple. The input is designed in such a way so that it provides security and
ease of use with retaining the privacy. Input Design considered the following things:
➢ What data should be given as input?
➢ How the data should be arranged or coded?
➢ The dialog to guide the operating personnel in providing input.
➢ Methods for preparing input validations and steps to follow when error occur.

OBJECTIVE
1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and show the
correct direction to the management for getting correct information from the computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume of
data. The goal of designing input is to make data entry easier and to be free from errors. The data entry
screen is designed in such a way that all the data manipulates can be performed. It also provides record
viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in maize of instant.
Thusthe objective of input design is to create an input layout that is easy to follow

4.2 OUTPUT DESIGN

A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to other system

SIET-AIML 14
CYBER THREAT DETECTION

through outputs. In output design it is determined how the information is to be displaced for immediate need
and also the hard copy output. It is the most important and direct source information to the user. Efficient
and intelligent output design improves the system’s relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so that people will find
the system can use easily and effectively. When analysis design computer output, they should Identify the
specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following objectives.
• Convey information about past activities, current status or projections of the
• Future.
• Signal important events, opportunities, problems, or warnings.
• Trigger an action.
• Confirm an action.

SIET-AIML 15
CYBER THREAT DETECTION

CHAPTER-5
PROBLEM IDENTIFICATION & OBJECTIVES

SIET-AIML 16
CYBER THREAT DETECTION

CHAPTER-5
PROBLEM IDENTIFICATION & OBJECTIVES
5.1 Existing System – Honeypot in Cyber Attack Detection

In the field of cybersecurity, honeypots are widely used as a proactive defense mechanism to detect, analyze,
and mitigate cyber threats. A honeypot is a decoy system or network designed to lure cyber attackers,
allowing security experts to study their tactics, techniques, and procedures (TTPs). Unlike traditional
security measures that focus on preventing intrusions, honeypots attract malicious actors, providing valuable
intelligence on potential threats. These systems simulate real-world vulnerabilities, making them an essential
tool for identifying new attack patterns and improving overall cybersecurity defenses.

Honeypots come in different types, including low-interaction honeypots, which mimic basic system
functionalities to detect automated attacks, and high-interaction honeypots, which replicate fully functional
systems to engage attackers in-depth. By deploying honeypots, organizations can monitor cybercriminal
activities in a controlled environment without risking their actual infrastructure. Additionally, honeypots
help identify zero-day vulnerabilities, brute-force attacks, and malware propagation strategies, enabling
security teams to enhance their defensive measures.

Despite their effectiveness, honeypots also come with certain limitations. Sophisticated attackers may
recognize and evade honeypot systems, reducing their effectiveness. Furthermore, if not properly isolated, a
compromised honeypot could become a launching point for further attacks on the network. To maximize
their benefits, honeypots should be integrated with other cybersecurity tools, such as intrusion detection
systems (IDS), machine learning-based threat analysis, and firewall protection. By continuously
evolving and improving honeypot strategies, cybersecurity professionals can stay ahead of emerging threats
and strengthen overall network security.

Limitations of Honeypots in Cybersecurity

1. Limited Detection Scope– Honeypots only detect attacks that specifically target them. If an attacker
bypasses the honeypot and directly exploits the actual system, it remains undetected.

2. Ineffectiveness Against Internal Threats– Honeypots primarily attract external attackers and are
not effective in identifying malicious activities from within an organization.

3. False Data and Misleading Information –Attackers may intentionally provide false data to
manipulate honeypots and mislead security analysts, reducing the reliability of gathered intelligence

SIET-AIML 17
CYBER THREAT DETECTION

4. Limited Real-World Application – While honeypots are useful for research and threat intelligence,
they do not actively protect the entire system and should be combined with other security measures
like firewalls and intrusion detection systems (IDS).

5.2 Proposed System

The Cloud-Network-Signatures dataset was analyzed and trained using four prominent Machine
Learning (ML) algorithms: Random Forest (RF), Decision Tree, Logistic Regression, and Support
Vector Machine (SVM). These algorithms were selected for their ability to classify network traffic patterns
and detect potential intrusions in cloud environments. Each algorithm offers unique advantages in
cybersecurity applications—Random Forest and Decision Tree are widely used for their ability to handle
large datasets with high accuracy, while Logistic Regression provides interpretability, and SVM excels in
high-dimensional spaces. By leveraging these models, the study aims to improve the accuracy and efficiency
of Network Intrusion Detection Systems (NIDS).

A crucial step in the process was data preparation, which involved eliminating ambiguity, selecting
relevant features, and normalizing data to ensure consistency and accuracy. Proper feature selection
enhances the predictive performance of ML models by reducing noise and computational complexity.
Normalization ensures that all numerical values are on a comparable scale, preventing biases in the training
process. This step is essential for Intrusion Detection Systems (IDS), as it helps the models learn patterns
more effectively, leading to better anomaly detection in cloud-network environments. By refining the
dataset, the study ensures that the models can generalize well to new, unseen network traffic, minimizing
false positives and negatives.

For evaluating the NIDS performance, each of the supervised ML algorithms was tested using different
feature selection methods. Support Vector Machines (SVM), Random Forest (RF), Decision Tree, and
Logistic Regression were applied to analyze the dataset's effectiveness in detecting cyber threats. These
models were trained on different feature subsets to determine which combination provided the highest
detection accuracy. The comparative analysis of these algorithms helped identify the most suitable model for
cloud-based intrusion detection, ensuring robust security and timely threat mitigation. By combining
multiple ML techniques, the study enhances the adaptability and resilience of IDS in cloud environments,
making networks more secure against evolving cyber threats.

SIET-AIML 18
CYBER THREAT DETECTION

Advantages

1. Tuned to Specific Content in Network Packets – The system can analyze network packets in detail
and filter traffic based on specific patterns, helping to detect and prevent cyber threats more
effectively. This ensures precise identification of malicious activities, improving overall network
security.
2. Ability to Qualify and Quantify Attacks – The technique not only detects attacks but also
categorizes them based on severity and impact. It provides a quantitative measure of threats, helping
security teams prioritize and mitigate risks efficiently.
3. Compliance with Regulations – The cybersecurity system is designed to align with industry
standards and legal regulations. It simplifies adherence to data protection laws, cybersecurity
policies, and cloud security frameworks, making regulatory compliance more manageable for
organizations.
4. Automated Cybersecurity for Cloud Protection – The proposed approach introduces an automated
Intrusion Detection System (IDS) that continuously monitors cloud environments. By leveraging
Machine Learning (ML) and AI-driven techniques, it detects anomalies in real-time, ensuring
proactive threat mitigation.
5. Secure Document Storage and Sharing – The application allows users to securely store and share
documents while maintaining data confidentiality, integrity, and availability. Advanced
encryption techniques protect sensitive information from unauthorized access, ensuring secure
collaboration within cloud environments.

5.3 MODULES
In this section, the methodology of the research is discussed. According to the literature studies, there is a
critical need for the creation of effective machine learning and deep learning models for identifying attacks
in datasets. The dataset NSL-KDD was analyzed and trained using four Machine Learning algorithms
Random Forest (RF), Decision Tree, Logistics Regression and Support Vector Machine (SVM). The general
layout of the methodology.
a) Dataset Collection:

NSL-KDD is a condensed version of the original KDD dataset that was acquired from the Canadian
Institute for Cyber security [21]. It has the same features as KDD. Each record has 41 features and one

SIET-AIML 19
CYBER THREAT DETECTION

class attribute. Each connection is classified as either an attack or a normal connection.

b) Data pre-processing

Pre-processing the data is a very important step in preparing the data to be fed into the algorithm. The
goal of data preparation is to eliminate ambiguity in the dataset and provide IDS with accurate data. It
unifies feature selection and normalization. Many symbolic attributes in the dataset, such as flags and
protocol types, have nominal values. These values must be converted to numeric values for the dataset to
perform better.
c) Feature selection

Feature Selection produces more enhanced and efficient subsets by eliminating redundant and unrelated
features. Correlation is a popular and successful strategy for identifying the most closely linked
characteristics in any dataset; it defines the strength of the relationship between features, based on the
assumption that features are conditionally independent given the class. A good feature subset contains
features that are highly correlated (predictive of) the class yet uncorrelated and not predictive of one
another. The table shows the result of CFS Sub Set Eval-BestFirst was chosen for feature selection used
in WEKA.

Split and discretization

Main objective of discretization is to improve the overall classification performance while reducing
storage space because discretized data takes up less space. An important step before classification is m
considered using several classifiers employing discrete data and classifiers using discrete data to
discretization. Discretization is numeric attributes that were discretized by use of a discretization filter
using unsupervised 10 bin discretization on Weka. Also, one of the most important steps for building any
machine learning model is splitting the dataset into training and testing modules. In this study, the
dataset
was split into two, 80% of data for training, 20% for testing, and the rest for validation, which is 1% of
the data.

SIET-AIML 20
CYBER THREAT DETECTION

Fig 5.3.1:Data Splitting

b) Classification process

For the supervised machine learning algorithms used to evaluate the performance of NIDS over the NSL-
KDD dataset in this study, we used Support Vector Machines (SVM), Random Forest (RF),
Decision Tree and Logistic Regression algorithms for each type of feature selection method. In
general, every process of classification in machine learning is divided into five steps:
c) Evaluation metrics

The evaluation of the produced classification models is an important phase. It’s also done through the
use of a variety of evaluation metrics. The following are used on evaluation metrics:

• True Positives (TP) the total number of malicious packets correctly classified.

• True Negatives (TN) the total number of correctly classified as normal.

• False Positives (FP) the total number of malicious packets incorrectly classified as attacks.

• False Negatives (FN) the total number of malicious packets incorrectly classified as normal.

Classification accuracy is the most commonly used statistic for evaluating a model, however, it is not a
reliable predictor of its performance.

Accuracy: The appropriate classification ratio is the proportion of correctly classified samples to the
total number of input samples. It is calculated using the following formula:

Accuracy = (TP+TN)/(TP+FP+FN+TN)

Precision: It’s the number of successfully classified positive samples divided by the number of samples
that the classifier predicted as positive (i.e. the proportion of positive samples correctly classified
to the all predicted as positive). Its formula is as follows:

Precision = TP/(TP+FP)

SIET-AIML 21
CYBER THREAT DETECTION

Recall: It is calculated by dividing the number of correctly classified positive samples by the total
number of positive samples passed.

Recall = TP/(TP+FN)

Mathews Correlation Coefficient (MCC): It represents the relative correlation between observed and
predicted binary classifications.
MCC = (TP*TN – FP*FN). / / sqrt[(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)]

d) Model validation

In the final step, the model will be implemented and trained based on the decisions made in the previous
processes, and then validated to see if it meets all of the preconditions and to see how accurate it is at
predicting with new data. The model’s flaws and limitations are recognized as a result of these assessments,
allowing the required measures to be taken to address them.
In comparison to other algorithms, the experiment shows that RF has the highest accuracy, followed by the
ML algorithm. Table 10, shows that selecting 13 features for each algorithm provide high accuracy in the
binary class. The models have a closer accuracy of 98.92% and F-measure 98.9%, respectively. It shows that
the model is the best for detecting DOS attacks. It shows the same results for multi-class, with slight changes
in accuracy, which was high in model RF

5.4 Alogirthms
Logistic Regression Algorithm
It is a SML model that is very commonly or widely used for the classification. Performance of LR model for
linearly separable classes is very well and even easy to implement. Specially, in industry it is most
commonly used. In general LR is used for binary classification as it is a linear model but using technique
OvR it may be used for classification of multi class [9]. LR is applied on dataset by considering three
different train test ratio (80:20, 60:40, and 70:30) to predict whether the bank currency is forge or genuine.
For train test ratio 80:20 ROC curve and learning curves are drawn. Accuracy of LR is observed around 98%
.

Decision Tree Algorithm:

It is a classification model having a structure like a tree. DT is incrementally developed by breaking down
the data set into smaller subsets. DT results are having two types of nodes Decision nodes and leaf nodes.
For an example consider a decision node i.e., Outlook and it have branches as Rainy, Overcast and Sunny

SIET-AIML 22
CYBER THREAT DETECTION

representing values of the tested feature. Hours Played i.e., a leaf node it gives the decision on numerical
targeted value. DT can handle both numerical as well as categorical data [8]. DT is applied on dataset by
considering three different train test ratio (80:20, 60:40, and 70:30) to predict whether the bank currency is
forge or genuine. For train test ratio 80:20 ROC curve and learning curves are drawn. Accuracy of DT has
been observed around 99%.

Random Forest Algorithm

Random Forest is that the prevalent supervised technique. it's useful for mainly doing classification
challenges and also regression challenges. RF is one amongst the classifiers which holds multiple decision
trees in each subset of an assumed data set and computes the everyday value that enhances prediction
accurateness for the dataset.
The random forest doesn't depend upon decision trees. Instead, it gets a prediction from every tree so
forecasts the last result which is made upon polls of prevalence estimations. The more trees within the forest,
the upper the accuracy and avoid over fitting problems. it's supported the ensemble technique concept, which
mixes multiple classifiers to unravel a thorny problem and improves model performance.

Support Vector Machine (SVM)

The SVM is already known as the best learning algorithm for binary classification. The SVM, originally a
type of pattern classifier based on a statistical learning technique for classification and regression with a
variety of kernel functions, has been successfully applied to a number of pattern recognition applications.
Recently, it has also been applied to information security for intrusion detection.
Support Vector Machine has become one of the popular techniques for anomaly intrusion detection due to
their good generalization nature and the ability to overcome the curse of dimensionality.Another positive
aspect of SVM is that it is useful for finding a global minimum of the actual risk using structural risk
minimization, since it can generalize well with kernel tricks even in high-dimensional spaces under little
training sample conditions.
The SVM can select appropriate setup parameters because it does not depend on traditional empirical risk
such as neural networks. One of the main advantage of using SVM for IDS is its speed, as the capability of
detecting intrusions in real-time is very important. SVMs can learn a larger set of patterns and be able to
scale better, because the classification complexity does not depend on the dimensionality of the feature
space. SVMs also have the ability to update the training patterns dynamically whenever there is a new
pattern during classification.

SIET-AIML 23
CYBER THREAT DETECTION

5.5 SYSTEM TEST

The purpose of testing is to discover errors. Testing is the process of trying to discover every conceivable
fault or weakness in a work product. It provides a way to check the functionality of components,
subassemblies, assemblies and/or a finished product It is the process of exercising software with the intent of
ensuring that the Software system meets its requirements and user expectations and does not fail in an
unacceptable manner. There are various types of test. Each test type addresses a specific testing requirement.

TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is functioning
properly, and that program inputs produce valid outputs. All decision branches and internal code flow should
be validated. It is the testing of individual software units of the application .it is done after the completion of
an individual unit before integration. This is a structural testing, that relies on knowledge of its construction
and is invasive. Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined inputs and expected
results.
Integration testing
Integration tests are designed to test integrated software components to determine if they actually run as one
program. Testing is event driven and is more concerned with the basic outcome of screens or fields.
Integration tests demonstrate that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent. Integration testing is
specifically aimed at exposing the problems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as specified by the
business and technical requirements, system documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.

SIET-AIML 24
CYBER THREAT DETECTION

Systems/Procedures : interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key functions, or
special test cases. In addition, systematic coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes must be considered for testing. Before functional testing is
complete, additional tests are identified and the effective value of current tests is determined.

System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the configuration
oriented system integration test. System testing is based on process descriptions and flows, emphasizing pre-
driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose. It is used to test
areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings, structure or
language of the module being tested. Black box tests, as most other kinds of tests, must be written from a
definitive source document, such as specification or requirements document, such as specification or
requirements document. It is a testing in which the software under test is treated, as a black box you cannot
“see” into it. The test provides inputs and responds to outputs without considering how the software works.

Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.

Test strategy and approach

Field testing will be performed manually and functional tests will be written in detail.
Test objectives
• All field entries must work properly.
• Pages must be activated from the identified link.
• The entry screen, messages and responses must not be delayed.

SIET-AIML 25
CYBER THREAT DETECTION

Features to be tested
• Verify that the entries are of the correct format
• No duplicate entries should be allowed
• All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
The task of the integration test is to check that components or software applications, e.g. components in a
software system or – one step up – software applications at the company level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the end
user. It also ensures that the system meets the functional requirements.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

SIET-AIML 26
CYBER THREAT DETECTION

CHAPTER-6
SYSTEM REQUIREMENTS

SIET-AIML 27
CYBER THREAT DETECTION

CHAPTER-6
SYSTEM REQUIREMENTS
6.1 SOFTWARE ENVIRONMENT
Python, a widely used high-level programming language, plays a crucial role in cybersecurity and
cyberattack detection. Its simplicity, readability, and extensive libraries make it an ideal choice for security
analysts and developers working on cyber defense systems. Python allows both interactive and script-based
programming, enabling security professionals to write and execute security scripts efficiently. Features like
dynamic typing, automatic memory management, and support for multiple programming paradigms make it
highly adaptable for developing cybersecurity applications, including intrusion detection systems, malware
analysis, and network security monitoring.

Python’s data structures, such as lists, tuples, and dictionaries, are essential for handling large datasets
commonly used in cyber threat analysis. Lists provide flexibility in storing and processing network logs,
while tuples ensure the integrity of unmodifiable security data. Dictionaries, which store key-value pairs, are
particularly useful in cybersecurity for mapping IP addresses to threat intelligence data and managing log
files efficiently. Additionally, Python supports multi-line statements, indentation-based code structuring, and
powerful string manipulation functions, making it an excellent tool for parsing and analyzing security-
related information from various sources.

Security professionals leverage Python's command-line capabilities and built-in libraries to automate
security tasks, such as vulnerability scanning, forensic analysis, and penetration testing. Libraries like Scapy,
Requests, and Nmap provide powerful functionalities for network traffic analysis and ethical hacking.
Python also enables secure handling of user authentication, encryption, and access control mechanisms,
strengthening cyberattack detection frameworks. Given its versatility and effectiveness, Python remains a
cornerstone in the field of cybersecurity, empowering organizations to proactively defend against cyber
threats and enhance digital security.

SIET-AIML 28
CYBER THREAT DETECTION

6.2 SYSTEM SPECIFICATION:

HARDWARE REQUIREMENTS:
❖ System : Intel i3 to untill

❖ Hard Disk : 100 GB minimum.

❖ Monitor : 14/10/12/15’ Colour Monitor.

❖ Mouse : Optical Mouse.

❖ Ram : 4GB MINIMUM.

SOFTWARE REQUIREMENTS:
❖ Operating system : Windows 7/8/10/11.

❖ Coding Language : Python 3.7.

❖ Type of Application : GUI Application

❖ Front-End Technologies : Tkinter API

❖ Backend Technologies :matplotlib,pandas,Numpy,sckit-learn

❖ IDE Tool : PyCharm community edition 2021

❖ Dataset : NSL-KDD-Dataset

SIET-AIML 29
CYBER THREAT DETECTION

6.3 INTRODUCTION TO PYTHON:

Below are some facts about Python.

1. Python is currently the most widely used multi-purpose, high-level programming language.
2. Python allows programming in Object-Oriented and Procedural paradigms. Python programs generally are
smaller than other programming languages like Java.
3. Programmers have to type relatively less and indentation requirement of the language, makes them readable
all the time.
4. Python language is being used by almost all tech-giant companies like – Google, Amazon, Facebook,
Instagram, Dropbox, Uber… etc

Advantages of Python :

Let’s see how Python dominates over other languages.

1. Extensive Libraries :
Python downloads with an extensive library and it contain code for various purposes
like regular expressions, documentation-generation, unit-testing, web browsers, threading, databases, CGI,
email, image manipulation, and more. So, we don’t have to write the complete code for that manually.
2. Extensible :
As we have seen earlier, Python can be extended to other languages. You can write some of your code in
languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable :
Complimentary to extensibility, Python is embeddable as well. You can put your Python code in your source
code of a different language, like C++. This lets us add scripting capabilities to our code in the other language.
4. Improved Productivity :
The language’s simplicity and extensive libraries render programmers more productive than languages like
Java and C++ do. Also, the fact that you need to write less and get more things done.
5. IOT Opportunities :
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for the Internet Of
Things. This is a way to connect the language with the real world.
6. Simple and Easy :
When working with Java, you may have to create a class to print ‘Hello World’. But in Python, just a print
statement will do. It is also quite easy to learn, understand, and code. This is why when people pick up
Python, they have a hard time adjusting to other more verbose languages like Java.

SIET-AIML 30
CYBER THREAT DETECTION

7. Readable :
Because it is not such a verbose language, reading Python is much like reading English. This is the reason
why it is so easy to learn, understand, and code. It also does not need curlybraces to define blocks, and
indentation is mandatory. This further aids the readability of the code.
8. Object-Oriented :
This language supports both the procedural and object-oriented programming paradigms. While functions
help us with code reusability, classes and objects let us model the real world. A class allows the encapsulation
of data and functions into one.
9. Free and Open-Source :
Like we said earlier, Python is freely available. But not only can you download Python for free, but you can
also download its source code, make changes to it, and even distribute it. It downloads with an extensive
collection of libraries to help you with your tasks.
10. Portable :
When you code your project in a language like C++, you may need to make some changes to it if you want to
run it on another platform. But it isn’t the same with Python. Here, you need to code only once, and you can
run it anywhere. This is called Write Once Run Anywhere (WORA). However, you need to be careful enough
not to include any systemdependent features.
11.Interpreted :
Lastly, we will say that it is an interpreted language. Since statements are executed one by one, debugging is
easier than in compiled languages. Any doubts till now in the advantages of Python? Mention in the comment
section.

Disadvantages of Python :

So far, we’ve seen why Python is a great choice for your project. But if you choose it, you should be aware of
its consequences as well. Let’s now see the downsides of choosing Python over another language.
1. Speed Limitations :
We have seen that Python code is executed line by line. But since Python is interpreted, it often results in slow
execution. This, however, isn’t a problem unless speed is a focal point for the project. In other words, unless
high speed is a requirement, the benefits offered by Python are enough to distract us from its speed
limitations.
2. Weak in Mobile Computing and Browsers :
While it serves as an excellent server-side language, Python is much rarely seen on the client-side. Besides
that, it is rarely ever used to implement smartphone-based applications. One such application is called
Carbonnelle. The reason it is not so famous despite the existence of Brython is that it isn’t that secure.
SIET-AIML 31
CYBER THREAT DETECTION

3. Design Restrictions :
As you know, Python is dynamically typed. This means that you don’t need to declare the type of variable
while writing the code. It uses duck-typing. But wait, what’s that? Well, it just means that if it looks like a
duck, it must be a duck. While this is easy on the programmers during coding, it can raise run-time errors.
4. Underdeveloped Database Access Layers :
Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and ODBC (Open
DataBase Connectivity), Python’s database access layers are a bit underdeveloped. Consequently, it is less
often applied in huge enterprises.
5. Simple :
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I don’t do Java, I’m
more of a Python person. To me, its syntax is so simple that the verbosity of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming Language.

6.4 INSTALLATION ON WINDOWS:

Visit the link https://www.python.org/downloads/ to download the latest release of Python. In

this process, we will install Python 3.7.6 on our Windows operating system.

Double-click the executable file which is downloaded; the following window will open.

Select Customize installation and proceed.

SIET-AIML 32
CYBER THREAT DETECTION

The following windows how‘s all the optional features. All the features need to be installed
and are checked by default; we need to click next to continue

The following window shows a list of advanced options. Check all the options which you want to install
and click next. Here, we must notice that the first check-box (install for all users) must be checked.

SIET-AIML 33
CYBER THREAT DETECTION

Now, we are ready to install python-3.7.6. Let‘s install it.

Now, try to run python on the command prompt. Type the command python in case of python2
or python3 in case of python3. It will show an error as given in the below image. It is because we
haven't set the path.

SIET-AIML 34
CYBER THREAT DETECTION

To set the path of python, we need to the right click on "my computer" and go to
Properties →Advanced → Environment Variables.

Add the new path variable in the user variable section.

SIET-AIML 35
CYBER THREAT DETECTION

Type PATH as the variable name and set the path to the installation directory of the python shown in
the below image.

Now, the path is set, we are ready to run python on our local system. Restart CMD and type
python again. It will open the python interpreter shell where we can execute the python
statements.

6.5 Python Libraries:

Several Python libraries are crucial for cybersecurity applications, particularly in detecting and analyzing
cyber threats. These libraries provide functionalities for data visualization, processing, and machine
learning model development in intrusion detection systems (IDS).

1. Matplotlib – A powerful data visualization library that helps in creating charts and graphs to analyze
network traffic patterns and detect anomalies.
2. Pandas – A data manipulation library used for processing large cybersecurity datasets, such as
intrusion detection logs and network traffic data.
3. NumPy – A library for numerical computations, useful in handling large-scale network traffic data
and performing mathematical operations efficiently.

4. Scikit-learn – A machine learning library used for developing predictive models to detect cyber
threats and classify network intrusions.

5. Scikit-learn – A machine learning library used for developing predictive models to detect cyber
threats and classify network intrusions.

SIET-AIML 36
CYBER THREAT DETECTION

6. Scikit-learn – A machine learning library used for developing predictive models to detect cyber
threats and classify network intrusions.

6.6 Deployment of System and Maintenance:

Once the system development is complete and fully prepared, the next crucial phase is its deployment.
Deployment refers to the process of installing and configuring the system in a real-world environment where
it will be used. Since this project is an academic initiative, the deployment was carried out in our school lab,
ensuring all required software and dependencies were properly installed on systems running Windows OS.
This setup allowed for thorough testing, performance evaluation, and user training in a controlled
environment before any potential real-world application.

The deployment process involves various steps, including installation of software components,
configuration of system settings, user accessibility setup, and security testing to ensure the system runs
efficiently. Additionally, post-deployment testing is conducted to identify and resolve any potential issues
that may arise during usage. By deploying the system in a structured manner, we ensure that it functions as
intended and meets the requirements set during the development phase.

Maintenance is a crucial aspect of system management, ensuring long-term functionality and efficiency. In
this project, maintenance is a one-time measure, primarily focusing on fixing any initial issues that may
arise after deployment. However, in practical scenarios, maintenance often includes regular updates, security
patches, performance enhancements, and bug fixes to improve system reliability over time. Ensuring proper
documentation and providing user support can further enhance the system’s sustainability and usability.

SIET-AIML 37
CYBER THREAT DETECTION

CHAPTER-7
SYSTEM DESIGN

SIET-AIML 38
CYBER THREAT DETECTION

CHAPTER-7
SYSTEM DESIGN
7.1 SYSTEM ARCHITECTURE:

Fig 7.1:System Architecture

This network security architecture ensures a secure communication system by implementing multiple
layers of protection, including firewalls, routers, and Intrusion Detection Systems (IDS). The internal
network consists of various devices such as computers and printers, which are connected through structured
networking. To safeguard these internal resources, a firewall acts as a protective barrier that filters traffic,
allowing only legitimate data to pass through while blocking potential threats.

At the heart of the system, a router serves as a bridge between the internal network and the Internet,
directing traffic efficiently. To further enhance security, Network-Based Intrusion Detection Systems
(IDS) are strategically placed at different points within the network. These IDS monitor incoming and
outgoing traffic, detecting suspicious activities or potential cyber threats. IDS are positioned both before
and after the firewall, as well as near critical servers, to ensure maximum protection.

The infrastructure also includes essential servers such as the Web Server, Email Server, and DNS Server,
which handle website hosting, email communication, and domain name resolution, respectively. The firewall
and IDS work together to prevent unauthorized access and cyberattacks on these critical systems.

By implementing multiple layers of security, this network setup ensures data integrity, confidentiality, and
availability, effectively protecting against cyber threats. This approach allows organizations to maintain
secure and uninterrupted network operations while minimizing security risks.

SIET-AIML 39
CYBER THREAT DETECTION

7.2 UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling
language in the field of object-oriented software engineering. The standard is managed, and was created by,
the Object Management Group.

The goal is for UML to become a common language for creating models of object oriented computer
software. In its current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization, Constructing and
documenting the artifacts of software system, as well as for business modeling and other non-software
systems.

The UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.

The UML is a very important part of developing objects oriented software and the software
development process. The UML uses mostly graphical notations to express the design of software projects.

GOALS:

The Primary goals in the design of the UML are as follows:

1. Provide users a ready-to-use, expressive visual modeling Language so that they can develop and
exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks, patterns and
components.
7. Integrate best practices.

SIET-AIML 40
CYBER THREAT DETECTION

DATA FLOW DIAGRAM:

1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to
represent a system in terms of input data to the system, various processing carried out on this data,
and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the
system components. These components are the system process, the data used by the process, an
external entity that interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a series of
transformations. It is a graphical technique that depicts information flow and the transformations that
are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of
abstraction. DFD may be partitioned into levels that represent increasing information flow and
functional detail.

Fig 7.2.1:Data Flow Diagram

SIET-AIML 41
CYBER THREAT DETECTION

7.3 USE CASE DIAGRAM:

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagram
defined by and created from a Use-case analysis. Its purpose is to present a graphical overview of the
functionality provided by a system in terms of actors, their goals (represented as use cases), and any
dependencies between those use cases. The main purpose of a use case diagram is to show what system
functions are performed for which actor. Roles of the actors in the system can be depicted.

Fig 7.3.1:Use Case Diagram

SIET-AIML 42
CYBER THREAT DETECTION

7.4 CLASS DIAGRAM:

In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of static
structure diagram that describes the structure of a system by showing the system's classes, their attributes,
operations (or methods), and the relationships among the classes. It explains which class contains
information.

Fig 7.4.1:Class Diagram

SIET-AIML 43
CYBER THREAT DETECTION

7.5 SEQUENCE DIAGRAM:

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that shows how
processes operate with one another and in what order. It is a construct of a Message Sequence
Chart.Sequence diagrams are sometimes called event diagrams, event scenarios, and timing diagrams.

Fig 7.5.1:Sequence Diagram

SIET-AIML 44
CYBER THREAT DETECTION

7.6 ACTIVITY DIAGRAM:

Activity diagrams are graphical representations of workflows of stepwise activities and actions with support
for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams can be used to
describe the business and operational step-by-step workflows of components in a system. An activity
diagram shows the overall flow of control.

Fig 7.6.1:Activity Diagram

The activity diagram represents the workflow of a machine learning-based classification system using the
NSL-KDD dataset. The process begins with the user initiating the dataset collection, which involves obtaining
the NSL-KDD dataset. The collected data undergoes a preprocessing step to clean and prepare it for analysis.
After preprocessing, feature selection is performed to extract relevant attributes, which may include splitting
and discretization for improved classification performance.

SIET-AIML 45
CYBER THREAT DETECTION

The selected features are then used in the classification process, which involves multiple machine learning
algorithms, including Decision Tree, Random Forest, Support Vector Machine, and Logistic Regression. The
classification results are evaluated using various metrics, such as a confusion matrix, accuracy, precision,
recall, and F1-score. The evaluation metrics help assess the performance of each algorithm. Finally, model
validation is performed to compare the results and determine the most effective model for classification. The
performance comparison allows for selecting the best approach for network intrusion detection or similar
applications.

This structured approach ensures that data is systematically processed, classified, and validated, leading to
accurate and efficient decision-making.

SIET-AIML 46
CYBER THREAT DETECTION

CHAPTER-8
IMPLEMENTATION

SIET-AIML 47
CYBER THREAT DETECTION

CHAPTER-8
IMPLEMENTATION
8.1 Flowcharts:

Fig 8.1.1: Working Process

The flowchart illustrates a cyberattack detection framework designed for a SCADA (Supervisory Control
and Data Acquisition) system. The process is divided into two main phases: the Training Phase and the
Detection Phase, both of which rely on data stored in a Cyber-attack Database.

The process begins with the SCADA system (Master Operation Room), which collects and stores cyber
attack-related data in the cyber-attack database. This database contains both training data and testing data,
which are used in different phases of the detection system.

In the Training Phase, the system uses training data to build a predictive model. This phase involves three
key steps: Data Preprocessing, where raw data is cleaned and formatted; Training, where the system learns
from historical attack patterns; and Predictive Model Development, where a trained model is created to
recognize cyber threats. The trained predictive model is then used for real-time attack detection.

The Detection Phase is responsible for identifying potential cyber-attacks in real-world scenarios. Here, the
system takes testing data and processes it through the same Data Preprocessing step. The data is then
analyzed using the Predictive Model to determine whether a cyberattack is occurring. If the model detects
an attack, a Decision is made to classify it into a specific Type of Cyber-Attack. If no attack is detected, the
system continues normal operations.This structured approach ensures that the SCADA system can efficiently
identify and classify cyber-attacks, enhancing cybersecurity measures and preventing potential threats.

SIET-AIML 48
CYBER THREAT DETECTION

8.2 CODE
IDS.py
from tkinter import messagebox
from tkinter import *
from tkinter import simpledialog
import tkinter
from tkinter import filedialog
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
from tkinter.filedialog import askopenfilename
import numpy as np
import pandas as pd
from sklearn import *
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from keras.models import Sequential
from keras.layers import Dense
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

main = tkinter.Tk()
main.title("Detection of Cyber Attack in Network using Machine Learning Techniques")
main.geometry("1300x1200")

global filename
global labels
global columns
global balance_data
global data
global X, Y, X_train, X_test, y_train, y_test
global svm_acc, classifier, LR_acc, DT_acc, RFT_acc

def isfloat(value):
try:
float(value)
return True
except ValueError:
return False

def splitdataset(balance_data):

SIET-AIML 49
CYBER THREAT DETECTION

X = balance_data.values[:, 0:38]
Y = balance_data.values[:, 38]
print(X)
print(Y)
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.2, random_state=0)
return X, Y, X_train, X_test, y_train, y_test

def upload():
global filename
text.delete('1.0', END)
filename = askopenfilename(initialdir="NSL-KDD-Dataset")
pathlabel.config(text=filename)
text.insert(END, "Dataset loaded\n\n")

def preprocess():
global labels
global columns
global filename

text.delete('1.0', END)
columns = ["duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
"wrong_fragment",
"urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
"su_attempted",
"num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login",
"is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
"dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
"dst_host_rerror_rate", "dst_host_srv_rerror_rate", "label"]

labels = {"normal": 0, "neptune": 1, "warezclient": 2, "ipsweep": 3, "portsweep": 4, "teardrop": 5, "nmap":

6,
"satan": 7, "smurf": 8, "pod": 9, "back": 10, "guess_passwd": 11, "ftp_write": 12, "multihop": 13,
"rootkit": 14, "buffer_overflow": 15, "imap": 16, "warezmaster": 17, "phf": 18, "land": 19,
"loadmodule": 20, "spy": 21, "perl": 22, "saint": 23, "mscan": 24, "apache2": 25, "snmpgetattack":
26,
"processtable": 27, "httptunnel": 28, "ps": 29, "snmpguess": 30, "mailbomb": 31, "named": 32,
"sendmail": 33, "xterm": 34, "worm": 35, "xlock": 36, "xsnoop": 37, "sqlattack": 38, "udpstorm":
39}
balance_data = pd.read_csv(filename)
dataset = ''
index = 0
cols = ''

SIET-AIML 50
CYBER THREAT DETECTION

for index, row in balance_data.iterrows():

for i in range(0, 42):
if (isfloat(row[i])):
dataset += str(row[i]) + ','
if index == 0:
cols += columns[i] + ','
if row[41] == 'normal':
dataset += '0'
if row[41] == 'anomaly':
dataset += '1'
if index == 0:
cols += 'Label'
dataset += '\n'
index = 1;

f = open("clean.txt", "w")
f.write(cols + "\n" + dataset)
f.close()

text.insert(END, "Removed non numeric characters from dataset and saved inside clean.txt file\n\n")
text.insert(END, "Dataset Information\n\n")
text.insert(END, dataset + "\n\n")

def generateModel():
text.delete('1.0', END)
global X, Y, X_train, X_test, y_train, y_test
global balance_data
balance_data = pd.read_csv("clean.txt")
X, Y, X_train, X_test, y_train, y_test = splitdataset(balance_data)
text.insert(END, "Train & Test Model Generated\n\n")
text.insert(END, "Total Dataset Size : " + str(len(balance_data)) + "\n")
text.insert(END, "Split Training Size : " + str(len(X_train)) + "\n")
text.insert(END, "Split Test Size : " + str(len(X_test)) + "\n")

def prediction(X_test, cls):

y_pred = cls.predict(X_test)
for i in range(len(X_test)):
print("X=%s, Predicted=%s" % (X_test[i], y_pred[i]))
return y_pred

# Function to calculate accuracy

def cal_accuracy(y_test, y_pred, details):
accuracy = accuracy_score(y_test, y_pred) * 100
text.insert(END, details + "\n\n")
text.insert(END, "Accuracy : " + str(accuracy) + "\n\n")
return accuracy

SIET-AIML 51
CYBER THREAT DETECTION

def runSVM():
text.delete('1.0', END)
global svm_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];

text.insert(END, "Total Features : " + str(total) + "\n")

text.insert(END, "Features set reduce after applying features selection concept : " + str(
(total - X_train.shape[1])) + "\n\n")
cls = svm.SVC(kernel='rbf', class_weight='balanced', probability=True)
cls.fit(X_train, y_train)
text.insert(END, "Prediction Results\n\n")
prediction_data = prediction(X_test, cls)
svm_acc = cal_accuracy(y_test, prediction_data, 'SVM Accuracy, Classification Report & Confusion
Matrix')
classifier = cls

def runDT():
text.delete('1.0', END)
global DT_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];

text.insert(END, "Total Features : " + str(total) + "\n")

text.insert(END, "Features set reduce after applying features selection concept : " + str(
(total - X_train.shape[1])) + "\n\n")
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
text.insert(END, "Prediction Results\n\n")
prediction_data = prediction(X_test, DT)
DT_acc = cal_accuracy(y_test, prediction_data,
'DecisionTreeClassifier Accuracy, Classification Report & Confusion Matrix')

def runRFT():
text.delete('1.0', END)
global RFT_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];

text.insert(END, "Total Features : " + str(total) + "\n")

text.insert(END, "Features set reduce after applying features selection concept : " + str(
(total - X_train.shape[1])) + "\n\n")

SIET-AIML 52
CYBER THREAT DETECTION

RFT = RandomForestClassifier(n_estimators=10, criterion="entropy")

RFT.fit(X_train, y_train)
text.insert(END, "Prediction Results\n\n")
prediction_data = prediction(X_test, RFT)
RFT_acc = cal_accuracy(y_test, prediction_data, 'RandomForestClassifier, Classification Report &
Confusion Matrix')

def runLR():
text.delete('1.0', END)
global LR_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];

text.insert(END, "Total Features : " + str(total) + "\n")

text.insert(END, "Features set reduce after applying features selection concept : " + str(
(total - X_train.shape[1])) + "\n\n")
LR = LogisticRegression(random_state=0)
LR.fit(X_train, y_train)
text.insert(END, "Prediction Results\n\n")
prediction_data = prediction(X_test, LR)
LR_acc = cal_accuracy(y_test, prediction_data,
'LogisticRegression Accuracy, Classification Report & Confusion Matrix')

def detectAttack():
text.delete('1.0', END)
global X, Y, X_train, X_test, y_train, y_test

filename = filedialog.askopenfilename(initialdir="NSL-KDD-Dataset")
test = pd.read_csv(filename)
text.insert(END, filename + " test file loaded\n");
y_pred = classifier.predict(test)
print(y_pred)
for i in range(len(test)):
if str(y_pred[i]) == '1.0':
text.insert(END, "X=%s, Predicted=%s" % (X_test[i], ' Infected. Detected Anamoly Signatures') + "\n\n")
else:
text.insert(END, "X=%s, Predicted=%s" % (X_test[i], 'Normal Signatures') + "\n\n")

def graph():
height = [svm_acc, LR_acc, DT_acc, RFT_acc]
bars = ('SVM Accuracy', 'LR Accuracy', 'DT Accuracy', 'RFT Accuracy')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()

SIET-AIML 53
CYBER THREAT DETECTION

font = ('times', 16, 'bold')

title = Label(main,
text='Detection of Cyber Attack in Network using Machine Learning Techniques')
title.config(bg='PaleGreen2', fg='Khaki4')
title.config(font=font)
title.config(height=3, width=120)
title.place(x=0, y=5)

font1 = ('times', 14, 'bold')

upload = Button(main, text="Upload Dataset", command=upload)
upload.place(x=700, y=100)
upload.config(font=font1)

pathlabel = Label(main)
pathlabel.config(bg='DarkOrange1', fg='white')
pathlabel.config(font=font1)
pathlabel.place(x=700, y=150)

preprocess = Button(main, text="Preprocess Dataset", command=preprocess)

preprocess.place(x=700, y=200)
preprocess.config(font=font1)

model = Button(main, text="Generate Training Model", command=generateModel)

model.place(x=700, y=250)
model.config(font=font1)

runsvm = Button(main, text="Run SVM Algorithm", command=runSVM)

runsvm.place(x=700, y=300)
runsvm.config(font=font1)

runsvm = Button(main, text="Run LR Algorithm", command=runLR)

runsvm.place(x=700, y=400)
runsvm.config(font=font1)

runsvm = Button(main, text="Run DT Algorithm", command=runDT)

runsvm.place(x=700, y=450)
runsvm.config(font=font1)

runsvm = Button(main, text="Run RFT Algorithm", command=runRFT)

runsvm.place(x=700, y=500)
runsvm.config(font=font1)

attackButton = Button(main, text="Upload Test Data & Detect Attack", command=detectAttack)

attackButton.place(x=700, y=550)
attackButton.config(font=font1)

graphButton = Button(main, text="Accuracy Graph", command=graph)

SIET-AIML 54
CYBER THREAT DETECTION

graphButton.place(x=700, y=600)
graphButton.config(font=font1)

font1 = ('times', 12, 'bold')

text = Text(main, height=30, width=80)
scroll = Scrollbar(text)
text.configure(yscrollcommand=scroll.set)
text.place(x=10, y=100)
text.config(font=font1)

main.config(bg='PeachPuff2')
main.mainloop()

SIET-AIML 55
CYBER THREAT DETECTION

CHAPTER-9
RESULTS AND DISCUSIIONS

SIET-AIML 56
CYBER THREAT DETECTION

CHAPTER-9
RESULTS AND DISCUSSIONS
9.1 IMPLEMENTATION DESCRIPTION:
This Python script is an implementation of a graphical user interface (GUI) application developed using
the Tkinter library. The purpose of this application is to Detect cyber attacks using Support Vector
Machine (SVM) classification and compare different attack types using a Machine Learning model
for cybersecurity threat analysis.

• Import Statements:

The script begins by importing the necessary libraries and modules, such as Tkinter for GUI
development, Matplotlib for data visualization, NumPy, Pandas, scikit-learn, and other essential
libraries for data preprocessing, feature extraction, model training, and evaluation.

• GUI Setup:

The main GUI window is initialized with a title and specific dimensions, providing an interactive
interface for users to upload network traffic data, preprocess it, apply machine learning models,
and visualize attack patterns.

Function Definitions:

• Upload Dataset: This function allows the user to upload a dataset NSL KKD Dataset.
• Data Processing: Preprocesses the uploaded data, normalizing and splitting them into training and
testing sets.
• SVM: Trains an SVM classifier on the preprocessed data and evaluates its performance.
• Logistic Regression (LR): Trains on IDS log data to classify network activity as normal or an
intrusion. It uses probability-based decision-making and is evaluated using accuracy, precision, and
recall.
• Decision Tree: Builds a rule-based model by splitting IDS log features to identify suspicious
activities. It is interpretable and evaluated with accuracy and confusion matrix.
• Random Forest: Uses multiple decision trees to detect intrusions more accurately, reducing
overfitting. It classifies threats using majority voting and is assessed using ROC-AUC and F1-score.

SIET-AIML 57
CYBER THREAT DETECTION

• Accuracy_comparison_graph: Generates a bar graph comparing the performance metrics

(accuracy, precision, recall, and F1-score) of SVM and random forest tree, decision tree and logistic
regression models.
• Close: Closes the application.

4. GUI Components:

• Buttons: Various buttons are provided for uploading dataset, preprocessing, training SVM, building
CNN, testing survival rate, testing single image, generating performance graph, and exiting.
• Text Box: A text box is provided to display messages and performance metrics.
• GUI Configuration: Buttons and text box are configured with appropriate labels, commands, fonts,
and positions.
• Main Loop: The main loop() function starts the GUI application, allowing the user to interact with
it.

9.2 Dataset Description:

The given dataset is a network intrusion detection dataset used in cyber attack detection by training machine
learning models. It contains structured, labeled data with both numerical and categorical features, making it
suitable for classification tasks. The dataset consists of several network traffic attributes that help identify
whether a connection is normal or an anomaly (attack).

The dataset includes multiple features that describe different aspects of network connections. The
protocol_type column specifies the type of network protocol used, such as TCP, UDP, or ICMP. The service
column represents the destination service being accessed, like HTTP, FTP, or private services. The flag
column indicates the status of the connection, such as SF (successful connection) or REJ (rejected
connection). The duration attribute represents the time duration of the connection, while src_bytes and
dst_bytes denote the number of bytes sent from the source to the destination and vice versa.

Additional security-related features include land, which indicates whether the source and destination IPs are
the same, and wrong_fragment, which counts incorrect fragments in the connection. The dataset also
captures urgent packets, hot indicators that identify suspicious command execution, and num_failed_logins,
which tracks failed authentication attempts. Logged_inis a binary indicator showing whether a user
successfully logged in, while num_compromised, root_shell, and su_attempted highlight potential security
breaches.

SIET-AIML 58
CYBER THREAT DETECTION

Several attributes focus on file and command access, such as num_file_creations, num_shells,
num_access_files, andnum_outbound_cmds. Host-related attributes include is_host_login and
is_guest_login, which determine whether the login was performed on a host or guest account. Connection-
based attributes, such as count and srv_count, track the number of connections made within a short time
frame.

Statistical features include serror_rate and srv_serror_rate, which measure the percentage of connections
with SYN errors, while rerror_rateandsrv_rerror_rate track rejected connections. Same_srv_rate,
diff_srv_rate, and srv_diff_host_rate help understand network behavior by indicating the percentage of
connections to the same or different services and hosts.

Destination-specific attributes, including dst_host_count, dst_host_srv_count, and dst_host_same_srv_rate,

describe the number of connections to the destination host and service. Other metrics like
dst_host_diff_srv_rate, dst_host_same_src_port_rate, and dst_host_srv_diff_host_rate help detect unusual
network patterns. The dataset also captures dst_host_serror_rate, dst_host_srv_serror_rate,
dst_host_rerror_rate, and dst_host_srv_rerror_rate, which measure error rates for connections at the
destination level. Finally, the label column serves as the classification target, categorizing connections as
either normal (legitimate) or anomaly (potential attack). This dataset is widely used for developing and
evaluating intrusion detection systems (IDS), enabling machine learning models to differentiate between
normal and malicious network activities effectively.

9.3 Result Description:

Fig 9.3.1 upload dataset

SIET-AIML 59
CYBER THREAT DETECTION

The image represents the initial step of the process, where the dataset is loaded into the environment. This
typically involves reading data from a file, such as a CSV, Excel, or image dataset, using libraries like
Pandas, NumPy, or OpenCV. At this stage, functions like pd.read_csv(), np.load(), or similar
methods are used to bring the raw data into a structured format, making it ready for further processing.
Ensuring that the data is correctly loaded without errors is crucial, as this forms the foundation for
subsequent analysis and model training.

Fig 9.3.2 Data preprocess

The image focuses on the data preprocessing phase, which is essential for cleaning and preparing the dataset
before applying machine learning models. This step includes handling missing values by either filling them
with appropriate values (fillna()) or removing incomplete records (dropna()). Additionally,
numerical features may need to be scaled using techniques like standardization (StandardScaler) or
normalization (MinMaxScaler) to ensure that the model can learn effectively. Categorical variables might
also require encoding using LabelEncoder or OneHotEncoder.

Another critical aspect of preprocessing is splitting the dataset into training and testing sets using
train_test_split(), ensuring that the model is trained on one portion of the data and evaluated on
another for unbiased performance assessment.

SIET-AIML 60
CYBER THREAT DETECTION

Fig 9.3.3: Training the dataset

The image likely represents the feature engineering or feature extraction process, where meaningful features
are created or selected from the existing dataset to enhance machine learning performance. This step may
involve transforming raw data into a more suitable format, such as extracting relevant text features using TF-
IDF for natural language processing tasks, generating embeddings for deep learning models, or encoding
categorical variables. It could also include selecting important features using techniques like Principal
Component Analysis (PCA), mutual information scores, or recursive feature elimination to remove redundant
or irrelevant data, ensuring the model focuses on the most informative attributes, thereby improving both
accuracy and efficiency while reducing computational complexity.

SIET-AIML 61
CYBER THREAT DETECTION

Fig9.3.4:SVM Accuracy

The image depicts the application of a Support Vector Machine (SVM) model to the processed data. SVM is
a powerful supervised learning algorithm used for classification and regression tasks. At this stage, the
model is initialized using the SVC() function from the scikit-learn library, and training is performed on the
preprocessed dataset with model.fit(X_train, y_train). After training, predictions are made
using model.predict(X_test), and performance evaluation is conducted using metrics such as
accuracy, precision, recall, F1-score, or a confusion matrix. This phase is crucial for assessing how well the
model generalizes to unseen data and fine-tuning parameters if necessary to enhance performance.

Fig9.3.5: LR Accuracy

SIET-AIML 62
CYBER THREAT DETECTION

This image likely contains a visualization related to Logistic Regression, a popular classification algorithm.
It could be a decision boundary, performance metrics, or a model evaluation graph that explains how logistic
regression classifies data points into different categories.

Fig 9.3.6:DT Accuracy

This image represent a Decision Tree model, which is a hierarchical structure used for classification and
regression tasks. It could display the tree's structure, including nodes and branches, showing how the model
splits data based on different feature values.

Fig 9.3.7: RFT Accuracy

SIET-AIML 63
CYBER THREAT DETECTION

The Random Forest image may illustrate how multiple decision trees are combined to improve accuracy and
reduce overfitting. It could include a representation of multiple trees working together or feature importance
scores generated by the model.

Fig9.3.8: upload the test data

This file likely contains information on test data, possibly showing a dataset that has been uploaded for
evaluation purposes, ensuring the reliability and performance of a machine learning model. It might be a
screenshot or a structured file, such as a CSV or JSON, representing test samples used to validate the model's
predictions against known outcomes. By analyzing this data, practitioners can assess model accuracy, identify
potential biases, and fine-tune hyperparameters to optimize performance before deploying the model in real-
world applications.

SIET-AIML 64
CYBER THREAT DETECTION

Fig9.3.9: Attack Detection

This image might represent an attack detection system or an adversarial attack in machine learning. If related
to cybersecurity, it could depict different types of attacks or vulnerabilities in a dataset.

Fig 9.3.10: Accuracy Graph

SIET-AIML 65
CYBER THREAT DETECTION

The uploaded images represent the accuracy of four different machine learning algorithms: Logistic
Regression, Decision Tree, Random Forest, and one other model. Based on the visualization in Accuracy
graph, the Random Forest algorithm demonstrates the highest accuracy compared to the other models. The
graph likely presents a bar chart or line plot that compares the accuracy scores of each model, making it
visually evident that Random Forest outperforms the others. This superior performance suggests that
Random Forest is the best-suited model for the given dataset, possibly due to its ability to reduce overfitting
and improve generalization by averaging multiple decision trees.

Fig 9.3.11:confusion Matrix

This image likely represents a confusion matrix, a tool used to evaluate classification model performance. It
typically displays True Positives, False Positives, True Negatives, and False Negatives, helping to assess
accuracy, precision, recall, and F1-score.

SIET-AIML 66
CYBER THREAT DETECTION

CHAPTER-10
CONCLUSION AND FUTURE SCOPE

SIET-AIML 67
CYBER THREAT DETECTION

CHAPTER-10
CONCLUSION AND FUTURE SCOPE

CONCLUSION:
In this paper, a Network Intrusion Detection System was presented utilizing machine learning techniques. A
thorough evaluation on the performance of the proposed detection system using multiple machine learning
algorithms on the NSLKDD dataset. The results show that Random Forest and decision tree algorithms
performed well compared to the other models in predicting the malicious packets, especially in terms of
accuracy, recall, and the Mathews correlation coefficient. Moreover, the RF classifier outperformed state-of-
the-art intrusion detection systems. Although, the NSL-KDD dataset suffers from several issues, such as
imbalanced classes, and the recorded malicious traffic are synthetic, it does not reflect real-world attacks.
The classifiers have presented satisfactory results and are capable of detecting network intrusions.

FUTURE SCOPE:
In the future, we will increase number of testing data for our system and to find vary of accuracy. We also
hope to combine RST method and genetic algorithm to improve the accuracy of IDS. The present system
just displays the log information but doesn’t employ any techniques to analyze the information present in the
log records and extract knowledge. The system can be extended by incorporating Data Mining techniques to
analyze the information in the log records which may help in efficient decision making. The present system
only detects the attacks only the known attacks. This can be extended by incorporating Intelligence into it in
order to gain knowledge by itself by analyzing the growing traffic and learning new Intrusion patterns.

SIET-AIML 68
CYBER THREAT DETECTION

CHAPTER-11
REFERENCES

SIET-AIML 69
CYBER THREAT DETECTION

CHAPTER-11
REFERENCES
1. Cisco Annual Internet Report (2018–2023) White Paper. (2022, January 23). Cisco. https://www.
cisco.com/c/en/us/solutions/collateral/executiveperspectives/annual-internet-report/white-paperc11–
741490.html
2. Dyn Analysis Summary of Friday October 21 Attack (2022, February 20). https://web.archive.org/
web/20200620203923
3. Dartigue, C., Jang, H.I., Zeng, W. A new data-mining based approach for network intrusion detection. In
Seventh Annual Communication Networks and Services Research Conference. 2009; 372–377.
4. García-Teodoro, P., Díaz-Verdejo, J., MaciáFernández, G., Vázquez, E. Anomaly-based network intrusion
detection: Techniques, systems and challenges. Computers & Security. 2009; 28(1–2); 18–28.
5. Cisco. What Is Network Security? (2022, February,8). Cisco.
https://www.cisco.com/c/en/us/products/security/what-is-network-security.html
6. Kurose, J.F., Ross, K.W. Computer Networking: A Top-Down Approach (6th Edition). Pearson, 2012.
7. Tanenbaum, A., Wetherall, D. Computer Networks (5th Edition). Pearson, 2010.
8. Fernandes, G., Rodrigues, J.J.P.C., Carvalho, L.F., Al-Muhtadi, J.F., Proença, M.L. A comprehensive
survey on network anomaly detection. Telecommunication Systems. 2018; 70(3): 447–489.
9. Othman, S.M. Alsohybe, N.T., Ba-Alwi, F.M., Zahary, A.T. Survey on intrusion detection system types.
2018; 7(4): 444–463.
10. Pal Singh, A., Deep Singh, M. Analysis of HostBased and Network-Based Intrusion Detection System.
International Journal of Computer Network and Information Security, 2014; 6(8): 41–47.
11. Ferrag, M.A. Maglaras, L. Moschoyiannis, S., Janicke, H. Deep learning for cyber security intrusion
detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications.
2020; 50.
12. Boutaba, R. Salahuddin, M.A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F., Caicedo, O.M.
A comprehensive survey on machine learning for networking: evolution, applications and research
opportunities. Journal of Internet Services and Applications. 2018; 9(1).
13. Buczak, A.L., Guven, E.A Survey of Data Mining and Machine Learning Methods for Cyber Security
Intrusion Detection. IEEE Communications Surveys & Tutorials. 2016; 18(2): 1153–1176.
14. Chaabouni, N., Mosbah, M., Zemmari, A., Sauvignac, C., Faruki, P. Network Intrusion Detection for IoT
Security Based on Learning Techniques. IEEE Communications Surveys & Tutorials. 2019; 21(3): 2671–
2701.

SIET-AIML 70
CYBER THREAT DETECTION

15. Berman, D., Buczak, A., Chavis, J., Corbett, C. A Survey of Deep Learning Methods for Cyber Security.
Information. 2019; 10(4): 122.
16. Mahdavifar, S., Ghorbani, A.A. Application of deep learning to cybersecurity: A survey.
Neurocomputing. 2019; 347: 149–176.
17. Ahmed, M., Naser Mahmood, A., Hu, J. A survey of network anomaly detection techniques. Journal of
Network and Computer Applications. 2016; 60: 19–31.
18. Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A. A survey of network-based intrusion
detection data sets. Computers & Security. 2019; 86: 147–167.
19. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K. Network Anomaly Detection: Methods, Systems and
Tools. IEEE Communications Surveys & Tutorials. 2014; 16(1): 303–336.
20. Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H., Wang, C. Machine Learning and
Deep Learning Methods for Cybersecurity. IEEE Access. 2018; 6: 35365–35381.
21. UNB (2021, Novamber 15). https://www.unb.ca/ cic/datasets/nsl.html
22. Chumachenko, K. Machine learning methods for malware detection and classification., 2017.
23. Zou, J., Han, Y., So, S.S. Overview of artificial neural networks. Methods in molecular biology (Clifton,
N.J.). 2008; 458: 15–23.
24. Dong, B., Wang, X. Comparison deep learning method to traditional methods using for network
intrusion detection. In 2016 8th IEEE International Conference on Communication Software and Networks
(ICCSN). 2016; 581–585.
25. Mahesh, B. Machine Learning Algorithms – A Review. International Journal of Science and Research
(IJSR). 2020; 381–386.
26. Farnaaz, N., Jabbar, M.A. Random forest modeling for network intrusion detection system. Procedia
Computer Science. 2016; 89: 213–217.
27. Bhumgara, A., Pitale, A. Detection of Network Intrusions using Hybrid Intelligent Systems. 1st
International Conference on Advances in Information Technology (ICAIT). 2019; 500–506.
28. Kumar, K., Batth, J.S. Network Intrusion Detection with Feature Selection Techniques using Machine-
Learning Algorithms. International Journal of Computer Applications. 2016; 150(12): 1–13.
29. Dhanabal, L., Shantharajah, S.P. A study on NSLKDD dataset for intrusion detection system based on
classification algorithms. International journal of advanced research in computer and communication
engineering. 2015; 4(6): 446–452.

SIET-AIML 71

2018 Icas Invitation ENGLISH2
No ratings yet
2018 Icas Invitation ENGLISH2
2 pages
content part_merged
No ratings yet
content part_merged
76 pages
Roshini Project
No ratings yet
Roshini Project
74 pages
4-2 Project Documentation
No ratings yet
4-2 Project Documentation
72 pages
Phase 2 Full
No ratings yet
Phase 2 Full
63 pages
Aditya
No ratings yet
Aditya
143 pages
Varshith Asa
No ratings yet
Varshith Asa
10 pages
corrected report last
No ratings yet
corrected report last
99 pages
Report Final
No ratings yet
Report Final
67 pages
A Mini Project Report njjjjjjjjjjjj
No ratings yet
A Mini Project Report njjjjjjjjjjjj
60 pages
Doctmentloki Merged
No ratings yet
Doctmentloki Merged
100 pages
Documentation Example
No ratings yet
Documentation Example
89 pages
ilovepdf_merged
No ratings yet
ilovepdf_merged
58 pages
FNALIZE[1][1]
No ratings yet
FNALIZE[1][1]
74 pages
In PDF Format 4 October Report
No ratings yet
In PDF Format 4 October Report
44 pages
Sample Report file
No ratings yet
Sample Report file
101 pages
Rushi Project
No ratings yet
Rushi Project
117 pages
nandinireporyt-1
No ratings yet
nandinireporyt-1
63 pages
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
No ratings yet
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
71 pages
Batch 9
No ratings yet
Batch 9
90 pages
CVR College of Engineering: A Mini Project Report Submitted
No ratings yet
CVR College of Engineering: A Mini Project Report Submitted
34 pages
Automatic Weed Detector: A Project Report Submitted To
100% (1)
Automatic Weed Detector: A Project Report Submitted To
11 pages
Project Report
No ratings yet
Project Report
19 pages
Team No. 12 Major Project Report
No ratings yet
Team No. 12 Major Project Report
37 pages
miniproject_batch_2[1]
No ratings yet
miniproject_batch_2[1]
36 pages
Batch 12ELECTRONIC - VOTING - MACHINE - USING - ARDUINO - FINAL1
No ratings yet
Batch 12ELECTRONIC - VOTING - MACHINE - USING - ARDUINO - FINAL1
90 pages
plpl
No ratings yet
plpl
84 pages
Batch1
No ratings yet
Batch1
80 pages
GENERATING CLOUD MONITORS FROM MODELS TO SECURE - Docx 2
No ratings yet
GENERATING CLOUD MONITORS FROM MODELS TO SECURE - Docx 2
42 pages
Final
No ratings yet
Final
77 pages
MN15 B09 PDF
No ratings yet
MN15 B09 PDF
74 pages
Batch - 01 Report
No ratings yet
Batch - 01 Report
70 pages
Whole
No ratings yet
Whole
67 pages
Mini_Project_Report_400[1] (1) (1) (2)
No ratings yet
Mini_Project_Report_400[1] (1) (1) (2)
57 pages
doc2
No ratings yet
doc2
59 pages
Securityonion Report Final
No ratings yet
Securityonion Report Final
43 pages
Stating
No ratings yet
Stating
11 pages
Report On Advancements in Early Detection of Alzheimer's Disease
No ratings yet
Report On Advancements in Early Detection of Alzheimer's Disease
40 pages
myfinaldoc
No ratings yet
myfinaldoc
77 pages
report_final
No ratings yet
report_final
43 pages
Travel Guide Project
No ratings yet
Travel Guide Project
100 pages
Frequency Locked Loop DC Motor Speed Control and Monitoring System
No ratings yet
Frequency Locked Loop DC Motor Speed Control and Monitoring System
98 pages
Objectfy 1
No ratings yet
Objectfy 1
54 pages
17BEC096
No ratings yet
17BEC096
61 pages
MINI DOCC LAST (1)_removed
No ratings yet
MINI DOCC LAST (1)_removed
52 pages
B.SC Cs Batchno 8
No ratings yet
B.SC Cs Batchno 8
40 pages
lastone_merged (4) (3)
No ratings yet
lastone_merged (4) (3)
77 pages
project documentation
No ratings yet
project documentation
71 pages
grpf
No ratings yet
grpf
10 pages
Autonomous_Driving_Emulation (2)
No ratings yet
Autonomous_Driving_Emulation (2)
65 pages
Bus Scheduling Application
No ratings yet
Bus Scheduling Application
60 pages
Prject Finl
No ratings yet
Prject Finl
47 pages
Final Completion Project Harshi 3333
No ratings yet
Final Completion Project Harshi 3333
29 pages
Project Title in Bold
No ratings yet
Project Title in Bold
19 pages
Final23 12karthik
No ratings yet
Final23 12karthik
38 pages
Last
No ratings yet
Last
39 pages
Iot Based Transformer Monitoring System: Savitribai Phule Pune University
No ratings yet
Iot Based Transformer Monitoring System: Savitribai Phule Pune University
69 pages
Mini Project Eocs
No ratings yet
Mini Project Eocs
83 pages
AI report
No ratings yet
AI report
18 pages
Wireless Sensor Networks
From Everand
Wireless Sensor Networks
Ian F. Akyildiz
No ratings yet
Handbook of Systems Engineering and Management
From Everand
Handbook of Systems Engineering and Management
Andrew P. Sage
No ratings yet
g3 Marketing Challenges for Entrep Ventures
No ratings yet
g3 Marketing Challenges for Entrep Ventures
44 pages
Air Traffic Control Manual of Operations
100% (3)
Air Traffic Control Manual of Operations
750 pages
Blue
No ratings yet
Blue
5 pages
Activity Proposal. School Encampment (Autorecovered)
No ratings yet
Activity Proposal. School Encampment (Autorecovered)
5 pages
FDP Brochure 01
No ratings yet
FDP Brochure 01
8 pages
Economics Notes: Unit 1 - How Markets Work
98% (46)
Economics Notes: Unit 1 - How Markets Work
45 pages
SOA and Web Services - Understanding SOA With Web Services...
No ratings yet
SOA and Web Services - Understanding SOA With Web Services...
69 pages
Statement of Account: Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
No ratings yet
Statement of Account: Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
2 pages
MBR 20100
No ratings yet
MBR 20100
7 pages
Personal KYC Form
No ratings yet
Personal KYC Form
4 pages
Heat Transfer Note
No ratings yet
Heat Transfer Note
27 pages
Power Amplifiers - 1
No ratings yet
Power Amplifiers - 1
32 pages
The SAP HANA Studio
No ratings yet
The SAP HANA Studio
14 pages
Adil CV Latest PDF
No ratings yet
Adil CV Latest PDF
4 pages
Additional Mathematics Revision Paper 2 2024 - 240501 - 174942
No ratings yet
Additional Mathematics Revision Paper 2 2024 - 240501 - 174942
4 pages
LP1 LABEL PRINTER USER MANUAL - Australia 3 Inch PDF
No ratings yet
LP1 LABEL PRINTER USER MANUAL - Australia 3 Inch PDF
36 pages
Matrix Skill Admin DC Bengkulu NEW
No ratings yet
Matrix Skill Admin DC Bengkulu NEW
24 pages
Casio Celviano AP-260 PDF
No ratings yet
Casio Celviano AP-260 PDF
49 pages
QSpiders 6th Nov 2025 Batch Online Incubation Drive Apti Results University b.d.t. College of Engineering Karnataka
No ratings yet
QSpiders 6th Nov 2025 Batch Online Incubation Drive Apti Results University b.d.t. College of Engineering Karnataka
18 pages
Edge Clock TP1000
No ratings yet
Edge Clock TP1000
224 pages
Sb&Lp Final Letter Rohonna James
No ratings yet
Sb&Lp Final Letter Rohonna James
20 pages
Strong & MBTI Sample Report
No ratings yet
Strong & MBTI Sample Report
27 pages
Vehicle Management System
85% (33)
Vehicle Management System
87 pages
FIN195 Document 2
No ratings yet
FIN195 Document 2
5 pages
Full Download Biochemical Ecotoxicology Principles and Methods 1st Edition Francois Gagne PDF DOCX
100% (7)
Full Download Biochemical Ecotoxicology Principles and Methods 1st Edition Francois Gagne PDF DOCX
77 pages
Pressure Vessel Plates, Carbon Steel, For Intermediate-And Higher-Temperature Service
No ratings yet
Pressure Vessel Plates, Carbon Steel, For Intermediate-And Higher-Temperature Service
3 pages
DEBATE - FCE f.s:C1
No ratings yet
DEBATE - FCE f.s:C1
2 pages
ADM 202 Midterm Exam
No ratings yet
ADM 202 Midterm Exam
7 pages
IRC2023 Brand Guidelines
No ratings yet
IRC2023 Brand Guidelines
29 pages