Project Final Doc(17)
Project Final Doc(17)
A Project Report Submitted in partial fulfillment of the requirements for the award of
The degree
BACHELOR OF TECHNOLOGY IN
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted by
(Accredited by NAAC with 'A' Grade; Recognised by UGC under sections 2(f) & 12(B))
NH-216,Cheyyeru(V),Amalapuram-533216.
APRIL-2025
SRINIVASA INSTITUTE OF ENGINEERING AND TECHNOLOGY
(UGC–Autonomous Institution)
(Approved by AICTE, permanently affiliated to JNTUK, Kakinada, ISO 9001:2015 certified Institution)
(Accredited by NAAC with 'A' Grade; Recognised by UGC under sections 2(f) & 12(B))
NH-216,Cheyyeru(V),Amalapuram-533216.
CERTIFICATE
This is to certify that Project Report entitled “CYBER THREAT DETECTION USING MACHINE
for the award of Bachelor of Technology in Artificial Intelligence and Machine Learning during the academic
period 2021-2025.
AssociateProfessor, AssociateProfessor,
DepartmentofAIML DepartmentofAIML.
EXTERNAL EXAMINER
ACKNOWLEDGEMENT
We would like to thankful the Principal, DR.M. SREENEVAS KUMAR, M. Tech, Ph.D,
(U.K), MISTE, FIE (1) and Management of “ Srinivasa Institute of Engineering &
Technology,” for providing me with the requisite facilities to carry out my project in the campus.
Our deep hearted thanks to project coordinator, Mr. K VIJAY BABU, M. Tech and all the
faculty members of our department for their value-based imparting of theory and practical
subjects, which I had put into use in my project. I am also indebted to the Non-Teaching Staff
for their co-operation.
We would like to thank our Friends and Family members for their help and support in
making my project a success. Last but not the least, we would like to convey special thanks to all
those who have helped either directly or indirectly for the completion of the project work.
L.JITHENDRA 216N1A6136
A.SATHISH 216N1A6101
i
6.5 Deployment of System and Maintenance 37
CHAPTER-7 38
7 SYSTEM DESIGN
7.1 System Architecture 39
8 CHAPTER-8 47
IMPLEMENTATION
8.1 Flowchart 48
9 CHAPTER-9 56
RESULTS AND DISCUSSIONS
ii
LIST OF FIGURES
iii
ABSTRACT
The present-day world has become all dependent on cyberspace for every aspect of daily
living. The use of cyberspace is rising with each passing day. The world is spending more
time on the Internet than ever before. As a result, the risks of cyber threats and cybercrimes
are increasing. The term 'cyber threat' is referred to as the illegal activity performed using the
Internet. Cybercriminals are changing their techniques with time to pass through the wall of
protection. Conventional techniques are not capable of detecting zero day attacks and
sophisticated attacks. Thus far, heaps of machine learning techniques have been developed to
detect the cybercrimes and battle against cyber threats. The objective of this research work is
to present the evaluation of some of the widely used machine learning techniques used to
detect some of the most threatening cyber threats to the cyberspace. Three primary machine
learning techniques are mainly investigated, including deep belief network, decision tree and
support vector machine. We have presented a brief exploration to gauge the performance of
these machine learning techniques in the spam detection, intrusion detection and malware
detection based on frequently used and benchmark datasets.
iv
CYBER THREAT DETECTION
SIET-AIML 1
CYBER THREAT DETECTION
CHAPTER-1
INTRODUCTION
SIET-AIML 2
CYBER THREAT DETECTION
CHAPTER 1
INTRODUCTION
OVERVIEW:
IDSs are security solutions that, like antivirus software, firewalls, and access control schemes, are designed
to make information and communication systems more secure. IDS arose as a result of the inadequacy of
traditional security methods. The following subsections discuss the network security, firewalls and IDSs,
respectively. According to Cisco [5], network security involves any action that is tailored to ensure that there
is usefulness and reliable integrity of the user’s network and data. This activity incorporates both tangible
and intangible innovations to computer systems. Accessing the network is usually under the control of active
network security. It can detect and prevent a variety of threats from getting into or proliferating throughout
the user’s network at any given time. The majority of security threats are purposefully created by malicious
people seeking a benefit, gaining publicity, or harming someone. Network security issues can be loosely
classified into five interconnected areas, as noted by [6]: 1. Confidentiality: The contents of the transmitted
communication should only be understood by the sender and the intended receiver because the message
could be intercepted by eavesdroppers. Encryption is used to accomplish this. 2. Message integrity assures
that the delivered message’s content isn’t tampered with, either intentionally or accidentally. Checksum and
hash functions are used to accomplish this. 3. Verification: the party sending, and the one receiving the
information, ought to have a way of verifying their identity. Each party should be able to verify the identity
of the other. 4. No repudiation deals with the possibility of someone denying sending a message or carrying
out an action. It is achieved through digital signatures. 5. Operational security: this is a security process used
to prevent important materials of a company or an institution from being accessed by unauthorized
individuals. Nearly all institutions, including banks and higher learning institutions, among others, possess or
use a network that happens to be linked to the public Internet. At some point, the networks can easily be
tampered with without the owner’s consent. Malicious people can introduce worms into the network’s host,
access the institution’s confidential documents, change the organization’s network configuration, and launch
disk operating system attacks. For this reason, firewalls and IDSs are put into use to counter attacks that may
arise against a company’s network. Networks of companies or institutions are organized into two categories:
internal networks and demilitarized zone (Fig. 1). The internal network of the company or institution can
only be accessed by the network administrators or the workers within the company. The demilitarized zone
(DMZ) can be accessed by anyone. Having a demilitarized zone within any organization plays a very crucial
role.
SIET-AIML 3
CYBER THREAT DETECTION
It adds an extra layer of security to the company’s internal network because the hosts that are the most
susceptible to attacks are the ones that provide services to users who are not within the internal network, for
instance, electronic mail, website, and domain name system servers. Due to the high number of
organizations that are facing attacks, the organizations are placed within a sub network to protect the rest of
the network within the organization from receiving attacks. Only the information exposed in the DMZ
within an organization can be accessed by an external host. The rest of the organization’s network cannot be
accessed by any means from an external host. Nevertheless, having a separation of the organization’s
network while not developing tactics that can control network traffic doesn’t make any sense. Consequently,
a common mechanism of security is the addition of a firewall. As we saw in the previous section, a packet
filter (firewall) inspects packets such as ICMP, TCP, IP, and UDP header fields when determining whether
to allow them past the firewall. However, Deep Packet Inspection (DPI) is required to detect many attack
types, particularly those that the packet filter cannot detect. A device that not only analyzes the headers of all
packets traveling through it (unlike a packet filter) but also does deep packet inspections has a place in
intrusion prevention. An Intrusion Prevention System acts when a device detects a suspect packet or a
suspicious series of packets and drops them to prevent them from accessing the organization’s network.
Intrusion Detection Systems are used when a device can let packets pass by it on their way to the corporate
network but sends an alarm to the network administrator or logs the packets. In this section, we’ll look at
intrusion detection in further depth. Intrusion Detection Systems are computer based security and defense
systems that monitor, identify, and analyze harmful activity on hosts or networks. The purpose of an
intrusion detection system is to ensure that the security of a computer system or network based on integrity,
confidentiality, and availability is maintained. The Intrusion Detection System, upon detecting that an
intrusion has occurred and that the firewall failed to mitigate or stop the attack or intrusion [8]. The firewall
is the first protection against intrusion. At the same time, using the Intrusion Detection System is based upon
the certainty that an attack will occur that the firewall cannot eliminate or mitigate. The Intrusion Detection
System can be classified in different ways, based on the monitored platform or the technique they employ to
identify anomalous activity.
Motivation:
Cybersecurity threats have increased significantly due to the rapid digital transformation and the growing
dependence on technology. Cyber-attacks pose significant risks to individuals, businesses, and
government institutions, leading to financial losses, data breaches, and privacy concerns. The need for
robust cybersecurity measures and an understanding of various attack mechanisms is more crucial than
SIET-AIML 4
CYBER THREAT DETECTION
ever.
Cyber threats come in many forms, including malware, phishing, ransomware, denial-of-service attacks,
and insider threats. Organizations and individuals must adopt proactive measures to safeguard their
digital assets. The increasing sophistication of cybercriminals calls for continuous innovation in
cybersecurity solutions, emphasizing real-time monitoring, threat intelligence, and advanced mitigation
strategies.
Problem Statement:
The task is to build a network intrusion detector, a predictive model capable of distinguishing between bad
connections, called intrusions or attacks, and good normal connections. Providing security to the industrial
networks using IT solutions may not be a reasonable approach because of the different functionalities that
these networks have. Hence, to effectively protect the ICS network from the increasing number of intrusions
and reduce their impact, an efficient Intrusion Detection Systems(IDS) which can minimize the effects of the
attacks is vital. However, existing IDSs have shown inefficiency in detecting zero-day attacks. They also
suffer from false positives (unnecessary alarm) and false negatives (which impact the security), which affect
the performance and accuracy of the ICS. When designing an efficient IDS framework, the problem that
struggles developers is to intertwine various components to reduce these drawbacks.
Applications:
Cyberattack detection is essential for securing digital assets across various domains, including network
security, cloud security, finance, healthcare, industrial control systems, and government sectors. It helps
identify unauthorized access, detect malware, prevent data breaches, and mitigate threats using advanced
technologies like Intrusion Detection Systems (IDS), AI-driven anomaly detection, and real-time
monitoring. In the financial sector, it prevents fraudulent transactions, while in healthcare, it safeguards
patient data from cyber threats. Industrial control systems rely on it to protect critical infrastructure, and
government agencies use it for national security. As cyber threats continue to evolve, robust detection
mechanisms are crucial for ensuring digital safety and maintaining data integrity.
SIET-AIML 5
CYBER THREAT DETECTION
CHAPTER-2
LITERATURE SURVEY
SIET-AIML 6
CYBER THREAT DETECTION
CHAPTER 2
LITERATURESURVEY
As basic SVM cannot be used for IDS domain due to previously mentioned shortcomings, various authors
have suggested variant in SVM framework to address the mentioned limitation. Some of the related works
are mentioned here.
• Heba F. Eid, Ashraf Darwish, Aboul Ella Hassanien, and Ajith Abraham we effectively introduced
intrusion detection system by using Principal Component Analysis (PCA) with Support Vector
Machines (SVMs) as an approach to select the optimum feature subset [11]. They verified the
effectiveness and the feasibility of the proposed IDS system by several experiments on NSL-KDD
dataset.
• J.F Joseph, A. Das, B.C. Seet in their paper proposed an autonomous host-based ID for detecting
sinking behaviour in an ad hoc network [12]. The proposed detection system uses a cross-layer
approach to maximize detection accuracy. To further maximize the detection accuracy SVM is used
for training the detection model. However, SVM is computationally expensive for resource-limited
ad hoc network nodes. Hence, the proposed IDS preprocess the training data for reducing the
computational overhead incurred by SVM. Number of features in the training data is reduced using
predefined association functions. Also, the proposed IDS uses a linear classification algorithm,
namely Fischer Discriminants Analysis (FDA) to remove data with low-information content
(entropy). The above data reduction measures have made SVM feasible in ad hoc network nodes.
• T. Shon, Y. Kim, C. Lee and J. Moon in their paper proposed a Machine Learning Model using a
modified Support Vector Machine (SVM) that combines the benefits of supervised and unsupervised
learning. Moreover, a preliminary feature selection process using GA is provided to select more
appropriate packet fields.
• Peddabachigari, A. Abraham, C. Grosan conducted an empirical investigation of SVM and Decision
Tree, in which they analyzed their performance as standalone detectors and as hybrids. Two hybrids
models were examined, a hierarchical model (DT-SVM), with the DT as the first layer to produce
node information for the SVM in the second layer, and an ensemble model comprising the standalone
techniques and the hierarchal hybrid. For the ensemble approach, each technique is given a weight
according to detection rate of each particular attack type during training. Thereafter, when the system
is tested, only the technique with the largest weight for the respective attack prediction is chosen to
output the classification. The approaches were tested on the KDD Cup ’99 data set.
SIET-AIML 7
CYBER THREAT DETECTION
• R. C. Chen, K.F Cheng and C. F Hsieh in their paper used RST (Rough Set Theory) and SVM
(Support Vector Machine) to detect intrusions [15]. First, RST is used to preprocess the data and
reduce the dimensions. Next, the features selected by RST are sent to SVM model to learn and test
respectively. The method is effectively decreased the space density of data.
• KyawThetKhaingin his paper proposed an enhanced SVM Model with a Recursive Feature
Elimination (RFE) and K Nearest Neighbor (KNN) method to perform a feature ranking and
selection task of the new model [16].
Different techniques have been implemented to tackle the problem of feature selection. Some of them
method uses the predictive accuracy of a classifier as a means to evaluate the “goodness” of a feature set,
while other uses measures such as information, consistency, or distance measures to compute the relevance
of a set of features. These approaches suffer from many drawbacks: the first major drawback is that feeding
the classifier with arbitrary features may lead to biased results, and hence, we cannot rely on the classifier’s
predictive accuracy as a measure to select feature. A second drawback is that for a set of N features, trying
all possible combinations of features (2N Combinations) to find the best combination to feed the classifier is
not a feasible approach.
SIET-AIML 8
CYBER THREAT DETECTION
CHAPTER-3
INTRODUCTION TO CYBER ATTACKS
SIET-AIML 9
CYBER THREAT DETECTION
CHAPTER- 3
3.1 Introduction to Cyber Attacks
In today's digital world, cyberattacks have become a significant threat to individuals, businesses, and
governments. A cyberattack is any malicious attempt to gain unauthorized access to a computer system,
network, or data with the intent to steal, alter, or destroy information. These attacks can take various forms,
including malware infections, phishing scams, denial-of-service (DoS) attacks, and ransomware. With the
increasing reliance on digital infrastructure, cyber threats are becoming more sophisticated and frequent,
making cybersecurity a critical concern.
Cybercriminals use various techniques to exploit vulnerabilities in systems, often targeting sensitive data
such as personal information, financial records, and intellectual property. Advanced persistent threats
(APTs), social engineering attacks, and zero-day exploits are some of the highly complex cyber threats
organizations face today. The consequences of a cyberattack can be severe, ranging from financial losses and
reputational damage to legal consequences and national security risks. This growing threat landscape
demands proactive security measures and robust cyber defense strategies. To combat cyberattacks
effectively, organizations and individuals must adopt a multi-layered cybersecurity approach. This includes
implementing strong authentication methods, regular security updates, employee awareness training, and the
use of artificial intelligence for threat detection. Governments and cybersecurity experts worldwide are also
working on developing advanced security frameworks to counter evolving cyber threats. As technology
continues to advance, the need for continuous monitoring and improvement in cybersecurity practices
remains crucial to safeguarding digital assets and ensuring data privacy.
SIET-AIML 10
CYBER THREAT DETECTION
Cyberattacks come in various forms, each designed to exploit vulnerabilities in digital systems and
networks. Malware attacks involve malicious software such as viruses, worms, ransomware, and spyware
that infect devices to steal data, disrupt operations, or demand ransom payments. Phishing attacks trick
users into revealing sensitive information, such as passwords and financial details, through deceptive emails,
messages, or fake websites. Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) attacks
flood networks or websites with excessive traffic, causing disruptions and making services unavailable to
legitimate users. Man-in-the-Middle (MitM) attacks intercept communications between two parties to
steal or manipulate data, often occurring over unsecured public Wi-Fi networks.
Additionally, SQL injection attacks exploit vulnerabilities in web applications by injecting malicious SQL
code into databases, allowing attackers to access or manipulate sensitive information. Zero-day exploits
target unknown software vulnerabilities before developers can release security patches, making them highly
dangerous. Password attacks, such as brute-force and credential stuffing, attempt to crack passwords and
gain unauthorized access to systems. Social engineering attacks manipulate individuals into disclosing
confidential information through psychological manipulation, often bypassing technical security measures.
As cyber threats continue to evolve, understanding these attack types is essential for implementing effective
cybersecurity measures and protecting digital assets.
Machine learning is the foundation of modern cybersecurity advancements because it enables the analysis of
massive datasets, the recognition of patterns, and the formation of predictions that are essential for the
detection, prevention, and response to threats. Within the scope of this part, we will investigate the
fundamental ideas that underpin machine learning and the significance of these ideas in the field of
cybersecurity.
The term "machine learning" refers to a wide variety of algorithms, each of which is designed to meet
particular requirements within the field of cybersecurity. This subsection offers a summary of the
fundamental ideas, which are as follows:
SIET-AIML 11
CYBER THREAT DETECTION
Supervised Learning: In the process of supervised learning, models are trained using datasets that have
been labelled, with the input data being associated with the output labels that correspond to it. In order to
complete tasks such as classification and regression, this approach is absolutely necessary.
Unsupervised Learning: Discovering hidden patterns or groups is the goal of unsupervised learning, which
involves training models on data that has not been labelled. In the field of cybersecurity, clustering and
dimensionality reduction are two applications that are frequently used.
SIET-AIML 12
CYBER THREAT DETECTION
CHAPTER-4
INPUT AND OUTPUT DESIGNS
SIET-AIML 13
CYBER THREAT DETECTION
CHAPTER 4
The input design is the link between the information system and the user. It comprises the developing
specification and procedures for data preparation and those steps are necessary to put transaction data in to a
usable form for processing can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the system. The design of input
focuses on controlling the amount of input required, controlling the errors, avoiding delay, avoiding extra
steps and keeping the process simple. The input is designed in such a way so that it provides security and
ease of use with retaining the privacy. Input Design considered the following things:
➢ What data should be given as input?
➢ How the data should be arranged or coded?
➢ The dialog to guide the operating personnel in providing input.
➢ Methods for preparing input validations and steps to follow when error occur.
OBJECTIVE
1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and show the
correct direction to the management for getting correct information from the computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume of
data. The goal of designing input is to make data entry easier and to be free from errors. The data entry
screen is designed in such a way that all the data manipulates can be performed. It also provides record
viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in maize of instant.
Thusthe objective of input design is to create an input layout that is easy to follow
SIET-AIML 14
CYBER THREAT DETECTION
through outputs. In output design it is determined how the information is to be displaced for immediate need
and also the hard copy output. It is the most important and direct source information to the user. Efficient
and intelligent output design improves the system’s relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so that people will find
the system can use easily and effectively. When analysis design computer output, they should Identify the
specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following objectives.
• Convey information about past activities, current status or projections of the
• Future.
• Signal important events, opportunities, problems, or warnings.
• Trigger an action.
• Confirm an action.
SIET-AIML 15
CYBER THREAT DETECTION
CHAPTER-5
PROBLEM IDENTIFICATION & OBJECTIVES
SIET-AIML 16
CYBER THREAT DETECTION
CHAPTER-5
PROBLEM IDENTIFICATION & OBJECTIVES
5.1 Existing System – Honeypot in Cyber Attack Detection
In the field of cybersecurity, honeypots are widely used as a proactive defense mechanism to detect, analyze,
and mitigate cyber threats. A honeypot is a decoy system or network designed to lure cyber attackers,
allowing security experts to study their tactics, techniques, and procedures (TTPs). Unlike traditional
security measures that focus on preventing intrusions, honeypots attract malicious actors, providing valuable
intelligence on potential threats. These systems simulate real-world vulnerabilities, making them an essential
tool for identifying new attack patterns and improving overall cybersecurity defenses.
Honeypots come in different types, including low-interaction honeypots, which mimic basic system
functionalities to detect automated attacks, and high-interaction honeypots, which replicate fully functional
systems to engage attackers in-depth. By deploying honeypots, organizations can monitor cybercriminal
activities in a controlled environment without risking their actual infrastructure. Additionally, honeypots
help identify zero-day vulnerabilities, brute-force attacks, and malware propagation strategies, enabling
security teams to enhance their defensive measures.
Despite their effectiveness, honeypots also come with certain limitations. Sophisticated attackers may
recognize and evade honeypot systems, reducing their effectiveness. Furthermore, if not properly isolated, a
compromised honeypot could become a launching point for further attacks on the network. To maximize
their benefits, honeypots should be integrated with other cybersecurity tools, such as intrusion detection
systems (IDS), machine learning-based threat analysis, and firewall protection. By continuously
evolving and improving honeypot strategies, cybersecurity professionals can stay ahead of emerging threats
and strengthen overall network security.
1. Limited Detection Scope– Honeypots only detect attacks that specifically target them. If an attacker
bypasses the honeypot and directly exploits the actual system, it remains undetected.
2. Ineffectiveness Against Internal Threats– Honeypots primarily attract external attackers and are
not effective in identifying malicious activities from within an organization.
3. False Data and Misleading Information –Attackers may intentionally provide false data to
manipulate honeypots and mislead security analysts, reducing the reliability of gathered intelligence
SIET-AIML 17
CYBER THREAT DETECTION
4. Limited Real-World Application – While honeypots are useful for research and threat intelligence,
they do not actively protect the entire system and should be combined with other security measures
like firewalls and intrusion detection systems (IDS).
The Cloud-Network-Signatures dataset was analyzed and trained using four prominent Machine
Learning (ML) algorithms: Random Forest (RF), Decision Tree, Logistic Regression, and Support
Vector Machine (SVM). These algorithms were selected for their ability to classify network traffic patterns
and detect potential intrusions in cloud environments. Each algorithm offers unique advantages in
cybersecurity applications—Random Forest and Decision Tree are widely used for their ability to handle
large datasets with high accuracy, while Logistic Regression provides interpretability, and SVM excels in
high-dimensional spaces. By leveraging these models, the study aims to improve the accuracy and efficiency
of Network Intrusion Detection Systems (NIDS).
A crucial step in the process was data preparation, which involved eliminating ambiguity, selecting
relevant features, and normalizing data to ensure consistency and accuracy. Proper feature selection
enhances the predictive performance of ML models by reducing noise and computational complexity.
Normalization ensures that all numerical values are on a comparable scale, preventing biases in the training
process. This step is essential for Intrusion Detection Systems (IDS), as it helps the models learn patterns
more effectively, leading to better anomaly detection in cloud-network environments. By refining the
dataset, the study ensures that the models can generalize well to new, unseen network traffic, minimizing
false positives and negatives.
For evaluating the NIDS performance, each of the supervised ML algorithms was tested using different
feature selection methods. Support Vector Machines (SVM), Random Forest (RF), Decision Tree, and
Logistic Regression were applied to analyze the dataset's effectiveness in detecting cyber threats. These
models were trained on different feature subsets to determine which combination provided the highest
detection accuracy. The comparative analysis of these algorithms helped identify the most suitable model for
cloud-based intrusion detection, ensuring robust security and timely threat mitigation. By combining
multiple ML techniques, the study enhances the adaptability and resilience of IDS in cloud environments,
making networks more secure against evolving cyber threats.
SIET-AIML 18
CYBER THREAT DETECTION
Advantages
1. Tuned to Specific Content in Network Packets – The system can analyze network packets in detail
and filter traffic based on specific patterns, helping to detect and prevent cyber threats more
effectively. This ensures precise identification of malicious activities, improving overall network
security.
2. Ability to Qualify and Quantify Attacks – The technique not only detects attacks but also
categorizes them based on severity and impact. It provides a quantitative measure of threats, helping
security teams prioritize and mitigate risks efficiently.
3. Compliance with Regulations – The cybersecurity system is designed to align with industry
standards and legal regulations. It simplifies adherence to data protection laws, cybersecurity
policies, and cloud security frameworks, making regulatory compliance more manageable for
organizations.
4. Automated Cybersecurity for Cloud Protection – The proposed approach introduces an automated
Intrusion Detection System (IDS) that continuously monitors cloud environments. By leveraging
Machine Learning (ML) and AI-driven techniques, it detects anomalies in real-time, ensuring
proactive threat mitigation.
5. Secure Document Storage and Sharing – The application allows users to securely store and share
documents while maintaining data confidentiality, integrity, and availability. Advanced
encryption techniques protect sensitive information from unauthorized access, ensuring secure
collaboration within cloud environments.
5.3 MODULES
In this section, the methodology of the research is discussed. According to the literature studies, there is a
critical need for the creation of effective machine learning and deep learning models for identifying attacks
in datasets. The dataset NSL-KDD was analyzed and trained using four Machine Learning algorithms
Random Forest (RF), Decision Tree, Logistics Regression and Support Vector Machine (SVM). The general
layout of the methodology.
a) Dataset Collection:
NSL-KDD is a condensed version of the original KDD dataset that was acquired from the Canadian
Institute for Cyber security [21]. It has the same features as KDD. Each record has 41 features and one
SIET-AIML 19
CYBER THREAT DETECTION
b) Data pre-processing
Pre-processing the data is a very important step in preparing the data to be fed into the algorithm. The
goal of data preparation is to eliminate ambiguity in the dataset and provide IDS with accurate data. It
unifies feature selection and normalization. Many symbolic attributes in the dataset, such as flags and
protocol types, have nominal values. These values must be converted to numeric values for the dataset to
perform better.
c) Feature selection
Feature Selection produces more enhanced and efficient subsets by eliminating redundant and unrelated
features. Correlation is a popular and successful strategy for identifying the most closely linked
characteristics in any dataset; it defines the strength of the relationship between features, based on the
assumption that features are conditionally independent given the class. A good feature subset contains
features that are highly correlated (predictive of) the class yet uncorrelated and not predictive of one
another. The table shows the result of CFS Sub Set Eval-BestFirst was chosen for feature selection used
in WEKA.
SIET-AIML 20
CYBER THREAT DETECTION
For the supervised machine learning algorithms used to evaluate the performance of NIDS over the NSL-
KDD dataset in this study, we used Support Vector Machines (SVM), Random Forest (RF),
Decision Tree and Logistic Regression algorithms for each type of feature selection method. In
general, every process of classification in machine learning is divided into five steps:
c) Evaluation metrics
The evaluation of the produced classification models is an important phase. It’s also done through the
use of a variety of evaluation metrics. The following are used on evaluation metrics:
• True Positives (TP) the total number of malicious packets correctly classified.
• False Positives (FP) the total number of malicious packets incorrectly classified as attacks.
• False Negatives (FN) the total number of malicious packets incorrectly classified as normal.
Classification accuracy is the most commonly used statistic for evaluating a model, however, it is not a
reliable predictor of its performance.
Accuracy: The appropriate classification ratio is the proportion of correctly classified samples to the
total number of input samples. It is calculated using the following formula:
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision: It’s the number of successfully classified positive samples divided by the number of samples
that the classifier predicted as positive (i.e. the proportion of positive samples correctly classified
to the all predicted as positive). Its formula is as follows:
Precision = TP/(TP+FP)
SIET-AIML 21
CYBER THREAT DETECTION
Recall: It is calculated by dividing the number of correctly classified positive samples by the total
number of positive samples passed.
Recall = TP/(TP+FN)
Mathews Correlation Coefficient (MCC): It represents the relative correlation between observed and
predicted binary classifications.
MCC = (TP*TN – FP*FN). / / sqrt[(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)]
d) Model validation
In the final step, the model will be implemented and trained based on the decisions made in the previous
processes, and then validated to see if it meets all of the preconditions and to see how accurate it is at
predicting with new data. The model’s flaws and limitations are recognized as a result of these assessments,
allowing the required measures to be taken to address them.
In comparison to other algorithms, the experiment shows that RF has the highest accuracy, followed by the
ML algorithm. Table 10, shows that selecting 13 features for each algorithm provide high accuracy in the
binary class. The models have a closer accuracy of 98.92% and F-measure 98.9%, respectively. It shows that
the model is the best for detecting DOS attacks. It shows the same results for multi-class, with slight changes
in accuracy, which was high in model RF
5.4 Alogirthms
Logistic Regression Algorithm
It is a SML model that is very commonly or widely used for the classification. Performance of LR model for
linearly separable classes is very well and even easy to implement. Specially, in industry it is most
commonly used. In general LR is used for binary classification as it is a linear model but using technique
OvR it may be used for classification of multi class [9]. LR is applied on dataset by considering three
different train test ratio (80:20, 60:40, and 70:30) to predict whether the bank currency is forge or genuine.
For train test ratio 80:20 ROC curve and learning curves are drawn. Accuracy of LR is observed around 98%
.
SIET-AIML 22
CYBER THREAT DETECTION
representing values of the tested feature. Hours Played i.e., a leaf node it gives the decision on numerical
targeted value. DT can handle both numerical as well as categorical data [8]. DT is applied on dataset by
considering three different train test ratio (80:20, 60:40, and 70:30) to predict whether the bank currency is
forge or genuine. For train test ratio 80:20 ROC curve and learning curves are drawn. Accuracy of DT has
been observed around 99%.
SIET-AIML 23
CYBER THREAT DETECTION
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is functioning
properly, and that program inputs produce valid outputs. All decision branches and internal code flow should
be validated. It is the testing of individual software units of the application .it is done after the completion of
an individual unit before integration. This is a structural testing, that relies on knowledge of its construction
and is invasive. Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined inputs and expected
results.
Integration testing
Integration tests are designed to test integrated software components to determine if they actually run as one
program. Testing is event driven and is more concerned with the basic outcome of screens or fields.
Integration tests demonstrate that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent. Integration testing is
specifically aimed at exposing the problems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as specified by the
business and technical requirements, system documentation, and user manuals.
SIET-AIML 24
CYBER THREAT DETECTION
System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the configuration
oriented system integration test. System testing is based on process descriptions and flows, emphasizing pre-
driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose. It is used to test
areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings, structure or
language of the module being tested. Black box tests, as most other kinds of tests, must be written from a
definitive source document, such as specification or requirements document, such as specification or
requirements document. It is a testing in which the software under test is treated, as a black box you cannot
“see” into it. The test provides inputs and responds to outputs without considering how the software works.
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.
SIET-AIML 25
CYBER THREAT DETECTION
Features to be tested
• Verify that the entries are of the correct format
• No duplicate entries should be allowed
• All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
The task of the integration test is to check that components or software applications, e.g. components in a
software system or – one step up – software applications at the company level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the end
user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
SIET-AIML 26
CYBER THREAT DETECTION
CHAPTER-6
SYSTEM REQUIREMENTS
SIET-AIML 27
CYBER THREAT DETECTION
CHAPTER-6
SYSTEM REQUIREMENTS
6.1 SOFTWARE ENVIRONMENT
Python, a widely used high-level programming language, plays a crucial role in cybersecurity and
cyberattack detection. Its simplicity, readability, and extensive libraries make it an ideal choice for security
analysts and developers working on cyber defense systems. Python allows both interactive and script-based
programming, enabling security professionals to write and execute security scripts efficiently. Features like
dynamic typing, automatic memory management, and support for multiple programming paradigms make it
highly adaptable for developing cybersecurity applications, including intrusion detection systems, malware
analysis, and network security monitoring.
Python’s data structures, such as lists, tuples, and dictionaries, are essential for handling large datasets
commonly used in cyber threat analysis. Lists provide flexibility in storing and processing network logs,
while tuples ensure the integrity of unmodifiable security data. Dictionaries, which store key-value pairs, are
particularly useful in cybersecurity for mapping IP addresses to threat intelligence data and managing log
files efficiently. Additionally, Python supports multi-line statements, indentation-based code structuring, and
powerful string manipulation functions, making it an excellent tool for parsing and analyzing security-
related information from various sources.
Security professionals leverage Python's command-line capabilities and built-in libraries to automate
security tasks, such as vulnerability scanning, forensic analysis, and penetration testing. Libraries like Scapy,
Requests, and Nmap provide powerful functionalities for network traffic analysis and ethical hacking.
Python also enables secure handling of user authentication, encryption, and access control mechanisms,
strengthening cyberattack detection frameworks. Given its versatility and effectiveness, Python remains a
cornerstone in the field of cybersecurity, empowering organizations to proactively defend against cyber
threats and enhance digital security.
SIET-AIML 28
CYBER THREAT DETECTION
HARDWARE REQUIREMENTS:
❖ System : Intel i3 to untill
SOFTWARE REQUIREMENTS:
❖ Operating system : Windows 7/8/10/11.
❖ Dataset : NSL-KDD-Dataset
SIET-AIML 29
CYBER THREAT DETECTION
Advantages of Python :
SIET-AIML 30
CYBER THREAT DETECTION
7. Readable :
Because it is not such a verbose language, reading Python is much like reading English. This is the reason
why it is so easy to learn, understand, and code. It also does not need curlybraces to define blocks, and
indentation is mandatory. This further aids the readability of the code.
8. Object-Oriented :
This language supports both the procedural and object-oriented programming paradigms. While functions
help us with code reusability, classes and objects let us model the real world. A class allows the encapsulation
of data and functions into one.
9. Free and Open-Source :
Like we said earlier, Python is freely available. But not only can you download Python for free, but you can
also download its source code, make changes to it, and even distribute it. It downloads with an extensive
collection of libraries to help you with your tasks.
10. Portable :
When you code your project in a language like C++, you may need to make some changes to it if you want to
run it on another platform. But it isn’t the same with Python. Here, you need to code only once, and you can
run it anywhere. This is called Write Once Run Anywhere (WORA). However, you need to be careful enough
not to include any systemdependent features.
11.Interpreted :
Lastly, we will say that it is an interpreted language. Since statements are executed one by one, debugging is
easier than in compiled languages. Any doubts till now in the advantages of Python? Mention in the comment
section.
Disadvantages of Python :
So far, we’ve seen why Python is a great choice for your project. But if you choose it, you should be aware of
its consequences as well. Let’s now see the downsides of choosing Python over another language.
1. Speed Limitations :
We have seen that Python code is executed line by line. But since Python is interpreted, it often results in slow
execution. This, however, isn’t a problem unless speed is a focal point for the project. In other words, unless
high speed is a requirement, the benefits offered by Python are enough to distract us from its speed
limitations.
2. Weak in Mobile Computing and Browsers :
While it serves as an excellent server-side language, Python is much rarely seen on the client-side. Besides
that, it is rarely ever used to implement smartphone-based applications. One such application is called
Carbonnelle. The reason it is not so famous despite the existence of Brython is that it isn’t that secure.
SIET-AIML 31
CYBER THREAT DETECTION
3. Design Restrictions :
As you know, Python is dynamically typed. This means that you don’t need to declare the type of variable
while writing the code. It uses duck-typing. But wait, what’s that? Well, it just means that if it looks like a
duck, it must be a duck. While this is easy on the programmers during coding, it can raise run-time errors.
4. Underdeveloped Database Access Layers :
Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and ODBC (Open
DataBase Connectivity), Python’s database access layers are a bit underdeveloped. Consequently, it is less
often applied in huge enterprises.
5. Simple :
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I don’t do Java, I’m
more of a Python person. To me, its syntax is so simple that the verbosity of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming Language.
Double-click the executable file which is downloaded; the following window will open.
SIET-AIML 32
CYBER THREAT DETECTION
The following windows how‘s all the optional features. All the features need to be installed
and are checked by default; we need to click next to continue
The following window shows a list of advanced options. Check all the options which you want to install
and click next. Here, we must notice that the first check-box (install for all users) must be checked.
SIET-AIML 33
CYBER THREAT DETECTION
Now, try to run python on the command prompt. Type the command python in case of python2
or python3 in case of python3. It will show an error as given in the below image. It is because we
haven't set the path.
SIET-AIML 34
CYBER THREAT DETECTION
To set the path of python, we need to the right click on "my computer" and go to
Properties →Advanced → Environment Variables.
SIET-AIML 35
CYBER THREAT DETECTION
Type PATH as the variable name and set the path to the installation directory of the python shown in
the below image.
Now, the path is set, we are ready to run python on our local system. Restart CMD and type
python again. It will open the python interpreter shell where we can execute the python
statements.
Several Python libraries are crucial for cybersecurity applications, particularly in detecting and analyzing
cyber threats. These libraries provide functionalities for data visualization, processing, and machine
learning model development in intrusion detection systems (IDS).
1. Matplotlib – A powerful data visualization library that helps in creating charts and graphs to analyze
network traffic patterns and detect anomalies.
2. Pandas – A data manipulation library used for processing large cybersecurity datasets, such as
intrusion detection logs and network traffic data.
3. NumPy – A library for numerical computations, useful in handling large-scale network traffic data
and performing mathematical operations efficiently.
4. Scikit-learn – A machine learning library used for developing predictive models to detect cyber
threats and classify network intrusions.
5. Scikit-learn – A machine learning library used for developing predictive models to detect cyber
threats and classify network intrusions.
SIET-AIML 36
CYBER THREAT DETECTION
6. Scikit-learn – A machine learning library used for developing predictive models to detect cyber
threats and classify network intrusions.
Once the system development is complete and fully prepared, the next crucial phase is its deployment.
Deployment refers to the process of installing and configuring the system in a real-world environment where
it will be used. Since this project is an academic initiative, the deployment was carried out in our school lab,
ensuring all required software and dependencies were properly installed on systems running Windows OS.
This setup allowed for thorough testing, performance evaluation, and user training in a controlled
environment before any potential real-world application.
The deployment process involves various steps, including installation of software components,
configuration of system settings, user accessibility setup, and security testing to ensure the system runs
efficiently. Additionally, post-deployment testing is conducted to identify and resolve any potential issues
that may arise during usage. By deploying the system in a structured manner, we ensure that it functions as
intended and meets the requirements set during the development phase.
Maintenance is a crucial aspect of system management, ensuring long-term functionality and efficiency. In
this project, maintenance is a one-time measure, primarily focusing on fixing any initial issues that may
arise after deployment. However, in practical scenarios, maintenance often includes regular updates, security
patches, performance enhancements, and bug fixes to improve system reliability over time. Ensuring proper
documentation and providing user support can further enhance the system’s sustainability and usability.
SIET-AIML 37
CYBER THREAT DETECTION
CHAPTER-7
SYSTEM DESIGN
SIET-AIML 38
CYBER THREAT DETECTION
CHAPTER-7
SYSTEM DESIGN
7.1 SYSTEM ARCHITECTURE:
This network security architecture ensures a secure communication system by implementing multiple
layers of protection, including firewalls, routers, and Intrusion Detection Systems (IDS). The internal
network consists of various devices such as computers and printers, which are connected through structured
networking. To safeguard these internal resources, a firewall acts as a protective barrier that filters traffic,
allowing only legitimate data to pass through while blocking potential threats.
At the heart of the system, a router serves as a bridge between the internal network and the Internet,
directing traffic efficiently. To further enhance security, Network-Based Intrusion Detection Systems
(IDS) are strategically placed at different points within the network. These IDS monitor incoming and
outgoing traffic, detecting suspicious activities or potential cyber threats. IDS are positioned both before
and after the firewall, as well as near critical servers, to ensure maximum protection.
The infrastructure also includes essential servers such as the Web Server, Email Server, and DNS Server,
which handle website hosting, email communication, and domain name resolution, respectively. The firewall
and IDS work together to prevent unauthorized access and cyberattacks on these critical systems.
By implementing multiple layers of security, this network setup ensures data integrity, confidentiality, and
availability, effectively protecting against cyber threats. This approach allows organizations to maintain
secure and uninterrupted network operations while minimizing security risks.
SIET-AIML 39
CYBER THREAT DETECTION
UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling
language in the field of object-oriented software engineering. The standard is managed, and was created by,
the Object Management Group.
The goal is for UML to become a common language for creating models of object oriented computer
software. In its current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization, Constructing and
documenting the artifacts of software system, as well as for business modeling and other non-software
systems.
The UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the software
development process. The UML uses mostly graphical notations to express the design of software projects.
GOALS:
SIET-AIML 40
CYBER THREAT DETECTION
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to
represent a system in terms of input data to the system, various processing carried out on this data,
and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the
system components. These components are the system process, the data used by the process, an
external entity that interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a series of
transformations. It is a graphical technique that depicts information flow and the transformations that
are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of
abstraction. DFD may be partitioned into levels that represent increasing information flow and
functional detail.
SIET-AIML 41
CYBER THREAT DETECTION
SIET-AIML 42
CYBER THREAT DETECTION
SIET-AIML 43
CYBER THREAT DETECTION
SIET-AIML 44
CYBER THREAT DETECTION
The activity diagram represents the workflow of a machine learning-based classification system using the
NSL-KDD dataset. The process begins with the user initiating the dataset collection, which involves obtaining
the NSL-KDD dataset. The collected data undergoes a preprocessing step to clean and prepare it for analysis.
After preprocessing, feature selection is performed to extract relevant attributes, which may include splitting
and discretization for improved classification performance.
SIET-AIML 45
CYBER THREAT DETECTION
The selected features are then used in the classification process, which involves multiple machine learning
algorithms, including Decision Tree, Random Forest, Support Vector Machine, and Logistic Regression. The
classification results are evaluated using various metrics, such as a confusion matrix, accuracy, precision,
recall, and F1-score. The evaluation metrics help assess the performance of each algorithm. Finally, model
validation is performed to compare the results and determine the most effective model for classification. The
performance comparison allows for selecting the best approach for network intrusion detection or similar
applications.
This structured approach ensures that data is systematically processed, classified, and validated, leading to
accurate and efficient decision-making.
SIET-AIML 46
CYBER THREAT DETECTION
CHAPTER-8
IMPLEMENTATION
SIET-AIML 47
CYBER THREAT DETECTION
CHAPTER-8
IMPLEMENTATION
8.1 Flowcharts:
The flowchart illustrates a cyberattack detection framework designed for a SCADA (Supervisory Control
and Data Acquisition) system. The process is divided into two main phases: the Training Phase and the
Detection Phase, both of which rely on data stored in a Cyber-attack Database.
The process begins with the SCADA system (Master Operation Room), which collects and stores cyber
attack-related data in the cyber-attack database. This database contains both training data and testing data,
which are used in different phases of the detection system.
In the Training Phase, the system uses training data to build a predictive model. This phase involves three
key steps: Data Preprocessing, where raw data is cleaned and formatted; Training, where the system learns
from historical attack patterns; and Predictive Model Development, where a trained model is created to
recognize cyber threats. The trained predictive model is then used for real-time attack detection.
The Detection Phase is responsible for identifying potential cyber-attacks in real-world scenarios. Here, the
system takes testing data and processes it through the same Data Preprocessing step. The data is then
analyzed using the Predictive Model to determine whether a cyberattack is occurring. If the model detects
an attack, a Decision is made to classify it into a specific Type of Cyber-Attack. If no attack is detected, the
system continues normal operations.This structured approach ensures that the SCADA system can efficiently
identify and classify cyber-attacks, enhancing cybersecurity measures and preventing potential threats.
SIET-AIML 48
CYBER THREAT DETECTION
8.2 CODE
IDS.py
from tkinter import messagebox
from tkinter import *
from tkinter import simpledialog
import tkinter
from tkinter import filedialog
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
from tkinter.filedialog import askopenfilename
import numpy as np
import pandas as pd
from sklearn import *
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from keras.models import Sequential
from keras.layers import Dense
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
main = tkinter.Tk()
main.title("Detection of Cyber Attack in Network using Machine Learning Techniques")
main.geometry("1300x1200")
global filename
global labels
global columns
global balance_data
global data
global X, Y, X_train, X_test, y_train, y_test
global svm_acc, classifier, LR_acc, DT_acc, RFT_acc
def isfloat(value):
try:
float(value)
return True
except ValueError:
return False
def splitdataset(balance_data):
SIET-AIML 49
CYBER THREAT DETECTION
X = balance_data.values[:, 0:38]
Y = balance_data.values[:, 38]
print(X)
print(Y)
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.2, random_state=0)
return X, Y, X_train, X_test, y_train, y_test
def upload():
global filename
text.delete('1.0', END)
filename = askopenfilename(initialdir="NSL-KDD-Dataset")
pathlabel.config(text=filename)
text.insert(END, "Dataset loaded\n\n")
def preprocess():
global labels
global columns
global filename
text.delete('1.0', END)
columns = ["duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
"wrong_fragment",
"urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
"su_attempted",
"num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login",
"is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
"dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
"dst_host_rerror_rate", "dst_host_srv_rerror_rate", "label"]
SIET-AIML 50
CYBER THREAT DETECTION
f = open("clean.txt", "w")
f.write(cols + "\n" + dataset)
f.close()
text.insert(END, "Removed non numeric characters from dataset and saved inside clean.txt file\n\n")
text.insert(END, "Dataset Information\n\n")
text.insert(END, dataset + "\n\n")
def generateModel():
text.delete('1.0', END)
global X, Y, X_train, X_test, y_train, y_test
global balance_data
balance_data = pd.read_csv("clean.txt")
X, Y, X_train, X_test, y_train, y_test = splitdataset(balance_data)
text.insert(END, "Train & Test Model Generated\n\n")
text.insert(END, "Total Dataset Size : " + str(len(balance_data)) + "\n")
text.insert(END, "Split Training Size : " + str(len(X_train)) + "\n")
text.insert(END, "Split Test Size : " + str(len(X_test)) + "\n")
SIET-AIML 51
CYBER THREAT DETECTION
def runSVM():
text.delete('1.0', END)
global svm_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];
def runDT():
text.delete('1.0', END)
global DT_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];
def runRFT():
text.delete('1.0', END)
global RFT_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];
SIET-AIML 52
CYBER THREAT DETECTION
def runLR():
text.delete('1.0', END)
global LR_acc
global classifier
global X, Y, X_train, X_test, y_train, y_test
total = X_train.shape[1];
def detectAttack():
text.delete('1.0', END)
global X, Y, X_train, X_test, y_train, y_test
filename = filedialog.askopenfilename(initialdir="NSL-KDD-Dataset")
test = pd.read_csv(filename)
text.insert(END, filename + " test file loaded\n");
y_pred = classifier.predict(test)
print(y_pred)
for i in range(len(test)):
if str(y_pred[i]) == '1.0':
text.insert(END, "X=%s, Predicted=%s" % (X_test[i], ' Infected. Detected Anamoly Signatures') + "\n\n")
else:
text.insert(END, "X=%s, Predicted=%s" % (X_test[i], 'Normal Signatures') + "\n\n")
def graph():
height = [svm_acc, LR_acc, DT_acc, RFT_acc]
bars = ('SVM Accuracy', 'LR Accuracy', 'DT Accuracy', 'RFT Accuracy')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()
SIET-AIML 53
CYBER THREAT DETECTION
pathlabel = Label(main)
pathlabel.config(bg='DarkOrange1', fg='white')
pathlabel.config(font=font1)
pathlabel.place(x=700, y=150)
SIET-AIML 54
CYBER THREAT DETECTION
graphButton.place(x=700, y=600)
graphButton.config(font=font1)
main.config(bg='PeachPuff2')
main.mainloop()
SIET-AIML 55
CYBER THREAT DETECTION
CHAPTER-9
RESULTS AND DISCUSIIONS
SIET-AIML 56
CYBER THREAT DETECTION
CHAPTER-9
RESULTS AND DISCUSSIONS
9.1 IMPLEMENTATION DESCRIPTION:
This Python script is an implementation of a graphical user interface (GUI) application developed using
the Tkinter library. The purpose of this application is to Detect cyber attacks using Support Vector
Machine (SVM) classification and compare different attack types using a Machine Learning model
for cybersecurity threat analysis.
• Import Statements:
The script begins by importing the necessary libraries and modules, such as Tkinter for GUI
development, Matplotlib for data visualization, NumPy, Pandas, scikit-learn, and other essential
libraries for data preprocessing, feature extraction, model training, and evaluation.
• GUI Setup:
The main GUI window is initialized with a title and specific dimensions, providing an interactive
interface for users to upload network traffic data, preprocess it, apply machine learning models,
and visualize attack patterns.
Function Definitions:
• Upload Dataset: This function allows the user to upload a dataset NSL KKD Dataset.
• Data Processing: Preprocesses the uploaded data, normalizing and splitting them into training and
testing sets.
• SVM: Trains an SVM classifier on the preprocessed data and evaluates its performance.
• Logistic Regression (LR): Trains on IDS log data to classify network activity as normal or an
intrusion. It uses probability-based decision-making and is evaluated using accuracy, precision, and
recall.
• Decision Tree: Builds a rule-based model by splitting IDS log features to identify suspicious
activities. It is interpretable and evaluated with accuracy and confusion matrix.
• Random Forest: Uses multiple decision trees to detect intrusions more accurately, reducing
overfitting. It classifies threats using majority voting and is assessed using ROC-AUC and F1-score.
SIET-AIML 57
CYBER THREAT DETECTION
4. GUI Components:
• Buttons: Various buttons are provided for uploading dataset, preprocessing, training SVM, building
CNN, testing survival rate, testing single image, generating performance graph, and exiting.
• Text Box: A text box is provided to display messages and performance metrics.
• GUI Configuration: Buttons and text box are configured with appropriate labels, commands, fonts,
and positions.
• Main Loop: The main loop() function starts the GUI application, allowing the user to interact with
it.
The given dataset is a network intrusion detection dataset used in cyber attack detection by training machine
learning models. It contains structured, labeled data with both numerical and categorical features, making it
suitable for classification tasks. The dataset consists of several network traffic attributes that help identify
whether a connection is normal or an anomaly (attack).
The dataset includes multiple features that describe different aspects of network connections. The
protocol_type column specifies the type of network protocol used, such as TCP, UDP, or ICMP. The service
column represents the destination service being accessed, like HTTP, FTP, or private services. The flag
column indicates the status of the connection, such as SF (successful connection) or REJ (rejected
connection). The duration attribute represents the time duration of the connection, while src_bytes and
dst_bytes denote the number of bytes sent from the source to the destination and vice versa.
Additional security-related features include land, which indicates whether the source and destination IPs are
the same, and wrong_fragment, which counts incorrect fragments in the connection. The dataset also
captures urgent packets, hot indicators that identify suspicious command execution, and num_failed_logins,
which tracks failed authentication attempts. Logged_inis a binary indicator showing whether a user
successfully logged in, while num_compromised, root_shell, and su_attempted highlight potential security
breaches.
SIET-AIML 58
CYBER THREAT DETECTION
Several attributes focus on file and command access, such as num_file_creations, num_shells,
num_access_files, andnum_outbound_cmds. Host-related attributes include is_host_login and
is_guest_login, which determine whether the login was performed on a host or guest account. Connection-
based attributes, such as count and srv_count, track the number of connections made within a short time
frame.
Statistical features include serror_rate and srv_serror_rate, which measure the percentage of connections
with SYN errors, while rerror_rateandsrv_rerror_rate track rejected connections. Same_srv_rate,
diff_srv_rate, and srv_diff_host_rate help understand network behavior by indicating the percentage of
connections to the same or different services and hosts.
SIET-AIML 59
CYBER THREAT DETECTION
The image represents the initial step of the process, where the dataset is loaded into the environment. This
typically involves reading data from a file, such as a CSV, Excel, or image dataset, using libraries like
Pandas, NumPy, or OpenCV. At this stage, functions like pd.read_csv(), np.load(), or similar
methods are used to bring the raw data into a structured format, making it ready for further processing.
Ensuring that the data is correctly loaded without errors is crucial, as this forms the foundation for
subsequent analysis and model training.
The image focuses on the data preprocessing phase, which is essential for cleaning and preparing the dataset
before applying machine learning models. This step includes handling missing values by either filling them
with appropriate values (fillna()) or removing incomplete records (dropna()). Additionally,
numerical features may need to be scaled using techniques like standardization (StandardScaler) or
normalization (MinMaxScaler) to ensure that the model can learn effectively. Categorical variables might
also require encoding using LabelEncoder or OneHotEncoder.
Another critical aspect of preprocessing is splitting the dataset into training and testing sets using
train_test_split(), ensuring that the model is trained on one portion of the data and evaluated on
another for unbiased performance assessment.
SIET-AIML 60
CYBER THREAT DETECTION
The image likely represents the feature engineering or feature extraction process, where meaningful features
are created or selected from the existing dataset to enhance machine learning performance. This step may
involve transforming raw data into a more suitable format, such as extracting relevant text features using TF-
IDF for natural language processing tasks, generating embeddings for deep learning models, or encoding
categorical variables. It could also include selecting important features using techniques like Principal
Component Analysis (PCA), mutual information scores, or recursive feature elimination to remove redundant
or irrelevant data, ensuring the model focuses on the most informative attributes, thereby improving both
accuracy and efficiency while reducing computational complexity.
SIET-AIML 61
CYBER THREAT DETECTION
Fig9.3.4:SVM Accuracy
The image depicts the application of a Support Vector Machine (SVM) model to the processed data. SVM is
a powerful supervised learning algorithm used for classification and regression tasks. At this stage, the
model is initialized using the SVC() function from the scikit-learn library, and training is performed on the
preprocessed dataset with model.fit(X_train, y_train). After training, predictions are made
using model.predict(X_test), and performance evaluation is conducted using metrics such as
accuracy, precision, recall, F1-score, or a confusion matrix. This phase is crucial for assessing how well the
model generalizes to unseen data and fine-tuning parameters if necessary to enhance performance.
Fig9.3.5: LR Accuracy
SIET-AIML 62
CYBER THREAT DETECTION
This image likely contains a visualization related to Logistic Regression, a popular classification algorithm.
It could be a decision boundary, performance metrics, or a model evaluation graph that explains how logistic
regression classifies data points into different categories.
This image represent a Decision Tree model, which is a hierarchical structure used for classification and
regression tasks. It could display the tree's structure, including nodes and branches, showing how the model
splits data based on different feature values.
SIET-AIML 63
CYBER THREAT DETECTION
The Random Forest image may illustrate how multiple decision trees are combined to improve accuracy and
reduce overfitting. It could include a representation of multiple trees working together or feature importance
scores generated by the model.
This file likely contains information on test data, possibly showing a dataset that has been uploaded for
evaluation purposes, ensuring the reliability and performance of a machine learning model. It might be a
screenshot or a structured file, such as a CSV or JSON, representing test samples used to validate the model's
predictions against known outcomes. By analyzing this data, practitioners can assess model accuracy, identify
potential biases, and fine-tune hyperparameters to optimize performance before deploying the model in real-
world applications.
SIET-AIML 64
CYBER THREAT DETECTION
This image might represent an attack detection system or an adversarial attack in machine learning. If related
to cybersecurity, it could depict different types of attacks or vulnerabilities in a dataset.
SIET-AIML 65
CYBER THREAT DETECTION
The uploaded images represent the accuracy of four different machine learning algorithms: Logistic
Regression, Decision Tree, Random Forest, and one other model. Based on the visualization in Accuracy
graph, the Random Forest algorithm demonstrates the highest accuracy compared to the other models. The
graph likely presents a bar chart or line plot that compares the accuracy scores of each model, making it
visually evident that Random Forest outperforms the others. This superior performance suggests that
Random Forest is the best-suited model for the given dataset, possibly due to its ability to reduce overfitting
and improve generalization by averaging multiple decision trees.
This image likely represents a confusion matrix, a tool used to evaluate classification model performance. It
typically displays True Positives, False Positives, True Negatives, and False Negatives, helping to assess
accuracy, precision, recall, and F1-score.
SIET-AIML 66
CYBER THREAT DETECTION
CHAPTER-10
CONCLUSION AND FUTURE SCOPE
SIET-AIML 67
CYBER THREAT DETECTION
CHAPTER-10
CONCLUSION AND FUTURE SCOPE
CONCLUSION:
In this paper, a Network Intrusion Detection System was presented utilizing machine learning techniques. A
thorough evaluation on the performance of the proposed detection system using multiple machine learning
algorithms on the NSLKDD dataset. The results show that Random Forest and decision tree algorithms
performed well compared to the other models in predicting the malicious packets, especially in terms of
accuracy, recall, and the Mathews correlation coefficient. Moreover, the RF classifier outperformed state-of-
the-art intrusion detection systems. Although, the NSL-KDD dataset suffers from several issues, such as
imbalanced classes, and the recorded malicious traffic are synthetic, it does not reflect real-world attacks.
The classifiers have presented satisfactory results and are capable of detecting network intrusions.
FUTURE SCOPE:
In the future, we will increase number of testing data for our system and to find vary of accuracy. We also
hope to combine RST method and genetic algorithm to improve the accuracy of IDS. The present system
just displays the log information but doesn’t employ any techniques to analyze the information present in the
log records and extract knowledge. The system can be extended by incorporating Data Mining techniques to
analyze the information in the log records which may help in efficient decision making. The present system
only detects the attacks only the known attacks. This can be extended by incorporating Intelligence into it in
order to gain knowledge by itself by analyzing the growing traffic and learning new Intrusion patterns.
SIET-AIML 68
CYBER THREAT DETECTION
CHAPTER-11
REFERENCES
SIET-AIML 69
CYBER THREAT DETECTION
CHAPTER-11
REFERENCES
1. Cisco Annual Internet Report (2018–2023) White Paper. (2022, January 23). Cisco. https://www.
cisco.com/c/en/us/solutions/collateral/executiveperspectives/annual-internet-report/white-paperc11–
741490.html
2. Dyn Analysis Summary of Friday October 21 Attack (2022, February 20). https://web.archive.org/
web/20200620203923
3. Dartigue, C., Jang, H.I., Zeng, W. A new data-mining based approach for network intrusion detection. In
Seventh Annual Communication Networks and Services Research Conference. 2009; 372–377.
4. García-Teodoro, P., Díaz-Verdejo, J., MaciáFernández, G., Vázquez, E. Anomaly-based network intrusion
detection: Techniques, systems and challenges. Computers & Security. 2009; 28(1–2); 18–28.
5. Cisco. What Is Network Security? (2022, February,8). Cisco.
https://www.cisco.com/c/en/us/products/security/what-is-network-security.html
6. Kurose, J.F., Ross, K.W. Computer Networking: A Top-Down Approach (6th Edition). Pearson, 2012.
7. Tanenbaum, A., Wetherall, D. Computer Networks (5th Edition). Pearson, 2010.
8. Fernandes, G., Rodrigues, J.J.P.C., Carvalho, L.F., Al-Muhtadi, J.F., Proença, M.L. A comprehensive
survey on network anomaly detection. Telecommunication Systems. 2018; 70(3): 447–489.
9. Othman, S.M. Alsohybe, N.T., Ba-Alwi, F.M., Zahary, A.T. Survey on intrusion detection system types.
2018; 7(4): 444–463.
10. Pal Singh, A., Deep Singh, M. Analysis of HostBased and Network-Based Intrusion Detection System.
International Journal of Computer Network and Information Security, 2014; 6(8): 41–47.
11. Ferrag, M.A. Maglaras, L. Moschoyiannis, S., Janicke, H. Deep learning for cyber security intrusion
detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications.
2020; 50.
12. Boutaba, R. Salahuddin, M.A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F., Caicedo, O.M.
A comprehensive survey on machine learning for networking: evolution, applications and research
opportunities. Journal of Internet Services and Applications. 2018; 9(1).
13. Buczak, A.L., Guven, E.A Survey of Data Mining and Machine Learning Methods for Cyber Security
Intrusion Detection. IEEE Communications Surveys & Tutorials. 2016; 18(2): 1153–1176.
14. Chaabouni, N., Mosbah, M., Zemmari, A., Sauvignac, C., Faruki, P. Network Intrusion Detection for IoT
Security Based on Learning Techniques. IEEE Communications Surveys & Tutorials. 2019; 21(3): 2671–
2701.
SIET-AIML 70
CYBER THREAT DETECTION
15. Berman, D., Buczak, A., Chavis, J., Corbett, C. A Survey of Deep Learning Methods for Cyber Security.
Information. 2019; 10(4): 122.
16. Mahdavifar, S., Ghorbani, A.A. Application of deep learning to cybersecurity: A survey.
Neurocomputing. 2019; 347: 149–176.
17. Ahmed, M., Naser Mahmood, A., Hu, J. A survey of network anomaly detection techniques. Journal of
Network and Computer Applications. 2016; 60: 19–31.
18. Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A. A survey of network-based intrusion
detection data sets. Computers & Security. 2019; 86: 147–167.
19. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K. Network Anomaly Detection: Methods, Systems and
Tools. IEEE Communications Surveys & Tutorials. 2014; 16(1): 303–336.
20. Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H., Wang, C. Machine Learning and
Deep Learning Methods for Cybersecurity. IEEE Access. 2018; 6: 35365–35381.
21. UNB (2021, Novamber 15). https://www.unb.ca/ cic/datasets/nsl.html
22. Chumachenko, K. Machine learning methods for malware detection and classification., 2017.
23. Zou, J., Han, Y., So, S.S. Overview of artificial neural networks. Methods in molecular biology (Clifton,
N.J.). 2008; 458: 15–23.
24. Dong, B., Wang, X. Comparison deep learning method to traditional methods using for network
intrusion detection. In 2016 8th IEEE International Conference on Communication Software and Networks
(ICCSN). 2016; 581–585.
25. Mahesh, B. Machine Learning Algorithms – A Review. International Journal of Science and Research
(IJSR). 2020; 381–386.
26. Farnaaz, N., Jabbar, M.A. Random forest modeling for network intrusion detection system. Procedia
Computer Science. 2016; 89: 213–217.
27. Bhumgara, A., Pitale, A. Detection of Network Intrusions using Hybrid Intelligent Systems. 1st
International Conference on Advances in Information Technology (ICAIT). 2019; 500–506.
28. Kumar, K., Batth, J.S. Network Intrusion Detection with Feature Selection Techniques using Machine-
Learning Algorithms. International Journal of Computer Applications. 2016; 150(12): 1–13.
29. Dhanabal, L., Shantharajah, S.P. A study on NSLKDD dataset for intrusion detection system based on
classification algorithms. International journal of advanced research in computer and communication
engineering. 2015; 4(6): 446–452.
SIET-AIML 71