Final Cyber Attack
Final Cyber Attack
SCIENCE TECHNIQUE
Submitted in partial fulfillmen to the requirements for the award of
Bachelor of Engineering Degree
in
Computer Science and Engineering
By
RamaduguKaustub(Reg.No.38110448)
Vishnu Vardhan Raju(Reg.No.38110628)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SCHOOL OF COMPUTING
(DEEMEDTOBEUNIVERSITY)
Accreditedwith Grade“A”byNAAC
JEPPIAARNAGAR,RAJIVGANDHISALAI,CHENNAI-600119
April 2022
1
SATHYABAMAINSTITUTEOFSCIENCE
ANDTECHNOLOGY
(EstablishedunderSection3ofUGCAct,1956)
JeppiaarNagar,RajivGandhiSalai,Chennai-600119
www.sathyabama.ac.in
SCHOOLOFCOMPUTING
BONAFIDECERTIFICATE
This is to certify that this Project Report is the Bonafide work of RamaduguKaustub
(Reg.No.38110448);Vishnu Vardhan Raju (Reg.No.38110628)
Who carried out the project entitled “Prediction Of Cyber Attacks Using Data Science
Technique” under our supervision from Nov2021 to April 2022.
Internal Guide
Dr. T.Judgi
HeadoftheDepartment
Dr.S.VIGNESHWARIM.E.,Ph.D.,
ii
DECLARATION
DATE:
PLACE: SIGNATUREOFTHECANDIDATE
iii
ACKNOWLEDGEMENT
IwishtoexpressmythankstoallTeachingandNon-
teachingstaffmembersof the department of COMPUTER
SCIENCE AND ENGINEERING whowerehelpful in many
ways forthe completionof theproject.
1
ABSTRACT
2
TABLE OF CONTENT
01
02 EXISTING SYSTEM 12
2.1 DRAWBACKS
INTRODUCTION 13
03 3.1 DATA SCIENCE
3.2 ARTIFICIAL INTELLIGENCE
04 MACHINE LEARNING 19
05 PREPARING DATASET 21
06 PROPOSED SYSTEM 21
6.1 ADVANTAGES
07 LITERATURE SURVEY 22
08 SYSTEM STUDY 30
8.1 OBJECTIVES
8.2 PROJECT GOAL
8.3 SCOPE OF THE PROJECT
09 FEASIBILITY STUDY 37
10 LIST OF MODULES 39
PROJECT REQUIREMENTS 39
11 11.1 FUNCTIONAL REQUIREMENTS
11.2 NON-FUNCTIONAL REQUIREMENTS
3
12 ENVIRONMENT REQUIREMENT 40
13 SOFTWARE DESCRIPTION 41
13.1 ANACONDA NAVIGATOR
13.2 JUPYTER NOTEBOOK
14 PYTHON 51
15 SYSTEM ARCHITECTURE 63
16 WORKFLOW DIAGRAM 64
17 USECASE DIAGRAM 65
18 CLASS DIAGRAM 66
19 ACTIVITY DIAGRAM 67
20 SEQUENCE DIAGRAM 68
21 ER – DIAGRAM 69
22 MODULE DESCRIPTION 70
22.1 MODULE DIAGRAM
22.2 MODULE GIVEN INPUT EXPECTED
OUTPUT
23 DEPLOYMENT (GUI) 94
24 CODING 95
25 CONCLUSION 141
4
LIST OF FIGURES
5
LIST OF SYSMBOLS
Class Name
1. Class Represents a
-attribute
collection of
+ public
-attribute similar entities
-private
+operation grouped together.
# protected
+operation
+operation
Associations
2. Association Class A NAME Class B represents static
relationships
Class B
between classes.
Class A
Roles represents
the way the two
classes see each
other.
3. Actor It aggregates
6
several classes into
a single classes.
Relation(uses)
5. uses Used for additional
process
communication.
6. Relation Extends
(extends) extends relationship is used
when one use case
is similar to
another use case
but does a bit
more.
7. Communication Communication
between various
use cases.
7
9. Initial State Initial state of the
object
Interaction
13. Use case Uses case between the system
and external
environment.
Represents
14. Component physical modules
which is a
8
collection of
components.
Represents
15. Node physical modules
which are a
collection of
components
A circle in DFD
16. Data represents a state
Process/State or process which
has been triggered
due to some event
or action.
Represents external
17. External entity entities such as
keyboard, sensors
etc.
9
that occurs
between processes.
Represents the
19. Object Lifeline vertical dimensions
that the object
communications.
10
CHAPTER 1
1.1 INTRODUCTION
The upside of having the option to identify uncommon things is the capacity to
recognize new (or startling) assaults that convey many advantages. Procedures
dependent on innovation pipelines utilized in different ventures. We give
general data to the investigation of traffic data and of information, which can be
used for targetedattacks. A comparative study between machine learning
algorithms had been carried out in order to determine which algorithm is the
most accurate in predicting the type cyber Attacks. We classify four types of
attacks are DOS Attack, R2L Attack, U2R Attack, Probe attack. The results
show that the effectiveness of the proposed machine learning algorithm
technique can be compared with best accuracy with entropy calculation,
precision, Recall, F1 Score, Sensitivity, Specificity and Entropy. for the
location of street mishaps utilizing the significant distance-course of-the-street
11
kinds; great, terrible and impartial. The response to this classification is the
expression enraptured (positive, negative, or unbiased) as for street sentences,
contingent upon whether or not it is traffic. The bag-of-words (BoW) is
presently used to change each sentence over to a solitary hot code to take care
of bi-directional LSTM organizations (Bi-LSTM). In the wake of preparing, a
multi-stage muscle network utilizes softmax to arrange sentences as indicated
by area, vehicle experience, and sort of polarization. The proposed strategy
contrasts the preparation of various machines and the high-level preparing
techniques as far as precision, F scores, and different standards
Disadvantages:
1. The performance is not good and its get complicated for other networks.
12
2. The performance metrics like recall F1 score and comparison of machine
learning algorithm is not done.
CHAPTER 2
DOMAIN OVERVIEW
2.1Data Science
The term "data science" has been traced back to 1974, when Peter
Naur proposed it as an alternative name for computer science. In 1996, the
International Federation of Classification Societies became the first conference
to specifically feature data science as a topic. However, the definition was still
in flux.
The term ―data science‖ was first coined in 2008 by D.J. Patil, and Jeff
Hammerbacher, the pioneer leads of data and analytics efforts at LinkedIn and
Facebook. In less than a decade, it has become one of the hottest and most
trending professions in the market.
13
finding out the hidden insights or patterns from raw data which can be of major
use in the formation of big business decisions.
Data Scientist:
2.2ARTIFICIAL INTELLIGENCE
14
achieving its goals. Some popular accounts use the term "artificial intelligence"
to describe machines that mimic "cognitive" functions that humans associate
with the human mind, such as "learning" and "problem solving", however this
definition is rejected by major AI researchers.
AI research has tried and discarded many different approaches during its
lifetime, including simulating the brain, modeling human problem solving,
formal logic, large databases of knowledge and imitating animal behavior. In
the first decades of the 21st century, highly mathematical statistical machine
learning has dominated the field, and this technique has proved highly
successful, helping to solve many challenging problems throughout industry and
academia.
15
The various sub-fields of AI research are centered around particular goals
and the use of particular tools. The traditional goals of AI research
include reasoning, knowledge representation, planning, learning, natural
language processing, perception and the ability to move and manipulate
objects. General intelligence (the ability to solve an arbitrary problem) is among
the field's long-term goals.
The field was founded on the assumption that human intelligence "can be
so precisely described that a machine can be made to simulate it". This raises
philosophical arguments about the mind and the ethics of creating artificial
beings endowed with human-like intelligence.
16
In general, AI systems work by ingesting large amounts of labeled
training data, analyzing the data for correlations and patterns, and using these
patterns to make predictions about future states. In this way, a chatbot that is fed
examples of text chats can learn to produce life like exchanges with people, or
an image recognition tool can learn to identify and describe objects in images
by reviewing millions of examples.
17
Artificial neural networks and deep learning artificial intelligence
technologies are quickly evolving, primarily because AI processes large
amounts of data much faster and makes predictions more accurately than
humanly possible.
MACHINE LEARNING
18
Machine learning is to predict the future from past data. Machine learning
(ML) is a type of artificial intelligence (AI) that provides computers with the
ability to learn without being explicitly programmed. Machine learning focuses
on the development of Computer Programs that can change when exposed to
new data and the basics of Machine Learning, implementation of a simple
machine learning algorithm using python. Process of training and prediction
involves use of specialized algorithms. It feed the training data to an algorithm,
and the algorithm uses this training data to give predictions on a new test data.
Machine learning can be roughly separated in to three categories. There are
supervised learning, unsupervised learning and reinforcement learning.
Supervised learning program is both given the input data and the corresponding
labeling to learn data has to be labeled by a human being beforehand.
Unsupervised learning is no labels. It provided to the learning algorithm. This
algorithm has to figure out the clustering of the input data. Finally,
Reinforcement learning dynamically interacts with its environment and it
receives positive or negative feedback to improve its performance.
19
person is male or female or that the mail is spam or non-spam) or it may be
multi-class too. Some examples of classification problems are: speech
recognition, handwriting recognition, bio metric identification, document
classification etc.
20
Preparing Dataset:
DOS Attack
R2L Attack
U2R Attack
Probe Attack
Proposed System:
Advantages:
1. The anomaly detection can be automated process using the machine learning.
21
CHAPTER 3
Literature Survey:
General
A literature review is a body of text that aims to review the critical points
of current knowledge on and/or methodological approaches to a particular topic.
It is secondary sources and discuss published information in a particular subject
area and sometimes information in a particular subject area within a certain time
period. Its ultimate goal is to bring the reader up to date with current literature
on a topic and forms the basis for another goal, such as future research that may
be needed in the area and precedes a research proposal and may be just a simple
summary of sources. Usually, it has an organizational pattern and combines
both summary and synthesis.
Loan default trends have been long studied from a socio-economic stand
point. Most economics surveys believe in empirical modeling of these complex
systems in order to be able to predict the loan default rate for a particular
individual. The use of machine learning for such tasks is a trend which it is
observing now. Some of the survey‘s to understand the past and present
perspective of loan approval or not.
22
Review of Literature Survey
Year : 2008
23
to implement the classification of DoS attack data. This method first gets the
proper partition of the relation between the network traffic and the amount of
DoS attack based on the optimized clustering and builds the prediction sub-
models of DoS attack. Meanwhile, with the Bayesian method, the calculation of
the output probability corresponding to each sub-model is deduced and then the
distribution of the amount of DoS attack in some range in future is obtained.
Year : 2014
24
network formed using cyberspace ortelecommunication technology, the
identification orprediction of any kind of socio-technical attack is
alwaysdifficult. This challenge creates an opportunity to exploredifferent
methodologies, concepts and algorithms used toidentify these kinds of
community on the basis of certainpattern, properties, structure and trend in their
linkage.It tries to find the hidden information in huge socialnetwork by
compressing it in small networks through apriorialgorithm and then diagnosed
using viterbi algorithm topredict the most probable pattern of conversation to
befollowed in the network and if this pattern matches with theexisting pattern of
criminals, terrorists and hijackers then itmay be helpful to generate some kind
of alert before crime.
Due to emergence of internet on mobile phone, the different social
networks such as on social networking sites, blogs, opinion, ratings, review,
serial bookmarking, social news, media sharing, Wikipedia led the people to
disperse any kind of information very easily. Rigorous analysis of these patterns
can reveal some very undisclosed and important information explicitly whether
that person is conducting malignant or harmless communications with a
particular user and may be a reason for any kind of socio technical attacks.
From the above simulation done on CDR, it may be concluded that if this kind
of simulation applied on networks based on the internet and if we are in the
position to get the data which could be transformed in transition and emission
matrix then several kind of prediction may be drawn which will be helpful to
take our decisions.
Year : 2013
25
Intrusion detection systems (IDS) are used to detect the occurrence of
malicious activities against IT system. Through monitoring and analyzing of IT
system activities the malicious activities will be detected. In ideal case IDS
generate alert(s) for each detected malicious activity and store it in IDS
database. Some of stored alerts in IDS database are related. Alerts relations are
differentiated from duplication relation to same attack scenario relation.
Duplication relation means that the two alerts generated as a result of same
malicious activity. Where same attack scenario relation means that the two
related alert are generated as a result of related malicious activities. Attack
scenario or multi-step attack is a set of related malicious activities run by same
attacker to reach specific goal. Normal relation between malicious activities
belong to same attack scenario is causal relation. Causal relation means that
current malicious activity output is pre-condition to run the next malicious
activity. Possible multi-step attack against a network start with information
gathering about network and the information gathering is done through network
Reconnaissance and fingerprinting process. Through reconnaissance network
configuration and running services are identified. Through fingerprint process
Operating system type and version are identified. propose a real timeprediction
methodology for predicting most possible attack stepsand attack scenarios.
Proposed methodology benefits fromattacks history against network and from
attack graph sourcedata. it comes without considerable computation overload
such aschecking of attack plans library. It provides parallel predictionfor
parallel attack scenarios. Possible third attack step is to identify attack plan
based on the modeled attack graph in the past step. The attack plan usually will
include the exploiting of a sequence of founded vulnerabilities. Mostly this
sequence is distributed over a set of network nodes. This sequence of nodes
vulnerabilities is related through causal relation and connectivity. Lastly
Attacker start orderly exploits the attack scenario sequences till reaching his/her
26
goal. Attack plan consist of many correlated malicious activities end up with
attacking goal.
Year : 2012
The prediction results reflect the security situation of the target network
in the future, and security administrators can take corresponding measures to
enhance network security according to the results. To quantitatively predict the
possible attack of the network in the future, attack probability plays a significant
role. It can be used to indicate the possibility of invasion by intruders. As an
important kind of network security quantitative evaluation measure, attack
probability and its computing methods has been studied for a long time. Many
models have been proposed for performing evaluation of network security.
Graphical models such as attack graphs become the main-stream approach.
Attack graphs which capture the relationships among vulnerabilities and
exploits show us all the possible attack paths that an attacker can take to intrude
all the targets in the network. The traffics to different hosts or servers may differ
from each other. The hosts or servers with big traffic may be more risky since
they are often important hosts or servers, and intruders may have more contacts
and understanding with them. In our cyber-attacks prediction model, they used
attack graph to capture the vulnerabilities in the network. In addition we
consider 3 environment factors that are the major impact factors of the cyber-
attacks in the future. They are the value of assets in the network, the usage
condition of the network and the attack history of the network. Cyber-attacks
prediction is an important part of riskmanagement. Existing cyber-attacks
prediction methods didnot fully consider the specific environment factors of the
targetnetwork, which may make the results deviate from the truesituation. In
27
this paper, we propose a cyber-attacks predictionmodel based on Bayesian
network. We use attack graphs torepresent all the vulnerabilities and possible
attack paths.Then we capture the using environment factors using
Bayesiannetwork model. Cyber-attacks predictions are performed onthe
constructed Bayesian network.
Year : 2008
This paper begins with the relation exists between network traffic data
and the amount of DoS attack, and then proposes a clustering method based on
the genetic optimization algorithm to implement the classification of DoS attack
data. This method first gets the proper partition of the relation between the
network traffic and the amount of DoS attack based on the optimized clustering
and builds the prediction sub-models of DoS attack. Meanwhile, with the
Bayesian method, the calculation of the output probability corresponding to
each sub-model is deduced and then the distribution of the amount of DoS
attack in some range in future is obtained. This paper describes the clustering
problem first, and then utilizes the genetic algorithm to implement the
optimization of clustering methods. Based on the optimized clustering on the
sample data, we get various categories of the relation between traffics and attack
amounts, and then builds up several prediction sub-models about DoS attack.
Furthermore, according to the Bayesian method, we deduce discrete probability
calculation about each sub-model and then get the distribution discrete
probability prediction model for DoS attack.
28
Author: Xiaoyong Yuan , Pan He, Qile Zhu, and Xiaolin Li
Year : 2019
Year : 2019
29
This paper has investigated the distributed secure control of multiagent
systems under DoS attacks. We focus on the investigation of a jointly adverse
impact of distributed DoS attacks from multiple adversaries. In this scenario,
two kinds of communication schemes, that is, sample-data and event-triggered
communication schemes, have been discussed and, then, a fully distributed
control protocol has been developed to guarantee satisfactory asymptotic
consensus. Note that this protocol has strong robustness and high scalability. Its
design does not involve any global information, and its efficiency has been
proved. For the event-triggered case, two effective dynamical event conditions
have been designed and implemented in a fully distributed way, and both of
them have excluded Zeno behavior. Finally, a simulation example has been
provided to verify the effectiveness of theoretical analysis. Our future research
topics focus on fully distributed event/self-triggered control for linear/nonlinear
multiagent systems to gain a better understanding of fully distributed control.
CHAPTER 4
SYSTEM STUDY
Classification of Attacks:
The data set in KDD Cup99 have normal and 22 attack type data with 41
features and all generated traffic patterns end with a label either as ‗normal‘ or
any type of ‗attack‘ for upcoming analysis. There are varieties of attacks which
are entering into the network over a period of time and the attacks are classified
into the following four main classes.
Denial of Service (DoS)
User to Root (U2R)
Remote to User (R2L)
30
Probing
Denial of Service:
Denial of Service is a class of attacks where an attacker makes some
computing or memory resource too busy or too full to handle legitimate
requests, denying legitimate users access to a machine. The different ways to
launch a DoS attack are by abusing the computer‘s legitimate features,
by targeting the implementation bugs
by exploiting the misconfiguration of the systems
DoS attacks are classified based on the services that an attacker renders
unavailable to legitimate users.
User to Root:
In User to Root attack, an attacker starts with access to a normal user
account on the system and gains root access. Regular programming mistakes
and environment assumption give an attacker the opportunity to exploit the
vulnerability of root access.
Remote to User:
In Remote to User attack, an attacker sends packets to a machine over a
network that exploits the machine‘s vulnerability to gain local access as a user
illegally. There are different types of R2L attacks and the most common attack
in this class is done by using social engineering.
Probing:
Probing is a class of attacks where an attacker scans a network to gather
information in order to find known vulnerabilities. An attacker with a map of
machines and services that are available on a network can manipulate the
31
information to look for exploits. There are different types of probes: some of
them abuse the computer‘s legitimate features and some of them use social
engineering techniques. This class of attacks is the most common because it
requires very little technical expertise.
Summary:
This chapter outlines the structure of the dataset used in the proposed
work. The various kinds of features such as discrete and continuous features are
studied with a focus on their role in the attack. The attacks are classified with a
brief introduction to each. The next chapter discusses the clustering and
classification of the data with a direction to learning by machine.
Table: Attack Types Grouped to respective Class
32
xlock
xsnoop
Types of
Description
Attacks
Denial of service attack against apache web server where a
back
client requests a URL containing many backslashes
neptune Syn flood denial of service on one or more ports
Denial of service where a remote host is sent a UDP packet
land with the
same source and destination
pod Denial of service ping of death
smurf Denial of service icmp echo reply flood
Denial of service where mis-fragmented UDP packets cause
teardrop some
systems to reboot
Multi-day scenario in which a user first breaks into one
multihop
machine
Exploitable CGI script which allows a client to execute
phf arbitrary commands on a machine with a mis-configured web
server.
Multi-day scenario in which a user breaks into a machine with
the purpose of finding important information where the user
spy
tries to avoid detection. Uses several different exploit methods
to gain access
warezclient Users downloading illegal software which was previously
33
posted via
anonymous FTP by the warezmaster
Anonymous FTP upload of Warez (usually illegal copies of
warezmaster copy writed
software) onto FTP server
Imap Remote buffer overflow using imap port leads to root shell
Non-stealthy loadmodule attack which resets IFS for a normal
loadmodule user and
creates a root shell
Perl attack which sets the user id to root in a perl script and
Perl
creates a root shell
Multi-day scenario where a user installs one or more
rootkit components of a
rootkit
Surveillance sweep performing either a port sweep or ping on
ipsweep multiple
host addresses
Network mapping using the nmap tool. Mode of exploring
nmap network will
vary-options include SYN
Network probing tool which looks for well-known weaknesses.
satan
Operates at three different levels. Level 0 is light
Surveillance sweep through many ports to determine which
portsweep services are
supported on a single host
Guess passwords for a valid user using simple variants of the
dict account
name over a telnet connection
34
Buffer overflow using eject program on Solaris. Leads to a
eject user->root
transition if successful
Buffer overflow using the ffbconfig UNIX system command
ffb leads to
root shell
Buffer overflow using the fdformat UNIX system command
format leads to
root shell
Remote FTP user creates .rhost file in world writable
ftp-write anonymous FTP
directory and obtains local login
guest Try to guess password via telnet for guest account
Denial of service for the syslog service connects to port 514
syslog with
unresolvable source ip
User logs into anonymous FTP site and creates a hidden
warez
directory
Objectives:
35
Apply the fundamental concepts of machine learning from an available
dataset and Evaluate and interpret my results and justify my interpretation
based on observed dataset.
Create notebooks that serve as computational records and document my
thought process and investigate the network connection whether attacked
or not to analyses the data set.
Evaluate and analyses statistical and visualized results, which find the
standard patterns for all regiments.
Project Goals
36
Comparing algorithm to predict the result
Based on the best accuracy
Scope:
The scope of this project is to investigate a dataset of network connection
attacks for KDD records for medical sector using machine learning technique.
To identifying network connection is attacked or not.
CHAPTER 5
Feasibility study:
Data Wrangling
In this section of the report will load in the data, check for cleanliness,
and then trim and clean given dataset for analysis. Make sure that the document
steps carefully and justify for cleaning decisions.
Data collection
The data set collected for predicting given data is split into Training set
and Test set. Generally, 7:3 ratios are applied to split the Training set and Test
set. The Data Model which was created using Random Forest, logistic, Decision
tree algorithms and Support vector classifier (SVC) are applied on the Training
set and based on the test result accuracy, Test set prediction is done.
Preprocessing
The data which was collected might contain missing values that may lead
to inconsistency. To gain better results data need to be preprocessed so as to
improve the efficiency of the algorithm. The outliers have to be removed and
also variable conversion need to be done.
37
The prediction of Phishing Website, A Random Forest Algorithm
prediction model is effective because of the following reasons: It provides
better results in classification problem.
Machine learning needs data gathering have lot of past data‘s. Data
gathering have sufficient historical data and raw data. Before data pre-
processing, raw data can‘t be used directly. It‘s used to preprocess then, what
kind of algorithm with model. Training and testing this model working and
predicting correctly with minimum errors. Tuned model involved by tuned time
to time with improving the accuracy.
Data Gathering
Data Pre-Processing
Choose model
Train model
Test model
38
Tune model
Prediction
CHAPTER 6
Project Requirements
General:
1. Functional requirements
2. Non-Functional requirements
39
3. Environment requirements
A. Hardware requirements
B. software requirements
6.1Functional requirements:
1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result
1. Software Requirements:
40
2. Hardware requirements:
RAM : minimum 2 GB
CHAPTER 7
SOFTWARE DESCRIPTION
41
7.1 ANACONDA NAVIGATOR
Anaconda. Now, if you are primarily doing data science work, Anaconda
is also a great option. Anaconda is created by Continuum Analytics, and it is
a Python distribution that comes preinstalled with lots of useful python libraries
for data science.
42
The following applications are available by default in Navigator:
JupyterLab
Jupyter Notebook
Spyder
PyCharm
VSCode
Glueviz
Orange 3 App
RStudio
Anaconda Prompt (Windows only)
Anaconda PowerShell (Windows only)
43
Anaconda Navigator is a desktop graphical user interface (GUI) included
in Anaconda distribution.
Anaconda comes with many built-in packages that you can easily find
with conda list on your anaconda prompt. As it has lots of packages (many of
which are rarely used), it requires lots of space and time as well. If you have
enough space, time and do not want to burden yourself to install small utilities
like JSON, YAML, you better go for Anaconda.
Conda:
44
Conda is an open source, cross-platform, language-agnostic package
manager and environment management systemthat installs, runs, and updates
packages and their dependencies. It was created for Python programs, but it can
package and distribute software for any language (e.g., R), including multi-
language projects. The conda package and environment manager is included in
all versions of Anaconda, Miniconda, and Anaconda Repository.
45
human-readable documents containing the analysis description and the results
(figures, tables, etc.) as well as executable documents which can be run to
perform data analysis.
Installation: The easiest way to install the Jupyter Notebook App is installing a
scientific python distribution which also includes scientific python packages.
The most common distribution is called Anaconda
This will launch a new browser window (or a new tab) showing
the Notebook Dashboard, a sort of control panel that allows (among other
things) to select which notebook to open.
When started, the Jupyter Notebook App can access only files within its
start-up folder (including any sub-folder). No configuration is necessary if you
place your notebooks in your home folder or subfolders. Otherwise, you need to
choose a Jupyter Notebook App start-up folder which will contain all the
notebooks.
46
Save notebooks: Modifications to the notebooks are automatically saved every
few minutes. To avoid modifying the original notebook, make a copy of the
notebook document (menu file -> make a copy…) and save the modifications
on the copy.
Executing a notebook: Download the notebook you want to execute and put it
in your notebook folder (or a sub-folder of it).
Launch the jupyter notebook app
Click on the menu Help -> User Interface Tour for an overview of
the Jupyter Notebook App user interface.
You can run the notebook document step-by-step (one cell a time) by
pressing shift + enter.
You can run the whole notebook in a single step by clicking on the
menu Cell -> Run All.
To restart the kernel (i.e. the computational engine), click on the
menu Kernel -> Restart. This can be useful to start over a computation
from scratch (e.g. variables are deleted, open files are closed, etc…).
Purpose: To support interactive data science and scientific computing across all
programming languages.
47
File Extension: An IPYNB file is a notebook document created by Jupyter
Notebook, an interactive computational environment that helps scientists
manipulate and analyze data using Python.
48
Depending on the type of computations, the kernel may consume
significant CPU and RAM. Note that the RAM is not released until the kernel is
shut-down
The Notebook Dashboard has other features similar to a file manager, namely
navigating folders and renaming/deleting files
Working Process:
Download and install anaconda and get the most useful package for
machine learning in Python.
Load a dataset and understand its structure using statistical summaries
and data visualization.
Machine learning models, pick the best and build confidence that the
accuracy is reliable.
49
The best way to get started using Python for machine learning is to complete a
project.
It will force you to install and start the Python interpreter (at the very least).
It will give you a bird‘s eye view of how to step through a small project.
It will give you confidence, maybe to go on to your own small projects.
When you are applying machine learning to your own datasets, you are
working on a project. A machine learning project may not be linear, but it has a
number of well-known steps:
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.
The best way to really come to terms with a new platform or tool is to
work through a machine learning project end-to-end and cover the key steps.
Namely, from loading data, summarizing data, evaluating algorithms and
making some predictions.
50
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.
PYTHON
Introduction:
51
Python consistently ranks as one of the most popular programming
languages
History:
Python was conceived in the late 1980s by Guido van Rossum at Centrum
Wiskunde& Informatica (CWI) in the Netherlands as a successor to ABC
programming language, which was inspired by SETL, capable of exception
handling and interfacing with the Amoeba operating system. Its implementation
began in December 1989. Van Rossum shouldered sole responsibility for the
project, as the lead developer, until 12 July 2018, when he announced his
"permanent vacation" from his responsibilities as Python's Benevolent Dictator
For Life, a title the Python community bestowed upon him to reflect his long-
term commitment as the project's chief decision-maker. In January 2019, active
Python core developers elected a 5-member "Steering Council" to lead the
project. As of 2021, the current members of this council are Barry Warsaw,
Brett Cannon, Carol Willing, Thomas Wouters, and Pablo Galindo Salgado.
Python 2.0 was released on 16 October 2000, with many major new
features, including a cycle-detecting garbage collector and support for Unicode.
Python 2.7's end-of-life date was initially set at 2015 then postponed to
2020 out of concern that a large body of existing code could not easily be
forward-ported to Python 3. No more security patches or other improvements
52
will be released for it. With Python 2's end-of-life, only Python 3.6.x and later
are supported.
Python 3.9.2 and 3.8.8 were expeditedas all versions of Python (including
2.7) had security issues, leading to possible remote code execution and web
cache poisoning.
53
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Readability counts.
Rather than having all of its functionality built into its core, Python was
designed to be highly extensible (with modules). This compact modularity has
made it particularly popular as a means of adding programmable interfaces to
existing applications. Van Rossum's vision of a small core language with a large
standard library and easily extensible interpreter stemmed from his frustrations
with ABC, which espoused the opposite approach.
54
Python's developers aim to keep the language fun to use. This is reflected
in its name a tribute to the British comedy group Monty Python and in
occasionally playful approaches to tutorials and reference materials, such as
examples that refer to spam and eggs (a reference to a Monty Python sketch)
instead of the standard foo and bar.
Indentation:
Main article: Python syntax and semantics & Indentation
55
statements; a decrease in indentation signifies the end of the current block.
Thus, the program's visual structure accurately represents the program's
semantic structure. This feature is sometimes termed the off-side rule, which
some other languages share, but in most languages indentation does not have
any semantic meaning. The recommended indent size is four spaces.
56
acquisition-is-initialization (RAII) - like behavior and replaces a common
try/finally idiom.
The break statement, exits from a loop.
The continue statement, skips this iteration and continues with the next item.
The del statement, removes a variable, which means the reference from the
name to the value is deleted and trying to use that variable will cause an
error. A deleted variable can be reassigned.
The pass statement, which serves as a NOP. It is syntactically needed to
create an empty code block.
The assert statement, used during debugging to check for conditions that
should apply.
The yield statement, which returns a value from a generator function and
yield is also an operator. This form is used to implement co-routines.
The return statement, used to return a value from a function.
The import statement, which is used to import modules whose functions or
variables can be used in the current program.
57
support for co-routine-like functionality is provided, by extending
Python's generators. Before 2.5, generators were lazy iterators; information was
passed uni-directionally out of the generator. From Python 2.5, it is possible to
pass information back into a generator function, and from Python 3.3, the
information can be passed through multiple stack levels.
Expressions:
Addition, subtraction, and multiplication are the same, but the behavior of
division differs. There are two types of divisions in Python. They are floor
division (or integer division) // and floating-point / division. Python also uses
the ** operator for exponentiation.
From Python 3.5, the new @ infix operator was introduced. It is intended to
be used by libraries such as NumPy for matrix multiplication.
From Python 3.8, the syntax :=, called the 'walrus operator' was introduced.
It assigns values to variables as part of a larger expression.
In Python, == compares by value, versus Java, which compares numerics by
value and objects by reference. (Value comparisons in Java on objects can
be performed with the equals() method.) Python's is operator may be used to
compare object identities (comparison by reference). In Python, comparisons
may be chained, for example A<=B<=C.
Python uses the words and, or, not for or its boolean operators rather than
the symbolic &&, ||, ! used in Java and C.
Python has a type of expression termed a list comprehension as well as a
more general expression termed a generator expression.
58
Anonymous functions are implemented using lambda expressions; however,
these are limited in that the body can only be one expression.
Conditional expressions in Python are written as x if c else y (different in
order of operands from the c ? x : y operator common to many other
languages).
Python makes a distinction between lists and tuples. Lists are written as [1,
2, 3], are mutable, and cannot be used as the keys of dictionaries (dictionary
keys must be immutable in Python). Tuples are written as (1, 2, 3), are
immutable and thus can be used as the keys of dictionaries, provided all
elements of the tuple are immutable. The + operator can be used to
concatenate two tuples, which does not directly modify their contents, but
rather produces a new tuple containing the elements of both provided tuples.
Thus, given the variable t initially equal to (1, 2, 3), executing t = t + (4,
5) first evaluates t + (4, 5), which yields (1, 2, 3, 4, 5), which is then
assigned back to t, thereby effectively "modifying the contents" of t, while
conforming to the immutable nature of tuple objects. Parentheses are
optional for tuples in unambiguous contexts.
Python features sequence unpacking wherein multiple expressions, each
evaluating to anything that can be assigned to (a variable, a writable
property, etc.), are associated in an identical manner to that forming tuple
literals and, as a whole, are put on the left-hand side of the equal sign in an
assignment statement. The statement expects an iterable object on the right-
hand side of the equal sign that produces the same number of values as the
provided writable expressions when iterated through and will iterate through
it, assigning each of the produced values to the corresponding expression on
the left.
Python has a "string format" operator %. This functions analogously ton
printf format strings in C, e.g. ―spam=%s eggs=%d‖ % (―blah‖,2) evaluates
to ―spam=blah eggs=2‖. In Python 3 and 2.6+, this was supplemented by the
59
format() method of the str class, e.g. ―spam={0}
eggs={1}‖.format(―blah‖,2). Python 3.6 added "f-strings": blah = ―blah‖;
eggs = 2; f‗spam={blah} eggs={eggs}‘
Strings in Python can be concatenated, by "adding" them (same operator as
for adding integers and floats). E.g. ―spam‖ + ―eggs‖returns ―spameggs‖.
Even if your strings contain numbers, they are still added as strings rather
than integers. E.g. ―2‖ + ―2‖ returns ―2‖.
Python has various kinds of string literals:
o Strings delimited by single or double quote marks. Unlike in Unix
shells, Perl and Perl-influenced languages, single quote marks and double
quote marks function identically. Both kinds of string use the backslash
(\) as an escape character. String interpolation became available in
Python 3.6 as "formatted string literals".
o Triple-quoted strings, which begin and end with a series of three single
or double quote marks. They may span multiple lines and function
like here documents in shells, Perl and Ruby.
o Raw string varieties, denoted by prefixing the string literal with an r .
Escape sequences are not interpreted; hence raw strings are useful where
literal backslashes are common, such as regular
expressions and Windows-style paths. Compare "@-quoting" in C#.
Python has array index and array slicing expressions on lists, denoted as
a[Key], a[start:stop] or a[start:stop:step]. Indexes are zero-based, and
negative indexes are relative to the end. Slices take elements from
the start index up to, but not including, the stop index. The third slice
parameter, called step or stride, allows elements to be skipped and reversed.
Slice indexes may be omitted, for example a[:] returns a copy of the entire
list. Each element of a slice is a shallow copy.
60
In Python, a distinction between expressions and statements is rigidly
enforced, in contrast to languages such as Common Lisp, Scheme, or Ruby.
This leads to duplicating some functionality. For example:
Methods:
Methods on objects are functions attached to the object's class; the syntax
instance.method(argument) is, for normal methods and functions, syntactic
sugar for Class.method(instance, argument). Python methods have an explicit
self parameter access instance data, in contrast to the implicit self (or this) in
some other object-oriented programming languages (e.g., C++, Java, Objective-
C, or Ruby). Apart from this Python also provides methods, sometimes called d-
under methods due to their names beginning and ending with double-
underscores, to extend the functionality of custom class to support native
functions such as print, length, comparison, support for arithmetic operations,
type conversion, and many more.
61
Typing :
Python uses duck typing and has typed objects but untyped variable
names. Type constraints are not checked at compile time; rather, operations on
an object may fail, signifying that the given object is not of a suitable type.
Despite being dynamically-typed, Python is strongly-typed, forbidding
operations that are not well-defined (for example, adding a number to a string)
rather than silently attempting to make sense of them.
Before version 3.0, Python had two kinds of classes: old-style and new-
style.The syntax of both styles is the same, the difference being whether the
class object is inherited from, directly or indirectly (all new-style classes inherit
from object and are instances of type). In versions of Python 2 from Python 2.2
onwards, both kinds of classes can be used. Old-style classes were eliminated in
Python 3.0.
The long-term plan is to support gradual typing and from Python 3.5, the
syntax of the language allows specifying static types but they are not checked in
the default implementation, CPython. An experimental optional static type
checker named mypy supports compile-time type checking.
CHAPTER 8
62
System Architecture
Source Data
63
Training Testing
Dataset Dataset
Workflow Diagram
64
Use case diagrams are considered for high level requirement analysis of a
system. So when the requirements of a system are analyzed the functionalities
are captured in use cases. So, it can say that uses cases are nothing but the
system functionalities written in an organized manner.
Class Diagram:
65
Class diagram is basically a graphical representation of the static view of
the system and represents different aspects of the application. So a collection of
class diagrams represent the whole system. The name of the class diagram
should be meaningful to describe the aspect of the system. Each element and
their relationships should be identified in advance Responsibility (attributes and
methods) of each class should be clearly identified for each class minimum
number of properties should be specified and because, unnecessary properties
will make the diagram complicated. Use notes whenever required to describe
some aspect of the diagram and at the end of the drawing it should be
understandable to the developer/coder. Finally, before making the final version,
the diagram should be drawn on plain paper and rework as many times as
possible to make it correct.
Activity Diagram:
66
Activity is a particular operation of the system. Activity diagrams are not
only used for visualizing dynamic nature of a system but they are also used to
construct the executable system by using forward and reverse engineering
techniques. The only missing thing in activity diagram is the message part. It
does not show any message flow from one activity to another. Activity diagram
is some time considered as the flow chart. Although the diagrams looks like a
flow chart but it is not. It shows different flow like parallel, branched,
concurrent and single.
Sequence Diagram:
67
Sequence diagrams model the flow of logic within your system in a
visual manner, enabling you both to document and validate your logic, and are
commonly used for both analysis and design purposes. Sequence diagrams are
the most popular UML artifact for dynamic modeling, which focuses on
identifying the behavior within your system. Other dynamic modeling
techniques include activity diagramming, communication diagramming, timing
diagramming, and interaction overview diagramming. Sequence diagrams,
along with class diagrams and physical data models are in my opinion the most
important design-level models for modern business application development.
68
An entity relationship diagram (ERD), also known as an entity
relationship model, is a graphical representation of an information system that
depicts the relationships among people, objects, places, concepts or events
within that system. An ERD is a data modeling technique that can help define
business processes and be used as the foundation for a relational database.
Entity relationship diagrams provide a visual starting point for database design
that can also be used to help determine information system requirements
throughout an organization. After a relational database is rolled out, an ERD can
still serve as a referral point, should any debugging or business process re-
engineering be needed later.
CHAPTER 9
69
Module Description:
Data validation process by each attack (Module-01)
Performance measurements of DoS attacks (Module-02)
Performance measurements of R2L attacks (Module-03)
Performance measurements of U2R attacks (Module-04)
Performance measurements of Probe attacks (Module-05)
Performance measurements of overall network attacks (Module-06)
GUI based prediction results of Network attacks (Module-07)
Module-01:
Validation techniques in machine learning are used to get the error rate of
the Machine Learning (ML) model, which can be considered as close to the true
error rate of the dataset. If the data volume is large enough to be representative
of the population, you may not need the validation techniques. However, in
real-world scenarios, to work with samples of data that may not be a true
representative of the population of given dataset. To finding the missing value,
duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model fit
on the training dataset while tuning model hyper parameters. The evaluation
becomes more biased as skill on the validation dataset is incorporated into the
model configuration. The validation set is used to evaluate a given model, but
this is for frequent evaluation. It as machine learning engineers uses this data to
fine-tune the model hyper parameters. Data collection, data analysis, and the
process of addressing data content, quality, and structure can add up to a time-
consuming to-do list. During the process of data identification, it helps to
understand your data and its properties; this knowledge will help you choose
70
which algorithm to use to build your model. For example, time series data can
be analyzed by regression algorithms; classification algorithms can be used to
analyze discrete data. (For example to show the data type format of given
dataset)
71
Exploration data analysis of visualization:
Sometimes data does not make sense until it can look at in a visual form,
such as with charts and plots. Being able to quickly visualize of data samples
72
and others is an important skill both in applied statistics and in applied machine
learning. It will discover the many types of plots that you will need to know
when visualizing data in Python and how to use them to better understand your
own data.
How to chart time series data with line plots and categorical quantities
with bar charts.
How to summarize data distributions with histograms and box plots.
How to summarize the relationship between variables with scatter plots.
73
Even before predictive models are prepared on training data, outliers can
result in misleading representations and in turn misleading interpretations of
collected data. Outliers can skew the summary distribution of attribute values in
descriptive statistics like mean and standard deviation and in plots such as
histograms and scatterplots, compressing the body of the data.Finally, outliers
can represent examples of data instances that are relevant to the problem such as
anomalies in the case of fraud detection and computer security.
It couldn‘t fit the model on the training data and can‘t say that the model
will work accurately for the real data. For this, we must assure that our model
got the correct patterns from the data, and it is not getting up too much noise.
Cross-validation is a technique in which we train our model using the subset of
the data-set and then evaluate using the complementary subset of the data-set.
1. This runs K times faster than Leave One Out cross-validation because K-
fold cross-validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
74
Data Pre-processing:
75
MODULE DIAGRAM
input: data
MODULE DIAGRAM
76
input: data
Module-02:
77
attack (sometimes referred to as layer 7 DDoS attack) is a form of DDoS attack
where attackers target application-layer processes. The attack over-exercises
specific functions or features of a website with the intention to disable those
functions or features. This application-layer attack is different from an entire
network attack, and is often used against financial institutions to distract IT and
security personnel from security breaches.
MODULE DIAGRAM
78
GIVEN INPUT EXPECTED OUTPUT
input: data
Module-03:
79
MODULE DIAGRAM
input: data
Module-04:
80
computer resource. Detection of these attacks and prevention of computers from
it is a major research topic for researchers throughout the world.
81
MODULE DIAGRAM:
input : data
Module-05:
82
MODULE DIAGRAM:
83
GIVEN INPUT EXPECTED OUTPUT
input : data
Module-06:
84
successfully to detect phishing attacks include URLs that include IP addresses,
the age of a linked-to domain, and a mismatch between anchor and text of a
link.
85
MODULE DIAGRAM:
input : data
Module-07:
GUI means Graphical User Interface. It is the common user Interface that
includes Graphical representation like buttons and icons, and communication
can be performed by interacting with these icons rather than the usual text-based
or command-based communication. A common example of a GUI is Microsoft
operating systems.
86
resolution types of interfaces, such as video games (where head-up
display (HUD) is preferred), or not including flat screens, like volumetric
displays because the term is restricted to the scope of two-dimensional display
screens able to describe generic information, in the tradition of the computer
science research at the Xerox Palo Alto Research Center.
87
MODULE DIAGRAM:
88
GIVEN INPUT EXPECTED OUTPUT
ALGORITHMS EXPLANATION:
Logistic Regression:
It is a statistical method for analysing a data set in which there are one or
more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two possible
outcomes). The goal of logistic regression is to find the best fitting model to
describe the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of independent
(predictor or explanatory) variables. Logistic regression is a Machine Learning
classification algorithm that is used to predict the probability of a categorical
89
dependent variable. In logistic regression, the dependent variable is a binary
variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
Decision Tree:
90
Decision tree builds classification or regression models in the form of a
tree structure. It breaks down a data set into smaller and smaller subsets while at
the same time an associated decision tree is incrementally developed. A decision
node has two or more branches and a leaf node represents a classification or
decision. The topmost decision node in a tree which corresponds to the best
predictor called root node. Decision trees can handle both categorical and
numerical data. Decision tree builds classification or regression models in the
form of a tree structure. It utilizes an if-then rule set which is mutually exclusive
and exhaustive for classification. The rules are learned sequentially using the
training data one at a time. Each time a rule is learned, the tuples covered by the
rules are removed.
Random Forest:
91
multiple times to form a more powerful prediction model. The random
forest algorithm combines multiple algorithm of the same type i.e. multiple
decision trees, resulting in a forest of trees, hence the name "Random Forest".
The random forest algorithm can be used for both regression and classification
tasks.
The following are the basic steps involved in performing the random forest
algorithm:
92
continue to be the go-to method for a high-performing algorithm with little
tuning.
sklearn:
In python, sklearn is a machine learning package which include a lot
of ML algorithms.
Here, we are using some of its modules like train_test_split,
DecisionTreeClassifier or Logistic Regression and accuracy_score.
NumPy:
It is a numeric python module which provides fast maths functions for
calculations.
It is used to read data in numpy arrays and for manipulation purpose.
Pandas:
Used to read and write different files.
Data manipulation can be done easily with data frames.
Matplotlib:
Data visualization is a useful way to help with identify the patterns
from given dataset.
93
Data manipulation can be done easily with data frames.
tkinter:
Standard python interface to the GUI toolkit.
Accessible to everybody and reusable in various contexts.
1. Deployment:
Tkinter:
Tkinter is Python's de-facto standard GUI (Graphical User Interface)
package. It is a thin object-oriented layer on top of Tcl/Tk. Tkinter is not the
only GUI Programming toolkit for Python. It is however the most commonly
used one. ... Graphical User Interfaces with Tk, a chapter from the Python
Documentation.
Running python –m tkinter from the command line should open a window
demonstrating a simple Tk interface, letting you know that tkinter is properly
installed on your system, and also showing what version of Tcl/Tk is installed,
so you can read the Tcl/Tk documentation specific to that version.
Tkinter is not a thin wrapper, but adds a fair amount of its own logic to
make the experience more pythonic. This documentation will concentrate on
94
these additions and changes, and refer to the official Tcl/Tk documentation for
details that are unchanged.
SOURCE CODE:
MODULE – 1
# feature names
features = ["duration", "protocol_type", "service", "flag", "src_bytes",
"dst_bytes", "land", "Wrong_fragment", "Urgent", "hot", "num_failed_login",
95
"logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files",
"num_outbound_cmds", "is_host_login", "is_guest_login", "count",
"srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_ srv_rate",
"dst_host_same_src_port_rate", "dst_host_srv_diff_host _rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate","class"]
data.head(10)
df=data.dropna()
df.head(10)
#show columns
df.columns
96
'dst_host_srv_diff_host _rate', 'dst_host_serror_rate',
'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate', 'class'],
dtype='object')
#To describe the dataframe
df.describe()
d =p.crosstab(df['protocol_type'], df['class'])
d.plot(kind='bar', stacked=True, color=['red','green'], grid=False,
figsize=(18,8))
importmatplotlib.pyplotasplt
pr =df["protocol_type"]
fl=df["flag"]
plt.plot(fl, pr, color='g')
plt.xlabel('Flag Types')
plt.ylabel('Protocol Types')
plt.title('Flag Details by protocol type')
plt.show()
97
df["class"].unique()
df['land'].value_counts()
df['service'].value_counts()
df['protocol_type'].value_counts()
importnumpyas n
defPropByVar(df, variable):
dataframe_pie=df[variable].value_counts()
ax =dataframe_pie.plot.pie(figsize=(10,10), autopct='%1.2f%%', fontsize=
12);
ax.set_title(variable + ' (%) (Per Count)\n', fontsize= 15);
returnn.round(dataframe_pie/df.shape[0]*100,2)
PropByVar(df, 'protocol_type')
df['DOSland'] =df.land.map({0:'attack',1:'noattack',2:'normal'})
df['DOSlandclass'] =df.DOSland.map({'attack':1,'noattack':0,'normal':0})
df['DOSlandclass'].value_counts()
df.head()
98
'ipsweep.':0, 'multihop.':1, 'xsnoop.':1, 'sendmail.':1, 'guess_passwd.':1,
'saint.':0, 'buffer_overflow.':0, 'portsweep.':0, 'pod.':0, 'apache2.':0,
'phf.':1, 'udpstorm.':0, 'warezmaster.':1, 'perl.':0, 'satan.':0, 'xterm.':0,
'mscan.':0, 'processtable.':0, 'ps.':0, 'nmap.':0, 'rootkit.':0, 'neptune.':0,
'loadmodule.':0, 'imap.':1, 'back.':0, 'httptunnel.':1, 'worm.':1,
'mailbomb.':0, 'ftp_write.':1, 'teardrop.':0, 'land.':0, 'sqlattack.':0,
'snmpguess.':1})
99
'saint.':1, 'buffer_overflow.':1, 'portsweep.':1, 'pod.':1, 'apache2.':1,
'phf.':1, 'udpstorm.':1, 'warezmaster.':1, 'perl.':1, 'satan.':1, 'xterm.':1,
'mscan.':1, 'processtable.':1, 'ps.':1, 'nmap.':1, 'rootkit.':1, 'neptune.':1,
'loadmodule.':1, 'imap.':1, 'back.':1, 'httptunnel.':1, 'worm.':1,
'mailbomb.':1, 'ftp_write.':1, 'teardrop.':1, 'land.':1, 'sqlattack.':1,
'snmpguess.':1})
df.head()
df.corr()
Before Pre-Processing:
df.head()
After Pre-Processing:
df.columns
fromsklearn.preprocessingimportLabelEncoder
var_mod= ['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
'dst_bytes', 'land', 'Wrong_fragment', 'Urgent', 'hot',
'num_failed_login', 'logged_in', 'num_compromised', 'root_shell',
'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
'num_access_files', 'num_outbound_cmds', 'is_host_login',
'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
'dst_host_srv_count', 'dst_host_same_srv_rate',
'dst_host_diff_ srv_rate', 'dst_host_same_src_port_rate',
'dst_host_srv_diff_host _rate', 'dst_host_serror_rate',
'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate', ]
100
le =LabelEncoder()
foriinvar_mod:
df[i] =le.fit_transform(df[i]).astype(str)
df.head()
MODULE-2
import pandas as p
import warnings
warnings.filterwarnings('ignore')
# feature names
features = ["duration", "protocol_type", "service", "flag", "src_bytes",
"dst_bytes", "land", "Wrong_fragment", "Urgent", "hot", "num_failed_login",
"logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files",
"num_outbound_cmds", "is_host_login", "is_guest_login", "count",
"srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_ srv_rate",
"dst_host_same_src_port_rate", "dst_host_srv_diff_host _rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate","class"]
data =p.read_csv("data6.csv", names = features)
df=data.dropna()
df['DOSland'] =df.land.map({0:'attack',1:'noattack',2:'normal'})
df['DOSlandclass'] =df.DOSland.map({'attack':1,'noattack':0,'normal':0})
101
df['DOS'] =df['class'].map({'normal.':0, 'snmpgetattack.':0, 'named.':0, 'xlock.':0,
'smurf.':1,
'ipsweep.':0, 'multihop.':0, 'xsnoop.':0, 'sendmail.':0, 'guess_passwd.':0,
'saint.':0, 'buffer_overflow.':0, 'portsweep.':0, 'pod.':1, 'apache2.':1,
'phf.':0, 'udpstorm.':1, 'warezmaster.':0, 'perl.':0, 'satan.':0, 'xterm.':0,
'mscan.':0, 'processtable.':1, 'ps.':0, 'nmap.':0, 'rootkit.':0, 'neptune.':1,
'loadmodule.':0, 'imap.':0, 'back.':1, 'httptunnel.':0, 'worm.':0,
'mailbomb.':1, 'ftp_write.':0, 'teardrop.':1, 'land.':1, 'sqlattack.':0,
'snmpguess.':0})
fromsklearn.preprocessingimportLabelEncoder
var_mod= ['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
'dst_bytes', 'land', 'Wrong_fragment', 'Urgent', 'hot',
'num_failed_login', 'logged_in', 'num_compromised', 'root_shell',
'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
'num_access_files', 'num_outbound_cmds', 'is_host_login',
'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
'dst_host_srv_count', 'dst_host_same_srv_rate',
'dst_host_diff_ srv_rate', 'dst_host_same_src_port_rate',
'dst_host_srv_diff_host _rate', 'dst_host_serror_rate',
'dst_host_srv_serror_rate', 'dst_host_rerror_rate'
]
le =LabelEncoder()
foriinvar_mod:
df[i] =le.fit_transform(df[i]).astype(int)
102
deldf['DOSland']
deldf['dst_host_srv_rerror_rate']
deldf['DOSlandclass']
deldf['class']
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score,
average_precision_score, roc_auc_score
X =df.drop(labels='DOS', axis=1)
#Response variable
y =df.loc[:,'DOS']
#We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions.
fromsklearn.model_selectionimporttrain_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score,
average_precision_score, roc_auc_score
Logistic Regression :
fromsklearn.metricsimportaccuracy_score, confusion_matrix
fromsklearn.linear_modelimportLogisticRegression
103
fromsklearn.model_selectionimportcross_val_score
logR=LogisticRegression()
logR.fit(X_train,y_train)
predictR=logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
104
Decision Tree:
fromsklearn.treeimportDecisionTreeClassifier
fromsklearn.metricsimportaccuracy_score, confusion_matrix
fromsklearn.model_selectionimportcross_val_score
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)
predictR=dt.predict(X_test)
print("")
print('Classification report of Decision Tree Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Decision Tree is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
105
print("Accuracy result of Decision Tree is:",accuracy.mean() * 100)
dt=accuracy.mean() * 100
Random Forest:
fromsklearn.ensembleimportRandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
predictR=rfc.predict(X_test)
print("")
print('Classification report of Random Forest Results:')
print("")
print(classification_report(y_test,predictR))
print("")
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Random Forest is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
106
print("Accuracy result of Random Forest is:",accuracy.mean() * 100)
rf=accuracy.mean() * 100
fromsklearn.svmimport SVC
sv= SVC()
sv.fit(X_train, y_train)
predictSVC=sv.predict(X_test)
print("")
print('Classification report of Support Vector Classifier Results:')
print("")
print(classification_report(y_test,predictSVC))
print("")
cm4=confusion_matrix(y_test,predictSVC)
print('Confusion Matrix result of Support Vector Classifier is:\n',
confusion_matrix(y_test,predictSVC))
print("")
sensitivity1 = cm4[0,0]/(cm4[0,0]+cm4[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm4[1,1]/(cm4[1,0]+cm4[1,1])
print('Specificity : ', specificity1)
107
print("Accuracy result of Support Vector Classifier is:",accuracy.mean() * 100)
sv=accuracy.mean() * 100
def graph():
importmatplotlib.pyplotasplt
data=[lr,dt,rf,sv]
alg="LR","DT","RF","SVM"
plt.figure(figsize=(10,5))
b=plt.bar(alg,data,color=("r","g","b","y"))
plt.title("Accuracy comparison of DoS Attacks",fontsize=15)
plt.legend(b,data,fontsize=9)
plt.savefig('DOS.png')
graph()
importtkinter
frommatplotlib.backends.backend_tkaggimport (FigureCanvasTkAgg,
NavigationToolbar2Tk)
frommatplotlib.backend_basesimportkey_press_handler
frommatplotlib.figureimport Figure
importnumpyas np
root =tkinter.Tk()
root.wm_title("Accuracy plot for DoS Attacks")
fig = Figure(figsize=(10,10),dpi=1)
canvas =FigureCanvasTkAgg(fig, master=root)
canvas.draw()
canvas.get_tk_widget().pack(side=tkinter.TOP, fill=tkinter.BOTH, expand=1)
icon=tkinter.PhotoImage(file='DOS.png')
label=tkinter.Label(root,image=icon)
label.pack()
108
root.mainloop()
MODULE-3
import pandas as p
# feature names
features = ["duration", "protocol_type", "service", "flag", "src_bytes",
"dst_bytes", "land", "Wrong_fragment", "Urgent", "hot", "num_failed_login",
"logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
"num_file_creations", "num_shells", "num_access_files",
"num_outbound_cmds", "is_host_login", "is_guest_login", "count",
"srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
"same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_ srv_rate",
"dst_host_same_src_port_rate", "dst_host_srv_diff_host _rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate","class"]
data =p.read_csv("data6.csv", names = features)
df=data.dropna()
109
'snmpguess.':1})
fromsklearn.preprocessingimportLabelEncoder
var_mod= ['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
'dst_bytes', 'land', 'Wrong_fragment', 'Urgent', 'hot',
'num_failed_login', 'logged_in', 'num_compromised', 'root_shell',
'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
'num_access_files', 'num_outbound_cmds', 'is_host_login',
'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
'dst_host_srv_count', 'dst_host_same_srv_rate',
'dst_host_diff_ srv_rate', 'dst_host_same_src_port_rate',
'dst_host_srv_diff_host _rate', 'dst_host_serror_rate',
'dst_host_srv_serror_rate', 'dst_host_rerror_rate'
]
le =LabelEncoder()
foriinvar_mod:
df[i] =le.fit_transform(df[i]).astype(int)
deldf['dst_host_srv_rerror_rate']
deldf["class"]
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score,
average_precision_score, roc_auc_score
X =df.drop(labels='R2L', axis=1)
#Response variable
110
y =df.loc[:,'R2L']
#We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions.
fromsklearn.model_selectionimporttrain_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
import warnings
warnings.filterwarnings('ignore')
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score,
average_precision_score, roc_auc_score
Logistic Regression :
fromsklearn.metricsimportaccuracy_score, confusion_matrix
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.model_selectionimportcross_val_score
logR=LogisticRegression()
logR.fit(X_train,y_train)
predictR=logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
111
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
Decision Tree:
fromsklearn.treeimportDecisionTreeClassifier
fromsklearn.metricsimportaccuracy_score, confusion_matrix
fromsklearn.model_selectionimportcross_val_score
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)
predictR=dt.predict(X_test)
print("")
print('Classification report of Decision Tree Results:')
112
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Decision Tree is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
Random Forest:
fromsklearn.ensembleimportRandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
predictR=rfc.predict(X_test)
print("")
print('Classification report of Random Forest Results:')
113
print("")
print(classification_report(y_test,predictR))
print("")
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Random Forest is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
fromsklearn.svmimport SVC
sv= SVC()
sv.fit(X_train, y_train)
predictSVC=sv.predict(X_test)
print("")
print('Classification report of Support Vector Classifier Results:')
114
print("")
print(classification_report(y_test,predictSVC))
print("")
cm4=confusion_matrix(y_test,predictSVC)
print('Confusion Matrix result of Support Vector Classifier is:\n',
confusion_matrix(y_test,predictSVC))
print("")
sensitivity1 = cm4[0,0]/(cm4[0,0]+cm4[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm4[1,1]/(cm4[1,0]+cm4[1,1])
print('Specificity : ', specificity1)
def graph():
importmatplotlib.pyplotasplt
data=[lr,dt,rf,sv]
alg="LR","DT","RF","SVM"
plt.figure(figsize=(10,5))
b=plt.bar(alg,data,color=("r","g","b","y"))
plt.title("Accuracy comparison of R2L Attacks",fontsize=15)
plt.legend(b,data,fontsize=9)
115
plt.savefig('R2L.png')
graph()
importtkinter
frommatplotlib.backends.backend_tkaggimport (FigureCanvasTkAgg,
NavigationToolbar2Tk)
frommatplotlib.backend_basesimportkey_press_handler
frommatplotlib.figureimport Figure
importnumpyas np
root =tkinter.Tk()
root.wm_title("Accuracy plot for R2L Attacks")
fig = Figure(figsize=(10,10),dpi=1)
canvas =FigureCanvasTkAgg(fig, master=root)
canvas.draw()
canvas.get_tk_widget().pack(side=tkinter.TOP, fill=tkinter.BOTH, expand=1)
icon=tkinter.PhotoImage(file='R2L.png')
label=tkinter.Label(root,image=icon)
label.pack()
root.mainloop()
MODULE-4
116
compromised","root_shell","su_attempted","num_root","num_file_creations","n
um_shells","num_access_files","num_outbound_cmds","is_host_login","is_gue
st_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_
rerror_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_cou
nt","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_
srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host
_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",
"dst_host_srv_rerror_rate","class"]
data=p.read_csv("data6.csv",names=features)
df=data.dropna()
df['U2R']=df['class'].map({'normal.':0,'snmpgetattack.':0,'named.':0,'xlock.':0,'s
murf.':0,
'ipsweep.':0,'multihop.':0,'xsnoop.':0,'sendmail.':0,'guess_passwd.':0,
'saint.':0,'buffer_overflow.':1,'portsweep.':0,'pod.':0,'apache2.':0,
'phf.':0,'udpstorm.':0,'warezmaster.':0,'perl.':1,'satan.':0,'xterm.':1,
'mscan.':0,'processtable.':0,'ps.':1,'nmap.':0,'rootkit.':1,'neptune.':0,
'loadmodule.':1,'imap.':0,'back.':0,'httptunnel.':0,'worm.':0,
'mailbomb.':0,'ftp_write.':0,'teardrop.':0,'land.':0,'sqlattack.':1,
'snmpguess.':0})
fromsklearn.preprocessingimportLabelEncoder
var_mod=['duration','protocol_type','service','flag','src_bytes',
'dst_bytes','land','Wrong_fragment','Urgent','hot',
'num_failed_login','logged_in','num_compromised','root_shell',
'su_attempted','num_root','num_file_creations','num_shells',
'num_access_files','num_outbound_cmds','is_host_login',
'is_guest_login','count','srv_count','serror_rate',
'srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate',
'diff_srv_rate','srv_diff_host_rate','dst_host_count',
'dst_host_srv_count','dst_host_same_srv_rate',
'dst_host_diff_ srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host _rate','dst_host_serror_rate',
'dst_host_srv_serror_rate','dst_host_rerror_rate'
]
le=LabelEncoder()
foriinvar_mod:
df[i]=le.fit_transform(df[i]).astype(str)
deldf['dst_host_srv_rerror_rate']
deldf["class"]
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
117
fromsklearn.metricsimportconfusion_matrix,classification_report,matthews_co
rrcoef,cohen_kappa_score,accuracy_score,average_precision_score,roc_auc_sc
ore
X=df.drop(labels='U2R',axis=1)
#Response variable
y=df.loc[:,'U2R']
#We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions.
fromsklearn.model_selectionimporttrain_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1
,stratify=y)
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix,classification_report,matthews_co
rrcoef,cohen_kappa_score,accuracy_score,average_precision_score,roc_auc_sc
ore
Logistic Regression :
fromsklearn.metricsimportaccuracy_score,confusion_matrix
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.model_selectionimportcross_val_score
logR=LogisticRegression()
logR.fit(X_train,y_train)
predictR=logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
118
accuracy=cross_val_score(logR,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Logistic Regression is:",accuracy.mean()*100)
lr=accuracy.mean()*100
Decision Tree:
fromsklearn.treeimportDecisionTreeClassifier
fromsklearn.metricsimportaccuracy_score,confusion_matrix
fromsklearn.model_selectionimportcross_val_score
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)
predictR=dt.predict(X_test)
print("")
print('Classification report of Decision Tree Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Decision Tree is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(dt,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Decision Tree is:",accuracy.mean()*100)
dt=accuracy.mean()*100
Random Forest:
fromsklearn.ensembleimportRandomForestClassifier
rfc=RandomForestClassifier()
119
rfc.fit(X_train,y_train)
predictR=rfc.predict(X_test)
print("")
print('Classification report of Random Forest Results:')
print("")
print(classification_report(y_test,predictR))
print("")
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Random Forest is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(rfc,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Random Forest is:",accuracy.mean()*100)
rf=accuracy.mean()*100
print("")
cm4=confusion_matrix(y_test,predictSVC)
print('Confusion Matrix result of Support Vector Classifier
is:\n',confusion_matrix(y_test,predictSVC))
print("")
sensitivity1=cm4[0,0]/(cm4[0,0]+cm4[0,1])
print('Sensitivity : ',sensitivity1)
120
print("")
specificity1=cm4[1,1]/(cm4[1,0]+cm4[1,1])
print('Specificity : ',specificity1)
accuracy=cross_val_score(sv,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Support Vector Classifier is:",accuracy.mean()*100)
sv=accuracy.mean()*100
defgraph():
importmatplotlib.pyplotasplt
data=[lr,dt,rf,sv]
alg="LR","DT","RF","SVM"
plt.figure(figsize=(10,5))
b=plt.bar(alg,data,color=("r","g","b","y"))
plt.title("Accuracy comparison of U2R Attacks",fontsize=15)
plt.legend(b,data,fontsize=9)
plt.savefig('U2R.png')
graph()
importtkinter
frommatplotlib.backends.backend_tkaggimport(FigureCanvasTkAgg,Navigati
onToolbar2Tk)
frommatplotlib.backend_basesimportkey_press_handler
frommatplotlib.figureimportFigure
importnumpyasnp
root=tkinter.Tk()
root.wm_title("Accuracy plot for U2R Attacks")
fig=Figure(figsize=(10,10),dpi=1)
canvas=FigureCanvasTkAgg(fig,master=root)
canvas.draw()
canvas.get_tk_widget().pack(side=tkinter.TOP,fill=tkinter.BOTH,expand=1)
icon=tkinter.PhotoImage(file='U2R.png')
label=tkinter.Label(root,image=icon)
label.pack()
root.mainloop()
121
MODULE-5
fromsklearn.preprocessingimportLabelEncoder
var_mod=['duration','protocol_type','service','flag','src_bytes',
'dst_bytes','land','Wrong_fragment','Urgent','hot',
'num_failed_login','logged_in','num_compromised','root_shell',
'su_attempted','num_root','num_file_creations','num_shells',
'num_access_files','num_outbound_cmds','is_host_login',
'is_guest_login','count','srv_count','serror_rate',
'srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate',
'diff_srv_rate','srv_diff_host_rate','dst_host_count',
'dst_host_srv_count','dst_host_same_srv_rate',
'dst_host_diff_ srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host _rate','dst_host_serror_rate',
'dst_host_srv_serror_rate','dst_host_rerror_rate'
122
]
le=LabelEncoder()
foriinvar_mod:
df[i]=le.fit_transform(df[i]).astype(int)
deldf['dst_host_srv_rerror_rate']
deldf["class"]
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix,classification_report,matthews_co
rrcoef,cohen_kappa_score,accuracy_score,average_precision_score,roc_auc_sc
ore
X=df.drop(labels='Probe',axis=1)
#Response variable
y=df.loc[:,'Probe']
#We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions.
fromsklearn.model_selectionimporttrain_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1
,stratify=y)
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix,classification_report,matthews_co
rrcoef,cohen_kappa_score,accuracy_score,average_precision_score,roc_auc_sc
ore
Logistic Regression:
fromsklearn.metricsimportaccuracy_score,confusion_matrix
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.model_selectionimportcross_val_score
logR=LogisticRegression()
logR.fit(X_train,y_train)
predictR=logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
123
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(logR,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Logistic Regression is:",accuracy.mean()*100)
lr=accuracy.mean()*100
Decision Tree:
fromsklearn.treeimportDecisionTreeClassifier
fromsklearn.metricsimportaccuracy_score,confusion_matrix
fromsklearn.model_selectionimportcross_val_score
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)
predictR=dt.predict(X_test)
print("")
print('Classification report of Decision Tree Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Decision Tree is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(dt,X,y,scoring='accuracy')
124
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Decision Tree is:",accuracy.mean()*100)
dt=accuracy.mean()*100
Random Forest:
fromsklearn.ensembleimportRandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
predictR=rfc.predict(X_test)
print("")
print('Classification report of Random Forest Results:')
print("")
print(classification_report(y_test,predictR))
print("")
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Random Forest is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(rfc,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Random Forest is:",accuracy.mean()*100)
rf=accuracy.mean()*100
125
print("")
print(classification_report(y_test,predictSVC))
print("")
cm4=confusion_matrix(y_test,predictSVC)
print('Confusion Matrix result of Support Vector Classifier
is:\n',confusion_matrix(y_test,predictSVC))
print("")
sensitivity1=cm4[0,0]/(cm4[0,0]+cm4[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm4[1,1]/(cm4[1,0]+cm4[1,1])
print('Specificity : ',specificity1)
accuracy=cross_val_score(sv,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Support Vector Classifier is:",accuracy.mean()*100)
sv=accuracy.mean()*100
defgraph():
importmatplotlib.pyplotasplt
data=[lr,dt,rf,sv]
alg="LR","DT","RF","SVM"
plt.figure(figsize=(10,5))
b=plt.bar(alg,data,color=("r","g","b","y"))
plt.title("Accuracy comparison of Probe Attacks",fontsize=15)
plt.legend(b,data,fontsize=9)
plt.savefig('Probe.png')
graph()
importtkinter
frommatplotlib.backends.backend_tkaggimport(FigureCanvasTkAgg,Navigati
onToolbar2Tk)
frommatplotlib.backend_basesimportkey_press_handler
frommatplotlib.figureimportFigure
importnumpyasnp
root=tkinter.Tk()
root.wm_title("Accuracy plot for Probe Attacks")
fig=Figure(figsize=(10,10),dpi=1)
canvas=FigureCanvasTkAgg(fig,master=root)
canvas.draw()
126
canvas.get_tk_widget().pack(side=tkinter.TOP,fill=tkinter.BOTH,expand=1)
icon=tkinter.PhotoImage(file='Probe.png')
label=tkinter.Label(root,image=icon)
label.pack()
root.mainloop()
MODULE-6
df['attack']=df['class'].map({'normal.':0,'snmpgetattack.':1,'named.':1,'xlock.':1,'s
murf.':1,
'ipsweep.':1,'multihop.':1,'xsnoop.':1,'sendmail.':1,'guess_passwd.':1,
'saint.':1,'buffer_overflow.':1,'portsweep.':1,'pod.':1,'apache2.':1,
'phf.':1,'udpstorm.':1,'warezmaster.':1,'perl.':1,'satan.':1,'xterm.':1,
'mscan.':1,'processtable.':1,'ps.':1,'nmap.':1,'rootkit.':1,'neptune.':1,
'loadmodule.':1,'imap.':1,'back.':1,'httptunnel.':1,'worm.':1,
'mailbomb.':1,'ftp_write.':1,'teardrop.':1,'land.':1,'sqlattack.':1,
'snmpguess.':1})
fromsklearn.preprocessingimportLabelEncoder
var_mod=['duration','protocol_type','service','flag','src_bytes',
'dst_bytes','land','Wrong_fragment','Urgent','hot',
'num_failed_login','logged_in','num_compromised','root_shell',
127
'su_attempted','num_root','num_file_creations','num_shells',
'num_access_files','num_outbound_cmds','is_host_login',
'is_guest_login','count','srv_count','serror_rate',
'srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate',
'diff_srv_rate','srv_diff_host_rate','dst_host_count',
'dst_host_srv_count','dst_host_same_srv_rate',
'dst_host_diff_ srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host _rate','dst_host_serror_rate',
'dst_host_srv_serror_rate','dst_host_rerror_rate'
]
le=LabelEncoder()
foriinvar_mod:
df[i]=le.fit_transform(df[i]).astype(int)
deldf['dst_host_srv_rerror_rate']
deldf["class"]
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix,classification_report,matthews_co
rrcoef,cohen_kappa_score,accuracy_score,average_precision_score,roc_auc_sc
ore
X=df.drop(labels='attack',axis=1)
#Response variable
y=df.loc[:,'attack']
#We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions.
fromsklearn.model_selectionimporttrain_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1
,stratify=y)
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
fromsklearn.metricsimportconfusion_matrix,classification_report,matthews_co
rrcoef,cohen_kappa_score,accuracy_score,average_precision_score,roc_auc_sc
ore
Logistic Regression :
fromsklearn.metricsimportaccuracy_score,confusion_matrix
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.model_selectionimportcross_val_score
128
logR=LogisticRegression()
logR.fit(X_train,y_train)
predictR=logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(logR,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Logistic Regression is:",accuracy.mean()*100)
lr=accuracy.mean()*100
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)
predictR=dt.predict(X_test)
print("")
print('Classification report of Decision Tree Results:')
print("")
print(classification_report(y_test,predictR))
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Decision Tree is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
129
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(dt,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Decision Tree is:",accuracy.mean()*100)
dt=accuracy.mean()*100
Random Forest:
fromsklearn.ensembleimportRandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
predictR=rfc.predict(X_test)
print("")
print('Classification report of Random Forest Results:')
print("")
print(classification_report(y_test,predictR))
print("")
cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Random Forest is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ',specificity1)
print("")
accuracy=cross_val_score(rfc,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Random Forest is:",accuracy.mean()*100)
rf=accuracy.mean()*100
130
fromsklearn.svmimportSVC
sv=SVC()
sv.fit(X_train,y_train)
predictSVC=sv.predict(X_test)
print("")
print('Classification report of Support Vector Classifier Results:')
print("")
print(classification_report(y_test,predictSVC))
print("")
cm4=confusion_matrix(y_test,predictSVC)
print('Confusion Matrix result of Support Vector Classifier
is:\n',confusion_matrix(y_test,predictSVC))
print("")
sensitivity1=cm4[0,0]/(cm4[0,0]+cm4[0,1])
print('Sensitivity : ',sensitivity1)
print("")
specificity1=cm4[1,1]/(cm4[1,0]+cm4[1,1])
print('Specificity : ',specificity1)
accuracy=cross_val_score(sv,X,y,scoring='accuracy')
print('Cross validation test results of accuracy:')
print(accuracy)
#get the mean of each fold
print("")
print("Accuracy result of Support Vector Classifier is:",accuracy.mean()*100)
sv=accuracy.mean()*100
defgraph():
importmatplotlib.pyplotasplt
data=[lr,dt,rf,sv]
alg="LR","DT","RF","SVM"
plt.figure(figsize=(10,5))
b=plt.bar(alg,data,color=("r","g","b","y"))
plt.title("Accuracy comparison of Overall Attacks",fontsize=15)
plt.legend(b,data,fontsize=9)
plt.savefig('overallattack.png')
graph()
importtkinter
frommatplotlib.backends.backend_tkaggimport(FigureCanvasTkAgg,Navigati
onToolbar2Tk)
frommatplotlib.backend_basesimportkey_press_handler
frommatplotlib.figureimportFigure
131
importnumpyasnp
root=tkinter.Tk()
root.wm_title("Accuracy plot for Overall Attacks")
fig=Figure(figsize=(10,10),dpi=1)
canvas=FigureCanvasTkAgg(fig,master=root)
canvas.draw()
canvas.get_tk_widget().pack(side=tkinter.TOP,fill=tkinter.BOTH,expand=1)
icon=tkinter.PhotoImage(file='overallattack.png')
label=tkinter.Label(root,image=icon)
label.pack()
root.mainloop()
MODULE-7
fromtkinterimport*
132
deldf["duration"]
deldf["land"]
deldf["Urgent"]
deldf["hot"]
deldf["num_failed_login"]
deldf["logged_in"]
deldf["num_compromised"]
deldf["root_shell"]
deldf["is_host_login"]
deldf["is_guest_login"]
deldf['num_root']
deldf['num_file_creations']
deldf['num_shells']
deldf['num_outbound_cmds']
deldf['count']
deldf['srv_count']
deldf['srv_serror_rate']
deldf['srv_rerror_rate']
deldf['same_srv_rate']
deldf['diff_srv_rate']
deldf['srv_diff_host_rate']
deldf['dst_host_count']
deldf['dst_host_srv_count']
deldf['dst_host_same_srv_rate']
deldf['dst_host_diff_ srv_rate']
deldf['dst_host_same_src_port_rate']
deldf['dst_host_srv_diff_host _rate']
deldf['dst_host_serror_rate']
deldf['dst_host_srv_serror_rate']
deldf['dst_host_rerror_rate']
deldf['dst_host_srv_rerror_rate']
deldf['su_attempted']
deldf['num_access_files']
df.columns
df['protocol_type'].unique()
df['UDP']=df.protocol_type.map({'udp':1,'tcp':0,'icmp':0})
df['TCP']=df.protocol_type.map({'udp':0,'tcp':1,'icmp':0})
df['ICMP']=df.protocol_type.map({'udp':0,'tcp':0,'icmp':1})
deldf['protocol_type']
df['service'].unique()
df['private']=df.service.map({'ecr_i':0,'http':0,'private':1})
133
df['http']=df.service.map({'ecr_i':0,'http':1,'private':0})
df['ecr_i']=df.service.map({'ecr_i':1,'http':0,'private':0})
df['http'].unique()
deldf['service']
df['flag'].unique()
df['SF']=df.flag.map({'SF':1,'S0':0,'REJ':0,'S1':0})
df['S1']=df.flag.map({'SF':0,'S0':0,'REJ':0,'S1':1})
df['REJ']=df.flag.map({'SF':0,'S0':0,'REJ':1,'S1':0})
df['S0']=df.flag.map({'SF':0,'S0':0,'REJ':0,'S1':1})
df['S0'].unique()
deldf['flag']
df.columns
df['src_bytes'].unique()
df['SRC_BY_BL_50']=df.src_bytes.map({1032:0,283:0,252:0,0:1,105:0,303:0,
42:1,45:1,213:0,285:0,5050:0,
212:0,184:0,289:0,291:0,246:0,175:0,241:0,293:0,245:0,249:0,225:0,
305:0,3894:0,320:0,162:0,206:0,353:0,1:1})
df['SRC_BY_AB_50']=df.src_bytes.map({1032:0,283:0,252:0,0:0,105:1,303:0,
42:0,45:0,213:0,285:0,5050:0,
212:1,184:1,289:0,291:0,246:1,175:1,241:1,293:0,245:1,249:1,225:1,
305:0,3894:0,320:0,162:1,206:1,353:0,1:0})
df['SRC_BY_AB_250']=df.src_bytes.map({1032:0,283:1,252:1,0:0,105:1,303:0
,42:0,45:0,213:0,285:1,5050:0,
212:0,184:0,289:1,291:1,246:0,175:1,241:0,293:1,245:0,249:0,225:0,
305:1,3894:0,320:1,162:1,206:0,353:1,1:0})
df['SRC_BY_AB_450']=df.src_bytes.map({1032:0,283:1,252:1,0:0,105:0,303:0
,42:0,45:0,213:1,285:1,5050:0,
212:1,184:0,289:1,291:1,246:1,175:0,241:1,293:1,245:1,249:1,225:1,
305:0,3894:0,320:0,162:0,206:1,353:0,1:0})
df['SRC_BY_AB_650']=df.src_bytes.map({1032:0,283:0,252:0,0:0,105:0,303:0
,42:0,45:0,213:0,285:0,5050:0,
212:0,184:0,289:0,291:0,246:0,175:0,241:0,293:0,245:0,249:0,225:0,
305:0,3894:0,320:0,162:0,206:0,353:0,1:0})
df['SRC_BY_AB_850']=df.src_bytes.map({1032:0,283:0,252:0,0:0,105:0,303:0
,42:0,45:0,213:0,285:0,5050:0,
212:0,184:0,289:0,291:0,246:0,175:0,241:0,293:0,245:0,249:0,225:0,
134
305:0,3894:0,320:0,162:0,206:0,353:0,1:0})
df['SRC_BY_AB_1000']=df.src_bytes.map({1032:1,283:0,252:0,0:0,105:0,303:
0,42:0,45:0,213:0,285:0,5050:1,
212:0,184:0,289:0,291:0,246:0,175:0,241:0,293:0,245:0,249:0,225:0,
305:0,3894:1,320:0,162:0,206:0,353:0,1:0})
deldf['src_bytes']
df.columns
df['dst_bytes'].unique()
df['DST_BY_BL_50']=df.dst_bytes.map({0:1,903:0,1422:0,146:0,1292:0,42:1,
115:0,4996:0,145:0,
5200:0,329:0,341:0,128:0,721:0,331:0,753:0,38352:0,722:0,
1965:0,634:0,628:0,188:0,47582:0,489:0,105:0,486:0,2940:0,
209:0,1401:0,292:0,1085:0,1:1})
df['DST_BY_AB_50']=df.dst_bytes.map({0:0,903:0,1422:0,146:1,1292:0,42:0,
115:1,4996:0,145:1,
5200:0,329:0,341:0,128:1,721:0,331:0,753:0,38352:0,722:0,
1965:0,634:0,628:0,188:1,47582:0,489:0,105:1,486:0,2940:0,
209:1,1401:0,292:1,1085:0,1:0})
df['DST_BY_AB_250']=df.dst_bytes.map({0:0,903:0,1422:0,146:0,1292:0,42:0
,115:0,4996:0,145:0,
5200:0,329:1,341:1,128:0,721:0,331:1,753:0,38352:0,722:0,
1965:0,634:0,628:0,188:0,47582:0,489:0,105:0,486:0,2940:0,
209:1,1401:0,292:1,1085:0,1:0})
df['DST_BY_AB_450']=df.dst_bytes.map({0:0,903:0,1422:0,146:0,1292:0,42:0
,115:0,4996:0,145:0,
5200:0,329:0,341:0,128:0,721:0,331:0,753:0,38352:0,722:0,
1965:0,634:1,628:1,188:0,47582:0,489:1,105:0,486:1,2940:0,
209:0,1401:0,292:0,1085:0,1:0})
df['DST_BY_AB_650']=df.dst_bytes.map({0:0,903:0,1422:0,146:0,1292:0,42:0
,115:0,4996:0,145:0,
5200:0,329:0,341:0,128:0,721:1,331:0,753:1,38352:0,722:1,
1965:0,634:0,628:0,188:0,47582:0,489:0,105:0,486:0,2940:0,
209:0,1401:0,292:0,1085:0,1:0})O
df['DST_BY_AB_850']=df.dst_bytes.map({0:0,903:1,1422:0,146:0,1292:0,42:0
,115:0,4996:0,145:0,
135
5200:0,329:0,341:0,128:0,721:0,331:0,753:0,38352:0,722:0,
1965:0,634:0,628:0,188:0,47582:0,489:0,105:0,486:0,2940:0,
209:0,1401:0,292:0,1085:0,1:0})
df['DST_BY_AB_1000']=df.dst_bytes.map({0:0,903:0,1422:1,146:0,1292:1,42:
0,115:0,4996:1,145:0,
5200:1,329:0,341:0,128:0,721:0,331:0,753:0,38352:1,722:0,
1965:1,634:0,628:0,188:0,47582:1,489:0,105:0,486:0,2940:1,
209:0,1401:0,292:0,1085:1,1:0})
df['DST_BY_AB_1000'].unique()
deldf['dst_bytes']
deldf['Wrong_fragment']
deldf['serror_rate']
deldf['rerror_rate']
df.head()
df.columns
l1=['SRC_BY_BL_50','SRC_BY_AB_50','SRC_BY_AB_250','SRC_BY_AB_
450','SRC_BY_AB_650','SRC_BY_AB_850','SRC_BY_AB_1000']
l2=['DST_BY_BL_50','DST_BY_AB_50','DST_BY_AB_250','DST_BY_AB_
450','DST_BY_AB_650','DST_BY_AB_850','DST_BY_AB_1000']
l3=['UDP','TCP','ICMP']
l4=['SF','S1','REJ','S0']
l5=['private','http','ecr_i']
l6=['UDP','TCP','ICMP','private','http','ecr_i','SF','S1','REJ','S0','SRC_BY_BL_5
0','SRC_BY_AB_50','SRC_BY_AB_250',
'SRC_BY_AB_450','SRC_BY_AB_650','SRC_BY_AB_850','SRC_BY_AB_10
00',
'DST_BY_BL_50','DST_BY_AB_50','DST_BY_AB_250','DST_BY_AB_450','
DST_BY_AB_650','DST_BY_AB_850','DST_BY_AB_1000']
df['class'].unique()
decision=['smurf','perl','xlock','xsnoop','xterm','satan','neptune','nmap','back','apa
che2','multihop','worm',
'buffer overflow','sql attack','saint','Nmap','ipsweep']
l7=[]
forxinrange(0,len(l6)):
l7.append(0)
df['class'].unique()
df.replace({'class':{'smurf':0,'perl':1,'xlock':2,'xsnoop':3,'xterm':4,'satan':5,'neptu
ne':6,
136
'nmap':7,'back':8,'apache2':9,'multihop':10,'worm':11,'buffer overflow':12,
'sql attack':13,'saint':14,'Nmap':15,'ipsweep':16}},inplace=True)
importnumpyasnp
Xd=df[l6]
yd=df[["class"]]
np.ravel(yd)
importnumpyasnp
X_testd=df[l6]
y_testd=df[["class"]]
np.ravel(y_testd)
fromsklearn.svmimportSVC
fromsklearn.model_selectionimportcross_val_score
fromsklearn.metricsimportaccuracy_score
defover():
clf=SVC()
gnb=clf.fit(Xd,np.ravel(yd))
# calculating accuracy---------------------------
fromsklearn.metricsimportaccuracy_score
y_predd=gnb.predict(X_testd)
print(accuracy_score(y_testd,y_predd))
print(accuracy_score(y_testd,y_predd,normalize=False))
# -----------------------------------------------------
terms=[src.get(),dst.get(),prt.get(),fl.get(),ser.get()]
forkinrange(0,len(l6)):
forzinterms:
if(z==l6[k]):
l7[k]=1
inputtest=[l7]
predict=gnb.predict(inputtest)
predicted=predict[0]
137
h='no'
forainrange(0,len(decision)):
if(predicted==a):
h='yes'
break
if(h=='yes'):
t1.delete("1.0",END)
t1.insert(END,decision[a])
else:
t1.delete("1.0",END)
t1.insert(END,"Not Found")
root1=Tk()
root1.title("Prediction of Network Attacks")
#root1.configure(background='black')
root=Canvas(root1,width=1620,height=1800)
root.pack()
photo=PhotoImage(file='im3.png')
root.create_image(0,0,image=photo,anchor=NW)
src=StringVar()
src.set(None)
dst=StringVar()
dst.set(None)
prt=StringVar()
prt.set(None)
fl=StringVar()
fl.set(None)
ser=StringVar()
ser.set(None)
# Heading
w2=Label(root,justify=LEFT,text="Network attack prediction
",fg="red",bg="white")
w2.config(font=("Elephant",20))
w2.grid(row=1,column=0,columnspan=2,padx=100)
w2=Label(root,justify=LEFT,text="DoS, R2L, U2R and Probe Types
",fg="blue")
138
w2.config(font=("Aharoni",15))
w2.grid(row=2,column=0,columnspan=2,padx=100)
# labels
srcLb=Label(root,text="Source File Size(in BY):")
srcLb.grid(row=6,column=0,pady=15,sticky=W)
prtLb=Label(root,text="Protocol Type:")
prtLb.grid(row=8,column=0,pady=15,sticky=W)
flLb=Label(root,text="Flag Type:")
flLb.grid(row=9,column=0,pady=15,sticky=W)
serLb=Label(root,text="Select services:")
serLb.grid(row=10,column=0,pady=15,sticky=W)
lrdLb=Label(root,text="Attack_Type",fg="white",bg="red")
lrdLb.grid(row=13,column=0,pady=10,sticky=W)
# entries
OPTIONSsrc=sorted(l1)
OPTIONSdst=sorted(l2)
OPTIONSprt=sorted(l3)
OPTIONSfl=sorted(l4)
OPTIONSser=sorted(l5)
srcEn=OptionMenu(root,src,*OPTIONSsrc)
srcEn.grid(row=6,column=1)
dstEn=OptionMenu(root,dst,*OPTIONSdst)
dstEn.grid(row=7,column=1)
prtEn=OptionMenu(root,prt,*OPTIONSprt)
prtEn.grid(row=8,column=1)
flEn=OptionMenu(root,fl,*OPTIONSfl)
flEn.grid(row=9,column=1)
serEn=OptionMenu(root,ser,*OPTIONSser)
serEn.grid(row=10,column=1)
defclear_display_result():
139
t1.delete('1.0',END)
lrd=Button(root,text="Check Result",command=over,bg="cyan",fg="green")
lrd.grid(row=13,column=3,padx=10)
b=Button(root,text="Reset",command=clear_display_result,bg="red",fg="white
")
b.grid(row=5,column=3,padx=10)
t1=Text(root,height=1,width=40,bg="orange",fg="black")
t1.grid(row=13,column=1,padx=10)
root1.mainloop()
OUTPUT SCREENSHOT:
140
CONCLUSION
The analytical process started from data cleaning and processing, missing
value, exploratory analysis and finally model building and evaluation. The best
accuracy on public test set is higher accuracy score will be find out by
141
comparing each algorithm with type of all network attacks for future prediction
results by finding best connections. This brings some of the following insights
about diagnose the network attack of each new connection. To presented a
prediction model with the aid of artificial intelligence to improve over human
accuracy and provide with the scope of early detection. It can be inferred from
this model that, area analysis and use of machine learning technique is useful in
developing prediction models that can helps to network sectors reduce the long
process of diagnosis and eradicate any human error.
FUTURE WORK
142