0% found this document useful (0 votes)
33 views49 pages

Report

This document discusses using machine learning techniques to detect phishing URLs. It aims to develop a model that can accurately classify URLs as either legitimate or phishing. Cyber security experts are looking for reliable phishing detection methods. The proposed system will extract and evaluate various features of authentic and phishing URLs using machine learning. This will provide a model for URL classification that can help identify phishing attacks by validating any supplied link. The goal is to create a cost-effective solution that can efficiently classify websites in a timely manner to protect users from phishing attacks.

Uploaded by

Vinay Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views49 pages

Report

This document discusses using machine learning techniques to detect phishing URLs. It aims to develop a model that can accurately classify URLs as either legitimate or phishing. Cyber security experts are looking for reliable phishing detection methods. The proposed system will extract and evaluate various features of authentic and phishing URLs using machine learning. This will provide a model for URL classification that can help identify phishing attacks by validating any supplied link. The goal is to create a cost-effective solution that can efficiently classify websites in a timely manner to protect users from phishing attacks.

Uploaded by

Vinay Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

ABSTRACT

With raising in-depth amalgamation of the Internet and social life, the Internet is looking
differently at how people are learning and working, meanwhile opening us to growing serious
security attacks. The ways to recognize various network threats, specifically attacks not seen
before, is a primary issue that needs to be looked into immediately. The aim of phishing site
URLs is to collect the private information like user’s identity, passwords and online money
related exchanges. Phishers use the sites which are visibly and semantically like those of
authentic websites. Since the majority of the clients go online to get to the administrations
given by the government and money related organizations, there has been a vital increment in
phishing threats and attacks since some years. As technology is growing, phishing methods
have started to progress briskly and this should be avoided by making use of anti-phishing
techniques to detect phishing. Machine learning is a authoritative tool that can be used to aim
against phishing assaults. This study develops and creates a model that can predict whether a
URL link is legitimate or phishing

Cyber security persons are now looking for trustworthy and steady detection techniques for
phishing websites detection. By extracting and evaluating numerous aspects of authentic and
phishing URLs, this project uses machine learning technology to detect phishing URLs.

In conclusion, the study provided a model for URL classification into phishing and legitimate
URLs. This would be very valuable in assisting individuals and companies in identifying
phishing attacks by authenticating any link supplied to them to prove its validity.

Keywords: Phishing attacks, legitimate, trust worthy, Machine Learning, Personal


Information, Malicious links, Phishing domain characteristics.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM viii


LIST OF FIGURES

FIG. NO FIG. NAME PAGE NO.

1.1 Applications of ML in Cyber Security 13

3.1 System Architecture 26

4.2 DFD level – 0 33

4.3 DFD level – 1 34

4.4 DFD level – 2 35

4.1 UML Activity Diagram 32

5.1 Implementation 36

5.2 Flowchart of proposed System 39

5.3 Flowchart of web Interface 40

6.1 Graph of Accuracy 48

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM ix


LIST OF TABLES

TABLE NO. TABLE TITLE PAGE NO.

2.1 Literature survey 18

3.1 Supporting Python Modules 28

6.1 Performance of Proposed System 47

6.2 Accuracy Classification 48

7.1 Test Cases 51

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM x


SYMBOLS & ABBREVIATIONS
ML : Machine Learning

AI : Artificial Intelligence

IDS : Intrusion Detection Systems

HTTPS : Hypertext Transfer Protocol Secure

URLs : Uniform Resource Locators

UML : Unified Modeling Language

CSS : Cascading Style Sheets

CNN : Convolutional Neural Network

KSVM : Kernel Support Vector Machine

RFC : Random Forest Classifier

DFD : Data flow diagram

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xi


CHAPTER 1

INTRODUCTION

1.1 Background
Artificial intelligence is a new innovative science that reviews and creates hypotheses,
strategies, procedures, and applications that recreate, grow and broaden human
knowledge. ML is an arm of artificial intelligence and it is analogous to
computational measurements, that also concentrates on making predictions with the
use of PCs. Machine leaning has solid relationship with scientific improvement,
which tells methods, hypothesis and utilization regions to the field. ML is sometimes,
in a while combined with data mining, but the data mining subfield focuses more on
preparatory information investigation and is called as unsupervised learning. ML can
likewise be unsupervised and be utilized to learn and set up pattern profiles for
various entities and then used to find important anomalies.

Cyber security is a set of innovations and procedures intended to secure PCs,


networks, projects and information from assaults and unapproved access,
modification. A system security framework comprises of a system assurance
framework and furthermore a PC protection framework. Every one of these
frameworks incorporates firewalls, antivirus programming, and intrusion detection
system (IDS). IDSs help find, decide and distinguish unapproved system conduct, for
instance, use, replicating, change and annihilation.

The applications of machine learning (ML) methods in cybersecurity is rising than


ever before as shown in fig 1.1. Beginning from IP traffic categorization, separating
malicious traffic for intrusion detection, Machine learning is the one of the best
answers that can impact against zero-day attacks. New exploration is being done by
utilization of measurable traffic characteristics and ML techniques. The word
phishing was introduced in the year 1987. Phishing is an online thievery that robs an
individual’s private data and identity data. It is a sort of extortion where the assailant
gets complete access to other individual’s private data.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xii


Figure 1.1: Applications of Machine Learning in Cyber Security

Because of increase in the phishing attacks, numerous results are proposed which
generates a solution to the issue. To build a framework which guarantees a solution
against the phishing attack, there are several ways. Various other methods for
detecting phishing attack are there like black list, Fuzzy rule-based, white list-based,
cantina-based, machine learning based, Heuristic and image-based approaches. There
are several other studies that talks about a variety of methods and techniques to detect
the different types of phishing attacks. Phishing sites looks to be like a genuine
website and several individuals have problem in recognizing such websites. Few anti
phishing techniques are in built in some of the browsers.

1.2 Motivation

There are many Anti phishing techniques that helps us protect from phishing sites.
Mozilla Firefox, Safari and Google chrome makes use of Google Safe Browsing
(GSB) service that will block the phishing websites. There are also many such tools
like McFee Site Advisor, Quick Heal, Avast and Netcraft which are widely used. GSB
analyzes a URL by making use of the blacklist approach. The main disadvantage of
GSB was that it was unable to detect the phishing website since updation of blacklist
was not done. In case of Netcraft, a website that phishing was recorded as phishing
although it wasn’t blocked. The blocking is done by Netcraft only when it is sure
100% that the website is phishing. The warning is given only when the user clicks the
right button on the icon to find the risk rating. The risk is when the individual doesn’t
check the rating or makes a decision to use it after checking the rating. Security
against security attacks online is provided by some soft wares like QuickHeal and

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xiii


Avast. The functioning of Avast antivirus was checked after installing it. The Avast
browser was not able to successfully find the phishy URL that was successfully
determined by Netcraft and GSB. This above mentioned points accepts the necessity
of anti phishing tools that are advanced in nature. It is noteworthy that these tools
must be installed independently. A lay person might never install tools if he is not
aware of practices like phishing. If that is the case, then people rely only on GSB
service. Hence, the awareness considering such anti phishing tools and phishing is
very important. Also, no individual should fully rely on tools because it is seen that
they might lead to misclassification.

1.3 Problem Statement

The problem is derived after making a thorough observation and study about the
method of classification of phishing websites that makes use of machine learning
techniques. We must design a system that should allow us to:

• Time consumed for detection should be less and should be cost effective.
• Accurately and efficiently classify the websites into legitimate or phishing.

1.4 Scope
This study explores data science and machine learning models that use datasets gotten
from open-source platforms to analyze website links and distinguish between phishing
and legitimate URL links. The model will be integrated into a web application,
allowing a user to predict if a URL link is legitimate or phishing. This online
application is compatible with a variety of browsers.

1.5 Definition of the Problem

Phishing attacks have gotten increasingly complex, it is very difficult for an average
person to determine if an email message link or website is legitimate. Cyber-attacks
by criminals that employ phishing schemes are so prevalent and successful nowadays.
Hence, this project seeks to address fake URLs and domain names by identifying
phishing website links. Therefore, having a web application that provides the user an

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xiv


interface to check if a URL is Phishing or legitimate will help decrease security risks
to individuals and organizations.

1.6 Aim and Objective

The project’s objectives are as follows:

• To study various automatic phishing detection methods


• To identify the appropriate machine learning techniques and define a solution
using the selected method
• To select an appropriate dataset for the problem statement
• To apply appropriate algorithms to achieve the solution to phishing attacks.

1.7 Challenges

The challenges faced during the project are as follows:

• Finding the appropriate dataset.

• Feature extraction required the study of various modules and understanding


each module and getting the expected outcome from it.

• An extensive review was done on related topics and existing documented


materials such as journals, e-books, and websites containing related
information gathered which was examined and reviewed to retrieve essential
data to better understand and know how to help improve the system

1.8 Methodology
The methodology used to achieve the earlier stated objectives is explained below. The
dataset collection consists of phishing and legitimate URLs which were obtained from
open-source platforms. The dataset was then pre-processed that is cleaned up from
any abnormality such as missing data to avoid data imbalance. Afterward, expository
data analysis was done on the dataset to explore and summarize the dataset. Once the
dataset was free from all anomalies, website content-based features were extracted
from the dataset to get accurate features to train and test the model. An extensive

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xv


review was done on existing works of literature and machine learning models on
detecting phishing websites to best decide the classification models to solve the
problem of detecting phishing websites.

Hence, Series of these machine learning classification models such as Decision Tree,
Support Vector Machine, XGBooster, Multilayer perceptions, Auto encoder Neural
Network and Random Forest was deployed on the dataset to distinguish between
phishing and legitimate URLs. The best model with high training accuracy out of all
the deployed models was selected then integrated into a developed web application.
Thus, a user can enter a URL link on the web application to predict if it is phishing or
legitimate.

1.9 Organization of the thesis

Chapter 1 incorporates a presentation about the application of ML in cyber security.


It details the problem statement, objectives and scope of the project. It also tells about
the challenges faced during the development of the project.

Chapter 2 incorporates the study and research about the phishing attacks and its
detection using Machine learning techniques. It gives a detailed description of the
earlier works done in this front and the limitations of those related works.

Chapter 3 discusses about the software and hardware requirements which is


necessary for the system. The chapter details about the minimum requirements needed
for the project and also about the modules of Python that are used.

Chapter 4 tells about the system design and its representation using architecture, data
flow diagrams and activity diagram. It gives a graphical and diagrammatic
representation of the system for better understanding and the system’s, user’s and run
time perspective of the project.

In chapter 5, the implementation of this project is being examined. The chapter


details about the dataset used, the steps involved in the implementation, the classifiers
used, etc.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xvi


In chapter 6, the test cases are being examined and a comparison of the expected
output and the actual output is being made to validate our result.

In chapter 7, the outcome obtained and the environmental setup up of the project is
being discussed.

I conclude the project in chapter 8 and also discuss about the future enhancements to
the project.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xvii


CHAPTER 2

LITERATURE SURVEY

2.1 Overview of the Study

This chapter offers an insight into various important studies conducted by excellent
scholars from articles, books, and other sources relevant to the detection of phishing
websites. It also provides the project with a theoretical review, conceptual review, and
empirical review to demonstrate understanding of the project

2.2 Literature Survey

A literature survey is an insightful article that presents the existing information


including considerable discoveries just as theoretical and methodological
commitments to a specific topic.

No Paper Title Method/Technique Publish Year Limitations

1 FS-NN: ”An effective Proposed method 2019 The continuous


Phishing Websites Detection has 3 stages: growing of features
Model Based on Optimal Defines new index, that are sensitive of
Feature Selection and Neural Designs optimal phishing attacks need
Network” feature selection collection of more
algorithm,Produce features for the OFS
OFSNN model

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xviii


2 ”Fuzzy Rough Set Feature The proposed 2019 The specific features
Selection to Enhance method uses Fuzzy used in the method is
Phishing Attack Detection” Rough Set (FRS) not specified.
theory to identify
the features. The
decision boundary is
decided lower and
upper
approximation
region. Using the
lower and upper
approximation
memberships, a set
member is decided
to which category it
belongs

3 ”Phishing Website Detection The proposed 2019 It requires more


based on Multidimensional method has the computation and
Features driven by Deep following stages: therefore an expensive
Learning” 1.character method
succession features
of the URL are
extricated as well as
utilized for fast
characterization 2.
the LSTM (long
short-term memory)
network is utilized
to catch setting
semantic and
dependency features
of URL character
groupings. 3.
softmax classifies
the features
extracted

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xix


4 ”WC-PAD: Web Crawling It is a 3-phase 2019 Time consuming as it
based Phishing Attack detection of involves three phases
Detection” phishing attack and each website has
approach. The 3 to go through the three
phases of WC-PAD phases.
are 1) blacklist of
DNS 2) Approach
based on Heuristics
and 3) Approach
based on Web
crawler. Feature
extraction as well as
phishing attack
detection both
makes use of web
crawler.

5 ”Phishing URL Detection via NN module is used 2019 false positive rate is
CNN and Attention-Based to derive high
Hierarchical RNN” representation of
spatial feature that
is character level of
the URLs. Then the
representational
features are
combined by using
a CNN of 3 layers
to create precise
feature
representations of
URLs. That is then
used for training the
classifier of
phishing URLs.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xx


6 ”An Adaptive Machine A phishing 2020
Learning Based Approach for detection system
Phishing Detection Using was developed by
Hybrid Features” making use of
classifier of
Machine learning
called XCS. It is an
adaptive ML
technique that is
online. This
advances a lot of
rules called
classifiers. This
model derives 38
features from source
code of webpage
and URLs.

7 ”A new method for Detection The three major 2020 Does not give full
of Phishing Websites: URL phases in this work information about the
Detection are Parsing, techniques use
Heuristic
Classification of
data, Performance
Analysis in this
model. All of these
phases use various
and distinctive
methods for data
processing to get
results that are
better.

Table 2.1: Literature Survey.

From the above, ML methods plays a vital role in many applications of cybersecurity
and shall remain an encouraging path that captivates more such investigations. When
coming to the reality, there are several barriers that are limitations during

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxi


implementations. As discussed, there are many approaches earlier proposed for
detecting phishing website attack and they also have their own limitations. Therefore,
the aim of the project is detection of phishing website attack using a novel Machine
learning technique.

2.3 Analysis of Existing System

The existing system of phishing detection techniques suffers low detection accuracy
and high false alarm especially when different phishing approaches are introduced.
Above and beyond, the most common technique used is the blacklist-based method
which is inefficient in responding to emanating phishing attacks since registering a
new domain has become easier, no comprehensive blacklist can ensure a perfect up-
to-date database for phishing detection.

2.4 Proposed System

The proposed phishing detection system utilizes machine learning models. The
system comprises two major parts, which are the machine learning models and a web
application. These models consist of Decision Tree, Support Vector Machine,
XGBooster, logistic regression, and Random Forest. These models are selected after
different comparison-based performances of multiple machine learning algorithms.
Each of these models is trained and tested on a website content-based feature,
extracted from both phishing and legitimate dataset. Hence, the model with the
highest accuracy is selected and integrated into a web application that will enable a
user to predict if a URL link is phishing or legitimate.

2.5 Benefits of the new system

i. Will be able to differentiate between phishing (0) and legitimate (1) URLs

ii. It Will help reduce phishing data breaches for an organization

iii. It Will be helpful to individuals and organizations

iv. It is easy to use

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxii


2.6 Summary

In this chapter we mainly focused on existing system through literature survey and
various research paper analyzed and we specified some important points of each paper
and related diagrams or graphs are included. In comparison section we have mainly
highlighted few important advantages and disadvantages in each paper and
comparison between those papers. This chapter also introduces drawbacks of existing
system and functionality of proposed system and their advantages.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxiii


CHAPTER 3

ANALYSIS

3.1 Overview of System Analysis

This chapter describes the various process, methods, and procedures adopted by the
researcher to achieve the set aim and objectives and the conceptual structure within
which the research was conducted.

The methodology of any research work refers to the research approach adopted by the
researcher to tackle the stated problem. Since the efficiency and maintainability of
any application are solely dependent on how designs are prepared. This chapter
provides detailed descriptions of methods employed to proffer solutions to the stated
objectives of the research work.

According to the Merriam-Webster dictionary (11th.Ed), system analysis is "the


process of studying a procedure or business to identify its goals and purposes and
create systems and procedures that will efficiently achieve them". It is also the act,
process, or profession of studying an activity (such as a procedure, a business, or a
physiological function) typically by mathematical means to define its goals or
purposes and to discover operations and procedures for accomplishing them most
efficiently. System analysis is used in every field where the development of
something is done. Before planning and development, you need to understand the old
systems thoroughly and use the knowledge to determine how best your new system
can work.

In ML and statistics, classification method is an approach involving supervised


learning where computer program gains information from input and afterward utilizes
this figuring out how to characterize new observations. Here are few classification
techniques used in the detection of phishing URLs.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxiv


3.2 Software Requirement Specification

3.2.1 Installation Requirements


The hardware (physical components of a computer system that can be seen, touched,
or felt) and software (both system software and the application software installed and
used in the system development) tools needed to satisfy these objectives highlighted
below:

3.2.2 Hardware Requirements:

• Processor CPU - Intel Pentium Dual Core and Higher


• Hard Disk capacity - 512MB Space required minimum
• RAM - 4GB minimum

3.2.3 Software requirements

• Programming language – Python

• Operating system - Windows 8.1 or above

• IDE – Anaconda or pycharm

• iPython version 3.x

3.3 Other Non-Functional Requirements


A non-functional requirement is a determination that depicts the framework’s activity
abilities and requirements that improve its usefulness. Some of them are as follows:

• Reusability: the same code with limited changes can be used for detecting
phishing attacks variants like smishing, vishing, etc.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxv


• Maintainability: The implementation is very basic and includes print
statements that makes it easy to debug

• Usability: The software used is very user friendly and open source. It also
runs on any operating system.

• Scalability: The implementation can include detection of vishing, smishing,


etc.

3.4 System Architecture

Figure 3.1: System Architecture

The architecture of the system is as shown in fig 4.1; the URLs to be classified as
legitimate or phishing is fed as input to the appropriate classifier. Then classifier that
is being trained to classify URLs as phishing or legitimate from the training dataset
uses the pattern it recognized to classify the newly fed input. The features such as IP
address, URL length, domain, having favicon, etc. are extracted from the URL and a
list of its values is generated. The list is fed to the classifiers such as KNN, kernel
SVM, Decision tree and Random Forest classifier. These models’ performance is then
evaluated and an accuracy score is generated. The trained classifier using the
generated list predicts if the URL is legitimate or phishing. The list contains values 1,
0 and -1 if the features exist, not applicable and if the features doesn’t exist
respectively. There are 30 features being considered in this project.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxvi


3.5 Supporting Python modules

Python has an approach to place definitions in a document and use them in a content
or in an intuitive case of the interpreter. Such a file is known as a module; definitions
a module can be brought into different modules or into the fundamental module.
Some of the modules used in the project are as shown in Table 3.1

S.No Python Modules Description

1 Ip address Ip address gives the capacities to generate,


control and work on IPv4 and IPv6 addresses
and networks.

2 Re This module gives regular expression matching


activities like those found in Perl.

3 urllib.request The urllib.request module characterizes


functions and classes which help in opening
URLs (for the most part HTTP) in a complex
world.

4 BeautifulSoup BeautifulSoup is a package in python for


parsing HTML and XML records. It makes a
parse tree for parsed pages that can be utilized
to extricate information from HTML, which is
valuable for web scraping.
5 Socket The BSD interface of socket is given access by
this module.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxvii


6 Requests The HTTP requests are allowed to send by this
module making use of Python.

7 Whois WHOIS is an inquiry and response convention


that is comprehensively used for addressing
databases that store the selected customers or
trustees of an Internet resource. for example, a
domain name, an autonomous framework or an
IP address block , also simultaneously used for
broad extend of information.

Table 3.1: Supporting python modules

3.6 Machine learning models

The machine learning model is nothing but a piece of code; an engineer or data
scientist makes it smart through training with data. So, if you give garbage to the
model, you will get garbage in return, i.e. the trained model will provide false or
wrong predictions.

3.6.1 Supervised learning


Supervised learning, in the background of artificial intelligence (AI) and machine
learning, is a type of system in which both input and preferred output data are
provided. Input and output data are labelled for classification to deliver a learning
basis for future data processing. Supervised learning models have some benefits over
the unsupervised approach, but they also have boundaries. The systems are more
likely to make decisions that humans can relate to, for example, because humans have
provided the basis for decisions. However, in the case of a retrieval-based method,
supervised learning systems have distress dealing with new information. The
supervised learning is categorized into 2 other categories which are “Classification”
and “Regression”.
Classification problem is when the target variable is categorical (i.e. the output could
be classified into classes — it belongs to either Class A or B or something else).
While a Regression problem is when the target variable is continuous

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxviii


3.6.2 Logistic Regression
Logistic Regression is a classification model that is used when the dependent variable
(output) is in the binary format such as 0 (False) or 1 (True). This makes logistic
regression a good algorithm fit for the purpose of our work in predicting if a URL is a
phishing URL (1) or not (0) as in the case of this paper.

Logistic Regression is an extension of the Linear Regression model. Let us


understand this with a simple example. If we want to classify if an email is a spam or
not, if we apply a Linear Regression model, we would get only continuous values
between 0 and 1 such as 0.4, 0.7 etc. On the other hand, the Logistic Regression
extends this linear regression model by setting a threshold at 0.5, hence the data point
will be classified as spam if the output value is greater than 0.5 and not spam if the
output value is lesser than 0.5. In this way, we can use Logistic Regression to
classification problems and get accurate predictions.

The logistic function, also called as sigmoid function was initially used by
statisticians to describe properties of population growth in ecology. The sigmoid
function is a mathematical function used to map the predicted values to probabilities.
Logistic Regression has an S- shaped curve and can take values between 0 and 1 but
never exactly at those limits.

3.6.3 Random Forest Algorithm


Random forest algorithm is one of the most powerful algorithms in machine learning
technology and it is based on concept of decision tree algorithm. Random forest
algorithm creates the forest with number of decision trees. High number of tree gives
high detection accuracy. Creation of trees are based on bootstrap method. In bootstrap
method features and samples of dataset are randomly selected with replacement to
construct single tree. Among randomly selected features, random forest algorithm will
choose best splitter for the classification and like decision tree algorithm; Random
forest algorithm also uses gini index and information gain methods to find the best
splitter. This process will get continue until random forest creates n number of trees.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxix


Each tree in forest predicts the target value and then algorithm will calculate the votes
for each predicted target. Finally random forest algorithm considers high voted
predicted target as a final prediction.

3.6.4 Support Vector Machine Algorithm


Support vector machine is another powerful algorithm in machine learning
technology. In support vector machine algorithm each data item is plotted as a point
in n-dimensional space and support vector machine algorithm constructs separating
line for classification of two classes, this separating line is well known as hyperplane.
Support vector machine seeks for the closest points called as support vectors and once
it finds the closest point it draws a line connecting to them. Support vector machine
then construct separating line which bisects and perpendicular to the connecting line.
In order to classify data perfectly the margin should be maximum. Here the margin is
a distance between hyperplane and support vectors. In real scenario it is not possible
to separate complex and non linear data, to solve this problem support vector machine
uses kernel trick which transforms lower dimensional space to higher dimensional
space.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxx


CHAPTER 4

DESIGN

4.1 System Modelling

System modeling involves the process of developing an abstract model of a system,


with each model presenting a different view or perspective of the system. It is the
process of representing a system using various graphical notations that shows how
users will interact with the system and how certain parts of the system function. The
proposed system was modeled using the following diagrams: i. Architecture diagram
ii. Use case diagram iii. Flowcharts The proposed system will be implemented using
Python Programming language along with different machine learning models and
libraries such as pandas, scikit-learn, python who-is, beautiful-Soup, NumPy, seaborn,
and matplotlib. Etc.

4.2 UML Activity Diagram

Activity diagram is a behavioral diagram. The fig 4.5 shows the activity diagram of
the system. It depicts the control flow from a start point to an end point showing
various paths which exists during the execution of the activity.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxi


Figure 4.1: UML activity diagram

4.3 Data Flow Diagrams

DFDs are used to depict graphically the data flow in a system. It explains the
processes involved in a system from the input to the report generation. It shows all
possible paths from one entity to another of aa system. The detail of a data flow
diagram can be represented in three different levels that are numbered 0, 1 and 2.
There are many types of notations to draw a data flow diagram among which
YourdonCoad and Gane-Sarson method are popular. The DFDs depicted in this
chapter uses the Gane-Sarson DFD notations.

4.3.1 Data Flow Diagram – Level 0

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxii


DFD level 0 is called a Context Diagram. It is a simple overview of the whole system
being modeled. Fig 4.2 shows the DFD level 0 of the system.

Figure 4.2: DFD - level 0

It shows the system as a high-level process with its relationship to the external
entities. It should be easily acknowledged by a wide range of audience from
stakeholders to developers to data analysts.

4.3.2 Data Flow Diagram – Level 1

DFD level 1 gives a more detailed explanation of the Context diagram. The high-level
process of the Context diagram is broken down into its subprocesses. The DFD level
1 of the system is depicted in fig 4.3

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxiii


Figure 4.3: DFD - level 1

The Level 1 DFD takes a step deep by including the processes involved in the system
such as feature extraction, splitting of dataset, building the classifier, etc. and hence
gives a more detailed vision of the system.

4.3.3 Data Flow Diagram – Level 2

DFD level 2 goes one more step deeper into the subprocesses of Level 1. Fig 4.4
shows the DFD level 2 of the system. It might require more text to get into the
necessary level of detail about the functioning of the system. The Level 2 gives a
more detailed sight of the system by categorizing the processes involved in the system
to three categories namely preprocessing, feature scaling and classification. It also
graphically depicts each of these categories in detail and gives a complete idea of how
the system works.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxiv


Figure 4.4: DFD - level 2

4.4 Summary
The system’s architecture, the processes involved from input to output with varying
levels of complexity and the system’s behaviour is graphically represented for better
understanding of the system in the above chapter.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxv


CHAPTER 5

IMPLEMENTATION

5.1 Introduction

This chapter of the report illustrates the approach employed to classify the URLs as
either phishing or legitimate. The methodology involves building a training set. The
training set is used for training a machine learning model, i.e., the classifier. Fig 5.1
shows the diagrammatic representation of the implementation.

Figure 5.1: Implementation

5.2 Technology Used


PYTHON

In technical terms, Python is an object-oriented, high-level programming language


with integrated dynamic semantics primarily for web and app development. It is
extremely attractive in the field of Rapid Application Development because it offers
dynamic typing and dynamic binding options.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxvi


Python is relatively simple, so it's easy to learn since it requires a unique syntax that
focuses on readability. Developers can read and translate Python code much easier
than other languages. In turn, this reduces the cost of program maintenance and
development because it allows teams to work collaboratively without significant lang
uage and experience barriers.

Additionally, Python supports the use of modules and packages, which means that
programs can be designed in a modular style and code can be reused across a variety
of projects. Once you have developed a module or package you need, it can be scaled
for use in other projects, and it’s easy to import these modules.

One of the most promising benefits of Python is that both the standard library and the
interpreter are available free of charge, in both binary and source form. There is no
exclusivity either, as Python and all the necessary tools are available on all major
platforms. Therefore, it is an enticing option for developers who don't want to worry
about paying high development costs.

That makes Python accessible to almost anyone. If you have the time to learn, you can
create some amazing things with the language.

MACHINE LEARNING

Machine learning provides simplified and efficient methods for data analysis. It has
indicated promising outcomes in real time classification problems recently. The key
advantage of machine learning is the ability to create flexible models for specific
tasks like phishing detection. Since phishing is a classification problem, Machine
learning models can be used as a powerful tool. Machine learning models could adapt
to changes quickly to identify patterns of fraudulent transactions that help to develop
a learning-based identification system. Most of the machine learning models
discussed here are classified as supervised machine learning, this is where an
algorithm tries to learn a function that maps an input to an output based on example
input-output pairs. It infers a function from labeled training data consisting of a set of
training examples.

PANDAS

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxvii


Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is
derived from the word Panel Data – an Econometrics from Multidimensional data. In
2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data. Prior to Pandas, Python was majorly
used for data munging and preparation. It had very little contribution towards data
analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical
steps in the processing and analysis of data, regardless of the origin of data — load,
prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range
of fields including academic and commercial domains including finance, economics,
Statistics, analytics, etc.

NUMPY

NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting


of multidimensional array objects and a collection of routines for processing of array.

Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Num array was also developed, having some additional functionalities. In 2005,
Travis Oliphant created NumPy package by incorporating the features of Num array
into Numeric package. There are many contributors to this open source project.
Operations using NumPy Using NumPy, a developer can perform the following
operations –

• Mathematical and logical operations on arrays.

• Fourier transforms and routines for shape manipulation.

• Operations related to linear algebra. NumPy has in-built functions for linear algebra
and random number generation.

5.3 Flowchart of the system

A flowchart is a diagram that depicts a process, system, or computer algorithm. It is a


graphical representation of the steps that are to be performed in a system, it shows the
steps in sequential order. It is used in presenting the flow of algorithms and to
communicate complex processes in clear, easy-to-understand diagrams.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxviii


Figure 5.2 shows the flow of phishing detection systems using the machine learning
process.

Figure 5.3 shows the phishing detection web interface system. The user inputs a URL
link and the website validates the format of the URL and then predicts if the link is
phishing or legitimate.

Figure 5.2 Flowchart of the proposed System

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xxxix


Figure 5.3 Flowchart of the web interface

5.4 Dataset

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xl


5.5 Process involved in implementation
The first step of the research work was determining the right data set. The dataset
selected was collected from Kaggle for this task. The reasons behind selecting this
dataset are several. It includes:

• The data set is large, so working with it is intriguing


• The number of features in the data set is 30 giving a wide range of features
making the predictions a little more accurate.
• The number of URLs is quite evenly distributed among the 2 categories.

5.5.1 Splitting:

The dataset into training part of dataset and testing part of dataset. The dataset was
split into training and testing dataset with 75% for training and 25% for testing using
the “train test split” method. The splitting was done after assigning the dependent
variables and independent variables.

5.5.2 Preprocessing:

Preprocessing involves filling the missing data or removing the missing data and
getting a clean dataset. But the dataset chosen was already preprocessed and did not
require any further preprocessing from my end. The only step to be performed in
preprocessing was feature scaling.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xli


5.5.3 Feature extraction:

Feature values are extracted using python modules like whois, requests, socket, re,
ipaddress, BeautifulSoup, etc. to get information regarding ip address, length of url,
domain name, subdomains, presence of favicon, etc. The value obtained is stored in a
list. This is being done because the dataset is in this format and hence the classifier
will be trained with input of this format. Therefore, when a URL is passed as input to
the system, it converts it into a python list of 30 elements each representing its
respective feature and there after that list is fed to the trained classifier. The classifier
that is being used includes KNN, kernel SVM, Decision Tree and random forest
classifier.

5.6 General Working of The System

A one-page phishing detection web application has been developed to run on any
browser. The application was developed using programming languages such as
HTML, CSS, PHP, and JavaScript. The phishing detection web application has the
following pages:

5.6.1 The home page

The home page contains a session for a user to enter a URL and predict if it is
phishing or legitimate. It predicts the state of the URL base on the feature selection.
The purpose of this page is to help its users validate a URL link

5.6.2 FAQs Page

The FAQs Page contains a series of questions and answers about the phishing attacks
and how the users can get prevented from getting attacked by the phishing sites.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xlii


Figure 5.4 The Home Page

Figure 5.5 Code for the web application

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xliii


MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xliv
5.7 SUMMARY

This chapter discusses the working of the system through proposed system
architecture. The flow diagram shows the working of Proposed system and the
software requirement specification. The Project is also explained through the
architecture of the proposed system.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xlv


CHAPTER 6

RESULTS AND DISCUSSIONS

6.1 Screen Shots

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xlvi


6.2 Table and Graphs of results

Table 6.1 performance of the proposed system

Figure 6.1 Graph of accuracy

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xlvii


6.3 Results comparison and graphs
The phishing scam in websites classification model is generated by implementing
random forest algorithm, Logistic regression and support vector machine algorithms.
The goal of this project is to compare the performance of different classifiers and find
out the best approach for classification phishing and non-phishing website. These
algorithms were implemented in python.

Table 6.2 Accuracy classification

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xlviii


CHAPTER 7

TESTING AND VALIDATION

7.1 Introduction
System testing is actually a series of different tests whose primary purpose is to fully
exercise the computer-based system. Although each test has a different purpose, all
work to verify that all the system elements have been properly integrated and perform
allocated functions. The testing process is actually carried out to make sure that the
web application exactly does the same thing what is supposed to do. In the testing
stage following goals are tried to achieve: -

● To affirm the quality of the project.

● To find and eliminate and residual errors from previous stages.

● To validate the software as solution to the original problem.

● To provide operational reliability of the system.

In this chapter, we check for the working of the proposed system by testing and
comparing the result of the algorithm and the actual result. It is basically validating
the system. The testing is done for each algorithm with a legitimate and phishing URL
and the results are as follows.

7.2 Testing Types

7.2.1 Unit Testing

Unit testing, also known as component testing refers to tests that verify the
functionality of a specific section of code, usually at the function level. In an object-
oriented environment, this is usually at the class level, and the minimal unit tests
include the constructors and destructors. Unit testing is a software development
process that involves synchronized application of a broad spectrum of defect
prevention and detection strategies in order to reduce software development risks,
time and costs. The following Unit testing table shows the functions that were tested
at the time of programming. The first column gives all the modules which were tested,

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM xlix


and the second column gives the test results. Test results indicate if the functions, for
given inputs are delivering valid outputs.

7.2.2 Validation Testing

At the culmination of integration testing, software is completed assembled as a


package. Interfacing errors have been uncovered and corrected. Validation testing can
be defined in Machine learning based approach to detect phishing attacks Testing
Dept. of CSE, SJCIT 31 2021-22 many ways; here the testing validates the software
function in a manner that is reasonably expected by the customer. In software project
management, software testing, and software engineering, verification and validation
(V&V) is the process of checking that a software system meets specifications and that
it fulfills its intended purpose. It may also be referred to as software quality control.

7.2.3 Functional Testing

Functional testing is a type of testing that seeks to establish whether each application
feature works as per the software requirements. Each function is compared to the
corresponding requirement to ascertain whether its output is consistent with the end
user’s expectations. The testing is done by providing sample inputs, capturing
resulting outputs, and verifying that actual outputs are the same as expected outputs.

7.2.4 Integration Testing

Integration testing is any type of software testing that seeks to verify the interfaces
between components against a software design. Software components may be
integrated in an iterative way or all together. Normally the former is considered a
better practice since it allows interface issues to be located more quickly and fixed.
Integration testing works to expose defects in the interfaces and interaction between
integrated components (modules). Progressively larger groups of tested software
components corresponding to elements of the architectural design are integrated and
tested until the software works as a system.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM l


7.3 Test Cases

S Input URL Expected Actual Remarks


No Output Output

1 HTTPS://WWW.MLRITM.AC.IN/ Legitimate Legitimate Success

2 https://www.2498.b.hostable.me/ Phishing Phishing Success

3 HTTPS://FACEBOOK.COM Legitimate Legitimate Success

4 WWW.FACEBOOK.COM Please input Please input full Success


full URL URL

Table 7.1 Test Cases Table

7.4 Summary

This chapter discuss about the importance of testing and varies methods that are used
to test the model built. This helps us to understand the performance of the system and
make the necessary changes accordingly.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM li


CHAPTER 8

CONCLUSION AND FUTURE SCOPE

8.1 Conclusion

The demonstration of phishing is turning into an advanced danger to this quickly


developing universe of innovation. Today, every nation is focusing on cashless
exchanges, business online, tickets that are paperless and so on to update with the
growing world. Yet phishing is turning into an impediment to this advancement.
Individuals are not feeling web is dependable now. It is conceivable to utilize AI to
get information and assemble extraordinary information items. A lay person,
completely unconscious of how to recognize a security danger shall never invite the
danger of making money related exchanges on the web. Phishers are focusing on
installment industry and cloud benefits the most.

The project means to investigate this region by indicating an utilization instance of


recognizing phishing sites utilizing ML. It aimed to build a phishing detection
mechanism using machine learning tools and techniques which is efficient, accurate
and cost effective. The project was carried out in Anaconda IDE and was written in
Python. The proposed method used four machine learning classifiers to achieve this
and a comparative study of the four algorithms was made. A good accuracy score was
also achieved.

8.2 Future Enhancement

Further work can be done to enhance the model by using assembling models to get
greater accuracy score. Ensemble methods is a ML technique that combines many
base models to generate an optimal predictive model. Further reaching future work
would be combining multiple classifiers, trained on different aspects of the same
training set, into a single classifier that may provide a more robust prediction than any
of the single classifiers on their own.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM lii


The project can also include other variants of phishing like smishing, vishing, etc. to
complete the system. Looking even further out, the methodology needs to be
evaluated on how it might handle collection growth. The collections will ideally grow
incrementally over time so there will need to be a way to apply a classifier
incrementally to the new data, but also potentially have this classifier receive
feedback that might modify it over time

8.3 Recommendation

Through this project, one can know a lot about phishing attacks and how to prevent
them. This project can be taken further by creating a browser extension that can be
installed on any web browser to detect phishing URL Links.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM liii


REFERENCES

[1] Reid G. Smith and Joshua Eckroth. Building ai applications: Yesterday, today, and
tomorrow. AI Magazine, 38(1):6–22, Mar. 2017.

[2] Panos Louridas and Christof Ebert. Machine learning. IEEE Software, 33:110–
115, 09 2016.

[3] Michael Jordan and T.M. Mitchell. Machine learning: Trends, perspectives, and
prospects. Science (New York, N.Y.), 349:255–60, 07 2015.

[4] Steven Aftergood. Cybersecurity: The cold war online. Nature, 547:30+, Jul 2017.
7661.

[5] Aleksandar Milenkoski, Marco Vieira, Samuel Kounev, Alberto Avritzer, and
Bryan Payne. Evaluating computer intrusion detection systems: A survey of common
practices. ACM Computing Surveys, 48:12:1–, 09 2015.

[6] Chirag N. Modi and Kamatchi Acha. Virtualization layer security challenges and
intrusion detection/prevention systems in cloud computing: a comprehensive review.
The Journal of Supercomputing, 73(3):1192–1234, Mar 2017.

[7] Eduardo Viegas, Altair Santin, Andre Fanca, Ricardo Jasinski, Volnei Pedroni,
and Luiz Soares de Oliveira. Towards an energy-efficient anomaly-based intrusion
detection engine for embedded systems. IEEE Transactions on Computers, 66:1–1,
Jan 2016. 53

[8] Y. Xin, L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C. Wang.
Machine learning and deep learning methods for cybersecurity. IEEE Access,
6:35365– 35381, 2018.

[9] Neha R. Israni and Anil N. Jaiswal. A survey on various phishing and anti-
phishing measures. International journal of engineering research and technology, 4,
2015.

[10] Pingchuan Liu and Teng-Sheng Moh. Content based spam e-mail filtering. pages
218–224, 10 2016.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM liv


[11] N. Agrawal and S. Singh. Origin (dynamic blacklisting) based spammer
detection and spam mail filtering approach. In 2016 Third International Conference
on Digital Information Processing, Data Mining, and Wireless Communications
(DIPDMWC), pages 99–104, 2016.

[12] Vikas Sahare, Sheetalkumar Jain, and Manish Giri. Survey:anti-phishing


framework using visual cryptography on cloud. JAFRC, 2, 01 2015.

[13] S. Patil and S. Dhage. A methodical overview on phishing detection along with
an organized way to construct an anti-phishing framework. In 2019 5th International
Conference on Advanced Computing Communication Systems (ICACCS), pages
588– 593, 2019.

[14] Dipesh Vaya, Sarika Khandelwal, and Teena Hadpawat. Visual cryptography: A
review. International Journal of Computer Applications, 174:40–43, 09 2017.

[15] Saurabh Saoji. Phishing detection system using visual cryptography, 03 2015.

[16] C. Pham, L. A. T. Nguyen, N. H. Tran, E. Huh, and C. S. Hong. Phishing-aware:


A neuro-fuzzy approach for anti-phishing on fog networks. IEEE Transactions on
Network and Service Management, 15(3):1076–1089, 2018.

[17] K. S. C. Yong, K. L. Chiew, and C. L. Tan. A survey of the qr code phishing: the
current attacks and countermeasures. In 2019 7th International Conference on Smart
Computing Communications (ICSCC), pages 1–5, 2019. 54

[18] G. Egozi and R. Verma. Phishing email detection using robust nlp techniques. In
2018 IEEE International Conference on Data Mining Workshops (ICDMW), pages 7–
12, 2018.

[19] J. Mao, W. Tian, P. Li, T. Wei, and Z. Liang. Phishing-alarm: Robust and
efficient phishing detection via page component similarity. IEEE Access, 5:17020–
17030, 2017.

[20] G. J. W. Kathrine, P. M. Praise, A. A. Rose, and E. C. Kalaivani. Variants of


phishing attacks and their detection techniques. In 2019 3rd International Conference
on Trends in Electronics and Informatics (ICOEI), pages 255–259, 2019.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM lv


[21] Muhammet Baykara and Zahit Gurel. Detection of phishing attacks. pages 1–5,
03 2018. [22] Prof. Gayathri Naidu . A survey on various phishing detection and
prevention techniques. International Journal of Engineering and Computer Science,
5(9), May 2016.

[23] E. Zhu, Y. Chen, C. Ye, X. Li, and F. Liu. Ofs-nn: An effective phishing
websites detection model based on optimal feature selection and neural network.
IEEE Access, 7:73271–73284, 2019.

[24] Mahdieh Zabihimayvan and Derek Doran. Fuzzy rough set feature selection to
enhance phishing attack detection, 03 2019.

[25] P. Yang, G. Zhao, and P. Zeng. Phishing website detection based on


multidimensional features driven by deep learning. IEEE Access, 7:15196–15209,
2019.

[26] T. Nathezhtha, D. Sangeetha, and V. Vaidehi. Wc-pad: Web crawling based


phishing attack detection. In 2019 International Carnahan Conference on Security
Technology (ICCST), pages 1–6, 2019.

MAJOR PROJECT REPORT (2019-2023 Batch) Dept. of CSE, MLRITM lvi

You might also like