Report
Report
With raising in-depth amalgamation of the Internet and social life, the Internet is looking
differently at how people are learning and working, meanwhile opening us to growing serious
security attacks. The ways to recognize various network threats, specifically attacks not seen
before, is a primary issue that needs to be looked into immediately. The aim of phishing site
URLs is to collect the private information like user’s identity, passwords and online money
related exchanges. Phishers use the sites which are visibly and semantically like those of
authentic websites. Since the majority of the clients go online to get to the administrations
given by the government and money related organizations, there has been a vital increment in
phishing threats and attacks since some years. As technology is growing, phishing methods
have started to progress briskly and this should be avoided by making use of anti-phishing
techniques to detect phishing. Machine learning is a authoritative tool that can be used to aim
against phishing assaults. This study develops and creates a model that can predict whether a
URL link is legitimate or phishing
Cyber security persons are now looking for trustworthy and steady detection techniques for
phishing websites detection. By extracting and evaluating numerous aspects of authentic and
phishing URLs, this project uses machine learning technology to detect phishing URLs.
In conclusion, the study provided a model for URL classification into phishing and legitimate
URLs. This would be very valuable in assisting individuals and companies in identifying
phishing attacks by authenticating any link supplied to them to prove its validity.
5.1 Implementation 36
AI : Artificial Intelligence
INTRODUCTION
1.1 Background
Artificial intelligence is a new innovative science that reviews and creates hypotheses,
strategies, procedures, and applications that recreate, grow and broaden human
knowledge. ML is an arm of artificial intelligence and it is analogous to
computational measurements, that also concentrates on making predictions with the
use of PCs. Machine leaning has solid relationship with scientific improvement,
which tells methods, hypothesis and utilization regions to the field. ML is sometimes,
in a while combined with data mining, but the data mining subfield focuses more on
preparatory information investigation and is called as unsupervised learning. ML can
likewise be unsupervised and be utilized to learn and set up pattern profiles for
various entities and then used to find important anomalies.
Because of increase in the phishing attacks, numerous results are proposed which
generates a solution to the issue. To build a framework which guarantees a solution
against the phishing attack, there are several ways. Various other methods for
detecting phishing attack are there like black list, Fuzzy rule-based, white list-based,
cantina-based, machine learning based, Heuristic and image-based approaches. There
are several other studies that talks about a variety of methods and techniques to detect
the different types of phishing attacks. Phishing sites looks to be like a genuine
website and several individuals have problem in recognizing such websites. Few anti
phishing techniques are in built in some of the browsers.
1.2 Motivation
There are many Anti phishing techniques that helps us protect from phishing sites.
Mozilla Firefox, Safari and Google chrome makes use of Google Safe Browsing
(GSB) service that will block the phishing websites. There are also many such tools
like McFee Site Advisor, Quick Heal, Avast and Netcraft which are widely used. GSB
analyzes a URL by making use of the blacklist approach. The main disadvantage of
GSB was that it was unable to detect the phishing website since updation of blacklist
was not done. In case of Netcraft, a website that phishing was recorded as phishing
although it wasn’t blocked. The blocking is done by Netcraft only when it is sure
100% that the website is phishing. The warning is given only when the user clicks the
right button on the icon to find the risk rating. The risk is when the individual doesn’t
check the rating or makes a decision to use it after checking the rating. Security
against security attacks online is provided by some soft wares like QuickHeal and
The problem is derived after making a thorough observation and study about the
method of classification of phishing websites that makes use of machine learning
techniques. We must design a system that should allow us to:
• Time consumed for detection should be less and should be cost effective.
• Accurately and efficiently classify the websites into legitimate or phishing.
1.4 Scope
This study explores data science and machine learning models that use datasets gotten
from open-source platforms to analyze website links and distinguish between phishing
and legitimate URL links. The model will be integrated into a web application,
allowing a user to predict if a URL link is legitimate or phishing. This online
application is compatible with a variety of browsers.
Phishing attacks have gotten increasingly complex, it is very difficult for an average
person to determine if an email message link or website is legitimate. Cyber-attacks
by criminals that employ phishing schemes are so prevalent and successful nowadays.
Hence, this project seeks to address fake URLs and domain names by identifying
phishing website links. Therefore, having a web application that provides the user an
1.7 Challenges
1.8 Methodology
The methodology used to achieve the earlier stated objectives is explained below. The
dataset collection consists of phishing and legitimate URLs which were obtained from
open-source platforms. The dataset was then pre-processed that is cleaned up from
any abnormality such as missing data to avoid data imbalance. Afterward, expository
data analysis was done on the dataset to explore and summarize the dataset. Once the
dataset was free from all anomalies, website content-based features were extracted
from the dataset to get accurate features to train and test the model. An extensive
Hence, Series of these machine learning classification models such as Decision Tree,
Support Vector Machine, XGBooster, Multilayer perceptions, Auto encoder Neural
Network and Random Forest was deployed on the dataset to distinguish between
phishing and legitimate URLs. The best model with high training accuracy out of all
the deployed models was selected then integrated into a developed web application.
Thus, a user can enter a URL link on the web application to predict if it is phishing or
legitimate.
Chapter 2 incorporates the study and research about the phishing attacks and its
detection using Machine learning techniques. It gives a detailed description of the
earlier works done in this front and the limitations of those related works.
Chapter 4 tells about the system design and its representation using architecture, data
flow diagrams and activity diagram. It gives a graphical and diagrammatic
representation of the system for better understanding and the system’s, user’s and run
time perspective of the project.
In chapter 7, the outcome obtained and the environmental setup up of the project is
being discussed.
I conclude the project in chapter 8 and also discuss about the future enhancements to
the project.
LITERATURE SURVEY
This chapter offers an insight into various important studies conducted by excellent
scholars from articles, books, and other sources relevant to the detection of phishing
websites. It also provides the project with a theoretical review, conceptual review, and
empirical review to demonstrate understanding of the project
5 ”Phishing URL Detection via NN module is used 2019 false positive rate is
CNN and Attention-Based to derive high
Hierarchical RNN” representation of
spatial feature that
is character level of
the URLs. Then the
representational
features are
combined by using
a CNN of 3 layers
to create precise
feature
representations of
URLs. That is then
used for training the
classifier of
phishing URLs.
7 ”A new method for Detection The three major 2020 Does not give full
of Phishing Websites: URL phases in this work information about the
Detection are Parsing, techniques use
Heuristic
Classification of
data, Performance
Analysis in this
model. All of these
phases use various
and distinctive
methods for data
processing to get
results that are
better.
From the above, ML methods plays a vital role in many applications of cybersecurity
and shall remain an encouraging path that captivates more such investigations. When
coming to the reality, there are several barriers that are limitations during
The existing system of phishing detection techniques suffers low detection accuracy
and high false alarm especially when different phishing approaches are introduced.
Above and beyond, the most common technique used is the blacklist-based method
which is inefficient in responding to emanating phishing attacks since registering a
new domain has become easier, no comprehensive blacklist can ensure a perfect up-
to-date database for phishing detection.
The proposed phishing detection system utilizes machine learning models. The
system comprises two major parts, which are the machine learning models and a web
application. These models consist of Decision Tree, Support Vector Machine,
XGBooster, logistic regression, and Random Forest. These models are selected after
different comparison-based performances of multiple machine learning algorithms.
Each of these models is trained and tested on a website content-based feature,
extracted from both phishing and legitimate dataset. Hence, the model with the
highest accuracy is selected and integrated into a web application that will enable a
user to predict if a URL link is phishing or legitimate.
i. Will be able to differentiate between phishing (0) and legitimate (1) URLs
In this chapter we mainly focused on existing system through literature survey and
various research paper analyzed and we specified some important points of each paper
and related diagrams or graphs are included. In comparison section we have mainly
highlighted few important advantages and disadvantages in each paper and
comparison between those papers. This chapter also introduces drawbacks of existing
system and functionality of proposed system and their advantages.
ANALYSIS
This chapter describes the various process, methods, and procedures adopted by the
researcher to achieve the set aim and objectives and the conceptual structure within
which the research was conducted.
The methodology of any research work refers to the research approach adopted by the
researcher to tackle the stated problem. Since the efficiency and maintainability of
any application are solely dependent on how designs are prepared. This chapter
provides detailed descriptions of methods employed to proffer solutions to the stated
objectives of the research work.
• Reusability: the same code with limited changes can be used for detecting
phishing attacks variants like smishing, vishing, etc.
• Usability: The software used is very user friendly and open source. It also
runs on any operating system.
The architecture of the system is as shown in fig 4.1; the URLs to be classified as
legitimate or phishing is fed as input to the appropriate classifier. Then classifier that
is being trained to classify URLs as phishing or legitimate from the training dataset
uses the pattern it recognized to classify the newly fed input. The features such as IP
address, URL length, domain, having favicon, etc. are extracted from the URL and a
list of its values is generated. The list is fed to the classifiers such as KNN, kernel
SVM, Decision tree and Random Forest classifier. These models’ performance is then
evaluated and an accuracy score is generated. The trained classifier using the
generated list predicts if the URL is legitimate or phishing. The list contains values 1,
0 and -1 if the features exist, not applicable and if the features doesn’t exist
respectively. There are 30 features being considered in this project.
Python has an approach to place definitions in a document and use them in a content
or in an intuitive case of the interpreter. Such a file is known as a module; definitions
a module can be brought into different modules or into the fundamental module.
Some of the modules used in the project are as shown in Table 3.1
The machine learning model is nothing but a piece of code; an engineer or data
scientist makes it smart through training with data. So, if you give garbage to the
model, you will get garbage in return, i.e. the trained model will provide false or
wrong predictions.
The logistic function, also called as sigmoid function was initially used by
statisticians to describe properties of population growth in ecology. The sigmoid
function is a mathematical function used to map the predicted values to probabilities.
Logistic Regression has an S- shaped curve and can take values between 0 and 1 but
never exactly at those limits.
DESIGN
Activity diagram is a behavioral diagram. The fig 4.5 shows the activity diagram of
the system. It depicts the control flow from a start point to an end point showing
various paths which exists during the execution of the activity.
DFDs are used to depict graphically the data flow in a system. It explains the
processes involved in a system from the input to the report generation. It shows all
possible paths from one entity to another of aa system. The detail of a data flow
diagram can be represented in three different levels that are numbered 0, 1 and 2.
There are many types of notations to draw a data flow diagram among which
YourdonCoad and Gane-Sarson method are popular. The DFDs depicted in this
chapter uses the Gane-Sarson DFD notations.
It shows the system as a high-level process with its relationship to the external
entities. It should be easily acknowledged by a wide range of audience from
stakeholders to developers to data analysts.
DFD level 1 gives a more detailed explanation of the Context diagram. The high-level
process of the Context diagram is broken down into its subprocesses. The DFD level
1 of the system is depicted in fig 4.3
The Level 1 DFD takes a step deep by including the processes involved in the system
such as feature extraction, splitting of dataset, building the classifier, etc. and hence
gives a more detailed vision of the system.
DFD level 2 goes one more step deeper into the subprocesses of Level 1. Fig 4.4
shows the DFD level 2 of the system. It might require more text to get into the
necessary level of detail about the functioning of the system. The Level 2 gives a
more detailed sight of the system by categorizing the processes involved in the system
to three categories namely preprocessing, feature scaling and classification. It also
graphically depicts each of these categories in detail and gives a complete idea of how
the system works.
4.4 Summary
The system’s architecture, the processes involved from input to output with varying
levels of complexity and the system’s behaviour is graphically represented for better
understanding of the system in the above chapter.
IMPLEMENTATION
5.1 Introduction
This chapter of the report illustrates the approach employed to classify the URLs as
either phishing or legitimate. The methodology involves building a training set. The
training set is used for training a machine learning model, i.e., the classifier. Fig 5.1
shows the diagrammatic representation of the implementation.
Additionally, Python supports the use of modules and packages, which means that
programs can be designed in a modular style and code can be reused across a variety
of projects. Once you have developed a module or package you need, it can be scaled
for use in other projects, and it’s easy to import these modules.
One of the most promising benefits of Python is that both the standard library and the
interpreter are available free of charge, in both binary and source form. There is no
exclusivity either, as Python and all the necessary tools are available on all major
platforms. Therefore, it is an enticing option for developers who don't want to worry
about paying high development costs.
That makes Python accessible to almost anyone. If you have the time to learn, you can
create some amazing things with the language.
MACHINE LEARNING
Machine learning provides simplified and efficient methods for data analysis. It has
indicated promising outcomes in real time classification problems recently. The key
advantage of machine learning is the ability to create flexible models for specific
tasks like phishing detection. Since phishing is a classification problem, Machine
learning models can be used as a powerful tool. Machine learning models could adapt
to changes quickly to identify patterns of fraudulent transactions that help to develop
a learning-based identification system. Most of the machine learning models
discussed here are classified as supervised machine learning, this is where an
algorithm tries to learn a function that maps an input to an output based on example
input-output pairs. It infers a function from labeled training data consisting of a set of
training examples.
PANDAS
NUMPY
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Num array was also developed, having some additional functionalities. In 2005,
Travis Oliphant created NumPy package by incorporating the features of Num array
into Numeric package. There are many contributors to this open source project.
Operations using NumPy Using NumPy, a developer can perform the following
operations –
• Operations related to linear algebra. NumPy has in-built functions for linear algebra
and random number generation.
Figure 5.3 shows the phishing detection web interface system. The user inputs a URL
link and the website validates the format of the URL and then predicts if the link is
phishing or legitimate.
5.4 Dataset
5.5.1 Splitting:
The dataset into training part of dataset and testing part of dataset. The dataset was
split into training and testing dataset with 75% for training and 25% for testing using
the “train test split” method. The splitting was done after assigning the dependent
variables and independent variables.
5.5.2 Preprocessing:
Preprocessing involves filling the missing data or removing the missing data and
getting a clean dataset. But the dataset chosen was already preprocessed and did not
require any further preprocessing from my end. The only step to be performed in
preprocessing was feature scaling.
Feature values are extracted using python modules like whois, requests, socket, re,
ipaddress, BeautifulSoup, etc. to get information regarding ip address, length of url,
domain name, subdomains, presence of favicon, etc. The value obtained is stored in a
list. This is being done because the dataset is in this format and hence the classifier
will be trained with input of this format. Therefore, when a URL is passed as input to
the system, it converts it into a python list of 30 elements each representing its
respective feature and there after that list is fed to the trained classifier. The classifier
that is being used includes KNN, kernel SVM, Decision Tree and random forest
classifier.
A one-page phishing detection web application has been developed to run on any
browser. The application was developed using programming languages such as
HTML, CSS, PHP, and JavaScript. The phishing detection web application has the
following pages:
The home page contains a session for a user to enter a URL and predict if it is
phishing or legitimate. It predicts the state of the URL base on the feature selection.
The purpose of this page is to help its users validate a URL link
The FAQs Page contains a series of questions and answers about the phishing attacks
and how the users can get prevented from getting attacked by the phishing sites.
This chapter discusses the working of the system through proposed system
architecture. The flow diagram shows the working of Proposed system and the
software requirement specification. The Project is also explained through the
architecture of the proposed system.
7.1 Introduction
System testing is actually a series of different tests whose primary purpose is to fully
exercise the computer-based system. Although each test has a different purpose, all
work to verify that all the system elements have been properly integrated and perform
allocated functions. The testing process is actually carried out to make sure that the
web application exactly does the same thing what is supposed to do. In the testing
stage following goals are tried to achieve: -
In this chapter, we check for the working of the proposed system by testing and
comparing the result of the algorithm and the actual result. It is basically validating
the system. The testing is done for each algorithm with a legitimate and phishing URL
and the results are as follows.
Unit testing, also known as component testing refers to tests that verify the
functionality of a specific section of code, usually at the function level. In an object-
oriented environment, this is usually at the class level, and the minimal unit tests
include the constructors and destructors. Unit testing is a software development
process that involves synchronized application of a broad spectrum of defect
prevention and detection strategies in order to reduce software development risks,
time and costs. The following Unit testing table shows the functions that were tested
at the time of programming. The first column gives all the modules which were tested,
Functional testing is a type of testing that seeks to establish whether each application
feature works as per the software requirements. Each function is compared to the
corresponding requirement to ascertain whether its output is consistent with the end
user’s expectations. The testing is done by providing sample inputs, capturing
resulting outputs, and verifying that actual outputs are the same as expected outputs.
Integration testing is any type of software testing that seeks to verify the interfaces
between components against a software design. Software components may be
integrated in an iterative way or all together. Normally the former is considered a
better practice since it allows interface issues to be located more quickly and fixed.
Integration testing works to expose defects in the interfaces and interaction between
integrated components (modules). Progressively larger groups of tested software
components corresponding to elements of the architectural design are integrated and
tested until the software works as a system.
7.4 Summary
This chapter discuss about the importance of testing and varies methods that are used
to test the model built. This helps us to understand the performance of the system and
make the necessary changes accordingly.
8.1 Conclusion
Further work can be done to enhance the model by using assembling models to get
greater accuracy score. Ensemble methods is a ML technique that combines many
base models to generate an optimal predictive model. Further reaching future work
would be combining multiple classifiers, trained on different aspects of the same
training set, into a single classifier that may provide a more robust prediction than any
of the single classifiers on their own.
8.3 Recommendation
Through this project, one can know a lot about phishing attacks and how to prevent
them. This project can be taken further by creating a browser extension that can be
installed on any web browser to detect phishing URL Links.
[1] Reid G. Smith and Joshua Eckroth. Building ai applications: Yesterday, today, and
tomorrow. AI Magazine, 38(1):6–22, Mar. 2017.
[2] Panos Louridas and Christof Ebert. Machine learning. IEEE Software, 33:110–
115, 09 2016.
[3] Michael Jordan and T.M. Mitchell. Machine learning: Trends, perspectives, and
prospects. Science (New York, N.Y.), 349:255–60, 07 2015.
[4] Steven Aftergood. Cybersecurity: The cold war online. Nature, 547:30+, Jul 2017.
7661.
[5] Aleksandar Milenkoski, Marco Vieira, Samuel Kounev, Alberto Avritzer, and
Bryan Payne. Evaluating computer intrusion detection systems: A survey of common
practices. ACM Computing Surveys, 48:12:1–, 09 2015.
[6] Chirag N. Modi and Kamatchi Acha. Virtualization layer security challenges and
intrusion detection/prevention systems in cloud computing: a comprehensive review.
The Journal of Supercomputing, 73(3):1192–1234, Mar 2017.
[7] Eduardo Viegas, Altair Santin, Andre Fanca, Ricardo Jasinski, Volnei Pedroni,
and Luiz Soares de Oliveira. Towards an energy-efficient anomaly-based intrusion
detection engine for embedded systems. IEEE Transactions on Computers, 66:1–1,
Jan 2016. 53
[8] Y. Xin, L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C. Wang.
Machine learning and deep learning methods for cybersecurity. IEEE Access,
6:35365– 35381, 2018.
[9] Neha R. Israni and Anil N. Jaiswal. A survey on various phishing and anti-
phishing measures. International journal of engineering research and technology, 4,
2015.
[10] Pingchuan Liu and Teng-Sheng Moh. Content based spam e-mail filtering. pages
218–224, 10 2016.
[13] S. Patil and S. Dhage. A methodical overview on phishing detection along with
an organized way to construct an anti-phishing framework. In 2019 5th International
Conference on Advanced Computing Communication Systems (ICACCS), pages
588– 593, 2019.
[14] Dipesh Vaya, Sarika Khandelwal, and Teena Hadpawat. Visual cryptography: A
review. International Journal of Computer Applications, 174:40–43, 09 2017.
[15] Saurabh Saoji. Phishing detection system using visual cryptography, 03 2015.
[17] K. S. C. Yong, K. L. Chiew, and C. L. Tan. A survey of the qr code phishing: the
current attacks and countermeasures. In 2019 7th International Conference on Smart
Computing Communications (ICSCC), pages 1–5, 2019. 54
[18] G. Egozi and R. Verma. Phishing email detection using robust nlp techniques. In
2018 IEEE International Conference on Data Mining Workshops (ICDMW), pages 7–
12, 2018.
[19] J. Mao, W. Tian, P. Li, T. Wei, and Z. Liang. Phishing-alarm: Robust and
efficient phishing detection via page component similarity. IEEE Access, 5:17020–
17030, 2017.
[23] E. Zhu, Y. Chen, C. Ye, X. Li, and F. Liu. Ofs-nn: An effective phishing
websites detection model based on optimal feature selection and neural network.
IEEE Access, 7:73271–73284, 2019.
[24] Mahdieh Zabihimayvan and Derek Doran. Fuzzy rough set feature selection to
enhance phishing attack detection, 03 2019.