Detecting Phishing Website With Code Implementation
Detecting Phishing Website With Code Implementation
Souhardya Gayen
15/05/2023
Abstract
The predefined anti-phishing approaches based on machine learning techniques extract few
features from other sources which detects the fraudulent websites comparatively slower and
unfit for real-time execution. This paper presents a solution to this problem by deeply
examining various characteristics of phishing as well as legitimate websites and analysing
thirty outstanding features to distinguish phishing websites from legitimate URLs. These thirty
features scan the URLs thoroughly and extract their values from the URLs of the websites and
are thus independent of any third-party presence which gives an accuracy of 93%.
This results in a more realistic, reliable, resourceful, well-informed computational approach.
1.0 Introduction
Phishing is a social engineering attack that aims at exploiting the weakness found in the system
at the user’s end. For example, a system may be technically secure enough for password theft
but the unaware user may leak his/her password when the attacker sends a false update
password request through forged (phished) website. For addressing this issue, a layer of
protection must be added on the user side to address this problem. A phishing attack is when a
criminal sends an email or the url pretending to be someone or something he’s not, in order to
get sensitive information out of the victim. The victim in regard to his/her curiosity or a sense
of urgency, they enter the details, like a username, password, or credit card number, they are
likely to acquiesce. The recent example of a Gmail phishing scam that targeted around1 billion
Gmail users worldwide.
Phishing attacks have become anxiety for the cyber world. It causes enormous
problems for privacy and financial issues of internet users. Scammers, namely fishers, create
false websites to feel and look like a genuine to deceive the people. They spoof emails to steal
the identity of legitimate users. They gather personal covert information, password, account
information, and credit card details for the transaction. Fishers always change their strategy to
attack the system. Social engineering is one of the essential techniques the fishers use. Using
this technique, they gather personal credentials from a trustworthy person. Phishers create false
websites and spoof email in such a way that they are very similar and sometimes look like a
real company website that comes from a source. Sometimes the attackers act like a real source
and force the users to update the system. Moreover, they threaten the customer to suspend the
account and demand ransom. Email spoofing is another technique used for phishing fraud.
Customers are usually misled to disclose private information like passwords and credit card
number. Thus phishing is mainly used to steal valuable information such as bank account,
password, and credit card details. This type of scam is increasing rapidly, and individuals,
business-people are losing their trust in online business. Thus, a negative impression of clients
on online business was swarmed as they lost faith in online transactions. Even though
encryption software is used to protect the information in the computers' storage, they are also
vulnerable to attacks. In this paper, the detection of fishing was performed through ML.
As phishing attack allows attackers a foothold in corporate networks causing access
to vital information, it is important to safeguard users from becoming victims of fraud. Thus
phishing detection tools play a vital role in ensuring security. So we planned to work on this
topic.
There are several methods used to assess the detection of phishing websites, including
vulnerability assessments, penetration testing, and user awareness training.
Penetration testing, on the other hand, is a more targeted approach that involves
simulating an attack on an organization's systems to identify weaknesses that could be
exploited. This testing can help identify specific weaknesses in an organization's
security system that could be used to launch a phishing attack.
I use the Online shopper dataset for this project Dataset can be found here:
https://www.kaggle.com/datasets/akashkr/phishing-website-dataset
Source:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0258361
https://www.researchgate.net/publication/328541785_Phishing_Website_Detection_using_Machin
e_Learning_Algorithms
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8504731/
https://www.sciencedirect.com/topics/computer-science/website-phishing-detection
https://towardsdatascience.com/phishing-domain-detection-with-ml-
5be9c99293e5#:~:text=Phishing%20is%20a%20form%20of%20fraud%20in%20which%20the%20atta
cker,email%20or%20other%20communication%20channels.
6.0 Benchmarking
Histogram Plot For Phishing Data
Correlation Matrix For Phishing Data
Phishing is a type of online fraud that targets individuals and organizations by tricking
them into giving away sensitive information, such as passwords or credit card numbers.
To help prevent phishing attacks, there are several regulations in place that focus on the
detection of phishing websites.
One of the most notable regulations is the Anti-Phishing Working Group (APWG),
which is an international coalition dedicated to eliminating phishing and other online
fraud. The APWG works to bring together industry, government, and law enforcement
organizations to share information and best practices for detecting and preventing
phishing attacks.
In addition to the APWG, another important regulation is the use of URL blacklists.
URL blacklists are lists of known phishing websites that are automatically blocked by
web browsers and other software. These lists are updated regularly to ensure that users
are protected from the latest phishing threats.
Apart from these regulations, there are also technical measures that can be used to detect
phishing websites. Machine learning algorithms can be used to analyze website content
and identify suspicious patterns. Certificate validation can also be used to verify the
authenticity of a website's SSL certificate.
This section reflects on how I framed my concept with EDA technology, data cleaning, feature
selection algorithms, Machine Learning algorithms and evaluated the performance. The
following figure (Fig 1) gives a brief of the steps I have followed in our proposed methodology.
It aims at building a website to detect whether the websites are fraud or trustworthy.
STEP 1: DATA COLLECTION
STEP 2: FEATURES EXTRACTION
STEP 3: EXPLORATORY DATA ANALYSIS
STEP 4: DATA CLEANING
STEP 5: FEATURE SELECTION TECHNIQUE
STEP 6: MACHINE LEARNING ALGORITHM
DATA
COLLECTION
FEATURES
EXTRACTION
EDA
DATA
CLEANING
FEATURE
SELECTION
TECHNIQUE
MACHINE
LEARNING
ALGORITHM
After applying the Machine Learning Algorithms, I have trained and tested the model. I have
performed model evaluation to determine which model gives the better performance. My
source code has produced the following result.
The Accuracy of the initial model is given below:
I have taken data from the data set and checked manually whether it is giving correct and
appropriate result.
The product takes the following functions to perfect and provide a good result.
Back-end
Model Development: This must be done before releasing the service. A lot of manual
supervised machine learning must be performed to optimize the automated tasks.
1. Performing EDA to realize the dependent and independent features.
2. Algorithm training and optimization must be done to minimize overfitting of the
model and hyperparameter tuning.
Front End
1. Different user interface: The user must be given many options to choose form in
terms of parameters. This can only be optimized after a lot of testing and analysis all
the edge cases.
2. Interactive visualization the data extracted from the trained models will return raw
and inscrutable data. This must be present in an aesthetic and an “easy to read” style.
3. Feedback system: A valuable feedback system must be developed to understand the
customer’s needs that have not been met. This will help us train the models
constantly.
I created a web app using the python, html and use SVM model for
training data and deployed this web app on the render cloud platform.
15.0 Conclusion
In conclusion, detecting and preventing phishing attacks is crucial for maintaining online
security. There are various strategies and tools available to detect phishing websites,
including technical solutions, user education, and targeted detection measures. While
there are challenges and constraints to effective detection, investing in anti-phishing
solutions is a wise decision for organizations and individuals looking to protect their
sensitive information and prevent financial loss. By staying vigilant and taking proactive
measures, we can work to stay safe and secure in an increasingly digital world.
16.0 References
• Published by Vaibhav Patil, Pritesh Thakkar Prof. S. P. Godse, Tushar Bhat and Chirag
Shah of “Detection and Prevention of Phishing Websites using Machine Learning
Approach”, Dept. of Computer Engineering in Sinhgad Academy of Engineering Pune,
India 2018.
• Ankit Kumar Jain and B. B. Gupta, “Phishing Detection Analysis of Visual Similarity
Based Approaches”, Hindawi 2017.
• Jian Mao, Pei Li, Kun Li, Tao Wei, and Zhenkai Liang, “Bait Alarm Detecting Phishing
Sites Using Similarity in Fundamental Visual Features”, INCS 2013.
• Y. Zhang, J. I. Hong, and L. F. Cranor. Cantina: a content-based approach to detecting
phishing web sites. In WWW ’07: Proceedings of the 16th international conference on
World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM.
• Haijun Zhang, Gang Liu, Tommy W. S. Chow, and Wenyin Liu, “Textual and Visual
Content-Based Anti-Phishing A Bayesian Approach”, IEEE 2011.