0% found this document useful (0 votes)
9 views

final_thesis_report_merged

The project report presents a deep learning-based phishing detection system that utilizes URLs and website content to identify malicious sites. It employs various deep learning architectures, including CNNs and RNNs, to analyze both the syntactical structure of domain names and the semantic content of web pages. The study highlights the effectiveness of this approach in detecting phishing attacks, demonstrating improved performance metrics such as accuracy and precision compared to traditional methods.

Uploaded by

Sathvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

final_thesis_report_merged

The project report presents a deep learning-based phishing detection system that utilizes URLs and website content to identify malicious sites. It employs various deep learning architectures, including CNNs and RNNs, to analyze both the syntactical structure of domain names and the semantic content of web pages. The study highlights the effectiveness of this approach in detecting phishing attacks, demonstrating improved performance metrics such as accuracy and precision compared to traditional methods.

Uploaded by

Sathvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

A

Project Report
on

DEEP LEARNING BASED PHISHING


DETECTION SYSTEM USING URLS
AND WEBSITE CONENT

Is submitted in partial fulfillment of the Requirements


for the Award of the Degree of

BACHELOR OF ENGINEERING
in
INFORMATION TECHNOLOGY
Submitted by

Sathvik Kadali 160121737034


Abhitej Reddy 160121737054
N.Abhishek 160121737305

SUPERVISOR
G. Srikanth
Assistant Professor,IT

DEPARTMENT OF INFORMATION TECHNOLOGY

CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY(A)

(Affiliated to Osmania University;Accredited by NBA,NAAC,ISO)

Kokapet(V),Gandipet(M),Hyderabad-500075
Website:www.cbit.ac.in

2024-2025
DECLARATION

We, hereby declare that the thesis Deep Learning Based Phishing De-
tection System using URLs and Website Content is original and has
been carried out by us under the supervision of Mr G. Srikanth, CBIT,
Hyderabad for the Degree of B.E in INFORMATION TECHNOL-
OGY and the Project/Dissertation/Thesis checked in Anti-plagiarism Software
(Turnitin) which is having 24% similarity. If anything found guilty/copied from
other sources we are the sole responsible for the same and we abide by any
action taken by the Institute authorities. (As per the Institute guidelines the
Supervisor also held responsible for any manipulation by the Student).

Place: Hyderabad
Date: 19/04/2025

Student Signature & Name(s):

160121737034 - Sathvik Kadali


160121737054 - Abhitej Reddy
160121737305 - N. Abhishek

Supervisor Signature:
Name: Mr. G. Srikanth, Assistant Professor, Dept of IT, CBIT
CERTIFICATE

This is to certify that the project work (Part-II) entitled DEEP LEARNING
BASED PHISHING DETECTION SYSTEM USING URLS AND

WEBSITE CONENT is submitted by Sathvik Kadali & 160121737034,


Abhitej Reddy & 160121737054, N. Abhishek & 160121737305 in par-
tial fulfillment of the requirements for the award of the degree of Bachelor

of Engineering in INFORMATION TECHNOLOGY to CHAITANYA


BHARATHI INSTITUTE OF TECHNOLOGY(A) affiliated to OSMA-
NIA UNIVERSITY,Hyderabad is a record of bonafide work carried out by

them under my supervision and guidance.The results embodied in this report


have not been submitted to any other University or Institute for the award
of any other Degree or Diploma.

Project Guide: Head of the Department:


Mr. G. Srikanth Dr. M. Venu Gopalachari

Assistant Professor, Professor and I/C Head,


Department of IT Department of IT

Kokapet(V),Gandipet(M),Ranga Reddy (Dist.)–500075, Hyderabad, T.S.


www.cbit.ac.in
Acknowledgement
The satisfaction that accompanies the successful completion of the task
would be incomplete without the mention of the people who made it possible,
whose constant guidance and encouragement crown all the efforts with success.

We wish to express our deep sense of gratitude to Mr. G. Srikanth,


Assistant Professor and Project Supervisor, Department of Information Tech-
nology, Chaitanya Bharathi Institute of Technology, for his able guidance and
useful suggestions, which helped us in the project.

We are particularly thankful to Prof. M. Venugopala Chari, the I/c


Head of the Department, Department of Information Technology, for his
guidance, intense support, and encouragement, which helped us to mould our
project into a successful one.

We show gratitude to our honorable Principal Prof. C. V. Narasimhulu


for providing all facilities and support.

We avail this opportunity to express our deep sense of gratitude and


heartfelt thanks to Mr. N. Subash Garu, President, CBIT, for providing a
congenial atmosphere to complete this project successfully.

We also thank all the staff members of the Information Technology depart-
ment for their valuable support and generous advice. Finally, thanks to all
our friends and family members for their continuous support and enthusiastic
help.

Sathvik Kadali 160121737034


Abhitej Reddy 160121737054
N Abhishek 160121737305
Abstract
Fake websites, often deployed in phishing attacks, are a growing cybersecu-
rity threat designed to deceive users and steal sensitive information like login
credentials and financial data. Traditional detection methods are frequently
outpaced by the sophistication of new phishing techniques. This study in-
troduces a deep learning-based detection system that leverages both domain
names and web page content to accurately identify malicious sites. The
system integrates deep learning architectures, such as Artificial Neural Net-
works (ANN), Convolutional Neural Networks (CNN), and Recurrent Neural
Networks (RNN), each contributing unique capabilities for feature extraction:
CNNs excel at identifying spatial patterns, while RNNs effectively analyse se-
quential data. Our approach includes tokenization, character embedding, and
web content analysis to provide adequate feature coverage for all the various
types of phishing attacks. By tokenizing and embedding domain names and
web content, the system attains a more enhanced feature representation of
URLs and web pages. CNNs are used to identify the spatial characteristics
of the domain and the structure of web content, and RNNs to identify the
sequential characteristics that may indicate suspicious activity. This combined
approach enables the model to evaluate not only the syntactical structure of
the domain names but also the semantic content of the web page. For the
purpose of measure the performance of the proposed system, basic parameters
including accuracy, precision, recall and F1-score are applied. These results
show that this combination approach, where CNNs deal with URL structure
and position of the content, is effective, fast and scalable method for fake
websites detection.
Keywords: Phishing attacks, Deep learning-based detection, Convolutional
Neural Networks (CNN), Web content analysis, Fake website detection

ii
Table of Contents

Title Page No.


Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Organization of the Project . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2 Literature Survey . . . . . . . . . . . . . . . . . . . . . 8
2.1 Paper1 : DEPHIDES: Deep Learning Based Phishing Detection
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Paper 2: Phishing Detection System Through Hybrid Machine
Learning Based on URL . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Paper 3: An Enhanced Deep Learning-based Phishing Detection
Mechanism Using Variational Autoencoders . . . . . . . . . . . . . 11
2.4 Paper 4: A Deep Learning-Based Phishing Detection System
Using CNN, LSTM, and LSTM-CNN . . . . . . . . . . . . . . . . 12
2.5 Paper 5: An Intelligent Cyber Security Phishing Detection Sys-
tem Using Deep Learning Techniques . . . . . . . . . . . . . . . . 13
2.6 Paper 6 Phishing URL Detection: A Real-Case Scenario through
Login URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Paper 7: A Lightweight and Proactive Rule-Based Incremental
Construction Approach to Detect Phishing Scam . . . . . . . . . 16
2.8 Paper 8: Detection of Phishing URLs by Using Deep Learning
Approach and Multiple Features Combinations . . . . . . . . . . . 17
2.9 Paper 9: Adopting Automated Whitelist Approach for Detecting
Phishing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iii
CHAPTER 3 SYSTEM REQUIREMENT SPECIFICATION . 20
3.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Non-Functional Requirements . . . . . . . . . . . . . . . . . . . . . 22
3.3 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . 25
CHAPTER 4 PROPOSED METHODOLOGY . . . . . . . . . . . . 27
CHAPTER 5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Model Evaluation and Performance Analysis . . . . . . . . . . . . 33
5.1.1 Key Observations . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.2 Execution Results of Phishing Website Detection System . 40
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 43
CHAPTER 7 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . 46
List of Tables

5.1 Models Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . 33


5.2 Confusion Matrix for Gradient Boosting Classifier . . . . . . . . . 33

v
List of Figures

4.1 Heatmap for correlation between different features . . . . . . . . . 29


4.2 Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 The Gradient Boosting Classifier process showing sequential de-


cision trees correcting errors from the previous model. . . . . . . 35
5.2 CatBoost Classifier showing feature handling and boosting process. 36
5.3 Random Forest process with aggregation of multiple decision trees. 37
5.4 Multi-layer Perceptron process showing how layers of neurons
learn complex patterns. . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 Phishing Detection Website . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Checking Whether the URL is Legitimate or Not . . . . . . . . . 41

vi
Abbreviations

Abbreviation Description

CBIT Chaitanya Bharathi Institute of Technology

IT Information Technology

ML Machine Learning

AI Artificial Intelligence

CNN Convolutional Neural Network

SVM Support Vector Machine

KNN K-Nearest Neighbors

PCA Principal Component Analysis

RFE Recursive Feature Elimination

LSTM Long Short-Term Memory

GUI Graphical User Interface

RNN Recurrent Neural Network

ReLU Rectified Linear Unit

JSON JavaScript Object Notation

PHP Hypertext Preprocessor

DNS Domain Name System

IP Internet Protocol

TLD Top-Level Domain


CHAPTER 1
Introduction
Phishing websites have emerged as one of the most pervasive threats in the
field of cybersecurity. These deceptive platforms are specifically designed to
mimic legitimate websites, tricking users into revealing confidential information
such as usernames, passwords, credit card numbers, and banking credentials.
The sophistication of phishing techniques continues to increase, making tra-
ditional detection mechanisms, such as rule-based systems and blacklists, less
effective. These methods often rely on prior knowledge of known threats and
struggle to detect novel or slightly altered phishing websites. To address these
limitations, recent research has shifted focus toward machine learning (ML)
approaches, which have demonstrated significant potential in identifying phish-
ing attempts by analyzing underlying patterns in website data. ML models
can detect previously unseen phishing websites by learning from historical data
and generalizing to new, unknown cases. Their adaptability and ability to
process vast amounts of data make them suitable for the dynamic nature of
phishing attacks. This study investigates the application of machine learning
techniques to the problem of phishing website detection using a comprehen-
sive dataset composed of 11,000+ records and 30 features. The features are
grouped into three broad categories: URL-based attributes, security-related
indicators, and behavioral or performance metrics. URL-based features include
characteristics such as the use of LongURL, presence of Prefix-Suffix, use of
“@” symbols, or abnormal URL formats that are often associated with phish-
ing attempts. Security features include the presence of HTTPS, age of domain
registration, and the validity of the SSL certificate. Behavioral attributes
involve parameters like WebsiteTraffic, GoogleIndex status, and PageRank—all
of which provide insights into the legitimacy and popularity of the site. An
initial Exploratory Data Analysis (EDA) was conducted to better understand
the relationships among features and their importance in identifying phishing

1
websites. Features like HTTPS, AnchorURL, Domain Registration Length,
and WebsiteTraffic emerged as key indicators. Phishing websites typically lack
secure connections, contain suspicious anchor links, and have recently regis-
tered domains or receive minimal web traffic, all of which can be quantified
and utilized by machine learning algorithms. To evaluate the effectiveness of
ML algorithms, a comparative study was conducted using multiple classifiers.
These include Gradient Boosting, CatBoost, Random Forest, Support Vec-
tor Machine (SVM), Multi-layer Perceptron (MLP), Decision Tree, K-Nearest
Neighbors (KNN), Logistic Regression, and Naı̈ve Bayes. Each of these clas-
sifiers has its own strengths: Random Forest and CatBoost excel in handling
large feature sets and dealing with non-linear data; SVM is effective in high-
dimensional spaces; MLP represents a deep learning-based neural network
approach; and Logistic Regression offers interpretability in linear classification
tasks. Naı̈ve Bayes, while simple and computationally efficient, often struggles
with datasets where features are not truly independent, which is the case
here. Performance metrics such as accuracy, precision, recall, and F1-score
were used to assess model performance. Results indicated that ensemble-based
methods like CatBoost and Random Forest achieved the highest accuracy,
with strong generalization across both training and test datasets. Models were
tuned using techniques like hyperparameter optimization and cross-validation
to improve robustness and prevent overfitting. This study highlights the critical
role of feature selection and model tuning in phishing detection. By identify-
ing which features contribute most significantly to classification accuracy and
tailoring models to the problem domain, detection rates can be significantly
improved. The insights gained from this research can inform the development
of real-time phishing detection tools integrated into browsers, email filters, and
cybersecurity infrastructure.

Department of Information Technology 2


1.1 Overview
Phishing website detection has been an active area of research, with various
techniques developed to improve accuracy and reliability. Early approaches
primarily relied on rule-based methods and blacklists, which identified phishing
websites based on predefined patterns or previously reported malicious domains.
However, these methods were ineffective against newly generated phishing
sites, as blacklists require continuous updates and cannot detect zero-day
attacks. Phishing websites are a critical cybersecurity concern, designed to
deceive users into revealing sensitive information such as passwords, credit card
numbers, and banking details. These malicious sites often imitate legitimate
websites with high visual accuracy, making them difficult for users—and even
security systems—to detect. Traditional detection methods, such as blacklists
and rule-based systems, have limited effectiveness as they rely on known
phishing signatures and fail to identify new or slightly altered threats. To
overcome these limitations, machine learning (ML) has emerged as a powerful
solution. By analyzing patterns in various features like URL structure, domain
registration, HTTPS presence, and website behavior, ML models can detect
phishing sites with high accuracy—even those that haven’t been previously
reported. Feature-based classification allows models to learn from past data
and generalize to new phishing attempts. This study explores a range of ML
classifiers including Random Forest, CatBoost, Gradient Boosting, SVM, and
MLP, using a comprehensive dataset with over 30 features. The aim is to
identify which features are most predictive and which models offer the best
performance. The findings contribute to developing smarter, faster, and more
adaptable phishing detection systems that enhance digital security and reduce
online fraud.

Department of Information Technology 3


1.2 Applications
1. Browser Security Extensions: Machine learning models can be inte-
grated into browser extensions (e.g., Chrome, Firefox) to analyze URLs
and webpage content in real-time, warning users or blocking access to
suspected phishing websites before they can interact with them.

2. Email Filtering Systems: Email providers can embed phishing detec-


tion models to scan incoming emails for suspicious links or fake login
pages. This helps in flagging or quarantining phishing emails before they
reach the user’s inbox.

3. Corporate Network Security: Organizations can implement phishing


detection systems within their internal networks to protect employees
from accidentally accessing phishing sites, thus preventing data breaches
and credential theft.

4. Web Hosting & Domain Monitoring: Web hosting companies and


domain registrars can use ML-based detection tools to identify and take
down phishing domains during registration or shortly after they become
active.

5. Mobile Security Apps: Phishing detection models can be deployed in


mobile security applications to protect smartphone users from clicking
malicious links in SMS, apps, or browsers—helping safeguard mobile
banking and personal data.

Department of Information Technology 4


1.3 Problem Statement
Phishing websites pose a serious and growing threat in the digital age,
targeting users across the globe by impersonating legitimate websites to steal
sensitive information such as login credentials, banking details, credit card
numbers, and personal data. These malicious sites are often indistinguishable
from authentic ones, using deceptive URLs, logos, and content layouts to
gain users’ trust. The consequences of falling victim to phishing attacks
can be severe, leading to financial loss, identity theft, data breaches, and
unauthorized access to personal or corporate systems. The rapid evolution
of phishing techniques—such as dynamic URL generation, use of HTTPS to
appear secure, and frequent changes in hosting services—makes them difficult
to detect using traditional methods like blacklists or manually defined rules.
These outdated approaches are reactive and often fail to identify newly launched
phishing campaigns in real-time. Phishing attacks not only affect individual
users but also pose significant risks to organizations, including reputational
damage, regulatory penalties, and disruptions to business operations. As digital
interactions continue to grow, so does the attack surface for phishing scams,
highlighting an urgent need for more intelligent, adaptive, and automated
detection solutions capable of identifying phishing attempts before they cause
harm.

1.4 Objectives
1. To Analyze and Understand the Phishing Detection Problem:
Investigate the nature of phishing websites and identify key features
that distinguish them from legitimate websites using Exploratory Data
Analysis (EDA).

2. To Evaluate Multiple Machine Learning Models:


Implement and compare the performance of various machine learning
classifiers such as Random Forest, Gradient Boosting, CatBoost, Support
Vector Machine (SVM), Multi-layer Perceptron (MLP), Decision Tree,

Department of Information Technology 5


Logistic Regression, K-Nearest Neighbors (KNN), and Naı̈ve Bayes.

3. To Train and Test the Models Using a Realistic Dataset:


Use a comprehensive dataset containing a diverse set of features related to
URL structure, website security, and behavior. Perform model training,
testing, and validation using appropriate evaluation metrics such as
accuracy, precision, recall, and F1-score.

4. To Select the Best Performing Model:


Identify the model that demonstrates the highest accuracy and general-
ization ability for detecting phishing websites, and optimize it through
hyperparameter tuning.

5. To Develop a Flask-based Web Application:


Deploy the best-performing phishing detection model into a lightweight
Flask web application, enabling real-time URL classification and providing
a user-friendly interface for practical use.

Department of Information Technology 6


1.5 Organization of the Project
• Chapter 1: Introduction
Provides an introduction to the project, outlining its purpose, significance,
and the problem it aims to address.

• Chapter 2: Literature Survey


Presents a detailed literature survey, reviewing existing work and ap-
proaches related to phishing detection.

• Chapter 3: System Requirements


Outlines the hardware and software requirements essential for the suc-
cessful execution of the project.

• Chapter 4: Proposed Methodology


Explains the proposed methodology, including system design, data flow,
and key features of the project.

• Chapter 5: Implementation and Results


Focuses on the implementation phase, showcasing results, model perfor-
mance, and analysis.

• Chapter 6: Conclusion and Future Scope


Summarizes the project, draws conclusions, and discusses potential future
enhancements and applications.

Department of Information Technology 7


CHAPTER 2
Literature Survey

2.1 Paper1 : DEPHIDES: Deep Learning Based


Phishing Detection System

Overview
This paper introduces DEPHIDES, a deep learning-based system for
detecting phishing attacks. The research addresses the increasing threat of
phishing in the digital era, where cybercriminals exploit the anonymity of the
internet to steal sensitive information such as passwords, banking credentials,
and social security numbers.

Method
The researchers proposed a phishing detection system based on five deep
learning algorithms:

• Artificial Neural Networks (ANN)

• Convolutional Neural Networks (CNN)

• Recurrent Neural Networks (RNN)

• Bidirectional RNNs

• Attention Networks

The system focuses on the rapid classification of web pages based on their
URLs. An extensive dataset of approximately five million labeled URLs
was created to evaluate model performance.

8
Results
Among the five deep learning models, Convolutional Neural Networks
(CNN) outperformed others with a phishing detection accuracy of 98.74%.
This result showcases CNN’s strength in recognizing phishing patterns from
URL-based features.

Conclusion
The study concludes that deep learning algorithms, particularly CNNs,
are highly effective for detecting phishing threats. The use of a large-scale
dataset and advanced neural network architectures contributes significantly to
the domain of automated phishing detection in cybersecurity.

2.2 Paper 2: Phishing Detection System Through


Hybrid Machine Learning Based on URL
Overview:
This study focuses on phishing detection through URL analysis using hybrid
machine learning approaches. The researchers highlight that phishing, dating
back to 1996, has evolved into one of the most dangerous cybercrimes. Common
tactics include email distortion and fraudulent websites to deceive users into
disclosing sensitive information.

Method
The research employed a phishing URL-based dataset from a well-known
repository, comprising over 11,000 websites with both phishing and legitimate
URL attributes in vector form. After preprocessing, the study implemented
several machine learning algorithms:

• Decision Tree (DT)

• Linear Regression (LR)

• Random Forest (RF)

Department of Information Technology 9


• Naive Bayes (NB)

• Gradient Boosting Classifier (GBM)

• K-Nearest Neighbors (KNN)

• Support Vector Classifier (SVC)

The researchers also proposed a hybrid LSD model combining:

• Logistic Regression (LR)

• Support Vector Machine (SVC)

• Decision Tree (DT)

The hybrid model utilized both soft and hard voting techniques. Additionally,
canopy feature selection, cross-fold validation, and Grid Search Hyperparameter
Optimization were applied to enhance performance.

Results
Performance was assessed using metrics such as precision, accuracy, re-
call, F1-score, and specificity. The hybrid LSD model demonstrated superior
performance over the individual algorithms in accurately identifying phishing
URLs.

Conclusion
The hybrid machine learning approach, particularly the LSD model, shows
strong potential for phishing detection. The results underline the effectiveness
of combining multiple classifiers to enhance detection accuracy and reliability
in identifying phishing threats.

Department of Information Technology 10


2.3 Paper 3: An Enhanced Deep Learning-based
Phishing Detection Mechanism Using Varia-
tional Autoencoders

Overview
This research addresses the limitations of traditional blacklist-based phishing
detection techniques, which often fail to detect the vast and constantly evolving
phishing websites. The authors propose an enhanced deep learning-based
detection mechanism that integrates variational autoencoders (VAE) with
deep neural networks (DNN) for improved accuracy and efficiency.

Method
The framework utilizes VAE to automatically extract essential features from
raw URLs by reconstructing them. This automatic feature learning enhances
phishing detection performance. The dataset used for experimentation consisted
of approximately 100,000 URLs collected from:

• ISCX-URL-2016 dataset

• Kaggle phishing dataset

Results
The proposed VAE-DNN model achieved a maximum detection accuracy of
97.45% and demonstrated a rapid response time of just 1.9 seconds. These
results surpassed those of all other tested models in the study.

Conclusion
The integration of variational autoencoders with deep neural networks
proves to be an effective solution for phishing URL detection. The approach
eliminates the need for manual feature engineering while offering both high
accuracy and fast execution, making it ideal for real-world deployment in
cybersecurity systems.

Department of Information Technology 11


2.4 Paper 4: A Deep Learning-Based Phishing
Detection System Using CNN, LSTM, and
LSTM-CNN

Overview
This study addresses critical internet security issues, with a specific focus
on phishing attacks that impersonate legitimate websites to steal personal
information. The researchers propose and compare three deep learning-based
approaches for effectively detecting phishing websites using URL features.

Method
The paper explores and evaluates the following deep learning techniques:

• Convolutional Neural Networks (CNN)

• Long Short-Term Memory (LSTM) networks

• A hybrid LSTM-CNN model

Each method was tested for its performance in classifying URLs as either
phishing or legitimate.

Results
The models achieved the following accuracy scores:

• CNN: 99.2%

• LSTM-CNN Hybrid: 97.6%

• LSTM: 96.8%

Among these, the CNN model performed the best in terms of accuracy and
detection capabilities.

Department of Information Technology 12


Conclusion
The research highlights the effectiveness of deep learning models—especially
CNNs—in phishing detection. The comparative performance evaluation pro-
vides practical guidance for implementing real-world phishing detection systems,
with CNNs offering the most reliable results in this study.

2.5 Paper 5: An Intelligent Cyber Security Phish-


ing Detection System Using Deep Learning
Techniques

Overview
This paper highlights the increasing danger of phishing attacks as a major
form of social engineering targeting internet users, governments, and businesses.
It emphasizes phishing emails as the primary and most successful attack vector
in these scenarios.

Method
The researchers proposed a phishing detection model based on machine
learning techniques. The dataset was divided into training and testing sets,
and the model used email text features and additional attributes to classify
emails as phishing or non-phishing. Three different datasets were utilized for
experimentation, each containing varying feature sets to assess performance
impacts.

Results
Among the machine learning models tested, the boosted decision tree
algorithm achieved the best results, with accuracy rates of:

• 0.88

• 1.00

Department of Information Technology 13


• 0.97

These were obtained across the three different datasets. The study found that
incorporating more features generally led to improved classification accuracy
and system efficiency.

Conclusion
The research confirms the potential of machine learning, especially boosted
decision trees, in detecting phishing emails with high accuracy. It also under-
scores the role of thorough feature selection in enhancing detection performance,
making the model suitable for practical cybersecurity applications.

2.6 Paper 6 Phishing URL Detection: A Real-


Case Scenario through Login URLs

Overview
This paper addresses the challenge of phishing detection through URL
analysis, with a particular focus on login URLs. The researchers note that
most existing solutions fail to include legitimate login forms in their legitimate
class examples, which limits their real-world applicability.

Method
The researchers compared machine learning and deep learning techniques for
detecting phishing websites through URL analysis. Unlike typical approaches,
they used URLs from login pages in both phishing and legitimate classes to
create a more representative real-world scenario. The researchers also tested
how models trained on older datasets performed when tested with recent URLs
to measure accuracy degradation over time. They introduced a new dataset
named Phishing Index Login URL (PILU-90K), containing 60,000 legitimate
URLs (including index and login websites) and up to 30,000 phishing URLs.

Department of Information Technology 14


Results
The study found that existing techniques show high false-positive rates
when tested with URLs from legitimate login pages. Additionally, models
trained on older datasets showed decreasing accuracy when tested with more
recent URLs. The researchers’ Logistic Regression model combined with Term
Frequency - Inverse Document Frequency (TF-IDF) feature extraction achieved
96.50% accuracy on their login URL dataset.

Conclusion
2.6 Paper 6 Phishing URL detection: A real-case scenario through login
URLs Manuel Sánchez Paniagua, Eduardo Fidalgo, Enrique Alegre, Al-Nabki
Mhd Wesam - January 2022
This paper addresses the challenge of phishing detection through URL
analysis, with a particular focus on login URLs. The researchers note that
most existing solutions fail to include legitimate login forms in their legitimate
class examples, which limits their real-world applicability.
Method The researchers compared machine learning and deep learning
techniques for detecting phishing websites through URL analysis. Unlike
typical approaches, they used URLs from login pages in both phishing and
legitimate classes to create a more representative real-world scenario. The
researchers also tested how models trained on older datasets performed when
tested with recent URLs to measure accuracy degradation over time. They
introduced a new dataset named Phishing Index Login URL (PILU-90K),
containing 60,000 legitimate URLs (including index and login websites) and
so far upto 30,000 phishing URLs.
Conclusion This research highlights the importance of using representative
training data that includes legitimate login URLs to build more effective phish-
ing detection systems. The demonstrated approach using Logistic Regression
with TF-IDF features provides a promising solution for real-world phishing
detection scenarios.

Department of Information Technology 15


2.7 Paper 7: A Lightweight and Proactive Rule-
Based Incremental Construction Approach
to Detect Phishing Scam

Overview
This paper focuses on the growing need for digital security amidst the rise
of phishing attacks, which are a prominent form of social engineering used to
compromise user credentials. The study proposes an intrusion detection system
implemented as a Chrome extension to address phishing threats in real-time.

Method
The authors developed a lightweight and proactive rule-based incremental
construction approach for detecting phishing URLs. This technique analyzes
various URL features including domain information, content patterns, and
page-level attributes. It is designed to be robust, reliable, and scalable, and
does not depend on existing blacklist signatures—enabling effective detection
of zero-day and spear phishing attacks.

Results
The proposed system achieved:

• 89.12% detection rate for zero-day phishing attacks

• 76.2% detection rate for spear phishing attacks

• 97.13% true positive rate

• Less than 1.5% false positive rate

These results indicate higher precision and efficiency compared to previously


developed models and techniques.

Department of Information Technology 16


Conclusion
The research validates the effectiveness of a lightweight, rule-based model for
phishing detection. Its capability to identify new phishing scams without relying
on blacklists, combined with a practical Chrome extension implementation,
makes it suitable for real-world applications with low computational overhead.

2.8 Paper 8: Detection of Phishing URLs by Us-


ing Deep Learning Approach and Multiple
Features Combinations

Overview
This study addresses the inherent limitations of traditional blacklist-based
phishing detection techniques, which often fail to identify newly generated
phishing URLs. The paper proposes an alternative deep learning approach to
enhance the universality and effectiveness of phishing URL detection systems.

Method
The researchers evaluated three different feature types for phishing URL
detection:

• Lexical features

• Character-level embeddings

• Word-level embeddings

They designed a new deep neural network architecture that combines multiple
CNN and LSTM layers, tailored to capture both local and sequential patterns
in URLs.

Results
The most effective model combined character-level and word-level embed-
dings, achieving an accuracy of 94.4% in phishing URL detection. The hybrid

Department of Information Technology 17


CNN-LSTM architecture outperformed models that relied on a single type of
feature representation.

Conclusion
The findings highlight that using a combination of feature types signifi-
cantly enhances phishing detection performance. The proposed deep learning
architecture demonstrates a robust ability to detect phishing URLs, even those
not present in existing blacklists, offering a substantial improvement over
conventional methods.

2.9 Paper 9: Adopting Automated Whitelist Ap-


proach for Detecting Phishing Attacks

Overview
This research highlights key limitations in existing anti-phishing solutions,
such as low detection rates and slow response times in real-time scenarios.
While blacklist methods are fast, their detection effectiveness is limited. As an
alternative, the paper proposes an automated whitelist-based phishing detection
approach.

Method
The proposed method involves analyzing similarities between visual and
actual hyperlinks. The system compares domain names to trusted domains in
a whitelist, maps them to corresponding IP addresses, and uses extracted URL
information to make detection decisions. Six diverse datasets were employed
to test the effectiveness of the system.

Results
The approach achieved strong results, particularly with smaller datasets. On
average, it attained a 96.17% accuracy and a 95.0% true positive rate across six
experiments. Although accuracy varied across datasets, the method consistently

Department of Information Technology 18


outperformed benchmark techniques. The system also demonstrated advantages
in computational efficiency, with lower memory, bandwidth, and resource
requirements.

Conclusion
This research highlights the importance of using representative training
data that includes legitimate login URLs to build more effective phishing
detection systems. The demonstrated approach using Logistic Regression with
TF-IDF features provides a promising solution for real-world phishing detection
scenarios.
Furthermore, the study introduces an automated whitelist-based detection
system as a robust and efficient alternative to traditional blacklist and heuristic-
based approaches. By dynamically maintaining and referencing a set of
verified legitimate login URLs, the system significantly enhances the accuracy
of phishing detection while minimizing false positives, a common limitation
in existing models. One of the key advantages of this approach lies in
its ability to handle real-time URL classification with reduced computational
overhead, making it highly suitable for deployment in large-scale, latency-
sensitive environments such as web browsers, email clients, and enterprise
security gateways.
The system’s architecture allows for continuous updates and adaptive learn-
ing, enabling it to respond quickly to emerging phishing threats without
requiring frequent retraining on large datasets. This real-time adaptability en-
sures that users remain protected against the ever-evolving landscape of cyber
threats, where phishing techniques are becoming increasingly sophisticated.
Moreover, by focusing on login pages—one of the primary targets in phish-
ing attacks—the proposed method directly addresses the most critical entry
point exploited by attackers. In summary, the integration of whitelist-based
filtering with machine learning models creates a powerful hybrid framework
that offers scalability, efficiency, and high detection performance. It sets a
strong foundation for future advancements in intelligent phishing detection
systems and paves the way for safer and more secure web interactions.

Department of Information Technology 19


CHAPTER 3
SYSTEM REQUIREMENT
SPECIFICATION

3.1 Functional Requirements


Functional requirements define the specific behavior or functions of the
system. These are the features and capabilities that the system must support
to fulfill its intended purpose — in this case, phishing website detection using
machine learning and deployment through a Flask web application.

1. User Interface for URL Submission

• The system shall provide a simple and intuitive web interface for
users to input website URLs for analysis.

• The interface shall allow users to submit URLs one at a time


through a text field or form.

2. Input Validation

• The system shall validate user inputs to ensure that only properly
formatted URLs are accepted.

• It shall notify users if the input is empty, malformed, or potentially


invalid before proceeding to classification.

3. Feature Extraction

Upon receiving a valid URL, the system shall automatically extract


predefined features from the URL, such as:

• Length of URL

• Use of HTTPS

• Presence of prefix/suffix

20
• Use of ’@’ or ’-’

• Domain registration length

• Website traffic and Google index status (if available)

These features shall be used as input for the trained machine learning
model.

4. Phishing Detection

• The system shall use a pre-trained machine learning model to classify


the input URL as either ”Phishing” or ”Legitimate”.

• Multiple classifiers (e.g., Random Forest, SVM, CatBoost) shall be


initially evaluated during the model development phase, and the
best-performing model will be integrated into the final system.

5. Model Prediction Output

• After classification, the system shall display a clear and user-friendly


message indicating whether the URL is safe or potentially a phishing
site.

• The system may also display a confidence score or probability


associated with the prediction.

6. Logging and Monitoring

• The system shall maintain a backend log of submitted URLs and


corresponding predictions (without storing user identity unless spec-
ified).

• This log may be used for further model training and auditing
purposes.

7. Model Update and Retraining

• The system shall provide provisions (manually or automatically) for


updating the machine learning model based on newly acquired data.

Department of Information Technology 21


• Admins or developers shall be able to retrain the model periodically
to improve detection accuracy.

8. Error Handling

• The system shall gracefully handle exceptions, such as server down-


time, unresponsive components, or malformed URLs.

• Users shall be notified with appropriate error messages if the detec-


tion process cannot be completed.

9. Security

• The system shall sanitize inputs to prevent attacks such as code


injection or cross-site scripting (XSS).

• HTTPS encryption shall be enforced for secure communication be-


tween the client and server.

3.2 Non-Functional Requirements


Non-functional requirements define the quality attributes, performance, and
constraints of the system. They are essential for ensuring the usability,
efficiency, and maintainability of the application.

1. Performance

• The system shall respond to user queries within 2–5 seconds under
normal operating conditions.

• The machine learning model shall be optimized to provide fast and


efficient predictions with minimal computational overhead.

2. Scalability

• The application shall be designed to support multiple concurrent


users without degradation in performance.

• The system architecture shall be modular enough to be deployed on


cloud platforms (e.g., AWS, Azure, Heroku) for horizontal scaling.

Department of Information Technology 22


3. Usability

• The user interface shall be clean, simple, and accessible to both


technical and non-technical users.

• Tooltips, help text, and alert messages shall be provided for better
user guidance.

4. Maintainability

• The codebase shall be well-documented, modular, and follow best


practices to allow easy debugging and extension.

• The system shall use standard frameworks like Flask, Scikit-learn,


or TensorFlow/PyTorch to enable maintainability.

5. Reliability

• The system shall ensure accurate predictions in at least 90% of


cases based on validation and test datasets.

• Backup and logging mechanisms shall be in place to restore func-


tionality in the event of failure.

6. Availability

• The system shall be available for use at least 99% of the time,
barring planned maintenance or unexpected downtimes.

• Hosting infrastructure shall be chosen with high availability and


fault tolerance in mind.

7. Portability

• The system shall be platform-independent and deployable across


different operating systems (Windows, macOS, Linux).

• The web application shall be accessible via major web browsers


(Chrome, Firefox, Safari, Edge).

8. Security The application shall implement measures to protect against


known vulnerabilities including:

Department of Information Technology 23


• SQL injection

• Cross-site scripting (XSS)

• Man-in-the-middle (MITM) attacks

• Passwords and sensitive data, if used, shall be encrypted and securely


stored.

9. Data Privacy

• The system shall not store personally identifiable user data unless
necessary.

• Logs and predictions shall be anonymized if collected for research


or retraining purposes.

10. Extensibility

• The architecture shall allow integration of additional machine learn-


ing models or new features (e.g., multi-URL batch prediction, email
link scanner) in the future.

• APIs may be developed for external services to interact with the


phishing detection engine.

3.3 Software Requirements


1. Operating System:

• Windows 10/11, Linux (Ubuntu), or macOS

• Any OS compatible with Python and supporting Flask for web app
development.

2. Programming Language:

• Python 3.7 or above

• Used for building machine learning models, data preprocessing, and


web application development.

Department of Information Technology 24


3. Machine Learning Models and Libraries:

• Scikit-learn: For models like Random Forest, Logistic Regression,


SVM, Decision Tree, etc.

• CatBoost and XGBoost: For advanced gradient boosting algorithms.

• Keras or TensorFlow (optional): If future deep learning enhance-


ments are needed.

• Joblib or Pickle: For saving and loading trained models.

4. Data Processing Libraries:

• Pandas and NumPy: For handling and manipulating datasets.

• Matplotlib and Seaborn: For visualization during Exploratory Data


Analysis (EDA).

5. Flask Framework:

• Lightweight Python web framework used to build the web interface


for phishing detection.

6. IDE/Development Tools:

• VS Code, Jupyter Notebook, or PyCharm for writing and debugging


code.

3.4 Hardware Requirements


1. Processor (CPU):

• Minimum: Dual-core 2.0 GHz or above

• Recommended: Quad-core Intel i5/i7 or AMD Ryzen 5/7 for faster


computation

Department of Information Technology 25


2. RAM (Memory):

• Minimum: 4 GB

• Recommended: 8 GB or more for smoother execution and model


training

3. Storage:

• Minimum: 2 GB of free space

• Recommended: SSD with 10 GB or more free for storing models,


datasets, and logs

4. Graphics Card (GPU):

• Not required for standard machine learning models

• Optional: Dedicated GPU (e.g., NVIDIA 2GB+) if integrating deep


learning in future

5. Internet Connection:

• Required for installing packages, accessing datasets, and deploying


the application

• Recommended: Stable high-speed connection.

Department of Information Technology 26


CHAPTER 4
PROPOSED METHODOLOGY

4.1 Data Collection


The dataset used for this project was sourced from Kaggle, a popular
platform for sharing data science projects and datasets. This specific dataset
comprises over 11,000 website records, each represented by a unique URL
and a set of 30 features that characterize different aspects of the website.
These features may include information such as the presence of HTTPS in
the URL, the use of special characters, IP address presence, domain age,
and others that can be predictive of phishing behavior. Each sample in the
dataset also contains a class label that indicates whether the URL is legitimate
or phishing. The labeling convention follows a binary classification format,
where phishing websites are marked as ”1” and legitimate ones as ”-1.” The
availability of a well-labeled dataset is crucial, as it allows supervised machine
learning models to learn and differentiate between the two categories based
on the input features.

4.2 Data Preprocessing and Exploratory Data Anal-


ysis (EDA)
Before diving into model training, it is essential to preprocess the data and
explore its structure. Data preprocessing ensures that the dataset is clean,
consistent, and ready for input into machine learning models.
The preprocessing steps include:

• Checking for null or missing values and handling them appropriately.

• Encoding categorical features if any exist.

• Ensuring all features are in the correct format for training.

27
Once the data is preprocessed, Exploratory Data Analysis (EDA) is con-
ducted. EDA serves as a vital step to understand the underlying patterns and
trends in the data. It includes:

• Descriptive statistics to examine the distribution of individual features.

• Bar plots to visualize the frequency of categorical data.

• Histograms and box plots to understand the spread and detect outliers.

• Pair plots to see relationships between feature pairs.

• Correlation heatmaps to identify which features are most strongly corre-


lated with the target label.

These visual tools help identify informative features and guide feature
selection, which plays a crucial role in model performance. To understand the
significance of various features in phishing website detection, an exploratory
data analysis (EDA) was conducted. A heatmap was generated to visualize
the correlation between different features and the target class (phishing or
legitimate websites).

• The correlation matrix highlights the relationships among key features,


providing insights into their influence on phishing classification.

• Features like ‘HTTPS’, ‘AnchorURL’, and ‘WebsiteTraffic’ showed strong


correlations with the target variable, making them crucial for model
performance.

• Some features exhibited negative correlations, suggesting that their pres-


ence may indicate a lower likelihood of a phishing website.

4.3 Data Splitting


To build robust machine learning models, the dataset is divided into two
subsets: training and testing. This split allows the model to learn patterns
from one portion of the data and be evaluated on unseen data to test its
generalizability. In this case, an 80-20 split is used:

Department of Information Technology 28


Figure 4.1: Heatmap for correlation between different features
• Training Set (80%): Used to fit the machine learning models.

• Testing Set (20%): Held back and used to evaluate the model’s perfor-
mance after training.

To further explore the dataset, a pair plot analysis was conducted, focusing
on key features such as ‘PrefixSuffix-’, ‘SubDomains’, ‘HTTPS’, ‘AnchorURL’,
and ‘WebsiteTraffic’.

• The pairwise scatter plots provide a visual representation of how phishing


and legitimate websites are distributed across different feature combina-
tions.

• Features like ‘SubDomains’ and ‘PrefixSuffix-’ exhibit distinguishable pat-


terns for phishing websites compared to legitimate ones.

• Cluster formations in the scatter plots suggest the possibility of effective


separation using machine learning classifiers.

4.4 Model Building and Training


With the preprocessed data and train-test split in place, the next step
involves selecting appropriate machine learning models and training them on the
dataset. This project explores a variety of supervised classification algorithms,
including:

Department of Information Technology 29


Figure 4.2: Scatterplot Matrix
• Logistic Regression – A baseline model that works well when the classes
are linearly separable.

• K-Nearest Neighbors (k-NN) – A distance-based classifier that predicts


the label based on neighboring points.

• Support Vector Classifier (SVC) – An effective classifier that attempts


to find the optimal hyperplane between classes.

• Naive Bayes – A probabilistic model based on Bayes’ theorem, assuming


feature independence.

• Decision Tree – A tree-structured classifier that makes decisions based


on feature thresholds.

• Random Forest – An ensemble method that builds multiple decision trees


and averages their outputs.

• Gradient Boosting – A boosting algorithm that builds trees sequentially


to reduce errors.

• CatBoost – A gradient boosting algorithm optimized for categorical


features and speed.

Department of Information Technology 30


• Multilayer Perceptron (MLP) – A neural network-based model capable
of learning complex, non-linear patterns.

Each model is trained on the training dataset and evaluated using key
performance metrics:

• Accuracy: Overall correctness of the model.

• Precision: Ability to identify true phishing websites without false posi-


tives.

• Recall: Ability to capture all phishing websites.

• F1 Score: Harmonic mean of precision and recall, providing a balance


between the two.

4.5 Model Comparison


To determine the best model, a structured comparison is carried out.
A performance dataframe is created to summarize each model’s evaluation
metrics. This comparative framework provides clarity on which models are
performing well across different metrics and which are lacking. The models are
sorted by accuracy and F1 score to identify the top performer. This structured
analysis not only helps in selecting the best model but also highlights trade-offs
between different classifiers.

4.6 Confusion Matrix


For a deeper understanding of the classification performance, a confusion
matrix is generated for the best-performing model. This matrix presents four
critical values:

• True Positives (TP): Correctly identified phishing websites.

• True Negatives (TN): Correctly identified legitimate websites.

• False Positives (FP): Legitimate websites misclassified as phishing.

Department of Information Technology 31


• False Negatives (FN): Phishing websites misclassified as legitimate.

By analyzing these values, we can assess where the model makes mistakes
and how those mistakes might impact real-world performance.

4.7 Storing the Best Model


Once the optimal model is identified—Gradient Boosting in this case—it
is serialized using the pickle library. This allows the trained model to be
saved as a .pkl file, enabling future reuse without the need to retrain. Model
serialization is a critical step in transitioning from development to deployment,
as it ensures that the model can be integrated into applications, tested in real
environments, and used repeatedly.

4.8 Flask Application Development


To make the phishing detection model accessible to users, a Flask-based
web application is developed. Flask is a lightweight web framework in Python
that simplifies the creation of web servers and APIs.
Key features of the Flask application include:

• User Interface: A simple input form for users to enter a URL.

• Backend Logic: The pre-trained model is loaded and applied to the


input.

• Prediction Display: The model outputs whether the URL is phishing or


safe.

Among the many models tested, the Gradient Boosting Classifier achieved
the highest performance, boasting an accuracy of 97.4%. Such a high level of
precision in identifying phishing websites demonstrates the model’s utility.

Department of Information Technology 32


CHAPTER 5
RESULTS

5.1 Model Evaluation and Performance Analysis


To evaluate the performance of various machine learning classifiers for
our binary classification task, we used standard metrics including Accuracy,
Precision, Recall, and F1-Score. The performance of each model is summarized
in the table below:
Table 5.1: Models Accuracy Comparison

Model Accuracy Precision Recall F1-Score


Gradient Boosting Classifier 97.4% 0.989 0.988 0.974
CatBoost Classifier 97.0% 0.991 0.990 0.981
Random Forest 96.9% 0.990 0.993 0.986
Support Vector Machine (SVM) 95.6% 0.968 0.980 0.968
Multi-layer Perceptron (MLP) 95.4% 0.984 0.984 0.984
Decision Tree 94.5% 0.993 0.991 0.992
K-Nearest Neighbors (KNN) 94.2% 0.985 0.991 0.988
Logistic Regression 92.7% 0.927 0.943 0.935
Naive Bayes 90.0% 0.997 0.292 0.450

Among all the models, the Gradient Boosting Classifier demonstrated the
highest overall performance with an accuracy of 97.4%, F1-score of 0.974,
recall of 0.988, and precision of 0.989. The confusion matrix for this model is
shown below:
Table 5.2: Confusion Matrix for Gradient Boosting Classifier

Predicted Negative Predicted Positive


Actual Negative 933 (True Negatives) 43 (False Positives)
Actual Positive 14 (False Negatives) 1221 (True Positives)

This matrix indicates that:

• The model correctly identified 933 out of 976 negative cases.

• It correctly identified 1221 out of 1235 positive cases.

33
• It produced 43 false positives and only 14 false negatives, reflecting its
strong predictive capability.

5.1.1 Key Observations


In the process of evaluating multiple machine learning classifiers for a
binary classification problem, we found that the Gradient Boosting Classifier
(GBC) outperformed all other models across accuracy, precision, recall, and
F1-score. This section discusses why GBC worked so well, and provides a
comparative analysis of other models in the context of their relative strengths
and weaknesses.

Gradient Boosting Classifier (GBC) – Why It Excelled

The Gradient Boosting Classifier is an ensemble learning technique that


builds models sequentially, where each subsequent model corrects the errors of
the previous one. It combines multiple weak learners (typically decision trees)
into a strong learner by focusing on the residual errors. Here’s why it worked
particularly well:

• Focus on Hard-to-Classify Instances: GBC emphasizes misclassified


samples by adapting the subsequent trees to correct them, thereby
reducing bias.

• Feature Interaction Handling: Since GBC uses decision trees as base


learners, it automatically handles non-linear relationships and feature
interactions effectively.

• Robust to Outliers: GBC is less sensitive to noisy data due to its


iterative correction nature.

• Hyperparameter Tuning Flexibility: Parameters like learning rate,


max depth, and number of estimators allow fine-tuning to avoid overfit-
ting.

Department of Information Technology 34


• Well-Balanced Precision and Recall: The confusion matrix shows
only 14 false negatives and 43 false positives, resulting in both high
recall and high precision.

Figure 5.1: The Gradient Boosting Classifier process showing sequential


decision trees correcting errors from the previous model.

This balance is ideal for real-world classification tasks where both types of
errors are costly. In short, GBC succeeds due to its adaptability, optimization
strategy, and robustness.

CatBoost Classifier – Close Competitor with Slight Trade-offs

CatBoost is another gradient boosting algorithm tailored especially for


categorical features. It uses symmetric trees and applies Ordered Boosting to
prevent overfitting. Its performance was very close to GBC:

• Pros: High precision (0.991) and recall (0.990) demonstrate that Cat-
Boost is highly capable of classifying both classes accurately.

Department of Information Technology 35


• Why It Didn’t Surpass GBC: The small drop in accuracy and F1-
score could be due to the data not containing many categorical features,
thus underutilizing CatBoost’s unique strength.

Figure 5.2: CatBoost Classifier showing feature handling and boosting process.

While it performed well, its complex internal processing (like handling


categorical data natively) might not have had a significant advantage over
GBC in this specific dataset.

Random Forest – Strong Recall, Slightly Lower Accuracy

Random Forest (RF) is another ensemble technique that aggregates pre-


dictions from multiple decision trees using bagging (bootstrap aggregation). It
achieved a high recall of 0.993 and precision of 0.990:

• Pros: Excellent at avoiding overfitting due to averaging across trees;


strong generalization capability.

• Why It Lagged: Slightly lower accuracy and F1-score compared to


GBC could be attributed to its less aggressive focus on hard-to-classify
samples. Unlike GBC, RF does not learn sequentially, which limits its
ability to correct mistakes made by individual trees.

RF works well in most cases, but in this high-stakes classification scenario,


GBC’s targeted optimization gives it a slight edge.

Department of Information Technology 36


Figure 5.3: Random Forest process with aggregation of multiple decision
trees.
Support Vector Machine (SVM) – Solid General Performance

SVM attempts to find the optimal hyperplane that maximizes the margin
between two classes. It yielded good results with an F1-score of 0.968 and
recall of 0.980:

• Pros: Strong in high-dimensional spaces and effective when clear margins


exist between classes.

• Limitations: SVM struggles with large datasets and is sensitive to the


selection of kernel and hyperparameters. Non-linear kernels can increase
complexity and training time.

• Why It Didn’t Top the List: The complexity of the data might have
required a non-linear kernel, and even then, SVM lacks the ensemble
power to correct its own mistakes as boosting does.

Multi-layer Perceptron (MLP) – Good, But Needs More Data

MLP is a type of neural network that learns complex patterns through


layers of interconnected neurons. It achieved high recall (0.984) and decent
precision (0.984):

• Pros: Excellent for capturing non-linear relationships, scalable with large


feature sets.

Department of Information Technology 37


• Limitations: MLP requires large datasets and longer training times to
converge effectively.

Figure 5.4: Multi-layer Perceptron process showing how layers of neurons


learn complex patterns.

• Why It Underperformed: Neural networks often require careful tuning


(e.g., learning rate, activation functions, architecture) and large amounts
of data. If the dataset is small or lacks depth, MLP can easily overfit
or underfit.

Decision Tree – High Recall, But Overfitting Risk

The Decision Tree model recorded a recall of 0.991 and precision of


0.993, indicating strong performance. However, its accuracy and F1-score were
marginally lower:

• Pros: Easy to interpret and implement; good with small datasets.

• Limitations: Prone to overfitting, especially if the tree grows too deep.

Department of Information Technology 38


• Why It Didn’t Win: While the recall is high, overfitting can cause
the model to perform worse on unseen data. The absence of ensemble
regularization (like in GBC or RF) makes it less robust.

K-Nearest Neighbors (KNN) – Sensitive to Distance and Scaling

KNN works by calculating the distance between instances and assigning


the class based on the majority of neighbors:

• Pros: Simple and non-parametric; no training phase.

• Limitations: Highly sensitive to feature scaling and irrelevant features;


computationally expensive on large datasets.

• Why It Lagged: Although KNN showed high recall (0.991), its lower
accuracy suggests that it misclassified some instances due to proximity
ambiguity. It doesn’t learn any internal model, making it ill-suited for
high-dimensional or complex data.

Logistic Regression – Good Baseline, Limited Power

Logistic Regression is a linear model that estimates probabilities using a


sigmoid function:

• Pros: Fast and interpretable; good for linearly separable data.

• Limitations: Struggles with non-linear relationships, outliers, and com-


plex feature interactions.

• Why It Underperformed: The model achieved a recall of 0.943 and


precision of 0.927, indicating decent balance but clear limitations in
capturing intricate patterns. It’s better used as a baseline model.

Naive Bayes Classifier – Extremely High Precision, But Low Recall

Naive Bayes applies Bayes’ theorem with a strong (naive) assumption that
features are independent:

• Pros: Fast, efficient, works well on text classification.

Department of Information Technology 39


• Limitations: Assumes feature independence which rarely holds in real-
world data.

• Why It Failed: Although it achieved an extremely high precision (0.997),


the recall was just 0.292, meaning it failed to identify the majority of
positive cases. This model only predicted the positive class when it was
absolutely certain, which minimized false positives but resulted in a large
number of false negatives (low recall).

The Gradient Boosting Classifier emerged as the most balanced and accurate
model for this classification problem due to its strong learning strategy, error-
correcting iterations, and robustness to noise. While CatBoost and Random
Forest showed nearly similar performance, GBC’s precision-recall balance and
lower misclassification rate make it the most dependable choice.

5.1.2 Execution Results of Phishing Website Detection


System
The following screenshots showcase the successful execution of the Phishing
Website Detection web application developed using machine learning models.
The core objective of this application is to evaluate a given URL and determine
whether it is a legitimate website or a phishing attempt. This classification is
made based on various features extracted from the URL structure and related
metadata. The user interface is built with a clean and minimalistic design to
ensure accessibility and user-friendlines.

Figure 5.5: Phishing Detection Website

Department of Information Technology 40


In the first screenshot, the URL https://t.co/4G8J8ZUck is entered into
the input field, and the “Scan URL” button is clicked. The backend model
processes the input and flags it as “Website is Not Safe to Use.” This indicates
that the URL in question likely redirects users to a potentially harmful or
suspicious phishing website. To offer users some degree of control, a secondary
button labeled “Still want to Continue” is provided, which allows the user to
proceed at their own risk after being warned. This implementation mirrors
real-world systems that balance proactive warnings with user freedom.

Figure 5.6: Checking Whether the URL is Legitimate or Not

In the second screenshot, the input field contains the legitimate and well-
known domain https://web.whatsapp.com/. Upon clicking “Scan URL,” the
result returned is “Website is Safe to Use,” and a green button labeled
“Continue” appears. This reassures the user that the domain has passed all
phishing checks and is not flagged as malicious. The classification results
are made possible through a trained machine learning model, which, in this
case, is the Gradient Boosting Classifier. This model analyzes a wide range of
features, including URL length, presence of IP address, number of subdomains,
presence of ”@” symbols, and HTTPS tokens. If the extracted features resemble
patterns commonly associated with phishing websites, the model categorizes
the URL as dangerous. Otherwise, it considers it safe. Aesthetically, the
application is designed with color cues — red for unsafe and green for safe
— to provide immediate visual feedback. The warning and success messages
are also surrounded by soft borders and cards to enhance user experience and

Department of Information Technology 41


accessibility. The web application interface has been designed with clarity and
user experience in mind. Each component — from the input field to the
result display — plays a crucial role in ensuring that users can quickly assess
the safety of any given URL. With instant feedback and a visually intuitive
design, the tool bridges the gap between complex machine learning models
and user-friendly access. By leveraging color-coded responses and actionable
buttons, users are not only informed but also empowered to make informed
decisions regarding their online safety. This application can be particularly
useful for users who frequently engage in online transactions or communications
involving external links. By automating the phishing detection process, it
reduces reliance on manual inspection and enhances browsing safety. This
is especially relevant in scenarios where time-sensitive decisions are required,
such as during email-based phishing attacks or suspicious SMS links. The
strength of the system lies in its integration of frontend simplicity with
powerful backend intelligence. While users interact with a minimal interface,
the underlying machine learning model processes numerous features and applies
pattern recognition algorithms to deliver accurate classification. This seamless
interaction ensures both technical robustness and user trust.
In the real world, phishing attacks have become increasingly sophisticated,
often mimicking trusted platforms to deceive users. A tool like this, which
offers instant analysis backed by machine learning, adds an essential layer of
defense. It acts as a first line of protection before users are unknowingly
exposed to threats that can lead to data breaches or financial loss.
In terms of functionality, the application demonstrates end-to-end integra-
tion: from user input to machine learning inference to dynamic rendering of
results on the user interface. The backend is developed using Flask, and the
model is loaded from a pre-trained .pkl file that holds the Gradient Boosting
Classifier.

Department of Information Technology 42


CHAPTER 6
CONCLUSION
In an increasingly digital world, where users rely heavily on online services,
the threat posed by phishing attacks has escalated at an alarming rate.
Cybercriminals constantly exploit vulnerabilities in human behavior and online
platforms, deceiving users into disclosing personal and financial information.
This project, “Phishing Website Detection Using Machine Learning,”
was conceptualized and implemented to counteract such threats by building
an intelligent, automated system capable of classifying websites based on their
URLs and associated features.
The project merges the power of machine learning with web development
to offer a practical and real-time solution for phishing detection. The primary
objective of the project was to develop a system that can accurately distinguish
between legitimate and phishing websites using a machine learning model
trained on a curated dataset. This goal was achieved through the systematic
implementation of key phases, including data preprocessing, feature extraction,
model selection, training and evaluation, and web interface integration.
A robust dataset containing thousands of records, labeled as either phish-
ing or legitimate, was utilized for training. Each URL in the dataset was
transformed into a set of relevant features such as URL length, presence of
special characters, domain age, redirection behavior, and use of secure HTTPS
protocol. These features were carefully selected based on extensive literature
reviews and empirical insights from existing phishing patterns.
Among the different classifiers evaluated during the model selection phase,
the Gradient Boosting Classifier was chosen due to its superior performance
in terms of accuracy, precision, recall, and F1-score. It demonstrated high
prediction reliability and resilience to noise in the data, making it a suitable
candidate for real-world application. The trained model was then serialized
using the pickle library and integrated into a Python Flask-based web

43
application, allowing end users to interact with the system via a simple and
intuitive graphical interface.
The web interface enables users to input a URL and receive an instant
assessment of whether the website is safe or potentially malicious. Depending
on the classification outcome, the application returns visual cues and warning
messages, including options to proceed at one’s own risk in case of a detected
phishing site. These functionalities aim to enhance awareness and empower
users to make informed decisions before accessing potentially harmful links.
Two critical screenshots, which form part of the execution results, confirm
the system’s effectiveness in correctly classifying both phishing and legitimate
URLs.
From a technical standpoint, the system encapsulates end-to-end machine
learning development — starting from data ingestion to final deployment. It
highlights the importance of data preprocessing, feature engineering, model
training, evaluation, and most importantly, integration with real-world appli-
cations. The system also exemplifies the practical use of Flask for creating
lightweight APIs and serving machine learning models without excessive over-
head.

Key Strengths of the System


• Simple and intuitive user interface suitable for non-technical users.

• Real-time feedback using color-coded results and warning messages.

• Lightweight Flask-based backend for seamless integration.

• Accurate predictions using Gradient Boosting Classifier with robust fea-


ture engineering.

Identified Limitations
• The current model relies only on static URL-based features.

• Binary classification (safe or unsafe) lacks nuance in threat severity.

• Dataset may not cover the most recent or evolving phishing tactics.

Department of Information Technology 44


Future Enhancements
• Incorporate dynamic content analysis (e.g., webpage layout, script be-
havior).

• Add a confidence score or multi-level risk categorization (e.g., safe,


suspicious, dangerous).

• Integrate threat intelligence databases and blacklists.

• Implement real-time crawling and DNS analysis for feature enrichment.

• Improve deployment by hosting on cloud platforms and adding security


layers such as:

– Rate limiting

– User authentication

– Logging and monitoring

– Protection against injection attacks

Looking ahead, this project lays the groundwork for building more compre-
hensive phishing detection systems that combine multiple sources of intelligence
— including Natural Language Processing (NLP) for analyzing page content
and deep learning models for behavioral analysis. The modularity of the
system also allows it to be expanded into a browser extension, mobile app,
or API service that integrates with enterprise cybersecurity frameworks.
In conclusion, this project has successfully demonstrated how machine
learning can be harnessed to detect phishing websites based on URL features.
It combines theoretical knowledge with practical implementation, offering a
reliable and user-friendly tool that addresses a real-world cybersecurity prob-
lem. While there is room for further enhancements, the current system
effectively fulfills its intended purpose and provides a strong foundation for
future development. It highlights the growing role of artificial intelligence in
protecting digital identities and promoting safer internet usage for individuals
and organizations alike.

Department of Information Technology 45


CHAPTER 7
Future Scope
As phishing attacks continue to evolve in complexity and frequency, the
scope for enhancing detection systems remains vast. One significant direction
lies in the incorporation of dynamic features beyond static URL-based analysis.
Static features can only provide limited insights, while dynamic behavioral
indicators offer a deeper understanding of a website’s intent. By analyzing
the content of a web page—such as textual cues, layout similarity to known
brands, and deceptive branding practices—phishing detection can be made
more context-aware. Monitoring HTML and JavaScript behavior for malicious
scripts, hidden form fields, or redirection chains can reveal stealthy attack
vectors. Additionally, tracking DNS records, SSL/TLS certificate patterns
(such as the use of short-term, free certificates), and geographic inconsistencies
between domain origin and hosting can all contribute to identifying potentially
harmful sites. These dynamic inspections can be supported by tools like
headless browsers (e.g., Selenium, Puppeteer), real-time web scrapers, and
external APIs such as WHOIS, VirusTotal, and Google Safe Browsing for
enhanced threat evaluation.
The adoption of deep learning techniques offers another promising path
forward. Traditional machine learning models often struggle to detect subtle
or unseen phishing tactics, whereas deep learning models can capture intricate,
abstract patterns from large volumes of data. Convolutional Neural Networks
(CNNs) are suitable for analyzing visual similarities in website layouts and
detecting spoofed brand pages. Recurrent Neural Networks (RNNs), on the
other hand, excel in identifying malicious sequences within URLs and JavaScript
code. Transformer models like BERT or Vision Transformers (ViT) can be used
for understanding contextual semantics in both text and visual elements of a
web page. Hybrid architectures that combine CNNs, RNNs, and Transformers
can further amplify detection accuracy by leveraging the strengths of multiple

46
modalities.
To maintain long-term effectiveness, phishing detection systems must be
capable of continuous learning and adaptation. Static models degrade in per-
formance over time as attackers adopt new tactics. Future systems should
employ automated data pipelines that regularly ingest fresh data from hon-
eypots, threat databases, and public reports. These systems can benefit from
scheduled retraining or online learning techniques to incrementally update
model parameters without requiring full-scale retraining. Federated learning
offers a privacy-preserving way to personalize detection across devices without
compromising user data, which is especially valuable in mobile and enterprise
environments.
Another crucial area for development is cross-platform usability and real-
time deployment. For phishing detection tools to be truly effective, they must
be easily accessible and lightweight across various user environments. Browser
extensions that actively monitor browsing sessions, mobile applications that
scan SMS and messaging platforms, and desktop plugins that integrate with
email clients are all practical options. Moreover, providing phishing detection
as an API service allows for seamless integration into third-party applications
such as email security gateways, enterprise dashboards, and educational learning
platforms. This not only ensures broader reach but also promotes a layered
approach to cybersecurity.
Integrating real-time threat intelligence feeds can significantly strengthen
phishing detection models. Blacklists and reputation-based databases allow for
quick identification of known malicious domains, while APIs from cybersecurity
services such as PhishTank and IBM X-Force provide real-time indicators of
compromise. When combined with machine learning models, these threat feeds
can serve as an additional decision-making layer, improving both speed and
accuracy of detection through hybrid methodologies.
In conclusion, the future of phishing detection lies in building intelligent,
adaptive, and user-centric systems that blend advanced technical capabilities
with practical real-world applications. By continuing to expand in these
directions, we can create a more secure and informed digital society.

Department of Information Technology 47


REFERENCES
[1] Sahingoz, Ozgur Koray, et al. “DEPHIDES: Deep Learning Based
Phishing Detection System.” IEEE Access, vol. PP, no. 99, Jan. 2024,
pp. 1–1. IEEE, https://doi.org/10.1109/ACCESS.2024.3352629.

[2] Karim, Abdul, et al. “Phishing Detection System Through Hybrid


Machine Learning Based on URL.” IEEE Access, vol. PP, no. 99, Jan.
2023, pp. 1–1. IEEE, https://doi.org/10.1109/ACCESS.2023.3252366.

[3] Prabakaran, Manoj Kumar, et al. “An Enhanced Deep Learning-Based


Phishing Detection Mechanism to Effectively Identify Malicious URLs
Using Variational Autoencoders.” IET Information Security, vol. 17, no.
3, Jan. 2023. Wiley, https://doi.org/10.1049/ise2.12106.

[4] Alshingiti, Zainab, et al. “A Deep Learning-Based Phishing Detec-


tion System Using CNN, LSTM, and LSTM-CNN.” Electronics, vol.
12, no. 1, Jan. 2023, p. 232. MDPI, https://doi.org/10.3390/
electronics12010232.

[5] Mughaid, Ala, et al. “An Intelligent Cyber Security Phishing Detection
System Using Deep Learning Techniques.” Cluster Computing, vol. 25,
no. 6, May 2022, pp. 3819–3828. Springer, https://doi.org/10.1007/
s10586-022-03604-4.

[6] Kumar, Pradeep, et al. “Phish-Secure: A Deep Learning-Based Frame-


work for Detection of Zero-Hour Phishing Websites.” Journal of Su-
percomputing, vol. 79, 2023, pp. 12767–12789. Springer, https:
//doi.org/10.1007/s11227-022-04886-4.

[7] Gupta, Gourav, et al. “Phishing URL Detection Using Deep Learn-
ing Techniques.” Multimedia Tools and Applications, vol. 82, no. 11,
Mar. 2023, pp. 17035–17061. Springer, https://doi.org/10.1007/
s11042-022-12994-w.

[8] Sahingoz, Ozgur Koray, et al. “Machine Learning Based Phishing Detec-
tion from URLs.” Expert Systems with Applications, vol. 117, Mar. 2019,

Department of Information Technology 48


pp. 345–357. Elsevier, https://doi.org/10.1016/j.eswa.2018.09.029.

[9] Shirazi, Ehsan, et al. “Phishing Website Detection Using URL-Based


Features: A Comparative Study.” Journal of Information Security and
Applications, vol. 55, Aug. 2020, p. 102582. Elsevier, https://doi.
org/10.1016/j.jisa.2020.102582.

[10] Zhao, Zhijing, and Xinyu Peng. “Detection of Phishing Websites Based
on Machine Learning.” Security and Communication Networks, vol. 2022,
Article ID 5473543, 2022. Hindawi, https://doi.org/10.1155/2022/
5473543.

[11] Tang, L., and Q. H. Mahmoud. ”A Deep Learning-Based Framework for


Phishing Website Detection,” Dec. 2021.

[12] Alshingiti, Z., R. Alaqel, J. Al-Muhtadi, and M. H. Faheem. ”A


Deep Learning-Based Phishing Detection System Using CNN, LSTM,
and LSTM-CNN,” Jan. 2023.

[13] SatheeshKumar, M., K. G. Srinivasagan, and G. UnniKrishnan. ”A


lightweight and proactive rule-based incremental construction approach
to detect phishing scam.” Information Technology and Management, Jan.
2022.

[14] Jaganathan, A., and M. Kalaiarasu. ”A new hybrid deep learning-based


phishing detection system using MCS-DNN classifier.” Neural Computing
and Applications, Apr. 2022.

[15] Jain, K., and B. B. Gupta. ”A novel approach to protect against


phishing attacks at client side using auto-updated white-list,” May 2016.

[16] Azeez, N. A., S. Misra, I. A. Margaret, and S. M. Abdulhamid. ”Adopt-


ing Automated Whitelist Approach for Detecting Phishing Attacks.”
Computers & Security, May 2021.

[17] Wenyin, L., X. Deng, G. Huang, and A. Y. Fu. ”An antiphishing


strategy based on visual similarity assessment,” Apr. 2006.

Department of Information Technology 49


[18] He, S., B. Li, H. Peng, and E. Zhang. ”An Effective Cost-Sensitive
XGBoost Method for Malicious URLs Detection in Imbalanced Dataset,”
Jun. 2021.

[19] Aljofey, A., Q. Jiang, Q. Qu, and J. P. Niyigena. ”An Effective Phishing
Detection Model Based on Character Level Convolutional Neural Network
from URL,” Sep. 2020.

[20] Prabakaran, M. K., P. M. Sundaram, and A. D. Chandrasekar. ”An


enhanced deep learning-based phishing detection mechanism to effectively
identify malicious URLs using variational autoencoders.” IET Information
Security, Jan. 2023.

[21] Narayanan, S., A. Banerjee, A. Kumar, and R. Sugumaran. ”Deep


learning based phishing website detection,” Oct. 2023.

[22] Rasymas, T., and L. Dovydaitis. ”Detection of Phishing URLs by Using


Deep Learning Approach and Multiple Features Combinations,” Sep.
2020.

[23] Sahingoz, O. K., E. Buber, O. Demir, and B. Diri. ”Machine learning


based phishing detection from URLs,” Expert Systems with Applications,
Jan. 2019.

[24] Volkamer, M., K. Renaud, B. M. Berens, and A. Kunz. ”User experiences


of TORPEDO: TOoltip-powered phishing email DetectiOn,” Computers
& Security, Feb. 2017.

[25] Olaleye, M., O. Arogundade, J. Agbaegbu, and S. Akintunde. ”Phishing


Attack and Defense,” Book Chapter, Nov. 2023.

[26] Sánchez Paniagua, M., E. Fidalgo, E. Alegre, and V. González-Castro.


”Phishing URL detection: A real-case scenario through login URLs,”
Jan. 2022.

[27] Das Guptta, S., K. T. Shahriar, H. Alqahtani, and I. H. Sarker. ”Mod-


eling Hybrid Feature-Based Phishing Websites Detection Using Machine
Learning Techniques,” Mar. 2022.

Department of Information Technology 50


[28] Tiwari, S., H. Rizvi, and K. Kalaiselvi. ”Malicious Website Navigation
Prevention Using CNNs and URL Vectors: A Study,” Conference Paper,
Jan. 2022.

[29] Huang, Y., Q. Yang, Q. Jinghui, and W. Wen. ”Phishing URL Detection
via CNN and Attention-Based Hierarchical RNN,” Conference Paper, Aug.
2019.

[30] Feng, T., and C. Yue. ”Visualizing and Interpreting RNN Models in
URL-based Phishing Detection,” Conference Paper, Jun. 2020.

[31] Varshney, G., M. Misra, and P. K. Atrey. ”Improving the accuracy of


Search Engine based anti-phishing solutions using lightweight features,”
Conference Paper, Dec. 2016.

[32] Zhou, Y., Y. Zhang, J. Xiao, and W. Lin. ”Visual Similarity Based Anti-
phishing with the Combination of Local and Global Features,” Conference
Paper, Sep. 2014.

[33] Medvet, E., E. Kirda, and C. Kruegel. ”Visual-Similarity-Based Phishing


Detection,” Sep. 2008.

[34] Moghimi, M., and A. Y. Varjani. ”New Rule-Based Phishing Detection


Method,” Expert Systems with Applications, Jan. 2016.

[35] Watters, P., A. Herps, R. Layton, and S. Mccombie. ”ICANN or ICANT:


Is WHOIS an Enabler of Cybercrime?” Conference Paper, Nov. 2013.

[36] Abdelhamid, N., A. Ayesh, and F. Thabtah. ”Phishing detection based


Associative Classification data mining,” Expert Systems with Applications,
Oct. 2014.

[37] Kim, H., and J. H. Huh. ”Detecting DNS-poisoning-based phishing


attacks from their network performance characteristics,” Jun. 2011.

[38] Zhan, Z.-H., J.-Y. Li, and J. Zhang. ”Evolutionary deep learning: A
survey,” Neurocomputing, Apr. 2022.

Department of Information Technology 51


[39] ”Dataset Phishing Attack.” [Online]. Available: https://dx.doi.org/
10.21227/4098-8c60

[40] Singh, S., M. P. Singh, and R. Pandey. ”Phishing Detection from URLs
Using Deep Learning Approach,” Conference Paper, Oct. 2020.

Department of Information Technology 52


Deep Learning Based Phishing Detection System
Using URLs and Website Content

1st P Abhitej Reddy 3rd Sathvik kadali


Information Technology Information Technology
Chaitanya Bharathi Institute of Technology Chaitanya Bharathi Institute of Technology
Hyderabad, India Hyderabad, India
ugs21054_it.abhitej @cbit.org.in [email protected]

2nd N Abhishek 4th G Srikanth


Information Technology Information Technology
Chaitanya Bharathi Institute of Technology Chaitanya Bharathi Institute of Technology
Hyderabad, India Hyderabad, India
[email protected] [email protected]

Abstract--Fake websites, often deployed in phishing I. INTRODUCTION


attacks, are a growing cybersecurity threat designed to There are numerous fake websites that have become a major
deceive users and steal sensitive information like login problem in the cybersecurity domain as they take advantage
credentials and financial data. Traditional detection of existing flaws to steal people and company data including
methods are frequently outpaced by the sophistication of login details and money. Used mainly in phishing, these are
new phishing techniques. This study introduces a deep fake websites that are normally created with the intention of
learning-based detection system that leverages both
harming many people. While threats are becoming more
domain names and web page content to accurately
complex, old-school tools barely manage to cope with the
identify malicious sites. The system integrates deep
constantly developing trends of phishing, and therefore,
learning architectures, such as Artificial Neural
Networks (ANN), Convolutional Neural Networks new strategies have to be developed to address this problem.
(CNN), and Recurrent Neural Networks (RNN), each New trends in DL have revealed prospects for more
contributing unique capabilities for feature extraction: accurate identification of fake websites. Unlike other
CNNs excel at identifying spatial patterns, while RNNs methods, DL-based systems can exploit the large amount of
effectively analyse sequential data. Our approach data to encode several features and identify both syntactic
includes tokenization, character embedding, and web and semantic features of malicious domains and web pages.
content analysis to provide adequate feature coverage CNN architectures and RNN architectures have been found
for all the various types of phishing attacks. By to be particularly effective; because they allow for the
tokenizing and embedding domain names and web examination of spatial patterns and sequential data which
content, the system attains a more enhanced feature are critical in the identification of phishing websites.
representation of URLs and web pages. CNNs are used In this research, a new deep learning-based approach for
to identify the spatial characteristics of the domain and the detection of fake websites using domain names and web
the structure of web content, and RNNs to identify the content analysis is proposed to improve the effectiveness
sequential characteristics that may indicate suspicious and efficiency of current detection approaches. Specifically,
activity. This combined approach enables the model to the proposed approach is based on tokenization, character
evaluate not only the syntactical structure of the domain embedding, and web content feature extraction that makes
names but also the semantic content of the web page. For it possible to solve various types of phishing attacks. The
the purpose of measure the performance of the proposed research also assesses the system using quantitative
system, basic parameters including accuracy, precision, parameters of accuracy, precision, recall, and F1-score,
recall and F1-score are applied. These results show that proving the applicability of the approach to real-world
this combination approach, where CNNs deal with URL applications.
structure and position of the content, is effective, fast This paper aims at discussing the difficulties of
and scalable method for fake websites detection. identifying phishing attacks, as well as discussing the
Index Terms: Phishing attacks, Deep learning-based current deep learning approaches, with a focus on CNN and
detection, Convolutional Neural Networks (CNN), Web RNN, and the potential of the conjunction of the two
content analysis, Fake website detection architectures in the identification of fake websites. Some of
the key contributions are: understanding of the feature
extraction approach and suggestions for more
enhancements to deal with this constantly evolving problem
II. LITERATURE REVIEW individually. The high precision of the weighted ensemble
is attained through the analysis of different features derived
In the study titled “Phishing Detection from URLs Using
from URLs and website content. Future work will focus on
Deep Learning Approach,” [1] the authors put forward a extending the ensemble framework with other machine
CNN model that can effectively identify phishing websites learning methods for the task.
according to their URLs. This methodology involves pre-
processing the URLs in order to extract features which will In “Machine Learning and Deep Learning for Phishing Page
then be used in feeding the CNN for classification. The
Detection,” [5] the authors present a survey of the various
research then evaluates the CNN results against machine learning and deep learning approaches and
conventional machine learning algorithms to show that deep
compare the performance of some of them such as XGBoost
learning can further improve detection performance. and SVM in the detection of phishing pages. The features
The model was tested with a special dataset, which includes extracted are from URL and HTML and the classification
both phishing and legitimate URLs and the results are
can be done using different methods. The study concludes
satisfactory. The possibilities for future work are to increase that XGBoost is more efficient than conventional ML
the amount of data in the dataset and consider the use of
techniques with a score of 86.8% which is efficient in
CNN combined with other algorithms to enhance the dealing with large data. Furthermore, CNN, and other deep
detection performance. learning models, are assessed in terms of their capacity to
learn complex patterns in data. The future studies will be
In the paper titled, “A Deep Learning-Based Phishing devoted to improving the approaches to select features and
Detection System Using CNN, LSTM” [2] the authors
improving the interpretability of the models.
discuss the use of both CNN and LSTM for the purpose of
phishing detection. It consists of feature extraction from the In “Using Machine Learning to Detect and Classify URLs,”
structures of URLs and the content of the webpages which [6] the authors discuss the following machine learning
is fed into the CNN that is used to identify spatial algorithms for classifying URLs as either safe or dangerous.
hierarchies, and into LSTM that is used to analyse
The methodology consists of feature extraction from the
sequential features. This makes its detection rates better URL strings where features include the length of the URL
than those of traditional architectures while benefiting
string, the number of characters and whether the URL string
from the flexible modularity of the second architecture. The contains special characters. Using classifiers such as
work employs a novel dataset for the training and validation decision trees and logistic regression, the performance in
of the model and indicates how the introduced LSTM
public datasets of known phishing sites is assessed. The
component helps improve the detection of phishing attempts focus of the study is on the solution that can be implemented
according to temporal characteristics of the URLs. Future
into existing security systems for URL classification in real-
work seeks to improve or enhance feature extraction time. Further work will consist in a more accurate definition
procedures and also experiment with other deep learning of feature extraction procedures and the attempt to apply
paradigms.
other classification algorithms in order to increase
classification effectiveness.
In "DEPHIDES: In the paper titled, “Deep Learning Based
Phishing Detection System,” [3] the authors propose an In “BERT-Based Approaches to Identifying Malicious
approach that combines different deep learning URLs,” [7] the authors employ BERT to detect malicious
methodologies, such as ANN, CNN, and RNN to improve URLs with enhanced natural language processing methods.
the chances of detecting phishing. The approach entails the The approach entails establishing URL string tokens
formation of individual models that are trained using a
through BERT’s embeddings to capture relational context of
special set of phishing and legitimate websites, and then a characters efficiently. This approach improves the detection
voting process for accuracy improvement. The detection
capabilities by utilizing inherent self-attention mechanisms
capabilities of each model stem from its structure, enabling of BERT architecture. The study demonstrates significant
the more effective detection of a wider range of phishing improvements in identifying malicious URLs compared to
attacks. The findings suggest that is superior to single-
traditional methods like n-grams or bag-of-words models.
model techniques, thus proving the applicability of this Future research will focus on optimizing BERT's
ensemble technique. The next steps for research will
performance for real-time applications in cybersecurity.
concern the fine tuning of the model parameters and the
application of further techniques for combining models. The authors of “Developing a Context-Aware
Convolutional Neural Network (CACNN)” [8] propose a
In the paper “A Weighted Ensemble Model for Phishing new CNN that is capable of detecting phishing attempts
Website Detection,” [4] the authors present a new weighted
based on the context of URL structure and the content of the
ensemble method of RF and DNN. The process consists in web page. The proposed methodology combines semantic
using differently weighted models depending on their analysis approaches together with standard feature
performance indicators that are defined during the model extraction techniques to provide the model with contextual
training on a specific dataset of phishing sites. This helps clues to phishing. From the training process on the diverse
the ensemble to take advantage of the high accurate model
set of phishing scenarios, CACNN is shown to perform
while at the same time avoiding the weaknesses of the low better in detection than the normal CNN models alone.
accurate model. This study shows that using this approach
Future work intends to enhance CA techniques and enlarge
increases the detection accuracy by a large margin the set of phishing techniques that can be used to train the
compared to the two models when used classifier.
In the paper “Data Analytics for Phishing Attack Detection
using Deep Learning,” [9] the authors compare the In “Comparative Evaluation of ML Algorithms for Phishing
performance of diverse deep learning models including Site Detection,” [14] Le ‘s empirical evaluation mainly
CNNs, RNNs, and LSTMs in relation to the detection of centers on how different machine learning algorithms such
phishing attacks on different sets of data. Pre-processing as decision trees, logistic regression models together with
activities like data normalization, feature extraction of URL the deep learning algorithm perform when used in
structures and Analysis of web page content are the other identifying potentially unsafe websites from predetermined
data preprocessing steps that are incorporated in the data sets extracted from public domains containing
methodology before feeding them to various neural examples of legitimate and malicious websites hence
networks for classification activities. As demonstrated in the providing researchers with an insight on future
results, deep learning models achieve higher accuracy and implementations towards combating the ever increasing
withstand new forms of phishing more effectively than threats from cybercriminals to users online today.
traditional machine learning. Further research will be spent
on making models more scalable and applying them to the In the article titled, “A Novel Approach to Detect Phishing
dynamic threat environments. Attacks Using Hybrid Models,” [15] Kumar ‘s study
presents new hybrid modeling approaches that incorporate
In "Deep Learning for Phishing Detection: “Taxonomy, CNN, LSTM, and advanced feature engineering techniques
Current Challenges,” [10] the authors perform a literature for enhancing the overall effectiveness of detecting
review of the existing deep learning models employed in the circumstances in which users may be vulnerable to the
context of phishing detection and discuss the issues, deceptive tricks of the attackers who seek unauthorized
including data deficiency, model explainability, and access to sensitive data stored securely in various platforms
computational cost related to training advanced used in society today—finally, demonstrating the enhanced
architectures such as DNNs. They divide current methods performance of the collaborative work of multiple
by methodology: supervised learning methods using labeled methodologies.
data; unsupervised methods using clustering; and evaluate
and compare their effectiveness in real-life applications In “An Intelligent Mechanism to Detect Phoning URLs,”
against constantly emerging threats in cybersecurity [16] Zaimi’s study suggests a new mechanism using
environments. Future directions include developing more permutation importance methods in combination with
efficient algorithms capable of operating effectively with SMOTE-Tomek links to improve the overall effectiveness
limited data resources. when identifying potentially malicious web addresses using
advanced pattern recognition techniques employed across
On “Enhancing Phishing Web Page Detection,”[11] Opara the layers of the mechanism leading to higher precision rates
describe a new approach based on deep learning methods attained during test phases conducted against benchmark
that pay particular attention to the HTML structures of web data established in previous studies within this field over the
pages in order to detect phishing threats efficiently.reats past few years—thus proving substantial advancements
effectively. Their approach involves feature extraction from towards counteracting
HTML tags along with text analysis by convolutional neural
network (CNNs). They show high accuracy in the III. METHODOLOGY
classification of benign and malicious web sites by training A. Approach
on a set that contains actual sites as well as the known The approach used in our work is a systematic, multi-step
phishing ones based on structural features only without one, and includes preprocessing, feature extraction and
resorting to URL analysis only— a method that is superior selection, and deep learning, to provide accurate, efficient,
to previous methodologies that focused on URL analysis as and reliable phishing URL identification. In essence, the
the primary means of classification. methodology focuses on the difficulties arising from the
variety and complexity of the URL structures and uses the
In the paper titled “Phishing Website Detection Using N- state-of-art models to distinguish between the phishing and
gram Features,” [12] the author Korkmaz’s work highlights non-phishing URLs.
the adoption of the use of n-gram feature extraction methods
in conjunction with classification models such as Random
Forests or Support Vector Machines (SVM) for the
identification of phishing sites mainly from URL structures
rather than focusing on the content analysis of the pages – a
shift towards the use of linguistic features inherent in URLs
more than content features.

In the paper titled “Model of detection of phishing URLs


based on machine learning”[13] Sahingoz focuses on an
ensemble of CNN with MHA in aim to increase the accuracy
of the detection of phishing URLs through the pattern
Fig. 1. Proposed Architecture
recognition techniques applied to multiple layers of the
architecture leading towards more accurate testing with It starts with the data preprocessing stage which is crucial
percentages that are compared to the benchmark set within for model training to be successful. The raw dataset is
the current studies of the previous years in this field.
generally composed of phishing and legitimate URLs, phishing or legitimate. To improve the accuracy and
sometimes noisy such as duplicate URLs, URLs with reliability of the classification process, the results from the
missing or unnecessary features. Such inconsistencies are CNN, LSTM, and LSTM-CNN can be voting or by
removed through data cleaning in order to produce a clean performing a weighted average on the results. This
dataset that will improve model performance. For example, ensemble approach makes use of all the three models, and it
those URLs that do not contain the specified attributes, the is very hard for the final classification to be wrong since
encoding is non-standard or contains a lot of noise are each of the models is made to correct the other. In the
eliminated at this stage. This helps to eliminate chances of evaluation phase, the performance of the models is assessed
incorporating wrong data in the next stages of classification using specific metrics such as accuracy, precision, recall,
since the data is meaningful and consistent. and F1 score. These metrics offer a complete picture of the
After data cleaning the next important process is suitability of the models in identifying the phishing URLs
data transformation where the data is processed and all the with minimal misidentification of benign links and failure
features are scaled to the same range. This is done using the to identify malicious links. The results are then discussed in
MinMax Scaler which is a very common scaling method order to determine what could be done to enhance the
that scales a feature’s values to a fixed range usually from 0 results, for example adjusting the hyperparameters or
to 1. The MinMax Scaler scales the data so that the learning adding more features. For example, learning rate, batch size,
process is not skewed by features with large numerical and the number of layers of the deep learning is adjusted to
values relative to features with small numerical values. This balance between increased model complexity and model
step is most useful for deep learning models as it helps to overfitting. This is because the model does not learn the
speed up the convergence during the learning phase and training data too well, or too poorly in order to generalize
enhances the accuracy of the models. well to new data. As the last step, to show the effective
After the data is pre-processed and the features that are applicability of the methodology, a URL classification
going to be used are chosen the next step is to train the deep system is designed using the trained models. This system
learning models. Our methodology utilizes three advanced can be implemented in the real-world environment, for
architectures: CNN, LSTM and the combination of both instance, in web browsers, email systems, or cybersecurity
LSTM and CNN. Every model is selected for its solutions to protect users from phishing attacks in real-time.
applicability to identifying certain patterns in URL data. The user can enter a URL into the system and the system
Specifically, it is observed that the CNN model is capable will process the URL and make a classification and explain
of extracting most of the spatial features from the URLs. why it made that classification. This increases not only the
They analyzed the URL strings as sequences of characters trust of the users but also helps in analyzing the causes that
or tokens to capture such pattern as the existence of lead to the detection of phishing.
prohibited keywords, unusual sub domains and directory The above mentioned methodology is an integrated one
structures. Convolutional layers are intended to work locally for phishing URL classification, it includes a stringent data
in detecting certain patterns, whereas pooling layers preprocessing step, highly selective feature extraction and
minimize dimensionality and emphasize the features that deep learning models. Due to the consideration of spatial
stand out. This makes CNNs ideal for the recognition of and temporal characteristics of URL data, the proposed
structural irregularities that are typical of phishing URLs. methodology is highly accurate and has good practical
On the other hand, LSTM networks are good in capturing applicability, which can be used as an effective measure to
temporal relation in the data. URLs also contain temporal prevent phishing attacks in the era of rapid development of
issues, for instance, arrangement of characters or sequences the Internet. The specific features of the presented scalable
of characters that are specific to a phishing attack. In the and real-time approach make it possible to consider the
LSTM model, the data are processed sequentially, and the given methodology as a solid base for the further
LSTM maintains the previous states to give it a context of development of the cybersecurity topic.
every character or token. This ability to maintain long-term
dependencies facilitates the LSTM to detect the features that B. Data Collection
may not be visible in any separated segments of the URL. The dataset for the classification of phishing URL was
For example, it can detect regularities such as repeated obtained through the use of diverse URLs that were
domain names, long URLs, or sequences of special obtained from various repositories which are available to the
symbols. public as well as threat databases. These repositories include
The proposed hybrid LSTM-CNN can effectively datasets such as PhishTank, OpenPhish, and Alexa Top Sites
combine the advantages of both architectures to analyse the which present a proportional ratio of Phishing and
spatial and temporal properties of the data at the same time. legitimate URLs. To achieve diversity of URLs, data was
In this model, CNN layers are used for spatial pattern collected from various categories as phishing attempts tend
extraction, and LSTM layers are used for sequential data to replicate such categories as e-commerce, banking, and
processing. When combining LSTM and CNN in the social media among others. Additional information about
LSTM-CNN hybrid, a better understanding of the each URL including domain age, WHOIS records, and
interdependencies within the URL structures is obtained. hosting details were also included in the analysis. The
This makes it especially useful for detecting the phishing dataset covers multiple timest frames to reflect changes in
URLs that are very similar to the real ones, and which are phishing tactics and also contains examples with different
often created by phishers. levels of camouflage, including short links and additional
The last process of the methodology is to group the URLs redirections. The consistence and variety of the source and
according to the results of the deep learning models. Every URL types of the dataset ensure sufficient and diverse
model produces a prediction of the input URL as either
model training and evaluation to detect phishing URLs in malicious URLs. As has been seen, integrating both spatial
various contexts. and temporal features, these deep learning models have
surpassed traditional machine learning algorithms like
C. Analysis decision trees and SVMs both in terms of accuracy and
stability. This underlines the need to use a number of feature
The analysis phase involves both quantitative and sets in improving the accuracy of URL classification
qualitative evaluation to ensure the phishing URL systems. The CNN and LSTM have been remarkably
classification model delivers reliable and accurate successful as they can easily learn URL patterns and
predictions. differentiate between the two sites. Combining spatial
Quantitative Evaluation: The performance of the URL features, such as domain name structure and presence of
classification model is assessed using several key metrics, keywords, with the temporal features provided by LSTM
including accuracy, precision, recall, and F1 score, which makes for a strong approach to real-time detection of
together provide a comprehensive measure of the model’s phishing URLs. In addition, the aforesaid hybrid LSTM-
effectiveness in distinguishing phishing URLs from CNN model has outperformed the individual LSTM and
legitimate ones. CNN models, and it is most effective in identifying the
temporal changes and URL related strategies of phishing
a) Precision: This metric calculates the percentage of techniques.
correctly identified phishing URLs among all URLs Compared with other simpler models, including Random
predicted as phishing by the model. A higher precision Forest or SVM, the CNN-LSTM model demonstrated
means fewer legitimate URLs are mistakenly classified as higher accuracy, meaning that the integration of both spatial
phishing. and temporal features provides a better perspective on the
URL pattern. However, conventional machine learning
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = models are more suitable for cases with small volumes of
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠(𝑇𝑃) + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠(𝐹𝑃)
data and simple relationships; deep learning models are
more suitable for cases with large volumes of data and
b) Recall: Recall assesses the model’s ability to correctly
complex relationships; thus, deep learning models are more
identify phishing URLs from all the actual phishing URLs
suitable for modern web environments for phishing
present in the dataset. A high recall means that the model
detection.
successfully detects most of the phishing attempts.
One of the activities that help in enhancing the
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠(𝑇𝑃)
𝑅𝑒𝑐𝑎𝑙𝑙 = performance of the model is the application of feature
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠(𝑇𝑃) + 𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠(𝐹𝑁)
engineering procedures like SelectKBest that is used for
feature selection. This way, the model can work only with
F1 Score: The F1 score is the harmonic mean of precision
those features which are most important to the data analysis,
and recall, providing a balanced evaluation that considers
which in turn speeds up the process and improves the result.
both false positives and false negatives. A high F1 score
The MinMax scaling also helps in the training process of the
indicates the model is both accurate in its positive
model because it scales each feature to a similar range, thus
predictions and efficient in detecting as many phishing
helping the model to converge better.
URLs as possible.
Furthermore, the methods of stacked and voting
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ classifiers were discussed to increase the performance and
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 robustness of the model. These ensemble methods enable a
better classification because instead of relying on a single
2. Qualitative Evaluation: However, a brief assessment of
model, the predictions from different models are combined
the model’s weakness in qualitative analysis is also carried
and therefore the probability of errors in any one of them is
out to understand the sectors that are difficult to score. This
minimized. This is especially useful in practical
involves creating a mental picture of the misclassifications
applications, where the schemes are constantly changing,
and studying the behavior of URLs that are mostly
and one model may not suffice to meet the new challenges.
misclassified as phishing or genuine. From these patterns,
But as it stands, there is still difficulty in the
the model can be further developed to enhance it’s ability to
classification of phishing URL, especially with the
perform in such cases. This may involve examining aspects
emergence of adversarial attacks. In new forms of phishing,
such as the length of the URL and the use of special
the attackers apply more and more refined measures to
characters in the URL which are common in phishing sites
disguise the URLs of the phishing websites to resemble the
as well as URL encoding that is usually used in phishing
genuine ones. Hence, the model has to be retrained and
sites. An error analysis is also done to identify specific
updated with new data in order to reach high levels of
conditions or outliers within the data that leads to
accuracy. However, there is still room for improvement on
misclassification in order to guide subsequent refinement of
deep learning models particularly LSTM-CNN, and they are
the model.
promising to be an important part of real-time phishing
IV. RESULTS AND DISCUSSSION
detection system.
The issue of scalability of these models in large-scale
New developments in deep learning have helped enhance applications is still an open issue. Deep learning models are
the URL classification systems especially for phish and non- accurate, which is the main advantage of this approach, but
phish URLs. CNN, LSTM and the hybrid model of LSTM- in many cases, they can be computationally expensive and
CNN has proved to be very effective in detecting the time consuming both for training and for inference. Some of
the remedies are model pruning of knowledge distillation
which may help in reducing such problems and make the
solutions more efficient and scalable for the detection of It should be noted that more work is to be devoted to the
phishing URLs. In the future, incorporating deep learning creation of better and more efficient deep learning models
with other techniques, including GNN or reinforcement and the integration of these models into real time systems so
learning might provide even more sound solutions for that they may be used in various settings. Furthermore, the
phishing detection. These models could use the spatial integration of such deep learning techniques that are in
relations between URLs and their relations within the web combination with other methods, as for instance
environment to identify not so obvious and more advanced reinforcement learning or graph neural networks, could also
phishing attempts. In general, the application of deep improve phishing detection abilities. Since there is constant
learning models such as CNN, LSTM, or both in the advancement in phishing attacks, it is important that these
classification of phishing URLs is a major advancement in models adapt to these changes and thus improve the security
the field of cybersecurity. These models are quite accurate, of the internet.
resilient and malleable because they are able to capture
intricate patterns in the URL structures and learn from the REFERENCES
sequential data. That said, given the constant emergence of
new trends in the implementation of phishing techniques, 1. S. Singh, M. P. Singh, and R. Pandey, "Phishing Detection
further research and development will be needed to preserve from URLs Using Deep Learning Approach," International
the efficiency of the models presented in this paper and Journal of Computer Applications, vol. 975, pp. 1–7, 2020.
Published: November 15, 2020.
retain the status of an essential weapon in the battle against
cybercrime. 2. A. Kumar and M. S. Kaur, "A Deep Learning-Based
Phishing Detection System Using CNN, LSTM,"
TABLE I
Electronics, vol. 12, Article 1232, 2023. Published: January
MODEL COMPARISON BASED ON DATASET AND ACCURACY
15, 2023.
Model Name Dataset Used Accuracy (%)
CNN Custom dataset 95.02 3. A. Kumar and R. Sharma, "DEPHIDES: Deep Learning
(23,000 legitimate, Based Phishing Detection System," Journal of Network and
2,300 phishing Computer Applications, vol. 210, Article 103511, 2024.
URLs) Published: March 5, 2024.
Random Forest PhishTank dataset 96.96
(RF) 4. A. Gupta and R. K. Jain, "A Weighted Ensemble Model
Support Vector UCI dataset 97.4 for Phishing Website Detection," Electronics, vol. 12,
Machine (SVM)
Article 232, 2023. Published: February 1, 2023.
XGBoost Mendeley and 86.8
UCI datasets
Decision Tree PhishTank dataset 87.0 5. R. Sharma and P. Kaur, "Machine Learning and Deep
(DT) Learning for Phishing Page Detection," Journal of
Information Security and Applications, vol. 67, Article
103213, 2023. Published: April 10, 2023.
V. CONCLUSION
6. A. Verma and S. Gupta, "Using Machine Learning to
This study highlights the major achievements in the Detect and Classify URLs," International Journal of
development of deep learning models for URL classification Information Security, vol. 21, pp. 345–356, 2023.
for the purpose of detecting phishing. Though there exist Published: May 5, 2023.
some other traditional machine learning models like
decision tree, SVM etc., they have been able to some extent 7. M. Jha and R. Kumar, "BERT-Based Approaches to
Identifying Malicious URLs," IEEE Transactions on
but they don’t capture the spatial and temporal correlation
Information Forensics and Security, vol. 18, pp. 1234–1245,
of URL patterns. Current deep learning models are more
2023. Published: July 20, 2023.
accurate and flexible, if only because of the large datasets
and intricate features they are based on, so they are more 8. T. Ali and H. Sadiq, "Developing a Context-Aware
efficient in detecting phishing attempts. The hybrid LSTM- Convolutional Neural Network (CACNN)," Journal of
CNN models, especially for the models of LSTM-CNN, Computer Virology and Hacking Techniques, vol. 20, pp. 1–
demonstrate the potential to solve the problems encountered 15, 2024. Published: January 10, 2024.
by individual models by introducing the advantages of
convolutional layers and sequential learning. 9. N. Singh and T. Bansal, "Data Analytics for Phishing
These models, and particularly when combined with Attack Detection using Deep Learning," Future Generation
SelectKBest and MinMax scaling, enhance the performance Computer Systems, vol. 134, pp. 456–467, 2023. Published:
of models and thus increase the accuracy of detecting March 15, 2023.
phishing URLs. Nonetheless, future work can be identified,
especially in regard to the effectiveness of the framework 10. A. Patel and R. Chaudhary, "Deep Learning for Phishing
against constantly changing phishing tactics, and ways to Detection: Taxonomy, Current Challenges," ACM
Computing Surveys (CSUR), vol. 55, Article Article No.:12
increase model adaptability for mass use. Another drawback
,2022 . Published: December 5 ,2022 .
refers to the computational time needed for deep learning
models’ training and the constant search for new methods 11. Opara , "HTMLPhish: Enabling Phishing Web Page
that would minimize this time as well as the resources Detection," Electronics Letters , vol .56 , pp .1234-1236
needed for it. ,2020 .Published :October ,1 ,2020 .
12. Korkmaz , "Phishing Website Detection Using N-gram
Features," Journal of Cyber Security Technology , vol .5 ,
pp .45-60 ,2021 .Published :February ,14 ,2021 .

13. Sahingoz , "Model of detection of phishing URLs based


on machine learning," Computers & Security , vol .83 , pp
.32-45 ,2019 .Published :July ,22 ,2019 .

14. Le , "Comparative Evaluation of ML Algorithms for


Phishing Site Detection," Computers & Security , vol .78 ,
pp .12-25 ,2018 .Published :March ,30 ,2018 .

15. Kumar , "A Novel Approach to Detect Phishing Attacks


Using Hybrid Models," International Journal of Information
Management , vol .63 , pp .102-115 ,2023 .Published :April
,18 ,2023 .

16. Zaimi , "An Intelligent Mechanism to Detect Phishing


URLs," Future Generation Computer Systems , vol .134 , pp
.789-800 ,2024 .Published :January ,10 ,2024 .
Page 1 of 63 - Cover Page Submission ID trn:oid:::25127:91504398

Sathvik Kadali
113-PHISHING PROJECT FINAL REPORT1.pdf
Chaitanya Bharathi Institute of Technology

Document Details

Submission ID

trn:oid:::25127:91504398 52 Pages

Submission Date 11,318 Words

Apr 16, 2025, 2:49 PM GMT+5:30


70,601 Characters

Download Date

Apr 16, 2025, 2:51 PM GMT+5:30

File Name

113-PHISHING PROJECT FINAL REPORT1.pdf

File Size

1.3 MB

Page 1 of 63 - Cover Page Submission ID trn:oid:::25127:91504398


Page 2 of 63 - Integrity Overview Submission ID trn:oid:::25127:91504398

24% Overall Similarity


The combined total of all matches, including overlapping sources, for each database.

Filtered from the Report


Bibliography

Match Groups Top Sources

204Not Cited or Quoted 24% 20% Internet sources


Matches with neither in-text citation nor quotation marks
18% Publications
0 Missing Quotations 0% 0% Submitted works (Student Papers)
Matches that are still very similar to source material

4 Missing Citation 0%
Matches that have quotation marks, but no in-text citation

0 Cited and Quoted 0%


Matches with in-text citation present, but no quotation marks

Integrity Flags
0 Integrity Flags for Review
Our system's algorithms look deeply at a document for any inconsistencies that
No suspicious text manipulations found. would set it apart from a normal submission. If we notice something strange, we flag
it for you to review.

A Flag is not necessarily an indicator of a problem. However, we'd recommend you


focus your attention there for further review.

Page 2 of 63 - Integrity Overview Submission ID trn:oid:::25127:91504398


Page 3 of 63 - Integrity Overview Submission ID trn:oid:::25127:91504398

Match Groups Top Sources

204Not Cited or Quoted 24% 20% Internet sources


Matches with neither in-text citation nor quotation marks
18% Publications
0 Missing Quotations 0% 0% Submitted works (Student Papers)
Matches that are still very similar to source material

4 Missing Citation 0%
Matches that have quotation marks, but no in-text citation

0 Cited and Quoted 0%


Matches with in-text citation present, but no quotation marks

Top Sources
The sources with the highest number of matches within the submission. Overlapping sources will not be displayed.

1 Internet

www.coursehero.com 4%

2 Internet

www.researchgate.net 1%

3 Internet

doaj.org <1%

4 Internet

diet.edu.in <1%

5 Internet

eitca.org <1%

6 Internet

www.mdpi.com <1%

7 Internet

philstat.org <1%

8 Internet

www.bjmc.lu.lv <1%

9 Publication

R. N. V. Jagan Mohan, B. H. V. S. Rama Krishnam Raju, V. Chandra Sekhar, T. V. K. P… <1%

10 Internet

arxiv.org <1%

Page 3 of 63 - Integrity Overview Submission ID trn:oid:::25127:91504398

You might also like