0% found this document useful (0 votes)
23 views87 pages

Final Doc Fin

Uploaded by

pj222020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views87 pages

Final Doc Fin

Uploaded by

pj222020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

PHISHARMOUR: A RESILIENT PHISHING WEBSITE

DETECTION WITH ENSEMBLE MODEL


A PROJECT REPORT

Submitted by

NITHISH. B (312820104057)
PRADEEP JOSHWA (312820104060)
PRAKASH RAJA. C (312820104061)

Under the guidance of

MS.AISHWARYA (ASST. PROFESSOR, CSE)

in partial fulfilment for the award of the degree of

BACHELOR’s of ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING

AGNI COLLEGE OF TECHNOLOGY

MAY 2024
ABSTRACT

Phishing attacks represent a formidable challenge in the realm of cybersecurity, with


attackers continually evolving their tactics to outsmart existing defense mechanisms.
These deceptive tactics involve the creation of replica websites that are
indistinguishable from legitimate ones, duping unsuspecting users into sharing
sensitive information such as usernames, passwords, and financial details. The
repercussions of such attacks extend far beyond their initial targets, with pilfered
credentials serving as potential entry points for unauthorised access across a
spectrum of popular online platforms. Despite the availability of countermeasures like
anti-phishing tools and browser extensions, the tenacity of these attacks underscores
the inadequacy of prevailing approaches. As the digital landscape becomes more
complex, there arises an urgent need to fortify defences against phishing threats and
to bolster the resilience of cybersecurity measures.

In response to the escalating sophistication of phishing attacks, this project is


dedicated to elevating the efficiency and computational performance of existing
phishing detection models. The overarching goal is a holistic optimization that not only
strengthens the ability to identify phishing attempts but also streamlines the overall
operational framework. This optimization journey involves a multifaceted approach,
encompassing the refinement of algorithms to enhance accuracy, the improvement of
data processing methods for swifter analysis, and the fortification of models against
emerging evasion techniques employed by cybercriminals. A pivotal aspect of this
endeavour is model pruning, a strategic reduction of hardware workload without
compromising the system's resilience. By achieving this delicate balance, the project
aims to ensure that the phishing detection system remains adaptive and efficient in
real-time, responding effectively to the dynamic landscape of cyber threats.

These efforts are not merely technical enhancements; they are crucial for mitigating
the risk of data breaches and identity theft that loom over unsuspecting users. As
phishing attacks become more intricate and insidious, a proactive and adaptive
response is imperative to safeguard the digital ecosystem. Beyond the immediate goal
of threat mitigation, these optimizations contribute significantly to the broader objective

of fostering user privacy, building trust, and preserving data integrity across diverse
online platforms. In essence, this project represents a critical stride towards fortifying
the digital realm against the ever-evolving and persistent menace of phishing attacks.
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT 2

TABLE OF CONTENTS vi

LIST OF FIGURES vii

LIST OF ABBREVIATIONS ix

1 INTRODUCTION 1

1.1 OVERVIEW 1

1.1.1 DATA SCIENCE 1

1.1.2 MACHINE LEARNING 2

1.2 OBJECTIVE 3

1.3 DESCRIPTION 3

1.4 STRUCTURE OF THE PROJECT WORK 4

2 LITERATURE SURVEY 6

3 PROBLEM DEFINITION AND METHODOLOGIES 9

3.1 PROBLEM DEFINITION 9

3.2 EXISTING SYSTEM 9

3.3 PROPOSED SYSTEM 10

3.4 METHODOLOGIES 10

3.4.1 WEB SCRAPING AND DATA COLLECTION 10

3.4.2 FEATURE EXTRACTION 10

3.4.3 FISHER’S SCORE FOR FEATURE SELECTION 11

3.4.4 MULTILAYER STACKED ENSEMBLE MODEL 11

vi
3.4.5 XGBOOST MODEL 11

3.4.6 METRIC SELECTION 12

4 DESIGN PROCESS 13

4.1 DESIGN OVERVIEW 13

4.2 DATA FLOW DIAGRAM 14

4.3 ARCHITECTURE DIAGRAM 15

4.4 PROJECT REQUIREMENTS 18

4.4.1 SOFTWARE REQUIREMENTS 18

4.4.2 HARDWARE REQUIREMENTS 18

5 IMPLEMENTATIONS 20

5.1 ANALYSIS ON THE DATA 20

5.2 DATA VISUALIZATION 24

5.3 MODEL BUILDING AND TRAINING 25

6 EXPERIMENTATION RESULTS 31

6.1 OBSERVATIONS 31

6.2 INFERENCES 33

7 FUTURE WORK AND ENHANCEMENTS 34

REFERENCES 35

APPENDIX 39

A1 – SOURCE CODE 39

A2 – SCREENSHOTS 76

LIST OF FIGURES

vii
FIGURE NO. FIGURE NAME PAGE NO.

FIGURE 4.1 DATA FLOW DIAGRAM 14

FIGURE 4.2 FRONT-END BASED ARCHITECTURE DIAGRAM 15

FIGURE 4.3 BACK-END BASED ARCHITECTURE DIAGRAM 17

FIGURE 5.1 DATA 20

FIGURE 5.2 SHAPE OF THE DATA 20

FIGURE 5.3 FEATURES OF THE DATA 21

FIGURE 5.4 INFORMATION ABOUT THE DATA 21

FIGURE 5.5 UNIQUE VALUES OF THE DATA 22

FIGURE 5.6 DESCRIPTION OF THE DATA 23

FIGURE 5.7 CORRELATION MAP OF THE DATA 24

FIGURE 5.8 PIE CHART FOR CLASSES OF THE DATA 25

FIGURE 5.9 TESTING AND TRAINING DATA 26

FIGURE 5.10 FISHER’S SCORE FOR THE FEATURES IN THE DATA 27

FIGURE 5.11 FINAL ACCURACY OF THE MODEL 29

FIGURE 5.12 CONFUSION MATRIX FOR THE MODEL 30

FIGURE 5.13 CLASSIFICATION REPORT OF THE MODEL 30

FIGURE 6.1 THE PERFORMANCE OF XGBOOST AND MLSELM MODEL 32

FIGURE A2.1 HOMEPAGE OF THE WEBSITE 77

FIGURE A2.2 URL DETECTOR PAGE OF THE WEBSITE 77

FIGURE A2.3 VALID PHISHING HAS BEEN PASTED 78

FIGURE A2.4 VALID PHISHING HAS BEEN DETECTED 78

FIGURE A2.5 INVALID PHISHING HAS BEEN PASTED 79

FIGURE A2.6 INVALID PHISHING HAS BEEN DETECTED 79

viii
LIST OF ABBREVIATIONS

ML Machine Learning

AI Artificial Intelligence

WEKA Waikato Environment for Knowledge Analysis

URL Uniform Resource Locator

MLSELM Multilayer Stacked Ensemble Learning Model

LSTM Long Short-Term Memory

RF Random Forest

CNN Neural Network

RNN Recurrent Neural Network

SVM Support Vector Machine

LR Logistic Regression

DT Decision Tree

NB Naive Bayes

SVC Support Vector Classifier

IDS Intrusion Detection System

MITM Man-in-the-Middle

DOS Denial of Service

KNN K-Nearest Neighbors

IOT Internet of Things

HTTPS Hypertext Transfer Protocol Secure

IP Internet Protocol

API Application Programming Interface

ix
MLP Multilayer Perceptron

XGB Extreme Gradient Boosting

RAM Random Access Memory


GPU Graphics Processing Unit

NTP Network Time Protocol

NTN Number of True Negatives

NFP Number of False Positives

NFN Number of False Negatives

x
CHAPTER 1

INTRODUCTION

1.1 OVERVIEW

Phishing attacks represent a fraudulent attempt to obtain confidential information by


posing as a legitimate entity. These attacks often involve fake emails, websites or
messages that trick users into revealing personal information such as passwords,
bank details, and credentials. Despite preventive measures, such as anti-phishing
tools, these attacks continue to evolve due to their evolving nature and methods. To
counter these persistent threats, the development of machine learning models for
phishing detection is gaining attention. ML models use algorithms and data analysis
to distinguish between legitimate and phishing websites. Features such as URL
structure, content, and user behavior patterns are analyzed to create predictive
models that can identify potential threats.
However, building ML models for phishing detection is difficult. Obtaining high-quality
data for training and testing models, ensuring model accuracy, adapting to new
phishing techniques, and balancing detection accuracy and computer effectiveness
are critical challenges.

Data science and machine learning are closely intertwined disciplines, often used in
conjunction to extract valuable insights from data, make predictions, and automate
decision-making processes. Let's delve into how they are integrated, particularly in the
context of classification tasks.

1.1.1 DATA SCIENCE

Data science is a broader field that encompasses a range of techniques and


methodologies to handle and analyse data. It involves processes such as data
collection, cleaning, exploration, and visualisation. The goal of data science is to derive
actionable insights and knowledge from data to support decision-making.

1
1.1.2 MACHINE LEARNING

Machine learning is a subset of artificial intelligence that focuses on the development


of algorithms and models that enable computers to learn patterns from data and make
predictions or decisions without explicit programming. It involves training models on
historical data to generalise and make accurate predictions on new, unseen data.

Data Preparation: Data scientists play a crucial role in preparing the data for
classification tasks. They handle tasks such as cleaning noisy data, handling missing
values, and transforming data into a suitable format for machine learning algorithms.

Feature Engineering: Identifying and selecting relevant features (variables) from the
dataset is a critical step in classification. Data scientists use domain knowledge to
determine which features are most informative for the task at hand.

Model Selection: Machine learning practitioners, often working in collaboration with


data scientists, choose appropriate classification algorithms based on the nature of
the data and the problem. Common algorithms include logistic regression, decision
trees, support vector machines, and neural networks.

Training the Model: Machine learning models are trained on labelled datasets during
this phase. The model learns patterns and relationships between features and labels,
adjusting its parameters to make accurate predictions.

Evaluation and Validation: Data scientists are responsible for evaluating the
performance of the trained model using validation datasets. They use metrics such as
accuracy, precision, recall, and F1-score to assess how well the model generalises to
new, unseen data.

Iterative Process: The integration of data science and machine learning in


classification is often an iterative process. Data scientists and machine learning
practitioners collaborate to refine the model, adjust features, and improve overall
performance.
2
The seamless integration of data science and machine learning in classification tasks
allows organisations to automate decision-making processes, classify and categorise
data efficiently, and derive insights that contribute to informed decisionmaking. This
collaborative approach leverages the strengths of both fields to create robust and
accurate classification models.

1.2 OBJECTIVE

The objective of this project is to develop and compare two machine learning models
for the task of detecting phishing websites. The primary focus is on enhancing the
accuracy and efficiency of phishing detection methods by using feature selection, and
also on evaluating the models based on their accuracy in order to determine which
one performs better for the given task.

This objective aims to create a sophisticated solution capable of differentiating


between legitimate and deceptive websites by utilising a stacked ensemble machine
learning model, to effectively analyse patterns, URLs, and content structures.

Additionally, it also includes the potential integration of this ML-based system into web
browsers or as extensions, ensuring real-time warnings and protection for users while
browsing, ultimately fostering a safer and more secure online environment.

1.3 DESCRIPTION

This project employs machine learning algorithms to identify critical factors influencing
the detection of phishing websites. Two models, a Multilayer Stacked Ensemble Model
and an XGBoost Model, use techniques such as clustering and classification to discern
patterns in data.
Addressing the rising threat of phishing websites, the project aims to develop and
compare two robust machine learning models. The models are evaluated for efficiency
in detecting deceptive websites using Fisher's score for feature selection. Additionally,

3
a Chrome extension will integrate the superior model, offering real-time warnings and
enhanced user protection during online activities.

1.4 STRUCTURE OF THE PROJECT REPORT

Chapter 1 lays the groundwork for the entire project report. It provides an overview of
the project, including its purpose, scope, and significance. It also describes the project
in more detail, outlining the objectives and expected outcomes. Additionally, it explains
the structure of the report itself, giving the reader a roadmap to navigate the
information presented.

Chapter 2 provides a comprehensive overview of existing research and knowledge


related to the project's topic. It's essentially a review of relevant scholarly works, books,
articles, and other sources that inform the project's understanding of the problem and
potential solutions.

Chapter 3 tackles the heart of the project by defining the problem and outlining the
chosen approach to solve it. This chapter dives into the existing system, analyses its
limitations, and presents the proposed solution with its underlying algorithm. In
essence, it's the roadmap for tackling the challenge at hand.

Chapter 4 lays out the blueprint for a website with a chrome extension, from its purpose
and audience to its technical structure and development process. It starts with an
overview, then specifies requirements, details the architecture, and breaks down the
design step-by-step. Each module gets its own dedicated explanation, while the
conclusion wraps everything up and suggests potential future directions. Essentially,
this document serves as a comprehensive roadmap for bringing the mobile app to life.

Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of its
core security features. It meticulously details the construction of these vital safeguards,
providing a comprehensive blueprint for those seeking to establish their own phishing
URL detection stronghold.

4
Chapter 6 discusses the significant lessons learned, and the necessary future
enhancements that can be used to improve the overall feasibility of the proposed work.

CHAPTER 2
5
LITERATURE SURVEY

Lakshmana Rao Kalabarige et al., 2022 [1] propose a highly effective Multilayer
Stacked Ensemble Learning Model for phishing detection, achieving 96.79% to
98.90% accuracy. Outperforming baselines, the model underscores its efficacy with
improved metrics. The paper stresses the urgency of countering phishing, outlines the
model's architecture and results, and suggests future research on feature selection
and model optimization. Overall, the study introduces a potent detection model,
validates its effectiveness, and outlines avenues for further research.

Al-Sariera et al., 2019 [2] presented advanced AI meta-learner models like AdaBoost-
Extra Tree for phishing detection, achieving over 97% accuracy with minimal false
positives (below 0.028). The study critically reviewed existing methods, emphasising
the need for improved techniques. Thorough evaluation using 10-fold cross-validation
and WEKA software demonstrated the models' superiority in accuracy and predictive
capabilities over existing methods. The paper highlighted the importance of
interpretable AI models and suggested exploring alternative decision tree algorithms
and hybridization methods for future advancements.

Ayman El Aassal et al., 2020 [3] conducted a thorough benchmarking of phishing


detection research, emphasising the significance of diverse datasets and real-time
detection. They introduced PhishBench, a systematic framework for evaluating and
comparing detection methods, addressing challenges like imbalanced scenarios. The
study covers benchmarking frameworks, feature importance, and the architecture of
PhishBench, advocating for comprehensive datasets in phishing detection research.

Rasha Zeini et al., 2023 [4] reviews phishing detection methods, emphasising model
explainability, feature engineering, and domain knowledge. They identify gaps,
including URL shortening challenges, and stress the importance of reproducibility,
diverse datasets, and informed feature selection. The document offers insights into

6
evolving phishing tactics, highlighting the need for continuous research and user
education in effective countermeasures.

Al-Sarem et al., 2021 [5] presented an optimised stacking ensemble method for
phishing detection, employing Genetic Algorithm to determine optimal parameters.
The ensemble comprised algorithms like Random Forests, AdaBoost, XGBoost,
Bagging, GradientBoost, and LightGBM, applied to UCI Phishing, Mendely, and
Mendeley-small variant datasets. The model demonstrated remarkable accuracy of
97.16%, 98.58%, and 97.35% on the respective datasets, showcasing its
effectiveness across diverse phishing instances and features.

Yi Wei et al., 2022 [6] In 2022, Wei et al. compare machine learning and deep learning
methods for phishing website classification. They assess traditional algorithms,
ensemble methods, and deep learning models like Random Forest, AdaBoost, LSTM,
CNN, and RNN. Results emphasise ensemble methods' effectiveness, particularly
Random Forest, in achieving high accuracy and computational efficiency, especially
with reduced feature sets.

Ahmet Ozaday et al., 2022 [7] used six machine learning algorithms to classify URLs
based on eleven features, with Random Forest yielding the highest accuracy of
98.90%. Comparing various methods, they concluded Random Forest provided
consistent and superior performance. The study stressed the importance of updated
datasets, global collaboration, and user awareness in combating phishing.

A. Karim et al., 2023 [8] developed a phishing detection system employing various
machine learning models and a hybrid approach (LR+SVC+DT). The hybrid model
demonstrated high efficiency, utilising metrics like accuracy, precision, recall,
specificity, and F1-score. The study underscores the effectiveness of combining
listbased and machine-learning-based systems for more efficient phishing URL
detection.

M. Aljabri et al., 2022 [9] offers a thorough review of ML algorithms for detecting
malicious URLs, highlighting SVM, RF, DT, NB, and LR with accuracy surpassing

7
98.42%. The document underscores the effectiveness of ensemble techniques in
achieving over 90% accuracy and discusses challenges like sample sizes and network
traffic considerations. Providing insights into datasets, features, and model accuracy,
the study contributes to understanding and addressing unresolved issues in malicious
URL detection.

Priscilla Kyei Danso et al., 2022 [10] tackles IoT security challenges with an Intelligent
Ensemble-based IDS at the gateway, mitigating threats like MITM and DoS. The
proposed solution employs Naïve Bayes, SVM, and k-NN as base learners,
demonstrating efficacy through ensemble models on various datasets. The study
emphasises the importance of ensemble learning in IoT security and suggests future
directions for anomaly-based IDS improvements.

Pankaj Saraswat et al., 2022 [11] addresses email security challenges, focusing on
phishing detection with machine learning. Using SVM and Random Forest, the study
achieves a maximum accuracy of 96.87%, emphasising the need for effective
detection methods against evolving phishing techniques. The proposed system
extracts link, tag, and word-based features, underscoring the importance of dataset
expansion for real-world applicability.

F. Castaño et al., 2023 [12] introduces PhiKitA-500, a dataset linking phishing websites
to phishing kits, facilitating algorithm evaluation. The methodology involves stages like
source definition, phishing kit collection, website extraction, and postprocessing.
Results indicate successful grouping of phishing kits, demonstrating the utility of kit
information in classifying phishing attacks, despite challenges in multiclass
classification.

8
CHAPTER 3

PROBLEM DEFINITION AND METHODOLOGIES

In this chapter, the project delves into the core by identifying the problem and
articulating the selected strategy for resolution. The chapter critically examines the
current system, putting forth its constraints, and introduces the solution, complete with
its underlying algorithm. Essentially, this section serves as a comprehensive guide,
outlining the path forward for addressing the project's central challenge.

3.1 PROBLEM DEFINITION

The escalating threat of phishing websites poses a significant challenge to online


security. Existing detection methods require enhancement to discern deceptive sites
effectively. This project addresses the need for robust machine learning models
capable of identifying critical factors influencing phishing website detection. The lack
of a comprehensive solution, coupled with the urgency to protect users from evolving
phishing techniques, necessitates the development and comparison of advanced
models. The challenge lies in creating models that not only exhibit high efficiency in
detecting deceptive websites but also integrate seamlessly into user workflows,
providing real-time warnings and enhancing online security.

3.2 EXISTING SYSTEM

The existing model for phishing detection is a Layer-wise Stacked Ensemble Learning
architecture, comprising multiple layers of estimators culminating in a metalearner.
The workflow involves initialising the model, creating multiple layers with diverse
estimators, and adding a meta-learner as the final layer for comprehensive decision-
making. The Stacked Ensemble Learning process involves running estimators in
parallel within layers and sequentially between layers, employing various models like
Random Forest and Logistic Regression. The phases of the Multilayer Stacked
Ensemble Learning Model (MLSELM) include the input phase with the Phishing
9
dataset, data balancing phase, and the implementation phase for executing the model
effectively.

3.3 PROPOSED SYSTEM

The proposed system incorporates a dedicated website aimed at detecting phishing


URLs, complemented by the introduction of a user-friendly Chrome extension
available for download to enhance the detection of potential phishing URLs. To
improve feature selection, the system integrates Fisher's score, a methodology not
implemented in the existing system. Through this enhancement, the comparison
between the Multilayer Stacked Ensemble Model and the XGBoost Model is done to
evaluate their efficacy in the context of phishing URL detection, addressing a critical
aspect of online security that was not explicitly covered in the existing model. The
Chrome extension's functionality includes real-time pop-up notifications when
hovering over URLs, signaling potential phishing attempts. By combining the
capabilities of the integrated website and Chrome extension, the system actively
protects users during browsing activities.

3.4 METHODOLOGIES

3.4.1 Web Scraping and Data Collection:


Web scraping is a technique used to extract data from websites. In this context, the
system employs web scraping to collect a large dataset of URLs for both phishing
websites and legitimate websites. This process involves crawling known phishing
websites, which are sites designed to steal sensitive information such as login
credentials or financial data from users, as well as legitimate websites to ensure a
comprehensive dataset. By collecting URLs from both categories, the system can train
machine learning models to differentiate between phishing and legitimate URLs
effectively.

3.4.2 Feature Extraction:


Feature extraction involves identifying and extracting relevant information from the
collected URLs to create a comprehensive feature set for model training. Features

10
such as URL length, presence of HTTPS, use of IP addresses, presence of suspicious
keywords, and other relevant indicators are likely extracted. These features provide
valuable information for the machine learning models to learn and make predictions
effectively.

3.4.3 Fisher's Score for Feature Selection:


Fisher's score is a statistical measure used for feature selection, helping identify the
most informative features for training the models. In this step, features with higher
Fisher's scores are considered more relevant and are selected for model training,
while less informative features are discarded. By focusing on the most informative
features, the system improves the performance of the machine learning models by
reducing noise and irrelevant information in the dataset.

3.4.4 Multilayer Stacked Ensemble Model:


A multilayer stacked ensemble model involves combining multiple machine learning
models in a stacked architecture. Each layer of the ensemble may use different
algorithms, such as decision trees, logistic regression, or neural networks, to learn
from the data and make predictions. The predictions from each layer are then
combined, typically using a meta-learner or aggregation method, to produce the final
output. This approach improves the overall performance of the model by leveraging
the strengths of multiple algorithms and capturing complex relationships in the data.

3.4.5 XGBoost Model:


XGBoost is a popular and powerful gradient-boosting algorithm commonly used in
classification tasks. It works by iteratively training decision trees to correct the errors
made by previous trees, leading to a highly accurate predictive model. In this
methodology, the system implements an XGBoost model as an alternative approach
for comparison with the multilayer stacked ensemble model. This allows for evaluating
the performance of different algorithms and selecting the best-performing model for
the task.

11
3.4.6 Metric Selection:
Evaluation metrics are used to assess the performance of the machine learning
models. Common metrics used in binary classification tasks like phishing detection
include precision, recall, F1 score, and accuracy.

• Precision measures the proportion of true phishing URLs among the URLs
predicted as phishing.
• Recall measures the proportion of true phishing URLs correctly identified by the
model.
• F1 score is the harmonic mean of precision.
• recall, and accuracy measures the overall correctness of the model's
predictions.

By using relevant evaluation metrics, the system can effectively evaluate and compare
the performance of the multilayer stacked ensemble model and the XGBoost model to
select the best-performing approach.

12
CHAPTER 4

DESIGN PROCESS

This chapter serves as a detailed guide for constructing a website accompanied by a


Chrome extension, outlining its purpose, target audience, technical framework, and
developmental procedures. The chapter begins with a broad overview, followed by a
precise delineation of requirements and an in-depth exploration of the architectural
blueprint.

4.1 DESIGN OVERVIEW

The proposed system offers a comprehensive solution for phishing URL detection, with
a dedicated website, and a convenient Chrome extension to strengthen user
protection while browsing the web. The dedicated website is the core of the system,
with an easy-to-use interface where users can input URLs to be analyzed. The
backend integrates top-notch machine learning models such as the multilayer stack
ensemble model and the xGBoost model. The inclusion of Fisher’s score in the feature
selection methodology improves the system’s ability to identify phishing patterns by
focusing on critical aspects that are not explicitly covered by the current system.

This website is designed to integrate seamlessly with the Chrome extension, providing
a unified and synchronous user experience. The Chrome extension is available for
download and provides an extra layer of protection while online. One of the key
features of this extension is the real-time pops-up notifications that appear when a
user hover over a URL to alert them of a potential phishing attempt. These notifications
act as a direct link between the extension and the dedicated website, providing timely
warnings and alert users so they can make informed decisions while navigating the
web.
13
The system’s design focuses on the smooth integration of the dedicated website with
the Chrome extension to provide a holistic approach to the detection of phishing URLs.
Sophisticated communication channels ensure the secure exchange of data between
the website’s backend and Chrome extension, while preserving the user’s privacy and
the system’s reliability. Regular updates and syncing mechanisms ensure that the
machine learning model and detection algorithms are always up-todate and effective
against ever-evolving phishing tactics.

The user experience is at the forefront, with an intuitive website interface and
unobtrusive chrome extension, allowing users to conveniently access protection
features without interrupting their browsing. Real-time notifications from the extension
enable users to make smart decisions about the security of the URLs they encounter.
In conclusion, the proposed solution combines the best features of a dedicated site
with a Chrome extension to increase the effectiveness of phishing URLs detection,
while prioritising easy to use interactions and real time feedback to actively protect
users during their online engagements.

4.2 DATA FLOW DIAGRAM

FIGURE 4.1 Data Flow Diagram


14
The key components of the phishing detection system include the user, who interacts
with the system through a Chrome Extension integrated with the web browser. The
user begins by registering and logging into the extension or a corresponding service.
When the user interacts with a website, the extension engages in URL analysis,
allowing the user to either manually paste a URL or automatically scanning links when
the cursor hovers over them. The core of the system is the Phishing Detection System,
which scrutinises the provided URL by comparing it against a database of known
phishing URLs, patterns, and characteristics. Determining whether the website is likely
a phishing attempt or legitimate, the system issues phishing alerts to the user,
potentially blocking access or displaying warnings if a suspicious activity is detected.
This comprehensive process aims to enhance online security by proactively identifying
and notifying users about potential phishing threats.

4.3 ARCHITECTURE DIAGRAM

Figure 4.2 Front end-based architecture diagram

15
The web application is the overarching entity, encompassing two main divisions: the
frontend and the backend. The frontend is the user-facing component primarily
accessed through the Chrome Extension. Within the frontend, the Content Script
operates as a script embedded in the visited webpage, providing access to the
webpage's content such as text, images, and structure to gather essential data. On
the other hand, the backend serves as the processing powerhouse behind the scenes,
managing critical functions. The Background Script, integrated into the Chrome
Extension, acts as a coordinator facilitating communication between the content script
and the backend by passing data seamlessly. Additionally, an essential element of the
backend is the API, functioning as an interface that enables the detection model within
the backend to receive data and convey evaluations effectively.

At the core of the system lies the Phishing Detection Model, serving as the intelligence
hub. This model is likely a machine learning model meticulously trained to discern
patterns and features characteristic of phishing websites. It plays a pivotal role in the
system's functionality, leveraging its learned knowledge to analyse incoming data and
determine whether a visited webpage poses a potential phishing threat. In essence,
the web application operates as a seamlessly integrated unit, with the frontend
facilitating user interaction through the Chrome Extension, while the backend, with its
content script, background script, API, and the Phishing Detection Model, collectively
ensures robust processing and accurate identification of potential phishing websites.

16
Figure 4.3 Backend-based architecture diagram

A stacked ensemble learning model is a machine learning technique that uses multiple
models to improve the performance of a single model.
The stacked ensemble learning model used in this diagram is a multilayer stacked
ensemble learning model. A multilayer stacked ensemble learning model is a type of
stacked ensemble learning model that uses multiple layers of models. The first layer
of the model consists of five different machine learning models: MLP, KNN, RF, LR,
and XGB. These models are all trained on the training dataset. The second layer of
the model consists of two models. These models are trained on the outputs of the first
layer of models. The third layer of the model consists of a single XGB model, also
called a meta learner. This model is trained on the outputs of the second layer of
models. The final output of the model is a prediction of whether an email is phishing
or legitimate.
The stacked ensemble learning model can improve the performance of a single
machine learning model by combining the strengths of multiple models.

17
4.4 PROJECT REQUIREMENTS

This segment details the precise technological prerequisites essential for


implementing the project. The subsequent content outlines the necessary software
and hardware requirements crucial for the successful execution of this initiative.

4.4.1 SOFTWARE REQUIREMENTS

● Operating System: Windows


● Tools Used: Jupyter, Google Colab, Visual Studio.

4.4.2 HARDWARE REQUIREMENTS

● Processor: AMD Ryzen 7 5800H


● Hard Disk : 500GB SSD
● RAM : 16GB SODIMM RAM
● GPU: NVIDIA GeForce RTX 3050

Jupyter is an open-source tool that allows interactive computing and supports various
programming languages. It's widely used for creating and sharing documents
containing live code, equations, visualizations, and narrative text. Jupyter is likely used
for developing and testing code, especially for tasks like data preprocessing, feature
extraction, and initial model training. Its interactive nature facilitates iterative
development and experimentation.

Google Colab is a cloud-based platform provided by Google that allows for the creation
and execution of Jupyter notebooks in a collaborative environment. It provides free
access to GPU resources, which can be beneficial for training machine learning
models. Google Colab is specified as a tool, indicating that the project may leverage
its cloud-based infrastructure for resource-intensive tasks, such as training machine
learning models on the specified dataset.

18
In summary, the specified software and hardware requirements are optimised to be
used with Windows as the required OS, with Jupyter and Google Colab as key
development tools. The hardware specifications include a high-performance
processor, SSD for fast storage, ample RAM for efficient multitasking, and a dedicated
GPU for accelerated machine learning tasks. These choices are aligned with the
computational demands of developing and implementing a phishing URL detection
system with machine learning models.

19
CHAPTER 5

IMPLEMENTATIONS

Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of its
core security features It meticulously details the construction of these vital safeguards,
providing a comprehensive blueprint for those seeking to establish their phishing URL
detection stronghold.

5.1 ANALYSIS ON THE DATA

#Loading data into data frame data


= pd.read_csv("phishing.csv")
data.head()

Figure 5.1 Data

#Shape of data frame data.shape

Figure 5.2 Shape of the data

20
#Listing the features of the dataset
data.columns

Figure 5.3 Features of the data

#Information about the dataset


data.info()

Figure 5.4 Information of the data

21
# nunique value in columns
data.nunique()

Figure 5.5 Unique values of the data

#droping index column


data = data.drop(['Index'],axis = 1)

#description of dataset data.describe().T

22
Figure 5.6 Description of the data

• There are 11054 instances and 31 features in the dataset.


• Out of which 30 are independent features whereas 1 is a dependent feature.
• There is no outlier present in the dataset.
• There is no missing value in the dataset.

23
5.2 DATA VISUALIZATION

#Correlation heatmap
plt.figure(figsize=(15,15))
sns.heatmap(data.corr(), annot=True) plt.show()

Figure 5.7 Correlation map of the data

This code generates a heatmap visualization of the correlation matrix of a data frame
using the seaborn (sns) library and matplotlib (plt). This code visualizes the
correlations between different columns in the DataFrame data using a heatmap, where
brighter colors represent stronger correlations (either positive or negative), and darker
colors represent weaker correlations or no correlation. The annotations on the
heatmap provide the exact correlation coefficients for each pair of columns. #Phishing
Count in a pie chart

24
data['class'].value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.title("Phishing Count") plt.show()

Figure 5.8 Pie chart for classes of the data

This code generates a pie chart to visualize the distribution of the 'class' variable in the
DataFrame 'data'. This creates a pie chart that visually represents the proportion of
different classes (or categories) in the 'class' column of the DataFrame 'data'. Each
slice of the pie represents a unique class, and the size of each slice corresponds to
the frequency of that class in the dataset. The percentage values displayed on the
chart indicate the proportion of each class relative to the total number of instances in
the dataset.

5.3 MODEL BUILDING AND TRAINING

# Splitting the dataset into dependent and independent feature


X = data.drop(["class"],axis =1) y = data["class"]

# Splitting the dataset into train and test sets: 80-20 split from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

25
Figure 5.9 Testing and Training Data

#Defining Fisher’s Score def


fisher_score(X, y):
"""
Compute the Fisher Score for each feature.

Parameters:
- X: numpy array, shape (n_samples,
n_features) Feature matrix.
- y: numpy array, shape (n_samples,)
Target vector.

Returns:
- scores: numpy array, shape (n_features,)
Fisher Scores for each feature.
"""
# Number of samples for each class classes,
class_counts = np.unique(y, return_counts=True)
n_classes = len(classes)

# Overall mean of each feature


mean_overall = np.mean(X, axis=0)

# Within and between class scatter


S_W = np.zeros(X.shape[1])
S_B = np.zeros(X.shape[1])

for i, label in enumerate(classes):


X_class = X[y == label]

26
mean_class = np.mean(X_class, axis=0)
S_W += ((X_class - mean_class)**2).sum(axis=0)
S_B += class_counts[i] * ((mean_class - mean_overall)**2)

# Compute Fisher Score


scores = S_B / S_W
return scores

# Example usage
# Assuming X is your feature matrix and y is your target vector fisher_scores
= fisher_score(X, y)
# If you want to rank features based on Fisher Score
ranking = np.argsort(fisher_scores)[::-1]
print("Features ranked by Fisher Score:") for rank in
ranking:
print(f"Feature {rank} Score: {fisher_scores[rank]}")

Figure 5.10 Fisher’s score for the features in the data

27
#Fitting the models def
fit_models(models, X, y):
for model in models:
model.fit(X, y)

def generate_meta_features(models, X):


meta_features = [model.predict_proba(X) for model in models]
return np.hstack(meta_features)

# Mapping -1 to 0 and 1 remains 1


y_train = y_train.map({-1: 0, 1: 1})
y_test = y_test.map({-1: 0, 1: 1})

# Initialize a label encoder and fit it to the training


labels label_encoder = LabelEncoder() y_train =
label_encoder.fit_transform(y_train) y_test =
label_encoder.transform(y_test)

#First layer models models_layer1 = [


xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss'),
MLPClassifier(max_iter=1000, random_state=42),
KNeighborsClassifier(n_neighbors=5),
RandomForestClassifier(n_estimators=100, random_state=42),
LogisticRegression(max_iter=1000, random_state=42)
]
# First layer training and meta-features generation fit_models(models_layer1,
X_train, y_train)
X_train_meta_1 = generate_meta_features(models_layer1, X_train)
X_test_meta_1 = generate_meta_features(models_layer1, X_test)

28
# Define the second layer models models_layer2
=[ MLPClassifier(max_iter=1000,
random_state=42),

RandomForestClassifier(n_estimators=100, random_state=42),
xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
]
# Train second layer models using the meta-features from the first layer
fit_models(models_layer2, X_train_meta_1, y_train)
# Generate second layer meta-features
X_train_meta_2 = generate_meta_features(models_layer2, X_train_meta_1)
X_test_meta_2 = generate_meta_features(models_layer2, X_test_meta_1)

#Third layer
final_layer_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
# Third layer training and final predictions
final_layer_model.fit(X_train_meta_2, y_train)
y_pred_final = final_layer_model.predict(X_test_meta_2)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred_final) print(f'Final
Model Accuracy: {accuracy}')

Figure 5.11 Final accuracy of the model

# Generate the confusion matrix


conf_matrix = confusion_matrix(y_test,
y_pred_final) print("Confusion Matrix:")
print(conf_matrix)

29
Figure 5.12 Confusion matrix for the model

# Generate classification report


class_report = classification_report(y_test, y_pred_final)
print("Classification Report:") print(class_report)

Figure 5.13 Classification report of the model

CHAPTER 6

30
EXPERIMENTATION RESULTS

Chapter 6 serves as a reflection on the essential lessons derived from our research
journey, highlighting the quest for improvement and innovation in ensemble learning
methodologies.

6.1 OBSERVATIONS

This chapter delves into the comparison of two powerful machine learning algorithms,
Multilayer Stacked Ensemble Learning Machine (MLSELM) and XGBoost, for the task
of phishing website detection. The MLSELM model, comprising three layers of
classifiers, outperformed the XGBoost model in terms of accuracy. Through
meticulous feature selection, the MLS-ELM achieved an impressive accuracy of 97%,
while the XGBoost model attained 86% accuracy.

In this, we evaluate the performance of the proposed Multilayer Stacked Ensemble


Learning Machine (MLSELM) alongside several established machine learning
algorithms, including Multi-Layer Perceptron (MLP), k-nearest neighbours (KNN),
Random Forest (RF), Logistic Regression (LR), and XGBoost (XGB) and XGBoost
model separately. The classification metrics utilized for performance evaluation
encompass Precision, Recall, F-score, and Accuracy. In the context of distinguishing
between Legitimate and Phishing instances, Phishing instances are designated as
positive, while Legitimate instances are termed negative.

The calculation of True Positive (NTP), True Negatives (NTN), False Positives (NFP),
and False Negatives (NFN) is outlined as follows:
- P: Total number of phishing instances
- N: Total number of legitimate instances
- NTN: Number of legitimate instances predicted as legitimate
- NFN: Number of phishing instances predicted as legitimate
- NTP: Number of phishing instances predicted as phishing
- NFP: Number of legitimate instances predicted as phishing

31
The computation of each metric is articulated as follows:
• Accuracy: Accuracy is the proportion of true positives (correctly identified
positive cases) out of the total number of cases examined.
((NTP + NTN) / (P + N)) × 100
• Precision: Precision is the proportion of true positives out of the total number
of positive cases identified.
(NTP / (NTP + NFP)) × 100
• Recall: Recall is the proportion of true positives out of the total number of
positive cases in the dataset.
(NTP / (NTP + NFN)) × 100
• F-score: Combines precision and recall into a single metric (Precision × Recall)
/ (Precision + Recall) × 100

The experimental setup involved training and evaluating MLS-ELM and XGBoost
models using the same dataset comprising features relevant to phishing website
detection. Both models underwent feature selection using Fisher’s Score to optimize
their performance. Both MLSELM and the XGBoost model are subjected to identical
dataset conditions to ensure a fair assessment of their capabilities. Furthermore, the
comparison encompasses evaluations with feature selection using Fisher’s Score,
providing insights into the impact of data preprocessing techniques on model
performance. This comparative analysis offers valuable insights into the relative
strengths and weaknesses of each approach, aiding in the selection of the most
suitable algorithm for phishing website detection tasks.

MEASURES XGBOOST MLSELM


ACCURACY 86 97
PRECISION 93 97
RECALL
79 96
F1 SCORE
85 96

Figure 6.1 The performance of XGBoost algorithm & MLSELM algorithm


32
In this section, we are analysing the models based on the results in the table, the
MLSELM model outperforms the XGBoost model in all four metrics. It has a
significantly higher accuracy rate (97% compared to 86%), meaning it correctly
classified a much larger proportion of websites. MLSELM also has superior precision
(97% compared to 93%) and recall (96% compared to 79%), indicating it made fewer
mistakes in classifying websites and identified a larger proportion of actual phishing
sites. Finally, it has a higher F1 score (96% compared to 85%), reflecting a better
overall balance between precision and recall.

It is important to consider that the performance of these models may vary depending
on the specific dataset they are trained on and the types of phishing websites they
encounter.

6.2 INFERENCES

The superior performance of the MLSELM model can be attributed to its multilayer
stacked ensemble architecture, which leverages the collective intelligence of multiple
classifiers to make accurate predictions. By incorporating diverse base classifiers and
meta-learning techniques, MLSELM effectively captures the complex relationships
between features and the target variable, enhancing its discriminative power. In
contrast, while XGBoost is renowned for its scalability and efficiency, its performance
may be limited by its single-layer ensemble approach, which may struggle to capture
intricate patterns in the data.

Furthermore, the success of the MLSELM model emphasizes the importance of


feature selection in enhancing model performance. By identifying and prioritizing
relevant features and optimizing model parameters, we can mitigate the risk of
overfitting and improve the model's generalization capabilities.

33
CHAPTER 7

FUTURE WORK AND ENHANCEMENTS

In envisioning the future enhancements for our phishing website detection model, we
are poised to revolutionize cybersecurity by imbuing it with self-learning and
selfupdating capabilities. By leveraging advanced machine learning algorithms and
innovative techniques, our model will autonomously adapt to emerging threats,
continuously refining its predictive capabilities without the need for manual
intervention. This transformative approach not only ensures real-time protection but
also alleviates the burden on administrators, freeing them from the tedious task of
manual updates.

Moreover, our vision extends beyond mere efficacy to inclusivity, as we aspire to


expand the reach of our phishing detection solution beyond the confines of a single
browser. Through future enhancements to our Chrome extension, we aim to engineer
a versatile tool that transcends browser limitations, offering seamless integration with
a myriad of popular web browsers. This expansion democratizes access to cutting-
edge cybersecurity measures, empowering users across diverse platforms to
safeguard themselves against phishing attacks effectively.

In essence, our commitment to innovation and inclusivity drives us to reimagine the


landscape of cybersecurity, ushering in an era where protection is not only intelligent
and adaptive but also universally accessible. With these future enhancements, we are
poised to make a lasting impact, fortifying digital ecosystems against the everevolving
threat of phishing attacks.

34
REFERENCES

[1] Lizhen Tang; Qusay H. Mahmoud – (2023) "A Deep Learning-Based Framework
for Phishing Website Detection"

[2] Rasha Zieni , Luisa Masari , and Maria Carla Calzarossa – (2023) "Phishing or
Not Phishing? A Survey on the Detection of Phishing Websites"

[3] Yazan A. Al-Sarier, Victor Elijah Adeyemo, Abdullateef O. Balogun and Ammar
K. Alazzawi – (2020) "AI Meta-Learners and Extra-Trees Algorithm for the
Detection of Phishing Websites"

[4] Praveen M, Dhulavvagol Ribhav Ostwal ,S G Totad , S Sudhanshu, Pratheek P,


Veerabhadra M.Y - (2022) "An Efficient Ensemble Based Model for Data
Classification"

[5] S. Zander et al., (2018) "Machine Learning-Based Phishing Detection: Feature


Selection, False Positive Reduction, and Model Evaluation," in IEEE Transactions on
Dependable and Secure Computing, vol. 15, no. 4, pp. 645-659, doi:
10.1109/TDSC.2017.2724718.

[6] D. Fumarola et al., (2019) "Phishing Detection Using Genetic Programming with
Human-Competitive Performance," in IEEE Transactions on Evolutionary
Computation, vol. 23, no. 3, pp. 390-403, doi: 10.1109/TEVC.2018.2885320.

[7] C. Ma et al., (2020) "Phishing Website Detection Based on Deep Learning


Technique," in IEEE Access, vol. 8, pp. 201565-201576, doi:
10.1109/ACCESS.2020.3039802.

[8] A. Nazir et al., (2017) "Machine Learning-Based Phishing Detection Using URL
and Website Content Features," in Computers & Security, vol. 68, pp. 126-140,
doi:
35
10.1016/j.cose.2017.04.003.

[9] J. Ma et al., (2016) "A Machine Learning-Based Approach for Detecting Phishing
URLs," in Journal of Computer and System Sciences, vol. 82, no. 8, pp. 1284-
1297, doi: 10.1016/j.jcss.2016.04.002.

[10] A. Kumar et al., (2015) "A Review of Machine Learning Approaches to Phishing
Detection," in Procedia Computer Science, vol. 48, pp. 96-104, doi:
10.1016/j.procs.2015.04.197.

[11] L. Liao et al., (2018) "Combating Phishing Using Trusted Features and Machine
Learning," in Information Sciences, vol. 423, pp. 85-102, doi:
10.1016/j.ins.2017.10.005.

[12] H. Y. Son et al., (2019) "Phishing Website Detection Using Machine Learning
and Features Extracted from Website Images," in Journal of Information
Processing Systems, vol. 15, no. 1, pp. 117-133, doi: 10.3745/JIPS.03.0104.

[13] P. M. Chhabra et al., (2017) "A Machine Learning Approach to Phishing


Detection and Defense," in ACM Transactions on Internet Technology, vol. 17,
no. 4, pp. 1-25, doi: 10.1145/3091628.

[14] A. Shukla et al., (2016) "Machine Learning-Based Phishing Detection


Framework Using Multi-Level Feature Engineering," in Journal of Network and
Computer Applications, vol. 76, pp. 149-159, doi: 10.1016/j.jnca.2016.07.015.

[15] C. Singh et al., (2020) "A Machine Learning Approach for Detecting Phishing
Websites Using Neural Network," in Journal of King Saud University - Computer
and Information Sciences, doi: 10.1016/j.jksuci.2020.07.001.

[16] M. K. Srivastava et al., (2018) "Detection of Phishing Websites Using Machine


Learning Techniques," in International Journal of Computer Applications, vol.
184, no. 20, pp. 21-25, doi: 10.5120/ijca2018917606.

36
[17] G. Li et al., (2019) "Phishing Website Detection Based on URL Features Using
Machine Learning," in IEEE Access, vol. 7, pp. 131577-131588, doi:
10.1109/ACCESS.2019.2936143.

[18] S. A. Alqahtani et al., (2018) "A Novel Approach for Phishing Detection Based
on Ensemble Learning," in International Journal of Advanced Computer Science
and Applications, vol. 9, no. 10, pp. 308-316, doi:
10.14569/IJACSA.2018.091044.

[19] Y. Zhang et al., (2019) "PhishSpy: A Deep Learning-Based Framework for


Phishing Website Detection," in Proceedings of the IEEE International
Conference on Big Data, pp. 2543-2552, doi:
10.1109/BigData47090.2019.9006191.

[20] A. K. Sharma et al., (2015) "Machine Learning Techniques for Phishing


Detection," in International Journal of Computer Science and Information
Security, vol. 13, no. 8, pp. 57-64.

[21] Stacking Explained for Beginners - Ensemble Learning – Youtube video by AI


Sciences: https://www.youtube.com/watch?v=lcXKFS65BI0

[22] Phishing - A game of deception - Cyber security awareness video – Youtube


video by Security Quotient: https://www.youtube.com/watch?v=WNVTGTrWcvw

[23] Phishing Explained In 6 Minutes - YouTube Video by Simplilearn:


https://www.youtube.com/watch?v=XBkzBrXlle0

[24] How Hackers do Phishing Attacks to hack your accounts - YouTube Video by
Tech Raj: https://www.youtube.com/watch?v=RNzMKEYi2_0

[25] Feature Selection Techniques Easily Explained - YouTube Video by Krish Naik:
https://www.youtube.com/watch?v=EqLBAmtKMnQ

37
[26] What is a Phishing Attack? – Article by
IBM: https://www.ibm.com/topics/phishing

[27] How to Recognize and Avoid Phishing Scams –Article by Federal Trade
Commission:https://consumer.ftc.gov/articles/how-recognize-and-avoid-
phishingscams

[28] Phishing: Spot and report scam emails, texts, websites and calls –Article by
National Cybersecurity Centre - https://www.ncsc.gov.uk/collection/phishing-scams

[29] Multi-layer stacking ensemble learners for low footprint network intrusion
detection – Article by Springer Link:
https://link.springer.com/article/10.1007/s40747022-00809-3

[30] Stacking Ensemble Machine Learning With Python – Article by Machine


Learning Mastery: https://machinelearningmastery.com/stacking-ensemble-
machinelearning-with-python/

38
APPENDIX

A1 - SOURCE CODE

Base.html
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />

<link rel="preconnect" href="https://fonts.googleapis.com" />


<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=DM+Sans&display=swap"
rel="stylesheet" />
<!-- Required Css Stylesheets -->
{% block styles %}
<link rel="stylesheet" href="{{ url_for('static', filename='css/bootstrap.min.css') }}" />
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}" /> {%
endblock %}

<title>{% block title %}{% endblock %}</title>


</head>
{% set theme = theme|default('dark') -%}

<body class="body--{{ theme }}">


<main>{% block content %}{% endblock %}</main>
<div class="footer">{% block footer %}{% endblock %}</div>

<!-- Required Js Scripts -->


39
{% block scripts %}
<script src="{{ url_for('static', filename='js/jquery.min.js') }}"></script>
<script src="{{ url_for('static', filename='js/bootstrap.bundle.min.js') }}"></script>
<script src="https://cdn.jsdelivr.net/npm/sweetalert2@11"></script>
{% endblock %}
</body>

</html>

Check.html
{% extends 'base.html' %}
{% block title %}Processing | Phishing Website Detector{% endblock %}

{% set theme = 'dark' -%}

{% block content %}
<div class="card card--result color-bg-dark">
<img class="screenshot--target card-img rounded-0" alt="{{target}}" style="display:
none;">
<div class="screenshot--skeleton placeholder-glow">
<span class="placeholder"></span>
</div>

<div class="card-img-overlay content--wrapper">


<div class="content--area placeholder-glow">
<div class="liquid-ball placeholder">
<div class="ball-inner">
<div class="ball-percent"></div>
<div class="ball-water"></div>
<div class="ball-glare"></div>
</div>
</div>
<button id="web-button" class="btn btn-lg btn-redirect placeholder px-lg-

40
5">Continue to
website</button>
</div>
</div>
</div>
{% endblock %}

{% block scripts %}
{{ super() }} <script> function setTargetScreenshot() {
var targetScreenshot = $(".screenshot--target"); var
skeletonScreenshot = $(".screenshot--skeleton");

targetScreenshot.hide();
skeletonScreenshot.show();

var height = $(window).height();


var width = $(window).width();

targetScreenshot.attr("src", `{{ url_for('screenshot', target=target)


}}&width=${width}&height=${height}`);
targetScreenshot.on('load', function () {
skeletonScreenshot.hide();
$(this).show();
});
}

function showResult(data) { var variantInc = 100 / 3; var safe_percentage


= data.safe_percentage; var percentage = Math.max(data.safe_percentage,
data.unsafe_percentage);

if (percentage !== "" &&


!isNaN(percentage) &&

41
percentage <= 100 && percentage
>= 0) {

var waterLevel = 100 - percentage;

$(".ball-percent").append($("<span>").text(percentage.toLocaleString('en-US', {
minimumFractionDigits: 0, maximumFractionDigits: 1,
}) + "%"));
$(".ball-water").css("top", waterLevel + "%");

if (safe_percentage < variantInc * 1) {


$(".content--area").addClass("content--unsafe");
$(".ball-percent").append($("<span>").text("unsafe"));
$(".btn-redirect").attr("onClick", "phishedAlert()");
} else if (safe_percentage < variantInc * 2) {
$(".content--area").addClass("content--doubt");
$(".ball-percent").append($("<span>").text("doubt"));
$(".btn-redirect").attr("onClick", `warningAlert('${data.target}')`);
} else {
$(".content--area").addClass("content--safe");
$(".ball-percent").append($("<span>").text("safe"));
$(".btn-redirect").attr("onClick", `safeAlert('${data.target}')`);
}
} else {
$(".ball-water").css("top", "100%");
$(".ball-percent").text("NaN").css("font-size", "92px");
$(".content--area").addClass("content--doubt");
}

$(".content--area").find(".placeholder").removeClass("placeholder");
$(".content--area").removeClass("placeholder-glow");
}

42
function phishedAlert() {
Swal.fire({ title: "Phished
Website!!", text: "It's too
dangerous to continue,
hence we can't allow this
action.",
showCancelButton: true,
showConfirmButton: false,
showDenyButton: true,
denyButtonText: 'Back to
home'
}).then((result) => { if (result.isDenied)
{ window.location = "{{ url_for('home')
}}";
}
});
}

function warningAlert(target) { Swal.fire({ title: "Seems unsafe to


me!!", text: "It's too dangerous to continue, hence we can't allow this
action.", showCancelButton: true, showConfirmButton: true,
confirmButtonText: 'Continue anyways'
}).then((result) => {
window.open(target, "_blank");
});
}

function safeAlert(target) { Swal.fire({ title:


"Hurray! You're safe", html: "You'll be redirected to
the website...", timer: 3000, timerProgressBar:
true }).then((result) => { if (result.dismiss ===
Swal.DismissReason.timer) {
window.open(target, "_blank");
43
}
});
}

function invalidAlert(error) {
Swal.fire({ icon: 'error', title:
'Oops! Something went wrong!',
text: error, input: 'url',
showDenyButton: true,
denyButtonText: 'Back to home',
confirmButtonText: 'Check again',
inputPlaceholder: 'Enter the URL',
allowOutsideClick: false
}).then((result) => { if
(result.isConfirmed) {
window.location =
`{{ url_for('check') }}?target=${result.value}`;
} else if (result.isDenied) {
window.location = "{{ url_for('home') }}";
}
});
}

$(function () {

setTargetScreenshot();

$.ajax({ type: "POST",


url: "{{ url_for('check') }}",
contentType: "application/json",
data: JSON.stringify({ target:
"{{target}}",
}),
44
dataType: "json", success: function (data, status) {
console.log("Data: " + JSON.stringify(data) + "\nStatus: " + status);
if (data.status) { showResult(data);
} else {
showResult("");
invalidAlert(data.message);
}
},
});
});

$(window).on('resize', function () {
setTargetScreenshot();
});
</script>
{% endblock %}

Home.html
<!DOCTYPE html>

<html lang="en" dir="ltr">


<head>
<meta charset="utf-8">
<title>Phishing Detector</title>
<link rel="shortcut icon" href="{{ url_for('static', filename='spam-favicon.ico') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='styles.css') }}">
<script src="https://kit.fontawesome.com/5f3f547070.js"
crossorigin="anonymous"></script>
<link href="https://fonts.googleapis.com/css2?family=Roboto&display=swap"
rel="stylesheet">

</head>

<body>
45
<header>
<a href="/url-detector" class="link">Phishing Url Detector</a>
</header>

<!-- Website Title -->


<div class="container">
<h1 class='container-heading'><span>Secure Your Organization from
Phishing Attacks</h1>
<div class='description'>
<p>Advanced phishing detection app that utilizes cutting-edge algorithms and
machine learning techniques to identify and prevent phishing attacks targeting
organizations.</p>
</div>
</div>

<!-- Text Area -->


<!-- <div class="ml-container">
<form action="{{ url_for('predict') }}" method="POST">
<textarea class='message-box' name="message" rows="10"
cols="100" placeholder="Paste Email Message Here... eg. Dear Citizen.
Due to Covid-19 related issues, NY.GOV will pay $600 for victims who are
affected by this pandemic
Please complete the online form to join the aids program.
Please click here to activate your account. Please do not close out of the
browser while
compicung he account achvalion
Thank you
New York State"></textarea><br/>
<input type="submit" class="my-cta-button" value="Identify Phishing
Attempts">
</form>
</div> -->

46
<!-- Footer -->
<!-- <div class='footer'>

</div> -->

</body>
</html>

Index.html
{% extends 'base.html' %}
{% block title %}Phising Detector by Invaders{% endblock %}

{% block content %}
<section class="container min-vh-100">
<header>
<a href="/" class="link">Home Page</a>
</header>

<div class="row align-items-center justify-content-center min-vh-100 py-5">

<div class="col-lg-10 my-5">


<div class="d-flex justify-content-center mb-3">
<figure class="text-center">
<blockquote class="blockquote">
<h1 class="fw-bold">Phishing Website Detector</h1>
</blockquote>

</figure>
</div>
<form action="{{ url_for('check') }}" method="get" class="form--home input-group
rounded">

47
<input type="url" id="target-url" name="target" class="form-control form-
control-lg text-center border border-dark border-2 py-
3" placeholder="http://phish-site.com/malicious-url" required
/>

<button class="btn btn-dark fs-5 py-3 px-5" type="submit">


Let's find out
</button>
</form>
</div>
<div class="col-10 my-5">
<div class="row text-center">
<div class="col-md py-2">
<div class="card shadow-sm p-4 border border-dark border-2">
<div id="web-visits" class="display-5 fw-bold text-primary"></div>
<p class="fs-5 fw-semibold">Total website visits</p>
</div>
</div>
<div class="col py-2">
<div class="card shadow-sm p-4 border border-dark border-2">
<div id="web-checked" class="display-5 fw-bold text-secondary"></div>
<p class="fs-5 fw-semibold">Total website checked</p>
</div>
</div>
<div class="col py-2">
<div class="card shadow-sm p-4 border border-dark border-2">
<div id="web-phished" class="display-5 fw-bold text-danger"></div>
<p class="fs-5 fw-semibold">Total phished websites</p>
</div>
</div>
</div>
</div>
</div>

48
</section>
{% endblock %}

{% block scripts %}
{{ super() }} <script> var eventSource = new
EventSource("{{ url_for('listen') }}");

eventSource.addEventListener(
"stats", function (e) {
console.log(e.data); data =
JSON.parse(e.data);
$("#web-visits").text(data.visits);
$("#web-checked").text(data.checked);
$("#web-phished").text(data.phished);
},
true
);
</script>
{% endblock %}

Result.html
<!DOCTYPE html>

<html lang="en" dir="ltr">


<head>
<meta charset="utf-8" />
<title>PHISHING DETECTOR</title>
<link rel="shortcut
icon"
href="{{ url_for('static', filename='spam-favicon.ico') }}"
/>
<link
rel="stylesheet"

49
type="text/css"
href="{{ url_for('static', filename='styles.css') }}"
/>
<script
src="https://kit.fontawesome.com/5f3f547070.js"
crossorigin="anonymous"
></script>
<link
href="https://fonts.googleapis.com/css2?family=Roboto&display=swap"
rel="stylesheet"
/>
</head>

<body>
<div class="results">
<h1>PREDICTION RESULT</h1>
{% if prediction==1 %}
<h2>
<span class="danger"
>Caution! Our system has flagged this message as a possible phishing
attempt</span
>
</h2> <img class="image" src="{{
url_for('static', filename='unsafe-icon.png') }}"
alt="SPAM Image"
/>
{% elif prediction==0 %}
<h2>
<span class="safe"
>Congratulations! This message is classified as SAFE</span
>
</h2>

50
<img class="image" src="{{ url_for('static',
filename='safety-icon.png') }}" alt="Not a spam
image"
/>
{% endif %}
</div>
</body>
</html>

Content detector.py

# Importing essential libraries from flask import


Flask, render_template, request import pickle

# Load the Multinomial Naive Bayes model and CountVectorizer object from disk
filename = 'spam-sms-mnb-model.pkl' classifier = pickle.load(open(filename,
'rb')) cv = pickle.load(open('cv-transform.pkl','rb')) app = Flask(__name__)

@app.route('/') def
home():
return render_template('home.html')

@app.route('/predict',methods=['POST']) def
predict():
if request.method == 'POST':
message =
request.form['message'] data =
[message] vect =
cv.transform(data).toarray()
my_prediction =
classifier.predict(vect) return
render_template('result.html',
prediction=my_prediction)
51
if __name__ == '__main__':
app.run(debug=True)

Features.py
# Exraction of features from the URL
# 0 having_IP_Address
# 1 URL_Length
# 2 Shortining_Service
# 3 having_At_Symbol
# 4 double_slash_redirecting
# 5 Prefix_Suffix
# 6 having_Sub_Domain
# 7 URL_Depth
# 8 Domain_registeration_length
# 9 Favicon
# 10 port
# 11 HTTPS_token
# 12 Request_URL
# 13 URL_of_Anchor
# 14 Links_in_tags
# 15 SFH
# 16 Submitting_to_email
# 17 Abnormal_URL
# 18 Redirect
# 19 on_mouseover
# 20 RightClick
# 21 popUpWidnow
# 22 Iframe
# 23 age_of_domain
# 24 DNSRecord
# 25 web_traffic
# Above fetatures function returns
# 1 if the URL is Phishing,

52
# -1 if the URL is Legitimate and
# 0 if the URL is Suspicious

import re import whois import


datetime import requests import
ipaddress from dns import
resolver from bs4 import
BeautifulSoup from urllib.parse
import urlparse

class FeatureExtraction:
def __init__(self, url):
self.url = url self.parsedurl =
urlparse(self.url) self.domain =
self.parsedurl.netloc

try:
self.whois = whois.whois(self.domain)
except:
self.whois = None

try:
self.request = requests.get(self.url, timeout=5, headers={
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64
12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141
Safari/537.36"})
self.soup = BeautifulSoup(self.request.content, 'html.parser')
except:

self.request = None
self.soup = None

53
self.shortening_services =
r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \

r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \

r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.u
s|" \

r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.
tt|" \
r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \

r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \

r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\
.gd|" \
r"tr\.im|link\.zip\.net"

def getFeaturesDict(self):
return {
"having_IP_Address": self.having_IP_Address(),
"URL_Length": self.URL_Length(),
"Shortining_Service": self.Shortining_Service(),
"having_At_Symbol": self.having_At_Symbol(),
"double_slash_redirecting": self.double_slash_redirecting(),
"Prefix_Suffix": self.Prefix_Suffix(),
"having_Sub_Domain": self.having_Sub_Domain(),
"URL_Depth": self.URL_Depth(),
"Domain_registeration_length": self.Domain_registeration_length(),
"Favicon": self.Favicon(),
"port": self.port(),
"HTTPS_token": self.HTTPS_token(),
"Request_URL": self.Request_URL(),

54
"URL_of_Anchor": self.URL_of_Anchor(),
"Links_in_tags": self.Links_in_tags(),
"SFH": self.SFH(),
"Submitting_to_email": self.Submitting_to_email(),
"Abnormal_URL": self.Abnormal_URL(),
"Redirect": self.Redirect(),
"on_mouseover": self.on_mouseover(),
"RightClick": self.RightClick(),
"popUpWidnow": self.popUpWidnow(),
"Iframe": self.Iframe(),
"age_of_domain": self.age_of_domain(),
"DNSRecord": self.DNSRecord(),
"web_traffic": self.web_traffic()
}

"""#### ** IP Address in the URL **


Checks for the presence of IP address in the URL. URLs may have IP address
instead of domain name. If an IP address is used as an alternative of the domain
name in the URL, we can be sure that someone is trying to steal personal
information with this URL.
If the domain part of URL has IP address, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def having_IP_Address(self):
try:
ipaddress.ip_address(self.domain)
return 1 except:
return -1

"""#### ** Length of URL **


Computes the length of the URL. Phishers can use long URL to hide the doubtful
part in the address bar. In this project, if the length of the URL is greater than or

55
equal 54 characters then the URL classified as phishing otherwise legitimate. If
the length of URL >= 54 , the value assigned to this feature is 1 (phishing) or else 0
(suspicious) else -1 (legitimate).
"""

def URL_Length(self):
if len(self.url) < 54:
return -1 elif len(self.url) >= 54 and
len(self.url) <= 75:
return 0
else:
return 1
"""#### ** Using URL Shortening Services “TinyURL” **
URL shortening is a method on the “World Wide Web” in which a URL may be
made considerably smaller in length and still lead to the required webpage. This is
accomplished by means of an “HTTP Redirect” on a domain name that is short,
which links to the webpage that has a long URL.
If the URL is using Shortening Services, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def Shortining_Service(self): if
re.search(self.shortening_services, self.url):
return 1
else:
return -1

"""#### ** "@" Symbol in URL **


Checks for the presence of '@' symbol in the URL. Using “@” symbol in the URL
leads the browser to ignore everything preceding the “@” symbol and the real
address often follows the “@” symbol.
If the URL has '@' symbol, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
56
"""

def having_At_Symbol(self):
if '@' in self.url:
return 1
else:
return -1

"""#### ** Redirection "//" in URL **


Checks the presence of "//" in the URL. The existence of “//” within the URL path
means that the user will be redirected to another website. The location of the “//” in
URL is computed.
If the "//" is anywhere in the URL apart from after the protocal, thee value assigned
to this feature is 1 (phishing) or else -1 (legitimate).
"""

def double_slash_redirecting(self):
if re.search(r'https?://[^\s]*//', self.url):
return 1
else:
return -1

"""#### ** Prefix or Suffix "-" in Domain **


Checking the presence of '-' in the domain part of URL. The dash symbol is rarely
used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to
the domain name so that users feel that they are dealing with a legitimate webpage.
If the URL has '-' symbol in the domain part of the URL, the value assigned to this
feature is 1 (phishing) or else -1 (legitimate).
"""

def Prefix_Suffix(self):
if '-' in self.domain:

57
return 1
else:
return -1

"""#### ** SubDomains **
If the URL has more than 2 subdomains, the value assigned to this feature is 1
(phishing) or else 0 (suspicious) else -1 (legitimate).
"""

def having_Sub_Domain(self):
count = self.domain.count('.')
if count <= 2: return -1
elif count > 2 and count <= 3:
return 0
else:
return 1

"""#### ** Depth of URL **


Computes the depth of the URL. This feature calculates the number of sub pages
in the given url based on the '/'.
The value of feature is a numerical based on the URL.
"""

def URL_Depth(self):
depth = 0 subdirs =
self.parsedurl.path.split('/') for
subdir in subdirs: if subdir:
depth += 1 return depth

"""#### ** End Period of Domain **


This feature can be extracted from WHOIS database. For this feature, the
remaining domain time is calculated by finding the different between expiration time

58
& current time. The end period considered for the legitimate domain is 6 months or
more for this project.
If end period of domain < 6 months, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""

def Domain_registeration_length(self):
if self.whois is None:
return 1
try: if
type(self.whois['expiration_date']) is list:
expiration_date = self.whois['expiration_date'][0]
else:
expiration_date = self.whois['expiration_date']

registration_length = abs(
(expiration_date - datetime.datetime.now()).days)
if registration_length / 30 >= 6:
return -1
else:
return 1
except:
return 1

"""#### ** Favicon **
Checks for the presence of favicon in the website. The presence of favicon in the
website can be used as a feature to detect phishing websites.
If the website has favicon, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).

"""

def Favicon(self):
try:
59
if re.findall(r'favicon', self.soup.text) or \
self.soup.find('link', rel='shortcut icon') or \
self.soup.find('link', rel='icon'):
return -1
else:
return 1
except:
return 1

"""#### ** Non-Standard Port **


Checks for the use of non-standard port. Phishers often use non-standard ports in
the URL in order to make it look like a legitimate one.
If the URL uses non-standard port, the value assigned to this feature is 1 (phishing)
or else -1 (legitimate).
"""

def port(self): if
self.parsedurl.port:
return 1
else:
return -1

"""#### ** "http/https" in Domain name **


Checks for the presence of "http/https" in the domain part of the URL. The
phishers may add the “HTTPS” token to the domain part of a URL in order to trick
users.
If the URL has "http/https" in the domain part, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def HTTPS_token(self):
if 'https' in self.domain:
return 1
else:
60
return -1

"""### ** Request_URL **
The fine line that distinguishes phishing websites from legitimate ones is how
many times a website has been redirected. In our dataset, we find that legitimate
websites have been redirected one time max. On the other hand, phishing websites
containing this feature have been redirected at least 4 times.
"""

def Request_URL(self): try:


if len(self.request.history) <= 1:
return -1 elif
len(self.request.history) <= 3:
return 0
else:
return 1
except:
return -1

"""#### ** URL_of_Anchor **
The presence of “<a>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<a>” tag in the URL.
If the URL has “<a>” tag, the value assigned to this feature is 1 (phishing) or else 1
(legitimate).
"""

def URL_of_Anchor(self):
try:
count = 0
for i in self.soup.find_all('a'):
if i.has_attr('href'):

61
count += 1
if count == 0:
return 1 else:
return -1
except:
return 1

"""#### ** Links_in_tags **
The presence of “<link>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<link>” tag in the URL.
If the URL has “<link>” tag, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""

def Links_in_tags(self):
try:
count = 0 for i in
self.soup.find_all('link'): if
i.has_attr('href'):
count += 1
if count == 0:
return 1 else:
return -1
except:
return 1

"""#### ** SFH **
The presence of “<form>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<form>” tag in the URL.

If the URL has “<form>” tag, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""

62
def SFH(self): try:
if self.soup.find('form'):
return 1
else:
return -1
except:
return 0

"""#### ** Submitting_to_email **
The presence of “mailto:” in the URL is a strong indicator of phishing websites.
This feature checks for the presence of “mailto:” in the URL.
If the URL has “mailto:” tag, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""

def Submitting_to_email(self):
try:
if self.soup.find('mailto:'):
return 1
else:
return -1
except:
return 0

"""#### ** Abnormal_URL **
The presence of “<script>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<script>” tag in the URL.
If the URL has “<script>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).

"""

63
def Abnormal_URL(self):
try: if
re.findall(r'script|javascript|alert|onmouseover|onload|onerror|onclick|onmouse',
self.url):
return 1
else:
return -1
except:
return -1

"""#### ** Redirect **
The presence of “<meta>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<meta>” tag in the URL.
If the URL has “<meta>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""

def Redirect(self):
try:
if self.soup.find('meta', attrs={'http-equiv': 'refresh'}):
return 1
else:
return -1
except:
return -1

"""### **3.3.2. Status Bar Customization**


Phishers may use JavaScript to show a fake URL in the status bar to users. To
extract this feature, we must dig-out the webpage source code, particularly the
“onMouseOver” event, and check if it makes any changes on the status bar
If the response is empty or onmouseover is found then, the value assigned to this
feature is 1 (phishing) or else 0 (legitimate).
"""
64
def on_mouseover(self): try: if
re.findall(r"onmouseover", self.soup.text):
return 1
else:
return -1
except:
return -1

"""### ** Disabling Right Click **


Phishers use JavaScript to disable the right-click function, so that users cannot
view and save the webpage source code. This feature is treated exactly as “Using
onMouseOver to hide the Link”. Nonetheless, for this feature, we will search for event
“event.button==2” in the webpage source code and check if the right click is disabled.
If the response is empty or onmouseover is not found then, the value assigned to
this feature is 1 (phishing) or else 0 (legitimate).
"""

def RightClick(self): try: if


re.findall(r"contextmenu|event.button ?== ?2", self.soup.text):
return 1 else:
return -1
except:
return -1

"""### ** PopUp Window **


Phishers may use JavaScript to open a fake webpage in a new window to trick
users. This feature is treated exactly as “Using onMouseOver to hide the Link”.
Nonetheless, for this feature, we will search for event “window.open” in the webpage
source code and check if the pop-up window is opened.
If the response is empty or onmouseover is not found then, the value assigned to
this feature is 1 (phishing) or else 0 (legitimate).
"""

65
def popUpWidnow(self): try: if
re.findall(r"alert\(|onMouseOver|window.open", self.soup.text):
return 1 else:
return -1
except:
return -1

"""### ** IFrame Redirection **


IFrame is an HTML tag used to display an additional webpage into one that is
currently shown. Phishers can make use of the “iframe” tag and make it invisible i.e.
without frame borders. In this regard, phishers make use of the “frameBorder”
attribute which causes the browser to render a visual delineation.
If the iframe is empty or repsonse is not found then, the value assigned to this
feature is 1 (phishing) or else -1 (legitimate).
"""

def Iframe(self): try: if


re.findall(r"[<iframe>|<frameBorder>]", self.soup.text):
return 1
else:
return -1
except:
return -1

"""#### ** Age of Domain **


This feature can be extracted from WHOIS database. Most phishing websites live
for a short period of time. The minimum age of the legitimate domain is considered to
be 12 months for this project. Age here is nothing but different between creation and
expiration time.
If age of domain > 12 months, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""
66
def age_of_domain(self):
if self.whois is None:
return 1
try: if
type(self.whois['creation_date']) is list:
creation_date = self.whois['creation_date'][0]
else:
creation_date = self.whois['creation_date']

ageofdomain = abs((datetime.datetime.now() - creation_date).days)


if ageofdomain / 30 > 12:
return -1
else:
return 1
except:
return 1

"""#### ** DNS Record **


For phishing websites, either the claimed identity is not recognized by the WHOIS
database or no records founded for the hostname.
If the DNS record is empty or not found then, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def DNSRecord(self):
try:
resolver.resolve(self.domain, 'A')
return -1 except: return 1

"""#### ** Web Traffic **


This feature measures the popularity of the website by determining the number of
visitors and the number of pages they visit. However, since phishing websites live for

67
a short period of time, they may not be recognized by the Alexa database (Alexa the
Web Information Company., 1996). By reviewing our dataset, we find that in worst
scenarios, legitimate websites ranked among the top 100,000. Furthermore, if the
domain has no traffic or is not recognized by the Alexa database, it is classified as
“Phishing”.
If the rank of the domain < 100000, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""

def web_traffic(self):
try:
alexadata = BeautifulSoup(requests.get(
"http://data.alexa.com/data?cli=10&dat=s&url=" + self.domain,
timeout=10).content, 'lxml') rank =
int(alexadata.find('reach')['rank']) if rank < 100000:
return -1
else:
return 1
except:
return 1

Helper.py from features import


FeatureExtraction from flask import
send_from_directory from html2image
import Html2Image from urllib.parse
import urlparse import pandas as pd
import validators import pickle import
re import os

screenshot_dir = 'screenshot/' stats_params


= ('visits', 'checked', 'phished')

68
h2i = Html2Image() h2i.output_path
= screenshot_dir

stats_filename = 'stats.txt' model =


pickle.load(open("model.pkl", "rb"))

def format_url(url): url = url.strip() if


not re.match('(?:http|ftp|https)://', url):
return 'http://{}'.format(url)
return url

def capture_screenshot(target_url, filename='screenshot.png', size=(1920, 1080)):


h2i.screenshot(url=target_url, save_as=filename, size=size)
return send_from_directory(screenshot_dir, path=filename) def
get_phishing_result(target_url): target_url = format_url(target_url)
if not (target_url and validators.url(target_url)):
return dict(status=False, message="You have provided an invalid target url,
Please try again after updating the url.")

try:
update_stats('checked')
target = urlparse(target_url)

features_obj = FeatureExtraction(target_url) x =
pd.DataFrame.from_dict(features_obj.getFeaturesDict(), orient='index').T

pred = model.predict(x)[0] # 1 is phished & 0 is not

69
pred_prob = model.predict_proba(x)[0]
safe_prob = pred_prob[0] unsafe_prob
= pred_prob[1]

if pred == 1:
update_stats('phished')
return dict( status=True,
domain=target.netloc,
target=target_url,
safe_percentage=safe_prob*100,
unsafe_percentage=unsafe_prob*100
)
except Exception as e:
return dict(status=False, message=str(e))

def get_stats(key=None):
stats = {} if
os.path.exists(stats_filename): with
open(stats_filename, "r") as file:
for line in file: (k, v) =
line.split(":") stats[k] = int(v)

if key is not None:


return stats[key] if key in stats else None

return stats

return False

70
def update_stats(key): stats =
get_stats() with open("stats.txt",
"w+") as file: if stats is False:
file.write('\n'.join([f"{x}:0" for x in stats_params]))
else:
lines = [] avail_params =
list(stats_params) for k, v in
stats.items():
avail_params.remove(k) if k ==
key: v += 1
lines.append(f"{k}:{v}") if
len(avail_params) > 0: for
param in avail_params:
lines.append(f"{param}:0")
file.write("\n".join(lines)) file.flush()
Main.py from flask import
Flask from url_detector import
* from content_detector import
*

app = Flask(__name__)

# Routes from app.py


@app.route('/') def
home():
return render_template('home.html')

@app.route('/predict', methods=['POST'])
def predict(): if request.method ==
'POST':
message = request.form['message'] data = [message]
vect = cv.transform(data).toarray() my_prediction =

71
classifier.predict(vect) return render_template('result.html',
prediction=my_prediction)

# Routes from main.py


@app.route('/url-detector') def
url():
update_stats('visits') return
render_template("index.html")

@app.route('/check', methods=['GET', 'POST']) def


check():
update_stats('visits') if
request.method == "POST":
target_url = request.json['target']
result = get_phishing_result(target_url=target_url)
return jsonify(result)

target_url = request.args.get("target") return


render_template('check.html', target=target_url)

@app.route("/listen") def
listen(): def
respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata: {_data}\nevent: stats\n\n"
time.sleep(0.5)

return Response(respond_to_client(), mimetype='text/event-stream')

@app.route("/screenshot") def
screenshot():
72
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()

width = default_screenshot_width
height = default_screenshot_height

if query.get("width") and query.get("height"):


width = int(query.get("width"))
height = int(query.get("height"))

ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png")
ss_file_path = os.path.join(screenshot_dir, ss_file_name)

if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)

return capture_screenshot(target_url=target_url, filename=ss_file_name,


size=(width, height)) abort(404)

if __name__ == '__main__':
app.run(debug=True)

Url detector.py from flask import Flask, Response, render_template, request,


send_from_directory, jsonify, abort from helper import get_phishing_result, get_stats,
update_stats, capture_screenshot, screenshot_dir from werkzeug.utils import
secure_filename from datetime import date import time import json import os

app = Flask(__name__)

73
app.secret_key = os.urandom(12).hex()

default_screenshot_width = 1920 default_screenshot_height


= 1080

@app.route('/') def
home():
update_stats('visits') return
render_template("index.html")
@app.route('/check', methods=['GET',
'POST']) def check():
update_stats('visits') if
request.method == "POST":
target_url = request.json['target'] result =
get_phishing_result(target_url=target_url) return
jsonify(result) target_url = request.args.get("target")
return render_template('check.html', target=target_url)

@app.route("/listen") def
listen():

def respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata:
{_data}\nevent: stats\n\n" time.sleep(0.5)

return Response(respond_to_client(), mimetype='text/event-stream')

74
@app.route("/screenshot") def
screenshot():
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()
width = default_screenshot_width
height = default_screenshot_height

if query.get("width") and query.get("height"):


width = int(query.get("width"))
height = int(query.get("height"))

ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png") ss_file_path =
os.path.join(screenshot_dir, ss_file_name)

if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)

return capture_screenshot(target_url=target_url, filename=ss_file_name,


size=(width, height)) abort(404)

if __name__ == '__main__':
app.run(debug=True)

75
A2 – SCREENSHOTS

Figure A2.1 Home page of the website

76
Figure A2.2 URL Detector page of the website

This image also shows the number of website visits, the total number of websites
checked, and the total number of phishing websites detected out of the total websites
checked.

Figure A2.3 Valid phishing url has been pasted

Figure A2.4 Valid phishing url has been detected

77
Figure A2.5 Invalid phishing URL has been pasted

Figure A2.6 Invalid phishing URL has been detected

78
79

You might also like