Final Doc Fin
Final Doc Fin
Submitted by
NITHISH. B (312820104057)
PRADEEP JOSHWA (312820104060)
PRAKASH RAJA. C (312820104061)
BACHELOR’s of ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
MAY 2024
ABSTRACT
These efforts are not merely technical enhancements; they are crucial for mitigating
the risk of data breaches and identity theft that loom over unsuspecting users. As
phishing attacks become more intricate and insidious, a proactive and adaptive
response is imperative to safeguard the digital ecosystem. Beyond the immediate goal
of threat mitigation, these optimizations contribute significantly to the broader objective
of fostering user privacy, building trust, and preserving data integrity across diverse
online platforms. In essence, this project represents a critical stride towards fortifying
the digital realm against the ever-evolving and persistent menace of phishing attacks.
TABLE OF CONTENTS
ABSTRACT 2
TABLE OF CONTENTS vi
LIST OF ABBREVIATIONS ix
1 INTRODUCTION 1
1.1 OVERVIEW 1
1.2 OBJECTIVE 3
1.3 DESCRIPTION 3
2 LITERATURE SURVEY 6
3.4 METHODOLOGIES 10
vi
3.4.5 XGBOOST MODEL 11
4 DESIGN PROCESS 13
5 IMPLEMENTATIONS 20
6 EXPERIMENTATION RESULTS 31
6.1 OBSERVATIONS 31
6.2 INFERENCES 33
REFERENCES 35
APPENDIX 39
A1 – SOURCE CODE 39
A2 – SCREENSHOTS 76
LIST OF FIGURES
vii
FIGURE NO. FIGURE NAME PAGE NO.
viii
LIST OF ABBREVIATIONS
ML Machine Learning
AI Artificial Intelligence
RF Random Forest
LR Logistic Regression
DT Decision Tree
NB Naive Bayes
MITM Man-in-the-Middle
IP Internet Protocol
ix
MLP Multilayer Perceptron
x
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Data science and machine learning are closely intertwined disciplines, often used in
conjunction to extract valuable insights from data, make predictions, and automate
decision-making processes. Let's delve into how they are integrated, particularly in the
context of classification tasks.
1
1.1.2 MACHINE LEARNING
Data Preparation: Data scientists play a crucial role in preparing the data for
classification tasks. They handle tasks such as cleaning noisy data, handling missing
values, and transforming data into a suitable format for machine learning algorithms.
Feature Engineering: Identifying and selecting relevant features (variables) from the
dataset is a critical step in classification. Data scientists use domain knowledge to
determine which features are most informative for the task at hand.
Training the Model: Machine learning models are trained on labelled datasets during
this phase. The model learns patterns and relationships between features and labels,
adjusting its parameters to make accurate predictions.
Evaluation and Validation: Data scientists are responsible for evaluating the
performance of the trained model using validation datasets. They use metrics such as
accuracy, precision, recall, and F1-score to assess how well the model generalises to
new, unseen data.
1.2 OBJECTIVE
The objective of this project is to develop and compare two machine learning models
for the task of detecting phishing websites. The primary focus is on enhancing the
accuracy and efficiency of phishing detection methods by using feature selection, and
also on evaluating the models based on their accuracy in order to determine which
one performs better for the given task.
Additionally, it also includes the potential integration of this ML-based system into web
browsers or as extensions, ensuring real-time warnings and protection for users while
browsing, ultimately fostering a safer and more secure online environment.
1.3 DESCRIPTION
This project employs machine learning algorithms to identify critical factors influencing
the detection of phishing websites. Two models, a Multilayer Stacked Ensemble Model
and an XGBoost Model, use techniques such as clustering and classification to discern
patterns in data.
Addressing the rising threat of phishing websites, the project aims to develop and
compare two robust machine learning models. The models are evaluated for efficiency
in detecting deceptive websites using Fisher's score for feature selection. Additionally,
3
a Chrome extension will integrate the superior model, offering real-time warnings and
enhanced user protection during online activities.
Chapter 1 lays the groundwork for the entire project report. It provides an overview of
the project, including its purpose, scope, and significance. It also describes the project
in more detail, outlining the objectives and expected outcomes. Additionally, it explains
the structure of the report itself, giving the reader a roadmap to navigate the
information presented.
Chapter 3 tackles the heart of the project by defining the problem and outlining the
chosen approach to solve it. This chapter dives into the existing system, analyses its
limitations, and presents the proposed solution with its underlying algorithm. In
essence, it's the roadmap for tackling the challenge at hand.
Chapter 4 lays out the blueprint for a website with a chrome extension, from its purpose
and audience to its technical structure and development process. It starts with an
overview, then specifies requirements, details the architecture, and breaks down the
design step-by-step. Each module gets its own dedicated explanation, while the
conclusion wraps everything up and suggests potential future directions. Essentially,
this document serves as a comprehensive roadmap for bringing the mobile app to life.
Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of its
core security features. It meticulously details the construction of these vital safeguards,
providing a comprehensive blueprint for those seeking to establish their own phishing
URL detection stronghold.
4
Chapter 6 discusses the significant lessons learned, and the necessary future
enhancements that can be used to improve the overall feasibility of the proposed work.
CHAPTER 2
5
LITERATURE SURVEY
Lakshmana Rao Kalabarige et al., 2022 [1] propose a highly effective Multilayer
Stacked Ensemble Learning Model for phishing detection, achieving 96.79% to
98.90% accuracy. Outperforming baselines, the model underscores its efficacy with
improved metrics. The paper stresses the urgency of countering phishing, outlines the
model's architecture and results, and suggests future research on feature selection
and model optimization. Overall, the study introduces a potent detection model,
validates its effectiveness, and outlines avenues for further research.
Al-Sariera et al., 2019 [2] presented advanced AI meta-learner models like AdaBoost-
Extra Tree for phishing detection, achieving over 97% accuracy with minimal false
positives (below 0.028). The study critically reviewed existing methods, emphasising
the need for improved techniques. Thorough evaluation using 10-fold cross-validation
and WEKA software demonstrated the models' superiority in accuracy and predictive
capabilities over existing methods. The paper highlighted the importance of
interpretable AI models and suggested exploring alternative decision tree algorithms
and hybridization methods for future advancements.
Rasha Zeini et al., 2023 [4] reviews phishing detection methods, emphasising model
explainability, feature engineering, and domain knowledge. They identify gaps,
including URL shortening challenges, and stress the importance of reproducibility,
diverse datasets, and informed feature selection. The document offers insights into
6
evolving phishing tactics, highlighting the need for continuous research and user
education in effective countermeasures.
Al-Sarem et al., 2021 [5] presented an optimised stacking ensemble method for
phishing detection, employing Genetic Algorithm to determine optimal parameters.
The ensemble comprised algorithms like Random Forests, AdaBoost, XGBoost,
Bagging, GradientBoost, and LightGBM, applied to UCI Phishing, Mendely, and
Mendeley-small variant datasets. The model demonstrated remarkable accuracy of
97.16%, 98.58%, and 97.35% on the respective datasets, showcasing its
effectiveness across diverse phishing instances and features.
Yi Wei et al., 2022 [6] In 2022, Wei et al. compare machine learning and deep learning
methods for phishing website classification. They assess traditional algorithms,
ensemble methods, and deep learning models like Random Forest, AdaBoost, LSTM,
CNN, and RNN. Results emphasise ensemble methods' effectiveness, particularly
Random Forest, in achieving high accuracy and computational efficiency, especially
with reduced feature sets.
Ahmet Ozaday et al., 2022 [7] used six machine learning algorithms to classify URLs
based on eleven features, with Random Forest yielding the highest accuracy of
98.90%. Comparing various methods, they concluded Random Forest provided
consistent and superior performance. The study stressed the importance of updated
datasets, global collaboration, and user awareness in combating phishing.
A. Karim et al., 2023 [8] developed a phishing detection system employing various
machine learning models and a hybrid approach (LR+SVC+DT). The hybrid model
demonstrated high efficiency, utilising metrics like accuracy, precision, recall,
specificity, and F1-score. The study underscores the effectiveness of combining
listbased and machine-learning-based systems for more efficient phishing URL
detection.
M. Aljabri et al., 2022 [9] offers a thorough review of ML algorithms for detecting
malicious URLs, highlighting SVM, RF, DT, NB, and LR with accuracy surpassing
7
98.42%. The document underscores the effectiveness of ensemble techniques in
achieving over 90% accuracy and discusses challenges like sample sizes and network
traffic considerations. Providing insights into datasets, features, and model accuracy,
the study contributes to understanding and addressing unresolved issues in malicious
URL detection.
Priscilla Kyei Danso et al., 2022 [10] tackles IoT security challenges with an Intelligent
Ensemble-based IDS at the gateway, mitigating threats like MITM and DoS. The
proposed solution employs Naïve Bayes, SVM, and k-NN as base learners,
demonstrating efficacy through ensemble models on various datasets. The study
emphasises the importance of ensemble learning in IoT security and suggests future
directions for anomaly-based IDS improvements.
Pankaj Saraswat et al., 2022 [11] addresses email security challenges, focusing on
phishing detection with machine learning. Using SVM and Random Forest, the study
achieves a maximum accuracy of 96.87%, emphasising the need for effective
detection methods against evolving phishing techniques. The proposed system
extracts link, tag, and word-based features, underscoring the importance of dataset
expansion for real-world applicability.
F. Castaño et al., 2023 [12] introduces PhiKitA-500, a dataset linking phishing websites
to phishing kits, facilitating algorithm evaluation. The methodology involves stages like
source definition, phishing kit collection, website extraction, and postprocessing.
Results indicate successful grouping of phishing kits, demonstrating the utility of kit
information in classifying phishing attacks, despite challenges in multiclass
classification.
8
CHAPTER 3
In this chapter, the project delves into the core by identifying the problem and
articulating the selected strategy for resolution. The chapter critically examines the
current system, putting forth its constraints, and introduces the solution, complete with
its underlying algorithm. Essentially, this section serves as a comprehensive guide,
outlining the path forward for addressing the project's central challenge.
The existing model for phishing detection is a Layer-wise Stacked Ensemble Learning
architecture, comprising multiple layers of estimators culminating in a metalearner.
The workflow involves initialising the model, creating multiple layers with diverse
estimators, and adding a meta-learner as the final layer for comprehensive decision-
making. The Stacked Ensemble Learning process involves running estimators in
parallel within layers and sequentially between layers, employing various models like
Random Forest and Logistic Regression. The phases of the Multilayer Stacked
Ensemble Learning Model (MLSELM) include the input phase with the Phishing
9
dataset, data balancing phase, and the implementation phase for executing the model
effectively.
3.4 METHODOLOGIES
10
such as URL length, presence of HTTPS, use of IP addresses, presence of suspicious
keywords, and other relevant indicators are likely extracted. These features provide
valuable information for the machine learning models to learn and make predictions
effectively.
11
3.4.6 Metric Selection:
Evaluation metrics are used to assess the performance of the machine learning
models. Common metrics used in binary classification tasks like phishing detection
include precision, recall, F1 score, and accuracy.
• Precision measures the proportion of true phishing URLs among the URLs
predicted as phishing.
• Recall measures the proportion of true phishing URLs correctly identified by the
model.
• F1 score is the harmonic mean of precision.
• recall, and accuracy measures the overall correctness of the model's
predictions.
By using relevant evaluation metrics, the system can effectively evaluate and compare
the performance of the multilayer stacked ensemble model and the XGBoost model to
select the best-performing approach.
12
CHAPTER 4
DESIGN PROCESS
The proposed system offers a comprehensive solution for phishing URL detection, with
a dedicated website, and a convenient Chrome extension to strengthen user
protection while browsing the web. The dedicated website is the core of the system,
with an easy-to-use interface where users can input URLs to be analyzed. The
backend integrates top-notch machine learning models such as the multilayer stack
ensemble model and the xGBoost model. The inclusion of Fisher’s score in the feature
selection methodology improves the system’s ability to identify phishing patterns by
focusing on critical aspects that are not explicitly covered by the current system.
This website is designed to integrate seamlessly with the Chrome extension, providing
a unified and synchronous user experience. The Chrome extension is available for
download and provides an extra layer of protection while online. One of the key
features of this extension is the real-time pops-up notifications that appear when a
user hover over a URL to alert them of a potential phishing attempt. These notifications
act as a direct link between the extension and the dedicated website, providing timely
warnings and alert users so they can make informed decisions while navigating the
web.
13
The system’s design focuses on the smooth integration of the dedicated website with
the Chrome extension to provide a holistic approach to the detection of phishing URLs.
Sophisticated communication channels ensure the secure exchange of data between
the website’s backend and Chrome extension, while preserving the user’s privacy and
the system’s reliability. Regular updates and syncing mechanisms ensure that the
machine learning model and detection algorithms are always up-todate and effective
against ever-evolving phishing tactics.
The user experience is at the forefront, with an intuitive website interface and
unobtrusive chrome extension, allowing users to conveniently access protection
features without interrupting their browsing. Real-time notifications from the extension
enable users to make smart decisions about the security of the URLs they encounter.
In conclusion, the proposed solution combines the best features of a dedicated site
with a Chrome extension to increase the effectiveness of phishing URLs detection,
while prioritising easy to use interactions and real time feedback to actively protect
users during their online engagements.
15
The web application is the overarching entity, encompassing two main divisions: the
frontend and the backend. The frontend is the user-facing component primarily
accessed through the Chrome Extension. Within the frontend, the Content Script
operates as a script embedded in the visited webpage, providing access to the
webpage's content such as text, images, and structure to gather essential data. On
the other hand, the backend serves as the processing powerhouse behind the scenes,
managing critical functions. The Background Script, integrated into the Chrome
Extension, acts as a coordinator facilitating communication between the content script
and the backend by passing data seamlessly. Additionally, an essential element of the
backend is the API, functioning as an interface that enables the detection model within
the backend to receive data and convey evaluations effectively.
At the core of the system lies the Phishing Detection Model, serving as the intelligence
hub. This model is likely a machine learning model meticulously trained to discern
patterns and features characteristic of phishing websites. It plays a pivotal role in the
system's functionality, leveraging its learned knowledge to analyse incoming data and
determine whether a visited webpage poses a potential phishing threat. In essence,
the web application operates as a seamlessly integrated unit, with the frontend
facilitating user interaction through the Chrome Extension, while the backend, with its
content script, background script, API, and the Phishing Detection Model, collectively
ensures robust processing and accurate identification of potential phishing websites.
16
Figure 4.3 Backend-based architecture diagram
A stacked ensemble learning model is a machine learning technique that uses multiple
models to improve the performance of a single model.
The stacked ensemble learning model used in this diagram is a multilayer stacked
ensemble learning model. A multilayer stacked ensemble learning model is a type of
stacked ensemble learning model that uses multiple layers of models. The first layer
of the model consists of five different machine learning models: MLP, KNN, RF, LR,
and XGB. These models are all trained on the training dataset. The second layer of
the model consists of two models. These models are trained on the outputs of the first
layer of models. The third layer of the model consists of a single XGB model, also
called a meta learner. This model is trained on the outputs of the second layer of
models. The final output of the model is a prediction of whether an email is phishing
or legitimate.
The stacked ensemble learning model can improve the performance of a single
machine learning model by combining the strengths of multiple models.
17
4.4 PROJECT REQUIREMENTS
Jupyter is an open-source tool that allows interactive computing and supports various
programming languages. It's widely used for creating and sharing documents
containing live code, equations, visualizations, and narrative text. Jupyter is likely used
for developing and testing code, especially for tasks like data preprocessing, feature
extraction, and initial model training. Its interactive nature facilitates iterative
development and experimentation.
Google Colab is a cloud-based platform provided by Google that allows for the creation
and execution of Jupyter notebooks in a collaborative environment. It provides free
access to GPU resources, which can be beneficial for training machine learning
models. Google Colab is specified as a tool, indicating that the project may leverage
its cloud-based infrastructure for resource-intensive tasks, such as training machine
learning models on the specified dataset.
18
In summary, the specified software and hardware requirements are optimised to be
used with Windows as the required OS, with Jupyter and Google Colab as key
development tools. The hardware specifications include a high-performance
processor, SSD for fast storage, ample RAM for efficient multitasking, and a dedicated
GPU for accelerated machine learning tasks. These choices are aligned with the
computational demands of developing and implementing a phishing URL detection
system with machine learning models.
19
CHAPTER 5
IMPLEMENTATIONS
Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of its
core security features It meticulously details the construction of these vital safeguards,
providing a comprehensive blueprint for those seeking to establish their phishing URL
detection stronghold.
20
#Listing the features of the dataset
data.columns
21
# nunique value in columns
data.nunique()
22
Figure 5.6 Description of the data
23
5.2 DATA VISUALIZATION
#Correlation heatmap
plt.figure(figsize=(15,15))
sns.heatmap(data.corr(), annot=True) plt.show()
This code generates a heatmap visualization of the correlation matrix of a data frame
using the seaborn (sns) library and matplotlib (plt). This code visualizes the
correlations between different columns in the DataFrame data using a heatmap, where
brighter colors represent stronger correlations (either positive or negative), and darker
colors represent weaker correlations or no correlation. The annotations on the
heatmap provide the exact correlation coefficients for each pair of columns. #Phishing
Count in a pie chart
24
data['class'].value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.title("Phishing Count") plt.show()
This code generates a pie chart to visualize the distribution of the 'class' variable in the
DataFrame 'data'. This creates a pie chart that visually represents the proportion of
different classes (or categories) in the 'class' column of the DataFrame 'data'. Each
slice of the pie represents a unique class, and the size of each slice corresponds to
the frequency of that class in the dataset. The percentage values displayed on the
chart indicate the proportion of each class relative to the total number of instances in
the dataset.
# Splitting the dataset into train and test sets: 80-20 split from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
25
Figure 5.9 Testing and Training Data
Parameters:
- X: numpy array, shape (n_samples,
n_features) Feature matrix.
- y: numpy array, shape (n_samples,)
Target vector.
Returns:
- scores: numpy array, shape (n_features,)
Fisher Scores for each feature.
"""
# Number of samples for each class classes,
class_counts = np.unique(y, return_counts=True)
n_classes = len(classes)
26
mean_class = np.mean(X_class, axis=0)
S_W += ((X_class - mean_class)**2).sum(axis=0)
S_B += class_counts[i] * ((mean_class - mean_overall)**2)
# Example usage
# Assuming X is your feature matrix and y is your target vector fisher_scores
= fisher_score(X, y)
# If you want to rank features based on Fisher Score
ranking = np.argsort(fisher_scores)[::-1]
print("Features ranked by Fisher Score:") for rank in
ranking:
print(f"Feature {rank} Score: {fisher_scores[rank]}")
27
#Fitting the models def
fit_models(models, X, y):
for model in models:
model.fit(X, y)
28
# Define the second layer models models_layer2
=[ MLPClassifier(max_iter=1000,
random_state=42),
RandomForestClassifier(n_estimators=100, random_state=42),
xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
]
# Train second layer models using the meta-features from the first layer
fit_models(models_layer2, X_train_meta_1, y_train)
# Generate second layer meta-features
X_train_meta_2 = generate_meta_features(models_layer2, X_train_meta_1)
X_test_meta_2 = generate_meta_features(models_layer2, X_test_meta_1)
#Third layer
final_layer_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
# Third layer training and final predictions
final_layer_model.fit(X_train_meta_2, y_train)
y_pred_final = final_layer_model.predict(X_test_meta_2)
29
Figure 5.12 Confusion matrix for the model
CHAPTER 6
30
EXPERIMENTATION RESULTS
Chapter 6 serves as a reflection on the essential lessons derived from our research
journey, highlighting the quest for improvement and innovation in ensemble learning
methodologies.
6.1 OBSERVATIONS
This chapter delves into the comparison of two powerful machine learning algorithms,
Multilayer Stacked Ensemble Learning Machine (MLSELM) and XGBoost, for the task
of phishing website detection. The MLSELM model, comprising three layers of
classifiers, outperformed the XGBoost model in terms of accuracy. Through
meticulous feature selection, the MLS-ELM achieved an impressive accuracy of 97%,
while the XGBoost model attained 86% accuracy.
The calculation of True Positive (NTP), True Negatives (NTN), False Positives (NFP),
and False Negatives (NFN) is outlined as follows:
- P: Total number of phishing instances
- N: Total number of legitimate instances
- NTN: Number of legitimate instances predicted as legitimate
- NFN: Number of phishing instances predicted as legitimate
- NTP: Number of phishing instances predicted as phishing
- NFP: Number of legitimate instances predicted as phishing
31
The computation of each metric is articulated as follows:
• Accuracy: Accuracy is the proportion of true positives (correctly identified
positive cases) out of the total number of cases examined.
((NTP + NTN) / (P + N)) × 100
• Precision: Precision is the proportion of true positives out of the total number
of positive cases identified.
(NTP / (NTP + NFP)) × 100
• Recall: Recall is the proportion of true positives out of the total number of
positive cases in the dataset.
(NTP / (NTP + NFN)) × 100
• F-score: Combines precision and recall into a single metric (Precision × Recall)
/ (Precision + Recall) × 100
The experimental setup involved training and evaluating MLS-ELM and XGBoost
models using the same dataset comprising features relevant to phishing website
detection. Both models underwent feature selection using Fisher’s Score to optimize
their performance. Both MLSELM and the XGBoost model are subjected to identical
dataset conditions to ensure a fair assessment of their capabilities. Furthermore, the
comparison encompasses evaluations with feature selection using Fisher’s Score,
providing insights into the impact of data preprocessing techniques on model
performance. This comparative analysis offers valuable insights into the relative
strengths and weaknesses of each approach, aiding in the selection of the most
suitable algorithm for phishing website detection tasks.
It is important to consider that the performance of these models may vary depending
on the specific dataset they are trained on and the types of phishing websites they
encounter.
6.2 INFERENCES
The superior performance of the MLSELM model can be attributed to its multilayer
stacked ensemble architecture, which leverages the collective intelligence of multiple
classifiers to make accurate predictions. By incorporating diverse base classifiers and
meta-learning techniques, MLSELM effectively captures the complex relationships
between features and the target variable, enhancing its discriminative power. In
contrast, while XGBoost is renowned for its scalability and efficiency, its performance
may be limited by its single-layer ensemble approach, which may struggle to capture
intricate patterns in the data.
33
CHAPTER 7
In envisioning the future enhancements for our phishing website detection model, we
are poised to revolutionize cybersecurity by imbuing it with self-learning and
selfupdating capabilities. By leveraging advanced machine learning algorithms and
innovative techniques, our model will autonomously adapt to emerging threats,
continuously refining its predictive capabilities without the need for manual
intervention. This transformative approach not only ensures real-time protection but
also alleviates the burden on administrators, freeing them from the tedious task of
manual updates.
34
REFERENCES
[1] Lizhen Tang; Qusay H. Mahmoud – (2023) "A Deep Learning-Based Framework
for Phishing Website Detection"
[2] Rasha Zieni , Luisa Masari , and Maria Carla Calzarossa – (2023) "Phishing or
Not Phishing? A Survey on the Detection of Phishing Websites"
[3] Yazan A. Al-Sarier, Victor Elijah Adeyemo, Abdullateef O. Balogun and Ammar
K. Alazzawi – (2020) "AI Meta-Learners and Extra-Trees Algorithm for the
Detection of Phishing Websites"
[6] D. Fumarola et al., (2019) "Phishing Detection Using Genetic Programming with
Human-Competitive Performance," in IEEE Transactions on Evolutionary
Computation, vol. 23, no. 3, pp. 390-403, doi: 10.1109/TEVC.2018.2885320.
[8] A. Nazir et al., (2017) "Machine Learning-Based Phishing Detection Using URL
and Website Content Features," in Computers & Security, vol. 68, pp. 126-140,
doi:
35
10.1016/j.cose.2017.04.003.
[9] J. Ma et al., (2016) "A Machine Learning-Based Approach for Detecting Phishing
URLs," in Journal of Computer and System Sciences, vol. 82, no. 8, pp. 1284-
1297, doi: 10.1016/j.jcss.2016.04.002.
[10] A. Kumar et al., (2015) "A Review of Machine Learning Approaches to Phishing
Detection," in Procedia Computer Science, vol. 48, pp. 96-104, doi:
10.1016/j.procs.2015.04.197.
[11] L. Liao et al., (2018) "Combating Phishing Using Trusted Features and Machine
Learning," in Information Sciences, vol. 423, pp. 85-102, doi:
10.1016/j.ins.2017.10.005.
[12] H. Y. Son et al., (2019) "Phishing Website Detection Using Machine Learning
and Features Extracted from Website Images," in Journal of Information
Processing Systems, vol. 15, no. 1, pp. 117-133, doi: 10.3745/JIPS.03.0104.
[15] C. Singh et al., (2020) "A Machine Learning Approach for Detecting Phishing
Websites Using Neural Network," in Journal of King Saud University - Computer
and Information Sciences, doi: 10.1016/j.jksuci.2020.07.001.
36
[17] G. Li et al., (2019) "Phishing Website Detection Based on URL Features Using
Machine Learning," in IEEE Access, vol. 7, pp. 131577-131588, doi:
10.1109/ACCESS.2019.2936143.
[18] S. A. Alqahtani et al., (2018) "A Novel Approach for Phishing Detection Based
on Ensemble Learning," in International Journal of Advanced Computer Science
and Applications, vol. 9, no. 10, pp. 308-316, doi:
10.14569/IJACSA.2018.091044.
[24] How Hackers do Phishing Attacks to hack your accounts - YouTube Video by
Tech Raj: https://www.youtube.com/watch?v=RNzMKEYi2_0
[25] Feature Selection Techniques Easily Explained - YouTube Video by Krish Naik:
https://www.youtube.com/watch?v=EqLBAmtKMnQ
37
[26] What is a Phishing Attack? – Article by
IBM: https://www.ibm.com/topics/phishing
[27] How to Recognize and Avoid Phishing Scams –Article by Federal Trade
Commission:https://consumer.ftc.gov/articles/how-recognize-and-avoid-
phishingscams
[28] Phishing: Spot and report scam emails, texts, websites and calls –Article by
National Cybersecurity Centre - https://www.ncsc.gov.uk/collection/phishing-scams
[29] Multi-layer stacking ensemble learners for low footprint network intrusion
detection – Article by Springer Link:
https://link.springer.com/article/10.1007/s40747022-00809-3
38
APPENDIX
A1 - SOURCE CODE
Base.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</html>
Check.html
{% extends 'base.html' %}
{% block title %}Processing | Phishing Website Detector{% endblock %}
{% block content %}
<div class="card card--result color-bg-dark">
<img class="screenshot--target card-img rounded-0" alt="{{target}}" style="display:
none;">
<div class="screenshot--skeleton placeholder-glow">
<span class="placeholder"></span>
</div>
40
5">Continue to
website</button>
</div>
</div>
</div>
{% endblock %}
{% block scripts %}
{{ super() }} <script> function setTargetScreenshot() {
var targetScreenshot = $(".screenshot--target"); var
skeletonScreenshot = $(".screenshot--skeleton");
targetScreenshot.hide();
skeletonScreenshot.show();
41
percentage <= 100 && percentage
>= 0) {
$(".ball-percent").append($("<span>").text(percentage.toLocaleString('en-US', {
minimumFractionDigits: 0, maximumFractionDigits: 1,
}) + "%"));
$(".ball-water").css("top", waterLevel + "%");
$(".content--area").find(".placeholder").removeClass("placeholder");
$(".content--area").removeClass("placeholder-glow");
}
42
function phishedAlert() {
Swal.fire({ title: "Phished
Website!!", text: "It's too
dangerous to continue,
hence we can't allow this
action.",
showCancelButton: true,
showConfirmButton: false,
showDenyButton: true,
denyButtonText: 'Back to
home'
}).then((result) => { if (result.isDenied)
{ window.location = "{{ url_for('home')
}}";
}
});
}
function invalidAlert(error) {
Swal.fire({ icon: 'error', title:
'Oops! Something went wrong!',
text: error, input: 'url',
showDenyButton: true,
denyButtonText: 'Back to home',
confirmButtonText: 'Check again',
inputPlaceholder: 'Enter the URL',
allowOutsideClick: false
}).then((result) => { if
(result.isConfirmed) {
window.location =
`{{ url_for('check') }}?target=${result.value}`;
} else if (result.isDenied) {
window.location = "{{ url_for('home') }}";
}
});
}
$(function () {
setTargetScreenshot();
$(window).on('resize', function () {
setTargetScreenshot();
});
</script>
{% endblock %}
Home.html
<!DOCTYPE html>
</head>
<body>
45
<header>
<a href="/url-detector" class="link">Phishing Url Detector</a>
</header>
46
<!-- Footer -->
<!-- <div class='footer'>
</div> -->
</body>
</html>
Index.html
{% extends 'base.html' %}
{% block title %}Phising Detector by Invaders{% endblock %}
{% block content %}
<section class="container min-vh-100">
<header>
<a href="/" class="link">Home Page</a>
</header>
</figure>
</div>
<form action="{{ url_for('check') }}" method="get" class="form--home input-group
rounded">
47
<input type="url" id="target-url" name="target" class="form-control form-
control-lg text-center border border-dark border-2 py-
3" placeholder="http://phish-site.com/malicious-url" required
/>
48
</section>
{% endblock %}
{% block scripts %}
{{ super() }} <script> var eventSource = new
EventSource("{{ url_for('listen') }}");
eventSource.addEventListener(
"stats", function (e) {
console.log(e.data); data =
JSON.parse(e.data);
$("#web-visits").text(data.visits);
$("#web-checked").text(data.checked);
$("#web-phished").text(data.phished);
},
true
);
</script>
{% endblock %}
Result.html
<!DOCTYPE html>
49
type="text/css"
href="{{ url_for('static', filename='styles.css') }}"
/>
<script
src="https://kit.fontawesome.com/5f3f547070.js"
crossorigin="anonymous"
></script>
<link
href="https://fonts.googleapis.com/css2?family=Roboto&display=swap"
rel="stylesheet"
/>
</head>
<body>
<div class="results">
<h1>PREDICTION RESULT</h1>
{% if prediction==1 %}
<h2>
<span class="danger"
>Caution! Our system has flagged this message as a possible phishing
attempt</span
>
</h2> <img class="image" src="{{
url_for('static', filename='unsafe-icon.png') }}"
alt="SPAM Image"
/>
{% elif prediction==0 %}
<h2>
<span class="safe"
>Congratulations! This message is classified as SAFE</span
>
</h2>
50
<img class="image" src="{{ url_for('static',
filename='safety-icon.png') }}" alt="Not a spam
image"
/>
{% endif %}
</div>
</body>
</html>
Content detector.py
# Load the Multinomial Naive Bayes model and CountVectorizer object from disk
filename = 'spam-sms-mnb-model.pkl' classifier = pickle.load(open(filename,
'rb')) cv = pickle.load(open('cv-transform.pkl','rb')) app = Flask(__name__)
@app.route('/') def
home():
return render_template('home.html')
@app.route('/predict',methods=['POST']) def
predict():
if request.method == 'POST':
message =
request.form['message'] data =
[message] vect =
cv.transform(data).toarray()
my_prediction =
classifier.predict(vect) return
render_template('result.html',
prediction=my_prediction)
51
if __name__ == '__main__':
app.run(debug=True)
Features.py
# Exraction of features from the URL
# 0 having_IP_Address
# 1 URL_Length
# 2 Shortining_Service
# 3 having_At_Symbol
# 4 double_slash_redirecting
# 5 Prefix_Suffix
# 6 having_Sub_Domain
# 7 URL_Depth
# 8 Domain_registeration_length
# 9 Favicon
# 10 port
# 11 HTTPS_token
# 12 Request_URL
# 13 URL_of_Anchor
# 14 Links_in_tags
# 15 SFH
# 16 Submitting_to_email
# 17 Abnormal_URL
# 18 Redirect
# 19 on_mouseover
# 20 RightClick
# 21 popUpWidnow
# 22 Iframe
# 23 age_of_domain
# 24 DNSRecord
# 25 web_traffic
# Above fetatures function returns
# 1 if the URL is Phishing,
52
# -1 if the URL is Legitimate and
# 0 if the URL is Suspicious
class FeatureExtraction:
def __init__(self, url):
self.url = url self.parsedurl =
urlparse(self.url) self.domain =
self.parsedurl.netloc
try:
self.whois = whois.whois(self.domain)
except:
self.whois = None
try:
self.request = requests.get(self.url, timeout=5, headers={
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64
12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141
Safari/537.36"})
self.soup = BeautifulSoup(self.request.content, 'html.parser')
except:
self.request = None
self.soup = None
53
self.shortening_services =
r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.u
s|" \
r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.
tt|" \
r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\
.gd|" \
r"tr\.im|link\.zip\.net"
def getFeaturesDict(self):
return {
"having_IP_Address": self.having_IP_Address(),
"URL_Length": self.URL_Length(),
"Shortining_Service": self.Shortining_Service(),
"having_At_Symbol": self.having_At_Symbol(),
"double_slash_redirecting": self.double_slash_redirecting(),
"Prefix_Suffix": self.Prefix_Suffix(),
"having_Sub_Domain": self.having_Sub_Domain(),
"URL_Depth": self.URL_Depth(),
"Domain_registeration_length": self.Domain_registeration_length(),
"Favicon": self.Favicon(),
"port": self.port(),
"HTTPS_token": self.HTTPS_token(),
"Request_URL": self.Request_URL(),
54
"URL_of_Anchor": self.URL_of_Anchor(),
"Links_in_tags": self.Links_in_tags(),
"SFH": self.SFH(),
"Submitting_to_email": self.Submitting_to_email(),
"Abnormal_URL": self.Abnormal_URL(),
"Redirect": self.Redirect(),
"on_mouseover": self.on_mouseover(),
"RightClick": self.RightClick(),
"popUpWidnow": self.popUpWidnow(),
"Iframe": self.Iframe(),
"age_of_domain": self.age_of_domain(),
"DNSRecord": self.DNSRecord(),
"web_traffic": self.web_traffic()
}
def having_IP_Address(self):
try:
ipaddress.ip_address(self.domain)
return 1 except:
return -1
55
equal 54 characters then the URL classified as phishing otherwise legitimate. If
the length of URL >= 54 , the value assigned to this feature is 1 (phishing) or else 0
(suspicious) else -1 (legitimate).
"""
def URL_Length(self):
if len(self.url) < 54:
return -1 elif len(self.url) >= 54 and
len(self.url) <= 75:
return 0
else:
return 1
"""#### ** Using URL Shortening Services “TinyURL” **
URL shortening is a method on the “World Wide Web” in which a URL may be
made considerably smaller in length and still lead to the required webpage. This is
accomplished by means of an “HTTP Redirect” on a domain name that is short,
which links to the webpage that has a long URL.
If the URL is using Shortening Services, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""
def Shortining_Service(self): if
re.search(self.shortening_services, self.url):
return 1
else:
return -1
def having_At_Symbol(self):
if '@' in self.url:
return 1
else:
return -1
def double_slash_redirecting(self):
if re.search(r'https?://[^\s]*//', self.url):
return 1
else:
return -1
def Prefix_Suffix(self):
if '-' in self.domain:
57
return 1
else:
return -1
"""#### ** SubDomains **
If the URL has more than 2 subdomains, the value assigned to this feature is 1
(phishing) or else 0 (suspicious) else -1 (legitimate).
"""
def having_Sub_Domain(self):
count = self.domain.count('.')
if count <= 2: return -1
elif count > 2 and count <= 3:
return 0
else:
return 1
def URL_Depth(self):
depth = 0 subdirs =
self.parsedurl.path.split('/') for
subdir in subdirs: if subdir:
depth += 1 return depth
58
& current time. The end period considered for the legitimate domain is 6 months or
more for this project.
If end period of domain < 6 months, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""
def Domain_registeration_length(self):
if self.whois is None:
return 1
try: if
type(self.whois['expiration_date']) is list:
expiration_date = self.whois['expiration_date'][0]
else:
expiration_date = self.whois['expiration_date']
registration_length = abs(
(expiration_date - datetime.datetime.now()).days)
if registration_length / 30 >= 6:
return -1
else:
return 1
except:
return 1
"""#### ** Favicon **
Checks for the presence of favicon in the website. The presence of favicon in the
website can be used as a feature to detect phishing websites.
If the website has favicon, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""
def Favicon(self):
try:
59
if re.findall(r'favicon', self.soup.text) or \
self.soup.find('link', rel='shortcut icon') or \
self.soup.find('link', rel='icon'):
return -1
else:
return 1
except:
return 1
def port(self): if
self.parsedurl.port:
return 1
else:
return -1
def HTTPS_token(self):
if 'https' in self.domain:
return 1
else:
60
return -1
"""### ** Request_URL **
The fine line that distinguishes phishing websites from legitimate ones is how
many times a website has been redirected. In our dataset, we find that legitimate
websites have been redirected one time max. On the other hand, phishing websites
containing this feature have been redirected at least 4 times.
"""
"""#### ** URL_of_Anchor **
The presence of “<a>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<a>” tag in the URL.
If the URL has “<a>” tag, the value assigned to this feature is 1 (phishing) or else 1
(legitimate).
"""
def URL_of_Anchor(self):
try:
count = 0
for i in self.soup.find_all('a'):
if i.has_attr('href'):
61
count += 1
if count == 0:
return 1 else:
return -1
except:
return 1
"""#### ** Links_in_tags **
The presence of “<link>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<link>” tag in the URL.
If the URL has “<link>” tag, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""
def Links_in_tags(self):
try:
count = 0 for i in
self.soup.find_all('link'): if
i.has_attr('href'):
count += 1
if count == 0:
return 1 else:
return -1
except:
return 1
"""#### ** SFH **
The presence of “<form>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<form>” tag in the URL.
If the URL has “<form>” tag, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""
62
def SFH(self): try:
if self.soup.find('form'):
return 1
else:
return -1
except:
return 0
"""#### ** Submitting_to_email **
The presence of “mailto:” in the URL is a strong indicator of phishing websites.
This feature checks for the presence of “mailto:” in the URL.
If the URL has “mailto:” tag, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""
def Submitting_to_email(self):
try:
if self.soup.find('mailto:'):
return 1
else:
return -1
except:
return 0
"""#### ** Abnormal_URL **
The presence of “<script>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<script>” tag in the URL.
If the URL has “<script>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
63
def Abnormal_URL(self):
try: if
re.findall(r'script|javascript|alert|onmouseover|onload|onerror|onclick|onmouse',
self.url):
return 1
else:
return -1
except:
return -1
"""#### ** Redirect **
The presence of “<meta>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<meta>” tag in the URL.
If the URL has “<meta>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
def Redirect(self):
try:
if self.soup.find('meta', attrs={'http-equiv': 'refresh'}):
return 1
else:
return -1
except:
return -1
65
def popUpWidnow(self): try: if
re.findall(r"alert\(|onMouseOver|window.open", self.soup.text):
return 1 else:
return -1
except:
return -1
def DNSRecord(self):
try:
resolver.resolve(self.domain, 'A')
return -1 except: return 1
67
a short period of time, they may not be recognized by the Alexa database (Alexa the
Web Information Company., 1996). By reviewing our dataset, we find that in worst
scenarios, legitimate websites ranked among the top 100,000. Furthermore, if the
domain has no traffic or is not recognized by the Alexa database, it is classified as
“Phishing”.
If the rank of the domain < 100000, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""
def web_traffic(self):
try:
alexadata = BeautifulSoup(requests.get(
"http://data.alexa.com/data?cli=10&dat=s&url=" + self.domain,
timeout=10).content, 'lxml') rank =
int(alexadata.find('reach')['rank']) if rank < 100000:
return -1
else:
return 1
except:
return 1
68
h2i = Html2Image() h2i.output_path
= screenshot_dir
try:
update_stats('checked')
target = urlparse(target_url)
features_obj = FeatureExtraction(target_url) x =
pd.DataFrame.from_dict(features_obj.getFeaturesDict(), orient='index').T
69
pred_prob = model.predict_proba(x)[0]
safe_prob = pred_prob[0] unsafe_prob
= pred_prob[1]
if pred == 1:
update_stats('phished')
return dict( status=True,
domain=target.netloc,
target=target_url,
safe_percentage=safe_prob*100,
unsafe_percentage=unsafe_prob*100
)
except Exception as e:
return dict(status=False, message=str(e))
def get_stats(key=None):
stats = {} if
os.path.exists(stats_filename): with
open(stats_filename, "r") as file:
for line in file: (k, v) =
line.split(":") stats[k] = int(v)
return stats
return False
70
def update_stats(key): stats =
get_stats() with open("stats.txt",
"w+") as file: if stats is False:
file.write('\n'.join([f"{x}:0" for x in stats_params]))
else:
lines = [] avail_params =
list(stats_params) for k, v in
stats.items():
avail_params.remove(k) if k ==
key: v += 1
lines.append(f"{k}:{v}") if
len(avail_params) > 0: for
param in avail_params:
lines.append(f"{param}:0")
file.write("\n".join(lines)) file.flush()
Main.py from flask import
Flask from url_detector import
* from content_detector import
*
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict(): if request.method ==
'POST':
message = request.form['message'] data = [message]
vect = cv.transform(data).toarray() my_prediction =
71
classifier.predict(vect) return render_template('result.html',
prediction=my_prediction)
@app.route("/listen") def
listen(): def
respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata: {_data}\nevent: stats\n\n"
time.sleep(0.5)
@app.route("/screenshot") def
screenshot():
72
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()
width = default_screenshot_width
height = default_screenshot_height
ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png")
ss_file_path = os.path.join(screenshot_dir, ss_file_name)
if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)
if __name__ == '__main__':
app.run(debug=True)
app = Flask(__name__)
73
app.secret_key = os.urandom(12).hex()
@app.route('/') def
home():
update_stats('visits') return
render_template("index.html")
@app.route('/check', methods=['GET',
'POST']) def check():
update_stats('visits') if
request.method == "POST":
target_url = request.json['target'] result =
get_phishing_result(target_url=target_url) return
jsonify(result) target_url = request.args.get("target")
return render_template('check.html', target=target_url)
@app.route("/listen") def
listen():
def respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata:
{_data}\nevent: stats\n\n" time.sleep(0.5)
74
@app.route("/screenshot") def
screenshot():
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()
width = default_screenshot_width
height = default_screenshot_height
ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png") ss_file_path =
os.path.join(screenshot_dir, ss_file_name)
if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)
if __name__ == '__main__':
app.run(debug=True)
75
A2 – SCREENSHOTS
76
Figure A2.2 URL Detector page of the website
This image also shows the number of website visits, the total number of websites
checked, and the total number of phishing websites detected out of the total websites
checked.
77
Figure A2.5 Invalid phishing URL has been pasted
78
79