0% found this document useful (0 votes)
37 views

BCSE497J Project I Report - Sparsh (1)

The document outlines a project focused on developing a hybrid AI-based phishing attack detection system for financial transactions, combining machine learning and deep learning techniques to enhance detection accuracy. It details the project's objectives, methodologies, and the significance of addressing evolving phishing threats in the digital finance landscape. The study aims to improve cybersecurity defenses by integrating advanced detection mechanisms into financial security systems.

Uploaded by

ujjwalnormaluse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

BCSE497J Project I Report - Sparsh (1)

The document outlines a project focused on developing a hybrid AI-based phishing attack detection system for financial transactions, combining machine learning and deep learning techniques to enhance detection accuracy. It details the project's objectives, methodologies, and the significance of addressing evolving phishing threats in the digital finance landscape. The study aims to improve cybersecurity defenses by integrating advanced detection mechanisms into financial security systems.

Uploaded by

ujjwalnormaluse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

B.Tech.

BCSE497J - Project-I

PHISHING ATTACK DETECTION IN FINANCIAL


TRANSACTIONS USING AI

Submitted in partial fulfillment of the requirements for the degree of

Bachelor of Technology
in
Computer Science Engineering
by

21BCE2360 Sparsh Indurkar


21BCE2392 Jahnavi Gupta

Under the Supervision of


Professor Santhi H
Associate Professor Sr.
School of Computer Science and Engineering (SCOPE)

November 2024
DECLARATION

I hereby declare that the project entitled Phishing Attack Detection in Financial
Transactions using AI submitted by me, for the award of the degree of Bachelor of
Technology in Computer Science and Engineering to VIT is a record of bonafide work
carried out by me under the supervision of Prof. Santhi H
I further declare that the work reported in this project has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree ordiploma in this
institute or any other institute or university.

Place : Vellore

Date : 13/11/24
Signature of the Candidate

i
CERTIFICATE

This is to certify that the project entitled Phishing Attack Detection in Financial
Transactions using AI submitted by SPARSH INDURKAR (21BCE2360) , School of
Computer Science and Engineering, VIT, for the award of the degree of Bachelor of
Technology in Computer Science and Engineering, is a record of bonafide work carried out
by him / her under my supervision during Fall Semester 2024-2025, as per the VIT code of
academic and research ethics.

The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute orany other
institute or university. The project fulfills the requirements and regulations of the University
and in my opinion meets the necessary standards for submission.

Place : Vellore
Date :13/11/24

Signature of the Guide

Examiner(s)

Dr. K.S Umadevi


Computer Science Engineering

ii
ACKNOWLEDGEMENTS

I am deeply grateful to the management of Vellore Institute of Technology (VIT) for providing
me with the opportunity and resources to undertake this project. Their commitment to fostering a
conducive learning environment has been instrumental in my academic journey. The support and
infrastructure provided by VIT have enabled me to explore and develop my ideas to their fullest
potential.

My sincere thanks to Dr. Ramesh Babu K, the Dean of the School of Computer Science and
Engineering (SCOPE), for his unwavering support and encouragement. His leadership and vision
have greatly inspired me to strive for excellence. The Dean’s dedication to academic excellence
and innovation has been a constant source of motivation for me. I appreciate his efforts in creating
an environment that nurtures creativity and critical thinking.

I express my profound appreciation to Dr. K.S Umadevi, the Head of the Department of Software
Systems ,for his/her insightful guidance and continuous support. His/her expertise and advice have
been crucial in shaping the direction of my project. The Head of Department’s commitment to
fostering a collaborative and supportive atmosphere has greatly enhanced my learning experience.
Her constructive feedback and encouragement have been invaluable in overcoming challenges and
achieving my project goals.

I am immensely thankful to my project supervisor, Prof. Santhi H for her dedicated mentorship
and invaluable feedback. Her patience, knowledge, and encouragement have been pivotal in the
successful completion of this project. My supervisor’s willingness to share her expertise and
provide thoughtful guidance has been instrumental in refining my ideas and methodologies. Her
support has not only contributed to the success of this project but has also enriched my overall
academic experience.

Thank you all for your contributions and support.

Name of the Candidate

Sparsh Indurkar

iii
TABLE OF CONTENTS

Sl.No Contents Page No.


Abstract ix
1. INTRODUCTION 1
1.1 Background 1
1.2 Motivations 1
1.3 Scope of the Project 2

2. PROJECT DESCRIPTION AND GOALS


2.1 Literature Review 3
2.2 Research Gap 16
2.3 Objectives 17
2.4 Problem Statement 17
2.5 Project Plan 18
3. TECHNICAL SPECIFICATION
3.1 Requirements 19
3.1.1 Functional
3.1.2 Non-Functional
3.2 Feasibility Study 19
3.2.1 Technical Feasibility
3.2.2 Economic Feasibility
3.2.2 Social Feasibility
3.3 System Specification 20
3.3.1 Hardware Specification
3.3.2 Software Specification
4. DESIGN APPROACH AND DETAILS
4.1 System Architecture 21
4.2 Design
4.2.1 Data Flow Diagram 23
4.2.2 Use Case Diagram 24
4.2.3 Class Diagram 24
4.2.4 Sequence Diagram 25

iv
5. METHODOLOGY AND TESTING 26
6. PROJECT DEMONSTRATION 29
7. RESULT AND DISCUSSION 31
9. CONCLUSION 34
10. FUTURE WORKS 35
11. REFERENCES 36
APPENDIX A – SAMPLE CODE

v
List of Figures

Figure No. Title Page No.


1 Gnatt Chart 17
2 Data Flow Diagram 20
3 Use Case Diagram 21
4 Class Diagram 22
5 Sequence Diagram 23
6 Random Forest Model Comparison 31
7 Training and Test Accuracy 32
8 Metrics Direct Values 33
9 Metrics Grouped Values 33
10 Metrics Direct Values for ANN 33
11 Multi – layer Perceptron Model Metrics 33
Comparison

vi
List of Abbreviations

2G Second Generation
AI Artificial Intelligence
ANN Artificial Neural Network
API Application Programming Interface
CNN Convolutional Neural Network
DL Deep Learning
DNS Domain Name System
DNN Deep Neural Network
FPR False Positive Rate
GPU Graphics Processing Unit
HTML Hypertext Markup Language
HTTP Hypertext Transfer Protocol
HTTPS Hypertext Transfer Protocol Secure
ID Identifier
IP Internet Protocol
KNN K-Nearest Neighbor
LSTM Long Short-Term Memory
ML Machine Learning
NLP Natural Language Processing
RBF Radial Basis Function
SVC Support Vector Classification
SVM Support Vector Machine
TLD Top-Level Domain
TPR True Positive Rate
URL Uniform Resource Locator

vii
Symbols and Notations

α Learning rate
λ Regularization parameter
σ Activation function
μ Mean of a dataset or feature
Σ Summation symbol,
∇ Gradient
Ψ Potential function

viii
ABSTRACT

Phishing attacks have emerged as a serious cybersecurity threat, particularly in financial


transactions where attackers attempt to steal sensitive information via fraudulent URLs and
misleading messages. Traditional phishing detection methods, such as rule-based and blacklist
approaches, frequently fall short of detecting sophisticated and evolving phishing techniques. This
study presents a hybrid artificial intelligence (AI)-based phishing detection system that uses both
(ML) and (DL) techniques to improve detection accuracy for financial applications.

The proposed system uses a machine learning Random Forest classifier to evaluate various URL
and email attributes, including URL length, the presence of suspicious or whitespace characters,
and risky top-level domains (TLDs), and classifies them as significant indicators of phishing
attempts. In parallel, an Artificial Neural Network (ANN) is used to analyse more complex
phishing patterns in messages and URLs using tokenisation and vectorisation. Both models are
subjected to hyperparameter optimisation, which involves adjusting parameters such as the
number of trees in the Random Forest and the number of hidden layers in the ANN to improve
detection performance.

The models' effectiveness is measured using accuracy, precision, recall, and the F1 score. After
comparing performance, the most accurate models are used in real-time phishing detection as part
of financial security systems. This hybrid approach improves detection accuracy while also
providing a scalable and dependable solution for identifying and mitigating phishing threats. By
integrating this system into financial transactions, the research contributes to stronger
cybersecurity defenses, protecting users against increasingly advanced phishing attacks.

ix
1. INTRODUCTION

1.1 Background
Advance Fee Fraud (AFF) is a common type of scam in cyberspace that can cause significant
damage, particularly to the banking and finance industries. They are opportunistic attacks that
attempt to gain access to personal information (such as login credentials, financial data, or even
your identity) by impersonating a legitimate entity. It is accomplished primarily through the use
of fake URLs or misleading text messages. Because money exchange is now mostly digital, this
risk has grown, and stronger cyber security solutions to detect more dangerous scams such as
Phishing Attacks are becoming increasingly important.

Historically, rule-based methods were popular for phishing detection which worked well on
known threats but failed to spot new or sophisticated ones. Machine Learning (ML) and Deep
Learning (DL) are two powerful new alternatives in which you can create a system that learns
patterns and improves over time. In this regard, using AI techniques to detect phishing is a
promising improvement in the identification of fraudulent transactions with higher accuracy and
efficiency.

1.2 Motivation
Phishing has become a growing concern within the field of financial fraud due to its widespread
impact on individuals and organizations. As these fraudulent operations become more
sophisticated, low-key new phishing trends emerge that are difficult for current security controls
to detect. Successful phishing attacks have many consequences, including financial loss, data loss,
and a loss of user trust.
That is why the goal of this research project is to develop a better, more accurate, and reliable
system for detecting and categorizing phishing attempts. The project seeks to provide a solution
to the problem of traditional systems to prevent phishing by smart technologies, particularly ML,
and DL. In the hybrid model, combining Random Forest, and ANN the simple phishing features
are combined with those that require discriminative reasoning context for detecting more complex
patterns, increasing comfort. The mentioned system will help mitigate the threat of phishing
attacks during web based financial transactions, safeguard confidential information, and improve
overall security.

1
1.3 Scope of the Project
The project concentrates on developing a hybrid system of detecting phishing with reference to
financial transactions by incorporating both Machine Learning and Deep Learning techniques.
The scope includes:

(1) Feature Extraction: Development of methodologies for extracting significant phishing


related features from both URLs and text messages. It entails the analysis of attributes like URL
length, suspicious characters as well as keywords in the text.

(2) Model Development: This involves implementing a Random Forest Classifier for the
machine learning-based detection and an Artificial Neural Network for deep learning-based
detection. The models are then trained and fine-tuned to ensure a high level of accuracy during
classification.

(3) Hyperparameter Tuning: Here, the Random Forest and ANN models are tuned up for
optimal performance,

(4) varying such hyperparameters as the number of trees or learning rates and even hidden
layers.

(5) Assessment: Compare the models by standard performance metrics like accuracy,
precision, recall, F1 score to find out which one is best.

(6) Real-time Detection: The deployment of the best-performing model as a service to be


embedded within security systems of financial institutions for real-time phishing detection.

This project aims to enhance phishing detection accuracy, provide real-time security and reduce
phishing attacks on financial transactions with ensured scalability and adaptability to evolving
threats.

2
2. PROJECT DESCRIPTION AND GOALS

2.1 Literature Review

A comprehensive survey of AI-enabled phishing attacks detection techniques

The literature survey in the paper explores the various AI-based techniques that have been
proposed for phishing attack detection. Deep Learning Approaches: The paper discusses how deep
learning models like deep neural networks, convolutional neural networks, and recurrent neural
networks have been applied to the phishing detection problem. These approaches aim to
automatically learn relevant features from raw data like URLs and web page content. However,
the performance of deep learning heavily depends on the selection of appropriate model
architectures and hyperparameters. Machine Learning Approaches: A significant portion of the
literature has focused on applying traditional machine learning classifiers like decision trees,
random forests, SVMs, and ensemble methods for phishing detection. These approaches typically
involve extracting a set of features from URLs, web pages, or emails, and then training a model to
distinguish between legitimate and phishing instances. Feature engineering and selection are
critical for achieving high accuracy with these methods. Hybrid Approaches: Some studies have
explored combining multiple machine learning techniques, such as using ensemble methods or
stacking classifiers, to further improve phishing detection performance. The intuition is that the
strengths of different models can be leveraged to achieve better overall results.[15]
Scenario-based Approaches: A smaller set of studies have investigated phishing detection from a
more behavioural and psychological perspective. These works have looked at factors like the dark
triad personality traits of attackers, the impact of email legitimacy and influence on user
susceptibility, and the use of game-based training to improve user awareness.
The literature review highlights that while significant progress has been made in developing AI-
powered phishing detection solutions, there are still challenges around achieving high accuracy,
low false positives, scalability, and practical deployment. The paper concludes by outlining future
research directions in this domain, such as the need for more comprehensive and up-to-date
datasets, the development of smart browser plugins for real-time phishing detection, and the
exploration of novel AI techniques like reinforcement learning.

3
A Methodical Overview on Phishing Detection along with an Organized Way to
Construct an Anti-Phishing Framework
In this report numerous Anti-phishing tools which help to protect against phishing websites are
analysed . Google chrome, Mozilla Firefox and Safari uses Google Safe Browsing (GSB) service
which blocks the site if it is phishy. Other tools like Netcraft, Mcfee Site Advisor, Avast, Quick
Heal are also in use . Google Safe Browsing service uses blacklist approach to analyze a URL. As
the blacklist was not updated, Google Safe Browsing service could not detect the phishy site.

Anti-virus software like Avast and Quick Heal also provide protection against online security
threats. They installed Avast anti-virus to check for its functioning and found out that the special
Avast browser for secure browsing was unable to detect the phishy URL that Netcraft or Google
Safe Browsing detected successfully. It acknowledges a major necessity of an advanced Anti-
phishing tool. It can also be significantly noted that these tools are installed separately. Thus, it is
very important to create awareness regarding these tools and phishing attacks.
The blacklisting technique for detecting spam turned out to be less useful for newly registered
domains. Every day newly created fake websites can be generated. If this fake URL is not present
in the black list, these websites cannot be detected. The heuristics based approach can fail when
no rule is present for a particular attribute. Due to this, that attribute remains undetected and thus,
we need to make sure that all the rules or heuristics are added into the system. According to the
paper Heuristics approach and Hybrid Approach perform best among others. [4]

Although the accuracy of Heuristics approach is high, the possibility of this method to not being
able to analyze a newly added spam feature in website can result into classifying the website as
legitimate. Thus, heuristics based approach which works on predefined thresholds is very
cumbersome. Authors of this paper experimented with various other Machine Learning algorithms
but found Random Forest as the best . Been said that, Hybrid approach is also a better way to
design a model that gives better accuracy as one dataset is trained using two algorithms.

4
A Novel Approach to Detect Phishing Attacks using Binary Visualisation and
Machine Learning

The paper discusses a wide variety of approaches have been proposed to counter the ever-persistent
threat of phishing in both commercial and public domains. These approaches can be classified into
two main categories; user training approaches and software classification approaches Training
approaches aim at raising the ability of end-users to identify phishing attacks which could reduce
their susceptibility to falling victim to phishing attacks .While classification approaches are
typically designed to classify phishing and legitimate web pages on behalf of the user in an attempt
to tackle issues of the human error and ignorance .[2]

The proposed system in the paper consists of two stages, the learning stage, and the detection
stage. In the first stage, the samples and the topological structure of the machine learning
TensorFlow is built, while in the second stage the submitted URLs are tested against the samples
in the database to perform classification. It relies on visualizing scraped HTML files onto 2D
images, which are then processed by the TensorFlow that analyses them against its training
modules, to distinguish between legitimate and phishing websites.

URLs passed through the system are recorded in a database, thus, each URL submitted by the user
is tested to check for duplicates which helps in increasing the system overall performance as it
could avoid the binary image reproduction process, which is a time-consuming process. If the
submitted URL does not exist in the database, the system would automatically scrape the HTML
code from the corresponding websites and store it in a string format.
The automation of scraping the web page protects users from having to visit the potential phishing
page and removes the risk of droppers and browser exploits . According to the initial experimental
results, the method seems promising and being able to fast detection of phishing attacker with high
accuracy. Moreover, the method learns from the misclassifications and improves its efficiency.

Artificial Intelligence based Cyber Security Threats Identification in Financial


Institutions Using Machine Learning Approach

Cybersecurity threat identification is crucial as digital threats increase. Using technologies like
malware detection, intrusion systems, AI, and ML, companies proactively detect and prevent cyber
threats. According to the paper machine learning can be utilized to detect cyber security issues
more effectively than traditional methods and some of the potential use cases of this technology.
The proposed Cyber machine learning approach (CMLA) has compared with the existing NLP-
Based Analysis (NLPBA), Genetic algorithm based techno-economic optimization (GATEO),
load-based resource utilization algorithm (LRUBI) and Intelligent Deep Learning (IDL) .[1]

Financial institutions should employ robust encryption technologies to protect customer data
during transactions. This should include using encrypted communication channels, as well as
encryption of data stored on servers or in databases. Common encryption algorithms used in robust
encryption include Advanced Encryption Standard (AES), RSA, and Two fish. Multi-factor
authentication can also help to further secure customer data by requiring users to provide multiple
pieces of evidence, such as a password and a security code, to gain access to the system. The
institutions should monitor user activity for any suspicious activity or attempts to gain
unauthorized access. They should also have procedures in place to quickly identify and address

5
any potential security breaches. Regular security audits should be carried out to identify any
vulnerabilities or weaknesses in the system. Audits should also be conducted to ensure that all
security measures are up to date and properly implemented.

There are a number of important steps that need to be executed to secure financial transactions.
The first is to ensure that any online accounts or services used to store financial information are
secure. This means using strong passwords, two-factor authentication, and keeping up-to-date with
security patches and updates. Additionally, it is important to be aware of phishing scams, which
use email and other methods to trick people into giving away their financial information. Another
important step for financial transaction security is to use trusted payment systems whenever
possible. This means making sure that any online payments are made through reputable services,
such as PayPal or Apple Pay. [1]
Additionally, it is important to use secure methods of payment, such as credit cards or online
wallets, to make sure that the money is not stolen or misused. It is important to be aware of one’s
own financial situation and to monitor accounts for any suspicious activity. This means regularly
checking account balances and transactions to make sure that nothing unusual is happening.
Additionally, it is important to report any suspicious activity to the authorities in order to prevent
any further theft or misuse of funds.

Boosting the Phishing Detection Performance by Semantic Analysis

The paper discusses three main approaches for phishing detection:1. Blacklist-based methods:
These maintain a list of known phishing websites to determine if a new website is a phishing site.
The main limitation is that they cannot detect zero-hour phishing attacks that have not been added
to the blacklist yet.[3]
2. Heuristic methods: These identify phishing websites using one or more features extracted
from known phishing attacks, such as URL-based features, HTML features, visual
similarity features, and third-party service features. Examples include the CANTINA and
CANTINA+ methods. The main issues are that the heuristic rules are difficult to update
and may also flag some legitimate websites as phishing.
3. Machine learning based methods: These treat phishing detection as a classification problem
and use machine learning algorithms to learn patterns from extracted statistical features of
websites. The key is to capture the inherent patterns of phishing sites across different
dimensions like URL, HTML, visual similarity, and third-party services. However, these
methods have largely ignored the semantic information of web pages, which is an
important aspect of phishing.
The paper proposes to extract semantic features using word embeddings and fuse them with the
statistical features to build a more robust phishing detection model. This is presented as a novel
approach compared to prior work.

6
Cyber Threats Classifications and Countermeasures in Banking and Financial Sector

The banking and financial sector has long been a prime target for cyber threats due to the sensitive
and critical nature of the information it handles. With the increasing digitization of banking
services, the sector has become more vulnerable to a wide range of cyber threats, including
malware, phishing, distributed denial-of-service (DDoS) attacks, and insider threats. Several
studies have examined the impact of cyber threats on the banking industry. Akinbowale conducted
a comprehensive survey of the literature on the effects of cybercrime on the banking sector,
highlighting the financial losses, reputational damage, and legal consequences that can result from
such attacks. Alzoubi and Ghelani further explored the specific cyber security threats facing
digital banking, such as unauthorized access, data breaches, and service disruptions. [7]

Researchers have also focused on classifying and understanding the nature of cyber threats in the
banking sector. Alkhalil provided a detailed analysis of phishing attacks, a common tactic used by
cybercriminals to target financial institutions. Akinbowale and Barrigar examined the threat of
mobile banking malware and the challenges in combating it. Salahdine and Kaabouch delved into
the social engineering techniques employed by threat actors to manipulate human behaviour and
gain unauthorized access to banking systems.
In addition to technical threats, researchers have also highlighted the importance of addressing
non-technical threats, such as insider threats and regulatory compliance issues. Kazi and Kathrine
explored the use of machine learning techniques to detect and mitigate banking malware and
network communication threats. Jakovljević and Al-Alawi and Al-Bassam investigated the factors
contributing to cybersecurity awareness and the challenges in the banking sector.

Several studies have also focused on the development of effective countermeasures to address
cyber threats in the banking industry. Dubois and Tatar discussed the importance of training and
test beds to better prepare banks against cyber threats. Ali and Lin and Wang highlighted the
growing cyber threat that has created financial and operational challenges for the global banking
industry. Somogyi and Nagy observed an increasing trend in the number of cyberattacks in the
banking industry, underscoring the need for robust information security measures.
Overall, the existing literature provides a comprehensive understanding of the cyber threat
landscape facing the banking and financial sector, the potential impacts of these threats, and the
various strategies and countermeasures that can be employed to mitigate the risks. However, the
rapidly evolving nature of cyber threats and the increasing sophistication of attack methods
continue to pose significant challenges for the sector. Ongoing research and collaboration between
industry, academia, and policymakers are crucial to developing more effective and proactive
approaches to cybersecurity in the banking and financial domain.

7
Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future
Directions

This paper provides a comprehensive systematic literature review (SLR) on the application of deep
learning (DL) techniques for phishing detection. The authors adopt a structured SLR approach to
analyze relevant studies and select 81 articles based on predefined criteria.

The paper first proposes a taxonomy of phishing detection methods, categorizing them into list-
based, heuristic-based, and visual similarity approaches. It then presents a taxonomy of DL
techniques used for phishing detection, including convolutional neural networks (CNNs), long
short-term memory (LSTMs), deep neural networks (DNNs), and others. The analysis shows that
LSTM and CNN are the most popular DL techniques, accounting for 34% and 30% of the reviewed
articles respectively.[5]

The authors then discuss the strengths and weaknesses of various DL algorithms in the context of
phishing detection. Key issues identified include the need for manual parameter tuning, long
training times, and suboptimal detection accuracy. The paper also highlights the importance of
choosing appropriate evaluation metrics, especially for imbalanced datasets.To further illustrate
the challenges, the authors conduct an empirical analysis comparing the performance of different
DL models on a phishing dataset. The results confirm that ensemble DL models generally
outperform individual models in terms of accuracy but require longer training times. Based on the
literature review and empirical findings, the paper proposes several future research directions.
These include exploring unsupervised and semi-supervised DL approaches, leveraging
multimodal and multi-task learning, and developing explainable DL models to provide better
inference justification. Integrating DL with big data technologies is also suggested to address the
computational cost issues.

Overall, this SLR provides a comprehensive overview of the state-of-the-art in DL-based phishing
detection, identifies key challenges, and outlines promising future research avenues. The detailed
analysis and empirical insights make this paper a valuable resource for researchers and
practitioners in the cybersecurity domain.

8
Design and Evaluation of a Real-Time URL Spam Filtering Service

The paper presents Monarch, a real-time system that crawls URLs submitted to web services and
determines if they are spam. The goal is to provide accurate, real-time spam detection for web
services like social networks and URL shorteners, which have become targets for scams, phishing,
and malware.[6]

The paper evaluates the viability of Monarch and the challenges in addressing the diversity of web
service spam. Key findings include spam targeting email differs significantly from spam targeting
Twitter, and Email spam has short-lived campaigns that quickly churn through domains, while
Twitter spam uses more persistent infrastructure.90% of features used to detect email spam vs.
Twitter spam do not overlap, indicating they are separate actors.- Classifiers trained on one type
of spam do not generalize well to the other.2. Monarch's Approach:
It uses an instrumented web browser to collect a wide range of features from URLs, including
HTML content, redirects, plugins, etc. The proposed model employs a distributed logistic
regression classifier with L1-regularization to handle millions of features and can process 638,000
URLs per day on a modest cloud infrastructure, with a median processing time of 5.54 seconds
per URL. The model Achieved 90.78% accuracy with 0.87% false positives when trained on 1.7
million spam URLs.

Challenges in this paper include Adversarial attacks i.e. Attackers can tune features to evade
detection, modify content after classification, or block the crawler.
To conclude, the paper presents a scalable, real-time spam detection system that highlights the
fundamental differences between email and social media spam, requiring distinct classification
approaches. It also identifies key challenges in building robust spam defenses in an adversarial
environment.

Email Phishing Detection and prevention by using data mining techniques

This paper explores technical phishing attacks that can come through various communication
channels like email, instant messaging, websites, and social media. It examines methods for
detecting suspicious text, phishing URLs, phishing websites, and malicious attachments.

The paper first provides background on social engineering attacks, which rely on manipulating
people rather than technical vulnerabilities. Phishing, spear phishing, and vishing are common
social engineering attack methods that try to trick users into revealing sensitive information or
downloading malware. The paper then delves into technical approaches for detecting and
preventing phishing attacks, focusing on those that come through email. One key approach is URL
analysis - extracting lexical features from URLs to identify fraudulent or hidden URLs associated
with phishing sites. Research has shown this can achieve over 97% accuracy in classifying URLs
as legitimate or phishing.[14]

Another approach is detecting phishing websites by analyzing their content and visual similarity
to legitimate sites. Techniques like PhishZoo and Cantina use visual profiles and text/image
features to achieve over 90% accuracy in identifying phishing sites. The paper also discusses
machine learning-based anti-phishing systems that analyze email behavior patterns to detect
phishing attempts. These systems, called MLAPT, use techniques like Support Vector Machines
to classify emails as phishing or legitimate based on features like URL, form, and keyword
analysis.

9
In addition, the paper covers filtering and blacklisting mechanisms that leverage protocols like
SMTP to detect forged sender addresses and block known phishing sources.Overall, the paper
emphasizes that there is no single solution to the phishing problem. A combination of techniques,
including lexical URL analysis, visual website profiling, and machine learning-based email
screening, are needed to effectively combat the evolving threat of phishing attacks. The 89%
success rate achieved in the authors' own phishing detection system demonstrates the potential of
these approaches.

Fresh-Phish: A Framework for Auto-Detection of Phishing Websites


The paper introduces a framework called "Fresh-Phish" for creating up-to-date machine learning
data for detecting phishing websites. Phishing attacks, where attackers try to trick users into
revealing sensitive information like passwords and credit card numbers, are a growing problem on
the internet. Existing datasets used for training machine learning models to detect phishing become
outdated quickly due to the dynamic nature of the web.

The Fresh-Phish framework uses 30 different features of websites to classify them as either
phishing or legitimate. These features are categorized into 5 groups: URL-based, DNS-based,
external statistics, HTML-based, and JavaScript-based. The framework crawls the top 6,000 Alexa
websites and 6,000 known phishing websites from Phishtank to create a labeled dataset.
The paper then evaluates the performance of several machine learning classifiers on the Fresh-
Phish dataset, including neural networks, support vector machines (SVMs), and linear classifiers.
The results show that the TensorFlow neural network using a GradientDescent optimizer achieves
the highest accuracy at around 90%, followed closely by the SVM with a Gaussian kernel. The
linear classifiers performed the worst.[11]

Feature importance analysis revealed the top 10 most important features for detecting phishing
websites, including URL length, use of URL shortening services, and presence of prefixes/suffixes
in the URL. The authors note that the Fresh-Phish framework is intended to be an extensible tool
that other researchers can use to generate up-to-date phishing detection datasets. They plan to
continue improving the framework by defining their own features beyond the ones used in prior
work and exploring additional machine learning techniques. Overall, the Fresh-Phish framework
provides a valuable resource for researchers working on automated detection of the ever-evolving
problem of phishing websites.

10
Phishing Attack Detection on Text Messages Using Machine Learning Techniques

This paper presents a phishing attack detection system for text messages (PADSTM) that uses
machine learning techniques to detect phishing attacks. Phishing is a prevalent type of social
engineering attack where attackers try to manipulate users into revealing sensitive information.
PADSTM focuses on detecting phishing in text messages by considering two key aspects -
verifying URLs against a blacklist, and analysing the content of the text messages using
customized keywords. The customized keywords are categorized into special tokens, currency
symbols, visual morphemes, mobile numbers, and URLs.

The system uses four machine learning algorithms - Naive Bayes Classifier, Support Vector
Classification (SVC), Random Forest Classifier, and K-Nearest Neighbour (KNN) - to classify the
text messages as phished or not phished based on the extracted features. During the training phase,
the machine learning models are trained on a dataset of text messages labelled as phished or not
phished. In the testing phase, new incoming text messages are classified using the trained
models.[12]

Experimental results show that the Random Forest Classifier outperforms the other algorithms,
achieving the highest accuracy of 97.01% and F1-score of 98.29% in detecting phishing attacks.
This indicates that the Random Forest model is the most suitable for the proposed PADSTM
system.

The key advantages of PADSTM are its focus on customized keywords in text messages as
features, and its use of URL verification against a blacklist to enhance phishing detection. The
system can be used to detect phishing in various types of text messages like SMS, WhatsApp
messages, and social media posts. Future work could explore extending the system to handle text
messages in languages other than English. In summary, the PADSTM system leverages machine
learning and customized keyword analysis to provide an effective mechanism for detecting
phishing attacks in text messages, with the Random Forest Classifier demonstrating the best
performance.

Phishing Emails Detection Using CS-SVM

This paper proposes a Cuckoo Search SVM (CS-SVM) algorithm for detecting phishing emails.
Phishing attacks are a common online threat that can lead to financial losses through malware or
social engineering. Machine learning techniques, particularly Support Vector Machines (SVMs),
have been effective for phishing email detection. However, the parameters of the kernel function
in SVM can significantly impact the classification accuracy.

The proposed CS-SVM approach extracts 23 features from email headers, bodies, and URLs to
build a hybrid classifier. These features include things like the sending time of the email, presence
of certain keywords, number of URLs, and URL characteristics. The Cuckoo Search algorithm is
then used to optimize the parameter selection for the SVM's Radial Basis Function (RBF)
kernel.[9]

Experiments were conducted on a dataset of 1,384 phishing emails and 20,071 non-phishing
emails. The results show that the CS-SVM approach outperforms a standard SVM classifier with
default parameter settings. Specifically, the CS-SVM achieved a True Positive Rate (TPR) of up
to 93.12%, which is about 4% higher than the standard SVM. The False Positive Rate (FPR) was
11
the same for both classifiers. The overall classification accuracy of CS-SVM was over 91%,
compared to the standard SVM which was less than 91%.

The authors conclude that the proposed CS-SVM method is more effective than the standard SVM
for phishing email detection, as it is able to optimize the kernel function parameters. Future work
will focus on parallelizing the CS-SVM algorithm to improve efficiency on larger datasets.
Overall, this research demonstrates the benefits of using advanced optimization techniques like
Cuckoo Search to enhance the performance of machine learning models for security applications.

Phishing URL detection using URL Ranking

This paper proposes a framework for detecting and ranking phishing URLs in real-time. The
approach uses a combination of URL clustering, classification, and categorization to identify
malicious URLs.

The paper first extracts lexical features from URLs, such as the presence of incorrect spellings or
brand names, which can indicate a phishing attempt. It also uses host-based features, analyzing the
IP addresses and autonomous system numbers associated with the URL's domain, hostname, mail
server, and name server. These host-based features can reveal if a URL is hosted on infrastructure
known to be used for phishing.The paper then performs clustering on the full dataset of URLs
using the K-means algorithm. This clustering step extracts structural patterns in the data and
assigns each URL a cluster ID, which is then used as an additional feature for the classification
model.[10]

The classification model uses an online learning approach to adapt to evolving trends in phishing
URLs. It achieves classification accuracies between 93-98% in detecting phishing URLs.To
provide more meaningful feedback to users, the paper also incorporates URL categorization using
the Microsoft Reputation Service (MRS). The categories returned by MRS (e.g. "Phishing",
"Education") are used to place each URL into a "Severe", "Moderate", or "Benign" threat level.
An internal threat scale and URL ranking algorithm then combines the classification results and
threat level to provide a final ranking for each URL, using a color-coded system (red, yellow,
green) to indicate the level of risk.

The paper's key contributions are: 1) the use of URL clustering to enhance the classification model,
2) the incorporation of URL categorization from external services to improve the ranking
mechanism, and 3) the development of a comprehensive framework that can detect, categorize,
and rank phishing URLs in real-time. This approach outperforms prior work that relied only on
lexical or host-based features alone.

12
Phishing Website Detection Using Machine Learning

The paper proposes a machine learning approach to detect phishing websites. Phishing is an
internet scam where attackers send fake messages that appear to be from a trusted source, in order
to steal personal information or infect devices. Traditional blacklisting methods have limitations
in detecting zero hour phishing attacks.[13]

The proposed approach uses supervised machine learning algorithms like Random Forest and
Decision Tree to classify URLs as phishing or legitimate. A dataset of 10,000 URLs (5,000
phishing, 5,000 legitimate) is used. Feature extraction is done on three types of features:
1. Domain-based features: DNS record, website traffic, domain age, domain expiration2.
HTML and JavaScript-based features: iframe redirection, status bar customization,
disabling right-click, website forwarding
3. Address bar-based features: domain, IP address, use of "@" symbol, URL length, URL
depth, use of URL shortening services, use of "-" in domain
The extracted features are used to train the machine learning models. Random Forest achieved an
accuracy of 87.0%, while Decision Tree achieved 82.4% accuracy in classifying the URLs.
The trained Random Forest model is then deployed as a web application using Flask. Users can
enter a URL in the web interface, and the model will classify it as phishing or legitimate, displaying
the result to the user. The key advantages of this approach are effective detection of phishing
websites, including zero hour attacks, using machine learning, High accuracy achieved through
feature engineering and use of ensemble methods like Random Forest, and ease of deployment as
a web application for user-friendly phishing detection. Overall, the proposed machine learning
based phishing detection system provides a robust and practical solution to protect users from
phishing attacks.

PhishNet: Predictive Blacklisting to Detect Phishing Attacks

PhishNet is a system designed to improve the resilience and efficiency of URL blacklists in
defending against phishing attacks. It has two main components: a URL prediction component and
an approximate URL matching component. The URL prediction component systematically
generates new URLs from existing blacklist entries using five different heuristics.

These heuristics exploit observations about the prevalence of lexical similarities in phishing URLs.
For example, the "Replacing TLDs" heuristic generates new URLs by replacing the top-level
domain of a blacklisted URL with other common TLDs. The "Directory structure similarity"
heuristic creates new URLs by exchanging file names or query strings among URLs with similar
directory structures. The generated URLs are then validated through DNS lookups, server
connectivity tests, and content similarity checks to identify new phishing sites.
The approximate URL matching component performs a more nuanced analysis of a given URL,
rather than relying on exact matches with the blacklist. It breaks down the URL into four key
components - IP address, hostname, directory structure, and brand name - and scores each
component based on its similarity to the blacklist entries.[8]

The final cumulative score determines whether the URL is flagged as potentially malicious.
Evaluation of PhishNet shows that it is effective at generating new phishing URLs, discovering
around 18,000 new phishing URLs from a set of 6,000 blacklist entries. Its approximate matching
algorithm also achieves low false positive (3%) and false negative (5%) rates, while being
significantly faster than the Google Safe Browsing API. Overall, PhishNet demonstrates the
benefits of combining systematic URL generation with approximate matching to enhance the
resilience of blacklists against phishing attacks

13
Enhancing secure financial transactions through the synergy of blockchain and
artificial intelligence

The document presents a comprehensive study on enhancing secure financial transactions through
the integration of blockchain and artificial intelligence (AI). It proposes the Integrated Blockchain
and Artificial Intelligence (IBAI) framework, which leverages blockchain's decentralized
architecture and AI's data processing capabilities to improve security in financial transactions. The
framework aims to safeguard user data from cyber threats by storing customer information on a
blockchain while AI algorithms analyze and detect suspicious activities. This combination
strengthens data protection and enables faster, more secure financial services. Key elements of
the IBAI framework include:
(1) Blockchain for Data Security: Blockchain ensures secure, decentralized data sharing,
removing the need for centralized control. Each transaction is recorded in blocks, preventing
tampering.
(2) AI for Suspicious Activity Detection: AI-driven algorithms analyze data for any signs of
suspicious behavior, with accuracy rates as high as 98% in identifying threats.
(3) Secure Transactions: The IBAI framework incorporates robust encryption and
consensus mechanisms to validate transactions securely across the network.
(4) Improved Financial Services: The use of AI and blockchain improves the speed,
accuracy, and safety of financial transactions, enhancing the overall efficiency of financial
systems.
The Integrated Blockchain and Artificial Intelligence (IBAI) Framework is a proposed system
designed to enhance the security and efficiency of financial transactions by combining the
strengths of blockchain technology and artificial intelligence (AI). [16]
Limitations mentioned in the paper include increased power consumption, higher processing time
for secure transactions, complexity of implementation, scalability issues, resourceintensive
computation, potential for overhead costs and dependence on network infrastructure

An Enhanced Security Method for Monitoring Transaction Risks in Electronic


Money Transfer Machines in India

The document, "An Enhanced Security Method for Monitoring Transaction Risks in Electronic
Money Transfer Machines in India", focuses on improving the security of electronic money
transfer systems, particularly micro-ATMs, which are widely used in rural areas for Aadhaar-
based payments. The paper highlights the vulnerabilities in these systems, including memory,
volatile, and static data threats, and addresses risks such as skimming, identity theft, and data
breaches at point-of-sale terminals. Key points include:
(1) Vulnerabilities: Memory data theft, weak encryption, and card skimming.
(2) Security Measures: Locking card data, monitoring transaction risks, updating software and
hardware, and using firewalls.
(3) Proposed Solutions: Terminal-to-terminal locking methods, user awareness, and enhanced
encryption.
The paper stresses the need for stringent security protocols and consistent updates to safeguard
transactions and protect user data[17]

14
Secure Credit or Debit Card Transaction Using Alert messages and OTP to prevent
phishing attacks

The document titled "Secure Credit or Debit Card Transaction Using Alert Messages and OTP to
Prevent Phishing Attacks" focuses on improving the security of online credit and debit card
transactions. It proposes a system that sends an alert message to the card owner before a
transaction is completed, offering a chance to block it if unauthorized. The method combines
existing OTP (One-Time Password) authentication with this additional alert system to prevent
phishing attacks.The model is built around validating card details with the LUHN algorithm,
sending alerts, and using a sequence of checks with the customer's bank, the merchant, and OTP
generation to confirm the transaction. The system enhances security by allowing the cardholder
to block transactions if alerted in real-time before any payment is processed.[20]

The LUHN algorithm is a simple way to check if a credit or debit card number is valid. It works
by :
(1)Doubling every second digit from the right.
(2)If the result is more than 9, subtract 9.
(3)Add all the digits together.
(4)If the total is a multiple of 10, the card number is valid; if not, it's invalid. It's a
quick check to catch errors in card numbers.

Detection and mitigation of insider attacks in financial systems

The project aims to detect and mitigate insider attacks in financial systems using a
smartcontactless blockchain approach. It focuses on ensuring data integrity and security without
fullscale data migration. The system uses blockchain to store fixed-length fingerprints of SQL
database tuples. It intercepts SQL queries to verify data integrity by comparing with stored
fingerprints, using an enhanced hashing algorithm for faster detection. [18]

The system is divided into four modules: User Management, Financial Transaction Management,
Insider Attack Detection, and Insider Attack Mitigation. It employs a blockchain network, SQL
database, and an interface for integrity checks. The implementation environment includes AWS
EC2 instances and Lambda functions for scalability and parallel processing. The system shows
reduced execution time with increased parallel processing units, enhancing detection.

15
Online Banking in India: Attacks and Preventive Measures to Minimize Risk

The document titled "Online Banking in India: Attacks and Preventive Measures to Minimize
Risk" provides a comprehensive overview of the security challenges associated with online
banking in India and offers strategies to mitigate risks. Key Points:
(1) Online Banking Attacks: Common attacks include phishing, malware, identity theft, and
session hijacking. These attacks aim to steal personal and financial information, often through
deceptive emails, malicious software, or unauthorized access.
(2) Malware Threats: Types of malwares include spyware, trojans, viruses, and worms, all of
which can compromise user information. Malware can steal login details, manipulate websites,
and hijack user sessions to carry out unauthorized transactions.
(3) Phishing and Vishing: Phishing uses fake emails to lure users into giving sensitive
information, while vishing involves phone calls pretending to be from a bank to collect details.
(4) Security Measures: Banks are encouraged to implement stronger authentication methods,
including two-factor authentication, network security tools like firewalls, and regular risk
assessments to minimize vulnerabilities.
(5) Customer Education: Training and awareness programs for both customers and employees
are critical to recognizing and avoiding online scams.
Comprehensive Security: A holistic approach involving preventive, detective, deterrent,
corrective, and recovery controls is necessary to protect against cyber threats in online banking.
The document emphasizes the need for both technical and organizational measures to safeguard
online banking systems from evolving cyber threats.[19]

2.2 Research Gap

Phishing attacks have been extensively studied, and traditional detection methods, such as rule
based systems and blacklist approaches, have been used to combat phishing threats. However,
some of these methods have a hard time keeping up with the rapidly changing world of phishing
scams — particularly those in financial services.
1) Lack of Adaptability to Emerging Threats: Existing rule-based and blacklist approaches
are static methods, which make them weak in front of phishing methodologies designed on the
fly that slip through existing security measures.
2) Inability to Handle Complex Patterns: Traditional detection methods are not equipped to
catch the subtle and intricate patterns that modern phishing attacks may show such as dynamic
URL generation and target-specific text messages.
3) Insufficient Use of Hybrid Techniques: Although ML and DL have been separately used
for phishing detection, a few works focus on the combination with each other to cope up their
synergic strength.
4) Performance Trade-offs: Traditional ML / DL solutions cannot provide optimized results,
and it can create a gap between real-time phishing detection accuracy with computational
efficiency

16
2.3 Objectives
The primary objectives of the project are:
1)To develop an effective hybrid AI-based phishing detection system :that combines Machine
Learning (ML) and Deep Learning (DL) techniques to detect phishing attacks in financial
transactions.
2)To extract relevant features: from URLs and text messages, including suspicious characters,
risky top-level domains (TLDs), and specific terms in text messages, for better phishing detection.
3)To implement a Random Forest classifier: for ML-based detection, utilizing extracted features
to identify phishing attempts based on predefined and statistical patterns.
4)To design and train an Artificial Neural Network (ANN) :that can detect more complex and
subtle phishing patterns through tokenization and vectorization of URLs and text messages.
5)To optimize model performance through hyperparameter tuning: including adjusting learning
rates, the number of hidden layers and estimators, to ensure high accuracy, precision, recall, and
F1 score.
6)To evaluate and compare the performance of both models: using relevant metrics and deploy
the best-performing model for real-time phishing detection.
7)To integrate the model into existing financial security systems: for scalable and real-time
phishing detection and mitigation.

2.4 Problem Statement

Phishing attacks are one of the biggest cyber threats out there, especially in finance –They involve
fake URLs and spoofed text messages hacking away at login credentials to get sensitive
information like financial data. With phishers always changing their tactics a step ahead of the
rule-based detection systems and blacklists that used to catch them, these traditional phishing
defense mechanisms are proving less and less effective. Traditional methods have difficulty in
identifying recently developed, evolving phishing techniques and sophisticated behaviours of an
adversary resulting to severe monetary losses data leaks loss of users trust.[17]

While both Machine Learning (ML) and Deep Learning (DL) have made important improvements
in improving phishing detection, many of these implementations concentrate on one type of the
method over another depending on either simplicity or complexity, without grasping a wide range
of patterns from simple to very intricate forms. We need a flexible, reliable and scalable phishing
detection system that incorporates well-known ML/DL techniques in order to enhance accuracy
while responding instantly. In this paper, we work towards achieving an integrated AI powered
solution with a hybrid approach by combining the benefits of ML and DL sensors to identify
phishing attempts that occur through financial transactions. This will improve fraud detection in
financial systems by augmenting the current capability of detecting phishing attempts based on
simple features such as those present in URL structure, patterns that are either inherently simplistic
to detect or may be too intricate and therefore overlooked

17
2.5 Project Plan

This initiative proposes an idea to create a Google Chrome Extension that provides muchneeded
assistance — detecting Phishing URLs and Phishing attacks in Text Messages, protecting user's
data from theft. Solution to Detect Phishing with Random Forest-classification and ANN

Figure 1: Gantt Chart

18
3. TECHNICAL SPECIFICATION

3.1 Requirements
3.1.1 Functional
1) Phishing URL Detection:
- The system should analyse URLs entered by the user.
- The Random Forest classification model should classify URLs as either phishing or
legitimate.
2) Phishing Text Message Detection:
- The system should allow users to input text messages.
- The Artificial Neural Network (ANN) should classify messages as either phishing or
nonphishing.
3) Real-time Detection:
- The extension must provide real-time feedback on URLs and text messages.
4) User Interface (UI):
- A simple, user-friendly interface where users can input URLs or text for
analysis. - Visual feedback to inform users of the classification results.
5) Database Management:
- Store and manage a list of blacklisted phishing URLs.
- Update the database as new phishing URLs are detected.

3.1.2 Non- Functional


1) Performance: The classification models should process input and return results within
a few seconds.
2) Scalability: The system should be scalable to handle multiple users simultaneously.
3) Security: Ensure user data, such as entered URLs or text, is not stored or shared.
4) Reliability: Ensure the extension remains functional across various Chrome versions.
5) Maintainability: The system should be easily maintainable, allowing for model
updates and improvements.

3.2 Feasibility Study


3.2.1 Technical Feasibility
1) Data Availability: Adequate phishing and non-phishing datasets are available for
training the Random Forest and ANN models.
2) Development Tools: The Chrome extension will be built using HTML, CSS, and
JavaScript for the frontend. Python-based machine learning models can be integrated
into the backend via Flask or Node.js.
3) Machine Learning Models: Random Forest is well-suited for URL classification, and
ANN is effective for detecting phishing in text messages. These models can be
optimized for browser environments.

19
3.2.2 Economic Feasibility
1) Development Costs: Most tools required for this project (e.g., Python, JavaScript,
Chrome Extension APIs) are open source, keeping development costs minimal.

2) Model Training: Cloud services like Google Colab can be used for model training,
reducing the need for expensive infrastructure.

3) Deployment: Publishing the extension on the Chrome Web Store involves a small
registration fee, making it economically feasible.

3.2.3 Social Feasibility


1) User Impact: This extension will significantly benefit users by protecting them from
phishing attacks, thereby enhancing online safety.

2) Ease of Use: The extension is designed for non-technical users, providing easy access
to phishing detection.

3.3 System Specification


3.3.1 Hardware Specification
Development Machine:
- Processor: Intel Core i5 or higher
- RAM: 8 GB or higher - Storage: 250 GB SSD
- User System Requirements:
- Any device capable of running Google Chrome (PC, Mac, etc.)
- Minimum of 4 GB RAM for smooth operation

3.3.2 Software Specification


Development Environment:
- Python 3.8 or higher for machine learning models (Random Forest, ANN)
- Flask or Node.js for backend integration
- HTML, CSS, JavaScript for Chrome Extension development - Chrome Extension API
for extension functionality - Machine Learning Libraries:
- Scikit-learn for Random Forest implementation - TensorFlow or Keras for ANN
development - Version Control:
- Git for source code management and collaboration - Testing Tools:
- Postman for API testing
- Selenium for browser-based testing

20
4. DESIGN APPROACH AND DETAILS

4.1 System Architecture


1.User Interface (UI) Layer
(1)Chrome Extension UI:
(I)The front end of the extension, where users input URLs or text
messages to be analysed. It is developed using HTML, CSS, and
JavaScript for interactivity.
(II)Provides input fields for URLs and text messages.
(III)Displays the classification results: "Phishing" or "Legitimate" for
URLs, and "Phishing" or "Non-phishing" for text messages.
2.⁠ ⁠Browser Layer
(1)Chrome API Integration:
(I)The extension interacts with Google Chrome through its Extension API.
This layer manages extension-specific functionality like user interactions
and permissions.
(II)Handles input capture, event listeners, and background operations.
3.⁠ ⁠Backend Logic Layer
(1)Random Forest Classifier for URL Detection:A pre-trained Random Forest
model is integrated into the backend. Upon user input (URL), this layer processes
the URL by extracting relevant features (e.g., URL length, domain, special
characters) and predicts whether it is phishing or legitimate.
(2)Artificial Neural Network (ANN) for Text Message Detection:A deep learning
model (ANN) is used to detect phishing attempts in text messages. The input text
is preprocessed (tokenization, stop word removal) before being fed into the ANN
model for classification.
(3)Model Integration: The models are hosted in a lightweight backend server
(Flask/Node.js). The backend is responsible for receiving inputs from the Chrome
extension, passing them to the respective machine learning models, and returning
the results to the extension.
4.⁠ ⁠Data Processing & Model Layer
(1)Feature Extraction Module (for URLs):
2)Extracts essential features from the URL, such as domain name length, number
of special characters, presence of HTTPS, and other phishing indicators.
(3)Text Preprocessing Module (for Text Messages): Tokenizes and vectorizes
input text messages, removing unnecessary words and noise, preparing the input for the
ANN model.

21
(I) Machine Learning Models:
(II) Random Forest Classifier: A pre-trained model classifying URLs.
(III) ANN Model: A deep learning network classifying text messages.
5.⁠ ⁠Database Layer (Optional)
(1)Phishing URL Database: Stores blacklisted phishing URLs for reference and
real-time validation.
(2) User Activity Log (Optional): Can log user inputs for further analysis,
feedback, or enhancement of phishing detection models. (Ensure privacy and
anonymity for user data.)
6.⁠ ⁠Communication Layer
AJAX/REST API Calls:
(1) The Chrome extension sends input data (URLs or text messages) to the
backend via RESTful API calls.
(2) Receives classification results from the backend asynchronously to
ensure smooth user experience.
7.⁠ ⁠Result Display Layer:
(1) UI Feedback: Once the backend returns the classification results, the Chrome
extension UI displays them to the user (e.g., "Phishing" or "Legitimate" for
URLs, "Phishing" or "Non-phishing" for text messages).

22
4.2 Design
4.2.1 Data Flow Diagram
(1) Input:- The URL flows from the user to the web extension.
(2) Processing:- The URL is predicted through the machine learning model.
(3) Accessing Data Store:- The pre trained dataset is referred by the ML model to describe the
URL.
(4) Prediction:- The model returns back a result and transmits it to the web extension.
(5) Output:- The action is performed and the output is displayed to the user regarding the
nature of the URL.

Figure 2: Data flow diagram

4.2.1 Use Case Diagram

(1) Primary Actor: User- The primary and the only error prone actor is the user who
interacts with the web extension to paste a URL and see the prediction.
(2) System Actions: The prediction scenario will let the Web Extension to make URL
analysis and so predictions on the URL, all done by the machine learning.
(3) Optional Features: Results from logs prediction can be viewed by Admin on
prediction results logging, Admin optionally.
(4) Relationships: The relations aid in identifying the corresponding associations of the
actors and use cases to the system.

23
Figure 3: Use-case diagram
4.2.3 Class Diagram
(1) Association: The URLAnalyzer is a subordinate, owned by, and functional component
of the WebExtension. Without the URLAnalyzer, the WebExtension cannot and will
not be able to perform any action or analysis on the URL.
(2) Aggregation: URLAnalyzer is a member of Aggregation in the sense that analyzing
the URL alone is insufficient without generating a prediction.
(3) Optional Dependency: ResultLogger complements WebExtension in that the latter
includes an optional feature that allows it to log the former's predictions.

Figure 4: Class diagram

24
4.2.4 Sequence Diagram

In this sequence diagram the user copies and pastes a URL into the WebExtension. The
WebExtension makes a call to the URLAnaIzer. Then the URLAnalyzer contacts the
Machine Learning Model for prediction purposes. The Prediction is provided back and
demonstrated to the User. If desired, the Result Logger describes the outcome with the
help of the Web Extension.

Figure 5: Sequence diagram

25
5. METHODOLOGY AND TESTING

This section covers the methodology of building a browser extension for phishing detection,
including the structure of files, feature extraction, backend API, and integration with a machine
learning model. The extension is designed to identify phishing websites by examining various
characteristics of URLs, such as length, presence of symbols, HTTPS usage, and domain-specific
features. This section also describes the testing strategy to ensure accurate phishing detection.

Dataset and Feature Selection


The phishing detection model relies on a comprehensive set of URL features, each of which is
indicative of potential phishing activity. The dataset includes features such as UsingIP, LongURL,
ShortURL, and Symbol@, which assess basic URL characteristics and their tendency to indicate
phishing attempts. For instance, the UsingIP feature checks if the URL uses an IP address rather
than a domain, a tactic often employed by phishing websites to disguise their identity. LongURL
and ShortURL features account for URLs that are abnormally long or use shortening services, as
phishers may use these methods to hide their intent.

Additional features assess more advanced URL properties, such as HTTPS, DomainRegLen,
Favicon, NonStdPort, and ServerFormHandler. These are designed to capture security indicators,
such as whether the site uses HTTPS, has a valid domain registration length, or uses non-standard
ports. Features like Favicon and HTTPSDomainURL help identify anomalies in the site's icon
source or HTTPS presence in the domain section of the URL, which could signal phishing.
The dataset also incorporates behavioral and structural features within the webpage, such as
AnchorURL, LinksInScriptTags, IframeRedirection, and UsingPopupWindow. These features
provide insight into how the page is structured and its interaction with users, with the assumption
that phishing sites may attempt to hide malicious actions in scripts, iframes, or popup windows.
Finally, popularity and reputation-based features like WebsiteTraffic, PageRank, and GoogleIndex
assess the site’s online presence, which typically indicates its legitimacy.
Each feature contributes to a robust detection mechanism that allows the machine learning model
to analyze URLs comprehensively and accurately distinguish between phishing and legitimate
sites.

Chrome Extension Structure


The Chrome extension is implemented through a series of structured files, each serving a specific
role within the application.
1. Manifest.json: This is the core configuration file for the Chrome extension, defining
metadata like the extension's name, version, and the permissions it needs to access browser
resources. The file also specifies which scripts are required for background processing and
UI elements, such as the popup interface.
2. Popup Interface (popup.html and popup.js): The user-facing part of the extension
consists of an HTML file (popup.html) that serves as a popup when the user clicks the
extension icon. Within this popup, users can enter a URL, triggering the phishing detection
process. The JavaScript file popup.js manages this input and communicates with the
backend by sending the URL to the background script (background.js), awaiting the
phishing detection result, and updating the popup interface based on the response.
3. Background Script (background.js): The background.js file maintains persistent
operations while the extension is active. It functions as an intermediary between the
frontend (popup.js) and the backend API (app.py). Upon receiving the URL from popup.js,
background.js forwards it to the backend for feature extraction and phishing prediction.

26
4. Backend API (app.py): The backend is built using FastAPI and is responsible for handling
API requests from background.js. This API is designed to accept a URL input, pass it to
the feature extraction module (features.py), and apply a machine learning model
(model.pkl) for prediction. The API returns a classification result, which background.js
then relays to popup.js to be displayed in the popup interface.
5. Feature Extraction (features.py): The features.py file houses the functions required for
computing the URL features mentioned earlier. It processes the URL and extracts features
that the model uses to determine the likelihood of phishing. These extracted features form
the input to the pre-trained model in app.py.
6. Machine Learning Model (model.pkl): The model file, model.pkl, is a pre-trained binary
classifier designed to distinguish between phishing and safe websites. It utilizes the
features computed in features.py to predict the safety of a URL, with the output fed back
into the FastAPI framework in app.py for delivery to the frontend.

Backend API Workflow


The backend API, structured in app.py, is developed with FastAPI, providing a real-time
prediction system that assesses each input URL for potential phishing risks. The API defines an
input schema (URLInput) to ensure incoming requests are in a valid format. Upon initialization,
app.py loads the pre-trained model (model.pkl) with the Python pickle module, allowing the model
to be reused across requests.
The core of the API’s functionality lies in two endpoints:
• Root Endpoint (/): This serves as a basic endpoint returning a welcome message to
confirm the API is operational.
• Prediction Endpoint (/predict): This endpoint accepts POST requests with a URL, which
it passes to the getfeatures function in features.py. The feature extraction results are
reshaped as necessary and provided to the model for prediction.
Depending on the model’s prediction, the API returns either "Website is safe," "Website is not
safe," or an ambiguous result. To enhance robustness, the API returns error messages with HTTP
status codes if issues arise during prediction. This API-centric approach allows phishing detection
to function seamlessly within the extension.
Workflow of the Phishing Detection Process
The phishing detection process follows a structured workflow:
1. User Input: The user enters a URL into the extension's popup interface.
2. Data Handling by popup.js: The entered URL is captured by popup.js and forwarded to
background.js.
3. Communication with app.py: background.js sends the URL to the API hosted by app.py.
4. Feature Extraction: Inside app.py, the URL is sent to features.py to extract all necessary
URL features, preparing the data for model input.
5. Model Prediction: The features are then passed to model.pkl for classification, yielding a
prediction of either "phishing" or "safe."
6. Result Relay and Display: The prediction result is returned by app.py, passed back
through background.js to popup.js, and displayed to the user.

Model Testing and Evaluation


The model’s performance is evaluated based on its accuracy in correctly classifying URLs as
phishing or safe. Testing includes:
1. Unit Testing of Feature Extraction: Each feature extraction function in features.py is
unit-tested to ensure correct values are derived from a variety of URL structures.
2. Model Validation: The pre-trained model undergoes validation with a separate testing set.
The model's accuracy, precision, recall, and F1-score are calculated to assess its ability to
identify phishing URLs accurately without mislabeling legitimate sites.

27
3. End-to-End Testing: The entire extension—from user input to the displayed result—is
tested to verify seamless interaction between frontend and backend components.
4. Cross-Browser Compatibility: Although designed as a Chrome extension, cross-browser
testing is conducted to ensure consistent behavior in other Chromium-based browsers.
The rigorous testing framework and attention to detail in each component provide confidence that
the extension will perform accurately and reliably for end-users. By integrating feature-rich data,
advanced machine learning, and real-time API processing, the phishing detection extension offers
a robust, user-friendly solution for identifying phishing threats.

28
6.PROJECT DEMONSTRATION

(1) Input and Output of unsafe Website

Input :

Output:

29
(2) Input and Output of a Safe Website

Input:

Output:

30
7.RESULT AND DISCUSSION

The phishing detection system’s effectiveness was evaluated based on accuracy, precision, recall,
and F1 score to assess its ability to accurately classify phishing and legitimate URLs. The hybrid
model, combining a Random Forest classifier and an Artificial Neural Network, demonstrated the
following key results:

1.Model Performance:

Random Forest Classifier: The Random Forest model showed high accuracy in
identifying phishing attempts based on structured URL attributes, such as URL length,
subdomain structure, and HTTPS usage. During testing, it achieved an accuracy of over
90% in URL classification, with a precision of 0.92 and recall of 0.89. This model
effectively identifies URLs with clear phishing characteristics, such as suspicious symbols
and unusual structure.

Artificial Neural Network (ANN): The ANN model excelled at detecting more complex
phishing patterns, especially in URLs containing subtle indicators. The ANN achieved an
F1 score of 0.91, with precision and recall scores of 0.89 and 0.93, respectively. Its
performance highlighted its ability to detect sophisticated phishing attempts, particularly
those embedded in suspicious textual patterns.

Figure 6: Random Forest Model Metrics Comparison

31
Figure 7: Training and Test Accuracy

2.Hybrid Model Performance:


1) The hybrid model, integrating both the Random Forest and ANN classifiers, improved
overall accuracy to approximately 94%. This combined approach leveraged the Random
Forest classifier for initial detection and then passed potentially suspicious URLs to the
ANN for further scrutiny. By combining the strengths of both models, the system achieved
enhanced detection rates, reducing both false positives and false negatives.

2)Real-Time Detection: Testing demonstrated the system's capability for real-time


analysis, processing URLs in less than two seconds, which is crucial for financial
transactions that require immediate phishing detection.

3.Feature Importance: Key features contributing to detection accuracy included URL length,
subdomain count, HTTPS presence, and the use of specific symbols or keywords. Additionally,
features like Google Index status, domain age, and PageRank provided critical insight into the
legitimacy of URLs, helping the model filter out benign sites effectively.

4. User Interface Feedback: User testing revealed a positive response to the system’s intuitive
interface. Users appreciated the real-time feedback and clear indication of a website's safety status,
with the interface successfully highlighting phishing risks based on the model’s predictions.

These results demonstrate that the hybrid AI-based model can effectively and efficiently detect
phishing attempts, especially within the context of financial transactions. The system’s accuracy
and speed highlight its suitability for integration into financial security measures, providing real-
time protection against phishing attacks.

32
Figure 11: Multi – layer Perceptron Model Metrics Comparison

Figure 8: Metrics Direct Values Figure 9: Metrics Grouped values

Figure 10: Metrics Direct Values for ANN

33
9.CONCLUSION

This research demonstrates the effectiveness of an AI-based hybrid approach for phishing
detection in financial transactions. By combining a Random Forest classifier with an Artificial
Neural Network, the system leverages both structured feature-based analysis and deep learning's
pattern-recognition capabilities. The Random Forest model efficiently captures high-level
indicators of phishing within URLs, while the ANN provides additional depth, identifying more
nuanced phishing patterns that evade traditional detection methods.
The resulting hybrid system addresses the limitations of conventional rule-based and blacklist
approaches, which often fall short in detecting evolving phishing strategies. The AI-powered
solution achieves high accuracy and low latency, essential characteristics for phishing detection in
financial transactions, where time is critical, and user trust must be maintained.

While the current model has proven effective in detecting phishing URLs, future work could
explore expanding the model to analyze phishing emails, potentially improving financial security
systems by integrating email threat detection. Additionally, exploring advanced deep learning
architectures, such as transformer models, could enhance pattern recognition in complex phishing
attempts. Moreover, integrating this system into a mobile application could provide broader
accessibility and real-time protection across devices.

In summary, this AI-based phishing detection system provides a scalable and efficient solution for
financial institutions, helping to protect users against sophisticated phishing attacks. By enhancing
cybersecurity measures in financial transactions, this research contributes significantly to the
ongoing efforts to counteract the growing threat of phishing in the digital age.

34
10.FUTURE WORKS

To enhance the capabilities and effectiveness of the phishing detection system, some future
research and development directions are proposed below:

1. Expansion of the Chrome Extension Functionality:


Incorporate advanced analysis features that allow users to test a large number of URLs
simultaneously, enabling bulk classification. This functionality would be beneficial for
organizations that need to screen multiple URLs regularly for cybersecurity purposes.
2. Broader Data Collection and Real-Time Updates:
(I)Increase the dataset by integrating URLs from various global threat databases to ensure
the model stays updated with the latest phishing trends.
(II)Develop an automated data collection pipeline to continuously update the training
dataset, enabling real-time adaptation to emerging phishing tactics.

By focusing on these future directions, the project can further advance its mission of providing
accurate and scalable solutions for phishing detection, contributing to stronger cybersecurity
defences across various digital platforms.

35
9.REFERENCES

[1] Dhruv kumar & Kakumanu Prabhanjan Kumar. Artificial Intelligence based Cyber
Security Threats Identification in Financial Institutions Using Machine Learning Approach.
2023 2nd International Conference for Innovation in Technology (INOCON) Bangalore,
India. Mar 3-5, 2023

[2] Luke Barlow, Gueltoum Bendiab, Stavros Shiaeles & Nick Savage. A Novel Approach to
Detect Phishing Attacks using Binary Visualisation and Machine Learning. 2020 IEEE World
Congress on Services (SERVICES).

[3] Xi Zhang, Yu Zeng, Xiao-Bo Jin, Zhi-Wei Yan & Guang-Gang Geng. Boosting the
Phishing Detection Performance by Semantic Analysis. 2017 IEEE International Conference
on Big Data (BIGDATA).

[4] Srushti Patil & Sudhir Dhage. A Methodical Overview on Phishing Detection along with
an Organized Way to Construct an Anti-Phishing Framework. 2019 5th International
Conference on Advanced Computing & Communication Systems (ICACCS).

[5] Nguyet quang do, Ali Selamat, Ondrej Krejcar, Enrique Herrera-Viedma & Hamido
Fujita. Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future
Directions. Received November 8, 2021, accepted January 21, 2022, date of publication
February 17, 2022, date of current version April 8, 2022.

[6] Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson & Dawn Song. Design and Evaluation
of a Real-Time URL Spam Filtering Service. 2011 IEEE Symposium on Security and Privacy.

[7] Abdulbasit A. Darem, Asma A Alhashmi, Tareq M Alkhadi, Abdullah M Alashjaee,


Sultan M Alanazi & Shouki A Ebad. Cyber Threats Classifications and Countermeasures in
Banking and Financial Sector. Received 18 September 2023, accepted 17 October 2023, date
of publication 23 October 2023, date of current version 10 November 2023.

[8] Pawan Prakash, Manish Kumar, Ramana Rao Kompella & Minaxi Gupta. PhishNet:
Predictive Blacklisting to Detect Phishing Attacks. Mini-Conference at IEEE INFOCOM
2010.

[9] Weina Niu, Xiaosong Zhang, Guowu Yang, Zhiyuan Ma & Zhongliu Zhuo. Phishing
Emails Detection Using CS-SVM. 2017 IEEE International Symposium on Parallel and
Distributed Processing with Applications and 2017 IEEE International Conference on
Ubiquitous Computing and Communications (ISPA/IUCC).

[10] Mohammed Nazim Feroz & Susan Mengel. Phishing URL detection using URL
Ranking. 2015 IEEE International Congress on Big Data.
36
[11] Hossein Shirazi, Kyle Haefner & Indrakshi Ray. Fresh-Phish: A Framework for Auto-
Detection of Phishing Websites. Fresh-Phish: A Framework for Auto-Detection of Phishing
Websites.

[12] Swarangi Uplenchwar, Varsha Sawant, Prajakta Surve, Shilpa Deshpande & Supriya
Kelkar. Phishing Attack Detection on Text Messages Using Machine Learning Techniques.
2022 IEEE Pune Section International Conference (PuneCon) International Institute of
Information Technology ((I²IT), Pune, India. Dec 15-17, 2022.

[13] Adarsh Mandadi, Saikiran Boppana, Vishnu Ravella & Dr R Kavitha. Phishing Website
Detection Using Machine Learning. 2022 IEEE 7th International conference for Convergence
in Technology (I2CT) Pune, India. Apr 07-09, 2022

[14] Serafettin Sentürk, Elif Yerli & Ibrahim Sogukpinar. Email Phishing Detection and
Prevention by using Data Mining Techniques. (UBMK’17) 2nd International conference on
Computer Science and Engineering

[15] Abdul Basit, Maham Zafar, Xuan Liu, Abdul Rehman Javed, Zunera Jalil & Kashif
Kifayat. A Comprehensive survey of AI-enabled phishing attacks detection techniques.
Accepted: 9 October 2020 / Published online: 23 October 2020 ©Springer Science+Business
Media, LLC, part of Springer Nature 2020.

[16] Abdullah Alenizi , Shailendra Mishra and Abdullah Baihan. “Enhancing secure financial
transactions through the synergy of blockchain and artificial intelligence” epartment of
Information Technology, College of Computer and Information Sciences, Majmaah
University, Al-Majmaah 11952, Saudi Arabia b Computer Science Department, Community
College, King Saud University, Riyadh 11437, Saudi Arabia

[17] Devender Singh and Shikha Bharti. “An Enhanced Security Method for Monitoring
Transaction Risks in Electronic Money Transfer Machines in India”. 2023 IEEE International
Conference on Integrated Circuits and Communication Systems (ICICACS)

[18] Pradesh GV, Sangeetha D, Ram Kishore V and Sai Sharan L. “Detection and Mitigation
of Insider Attacks in Financial Systems”. 2024 International Conference on Advances in
Computing, Communication and Applied informatics (ACCAI) | 979-8-3503-
89449/24/831.00.

[19] Mrs. Rajshree Khande and Dr. Yashwant Patil. “Online Banking in India: Attacks and
Preventive Measures to Minimize Risk”. ISBN No.978-1-4799-38346/14/$31.00©2014
IEEE.

[20] Parvesh, Indervati, Sonia Kumari, Kartik Kumar, Gorakh Gupta and P Rajakumar.
“Secure Credit or Debit Card Transaction Using Alert messages and OTP to prevent phishing
attacks”. 3 rd International Conference on Innovative Practices in Technology and
Management (ICIPTM 2023).
37
[21] Dr. Aniket Deshpande .“Cybersecurity in Financial Services: Addressing AI-Related
Threats and Vulnerabilities”. 2024 International Conference on Knowledge Engineering and
Communication Systems (ICKECS) | 979-8-3503-5968-8/24/$31.00 ©2024 IEEE | DOI:
10.1109/ICKECS61492.2024.10616498

[22] OLEKSANDR KUZNETSOV, (Member,IEEE), PAOLO SERNANI ,LUCA


ROMEO ,EMANUELE FRONTONI,(Member,IEEE), AND ADRIANO MANCINI
.“On the Integration of Artificial Intelligence and Blockchain Technology: A
Perspective About Security”. Received 22 November 2023, accepted 19 December 2023,
date of publication 1 January 2024, date of current version 10 January 2024.IEEE
Access10.1109/ACCESS.2023.3349019

[23] Sonam Rani , Prof. (Dr.) Ajit Mittal .“Securing digital payments A comprehensive
analysis of AI driven fraud detection with real time transaction monitoring and anomaly
detection”. 2023 6th International Conference on Contemporary Computing and Informatics
(IC3I) | 979-8-3503-0448-0/23/$31.00 ©2023 IEEE | DOI:
10.1109/IC3I59117.2023.10397958

[24] Ms. Sanghmitra Gopal, Ms. Priyanks Gupta, Ms. Amrisha Minocha. “Advancements in
Fin-Tech and Security Challenges of Banking Industry”. 4 th International Conference on
Intelligent Engineering and Management (ICIEM 2023).

[25] Kuldeep Singh, Lakshami Sevakamoorthy. “Blockchain and AI-Based Threat Detection
for Enhanced Security in Financial Networks”. 2023 IEEE Technology & Engineering
Management Conference - Asia Pacific (TEMSCON-ASPAC) | 979-8-3503-
84659/23/$31.00 ©2023 IEEE | DOI: 10.1109/TEMSCON-ASPAC59527.2023.10531316
[26] Md. Jafrin Hossain, Umme Nusrat Jahan, Rejuan Haque Rifat, Annajiat Alim Rasel,
Muhammad Abdur Rahman. “Classifying Cyberattacks on Financial Organizations Based on
Publicly Available Deep Web Dataset”. 2023 International Conference On Cyber
Management And Engineering (CyMaEn) | 978-1-6654-9329-1/23/$31.00 ©2023 IEEE |
DOI: 10.1109/CyMaEn57228.2023.10050921.
[27] Dingari Jahnavi, Mona A, Sandeep Pulata, Sasank Sami, Bharadwaj Vakamullu,
Bharathi Mohan G. “Robust Hybrid Machine Learning Model for Financial Fraud Detection
in Credit Card Transactions”. Proceedings of the 2nd International Conference on Intelligent
Data Communication Technologies and Internet of Things (IDCIoT-2024) IEEE Xplore Part
Number: CFP24CV1-ART; ISBN: 979-8-3503-2753-3

[28] Amartyani Chattopadhyay, Dr.Divya Sripada. “Security Analysis and Threat Modelling
of Mobile Banking Applications”. 2023 14th International Conference on Computing
Communication and Networking Technologies (ICCCNT) | 979-8-3503-35095/23/$31.00
©2023 IEEE | DOI: 10.1109/ICCCNT56998.2023.10307577

38
[29] Oksana Avdeyuk, Dmitriy Kozlov, Lida Druzhinina, Irina Tarasova. “Fraud prevention
in the system of electronic payments on the basis of POS-networks security monitoring”.
978-1-5386-0798-5/17/$31.00 ©2017 IEEE.

[30] Dr. S Surya, Suvana Ranjeet jagtap, Ramnarayan, Mankali Priyadarshini, Read Khalid
Ibrahim, Malik Bader Alazzam. “Protecting Online Transactions: A Cybersecurity Solution
Model”. 2023 3rd International Conference on Advance Computing and Innovative
Technologies in Engineering (ICACITE) | 979-8-3503-9926-4/23/$31.00 ©2023 IEEE | DOI:
10.1109/ICACITE57410.2023.10183282.

[31] Tamsanqa Ngalo, Hannan Xiao, Bruce Christianson, Ying Zhang. “Threat Analysis of
Software Agents in Online Banking and Payments”. 2018 IEEE 16th Int. Conf. on
Dependable, Autonomic & Secure Comp., 16th Int. Conf. on Pervasive Intelligence &
Comp., 4th Int. Conf. on Big Data Intelligence & Comp., and 3rd Cyber Sci. & Tech. Cong.

[32] Jack Sturgess, Simon Eberz, Ivo Sluganovic, and Ivan Martinovic. “WatchAuth: User
Authentication and Intent Recognition in Mobile Payments using a Smartwatch”. 2022 IEEE
7th European Symposium on Security and Privacy (EuroS&P).
[33] Aya H. Salem1, Safaa M. Azzam, O. E. Emam1 and Amr A. Abohany. “Advancing
cybersecurity: a comprehensive review of AI-driven detection techniques”. Salem et al.
Journal of Big Data (2024) 11:105 https://doi.org/10.1186/s40537-024-00957-y.

Web-links:
1. https://www.datavisor.com/wiki/advance-fee-fraud/
2. https://stripe.com/in/resources/more/how-machine-learning-works-for-payment-
frauddetection-and-prevention
3. https://www.fraud.com/post/artificial-intelligence

39
APPENDIX A

(a) Random Forest Model Training Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import metrics
import warnings
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pickle
warnings.filterwarnings('ignore')

data = pd.read_csv("phishing.csv")
data.head()

data = data.drop(['Index'],axis = 1
data['class'].value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.title("Phishing Count")
plt.show()

X = data.drop(["class"],axis =1)
y = data["class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =


42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

ML_Model = []
accuracy = []
f1_score = []
recall = []
precision = []

#function to call for storing the results


def storeResults(model, a,b,c,d):
ML_Model.append(model)
accuracy.append(round(a, 3))
f1_score.append(round(b, 3))
recall.append(round(c, 3))
precision.append(round(d, 3))

# Random Forest Classifier Model


from sklearn.ensemble import RandomForestClassifier
# instantiate the model
forest = RandomForestClassifier(n_estimators=10)
# fit the model
40
forest.fit(X_train,y_train)

# Save the model to a .pkl file


with open("model.pkl", "wb") as file:
pickle.dump(forest, file)

#predicting the target value from the model for the samples
y_train_forest = forest.predict(X_train)
y_test_forest = forest.predict(X_test)

#computing the accuracy, f1_score, Recall, precision of the model performance

acc_train_forest = metrics.accuracy_score(y_train,y_train_forest)
acc_test_forest = metrics.accuracy_score(y_test,y_test_forest)
print("Random Forest : Accuracy on training Data: {:.3f}".format(acc_train_forest))
print("Random Forest : Accuracy on test Data: {:.3f}".format(acc_test_forest))
print()

f1_score_train_forest = metrics.f1_score(y_train,y_train_forest)
f1_score_test_forest = metrics.f1_score(y_test,y_test_forest)
print("Random Forest : f1_score on training Data: {:.3f}".format(f1_score_train_forest))
print("Random Forest : f1_score on test Data: {:.3f}".format(f1_score_test_forest))
print()

recall_score_train_forest = metrics.recall_score(y_train,y_train_forest)
recall_score_test_forest = metrics.recall_score(y_test,y_test_forest)
print("Random Forest : Recall on training Data:
{:.3f}".format(recall_score_train_forest))
print("Random Forest : Recall on test Data: {:.3f}".format(recall_score_test_forest))
print()

precision_score_train_forest = metrics.precision_score(y_train,y_train_forest)
precision_score_test_forest = metrics.precision_score(y_test,y_test_forest)
print("Random Forest : precision on training Data:
{:.3f}".format(precision_score_train_forest))
print("Random Forest : precision on test Data:
{:.3f}".format(precision_score_test_forest))

print(metrics.classification_report(y_test, y_test_forest))

training_accuracy = []
test_accuracy = []
# try max_depth from 1 to 20
depth = range(1,20)
for n in depth:
forest_test = RandomForestClassifier(n_estimators=n)

forest_test.fit(X_train, y_train)
# record training set accuracy
training_accuracy.append(forest_test.score(X_train, y_train))
41
# record generalization accuracy
test_accuracy.append(forest_test.score(X_test, y_test))

#plotting the training & testing accuracy for n_estimators from 1 to 20


plt.figure(figsize=None)
plt.plot(depth, training_accuracy, label="training accuracy")
plt.plot(depth, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_estimators")
plt.legend();

# Sample metric values (replace these with your actual metric variables)
metrics_names = ['Accuracy', 'F1 Score', 'Recall', 'Precision']
train_scores = [
acc_train_forest,
f1_score_train_forest,
recall_score_train_forest,
precision_score_train_forest
]
test_scores = [
acc_test_forest,
f1_score_test_forest,
recall_score_test_forest,
precision_score_test_forest
]

# Define bar width


bar_width = 0.35
index = np.arange(len(metrics_names))

# Plotting
fig, ax = plt.subplots(figsize=(10, 6))
train_bars = ax.bar(index, train_scores, bar_width, label='Train Score', color='b')
test_bars = ax.bar(index + bar_width, test_scores, bar_width, label='Test Score',
color='g')

# Labeling
ax.set_xlabel('Metrics')
ax.set_ylabel('Scores')
ax.set_title('Random Forest Model Metrics Comparison')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(metrics_names)
ax.legend()

# Display the values on top of bars


for bars in [train_bars, test_bars]:
for bar in bars:
height = bar.get_height()
42
ax.annotate(f'{height:.3f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')

plt.tight_layout()
plt.show()

(b) FastAPI Code:

from fastapi import FastAPI, HTTPException


from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import pickle
import numpy as np
from features import getfeatures

app = FastAPI()

app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Allows requests from any domain
allow_credentials=True,
allow_methods=["*"], # Allows all HTTP methods
allow_headers=["*"], # Allows all HTTP headers
)

class URLInput(BaseModel):
url: str

with open("model.pkl", "rb") as file:


model = pickle.load(file)

with open("modelANN.pkl", "rb") as file:


modelANN = pickle.load(file)

@app.get("/")
async def root():
return {"message": "Hello, this is the phishing detection API"}

@app.post("/predict")
async def predict(input_data: URLInput):
try:
# Extract features from the URL
features = getfeatures(input_data.url)
features_array = np.array(features).reshape(1, -1) # Ensure 2D shape for model
input

43
# Make initial predictions
prediction = model.predict(features_array)
predictionANN = modelANN.predict(features_array)

# Repeat prediction if there's a mismatch


retry_count = 0
while prediction != predictionANN and retry_count < 3: # Set a retry limit
prediction = model.predict(features_array)
predictionANN = modelANN.predict(features_array)
retry_count += 1

# Interpret prediction
if prediction == 1 and predictionANN == 1:
result = "Website is safe"
elif prediction == -1 and predictionANN == -1:
result = "Website is not safe"
else:
result = "Prediction mismatch - unable to determine"

return {"url": input_data.url, "Prediction": result}

except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

(c) Features extraction Code:

import ipaddress
import re
import urllib.request
from bs4 import BeautifulSoup
import socket
import requests
from googlesearch import search
import whois
from datetime import date, datetime
import time
from dateutil.parser import parse as date_parse
from urllib.parse import urlparse

class FeatureExtraction:
features = []
def __init__(self,url):
self.features = []
self.url = url
self.domain = ""
self.whois_response = ""
self.urlparse = ""
self.response = ""
self.soup = ""

44
try:
self.response = requests.get(url)
self.soup = BeautifulSoup(response.text, 'html.parser')
except:
pass

try:
self.urlparse = urlparse(url)
self.domain = self.urlparse.netloc
except:
pass

try:
self.whois_response = whois.whois(self.domain)
except:
pass

self.features.append(self.UsingIp())
self.features.append(self.longUrl())
self.features.append(self.shortUrl())
self.features.append(self.symbol())
self.features.append(self.redirecting())
self.features.append(self.prefixSuffix())
self.features.append(self.SubDomains())
self.features.append(self.Hppts())
self.features.append(self.DomainRegLen())
self.features.append(self.Favicon())

self.features.append(self.NonStdPort())
self.features.append(self.HTTPSDomainURL())
self.features.append(self.RequestURL())
self.features.append(self.AnchorURL())
self.features.append(self.LinksInScriptTags())
self.features.append(self.ServerFormHandler())
self.features.append(self.InfoEmail())
self.features.append(self.AbnormalURL())
self.features.append(self.WebsiteForwarding())
self.features.append(self.StatusBarCust())

self.features.append(self.DisableRightClick())
self.features.append(self.UsingPopupWindow())
self.features.append(self.IframeRedirection())
self.features.append(self.AgeofDomain())
self.features.append(self.DNSRecording())
self.features.append(self.WebsiteTraffic())
self.features.append(self.PageRank())
self.features.append(self.GoogleIndex())
45
self.features.append(self.LinksPointingToPage())
self.features.append(self.StatsReport())

# 1.UsingIp
def UsingIp(self):
try:
ipaddress.ip_address(self.url)
return -1
except:
return 1

# 2.longUrl
def longUrl(self):
if len(self.url) < 54:
return 1
if len(self.url) >= 54 and len(self.url) <= 75:
return 0
return -1

# 3.shortUrl
def shortUrl(self):
match =
re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.g
s|'

'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'

'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.
us|'

'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'

'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'

'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.or
g|'

'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\
.me|v\.gd|tr\.im|link\.zip\.net', self.url)
if match:
return -1
return 1

# 4.Symbol@
def symbol(self):
if re.findall("@",self.url):
return -1
return 1

# 5.Redirecting//
46
def redirecting(self):
if self.url.rfind('//')>6:
return -1
return 1

# 6.prefixSuffix
def prefixSuffix(self):
try:
match = re.findall('\-', self.domain)
if match:
return -1
return 1
except:
return -1

# 7.SubDomains
def SubDomains(self):
dot_count = len(re.findall("\.", self.url))
if dot_count == 1:
return 1
elif dot_count == 2:
return 0
return -1

# 8.HTTPS
def Hppts(self):
try:
https = self.urlparse.scheme
if 'https' in https:
return 1
return -1
except:
return 1

# 9.DomainRegLen
def DomainRegLen(self):
try:
expiration_date = self.whois_response.expiration_date
creation_date = self.whois_response.creation_date
try:
if(len(expiration_date)):
expiration_date = expiration_date[0]
except:
pass
try:
if(len(creation_date)):
creation_date = creation_date[0]
except:
pass

47
age = (expiration_date.year-creation_date.year)*12+ (expiration_date.month-
creation_date.month)
if age >=12:
return 1
return -1
except:
return -1

# 10. Favicon
def Favicon(self):
try:
for head in self.soup.find_all('head'):
for head.link in self.soup.find_all('link', href=True):
dots = [x.start(0) for x in re.finditer('\.', head.link['href'])]
if self.url in head.link['href'] or len(dots) == 1 or domain in
head.link['href']:
return 1
return -1
except:
return -1

# 11. NonStdPort
def NonStdPort(self):
try:
port = self.domain.split(":")
if len(port)>1:
return -1
return 1
except:
return -1

# 12. HTTPSDomainURL
def HTTPSDomainURL(self):
try:
if 'https' in self.domain:
return -1
return 1
except:
return -1

# 13. RequestURL
def RequestURL(self):
try:
for img in self.soup.find_all('img', src=True):
dots = [x.start(0) for x in re.finditer('\.', img['src'])]
if self.url in img['src'] or self.domain in img['src'] or len(dots) == 1:
success = success + 1
i = i+1

for audio in self.soup.find_all('audio', src=True):


dots = [x.start(0) for x in re.finditer('\.', audio['src'])]
48
if self.url in audio['src'] or self.domain in audio['src'] or len(dots) == 1:
success = success + 1
i = i+1

for embed in self.soup.find_all('embed', src=True):


dots = [x.start(0) for x in re.finditer('\.', embed['src'])]
if self.url in embed['src'] or self.domain in embed['src'] or len(dots) == 1:
success = success + 1
i = i+1

for iframe in self.soup.find_all('iframe', src=True):


dots = [x.start(0) for x in re.finditer('\.', iframe['src'])]
if self.url in iframe['src'] or self.domain in iframe['src'] or len(dots) == 1:
success = success + 1
i = i+1

try:
percentage = success/float(i) * 100
if percentage < 22.0:
return 1
elif((percentage >= 22.0) and (percentage < 61.0)):
return 0
else:
return -1
except:
return 0
except:
return -1

# 14. AnchorURL
def AnchorURL(self):
try:
i,unsafe = 0,0
for a in self.soup.find_all('a', href=True):
if "#" in a['href'] or "javascript" in a['href'].lower() or "mailto" in
a['href'].lower() or not (url in a['href'] or self.domain in a['href']):
unsafe = unsafe + 1
i=i+1

try:
percentage = unsafe / float(i) * 100
if percentage < 31.0:
return 1
elif ((percentage >= 31.0) and (percentage < 67.0)):
return 0
else:
return -1
except:
return -1

except:
49
return -1

# 15. LinksInScriptTags
def LinksInScriptTags(self):
try:
i,success = 0,0

for link in self.soup.find_all('link', href=True):


dots = [x.start(0) for x in re.finditer('\.', link['href'])]
if self.url in link['href'] or self.domain in link['href'] or len(dots) == 1:
success = success + 1
i = i+1

for script in self.soup.find_all('script', src=True):


dots = [x.start(0) for x in re.finditer('\.', script['src'])]
if self.url in script['src'] or self.domain in script['src'] or len(dots) == 1:
success = success + 1
i = i+1

try:
percentage = success / float(i) * 100
if percentage < 17.0:
return 1
elif((percentage >= 17.0) and (percentage < 81.0)):
return 0
else:
return -1
except:
return 0
except:
return -1

# 16. ServerFormHandler
def ServerFormHandler(self):
try:
if len(self.soup.find_all('form', action=True))==0:
return 1
else :
for form in self.soup.find_all('form', action=True):
if form['action'] == "" or form['action'] == "about:blank":
return -1
elif self.url not in form['action'] and self.domain not in form['action']:
return 0
else:
return 1
except:
return -1

# 17. InfoEmail
def InfoEmail(self):
try:
50
if re.findall(r"[mail\(\)|mailto:?]", self.soap):
return -1
else:
return 1
except:
return -1

# 18. AbnormalURL
def AbnormalURL(self):
try:
if self.response.text == self.whois_response:
return 1
else:
return -1
except:
return -1

# 19. WebsiteForwarding
def WebsiteForwarding(self):
try:
if len(self.response.history) <= 1:
return 1
elif len(self.response.history) <= 4:
return 0
else:
return -1
except:
return -1

# 20. StatusBarCust
def StatusBarCust(self):
try:
if re.findall("<script>.+onmouseover.+</script>", self.response.text):
return 1
else:
return -1
except:
return -1

# 21. DisableRightClick
def DisableRightClick(self):
try:
if re.findall(r"event.button ?== ?2", self.response.text):
return 1
else:
return -1
except:
return -1

# 22. UsingPopupWindow
def UsingPopupWindow(self):
51
try:
if re.findall(r"alert\(", self.response.text):
return 1
else:
return -1
except:
return -1

# 23. IframeRedirection
def IframeRedirection(self):
try:
if re.findall(r"[<iframe>|<frameBorder>]", self.response.text):
return 1
else:
return -1
except:
return -1

# 24. AgeofDomain
def AgeofDomain(self):
try:
creation_date = self.whois_response.creation_date
try:
if(len(creation_date)):
creation_date = creation_date[0]
except:
pass

today = date.today()
age = (today.year-creation_date.year)*12+(today.month-
creation_date.month)
if age >=6:
return 1
return -1
except:
return -1

# 25. DNSRecording
def DNSRecording(self):
try:
creation_date = self.whois_response.creation_date
try:
if(len(creation_date)):
creation_date = creation_date[0]
except:
pass

today = date.today()
age = (today.year-creation_date.year)*12+(today.month-
creation_date.month)
if age >=6:
52
return 1
return -1
except:
return -1

# 26. WebsiteTraffic
def WebsiteTraffic(self):
try:
rank =
BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url
=" + url).read(), "xml").find("REACH")['RANK']
if (int(rank) < 100000):
return 1
return 0
except :
return -1

# 27. PageRank
def PageRank(self):
try:
prank_checker_response =
requests.post("https://www.checkpagerank.net/index.php", {"name": self.domain})

global_rank = int(re.findall(r"Global Rank: ([0-9]+)",


rank_checker_response.text)[0])
if global_rank > 0 and global_rank < 100000:
return 1
return -1
except:
return -1

# 28. GoogleIndex
def GoogleIndex(self):
try:
site = search(self.url, 5)
if site:
return 1
else:
return -1
except:
return 1

# 29. LinksPointingToPage
def LinksPointingToPage(self):
try:
number_of_links = len(re.findall(r"<a href=", self.response.text))
if number_of_links == 0:
return 1
elif number_of_links <= 2:
return 0
53
else:
return -1
except:
return -1

# 30. StatsReport
def StatsReport(self):
try:
url_match = re.search(

'at\.ua|usa\.cc|baltazarpresentes\.com\.br|pe\.hu|esy\.es|hol\.es|sweddy\.com|myjino\.r
u|96\.lt|ow\.ly', url)
ip_address = socket.gethostbyname(self.domain)
ip_match =
re.search('146\.112\.61\.108|213\.174\.157\.151|121\.50\.168\.88|192\.185\.217\.116|
78\.46\.211\.158|181\.174\.165\.13|46\.242\.145\.103|121\.50\.168\.40|83\.125\.22\.2
19|46\.242\.145\.98|'

'107\.151\.148\.44|107\.151\.148\.107|64\.70\.19\.203|199\.184\.144\.27|107\.151\.14
8\.108|107\.151\.148\.109|119\.28\.52\.61|54\.83\.43\.69|52\.69\.166\.231|216\.58\.19
2\.225|'

'118\.184\.25\.86|67\.208\.74\.71|23\.253\.126\.58|104\.239\.157\.210|175\.126\.123\.
219|141\.8\.224\.221|10\.10\.10\.10|43\.229\.108\.32|103\.232\.215\.140|69\.172\.201
\.153|'

'216\.218\.185\.162|54\.225\.104\.146|103\.243\.24\.98|199\.59\.243\.120|31\.170\.16
0\.61|213\.19\.128\.77|62\.113\.226\.131|208\.100\.26\.234|195\.16\.127\.102|195\.16
\.127\.157|'

'34\.196\.13\.28|103\.224\.212\.222|172\.217\.4\.225|54\.72\.9\.51|192\.64\.147\.141|
198\.200\.56\.183|23\.253\.164\.103|52\.48\.191\.26|52\.214\.197\.72|87\.98\.255\.18|
209\.99\.17\.27|'

'216\.38\.62\.18|104\.130\.124\.96|47\.89\.58\.141|78\.46\.211\.158|54\.86\.225\.156|
54\.82\.156\.19|37\.157\.192\.102|204\.11\.56\.48|110\.34\.231\.42', ip_address)
if url_match:
return -1
elif ip_match:
return -1
return 1
except:
return 1

def getFeaturesList(self):
return self.features

def getfeatures(url):
# Take URL input from the user
#url = input("Enter the URL to analyze: ")
54
# Create an instance of the feature extraction class
feature_extractor = FeatureExtraction(url)

features = []

# Display the extracted features


#print("Extracted Features:")
for i, feature in enumerate(feature_extractor.features, start=1):
#print(f"Feature {i}: {feature}")
features.append(feature)

return features

(d) ANN Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import metrics
import warnings
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
import pickle
warnings.filterwarnings('ignore')

data = pd.read_csv("phishing.csv")
data.head()

data = data.drop(['Index'],axis = 1)

X = data.drop(["class"],axis =1)
y = data["class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state


= 42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

# Creating holders to store the model performance results


ML_Model = []
accuracy = []
f1_score = []
recall = []
precision = []

def storeResults(model, a,b,c,d):


ML_Model.append(model)
accuracy.append(round(a, 3))
f1_score.append(round(b, 3))
55
recall.append(round(c, 3))
precision.append(round(d, 3))

mlp = MLPClassifier()
# fit the model
mlp.fit(X_train,y_train)
import pickle

with open("modelANN.pkl", "wb") as file:


pickle.dump(mlp, file)

y_train_mlp = mlp.predict(X_train)
y_test_mlp = mlp.predict(X_test)

acc_train_mlp = metrics.accuracy_score(y_train,y_train_mlp)
acc_test_mlp = metrics.accuracy_score(y_test,y_test_mlp)
print("Multi-layer Perceptron : Accuracy on training Data:
{:.3f}".format(acc_train_mlp))
print("Multi-layer Perceptron : Accuracy on test Data: {:.3f}".format(acc_test_mlp))
print()

f1_score_train_mlp = metrics.f1_score(y_train,y_train_mlp)
f1_score_test_mlp = metrics.f1_score(y_test,y_test_mlp)
print("Multi-layer Perceptron : f1_score on training Data:
{:.3f}".format(f1_score_train_mlp))
print("Multi-layer Perceptron : f1_score on test Data:
{:.3f}".format(f1_score_train_mlp))
print()

recall_score_train_mlp = metrics.recall_score(y_train,y_train_mlp)
recall_score_test_mlp = metrics.recall_score(y_test,y_test_mlp)
print("Multi-layer Perceptron : Recall on training Data:
{:.3f}".format(recall_score_train_mlp))
print("Multi-layer Perceptron : Recall on test Data:
{:.3f}".format(recall_score_test_mlp))
print()

precision_score_train_mlp = metrics.precision_score(y_train,y_train_mlp)
precision_score_test_mlp = metrics.precision_score(y_test,y_test_mlp)
print("Multi-layer Perceptron : precision on training Data:
{:.3f}".format(precision_score_train_mlp))
print("Multi-layer Perceptron : precision on test Data:
{:.3f}".format(precision_score_test_mlp))

print(metrics.classification_report(y_test, y_test_mlp))

metrics_names = ['Accuracy', 'F1 Score', 'Recall', 'Precision']


56
train_scores_mlp = [
acc_train_mlp,
f1_score_train_mlp,
recall_score_train_mlp,
precision_score_train_mlp
]
test_scores_mlp = [
acc_test_mlp,
f1_score_test_mlp,
recall_score_test_mlp,
precision_score_test_mlp
]

# Set up the bar width and x-axis indices


bar_width = 0.35
index = np.arange(len(metrics_names))

# Plot the scores


fig, ax = plt.subplots(figsize=(10, 6))
train_bars_mlp = ax.bar(index, train_scores_mlp, bar_width, label='Train Score',
color='blue')
test_bars_mlp = ax.bar(index + bar_width, test_scores_mlp, bar_width, label='Test
Score', color='orange')

# Set titles and labels


ax.set_xlabel('Metrics')
ax.set_ylabel('Scores')
ax.set_title('Multi-layer Perceptron Model Metrics Comparison')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(metrics_names)
ax.legend()

# Display the values on top of each bar


for bars in [train_bars_mlp, test_bars_mlp]:
for bar in bars:
height = bar.get_height()
ax.annotate(f'{height:.3f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')

# Show plot
plt.tight_layout()
plt.show()

57

You might also like