0% found this document useful (0 votes)
25 views

Across_the_Spectrum_In-Depth_Review_AI-Based_Models_for_Phishing_Detection

This article, accepted for publication in the IEEE Open Journal of the Communications Society, provides a comprehensive review of AI-based models for phishing detection, analyzing over 130 articles published between 2020 and 2024. It discusses the evolution of phishing attacks, traditional detection methods, and the effectiveness of machine learning and deep learning models, while identifying gaps and challenges in current research. The study aims to offer a roadmap for researchers and cybersecurity experts to enhance phishing detection strategies and improve overall internet security.

Uploaded by

Rinku Bathra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Across_the_Spectrum_In-Depth_Review_AI-Based_Models_for_Phishing_Detection

This article, accepted for publication in the IEEE Open Journal of the Communications Society, provides a comprehensive review of AI-based models for phishing detection, analyzing over 130 articles published between 2020 and 2024. It discusses the evolution of phishing attacks, traditional detection methods, and the effectiveness of machine learning and deep learning models, while identifying gaps and challenges in current research. The study aims to offer a roadmap for researchers and cybersecurity experts to enhance phishing detection strategies and improve overall internet security.

Uploaded by

Rinku Bathra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

This article has been accepted for publication in IEEE Open Journal of the Communications Society.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Received XX Month, XXXX; revised XX Month, XXXX; accepted XX Month, XXXX; Date of publication XX Month, XXXX; date of
current version 11 January, 2024.
Digital Object Identifier 10.1109/OJCOMS.2024.011100

Across the Spectrum In-Depth Review


AI-Based Models for Phishing
Detection
SHAKEEL AHMAD 1 , MUHAMMAD ZAMAN 1 (Member, IEEE) , AHMAD SAMI
AL-SHAMAYLEH 3 , TANZILA KEHKASHAN 1,4 , RAHIEL AHMAD 1 , SHAFI’I MUHAMMAD
ABDULHAMID 5 , ISMAIL ERGEN 2 , ADNAN AKHUNZADA 6 (Senior Member, IEEE)
1
Faculty of Computer Science, University of Lahore, 10 KM Lahore- Sargodha Rd, Sargodha, Punjab 40100, Pakistan
2
Department of Fine art, design and Architecture, Faculty of Digital Game Design, Istinye University, Istanbul, 34396,Türkiye
3
Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Al-Ahliyya Amman University,
Amman, 19328, Jordan
4
Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
5
Department of information technology, Science and Technology Division, Community College of Qatar, Qatar
6
College of Computing & IT, Department of Data and Cybersecurity, University of Doha for Science and Technology, Doha, 24449, Qatar
CORRESPONDING AUTHOR: MUHAMMAD ZAMAN, Email: ([email protected])
PRINCIPAL CORRESPONDING AUTHOR: Shafii Muhammad Abdul Hamid, Email: ([email protected])
The open access funding is provided by Qatar National Library. Besides, we also extend our appreciation for the necessary support of
Al-Ahliyya Amman University, Jordan.

ABSTRACT Advancement of the Internet has increased security risks associated with data protection
and online shopping. Several techniques compromise Internet security, including hacking, SQL injection,
phishing attacks, and DNS tunneling. Phishing attacks are particularly significant among web phishing
techniques. In a phishing attack, the attacker creates a fake website that closely resembles a legitimate one
to deceive users into providing sensitive information. These attacks can be detected using both traditional
and modern AI-based models. However, even with state-of-the-art methods, accurately classifying newly
emerged links as phishing or legitimate remains a challenge. This study conducts a comparative analysis
of more than 130 articles published between 2020 and 2024, identifying challenges and gaps in the
literature and comparing the findings of various authors. The novelty of this research lies in providing a
roadmap for researchers, practitioners, and cybersecurity experts to navigate the landscape of machine
learning (ML) and deep learning (DL) models for phishing detection. The study reviews traditional
phishing detection methods, ML and DL models, phishing datasets, and the step-by-step phishing process.
It highlights limitations, research gaps, weaknesses, and potential improvements. Accuracy measures are
used to compare model performance. In conclusion, this research provides a comprehensive survey of
website phishing detection using AI models, offering a new roadmap for future studies.

INDEX TERMS Anomaly Detection, Blocklists, Cyber-Attack Mitigation, Cybersecurity, Deep Learning
(DL), Machine Learning (ML), Phishing Detection, Threat Intelligence, Web Phishing Detection, Whitelists

I. Introduction for fraudulent activities or sales on the dark web [2]. Under-

W EB Phishing is a cyber attack method in which


attackers disguise themselves as trustworthy entities
to extract sensitive personal information from individuals,
standing the basics of web phishing is crucial, as it forms the
foundation for developing effective detection and prevention
strategies [3].
such as usernames, passwords, and credit card details [1]. Phishing attacks have evolved significantly over the years,
This is typically done through emails, instant messaging, becoming more sophisticated and complex to detect. Initially,
or web pages that mimic real services. Generally, phishing phishing was relatively simple in design, involving general
aims to gain access to personal and financial information messages sent to multiple recipients at once, hoping that

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

VOLUME , 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

a small fraction would fall for it [4]. Such messages often phishing attack methods along with the step-by-step phishing
included poor grammar and other obvious signs of fraud [5]. attack process. Section V presents how phishing works and
Today’s phishing techniques are highly targeted and person- different phishing techniques. Section VI explains different
alized, using even minute details about the victim obtained phishing detection methods based on webpage screenshots,
from social media or other sources [6]. This approach, known while Section VII describes the different phishing detection
as spear phishing, increases the chances of success due to datasets and their comparative analysis. Similarly, Section
its more convincing deceit [7]. VIII presents different anti-phishing methods with their
Detection of web phishing involves identifying and mit- comparative analysis. Section IX presents model recommen-
igating malicious activities before they can spread [8]. dations for phishing detection, highlighting which model
Traditional detection mechanisms are highly dependent on is the best fit for phishing detection. Section X highlights
blacklists, which are databases containing known phishing open challenges and discussions related to phishing detection
URLs that are blocked in web browsers and security software and research papers. In the last section, the conclusion is
[9]. The effectiveness of blacklists is often limited, as they presented.
can only protect against threats that have been identified in
the past—making them essentially reactive. This limitation
II. Literature Review
has spurred the development of more advanced detection
techniques designed to track down newer and emerging web Internet use has become integral to people’s daily lives,
phishing threats in real-time [10]. making it difficult to envision a world without it. Ac-
Machine learning has become a critical tool in the fight cording to the Global Digital Population Survey Report
against phishing on the web. By analyzing large volumes (GDP) [15], published in 2023, approximately 5.3 billion
of data, machine learning algorithms can identify patterns people use the internet worldwide. Of these, 62% use social
and anomalies associated with phishing attacks [11]. These media. In the report [16], it is stated that 94.6% of these
algorithms can learn characteristics related to URL structure, users have accessed the internet through smartphones. This
domain age, website content, and email metadata to detect connectivity has revolutionized life, including information
phishing attempts with high precision [12]. Thus, using exchange, online shopping, communication, and professional
machine learning not only enhances detection capabilities but tasks. At the beginning of 2019, when the pandemic began,
also reduces false positives, thereby avoiding the misidenti- there were significant changes in traditional offline services.
fication of legitimate websites and emails as malware [13]. These services transformed from offline to online platforms,
However, traditional and AI models often fail to detect newly particularly in industries such as catering and retail.
emerged phishing links. Therefore, systematic literature re- In this digital era, individuals frequently share sensitive
views are essential for researchers to identify study gaps, online data, such as login credentials, personal informa-
evaluate the performance of existing models, and discuss tion, and credit card details. Unfortunately, cybercriminals
current datasets [14]. exploit various illicit methods to acquire this information
The significant contributions of this study include the and subsequently engage in unauthorized activities on the
examination of methods to prevent phishing attacks, covering Internet. Network security concerns have been present since
attack types, phishing processes, user behaviors, and pre- the inception of the Internet, evolving in tandem with its
vention measures. It discusses both traditional and modern development. In [17], the author proposed that the rapid
detection methods, analyzes existing solutions, and addresses evolution of network attack techniques poses significant
their limitations. The key contributions include: challenges to cybersecurity. There are several categories of
cybersecurity issues, classified based on attack methods and
1) Discussing all possible phishing attack modes, tech- forms, including denial-of-service attacks (DoS), man-in-
niques, and the effects of attacks. the-middle (MitM) attacks, SQL injection (SQL-Inj), zero-
2) Exploring attack processes, typologies, and anti- day (ZD) exploits, DNS tunneling, phishing, and malware.
phishing solutions. In [18], the author explained that the dynamic landscape of
3) Presenting a comparative analysis of traditional as well the Internet and its vulnerabilities necessitate ongoing efforts
as machine learning (ML) and deep learning (DL) based to enhance cybersecurity measures and protect users from
models. potential threats.
4) Providing taxonomic classification of anti-phishing In [19], the author explained that phishing attacks require
techniques. tactful skills, including re-engineering, networking, coding,
5) Measuring the performance of models using accuracy databases, and deep knowledge of protocols and how infor-
to evaluate significant models for phishing detection. mation is stolen from these protocols. In [20], the author
6) Finally, presenting a plethora of promising future re- explained how the attacker designs the phishing page linked
search directions. to the database; the web form looks like the original and
The remainder of this paper is organized as follows: shares a link with the user using social media, SMS Gateway,
Section II discusses the literature review. Section III covers and email. This sharing contains alarming messages and
the research methodology. Section IV discusses different warning text, including misleading images, to attract and

2 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

induce the user to click the link. Phishing attacks have caused rameters, and data that need to be retrieved. Ancient rules
economic losses in the last 30 years. In [21], the author such as whitelists and blacklists, while checking the URL
discusses the history of phishing, stating that phishing attacks registered domain, if the domain exists, the system passes
increased dramatically during the 2019 pandemic. In that the URL as legitimate; otherwise, it blocks it [26]. For
period, governments worldwide issued financial assistance the authenticity of the URLs, the system needs to verify
to their citizens and started collecting sensitive data that information such as domain expiry and registered date from
contained bank accounts, credit card history, debit card a third party. Once the authentication rules are published,
history, and personal details to disburse funds. the attacker learns them and works according to the authen-
Similarly, the attacker launched the same campaign to tication laws to bypass the system. Thus, ancient methods
obtain data online from citizens. According to phishing have not been successful in controlling phishing attacks.
attack statistics published in 2022, approximately 36% of Several models were used for anti-phishing after the ML
data breaches are caused by phishing attacks, and 83% of and DL models. The phishing detection mechanism uses
citizens in the US experience phishing attacks [22]. This ratio labeled data to classify phishing and legitimate websites.
increased from 80% to 345% from 2020 to 2021. Another Different state-of-the-art models have been used for web
report published in 2022 by a US-based organization [23] phishing detection and identification in ML and DL [26],
states that the number of phishing attacks doubled in 2022 [31], [32], [33], [34], [35], [36], [37], [38]. The fundamental
compared to the 2019 pandemic due to the high success rate. use of these models is to identify and classify phishing links
In [23], the author proposed several methods to prevent correctly. Therefore, the models are differentiated based on
phishing attacks, including technical staff education and their accuracy and computation time. The higher accuracy
training on daily email responses, SMS, WhatsApp, and and lower time computation models are considered the best
social media material sharing. In [24], it is represented that models for detection.
the objective is to survey recently published anti-phishing Phishing attacks have increased dramatically due to the
methods. Identifying a phishing website is a challenging increased number of members on social media and online
task during the process of obtaining user information. Re- businesses. Therefore, cyber risks and threats have increased,
searchers have proposed several methods to identify phish- needing to be addressed appropriately [39]. The complex
ing websites before the invention of artificial intelligence, nature of hyperlinks makes it difficult for the human eye to
including traditional methods such as whitelisting universal recognize original and fake links. Therefore, cybersecurity
resource locators (URLs) and blocklisting URLs [25]. In experts are paying more attention to the detection of coun-
the whitelist URLs list, several URLs were considered le- terfeit URLs. Phishers use advanced techniques and methods
gitimate, while others were considered phishing. Similarly, after learning the modern methods of ML and DL [29].
blocklists contain all shortened URLs, unnecessary strings, Several research papers have been published on phish-
long lengths, unstructured formats, and ambiguous domains. ing detection methods. In [40], the authors have analyzed
Whitelist URLs and blacklist URLs are shared with the different phishing solutions based on different parameters.
general public to avoid visiting such URLs [26]. The authors discussed lists of phishing techniques used on
This approach prevents the user from phishing attacks; other devices and provided countermeasures against phishing
however, it is not as effective because of the higher com- attacks in four major categories: AI-Anti-Phishing models,
putational cost of algorithm matching with a single string Classical methods based on different scenarios, and lists-
by string in a real-time environment. However, this method based. The authors concluded that the appropriate feature
could not identify shortened, modified, and long-string phish- selection method gives a higher output for better results
ing URLs [27]. Another ancient method is known as the and that the model shows the highest accuracy compared to
rule-based phishing detection method. In this method, rules other AI models. However, the authors did not investigate
are defined for web surfing. This type of detection requires other ML and DL methods proposed in [41], [42], [43],
expert knowledge of cybersecurity policies and web filtering. [44], [45], such as SVM, LSTM, NB, and other modern DL
According to this method, the user must know how to models, which can detect with accuracy rates from 99.00%
implement rules and analyze the URLs, either phishing or to 99.62%.
legit [28]. After the introduction of ML and DL models, In [46], the authors explained the two main types of
phishing detection and identification became more efficient, phishing attacks, including social engineering and the use
but there are drawbacks to traditional ML models. In these of malware. The authors also discussed feature extraction
models, feature extraction must be performed manually to techniques based on some rules. However, they did not
identify phishing pages. This means that humans must write discuss the challenges associated with feature extraction,
the rules that if such a string, word, or signature does not limitations, which feature extraction technique is suitable for
validate, that URL will be marked as phishing [29]. which ML DL model, and how accurately required features
According to the World Wide Web Consortium (W3C) were extracted as given in these [36]–[47] papers.
[30], a URL must contain elements such as the protocol, In [48], authors provide phishing attacks, and their so-
subdomain, domain, port number, database path, query pa- lutions are categorized into three main types: URL-based,

VOLUME , 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

content-based, and hybrid approaches. After carefully study- The author used the same models for noise removal from the
ing and comparing the proposed methods, they concluded dataset. The author and collaborators explore the interaction
that a hybrid approach is best for the detection and preven- of swarm intelligence and deep learning for phishing detec-
tion of real-time phishing. However, they did not discuss tion. Swarm intelligence draws inspiration from collective
the challenges associated with implementing the model and behavior observed in natural systems (e.g., ant colonies,
dataset described in these [49], [50] papers. bird flocks). Combining these principles with deep learning
In [51], the authors explained different anti-phishing meth- allows their I-BBA model to be observed [60].
ods with phishing techniques. They discussed nine different In [61], the authors introduce a novel approach that
datasets used for phishing detection methods. They also combines support vector machines (SVMs) with nature-
wrote about 18 different AI-based models and compared inspired optimization algorithms. SVMs are robust classifiers
their results. They discussed various challenges and limi- that aim to find an optimal hyperplane to separate data
tations of the models, such as precision and over-fitting. points into different classes. By integrating these algorithms,
However, they did not discuss the methods to reduce the the researchers achieved promising results in identifying
model over-fitting and improve accuracy, as discussed in phishing websites. In [62], the study investigates techniques
these papers [51], [52]. for detecting spoofed websites, which often mimic legiti-
The authors in [53] discussed different email phishing mate sites to deceive users. The authors explore machine
detection techniques, including email spoofing using modern learning models, feature engineering, and anomaly detection
ML methods and natural language processing (NLP). NLP methods. Their work enhances the accuracy of identifying
and ML are used for feature extraction and to detect ma- fraudulent web pages. Their study involves URL feature
licious email content. In [54], the author explained that the extraction, behavioral analysis, and model training. Taking
analysis is based on URL parts, page contents, and web page into account the lexical and content-based features, they
coding to find whether any tag or part of the code is modified contribute to the development of robust detection mecha-
or redirected elsewhere. Then, all models’ performance will nisms [63]. Another study [64] explores a comprehensive
be compared to find the best one. In [55], the authors did approach to web phishing detection. The authors combine
not clearly explain the feature detection methods. There are web crawling techniques, cloud infrastructure, and deep
many ways to extract features, such as manual selection learning frameworks. By analyzing web content, network
methods and applying ML and DL models. However, there is traffic, and behavioral patterns, their model provides robust
a problem with the ML and DL models because the analyst protection against phishing attacks.
has to manually select features that will be useful for the The overview paper [65] by Scholar and the team critically
current dataset only. In case the data set changes, the feature examines existing methods for detecting phishing sites. They
selection technique fails. Furthermore, the authors did not discuss zero-day attacks, adversarial evasion, and real-time
review the modern methods for feature selection. detection challenges. In this work, the authors investigate
The author reviews modern AI-based phishing detection the effectiveness of combining multiple machine-learning
models in [56]. This paper divided the detection methods models for phishing classification. They achieve improved
into four categories: ML DL-based, scenario-based, hybrid performance by leveraging ensemble techniques such as
approach, and list-based. The author fails to provide a more stacking or blending. Their study emphasizes the benefits
detailed review of AI models that can be used confidently of model fusion in security applications.
for phishing detection. This paper lacks data processing and In [66], researchers and colleagues propose an ensemble
feature extraction techniques for phishing detection datasets. model designed explicitly for detecting phishing intrusions
In [57], they proposed a novel approach for detecting from URLs. Their approach combines decision trees, random
phishing websites by combining Support Vector Machines forests, and gradient boosting. By considering diverse classi-
(SVMs) with nature-inspired optimization algorithms. SVMs fiers, they enhance the robustness of their detection system.
are robust classifiers that aim to find an optimal hyperplane The research article [67] opens up a new dimension to
to separate data points into different classes. By integrating the world: the DRL-BWO algorithm, optimized by Black
these algorithms, the researchers achieved promising results Widow Optimization, for UAV networks. In addition, DRL
in identifying phishing URLs. The author and colleagues de- incorporates an enhanced reinforcement learning-based DBN
veloped an anti-phishing browser that leverages the Random for the detection of intrusions in UAV networks. The BWO
Forest algorithm and a rule-based extraction framework. algorithm is applied to the parameter optimization of the
In [58], the authors proposed an RF method for the DRL approach. It enhances the performance of intrusion
detection of Web phishing. The study is based on rule- detection in UAV networks, securing communication over
based detection. The features extracted for modeling are the UAV.
based on the RF model. In [59], the author gives us a
view of the different phishing techniques. This study is
A. Research Gap in Literature Review
based on the hybrid approach for phishing detection, and
features are extracted using XGBoost and Gradient Boost. During the literature review, we identified several research
gaps that highlight areas where further investigation is

4 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

needed. A complete review of these research gaps is provided


in Table.1.

VOLUME , 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
6
TABLE 1. The research gap review table summarizes the methods employed and the limitations identified in various studies focused on phishing detection. It highlights gaps such as the need for model
adaptability, handling modern attack techniques, effective feature selection, dataset diversity, and real-world applicability across different approaches.

Sr. No Authors Methods Limitation


1 Abad et al. [25] SVM, RF, DT, KNN, There is a lack of explanation on model adaptability with multiple doors, handling of modern attack
RNN, NB, Optimizer, techniques, and real-world dataset implementation.
DRLSH, BPLSH
2 Anupam et al. [31] SVM, BAT, WA In this paper, the author did not address the issues of imbalanced data, feature engineering, or new
phishing techniques.

3 Shahrivari et al. [32] DT, LR, KNN, ANN, RF, Lack of feature selection/extraction methods, lack of exploration of deep learning techniques, and
Ad inadequate real-world scenario applicability.

4 ALSARIERA et al. [34] ABET, BET, RoFBET, There is a need to explore diverse datasets beyond Mendeley, scalability assessment for larger datasets,
LBET, ANN, SVM, RF, hyperparameter impact analysis, and adaptation to evolving threats.
KNN, SVM, DT
5 Lokesh & BoreGowda RF, KNN, DT, L-SVC Lack of specific feature discussion, detailed algorithm comparison, real-world dataset robustness, and
[38] adaptation to emerging phishing techniques.

6 Butt et al. [41], [42], LSTM, SVM, NB, ISHO, Lack of comparative analysis, absence of discussion on URL feature selection, and limited applicability
[68] WARM, Firefly, BAT to specific datasets.

7 Zamir et al. [47] [RF], [NN], NB, KNN Lack of integration methods for various data sources repetition of existing models from previous research.

8 Jovanovic et al. [60] XGBoost, WI, MOFA In this study, the author discussed only selected features and did not discuss handling blank images and
shortened URLs.
Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

9 KARIM et al. [63] DNN, RF, NB, Na Absence of generalization to evolving phishing methods, inadequate feature extraction for large datasets,
and limited real-world scenario assessment.

10 Zieni et al. [64] list-based, similarity- Inadequate handling of imbalanced datasets, lack of feature details for classification, and focus on
based, and machine controlled experiments over real-world scenarios.
learning-based
11 Shaukat et al. [65] SVM, RF, MP, XGBoost Limited to three datasets, lacks a universal feature extraction method, and inadequate analysis of ML
and DL models.
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

12 Korkmaz et al. [69] XGBoost, RF, LR, KNN, Missing exploration of mitigation strategies for detected phishing attacks.
SVM, DT, NN, NB
13 Adebowal et al. [70], LSTM, CNN, IPDS Clas- Uncertainty on adaptation to new phishing methods, lack of automated feature selection for large datasets,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
[71] sifier and inadequate real-world scenario assessment.

14 Maci et al. [72] DL, DRL, MDL, ICMDP Unclear performance with increasing features and large datasets, lack of exploration on additional features’
impact, and absence of real-time scenario evaluation.
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and

VOLUME ,
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

III. Research Methodology


The literature review section of this article is based on
various research papers, including surveys, reviews, and
original research articles. These articles were selected from
high-impact journals published between 2020 and 2024 as
shown in Figure 1 and Figure 2. The selection process began
with a broad query as shown in Figure 3 using the general
keyword ”Web Phishing Detection,” which generated a list
of numerous relevant papers.
We filtered the papers to focus on those more specific
to the domain. Initially, the query returned approximately
3,500 articles. After narrowing the selection to more specific
documents, we identified about 500 articles relevant to the
query. However, many of these documents were written
before 2020. By refining the query, we ultimately selected
the most recently published papers from 2020 to 2024. These
articles include 100 original research articles, 30 survey FIGURE 3. The research process flow diagram for web phishing detection
outlines the systematic steps involved in identifying, filtering, and
articles, and 20 review articles. selecting academic papers related to web phishing detection. It begins
with a keyword search, progresses through various filtering stages,
including filtering by year, and concludes with the download of selected
papers. This process ensures a focused and relevant literature review.

A. Phishing Attack Modes


1) Using Electronic Mail
Nowadays, almost everyone has an email address to ex-
change data and correspondence. According to an article
published in 2024, about 3.4 billion emails are sent daily,
and 1.2% of them contain malicious URLs [73]. Another
report states that approximately 96% of these emails contain
phishing URLs [73]. Users are often attracted to these emails
FIGURE 1. Illustrates the distribution of papers across different because the content appears desirable, leading them to visit
categories: original papers, review papers, and survey papers. the links provided [74]. Some emails offer incentives, while
others claim that ”your password has been hacked; please
reset your password using the following link.” Upon reading
these emails, users might click on the link and unknowingly
provide their credentials directly to phishers [75]. Phishers
then use these credentials for illegal purposes, such as cash
withdrawal. According to a report published in 2022 by the
Federal Bureau of Investigation (FBI), about 10.3 billion
USD was lost due to email phishing [76].

2) Using Social Networks


The world has become a global village due to the invention
of technology and the rise of social media. People from all
over the world connect on social media platforms, share
FIGURE 2. Illustrates that the contents are sourced from the most recent images and thoughts, sell products, and launch business
publications from the years 2020 to 2023. campaigns to promote their businesses [76]. These activi-
ties make social networks an effective tool for hackers to
reach their target audiences [77]. Hackers may create fake
IV. Phishing Attack Methods, Step-by-Step Process, discount sales campaigns and include phishing URLs in the
Effects and Techniques descriptions, requiring users to fill in sensitive information,
Phishing attack refers to various techniques to share mali- such as bank, credit card, or debit card details. As a result,
cious link share with end users. There are several methods hackers obtain the necessary information, leading to financial
described below. loss and mental distress for users. According to a CBS

VOLUME , 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

institution report, about $4.1 billion was lost due to social 3) Domain Registration The developed website is
media phishing attacks [78]. then hosted on a domain using a URL similar
to the original one, with minor modifications—for
example, https://www.bankalfa.com.pk to
3) Using SMS (Short Message Service) https://www.bankalfaa.com.pk.
Phishing attacks have increased due to changes in the way 4) URL Shortening After the website is hosted, the next
people communicate. Since the invention of the smartphone, step is to shorten the URL to share with the target
social communication modes have evolved [79]. People audience.
enjoy sending SMS messages to their loved ones to stay in 5) Forward Phishing URL Once the URL is shortened, it
touch because it is an inexpensive communication method. is forwarded using various means, such as social media,
Similarly, companies send SMS messages to promote prod- email, SMS, WhatsApp, and other sources.
uct campaigns to end users [80]. Using the same strategy, 6) User Click The phishing link contains attractive de-
phishers send SMS messages containing links that request tails, content, offers, and language that persuade the user
the recipient to fill in personal information. Sometimes these to click. Users then fill in the requested information,
messages claim to be surveys, while other times they an- which is transferred to the phisher’s web server.
nounce the launch of a new product. The user may click the 7) Collect Required Data When users click the link, they
link, believing that they will receive a significant discount, are prompted to fill in information that is directly saved
and then fill in all the required fields. Consequently, all on the phisher’s web server. Phishers then use this data
sensitive information is transferred to phishers, who use it for for other purposes.
illegal activities, leading to financial loss for the user [81]. 8) Illegal Use of Data Phishers collect data and use
it for illegal purposes, resulting in financial loss and
compromise of online accounts.
4) Using Live Messengers
There are other ways to communicate with friends and
family through live messengers such as Yahoo Messenger, Y-
Mail Messenger, and Hotmail Messenger. People share their
location, pictures, documents, and sensitive information via
these platforms. Phishers often pretend to be company repre-
sentatives, sending phishing links and requesting recipients
to fill in their information to participate in a supposed lucky
draw. As a consequence, users may face financial loss and
account termination [82].

5) Using Blog Posts and Community Forums FIGURE 4. It illustrates the typical steps of a phishing attack. It begins
Blog posts and community forums are websites where people with the attacker sending a phishing email to the target, followed by the
target clicking on a phishing link. This action leads them to a fake website
share their thoughts and problems, discuss issues, and get where their credentials are collected. Ultimately, the attacker uses these
feedback from others. They are widely used for sharing credentials to access private information on the original website.
information and completing surveys, forms, and other details.
Attackers may create fake surveys, pretending to be the
forum owner, to obtain members’ details, which are later A. Phishing Attack Damages
used for illegal purposes [83]. A phishing attack is a hacking activity in which the phisher
obtains personal information accessed via URL, which may
V. How Web-Phishing works cause the following effects on the user:
Phishing attacks differ entirely from hacking or gaining 1) By gaining unauthorized access, phishers cause finan-
unauthorized access through various means. In phishing cial damage to the end user without consent.
attacks, the phisher is typically a technical person with 2) It may tarnish the user’s online reputation, leading to
deep knowledge of web development, SQL tools, machine reduced business opportunities.
learning, and the creation of fake web pages. Phishing attacks 3) Phishing attacks diminish trust in companies, resulting
involve several steps as shown in Figure 4 and outlined in decreased customer engagement and business activ-
below. ities.
1) Develop Strategy First, phishers develop a strategy to
identify and target a specific community. B. Possible Phishing Techniques
2) Development of Phishing Website Once the target Phishing attacks employ various methods to deceive users
is identified, the next step is to develop a website that and obtain sensitive information. The following are some
closely resembles the original. possible techniques used by phishers as shown in Figure 5:

8 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

1) The attacker designs a web page that closely resem-


bles the original, including visuals such as graphs and
pictures. The phishing URL is hidden using JavaScript,
displaying the actual URL to prevent user confusion
and reinforce legitimacy.
2) Attackers send download links for registered software,
which include malware. Once downloaded and in-
stalled, the malware begins sharing information with
the phisher, compromising the user’s system.

d: Spear Phishing
In this attack, the phisher targets an individual from a
reputable organization. The attacker monitors the target’s
social media accounts to learn their schedule. Once enough
FIGURE 5. Phishing Attack Methods diagram categorizes various information is gathered, the phisher contacts the target via
phishing techniques into four main groups: Social Manipulation, email, pretending to be a company manager, and requests
System-Based Methods, Using Mobile Devices, and Other Phishing
Methods. Each category lists specific types of attack, highlighting the them to fill out a form for an urgent meeting. The target,
diversity and complexity of phishing tactics used to compromise security believing the request is legitimate, shares sensitive infor-
across different platforms. mation with the attacker. Using spear phishing methods,
several prominent authorities have been attacked, suffering
significant financial losses and data breaches [87].
1) Social Manipulation
Social manipulation [84] involves tricking individuals into
sharing personal information without realizing the risk of 2) System-Based Methods
being hacked. This often involves following a URL to The following are the main types of system-based phishing
provide information, thinking it is necessary to secure their attacks:
account. For example, receiving an email that appears to be
from a bank asking for account confirmation can prompt a: Ransomware Virus
a person to click on a link and fill out a form, believing Ransomware is a modern type of malware that deeply affects
that they are providing accurate information to their financial users, causing financial and sensitive data loss [88]. In this
institution. Phishers exploit this by counting on the user to attack, phishers send harmful links containing ransomware.
click on the URLs, leading to unauthorized access. Once the user clicks on the link, the malware is downloaded
and installed on the target computer. After installation, all
a: Web Phishing Tricking files are encrypted, and a pop-up displays the attacker’s
Phishers create fake web pages that closely resemble legit- account details for data recovery [89].
imate ones to deceive users. These fake pages often mimic
login screens or other interactive elements to trick users b: Trojan Horse Virus
into divulging sensitive information such as usernames and The Trojan horse is another type of malware similar to
passwords. ransomware, but it functions differently. The attacker sends
a Trojan via email, text message, or WhatsApp link to down-
load media or applications [90]. The Trojan then installs
b: Web Phishing Cloning
on the target device, running in the background without
This phishing attack involves cloning web pages to resemble
user knowledge [90]. Once installed, it sends sensitive data,
the original site. Phishers use online or offline tools to
including bank details, to the attacker [91].
create these fake web pages [85]. They primarily focus on
replicating login pages to ensure users believe they are on c: Content Injection
the original site rather than the phishing page. The cloned In content injection attacks, also known as cross-site script-
site can be detected visually and using browser security ing (XSS), attackers exploit website vulnerabilities to add
features, which may alert users about the phishing page harmful code to web pages [92]. This code, often JavaScript,
[86]. Therefore, phishers often remove identification tags, can perform dangerous actions such as stealing cookies,
codes, divs, and frames from the phishing web page to avoid session tokens, or other sensitive user data stored in the
detection [23]. browser [92]. Once inserted, the code can capture what users
type in forms or redirect users to fake pages that resemble
c: Email Tracking legitimate websites. According to reports, about 2.0 million
Another method of obtaining user details is by email. Phish- websites suffer from content injection attacks [49], making
ers might send two types of email to end users: it a prevalent online threat. These attacks often occur due

VOLUME , 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

to poor input validation or lack of sanitization, allowing verifying its authenticity, allowing phishers to collect the
attackers to inject malicious scripts into trusted sites. To information.
mitigate these risks, web developers should implement secu-
rity measures such as Content Security Policy (CSP), input b: Mali-App
validation, output encoding, and regular security audits to Mali-App stands for malicious applications, referring to apps
detect and fix vulnerabilities before exploitation [93]. not verified by Google’s security mechanisms [104]. These
third-party apps can harm mobile devices [105]. Phishers
d: Keylogger / Screen Logger gain unauthorized access through these apps and start col-
A keylogger is an application installed on a system that lecting data and sensitive information.
tracks user keystrokes [94]. This malicious software can
be installed physically or remotely on the target system c: Vishing
without the user’s knowledge. Once installed, the software Similar to smishing, vishing is a type of phishing attack
tracks keystrokes and website visits, sending the data to the where attackers use voice manipulation to mimic legitimate
phisher’s email or configured location. voices [106]. The attacker sets up a VoIP platform and uses
voice changer software to imitate an authentic voice. For
e: Communication Crack Phishing Attack example, an employee of a reputable company may receive
The communication crack phishing exploit targets vulner- a call from someone pretending to be their manager, asking
abilities within wireless networks, such as open or public for critical login details. This method is often successful due
Wi-Fi [95]. The attacker conducts a man-in-the-middle at- to the high level of trust involved.
tack using tools like Wireshark or SSLstrip to capture and
decrypt traffic between users and the organization’s servers d: Wi-Fi Phishing
[96]. Users are drawn to a rogue access point that mimics
Wi-Fi phishing is a modern phishing method involving open
a real network, capturing sensitive information, including
Wi-Fi hotspots. While users access the internet through these
login credentials and financial data [97]. Alternatively, the
hotspots, attackers monitor traffic for valuable information.
attacker may redirect users to phishing sites for credential
Alternatively, users may be asked to fill out a registration
harvesting or inject malware during communication. [98]
form, which includes personal information stored in the
The attacker exploits weaknesses in encryption, such as the
phishing system [49].
KRACK vulnerability in WPA2 protocols, to decrypt and
modify network traffic [99]. Mitigating such risks involves
implementing robust encryption, regularly updating security
4) Other Phishing Methods
protocols, and educating users about the risks of connecting
to open Wi-Fi networks and recognizing phishing attempts a: Compromised Server
[100]. In this phishing attack, the phisher hacks a target website’s
server and uploads a malicious toolkit. The phisher silently
controls the server and hosts a similar web page to di-
3) Using Mobile Devices vert users, making them believe it is legitimate [107]. By
Over the decades, mobile device use has become widespread. compromising the server, hackers save on hosting costs. A
Today, phones are primarily used for communication, in- study [108] found that about 76.5% of websites are hosted
formation sharing, and online shopping. Phishers generate on compromised servers.
phishing links and share them via SMS, email, WhatsApp,
and third-party mobile apps [101]. When users click on these b: Phishing Using Botnets
links and provide information, their devices are compro- Botnets [109] are networks of computers connected to per-
mised. The attacker gains control over the mobile device, form specific tasks. These computers ensure the smooth op-
leading to misuse. Several mobile phishing methods are eration of applications like VoIP and chat systems. However,
listed below. phishers may insert malicious applications into one system,
sending numerous emails from the organization’s systems.
a: Smishing This type of attack is particularly dangerous and difficult to
Smishing is a blend of ”SMS” and phishing, hence the detect.
name. Attackers send SMS messages with malicious links,
prompting users to click and fill out forms [102]. For c: DNS Injection Attack
example, if the government announces financial assistance Phishing attacks that rely on DNS manipulation have become
for needy people, phishers might generate messages with more sophisticated and dangerous. Many fake websites are
phishing links requesting sensitive personal, banking, and hosted, and traffic is diverted using DNS injection methods
other information [103]. Users, thinking the message is [108]. Once the DNS cache becomes infected, it starts
from a legitimate agency, share sensitive information without transmitting data to malicious URLs.

10 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

5) Social Networking Phishing Methods available online for phishing detection, varying in size from
Social networking sites are significant targets for phishing small to large. As phishing attacks continue to increase,
attacks. The following methods are used to exploit these there is a growing need for advanced models for detection.
platforms. A well-trained model relies on the latest datasets with a
comprehensive set of features. The table 2 provides details
a: Malicious URL Sharing of some existing phishing datasets. In this section, we will
Social media has become integral to daily life, especially for review existing datasets, their features, and the models used
selling products and launching campaigns. Attackers design for feature extraction using these datasets, as well as their
malicious URLs and share them with members, friends, or advantages and drawbacks.
company contacts, asking for information related to meetings TABLE 2. Table provides a comprehensive list of datasets used in phishing
[50]. This information may include system usernames and detection research, categorized by their nature (Legit or Phisher), update
passwords. The relevant group interacts with the URL, pro- year, and accessible URLs for further exploration
viding the requested information, which is then transferred
to the attacker’s account. Dataset Name Update Type URL
Year
b: Masked URLs Phishing- 2024 Phisher https://phishtank.org/
In this attack, the phisher shares a URL while pretending Tank [116], developer info.php
to be the admin of a social media group [50]. The URL [117]
links to a dummy form that requests sensitive information. Alexa- 2022 Legit https://www.similarweb.
Once group members fill out the form, the attacker uses dataset [117] com/website/alexa.com/
the information for unauthorized access [24]. Attackers may
send messages from hacked accounts or even request loans, Wein- 2021 Phisher https://jeowein.net/
pretending to be legitimate contacts [110]. dataset [118]
Crawl- 2021 Legit https://commoncrawl.org/
c: Forged Profile dataset [119]
Another deceptive method is using a fake profile. The Open-Phish- 2021 Phisher https://openphish.com/
attacker targets a prominent social media figure, monitoring dataset [119],
their profile, posts, and comments daily. After gathering [120]
enough information, the attacker creates a similar profile, Phishing-army 2021 Phisher https://www.phishing.army/
mirroring the content, and adds new friends to deceive others. dataset [119],
[120]
VI. Phishing Detection Based on Webpage Screenshot Kaggle- 2021 Legit https://www.kaggle.
Screenshot-based phishing detection is a novel approach that phishing com/datasets/
utilizes visual analysis techniques to identify fraudulent web dataset [121] shashwatwork/phishing-/
pages [111]. In this technique, a screenshot of a webpage /dataset-for-machine-learning
is analyzed for visual elements such as logos, text layout,
color schemes, and design patterns to determine whether the
website is genuine [112]. Machine learning algorithms are UCI- 2022 Phisher https://data.world/uci/
employed to compare the screenshot against legitimate and Dataset [51] phishing-websites
phishing sites within a database, acting as image processors
Parsed- 2022 Legit https://doi.org/10.7910/dvn/
for anomaly detection or identifying potential threats [113].
dataset [122] omv
This technique is particularly useful for detecting phishing
attempts that visually replicate a real website, which might Yahoo- 2022 Legit https://webscope.sandbox.
otherwise go unnoticed by traditional text-based detection Phishing [123] yahoo.com/
methods [114]. Additionally, the approach can be integrated
with other detection techniques, such as URL analysis and Yandex- 2022 Legit https://Yandex.com/dev/xml/
text feature extraction, to enhance the overall accuracy and phishing [123]
reliability of phishing detection systems [115]. Phishing Phished- 2022 Phisher https://www.medien.ifi.lmu.
detection through screenshots offers powerful protection dataset [121] de/team
against growing phishing attacks by focusing on the visual
content of web pages [115]. • Phisher-Tank Dataset: The Phisher-Tank dataset con-
tains almost 2 million entries from phishing websites
VII. Web Phishing Detection Datasets that are blocklisted on the internet. Approximately 90%
In web phishing detection, the latest datasets play a vital of these websites are offline due to removal from
role in model training and detection. Several datasets are internet sources, while 11,000 remain active as phishing

VOLUME , 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

sites. This dataset is compiled and maintained by the the character level, allowing the model to use the TF-IDF
Talos group of companies. The Phisher-Tank also offers method for feature scoring. Once extracted, these features
an online service where users can paste a web URL to are used for model training.
check its legitimacy. If a URL is identified as phishing,
it is automatically added to the Phisher-Tank database.
• Dataset: Alexa provides an online system that analyzes 4) Recursive Feature Elimination Method
website performance, checking how efficiently a web- Known as REF, this method was introduced in [47]. It
site operates and whether it contains hidden phishing involves extracting all features and removing weak ones
links. This service, launched and controlled by Amazon, based on a threshold value.
helps identify and report phishing sites.
• Wein Phishing Dataset: The Jet-Wein phishing dataset
is an open-source system that uses an API to blocklist
5) Using Principal Component Analysis Method
phishing websites. This dataset contains approximately
Proposed in [47], this method begins with preprocessing,
15,000 phishing URLs.
followed by the selection of features after removing redun-
• Crawl Dataset: The Common Crawl dataset is generated
dant and unwanted ones. Techniques like median filtering
using a web crawler. Once URLs are crawled, they
or adaptive thresholding are commonly used. The wavelet
are tested with a phishing detector to identify phishing
packet transform (WPT) is another method that can be
URLs. This dataset contains over 10,000 entries.
applied in this technique [125].
• OpenPhish Dataset: OpenPhish is a free open-source
platform for web analysis and phishing detection. This
system includes almost 19 million phishing links.
6) Using Information Gain Method
A. Feature Extraction Methods
Information Gain is a popular method for feature extraction
in phishing datasets. As discussed in [47], this method
1) Dynamic Feature Extraction Method
uses probability functions to identify vital features based on
In phishing detection, accurate model performance relies
probability scores, selecting features that meet algorithmic
on extracting relevant features from the data, enabling the
criteria and discarding others.
model to train effectively and differentiate between phishing
and legitimate websites. Researchers [66], [85], [117] have
proposed a dynamic feature extraction method based on
7) Relief Ranking Filter Method
feature weights. They extracted 17 features from the dataset,
categorized into three groups: address-based features, script- Used in [126], this algorithm extracts features based on
based features, and tag-based features. This paper [85] a near-neighbor score algorithm. First applied to the UCI
discusses automatic feature extraction from the URL and dataset, features are scored, compared with near nodes, and
address bar without using third-party tools. However, the selected using the NNS algorithm. This method identified 22
WHOIS database is used for domain name and registra- features from the UCI-Phishing dataset.
tion data. Additionally, page scores are extracted from the
Google rank database. After extracting the features, they are
weighted based on their scores and averages, which are then 8) FRS (Fuzzy Rough Set) Feature Extraction Method
used to train and test the model. This algorithm, related to rough set theory, identifies related
data points and compares nodes and classes to discern
between them. For example, it compares every feature of a
2) Machine Learning Feature Extraction phishing website with another to check legitimacy. Features
In 2015, researchers proposed a new dataset containing are extracted based on class matches, represented as 0 or 1
over 11,000 instances with 30 features. These features were in the original UCI phishing dataset [126], [127].
extracted using machine learning models to improve phishing
detection accuracy.
9) El-Rashidy Feature Extraction Method
First introduced by El-Rashidy in [128], this algorithm
3) Feature Extraction Using NLP (Natural Language operates in two steps. Initially, it extracts features from the
Processing) dataset and trains the model using the RF ML model. During
The integration of NLP in machine learning has enabled re- training, it assesses accuracy and removes features with
searchers to extract features from phishing URLs more effec- lower accuracy. After training, it refines features and selects
tively. Character-level characteristics are extracted using ma- high-accuracy features for further testing. While effective for
chine learning models, classified, and used for model training small datasets, this method is less suitable for large datasets
[124]. NLP facilitates feature extraction from datasets at due to computational costs.

12 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

B. URL-Based Feature Extraction Method appears in the code. The analysis checks for any tampering
This method involves extracting URL properties using var- or unusual information outside the normal structure [136],
ious techniques. These properties include URL syntax, do- [137], [138]. These features are used for model training.
main name, registration and expiration dates, website age, Additionally, form-related tags are identified and used as fea-
hosting server location, IP address, and DNS details. Ex- ture parameters in the training process. If a website contains
tracted features help identify whether a URL is legitimate graphical visuals, properties related to those graphics are also
or phishing. Four main categories of URL feature extraction extracted to identify manipulated pages. These features are
exist: crucial for phishing detection [129].

a: Structure-Based Properties b: Visual-Based Properties


This category examines URL structure, syntax, commu- Visual properties also play a role in phishing detection.
nication protocol, domain name, hosting server, and dis- Features such as size, width, height, brightness, and darkness
criminating tokens like ’,’ and ’.’. Features are extracted [139] are extracted for model training. This helps identify
through tokenization, removing URL elements. Another malicious images on websites [132].
method [129] involves bag-of-words comparison to identify
matching URLs, extracting relevant features while discard-
ing others. Binary notation determines domain or property 2) Filter Method for Feature Extraction
length, returning 1 for matches and 0 otherwise. The filter method for feature extraction is based on statistical
calculations and classifications using techniques like Chi-
b: URL Pattern Properties Square [138], Information Gain [140], Gain Scores, Corre-
This method analyzes URL statistics, such as length, domain lation Method [141], [142], [143], and the Fisher method
name, hosting validity, and expiry dates. Extracted statistics [142], [143].
are compared with legitimate URL statistics to identify rele-
vant characteristics [130]. Term and inverse URL frequencies
are calculated for feature matching. Frequency-level word 3) Wrapper Method for Feature Extraction
counts are compared to determine matching features, using This method selects features using machine learning models
the Jaccard Index Pairwise (JIP) method to establish feature such as genetic algorithms, greedy forward selection, and
associations. classifiers [144]. It is used to determine 177 optimal features
from the web page source and URL [145].
c: Domain-Based Properties
This method extracts features related to the domain name,
registration and expiry dates, and other URLs using third- 4) Tokenization and Vectorization
party tools like WHOIS, WHOAMI, and DNS-LOOKUP. Tokenization is one of the techniques for feature extraction.
Information obtained from external sources is used as model It is considered as translating a single string into a sequence
training features. of one or more non-empty sub-strings. This method has been
implemented to identify malicious URLs in previous work
d: Ranking-Based Properties [142], [143]. The selected 10000 benign URLs and 10000
Phishers often target well-known websites due to their high malicious URLs make up data1. Words such as ”com” in
traffic volumes, increasing the likelihood of encountering URLs do not add value to them, so they are removed before
unsuspecting victims. Consequently, website ranking is a tokenization. Tokenization is done using special characters,
critical feature during feature extraction in phishing detection slash, dash, and dot. After tokenization of selected URLs,
systems. the data is then converted into the sparse matrix vector for
machine learning [146].

1) Hypertext-Based Features
These features relate to the website’s source code and include 5) Character N-grams
different HTML tags and forms. They are significant in train- The character N-grams extracted from URLs are overlapping
ing data for identifying parsed HTML or phishing HTML sequences of N-consecutive characters, extracted from the
pages. Several studies focus on these features, including URLs where the value of N varies between 1 and 10. For
[131], [45], [127], [128], [130], [132], [133], [129], [134], example, the first three bigrams of the URL ”example.com”
[135]. There are three main types of these features: are ”ex”, ”xa”, ”am”. This is much richer than the bag-
of-words approach used by researchers in [11] as it cap-
a: Text-Based Properties tures punctuation, misspellings, etc. in the URLs. Here N
This property involves analyzing the complete web page represents the length of the character substring. Table 3
code, including the HTML tags. First, tags are extracted, and provides comparative overview of different feature extraction
a frequency table is used to determine how often each tag methods.

VOLUME , 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
14
TABLE 3. The table categorizes extracted features from various studies into different properties such as lexical, statistical, network, reputation, textual/visual, and traffic. It lists the feature types, the
number of features extracted, and the third-party services used for each category, providing a comprehensive overview of methods and tools applied in URL-based phishing detection research.

Feature Paper URL-based HTML-based No. of features Third-party services

Yassine et al. ✓ ✓ 112 WHOIS, Common Crawl


Lexical properties
Liu et al. ✓ ✓ 25 WordNet, Google Search

Al-Rfou et al. ✓ ✓ 120 WordNet, Google Search

Abualigah et al. ✓ ✓ 10 None


Statistical properties
Li et al. ✓ ✓ 18 None

Wang et al. ✓ ✓ 12 SimilarWeb, Alexa ranking

Alsuwaihli et al. ✓ ✓ 14 Social network analysis


Network properties
Wu et al. ✓ ✓ 22 Social network analysis, Majestic

Zhang et al. ✓ ✓ 16 Social network analysis, Alexa ranking

Bhatia et al. ✓ ✓ 22 Online reviews, social media sentiment


Reputation properties
Cao et al. ✓ ✓ 28 Online reviews, social media sentiment, Trustpilot

Goyal et al. ✓ ✓ 15 Online reviews


Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

Feng et al. ✓ ✓ 18 Image analysis, sentiment analysis


Textual/visual properties
Xu et al. ✓ ✓ 20 Image analysis, named entity recognition

Jiang et al. ✓ ✓ 24 Sentiment analysis, named entity recognition

Ali et al. ✓ ✓ 15 Web traffic analysis


Traffic properties
Hassan et al. ✓ ✓ 12 SimilarWeb, Similarweb Pro

Tripathi et al. ✓ ✓ 18 SimilarWeb, Alexa ranking


content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and

VOLUME ,
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

VIII. Anti-Phishing Methods teractive training sessions are more successful than warning
In this section, we will discuss different approaches to anti- notifications [148], [149], [150].
phishing. These approaches include classical and state-of- A serious game was created to enhance users’ ability
the-art methods, including machine learning (ML) and deep to recognize phishing URLs. The game, ”Phisher,” is a
learning (DL). The basic structure of anti-phishing methods solo, lightweight, intuitive, and narrative-driven game. The
is shown in Figure 6. scenario begins with the player receiving several messages
claiming they won a substantial sum of $600,000 and would
be sent to an island if they input bank details. The player
A. User-Based Anti-Phishing Approaches
must capture the scammer using a boat, a hungry tiger, and
1) User-Based Anti-Phishing Technique
no money, returning to the beach to survive. The game play
a: Educating Anti-Phishing Technique progresses as the player answers questions about phishing.
Humans are often unaware of new threats, which makes ed- The surveys of the participants in the pre and post-game
ucation essential to learn about new methods and techniques. events showed that the number of correct answers increased
Education is crucial to teach people about phishing attacks, from 4-7 to 5-8 after playing the game [148], [149].
how phishing works, and how to identify phishing emails. The confidence level of accuracy rose from 4.09 to 4.47
Organizations worldwide must train their staff about phishing (p < 0.05), and the accuracy improved from 0.70 to 0.795
threats. Similarly, all government agencies should educate (p = 4.12 × 10−142 ). The false negative rate decreased from
the public about phishing [147]. By educating employees 0.22 to 0.14 (p = 5.03×10−091 ), while the false positive rate
and the general public, phishing attacks can be controlled. decreased from 0.34 to 0.25 (p = 7.71 × 10−076 ). 25% of
the participants played the game more than once, indicating
b: Awareness About Security Warning Anti-Phishing the appeal and participation of the game [148], [149].
Technique
Most phishing detection methods use browser plugins, which
quickly identify suspicious websites and alert users when d: User Response Anti-Phishing Technique
visiting a potentially dangerous site. Understanding security
Research conducted by various academics has investigated
warning signs is crucial when human intervention is required.
why individuals become susceptible to phishing schemes.
If users ignore these warnings, it could lead to negative
These studies also evaluate whether users examine the URL,
consequences. Proper training on the recognition of security
browser toolbar, or other security indicators. Many computer
indicators is essential. Studies have shown that 60% of the
users are targeted due to their ignorance of warning signs and
users ignore warnings and proceed to phishing URLs without
indicators when using toolbars. The survey found that most
training, while the click-through rate for trained users was 0
of the participants had no bank account and were unfamiliar
There are two types of warning: active warnings that
with financial jargon, making them unable to recognize
prevent users from accessing phishing URLs and passive
90% fraudulent sites, even when they visually resembled
warnings that display a message while allowing access. Most
legitimate ones. The studies discovered that 23% of the users
contemporary web browsers, such as Mozilla Firefox and
prefer to avoid verifying URLs. Researchers studied phishing
Google Chrome, use passive warnings. Active warnings are
attacks, specifically spear phishing, among 158 volunteers
more effective, since many users tend to ignore passive warn-
of various age groups. The study found that older women
ings. A study with 60 participants found passive warnings
were more susceptible to phishing attacks compared to other
inadequate; 79% noticed active warnings, while only 13%
demographics. Scarcity was more prevalent among young
noticed passive warnings [148], [149], [150].
people, while reciprocation was more prevalent among el-
derly adults [88].
c: Training Using Games Anti-Phishing Technique:
Training methods that incorporate games are advantageous
because they are convenient and easy to learn, providing a
natural setting for teaching. Various developers have created B. Classical Methods
interactive teaching tools to educate users on recognizing 1) URL-Based Method
phishing attempts. a: Blocklist URLs
Before and after studies demonstrated the effectiveness of This method uses anti-phishing tools like Phish-Net, Google
training games. Participants who played these games showed Safe Browsing, and PhishTank to generate a list of URLs.
an increased awareness of phishing emails and websites. The It works by matching suspected phishing URLs against the
training was integrated into users’ daily routines, making blocklist. If a match is found, the URL is identified as
it user-friendly. Periodically, instructive notes were sent to phishing; otherwise, it is considered legitimate. This method
users after the program began. Research showed that only is beneficial for quickly identifying known phishing URLs
30% of trained users clicked on fake links in emails they but is not effective for new URLs. Therefore, the blocklist
learned to recognize. Moreover, findings indicated that in- must be updated daily [151].

VOLUME , 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

FIGURE 6. The diagram of classical and AI-based anti-phishing models categorizes various anti-phishing techniques into three main groups:
user-based approaches, classical methods, and AI methods. It outlines specific strategies within each category, ranging from educating users and
employing URL-based methods to advanced machine learning and deep learning models. This comprehensive framework highlights the multi-faceted
approach necessary to effectively combat phishing threats.

b: Whitelist URLs This method does not identify new phishing URLs, requiring
This list contains only legitimate URLs with no associated the URL list to be updated daily. Similarly, whitelist and
phishing sources or code. All URLs that are not identified heuristic lists also need to be updated regularly. However,
as phishing are added and managed in this list. When a if all these methods are combined and implemented with
URL is visited, it is compared with the whitelist to verify automated list updates, they would provide the most effective
its legitimacy. If a URL is identified as phishing, it is and reliable solution for anti-phishing.
marked for the blocklist. However, if it matches the whitelist,
it is considered safe. The verification mechanism is often
integrated within search engines or browser extensions like TABLE 4. The table provides a comparison between URL-based anti-
the Google toolbar [26]. phishing methods, detailing their limitations, proposed enhancements,
and accuracy levels. It offers insights into how integrating these methods
c: Heuristics
with advanced technologies like machine learning and AI can improve
Heuristics involve analyzing URL details such as domain efficacy and reduce manual effort in identifying phishing threats.
name, domain path, website rank, Alexa ranking, and rep-
utation score. A suspected phishing URL is tested against Author Method Gaps Solution Accuracy
heuristic criteria. If the URL satisfies criteria such as domain [153] Blocklist Needs Combine with High for
name, path, address, and rank, it is considered legitimate. URLs daily other methods known
Otherwise, it is deemed phishing and added to the blocklist. updates like whitelists phishing
Due to the short lifespan of phishing URLs, they rarely and heuristics sites
achieve high rankings [152].
Table 4 shows that the Blacklist method performs well,
but there are some limitations associated with this approach.

16 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

[26] Whitelist Requires Combine with High for is more efficient than HTML source code analysis and key-
URLs manual automated whitelisted word matching. These detection methods can be enhanced
effort to tools and sites by combining different approaches. If keyword matching,
maintain regular HTML source code analysis, and visual similarity methods
updates are combined, the resulting output will be superior to other
[154] Heuristics Requires Use machine Varies classical methods discussed in Table 4
advanced learning and depending
TABLE 5. The table provides a comparison of different methods for analyz-
analysis AI to improve on the im-
ing HTML source code in anti-phishing solutions, highlighting their gaps,
and accuracy plementation
proposed solutions, accuracy, and overall suitability for various security
expertise and reduce
needs. It underscores the trade-offs between ease of implementation
false posi-
and the level of security provided, suggesting combined approaches for
tives/negatives
enhanced protection.

Author Method Gaps Solution Accuracy Decision


2) Content-Based Methods
[46] Keyword Limited de- Combine Moderate Good
a: Keyword Matching Match- tection capa- with for
Keyword matching is another classical anti-phishing method ing bilities, can- other essential
used to detect potential phishing URLs. In this method, not identify meth- protec-
URLs are scanned for specific keywords commonly associ- complex at- ods, use tion,
ated with phishing attacks, such as ”username,” ”password,” tacks updated but not
”login,” ”register,” ”CNIC,” ”credit card,” ”PIN code,” ”pass- keyword suitable
code,” ”email,” ”password,” and ”one-time password” [51]. If lists for high-
a URL string matches any of these keywords, it is flagged as security
phishing, and a pop-up alert warns the user to avoid visiting needs
that URL [46]. [131] HTML May miss Combine High Good
Source well-crafted with for ad-
b: HTML Source Code Analysis
Code visual visual vanced
HTML code is vital in anti-phishing efforts because it helps Analysis phishing similar- protec-
identify hidden patterns and modifications in the code. Sev- attempts ity and tion,
eral methods analyze and verify the contents of the source user ed- suitable
code. First, check if the form code, including input boxes or ucation for
password fields, contains any hidden content or script that security-
could redirect user input to a phisher’s database. conscious
Secondly, verify whether any SQL queries related to these users
fields are correctly executed or redirected. Next, examine [85] Visual It may be Use ad- Very Best for
JavaScript code to detect any malicious elements by com- Similar- fooled by vanced High high-
paring it with the original code and syntax. Additionally, ity minor visual machine security
verify internal and external links to ensure they refer to changes, learning needs
the correct domain, and validate security certificates with which are tech- requires
the SSL server. If all these parameters match the original ineffective niques signif-
page, it is declared legitimate; otherwise, it is identified as against combine icant
phishing [131]. non-visual with re-
attacks. other sources
c: Visual Similarity
methods and ex-
Visual similarity matching involves analyzing CSS, text for- pertise
matting, image formatting, image dimensions, and web page
content. These features are compared with authenticated and
verified web pages to determine if a page is phishing [85].
The visual similarity method is divided into four subtypes: C. Network-Based Method
document object comparison, CSS similarity, image similar- 1) Traffic Analysis
ity, and visual feature matching. The first model is trained The traffic analysis mechanism provides a comprehensive
using these features and tested in real-world scenarios. If overview of network traffic to identify and control any suspi-
a testing webpage matches all the features, it is marked as cious activity immediately [155]. During phishing detection
legitimate; otherwise, it is flagged as phishing [85]. using traffic analysis, various parameters such as IP ad-
Table 5 presents a comparative study of HTML code- dresses, domain names, packet sizes, encryption techniques,
based anti-phishing methods, showing that visual similarity communication protocols, and phishing keywords are exam-

VOLUME , 17

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

ined. Machine learning and deep learning models are trained ods to detect phishing websites. Several models have been
using labeled data to classify phishing and legitimate traffic. developed and proposed as standards for phishing detection
However, this method may not provide 100% accuracy due within the industry. However, due to the high rate of phishing
to a high false positive rate. While it can help secure the attacks and new phishing techniques, these methods need
system, more advanced methods are needed for precise traffic continuous improvement. AI models are divided into two
analysis [153]. categories:

2) DNS Analysis 1) Machine Learning Models


DNS anti-phishing involves verifying the domain name asso- a: Decision Tree Model
ciated with an IP address to determine whether it is phishing The Decision Tree is a supervised machine learning model
or legitimate. If the results and records match, the URL or known for its ease of use, parametric distribution, scoring
website is considered legitimate; otherwise, it is flagged as distribution, and efficiency [156]. It can quickly learn and
phishing [154]. handle different types of data sets simultaneously [157]. This
model operates iteratively to predict whether websites are
D. Client-Side Methods phishing or legitimate [158].
1) Browser Extension XC
A browser-based solution proposed by [154] elaborates on E(s) = −Pi log2 Pi (Equation-1)
how to be safe from phishing URLs. The author presented a i=1
new browser extension that operates in real-time, detecting XC

phishing and legitimate URLs. The extension can block E(T, X) = p(c) · E(c) (Equation-2)
JavaScript and provide alerts for any phishing URLs. c∈X
IG(T, X) = E(T ) − E(T, X) (Equation-3)
Equations 1, 2, and 3 are central to optimizing Decision Tree
2) Anti-Phishing Toolbars
models for phishing detection. Entropy (E(s)) measures the
Toolbar-based solutions are presented as browser extensions impurity of a data set, conditional entropy (E(T, X)) evaluates
that must be installed to prevent phishing attacks. When users the entropy after splitting by attribute X, and Information
visit a phishing website, the toolbar assesses the website’s Gain (IG(T, X)) calculates the reduction in entropy due to the
credibility and warns users to avoid fraud. split. These metrics help to determine the most informative
attributes for splitting nodes in the model.
E. Search Engine-Based Methods
In this method, when a site is visited through a search engine, b: Random Forest Model
its page ranking in the search results is considered. The The random forest (RF) is a classifier that is used to catego-
search engine indexes the website based on its lifespan and rize data into different classes. It is a highly effective model
visit statistics. New websites typically do not rank at the top that is often used to solve classification problems [159].
of search results. Search engine detection is classified into Like a decision tree, this model organizes data into various
two methods: categories as given in Equation 4. It aggregates results from
different nodes to predict classes [158], [160].
B
1) Logo-Based Technique 1 X
E(s) = fi (xt ) (Equation-4)
This is an older method to detect original URLs using a B i=0
search engine. It involves extracting the logo of the original
website and searching for it to find legitimate URLs. c: Support Vector Machine Model
Support Vector Machine (SVM) is a supervised machine
learning model used for binary classification problems as
2) Information Retrieval Technique given in equation 5,6,7 and 8. SVM solves linear problems
In this method, the search engine extracts features of a using the kernel function, which eliminates the need to
website, including web page tags. The extracted tags are transform the data manually, as the SVM kernel handles this
used in search queries to discover phishing websites. The task. Additionally, there is no need to make assumptions
Google Chrome browser uses a phishing detection extension about feature extraction or selection, as the SVM kernel
to analyze website content, including domain names and web manages these aspects. SVM provides a robust solution
page titles. for classification problems by employing a nature-inspired
optimization algorithm for phishing detection and spam
F. AI Methods identification. It classifies data into two classes using a
AI plays a vital role in developing anti-phishing models, hyperplane, predicting the class based on the new value’s
which perform well using different feature extraction meth- position relative to the hyperplane [146].

18 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

f: Gradient Boosting Classifier


T
w X=b=0 (Equation-5) Gradient Boosting is a supervised machine learning classifier
given in equation 11 that works similarly to AdaBoost. It
Equation 5: Linear hyperplane model
assigns weights to the training data and learns from errors.
w T xi + b During training, the model iteratively learns by minimizing
di = (Equation-6) errors and improving accuracy in subsequent iterations [30],
∥w∥
[32], [46].
Equation 6: The distance between a data point and the !
T
decision boundary X
( H(x) = sign αt ht (x) (Equation-11)
1 if wT x + b ≥ 0 t=1
ŷ = (Equation-7)
0 if wT x + b < 0 Equation. 11: Gradient Boosting Mathematical Model
Equation 7: SVM with Linear Classifier Model g: XGBoost
m XGBoost is an advanced form of the Gradient Boosting
1 T X
maximize(W,b) w w+C ci algorithm that outperforms other boosting classifiers. It is
2 i=1 highly scalable and effective for various classification prob-
subject to yi (wT x + b) ≥ 1 − ci lems using equations 12 and 13. XGBoost provides faster
ci ≥ 0 for i = 1, 2, 3, . . . , m processing due to its efficient memory management and
(Equation-8) distributed processing capabilities, enabling data scientists
Equation 8: Linear SVM Classifier with Soft Margin to run large datasets on desktop processors [25].
T
!
X
d: Naive Bayes Model H(x) = sign αt ht (x) (Equation-12)
Naı̈ve Bayes is a probabilistic model used as a classifier. It t=1
is based on the Bayesian model as given in equation 9 and n
X K
X
is used to find relationships between different features. This Obj = l(yi , ŷt ) + Ω(fk ) (Equation-13)
model calculates the probability of feature occurrence using i=1 k=1
the corpus consideration method. As a supervised machine h: K-Nearest Neighbor
learning model, Naı̈ve Bayes works with class label attributes The k-Nearest Neighbors (k-NN) algorithm is a supervised
to classify data. The correlation of all attributes is calculated machine learning model used to solve classification and
independently, assuming that each feature contributes equally regression problems using equations 14 and 15. Classifies
and independently to the outcome [47]. data points according to their similarity, measured by the
P (B | A) · P (A) Euclidean distance, where the value of k is a positive integer
P (A | B) = (Equation-9)
P (B) representing the number of nearest neighbors considered.
Equation 9: Naive Bayes Theorem The model calculates the nearest-neighbor score to find the
closest data points and groups them into one class. Although
e: AdaBoost it is simple to apply, it can be computationally expensive due
AdaBoost is a classifier that functions similarly to a Random to assumptions such as equally divided classes [25].
N
Forest classifier. It uses multiple weak learners and combines X
them into strong learners to classify data using which is ŷq = argmaxy 1(yi = y) · 1(xi ∈ KNN(xq ))
i=1
based on equation 10. This model utilizes the weights of (Equation-14)
the nodes during training, transferring these weights to 1X
N
subsequent nodes to enhance the model’s accuracy. As a ŷq = (yi ) · 1(xi ∈ KNN(xq )) (Equation-15)
k i=1
supervised machine learning model, AdaBoost uses labeled
data for classification. During training, the model first gen- i: Logistic Regression
erates a weak tree structure, assigns scores to weak learners,
The logistic regression algorithm is a supervised machine
and then transfers weights to the next learner, making it more
learning algorithm used for classification and regression
robust than the previous one. In essence, the model learns
modeling as given in equations 16 and 17. It utilizes the
from previous errors to improve its accuracy over time. The
sigmoid function to calculate the probability score of two
number of iterations contributes to improved accuracy [32],
classes and map them accordingly. Logistic regression per-
[33].
! forms well when the relationship between the classes in the
M
X dataset is linear, but its performance declines significantly in
H(x) = sign αm hm (x) (Equation-10) non-linear cases.
m=1 1
P (yi = 1 | xi ) = (Equation-16)
Equation. 10: AdaBoost Mathematical Model (1 + e−z )

VOLUME , 19

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

1 hidden layers process this data using activation functions


σ(z) = (Equation-17)
(1 + ez ) such as sigmoid functions, employing feed-forward and
feedback methods. The final layer, the output layer, provides
the results of classification or detection tasks. DNN is a
2) Hybrid Models supervised machine learning model used for classification
a: Artificial Neural Network and detection problems [161].
An Artificial Neural Network (ANN) is a deep learning
model inspired by the biological neural network of the hu- b: Convolutional Neural Network (CNN)
man brain as shown in Figure 7. It consists of multiple layers, A Convolutional Neural Network (CNN) is a deep learning
each containing numerous neurons that process data. The algorithm that operates using layered architecture listed in
neurons in each layer receive input from the previous layer, equation 18,19, 20 and 21. The first layer is the convolu-
process it using assigned weights and activation functions, tional layer, which applies convolution operations to extract
and then pass the output to the next layer. There are two features from the input data. The second layer is the fully
types of processing within a neural network: feed-forward connected layer, which is typically used for classification
and feedback. In the feed-forward process, data flow from tasks. CNNs are primarily used in computer vision for image
left to right through the layers, while in feedback (or back- processing and work with two-dimensional data. Additional
propagation), data flow from right to left to adjust weights layers in CNNs include pooling layers for down-sampling,
based on the error between predicted and actual outcomes. dropout layers to prevent overfitting, batch normalization
The number of layers in an ANN can be customized ac- layers to stabilize learning, and output layers for final pre-
cording to the dataset and model requirements, and each dictions [162]. The basic structure of CNN model is given
layer independently transforms its data for the subsequent in Figure 8.
layer. During the initial stage of model training, nodes
receive random weights, which are then adjusted using a ∞
X
gradient descent algorithm to achieve optimal solutions. Due (f ∗ g)(t) = f (a)g(t − a) (Equation-18)
to their ability to handle non-linear data, neural networks are a=−∞

effective for addressing complex problems. ReLU(x) = max(0, x) (Equation-19)


max-pooling(x) = max(neighborhood(x)) = W x + b
X X (Equation-20)
i
L(y, ŷ) = − yi log (ŷ ) (Equation-21)
i
Equation 15: Convolution Neural Network Model

FIGURE 7. The Artificial Neural Network Layers Model diagram illustrates


FIGURE 8. The Convolutional Neural Network Layers Structure diagram
a basic neural network structure, comprising input, hidden, and output
represents the architecture of a typical Convolutional Neural Network
layers. Nodes in the input layer receive data, which then flows through
(CNN). It begins with an input layer and progresses through multiple
interconnected nodes in the hidden layer before reaching the output layer
convolutional layers where filters are applied to extract features. These
for final processing. Arrows represent the connections through which
are followed by pooling layers that reduce dimensionality. The
data is forwarded, with weights that are adjusted during training to
architecture culminates in a fully connected layer that leads to the output.
optimize the network’s predictive accuracy.

c: Recurrent Neural Network


3) Deep Learning Models Recurrent Neural Network (RNN) is a deep learning model
a: Deep Neural Network (DNN) for language processing and text mining. In this model given
A Deep Neural Network (DNN) is an advanced form of an in equations 22,23 and 24, each layer is connected with
Artificial Neural Network (ANN). The primary distinction of the inner layer, generating a bi-directional connection. This
DNN is the increase in the number of hidden layers. DNNs model helps process sequential data and is helpful for URL
contain more hidden layers than typical ANNs, allowing for phishing URL detection and feature extraction [163]. Figure
more complex data processing. The layers are arranged in 9 is showing basic structure of RNN model.
sequence, starting with the input layer that receives data. The ht = activation(Whx xt + Whh ht−1 + bh ) (Equation-22)

20 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

yt = softmax(Wyh ht + by ) (Equation-23)
X X
L(y, ŷ) = − yi log (ŷ i ) (Equation-24)
i
Equation 16: RNN Mathematical Model with Loss Function

FIGURE 9. Basic Architecture of RNN Model diagram illustrates the


unfolding of a Recurrent Neural Network (RNN) over time. It shows how
input xxx at different time steps t is processed through the same hidden FIGURE 11. Performance Comparison of Various Phishing Detection
layer h with recurrent connections, and how it influences the output o at Models: This graph illustrates the accuracy rates of different models used
each step. The unfolding visualizes the sequence processing nature of in phishing detection, ranging from Convolutional Autoencoder + DNN to
RNNs, where weights U, V, and W are shared across all time steps, Random Forest, highlighting the progressive improvement in detection
highlighting the network’s ability to maintain temporal dependencies in capabilities across models
data.

IX. Model Recommendation for Phishing Detection


d: Long-Short-Term Memory Model Based on the literature study and visualization of different
Long-Short-Term Memory (LSTM) is an advanced form models using accuracy metrics as given in Figure 11, it is
of RNN. The basic RNN did not support more than ten concluded that the Random Forest model performs best in
inner connections of layers. However, with the new LSTM classifying legitimate and manipulated websites. This model
method, it is possible to implement more than ten inner excels both in individual implementation and when used in
connected layers in the model as shown in figure 10. LSTM ensemble and hybrid modeling techniques.
enables the application of more than 1000 processing steps
by establishing the relationship of data points with each X. Open Challenges & Discussion
other. During processing, LSTM finds the importance scores Several models have been developed for detecting Web
of each step, if necessary, and keeps it within memory until phishing, but due to the evolving nature of phishing attacks,
the model is executed [161], [162]. many challenges persist.
1) Adapting to Emerging Phishing Techniques: Phishing
websites are launched daily with new phishing material,
making it essential to regularly update phishing dataset
databases and train models with the latest data [1], [2].
This challenge is significant as existing models may
become outdated quickly without frequent updates [28].
2) Dataset-Specific Model Limitations: Many models, in-
cluding hybrid ones, are trained on specific datasets
with particular features, which can limit their effective-
ness against emerging attacks [22], [47]. The specificity
of datasets poses a challenge in developing models that
are versatile across different types of phishing attacks
[38].
3) Feature Extraction Gaps: Current models often extract
features based on fixed thresholds or statistical models,
lacking a dynamic approach that considers compre-
hensive information such as web page content, URL
properties, visualization properties, and domain proper-
FIGURE 10. LSTM Model Architecture diagram illustrates the architecture
of an LSTM cell, highlighting the processing of inputs through forget,
ties [17], [38], [43]. More extensive work is needed
input, and output gates to update the cell state and hidden state, thereby in feature extraction to enhance phishing detection
managing long-term dependencies in data sequences. research [113].

VOLUME , 21

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
22
TABLE 6. Table provides a comparative analysis of various phishing detection models, detailing their method types, datasets, main challenges, limitations, and model accuracy. It highlights the
performance metrics and constraints of each model, showcasing how they fare against different types of phishing datasets and under varying conditions.

Model/Ref Method-Type Dataset-Ref Main Challenges Model Limitations Model Accuracy


Random Forest [45] Single Model ISCX-2016 Enhance accuracy using limited features This model was not tested with multiple 0.9957
without any third-party tool datasets. The model did not support robust-
ness.
Random Forest [36] Single Model Phishing Tank Manually extracted features rely on third- Small experimental dataset 0.9950
party services
PSL + PART [44] Hybrid Model Phishing Tank Bank Trained different ML models and compared The dataset was based only on bank data. 0.9930
Dataset results
ISHO + SVM [42] Hybrid Model UCI Dataset Dataset lacks original URLs, no feature Limited to specific datasets without URL 0.9864
extraction information
Adaboost [43] [71] Single Model UCI Dataset Not evaluated on diverse datasets Not enough information on dataset size 0.9830
LBET (Logistic + Extra Hybrid Model UCI Dataset Insufficient data sources, lack of feature Limited to specific datasets without URL 0.9757
Tree) [71] extraction information
Bootstrap Aggregating + Hybrid Model UCI Dataset Insufficient data sources, lack of feature Limited to specific datasets without URL 0.9742
Logistic Model Tree [164] extraction information
Random Forest + Neural Hybrid Model UCI Dataset No prior research on this specific combina- Insufficient data sources, lack of feature 0.9740
Network + Bagging [47] tion extraction
Priority-Based Algorithms Hybrid Model UCI Dataset Insufficient data sources, lack of feature Limited to specific datasets without URL 0.9700
Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

[61] extraction information


Random Forest [38] Single Model Phishing Tank Insufficient data sources, lack of feature Limited to specific datasets without URL 0.9687
extraction information
Adam optimizer + Deep Hybrid Model Phishing Tank Insufficient data sources, lack of feature Limited to specific datasets without URL 0.9600
Neural Network (DNN) extraction information
[62]
CNN [31], [165] Single Model Phishing Tank Large dataset size, long training time Sensitive to URL length ignores website 0.9502
status
Grey Wolf Optimizer + Hybrid Model Phishing Tank Small dataset, limited evaluation on diverse Rule-based features only, potentially lacking 0.9038
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

SVM [31] datasets complexity


Genetic Algorithm (GA) + Hybrid Model Phishing Tank Requires longer training due to GA feature Insufficient data sources, lack of feature 0.8950
DNN [166] selection extraction

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Convolutional Hybrid Model Phishing Tank Lower accuracy than some methods, small Rule-based features, potentially limited in- 0.8900
Autoencoder + DNN dataset formation
[167]
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and

VOLUME ,
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

4) Autonomous System Development: Developing an au- tional Journal of Computer Applications, vol. 185, no. 11, pp. 1–11,
tonomous system that continuously updates itself with 2023.
[2] A. Safi and S. Singh, “A systematic literature review on phishing
the latest phishing attacks and maintains an up-to-date website detection techniques,” Journal of King Saud University-
database of phishing URLs is a significant challenge Computer and Information Sciences, vol. 35, no. 2, pp. 590–611,
[18], [26]. The need for real-time environments that 2023.
[3] B. Naqvi, K. Perova, A. Farooq, I. Makhdoom, S. Oyedeji, and
automatically learn and train themselves to ensure user J. Porras, “Mitigation strategies against the phishing attacks: A
safety is critical [38]. systematic literature review,” Computers & Security, p. 103387, 2023.
5) User Awareness and Education: Despite technical ad- [4] M. K. Pandey, M. K. Singh, S. Pal, and B. Tiwari, “Prediction
of phishing websites using machine learning,” Spatial Information
vancements, many users remain unaware of phishing Research, vol. 31, no. 2, pp. 157–166, 2023.
attacks, including educated individuals who might in- [5] C. Cross, ““i knew it was a scam”: Understanding the triggers for
advertently click on malicious URLs [74]. Conducting recognizing romance fraud,” Criminology & Public Policy, vol. 22,
no. 4, pp. 613–637, 2023.
training sessions and awareness campaigns is crucial to [6] L. Brotherston, A. Berlin, and W. F. Reyor III, Defensive security
educate people about phishing threats and prevention handbook. ” O’Reilly Media, Inc.”, 2024.
strategies [147]. [7] T. Xu, K. Singh, and P. Rajivan, “Personalized persuasion: Quan-
tifying susceptibility to information exploitation in spear-phishing
6) Dependency on Third-Party Services for Feature Ex- attacks,” Applied Ergonomics, vol. 108, p. 103908, 2023.
traction: Most feature extraction methods rely on third- [8] M. Nadeem, S. W. Zahra, M. N. Abbasi, A. Arshad, S. Riaz,
party services to extract specific information such as do- and W. Ahmed, “Phishing attack, its detections and prevention
main names, DNS registration, and host names. These techniques,” International Journal of Wireless Security and Networks,
vol. 1, no. 2, pp. 13–25p, 2023.
services are often paid and may not always provide [9] R. Goenka, M. Chawla, and N. Tiwari, “A comprehensive survey of
up-to-date information, leading to higher error rates in phishing: Mediums, intended targets, attack and defence techniques
models [26]. and a novel taxonomy,” International Journal of Information Security,
vol. 23, no. 2, pp. 819–848, 2024.
7) Handling Tiny URLs: The literature lacks specific [10] I. Ahmad, S. Khan, and S. Iqbal, “Guardians of the vault: unmasking
mechanisms for handling tiny URLs, which are difficult online threats and fortifying e-banking security, a systematic review,”
to track and verify for phishing content. Educating users Journal of Financial Crime, 2024.
[11] F. S. Alsubaei, A. A. Almazroi, and N. Ayub, “Enhancing phishing
about the credibility of URLs, whether from known detection: A novel hybrid deep learning framework for cybercrime
or unknown sources, is crucial [37], [131]. Designing forensics,” IEEE Access, 2024.
a system to identify whether tiny URLs are phished [12] S. Asiri, Y. Xiao, S. Alzahrani, S. Li, and T. Li, “A survey of
intelligent detection designs of html url phishing attacks,” IEEE
or legitimate remains a significant research challenge Access, vol. 11, pp. 6421–6443, 2023.
[130]. [13] Y. Guo, “A review of machine learning-based zero-day attack detec-
8) Limitations of Rule-Based and List-Based Models: tion: Challenges and future directions,” Computer communications,
vol. 198, pp. 175–185, 2023.
Rule-based and list-based models, while effective in [14] A. S. Albahri, A. M. Duhaim, M. A. Fadhel, A. Alnoor, N. S.
identifying phished or legitimate URLs, require frequent Baqer, L. Alzubaidi, O. S. Albahri, A. H. Alamoodi, J. Bai, A. Salhi
updates and can have slow detection speeds, resulting et al., “A systematic review of trustworthy and explainable artificial
intelligence in healthcare: Assessment of quality, bias risk, and data
in high response times [154]. Designing systems capa- fusion,” Information Fusion, vol. 96, pp. 156–191, 2023.
ble of handling phishing links spread through various [15] S. Kemp, “Global overview report,”
devices presents a substantial challenge [133]. Https://Datareportal.Com/Reports/Digital-2022-Global-Overview-
Report, 2022.
[16] A.P.W.G., “Apwg phishing trends report 2nd quarter 2022,” Anti-
XI. Conclusion Phishing Working Group (APWG, no. September, 2022.
In conclusion, datasets with few phishing URLs can ad- [17] B. Gontla, P. Gundu, P. Uppalapati, K. Rao, and S. Hussain, “A
versely affect model performance when tested on larger machine learning approach to identify phishing websites: A compara-
tive study of classification models and ensemble learning techniques,”
datasets. Although blacklisting, whitelisting, and rule-based EAI Endorsed Transactions on Scalable Information Systems, vol. 10,
detection methods are effective, they are constrained by lists no. 5, 2023.
or rules. Machine learning models have been introduced in [18] S. Santos, P. Costa, and A. Rocha, “It/ot convergence in industry
4.0 : Risks and analysis of the problems,” in Iberian Conference on
some research, but these are limited to specific features. Information Systems and Technologies, CISTI, 2023.
Even when using NLP-based machine learning models for [19] N. Do, A. Selamat, O. Krejcar, E. Herrera-Viedma, and H. Fujita,
feature extraction, third-party services are still necessary. “Deep learning for phishing detection: Taxonomy, current challenges,
and future directions,” IEEE Access, vol. 10, 2022.
A comprehensive and dynamic system capable of handling [20] U. Agarwal, “Blockchain technology for secure supply chain man-
all types of attack, implementing across all devices, and agement: A comprehensive review,” IEEE Access, 2022.
updating dynamically with new phishing techniques is still [21] S. Zahra, M. Chishti, A. Baba, and F. Wu, “Detecting covid-19 chaos
driven phishing/malicious url attacks by a fuzzy logic and data mining
needed. Continued research is necessary to develop bench- based intelligence system,” Egyptian Informatics Journal, vol. 23,
mark datasets and systems for both offline and real-time no. 2, 2022.
detection. [22] B. Naqvi, K. Perova, A. Farooq, I. Makhdoom, S. Oyedeji, and
J. Porras, “Mitigation strategies against the phishing attacks: A
systematic literature review,” Computers and Security, vol. 132, 2023.
REFERENCES [23] A. Safi and S. Singh, “A systematic literature review on phishing
[1] D. Kalla, F. Samaah, S. Kuraku, and N. Smith, “Phishing detection website detection techniques,” Journal of King Saud University -
implementation using databricks and artificial intelligence,” Interna- Computer and Information Sciences, vol. 35, no. 2, 2023.

VOLUME , 23

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

[24] A. Jain, S. Sahoo, and J. Kaubiyal, “Online social networks security [48] M. Vijayalakshmi, S. Shalinie, M. Yang, and U. Meenakshi, “Web
and privacy: comprehensive review and analysis,” Complex and phishing detection techniques: A survey on the state-of-the-art, taxon-
Intelligent Systems, vol. 7, no. 5, 2021. omy and future directions,” IET Networks, vol. 9, no. 5, pp. 235–246,,
[25] S. Abad, H. Gholamy, and M. Aslani, “Classification of malicious 2020-09-01.
urls using machine learning,” Sensors, vol. 23, no. 18, 2023-09. [49] A. Jain and B. Gupta, “A survey of phishing attack techniques,
[26] I. Kotenko, “Detection of anomalies and attacks in container systems: defence mechanisms and open research challenges,” Enterprise In-
An integrated approach based on black and white lists,” in Lecture formation Systems, vol. 16, no. 4, pp. 527–565,, 2022.
Notes in Networks and Systems, 2023. [50] M. Bhattacharya, S. Roy, S. Chattopadhyay, A. Das, and S. Shetty,
[27] T. Pattewar, C. Mali, S. Kshire, M. Sadarao, J. Salunkhe, and A. Shah, “A comprehensive survey on online social networks security and
“Malicious short urls detection: A survey,” International Research privacy issues: Threats, machine learning-based solutions, and open
Journal of Engineering and Technology, 2019. challenges,” SECURITY AND PRIVACY, vol. 6, no. 1, 2023.
[28] O. Abiodun, S. A.S, and K. S.O, “Linkcalculator – an efficient link- [51] S. Samad, “Analysis of the performance impact of fine-tuned machine
based phishing detection tool,” Acta Informatica Malaysia, vol. 4, learning model for phishing url detection,” Electronics (Switzerland,
no. 2, 2020. vol. 12, no. 7, 2023.
[29] P. Yang, G. Zhao, and P. Zeng, “Phishing website detection based [52] M. Almousa, T. Zhang, A. Sarrafzadeh, and M. Anwar, “Phishing
on multidimensional features driven by deep learning,” IEEE Access, website detection: How effective are deep learning-based models and
vol. 7, 2019. hyperparameter optimization?” SECURITY AND PRIVACY, vol. 5,
[30] L. Tang and Q. Mahmoud, “A survey of machine learning-based no. 6, 2022.
solutions for phishing website detection,” Machine Learning and [53] S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing email
Knowledge Extraction, vol. 3, no. 3, 2021. detection using natural language processing techniques: A literature
[31] S. Anupam and A. Kar, “Phishing website detection using sup- survey,” in Procedia CIRP, 2021.
port vector machines and nature-inspired optimization algorithms,” [54] M. Korkmaz, O. Sahingoz, and B. Diri, “Feature selections for the
Telecommun Syst, vol. 76, no. 1, pp. 17–32,, 2021-01. classification of webpages to detect phishing attacks: A survey,”
[32] V. Shahrivari, M. Darabi, and M. Izadi, “Phishing detection in HORA 2020 - 2nd International Congress on Human-Computer
using machine learning techniques,” 2020-09, available:. [Online]. Interaction, Optimization and Robotic Applications, Proceedings,
Available: http://arxiv.org/abs/2009.11116 2020.
[33] J. Rashid, T. Mahmood, M. Nisar, and T. Nazir, “Phishing detec- [55] M. Hr, A. Mv, S. Prasad, and S. Vinay, “Development of anti-phishing
tion using machine learning technique,” in Proceedings - 2020 1st browser based on random forest and rule of extraction framework,”
International Conference of Smart Systems and Emerging Technolo- Cybersecurity, vol. 3, no. 1, 2020-12.
gies, SMART-TECH 2020. Institute of Electrical and Electronics [56] P. Kumar, T. Jaya, and V. Rajendran, “Si-bba – a novel phishing
Engineers Inc, 2020-11, p. 43–46. website detection based on swarm intelligence with deep learning,”
[34] S. Alnemari and M. Alshammari, “Detecting phishing domains using Mater Today Proc, vol. 80, pp. 3129–3139,, 2023-01.
machine learning,” Applied Sciences (Switzerland, vol. 13, no. 8, [57] L. Abdulrahman, S. Ahmed, Z. Rashid, Y. Jghef, T. Ghazi, and
2023. U. Jader, “Web phishing detection using web crawling, cloud infras-
[35] A. Dutta, “Detecting phishing websites using machine learning tructure and deep learning framework,” Journal of Applied Science
technique,” PLoS One, vol. 16, no. 10, 2021-10. and Technology Trends, vol. 4, no. 01, pp. 54–71,, 2023-03.
[36] E. Gandotra and D. Gupta, “Improving spoofed website detection [58] P. Kalaharsha and B. Mehtre, “Detecting phishing sites –
using machine learning,” Cybern Syst, vol. 52, no. 2, pp. 169–190,, an overview,” Mar, 2021, available:. [Online]. Available: http:
2021. //arxiv.org/abs/2103.12739
[37] B. Waseso and N. Setiyanto, “Web phishing classification using com- [59] R. Pravali, S. Raha, Y. Rachana, and D. Kamesh, “Ensemble machine
bined machine learning methods,” Journal of Computing Theories learning model for phishing intrusion detection and classification
and Applications, vol. 1, no. 1, pp. 11–18,, 2023-08. from urls,” 2023.
[38] G. Lokesh and G. BoreGowda, “Phishing website detection based [60] L. Jovanovic, “Improving phishing website detection using a hybrid
on effective machine learning approach,” Journal of Cyber Security two-level framework for feature selection and xgboost tuning,” Jour-
Technology, vol. 5, no. 1, pp. 1–14,, 2021-01. nal of Web Engineering, vol. 22, no. 3, pp. 543–574,, 2023-07.
[39] M. Lei, Y. Xiao, S. Vrbsky, and C. Li, “Virtual password using [61] A. Lakshmanarao, P. P. Rao, and M. Krishna, “Phishing website
random linear functions for on-line services, atm machines, and detection using novel machine learning fusion approach,” in Pro-
pervasive computing,” Comput Commun, vol. 31, no. 18, 2008. ceedings - International Conference on Artificial Intelligence and
[40] A. Basit, M. Zafar, X. Liu, A. Javed, Z. Jalil, and K. Kifayat, Smart Systems, ICAIS 2021. Institute of Electrical and Electronics
“A comprehensive survey of ai-enabled phishing attacks detection Engineers Inc, 2021-03, p. 1164–1169.
techniques,” Telecommunication Systems, vol. 76, no. 1, 2021. [62] L. Lakshmi, M. Reddy, C. Santhaiah, and U. Reddy, “Smart phishing
[41] U. Butt, R. Amin, H. Aldabbas, S. Mohan, B. Alouffi, and A. Ah- detection in web pages using supervised deep learning classification
madian, “Cloud-based email phishing attack using machine and deep and optimization technique adam,” Wirel Pers Commun, vol. 118,
learning algorithm,” Complex and Intelligent Systems, vol. 9, no. 3, no. 4, pp. 3549–3564,, 2021-06.
pp. 3043–3070,, 2023-06. [63] A. Karim, M. Shahroz, K. Mustofa, S. Belhaouari, and S. Joga,
[42] M. Sabahno and F. Safara, “Isho: improved spotted hyena optimiza- “Phishing detection system through hybrid machine learning based
tion algorithm for phishing website detection,” Multimed Tools Appl, on url,” IEEE Access, vol. 11, pp. 36 805–36 822,, 2023.
vol. 81, no. 24, pp. 34 677–34 696,, 2022-10. [64] R. Zieni, L. Massari, and M. Calzarossa, “Phishing or not phishing? a
[43] A. Odeh, I. Keshta, and E. Abdelfattah, “Phiboost-a novel phishing survey on the detection of phishing websites,” IEEE Access, vol. 11,
detection model using adaptive boosting approach,” 2021. pp. 18 499–18 519,, 2023.
[44] P. Barraclough, G. Fehringer, and J. Woodward, “Intelligent cyber- [65] M. Shaukat, R. Amin, M. Muslam, A. Alshehri, and J. Xie, “A hybrid
phishing detection for online,” Comput Secur, vol. 104, 2021-05. approach for alluring ads phishing attack detection using machine
[45] B. Gupta, K. Yadav, I. Razzak, K. Psannis, A. Castiglione, and learning,” Sensors, vol. 23, no. 19, 2023-10.
X. Chang, “A novel approach for phishing urls detection using [66] L. Yang, J. Zhang, X. Wang, Z. Li, Z. Li, and Y. He, “An improved
lexical based machine learning in a real-time environment,” Comput elm-based and data preprocessing integrated approach for phishing
Commun, vol. 175, pp. 47–57,, 2021-07. detection considering comprehensive features,” Expert Syst Appl, vol.
[46] C. Singh and Meenu, “Phishing website detection based on ma- 165, 2021-03.
chine learning: A survey,” in 2020 6th International Conference on [67] V. Praveena, A. Vijayaraj, P. Chinnasamy, I. Ali, R. Alroobaea, S. Y.
Advanced Computing and Communication Systems, ICACCS 2020, Alyahyan, and M. A. Raza, “Optimal deep reinforcement learning
2020. for intrusion detection in uavs,” Computers, Materials & Continua,
[47] A. Zamir, “Phishing web site detection using diverse machine learn- vol. 70, no. 2, pp. 2639–2653, 2022.
ing algorithms,” Electronic Library, vol. 38, no. 1, pp. 65–80,, 2020- [68] P. Chinnasamy, K. S. Sathya, B. J. A. Jebamani, A. Nithyasri, and
03. S. Fowjiya, “Deep learning: Algorithms, techniques, and applica-

24 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

tions—a systematic survey,” in Deep Learning Research Applications [91] I. Riadi, Sunardi, and D. Aprilliansyah, “Analysis of anubis trojan
for Natural Language Processing. IGI global, 2023, pp. 1–17. attack on android banking application using mobile security labware,”
[69] M. Korkmaz, O. Sahingoz, and B. Diri, “Detection of phishing International Journal of Safety and Security Engineering, vol. 13,
websites by using machine learning-based url analysis,” 2020. no. 1, 2023.
[70] M. Adebowale, K. Lwin, and M. Hossain, “Intelligent phishing detec- [92] A. Hannousse, S. Yahiouche, and M. C. Nait-Hamoud, “Twenty-
tion scheme using deep learning algorithms,” Journal of Enterprise two years since revealing cross-site scripting attacks: a systematic
Information Management, vol. 36, no. 3, pp. 747–766,, 2023-04. mapping and a comprehensive survey,” Computer Science Review,
[71] Y. Alsariera, V. Adeyemo, A. Balogun, and A. Alazzawi, “Ai meta- vol. 52, p. 100634, 2024.
learners and extra-trees algorithm for the detection of phishing [93] F. Kalantari, M. Zaeifi, T. Bao, R. Wang, Y. Shoshitaishvili, and
websites,” IEEE Access, vol. 8, pp. 142 532–142 542,, 2020. A. Doupé, “Context-auditor: Context-sensitive content injection mit-
[72] A. Maci, A. Santorsola, A. Coscia, and A. Iannacone, “Unbalanced igation,” in ACM International Conference Proceeding Series, 2022.
web phishing classification through deep reinforcement learning,” [94] S. Yadav, A. Mahajan, M. Prasad, and A. Kumar, “Advanced keylog-
Computers, vol. 12, no. 6, 2023-06. ger for ethical hacking,” International Journal of Engineering Applied
[73] R. Goenka, M. Chawla, and N. Tiwari, “A comprehensive survey of Sciences and Technology, vol. 5, no. 1, 2020.
phishing: Mediums, intended targets, attack and defence techniques [95] K. Hussain, A. R. Rahmatyar, B. Riskhan, M. A. U. Sheikh, and
and a novel taxonomy,” International Journal of Information Security, S. R. Sindiramutty, “Threats and vulnerabilities of wireless networks
vol. 23, no. 2, pp. 819–848, 2024. in the internet of things (iot),” in 2024 IEEE 1st Karachi Section
[74] L. Gallo, D. Gentile, S. Ruggiero, A. Botta, and G. Ventre, “The Humanitarian Technology Conference (KHI-HTC). IEEE, 2024, pp.
human factor in phishing: Collecting and analyzing user behavior 1–8.
when reading emails,” Computers & Security, vol. 139, p. 103671, [96] I. Despotopoulos, “Wireless local area network security and modern
2024. cryptographic protocols: Wep & wpa1/2/3,” 2024.
[75] R. Paudel and M. N. Al-Ameen, “Priming through persuasion: [97] B. Tsouvalas and N. Nikiforakis, “Knocking on admin’s door:
Towards secure password behavior,” Proceedings of the ACM on Protecting critical web applications with deception,” in International
Human-Computer Interaction, vol. 8, no. CSCW1, pp. 1–27, 2024. Conference on Detection of Intrusions and Malware, and Vulnera-
[76] H. Gururaj, V. Janhavi, and V. Ambika, Social Engineering in bility Assessment. Springer, 2024, pp. 283–306.
Cybersecurity: Threats and Defenses. CRC Press, 2024. [98] D. Senecal, The Reign of Botnets: Defending Against Abuses, Bots
[77] A. Juanna, M. A. S. Monoarfa, R. Podungge, and R. Tantawi, “Identi- and Fraud on the Internet. John Wiley & Sons, 2024.
fication of trends in business promotion and marketing using video- [99] I. Despotopoulos, “Wireless local area network security and modern
based content on social media,” Jambura Science of Management, cryptographic protocols: Wep & wpa1/2/3,” 2024.
vol. 6, no. 2, pp. 88–103, 2024.
[100] M. A. I. Mallick and R. Nath, “Navigating the cyber security land-
[78] N. Akyeşilmen and A. Alhosban, “Non-technical cyber-attacks scape: A comprehensive review of cyber-attacks, emerging trends,
and international cybersecurity: The case of social engineering,” and recent developments,” World Scientific News, vol. 190, no. 1, pp.
Gaziantep University Journal of Social Sciences, vol. 23, no. 1, pp. 1–69, 2024.
342–360, 2024.
[101] U. Joseph and M. Jacob, “Real time detection of phishing attacks in
[79] D. Senecal, The Reign of Botnets: Defending Against Abuses, Bots
edge devices using lstm networks,” in AIP Conference Proceedings,
and Fraud on the Internet. John Wiley & Sons, 2024.
2022.
[80] K. Church and R. De Oliveira, “What’s up with whatsapp? comparing
[102] R. Ulfath, I. Sarker, M. Chowdhury, and M. Hammoudeh, “Detecting
mobile instant messaging behaviors with traditional sms,” in Pro-
smishing attacks using feature extraction and classification tech-
ceedings of the 15th international conference on Human-computer
niques,” in Lecture Notes on Data Engineering and Communications
interaction with mobile devices and services, 2013, pp. 352–361.
Technologies, 2022, vol. 95.
[81] Z. Alkhalil, C. Hewage, L. Nawaf, and I. Khan, “Phishing attacks:
[103] S. Tang, X. Mi, Y. Li, X. Wang, and K. Chen, “Clues in tweets:
A recent comprehensive study and a new anatomy,” Frontiers in
Twitter-guided discovery and analysis of sms spam,” in Proceedings
Computer Science, vol. 3, p. 563060, 2021.
of the ACM Conference on Computer and Communications Security,
[82] P. Syiemlieh, G. M. Khongsit, U. M. Sharma, and B. Sharma,
2022.
“Phishing-an analysis on the types, causes, preventive measuresand
case studies in the current situation,” IOSR J. Comput. Eng, vol. 9, [104] R. Mayrhofer, J. Stoep, C. Brubaker, and N. Kralevich, “The android
pp. 2278–8727, 2015. platform security model,” ACM Transactions on Privacy and Security,
vol. 24, no. 3, 2021.
[83] W. Kim, O.-R. Jeong, C. Kim, and J. So, “The dark side of the
internet: Attacks, costs and responses,” Information systems, vol. 36, [105] M. Suleman, T. Soomro, T. Ghazal, and M. Alshurideh, “Combating
no. 3, pp. 675–705, 2011. against potentially harmful mobile apps,” 2021.
[84] N. Knopf, “Social engineering: How crowdmasters, phreaks, hack- [106] M. Armstrong, K. Jones, and A. Namin, “How perceptions of caller
ers, and trolls created a new form of manipulative communication honesty vary during vishing attacks that include highly sensitive or
—robert w,” IEEE Technology and Society Magazine, vol. 42, no. 1, seemingly innocuous requests,” Hum Factors, vol. 65, no. 2, 2023.
p. 344, 2022. [107] “scholar (11)”.
[85] B. Dooremaal, P. Burda, L. Allodi, and N. Zannone, “Combining text [108] S. Mahdavifar, N. Maleki, A. Lashkari, M. Broda, and A. Razavi,
and visual features to improve the identification of cloned webpages “Classifying malicious domains using dns traffic analysis,” in 2021
for early phishing detection,” in ACM International Conference IEEE Intl Conf on Dependable, Autonomic and Secure Computing,
Proceeding Series, 2021. Intl Conf on Pervasive Intelligence and Computing, Intl Conf on
[86] I. Tomicic, “Social engineering aspects of email phishing: an Cloud and Big Data Computing, Intl Conf on Cyber Science and
overview and taxonomy,” in 2023 46th ICT and Electronics Con- Technology Congress (DASC/PiCom/CBDCom/CyberSciTech. IEEE,
vention, MIPRO 2023 - Proceedings, 2023. 2021, p. 60–67.
[87] S. Dadvandipour and A. Ganie, “Analyzing and predicting spear- [109] H. Owen, J. Zarrin, and S. Pour, “A survey on botnets, issues, threats,
phishing using machine learning methods,” Multidiszciplináris tu- methods, detection and prevention,” Journal of Cybersecurity and
dományok, vol. 10, no. 4, 2020. Privacy, vol. 2, no. 1, 2022.
[88] H. Oz, A. Aris, A. Levi, and A. Uluagac, “A survey on ransomware: [110] H. Kilavo, L. Mselle, R. Rais, and S. Mrutu, “Reverse social
Evolution, taxonomy, and defense solutions,” ACM Comput Surv, engineering to counter social engineering in mobile money theft: A
vol. 54, no. 11s, 2022. tanzanian context,” Journal of Applied Security Research, vol. 18,
[89] Ekta and U. Bansal, “A review on ransomware attack,” in ICSCCC no. 3, 2023.
2021 - International Conference on Secure Cyber Computing and [111] M. Wang, L. Song, L. Li, Y. Zhu, and J. Li, “Phishing webpage
Communications, 2021. detection based on global and local visual similarity,” Expert Systems
[90] S. Ullah, T. Ahmad, A. Buriro, N. Zara, and S. Saha, “Trojande- with Applications, vol. 252, p. 124120, 2024.
tector: A multi-layer hybrid approach for trojan detection in android [112] A. Brunstein, “Automatic web crawler for malicious websites classi-
applications,” Applied Sciences (Switzerland, vol. 12, no. 21, 2022. fication,” Ph.D. dissertation, Politecnico di Torino, 2024.

VOLUME , 25

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

[113] D.-J. Liu and J.-H. Lee, “A cnn-based sia screenshot method to [135] I. Kara, M. Ok, and A. Ozaday, “Characteristics of understanding
visually identify phishing websites,” Journal of Network and Systems urls and domain names features: The detection of phishing websites
Management, vol. 32, no. 1, p. 8, 2024. with machine learning methods,” IEEE Access, vol. 10, 2022.
[114] O. Sarker, A. Jayatilaka, S. Haggag, C. Liu, and M. A. Babar, “A [136] F. Sadique, R. Kaul, S. Badsha, and S. Sengupta, “An automated
multi-vocal literature review on challenges and critical success factors framework for real-time phishing url detection,” in 2020 10th Annual
of phishing education, training and awareness,” Journal of Systems Computing and Communication Workshop and Conference, CCWC
and Software, vol. 208, p. 111899, 2024. 2020, 2020.
[115] S. Das Guptta, K. T. Shahriar, H. Alqahtani, D. Alsalman, and I. H. [137] A. Bozkir and M. Aydos, “Logosense: A companion hog based
Sarker, “Modeling hybrid feature-based phishing websites detection logo detection scheme for phishing web page and e-mail brand
using machine learning techniques,” Annals of Data Science, vol. 11, recognition,” Comput Secur, vol. 95, 2020.
no. 1, pp. 217–242, 2024. [138] M. Pandey, M. Singh, S. Pal, and B. Tiwari, “Prediction of phishing
[116] A. Aljammal, S. taamneh, A. Qawasmeh, and H. Salameh, “Machine websites using stacked ensemble method and hybrid features selec-
learning based phishing attacks detection using multiple datasets,” tion method,” SN Comput Sci, vol. 3, no. 6, 2022.
International Journal of Interactive Mobile Technologies, vol. 17, [139] P. Indrasiri, M. Halgamuge, and A. Mohammad, “Robust ensemble
no. 5, 2023. machine learning model for filtering phishing urls: Expandable ran-
[117] M. Sánchez-Paniagua, E. Fidalgo, E. Alegre, and R. Alaiz-Rodrı́guez, dom gradient stacked voting classifier (erg-svc,” IEEE Access, vol. 9,
“Phishing websites detection using a novel multipurpose dataset and 2021.
web technologies features,” Expert Syst Appl, vol. 207, 2022. [140] M. Prince, A. Hasan, and F. Shah, “A new ensemble model for
[118] Q. Li, G. Zhong, C. Xie, and R. Hedjam, “Weak edge identification phishing detection based on hybrid cumulative feature selection,” in
network for ocean front detection,” IEEE Geoscience and Remote ISCAIE 2021 - IEEE 11th Symposium on Computer Applications and
Sensing Letters, vol. 19, 2022. Industrial Electronics, 2021.
[119] L. Xue, “mt5: A massively multilingual pre-trained text-to-text [141] L. Rani, C. Foozy, and S. Mustafa, “Feature selection to enhance
transformer,” in NAACL-HLT 2021 - 2021 Conference of the North phishing website detection based on url using machine learning
American Chapter of the Association for Computational Linguistics: techniques,” Journal of Soft Computing and Data Mining, vol. 4,
Human Language Technologies, Proceedings of the Conference, no. 1, 2023.
2021. [142] J. Moedjahedy, A. Setyanto, F. Alarfaj, and M. Alreshoodi, “Ccrfs:
[120] S. Gopal and C. Poongodi, “Mitigation of phishing url attack in iot Combine correlation features selection for detecting phishing web-
using h-ann with h-ffgwo algorithm,” KSII Transactions on Internet sites using machine learning,” Future Internet, vol. 14, no. 8, 2022.
and Information Systems, vol. 17, no. 7, 2023. [143] Y. Mansour and M. Alenizi, “Enhanced classification method for
[121] H. Alqahtani, “Evolutionary algorithm with deep auto encoder net- phishing emails detection,” Journal of Information Security and
work based website phishing detection and classification,” Applied Cybercrimes Research, vol. 3, no. 1, 2020.
Sciences (Switzerland, vol. 12, no. 15, 2022. [144] A. Alhussan, H. Al-Mahdawi, and A. Kadi, “Spam detection in
[122] K. Apoorva and S. Sangeetha, “Analysis of uniform resource locator connected networks using particle swarm and genetic algorithm
using boosting algorithms for forensic purpose,” Comput Commun, optimization: Youtube as a case study,” International Journal of
vol. 190, 2022. Wireless and Ad Hoc Communication, vol. 6, no. 1, pp. 08–18,, 2023.
[145] A. Ramana, K. Rao, and R. Rao, “Stop-phish: an intelligent phishing
[123] V. Mazzeo, A. Rapisarda, and G. Giuffrida, “Detection of fake news
detection method using feature selection ensemble,” Soc Netw Anal
on covid-19 on web search engines,” Front Phys, vol. 9, 2021.
Min, vol. 11, no. 1, 2021.
[124] A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J. Niyigena, “An effective
[146] S. Anupam and A. Kar, “Phishing website detection using sup-
phishing detection model based on character level convolutional
port vector machines and nature-inspired optimization algorithms,”
neural network from url,” Electronics (Switzerland, vol. 9, no. 9,
Telecommun Syst, vol. 76, no. 1, pp. 17–32,, 2021-01.
pp. 1–24,, 2020-09.
[147] O. Sarker, A. Jayatilaka, S. Haggag, C. Liu, and M. Babar, “A
[125] E. Gualberto, R. Sousa, T. Brito Vieira, J. Costa, and C. Duque, “The multi-vocal literature review on challenges and critical success
answer is in the text: Multi-stage methods for phishing detection factors of phishing education, training and awareness,” Journal
based on feature engineering,” IEEE Access, vol. 8, 2020. of Systems and Software, vol. 208, pp. 111 899,, 2024. [Online].
[126] S. Shabudin, N. Sani, K. Ariffin, and M. Aliff, “Feature selection for Available: https://doi.org/10.1016/j.jss.2023.111899.
phishing website classification,” International Journal of Advanced [148] M. Grubbs, “Anti-phishing game-based training: An experimental
Computer Science and Applications, vol. 11, no. 4, 2020. analysis of demographic factors,” SSRN Electronic Journal, 2022.
[127] A. Singh and S. Misra, “A comparison of performance of rough [149] J. Brickley, K. Thakur, and A. Kamruzzaman, “A comparative
set theory with machine learning techniques in detecting phishing analysis between technical and non-technical phishing defences,” In-
attack,” in Lecture Notes in Networks and Systems, 2022, vol. 289. ternational Journal of Cyber-Security and Digital Forensics, vol. 10,
[128] M. El-Rashidy, “A smart model for web phishing detection based no. 1, 2021.
on new proposed feature selection technique,” Menoufia Journal of [150] A. Chattopadhyay, C. Maschinot, and L. Nestor, “Mirror on the wall -
Electronic Engineering Research, vol. 30, no. 1, 2021. what are cybersecurity educational games offering overall: A research
[129] A. Thahira and A. John, “Phishing website detection using lgbm study and gap analysis,” in Proceedings - Frontiers in Education
classifier with url-based lexical features,” in Proceedings - 2022 IEEE Conference, FIE, 2021.
Silchar Subsection Conference, SILCON 2022, 2022. [151] S. Bell and P. Komisarczuk, “An analysis of phishing blacklists:
[130] H. Zhao, Z. Chen, and R. Yan, “Malicious domain names detection Google safe browsing, openphish, and phishtank,” in ACM Interna-
algorithm based on statistical features of urls,” in 2022 IEEE 25th tional Conference Proceeding Series, 2020.
International Conference on Computer Supported Cooperative Work [152] S. Chanti and T. Chithralekha, “Classification of anti-phishing solu-
in Design, CSCWD 2022, 2022. tions,” SN Comput Sci, vol. 1, no. 1, 2020.
[131] S. Asiri, Y. Xiao, S. Alzahrani, S. Li, and T. Li, “A survey of [153] M. Korkmaz, E. Kocyigit, O. Sahingoz, and B. Diri, “A hybrid
intelligent detection designs of html url phishing attacks,” IEEE phishing detection system using deep learning-based url and content
Access, vol. 11, pp. 6421–6443,, 2023. analysis,” Elektronika ir Elektrotechnika, vol. 28, no. 5, 2022.
[132] R. Rao, T. Vaishnavi, and A. Pais, “Catchphish: detection of phishing [154] R. Zaimi, M. Hafidi, and M. Lamia, “Survey paper: Taxonomy of
websites by inspecting urls,” J Ambient Intell Humaniz Comput, website anti-phishing solutions,” in 2020 7th International Confer-
vol. 11, no. 2, 2020. ence on Social Network Analysis, Management and Security, SNAMS
[133] F. Kausar, B. Al-Otaibi, A. Al-Qadi, and N. Al-Dossari, “Hybrid 2020, 2020.
client side phishing websites detection approach,” International Jour- [155] T. Suleman, “A survey on web phishing detection techniques: A
nal of Advanced Computer Science and Applications, vol. 5, no. 7, taxonomy-based approach,” LGU International Journal for Electronic
2014. Crime Investigation, pp. 1–12,, 2021.
[134] C. Tan, K. Chiew, K. Yong, S. Sze, J. Abdullah, and Y. Sebastian, [156] Z. Azam, M. Islam, and M. Huda, “Comparative analysis of intru-
“A graph-theoretic approach for the detection of phishing webpages,” sion detection systems and machine learning-based model analysis
Comput Secur, vol. 95, 2020. through decision tree,” IEEE Access, vol. 11, 2023.

26 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

[157] S. Mishra, P. Mallick, H. Tripathy, A. Bhoi, and A. González-Briones, MUHAMMAD ZAMAN (Memeber, IEEE) is a
“Performance evaluation of a proposed machine learning model for committed lecturer at the University of Lahore
chronic disease datasets using an integrated attribute evaluator and and is a graduate of COMSATS University of
an improved decision tree classifier,” Applied Sciences (Switzerland, Islamabad with a Master of Science in Computer
vol. 10, no. 22, 2020. Science. Muhammad, driven by an unwavering
[158] A. Alsufyani and S. Alzahrani, “Social engineering attack detection dedication to the progression of knowledge, pos-
using machine learning: Text phishing attack,” Indian Journal of sesses extensive research acumen in numerous
Computer Science and Engineering, vol. 12, no. 3, 2021. fields, including but not limited to medical image
[159] S. Wang, S. Khan, C. Xu, S. Nazir, and A. Hafeez, “Deep learning- processing, remote sensing, and natural language
based efficient model development for phishing detection using processing. The individual’s substantial research
random forest and blstm classifiers,” Complexity, vol. 2020, 2020. pursuits and contributions demonstrate a profound
[160] S. Sindhu, S. Patil, A. Sreevalsan, F. Rahman, and A. Saritha, “Phish- enthusiasm for artificial intelligence, machine learning, deep learning, com-
ing detection using random forest, svm and neural network with puter vision, and reinforcement learning. Muhammad’s numerous publica-
backpropagation,” in Proceedings of the International Conference tions in computer vision have substantiated his contributions to the field.
on Smart Technologies in Computing, Electrical and Electronics, Furthermore, he has served as a mentor and guide to a considerable number
ICSTCEE 2020, 2020. of MS students as they have completed their research theses and papers.
[161] A. Ozcan, C. Catal, E. Donmez, and B. Senturk, “A hybrid dnn–lstm This demonstrates his dedication to cultivating the subsequent generation of
model for detecting phishing urls,” Neural Comput Appl, vol. 35, scholars and innovators.
no. 7, pp. 4957–4973,, 2023-03.
[162] G. Xu, T. Ren, Y. Chen, and W. Che, “A one-dimensional cnn-lstm
model for epileptic seizure recognition using eeg signal analysis,” AHMAD SAMI AL-SHAMAYLEH received the
Front Neurosci, vol. 14, 2020. master’s degree in Information Systems from The
[163] Z. Alshingiti, R. Alaqel, J. Al-Muhtadi, Q. Haq, K. Saleem, and University of Jordan, Jordan, in 2014, and the
M. Faheem, “A deep learning-based phishing detection system using Ph.D. degree in Artificial intelligence from Univer-
cnn, lstm, and lstm-cnn,” Electronics (Switzerland, vol. 12, no. 1, sity of Malaya, Malaysia, in 2020. He is currently
2023-01. an Assistant Professor with the Faculty of Informa-
[164] V. Adeyemo, A. Balogun, H. Mojeed, N. Akande, and K. Adewole, tion Technology, Al-Ahliyya Amman University,
“Ensemble-based logistic model trees for website phishing detection,” Jordan. His research interests include: Artificial
in Communications in Computer and Information Science, 2021. Intelligence, Human Computer Interaction, IoT,
[165] A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J. Niyigena, “An effective Arabic NLP, Arabic sign language recognition,
phishing detection model based on character level convolutional language resources production, the design and
neural network from url,” Electronics (Switzerland, vol. 9, no. 9, evaluation of interactive applications for handicapped people, multimodality,
pp. 1–24,, 2020-09. and software engineering.
[166] W. Ali and A. Ahmed, “Hybrid intelligent phishing website prediction
using deep neural networks with genetic algorithm-based feature
selection and weighting,” IET Inf Secur, vol. 13, no. 6, 2019. Dr. TANZILA KEHKASHANis a lecturer at the
[167] D. Thanammal and D. Sujatha, “Phishing website detection using
Faculty of Computer Science at the University
novel features and machine learning approach,” 2021. of Lahore, Pakistan. She earned her Ph.D. from
Universiti Teknologi Malaysia (UTM) and holds
XII. Biography Section a Master of Science in Computer Science from
the University of Central Punjab, Lahore, Pakistan.
In addition to her teaching role, Dr. Kehkashan
is an active member of the Virtual, Visualization,
SHAKEEL AHMAD an eminent professional ed- and Vision Research Group (UTM VicubeLab,
ucationist working as Subject Specialist (Computer Malaysia), where she conducts cutting-edge re-
Science) in School Education Department, Punjab, search in visual computing, particularly in the ar-
Pakistan, from last 10 years, born on September 6, eas of computer vision and natural language processing. Her work has been
1986, in Rasool Pur Tarar, a town within District published in prestigious journals and conference proceedings, showcasing
Hafizabad, Punjab, Pakistan, holds a master’s de- her commitment to advancing research in these fields. Dr. Kehkashan also
gree in computer science & information technol- serves as a supervisor for master’s theses and final-year projects, contributing
ogy from the prestigious University of Education significantly to academic mentorship. Her research interests include image
Division of Science and Technology Township and video analysis, medical imaging, and language modeling. Her dedication
Campus Lahore, Punjab, Pakistan (2012). Prior to to both research and academic practice reflects her passion for the field of
his master, Mr. Shakeel Ahmad done his grad- computer science.
uation in Business and Finance from the University of Punjab, Lahore,
Pakistan in 2008. He has started his professional career as Subject Specialist
(Computer Science) in 2014. He also started his career as Research Assistant
in Machine learning & Deep Learning in 2015. Mr. Shakeel Ahmad also RAHIEL AHMAD a distinguished professional
done, certification of Computer Application offered by the Govt of Punjab, born on February 4, 1987, in Rasool Pur Tarar, a
Pakistan in 2008. Shakeel’s expertise spans Research and Development town within District Hafizabad, Punjab, Pakistan,
in Machine Learning and Object Detection Algorithms, Technical Office holds a master’s degree in computer science (Spe-
Management & Administration, Technical Report writing, SOP writing, cialized in AI & ML) from the prestigious Uni-
Training management, and Training Policy administration across Primary versity of Lahore, Punjab, Pakistan (2021). Prior
to Advanced technical levels, showcasing his exceptional versatility and to his Masters, Rahiel completed his bachelor’s in
proficiency. software engineering from COMSATS University
Islamabad. With a rich and diverse professional
background, Engr. Rahiel currently serves as a
Training Coordinator at Avionics Flight, PAFAA,
a position he has held since June, 2023. Prior to this, Engr. Rahiel held
notable roles as an Assistant Avionics Maintenance Engineer at Trainer
Fleet, Avionics Flight, PAFAA (2010-2014), and Senior Avionics Mainte-
nance Engineer and Technical Administrator at UAV Fleet, PAF, Mushaf
(2014-2023). Engr. Rahiel’s expertise spans Research and Development
in Machine Learning and Object Detection Algorithms, Technical Office

VOLUME , 27

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of the Communications Society. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJCOMS.2024.3462503

Shakeel et al.: Across the Spectrum In-Depth Review AI-Based Models for Phishing Detection

Management & Administration, Technical Report writing, SOP writing,


Training management, and Training Policy administration across Primary
to Advanced technical levels, showcasing his exceptional versatility and
proficiency.

SHAFI’I MUHAMMAD ABDULHAMID is cur-


rently an Assistant Professor with the Department
of Information Technology, Community College of
Qatar. His research interests include soft comput-
ing, machine learning, fog computing, and cloud
computing security.

ISMAIL ERGEN is a distinguished professional


with a rich background in art, design, and emerg-
ing technologies. After completing his university
education in Turkey, he relocated to the United
States, where he spent 15 years honing his skills
and expanding his knowledge base. During this
period, Ismail completed his Proficiency in Art
education at the Academy of Art University in
San Francisco, one of the world’s foremost art
institutions. His career in America spanned over
a decade, during which he worked in various
roles related to advertising, art, and business. Ismail is also an innovative
entrepreneur, having established a toy company in the United States that
specializes in designing cutting-edge 3D toys. Upon returning to Turkey, he
pursued and completed a doctoral thesis on artificial intelligence and graphic
design at Arel University. Currently, Ismail holds a managerial position at
Ambeent.ai, where he delves into the impact of emerging technologies on
game design. His work encompasses game design, artificial intelligence,
experience design, as well as UX and UI, reflecting his multifaceted
expertise and ongoing commitment to advancing the field of design through
technology.

ADNAN AKHUNZADA (Senior Member, IEEE)


Professional Member of ACM, possesses a rich
and accomplished tenure of 15 years in Research
and Development (R&D). Seamlessly navigating
the intersection of ICT industry and academia, he
stands as a testament to excellence and innova-
tion. Renowned for high-impact publications, US
Patents, and commercial products. Dr. Adnan’s
patented cybersecurity and AI innovations have
secured multi-million-dollar projects for global
corporations such as Vinnova and EU Horizon. In
recognition of his outstanding scholarly contributions, Stanford University
acknowledged him as one of the top 2% scientists globally in 2023. Prof.
Adnan Akhunzada leverages his strong cybersecurity skills and cutting-edge
technological knowledge to solve industrial problems and develop state-
of-the-art security tools, techniques, and frameworks. He’s a Postdoc in
Cybersecurity, PhD in Network Security, and MS in Information Security.
His Expertise in Cybersecurity & AI, Secure Future Internet, modelling
and designing Secure & Dependable Software Defined Networks, and
Large-Scale Distributed Systems (Cloud, Fog, Edge, IoT, IoE, IIoT, CPS);
Lightweight Cryptographic Next Generation Communication Protocols,
QoS/QoE, and Adversarial Machine Learning is Helping Shape the Future
of Secure and Dependable Systems.

28 VOLUME ,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like