Review Paper

Uploaded by

Dhehus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views9 pages

Review Paper

Uploaded by

Dhehus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Phishing Website Detection Machine Learning

Review
Aditya Deshmukh1, Akash Yadav2, Pratham Maske3, Shreyash Kathane4, Dr. D.S Adane5
Student, Information & Technology, RCOEM, Nagpur, India1-4
Professor of Information & Technology, RCOEM, Nagpur, India5
over $2 billion in losses.
Abstract— As the technology develops this increases
the chance of cybercrimes happening. Phishing attacks
based on URLs are among the most common threats Investment scams were the most damaging-they alone robbed
toward Internet users. Such attacks are not built upon victims of $4.57 billion, which is an increase of 38% from the
technical vulnerabilities; instead, they exploit a weakness previous year. Crypto-investment fraud accounted for alone $3.94
in humans and are often launched against organizations billion-a whopping 53% rise. Phishing type schemes are among the
and individuals. Attackers deceive users by clicking on most reported crimes, with over 298,000 complaints with make
URLs that appear trustworthy, leading them to reveal about 34% of the complaints
sensitive information or install malware. Various
techniques of machine learning used for phishing URL
detection classify URLs into phishing and legitimate ones.
Models remain under development and refinement
because of researchers' determination to develop them as
accurate and efficient as possible. Different machine
learning techniques for detecting phishing URLs
accompanied by URL features and datasets that train the
models are reviewed. The paper further discusses the
many different methods put forth by the researchers to
enhance the detection accuracy of these models.

1. INTRODUCTION
Fig 1: Complaints and losses of last 5 years [1]
In the year 2024, we only deepened our reliance on
technology that further exposed us to more non-native cyber The report places importance on public reporting to IC3 so as to
threats. The ongoing digital transformation, with major assist the FBI in combating cyber threats. FBI encourages
impetus from the global pandemic, had created fertile fields consumers to look out for and read consumer and industry alerts
for the operation of cybercriminals. Recent analysis and about cybercrime, notify financial institutions if victimized, and
reports are pointing at the surge of security breaches, which file a report to IC3 or local law enforcement..
caused both financial losses and personal information
exposures of astronomical proportions. Phishing has been
continued to be prevalent among these instances of
cyberspace crime, using both social engineering and further
technical deception to steal an individual's personal identity
data and financial account credentials. Attackers build fake
versions of trusted websites with the aim of tricking people
into voluntarily divulging their usernames, passwords,
banking details, and other sensitive information. These
phishing URLs would typically be distributed through e-mail,
instant messages, or text messages, thus it is worthwhile that
users should remain awake to the matter and embrace solid
respect for cybersecurity practices.

The FBI's Internet Crime Complaint Center (IC3) 2024 report

highlights a significant rise in online fraud, with 880,418
complaints and potential losses exceeding $12.5 billion,
marking a 10% increase in complaints and a 22% rise in
losses from 2023. California reported the highest number of
complaints and losses, with nearly 80,000 complaints and
2. BACKGROUND

A. Phishing Detection
3. LITERATURE REVIEW
A URL based phishing attack is carried out by sending
malicious links, that seems legitimate to the users, and In this section, few of the research works that deploy the
tricking them into clicking on it. In phishing detection, an above-mentioned algorithms are reviewed and their results are
incoming URL is identified as phishing or not by analysing summarized.
the different features of the URL and is classified
accordingly. Different machine learning algorithms are
trained on various datasets of URL features to classify a The study was conducted by Dr. Nitin N. Sakhare et al.
given URL as phishing or legitimate. [2] Integrated conventional machine learning models like
XGBoost, LightGBM, and a referenced but inactive
B. Phishing Detection Approaches Random Forest classifier alongside a Graph Nerual
List-Based Phishing Detection Systems Network (GNN). XGboost classifier gives accuracy of
These systems rely on two lists to classify website as 92.09%, LightGBM gives highest accuracy of 93.29%.
either phishing or non-phishing. The whitelist contains safe Apart from this, they implement another tree-based
and legitimate websites, while the blacklist includes those machine learning algorithm, CatBoost, which gives
identified as phishing. Researchers have used whitelists to accuracy of 92.98%. GNN's performance left a huge scope
ensure that only URLs on the list are accessible. Another for improvement. LightGBM emerged as a standout
approach is the blacklist method, where URLs are checked performer, giving a precision score of 0.93 alongside a
against a list of known phishing sites. However, these systems recall score of 0.93.
have a significant drawback: even a small change in the URL
can prevent it from being matched in the list. Additionally, B. Sucharitha et al [3] investigated the application of machine
they struggle to catch new, zero-day attacks.[3] learning algorithms to classify phishing websites. The dataset for
this research comprises of 32 features including IP address, URL
The Rule-based Phishing Detection Systems length, URL shortening service employed, and state of SSL,
The feature sets for rule-based systems stem from relational among others. The study gives these salient features of
rule mining. The rules provide a weighting of characteristics malicious URLs, and these features identify phishing websites.
most prevalent in phishing URLs. These rules, when used with The authors considered different machine learning models,
the system, provide better accuracy than what can be achieved namely, Decision Trees, Random Forest, and Gradient Boosting.
with just features working alone in classification. For These models were evaluated using metrics such as accuracy,
example, researchers in the CANTINA study resorted to TF- precision, recall, and F1-score. Among all other models,
IDF and some specific rules to identify phishing attacks. Gradient Boosting achieved the highest score with accuracy
Researchers have implemented a combination of features and 98.9%, precision 99.0%, recall 99.4%, and F-value just slightly
rules to uncover higher detection accuracy in similar works.[3] lower at 98.6%. Thus, the authors concluded that ensemble
methods such as Gradient Boosting and Random Forest can
Visual Similarity-Based Phishing Detection Systems provide accurate and strong generalization capabilities when
detecting phishing websites. The authors stress the importance
The systems compare web pages with phishing sites visually
of using features from the varied sources and suggest that
to detect attempts of phishing. They take a server-perspective combining machine learning models and other phishing
comparison of both phishing and non-phishing sites and use detection techniques can enhance the detection capabilities
image processing techniques to identify minor visual further. This research clearly epitomizes machine learning in the
differences which users would not notice. Fake sites are detection of phishing websites, being a step further to its
designed with the intention of making them similar to the improvement by hybrid models and other features.
original ones; however, slight differences are visible due to
these techniques. Studies have shown that visual similarity-
based systems can prove to be effective detection models Machikuri Santoshi Kumari et al. in [4] detects phishing based
against phishing attacks upon comparing generic visual on models enhanced by blacklisting and machine-learning
elements.[3] methods. Several machine-learning algorithms, such as XGBoost,
Random Forest, Decision Tree, and Multilayer Perceptrons, were
Machine Learning-Based Phishing Detection Systems used for the detection. Other datasets were used in addition to the
Machine learning-based systems detect phishing websites by Phishtank dataset, namely: one containing phishing websites and
classifying specified features using artificial intelligence the second containing phonemy features. A total of 30 features
techniques. These features can include URL structure, domain were used out of 30 most important features were HTTPS,
name, website content, and more. Due to their dynamic nature, followed by Anchor URL, Website Traffic, etc. XGBoost gave the
these systems are particularly popular for detecting anomalies maximum training accuracy of 100% and the best test accuracy at
on websites. Machine learning models can adapt to new 96.7% out of all other algorithms. They concluded that "using the
phishing tactics, making them highly effective in protecting XGBoost algorithm to detect phishing improves prediction
accuracy."
users from evolving threats.[3]
A. Orunsolu et al.[5] Proposed an scalable architecture
combined with incremental learning in a modular approach was
effective. Utilizing an extensive dataset from
Phishtank(comprising 2,541 phishing URLs) and Alexa large data.
(containing 2,500 legitimate URLs), the model attained
Rashid et al. (2020) in [8] have presented a machine learning
99.96% accuracy with a low false positive rate of 0.04%. In
approach for phishing detection that harnesses Support
conducting comparative performance studies, use was made
Vector Machines (SVM) for classification. This dataset
of Support Vector Machine (SVM) and Naïve Bayes (NB)
obtained from repositories such as Phish Tank and Alexa
algorithms. The study provides a criterion for assessing
consists of valid and phishing URLs, together with internal
feature importance based on how often phishing and
features, such as the length of the URL and external features
legitimate datasets favor certain features. The selection
that are derived from third-party services. Principal
therefore introduces features as per maximum relevance with
Component Analysis (PCA) was performed for dimension
minimum redundancy. The URL features consist of, but are
reduction to facilitate more efficient processing. The model
not limited to, length, presence of '@', and hexadecimal
achieved 95.66% accuracy using SVM with only five
codes. The webpage features investigated include validity of
features, much higher than that achieved using any other
SSL certificates and congruency with domain names; while
techniques, for example, Random Forest, which showed an
patterns of behavior, like cookie handling, and the age of the
accuracy of 94.27% with 30 features. This reduction in
domain, also qualify to be important features. The
feature set improved computational efficiency while
incremental methodology processes these features in stages,
maintaining good detection rates. The authors indicated how
starting with URL analysis, followed by webpage properties,
robust their solution is at identifying new and transient
and finally webpage behaviors if needed. This modular
phishing sites that constitute a practical attack against cyber
approach ensures scalability and adaptability to new phishing
threats.
tactics. The study’s results demonstrate the effectiveness of
the proposed system, though limitations such as dataset The proposed research by Vahid Shahrivari and
diversity, lack of real-time testing, and absence of Mohammad Mahdi Darabi [9] deals with the application of
benchmarking. various machine-learning algorithms for the detection of
Korkmaz et al [6] This research work addressed a persistent phishing websites. This research uses a dataset constituted of
concern regarding phishing through URL analysis, which 30 features, such as IP address presence, URL length,
employs machine-learning techniques to track these attacks whether shortening services are used, and SSL state among
proliferated by exploiting vulnerabilities inherent within others. Characteristics common to such URL layouts are
human nature by imitating legitimate sites in a bid to obtain employed to distinguish phishing websites from those which
sensitive data. Also, such an attempt to assess performance do not engage in this practice. Logistic regression, decision
can improve by addressing primarily the attributes of URLs tree, random forest, AdaBoost, KNN, SVM, gradient
for further improvement in efficiency. The authors employed boosting, XGBoost, and neural networks constituted the
eight machine learning algorithms via Random Forest (RF), machine learning algorithms that tried out. Besides the
Artificial Neural Networks (ANN), and Support Vector accuracy, precision, recall, and F1 score are also used to
Machines (SVM), which were tested on three datasets with assess the performance of different models. While XGBoost
over 126,000 URLs. The datasets combined the phishing proved most accurate at 98.32%, Random Forest came
URLs from PhishTank and the legitimate URLs from Alexa second best at an accuracy close to 97.27%; moreover, Neural
and Common Crawl databases. The system extracted and Network exhibited good performance, achieving 96.98%
used 48 key features from the URLs that include domain accuracy. The authors concluded that the ensemble methods
structure, special character presence, and length metrics, such as Random Forest and XGBoost are good at detecting
without recourse to third-party services for efficiency phishing websites due to their high accuracy and robustness.
concerns. The experimental results indicate that the Random They stressed the usefulness of employing multiple features
Forest algorithm had the highest accuracy across the dataset and suggested that one method for enhancing detection
(up to 94.59%) and had better accuracy than previous studies. performance might be coupling machine learning models
Such an experiment proves to be running with a high degree with other phishing detection methods. This work exemplifies
of efficiency in that it can be effectively used for real-time the potential for machine learning to help discern phishing
detection and speed. However, limited area coverages websites, and its further promise of improvement with hybrid
mentioned in the paper provided directions for further work. models and novel features.
Expanding upon the initial dataset. Jitendra Kumar et al. described in their research [10] the
Phishing attack detection was investigated in Alam et al. training of Logistic Regression, Naive Bayes, Random
(2020) [7], which used decision tree and random forest Forest, Decision Tree and K-Nearest Neighbor classifiers
algorithms for the classification of attacks. The dataset, which using features derived from the lexical structure of URLs.
came from Kaggle, had 30 very significant features for They had carefully created a dataset to solve common
identifying phishing URLs. The detailed preprocessing step problems like data imbalance, biased training, variance and
was reasonably done to render clean and noise-free data, overfitting. The preprocessed dataset was evenly split into
followed by feature selection using algorithms like PCA. The phishing and trusted URLs and was further divided into a
performance of each algorithm was analyzed in terms of 70:30 ratio for training and testing. Interestingly, all
confusion matrices and the following performance measures: classifiers had similar AUC (Area Under Curve) values, but
accuracy, precision, recall, and F1-score. The performance of the Naive Bayes Classifier claimed to be the best performer
random forests was superior to DTs, offering a 97% accuracy with the highest AUC value. It achieved an accuracy of 98%
compared to 91.94% accuracy for DTs, with random forests with precision of 1, recall of 0.95 and F1-score of 0.97, thus
dealing with overfitting and variability issues effectively. The the study makes a point regarding the importance of a
study asserted that random forests, ensemble approaches, balanced dataset and further emphasizes Naive Bayes being a
techniques for web-based search filter out the spam and help strong candidate choice in the detection of phishing.
assistant for phishing detection substantially in view of the
The detection of phishing websites with machine 1) Having IP Address: If an IP address is used instead of the
learning techniques by Kulkarni and Brown (2019) domain name in the URL, such as
[11]. A dataset http://217.102.24.235/sample.html
was reported as obtained from the University of California, 2) URL Length: Phishers can use a long URL to hide the
Irvine Machine Learning Repository containing 1353 URLs doubtful part in the address bar.
labeled as phishing, suspicious, and legitimate. Nine
features were extracted from URLs, including URL length, 3) Shortening Service: Links to the webpage that has a long
age of domain, presence of an IP address, and others. Four URL. For example, the URL http://sharif.hud.ac.uk/ can be
classifiers were set to run: Decision Tree, Support Vector shortened to bit.ly/1sSEGTB.
Machine (SVM), Naïve Bayes, and Neural Network. The
4) Having @ Symbol: Using the @ symbol in the URL leads
accuracy achieved by the Decision Tree classifier was
91.5%, with a True Positive Rate (TPR) of 90.97% and a the browser to ignore everything preceding the @ symbol
False Positive Rate (FPR) of 7.81%. The SVM was slightly and the real address often follows the @ symbol
behind, achieving an accuracy of 86.69%, and both Naïve 5) Double Slash Redirection: The existence of // within the
Bayes and Neural Network slightly trailed at rates of URL which means that the user will be redirected to another
86.14% and 84.87%, respectively. The study stated that website
Decision Trees are quite good with discrete feature values,
but they need pruning to deal with problems of overfitting. 6) Prefix Suffix: Phishers tend to add prefixes or suffixes
The authors concluded more features with larger datasets separated by (-) to the domain name so that users feel that
would help the performance of the classifier and they are dealing with a legitimate webpage. For example
recommended going for ensemble methods and rule-based http://www.Confirme-paypal.com.
approaches for future work.
7) Having Sub Domain: Having subdomain in URL.
Rishikesh Mahajan and Irfan Siddavatam[12] emphasized
three class orientation algorithms: Decision Tree, Random 8) SSL State: Shows that website use SSL
Forest, and Support Vector Machine. The dataset of benign 9) Domain Registration Length: Based on the fact that
URLs was constructed by taking 17,058 from Alexa and phishing website lives for a short period
19,653 from PhishTank, all with16 features. The data were
respectively partitioned into training and testing sets with 10) Favicon: A favicon is a graphic image (icon) associated with
proportions of 50:50, 70:30, and 90:10. The performance was a specific webpage. If the favicon is loaded from a other
judged according to accuracy, false negative rate, and false domain then the webpage is likely to be considered Phishing
positive rate. Random Forest stood out as the algorithm attempt.
where 97.14% accuracy was achieved with the least false
negative rate. Their conclusion was that the more data used 11) Using Non-Standard Port: To control intrusions, it is much
for training, the better the accuracy. better to merely open ports that you need. Several firewalls,
Proxy and Network Address Translation (NAT) servers will,
by default, block all or most of the ports.
4. DATASETS
The datasets have been collected from various sites such 12) HTTPS token: Having deceiving https token in URL. For
as PhishTank[13] , Alexa, etc. Which has the data about the example, http://https-www-mellat-phish.ir
phishing websites and keeps updating them .The datasets
contains all features and their respective values.
Abnormal Based Features

5. FEATURE EXTRACTION 13) Request URL: Request URL examines whether the external
objects contained within a webpage such as images, videos,
URLs have certain characteristics and patterns that can be and sounds are loaded from another domain.
considered as its features.
14) URL of Anchor: An anchor is an element defined by the < a
In case of URL based analysis for designing machine > tag. This feature is treated exactly as Request URL.
learning models, we need to extract these features in order to
form a dataset that can be used for training and testing. There 15) Links In Tags: It is common for legitimate websites to use
are four categories of features that are most commonly ¡Meta¿ tags to offer metadata about the HTML document;
considered for feature extraction as in [9]. They are as ¡Script¿ tags to create a client side script; and ¡Link¿ tags to
follows: retrieve other web resources.
1) Address Bar based features 16) Server Form Handler: If the domain name in SFHs is
different from the domain name of the webpage.
2) Abnormal based features
3) HTML and JavaScript based features 17) Submitting Information To E-mail: A phisher might
4) Domain based features redirect the users information to his email.
18) Abnormal URL: It is extracted from the WHOIS database.
For a legitimate website, identity is typically part of its URL.

Address Bar Based Features

HTML & JavaScript Based Features Let us consider this URL:
19) Website Redirect Count: If the redirection is more http://amazon.com-verification accounts.darotob.com/Sign-
than four-time in/5b60fcc60b36d1c3d
20) Status Bar Customization: Use JavaScript to show a The lexical analysis of the above URL reveals parts as
fake URL in the status bar to users shown in above Fig. The attackers obfuscate the URL in
such a way that the actual domain name might not be easily
21) Disabling Right Click: It is treated exactly as Using
revealed to the normal user and it will be nested deep inside
onMouseOver to hide the Link
the URL.
22) Using Pop-up Window: Showing having popo-up
windows on the webpage.
23) IFrame: IFrame is an HTML tag used to display an
additional webpage into one that is currently shown.

Domain Based Features

24) Age of Domain: If the age of the domain is less than a Fig. 2. Different parts of the URL
month.
25) DNS Record: Having the DNS record
26) Web Traffic: This feature measures the popularity of
the website by determining the number of visitors.
27) Page Rank: Page rank is a value ranging from 0 to 1.
PageRank aims to measure how important a webpage is
on the Internet.
28) Google Index: This feature examines whether a website
is in Googles index or not.
29) Links Pointing To Page: The number of links pointing
to the web page.
30) Statistical Report: If the IP belongs to top phishing
IP’s or not.

5.1) LEXICAL STRUCTURE OF A URL[10]

The structure of a URL can reveal a lot of hidden
information. A URL starts with a protocol name like
HTTP or HTTPS. The fully qualified domain name
(FQDN) is the complete domain name of the server
hosting the website, which is then translated into an IP
address using DNS servers. The domain name consists
of a second-level domain (SLD) and a top-level domain
(TLD). This domain name is unique and registered with
a domain registrar.

Fig 2: Lexical structure of a URL [10]

TABLE I. R E S U L T A N A L Y S I S
Paper Model Used Suitable Models Accuracy score

[2] XGBoost, LightGBM, LightGBM gave XGBoost:

Graph Neural Network highest accuracy 92.09%,
6) PERFORMANCE EVALUATION METRICS
(GNN) and CatBoost with precision LightGBM: A selected parameter will be used to evaluate the
applied. Performance 0.93 and recall 93.29%, GNN:
evaluated using accuracy, score 0.93 70%, CatBoost:
measure of performance for the system. The associated
precision, recall and F1- 92.98% models are Accuracy, Precision, Recall, F1 Score, and
score. ROC curve, all derived from the values of True Positive
(TP), True Negative (TN), False Positive (FP), and False
Negative (FN).
[3] Decesion GB produces GB: 98.9%, RF:
Tree(DT),Random reliable results in 96.9%, DT: In the context of URL classification.
Forest(RF) and Gradient terms of 96.0%
Boost(GB) implemented. accuracy 98.9%, True Positive (TP): The number of phishing URLs
Performance evaluated precision 99.0%, correctly detected as phishing.
using recall 99.4%,
accuracy,precesion ,recall and F1 score True Negative (TN): The number of legitimate URLs
and F1 score 98.6%. correctly detected as legitimate.
False Positive (FP): The number of legitimate URLs
incorrectly classified as phishing.
[4] Combined blacklisting Among these, XGBoost:
applied ML Algorithms: XGBoost was 96.7%, RF: False Negative (FN): The number of phishing URLs
XGBoost, RF, DT, and found to be the 92.5%, DT: incorrectly classified as legitimate.
Multilayer Perceptrons to most accurate 90.5%,
dataset with features, model. Multilayer Paper Model Used Suitable Models Accuracy score
Phishing URLs collected Perceptrons:
from Phishtank and 88% [8] Applied SVM on data RF and NB SVM: 95.66%,
OpenPhish. from PhishTank and classifiers had RF: 94.27%
[5] Support Vector Machine Both Support SVM: 99.96%, Alexa, with internal better
(SVM) and Naïve Bayes Vector Machine and external features, accuracies. In
NB: 99.96% and PCA for terms of AUC,
(NB) with features based (SVM) and
dimensionality Gaussian Naive
on maximum relevance Naïve Bayes
reduction Bayes had a
with minimum (NB) classifiers slightly higher
redundancy. Phishtank have TPR of value of 0.991.
(2,541 phishing URLs) 99.96, FNR of [9] The examined Very good Logistic
and Alexa (2,500 0.04, TNR of classifiers are Logistic performance in regression:
legitimate URLs) datasets. 99.96, and FPR Regression, Decision ensembling 92.6%, Decision
of 0.04 Tree, Support Vector classifiers tree: 96.5%,
[6] Random Forest (RF), Random Forest RF: 94.59%, Machine, Ada Boost, namely, Random Random forest:
Artificial Neural Networks (RF) was the ANN: 94.35%, Random Forest, Neural Forest, XGBoost 97.2%,
(ANN), Support Vector best-suited XGBoost: Networks, KNN, both on Adabooster:
Machines (SVM), Logistic model based on 92.95%, DT: Gradient Boosting, and computation 93.6%, KNN:
Regression (LR), K- its highest 92.59%, KNN: XGBoost. duration and 95%, SVM:
Nearest Neighbor (KNN), accuracy and 91.49%, LR: accuracy 94.9%, Gradient
Decision Tree (DT), Naive overall 91.31%, NB: boosting: 94.8%,
XGBoost: 98.3%
Bayes (NB), XGBoost performance in 88.35%, SVM:
[10] A balanced dataset was Random Forest Random Forest:
detecting 87.03%
utilized to train and Naive Bayes 98.03%,
phishing URLs.
classifiers such as demonstrated Gaussian Naive
Logistic Regression superior Bayes: 97.18%
[7] DT and RF applied to a RF outperformed RF: 97%, DT:
(LR), Naive Bayes accuracy
Kaggle dataset with 30 DT, addressing 91.94% (NB), Random Forest
features. PCA used for overfitting and (RF), Decision Tree
feature selection. variability (DT), and k-Nearest
Performance evaluated effectively Neighbors (k-NN),
using accuracy, precision, using features derived
recall, and F1-score. from the lexical
structure of URLs.

[11] Four classifiers (DT, DT performed DT: 91.5%,

SVM, Naïve Bayes, best with 91.5% SVM: 86.69%,
Neural Network) accuracy but Naïve Bayes:
applied to a UCI required pruning 86.14%, Neural
dataset with 1,353 to address Network:
labeled URLs and 9 overfitting. 84.87%
extracted features Ensemble
methods were
recommended
[12] The dataset was The Random 50:50 split ratio:
divided into split ratios Forest classifier 96.72%, 70:30
of 50:50, 70:30, and demonstrated split ratio:
90:10. Decision Tree superior 96.84%, 90:10
(DT), Random Forest accuracy and the split ratio:
(RF), and (SVM) lowest false 97.14%
classifiers were negative rate.
applied.
A Confusion Matrix represents these values in terms REFERENCES
of how it indicates the performance of the classification [1] 2023 Internet Crime Report FBI. Retrieved from:
model. https://www.ic3.gov/Media/PDF/AnnualReport/2023_IC3Repo
rt.pdf
[2] Dr. Nitin N. Sakhare, Jyoti L. Bangare, Dr. Radhika G.
Purandare, Disha S. Wankhede, Pooja Dehankar, “Phishing
Website Detection Using Advanced Machine Learning
[10] Techniques”, International Journal of Intelligent Systems and
Applications in Engineering 2024.
[3] Sucharitha, B., Chandini, B., Kumar, D. S., Surendra, M., &
Kumar, G. K. (2024). Detecting phishing websites using
[10] machine learning. IJARCCE, 13(4).
https://doi.org/10.17148/ijarcce.2024.134145
[4] Machikuri Santoshi Kumari, Chiguru Keerthi Priya, Gondhi
Bhavya Haridas Neha, Monisha Awasthi, Surendra Tripathi, ”
Viable Detection of URL Phishing using Machine Learning
Approach”, 15th International Conference on Materials
[10] Processing and Characterization (ICMPC 2023).
[5].A.A. Orunsolu, A. S. Sodiya, and A. T. Akinwale, “A
predictive model for phishing detection,” Journal of King Saud
University – Computer and Information Sciences, vol. 34, no.
2, pp. 232–247, 2022.
[6] Korkma, M., Sahingoz, O. K., & Diri, B. (2020). Detection
[10] of Phishing Websites by Using Machine Learning-Based URL
Analysis. Presented at the 11th International Conference on
Computing, Communication and Networking Technologies
OBSERVATIONS (ICCCNT), July 1-3, 2020, IIT Kharagpur, India. IEEE.
Phishing attacks are constantly evolving and the cyber world [7] Mohammad Nazmul Alam, Dhiman Sarma et al., “Phishing
is hit by new types of attacks often. Hence a particular detection attacks detection using machine learning approach,” 3rd
approach or algorithm cannot be tagged as the best one giving International Conference on Smart Systems and Inventive
exact results. Through the literature survey, it is evidently Technology (ICSSIT), 2020.
visible that Random Forest gives better results in most
scenarios. But then the performance of each algorithm varies [8] Junaid Rashid, “Phishing Detection Using Machine
depending on the dataset used, train-test split ratio, feature Learning Technique”, First International Conference of Smart
selection techniques applied etc. Researchers prefer to create Systems and Emerging Technologies (SMARTTECH), 2020.
machine learning models that perform phishing detection with
best value for evaluation parameters and least training time. [9] Vahid Shahrivari, Mohammad Mahdi Darabi, Mohammad
Therefore, the future works should focus on these aspects of Izadi “Phishing Detection Using Machine Learning
phishing detection. Techniques” arXiv preprint arXiv:2009.11116, 2020. Retrieved
from arXiv.
6. CONCLUSION [10] Jitendra Kumar, A. Santhanavijayan, B. Janet, Balaji
Due to the greater demand for the security of personal, Rajendran, and Bindhumadhava BS, “Phishing website
financial, and professional data in this digital era, phishing classification and detection using machine learning,”
detection has risen to be a highly critical area of research. International Conference on Computer Communication and
URL-based analysis is one of the ways that enhance both Informatics (ICCCI), 2020.
detection speed and detection accuracy. By extracting
those features from the given URL and applying feature [11] Arun Kulkarni, Leonard L. Brown, “Phishing Websites
selection and dimensionality reduction techniques, models Detection using Machine Learning”, IJACSA International
are refined by eliminating unnecessary data and focusing Journal of Advanced Computer Science and Applications, Vol.
on the most informative features. Numerous machine 10, No. 7, 2019.
learning algorithms have shown strong performance on [12] Rishikesh Mahajan, and Irfan Siddavatam, “Phishing
phishing URL classification including Random Forest, website detection using machine learning algorithms,”
XGBoost, and Support Vector Machines. In this paper, we International Journal of Computer Applications (0975-8887),
retrospectively examined phishing detection, focusing on vol. 181, no. 23, 2018.
different methodologies and their performance. The
review builds a good basis for future researchers taking [13] PhishTank : https://phishtank.org/
their next step at improving phishing detection systems.

MVC Complete Notes (1) (Repaired)
No ratings yet
MVC Complete Notes (1) (Repaired)
342 pages
Cyber Ethics Notes
No ratings yet
Cyber Ethics Notes
11 pages
Research_paper_ Group-B5
No ratings yet
Research_paper_ Group-B5
4 pages
Domain name disputes in india under IPR written by Hitesh Chaudhary, DSMNRU
No ratings yet
Domain name disputes in india under IPR written by Hitesh Chaudhary, DSMNRU
14 pages
Phishing Detection in Email Using Deep Learning
No ratings yet
Phishing Detection in Email Using Deep Learning
8 pages
CH 2. Literature Survey
No ratings yet
CH 2. Literature Survey
5 pages
BIT2102 Fundamentals of Internet MODULE
No ratings yet
BIT2102 Fundamentals of Internet MODULE
66 pages
Allchapters_250403_034719
No ratings yet
Allchapters_250403_034719
193 pages
Bus 111_Assignment 2_f24
No ratings yet
Bus 111_Assignment 2_f24
6 pages
DCNG Chapter 3 - Data and Transmission Standards
No ratings yet
DCNG Chapter 3 - Data and Transmission Standards
54 pages
Openlab Cds Ezchrom Edition A 04 05 Fds 11752
No ratings yet
Openlab Cds Ezchrom Edition A 04 05 Fds 11752
38 pages
Leslie Turner CH 14
No ratings yet
Leslie Turner CH 14
34 pages
SANGFOR - IAM - v12.0.42 - Best Practice - Activity Domain Script SSO
No ratings yet
SANGFOR - IAM - v12.0.42 - Best Practice - Activity Domain Script SSO
44 pages
2023.I4.001
No ratings yet
2023.I4.001
11 pages
WEB-assignment by NK
No ratings yet
WEB-assignment by NK
34 pages
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
No ratings yet
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
6 pages
SMB Scanning Setting Up SMB Scan Folder
No ratings yet
SMB Scanning Setting Up SMB Scan Folder
16 pages
CNSL Mock Questions With Sol
No ratings yet
CNSL Mock Questions With Sol
24 pages
IEEE_Format_Paper
No ratings yet
IEEE_Format_Paper
20 pages
Chapter 7 Networks and The Internet
No ratings yet
Chapter 7 Networks and The Internet
30 pages
1.1 History of Internet
No ratings yet
1.1 History of Internet
25 pages
Leveraging Advanced Machine Learning Techniques For Phishing Website Detection
No ratings yet
Leveraging Advanced Machine Learning Techniques For Phishing Website Detection
6 pages
Review 0 - Phishing Website in SEO (1)
No ratings yet
Review 0 - Phishing Website in SEO (1)
6 pages
Ieee Argencon 2016 Paper 14
No ratings yet
Ieee Argencon 2016 Paper 14
6 pages
Random Forest
No ratings yet
Random Forest
10 pages
An investigation into the performances of the Current state-of-the-art Naive Bayes, Non-Bayesian and Deep Learning Based Classifier for Phishing Detection A Survey
No ratings yet
An investigation into the performances of the Current state-of-the-art Naive Bayes, Non-Bayesian and Deep Learning Based Classifier for Phishing Detection A Survey
12 pages
Phishing Detection Using Machine Learnin
No ratings yet
Phishing Detection Using Machine Learnin
5 pages
Windows 2003 Server Installation Guide: Revision 2.0 April 14, 2011
No ratings yet
Windows 2003 Server Installation Guide: Revision 2.0 April 14, 2011
13 pages
(IJCST-V9I3P26) :P.Hema Sujatha, S.Sushma Sree, N. Vinay Sreenath, S. Suresh, DR - Bala Brahmeswara Kadaru
No ratings yet
(IJCST-V9I3P26) :P.Hema Sujatha, S.Sushma Sree, N. Vinay Sreenath, S. Suresh, DR - Bala Brahmeswara Kadaru
6 pages
Major Project Final Report
No ratings yet
Major Project Final Report
53 pages
Mahajan 2018 Ijca 918026
No ratings yet
Mahajan 2018 Ijca 918026
3 pages
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
No ratings yet
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
16 pages
Change Contact Details Form
No ratings yet
Change Contact Details Form
1 page
Phishing Detection Using Machine Learning
No ratings yet
Phishing Detection Using Machine Learning
9 pages
Fake Url
No ratings yet
Fake Url
64 pages
Detection of Phising Websites Using Machine Learning Approaches
No ratings yet
Detection of Phising Websites Using Machine Learning Approaches
9 pages
Automated Phishing Detection Through URL Analysis and Machine Learning
No ratings yet
Automated Phishing Detection Through URL Analysis and Machine Learning
9 pages
Domain Name Registration Form
No ratings yet
Domain Name Registration Form
2 pages
updated_phishing_url_detection
No ratings yet
updated_phishing_url_detection
13 pages
DNS Conditional Forwarders With Mikrotik RouterOS
No ratings yet
DNS Conditional Forwarders With Mikrotik RouterOS
2 pages
Phishing Website Detection Using ML IJERTCONV9IS13006
No ratings yet
Phishing Website Detection Using ML IJERTCONV9IS13006
4 pages
Social Engineering Detection: Phishing URLs
No ratings yet
Social Engineering Detection: Phishing URLs
7 pages
Phishing Website Detection Using Machine Learning Algorithms
No ratings yet
Phishing Website Detection Using Machine Learning Algorithms
4 pages
A multi-algorithm approach for phishing uniform resource locator’s detection
No ratings yet
A multi-algorithm approach for phishing uniform resource locator’s detection
10 pages
Machine_Learning_for_Detecting_the_Phishing_Threats
No ratings yet
Machine_Learning_for_Detecting_the_Phishing_Threats
6 pages
A Beginners Guide To Become Pro Webmaster
No ratings yet
A Beginners Guide To Become Pro Webmaster
64 pages
Final Paper on Phishing Domains Detection Using Deep Learning
No ratings yet
Final Paper on Phishing Domains Detection Using Deep Learning
11 pages
Edited Phishing Domains Detection Using Deep Learning
No ratings yet
Edited Phishing Domains Detection Using Deep Learning
11 pages
Comparative Analysis of Features Based Machine Learning Approaches For Phishing Detection
No ratings yet
Comparative Analysis of Features Based Machine Learning Approaches For Phishing Detection
6 pages
IJCRTI020051
No ratings yet
IJCRTI020051
4 pages
Fin Irjmets1682919970
No ratings yet
Fin Irjmets1682919970
5 pages
Detection_of_Phishing_Websites_using_mac
No ratings yet
Detection_of_Phishing_Websites_using_mac
3 pages
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
No ratings yet
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
11 pages
Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK
No ratings yet
Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK
7 pages
phishing4
No ratings yet
phishing4
6 pages
V6I602
No ratings yet
V6I602
8 pages
Phishing Detection (Yamu Research Project)
No ratings yet
Phishing Detection (Yamu Research Project)
19 pages
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
No ratings yet
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
15 pages
IJRTI2207237
No ratings yet
IJRTI2207237
19 pages
Web Hosting Knowledge
0% (1)
Web Hosting Knowledge
121 pages
A Structured Synopsis For Phishing Website Identification
No ratings yet
A Structured Synopsis For Phishing Website Identification
5 pages
20mis0106 VL2023240102875 Pe003
No ratings yet
20mis0106 VL2023240102875 Pe003
42 pages
Review Paper
No ratings yet
Review Paper
9 pages
Review Paper
No ratings yet
Review Paper
8 pages
New Dedicated Server Information: 1 Mensaje
No ratings yet
New Dedicated Server Information: 1 Mensaje
2 pages
A Comparative Analysis of Different Feature Set On The Performance of Different Algorithms in Phishing Website Detection
No ratings yet
A Comparative Analysis of Different Feature Set On The Performance of Different Algorithms in Phishing Website Detection
7 pages
Batch-5 Journal-6 ECE-D new (1)
No ratings yet
Batch-5 Journal-6 ECE-D new (1)
6 pages
Phishing Seminar
No ratings yet
Phishing Seminar
19 pages
Phish Guard Phishing Website using Machine Learning Algorithms
No ratings yet
Phish Guard Phishing Website using Machine Learning Algorithms
10 pages
paper-major1
No ratings yet
paper-major1
6 pages
Batch-5 ECE-D
No ratings yet
Batch-5 ECE-D
4 pages
Detection of URL Based Phishing Websites Using Machine Learning
No ratings yet
Detection of URL Based Phishing Websites Using Machine Learning
6 pages
Phishing Detection Based On Machine Learning and Feature Selection Methods
No ratings yet
Phishing Detection Based On Machine Learning and Feature Selection Methods
13 pages
Final Research Paper
No ratings yet
Final Research Paper
6 pages
Based On URL Feature Extraction
No ratings yet
Based On URL Feature Extraction
6 pages
Phishing Seminar
No ratings yet
Phishing Seminar
19 pages
Paper 1
No ratings yet
Paper 1
5 pages
CyberSec Review3 Team10
No ratings yet
CyberSec Review3 Team10
28 pages
Avaya Communication Manager: System Handover Document
100% (1)
Avaya Communication Manager: System Handover Document
26 pages
CSE3502-Final J Comp Report
No ratings yet
CSE3502-Final J Comp Report
20 pages
Active Directory Installation & Configuration in Windows Server 20
No ratings yet
Active Directory Installation & Configuration in Windows Server 20
29 pages
Technology Infrastructure: The Internet and The World Wide Web
No ratings yet
Technology Infrastructure: The Internet and The World Wide Web
47 pages
Siem Use Cases
No ratings yet
Siem Use Cases
28 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
6 pages
Complaint Burberry V John Does
No ratings yet
Complaint Burberry V John Does
118 pages
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
No ratings yet
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
14 pages
Nixintel's OSINT Resource List - Start - Me
No ratings yet
Nixintel's OSINT Resource List - Start - Me
60 pages
Detection of Url Based Phishing Attacks Using Machine Learning IJERTV8IS110269
No ratings yet
Detection of Url Based Phishing Attacks Using Machine Learning IJERTV8IS110269
8 pages
Install Active Directory Domain Services PDF
100% (1)
Install Active Directory Domain Services PDF
82 pages
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
From Everand
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Review Paper

Uploaded by

Review Paper

Uploaded by

Phishing Website Detection Machine Learning

The FBI's Internet Crime Complaint Center (IC3) 2024 report

Address Bar Based Features

Domain Based Features

5.1) LEXICAL STRUCTURE OF A URL[10]

Fig 2: Lexical structure of a URL [10]

[2] XGBoost, LightGBM, LightGBM gave XGBoost:

[11] Four classifiers (DT, DT performed DT: 91.5%,

You might also like