Review Paper
Review Paper
Review
Aditya Deshmukh1, Akash Yadav2, Pratham Maske3, Shreyash Kathane4, Dr. D.S Adane5
Student, Information & Technology, RCOEM, Nagpur, India1-4
Professor of Information & Technology, RCOEM, Nagpur, India5
over $2 billion in losses.
Abstract— As the technology develops this increases
the chance of cybercrimes happening. Phishing attacks
based on URLs are among the most common threats Investment scams were the most damaging-they alone robbed
toward Internet users. Such attacks are not built upon victims of $4.57 billion, which is an increase of 38% from the
technical vulnerabilities; instead, they exploit a weakness previous year. Crypto-investment fraud accounted for alone $3.94
in humans and are often launched against organizations billion-a whopping 53% rise. Phishing type schemes are among the
and individuals. Attackers deceive users by clicking on most reported crimes, with over 298,000 complaints with make
URLs that appear trustworthy, leading them to reveal about 34% of the complaints
sensitive information or install malware. Various
techniques of machine learning used for phishing URL
detection classify URLs into phishing and legitimate ones.
Models remain under development and refinement
because of researchers' determination to develop them as
accurate and efficient as possible. Different machine
learning techniques for detecting phishing URLs
accompanied by URL features and datasets that train the
models are reviewed. The paper further discusses the
many different methods put forth by the researchers to
enhance the detection accuracy of these models.
1. INTRODUCTION
Fig 1: Complaints and losses of last 5 years [1]
In the year 2024, we only deepened our reliance on
technology that further exposed us to more non-native cyber The report places importance on public reporting to IC3 so as to
threats. The ongoing digital transformation, with major assist the FBI in combating cyber threats. FBI encourages
impetus from the global pandemic, had created fertile fields consumers to look out for and read consumer and industry alerts
for the operation of cybercriminals. Recent analysis and about cybercrime, notify financial institutions if victimized, and
reports are pointing at the surge of security breaches, which file a report to IC3 or local law enforcement..
caused both financial losses and personal information
exposures of astronomical proportions. Phishing has been
continued to be prevalent among these instances of
cyberspace crime, using both social engineering and further
technical deception to steal an individual's personal identity
data and financial account credentials. Attackers build fake
versions of trusted websites with the aim of tricking people
into voluntarily divulging their usernames, passwords,
banking details, and other sensitive information. These
phishing URLs would typically be distributed through e-mail,
instant messages, or text messages, thus it is worthwhile that
users should remain awake to the matter and embrace solid
respect for cybersecurity practices.
A. Phishing Detection
3. LITERATURE REVIEW
A URL based phishing attack is carried out by sending
malicious links, that seems legitimate to the users, and In this section, few of the research works that deploy the
tricking them into clicking on it. In phishing detection, an above-mentioned algorithms are reviewed and their results are
incoming URL is identified as phishing or not by analysing summarized.
the different features of the URL and is classified
accordingly. Different machine learning algorithms are
trained on various datasets of URL features to classify a The study was conducted by Dr. Nitin N. Sakhare et al.
given URL as phishing or legitimate. [2] Integrated conventional machine learning models like
XGBoost, LightGBM, and a referenced but inactive
B. Phishing Detection Approaches Random Forest classifier alongside a Graph Nerual
List-Based Phishing Detection Systems Network (GNN). XGboost classifier gives accuracy of
These systems rely on two lists to classify website as 92.09%, LightGBM gives highest accuracy of 93.29%.
either phishing or non-phishing. The whitelist contains safe Apart from this, they implement another tree-based
and legitimate websites, while the blacklist includes those machine learning algorithm, CatBoost, which gives
identified as phishing. Researchers have used whitelists to accuracy of 92.98%. GNN's performance left a huge scope
ensure that only URLs on the list are accessible. Another for improvement. LightGBM emerged as a standout
approach is the blacklist method, where URLs are checked performer, giving a precision score of 0.93 alongside a
against a list of known phishing sites. However, these systems recall score of 0.93.
have a significant drawback: even a small change in the URL
can prevent it from being matched in the list. Additionally, B. Sucharitha et al [3] investigated the application of machine
they struggle to catch new, zero-day attacks.[3] learning algorithms to classify phishing websites. The dataset for
this research comprises of 32 features including IP address, URL
The Rule-based Phishing Detection Systems length, URL shortening service employed, and state of SSL,
The feature sets for rule-based systems stem from relational among others. The study gives these salient features of
rule mining. The rules provide a weighting of characteristics malicious URLs, and these features identify phishing websites.
most prevalent in phishing URLs. These rules, when used with The authors considered different machine learning models,
the system, provide better accuracy than what can be achieved namely, Decision Trees, Random Forest, and Gradient Boosting.
with just features working alone in classification. For These models were evaluated using metrics such as accuracy,
example, researchers in the CANTINA study resorted to TF- precision, recall, and F1-score. Among all other models,
IDF and some specific rules to identify phishing attacks. Gradient Boosting achieved the highest score with accuracy
Researchers have implemented a combination of features and 98.9%, precision 99.0%, recall 99.4%, and F-value just slightly
rules to uncover higher detection accuracy in similar works.[3] lower at 98.6%. Thus, the authors concluded that ensemble
methods such as Gradient Boosting and Random Forest can
Visual Similarity-Based Phishing Detection Systems provide accurate and strong generalization capabilities when
detecting phishing websites. The authors stress the importance
The systems compare web pages with phishing sites visually
of using features from the varied sources and suggest that
to detect attempts of phishing. They take a server-perspective combining machine learning models and other phishing
comparison of both phishing and non-phishing sites and use detection techniques can enhance the detection capabilities
image processing techniques to identify minor visual further. This research clearly epitomizes machine learning in the
differences which users would not notice. Fake sites are detection of phishing websites, being a step further to its
designed with the intention of making them similar to the improvement by hybrid models and other features.
original ones; however, slight differences are visible due to
these techniques. Studies have shown that visual similarity-
based systems can prove to be effective detection models Machikuri Santoshi Kumari et al. in [4] detects phishing based
against phishing attacks upon comparing generic visual on models enhanced by blacklisting and machine-learning
elements.[3] methods. Several machine-learning algorithms, such as XGBoost,
Random Forest, Decision Tree, and Multilayer Perceptrons, were
Machine Learning-Based Phishing Detection Systems used for the detection. Other datasets were used in addition to the
Machine learning-based systems detect phishing websites by Phishtank dataset, namely: one containing phishing websites and
classifying specified features using artificial intelligence the second containing phonemy features. A total of 30 features
techniques. These features can include URL structure, domain were used out of 30 most important features were HTTPS,
name, website content, and more. Due to their dynamic nature, followed by Anchor URL, Website Traffic, etc. XGBoost gave the
these systems are particularly popular for detecting anomalies maximum training accuracy of 100% and the best test accuracy at
on websites. Machine learning models can adapt to new 96.7% out of all other algorithms. They concluded that "using the
phishing tactics, making them highly effective in protecting XGBoost algorithm to detect phishing improves prediction
accuracy."
users from evolving threats.[3]
A. Orunsolu et al.[5] Proposed an scalable architecture
combined with incremental learning in a modular approach was
effective. Utilizing an extensive dataset from
Phishtank(comprising 2,541 phishing URLs) and Alexa large data.
(containing 2,500 legitimate URLs), the model attained
Rashid et al. (2020) in [8] have presented a machine learning
99.96% accuracy with a low false positive rate of 0.04%. In
approach for phishing detection that harnesses Support
conducting comparative performance studies, use was made
Vector Machines (SVM) for classification. This dataset
of Support Vector Machine (SVM) and Naïve Bayes (NB)
obtained from repositories such as Phish Tank and Alexa
algorithms. The study provides a criterion for assessing
consists of valid and phishing URLs, together with internal
feature importance based on how often phishing and
features, such as the length of the URL and external features
legitimate datasets favor certain features. The selection
that are derived from third-party services. Principal
therefore introduces features as per maximum relevance with
Component Analysis (PCA) was performed for dimension
minimum redundancy. The URL features consist of, but are
reduction to facilitate more efficient processing. The model
not limited to, length, presence of '@', and hexadecimal
achieved 95.66% accuracy using SVM with only five
codes. The webpage features investigated include validity of
features, much higher than that achieved using any other
SSL certificates and congruency with domain names; while
techniques, for example, Random Forest, which showed an
patterns of behavior, like cookie handling, and the age of the
accuracy of 94.27% with 30 features. This reduction in
domain, also qualify to be important features. The
feature set improved computational efficiency while
incremental methodology processes these features in stages,
maintaining good detection rates. The authors indicated how
starting with URL analysis, followed by webpage properties,
robust their solution is at identifying new and transient
and finally webpage behaviors if needed. This modular
phishing sites that constitute a practical attack against cyber
approach ensures scalability and adaptability to new phishing
threats.
tactics. The study’s results demonstrate the effectiveness of
the proposed system, though limitations such as dataset The proposed research by Vahid Shahrivari and
diversity, lack of real-time testing, and absence of Mohammad Mahdi Darabi [9] deals with the application of
benchmarking. various machine-learning algorithms for the detection of
Korkmaz et al [6] This research work addressed a persistent phishing websites. This research uses a dataset constituted of
concern regarding phishing through URL analysis, which 30 features, such as IP address presence, URL length,
employs machine-learning techniques to track these attacks whether shortening services are used, and SSL state among
proliferated by exploiting vulnerabilities inherent within others. Characteristics common to such URL layouts are
human nature by imitating legitimate sites in a bid to obtain employed to distinguish phishing websites from those which
sensitive data. Also, such an attempt to assess performance do not engage in this practice. Logistic regression, decision
can improve by addressing primarily the attributes of URLs tree, random forest, AdaBoost, KNN, SVM, gradient
for further improvement in efficiency. The authors employed boosting, XGBoost, and neural networks constituted the
eight machine learning algorithms via Random Forest (RF), machine learning algorithms that tried out. Besides the
Artificial Neural Networks (ANN), and Support Vector accuracy, precision, recall, and F1 score are also used to
Machines (SVM), which were tested on three datasets with assess the performance of different models. While XGBoost
over 126,000 URLs. The datasets combined the phishing proved most accurate at 98.32%, Random Forest came
URLs from PhishTank and the legitimate URLs from Alexa second best at an accuracy close to 97.27%; moreover, Neural
and Common Crawl databases. The system extracted and Network exhibited good performance, achieving 96.98%
used 48 key features from the URLs that include domain accuracy. The authors concluded that the ensemble methods
structure, special character presence, and length metrics, such as Random Forest and XGBoost are good at detecting
without recourse to third-party services for efficiency phishing websites due to their high accuracy and robustness.
concerns. The experimental results indicate that the Random They stressed the usefulness of employing multiple features
Forest algorithm had the highest accuracy across the dataset and suggested that one method for enhancing detection
(up to 94.59%) and had better accuracy than previous studies. performance might be coupling machine learning models
Such an experiment proves to be running with a high degree with other phishing detection methods. This work exemplifies
of efficiency in that it can be effectively used for real-time the potential for machine learning to help discern phishing
detection and speed. However, limited area coverages websites, and its further promise of improvement with hybrid
mentioned in the paper provided directions for further work. models and novel features.
Expanding upon the initial dataset. Jitendra Kumar et al. described in their research [10] the
Phishing attack detection was investigated in Alam et al. training of Logistic Regression, Naive Bayes, Random
(2020) [7], which used decision tree and random forest Forest, Decision Tree and K-Nearest Neighbor classifiers
algorithms for the classification of attacks. The dataset, which using features derived from the lexical structure of URLs.
came from Kaggle, had 30 very significant features for They had carefully created a dataset to solve common
identifying phishing URLs. The detailed preprocessing step problems like data imbalance, biased training, variance and
was reasonably done to render clean and noise-free data, overfitting. The preprocessed dataset was evenly split into
followed by feature selection using algorithms like PCA. The phishing and trusted URLs and was further divided into a
performance of each algorithm was analyzed in terms of 70:30 ratio for training and testing. Interestingly, all
confusion matrices and the following performance measures: classifiers had similar AUC (Area Under Curve) values, but
accuracy, precision, recall, and F1-score. The performance of the Naive Bayes Classifier claimed to be the best performer
random forests was superior to DTs, offering a 97% accuracy with the highest AUC value. It achieved an accuracy of 98%
compared to 91.94% accuracy for DTs, with random forests with precision of 1, recall of 0.95 and F1-score of 0.97, thus
dealing with overfitting and variability issues effectively. The the study makes a point regarding the importance of a
study asserted that random forests, ensemble approaches, balanced dataset and further emphasizes Naive Bayes being a
techniques for web-based search filter out the spam and help strong candidate choice in the detection of phishing.
assistant for phishing detection substantially in view of the
The detection of phishing websites with machine 1) Having IP Address: If an IP address is used instead of the
learning techniques by Kulkarni and Brown (2019) domain name in the URL, such as
[11]. A dataset http://217.102.24.235/sample.html
was reported as obtained from the University of California, 2) URL Length: Phishers can use a long URL to hide the
Irvine Machine Learning Repository containing 1353 URLs doubtful part in the address bar.
labeled as phishing, suspicious, and legitimate. Nine
features were extracted from URLs, including URL length, 3) Shortening Service: Links to the webpage that has a long
age of domain, presence of an IP address, and others. Four URL. For example, the URL http://sharif.hud.ac.uk/ can be
classifiers were set to run: Decision Tree, Support Vector shortened to bit.ly/1sSEGTB.
Machine (SVM), Naïve Bayes, and Neural Network. The
4) Having @ Symbol: Using the @ symbol in the URL leads
accuracy achieved by the Decision Tree classifier was
91.5%, with a True Positive Rate (TPR) of 90.97% and a the browser to ignore everything preceding the @ symbol
False Positive Rate (FPR) of 7.81%. The SVM was slightly and the real address often follows the @ symbol
behind, achieving an accuracy of 86.69%, and both Naïve 5) Double Slash Redirection: The existence of // within the
Bayes and Neural Network slightly trailed at rates of URL which means that the user will be redirected to another
86.14% and 84.87%, respectively. The study stated that website
Decision Trees are quite good with discrete feature values,
but they need pruning to deal with problems of overfitting. 6) Prefix Suffix: Phishers tend to add prefixes or suffixes
The authors concluded more features with larger datasets separated by (-) to the domain name so that users feel that
would help the performance of the classifier and they are dealing with a legitimate webpage. For example
recommended going for ensemble methods and rule-based http://www.Confirme-paypal.com.
approaches for future work.
7) Having Sub Domain: Having subdomain in URL.
Rishikesh Mahajan and Irfan Siddavatam[12] emphasized
three class orientation algorithms: Decision Tree, Random 8) SSL State: Shows that website use SSL
Forest, and Support Vector Machine. The dataset of benign 9) Domain Registration Length: Based on the fact that
URLs was constructed by taking 17,058 from Alexa and phishing website lives for a short period
19,653 from PhishTank, all with16 features. The data were
respectively partitioned into training and testing sets with 10) Favicon: A favicon is a graphic image (icon) associated with
proportions of 50:50, 70:30, and 90:10. The performance was a specific webpage. If the favicon is loaded from a other
judged according to accuracy, false negative rate, and false domain then the webpage is likely to be considered Phishing
positive rate. Random Forest stood out as the algorithm attempt.
where 97.14% accuracy was achieved with the least false
negative rate. Their conclusion was that the more data used 11) Using Non-Standard Port: To control intrusions, it is much
for training, the better the accuracy. better to merely open ports that you need. Several firewalls,
Proxy and Network Address Translation (NAT) servers will,
by default, block all or most of the ports.
4. DATASETS
The datasets have been collected from various sites such 12) HTTPS token: Having deceiving https token in URL. For
as PhishTank[13] , Alexa, etc. Which has the data about the example, http://https-www-mellat-phish.ir
phishing websites and keeps updating them .The datasets
contains all features and their respective values.
Abnormal Based Features
5. FEATURE EXTRACTION 13) Request URL: Request URL examines whether the external
objects contained within a webpage such as images, videos,
URLs have certain characteristics and patterns that can be and sounds are loaded from another domain.
considered as its features.
14) URL of Anchor: An anchor is an element defined by the < a
In case of URL based analysis for designing machine > tag. This feature is treated exactly as Request URL.
learning models, we need to extract these features in order to
form a dataset that can be used for training and testing. There 15) Links In Tags: It is common for legitimate websites to use
are four categories of features that are most commonly ¡Meta¿ tags to offer metadata about the HTML document;
considered for feature extraction as in [9]. They are as ¡Script¿ tags to create a client side script; and ¡Link¿ tags to
follows: retrieve other web resources.
1) Address Bar based features 16) Server Form Handler: If the domain name in SFHs is
different from the domain name of the webpage.
2) Abnormal based features
3) HTML and JavaScript based features 17) Submitting Information To E-mail: A phisher might
4) Domain based features redirect the users information to his email.
18) Abnormal URL: It is extracted from the WHOIS database.
For a legitimate website, identity is typically part of its URL.