0% found this document useful (0 votes)
16 views

Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK

Uploaded by

nitesh kumar35
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK

Uploaded by

nitesh kumar35
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Fake URL Detection Using Machine Learning

Algorithms
Ms. Neha Gupta Nitesh Kumar
Department of Information Technology Department of Information Technology
Greater Noida Institute of Technology Greater Noida Institute of Technology
(Engineering Institute) (Engineering Institute)
[email protected]

Abstract - Phishing is a common scam in which people are more and more successful While professionals can identify
misled into supplying personal information through fraudulent fraudulent websites, others are less fortunate and fall
websites. Phishing website URLs are used to get usernames, victim to phishing assaults [12]. The attacker's primary
passwords, and online banking credentials. Phishers use websites
purpose is to obtain passwords for bank accounts. Because
that appear and function similarly to legitimate websites.
Phishing strategies have gotten more complex as technology has
clients are becoming less aware, phishing attacks are
improved. To address this, phishing attempts must be detected by becoming increasingly effective. Phishing attacks are
anti-phishing software. Machine learning is an efficient way to getting more and more successful. Because phishing
avoid phishing attacks. This study examines the feature sets used assaults take use of human vulnerabilities, they are hard to
in machine learning-based detection techniques. Attackers thwart must be maintained. Phishing is a common kind of
regularly Phishing is the practice of deceiving an individual into cyberbullying in which a malicious website poses as a
clicking on a harmful link that appears to be authentic. reliable source in real life[11]. To avoid open
programming and frameworks, phishers consider inventive
Keywords - Keywords: website authenticity, phishing websites, and hybrid strategies. These include techniques for
fake websites, and website content analysis. identifying phishing content online and identifying
potential phishing attempts in interactions [7]. Phishing is
a fraudulent technique that obtains sensitive data, such as
I. INTRODUCTION
passwords and open-ended credit card numbers, by means
of social engineering.
Security experts are increasingly concerned about phishing
due of the simplicity with which a counterfeit website that
closely mimics the genuine one may be constructed. While
while posing as a trustworthy person or corporation via
professionals can identify fraudulent websites, others are
electronic contact[2]. Phishing is a tactic in which links
less fortunate and fall victim to phishing assaults. The
from phishing websites are used to trick consumers into
attacker's primary purpose is to obtain passwords for bank
visiting fraudulent websites. The fake mails are aimed
accounts [12]. Because clients are becoming less aware,
toward authentic sources, such online business objectives
phishing attacks are becoming increasingly effective
or financial organizations, and are made to appear
Phishing attacks are getting more and more successful.
authentic. Systems for detecting phishing still need to be
Because phishing assaults take use of human
updated [11]. Phishing is an effective tactic that a
vulnerabilities, they are hard to thwart and require
malicious actor might use to demand ransom from a large
updating. Phishing is a frequent method of blackmail when
number of people. The phony emails are made to seem like
a rogue website impersonates legitimate source in a real-
communications from respectable businesses, banks, or
world setting [1]. While professionals can detect
internet marketers. Systems for detecting phishing still
fraudulent websites, some are less fortunate and fall victim
need to be updated.Phishing is a tactic that works similarly
to Phishing assaults are Phishing assaults are difficult to
to mass extortion and involves a malicious website
stop because they play on human weaknesses, yet phishing
impersonating a reliable source in real life. Although
detection systems upgraded. Phishing is a kind of
experts are able to spot fraudulent websites, the general
cybercrime when a hostile website impersonates a
population is more susceptible and can be duped by
trustworthy source in an effort to defraud large amounts of
phishing techniques. The main objective of the attacker is
money. Because consumers are increasingly less informed,
to obtain bank account credentials. There are fewer
phishing attacks are becoming more effective. Phishing
consumers since phishing efforts are more successful
assaults are difficult to stop because they take use of user
systems need to be improved.
weaknesses, although phishing detection systems [4].
While professionals can identify fraudulent websites,
others are less fortunate and fall victim to phishing assaults
[12]. The attacker's primary purpose is to obtain passwords
for bank accounts. Because clients are becoming less
aware, phishing attacks are becoming increasingly
effective. Phishing attacks are getting.
:

II. LITERATURE REVIEW 1. Address Bar-based Functionalities.

In this episode, we went over earlier talks on machine 2. Atypical Based Elements.
learning techniques for spotting fraudulent websites.
Neural networks are used to classify URLs as phishing 3. JavaScript Base and HTML
or not. To improve the accuracy of the model, binary
visualization techniques are applied. This method may 1. Address Bar based Features
make the most of the phishing detection frameworks
that are now in place and be used to ascertain if a 1.1 When someone types an IP address (125.98.3.123,
website is phishing or not. The trial's little dataset for example) instead of a space title to access a
provided the researchers with understanding of an website, they are more likely to experience identity
impact on the model's prediction and efficacy. As a theft.
consequence, this technique might be enhanced by
employing other prediction models and additional 1.2. Extended URL to mask the questionable aspect
datasets for training and testing. The reference authors Phishers can utilize long URLs to hide malicious
[4]. This part. According to the findings in [5], material within the address bar.
reviewed used supervised machine learning .
techniques, with deep neural networks being the most
popular methodology. This information was acquired 1.3. Using URL shortening strategies. TinyURL A
through a systematic review of the literature. The URL can be greatly abbreviated while maintaining its
findings indicate that, Deep Learning (DL) approaches link to the intended location using a process known as
for phishing detection were the only ones examined in "URL shortening," which is done over the Internet.
the study; these algorithms have the potential to
improve online system security. All things considered, 1.4. URLs that start with the @ sign
this work [5] significantly advances the field of The browser will disregard everything that comes
phishing detection, especially with regard to the before the @ mark in a URL. The actual address.
effectiveness of deep learning algorithms. To address
the shortcomings of previous research, future research 2. Data Preprocessing
may look at alternative machine learning techniques.
In another study, convolutional neural networks Pre-processing is a vital first step in getting data ready
(CNN) and machine learning were used [6]. Although for machine learning algorithms. The first stage in
the results of this study were limited to Deep Learning cleaning up the original data is to remove any
(DL) techniques for phishing detection, these unnecessary information, missing digits, or duplicate
algorithms may enhance online system security. All data. It is the primary reason for our model's great
things considered, this study [5] makes a substantial accuracy. To clean up data, we employ a variety of
contribution to the field of phishing detection, techniques, such eliminating redundant URLs and
especially with regard to the missing information from a row. We also eliminate
studies may look into different machine learning irrelevant data, such as URLs that have no effect on
methodologies to address the shortcomings of the neurological system and malware, phishing,
previous research. A separate research [6] used manipulation, and innocuous content. details following
convolutional neural networks (CNN) and machine cleaning was organized such that machine learning
learning. algorithms could use it. In this case, feature
engineering is used, which is the process of
III. PROJECT DESCRIPTION discovering and choosing significant qualities from
datasets [11].
As part of the project, we created a website that acts as
a platform for every customer. This dynamic, flexible
website refers to discern between legitimate and 3. Word Cloud
counterfeit sites [12]. This website was built with a
number of web development languages, including Word clouds may be used to assess phrase distribution
HTML, CSS, Javascript, and Django. The website's in certain data categories [13]. Figure 1 depicts a word
core foundation is built using HTML. CSS may be cloud for each of the four classes examined in this
used to enhance the look and feel of a website. It's study. Benign URLs frequently include
important to remember that the website is intended to frequently .used tokens like HTML, com, and org.
be accessible to everyone, thus everyone should be
able to use it without any problems.
The dataset includes several elements that should be
taken into account when determining if a URL on the
internet is phishing or legitimate.

The following elements are used to identify .


IV. RESULT ANALYSIS AND DISCUSSION

Table I shows that the RF algorithm outperformed the


other two, with an accuracy of 97%. In addition,
various metrics such as F1 score, recall, and accuracy
are used to assess the algorithm's entire
implementation.

Table No. I

PERFORMANCE OF Our PROPOSED


MODEL

Figure 1. Methodology for Detecting Fake Website


URLs

During our analysis, we observed that RF


outperformed the other two in terms of accuracy rate,
achieving an astounding 97%. This suggests that the
system is accurate in identifying bogus URLs, which
makes it a useful tool for guarding against phishing
schemes and other internet dangers. The entire RF
(a) Benign URLs. (b) Phishing URLs. performance we were able to get throughout our
investigation is displayed in Table II.
Figure 4 depicts the network of RF disruptions.
. For classification issues, the hat strategy generates a
forest of decision trees and delivers the average
forecast from each one. During training, it builds a
(c) Malware URLs. (d) Defacement URL huge number of decision trees from which it computes
Figure 2 shows word clouds for each of the four the class mode (classification) or class mean
courses collection. (regression).

Table No II
Phishing URLs can use highlight tokens like www,
file, tools, ietf, and fight to trick visitors into thinking EVALUATI ON REPORT FOR RANDOM FOREST
they are legitimate URLs (see Figure 3b).
Malware URLs frequently contain high-frequency
tokens like exe, E7, BB, and MOZI.
The executable records in Figure 3c are trojans, which
are used to distribute these tokens. Figure 3d shows
defacement URLs, which typically employ
development terminology (php, list, itemid, etc.) and
attempt to modify the original website's code. Specific
lexical characteristics are extracted from raw URLs
during the highlight creating handle and used as input
highlights to set up the machine learning
demonstration. The elements listed in Table I are
supposed to help identify.

These tokens are spread via trojans, which are


represented by the executable files in Figure 3. Figure
2 shows defacement URLs that seek to change the
original website's code and typically use development
terms (php, index, itemid, etc.). During the feature
engineering process, specific lexical qualities from
raw URLs are extracted and used as input features for
the machine learning model. Table I's components are
meant to help identify
A. LightGBM

Our investigation revealed that LightGBM has a 96%


accuracy rate in identifying phony URLs, as seen
in Table IV. This demonstrates the algorithm's
ability to identify phony websites and the reasons
it should be used in various cybersecurity
applications. Because of its remarkable speed and
scalability, Light GBM has become widely used
and is a great option for real-time applications
that demand quick responses [12].
Figure 3 shows the Random Forest's confusion matrix.
A. XGBoos

XGBoost is well-known for its high accuracy, quick


performance, has a special aptitude for handling
imbalanced datasets, missing values, and other
problems that come with data from the real
world. We found that the XGBoost algorithm
performed with a precision rate of 96.2%, which
suggests that it could be a useful tool for
accurately identifying URLs. This is not
surprising, as the algorithm is well-known for its
high precision, speed of execution, and special
ability to handle uneven datasets and missing
values, which are common in real-world data.
Our research revealed that the XGBoost
computation had a precision rate of 96.2%, Figure 4 shows the LightGBM's confusion matrix.
indicating that it could be used as a tool for
precisely identifying URLs. Both of these Gentile or evil, with great precision. summarizes the
algorithms are well-known for their high assessment results, while Figure 4 shows the XGBoost
accuracy, speed of execution, and special confusion matrix. that require quick projections due to
their amazing speed and scalability, which have
capacity to handle lost values and imbalanced
resulted in widespread acceptance. Table IV
datasets, which are frequent in real-world data. demonstrates that LightGBM has an incredible 96%
Our findings demonstrated that the XGBoost accuracy rate in detecting counterfeit URLs. Our
[11]. findings demonstrate the breadth of cybersecurity
applications that our algorithm may be used for, as
TABLE III well the usefulness of our creation in detecting
fraudulent websites. Figure 4 displays the topology for
EVALUATI ON REPORT FOR LIG HTGBM. LightGBM disarray. Light GBM has significantly
increased in quality because to its remarkable speed
and versatility, which makes it an excellent choice for
real-time applications that require quick predictions.
Table IV reveals an astonishing 96%.

A. Comparative Results

The primary motivation for this is to use cutting-


Accuracy: 0.96 (130239) cutting-edge machine learning computations to
determine websites that are bogus. To make plans to
do this, we evaluated the suitability of three distinct
algorithms
utilized in our investigation of several historical
factors in Table VI. The machine learning exhibition
The Train Using AutoML program makes use of we've suggested, which integrates LightGBM,
LightGBM is a decision tree-based gradient-boosting XGBoost, and RF surpasses the other models in terms
collecting method. Similar to other decision tree-based of algorithms. RF functionsthe finest of all of them
techniques, LightGBM may be applied to both .
categorization and relapse.
D.. Real Time Prediction and Results

really precise about whether something is harmful or


benign. The complete assessment findings are shown
in Table V, and the confusion matrix of the XGBoost
Figure 5: LightGBM Confusion Matrix is shown in Figure 5.

accurately assess whether a given object is dangerous E. Comparative Results


or benign. Table V displays the assessment findings in
their entirety, and Figure 5 displays the XGBoost The main goal of this research is to use cutting-edge
confusion matrix. machine learning techniques to identify bogus
websites[13]. In order to do this, we have contrasted
V. ALGORITHMS USED the results of three different algorithms that we
employed in our research with a few earlier studies, a
There are currently two methods for figuring out if a list of which may be seen in Table V. By contrasting
URL is real or not. A forest made up of many decision our proposed machine learning model's output with
trees is created by the random forest algorithm [12]. A that of other models, we can see that our model—
high tree count results in excellent detection accuracy. which makes use of the RF, LightGBM, and XGBoost
The bootstrap approach is the foundation for tree algorithms—performed better overall, with RF
creation[11]. The bootstrap method's characteristics. exhibiting the best performance [1].
Selecting the best splitter—the root of the tree—
among the qualities that are available for
categorization. The algorithm keeps building the tree
until it comes to a leaf node Figure 7 Each leaf node
of the tree belongs to a class label, and each node
inside the tree corresponds to a characteristic [14].

. Decision trees are used to generate training models


that are used to predict target values or classes in tree
representations [11].

To build a single tree, randomly selected dataset


samples are substituted. The random forest method
will pick among characteristics that are randomly
selected[14]. A pop-up window alerting users to
phishing websites will show up if they enter a URL
that leads there. A user has the option to 'CONFIRM'
when they want to access data from a website. They
will be sent to the previous page if not. Random
selection and replacement are employed to produce a Fig.7. Working Random Forest Algorithm
single tree[15]. Using a collection of randomly chosen
qualities, the random forest method makes selections. The previous page. Random selection and replacement
A pop-up window will appear to warn users when result in the production of a single tree [18]. To
they Enter a URL that will take them to a fake internet generate choices, the random forest approach employs
gateway. a collection of randomly chosen qualities. A pop-up
window alerting users to phishing websites will show
up if they enter a URL that takes them there. Users
may only browse websites that are on the blacklist and
whitelist, which are seldom updated, and are unable to
access any other websites [19].
We acquired unstructured URL data from a variety of A pop-up window appears when a user enters a URL
websites, including as Alexa, Kaggle, and Phishtank. that leads to a phishing website. will appear to inform
them. When a user wishes In a long time. Our
• Each detail is given a paired (0,1) value, which is proposed solution incorporates three approaches:
then fed into classifiers in line with the dataset's blacklists and whitelists, heuristics, and visual
design. similarities. Our proposed system employs the
following algorithm[15].
• We use Random Forest and Decision Tree
approaches to train three different classifiers and 1. Create a browser extension that monitors all "http"
evaluate their accuracy[1]. traffic from the end user's system. Using an extension
rather than an application or program enables real-time
processing and dynamic output delivery.

2. Compare each URL's domain against trusted and


illegitimate domain lists. Data required [2].

3.Furthermore, the website will be evaluated based on


a variety of characteristics. We studied the following

In summary, phishing poses a serious risk to online


security and safety, which makes phishing detection a
crucial concern. Our examination of conventional
phishing detection methods, such heuristic assessments
Figure 6: The Decision Tree Algorithm in Action
and blacklists [6].
Length, amount of hyperlinks Furthermore, the
website will be evaluated on a range of criteria. We
investigated the following characteristics: website
protocol (secure or unsecure). to access data from a
website, they can utilize the 'CONFIRM' option.
Otherwise, they will be sent to the
phishing detection technology to end users[11].

VII. RESULTS

The Scikit-Learn software was used to load the


machine learning algorithms. After being trained on
one set of data, classifiers are evaluated on a different
set Fig 8. To load the machine learning algorithms,
Scikit-Learn was utilized. Classifiers are tested on a Fig.8. Random Forest Algorithm Accuracy
different set of data after being trained on the first. An
accuracy score was used to evaluate the effectiveness
of a classifier.

VI. WORKING

 We collected unstructured URL data from


many websites, including Phishtank, Kaggle,
and Alexa.
 After structuring the dataset, each detail is
assigned paired (0,1) values, which are then
input into classifiers.
 We train three distinct classifiers and test
their accuracy with Decision Tree and
Random Forest approaches[1].

Fig.9. Accuracy with Decision Tree Algorithm


Phishers have figured out how to change URLs so
they can avoid detection, even if lexical features by
themselves yield a high level of accuracy (about 97%).
Combining these attributes with those possess is the
most effective tactic. In the future, we must expand the
phishing location architecture, leverage online
learning to better understand contemporary attack
strategies, and increase the model's accuracy by
extracting highlights.

In conclusion, phishing presents a significant risk to


online security and safety, making its detection a
crucial concern. Our examination of conventional
phishing detection methods, such heuristic
assessments and blacklists[6]. We might be able to
automatically identify fraudulent websites and stop a
variety of attacks, such as phishing, malware, and
defacement, by using machine learning algorithms.
Future studies might examine the application of more
sophisticated deep learning methods to enhance. This
study evaluated three popular algorithms: XGBoost,
LightGBM, and RF. We found that they yielded
impressive results on the dataset. With an accuracy
rate of more than 97%, our proposed method
successfully distinguishes between bogus and real
websites. According to a feature significance study,
the URL's size and the number of dots in it, and the
presence of certain keywords are some of the most
important characteristics that reveal phony websites.
comparing the results of learning algorithms to the
"Phishing Websites Dataset" To detect bogus
websites, we developed a Chrome plugin using the
quickest algorithm. With machine learning techniques,
we may be able to automatically detect bogus
websites. that it has the ability to distinguish
between trustworthy and fraudulent websites. The
size of the URL, the amount of dots in the URL,
and other attributes are some of the most crucial
ones for recognizing bogus websites, according
to feature significance research. and the inclusion
of particular keywords. By automatically spotting
phony websites, machine learning algorithms may be
able to stop phishing, ransomware, and vandalism
among other cyberthreats[9]. Future research projects
might focus on applying cutting-edge deep learning
methods.

You might also like