Detection_of_Phishing_Websites_Using_Machine_Learning (2)
Detection_of_Phishing_Websites_Using_Machine_Learning (2)
Abstract: Internet and cloud technology Users are lured into visiting phishing websites via
improvements in recent years have significantly the links contained in phishing emails. In website-
increased electronic trade, or consumer-to-consumer
based phishing, a legitimate website is copied in an
online transactions. The resources of an enterprise are
harmed by this growth, which permits unauthorised effort to trick people into disclosing personal data.
access to sensitive information about users. One well-
Through various social networking sites, including
known assault that deceives users into accessing
dangerous content and giving up their information is Facebook or Twitter, users can access phishing sites.
phishing. Most phishing websites use the same website Malware-based phishing involves embedding
interface and universal resource location (URL) as the malicious software, such as a Trojan horse, into a
legitimate websites. Phishing assaults are more likely legitimate website that has been compromised.
to succeed on the Internet because of its anonymous
When a user clicks on the link, the malicious
and unregulated nature. Existing research
demonstrates that the phishing detection system's software is installed into their system, and it then
performance is constrained. An intelligent method is tracks any sensitive data they have on their computer
required to safeguard users against cyber-attacks. In and sends it to the attacker. Malware [1] can be
this paper, we investigate the use of machine learning introduced into a trustworthy website by links, audio
classifiers and a wrapper feature selection technique to files, or video files. Most modern malware is
detect phishing websites. The purpose of this study is
to identify phishing URLs and to select the best
multipurpose, meaning it may download and install
performing machine learning algorithm by evaluating other malicious software without the user's
the accuracy rate, false positive and false negative rates knowledge, steal data, and turn the victim's
of each algorithm. computer into a botnet.
Keywords: Phishing, URL, Wrapper Features Selection, Fresh phishers can quickly design phishing websites
Machine Learning
using phishing toolkits that are readily available
online due to the rapid increase of complex phishing
strategies created by sophisticated attackers. Many
I.INTRODUCTION
security experts are now concentrating on machine
Because it is so simple to develop a phoney website learning techniques to overcome the limitations of
that closely resembles a legitimate website, phishing blacklist and heuristics-based methods.
has recently become a top worry for security
The remaining sections are arranged as follows: The
specialists. Experts can spot bogus websites, but not
background of the investigation and relevant
all consumers can, and those users end up falling for
literature are covered in Section 2 with regard to
phishing scams. Phishing attacks typically originate
URL detection. Section 3 details the research's
through email, websites, and software. In email
methodology. In section 4, results and analysis are
phishing, the attacker sends millions of emails to
presented. Section 5 ends the study and outlines its
millions of people in the hope that at least a few
future directions.
thousand of them will fall for it. Most emails make
the claim to be from from a reputable company.
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on November 16,2024 at 09:11:11 UTC from IEEE Xplore. Restrictio
II. RESEARCH BACKGROUND AND however it has already undergone pre-processing to
RELATED WORKS increase its dependability and usefulness for
machine learning algorithms. Data cleaning is done
To identify phishing techniques, Google Safe to get rid of trash, missing values, and invalid values
Browsing employs a blacklist anti-phishing that could be challenging for ML algorithms to
approach. The suspicious URL is verified to see if it process. The easier it is for the ML algorithms [5] to
is on the blacklist. Using a blacklist as input, the provide better outcomes, the better the data is. In the
PhishNet approach [2] predicts variations of each following stage, flexible algorithms are used to
URL based on five URL variation heuristics, create ML models. Artificial neural network,
including Top Level Domain (TLD) replacement, random forest, and support vector machine
Directory structure similarity, IP address algorithms are employed. Each algorithm offers
equivalence, Query string replacement, and Brand advantages that are utilised in creating the final
name equivalence. result. Each model would generate its own output, or
a prediction of whether the features in the past data
Fresh phishers can quickly design phishing websites could indicate a real website or a phishing website.
using phishing toolkits that are readily available The results are then compared, and the optimal
online due to the rapid increase of sophisticated model is determined by which method produces the
phishing techniques established by sophisticated best outcomes. To further improve its performance
attackers. As a result, antiphishing methods and dependability the algorithm is capable of
including blacklisting, whitelisting, heuristic producing reliable results for recently entered data
analysis, and visual similarity-based approaches are and dynamic features extraction is used to retrieve
the dataset of newly input URL by the user.
no longer as successful at identifying phishing
websites. Blacklisting and whitelisting only affect
URLs that are listed in the appropriate list; they are
unable to identify URLs that are not on that list. This
method provides a false positive or false negative
rate (whitelist or blacklist). In these methods, lists
must be updated often. Compared to list-based
approaches, heuristic approaches are faster and have
fewer false positives and false negatives, but they
are also less accurate. In comparison to the other two
ways, the visual similarity strategy is slower since it
requires a database of all trusted websites' visual
content as a starting point before comparing it to the
suspected website, and because visual content
comparison is more expensive than URL
comparison.
Fig.1: Architecture of the proposed system
R.S, and Syed [3] provides when given a URL, Phish
Shield determines whether it is a genuine or
fraudulent website. Phish Shield is speedier than
visual-based evaluation tools for spotting phishing Here, a highly accurate phishing website detection
attempts, and it may be able to discern party time method that makes advantage of the Wrapper feature
phishing attacks that boycotts cannot. selection strategy [6] is proposed. In order to
identify phishing websites, classification algorithms
A survey conducted on recent techniques used to such as neural networks, support vector machines,
identify phishing [4] provides a comprehensive and random forests have been used. The following
analysis of phishing attacks, their double-dealing, a features are categorised for use in phishing site
sample of the new visual similitude-based detection:
methodologies for phishing identification.
• Domain-Based Features
III. PROPOSED METHOD • Features Based on URLs
• Features on a page
The first step in the process is to obtain the raw • Features Based on Content
dataset from the exclusive websites of UCI and
Kaggle. This dataset was collected in its raw form,
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on November 16,2024 at 09:11:11 UTC from IEEE Xplore. Restrictio
A. Extraction of Dynamic Features from a URL 11. Request URL: The website is valid, contains site
addresses, images, audio files, and other media, and
Since phishing sites don't last long, this DNS does not interfere with other areas of the page.
information can become inaccessible over time. The
website is a phishing site if the DNS record is 12. Abnormal URL: If a website is authentic, its
inaccessible from anywhere. URL will reflect its personality.
If the area name of the suspicious page doesn't match 13. SFH: SFHs that contain an empty string or
a record in the WHOIS data collection, the page is "about: blank" are regarded as suspicious, and action
then assumed to be phishing. The features that we should be taken regarding the provided data.
have retrieved to identify phishing URLs [7] are
listed below. 14. Email submission: The authoritative site will
often email individual data to the server for
1. Having an Internet Protocol (IP) Address: If the processing.
URL is like http://125.98.2.142/contoh.html, then it
is possible to assume that attempts have been made 15. Redirect: How frequently a site has been
to steal data. diverted is the barely perceptible distinction that
distinguishes phishing sites from legitimate ones.
2. Length of the URL: Long URLs may also be a We discover that genuine sites have only been
sign that a website is a phishing scam. misdirected once or twice in our sample. However,
phishing sites with this highlight have been
3. Shortening Services: URL shortening is a redirected a number of times.
technique that reduces the length of a URL so that it
will better interface with a page that has a longer 16. On mouseover: In order to eliminate this
URL. component, we need recover the source code of the
website page, paying close attention to the
4. Using the @ symbol causes the application to "onMouseOver" occasion, and see if it makes any
ignore everything that comes before the @ image in changes to the status bar.
URLs.
17. Iframe: An Iframe is an HTML label that
5. Redirecting: The two-slash symbol "//" indicates displays an additional online page within the one
that the user will be forwarded to another website. that is now displayed. The "iframe" element can be
6. Prefixes and Suffixes: Rarely using an actual URL used by scammers to hide their work, for instance by
with an image dashboard, phishers will instead add removing the borders' outline.
a prefix or postfix to the domain name, separated by 18. DNS Record: For phishing sites, either no
(-), to help the user remember to have a valid login records have been created for the hostname or the
for websites. WHOIS database does not recognise the guaranteed
7. Domain registration Length: The domain name identity.
could have a code for each nation. Eliminating the 19. PageRank: From "0" to "1," PageRank is worth
"www" in the URL and any cc DTL, if present, are something. PageRank aims to measure a page's
the first steps in separating the core element. The importance on the Internet. The importance of an
URL can be labelled "suspect" based just on the internet page increases with the PageRank value. We
subdomain if you compute the residual dabs and the estimate that almost 95% of phishing pages in our
number of focuses is greater than one. datasets lack PageRank.
8. Favicon: A favicon is a picture that is used as a 20. Google Index: When a website is listed by
sign on a website and also serves to convey the Google, it appears in the search results. Since
personality of the site. phishing pages typically only remain active for a
9. Port: A port is utilised to authorise particular short time, many of them might not be listed on the
administrations, such as HTTP. Google record.
10. HTTPS Token: Generally speaking, phishers can 21. Links Pointing to Page: We estimate that 98
append the https token to a URL in order to redirect percent of the items in phishing databases don't have
the client, as shown at http://httpswww-amikom- any linkages to them due of their brief lifespan.
coolest-college.com/. However, genuine websites usually have two
external links showcasing them.
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on November 16,2024 at 09:11:11 UTC from IEEE Xplore. Restrictio
IV. RESULTS AND DISCUSSION
A. Model Evaluation
B. Accuracy Score:
Fig.2: ANN
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on November 16,2024 at 09:11:11 UTC from IEEE Xplore. Restrictio
The confusion matrix for ANN, RF and SVM for the Figures 7 and 8 represents the application
testing dataset containing 3317 records was classifying a URL entered by the user as a legitimate
presented in fig.2, 3 and 4. Additionally, figure 5 URL and phishing URL.
compares these three prepared models with accuracy
of 96.7%, 95%, and 94%, respectively.
CONCLUSION
Table 1: Performance proportion of AI classifier
The accuracy obtained in the support vector
machine, artificial neural network, and random
forest preparation models is 96.7%, 95% and 94%,
respectively. Therefore, the Random Forest model
with covering highlights choosing, which is used in
the final location of phishing sites, has the best
precision obtained. The major strategy for protecting
clients against phishing attacks is awareness
Table 2: Accuracy and Error Rate
training. Web users should be aware of all security
recommendations made by professionals. Each
customer should also be ready to refrain from idly
clicking on links to websites where they must submit
sensitive information. It is essential to check the
URL before visiting a website. In future framework
The performance of the ML models with and can move up to recognize phishing pages
without Wrapper feature selection along with their consequently by running behind the scenes for client
respective accuracy and error rate are shown in meetings with the goal that the framework can
table 1 and 2. forestall the phishing assaults proficiently.
REFERENCES
[1] Routhu Srinivasa Rao and Syed Taqi Ali, 2015, "PhishShield:
A Desktop Application to Detect Phishing Webpages through
Heuristic Approach", Department of Computer Engineering,
National Institute of Technology, Kurukshetra , Haryana, India,
Procedia Computer Science 54,(147 - 156).
Fig.7: Example of Phishing URL [7] Jian Mao, Jingdong Bian, Wenqian Tian, Shishi Zhu, Tao
Wei, Aili Li and Zhenkai Liang, 2018, “Detecting Phishing
Websites via Aggregation Analysis of Page Layouts” Procedia
Computer Science 129 (2018), 224–230.
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on November 16,2024 at 09:11:11 UTC from IEEE Xplore. Restrictio
[8] R. Kiruthiga, D. Akila, 2019, “Phishing Websites Detection
Using Machine Learning” (IJRTE) ISSN: 2277-3878, Volume-8,
Issue-2S11.
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on November 16,2024 at 09:11:11 UTC from IEEE Xplore. Restrictio