0% found this document useful (0 votes)
16 views20 pages

IEEE_Format_Paper

This document discusses the use of machine learning techniques for detecting phishing websites, highlighting various methods and their effectiveness. It reviews the prevalence of phishing attacks, the techniques used by phishers, and the challenges faced in detection, emphasizing the need for improved models and dataset diversity. The paper also compares different machine learning algorithms and approaches, including deep learning, for identifying phishing threats in the digital landscape.

Uploaded by

kohinoor9010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

IEEE_Format_Paper

This document discusses the use of machine learning techniques for detecting phishing websites, highlighting various methods and their effectiveness. It reviews the prevalence of phishing attacks, the techniques used by phishers, and the challenges faced in detection, emphasizing the need for improved models and dataset diversity. The paper also compares different machine learning algorithms and approaches, including deep learning, for identifying phishing threats in the digital landscape.

Uploaded by

kohinoor9010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

PHISHING DETECTION USING URL-BASED FEATURES: A MACHINE LEARNING

APPROACH

ABSTRACT

Although the web has become part and parcel of our daily lives, it also provides
anonymity for people who undertake malicious acts such as phishing. To deceive its
victims, a phisher can use various methods like social engineering and creating
counterfeit websites to steal personal and corporate account IDs, usernames,
passwords among others. In order to detect phishing websites several techniques
have been proposed but phishers have come up with their own ways of detecting
them. Machine learning is one of the best approaches used in identifying these
malicious activities because most phishing attacks have common characteristics that
machine learning can recognize. This paper compares different machine learning
models for predicting phishing websites.
Keywords: Phishing, Classification, Cybercrime, Machine Learning

INTRODUCTION

To obtain valuable or confidential data from users, phishing, a kind of cybercrime,


usually creates a counterfeit website that mimics the genuine one Website. For
example, manipulating links; avoiding filters: spoofing websites; hidden redirects
and social engineering are some of the techniques used in phishing attacks. The
easiest way for them to achieve this is by creating a fake website that imitates an
authentic one Website. Such types of attacks have become an important subject of
discussion recently especially in the 2018 Internet Crime Report by the US Federal
Bureau of Investigation and investigation by the Internet Crime Complaint Centre
(IC3). In 2018, FBI IC3 statistics indicated that internet theft, fraud and exploitation
were still rampant as they caused a staggering loss amounting to $2.7 billion.
According to IC3’s report, there were 20,373 complaints against business e-mail
compromise (BEC) and EAC with losses being over $1.2 billion [1]. The number of
these sophisticated attacks has been increasing over the past several years as said in
this report. Anti-Phishing Working Group (APWG) highlights that phishing attacks
have increased significantly within recent years as depicted in Figure 1 which shows
the total number of phishing sites identified by APWG during the first quarter of
2020 and last quarter of 2019 respectively. The number has shown slow but steady
growth from 162,155 in Q4 2019 to 165,772 in Q1 2020. Thus, it is responsible for
enormous loss in many corporate organizations and the global economy at large;
this was confirmed by a survey conducted by one of APWG's members OpSec
Security during the last quarter of 2019 that revealed SaaS and webmail sites as
most vulnerable targets for phishing attacks. Phishers continue obtain credentials
from these objectives using BEC and then gain access to Corporate SaaS accounts
[2]. Several methods have been employed to separate out phishing sites. Each one of
these ways can be used at different stages attack flow which includes network-level
protection, authentication, client-side tool, user education, server-side filters and
classifiers. Though there are various peculiarities within each kind of phishing
attack, however majority partakers adhere to approximately similar procedures.
From machine learning approach has been an effective method of finding patterns
in data thus helping reveal some common features such as recognizing phishing
sites through these methods. This study compares and analyses different machine
learning techniques applied for detecting malicious websites that engage in online
fraud considered to be a form of phishing. The machine learning techniques we
learned comprised of logistic regression, decision tree, random Forest, Ada-Boost,
Support Vector Machine, KNN, Artificial Neural Networks, Gradient Boosting and
XGBoost. The rest of this paper is arranged as follows: some widely used phishing
techniques are presented in section II and in Part III different types of fishing
methods are discussed along with how to prevent such attacks from happening. In
the IVth Section Our dataset characteristics are shown. We show the evaluation
results of our proposed machine learning algorithms in section VI a VII and finally
conclusion and future work is discussed in section VIII.

RELATED WORKS

Many investigations have been done in the past regarding phishing detection.
Different methods, challenges and future prospects have been dealt with in the
studies. A selection of significant contributions in the area of phishing detection and
prevention is as follows.

Machine Learning Approaches:

Mittal et al. (2022) conducted a study of phishing detection that focuses on NLP
methods such as tokenization, stemming, and Bag-of-Words. However, their work
was with a good textual analysis but no attachment analysis was carried out, hence
calls for further works incorporating file-based phishing detection were made.

In their paper, Rawal et al. (2017) applied SVM, Naive Bayes, and Random Forest
algorithms to detect email-based phishing. Their study whilst performing
commendably with certain accuracy had the limitations of a small dataset size and
few email samples that were not diverse.

Alattas et al. (2022) used SVM and Random Forest in email phishing detection but
encountered problems associated with imbalanced datasets. They speculated that
attachment analysis can be a worthy future approach to enhance their model
robustness.

Ensemble Learning Techniques:

Orunsolu et al. (2019) implemented a model comprising Decision Trees, Random


Forests, and Support Vector Machines. However effective, their approach was, as
usual, plagued with undesirably high false positive rates, suggesting that there is a
need for optimizing the feature vectors. Dinesh et al. (2023) applied further
developed ensemble methods such as XGBoost and Random Forest in detecting
phishing. Their research pointed out ever-changing phishing methods as a hurdle
and recommended updating detection mechanisms regularly.

Challenges and Future Directions:

Most of the works reviewed recognized the limited diversity of the dataset and the
potential for further development of the model. Particularly increasing the size of
the dataset and enhancing the detection strategies in response to new forms of
phishing attacks.

Future works emphasized the importance of combining attachment analysis


reporting and improving the detection mechanisms in real-life situations including
multi-modal data and advanced phishing techniques.

PHISHING TECHNIQUES

Users can be targeted and their personal information obtained using different
technologies. With advancements in technology, cyber criminals have also changed
the techniques they use.
To protect yourself from phishing, you must know what a cyber-criminal is doing
and understand how to counter any kind of a phishing attack that may come your
way.

Phishing

For instance, spear-phishing is more common compared to traditional phishing


which is commonly known as the “spray and pray” method that entails sending
mass emails to millions of people with hopes that some will fall prey. This requires
hackers who have researched the person or organization being attacked and fine-
tuned their attacks just for that individual; this increases the target’s risk of being
caught because it reflects knowledge about him/her.
Hijacking Sessions

In session hijacking, attackers exploit web-session management protocols in order


to gain unauthorized access to user restricted areas or steal sensitive user data. In
chat interception which is an easy type of chat hacking technique, sniffing enables a
phisher intercept message and gain unauthorized entry into various websites.

Email/Spam

These are mostly seen with many other kinds of phishing methods where one email
gets sent out simultaneously to thousands and millions of users asking for personal
details. For example, such details could be used by the phishers for illegal activities.
Most often messages contain short description stating that users need to provide
credentials so as to update account information, change details or authenticate an
account quickly. It sometimes prompts them to fill out a form via a link in the email
to access the new service.

Content Splicing
Content splicing is a method used by phishers to modify some of the content. The
web pages’ contents are dependable. To this end, they force a user into another page
instead of the official website where he/she will be required to fill in personal
details.

Web-based distribution

Web-based distribution is among the common phishing methods. Also referred to as


MITM (man in the middle), this technique requires hackers to place themselves
between legitimate site and phishing system. Phishers trace modifications from
reliable websites made by users. During writing of messages by the user, phishing
acquires information without user’s knowledge.>Phishing sites gather such data
when users attempt to make purchases through entering their credit card
information. There are many bogus banks providing users with credit cards or loans
with low interest rates yet these are actually phishing sites.

Link Policy

Link Policy is a way that phishers send malicious links to illegal sites. After clicking
on a fake link, instead of opening up the specific webpage provided for in that link, it
opens up on the phisher’s website instead. Place your cursor over any link to see its
actual address and prevent being controlled by it.
Voice Phishing
During phone phishing, the phisher calls the user and makes them dial a number. It
aims to get bank account information through a call. Usually, phone phishing is
perpetuated by fake callers.

SMS Phishing

For example, phishing emails which are usually links to phishing websites that try to
deceive victims into giving out personal details.

Keyloggers

These messages go to hackers who can decrypt passwords and other messages. To
avoid sensitive software from being able to log important financial data, secure sites
allow users to click on virtual keyboards for logging in with mouse clicks only.

Malware

Phishing scams require the delivery of malware which runs on the recipient’s
computer. Emails sent to individuals by phishers may contain malware. At that point
when you press on it, the malware starts running. Sometimes malware can also be
added to the downloaded file.

Trojan Horses
A Trojan horse is a kind of malware that looks as if it functions legitimately but
instead, it enables users to access remote accounts from their local machines.
Submit your credentials. The information provided will be sent to cyber criminals.

Ransomware

Ransomware makes device or data unavailable until the ransom payment has been
made. Personal computer ransomware is a form of malware which sneaks into a
user’s machine through a social networking attack. The user gets prompted to click
on, open or visit links.

Malicious Advertising

Malvertising refers to malicious ad-ware messages that are intended to download


malware or impose undesired content on your computer. Attackers most commonly
exploit vulnerabilities in Adobe PDF and Flash.
PHISHING DETECTION APPROACHES

Phishing detection existing works are classified into three groups:

Deep Learning (DL) for phishing attack detection:

Today demonstrates how to undertake DL-based search. Recent advances in deep


learning suggest that using deep neural networks for the identification of phishing
websites would be more preferable than typical machine learning (ML) algorithms.
Nevertheless, it is anticipated that the outcomes using deep neural networks may
vary depending on the locations of various subjects. There are many ways to do this,
such as through using a deep autoencoders or a limited Boltzmann machine, as well
as by means of trust deep network, feedforward deep neural network, recurrent
neural network and convolutional neural network and deep neural network. Deep
Learning methods for Network Security Research include Deep Autoencoders,
Convolution Neural Networks (CNNs), Limited Boltzmann Machines (LBMs), Trust
Deep Networks (TDNs), Feedforward Deep Neural Networks (FDNNs), Recurrent
Neural Networks (RNNs). The figure 5 below shows how the model of DL works. A
neuron is fed with a series of input data and some weightage is assigned to it so that
it can predict simply either whether it will become a phish attack or if legitimate
traffic would not be allowed.

Machine Learning (ML) for Phishing Attack Detection:

Phishing website detection is widely done using ML methods, making it an easy


classification problem. A machine learning model that can be trained in order to
detect educational tools for phishing must have available data featuring categories
from phishing and legitimate website. Various techniques are used to detect
phishing attack. In the past, this has been established by means of powerful machine
learning. Optimal features selection method would reduce many feature options. As
illustrated in Figure 6, this is how the Machine Learning Model works. For example,
provide input dataset as input to machine learning models for predicting either
phishing attacks or legitimate traffic. When features are eliminated information
illumination becomes clearer and more concise since unnecessary ones are
removed. The most important algorithms that have been used in literature and
found to give reasonable accuracy of phishing attack detection include C4.5, k-NN
and SVM. These classifications like DTs such as C4.5 improve the precision as well as
effectiveness of detecting phishing attacks. Thus, trying to further explore findings
about phantom menace, the limits were indicated by researchers on their study.
Most studies failed to employ collaborative learning process and did not functionally
restrain themselves thereby limiting them greatly according to a common
observation made by many researchers. Different components such as C4.5, IBK, NB
and SVM are used by the authors of James et al. RF is also employed by Liew et al. in
distinguishing phishing attacks from genuine web pages. The authors of Adebowale
et al. make use of integrated resources to detect and prevent phishing using a strong
system based on a modified neuro-fuzzy inference.

Language based:
Names for browsers like Mozilla Firefox, Microsoft Edge and Google Chrome act as a
way to identify phishing websites. However, whitelist and blacklist are two lists.
When it comes to the whitelist, it contains valid URLs that can be accessed through
the browser; It implies that if the URL is whitelisted then it means that the browser
can download this webpage. On the other hand, on black list there are phishing or
scam URLs which prevent downloading web page by browser itself. There is one
major shortcoming: even minute variation in URL may allow bypassing it during
running; therefore, this list has to be updated regularly by sites to stop new phishers
URLs from running on them. Some features that can differentiate between fake
websites from real ones have been selected for this strategy of protection against
phishing scams through email links.

This system receives information from several sources e.g., URLs, text content (i.e.,
emails), DNS data / metadata records–domain names resolution services-, digital
certificates (e.g., SSL) and web traffic (i.e., IP packets). Specific techniques used in
training models e.g., classification algorithms determine success rate or failure
among others. One positive aspect of using these technologies is ability of
recognizing zero-day type of phishing attack before its actualization takes place on
an automatic basis which was not possible before now at all times – when they
occur.

MACHINE LEARNING APPROACH

Digital landscape is still a war zone, where phishing attacks constantly threaten both
individuals and organizations. These tricks intend to steal sensitive information like
credentials or financial data, or mislead users into accessing malicious links that can
install malware on their devices or redirect them to fake websites. Machine learning
(ML) has become a potent weapon to fight phishing in the context of this emerging
threat. This analysis delves deep into various ML techniques that are common in
detecting phishing, highlighting their strengths, limitations, and things one must
consider when implementing them.

Feature Engineering and supervised learning: Basic techniques.

Machine learning (ML) algorithms are good at pattern recognition as well as


prediction based on big data sets. In phish detection for example, such algorithms
are trained using huge collections of tagged emails which are classified as “genuine”
or “phishing.” During training time, features (the most relevant characteristics) are
extracted by the model from emails to learn how these features relate with their
labels. This enables the model to examine new incoming mails which it has not seen
before and then predict whether they belong to class 0 or class 1 using the learnt
features.

Structural engineering is of great importance in this process. Features for phishing


detection can be drawn from different parts of an email as follows:

Email header properties: Sender address (fake), recipient address, email size,
timestamps.

Content properties: Subject keywords (for example, “urgent,” “important”), body


content (urgency level, threats involved and suspicious grammar), presence of
HTML code.

Attachment and URL properties: Attachment file extensions, presence of known


malicious URLs, properties of embedded links (i.e., shortened URLs and domain
name mismatches).

The prevalent approach to phishing detection is supervised learning. The ML model


is trained on a labelled dataset where each email is categorized. They are the
following algorithms that are most commonly used for such purposes along with
their strengths and aspects:

Logistic Regression: A simple yet powerful linear model for binary classification; it
perfectly identifies between legitimate and phishing emails. It predicts the
likelihood that an email belongs to one class or another (legitimate or phishing)
based on those extracted features. Logistic regression is easy to understand and
computationally efficient making it a good starting point for phishing detection.
However, this technique may struggle with complex feature relationships seen in
some phishing attempts.

The K-Nearest Neighbours is an algorithm that classifies data points based on their
likeness to labelled data points in the training set. In phishing detection, an email is
considered phishing when its features are quite similar to known phishing emails in
the training data. The good thing about KNN is that it is relatively simple to
implement and it offers a fair interpretation. Nevertheless, for very large datasets,
this may take a lot of time due to computational load and performance may depend
greatly on the choice of “K” (nearest neighbours taken into account).
Support Vector Machines (SVM): These powerful algorithms aim at separating
legitimate emails from phishing ones in a high-dimensional feature space. SVMs are
able to handle complex feature relationships effectively and usually they are not
sensitive to noise within the dataset. They can be especially valuable whenever
unbalanced datasets exist where there might be less of these type of emails among
all e-mails collected. However, training SVMs on big datasets tends to be
computationally expensive and their decision-making process can be less
transparent compared to simpler models.

Decision trees: Decision trees serve as a set of branching queries on the features of
e-mail. The tree moves through characteristics of e-mail and finally ends up in a
node for the classification (whether it is legitimate or phishing). Decision tree is
understandable, making its decision-making process simple to grasp. Also, they can
handle numeric and categorical attributes without much need for data
preprocessing. On the flip side, poor use of control may lead to overfitting in
decision trees, and their performance may be affected by order dependence during
tree construction.

Random Forests: These ensemble methods combine multiple decision trees to


enhance accuracy and robustness. They involve training multiple decision trees
using random sub-samples of data and then merging predictions from these trees to
achieve final classification. Random forests are better at generalization than single
decision trees since they do not easily overfit. Furthermore, these offer an element
of interpretability by identifying influential predictors among a group of trees
However large datasets make it computationally expensive to train random forests.

Ensemble Methods (Boosting): These techniques combine multiple weak learners to


create a stronger learner. Examples of commonly used boosting algorithms include
AdaBoost, Gradient Boosting and XGBoost. They work by training models iteratively,
with emphasis on those data points that previous ensemble models had
misclassified. This results in a more resilient model which can handle complex
phishing patterns. However, the high accuracy achieved by boosting algorithms may
make them computationally expensive for training purposes and harder to interpret
than simpler models; they are also highly prone to overfitting if not properly tuned.
Even though XGBoost is a popular boosting algorithm it has its advantages such as
dealing with missing data and being quite efficient for large datasets but requires
careful tuning of hyperparameters for optimal performance.

Strengths and benefits of ML-based phishing detection

Adaptability: ML models can adapt to evolving phishing tactics. Updating the


training data with new phishing examples continuously helps in better identifying
them using the model. Such an ability is important since threats keep changing
every now and then. Security teams can use threat intelligence feeds or honeypots
(spoofing email accounts) to identify new types of phishing and update their
training data correspondingly.

The scalable nature of machine learning models makes them a perfect fit for
companies that are bombarded with high volumes of emails every day. This
scalability ensures that email protection remains effective even as the number of
emails increases. To aid in ML real-time email analytics, some organizations may
choose to use cloud-based solutions or invest in high-performance computing
resources for training and deploying ML models.

Automation: The workload of security officials is significantly reduced by machine-


learning-based email analysis systems, which enable automated systems to be used.
This allows security personnel to concentrate on examining flagged e-mails and
addressing other security threats. In addition, automated systems can also prompt
such actions as placing suspicious e-mails in quarantine or blocking malicious URLs
thus further reducing the possibility of successful phishing attacks.

Improved Accuracy: Machine learning algorithms that have been trained on a


complex dataset exhibiting well-defined feature set are highly accurate in detecting
phishing mails. Improved accuracy results in data leakages and financial losses due
to less successful phishing attacks being experienced. This email security posture
can be greatly enhanced through leveraging ML-based detection mechanisms.

Limitations and considerations for implementing ML for phishing detection

Data Accuracy: The competency of ML algorithms mainly depends on the quality


and quantity of training data. Chances are that models which lack good annotations
will be partial or incapable of generalizing to any hidden type of phishing attempts.
It is therefore important for organizations to invest in obtaining and maintaining
high-quality labelled datasets for their models to perform optimally. This may
involve collaborating with security vendors or joining data sharing initiatives within
industry communities.

Evolutionary Phishing Techniques: Again, though ML models are adaptive, attackers


always invent new methods. Continuous monitoring, updating data and possibly
retraining models are necessary to stay one step ahead. Security teams should be
constantly watchful and proactive at identifying fresh phishing tactics and
incorporating them into their training data so as not to undermine the efficiency of
their ML models.
False Positive and False Negatives: No machine learning model is perfect. There can
be instances in which legitimate emails are identified as phishing (false positives)
while some actual phishing emails pass through undetected (false negatives).
Striking a balance between accuracy and minimizing false positives/negatives is
very important. For instance, cost-sensitive learning techniques may help prioritize
reduction of false positives that might disrupt user workflows at the same time
preserving reasonable detection rate for phishing emails.

Computational resources: Training sophisticated machine learning models can be


resource-intensive and require high-performance computing. Furthermore, real-
time email processing of huge volumes may need considerable computational
capacity. ML-based phishing detection infrastructure that organizations deploy
must take into account the costs related to the maintenance of these systems. They
are cost efficient for small institutions with limited on-premise sources.

Explainability and transparency: Understanding how an ML model arrives at its


decision can be difficult. In a security context, however, lack of explainability is
problematic because understanding why an email was flagged is critical. The feature
importance analysis technique may help in shedding some light on decision
processes by the model but unravelling complex models completely may prove
difficult.

Successful implementation: a multi-dimensional approach

However, it is important to note that while machine learning has been found to be a
powerful tool for detecting phishers, it should be incorporated with other security
measures in order to make a comprehensive plan towards optimization levels of
protection. The main points are as follows:

Employee training and user education: When it comes to phishing attacks, the
employees are usually in the front line. Organizations need to invest in user
education programs that will help train their staff in detecting popular phishing
techniques and ensure safe email practices.

Secure Email Gateways (SEGs): SEGs can be implemented for scanning both
incoming and outgoing emails for malicious contents such as phishing. They employ
anti-virus and anti-phishing technologies to detect suspicious e-mails before they
reach the recipient’s inbox.

Email authentication protocols (DMARC, SPF, DKIM): Such protocols shield against
email spoofing which is a frequently used technique by attackers for phishing
purposes. For DMARC, it enables companies to set up the way email recipients
should treat emails claiming to originate from a given domain there by preventing
domain spoofing. On the other hand, SPF and DKIM can assist in validating that the
sender is genuine.

DATA SET DESCRIPTION -

One of the major challenges fraced by us was the scarcity of phising dataset . Many
research paper on phising detection have been published but most of them have not
provided the dataset they used in their research . An ideal dataset to work onj
contains standard set of record characteristics of a phising website. The dataset we
used in our research is well epquiped with features with a range index of 1000
entries , ranging from 0 to 999 and a total of 50 columns , Each website is marked
either legitimate or phishing. The features of our dataset are as follows:

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 10000 non-null int64

1 NumDots 10000 non-null int64

2 SubdomainLevel 10000 non-null int64

3 PathLevel 10000 non-null int64

4 UrlLength 10000 non-null int64

5 NumDash 10000 non-null int64

6 NumDashInHostname 10000 non-null int64

7 AtSymbol 10000 non-null int64

8 TildeSymbol 10000 non-null int64

9 NumUnderscore 10000 non-null int64

10 NumPercent 10000 non-null int64

11 NumQueryComponents 10000 non-null int64

12 NumAmpersand 10000 non-null int64

13 NumHash 10000 non-null int64

14 NumNumericChars 10000 non-null int64


15 NoHttps 10000 non-null int64

16 RandomString 10000 non-null int64

17 IpAddress 10000 non-null int64

18 DomainInSubdomains 10000 non-null int64

19 DomainInPaths 10000 non-null int64

20 HttpsInHostname 10000 non-null int64

21 HostnameLength 10000 non-null int64

22 PathLength 10000 non-null int64

23 QueryLength 10000 non-null int64

24 DoubleSlashInPath 10000 non-null int64

25 NumSensitiveWords 10000 non-null int64

26 EmbeddedBrandName 10000 non-null int64

27 PctExtHyperlinks 10000 non-null float64

28 PctExtResourceUrls 10000 non-null float64

29 ExtFavicon 10000 non-null int64

30 InsecureForms 10000 non-null int64

31 RelativeFormAction 10000 non-null int64

32 ExtFormAction 10000 non-null int64

33 AbnormalFormAction 10000 non-null int64

34 PctNullSelfRedirectHyperlinks 10000 non-null float64

35 FrequentDomainNameMismatch 10000 non-null int64

36 FakeLinkInStatusBar 10000 non-null int64

37 RightClickDisabled 10000 non-null int64

38 PopUpWindow 10000 non-null int64

39 SubmitInfoToEmail 10000 non-null int64


40 IframeOrFrame 10000 non-null int64

41 MissingTitle 10000 non-null int64

42 ImagesOnlyInForm 10000 non-null int64

43 SubdomainLevelRT 10000 non-null int64

44 UrlLengthRT 10000 non-null int64

45 PctExtResourceUrlsRT 10000 non-null int64

46 AbnormalExtFormActionR 10000 non-null int64

47 ExtMetaScriptLinkRT 10000 non-null int64

48 PctExtNullSelfRedirectHyperlinksRT 10000 non-null int64

49 CLASS_LABEL 10000 non-null int64

The mean and standard deviation of all features are given below:-

Columns Mean Std. dev

------- -------- ---------

NumDots 2.445100 1.346769

SubdomainLevel 0.586800 0.751176

PathLevel 3.300300 1.863148

UrlLength 70.264100 33.368209

NumDash 1.818000 3.106103

NumDashInHostname 0.138900 0.545717

AtSymbol 0.000300 0.017318

TildeSymbol 0.013100 0.113703

NumUnderscore 0.323200 1.114604

NumPercent 0.073800 0.622217

NumQueryComponents 0.458600 1.344725

NumAmpersand 0.277200 1.117300


NumHash 0.002300 0.047903

NumNumericChars 5.810300 9.617396

NoHttps 0.988800 0.105236

RandomString 0.525200 0.499365

IpAddress 0.017200 0.130016

DomainInSubdomains 0.022200 0.147333

DomainInPaths 0.428900 0.494919

HttpsInHostname 0.000000 0.000000

HostnameLength 18.824300 8.116134

PathLength 35.564900 24.587273

QueryLength 8.606500 24.311838

DoubleSlashInPath 0.000900 0.029986

NumSensitiveWords 0.109300 0.368719

EmbeddedBrandName 0.057100 0.232034

PctExtHyperlinks 0.241334 0.342353

PctExtResourceUrls 0.39293 0.387273

ExtFavicon 0.167200 0.373154

InsecureForms 0.844000 0.362855

RelativeFormAction 0.248700 0.432260

ExtFormAction 0.101800 0.302385

AbnormalFormAction 0.057600 0.232985

PctNullSelfRedirectHyperlinks 0.136136 0.312398

FrequentDomainNameMismatch 0.215300 0.411030

FakeLinkInStatusBar 0.005500 0.073958

RightClickDisabled 0.014000 0.117490


PopUpWindow 0.004900 0.069828

SubmitInfoToEmail 0.128800 0.334978

IframeOrFrame 0.339600 0.473573

MissingTitle 0.032200 0.176531

ImagesOnlyInForm 0.030400 0.171685

SubdomainLevelRT 0.956600 0.248025

UrlLengthRT 0.020200 0.819995

PctExtResourceUrlsRT 0.353300 0.888864

AbnormalExtFormActionR 0.793200 0.520993

ExtMetaScriptLinkRT 0.173400 0.755733

PctExtNullSelfRedirectHyperlinksRT 0.314100 0.897798

CLASS_LABEL 0.500000 0.500000

Evaluation Metrics:-

For evaluating the dataset for phsising classification we are using

Accuracy , recall , precision and F1 score .

Recall measures the percentage of phishing websites that the model manages to
detects model’s effectiveness. F1 score detects harmonic mean of precision and
recall. \

Let NL→L be the number of legitimate websites classified as legitimate, NL→P be the
number of legitimate websites misclassified as phishing, NP→L be the number of
phishing misclassified as legitimate and NP→P be the number of phishing websites
classified as phishing. Thus the following equations hold

Accuracy = (NL→L + NP→P)/( NL→L + NL→P + NP→L + NP→P ) (1)

Recall = NP→P / (NP→L + NP→P) (2)

precison = NP→P / (NL→P + NP→P) (3)

F1 score = 2pr / (p + r) (4)

Experimental Results :-
In our study , we used various machine learning models for phishing detection like
logistic rregression , ada booster , random forest , gradient boosting , SVM , stacking
classifier , voting classifier , XGBoost , GaussianNB.

We are evaluating the accuracy , precision , F1 score and recall of these models and
comparing them to get the best working model for the dataset for the best results .
The table below shows the comparison between accuracy precision , recall and F1
recall of these models .

In our findings , we observed that the various classifiers have range of capabilities
and performance matrix . The SVM aka support vector machine exhibited notable
different results across different kernels , with RBF kernel showcasing the best
performance . We can say so as RBF kernel’s non linear classification abilities
proved effective with our dataset. However we recognize the importance of
meticulous hyperparameter turning through cross validation , especially when
dealing with models like SVM. Overfitting is a real concern and cross validation
helps us maintain that balance.

Random Forest shined out as one of the best performer , with high accuracy ,
robustness against noise and outliers and efficient feature selection capabilities .
These observations aligns well with a high accuracy of 98% and F1 score of 98.5%
of random forest in our experimental analysis . Furthermore , despite the pro points
of random forest , we faced various challenges with its numerous paraemters for
optimal performance .

XGBoost again was one of the best model to work with in our dataset , its strength is
it’s speed and regularization for various reduction . In our experimental analysis we
found its’s accuracy to be 98.8% and F1 score of again 98.8%. However , we alsoe
noted the model algorithim’s complexity and the expertise required for effective
tuning that matches the challaneges in our dataset .

Logistic regression peformed fairly nice with an accuracy of 92.8% and F1 score of
again 92.8%. It’s simplicity and effiecient training makes it easier to use and a
valuable choice for binary classification tasks . Despite its linear nature , it handles
feature scaling well and provides cler and useful insights into features through
coefficient analysis .

Ada Boosting and gradient boosting are ensemble techniques , with acuuracy of 97.5
% and 97.95% respectively . Similary with a F1score of 97.54% and 97.98%
respectively . AdaBoost‘s Ability to combine weak leaners into a strong classifier
makes it useful against overfitting g and low noisy datasets . It’s simplicity and ease
to use made it a right choice for us to experiment with it . Gradient boosting , is an
iterative training process where each newer version of model corrects errors of the
previous ones , leads to powerful and predictive performance . It’s ability to handle
both numericao and categorical features made it a right choice for us to experiment
with .

The Stacking Classifier and Voting Classifier demonstrated high accuracy and
precision in ensemble learning experiments. The Stacking Classifier combined
multiple classification models, optimizing ensemble predictions, but may require
more computational resources. The Voting Classifier combined strengths from
different algorithms, improving model robustness.

CONCLUSION AND FUTURE WORK

In the study, ten different classifiers such as Logistic Regression, Decision Tree,
Support Vector Machine (SVM), AdaBoost, Random Forest, Gradient Boosting and
XGBoost were evaluated on a phishing website dataset. The findings indicated the
better competence of ensemble classifiers especially Random Forest and XGBoost
with respect to duration of computation and accuracy of forecasting. As it was
established for classification tasks in practices, ensemble methods, which forge
many weak learners into one strong, came in handy. When applied on low noise
data samples AdaBoost was not only robust to over-fitting but was also very easy to
interpret; however, this approach has shown drawbacks when applied on noisy
samples owing to the very long learning time and probable distort on the sampled
results. In addition, this algorithm was slower in execution than Random Forest and
XgBoost.

In the scope of the current work, we also developed a hybrid model which consisted
of SVM and Random Forest aiming to enhance the detection of phishing websites by
utilizing the advantages of both classifiers. This hybrid approach is supposed to
offer high classification performance by taking advantage of the high accuracy of the
Random Forest and effective boundary setting by the SVM.

The results motivate further research aimed at enhancing the dataset by


incorporating the other features, as well as marrying machine learning models and
other techniques for phishing detection, specifically List-Based methods.
Furthermore, the researchers are planning how to additionally improve existing
features or develop new ones to ensure that phishing detection techniques can keep
up with new ways of phishing attacks.

REFERENCES

[1] FBI, “Ic3 annual report released.”


[2] APWG, “Phishing activity trends report.”

[3] V. B. et al, “study on phishing attacks,” International Journal of Computer


Applications, 2018.

[4] I.-F. Lam, W.-C. Xiao, S.-C. Wang, and K.-T. Chen, “Counteracting phishing page
polymorphism: An image layout analysis approach,” in International Conference on
Information Security and Assurance, pp. 270–279, Springer, 2009.

[5] W. Jing, “Covert redirect vulnerability,” 2017.

[6] K. Krombholz, H. Hobel, M. Huber, and E. Weippl, “Advanced social engineering


attacks,” Journal of Information Security and applications, vol. 22, pp. 113–122,
2015.

[7] P. Kumaraguru, J. Cranshaw, A. Acquisti, L. Cranor, J. Hong, M. A. Blair, and T.


Pham, “School of phish: a real-world evaluation of antiphishing training,” in
Proceedings of the 5th Symposium on Usable Privacy and Security, pp. 1–12, 2009.

[8] R. C. Dodge Jr, C. Carver, and A. J. Ferguson, “Phishing for user security
awareness,” computers & security, vol. 26, no. 1, pp. 73–80, 2007.

[9] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” in Proceedings of


the SIGCHI conference on Human Factors in computing systems, pp. 581–590, 2006.

[10] C. Ludl, S. McAllister, E. Kirda, and C. Kruegel, “On the effectiveness of


techniques to detect phishing sites,” in International Conference on Detection of
Intrusions and Malware, and Vulnerability Assessment, pp. 20–39, Springer, 2007.

[11] A. P. Rosiello, E. Kirda, F. Ferrandi, et al., “A layout-similarity-based approach


for detecting phishing pages,” in 2007 Third International Conference on Security
and Privacy in Communications Networks and the Workshops-SecureComm 2007,
pp. 454–463, IEEE, 2007.

[12] S. Afroz and R. Greenstadt, “Phishzoo: Detecting phishing websites by looking at


them,” in 2011 IEEE fifth international conference on semantic computing, pp. 368–
375, IEEE, 2011.

[13] K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting phishing with
discriminative keypoint features,” IEEE Internet Computing, vol. 13, no. 3, pp. 56–
63, 2009.
[14] A. K. Jain and B. B. Gupta, “Phishing detection: Analysis of visual similarity
based approaches,” Security and Communication Networks, vol. 2017, 2017.
[15] R. S. Rao and S. T. Ali, “A computer vision technique to detect phishing attacks,”
in 2015 Fifth International Conference on Communication Systems and Network
Technologies, pp. 596–601, IEEE, 2015.

[16] B. B. Gupta, N. A. Arachchilage, and K. E. Psannis, “Defending against phishing


attacks: taxonomy of methods, current issues and future directions,”
Telecommunication Systems, vol. 67, no. 2, pp. 247–267, 2018.

[17] A. Karatzoglou, D. Meyer, and K. Hornik, “Support vector machines in r,” Journal
of statistical software, vol. 15, no. 9, pp. 1–28, 2006.

[18] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5– 32, 2001.

[19] T. Hastie, S. Rosset, J. Zhu, and H. Zou, “Multi-class adaboost,” Statistics and its
Interface, vol. 2, no. 3, pp. 349–360, 2009.

[20] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics & data


analysis, vol. 38, no. 4, pp. 367–378, 2002.

[21] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in


Proceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining, pp. 785–794, 2016.

[22] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[23] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Phishing websites features,”


School of Computing and Engineering, University of Huddersfield, 2015

You might also like