0% found this document useful (0 votes)
13 views14 pages

A Machine Learning Based Approach For Phishing Detection Using

Uploaded by

bhanujarudran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

A Machine Learning Based Approach For Phishing Detection Using

Uploaded by

bhanujarudran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Journal of Ambient Intelligence and Humanized Computing (2019) 10:2015–2028

https://doi.org/10.1007/s12652-018-0798-z

ORIGINAL RESEARCH

A machine learning based approach for phishing detection using


hyperlinks information
Ankit Kumar Jain1 · B. B. Gupta1

Received: 11 December 2017 / Accepted: 14 April 2018 / Published online: 26 April 2018
© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract
This paper presents a novel approach that can detect phishing attack by analysing the hyperlinks found in the HTML source
code of the website. The proposed approach incorporates various new outstanding hyperlink specific features to detect phish-
ing attack. The proposed approach has divided the hyperlink specific features into 12 different categories and used these
features to train the machine learning algorithms. We have evaluated the performance of our proposed phishing detection
approach on various classification algorithms using the phishing and non-phishing websites dataset. The proposed approach
is an entirely client-side solution, and does not require any services from the third party. Moreover, the proposed approach
is language independent and it can detect the website written in any textual language. Compared to other methods, the pro-
posed approach has relatively high accuracy in detection of phishing websites as it achieved more than 98.4% accuracy on
logistic regression classifier.

Keywords Cyber security · Phishing attack · Hyperlink · Social engineering · Website · Machine learning

1 Introduction When the user unknowingly updates the confidential cre-


dentials, the cyber criminals acquire user’s details (Bhui-
1.1 Context yan et al. 2016; Fan et al. 2016; Li et al. 2018). Phishing
attack performed not only for gaining information; now it
Today, phishing is one of the most serious Internet secu- has become the number 1 delivery method for spreading
rity threats. In this attack, the user enters his/her sensitive other types of malicious software like ransomware. 90% of
credential such as credit card details, password, etc. to the all active cyber-attacks start with a phishing emails (Phish-
fake website which looks like a genuine one (Jain and Gupta ingpro Report 2016). Phishing attack encompasses over a
2017a). The online payment services, e-commerce, and half of all cyber fraud that influences the Internet users.
social networks are the most affected sectors by this attack. According to APWG report, 291,096 unique phishing web-
A phishing attack is performed by taking advantage of the sites were detected between January to June 2017 (APWG
visual resemblance between the fake and the authentic web- H1 2017 Report 2017). The per month attack growth has
pages (Jain and Gupta 2017b). The attacker creates a web- also increased by 5753% over 12 years from 2004 to 2016
page that looks exactly similar to the legitimate webpage. (1609 phishing attacks per month in 2004 and average of
The link of phishing webpage is then send to thousands of 92,564 attacks in 2016). Figure 1 presents the growth of
Internet users through emails and other means of communi- phishing attack from 2005 to 2016.
cation. Usually, the fake email content shows some sense of
fear, urgency or offer some price money and asks the user 1.2 Problem definition
to take urgent action. E.g., the fake email will impel user
to update their PIN to avoid debit/credit card suspension. Recent developments in phishing detection have led to the
growth of various new machine learning based techniques.
In the machine learning based techniques, a classification
* B. B. Gupta algorithm is trained using some features, which can differ-
[email protected] entiate a phishing website from the legitimate one (Jain and
1 Gupta 2016a). These features are extracted from various
National Institute of Technology, Kurukshetra, India

13
Vol.:(0123456789)
2016 A. K. Jain, B. B. Gupta

Fig. 1  Growth of phishing


attack
Number of Unique Phishing Sites Detected
1600000
1397553
1400000
1200000
1000000
789098
800000
630507
600000 499895
427314 380527393160
363707 363661
400000 278398
223458
200000 49791
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Year→

sources like URL, search engine page source, website traf- • Proposed approach can detect the phishing websites writ-
fic, search engine, DNS, etc. The existing machine learning ten in any textual language.
based methods extract features from the third party, search • We have also conducted a sensitivity analysis to predict
engine, etc. Therefore, they are complicated, slow in nature, the most powerful features in the detection of the phish-
and not fit for the real-time environment. Phishing websites ing websites.
are short-lived, and thousands of fake websites are generated
every day. Therefore, there is requirement of real-time, fast
and intelligent phishing detection solution. 1.5 Experimental results

1.3 Proposed solution We have evaluated our proposed phishing detection approach


on various classification algorithms and used the dataset of
To solve above said problem, this paper presents a machine 2544 phishing and non-phishing websites. Experimental
learning based novel anti-phishing approach that extracts results show that logistic regression performs best in the
the features from client side only. This paper presents an detection of phishing websites. The proposed approach has
approach that can detect phishing websites using the hyper- the relatively high accuracy in detection of phishing web-
link information present in the source code of the website. sites as it achieved more than 98.39% true positive rate and
Proposed approach extract the hyperlinks from the page only 1.52% false positive rate. Moreover, the accuracy, pre-
source and analyse them to detect whether the given web- cision, and f1 score of our approach are 98.42, 98.80, and
site is phishing or not. We have divided the hyperlink fea- 98.59%, respectively. We have also explored the area under
tures into 12 different categories namely total hyperlinks, no receiver operating characteristic (ROC) curve to find a better
hyperlink, internal hyperlinks, external hyperlinks, internal metric of precision. In our experiment, the area under the
error, external error, internal redirect, external redirect, null ROC curve for phishing website is 99.6, and it shows that
hyperlink, login form link, external/internal CSS, and exter- our approach has high accuracy in classification of correct
nal/internal favicon. websites.

1.4 Contributions 1.6 Outlines

The followings are the major contributions of our paper: The remainder of this paper is organized as follows. Sec-
tion 2 presents the related work. Section 3 describes our pro-
• Proposed approach extracts the outstanding features from posed approach in detail. Section 4 presents the extractions
the web browser only and does not depend on third party of various features to train the machine learning algorithms.
services (e.g. search engine, third party DNS, Certifica- Section 5 presents the implementation detail, evaluation
tion Authority, etc). Therefore, it can be implemented at metrics, and experimental results. Finally, Sect. 6 concludes
the client side and provide better privacy. the paper and presents future work.
• Proposed approach can identify “zero-hour” phishing
attack with high accuracy.

13
A machine learning based approach for phishing detection using hyperlinks information 2017

2 Related work discussed four different kinds of obfuscation techniques of


phishing URLs. The approach uses logistic regression as a
In this section, we present an overview of various anti-phish- classifier. However, this technique cannot identify tiny URL
ing solutions proposed in the literature. Phishing detection based phishing websites. Mohammad et al. (2014) proposed
approaches are divided into two categories. First, based on an intelligent phishing detection system using the self-struc-
user education, and another relies on the software. In the turing neural network. Authors have collected 17 features
user education-based approaches, Internet users are edu- from URL, source code and the third party to train the sys-
cated to understand the characteristics of phishing attacks, tem using the neural network. Back propagation algorithm
which eventually leads them to appropriately identifying is used to adjust the weights of the network. Nevertheless,
phishing and legitimate websites and emails (Kumaraguru the design of network was a little bit complex. However,
et al. 2007). Software-based approaches are further classi- the training and testing set accuracy were 94.07 and 92.18,
fied into machine learning, blacklist, and visual similarity respectively on 1000 epochs. Aburrous et al. (2010) have
based approaches. Machine learning based approach trains used 27 features to construct a model based on fuzzy-logic
a classification algorithm with some features and a web- for detection of phishing attack in banking websites. The
site is declared as phishing, if the design of the websites authors used the features from the URL, page content (e.g.
matches with the predefined feature set. Visual similarity spelling error), SSL certificates, etc., to identify the phishing
based approaches compare the visual appearance of the attack. This approach focused only on e-banking websites
suspicious website and its corresponding legitimate website and did not discuss the detection results on another type of
(Jain and Gupta 2017a). Blacklist matches the suspicious websites. Whittaker et al. (2010) published research on a
domain with some predefined phishing domains which are large-scale classification of phishing websites, which uses
blacklisted. The negative aspect of the blacklist and visual the features from URL, page hosting, and page content. The
similarity based schemes is that they usually do not cover TPR and FPR of the approach is 90 and 0.1%, respectively.
newly launched (i.e. zero hour attack) phishing websites. Xiang et al. (2011) proposed CANTINA+, which takes 15
Most of the phishing URLs in the blacklist are updated only features from URL, HTML DOM (Document object model),
after 12 h of phishing attack (Sheng et al. 2009). Therefore, third party services, search engine, and trained these features
machine learning based approaches are more effective in using support vector machine (SVM). Although, the perfor-
dealing with phishing attacks. Some of the machine learning mance of the scheme is affected by third party services like
based approaches given in the literature are explained below. WHOIS lookup and search results. He et al. (2011) have
Pan and Ding (2006) proposed an anti-phishing method, used 12 features from the legitimate and phishing websites
which inspects the anomalies in the website. The approach and achieved 97% true positive rate and 4% false positive
extracts the anomalies from the various sources like URL, rate. These features are taken from the meta tags, webpage
page title, cookies, login form, DNS records, SSL certifi- content, URL, hyperlinks, TF-IDF, etc. Zhang et al. (2017)
cates, etc. The approach used SVM and achieved 88% true extract hybrid features from the URL, text content, and web
positive rate and 29% false positive rate. However, the pro- and uses extreme learning machine (ELM) technique. The
posed scheme used a dataset of only 379 websites Zhang first phase of this technique built a textual content classifier
et al. (2007) proposed a content specific approach CAN- to predict the label of textual content using ELM. In this,
TINA that can detect the phishing webpage by analysing OCR software is used to extract text from images. The sec-
text content and using TF-IDF algorithm. Top five keywords ond phase combine text and another hybrid feature-based
with highest TF-IDF are submitted into the search engine classifier. El-Alfy (2017) proposed an approach, which
to extract the relevant domains. CANTINA also uses some builds probabilistic neural networks (PNNs). The benefits
heuristic like the special symbol in URL “@” (at sign), “–” of the PNN are fast training time, insensitivity to outliers
(dash) symbol, dot count, domain age, etc. However, the and optimal generalisation. However, PNN may require high
accuracy of the scheme depends on TF-IDF algorithm and space and time with enormous increase of data. Therefore,
language used on the website. CANTINA achieved 6% of the authors use K-medoids clustering with PNN to reduce
false positive rate, which is considered very high.(Abu- the training instances. Montazera and ArabYarmohammadi
Nimeh et al. 2007) compared six machine learning algo- (2015) proposed an anti-phishing method for the e-banking
rithms for phishing e-mail detection namely Logical regres- system of Iran. Authors identified 28 features utilized by
sion, Bayesian additive regression trees, SVM, RF, Neural the attackers to deceive the Irani banking websites. The
network, and Regression trees. The result shows that there detection accuracy is 88% on Iranian banking system. The
are no standard machine learning algorithms which can effi- approach is particular designed to identify the Iranian bank-
ciently detect phishing attack. Garera et al. (2007) proposed ing websites only while our approach can filter all kinds of
a technique based on phishing URLs. The given approach phishing and legitimate websites.

13
2018 A. K. Jain, B. B. Gupta

Usually, machine learning based techniques compare the replaced by their hierarchically known absolute links. Our
features of the suspicious website with the predefined feature proposed approach takes the decision based on 12 features
set (Wang et al. 2018; Lin et al. 2018). Therefore, the accu- namely total hyperlink, no hyperlinks, internal hyperlinks,
racy of the scheme depends on feature set and how accurately external hyperlinks, null hyperlinks, internal error, external
a defender chooses the features (Maio et al. 2017, 2018). error, internal redirect, external redirect, login form link,
external/internal CSS and external/internal favicon.
In particular, features 2, 6, 7, 8, 9, 10 are novel and pro-
3 Proposed approach posed by us. Features 1, 3, 4, 5, 11, 12 are taken from other
approaches (Mohammad et al. 2014; Whittaker et al. 2010;
Figure 2 presents the system architecture of the proposed Xiang et al. 2011; He et al. 2011). However, we fine-tuned
approach. The selection of outstanding feature set is the these adopted features by performing various experiment
major contribution of this paper. We have proposed six new to get the better results. After extraction of these features,
features to improve the detecting accuracy of phishing web- a feature vector is created corresponding to each website.
pages. Our proposed features identify the relation between We have constructed the training and testing dataset by
the webpage content and the URL of the webpage. Our fea- extraction of defined 12 features from the phishing and non-
tures are based on hyperlinks of the webpage. A website can phishing websites. The training phase generates a binary
be transformed into a Document Object Model (DOM) tree, classifier by applying the feature vectors of phishing and
and it is used to extract the hyperlink features as shown in legitimate websites dataset. In the testing phase, the classi-
Fig. 3. In our approach, we have gathered the website hyper- fier determines whether a new site is a phishing site or not.
link features automatically using a web crawler as shown in A classifier takes the decision based on the learning from
Fig. 4. In hyperlink extraction process, the relative links are the labelled dataset. A binary classifier classify the websites

Fig. 2  System architecture

13
A machine learning based approach for phishing detection using hyperlinks information 2019

Fig. 3  HTML DOM tree

Fig. 4  Web crawler to extract features

into two possible categories namely phishing and legitimate. approach. These features are extracted from the client side
When a user requests for a new website, the crawler gener- and not dependent on any third party services. In this,
ates the feature values and the binary classifier that correctly F = {F1, F2,…, F12} is defined as the feature vector cor-
identifies the given website. responding to each feature. Some features produce the value
in the form of 1 and 0, where 1 indicates for phishing and 0
indicate for legitimate. We will discuss all these features in
4 Features extraction the following subsections.

The accuracy of a phishing detection scheme depends on 4.1 Total and no hyperlink feature (F1 and F2)
the feature set which distinguish the phishing and legiti-
mate website. Based on the given limitation of individual Phishing websites are small as compared to legitimate
and third party dependent approaches in the Sect. 2, we websites. A legitimate website usually contains many web-
have adopted the hyperlink specific features in the proposed pages. However, a phishing website consists of very limited

13
2020 A. K. Jain, B. B. Gupta

webpages, sometimes only one or two. Moreover, sometimes RatioExternal are the ratios of internal and external hyperlinks
the phishing website does not provide any hyperlink because to total available hyperlinks.
the attackers use the hyperlink hidden techniques (Geng et al.
2014). Also, attacker also uses server-side scripting and frame- 4.3 Null hyperlink (F5)
set to hide the source code of webpage (Jain and Gupta 2016b).
From our experiments, we analyse that if a website is genuine, In the null hyperlink, the href attribute of anchor tag does
we can extract at least one hyperlink from the source code. not contain any URL. When the user clicks on the null link,
Therefore, if the approach does not extract any link from the it returns on the same page again. A legitimate website
source code, the website is considered as a phishing website consists of many webpages, therefore to behave like the
(feature 2). Total hyperlinks are calculated by adding href, link, legitimate website, phisher places no values in hyperlinks,
and src tags. Taking the no hyperlink as a different feature and the links appear active on the website. Phisher also
increases the true positive rate of the proposed approach. exploits the vulnerability of web browser with the help of
empty links (Jain and Gupta 2016b). The HTML coding
F1 = Total hyperlink present in a website (1) used for designing null hyperlinks are < a href=“#”>, <a
href=“#content”>, <a href=“JavaScript ::void(0)”>. To
{
0 if F1 > 0
F2 =
1 if F1 = 0
. (2) set the null hyperlink feature, we calculate the ratio of null
hyperlinks to the total number of links present in a website
4.2 Internal and external hyperlinks (F3 and F4) and if the ratio is greater than 0.34 then set as 1 else 0. Fol-
lowing equations are used to calculate null hyperlink feature.
The internal and external hyperlink means hyperlink con- {H
tains the same and different base domain respectively. The
Null
if Htotal > 0
RatioNull = Htotal (7)
phishing website usually copies the source code from its 0 if Htotal = 0
targeted official website, and it may have many hyperlinks
that point to the targeted website. In the legitimate website, {
0 RatioNull ≤ 0.34
most of the hyperlinks contain same base domain while in F5 = , (8)
1 RatioNull > 0.34
phishing website many hyperlinks may contain the domain
of the corresponding legitimate website. In our experiment, where HNull and Htotal are the numbers of null and total
we found that out of 1428 phishing websites, 593 websites hyperlinks in a website.RatioNull is the ratio of null hyper-
include direct hyperlinks to their official website. To set the links to total hyperlinks present in the website.
internal hyperlink feature, we calculate the ratio of internal
hyperlinks to the total links present in a website (Eq. 3) and 4.4 Internal/external CSS (F6)
if the ratio is less than 0.5 then set as 1 else 0 as given in
Eq. 4. Furthermore, to establish the external hyperlink fea- Cascading Style Sheets (CSS) is a language used for depict-
ture, we calculate the ratio of external hyperlinks to the total ing the formatting of a document and setting the visual
available links (Eq. 5) and if the ratio is greater than 0.5 then appearance of a website written in the HTML, XHTML, and
set as 1 else 0 as represented in Eq. 6. XML. An attacker always tries to mimic legitimate website
{ HInternal and keep the same design of the phishing website as that
if Htotal > 0
Ratiointernal = Htotal
(3) of targeted website to attract potential victim. Formally, a
0 if Htotal = 0 CSS contains a list of rules, which can associate a group of
selectors, properties, and values to a set of declarations. CSS
{ of any website is either included with external CSS file or
0 RatioInternal ≥ 0.5
F3 = (4) within the HTML tags itself. External CSS files are associ-
1 RatioInternal < 0.5
ated with some HTML website by using the tag <link>. To
extract external CSS file, we try to find a tag with other val-
ues such as <link… rel = ‘stylesheet’’… href = ‘URL of CSS
{ HExternal
if Htotal > 0
RatioExternal = Htotal (5) file’…>. However, during the experiment, we found that in
0 if Htotal = 0
the case of the phishing website, it uses only one CSS file
or internal style and this external CSS file contain the link
of targeted legitimate website. Whereas, several legitimate
{
0 RatioExternal ≤ 0.5
F4 = , (6)
1 RatioExternal > 0.5 websites use more than one CSS file or internal style. We
develop an algorithm to find the suspicious CSS in a website
where Hinternal, HExternal, and Htotal are the number of internal, as shown in Fig. 5.
external and total hyperlinks in a website. Ratiointernal and

13
A machine learning based approach for phishing detection using hyperlinks information 2021

4.5 Internal and external redirection (F7 and F8) approach uses web crawler to fetch the response code of each
hyperlink. Internal error (F9) is calculated by dividing the
Redirection indicates whether a website redirects to some total internal error hyperlinks to the total internal hyperlinks.
other place. When a browser tries to open an URL, which External error (F10) is calculated by dividing the total exter-
has been redirected, a webpage with a different URL opens. nal error hyperlinks to the total external hyperlinks.
Sometimes URL redirection confuses users about which { H
website they are surfing. Moreover, redirection may also take i-error
if HInternal > 0
F9 = HInternal , (11)
the user to a website which is bogus. In a phishing website, 0 if HInternal = 0
there may be some links that redirect to the corresponding
legitimate domain and sometimes the fake website can also
be redirected to legitimate one after filling the login form. In
{ He-error
if HExternal > 0
this paper, we consider only response code 301 and 302 for F10 = HExternal , (12)
0 if HExternal = 0
URL redirection. We select both, internal and external URL
redirection in our feature set. In this feature, we calculate where Hi-error , He-error , HInternal and HExternal are the number
the ratio of hyperlinks which are redirecting. Internal Redi- of internal error, external error, total internal and total exter-
rection (F7) is calculated by dividing total internally redi- nal hyperlinks in a website.
rected hyperlinks to the total internal hyperlinks. External
Redirection (F8) is calculated by dividing the total external
redirected hyperlinks to the total external hyperlinks. 4.7 Login form link (F11)
{H
i-redirect
if HInternal > 0 Phishing websites usually contain login form to steal cre-
F7 = HInternal (9)
0 if HInternal = 0 dentials of the Internet users. The personal information of
the user is transferred to the attacker after filling the form
{ He-redirect
on a fake website. The login form of the phishing websites
if HExternal > 0 appears in the same manner as in the legitimate website.
F8 = HExternal , (10)
0 if HExternal = 0 In this feature, we check the authenticity of login forms. In
the legitimate website, the action field typically contains
where Hi-redirect , He-redirect , HInternal and HExternal are the the URL of the current website. However, Attackers either
number of internal redirect, external redirect, total internal use the different domain (other than visited domain), null
and total external hyperlinks present in the website. (hyperlink in footer section) or a PHP file in the form action
field of phishing websites (Jain and Gupta 2017c). PHP file
4.6 Internal and external error (F9 and F10) contains a script which saves the input data (e.g. user id or
password) in a text file saved on the attacker’s computer.
In this heuristic, we check the errors in hyperlinks of the The PHP file usually named as index.php, login.php, etc.
website. Error “404 not found” occurs when a user request We construct an algorithm to check the authenticity of the
for an URL and server is not able to determine the requested login form as shown in Fig. 6. The input of algorithm is the
URL. Phisher also adds some hyperlinks in the fake page URL of the suspicious website and output results as {0,1},
which do not exists. “404 not found” error is generated when 0 for legitimate and 1 for phishing. If hyperlink present in
a user attempts to access dead or broken link. We consider the action field is relative, then system replaces it by the
the 403 and 404 response code of hyperlinks. The proposed absolute link.

Fig. 5  Algorithm to detect


suspicious CSS Algorithm to detect suspicious CSS
Input: URL of suspicious website
Output: F6 {0, 1}, 0- Legitimate, 1- Phishing
Start
Step1: Extract all the CSS file of the website
Step 2: If the CSS is internal then set F6 = 0
Step 3: If the CSS is external and base domain is equal to current domain then set F6 = 0
else set F6 = 1
End

13
2022 A. K. Jain, B. B. Gupta

Fig. 6  Algorithm to find suspi-


cious login form Algorithm to find suspicious login form
Input: URL of suspicious website
Output: F11 {0, 1}, 0- Legitimate, 1- Phishing
Start
Step1: Extract the action field value of each form
Step 2: If the value of action field is blank, # or, javascript:void(0)) then set F11 = 1
Step 3: If the value of action field is in the form of “filename.php” then set F11 =1
Step 4: If action field contain foreign domain then set F11 =1 otherwise set F11 = 0
End

a. action= “ ”
b. action= “#”
c. action= “javascript:void(0)”
d. action= “filename.php”// e.g. filename is the name of php file

4.8 Internal/external favicon (F12) this, if the algorithm extracts the wrong keywords, then the
results are defective. Moreover, the rank provided to a web-
Favicon is an image icon associated with the particular web- site determines its position in the list of the searched links.
site. An attacker may copy the favicon of targeted website. Newly published websites and alienated blogs which have
Favicon is an .ico file linked to an URL, and found in link tag no connection to the mainstream websites are pushed back
of the DOM tree. If the favicon shown in the address bar is in the search results. Furthermore, different search engines
other than the current website, it is considered as a phishing allow the search string to be specified in the desired way so
attempt. This feature contains the two values, 0 (legitimate) that it may give the particular result user is looking for. e.g.
and 1(phishing). If the favicon belongs to the same domain, Google has special query pattern to search exact phrases,
then make this features 0 else 1. Following HTML coding exclude a word, search a specific domain, search specifying
is used in designing of favicon. a location, etc. If the search string that user enters, matches

a. <link rel="shortcut icon" href="https://www.facebook.com/rsrc.php/yl/r/H3nktOa7ZMg.ico" />


b. <link rel="shortcut icon" href="//in.bmscdn.com/webin/common/favicon.ico" type="image/x-icon" />
c. <link type="image/png" href="/css/img/favicon.png" rel="shortcut icon">

4.9 Unused features a special case, then the search results could be irrelevant and
in some cases, the search engine may fail to produce results
We have mentioned the usefulness and importance of the for such queries.
proposed features used in our approach. However, there 2. Third Party dependent features We have not chosen
are several other features, which are used by various exist- features which are dependent on third party services such as
ing approaches and are not appropriate for the proposed DNS, blacklist/whitelist, WHOIS record, certifying author-
approach due to the following reasons: ity, search engine, etc. Third party dependent features make
1. Search engine based features Various approaches have our approach dependent on the third party and create addi-
used search engine based features (Zhang et al. 2007; He tional network delay which can result in high prediction
et al. 2011; Varshney et al. 2016). These approaches ver- time. Moreover, DNS database may also be poisoned.
ify the authenticity of the webpage by searching the URL, 3. URL based features Various approaches used URL
domain name, title keywords, most frequent word, website features (Whittaker et al. 2010; Xiang et al. 2011; He et al.
logo, etc. in the popular search engine (Google, Yahoo, 2011; Zhang et al. 2017) (e.g. number of dots, Presence
Bing, etc). In Zhang et al. (2007) and He et al.( 2011) the of special “@”, “#”, “–” symbol, URL length, Suspicious
presented approaches are based on the TF-IDF algorithm. In words in URL, Position of Top-Level Domain, http count,

13
A machine learning based approach for phishing detection using hyperlinks information 2023

Brand name in URL, IP address, etc.). Nowadays phisher are True positive rate (TPR): measures the rate of phish-
changing their way to perform attacks, and these techniques ing websites classified as phishing out of entire phishing
cannot detect tiny URL, and Data URI based phishing web- websites.
sites which are considered as popular one. False positive rate (FPR): measures the rate of legiti-
mate websites classified as phishing out of total legitimate
websites.
5 System design, implementation False negative rate (FNR): measures the rate of phish-
and results ing websites classified as legitimate out of total phishing
websites.
This section presents the construction of the dataset, evalua- True negative rate (TNR): measures the rate of legiti-
tion measures, implementation details, and results outcomes mate websites classified as legitimate out of total legitimate
of proposed anti-phishing approach. The detection of phish- websites.
ing websites is a binary classification problem where various Accuracy (A): it measures the overall rate of correct
features are used to train the classifier. Moreover, this trained prediction.
classifier is used to classify the new website as phishing and Precision: it measures the rate of instances correctly
legitimate category. detected as phishing with respect to all instances detected
as phishing.
5.1 Training dataset f1 Score: It is the harmonic mean of Precision and Recall.

We have collected proposed features from 2544 different 5.3 Implementation tool


phishing and legitimate websites. Table 1 presents the number
of instances and the sources of phishing and legitimate data- The experiments were conducted on Pentium i5 computer
sets. The life of phishing websites is very short. Therefore, we with 2.4 GHz processor. The proposed approach is imple-
crawled when they are alive. We have used a wide range of mented in Java platform standard edition 7. Jsoup (Jsoup
websites in our dataset like blogs, social media networking, HTML parser 2018) is used to extract hyperlinks from web-
payment gateways, banking, etc. Table 2 presents a sample of site and Guava library (Guava libraries, Google Inc 2018)
phishing and legitimate websites datasets. The Alexa dataset is used to obtain the base domains of the hyperlinks. We
includes 500 high ranked website (Rank 1–500) and 500 low have used WEKA to judge the performance of our proposed
ranked websites (Rank 999,500–1,000,000). Some features approach on various classifiers. Weka is Java open source
contain the feature values like “Legitimate” and “Phishing” code which means “Waikato Environment for Knowledge
in this case; we replaced these values with numerical value Analysis”. Numerous data mining and machine learning
0 and 1, respectively. The feature vector is having identical algorithms are implemented in WEKA. It contains the rich
values removed from the dataset. Our solution is the language collection of modelling, clustering, classification, regression
independent. Therefore, we have also considered the website and data pre-processing techniques. The experiments on
of different languages to test our approach. various classification algorithm namely SMO, Naive Bayes,
Random forest, Support Vector Machine (SVM), Adaboost,
5.2 Evolution metrics Neural Networks, C4.5, and Logistic Regression have been
performed.
We use true positive rate, false positive rate, true negative
rate, false negative rate, f1 score, accuracy, precision, and 5.4 Training with classifier
recall to evaluate the performance of the proposed approach.
Table 3 shows the results of true and false possible classifi- We have used the logistic regression (LR) as a binary clas-
cation. The performance of our approach is evaluated in the sifier because it gives the better accuracy as compared to
following manner: other classifiers. Logistic regression is a classification

Table 1  Training and testing S. no. Database Number of Phishing/legitimate


dataset instances

1 Phishtank dataset (2018) 1428 Phishing


2 Alexa top websites (2018) 1000 Legitimate
3 Stuffgate Free Online Website Analyzer (2018) 50 Legitimate
4 List of online payment service providers (2018) 66 Legitimate

13
2024 A. K. Jain, B. B. Gupta

Table 3  Confusion matrix


True results

Internal/
external
favicon
1 (phishing) 0 (legitimate)

0
0
0
0

1
0
1
0
form link Prediction
1 (phishing) True positive rate False positive rate
Login

0 (legitimate) False negative rate True negative rate


0
0
0
1

0
1
1
0
External error

Table 4  Coefficient and odd ratio of feature set


Feature Coefficients Odd ratio
1
0
0
0

0
0
0
84
Total hyperlinks − 0.017236 0.982911
No hyperlink 23.231230 1.23E+10
Internal error

Internal hyperlink 2.327730 10.254638


External hyperlink 1.914151 6.781177
Null hyperlink 20.263021 6.31E+08
2
0
0
1

0
0
27
0

CSS 2.515211 12.369218


Internal redirect 0.149838 1.161646
redirection
External

External redirect 0.826801 2.285995


Internal error 2.089872 8.083884
14
5
1
6

8
0
2
11

External error 0.222953 1.249762


external CSS redirection

Login form 5.445638 231.745272


Internal

Favicon 2.910151 18.359569


Intercept − 3.714364 0.024371
47
3
0
20

1
0
0
0
Internal/

technique used to predict a binary dependent variable with


the set of independent variables. Logistic regression esti-
1
0
0
0

1
0
1
1

mates the occurring probability of the dependent variable.


In our approach, the dependent variable is used to decide
Null links

whether a website is phishing, and the independent vari-


ables are the proposed feature set which were explained
1
2
0
5

0
11
0
0

in Sect. 4. A labelled training dataset is used to train the


hyperlink

logistic regression classifier. Our labelled data set consists


External

of 2544 websites in which 1428 are phishing, and 1116 are


52
14
0
12

11
0
9
100

legitimate, as described in Sect. 5.1. Phishing websites are


defined under the positive (true, 1) class, and legitimate
hyperlink
Internal

websites are described under the negative (false, 0) class.


Table 4 presents the coefficient and odd ratio correspond-
484
38
19
618

15
2
28
2

ing to each feature. The odd ratio is the ratio of the odd of an
event in the positive class (phishing) to the odd of it happen-
No hyperlink

ing in the negative class (non-phishing). Odd ratio 1 means a


feature is equally useful in identification of both categories
(phishing and non-phishing). If the value of the odd ratio is
0
0
0
0

0
0
0
0

greater than 1, then the related feature is more valuable in


Total hyperlink

recognizing the positive class. Higher odd ratio means most


Table 2  Sample datasets

helpful feature in determining the phishing websites. From


Legitimate websites

Table 4 we can analyse that the feature 2, 3, 4, 5, 6, 9, 11 and


Phishing websites

12 have the very high odd ratio, and identify as the most use-
537
54
19
635

27
13
38
115

ful features in our proposed feature set. However, these eight


features are not sufficient to detect all kind of phishing web-
S. no.

sites. Therefore, we have also used other features to improve


1
2
3
4

1
2
3
4

13
A machine learning based approach for phishing detection using hyperlinks information 2025

Table 5  Results of our proposed approach


Total dataset True positive rate/recall False positive rate True negative rate False negative rate Accuracy Precision f1 Score

2544 98.39% 1.52% 98.48% 1.61% 98.42% 98.80% 98.59%

ROC Curve other words, 98.39% of phishing websites are caught by our
1 approach, and 1.61% (false negative) will be missed. The
accuracy, precision, and f1 score of our approach are 98.42,
True Posive Rate

0.8
98.80, and 98.59%, respectively as presented in Table 5. We
0.6
have also explored the area under ROC (Receiver Operating
0.4 Characteristic) curve to find a better metric of precision. In
0.2 our experiment, the area under the ROC curve for phish-
0 ing website is 99.6 as shown in Fig. 7, and it shows that
0 0.2 0.4 0.6 0.8 1 our approach has high accuracy in classification of correct
False Posive Rate websites. Results of our approach on different classifiers are
presented in Fig. 8. The probability of a website is phishing
Fig. 7  ROC curve of logistic regression classifier in logistic regression shown by the following equation.
eb0 +b1 x1 +b2 x2 +…+bn xn 1
p= =
the accuracy of the proposed approach. We have evaluated 1 + eb0 +b1 x1 +b2 x2 +…+bn xn 1 + e−(b0 +b1 x1 +b2 x2 +…+bn xn )
(13)
our dataset with tenfold cross validation. It uses 90% of
In the Eq. 13, ‘p’ is the probability of occurring the
data for training purpose, and 10% data for testing purpose.
event. x1 , x2,… xn are the values corresponding to each
The TPR of the approach is 98.39%, and FPR is 1.52%. In

Fig. 8  Evaluation results of


our approach on the various
Results on various Classifiers
classifiers 100
98
96
94
%

92
90
88
86
84
True True f1
Precision
Posive Negave( Measure Accuracy
(%)
(%) %) (%)
SMO 96.91 96.86 97.53 97.22 96.89
Naive Bayes 95.8 95.79 96.67 96.23 95.79
Random Forest 96.85 98.03 98.43 97.63 97.37
SVM 92.65 89.96 92.19 92.42 91.47
Adaboost 95.09 96.77 97.42 96.24 95.83
Neural Network 97.69 96.68 97.41 97.55 97.25
C4.5 97.41 97.13 97.75 97.58 97.29
Logisc Regression 98.39 98.48 98.8 98.59 98.42

13
2026 A. K. Jain, B. B. Gupta

feature and b0 , b1,… . bn are the coefficient corresponding 5.6 Comparison with other machine learning based
to each feature. In our experiment, we set the classifica- phishing detection method
tion cut-off at 0.5, since at 0.5 system get the maximum
accuracy. If the score of the website is less than 0.5, then This experiment compares our proposed method with the
website is more likely to be a genuine website, and if it is existing machine learning based approaches given in the
greater than 0.5, then the website considers as a phishing literature. The comparison is based on TPR, FPR, accuracy,
website. third party independent, language independent solution,
In this paper, our primary objective is to design an zero hour detection, and search engine independent solu-
approach which has high TPR and TNR and, low FPR tion. Table 6 presents the result comparison of our approach
and FNR. If classification cut-off increases, then the FPR with other previous phishing detection methods. The search
decreases but at the same time TPR also decreases. Fur- engine based techniques believe that legitimate site appears
thermore, if we reduce the classification cut-off then TPR in the top results of search engine. Although only popular
increases but FPR increases as well. A good phishing detec- sites appear in the top search results. Therefore, we have
tion approach requires both high TPR and low FNR. not considered search engine based feature. Moreover, most
of the previous methods have used the dataset of famous
5.5 Complexity of the proposed approach sites while we have also considered the low ranked web-
sites. Our approach gives FPR of 1.52% for the legitimate
Feature extraction from the source code of the webpage websites. Only the work of Garera et al. (2007), Whittaker
helps in reducing the processing time as well as response et al. (2010), Xiang et al. (2011) gives a FPR lower than our
time, hence making the approach more reliable and efficient. approach but their TPR and overall detection accuracy is
The computational complexity of the proposed approach very low as compared to our approach. The TPR of Garera
depends on the extraction and computing the proposed fea- et al. (2007) is 88%, i.e. this scheme fails to detect 12% of
tures. We need to obtain all hyperlinks from the webpage to phishing websites, which is very high. Another important
compute features. A regular expression, which can include issue of comparison is the language used in the website.
and identify all the ways in which hyperlinks can be present Only 52.1% of the website are used English language (Usage
on the webpage. Every text in the page source that matches of content languages for websites 2017). Many approaches
the given regular expression is identified as a hyperlink, and (Garera et al. 2007; Aburrous et al. 2010) are dependent on
it is calculated in term of linear time complexity of O(n), the textual language of the website. The proposed approach
where n is source code length of the webpage. A single used the hyperlink specific features because it is very effi-
pattern matching algorithm (i.e. Knuth–Morris–Pratt algo- cient and language independent. Some of the approaches
rithm) used to match the domain name of hyperlinks with (Aburrous et al. 2010; Montazera and ArabYarmohammadi
the URL of webpage. Moreover, the proposed method is not 2015) cannot detect the zero hour attack because these
dependent on any third party services, and hence it does not approaches are designed to detect special kind of phishing
need to wait for the results return by these services. website. On the other hand, our approach can detect all kind
of phishing websites. Moreover, most of the approaches use

Table 6  Comparison between various anti-phishing approaches based on results obtained


Approach TPR (%) FPR (%) Accuracy (%) Search engine Language Zero hour Third party
independent independent detection independent

(Pan and Ding 2006) 88 29 84 Yes No Yes No


(Zhang et al. 2007) 97 6 95 No No Yes No
(Garera et al. 2007) 88 0.7 97.3 Yes Yes Yes No
(Aburrous et al. 2010) 86.38 13.6 88.4 Yes Yes No No
(Whittaker et al. 2010) 91.85 0.0001 95.92 Yes No Yes No
(Xiang et al. 2011) 92 0.4 95.8 No No Yes No
(He et al. 2011) 97 4 96.5 No No Yes No
(Zhang et al. 2017) 97 2 97.50 Yes Yes Yes No
(El-Alfy 2017) 97.89 4.59 96.74 No Yes Yes No
(Montazera and ArabYar- 88 12 88 Yes No No No
mohammadi 2015)
Our method 98.39 1.52 98.42 Yes Yes Yes Yes

13
A machine learning based approach for phishing detection using hyperlinks information 2027

the third party features, e.g. WHOIS lookup, DNS, certi- classification accuracy. However, extracting other features
fying authority, etc. and the accuracy also depends on the from the third party will increase the running time com-
result returned by the third party and it is also time con- plexity of the scheme. In future work our aim to design a
suming process. Therefore, we have not considered the third system which can also detect non-HTML websites with high
party dependent features in our proposed approach. accuracy. Nowadays, Mobile devices are more popular and
seem to be a perfect target for malicious attacks like mobile
phishing. Therefore, detecting the phishing websites in the
6 Discussion mobile environment is a challenge for further research and
development.
With the rapid growth of e-commerce, e-banking, and social
networking, the phishing attack is also growing day by day.
This results in enormous amount financial losses to indus- References
tries and Internet users. Therefore, there is need of effective
solution to detect phishing attack which has high accuracy Abu-Nimeh S, Nappa D, Wang X, Nair S (2007). A comparison of
machine learning techniques for phishing detection. In: Proceed-
and less response time. We proposed a novel anti-phishing ings of the anti-phishing working groups 2nd annual eCrime
approach, which includes various unique hyperlink specific researchers summit, Pittsburgh, pp 60–69
features that have never been considered. We implemented Aburrous M, Hossain MA, Thabatah F, Dahal K (2010) Intelligent
these hyperlink specific features on different machine learn- phishing detection system for e-banking using fuzzy data mining.
Expert Syst Appl 37(12):7913–7921
ing algorithms, and find that logistic regression achieved Alexa top websites (2018) http://www.alexa​.com/topsi​tes. Retrieved
the best performance. There are certain limitations of our 22 Aug 2017
proposed approach. The feature set of our phishing detec- APWG H1 2017 Report (2017) http://docs.apwg.org/repor​ts/apwg_
tion approach completely depends on the source code of trend​s_repor​t_h1_2017.pdf. Retrieved 25 March 2018
Bhuiyan MZA, Wu J, Wang G, Cao J (2016) Sensing and decision
the website. We believe that attacker use the source code making in cyber-physical systems: the case of structural event
from targeted legitimate website to construct the phishing monitoring. IEEE Trans Ind Inform 12(6):2103–2114
website and they modify the login form handler to steal El-Alfy E-SM (2017) Detection of phishing websites based on proba-
user’s credential. If a cybercriminal may alter all the page bilistic neural networks and K-Medoids clustering. Comput J.
https​://doi.org/10.1093/comjn​l/bxx03​5
resource references (i.e. images, CSS, Favicon, JavaScript, Fan L, Lei X, Yang N, Duong TQ, Karagiannidis GK (2016) Secure
etc.), then our approach predicts false result too. Also, if the multiple amplify-and forward relaying with cochannel interfer-
attacker uses embedded objects (images, JavaScript, Flash, ence. IEEE J Select Topics Signal Process 10(8):1494–1505
ActiveX, etc.) instead of DOM to hide the HTML coding Garera S, Provos N, Chew M, Rubin AD (2007) A framework for detec-
tion and measurement of phishing attacks. In: Proceedings of the
from the phishing detection approaches, then our technique 2007 ACM workshop on recurring malcode, Alexandria, pp 1–8
may incorrectly classify the phishing websites. Geng G-G, Yang X-T, Wang W, Meng C-J (2014) A taxonomy of
hyperlink hiding techniques. In: Asia-Pacific web conference,
vol 8709, Lecture Notes in Computer Science. Springer, Suzhou,
pp 165–176
7 Conclusion and future work Guava libraries, Google Inc. (2018) https​://githu​b.com/googl​e/guava​
. Retrieved 18 Jan 2018
In this paper, we have recognized various new features for He M, Horng SJ, Fan P, Khan MK, Run RS, Lai JL, Sutanto A
identifying phishing websites. These features are based on (2011) An efficient phishing webpage detector. Expert Syst Appl
38(10):12018–12027
hyperlink information given in source code of the website. Jain AK, Gupta BB (2016a) Comparative analysis of features based
We have used these features to train logistic regression clas- machine learning approaches for phishing detection. In: Pro-
sifier, which achieved high accuracy in detection of phish- ceedings of 3rd international conference on computing for sus-
ing and legitimate websites. One of the major contributions tainable global development (INDIACom). IEEE, New Delhi,
pp 2125–2130
of this paper is the selection of hyperlink specific features Jain AK, Gupta BB (2016b) A novel approach to protect against phish-
which are extracted from client side and these features do ing attacks at client side using auto-updated white-list. EURASIP
not depend on any third party services. Moreover, these fea- J Inf Secur 2016(9)
tures are sufficient enough to detect a website written in any Jain AK, Gupta BB (2017a) Phishing detection: analysis of visual
similarity based approaches. Secur Commun Netw. https​://doi.
language. The experimental results showed that proposed org/10.1155/2017/54210​46
method is very efficient in classification of phishing web- Jain AK, Gupta BB (2017b) Two-level authentication approach to pro-
sites as it has 98.39% true positive rate and 98.42% overall tect from phishing attacks in real time. J Ambient Intell Humaniz
accuracy. The accuracy of our approach may be improved by Comput, 1–14
Jain AK, Gupta BB (2017c). Towards detection of phishing websites on
adding certain more features. Our proposed phishing detec- client-side using machine learning based approach. Telecommun
tion approach completely depends on the source code of Syst, 1–14. https​://doi.org/10.1007/s1123​5-017-0414-0
the website. Adding certain more features may increase the

13
2028 A. K. Jain, B. B. Gupta

Jsoup HTML parser (2018) https​://jsoup​.org/apido​cs/org/jsoup​/parse​ Phishtank dataset (2018) http://www.phisht​ ank.com. Retrieved 22 Aug
r/Parse​r.html. Retrieved 20 Jan 2018 2017
Kumaraguru P, Rhee Y, Acquisti A, Cranor LF, Hong J, Nunge E Sheng S, Wardman B, Warner G, Cranor LF, Hong J, Zhang C (2009)
(2007) Protecting people from phishing: the design and evaluation An empirical analysis of phishing blacklists. In: Proceedings of
of an embedded training email system. In: Proceedings of SIGCHI the sixth conference on email and anti-spam, Mountain View
conference on human factors in computing systems, San Jose Stuffgate Free Online Website Analyzer (2018) http://www.stuff​gate.
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant permis- com/. Retrieved 21 Jan 2018
sion identification for machine learning based android malware Usage of content languages for websites (2017) https​://w3tec​hs.com/
detection. IEEE Trans Ind Inform techn​ologi​es/overv​iew/conte​nt_langu​age/all. Retrieved 22 Aug
Lin Q, Li J, Huang Z, Chen W, Shen J (2018) A short linearly homo- 2017
morphic proxy signaturescheme. IEEE Access Varshney G, Misra M, Atrey PK (2016) A phish detector using light-
List of online payment service providers (2018) http://resea​rch.omics​ weight search features. Comput Secur 62:213–228
group​.org/index​.php/List_of_onlin​e_payme​nt_servi​ce_provi​ders. Wang YG, Zhu G, Shi YQ (2018) Transportation spherical watermark-
Retrieved 25 March 2018 ing. IEEE Trans Image Process 27(4):2063–2077
Maio CD, Fenza G, Gallo M, Loia V, Parente M (2017) Time-aware Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic clas-
adaptive tweets ranking through deep learning. Future Gener sification of phishing pages. In: Proceedings of the network and
Comput Syst. https​://doi.org/10.1016/j.futur​e.2017.07.039 distributed system security symposium, San Diego, pp 1–14
Maio CD, Fenza G, Gallo M, Loia V, Parente M (2018) Social media Xiang G, Hong J, Rose CP, Cranor L (2011) CANTINA+: a feature-
marketing through time-aware collaborative filtering. Concurr rich machine learning framework for detecting phishing web sites.
Comput Pract Exp 30(1) ACM Trans Inf Syst Secur 14(2)
Mohammad RM, Thabtah F, McCluskey L (2014) Predicting phishing Zhang Y, Hong JI, Cranor LF (2007) CANTINA: a content-based
websites based on self-structuring neural network. Neural Comput approach to detecting phishing websites. In: Proceedings of 16th
Appl 25(2):443–458 international world wide web conference (WWW2007), Banff,
Montazera GA, ArabYarmohammadi S (2015) Detection of phishing pp 639–648
attacks in Iranian e-banking using a fuzzy–rough hybrid system. Zhang W, Jiang Q, Chen L, Li C (2017) Two-stage ELM for phish-
Appl Soft Comput 35:482–492 ing Web pages detection using hybrid features. World Wide Web
Pan Y, Ding X (2006) Anomaly based web phishing page detection. 20(4):797–813
In: Proceedings of 22nd annual computer security applications
conference, Miami Beach, pp 381–392 Publisher’s Note Springer Nature remains neutral with regard to
Phishingpro Report (2016) http://www.razor​thorn​.co.uk/wp-conte​nt/ jurisdictional claims in published maps and institutional affiliations.
upload​ s/2017/01/Phishi​ ng-Stats-​ 2016.pdf. Retrieved 14 Oct 2017

13

You might also like