A Machine Learning Based Approach For Phishing Detection Using
A Machine Learning Based Approach For Phishing Detection Using
https://doi.org/10.1007/s12652-018-0798-z
ORIGINAL RESEARCH
Received: 11 December 2017 / Accepted: 14 April 2018 / Published online: 26 April 2018
© Springer-Verlag GmbH Germany, part of Springer Nature 2018
Abstract
This paper presents a novel approach that can detect phishing attack by analysing the hyperlinks found in the HTML source
code of the website. The proposed approach incorporates various new outstanding hyperlink specific features to detect phish-
ing attack. The proposed approach has divided the hyperlink specific features into 12 different categories and used these
features to train the machine learning algorithms. We have evaluated the performance of our proposed phishing detection
approach on various classification algorithms using the phishing and non-phishing websites dataset. The proposed approach
is an entirely client-side solution, and does not require any services from the third party. Moreover, the proposed approach
is language independent and it can detect the website written in any textual language. Compared to other methods, the pro-
posed approach has relatively high accuracy in detection of phishing websites as it achieved more than 98.4% accuracy on
logistic regression classifier.
Keywords Cyber security · Phishing attack · Hyperlink · Social engineering · Website · Machine learning
13
Vol.:(0123456789)
2016 A. K. Jain, B. B. Gupta
sources like URL, search engine page source, website traf- • Proposed approach can detect the phishing websites writ-
fic, search engine, DNS, etc. The existing machine learning ten in any textual language.
based methods extract features from the third party, search • We have also conducted a sensitivity analysis to predict
engine, etc. Therefore, they are complicated, slow in nature, the most powerful features in the detection of the phish-
and not fit for the real-time environment. Phishing websites ing websites.
are short-lived, and thousands of fake websites are generated
every day. Therefore, there is requirement of real-time, fast
and intelligent phishing detection solution. 1.5 Experimental results
1.4 Contributions 1.6 Outlines
The followings are the major contributions of our paper: The remainder of this paper is organized as follows. Sec-
tion 2 presents the related work. Section 3 describes our pro-
• Proposed approach extracts the outstanding features from posed approach in detail. Section 4 presents the extractions
the web browser only and does not depend on third party of various features to train the machine learning algorithms.
services (e.g. search engine, third party DNS, Certifica- Section 5 presents the implementation detail, evaluation
tion Authority, etc). Therefore, it can be implemented at metrics, and experimental results. Finally, Sect. 6 concludes
the client side and provide better privacy. the paper and presents future work.
• Proposed approach can identify “zero-hour” phishing
attack with high accuracy.
13
A machine learning based approach for phishing detection using hyperlinks information 2017
13
2018 A. K. Jain, B. B. Gupta
Usually, machine learning based techniques compare the replaced by their hierarchically known absolute links. Our
features of the suspicious website with the predefined feature proposed approach takes the decision based on 12 features
set (Wang et al. 2018; Lin et al. 2018). Therefore, the accu- namely total hyperlink, no hyperlinks, internal hyperlinks,
racy of the scheme depends on feature set and how accurately external hyperlinks, null hyperlinks, internal error, external
a defender chooses the features (Maio et al. 2017, 2018). error, internal redirect, external redirect, login form link,
external/internal CSS and external/internal favicon.
In particular, features 2, 6, 7, 8, 9, 10 are novel and pro-
3 Proposed approach posed by us. Features 1, 3, 4, 5, 11, 12 are taken from other
approaches (Mohammad et al. 2014; Whittaker et al. 2010;
Figure 2 presents the system architecture of the proposed Xiang et al. 2011; He et al. 2011). However, we fine-tuned
approach. The selection of outstanding feature set is the these adopted features by performing various experiment
major contribution of this paper. We have proposed six new to get the better results. After extraction of these features,
features to improve the detecting accuracy of phishing web- a feature vector is created corresponding to each website.
pages. Our proposed features identify the relation between We have constructed the training and testing dataset by
the webpage content and the URL of the webpage. Our fea- extraction of defined 12 features from the phishing and non-
tures are based on hyperlinks of the webpage. A website can phishing websites. The training phase generates a binary
be transformed into a Document Object Model (DOM) tree, classifier by applying the feature vectors of phishing and
and it is used to extract the hyperlink features as shown in legitimate websites dataset. In the testing phase, the classi-
Fig. 3. In our approach, we have gathered the website hyper- fier determines whether a new site is a phishing site or not.
link features automatically using a web crawler as shown in A classifier takes the decision based on the learning from
Fig. 4. In hyperlink extraction process, the relative links are the labelled dataset. A binary classifier classify the websites
13
A machine learning based approach for phishing detection using hyperlinks information 2019
into two possible categories namely phishing and legitimate. approach. These features are extracted from the client side
When a user requests for a new website, the crawler gener- and not dependent on any third party services. In this,
ates the feature values and the binary classifier that correctly F = {F1, F2,…, F12} is defined as the feature vector cor-
identifies the given website. responding to each feature. Some features produce the value
in the form of 1 and 0, where 1 indicates for phishing and 0
indicate for legitimate. We will discuss all these features in
4 Features extraction the following subsections.
The accuracy of a phishing detection scheme depends on 4.1 Total and no hyperlink feature (F1 and F2)
the feature set which distinguish the phishing and legiti-
mate website. Based on the given limitation of individual Phishing websites are small as compared to legitimate
and third party dependent approaches in the Sect. 2, we websites. A legitimate website usually contains many web-
have adopted the hyperlink specific features in the proposed pages. However, a phishing website consists of very limited
13
2020 A. K. Jain, B. B. Gupta
webpages, sometimes only one or two. Moreover, sometimes RatioExternal are the ratios of internal and external hyperlinks
the phishing website does not provide any hyperlink because to total available hyperlinks.
the attackers use the hyperlink hidden techniques (Geng et al.
2014). Also, attacker also uses server-side scripting and frame- 4.3 Null hyperlink (F5)
set to hide the source code of webpage (Jain and Gupta 2016b).
From our experiments, we analyse that if a website is genuine, In the null hyperlink, the href attribute of anchor tag does
we can extract at least one hyperlink from the source code. not contain any URL. When the user clicks on the null link,
Therefore, if the approach does not extract any link from the it returns on the same page again. A legitimate website
source code, the website is considered as a phishing website consists of many webpages, therefore to behave like the
(feature 2). Total hyperlinks are calculated by adding href, link, legitimate website, phisher places no values in hyperlinks,
and src tags. Taking the no hyperlink as a different feature and the links appear active on the website. Phisher also
increases the true positive rate of the proposed approach. exploits the vulnerability of web browser with the help of
empty links (Jain and Gupta 2016b). The HTML coding
F1 = Total hyperlink present in a website (1) used for designing null hyperlinks are < a href=“#”>, <a
href=“#content”>, <a href=“JavaScript ::void(0)”>. To
{
0 if F1 > 0
F2 =
1 if F1 = 0
. (2) set the null hyperlink feature, we calculate the ratio of null
hyperlinks to the total number of links present in a website
4.2 Internal and external hyperlinks (F3 and F4) and if the ratio is greater than 0.34 then set as 1 else 0. Fol-
lowing equations are used to calculate null hyperlink feature.
The internal and external hyperlink means hyperlink con- {H
tains the same and different base domain respectively. The
Null
if Htotal > 0
RatioNull = Htotal (7)
phishing website usually copies the source code from its 0 if Htotal = 0
targeted official website, and it may have many hyperlinks
that point to the targeted website. In the legitimate website, {
0 RatioNull ≤ 0.34
most of the hyperlinks contain same base domain while in F5 = , (8)
1 RatioNull > 0.34
phishing website many hyperlinks may contain the domain
of the corresponding legitimate website. In our experiment, where HNull and Htotal are the numbers of null and total
we found that out of 1428 phishing websites, 593 websites hyperlinks in a website.RatioNull is the ratio of null hyper-
include direct hyperlinks to their official website. To set the links to total hyperlinks present in the website.
internal hyperlink feature, we calculate the ratio of internal
hyperlinks to the total links present in a website (Eq. 3) and 4.4 Internal/external CSS (F6)
if the ratio is less than 0.5 then set as 1 else 0 as given in
Eq. 4. Furthermore, to establish the external hyperlink fea- Cascading Style Sheets (CSS) is a language used for depict-
ture, we calculate the ratio of external hyperlinks to the total ing the formatting of a document and setting the visual
available links (Eq. 5) and if the ratio is greater than 0.5 then appearance of a website written in the HTML, XHTML, and
set as 1 else 0 as represented in Eq. 6. XML. An attacker always tries to mimic legitimate website
{ HInternal and keep the same design of the phishing website as that
if Htotal > 0
Ratiointernal = Htotal
(3) of targeted website to attract potential victim. Formally, a
0 if Htotal = 0 CSS contains a list of rules, which can associate a group of
selectors, properties, and values to a set of declarations. CSS
{ of any website is either included with external CSS file or
0 RatioInternal ≥ 0.5
F3 = (4) within the HTML tags itself. External CSS files are associ-
1 RatioInternal < 0.5
ated with some HTML website by using the tag <link>. To
extract external CSS file, we try to find a tag with other val-
ues such as <link… rel = ‘stylesheet’’… href = ‘URL of CSS
{ HExternal
if Htotal > 0
RatioExternal = Htotal (5) file’…>. However, during the experiment, we found that in
0 if Htotal = 0
the case of the phishing website, it uses only one CSS file
or internal style and this external CSS file contain the link
of targeted legitimate website. Whereas, several legitimate
{
0 RatioExternal ≤ 0.5
F4 = , (6)
1 RatioExternal > 0.5 websites use more than one CSS file or internal style. We
develop an algorithm to find the suspicious CSS in a website
where Hinternal, HExternal, and Htotal are the number of internal, as shown in Fig. 5.
external and total hyperlinks in a website. Ratiointernal and
13
A machine learning based approach for phishing detection using hyperlinks information 2021
4.5 Internal and external redirection (F7 and F8) approach uses web crawler to fetch the response code of each
hyperlink. Internal error (F9) is calculated by dividing the
Redirection indicates whether a website redirects to some total internal error hyperlinks to the total internal hyperlinks.
other place. When a browser tries to open an URL, which External error (F10) is calculated by dividing the total exter-
has been redirected, a webpage with a different URL opens. nal error hyperlinks to the total external hyperlinks.
Sometimes URL redirection confuses users about which { H
website they are surfing. Moreover, redirection may also take i-error
if HInternal > 0
F9 = HInternal , (11)
the user to a website which is bogus. In a phishing website, 0 if HInternal = 0
there may be some links that redirect to the corresponding
legitimate domain and sometimes the fake website can also
be redirected to legitimate one after filling the login form. In
{ He-error
if HExternal > 0
this paper, we consider only response code 301 and 302 for F10 = HExternal , (12)
0 if HExternal = 0
URL redirection. We select both, internal and external URL
redirection in our feature set. In this feature, we calculate where Hi-error , He-error , HInternal and HExternal are the number
the ratio of hyperlinks which are redirecting. Internal Redi- of internal error, external error, total internal and total exter-
rection (F7) is calculated by dividing total internally redi- nal hyperlinks in a website.
rected hyperlinks to the total internal hyperlinks. External
Redirection (F8) is calculated by dividing the total external
redirected hyperlinks to the total external hyperlinks. 4.7 Login form link (F11)
{H
i-redirect
if HInternal > 0 Phishing websites usually contain login form to steal cre-
F7 = HInternal (9)
0 if HInternal = 0 dentials of the Internet users. The personal information of
the user is transferred to the attacker after filling the form
{ He-redirect
on a fake website. The login form of the phishing websites
if HExternal > 0 appears in the same manner as in the legitimate website.
F8 = HExternal , (10)
0 if HExternal = 0 In this feature, we check the authenticity of login forms. In
the legitimate website, the action field typically contains
where Hi-redirect , He-redirect , HInternal and HExternal are the the URL of the current website. However, Attackers either
number of internal redirect, external redirect, total internal use the different domain (other than visited domain), null
and total external hyperlinks present in the website. (hyperlink in footer section) or a PHP file in the form action
field of phishing websites (Jain and Gupta 2017c). PHP file
4.6 Internal and external error (F9 and F10) contains a script which saves the input data (e.g. user id or
password) in a text file saved on the attacker’s computer.
In this heuristic, we check the errors in hyperlinks of the The PHP file usually named as index.php, login.php, etc.
website. Error “404 not found” occurs when a user request We construct an algorithm to check the authenticity of the
for an URL and server is not able to determine the requested login form as shown in Fig. 6. The input of algorithm is the
URL. Phisher also adds some hyperlinks in the fake page URL of the suspicious website and output results as {0,1},
which do not exists. “404 not found” error is generated when 0 for legitimate and 1 for phishing. If hyperlink present in
a user attempts to access dead or broken link. We consider the action field is relative, then system replaces it by the
the 403 and 404 response code of hyperlinks. The proposed absolute link.
13
2022 A. K. Jain, B. B. Gupta
a. action= “ ”
b. action= “#”
c. action= “javascript:void(0)”
d. action= “filename.php”// e.g. filename is the name of php file
4.8 Internal/external favicon (F12) this, if the algorithm extracts the wrong keywords, then the
results are defective. Moreover, the rank provided to a web-
Favicon is an image icon associated with the particular web- site determines its position in the list of the searched links.
site. An attacker may copy the favicon of targeted website. Newly published websites and alienated blogs which have
Favicon is an .ico file linked to an URL, and found in link tag no connection to the mainstream websites are pushed back
of the DOM tree. If the favicon shown in the address bar is in the search results. Furthermore, different search engines
other than the current website, it is considered as a phishing allow the search string to be specified in the desired way so
attempt. This feature contains the two values, 0 (legitimate) that it may give the particular result user is looking for. e.g.
and 1(phishing). If the favicon belongs to the same domain, Google has special query pattern to search exact phrases,
then make this features 0 else 1. Following HTML coding exclude a word, search a specific domain, search specifying
is used in designing of favicon. a location, etc. If the search string that user enters, matches
4.9 Unused features a special case, then the search results could be irrelevant and
in some cases, the search engine may fail to produce results
We have mentioned the usefulness and importance of the for such queries.
proposed features used in our approach. However, there 2. Third Party dependent features We have not chosen
are several other features, which are used by various exist- features which are dependent on third party services such as
ing approaches and are not appropriate for the proposed DNS, blacklist/whitelist, WHOIS record, certifying author-
approach due to the following reasons: ity, search engine, etc. Third party dependent features make
1. Search engine based features Various approaches have our approach dependent on the third party and create addi-
used search engine based features (Zhang et al. 2007; He tional network delay which can result in high prediction
et al. 2011; Varshney et al. 2016). These approaches ver- time. Moreover, DNS database may also be poisoned.
ify the authenticity of the webpage by searching the URL, 3. URL based features Various approaches used URL
domain name, title keywords, most frequent word, website features (Whittaker et al. 2010; Xiang et al. 2011; He et al.
logo, etc. in the popular search engine (Google, Yahoo, 2011; Zhang et al. 2017) (e.g. number of dots, Presence
Bing, etc). In Zhang et al. (2007) and He et al.( 2011) the of special “@”, “#”, “–” symbol, URL length, Suspicious
presented approaches are based on the TF-IDF algorithm. In words in URL, Position of Top-Level Domain, http count,
13
A machine learning based approach for phishing detection using hyperlinks information 2023
Brand name in URL, IP address, etc.). Nowadays phisher are True positive rate (TPR): measures the rate of phish-
changing their way to perform attacks, and these techniques ing websites classified as phishing out of entire phishing
cannot detect tiny URL, and Data URI based phishing web- websites.
sites which are considered as popular one. False positive rate (FPR): measures the rate of legiti-
mate websites classified as phishing out of total legitimate
websites.
5 System design, implementation False negative rate (FNR): measures the rate of phish-
and results ing websites classified as legitimate out of total phishing
websites.
This section presents the construction of the dataset, evalua- True negative rate (TNR): measures the rate of legiti-
tion measures, implementation details, and results outcomes mate websites classified as legitimate out of total legitimate
of proposed anti-phishing approach. The detection of phish- websites.
ing websites is a binary classification problem where various Accuracy (A): it measures the overall rate of correct
features are used to train the classifier. Moreover, this trained prediction.
classifier is used to classify the new website as phishing and Precision: it measures the rate of instances correctly
legitimate category. detected as phishing with respect to all instances detected
as phishing.
5.1 Training dataset f1 Score: It is the harmonic mean of Precision and Recall.
13
2024 A. K. Jain, B. B. Gupta
Internal/
external
favicon
1 (phishing) 0 (legitimate)
0
0
0
0
1
0
1
0
form link Prediction
1 (phishing) True positive rate False positive rate
Login
0
1
1
0
External error
0
0
0
84
Total hyperlinks − 0.017236 0.982911
No hyperlink 23.231230 1.23E+10
Internal error
0
0
27
0
8
0
2
11
1
0
0
0
Internal/
1
0
1
1
0
11
0
0
11
0
9
100
15
2
28
2
ing to each feature. The odd ratio is the ratio of the odd of an
event in the positive class (phishing) to the odd of it happen-
No hyperlink
0
0
0
0
12 have the very high odd ratio, and identify as the most use-
537
54
19
635
27
13
38
115
1
2
3
4
13
A machine learning based approach for phishing detection using hyperlinks information 2025
ROC Curve other words, 98.39% of phishing websites are caught by our
1 approach, and 1.61% (false negative) will be missed. The
accuracy, precision, and f1 score of our approach are 98.42,
True Posive Rate
0.8
98.80, and 98.59%, respectively as presented in Table 5. We
0.6
have also explored the area under ROC (Receiver Operating
0.4 Characteristic) curve to find a better metric of precision. In
0.2 our experiment, the area under the ROC curve for phish-
0 ing website is 99.6 as shown in Fig. 7, and it shows that
0 0.2 0.4 0.6 0.8 1 our approach has high accuracy in classification of correct
False Posive Rate websites. Results of our approach on different classifiers are
presented in Fig. 8. The probability of a website is phishing
Fig. 7 ROC curve of logistic regression classifier in logistic regression shown by the following equation.
eb0 +b1 x1 +b2 x2 +…+bn xn 1
p= =
the accuracy of the proposed approach. We have evaluated 1 + eb0 +b1 x1 +b2 x2 +…+bn xn 1 + e−(b0 +b1 x1 +b2 x2 +…+bn xn )
(13)
our dataset with tenfold cross validation. It uses 90% of
In the Eq. 13, ‘p’ is the probability of occurring the
data for training purpose, and 10% data for testing purpose.
event. x1 , x2,… xn are the values corresponding to each
The TPR of the approach is 98.39%, and FPR is 1.52%. In
92
90
88
86
84
True True f1
Precision
Posive Negave( Measure Accuracy
(%)
(%) %) (%)
SMO 96.91 96.86 97.53 97.22 96.89
Naive Bayes 95.8 95.79 96.67 96.23 95.79
Random Forest 96.85 98.03 98.43 97.63 97.37
SVM 92.65 89.96 92.19 92.42 91.47
Adaboost 95.09 96.77 97.42 96.24 95.83
Neural Network 97.69 96.68 97.41 97.55 97.25
C4.5 97.41 97.13 97.75 97.58 97.29
Logisc Regression 98.39 98.48 98.8 98.59 98.42
13
2026 A. K. Jain, B. B. Gupta
feature and b0 , b1,… . bn are the coefficient corresponding 5.6 Comparison with other machine learning based
to each feature. In our experiment, we set the classifica- phishing detection method
tion cut-off at 0.5, since at 0.5 system get the maximum
accuracy. If the score of the website is less than 0.5, then This experiment compares our proposed method with the
website is more likely to be a genuine website, and if it is existing machine learning based approaches given in the
greater than 0.5, then the website considers as a phishing literature. The comparison is based on TPR, FPR, accuracy,
website. third party independent, language independent solution,
In this paper, our primary objective is to design an zero hour detection, and search engine independent solu-
approach which has high TPR and TNR and, low FPR tion. Table 6 presents the result comparison of our approach
and FNR. If classification cut-off increases, then the FPR with other previous phishing detection methods. The search
decreases but at the same time TPR also decreases. Fur- engine based techniques believe that legitimate site appears
thermore, if we reduce the classification cut-off then TPR in the top results of search engine. Although only popular
increases but FPR increases as well. A good phishing detec- sites appear in the top search results. Therefore, we have
tion approach requires both high TPR and low FNR. not considered search engine based feature. Moreover, most
of the previous methods have used the dataset of famous
5.5 Complexity of the proposed approach sites while we have also considered the low ranked web-
sites. Our approach gives FPR of 1.52% for the legitimate
Feature extraction from the source code of the webpage websites. Only the work of Garera et al. (2007), Whittaker
helps in reducing the processing time as well as response et al. (2010), Xiang et al. (2011) gives a FPR lower than our
time, hence making the approach more reliable and efficient. approach but their TPR and overall detection accuracy is
The computational complexity of the proposed approach very low as compared to our approach. The TPR of Garera
depends on the extraction and computing the proposed fea- et al. (2007) is 88%, i.e. this scheme fails to detect 12% of
tures. We need to obtain all hyperlinks from the webpage to phishing websites, which is very high. Another important
compute features. A regular expression, which can include issue of comparison is the language used in the website.
and identify all the ways in which hyperlinks can be present Only 52.1% of the website are used English language (Usage
on the webpage. Every text in the page source that matches of content languages for websites 2017). Many approaches
the given regular expression is identified as a hyperlink, and (Garera et al. 2007; Aburrous et al. 2010) are dependent on
it is calculated in term of linear time complexity of O(n), the textual language of the website. The proposed approach
where n is source code length of the webpage. A single used the hyperlink specific features because it is very effi-
pattern matching algorithm (i.e. Knuth–Morris–Pratt algo- cient and language independent. Some of the approaches
rithm) used to match the domain name of hyperlinks with (Aburrous et al. 2010; Montazera and ArabYarmohammadi
the URL of webpage. Moreover, the proposed method is not 2015) cannot detect the zero hour attack because these
dependent on any third party services, and hence it does not approaches are designed to detect special kind of phishing
need to wait for the results return by these services. website. On the other hand, our approach can detect all kind
of phishing websites. Moreover, most of the approaches use
13
A machine learning based approach for phishing detection using hyperlinks information 2027
the third party features, e.g. WHOIS lookup, DNS, certi- classification accuracy. However, extracting other features
fying authority, etc. and the accuracy also depends on the from the third party will increase the running time com-
result returned by the third party and it is also time con- plexity of the scheme. In future work our aim to design a
suming process. Therefore, we have not considered the third system which can also detect non-HTML websites with high
party dependent features in our proposed approach. accuracy. Nowadays, Mobile devices are more popular and
seem to be a perfect target for malicious attacks like mobile
phishing. Therefore, detecting the phishing websites in the
6 Discussion mobile environment is a challenge for further research and
development.
With the rapid growth of e-commerce, e-banking, and social
networking, the phishing attack is also growing day by day.
This results in enormous amount financial losses to indus- References
tries and Internet users. Therefore, there is need of effective
solution to detect phishing attack which has high accuracy Abu-Nimeh S, Nappa D, Wang X, Nair S (2007). A comparison of
machine learning techniques for phishing detection. In: Proceed-
and less response time. We proposed a novel anti-phishing ings of the anti-phishing working groups 2nd annual eCrime
approach, which includes various unique hyperlink specific researchers summit, Pittsburgh, pp 60–69
features that have never been considered. We implemented Aburrous M, Hossain MA, Thabatah F, Dahal K (2010) Intelligent
these hyperlink specific features on different machine learn- phishing detection system for e-banking using fuzzy data mining.
Expert Syst Appl 37(12):7913–7921
ing algorithms, and find that logistic regression achieved Alexa top websites (2018) http://www.alexa.com/topsites. Retrieved
the best performance. There are certain limitations of our 22 Aug 2017
proposed approach. The feature set of our phishing detec- APWG H1 2017 Report (2017) http://docs.apwg.org/reports/apwg_
tion approach completely depends on the source code of trends_report_h1_2017.pdf. Retrieved 25 March 2018
Bhuiyan MZA, Wu J, Wang G, Cao J (2016) Sensing and decision
the website. We believe that attacker use the source code making in cyber-physical systems: the case of structural event
from targeted legitimate website to construct the phishing monitoring. IEEE Trans Ind Inform 12(6):2103–2114
website and they modify the login form handler to steal El-Alfy E-SM (2017) Detection of phishing websites based on proba-
user’s credential. If a cybercriminal may alter all the page bilistic neural networks and K-Medoids clustering. Comput J.
https://doi.org/10.1093/comjnl/bxx035
resource references (i.e. images, CSS, Favicon, JavaScript, Fan L, Lei X, Yang N, Duong TQ, Karagiannidis GK (2016) Secure
etc.), then our approach predicts false result too. Also, if the multiple amplify-and forward relaying with cochannel interfer-
attacker uses embedded objects (images, JavaScript, Flash, ence. IEEE J Select Topics Signal Process 10(8):1494–1505
ActiveX, etc.) instead of DOM to hide the HTML coding Garera S, Provos N, Chew M, Rubin AD (2007) A framework for detec-
tion and measurement of phishing attacks. In: Proceedings of the
from the phishing detection approaches, then our technique 2007 ACM workshop on recurring malcode, Alexandria, pp 1–8
may incorrectly classify the phishing websites. Geng G-G, Yang X-T, Wang W, Meng C-J (2014) A taxonomy of
hyperlink hiding techniques. In: Asia-Pacific web conference,
vol 8709, Lecture Notes in Computer Science. Springer, Suzhou,
pp 165–176
7 Conclusion and future work Guava libraries, Google Inc. (2018) https://github.com/google/guava
. Retrieved 18 Jan 2018
In this paper, we have recognized various new features for He M, Horng SJ, Fan P, Khan MK, Run RS, Lai JL, Sutanto A
identifying phishing websites. These features are based on (2011) An efficient phishing webpage detector. Expert Syst Appl
38(10):12018–12027
hyperlink information given in source code of the website. Jain AK, Gupta BB (2016a) Comparative analysis of features based
We have used these features to train logistic regression clas- machine learning approaches for phishing detection. In: Pro-
sifier, which achieved high accuracy in detection of phish- ceedings of 3rd international conference on computing for sus-
ing and legitimate websites. One of the major contributions tainable global development (INDIACom). IEEE, New Delhi,
pp 2125–2130
of this paper is the selection of hyperlink specific features Jain AK, Gupta BB (2016b) A novel approach to protect against phish-
which are extracted from client side and these features do ing attacks at client side using auto-updated white-list. EURASIP
not depend on any third party services. Moreover, these fea- J Inf Secur 2016(9)
tures are sufficient enough to detect a website written in any Jain AK, Gupta BB (2017a) Phishing detection: analysis of visual
similarity based approaches. Secur Commun Netw. https://doi.
language. The experimental results showed that proposed org/10.1155/2017/5421046
method is very efficient in classification of phishing web- Jain AK, Gupta BB (2017b) Two-level authentication approach to pro-
sites as it has 98.39% true positive rate and 98.42% overall tect from phishing attacks in real time. J Ambient Intell Humaniz
accuracy. The accuracy of our approach may be improved by Comput, 1–14
Jain AK, Gupta BB (2017c). Towards detection of phishing websites on
adding certain more features. Our proposed phishing detec- client-side using machine learning based approach. Telecommun
tion approach completely depends on the source code of Syst, 1–14. https://doi.org/10.1007/s11235-017-0414-0
the website. Adding certain more features may increase the
13
2028 A. K. Jain, B. B. Gupta
Jsoup HTML parser (2018) https://jsoup.org/apidocs/org/jsoup/parse Phishtank dataset (2018) http://www.phisht ank.com. Retrieved 22 Aug
r/Parser.html. Retrieved 20 Jan 2018 2017
Kumaraguru P, Rhee Y, Acquisti A, Cranor LF, Hong J, Nunge E Sheng S, Wardman B, Warner G, Cranor LF, Hong J, Zhang C (2009)
(2007) Protecting people from phishing: the design and evaluation An empirical analysis of phishing blacklists. In: Proceedings of
of an embedded training email system. In: Proceedings of SIGCHI the sixth conference on email and anti-spam, Mountain View
conference on human factors in computing systems, San Jose Stuffgate Free Online Website Analyzer (2018) http://www.stuffgate.
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant permis- com/. Retrieved 21 Jan 2018
sion identification for machine learning based android malware Usage of content languages for websites (2017) https://w3techs.com/
detection. IEEE Trans Ind Inform technologies/overview/content_language/all. Retrieved 22 Aug
Lin Q, Li J, Huang Z, Chen W, Shen J (2018) A short linearly homo- 2017
morphic proxy signaturescheme. IEEE Access Varshney G, Misra M, Atrey PK (2016) A phish detector using light-
List of online payment service providers (2018) http://research.omics weight search features. Comput Secur 62:213–228
group.org/index.php/List_of_online_payment_service_providers. Wang YG, Zhu G, Shi YQ (2018) Transportation spherical watermark-
Retrieved 25 March 2018 ing. IEEE Trans Image Process 27(4):2063–2077
Maio CD, Fenza G, Gallo M, Loia V, Parente M (2017) Time-aware Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic clas-
adaptive tweets ranking through deep learning. Future Gener sification of phishing pages. In: Proceedings of the network and
Comput Syst. https://doi.org/10.1016/j.future.2017.07.039 distributed system security symposium, San Diego, pp 1–14
Maio CD, Fenza G, Gallo M, Loia V, Parente M (2018) Social media Xiang G, Hong J, Rose CP, Cranor L (2011) CANTINA+: a feature-
marketing through time-aware collaborative filtering. Concurr rich machine learning framework for detecting phishing web sites.
Comput Pract Exp 30(1) ACM Trans Inf Syst Secur 14(2)
Mohammad RM, Thabtah F, McCluskey L (2014) Predicting phishing Zhang Y, Hong JI, Cranor LF (2007) CANTINA: a content-based
websites based on self-structuring neural network. Neural Comput approach to detecting phishing websites. In: Proceedings of 16th
Appl 25(2):443–458 international world wide web conference (WWW2007), Banff,
Montazera GA, ArabYarmohammadi S (2015) Detection of phishing pp 639–648
attacks in Iranian e-banking using a fuzzy–rough hybrid system. Zhang W, Jiang Q, Chen L, Li C (2017) Two-stage ELM for phish-
Appl Soft Comput 35:482–492 ing Web pages detection using hybrid features. World Wide Web
Pan Y, Ding X (2006) Anomaly based web phishing page detection. 20(4):797–813
In: Proceedings of 22nd annual computer security applications
conference, Miami Beach, pp 381–392 Publisher’s Note Springer Nature remains neutral with regard to
Phishingpro Report (2016) http://www.razorthorn.co.uk/wp-content/ jurisdictional claims in published maps and institutional affiliations.
upload s/2017/01/Phishi ng-Stats- 2016.pdf. Retrieved 14 Oct 2017
13