0% found this document useful (0 votes)
11 views

Efficient Deep Learning Techniques For The Detection of Phishing

Uploaded by

bhanujarudran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Efficient Deep Learning Techniques For The Detection of Phishing

Uploaded by

bhanujarudran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Sådhanå (2020) 45:165  Indian Academy of Sciences

https://doi.org/10.1007/s12046-020-01392-4
Sadhana(0123456789().,-volV)FT3](012345
6789().,-volV)

Efficient deep learning techniques for the detection of phishing


websites
M SOMESHA*, ALWYN ROSHAN PAIS, ROUTHU SRINIVASA RAO and
VIKRAM SINGH RATHOUR

Information Security Research Lab, National Institute of Technology Karnataka, Surathkal 575025, India
e-mail: [email protected]; [email protected]; [email protected];
[email protected]

MS received 18 May 2019; revised 10 January 2020; accepted 12 February 2020; published online 27 June 2020

Abstract. Phishing is a fraudulent practice and a form of cyber-attack designed and executed with the sole
purpose of gathering sensitive information by masquerading the genuine websites. Phishers fool users by
replicating the original and genuine contents to reveal personal information such as security number, credit card
number, password, etc. There are many anti-phishing techniques such as blacklist- or whitelist-, heuristic-
feature- and visual-similarity-based methods proposed as of today. Modern browsers adapt to reduce the chances
of users getting trapped into a vicious agenda, but still users fall as prey to phishers and end up revealing their
secret information. In a previous work, the authors proposed a machine learning approach based on heuristic
features for phishing website detection and achieved an accuracy of 99.5% using 18 features. In this paper, we
have proposed novel phishing URL detection models using (a) Deep Neural Network (DNN), (b) Long Short-
Term Memory (LSTM) and (c) Convolution Neural Network (CNN) using only 10 features of our earlier work.
The proposed technique achieves an accuracy of 99.52% for DNN, 99.57% for LSTM and 99.43% for CNN. The
proposed techniques utilize only one third-party service feature, thus making it more robust to failure and
increases the speed of phishing detection.

Keywords. Deep neural network (DNN); long short-term memory (LSTM); convolution neural network
(CNN); recurrent neural network (RNN); phishing; heuristic technique; deep learning.

1. Introduction APWG is a non-profit international consortium, which


analyses phishing attacks reported by its members,
The Internet has brought revolutionary transformations in including security products, service-oriented organizations,
social networking, communication, banking, marketing and law enforcement agencies, government agencies, trade
service delivery. The number of users using these Internet associations, regional international treaties and communi-
facilities is growing at a drastic rate. However, communi- cations organizations such as BitDefender, Symantec,
cation technology is growing to meet human needs. How- McAfee, VeriSign, etc. APWG publishes statistical reports
ever, adversaries are also evolving to disrupt on phishing trends across cyberspace periodically (quar-
communication. These adversaries steal sensitive informa- terly or half-yearly). According to the latest APWG (2018)
tion by tricking the user through malware or phishing [2] report, 263,538 number of phishing attacks were
websites. Phishing is one of the fraudulent ways in the reported with a growth of 46% when compared with the
cyber world. The phisher sends a bait as a replica of the real fourth quarter of 2017. There are many ways by which
website and waits for the users to fall as prey. The phisher attackers can design phishing attacks. The medium of
succeeds when a user becomes a victim by trusting the attack could be e-mails, websites or malware.
mimicked page. Recent research scientists have given more In e-mail phishing, a spoofed e-mail is ostensibly sent to
attention to phishing attacks to avoid damage to innocent the intended users by some trusted companies or organi-
Internet users [1]. Survey of such attacks was conducted by zations. In website phishing, phishers build websites that
many consortia such as NSFOCUS, Anti-Phishing Working are replica of real sites and advertise on other website
Group (APWG), etc. contents or technology giants such as Facebook, twitter,
google, etc. Some of the phishing sites also use security
indicators such as Hypertext Transfer Protocol Secure
(HTTPS) [2], and the green padlock, which makes it dif-
*For correspondence ficult for users to differentiate between real and fake sites.

1
165 Page 2 of 18 Sådhanå (2020) 45:165

There are many methods proposed for detecting and pre- phishing attacks when trained with heuristic model
venting phishing. These techniques can be summarized as features. If training data is vast, these algorithms are
follows. even better as they learn most of the possible variations
that phishing sites may have. Rao and Pais [1]
• Listing-based detection: Most browsers like Chrome, achieved an accuracy of about 99.5% in detecting
Mozilla, Opera, etc. maintain the list of blocked and phishing sites using machine learning techniques.
permitted Uniform Resource Locators (URLs). The Also, according to the recent survey [7], the detection
database of blocked URLs is termed as blacklist and of phishing sites with an accuracy of more than 99%
permitted URLs as a whitelist. In whitelist databases, could be achieved using machine learning techniques.
even legitimate sites that are not found in the database The performance of the machine learning algorithm
entry could be the victim and blocked from the browser depends on the size of the training data, the quality of
access. The blacklist-based methods follow the oppo- the extracted features and the values of certain
site of the whitelist. Instead of maintaining a database hyperparameters used to optimize the accuracy.
of legitimate URLs, they keep a database of phishing • Deep-learning-based detection: Deep learning is a
URLs. This method fails when phishing sites are latest machine learning technique that learns features directly
and not even a day old, known as zero-day phishing from data. The data may be images, text or sound.
sites, are encountered. It may be bypassed with slight Deep learning requires a large amount of labelled data
URL changes. It is mandatory to update the list more and makes it possible for the Graphical Processing
frequently, which seems pretty hectic with the amount Unit (GPU) to train deep networks in less time. New
of rising of phishing attacks. trends have been made to exploit Deep Neural
• Heuristic-feature-based detection: This technique is Network (DNN) techniques such as multi-layer feed-
based on features extracted from phishing sites and is forward network [8], Convolutional Neural Network
used to detect and prevent phishing attacks. However, (CNN) [9] and Recurrent Neural Network (RNN) [10]
the limitation is that the heuristic features are not to detect and prevent phishing attacks. These networks
always guaranteed to exist in all the phishing sites, are trained through multi-featured datasets obtained
which may lead to reduced detection rates. Also, this using heuristic methods. Bahnsen et al [10] trained the
technique could be easily bypassed if appropriate RNN over the URL character sequence. They argued
algorithms or detection features are known in advance. that each character sequence has correlations, i.e.
• Visual-similarity-based detection: Attackers mimic the nearby characters in the URL are likely to be
target websites by the use of favicons, screenshots, connected. These sequential patterns are important
background images and logos such that a user can get because they can be used to improve predictor
tricked easily. There are many techniques [3–6] that performance. Le et al [9] used CNN to learn sequential
use databases of logos, screenshots, favicons and URL behaviour. They adopted two techniques that are
Document Object Models (DOM) of target websites CNN character level and CNN word level, which
for similarity computation with suspicious sites. If the identify unique characters and words. Each character
similarity score is higher than a certain threshold, it or word is represented as a vector and trains the vectors
implies that the suspicious site has mimicked some over CNN to learn the sequential behaviour of the URL
legitimate sites, and such websites are declared as to identify the phishing URLs.
phishers. The phishers could easily bypass this security
system with a slight change in visual elements without The robustness of the machine learning algorithms, trained
changing its contents. over datasets consisting of values of heuristic methods, has
• Conventional machine-learning-based detection: One led to the proposal of many methods for dealing with
of the main problems suffered by heuristic detection is phishing sites. Many works [7, 11–13] use third-party
that it is not flexible enough to accommodate phishing websites such as Google or Bing results, Alexa1 ranking
site changes. Even minor changes could cause such and WHOIS2 to detect phishing. However, some of the
detection to bypass. Therefore, the heuristic model has phishing websites hosted on the compromised domain even
given flexibility using machine learning techniques to bypass such techniques. According to APWG’s [14] report,
accommodate the changes. In this technique, datasets most phishing sites may not last for a day, but phishing sites
are prepared to train the machine learning model, and hosted on a compromised site may live for more than a day.
the dataset represents the values of features extracted The same statistics refer to the limitations of the existing
using a heuristic approach. Some of the algorithms technique leading to such a drastic increase in phishing
used are Support Vector Machine Decision Tree over the years. Therefore, there must be a mechanism to
(SVM-DT), Random Forest (RF), Sequential Mini- prevent phishing attacks with higher accuracy and
mum Optimization (SMO), Principal Component
1
Analysis Random Forest, J48 tree, Multilayer Percep- https://www.alexa.com/topsites.
tron, etc. These algorithms can detect even zero-day 2
https://www.whois.com.
Sådhanå (2020) 45:165 Page 3 of 18 165

minimize the use of third-party services with minimal Marchal et al [28] proposed a client-side application that
features. The heuristic method captures specific and com- extracts features mainly from URL and content of the
pelling features that are sufficiently robust to detect even website resulting in a 210-feature vector. The authors used
zero-day phishing. These methods were used to extract the a Gradient Boosting algorithm for the classification of
required features for the training of our multi-layer DNN, phishing sites to achieve a significant detection rate. The
Long Short-Term Memory (LSTM) Network and CNN. We use of a large feature vector may include significant time
also tried to optimize the hyperparameters of these net- for the feature extraction and classification of URLs.
works to obtain the best possible accuracy with minimal Sahingoz et al [29] proposed a phishing detection model
features. from URLs using a machine learning approach. The authors
The previous work [1] used RF and their variations as applied 7 different classification algorithms on the Natural
classifiers with a rich feature set for the classification of Language Processing (NLP)-based features for the classi-
phishing sites. In our current work, we have used deep fication of phishing URLs. The experimental results
learning algorithms such as CNN, LSTM and DNN for the demonstrated that the RF algorithm with NLP-based fea-
detection of phishing websites. Also, we have used an tures achieved a significant accuracy of 97.98%.
information gain (IG) algorithm to select the best per- Li et al [30] proposed a stacking model combining
forming features among our proposed features and used it Gradient Boosting Decision Tree, XGBoost and LightGBM
for classifying the phishing websites. The feature selection algorithms for detecting the phishing web pages. The
resulted in a reduction of features from 18 to 10 with lower authors extracted features from URL and Hypertext
dependence on third-party services while achieving the Markup Language (HTML) of the suspicious website. The
same accuracy as that of the previous work. Out of these 10 extracted features contain 8 URL and 12 HTML-based
features, 6 are existing features proposed by others, and the features to generate a feature vector. The vector was fed to
4 features are proposed in the earlier work. the stacked model for the classification and achieved an
Our paper makes the following research contributions: accuracy of 97.30%. Jain & Gupta [31] proposed a client-
side technique that uses features from the URL and source
1. We have proposed the IG algorithm to select the best code of the suspicious site for the classification. They
performing features for phishing URL detection. applied five machine learning algorithms to identify the
2. We have proposed novel DNN-, LSTM- and CNN-based best classifier suitable for their dataset. RF had outper-
models for phishing URL detection with only 10 formed other classifiers with an accuracy of 99.09%.
features. Yang et al [32] proposed a phishing website detection
3. These proposed models achieved a promising accuracy based on multidimensional features driven by deep learn-
of 99.52%, 99.57% and 99.43% for DNN, LSTM and ing. The authors propose a direct URL approach where
CNN, respectively. character sequence features are extracted from the URLs
The rest of the paper is organized as follows. In section 2, using a dynamic category decision algorithm, and deep
we discuss related work carried out by different researchers learning is applied for the classification of websites. The
features are extracted from URL, webpage code and web-
using different techniques and algorithms. Section 3
page text features, which are combined into the multidi-
explains the architecture of the proposed work. Section 4
mensional feature set. This feature set is fed to the CNN–
deals with the implementation of the proposed model with
LSTM model for the detection of phishing sites and
used tools and datasets. In section 5, we discuss and capture
achieves an accuracy of 98.99%. El-Alfy [33] proposed
the results of individual methods with their efficiency and phishing websites based on probabilistic neural networks
accuracy by incorporating different test levels of the model. and clustering K medoids. This framework combined
In section 6, we list out the limitations of our work, and unsupervised and supervised algorithms for training the
finally, we conclude our paper in section 7. nodes. K-medoid technology uses feature selection or
transformation, and component analysis is used to reduce
space dimensionality. The technique achieved 96.79%
accuracy by considering 30 features.
2. Related work Zhang et al [24] proposed SMO for the detection and
classification of Chinese phishing e-business websites. To
The proposed model fits into deep-learning-based method evaluate the model, they used 15 unique and some generic
in which less work has been done, and it has been inspired domain-specific features. They have used 4 different
by existing list-based (whitelist [15], blacklist [16, 17]), machine learning algorithms for the classification of
heuristic [18–22] and machine learning methods [23–27]. phishing sites. Among all 4 algorithms, SMO performed the
These methods were discussed in section 1, and also, in the best in detecting phishing sites with an accuracy of 95.83%.
previous work [1]. In this section, we discuss some of the The disadvantage of this approach is that it works better
latest works on deep-learning- and machine-learning-based with Chinese websites only. Bahnsen et al [10] proposed an
techniques, which are given as follows: RNN for the classification of phishing URLs using LSTM.
165 Page 4 of 18 Sådhanå (2020) 45:165

The authors compared the traditional RF tree machine common page (Common Page Detection), detection of
learning algorithm to LSTM with 3-fold cross-validation. phishing sites that are hosted in any language (Language
RF used 14 features for URL statistical and lexical analysis independence), detection of phishing sites that consist of
with an accuracy of 93.5%. However, RNN with direct maximum number of broken links (Broken links), detection
URLs performs better than the RF tree classification algo- of phishing sites based on different models and the number
rithm with 98.7% accuracy without requiring intensive of features used for classification of phishing sites.
labour and time-consuming manual extraction of features.
Le et al [9] use a deep learning model to detect phishing
URLs. They use the URLNet framework to learn a non- 3. Proposed work
linear URL embedding for malicious URL detection
directly from the URL. To learn the URL embedding, The goal of this work is to detect the status of a given URL
URLNet uses CNN specifically to both characters as well as using minimal distinctive features with deep learning
words of the URL string. The proposed method has similar classifiers. The architecture of the proposed system is
accuracy for word-level and character level and performs shown in figure 1. The architecture comprises feature
much better than other methods. This method may fail if the extraction, feature selection and classification methodolo-
phishing sites are represented with short URLs (bitly, goo, gies. A set of webpage URLs are fed as an input into the
tiny, etc.) and data URLs. feature extractor, which extracts required features from
Zhao et al [34] proposed a Gated Recurrent Neural three sources (URL obfuscation, hyperlink and third-party-
Network model and showed that Gated Recurrent Unit based). The extracted features are further fed to the IG
(GRU) outperformed the RF classifier with 21 features and feature ranking algorithm. The outcome of the algorithm
achieved 2.1% better efficiency than RF, i.e., 98.5%. helps in selecting the best performing features by a clear
However, here only URLs are used as datasets and need investigation in considering the dependences. The best
transforming of all characters into vectors to learn hidden performing 10 features are further trained through different
patterns. Hence, GRU needs more time to train and requires deep learning methodologies to output the status of the
system architecture to be optimized for better performance. URL as legitimate or phishing.
Mohammad et al [35] proposed a model to predict phishing A detailed description of individual models are as
sites based on the self-structuring neural networks. They follows.
used 17 features extracted from the URL and source code of
the website. These features are used to classify websites in
artificial neural networks. This model should be regularly 3.1 Feature extraction
retrained with up-to-date training datasets.
Feng et al [36] proposed a novel classification model for The features are extracted from three sources:
the detection of the legitimacy of a given website. They • URL obfuscation features,
used the Monte Carlo algorithm [38] for training the model • hyperlink-based features and
and the risk minimization principle to avoid over-fitting in • third-party-based features.
the proposed model. They adopted 30 features from the
UCI3 repository and achieved an accuracy of 97.71%. These features are extracted using Selenium with Python
Yi et al [37] proposed a deep learning framework with language, an HTML parser, and BeautifulSoup for parsing
two types of feature sets, namely original and interaction the websites. The selection of prominent features from the
features. The original features are extracted from the URL extracted features is carried out using the IG mechanism.
analysis, i.e., presence of special characters (@, _, Uni- The IG for the features proposed by Rao and Pais [1] is
code), count of dots and age of the domain. The interaction given in table 2.
features are extracted from the source code of the website,
i.e. in-degree, out-degree, frequency of accessing URL and 3.1.1 URL obfuscation features They are the
cookie absence. Deep Belief Network (DBN) is applied to characteristics that can be extracted from the URL itself.
the extracted features and achieved an accuracy of 90% true These features do not involve the inclusion of website
positive rate and 0.6% false positive rate. content on third-party services. Before defining different
The related works with deep learning classifiers are URL-based features, we have to understand the typical
summarized in table 1. The table gives a comparison URL anatomy. A URL is a specific Uniform Resource
between the proposed method and all other approaches Identifier (URI) that is used to locate existing resources on
using six different metrics. These metrics include the the Internet. It is used when a web client requests the server
detection of phishing sites that replace textual content with for resources such as HTML, CSS, images, videos or other
an image (Image-based phishing), detection of phishing hypermedia. A URL usually consists of four or five
sites that contain most of the hyperlinks directed towards a components. The typical structure of URL is http://www.
reg.signin.nitk.com.pk/secure/login/web/index.php. It con-
3
http://archive.ics.uci.edu/ml/index.php. sists of the following parts.
Sådhanå (2020) 45:165 Page 5 of 18 165

Table 1. Summary of related work in comparison with proposed work.

Image-based Common page-based Language Broken


Techniques phishing phishing independ-ence links Models Features
Rao and Pais [1] Yes Yes Yes Yes Machine learning 18
Zhang et al [24] No No No No SMO 15
El-Alfy [33] Yes Yes Yes No PNN K-medoid clustering 30
Zhao et al [34] No No Yes No Gated Recurrent Neural Network Direct URLs
Mohammad et al [35] No No Yes No Neural Network 17
Le et al [9] No No Yes No Convolution Neural Network Direct URLs
Bahnsen et al [10] No No Yes No Recurrent Neural Network Direct URLs
Yang et al [32] No No Yes No CNN–LSTM Hybrid Network Direct URLs
Feng et al [36] Yes Yes Yes No Neural Network 30
Yi et al [37] Yes No Yes Yes Deep Learning DBN 8
Proposed Model-I Yes Yes Yes Yes Deep Learning DNN 10
Proposed Model-II Yes Yes Yes Yes Deep Learning LSTM 10
Proposed Model-III Yes Yes Yes Yes Deep Learning CNN 10

Figure 1. Architecture of proposed model.

Table 2. Information gain of individual features.

Features from Rao and Pais [1] Information gain


UF1 – Dots in Hostname 0.0874
UF2 – URL with @ symbol 0.00797
UF3 – Length of URL 0.28293
UF4 – Presence of IP 0.00523
UF5 – Presence of HTTPS 0.07321
TF1 – Age of domain 0.29139
TF2 – Page rank 0.88344
TF31 – Website in search engine results-title 0.15664
TF32 – Website in search engine results-copyright 0.16603
TF33 – Website in search engine results-description 0.27909
HF1 – Frequency of domain in anchor links 0.21588
HF2 – Frequency of domain in CSS links, image links and script links 0.04654
HF3 – Common page detection ratio in website 0.40058
HF4 – Common page detection ratio in footer 0.29128
HF5 – Null links ratio in website 0.25015
HF6 – Null links ratio in footer 0.08162
HF7 – Presence of anchor links in website 0.14237
HF8 – Broken links ratio 0.20216
165 Page 6 of 18 Sådhanå (2020) 45:165

• Scheme: The scheme is used to identify the used 3.1.3 Third-party-based feature In this section, we use
protocols, Hypertext Transfer Protocol (HTTP) or third-party services such as WHOIS, Alexa and Search
HTTP with Secure Sockets Layer (HTTPS). engine for the extraction of third-party-based features.
• Hostname: The hostname identifies the machine that Surprisingly, out of these three features, the Alexa rank-
contains resources. The hostname includes the Generic based feature performed significantly better in the IG. Even
top-level domain (gTLD) and Country-code top-level though other third-party features performed better than the
domain (ccTLD). In the given example, reg.signin other (URL obfuscation, hyperlink-based) features, we
indicates subdomain, nitk is the primary domain, com have not considered them in our feature selection to reduce
is gTLD and pk is ccTLD. the dependence on third-party services.
• Path: The path identifies the basic or required infor- TF2: Alexa ranking is a third-party-based service used to
mation in the host that the web client wants to access. classify the phishing sites. The rationale behind this feature
From the given URL example, the path-name is is that phishing sites are low ranked, and target websites are
secure/login/web/index.php. highly ranked. This feature checks the rank of a suspicious
• A query string: When a query string is used, the path website in the Alexa database. To calculate the rank, an
component follows and provides a string of informa- HTTP request is sent to http://data.alexa.com/data?cli=
tion that the resource can use for some purpose. The 10&url=’’?domain and an XML parser is used to get the
query string is usually the name and value pair string. Alexa ranking.
An ampersand separates name and value pairs (&). For 
example, in the URL http://www.google.co.uk/ 0 if rank is not found
Pagerank ¼
search?q=url&ie=utf-8, ?q ¼ url&ie ¼ utf  8 is the rank otherwise
query string with name and value pairs, respectively, as
q ¼ url and ie ¼ utf  8. The selected features from these three sources are high-
lighted and marked as selected features in table 3.
In the earlier work [1], five URL obfuscation features (UF1,
UF2, UF3, UF4, UF5) were proposed. Out of those five,
two features are poorly performing when applied to the IG 3.2 Feature selection
algorithm as shown in table 2. The best performing three
We have used IG as a ranking criterion to score the fea-
features have been selected, and these features are as
tures, and by applying the threshold, we have filtered out
follows.
prominent features. The intuition behind ranking is to
1. UF1: dots in hostname. evaluate the relevance of the features for the detection of
2. UF3: lengthy URL. phishing websites. The relevance of feature implies that
3. UF5: presence of HTTPS. each feature may be mutually exclusive to each other, but it
must not be completely independent of class labels. There
must exist a relation between feature and class labels. The
3.1.2 Hyperlink-based features These features are
extracted from the hyperlinks in the source code of a Table 3. Selected features.
website. The hyperlink is an electronic document element Features from Rao and Pais [1] Selected features (U)
that connects from one source to another. The web source
may be an image, program, HTML document or HTML UF1 U
document element. As mentioned earlier, the previous work UF2 –
UF3 U
[1] consists of eight hyperlink-based features that are used
UF4 –
for phishing detection. We have selected six best
UF5 U
performing features among the eight features based on TF1 –
the results of IG analysis shown in table 2, and the selected TF2 U
features are as follows. TF31 –
TF32 –
1. HF1: presence of domain in anchor links.
TF33 –
2. HF2: frequency of domain in image links.
HF1 U
3. HF3: common page detection ratio. HF2 U
4. HF4: common page detection ratio in footer. HF3 U
5. HF7: presence of anchor links in HTML body. HF4 U
6. HF8: broken links ratio. HF5 –
HF6 –
It may be observed that HF5 and HF6 have performed
HF7 U
better in IG, but they have been eliminated since HF3 and HF8 U
HF4 capture these features characteristics [22].
Sådhanå (2020) 45:165 Page 7 of 18 165

features that are irrelevant and have no relation or abysmal IGðAÞ ¼ infoðDÞ  infoA ðDÞ: ð3Þ
role can be discarded. Hence, our primary aim behind using
this technique is to rank features based on their relevance The IG for the features proposed by Rao and Pais [1] is
and influence on the class labels. This could be used in the given in table 2.
feature reduction process. IG [39] is measured based on the The classification methodologies have been discussed in
entropy of a system. The entropy is defined as a degree of detail in section 4.3 along with the formal description of
disorder and impurity in the system. IG is defined as a DNN, LSTM and CNN.
reduction in the impurity, bringing more certainty in the
system. For feature ranking purposes, we have calculated
IG on the entire dataset. Summarizing, IG looks at each 4. Implementation
feature in isolation and is calculated on each feature inde-
pendently. By computing IG of each feature independently, Given a list of website URLs, we have trained and cross-
we get a quantitative measure of significance and relevance validated a proposed deep-learning-based model to identify
of this feature on class labels. Computation of information as legitimate URL or phishing URL. We have used the
for a feature involves two steps. Selenium library in Python and Firefox web driver to get
screenshots of website URLs, and also to download the
1. Compute entropy of the class label for the entire dataset.
source code. We used Beautiful Soup in Python to parse the
It can be computed by the formula
source code to extract the required features. Screenshots
X
m and status codes are further used to verify that contents
infoðDÞ ¼ pi  log2 pi ð1Þ have not changed while extracting features from source
i¼1
code. Extracted datasets are further examined manually for
where m ¼ 2, i.e. the total unique number of class labels legitimate URLs; duplicates and unwanted URLs (neither
(phishing, legitimate), and in our dataset, it is two. D phishing nor legitimate) are removed from the PhishTank
represents a feature of a dataset. Hence, each feature has dataset. This process is to avoid legitimate sites being
some instances belonging to one class and remaining to treated as phishing and reduce the processing time by
another class; pi represents the probability of instances of avoiding unwanted comparisons.
D that belong to ith class. We can compute probability pi
by counting the number of instances of D that belongs to
ith class and then dividing by the total number of 4.1 Tools used
instances of D. Once we get pi for all i, we use Eq. (1) to We have implemented Python scripts to extract all features
calculate the entropy of D. using Python 3.6 from URL and URL content. We collected
2. Computation of conditional entropy for each unique phishing URLs from the PhishTank4 website, and legiti-
value of that feature: The calculation of conditional mate sites from the Alexa databases. When these URLs are
entropy requires a frequency count of the class label by fed as inputs to the Python script, all the essential features
feature value. The feature value can be continuous as are extracted and stored in text files. These extracted fea-
well as discrete. tures are transferred to deep learning algorithms to train and
i. For discrete-valued features, it can be calculated by cross-validate so that it can start classifying URLs into
the formula: legitimate and phishing sites. We have implemented a deep
learning algorithm with a TensorFlow package, an open-
X
v source machine learning framework implemented on top of
infoA ðDÞ ¼ j Di j = j D j infoðDi Þ ð2Þ Python, which supports parallel computing.
i¼1

where v is equal to total unique discrete values pre-


4.2 Datasets used
sent in the feature value, Di represents a count of ith
type of value in feature and D is the total count of We have used the dataset of Rao and Pais [1] for all our
feature value. experiments in this paper. The dataset consists of 3526
ii. For continuous-valued features, we have sorted feature instances out of which 2119 are phishing sites collected
pffiffiffi
value and have divided them in n bins, where n is from PhishTank and 1407 legitimate sites collected from
equal to the total count of feature values. Now we have the Alexa database. They were further divided into cate-
n different classes, and it can be treated as discrete- gories of training sets and testing sets of 75% and 25%,
valued features, and Eq. (2) is used to calculate the respectively, for model evaluation.
conditional entropy of continuous features.
Now the information gain (IG) is calculated using the
following equation: 4
http://www.phishtank.com/index.php.
165 Page 8 of 18 Sådhanå (2020) 45:165

4.3 Deep learning algorithms Each layer is composed of the basic computing unit, i.e.
the neuron. The neuron is inspired by the biological
To evaluate the performance of the feature set, the fea- neuron that performs mathematical functions for the
ture set has been trained and cross-validated against storage of information. This information is transmitted to
many different parameter combinations. In the multi- another neuron, and therefore information propagates in
feed-forward network, we must gather data based on the neural network. A neuron’s general mathematical
feature sets and then tune the parameters to achieve representation is
maximum accuracy in phishing site classification. It is an
essential process in which training networks must set !
parameters and validate across appropriate values. After X
k¼n
k
Y ¼U Wkj xj þ bk ð4Þ
attaining the right value, phishing sites can easily be k¼0
classified with the highest probability. We used Python
programming language along with the TensorFlow where U is activation function, Wk 2 RLB is weight of K th
library to implement deep learning algorithms. From neuron and Y k is the output of K th neuron. The number of
various combinations of hidden layers, we found that the neurons in the input layer depends upon the dimension of
DNN with 5 hidden layers achieved the best results. It datasets or equivalently to the number of features of the
can be understood that this permits the features we have dataset, i.e., X 2 RLK where L is the total number of the
extracted in the nonlinear, separable and complex func- datasets, K is the total number of features in datasets and
tions to be represented most effectively. The proposed R represents a real number. The number of neurons in the
deep feed-forward neural network comprises 7 layers, output layer depends on the number of outputs we want.
with 5 hidden layers, one input layer and one output The number of neurons in the hidden layer is a hyperpa-
layer. All layers were followed and standardized by the rameter that needs to be tuned to obtain an optimum result.
Rectified Linear Unit (ReLU) or sigmoid function. The Since each neuron performs computation, the number of
first four layers were followed by the ReLU function and neurons defines the network complexity. Each DNN is a
the output layer using the sigmoid function. The rationale complex mathematical function that adapts itself according
behind batch normalization is that it speeds up training to the nature of data. Hence, making the network more
by reducing the internal covariate shift and reducing complex may result in data over-fitting, i.e. it performs
over-fitting. ReLU activation has replaced sigmoidal or pretty good with training data but fails to achieve good
tanh activation functions in hidden layers due to its accuracy with unknown data.
tendency to learn faster than sigmoidal or tanh, avoiding Let l = {0,1,2,3,4,5} be the layers in my deep learning
significant delays in the rate of gradient descent con- model, Y ðl1Þ be the input to layers {1,2,3,4,5}, Y ðlÞ be
vergence after an initial set of iterations. output value of layer, where W ðlÞ is weight of layer i that is
used for linear transformation of inputs from n layers to
4.3.1 Formal description of DNN The DNN is a type output of m layers, BðlÞ be bias of layer i and F ðlÞ be the
of machine learning technology. It consists of many
associated activation function of each layer. Y ð0Þ is nothing
common neural network layers. It has one input layer,
one output layer and at least one hidden layer, as shown in but input layer and Y ðlÞ is output layer.
figure 2.
Z ðlÞ ¼Y ðl1Þ  W ðlÞ þ BðlÞ ; ð5Þ

Y ðlÞ ¼FðZ ðlÞ Þ ð6Þ

where * is for matrix multiplication. W values were ini-


tialized with Xavier Initialization (the initializer used to
initialize random values), and B was initialized with zero.
W and B are updated after each iteration in the backprop-
agation method. Layer 0 is the input layer, layer 6 is output
layer 6 and layers 1–5 are hidden layers activated with the
ReLU function provided by
(
0 if Zil  0
Yil ¼ ð7Þ
Zil otherwise

where i represents ith iteration and l represents l th layer.


Figure 2. Architecture of simple neuron. The intermediate output of our model Y  is obtained as
follows using the sigmoid activation function:
Sådhanå (2020) 45:165 Page 9 of 18 165

1 LSTM (figure 3) removes the vanishing gradient problem


Y ¼ ð8Þ or exploding gradient problem to avoid long-term depen-
1 þ exp Z l
dences. In LSTM, a neuron is replaced by cell memory,
where l ¼ 6 in case of output layer. The loss function which performs the task using activation function for input
^ over entire dataset is defined as sum of cross-
(LðY  ; YÞ) by forming a linear combination of the dot product of input
entropy between model output and actual output, which is and weights with bias. LSTM also uses update gate, forget
shown as follows: gate and the output gate to avoid long-term dependences.
X
n
C^\t [ ¼tanhðWc ½a\t1 [ ; x\t [  þ bc Þ ð10Þ
LðY  ; YÞ
^ ¼ ½y^j log yj þ ð1  y^j Þ logð1  yj Þ ð9Þ
j¼1
Cu ¼rðWu ½a\t1 [ ; x\t [  þ bu Þ ð11Þ

where Y is an intermediate output of entire dataset
obtained after processing it through deep learning model Cf ¼rðWf ½a\t1 [ ; x\t [  þ bf Þ ð12Þ
and yj 2 ð0; 1Þ is jth row of Y  while Y^ is an actual label of
Co ¼rðWo½a\t1 [ ; x\t [  þ bo Þ ð13Þ
our dataset and y^j 2 f0; 1g is jth row of Y, ^ where 0 repre-
sents legitimate site and 1 indicates phishing site. The loss C\t [ ¼Cu  C^\t [ þ Cf  C\t1 [ ð14Þ
function given earlier is optimized using the Adam Opti-
mizer at every epoch to update parameters and train deep a\t [ ¼Co  C \t [ ð15Þ
neural model using the backpropagation algorithm. The
functional formations represent these features without over- Weight matrices Wc ; Wu ; Wf ; Wo , bias vectors bc ; bu ; bf ; bo ,
fitting, because DNN has 5 hidden layers along with 1 input temporary cell state ðC^\t [ Þ, update gate ðCu Þ, forget gate
and 1 output layer. ðCf Þ, output gate ðCo Þ, cell state ðC \t [ Þ and activation
ða\t [ Þ [43] remain the same for all time-steps in a single
4.3.2 Formal description of LSTM An RNN is a unit of LSTM network, and are updated after each epoch
specific type of bio-inspired neural network that can model during the backpropagation method. The long memory is
and learn a sequential data pattern. It learns sequential usually called a cell state [Eq. (10)]. The cell state allows
dependences by learning one sequence at a time, thereby the network to store the information coming from the
introducing time to neural network modelling. previous cell. It is updated using the value of both the
The RNN has been good at the sequential and time-series update gate [Eq. (11)] and the forget gate [Eq. (12)]. The
dataset, and has proved to be very useful [10, 40]. However, forget gates enable the network to forget the information
the general recurring network suffers from a vanishing that is not necessary or not relevant by multiplying with 0.
gradient problem or an explosive gradient problem, i.e. it It also helps retain information by multiplying by 1. The
cannot retain memory across a larger path that causes long- update gate determines which information should enter the
term dependences [41, 42]. As a result a long correlation cell memory for storing. The output gate (Eq. (13)) decides
between sequences is not maintained, and the network fails which result should move forward to the next hidden layer.
in such circumstances. Hence, LSTM takes care of the long For exploring the feasibility of an RNN on our dataset, we
correlation between sequences. have implemented LSTM.

Figure 3. Architecture of LSTM.


165 Page 10 of 18 Sådhanå (2020) 45:165

In this neural network, we have taken 4 LSTM units passed to the convolution layer and the output of this layer is
performing different mathematical computations that are activated using a tanh function (Eq. (16)). Later the activated
defined earlier. Each unit has 10 time-steps. Each time-step output is subjected to batch normalization and pooling. The
has 1 output, which is passed as input to the next time-step. obtained output is passed to the next convolutional layer. In
The last time-step of the first unit is also passed to the this way, we have 6 convolutional layers connected
second unit, and so on, and finally, we get output from the sequentially in which the output of one layer is the input of the
last time-step of the fourth unit of LSTM. The obtained next. At the seventh layer we densed the output of sixth
output is further densed to a single output and passed to the convolutional layer to 500, again activated using tanh func-
sigmoid function. The loss function is calculated, and error tion and then at the end densed it to 1. The output of the tanh
is optimized using the Adam Optimizer. During back- function is passed to the sigmoid activation function (8) for
propagation, each parameter of all 4 units of LSTM is output in the range of (0,1). Later the loss function (9) is
updated. Again the loss function is calculated in each calculated and is optimized using the Adam Optimizer. Then
epoch; the network learns when variables are updated. We we have a backpropagation method where variables are
modified the dataset dimensionality to implement LSTM. updated, and thus network learns.
We have converted 10 features to 10 time-steps; each time-
step consists of 1 feature. Hence, our dataset new dimen- tanhðxÞ ¼ ðe2x  1Þ=ðe2x þ 1Þ: ð16Þ
sion is (3526,10,1). Through LSTM, we attempted to find
We have used 10 features extracted in the feature selection
out the possible relationship between different features.
process. The size of our dataset is 3526. Hence the
Initially, at the first gate, zero vector and first time-step
dimensionality of our converted dataset is (3526, 10, 1),
were passed to the first gate. The output from the first
and it is passed to our CNN model for phishing detection.
LSTM gate passed as input to second, and so on till the
tenth gate. The obtained single output from the tenth gate
forms a single LSTM unit output, which is again passed as
input to the first gate of the second unit. Hence, the output 5. Results and discussions
of the previous gate along with the current time-step is fed
to the next gate and the output of the previous LSTM unit is We conducted experiments to evaluate the performance of
fed to the next unit until the fourth LSTM unit. In the fourth our DNN, LSTM and CNN models with different features
unit, its output is densed to 1, which is passed on to the and parameters. All experiments were conducted with the
sigmoid activation function (8), and then the loss function same dataset of 3526 instances. Each experiment has been
(9) is calculated and optimized using the Adam Optimizer. repeated, and data has been randomly selected from the
dataset. For evaluating our model, we have used accuracy
(Eq. (17)) and error (Eq. (18)) rates as the main evaluation
4.3.3 Formal description of CNN CNN is similar to
metrics. To calculate them, we considered the phishing sites
an ordinary DNN. These networks consist of neurons that
as condition positive (P), where P represents the total number
have weights and biases, which are updated and made to
of phishing sites in our dataset. The legitimate sites are ter-
learn. Each of these neurons receives inputs that are
med as condition negative (N), where N represents the total
converted into a linear combination of dot products of
number of legitimate sites in our dataset. The correctly
weights and input bias. However, instead of fully connected
classified phishing sites are termed as True Positive (TP),
hidden layers, it performs convolution on input layers
which is calculated as the ratio of correctly classified phish-
x 2 RLB . Convolution is performed using convolution
ing sites to the total number of phishing sites (P). Correctly
operator  of length L with stride s, and consists of
classified legitimate sites are termed as True Negative (TN),
convolving filter W 2 RBK :
which is calculated as the ratio of correctly identified legiti-
Generally, CNNs are used with images due to the high
mate sites to the total number of legitimate sites (N).
correlation between pixels and networks. CNNs can fig-
ure out relations and different features using convolutional • Accuracy (ACC): Measures the legitimacy and phish-
techniques, which are used in conjunction with the pooling ing rate of the total number of websites.
layer, and batch normalization are done before passing it to
TP þ TN
any activation function [9, 44]. It has also been used in NLP ACC ¼ : ð17Þ
after character encoding due to the correlation between PþN
character sequences [9, 45]. • Error Rate (ERR): Measures the rate of legitimacy or
In our work, the selected 10 features from the IG algorithm phishing from incorrectly classified websites.
are fed to the CNN model to identify the status of the suspi-
cious site. The proposed CNN model consists of 8 layers (6
convolution and 2 dense layers). In the first layer, the input is
Sådhanå (2020) 45:165 Page 11 of 18 165

Table 4. Accuracy of individual features.

Features Training accuracy (%) Testing accuracy (%)


UF1 64.94 63.94
UF2 59.90 60.86
UF3 78.03 76.79
UF4 59.90 60.86
UF5 66.94 67.12
TF1 77.34 79.41
TF2 95.83 96.47
TF31 81.96 80.01
TF32 75.57 74.18
TF33 64.78 62.0
HF1 72.04 68.25
HF2 64.48 61.32
HF3 82.37 80.31 Figure 4. Network performance using 18 features.
HF4 80.33 79.18
HF5 75.04 73.95
HF6 63.39 63.25
HF7 68.75 68.03 The 10 features used in experiment 3 are the same, and
HF8 76.96 75.88 the one with the highest IG is given in table 2. The
experimental results are in line with the IG, and they val-
idate our selection of features.
TP þ TN Experiment 1 – evaluation of individual heuristic
ERR ¼ 1  : ð18Þ
PþN features using DNN: In this experiment the performance of
each individual feature has been evaluated, and is given in
table 4. This has been done in order to know the individual
contribution of each feature in determining accuracy. The
5.1 Validation of selected features using DNN features that have higher accuracy in detecting phishing
Our work of feature retention and rejection based on sites have more relevance to class labels. This relevance
inferences that we have drawn from IG is listed in table 2. will be an experimental manifestation of our feature rank-
We have retained those features that have higher IG. In this ing process using the IG algorithm. It will also be an
section, we validate our claim of feature selection using experimental justification of retention and rejection of
DNN. features using IG. The overall accuracy obtained using 18
To validate our selection of features, we have con- features is 97.95%. The accuracy chart using 18 features is
ducted 3 experiments. The first experiment is conducted shown in figure 4. The obtained accuracy is less than that in
by supplying all the 18 features of our earlier work [1] to our earlier work [1] with the same set of features.
DNN. The overall accuracy obtained using 18 features is Experiment 2 – evaluation of model using 14 features: In
97.95%. The results of the individual accuracy of this this experiment, we have evaluated our model accuracy
experiment are tabulated in table 4. The testing accuracy using 14 features. The features that we left out are UF2,
of individual feature varies from 60.86% (UF2, UF4) to UF4, HF5, HF6. We have summarized reasons for leaving
96.47% (TF2). Based on the accuracy, we have elimi- out these features as follows.
nated two URL-based features whose testing accuracy is • UF2 and UF4: The first two features have individual
less than 61% (UF2, UF4). We have eliminated two accuracy of less than 61%, which is the lowest in
hyperlink-based features HF5, HF6 since their function- table 4. In our IG table 2, two of the lowest values are
alities are taken care of by HF3 and HF4, respectively. 0.00797 and 0.00523 for UF2 and UF4, respectively.
Also, individual accuracy of HF5 and HF6 is less than The lowest accuracy and lowest IG, among other
that of HF3 and HF4, respectively. Experiment 2 is features, validate their rejection in the feature selection
conducted after eliminating 4 features (UF2, UF4, HF5, process. These two features are the least relevant to
HF6) from the total set of 18 features. The accuracy class labels, and their individual contributions are the
chart of training and testing using these 14 features is lowest. Hence, our experimental analysis upholds the
given in figure 5. The overall accuracy obtained using 14 rejection of these features in order to reduce inhibition
features is 99.20%. Experiment 3 is conducted to eval- offered by the least performing network.
uate features by minimizing third-party-based features. • HF3 vs HF5: HF3 is the ratio of the most common link
The overall accuracy after eliminating four third-party to the total number of links in the URL web page, and
feature is 98.97%. HF5 is the ratio of null links to the total number of
165 Page 12 of 18 Sådhanå (2020) 45:165

[22]. In our feature selection process, we have


considered higher IG of HF4 over HF6. In our
experimental analysis, the individual accuracy of
HF4 is higher than that of HF6, which validates our
claim.

After removing these features, accuracy has increased to


99.20% and the same can be observed in figure 5.
Experiment 3 – evaluation of features by minimizing
third-party-based features: In this experiment, we began by
retaining all third-party features (TF1, TF2, TF31, TF32,
TF33) shown in table 3. The extraction of third-party fea-
tures from URL is a time-consuming process. Hence,
phishing URL detection cannot be done in a faster time-
Figure 5. Accuracy chart with 14 features. bound manner. If any of these third-party features are not
available then we have to deal with missing data, and the
accuracy of the model will drop. Hence we removed all the
third-party features, including TF2, which has the highest
IG, to test the robustness of our model. The accuracy of our
model dropped to 90% after removing all the third-party
features. Again we experimented with the inclusion of one
third-party feature (TF2) and obtained an accuracy of
98.97% with 5000 epochs.

5.2 Results with DNN


Section 5.1 described three experiments using DNN to
validate our features by comparing results obtained from
the IG ranking algorithm. We have selected 10 best per-
forming features from experiment 3. The individual accu-
racies of the best 10 features are shown in figure 6.
Figure 6. Learning rate with a ¼ 0:001 and a ¼ 0:0001. Experiment 4 is conducted using DNN on the dataset of
Rao and Pais [1] with selected 10 features. The hyperpa-
rameters tuning is performed to optimize the model by
selecting the learning rate (a), optimizer, number of hidden
links in the URL web page. If the null link is the most
layers, number of nodes in layers and number of epochs.
common link, then both are equivalent, and if null
Experiment 4 – evaluation of model by tuning parame-
links are less or absent, then null link ratio features are
ters: In experiment 3, we have not fine-tuned the parame-
of no use. Hence, HF3 contains HF5 [22]. For
ters to optimize our model. In experiment 4, fine-tuning of
removing features, we had considered the IG of each
parameters was performed to optimize our DNN model.
feature. In table 4, HF3 individual accuracy is higher
The parameter fine-tuning process is as follows.
than that of HF5. Hence, retention of HF3 is exper-
imentally justified. • Learing rate: We began with a ¼ 0.001, keeping the
• HF4 vs HF6: HF4 is the ratio of the most common link rest of the features as specified in table 5. At this a
to the total number of links in the footer, and HF6 is value, our model’s loss function converged rapidly, as
the ratio of the null link to the total number of links in indicated in figure 7 with higher losses. In this case, we
the footer. Both will have the same value if the most got 99.69% training accuracy and 98.97% testing
common link is a null link. If null links are less or accuracy. Hence we increase a to 0.0001, and the loss
absent, it is shallow. HF3, therefore, contains HF5 function of our model begins to converge, as we can

Table 5. Parameters for DNN.

Layers Number of units in layers Learning rate Optimizer Epochs Activation function
6 10, 19, 100, 200, 300, 1 0.0001 Adam Optimizer 6000 ReLU
Sådhanå (2020) 45:165 Page 13 of 18 165

Figure 7. DNN individual feature accuracy.

Table 6. DNN experimental results.

Experiment a value No. of features Epochs Training accuracy (%) Testing accuracy (%)
1 0.001 18 5000 98.71 97.95
2 0.001 14 5000 99.96 99.20
3 0.001 10 5000 99.69 98.97
4 0.0001 10 5000 99.51 99.20
0.0001 10 6000 99.55 99.52

Figure 8. Comparison between optimizers.


Figure 9. DNN accuracy with ten features.

see in figure 7 at about 800 epochs with the lowest loss


(error) of about 0.012%. We achieved a test accuracy 99.52%, as shown in figure 9, and all experimental
of 99.20% and training accuracy of 99.51% with 5000 results are tabulated in table 6. We also increased the a
epochs. We further increased the number of epochs to to 0.00001, and the loss function converged at much
6000 by keeping a to 0.0001 and achieved consistent higher epochs with the same accuracy. We, therefore,
training accuracy of 99.55% and testing accuracy of concluded that further decreasing of alpha would take
165 Page 14 of 18 Sådhanå (2020) 45:165

Table 7. Parameters for LSTM.

Number of LSTM units Learning rate Optimizer Epochs


4 0.001 Adam Optimizer 700

more processing time without increasing model


accuracy.
• Optimizer: We used the Adam Optimizer and obtained
training accuracy of around 99.55% and testing
accuracy of 99.52%. We also tested our model with a
gradient descent optimizer, which makes our model
very slow and less accurate. This can be clearly seen
with the graph shown in figure 8.
• Number of epochs: We have used the iterative process Figure 11. Accuracy graph of LSTM.
to determine the total number of epochs to be used for
the best performance of our model. We started with
500 epochs and increased by 500 until we got the
minimum loss. The minimum loss means that if we • Number of units in hidden layers: There is no way to
keep iterating the model, the loss will continue to determine the number of hidden units in each layer.
decline to some minimum value and then start Hence we began with a smaller number of hidden units
fluctuating, so we have to stop at that minimum point. in hidden layers as they were faster, but could not learn
It varies with different learning rates and different properly, and it resulted in less accuracy. Hence, we
optimizers. increased and tested the accuracy of each of these
• Number of hidden layers: Increasing number of hidden configurations. It is observed that having more units
layers will result in increased network complexities slows down and leads to data over-fitting and therefore
because we fit our data with the number of hidden lower accuracy is obtained.
layers and the number of hidden units. We have
Therefore, after all these experiments, we found that DNN
initialized with one hidden layer and moved progres-
is consistent by achieving 99.55% training accuracy and
sively to identify the optimal hidden layers. Based on
the empirical analysis, we achieved the optimal model 99.52% testing accuracy with finalized 10 features. The
results with 4 hidden layers. On further increasing the individual features training and testing accuracies of DNN
layers, the model showed non-promising results with are shown in figure 6. The accuracy graph of selected
additional processing time. features using DNN is shown in figure 9. Experimental

Figure 10. LSTM individual feature accuracy.


Sådhanå (2020) 45:165 Page 15 of 18 165

Table 8. Parameters for CNN

Layers Number of filters in layers Learning rate Optimizer Epochs Window size Activation function Stride
7 32, 64, 64, 128, 128, 264, 512 0.001 Adam Optimizer 200 2 tanh 1

results are tabulated in table 6. Due to the close difference 99.57%. The accuracy graph that we obtain is shown in
between training and test accuracy, over-fitting is reduced. figure 10. The individual features testing and training
accuracies are given in figure 11. This figure shows that the
training and testing accuracies of individual features are
very close and prevent over-fitting.

5.3 Results with LSTM


To check the effectiveness of our 10 features, we experi-
mented using LSTM. The parameters used for the experi- 5.4 Results with CNN
ment are given in table 7. In LSTM, hyperparameters are To verify the performance of our 10 selected features, we
tuned to achieve better accuracy. These hyperparameters have conducted our next experiment using the CNN model.
are used in experiment number 5 to achieve promising The parameters used for this experiment are given in
accuracy. table 8. Further details of our experiments are given here.
Experiment 5 – evaluation of LSTM model by tuning Experiment 6 – evaluation of CNN model by tuning
parameters: parameters:
• Number of LSTM units: We have started with 32 LST- • Window size: It refers to the size of the one-
M units. However, more LSTM units slow the dimensional window, which has to be convolved
processing of training. The network testing accuracy sequentially. Our dataset has 10 features. The choice
was much lower than the training accuracy, which for the size of the window was already limited due to
implies over-fitting. After dropouts were used to less number of features. Hence we started with larger
reduce over-fitting there was a trade-off between window size (7), and we found out that it did not learn
accuracy and over-fitting, and over-fitting was reduced appropriately due to a lesser number of layers (2) by
by reducing the precision of training. Hence we started evaluating accuracy and trends of loss values during
decreasing LSTM units. This trend was seen until the training. Hence we started reducing the window size,
number of LSTM units was reduced to four. until we reached a window size of 2.
• Learning rate (a): This is one of the most important • Stride value: It refers to steps skipped after each
parameters to determine our model’s convergence. If a convolution. Higher strides will reduce the size of the
is kept large (near to one), the minimum convergence output. We started with a stride value of 4, and it
point in the model contour can be skipped. If a is kept reduced the number of convolutional layers in the
small (near 0), a long time is required to reach the model. However the lower convolutional layer per-
minimum convergence point. We began with a= formed poorly. Hence, we reduced it to 1.
0.0001. However, due to the low magnitude of a, the • Number of filters in each layer: The value has been
network was slowly converging even after 5000 summarized in table 8.
epochs, and there was no improvement in accuracy
with an increasing number of epochs, and it was 98%. Other hyperparameters have been summarized in table 8. It
Hence, we increased it to 0.001. At this a, we got our performs better with 10 features, and achieves training
maximum accuracy (99.57%). However we continued accuracy of 99.29% and testing accuracy of 99.43%. The
till 0.1 (larger a), and it proved to be pointless as the accuracy graph that we have obtained is shown in figure 12.
network started oscillating, and accuracy decreased.
• Optimizer: We have experimented with a simple
gradient optimizer and applied an incremental
approach by adding epochs. The experiment was 5.5 Comparison study
conducted with a maximum of 10000 epochs, and the In this section, we compare our model to those of existing
convergence rate was going slow with 10000 epochs. works that use deep learning for the classification of
In this process, we observed the best possible accuracy phishing sites. Like other researchers [32, 46–49], the
with 700 epochs using the Adam Optimizer. results of existing works are collected from the respective
After tuning the afore-mentioned parameters, we achieved papers for the comparison analysis. The listed results in
a training accuracy of 98.86% and testing accuracy of table 9 are the results obtained by respective authors with
165 Page 16 of 18 Sådhanå (2020) 45:165

results are given in table 10. It is observed that our model


with DNN, LSTM and CNN achieved significant accuracy
compared with the existing works. It is also demonstrated
that the proposed model with LSTM outperforms other
proposed models with an accuracy of 99.57%, which is an
improvement over the previous work (99.5%) with mini-
mal features. Note that Le et al [9], Bahnsen et al [10]
and Zhao et al [34] applied various deep learning algo-
rithms on the URLs rather than content for the classifi-
cation of phishing URLs. Le et al [9] achieved a
significant accuracy compared with other existing works
that used features extracted from content [24, 33, 35–37].
Despite the use of content-based features in training our
model, it is observed that our model outperforms Le et al
[9] work and other deep-learning-based methods [10, 34],
Figure 12. Accuracy graph of CNN.
which use URLs for the classification. This shows the
richness of our feature set in detecting the phishing sites.
Deployment of model: The model is deployed as a
desktop application that takes URL as an input and gives
Table 9. Summary of the results of related existing works.
the status of the URL as output. The application makes a
Techniques Accuracy (%) connection to REST API, which runs on a remote server
where the actual execution of technique takes place. The
Zhang et al [24] 95.83
REST API is hosted on an Intel Xeon 16 core Ubuntu
Mohammad et al [35] 92.48
server with 16-GB RAM and a 2.67-GHz processor. The
El-Alfy [33] 96.79
Zhao et al [34] 98.5 REST API is implemented using the Spring framework,
Le et al [9] 99.29 and the GET method is used to transfer the URL from
Bahnsen et al [10] 98.7 application to the program running on the remote server.
Yang et al [32] 98.99 On receiving the URL, the running program on the
Feng et al [36] 97.71 remote server proceeds with the extraction of features.
Yi et al [37] 90 These features are combined to form a feature vector that
is further sent to a trained deep learning model for
identifying the legitimacy of the given URL. The REST
API sends back the status of the URL (legitimate or
Table 10. Summary of the works implemented on the same phishing) to the application to display the message.
dataset.

Techniques Accuracy (%)


6. Limitations
Rao and Pais [1] 99.5
Zhang et al [18] 89.18
Xiang et al [23] 99.13
In this section, we discuss the limitations of our proposed
Proposed Model-I [DNN] 99.52 work. Since the proposed model is dependent on third-party
Proposed Model-II [LSTM] 99.57 services, the non-availability of these services will limit the
Proposed Model-III [CNN] 99.43 performance of our work.
Also, our proposed model might fail to detect phishing
sites that use embedded objects such as flash, java scripts
their datasets. These researchers’ datasets could not be and HTML files to replace textual content. In the future, we
used for comparison because of the limitation of feature intend to include the features for the detection of these
extraction. Our technique requires third-party-based fea- embedded objects in the phishing sites.
ture (TF2) for the classification of phishing URLs. The
use of third-party services for feature extraction requires
the datasets with live phishing sites. The existing works 7. Conclusions
(listed in table 9) datasets mostly consist of URLs that
have already been taken down from the Internet. Hence, We have proposed a deep learning model to detect the
they cannot be used for comparison. The present study has legitimacy of a given website. We used URL heuristic and
been compared to the previous work [1], CANTINA [18] third-party service-based features for training the deep
and CANTINA? [23] using the common dataset. The learning models. Unlike the previous work [1], we
Sådhanå (2020) 45:165 Page 17 of 18 165

minimized the number of features and reduced the depen- [10] Bahnsen A C, Bohorquez E C, Villegas S, Vargas J and
dence on third-party services to achieve a significant González F A 2017 Classifying phishing URLs using
accuracy of 99.57%. We also tested our features with var- recurrent neural networks. In: Proceedings of the APWG
ious deep-learning-based models such as CNN, DNN and Symposium on Electronic Crime Research (eCrime), IEEE,
LSTM, and we achieved an accuracy of 99.57% with pp. 1–8
[11] Whittaker C, Ryner B and Nazif M 2010 Large-scale
LSTM, 99.43% with CNN and 99.52% with DNN. The
automatic classification of phishing pages. In: Proceedings
LSTM and DNN outperformed by achieving better results of the Network and Distributed System Security Symposium
with 10 features than the previous work [1] with machine (NDSS), vol. 10
learning with 18 features. [12] Huh J H and Kim H 2011 Phishing detection with popular
In the future, we intend to include additional heuristic search engines: simple and effective. In: Proceedings of the
features that can detect phishing sites hosted on compro- International Symposium on Foundations and Practice of
mised domains, and also phishing sites that include Security. Springer, pp. 194–207
embedded objects such as iframes, flash and HTML. [13] Jain A K and Gupta B B 2018 Two-level authentication
approach to protect from phishing attacks in real time. J.
Ambient Intell. Humaniz. Comput. 9: 1783–1796
Acknowledgements [14] APWG 2014 Global phishing reports first half 2014. https://
docs.apwg.org//reports/APWG_Global_Phishing_Report_1H
_2014.pdf, published 25 September 2014
This research was funded by the Ministry of Electronics [15] Cao Y, Han W and Le Y 2008 Anti-phishing based on
and Information Technology (MeitY), Government of automated individual white-list. In: Proceedings of the 4th
India. The authors sincerely thank MeitY for financial ACM Workshop on Digital Identity Management, ACM,
support. The authors thank the anonymous referees for their pp. 51–60
comments and criticism, which have helped to improve the [16] Zhang J, Porras P A and Ullrich J 2008 Highly predictive
quality of the paper. blacklisting. In: Proceedings of the USENIX Security Sym-
posium, pp. 107–122
[17] Rao R S and Pais A R 2017 An enhanced blacklist method to
detect phishing websites. In: Proceedings of the Interna-
References tional Conference on Information Systems Security.
Springer, pp. 323–333
[1] Rao R S and Pais A R 2019 Detection of phishing websites [18] Zhang Y, Hong J I and Cranor L F 2007 Cantina: a content-
using an efficient feature-based machine learning frame- based approach to detecting phishing web sites. In: Pro-
work. Neural Comput. Appl. 31: 3851–3873 ceedings of the 16th International Conference on World
[2] APWG 2018 Phishing attack trends reports, first quarter Wide Web, ACM, pp. 639–648
2018. https://docs.apwg.org//reports/apwg_trends_report_ [19] Pan Y and Ding X 2006 December Anomaly based web
q1_2018.pdf, published July 31, 2018 phishing page detection. In: Proceedings of the 2006 22nd
[3] Fu A Y, Wenyin L and Deng X 2006 Detecting phishing web Annual Computer Security Applications Conference
pages with visual similarity assessment based on earth (ACSAC’06), IEEE, pp. 381–392
mover’s distance (emd). IEEE Trans. Dependable Secure [20] Horng M H S, Fan P, Khan M, Run R and Chen J L R 2011
Comput. 3: 301–311 An efficient phishing webpage detector. Expert Syst. Appl.
[4] Wenyin L, Huang G, Xiaoyue L, Min Z and Deng X 2005 Int. J. 38: 12018–12027
Detection of phishing webpages based on visual similarity. In: [21] Gowtham R and Krishnamurthi I 2014 A comprehensive and
Special Interest Tracks and Posters of the 14th International efficacious architecture for detecting phishing webpages.
Conference on World Wide Web, ACM, pp. 1060–1061 Comput. Secur. 40: 23–37
[5] Hara M, Yamada A and Miyake Y 2009 Visual similarity- [22] Srinivasa Rao R and Pais A R 2017 Detecting phishing
based phishing detection without victim site information. In: websites using automation of human behavior. In: Proceed-
Proceedings of the IEEE Symposium on Computational ings of the 3rd ACM Workshop on Cyber-Physical System
Intelligence in Cyber Security, CICS’09, IEEE, pp. 30–36 Security, ACM, pp. 33–42
[6] Rao R S and Ali S T 2015 A computer vision technique to [23] Xiang G, Hong J, Rose C P and Cranor L 2011 Cantina?: a
detect phishing attacks. In: Proceedings of the Fifth Inter- feature-rich machine learning framework for detecting
national Conference on Communication Systems and net- phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2): 1–28
work technologies (CSNT), IEEE, pp. 596–601 [24] Zhang D, Yan Z, Jiang H and Kim T 2014 A domain-feature
[7] Khonji M, Iraqi Y and Jones A 2013 Phishing detection: a enhanced classification model for the detection of Chinese
literature survey. IEEE Commun. Surv. Tutor. 15: 2091–2121 phishing e-business websites. Inf. Manag. 51: 845–853
[8] Zhang N and Yuan Y 2012 Phishing detection using neural [25] Chiew K L, Chang E H and Tiong W K 2015 Utilisation of
network. Technical Report, Department of Computer website logo for phishing detection. Comput. Secur. 54:
Science, Department of Statistics, Stanford University 16–26
(CS229 Lecture Notes) [26] Moghimi M and Varjani A Y 2016 New rule-based phishing
[9] Le H, Pham Q, Sahoo D and Hoi S C 2018 URLNet: learning detection method. Expert Syst. Appl. 53: 231–242
a URL representation with deep learning for malicious URL [27] Aggarwal A, Rajadesingan A and Kumaraguru P 2012
detection. arXiv preprint: arXiv:180203162 Phishari: automatic realtime phishing detection on twitter.
165 Page 18 of 18 Sådhanå (2020) 45:165

In: Proceedings of the eCrime Researchers Summit (eCrime), neural network model for non-linear time series prediction.
IEEE, pp. 1–12 EAI Endorsed Trans. Scalable Inf. Syst. 3: e5-1–e5-7
[28] Marchal S, Armano G, Gröndahl T, Saari K, Singh N and [39] Quinlan J R 1986 Induction of decision trees. Mach. Learn.
Asokan N 2017 Off-the-hook: an efficient and usable client- 1:81–106
side phishing prevention application. IEEE Trans. Comput. [40] Smith C and Jin Y 2014 Evolutionary multi-objective
66: 1717–1733 generation of recurrent neural network ensembles for time
[29] Sahingoz OK, Buber E, Demir O and Diri B 2019 Machine series prediction. Neurocomputing 143: 302–311
learning based phishing detection from URLs. Expert Syst. [41] Mikolov T, Joulin A, Chopra S, Mathieu M and Ranzato M
Appl. 117: 345–357 A 2014 Learning longer memory in recurrent neural
[30] Li Y, Yang Z, Chen X, Yuan H and Liu W 2019 A stacking networks. arXiv preprint: arXiv:1412.7753
model using URL and HTML features for phishing webpage [42] Jozefowicz R, Zaremba W and Sutskever I 2015 An
detection. Future Gener. Comput. Syst. 94: 27–39 empirical exploration of recurrent network architectures.
[31] Jain A K and Gupta B B 2018 Towards detection of phishing In: Proceedings of the International Conference on Machine
websites on client-side using machine learning based Learning, pp. 2342–2350
approach. Telecommun. Syst. 68: 687–700 [43] Hochreiter S, Schmidhuber J 1997 Long short-term memory.
[32] Yang P, Zhao G and Zeng P 2019 Phishing website detection Neural Comput. 9: 1735–1780
based on multidimensional features driven by deep learning. [44] Krizhevsky A, Sutskever I and Hinton G E 2012 Imagenet
IEEE Access 7: 15196–15209 classification with deep convolutional neural networks. In:
[33] El-Alfy ESM 2017 Detection of phishing websites based on Advances in Neural Information Processing Systems,
probabilistic neural networks and K-medoids clustering. pp. 1097–1105
Comput. J. 60: 1745–1759 [45] Pham N Q, Kruszewski G and Boleda G 2016 Convolutional
[34] Zhao J, Wang N, Ma Q and Cheng Z 2018 Classifying neural network language models. In: Proceedings of the
malicious URLs using gated recurrent neural networks. In: 2016 Conference on Empirical Methods in Natural Lan-
Proceedings of the International Conference on Innovative guage Processing, pp. 1153–1162
Mobile and Internet Services in Ubiquitous Computing. [46] Ramesh G, Krishnamurthi I and Kumar K S S 2014 An
Springer, pp. 385–394 efficacious method for detecting phishing webpages through
[35] Mohammad R M, Thabtah F and McCluskey L 2014 target domain identification. Decis. Support Syst. 61: 12–22
Predicting phishing websites based on self-structuring neural [47] He M, Horng S J, Fan P, Khan M K, Run R S, Lai J L, Chen
network. Neural Comput. Appl. 25: 443–458 R J and Sutanto A 2011 An efficient phishing webpage
[36] Feng F, Zhou Q, Shen Z, Yang X, Han L and Wang J 2018 The detector. Expert Syst. Appl. 38: 12,018–12,027
application of a novel neural network in the detection of [48] Marchal S, Armano G, Gröndahl T, Saari K, Singh N and
phishing websites. J. Ambient Intelli. Humaniz. Comput. 1–15 Asokan N 2017 Off-the-hook: an efficient and usable client-
[37] Yi P, Guan Y, Zou F, Yao Y, Wang W and Zhu T 2018 Web side phishing prevention application. IEEE Trans. Comput.
phishing detection using a deep learning framework. Wirel. 66: 1717–1733
Commun. Mobile Comput. 2018: Article ID 4678746 [49] Gowtham R and Krishnamurthi I 2014 A comprehensive and
[38] Zhou Q, Chen H, Zhao H, Zhang G, Yong J and Shen J 2016 efficacious architecture for detecting phishing webpages.
A local field correlated and Monte Carlo based shallow Comput Secur 40: 23–37

You might also like