Efficient Deep Learning Techniques For The Detection of Phishing
Efficient Deep Learning Techniques For The Detection of Phishing
https://doi.org/10.1007/s12046-020-01392-4
Sadhana(0123456789().,-volV)FT3](012345
6789().,-volV)
Information Security Research Lab, National Institute of Technology Karnataka, Surathkal 575025, India
e-mail: [email protected]; [email protected]; [email protected];
[email protected]
MS received 18 May 2019; revised 10 January 2020; accepted 12 February 2020; published online 27 June 2020
Abstract. Phishing is a fraudulent practice and a form of cyber-attack designed and executed with the sole
purpose of gathering sensitive information by masquerading the genuine websites. Phishers fool users by
replicating the original and genuine contents to reveal personal information such as security number, credit card
number, password, etc. There are many anti-phishing techniques such as blacklist- or whitelist-, heuristic-
feature- and visual-similarity-based methods proposed as of today. Modern browsers adapt to reduce the chances
of users getting trapped into a vicious agenda, but still users fall as prey to phishers and end up revealing their
secret information. In a previous work, the authors proposed a machine learning approach based on heuristic
features for phishing website detection and achieved an accuracy of 99.5% using 18 features. In this paper, we
have proposed novel phishing URL detection models using (a) Deep Neural Network (DNN), (b) Long Short-
Term Memory (LSTM) and (c) Convolution Neural Network (CNN) using only 10 features of our earlier work.
The proposed technique achieves an accuracy of 99.52% for DNN, 99.57% for LSTM and 99.43% for CNN. The
proposed techniques utilize only one third-party service feature, thus making it more robust to failure and
increases the speed of phishing detection.
Keywords. Deep neural network (DNN); long short-term memory (LSTM); convolution neural network
(CNN); recurrent neural network (RNN); phishing; heuristic technique; deep learning.
1
165 Page 2 of 18 Sådhanå (2020) 45:165
There are many methods proposed for detecting and pre- phishing attacks when trained with heuristic model
venting phishing. These techniques can be summarized as features. If training data is vast, these algorithms are
follows. even better as they learn most of the possible variations
that phishing sites may have. Rao and Pais [1]
• Listing-based detection: Most browsers like Chrome, achieved an accuracy of about 99.5% in detecting
Mozilla, Opera, etc. maintain the list of blocked and phishing sites using machine learning techniques.
permitted Uniform Resource Locators (URLs). The Also, according to the recent survey [7], the detection
database of blocked URLs is termed as blacklist and of phishing sites with an accuracy of more than 99%
permitted URLs as a whitelist. In whitelist databases, could be achieved using machine learning techniques.
even legitimate sites that are not found in the database The performance of the machine learning algorithm
entry could be the victim and blocked from the browser depends on the size of the training data, the quality of
access. The blacklist-based methods follow the oppo- the extracted features and the values of certain
site of the whitelist. Instead of maintaining a database hyperparameters used to optimize the accuracy.
of legitimate URLs, they keep a database of phishing • Deep-learning-based detection: Deep learning is a
URLs. This method fails when phishing sites are latest machine learning technique that learns features directly
and not even a day old, known as zero-day phishing from data. The data may be images, text or sound.
sites, are encountered. It may be bypassed with slight Deep learning requires a large amount of labelled data
URL changes. It is mandatory to update the list more and makes it possible for the Graphical Processing
frequently, which seems pretty hectic with the amount Unit (GPU) to train deep networks in less time. New
of rising of phishing attacks. trends have been made to exploit Deep Neural
• Heuristic-feature-based detection: This technique is Network (DNN) techniques such as multi-layer feed-
based on features extracted from phishing sites and is forward network [8], Convolutional Neural Network
used to detect and prevent phishing attacks. However, (CNN) [9] and Recurrent Neural Network (RNN) [10]
the limitation is that the heuristic features are not to detect and prevent phishing attacks. These networks
always guaranteed to exist in all the phishing sites, are trained through multi-featured datasets obtained
which may lead to reduced detection rates. Also, this using heuristic methods. Bahnsen et al [10] trained the
technique could be easily bypassed if appropriate RNN over the URL character sequence. They argued
algorithms or detection features are known in advance. that each character sequence has correlations, i.e.
• Visual-similarity-based detection: Attackers mimic the nearby characters in the URL are likely to be
target websites by the use of favicons, screenshots, connected. These sequential patterns are important
background images and logos such that a user can get because they can be used to improve predictor
tricked easily. There are many techniques [3–6] that performance. Le et al [9] used CNN to learn sequential
use databases of logos, screenshots, favicons and URL behaviour. They adopted two techniques that are
Document Object Models (DOM) of target websites CNN character level and CNN word level, which
for similarity computation with suspicious sites. If the identify unique characters and words. Each character
similarity score is higher than a certain threshold, it or word is represented as a vector and trains the vectors
implies that the suspicious site has mimicked some over CNN to learn the sequential behaviour of the URL
legitimate sites, and such websites are declared as to identify the phishing URLs.
phishers. The phishers could easily bypass this security
system with a slight change in visual elements without The robustness of the machine learning algorithms, trained
changing its contents. over datasets consisting of values of heuristic methods, has
• Conventional machine-learning-based detection: One led to the proposal of many methods for dealing with
of the main problems suffered by heuristic detection is phishing sites. Many works [7, 11–13] use third-party
that it is not flexible enough to accommodate phishing websites such as Google or Bing results, Alexa1 ranking
site changes. Even minor changes could cause such and WHOIS2 to detect phishing. However, some of the
detection to bypass. Therefore, the heuristic model has phishing websites hosted on the compromised domain even
given flexibility using machine learning techniques to bypass such techniques. According to APWG’s [14] report,
accommodate the changes. In this technique, datasets most phishing sites may not last for a day, but phishing sites
are prepared to train the machine learning model, and hosted on a compromised site may live for more than a day.
the dataset represents the values of features extracted The same statistics refer to the limitations of the existing
using a heuristic approach. Some of the algorithms technique leading to such a drastic increase in phishing
used are Support Vector Machine Decision Tree over the years. Therefore, there must be a mechanism to
(SVM-DT), Random Forest (RF), Sequential Mini- prevent phishing attacks with higher accuracy and
mum Optimization (SMO), Principal Component
1
Analysis Random Forest, J48 tree, Multilayer Percep- https://www.alexa.com/topsites.
tron, etc. These algorithms can detect even zero-day 2
https://www.whois.com.
Sådhanå (2020) 45:165 Page 3 of 18 165
minimize the use of third-party services with minimal Marchal et al [28] proposed a client-side application that
features. The heuristic method captures specific and com- extracts features mainly from URL and content of the
pelling features that are sufficiently robust to detect even website resulting in a 210-feature vector. The authors used
zero-day phishing. These methods were used to extract the a Gradient Boosting algorithm for the classification of
required features for the training of our multi-layer DNN, phishing sites to achieve a significant detection rate. The
Long Short-Term Memory (LSTM) Network and CNN. We use of a large feature vector may include significant time
also tried to optimize the hyperparameters of these net- for the feature extraction and classification of URLs.
works to obtain the best possible accuracy with minimal Sahingoz et al [29] proposed a phishing detection model
features. from URLs using a machine learning approach. The authors
The previous work [1] used RF and their variations as applied 7 different classification algorithms on the Natural
classifiers with a rich feature set for the classification of Language Processing (NLP)-based features for the classi-
phishing sites. In our current work, we have used deep fication of phishing URLs. The experimental results
learning algorithms such as CNN, LSTM and DNN for the demonstrated that the RF algorithm with NLP-based fea-
detection of phishing websites. Also, we have used an tures achieved a significant accuracy of 97.98%.
information gain (IG) algorithm to select the best per- Li et al [30] proposed a stacking model combining
forming features among our proposed features and used it Gradient Boosting Decision Tree, XGBoost and LightGBM
for classifying the phishing websites. The feature selection algorithms for detecting the phishing web pages. The
resulted in a reduction of features from 18 to 10 with lower authors extracted features from URL and Hypertext
dependence on third-party services while achieving the Markup Language (HTML) of the suspicious website. The
same accuracy as that of the previous work. Out of these 10 extracted features contain 8 URL and 12 HTML-based
features, 6 are existing features proposed by others, and the features to generate a feature vector. The vector was fed to
4 features are proposed in the earlier work. the stacked model for the classification and achieved an
Our paper makes the following research contributions: accuracy of 97.30%. Jain & Gupta [31] proposed a client-
side technique that uses features from the URL and source
1. We have proposed the IG algorithm to select the best code of the suspicious site for the classification. They
performing features for phishing URL detection. applied five machine learning algorithms to identify the
2. We have proposed novel DNN-, LSTM- and CNN-based best classifier suitable for their dataset. RF had outper-
models for phishing URL detection with only 10 formed other classifiers with an accuracy of 99.09%.
features. Yang et al [32] proposed a phishing website detection
3. These proposed models achieved a promising accuracy based on multidimensional features driven by deep learn-
of 99.52%, 99.57% and 99.43% for DNN, LSTM and ing. The authors propose a direct URL approach where
CNN, respectively. character sequence features are extracted from the URLs
The rest of the paper is organized as follows. In section 2, using a dynamic category decision algorithm, and deep
we discuss related work carried out by different researchers learning is applied for the classification of websites. The
features are extracted from URL, webpage code and web-
using different techniques and algorithms. Section 3
page text features, which are combined into the multidi-
explains the architecture of the proposed work. Section 4
mensional feature set. This feature set is fed to the CNN–
deals with the implementation of the proposed model with
LSTM model for the detection of phishing sites and
used tools and datasets. In section 5, we discuss and capture
achieves an accuracy of 98.99%. El-Alfy [33] proposed
the results of individual methods with their efficiency and phishing websites based on probabilistic neural networks
accuracy by incorporating different test levels of the model. and clustering K medoids. This framework combined
In section 6, we list out the limitations of our work, and unsupervised and supervised algorithms for training the
finally, we conclude our paper in section 7. nodes. K-medoid technology uses feature selection or
transformation, and component analysis is used to reduce
space dimensionality. The technique achieved 96.79%
accuracy by considering 30 features.
2. Related work Zhang et al [24] proposed SMO for the detection and
classification of Chinese phishing e-business websites. To
The proposed model fits into deep-learning-based method evaluate the model, they used 15 unique and some generic
in which less work has been done, and it has been inspired domain-specific features. They have used 4 different
by existing list-based (whitelist [15], blacklist [16, 17]), machine learning algorithms for the classification of
heuristic [18–22] and machine learning methods [23–27]. phishing sites. Among all 4 algorithms, SMO performed the
These methods were discussed in section 1, and also, in the best in detecting phishing sites with an accuracy of 95.83%.
previous work [1]. In this section, we discuss some of the The disadvantage of this approach is that it works better
latest works on deep-learning- and machine-learning-based with Chinese websites only. Bahnsen et al [10] proposed an
techniques, which are given as follows: RNN for the classification of phishing URLs using LSTM.
165 Page 4 of 18 Sådhanå (2020) 45:165
The authors compared the traditional RF tree machine common page (Common Page Detection), detection of
learning algorithm to LSTM with 3-fold cross-validation. phishing sites that are hosted in any language (Language
RF used 14 features for URL statistical and lexical analysis independence), detection of phishing sites that consist of
with an accuracy of 93.5%. However, RNN with direct maximum number of broken links (Broken links), detection
URLs performs better than the RF tree classification algo- of phishing sites based on different models and the number
rithm with 98.7% accuracy without requiring intensive of features used for classification of phishing sites.
labour and time-consuming manual extraction of features.
Le et al [9] use a deep learning model to detect phishing
URLs. They use the URLNet framework to learn a non- 3. Proposed work
linear URL embedding for malicious URL detection
directly from the URL. To learn the URL embedding, The goal of this work is to detect the status of a given URL
URLNet uses CNN specifically to both characters as well as using minimal distinctive features with deep learning
words of the URL string. The proposed method has similar classifiers. The architecture of the proposed system is
accuracy for word-level and character level and performs shown in figure 1. The architecture comprises feature
much better than other methods. This method may fail if the extraction, feature selection and classification methodolo-
phishing sites are represented with short URLs (bitly, goo, gies. A set of webpage URLs are fed as an input into the
tiny, etc.) and data URLs. feature extractor, which extracts required features from
Zhao et al [34] proposed a Gated Recurrent Neural three sources (URL obfuscation, hyperlink and third-party-
Network model and showed that Gated Recurrent Unit based). The extracted features are further fed to the IG
(GRU) outperformed the RF classifier with 21 features and feature ranking algorithm. The outcome of the algorithm
achieved 2.1% better efficiency than RF, i.e., 98.5%. helps in selecting the best performing features by a clear
However, here only URLs are used as datasets and need investigation in considering the dependences. The best
transforming of all characters into vectors to learn hidden performing 10 features are further trained through different
patterns. Hence, GRU needs more time to train and requires deep learning methodologies to output the status of the
system architecture to be optimized for better performance. URL as legitimate or phishing.
Mohammad et al [35] proposed a model to predict phishing A detailed description of individual models are as
sites based on the self-structuring neural networks. They follows.
used 17 features extracted from the URL and source code of
the website. These features are used to classify websites in
artificial neural networks. This model should be regularly 3.1 Feature extraction
retrained with up-to-date training datasets.
Feng et al [36] proposed a novel classification model for The features are extracted from three sources:
the detection of the legitimacy of a given website. They • URL obfuscation features,
used the Monte Carlo algorithm [38] for training the model • hyperlink-based features and
and the risk minimization principle to avoid over-fitting in • third-party-based features.
the proposed model. They adopted 30 features from the
UCI3 repository and achieved an accuracy of 97.71%. These features are extracted using Selenium with Python
Yi et al [37] proposed a deep learning framework with language, an HTML parser, and BeautifulSoup for parsing
two types of feature sets, namely original and interaction the websites. The selection of prominent features from the
features. The original features are extracted from the URL extracted features is carried out using the IG mechanism.
analysis, i.e., presence of special characters (@, _, Uni- The IG for the features proposed by Rao and Pais [1] is
code), count of dots and age of the domain. The interaction given in table 2.
features are extracted from the source code of the website,
i.e. in-degree, out-degree, frequency of accessing URL and 3.1.1 URL obfuscation features They are the
cookie absence. Deep Belief Network (DBN) is applied to characteristics that can be extracted from the URL itself.
the extracted features and achieved an accuracy of 90% true These features do not involve the inclusion of website
positive rate and 0.6% false positive rate. content on third-party services. Before defining different
The related works with deep learning classifiers are URL-based features, we have to understand the typical
summarized in table 1. The table gives a comparison URL anatomy. A URL is a specific Uniform Resource
between the proposed method and all other approaches Identifier (URI) that is used to locate existing resources on
using six different metrics. These metrics include the the Internet. It is used when a web client requests the server
detection of phishing sites that replace textual content with for resources such as HTML, CSS, images, videos or other
an image (Image-based phishing), detection of phishing hypermedia. A URL usually consists of four or five
sites that contain most of the hyperlinks directed towards a components. The typical structure of URL is http://www.
reg.signin.nitk.com.pk/secure/login/web/index.php. It con-
3
http://archive.ics.uci.edu/ml/index.php. sists of the following parts.
Sådhanå (2020) 45:165 Page 5 of 18 165
• Scheme: The scheme is used to identify the used 3.1.3 Third-party-based feature In this section, we use
protocols, Hypertext Transfer Protocol (HTTP) or third-party services such as WHOIS, Alexa and Search
HTTP with Secure Sockets Layer (HTTPS). engine for the extraction of third-party-based features.
• Hostname: The hostname identifies the machine that Surprisingly, out of these three features, the Alexa rank-
contains resources. The hostname includes the Generic based feature performed significantly better in the IG. Even
top-level domain (gTLD) and Country-code top-level though other third-party features performed better than the
domain (ccTLD). In the given example, reg.signin other (URL obfuscation, hyperlink-based) features, we
indicates subdomain, nitk is the primary domain, com have not considered them in our feature selection to reduce
is gTLD and pk is ccTLD. the dependence on third-party services.
• Path: The path identifies the basic or required infor- TF2: Alexa ranking is a third-party-based service used to
mation in the host that the web client wants to access. classify the phishing sites. The rationale behind this feature
From the given URL example, the path-name is is that phishing sites are low ranked, and target websites are
secure/login/web/index.php. highly ranked. This feature checks the rank of a suspicious
• A query string: When a query string is used, the path website in the Alexa database. To calculate the rank, an
component follows and provides a string of informa- HTTP request is sent to http://data.alexa.com/data?cli=
tion that the resource can use for some purpose. The 10&url=’’?domain and an XML parser is used to get the
query string is usually the name and value pair string. Alexa ranking.
An ampersand separates name and value pairs (&). For
example, in the URL http://www.google.co.uk/ 0 if rank is not found
Pagerank ¼
search?q=url&ie=utf-8, ?q ¼ url&ie ¼ utf 8 is the rank otherwise
query string with name and value pairs, respectively, as
q ¼ url and ie ¼ utf 8. The selected features from these three sources are high-
lighted and marked as selected features in table 3.
In the earlier work [1], five URL obfuscation features (UF1,
UF2, UF3, UF4, UF5) were proposed. Out of those five,
two features are poorly performing when applied to the IG 3.2 Feature selection
algorithm as shown in table 2. The best performing three
We have used IG as a ranking criterion to score the fea-
features have been selected, and these features are as
tures, and by applying the threshold, we have filtered out
follows.
prominent features. The intuition behind ranking is to
1. UF1: dots in hostname. evaluate the relevance of the features for the detection of
2. UF3: lengthy URL. phishing websites. The relevance of feature implies that
3. UF5: presence of HTTPS. each feature may be mutually exclusive to each other, but it
must not be completely independent of class labels. There
must exist a relation between feature and class labels. The
3.1.2 Hyperlink-based features These features are
extracted from the hyperlinks in the source code of a Table 3. Selected features.
website. The hyperlink is an electronic document element Features from Rao and Pais [1] Selected features (U)
that connects from one source to another. The web source
may be an image, program, HTML document or HTML UF1 U
document element. As mentioned earlier, the previous work UF2 –
UF3 U
[1] consists of eight hyperlink-based features that are used
UF4 –
for phishing detection. We have selected six best
UF5 U
performing features among the eight features based on TF1 –
the results of IG analysis shown in table 2, and the selected TF2 U
features are as follows. TF31 –
TF32 –
1. HF1: presence of domain in anchor links.
TF33 –
2. HF2: frequency of domain in image links.
HF1 U
3. HF3: common page detection ratio. HF2 U
4. HF4: common page detection ratio in footer. HF3 U
5. HF7: presence of anchor links in HTML body. HF4 U
6. HF8: broken links ratio. HF5 –
HF6 –
It may be observed that HF5 and HF6 have performed
HF7 U
better in IG, but they have been eliminated since HF3 and HF8 U
HF4 capture these features characteristics [22].
Sådhanå (2020) 45:165 Page 7 of 18 165
features that are irrelevant and have no relation or abysmal IGðAÞ ¼ infoðDÞ infoA ðDÞ: ð3Þ
role can be discarded. Hence, our primary aim behind using
this technique is to rank features based on their relevance The IG for the features proposed by Rao and Pais [1] is
and influence on the class labels. This could be used in the given in table 2.
feature reduction process. IG [39] is measured based on the The classification methodologies have been discussed in
entropy of a system. The entropy is defined as a degree of detail in section 4.3 along with the formal description of
disorder and impurity in the system. IG is defined as a DNN, LSTM and CNN.
reduction in the impurity, bringing more certainty in the
system. For feature ranking purposes, we have calculated
IG on the entire dataset. Summarizing, IG looks at each 4. Implementation
feature in isolation and is calculated on each feature inde-
pendently. By computing IG of each feature independently, Given a list of website URLs, we have trained and cross-
we get a quantitative measure of significance and relevance validated a proposed deep-learning-based model to identify
of this feature on class labels. Computation of information as legitimate URL or phishing URL. We have used the
for a feature involves two steps. Selenium library in Python and Firefox web driver to get
screenshots of website URLs, and also to download the
1. Compute entropy of the class label for the entire dataset.
source code. We used Beautiful Soup in Python to parse the
It can be computed by the formula
source code to extract the required features. Screenshots
X
m and status codes are further used to verify that contents
infoðDÞ ¼ pi log2 pi ð1Þ have not changed while extracting features from source
i¼1
code. Extracted datasets are further examined manually for
where m ¼ 2, i.e. the total unique number of class labels legitimate URLs; duplicates and unwanted URLs (neither
(phishing, legitimate), and in our dataset, it is two. D phishing nor legitimate) are removed from the PhishTank
represents a feature of a dataset. Hence, each feature has dataset. This process is to avoid legitimate sites being
some instances belonging to one class and remaining to treated as phishing and reduce the processing time by
another class; pi represents the probability of instances of avoiding unwanted comparisons.
D that belong to ith class. We can compute probability pi
by counting the number of instances of D that belongs to
ith class and then dividing by the total number of 4.1 Tools used
instances of D. Once we get pi for all i, we use Eq. (1) to We have implemented Python scripts to extract all features
calculate the entropy of D. using Python 3.6 from URL and URL content. We collected
2. Computation of conditional entropy for each unique phishing URLs from the PhishTank4 website, and legiti-
value of that feature: The calculation of conditional mate sites from the Alexa databases. When these URLs are
entropy requires a frequency count of the class label by fed as inputs to the Python script, all the essential features
feature value. The feature value can be continuous as are extracted and stored in text files. These extracted fea-
well as discrete. tures are transferred to deep learning algorithms to train and
i. For discrete-valued features, it can be calculated by cross-validate so that it can start classifying URLs into
the formula: legitimate and phishing sites. We have implemented a deep
learning algorithm with a TensorFlow package, an open-
X
v source machine learning framework implemented on top of
infoA ðDÞ ¼ j Di j = j D j infoðDi Þ ð2Þ Python, which supports parallel computing.
i¼1
4.3 Deep learning algorithms Each layer is composed of the basic computing unit, i.e.
the neuron. The neuron is inspired by the biological
To evaluate the performance of the feature set, the fea- neuron that performs mathematical functions for the
ture set has been trained and cross-validated against storage of information. This information is transmitted to
many different parameter combinations. In the multi- another neuron, and therefore information propagates in
feed-forward network, we must gather data based on the neural network. A neuron’s general mathematical
feature sets and then tune the parameters to achieve representation is
maximum accuracy in phishing site classification. It is an
essential process in which training networks must set !
parameters and validate across appropriate values. After X
k¼n
k
Y ¼U Wkj xj þ bk ð4Þ
attaining the right value, phishing sites can easily be k¼0
classified with the highest probability. We used Python
programming language along with the TensorFlow where U is activation function, Wk 2 RLB is weight of K th
library to implement deep learning algorithms. From neuron and Y k is the output of K th neuron. The number of
various combinations of hidden layers, we found that the neurons in the input layer depends upon the dimension of
DNN with 5 hidden layers achieved the best results. It datasets or equivalently to the number of features of the
can be understood that this permits the features we have dataset, i.e., X 2 RLK where L is the total number of the
extracted in the nonlinear, separable and complex func- datasets, K is the total number of features in datasets and
tions to be represented most effectively. The proposed R represents a real number. The number of neurons in the
deep feed-forward neural network comprises 7 layers, output layer depends on the number of outputs we want.
with 5 hidden layers, one input layer and one output The number of neurons in the hidden layer is a hyperpa-
layer. All layers were followed and standardized by the rameter that needs to be tuned to obtain an optimum result.
Rectified Linear Unit (ReLU) or sigmoid function. The Since each neuron performs computation, the number of
first four layers were followed by the ReLU function and neurons defines the network complexity. Each DNN is a
the output layer using the sigmoid function. The rationale complex mathematical function that adapts itself according
behind batch normalization is that it speeds up training to the nature of data. Hence, making the network more
by reducing the internal covariate shift and reducing complex may result in data over-fitting, i.e. it performs
over-fitting. ReLU activation has replaced sigmoidal or pretty good with training data but fails to achieve good
tanh activation functions in hidden layers due to its accuracy with unknown data.
tendency to learn faster than sigmoidal or tanh, avoiding Let l = {0,1,2,3,4,5} be the layers in my deep learning
significant delays in the rate of gradient descent con- model, Y ðl1Þ be the input to layers {1,2,3,4,5}, Y ðlÞ be
vergence after an initial set of iterations. output value of layer, where W ðlÞ is weight of layer i that is
used for linear transformation of inputs from n layers to
4.3.1 Formal description of DNN The DNN is a type output of m layers, BðlÞ be bias of layer i and F ðlÞ be the
of machine learning technology. It consists of many
associated activation function of each layer. Y ð0Þ is nothing
common neural network layers. It has one input layer,
one output layer and at least one hidden layer, as shown in but input layer and Y ðlÞ is output layer.
figure 2.
Z ðlÞ ¼Y ðl1Þ W ðlÞ þ BðlÞ ; ð5Þ
In this neural network, we have taken 4 LSTM units passed to the convolution layer and the output of this layer is
performing different mathematical computations that are activated using a tanh function (Eq. (16)). Later the activated
defined earlier. Each unit has 10 time-steps. Each time-step output is subjected to batch normalization and pooling. The
has 1 output, which is passed as input to the next time-step. obtained output is passed to the next convolutional layer. In
The last time-step of the first unit is also passed to the this way, we have 6 convolutional layers connected
second unit, and so on, and finally, we get output from the sequentially in which the output of one layer is the input of the
last time-step of the fourth unit of LSTM. The obtained next. At the seventh layer we densed the output of sixth
output is further densed to a single output and passed to the convolutional layer to 500, again activated using tanh func-
sigmoid function. The loss function is calculated, and error tion and then at the end densed it to 1. The output of the tanh
is optimized using the Adam Optimizer. During back- function is passed to the sigmoid activation function (8) for
propagation, each parameter of all 4 units of LSTM is output in the range of (0,1). Later the loss function (9) is
updated. Again the loss function is calculated in each calculated and is optimized using the Adam Optimizer. Then
epoch; the network learns when variables are updated. We we have a backpropagation method where variables are
modified the dataset dimensionality to implement LSTM. updated, and thus network learns.
We have converted 10 features to 10 time-steps; each time-
step consists of 1 feature. Hence, our dataset new dimen- tanhðxÞ ¼ ðe2x 1Þ=ðe2x þ 1Þ: ð16Þ
sion is (3526,10,1). Through LSTM, we attempted to find
We have used 10 features extracted in the feature selection
out the possible relationship between different features.
process. The size of our dataset is 3526. Hence the
Initially, at the first gate, zero vector and first time-step
dimensionality of our converted dataset is (3526, 10, 1),
were passed to the first gate. The output from the first
and it is passed to our CNN model for phishing detection.
LSTM gate passed as input to second, and so on till the
tenth gate. The obtained single output from the tenth gate
forms a single LSTM unit output, which is again passed as
input to the first gate of the second unit. Hence, the output 5. Results and discussions
of the previous gate along with the current time-step is fed
to the next gate and the output of the previous LSTM unit is We conducted experiments to evaluate the performance of
fed to the next unit until the fourth LSTM unit. In the fourth our DNN, LSTM and CNN models with different features
unit, its output is densed to 1, which is passed on to the and parameters. All experiments were conducted with the
sigmoid activation function (8), and then the loss function same dataset of 3526 instances. Each experiment has been
(9) is calculated and optimized using the Adam Optimizer. repeated, and data has been randomly selected from the
dataset. For evaluating our model, we have used accuracy
(Eq. (17)) and error (Eq. (18)) rates as the main evaluation
4.3.3 Formal description of CNN CNN is similar to
metrics. To calculate them, we considered the phishing sites
an ordinary DNN. These networks consist of neurons that
as condition positive (P), where P represents the total number
have weights and biases, which are updated and made to
of phishing sites in our dataset. The legitimate sites are ter-
learn. Each of these neurons receives inputs that are
med as condition negative (N), where N represents the total
converted into a linear combination of dot products of
number of legitimate sites in our dataset. The correctly
weights and input bias. However, instead of fully connected
classified phishing sites are termed as True Positive (TP),
hidden layers, it performs convolution on input layers
which is calculated as the ratio of correctly classified phish-
x 2 RLB . Convolution is performed using convolution
ing sites to the total number of phishing sites (P). Correctly
operator of length L with stride s, and consists of
classified legitimate sites are termed as True Negative (TN),
convolving filter W 2 RBK :
which is calculated as the ratio of correctly identified legiti-
Generally, CNNs are used with images due to the high
mate sites to the total number of legitimate sites (N).
correlation between pixels and networks. CNNs can fig-
ure out relations and different features using convolutional • Accuracy (ACC): Measures the legitimacy and phish-
techniques, which are used in conjunction with the pooling ing rate of the total number of websites.
layer, and batch normalization are done before passing it to
TP þ TN
any activation function [9, 44]. It has also been used in NLP ACC ¼ : ð17Þ
after character encoding due to the correlation between PþN
character sequences [9, 45]. • Error Rate (ERR): Measures the rate of legitimacy or
In our work, the selected 10 features from the IG algorithm phishing from incorrectly classified websites.
are fed to the CNN model to identify the status of the suspi-
cious site. The proposed CNN model consists of 8 layers (6
convolution and 2 dense layers). In the first layer, the input is
Sådhanå (2020) 45:165 Page 11 of 18 165
Layers Number of units in layers Learning rate Optimizer Epochs Activation function
6 10, 19, 100, 200, 300, 1 0.0001 Adam Optimizer 6000 ReLU
Sådhanå (2020) 45:165 Page 13 of 18 165
Experiment a value No. of features Epochs Training accuracy (%) Testing accuracy (%)
1 0.001 18 5000 98.71 97.95
2 0.001 14 5000 99.96 99.20
3 0.001 10 5000 99.69 98.97
4 0.0001 10 5000 99.51 99.20
0.0001 10 6000 99.55 99.52
Layers Number of filters in layers Learning rate Optimizer Epochs Window size Activation function Stride
7 32, 64, 64, 128, 128, 264, 512 0.001 Adam Optimizer 200 2 tanh 1
results are tabulated in table 6. Due to the close difference 99.57%. The accuracy graph that we obtain is shown in
between training and test accuracy, over-fitting is reduced. figure 10. The individual features testing and training
accuracies are given in figure 11. This figure shows that the
training and testing accuracies of individual features are
very close and prevent over-fitting.
minimized the number of features and reduced the depen- [10] Bahnsen A C, Bohorquez E C, Villegas S, Vargas J and
dence on third-party services to achieve a significant González F A 2017 Classifying phishing URLs using
accuracy of 99.57%. We also tested our features with var- recurrent neural networks. In: Proceedings of the APWG
ious deep-learning-based models such as CNN, DNN and Symposium on Electronic Crime Research (eCrime), IEEE,
LSTM, and we achieved an accuracy of 99.57% with pp. 1–8
[11] Whittaker C, Ryner B and Nazif M 2010 Large-scale
LSTM, 99.43% with CNN and 99.52% with DNN. The
automatic classification of phishing pages. In: Proceedings
LSTM and DNN outperformed by achieving better results of the Network and Distributed System Security Symposium
with 10 features than the previous work [1] with machine (NDSS), vol. 10
learning with 18 features. [12] Huh J H and Kim H 2011 Phishing detection with popular
In the future, we intend to include additional heuristic search engines: simple and effective. In: Proceedings of the
features that can detect phishing sites hosted on compro- International Symposium on Foundations and Practice of
mised domains, and also phishing sites that include Security. Springer, pp. 194–207
embedded objects such as iframes, flash and HTML. [13] Jain A K and Gupta B B 2018 Two-level authentication
approach to protect from phishing attacks in real time. J.
Ambient Intell. Humaniz. Comput. 9: 1783–1796
Acknowledgements [14] APWG 2014 Global phishing reports first half 2014. https://
docs.apwg.org//reports/APWG_Global_Phishing_Report_1H
_2014.pdf, published 25 September 2014
This research was funded by the Ministry of Electronics [15] Cao Y, Han W and Le Y 2008 Anti-phishing based on
and Information Technology (MeitY), Government of automated individual white-list. In: Proceedings of the 4th
India. The authors sincerely thank MeitY for financial ACM Workshop on Digital Identity Management, ACM,
support. The authors thank the anonymous referees for their pp. 51–60
comments and criticism, which have helped to improve the [16] Zhang J, Porras P A and Ullrich J 2008 Highly predictive
quality of the paper. blacklisting. In: Proceedings of the USENIX Security Sym-
posium, pp. 107–122
[17] Rao R S and Pais A R 2017 An enhanced blacklist method to
detect phishing websites. In: Proceedings of the Interna-
References tional Conference on Information Systems Security.
Springer, pp. 323–333
[1] Rao R S and Pais A R 2019 Detection of phishing websites [18] Zhang Y, Hong J I and Cranor L F 2007 Cantina: a content-
using an efficient feature-based machine learning frame- based approach to detecting phishing web sites. In: Pro-
work. Neural Comput. Appl. 31: 3851–3873 ceedings of the 16th International Conference on World
[2] APWG 2018 Phishing attack trends reports, first quarter Wide Web, ACM, pp. 639–648
2018. https://docs.apwg.org//reports/apwg_trends_report_ [19] Pan Y and Ding X 2006 December Anomaly based web
q1_2018.pdf, published July 31, 2018 phishing page detection. In: Proceedings of the 2006 22nd
[3] Fu A Y, Wenyin L and Deng X 2006 Detecting phishing web Annual Computer Security Applications Conference
pages with visual similarity assessment based on earth (ACSAC’06), IEEE, pp. 381–392
mover’s distance (emd). IEEE Trans. Dependable Secure [20] Horng M H S, Fan P, Khan M, Run R and Chen J L R 2011
Comput. 3: 301–311 An efficient phishing webpage detector. Expert Syst. Appl.
[4] Wenyin L, Huang G, Xiaoyue L, Min Z and Deng X 2005 Int. J. 38: 12018–12027
Detection of phishing webpages based on visual similarity. In: [21] Gowtham R and Krishnamurthi I 2014 A comprehensive and
Special Interest Tracks and Posters of the 14th International efficacious architecture for detecting phishing webpages.
Conference on World Wide Web, ACM, pp. 1060–1061 Comput. Secur. 40: 23–37
[5] Hara M, Yamada A and Miyake Y 2009 Visual similarity- [22] Srinivasa Rao R and Pais A R 2017 Detecting phishing
based phishing detection without victim site information. In: websites using automation of human behavior. In: Proceed-
Proceedings of the IEEE Symposium on Computational ings of the 3rd ACM Workshop on Cyber-Physical System
Intelligence in Cyber Security, CICS’09, IEEE, pp. 30–36 Security, ACM, pp. 33–42
[6] Rao R S and Ali S T 2015 A computer vision technique to [23] Xiang G, Hong J, Rose C P and Cranor L 2011 Cantina?: a
detect phishing attacks. In: Proceedings of the Fifth Inter- feature-rich machine learning framework for detecting
national Conference on Communication Systems and net- phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2): 1–28
work technologies (CSNT), IEEE, pp. 596–601 [24] Zhang D, Yan Z, Jiang H and Kim T 2014 A domain-feature
[7] Khonji M, Iraqi Y and Jones A 2013 Phishing detection: a enhanced classification model for the detection of Chinese
literature survey. IEEE Commun. Surv. Tutor. 15: 2091–2121 phishing e-business websites. Inf. Manag. 51: 845–853
[8] Zhang N and Yuan Y 2012 Phishing detection using neural [25] Chiew K L, Chang E H and Tiong W K 2015 Utilisation of
network. Technical Report, Department of Computer website logo for phishing detection. Comput. Secur. 54:
Science, Department of Statistics, Stanford University 16–26
(CS229 Lecture Notes) [26] Moghimi M and Varjani A Y 2016 New rule-based phishing
[9] Le H, Pham Q, Sahoo D and Hoi S C 2018 URLNet: learning detection method. Expert Syst. Appl. 53: 231–242
a URL representation with deep learning for malicious URL [27] Aggarwal A, Rajadesingan A and Kumaraguru P 2012
detection. arXiv preprint: arXiv:180203162 Phishari: automatic realtime phishing detection on twitter.
165 Page 18 of 18 Sådhanå (2020) 45:165
In: Proceedings of the eCrime Researchers Summit (eCrime), neural network model for non-linear time series prediction.
IEEE, pp. 1–12 EAI Endorsed Trans. Scalable Inf. Syst. 3: e5-1–e5-7
[28] Marchal S, Armano G, Gröndahl T, Saari K, Singh N and [39] Quinlan J R 1986 Induction of decision trees. Mach. Learn.
Asokan N 2017 Off-the-hook: an efficient and usable client- 1:81–106
side phishing prevention application. IEEE Trans. Comput. [40] Smith C and Jin Y 2014 Evolutionary multi-objective
66: 1717–1733 generation of recurrent neural network ensembles for time
[29] Sahingoz OK, Buber E, Demir O and Diri B 2019 Machine series prediction. Neurocomputing 143: 302–311
learning based phishing detection from URLs. Expert Syst. [41] Mikolov T, Joulin A, Chopra S, Mathieu M and Ranzato M
Appl. 117: 345–357 A 2014 Learning longer memory in recurrent neural
[30] Li Y, Yang Z, Chen X, Yuan H and Liu W 2019 A stacking networks. arXiv preprint: arXiv:1412.7753
model using URL and HTML features for phishing webpage [42] Jozefowicz R, Zaremba W and Sutskever I 2015 An
detection. Future Gener. Comput. Syst. 94: 27–39 empirical exploration of recurrent network architectures.
[31] Jain A K and Gupta B B 2018 Towards detection of phishing In: Proceedings of the International Conference on Machine
websites on client-side using machine learning based Learning, pp. 2342–2350
approach. Telecommun. Syst. 68: 687–700 [43] Hochreiter S, Schmidhuber J 1997 Long short-term memory.
[32] Yang P, Zhao G and Zeng P 2019 Phishing website detection Neural Comput. 9: 1735–1780
based on multidimensional features driven by deep learning. [44] Krizhevsky A, Sutskever I and Hinton G E 2012 Imagenet
IEEE Access 7: 15196–15209 classification with deep convolutional neural networks. In:
[33] El-Alfy ESM 2017 Detection of phishing websites based on Advances in Neural Information Processing Systems,
probabilistic neural networks and K-medoids clustering. pp. 1097–1105
Comput. J. 60: 1745–1759 [45] Pham N Q, Kruszewski G and Boleda G 2016 Convolutional
[34] Zhao J, Wang N, Ma Q and Cheng Z 2018 Classifying neural network language models. In: Proceedings of the
malicious URLs using gated recurrent neural networks. In: 2016 Conference on Empirical Methods in Natural Lan-
Proceedings of the International Conference on Innovative guage Processing, pp. 1153–1162
Mobile and Internet Services in Ubiquitous Computing. [46] Ramesh G, Krishnamurthi I and Kumar K S S 2014 An
Springer, pp. 385–394 efficacious method for detecting phishing webpages through
[35] Mohammad R M, Thabtah F and McCluskey L 2014 target domain identification. Decis. Support Syst. 61: 12–22
Predicting phishing websites based on self-structuring neural [47] He M, Horng S J, Fan P, Khan M K, Run R S, Lai J L, Chen
network. Neural Comput. Appl. 25: 443–458 R J and Sutanto A 2011 An efficient phishing webpage
[36] Feng F, Zhou Q, Shen Z, Yang X, Han L and Wang J 2018 The detector. Expert Syst. Appl. 38: 12,018–12,027
application of a novel neural network in the detection of [48] Marchal S, Armano G, Gröndahl T, Saari K, Singh N and
phishing websites. J. Ambient Intelli. Humaniz. Comput. 1–15 Asokan N 2017 Off-the-hook: an efficient and usable client-
[37] Yi P, Guan Y, Zou F, Yao Y, Wang W and Zhu T 2018 Web side phishing prevention application. IEEE Trans. Comput.
phishing detection using a deep learning framework. Wirel. 66: 1717–1733
Commun. Mobile Comput. 2018: Article ID 4678746 [49] Gowtham R and Krishnamurthi I 2014 A comprehensive and
[38] Zhou Q, Chen H, Zhao H, Zhang G, Yong J and Shen J 2016 efficacious architecture for detecting phishing webpages.
A local field correlated and Monte Carlo based shallow Comput Secur 40: 23–37