Detection of Phishing WebsitesUsing Random Forest and XGBOOST
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
Abstract
Mitigating the risk pose by phishers and other cybercriminal in the cyber space requires a
robust and automatic means of detecting phishing websites and phishing emails since the
culprits are constantly coming up with new techniques of achieving their goals. Many
approaches have been proposed in an attempt to curb the problems caused by phishers. In
this study, we tried to extend and improve on the existing methods by proposing a hybrid
technique (Random forest and Xgboost) algorithms. Random forest (RF) was used to rank
and select the most relevant features of our datasets while xgboost was used to build the
model using the selected dataset. The model was evaluated and tested with 11055 phishing
dataset from UCI repository consisting of 4898 legitimate and 6157 phishing websites
using Accuracy, Recall, Mathew Correlation Coefficient (MCC), Precision and Fscore as
performance metrics. The proposed method was compared with some state of the art
methods from the literature and results showed that the proposed method turned out to be
the most robust method in terms of the aforementioned evaluation metrics.
Keywords: Phishing websites, Random Forest, Xgboost, Algorithm, cyberspace
1. INTRODUCTION
Phishing is a cyber crime in which cyber criminals attempt to obtain sensitive
information of cyber users such as username, passwords credit card details often for
malicious intent by disguising as a trustworthy entity in an electronic communication.
(Toolan and Carthy, 2019).The information gained by phishers are often used to access
users important accounts (facebook, twitter, email and bank) which may result in identity
theft and financial losses. (Gupta, Tewari, Jain and Agrawal, 2017). The word phishing was
first coined in 1996 as a form of online identity theft after an attack by hackers on
Open Access Journal www.smrpi.com 1
Frontiers of Knowledge Journal Series | International Journal of Pure and
Applied Sciences ISSN: 2635-3393 | Vol. 2 Issue 3 (September, 2019)
AmericaOnline account. (Khonji, Iraqi and Jones, 2013). and the first phishing lawsuit was
filed in 2004 against a California teenager who created an imitation of the website
“AmericaOnline” to gain access to user sensitive information including credit card details
causing them huge financial lost. Phishers operate by sending fake emails to their victims
pretending to be from legitimate and well known organizations such as banks, university,
communication network etc. The email contents will insist/deceive the victims to follow a
link/URL to fake website where they will require to update some personal information
including their passwords and usernames to avoid losing access right to some of the
services provided by that organization. Phishers use this avenue to obtained users sensitive
information which they in turn use it to access their important accounts resulting in identity
theft and financial loss. (Adelhamid, Ayesh and Thatah, 2014).Mitigating the risk pose by
phishers and other cybercriminal in the cyber space requires a robust and automatic means
of detecting phishing websites and phishing emails since the culprits are constantly coming
up with new techniques of achieving their goals.Many approaches have been proposed in
an attempt to curb the problems caused by phishers (Abu-Nimeh, Nappa, Wang and Nair,
2017)-(El-Alfy, 2017). However, due to the dynamic nature of attackers and the
challenging nature of the problem, it still lacks a complete solution. Recently, machine
learning approaches have been found to be very successful in the automated detection of
phishing wed sites. This paper builds/extends on this by using Xgboost, an optimized
implementation of gradient boosted decision tree algorithm and Random Forest (RF)
algorithm to improve the performance that a predictive model can achieve in the detection
of a phishing website from a legitimate website.
2. PROBLEM STATEMENT
Advancement in technology has made the cyberspace an avenue for banking,
shopping, education, and entertainment. However, as most of human activities are being
moved to the cyberspace, phishers and other cybercriminals are making the cyberspace
unsafe by posing serious risks to users and businesses as well as threating global security
and the economy. (Gupta, Tewari,Jain and Agrawal, 2017). A cybercrime in which an
attacker attempts to obtain or learn sensitive information such as usernames, passwords and
credit card information often for malicious intent by masquerading as a trustworthy entity
in an electronic communication in known as phishing. Today, phishers are constantly
evolving the techniques they used for luring user to revealing their sensitive information.
They use these information to access important accounts of their victims resulting in
identity theft, denial of services, financial losses and sabotage of reputations (Adelhamid,
Ayesh and Thatah, 2014). Many techniques have been proposed in the past for phishing
website detection, however, due to the dynamic and challenging nature of the problem, the
problem still lack complete solution. Consequently, this work tries to improve the
performance that a predictive model can achieve in the task of phishing website detection
by integrating RF and Xgboost algorithms.
3. RELATED STUDIES
Many studies have been proposed to mitigate the risk caused by phishers and other
cybercriminal in the cyber space. Few of these studies are presented below:
Davut and Mustapha (Zouina and Outtaj, 2017) proposed an intelligent phishing websites
detection model based on extreme learning machine. They tested the proposed model using
a dataset having 30 input features and 1 output feature. 10 fold cross validation was used
for splitting the datasets into training and testing sets. The proposed model obtained an
average classification accuracy of 95.05%.
In (Abu-Nimeh, Nappa, Wang and Nair, 2017),Saed et al evaluated the performance
of several Machine Learning algorithm in the detection of phishing emails including
logistic regression, Classification and Regression Tree (CART), Support Vector Machine
(SVM), neural network and Random Forest using a datasets consisting of 2889 legitimate
and phishing emails. They also used 10 – fold cross validation in splitting the datasets for
training and testing. Results of their experiments revealed that Random forest turned out to
have the best performance when legitimate and phishing emails are equal with an error rate
of 07.07%.
Mouad and Banceur in (Khaytan and Handay, 2017) proposed a light weight
phishing detection system using SVM and similarity index. They tested the performance of
the proposed method using 2000 phishing datasets consisting of 1000 legitimate and
phishing websites using only six features. The six features included a similarity index
which is a new feature proposed by the authors. Their results revealed that the new feature
introduced (similarity index) improves the overall detection rate by 21%.
Although, many approaches have been proposed in an attempt to curb the problems
caused by phishers, however, due to the dynamic and challenging nature of the problem, the
problem still lack complete solution. Consequently, this work tries to improve the
performance that a predictive model can achieve in the task of phishing website detection
by integrating RF and Xgboost algorithms.
4. METHODOLOGY
4.1 Data Gathering and Description
Most researchers in phishing detection make use of datasets constructed by
themselves. However, with such type of datasets, it is difficult to evaluate and compare the
performance of a model with other models from the literature since the datasets they are
using are not publicly available for other to use and confirm their results, therefore such
results cannot be generalized (El-Alfy, 2017).
In order to assess and compare the predictive performance of the proposed model,
we adopted a recently created phishing dataset from UCI machine learning repository. This
dataset was created by Mohammmed, Thabtah and McChushy at the university of
Huddesfied, united Kingdom. (Mohammad, Thabtah abd McChuskey, 2014). The dataset
has a total of 11055 websites instances preclassified as legitimate (non phishing) and
phishing websites with 30 features. 4898 of the dataset are legitimate while the remaining
6157 are phishing. The description of the adopted features of the dataset is presented in the
table below.
Table 1:Adopted Features of Dataset
s/n Features Feature Notation Value range
1 Having IP Address Has_ip {-1, 1}
2 URL Length url_length {-1,0, 1}
3 Using URL Shortening Short_service {-1, 1}
Service
4 URL having the @ Has_@_symbol {-1,1}
symbol
5 URL has redirect symbol Double_slash_redirect {-1,1}
6 Prefix or suffix to domain Pref_suf {-1,1}
7 Having subdomains Has_subdomain {-1,1}
The y axis represents the importance value for each feature in the dataset while the x
axis represents the different features in the dataset. Based on their importance values, 24
feature were selected and used for the purpose of this work.
4.3 Design of Xgboost Classifier
XGBOOST (Extreme Gredient Boosted Tree) is an optimized implementation of
gradient boosted trees first introduced by Chen and Guestrin in (Chen, Guestrin, 2016) . It
is mostly employed in classification task where it is used as a classifier for mapping input
pattern into a specific class. Xgboost implements a process known as boosting to improve
the performance of gradient boosted trees. Boosting is an essemble technique that attempts
to create a stronger classifier from a number of weak classifiers (James,Witten, Hastie and
Tibshirani, 2014). XGBOOST has many strengths when compared to the traditional
gradient boosting implementations. Among its strengths are better regularization ability
which helps to reduce overfitting, high speed and performance owing to the parrallel nature
in which trees are built, flexibility due to it costume optimization objectives and evaluation
criteria, and inbuilt routines for handling missing values. These and many other advantages
of XGBOOST have made it an excellent tool of choice for many researchers in data science
and machine learning as can be seen in the following articles. (Zimmermann, Djurken,
Mayer, Janke, Boisseir and Scholesser, 2017)-(Zhang and Zhan, 2017).
As an optimization to the gradient boosted trees, Xgboost adds a regularization term to the
loss function to establish its objective function for measuring performance given by:
( ) ( ) ( ) …..Eqn(1)
Where L is the training loss function, and is the regularization term. The training loss
measures the performance of the model on training data. The regularization term controls
the complexity of the model, which usually controls over-fitting.
Since the base model is decision tree, the output of the model yiis voted or averaged by a
collection F of k trees denoted as follows:
∑ ( ) …..eqn(2)
Where n is the number of predictions and Ωis the regularization term defined as:
( ) …….eqn(4)
Where the γis the complexity of each leaf.T is the number of leaves in a decision tree,λis a
…..eqn(5)
( )
…..eqn(6)
( )
….eqn(7)
( )
( )
….eqn(8)
( )
( ) ( )
…eqn(9)
( )( )( )( )
5. EXPERIMENTAL SETUP
The proposed methodology was implemented in python programming language and
all experiments have been carried out on Lenevo machine running with 64 bit windows
operating system, AMD E1 essential CPU at 1.00 GHz and 4.00 RAM. After pre-
processing of our datasets, it was divided into 70% for training and 30% for testing using
hold-out stratified cross validation. Finally, based on the relevance of features to this
problem, only the first 24 features were selected and used for this work.
5.1 Result of Experiments
Chart Title
1.2
Perfomance Measure
1
0.8
0.6
0.4
0.2 Rec
0
RF and Prc
Xgboost RF PNN
Xgboost
Rec 0.9735 0.972 0.7022 0.9626 Fscore
1
Performance Measure
0.98
0.96
0.94
0.92 Rec
0.9
Prc
0.88
Fscore
0.86
RF and MCC
Xgboost RF PNN
Xgboost
Rec Acc
0.9773 0.9789 0.9785 0.9789
Prc 0.9713 0.9726 0.9442 0.964
Fscore 0.9709 0.9721 0.9611 0.9714
MCC 0.9418 0.9443 0.9127 0.935
Acc 0.9713 0.9726 0.9565 0.9679
6. CONCLUSIONS
In this work, we have proposed a hybrid technique (RF and XGBoost) for the
detection of phishing websites by integrating RF and XGboost algorithms. The RF is used
for evaluating the most relevant features for the experiments there by reducing
computational time, while the XGboost is used for the detection. The robustness of the
proposed method was evaluated in comparisons with the individual algorithm and a
recently proposed phishing detection technique (PNN) using MCC, Fscore and ACC as
performance metrics. From the experiments conducted, the proposed technique turned out
to be the most robust among the other algorithms.
7. REFERENCES
Toolan, F. and Carthy. J. (2019). “Phishing detection using classifier ensembles,” IneCrime
Researchers Summit, eCRIME’09.(pp. 1-9), IEEE, 2009.
Gupta, B.B., Tewari, A., Jain, A. K. and Agrawal, D. P.(2017). “Fighting against phishing
attacks: state of the art and future challenges,” Neural Computing and
Applications, 28(12), pp.3629-3654, 2017.
Khonji, M., Iraqi, Y. and Jones, A. (2013). “Phishing detection: a literature survey,” IEEE
Communications Surveys & Tutorials, 15(4), pp.2091-2121, 2013.
Abu-Nimeh, S., Nappa, D., Wang, X. and Nair, S.(2007). “A comparison of machine
learning techniques for phishing detection,” In Proceedings of the anti-phishing
working groups 2nd annual eCrime researchers summit (pp. 60-69), ACM, 2007.
Zouina, M. and Outtaj, B. (2017). “A novel lightweight URL phishing detection system
using SVM and similarity index,” Human-centric Computing and Information
Sciences, 7(1), p.17, 2017.
Kaytan, M. and Hanbay, D. (2017). “Effective classification of phishing web pages based
on new rules by using extreme learning machines,” Anatolian Journal of Computer
Sciences, 2(1), pp.15-36, 2017.
Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B. and Si, Y. (2018). “A Data-Driven
Design for Fault Detection of Wind Turbines Using Random Forests and XGboost,”
IEEE Access, 6, pp.21020-21031, 2018.
James, G., Witten, D., Hastie, T. and Tibshirani, R.(2014). “An Introduction to Statistical
Learning with Applications in R”. Springer 2014.
Zimmermann, T., Djürken, T., Mayer, A., Janke, M. Boissier, M., Schwarz, C., Schlosser,
R. and Uflacker, M.(2017). “Detecting Fraudulent Advertisements on a Large E-
Commerce Platform,” In EDBT/ICDT Workshops, 2017.
Zhang, L. and Zhan, C. (May 2017). “Machine Learning in Rock Facies Classification: An
Application of XGBoost,” In International Geophysical Conference, Qingdao,
China, 17-20 April 2017 (pp. 1371-1374).Society of Exploration Geophysicists and
Chinese Petroleum Society.
Author(s)
1 Ali Ahmad Aminu
2 Abdulrahman Abdulkarim
4 Muhammad Aliyu