0% found this document useful (0 votes)
15 views

Paper 1

Uploaded by

islammdparvez281
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Paper 1

Uploaded by

islammdparvez281
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373707539

Phishing Websites Detection Using Machine Learning

Conference Paper · November 2019

CITATIONS READS
17 28

2 authors:

R Kiruthiga Akila D.
SRM Institute of Science and Technology Saveetha College of Liberal Arts and Sciences
4 PUBLICATIONS 22 CITATIONS 112 PUBLICATIONS 513 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by R Kiruthiga on 06 September 2023.

The user has requested enhancement of the downloaded file.


International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8, Issue-2S11, September 2019

Phishing Websites Detection Using Machine


Learning
R. Kiruthiga, D. Akila

Abstract--- Phishing is a common attack on credulous people by 


making them to disclose their unique information using counterfeit II. LITERARY REVIEW
websites. The objective of phishing website URLs is to purloin the
personal information like user name, passwords and online Authors in this paper[1] explained a novel approach to
banking transactions. Phishers use the websites which are visually detect phishing websites using machine learning algorithms.
and semantically similar to those real websites. As technology They also compared the accuracy of five machine learning
continues to grow, phishing techniques started to progress rapidly algorithms Decision Tree (DT), Random Forest (RF)[1],
and this needs to be prevented by using anti-phishing mechanisms
to detect phishing. Machine learning is a powerful tool used to
Gradient Boosting (GBM), Generalized Linear Model (GLM)
strive against phishing attacks. This paper surveys the features used and Generalized Additive Model (GAM)[1]. Accuracy,
for detection and detection techniques using machine learning. Precision and Recall evaluation methods were calculated for
Keywords--- Phishing, Phishing Websites, Detection, Machine each algorithm and compared. Website attributes (30) are
Learning. extracted with the help of Python and performance evaluation
done with open source programming language R. Top three
I. INTRODUCTION algorithms namely Decision Tree, Random Forest and GBM
Phishing is the most unsafe criminal exercises in cyber performance were compared in table. From the tables of
space. Since most of the users go online to access the services accuracy, recall and performance, it is shown that Random
provided by government and financial institutions, there has Forest algorithm has given highest 98.4% accuracy, 98.59%
been a significant increase in phishing attacks for the past few recall and 97.70% precision.
years. Phishers started to earn money and they are doing this In this paper authors [2] proposes a classification mode[2]l
as a successful business. Various methods are used by in order to classify the phishing attacks. This model comprises
phishers to attack the vulnerable users such as messaging, of feature extraction from sites and classification of website.
VOIP, spoofed link and counterfeit websites. It is very easy to In feature extraction, 30 features has been taken from UCI
create counterfeit websites, which looks like a genuine website Irvine machine learning repository data set and phishing
in terms of layout and content. Even, the content of these feature extraction rules has been clearly defined. In order to
websites would be identical to their legitimate websites. The classification of these features, Support Vector Machine
reasonfor creating these websites is to get private data from (SVM), Naïve Bayes (NB) and Extreme Learning Machine
users like account numbers, login id, passwords of debit and (ELM)[2] were used. In Extreme Learning Machine (ELM),
credit card, etc. Moreover, attackers ask security questions to six activation functions were used and achieved 95.34%
answer to posing as a high level security measure providing to accuracy than SVM and NB. The results were obtained with
users. When users respond to those questions, they get easily the help of MATLAB.
trapped into phishing attacks. Many researches have been Authors [3] presents an approach to detect phishing email
going on to prevent phishing attacks by different communities attacks using natural language processing and machine
around the world. Phishing attacks can be prevented by learning. This is used to perform the semantic analysis of the
detecting the websites and creating awareness to users to text to detect malicious intent. A natural Language Processing
identify the phishing websites. Machine learning algorithms (NLP) technique is usedto parse each sentence and finds the
have been one of the powerful techniques in detecting semantic jobs of words in the sentence in connection to the
phishing websites. In this study, various methods of detecting predicate. In light of the job of each word in the sentence, this
phishing websites have been discussed. strategy recognizes whether the sentence is an inquiry or an
order. Supervised machine learning[3] is used to generate the
blacklist of malicious pairs. Authors defined algorithm
SEAHound[3] for detecting phishing emails and Netcraft
Anti-Phishing Toolbar is used to verify the validity of a URL.
This algorithm is implemented with Python scripts and dataset
Nazario phishing email set is used. Results of Netcraft and
Manuscript received September 16, 2019.
R. Kiruthiga, Ph.D. Research Scholar, Department of Computer Science, SEAHound[3] are compared and obtained precision 98% and
VELS Institute of Science, Technology & Advanced Studies, Chennai, 95% respectively.
Tamilnadu. (e-mail: [email protected])
Dr.D. Akila, Associate Professor, Department of Information
Technology, School of Computing Sciences, VELS Institute of Science,
Technology & Advanced Studies, Chennai, Tamilnadu, India. (e-mail:
[email protected])

Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 111 & Sciences Publication
PHISHING WEBSITES DETECTION USING MACHINE LEARNING

This result demonstrates that semantic data is a solid pointer calculations performance was surveyed dependent on
of social designing. precision, recall, f-measure and accuracy. Study shows that
Another approach by authors [4] proposes feature selection SVM algorithm achieved best performance over LR
algorithms to decrease the components of dataset to get higher algorithm.
order execution [4]. It also compared with other data mining In this paper authors [9] proposed a phishing detection
classification algorithms and results obtained. Dataset for model to detect the phishing performance effectively by using
phishing websites was taken from UCI machine learning mining the semantic features of word embedding, semantic
repository[4]. From the outcomes, it is seen that some feature and multi-scale statistical features[9] in Chinese web
classification strategies increment the execution; some of them pages. Eleven features were extracted and categorized into
decline the execution with decreased component. Bayesian five classes to acquire statistical features of web pages.
Network, Stochastic Gradient Descent (SGD), lazy.K.Star, AdaBoost, Bagging, Random Forest and SMO[9] are used to
Randomizable Filtered Classifier, Logistic model tree (LMT) implement learning and testing the model. Legitimate URLs
and ID3 (Iterative Dichotomiser)[4] are useful for reduce dataset obtained from DirectIndustry web guides and phishing
phishing dataset and Multilayer Perception, JRip, PART, data was obtained from Anti-Phishing Alliance of China.
J48[4], Random Forest and Random Tree algorithms are not According to study, only semantic features well identified the
valuable for the diminished phishing dataset. Lazy.K.Star phishing sites with high detection[9] efficiency and fusion
obtained 97.58% accuracy with 27 reduced features. This model achieved the best performance detection. This model is
study is obtained with the help of WEKA software. unique to Chinese web pages and it has dependency in certain
Authors [5]proposed a model with answer for recognize language.
phishing sites by utilizing URL identification strategy utilizing This paper [10] proposes a efficient way to detect phishing
Random Forest algorithm. Show has three stages, namely URL websites by using c4.5 decision tree approach. This
Parsing, Heuristic Classification of data, Performance technique extracts features from the sites and calculates
Analysis [5]. Parsing is used to analyze feature set. Dataset heuristic values. These values were given to the c4.5 decision
gathered from Phishtank. Out of 31 features only 8 features tree algorithm[10] to determine whether the site is phishing or
are considered for parsing. Random forest method obtained not. Dataset is collected from PhishTank and Google. This
accuracy level of 95%. process includes two phases namely pre-processing phase and
Authors [6] proposed a flexible filtering decision module to detection phase[10]. In which features are extracted based on
extract features automatically without any specific expert rules in pre-processing phase and the features and their
knowledge of the URL domain using neural network model. respected values were inputted to the c4.5 algorithm and
In this approach authors used all the characters included in the obtained 89.40% accuracy.
URL strings and count byte values. They not only count byte Authors [11] in this paper created an extension to Google
values and also overlap parts of neighbouring characters by Chrome to detect phishing websites content with the help of
shifting 4-bits. They embed combination information of two machine learning algorithms. Dataset UCI-Machine Learning
characters appearing sequentially and counts how many times Repository used and 22 features were extracted for this
each value appears in the original URL string and achieves a dataset. Algorithms kNN, SVM and Random Forest were
512 dimension vector. Neural network model tested with three chosen for precision, recall,f1-score and accuracy comparison.
optimizers Adam, AdaDelta and SGD. Adam was the best Random Forest obtained a best score and HTML,JavaScript,
optimizer with accuracy 94.18% than others. Authors also CSS[11] used for implementing chrome extension along with
conclude that this model accuracy is higher than the python. This extension is having a drawback of declared
previously proposed complex neural network topology. malicious site list which is increasing every day.
In this paper authors [7] made a comparative study to detect This paper [12] approaches a framework to extract features
malicious URL with classical machine learning technique – flexible and simple with new strategies. Data is collected from
logistic regression using bigram, deep learning techniques like PhishTank[12] and legitimate URLs from Google[12]. To
convolution neural network (CNN) and CNN long short-term obtain the text properties C# programming and R
memory (CNN-LSTM)[7] as architecture. The dataset programming were used. 133 features were obtained from the
collected from Phishtank, OpenPhish for phishing URLs and dataset and third party service providers. CFS subset based
dataset MalwareDomainlist, MalwareDomains were collected and Consistency subset based feature selection[12] methods
for malicious URLs. As a result of comparison, CNN-LSTM used for feature selection and analyzed with WEKA tool.
obtained 98% accuracy. In this paper authors used Naïve Bayes and Sequential Minimal Optimization
TensorFlow[7] in conjuction with Keras[7] for deep learning (SMO)[12] algorithms were compared for performance
architecture. evaluation and SMO is preferred by the author for phishing
Authors in this paper [8] also proposed reduced feature detection than NB.
selection model to detect phishing websites. They used
Logistic Regression and Support Vector Machine (SVM)[8] as
classification methods to validate the feature selection method.
19 features reduced from 30 site features have been selected
and used for phishing detection. The LR and SVM

Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 112 & Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8, Issue-2S11, September 2019

Another heuristic features detection method by authors [13] PhishScore. This approach is based on intra-URL
explains about the feature of URL such as PrimaryDomain, relatedness[14][18]. This relatedness reflects the relationship
SubDomain, PathDomain and ranking of website such as into part of the URLRight around 12 site highlights removed
PageRank, AlexaRank, AlexReputation to identify the from a solitary URL are utilized to include machine learning
phishing websites. Dataset used from PhishTank and algorithms to identify phishing URLs. This experiment results
experimental is splitted into 6 phases through MYSQL, PHP accuracy of 94.91%.
with 10 testing datasets. The proposed model contains two RESULTS
phases. In Phase I site features were extracted and in Phase II This paper [15] focuses on detecting phishing website
six values of heuristic are calculated. According to authors, if URLs with domain name features. Web spoofing attack
heuristic value is nearest to one, the site is considered as categories content-based, heuristic-based and blacklist-based
legitimate and if it is nearest to zero then the site is doubted as approaches[8][17] are explained and the proposed model
phishing site. Root Mean Square Error (RMSE)[13] is used to PhishChecker is developed with the help of Microsoft Visual
calculate accuracy and obtained 97% accuracy. Studio Express 2013 and C# language[15]. Dataset used from
In this paper author [14] introduces a phishing URL Phishtank and Yahoo directory set and obtained an accuracy
detection system depends on URL lexical analysis named of 96%. This paper checks only the validity of URLs.
Table 1: Outline of Algorithms used to detect Phishing Website URLs

Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 113 & Sciences Publication
PHISHING WEBSITES DETECTION USING MACHINE LEARNING

12. M. Aydin and N. Baykal, “Feature extraction and


III. CONCLUSION classification phishing websites based on URL,” 2015
IEEE Conf. Commun. NetworkSecurity, CNS 2015, pp.
This survey presented various algorithms and approaches to 769–770, 2015.
detect phishing websites by several researchers in Machine 13. L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H.
Learning. On reviewing the papers, we came to a conclusion Nguyen, “A novel approach for phishing detection using
that most of the work done by using familiar machine learning URL-based heuristic,” 2014 Int. Conf. Comput. Manag.
algorithms like Naïve Bayesian, SVM, Decision Tree and Telecommun. ComManTel 2014, pp. 298–303, 2014.
Random Forest. Some authors proposed a new system like 14. S. Marchal, J. Francois, R. State, and T. Engel,
“PhishScore: Hacking phishers’ minds,” Proc. 10th Int.
PhishScore and PhishChecker for detection. The combinations Conf. Netw. Serv. Manag. CNSM 2014, pp. 46–54, 2015.
of features with regards to accuracy, precision, recall etc. were 15. A. A. Ahmed and N. A. Abdullah, “Real time detection of
used. Experimentally successful techniques in detecting phishing websites,” 7th IEEE Annu. Inf. Technol. Electron.
phishing website URLs were summarized in Table 1. As Mob. Commun. Conf. IEEE IEMCON 2016, 2016.
phishing websites increases day by day, some features may be 16. X. Zhang, Y. Zeng, X. B. Jin, Z. W. Yan, and G. G. Geng,
included or replaced with new ones to detect them. “Boosting the phishing detection performance by semantic
analysis,” in Proceedings - 2017 IEEE International
Conference on Big Data, Big Data 2017, 2018, vol. 2018–
REFERENCES Janua, pp. 1063–1070.
1. J. Shad and S. Sharma, “A Novel Machine Learning 17. Dr.D.Akila, Dr.C. Jayakumar, "Acquiring Evolving
Approach to Detect Phishing Websites Jaypee Institute of Semantic Relationships for WordNet to Enhance
Information Technology,” pp. 425–430, 2018. Information Retrieval", International Journal of
2. Y. Sönmez, T. Tuncer, H. Gökal, and E. Avci, “Phishing Engineering and Technology, Volume 6, November 5, pp.
web sites features classification based on extreme learning 2115-2128, 2014.
machine,” 6th Int. Symp. Digit. Forensic Secur. ISDFS 18. D.Akila,S.Sathya, G.Suseendran, “Survey on Query
2018 - Proceeding, vol. 2018–Janua, pp. 1–5, 2018. Expansion Techniques in Word Net Application”, Journal
3. T. Peng, I. Harris, and Y. Sawa, “Detecting Phishing of Advanced Research in Dynamical and Control Systems,
Attacks Using Natural Language Processing and Machine Vol.10(4), pp.119-124, 2018.
Learning,” Proc. - 12th IEEE Int. Conf. Semant. Comput.
ICSC 2018, vol. 2018–Janua, pp. 300–301, 2018.
4. M. Karabatak and T. Mustafa, “Performance comparison of
classifiers on reduced phishing website dataset,” 6th Int.
Symp. Digit. Forensic Secur. ISDFS 2018 - Proceeding,
vol. 2018–Janua, pp. 1–5, 2018.
5. S. Parekh, D. Parikh, S. Kotak, and P. S. Sankhe, “A New
Method for Detection of Phishing Websites: URL
Detection,” in 2018 Second International Conference on
Inventive Communication and Computational Technologies
(ICICCT), 2018, vol. 0, no. Icicct, pp. 949–952.
6. K. Shima et al., “Classification of URL bitstreams using
bag of bytes,” in 2018 21st Conference on Innovation in
Clouds, Internet and Networks and Workshops (ICIN),
2018, vol. 91, pp. 1–5.
7. A. Vazhayil, R. Vinayakumar, and K. Soman,
“Comparative Study of the Detection of Malicious URLs
Using Shallow and Deep Networks,” in 2018 9th
International Conference on Computing, Communication
and Networking Technologies, ICCCNT 2018, 2018, pp. 1–
6.
8. W. Fadheel, M. Abusharkh, and I. Abdel-Qader, “On
Feature Selection for the Prediction of Phishing Websites,”
2017 IEEE 15th Intl Conf Dependable, Auton. Secur.
Comput. 15th Intl Conf Pervasive Intell. Comput. 3rd Intl
Conf Big Data Intell. Comput. Cyber Sci. Technol. Congr.,
pp. 871–876, 2017.
9. X. Zhang, Y. Zeng, X. Jin, Z. Yan, and G. Geng, “Boosting
the Phishing Detection Performance by Semantic
Analysis,” 2017.
10. L. MacHado and J. Gadge, “Phishing Sites Detection Based
on C4.5 Decision Tree Algorithm,” in 2017 International
Conference on Computing, Communication, Control and
Automation, ICCUBEA 2017, 2018, pp. 1–5.
11. A. Desai, J. Jatakia, R. Naik, and N. Raul, “Malicious web
content detection using machine leaning,” RTEICT 2017 -
2nd IEEE Int. Conf. Recent Trends Electron. Inf. Commun.
Technol. Proc., vol. 2018–Janua, pp. 1432–1436, 2018.

Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 114 & Sciences Publication
View publication stats

You might also like