Paper 1
Paper 1
net/publication/373707539
CITATIONS READS
17 28
2 authors:
R Kiruthiga Akila D.
SRM Institute of Science and Technology Saveetha College of Liberal Arts and Sciences
4 PUBLICATIONS 22 CITATIONS 112 PUBLICATIONS 513 CITATIONS
All content following this page was uploaded by R Kiruthiga on 06 September 2023.
Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 111 & Sciences Publication
PHISHING WEBSITES DETECTION USING MACHINE LEARNING
This result demonstrates that semantic data is a solid pointer calculations performance was surveyed dependent on
of social designing. precision, recall, f-measure and accuracy. Study shows that
Another approach by authors [4] proposes feature selection SVM algorithm achieved best performance over LR
algorithms to decrease the components of dataset to get higher algorithm.
order execution [4]. It also compared with other data mining In this paper authors [9] proposed a phishing detection
classification algorithms and results obtained. Dataset for model to detect the phishing performance effectively by using
phishing websites was taken from UCI machine learning mining the semantic features of word embedding, semantic
repository[4]. From the outcomes, it is seen that some feature and multi-scale statistical features[9] in Chinese web
classification strategies increment the execution; some of them pages. Eleven features were extracted and categorized into
decline the execution with decreased component. Bayesian five classes to acquire statistical features of web pages.
Network, Stochastic Gradient Descent (SGD), lazy.K.Star, AdaBoost, Bagging, Random Forest and SMO[9] are used to
Randomizable Filtered Classifier, Logistic model tree (LMT) implement learning and testing the model. Legitimate URLs
and ID3 (Iterative Dichotomiser)[4] are useful for reduce dataset obtained from DirectIndustry web guides and phishing
phishing dataset and Multilayer Perception, JRip, PART, data was obtained from Anti-Phishing Alliance of China.
J48[4], Random Forest and Random Tree algorithms are not According to study, only semantic features well identified the
valuable for the diminished phishing dataset. Lazy.K.Star phishing sites with high detection[9] efficiency and fusion
obtained 97.58% accuracy with 27 reduced features. This model achieved the best performance detection. This model is
study is obtained with the help of WEKA software. unique to Chinese web pages and it has dependency in certain
Authors [5]proposed a model with answer for recognize language.
phishing sites by utilizing URL identification strategy utilizing This paper [10] proposes a efficient way to detect phishing
Random Forest algorithm. Show has three stages, namely URL websites by using c4.5 decision tree approach. This
Parsing, Heuristic Classification of data, Performance technique extracts features from the sites and calculates
Analysis [5]. Parsing is used to analyze feature set. Dataset heuristic values. These values were given to the c4.5 decision
gathered from Phishtank. Out of 31 features only 8 features tree algorithm[10] to determine whether the site is phishing or
are considered for parsing. Random forest method obtained not. Dataset is collected from PhishTank and Google. This
accuracy level of 95%. process includes two phases namely pre-processing phase and
Authors [6] proposed a flexible filtering decision module to detection phase[10]. In which features are extracted based on
extract features automatically without any specific expert rules in pre-processing phase and the features and their
knowledge of the URL domain using neural network model. respected values were inputted to the c4.5 algorithm and
In this approach authors used all the characters included in the obtained 89.40% accuracy.
URL strings and count byte values. They not only count byte Authors [11] in this paper created an extension to Google
values and also overlap parts of neighbouring characters by Chrome to detect phishing websites content with the help of
shifting 4-bits. They embed combination information of two machine learning algorithms. Dataset UCI-Machine Learning
characters appearing sequentially and counts how many times Repository used and 22 features were extracted for this
each value appears in the original URL string and achieves a dataset. Algorithms kNN, SVM and Random Forest were
512 dimension vector. Neural network model tested with three chosen for precision, recall,f1-score and accuracy comparison.
optimizers Adam, AdaDelta and SGD. Adam was the best Random Forest obtained a best score and HTML,JavaScript,
optimizer with accuracy 94.18% than others. Authors also CSS[11] used for implementing chrome extension along with
conclude that this model accuracy is higher than the python. This extension is having a drawback of declared
previously proposed complex neural network topology. malicious site list which is increasing every day.
In this paper authors [7] made a comparative study to detect This paper [12] approaches a framework to extract features
malicious URL with classical machine learning technique – flexible and simple with new strategies. Data is collected from
logistic regression using bigram, deep learning techniques like PhishTank[12] and legitimate URLs from Google[12]. To
convolution neural network (CNN) and CNN long short-term obtain the text properties C# programming and R
memory (CNN-LSTM)[7] as architecture. The dataset programming were used. 133 features were obtained from the
collected from Phishtank, OpenPhish for phishing URLs and dataset and third party service providers. CFS subset based
dataset MalwareDomainlist, MalwareDomains were collected and Consistency subset based feature selection[12] methods
for malicious URLs. As a result of comparison, CNN-LSTM used for feature selection and analyzed with WEKA tool.
obtained 98% accuracy. In this paper authors used Naïve Bayes and Sequential Minimal Optimization
TensorFlow[7] in conjuction with Keras[7] for deep learning (SMO)[12] algorithms were compared for performance
architecture. evaluation and SMO is preferred by the author for phishing
Authors in this paper [8] also proposed reduced feature detection than NB.
selection model to detect phishing websites. They used
Logistic Regression and Support Vector Machine (SVM)[8] as
classification methods to validate the feature selection method.
19 features reduced from 30 site features have been selected
and used for phishing detection. The LR and SVM
Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 112 & Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8, Issue-2S11, September 2019
Another heuristic features detection method by authors [13] PhishScore. This approach is based on intra-URL
explains about the feature of URL such as PrimaryDomain, relatedness[14][18]. This relatedness reflects the relationship
SubDomain, PathDomain and ranking of website such as into part of the URLRight around 12 site highlights removed
PageRank, AlexaRank, AlexReputation to identify the from a solitary URL are utilized to include machine learning
phishing websites. Dataset used from PhishTank and algorithms to identify phishing URLs. This experiment results
experimental is splitted into 6 phases through MYSQL, PHP accuracy of 94.91%.
with 10 testing datasets. The proposed model contains two RESULTS
phases. In Phase I site features were extracted and in Phase II This paper [15] focuses on detecting phishing website
six values of heuristic are calculated. According to authors, if URLs with domain name features. Web spoofing attack
heuristic value is nearest to one, the site is considered as categories content-based, heuristic-based and blacklist-based
legitimate and if it is nearest to zero then the site is doubted as approaches[8][17] are explained and the proposed model
phishing site. Root Mean Square Error (RMSE)[13] is used to PhishChecker is developed with the help of Microsoft Visual
calculate accuracy and obtained 97% accuracy. Studio Express 2013 and C# language[15]. Dataset used from
In this paper author [14] introduces a phishing URL Phishtank and Yahoo directory set and obtained an accuracy
detection system depends on URL lexical analysis named of 96%. This paper checks only the validity of URLs.
Table 1: Outline of Algorithms used to detect Phishing Website URLs
Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 113 & Sciences Publication
PHISHING WEBSITES DETECTION USING MACHINE LEARNING
Published By:
Retrieval Number: B10180982S1119/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijrte.B1018.0982S1119 114 & Sciences Publication
View publication stats