Detectionof Phishing Websitesfrom URLsbyusing Classification Techniqueson WEKA
Detectionof Phishing Websitesfrom URLsbyusing Classification Techniqueson WEKA
net/publication/349658472
CITATIONS READS
20 2,932
3 authors, including:
Emre Koçyiğit
Yildiz Technical University
10 PUBLICATIONS 137 CITATIONS
SEE PROFILE
All content following this page was uploaded by Emre Koçyiğit on 11 July 2021.
Abstract—The Internet is getting stronger day by day and it and the information that the user share with this website is
makes our lives easier with many applications that are executed transmitted to the phisher.
on cyberworld. However, with the development of the internet,
cyber-attacks have increased gradually and identity thefts have
The numbers of phishing websites detected in the first,
emerged. It is a type of fraud committed by intruders by second, and third quarters of 2020 were in the order of 165,772
using fake web pages to access people’s private information 146,994 and 571,764. Totally 884,530 unique phishing web-
such as userid, password, credit card number and bank ac- sites were detected for the first third quarter of this year. If we
count numbers, etc. These scammers can also send e-mail from look for 2019 in the same order, the detected websites were
many important institutions and organizations by using phishing
attacks which imitate these web pages and acts as if they
180,768 182,465 and 266,378 totally 629,611. This means an
are original. Traditional security mechanisms can not prevent increase of approximately 40% in phishing websites in a year
these attacks because they directly target the weakest part of [2].
connection : end-users. Machine learning technology has been
used to detect and prevent this type of intrusions. The anti-
phishing method has been developed by detecting the attacks
made with the technologies used. In this paper, we combined the
websites used by phishing attacks into a dataset, then we obtained
some results using 4 classification algorithms with this dataset.
The experimental results showed that the proposed systems give
very good accuracy levels for the detection of these attacks.
Index Terms—phishing attacks, machine learning, classification
algorithms, phishing detection, cybersecurity
Fig. 1. Phishing attacks.
I. I NTRODUCTION
In Phishing attacks, phisher mostly designs a fake web page.
In the developing world, we use the internet very actively This web page appears similar with the original web page and
to provide communication and reach information, and the user has a different but deceptive URL. By this way, they can access
base has recently increased with internet applications. For this the private information of the users. A careful user may notice
reason, thanks to the internet, communication and information that the URL is malicious and belongs to phishing. However,
transfer with social networks such as banking, e-commerce, phishers take advantage of human vulnerabilities and social
e-mail, and social media applications like Instagram have engineering techniques to hide their scam.
drastically ascended, and it has a huge positive influence in E-mails, which are sent by phishers, have the appearance
our lives [1]. On the other hand, security measures are not of official e-mail account of institutions and organizations as
sufficiently organized and capable of preventing a wide variety a part of deceptive process of phishing. When the user clicks
of cyber-attack threats or protecting computer users. This is a on these e-mails, it leads to a malicious website. This website
vital security problem for even experienced and educated users uses the credentials entered by the user. This information is
and any cyber-threat like phishing attack can cause crucial saved on a different server. The phishing uses them to commit
losses. a cybercrime. Various phishing techniques and methods have
Phishing is an online attack by fraudsters and it is sent been developed with the advancement of technology in order
to user accounts to collect sensitive, personal, and financial to acquire confidential data of users. Phishing techniques are
information. Phishing attacks seek access to especially finan- shown in the table below [3].
cial information using emails, official websites, credit card There are also anti-phishing techniques and measures in
companies. While doing these, there is a URL link for the order to get rid of spam messages and protect vital information
user to point to another website. The website that the user is of users. Recently, they have been developing with different
connected to is a fake website with an innocent appearance approaches. These techniques are:
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9
Fig. 2. Phishing process. The rest of this paper is organized as follows: in Section
II, the related works about phishing detection are reviewed.
Phishing Techniques Definition Section III focuses on proposed system and gives details about
Spear phishing, hackers don’t target random the dataset, tools and machine learning methods that are used
people or organizations here. So phisher does
Spear Phishing
specific research to launch an attack and
in this work then gives information about data preprocessing
organize personal attacks to trap the target. part. Results are depicted in Section IV. Section V discusses
Phisher uses this method to steal information about the work and about how to improve this work in the
from users through its web session control
Session Hijacking
mechanism. The identity hunter accesses the
future. Finally, conclusion is presented in Section VI.
web server illegally with the help of listeners.
E-mail, one of the most common techniques,
II. R ELATED WORK
asks to access user information by e-mail sent Phishing is an old problem in the internet history. Hackers
Email/Spam to millions of people. These messages sends a consistently and insidiously try to obtain and abuse people’s
form to the users to fill in their personal
account information to access their accounts. information. Users must be quite careful to avoid from these
Content Injection, is connected to a different kind of attacks and effective and well-organized strategies
web page to access the personal information of should be generated. This study aims to carry out accurate
Content Injection
audiophiles varying portions of their content
in a trusted website. predictions of phishing websites by several algorithms. Mar-
Phisher observes our actions on the website, chal, François, State and Engel (2014), build a phishing dataset
Web Based Delivery transmitting our information to the phishing by downloading the daily PhishTank blacklist data between
site without the user’s knowledge.
It is a technique that users carry out by stealing October 11th and November 10th, 2012 with 53,089 unique
Phishing through
Search Engines
the credit card information of the products URLs [5]. After a selection they had 48,009 extended phishing
they research on search engines. URLs. Then for a balanced dataset they get same amount
It is a method of connecting to a malicious
Link Manipulation website when you click on the link sent by of malicious URLs from DMOZ. This study uses supervised
the phishing. classification techniques. They build a feature vector matrix
It is the method made by SMS sent to our from the dataset, each one is composed of 12 elements.
Smishing
phones. They can access our personal
(SMS Phishing) Predicted variable is 0 for legitimates and 1 for phishings.
information via the link in the message section.
Phishing scammers, as soon as we click on the Using Weka they have tested seven classifiers. With Random
Malware link they send to our e-mail, the malware will Forest Classifier they achieve 94.91% accuracy with 1.44%
start running on our computer.
It is a type of maliciously written software to false positive rate.
Trojan
access credentials. Jain and Gupta [6], extracts nineteen features from client
Here it is denied access to a file and device side only, URL and source code of the websites. The data
to get a ransom from the user. When the user
Ransomware is collected mostly from Phishtank, Openphish for verified
clicks on a link or is tricked by a malicious ad,
the malware gets installed on their computer. URLs and Alexa for the legitimate ones, which includes 4059
websites with 2141 phishing and 1918 legitimate sites in train-
ing dataset. They get 99.39% of TPR and 1.25 of FPR. They
1) Spam filters have been designed to detect and prevent implemented intuitive methods to produce the feature vector
detrimental and phishing e-mails. and generate a singular feature vector for each website sample
2) Web browsers such as Google Chrome, Internet Ex- to build labelled dataset. They have evaluated the dataset
plorer, Mozilla Firefox have taken browser measures that with 10-fold cross-validation. The study has reached 99.09%
warn us of phishing on the websites we enter. accuracy with random forest, 96.16% with SVM, 98.05% with
3) Using different password entries. For example, some neural networks, 98.25% with Logistic Regression (LR), and
banks have added a password by adding images to be 97.59% with Bayes by using WEKA.
selected by the users other than a certain password. This Weedon, Tsaptsinos and Denholm-Price [7], get the dataset
has increased the password entries. from Phishtank which is completely verified and DMOZ web-
4) To prevent phishing scams, some organizations analyze sites. Their study contained 4000 URLs for training process
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9
and quarter of the training data belongs to malicious dataset İbrahim and Hadi [12], use WEKA tool for implementing
and remaining data belongs to the opposite one. Their testing the classifiers on public NASA repository dataset. The dataset
set consists of 7000 URLs and nearly 57% of them belong has 30 attributes with 11055 instances. They categorized their
malicious dataset. The study used a literal only dataset in order dataset into four parts such as address bar, abnormal, website
to assess the accuracy of Random Forest algorithm and gets content and domain based features. They used K-fold cross
86.9%. With other algorithms they get 83.9% accuracy with validation first and k value was 10. Random Forest get the
j48, 64.6% with Bayes, 81.5% with LR. highest accuracy 95.2% with and without feature selection
Sahingoz, et. al. [8], [14], provide the phishing URLs mostly algorithms and Bayes get the least. All classifiers get better
from Phishtank by writing a script. Over 70000 URLs were accuracy results than using feature selection.
available in their dataset and roughly half of them were James et al. [13], analyzed some algorithms using WEKA
legitimate websites and other half of the dataset were phishing and MATLAB. First, they extract the features. Then they
websites. They extract each word from these URLs to use in choose a classifier to implement in MATLAB. They collect
analyses. Then they implemented a Random Word Detection URLs of benign websites from Alexa, DMOZ websites and
Module and all words, which had over seven characters, were web browser past. They collected 37000 URLs and 45% of
examined via Word Decomposer Module(WDM) to separate them were phishing samples from Phishtank. They collect
their subwords. For the words are not compound, they obtained WHOIS information of some websites. By using only the
only the original ones by WDM. After that their Maliciousness lexical features, they generate a successful classification rates
Analysis Module examined and processed the output words as 93.2% for test section of 60% and 93.78% for test section
of WDM and the words that were up to seven characters. of 90%. They used Regression Tree by MATLAB and the
Then a couple of auxilary features were extracted from these accuracy rate was 91.08% in 60% of dataset although accuracy
words. Random Forest was the most successul algorithm with rate was 85.63% in 90% of dataset.
97.98% accuracy. Natural Language Processing based features Priya and Meenakshi [15], analysis C4.5 (J48) algorithm
increased the performance and had better scores than word using WEKA tool. Phishing and legitimate websites are col-
vectors in all algorithms but Naive Bayes. lected from PhishTank. Thirty-two features are extracted from
Liu, Wang, Lang and Zhou [9], uses WEKA library to the websites. Two training datasets are created with 750 URLs
execute Random Forest, J48, LR, SVM, MLP and Bayes and 2000 URLs to train the J48 algorithm. The test dataset has
algorithms. 29000 URLs were available in their dataset and 300 URLs. The size of tree is 45 nodes out of 28 nodes are
approximately 12500 malicious of them were obtained from leaves. But if 2000 URLs used then 75 nodes are created, out
Phishtank. Then they were combined with 16.516 legitimate of 43 are leaf nodes. The algorithm has 82.6% accuracy rate.
URLs taken from digg58 website. They identified 41 features
in their study and adjusted the Random Forest algorithm. After III. P ROPOSED S YSTEM
their implementations, considerable scores were acquired by A. Dataset Description
Random Forest classification. The algorithm’s precision was Phishers try to click the URL of the site their victims
99.7% and FPR, which is important factor in this kind of enter in their attacks. Identity hunters use some differences to
problems, was less than 0.4%. change the appearance of the URL structure in various ways.
Rakesh et al. [10], also use Weka tool. They collected These differences used change the URLs and look different
the legitimate URL set from DMOZ and fake samples from from the legitimate site. What we are going to do here is that
widespread source, Phishtank. The dataset consists of balanced by doing extensive research on the malicious URL, we use
2000 URLs. This study had 9 features that were extracted some properties to classify the web page. And we analyze the
using a java program. They generated 6 particular subsets URLs we detected. Some features are defined below for the
in variable rates to observe the difference of accuracy rate malicious URL [10].
by dataset size. In their project, they classified the URLs by
C4.5 algorithm in WEKA. As a result higher accuracy scores
belonged to C4.5 and AdaBoost algorithms.
Aydın et. al. [11] point out the most attacked websites and
their devious URLs from the Phishtank. The study analyzed
totally 8,538 URLs. Nearly 40% of them were legitimate and
remaining ones were fraudulent. Their program got textual Fig. 3. URL components.
properties and ”whois” records. Additionally, they obtained
some data manually. Dataset has 133 separate features related The CatchPhish D3.csv dataset had 126,077 rows and 2
to URLs. Gain Ratio Attribute (GRA) and ReliefF Attribute columns which is full of site names and phishing value if
used for the feature selection and analyzed by WEKA. The it is a phishing site 1 if not 0 is given. We get the dataset
SMO and J48 algorithms achieve their best results by us- from [21]. In this dataset, legitimate sites are collected from
ing ReliefF attribute-based(58) selection technique and got common-crawl and Alexa while phishing sites are collected
96.42%, 98.47% accuracies. Naive Bayes reaches better result from Phishtank. After some pre-processing steps 122,055
using Gain Ratio Attribute(36) with 87.08% accuracy. values 85,220 with 0 and 36,835 with 1 values stayed and those
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9
values are in order 0’s comes first. After extracting values from 3) Weka: Waikato Environment for Knowledge Analysis
address bar values we had 15 columns for example protocol, (WEKA) is a Java based open-source tool which is developed
site length, host name, file name, path, path length, fragment, by the University of Waikato. It is used for data mining. Weka
number of query keys, port, number of delimiters, number of includes machine learning algorithms and it is easy to use.
reserved characters etc. Besides, visualization and data preprocessing tools are also
included [16].
TABLE I
D ESCRIPTION OF RELEVANT NOMINAL ATTRIBUTES
C. Methods
Description Values
site full-length sites in the dataset Data Mining (DM) is the business of accessing information
site len length of sites or mining among big data. Our job here is to predict precisely
protocol http (0) what will come from the large chunks of data. The computer
https (1)
url host name combine of subdomain and domain name program we use for this is WEKA. The use of WEKA
url host name len length of host name of a URL provides a great convenience here because WEKA obtains a
file name name of a file very quick result in machine utilization algorithms. Thanks
file name len length of a file name
file name without ext file name length without extension
to the results we obtained, we can compare the algorithms
file name len without ext length of file name without extension we use. We used the CatchPhish dataset in our program. We
path location of a file obtained some features by breaking URLs in the dataset. This
path len length of path of a URL became a multivariate dataset. Classification is a data mining
query pass data to the server
query len length of a query string function that assigns items in a dataset to target categories
num of query keys count of query keys of a URL or classes. The purpose of classification is to make accurate
fragment internal page reference predictions for each value in the data. In the classification
fragment len length of fragment identifier
model, we looked at functional algorithms such as Random
port by default 80 for HTTP, 443 for HTTPS
num of reserved chars reserved chars: ; / ? : @ & = + $ Forest, Decision Tree. Finally, we looked at Naive Bayes,
num of delimiters delimiters: < > # % ‘ where we used multiple algorithms, because you have the
num of unreserved chars unreserved chars: - . ! * ∼ ’ ( ) advantage of comparing the extracted information in each
num of unwise chars unwise chars: { } | \ ˆ [ ]
phish not phish (0)
algorithm. As a result, the multiple classification system wants
phish (1) to make the best use of the data in the data set.
1) Random Forest Classifier: Random Forest is one of the
most used and most popular machine learning algorithms in
B. Tools Classification. In this algorithm, it creates a forest made up of
1) Python: Python is created by Guido Van Rossum in early words as its name and combines this data to make a random
1990s [19]. It has a simple structure so that a nice choice prediction [17]. Here, our algorithm gets faster results than
for educational purposes as a student or a beginner. Since other algorithms. Additionally, it works even better.
being one of the preferred languages contributes more to open 2) Decision Tree: Tree-based learning algorithm is one of
source projects related to this language, it leads to a rapid the most used algorithms for data mining classification. The
development cycle. algorithm, which has a tree-like model, makes some decisions
2) Jupyter Notebook: Jupyter notebook is an open source to reach the desired results. But it contains some conditional
web program to lets developers to execude codes in pieces and statements to arrive at this conclusion [1]. This algorithm
make visualizations. In this way, it allows users to view code can be used in all decision trees, classification and regression
blocks and their results. problems, and this algorithm gives the best result to achieve
the goal. A decision tree is used to divide a data set into even
smaller sets by applying certain rules. In other words, even
easier steps are taken in the data that is divided into small
pieces. In addition, a decision tree that can be visualized is
much easier to understand.
3) Logistic Regression: Logistic regression is the catego-
rization problem of dependent variables used in the linear clas-
sification problem. The purpose here is to obtain an analysis
of the independent variables that we use in our data set. Like
the phenomenon of all regression algorithms, this algorithm is
a prediction analysis. The result we will get here is taken from
a binary variable. These variables are 0 (false) and 1 (true).
Events in logistic regression must be independent from each
Fig. 4. Distribution of dataset according to phishing sites. other. There is no linear relationship between dependent and
independent variables.
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9
4) Naive Bayes: This classification is the simplest network The confusion matrix is shown above. The diagonal values,
model with the family of ”probabilistic classifiers”. The pur- which is TP and FN, of the matrices shows the estimated
pose is to use a vector with multiple properties. Then training correct values. By looking at this, we can say that Random
is created from the information provided, and It is received at Forest predicted the highest correct values.
the end of this training the new data classified correctly.
IV. E XPERIMENTAL R ESULTS
D. Dataset Preprocessing Recall: The proportion of positive samples is calculated
Data preprocessing is a step to get qualified data because the according to the total number of positive samples in the correct
dataset can have incomplete, inconsistent and outdated data in classification used.
it [17]. Our dataset had 2 columns in the beginning which is Recall = TP / (TP+FN)
full site names and phishing detail. Firstly, we transform the F1-Score: It is the harmonic mean of Recall and Precision
dataset csv to excel. Then some rows in phishing column was values. The purpose here is to measure the performance
missing so we fill them with NaN value and some rows had value shown by the classifiers. It is mostly used to compare
different numbers so we found and delete them. We drop the classifiers.
empty rows. After that we extract the needed features from F1-Score = 2 * Precision * Recall/ (Precision + Recall)
the site column and had 15 features. We normalize some rows ROC Curve: Here are the graphs used by calculating the
like port which means we give a number to each value. Then performance for all values consisting of classifiers. The ROC
we made it available for use on Weka and made predictions curve creates a Sensitivity / Specificity report. The area under
J48, Bayes, Logistic Regression and Decision Tree. But before the ROC curve is called AUC. It uses this field as an evaluation
doing that we applied 5-fold cross validation which divides criterion. AUC is a measure of how well a parameter can be
data into subsets and leaves the last part as test data. distinguished between two classes.
TABLE III
TABLE II P ERFORMANCE METRICS
C ONFUSION M ATRIX T ERMS
Algorithm Class Precision Recall F1-Score
Predicted Positive Predicted Negative
Value Value Random Forest Not Phish 0,877 0,864 0,870
Real Positive Value True Positive (TP) False Negative (FN) Phish 0,743 0,765 0,754
Real Negative Value False Positive (FP) True Negative (TN) Decision Tree Not Phish 0,871 0,818 0,844
Phish 0,684 0,765 0,722
Naive Bayes Not Phish 0,833 0,833 0,833
Phish 0,676 0,676 0,676
True Positive Rate (TPR): Here the classifier calculates Logistic Regression Not Phish 0,844 0,818 0,831
how accurately it predicts true positive values. The higher the Phish 0,667 0,706 0,686
better.
TPR = TP / (TP+FN) Random Forest Classifier’s results are shown below. We
False Positive Rate (FPR): Here the classifier calculates how can say that this algorithm’s execution time takes a little
accurately it predicts true negative values. longer than the other algorithms that we use except logistic
FPR = FP / (FP+TN) Accuracy: Here is how often the regression but still fast. Here, the accuracy value of Random
classifier gets the correct predictions. Forest Classifier is calculated as 83%. Logistic Regression and
Accuracy = (TP+TN) / (TP+FP+TN+FN) Naive Bayes algorithms get 78% accuracy but Naive Bayes
Precision: It is a measure of accuracy in all estimated classes. was faster. J48 was also fast and get 80% accuracy percentage.
It is preferable to be high. As a result, we can say that the Random Forest Classifier gave
Precision = TP/ (TP+FP) us the best result.
TABLE IV
A LGORITHMS
V. D ISCUSSION
In this study, the accurate prediction of phishing websites
Fig. 5. Confusion matrix for each algorithm. by different classification techniques is the ultimate goal and
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9
we divided our dataset into two main parts as training and [7] M. Weedon, D. Tsaptsinos and J. Denholm-Price, ”Random forest
test in the first phase. Initially, we tried to extract some explorations for URL classification,” 2017 International Conference On
Cyber Situational Awareness, Data Analytics And Assessment (Cyber
effective features from the URL dataset that we can use to SA), London, 2017, pp. 1-4, doi: 10.1109/CyberSA.2017.8073403.
detect phishing. Then we make data preprocessing to clear [8] E. Buber, B. Dırı and O. K. Sahingoz, ”Detecting phishing attacks
and prepare the data. Then we apply Random Forest, Decision from URL by using NLP techniques,” 2017 International Conference
on Computer Science and Engineering (UBMK), Antalya, 2017, pp.
Tree, Naive Bayes and Logistic Regression algorithms to reach 337-342, doi: 10.1109/UBMK.2017.8093406.
the most qualified result and compare each one’s scores. [9] C. Liu, L. Wang, B. Lang and Y. Zhou, ”Finding effective classifier for
We observed that addition of phishing samples increased the malicious URL detection,” Proceedings of the 2018 2nd International
Conference on Management Engineering, Software Engineering and
accuracy score of algorithms. Balanced and enhanced dataset Service Sciences, 2018, pp. 240-244, doi: 10.1145/3180374.3181352.
can create better solutions in this case. In addition to all, some [10] Rakesh R, Kannan A, Muthurajkumar S, Pandiyaraju V and SaiRamesh
deep learning models which showed the efficiency in [14] can L, ”Enhancing the precision of phishing classification accuracy using
reduced feature set and boosting algorithm,” 2014 Sixth International
be adopted to the proposed model in the future work. Conference on Advanced Computing (ICoAC), Chennai, 2014, pp. 86-
90, doi: 10.1109/ICoAC.2014.7229752.
VI. C ONCLUSION [11] M. Aydin, I. Butun, K. Bicakci and N. Baykal, ”Using Attribute-based
Feature Selection Approaches and Machine Learning Algorithms for
In this paper, we have executed a phishing detection system Detecting Fraudulent Website URLs,” 2020 10th Annual Computing and
on WEKA and tested its efficiency by using a public dataset Communication Workshop and Conference (CCWC), Las Vegas, NV,
USA, 2020, pp. 0774-0779, doi: 10.1109/CCWC47524.2020.9031125.
as CatchPhish D3 by using different classification techniques. [12] D. R. Ibrahim and A. H. Hadi, ”Phishing Websites Prediction Us-
The dataset has 2 columns and we extracted some features ing Classification Techniques,” 2017 International Conference on New
and create a new dataset with a structured format. To make Trends in Computing Sciences (ICTCS), Amman, 2017, pp. 133-137,
doi: 10.1109/ICTCS.2017.38.
this, it is needed to make some preprocessing steps to use [13] J. James, Sandhya L. and C. Thomas, ”Detection of phishing URLs
the dataset in Weka system. For detection of phishing sites using machine learning techniques,” 2013 International Conference on
URL of the web pages are mainly used. By using this data Control Communication and Computing (ICCC), Thiruvananthapuram,
2013, pp. 304-309, doi: 10.1109/ICCC.2013.6731669.
some features are produced and these features are used for [14] C.B. Cebi, F.S. Bulut, H. Firat, O.K. Sahingoz and G. Karatas, ”Deep
detection of whether the web page is phishing or not. To Learning Based Security Management of Information Systems: A Com-
predict this, four different machine learning models are used parative Study”, Journal of Advances in Information Technology Vol 11
(3), 2020.
as random forest, naive Bayes, logistic regression and decision [15] A. Priya and E. Meenakshi, ”Detection of phishing websites using C4.5
tree algorithms. data mining algorithm,” 2017 2nd IEEE International Conference on
Recent Trends in Electronics, Information & Communication Technol-
As a conclusion of this work, we found that the Random ogy (RTEICT), Bangalore, 2017, pp. 1468-1472, doi: 10.1109/RTE-
Forest algorithm works better than the others with relatively ICT.2017.8256841.
high accuracy rates. The models can be enhanced by using new [16] K. P. S. Attwal and A. S. Dhiman, ”Exploring data mining tool-Weka
and using Weka to build and evaluate predictive models,” Advances and
features in the system as in [4], [18]. Additionally, apart from Applications in Mathematical Sciences, vol. 19, 2020, pp. 451-469.
the URL based features, some content-based features [20] can [17] B. Geyik and M. Kara, ”Severity Prediction with Machine Learning
also be used here. Finally, we can also get help from some third Methods,” 2020 International Congress on Human-Computer Interaction,
Optimization and Robotic Applications (HORA), Ankara, Turkey, 2020,
party organization/web pages as Alexa and Whois to identify pp. 1-7, doi: 10.1109/HORA49412.2020.9152601.
whether the page is phishing or not. [18] E. Buber, Ö. Demir and O. K. Sahingoz, ”Feature selections for the
machine learning based detection of phishing websites,” 2017 Interna-
R EFERENCES tional Artificial Intelligence and Data Processing Symposium (IDAP),
Malatya, 2017, pp. 1-5, doi: 10.1109/IDAP.2017.8090317.
[1] M. Korkmaz, O. K. Sahingoz and B. Diri, ”Detection of Phishing [19] A. Rawat, ”A Review on Python Programming,” International Journal
Websites by Using Machine Learning-Based URL Analysis,” 2020 11th of Research in Engineering, Science and Management, vol. 3, 2020, pp.
International Conference on Computing, Communication and Network- 8-11.
ing Technologies (ICCCNT), Kharagpur, India, 2020, pp. 1-7, doi: [20] U. Ozker and O. K. Sahingoz, ”Content Based Phishing Detection
10.1109/ICCCNT49239.2020.9225561. with Machine Learning,” 2020 International Conference on Elec-
[2] Phishing Activity Trends Report, Summary – 3rd Quarter 2020. (2020). trical Engineering (ICEE), Istanbul, Turkey, 2020, pp. 1-6, doi:
[Online]. Available: 10.1109/ICEE49691.2020.9249892.
https://apwg.org/trendsreports/ [21] R.S. Rao, T. Vaishnavi and A.R. Pais ”CatchPhish: detection of phishing
[3] A. Das, S. Baki, A. El Aassal, R. Verma and A. Dunbar, ”SoK: A websites by inspecting URLs”, Journal of Ambient Intelligence and
Comprehensive Reexamination of Phishing Research From the Security Humanized Computing 11, 2020, pp. 813–825, doi: 10.1007/s12652-
Perspective,” in IEEE Communications Surveys Tutorials, vol. 22, no. 019-01311-4
1, pp. 671-708, First quarter 2020, doi: 10.1109/COMST.2019.2957750.
[4] M. Korkmaz, O. K. Sahingoz and B. Diri, ”Feature Selections for the
Classification of Webpages to Detect Phishing Attacks: A Survey,” 2020
International Congress on Human-Computer Interaction, Optimization
and Robotic Applications (HORA), Ankara, Turkey, 2020, pp. 1-9, doi:
10.1109/HORA49412.2020.9152934.
[5] S. Marchal, J. François, R. State and T. Engel, ”PhishScore: Hacking
phishers’ minds,” 10th International Conference on Network and Service
Management (CNSM) and Workshop, Rio de Janeiro, 2014, pp. 46-54,
doi: 10.1109/CNSM.2014.7014140.
[6] A. K. Jain and B. B. Gupta, ”Towards detection of phishing websites on
client-side using machine learning based approach,” Telecommunication
Systems, vol. 68, 2018, pp. 687-700, doi: 10.1007/s11235-017-0414-0.
Authorized licensed
View publication stats use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.