0% found this document useful (0 votes)

4 views

Detectionof Phishing Websitesfrom URLsbyusing Classification Techniqueson WEKA

Uploaded by

votrungduong1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Detectionof Phishing Websitesfrom URLsbyusing Classification Techniqueson WEKA

Uploaded by

votrungduong1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/349658472

Detection of Phishing Websites from URLs by using Classiﬁcation Techniques on

WEKA

Conference Paper · January 2021

DOI: 10.1109/ICICT50816.2021.9358642

CITATIONS READS
20 2,932

3 authors, including:

Emre Koçyiğit
Yildiz Technical University
10 PUBLICATIONS 137 CITATIONS

SEE PROFILE

All content following this page was uploaded by Emre Koçyiğit on 11 July 2021.

The user has requested enhancement of the downloaded file.

Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9

Detection of Phishing Websites from URLs by

using Classification Techniques on WEKA
Buket Geyik Kübra Erensoy Emre Kocyigit
Istanbul Kultur University Istanbul Kultur University Yildiz Technical University
Computer Engineering Department Computer Engineering Department Department of Computer Engineering
Istanbul, Turkey Istanbul, Turkey Istanbul, Turkey
[email protected] [email protected] [email protected]
2021 6th International Conference on Inventive Computation Technologies (ICICT) | 978-1-7281-8501-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICICT50816.2021.9358642

Abstract—The Internet is getting stronger day by day and it and the information that the user share with this website is
makes our lives easier with many applications that are executed transmitted to the phisher.
on cyberworld. However, with the development of the internet,
cyber-attacks have increased gradually and identity thefts have
The numbers of phishing websites detected in the first,
emerged. It is a type of fraud committed by intruders by second, and third quarters of 2020 were in the order of 165,772
using fake web pages to access people’s private information 146,994 and 571,764. Totally 884,530 unique phishing web-
such as userid, password, credit card number and bank ac- sites were detected for the first third quarter of this year. If we
count numbers, etc. These scammers can also send e-mail from look for 2019 in the same order, the detected websites were
many important institutions and organizations by using phishing
attacks which imitate these web pages and acts as if they
180,768 182,465 and 266,378 totally 629,611. This means an
are original. Traditional security mechanisms can not prevent increase of approximately 40% in phishing websites in a year
these attacks because they directly target the weakest part of [2].
connection : end-users. Machine learning technology has been
used to detect and prevent this type of intrusions. The anti-
phishing method has been developed by detecting the attacks
made with the technologies used. In this paper, we combined the
websites used by phishing attacks into a dataset, then we obtained
some results using 4 classification algorithms with this dataset.
The experimental results showed that the proposed systems give
very good accuracy levels for the detection of these attacks.
Index Terms—phishing attacks, machine learning, classification
algorithms, phishing detection, cybersecurity
Fig. 1. Phishing attacks.

I. I NTRODUCTION
In Phishing attacks, phisher mostly designs a fake web page.
In the developing world, we use the internet very actively This web page appears similar with the original web page and
to provide communication and reach information, and the user has a different but deceptive URL. By this way, they can access
base has recently increased with internet applications. For this the private information of the users. A careful user may notice
reason, thanks to the internet, communication and information that the URL is malicious and belongs to phishing. However,
transfer with social networks such as banking, e-commerce, phishers take advantage of human vulnerabilities and social
e-mail, and social media applications like Instagram have engineering techniques to hide their scam.
drastically ascended, and it has a huge positive influence in E-mails, which are sent by phishers, have the appearance
our lives [1]. On the other hand, security measures are not of official e-mail account of institutions and organizations as
sufficiently organized and capable of preventing a wide variety a part of deceptive process of phishing. When the user clicks
of cyber-attack threats or protecting computer users. This is a on these e-mails, it leads to a malicious website. This website
vital security problem for even experienced and educated users uses the credentials entered by the user. This information is
and any cyber-threat like phishing attack can cause crucial saved on a different server. The phishing uses them to commit
losses. a cybercrime. Various phishing techniques and methods have
Phishing is an online attack by fraudsters and it is sent been developed with the advancement of technology in order
to user accounts to collect sensitive, personal, and financial to acquire confidential data of users. Phishing techniques are
information. Phishing attacks seek access to especially finan- shown in the table below [3].
cial information using emails, official websites, credit card There are also anti-phishing techniques and measures in
companies. While doing these, there is a URL link for the order to get rid of spam messages and protect vital information
user to point to another website. The website that the user is of users. Recently, they have been developing with different
connected to is a fake website with an innocent appearance approaches. These techniques are:

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 120

Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9

websites to remove them.

5) Users of multi-factor authentication methods application
for entry to specific pages.
To be aware and detect phishing we should take a look at
the techniques used by phishers. Some of these techniques are
[4]:
• Using an IP address in the URL
• Using long URL to hide doubtful part
• Using tiny URL to hide a long URL
• Having @ symbol in the URL
• Adding known prefix or suffix to domain name
• Using ”https” word in domain name

Fig. 2. Phishing process. The rest of this paper is organized as follows: in Section
II, the related works about phishing detection are reviewed.
Phishing Techniques Definition Section III focuses on proposed system and gives details about
Spear phishing, hackers don’t target random the dataset, tools and machine learning methods that are used
people or organizations here. So phisher does
Spear Phishing
specific research to launch an attack and
in this work then gives information about data preprocessing
organize personal attacks to trap the target. part. Results are depicted in Section IV. Section V discusses
Phisher uses this method to steal information about the work and about how to improve this work in the
from users through its web session control
Session Hijacking
mechanism. The identity hunter accesses the
future. Finally, conclusion is presented in Section VI.
web server illegally with the help of listeners.
E-mail, one of the most common techniques,
II. R ELATED WORK
asks to access user information by e-mail sent Phishing is an old problem in the internet history. Hackers
Email/Spam to millions of people. These messages sends a consistently and insidiously try to obtain and abuse people’s
form to the users to fill in their personal
account information to access their accounts. information. Users must be quite careful to avoid from these
Content Injection, is connected to a different kind of attacks and effective and well-organized strategies
web page to access the personal information of should be generated. This study aims to carry out accurate
Content Injection
audiophiles varying portions of their content
in a trusted website. predictions of phishing websites by several algorithms. Mar-
Phisher observes our actions on the website, chal, François, State and Engel (2014), build a phishing dataset
Web Based Delivery transmitting our information to the phishing by downloading the daily PhishTank blacklist data between
site without the user’s knowledge.
It is a technique that users carry out by stealing October 11th and November 10th, 2012 with 53,089 unique
Phishing through
Search Engines
the credit card information of the products URLs [5]. After a selection they had 48,009 extended phishing
they research on search engines. URLs. Then for a balanced dataset they get same amount
It is a method of connecting to a malicious
Link Manipulation website when you click on the link sent by of malicious URLs from DMOZ. This study uses supervised
the phishing. classification techniques. They build a feature vector matrix
It is the method made by SMS sent to our from the dataset, each one is composed of 12 elements.
Smishing
phones. They can access our personal
(SMS Phishing) Predicted variable is 0 for legitimates and 1 for phishings.
information via the link in the message section.
Phishing scammers, as soon as we click on the Using Weka they have tested seven classifiers. With Random
Malware link they send to our e-mail, the malware will Forest Classifier they achieve 94.91% accuracy with 1.44%
start running on our computer.
It is a type of maliciously written software to false positive rate.
Trojan
access credentials. Jain and Gupta [6], extracts nineteen features from client
Here it is denied access to a file and device side only, URL and source code of the websites. The data
to get a ransom from the user. When the user
Ransomware is collected mostly from Phishtank, Openphish for verified
clicks on a link or is tricked by a malicious ad,
the malware gets installed on their computer. URLs and Alexa for the legitimate ones, which includes 4059
websites with 2141 phishing and 1918 legitimate sites in train-
ing dataset. They get 99.39% of TPR and 1.25 of FPR. They
1) Spam filters have been designed to detect and prevent implemented intuitive methods to produce the feature vector
detrimental and phishing e-mails. and generate a singular feature vector for each website sample
2) Web browsers such as Google Chrome, Internet Ex- to build labelled dataset. They have evaluated the dataset
plorer, Mozilla Firefox have taken browser measures that with 10-fold cross-validation. The study has reached 99.09%
warn us of phishing on the websites we enter. accuracy with random forest, 96.16% with SVM, 98.05% with
3) Using different password entries. For example, some neural networks, 98.25% with Logistic Regression (LR), and
banks have added a password by adding images to be 97.59% with Bayes by using WEKA.
selected by the users other than a certain password. This Weedon, Tsaptsinos and Denholm-Price [7], get the dataset
has increased the password entries. from Phishtank which is completely verified and DMOZ web-
4) To prevent phishing scams, some organizations analyze sites. Their study contained 4000 URLs for training process

and quarter of the training data belongs to malicious dataset İbrahim and Hadi [12], use WEKA tool for implementing
and remaining data belongs to the opposite one. Their testing the classifiers on public NASA repository dataset. The dataset
set consists of 7000 URLs and nearly 57% of them belong has 30 attributes with 11055 instances. They categorized their
malicious dataset. The study used a literal only dataset in order dataset into four parts such as address bar, abnormal, website
to assess the accuracy of Random Forest algorithm and gets content and domain based features. They used K-fold cross
86.9%. With other algorithms they get 83.9% accuracy with validation first and k value was 10. Random Forest get the
j48, 64.6% with Bayes, 81.5% with LR. highest accuracy 95.2% with and without feature selection
Sahingoz, et. al. [8], [14], provide the phishing URLs mostly algorithms and Bayes get the least. All classifiers get better
from Phishtank by writing a script. Over 70000 URLs were accuracy results than using feature selection.
available in their dataset and roughly half of them were James et al. [13], analyzed some algorithms using WEKA
legitimate websites and other half of the dataset were phishing and MATLAB. First, they extract the features. Then they
websites. They extract each word from these URLs to use in choose a classifier to implement in MATLAB. They collect
analyses. Then they implemented a Random Word Detection URLs of benign websites from Alexa, DMOZ websites and
Module and all words, which had over seven characters, were web browser past. They collected 37000 URLs and 45% of
examined via Word Decomposer Module(WDM) to separate them were phishing samples from Phishtank. They collect
their subwords. For the words are not compound, they obtained WHOIS information of some websites. By using only the
only the original ones by WDM. After that their Maliciousness lexical features, they generate a successful classification rates
Analysis Module examined and processed the output words as 93.2% for test section of 60% and 93.78% for test section
of WDM and the words that were up to seven characters. of 90%. They used Regression Tree by MATLAB and the
Then a couple of auxilary features were extracted from these accuracy rate was 91.08% in 60% of dataset although accuracy
words. Random Forest was the most successul algorithm with rate was 85.63% in 90% of dataset.
97.98% accuracy. Natural Language Processing based features Priya and Meenakshi [15], analysis C4.5 (J48) algorithm
increased the performance and had better scores than word using WEKA tool. Phishing and legitimate websites are col-
vectors in all algorithms but Naive Bayes. lected from PhishTank. Thirty-two features are extracted from
Liu, Wang, Lang and Zhou [9], uses WEKA library to the websites. Two training datasets are created with 750 URLs
execute Random Forest, J48, LR, SVM, MLP and Bayes and 2000 URLs to train the J48 algorithm. The test dataset has
algorithms. 29000 URLs were available in their dataset and 300 URLs. The size of tree is 45 nodes out of 28 nodes are
approximately 12500 malicious of them were obtained from leaves. But if 2000 URLs used then 75 nodes are created, out
Phishtank. Then they were combined with 16.516 legitimate of 43 are leaf nodes. The algorithm has 82.6% accuracy rate.
URLs taken from digg58 website. They identified 41 features
in their study and adjusted the Random Forest algorithm. After III. P ROPOSED S YSTEM
their implementations, considerable scores were acquired by A. Dataset Description
Random Forest classification. The algorithm’s precision was Phishers try to click the URL of the site their victims
99.7% and FPR, which is important factor in this kind of enter in their attacks. Identity hunters use some differences to
problems, was less than 0.4%. change the appearance of the URL structure in various ways.
Rakesh et al. [10], also use Weka tool. They collected These differences used change the URLs and look different
the legitimate URL set from DMOZ and fake samples from from the legitimate site. What we are going to do here is that
widespread source, Phishtank. The dataset consists of balanced by doing extensive research on the malicious URL, we use
2000 URLs. This study had 9 features that were extracted some properties to classify the web page. And we analyze the
using a java program. They generated 6 particular subsets URLs we detected. Some features are defined below for the
in variable rates to observe the difference of accuracy rate malicious URL [10].
by dataset size. In their project, they classified the URLs by
C4.5 algorithm in WEKA. As a result higher accuracy scores
belonged to C4.5 and AdaBoost algorithms.
Aydın et. al. [11] point out the most attacked websites and
their devious URLs from the Phishtank. The study analyzed
totally 8,538 URLs. Nearly 40% of them were legitimate and
remaining ones were fraudulent. Their program got textual Fig. 3. URL components.
properties and ”whois” records. Additionally, they obtained
some data manually. Dataset has 133 separate features related The CatchPhish D3.csv dataset had 126,077 rows and 2
to URLs. Gain Ratio Attribute (GRA) and ReliefF Attribute columns which is full of site names and phishing value if
used for the feature selection and analyzed by WEKA. The it is a phishing site 1 if not 0 is given. We get the dataset
SMO and J48 algorithms achieve their best results by us- from [21]. In this dataset, legitimate sites are collected from
ing ReliefF attribute-based(58) selection technique and got common-crawl and Alexa while phishing sites are collected
96.42%, 98.47% accuracies. Naive Bayes reaches better result from Phishtank. After some pre-processing steps 122,055
using Gain Ratio Attribute(36) with 87.08% accuracy. values 85,220 with 0 and 36,835 with 1 values stayed and those

values are in order 0’s comes first. After extracting values from 3) Weka: Waikato Environment for Knowledge Analysis
address bar values we had 15 columns for example protocol, (WEKA) is a Java based open-source tool which is developed
site length, host name, file name, path, path length, fragment, by the University of Waikato. It is used for data mining. Weka
number of query keys, port, number of delimiters, number of includes machine learning algorithms and it is easy to use.
reserved characters etc. Besides, visualization and data preprocessing tools are also
included [16].
TABLE I
D ESCRIPTION OF RELEVANT NOMINAL ATTRIBUTES
C. Methods
Description Values
site full-length sites in the dataset Data Mining (DM) is the business of accessing information
site len length of sites or mining among big data. Our job here is to predict precisely
protocol http (0) what will come from the large chunks of data. The computer
https (1)
url host name combine of subdomain and domain name program we use for this is WEKA. The use of WEKA
url host name len length of host name of a URL provides a great convenience here because WEKA obtains a
file name name of a file very quick result in machine utilization algorithms. Thanks
file name len length of a file name
file name without ext file name length without extension
to the results we obtained, we can compare the algorithms
file name len without ext length of file name without extension we use. We used the CatchPhish dataset in our program. We
path location of a file obtained some features by breaking URLs in the dataset. This
path len length of path of a URL became a multivariate dataset. Classification is a data mining
query pass data to the server
query len length of a query string function that assigns items in a dataset to target categories
num of query keys count of query keys of a URL or classes. The purpose of classification is to make accurate
fragment internal page reference predictions for each value in the data. In the classification
fragment len length of fragment identifier
model, we looked at functional algorithms such as Random
port by default 80 for HTTP, 443 for HTTPS
num of reserved chars reserved chars: ; / ? : @ & = + $ Forest, Decision Tree. Finally, we looked at Naive Bayes,
num of delimiters delimiters: < > # % ‘ where we used multiple algorithms, because you have the
num of unreserved chars unreserved chars: - . ! * ∼ ’ ( ) advantage of comparing the extracted information in each
num of unwise chars unwise chars: { } | \ ˆ [ ]
phish not phish (0)
algorithm. As a result, the multiple classification system wants
phish (1) to make the best use of the data in the data set.
1) Random Forest Classifier: Random Forest is one of the
most used and most popular machine learning algorithms in
B. Tools Classification. In this algorithm, it creates a forest made up of
1) Python: Python is created by Guido Van Rossum in early words as its name and combines this data to make a random
1990s [19]. It has a simple structure so that a nice choice prediction [17]. Here, our algorithm gets faster results than
for educational purposes as a student or a beginner. Since other algorithms. Additionally, it works even better.
being one of the preferred languages contributes more to open 2) Decision Tree: Tree-based learning algorithm is one of
source projects related to this language, it leads to a rapid the most used algorithms for data mining classification. The
development cycle. algorithm, which has a tree-like model, makes some decisions
2) Jupyter Notebook: Jupyter notebook is an open source to reach the desired results. But it contains some conditional
web program to lets developers to execude codes in pieces and statements to arrive at this conclusion [1]. This algorithm
make visualizations. In this way, it allows users to view code can be used in all decision trees, classification and regression
blocks and their results. problems, and this algorithm gives the best result to achieve
the goal. A decision tree is used to divide a data set into even
smaller sets by applying certain rules. In other words, even
easier steps are taken in the data that is divided into small
pieces. In addition, a decision tree that can be visualized is
much easier to understand.
3) Logistic Regression: Logistic regression is the catego-
rization problem of dependent variables used in the linear clas-
sification problem. The purpose here is to obtain an analysis
of the independent variables that we use in our data set. Like
the phenomenon of all regression algorithms, this algorithm is
a prediction analysis. The result we will get here is taken from
a binary variable. These variables are 0 (false) and 1 (true).
Events in logistic regression must be independent from each
Fig. 4. Distribution of dataset according to phishing sites. other. There is no linear relationship between dependent and
independent variables.

4) Naive Bayes: This classification is the simplest network The confusion matrix is shown above. The diagonal values,
model with the family of ”probabilistic classifiers”. The pur- which is TP and FN, of the matrices shows the estimated
pose is to use a vector with multiple properties. Then training correct values. By looking at this, we can say that Random
is created from the information provided, and It is received at Forest predicted the highest correct values.
the end of this training the new data classified correctly.
IV. E XPERIMENTAL R ESULTS
D. Dataset Preprocessing Recall: The proportion of positive samples is calculated
Data preprocessing is a step to get qualified data because the according to the total number of positive samples in the correct
dataset can have incomplete, inconsistent and outdated data in classification used.
it [17]. Our dataset had 2 columns in the beginning which is Recall = TP / (TP+FN)
full site names and phishing detail. Firstly, we transform the F1-Score: It is the harmonic mean of Recall and Precision
dataset csv to excel. Then some rows in phishing column was values. The purpose here is to measure the performance
missing so we fill them with NaN value and some rows had value shown by the classifiers. It is mostly used to compare
different numbers so we found and delete them. We drop the classifiers.
empty rows. After that we extract the needed features from F1-Score = 2 * Precision * Recall/ (Precision + Recall)
the site column and had 15 features. We normalize some rows ROC Curve: Here are the graphs used by calculating the
like port which means we give a number to each value. Then performance for all values consisting of classifiers. The ROC
we made it available for use on Weka and made predictions curve creates a Sensitivity / Specificity report. The area under
J48, Bayes, Logistic Regression and Decision Tree. But before the ROC curve is called AUC. It uses this field as an evaluation
doing that we applied 5-fold cross validation which divides criterion. AUC is a measure of how well a parameter can be
data into subsets and leaves the last part as test data. distinguished between two classes.

TABLE III
TABLE II P ERFORMANCE METRICS
C ONFUSION M ATRIX T ERMS
Algorithm Class Precision Recall F1-Score
Predicted Positive Predicted Negative
Value Value Random Forest Not Phish 0,877 0,864 0,870
Real Positive Value True Positive (TP) False Negative (FN) Phish 0,743 0,765 0,754
Real Negative Value False Positive (FP) True Negative (TN) Decision Tree Not Phish 0,871 0,818 0,844
Phish 0,684 0,765 0,722
Naive Bayes Not Phish 0,833 0,833 0,833
Phish 0,676 0,676 0,676
True Positive Rate (TPR): Here the classifier calculates Logistic Regression Not Phish 0,844 0,818 0,831
how accurately it predicts true positive values. The higher the Phish 0,667 0,706 0,686
better.
TPR = TP / (TP+FN) Random Forest Classifier’s results are shown below. We
False Positive Rate (FPR): Here the classifier calculates how can say that this algorithm’s execution time takes a little
accurately it predicts true negative values. longer than the other algorithms that we use except logistic
FPR = FP / (FP+TN) Accuracy: Here is how often the regression but still fast. Here, the accuracy value of Random
classifier gets the correct predictions. Forest Classifier is calculated as 83%. Logistic Regression and
Accuracy = (TP+TN) / (TP+FP+TN+FN) Naive Bayes algorithms get 78% accuracy but Naive Bayes
Precision: It is a measure of accuracy in all estimated classes. was faster. J48 was also fast and get 80% accuracy percentage.
It is preferable to be high. As a result, we can say that the Random Forest Classifier gave
Precision = TP/ (TP+FP) us the best result.

TABLE IV
A LGORITHMS

Algorithm Test results

Names Accuracy Run time
Random Forest 83% 0.06sec
Decision Tree (J48) 80% 0.03sec
Naive Bayes 78% 0.01sec
Logistic Regression 78% 0.07sec
Algorithm accuracy and runtimes.

V. D ISCUSSION
In this study, the accurate prediction of phishing websites
Fig. 5. Confusion matrix for each algorithm. by different classification techniques is the ultimate goal and

we divided our dataset into two main parts as training and [7] M. Weedon, D. Tsaptsinos and J. Denholm-Price, ”Random forest
test in the first phase. Initially, we tried to extract some explorations for URL classification,” 2017 International Conference On
Cyber Situational Awareness, Data Analytics And Assessment (Cyber
effective features from the URL dataset that we can use to SA), London, 2017, pp. 1-4, doi: 10.1109/CyberSA.2017.8073403.
detect phishing. Then we make data preprocessing to clear [8] E. Buber, B. Dırı and O. K. Sahingoz, ”Detecting phishing attacks
and prepare the data. Then we apply Random Forest, Decision from URL by using NLP techniques,” 2017 International Conference
on Computer Science and Engineering (UBMK), Antalya, 2017, pp.
Tree, Naive Bayes and Logistic Regression algorithms to reach 337-342, doi: 10.1109/UBMK.2017.8093406.
the most qualified result and compare each one’s scores. [9] C. Liu, L. Wang, B. Lang and Y. Zhou, ”Finding effective classifier for
We observed that addition of phishing samples increased the malicious URL detection,” Proceedings of the 2018 2nd International
Conference on Management Engineering, Software Engineering and
accuracy score of algorithms. Balanced and enhanced dataset Service Sciences, 2018, pp. 240-244, doi: 10.1145/3180374.3181352.
can create better solutions in this case. In addition to all, some [10] Rakesh R, Kannan A, Muthurajkumar S, Pandiyaraju V and SaiRamesh
deep learning models which showed the efficiency in [14] can L, ”Enhancing the precision of phishing classification accuracy using
reduced feature set and boosting algorithm,” 2014 Sixth International
be adopted to the proposed model in the future work. Conference on Advanced Computing (ICoAC), Chennai, 2014, pp. 86-
90, doi: 10.1109/ICoAC.2014.7229752.
VI. C ONCLUSION [11] M. Aydin, I. Butun, K. Bicakci and N. Baykal, ”Using Attribute-based
Feature Selection Approaches and Machine Learning Algorithms for
In this paper, we have executed a phishing detection system Detecting Fraudulent Website URLs,” 2020 10th Annual Computing and
on WEKA and tested its efficiency by using a public dataset Communication Workshop and Conference (CCWC), Las Vegas, NV,
USA, 2020, pp. 0774-0779, doi: 10.1109/CCWC47524.2020.9031125.
as CatchPhish D3 by using different classification techniques. [12] D. R. Ibrahim and A. H. Hadi, ”Phishing Websites Prediction Us-
The dataset has 2 columns and we extracted some features ing Classification Techniques,” 2017 International Conference on New
and create a new dataset with a structured format. To make Trends in Computing Sciences (ICTCS), Amman, 2017, pp. 133-137,
doi: 10.1109/ICTCS.2017.38.
this, it is needed to make some preprocessing steps to use [13] J. James, Sandhya L. and C. Thomas, ”Detection of phishing URLs
the dataset in Weka system. For detection of phishing sites using machine learning techniques,” 2013 International Conference on
URL of the web pages are mainly used. By using this data Control Communication and Computing (ICCC), Thiruvananthapuram,
2013, pp. 304-309, doi: 10.1109/ICCC.2013.6731669.
some features are produced and these features are used for [14] C.B. Cebi, F.S. Bulut, H. Firat, O.K. Sahingoz and G. Karatas, ”Deep
detection of whether the web page is phishing or not. To Learning Based Security Management of Information Systems: A Com-
predict this, four different machine learning models are used parative Study”, Journal of Advances in Information Technology Vol 11
(3), 2020.
as random forest, naive Bayes, logistic regression and decision [15] A. Priya and E. Meenakshi, ”Detection of phishing websites using C4.5
tree algorithms. data mining algorithm,” 2017 2nd IEEE International Conference on
Recent Trends in Electronics, Information & Communication Technol-
As a conclusion of this work, we found that the Random ogy (RTEICT), Bangalore, 2017, pp. 1468-1472, doi: 10.1109/RTE-
Forest algorithm works better than the others with relatively ICT.2017.8256841.
high accuracy rates. The models can be enhanced by using new [16] K. P. S. Attwal and A. S. Dhiman, ”Exploring data mining tool-Weka
and using Weka to build and evaluate predictive models,” Advances and
features in the system as in [4], [18]. Additionally, apart from Applications in Mathematical Sciences, vol. 19, 2020, pp. 451-469.
the URL based features, some content-based features [20] can [17] B. Geyik and M. Kara, ”Severity Prediction with Machine Learning
also be used here. Finally, we can also get help from some third Methods,” 2020 International Congress on Human-Computer Interaction,
Optimization and Robotic Applications (HORA), Ankara, Turkey, 2020,
party organization/web pages as Alexa and Whois to identify pp. 1-7, doi: 10.1109/HORA49412.2020.9152601.
whether the page is phishing or not. [18] E. Buber, Ö. Demir and O. K. Sahingoz, ”Feature selections for the
machine learning based detection of phishing websites,” 2017 Interna-
R EFERENCES tional Artificial Intelligence and Data Processing Symposium (IDAP),
Malatya, 2017, pp. 1-5, doi: 10.1109/IDAP.2017.8090317.
[1] M. Korkmaz, O. K. Sahingoz and B. Diri, ”Detection of Phishing [19] A. Rawat, ”A Review on Python Programming,” International Journal
Websites by Using Machine Learning-Based URL Analysis,” 2020 11th of Research in Engineering, Science and Management, vol. 3, 2020, pp.
International Conference on Computing, Communication and Network- 8-11.
ing Technologies (ICCCNT), Kharagpur, India, 2020, pp. 1-7, doi: [20] U. Ozker and O. K. Sahingoz, ”Content Based Phishing Detection
10.1109/ICCCNT49239.2020.9225561. with Machine Learning,” 2020 International Conference on Elec-
[2] Phishing Activity Trends Report, Summary – 3rd Quarter 2020. (2020). trical Engineering (ICEE), Istanbul, Turkey, 2020, pp. 1-6, doi:
[Online]. Available: 10.1109/ICEE49691.2020.9249892.
https://apwg.org/trendsreports/ [21] R.S. Rao, T. Vaishnavi and A.R. Pais ”CatchPhish: detection of phishing
[3] A. Das, S. Baki, A. El Aassal, R. Verma and A. Dunbar, ”SoK: A websites by inspecting URLs”, Journal of Ambient Intelligence and
Comprehensive Reexamination of Phishing Research From the Security Humanized Computing 11, 2020, pp. 813–825, doi: 10.1007/s12652-
Perspective,” in IEEE Communications Surveys Tutorials, vol. 22, no. 019-01311-4
1, pp. 671-708, First quarter 2020, doi: 10.1109/COMST.2019.2957750.
[4] M. Korkmaz, O. K. Sahingoz and B. Diri, ”Feature Selections for the
Classification of Webpages to Detect Phishing Attacks: A Survey,” 2020
International Congress on Human-Computer Interaction, Optimization
and Robotic Applications (HORA), Ankara, Turkey, 2020, pp. 1-9, doi:
10.1109/HORA49412.2020.9152934.
[5] S. Marchal, J. François, R. State and T. Engel, ”PhishScore: Hacking
phishers’ minds,” 10th International Conference on Network and Service
Management (CNSM) and Workshop, Rio de Janeiro, 2014, pp. 46-54,
doi: 10.1109/CNSM.2014.7014140.
[6] A. K. Jain and B. B. Gupta, ”Towards detection of phishing websites on
client-side using machine learning based approach,” Telecommunication
Systems, vol. 68, 2018, pp. 687-700, doi: 10.1007/s11235-017-0414-0.

Authorized licensed
View publication stats use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on July 11,2021 at 16:06:47 UTC from IEEE Xplore. Restrictions apply.

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
A Novel Approach For Phishing URLs Detection Using Lexical Based Machine Learning in A Real-Time Environment
No ratings yet
A Novel Approach For Phishing URLs Detection Using Lexical Based Machine Learning in A Real-Time Environment
11 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
base paper
No ratings yet
base paper
16 pages
Batch-5 Journal-6 ECE-D new (1)
No ratings yet
Batch-5 Journal-6 ECE-D new (1)
6 pages
Phish Guard Phishing Website using Machine Learning Algorithms
No ratings yet
Phish Guard Phishing Website using Machine Learning Algorithms
10 pages
LIS 2022 New 1-154-160
No ratings yet
LIS 2022 New 1-154-160
7 pages
Reference 10
No ratings yet
Reference 10
21 pages
Detection of Phishing Website
No ratings yet
Detection of Phishing Website
12 pages
A Survey of Intelligent Detection Designs of HTML URL Phishing Attacks
No ratings yet
A Survey of Intelligent Detection Designs of HTML URL Phishing Attacks
23 pages
Batch-5 ECE-D
No ratings yet
Batch-5 ECE-D
4 pages
final ppt
No ratings yet
final ppt
26 pages
Across_the_Spectrum_In-Depth_Review_AI-Based_Models_for_Phishing_Detection
No ratings yet
Across_the_Spectrum_In-Depth_Review_AI-Based_Models_for_Phishing_Detection
28 pages
Detection of Phishing Websites Using Machine Learning IJERTV10IS050235
No ratings yet
Detection of Phishing Websites Using Machine Learning IJERTV10IS050235
5 pages
Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features
No ratings yet
Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features
4 pages
Paper 1
No ratings yet
Paper 1
5 pages
phishing4
No ratings yet
phishing4
6 pages
Efficient Deep Learning Techniques For The Detection of Phishing
No ratings yet
Efficient Deep Learning Techniques For The Detection of Phishing
18 pages
paper-major1
No ratings yet
paper-major1
6 pages
(IJCST-V9I3P26) :P.Hema Sujatha, S.Sushma Sree, N. Vinay Sreenath, S. Suresh, DR - Bala Brahmeswara Kadaru
No ratings yet
(IJCST-V9I3P26) :P.Hema Sujatha, S.Sushma Sree, N. Vinay Sreenath, S. Suresh, DR - Bala Brahmeswara Kadaru
6 pages
Content Pages CPE
No ratings yet
Content Pages CPE
79 pages
Detection of Phishing Websites Using An Efficient Feature-Based Machine Learning Framework
No ratings yet
Detection of Phishing Websites Using An Efficient Feature-Based Machine Learning Framework
23 pages
A Hybrid Model To Detect Phishing-Sites Using Supervised Learning Algorithms
No ratings yet
A Hybrid Model To Detect Phishing-Sites Using Supervised Learning Algorithms
8 pages
Unmasking Phishing Threats Through Cutting-Edge Machine Learning
No ratings yet
Unmasking Phishing Threats Through Cutting-Edge Machine Learning
8 pages
Detecting Phishing Website With Code Implementation
No ratings yet
Detecting Phishing Website With Code Implementation
13 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
6 pages
1 s2.0 S0957417422014373 Main
No ratings yet
1 s2.0 S0957417422014373 Main
13 pages
Contents 1
No ratings yet
Contents 1
19 pages
Towards Detection of Phishing Websites On Client-Side Using Machine
No ratings yet
Towards Detection of Phishing Websites On Client-Side Using Machine
14 pages
Detection of Phising Websites Using Machine Learning Approaches
No ratings yet
Detection of Phising Websites Using Machine Learning Approaches
9 pages
N Tabassum A Hybrid Machine Learning Based Phishing Website Detection Technique Through Dimensionality Reduction
No ratings yet
N Tabassum A Hybrid Machine Learning Based Phishing Website Detection Technique Through Dimensionality Reduction
6 pages
Ozcan A Hybrid DNN-LSTM Model For Detecting Phishing Url
No ratings yet
Ozcan A Hybrid DNN-LSTM Model For Detecting Phishing Url
17 pages
Final Research Paper
No ratings yet
Final Research Paper
6 pages
Logistic Regression Based Machine Learning Technique For Phishing Website Detection
No ratings yet
Logistic Regression Based Machine Learning Technique For Phishing Website Detection
4 pages
Expert Systems With Applications: Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu Diri
No ratings yet
Expert Systems With Applications: Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu Diri
13 pages
A Machine Learning Based Approach For Phishing Detection Using
No ratings yet
A Machine Learning Based Approach For Phishing Detection Using
14 pages
Various Methodological Approaches to Phishing Detection
No ratings yet
Various Methodological Approaches to Phishing Detection
8 pages
Detect Phishing Website by Using Machine Learning
No ratings yet
Detect Phishing Website by Using Machine Learning
4 pages
Part 3 discription
No ratings yet
Part 3 discription
27 pages
Expert Systems With Applications: Neda Abdelhamid, Aladdin Ayesh, Fadi Thabtah
No ratings yet
Expert Systems With Applications: Neda Abdelhamid, Aladdin Ayesh, Fadi Thabtah
12 pages
Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques
No ratings yet
Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques
51 pages
base paper
No ratings yet
base paper
13 pages
Survey and comparative analysis of phishing detection techniques: current trends, challenges, and future directions
No ratings yet
Survey and comparative analysis of phishing detection techniques: current trends, challenges, and future directions
14 pages
Project
No ratings yet
Project
12 pages
Real Time Phishing Website Detectionusing ML
No ratings yet
Real Time Phishing Website Detectionusing ML
4 pages
1822 B.E Cse Batchno 287
No ratings yet
1822 B.E Cse Batchno 287
65 pages
StateoftheArtContent basedandHybridPhishingDetection
No ratings yet
StateoftheArtContent basedandHybridPhishingDetection
9 pages
DEPHIDES Deep Learning Based Phishing Detection System
No ratings yet
DEPHIDES Deep Learning Based Phishing Detection System
19 pages
Evasion_Attacks_and_Defense_Mechanisms_for_Machine_Learning-Based_Web_Phishing_Classifiers
No ratings yet
Evasion_Attacks_and_Defense_Mechanisms_for_Machine_Learning-Based_Web_Phishing_Classifiers
13 pages
Cui Qian 2019 Thesis
No ratings yet
Cui Qian 2019 Thesis
136 pages
Raika ShahLJdA
No ratings yet
Raika ShahLJdA
8 pages
Development of A Phishing Detection System Using Support Vector Machine
No ratings yet
Development of A Phishing Detection System Using Support Vector Machine
11 pages
1 s2.0 S0957417422012301 Main - 2
No ratings yet
1 s2.0 S0957417422012301 Main - 2
16 pages
Detection of Phishing Websites Using An Efficient Feature
No ratings yet
Detection of Phishing Websites Using An Efficient Feature
11 pages
Detection of Phishing On Apps and Websites - Project Report
No ratings yet
Detection of Phishing On Apps and Websites - Project Report
21 pages
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
No ratings yet
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
16 pages
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
No ratings yet
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
15 pages
Bhavsar 2018 Ijca 918286
No ratings yet
Bhavsar 2018 Ijca 918286
4 pages
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
No ratings yet
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
11 pages
Detection of Phishing
No ratings yet
Detection of Phishing
7 pages
Study On Phishing Attacks: International Journal of Computer Applications December 2018
No ratings yet
Study On Phishing Attacks: International Journal of Computer Applications December 2018
4 pages
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
From Everand
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
Bolakale Aremu
5/5 (1)
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Ocampo Vs Rear Admiral DEAR Violation
No ratings yet
Ocampo Vs Rear Admiral DEAR Violation
2 pages
Find No Peace by Thomas Wyatt
No ratings yet
Find No Peace by Thomas Wyatt
4 pages
Benaglia V CIR PDF
No ratings yet
Benaglia V CIR PDF
4 pages
Guru Stotram: Ātma Jnāna Pradānena
No ratings yet
Guru Stotram: Ātma Jnāna Pradānena
10 pages
PDF Introduction to Law and the Legal System 11th Edition Frank August Schubert download
100% (3)
PDF Introduction to Law and the Legal System 11th Edition Frank August Schubert download
58 pages
Election
No ratings yet
Election
4 pages
INV_SRNB_24-25_276_8857_9056
No ratings yet
INV_SRNB_24-25_276_8857_9056
4 pages
Unno Vs Gen Milling Digest
No ratings yet
Unno Vs Gen Milling Digest
2 pages
JCR Vol. 12 No. 01: Symposium On The Constitution and Political Theology
100% (1)
JCR Vol. 12 No. 01: Symposium On The Constitution and Political Theology
349 pages
Maid of Honor Speech
0% (1)
Maid of Honor Speech
2 pages
Cybersecurity - Module 2
No ratings yet
Cybersecurity - Module 2
27 pages
Inside A Jaguar's Jawsamerican Ethnologist - 2023 - Ruiz Serna - On The Hybrid Afterlives of Warfare
No ratings yet
Inside A Jaguar's Jawsamerican Ethnologist - 2023 - Ruiz Serna - On The Hybrid Afterlives of Warfare
11 pages
Quiz On Indian Freedom Struggle Slogans
No ratings yet
Quiz On Indian Freedom Struggle Slogans
4 pages
Sibling Sexual Abuse: A Guide For Parents
100% (1)
Sibling Sexual Abuse: A Guide For Parents
10 pages
UWC Logical Fallacies
No ratings yet
UWC Logical Fallacies
2 pages
Hutchinson Ports V SBMA G.R. No. 131367
No ratings yet
Hutchinson Ports V SBMA G.R. No. 131367
6 pages
Children and War: Everyday Life in A Wartime City, Sarajevo
No ratings yet
Children and War: Everyday Life in A Wartime City, Sarajevo
8 pages
Ismail Ally Vs Republic (Criminal Appeal No 212 of 2016) 2018 TZHC 2956 (8 May 2018)
No ratings yet
Ismail Ally Vs Republic (Criminal Appeal No 212 of 2016) 2018 TZHC 2956 (8 May 2018)
22 pages
PM Reyes Flowchart of Tax Remedies (Feb 2023 Update)
No ratings yet
PM Reyes Flowchart of Tax Remedies (Feb 2023 Update)
11 pages
Why Does Test and Try Us
No ratings yet
Why Does Test and Try Us
5 pages
ISLAMIC Reminders Com. 1
50% (2)
ISLAMIC Reminders Com. 1
45 pages
Latin Tables
No ratings yet
Latin Tables
12 pages
The Institute of Chartered Accountants of India (Icai)
No ratings yet
The Institute of Chartered Accountants of India (Icai)
4 pages
Venture Global Engineering vs. Satyam Computer Services Ltd. and Ors
No ratings yet
Venture Global Engineering vs. Satyam Computer Services Ltd. and Ors
12 pages
LawTG05 PDF
No ratings yet
LawTG05 PDF
26 pages
Document1 1
No ratings yet
Document1 1
5 pages
1 Prepositions With Answers
No ratings yet
1 Prepositions With Answers
6 pages
Government of West Bengal: 11 A, Mirza Galib Street, Kolkata-87
No ratings yet
Government of West Bengal: 11 A, Mirza Galib Street, Kolkata-87
2 pages
Alolino vs. Flores
No ratings yet
Alolino vs. Flores
6 pages
Inversionexercises 1 3
0% (1)
Inversionexercises 1 3
3 pages

Detectionof Phishing Websitesfrom URLsbyusing Classification Techniqueson WEKA

Uploaded by

Detectionof Phishing Websitesfrom URLsbyusing Classification Techniqueson WEKA

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Detection of Phishing Websites from URLs by using Classiﬁcation Techniques on

Conference Paper · January 2021

The user has requested enhancement of the downloaded file.

Detection of Phishing Websites from URLs by

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 120

websites to remove them.

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 121

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 122

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 123

Algorithm Test results

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 124

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 125

You might also like