0% found this document useful (0 votes)
93 views

Social Engineering Detection: Phishing URLs

In the digital age, the proliferation of malicious phishing URLs poses a significant threat to online security. While conventional machine learning algorithms have been employed to combat this menace, our research pioneers the use of ensemble methods, including XGBoost and Random Forest, for phishing URL detection. Our methodology involves collection of the data, preprocessing it then feature extraction followed by model training, evaluation and comparison.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Social Engineering Detection: Phishing URLs

In the digital age, the proliferation of malicious phishing URLs poses a significant threat to online security. While conventional machine learning algorithms have been employed to combat this menace, our research pioneers the use of ensemble methods, including XGBoost and Random Forest, for phishing URL detection. Our methodology involves collection of the data, preprocessing it then feature extraction followed by model training, evaluation and comparison.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Social Engineering Detection : Phishing URLs


Utkarsh Singh1 Ashvini Kumar2 Pratham Jain3
Dept. of CSE Dept. of CSE Dept. of CSE
Chandigarh University Chandigarh University Chandigarh University
Mohali, India Mohali, India Mohali, India

Tanya Jaiswal4 Sudhanshu Shekhar5 Gurleen Kaur6


Dept. of CSE Dept. of CSE Dept. of CSE
Chandigarh University Chandigarh University Chandigarh University
Mohali, India Mohali, India Mohali, India

Abstract:- In the digital age, the proliferation of malicious engineering attacks, with phishing being a notorious
phishing URLs poses a significant threat to online exemplar. Within this realm, one insidious tactic has emerged
security. While conventional machine learning algorithms as a primary conduit for deceit and exploitation: phishing
have been employed to combat this menace, our research URLs. These malicious web links, often camouflaged as
pioneers the use of ensemble methods, including XGBoost legitimate destinations, are designed to deceive unsuspecting
and Random Forest, for phishing URL detection. Our users into divulging sensitive information or unleashing cyber
methodology involves collection of the data, preprocessing threats.
it then feature extraction followed by model training,
evaluation and comparison. Notably, our results reveal Just like any file on a computer can be located by
the superior accuracy of ensemble methods in supplying its filename, any website can be located using a
distinguishing phishing URLs from legitimate ones. These URL. Each Uniform Resource Locator (URL) has two
findings underscore the potential of ensemble methods as primary components: the protocol and the resource identifier.
a game-changing asset in the battle against cyber threats, The protocol is the first part of the URL, and it specifies the
promising enhanced online security and the protection of method used to access the resource. For example, HTTPS is a
sensitive user information. secure version of HTTP that is used to retrieve hypertext
documents. Other protocols include File Transfer Protocol
Keywords:- Social Engineering, Phishing URLs, Cyber (FTP), Domain Name System (DNS), and more. The second
Security, Machine Learning. part of the URL is the resource identifier, which is used to
grant access to an online destination. For instance, in the URL
I. INTRODUCTION https://www.google.com, the resource identifier is
“www.google.com”.
In the digital age, where the exchange of information
and communication are paramount, individuals and Asadullah Safi [1] has described several types of
organizations alike face an ever-increasing threat from social phishing attacks, including email, web and link manipulation.

Fig 1 Types of Phishing Attacks

IJISRT23OCT1863 www.ijisrt.com 1773


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The requirement of robust and efficient mechanisms to machine learning techniques are discussed in the context of
detect phishing URLs has never been more critical. The stakes protection against computerized or automated phishing.
are high, encompassing not only the protection of personal
data but also the preservation of trust in online transactions Kunju et al. [3] Use investigative methods to investigate
and communication. phishing attacks. Research provides various techniques and
solutions for detecting phishing attacks. Research shows that
a number of proposed remediation measures are not
sufficient to deal with phishing attacks.

Kathrine et al. [4] proposed a framework to detect and


prevent various phishing attacks. This study proves that
machine learning-based algorithms can identify real-world
benefits. The literature examined in this project includes only
11 studies, and deep learning techniques used in combating
phishing websites are not included in the studies. These are
the limitations of this study.
Fig 2 Example of a URL
Benavides et al. [5] conducted a review and analyzed
This research paper delves into the domain of "Social different methods used by other researchers to use deep
Engineering Attack Detection: Phishing URLs." It focuses on learning to detect phishing attacks. In summary, there are still
harnessing the capabilities of multiple machine learning large differences in deep learning algorithms for detecting
models, in combination with ensemble methods, to discern phishing attacks. This study has only 19 articles published
phishing URLs from their legitimate counterparts. This between 2014 and 2019 in the existing literature.
research strives to illuminate the efficacy of different models
and their potential for enhancing the accuracy and timeliness Arshad et al. [6] show different types of phishing and
of detection, ultimately bolstering cybersecurity defenses in a anti-phishing in their work. According to SLR's analysis, the
world where the preservation of digital trust is paramount. most commonly used phishing tactics include spear phishing,
email spoofing, phone phishing and email manipulation. The
II. RELATED WORK study found that machine learning methods were the most
accurate.
The field of spam and social engineering detection has
witnessed significant advancements over the years, with Shantanu et al. [7] In his paper, decided to find bad
researchers proposing various techniques and models to URLs as a binary classification problem and evaluated the
combat these security threats. In this literature survey, we performance of several well-known machine learning
reviewed 6 papers on Phishing Detection Systems. classifiers. The model was trained using Kaggle's public
database of 450,000 URLs.
Qabajeh et al. [2] have recently devoted themselves to
research on traditional and automatic phishing detection Table 1 shows the details of data analysis of phishing
technology. Raising awareness, educating users, holding detection systems.
regular courses or seminars, and utilizing legal opinions are
some of the strategies to prevent phishing. Product and

Table 1 Phishing Detection Systems


Author and Year Aim Main Findings Limitations
This review article contrasts Machine learning and rule Sixty-seven studies were
conventional anti-phishing techniques, generation are ideal for stopping evaluated, but the studies
Qabajeh et al. [2], such as utilizing a legal viewpoint, phishing attempts because of the did not include an in-depth
2018 educating users, holding recurring high detection rate and, more study.
training sessions, and increasing importantly, the results are easy
awareness. to understand.
This article provides an overview of This study indicates that In the literature reviewed
various machine learning algorithms detecting phishing websites with in this study, only 14
such as kNN, Naive Bayes, Decision a single method is insufficient. studies discussed machine
Kunju et al. [3], 2019
Trees, SVM, Neural Networks and learning.
Random Forests to detect phishing
websites.
This project introduces various phishing This study shows that machine Just 11 studies were
attacks and the latest protection learning-based algorithms can covered in the work, and
Kathrine et al. [4], techniques. This study provides a identify real-world benefits. Deep Learning methods for
2019 framework for identifying and avoiding phishing website
phishing scams. mitigation are not included
in the research.

IJISRT23OCT1863 www.ijisrt.com 1774


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Author and Year Aim Main Findings Limitations
The purpose of this literature review is This project only considers the In summary, there is still a
to evaluate various proposals from other search terms phishing and deep huge gap in the field of
Benavides et al. [5],
researchers for using deep learning to learning, including 19 studies. deep learning algorithms
2020
identify phishing attacks. for detecting phishing
attacks.
This study discusses various phishing They came to the conclusion The research only draws
strategies and protection against that email manipulation, phone from twenty studies.
Arshad et al. [6],
phishing. phishing, spear phishing, and
2021
email spoofing were the most
often used phishing strategies.
This study examines various In this paper, they address the The models in this work
classification models to determine binary classification problem of were not constructed using
Shantanu et al. [7], which one has the best accuracy on a malicious URL detection and ensemble methods.
2021 dataset of phishing URLs. evaluate the performance of
various popular machine
learning classifiers.

III. METHODOLOGY The initial selection of machine learning models [7] was
diverse and included many types of learning. Choose models
In this research, we present our methodology for the such as support vector machine (SVM), nearest neighbor
robust detection of malicious URLs, with a specific focus on (KNN), decision trees, random forest, gradient boosting, and
machine learning models, feature engineering, and ensemble packing and boosting transport integration. These models
methods for classification. We embark on this journey represent a wide range of distribution strategies. After model
through a systematic set of steps. selection, the next step is the training phase. The selected
model is trained on the training data, a process that involves
We begin with the pivotal phase of data collection. The fine-tuning hyperparameters to improve its performance.
dataset [8] is taken from www.kaggle.com which includes
507195 Unique URLs out of which 72% are Good URLs and Discover the power of collaborative processes to
28% are the Malicious ones as shown in Table 2. Data increase the efficiency of distribution. This includes looking
preprocessing follows, an indispensable step to ensure the at methods like random forest integration, gradient boosting
integrity of the dataset. The data is diligently cleaned to integration (like XGBoost), AdaBoost, and Stacking.
eliminate inconsistencies and noise. We also perform feature
extraction, deriving significant attributes from the URLs, The core of our research is the comparative analysis. We
including domain, path, length, and the presence of special delve into the performance of each model in-depth, with a
characters. These extracted features will be instrumental as focus on both traditional and ensemble methods. Through this
input variables for our machine learning models. analysis, we dive into the strengths and limitations of each
model and evaluate their accuracy and robustness in
Table 2 Dataset Details distinguishing malicious from legitimate URLs.
Good URLs Malicious URLs
72% 28% The below flow diagram describes the flow of our model
3,65,180 1,42,015 which involves, firstly the Pre-processing phase followed by
the detection phase. The Pre-processing phase contains
To effectively train the model and test, the data is webpage feature generation, extraction and feature
divided into two groups: training and testing. The training vectorization. The detection phase contains training set and
process will enable our model to learn from past data, and the testing set, feature model training and result analysis.
light test will be evidence of evaluating the model.

IJISRT23OCT1863 www.ijisrt.com 1775


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 3 Phishing Model Flow Diagram

IJISRT23OCT1863 www.ijisrt.com 1776


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. EXPERIMENTAL ANALYSIS Use the metrics above to train and evaluate different
models. Two integration methods are used: Random Forest
Feature Extraction: Feature extraction [9] is the process and XGBoost classifier. A prediction accuracy of 92.1% was
of representing or enhancing features to make machine achieved using random forest classification. A prediction
learning models more efficient. It helps in reducing the size accuracy of 93.7% was achieved using XGBoost.
and speeding up the work. The most common methods are
discriminant analysis and principal component analysis.  Random Forest
Random forest [10] is a popular machine learning
Feature scaling: Feature scaling is a process of scaling algorithm suite that aims to reduce variance by using a series
data features within a fixed range. It is used during data of deep decisions to train a model consisting of different
preprocessing to handle high variance data. Without detailed domains of the same training; The results are then shown as
information, machine learning models tend to give more average values to obtain the final classification.
weight to higher values and less weight to lower values. It is
one of the most important and time-consuming steps in the The results of the random forest integrated model are
previous document. shown in Figure 4. It shows the model's accuracy, precision,
recall, and F-score.
Large files are divided into 80-20 rules. Each model is
trained on 80% of the data and tested on the remaining 20%.

 Measurements used to Evaluate Classification Models:

 True Positive (TP): Model predicts True and the result is


also True.
 False Positive (FP): Model predicts True but the result is
False
 True Negative (TN): Model predicts false and the result is
also False.
 False Negative (FN): Model predicts False but the result
is True.
 Accuracy: It is the true values divided by total number of
values
Fig 4 Random Forest Results
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 (1)
 XGBoost
XGBoost [11] is an efficient, adaptable, and portable
Precision: The ratio of correct predictions to the total gradient boosting algorithm. To get good results, it makes use
number of correct predictions. of weighted classifiers, tree pruning, and parallelization.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (2) The results of the XGBoost integrated model are shown
in Figure 5. It shows the model's accuracy, precision, recall,
Recall: It is predicted true values divided by the total and F-score.
actual true values.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁 (3)

F1-score: F score is the harmonic mean of precision and


recall.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (4)

Table 3 shows the position of TP, TN, FP and FN in a


confusion matrix.

Table 3 Confusion Matrix


Positive Negative
Positive True Positive False Positive Fig 5 XGBoost Results
Negative False Negative True Negative
The confusion matrix values of random forest and
XGBoost are shown in Table 4.

IJISRT23OCT1863 www.ijisrt.com 1777


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Table 4 Confusion matrix values
Random Forest XGBoost
Positive Negative Positive Negative
Positive 242 32 220 18
Negative 22 235 12 180

The below Table 5 shows the summary of the test results of random forest and XGBoost.

Table 5 Summary of Test Results


Algorithm Random Forest XGBoost
Accuracy 0.921 0.937
Precision 0.883 0.938
Recall 0.914 0.949
F-Score 0.898 0.928

In Table 5, XGBoost accuracy, precision, recall and F-Score values are more than random forest.

V. COMPARATIVE ANALYSIS

Various classification models have been made earlier for classifying the phishing URLs into Safe or Malicious ones. One
such work is done by Shantanu et. al. [7] where he chose non-ensembled training models Naïve Bayes, KNN and Support Vector
Machines. Another one was Sharad Rajendra Parmar et. al. [12] who used algorithms Logistic Regression and KNN to train his
model. Table 6 shows the comparative analysis of various algorithm results.

Table 6 Comparative Analysis of Various Algorithms


Author Algorithm Accuracy Precision Recall F-Score
Shantanu et. al. Naïve Bayes 0.891 0.881 0.843 0.876
KNN 0.917 0.890 0.812 0.910
SVM 0.921 0.901 0.842 0.913
Sharad et. al. Logistic Regression 0.924 0.929 0.936 0.932
KNN 0.543 0.605 0.548 0.756
Our Models RF 0.921 0.883 0.914 0.898
XGBoost 0.937 0.938 0.949 0.928

Below Fig. 6 Shows the Comparative Analysis of the algorithms used earlier and our ensemble methods.

Fig 6 Comparative Analysis of Algorithms

From the above figure, we can see that our models – Random Forest and XGBoost have performed well in all the metrics
like Accuracy, Precision, Recall and F-Score.

IJISRT23OCT1863 www.ijisrt.com 1778


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
VI. CONCLUSION [10]. Ali, Jehad, Rehanullah & Ahmad, Nasir & Maqsood,
Imran. (2012). “Random Forests and Decision Trees”,
To reduce phishing attacks or malware attacks, the International Journal of Computer Science Issues
learning process can be a very good technique because it can (IJCSI).
classify good and non-bad phishing URLs. All conditions are [11]. Chen, Tianqi & Guestrin, Carlos (2016).“XGBoost: A
taken into account; We can say that learning together can Scalable Tree Boosting System”. pp. 785-794.
produce good classification results. The rationale behind this 10.1145/2939672.2939785.
is that ensemble learning solves a given problem by [12]. Parmar, Sharad, 2020 “Detection of Phishing URL
combining the best features of several models. This method using Ensemble Learning Techniques” Master’s
significantly enhances the classification. thesis, Dublin, National College of Ireland.

To get much better outcomes, other combinations of


various machine learning models can be investigated in
future studies. It is evident that the ensembled algorithms
which are combinations give much better results than the
individual machine learning algorithms.

REFERENCES

[1]. Asadullah Safi, Satwinder Singh, “A systematic


literature review on phishing website detection
techniques”, Journal of King Saud University,
Volume 35, Issue 2, 2023, pp. 590-611, ISSN 1319-
1578
[2]. Qabajeh, I., Thabtah, F. 2018. “A recent review of
conventional vs. automated cybersecurity anti-
phishing techniques”. Computer Sci. Rev. 29, 44– 55.
[3]. Kunju, M.V., Dainel, E., Anthony, H.C., Bhelwa, S.,
2019. “Evaluation of phishing techniques based on
machine learning”, 2019 International Conference on
Intelligent Computing and Control Systems, ICCS
2019, Iciccs, pp. 963–968.
[4]. Kathrine, G.J.W., Praise, P.M., Rose, A.A., Kalaivani,
E.C., 2019. “Variants of phishing attacks and their
detection techniques”, Proceedings of the international
Conference on Trends in Electronics and Informatics,
ICOEI 2019, Icoei, pp. 255–259.
[5]. Benavides, E., Fuertes, W., Sanchez, S., Sanchez, M.,
2020. “Classification of phishing attack solutions by
employing deep learning techniques: a systematic
literature review”. In: Rocha, Á., Pereira, R. (eds)
Developments and Advances in Defense and Security.
Smart Innovation, Systems and Technologies, vol 152.
Springer, Singapore.
[6]. Arshad, A, Rehman, A.U., Javaid, S., Ali, T.M.,
Sheikh, J.A., Azeem, M., 2021. “A Systematic
Literature Review on Phishing and Anti-Phishing
Techniques.”.
[7]. Shantanu, B. Janet and R. Joshua Arul Kumar,
"Malicious URL Detection: A Comparative Study,"
2021 International Conference on Artificial
Intelligence and Smart Systems (ICAIS), Coimbatore,
India, 2021, pp. 1147-1151, doi:
10.1109/ICAIS50930.2021.9396014.
[8]. https://www.kaggle.com/code/anseldsouza/phishing-
url-classification-using-knn-and-lr/input
[9]. Rakesh Verma, “What’s in a URL: Fast Feature
Extraction and Malicious URL Detection”, ACM
ISBN 978-1-4503-4909-3/17/03

IJISRT23OCT1863 www.ijisrt.com 1779

You might also like