Social Engineering Detection: Phishing URLs
Social Engineering Detection: Phishing URLs
ISSN No:-2456-2165
Abstract:- In the digital age, the proliferation of malicious engineering attacks, with phishing being a notorious
phishing URLs poses a significant threat to online exemplar. Within this realm, one insidious tactic has emerged
security. While conventional machine learning algorithms as a primary conduit for deceit and exploitation: phishing
have been employed to combat this menace, our research URLs. These malicious web links, often camouflaged as
pioneers the use of ensemble methods, including XGBoost legitimate destinations, are designed to deceive unsuspecting
and Random Forest, for phishing URL detection. Our users into divulging sensitive information or unleashing cyber
methodology involves collection of the data, preprocessing threats.
it then feature extraction followed by model training,
evaluation and comparison. Notably, our results reveal Just like any file on a computer can be located by
the superior accuracy of ensemble methods in supplying its filename, any website can be located using a
distinguishing phishing URLs from legitimate ones. These URL. Each Uniform Resource Locator (URL) has two
findings underscore the potential of ensemble methods as primary components: the protocol and the resource identifier.
a game-changing asset in the battle against cyber threats, The protocol is the first part of the URL, and it specifies the
promising enhanced online security and the protection of method used to access the resource. For example, HTTPS is a
sensitive user information. secure version of HTTP that is used to retrieve hypertext
documents. Other protocols include File Transfer Protocol
Keywords:- Social Engineering, Phishing URLs, Cyber (FTP), Domain Name System (DNS), and more. The second
Security, Machine Learning. part of the URL is the resource identifier, which is used to
grant access to an online destination. For instance, in the URL
I. INTRODUCTION https://www.google.com, the resource identifier is
“www.google.com”.
In the digital age, where the exchange of information
and communication are paramount, individuals and Asadullah Safi [1] has described several types of
organizations alike face an ever-increasing threat from social phishing attacks, including email, web and link manipulation.
III. METHODOLOGY The initial selection of machine learning models [7] was
diverse and included many types of learning. Choose models
In this research, we present our methodology for the such as support vector machine (SVM), nearest neighbor
robust detection of malicious URLs, with a specific focus on (KNN), decision trees, random forest, gradient boosting, and
machine learning models, feature engineering, and ensemble packing and boosting transport integration. These models
methods for classification. We embark on this journey represent a wide range of distribution strategies. After model
through a systematic set of steps. selection, the next step is the training phase. The selected
model is trained on the training data, a process that involves
We begin with the pivotal phase of data collection. The fine-tuning hyperparameters to improve its performance.
dataset [8] is taken from www.kaggle.com which includes
507195 Unique URLs out of which 72% are Good URLs and Discover the power of collaborative processes to
28% are the Malicious ones as shown in Table 2. Data increase the efficiency of distribution. This includes looking
preprocessing follows, an indispensable step to ensure the at methods like random forest integration, gradient boosting
integrity of the dataset. The data is diligently cleaned to integration (like XGBoost), AdaBoost, and Stacking.
eliminate inconsistencies and noise. We also perform feature
extraction, deriving significant attributes from the URLs, The core of our research is the comparative analysis. We
including domain, path, length, and the presence of special delve into the performance of each model in-depth, with a
characters. These extracted features will be instrumental as focus on both traditional and ensemble methods. Through this
input variables for our machine learning models. analysis, we dive into the strengths and limitations of each
model and evaluate their accuracy and robustness in
Table 2 Dataset Details distinguishing malicious from legitimate URLs.
Good URLs Malicious URLs
72% 28% The below flow diagram describes the flow of our model
3,65,180 1,42,015 which involves, firstly the Pre-processing phase followed by
the detection phase. The Pre-processing phase contains
To effectively train the model and test, the data is webpage feature generation, extraction and feature
divided into two groups: training and testing. The training vectorization. The detection phase contains training set and
process will enable our model to learn from past data, and the testing set, feature model training and result analysis.
light test will be evidence of evaluating the model.
The below Table 5 shows the summary of the test results of random forest and XGBoost.
In Table 5, XGBoost accuracy, precision, recall and F-Score values are more than random forest.
V. COMPARATIVE ANALYSIS
Various classification models have been made earlier for classifying the phishing URLs into Safe or Malicious ones. One
such work is done by Shantanu et. al. [7] where he chose non-ensembled training models Naïve Bayes, KNN and Support Vector
Machines. Another one was Sharad Rajendra Parmar et. al. [12] who used algorithms Logistic Regression and KNN to train his
model. Table 6 shows the comparative analysis of various algorithm results.
Below Fig. 6 Shows the Comparative Analysis of the algorithms used earlier and our ensemble methods.
From the above figure, we can see that our models – Random Forest and XGBoost have performed well in all the metrics
like Accuracy, Precision, Recall and F-Score.
REFERENCES