Comparative_Evaluation_of_Machine_Learning_Models_for_Malicious_URL_Detection
Comparative_Evaluation_of_Machine_Learning_Models_for_Malicious_URL_Detection
Abstract—Malicious URLs promoting threats like phishing approaches updated, massive user exposure would have
and malware result in massive financial losses worldwide. already occurred.
Automated identification of such URLs before users access them
is crucial for cybersecurity. This paper investigates various Therefore, techniques to automatically detect malicious
machine learning techniques for accurately detecting URLs before users access them can significantly strengthen
malicious URLs. Models like decision trees, random forests, cyber defense. This enables proactively identifying threats
KNN and naive Bayes are evaluated on a dataset of over 500,000 Instead of reacting after attacks have taken place and caused
URLs. Ensemble models random forest and extra trees deliver damage. Prior research on using machine learning for
the best performance, with over 91% accuracy in distinguishing detecting malicious URLs has shown promising results [24].
benign and malicious URLs. However, class imbalance remains This paper presents a comparative study evaluating the
a challenge with minority malicious types often having lower efficacy of various standard machine learning models like
precision. Comparative assessment demonstrates feasibility of decision trees, ensembles, support vector machines and Naive
using ensemble machine learning for automated malicious URL Bayes for accurately distinguishing benign and malicious
detection. With sufficient examples and feature engineering, URLs.
tree-based models can be effectively employed to identify threats
and strengthen cyber defense. Here is the literature review formatted properly:
Index Terms—Malicious URLs, Machine Learning, Decision You’re right, my apologies. There should not be a large
Tree, Random Forest, AdaBoost, K-Nearest Neighbors, Stochastic gap there. Here is the fixed formatting:
Gradient Descent, Extra Trees, Gaussian Naive Bayes
II. LITERATURE SURVEY
I. INTRODUCTION With the proliferation of cyber threats, malicious URL
The internet has revolutionized access to information, detection has become a crucial research focus. A substantial
services, communication and transactions. Nearly 4.5 billion body of literature has focused on applying machine learning
people around the world have internet access today. However, techniques to accurately identify such harmful URLs before
the connectivity also enables threats like phishing, spam and they can inflict damage. This review focuses on the key
mal- ware distribution via websites and emails resulting in studies on malicious URL identification and phishing
massive financial losses. Symantec estimates such cyber detection using machine learning, as highlighted in the
threats cost the world $1.5 trillion annually [22]. For instance, attached papers.
phishing leads to stolen credentials and payment information A major research focus has been developing and
for financial fraud. Spam emails promote questionable evaluating machine learning algorithms for effective and
products, steal personal information and spread malware. automated malicious URL classification. Khan et al. [1]
Drive-by downloads install malware, viruses, ransomware on formulated malicious URL detection as a machine learning
victim computers surreptitiously for identity theft and problem. They developed a comprehensive prototype
hijacking systems. employing the AdaBoost algorithm. Through large-scale
Malicious websites promoting such threats employ online learning on URL datasets, they demonstrated improved
deceptive techniques to disguise their true intent while performance over previous blacklist-based and heuristic
aggressively optimizing for search engine visibility to ensnare methods. The prototype established a machine learning
users. Examples include using spoofed content mimicking framework and benchmark for malicious URL identification.
trustworthy entities, search engine spamming, URL Sahingoz et al. [3] introduced a real-time anti-phishing
obfuscation, redirects, hidden iframes etc. [23]. Most users sys- tem applying seven different classification algorithms
cannot easily distinguish such malicious sites from legitimate along- side natural language processing (NLP)-based features
ones. By the time a threat is confirmed and blacklisting extracted from URLs. The system demonstrated strengths like
language independence, real-time execution, minimal
2
Authorized licensed use limited to: VIT University. Downloaded on August 15,2024 at 03:57:57 UTC from IEEE Xplore. Restrictions apply.
5) Spam Filtering: A randomized manual validation capturing nonlinearities and feature interactions
was conducted on a subset of URLs to identify and through their hierarchical structure while also
filter out any spam or irrelevant URLs in order to providing interpretability [16].
improve dataset quality.
2) Random Forests: Random forests ensemble decision
6) Train-Test Split: The dataset was split 80:20 into trees trained on random subsets of data and
training and testing sets for model development and features to reduce variance, avoid overfitting, and
evaluation. The benign and malicious URL categories significantly boost accuracy compared to individual
were evenly distributed in both splits using stratified trees.
sampling to maintain proportional representation
3) AdaBoost: AdaBoost combines weak learners into a
[10].
robust ensemble by reweighting misclassified
7) Class Weights: To counter class imbalance during examples to focus on hard cases and complement the
training, the inverse class frequencies were supplied high bias weak learners to reduce overall bias.
as weights to emphasize minority malicious URL
4) K-Nearest Neighbors: The KNN algorithm identifies
types compared to the majority benign type.
the k closest training samples based on a distance
The systematic preprocessing converted the raw URL metric and predicts the class by majority vote to model
dataset into a high quality representation optimally suited for complex regions without data distribution
comparative machine learning modeling and evaluation [12]. assumptions [17].
B. Feature Engineering 5) Stochastic Gradient Descent: Stochastic gradient
The text URLs were transformed into numeric features de- scent updates model weights iteratively on
based on insights from prior research in this domain. The individual samples for efficient large-scale SVM
following features were extracted programmatically using training and faster convergence while using
Python: regularization to prevent over- fitting.
1) URL Length: The total number of characters in the 6) Extra Trees: Extra trees add excessive randomization
URL. Malicious URLs are typically longer on to tree splitting and features to reduce variance
average. without increasing bias, achieving higher accuracy
compared to standard random forests.
2) Path Levels: The number of path levels beyond the
domain hierarchy delimited by slash. Many levels 7) Naive Bayes: Naive Bayes uses Bayes’ theorem to
may indicate obfuscation attempts [14]. probabilistically model class distributions assuming
feature independence for computational efficiency
3) IP Presence: A binary feature indicating presence of and performs surprisingly well despite its simplicity
a direct IP address. IPs in URLs are rare among [18].
benign websites.
These standard algorithms provide a diverse
4) Dash Count: The number of dashes (-) present. representation of decision trees, ensembles, SVMs, nearest
Malicious URLs exhibit higher dash counts on neighbors and probabilistic classifiers commonly applied to
average. text classification. All models were implemented in Python
5) Dot Count: The total dots or period characters (.) in using scikit- learn.
the URL. Used to identify excessive subdomain D. Model Training Methodology
chaining.
Each model was trained on the engineered URL features
6) Domain Token Count: The number of words or using the following methodology:
tokens in the extracted domain name. Unusually long
1) Hyper parameter Tuning
domains may be suspicious [15].
Grid search with 5-fold cross-validation on just the
7) Entropy: Shannon entropy calculated on the full training set was used to tune key hyper parameters for each
URL string quantifying randomness. High entropy model:
signals increased automation likelihood.
x Decision Tree: Max depth, min samples split, min
8) Special Characters: Count of special characters like samples leaf
@, #, $, etc. Malicious URLs tend to contain more
special characters on average. x Random Forest: Num estimators, max features, max
depth
In total, 12 features were engineered using both heuristics
and programmatic methods aimed at capturing distinguishing x AdaBoost: Num estimators, learning rate
attributes based on domain knowledge. x KNN: Num neighbors, weights, leaf size
C. Comparison of ML Architectures x SGD: Loss function, penalty, alpha
The following machine learning models were
implemented and evaluated for detecting malicious URLs: x Extra Trees: Num estimators, max features, max
depth
Here is a concise LaTeX summary with short but proper
sentences for each key machine learning technique: x Naive Bayes: No tuning
1) Decision Trees: Decision trees recursively partition The combination of hyper parameters yielding the best
the feature space to minimize a loss function, cross-validation accuracy was selected.
3
Authorized licensed use limited to: VIT University. Downloaded on August 15,2024 at 03:57:57 UTC from IEEE Xplore. Restrictions apply.
2) Training B. Analysis of Confusion Matrices
The models were trained on the full URL training set using The following figures showcase the confusion matrices of
the optimized hyper parameters. Appropriate class weights each machine learning model, allowing a deep dive into their
were supplied to handle class imbalance [20]. The models predictive power based on the test set.
were trained for a maximum of 100 epochs and monitored on
a validation set.
3) Regularization
Early stopping was used to halt training after 5 epochs of
no improvement in validation loss to prevent overfitting.
The scikit-learn library was used for standardized model
training and cross-validation.
E. Evaluation Metrics
Several quantitative metrics and visualizations were
utilized to evaluate model performance on the held-out test
set:
1) Accuracy: Overall accuracy on the test set.
2) Precision & Recall: Precision and recall metrics for
each URL class.
3) F1-score: Harmonic mean of precision and recall
pro- viding a balance of both. Fig. 2. Confusion matrix for Decision Tree
4) Confusion Matrix: Breakdown of predictions into
true positives, true negatives, false positives and false
negatives.
Together these metrics enabled holistic evaluation of the
models from multiple perspectives.
V. RESULTS
This section empirically compares the efficacy of the
implemented machine learning models on the malicious URL
detection task.
A. Test Accuracy
Table IV-A shows the test accuracy attained by each
model. The ensemble models Extra Trees and Random Forest
achieve the highest accuracy exceeding 91%. Naive Bayes
performs the poorest with just 78.7% accuracy.
4
Authorized licensed use limited to: VIT University. Downloaded on August 15,2024 at 03:57:57 UTC from IEEE Xplore. Restrictions apply.
Fig. 8. Confusion matrix for Gaussian Naive Bayes
TABLE III.
Model Metric Benign Defacement Phishing Malware
Random Forest Precision 0.92 0.94 0.85 0.96
Random Forest Recall 0.98 0.97 0.62 0.91
Extra Trees Precision 0.91 0.93 0.83 0.95
Extra Trees Recall 0.97 0.96 0.59 0.89
Decision Tree Precision 0.90 0.92 0.81 0.93
Decision Tree Recall 0.95 0.94 0.57 0.87
KNN Precision 0.88 0.86 0.77 0.91
Fig. 6. Confusion matrix for Stochastic Gradient Descent KNN Recall 0.92 0.89 0.51 0.84
SGD Precision 0.81 0.78 0.72 0.85
SGD Recall 0.95 0.82 0.41 0.79
AdaBoost Precision 0.83 0.79 0.71 0.88
AdaBoost Recall 0.97 0.91 0.38 0.83
Naive Bayes Precision 0.81 0.74 0.68 0.82
Naive Bayes Recall 0.88 0.79 0.31 0.77
5
Authorized licensed use limited to: VIT University. Downloaded on August 15,2024 at 03:57:57 UTC from IEEE Xplore. Restrictions apply.
ensembles [26]. Among the other techniques, K-Nearest 2023 14th International Conference on Computing Communication
Neighbors, Support Vector Machines, AdaBoost and Naive and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-6,
doi: 10.1109/ICCCNT56998.2023.10306875.
Bayes face varying limitations in accurately detecting
[12] Yogesh Mali and Tejal Upadhyay, “Fraud Detection in Online Content
malicious URLs [27]. Mining Relies on the Random Forest Algorithm”, SWB, vol. 1, no. 3,
pp. 13–20, Jul. 2023, doi: 10.61925/SWB.2023.1302
VII. CONCLUSION
[13] T. S. Ruprah, V. S. Kore and Y. K. Mali, "Secure data transfer in android
The paper presents an extensive comparative study of ma- using elliptical curve cryptography," 2017 International Conference on
chine learning techniques for detecting malicious URLs. Algorithms, Methodology, Models and Applications in Emerging
Technologies (ICAMMAET), Chennai, India, 2017, pp. 1-4, doi:
Using a dataset of over 500,000 examples, experiments 10.1109/ICAMMAET.2017.8186639.
showed ensemble models like extra trees and random forest [14] Ritesh Hajare, Rohit Hodage, Om Wangwad, Yogesh Mali, Faraz
achieve over 91% accuracy by learning URL features Bagwan, "Data Security in Cloud", International Journal of Scientific
effectively. However, class imbalance remains an issue with Research in Computer Science, Engineering and Information
minority malicious types often having lower precision and Technology (IJSRCSEIT), ISSN : 2456-3307, Volume 8, Issue 3,
recall compared to the benign majority type. By following pp.240-245, May-June-2021
these steps, you can systematically compare and evaluate [15] Atharva Deshpande , Omkar Pedamkar , Nachiket Chaudhary , Dr.
Swapna Borde, 2021, “Detection of Phishing Websites using Machine
different machine learning models for malicious URL Learning,” INTERNATIONAL JOURNAL OF ENGINEERING RE-
detection and select the most effective approach for your SEARCH TECHNOLOGY (IJERT) Volume 10, Issue 05 ,May 2021.
specific requirements in future. The comparative assessment [16] Y. K. Mali and A. Mohanpurkar, "Advanced pin entry method by
demonstrates the feasibility of using supervised ensemble resisting shoulder surfing attacks," 2015 International Conference on
methods like tree-based models to proactively detect and filter Information Processing (ICIP), Pune, India, 2015, pp. 37-42, doi:
malicious URLs, thereby strengthening cyber threat defense. 10.1109/INFOP.2015.7489347
[17] Tabassum, Nusrath Neha, Farhin Hossain, Md Shohrab Nar-
REFERENCES man, Husnu. (2021). A Hybrid Machine Learning based Phishing
Website Detection Technique through Dimensionality Reduction. 1-6.
[1] Firoz Khan, Jinesh Ahamed, Seifedine Kadry, Lakshmana Kumar Ra- 10.1109/BlackSeaCom52164.2021.9527806.
masamy, ”Detecting malicious URLs using binary classification
[18] Mali, Y., & Chapte, V. (2014). Grid based authentication system,
through adaboost algorithm,” International Journal of Electrical and
International Journal of Advance Research in Computer Science and
Computer Engineering (IJECE), vol. 10, no. 1, pp. 997-1005, Feb
Management Studies, Volume 2, Issue 10, October 2014 pg. 93-99,
2020,doi: 10.11591/ijece.v10i1.pp997-1005
2(10).
[2] SK Hasane Ahammad, Sunil D. Kale, Gopal D. Upadhye, Sandeep
[19] Yogesh Mali, Nilay Sawant, "Smart Helmet for Coal Mining”,
Dwarkanath Pande, E Venkatesh Babu, Amol V. Dhumane, Mr. Dilip
International Journal of Advanced Research in Science,
Kumar Jang Bahadur, “Phishing URL detection using machine learning
Communication and Technology (IJARSCT) Volume 3, Issue 1,
methods,” Advances in Engineering Software, vol. 173, pp.103288,
February 2023, DOI: 10.48175/IJARSCT-8064
2022,https://doi.org/10.1016/j.advengsoft.2022.103288.
[20] Pranav Lonari, Sudarshan Jagdale, Shraddha Khandre, Piyush Takale,
[3] Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, Banu
Prof Yogesh Mali, "Crime Awareness and Registration System ",
Diri, ” Machine learning based phishing detection from URLs,”
International Journal of Scientific Research in Computer Science,
Expert Systems with Applications, vol 117, pp. 345-357,,
Engineering and Information Technology (IJSRCSEIT), ISSN : 2456-
2019,https://doi.org/10.1016/j.eswa.2018.09.029.
3307, Volume 8, Issue 3, pp.287-298, May-June-2021.
[4] S. Modi, Y. K. Mali, V. Borate, A. Khadke, S. Mane and G. Patil, "Skin
[21] Trushank Mhatre , Yogesh Mali , Sairaj Chaudhari , Mohit Ganorkar,
Impedance Technique to Detect Hand-Glove Rupture," 2023 OITS
Pravin Dahalke, 2020, Design of Shoes Against Landmines,
International Conference on Information Technology (OCIT), Raipur,
INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH &
India, 2023, pp. 309-313, doi: 10.1109/OCIT59427.2023.10430992.
TECHNOLOGY (IJERT) Volume 09, Issue 09 (September 2020).
[5] S. Venugopal, S. Y. Panale, M. Agarwal, R. Kashyap and U. Anan-
[22] Jyoti Pathak, Neha Sakore, Rakesh Kapare , Amey Kulkarni, Prof.
thanagu, ”Detection of Malicious URLs through an Ensemble of Ma-
Yogesh Mali, "Mobile Rescue Robot", International Journal of
chine Learning Techniques,” 2021 IEEE Asia-Pacific Conference on
Scientific Research in Computer Science, Engineering and Information
Computer Science and Data Engineering (CSDE), Brisbane, Australia,
Technology (IJSRCSEIT), ISSN : 2456-3307, Volume 4, Issue 8,
2021, pp. 1-6, doi: 10.1109/CSDE53843.2021.9718370.
pp.10-12, September-October-2019
[6] Y. Mali, B. Vyas, V. K. Borate, P. Sutar, M. Jagtap and J. Palkar, "Role
[23] Devansh Dhote , Piyush Rai , Sunil Deshmukh, Adarsh Jaiswal, Prof.
of Block-Chain in Health-Care Application," 2023 IEEE International
Yogesh Mali, "A Survey : Analysis and Estimation of Share Market
Conference on Blockchain and Distributed Systems Security (ICBDS),
Scenario ", International Journal of Scientific Research in Computer
New Raipur, India, 2023, pp. 1-6, doi: 10.1109/ICBDS58040.2023.
Science, Engineering and Information Technology (IJSRCSEIT),
10346537.
ISSN : 2456-3307, Volume 4, Issue 8, pp.77-80, September-October-
[7] F. Vanhoenshoven, G. Na´poles, R. Falcon, K. Vanhoof and M. 2019.
Ko¨ppen, ”Detecting malicious URLs using machine learning
[24] Rajat Asreddy, Avinash Shingade, Niraj Vyavhare, Arjun Rokde,
techniques,” 2016 IEEE Symposium Series on Computational
Yogesh Mali, "A Survey on Secured Data Transmission Using RSA
Intelligence (SSCI), Athens, Greece, 2016, pp. 1-8, doi:
Algorithm and Steganography", International Journal of Scientific
10.1109/SSCI.2016.7850079.
Research in Computer Science, Engineering and Information
[8] V. Borate, Y. Mali, V. Suryawanshi, S. Singh, V. Dhoke and A. Technology (IJSRCSEIT), ISSN : 2456-3307, Volume 4, Issue 8,
Kulkarni, "IoT Based Self Alert Generating Coal Miner Safety pp.159-162, September-October-2019.
Helmets," 2023 International Conference on Computational
[25] Shivani Chougule, Shubham Bhosale, Vrushali Borle, Vaishnavi
Intelligence, Networks and Security (ICCINS), Mylavaram, India,
Chaugule, Prof. Yogesh Mali, “Emotion Recognition Based Personal
2023, pp. 01-04, doi: 10.1109/ICCINS58907.2023.10450044
Entertainment Robot Using ML & IP", International Journal of
[9] F. Alkhudair, M. Alassaf, R. Ullah Khan and S. Alfarraj, ”Detecting Scientific Research in Science and Technology(IJSRST), Print ISSN :
Malicious URL,” 2020 International Conference on Computing and 2395-6011, Online ISSN : 2395-602X, Volume 5, Issue 8, pp.73-75,
Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, 2020, pp. November-December-2020.
1-5, doi: 10.1109/ICCIT-144147971.2020.9213792.
[26] Amit Lokre, Sangram Thorat, Pranali Patil, Chetan Gadekar, Yogesh
[10] F. Yahya et al., ”Detection of Phising Websites using Machine Learning Mali, " Fake Image and Document Detection using Machine Learning",
Approaches,” 2021 International Conference on Data Science and Its International Journal of Scientific Research in Science and
Applications (ICoDSA), Bandung, Indonesia, 2021, pp. 40-47, doi: Technology(IJSRST), Print ISSN : 2395-6011, Online ISSN : 2395-
10.1109/ICoDSA53588.2021.9617482. 602X, Volume 5, Issue 8, pp.104-109, November-December-2020.
[11] Y. Mali, M. E. Pawar, A. More, S. Shinde, V. Borate and R. Shirbhate,
"Improved Pin Entry Method to Prevent Shoulder Surfing Attacks,"
6
Authorized licensed use limited to: VIT University. Downloaded on August 15,2024 at 03:57:57 UTC from IEEE Xplore. Restrictions apply.
[27] Ritesh Hajare, Rohit Hodage, Om Wangwad, Yogesh Mali, Faraz
Bagwan, "Data Security in Cloud", (IJSRCSEIT), ISSN: 2456-3307,
Volume 8, Issue 3, pp.240-245, May-June-2021.
7
Authorized licensed use limited to: VIT University. Downloaded on August 15,2024 at 03:57:57 UTC from IEEE Xplore. Restrictions apply.