0% found this document useful (0 votes)

12 views

ICT4SD_Published_Version

The document discusses the use of machine learning techniques for detecting malicious URLs, highlighting the inadequacies of traditional blacklisting methods. It proposes a Malicious URL Detection (MuD) model that employs three classifiers: support vector machine, logistic regression, and Naive Bayes, with preliminary results indicating Naive Bayes as the most effective. The paper details the model's methodology, including data gathering, preprocessing, feature extraction, and evaluation metrics for assessing classifier performance.

Uploaded by

zahirihamza603

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

ICT4SD_Published_Version

Uploaded by

zahirihamza603

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/347620249

Machine Learning for Malicious URL Detection

Chapter · January 2021

DOI: 10.1007/978-981-15-8289-9_45

CITATIONS READS

12 3,372

2 authors:

Gold Wejinya Sajal Bhatia

Sacred Heart University Sacred Heart University
1 PUBLICATION 12 CITATIONS 39 PUBLICATIONS 731 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sajal Bhatia on 03 March 2022.

The user has requested enhancement of the downloaded file.

Machine Learning for Malicious URL
Detection

Gold Wejinya and Sajal Bhatia

Abstract In recent years, Web-based attacks have become one of the most common
threats. Threat actors tend to use malicious URLs to intentionally deceive users and
launch attacks. Several approaches such as blacklisting have been implemented to
detect malicious URLs. These unreliable approaches were also accompanied with
strenuous task of maintaining an up-to-date blacklist URL database. To detect mali-
cious URLs, machine learning techniques have been explored in recent years. This
method analyzes different features of a URL and trains a prediction model on an
already existing dataset of both malicious and benign URLs. This paper proposes
a MuD (Malicious URL Detection) model which utilizes three supervised machine
learning classifiers—support vector machine, logistic regression and Naive Bayes—
to effectively and accurately detect malicious URLs. The preliminary results indicate
that Naïve Bayes algorithm produced best results.

Keywords Malicious URL · Machine learning · Support vector machine ·

Logistic regression · Naïve Bayes

1 Introduction

Studies have shown that at least 47% of the world’s population is online at any given
time [1]. Most users especially people using home computers are a lot more vulnera-
ble to external attacks and threats because they have very little knowledge of some of
the attacks that can be launched against them and how to be protected [2]. These users
cannot differentiate between a legitimate and malicious URL because most malicious
Web sites have similar content as the legitimate ones. Threat actors take advantage
of this opportunity by creating malicious Web contents and lure unsuspecting users

G. Wejinya · S. Bhatia (B)

Sacred Heart University, Fairfield, CT 06826, USA
e-mail: [email protected]
G. Wejinya
e-mail: [email protected]

© The Editor(s) (if applicable) and The Author(s), under exclusive license 463
to Springer Nature Singapore Pte Ltd. 2021
M. Tuba et al. (eds.), ICT Systems and Sustainability, Advances in Intelligent Systems
and Computing 1270, https://doi.org/10.1007/978-981-15-8289-9_45
464 G. Wejinya and S. Bhatia

to these URLs. Identification of these malicious URLs and creating awareness to

users is one way to reduce or prevent these phishing and malware attacks. Different
researchers have proposed various methods to identify malicious URLs. Machine
learning algorithms have come to be one of the most effective techniques for detect-
ing malicious URLs. Research has been carried out to compare different machine
learning algorithms for their accuracy in detecting malicious URLs.
Our proposed MuD(Malicious URL Detection) model describes a process of
detecting malicious URLs using three different machine learning algorithms and
evaluates which algorithm works best. It explores different algorithms and imple-
ments a working prototype that classifies URLs as benign or malicious based on
certain features of a URL that were seen to be relevant in the detection of mali-
cious URLs. The previous work done to detect malicious URLs using blacklisting or
heuristics is found to lack reliability as they lack capabilities to detect new malicious
URLs.
The rest of the article is organized as follow: Sect. 2 gives an overview of perti-
nent work in malicious URL detection. Section 3 gives a detailed description of the
proposed MuD (Malicious URL Detection) model. Section 4 presents experimental
results and compares the performance of different classifiers. Section 5 summarizes
the paper and presents directions for future research work in this area.

2 Background and Related Work

Over the past few years, there have been studies on the detection of malicious URLs
using machine learning techniques in order to deter attacks originating from such
URLs. This MuD model proposed in this paper makes an effort to stop similar attacks
by effectively and efficiently detecting malicious URLs. This paper enumerates more
on the work done by Vanhoenshoven et al. [3] where malicious URL detection is
mapped as a binary classification problem on a publicly available dataset consisting
of 2.4 million URLs and 3.2 million features. Their experimental results demonstrate
that most classification methods achieve acceptable prediction rates without requiring
advanced feature selection techniques or domain expertise. Their obtained results
showed that random forest and multi-layer perceptron attained the highest accuracy.
Singh and Goyal [4] proposed malicious URL detection by focusing on the
attribute selection. Their analysis was based on the fact that machine learning tech-
niques and selection of attributes are more important than any other aspect. They con-
sidered twenty-five attributes that can help find malicious Web sites. These attributes
were analyzed in regards to resources (computational) needed for extraction, pro-
cessing and accurately classifying the predicted malicious URLs. Based on their
analysis, the top five attributes for detecting malicious URLs were identified to be
geographical location, URL properties, HTTPS-enabled Web site, domain name and
DNS WHOIS information.
Manjeri et al. [5] focused on the feature selection technique after the URL dataset
has been fed into the prediction model. The authors also addressed the issues of class
Machine Learning for Malicious URL Detection 465

imbalance in several classifiers and proposed incorporating a reduced number of

features, based on their importance, required for classification. They also proposed
the use of rule mining algorithms such as Apriori and FPGrowth to generate IF-
THEN rules which helped in establishing relationship among these features [5]. They
concluded that when comparing class imbalance against random forest, decision tree,
logistic regression, k-NN and SVM and random forest had 96% accuracy. Cui et
al. [6] emphasized on using keyword matching to detect malicious URLs. They used
statistical analysis for gradient learning and sigmoidal threshold level for feature
extraction in their proposed techniques. Naïve Bayes, SVM classifiers and decision
trees were used for validation and efficiency calculations of their proposed approach
which has a good detection performance and an accuracy rate of 98.7%.

3 Proposed Malicious URL Detection (MUD) Model

The proposed model is an approach for malicious URL detection based on the features
of already existing malicious URLs using machine learning. With any URL, 15
features are selected and categorized as either lexical-, host-based or content-based.
The classifiers used in this project are trained using the dataset [7], and subsequently,
URLs fed into the model are classified as being either malicious or benign. In order to
properly build and train the proposed MuD model (Fig. 1) to effectively and efficiently
detect malicious URLs, the following steps were taken—(a) data gathering; (b) data
preprocessing; (c) classification; (d) training and testing; (e) evaluation.

Fig. 1 MuD (Malicious URL Detection) model for categorizing URLs

466 G. Wejinya and S. Bhatia

Fig. 2 URL components [1]

3.1 Data Gathering

As we will be working with URLs, it is important to clearly understand what a URL

is and what it is comprised of (Fig. 2). A universal resource locator (URL) is the
address of a Web page located in the World Wide Web (WWW), and it is displayed
in the Web browser’s address bar. A URL usually contains a protocol. Some common
protocols that are used today includes Hypertext Transfer Protocol (HTTP) which
helps with site navigation, Hypertext Transfer Protocol Secure (HTTPS) which is
just like HTTP but has a layer of security, file transfer protocol (FTP) which allows
for the exchange or transfer of files between systems and Domain Name Server
(DNS). Cyberattacks are usually performed using these compromised URLs that are
malicious [8]. The first step is to gather URLs, which can be done by visiting some
Web sites that offer malicious URLs links and also finding good URLs.

3.2 Data Preprocessing

This step is used to transform raw data to a clean dataset. It is an important part
in machine learning because the quality of the data used immensely affects the
capacity of the proposed model to learn. In this step, the data is cleaned, null values
in the dataset are replaced or removed, and data is rescaled. Data preprocessing is
important as it makes it possible for the dataset to be used on different machine
learning algorithms. One of the important steps of data preprocessing is feature
extraction.

3.2.1 Feature Extraction:

There are different types of features that can be extracted and used in classify-
ing URLs. Features can be lexical, which are URL length, number of characters
Machine Learning for Malicious URL Detection 467

and special characters (“//” , “.”, “@”) [2]. Host-based features are retrieved from
host properties of the URLs which give access to the location of the host, the IP
address and information about the registration of the domain. “Due to the difficulty
of obtaining new IPs, these features are very important to detect malicious URLs” [2].
Content-based features are features obtained from downloading the URLs content.
It is important in prediction as it helps find malicious code that can be embedded
into the HTML by threat actors. Some other features that can help in the detection
are the page rank which has to do with how many times a URL is visited.

3.2.2 Dataset Used

An existing dataset from UCI machine learning repository was used which has
already been preprocessed with 11,055 URLs both malicious and benign and 30
features extracted [7]. The dataset was cleaned to select specific features that we
need and came down with about 15 features. We chose some features from [5] which
explained different features of URL that can help determine if a URL is malicious
or not and three of the top five features from [4] which can be used to predict malign
URLs based on computational resource requirement and classification accuracy. The
following features were used:
1. Having an IP address: checks if the URL has an associated IP address
2. URL length: checks for the number of characters. Threat actors at times use this
feature to hide the malicious parts of the URL
3. Having @ symbol: This makes the Web browser to ignore all the content pre-
ceding the ‘@’ symbol and the real address subsequently follows the symbol
4. Having double slash ‘//’: The existence of ‘//’ within the URL path implies that
the user will be redirected to another site. The ‘//’ in URLs is usually located
on the sixth position of HTTP and position for HTTPS.
5. Link in tags: Legitimate Web sites offer link in tags that offer metadata about
the HTML document
6. Having a sub-domain: Sub-domain is a division of a domain that helps to orga-
nize an existing Web site into a separate site, and also, it helps organize and
navigate to different sections of a main Web site
7. Age of Domain: This information can be extracted from WHOIS database.
Majority of the phishing Web sites have a relatively small age of domain.
8. DNS record: It provides the registration details of the Web resource. The pres-
ence or absence of this information in the DNS WHOIS information has been
found to be linked to maliciousness
9. HTTPS token: This makes HTTPS-enabled Web sites less likely to host mali-
cious content.
10. Page-Rank: This feature aims to measure the relative importance of a Web page
on the Internet. In the dataset used for this paper, it is observed that around 95%
of phishing Web pages have no Page-Rank
468 G. Wejinya and S. Bhatia

11. Google Index: It examines if a Web site is indexed by Google or not. Phishing
Web pages, which are merely accessible for a short period, are often no found
to be indexed by Google
12. Iframe: Iframe HTML tag is well known in being used to download malicious
JavaScript exploits [5]
13. Redirect: When a URL is redirected, it is often linked to malicious behavior.
The feature checks whether the URL is redirected or not.
14. Pop-up window: JavaScript Window Open() pop-ups are often used for adver-
tisements and injecting exploits.
15. Favicon: This is a graphic associated with a specific Web page. If the image is
found to be loaded from a different domain than the one shown in the address
bar, the Web page is likely to contain malicious content.

3.3 Classification

Machine learning is a branch of science which enables computers to learn, adapt,

extrapolate patterns and communicate with each other without explicitly being pro-
grammed to do so and minimal human intervention. Machine learning approaches
use a set of URLs to train a model and based on the statistical properties, learn to
classify the URL as malicious or good. Classification models also known as classi-
fiers usually offer a range of solutions when faced with classification problems. Some
popular classifiers are decision trees, support vector machine, Bayesian networks,
random forests and k-nearest neighbors. This paper uses three classifiers viz; support
vector machine (SVM), logistic regression and Bayesian network (Naïve Bayes).
Support Vector Machine (SVM): is one of the powerful supervised learning
techniques, and it can be used for classification and regression problems. It is usually
used for two-group classification problems. SVM is a non-probabilistic binary linear
classifier and begins with a set of training examples, with each example labeled to
belong to one of the two categories. SVM separates the examples into the categories
dividing it by finding a hyperplane that has the largest distance to the closest data
point of any class which tends to minimize the generalization error of the SVM
classifier.
Naïve Bayes: is a classification algorithm and it is Naïve because it assumes that
every feature of the given variable is independent of each other [4]. It is a probabilistic
classifier. This model is used in the prediction of classification of a new instance by
calculating the probability of each instance in the dataset.
Logistic Regression: This classification algorithm usually gives a properly cal-
ibrated probability and can be used in categorizing whether a URL is malicious or
not, since it only supports binary classification problems. Its goal is to get the best
model which explains the relationship between the outcome and a set of independent
variables.
Machine Learning for Malicious URL Detection 469

3.4 Training and Testing

In this step, the dataset is divided into a training and testing set. The training set
is usually 70% of the dataset and the rest is used for testing. The classification
model learns from the training set. The testing set should not include any part of the
training set, and it is used for performance evaluation of the model. When the model
is properly trained, it will be able to perform a task on new data.

3.5 Evaluation

The next step is to evaluate the performance of our classifiers. To be able to correctly
evaluate our models, we make use of confusion matrix as shows in [9] and use F1
score, accuracy, precision and recall as the evaluation metrics
F1 Score: It is a function of precision and recall, calculated using the average of
precision and recall.

Precision × Recall
F1 Score = 2 × (1)
Precision + Recall

Accuracy: This is defined as the overall success rate of the URL prediction tech-
nique.
TP + TN
Accuracy = (2)
TP + TN + FP + FN

Precision: This is the ratio of the positive predictions of URLs that are correctly
classified.
TP
Precision = (3)
TP + FP

Recall: Can be seen as out of all the positive classes (URLs), how much was
actually correctly predicted
TP
Recall = (4)
TP + FN

TP: number of true positives, actual malicious URLs classified correctly

TN: number of true negatives, actual benign URLs classified correctly
FP: number of false positives (error 1), benign URLs classified as malicious
FN: number of false negatives (error 2) malicious URLs classified as benign.
470 G. Wejinya and S. Bhatia

4 Results and Discussion

The proposed MuD model is tested on three machine learning classifiers, and the
accuracy, recall and precision of each classifier are recorded. Analysis of the per-
formance of each machine learning classifier used is done based on the precision,
recall, F1-score and accuracy when applied to the UCI machine learning repository
dataset [7]. Classifiers are tasked with the ability to analyze different patterns and be
able to identify the differences between new and the existing patterns.
Confusion matrix (Table 1) is used to evaluate the performance of the three classi-
fiers (Naive Bayes, support vector machine and logistic regression) in detecting and
categorizing URLs as malicious and benign. The highest accuracy score of 100% was
observed for the Naive Bayes, SVM had an accuracy of 98% and logistic regression
gave an accuracy of 96% as shown in Table 2.
Low recall rate as compared to the precision and accuracy as in the case of logis-
tic regression usually indicates that the model will most likely classify bad URLs as
good, and this results more malicious URLs being undetected. However, when preci-
sion values are low, it indicates that the classifier will classify good URLs as bad. The
graphical representation in Fig. 3 shows the confusion matrix which uses accuracy,
F1 score, precision and recall for each of the classifiers to evaluate their performance.
With the features selected from the dataset [7], we observed that the classifiers used
achieved high classification precision and accuracy of above 90%. In classifying
URLs, SVM ran slower as compared to Naive Bayes and logistic regression. Using
lexical-, host- and content-based features of URLs played an important role in the
performance our classifiers. Testing the prediction model with logistic regression
having an accuracy of 96%, it was tested with several malicious and benign URLs
from Phishtank to know how well it classifies URLs.

Table 1 Confusion Matrix for Performance Evaluation of MuD Model

Predicted outcome
Positive Negative
Actual value Positive True positive False negative
Negative False positive True negative

Table 2 Evaluation of classifiers in MuD Model

Classifier Accuracy (%) F1 Score (%) Precision (%) Recall (%)
Support vector 98 9 98 95
machine
Naïve Bayes 100 100 100 100
Logistic 96 93 97 90
regression
Machine Learning for Malicious URL Detection 471

Fig. 3 Confusion matrix

5 Conclusion and Future Work

This paper addresses the widespread cybersecurity concern where threat actors
bypass security defenses and use URLs to launch various forms of malicious attacks
on unsuspecting individuals. In order to prevent such attacks, the paper proposes the
use of machine learning algorithms to detect malicious URLs. The proposed MuD
(Malicious URL Detection) model is trained using an existing dataset which contains
11,055 URLs, each with 15 unique features, and is applied to three different machine
learning classifiers—support vector machine, logistic regression and Naïve Bayes.
After training and testing the algorithms, it is observed that Naïve Bayes classifier
recorded the highest accuracy.
As part of the future work in this direction, the authors plan to investigate the
performance of the proposed model by assigning distinct weights to different feature
components of the URL. The proposed model is also planned to be deployed online
by integrating it as a Web browser plug-in capable of warning users of potential
malicious URLs in real time. URLs clicked or typed will be checked based on its
features to determine if it is malicious or not. If it is malicious or suspected to be
malicious, there will be a pop-up informing the user of the potential threat and it
will be temporarily blocked except the user chooses to still navigate to the URL.
The future work also includes evaluating the proposed model against more recent
and diverse datasets along with using additional classifiers such as decision trees and
random forest.
472 G. Wejinya and S. Bhatia

References

1. S. Halder, S. Ozdemir, Hands-On Machine Learning for Cybersecurity (Packt, 2018)

2. M. Ferreira, Malicious URL detection using machine learning algorithms, in Digital Privacy
and Security Conference (2019), p. 114
3. F. Vanhoenshoven, G. Nápoles, R. Falcon, K. Vanhoof, M. Köppen, Detecting malicious URLs
using machine learning techniques, in 2016 IEEE Symposium Series on Computational Intelli-
gence (SSCI) (IEEE, 2016), pp. 1–8
4. A. Singh, N. Goyal, A comparison of machine learning attributes for detecting malicious web-
sites, in 2019 11th International Conference on Communication Systems & Networks (COM-
SNETS) (IEEE, 2019), pp. 352–358
5. A.S. Manjeri, R. Kaushik, M. Ajay, P.C. Nair, A machine learning approach for detecting mali-
cious websites using URL features, in 2019 3rd International conference on Electronics, Com-
munication and Aerospace Technology (ICECA) (IEEE, 2019), pp. 555–561
6. B. Cui, S. He, X. Yao, P. Shi, Malicious URL detection with feature extraction based on machine
learning. Int. J. High Perform. Comput. Netw. 12(2), 166–178 (2018)
7. A. Asuncion, D. Newman, UCI Machine Learning Repository (2007)
8. D. Sahoo, C. Liu, S.C. Hoi, Malicious URL Detection Using Machine Learning: A Survey. arXiv
preprint arXiv:1701.07179 (2017)
9. V.M. Patro, M.R. Patra, Augmenting weighted average with confusion matrix to enhance clas-
sification accuracy. Trans. Mach. Learn. Artif. Intell. 2(4), 77–91 (2014)