0% found this document useful (0 votes)
4 views

malicious_url_detect _1BY21IS087,88

The document discusses a method for detecting malicious URLs using machine learning and deep learning techniques, focusing on behavioral and character traits. It highlights the limitations of traditional methods like blacklisting and proposes a model that improves detection accuracy through feature selection and classification algorithms. The study emphasizes the importance of real-time detection and ethical considerations in using machine learning for this purpose.

Uploaded by

e24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

malicious_url_detect _1BY21IS087,88

The document discusses a method for detecting malicious URLs using machine learning and deep learning techniques, focusing on behavioral and character traits. It highlights the limitations of traditional methods like blacklisting and proposes a model that improves detection accuracy through feature selection and classification algorithms. The study emphasizes the importance of real-time detection and ethical considerations in using machine learning for this purpose.

Uploaded by

e24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MALICIOUS URL DETECTION

Mohd Jeeshan Dr. Mohan B.A Ms. Lalita Kumari


Information Science and Engineering Information Science and Engineering Information Science and Engineering
BMS Institute of Technology BMS Institute of Technology BMS Institute of Technology
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected]
[email protected]

Abstract-Know how to solve problems - Online insecurity or web pages. A Uniform Resource Locator (URL) is a unique
has become increasingly common and dangerous, and address on the internet that directs visitors to a website and
hackers are now exploiting human weaknesses to attack helps them identify and understand its content.
technology through social engineering, phishing, and scripted
name spoofing. One of the key steps in such attacks is to use II. PROPOSED METHOD
a malicious Uniform Resource Locator (URL) to trick the To improve the process of providing information, a model
user. This has led to increased interest in using machine was developed that provides an in-depth description of the
learning and deep learning to detect malicious URLs. This information needed for training. The basis of the model is
paper presents a method for detecting malicious URLs based the dataset itself, because it needs sufficient and accurate
on behavioural and character traits using machine learning data about bad and malicious URLs. The data contains a list
algorithms and big data. The proposed methods incorporate of URLs that are classified as malicious or malicious. Each
new features and behaviours of URLs, machine learning URL has a set of parameters, including the number of
algorithms, and big data, and aim to improve the ability to elements in the URL, the distance to the URL, and an image-
detect malicious URLs based on unusual behaviours. based identifier such as "google.com". The model is learned
Experimental results show that new URL behaviour and using a technique called binary classification (also called
behaviour can improve the ability to identify malicious binary regression). This method has several advantages, such
URLs. Therefore, the proposed method can be considered as as achieving the highest learning rate compared to other
an effective and efficient method for detecting malicious. machine learning algorithms and requiring less time to learn
phishing URLs.

III. ABOUT THE URL


I. INTRODUCTION A Uniform Resource Locator, or URL, is the address of a
As society’s reliance on online services increases, the web page that points to a resource on the Internet. It can be
number of online scams and malicious websites also used to search websites and identify content. A URL has two
increases. Many users are unaware of these threats and may parts: a method identifier and the name of the resource. The
assume that a website is legitimate, compromising defenses protocol specification defines the protocol used to provide
against attacks. URLs are particularly vulnerable to the resource, such as HTTP, FTP, or NEWS. The name of
malicious attacks because they are the first and most the resource is the address of the resource found on the
common way to access web data. Therefore, it is important process. For example, in HTTP, resource names can lead to
to determine whether a URL is malicious or not. Techniques restrictions. The host parameter specifies the IP address or
such as blacklists and heuristic strategies have not evolved domain name of the resource, and the Data parameter
with changing threats. Malicious URL detection applications provides the location of the resource on the machine or host
combine static data (such as URL string attributes) with host [4]. The port parameter is the number assigned to the source
information and HTML or JavaScript content to detect port, if any. Finally, query parameters provide values ​and are
malicious URLs [1]. The main goal of Malicious URL not relevant to the query..
Detection is to detect and block URLs that contain malware,
phishing attempts, or other dangerous content. Such URLs
pose a threat to user security and privacy and can reach Malicious URL
users through various methods, including email and social In addition to blacklisting, another method used to identify
media. malicious URLs is whitelisting. Whitelisting involves
creating a database of known benign URLs and allowing
access to only those URLs. This technique can be effective
in protecting against known threats, but it may restrict

19
This heuristic-based approach is a variation of the blacklist
access to legitimate websites that have not been added to approach that aims to create a data or “feature blacklist”
the whitelist. [14]. This blacklist modification detects, removes, and stores
malicious URLs instead of all bad URLs, thus detecting
Behavioral analysis is another method used to identify threats in new URLs. However, this approach can be
malicious URLs [6]. The process involves analyzing the misleading as it can lead to a large number of false
behavior of a website to determine whether it is involved in positives. Heuristics often rely on machine learning
malicious activities such as phishing or random downloads techniques to determine the features of a classification. It is
[7]. Behavioral analysis can be effective in detecting new important to design testing procedures to cover a variety of
threats, but it can also produce false positives when scenarios, including well-known URLs, known threats, and
legitimate websites exhibit similar behavior. Machine URLs with unusual or unexpected patterns [15]. Using
learning algorithms can be trained to recognize patterns in automation testing tools such as Pitesti and Selenium can
URLs that indicate malicious activity. This approach is simplify the testing process. Once testing is complete and all
useful in detecting new threats and mitigating vulnerabilities, issues are identified and resolved, the code can be pushed to
but it requires a lot of information and can be distracting production to ensure it works as expected [16]. Any issues
when trying to evade [10]. As a result, a combination of or errors that arise during testing must be resolved.
these techniques can be used effectively to identify and
block malicious URLs [11]. Users should be careful and
cautious when clicking on links, especially from unknown
sources, and use antivirus software and other security
measures to protect themselves. Block malware and other VI. IMPLEMENTATION
threats [12]. These classification models were trained using data from
around 400,000 URLs obtained from various sources,
including Open Phish and Alexa whitelists. To ensure that
the data reflected both bad and good URLs, an 80-20 split
was created between the two. Only necessary features
IV. APPROACH should be considered when training a machine learning
The blacklisting method involves creating a library of known model, as too many features will cause the model to learn
bad URLs that are used to filter incoming URLs. If it from noise and inconsistent patterns [17]. The process of
matches the list, it is dangerous and sends a warning to the selecting important features from the data is called feature
user. Otherwise, if there is no match, the URL is considered selection, and this can be done by including important
safe [12]. features or excluding irrelevant features [18]. This is the
first step towards extracting essential features that accurately
This technique is ineffective at detecting new attacks, as new describe the URL. Since lexical features refer to URL tags,
malicious URLs emerge all the time. Blacklisting: This lexical features are selected by the host and ranked due to its
method is fast, effective, and low-cost, but it also has a very speed, less information required, and ease of extraction. Get
unpleasant side - namely, that when targeting URLs creates ready to work. One-shot coding aims to convert categorical
new problems, they often go unnoticed [14]. data into numerical data and provide more accurate
predictions than single labels [19].
An example of a URL blacklisting site is Google's Safe
Browsing tool. The list detects, deletes, and stores bad names Bow describes a method to convert written words into
instead of all bad URLs, allowing threats to be identified in meaningful numbers by counting the number of occurrences
new URLs. of words, ignoring grammatical elements and word order.
The Vectorizer tool is used to collect text and create
However, this approach introduces many flaws and can dictionaries of known words, which are also used in machine
lead to errors. Heuristic methods often rely on machine learning models. TF and TF-IDF were also used in the paper
learning techniques to determine the features of the to provide insight into less common words. As the body data
classification, so it is useful to use heuristics. length changes, normalization is used as the denominator
term of the formula [20]. Best practices are used to
distinguish malicious URLs from benign URLs based on
various vulnerabilities, including IP address swapping and
even the use of invalid URLs.
V. TRAINING AND TESTING
To implement and validate the malicious URL detection code,

20
Preprocessing is a very important step in building a good
machine learning model.

To determine the importance of a word, a dictionary of different


words should be created and their word frequency (TF) should
be calculated. Words with more TF and fewer words with less
TF. However, the importance of a word is not considered in TF
because sometimes words like 'of' and 'and' appear many times
but their importance is low. The importance of each word in the
dictionary is measured by its occurrence in the system.
"Entertainment." Finally, TF-IDF is considered too expensive to
calculate when the definition is broad enough.

Given an instance "x" that needs to be classified, place the text


in the special value "c". The algorithm predicts the label of the Figure 1(a): Sigmoid function Fig 1(b): Decision Tree
text by selecting the category with the best performance. This
method is widely used in the field of Natural Language Random forest is a classification technique that uses
Processing (NLP). ensemble learning to improve models and solve complex
problems by combining multiple classifications. It consists
It consists of multiple decision trees; internal nodes represent of multiple decision trees that work on different subsets of
the features of the dataset, branches represent the decisions, and the dataset, and averaging these trees increases the accuracy
leaf nodes represent the results. Each tree has two nodes, of the prediction. Instead of relying on a single decision tree,
decision nodes and leaf nodes. Decision nodes help in making random forests combine the predictions of each tree and
decisions and have many branches. Its leaves are used for urine decide on the final result based on a majority vote.
and it has no branches.
We then build a decision tree based on the data points.
It is used to determine the properties of data or to conduct Finally, we will see the predictions of each decision tree and
experiments. CART is used to create random forests. It is an assign them to the class that gets the most votes. We will
algorithm used for classification and regression trees. The repeat steps 1 and 2.
algorithm asks a question and then divides the tree into multiple
subtrees based on the answers.

Fig 2: Random Forest

We use score() and confusion matrix methods to evaluate the


performance of the model. Scoring is a scoring system for
models and estimators learned by Scikit. It requires the
input values ​of the sample test X-Test and the expected
output value of the Y-Test as inputs to calculate the correct
score.

21
The four-part confusion matrix is ​a technique that describes
the performance of classification models on test data. These
include true positives, negatives, negatives, and negatives, all
of which fit the square matrix.

The truth indicates the weight and support of the fit between
the true and the inverse. The F1 Score calculates the
performance of the model: a score of 1.0 indicates good
performance. Accuracy is the ratio of the number of correct
predictions for positive and false classes, and support is the
number of classes present in the data. Describe the entire
performance evaluation process, but it is not possible to
separate different models from each other.

The training phase collects and correctly classifies malicious


URLs, while the search phase extracts features from each
incoming URL to classify clean and malicious URLs. The
data is divided into two parts: training data for training
machine learning algorithms and test data for evaluating the
performance of the models. This article provides a Figure 3 (b): Proposed Method
comprehensive explanation.

Figure 4: Train and Test Data split

VII. CONCLUSION AND FUTURE


ENHANCEMENT
The main goal of this project is to identify malicious URLs
based on the information provided by the URL strings
without downloading the page content.

For this purpose, a customized “Tokenizer” function and


three learning machines (multinomial, logistic regression,
Figure 3(a): Proposed Method and random forest) are used to provide a summary of the
content. The results of this model are compared with Count-
Take a look at these features. Creating features from existing Vectorized and TF-IDF vectorized data, where the best
features is an important step in preparing the data for results are those provided by Random Forest. In the third
statistical analysis. Since the algorithms used are based on stage, fuzzy string matching is used to detect attempts to
machine learning, the inputs to these algorithms must be trick users by sending links to unknown sources.
digital. Therefore, the URL string must be encoded into a
vector number. By analyzing a large number of URLs, lexical
features that can distinguish good URLs from bad URLs are Future work could be to create more accurate models by
derived. The following steps should be followed in its exploring different machine learning algorithms and special
implementation, for which tokenization is used as an example: selection methods. Real-time search is another important
The ML model is then created and used to create the outside aspect: bad URLs are constantly changing, so researchers
of the training process. Therefore, this process includes can look for patterns to identify URLs with content such as
training the ML algorithm, predicting the text with the help of website content and network links in real time.
the extracted features, adding quality models according to the
needs of the business and finally evaluating the archived data.
22
Ethical considerations for using machine learning for 6 D. Sahoo, C. Liu, S.C.H. CORR, abs/1701.07179, 2017.
malicious URL detection include bias or privacy concerns.
Another concern is attack resistance, and methods to
7 A. Jones, "Phishing detection: A research paper", IEEE
improve the performance of ML models for attack resistance
Communications Research and Tutorials, vol. 15, no. 4, pp.
need to be explored. 2091-2121, 2013. Cove, C. Kriegel and G. Vigna, “Searching
for and analyzing drive-by download attacks and malicious
REFERENCES JavaScript, ” in Proceedings of the 19th International World
1 Justin Ma, Saul L. K., Savage S., and Voelker G. M. (2011). Wide Web Conference. ACM, 2010, pp. 281–290. Loukas,
Learn to identify malicious URLs. ACM Transactions on “Study of distributed attack and defense mechanisms against
Intelligent Systems and Technology, 3(2). semantic social engineering attacks,” ACM Computing
Surveys (CSUR), vol. 48, No. 3, 37 March 2015.
2 Das A. (2017). Content in URLs: Rapid extraction and https://www.symantec.com/content/dam/symantec/docs/repor
detection of malicious URLs. In,3rd International Conference ts/istr-24-2019-en.pdf [last accessed October 2019].
on Security and Self-Assessment,, pages 55–63.

3 Patil J. B. (2016). Detect malicious pages using URL-like 8 Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong and
authentication. International Journal of Information Security C. Zhang, "Scientific Review of Phishing Blacklists," in
and Cyber ​Crime, 5(2), 57-70. (2015). Proceedings of the 6th Conference on Email and Anti-Spam
(CEAS), 2009.
4 Robust subset for phishing page prediction is selected using the
most accurate and least duplicate data. Journal of Theoretical
and Applied Information Technologies, 81(2), 188-205. (2017).

5 Hsu C. W. & Lin C. J. (2002). A comparison of methods for


multiclass support vector machines. IEEE Transactions on
Neural Networks, 13(2).

23

You might also like