malicious_url_detect _1BY21IS087,88
malicious_url_detect _1BY21IS087,88
Abstract-Know how to solve problems - Online insecurity or web pages. A Uniform Resource Locator (URL) is a unique
has become increasingly common and dangerous, and address on the internet that directs visitors to a website and
hackers are now exploiting human weaknesses to attack helps them identify and understand its content.
technology through social engineering, phishing, and scripted
name spoofing. One of the key steps in such attacks is to use II. PROPOSED METHOD
a malicious Uniform Resource Locator (URL) to trick the To improve the process of providing information, a model
user. This has led to increased interest in using machine was developed that provides an in-depth description of the
learning and deep learning to detect malicious URLs. This information needed for training. The basis of the model is
paper presents a method for detecting malicious URLs based the dataset itself, because it needs sufficient and accurate
on behavioural and character traits using machine learning data about bad and malicious URLs. The data contains a list
algorithms and big data. The proposed methods incorporate of URLs that are classified as malicious or malicious. Each
new features and behaviours of URLs, machine learning URL has a set of parameters, including the number of
algorithms, and big data, and aim to improve the ability to elements in the URL, the distance to the URL, and an image-
detect malicious URLs based on unusual behaviours. based identifier such as "google.com". The model is learned
Experimental results show that new URL behaviour and using a technique called binary classification (also called
behaviour can improve the ability to identify malicious binary regression). This method has several advantages, such
URLs. Therefore, the proposed method can be considered as as achieving the highest learning rate compared to other
an effective and efficient method for detecting malicious. machine learning algorithms and requiring less time to learn
phishing URLs.
19
This heuristic-based approach is a variation of the blacklist
access to legitimate websites that have not been added to approach that aims to create a data or “feature blacklist”
the whitelist. [14]. This blacklist modification detects, removes, and stores
malicious URLs instead of all bad URLs, thus detecting
Behavioral analysis is another method used to identify threats in new URLs. However, this approach can be
malicious URLs [6]. The process involves analyzing the misleading as it can lead to a large number of false
behavior of a website to determine whether it is involved in positives. Heuristics often rely on machine learning
malicious activities such as phishing or random downloads techniques to determine the features of a classification. It is
[7]. Behavioral analysis can be effective in detecting new important to design testing procedures to cover a variety of
threats, but it can also produce false positives when scenarios, including well-known URLs, known threats, and
legitimate websites exhibit similar behavior. Machine URLs with unusual or unexpected patterns [15]. Using
learning algorithms can be trained to recognize patterns in automation testing tools such as Pitesti and Selenium can
URLs that indicate malicious activity. This approach is simplify the testing process. Once testing is complete and all
useful in detecting new threats and mitigating vulnerabilities, issues are identified and resolved, the code can be pushed to
but it requires a lot of information and can be distracting production to ensure it works as expected [16]. Any issues
when trying to evade [10]. As a result, a combination of or errors that arise during testing must be resolved.
these techniques can be used effectively to identify and
block malicious URLs [11]. Users should be careful and
cautious when clicking on links, especially from unknown
sources, and use antivirus software and other security
measures to protect themselves. Block malware and other VI. IMPLEMENTATION
threats [12]. These classification models were trained using data from
around 400,000 URLs obtained from various sources,
including Open Phish and Alexa whitelists. To ensure that
the data reflected both bad and good URLs, an 80-20 split
was created between the two. Only necessary features
IV. APPROACH should be considered when training a machine learning
The blacklisting method involves creating a library of known model, as too many features will cause the model to learn
bad URLs that are used to filter incoming URLs. If it from noise and inconsistent patterns [17]. The process of
matches the list, it is dangerous and sends a warning to the selecting important features from the data is called feature
user. Otherwise, if there is no match, the URL is considered selection, and this can be done by including important
safe [12]. features or excluding irrelevant features [18]. This is the
first step towards extracting essential features that accurately
This technique is ineffective at detecting new attacks, as new describe the URL. Since lexical features refer to URL tags,
malicious URLs emerge all the time. Blacklisting: This lexical features are selected by the host and ranked due to its
method is fast, effective, and low-cost, but it also has a very speed, less information required, and ease of extraction. Get
unpleasant side - namely, that when targeting URLs creates ready to work. One-shot coding aims to convert categorical
new problems, they often go unnoticed [14]. data into numerical data and provide more accurate
predictions than single labels [19].
An example of a URL blacklisting site is Google's Safe
Browsing tool. The list detects, deletes, and stores bad names Bow describes a method to convert written words into
instead of all bad URLs, allowing threats to be identified in meaningful numbers by counting the number of occurrences
new URLs. of words, ignoring grammatical elements and word order.
The Vectorizer tool is used to collect text and create
However, this approach introduces many flaws and can dictionaries of known words, which are also used in machine
lead to errors. Heuristic methods often rely on machine learning models. TF and TF-IDF were also used in the paper
learning techniques to determine the features of the to provide insight into less common words. As the body data
classification, so it is useful to use heuristics. length changes, normalization is used as the denominator
term of the formula [20]. Best practices are used to
distinguish malicious URLs from benign URLs based on
various vulnerabilities, including IP address swapping and
even the use of invalid URLs.
V. TRAINING AND TESTING
To implement and validate the malicious URL detection code,
20
Preprocessing is a very important step in building a good
machine learning model.
21
The four-part confusion matrix is a technique that describes
the performance of classification models on test data. These
include true positives, negatives, negatives, and negatives, all
of which fit the square matrix.
The truth indicates the weight and support of the fit between
the true and the inverse. The F1 Score calculates the
performance of the model: a score of 1.0 indicates good
performance. Accuracy is the ratio of the number of correct
predictions for positive and false classes, and support is the
number of classes present in the data. Describe the entire
performance evaluation process, but it is not possible to
separate different models from each other.
3 Patil J. B. (2016). Detect malicious pages using URL-like 8 Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong and
authentication. International Journal of Information Security C. Zhang, "Scientific Review of Phishing Blacklists," in
and Cyber Crime, 5(2), 57-70. (2015). Proceedings of the 6th Conference on Email and Anti-Spam
(CEAS), 2009.
4 Robust subset for phishing page prediction is selected using the
most accurate and least duplicate data. Journal of Theoretical
and Applied Information Technologies, 81(2), 188-205. (2017).
23