0% found this document useful (0 votes)
3 views

Fr -Detecting Malicious Urls Using Data Analytics

This document discusses a methodology for detecting malicious URLs, particularly phishing attempts, using the One vs All Classification technique in machine learning. It highlights the challenges of traditional methods in real-time detection and proposes a resource-efficient approach that focuses on key URL elements like domain and path. The research aims to enhance accuracy and adaptability in identifying hazardous URLs, contributing valuable insights to cybersecurity practices.

Uploaded by

udhayaop
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Fr -Detecting Malicious Urls Using Data Analytics

This document discusses a methodology for detecting malicious URLs, particularly phishing attempts, using the One vs All Classification technique in machine learning. It highlights the challenges of traditional methods in real-time detection and proposes a resource-efficient approach that focuses on key URL elements like domain and path. The research aims to enhance accuracy and adaptability in identifying hazardous URLs, contributing valuable insights to cybersecurity practices.

Uploaded by

udhayaop
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

DETECTING MALICIOUS URLS USING DATA ANALYTICS

ABSTRACT:
Phishing, a prevalent form of social engineering, involves attackers deceiving victims
into divulging their login credentials through deceptive forms that transmit the
information to malicious servers. The dynamic domain of data science is increasingly
reliant on machine learning, with One vs All Classification emerging as a potent
technique for identifying hazardous URLs. This approach entails classifying URLs
into distinct categories like malware, phishing, or spam. The method necessitates
training multiple binary classifiers, each specialized in distinguishing one category
from the rest. Our proposed methodology leverages One Vs All Classification to
achieve robust accuracy in detecting malicious URLs, enabling training with compact
datasets focused on crucial URL elements like domain and path. The One vs All
Classification technique proves to be a promising strategy for robust malicious URL
detection. Its efficacy extends to scenarios with limited datasets and computational
resources, presenting a notable advantage over existing methodologies. This research
emphasizes the significance of URL aspects, such as domain and path, in achieving
high accuracy. The adaptability of this approach positions it as a valuable asset in the
ongoing battle against cybersecurity threats.

Keywords: URL, Dataset, Classification, Accuracy, Machine Learning

PROBLEM STATEMENT:

The evolving landscape of cyber threats poses a substantial challenge in effectively


identifying and mitigating phishing attacks, a prevalent form of social engineering.
Attackers continuously refine their tactics, employing deceptive techniques to trick
individuals into divulging sensitive information through malicious URLs. Traditional
approaches to URL classification often struggle to keep pace with these dynamic
threats, requiring extensive datasets and significant computational resources for
accurate detection. existing methods may not be well-suited for real-time threat
detection, particularly in resource-constrained environments where large datasets and
extensive computational capabilities are not readily available. There is a pressing need
for an advanced and efficient approach that can enhance the accuracy of malicious
URL detection while overcoming the limitations associated with data requirements
and computational intensity. In response to these challenges, this research aims to
explore the potential of the One vs All Classification technique within the framework
of machine learning. The objective is to develop a more adaptive and resource-
efficient system for identifying dangerous URLs, specifically those associated with
phishing attempts. By training multiple binary classifiers to differentiate between
various URL categories, including malware, phishing, and spam, the goal is to create a
solution that is not only effective but also practical for organizations with limited
resources.

OBJECTIVE & SCOPE OF THE PROJECT:

The primary objective of this project is to develop an advanced system for detecting
malicious URLs, with a specific focus on identifying phishing attempts through the
application of the One vs All Classification technique within the domain of machine
learning. The project aims to address the escalating threat landscape by creating a
robust and efficient solution capable of accurately differentiating between safe and
hazardous URLs in real-time. The scope of the project encompasses the exploration
and implementation of One vs All Classification as a central component of the
machine learning approach. This technique involves training multiple binary
classifiers to classify URLs into distinct categories, such as malware, phishing, and
spam. The project's focus on phishing detection is motivated by the urgent need to
combat socially engineered attacks that exploit human vulnerabilities.

To achieve these objectives, the project will investigate the feasibility of training the
system using a minimal dataset that emphasizes crucial URL elements like domain
and path. This approach not only streamlines the training process but also addresses
the challenges associated with resource constraints, making the solution applicable in
diverse operational environments. The project aims to contribute valuable insights to
the field of cybersecurity by assessing the effectiveness and adaptability of one vs All
Classification in detecting phishing attempts. The research outcomes have the
potential to inform and enhance current practices in URL classification, offering a
practical and resource-efficient alternative for organizations seeking robust
cybersecurity measures.

EXISTING SYSTEM:

N. J. Alghamdi and Y.Almardeny: Using feature selection and optimization strategies,


this article presents a machine learning strategy for identifying fraudulent URLs. The
research makes use of a number of different machine learning methods, such as
Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest, in
order to categorize URLs as either harmful or benign. The Chi-Square statistical test is
utilized during the process of selecting features, and the Particle Swarm Optimization
(PSO) algorithm is utilized throughout the optimization procedure. Using the Random
Forest algorithm, the suggested methodology achieved a high accuracy. We propose
using a one vs all classification approach to identify malicious URLs. The approach
involves training multiple classifiers, one for each category of URLs (e.g., phishing,
malware, spam), using a dataset of labeled URLs. We will use feature engineering
techniques to extract meaningful features from the URLs, such as the presence of
certain keywords or characters, and input them into a machine learning model such as
Logistic Regression or SVM. The trained classifiers will then be used to predict the
category of a new, unseen URL. We will evaluate the performance of our approach
using metrics such as accuracy, precision, recall, and F1-score. By leveraging one vs
all classification and feature engineering techniques, we aim to develop an effective
and scalable method for detecting malicious URLs.

Drawbacks:

⮚ Limited Precision: Conventional methods often lack the precision required to


accurately predict malicious.
⮚ Absence of Advanced Technologies: The reliance on traditional methods
means that the existing system may not incorporate advanced technologies,
such as ML.
⮚ Inability to Address Zero-Day Attacks.
⮚ Time-consuming process.
⮚ Inefficient Data Handling.

LITERATURE SURVEY:

1. TITLE: DETECTING PHISHING WEBSITES USING AUTOMATION OF


HUMAN BEHAVIOR

AUTHORS: Srinivasa Rao R, Pais AR

DESCRIPTION:

In this paper, we propose a technique to detect phishing attacks based on behavior of


human when exposed to fake website. Some online users submit fake credentials to
the login page before submitting their actual credentials. He/She observes the login
status of the resulting page to check whether the website is fake or legitimate. We
automate the same behavior with our application (FeedPhish) which feeds fake values
into login page. If the web page logs in successfully, it is classified as phishing
otherwise it undergoes further heuristic filtering. If the suspicious site passes through
all heuristic filters then the website is classified as a legitimate site. As per the
experimentation results, our application has achieved a true positive rate of 97.61%,
true negative rate of 94.37% and overall accuracy of 96.38%. Our application neither
demands third party services nor prior knowledge like web history, whitelist or
blacklist of URLS. It is able to detect not only zero-day phishing attacks but also
detects phishing sites which are hosted on compromised domains.

2. TITLE: GLOBAL PHISHING REPORTS

AUTHOR: Yu Zhou, Jun Xiao

DESCRIPTION:
Phishing uses a fake Web page to steal personal sensitive information such as credit
card numbers and passwords. Generally, the fake Web page is visually similar to the
legitimate target Web page. The phishers can obtain financial benefits through these
information. Anti-phishing is very important for a variety of applications such as
phishing attacks, online transaction security, and user privacy protection. In this paper,
we propose a novel and effective visual similarity based phishing detection approach
that compares the snapshot image pair of the suspected Web page and the protected
Web page. The proposed approach is based on the key insight that both the local and
the global features of the Web page image can be used to represent the visual
characteristics of the Web page together. This approach is purely on the image level,
and thus can effectively deal with the non-text phishing tricks including images or
Flashes objects in the HTML contents. For the local feature, the existence of the target
logo is detected. For the global feature, the similarity of the visible part of the Web
page is considered. We implemented and evaluated the proposed approach on a large
scale dataset consisting of 2,129 real world phishing Web pages and 1,367 irrelevant
legitimate Web pages. The experimental results show that the proposed approach can
achieve over 90.00% true positive rate and 97.00% true negative rate. Our approach
has been applied in the anti-phishing project of a major Internet Service Provider and
gives a periodical reports to the potential users.

3. TITLE: ENSEMBLE CLASSIFICATION AND REGRESSION-RECENT


DEVELOPMENTS, APPLICATIONS AND FUTURE DIRECTIONS

AUTHOR: Ren Y, Zhang L, Suganthan PN

DESCRIPTION:

Ensemble methods use multiple models to get better performance. Ensemble methods
have been used in multiple research fields such as computational intelligence,
statistics and machine learning. This paper reviews traditional as well as state-of-the-
art ensemble methods and thus can serve as an extensive summary for practitioners
and beginners. The ensemble methods are categorized into conventional ensemble
methods such as bagging, boosting and random forest, decomposition methods,
negative correlation learning methods, multi-objective optimization based ensemble
methods, fuzzy ensemble methods, multiple kernel learning ensemble methods and
deep learning based ensemble methods. Variations, improvements and typical
applications are discussed. Finally this paper gives some recommendations for future
research directions.

4. TITLE: TOWARDS AUTOMATIC REAL TIME IDENTIFICATION OF


MALICIOUS POSTS ON FACEBOOK

AUTHOR: Dewan P, Kumaraguru P

DESCRIPTION:

Online Social Networks (OSNs) witness a rise in user activity whenever a news-
making event takes place. Cyber criminals exploit this spur in user-engagement levels
to spread malicious content that compromises system reputation, causes financial
losses and degrades user experience. In this paper, we characterized a dataset of 4.4
million public posts generated on Facebook during 17 news-making events (natural
calamities, terror attacks, etc.) and identified 11,217 malicious posts containing URLs.
We found that most of the malicious content which is currently evading Facebook's
detection techniques originated from third party and web applications, while more
than half of all legitimate content originated from mobile applications. We also
observed greater participation of Facebook pages in generating malicious content as
compared to legitimate content. We proposed an extensive feature set based on entity
profile, textual content, metadata, and URL features to automatically identify
malicious content on Facebook in real time. This feature set was used to train multiple
machine learning models and achieved an accuracy of 86.9%. We performed
experiments to show that past techniques for phisher campaign detection identified
less than half the number of malicious posts as compared to our model. This model
was used to create a REST API and a browser plug-in to identify malicious Facebook
posts in real time.

5. TITLE: NEW RULE-BASED PHISHING DETECTION METHOD

AUTHOR: Moghimi M, Varjani AY

DEFSCRIPTION:

Many classifications techniques have been used and devised to combat phishing
threats, but none of them is able to efficiently identify web phishing attacks due to the
continuous change and the short life cycle of phishing websites. In this paper, we
introduce a Case-Based Reasoning (CBR) Phishing Detection System (CBR-PDS). It
mainly depends on CBR methodology as a core part. The proposed system is highly
adaptive and dynamic as it can easily adapt to detect new phishing attacks with a
relatively small data set in contrast to other classifiers that need to be heavily trained
in advance. We test our system using different scenarios on a balanced 572 phishing
and legitimate URLs. Experiments show that the CBR-PDS system accuracy exceeds
95.62%, yet it significantly enhances the classification accuracy with a small set of
features and limited data sets.

6. TITLE: A COMPUTER VISION TECHNIQUE TO DETECT PHISHING


ATTACKS

AUTHOR: Rao RS, Ali ST

DESCRIPTION:

Phishing refers to cybercrime that use social engineering and technical subterfuge
techniques to fool online users into revealing sensitive information such as username,
password, bank account number or social security number. In this paper, we propose a
novel solution to defend zero-day phishing attacks. Our proposed approach is a
combination of white list and visual similarity based techniques. We use computer
vision technique called SURF detector to extract discriminative key point features
from both suspicious and targeted websites. Then they are used for computing
similarity degree between the legitimate and suspicious pages. Our proposed solution
is efficient, covers a wide range of websites phishing attacks and results in less false
positive rate.

7. TITLE: NEXT GENERATION SECURITY SOFTWARE LIMITED

AUTHORS: Ollmann G

DESCRIPTION:

Next generation wireless technology is breaking ground with the ability to pass the
speeds of a gigabit Ethernet connections. With these technological advances it is
important to also take into consideration the security of such technology. In this paper
we focus on IEEE 802.11ac standard. We have recorded and analyzed 802.11ac
wireless traffic with a packet capture software and compared the results with 2.4 GHZ
802.11n and 5.0 GHZ 802.11n traffic in terms of security improvements. The results
of our analysis concluded that 802.11ac does not implement new features in the
802.11 architecture and has the same security weaknesses that exist in 802.11n and
802.11g. This paper also focuses on the performance of 802.11ac as compared to
802.11n 5GHZ and 802.11gn. The results concluded that 802.11ac significantly
outperformed 802.11n 5 GHZ and 802.11gn

PROPOSED SYSTEM:
The proposed approach, we propose a novel methodology for addressing the prevalent
cybersecurity threat of phishing through the utilization of One vs All Classification.
Phishing, a form of social engineering, involves deceiving individuals into revealing
sensitive information, such as login credentials, through deceptive URLs. Our
approach specifically targets the identification of hazardous URLs, categorizing them
into distinct classes such as malware, phishing, or spam. The core of our methodology
lies in the application of One vs All Classification, a powerful technique in machine
learning. This approach involves training multiple binary classifiers, each specialized
in distinguishing one category from the rest. By leveraging this technique, we aim to
achieve robust accuracy in detecting malicious URLs. Notably, our methodology
enables training with compact datasets that focus on crucial URL elements,
particularly the domain and path. One of the key advantages of the One vs All
Classification technique is its efficacy in scenarios with limited datasets and
computational resources. This presents a notable advancement over existing
methodologies, making it a promising strategy for robust malicious URL detection.
Our research underscores the significance of considering URL aspects, such as
domain and path, in achieving high accuracy in the identification of cybersecurity
threats.

ADVANTAGES:

⮚ Increased Prediction Accuracy:


⮚ The proposed system's machine learning approach provides more precise
forecasts of malicious urls and interactions, aiding in better decision-making.
⮚ Early Warning System: The systematic analysis of historical data allows the
algorithm to function as an early warning system
⮚ Utilization of Advanced ML and DL Techniques
⮚ Efficient Data Handling.

ARCHITECTURE FOR PROPOSED SYSTEM:


HARDWARE REQUIREMENTS:

PROCESSOR : Intel I5
RAM : 8 GB
HARD DISK : 50 GB
SOFTWARE REQUIREMENTS:
PYTHON IDE : Anaconda Jupyter Notebook
BACK END : Jupyter Notebook & Django frame
FRONT END : HTML & CSS
PROGRAMMING LANGUAGE : Python

2.2 DETAILED DIAGRAM

2.2.1Front End Module Diagrams:


2.2.2 Back End Module Diagrams:

MODULES:

Data Preprocessing

Missing values were imputed to guarantee that all the algorithms would be able to
handle them.
Nevertheless, some algorithms could deal with missing values automatically without
imputation, such as XGBoost. To restrict the comparison complexity, the missing
values were imputed based on their data type. For numerical data types, the missing
entries are replaced by the median value of the complete entries. For categorical data,
the missing entries were replaced by the mode value of the complete entries.

DATA CLEANING:
In this module the data is cleaned. After cleaning of the data, the data is grouped as
per requirement. This grouping of data is known as data clustering. Then check if
there is any missing value in the data set or not. It there is some missing value then
change it by any default value. After that if any data need to change its format, it is
done. That total process before the prediction is known is data pre-processing. After
that the data is used for the prediction and forecasting step

Data splitting
After cleaning the data, data is normalized in training and testing the model. When
data is spitted then we train algorithm on the training data set and keep test data set
aside. This training process will produce the training model based on logic and
algorithms and values of the feature in training data. Basically aim of feature
extraction is to bring all the values under same scale.

A dataset used for machine learning should be partitioned into three subsets —
training, test, and validation sets.

Training set: -A data scientist uses a training set to train a model and define its
optimal parameters — parameters it has to learn from data.

Test set: - A test set is needed for an evaluation of the trained model and its capability
for generalization. The latter means a model’s ability to identify patterns in new
unseen data after having been trained over a training data. It’s crucial to use different
subsets for training and testing to avoid model over fitting, which is the incapacity for
generalization we mentioned above.

For each experiment, we split the entire dataset into 70% training set and 30% test set.
We
used the training set for resampling, hyper parameter tuning, and training the model
and we used test set to test the performance of the trained model. While splitting the
data, we specified a random seed (any random number), which ensured the same data
split every time the program executed.
A. Collection of URLs

We collected URLs of benign websites from www.alexa.com, www.dmoz.org and


personal web browser history. The phishing URLs were collected from
www.phishtak.com. The data set consists of 17000 phishing URLs and 20000
benign URLs. We obtained PageRank of 240 benign websites and 240 phishing
websites by checking PageRank individually at PR Checker. We collected WHOIS
information of 240 benign websites and 240 phishing websites.

B. Host based analysis

Host-based features explain “where” phishing sites are hosted, “who” they are
managed by, and “how” they are administered. We use these features because
phishing Web sites may be hosted in less reputable hosting centers, on machines
that are not usual Web hosts, or through not so reputable registrars.

CLASSIFICATION:

TRAIN DATASET AND TEST DATASET

The training data is a initial set of data which is used to understand the program. This
is the one in which we have to train the model first because to set the feature and this
data is available on system. This data is used to teach the machine for do different
actions. It is the data in which model can learn with algorithm to teach the model and
doing work automatic.
Testing data is the input given to a software. It shows the data affects when the
execution of the module that specifying and this is basically used for testing.

NAIVE BAYES:
Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem (or
Bayes’s rule) with strong independence (naive) assumptions. Parameter estimation for
Naïve Bayes models uses the maximum likelihood estimation. It takes only one pass
over the training set and is computationally very fast.
∙ Bayes Rule
A conditional probability is the likelihood of some conclusion, C, given some
evidence/observation, D, where a dependence relationship exists between C and D.
This probability is denoted as
(𝐶|𝐷) where, (𝐷/𝐶) = [(𝐷/𝐶)𝑃(𝐶)] /[𝑃(𝐷)]
∙ NB Classifier
Naïve Bayes classifier is one of the high detection approach for learning classification
of text documents. Given a set of classified training samples, an application can learn
from these samples, so as to predict the class of an unmet samples.
The features (𝑛1 , 𝑛2 , 𝑛3 , 𝑛4) which are present in URL are independent from each
other. Every feature (1 ≤ 𝑖 ≤ 4) text binary value showing whether the particular
property comes in URL. The probability is calculated that the given web belongs to a
class 𝑟 (𝑟1 : Non-phishing and 𝑟2 :
Phishing) as follows:
(𝑟1/𝑁) = ( (𝑟1) ∗ 𝑃(𝑁/𝑟𝑖))/𝑃(𝑁 )

RANDOM FOREST CLASSIFIER:

Random forests are recently proposed statistical inference tools, deriving their
predictive accuracy from the nonlinear nature of their component decision tree
members and the power of groups. Random forest committees provide more than just
predictions; model information on data proximities can be exploited to provide
random forest features. Variable importance measures show which variables are
closely associated with a chosen response variable, while partial dependencies
indicate the relation of important variables to said response variable.

PERFORMANCE MATRICES:
Data was divided into two portions, training data and testing data, both these portions
consisting 70% and 30% data respectively. All these six algorithms were applied on
same dataset using Enthought Canaopy and results were obtained.

Predicting accuracy is the main evaluation parameter that we used in this work.
Accuracy can be defied using equation. Accuracy is the overall success rate of the
algorithm.

CONFUSION MATRIX:

It is the most commonly used evaluation metrics in predictive analysis mainly because
it is very easy to understand and it can be used to compute other essential metrics such
as accuracy, recall, precision, etc. It is an NxN matrix that describes the overall
performance of a model when used on some dataset, where N is the number of class
labels in the classification problem.

All predicted true positive and true negative divided by all positive and negative. True
Positive (TP), True Negative (TN), False Negative (FN) and False Positive (FP)
predicted by all algorithms are presented in table.
True positive (TP) indicates that the positive class is predicted as a positive class, and
the number of sample positive classes was actually predicted by the model.
False negative indicates (FN) that the positive class is predicted as a negative class,
and the number of negative classes in the sample was actually predicted by the model.
False positive (FP) indicates that the negative class is predicted as a positive class, and
the number of positive classes of samples was actually predicted by the model.
True negative (TN) indicates that the negative class is predicted as a negative class,
and the number of sample negative classes was actually predicted by the model.

REFERENCES:

[1] H. Gupta, N. Nigam, and A. Khare, "A comparative study of machine learning
algorithms for malicious URL detection," International Journal of Engineering
and Technology, vol. 10, no. 2, pp. 197-204, 2018.
[2] S. Goyal and S. Gupta, "A survey on malicious URL detection using machine
learning techniques," International Journal of Advanced Research in Computer
Science, vol. 8, no. 4, pp. 33-37, 2017.
[3] S. Mirjalili, S. H. Mohseni, and A. Mirjalili, "A hybrid deep learning approach
for malicious URL detection," Journal of Ambient Intelligence and Humanized
Computing, vol. 11, no. 8, pp. 3339-3350, 2020.
[4] T. Zhang, Y. Zhao, and Y. Liu, "A lightweight and effective malware detection
method based on URL features," Journal of Intelligent & Fuzzy Systems, vol.
39, no. 5, pp. 6255- 6263, 2020.
[5] X. Yin, Y. Zhang, and Y. Liu, "Malicious URL detection using machine
learning and graph-based feature engineering," Computers & Security, vol. 98,
pp. 101960, 2020.
[6] Haider, M. J. Iqbal, and S. A. Khan, "Malicious URL detection using machine
learning: A comprehensive review," Journal of Intelligent & Fuzzy Systems,
vol. 39, no. 6, pp. 7503-7521, 2020.
[7] S. B. Anand, M. A. Parvez, and M. A. Rafi, "A comprehensive survey on the
detection of malicious URLs using machine learning algorithms," Journal of
King Saud University - Computer and Information Sciences, 2020.
[8] Y. Li, Y. Liang, and Y. Li, "Malicious URL detection using machine learning
algorithms with feature selection," Journal of Ambient Intelligence and
Humanized Computing, vol. 11, no. 5, pp. 2053-2062, 2020.
[9] X. Wang, X. Peng, and Y. Chen, "Malicious URL detection based on
convolutional neural network and URL structure analysis," Journal of Ambient
Intelligence and Humanized Computing, vol. 11, no. 5, pp. 2005-2015, 2020.
[10] M. A. Abualhaol, A. M. N. Ibrahem, and M. A. Al-Nahhal, "Malicious
URL detection using machine learning techniques," Journal of Ambient
Intelligence and Humanized Computing, vol. 11, no. 8, pp. 3423-3432, 2020.
[11] F. T. Saputra, S. S. Siregar, and B. Hendradjaya, "Malicious URL
detection using convolutional neural network and recurrent neural network
with attention mechanism," Journal of Ambient Intelligence and Humanized
Computing, vol. 11, no. 12, pp. 5229-5240, 2020.

You might also like