Detecting Malicious Urls Using Machine Learning Techniques
Detecting Malicious Urls Using Machine Learning Techniques
Abstract—The World Wide Web supports a wide range of In order to identify these malicious sites, the web security
criminal activities such as spam-advertised e-commerce, financial community has developed blacklisting services. These black-
fraud and malware dissemination. Although the precise motiva- lists are in turn developed by a myriad of techniques including
tions behind these schemes may differ, the common denominator
lies in the fact that unsuspecting users visit their sites. These visits manual reporting, honeypots, and web crawlers combined with
can be driven by email, web search results or links from other site analysis heuristics [2]. While URL blacklisting has been
web pages. In all cases, however, the user is required to take some effective to some extent, it is rather easy for an attacker
action, such as clicking on a desired Uniform Resource Locator to deceive the system by slightly modifying one or more
(URL). In order to identify these malicious sites, the web security components of the URL string. Inevitably, many malicious
community has developed blacklisting services. These blacklists
are in turn constructed by an array of techniques including sites are not blacklisted either because they are too recent or
manual reporting, honeypots, and web crawlers combined with were never or incorrectly evaluated.
site analysis heuristics. Inevitably, many malicious sites are not Several studies in the literature tackle this problem from a
blacklisted either because they are too recent or were never or Machine Learning standpoint. That is, they compile a list of
incorrectly evaluated. URLs that have been classified as either malicious or benign
In this paper, we address the detection of malicious URLs
as a binary classification problem and study the performance and characterize each URL via a set of attributes. Classification
of several well-known classifiers, namely Naı̈ve Bayes, Support algorithms are then expected to learn the boundary between
Vector Machines, Multi-Layer Perceptron, Decision Trees, Ran- the decision classes.
dom Forest and k-Nearest Neighbors. Furthermore, we adopted De las Cuevas et. al. [3] reported classification rates about
a public dataset comprising 2.4 million URLs (instances) and 96% that climbed up to 97% with a rough-set-based feature
3.2 million features. The numerical simulations have shown that
most classification methods achieve acceptable prediction rates selection preprocessing step that reduced the original 12
without requiring either advanced feature selection techniques features to 9. The labeling of each URL was done after a set
or the involvement of a domain expert. In particular, Random of security rules dictated by the Chief Security Officer (CSO)
Forest and Multi-Layer Perceptron attain the highest accuracy. in a company. This resulted in an imbalanced classification
problem that was dealt with via undersampling. In total, 57,000
I. I NTRODUCTION
URL instances were considered after removing duplicates. The
With the undeniable prominence of the World Wide Web as authors noticed an improvement over the results attained in
the paramount platform supporting knowledge dissemination their previous work [4].
and increased economic activity, the security aspect continues Kan and Thi [5] classified web pages not by their content
to be at the forefront of many companies and governments’ but using their URLs, which is much faster as no delays
research efforts. are incurred in fetching the page content or parsing the text.
Symantec’s 2016 Internet Security Report [1] elaborates on The URL was segmented into multiple tokens from which
an ample array of global threats that includes corporate data classification features were extracted. The features modeled
breaches, attacks on browsers and websites, spear phishing sequential dependencies between tokens. The authors pointed
attempts, ransonmware and other types of fraudulent cyber to the fact that the combination of high-quality URL segmen-
activities. The report also unveils several cyber tricks used by tation and feature extraction improved the classification rate
the scammers. One well-known and surprisingly quite effective over several baseline techniques. Baykan et. al. [6] pursue a
strategy is baiting the users to click on a malicious Uniform similar objective: topic classification from URLs. They trained
Resource Locator (URL), which leads to the system being separate binary classifiers for each topic (student, faculty,
somehow compromised. course and project) and were able to improve over the best
number of datasets
URLs in light of the wealth of information they carry about 10
classification rate (95%-99%) and a low false positive rate. 0.1 0.2 0.3
percentage of malicious urls
0.4
RF
MLP
C4.5
method
kNN
SVM
C5.0
NB
0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
value
Fig. 2. Accuracy, precision and recall per classification method and per feature set
feature set B or feature set C as proxies for set A. The benefit [2] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta, “Phishnet:
of using feature set C over the other sets would be that it does predictive blacklisting to detect phishing attacks,” in INFOCOM, 2010
Proceedings IEEE. IEEE, 2010, pp. 1–5.
not require a preliminary exploration analysis to determine the [3] P. de las Cuevas, Z. Chelly, A. Mora, J. Merelo, and A. Esparcia-
correlations, relying solely on the descriptive power of a real- Alcázar, “An improved decision system for URL accesses based on a
value feature over a binary feature. rough feature selection technique,” in Recent Advances in Computational
Intelligence in Defense and Security. Springer, 2016, pp. 139–167.
It is worth highlighting that these accuracy rates have been [4] A. Mora, P. De las Cuevas, and J. Merelo, “Going a step beyond the
achieved without requiring either advanced feature selection black and white lists for URL accesses in the enterprise by means of
techniques or the involvement of a domain expert. Never- categorical classifiers,” ECTA, pp. 125–134, 2014.
[5] M.-Y. Kan and H. O. N. Thi, “Fast webpage classification using url
theless, the methods seem to achieve competitive results. We features,” in Proceedings of the 14th ACM international conference on
believe that classification problems for over 2 million entries Information and knowledge management. ACM, 2005, pp. 325–326.
with over 3 million features is as much about feature selection [6] E. Baykan, M. Henzinger, L. Marian, and I. Weber, “Purely URL-based
as it is about the identification of an adequate classifier. The topic classification,” in Proceedings of the 18th international conference
on World wide web. ACM, 2009, pp. 1109–1110.
fact that the feature set is sparsely populated and contains [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists:
mainly binary attributes makes feature selection a more chal- learning to detect malicious web sites from suspicious urls,” in Proceed-
lenging task. Categorical features are spread out over multiple ings of the 15th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2009, pp. 1245–1254.
binary attributes, causing none of the attributes to contain full [8] ——, “Learning to detect malicious URLs,” ACM Transactions on
knowledge about the feature. Given that numerical features are Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 30, 2011.
not coded this way and, as a consequence, do not suffer from [9] P. Zhao and S. C. Hoi, “Cost-sensitive online active learning with
the aforementioned drawbacks, they are interesting candidates application to malicious URL detection,” in Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and
to use during training. The results of this paper suggest that the data mining. ACM, 2013, pp. 919–927.
classification methods achieve competitive prediction accuracy [10] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying Suspicious
rates for URL classification when only the numerical features URLs: An Application of Large-scale Online Learning,” in Proceedings
of the 26th Annual International Conference on Machine Learning.
are used for training. ACM, 2009, pp. 681–688.
[11] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd ed.
John Wiley & Sons, 2012.
R EFERENCES [12] Y. Sun, A. K. Wong, and M. S. Kamel, “Classification of imbalanced
data: a review,” International Journal of Pattern Recognition and Arti-
[1] Symantec, “2016 Internet security threat report,” https://www.symantec. ficial Intelligence, vol. 23, no. 04, pp. 687–719, 2009.
com/security-center/threat-report, 2016, [Online; accessed 11-Aug- [13] V. López, A. Fernández, S. Garcı́a, V. Palade, and F. Herrera, “An insight
2016]. into classification with imbalanced data: Empirical results and current
trends on using data intrinsic characteristics,” Information Sciences, vol.
250, pp. 113–141, 2013.
[14] B. Frénay and M. Verleysen, “Classification in the presence of label
noise: a survey,” Neural Networks and Learning Systems, IEEE Trans-
actions on, vol. 25, no. 5, pp. 845–869, 2014.
[15] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang,
“Semisupervised learning of classifiers: Theory, algorithms, and their
application to human-computer interaction,” Pattern Analysis and Ma-
chine Intelligence, IEEE Transactions on, vol. 26, no. 12, pp. 1553–
1566, 2004.
[16] I. Triguero, S. Garcı́a, and F. Herrera, “Self-labeled techniques for
semi-supervised learning: taxonomy, software and empirical study,”
Knowledge and Information Systems, vol. 42, no. 2, pp. 245–284, 2015.
[17] G. Tsoumakas and I. Katakis, “Multi-label classification: an overview,”
International Journal of Data Warehousing and Mining, vol. 3, no. 3,
pp. 1–13, 2007.
[18] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do
we Need Hundreds of Classifiers to Solve Real World Classification
Problems?” Journal of Machine Learning Research, vol. 15, pp. 3133–
3181, 2014.
[19] M. Wainberg, B. Alipanahi, and B. J. Frey, “Are random forests truly
the best classifiers?” Journal of Machine Learning Research, vol. 17,
no. 110, pp. 1–5, 2016.
[20] J. R. Quinlan, C4.5: programs for machine learning. Morgan Kauffman
Publishers, 1993.
[21] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
[22] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network clas-
sifiers,” Machine learning, vol. 29, no. 2-3, pp. 131–163, 1997.
[23] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
5–32, 2001.
[24] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf,
“Support vector machines,” Intelligent Systems and their Applications,
IEEE, vol. 13, no. 4, pp. 18–28, 1998.
[25] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vector
clustering,” Journal of machine learning research, vol. 2, no. Dec, pp.
125–137, 2001.
[26] T. Joachims, “Transductive inference for text classification using support
vector machines,” in ICML, vol. 99, 1999, pp. 200–209.
[27] F. Rosenblatt, “Principles of neurodynamics. perceptrons and the theory
of brain mechanisms,” DTIC Document, Tech. Rep., 1961.