Cyber Intrusion Prediction and Taxonomy System Using Deep Learning and Distributed Big Data Processing
Cyber Intrusion Prediction and Taxonomy System Using Deep Learning and Distributed Big Data Processing
Abstract—The issue of cybersecurity is becoming more and deal with technology or cyberspace. The fact that all our
more serious every day at all levels and in all domains. Cyber- information are stored all over cyberspace, whether it is held
attacks threaten the national security of every country and by federal, non-federal, or financial entities. Intrusion detection
nation. Furthermore, cyber-attacks can significantly harm the
economy. With the rapid and continuous growth of the cyber- systems (IDSs) are an essential component of any cyber or
universe, more software is being created, more data is being physical computer system. IDS can help prevent the loss of
generated, and cybersecurity breaches and defense strategies information through security breaches, in any organization,
are getting more complex. For such a problem, considering by detecting them and engaging alarms. This is done by
the size and complexity of the cyber-universe, big data mining monitoring and analyzing the incoming traffic data. IDSs,
techniques and advanced machine learning solutions will be
most suitable to use for predicting brand-new attacks. This is in general, are categorized into two main classes: signature-
because traditional machine learning methods would not help based intrusion detection systems (SIDS) and anomaly-based
combat today’s cybersecurity issues. Anomaly-based intrusion intrusion detection systems (AIDS). The SIDS are capable of
detection systems are receiving tremendous attention nowadays. giving accurate results in a timely manner if the attempting
This is because of the vast improvement and development in attack/intrusion has its signature/pattern stored in the SIDS
big data solutions. This paper utilizes highly imbalanced real-
life benchmark network traffic datasets of multiple types of library. The limitation of this method is the inability to detect
attacks. After resolving the class imbalance issue in our datasets new emerging attacks. Traditional IDS and techniques such as
by applying oversampling approach, our study becomes twofold. signature-based detection systems are always good for detect-
First, we are building prediction models for each type of attacks ing old-known attacks, but new attacks will not be detected
separately and optimizing the model with the highest accuracy. by these systems. It is not surprising that thousands of attacks
Then, we build a prediction model for all attacks together
using deep learning with the smallest number of features and are occurring on a daily basis, such as the zero-day attack,
we optimize the model to achieve the highest accuracy. Our that cannot be detected using SIDS [1]. Hackers are working
developed model can accurately predict the threat and the type day and night, either individually or in teams, to generate
of attack. new attacks that cannot be detected, to destroy or steal other
Index Terms—Cybersecurity, Big Data, Stream Mining, Real- people’s information. Hidden misbehaving pattern discovery
Time, Intrusion Detection, Deep Learning, High-Performance
Computing Clusters (HPCC) is the core of AIDS. The more accurate the discovery is, the
less are the security breaches. Designing and developing an
AIDS requires data that very well represents intrusions or
I. I NTRODUCTION
attacks, which can then be used to train a machine learning
Todays world is more interconnected than ever before. model that learns the behavior of those attacks, and predict
Yet, for all its advantages, increased connectivity brings an future attacks. Traditional machine learning and data mining
increased risk of theft, fraud, and abuse. As Americans techniques are the core of designing AIDS. However, with
become more reliant on modern technology, they become tremendous growth in the amount of data flowing through
more vulnerable to cyber-attacks such as corporate security networks (can be in gigabytes per minute), these traditional
breaches, spear phishing, and social media fraud. Comple- methods and tools cannot handle this amount of data. Big
mentary cybersecurity and law enforcement capabilities are data techniques and algorithms are the solutions for designing
critical to safeguarding and securing cyberspace. The issue of an effective and efficient IDS for current cybersecurity based
cybersecurity is becoming more and more serious every day on anomaly detection.
at all levels and in all domains. Cyber-attacks threaten the Deep learning (DL) has achieved a real breakthrough due
national security of every country and nation. In this work, to the evolution in computing power. At first, deep learning
we are tackling a very important and serious issue which is was extensively used for image and voice recognition, since
cybersecurity. Cybersecurity has become an urgent need for it is very capable of finding hidden patterns and difficult
everybody all around the world, even for those who don’t correlations in the data. Deep learning is the best technique to
978-1-5386-9276-9/18/$31.00 2018
c IEEE 631
efficiently detect zero-day attacks. With the rapid development Buczak and Guven [7] surveyed the Machine Learning
of computation techniques, a powerful framework has been (ML) and Data Mining (DM) methods used for cybersecurity
provided by Artificial Neural Networks (ANNs) with deep intrusion detection. They specifically looked at papers that
architectures for supervised learning. Generally speaking, a described the use of different ML and DM techniques in
deep learning algorithm consists of a hierarchical architecture the cyber domain, both for misuse and anomaly detection.
with many layers each of which constitutes a non-linear They concluded that the methods that have been established
information processing unit. In this paper, we only discuss for cyber applications are not the most effective. Given the
deep architectures in Neural Networks (NNs). Deep neural richness and complexity of the methods, it is impossible to
networks (DNNs), which employ deep architectures in NNs, make a recommendation due to several criteria like accuracy,
can represent functions with higher complexity if the number complexity, time for classifying an unknown instance with
of layers and units in a single layer are increased. Given simple classical prediction model. Also, it is difficult and time
enough labeled training data and suitable models, deep learn- consuming to obtain representative data. Depending on the
ing approaches can help us understand complex problems. In particular IDS, some ML techniques might be more important
this paper, we focus on four main deep learning architectures. than others.
Other methods, like sparse coding, are also briefly discussed. Rege et, al. [8] used four different neural network models
Additionally, some recent advances in the field of deep learn- to make temporal predictions of how adversaries progress
ing are described [2]. through cyber-attacks: nonlinear autoregressive (NAR) neural
The remainder of this paper is organized as follows. Sec- network, NAR neural network with exogenous input (NARX),
tion II presents the most related work. Section III presents a NAR neural network for multi-steps-ahead prediction, and
brief description about intrusion detection systems. Section IV autoregressive integrated moving average (ARIMA). The mod-
talks about the concept and theory of deep learning. Section V els were built on data that was collected from two RTBTE
describes the experiments done in this research and the ob- (Red Team/Blue Team exercises) research sites. The authors
tained results. Section VI contains the analysis of the obtained attempted to build a framework for dynamic prediction of
results. Finally, conclusions and future work are presented in adversarial movement across the cyber-intrusion chain. They
Section VII. concluded that their analysis is inadequate since it did not
account for many permutations and combinations of attack
II. R ELATED W ORK scenarios as well as different adversary types and motivations,
Improving the cybersecurity by developing efficient IDS is objectives, and organizational dynamics. Hence, more data
being discussed and researched since the emergence of the is needed to make a reliable mechanism for intrusion chain
Internet. Nowadays, many federal and private organizations analysis.
and companies are working collaboratively and individually Hindy et, al. [9] presented a taxonomy of network threats
on developing better IDS to enhance the cybersecurity against for intrusion detection systems. The taxonomy is divided into
all kinds of attacks. three control stages - Reconnaissance, Scanning, and Attacks
Modi et, al. [3] have surveyed the intrusion detection in the in order to describe more complex attack processes. The au-
cloud, where most of our data today resides. By introducing thors attempted to create a taxonomy with the ability to inform
the most common attacks that threaten the cyberspace, the first researchers developing both intrusion detection systems and
attack they discussed is the insider attack, in which the threat training datasets in order to increase the detection accuracy and
is initiated by authorized users and this cannot be prevented by decrease the false positive rate. With the increasing number
a firewall and will not be detected by an SID. SIDS are only of connected systems and networks, the taxonomy aims at
applicable for known attacks and this is the main limitation facilitating the design of future defense mechanisms as well
for this detection technique. AIDS are highly recommended as robust systems.
by the authors and their related work to be applied at all AlEroud and Karabatis [10] introduced a context-domain
levels of the cloud (on distributed architecture) since it is the knowledge-driven framework that has been implemented and
best guard against unforeseen attacks. In the Internet of things applied in the discovery of cyber-attacks. The proposed frame-
(IOT), cybersecurity is still the main challenge other than the work is intended to address the limitation of knowledge-
connectivity issues. based IDSs such as the lack of contextual information and
Diro et, al. [1] have proposed a DL based distributed domain knowledge used to detect attacks. This framework
attack detection method for IOT using the fog ecosystem and consists of several attack prediction models that are utilized
showed the effectiveness of deep learning for IOT. Authors in conjunction with IDSs to detect cyber-attacks. After a
compared the performance of both distributed and parallel comprehensive research review of contextual information, they
IDS. The authors proposed a parallel training of local nodes found a common classification of the contextual aspects that
and detection of attacks in attacks in a distributed manner. should be considered in IDSs to make them aware of the
However, their proposal lacks the intrusion detection in real- current context. The authors, approach introduces domain
time for a big amount of data. Traditional machine learning knowledge extracted from taxonomies as a foundation for
techniques have given comparable results in other research context-based reasoning in cybersecurity.
using the same data such as [4]–[6]. Loukas et, al. [11] have shown experimentally that utilizing
TABLE II
T HE MOST RELEVANT FEATURES COMPUTED IN THE PRELIMINARY ANALYSIS
Dataset 1 2 3 4 5
DoSHTTP sourcceTCPFlags destinationTCPFlags appName stopDateTime direction
DDoS direction destination startDateTime source appName
NetInf appName sourcceTCPFlags destinationTCPFlags destinationPort source
BFA source appName direction destinationTCPFlags sourcePort
BFASSH source direcction appName destinationTCPFlags destinationPort
The Aggregated Data direction appName sourcceTCPFlags destinationTCPFlags sourcePort
TABLE III
T HE CONFUSION MATRIX OF THE BUILT DEEP LEARNING MODEL FOR CYBER - INTRUSION DETECTION ALONG WITH WITH THE ERROR METRICS
Fig. 1. The most important features of the aggregated dataset after applying the oversampling with distributed random forest and deep learning algorithms
TABLE IV
E FFECTIVENESS R ESULTS C OMPARISON OF T HE D EVELOPED M ODELS
[6] B. A. Tama, A. S. Patil, and K.-H. Rhee, “An improved model of Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE,
anomaly detection using two-level classifier ensemble,” in Information 2007, pp. 1–8.
Security (AsiaJCIS), 2017 12th Asia Joint Conference on. IEEE, 2017, [19] B. Kicanaoglu, “Unsupervised anomaly detection in unstructured log-
pp. 1–4. data for root-cause-analysis,” 2015.
[7] A. L. Buczak and E. Guven, “A survey of data mining and machine [20] J. Schmidhuber, “Deep learning in neural networks: An overview,”
learning methods for cyber security intrusion detection,” IEEE Commu- Neural networks, vol. 61, pp. 85–117, 2015.
nications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, 2016. [21] J. Evermann, J.-R. Rehse, and P. Fettke, “Predicting process behaviour
[8] A. Rege, Z. Obradovic, N. Asadi, E. Parker, R. Pandit, N. Masceri, using deep learning,” Decision Support Systems, 2017.
and B. Singer, “Predicting adversarial cyber-intrusion stages using [22] C. M. Bishop, Neural networks for pattern recognition. Oxford
autoregressive neural networks,” IEEE Intelligent Systems, no. 2, pp. university press, 1995.
29–39, 2018. [23] H. Al-Najada and I. Mahgoub, “Real-time incident clearance time
[9] H. Hindy, E. Hodo, E. Bayne, A. Seeam, R. Atkinson, and X. Bellekens, prediction using traffic data from internet of mobility sensors,” in
“A taxonomy of malicious traffic for intrusion detection systems,” arXiv Dependable, Autonomic and Secure Computing, 15th Intl Conf on
preprint arXiv:1806.03516, 2018. Pervasive Intelligence & Computing, 3rd Intl Conf on Big Data In-
[10] A. AlEroud and G. Karabatis, “Methods and techniques to identify telligence and Computing and Cyber Science and Technology Congress
security incidents using domain knowledge and contextual information,” (DASC/PiCom/DataCom/CyberSciTech), 2017 IEEE 15th Intl. IEEE,
in Integrated Network and Service Management (IM), 2017 IFIP/IEEE 2017, pp. 728–735.
Symposium on. IEEE, 2017, pp. 1040–1045. [24] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward
[11] G. Loukas, T. Vuong, R. Heartfield, G. Sakellari, Y. Yoon, and D. Gan, developing a systematic approach to generate benchmark datasets for
“Cloud-based cyber-physical intrusion detection for vehicles using deep intrusion detection,” computers & security, vol. 31, no. 3, pp. 357–374,
learning,” IEEE Access, vol. 6, pp. 3491–3508, 2018. 2012.
[12] L. Wang and R. Jones, “Big data analytics for network intrusion detec- [25] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating
tion: A survey,” International Journal of Networks and Communications, a new intrusion detection dataset and intrusion traffic characterization.”
vol. 7, no. 1, pp. 24–31, 2017. in ICISSP, 2018, pp. 108–116.
[13] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques: [26] E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, “Smote-rsb*: a
concepts and techniques. Elsevier, 2011. hybrid preprocessing approach based on oversampling and undersam-
[14] D. Parikh and P. Tirkha, “Data mining & data stream miningopen pling for high imbalanced data-sets using smote and rough sets theory,”
source tools,” International Journal of Innovative Research in Science, Knowledge and information systems, vol. 33, no. 2, pp. 245–265, 2012.
Engineering and Technology, vol. 2, no. 10, pp. 5234–5239, 2013. [27] A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resampling method
[15] Z. Najafian, V. Aghazarian, and A. Hedayati, “Signature-based method for learning from imbalanced data sets,” Computational intelligence,
and stream data mining technique performance evaluation for security vol. 20, no. 1, pp. 18–36, 2004.
and intrusion detection in advanced metering infrastructures (ami),” [28] V. Srinivasan, S. Suri, and G. Varghese, “Packet classification using tuple
International Journal of Computer and Electrical Engineering, vol. 7, space search,” in ACM SIGCOMM Computer Communication Review,
no. 2, p. 128, 2015. vol. 29, no. 4. ACM, 1999, pp. 135–146.
[16] H. Al Najada and X. Zhu, “isrd: Spam review detection with imbalanced
data distributions,” in Information Reuse and Integration (IRI), 2014
IEEE 15th International Conference on. IEEE, 2014, pp. 553–560.
[17] R. Balasubramanian and S. Joseph, “Intrusion detection on highly
imbalance big data using tree based real time intrusion detection system:
effects and solutions,” Int. J. Adv. Res. Comput. Commun. Eng, vol. 5,
no. 2, pp. 27–32, 2016.
[18] L. Zhang and G. B. White, “An approach to detect executable content for
anomaly based network intrusion detection,” in Parallel and Distributed