مطلوب
مطلوب
Scholarship@Western
8-24-2018 9:30 AM
Part of the Electrical and Computer Engineering Commons, and the Other Computer Engineering
Commons
Recommended Citation
Anani, Wafaa, "Recurrent Neural Network Architectures Toward Intrusion Detection" (2018). Electronic
Thesis and Dissertation Repository. 5625.
https://ir.lib.uwo.ca/etd/5625
This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted
for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of
Scholarship@Western. For more information, please contact [email protected].
Abstract
Recurrent Neural Networks (RNN) show a remarkable result in sequence learning, partic-
ularly in architectures with gated unit structures such as Long Short-term Memory (LSTM). In
recent years, several permutations of LSTM architecture have been proposed mainly to over-
come the computational complexity of LSTM. In this dissertation, a novel study is presented
that will empirically investigate and evaluate LSTM architecture variants such as Gated Recur-
rent Unit (GRU), Bi-Directional LSTM, and Dynamic-RNN for LSTM and GRU specifically
on detecting network intrusions. The investigation is designed to identify the learning time re-
quired for each architecture algorithm and to measure the intrusion prediction accuracy. RNN
was evaluated on the DARPA/KDD Cup’99 intrusion detection dataset for each architecture.
Feature selection mechanisms were also implemented to help in identifying and removing non-
essential variables from data that do not affect the accuracy of the prediction models, in this
case Principal Component Analysis (PCA) and the RandomForest (RF) algorithm. The results
showed that RF captured more significant features over PCA when the accuracy for RF 97.86%
for LSTM and 96.59% for GRU, were PCA 64.34% for LSTM and 67.97% for GRU. In terms
of RNN architectures, prediction accuracy of each variant exhibited improvement at specific
parameters, yet with a large dataset and a suitable time training, the standard vanilla LSTM
tended to lead among all other RNN architectures which scored 99.48%. Although Dynamic
RNN’s offered better performance with accuracy, Dynamic-RNN GRU scored 99.34%, how-
ever they tended to take a longer time to be trained with high training cycles, Dynamic-RNN
LSTM needs 25284.03 seconds at 1000 training cycle. GRU architecture had one variant in-
troduced to reduce LSTM complexity, which developed with fewer parameters resulting in a
faster-trained model compared to LSTM needs 1903.09 seconds when LSTM required 2354.93
seconds for the same training cycle. It also showed equivalent performance with respect to the
parameters such as hidden layers and time-step. BLSTM offered impressive training time as
190 seconds at 100 training cycle, though the accuracy was below that of the other RNN archi-
tectures which didn’t exceed 90%.
Keywords: Recurrent Neural Networks, Gated Recurrent Unit, Long Short-term Memory,
Skip-LSTM, Bi-Directional LSTM, Dynamic-RNN, Intrusion Detection, Deep Learning.
i
Acknowlegements
Foremost, I would like to express my sincere gratitude to my advisor Dr. Jagath Samarabandu
of the Electrical and Computer Engineering department at Western University, for the contin-
uous support of my MESc study and research, as well as his patience, motivation, enthusiasm,
and immense knowledge. The door to Dr. Samarabandu’s office was always open whenever I
was stuck on an issue or had a question to ask. He consistently directed me in the right path
whenever needed. His guidance has assisted me in writing of this thesis.
I would also like to sincerely thank my labmates, Gobi and Nadun, for their companionship
during this time and their feedback during our invaluable lab meetings.
I must express my gratitude to my parents for providing me with unfailing support and
continuous encouragement throughout my years of study. And finally to my sister Lina, who
experienced all the ups and downs that I faced in my research, and kept me motivated.
ii
Contents
Certificate of Examination i
Abstract i
Acknowlegements ii
List of Figures v
List of Tables vi
Abbreviations vii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Purpose, Scope, and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 8
3 Background 15
3.1 Intrusion Detection and Machine Learning . . . . . . . . . . . . . . . . . . . . 15
3.2 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Long Short-term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Bi-Directional LSTM (BLSTM) . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Dynamic-RNN LSTM/GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . 23
iii
3.9 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.9.1 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.9.2 Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.9.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.9.4 Time-Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Evaluation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Experimental Results 26
4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 RandomForest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . 29
Bibliography 50
A 56
Curriculum Vitae 59
iv
List of Figures
4.1 Feature Selection Based on RF. (y-axis) shows Feature Importances and (x-
axis) shows Feature IDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Features Selection Based on the PCA Classifier. . . . . . . . . . . . . . . . . . 31
v
List of Tables
vi
List of Abbreviations
DRNN Dynamic-RNN
RF RandomForest
vii
Chapter 1
Introduction
1.1 Overview
Intrusion detection is a key research area in network security. A common approach for in-
trusion detection is detecting anomalies in network traffic, however, network threats are evolv-
ing at an unprecedented rate. The difference between the evolution of threats and the current
detection response time of a network leave the system vulnerable to attacks [1]. Over the years,
a number of machine learning techniques have been developed to detect network intrusions us-
ing packet prediction [2]. Recurrent Neural Network (RNN) is the most popular method of
performing classification and other analysis on sequences of data. A subset network of RNN
is Long Short-term Memory (LSTM), introduced by Hochreiter and Schmidhuber (1997) [3].
LSTM is a key algorithm in regards to the implementation of machine learning tasks that in-
volve sequential data. Successful deployment of LSTM has led the industry to heavily invest in
implementing the algorithm in a wider range of applications. These applications include voice
recognition [4], [5], handwriting recognition [6], machine translation and social media filter-
ing, thus making LSTM a natural candidate for Intrusion Detection Systems (IDS). Yet, this
1
2 Chapter 1. Introduction
algorithm incurs high computational costs when it is deployed on a large scale, in both time and
memory complexity [7], [8]. To overcome this challenge, several variations of the algorithm
have been proposed. These variations include Gated Recurrent Unit (GRU), Bi-Directional
long short-term memory (BLSTM), Dynamic-RNN for LSTM and GRU, and Skip-RNN. This
thesis presents a novel empirical study investigating and implementing the variants of LSTM
architectures for intrusion detection based on predicting packet sequences. The implementa-
tion of each architecture was evaluated in terms of training time, prediction accuracy (normal
or intrusion), the sensitivity of parameters, as well as several performance metrics including
precision, recall, and false alarm rate. Experiments were conducted on the full KDD Cup’99
- intrusion detection dataset [9], these algorithms were evaluated on the entire data set, rather
than on the most commonly used KDD 10% dataset used in the majority of intrusion detection
literature.
1.2 Motivation
The Long Short-term Memory (LSTM) is a subset network of the RNN, an architecture
that excels at storing sequential short-term memories and retrieving them many time-steps
later. For example, RNN has the capability to learn from previous time-steps of the input data.
The data at each time-step is processed and stored and given as input to the next time-step.
The algorithm at the next time step utilizes the previous data stored to process the information.
Such architecture, with a robust computation power would be suitable for security applications,
in particular dealing with streaming data such as network sequence packets. The security
domain is always researching to catch up with the evolution of intrusion. The new field of
RNN in intrusion detection is still in the initial stages of research and has an immense potential
for adaptation of these gated algorithms to learn insights much faster and provide intrusion
detection close to real-time. Though the neural network structures are complex, with the right
1.3. Problem Statement 3
set of parameters they can be tuned to obtain light-weight functionality. This served as the
motivation to explore gated RNNs and focus on the comparison between Long Short-term
Memory (LSTM) and other variant versions of it such as GRU, BLSTM, Dynamic-RNN for
LSTM/GRU and Skip-RNN.
The main goal of this dissertation is to explore and analyze various RNN architectures, tune
each architecture with a different set of parameters such as hidden layers, time-steps, training
cycle and learning rate with the goal of identifying the best parameters to achieve a shorter time
in training each algorithm and high accuracy in predicting whether a network stream packet is
an intrusion or not.
• What is the set of parameters that helps to achieve high accuracy and less time in train-
ing?
The purpose of this research is to evaluate different RNN algorithms on an intrusion detec-
tion dataset. The best known algorithms were identified, such as LSTM, GRU, BLSTM, and
Dynamic-RNN LSTM/GRU. This research has an immense potential to open doors for improv-
ing the intrusion detection applications domain as it offers a better detection rate in catching
4 Chapter 1. Introduction
any attack before network security is compromised. The scope of this project is to find the
most suitable architecture that should fulfill the research purpose by evaluating and analyzing
the performance of the selected algorithms in terms of prediction accuracy and time required
for each algorithm to be trained on an intrusion detection dataset. This study didn’t measure
detection time which is the time elapsed between the initial breach of a network by an attacker
and the discovery of that breach.
• Implemented, evaluated and compared the different RNN algorithms (LSTM, GRU,
BLSTM, and DRNN LSTM/GRU) in terms of training time and prediction accuracy.
• Identified the sensitivity of each algorithm with respect to its individual parameters, such
as learning rate, hidden layers, and training cycles, then measured the prediction accu-
racy
• Introduced for the first time DRNN LSTM/GRU to the intrusion detection domain.
• Calculated the performance metrics including precision, recall, and false alarm rate for
each algorithm.
• Two algorithms for classification mechanism called RandomForest (RF) and Principal
Component Analysis (PCA) are presented for feature selection in the domain of intrusion
detection. A comparison between the two algorithms was conducted to find the most
suitable algorithm to represent the data with the best performance in terms of prediction
accuracy.
• Overall results were presented illustrating the best-case scenario obtained for each algo-
rithm to be employed in the intrusion detection domain.
• The proposed LSTM optimized model scored 99.43%. Where as 19,593 more attacks
out of 3,925,650 have been correctly detected when compared with LSTM models.
1.5. Research Methodology 5
Domain issues and challenges were identified with regards to intrusion detection through
literature reviews.The most often used algorithms and experiments conducted were identified
as well as the accuracy rate for each. As a result, it became evident that RNN architectures
are new techniques in the intrusion detection domain. Proof of the concept of using LSTM
and GRU algorithms in the field of intrusion detection is scarce due to lack of intensive exper-
iments. None of the literature consulted demonstrated the best architecture or were compared
among the algorithms in terms of their parameters to achieve the best performance with high
accuracy. There are some challenges facing IDS which make it difficult to achieve that goal.
Classification of data and labeling of unlabeled data seems to be a challenging task, as the
current high volume of network traffic increases the number of attacks.
A module for each selected architecture was therefore developed. Feature selection was
then implemented to ensure the best representation of all the data and better represent the un-
derlying problem to each prediction model, resulting in improved model accuracy on unseen
data. In this research, two phases were demonstrated for the experiment. One can be de-
scribed as the features selection phase, using the two different selection mechanisms, Principal
Component Analysis (PCA) and RandomForest (RF). RF learns from inputs and improves per-
formance over time. With regards to intrusion detection, the aim is for the algorithm to learn
over time which are the best features from the network features. PCA selects a new feature
set to reduce redundancy of the features and improves performance [10], [11]. It had to be de-
cided which algorithm would offer better performance for the intrusion detection domain, and
to be implemented in IDS in real-time traffic. The second phase is to attempt to evaluate the
selected algorithms on a full KDD Cup’99 dataset which is a commonly used intrusion detec-
tion dataset. A baseline was created with initial values for each parameter, based on literature
review. Different values were used with each run to fine tune the parameters and to identify
the suitable ones that showed best prediction accuracy. Research methodology illustrated in
6 Chapter 1. Introduction
Figure 1.1.
demonstrated highlighting the dataset description, preprocessing and the selection of features
using RF and PCA. In Chapter 5 the result of the experiment is reported, as well as a discus-
sion of the performance of each algorithm with an overall analysis. This thesis is concluded
in Chapter 6 by summarizing the work carried out, the contribution made, and the conclusions
from the results obtained. Further research areas are also outlined in light of the needs of secur-
ing the network under the IoT applications and investigating deep learning and having a hybrid
framework with different layers of different architecture.
Chapter 2
Literature Review
This section focuses on highlighting the related work for anomaly detection techniques in
combination with machine learning algorithms. In particular, LSTM architectures, as it is con-
sidered to be a special kind of RNN that has the form of a chain of repeating models of a neural
network. These repeating models, called “memory cells” as illustrated in Figure 2.1, contain
four “gates” that can handle storing, finding long-range dependencies, and determining what
information to keep or forget [12]. LSTM architecture is widely used in sequential data prob-
lems, especially ones related to natural language [5], [13], [14]. LSTM and its variances are
being used in many studies in the field of intrusion detection. Tuor et al. [15] presented an on-
line deep learning method for intrusion detection due to it’s excellent ability to learn patterns,
where they employed deep neural network autoencoders for unsupervised network anomaly
detection using time aggregated statistics as features. Streaming scenarios were developed that
utilize user logs to detect insider threats. Their evaluation conducted on the CERT dataset indi-
cated that LSTM outperformed other anomaly detection techniques including Isolation Forest,
SVMs, and PCA.
Yunsheng Fu. et al. [16] proposed an intelligent attack detection method in social net-
8
9
works based on LSTM, with the purpose of achieving a high detection rate. They used the
NSL-KDD dataset to evaluate the performance of their proposed method. Their experiment
consisted of data preprocessing, feature abstraction, training and detection. LSTM were used
during the training stage to classify whether the traffic was an attack or normal traffic. Exper-
imental results demonstrated that their proposed intelligent attack detection method achieved
state-of-the-art performance and is much faster than most post-processing algorithms. They
compared their result with well known classifiers such as Bayesian, SVM, KNN, RBNN, PNN
and GRNN. The intelligent attack detection method scored 98.85%. Within that context LSTM
was employed in another well-known intrusion detection dataset, the full KDD Cup’99, as it
was also proven that LSTM outperformed other anomaly detection algorithms as Al-kasassbeh
et al. [17] showed that RF with higher accuracy at 93.78% and Meena et al. [18] showed
accuracy for J48 and Naı̈ve Bayes at 99.49% and 92.72%, respectively. Yunsheng and his team
proved that LSTM outperformed the other well-known classifiers with a satisfactory result at
10 Chapter 2. Literature Review
98.85%. 99.43% was managed to be scored in LSTM and 99.34% for DRNN GRU at 100
training cycles, and GRU at 1000 training cycles.
Kim et al. [19] showed effective results using LSTM with different values for learning rate
and hidden layer size with high detection rate and accuracy. They applied their experiment
on 10% of the KDD Cup ‘99 training dataset and 10% of the KDD Cup’99 testing dataset.
Depending on the tuning of the algorithm’s parameter values, the performance of the algo-
rithm changed. Thus, the learning rate and numbers of hidden layers had a great impact on
the performance. Authors found that the detection rate and false alarm rate showed improve-
ment precisely when the learning rate parameter was set to 0.01 and hidden layers size set
to 80, at which they register detection rate (0.877) and false alarm rate (0.133). According
to their experiments, the average detection rate was 98.8% among the total attacks, and the
average false alarm rate was 10%. Based on their experiments, most attacks like DoS and nor-
mal instances were detected. However, U2R instances were never detected since there weren’t
enough instances, with only 30 examples. Staudemeyer has trained LSTM on various network
topologies to identify the suitable LSTM network parameters and structure [20]. Their results
showed that the LSTM classifier has managed to detect “DoS” attacks and network “probes”
despite the distant time series of events between each attack. LSTM outperforms the winning
entries of the KDD Cup’99 challenge as LSTM can look back in time and correlate consecutive
connection samples. Other researchers addressed the computational complexity by modifying
LSTM implementation. Most literature used the KDD Cup’99 10% dataset to prove the ef-
ficiency of vanilla LSTM in predicting traffic anomalies and selecting optimal parameters to
enhance its performance [18] [19]. For this research, the full KDD Cup’99 dataset was used, as
LSTM showed a higher accuracy at 99.43% for the same learning rate of 0.01. Kim et al. had
80 hidden layers [19], in this experiment it was found that the best detection could be achieved
at 50 hidden layers with less training time.
Miao et al. [7] have simplified LSTM based on their analysis of activation function of the
11
gates, and by identifying the redundancy in LSTM structure. They proposed two simplifica-
tions: (1) deriving input gates from forget gates, and (2) removing recurrent inputs from output
gates. Lyu et al. [12] also proposed another simplification for LSTM. Moreover, Gers and
Schmidhuber [21] introduced “peephole” connections to improve the ability of LSTM to learn
precise timings and counting of the internal states. Peephole connections allow the gates to not
only depend on the previously hidden state (St-1), but also on the previous internal state (Ct-
1), adding additional terms in the gate equations. Greff et al. showed in their LSTM variants
analysis that “peephole connections” did not resolve the problems that LSTM were tested on
[8]. Based on that survey, any “peephole connection” was avoided within the implementation
by focusing on original architecture for LSTM. The simplification Miao introduces did not
really improve the complexity of the LSTM much [7], which is why GRU was the focus and
considered it as LSTM with only two gates. It showed an improvement in the training time,
yet didnot outperform LSTM until training cycles were increased.
Greg et al. [22] introduced GRU as a light version of LSTM. The architecture addressed
the complexity of LSTM by eliminating the “output gate”, which writes the contents from its
memory cell to the more substantial net at each time step. Many studies implemented GRU
to evaluate its performance in intrusion detection, as it is well-suited to classify, process, and
predict time series. One investigation by Athiwaratkun and Stokes presented a new, two-stage
malware classification model which utilizes a language model to generate the features. Then
one single stage, character-level for malware classification. Their new malware language-
model-based utilized LSTM or GRU to construct the features [23]. BLSTM, another form of
LSTM architecture, was used in acoustic modeling in speech recognition [24].
Ahmed E. [25] addresses one of the intrusion detection system challenges, which is to
achieve a low false alarm rate with new unseen threats. The author built a model using different
RNN models to identify seen and unseen threats. Bi-Directional RNN, LSTM, BLSTM, are
used to detect anomalies in sequence. He tested the models on NSL-KDD dataset. The results
12 Chapter 2. Literature Review
show that BLSTM showed superiority over the other RNN models. He ties that to the fact that
RNN has the ability to define normal behavior from large datasets and can be used to detect a
new unseen threat.
Ali H. et al. [26] utilized LSTM for computer network intrusion detection with their pro-
posed autoencoder framework for both fixed and variable length data sequence. They used
LSTM encoders such as GRU, and BLSTM through a comprehensive set of experiments.
They developed an online sequential unsupervised dataset for network intrusion detection us-
ing LSTM-autoencoders. Their experiment carried 5-folds cross validation that validate the
performance of their framework using the ISCX IDS 2012 dataset. The experiment carried dif-
ferent autoencoders such as LSTM-Autoencoder with Last pooling, LSTM-Autoencoder with
Max pooling, LSTM-Autoencoder with mean pooling and Deep Auto LSTM. They demon-
strate that LSTM-Autoencoder with Max pooling showed the best f1-score.
GRU and BLSTM were both implemented in this research focusing on what each architec-
ture offers in terms of performance within the intrusion detection dataset. As shown in Greg,
Athiwaratkun, Ahmed E. and Ali literatures, their focus was on proving that RNN architec-
ture has great potential in anomaly detection [22] [23] [25] [26]. In this experiment, a more
in-depth analysis is offered on what each architecture could offer with the right set of parame-
ters. GRU accelerated in showing strong results which could compete with LSTM. However,
BLSTM showed superiority in training time, but didn’t score high on accuracy, as Ahmed E.
stated [25].
Thi-Thu-Huong Le et al. [27] built a classifier of IDS using LSTM with six optimizers:
RMSprop, Adagrad, Adadelta, Adam, Adamax, and Nadam. They evaluated the performance
of each optimizer using the KDD Cup’99 dataset on each attack type as follows: DoS, Probe,
R2L, U2R and Normal. The main purpose of their experiment was to enhance the classification
performance including accuracy, detection rate, and decreasing false alarm rate. Moreover, im-
proving the classification result for each attack. Their experiment consists of two stages. The
13
first stage determined hyperparameter values. The second stage applied LSTM with the six op-
timizers. They concluded that the LSTM RNN model using the Nadam optimizer with learning
rate 0.002 obtains the best results, and outperformed other classifiers with detection rate and
FAR of 98.95% and 98.98%, respectively. Optimizers play a very crucial role to increasing
the accuracy of the model. RMSProp, AdaDelta and Adam are very similar algorithms, and
since Adam was found to slightly outperform RMSProp, Adam is generally chosen as the best
overall choice. For this reason Adam Optimizer with LSTM was implemented.
Bontemps et al. [28] introduce a collective anomaly detection model using LSTM in real-
time network traffic. LSTM trained with a normal time serious data without anomalies, rather
than relying on the prediction errors in conjunction with detection rules to signal for anoma-
lies. They demonstrate the efficient performance of the proposed model using the KDD Cup’99
dataset. The results showed the capability of the model to detect collective anomalies. How-
ever, they suggested that their model training data must be organized in a coherent manner to
guarantee the stability of the system.
Recent work by Benjamin et al. [29] has demonstrated that LSTM RNN can be applied to
the problem of anomaly detection in computer network flow data. They utilized a public dataset
“ISCX IDS” for IDS taken from the University of New Brunswick’s Canadian Institute for
Cybersecurity (CIC) and the Information Security Centre of Excellence (ISCX). Their model
consists of two stacked, bidirectional, LSTM layers, a single dense layer activation, and a
single fully connected SoftMax output layer. Their model can identify anomalous network
traffic. Observed anomalies were the focus for training the models for prediction attacks. The
concept introduced by Bontemps [28] and Benjamin [29] could be a good starting point for
future comparisons of this concept among all the RNN architecture from the point of having
different layers within one framework, and trying to implement the unseen threats concept to
observe which architecture would lead to a better performance.
they extend the existing LSTM model by learning to skip state updates to reduce the number
of sequential operations and the effective size in the computational graph. Their experiment
was based on developing SKIP-LSTM and SKIP-GRU on the MNIST dataset (a large database
of handwritten digits that is commonly used for training various image processing systems
by Modified National Institute of Standards and Technology database). All parameters were
trained using backpropagation. The results showed that Skip-RNN matched some cases and
even outperformed the baseline models in other instances. There was an attempt to include
this architecture into the research, by implementing it on the KDD Cup’99 dataset at different
learning rates 0.01, 0.001, and 0.0001. Unfortunately the algorithm did not converge despite
several attempts to train the model with different set values of its parameters.
In comparison to existing literature, this research offers insight into RNN architectures.
Different feature selection techniques were compared, the best technique that fit intrusion de-
tection was selected, and applied to a different machine learning model for intrusion detection.
This resulted in presenting suitable parameter tuning to be changed while increasing the ac-
curacy for this set of algorithm, which will be explained in detail in the results and analysis
section. However, it is worth pointing out that the best accuracy was LSTM: 99.43% at 500
training cycle with 50 hidden layers. GRU accuracy was at 99.34% at 1000 training cycle.
Dynamic-RNN LSTM was at 99.27% and Dynamic-RNN GRU at 99.34%, both at 100 train-
ing cycle.
Chapter 3
Background
This chapter presents information on the background for the research presented in this
thesis. It introduces the concepts and technology relevant to the application of RNN algo-
rithms in intrusion detection. These concepts include intrusion detection and machine learning.
This is followed by a description of architectures of RNN, LSTM, GRU, BLSTM, and DRNN
LSTM/GRU. Two feature selections algorithms, RF and PCA, are then explained. The final
sections of the chapter explains the parameters definitions and the evaluation metrics used for
this research.
Any misuse of a network that compromises its stability or the safety of its data is called
a network intrusion attack [31]. Intrusion Detection (ID) is an essential component of net-
work security [32] and has been defined as a technique that monitors system and network
actions to identify abnormal and malicious activities including attack attempts in the network
[33]. ID aims to identify and scan network activity and detect such intrusion attacks. ID’s
15
16 Chapter 3. Background
biggest challenge is being able to identify those attacks efficiently, accurately and in a timely
manner. Traditional systems were designed to find better-known attacks, however cannot de-
termine unknown threats. Machine learning is the science of getting computers to act without
being explicitly programmed. From an intrusion detection perspective, machine learning can
be applied, data mining and pattern recognition algorithms to distinguish between normal and
malicious traffic. A great deal of consideration has been given to machine learning techniques
taking an immense part in many IDSs yet the research shows that using machine learning tech-
nique for detecting an attack may not perform identically in detecting another attack. The
analysis of different machine learning algorithms has been completed in an evolutionary way
[34], however the main goal of these machine learning techniques is to distinguish the intru-
sions for better preparation against future attacks. The simulated attacks fall in one of the
following four categories:
• Denial of Service Attack (DoS): is an attack in which the attacker makes some com-
puting or memory resource too busy or too full to handle legitimate requests, or denies
legitimate users access to a machine.
• User to Root Attack (U2R): is a class of exploit in which the attacker starts out with
access to a normal user account on the system (perhaps gained by sniffing passwords, a
dictionary attack, or social engineering) and is able to exploit some vulnerability to gain
root access to the system.
• Remote to Local Attack (R2L): occurs when an attacker who has the ability to send
packets to a machine over a network but who does not have an account on that machine
exploits some vulnerability to gain local access as a user of that machine.
In this research RNN was selected due to its powerful features to learn from previously data,
then adapt its response to predict. Furthermore, it combines two key features: 1) Distributed
hidden state that allows them to store a lot of information about the past efficiently. 2) Non-
linear dynamics that allow them to update their hidden state in complicated ways. RNN, an
extension of a conventional feed forward neural network, is designed to recognize patterns in
a data sequence. The RNNs are called recurrent because they perform the same task for every
item of a sequence with the output being dependent on the previous computations [35]. The
sequential information is preserved in the recurrent network’s hidden state, which manages
to span many time-steps as it cascades forward to affect the processing of each new input.
It finds correlations between events separated by many moments, and these correlations are
called “long-term dependencies” because an event downstream in time depends upon, and is
a function of, one or more events that came before. One way to think about RNNs is to view
them as a way to share weights over time, as illustrated in Figure 3.1.
To calculate RNN hidden state and output, equation 1 and 2 were used:
ht = σ(Wht−1 + U xt + bt ) (3.1)
Ot = so f tmax(W s ht ) (3.2)
Unfortunately, if RNN was implemented it would not give a satisfactory result. That is
because the RNN model has a major drawback called the vanishing gradient problem. This
means that the result won’t be accurate, since at each time-step during training the same weights
18 Chapter 3. Background
LSTM is a variation of recurrent network proposed as one of the machine learning tech-
niques to solve many sequential data problems. LSTM helps to preserve the error that can
be back-propagated through time and layer. At first, the LSTM cell is precisely introduced to
reduce the multiplication of the gradient problem, as well as to make the RNN more useful for
long-term memory tasks. LSTM architecture as illustrated in Figure 3.2 consists of four main
components; the input gate (i), the forget gate (f), the output gate (o), and the memory cell (c).
The cell makes decisions about what to store, read and write via gates that open or close, and
each memory cell corresponds to a time-step. These gates pass the information based on a set
of weights. Some of the weights, like input and hidden states, are adjusted during the learning
process. Equations governing the operations of LSTM architecture are given below:
ht = Ot tanh(ct ). (3.7)
ot = f (Wo ht + bo ) (3.8)
GRU is a variant of LSTM which was introduced by K.Cho [22], [13]. GRU is basically
an LSTM without an output gate, which therefore fully writes the contents from its memory
cell to the larger net at each time-step. Its internal structure is simpler and therefore considered
faster to train as there are fewer computations needed to make updates to its hidden state. GRU
20 Chapter 3. Background
has two gates: reset gate r, and update gate z. Intuitively, the reset gate determines how to
combine the new input with the previous memory cell, and the update gate defines how much
of the previous memory cell to keep. The gates for a GRU cell are illustrated in Figure 3.3.
Where Rt is the reset gate, Zt is the update gate, ht is the activation function and h˜ t is the can-
didate activation. is an element-wise multiplication, and σ is the logistic sigmoid function.
W and U are denoted as learned weight matrices.
Bi-Directional RNN is also introduced to overcome the limitation of RNN [36]. This archi-
tecture can be trained using all available input information in the past and future of a specific
time frame as illustrated in Figure 3.4. In other words, stacking two RNNs together in which
the input sequence is fed in normal time order for one network as equation 3.12, and in re-
verse time order for another as equation 3.13. The outputs of the two networks are usually
concatenated at each time-step.
←
− ←
− −←
← − ←
−
h t = f (W xt + V h t−1 + b ) (3.13)
→
− ← −
yt = g(U[ h t ; h t ] + c) (3.14)
Dynamic Recurrent Neural Network (DRNN) has feedback connections that dynamically
perform the unrolling of the LSTM cell over each time-step. DRNNs allow for variable se-
quence lengths. Specifically, the state is an internal detail that is passed from time-step to
another. DRNN can handle substantially longer sequences, as it handles different sequence
lengths per batch, and faster graph building times, since it uses an internal loop. For a single
layer LSTM model of 2048 units and batch size 256, DRNN LSTM can handle a sequence of
length 256 while the normal LSTM runs out of memory at 128 [37].
Random Forest (RF) is a tree-based algorithm used to obtain estimates of feature impor-
tance [38]. The RF algorithm builds a large number of simple classifiers using randomly chosen
features and therefore, a subset can be created of the most important features [39]. Properties
of the RF as Jaiswal et al. [40] listed in their literature;
• It is very efficient on huge data sets even with hundreds and thousands of input variables
without overfitting and there is no requirement for data pruning.
• It is applied for the feature subset selection and missing data imputation and performs
very efficiently.
• The generated forest can perform well for the future addition data.
3.8. Principal Component Analysis (PCA) 23
The RF classifier ability was employed to rank the importance of the features set to the
target variables. Only those variables based on the maximum importance levels were selected.
Those with low values of the importance which were irrelevant to the learning model were
discarded since it would negatively impact accuracy.
3.9 Parameters
Selecting the suitable parameters for RNN architectures can often make the difference in
terms of performance as it has a significant impact on the accuracy [8]. However, little is
published regarding which parameters and design choices should be evaluated or selected,
making the correct parameters for obtaining the best performance left to experience or trial and
error. The parameters used in designing the RNN architecture are learning rate, hidden layers,
number of neurons in the hidden layers (hidden units), and number of time-steps.
24 Chapter 3. Background
The learning rate is a parameter that controls how much adjustment must be made to the
weights of a given network with respect to the loss gradient. For example, a far too large
learning rate or dropout rate will prevent the model from learning effectively. The learning rate
must be in the range of 0.01-0.1 to yield satisfactory results [43].
The initial design of any RNN model is comprised of a single hidden layer followed by a
standard feedforward output layer [8]. Usually, the number of hidden layers to be used is based
on the size of the dataset and the dimensions.
Hidden units are the number of neurons in the RNNs hidden layer. If a higher number is in
possession , the network becomes more powerful, however, the number of parameters to learn
also rises. This requires more time to train the prediction model.
3.9.4 Time-Steps
Time-steps are associated with how many steps back in time backpropagation uses when
calculating gradients for weight updates during training. The number of time-steps affects
learning. For example, high time-steps (over 100) typically means convergence is slower, while
low time-steps (a range of 8-32) means convergence is faster. For intrusion detection, time-
steps play a crucial role since the number of time-steps that are required to backpropagation
would impact identifying the correct patterns.
3.10. Evaluation Matrices 25
For evaluation purposes, Precision (P), Recall (R), F-measure (F) and Accuracy (ACC)
metrics are used. These metrics are calculated by using four different measures, true positive
(TP), true negative (TN), false positive (FP) and false negative (FN):
Accuracy (AC): the percentage of true detection over total traffic records,
TP + TN
AC = (3.15)
T P + T N + FP + FN
Precision (P): the percentage of predicted anomalous instances predicted are actual anoma-
lous instances,
TP
P= (3.16)
T P + FP
Recall (R): the percentage of predicted anomalous instances versus all the anomalous in-
stances presented,
TP
R= (3.17)
T P + FN
Chapter 4
Experimental Results
In this research the most current framework is used, Tensorflow [44], in implementing a
model for each architecture. The experiments were performed on a Desktop machine with an
Intel Core i7-4820K CPU @ 3.70GHz and 23.5 GB of memory. Four models were developed
for each architecture: vanilla LSTM, GRU, BLSTM, Dynamic LSTM and GRU, and Skip-
RNN. The experiments were designed to evaluate the performance of each model on the full
KDD Cup’99 dataset in terms of accuracy and training time required for each model.
The experiment was executed using the developed prediction model (LSTM, GRU, BLSTM,
and Dynamic LSTM/GRU) on the KDD Cup’99 Dataset. First, by preprocessing the dataset
by scaling the features and converting non-numerical features to numerical values. Second,
by implementing feature selection using two different algorithms for feature selection: RF and
PCA, for the purpose of evaluating the best technique with the KDD Cup’99 dataset. Two
models, LSTM and GRU, were evaluated to determine which algorithm to move forward with
evaluating the rest of the experiment models. Third, by splitting the dataset into two sets: 80%
for training and 20% for testing. The prediction model was run for both training and testing
classifiers about 10 times, recording the best values of all readings. Finally, the accuracy of all
26
4.1. Dataset Description 27
prediction models and the time required to train the models was logged. All the matrices were
calculated including True False Alarm Rate (Recall), False Alarm Rate (FAR), Efficiency and
Precision. All matrices are shown in equations (3.15, 3.16 3.17). The experiment process is
illustrated in Figure 1.1.
The study sample was conducted from the Third International Knowledge Discovery and
Data Mining Tools Competition (KDD Cup) 1999 [9]. The dataset was a collection of sim-
ulated raw TCP dump data over a period of nine weeks on a local area Network. The three
versions of the KDD Cup’99 IDS datasets are the full KDD dataset, corrected KDD, and 10%
KDD as shown in Table 4.1. Among these three, the 10% KDD dataset is the most used
in literature. However, the experiments in this research were conducted on the full dataset
(18M; 743M uncompressed) to obtain a more realistic view. The training data was collected
for seven weeks and testing data was collected for two weeks. The entire dataset contains 39
attacks which are categorized into four classes: Probe, Denial of Service (DoS), User to Root
(U2R), and Remote to Local (R2L). There are 41 features in addition to one class label for
every record “normal” or “attack type”. A complete listing of the 41 features defined for KDD
Cup’99 dataset is given in Appendix A.
The purpose of data preprocessing is mainly to transform the raw input data to the proper
format for the training model. The steps involved in data preprocessing are:
• Converting Categorical data to Numerical with one-hot vector to fit the training model.
Then select float64 as datatype.
• Scaling and normalizing the dataset, scaling the features so the lowest rank is 0 and the
highest rank is 1.
• Splitting the dataset into training dataset and testing dataset with a ratio of 8:2.
The feature selection mechanism helps to identify and remove non-essential, irrelevant and
redundant variables from data that has less of an effect on the accuracy. In this context, feature
selection usually address what is the best representation of the data to learn a solution to the
underlying problem. If this isn’t done, it could negatively impact the accuracy of the prediction
model.
4.3. Feature Selection 29
The “RandomForest (RF)” classifier was applied for feature selection based on a previous
study that demonstrates the efficiency of the RandomForest algorithm with KDD Cup‘99. RF
is a tree-based algorithm used to obtain estimates of feature importance [38]. The algorithm
was run on the dataset, which ranked 41 features based on importance presented in Figure 4.1.
Thereafter, the top 12 features were selected for the experiment as illustrated in Table 4.2
Figure 4.1: Feature Selection Based on RF. (y-axis) shows Feature Importances and (x-axis)
shows Feature IDs.
First, the mean value was calculated for each variable and subtracted the mean for each
value of the same variable; then calculated the correlation matrix, then the eigenvalues and
eigenvectors of the matrix, arranged the eigenvalues in a descending order and chose the top 12
features as illustrated in Table 4.3 and Figure 4.2. Their corresponding eigenvectors were used
30 Chapter 4. Experimental Results
as a characteristic vector matrix. Finally, the corresponding data was projected to the selected
eigenvectors and ended with the “processed” dataset. If a dataset with sample M is available,
and variable N, the original (mean-normalized data) data is M*N, and the correlation matrix is
N*N. The top K eigenvalues were chosen as the selected feature, therefore these eigenvectors
consist of characteristic vector matrix set N*K. The relationship between them is illustrated in
equation 4.1
FinalData(M ∗ K) = O(M ∗ N) ∗ E(N ∗ K) (4.1)
Where O is the original data, E is the eigenvector, M is the number of samples, N is the
number of variables, and K is the top selected eigenvalues.
As presented in this section, the features were selected based on the RF algorithm which
demonstrated a better presentation to the dataset and a higher accuracy in predicting whether
the samples were anomaly or not. The top 12 features were selected since adding any extra
features would not impact the accuracy. It is worth mentioning that the two algorithms had five
common features among the 12 features listed by ID as follows: 22, 25, 27, 29 and 37. The
result of each model using RF is presented in detail in the Results, Analysis and Discussion
section.
4.3. Feature Selection 31
5.1 Introduction
The experiment spanned two phases. The first phase illustrates the difference between two
features selection algorithms, RF and PCA for each LSTM and GRU. The impact of each
algorithm was identified based on the prediction accuracy. The second phase carried the exper-
iment using RF algorithm as the feature selection algorithm for rest of the RNN architectures
since the outcome of the first phase of the experiment shows that RF captures more significant
features over PCA.
The first phase begins with training the LSTM and GRU, and setting up specific parameters
with the RF features selection algorithm. The set of parameters are set as: training rate 0.01,
hidden layers 10, and backpropagation (working our way backwards through the network) 5,
with different cycle values for training. The same experiment setup was repeated using the
32
5.3. Phase II: RNN Architectures for IDS 33
PCA algorithm for features selection. The experiment was executed around 10 times for each
algorithm with different sets of parameters combinations. LSTM and GRU were trained at
different training cycles: 50, 100 and 300 as illustrated in Table 5.1. The goal was to obtain the
best case of prediction accuracy and training time. Based on the experiment the observations
are as follows:
• The best accuracy of LSTM and GRU at each training cycle with RF was achieved. It
shows that at 50 cycles, the LSTM and GRU scored an accuracy of 97.86% and 96.59%,
respectively. Whereas both models scored 64% and 67.97% with PCA.
• At training cycles with the value 300, LSTM and GRU cored an accuracy of 97.54% and
97.57% with PCA. However, this still did not outperform the accuracy recorded by RF
which were 99.00% and 98.89%.
• Based on the result, the rest of the experiment carried on using RF for feature selection
in training LSTM, GRU, BLSTM, and DRNN LSTM/GRU.
In phase two, RNN architectures: LSTM, GRU, BLSTM, and DRNN LSTM/GRU are used
to be trained to detect anomalies. The implemented models were developed using Python and
34 Chapter 5. Results, Analysis, and Discussion
TensorFlow platforms [44] to show the capability of each model to learn the definition of being
normal and anomalous from labeled datasets. Each model on the KDD Cup’99 dataset was
evaluated as previously mentioned. The first model was a vanilla LSTM, which was trained
and evaluated, as all RNN models selected for this experiment. For each model the best de-
faulted value parameter setup was identified to begin, tuning each model to achieve the best
performance and the highest accuracy. Several runs were conducted with different values as
shown in Table 5.2, such as learning rate, training cycle, time-step and hidden layers. The same
process was followed for all models, including adding certain parameters such as batch size for
BLSTM since the architecture of the model required passing the data into batches. Batch size
limits the number of samples to be shown to the network before a weight update can be per-
formed. Due to the limitation of available memory, the desired parameter values could not be
5.3. Phase II: RNN Architectures for IDS 35
First, a vanilla LSTM model was developed, and then determined through the experiment
what the suitable learning rate was for this model. The learning rate is one of the most important
parameters to be tuned due to its impact on the training model for faster and effective training.
It is important to not overfit the training model. The experiment was run with three different
learning rates: 0.0001, 0.001, and 0.01. After running the experiment several times at value
100, the training cycles results show that the learning rate 0.001 gives the best loss value, which
then decreases during training to allow more weight updates. A learning rate of 0.0001 did not
allow the model to converge as illustrated in Figure 5.2.
LSTM model was trained at three different cycles; 100, 500, and 1000. The accuracy,
precision, recall, FAR and time required for the model to be trained for each training cycle was
calculated. As shown in Table 5.3, LSTM shows a higher accuracy of 99.43% at 500 training
cycles. It was noted that adding more training cycles did not result in an increase in accuracy
prediction, as shown in in Figure 5.3. With regards to the training time, this took 10648.93
seconds for the LSTM 500 cycle, as per Figure 5.4.
36 Chapter 5. Results, Analysis, and Discussion
The same steps were followed as the LSTM model to train the GRU model, with a learning
rate once more at 0.01, as shown in Figure 5.5. The GRU model was trained at three different
training cycles; 100, 500, and 1000, calculating the accuracy, precision, recall, FAR and the
time required for the model to be trained for each of the training cycles. As shown in Table 5.4,
the GRU model shows a higher accuracy of 99.34% at the 1000 cycle, as shown in Figure 5.6.
GRU required more learning cycles to outperform LSTM. However, it is important to keep in
mind that the training time was less than LSTM due to the fact that GRU architecture consists
of two gates only. GRU has fewer parameters, and took 16654.21 seconds as shown in Figure
5.4.
BLSTM results show that the model didn’t perform as well as the rest of the models. The
accuracy at each training cycle below 90% is illustrated in Table 5.5. The one significance of
the model is with respect to the training time. It shows a significant improvement in the training
5.3. Phase II: RNN Architectures for IDS 37
time, for instance, at the 500 training cycle the model required only 190 seconds.
DRNN architecture had two implementations: LSTM and GRU. DRNN architecture uses
variable length sequences, which compute the shape of the input sequence length, whereas
other architectures have a fixed size of input sequence length. It relies on the padding technique
38 Chapter 5. Results, Analysis, and Discussion
by having a vector holding the sequence lengths, that can be passed to the DRNN model. The
experiment followed the same steps by obtaining the best learning rate for DRNN which is
0.01, as shown in Figure 5.8 for LSTM and Figure 5.9 for GRU. Moreover, the accuracy value
of each model was presented at each learning rate as seen in Table 5.6. Each model, LSTM
and GRU, was trained at the same training cycles: 100, 500, and 1000. The model parameters
were set up with 50 hidden layers, and 5 time-steps. The results showed the best accuracy
as 99.27% for DRNN LSTM at the 100 training cycle, where DRNN for GRU outperformed
DRNN LSTM, scoring 99.34%. However, it dropped significantly at 1000 training cycles
scoring 90.24%, whereas DRNN LSTM maintained its high accuracy at 99.23% for the same
training cycle. DRNN LSTM showed its best performance at 500 training cycles, whereas
DRNN GRU scored the highest accuracy at 100 training cycles as shown in Figure 5.10. With
5.3. Phase II: RNN Architectures for IDS 39
regards to the time taken to train the model, DRNN LSTM required 6154.80 seconds and
DRNN GRU required 2072.89 seconds. Both of the models’ accuracy, time and other matrices
are illustrated in Table 5.7 and Table 5.8.
In this experiment, the dataset with intrusion attacks containing 4,898,431 samples with
262,178 attacks was used. A similar series of experiments were run on all selected architectures
40 Chapter 5. Results, Analysis, and Discussion
with a slight change in the parameters due to some architecture requirements. Each algorithm
was trained at different training cycles as in Table 5.9, and the observations are as follows:
• The learning rate is the single most important parameter, and for the KDD Cup’99 the
best learning rate with the selected machine learning model is 0.01.
• DRNNs show the best performance in terms of accuracy with fewer training cycles.
DRNN LSTM accuracy is at 99.27% where vanilla LSTM accuracy is 98.85%. DRNN
GRU accuracy is 99.34% whereas GRU accuracy is 98.68%. This is due to the fact
that the DRNNs allow for variable sequence lengths. This allow RNNs to run for the
correct number of time-steps on those sequences that could be shorter than the maximum
sequence length.
• DRNN best accuracy rate is at 100 training cycles. Adding more training cycles could
5.3. Phase II: RNN Architectures for IDS 41
• Highest accuracy recorded for the standard Vanilla LSTM is 99.43% at 500 training
cycles, with 10648.93 seconds. LSTM architecture is suitable for large-scale implemen-
tation.
• Meanwhile, GRU outperformed LSTM at 1000 training cycles. GRU required more
training cycle to increase its accuracy because GRU uses two gates, the update gate and
reset gate. It lacks the output gate which fully writes the contents from its memory cell
to the larger net at each time-step.
• Highest accuracy at 100 training cycles was DRNN GRU which scored 99.34%, then
42 Chapter 5. Results, Analysis, and Discussion
DRNN LSTM at 99.27%. The highest accuracy for 500 Training Cycle was LSTM
scored 99.43%. Finally, GRU scored the highest accuracy at 99.25%.
• GRU scored the best training time at two training cycles: 500 and 1000. All training
time scores are presented in Table 5.10 and in Figure 5.11.
• Due to the limitation of available memory, the desired parameter values could not be
achieved for hidden layers and time-steps (Backpropagation). This restriction did not
allow to test the algorithm as expected, leading to trade off between hidden layers and
time-steps. The best set value was 50 for hidden layers and 5 for time-steps.
• Due to the GRU model having fewer parameters than LSTM, the GRU model proved to
train faster. It required only 31.7 minutes while LSTM required 39.25 minutes.
• BLSTM required less time for training, but its accuracy rate was lower than the rest of
5.3. Phase II: RNN Architectures for IDS 43
the RNN’s models. More investigation is required to see how to enhance the algorithm
accuracy.
• Skip-LSTM could not be trained in KDD Cup’99 intrusion detection even after running
several attempts with different setting parameters in the algorithm. More investigation is
required to fit the data into the model in order for the algorithm to predict attacks.
• In comparison with other approaches presented in other literature using LSTM the pro-
posed optimized LSTM in this research scored a the highest accuracy 99.43% where
other approaches the highest scored was 98.95% illustrated in Table 5.11 and Figure
5.12. Where as 19,593 more attacks out of 3,925,650 have been correctly detected.
44 Chapter 5. Results, Analysis, and Discussion
Table 5.11: Comparison of the Optimized LSTM Model Accuracy Rate with other LSTM
model proposed by other Literature Review
Approaches Accuracy
Optimized LSTM 99.34%
Approach 1 [16] 98.85%
Approach 2 [17] 93.78%
Approach 3 [18] 98.85%
Approach 4 [19] 98.80%
Approach 5 [27] 98.95%
Figure 5.12: Comparison of the Optimized LSTM Model Accuracy Rate with other LSTM
model proposed by other Literature Review
Chapter 6
The novelty of this research stems from the fact that it is the first experiment that imple-
ments and compares RNN’s architecture and offers more insight into each architecture, particu-
larly vanilla LSTM, GRU, BLSTM, and DRNN LSTM/GRU, on an intrusion detection dataset.
Most literature in the domain demonstrates the concept of using LSTM as one of the RNN ar-
chitectures to improve the accuracy in predicting attacks, as well its different variants, however
they only focused on one architecture for one application, comparing it with other machine
learning techniques like SVM, RF, and J48. This research took the path further in understand-
ing the architecture of each RNN algorithm, then applying it in an intrusion detection dataset.
It evaluates the performance of each architecture in terms of prediction accuracy and the time
required for each architecture to be trained. Moreover, this experiment is unique as it runs these
architectures on the full KDD Cup’99 dataset, which contains 4,898,431 samples, rather than
the commonly used KDD 10% dataset. Feature selection was performed using two different
mechanisms, RF and PCA, which are suitable for intrusion detection. This has offered a clean
dataset that carries all the important features. Feature selection reduced the dataset features to
improve the performance of accuracy, recall, training time and false alarm rate. As part of this
research, the experiment was limited in tuning the set of parameters with the goal of finding
48
49
the optimal parameters such as learning rate, hidden layers, and training cycle to improve the
model’s prediction accuracy and the amount of time required to be trained.
The results of this evaluation revealed that vanilla LSTM still stands up and outperforms
other architectures that are supposed enhancements of the vanilla LSTM. DRNNs showed the
potential to offer better performance with accuracy, however with high training cycles resulting
in a tendency to take a longer time to be trained. GRU architecture showed equivalent perfor-
mance with respect to the parameters such as hidden layers and time-step. GRU has fewer
parameters resulting in a faster-trained model compared to LSTM. In a large-scale implemen-
tation, however, LSTM may yield better results. BLSTM offered an impressive training time,
yet the accuracy did not exceed 84.99%.
For future work, the aim is to evaluate further architectures on the intrusion detection
dataset. Moreover, the aim is to investigate the application of deep learning by having mul-
tiple layers and hybrid layers of different architectures in one framework, as well as deploying
these techniques in IoT applications to develop robust security solutions. This is made possible
with the concepts of machine learning and deep learning as IoT generates an enormous amount
of heterogeneous data.
Bibliography
[1] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,” in 2013 IEEE
International Conference on Computer Vision, Dec 2013, pp. 2056–2063.
[2] S. S. Roy, A. Mallik, R. Gulati, M. S. Obaidat, and P. Krishna, “A deep learning based
artificial neural network approach for intrusion detection,” in International Conference
on Mathematics and Computing. Springer, 2017, pp. 44–53.
[4] T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” in
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May
2013, pp. 7378–7382.
[5] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neu-
ral networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, May 2013, pp. 6645–6649.
[6] A. Graves, “Generating sequences with recurrent neural networks,” CoRR, vol.
abs/1308.0850, 2013. [Online]. Available: http://arxiv.org/abs/1308.0850
50
BIBLIOGRAPHY 51
[7] Y. Miao, J. Li, Y. Wang, S. Zhang, and Y. Gong, “Simplifying long short-term memory
acoustic models for fast training and decoding,” in 2016 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 2284–2288.
[10] A. Ferriyan, A. H. Thamrin, K. Takeda, and J. Murai, “Feature selection using genetic
algorithm to improve classification in network intrusion detection system,” in 2017 In-
ternational Electronics Symposium on Knowledge Creation and Intelligent Computing
(IES-KCIC), Sept 2017, pp. 46–49.
[11] J. E. Varghese and B. Muniyal, “An investigation of classification algorithms for intrusion
detection system — a quantitative approach,” in 2017 International Conference on Ad-
vances in Computing, Communications and Informatics (ICACCI), Sept 2017, pp. 2045–
2051.
[12] Q. Lyu and J. Zhu, “Revisit long short-term memory: An optimization perspective,” in
Advances in neural information processing systems workshop on deep Learning and rep-
resentation Learning, 2014, pp. 1–9.
[14] P. Shah, V. Bakarola, and S. Pati, “Image captioning using deep neural architectures,”
CoRR, vol. abs/1801.05568, 2018. [Online]. Available: http://arxiv.org/abs/1801.05568
[15] A. Tuor, S. Kaplan, B. Hutchinson, N. Nichols, and S. Robinson, “Deep learning for
unsupervised insider threat detection in structured cybersecurity data streams,” CoRR,
vol. abs/1710.00811, 2017. [Online]. Available: http://arxiv.org/abs/1710.00811
[16] Y. Fu, F. Lou, F. Meng, Z. Tian, H. Zhang, and F. Jiang, “An intelligent network attack
detection method based on rnn,” in 2018 IEEE Third International Conference on Data
Science in Cyberspace (DSC), June 2018, pp. 483–489.
[18] G. Meena and R. R. Choudhary, “A review paper on ids classification using kdd 99 and nsl
kdd dataset in weka,” in 2017 International Conference on Computer, Communications
and Electronics (Comptelix), July 2017, pp. 553–558.
[19] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory recurrent neural
network classifier for intrusion detection,” in 2016 International Conference on Platform
Technology and Service (PlatCon), Feb 2016, pp. 1–5.
[21] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm
recurrent networks,” Journal of machine learning research, vol. 3, no. Aug, pp. 115–143,
2002.
BIBLIOGRAPHY 53
[22] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014. [Online].
Available: http://arxiv.org/abs/1412.3555
[23] B. Athiwaratkun and J. W. Stokes, “Malware classification with lstm and gru language
models and a character-level cnn,” in 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), March 2017, pp. 2482–2486.
[25] A. Elsherif, “Automatic intrusion detection system using deep recurrent neural network
paradigm,” Journal of Information Security and Cybercrimes Research (JISCR), vol. 1,
no. 1, 2018.
[26] A. H. Mirza and S. Cosan, “Computer network intrusion detection using sequential lstm
neural networks autoencoders,” in 2018 26th Signal Processing and Communications
Applications Conference (SIU), May 2018, pp. 1–4.
[27] T. Le, J. Kim, and H. Kim, “An effective intrusion detection classifier using long short-
term memory with gradient descent optimization,” in 2017 International Conference on
Platform Technology and Service (PlatCon), Feb 2017, pp. 1–6.
[30] V. Campos, B. Jou, X. Giró i Nieto, J. Torres, and S. Chang, “Skip RNN: learning to skip
state updates in recurrent neural networks,” CoRR, vol. abs/1708.06834, 2017. [Online].
Available: http://arxiv.org/abs/1708.06834
[32] Y. Li, R. Ma, and R. Jiao, “A hybrid malicious code detection method based on deep
learning,” International Journal of Security and its Applications, vol. 9, no. 5, pp. 205–
216, 2015.
[35] T. Tang, S. A. R. Zaidi, D. McLernon, L. Mhamdi, and M. Ghogho, “Deep recurrent neu-
ral network for intrusion detection in sdn-based networks,” in 2018 IEEE International
Conference on Network Softwarization (NetSoft 2018). IEEE, 2018.
[36] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transac-
tions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, Nov 1997.
[38] M. A. M. Hasan, M. Nasser, S. Ahmad, and K. I. Molla, “Feature selection for intrusion
detection using random forest,” Journal of information security, vol. 7, no. 03, p. 129,
2016.
[39] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[40] J. K. Jaiswal and R. Samikannu, “Application of random forest algorithm on feature sub-
set selection and classification and regression,” in 2017 World Congress on Computing
and Communication Technologies (WCCCT), Feb 2017, pp. 65–68.
[43] N. Reimers and I. Gurevych, “Optimal hyperparameters for deep lstm-networks for
sequence labeling tasks,” CoRR, vol. abs/1707.06799, 2017. [Online]. Available:
http://arxiv.org/abs/1707.06799
[44] Tensorflow.Org, Tensorflow framework, Last updated May 25, 2018. [Online]. Available:
https://www.tensorflow.org/
Appendix A
2 protocol type type of the protocol, e.g. tcp, udp, etc. discrete
56
57
21 is hot login 1 if the login belongs to the “hot” list; 0 otherwise discrete
28 srv count number of connections to the same service as the cur- continuous
rent connection in the past two seconds
33 Dst host srv count srv count for destination host continuous
34 Dst host same srv rate same srv rate for destination host continuous
35 Dst host diff srv rate diff srv rate for destination host continuous
36 Dst host same src port rate same src port rate for destination host continuous
37 Dst host srv diff host rate diff host rate for destination host continuous
58 Chapter A.
38 Dst host serror rate serror rate for destination host continuous
39 Dst host srv serror rate srv serror rate for destination host continuous
40 Dst host rerror rate rerror rate for destination host continuous
41 Dst host srv rerror rate srv serror rate for destination host continuous
Curriculum Vitae
Publications:
• Wafaa Anani and Abdelkader Ouda. “The Importance of Human Dynamics in the Future
User Authentication”. 2017 IEEE 30th Canadian Conference on Electrical and Com-
puter Engineering CCECE, pp. 1-5, 2017.
• Wafaa Anani and Jagath Samarabandu. “Comparison of Recurrent Neural Network Al-
gorithms for Intrusion Detection Based on Predicting Packet Sequences”. 2018 IEEE
31st CCECE, pp.1-4, 2018.
59