0% found this document useful (0 votes)
29 views

An Enhanced AI-Based Network Intrusion Detection System Using Generative Adversarial Networks-1

An enhanced

Uploaded by

urmilatv2407
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

An Enhanced AI-Based Network Intrusion Detection System Using Generative Adversarial Networks-1

An enhanced

Uploaded by

urmilatv2407
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2330 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO.

3, 1 FEBRUARY 2023

An Enhanced AI-Based Network Intrusion


Detection System Using Generative
Adversarial Networks
Cheolhee Park , Jonghoon Lee , Youngsoo Kim, Jong-Geun Park , Hyunjin Kim, and Dowon Hong

Abstract—As communication technology advances, various and Things (IoT), and the capacity of network systems has been
heterogeneous data are communicated in distributed environ- expanded to process these data reliably. However, as the access
ments through network systems. Meanwhile, along with the points are diversified, the attack surface expands, thereby
development of communication technology, the attack surface
has expanded, and concerns regarding network security have leaving the network systems vulnerable to potential threats.
increased. Accordingly, to deal with potential threats, research Moreover, cyber-attack techniques have become more com-
on network intrusion detection systems (NIDSs) has been actively plex and sophisticated, and the frequency of attacks has also
conducted. Among the various NIDS technologies, recent interest increased. Accordingly, the importance of cybersecurity is
is focused on artificial intelligence (AI)-based anomaly detection emphasized, and various studies have been actively conducted
systems, and various models have been proposed to improve the
performance of NIDS. However, there still exists the problem to prevent potential network threats.
of data imbalance, in which AI models cannot sufficiently learn One of the fundamental challenges in cybersecurity is
malicious behavior and thus fail to detect network threats accu- the detection of network threats, and various results have
rately. In this study, we propose a novel AI-based NIDS that can been reported in the field of network intrusion detection
efficiently resolve the data imbalance problem and improve the systems (NIDSs). In particular, the most recent studies have
performance of the previous systems. To address the aforemen-
tioned problem, we leveraged a state-of-the-art generative model been focused on applying the artificial intelligence (AI) tech-
that could generate plausible synthetic data for minor attack nology to NIDS, and AI-based intrusion detection systems
traffic. In particular, we focused on the reconstruction error and have achieved remarkable performance. Initially, the research
Wasserstein distance-based generative adversarial networks, and primarily focused on applying traditional machine learning
autoencoder-driven deep learning models. To demonstrate the models, such as decision trees [1] (DTs) and support vector
effectiveness of our system, we performed comprehensive evalu-
ations over various data sets and demonstrated that the proposed machines [2] (SVMs) to existing intrusion detection systems,
systems significantly outperformed the previous AI-based NIDS. and it has now been extended to deep learning approaches [3],
such as convolutional neural networks (CNNs), long short-
Index Terms—Anomaly detection, generative adversarial
network (GAN), network intrusion detection system (NIDS), term memory (LSTM), and autoencoders. Although these
network security. results have achieved remarkable performance in detecting
anomalies, there still exist limitations in deploying them in
real systems.
I. I NTRODUCTION In general, most of the network flow data is normal traffic,
ITH the development of the fifth-generation (5G) and malicious behavior that can cause service failure occurs
W mobile communication technology that diversifies the
access environments and constructs distributed networks,
rarely. Moreover, within the category of malicious behavior,
most of the data are well-known attacks, and specific types of
various and heterogeneous data are communicated through attacks are extremely rare. Due to this data imbalance problem,
network systems. In general, these data originate from diverse AI models deployed in NIDS cannot sufficiently learn the char-
domains, such as sensors, computers, and the Internet of acteristics of specific network threats, and this may leave the
network systems vulnerable to the attacks owing to the poor
Manuscript received 5 April 2022; revised 14 August 2022; accepted 21 detection performance.
September 2022. Date of publication 3 October 2022; date of current version
24 January 2023. This work was supported by the Institute of Information and In this study, to address this inherent problem, we propose
Communications Technology Planning and Evaluation (IITP) Grant funded by a novel AI-based NIDS that can resolve the data imbal-
the Korea Government (MSIT, Development of 5G Edge Security Technology ance problem and improve the performance of the previous
for Ensuring 5G+ Service Stability and Availability) under Grant 2020-0–
00952. (Corresponding author: Cheolhee Park.) systems. To address the aforementioned problem, we leveraged
Cheolhee Park, Jonghoon Lee, Youngsoo Kim, Jong-Geun Park, a state-of-the-art deep learning architecture, generative adver-
and Hyunjin Kim are with the Cyber Security Research Division, sarial networks [4] (GANs), to generate synthetic network
Electronics and Telecommunications Research Institute, Daejeon 34129, South
Korea (e-mail: [email protected]; [email protected]; [email protected]; traffic data. In particular, we focused on the reconstruction
[email protected]; [email protected]). error and Wasserstein distance-based GAN architecture [5],
Dowon Hong is with the Department of Applied Mathematics, Kongju which can generate plausible synthetic data for minor attack
National University, Gongju 32588, South Korea (e-mail: dwhong@
kongju.ac.kr). traffic. By combining the generative model with anomaly
Digital Object Identifier 10.1109/JIOT.2022.3211346 detection models, we demonstrated that the proposed systems
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2331

Fig. 1. Entire systemic architecture of our AI-based NIDS.

outperformed previous results in terms of the classification system outperforms existing AI-based NIDS in terms
performance. of detection rate.
The entire architecture of our system consists of four main 2) Through comparative experiments with various deep
stages (see Fig. 1): 1) preprocessing; 2) generative model learning models, we present that the detection
training; 3) autoencoder training; and 4) predictive model performance for rare attacks can be improved by apply-
training. In the preprocessing stage, the system refines the raw ing our methodology it as a base module.
data set into a format that deep learning models can learn. 3) By experimenting with data sets collected from vari-
After preprocessing, the system sequentially trains generative ous scenarios, we show that the proposed system can be
models and an autoencoder model, where the trained genera- effectively applied to real-world environments.
tive models are utilized to train the autoencoder model. Finally, The remainder of this article is organized as follows.
the system trains predictive models by applying the trained Section II briefly reviews related research from the perspec-
generative models and the encoder of the trained autoencoder, tive of NIDS based on machine learning and deep learning
where the generative models are used to generate scarce data approaches, and Section III provides a background with a
and the encoder is used as a feature extractor. In the case of the focus on autoencoders and GANs. In Section IV, we describe
classifier models, we consider three deep learning models that our methodology and the proposed framework as well as the
have been widely utilized in AI-based NIDS: 1) deep neural four main stages in detail. In Section V, we evaluate the
networks (DNNs); 2) CNNs; and 3) LSTM model. To evalu- proposed system in various environments and present exper-
ate our system, we experimented with four network flow data imental results with detailed analysis. Finally, we present
sets considering different scenarios: 1) NSL-KDD [6], [7]; concluding remarks and future work directions of this study
2) UNSW-NB15 [8]; 3) IoT data set [9]; and 4) real-world in Section VI.
data set. Through experiments on these various data sets, we
show that the proposed system outperformed previous results.
Moreover, we demonstrate that our methodology can improve II. R ELATED W ORK
the performance of existing AI-based NIDS by resolving the In the field of AI-based NIDSs, many studies have been
data imbalance problem. conducted to apply machine learning and deep learning tech-
The main contributions of the proposed approach can be nologies as anomaly detection. Ingre and Yadav [10] proposed
summarized as follows. multilayer perceptron-based intrusion detection system and
1) By combining the state-of-the-art GAN model that showed that the proposed approach achieve 81% and 79.9%
can generate plausible synthetic data and measure the accuracy in experiments on the NSL-KDD data set for
convergence of training, we show that the proposed binary and multiclassification, respectively. Gao et al. [11]
2332 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

proposed a semi-supervised learning approach for NIDSs ensemble approach and showed that the proposed system
based on fuzzy and ensemble learning and reported that can effectively detect intrusion attacks regardless of tempo-
the proposed system achieved 84.54% accuracy on the ral correlation. Moreover, they demonstrated that the proposed
NSL-KDD data set. By applying the deep belief network omni-IDS outperformed previous deep learning approaches
(DBN) model, Alrawashdeh and Purdy, [12] developed an through experiments on a SCADA testbed.
anomaly intrusion detection system and showed that the In addition to the previous approach of applying supervised
proposed DBN-based IDS exhibited a superior classifica- learning as an anomaly detection model, several studies have
tion performance in subsampled testing sets (sampled subsets focused on the application of unsupervised learning, espe-
from the original data set). By considering the software- cially autoencoder models. Javaid et al. [25] proposed a sparse
defined networking environment, Tang et al. [13] proposed autoencoder-based NIDS and reported that the proposed model
a DNN-based anomaly detection system and reported that the achieved 79.1% accuracy for multiclassification on the NSL-
DNN-based approach outperformed traditional machine learn- KDD data set. Similarly, Yan and Han [26] leveraged the
ing model approaches (e.g., Naïve Bayes, SVM, and DT). sparse autoencoder model to extract high-level feature repre-
Imamverdiyev and Abdullayeva [14] proposed a restricted sentations of intrusive behavior information and demonstrated
Boltzmann machine (RBM)-based intrusion detection system that the stacked sparse autoencoder model could be applied
and showed that the Gaussian–Bernoulli RBM model outper- as an efficient feature extraction method. Shone et al. [27]
formed other RMB-based models (such as Bernoulli-Bernoulli proposed a stacked nonsymmetric deep autoencoder-based
RBM and DBN). From the perspective of utilizing both intrusion detection system. Shone et al. [27] showed that the
behavioral (network traffic characteristics) and content fea- proposed model could achieve 85.42% accuracy in multiclassi-
tures (payload information), Zhong et al. [15] introduced a fication. As one of the significant results, Ieracitano et al. [28]
big data and tree architecture-driven deep learning system into proposed an autoencoder-driven intrusion detection model.
the intrusion detection system, where the authors combined Ieracitano et al. [28] proposed autoencoder-based and LSTM-
shallow learning and deep learning strategies and showed based IDS models and compared their performance with
that the system is particularly effective at detecting subtle conventional machine learning models. Through experiments
patterns for intrusion attacks. With the ensemble model-like on the NSL-KDD data set, they reported that the proposed
approach, Haghighat et al. [16] proposed an intrusion detec- autoencoder-based systems outperformed other models and
tion system based on deep learning and voting mechanisms. achieved 84.21% and 87% accuracy for binary and multiclas-
Haghighat and Li [16] aggregated the best model results and sification, respectively.
showed that the system can provide more accurate detec- As another approach to applying unsupervised learning,
tions. Moreover, they showed that the false alarms can be several studies have investigated using generative models to
reduced up to 75% compared to the conventional deep learn- improve the performance of existing NIDS. In particular,
ing approaches. Considering data streams in industrial IoT they have focused on applying the basic GANs [4], which
environments, Yang et al. [17] proposed a tree structure- are based on the Jensen–Shannon divergence (or Kullback–
based anomaly detection system, where the authors incorporate Leibler divergence) [29], [30], [31]. Thereafter, along with
the window sliding, detection strategy changing, and model the development of various GAN models, studies have been
updating mechanisms into the locality-sensitive hashing-based conducted to apply appropriate GAN models for specific
iForest model [18], [19] to handle the infiniteness of data purposes. Li et al. [32] and Lee et al. [33] utilized the
streams in real-time scenario. Similarly, Qi et al. [20] proposed Wasserstein divergence-based GAN model to generate the syn-
an intrusion detection system for multiaspect data streams by thetic data, and Dlamini et al. [34] proposed a conditional
combining locality-sensitive hashing, isolation forest, and prin- GAN-based anomaly detection model to improve the classi-
cipal component analysis (PCA) techniques. Qi et al. [20] fication performance in the minority classes. By focusing on
showed that the proposed system can effectively detect group specific industrial environments, Li et al. [35] and Alabugin
anomalies while dealing with multiaspect data and process and Sokolov [36] proposed LSTM-GAN and bidirectional
each data row faster than the previous approaches. GAN-based anomaly detection models, respectively. Through
From the perspective of dealing with time-series data, sev- experiments on the secure water treatment (SWaT) data set,
eral results have been reported focusing on recurrent models. they demonstrated that GAN models could be effectively
Kim et al. [21] proposed an LSTM-based IDS model and applied to IDS. Siniosoglou et al. [37] proposed an anomaly
proved the efficiency of the proposed IDS. Yin et al. [22] detection model that could simultaneously detect anomalies
proposed a recurrent neural network-based intrusion detec- and categorize the attack types. Siniosoglou et al. [37] encap-
tion system and achieved 83.3% accuracy and 81.3% accuracy sulated the autoencoder architecture into the structure of the
in binary and multiclassification, respectively. Xu et al. [23] basic GAN model (i.e., deploying the encoder as a discrimina-
developed a recurrent neural network-based intrusion detec- tor and the decoder as a generator) and proved the efficiency
tion model and reported that the gated recurrent unit was of the proposed model in various smart grid environments.
more suitable as a memory unit for intrusion detection than Unlike previous GAN approaches that are based on the
the LSTM unit. By considering supervisory control and data distance between data distributions, we considered the recon-
acquisition (SCADA) networks, Gao et al. [24] proposed struction error-based GAN model to generate more plausible
an omni-intrusion detection system. Gao et al. [24] com- synthetic data. In particular, we leveraged the boundary equi-
bined LSTM and a feedforward neural network through an librium GAN (BEGAN) model [5], which is based on the
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2333

Fig. 3. Basic architecture of generative adversarial networks.


Fig. 2. Basic architecture of autoencoder.

(dimension reduction) on the input data. Although PCA has


concept of autoencoders and the Wasserstein distance between traditionally been utilized to project high-dimensional data into
reconstruction error distributions of samples (real and synthetic a lower dimensional space, we leveraged the autoencoders for
samples). Moreover, we incorporated the autoencoder model nonlinear transformations on complex data sets. Although we
into the detection models to extract meaningful features from only present the basic architecture of autoencoders, models
the data and extend the adaptability and demonstrated that the can be built in multiple layers and an asymmetric manner.
proposed framework outperforms previous AI-based network
intrusion detection models. B. Generative Adversarial Networks
Generative models are designed to approximate the probabil-
III. BACKGROUND ity distribution of a training data set and aim to generate synthetic
In this section, we briefly illustrate the concepts of autoen- data that is close to the real data (training data). Recently, among
coders and GAN, which are key components of our anomaly these generative models, research on GAN [4] has been of sig-
detection system. nificant interest. Accordingly, various GAN models have been
proposed to improve the performance and advance function-
A. Autoencoder ality (e.g., [40], [41], and [42]). A GAN model consists of
The autoencoder [38], [39] is one of the fundamental deep two neural network-based models: 1) a generator G and 2) a
learning models and is trained with an unsupervised learn- discriminator D (see Fig. 3). The generator G aims to generate
ing process. The objective of autoencoders is to return the synthetic data (fake data) that is close to the real data, while
output as close to the original input as possible. Therefore, the discriminator D aims to discriminate between the real and
the parameters are updated progressively during the training fake data. In other words, these two components have opposing
process to minimize the reconstruction error. In general, the objectives during the training process.
architecture of an autoencoder consists of two components: More formally, let pz and pdata be the probability distribu-
1) an encoder and 2) a decoder (see Fig. 2). The encoder is tions of the latent code and the real data, respectively. Then,
responsible for mapping the given raw input data x into the the objective function V(D, G) of a GAN that consists of a
latent space of representation generator G and a discriminator D is a minimax game and
can be formulated as follows:
z = f (xW + b) (1)   
V(D, G) = min max Ex∼pdata log DθD (x)
G D
where f denotes the activation function of the encoder, and W    
and b represent the weight matrix and the bias vector, respec- + Ez∼pz log 1 − DθD GθG (z) (4)
tively. Conversely, the decoder plays the role of reconstructing where θD and θG denote the model parameters of D and G,
the representation z into the corresponding input data as close respectively. Therefore, the discriminator is trained to output
as possible (i.e., x̃) a higher confidence value in real data, and the generator is
 
x̃ = g zW  + b (2) trained to generate synthetic data that can maximize the con-
fidence score in the discriminator. After a sufficient number of
where g denotes the activation function of the decoder, and iterations of this training process, both the discriminator and
W  and b are the weight matrix and the bias vector, respec- generator will settle to a point, where there is no scope for
tively. Therefore, the autoencoder is trained to minimize the further improvement (i.e., a Nash equilibrium is achieved).
reconstruction error LRE Since the basic concept of the GAN model was introduced,
 
LRE x, x̃; W, W  = x − x̃22 numerous variants have been proposed to develop the original
  model by adjusting the objective function or by modifying the
= x − g W  · f (xW + b) + b . (3)
model architecture. Among these various models, we focus
One of the fundamental characteristics of the autoencoder is on the BEGAN model [5], which is based on the concept
to represent high-dimensional input data as lower dimensional of autoencoders and reconstruction errors. Unlike other GAN
information (summarized but meaningful information). Herein, models wherein the objective function is defined based on the
we utilized autoencoders with the aim of feature extraction distance of distributions between confidence vectors (on real
2334 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

and synthetic samples), the objective of BEGAN is defined attributes. In general, normalization (e.g., [28]) and standard-
based on the Wasserstein distance between reconstruction error ization (e.g., [24]) can be considered as scaling for numeric
distributions as follows: features. Between these two approaches, we adopted the min-
⎧ max normalization method.2 The normalization function fA (·)
⎨ LD = L(x; θD ) − kt · L(G(z; θG ); θD )
LG = L(G(z; θG ); θD ) (5) for a numeric attribute A that maps ∀x ∈ A into a range [0, 1]
⎩ can be defined as follows:
kt+1 = kt + λk · (γ · L(x; θD ) − L(G(z; θG ); θD )
xi − min xj
where the hyperparameter γ ∈ [0, 1] is the diversity ratio,1 fA (xi ) = x̃i = (7)
max xj − min xj
and λk serves as the learning rate for k. Note that L(·) denotes
the reconstruction error of the autoencoder, and t indicates the where xi denotes the ith attribute value in the attribute A.
iteration step. In general, existing deep-learning-based approaches con-
sider feature extraction (e.g., PCA, Pearson correlation coeffi-
IV. P ROPOSED M ETHODOLOGY cient, etc.) at this step to feed the model as many informative
features as possible, and, consequently, feature extraction can
As shown in Fig. 1, the entire architecture of the proposed
significantly impact the performance of models in anomaly
AI-based NIDS consists of four main streams: 1) preprocess-
detection. However, we do not consider the computational
ing; 2) generative model training; 3) autoencoder training; and
feature extraction process, as our framework embeds an
4) predictive model training. In this section, we describe the
autoencoder model that can replace functionalities of feature
proposed methodology and each module (process) in detail.
extraction. Note that, in our framework, the model with a
computational feature extraction process did not show sig-
A. Preprocessing nificant improvement compared with the model without the
Before building and training AI models, the system refines feature extraction. A detailed description for deploying the
a given raw data set via the preprocessing module that consists autoencoder as a feature extractor is presented later.
of three subprocesses: 1) outlier analysis; 2) one-hot encoding;
and 3) feature scaling. B. Synthetic Data Generation With Generative Model
In the outlier analysis phase, the system eliminates outliers,
The synthetic data generation module builds and trains
which can negatively affect the model training. Typically, out-
generative models using the data set refined in the data pre-
liers are detected by quantifying the statistical distribution of
processing module. In the case of the generative model, we
the data sets via robust measures of scale. There are several
utilize a state-of-the-art GAN model, BEGAN, which is based
standard robust measures of scale for detecting outliers, such
on the concept of autoencoders and the reconstruction error-
as interquartile range (IQR) and median absolute deviation
based objective function. For the model architecture, we built
(MAD). Among these measures, we leveraged the MAD.
the discriminator as a symmetric autoencoder model with five
For a numeric attribute A = {x1 , x2 , . . . , xn }, the MAD of
layers and the generator with the same architecture as the
the attribute is defined as follows:
decoder of the discriminator (autoencoder). Fig. 4 illustrates
MAD = median(|xi − median(A)|). (6) the entire architecture of the BEGAN model. Before train-
ing the BEGAN model, the system first splits the given data
We assume that numeric attributes appearing in the data set set according to the classes and then builds generative mod-
follow a normal distribution. Then, a consistent estimator σ̂ for els for each split subdata set. That is, generative models are
the estimation of the standard deviation is 1.4826 × MAD. In built in a number equal to the number of classes, and (after
terms of this estimator, we determine that for a given numeric training) each generative model produces only synthetic data
attribute, values exceeding 10 × σ̂ are outliers. Obviously, out- corresponding to a particular class.
lier analysis is performed only on the numerical attributes One of the important factors that must be considered when
and conducted independently for each class. Note that out- applying GAN models to NIDS is the determination of the
lier removal should be performed before scaling features, as termination criteria of training, which has a significant impact
it can potentially obscure information about outliers. on the performance of anomaly detection, as it is directly
After filtering out the outliers, the system transforms nomi- related to the quality of the synthetic data to be trained on
nal attributes into one-hot vectors. Each nominal (categorical) the detection model. The determination of the termination cri-
attribute is represented as a binary vector with the size of teria stems from the tracking of the training convergence,
the number of attribute values, where 1 is assigned only to and this is a difficult problem, as the objective function of
a point corresponding to the expressed value and 0 to all GAN models is defined to have the properties of a zero-
others. For example, in the case of the “protocol” attribute sum game. In general, monitoring the training progress has
(commonly included in network traffic data) with the values been conducted indirectly through visual inspection of syn-
tcp, udp, and icmp, the attribute is transformed into a binary thetic (generated) data. However, even this approach is not
vector of length 3, and the attribute values are converted into feasible in NIDS environments because the data being han-
[1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively. Together with dled is not in the form of an image. Fortunately, unlike other
the one-hot encoding process, the system scales the numeric
2 In our experiments, there was no significant difference between the two
1 Originally, the diversity ratio γ is defined as γ = E[L(G(z))]/E[L(x)]. feature scaling methods in terms of the performance of detection models.
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2335

Fig. 4. Architecture of generative model in our system.

GAN models, BEGAN can approximate the convergence of Algorithm 1 Autoencoder Training With Generators
training through the concept of equilibrium, and this char- Input: training dataset Dtrain , a set of generators G
acteristic facilitates the determination of the criteria for the 1: Initialize Autoencoder parameters θAE 0
training termination. The convergence measure M of BEGAN 2: for Gi ∈ G, where 1 ≤ i ≤ k do
is formulated as follows: 3: sample z = {zj }j=1,...,mi from the latent space
M = L(x) + |γ L(x) − L(G(z))| (8) 4: D̂i = Gi (z)
5: end for
where L(·) is the reconstruction error function, and γ is the 6: D̃ = Dtrain ∪ D̂1 ∪ · · · ∪ D̂k
diversity ratio. 7: θAE = Train_Autoencoder(θAE 0 , D̃)
By utilizing the convergence measure, the system termi- 8: θenc = Extract_Encoder(θAE )
nates the generative model’s training process. That is, when Output: trained encoder θenc
training the generative model, the system considers a thresh-
old as an input parameter and terminates the training process
if the convergence measure M outputs a value less than the
given threshold. In the experiment, we set the threshold of the expanded data set composed in the previous module and then
convergence measure M to 0.058.3 utilizes the trained encoder as the feature extraction module.
After training the generative model, the system generates Algorithm 1 presents a detailed process for autoencoder train-
synthetic data according to the classes using the trained gen- ing, where mi (1 ≤ i ≤ k) indicates the magnitude of synthetic
erator and integrates the generated data set into the original data to be generated for the class i. Note that the trained
training data set. This expanded data set is used to train the encoder is placed at the forefront (input layer) of the detec-
autoencoder and detection model in the next stage. Note that tion models as a feature extractor and is set not to learn any
although we designed the synthetic data generation module to more when training detection models (i.e., we fix the model
build multiple generative models according to the number of parameters of the trained encoder when training the detection
classes, it can be built as a single model by integrating the models).
concept of the conditional GAN architecture [41], where class For detection models, we utilized the basic DNN, CNN, and
attributes are embedded in the input space. LSTM as classifiers. We designed the DNN model to possess
two hidden layers, and it could naturally process the refined
C. Learning the Autoencoder and Detection Model network traffic data in terms of the model training and clas-
To build the intrusion detection model, the system first trains sification task. In the case of the CNN model, because the
an autoencoder model that can provide feature extraction and model was originally designed to be more suitable for analyz-
dimensionality reduction functionalities. In our framework, ing image data, it required additional transformation processes
we designed the autoencoder to possess the same architec- in the input data space or the layers of the model depend-
ture as the discriminator of the generative model. Because ing on the approach followed. In our system, we built the
the deployed generative model is BEGAN, the discriminator CNN model with one-dimensional (1-D) convolutional layers
has the form of an autoencoder, as depicted above, and is to process the network traffic data, rather than converting the
compatible in terms of the model architecture, as it handles input data (i.e., network traffic data) into a 2-D space. As
the same data format as the detection model. After build- shown in Fig. 5, we configured the CNN classifier to have
ing an autoencoder model, the system trains it using the two 1-D convolutional layers and one fully connected layer.
3 In the learning process, if the convergence measure M does not fall below For LSTM, we designed the model to possess two recurrent
a given threshold, the process may fall into an infinite loop. To prevent this, layers with the LSTM units and a fully connected layer, as
we additionally set the maximum number of iterations. shown in Fig. 6. LSTM is known to be particularly effective
2336 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

Fig. 5. CNN model architecture in our system. Fig. 6. LSTM model architecture in our system.

Algorithm 2 Classifier Training With Generators


in analyzing temporally correlated features [24]. Taking these Input: training dataset Dtrain , a set of generators G, trained
characteristics into account, we omitted the process of com- encoder θenc
bining with the autoencoder model for the LSTM model, since
1: Initialize classifier parameters W 0
the encoder may obscure the temporal features. For all models,
2: for Gi ∈ G, where 1 ≤ i ≤ k do
we designed the output layer with a binary field when the task
3: sample z = {zj }j=1,...,mi from the latent space
was to detect anomalies, and with multivalued fields when the
purpose was to distinguish not only the anomalies but also the 4: D̂i = Gi (z)
5: end for
detailed threat types. Algorithm 2 presents a detailed workflow
6: D̃ = Dtrain ∪ D̂1 ∪ · · · ∪ D̂k
for training a detection model with the trained generators and
7: Set Trainable_State on θenc = False
the trained encoder. As with the autoencoder training process,
8: Build Wθ0 = Concatenate_Models(θAE , W 0 )
the magnitude mi (1 ≤ i ≤ k) of synthetic data generation enc

can be set differently depending on the weight of each class. 9: Wθenc = Train_Classifier(Wθ0 , D̃)
enc
Note that the process of combining with the trained encoder Output: trained classifier Wθenc
(lines 7 and 8 in Algorithm 2) can be omitted according to
the predictive model.
From the perspective of the entire framework, the system training and testing data sets, KDDTrain and KDDTest, with
sequentially processes the data preprocessing, synthetic data 125 973 and 22 544 rows, respectively.4 In each data point,
generation, and detection model training modules, and we refer there exist 41 attributes (3 nominal, 6 binary, and 32 numeric
to the whole system as G-DNNAE , G-CNNAE , and G-LSTM, attributes) presenting different features of the network flow and
according to the type of the detection model. Additionally, we a label indicating an attack type or normal behavior. For the
subdivide the whole system into subsystems for a comprehen- attack type, there exist four distinct attack profiles: 1) Denial
sive comparison. In particular, we consider the DNN, CNN, of Service (DoS); 2) Probing; 3) Remote to Local (R2L); and
and LSTM models as naïve deep learning models and DNNAE 4) User to Root (U2R). DoS is an attack that depletes resources
and CNNAE , which are models combined with the autoen- by sending excessive traffic to the target system, thereby ren-
coder, as advanced deep learning models. In the experiment, dering it incapable of handling legitimate network traffic or
we conducted a comparative analysis of G-LSTM, G-DNNAE , service access. In the case of a probing attack, attacker’s
and G-DNNAE with the subsystems. objective is to gain information about the target system (e.g.,
scanning ports in use and sweeping IP addresses). R2L is
V. E XPERIMENTS AND E VALUATIONS an attack that attempts to obtain local access from a remote
In this section, we first review the target data sets and machine by sending remote fraudulent traffic to the target, and
describe the detailed implementation of each component. behaviors, such as password guessing and HTTP tunneling, are
Then, we present the experimental results with comparative considered R2L attacks. In the case of U2R, an attacker first
analysis and evaluate the proposed systems. gains access to the target system as an honest user and then
attempts to gain root privileges by causing system faults (e.g.,
A. Data Set Description buffer overflow and rootkit). Table I presents the entire distri-
bution of the NSL-KDD data set with respect to the classes
In this work, we focused on three network traffic data sets (attack classes and normal).
that are widely used as benchmark data sets in the field of 2) UNSW-NB15 Data Set: Together with the NSL-KDD
intrusion detection systems. Furthermore, we collected the data set presented above, the UNSW-NB15 data set [8], which
real data from a large enterprise system and analyzed the was created by the IXIA PerfectStorm tool, has been widely
performance of the proposed model on the real data set.
1) NSL-KDD Data Set: The NSL-KDD data set is a refined 4 The original configuration of the data set includes several subdata sets.
version of the KDDcup99 data set [6], [7] and consists of However, we only present the main training and testing data sets.
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2337

TABLE I TABLE II
DATA D ISTRIBUTION IN NSL-KDD DATA D ISTRIBUTION IN UNSW-NB15

used as an experimental data set in the field of anomaly


detection systems. Similarly, UNSW-NB15 consists of training
and testing data sets, UNSW-NB15_training and UNSW-
NB15_testing, with 175 341 and 82 332 records, respectively.
Each record possesses 43 attributes that present network flow
features and two class attributes.5 The class attributes con-
sist of an attribute that indicates whether or not the record is
normal traffic (binary-valued attribute) and the type of attack
(when the record is abnormal). For the attack type, there are
nine distinct attack profiles that are intuitively labeled as fol- learning, such as id and IP address. To adjust the magnitude
lows: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, of normal class data considering the data imbalance scenario,
Reconnaissance, Shellcode, and Worms. Table II presents the we randomly sampled 98 077 data from data sets in the benign
entire distribution of the UNSW-NB15 data set. Note that we scenarios. Consequently, we configured the IoT data set to
excluded any unnecessary attribute that did not affect the train- have 100 000 Benign data, 6706 C&C data, 14 394 DDos data,
ing of the models (“id” field) and combined the two class and 122 PortScan data.
attributes into a single field. Therefore, the data set is consid- 4) Real Data Set: To evaluate the performance of our
ered to have 42 attributes (4 nominal, 2 binary, and 36 numeric system in real-world environments, we collected raw secu-
attributes) and a class attribute. rity events from a large enterprise system. The data were
3) IoT Data Set: In addition to the data sets NSL-KDD and collected over five months, where threats were logged sepa-
UNSW-NB15, we evaluated the performance of our system rately by security operations center (SOC) analysts whenever
on a network traffic data set, called IoT-23 [9], collected from an intrusion occurred. In the data set, we investigated 798
the IoT devices. The IoT-23 data set consists of 20 subdata cyber threats, which occurred evenly over the collection period
sets collected from malicious IoT scenarios and three subdata (not focused on a specific period) and observed 547 system
sets collected from benign scenarios. For these data sets, we attacks, 240 scanning, and 11 warm attacks (the categorizing
utilized the data set collected on the Mirai botnet scenario was conducted by the SOC analysts). In terms of the cate-
(named CTU-IoT-Malware-Capture-34-1). The data set con- gories, the system attack includes cross-site scripting, DDoS,
tains 23 145 IoT network flows, where each data point belongs brute force attack, and injection attack, whereas the scanning
to one of the following four classes: 1) Benign; 2) C&C; attack includes Trojan and backdoor attacks. In total, we col-
3) DDos; and 4) PortScan. Benign matches the normal class, lected 4 782 342 security event data, of which 230 026 were
and the others are treated as threats. C&C indicates commu- identified as cyber threats (i.e., 4 552 316 data were labeled as
nication connected to the command and control server, and “Normal,” and 230 026 data were labeled as “Threat”). Each
PortScan refers to the activity of scanning ports to gather raw data has 16 basic features for network flow information,
information in order to conduct further attacks. For each such as the protocol type, service, and source bytes (eight
data point, there are 21 attributes (11 nominal, 2 binary, and nominal and eight numeric attributes). Moreover, because the
8 numeric attributes) presenting different features of network collected data are raw security events, each data includes
flow, and we removed four features that did not affect the information regarding the suspicious security event.6 Table III
presents a distribution of the collected data set with respect to
5 The raw data set contains 47 attributes (excluding class attributes), the suspicious security events, and it can be seen that the false
including source/destination IPs and ports. However, we used the provided
training/test data set, in which features that do not affect AI training are 6 Note that the suspicious security event can be different from the labels
excluded. classified by the SOC analysts.
2338 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

TABLE III
D ISTRIBUTION OF R AW S ECURITY E VENTS IN THE R EAL DATA S ET model. In the case of LSTM, we connected 64 LSTM cells in
each layer and concatenated a fully connected layer with 32
neurons. For these detection models, we set the default num-
ber of epochs to 300 and applied the early stop technique (we
stopped learning when relative differences of loss are less than
10−6 consecutively for 35 epochs [24]).
We utilized two additional basic machine learning models
as comparative models.
1) SVM is a supervised learning model based on the sta-
tistical learning theory and aims to locate the best
hyperplane that can optimally separate input domains
according to the classes. In the experiment, we imple-
mented the linear kernel SVM model [2].
2) DT is a nonparametric supervised learning model, and
it recursively splits input domains based on the corre-
lation between each feature and class. In this study, we
positives are relatively high (see [43] for a detailed description implemented the C4.5 algorithm [1].
of the collected real data set). Note that, although there were For a more extensive comparison, we subdivided the com-
several detailed classes of detected attacks, each data was cat- ponents of our system, DNN, CNN, LSTM, DNNAE , and
egorized as Normal and Threat only (related to the privacy CNNAE , and utilized them as comparative models with the
issues of the enterprise). whole system. Note that we regard these submodels to cor-
respond to the existing AI-based NIDS. In particular, DNN,
CNN, and LSTM are considered as naïve deep learning
B. Implementation and Hyperparameters Tuning approaches. In the case of DNNAE and CNNAE , they are con-
As described in the previous section, we set the discrimi- sidered as advanced deep learning approaches combined with
nator of the generative model to be a symmetric autoencoder autoencoders.7
model with three layers. For this model, we constructed the In the experiment, we utilized four metrics to evaluate the
first hidden layer with 80 neurons and a latent space dimen- performance of AI models: Accuracy, Precision, Recall, and
sion with a size of 50. Therefore, the generator is set to have F1-score. Accuracy refers to the fraction of correctly inferred
the latent space of size 50 and a hidden layer of size 80. results and is commonly used to quantify the performance of
Additionally, we applied batch normalization to each hidden AI models. For a given class in a data set, Precision presents
layer for the stability of learning and used the rectified lin- the fraction of positive values inferred by the model that is
ear unit (ReLU) as the activation function. Note that because correct, while Recall refers to the fraction of data with positive
we configured the autoencoder as a feature extractor with the values that are correctly inferred by the model. The F1-score
same architecture as the discriminator, the above configura- is the harmonic mean of Precision and Recall. The formulas
tion corresponds to that of the autoencoder as well. In the of these metrics are defined as follows:
case of the generative model, we set the convergence threshold TP + TN
1) Accuracy =
to 0.058 and terminated training when the convergence mea- TP + FP + TN + FN
sure fell below the given threshold, or the number of epochs TP
2) Precision =
reached 250. For autoencoder learning, we set the default num- TP + FP
ber of epochs to 300 and stop training when the reconstruction TP
accuracy was above 0.97. 3) Recall =
TP + FN
For the classifier models, we deployed three distinct deep Precision × Recall
learning models: DNN, CNN, and LSTM. Considering the 4) F1-score = 2 ×
Precision + Recall
number of features, we explored the depth of the models up to
where TP, TN, FN, and FP denote the true positive, true
three layers. In the experiment, the one-layer structure showed
negative, false negative, and false positive, respectively.
high volatility, and the three-layer structure showed a tendency
Using these metrics, we evaluated each model on the exper-
to overfit. As a result, the models were most stable in the
imental data sets. Note that, although we built the models
two-layer structure and showed the highest performance.
with a stable structure, there was still the issue of volatility.
For the DNN model, we set the first hidden layer to have 32
Accordingly, with respect to comparison and evaluation, we
neurons and the second layer to have 16 neurons. For CNN,
independently trained each model 100 times and displayed the
we used a 1-D-CNN model with two convolutional layers. The
results for the model with the best detection rate in the test
convolutional layers are configured to have 32 convolution fil-
data set.
ters with windows of size 5, and a fully connected layer of 16
neurons follows. Additionally, we applied a max-pooling layer
with windows of size 3 to the first convolutional layer, and the 7 Although the detailed architecture and configurations may differ from
batch normalization layer after each convolutional layer. For those of the previous approaches, we stress that the implemented models are
the activation function, we used ReLU as in the generative comparable or outperform in terms of performance to the existing systems.
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2339

TABLE IV
B INARY C LASSIFICATION R ESULTS FOR THE T EST DATA S ET IN NSL-KDD

C. Experiments on the NSL-KDD Data Set


For the NSL-KDD data set, we explored both binary and
multiclassification tasks. Note that NSL-KDD is provided sep-
arately as a training data set and a test data set as mentioned
above, and we used these data sets in our experiments as pro-
vided. In other words, we used KDDTrain (125 973 rows) as
a training data set and KDDTest (22 544 rows) as a test data
set, and there was no data shuffling between the two data
sets. In the experiments on our system (i.e., G-DNNAE , G-
CNNAE , and G-LSTM), we generated synthetic data for each
class via the generative model and integrated them into the
training data set. Obviously, the evaluation of all models was
conducted on the original test data set (KDDTest) for unbiased
comparisons. Fig. 7. Comparison of binary classification results on the NSL-KDD data set.
1) Binary Classification: Table IV presents the experimen-
tal results for the binary classification task on the NSL-KDD
data set. Note that the data belonging to the attack classes are autoencoder had been applied, were found to significantly
naturally considered anomalies in the binary classification task outperform all the aforementioned models. In particular, both
(labeled as abnormal). In the experiments on our system, we G-DNNAE and G-CNNAE achieved an accuracy close to 90%,
generated a total of 35 000 additional data (synthetic data) for and it was observed that G-CNNAE produced the highest
each class via the trained generative module. Fig. 7 shows a performance with an accuracy of 90.3%. In the case of LSTM,
comparison of experimental results for the NSL-KDD data set the generator combined LSTM model performed slightly bet-
in the binary classification scenario. ter than the naïve LSTM, but was measured to be inferior
Overall, the models output relatively high recall values for to CNNAE .
the data belonging to the normal class and, conversely, showed 2) Multiclassification: Table V presents the experimental
relatively high precision values for the abnormal class. For results for the multiclassification task on the NSL-KDD data
the basic machine learning models, the DT outperformed the set.8 Unlike the binary classification scenario, the system could
SVM model, with an accuracy of 81.5%. Moreover, the DT further recognize the type of threat that the data belonged to
model performed better than the naïve DNN and CNN mod- and hence, generate synthetic data with different magnitudes
els, where DNN achieved an accuracy of 79.5% and CNN based on weights in the population. In the experiments on our
achieved an accuracy of 80.5%. Among the basic models system, we generated synthetic data for minor classes with
and the naïve models, the LSTM model outperformed others less than 10% weight in the distribution. That is, we generated
with an accuracy of 82.0%. For the advanced deep learn- synthetic data for Probe, R2L, and U2R classes (10 000 syn-
ing approaches, both DNNAE and CNNAE exhibited better thetic data for each class) via the trained generative model.
results than the basic machine learning and the naïve deep
learning models. The advanced models, DNNAE and CNNAE 8 In the multiclassification scenario, we only presented experimental results
achieved an 85.5% accuracy and 86.4% accuracy, respec- for the attack classes. Experimental results for the normal class follow the
tively. The proposed models, to which the generative and previous experiment (i.e., experiments in the binary classification scenario).
2340 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

TABLE V
M ULTICLASSIFICATION R ESULTS FOR THE T EST DATA S ET IN NSL-KDD

In particular, the DT model showed better results than the


SVM, DNN, and CNN models in terms of the accuracy
metric. However, the basic machine learning models showed
poor results for the minor classes. In particular, they showed
extremely low F1-scores for the R2L and U2R classes, and
even failed to classify. On the contrary, although the detec-
tion performance was insufficient, the neural network-based
models performed better than the SVM and DT models in the
minor classes R2L and U2R. In comparison with the basic
models and the naïve models, the LSTM model outperformed
others as with the binary classification scenario, and showed
better performance in the temporally correlated attack (i.e.,
DoS attack).
Fig. 8. Comparison of multiclassification results on the NSL-KDD data set. The advanced deep learning models, however, achieved bet-
ter overall classification performance than the basic machine
learning models and the naïve deep learning models, where
Fig. 8 shows a comparison of experimental results for the DNNAE achieved an accuracy of 88.3%, and CNNAE achieved
NSL-KDD data set in the multiclassification scenario. an accuracy of 88.5%. In particular, the models combined with
In the case of the basic machine learning models and the autoencoder demonstrated significant improvement in the Dos,
naïve deep learning approaches, the results obtained were sim- Probe, and R2L classes. However, compared with the naïve
ilar to those obtained for the binary classification scenario. deep learning models, they did not improve the classification
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2341

TABLE VI
C LASSIFICATION ACCURACY FOR E ACH T HREAT C LASS ON THE UNSW-NB15 DATA S ET

performance for U2R, which is extremely minor. Note that,


although the results seem to have improved numerically, there
is not much difference in terms of the number of data. In
the case of our models, G-DNNAE and G-CNNAE achieved
the best performance compared with that of the other models
and achieved an accuracy of 92.7% and 93.2%, respectively.
From the perspective of the minor classes, the proposed mod-
els comprehensively improved the classification performance
and showed a notable improvement in the classification for the
R2L class. Note that we did not generate additional synthetic
data for the DoS class (a major class) as mentioned above, and
it can be observed that the results were measured in a manner
similar to the advanced deep learning models.
In summary, we found that neural network-based models Fig. 9. Comparison of multiclassification results on the UNSW-NB15
combined with autoencoders could significantly improve the data set.
classification performance in both the binary and multiclassi-
fication tasks, and they can be further improved by applying
the generative model. From the perspective of the base model particular, we generated synthetic data to reach a total size
architecture, the DNN-based model and the CNN-based model of 50 000 for each major class and a total size of 30 000 for
showed similar classification performance under the same con- each minor class. Additionally, we assumed that for a given
ditions, and no significant differences were found between the threat data, the classification was correct if the model clas-
two models. sified the data into one of the classes corresponding to the
attack category (even if the model did not predict the exact
class). Accordingly, we only indicated the accuracy of the
D. Experiments on the UNSW-NB15 Data Set performance measure in the experiment on the UNSW-NB15
To compare the performance of models in a data set data set, considering whether the attack was well classified
with more diverse classes, we conducted experiments on the as an attack. Fig. 9 shows a comparison of experimental
UNSW-NB15 data set as another multiclassification scenario. results for the UNSW-NB15 data set in the multiclassification
As described above, UNSW-NB15 has ten classes, including a scenario.
normal class, three major, and six minor classes. For the minor As shown in Table VI, G-DNNAE and G-CNNAE out-
classes, we determined that classes with a weight of less than performed other models in terms of the classification
1% are extremely minor. As with the experiments on the NSL- performance. For the major classes, Generic, Exploit, and
KDD data set, we used the original UNSW-NB15 training and Fuzzers, the naïve and advanced deep learning models showed
testing data sets (175 341 and 82 332 records, respectively). similar performance, and it was observed that the proposed
Similarly, we generated synthetic data for each class via the models could improve the classification performance for the
generative model in the experiments on our system and inte- major classes even in the LSTM-based model. In particular, the
grated them into the training data set. Note that the evaluation generator combined models showed significant performance
of all models was conducted on the original UNSW-NB15 improvement in the Generic and Fuzzers classes (up to about
testing data set. 5%). In the case of the minor classes, the proposed mod-
Table VI presents the experimental results for the multi- els showed a moderate performance improvement overall.
classification scenario on the UNSW-NB15 data set. In the Especially, G-LSTM, G-DNNAE , and G-CNNAE achieved
experiments, we generated synthetic data for all classes. In about 3% performance improvement in the Backdoors class
2342 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

TABLE VII
E XPERIMENTAL R ESULTS ON THE I OT-23 DATA S ET FOR M ULTICLASSIFICATION TASKS

each class (i.e., 84 855 training data and 36 367 test data). In
the experiments on the proposed system, we generated syn-
thetic data to attain a total size of 30 000 for each malicious
class in the training data set and evaluated the performance
of all models using the previously separated test data (36 367
rows).
Table VII presents the experimental results for the multi-
classification task on the IoT-23 data set, and Fig. 10 shows
a comparison of experimental results. Overall, all the mod-
els achieved an accuracy greater than 93%, and the models
were observed to have perfect classification performance for
the DDoS class, even in the naïve deep learning approach.
Moreover, we observed that there was no significant differ-
Fig. 10. Comparison of multiclassification results on the IoT-23 data set. ence in the performance between the advanced model and the
naïve model. In the case of the C&C class, all models showed
100% probability in precision. For the proposed models, all
(which possessed weights of approximately 1% within the the generator combined models showed the same performance
distribution) compared with the other models. Through exper- and achieved significant improvement in recall with a proba-
iments on the UNSW-NB15 data set containing more diverse bility of 80%. These results are presumably due to the fact that
classes, we found that the proposed model could improve the the IoT data set is very simple and has features that contain
classification performance for major classes. Moreover, we powerful information related to the nature of the attack (e.g.,
found that the implemented generative model could further “history”). In addition, regarding these results, we conjecture
improve the classification performance in minor and extremely that the trained generative model has generated plausible data
minor classes. points that fall within a certain region of the C&C distribu-
Although the proposed framework can improve the classi- tion (appearing in the test data set, but not in the training data
fication performance, there is still the problem of relatively set), and partially covered the missing region in the (extended)
low detection rates for some classes. In particular, all the training data set. Moreover, since there is a portion of the data
experimented models were observed to have relatively low in the corresponding region in the test data set, we estimate
detection rates for the DoS class, even in the LSTM-based that G-LSTM, G-DNNAE , and G-CNNAE performed signifi-
model, which is suitable for detecting temporally correlated cantly higher than other models. For the PortScan class, which
attacks. Regarding these results, we infer that the domain space is extremely minor, all models achieved 100% probability in
between classes is heavily overlapping [34], resulting in low recall, and the proposed systems achieved the highest precision
detection rates for some classes. value with a probability of 90.4%.

E. Experiments on the IoT Data Set F. Experiments on the Collected Real Data Set
To evaluate the performance of the proposed systems in To analyze the feasibility of the proposed system in a real
IoT environments, we conducted experiments on the IoT- environment, we collected real network flow data with raw
23 data set. As described above, we utilized the data set security events from a large enterprise system and conducted
collected on the Mirai botnet scenario (CTU-IoT-Malware- experiments on this real data set. As in the above experiment,
Capture-34-1) and intentionally simulated an extreme data we randomly split the collected data set into training and test
imbalance scenario. For evaluation, we randomly split the data sets at a ratio of 7:3 in both normal and abnormal classes
data set into training and test data sets at a ratio of 7:3 in (i.e., 3 347 639 training data and 1 434 703 test data). Note that
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2343

TABLE VIII
E XPERIMENTAL R ESULTS ON THE R EAL DATA S ET FOR B INARY C LASSIFICATION TASKS

G. Evaluation
Through comprehensive experiments on various data sets,
we demonstrated that the proposed system significantly out-
perform previous deep learning approaches and showed that
the classification performance for minor classes can be greatly
improved through the generative model. In particular, the
proposed models showed a noticeable performance improve-
ment for the R2L and Probe classes on the NSL-KDD data
set. In addition, we confirmed that the proposed model can
significantly improve the detection rate for most classes on
the UNSW-NB15 data set. Moreover, through experiments
on the IoT data set, we observed that our system can effi-
ciently detect network threats in a distributed environment.
Fig. 11. Comparison of binary classification results on the real data set. To demonstrate the feasibility in real-world environments, we
collected real data and tested our system in the binary clas-
sification scenario. Through experiments on the real data set,
we demonstrated that the proposed model could improve the
we only considered the binary classification scenario in exper- detection performance of network anomalies by resolving the
iments on the real environment. As shown in Table VIII, the data imbalance problem, and that the proposed system can be
data set possesses a severe imbalance between the normal and effectively applied in real-world environments.
abnormal classes. In the experiments on the proposed system,
we generated synthetic data for the abnormal class to be the
same size as the normal class and evaluated the performance VI. C ONCLUSION
of all models using the previously partitioned test data set as In this study, we presented a novel AI-based NIDS that can
in the previous experiments. efficiently resolve the data imbalance problem and improve
Table VIII presents the experimental results on the real the classification performance of the previous systems. To
data set, and Fig. 11 shows a comparison of experimental address the data imbalance problem, we leveraged a state-of-
results. First, it can be seen that all models achieve a superior the-art generative model that could generate plausible synthetic
performance in terms of the accuracy, as the data set consists of data and measure the convergence of training. Moreover, we
95.1% normal data and 4.9% anomalous data. Moreover, there implemented autoencoder-driven detection models based on
was no significant difference between the naïve and advanced DNN and CNN and demonstrated that the proposed mod-
models in terms of the classification performance, as in the els outperforms previous machine learning and deep learning
experiment on the IoT data set. From the perspective of each approaches. The proposed system was analyzed on various
class, the models achieved high F1-scores for normal data as data sets, including two benchmark data sets, an IoT data set,
expected, but relatively low recall values were measured for and a real data set. In particular, the proposed models achieved
abnormal data. In the case of the proposed model, G-DNNAE accuracies of up to 93.2% and 87% on the NSL-KDD data
and G-CNNAE achieved 93.8% F1-scores in the abnormal set and the UNSW-NB15 data set, respectively, and showed
class, and we observed that the deployed generative model remarkable performance improvement in the minor classes. In
could significantly improve the classification performance of addition, through experiments on an IoT data set, we demon-
minor classes even in the real system. strated that the proposed system can efficiently detect network
2344 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 3, 1 FEBRUARY 2023

threats in a distributed environment. Moreover, in order to [20] L. Qi, Y. Yang, X. Zhou, W. Rafique, and J. Ma, “Fast anomaly
investigate the feasibility in real-world environments, we col- identification based on multi-aspect data streams for intelligent intru-
sion detection toward secure industry 4.0,” IEEE Trans. Ind. Informat.,
lected real data from a large enterprise system and evaluated vol. 18, no.9, pp. 6503–6511, Sep. 2022.
the proposed model on the collected data set. Through this [21] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory
experiment, we demonstrated that the proposed model can recurrent neural network classifier for intrusion detection,” in Proc. Int.
Conf. Platform Technol. Service (PlatCon), 2016, pp. 1–5.
significantly improve the detection rate of network threats by [22] C. Yin, Y. Zhu, J. Fei, and X. He, “A deep learning approach for intru-
resolving the data imbalance problem in the real environment. sion detection using recurrent neural networks,” IEEE Access, vol. 5,
In the future, by considering practical distributed environ- pp. 21954–21961, 2017.
[23] C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system
ments, we will focus on applying our framework to federated using a deep neural network with gated recurrent units,” IEEE Access,
learning systems and ensemble AI systems to enhance network vol. 6, pp. 48697–48707, 2018.
threat detection. In addition, we will study adversarial attacks [24] J. Gao et al., “Omni SCADA intrusion detection using deep learn-
ing algorithms,” IEEE Internet Things J., vol. 8, no. 2, pp. 951–961,
that can bypass AI-based NIDS through vulnerabilities in AI Jan. 2021.
models and conduct research on enhanced NIDS that can resist [25] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning approach
these attacks in real-world environments. for network intrusion detection system,” EAI Endorsed Trans. Security
Safety, vol. 3, no. 9, p. e2, May 2016,
[26] B. Yan and G. Han, “Effective feature extraction via stacked sparse
autoencoder to improve intrusion detection system,” IEEE Access, vol. 6,
R EFERENCES pp. 41238–41248, 2018.
[1] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan [27] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach
Kaufmann Series in Machine Learning). San Mateo, CA, USA: Morgan to network intrusion detection,” IEEE Trans. Emerg. Topics Comput.
Kaufmann, 1993. Intell., vol. 2, no. 1, pp. 41–50, Feb. 2018.
[2] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector [28] C. Ieracitano, A. Adeel, F. C. Morabito, and A. Hussain, “A novel sta-
Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: tistical analysis and autoencoder driven intelligent intrusion detection
Cambridge Univ. Press, 2000. approach,” Neurocomputing, vol. 387, pp. 51–62, Apr. 2020.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, [29] J. Y. Kim, S. J. Bu, and S. B. Cho, “Malware detection using deep
MA, USA: MIT Press, 2016. transferred generative adversarial networks,” in Proc. Int. Conf. Neural
Inf. Process., 2017, pp. 556–564.
[4] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th Int.
[30] M. H. Shahriar, N. I. Haque, M. A. Rahman, and M. Alonso, “G-IDS:
Conf. Neural Inf. Process. Syst. (NIPS), 2014, pp. 2672–2680.
Generative adversarial networks assisted intrusion detection system,”
[5] D. Berthelot, T. Schumm, and L. Metz, “BEGAN: Boundary equilibrium
in Proc. IEEE 44th Annu. Comput., Softw., Appl. Conf. (COMPSAC),
generative adversarial networks,”2017, arXiv:1703.10717.
Jul. 2020, pp. 376–385.
[6] S. Hettich and S. D. Bay. “KDD cup 1999 data.” 1999. [Online].
[31] I. Yilmaz, R. Masum, and A. Siraj, “Addressing imbalanced data
Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
problem with generative adversarial network for intrusion detection,”
[7] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed anal- in Proc. IEEE 21st Int. Conf. Inf. Reuse Integr. Data Sci. (IRI), Las
ysis of the KDD CUP 99 data set,” in Proc. IEEE Symp. Comput. Intell. Vegas, NV, USA, 2020, pp. 25–30.
Secur. Defense Appl., Jul. 2009, pp. 1–6. [32] D. Li, D. Kotani, and Y. Okabe, “Improving attack detection
[8] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for performance in NIDS using GAN,” in Proc. IEEE 44th Annu. Comput.,
network intrusion detection systems (UNSW-NB15 network data set),”in Softw., Appl. Conf. (COMPSAC), Jul. 2020, pp. 817–825.
Proc. Military Commun. Inf. Syst. Conf. (MilCIS), 2015, pp. 1–6. [33] W. Lee, B. Noh, Y. Kim, and K. Jeong, “Generation of network traf-
[9] A. Parmisano, S. Garcia, and M. J. Erquiaga, “A labeled dataset with fic using WGAN-GP and a DFT filter for resolving data imbalance,”
malicious and benign IoT network traffic.” 2020. [Online]. Available: in Proc. Int. Conf. Internet Distrib. Comput. Syst. (IDCS), Oct. 2019,
https://www.stratosphereips.org/datasets-iot23 pp. 306–317.
[10] B. Ingre and A. Yadav, “Performance analysis of NSL-KDD dataset [34] G. Dlamini and M. Fahim, “DGM: A data generative model to improve
using ANN,” in Proc. Int. Conf. Signal Process. Commun. Eng. Syst., minority class presence in anomaly detection domain,” Neural Comput.
Andhra Pradesh, India, Jan. 2015, pp. 92–96. Appl., vol. 33, pp. 13635–13646, Apr. 2021.
[11] Y. Gao, Y. Liu, Y. Jin, J. Chen, and H. Wu, “A novel semi-supervised [35] D. Li, D. Chen, J. Goh, and S.-K. Ng, “Anomaly detection with
learning approach for network intrusion detection on cloud-based robotic generative adversarial networks for multivariate time series,” 2018,
system,” IEEE Access, vol. 6, pp. 50927–50938, 2018. arXiv:1809.04758.
[12] K. Alrawashdeh and C. Purdy, “Toward an online anomaly intrusion [36] S. K. Alabugin and A. N. Sokolov, “Applying of generative adversarial
detection system based on deep learning,” in Proc. IEEE 15th Int. Conf. networks for anomaly detection in industrial control systems,” in Proc.
Mach. Learn. Appl. (ICMLA), Anaheim, CA, USA, 2016, pp. 195–200. Global Smart Ind. Conf. (GloSIC), Nov. 2020, pp. 199–203.
[13] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho, [37] I. Siniosoglou, P. Radoglou-Grammatikis, G. Efstathopoulos, P. Fouliras,
“Deep learning approach for network intrusion detection in software and P. Sarigiannidis, “A unified deep learning anomaly detection and
defined networking,” in Proc. Int. Conf. Wireless Netw. Mobile Commun. classification approach for smart grid environments,” IEEE Trans. Netw.
(WINCOM), 2016, pp. 258–263. Service Manage., vol. 18, no. 2, pp. 1137–1151, Jun. 2021.
[14] Y. Imamverdiyev and F. Abdullayeva, “Deep learning method for denial [38] D. E. Rumelhart and J. L. McClelland, “Learning internal representations
of service attack detection based on restricted Boltzmann machine,” Big by error propagation,” in Parallel Distributed Processing: Explorations
Data, vol. 6, no. 2, pp. 159–169, Jun. 2018. in the Microstructure of Cognition: Foundations, vol. 1. Cambridge,
[15] W. Zhong, N. Yu, and C. Ai, “Applying big data based deep learn- MA, USA: MIT Press, 1987, pp. 318–362.
ing system to intrusion detection,” Big Data Min. Anal., vol. 3, no. 3, [39] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
pp. 181–195, Sep. 2020. length and helmholtz free energy,” in Proc. 6th Int. Conf. Neural Inf.
[16] M. H. Haghighat and J. Li, “Intrusion detection system using vot- Process. Syst., 1993, pp. 3–10.
ingbased neural network,” Tsinghua Sci. Technol., vol. 26, no. 4, [40] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
pp. 484–495, Aug. 2021. learning with deep convolutional generative adversarial networks,” 2016,
[17] Y. Yang et al., “ASTREAM: Data-stream-driven scalable anomaly detec- arXiv:1511.06434.
tion with accuracy guarantee in IIoT environment,” IEEE Trans. Netw. [41] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
Sci. Eng., early access, Mar. 8, 2022, doi: 10.1109/TNSE.2022.3157730. 2014, arXiv:1411.1784.
[18] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly [42] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-
detection,” ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. 1–39, sarial networks,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), 2017,
Mar. 2012. pp. 214–223.
[19] X. Zhang et al., “LSHiForest: A generic framework for fast tree isolation [43] J. Lee, J. Kim, I. Kim, and K. Han, “Cyber threat detection based on
based ensemble anomaly analysis,” in Proc. IEEE 33rd Int. Conf. Data artificial neural networks using event profiles,” IEEE Access, vol. 7,
Eng. (ICDE), Apr. 2017, pp. 983–994. pp. 165607–165626, 2019.
PARK et al.: ENHANCED AI-BASED NETWORK INTRUSION DETECTION SYSTEM 2345

Cheolhee Park received the B.S. degree from Jong-Geun Park received the B.S. and M.S.
the Department of Applied Mathematics, Kongju degrees from the Department of Industrial
National University, Gongju, South Korea, in 2014, Engineering, Sungkyunkwan University, Seoul,
and the M.S. and Ph.D. degrees from the Department Republic of Korea, in 1997 and 1999, respectively,
of Mathematics, Kongju National University in 2017 and the Ph.D. degree from the Department of
and 2021, respectively. Computer Engineering, Chungnam National
He joined Electronics and Telecommunications University, Daejeon, Republic of Korea, in 2013.
Research Institute, Daejeon, South Korea, in 2021, From 1999 to 2001, he was a Researcher with
where he is currently working as a Researcher. ADD, Daejeon. Then, he joined Electronics and
His research interests include data privacy, differ- Telecommunications Research Institute, Daejeon, in
ential privacy, machine learning, deep learning, AI 2001, where he is currently working as a Principal
security, and network security. Researcher. He is currently interested in mobile network security, SDN/NFV,
cloud security, and AI security.

Jonghoon Lee received the B.S., M.S., and Ph.D.


degrees in computer engineering from Kyungpook
National University, Daegu, South Korea, in 2000,
2002, and 2020, respectively. Hyunjin Kim received the B.S. degree in
He joined Electronics and Telecommunications information communications engineering and the
Research Institute (ETRI), Daejeon, South Korea, in M.S. and Ph.D. degrees in computer science and
2002. Since 2002, he has been involving in vari- engineering from Chungnam National University,
ous research projects for cyber security and network Daejeon, South Korea, in 2015, 2017, and 2021,
fields. He is currently a Principle Researcher with respectively.
the Cyber Security Research Division, ETRI. His He is currently a Researcher with Electronics and
research interests include cyber security, 5G network Telecommunications Research Institute, Daejeon. He
security, AI-based network intrusion detection, AI-SIEM techniques for 5G, is interested in information security, both theoretical
and network big data analytics for cyber security. and practical, and his recent research is largely about
network security and applied cryptography.

Youngsoo Kim received the B.S. degree from


the Department of Information Engineering,
Sungkyunkwan University, Seoul, Republic of Dowon Hong received the B.S., M.S., and Ph.D.
Korea, in 1998, and the M.S. and Ph.D. degrees degrees in mathematics from Korea University,
from the Department of Computer Engineering, Seoul, South Korea, in 1994, 1996, and 2000,
Sungkyunkwan University in 2000 and 2009, respectively.
respectively. He has been a Principal Member of Engineering
He joined Electronics and Telecommunications Staff of Electronics and Telecommunications
Research Institute, Daejeon, Republic of Korea, in Research Institute, Daejeon, South Korea, from
2000, where he is currently working as a Principal 2000 to 2012. He joined the Department of Applied
Researcher. From 2012 to 2015, he was an Adjunct Mathematics, Kongju National University, Gongju,
Professor with Chungnam National University, Daejeon. He is currently South Korea, in 2012, where he has been a Full
interested in 5G security, network security, digital forensics, cryptography, Professor since 2015. His research interests include
and AI security. cryptography, data privacy, differential privacy, and network security.

You might also like