0% found this document useful (0 votes)
37 views

An Advanced Intrusion Detection System For IIoT Ba

This article proposes an intrusion detection system for industrial IoT networks that uses a genetic algorithm for feature selection and various machine learning classifiers for detection. The genetic algorithm selects optimal feature vectors using a random forest model in its fitness function. When evaluated on the UNSW-NB15 dataset, the proposed approach achieved a test accuracy of 87.61% and AUC of 0.98, outperforming existing intrusion detection systems.

Uploaded by

yousuf
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

An Advanced Intrusion Detection System For IIoT Ba

This article proposes an intrusion detection system for industrial IoT networks that uses a genetic algorithm for feature selection and various machine learning classifiers for detection. The genetic algorithm selects optimal feature vectors using a random forest model in its fitness function. When evaluated on the UNSW-NB15 dataset, the proposed approach achieved a test accuracy of 87.61% and AUC of 0.98, outperforming existing intrusion detection systems.

Uploaded by

yousuf
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2019.DOI

An advanced Intrusion Detection System


for IIoT Based on GA and Tree based
Algorithms
SYDNEY MAMBWE KASONGO
Department of Industrial Engineering and School of Data Science and Computational Thinking, University of Stellenbosch, South Africa
Corresponding author: Sydney M. Kasongo (e-mail: [email protected]).

ABSTRACT
The evolution of the Internet and cloud-based technologies have empowered several organizations with the
capacity to implement large-scale Internet of Things (IoT)-based ecosystems, such as Industrial IoT (IIoT).
The IoT and, by virtue, the IIoT, are vulnerable to new types of threats and intrusions because of the nature of
their networks. So it is crucial to develop Intrusion Detection Systems (IDSs) that can provide the security,
privacy, and integrity of IIoT networks. In this research, we propose an IDS for IIoT that was implemented
using the Genetic Algorithm (GA) for feature selection, and the Random Forest (RF) model was employed
in the GA fitness function. The models used for the intrusion detection processes include classifiers such
as the RF, Linear Regression (LR), Naïve Bayes (NB), Decision Tree (DT), Extra-Trees (ET), and Extreme
Gradient Boosting (XGB). The GA-RF generated 10 feature vectors for the binary classification scheme
and seven feature vectors for the multiclass classification procedure. The UNSW-NB15 is used to assess the
effectiveness and the robustness of our proposed approach. The experimental outcomes demonstrated that
for the binary modeling process, the GA-RF achieved a test accuracy (TAC) of 87.61% and an Area Under
the Curve (AUC) of 0.98, using a feature vector that contained 16 features. These results were superior to
existing IDS frameworks.

INDEX TERMS Internet of Things, intrusion detection, genetic algorithm, machine learning

I. INTRODUCTION There exist several ways IIoT nodes connect to the Internet
and this includes communication protocols such as the Trans-
In recent years, the Internet of Things (IoT) paradigm has mission Control Protocol and the Internet Protocol(TCP/IP)
shown massive adoption by different industries including using Message Queue Telemetry Transport (MQTT), Mod-
the medical sector, vehicle manufacturers, home appliances bus TCP, Cellular, Long-Range Radio Wide Area Network
manufacturers, etc. The acceptance of IoT technology has (LoRaWAN), etc. [3], [4]. Moreover, most IIoT nodes can
significantly changed the way we live [1]. The specific use collect, process, and transmit data. These abilities make them
of IoT in the modern industry gave birth to the Industrial IoT susceptible to some privacy and security threats that have
(IIoT) concept. Modern Industrial Internet of Things (I-IoT the potential to jeopardize the IIoT systems and the appli-
or IIoT) depicts using the regular IoT in different industrial cations to which they belong [5]. One of the key attributes
ventures and organizations. IIoT contains countless actuators, of IIoT nodes is that they are always active while performing
sensors, control systems, communication, and integration the collection, processing, and transmission of data. Fig. 1
interfaces, advanced security systems, vehicular networks, depicts all the layers that are present in the IIoT, namely,
home appliances networks, etc. All the nodes within the IIoT the perceptual layer, the network layer, the application layer,
can connect to the Internet. Using IIoT in modern indus- and the Cloud. These layers are based on the flow of data.
tries has greatly enhanced the capabilities of various sectors Moreover, each layer is prone to various types of attacks and
such as manufacturing plants, asset management systems, intrusions that could compromise the systems within the IIoT.
advanced logistics systems, etc. Moreover, the IIoT allows Some common attacks and intrusions on the IIoT ecosys-
for several applications, devices, and services to connect the tem include access control attacks, data corruption breaches,
physical space to a virtual one [2].

VOLUME X, 2019 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

spoofing attacks, Denial of Service (DoS) attacks, Distributed There are three types of feature selection (FS) methods:
DoS, Operating System (OS) attacks, jamming attacks, etc. wrapper-based FS, filter-based FS, and hybrid-based FS. In
To counter these malicious attacks and to guarantee that the the instance of the filter-based FS method, the selection
active nature of IIoT nodes and the security of IIoT networks process relies on the nature of the data and it uses a va-
are maintained, a lot of organizations are implementing In- riety of statistical methods to extract the optimal feature
trusion Detection Systems (IDSs). Moreover, these IDSs can vector. The filter-based FS method is computationally cheap
be configured at any layer in Fig. 1 [5]. and efficient. In contrast, the wrapper-based FS approach
An IDS plays a critical role in the IIoT by guaranteeing employs a predictor in the selection process. This occurs
that the integrity, security, and privacy of data transmitted by iteratively computing the predictor’s performance over
through its network are maintained. An IDS can prevent, several subsets of features until the candidate optimal feature
detect, react and report any attacks or malicious activities that vector is found. The wrapper-based FS method is computa-
have the potential to cripple an IIoT network [6]. Traditional tionally expensive, but it is precise in comparison to other
IDSs are broadly categorized as follows: signature-based, FS methods. The hybrid-based FS technique, sometimes
anomaly-based, and hybrid-based. Signature-based IDSs are called embedded-based FS, combines the filter-based and the
designed using existing (known) attack signatures that can be wrapper-based FS methods [13]–[15]. In this research, we
found in the IDS database. Anomaly-based IDS are imple- propose a wrapper-based FS method, based on the Genetic
mented using abnormal patterns within a network. Hybrid- Algorithm (GA) [16] that uses the Random Forest (RF) ML
based IDSs combine signature and anomaly-based IDSs. algorithm [17] in its fitness function to generate optimal
Some drawbacks of traditional IDSs include a high false- candidates for feature vectors. Furthermore, to assess the
positive rate and a low detection accuracy. Additionally, they performance of our proposed method, we use the UNSW-
cannot detect novel types of intrusions and are incapable of NB15 intrusion detection dataset. This dataset is widely
preventing events such as zero-day attacks. To improve on the adopted by the research community [18], [19]. The network
performance of traditional IDSs, researchers have explored traces present in the dataset were generated in a laboratory
the use of Artificial Intelligence (AI) and more particularly, environment. But, they do mimic the real-world network
the application of Machine Learning (ML) based techniques traffic patterns, such as the ones generated by an IIoT net-
for IDS [7], [8]. work system [20]. Additionally, the UNSW-NB15 is a more
ML is a branch of Artificial Intelligence (AI) that em- complex dataset in comparison to the NSL-KDD or KDD
powers various systems with the ability and the capacity Cup 99 datasets [20] and it includes a higher variety of
to learn from experience and to ameliorate their decision- network traffic patterns. Moreover, the UNSW-NB15 is a
making process without any explicit programming [9]. At the general-purpose dataset that paved the way to datasets such
top level, ML approaches are categorized as supervised and as the TON_IoT dataset [21].
unsupervised. At a granular level, ML algorithms are classi- The major goals and contributions of this paper are as
fied as follows: supervised, unsupervised, semi-supervised, follows:
and reinforcement. Supervised ML methods improve their • Firstly, we propose a Genetic Algorithm (GA)-based
decision-making process by learning from a labeled dataset feature selection algorithm. The fitness function used
(a dataset with data points that have a label) to perform future in the GA method used the Random Forest (RF) to
predictions. In contrast, unsupervised ML approaches are generate the fitness scores.
used when the learning task involves unlabelled data. Semi- • Secondly, for each solution (attribute vector), we imple-
supervised ML algorithms use both labeled and unlabeled ment Tree-based algorithms such as RF, the Decision
data during the learning process. Reinforcement ML methods Tree (DT), and the Extra Tree (ET) methods. Moreover,
compute rewards or errors based on their interaction within a the generated attribute vectors can be applied by other
given environment [10]. researchers using their own classifiers.
In this research, we propose an IDS for IIoT that uses • Lastly, we conduct a comparison between our proposed
Tree-based supervised ML algorithms. ML-based IDSs are method with existing systems. The results demonstrate
generally trained using the latest intrusion detection datasets. a noteworthy improvement in performance.
Nonetheless, the majority of the modern datasets are large, The remainder of the paper is structured as follows. Sec-
both on the feature space dimension as well as the number tion II presents an account of related work. Section III
of network traces. A high number of features in a dataset introduces the UNSW-NB15 dataset. Section IV presents the
has the potential to negatively impact the training process proposed IDS methodology. Section V outlines the experi-
of ML algorithms. Often the performance of ML methods is ments and provides discussions about the results. Section VI
reduced as the number of features increases. In other words, concludes this paper and provides future directions.
it is harder to perform the learning process as the number
of attributes increases in a dataset [11]. Thus, it is crucial to II. RELATED WORK
perform a feature selection or extraction process to guarantee This section provides an account of related research that
that the size of the attribute vector is reduced to an optimal was conducted in the domain of IDS using ML techniques.
number of required features [12]. Moreover, this section serves as a survey of various IDS
2 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms

FIGURE 1. Typical IIoT Architecture

frameworks and solutions that were previously implemented evaluation phase, the following performance metrics were
for intrusion detection in IoT-based systems. employed: the False Alarm Rate (FAR), the Area Under the
Liu et al. [22] implemented an IDS system for IoT using Curve (AUC), the precision, the recall, and the F1-Score. The
a Particle Swarm Optimization (PSO)-based technique for experimental results demonstrated that the VLSTM achieved
feature selection and the Support Vector Machine(SVM) ML an AUC of 0.895, a precision of 86%, a recall of 97.8%, and
algorithm for classification. The PSO method used in this an F1-Score of 90.7%. Although these results were superior
research is based on the Light Gradient Boosting Machine to some of the existing methods. The authors conceded that
(LightGBM). The authors used the UNSW-NB15 dataset to further experiments needed to be done to deal with the highly
validate their model and they considered the accuracy and imbalanced nature of the UNSW-NB15.
the False Alarm Rate (FAR) as the performance metrics. The In [24], the authors proposed an ML-based IDS using an
experimental results demonstrated that the PSO-LightGBM adaptive principal component (APAC) for the feature selec-
achieved an overall accuracy of 86.68% and a high FAR of tion process and an incremental extreme learning machine
10.62%. This research was based on the binary classification (IELM) algorithm for classification. In this research, the
scheme. But, the authors could have also implemented the APAC is used to adaptively generate candidate attributes that
multiclass classification procedure to assess the full potential are then fed to the IELM for the classification procedure.
of their method. Moreover, the FAR obtained by the Light- The authors considered the NSL-KDD and the UNSW-NB15
GBM is high. datasets to gauge the effectiveness of the presented frame-
Zhou et al. [23] implemented a Variational LSTM (VL- work. Moreover, the multiclass classification scheme was
STM) IDS for Industrial Big Data systems. The VLSTM used for both datasets. The main performance metric that was
was implemented in conjunction with a feature selection and utilized in this work was the accuracy achieved by a model
retention technique based on the reconstructed rendering of on test data. In the case of the NSL-KDD dataset, the APAC-
features. The authors used an Auto-Encoder Neural Network IELM achieved an accuracy of 81.22%. For the UNSW-
(AENN) to retrieve the low-dimensional attribute character- NB15, the APAC-IELM obtained an accuracy of 70.51%.
istics from high-dimensional datasets. To study their model, Although the authors claimed that the obtained results were
the researchers used the UNSW-NB15 dataset. During the superior to those obtained by the existing systems, they
VOLUME X, 2019 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

conceded that more research needed to be undertaken to effectiveness of the proposed method.
adapt the APAC-IELM to industrial control systems (ICS). In [29], the author presented an IDS framework using the
In [25], the authors proposed a deep neural network J48 tree-based classifier and the SVM algorithm. Several
(DNN)-based IDS. In this research, the aim was to develop methods were used to conduct the feature selection process,
a flexible and robust IDS that could easily detect novel forms including the GA, the firefly optimization (FFA), and the grey
of attacks. To assess the efficacy of the presented method, the wolf optimizer (GWO). The researchers used the UNSW-
following datasets were considered: KDD-Cup99, UNSW- NB15 dataset to gauge the effectiveness of the models im-
NB15, NSL-KDD, Kyoto, WSN-DS, and CICIDS 2017. The plemented in the experiments. The results showed that the
experimental processes were executed over 1000 epochs for accuracy scores obtained by the GA-J48, GWO-J48, and
each dataset. Focusing on the UNSW-NB15, the experiments the FFA-J48 are 86.874%, 85.676%, and 86.037%, respec-
demonstrated that the DNN obtained an accuracy of 76.1%, tively. Moreover, the accuracy scores achieved by the GA-
a precision of 95.1%, a recall of 96.3%, and F1-Score of SVM, GWO-SVM, and FFA-SVM are 86.387%, 84.485%,
79.7% for the binary modeling process. In contrast, the DNN and 85.429%, respectively. Although these are impressive
obtained an accuracy of 65.1%, an F1-Score of 75.6%, a results using the J48 and the SVM methods, the authors
precision of 59.7%, and a recall of 65.1% for the multiclass recommended that future work be conducted using other
modeling procedure. approaches such as deep learning methods.
Hanif et al. [26] presented an IDS for IoT networks using In [30], the researchers implemented a novel feature se-
artificial neural networks (ANN). This system was imple- lection method named Tabu Search - Random Forest (TS-
mented to overcome the issue of security that is a major con- RF). TS-RF is a wrapper-based feature extraction technique
cern in IoT networks. Given the fact that IoT devices often in which the TS algorithm conducts the attributes search and
lack the capacity to perform high-level computation for secu- the RF approach is used as the learning method. To verify
rity, the authors decided to explore the possibility of using an the performance of their model, the authors considered the
ML-based IDS system as the first line of defense. To assess UNSW-NB15 dataset. The main performance metrics were
the effectiveness of the proposed method, the authors utilized the accuracy and the False Positive Rate (FPR). The results
the UNSW-NB15. The experimental outcomes claimed that demonstrated that the TS-RF in conjunction with the RF
the ANN-IDS obtained a precision score of 84.00% for the classifier obtained an accuracy of 83.12% and an FPR of
binary classification process. However, the researchers did 3.7%. Although the obtained results are promising, the au-
not provide much clarity on how the hyper-parameters of the thors conceded that they did not consider the class imbalance
ANN were tuned to arrive at their conclusion. Moreover, the problem found in the UNSW-NB15 dataset.
authors did not consider any feature selection method. In [31], a Two-Stage (TS) model for IDS was proposed.
In [27], the authors conducted a complexity comparison This methodology used the first stage to detect minority
analysis between the UNSW-NB15 and the KDD99 datasets. classes of intrusions and the second step to detect majority
To achieve the comparison, the authors used various meth- classes of attacks. The ML classification method used in
ods, including the expectation-maximization (EM) clustering this work is the RF method. The authors used the Informa-
algorithm and the ANN methods. In this work, the models tion Gain (IG) for feature extraction. The IG-TS IDS was
were assessed using the FAR and the accuracy. In the instance evaluated using the UNSW-NB15 dataset. The performance
of the KDD99, the EM clustering achieved an accuracy of metrics considered in this research are accuracy and FAR. In
78.06% and a FAR of 23.79%. In contrast for the UNSW- their experiments, the authors used the binary classification
NB15, the EM clustering obtained a FAR of 23.79% and scheme as their main configuration. The experimental results
an accuracy of 78.47%. Furthermore, the ANN technique showed that the IG-TS obtained a FAR of 15.64 % and an
attained an accuracy of 81.34% and a FAR of 21.13% when accuracy of 85.78 %. In future works, the authors aimed to
tested on the UNSW-NB15. This research concluded that change the classifier that was utilized in the two stages.
the UNSW-NB15 dataset is more complex in contrast to the In [32], the authors proposed an ML-based IDS using the
KDD99 dataset. GA algorithm and the Logistic Regression (LR) method for
Ketzaki [28] proposed a light-weight IDS using ANN. attributes selection. The binary classification process was
This system is destined to secure modern communication conducted using a Tree-based classifier, namely the C4.5
systems (5G networks, IIoT networks, etc.). The ANN-IDS method. The UNSW-NB15 was used to assess the efficacy
presented in this research was designed in two stages. The of the presented method. The authors considered a number
first stage is the feature extraction procedure using statistical of performance metrics to evaluate the proposed approach,
analysis. The second step is the classification process. The however, the accuracy that was obtained on test data was the
authors considered the binary classification scheme using the main metric. The experimental results showed that the GA-
UNSW-NB15 intrusion detection dataset. The performance LR-DT attained an accuracy of 81.42%. This research did
metric used to evaluate the ANN models is the accuracy not demonstrate the effectiveness of the GA-LR-DT for the
that was obtained on the test data. The results demonstrated multiclass classification scheme.
that the best model attained an accuracy score of 83.9%. Kasongo and Sun [33] proposed an IDS using an XG-
In their future endeavor, the authors aimed to improve the Boost (extreme gradient boosting) based feature extraction
4 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms

method in conjunction with several ML methods. The XG- the LSTM IDS obtained an accuracy of 72.26% for the
Boost, which is an ensemble-tree based algorithm, is used multiclass classification tasks.
in this research to decrease the number of attributes in In [38], the authors proposed a deep learning-based IDS
the UNSW-NB15. One of the classifiers used in this work using deep neural networks. This model was built using
is the LR method. The experimental results demonstrated a combination of residual blocks (ResBlk). The ResBlks
that the XGBoost-LR achieved an accuracy of 75.51% and contain convolutional neural networks (CNNs) and recurrent
72.53% for the binary and multiclass classification schemes, neural networks (RNN). Moreover, the authors utilized the
respectively. To overcome the class imbalance problems in NSL-KDD and the UNSW-NB15 dataset to assess the per-
the UNSW-NB15 dataset, the authors suggested using over- formance of the proposed approach. The accuracy was one
sampling techniques. of the main performance metrics that was used to evaluate
In [34], the authors implemented an SVM-based NIDS the outcome of the experiments. The results showed that the
using the UNSW-NB15 dataset. This system was designed DL method achieved an accuracy of 99.21% and 86.64% in
to accommodate the unique nature of IoT networks. The the instance of NSL-KDD and UNSW-NB15 datasets, re-
authors considered the accuracy, the detection rate, and the spectively. Although these results are promising, the authors
false positive rate as the main performance metrics. The ex- conceded that more experiments need to be conducted to
periments were conducted for both the binary and multiclass improve the current performance numbers.
classification schemes. The result showed that the SVM- Assiri [39] proposed a GA-RF-based method for anomaly
NIDS attained an AC of 85.99% for the binary modeling task. classification. In this work, the authors used the GA for
In the instance of the multiple classes setting, the SVM-NIDS attributes and parameters selection and the RF method for
obtained an accuracy of 75.77%. classification. Moreover, the researchers considered the bi-
Kumar et al. [35] applied the UNSW-NB15 as an offline nary classification scheme. The UNSW-NB15 was one of the
data source to design an ML-based IDS that would also be datasets used to assess the performance of their model. The
used to perform online intrusion detection. The authors used accuracy, recall, and precision were the main performance
the Information Gain (IG) methodology for the feature se- metrics that were utilized to evaluate the GA-RF presented
lection procedure. The IG method selected 13 attributes. For here. The experimental results demonstrated that the GA-
the classification process, the researchers used an integrated RF achieved a classification accuracy of 86.70%, a recall of
approach that included the following Tree-based classifiers: 87.00%, and a precision of 87%.
C5, CHAID, CART, and QUEST. The outcome of the exper- In [40], the authors implemented an advanced IDS. This
iments demonstrated that the proposed system obtained an system was designed using a multi-objective feature selection
accuracy of 84.83% for the binary classification procedure. method based on a special variation of the GA in conjunction
However, one of the drawbacks of the IDS presented here is with the Logistic regression (LR) algorithm. The RF method
its inability to detect unknown attacks. Solving this issue was was one of the ML methods that were used to assess the per-
one of the recommendations made by the authors. formance of the proposed methodology. The UNSW-NB15
In [36], the researchers presented an IDS using deep learn- was amongst the datasets that were employed to evaluate the
ing methods such as the Long-Short Term Memory (LTSM) models. The accuracy was the main performance metric that
RNN. To assess the effectiveness of the proposed approach, was considered to gauge the effectiveness of the GA-LR-
the authors used the UNSW-NB15 dataset. Moreover, the RF. The experimental outcomes demonstrated that the GA-
authors used the accuracy that was obtained during the clas- LR-RF achieved an accuracy of 64.23% for the multiclass
sification task as the main performance metric. The exper- classification task.
imental processes showed that the LSTM method obtained
an accuracy of 85.42% for the binary modeling process. III. THE UNSW-NB15 DATASET
Although the authors claimed that these results were superior The UNSW-NB15 [19] is an advanced dataset used for
to existing ones, they did not consider implementing a feature IDS research and it is widely used in the literature. The
selection algorithm. raw packets (network traces) contained in the UNSW-NB15
Elijah et al. [37], proposed an ensemble and deep learning- dataset were generated by the IXIA PerfectStorm tool in
based method for network intrusion detection. The LSTM a laboratory set-up of the Cyber Range Laboratory of the
algorithm was used to implement the deep learning model. Australian Center for Cybersecurity (ACCS). The UNSW-
The optimization algorithm applied to the LSTM is Stochas- NB15 contains 42 attributes listed in Table 1. As depicted
tic Gradient Descent (SGD). The activation function applied in the list of attributes in Table 1 ; 3 features are categorical
in the LSTM layers is the Rectified Linear Unit (ReLU) in the in nature and 39 attributes are numerical (binary, float and
instance of the binary classification task. For the multiclass integer).
classification scheme, the authors used the Softmax function. The UNSW-NB15 is composed of two datasets that in-
The UNSW-NB15 dataset was used in order to evaluate the clude the UNSW-NB15-train and the UNSW-NB15-test. In
performance of the proposed approach. The experimental this paper, UNSW-NB15-train is further divided into two
results show that the LSTM IDS achieved an accuracy of datasets. The first one is the UNSW-NB15-75 that makes up
80.72% for the two-way classification procedure. In contrast, 75% of the full UNSW-NB15-train. The second one is the
VOLUME X, 2019 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

UNSW-NB15-25 that accounts for 25% of the UNSW-NB15- modeling and evaluation phase. In the pre-processing phase,
train subset. In this study, UNSW-NB15-75 is used during we load the datasets (training set, validation set, and testing
the training phase of the models and the UNSW-NB15-25 is sets). Each dataset is cleaned and normalized. In the fea-
used during the validation phase of the models. It is crucial ture selection phase, the cleaned training dataset is used to
to perform a validation process to guarantee that the results compute the candidates feature vectors using the GA method
that were obtained during the training phase are optimal. in conjunction with the RF algorithm. In the modeling and
Moreover, the validation results must be like those of the evaluation step, the models (RF, EtraTrees, DT, LR, XGB)
training procedure. The entire UNSW-NB15-test dataset is are trained using the cleaned training dataset with a partic-
used during the testing phase of the models presented in this ular attribute vector generated by the previous phase. Once
research. the models have been trained, they are evaluated using the
The UNSW-NB15 intrusion detection dataset contains the cleaned validation set and they are tested using the cleaned
following nine categories of attacks [20]: testing set. The building blocks of the proposed framework
Fuzzers, Analysis, Exploits, Worms, Shellcode, DoS, are explained in more detail in the next subsections.
Generic, Reconnaissance, and Backdoor. The value distribu-
tion of the UNSW-NB15 (UNSW-NB15-100), the UNSW- A. PRE-PROCESSING PHASE
NB15-75, the UNSW-NB15-25, and the UNSW-NB15- The most important aspects of the pre-processing phase are
TEST datasets are shown in Table 2. the cleaning and data normalization steps. Data cleaning is
crucial because it ensures that the quality of the data used
TABLE 1. UNSW-NB15 dataset attributes list
to build the models has been improved. The steps taken
No. Feature Category No. Feature Category to clean the data include: removing duplicates, replacing
f1 dur float f22 dtcpb integer missing data, fixing structural errors, and removing unwanted
f2 proto nominal f23 dwin integer (potentially noisy) observations. Once, the data have been
f3 service nominal f24 tcprtt float
f4 state nominal f25 synack float cleaned, they require normalization. In this research, we
f5 spkts integer f26 ackdat float apply the Min-Max scaling [41] and it is defined as follows:
f6 dpkts integer f27 smean integer
f7 sbytes integer f28 dmean integer xn − min(xn )
f8 dbytes integer f29 trans_depth integer xnorm = (p − q) (1)
f9 rate float f30 response_body_len integer max(xn ) − min(xn )
f10 sttl integer f31 ct_srv_src integer
f11 dttl integer f32 ct_state_ttl integer where x represent a given feature in the feature space, X.
f12 sload float f33 ct_dst_ltm integer This scaling process acts as a safeguarding process by
f13 dload float f34 ct_src_dport_ltm integer squeezing the values of each feature within a certain range.
f14 sloss integer f35 ct_dst_sport_ltm integer
f15 dloss integer f36 ct_dst_src_ltm integer
f16 sinpkt float f37 is_ftp_login binary B. RANDOM FOREST
f17 dinpkt float f38 ct_ftp_cmd integer The building blocks of the Random Forest (RF) algorithm
f18 sjit float f39 ct_flw_http_mthd integer
f19 djit float f40 ct_src_ltm integer are Decision Trees (DTs). A DT is a supervised ML method
f20 swin integer f41 ct_srv_dst integer that is applied in tasks such as regression and classification.
f21 stcpb integer f42 is_sm_ips_ports binary In simple terms, a DT algorithm uses a tree-like structure
to compute the predictions. Each DT contains three types
of nodes: namely, the root node, the internal nodes, and the
TABLE 2. UNSW-NB15 dataset values distribution category nodes. For a given input vector, the DT computes
Attack UNSW- UNSW- UNSW- UNSW-
its prediction from the root node, traversing many internal
Category NB15-100 NB15-75 NB15-25 NB15-TEST nodes, to the category nodes [42], [43].
Normal 56000 41911 14089 37000 In this research, we use an RF classifier in the fitness
Generic 40000 30081 9919 18871
function of the GA algorithm described in the next section.
Exploits 33393 25034 8359 11132
Fuzzers 18184 13608 4576 6062 The RF algorithm was devised by L. Breiman [44] and it is
DoS 12264 9237 3027 4089 one of the most widely used ML algorithms today. The RF
Reconnaissance 10491 7875 2616 3496 algorithm is an ensemble of Decision Trees (DTs) classifiers
Analysis 2000 1477 523 677
Backdoor 1746 1330 416 583
whereby each individual DT is built using an attribute vector
Shellcode 1133 854 279 378 that is randomly selected from the input vector. Finally, each
Worms 130 99 31 44 DT casts a vote for the most popular label in the selected
input attribute vector. The label (class) with the highest score
wins the poll [45], [46]. The RF method can be formulated as
IV. THE PROPOSED IIOT IDS METHODOLOGY follows:
The architecture of the proposed framework is depicted in Let P = {X1 , y1 , ..., (Xk , yk )} be a training subset of
Fig. 2 whereby there are three main phases, namely, the inputs vectors and labels that are randomly selected given
pre-processing phase, the feature selection phase, and the probability distribution (dataset), (Xn , yn ) ∼ (X, Y ).
6 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms

FIGURE 2. The proposed IDS framework for IIoT

The aim is to compute a model (classifier) label y given an are used for optimization or learning tasks. EAs have the
input X from P . following main traits [49]:
Let F , be a group of possibly weak classifiers defined • Population EAs methods conserves a group of candi-
as follows: F = {f1 (X), ..., fN (X)} where N is the total date solutions labelled population.
number of models. Each model, fn (X), in F is defined as • Fitness An individual is a solution within a population.
a Decision Tree (DT). Therefore, F is called the Random Each individual possesses its code (Gene representa-
Forest. tion) and its fitness score.
Each model fn (X) has some parameters defined as Bn = • Variation The individual goes through changes(mutations)
(βn1 , βn2 , ..., βnp ). The notation of each tree in the forest similar to the biological genetic gene variation. This is
becomes: fn (X) = f (X|Bn ). how an EA algorithm performs the search in the solution
The attributes that appear in the nodes of the nth DT are space.
randomly selected based on Bn . The final result of the Forest,
The main steps in the GA algorithm are as follows [50]:
f (X) (a combination of all the classifiers) is computed by
majority voting. The label with the most votes is the output 1) Initialize the Population
of the RF. 2) Compute the fitness function
3) Perform the Selection
C. EXTRA-TREES 4) Perform the Crossover
The Extra-Trees (ET) method is a tree-based algorithm (a 5) Conduct the Mutation
meta-estimator) that is related to the RF algorithm because it In this research, the fitness function was implemented
also uses an ensemble of DTs to conduct the classification or using the Random Forest algorithm presented in Algorithm
the regression processes. However, unlike the RF algorithm, 1.
the ET approach randomly selects the nodes cut points. Algorithm 2 depicts the steps (pseudo code) that were used
Therefore, the ET method adds another layer of randomiza- to implement the GA algorithm on the UNSW-NB15 dataset.
tion while maintaining its optimization capability [47]. Moreover, Figure 3 simplifies this algorithm by outlining the
major steps in a flowchart format.
D. FEATURE SELECTION PHASE USING GENETIC
ALGORITHM E. MODELLING AND EVALUATION PHASE
The Genetic Algorithm (GA) is an Evolutionary Algorithm 1) Performance metrics
(EA) that has gained popularity by solving various opti- In this study, we used the following metrics to measure the
mization problems with a low computational cost [48]. EAs performance of our proposed method: the accuracy (AC),
are methods that are inspired by biological principles and the precision (PR), the recall (RC) and F1-Score (F1S) [51].
VOLUME X, 2019 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

FIGURE 3. GA algorithm applied to the UNSW-NB15 dataset

The F1S represents the harmonic mean of the PR and RC.


Algorithm 1 RF Algorithm in the GA fitness function These metrics are chosen on the basis that we are faced
Input: X, y; the input dataframe and output series with a classification problem. Moreover, in this research,
Output: AC; the Accuracy obtained by the RF model we implement binary and multiclass classification processes.
1. Spilt X and y in Xtrain , Xval , ytrain , yval The AC, the RC, the PR, and the F1S are computed as
2. Instantiate rf , the model. follows:
3. Fit rf using Xtrain and ytrain TP + TN
4. Evaluate rf using Xval AC = (2)
TP + TN + TP + FN
5. Compute predictions ypredictions
6. Compute AC using ypredictions and ytrain TP
RC = (3)
TP + FN
TP
PR = (4)
TP + FP
Algorithm 2 GA Algorithm applied on the UNSW-NB15 RC.P R
F 1S = 2 (5)
Require: D, the UNSW-NB15 data-frame RC + P R
Require: F , an array that contains the feature names Where each component in the above equations is defined as
Require: T , the target value follows:
Require: L, an empty list to store the feature subset • True Positive (TP): represents the intrusions that are
Require: mi, maximum iteration correctly labelled as attacks.
START • True Negative (TN): normal network traces that are
1. Initialize the population P , using F . correctly labelled as legitimate.
2. Implement the fitness function using RF • False Positive (FP): normal network traces that are
3. Compute the fitness using D, F , T and P labelled as intrusions.
4. Compute optimal fitness value, v • False Negative (FN): network intrusions that are
5. Update L wrongly labelled as non-intrusive (normal).
for i in range(mi)
Additionally, to verify the efficacy of pour proposed
6. Implement crossover
method, we also plotted the receiver operating characteristic
7. Run mutations
curve (ROC) curves for the models. The ROC curve plots the
8. Compute the fitness
True Positive Rate (TPR) vs. the False Positive Rate (FPR)
9. Compute optimal fitness value, v
of a given model. The area under the ROC curve is defined
10. Update L
as the Area Under the Curve (AUC). The value of the AUC

end for
is always between 0 and 1. An efficient model has an AUC
11. Convergence reached L and v
value closer to 1 [52].
STOP
TP
TPR = (6)
TP + FN
8 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms

TABLE 3. Features selected by the GA - Binary classification


FP
FPR = (7) Feature vector No. of features list of features
FP + FN f1 21 dur, state, dpkts, sbytes,
sload, dload, sloss, dloss, sjit,
V. EXPERIMENTS AND DISCUSSIONS dwin, synack, smean, dmean,
A. EXPERIMENTAL CONFIGURATION response_body_len, ct_srv_src,
ct_dst_ltm, ct_src_dport_ltm,
In this research, the experiments were conducted on a Lap- ct_dst_sport_ltm, ct_dst_src_ltm,
top with the following specifications: DELL 153000 series ct_ftp_cmd, ct_srv_dst
Windows 10 OS, Intel Core i7-8568U-CPU, 1.8GHz - 1.99 f2 17 service, sbytes, dbytes, sttl, sload,
dload, sinpkt, dinpkt, swin, stcpb,
GHz. The ML framework that was used to implement the synack, smean, trans_depth,
simulations is the Scikit-Learn (a Python-based framework) ct_dst_ltm, ct_dst_src_ltm,
[53]. ct_srv_dst, is_sm_ips_ports
f3 16 dur, service, dpkts, sbytes,
sttl, djit, smean, dmean,
B. EXPERIMENTAL RESULTS AND DISCUSSIONS trans_depth, response_body_len,
ct_src_dport_ltm, ct_dst_src_ltm,
In this research, the experiments were conducted in two is_ftp_login, ct_ftp_cmd,
phases (phase 1 and phase 2). In phase 1, we implemented ct_srv_dst, is_sm_ips_ports
the GA algorithm on the UNSW-NB15 dataset. This process f4 13 dpkts, sbytes, sloss, dloss, sinpkt,
generated two sets of feature vectors: Vb and Vm . djit, tcprtt, smean, dmean,
ct_srv_src, ct_src_dport_ltm,
ct_ftp_cmd, ct_srv_dst
Vb = {f1 , f2 , f3 , f4 , f5 , f6 , f7 , f8 , f9 , f10 } (8) f5 18 dur, sbytes, sttl, dloss, sinpkt,
djit, dtcpb, synack, ackdat,
smean, dmean, ct_srv_src,
ct_state_ttl, ct_src_dport_ltm,
Vm = {g1 , g2 , g3 , g4 , g5 , g6 , g7 } (9) ct_dst_sport_ltm, ct_dst_src_ltm,
is_ftp_login,ct_srv_dst
where Vb the group of possible solutions generated by the f6 17 dur, service, sbytes, dbytes, sttl,
GA for the binary classification scheme and Vm denotes sloss, dloss, sjit, stcpb, synack,
the group of possible solutions generated by the GA for the ackdat, smean, dmean, ct_srv_src,
ct_dst_src_ltm, ct_srv_dst,
multiclass classification process. Table 3 and Table 4 provide is_sm_ips_ports
the details about the vectors in Vb and Vm . These tables have f7 20 service, spkts, sbytes, dttl, sload,
three columns whereby the first one shows the vector name, dloss, sinpkt, djit, swin, stcpb,
synack, ackdat, smean, dmean,
the second column specifies the number of features that are ct_srv_src, ct_src_dport_ltm,
present in the feature vector and the third column provides a ct_dst_sport_ltm, ct_dst_src_ltm,
list of features (attributes) that were selected by the GA. ct_flw_http_mthd, ct_srv_dst
f8 27 dur, proto, service, dpkts, sbytes,
In the second phase of our experiments, we implemented dbytes, rate, dttl, sload, dload,
two classification processes. We first conducted the binary dloss, sinpkt, sjit, djit, swin, stcpb,
classification process whereby the target feature was binary dwin, ackdat, smean, dmean,
response_body_len, ct_srv_src,
(Normal or Attack). In this step, we considered all the feature ct_dst_ltm, ct_src_dport_ltm,
vectors in Vb . We used the Logistic Regression (LR) [54] ct_dst_src_ltm, ct_flw_http_mthd,
as our baseline model and we implemented the following ct_srv_dst
f9 16 dur, sbytes, dbytes, sttl, dttl,
Tree-based methods: DT, RF, ET, and XGB. The baseline sjit, swin, dtcpb, tcprtt, smean,
model was used as our point of departure and the aim was trans_depth, response_body_len,
to beat its performance using the other classifiers. The results ct_srv_src, ct_dst_src_ltm,
ct_ftp_cmd, is_sm_ips_ports
of the experiments are presented in Table 5 – 14. The most f10 17 proto, dpkts, sbytes, dbytes, sttl,
optimal test accuracy (TAC), 87.61%, was achieved by the swin, tcprtt, synack, ackdat, smean,
RF method using f3 , as shown in Table 7. Moreover, this ct_dst_ltm, ct_src_dport_ltm,
ct_dst_sport_ltm, ct_dst_src_ltm,
model obtained a validation accuracy (VAC) of 95.87%, a is_ftp_login, ct_src_ltm, ct_srv_dst
recall (RC) of 98.34%, a precision (PR) of 82.51%, and
an F1-score (F1S) of 89.73%. Moreover, for each of the
classifiers that were evaluated using f3 , we computed the
ROC curves. The results are depicted in Figure 3 whereby the
RF achieved an AUC = 0.98. This value demonstrates that the
quality of classification yielded by the RF is high. Although
the TAC obtained by the XGB method (Table 7) was lower
than that of the RF approach, it yielded an AUC = 0.98. This
shows that the classification quality of the XGB classifier is %
high. Both the RF and the ET surpassed the AUC = 0.895 of
the VLSTM presented in [23].
VOLUME X, 2019 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

TABLE 4. Features selected by the GA - Multiclass classification TABLE 7. Binary Classification for f3

Feature vector No. of features list of features


g1 22 service, spkts, sbytes, dbytes, Model FV VAC TAC RC PR F1S
rate, sttl, sloss, dinpkt, sjit, swin, LR f3 86.54 % 74.49 % 86.82 % 72.37 % 78.94 %
tcprtt, synack, ackdat, smean, DT f3 94.74 % 86.53 % 95.96 82.46 % 88.70 %
dmean, trans_depth, ct_state_ttl, RF f3 95.87 % 87.61 % 98.34 % 82.51 % 89.73 %
ct_src_dport_ltm, is_ftp_login, ET f3 95.72 % 87.38 % 97.86 % 82.48 % 89.51 %
ct_ftp_cmd, ct_src_ltm, ct_srv_dst XGB f3 94.87 % 86.84 % 98.64 % 81.40 % 89.19 %
g2 25 proto, service, state, dpkts, sbytes,
dbytes, sttl, dttl, sloss, dloss, dinpkt,
sjit, djit, stcpb, dwin, tcprtt, smean, TABLE 8. Binary Classification for f4
dmean, trans_depth, ct_state_ttl,
ct_dst_ltm, ct_ftp_cmd,
Model FV VAC TAC RC PR F1S
ct_flw_http_mthd, ct_srv_dst,
is_sm_ips_ports LR f4 88.12 % 75.51 % 94.26 % 70.87 % 80.91 %
g3 28 dur, proto, service, state, spkts, DT f4 94.78 % 85.53 % 94.45 % 81.99 % 87.78 %
dpkts, sbytes, dbytes, rate, sload, RF f4 95.93 % 86.19 % 96.85 % 81.53 % 88.53 %
dload, sloss, sjit, swin, stcpb, ET f4 95.85 % 86.13 % 96.62 % 81.59 % 88.47 %
dtcpb, dwin, tcprtt, ackdat, XGB f4 94.63 % 85.17 % 96.76 % 80.33 % 87.78 %
smean, dmean, ct_state_ttl,
ct_src_dport_ltm, is_ftp_login,
ct_ftp_cmd, ct_flw_http_mthd, TABLE 9. Binary Classification for f5
ct_src_ltm, ct_srv_dst
g4 20 proto, service, spkts, dpkts, Model FV VAC TAC RC PR F1S
sbytes, sload, dloss, sinpkt, dinpkt, LR f5 90.01 % 71.94 % 87.10 % 69.59 % 77.37 %
sjit, djit, tcprtt, ackdat, smean,
DT f5 94.64 % 85.90 % 95.83 % 81.72 % 88.21 %
dmean, ct_srv_src, ct_state_ttl,
RF f5 96.02 % 86.92 % 98.57 % 81.54 % 89.25 %
ct_dst_sport_ltm, ct_dst_src_ltm,
ET f5 95.96 % 86.60 % 98.35 % 81.26 % 88.99 %
ct_flw_http_mthd
XGB f5 94.92 % 86.44 % 98.29 % 81.09 % 88.87 %
g5 17 proto, service, spkts, dpkts,
dbytes, sttl, dloss, dinpkt, sjit,
tcprtt, smean, dmean, trans_depth,
ct_dst_src_ltm, is_ftp_login, TABLE 10. Binary Classification for f6
ct_ftp_cmd, is_sm_ips_ports
g6 26 dur, proto, service, spkts, Model FV VAC TAC RC PR F1S
dpkts, sbytes, dbytes, sttl, dttl, LR f6 87.63 % 74.49 % 90.21 % 71.17 % 79.56 %
dload, dloss, djit, stcpb, dtcpb, DT f6 94.78 % 83.03 % 89.43 % 81.54 % 85.30 %
dwin, tcprtt, synack, ackdat, RF f6 95.89 % 87.24 % 98.33 % 82.05 % 89.46 %
smean, dmean, ct_srv_src, ET f6 96.05 % 86.99 % 97.89 % 81.98 % 89.23 %
ct_dst_ltm, ct_src_dport_ltm, XGB f6 94.93 % 86.69 % 98.14 % 81.47 % 89.03 %
ct_dst_sport_ltm, is_ftp_login,
is_sm_ips_ports
g7 18 proto, service, state, dpkts, sbytes,
dbytes, sinpkt, swin, tcprtt, ackdat, TABLE 11. Binary Classification for f7
smean, dmean, trans_depth,
ct_state_ttl, ct_dst_src_ltm, Model FV VAC TAC RC PR F1S
is_ftp_login, ct_ftp_cmd, LR f7 92.12 % 76.81 % 91.57 % 73.10 % 81.30 %
ct_flw_http_mthd DT f7 94.74 % 86.42 % 96.26 % 82.15 % 88.64 %
RF f7 96.11 % 87.14 % 98.62 % 81.77 % 89.41 %
ET f7 96.11 % 86.85 % 98.26 % 81.61 % 89.16 %
XGB f7 94.99 % 86.52 % 98.27 % 81.19 % 88.92 %
TABLE 5. Binary Classification for f1

Model FV VAC TAC RC PR F1S TABLE 12. Binary Classification for f8


LR f1 88.63 % 73.40 % 91.76 % 69.60 % 79.16 %
DT f1 94.83 % 85.59 % 94.65 % 81.98 % 87.86 % Model FV VAC TAC RC PR F1S
RF f1 95.95 % 86.89 % 98.50 % 81.53 % 89.22 % LR f8 91.71 % 77.71 % 98.46 % 71.65 % 82.94 %
ET f1 95.82 % 86.67 % 98.26 % 81.38 % 89.03 % DT f8 94.87 % 85.41 % 94.22 % 81.98 % 87.67 %
XGB f1 94.80 % 86.22 % 98.21 % 80.87 % 88.70 % RF f8 95.88 % 87.28 % 98.59 % 81.96 % 89.51 %
ET f8 95.80 % 86.92 % 98.12 % 81.77 % 89.20 %
XGB f8 94.89 % 86.27 % 98.44 % 80.81 % 88.76 %

TABLE 6. Binary Classification for f2 TABLE 13. Binary Classification for f9

Model FV VAC TAC RC PR F1S Model FV VAC TAC RC PR F1S


LR f2 87.90 % 74.09 % 90.56 % 70.65 % 79.38 % LR f9 90.02 % 70.83 % 85.92 % 68.83 % 76.43 %
DT f2 94.50 % 86.40 % 95.96 % 82.29 % 88.60 % DT f9 94.78 % 83.33 % 89.83 % 81.71 % 85.58 %
RF f2 95.86 % 87.37 % 98.69 % 82.02 % 89.59 % RF f9 95.76 % 87.31 % 98.55 % 82.03 % 89.53 %
ET f2 95.80 % 87.13 % 98.48 % 81.84 % 89.39 % ET f9 95.75 % 87.14 % 98.54 % 81.82 % 89.41 %
XGB f2 94.94 % 86.72 % 98.78 % 81.18 % 89.12 % XGB f9 94.65 % 86.87 % 99.03 % 81.23 % 89.25 %

10 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms

TABLE 14. Binary Classification for f10 TABLE 16. Multiclass Classification for g2

Model FV VAC TAC RC PR F1S Model FV VAC TAC PR RC F1S


LR f10 86.92 % 72.70 % 82.71 % 71.92 % 76.94 % DT g2 81.26 75.08 80.36 75.08 77.63
DT f10 94.79 % 86.51 % 96.29 % 82.24 % 88.71 % RF g2 82.96 77.18 82.70 77.18 79.84
RF f10 95.86 % 86.71 % 98.57 % 81.27 % 89.09 % ET g2 83.07 77.24 82.48 77.24 79.77
ET f10 95.70 % 86.32 % 98.29 % 80.95 % 88.78 % XGB g2 83.03 77.23 82.46 77.23 79.76
XGB f10 94.87 % 86.41 % 99.13 % 80.63 % 88.93 % NB g2 55.10 51.59 69.39 51.59 59.18

TABLE 17. Multiclass Classification for g3

FIGURE 4. ROC Curves for classifiers using f3 Model FV VAC TAC PR RC F1S
DT g3 80.59 74.03 80.64 74.03 77.20
RF g3 82.20 76.04 82.43 76.04 79.11
ET g3 82.22 76.12 82.10 76.12 79.00
XGB g3 82.27 76.11 82.14 76.11 79.01
NB g3 46.62 55.47 52.69 55.47 54.05

TABLE 18. Multiclass Classification for g4

Model FV VAC TAC PR RC F1S


DT g4 80.99 73.85 80.54 73.85 77.05
RF g4 82.50 76.08 83.29 76.08 79.50
ET g4 82.51 76.35 83.35 76.35 79.70
XGB g4 82.57 76.41 83.32 76.41 79.72
NB g4 35.19 32.72 68.81 32.72 44.35

TABLE 19. Multiclass Classification for g5

Model FV VAC TAC PR RC F1S


In the second step of phase 2, we implemented the multi- DT g5 81.44 75.70 81.09 75.70 78.30
class classification process whereby all the labels (10 classes) RF g5 82.93 77.34 83.01 77.34 80.07
present in the UNSW-NB15 were considered. Moreover, in ET g5 82.94 77.64 83.09 77.64 80.27
XGB g5 82.94 77.58 82.99 77.58 80.20
this step, we utilized all the attribute vectors in Vm . The NB g5 43.40 40.26 66.38 40.26 50.13
Naïve Bayes (NB) classifier [55] was used as the baseline
model and we further implemented the following Tree-based
algorithms: DT, RF, ET, and XGB. As mentioned in the TABLE 20. Multiclass Classification for g6
previous step, the baseline model was utilized as our starting
point and the goal was to surpass its performance using the Model FV VAC TAC PR RC F1S
DT g6 80.57 74.40 74.40 80.15 77.16
other models. The outcomes are shown in Table 15 – 21. As
RF g6 82.52 76.41 82.37 76.41 79.28
depicted in Table 19, the experimental results demonstrated ET g6 82.64 76.56 82.43 76.56 79.39
that the best model was the ET using g5 . It attained a VAC XGB g6 82.67 76.54 82.32 76.54 79.33
of 82.64%, a TAC of 77.64%, an RC of 83.09%, a PR of NB g6 50.49 47.32 75.32 47.32 58.12
77.64%, and F1S of 80.27%. Furthermore, we computed the
confusion matrix to check how the model performed for each
TABLE 21. Multiclass Classification for g7
class present in the UNSW-NB15. As depicted in Figure
4, the ET performed optimally in detecting the following Model FV VAC TAC PR RC F1S
classes: Normal, Generic, Exploits, Dos, Reconnaissance, DT g7 81.38 74.90 74.90 80.60 77.65
and Shellcode. However, the ET underperformed for some RF g7 82.65 76.86 82.93 76.86 79.78
minority classes such as Worms, Backdoor, and Analysis. ET g7 82.87 76.86 82.83 76.86 79.73
XGB g7 82.83 76.84 82.86 76.84 79.74
NB g7 47.56 43.55 67.91 43.55 53.07
TABLE 15. Multiclass Classification for g1
Furthermore, we conducted a comparative analysis in Ta-
Model FV VAC TAC PR RC F1S ble 22. This analysis showed that the results yielded by the
DT g1 81.05 74.56 80.64 74.56 77.48
RF g1 82.84 76.61 82.85 76.61 79.61
methodologies presented in this paper are superior to existing
ET g1 82.90 76.53 82.63 76.53 79.46 frameworks. For instance, in the case of binary classification,
XGB g1 82.95 76.48 82.62 76.48 79.43 the TAC obtained by the GA-RF-f3 (proposed in this work)
NB g1 54.46 52.28 59.90 52.29 55.83 was 11.51% higher than the work presented in [25], 12.1%
VOLUME X, 2019 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

FIGURE 5. Confusion Matrix for g5 results

higher than the method in [33] and 3.71% greater than TAC tained by the GA-RF in [39]. For the multiclass classification
obtained in [26].In the case of the multiclass classification procedure, it obtained an accuracy that is 13.34% higher than
process, the GA-ET-g5 obtained a TAC that is 5.11% greater the score obtained by the GA-RF in [39].
than the TAC obtained in [33] and 1.87% higher than the
TAC obtained in [34]. Furthermore, the methods that were
Moreover, performance analysis of prediction time was
proposed in this research were superior to the DL-based
conducted between different models that used the most op-
algorithms that were reviewed in the literature. For instance,
timal feature vectors. In the instance of the binary classi-
the GA-RF achieved a TAC that is 2.19% higher than the
fication, the vector that yielded the most optimal TAC is
TAC obtained by the LSTM method in [36]. In comparison
f3 . The graph in Figure 6 shows that the DT model is
to the TAC obtained by the LSTM approach in [37], the GA-
the most efficient method in terms of prediction time (18.3
RF attained a TAC that is 6.89% higher. Additionally, the
milliseconds) when using f3 . For the multiclass classification
GA-RF achieved a higher TAC in comparison to the CNN-
process, the vector that achieved the highest TAC is g5 . The
RNN presented in [38]. Additionally, the GA-RF presented
plot in Figure 7 demonstrates that the NB (7.96 milliseconds)
in this paper achieved an accuracy that is superior to existing
method was the most efficient one in terms of prediction
research. For instance, for the two-way classification task, it
time when utilizing g5 . However, the NB did not obtain a
achieved a TAC that is 0.9% higher than the performance ob-
satisfactory TAC.
12 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms

TABLE 22. Comparison with other methods VI. CONCLUSION


In this research, an advanced IDS system for IIoT was pro-
Model TAC- Binary TAC- Multiclass
PSO-LightGBM [22] 86.68% - posed and it was evaluated using the UNSW-NB15 dataset.
APAC-IELM [23] - 70.52% This IDS was designed using multiple stages. The first stage
Deep learning - DNN [25] 76.1% 65.1% involved implementing the GA algorithm in conjunction with
ANN [26] 83.9% -
GA-SVM [29] 86.38% -
the RF model to select the most important features to be
GWO-SVM [29] 84.48% - used by the classifiers. This stage generated two sets of
FFA-SVM [29] 85.42% - feature vectors. The first feature set, Vb , included 10 feature
IG-TS [31] 85.78% - vectors destined for the binary classification procedure. The
GA-LR-DT [32] 81.42% -
XGBoost-LR [33] 75.51% 72.53% second feature set, Vm , contained 7 feature vectors that were
SVM-NIDS [34] 85.99% 75.77% used for the multiclass modeling process. For the binary
IG-Tree [35] 84.83% - classification experiments, the LR algorithm was applied as
Deep learning - LSTM [36] 85.42% - the baseline model and the following Tree-based models
Deep learning - LSTM [37] 80.72% 72.26%
Deep learning - CNN-RNN [38] 86.64% - were implemented: DT, RF, ET, and XGB. For the multi-
GA - RF [39] 86.70% - class modeling process, the NB was used as the baseline
GA - RF [40] - 64.23% model alongside the same Tree-based algorithms that were
GA - RF (Proposed) 87.61% -
GA - ET (Proposed) - 77.64%
implemented for the binary intrusion detection procedure.
The results demonstrated that for the binary classification
process, the GA-RF achieved a TAC of 87.61% and an AUC
of 0.98 using f3 that contained 16 features. When modeling
for the multiclass classification, the outcomes showed that
the GA-ET got a TAC of 77.64% using g5 that contained
17 attributes. The results achieved by the methods proposed
in this study were superior in comparison to those achieved
by the existing methodologies. In future work, we intend
to pair the GA algorithm with models such as the SVM
or ANN. We also aim to increase the performance of our
proposed approach on the minority classes of the UNSW-
NB15. Furthermore, we intend to implement the proposed
methodology on the TON_IoT. This dataset contains traffic
patterns that have been mainly generated by IIoT devices.
Additionally, we intend to conduct a performance analysis of
the method proposed in this paper across multiple datasets
including the NSL-KDD and the AWID.

FIGURE 6. Prediction time - Binary classification - f3


REFERENCES
[1] Y. Zhang, P. Li and X. Wang, “Intrusion detection for IoT based on
improved genetic algorithm and deep belief network”, IEEE Access, vol.
7, pp.31711-31722, 2019.
[2] A.S. Lalos, A.P. Kalogeras, C. Koulamas, C. Tselios, C. Alexakos and D.
Serpanos, “Secure and safe IIoT systems via machine and deep learning
approaches”, Sec. Qual. in Cyber-Physical Sys. Eng., pp.443-470, 2019.
[3] B. Valeske, A. Osman, F. Römer and R. Tschuncky, “Next Generation
NDE Sensor Systems as IIoT Elements of Industry 4.0”, Research in
Nondestructive Evaluation, vol. 31, no. 5-6, pp.340-369, 2020
[4] R. Schiekofer, A. Scholz and A. Weyrich, “REST based OPC UA for the
IIoT”, In IEEE 23rd Int. Conf. Emerg. Tech. Factory Automation (ETFA),
Sept. 2018,vol. 1, pp. 274-281.
[5] A. Meddeb, “Internet of Things standards: Who stands out from the
crowd?” IEEE Commun. Mag., vol. 54, no. 7, pp. 40-47, Jul. 2016.
[6] N. Koroniotis, Moustafa, N. and E. Sitnikova, “A new network forensic
framework based on deep learning for Internet of Things networks: A
particle deep framework”, Future Gen. Comput. Sys., vol. 110, pp.91-106,
2020.
[7] A. Khraisat, I. Gondal, P. Vamplew and J. Kamruzzaman, “Survey of
intrusion detection systems: techniques, datasets and challenges”, Cyber-
security, vol. 2, no. 20, 2019.
FIGURE 7. Prediction time - Multiclass classification - g5 [8] S. Dua and X. Du, “Data mining and machine learning in cybersecurity”,
CRC press, 2016
[9] X.D. Zhang, “Machine learning”, In A Matrix Algebra Approach to Artifi-
cial Intelligence, pp. 223-440, 2020.

VOLUME X, 2019 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3104113, IEEE Access

S.M. Kasongo et al.: An advanced Intrusion Detection System for IIoT Based on GA and Tree based Algorithms based Algorithms

[10] M. Mohammed, M.B. Khan and E.B.M. Bashier, “Machine learning: [33] S.M. Kasongo and Y. Sun, “Performance Analysis of Intrusion Detection
algorithms and applications”, CRC Press, 2016. Systems Using a Feature Selection Method on the UNSW-NB15 Dataset,”.
[11] T. Janarthanan and S. Zargari, “Feature selection in UNSW-NB15 and KD- Journ. Big Data, vol. 7, no.1, pp.1-20, 2020.
DCUP99 datasets”, In IEEE 26th Int. Symp. Industrial Electron. (ISIE), [34] D. Jing and H.B. Chen, “SVM based network intrusion detection for the
Jun. 2017, pp. 1881-1886. UNSW-NB15 dataset”, In IEEE 13th Int. Conf. on ASIC (ASICON), pp.
[12] H. Gharaee and H. Hosseinvand, “A new feature selection IDS based on 1-4, Oct. 2019.
genetic algorithm and SVM”,. In 8th Int. Symp. Telecommu. (IST) IEEE, [35] V. Kumar, D. Sinha, A.K. Das, S.C. Pandey and R.T. Goswami, “An
Sept. 2016, pp. 139-144. integrated rule based intrusion detection system: analysis on UNSW-NB15
[13] R. Wald, T.M. Khoshgoftaar and A. Napolitano, “Stability of filter-and data set and the real time online dataset,” Cluster Comput., vol. 23, no. 2,
wrapper-based feature subset selection”, In 25th Int. Conf. Tools with pp.1397-1418, 2020.
Artificial Intell. IEEE, Nov. 2013, pp. 374-380. [36] A. Aleesa, M. Younis, A.A. Mohammed and N. Sahar, “Deep Intrusion
[14] M. Shafiq, Z. Tian, A.K. Bashir, X. Du and M. Guizani, “IoT malicious detection system with enhanced UNSW-NB15 dataset based on deep
traffic identification using wrapper-based feature selection mechanisms”, learning techniques,”, Journ. Eng. Sc. Techn., vol. 16, no.1, pp.711-727,
Computers & Security,vol. 94, p.101863, 2020. 2021.
[15] M.A. Siddiqi and W. Pak, “Optimizing Filter-Based Feature Selection [37] A.V. Elijah, A. Abdullah, N. Jhanjhi, M. Supramaniam and B. Abdullateef,
Method Flow for Intrusion Detection System”, Electronics, vol. 9, 12, “Ensemble and deep-learning methods for two-class and multi-attack
p.2114. anomaly intrusion detection: An empirical study,” Int. J. Adv. Comput. Sci.
[16] S. Ding, X. Xu, H. Zhu, J. Wang and F. Jin , “Studies on optimization Appl., vol. 10, pp.520-528, 2019.
algorithms for some artificial neural networks based on genetic algorithm [38] P. Wu, H. Guo and N. Moustafa, “Pelican: A deep residual network for
(GA)”, JCP, vol. 6, no. 5, pp.939-946, 2011 network intrusion detection,” In 50th Annual IEEE/IFIP Int. Conf. on
Dependable Syst. Netw. Workshops (DSN-W), pp. 55-62, June 2020.
[17] P. Probst, M.N. Wright and A.L. Boulesteix, “Hyperparameters and tun-
[39] A. Assiri, “Anomaly classification using genetic algorithm-based random
ing strategies for random forest”, Wiley Interdisciplinary Reviews: Data
forest model for network attack detection,”. CMC-COMPUTERS MATE-
Mining and Knowledge Discovery, vol. 9, no. 3, p.e1301, 2019.
RIALS & CONTINUA, vol. 66, no. 1, pp.767-778, 2021.
[18] V. Kumar, A.K. Das and D. Sinha, “Statistical analysis of the UNSW-
[40] C. Khammassi and S. Krichen, “A NSGA2-LR wrapper approach for
NB15 dataset for intrusion detection”, In Comput. Intell. Pattern Recogn.,
feature selection in network intrusion detection,”. Comput. Ntw., vol. 172,
pp. 279-294, 2020.
p.107183, 2020.
[19] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for
[41] Z. Liu, “A method of SVM with normalization in intrusion detection”,
network intrusion detection systems (UNSW-NB15 network data set)”,in
Procedia Env. Sc., vol. 11, pp.256-262, 2011.
Military commun. Inf. Sys. Conf.(MilCIS) IEEE, Nov. 2015, pp. 1-6.
[42] H. Sharma and S. Kumar, “A survey on decision tree algorithms of
[20] N. Moustafa and J. Slay, “The evaluation of Network Anomaly Detection classification in data mining”, Int. Journ. Sc. Research (IJSR), vol. 5, no.
Systems: Statistical analysis of the UNSW-NB15 data set and the compar- 4, pp.2094-2097, 2016
ison with the KDD99 data set”, Inf. Sec. Journ. A Global Perspective, vol. [43] J. Liang, Z. Qin, S. Xiao, L. Ou and X. Lin, “Efficient and secure
25, no. 1-3, pp.18-31, 2016. decision tree classification for cloud-assisted online diagnosis services”,
[21] A. Alsaedi, N. Moustafa, Z. Tari, A. Mahmood and A. Anwar, “TON_IoT IEEE Trans. Dependable Sec. Comput., 2019.
telemetry dataset: a new generation dataset of IoT and IIoT for data-driven [44] L. Breiman, “Random forests”, Machine Learning, vol. 45, pp. 5–32, 2001
Intrusion Detection Systems,”. IEEE Access, vol. 8, pp.165130-165150, [45] A. Liaw and M. Wiener, “Classification and regression by randomForest”,
2020. R news, vol. 2, no. 3, pp.18-22, 2002
[22] J. Liu, D. Yang, M. Lian and M. Li, “Research on Intrusion Detection [46] G. Biau and E. Scornet, “A random forest guided tour”, Test, vol. 25, no.
Based on Particle Swarm Optimization in IoT”, IEEE Access, vol.9, 2, pp.197-227, 2016
pp.38254-38268, 2021. [47] P. Geurts, D. Ernst and L. Wehenkel, “Extremely randomized trees. Ma-
[23] X. Zhou, Y. Hu, W. Liang, J. Ma and Q. Jin, “Variational LSTM enhanced chine learning”, vol. 63, no. 1, pp.3-42, 2006.
anomaly detection for industrial big data”, IEEE Trans. on Industrial Inf., [48] L. Davis, “Handbook of genetic algorithms”, 1991.
vol. 17, no.5, pp.3469-3477, 2020. [49] X. Yu and M. Gen, “Introduction to evolutionary algorithms”, Springer
[24] J. Gao, S. Chai, B. Zhang and Y. Xia, “Research on network intrusion Science & Business Media, 2010.
detection based on incremental extreme learning machine and adaptive [50] P. Tao, Z. Sun and Z. Sun, “An improved intrusion detection algorithm
principal component analysis,” Energies, vol. 12, no. 7, p.1223, 2019. based on GA and SVM”, IEEE Access, Vol. 6, pp.13624-13631, 2018.
[25] R. Vinayakumar, M. Alazab, K.P. Soman, P. Poornachandran, A. Al- [51] M. Almseidin, M. Alzubi, S. Kovacs and M. Alkasassbeh, “Evaluation of
Nemrat and S. Venkatraman, “Deep learning approach for intelligent machine learning algorithms for intrusion detection system”, In IEEE 15th
intrusion detection system”, EEE Access, vol. 7, pp.41525-41550, 2019. Int. Symp. Intell. Sys. Inf. (SISY), Sept. 2017, pp. 000277-000282.
[26] S. Hanif, T. Ilyas and M. Zeeshan, M., “Intrusion detection in iot using [52] S. Narkhede, “Understanding auc-roc curve,” Towards Data
artificial neural networks on unsw-15 dataset”, In 16th Int. Conf. on Smart Science, https://towardsdatascience.com/understanding-auc-roc-curve-
Cities: Improving Quality of Life Using ICT & IoT and AI (HONET-ICT) 68b2303cc9c5 (accessed Apr. 04, 2021).
IEEE, Oct. 2019, pp. 152-156. [53] Scikit-Learn, “Machine Learning in Python”. Accessed on: May. 3, 2020.
[27] N. Moustafa and J. Slay, “The evaluation of Network Anomaly Detection [Online]. Available: https://scikit-learn.org/stable/
Systems: Statistical analysis of the UNSW-NB15 data set and the compar- [54] A. De Caigny, K. Coussement and K.W. De Bock, “A new hybrid
ison with the KDD99 data set,” Information Security Journal: A Global classification algorithm for customer churn prediction based on logistic
Perspective, vol. 25, no. 1-3, pp.18-31, 2016. regression and decision trees” European Journal of Operational Research,
[28] E. Ketzaki, A. Drosou, S. Papadopoulos and D. Tzovaras, “A light- vol. 269,no. 2, pp.760-772, 2018.
weighted ANN architecture for the classification of cyber-threats in mod- [55] M.M. Saritas and A. Yasar, “Performance analysis of ANN and Naive
ern communication networks,”. In 10th Int. Conf. Netw. of the Future Bayes classification algorithm for data classification”, Int. Journ. Intell.
(NoF) IEEE, Oct. 2019, pp. 17-24. Sys. Appl. Eng, vol. 7, no. 2, pp.88-91, 2019.
[29] O. Almomani, “A feature selection model for network intrusion detection
system based on PSO, GWO, FFA and GA algorithms,” Symmetry, vol. 12,
no. 6, p.1046, 2020.
[30] A. Nazir and R.A. Khan, “A novel combinatorial optimization based
feature selection method for network intrusion detection”, Computers &
Security, vol. 102, p.102164, 2021
[31] W. Zong, Y.-W. Chow, andW. Susilo, “A two-stage classifier approach for
network intrusion detection,” in Proc. Int. Conf. Inf. Secur. Pract. Exper.
Cham, Switzerland: Springer, 2018, pp. 329-340.
[32] C. Khammassi and S. Krichen, “A GA-LR wrapper approach for feature
selection in network intrusion detection”, computers & security, vol. 70,
pp.255-277, 2017.

14 VOLUME X, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like