0% found this document useful (0 votes)
45 views

Main

This document discusses using machine learning techniques for intrusion detection in SCADA systems. It begins with background on SCADA systems and common network attack vectors. The contributions of the paper are then outlined as assessing the performance of support vector machines, random forests, and bidirectional long short-term memory networks for intrusion detection using a real SCADA dataset. Experimental results on data normalization techniques and missing data imputation strategies are also discussed to evaluate the machine learning models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Main

This document discusses using machine learning techniques for intrusion detection in SCADA systems. It begins with background on SCADA systems and common network attack vectors. The contributions of the paper are then outlined as assessing the performance of support vector machines, random forests, and bidirectional long short-term memory networks for intrusion detection using a real SCADA dataset. Experimental results on data normalization techniques and missing data imputation strategies are also discussed to evaluate the machine learning models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Machine Learning for Reliable Network Attack

Detection in SCADA Systems


Rocio Lopez Perez∗ , Florian Adamsky† , Ridha Soua† , and Thomas Engel†
[email protected], CSC, University of Luxembourg, Luxembourg
† {name.surname}@uni.lu, SnT, University of Luxembourg, Luxembourg

Abstract—Critical Infrastructures (CIs) use Supervisory Con- the Internet. True isolation, however, is difficult in a real-
trol And Data Acquisition (SCADA) systems for remote control world environment. First, true isolation may lead to outdated
and monitoring. For a long time, operator of CIs applied the software [1], [2]. Without connectivity to the Internet, the soft-
air gap principle, a security strategy that physically isolates
the control network from other communication channels. True ware cannot easily receive security updates from the vendor.
isolation, however, is difficult nowadays due to the massive spread Second, true isolation is hard to implement since CI is often
of connectivity: using open protocols and more connectivity opens geographically distributed. To avoid the high costs of laying
new network attacks against CIs. To cope with this dilemma, direct fiber cable to substations, CI operators make use of
sophisticated security measures are needed to address malicious radio, Asymmetric Digital Subscriber Line (ADSL), General
intrusions, which are steadily increasing in number and variety.
Traditional Intrusion Detection Systems (IDSs) cannot detect Packet Radio Service (GPRS), or leased lines. Moreover,
attacks that are not already present in their databases. In malware like Stuxnet [3] or Flame [4] has shown us that even a
this paper, we assess Machine Learning (ML) for intrusion USB flash drive can provide connectivity to the outside world.
detection in SCADA systems using a real data set collected Besides the air gap principle, SCADA systems have made use
from a gas pipeline system and provided by the Mississippi of proprietary software, hardware, and communication proto-
State University (MSU). The contribution of this paper is two-
fold: 1) The evaluation of four techniques for missing data cols which have provided a false sense of security through
estimation and two techniques for data normalization, 2) The obscurity [1].
performances of Support Vector Machine (SVM), Random Forest Nowadays, the use of standardized communications pro-
(RF), Bidirectional Long Short Term Memory (BLSTM) are tocols has enabled the integration of SCADA systems with
assessed in terms of accuracy, precision, recall and F1 score the Internet and corporate networks. Given this new context,
for intrusion detection. Two cases are differentiated: binary
and categorical classifications. Our experiments reveal that RF SCADA systems are prone to numerous threats due to their
and BLSTM detect intrusions effectively, with an F1 score of large deployment areas, distributed operating mode and grow-
respectively > 99% and > 96%. ing interconnectivity [5]. Indeed, the widespread use of the
TCP/IP stack has led to the its adoption in SCADA systems.
I. I NTRODUCTION AND P ROBLEM S TATEMENT Modicom Communication Bus (Modbus) TCP, Distributed
Supervisory Control And Data Acquisition (SCADA) sys- Network Protocol (DNP3) [6], and IEC 60870-5-104 are the
tems are commonly used by Critical Infrastructures (CIs) or main communication protocols used. These protocols were de-
industries which are vital to citizens’ daily lives and countries’ signed over twenty years ago and are known to be highly vul-
economies. It includes oil pipelines, water treatment, and nerable to simple network attacks [7]–[10]. Mirian et al. [11],
chemical manufacturing plants to name but a few. Typically, using Internet-wide scanners such as ZMap [12], identified
SCADA systems consist of (1) field instrument devices for 60,000 vulnerable SCADA devices connected to the Internet.
sensing conditions of the CI (power level, pressure, through- Clearly, these protocols stacks are subject to increasing risks.
put, etc.); (2) operating equipment such as valves, pumps, This can also be seen in the cyberattacks against the Ukrainian
etc. controlled by actuators; (3) field local processors such as power grid in 2015, were 225,000 Ukrainian people were
Programmable Logic Controllers (PLCs) and Remote Terminal without electricity. These attack were the first that resulted
Units (RTUs) that communicate with field instrument devices in a power outage [13].
and operating equipment; and finally (4) the Human Machine Our contributions: In this paper, we focus on assessing
Interface (HMI) that acts as a central controller and monitoring the performances of Machine Learning (ML) techniques such
host. To operate properly in a synchronized manner, these as Support Vector Machine (SVM), Random Forest (RF),
different components must communicate. While short-range and Bidirectional Long Short Term Memory (BLSTM) in
communications are used to establish links between local pro- detecting intrusion in SCADA systems. Section II lays out
cessors, instrument devices and operating equipments, long- the foundation of the SCADA architecture and the ML al-
range communications are used to connect PLCs and RTUs gorithms used. We analyze SCADA protocols from monthly
with the HMI or the Master Terminal Unit (MTU). Internet-wide scans and see an increasing number of SCADA
Historically, SCADA systems implemented a security prin- services reachable and attackable over the Internet. Section III
ciple known as air gap, a strategy that physically isolates describes the data set and the experimental setup in detail.
the control network from the rest of the network, including In Section IV, we analyze four missing data strategies and

1
two data normalization techniques, characterizing the perfor- lines. All information converges to the HMI or SCADA master,
mances of the ML algorithms in terms of accuracy, precision, which is monitored and controlled by an employee.
recall and F1 score for binary and categorical classification.
We describe related works in Section V and compare them
with our approach. Finally, we conclude in Section VI and
Attacker A
1: Intercept Human Machine Interface
give directions for future research. 2: Interrupt
II. T ECHNICAL BACKGROUND 3: Modify a
4: Fabricate
In this section, we provide a brief overview of the SCADA
architecture, its network protocols, and the ML algorithms that Communication
we have used in this work. While discussing the technical b
Network
background, we also highlight the vulnerabilities that exist in
SCADA protocols. Substation
c
A. Attack Vectors on SCADA
As described in Section I, adversaries often can reach the RTU/PLC RTU/PLC RTU/PLC
control system from the Internet, because the air gap principle
is no longer not applicable in modern SCADA networks [1].
Most of these networks are geographically distributed. Hence,
they need to be connected to the HMI, either via ADSL, Sensors Sensors Sensors
GPRS, or leased lines. All of these connections can be used Actuators Actuators Actuators
to gain access to the control system.
After an attacker has gained access to the network, there Fig. 1. Attack model demonstrating four network attacks, denoted as (1–4),
are three attack vectors against a SCADA protocol: First, by against a simplified SCADA architecture with three attack targets (a–c) based
on [7].
exploiting vendor-specific implementation faults like memory-
corruption bugs; second, by exploiting weaknesses in the in-
frastructure like missing or inadequate firewall rules; and third, III. A NOMALY D ETECTION IN SCADA S YSTEMS : DATA
by exploiting protocol-specific weaknesses in the specification. S ET AND M ETHODOLOGY
In this paper, we focus on the third attack vector. An attacker To investigate the merits of the ML-based techniques for
wanting to exploit SCADA protocol weaknesses, has four anomaly detection in SCADA systems, a real-world gas
general attacks to choose from [7], as shown in Figure 1: pipeline data set is used for anomaly detection in our experi-
1) Interception: An attacker is able to analyse the network ments. We now describe the data set in detail, as well as the
traffic and gather information about the network infras- different steps of our methodology for anomaly detection.
tructure;
2) Interruption: An attacker intercepts packets and does not A. The Gas Pipeline Data Set
forward them to the next node; The SCADA data set used in this work is hosted on the
3) Modification: The attacker is a man-in-the-middle Industrial Control System (ICS) Cyber Attack Data Sets [26]
(MitM) modifying packets in a network stream; website. The real-world raw data was generated using a gas
4) Fabrication: An attacker is able to inject packets into the pipeline system provided by the Mississippi State University
network. (MSU)’s in-house SCADA lab. It contains a total of 274,628
Figure 1 depicts a simplified SCADA architecture in which instances.
an attacker (red square) has gained access to the network. All The methodology for the data set collection is described in
four attacks can target the HMI (a), the network infrastructure the study carried out by Turnipseed [27]. The data set, present
itself (b), or the RTU/PLC (c). The field devices shown in the Attribute-Relation File Format (ARFF), is used to create
in Figure 1 are sensors and actuators. A sensor monitors ML models once it has been pre-processed. It contains 20
the environment, e.g. the pressure of a gas pipeline, and features from Modbus RTU packets, three different types of
sends the information to the next higher level; an actuator, labels and also pure raw data, which is provided to aid in
in contrast, receives commands to control the environment, the pre-processing stage. Table I lists the features and their
e.g. opening and closing a valve. The RTU or PLC controls corresponding types.
and monitors the field devices, building a substation. One The address feature is a unique eight-bit value used for
advantage of the SCADA architecture is that substations can device identification. It is assigned to each master and slave
be geographically distributed; this is often a necessity for a CI. device allowing them to recognize each other while estab-
The control centre is located in a different physical location lishing a communication. This feature is used to overcome
and contains the HMI which monitors and controls the RTUs scan attacks which broadcast commands to all possible station
and PLCs. The RTUs/PLCs are connected to the HMI via addresses to determine which addresses are in use. The second
communication links such as radio, fibre-optics, or dial-up feature is the function code. Some function codes can be used

2
TABLE I on the attacks that SCADA systems may suffer.
L IST OF FEATURES FROM THE GAS PIPELINE DATA SET.

Nr. Features Types TABLE II


D ESCRIPTION , CATEGORY AND TYPE OF THE ATTACKS .
1 Address Network
2 Function Command Payload Description Category Attack Type #
3 Length Network
Naive Response Injection Response Injection Modify/Fabricate 7,753
4 Setpoint Command Payload Complex Response Injection Response Injection Modify/Fabricate 13,035
5 Gain Command Payload State Command Injection Command Injection Modify/Fabricate 7,900
6 Reset Rate Command Payload Parameter Command Injection Command Injection Modify/Fabricate 20,412
7 Deadband Command Payload Function Code Injection Command Injection Modify/Fabricate 4,898
8 Cycle Time Command Payload Denial of Service Denial of Service Interrupt 2,176
9 Rate Command Payload Reconnaissance Reconnaissance Intercept 3,874
10 System Mode Command Payload
11 Control Scheme Command Payload These attacks, set out in the Table II, are the result of one or
12 Pump Command Payload
13 Solenoid Command Payload a series of external malicious activities through Modbus RTU
14 Pressure Measurement Response Payload packets. The attacks in Table II include a description of the
15 CRC rate Network attacks, a category and a attack type according to our attack
16 Command Response Network
17 Time Network model in Figure 1.
18 Binary Result Label
19 Categorized Result Label B. Methodology
20 Specific Result Label
Developing an ML-based IDS for intrusion detection in
SCADA systems requires the steps illustrated in Figure 4. In
some cases attribute values (“features”) were missing from the
for malicious purposes (DoS attack), such as ‘0x08’, which data set used in our experiments. As these values are useful
can be used to force a slave device to stay in listening mode. in prediction modelling, the first phase of our approach cleans
The length field gives the Modbus frame length. This feature and transforms the data to eliminate incomplete records. Next,
may help detecting attacks by identifying frames which are to train our data, it was fundamental to follow the Holdout
not of an ordinary length. The set point feature is the most method and split each of the sixteen data sets into training,
critical, since it controls the pressure in the gas pipeline when validation and test sets containing respectively 60% (164,776
the system is in automatic mode. instances), 20% (54,926 instances) and 20% (54,926 instances)
Other features such as gain, reset rate, dead band, cycle of the observations. The validation set and the test set were
time, and rate allow the PID controller to open and/or close the respectively pre-processed based on the statistics obtained
gas valve as well as turn on and/or turn off the pump, based on from the training set and the combined training set and
a calculated error value. The system mode, which represents validation set. Because parameters of prior distribution, called
how the system is operating, may have three possible values: hyperparameters, may significantly impact the performance of
(1) off or inactive, (2) manual configuration or (3) automatic ML methods, we performed a hyperparameter search for the
configuration. The control scheme feature determines whether selected ML algorithms. Given that the data set is comprised of
the gas pipeline system will be controlled by the pump or by normal traffic and variants of attack types, we distinguish two
the solenoid. The pump field controls the pump state when classifiers: binary classification (normal, anomaly) and seven-
the system mode is set to manual. An adversary were able to category classification (see attacks depicted in Table II).
change the gas pipeline system mode to manual and turn the 1) Data Cleaning: We observed that many feature values
pump on, the system would become over-pressurized. were missing or non-existent. The Table III depicts the first
The pressure measurement feature provides the gas pressure three rows of the data set in ARFF. Addr, funct and c/r refer
measurement value provided by a pressure gauge attached to respectively to the address, function and command response
the pipeline. An attacker could use this feature to provide false features.
measurements emulating fabricated behaviours in the system. Table III, presents three different types of payloads where
An adversary may perform an attack by constantly transmitting data is missing: 1) All values are missing or nonexistent; 2)
a bad Cyclic Redundancy Check (CRC) to cause a Denial-of- only the pressure measurement is present; and 3) all values
Service (DoS) attack. The command response feature, as its except the pressure measurement are present. To handle the
name indicates, helps the Intrusion Detection System (IDS) feature values in the data set that do not have any representa-
to differentiate between commands and requests. This feature, tion or meaning, we used four techniques:
along with the timestamp, the binary result, the categorized Gaussian Mixture Model (GMM) can find the best
result and the specific result features were not parsed from the number k of Gaussian distributions needed to cluster
Modbus RTU frame itself, but from Modbus TCP/IP traffic. our data. To this end, the algorithm finds the best mean
As discussed in Section I, SCADA systems are a focus of or centre, µ and variance σ of the Gaussian distributions
attention for cyber-attacks. The MSU’s in-house SCADA lab that best separate our data.
used seven categories of attacks which were previously devel- K-means allows us to find the best number k of clusters
oped in Gao’s research [28] to provide a broader perspective by computing the Euclidean distance between the given

3
TABLE III
E XAMPLES OF THE MISSING VALUES IN THE GAS PIPELINE DATA SET.

Address Function Length Payload CRC C/R Timestamp


4 3 16 ?,?,?,?,?,?,?,?,?,?,? 12869 1 1418682163.170388
4 3 46 ?,?,?,?,?,?,?,?,?,?,0.689655 12356 0 1418682163.269946
4 16 90 10,115,0.2,0.5,1,0,0,1,0,0,? 17219 1 1418682164.995590

payload then the value was kept and the indicator set to
Data Set
0 [29].
Keep prior value, also known as forward-filling, deals with
the non-existent values by replacing them with the imme-
Training Set Training + Val Sets
diately preceding existing feature value. In the case where
forward-filling is not possible due to a lack of existing
prior feature values, backward-filling is conducted. The
Pre-Processing Pre-Processing
intuition behind this technique is that the missing values
Data Cleaning Data Cleaning
are not dues to data loss but simply cannot exist, since
Statistics Statistics the type of the packet does not support these features.
Therefore, they appear in the data as non-existent values
and they may be inferred from previously seen feature
Val Set Data Transf. Data Transf. Test Set values.
2) Data Transformation: This step was conducted by per-
forming, first, the mean-standard deviation and then min-max
Hyper SVM Model methods. The mean-standard deviation method consists of
Params RF Model
subtracting the calculated overall mean and dividing by the
Search BLSTM Model
calculated overall standard deviation for each of the values
within a certain feature. Thus,
Classification xi − µ
zi = , (2)
σ
Fig. 2. Flow chart diagram illustrating the steps of our work pipeline.
where x is a feature value, µ is the mean, and σ is the standard
deviation. Performing this pre-processing strategy ensures the
samples and a pre-assigned centroid point, assigning them minimization of the sample deviations from the mean. The
to a certain cluster and updating the centroids of the second method is min-max approach, which consists of finding
clusters until convergence on the best separation of the the minimum and maximum value from a given feature and
data. normalizing the feature values between 0 and 1. Hence,
In both GMM and K-means techniques, the first payload
xi − min(x)
type were considered as cluster k = 0, and the second and zi = , (3)
third payload types were be assigned to k number of clusters max(x) − min(x)
defined by the elbow method. This method determined the best where xi is a feature value, min(x) and max(x) are the
number of clusters based on the cost function or distortion: minimum and maximum values calculated from the overall
K X
feature values.
3) Hyperparameter Search: In a SVM, the hyperparameters
X
= ||xi − µk ||2 . (1)
k=1 i∈Ck
C and γ must be correctly set for each of the sixteen data
sets. Hence, we performed a random search to determine the
Lower values of  determine a preferable number k of clusters best hyperparameters for our models. Although grid search
and thus, better data separation. With this strategy, payloads and manual search are the most widely used techniques
are classified into k clusters, which are represented in the for hyperparameter optimization, it has been empirically and
pre-processed data as a one-hot encoded notation. One hot theoretically demonstrated that randomly chosen tests are more
encoding is a process of converting categorical variables into efficient [30].
form more suitable for ML algorithms. For each of the sixteen pre-processed data sets, we ran thirty
Zero imputation & indicators is a technique in which we different prediction trials over the corresponding validation
substituted missing values with 0 and indicated their po- set, during the hyperparameter search. The seven most notable
sitions by adding corresponding indicators with 1 values results are analyzed to investigate how the algorithms converge
to the payload feature. If the feature value existed in the to a good result after the best hyperparameters are found. Due

4
to the long training time of SVMs, we used only 25% from both cases of data normalization (MEAN and MIN-MAX)
the entire data set. using the Keep prior value. Indeed, the lowest F1 score for
In RF, the hyperparameters number of estimators and max- binary classification is 92.04% (see Figure 5g) while for CAT
imum depth of the trees must be correctly set for each of classification this value drops to 88.45 % (see Figure 5(j)).
the sixteen data sets. Once again, we performed a random The worst performance in terms of F1 score for both classifiers
search, through thirty different prediction trials, to define the was obtained by GMM and K-means algorithms. The Zeros &
best hyperparameters for these models. Indicators method performs better that GMM but worse than
In BLSTM, the hyperparameters learning rate, batch size, Keep prior value. For both binary and categorical classifiers,
sequence length, dropout and hidden layer size must be the MEAN normalization strategy outperforms MIN-MAX
correctly set for each of the sixteen data sets. Again, we normalization. Table IV summarizes the results, highlighting
conducted a random search, by running through fifty epochs, the best for BIN and CAT SVM classifiers employing the
a parameter for BLSTM, to define the best hyperparameters split criterion of 80% for the training set and 20% for the
for these models. For each data set, we ran thirty different test set, and using the hyperparameters that gave us the best
predictions over the corresponding validation set during the performance. We obtained a F1 score of 94.34% for BIN
hyperparameter search. The seven most significant results are and a F1 score of 92.50% for the CAT classifier. These
used in this study to show how the algorithm converges once were achieved using MEAN normalization and keep the prior
the best hyperparameters are found. existing value strategy respectively to deal with missing values.
4) Classification: In this step, models are created with the
aim of classifying novel observations on a set of predefined TABLE IV
classes. If only two possible classes exist, then it is called B EST BINARY AND CATEGORICAL CLASSIFIERS MODELED WITH SVM.
binary classification. In contrast, if more than two classes SVM Hyper-parameters Measurements
Test sets C gamma Acc Prec Recall F1-score
are differentiated, it is called multi-class classification. In the binary-mean-keep 346.219 0.3975 94.36 % 94.33 % 94.36 % 94.34 %
context of this work, a classification task is performed to binary-minmax-keep 579.161 0.6270 92.78 % 92.91 % 92.78 % 92.83 %
categorical-mean-keep 107.411 0.2689 92.56 % 92.47 % 92.56 % 92.50 %
correctly classify benign and malicious packets. The trained categorical-minmax-keep 536.672 0.7150 89.70 % 90.50 % 89.70 % 89.97 %

model output would be 0 or 1, for a binary classification ap-


proach and from 0 to n classes’, for a multi-class classification 2) RF Performance: Figures 5b, 5e, 5h, and 5k present the
approach. contrasting configurations in a binary and categorical classifier
modelled with the RF algorithm. The highest F1 score was
IV. D ETECTING I NTRUSION IN SCADA: E XPERIMENT achieved by the binary classifier: 99.40% with MIN-MAX
AND A NALYSIS OF R ESULTS technique for data normalization and using the Keep prior
We developed our classification scripts with Scikit-learn1 , value approach for dealing with missing data.
TensorFlow2 and Keras3 . Our source code is available on Table V depicts the final results obtained using the best
GitHub [31]. In the following, we evaluate our test results, to- hyperparameters, and the 80%–20% split criterion, for the
gether with the performance results of SVM, RF and BLSTM training and test sets. We obtained a F1 score of 99.58%
for anomaly detection using the gas pipeline system data set. for BIN and a F1 score of 99.41% for CAT. It is worth
mentioning that for the final results, the difference between
A. Anomaly Detection Results MEAN and MIN-MAX normalization strategies is very small:
We split each of the sixteen data sets into training, validation as illustrated in Table V, the difference is 0.02% for binary
and test sets according to the division in Section III-B. Once classification and 0.03% for categorical classification. There-
we obtain the best configuration for a given classifier, the fore, similar results can be achieved with both normalization
validation set is combined with the training set, leaving the strategies.
final split into 80% of the observations in training set and
20% in the testing set. Two classifiers were used to study the TABLE V
performance of the different SVM, RF and BLSTM-based IDS B EST BINARY AND CATEGORICAL CLASSIFIERS MODELLED WITH
R ANDOM F OREST A LGORITHM ; NE AND MD CORRESPOND TO NUMBER OF
models: binary (normal, anomaly) and categorical (see attacks ESTIMATORS AND MAXIMUM DEPTH .
listed in Table II). We denote these respectively by “BIN” and
Random Forest Hyper-parameters Measurements
“CAT”. For each experiment, we compared the performance of Test sets ne md Acc Prec Recall F1-score
binary-mean-keep 47 49 99.58 % 99.58 % 99.58 % 99.58 %
each ML technique under mean-standard deviation (MEAN) binary-minmax-keep 44 71 99.56 % 99.57 % 99.56 % 99.56 %
and min-max (MIN-MAX) approaches. categorical-mean-keep 71 80 99.41 % 99.41 % 99.41 % 99.41 %
categorical-minmax-keep 64 88 99.39 % 99.39 % 99.39 % 99.38 %
1) SVM Performance: Figures 5a, 5d, 5g, and 5j show the
performance of SVMs for the binary and categorical classifier 3) BLSTM Performance: In Figures 5c, 5f, 5i, and 5l,
and for the MEAN and MIN-MAX data normalization. As which show the results for BLSTM, the Zeros imputation
we can see, the BIN classifier achieves a better F1 score in & indicators strategy for dealing with missing values out-
1 http://scikit-learn.org/ performs other techniques, such as K-means and GMM, and
2 https://www.tensorflow.org/ slightly outperforms the Keep prior value approach. This
3 https://keras.io/ is consistent with the theory and experiments presented in

5
SVM BIN-MEAN (a) RF BIN-MEAN (b) BLSTM BIN-MEAN (c)
92.84 93.26 93.54 100 98.64 98.93
99.3 99.34 99.36 97.52 97.58 97.86
90.64 91.07 96.8 96.98 97.24
96.21
89.03 96.45
90
85.27 95

80 90 90
F1

85.26
70 85
80
60 80
cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7

SVM CAT-MEAN (d) RF CAT-MEAN (e) BLSTM CAT-MEAN (f)

88.55
89.95 90.75 91.07 91.14 100 97.65
98.41 98.78 98.84
99.14 99.16 99.17
95.99
96.79 96.93 97.25 97.28
97.35 97.39

84.79

80 77.28

90 90
F1

60

80 80
40
cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7

SVM BIN-MINMAX (g) RF BIN-MINMAX (h) BLSTM BIN-MINMAX (i)


92.04 99.34 99.39 99.39 99.4 96.09 96.15 96.33 96.46
90.93 91.69 91.75 100 95.46
94.71 94.92
90 88.69 88.7
95
83.34 94.81
95 93.91
92.57
80 90
F1

90
70 85
85
60 80
80
cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7

SVM CAT-MINMAX (j) RF CAT-MINMAX (k) BLSTM CAT-MINMAX (l)

86.55 87.03
88.45 100 98.96 99.12 99.13 99.15 99.17 99.17 94.9 95.24
95.5
85.22 85.96 95 93.97
83.99 93.24
96.72 91.63
80 76.81 90.97

90
95
F1

85
60
90 80

cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7 cfg1 cfg2 cfg3 cfg4 cfg5 cfg6 cfg7

Keep prior value GMM Zero & Indicators K-means

Fig. 3. Results for the hyperparameter search for SVM, RF, and BLSTM. The first row shows the results for the binary classification (BIN) using the
mean-standard deviation normalization strategy (MEAN), the second row for the categorical classification (CAT) using MEAN. The third row shows the
results for BIN using the min-max normalization strategy (MIN-MAX) and finally the fourth row for CAT using MIX-MAX. On the x axis are the different
configurations for the hyperparameter depicted and on the y axis is the F1 score depicted.

6
[29]. The Table VI summarizes the results for BIN and CAT TABLE VII
BLSTM classifiers, running three hundred epochs with the C LASSIFICATION REPORT OF THE RF ALGORITHM .
best hyperparameters and using the 80%–20% split criterion. Random Forest Accuracy test data = 99.41 %
Bidirectional Long Short Term Memory outperforms SVM. Type of Data precision recall f1-score support
We obtained a F1 score of 98.39% for BIN and a F1 score of Normal 99.48 % 99.90 % 99.69 % 42953
97.68% for CAT. As shown in Table VI, for both binary and NMRI 98.14 % 96.99 % 97.56 % 1526
CMRI 98.84 % 96.40 % 97.60 % 2641
categorical classifiers, the MEAN is better than MIN-MAX. MSCI 99.28 % 98.63 % 98.96 % 1538
The difference between these two normalization strategies is MPCI 99.90 % 98.00 % 98.94 % 4101
0.77% for BIN and 1.2% for CAT classification. MFCI 98.77 % 100 % 99.38 % 967
DoS 97.54 % 95.42 % 96.47 % 415
Recon 99.61 % 97.96 % 98.78 % 786
TABLE VI avg / total 99.41 % 99.41 % 99.41 % 54927
B EST BINARY AND CATEGORICAL CLASSIFIERS MODELLED WITH BLSTM
A LGORITHM ; LR , BATCH , SEQ , DROP AND H LAYER CORRESPOND TO
LEARNING RATE , BATCH SIZE , SEQUENCE LENGTH , DROPOUT AND
TABLE VIII
HIDDEN LAYER SIZE .
C ONFUSION MATRIX OF THE RF ALGORITHM .
BLSTM Hyper-parameters Measurements
Test sets lr batch seq drop h layer Acc Prec Recall F1-score Normal NMRI CMRI MSCI MPCI MFCI DoS Recon
binary-mean-indi 0.008308 67 4 0.019025 110 98.40 % 98.40 % 98.40 % 98.39 % 42908 12 9 9 4 0 8 3 Normal
binary-minmax-indi 0.011490 121 4 0.027915 218 97.64 % 97.64 % 97.65 % 97.62 %
categorical-mean-indi 0.009908 138 4 0.032404 136 97.71 % 97.69 % 97.71 % 97.68 % 25 1480 21 0 0 0 0 0 NMRI
categorical-minmax-indi 0.013236 138 4 0.039841 254 96.57 % 96.53 % 96.57 % 96.48 % 79 16 2546 0 0 0 0 0 CMRI
20 0 0 1517 0 0 1 0 MSCI
79 0 0 2 4019 0 1 0 MPCI
0 0 0 0 0 967 0 0 MFCI
4) Results Analysis: Although BLSTM models are widely 19 0 0 0 0 0 396 0 DoS
used for time-dependent problems given their capabilities of 4 0 0 0 0 12 0 770 Recon

using forward and backward information, RF results outper-


form those achieved with BLSTM algorithm, as illustrated
in Table V & VI. This may be due to both a lack of V. R ELATED W ORK
collective attacks, and the existence of high randomness in the The SCADA systems were originally designed following
occurrence of attacks within the data set. Since the data set was the air gap principle and therefore without security measures
generated, the developers made sure to avoid the appearance in mind [1]. Nowadays, these systems are in the spotlight of
of unintended patterns and did not inject collective attacks. network attacks, due to standardization and connectivity to the
For instance, DoS could be performed as a set of packets that Internet [2], [35]. While using ML for predicting anomalies
overwhelm the system, of which one single packet may not in networks has motivated many studies, little research has
mean anything to the predictor. Taken together, however, they tackled the advantage of using ML in SCADA systems by
do matter and represent an attack. In our case, DoS attacks using real data sets and a varied set of ML algorithms. In
are performed by sending Modbus packets with incorrect CRC the literature, a large number of studies used the Knowledge
values. We emphasize that the data was generated, whereas in Discovery and Data Mining (KDD) 99 data set to evaluate their
reality collective or sequential attacks may appear. This is why solutions for intrusion detection [36]–[39]. However, this data
it is interesting to study the BLSTM algorithm and integrate set does not consider the specificities of SCADA architecture,
it into a Network Intrusion Detection System (NIDS). communication protocols and traffic patterns. Moreover, it
The results from RF, which are listed in Table VIII show is seen by the research community as biased, outdated, and
that it correctly classifies large numbers of normal and mali- not relevant for modern network attacks detection. In the
cious packets. The categorical classification report in Table VII following, we detail different intrusion detection approaches
shows the detection rate for each of the data type. The for SCADA systems using real data sets.
distinction between Complex Malicious Response Injection The authors of [40] combine the signature-based and model-
(CMRI) and Naive Malicious Response Injection (NMRI) based approaches to design a rule-based IDS for SCADA
presents low recall value. This is due to the randomness of networks. Their IDS overcomes the main disadvantage of
NMRI attacks, which are likely to overlap in values with the signature-based systems, i.e only known attacks are detected
CMRI attacks and normal data: since a CMRI attack consists using pre-established rules. In [41], authors presented a multi-
of designing malicious packets that imitate normal behaviours, algorithm model-based IDS. Models that represent the ex-
some of these overlap with normal packets. For a DoS attack, pected/acceptable system behaviour are created, and any be-
the cause for the low detection rate, in comparison with the rest haviour that causes violations of these models is detected as
of attacks, is due to the bad CRC attack. This attack injects an an attack.
invalid CRC value in a write multiple register command, which Both [42] and [43] presented an IDS that detects malicious
makes the RTU to disregard the command, in turn causing a network traffic in SCADA systems, based on One Class
DoS. Random Forest algorithm was able to accurately classify Support Vector Machine (OCSVM) technique. While authors
the write command with the incorrect CRC value as an attack, of [42] use OCSVM to classify malicious observations by
but some responses from the RTU were not classified as a DoS comparing them with benign ones, the study carried out in [43]
attack. aims at detecting intruders in SCADA networks by analysing

7
variables of the control devices. Two different approaches similar messages to monitor (read) and control (write) sensors
of one-class classification, the Support Vector Data Descrip- and actuators. In addition, these protocols can be the victim
tion (SVDD) and the Kernel Principle Component Analysis of attacks that we have highlighted in Figure 1.
(KPCA), were proposed as well in [44]. Lp -norms are studied An interesting future investigation would be the extraction
in Radial Basis Function (RBF) kernels for intrusion detection. of rules from RF algorithms to integrate them with signature-
An IDS that detects SCADA attacks based on the network based NIDSs such as Snort.
traffic behaviour was proposed in [45]. The IDS extracts the
time correlation between different network packets and then ACKNOWLEDGMENT
monitors the system to determine if it is behaving normally or This work was partially funded by ATENA H2020 EU
not. An alarm is raised when anomalies are detected. Project (H2020-DS-2015-1 Project 700581). We thank Do-
Authors of [46] presented an IDS using Neural Network minic Dunlop for his review and comments that greatly
based Modelling (IDS-NNM) algorithm following the super- improved the manuscript.
vised learning approach. They adopted a specific window
based attribute extraction approach to capture the time series R EFERENCES
nature of the network packet stream. More recently, a Recur- [1] E. Byres, “The Air Gap: SCADA’s Enduring Security Myth,”
rent Neural Network (RNN) with unidirectional Long Short Communications of the ACM, vol. 56, no. 8, pp. 29–31, Aug. 2013.
Term Memory (LSTM) architecture was proposed in [47] to [Online]. Available: http://doi.acm.org/10.1145/2492007.2492018
[2] C. S. Wright. (2011, September) SCADA: Air Gaps Do Not Exist.
detect industrial control system anomalies. Accessed: 2017-12-04. [Online]. Available: http://infosecisland.com/
blogview/16770-SCADA-Air-Gaps-Do-Not-Exist.html
VI. C ONCLUSION AND F UTURE W ORK [3] R. Langner, “Stuxnet: Dissecting a Cyberwarfare Weapon,” IEEE Secu-
rity Privacy, vol. 9, no. 3, pp. 49–51, May 2011.
Until not too long ago, the most common security strategy [4] K. Zetter. (2012, May) Meet ’Flame,’ The Massive Spy Malware
for SCADA systems was the air gap principle: an operator of Infiltrating Iranian Computers. Accessed: 2017-12-04. [Online].
SCADA networks segregated the control network from other Available: https://www.wired.com/2012/05/flame/
[5] V. M. Igure, S. A. Laughter, and R. D. Williams, “Security Issues in
networks. Hence, attackers could not access them. The attacker SCADA Networks,” Elsevier Computers & Security, vol. 25, no. 7, pp.
had to be physically close to the SCADA system to access 498–506, 2006.
the communication channel, inject malicious data or even [6] “IEEE Standard for Electric Power Systems Communications-
Distributed Network Protocol (DNP3),” IEEE Std 1815-2012 (Revision
interfere with the protocol. Nowadays, with growing demands of IEEE Std 1815-2010), pp. 1–821, Oct 2012.
for connectivity between the SCADA control network and the [7] S. East, J. Butts, M. Papa, and S. Shenoi, “A Taxonomy of Attacks on the
corporate network, novel network attacks have appeared as DNP3 Protocol,” in International Conference on Critical Infrastructure
Protection. Springer Berlin Heidelberg, 2009, pp. 67–81.
PLCs or RTUs devices are managed over IP communication [8] N. R. Rodofile, K. Radke, and E. Foo, “Real-Time and Interactive
protocols. This increased interconnectivity results in the de- Attacks on DNP3 Critical Infrastructure Using Scapy,” in Proceedings
isolation of SCADA systems, making them more vulnerable. of the 13th Australasian Information Security Conference (AISC 2015),
2015, pp. 67–70.
Attackers no longer need to gain physical access to on-site
[9] P. Huitsing, R. Chandia, M. Papa, and S. Shenoi, “Attack Taxonomies for
circuits to perform a hostile action but instead, malicious the Modbus Protocols,” International Journal of Critical Infrastructure
network packets can reach the field devices from anywhere. Protection, vol. 1, pp. 37–44, 2008.
In this paper, we have shown that ML techniques can [10] P. Maynard, K. McLaughlin, and B. Haberler, “Towards Understanding
Man-In-The-Middle Attacks on IEC 60870-5-104 SCADA Networks,”
detect network attacks against SCADA systems. We used a in Proceedings of the 2nd International Symposium for ICS & SCADA
SCADA data set provided by the MSUs’s in-house SCADA Cyber Security Research 2014 (ICS-CSR 2014), Sep. 2014. [Online].
lab. It was generated using a gas pipeline SCADA system Available: http://ewic.bcs.org/content/ConWebDoc/53228
[11] A. Mirian, Z. Ma, D. Adrian, M. Tischer, T. Chuenchujit, T. Yardley,
hosted in their laboratory. We used SVM, RF, and BLSTM R. Berthier, J. Mason, Z. Durumeric, A. J. Halderman, and M. Bailey,
to implement diverse IDS classifiers. We provided a complete “An Internet-wide view of ICS devices,” in 14th IEEE Privacy, Security,
comparison between these algorithms along with the random and Trust Conference (PST’16), 2016.
[12] Z. Durumeric, E. Wustrow, and J. A. Halderman, “ZMap: Fast
hyper-parameter search results. We published our source code Internet-wide Scanning and Its Security Applications,” in Proceedings
on GitHub [31] to help other researchers to verify, compare, of the 22Nd USENIX Conference on Security. USENIX Association,
and/or extend their studies. In contrast to the state-of-the-art 2013, pp. 605–620. [Online]. Available: http://dl.acm.org/citation.cfm?
id=2534766.2534818
studies, the use of the test set accuracy, precision, recall and [13] R. M. Lee, M. J. Assante, and T. Conway, “TLP: White Analysis of the
F1 score allowed us to assess their performance correctly Cyber Attack on the Ukrainian Power Grid,” E-ISAC, Tech. Rep., Mar.
and comprehensively. The RF algorithm gives the best per- 2016.
[14] S. W. A.-H. Baddar, A. Merlo, and M. Migliardi, “Anomaly detection
formance by detecting 99.90% of benign data and 98.46% of in computer networks: A state-of-the-art review.” JoWUA, vol. 5, no. 4,
attacks, with an overall detection rate (recall) of 99.58%. pp. 29–64, 2014.
Our approach can be applied to different SCADA environ- [15] Z. Durumeric, D. Adrian, A. Mirian, M. Bailey, and J. A. Halderman, “A
Search Engine Backed by Internet-Wide Scanning,” in Proceedings of
ments, because SCADA is based on a well-defined architec- the 22nd ACM Conference on Computer and Communications Security,
ture (see Section II). The used data set was generated in a Oct. 2015.
real gas pipeline following a typical SCADA architecture. [16] “ModBus Application Protocol Specification V1.1b 3,” http://www.
modbus.org/docs/Modbus Application Protocol V1 1b3.pdf, 2012, ac-
Although, the data set contains only Modbus RTU traffic, cessed: 2017-11-24.
other SCADA protocols (e.g. DNP3 or IEC 60870-5-104) have [17] C. C. Aggarwal, Data Mining: The Textbook. Springer, 2015.

8
[18] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine [33] F. J. Valverde-Albacete and C. Peláez-Moreno, “100% Classification
Learning, vol. 20, no. 3, pp. 273–297, Sep. 1995. [Online]. Available: Accuracy Considered Harmful: The Normalized Information Transfer
https://doi.org/10.1023/A:1022627411411 Factor Explains the Accuracy Paradox,” PloS one, vol. 9, no. 1, 2014.
[19] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: [34] X. Zhu, Knowledge Discovery and Data Mining: Challenges and
From Theory to Algorithms. Cambridge University Press, 2014. Realities: Challenges and Realities. Igi Global, 2007.
[20] C. Olah. (2015, Aug) Understanding LSTM Networks. [35] B. Zhu and S. Sastry, “Scada-specific intrusion detection/prevention
Accessed: 2017-12-04. [Online]. Available: https://colah.github.io/ systems: a survey and taxonomy,” in Proceedings of the 1st Workshop
posts/2015-08-Understanding-LSTMs/ on Secure Control Systems (SCS), vol. 11, 2010.
[21] A. Karpathy. (2015, May) The Unreasonable Effectiveness of
[36] A. George, “Anomaly Detection Based on Machine Learning: Di-
Recurrent Neural Networks. Accessed: 2017-12-04. [Online]. Available:
mensionality Reduction using PCA and Classification using SVM,”
https://karpathy.github.io/2015/05/21/rnn-effectiveness/
International Journal of Computer Applications, vol. 47, no. 21, 2012.
[22] M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Net-
works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. [37] G. Wang, J. Hao, J. Ma, and L. Huang, “A new Approach to Intrusion
2673–2681, 1997. Detection using Artificial Neural Networks and Fuzzy Clustering,” An
[23] A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional lstm net- International Journal of Expert Systems with Applications, vol. 37, no. 9,
works for improved phoneme classification and recognition,” in Artificial pp. 6225–6232, 2010.
Neural Networks: Formal Models and Their Applications – ICANN 2005, [38] J. Zhang and M. Zulkernine, “A Hybrid Network Intrusion Detection
W. Duch, J. Kacprzyk, E. Oja, and S. Zadrożny, Eds. Berlin, Heidelberg: Technique using Random Forests,” in The First International Conference
Springer Berlin Heidelberg, 2005, pp. 799–804. on Availability, Reliability and Security. IEEE, 2006, pp. 8–pp.
[24] S. Latif, M. Usman, and J. Q. R. Rana, “Abnormal heartbeat detection [39] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long Short Term
using recurrent neural networks,” arXiv preprint arXiv:1801.08322, Memory Recurrent Neural Network Classifier for Intrusion Detection,”
2018. in Proceedings of the International Conference on Platform Technology
[25] X. Zhang, W. Kou, E. I. Chang, H. Gao, Y. Fan, Y. Xu et al., “Sleep and Service (PlatCon). IEEE, 2016, pp. 1–5.
stage classification based on multi-level feature learning and recurrent [41] S. Cheung, B. Dutertre, M. Fong, U. Lindqvist, K. Skinner, and
neural networks via wearable device,” arXiv preprint arXiv:1711.00629, A. Valdes, “Using model-based intrusion detection for scada networks,”
2017. in Proceedings of the SCADA security scientific symposium, vol. 46.
[26] “Industrial Control System (ICS) Cyber Attack Datasets,” https://sites. Citeseer, 2007, pp. 1–12.
google.com/a/uah.edu/tommy-morris-uah/ics-data-sets, accessed: 2017- [42] L. A. Maglaras and J. Jiang, “Intrusion Detection In SCADA Systems
12-04. using Machine Learning Techniques,” in Science and Information Con-
[27] I. Turnipseed, “A New Scada Dataset For Intrusion Detection Research,” ference (SAI), 2014, 2014, pp. 626–631.
M. Sc., Mississippi State University, August 2015. [43] A. F. S. Prisco and M. J. F. Duitama, “Intrusion detection system for
[28] T. Morris and W. Gao, “Industrial Control System Traffic Data Sets for scada platforms through machine learning algorithms,” in Communica-
Intrusion Detection Research,” Advances in Information and Communi- tions and Computing (COLCOM), 2017 IEEE Colombian Conference
cation Technology Critical Infrastructure Protection VIII, pp. 65––78, on. IEEE, 2017, pp. 1–6.
2014.
[44] P. Nader, P. Honeine, and P. Beauseroy, “lp -norms in one-class classifi-
[29] Z. C. Lipton, D. C. Kale, and R. Wetzel, “Directly Modeling Missing
cation for intrusion detection in scada systems,” IEEE Transactions on
Data in Sequences with RNNs: Improved Classification of Clinical Time
Industrial Informatics, vol. 10, no. 4, pp. 2308–2317, 2014.
Series,” in Proceedings of Machine Learning for Healthcare 2016, 2016,
pp. 253–270. [45] N. Sayegh, I. H. Elhajj, A. Kayssi, and A. Chehab, “Scada intrusion
[30] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Opti- detection system based on temporal behavior of frequent patterns,” in
mization,” The Journal of Machine Learning Research, vol. 13, no. Feb, Electrotechnical Conference (MELECON), 2014 17th IEEE Mediter-
pp. 281–305, 2012. ranean. IEEE, 2014, pp. 432–438.
[31] “Machine learning techniques for Intrusion Detection in SCADA Sys- [46] O. Linda, T. Vollmer, and M. Manic, “Neural network based intrusion
tems,” https://github.com/Rocionightwater/ML-NIDS-for-SCADA.git. detection system for critical infrastructures,” in Neural Networks, 2009.
[32] L. Talavera, “Dynamic Feature Selection in Incremental Hierarchical IJCNN 2009. International Joint Conference on. IEEE, 2009, pp. 1827–
Clustering,” in Proceedings of the European Conference on Machine 1834.
Learning. Springer, 2000. [47] Feng, Cheng and Li, Tingting and Chana, Deeph, “Multi-level Anomaly
[40] Y. Yang, K. McLaughlin, T. Littler, S. Sezer, B. Pranggono, and Detection in Industrial Control Systems via Package Signatures and
H. Wang, “Intrusion detection system for iec 60870-5-104 based scada LSTM Networks,” in Proceedings of the 47th IEEE/IFIP International
networks,” in Power and Energy Society General Meeting (PES), 2013 Conference on Dependable Systems and Networks. IEEE, 2017, pp.
IEEE. IEEE, 2013, pp. 1–5. 261–272.

You might also like