0% found this document useful (0 votes)
67 views

Anomaly Detection With Generative Adversarial Networks For Multivariate Time Series

1. The document proposes a novel GAN-based anomaly detection method called GAN-AD to detect anomalies in multivariate time series data from complex cyber-physical systems with networked sensors and actuators. 2. GAN-AD uses LSTM-RNN models for the generator and discriminator of the GAN to capture the distributions of normal sensor/actuator time series data and detect anomalies based on differences between real and generated data. 3. The method was tested on anomaly detection for a water treatment plant system, demonstrating high detection rates of various attacks with low false positives compared to existing methods.

Uploaded by

Dinar Mingaliev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Anomaly Detection With Generative Adversarial Networks For Multivariate Time Series

1. The document proposes a novel GAN-based anomaly detection method called GAN-AD to detect anomalies in multivariate time series data from complex cyber-physical systems with networked sensors and actuators. 2. GAN-AD uses LSTM-RNN models for the generator and discriminator of the GAN to capture the distributions of normal sensor/actuator time series data and detect anomalies based on differences between real and generated data. 3. The method was tested on anomaly detection for a water treatment plant system, demonstrating high detection rates of various attacks with low false positives compared to existing methods.

Uploaded by

Dinar Mingaliev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

Anomaly Detection with Generative Adversarial


Networks for Multivariate Time Series
Dan Li, Dacheng Chen, Jonathan Goh, and See-Kiong Ng,

Abstract—Today’s Cyber-Physical Systems (CPSs) are large, solutions for anomaly detection [1]. These conventional de-
complex, and affixed with networked sensors and actuators that tection techniques are unable to deal with the increasingly
are targets for cyber-attacks. Conventional detection techniques dynamic and complex nature of the CPSs with the advent
are unable to deal with the increasingly dynamic and complex
of IoT. As such, researchers have moved beyond specifica-
arXiv:1809.04758v3 [cs.LG] 15 Jan 2019

nature of the CPSs. On the other hand, the networked sensors


and actuators generate large amounts of data streams that can tion or signature-based techniques and begun to exploit both
be continuously monitored for intrusion events. Unsupervised supervised and unsupervised machine learning techniques to
machine learning techniques can be used to model the system develop more intelligent and adaptive methods from big data
behaviour and classify deviant behaviours as possible attacks. to identify anomalies or intrusions [2].
In this work, we proposed a novel Generative Adversarial
Networks-based Anomaly Detection (GAN-AD) method for such However, even with the use of machine learning techniques,
complex networked CPSs. We used LSTM-RNN in our GAN detecting anomalies in time series is still challenging. First,
to capture the distribution of the multivariate time series of most of the supervised techniques require enough liable nor-
the sensors and actuators under normal working conditions of mal data and labelled anomaly classes to learn from but
a CPS. Instead of treating each sensor’s and actuator’s time this is hardly the case in practice as anomalies are typically
series independently, we model the time series of multiple sensors
and actuators in the CPS concurrently to take into account of rare. Secondly, most of existing unsupervised methods are
potential latent interactions between them. To exploit both the built through linear projection and transformation but there is
generator and the discriminator of our GAN, we deployed the often non-linearity in the hidden inherent correlations of the
GAN-trained discriminator together with the residuals between multivariate time series of complex CPSs. Most of the current
generator-reconstructed data and the actual samples to detect techniques also employ simple comparison between present
possible anomalies in the complex CPS. We used our GAN-AD
to distinguish abnormal attacked situations from normal working state and predicted normal ranges, which can be inadequate
conditions for a complex six-stage Secure Water Treatment for anomaly detection since the control bounds are not flexible
(SWaT) system. Experimental results showed that the proposed enough and cannot effectively identify indirect attacks1 .
strategy is effective in identifying anomalies caused by various To address these challenges, we propose a novel unsuper-
attacks with high detection rate and low false positive rate as vised GAN-based Anomaly Detection (GAN-AD) method for
compared to existing methods.
a complex multi-process CPS with multiple networked sensors
Index Terms—Anomaly Detection, Multivariate Time Series, and actuators by modelling the non-linear correlations among
GAN, CPS, IoT
multiple time series and detecting anomalies based on the
trained GAN model. Fig. 1 depicts the overall framework of
I. I NTRODUCTION our proposed GAN-AD. First, to deal with time-series data,
Cyber-Physical Systems (CPSs) are interconnected physical the generator and discriminator are built as two Long-Short-
systems typically engineered for mission-critical tasks. Some Term Recurrent Neural Networks (LSTM-RNN), as shown in
example CPSs are water treatment and distribution plants, the left part of Fig. 1. As in a typical GAN, the generator
natural gas distribution systems, oil refineries, power plants, (G) generates fake samples from a specific latent space and
power grids, and autonomous vehicles. The emergence of the passes that to the discriminator (D) which tries to distinguish
Internet of Things (IoT) will further drive the proliferation of fake from real. Based on the outputs of D, the system will
CPSs for a large variety of tasks, resulting in many systems update parameters of D and G, so that the discriminator will be
and devices communicating and operating autonomously over trained to be as sensitive as possible to assign correct labels to
networks. As such, cyber-attacks are one of the most con- both real and fake samples, while the generator will be trained
cerned potential threats to CPSs. as smart as possible to fool the discriminator (assigning real
Traditionally, Statistical Process Control (SPC) methods labels to fake samples). After sufficient rounds of iterations,
such as CUSUM, EWMA and Shewhart charts are popular the generator will have captured the hidden distributions of the
training sequences and that could generate realistic samples.
This work was presented in the 7th International Workshop on Big Data, In other words, G can be viewed as an implicit model of
Streams and Heterogeneous Source Mining: Algorithms, Systems, Program- the CPS. At the same time, the resulting discriminator D is
ming Models and Applications on the ACM Knowledge Discovery and Data
Mining conference, August 2018, London, United Kingdom also able to distinguish fake from real with high sensitivity.
Dan Li , Dacheng Chen and See Kiong Ng are with the Institute of
Data Science, National University of Singapore, 3 Research Link, Singapore 1 For one specific variable (i.e. a sensor or actuator in the CPS), a ”direct
117602 attack” is defined as an attack that is directly inserted to it and affects its
Jonathan Goh is with the ST Electronics (Info Security) Pte Ltd. performance, while an ”indirect attack” is an intrusion targeted for another
The code is available at https://github.com/LiDan456/GAN-AD variable but also affects the performance of the variable.
2

In other words, D is an intuitive tool for anomaly detection. computationally intensive multivariate SPC methods are not
In this work, we propose to exploit both G and D for the practical for modern CPSs with high complexity and massive
anomaly detection task by (i) exploiting the residuals between data streams.
real-time testing samples and reconstructed samples based on Supervised machine learning techniques can also be used for
the mapping from real-time space to the GAN latent space; anomaly detection. A typical approach is to build a predictive
and (ii) discrimination with the machine-learned discriminator classification model for the normal and the anomalous classes.
by classifying the real-time series. We depict this aspect of The classification model is trained from the labelled data.
our proposed GAN-AD in the right part of Fig. 1. As shown, New measurements can then be analysed by the classifier
the testing samples are mapped back into the latent space, and be classified to corresponding categories (normal or
and the corresponding residual loss is calculated based on the anomalous) automatically [13]. A wide range of supervised
difference between the reconstructed testing samples (by the machine learning tecniques has been used for anomaly de-
trained generator) and the actual testing samples. At the same tection. They include Multivariate Regression models [14],
time, testing samples are also fed to the trained discriminator Bayes Classifier [15], Neural Networks (NN) [16], Fisher
to compute the discrimination loss. The two losses are then Discriminant Analysis (FDA) [17], Gaussion Mixture Model
combined to detect potential anomalies for sequential CPS data [18], Support Vector Data Description (SVDD) [19], Support
(more details are described in Section III-C). Vector Machines (SVM) [20], and tree-structured learning
The remaining part of this paper is organized as follows. method [21], [22]. However, supervised classification methods
Section II introduces the related works. Section III presents are dependent on the availability of initial labelled training
our proposed GAN-AD and derives an anomaly score function. data. Given that anomalies are typically rare, obtaining enough
Section IV introduces the multi-stage Secure Water Treatment accurate labelled anomaly classes is usually challenging.
system, which is followed by Section V in which we eval- The unsupervised learning methods—also known as de-
uate our proposed GAN-AD on real-time multivariate series. scriptive or undirected classification—train models without
Finally, Section VI summarizes the whole paper and proposes lablled classes. Due to their simplicity and ability to handle
possible future work. large number of process data, unsupervised learning methods
have been wildly used for anomaly detection for various
industrial processes [23]. Popular unsupervised methods in-
II. R ELATED W ORKS
clude Principal Component Analysis (PCA) [24] and Partial
The basic task of anomaly detection is to identify whether Least Squares (PLS) [25]. PCA is a multivariate data analysis
the testing data conform to the normal data distribution; method which preserves the significant variability information
the non-conforming points are called anomalies, outliers, extracted from the process measurements and reduces the
intrusions, failures or contaminants in various application dimension for huge amount of correlated data [26]. PLS
domains [3], [2]. Anomaly detection is an old but challenging is another multivariate data analysis method that has been
problem—it has been studied in the statistics community as extensively utilized for model building and anomaly detection
early as the 19th century [3]. [27]. Their key performance indicators (Square Predicted Error
Based on how the historical training data is used, we can (SPE) for PCA and T 2 index for PLS) for anomaly detection
broadly divide anomaly detection methods into three cate- can be achieved with correlation model traine off-line and
gories: i) Statistical Process Control (SPC) techniques, ii) su- online process measurements. However, these unsupervised
pervised machine learning methods, and iii) unsupervised ma- methods are only effective to highly correlated data, and
chine learning methods. The SPC techniques were extensively require the data to follow multivariate Gaussian distribution
used in the early years for monitoring and controlling quality [28].
of manufacturing processes through univariate or multivariate The recently proposed GAN framework enables researchers
analysis [4]. The SPC approaches typically inspect changes to build a generative model via adversarial training [29]. The
in process mean (mean shifts) and process variance (variance simultaneous training of a generator and a discriminator in
changes), and try to model the relationship among multiple an adversarial fashion is highly suggestive for using the GAN
variables [5], [?]. Shewhart control charts and CUSUM control framework for anomaly detection. The current successes of
charts are univariate SPC techniques that are usually appied to GANs are mainly in generating realistic-looking images. In
detect mean shifts [6], [7]. EWMV control charts are sensitive our CPS and IoT scenarios, we have to deal with oftentimes
to univarite variance changes as well as mean shifts [8]. multiple streams of potentially interacting time series data.
Although widely used, the aforementioned SPC techniques However,there has been limited work in adopting the GAN
cannot model the correlation among various sequences, while framework for time-series data todate. To the best of our
interrelations are common even in traditional manufacturing knowledge, there are two preliminary work using GAN to
systems. As an improvement, Hotelling’s T2 [9], Multivariate generate continuous valued sequences in the literature—one
Cumulative Sum (MCUSUM) [10] and Multivariate Expo- to produce polyphonic music with recurrent neural networks
nentially Weighted Moving ASverage (MEWMA) [11] were as generator and discriminator [30], and the other uses con-
proposed to monitor the performance of multiple variables in ditional version of recurrent GAN to generate real-valued
a manufacturing system. However, these multivariate methods medical time series [31]. In both of these works, the multi-
require independent and identically distributed (iid) assump- sequences were treated as i.i.d. and fed to a uniform GAN
tion which is often violated in reality [12]. Moreover, the framework. This will be inadequate for the IoT and CPSs
3

Latent Space
Generated
Generator
Discriminator Samples Generator Residual Discrimination Discriminator
Is D correct? Inputs Inputs LSTM- +
LSTM-RNN LSTM-RNN Loss Loss LSTM-RNN
RNN

Reconstructed
Invert Samples
Real Samples from CPS Mapping Real Samples from CPS
Model Training Training Data Testing Data Anomaly Detector

Fig. 1: GAN-AD: Unsupervised GAN-based anomaly detection for CPSs. On the left is a GAN framework in which the
generator and discriminator are obtained with iterative adversarial training. On the right is the exploitation of both the GAN’s
generator and discriminator for anomaly detection—the generator is used for computing the residual loss between reconstructed
samples and real ones, while the discriminator is used to compute the discrimination loss.

setting given the potential interactions amongst the multiple due to cyber attacks in a complex Secure Water Treatment
time series-generating sensors and actuators involved in the (SWaT) system with six stages [37].
same or different processes of the complex systems.
Unlike traditional classification methods, the GAN-trained
III. A NOMALY D ETECTION WITH G ENERATIVE
discriminator learns to detect false from real in an unsu-
A DVERSARIAL T RAINING
pervised fashion, making GAN an attractive unsupervised
machine learning technique for anomaly detection [32]. In A. GAN with LSTM-RNN
addition, the GAN framework also produces a generator which Long Short Term-Recurrent Neural Networks (LSTM-RNN)
is actually an inexplicit model of the target system with its had been shown to be capable of learning complex time series
ability to output normal samples from a certain latent space. by taking the information in backward (or even forward) time
Inspired by [33] and [34] that updates a mapping from the steps with memorise cells. In this work, in order to handle
real-time space to a certain latent space to enhance the training time series data of the CPS, both the generator (G) and
of generator and discriminator, researchers have recently pro- discriminator (D) of GAN are substituted by LSTM-RNN.
posed to train a latent space understandable GAN and apply it Following the architecture of a regular GAN framework [29],
for unsupervised learning of rich feature representations for ar- the GAN model is trained as a two-player minimax game.
bitrary data distributions. [35] and [36] showed the possibility
of recognizing anomalies with reconstructed testing samples min max V (D, G) = Exvpdata (x) [log D(x)]
from latent space, and successfully applied the proposed GAN-
G D (1)
+Ezvpz (Z) [log(1 − D(G(z)))]
based detection strategy to discover unexpected markers for
images. In this work, we will leverage on these previous Specifically, the generator G, a LSTM-RNN model, implic-
works to make use of both the GAN-trained generator and itly defines a probability distribution for the generated samples,
discriminator to better detect anomalies based on both residual which can be written as Grnn (z), where z is a distribution
and discrimination losses. from the random latent space. The discriminator, which is
Our contributions of this paper are summarized as fol- another LSTM-RNN model, is then trained to minimise the
lows: i), a novel GAN-based unsupervised anomaly detection average negative cross entropy between its predictions and
method is proposed to detect anomalies (cyber-attacks) for sequence labels (e.g., train D to recognize as many training
complex multi-process cyber-phsyical systems with networked samples as real as possible, and recognize as many generated
sensors and actuators; ii), the GAN model is trained with samples as false as possible). Thus, the discriminator loss is
multiple time series, which adapts GAN from the image 1
Pm
generation domain for time series generation by adopting the Dloss = m i=1 [log Drnn (xi ) + log(1 − Drnn (Grnn (zi )))]
1
P m
Long Short Term-Recurrent Neural Networks (LSTM-RNN) ⇔ min m i=1 [− log Drnn (xi ) − log(1 − Drnn (Grnn (zi )))]
(2)
to capture the temporal dependency; iii), normal sequences
where xi , i = 1, ..., m is the training samples which should be
with high dimension is uniformly utilized to train the GAN
recognized as real, and Grnn (zi ), i = 1, ..., m is the generated
model to discriminate fake from real and reconstruct testing
samples that should be recognized as false.
sequences from specific latent space simultaneously; iv), the
At the same time, the generator is trained to confuse the
discrimination loss calculated by the trained discriminator
discriminator so that the discriminator would recognize as
and the residual loss between reconstructed and real testing
many generated samples as real as possible. In other words,
sequences (to make use of both the trained discriminator and
the generator loss is
generator) are combined together to detect anomalous points in
Pm
the high dimensional time series, and the proposed method is Gloss =P i=1 log(1 − Drnn (Grnn (zi )))
shown to outperform existing methods in detecting anomalies m (3)
⇔ min i=1 log(−Drnn (Grnn (zi )))
4

The generator loss and discriminator loss are jointly dealt an inexplicit system model that reflects the normal
with by optimizers and used to update the parameters for Grnn data’s distribution. Due to the smooth transitions of
and Drnn . latent space mentioned in [38], the generator outputs
similar samples if the inputs in the latent space are
Algorithm 1 LSTM-RNN-GAN-based Anomaly Detection close. Thus, if it is possible to find the corresponding
Strategy Z k in the latent space for the testing data X tes , the
loop similarity between X tes and G(Z k ) could explain to
if epoch within number of training iterations then which extent is X tes follows the distribution reflected
for the kth epoch do by G. That is to say, residuals between X tes and G(Z k )
Generate samples from the latent space: could be utilized for identifying anomalies in testing data.
Z = {zi , i = 1, ..., m} ⇒ Grnn (Z)
Conduct discrimination: As shown in the right part of Fig. 1, to find the optimal
X = {xi , i = 1, ..., m} ⇒ Drnn (X) Z k that corresponds to the testing samples, we first sample a
Grnn (Z) ⇒ Drnn (Grnn (Z)) random set Z 1 from the latent space and obtain reconstructed
Update discriminator parameters by minimiz- raw samples G(Z 1 ) by feeding it to the generator. Then,
ing(descending)
Pm Dloss : the samples from the latent space could be updated with the
1
min m i=1 (− log Drnn (xi ) gradients obtained from the error function defined with X tes
− log(1 − Drnn (Grnn (zi )))) and G(Z).
Update discriminator parameters by minimiz-
ing(descending) Gl oss : min Er(X tes , Grnn (Z k )) = 1 − Simi(X tes , Grnn (Z k ))
Pm Zk
min i=1 log(−Drnn (Grnn (zi ))) (4)
Record parameters of the discriminator and genera- where the similarity between sequences could be defined as
tor in the current iteration. covariance for simplicity.
end for If after enough iteration rounds the error is small enough,
end if the samples Z k is recorded as the corresponding mapping in
for the lth iteration do the latent space for the testing samples. Thus, the residual at
Mapping testing data back to latent space: time t for testing samples is calculated as
Z k = min Er(X tes , Grnn (Z i )) n
Z X
end for Res(Xttes ) = | xtes,i
t − Grnn (Ztk,i ) | (5)
Calculate the residuals: i=1
Res =| X tes − Grnn (Z k ) |
where Xttes ⊆ Rn is the measurements at time step t for
Calculate the discrimination results:
n variables. In summary, the the anomaly score for anomaly
P ro = Drnn (X tes )
detection is
Obtain anomaly score:
S = Res + P ro Sttes = λRes(Xttes ) + (1 − λ)Drnn (Xttes ) (6)
end loop
Our GAN-based unsupervised anomaly detection strategy
is summarized in Algo. 1. Mini-batch stochastic optimization
based on Adam Optimizer and Gradient Descent Optimizer is
B. GAN-based Anomaly Score used for updating the model parameters.
As a newly arisen unsupervised learning method, both
the generator and generator of GAN are jointly trained to C. Anomaly Detection Framework
represent the normal anatomical variability which is helpful
for identifying anomalies. To make full use of the GAN model, We formulate the anomaly detection problem for multivari-
both the trained generator and discriminator should be driven ate time series as follows. First, consider an m-dimensional
to make contributions to the anomaly detection. Following time series X = {x(t) , t = 1, ..., T } with length T 2 , where
the formulation in [35], the anomaly detection for CPSs time x(t) ∈ Rm is an m-dimensional vector of readings for m
series data consists of the following two parts. variables at time-instance t. Usually, in industry process or
mechanical systems (such as the SWaT system considered in
• Anomaly Detection with Discrimination
this paper), sensor measurements are large time-series with
Intuitively, the trained discriminator D (after a sufficient
length T (T >> T ). Thus, multiple predefined time-series,
number of iterations of adversarial training) is a direct (L)
X = {X (1),...,X }, can be obtained by taking a window of
tool for anomaly detection since it can distinguish fake
length T over the raw data streams. The GAN model is trained
from real with high sensitivity.
based on the normal time-series dataset Xreal , and generates
• Anomaly Detection with Residuals
”fake” samples Xgs that ”look real”. Next, the testing time-
As mentioned in previous sections, the trained generator
series dataset Xatt (or Xtes ), which is real-time CPSs data,
G, which is capable of generating realistic samples,
is actually a mapping from the latent space to real 2 Usually, T should not be too large with purpose of monitoring and anmaly
data space: G(Z) : Z → X, and can be viewed as detection.
5

can be analysed by the trained model to detect anomalous TABLE I: List of Cyber-attacks Inserted to the SWaT System
slots. Process Type Attacked sensors Attack Actuators
However, the use of LSTM-RNN with high-dimensional P1
SSSP LIT-101 MV-101; P-101; P-102
inputs ( X ⊆ R51 in the SWaT case) incurs higher com- SSMP (LIT-101 and MV-101)
SSSP AIT-202
putational cost than usual deep neural networks. Thus, in this P2 SSMP (P-203 and P-205)
paper, we adopt the Principal Component Analysis (PCA) to SSMP (P-201, P-203 and P-205)
project the high-dimensional data into a PC projection space P3 SSSP LIT-301; DPIT-301 MV-303;MV-303;
MV-304; P-302
before feeding the data to the GAN model: Xtes ⊆ Rm ⇒ P4 SSSP LIT-401; FIT-401
X tes ⊆ Rn . The projection is SSSP AIT-504 MV-504
P5
SSMP (P-501 and FIT-502)
P = P CA(Xreal ) MSMP (UV-401, AIT-502 and P501)
(7)
X tes = Xtes P T MSMP (P-602, DIT-301 MV-301)
MSMP (P-302 and LIT-401)
where Xtes ⊆ Rm , X tes ⊆ Rn , P ⊆ Rn×n , m is the P1-6
MSMP (LIT-101, P-101 and MV-201)
MSSP (AIT-402 and AIT-502)
original dimension (namely the number of system variables) MSSP (FIT-401 and AIT-502)
and n is number of reserved principal components. Then the MSSP (P-101 and LIT-301)
projected variables are fed to the GAN-AD model and output *FIT-flower meter; LIT-water level transmitter
anomaly scores according to Eq. (6). Next, the following label *MV-motorized valve
*P-water pump/dosing pump/Sodium bi-sulphate pump
assigning function could be applied to identify whether the ith *AIT-chemical analyser; UV-dechlorinator meter
variable of the testing time-series set X tes at time step t is *DPIT-differential pressure indicating transmitter
being attacked or not. *SSSP: single stage single point attack
*SSMP: single stage multi point attack
*MSMP: multi stage multi point attack
(  
tes,i 1, if H S(xtes,it ), 1 > τ *MSSP: multi stage single point attack
At = (8)
0, else
where t = 1, ..., T, i = 1, ..., n. An anomaly is detected if
the attacked points include sensors (e.g., water level sensors,
the cross entropy error H (., .) for the anomaly score is higher
flow-rate meter, etc.) and actuators (e.g., valve, pump, etc.).
than a certain value τ .
The summary of attacked points based on attack location and
type is shown in Table I.
IV. SWAT S YSTEM AND C YBER - ATTACKS As a test-bed for research in the area of cyber security,
A. Water Treatment System several related works have been published based on the SWaT
The Secure Water Treatment (SWaT) system is an opera- dataset. Some of them focused on special attacks. For example,
tional test-bed for water treatment that represents a small-scale a distributed detection method for single stage multiple points
version of a large modern water treatment plant found in large attacks via system specific physical invariants is proposed in
cities [39] 3 . [40]. Also, Jonathan et al proposed to find attacks for the
The water purification process in SWaT is composed of first process via RNN prediction and CUSUM detection [41].
six sub-processes referred to as P 1 through P 6 [37]. The A model-based method, which derives a Kalman filter, was
first process is for raw water supply and storage, and P 2 applied to estimate the evolution of the system dynamics
is for pre-treatment where the water quality is assessed. on single variable basis [42]. In this work, we consider
Undesired materials are them removed by ultra-filtration (UF) all the aforementioned cyber-attacks as anomalous working
backwash in P 3. The remaining chorine is destroyed in the conditions and train our proposed GAN-AD method to detect
Dechlorination process (P 4). Subsequently, the water from P4 these anomalies for all the six processes in SWaT (results are
is pumped into the Reverse Osmosis (RO) system (P 5) to shown in Section V-E).
reduce inorganic impurities. Lastly, P 6 stores the water ready
for distribution. C. SWaT Dataset
The 2016 SWaT data collection process lasted for a 11 days
B. Cyber-Attacks with the system operated 24 hours per day. Various cyber-
Various experiments have been conducted on the SWaT attacks were implemented on the testbed with different intents
system to investigate cyber-attacks and respective system and divergent lasting durations (from a few minutes to an
responses. Please refer to the SWat website 4 for a detailed hour) in the final four days. The system was either allowed
description of the attacks. A total of 36 attacks were launched to reach its normal operating state before another attack was
during the 2016 SWaT data collection process [37]. Generally, launched or the attacks were launched consecutively. The 2016
SWaT dataset and its associated attacks have the following
3 The overall testbed design was coordinated with Singapore’s Public Utility
characteristics 5
Board, the nation-wide water utility company, and constructed by a third
party vendor. That collaboration ensured that the overall physical process and • Different attacks may last for different time durations due
control system closely resemble real systems in the field, so that the results to different scenarios. Some attacks do not even take
can be applied to real systems as well. For more information, please refer to
https://itrust.sutd.edu.sg/research/testbeds/secure-water-treatment-swat/ 5 The raw data are not plotted in this paper due to page limit—please refer
4 http://itrust.sutd.edu.sg/research/dataset to [37] for more information.
6

effect immediately. The system stabilization durations C. Sample Generation


also vary across attacks. Simpler attacks, such as those
First, we visualize the data samples generated by our GAN
aiming at changing flow rates, require less time for the
versus the actual samples from the CPS. As can be observed in
system to stabilize while the attacks that caused stronger
Fig. 2, our GAN generated samples that were clearly different
effects on the dynamics of system will require more time
from the training data in the early learning stage. However,
for stabilization.
after sufficient number of iterations, the generator is able to
• Attacks on one sensor (or actuator) may affect the per-
output realistic samples for the various sensors and actuators
formance on other sensors (or actuators), usually after a
in the system.
certain time delay.
• Furthermore, similar types of sensors (or actuators) tend
We use Maximum Mean Discrepancy (MMD) to evaluate
to respond to attacks in a similar fashion. For example, whether the GAN model has learned the distributions of the
attacks on the LIT101 sensor (a water level sensor in training data. MMD is one of the training objectives for
process 1) cause abnormal spikes in both LIT101 and moment matching networks.
LIT 301 (another water level sensor in process 3) but 1
Pn Pn k k
M M D(Zj , Xtes ) = n(n−1) i=1 j6=i K(Zi , Zj )
do not affect the readings of other sensors and actuators 2
P n P m k tes
− mn i=1 j=1 K(Zi , Xj ) (9)
(such as flow rate sensor and power meter) in the system. 1
Pm Pm tes tes
+ m(m−1) i=1 j6=i K(X i , Xj )
The aforementioned observations suggest we should take a
multivariate approach in the modelling instead of taking each
We plot the MMD values across GAN training iterations for
sensor or actuator in the CPS as an independent data source
generating univariate samples and multivariate samples in Fig.
(univariate approach). The underlying correlations between
3. In both cases, we can observe that the MMD values tend to
the sensors and actuators could be exploited to better detect
converge to small values after 20-30 iterations. Interestingly,
anomalies in the system.
the early MMD values of multivariate samples were lower
than that of univariate samples, and the MMD for multivariate
V. E XPERIMENTS
samples also converged faster than the univariate case. This
A. Data Pre-processing suggests that using multiple data streams can help with the
In the 2016 SWaT dataset, 51 variables (sensor readings training of GAN model. Fig. 4 shows the performance of our
and actuator states) were measured for 11-days. Within the GAN in generating both univariate and multivariate samples.
raw data, 496, 800 samples were collected under normal As can be seen, after more than 50 iterations, the GAN
working conditions (data collected in the first 7-days), and model could generate realistic time sequences even for the
449, 919 samples were collected when various cyber-attacks multivariate case.
were inserted to the system. We eliminate the first 21, 600
samples from the training dataset since it took 5-6 hours
to reach stabilization when the system was first turned on, D. Evaluation Metric
according to [37].
We use the following metrics, namely Accuracy (Accu),
We subdivide the original long multiple sequences into
Precision (Pre), Recall (Rec), F1 score, and False Positive Rate
smaller time series by taking window across raw streams.
(FPR) to evaluate the anomaly detection performance of GAN-
Since the SWaT data were recorded every second, we set the
AD.
window length as T =120 (i.e. data collected within 2 minutes).
TP + TN
To capture the relevant dynamics of SWaT data, the window Accu = (10)
(T =120) is applied to the normal and testing datasets with shift TP + TN + FP + FN
length SLnor =10 for normal dataset and SLatt =120 for testing
TP
dataset respectively. In order to speed up the GAN training P re = (11)
process and avoid over-fitting, the samples were down-sampled TP + FP
to one measurement every 10 seconds by taking the median
value. As a result, we obtained 47, 508 training samples and TP
Rec = (12)
3, 720 testing samples with sequence length L=12. TP + FN

P re × Rec
B. System Architecture F1 = 2 × (13)
P re + Rec
For this study, we used an LSTM network with depth
3 and 100 hidden (internal) units for the generator. The FP
LSTM network for the discriminator is relatively simpler with FPR = (14)
FP + TN
100 hidden units and depth 1. Inspired by the discussion
about latent space dimension in [31], we also tried different where T P is the correctly detected anomaly (At = 1 while
dimensions and found that higher latent space dimension real label Lt = 1), F P is the falsely detected anomaly (At = 1
generally generates better samples especially when generating while real label Lt = 0), T N is the correctly assigned normal
multivariate sequences. Thus, we set the dimension of latent (At = 0 while real label Lt = 0), and F N is the falsely
space as 15 in this study. assigned normal (At = 0 while real label Lt = 1).
7

(a) Generated samples at iteration=2. (b) Generated samples at iteration=95. (c) Original samples.
Fig. 2: Comparison between generated samples at different traning stages: GAN-generated samples at early stage are quite
random while those generated at later stages almost perfectly took the distribution of original samples.

case, we compare GAN-AD against PCA-based unsupervised


detection by inspecting the Squared Predicted Error (SPE,
i.e., residual distance). The anomaly detection results and
comparisons are summarized in Table II. We showed the
results on 9 variables (i.e. sensors/actuators) for discussion.
The 9 variables include sensors and actuators from different
processes of the system. They were also attack points for
single-point attacks.
Fig. 3: MMD: generation for multiple time series v.s. genera- To evaluate the major contribution of this work (i.e. detec-
tion single time series. tion of anomalies for multiple sequences), all the measured
variables are fed uniformly into the GAN-AD model (as
TABLE II: Anomaly Detection Rates for All Attacks by mentioned previously, to decrease the computational load, the
Checking Different Combinations of Measuring Points 51-dimensional data is projected to a lower dimensional space
Point Method Accu Pre Rec F1 FPR
with the help of PCA). Since PCA is also a popular unsuper-
CUSUM 86.63 36.42 26.86 0.30 7.39 vised multivarite anomaly detection method by inspecting the
LIT-101
GAN-AD 87.63 50.00 1.75 0.03 9.32 Squared Predicted Error (SPE, i.e., residual distance) values,
CUSUM 75.52 24.62 43.97 0.31 19.83 we also evaluate SPE-based detection against our proposed
P-101
GAN-AD 80.72 22.24 15.43 0.18 8.71
CUSUM 55.67 9.24 25.15 0.13 39.45 GAN-AD method.
AIT-202
GAN-AD 60.22 4.58 17.10 0.07 35.48 1) Univariate: As shown in Table II, univariate detection
CUSUM 81.50 12.92 9.02 0.10 8.43
LIT-301
GAN-AD 86.85 22.22 1.04 0.02 0.52 generally perform poorly with high false positive rate, and low
CUSUM 84.13 18.46 17.14 0.17 8.41 precision and recall. We also observed that:
DPIT-301
GAN-AD 84.40 25.00 2.67 0.05 1.18
CUSUM 71.55 9.67 19.18 0.12 22.01 • The water level sensors (e.g., LIT-101, LIT-301 and LIT-
MV-303
GAN-AD 87.68 17.54 3.00 0.05 1.74 401) have the best anomaly detection performance—
CUSUM 88.28 47.80 58.53 0.52 7.99 both CUSUM and GAN-AD were able to achieve good
LIT-401
GAN-AD 80.35 11.68 9.94 0.10 10.14
CUSUM 12.90 12.90 100.00 0.23 100 detection accuracy rate (near or above 80%). This is
FIT-401
GAN-AD 85.40 39.36 4.32 0.08 1.09 because other than the direct attacks on those sensors,
CUSUM 70.97 6.23 14.38 0.08 23.01 most of attacks applied to the other sensors (or actuators)
AIT-504
GAN-AD 86.03 14.74 14.35 0.14 11.14
SPE1 87.24 20.49 2.25 0.04 1.19 also affected the water levels indirectly.
SPE5 82.81 24.92 21.63 0.23 8.87 • The accuracy values of AIT-504 were 70.97% and
All
GAN-AD1 90.57 85.71 7.20 0.13 0.13 86.03% (CUSUM and GAN-AD respectively) were bet-
GAN-AD5 94.80 93.33 63.64 0.75 0.46 ter than its earlier counterpart AIT-202 (55.67% and
∗1 means only one principal component is chosen
∗5 means the first five principal components are chosen 60.22%). This was because the chemical accumulation in
P 5 (where inorganic impurities are reduced) is relevant
for anomaly detection after the chorine has been added
to the system in P 3 and P 4. P 2 was just a pre-treatment
E. Anomaly Detection Results
process where the quality of raw water for P 1 is assessed.
We evaluate the anomaly detection performance for both • Detection results with actuators P-101 and MV-303 were
the univariate and multivariate cases. In the univariate case, we not as poor as expected. Despite of the 0/1 values for
compare the performance against the CUSUM approach which on/off (open/close) states, both CUSUM and GAN-AD
was used in previous works such as [41]. For the multivariate managed to achieve accuracy of up to 71.55% − 75.52%
8

(a) Univariate. (b) Multivariate.


Fig. 4: Visualization of generated samples and original samples for both univariate and multivariate cases. Note that in the
multivariate case 4 variables are fed to the GAN model simultaneously.

and 80.72% − 87.68% respectively. While the 0/1 states PCA-based anomaly detection by inspecting the testing dataset
do not provide variance across numerical dimension, the with the Squared Predicted Error (SPE, i.e., the residual
frequency of how often ”0” and ”1” appears along the distances calculated by PCA projection) method.
time line was still useful for anomaly detection. The bottom part of Table II shows the performance of
• For the water flow-rate sensor FIT-401, the results by multivariate anomaly detection using SPE and our proposed
CUSUM was extremely poor while GAN-AD performed GAN-AD. The results showed about 3% − 12% improvement
well. One possible reason for CUSUM’s 100% recall and with the proposed GAN-AD. The GAN-AD method also
false positive rate was that CUSUM did not assign any achieved 50%-60% higher precision and 5%-40% higher recall
negative labels to the testing samples (i.e., recognizing all compared with SPE by assigning more true positives (correctly
samples as anomalies) due to unsuitable normal ranges. detected anomalies).
On looking closely at the normalized raw data, we We also compared both GAN-AD and SPE based on PC=1
observed that the values of flow rate meter took a roughly and PC=5. That is, we conducted SPE with the first one and
0/1 shape (just like the actuator states). However, the flow five principal components, while the raw data were projected
rate at the high points were not static 1s but varied with to the first one and five principal components correspondingly
high frequency. CUSUM is unable to capture such ”bi- before being fed into the GAN-AD. It is interesting to see that
variant” characteristic and hence performed badly in this for both GAN-AD and SPE the recall rates with PC=5 (hold
case. On the other hand, GAN-AD was able to handle more than 90% variate rate) were obviously higher that with
this and generated acceptable accuracy rate. PC=1 (which only contains around 50% variate rate as shown
2) Multivariate: A key contribution of this work is ap- in Fig. 5), which implies that using more principal components
plying our proposed GAN-AD method to solve the multi- could reduce false negatives.
variate anomaly detection problem for time series data. For In terms of false positive rates, GAN-AD with PC=1
dimensional reduction, instead of directly feeding the high achieved the best FPR amongst all (both univariate and multi-
dimensional data to the GAN-AD model, we used PCA to variate). Although GAN-AD with PC=5 outperformed others
project the raw data into a lower dimensional principal space, in the aspect of accuracy, precision, recall, and F1 , its false
as described in Eq. (7). positive rate was slightly higher than that by GAN-AD with
We plot the variance rate of the first 10 Principal Compo- PC=1. Similar phenomenon could be observed in multivariate
nents (PC) in Fig. 5. As shown in the figure, there is one main detection by SPE. This indicates that the improvement of
PC that explained more than 50% of the variance for the SWaT detection accuracy (as well as precision and recall) was built
data. Also, the PCs after the 5th one contribute little to the upon the sacrifice of more false positives due to the noisy
overall variance (near to 0). As such, we projected the SWaT information brought in by adding four more less important
data to the most variant PC (the first one) as well as the first PC dimensions.
5, and then applied the GAN-AD to detect anomalies for the The results in Table II also showed that:
projected data. For comparison, we also performed standard • Generally, the univariate detection cannot compete with
9

capture the extrinsic knowledge about relationships among


the networked sensors and components. We will also conduct
further research on feature selection for multivariate anomaly
detection, and investigate principled methods for choosing the
latent dimension and PC dimension with theoretical guaran-
tees.

Fig. 5: Variance Ratio of Principal Component for the SWaT ACKNOWLEDGEMENT


data. This material is based on research/work supported by the
Singapore National Research Foundation and the Cybersecu-
rity R&D Consortium Grant Office under Seed Grant Award
multivariate detection. To be specific, univariate detection No.CRDCG2017-S05.
results in widespread low precision and recall (note that
multivariate detection by SPE also performed poorly in R EFERENCES
terms of these two factors, which we will discuss in the
[1] B. Sun, P. B. Luh, Q.-S. Jia, Z. O’Neill, and F. Song, “Building energy
next point), and high FPR. This observation demonstrates doctors: An spc and kalman filter-based method for system-level fault
the multivariate detection is applicable for complex CPSs detection in hvac systems,” IEEE Transactions on Automation Science
with interconnected IoT of sensors and actuators gener- and Engineering., vol. 11, no. 1, pp. 215–229, 2014.
[2] K. Donghwoon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim,
ating large amounts of time series. “A survey of deep learning-based network anomaly detection,” Cluster
• In fact, even the baseline multivariate anomaly detection Computing, pp. 1–139, 2017.
method SPE can compete with most of the univariate de- [3] C. Varun, A. Banerjee, and V. Kumar, “Anomaly detection for discrete
sequences: A survey.” IEEE Transactions on Knowledge and Data
tection results in terms of accuracy,. However, SPE does Engineering, vol. 24, no. 5, pp. 823–839, 2012.
not compare well with univariate detection for LIT-101, [4] Y. N. S. Vilbert and Q. Chen, “Computer intrusion detection through
DPIT-301, LIT-401 and FIT-301 in terms of precision. ewma for autocorrelated and uncorrelated data,” IEEE transactions on
reliability, vol. 52, no. 1, 2003.
As noisy information can be accumulated when simply [5] Ryan and T. P., Statistical methods for quality improvement. New York,
projecting the whole raw dimensions, it can sometimes NY: John Wiley & Sons, 2011.
cause negative effect in assigning true positives [43]. [6] R. S. W., “Control chart tests based on geometric moving averages,”
Technometrics, vol. 1, no. 3, 1959.
One possible future work would be to consider select- [7] M. D. C. and C. M. Mastrangelo, “Some statistical process control
ing suitable variables and appointing different weights methods for autocorrelated data,” Journal of Quality Technology, vol. 23,
according to their importance levels, instead of treating no. 3, 1991.
[8] C.-W. Lu and M. R. R. Jr, “Ewma control charts for monitoring the mean
all the variables uniformly in one plain framework as in of autocorrelated processes,” Journal of Quality technology, vol. 31,
the current study. no. 2, 1999.
[9] M. Paige, R. E. Swanson, and C. E. Heckler, “Contribution plots: A
missing link in multivariate quality control,” Applied mathematics and
VI. C ONCLUSIONS computer science, vol. 8, no. 4, 1998.
[10] W. W. H. and M. M. Ncube, “Multivariate cusum quality-control
Cyber-Physical Systems are large, complex, and affixed with procedure,” Technometrics, vol. 27, no. 3, 1985.
networked sensors and actuators that generate large amounts [11] L. C. A., W. H. Woodall, C. W. Champ, and S. E. Rigdon, “A
of data streams. These data streams and their underlying multivariate exponentially weighted moving average control chart,”
Technometrics, vol. 34, no. 1, 1992.
system dependencies can potentially be mined for dynamic [12] Y. Nong and Q. Chen, “An anomaly detection technique based on a
detection of possible intrusion incidents. In this paper, we chisquare statistic for detecting intrusions into information systems,”
have explored the use of GAN to simultaneously train a deep Quality and Reliability Engineering International, vol. 17, no. 2, 2001.
[13] K. P. S. Girase and D. Mukhopadhyay, “A survey of classification
learning network to model the distributions of multi-sensor techniques in the area of big data,” arXiv preprint arXiv, vol. 1, no. 11,
data streams in a CPS under normal operating conditions, pp. 1–7, 2015.
and another to detect anomalies due to cyber attacks being [14] G. Mustafaraj, J. Chen, and G. Lowry, “Development of room temper-
ature and relative humidity linear parametric models for an open office
carried out against the CPS in an unsupervised fashion. We using bms data,” Energy and Buildings, vol. 42, pp. 348–356, Aug. 2010.
have proposed a novel GAN-based Anomaly Detection (GAN- [15] F. Xiao, Y. Zhao, J. Wen, and S. Wang, “Bayesian network based fdd
AD) method that directly utilizes both the discriminator and strategy for variable air volume terminals,” Automation in Construction,
vol. 41, pp. 106–118, 2014.
the generator trained on multivariate time series to detect [16] Z. Du, B. Fan, X. Jin, and J. Chi, “Fault detection and diagnosis
anomalies. We have tested our approach on a complex CPS for buildings and hvac systems using combined neural networks and
dataset from a Secure Water Treatment Testbed (SWaT) and subtractive clustering analysis,” Building and Environment, vol. 73, pp.
1–11, 2014.
showed that the proposed GAN-AD was able to outperform [17] D. Li, G. Hu, and C. J. Spanos, “A data-driven strategy for detection and
existing unsupervised detection methods. diagnosis of building chiller faults using linear discriminant analysis,”
For future work, we will explore the use of GAN-AD for Energy and Buildings, vol. 128, pp. 519–529, 2016.
[18] P. Jaikumar, A. Gacic, B. Andrews, and M. Dambier, “Detection of
other IoT applications such as predictive maintenance and fault anomalous events from unlabeled sensor data in smart building environ-
diagnosis for smart buildings and machineries. In terms of ments,” in 2011 IEEE International Conference on Acoustics, Speech
the GAN-AD methodology, instead of simply feeding multiple and Signal Processing (ICASSP). IEEE, 2011, pp. 2268–2271.
[19] Y. Zhao, F. Xiao, J. Wen, Y. Lu, and S. Wang, “A robust pat-
sequences uniformly into a fully connected network, we plan tern recognition-based fault detection and diagnosis (fdd) method for
to enhance GAN-AD with a multi-GAN framework to better chillers,” HVAC&R Research, vol. 20, no. 7, pp. 798–809, 2014.
10

[20] T. Mulumba, A. Afshari, K. Yan, W. Shen, and L. K. Norford, “Ro-


bust model-based fault diagnosis for air handling units,” Energy and
Buildings., vol. 86, pp. 698–707, 2015.
[21] D. Li, Y. Zhou, G. Hu, and C. J. Spanos, “Fault detection and diagnosis
for building cooling system with a tree-structured learning method,”
Energy and Buildings, vol. 127, pp. 540–551, 2016.
[22] ——, “Fusing system configuration information for building cooling
plant fault detection and severity level identification,” in 2016 IEEE In-
ternational Conference on Automation Science and Engineering (CASE).
IEEE, Conference Proceedings, pp. 1319–1325.
[23] Y. Shen, S. X. Ding, X. Xie, and H. Luo, “A review on basic data-
driven approaches for industrial process monitoring,” IEEE Transactions
on Industrial Electronics, vol. 61, no. 11, pp. 6418–6428, 2014.
[24] S. Li and J. Wen, “A model-based fault detection and diagnostic
methodology based on pca method and wavelet transform,” Energy and
Buildings, vol. 68, pp. 63–71, 2014.
[25] H. Xiao, Z. Wang, Y. Liu, and D. H. Zhou, “Least-squares fault
detection and diagnosis for networked sensing systems using a direct
state estimation approach,” IEEE Transactions on Industrial Informatics,
vol. 9, no. 3, pp. 1670–1679, 2013.
[26] W. S., E. K., and G. P., “Principal component analysis,” Chemometrics
and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.
[27] W. Herman, Partial least squares. Encyclopedia of statistical sciences,
1985.
[28] D. Xuewu and Z. Gao, “From model, signal to knowledge: A data-
driven perspective of fault detection and diagnosis,” IEEE Transactions
on Industrial Informatics, vol. 9, no. 4, pp. 2226–2238, 2013.
[29] G. Ian, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative adversarial nets,” in In
Advances in neural information processing systems, vol. ACM, 2014,
pp. 2672–2680.
[30] M. Olof, “C-rnn-gan: Continuous recurrent neural networks with adver-
sarial training,” arXiv preprint arXiv, vol. 1611, no. 09904, 2016.
[31] E. Cristbal, S. L. Hyland, and G. Rtsch, “Real-valued (medical) time
series generation with recurrent conditional gans,” arXiv preprint arXiv,
vol. 1706, no. 02633, 2017.
[32] X. Yuan, T. Xu, H. Zhang, R. Long, and X. Huang, “Segan: Adversarial
network with multi-scale l1 loss for medical image segmentation,” arXiv
preprint arXiv, vol. 1706, no. 01805, 2017.
[33] Y. Raymond, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N.
Do, “Semantic image inpainting with perceptual and contextual losses,”
arXiv preprint arXiv, vol. 1607, no. 07539, 2016.
[34] S. Tim, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and
X. Chen, “Improved techniques for training gans,” in In Advances in
Neural Information Processing Systems, 2016, pp. 2234–2242.
[35] S. Thomas, P. Seebck, S. M. Waldstein, U. Schmidt-Erfurth, and
G. Langs, “Unsupervised anomaly detection with generative adversarial
networks to guide marker discovery,” pp. 146–157, 2017.
[36] Z. Houssam, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar,
“Efficient gan-based anomaly detection,” arXiv preprint arXiv, vol. 1802,
no. 06222, 2018.
[37] G. Jonathan, S. Adepu, K. N. Junejo, and A. Mathur, “A dataset to
support research in the design of secure water treatment systems,”
in In International Conference on Critical Information Infrastructures
Security, 2016, pp. 88–99.
[38] R. Alec, L. Metz, and S. Chintala, “Unsupervised representation learning
with deep convolutional generative adversarial networks,” arXiv preprint
arXiv, vol. 1511, no. 06434, 2015.
[39] M. A. P. and N. O. Tippenhauer, “Swat: A water treatment testbed for
research and training on ics security,” in In International Workshop on
Cyber-physical Systems for Smart Water Networks (CySWater). IEEE,
2016, pp. 31–36.
[40] A. Sridhar and A. Mathur, “Distributed detection of single-stage mul-
tipoint cyber attacks in a water treatment plant,” in In Proceedings of
the 11th ACM on Asia Conference on Computer and Communications
Security. ACM, 2016, pp. 449–460.
[41] G. Jonathan, S. Adepu, M. Tan, and Z. S. Lee, “Anomaly detection in
cyber physical systems using recurrent neural networks,” in In IEEE
18th International Symposium on High Assurance Systems Engineering
(HASE). IEEE, 2017, pp. 140–145.
[42] A. C. Mujeeb, C. Murguia, and J. Ruths, “Model-based attack detection
scheme for smart water distribution networks,” in In Proceedings of
the 2017 ACM on Asia Conference on Computer and Communications
Security. ACM, 2017, pp. 101–113.
[43] D. Li, Y. Zhou, G. Hu, and C. J. Spanos, “Optimal sensor configuration
and feature selection for ahu fault detection and diagnosis,” IEEE
Transactions on Industrial Informatics, vol. 13, pp. 1369 – 1380, 2017.

You might also like