Anomaly Detection With Generative Adversarial Networks For Multivariate Time Series
Anomaly Detection With Generative Adversarial Networks For Multivariate Time Series
Abstract—Today’s Cyber-Physical Systems (CPSs) are large, solutions for anomaly detection [1]. These conventional de-
complex, and affixed with networked sensors and actuators that tection techniques are unable to deal with the increasingly
are targets for cyber-attacks. Conventional detection techniques dynamic and complex nature of the CPSs with the advent
are unable to deal with the increasingly dynamic and complex
of IoT. As such, researchers have moved beyond specifica-
arXiv:1809.04758v3 [cs.LG] 15 Jan 2019
In other words, D is an intuitive tool for anomaly detection. computationally intensive multivariate SPC methods are not
In this work, we propose to exploit both G and D for the practical for modern CPSs with high complexity and massive
anomaly detection task by (i) exploiting the residuals between data streams.
real-time testing samples and reconstructed samples based on Supervised machine learning techniques can also be used for
the mapping from real-time space to the GAN latent space; anomaly detection. A typical approach is to build a predictive
and (ii) discrimination with the machine-learned discriminator classification model for the normal and the anomalous classes.
by classifying the real-time series. We depict this aspect of The classification model is trained from the labelled data.
our proposed GAN-AD in the right part of Fig. 1. As shown, New measurements can then be analysed by the classifier
the testing samples are mapped back into the latent space, and be classified to corresponding categories (normal or
and the corresponding residual loss is calculated based on the anomalous) automatically [13]. A wide range of supervised
difference between the reconstructed testing samples (by the machine learning tecniques has been used for anomaly de-
trained generator) and the actual testing samples. At the same tection. They include Multivariate Regression models [14],
time, testing samples are also fed to the trained discriminator Bayes Classifier [15], Neural Networks (NN) [16], Fisher
to compute the discrimination loss. The two losses are then Discriminant Analysis (FDA) [17], Gaussion Mixture Model
combined to detect potential anomalies for sequential CPS data [18], Support Vector Data Description (SVDD) [19], Support
(more details are described in Section III-C). Vector Machines (SVM) [20], and tree-structured learning
The remaining part of this paper is organized as follows. method [21], [22]. However, supervised classification methods
Section II introduces the related works. Section III presents are dependent on the availability of initial labelled training
our proposed GAN-AD and derives an anomaly score function. data. Given that anomalies are typically rare, obtaining enough
Section IV introduces the multi-stage Secure Water Treatment accurate labelled anomaly classes is usually challenging.
system, which is followed by Section V in which we eval- The unsupervised learning methods—also known as de-
uate our proposed GAN-AD on real-time multivariate series. scriptive or undirected classification—train models without
Finally, Section VI summarizes the whole paper and proposes lablled classes. Due to their simplicity and ability to handle
possible future work. large number of process data, unsupervised learning methods
have been wildly used for anomaly detection for various
industrial processes [23]. Popular unsupervised methods in-
II. R ELATED W ORKS
clude Principal Component Analysis (PCA) [24] and Partial
The basic task of anomaly detection is to identify whether Least Squares (PLS) [25]. PCA is a multivariate data analysis
the testing data conform to the normal data distribution; method which preserves the significant variability information
the non-conforming points are called anomalies, outliers, extracted from the process measurements and reduces the
intrusions, failures or contaminants in various application dimension for huge amount of correlated data [26]. PLS
domains [3], [2]. Anomaly detection is an old but challenging is another multivariate data analysis method that has been
problem—it has been studied in the statistics community as extensively utilized for model building and anomaly detection
early as the 19th century [3]. [27]. Their key performance indicators (Square Predicted Error
Based on how the historical training data is used, we can (SPE) for PCA and T 2 index for PLS) for anomaly detection
broadly divide anomaly detection methods into three cate- can be achieved with correlation model traine off-line and
gories: i) Statistical Process Control (SPC) techniques, ii) su- online process measurements. However, these unsupervised
pervised machine learning methods, and iii) unsupervised ma- methods are only effective to highly correlated data, and
chine learning methods. The SPC techniques were extensively require the data to follow multivariate Gaussian distribution
used in the early years for monitoring and controlling quality [28].
of manufacturing processes through univariate or multivariate The recently proposed GAN framework enables researchers
analysis [4]. The SPC approaches typically inspect changes to build a generative model via adversarial training [29]. The
in process mean (mean shifts) and process variance (variance simultaneous training of a generator and a discriminator in
changes), and try to model the relationship among multiple an adversarial fashion is highly suggestive for using the GAN
variables [5], [?]. Shewhart control charts and CUSUM control framework for anomaly detection. The current successes of
charts are univariate SPC techniques that are usually appied to GANs are mainly in generating realistic-looking images. In
detect mean shifts [6], [7]. EWMV control charts are sensitive our CPS and IoT scenarios, we have to deal with oftentimes
to univarite variance changes as well as mean shifts [8]. multiple streams of potentially interacting time series data.
Although widely used, the aforementioned SPC techniques However,there has been limited work in adopting the GAN
cannot model the correlation among various sequences, while framework for time-series data todate. To the best of our
interrelations are common even in traditional manufacturing knowledge, there are two preliminary work using GAN to
systems. As an improvement, Hotelling’s T2 [9], Multivariate generate continuous valued sequences in the literature—one
Cumulative Sum (MCUSUM) [10] and Multivariate Expo- to produce polyphonic music with recurrent neural networks
nentially Weighted Moving ASverage (MEWMA) [11] were as generator and discriminator [30], and the other uses con-
proposed to monitor the performance of multiple variables in ditional version of recurrent GAN to generate real-valued
a manufacturing system. However, these multivariate methods medical time series [31]. In both of these works, the multi-
require independent and identically distributed (iid) assump- sequences were treated as i.i.d. and fed to a uniform GAN
tion which is often violated in reality [12]. Moreover, the framework. This will be inadequate for the IoT and CPSs
3
Latent Space
Generated
Generator
Discriminator Samples Generator Residual Discrimination Discriminator
Is D correct? Inputs Inputs LSTM- +
LSTM-RNN LSTM-RNN Loss Loss LSTM-RNN
RNN
Reconstructed
Invert Samples
Real Samples from CPS Mapping Real Samples from CPS
Model Training Training Data Testing Data Anomaly Detector
Fig. 1: GAN-AD: Unsupervised GAN-based anomaly detection for CPSs. On the left is a GAN framework in which the
generator and discriminator are obtained with iterative adversarial training. On the right is the exploitation of both the GAN’s
generator and discriminator for anomaly detection—the generator is used for computing the residual loss between reconstructed
samples and real ones, while the discriminator is used to compute the discrimination loss.
setting given the potential interactions amongst the multiple due to cyber attacks in a complex Secure Water Treatment
time series-generating sensors and actuators involved in the (SWaT) system with six stages [37].
same or different processes of the complex systems.
Unlike traditional classification methods, the GAN-trained
III. A NOMALY D ETECTION WITH G ENERATIVE
discriminator learns to detect false from real in an unsu-
A DVERSARIAL T RAINING
pervised fashion, making GAN an attractive unsupervised
machine learning technique for anomaly detection [32]. In A. GAN with LSTM-RNN
addition, the GAN framework also produces a generator which Long Short Term-Recurrent Neural Networks (LSTM-RNN)
is actually an inexplicit model of the target system with its had been shown to be capable of learning complex time series
ability to output normal samples from a certain latent space. by taking the information in backward (or even forward) time
Inspired by [33] and [34] that updates a mapping from the steps with memorise cells. In this work, in order to handle
real-time space to a certain latent space to enhance the training time series data of the CPS, both the generator (G) and
of generator and discriminator, researchers have recently pro- discriminator (D) of GAN are substituted by LSTM-RNN.
posed to train a latent space understandable GAN and apply it Following the architecture of a regular GAN framework [29],
for unsupervised learning of rich feature representations for ar- the GAN model is trained as a two-player minimax game.
bitrary data distributions. [35] and [36] showed the possibility
of recognizing anomalies with reconstructed testing samples min max V (D, G) = Exvpdata (x) [log D(x)]
from latent space, and successfully applied the proposed GAN-
G D (1)
+Ezvpz (Z) [log(1 − D(G(z)))]
based detection strategy to discover unexpected markers for
images. In this work, we will leverage on these previous Specifically, the generator G, a LSTM-RNN model, implic-
works to make use of both the GAN-trained generator and itly defines a probability distribution for the generated samples,
discriminator to better detect anomalies based on both residual which can be written as Grnn (z), where z is a distribution
and discrimination losses. from the random latent space. The discriminator, which is
Our contributions of this paper are summarized as fol- another LSTM-RNN model, is then trained to minimise the
lows: i), a novel GAN-based unsupervised anomaly detection average negative cross entropy between its predictions and
method is proposed to detect anomalies (cyber-attacks) for sequence labels (e.g., train D to recognize as many training
complex multi-process cyber-phsyical systems with networked samples as real as possible, and recognize as many generated
sensors and actuators; ii), the GAN model is trained with samples as false as possible). Thus, the discriminator loss is
multiple time series, which adapts GAN from the image 1
Pm
generation domain for time series generation by adopting the Dloss = m i=1 [log Drnn (xi ) + log(1 − Drnn (Grnn (zi )))]
1
P m
Long Short Term-Recurrent Neural Networks (LSTM-RNN) ⇔ min m i=1 [− log Drnn (xi ) − log(1 − Drnn (Grnn (zi )))]
(2)
to capture the temporal dependency; iii), normal sequences
where xi , i = 1, ..., m is the training samples which should be
with high dimension is uniformly utilized to train the GAN
recognized as real, and Grnn (zi ), i = 1, ..., m is the generated
model to discriminate fake from real and reconstruct testing
samples that should be recognized as false.
sequences from specific latent space simultaneously; iv), the
At the same time, the generator is trained to confuse the
discrimination loss calculated by the trained discriminator
discriminator so that the discriminator would recognize as
and the residual loss between reconstructed and real testing
many generated samples as real as possible. In other words,
sequences (to make use of both the trained discriminator and
the generator loss is
generator) are combined together to detect anomalous points in
Pm
the high dimensional time series, and the proposed method is Gloss =P i=1 log(1 − Drnn (Grnn (zi )))
shown to outperform existing methods in detecting anomalies m (3)
⇔ min i=1 log(−Drnn (Grnn (zi )))
4
The generator loss and discriminator loss are jointly dealt an inexplicit system model that reflects the normal
with by optimizers and used to update the parameters for Grnn data’s distribution. Due to the smooth transitions of
and Drnn . latent space mentioned in [38], the generator outputs
similar samples if the inputs in the latent space are
Algorithm 1 LSTM-RNN-GAN-based Anomaly Detection close. Thus, if it is possible to find the corresponding
Strategy Z k in the latent space for the testing data X tes , the
loop similarity between X tes and G(Z k ) could explain to
if epoch within number of training iterations then which extent is X tes follows the distribution reflected
for the kth epoch do by G. That is to say, residuals between X tes and G(Z k )
Generate samples from the latent space: could be utilized for identifying anomalies in testing data.
Z = {zi , i = 1, ..., m} ⇒ Grnn (Z)
Conduct discrimination: As shown in the right part of Fig. 1, to find the optimal
X = {xi , i = 1, ..., m} ⇒ Drnn (X) Z k that corresponds to the testing samples, we first sample a
Grnn (Z) ⇒ Drnn (Grnn (Z)) random set Z 1 from the latent space and obtain reconstructed
Update discriminator parameters by minimiz- raw samples G(Z 1 ) by feeding it to the generator. Then,
ing(descending)
Pm Dloss : the samples from the latent space could be updated with the
1
min m i=1 (− log Drnn (xi ) gradients obtained from the error function defined with X tes
− log(1 − Drnn (Grnn (zi )))) and G(Z).
Update discriminator parameters by minimiz-
ing(descending) Gl oss : min Er(X tes , Grnn (Z k )) = 1 − Simi(X tes , Grnn (Z k ))
Pm Zk
min i=1 log(−Drnn (Grnn (zi ))) (4)
Record parameters of the discriminator and genera- where the similarity between sequences could be defined as
tor in the current iteration. covariance for simplicity.
end for If after enough iteration rounds the error is small enough,
end if the samples Z k is recorded as the corresponding mapping in
for the lth iteration do the latent space for the testing samples. Thus, the residual at
Mapping testing data back to latent space: time t for testing samples is calculated as
Z k = min Er(X tes , Grnn (Z i )) n
Z X
end for Res(Xttes ) = | xtes,i
t − Grnn (Ztk,i ) | (5)
Calculate the residuals: i=1
Res =| X tes − Grnn (Z k ) |
where Xttes ⊆ Rn is the measurements at time step t for
Calculate the discrimination results:
n variables. In summary, the the anomaly score for anomaly
P ro = Drnn (X tes )
detection is
Obtain anomaly score:
S = Res + P ro Sttes = λRes(Xttes ) + (1 − λ)Drnn (Xttes ) (6)
end loop
Our GAN-based unsupervised anomaly detection strategy
is summarized in Algo. 1. Mini-batch stochastic optimization
based on Adam Optimizer and Gradient Descent Optimizer is
B. GAN-based Anomaly Score used for updating the model parameters.
As a newly arisen unsupervised learning method, both
the generator and generator of GAN are jointly trained to C. Anomaly Detection Framework
represent the normal anatomical variability which is helpful
for identifying anomalies. To make full use of the GAN model, We formulate the anomaly detection problem for multivari-
both the trained generator and discriminator should be driven ate time series as follows. First, consider an m-dimensional
to make contributions to the anomaly detection. Following time series X = {x(t) , t = 1, ..., T } with length T 2 , where
the formulation in [35], the anomaly detection for CPSs time x(t) ∈ Rm is an m-dimensional vector of readings for m
series data consists of the following two parts. variables at time-instance t. Usually, in industry process or
mechanical systems (such as the SWaT system considered in
• Anomaly Detection with Discrimination
this paper), sensor measurements are large time-series with
Intuitively, the trained discriminator D (after a sufficient
length T (T >> T ). Thus, multiple predefined time-series,
number of iterations of adversarial training) is a direct (L)
X = {X (1),...,X }, can be obtained by taking a window of
tool for anomaly detection since it can distinguish fake
length T over the raw data streams. The GAN model is trained
from real with high sensitivity.
based on the normal time-series dataset Xreal , and generates
• Anomaly Detection with Residuals
”fake” samples Xgs that ”look real”. Next, the testing time-
As mentioned in previous sections, the trained generator
series dataset Xatt (or Xtes ), which is real-time CPSs data,
G, which is capable of generating realistic samples,
is actually a mapping from the latent space to real 2 Usually, T should not be too large with purpose of monitoring and anmaly
data space: G(Z) : Z → X, and can be viewed as detection.
5
can be analysed by the trained model to detect anomalous TABLE I: List of Cyber-attacks Inserted to the SWaT System
slots. Process Type Attacked sensors Attack Actuators
However, the use of LSTM-RNN with high-dimensional P1
SSSP LIT-101 MV-101; P-101; P-102
inputs ( X ⊆ R51 in the SWaT case) incurs higher com- SSMP (LIT-101 and MV-101)
SSSP AIT-202
putational cost than usual deep neural networks. Thus, in this P2 SSMP (P-203 and P-205)
paper, we adopt the Principal Component Analysis (PCA) to SSMP (P-201, P-203 and P-205)
project the high-dimensional data into a PC projection space P3 SSSP LIT-301; DPIT-301 MV-303;MV-303;
MV-304; P-302
before feeding the data to the GAN model: Xtes ⊆ Rm ⇒ P4 SSSP LIT-401; FIT-401
X tes ⊆ Rn . The projection is SSSP AIT-504 MV-504
P5
SSMP (P-501 and FIT-502)
P = P CA(Xreal ) MSMP (UV-401, AIT-502 and P501)
(7)
X tes = Xtes P T MSMP (P-602, DIT-301 MV-301)
MSMP (P-302 and LIT-401)
where Xtes ⊆ Rm , X tes ⊆ Rn , P ⊆ Rn×n , m is the P1-6
MSMP (LIT-101, P-101 and MV-201)
MSSP (AIT-402 and AIT-502)
original dimension (namely the number of system variables) MSSP (FIT-401 and AIT-502)
and n is number of reserved principal components. Then the MSSP (P-101 and LIT-301)
projected variables are fed to the GAN-AD model and output *FIT-flower meter; LIT-water level transmitter
anomaly scores according to Eq. (6). Next, the following label *MV-motorized valve
*P-water pump/dosing pump/Sodium bi-sulphate pump
assigning function could be applied to identify whether the ith *AIT-chemical analyser; UV-dechlorinator meter
variable of the testing time-series set X tes at time step t is *DPIT-differential pressure indicating transmitter
being attacked or not. *SSSP: single stage single point attack
*SSMP: single stage multi point attack
*MSMP: multi stage multi point attack
(
tes,i 1, if H S(xtes,it ), 1 > τ *MSSP: multi stage single point attack
At = (8)
0, else
where t = 1, ..., T, i = 1, ..., n. An anomaly is detected if
the attacked points include sensors (e.g., water level sensors,
the cross entropy error H (., .) for the anomaly score is higher
flow-rate meter, etc.) and actuators (e.g., valve, pump, etc.).
than a certain value τ .
The summary of attacked points based on attack location and
type is shown in Table I.
IV. SWAT S YSTEM AND C YBER - ATTACKS As a test-bed for research in the area of cyber security,
A. Water Treatment System several related works have been published based on the SWaT
The Secure Water Treatment (SWaT) system is an opera- dataset. Some of them focused on special attacks. For example,
tional test-bed for water treatment that represents a small-scale a distributed detection method for single stage multiple points
version of a large modern water treatment plant found in large attacks via system specific physical invariants is proposed in
cities [39] 3 . [40]. Also, Jonathan et al proposed to find attacks for the
The water purification process in SWaT is composed of first process via RNN prediction and CUSUM detection [41].
six sub-processes referred to as P 1 through P 6 [37]. The A model-based method, which derives a Kalman filter, was
first process is for raw water supply and storage, and P 2 applied to estimate the evolution of the system dynamics
is for pre-treatment where the water quality is assessed. on single variable basis [42]. In this work, we consider
Undesired materials are them removed by ultra-filtration (UF) all the aforementioned cyber-attacks as anomalous working
backwash in P 3. The remaining chorine is destroyed in the conditions and train our proposed GAN-AD method to detect
Dechlorination process (P 4). Subsequently, the water from P4 these anomalies for all the six processes in SWaT (results are
is pumped into the Reverse Osmosis (RO) system (P 5) to shown in Section V-E).
reduce inorganic impurities. Lastly, P 6 stores the water ready
for distribution. C. SWaT Dataset
The 2016 SWaT data collection process lasted for a 11 days
B. Cyber-Attacks with the system operated 24 hours per day. Various cyber-
Various experiments have been conducted on the SWaT attacks were implemented on the testbed with different intents
system to investigate cyber-attacks and respective system and divergent lasting durations (from a few minutes to an
responses. Please refer to the SWat website 4 for a detailed hour) in the final four days. The system was either allowed
description of the attacks. A total of 36 attacks were launched to reach its normal operating state before another attack was
during the 2016 SWaT data collection process [37]. Generally, launched or the attacks were launched consecutively. The 2016
SWaT dataset and its associated attacks have the following
3 The overall testbed design was coordinated with Singapore’s Public Utility
characteristics 5
Board, the nation-wide water utility company, and constructed by a third
party vendor. That collaboration ensured that the overall physical process and • Different attacks may last for different time durations due
control system closely resemble real systems in the field, so that the results to different scenarios. Some attacks do not even take
can be applied to real systems as well. For more information, please refer to
https://itrust.sutd.edu.sg/research/testbeds/secure-water-treatment-swat/ 5 The raw data are not plotted in this paper due to page limit—please refer
4 http://itrust.sutd.edu.sg/research/dataset to [37] for more information.
6
P re × Rec
B. System Architecture F1 = 2 × (13)
P re + Rec
For this study, we used an LSTM network with depth
3 and 100 hidden (internal) units for the generator. The FP
LSTM network for the discriminator is relatively simpler with FPR = (14)
FP + TN
100 hidden units and depth 1. Inspired by the discussion
about latent space dimension in [31], we also tried different where T P is the correctly detected anomaly (At = 1 while
dimensions and found that higher latent space dimension real label Lt = 1), F P is the falsely detected anomaly (At = 1
generally generates better samples especially when generating while real label Lt = 0), T N is the correctly assigned normal
multivariate sequences. Thus, we set the dimension of latent (At = 0 while real label Lt = 0), and F N is the falsely
space as 15 in this study. assigned normal (At = 0 while real label Lt = 1).
7
(a) Generated samples at iteration=2. (b) Generated samples at iteration=95. (c) Original samples.
Fig. 2: Comparison between generated samples at different traning stages: GAN-generated samples at early stage are quite
random while those generated at later stages almost perfectly took the distribution of original samples.
and 80.72% − 87.68% respectively. While the 0/1 states PCA-based anomaly detection by inspecting the testing dataset
do not provide variance across numerical dimension, the with the Squared Predicted Error (SPE, i.e., the residual
frequency of how often ”0” and ”1” appears along the distances calculated by PCA projection) method.
time line was still useful for anomaly detection. The bottom part of Table II shows the performance of
• For the water flow-rate sensor FIT-401, the results by multivariate anomaly detection using SPE and our proposed
CUSUM was extremely poor while GAN-AD performed GAN-AD. The results showed about 3% − 12% improvement
well. One possible reason for CUSUM’s 100% recall and with the proposed GAN-AD. The GAN-AD method also
false positive rate was that CUSUM did not assign any achieved 50%-60% higher precision and 5%-40% higher recall
negative labels to the testing samples (i.e., recognizing all compared with SPE by assigning more true positives (correctly
samples as anomalies) due to unsuitable normal ranges. detected anomalies).
On looking closely at the normalized raw data, we We also compared both GAN-AD and SPE based on PC=1
observed that the values of flow rate meter took a roughly and PC=5. That is, we conducted SPE with the first one and
0/1 shape (just like the actuator states). However, the flow five principal components, while the raw data were projected
rate at the high points were not static 1s but varied with to the first one and five principal components correspondingly
high frequency. CUSUM is unable to capture such ”bi- before being fed into the GAN-AD. It is interesting to see that
variant” characteristic and hence performed badly in this for both GAN-AD and SPE the recall rates with PC=5 (hold
case. On the other hand, GAN-AD was able to handle more than 90% variate rate) were obviously higher that with
this and generated acceptable accuracy rate. PC=1 (which only contains around 50% variate rate as shown
2) Multivariate: A key contribution of this work is ap- in Fig. 5), which implies that using more principal components
plying our proposed GAN-AD method to solve the multi- could reduce false negatives.
variate anomaly detection problem for time series data. For In terms of false positive rates, GAN-AD with PC=1
dimensional reduction, instead of directly feeding the high achieved the best FPR amongst all (both univariate and multi-
dimensional data to the GAN-AD model, we used PCA to variate). Although GAN-AD with PC=5 outperformed others
project the raw data into a lower dimensional principal space, in the aspect of accuracy, precision, recall, and F1 , its false
as described in Eq. (7). positive rate was slightly higher than that by GAN-AD with
We plot the variance rate of the first 10 Principal Compo- PC=1. Similar phenomenon could be observed in multivariate
nents (PC) in Fig. 5. As shown in the figure, there is one main detection by SPE. This indicates that the improvement of
PC that explained more than 50% of the variance for the SWaT detection accuracy (as well as precision and recall) was built
data. Also, the PCs after the 5th one contribute little to the upon the sacrifice of more false positives due to the noisy
overall variance (near to 0). As such, we projected the SWaT information brought in by adding four more less important
data to the most variant PC (the first one) as well as the first PC dimensions.
5, and then applied the GAN-AD to detect anomalies for the The results in Table II also showed that:
projected data. For comparison, we also performed standard • Generally, the univariate detection cannot compete with
9