2020TadGAN Time Series Anomaly Detection Using
2020TadGAN Time Series Anomaly Detection Using
Fig. 1. An illustration of time series anomaly detection using unsupervised learning. Given a multivariate time series, the goal is to find out a set of anomalous
time segments that have unusual values and do not follow the expected temporal patterns.
a model that either predicts or reconstructs a time series signal, The key contributions of this paper are as follows:
and makes a comparison between the real and the predicted or • We propose a novel unsupervised GAN-reconstruction-
reconstructed values. High prediction or reconstruction errors based anomaly detection method for time series data. In
suggest the presence of anomalies. particular, we introduce a cycle-consistent GAN architec-
Deep learning methods [10] are extremely capable of han- ture for time-series-to-time-series mapping.
dling non-linearity in complex temporal correlations, and have • We identify two time series similarity measures suitable
excellent learning ability. For this reason, they have been for evaluating the contextual similarity between original
used in a number of time series anomaly detection methods and GAN-reconstructed sequences. Our novel approach
[6], [11], [12], including tools created by companies such as leverages GAN’s Generator and Critic to compute robust
Microsoft [8]. Generative Adversarial Networks (GANs) [13] anomaly scores at every time step.
have also been shown to be very successful at generat- • We conduct an extensive evaluation using 11 time-series
ing time series sequences and outperforming state-of-the-art datasets from 3 reputable entities (NASA, Yahoo, and
benchmarks [14]. Such a proliferation of methods invites the Numenta), demonstrating that our approach outperforms
question: Do these new, complex approaches actually perform 8 other baselines. We further provide several insights into
better than a simple baseline statistical method? To evaluate anomaly detection for time series data using GANs.
the new methods, we used 11 datasets (real and synthetic) • We develop a benchmarking system for time series
that collectively have 492 signals and thousands of known anomaly detection. The system is open-sourced and can
anomalies to set-up a benchmarking system (see the details be extended with additional approaches and datasets1 .
in Section VI and Table IV). We implemented 5 of the most At the time of this writing, the benchmark includes 9
recent deep learning techniques introduced between 2016 and anomaly detection pipelines, 13 datasets, and 2 evaluation
2019, and compared their performances with that of a baseline mechanisms.
method from the 1970s, A RIMA. While some methods were The rest of this paper is structured as follows. We formally
able to beat A RIMA on 50% of the datasets, two methods lay out the problem of time series anomaly detection in
failed to outperform it at all (c.f. Table I). Section II. Section III presents an overview of the related
One of the foundational challenges of deep learning-based literature. Section IV introduces the details of our GAN model.
approaches is that their remarkable ability to fit data carries the We describe how to use GANs for anomaly detection in Sec-
risk that they could fit anomalous data as well. For example, tion V, and evaluate our proposed framework in Section VI.
autoencoders, using L2 objective function, can fit and recon- Finally, Section VII summarizes the paper and reports our key
struct data extremely accurately - thus fitting the anomalies as findings.
well. On the other hand, GANs may be ineffective at learning
the generator to fully capture the data’s hidden distribution, II. U NSUPERVISED TIME SERIES ANOMALY DETECTION
thus causing false alarms. Here, we mix the two methods, Given a time series X = (x1 , x2 , · · · , xT ), where xi ∈
creating a more nuanced approach. Additionally, works in R M ×1
indicates M types of measurements at time step
this domain frequently emphasize improving the deep learning i, the goal of unsupervised time series anomaly detection
model itself. However, as we show in this paper, improving is to find a set of anomalous time segments Aseq =
post-processing steps could aid significantly in reducing the {a1seq , a2seq , · · · , akseq }, where aiseq is a continuous sequence
number of false positives. of data points in time that show anomalous or unusual behav-
In this work, we introduce a novel GAN architecture, iors (Figure 1) – values within the segment that appear not
TadGAN, for the time series domain. We use TadGAN to to comply with the expected temporal behavior of the signal.
reconstruct time series and assess errors in a contextual manner A few aspects of this problem make it both distinct from and
to identify anomalies. We explore different ways to compute more difficult than time series classification [15] or supervised
anomaly scores based on the outputs from Generators and time series anomaly detection [16], as well as more pertinent
Critics. We benchmark our method against several well-known to many industrial applications. We highlight them here:
classical- and deep learning-based methods on eleven time
series datasets. The detailed results can be found in Table IV. 1 The software is available at github (https://github.com/signals-dev/Orion)
– No a priori knowledge of anomalies or possible anoma- section, we discuss some of the unsupervised approaches.
lies: Unlike with supervised time series anomaly detec- The simplest of these are out-of-limit methods, which flag
tion, we do not have any previously identified “known regions where values exceed a certain threshold [21], [22].
anomalies” with which to train and optimize the model. While these methods are intuitive, they are inflexible and
Rather, we train the model to to learn the time series pat- incapable of detecting contextual anomalies. To overcome
terns, ask it to detect anomalies, and then check whether this, more advanced techniques have been proposed, namely
the detector identified anything relevant to end users. proximity-based, prediction-based, and reconstruction-based
– Non availability of “normal baselines” : For many real- anomaly detection (Table II).
world systems, such as wind turbines and aircraft engines,
simulation engines can produce a signal that resembles Methodology Papers
normal conditions, which can be tweaked for different Proximity [23]–[25]
control regimes or to account for degradation or aging. Prediction [2], [6], [26], [27]
Such simulation engines are often physics-based and Reconstruction [5], [28]–[30]
provide “normal baselines,” which can be used to train Reconstruction (GANs) [7], [14], [31]
models such that any deviations from them are considered
anomalous. Unsupervised time series anomaly detection
strategies do not rely on the availability of such baselines, TABLE II
U NSUPERVISED APPROACHES TO TIME SERIES ANOMALY DETECTION .
instead learning time series patterns from real-world
signals – signals that may themselves include anomalies
or problematic patterns. A. Anomaly Detection for Time Series Data.
– Not all detected anomalies are problematic: Detected Proximity-based methods first use a distance measure to
“anomalies” may not actually indicate problems, and quantify similarities between objects – single data points for
could instead result from external phenomena (such as point anomalies, or fixed length sequences of data points
sudden shifts in environmental conditions), auxiliary in- for collective anomalies. Objects that are distant from others
formation (such as the fact that a test run is being are considered anomalies. This detection type can be fur-
performed), or other variables that the algorithm did ther divided into distance-based methods, such as K-Nearest
not consider, such as regime or control setting changes. Neighbor (KNN) [24] – which use a given radius to define
Ultimately, it is up to the end user, the domain expert, to neighbors of an object, and the number of neighbors to
assess whether the anomalies identified by the model are determine an anomaly score – and density-based methods,
problematic. Figure 1 highlights how a trained unsuper- such as Local Outlier Factor (LOF) [23] and Clustering-Based
vised machine learning model can be used in real time Local Outlier Factor [25], which further consider the density
for the incoming data. of an object and that of its neighbors. There are two major
– No clear segmentation possible: Many signals, such as drawbacks to applying proximity-based methods in time series
those associated with periodic time series, can be seg- data: (1) a priori knowledge about anomaly duration and the
mented – for example, an electrocardiogram signal (ECG) number of anomalies is required; (2) these methods are unable
can be separated into similar segments that pertain to to capture temporal correlations.
periods [16], [17]. The resulting segment clusters may Prediction-based methods learn a predictive model to fit
reveal different collective patterns, along with anomalous the given time series data, and then use that model to predict
patterns. We focus on signals that cannot be clearly seg- future values. A data point is identified as an anomaly if
mented, making these approaches unfeasible. The length the difference between its predicted input and the original
of ai is also variable and is not known a priori, which input exceeds a certain threshold. Statistical models, such as
further increases the difficulty. ARIMA [26], Holt-Winters [26], and FDA [27], can serve this
– How do we evaluate these competing approaches? For purpose, but are sensitive to parameter selection, and often
this, we rely on several datasets that contain “known require strong assumptions and extensive domain knowledge
anomalies”, the details of which are introduced in Sec- about the data. Machine learning-based approaches attempt to
tion VI-A. Presumably, the “anomalies” are time seg- overcome these limitations. [2] introduce Hierarchical Tempo-
ments that have been manually identified as such by ral Memory (HTM), an unsupervised online sequence memory
some combination of algorithmic approaches and human algorithm, to detect anomalies in streaming data. HTM en-
expert annotation. These “anomalies” are used to evaluate codes the current input to a hidden state and predicts the next
the efficacy of our proposed unsupervised models. More hidden state. A prediction error is measured by computing the
details about this can be found in Section VI-B3. difference between the current hidden state and the predicted
hidden state. Hundman et al. [6] propose Long Short Term
III. R ELATED W ORK Recurrent Neural Networks (LSTM RNNs), to predict future
Over the past several years, the rich variety of anomaly time steps and flag large deviations from predictions.
types, data types and application scenarios has spurred a Reconstruction-based methods learn a model to capture
range of anomaly detection approaches [1], [18]–[20]. In this the latent structure (low-dimensional representations) of the
given time series data and then create a synthetic recon- 2019 are of note. First, to use GANs for anomaly detection
struction of the data. Reconstruction-based methods assume in time series, Li et al. [7] propose using a vanilla GAN
that anomalies lose information when they are mapped to a model to capture the distribution of a multivariate time series,
lower dimensional space and thereby cannot be effectively and using the Critic to detect anomalies. Another approach in
reconstructed; thus, high reconstruction errors suggest a high this line is BeatGAN [31], which is a Encoder and Decoder
chance of being anomalous. GAN architecture that allows for the use of the reconstruction
Principal Component Analysis (PCA) [28], a error for anomaly detection in heartbeat signals. More recently,
dimensionality-reduction technique, can be used to reconstruct Yoon et al. [14] propose a time series GAN which adopts
data, but this is limited to linear reconstruction and requires the same idea but introduces temporal embeddings to assist
data to be highly correlated and to follow a Gaussian network training. However, their work is designed for time
distribution [29]. More recently, deep learning based series representation learning instead of anomaly detection.
techniques have been investigated, including those that To the best of our knowledge, we are the first to introduce
use Auto-Encoder (AE) [30], Variational Auto-Encoder cycle-consistent GAN architectures for time series data, such
(VAE) [30] and LSTM Encoder-Decoder [5]. that Generators can be directly used for time series reconstruc-
However, without proper regularization, these tions. In addition, we systematically investigate how to utilize
reconstruction-based methods can easily become overfitted, Critic and Generator outputs for anomaly score computation.
resulting in low performance. In this work, we propose A complete framework of time series anomaly detection is
the use of adversarial learning to allow for time series introduced to work with GANs.
reconstruction. We introduce an intuitive approach for
regularizing reconstruction errors. The trained Generators can IV. A DVERSARIAL L EARNING FOR TIME SERIES
be directly used to reconstruct more concise time series data RECONSTRUCTION
– thereby providing more accurate reconstruction errors –
while the Critics can offer scores as a powerful complement The core idea behind reconstruction-based anomaly de-
to the reconstruction errors when computing an anomaly tection methods is to learn a model that can encode a
score. data point (in our case, a segment of a time series) and
then decode the encoded one (i.e., reconstructed one). An
B. Anomaly Detection Using GANs. effective model should not be able to reconstruct anomalies
Generative adversarial networks can successfully perform as well as “normal” instances, because anomalies will lose
many image-related tasks, including image generation [13], information during encoding. In our model, we learn two
image translation [32], and video prediction [33], and re- mapping functions between two domains X and Z, namely
searchers have recently demonstrated the effectiveness of E : X → Z and G : Z → X (Fig. 2). X denotes
GANs for anomaly detection in images [34], [35]. the input data domain, describing the given training samples
Adversarial learning for images. Schlegl et al. [36] use {(x1...t
i )}N 1...t
i=1 , xi ∈ X. Z represents the latent domain,
the Critic network in a GAN to detect anomalies in medical where we sample random vectors z to represent white noise.
images. They also attempt to use the reconstruction loss as We follow a standard multivariate normal distribution, i.e.,
an additional anomaly detection method, and find the inverse z ∼ PZ = N (0, 1). For notational convenience we use xi
mapping from the data space to the latent space. This mapping to denote a time sequence of length t starting at time step i.
is done in a separate step, after the GAN is trained. How- With the mapping functions, we can reconstruct the input time
ever, Zenati et al. [37] indicate that this method has proven series: xi → E(xi ) → G(E(xi )) ≈ x̂i .
impractical for large data sets or real-time applications. They We propose leveraging adversarial learning approaches to
propose a bi-directional GAN for anomaly detection in tabular obtain the two mapping functions E and G. As illustrated in
and image data sets, which allows for simultaneous training Fig. 2, we view the two mapping functions as Generators.
of the inverse mapping through an encoding network. Note that E is serving as an Encoder, which maps the time
The idea of training both encoder and decoder networks was series sequences into the latent space, while G is serving
developed by Donahue et al. [38] and Dumoulin et al. [39], as a Decoder, which transforms the latent space into the
who show how to achieve bidirectional GANs by trying to reconstructed time series. We further introduce two adversarial
match joint distributions. In an optimal situation, the joint Critics (aka discriminators) Cx and Cz . The goal of Cx is to
distributions are the same, and the Encoder and Decoder distinguish between the real time series sequences from X
must be inverses of each other. A cycle-consistent GAN was and the generated time series sequences from G(z), while Cz
introduced by Zhu et al. [32], who have two networks try measures the performance of the mapping into latent space. In
to map into opposite dimensions, such that samples can be other words, G is trying to fool Cx by generating real-looking
mapped from one space to the other and vice versa. sequences. Thus, our high-level objective consists of two
Adversarial learning for time series. Prior GAN-related terms: (1) Wasserstein losses [40], to match the distribution
work has rarely involved time series data, because the complex of generated time series sequences to the data distribution in
temporal correlations within this type of data pose significant the target domain; and (2) cycle consistency losses [32], to
challenges to generative modeling. Three works published in prevent the contradiction between E and G.
A. Wasserstein Loss Cx
The original formulation of GAN that applies the stan-
dard adversarial losses (Eq. 1) suffers from the mode x ∼ PX L2 G(E(x)) G(z)
collapse problem.
L = Ex∼PX [log Cx (x)] + Ez∼PZ [log(1 − Cx (G(z)))] (1) E G
where Cx produces a probability score from 0 to 1 indicating
the realness of the input time series sequence. To be specific, E(x) z ∼ PZ
the Generator tends to learn a small fraction of the variability
of the data, such that it cannot perfectly converge to the target Cz
distribution. This is mainly because the Generator prefers to
produce those samples that have already been found to be Fig. 2. Model architecture: Generator E is serving as an Encoder which
good at fooling the Critic, and is reluctant to produce new maps the time series sequences into the latent space, while Generator G is
ones, even though new ones might be helpful to capture other serving as a Decoder that transforms the latent space into the reconstructed
time series. Critic Cx is to distinguish between real time series sequences
“modes” in the data. from X and the generated time series sequences from G(z), whereas Critic
To overcome this limitation, we apply Wasserstein loss [40] Cz measures the goodness of the mapping into the latent space.
as the adversarial loss to train the GAN. We make use of
the Wasserstein-1 distance when training the Critic network.
Formally, let PX be the distribution over X. For the mapping consistency loss to time series reconstruction, which was first
function G : Z → X and its Critic Cx , we have the introduced by Zhu et al. [32] for image translation tasks. We
following objective: train the generative network E and G with the adapted cycle
consistency loss by minimizing the L2 norm of the difference
min max VX (Cx , G) (2) between the original and the reconstructed samples:
G Cx ∈Cx
gradients not equal to 1 (cf. line 5). The full architecture of our model can be seen in Figure
Following a similar approach, we introduce a Wasserstein 2. The benefits of this architecture with respect to anomaly
loss for the mapping function E : X → Z and its Critic Cz . detection are twofold. First, we have a Critic Cx that is
The objective is expressed as: trained to distinguish between real and fake time series se-
min max VZ (Cz , E) (4) quences, hence the score of the Critic can directly serve as an
E Cz ∈Cz anomaly measure. Second, the two Generators trained with
The purpose of the second Critic Cz is to distinguish between cycle consistency loss allow us to encode and decode a time
random latent samples z ∼ PZ and encoded samples E(x) series sequence. The difference between the original sequence
with x ∼ PX . We present the model type and architecture for and the decoded sequence can be used as a second anomaly
E, G, Cx , Cz in section VI-B. detection measure. For detailed training steps, please refer to
the pseudo code (cf. line 1–14). The following section will
B. Cycle Consistency Loss introduce the details of using TadGAN for anomaly detection.
The purpose of our GAN is to reconstruct the input time
series: xi → E(xi ) → G(E(xi )) ≈ x̂i . However, training the V. T IME - SERIES GAN FOR ANOMALY DETECTION
GAN with adversarial losses (i.e., Wasserstein losses) alone (TAD GAN)
cannot guarantee mapping individual input xi to a desired Let us assume that the given time series is X =
output zi which will be further mapped back to x̂i . To reduce (x1 , x2 , · · · , xT ), where xi ∈ RM ×1 indicates M types of
the possible mapping function search space, we adapt cycle measurements at time step i. For simplicity, we use M = 1
Algorithm 1: TadGAN {x̂qi , i + q = j} We take the median from the collection as
Require: m, batch size. the final reconstructed value x̂j . Note that in the preliminary
epoch, number of iterations over the data. experiments, we found that using the median achieved a better
ncritic , number of iterations of the critic per performance than using the mean. Now, the reconstructed time
epoch. series is (x̂1 , x̂2 , · · · , x̂T ). Here we propose three different
η, step size. types of functions (cf. line 18) for computing the reconstruc-
1 for each epoch do tion errors at each time step (assume the interval between
2 for κ = 0, . . . , ncritic do neighboring time steps is the same).
Sample {(x1...t )}m Point-wise difference. This is the most intuitive way to
3 i i=1 from real data.
Sample {(zi )}P 1...k m define the reconstruction error, which computes the difference
4 i=1 from random.
1 m between the true value and the reconstructed value at every
5 gwCxP= ∇wCx [ m i=1 Cx (xi ) −
1 m time step:
m i=1 C x (G(z i )) + gp(xi , G(zi ))]
6 wCx = wCx + η · P adam(wCx , gwCx ) st = xt − x̂t (7)
1 m
7 gwCzP= ∇wCz [ m i=1 Cz (zi ) − Area difference. This is applied over windows of a certain
1 m
m i=1 C z (E(x i )) + gp(zi , E(xi ))] length to measure the similarity between local regions. It is
8 wCz = wCz + η · adam(wCz , gwCz ) defined as the average difference between the areas beneath
9 end two curves of length l:
10 Sample {(x1...t i )}mi=1 from real data. Z t+l
1...k m 1
11 Sample {(zi )}i=1Pfrom random.
1 m st = xt − x̂t dx (8)
12 gwG,E = ∇wG ,wE [ m i=1PCx (xi ) − 2 ∗ l t−l
1
P m 1 m
m P i=1 C x (G(z i )) + m Pi=1 Cz (zi ) −
1 m 1 m Although this seems intuitive, it is not often used in this
m i=1 Cz (E(xi )) + m i=1 kxi −
G(E(xi ))k2 ] context – however, we will show in our experiments that this
13 wG,E , = wG,E + η · adam(wG,E , gwG,E ) approach works well in many cases. Compared with the point-
14 end
wise difference, the area difference is good at identifying the
15 X = {(xi
1...t n
)}i=1 regions where small differences exist over a long period of
16 for i = 1, . . . , n do
time. Since we are only given fixed samples of the functions,
17 x̂i = G(E(xi )); we use the trapezoidal rule to calculate the definite integral in
18 RE(xi ) = f (xi , x̂i ); the implementation.
19 score = αZRE (xi ) + (1 − α)ZCx (x̂i ) Dynamic time warping (DTW). DTW aims to calculate the
20 end
optimal match between two given time sequences [42] and
is used to measure the similarity between local regions. We
have two time series X = (xt−1 , xt−l+1 , . . . , xt+l ) and X̂ =
(x̂t−1 , x̂t−l+1 , . . . , x̂t+l ) and let W ∈ R2∗l×2∗l be a matrix
in the later description. Therefore, X is now a univariate time such that the (i, j)-th element is a distance measure between xi
series and xi is a scalar. The same steps can be applied for and x̂j , denoted as wk . We want to find the warp path W ∗ =
multivariate time series (i.e., when M > 1). (w1 , w2 , . . . , wK ) that defines the minimum distance between
To obtain the training samples, we introduce a sliding the two curves, subject to boundary conditions at the start and
window with window size t and step size s to divide the end, as well as constraints on continuity and monotonicity.
original time series into N sub-sequences X = {(x1...ti )}N
i=1 , The DTW distance between time series X and X̂ is defined
T −t
where N = s . In practice, it is difficult to know the ground as follows:
truth, and anomalous data points are rare. Hence, we assume v uK
all the training sample points are normal. In addition, we 1 uX
st = W ∗ = DTW(X, X̂) = min t wk (9)
generate Z = {(zi1...k )}N i=1 from a random space following W K
k=1
normal distribution, where k denotes the dimension of the
latent space. Then, we feed X and Z to our GAN model Similar to area difference, DTW is able to identify the regions
and train it with the objective defined in (6). With the trained of small difference over a long period of time, but DTW can
model, we are able to compute anomaly scores (or likelihoods) handle time shift issues as well.
at every time step by leveraging the reconstruction error and
Critic output (cf. line 15–20).
B. Estimating Anomaly Scores with Critic Outputs
A. Estimating Anomaly Scores using Reconstruction Errors During the training process, the Critic Cx has to distinguish
Given a sequence x1...t
i of length t (denoted as xi later), between real input sequences and synthetic ones. Because we
TadGAN generates a reconstructed sequence of the same use the Wasserstein-1 distance when training Cx , the outputs
length: xi → E(xi ) → G(E(xi )) ≈ x̂i . Therefore, for each can be seen as an indicator of how real (larger value) or fake
time point j, we have a collection of reconstructed values (smaller value) a sequence is. Therefore, once the Critic is
NASA Yahoo S5 NAB
Property SMAP MSL A1 A2 A3 A4 Art AdEx AWS Traf Tweets
# SIGNALS 53 27 67 100 100 100 6 5 17 7 10
# ANOMALIES 67 36 178 200 939 835 6 11 30 14 33
point (len = 1) 0 0 68 33 935 833 0 0 0 0 0
collective (len > 1) 67 36 110 167 4 2 6 11 30 14 33
# ANOMALY POINTS 54696 7766 1669 466 943 837 2418 795 6312 1560 15651
# out-of-dist 18126 642 861 153 21 49 123 15 210 86 520
(% tot.) 33.1% 8.3% 51.6% 32.8% 2.2% 5.9% 5.1% 1.9% 3.3% 5.5% 3.3%
# DATA POINTS 562800 132046 94866 142100 168000 168000 24192 7965 67644 15662 158511
IS SYNTHETIC ? X X X X
TABLE III
DATASET SUMMARY: OVERALL THE BENCHMARK DATASET CONTAINS A TOTAL OF 492 SIGNALS AND 2349 ANOMALIES .
trained, it can directly serve as an anomaly measure for time sequences. We use sliding windows to compute thresholds,
series sequences. and empirically set the window size as T3 and the step size
T
Similar to the reconstruction errors, at time step j, we have as 3∗10 . This is helpful to identify contextual anomalies
a collection of Critic scores (cqi , i + q = j). We apply kernel whose contextual information is usually unknown. The sliding
density estimation (KDE) on the collection and then take the window size determines the number of historical anomaly
maximum value as the smoothed value cj Now the Critic scores to evaluate the current threshold. For each sliding
score sequence is (c1 , c2 , . . . , cT ). We show in our experiments window, we use a simple static threshold defined as 4 standard
that it is indeed the case that the Critic assigns different deviations from the mean of the window. We can then identify
scores to anomalous regions compared to normal regions. This those points whose anomaly score is larger than the threshold
allows for the use of thresholding techniques to identify the as anomalous. Thus, continuous time points compose into
anomalous regions. anomalous sequences (or windows): {aiseq , i = 1, 2, . . . , K},
where aiseq = (astart(i) , . . . , aend(i) ) .
C. Combining Both Scores Mitigating false positives: The use of sliding windows can
The reconstruction errors RE(x) and Critic outputs Cx (x) increase recall of anomalies but may also produce many
cannot be directly used together as anomaly scores. Intuitively, false positives. We employ an anomaly pruning approach
the larger RE(x) and the smaller Cx (x) indicate higher inspired by Hundman et al. [6] to mitigate false positives.
anomaly scores. Therefore, we first compute the mean and At first, for each anomalous sequence, we use the maximum
standard deviation of RE(x) and Cx (x), and then calculate anomaly score to represent it, obtaining a set of maximum
their respective z-scores ZRE (x) and ZCx (x) to normalize values {aimax , i = 1, 2, . . . , K}. Once these values are sorted
both. Larger z-scores indicate high anomaly scores. in descending order, we can compute the decrease percent
We have explored different ways to leverage ZRE (x) and pi = (amax
i−1
−aimax )/ai−1 i
max . When the first p does not exceed
ZCx (x). As shown in Table V (row 1–4), we first tested three a certain threshold θ (by default θ = 0.1), we reclassify all
types of ZRE (x) and ZCx (x) individually. We then explored subsequent sequences (i.e., {ajseq , i ≤ j ≤ K}) as normal.
two different ways to combine them (row 5 to the last row).
First, we attempt to merge them into a single value a(x) with VI. E XPERIMENTAL R ESULTS
a convex combination (cf. line 19) [7], [36]: A. Datasets
a(x) = αZRE (x) + (1 − α)ZCx (x) (10) To measure the performance of TadGAN, we evaluate it on
multiple time series datasets. In total, we have collected 11
where α controls the relative importance of the two terms (by datasets (a total of 492 signals) across a variety of application
default alpha = 0.5). Second, we try to multiply both scores domains. We use spacecraft telemetry signals provided by
to emphasize the high values: NASA2 , consisting of two datasets: Mars Science Laboratory
(MSL) and Soil Moisture Active Passive (SMAP). In addition,
a(x) = αZRE (x) ZCx (x) (11)
we use Yahoo S5 which contains four different sub-datasets
3
where α = 1 by default. Both methods result in robust The A1 dataset is based on real production traffic to Yahoo
anomaly scores. The results are reported in Section VI-C. computing systems, while A2, A3 and A4 are all synthetic
datasets. Lastly, we use Numenta Anomaly Benchmark
D. Identifying Anomalous Sequences
2 Spacecraft telemetry data: https://s3-us-west-2.amazonaws.com/
Finding anomalous sequences with locally adaptive thresh-
telemanom/data.zip
olding: Once we obtain anomaly scores at every time step, 3 Yahoo S5 data can be requested here: https://webscope.sandbox.yahoo.
thresholding techniques can be applied to identify anomalous com/catalog.php?datatype=s&did=70
(NAB). NAB [43] includes multiple types of time series 4) Baselines: The baseline methods can be divided into
data from various application domains4 We have picked five three categories: prediction-based methods, reconstruction-
datasets: Art, AdEx, AWS, Traf, and Tweets. based methods, and online commercial tools.
Datasets from different sources contain different numbers ARIMA (Prediction-based). An autoregressive integrated
of signals and anomalies, and locations of anomalies are moving average (ARIMA) model is a popular statistical anal-
known for each signal. Basic information for each dataset is ysis model that learns autocorrelations in the time series for
summarized in Table III. For each dataset, we present the total future value prediction. We use point-wise prediction errors as
number of signals and the number of anomalies pertaining to the anomaly scores to detect anomalies.
them. We also observe whether the anomalies in the dataset are HTM (Prediction-based). Hierarchial Temporal Memory
single “point” anomalies, or one or more collections. In order (HTM) [2] has shown better performance over many statistical
to suss out the ease of anomaly identification, we measure how analysis models in the Numenta Anomaly Benchmark. It
out-of-the-ordinary each anomaly point is by categorizing it encodes the current input to a hidden state and predicts the next
as “out-of-dist” if it falls 4 standard deviations away from the hidden state. Prediction errors are computed as the differences
mean of all the data for a signal. As each dataset has some between the predicted state and the true state, which are then
quality that make detecting its anomalies more challenging, used as the anomaly scores for anomaly detection.
this diverse selection will help us identify the effectiveness LSTM (Prediction-based). The neural network used in our
and limitations of each baseline. experiments consists of two LSTM layers with 80 units each,
and a subsequent dense layer with one unit which predicts
B. Experimental setup the value at the next time step (similar to the one used by
1) Data preparation: For each dataset, we first normalize Hundman et al. [6]). Point-wise prediction errors are used for
the data betweeen [−1, 1]. Then we find a proper interval anomaly detection.
over which to aggregate the data, such that we have several AutoEncoder (Reconstruction-based). Our approach can
thousands of equally spaced points in time for each signal. be viewed as a special instance of “adversarial autoen-
We then set a window size t = 100 and step size s = 1 to coders” [44], E ◦ G : X → X. Thus, we compare our method
obtain training samples for TadGAN. Because many signals with standard autoencoders with dense layers or LSTM lay-
in the Yahoo datasets contain linear trends, we apply a simple ers [5]. The dense autoencoder consists of three dense layers
detrending function (which subtracts the result of a linear least- with 60, 20 and 60 units respectively. The LSTM autoencoder
squares fit to the signal) before training and testing. contains two LSTM layers, each with 60 units. Again, a point-
wise reconstruction error is used to detect anomalies.
2) Architecture: In our experiments, inputs to TadGAN are
MAD-GAN (Reconstruction-based). This method [7] uses
time series sequences of length 100 (domain X), and the
a vanilla GAN along with an optimal instance searching
latent space (domain Z) is 20-dimensional. We use a 1-layer
strategy in latent space to support multivariate time series
bidirectional Long Short-Term Memory (LSTM) with 100
reconstruction. We use MAD-GAN to compute the anomaly
hidden units as Generator E, and a 2-layer bidirectional LSTM
scores at every time step and then apply the same anomaly
with 64 hidden units each as Generator G, where dropout is
detection method introduced in Sec. V-D to find anomalies.
applied. We add a 1-D convolutional layer for both Critics,
Microsoft Azure Anomaly Detector (Commercial tool). Mi-
with the intention of capturing local temporal features that can
crosoft uses Spectral Residual Convolutional Neural Networks
determine how anomalous a sequence is. The model is trained
(SR-CNN) in which the models are applied serially [8]. The
on a specific signal from one dataset for 2000 iterations, with
SR model is responsible for saliency detection, and the CNN is
a batch size of 64.
responsible for learning a discriminating threshold. The output
3) Evaluation metrics: We measure the performance of
of the model is a sequence of binary labels that is attributed
different methods using the commonly used metrics Precision,
to each timestamp.
Recall and F1-Score. In many real-world application scenarios,
Amazon DeepAR (Commercial tool). DeepAR is a proba-
anomalies are rare and usually window-based (i.e. a continuous
bilistic forecasting model with autoregressive recurrent net-
sequence of points—see Sec. V-D). From the perspective of
works [9]. We use this model in a similar manner to LSTM
end-users, the best outcome is to receive timely true alarms
in that it is a prediction-based approach. Anomaly scores are
without too many false positives (FPs), as these may waste
presented as the regression errors which are computed as the
time and resources. To penalize high FPs and reward the timely
distance between the median of the predicted value and true
true alarms, we present the following window-based rules:
value.
(1) If a known anomalous window overlaps any predicted
windows, a TP is recorded. (2) If a known anomalous window C. Benchmarking Results
does not overlap any predicted windows, a FN is recorded. (3)
TadGAN outperformed all the baseline methods by
If a predicted window does not overlap any labeled anomalous
having the highest averaged F1 score (0.7) across all
region, a FP is recorded. This method is also used in Hundman
the datasets. Table IV ranks all the methods based on their
et al’s work [6].
averaged F1 scores (the last column) across the eleven datasets.
4 NAB data: https://github.com/numenta/NAB/tree/master/data The second (LSTM, 0.623) and the third (Arima, 0.599) best
NASA Yahoo S5 NAB
Baseline MSL SMAP A1 A2 A3 A4 Art AdEx AWS Traf Tweets Mean±SD
TadGAN 0.623 0.704 0.8 0.867 0.685 0.6 0.8 0.8 0.644 0.486 0.609 0.700±0.123
(P) LSTM 0.46 0.69 0.744 0.98 0.772 0.645 0.375 0.538 0.474 0.634 0.543 0.623±0.163
(P) Arima 0.492 0.42 0.726 0.836 0.815 0.703 0.353 0.583 0.518 0.571 0.567 0.599±0.148
(C) DeepAR 0.583 0.453 0.532 0.929 0.467 0.454 0.545 0.615 0.39 0.6 0.542 0.555±0.130
(R) LSTM AE 0.507 0.672 0.608 0.871 0.248 0.163 0.545 0.571 0.764 0.552 0.542 0.549±0.193
(P) HTM 0.412 0.557 0.588 0.662 0.325 0.287 0.455 0.519 0.571 0.474 0.526 0.489±0.108
(R) Dense AE 0.507 0.7 0.472 0.294 0.074 0.09 0.444 0.267 0.64 0.333 0.057 0.353±0.212
(R) MAD-GAN 0.111 0.128 0.37 0.439 0.589 0.464 0.324 0.297 0.273 0.412 0.444 0.35±0.137
(C) MS Azure 0.218 0.118 0.352 0.612 0.257 0.204 0.125 0.066 0.173 0.166 0.118 0.219±0.145
TABLE IV
F1-S CORES OF BASELINE MODELS USING WINDOW- BASED RULES . C OLOR ENCODES THE PERFORMANCE OF THE F1 SCORE . O NE IS EVENLY DIVIDED
INTO 10 BINS , WITH EACH BIN ASSOCIATED WITH ONE COLOR . F ROM DARK RED TO DARK BLUE , F1 SCORE INCREASES FROM 0 TO 1.
Baseline Models Comparison to ARIMA How well do AutoEncoders perform? To view the supe-
riority of GAN, we compare it to other reconstruction-based
TADGAN 15.3%
method such as LSTM AE, and Dense AE. One striking result
LSTM 4.1% is that the autoencoder alone does not perform well on point
DeepAR -7.2% anomalies. We observe this as LSTM, AE, and Dense AE
obtained an average F1 Score on A3 and A4 of 0.205 and
LSTM AE -8.2%
0.082 respectively, while TadGAN and MAD-GAN achieved
HTM -18.3% a higher score of 0.643 and 0.527 respectively. One potential
Dense AE -41.1% reason could be that AutoEncoders are optimizing L2 function
and strictly attempt to fit the data, resulting in that anomalies
MAD-GAN -41.5% get fitted as well. However, adversarial learning does not have
MS Azure -63.4% this type of issue.
TadGAN v.s. MadGAN. Overall, TadGAN (0.7) outper-
80 60 40 20 0 20 40
% Improvement formed Mad-GAN (0.219) significantly. This fully demon-
strates the usage of forward cycle-consistency loss (Eq. 5)
Fig. 3. Comparing average F1-Scores of baseline models across all datasets which prevents the contradiction between two Generators E
to ARIMA. The x-axis represents the percentage of improvement over the and G and paves the most direct way to the optimal zi
ARIMA score by each one of the baseline models. that corresponds to the testing sample xi . Mad-GAN uses
only vanilla GAN and does not include any regularization
mechanisms to guarantee the mapping route xi → zi → x̂i .
are both prediction-based methods and TadGAN outperformed Their approach to finding the optimal zi is that they first
them by 12.46% and 16.86%, respectively, compared to the sample a random z from the latent space and then optimize it
averaged F1 score. with the gradient descent algorithm by optimizing the anomaly
Baseline models in comparison to Arima. Figure 3 depicts detection loss.
the performance of all baseline models with respect to Arima.
It shows how much improvement in F1-Score is gained by D. Ablation Study
each model. The F1-Score presented is the average across the We evaluated multiple variations of TadGAN, using differ-
eleven datasets. TadGAN achieves the best overall improve- ent anomaly score computation methods for each (Sec. V-C).
ment with an over 15% improvement in score, followed by The results are summarized in Table V. Here we report some
LSTM with a little over 4% improvement. It’s worth noting noteworthy insights.
that all the remaining models struggle to beat Arima. Using Critic alone is unstable, because it has the lowest
Synthetic data v.s. real-world datasets. Although average F1 score (0.29) and the highest standard deviation
TadGAN outperforms all baselines on average, we note that it (0.237). While only using Critic can achieve a good perfor-
ranks below Arima when detecting anomalies within synthetic mance in some datasets, such as SMAP and Art, its perfor-
dataset with point anomalies. Specifically, TadGAN achieved mance may also be unexpectedly bad, such as in A2, A3, A4,
an average of 0.717 while Arima scored an average of 0.784. AdEx, and Traf. No clear shared characteristics are identified
However, TadGAN still produces competitive results in both among these five datasets (see Table III). For example, some
scenarios. datasets contain only collective anomalies (Traf, AdEx), while
other datasets, like A3 and A4, have point anomalies as the
majority types. One explanation could be that Critic’s behavior
NASA Yahoo S5 NAB
Variation MSL SMAP A1 A2 A3 A4 Art AdEx AWS Traf Tweets Mean+SD
Critic 0.393 0.672 0.285 0.118 0.008 0.024 0.625 0 0.35 0.167 0.548 0.290±0.237
Point 0.585 0.588 0.674 0.758 0.628 0.6 0.588 0.611 0.551 0.383 0.571 0.594±0.086
Area 0.525 0.655 0.681 0.82 0.567 0.523 0.625 0.645 0.59 0.435 0.559 0.602±0.096
DTW 0.514 0.581 0.697 0.794 0.613 0.547 0.714 0.69 0.633 0.455 0.559 0.618±0.095
Critic×Point 0.619 0.675 0.703 0.75 0.685 0.536 0.588 0.579 0.576 0.4 0.59 0.609±0.091
Critic+Point 0.529 0.653 0.8 0.78 0.571 0.44 0.625 0.595 0.644 0.439 0.592 0.606±0.111
Critic×Area 0.578 0.704 0.719 0.867 0.587 0.46 0.8 0.6 0.6 0.4 0.571 0.625±0.131
Critic+Area 0.493 0.692 0.789 0.847 0.483 0.367 0.75 0.75 0.607 0.474 0.6 0.623±0.148
Critic×DTW 0.623 0.68 0.667 0.82 0.631 0.497 0.667 0.667 0.61 0.455 0.605 0.629±0.091
Critic+DTW 0.462 0.658 0.735 0.857 0.523 0.388 0.667 0.8 0.632 0.486 0.609 0.620±0.139
Mean 0.532 0.655 0.675 0.741 0.529 0.438 0.664 0.593 0.579 0.409 0.580
SD 0.068 0.039 0.137 0.211 0.182 0.154 0.067 0.209 0.081 0.087 0.02
TABLE V
F1-S CORES OF ALL THE VARIATIONS OF OUR MODEL .
is unpredictable when confronted with anomalies (x PX ), such as Time-Series GAN [14]. Due to our modular design,
because it is only taught to distinguish real time segments any reconstruction-based algorithm of time series can employ
(x ∼ PX ) from generated ones. our anomaly scoring method for time series anomaly detection.
DTW outperforms the other two reconstruction error In the future, we plan to investigate various strategies for time
types slightly. Among all variations, Critic×DTW has the best series reconstruction and compare their performances to the
score (0.629). Further, its standard deviation is smaller than current state-of-the-art. Moreover, it is worth understanding
most of the other variations except for Point, indicating that how better signal reconstruction affects the performance of
this combination is more stable than others. Therefore, this anomaly detection. In fact, it is expected that better reconstruc-
combination should be the safe choice when encountering new tion might overfit to anomalies. Therefore, further experiments
datasets without labels. are required to understand the relationship between reconstruc-
Combining Critic outputs and reconstruction errors does tion and detecting anomalies.
improve performance in most cases. In all datasets except VII. C ONCLUSION
A4, combinations achieve the best performance. Let us take
the MSL dataset as an example. We observe that when using In this paper, we presented a novel framework, TadGAN ,
DTW alone, the F1 score is 0.514. Combining this with the that allows for time series reconstruction and effective anomaly
Critic score, we obtain a score of 0.623, despite the fact that detection, showing how GANs can be effectively used for
the F1 score when using Critic alone is 0.393. In addition, we anomaly detection in time series data. We explored point-
find that after combining the Critic scores, the averaged F1 wise and window-based methods to compute reconstruction
score improves for each of the individual reconstruction error errors. We further proposed two different ways to combine
computation methods. However, one interesting pattern is that reconstruction errors and Critic outputs to obtain anomaly
for dataset A4, which consists mostly of point anomalies, using scores at every time step. We have also tested several anomaly-
only point-wise errors achieve the best performance. scoring techniques and reported the best-suited one in this
Multiplication is a better option than convex combina- work. Our experimental results showed that (1) TadGAN out-
tion. Multiplication consistently leads to a higher averaged performed all the baseline methods by having the highest
F1 score than convex combination does when using the same averaged F1 score across all the datasets, and showed superior
reconstruction error type (e.g., Critic×Point v.s. Critic+Point). performance over baseline methods in 6 out of 11 datasets; (2)
Multiplication also has consistently smaller standard devia- window-based reconstruction errors outperformed the point-
tions. Thus, multiplication is the recommended way to com- wise method; and (3) the combination of both reconstruction
bine reconstruction scores and Critic scores. This can be errors and critic outputs offers more robust anomaly scores,
explained by the fact that multiplication can better amplify which help to reduce the number of false positives as well
high anomaly scores. as increase the number of true positives. Finally, our code is
open source and is available as a tool for benchmarking time
E. Limitations and Discussion series datasets for anomaly detection.