MR - Tad Report
MR - Tad Report
1
lous time series. It used time-series warping for data 3.3. Multi-resolution with Skip RNN Autoencoders
augmentation to improve accuracy and demonstrated their
We adopt the idea of recurrent skip connection net-
effectiveness in fast inference. Another method named
works (RSCNs) [22] to learn the latent representations from
MSCRED [24] proposed a novel architecture of attention-
encoders and decoders of each resolution. As shown in
based ConvLSTM to take temporal dependency into ac-
Figure 2, we make sparse skip connections to learn the
count. OmniAnomaly [19] and USAD [1] also adopted
lower resolution features and dense skip connections for the
reconstruction-based model architectures. However, these
higher resolution features. Because RSCNs use sparseness
latest works do not consider the temporal resolution to vary-
weight vector wt that decides which connections should be
ing degrees.
connected at each time step t, we make the skip connection
according to the resolution level.
2.2. Learning with Multi-Resolution
There have been a few approaches that extracted tem- 3.3.1 Encoder
poral features at multiple time scales. Temporal Hierarchi-
cal One-Class (THOC) network fuses and processes multi- We employed a recurrent neural network (RNN) to encode
resolution features through a hierarchical network with di- time series data. Given T with time series length T , the
lated recurrent neural networks (RNN) [17]. Lately, re- hidden state h of the encoder at time t is computed as
current autoencoder with multi-resolution ensemble decod- (E) (E)
ing (RAMED) uses decoders of different decoding lengths ht = f (E) ([st ; ht−1 ]), (1)
and proposes a coarse-to-fine fusion mechanism for com- (E)
puting multi-resolution outputs[18]. Both methods adopted where ht−1 is the previous state and f (E) is a nonlinear
a multi-resolution feature extraction technique and a fusion activation function.
mechanism to capture the temporal dynamics across resolu- In the encoder, the time series T is fed into each of the
tions of varying scales. RNN units (here, GRU is selected as it performs well in
preliminary experiments) from different resolutions. The
the hidden state ht is computed as Equation 2:
3. The Mr.TAD Framework
(E) (E)
w1 ht−1 + w2 ht−L
In this section, we describe our proposed algorithm ht
(E)
= f (E) st ; , (2)
Mr.TAD which captures both intra- and inter-resolution fea- |w1 | + |w2 |
tures. First, we introduce the overall architecture of Mr.TAD where w1 and w2 denote the random samples from
and explain the details of the encoding and decoding parts. {(1, 0), (0, 1), (1, 1)} at each time step. We use L as the
skip length formulated by 2k−1 for k th resolution. Further-
3.1. Problem Statement more, the k th resolution is represented as Equation 3, which
A time series T = hs1 , s2 ,· · ·, sT i is a series of data is the concatenated result of intra- and inter-resolution fea-
points indexed chronologically. Each data point st = tures. The hidden state of the k th resolution is defined as
{x1t , x2t · · ·xdt } represents d features of an entity at a spe-
hk = fM LP (concat[h(Ek ) , h(E∗) ]), (3)
cific timestamp ti . The time series anomaly detection model
aims to compute an anomaly score for each data point in a where h(Ek ) denotes the hidden state of intra-resolution and
given sequence. A series s at time t denoted by st is clas- h(E∗) denotes the hidden state of inter-resolution calculated
sified as an anomaly if the anomaly score exceeds a prede- as
fined threshold.
h(E∗) = fM LP (concat[h(E1 ) , h(E2 ) · · · , h(Ek ) ]). (4)
3.2. Architecture
The inter-resolution is responsible for capturing the global
Figure 1 depicts the overall architecture of our Mr.TAD, features of multiple resolutions through the shared latent
which is based on the ensemble sequence-to-sequence au- vectors.
toencoder of different resolutions to learn multi-resolution
features. A latent vector h(E∗) which represents the com-
3.3.2 Decoder
mon information between different resolutions is shared by
(E)
the encoders. Thus, h(E∗) helps the model to learn inter- The compressed representation of the encoders hT is fed
resolution features from various resolutions. For each reso- into the decoder using a GRU. The outputs (i.e., recon-
lution k, there exists a hidden state h(Ek ) . It addresses our structed time series) of the decoder is generally processed
claim that each resolution has its own unique characteristics in a reverse chronological order in time series anomaly
and helps the model to learn the intra-resolution features. detection [10, 22] since it is proved to be more efficient.
2
Shared latent vector
𝒉𝒉(𝑬𝑬∗)
Skip-RNN
𝒉𝒉𝑲𝑲 Skip−RNN Lower
Encoder
𝒉𝒉(𝑬𝑬𝑲𝑲 ) Decoder resolution
𝑬𝑬𝑲𝑲 𝑫𝑫𝑲𝑲
… …
𝒉𝒉𝟐𝟐
Skip-RNN (𝑬𝑬𝟐𝟐 ) Skip−RNN
Encoder
𝒉𝒉 Decoder
𝑬𝑬𝟐𝟐 𝑫𝑫𝟐𝟐
𝟏𝟏
𝒉𝒉
Skip-RNN Skip−RNN Higher
Encoder
𝒉𝒉(𝑬𝑬𝟏𝟏 ) Decoder
𝑬𝑬𝟏𝟏 𝑫𝑫𝟏𝟏 resolution
…
output timeseries
input timeseries (𝒊𝒊) (𝒊𝒊) (𝒊𝒊)
𝒔𝒔� 𝑻𝑻 , 𝒔𝒔� 𝑻𝑻−𝟏𝟏 … , 𝒔𝒔� 𝟏𝟏
𝒔𝒔𝟏𝟏 , 𝒔𝒔𝟐𝟐 , … , 𝒔𝒔𝑻𝑻
Figure 1. Overall architecture of Mr.TAD.
3
Table 1. Dataset Information 4.2. Comparison Baselines
Datasets Dim. # Training # Testing % Anomaly
A1 55,790 23,883 4.39 We compare Mr.TAD against the following competi-
A2 72,400 69,700 0.67 tive baselines, including 3 traditional methods and 3 deep
Yahoo S5 1
A3 108,192 46,368 0.65
learning-based models with their variants.
A4 96,432 41,328 0.58
SMAP 25 140,825 444,035 12.85 Traditional Methods. (1) Local Outlier Fac-
NASA
MSL 55 58,317 73,729 10.48 tor (LOF) [2] is a popular density-based outlier detection
SMD 38 708,405 708,420 4.16 method. (2) Isolation Forest (IF) [13] is an unsupervised
A 1,833 1,841 14.61 learning algorithm catching up with randomized clustering
B 2,439 1,287 12.35
C 10,863 3,348 4.45
forest. (3) One-Class Support Vector Machines (SVM) [16]
D 2,610 1,121 11.51 is also an unsupervised learning algorithm based on a
ECG E 2 2,011 1,447 9.61 kernel-based method for outlier detection.
F 2,943 2,255 8.38 Deep Learning-based Methods. (1) Auto-
G 34,735 9,882 2.01
H 2,373 2,721 9.52 Encoder (AE) [15] is a generative unsupervised deep
I 3,152 1,756 21.58 learning algorithm used for reconstructing high dimen-
Power Demand 1 1,513 1,596 13.22 sional input data using a neural network with a bottleneck
2D Gesture 2 8,451 2,800 26.04 latent layer which contains the compressed representation
of the input data. (2) Variational Auto-Encoder (VAE) [23]
is a modified reconstruction model to prevent the model
them. For datasets that do not have a predefined train-test from reconstructing abnormal samples well. By adding a
split, we select 70% of the dataset for training and 30% for constraint network in the latent space, the model is forced
testing. Further, 30% of the training set is used for valida- to generate new latent variables similar to that of training
tion. samples. (3) Generative Adversarial Network (GAN) [7]
Yahoo S5 1 . This dataset consists of real and synthetic is composed of the generator, which compresses and
time series with labeled anomaly points. The dataset tests decompresses the input to generate a time series, and the
the detection of various anomaly types, including outliers discriminator, which tries to distinguish whether the time
and change-points. The synthetic data contains time series series is normal or not. (4) Sparsely-connected RNNs
with varying trend, noise, and seasonality while the real data with Shared Framework (S-RNNS SF) [10] is the ensemble
represents the metrics of various Yahoo services. network with sparsely connected RNNs and randomly
removed connections. Moreover, the shared framework is
Power Demand [9]. The Power Demand dataset con- employed to enable interactions among the autoencoders.
tains one year of power consumption records measured by (5) Dilated LSTM AE is our baseline for the manual
a Dutch research facility in 1997. structure of Mr.TAD. During preprocessing for time-series,
ECG [9]. The ECG dataset contains anomalous beats a multi-resolution skip connection is implemented.
from electrocardiograms. It has nine sub-datasets. Variations of the AE-based models are CNN, LSTM,
2D Gesture [9]. The 2D Gesture dataset contains time GRU, and Bi-directional GRU, while for the others, CNN,
series of X-Y coordinates of an actor’s right hand. The data LSTM, and GRU are applied.
is extracted from a video in which the actor grabs a gun from
his hip-mounted holster, moves it to the target, and returns 4.3. Evaluation Metrics
it to the holster. The anomalous region is in the area where Following the common practices adopted by recent stud-
the actor fails to return his gun to the holster. ies [21, 19, 17, 18], a point-adjust approach, we adopt Pre-
NASA Dataset [8]. The NASA dataset consists of Soil cision (P), Recall (R), best F1 score (F1), and AUC as our
Moisture Active Passive (SMAP) satellite and Mars Sci- evaluation metrics.
ence Laboratory (MSL) rover datasets. SMAP and MSL
are two real-world public datasets, labeled by experts in TP TP P ·R
P = , R= , F1 = 2 · ,
NASA. They contain data of 55/27 entities each monitored TP + FP TP + FN P +R
by m = 25/55 metrics, respectively.
where TP is True Positives, FP is False Positives, and FN
Server Machine Dataset (SMD) [19]. SMD is a new 5- is False Negatives. Also, we evaluate the area under the
week-long dataset from a large Internet company collected precision-recall curve (PRAUC), which represents the aver-
and made publicly available. It contains data from 28 server age of precision scores calculated for each recall threshold
machines each monitored by m = 38 metrics. and the area under the ROC curve (ROCAUC), which rep-
resents the trade-off between the true positive rate and the
1 https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70 false positive rate. Finally, all metrics are computed with
4
Table 2. Overall Performance Comparison
Yahoo S5 NASA SMD ECG Power Demand 2D Gesture
F1 PRAUC ROCAUC F1 PRAUC ROCAUC F1 PRAUC ROCAUC F1 PRAUC ROCAUC F1 PRAUC ROCAUC F1 PRAUC ROCAUC
IF 0.3226 0.0913 0.2494 0.0440 0.0048 0.0017 0.3199 0.2468 0.1788 0.2269 0.1003 0.0855 0.1826 0.0323 0.0566 0.5091 0.3883 0.3604
LOF 0.5224 0.2364 0.1934 0.2256 0.0427 0.0350 0.3354 0.2640 0.3674 0.2819 0.1897 0.0718 0.0323 0.0008 0.0006 0.4533 0.2553 0.0785
OC-SVM 0.4667 0.1243 0.4889 0.1857 0.0473 0.0578 0.3348 0.1621 0.5072 0.2545 0.1488 0.2435 0.2451 0.0720 0.1301 0.5072 0.3946 0.3199
CNN-AE 0.6146 0.3557 0.7839 0.2259 0.3218 0.5047 0.4206 0.4185 0.7069 0.2483 0.1481 0.5260 0.2691 0.0996 0.3295 0.4132 0.3075 0.4596
LSTM-AE 0.6306 0.3695 0.7838 0.2123 0.3052 0.4916 0.4483 0.4674 0.7277 0.3011 0.2129 0.5636 0.2632 0.1046 0.3477 0.4699 0.3260 0.5948
GRU-AE 0.6509 0.3870 0.8103 0.2123 0.3141 0.4929 0.4469 0.4780 0.7215 0.2821 0.2065 0.5873 0.2816 0.1012 0.3259 0.4435 0.3640 0.5666
BiGRU-AE 0.7680 0.4867 0.8735 0.2123 0.3135 0.4953 0.4430 0.4817 0.7110 0.2762 0.1948 0.5768 0.2643 0.1205 0.3998 0.4401 0.3259 0.5665
CNN-VAE 0.5351 0.3063 0.7292 0.2131 0.3145 0.4567 0.4379 0.4445 0.7376 0.2688 0.1756 0.5482 0.2679 0.0953 0.2934 0.4548 0.2815 0.5773
LSTM-VAE 0.4383 0.2332 0.6749 0.2123 0.3146 0.4571 0.4420 0.4388 0.7508 0.2598 0.1725 0.5539 0.2643 0.1059 0.3646 0.4262 0.2573 0.5188
GRU-VAE 0.3822 0.2074 0.6147 0.2123 0.2003 0.4757 0.4426 0.4377 0.7517 0.2509 0.1710 0.5568 0.2643 0.1152 0.4110 0.4167 0.2446 0.4980
CNN-GAN 0.3906 0.1791 0.4934 0.2123 0.3127 0.4596 0.4078 0.3991 0.7005 0.2444 0.1583 0.5314 0.2632 0.0989 0.3181 0.4132 0.2318 0.4680
LSTM-GAN 0.2001 0.0819 0.2514 0.2123 0.1501 0.4731 0.4297 0.4188 0.7047 0.2153 0.1351 0.4983 0.4524 0.3270 0.6852 0.4444 0.3424 0.5506
GRU-GAN 0.2318 0.0957 0.2751 0.2123 0.2347 0.4434 0.4147 0.4020 0.6942 0.2735 0.1819 0.6097 0.4167 0.3270 0.6852 0.4132 0.2402 0.4904
S-RNNS SF 0.4712 0.2508 0.6407 0.2123 0.3157 0.4635 0.4492 0.4472 0.7477 0.2802 0.1831 0.5527 0.2749 0.0960 0.3189 0.4640 0.2836 0.5646
Dilated LSTM AE 0.6542 0.3933 0.8031 0.2123 0.3145 0.4943 0.4579 0.4589 0.7435 0.2901 0.2055 0.5420 0.2802 0.0990 0.3051 0.4770 0.3831 0.6374
Mr.TAD 0.5000 0.2679 0.6635 0.2123 0.3144 0.4900 0.4553 0.4632 0.7411 0.2710 0.1842 0.5572 0.2755 0.0968 0.3152 0.4759 0.3952 0.6329
0.4
0.3 0.6
0.3
ROCAUC
0.2 0.4
PRAUC
0.2
F1
0.1 0.2
0.1
0 0 0
RU E
RU E
RU E
N E
N E
N E
RU N
RU N
RU N
RN N
RN N
RN N
TM AE
TM AE
TM AE
CN VM
CN VM
CN M
G AE
G AE
G AE
RU E
RU E
RU E
N E
N E
N E
TM AN
TM AN
TM AN
F
O OF
TM E
TM E
TM E
D
D F
D F
D F
IF
IF
IF
M ted
M ted
M ted
G VA
G VA
G VA
CN -VA
CN -VA
CN -VA
LO
LO
Bi U-A
Bi U-A
Bi U-A
CN -A
CN -A
CN -A
S
LS -A
LS -A
LS -A
G GA
G GA
G GA
S- -GA
S- -GA
S- -GA
A
A
SV
LS -V
LS -V
LS -V
L
-
-
ila
ila
ila
S
S
LS -G
LS -G
LS -G
r.T
r.T
r.T
S
S
N
N
-
N
-
-
C-
C-
C-
R
R
G
G
O
1000 thresholds generated uniformly from 0 to the maxi- Table 3. Effect of Intra- and Inter-Resolution
mum anomaly score over all time steps in the test set [22]. F1 PRAUC ROCAUC
w/o Intra-resolution 0.3244 0.2220 0.5425
w/o Inter-resolution 0.3158 0.2122 0.5307
4.4. Implementation Details Mr.TAD 0.3372 0.2396 0.5744
As in the recent study [18], we use three encoders and Table 4. Effect of Skip Length Strategy
three decoders. Each encoder and decoder is a single-layer
GRU with 64 units. Due to the time limitation, we do not F1 PRAUC ROCAUC
perform any hyperparameter tuning. All hyperparameter linear skip connection 0.3356 0.2367 0.5720
settings were taken from relevant prior studies [17, 18, 10], random skip connection 0.3340 0.2348 0.5681
except for the sliding window size and the stride length. Mr.TAD 0.3372 0.2396 0.5744
In this work, we set the sliding window with the follow-
ing lengths: 16 for Power Demand, 8 for Yahoo S5, NASA, Table 5. Effect of Weighting Strategy
SMD, and 4 for ECG, 2D Gesture. The stride length is half
of the sliding window size. We need to set the very short F1 PRAUC ROCAUC
sequence length due to the limited time and resources. w/o weights 0.3343 0.2343 0.5648
w/ uniform weight 0.3380 0.2327 0.5661
We implement all algorithms in Python 3.7. The ma-
Mr.TAD 0.3372 0.2396 0.5744
chine learning methods (i.e., LOF, IF, and OC-SVM) are
implemented using scikit-learn 0.24.2. We implement deep
learning-based methods including ours using TensorFlow Table 6. Effect of Regularization Techniques
2.4. The Adam optimizer [11] is used with an initial learn-
ing rate of 0.001. Early stopping and learning rate decay F1 PRAUC ROCAUC
are also adopted to avoid overfitting. We set λ to 0.005. w/o regularization 0.3360 0.2395 0.5762
Mr.TAD and all baselines are trained on the same platform2 . w/ L1 0.3369 0.2391 0.5709
Mr.TAD 0.3372 0.2396 0.5744
5
4.5. Experimental Results we simply remove all the weights so that the input of the
next hidden state is the addition of the previous hidden state
4.5.1 Performance Comparison
and the L-th hidden state before. For the second case, we
We report the performance of all models grouped by each set the w1 = w2 = 1. Table 5 presents the results from
dataset in Table 2 on all metrics and visualize the global av- this experiment. We observe that the model performs better
eraged performance grouped by each metric in Figure 3. In with weight control, either by uniformly set or randomly
general, all methods give similar F1 scores, even the tradi- selecting.
tional models. It is possible because any model can achieve Effect of regularization. In [10], the authors suggest
a reasonably good result, given 1000 thresholds. However, that L1 regularization has the effect of making the shared
(E)
the deep learning-based methods perform better when mea- hidden state hT sparse, resulting in more robust decoders.
sured with PRAUC and ROCAUC in all cases with a sig- However, we suspect whether it is the case in our method.
nificant margin. It indicates that the deep learning-based Accordingly, we first remove the regularizer. Then, we try
models have lower false positive rates and higher true neg- with L1 as suggested. However, we see that using L2 reg-
ative rates, which is highly desired for anomaly detection. ularization achieves better performance on average. There-
Besides, it is observable that AE-based methods outperform fore, we decide to include the L2 regularizer with Mr.TAD
both VAEs and GANs, especially with bidirectional learn- instead of the L1.
ing. Still, according to Table 2, 7, 8, and 9, some VAE-
based and GAN-based models can achieve higher than AEs. 5. Conclusion
It is probably because specific methods may favor particu-
lar patterns over others. Therefore, developing a model that In this paper, we introduce a sequence-to-sequence-
can overcome this issue can be a promising research direc- based network called Mr.TAD for time series anomaly de-
tion. Concerning our Mr.TAD, it does not outperform all tection. Mr.TAD learns inter-resolution features represent-
baselines in all datasets yet obtains reasonable results. We ing the internal characteristic within each resolution, after
suspect that it results from the short sequence length that which the learned representations are shared for all resolu-
degrades the performance of our method, similar to the S- tion levels. Moreover, it learns intra-resolution features rep-
RNN SF. However, our proposed method will achieve better resenting the characteristics of each resolution through skip
results with the longer sequence length, i.e., window size. connections across different resolutions. Through extensive
Additionally, for Dilated LSTM AE, this model still attains experiments on commonly used benchmarks, we show that
adequate performance because the skip connection is made Mr.TAD outperforms classical algorithms and is comparable
in the input level; thus, there is no accumulated loss between to deep learning-based methods.
the hidden state of each timestamp.
References
4.5.2 Ablation Study [1] Julien Audibert, Pietro Michiardi, Frédéric Guyard,
Sébastien Marti, and Maria A Zuluaga. Usad: Unsupervised
To examine the contributions of each design choice, we per- anomaly detection on multivariate time series. In Proceed-
form ablation studies on the following aspects. ings of the 26th ACM SIGKDD International Conference on
Effect of intra- and inter-resolution learning. In this Knowledge Discovery & Data Mining, pages 3395–3404,
experiment, we use the identical architectures as the full 2020. 2
model but only remove one latent vector at a time. As pre- [2] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and
sented in Table 3, it is evident that using both latent repre- Jörg Sander. Lof: identifying density-based local outliers. In
sentation learning results in better performance. Obviously, Proceedings of the 2000 ACM SIGMOD international con-
when removing inter-resolution (i.e., joint representation), ference on Management of data, pages 93–104, 2000. 4
the performance decreases significantly. These results are [3] Wanpracha Art Chaovalitwongse, Ya-Ju Fan, and Rajesh C
also comparable to the previous study [10]. Sachdeo. On the time series k-nearest neighbor classifica-
tion of abnormal brain activity. IEEE Transactions on Sys-
Effect of skip length strategy. In this experiment, we
tems, Man, and Cybernetics-Part A: Systems and Humans,
study the difference between each skip length strategy by 37(6):1005–1016, 2007. 1
simply changing the 2k−1 to k for a linear case or a random [4] Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak
value between 2 and T . The results, presented in Table 4, Turaga. Outlier detection with autoencoder ensembles. In
indicate minor differences between each strategy. Thus, Proceedings of the 2017 SIAM international conference on
we believe that the model can still perform multi-resolution data mining, pages 90–98. SIAM, 2017. 3
learning, given the same number of autoencoders. [5] Chih-Chun Chia and Zeeshan Syed. Scalable noise mining in
Effect of weighting strategy. We study the effect of the long-term electrocardiographic time-series to predict death
weights w1 and w2 in this experiment. For the first case, following heart attacks. In Proceedings of the 20th ACM
6
SIGKDD international conference on Knowledge discovery [19] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and
and data mining, pages 125–134, 2014. 1 Dan Pei. Robust anomaly detection for multivariate time se-
[6] Alexander Geiger, D. Liu, Sarah Alnegheimish, Alfredo ries through stochastic recurrent neural network. In Proceed-
Cuesta-Infante, and K. Veeramachaneni. Tadgan: Time ings of the 25th ACM SIGKDD International Conference on
series anomaly detection using generative adversarial net- Knowledge Discovery & Data Mining, pages 2828–2837,
works. 2020 IEEE International Conference on Big Data 2019. 1, 2, 4
(Big Data), pages 33–43, 2020. 1 [20] Renjie Wu and Eamonn J Keogh. Current time series
[7] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing anomaly detection benchmarks are flawed and are creating
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and the illusion of progress. arXiv preprint arXiv:2009.13807,
Yoshua Bengio. Generative adversarial networks. arXiv 2020. 8
preprint arXiv:1406.2661, 2014. 4 [21] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jia-
hao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang
[8] Kyle Hundman, Valentino Constantinou, Christopher La-
Feng, et al. Unsupervised anomaly detection via variational
porte, Ian Colwell, and Tom Soderstrom. Detecting space-
auto-encoder for seasonal KPIs in web applications. In Pro-
craft anomalies using lstms and nonparametric dynamic
ceedings of the 2018 World Wide Web Conference, pages
thresholding. In Proceedings of the 24th ACM SIGKDD in-
187–196, 2018. 1, 4
ternational conference on knowledge discovery & data min-
[22] Yong-Ho Yoo, Ue-Hwan Kim, and Jong-Hwan Kim. Re-
ing, pages 387–395, 2018. 4
current reconstructive network for sequential anomaly de-
[9] Eamonn Keogh, Jessica Lin, and Ada Fu. Hot sax: Effi- tection. IEEE Transactions on Cybernetics, 51:1704–1715,
ciently finding the most unusual time series subsequence. 2021. 1, 2, 5
In Fifth IEEE International Conference on Data Mining [23] Chunkai Zhang and Yingyang Chen. Time series anomaly
(ICDM’05), pages 8–pp. Ieee, 2005. 4 detection with variational autoencoders. arXiv preprint
[10] Tung Kieu, Bin Yang, Chenjuan Guo, and Christian S arXiv:1907.01702, 2019. 4
Jensen. Outlier detection for time series with recurrent au- [24] Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng,
toencoder ensembles. In IJCAI, pages 2725–2732, 2019. 1, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong,
2, 3, 4, 5, 6 Haifeng Chen, and Nitesh V Chawla. A deep neural network
[11] Diederik P Kingma and Jimmy Ba. Adam: A method for for unsupervised anomaly detection and diagnosis in multi-
stochastic optimization. arXiv preprint arXiv:1412.6980, variate time series data. In Proceedings of the AAAI Confer-
2014. 5 ence on Artificial Intelligence, volume 33, pages 1409–1416,
[12] Istvan Kiss, Béla Genge, Piroska Haller, and Gheorghe 2019. 2
Sebestyén. Data clustering-based anomaly detection in in- [25] Bin Zhou, Shenghua Liu, Bryan Hooi, Xueqi Cheng, and
dustrial control systems. In 2014 IEEE 10th International Jing Ye. Beatgan: Anomalous rhythm detection using adver-
Conference on Intelligent Computer Communication and sarially generated time series. In IJCAI, pages 4433–4439,
Processing (ICCP), pages 275–281. IEEE, 2014. 1 2019. 1
[13] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation
forest. In 2008 eighth ieee international conference on data Appendix
mining, pages 413–422. IEEE, 2008. 1, 4
[14] Junshui Ma and Simon Perkins. Time-series novelty detec-
A. Additional Experimental Results
tion using one-class support vector machines. In Proceedings Due to the space limit, the additional experimental re-
of the International Joint Conference on Neural Networks, sults for all datasets in F1, PRAUC, and ROCAUC are pre-
2003., volume 3, pages 1741–1745. IEEE, 2003. 1
sented in Table 7, 8, and 9, respectively.
[15] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand,
Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. Lstm- B. KDD Cup Competition Results
based encoder-decoder for multi-sensor anomaly detection.
arXiv preprint arXiv:1607.00148, 2016. 4 We illustrate our submissions’ results from the competi-
[16] Larry M Manevitz and Malik Yousef. One-class svms for tion in Figure 4. The submissions were evaluated by com-
document classification. Journal of machine Learning re- puting the correct percentile from 250 files (i.e., datasets).
search, 2(Dec):139–154, 2001. 4 The max score one can obtain is 100%. As presented, the
[17] Lifeng Shen, Zhuocong Li, and James Kwok. Timeseries highest score we got is 58.4. With this score, among 1967
anomaly detection using temporal hierarchical one-class net- submissions, we were ranked 58th out of 544 teams and 661
work. Advances in Neural Information Processing Systems, competitors.
33:13016–13026, 2020. 1, 2, 4, 5
[18] Lifeng Shen, Zhongzhong Yu, Qianli Ma, and James T B.1. Datasets
Kwok. Time series anomaly detection with multiresolution KDD Cup 2021 UCR Dataset 3 . The KDDCup21
ensemble decoding. In Proceedings of the AAAI Confer-
datasets are created for KDDCup21 and designed to miti-
ence on Artificial Intelligence, volume 35, pages 9567–9575,
2021. 1, 2, 4, 5 3 https://compete.hexagon-ml.com/practice/competition/39/#data
7
60
40
Percentile
20
0
G TM
AN
LS AE
N
G AE
O AE
VM
St E
AE
py
IF
TM
T
m
F
-A
G PO
Bi GA
LO
do
um
Bi U-
-
-G
-
LS
LS
-S
U
TM
U
an
S
D DC
R
R
R
C
s
R
R
D NN
Bi
d
te
e-
R
S-
ila
Methods
B.2. Challenges
Through the KDD Cup 2021, the competition organizer
proposed the limitation of previous existing benchmark
datasets used in time-series anomaly detection task [20].
Existing datasets have four flaws: triviality, unrealistic
anomaly density, mislabeled ground truth, and run-to-
failure bias. (1) Triviality: A time-series anomaly detection
problem is trivial if it can be solved with a single line of
standard library MATLAB code. The overall 86.1% num-
ber is competitive with most papers that have examined this
dataset. (2) Unrealistic anomaly density: More than half the
test data exemplars consist of a contiguous region marked
as anomalies. There are many regions marked as anoma-
lies or the annotated anomalies are very close to each other.
(3) Mislabeled: All of the benchmark datasets appear to
have mislabeled data, both false positives, and false nega-
tives. For example, if two data points have similar values,
but one point is labeled as an anomaly and another is labeled
as normal. (4) Run-to-failure bias: Many of the anomalies
appear towards the end of the test datasets. Many real-world
systems are run-to-failure, so in many cases, there is no
data to the right of the last anomaly. A naive algorithm that
simply labels the last point as an anomaly has an excellent
chance of being correct.
8
Table 7. Performance Comparison in F1 scores.
Datasets IF LOF OC-SVM CNN-AE LSTM-AE GRU-AE BiGRU-AE CNN-VAE LSTM-VAE GRU-VAE CNN-GAN LSTM-GAN GRU-GAN S-RNNS SF Dilated LSTM AE Mr.TAD
Yahoo A1 0.7261 0.6397 0.7292 0.8166 0.8123 0.8084 0.8230 0.7394 0.6477 0.5706 0.6432 0.4233 0.4880 0.6967 0.8256 0.7078
Yahoo A2 0.0631 0.7697 0.4334 0.8730 0.8862 0.9136 0.9394 0.7540 0.6804 0.6259 0.6191 0.0589 0.1498 0.6990 0.9442 0.7259
Yahoo A3 0.2745 0.3939 0.3976 0.4072 0.4213 0.4617 0.6779 0.3321 0.2193 0.1713 0.1451 0.1656 0.1661 0.2776 0.4418 0.3244
Yahoo A4 0.2267 0.2862 0.3066 0.3617 0.4026 0.4199 0.6316 0.3150 0.2059 0.1612 0.1548 0.1524 0.1232 0.2113 0.4052 0.2418
NASA-SMAP 0.0685 0.2769 0.1324 0.2568 0.2296 0.2296 0.2296 0.2312 0.2296 0.2296 0.2296 0.2296 0.2296 0.2296 0.2296 0.2296
NASA-MSL 0.0195 0.1742 0.2391 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950 0.1950
SMD 0.3199 0.3354 0.3348 0.4206 0.4483 0.4469 0.4430 0.4379 0.4420 0.4426 0.4078 0.4297 0.4147 0.4492 0.4579 0.4553
ECG 1 0.4056 0.4181 0.4471 0.2785 0.4466 0.4470 0.4667 0.4399 0.4375 0.4017 0.2947 0.3136 0.4211 0.5419 0.5291 0.4975
ECG 2 0.4307 0.6293 0.4053 0.2941 0.3874 0.3210 0.3550 0.3874 0.3433 0.3308 0.3067 0.2241 0.4167 0.3548 0.3824 0.3309
ECG 3 0.2103 0.2156 0.1976 0.0953 0.2796 0.2796 0.2857 0.1987 0.1732 0.1572 0.1988 0.1209 0.1173 0.1800 0.2143 0.1633
ECG 4 0.2410 0.1146 0.2135 0.2533 0.2715 0.2600 0.2645 0.2486 0.2698 0.2545 0.2214 0.2407 0.2461 0.2698 0.2331 0.2612
ECG 5 0.3152 0.5811 0.4771 0.4106 0.4878 0.4481 0.3556 0.3963 0.3636 0.3636 0.3614 0.1885 0.3679 0.4020 0.4859 0.3822
ECG 6 0.0932 0.1333 0.0996 0.2781 0.1571 0.1571 0.1605 0.1571 0.1601 0.1610 0.2043 0.1876 0.2030 0.1571 0.1571 0.1646
ECG 7 0.0650 0.1301 0.0643 0.0612 0.1460 0.0909 0.0643 0.0548 0.0563 0.0547 0.0587 0.0774 0.0465 0.0580 0.0744 0.1054
ECG 8 0.1112 0.0353 0.1264 0.2054 0.1758 0.1758 0.1758 0.1758 0.1763 0.1760 0.1954 0.2220 0.1874 0.1880 0.1758 0.1758
ECG 9 0.1704 0.2792 0.2595 0.3580 0.3580 0.3591 0.3580 0.3605 0.3580 0.3585 0.3583 0.3633 0.4552 0.3705 0.3583 0.3580
Power Demand 0.1826 0.0323 0.2451 0.2691 0.2632 0.2816 0.2643 0.2679 0.2643 0.2643 0.2632 0.4524 0.4167 0.2749 0.2802 0.2755
2D Gesture 0.5091 0.4533 0.5072 0.4132 0.4699 0.4435 0.4401 0.4548 0.4262 0.4167 0.4132 0.4444 0.4132 0.4640 0.4770 0.4759