453677-anomaly-detection-using-prediction-error-cc5b2ed6
453677-anomaly-detection-using-prediction-error-cc5b2ed6
2, 2022 7
Abstract—In this paper, we propose a novel method for video anomaly detection motivated by an existing architecture
for sequence-to-sequence prediction and reconstruction using a spatio-temporal convolutional Long Short-Term Memory
(convLSTM). As in previous work on anomaly detection, anomalies arise as spatially localised failures in reconstruction or
prediction. In experiments with five benchmark datasets, we show that using prediction gives superior performance to using
reconstruction. We also compare performance with different length input/output sequences. Overall, our results using prediction
are comparable with the state of the art on the benchmark datasets.
Index Terms—Convolutional LSTM; convolutional autoencoder; prediction error; reconstruction error; anomaly detection.
✦
1. Introduction
Fig. 1: The encoding-decoding structure used for future prediction or reconstruction with video volumes of τ frames.
normal video frame and extracts feature maps. The [20]. The motivation is that reconstruction over a longer
encoding features are then used to retrieve prototypical duration using the memory of the LSTM should capture
normal patterns in the memory items and to update the more complex flow patterns. The convolutional network
memory. Then the feature maps and aggregated mem- is used to encode each frame, then feeding these encod-
ory items are fed into the decoder for reconstructing the ing tensors to Convolutional LSTMs to memorize the
input video frame or predicting the next frame. Using change of the appearance which corresponds to motion
cosine similarity and the softmax function for matching information [20]. Two Deconvolutional Networks (De-
probability between incoming encoding features and convNet) are used, one for reconstructing past frames
memory items, the global memory can be read and writ- and to identify whether an anomaly occurs; and one for
ten to. Since normal patterns in training and testing sets reconstructing the current frame. Thus the reconstruc-
may be different, the memory items are updated during tion error is an indicator of the change in appearance or
training and testing time, with the use of a predefined motion. The temporal unit in [13], [20] is applied on the
threshold to prevent updating on anomaly patterns [18]. final spatial stage, which encodes high level represen-
However, it is impossible to find an optimal threshold tations. Interleaving RNNs between spatial convolution
to distinguish between normal and abnormal patterns layers has recently been shown to improve performance
under various scenarios. Meta-learning methodology is on precipitation now-casting [21]. The model can learn
introduced into a Dynamic Prototype Unit (DPU) to temporal information on hierarchical spatial represen-
learn prototypes for encoding normal dynamics and to tations from low-level to high-level. In our work, we
enable the fast adaption capacity to a new scene with adopt the same architecture, except that we remain
only a few training frames [19]. As in previous work with convolutional LSTMs instead of the complex tra-
[18], the DPU inputs the encoding feature maps, which jGRU RNN [21]. Our results show a comparable level
are outputs of the encoder part of U-net, to generate a of performance to the state of the art on benchmark
pool of dynamic prototypes. However it is trained in datasets with fewer model parameters than state of
a fully differential attention manner in which attention the art models. Moreover, using prediction gives better
mapping functions are implemented as fully connected performance than reconstruction. Finally, performance
layers and updated using gradient descent style. After varies as expected with different prediction windows.
training the AE backbone using only frame prediction
loss, the DPU module is trained with the meta-training
phase using frame pairs sampled from videos of diverse 2. Architecture
scenes. In the testing phase, in order to adapt the model
to a new scene, the first few frames of the sequence Figure 1 illustrates the encoding-decoding structure
in this scene are used to construct K-shot input-output for future prediction or reconstruction, motivated by
frame pairs. The results show that the DPU is more earlier work [21] and adapted for anomaly detection.
memory-efficient than the memory module in previous At each time step, the network takes a video volume
work [17], [18]. of τ video frames Ft−τ +1 , ..., Ft , and generates an
output volume of the same size, predicting the future
Another approach to learning regular spatio- Ft+1 , ..., Ft+τ or reconstructing the input in reverse
temporal patterns is to use a convolutional LSTM [13], order Ft , ..., Ft−τ +1 .
Hanh T. M. Tran et al.: ANOMALY DETECTION USING PREDICTION ERROR WITH SPATIO-TEMPORAL CONVOLUTIONAL LSTM 9
2.1. Encoding-decoding model range [0, 1]. We stack τ frames in the 4th dimension
The structure consists of two networks, an encoding into video volumes and use them as the input of size
network and a decoding network (Fig. 1). The encoder 227 × 227 × 1 × τ to the encoder. Following [10], we
contains three convolutional layers, each followed by generate more video sequences by concatenating frames
leaky ReLU with negative slope equal to 0.2 [22]. In with skipping strides of 1, 2 and 3, thereby simu-
order to do down-sampling, we use all three convolu- lating faster motion patterns. Although speed can be
tional layers with stride. The strided convolution allows important in anomaly detection, we still carry out this
the network to learn its own spatial down-sampling. augmentation to minimise over-fitting and to have a fair
Similarly, three deconvolution layers are used in the de- comparison with [10], [13]. Unlike [10], we do not stack
coder to learn its own spatial up-sampling. The goal of precomputed optical flow into our input volume, in the
temporal encoding is to capture and compress changes expectation that the network can learn the necessary
due to motion in the input sequence into encoding patterns of motion.
hidden states that allow the decoder to reconstruct the
input or predict the future. 3. Training
The weights Wl and biases bl of each layer l are
learned by minimizing the regularized least squares
error:
1 X λX
N
kθn − θ̂n k22 + kWl k22 (1)
2N τ n=1 2 l
AUC/EER (%)
Method
UCSDPed1 UCSDPed2 CUHK Av- Subway En- Subway
enue trance Exit
Conv-WTA [7] 91.6/14.8 95/9.5 81/26.5 - -
AMDN [8] 92.1/16 90.8/17.1 - -
GAN [11] - 93.5/15.6 - - -
Conv-AE [10] 81/27.9 90/21.7 70.2/25.1 94.3/26.0 80.7/9.9
ST-AE [13] 89.9/12.5 87.4/12.0 80.3/20.7 84.7/23.7 94.0/9.5
Past-Current-LSTM [20] 75.5/− 88.1/− 77/− 93.3/− 87.7/−
STAE-3D [16] 92.3/15.3 91.2/16.7 77.1/33.8 − −
FlowNet-Unet-GAN [14] 83.1/− 95.4/− 85.1/− − −
Two-streams AE [15] - 94.1/− 83.3/− − −
MemAE [17] - 96.2/− 86.9/− − −
LMN [18] - 97/− 88.5/− − −
MPD* [19] 83.2/− 95.1/− 84.0/− − −
MPD [19] 85.1/− 96.9/− 89.5/− − −
Ours (prediction) 80.8/25.1 92.3/14.4 84.8/22.4 90.2/15.9 95/8
frames are summed up to form the prediction error for 5.2. Anomalous event detection
a volume as follows: Two performance metrics are employed for eval-
uation and comparison with state of the art results:
X
i=t+τ
e(t) = ||F̂i − Fi ||2 (2) Equal Error Rate (EER) and Area Under the ROC Curve
i=t+1 (AUC). The regularity score of each volume determines
whether it is normal or abnormal. We follow the in-
The prediction error then is normalized to compute a tuition that testing video volumes containing normal
regularity score s(t) of a testing volume as follows [10]: events generate high regularity scores (Eq. 3) since they
are similar to training data. A testing video sequence
e(t) − mint′ e(t′ ) containing an anomaly gives a lower score. Setting
s(t) = 1 − (3)
maxt′ e(t′ ) different thresholds on the regularity score, volumes
are classified into those that contain an anomaly and
where mint′ e(t′ ) and maxt′ e(t′ ) are calculated over the those that do not. These predictions are compared with
prediction errors of all volumes in the same test video. ground-truth to give the equal error rate (EER) and
If the regularity score s(t) is less than a threshold, the area under the curve (AUC) of the resulting ROC curve
corresponding test volume is abnormal. (TPR versus FPR) generated by varying an acceptance
We also use the same architecture for reconstruction threshold. Good performance has a low EER and high
in our experiments. Instead of using the next τ frames AUC.
as the target sequence, we use the input sequence in re- Table 1 shows that the model trained for prediction
verse order as the target. Replacing the target sequence performs comparably to state of the art results. Per-
in Eq. 2, we obtain the reconstruction error and use it formance on UCSDPed1 is relatively poor, whilst for
for anomaly detection with the reconstruction model. CUHK Avenue, the AUC is better than most methods,
except FlowNet-Unet-GAN [14], MemAE [17], LMN
[18], MPD [19]. However, MemAE [17], LMN [18] and
5. Experiments MPD [19] have more parameters than our models which
is shown in Table 3.
Our method is evaluated both quantitatively and
qualitatively. We modify and use Caffe [27] for all our TABLE 2: Comparison of AUC/EER with different models. τ is the
number of frames in an input sequence and a target sequence.
experiments. Code and trained models are available at
https : //github.com/t2mhanh/convLST M _P redicti
on_AnomalyDetection. AUC/EER (%)
Method
UCSDPed1 UCSDPed2 CUHK
Avenue
5.1. Datasets Reconstruction 75.6/28.9 87.5/17.1 81.4/26.1
τ =2 78.3/27.1 86.1/21.1 85.1/22.5
Our models are trained on five of the most com-
Prediction τ = 5 80.8/25.1 92.3/14.4 84.8/22.4
monly used datasets for anomaly detection: UCSD
τ =8 79/26.5 89.6/18.5 83.2/23.2
(UCSDPed1 and UCSDPed2) [2], CUHK Avenue [4],
Subway (Entrance and Exit) [5]. The UCSD and CUHK
datasets have separate training videos which contain Table 2 shows the results when different models
mostly normal events. The first 12 minutes of Subway are used. In the table, “Reconstruction” is for a model
Entrance and the first 5 minutes of Subway Exit are used trained for reconstructing a sequence of 5 frames and
for training. “Prediction” is for models trained to predict τ frames.
Hanh T. M. Tran et al.: ANOMALY DETECTION USING PREDICTION ERROR WITH SPATIO-TEMPORAL CONVOLUTIONAL LSTM 11
The model trained for future prediction gives better state of the art are compared in Table 3. We achieve 75
results than the reconstruction model. This may be be- fps for anomaly detection with a GeForce GTX TITAN
cause prediction will always try to draw back to normal- X, faster than other state of the art methods with the
ity, whereas reconstruction works from pre-sight of an same setting [18].
anomalous sequence. The quality comparison between
reconstruction and prediction is shown in Fig. 3. TABLE 3: Comparison of model complexity and testing speed.
[6] Wang, Siqi, En Zhu, Jianping Yin, and Fatih Porikli, "Anomaly model," Advances in neural information processing systems, 30
detection in crowded scenes by SL-HOF descriptor and fore- (2017).
ground classification," In 2016 23rd International Conference on [22] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng,
Pattern Recognition (ICPR), pp. 3398-3403, IEEE, 2016. "Rectifier nonlinearities improve neural network acoustic
[7] Tran, Hanh TM, and David Hogg, "Anomaly detection using models," In Proc. icml, vol. 30, no. 1, p. 3, 2013.
a convolutional winner-take-all autoencoder," In Proceedings [23] Shi, Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung,
of the British Machine Vision Conference 2017, British Machine Wai-Kin Wong, and Wang-chun Woo, "Convolutional LSTM
Vision Association, 2017. network: A machine learning approach for precipitation now-
[8] Xu, Dan, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe, casting," Advances in neural information processing systems, 28
"Learning deep representations of appearance and motion for (2015).
anomalous event detection," arXiv preprint arXiv:1510.01553 [24] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
(2015). "Delving deep into rectifiers: Surpassing human-level perfor-
[9] Tran, Thi Minh Hanh. "Anomaly Detection in Video." PhD mance on imagenet classification," In Proceedings of the IEEE
diss., University of Leeds, 2018. international conference on computer vision, pp. 1026-1034, 2015.
[10] Hasan, Mahmudul, Jonghyun Choi, Jan Neumann, Amit K. [25] Kingma, Diederik P., and Jimmy Ba, "Adam: A method for
Roy-Chowdhury, and Larry S. Davis, "Learning temporal stochastic optimization," arXiv preprint arXiv:1412.6980 (2014).
regularity in video sequences," In Proceedings of the IEEE [26] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton,
conference on computer vision and pattern recognition, pp. 733- "Imagenet classification with deep convolutional neural net-
742, 2016. works," Advances in neural information processing systems, 25
[11] Ravanbakhsh, Mahdyar, Enver Sangineto, Moin Nabi, and (2012).
Nicu Sebe, "Training adversarial discriminators for cross- [27] Jia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
channel abnormal event detection in crowds," In 2019 IEEE Jonathan Long, Ross Girshick, Sergio Guadarrama, and
Winter Conference on Applications of Computer Vision (WACV), Trevor Darrell, "Caffe: Convolutional architecture for fast fea-
pp. 1896-1904, IEEE, 2019. ture embedding," In Proceedings of the 22nd ACM international
[12] Zhao, Bin, Li Fei-Fei, and Eric P. Xing, "Online detection conference on Multimedia, pp. 675-678, 2014.
of unusual events in videos via dynamic sparse coding," In
CVPR 2011, pp. 3313-3320. IEEE, 2011.
[13] Chong, Yong Shean, and Yong Haur Tay, "Abnormal event
detection in videos using spatiotemporal autoencoder," In In-
ternational symposium on neural networks, pp. 189-196. Springer,
Cham, 2017. Hanh T. M. Tran is currently a Lec-
[14] Liu, Wen, Weixin Luo, Dongze Lian, and Shenghua Gao, "Fu- turer with the Department of Electronics
ture frame prediction for anomaly detection–a new baseline," and Telecommunications, the University of
In Proceedings of the IEEE conference on computer vision and Danang - University of Science and Tech-
pattern recognition, pp. 6536-6545, 2018. nology, Vietnam, where she joined since
[15] Nguyen, Trong-Nguyen, and Jean Meunier, "Anomaly detec- 2009. She received the B.Eng. and M.Eng.
tion in video sequence with appearance-motion correspon- degrees in Electronics and Telecommunica-
dence," In Proceedings of the IEEE/CVF international conference tions from the University of Danang - Uni-
on computer vision, pp. 1273-1283, 2019. versity of Science and Technology in 2008
[16] Zhao, Yiru, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and 2011, respectively. She obtained the
and Xian-Sheng Hua, "Spatio-temporal autoencoder for video Ph.D. degree from the University of Leeds,
anomaly detection," In Proceedings of the 25th ACM interna- United Kingdom, in 2018. She was a Visiting Researcher with the
tional conference on Multimedia, pp. 1933-1941, 2017. Arizona State University, Arizona, USA, in 2012. Her main research
[17] Gong, Dong, Lingqiao Liu, Vuong Le, Budhaditya Saha, interests include image/video processing, machine learning, deep
Moussa Reda Mansour, Svetha Venkatesh, and Anton van learning, anomaly detection, object detection and recognition.
den Hengel, "Memorizing normality to detect anomaly:
Memory-augmented deep autoencoder for unsupervised
anomaly detection," In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pp. 1705-1714, 2019.
[18] Park, Hyunjong, Jongyoun Noh, and Bumsub Ham, "Learn- David Hogg is Professor of Artificial Intel-
ing memory-guided normality for anomaly detection," In ligence at the University of Leeds. His re-
Proceedings of the IEEE/CVF Conference on Computer Vision and search is on artificial intelligence and par-
Pattern Recognition, pp. 14372-14381, 2020. ticularly in computer vision. He has been
[19] Lv, Hui, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Pro-Vice-Chancellor for Research and In-
Jian Yang, "Learning normal dynamics in videos with meta novation at the University of Leeds, visiting
prototype network," In Proceedings of the IEEE/CVF Conference professor at the MIT Media Lab, chair of
on Computer Vision and Pattern Recognition, pp. 15425-15434, the EPSRC ICT Strategic Advisory Team,
2021. and chair of the Academic Advisory Group
[20] Luo, Weixin, Wen Liu, and Shenghua Gao, "Remembering of the Worldwide Universities Network. He
history with convolutional lstm for anomaly detection," In is a Fellow of the European Association for
2017 IEEE International Conference on Multimedia and Expo Artificial Intelligence (EurAI), a Distinguished Fellow of the British
(ICME), pp. 439-444. IEEE, 2017. Machine Vision Association, and a Fellow of the International As-
[21] Shi, Xingjian, Zhihan Gao, Leonard Lausen, Hao Wang, Dit- sociation for Pattern Recognition. He is Director of the UKRI Centre
Yan Yeung, Wai-kin Wong, and Wang-chun Woo, "Deep learn- for Doctoral Training in Artificial Intelligence for Medical Diagnosis
ing for precipitation nowcasting: A benchmark and a new and Care at the University of Leeds