gridHTM
gridHTM
Detection in Videos
Vladimir Monakhov Vajira Thambawita
University of Oslo and SimulaMet SimulaMet
Norway Norway
An issue for deep-learning models in general is that they are sus- instance segmentation of cars on a video from the VIRAT [16]
ceptible to noise in the dataset [5, 6], which leads to decreased model dataset. An example segmentation is shown in Figure 1.
accuracy and poor prediction results. Due to the nature of training
deep learning models, they are also in most cases not self-supervised
and therefore require constant tuning in order to stay effective on
changing data. In addition, they require a lot of data before they can
be considered effective, and performance increases logarithmically
based on the volume of training data [7]. Deep learning models
also suffer from issues with out-of-distribution generalization [8],
where a model might perform great on the dataset it is tested on, but
performs poorly when deployed in real life. This could be caused Figure 1: Segmentation result of cars, which is suited to be
by selection bias in the dataset or when there are differences in the used as an SDR. Original frame taken from VIRAT [16].
causal structure between the training domain and the deployment
domain [9]. Another challenge with deep learning models is that
The idea is that the SP will learn to find an optimal general
they generally suffer from a lack of explainability [10]. While it is
representation of cars. How general this representation is can be
known how the models make their decisions, their huge parametric
configured using the various SP parameters, but ideally they should
spaces make it unfeasible to know why they make those predic-
be set so that different cars will be represented similarly while
tions. Combined with the vast potential that deep learning offers
trucks and motorcycles will be represented differently. An example
in critical sectors such as medicine, makes approaches that offer
representation by the SP is shown in Figure 2.
explainability highly attractive.
The HTM theory [2] introduces a machine learning algorithm
which works on the same principles as the brain and therefore
solves some of the issues that deep learning has. HTM is considered
noise resistant and can perform online learning, meaning that it
learns as it observes more data. HTM replicates the structure of the
neocortex which is made up of cortical regions, which in turn are
made up of mini-columns and then neurons.
The data in an HTM model is represented using a Sparse Dis-
tributed Representation (SDR), which is a sparse bit array. An en- Figure 2: The SDR (left) and its corresponding SP represen-
coder converts real world values into SDRs, and there are currently tation (right). Note that the SP is untrained.
encoders for numbers, geospatial locations, categories, and dates.
One of the difficulties with HTM is making it work on visual data, The task of the TM will then be to learn the common patterns that
where creating a good encoder for visual data is still being re- the cars exhibit, their speed, shape, and positioning will be taken
searched [11, 12, 13]. The learning mechanism consists of two parts, into account. Finally, the learning will be set so that new patterns
the Spatial Pooler (SP) and the Temporal Memory (TM). The SP are learned quickly, but forgotten slowly. This will allow the model
learns to extract semantically important information into output to quickly learn the norm, even if there is little activity, while still
SDRs. The TM learns sequences of patterns of SDRs and forms a reacting to anomalies. This requires that the input is stationary, in
prediction in the form of a predictive SDR. A research study [14] has our example this means that the camera is not moving.
shown that HTM is very capable of performing anomaly detection It is possible to split different segmentation classes into their
on low-dimensional data and is able to outperform other anomaly respective SDRs. This will give the SP and the TM the ability to
detection methods. However, related works, such as Daylidyonok, learn different things for each of the classes. For instance, if there
Frolenkova, and Panov [13], show that HTM struggles with higher are two classes "person" and "car", then the TM will learn that it
dimensional data. Therefore, a natural conclusion is that HTM is normal for objects belonging to "person" to be on the sidewalk,
should be applied differently, and that a new type of architecture while objects belonging to "car" will be marked as anomalous when
using HTM should be explored for the purpose of video anomaly on the sidewalk.
detection and surveillance. Ideally, the architecture will have a calibration period spanning
several days or weeks, during which the architecture is not per-
forming any anomaly detection, but is just learning the patterns.
3 GRID HTM
This paper proposes and explores a new type of architecture, named 4 IMPROVEMENTS
Grid HTM, for anomaly detection in videos using HTM, and pro- Daylidyonok, Frolenkova, and Panov [13] tested only the base HTM
poses to use segmentation techniques to simplify the data into an version and showed that the algorithm cannot handle subtle anom-
SDR-friendly format. These segmentation techniques could be any- alies, therefore multiple improvements needed to be introduced to
thing from simple binary thresholding to deep learning instance increase effectiveness.
segmentation. Even keypoint detectors such as Oriented FAST and Invariance. One issue that becomes evident is the lack of in-
Rotated BRIEF (ORB) [15] could in theory be applied. When ex- variance, due to the TM learning the global patterns. Using the
plaining Grid HTM, the examples will be taken from deep learning example, it learns that it is normal for cars to drive along the road
Grid HTM: Hierarchical Temporal Memory for Anomaly Detection in Videos Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
but only in the context of there being cars parked in the parking Clean data
lot. It is instead desired that the TM learns that it is normal for cars
to drive along the road, regardless of whether there are cars in the
parking lot. We proposes a solution based on dividing the encoder
output into a grid and have a separate SP and TM for each cell in
the grid. The anomaly scores of all the cells are then aggregated
into a single anomaly score using an aggregation function.
Aggregation Function. Selecting the correct aggregation func-
tion is important because it affects the final anomaly output. For
instance, it might be tempting to use the mean of all the anomaly (a) Mean. (b) Non-zero mean.
scores as the aggregation function:
𝑋 : {𝑥 ∈ R : 𝑥 ≥ 0} Figure 4: Aggregation functions performance on clean data.
Í
𝑥
𝑥 ∈𝑋
𝐴𝑛𝑜𝑚𝑎𝑙𝑦_𝑆𝑐𝑜𝑟𝑒 =
|𝑋 | By using Grid HTM it is now possible to determine where in the
Where 𝑋 denotes the set of anomaly scores 𝑥 from each individual input an anomaly has occurred by simply observing which cell has
grid. However, this leads to problems with normalization, meaning a high anomaly score. It is also possible to estimate the number of
that an overall anomaly score of 1 is hard to achieve due to many predictions for each cell which can be used as a measure of certainty,
cells having a zero anomaly score. In fact, it becomes unclear what where fewer predictions means higher certainty. Making it possible
a high anomaly score is anymore. Using the mean also means that to measure certainty per cell creates a new source of information
anomalies that take up a lot of space will be weighted higher than which can be used for explainability or robustness purposes.
anomalies that take up a little space, which might not be desirable. Flexibility and Performance. In addition, it is also possible
To solve the aforementioned problem and if the data has little noise, to configure the SP and the TM in each cell independently, giving
a potential aggregation function could be the non-zero mean: the architecture increased flexibility and to use a non-uniform grid,
meaning that some cells can have different sizes. Last but not least,
𝑋 : {𝑥 ∈ R : 𝑥 > 0} dividing the frame into smaller cells makes enables it to run each
Í
𝑥 cell in parallel for increased performance.
∈𝑋
𝑥
Reviewing Encoder Rules. A potential challenge with the
if |𝑋 | > 0
𝐴𝑛𝑜𝑚𝑎𝑙𝑦_𝑆𝑐𝑜𝑟𝑒 = |𝑋 |
grid approach is that the rules for creating a good encoder, may not
0 otherwise
be respected and therefore should be reviewed:
Meaning that only the cells with a strictly positive anomaly score, • Semantically similar data should result in SDRs with
will be contributing to the overall anomaly score which helps solve overlapping active bits. In this example, a car at one posi-
the aforementioned normalization and weighting problem. On the tion will produce an SDR with a high amount of overlapping
other hand, the non-zero mean will perform poorly when the archi- bits as another car at a similar position in the input image.
tecture is exposed to noisy data which could lead to there always • The same input should always produce the same SDR.
being one or more cells with a high anomaly score. Figure 3 illus- The segmentation model produces a deterministic output
given the same input.
Noisy data • The output must have the same dimensionality (total
number of bits) for all inputs. The segmentation model
output has a fixed dimensionality.
• The output should have similar sparsity (similar num-
ber of one-bits) for all inputs and have enough one-
bits to handle noise and subsampling. The segmentation
model does not respect this. An example is that there can be
no cars (zero active bits), one car (𝑛 active bits), or two cars
(2𝑛 active bits), and that this will fluctuate over time.
(a) Mean. (b) Non-zero mean. The solution for the last rule is two-fold, and consists of imposing a
soft upper bound and a hard lower bound for the number of active
Figure 3: Aggregation function performance on noisy data. pixels within a cell. The purpose is to lower the variation of number
of active pixels, while also containing enough semantic information
trates the effect of an aggregation function for noisy data, where for the HTM to work:
the non-zero mean is rendered useless due to the noise. On the • Pick a cell size so that the distribution of number of active
other hand, Figure 4 shows how the non-zero mean gives a clearer pixels is as tight as possible, while containing enough se-
anomaly score when the data is clean. mantic information and also being small enough so that the
Explainability. Having the encoder output divided into a grid desired invariance is achieved. The cell size acts as a soft
has the added benefit of introducing explainability into the model. upper bound for the possible number of active pixels.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Monakhov et al.
σ = 3.78 σ = 1.41
105 Non-zero Mean 105 Non-zero Mean
104 104
Figure 6: Example Grid HTM output and the corresponding
Frames
Frames
103 103 input. The color represents the anomaly score for each of
102 102 the cells, where red means high anomaly score and green
101 101
means zero anomaly score. Two of the cars are marked as
0 20 40 60 20 40 60 anomalous because they are moving, which is something
Number of Active Pixels Number of Active Pixels
Grid HTM has not seen before during its 300 frame (top
(a) Without empty pattern. (b) With empty pattern and a right) long lifetime.
minimum sparsity requirement
of 5.
Anomaly Score
timestamps that are suitable to be used as contextual anchors, so 0.03
as a replacement, the past observations are encoded instead.
0.02
Concatenating past observations together will force the TM
input, for when an object is in motion and when an object is still, 0.01
to be unique. High framerate videos can benefit the most from this, 0.00
and the effect will be more pronounced for higher values of 𝑛. 0 20000 40000 60000 80000 100000 120000 140000 160000
Frames
A potential side effect of introducing temporal patterns, is that
because the TM is now exposed to multiple frames at once, it will
Figure 9: Anomaly score output from Grid HTM.
be more tolerant to temporal noise. An example of temporal noise is
when an object disappears for a single frame due to falling below the
classification threshold of the deep learning segmentation model moving average (𝑛 = 200) was applied to smooth out the anomaly
encoder. The reason for the noise tolerance is that instead of the score output, otherwise the graph would be too noisy.
temporal noise making up the entire input for the TM, it now only With the aggregation functions presented in this paper in mind,
makes up 𝑛1 of the TM input. it is safe to conclude that looking at the anomaly score output is
Use Cases. The most intuitive use case is to use Grid HTM meaningless for complex data such as a surveillance video. This
for semi-active surveillance, where personnel only have to look however does not mean that Grid HTM is completely useless, and
at segments containing anomalies, leading to drastically increased this can be observed by looking at the visual output of Grid HTM.
efficiency. One example is making it possible to have an entire city The visual output during which the first segment anomaly occurs
be monitored by a few people. This is made possible by making it can be seen in Figure 10. Here, it is observed that Grid HTM cor-
so that people only have to look at segments that Grid HTM has rectly marks the sudden change of cars when the current segment
found anomalous, which is what drastically lowers the manpower ends and a new segment begins.
requirement for active monitoring of the entire city.
Figure 11: Visual output when a car is driving along a road. Figure 13: Anomaly output when there is no frame repeat-
ing, where it should have repeated is marked in red. The blue
circle highlights the object of interest.
for each cell during the calibration phase. It is also possible to Credibility in Modern Machine Learning. (2020). doi: 10 .
improve explainability and robustness by implementing a measure 48550/ARXIV.2011.03395.
of certainty for each cell. [10] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier
Finally, experiments should be performed to validate the possi- Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado,
bility of having the TM in each cell grow synapses to neighboring Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard
cells in order to solve the issue with unstable anomaly output. Benjamins, Raja Chatila, and Francisco Herrera. 2020. Ex-
plainable Artificial Intelligence (XAI): Concepts, taxonomies,
REFERENCES opportunities and challenges toward responsible AI. Infor-
[1] Divyanshi Tewari. 2019. U.S. Video Surveillance Market mation Fusion, 58, 82–115. issn: 1566-2535. doi: https://doi.
by Component (Solution, Service, and Connectivity Tech- org/10.1016/j.inffus.2019.12.012.
nology), Application (Commercial, Military & Defense, In- [11] Y. Zou, Y. Shi, Y. Wang, Y. Shu, Q. Yuan, and Y. Tian. 2018.
frastructure, Residential, and Others), and Customer Type Hierarchical Temporal Memory Enhanced One-Shot Dis-
(B2B and B2C): Opportunity Analysis and Industry Forecast, tance Learning for Action Recognition. In Proceedings of the
2020–2027. Online. (March 2019). https://www.alliedmarketresearch. 2018 IEEE International Conference on Multimedia and Expo
com/us-video-surveillance-market-A06741. (ICME), 1–6. doi: 10.1109/ICME.2018.8486447.
[2] J. Hawkins, S. Ahmad, S. Purdy, and A. Lavin. Biological [12] David McDougall (ctrl-z 9000-times). 2019. Online. (Septem-
and Machine Intelligence (BAMI). Initial online release 0.4, ber 2019). https://github.com/htm-community/htm.core/
(2016). https : / / numenta . com / resources / biological - and - issues/259#issuecomment-533333336.
machine-intelligence/. [13] Alexei V. Samsonovich, editor. 2019. Extended Hierarchical
[3] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Temporal Memory for Motion Anomaly Detection. Biologically
Van Den Hengel. 2021. Deep Learning for Anomaly Detec- Inspired Cognitive Architectures 2018. Springer International
tion. ACM Computing Surveys, 54, 2, (April 2021), 1–38. issn: Publishing, Cham, 69–81. isbn: 978-3-319-99316-4. doi: 10.
1557-7341. doi: 10.1145/3439950. 1007/978-3-319-99316-4_10.
[4] Sijie Zhu, Chen Chen, and Waqas Sultani. 2020. Video Anom- [14] Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha
aly Detection for Smart Surveillance. Online. (2020). doi: Agha. 2017. Unsupervised real-time anomaly detection for
10.48550/ARXIV.2004.00222. streaming data. Neurocomputing, 262, 134–147. Online Real-
[5] Shivani Gupta and Atul Gupta. 2019. Dealing with Noise Time Learning Strategies for Data Streams. issn: 0925-2312.
Problem in Machine Learning Data-sets: A Systematic Re- doi: https://doi.org/10.1016/j.neucom.2017.04.070.
view. Procedia Computer Science, 161, 466–474. The Fifth [15] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
Information Systems International Conference, 23-24 July Bradski. 2011. ORB: An efficient alternative to SIFT or SURF.
2019, Surabaya, Indonesia. issn: 1877-0509. doi: https://doi. In Proceedings of the 2011 International Conference on Com-
org/10.1016/j.procs.2019.11.146. puter Vision (ICCV), 2564–2571. doi: 10.1109/ICCV.2011.
[6] Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking 6126544.
Neural Network Robustness to Common Corruptions and [16] Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cun-
Perturbations. Online. (2019). doi: 10.48550/ARXIV.1903. toor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee,
12261. J. K. Aggarwal, Hyungtae Lee, Larry Davis, Eran Swears,
[7] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Xioyang Wang, Qiang Ji, Kishore Reddy, Mubarak Shah,
Gupta. 2017. Revisiting Unreasonable Effectiveness of Data Carl Vondrick, Hamed Pirsiavash, Deva Ramanan, Jenny
in Deep Learning Era. Online. (2017). doi: 10.48550/ARXIV. Yuen, Antonio Torralba, Bi Song, Anesco Fong, Amit Roy-
1707.02968. Chowdhury, and Mita Desai. 2011. A large-scale benchmark
[8] Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe dataset for event recognition in surveillance video. In Pro-
Xu, Han Yu, and Peng Cui. 2021. Towards Out-Of-Distribution ceedings of the 2013 IEEE Conference on Computer Vision and
Generalization: A Survey. (2021). doi: 10.48550/ARXIV.2108. Pattern Recognition (CVPR), 3153–3160. doi: 10.1109/CVPR.
13624. 2011.5995586.
[9] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben [17] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan shick. 2019. PointRend: Image Segmentation as Rendering.
Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hor- Online. (2019). doi: 10.48550/ARXIV.1912.08193.
mozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Di- 2016. Deep Residual Learning for Image Recognition. In Pro-
ana Mincu, Akinori Mitani, Andrea Montanari, Zachary ceedings of the 2016 IEEE Conference on Computer Vision and
Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Os- Pattern Recognition (CVPR), 770–778. doi: 10.1109/CVPR.
borne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica 2016.90.
Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, [19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and
Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Web- Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image
ster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, and D. database. In Proceedings of the 2009 IEEE Conference on Com-
Sculley. 2020. Underspecification Presents Challenges for puter Vision and Pattern Recognition (CVPR), 248–255. doi:
10.1109/CVPR.2009.5206848.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Monakhov et al.