0% found this document useful (0 votes)
38 views

Unsupervised Real-Time Anomaly Detection For Streaming Data: Neurocomputing June 2017

Uploaded by

ARUOS Soura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Unsupervised Real-Time Anomaly Detection For Streaming Data: Neurocomputing June 2017

Uploaded by

ARUOS Soura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/317325599

Unsupervised real-time anomaly detection for streaming data

Article  in  Neurocomputing · June 2017


DOI: 10.1016/j.neucom.2017.04.070

CITATIONS READS

383 13,971

4 authors, including:

Subutai Ahmad Alexander Lavin


Numenta Numenta
57 PUBLICATIONS   2,203 CITATIONS    7 PUBLICATIONS   540 CITATIONS   

SEE PROFILE SEE PROFILE

Scott Purdy
Numenta
10 PUBLICATIONS   504 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Subutai Ahmad on 15 July 2017.

The user has requested enhancement of the downloaded file.


JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

Neurocomputing 0 0 0 (2017) 1–14

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Unsupervised real-time anomaly detection for streaming data


Subutai Ahmad a,∗, Alexander Lavin a, Scott Purdy a, Zuha Agha a,b
a
Numenta, Redwood City, CA, USA
b
Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA

a r t i c l e i n f o a b s t r a c t

Article history: We are seeing an enormous increase in the availability of streaming, time-series data. Largely driven by
Received 9 August 2016 the rise of connected real-time data sources, this data presents technical challenges and opportunities.
Revised 19 April 2017
One fundamental capability for streaming analytics is to model each stream in an unsupervised fashion
Accepted 22 April 2017
and detect unusual, anomalous behaviors in real-time. Early anomaly detection is valuable, yet it can be
Available online xxx
difficult to execute reliably in practice. Application constraints require systems to process data in real-
Keywords: time, not batches. Streaming data inherently exhibits concept drift, favoring algorithms that learn con-
Anomaly detection tinuously. Furthermore, the massive number of independent streams in practice requires that anomaly
Hierarchical Temporal Memory detectors be fully automated. In this paper we propose a novel anomaly detection algorithm that meets
Streaming data these constraints. The technique is based on an online sequence memory algorithm called Hierarchi-
Unsupervised learning cal Temporal Memory (HTM). We also present results using the Numenta Anomaly Benchmark (NAB),
Concept drift
a benchmark containing real-world data streams with labeled anomalies. The benchmark, the first of its
Benchmark dataset
kind, provides a controlled open-source environment for testing anomaly detection algorithms on stream-
ing data. We present results and analysis for a wide range of algorithms on this benchmark, and discuss
future challenges for the emerging field of streaming analytics.
© 2017 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)

1. Introduction negative change in the system, like a fluctuation in the turbine ro-
tation frequency of a jet engine, possibly indicating an imminent
With sensors pervading our everyday lives, we are seeing an ex- failure. An anomaly can also be positive, like an abnormally high
ponential increase in the availability of streaming, time-series data. number of web clicks on a new product page, implying stronger
Largely driven by the rise of the Internet of Things (IoT) and con- than normal demand. Either way, anomalies in data identify abnor-
nected real-time data sources, we now have an enormous num- mal behavior with potentially useful information. Anomalies can
ber of applications with sensors that produce important data that be spatial, where an individual data instance can be considered
changes over time. Analyzing these streams effectively can provide anomalous with respect to the rest of data, independent of where
valuable insights for any use case and application. it occurs in the data stream, like the first and third anomalous
The detection of anomalies in real-time streaming data has spikes in Fig. 1. An anomaly can also be temporal, or contextual,
practical and significant applications across many industries. Use if the temporal sequence of data is relevant; i.e., a data instance
cases such as preventative maintenance, fraud prevention, fault de- is anomalous only in a specific temporal context, but not other-
tection, and monitoring can be found throughout numerous in- wise. Temporal anomalies, such as the middle anomaly of Fig. 1,
dustries such as finance, IT, security, medical, energy, e-commerce, are often subtle and hard to detect in real data streams. Detecting
agriculture, and social media. Detecting anomalies can give action- temporal anomalies in practical applications is valuable as they can
able information in critical scenarios, but reliable solutions do not serve as an early warning for problems with the underlying sys-
yet exist. To this end, we propose a novel and robust solution to tem.
tackle the challenges presented by real-time anomaly detection.
Consistent with [1], we define an anomaly as a point in time
1.1. Streaming applications
where the behavior of the system is unusual and significantly dif-
ferent from previous, normal behavior. An anomaly may signify a
Streaming applications impose unique constraints and chal-
lenges for machine learning models. These applications involve an-

Corresponding author. alyzing a continuous sequence of data occurring in real-time. In
E-mail address: [email protected] (S. Ahmad). contrast to batch processing, the full dataset is not available. The

http://dx.doi.org/10.1016/j.neucom.2017.04.070
0925-2312/© 2017 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

2 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

Fig. 1. The figure shows real-world temperature sensor data from an internal component of a large industrial machine. Anomalies are labeled with circles. The first anomaly
was a planned shutdown. The third anomaly was a catastrophic system failure. The second anomaly, a subtle but observable change in the behavior, indicated the actual
onset of the problem that led to the eventual system failure. The anomalies were hand-labeled by an engineer working on the machine. This file is included in the Numenta
Anomaly Benchmark corpus [2].

system observes each data record in sequential order as they arrive data stream could be a precursor to a heart attack. Detecting such
and any processing or learning must be done in an online fashion. an anomaly minutes in advance is far better than detecting it a few
Let the vector xt represent the state of a real-time system at time seconds ahead, or detecting it after the fact. Detection of anoma-
t. The model receives a continuous stream of inputs: lies often gives critical information, and we want this information
early enough that it’s actionable, possibly preventing system fail-
. . . , xt−2 , xt−1 , xt , xt+1 , xt+2 , . . . ure. There is a tradeoff between early detections and false posi-
tives, as an algorithm that makes frequent inaccurate detections is
Consider for example, the task of monitoring a datacenter. Com-
likely to be ignored.
ponents of xt might include CPU usage for various servers, band-
Given the above requirements, we define the ideal characteris-
width measurements, latency of servicing requests, etc. At each
tics of a real-world anomaly detection algorithm as follows:
point in time t we would like to determine whether the behavior
of the system is unusual. The determination must be made in real-
time, before time t + 1. That is, before seeing the next input (xt+1 ), 1. Predictions must be made online; i.e., the algorithm must iden-
the algorithm must consider the current and previous states to de- tify state xt as normal or anomalous before receiving the sub-
cide whether the system behavior is anomalous, as well as perform sequent xt+1 .
any model updates and retraining. Unlike batch processing, data is 2. The algorithm must learn continuously without a requirement
not split into train/test sets, and algorithms cannot look ahead. to store the entire stream.
Practical applications impose additional constraints on the 3. The algorithm must run in an unsupervised, automated
problem. Typically, the sensor streams are large in number and at fashion—i.e., without data labels or manual parameter tweak-
high velocity, leaving little opportunity for human, let alone expert, ing.
intervention; manual parameter tweaking and data labeling are not 4. Algorithms must adapt to dynamic environments and concept
viable. Thus, operating in an unsupervised, automated fashion is drift, as the underlying statistics of the data stream is often
often a necessity. non-stationary.
In many scenarios the statistics of the system can change over 5. Algorithms should make anomaly detections as early as possi-
time, a problem known as concept drift [3,4]. Consider again the ble.
example of a production datacenter. Software upgrades and config- 6. Algorithms should minimize false positives and false negatives
uration changes can occur at any time and may alter the behavior (this is true for batch scenarios as well).
of the system (Fig. 2). In such cases models must adapt to a new
definition of “normal” in an unsupervised, automated fashion. Taken together, the above requirements suggest that anomaly
In streaming applications early detection of anomalies is valu- detection for streaming applications is a fundamentally different
able in almost any use case. Consider a system that continuously problem than static batch anomaly detection. As discussed fur-
monitors the health of a cardiac patient’s heart. An anomaly in the ther below, the majority of existing anomaly detection algorithms

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

S. Ahmad et al. / Neurocomputing 000 (2017) 1–14 3

Fig. 2. CPU utilization (percent) for an Amazon EC2 instance (data from the Numenta Anomaly Benchmark [2]). A modification to the software running on the machine
caused the CPU usage to change. The initial anomaly represents a changepoint, and the new system behavior that follows is an example of concept drift. Continuous learning
is essential for performing anomaly detection on streaming data like this.

(even those designed for time-series data) are not applicable to resolves temporarily flagged data instances a few time steps later
streaming applications. to decide if they were anomalous. However, some kernel methods,
such as EXPoSE [23], adhere to our criteria of real-time anomaly
1.2. Related work detection (see evaluation section below).
For streaming anomaly detection, the majority of methods
Anomaly detection in time-series is a heavily studied area used in practice are statistical techniques that are computation-
of data science and machine learning, dating back to [5]. Many ally lightweight. These techniques include sliding thresholds, out-
anomaly detection approaches exist, both supervised (e.g. support lier tests such as extreme studentized deviate (ESD, also known
vector machines and decision trees [6]) and unsupervised (e.g. as Grubbs’) and k-sigma (e.g., [24,25]), changepoint detection [26],
clustering), yet the vast majority of anomaly detection methods statistical hypotheses testing, and exponential smoothing such as
are for processing data in batches, and unsuitable for real-time Holt–Winters [27]. Typicality and eccentricity analysis [28,29] is an
streaming applications. Examples from industry include Netflix’s efficient technique that requires no user-defined parameters. Most
robust principle component analysis (RPCA) method [7] and Ya- of these techniques focus on spatial anomalies, limiting their use-
hoo’s EGADS [8] both of which require analyzing the full dataset. fulness in applications with temporal dependencies.
Likewise, Symbolic Aggregate Approximation (SAX) [9] involves More advanced time-series modeling and forecasting models
decomposing the full time series to generate symbols prior to are capable of detecting temporal anomalies in complex scenar-
anomaly detection. Other recent techniques include [10,11]. Al- ios. ARIMA is a general purpose technique for modeling temporal
though these techniques may work well in certain situations, they data with seasonality [30]. It is effective at detecting anomalies in
are traditional batch methods, and the focus of this paper is on data with regular daily or weekly patterns. Extensions of ARIMA
methods for online anomaly detection. For reviews of anomaly de- enable the automatic determination of seasonality [31] for certain
tection in general we recommend [1,6,12,13]. For prior work on applications. A more recent example capable of handling temporal
data stream mining and concept drift in general see [3,14–17]. anomalies is the technique in [32] based on relative entropy.
Some anomaly detection algorithms are partially online. They Model-based approaches have been developed for specific use
either have an initial phase of offline learning, or rely on look- cases, but require explicit domain knowledge and are not gener-
ahead to flag previously-seen anomalous data. Most clustering- alizable. Domain-specific examples include anomaly detection in
based approaches fall under the umbrella of such algorithms. aircraft engine measurements [33], cloud datacenter temperatures
Some examples include Distributed Matching-based Grouping Al- [34], and ATM fraud detection [35]. Kalman filtering is a com-
gorithm (DMGA) [18], Online Novelty and Drift Detection Algo- mon technique, but the parameter tuning often requires domain
rithm (OLINDDA) [19], and MultI-class learNing Algorithm for data knowledge and choosing specific residual error models [12,36–38].
Streams (MINAS) [20]. Another example is self-adaptive and dy- Model-based approaches are often computationally efficient but
namic k-means [21] that uses training data to learn weights prior their lack of generalizability limits their applicability to general
to anomaly detection. Kernel-based recursive least squares (KRLS) streaming applications.
proposed in [22] also violates the principle of no look-ahead as it

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

4 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

Fig. 3. (a) A block diagram outlining the primary functional steps used to create a complete anomaly detection system based on HTM. Our process takes the output of an
HTM system and then performs two additional post-processing steps: computing the prediction error followed by computing an anomaly likelihood measure. (b) Breakdown
of the core algorithm components within an HTM system.

There are a number of other restrictions that can make meth- 2. Anomaly detection using HTM
ods unsuitable for real-time streaming anomaly detection, such as
computational constraints that impede scalability. An example is Based on known properties of cortical neurons, Hierarchical
Lytics Anomalyzer [39], which runs in O(n2 ), limiting its useful- Temporal Memory (HTM) is a theoretical framework for sequence
ness in practice where streams are arbitrarily long. Dimensionality learning in the cortex [46]. HTM implementations operate in real-
is another factor that can make some methods restrictive. For in- time and have been shown to work well for prediction tasks
stance online variants of principle component analysis (PCA) such [47,49]. HTM networks continuously learn and model the spa-
as osPCA [40] or window-based PCA [41] can only work with high- tiotemporal characteristics of their inputs, but they do not directly
dimensional, multivariate data streams that can be projected onto model anomalies and do not output a usable anomaly score. In this
a low dimensional space. Techniques that require data labels, such section we describe our technique for applying HTM to anomaly
as supervised classification-based methods [42], are typically un- detection.
suitable for real-time anomaly detection and continuous learning. Fig. 3(a) shows an overview of our process. At each point in
Additional techniques for general purpose anomaly detection time, the input data xt is fed to a standard HTM network. We per-
on streaming data include [9,43,44]. Twitter has an open-source form two additional computations on the output of the HTM. We
method based on Seasonal Hybrid ESD [45]. Skyline is another first compute a measure of prediction error, st . Then, using a prob-
popular open-source project, which uses an ensemble of statisti- abilistic model of st , we compute Lt , a likelihood that the system
cal techniques for detecting anomalies in streaming data [24]. We is in an anomalous state. A threshold on this likelihood determines
include comparisons to both of these methods in our Results sec- whether an anomaly is detected. In the following subsections, we
tion. provide an overview of HTM systems and then describe our tech-
niques for the additional steps of computing the prediction error
and anomaly likelihood. Taken together, the algorithm fulfills the
1.3. Outline
requirements for streaming applications outlined in Section 1.1.
The contributions of this paper are twofold: a novel anomaly
detection technique built for real-time applications, and a com- 2.1. Overview of HTM
prehensive set of results on a benchmark designed for evaluating
anomaly detection algorithms on streaming data. In Section 2 we Fig. 3(b) shows the core algorithm components and represen-
show how to use Hierarchical Temporal Memory (HTM) networks tations within a typical HTM system [49]. The current input, xt , is
[46–48] to robustly detect anomalies on a variety of data streams. fed to an encoder [50] and then a sparse spatial pooling process
The resulting system is efficient, extremely tolerant to noisy data, [51,52]. The resulting vector, a(xt ), is a sparse binary vector repre-
continuously adapts to changes in the statistics of the data, and senting the current input. The heart of the system is the sequence
detects subtle temporal anomalies while minimizing false posi- memory component. This component models temporal patterns in
tives. The HTM implementation and documentation are available a(xt ) and outputs a prediction in the form of another sparse vector
as open-source.1 In Section 3 we review the Numenta Anomaly π (xt ). π (xt ) is thus a prediction for a(xt+1 ).
Benchmark (NAB) [2], a rigorous benchmark dataset and scoring HTM sequence memory consists of a layer of HTM neurons
methodology we created for evaluating real-time anomaly detec- organized into a set of columns (Fig. 4). The network accepts a
tion algorithms. In Section 4 we present results comparing NAB on stream of inputs encoded as sparse vectors. It models high-order
ten algorithms, many of which are commonly used in industry and sequences (sequences with long-term dependencies) using a com-
academia. Section 5 concludes with a summary and directions for position of two separate sparse representations. The current in-
future work. put, xt and the previous sequence context, . . . xt−3 , xt−2 , xt−1 , are
simultaneously encoded using a dynamically updated sparse dis-
tributed representation. The network uses these representations to
1
HTM implementation is available at https://github.com/numenta/nupic. make predictions about the future in the form of a sparse vector.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

S. Ahmad et al. / Neurocomputing 000 (2017) 1–14 5

Fig. 4. The HTM sequence memory. (a) HTM sequence memory models one layer of cortex. The layer consists of a set of mini-columns, with each mini-column containing
multiple neurons. (b) An HTM neuron (left) models the dendritic structure of pyramidal neurons in cortex (right). An HTM neuron models dendrites as an array of coincident
detectors, each with a set of synapses. Context dendrites receive lateral input from other neurons within the layer. Each dendrite represents one transition in a sequence.
Sufficient lateral activity on a context dendrite will cause the cell to enter a predicted state. (c) Representing high-order Markov sequences with shared subsequences (ABCD
vs. XBCY). Each sequence element invokes a sparse set of cells within mini-columns. Cells that are predicted through lateral connections prevent other cells in the same
column from firing through intra-column inhibition resulting in a highly sparse representation. As shown in the figure, such a representation can maintain past context.
Because different cells respond to “C” in the two sequences (C’ and C’’), they can then invoke the correct prediction of either D or Y depending on the input from two time
steps ago.

Fig. 4(c) shows how the sparse representations are used to rep- the point of the shift, but will automatically degrade to zero as
resent temporal patterns and disambiguate sequences with long- the model adapts to the “new normal”. Fig. 5 shows an example
term dependencies. When receiving the next input, the network stream and the behavior of the prediction error st . Shifts in the
uses the difference between predicted input and the actual input temporal characteristics of the system are handled in addition to
to update its synaptic connections. Learning happens at every time spatial shifts in the underlying metric values.
step but since the representations are highly sparse only a tiny per- An interesting aspect of this metric is that branching sequences
centage of the synapses are updated. are handled correctly. In HTMs, multiple predictions are repre-
The details of the HTM learning algorithm and the properties sented in π (xt ) as a binary union of each individual prediction.
of its representation are beyond the scope of this paper but are Similar to Bloom filters, as long as the vectors are sufficiently
described in depth in [46,53]. In our implementation we use the sparse and of sufficient dimensionality, a moderate number of
standard HTM system [49] and a standard set of parameters (See predictions can be represented simultaneously with exponentially
Supplementary Section S3 for the complete list). small chance of error [53,54]. The above error handles branching
sequences gracefully in the following sense. If two completely dif-
2.2. Computing the prediction error ferent inputs are both possible and predicted, receiving either in-
put will lead to a 0 error. Any other input will generate a positive
Given the current input, xt , a(xt ) is a sparse encoding of the cur- error.
rent input, and π (xt−1 ) is the sparse vector representing the HTM
network’s internal prediction of a(xt ). The dimensionality of both
vectors is equal to the number of columns in the HTM network 2.3. Computing anomaly likelihood
(we use a standard value of 2048 for the number of columns in
all our experiments). Let the prediction error, st , be a scalar value The prediction error described above represents an instanta-
inversely proportional to the number of bits common between the neous measure of the predictability of the current input stream.
actual and predicted binary vectors: As shown in Fig. 5, it works well for certain scenarios. In some ap-
plications however, the underlying system is inherently very noisy
π (xt−1 ) · a(xt )
st = 1 − (1) and unpredictable and instantaneous predictions are often incor-
|a(xt )| rect. As an example, consider Fig. 6(a). This data shows the latency
where |a(xt )| is the scalar norm, i.e. the total number of 1 bits in of a load balancer in serving HTTP requests on a production web
a(xt ). In Eq. (1) the error st will be 0 if the current a(xt ) perfectly site. Although the latency is generally low, it is not unusual to have
matches the prediction, and 1 if the two binary vectors are orthog- occasional random jumps, leading to corresponding spikes in pre-
onal (i.e. they share no common 1 bits). st thus gives us an instan- diction error as shown in Fig. 6(b). The true anomaly is actually
taneous measure of how well the underlying HTM model predicts later in the stream, corresponding to a sustained increase in the
the current input xt . frequency of high latency requests. Thresholding the prediction er-
Changes to the underlying statistics are handled automatically ror directly would lead to many false positives.
due to the continuous learning nature of HTMs. If there is a shift To handle this class of scenarios, we introduce a second step.
in the behavior of the system, the prediction error will be high at Rather than thresholding the prediction error st directly, we model

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

6 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

Fig. 5. An example stream along with prediction error. (a) This plot shows CPU usage on a database server over time. There are two unusual behaviors in this stream, the
temporary spike up to 75% and the sustained shift up to 30% usage. B. This plot shows the prediction error while the HTM trains on this stream. Early during training the
prediction error is high while the HTM model learns the data. There is a spike in prediction error corresponding to the temporary spike in CPU usage that quickly drops once
usage goes back near normal. Finally there is an increase in prediction error corresponding to the sustained shift, which drops after the HTM has learned the new behavior.

the distribution of error values as an indirect metric, and use this (Q-function [55]) to decide whether or not to declare an anomaly.3
distribution to check for the likelihood that the current state is We define the anomaly likelihood as the complement of the tail
anomalous. The anomaly likelihood is thus a probabilistic metric probability:
defining how anomalous the current state is based on the pre-  
diction history of the HTM model. To compute the anomaly likeli- μ˜ t − μt
Lt = 1 − Q (4)
hood we maintain a window of the last W error values. We model σt
the distribution as a rolling normal distribution2 where the sample
where:
mean, μt , and variance, σt2 , are continuously updated from previ- i=W  −1
ous error values as follows: st−i
i=W −1
μ˜ t = i=0
(5)
W
st−i
μt = i=0
(2) W here is a window for a short term moving average, where
W
W  W, the duration for computing the distribution of prediction
i=W −1 errors.4 We threshold Lt based on a user-defined parameter  to
(st−i − μt )2
σt2 = i=0
(3) report an anomaly:
W −1
anomaly detectedt ≡ Lt ≥ 1 −  (6)
We then compute a recent short term average of prediction
errors, and apply a threshold to the Gaussian tail probability Since thresholding Lt involves thresholding a tail probability,
there is an inherent upper limit on the number of alerts and a

2
The distribution of prediction errors is not technically a normal distribution.
3
We have also attempted to model the errors using a number of other distributions The tail probability is the probability that a variable will be larger than x stan-
and distribution free bounds, such as Chebyshev’s inequality. Modeling errors as a dard deviations above the mean.
4
simple normal distribution worked significantly better than these other attempts. In practice we find that system performance is not sensitive to W as long as
It is nevertheless still possible that modeling another distribution could improve it is large enough to compute a reliable distribution. We use a generous value of
results. W = 80 0 0, and W  = 10.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

S. Ahmad et al. / Neurocomputing 000 (2017) 1–14 7

Fig. 6. A very noisy, unpredictable stream. (a) The data shows the latency (in seconds) of a load balancer on a production website. The anomaly (indicated by the dot) is an
unusual sustained increase in latencies around April 10. (b) The prediction error from an HTM model on the latency values. The unpredictable nature of the latencies results
in frequent spikes in the prediction error that cannot be distinguished from the true positives. The fact that the unpredictable metric values are spikes and the rest of the
latencies are close to zero results in the coincidental similarity between the latencies and resulting prediction error. (c) A log-scale plot of the anomaly likelihood computed
from the prediction error. Unlike the prediction error plot, there is a clear peak right around the real anomaly.

corresponding upper bound on the number of false positives. With spikes will. Importantly, a scenario that goes from wildly random
 very close to 0 it would be unlikely to get alerts with probabil- to completely predictable will also trigger an anomaly.
ity much higher than  . In practice we have found that  = 10−5
works well across a large range of domains and the user does not
normally need to specify a domain-dependent threshold. 2.4. Extensions
Fig. 6(c) shows an example of the anomaly likelihood, Lt , on
noisy load balancer data. The figure demonstrates that the anomaly Modeling multiple streams simultaneously can enable the sys-
likelihood provides clearer peaks in extremely noisy scenarios tem to detect anomalies that cannot be detected from one stream
compared to pure prediction error. alone. In Supplementary Section S4, we discuss this scenario and
It is important to note that Lt is based on the distribution of describe an extension for performing anomaly detection across
prediction errors, not on the distribution of underlying metric val- multiple streams. We show how to combine independent mod-
ues xt . As such, it is a measure of how well the model is able to els while accounting for temporal drift. This is particularly useful
predict, relative to the recent history. In clean predictable scenar- when there are many sensors and the combinations that enable
ios Lt behaves similarly to st . In these cases, the distribution of detection are unknown.
errors will have very small variance and will be centered near 0. Our algorithm is agnostic to the data type, as long as the data
Any spike in st will similarly lead to a corresponding spike in Lt . can be encoded as a sparse binary vector that captures the se-
However, in scenarios with some inherent randomness or noise, mantic characteristics of the data. We present an interesting ex-
the variance will be wider and the mean further from 0. A single tension with streaming geospatial data in Supplementary Section
spike in st will not lead to a significant increase in Lt but a series of S2, demonstrating the applicability in diverse industries.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

8 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

Fig. 7. Several data streams from the NAB corpus, showing a variety of data source and characteristics. From top left proceeding clockwise: click-through prices for online
advertisements, an artificial stream with some noise but no anomalies, AWS Cloudwatch CPU utilization data, autoscaling group data for a server cluster, a stream of tweet
volumes related to FB stock, and hourly demand for New York City taxis.

3. Evaluation of streaming anomaly detection algorithms 3.1. Benchmark dataset

Numerous benchmarks exist for anomaly detection [1,56] but The aim of the NAB dataset is to present algorithms with the
these benchmarks are generally designed for static datasets. Even challenges they will face in real-world scenarios, such as a mix
benchmarks containing time-series data typically do not capture of spatial and temporal anomalies, clean and noisy data, and data
the requirements of real-time streaming applications. It is also streams where the statistics evolve over time. The best way to do
difficult to find examples of real-world data that is labeled with this is to provide data streams from real-world use cases, and from
anomalies. Yahoo released a dataset for anomaly detection in time- a variety of domains and applications. The data currently in the
series data [8], but it is not available outside of academia and NAB corpus represents a variety of sources, ranging from server
does not incorporate the requirements of streaming applications. network utilization to temperature sensors on industrial machines
As such we have created the Numenta Anomaly Benchmark (NAB) to social media chatter.
with the following goals: NAB version 1.0 contains 58 data streams, each with 10 0 0–
22,0 0 0 records, for a total of 365,551 data points. Also included
1. Provide a dataset of labeled data streams from real-world are some artificially-generated data files that test anomalous be-
streaming applications. haviors not yet represented in the corpus’s real data, as well as
2. Provide a scoring methodology and set of constraints designed several data files without any anomalies. All data files are labeled,
for streaming applications. either because we know the root cause for the anomalies from
3. Provide a controlled open repository for researchers to evaluate the provider of the data, or as a result of the well-defined NAB
and compare anomaly detection algorithms for streaming appli- labeling procedure.5 These labels define the ground truth anoma-
cations. lies used in the NAB scoring process. Fig. 1 is an example of noisy
sensor data with spatial and temporal anomalies. Fig. 2 shows two
We briefly describe each of these below (full details can be related anomalies preceding a shift in the underlying statistics of
found in [2]). the stream. Fig. 7 shows several data streams from the benchmark

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

S. Ahmad et al. / Neurocomputing 000 (2017) 1–14 9

Fig. 8. A zoomed in view showing the anomaly window around the first anomaly in Fig. 1. The window size is set large enough to reward early detections of the anomaly,
yet small enough such that poor detections count as false positives.

dataset, sourced from a variety of domains and exhibiting diverse How large should the windows be? Large windows promote
characteristics such as temporal noise and short and long-term pe- early detection of anomalies, but the tradeoff is that random or
riodicities. unreliable detections would be regularly reinforced. Using the un-
derlying assumption that anomalous data is rare, the anomaly win-
3.2. NAB scoring dow length is defined to be 10%6 the length of a data file, divided
by the number of anomalies in the given file. This scheme pro-
The NAB scoring system formalizes a set of rules to determine vides a generous window to reward early detections and gives par-
the overall quality of streaming anomaly detection relative to the tial credit for detections slightly after the true anomaly, yet small
ideal, real-world anomaly detection algorithm that we defined ear- enough such that poor detections are likely to be counted nega-
lier. There are three key aspects of scoring in NAB: anomaly win- tively (Fig. 8). Note that the streaming algorithms themselves have
dows, the scoring function, and application profiles. These are de- no information regarding the windows or the data length. Anomaly
scribed below, and detailed discussion can be found in [2]. windows are only used as part of the benchmark and scoring sys-
To incorporate the value of early detection into scoring, we de- tem to gauge end performance.
fine anomaly windows that label ranges of the data streams as NAB also includes a mechanism to evaluate algorithms on their
anomalous, and a scoring function that uses these windows to re- bias towards false positives or false negatives. Depending on the
ward early detections (and penalize later detections). When an al- application, a FN may be much more significant than a FP, such as
gorithm is tested on NAB, the resulting detections must be scored. detecting anomalies in ECG data, where a missed detection could
After an algorithm has processed a data file, the windows are used be fatal. These scenarios are formalized in NAB by defining appli-
to identify true and false detections, and the scoring function is cation profiles that vary the relative values of these metrics. For ex-
applied relative to each window to give value to the resulting true ample, a datacenter application would be interested in the “Reward
positives and false positives. Detections within a window correctly Low FP” profile, where false positives are weighted more heavily
identify anomalous data and are true positives (TP), increasing the than in the other profiles.
NAB score. If a detection occurs at the beginning of a window, it is The combination of anomaly windows, a smooth temporal scor-
given more value by the scoring function than a detection towards ing function, and application profiles allows researchers to evalu-
the end of the window. Multiple detections within a single win- ate online anomaly detector implementations against the require-
dow identify the same anomalous data, so we only use the earli- ments of the ideal detector. The NAB scoring system evaluates real-
est (most valuable) detection for the score contribution. Detections time performance, prefers earlier detection of anomalies, penalizes
falling outside the windows are false positives, giving a negative “spam” (i.e. FPs), and provides realistic costs for the standard clas-
contribution to the NAB score. The value of false positives is also sification evaluation metrics TP, FP, TN, and FN.
calculated with the scoring function such that a false positive (FP)
that occurs just after a window hurts the NAB score less than a 3.3. NAB is open-source
FP that occurs far away from the window. Missing a window com-
pletely, or a false negative (FN), results in a negative contribution With the intent of fostering innovation in the field of anomaly
to the score. Refer to Supplementary Section S1 for details on scor- detection, NAB is designed to be an accessible and reliable frame-
ing equations. All the scoring code and documentation is available work for all to use. Included with the open-source data and code is
in the repository.5 extensive documentation and examples on how to test algorithms.

5 6
The NAB repository contains all the open-source data, code, and extensive doc- We tested a range of window sizes (between 5% and 20%) and found that, partly
umentation, including the labeling procedure and examples for running anomaly due to the scaled scoring function, the end score was not sensitive to this percent-
detection algorithms on NAB: https://github.com/numenta/NAB. age.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

10 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

Table 1
NAB scoreboard showing results of each algorithm on v1.0 of NAB. Note the Random de-
tector scores reflect the mean over a range of random seeds. HTM AL denotes the HTM
algorithm using prediction error plus anomaly likelihood, as described in Section 2.2. HTM
PE denotes the HTM algorithm using prediction error only. + Algorithms that were winners
of the WCCI NAB competition.

Detector Standard Profile Reward Low FP Reward Low FN

Perfect 100 100 100


HTM AL 70.1 63.1 74.3
CAD OSE+ 69.9 67.0 73.2
nab-comportex+ 64.6 58.8 69.6
KNN–CAD+ 58.0 43.4 64.8
Relative Entropy 54.6 47.6 58.8
HTM PE 53.6 34.2 61.9
Twitter ADVec 47.1 33.6 53.5
Etsy Skyline 35.7 27.1 44.5
Sliding Threshold 30.7 12.1 38.3
Bayesian Changepoint 17.7 3.2 32.2
EXPoSE 16.4 3.2 26.9
Random 11 1.2 19.5
Null 0 0 0

Table 2
Comparison of properties for algorithms we implemented and tested on NAB (based on published information, excludes competition
entries). Latency measures the time taken to process a single data point for anomaly detection. Latency time reported is an average
over three runs on a single large data file from NAB, consisting of 22,695 data records. Timing was performed on an eight-core laptop
with a 2.6 GHz Intel core i7 processor. NAB Score reflects the standard profile scores.

Detector Latency (ms) Spatial Anomaly Temporal Anomaly Concept Drift Non Parametric NAB Score

HTM 11.3 ✔ ✔ ✔ ✔ 70.1


Relative Entropy 0.05 ✔ ✔ ✔ ✔ 54.6
Twitter ADVec 3.0 ✔ ✔ ✔ ✗ 47.1
Etsy Skyline 414.2 ✔ ✗ ✗ ✗ 35.7
Sliding Threshold 0.4 ✔ ✗ ✗ ✗ 30.7
Bayesian Changepoint 3.5 ✔ ✗ ✔ ✗ 17.7
EXPoSE 2.6 ✔ ✔ ✔ ✔ 16.4

The NAB repository contains source code for commonly used algo- In addition, during the summer of 2016 we ran a NAB com-
rithms for online anomaly detection, as well as some algorithms petition in collaboration with IEEE WCCI7 to encourage additional
submitted by the community. algorithm testing. We include the result of the three competition
winners below.
We use the HTM algorithm as described in Section 2. The core
HTM algorithm by its nature is not highly sensitive to parameters.
4. Results & discussion We used the architecture shown in Fig. 3 and the standard HTM
parameter set (see Supplemental Section S3). The parameters spe-
In this section we discuss NAB results and the comparative cific to anomaly detection, ɛ, W , and W , were set to 10−5 , 80 0 0,
performance of a collection of real-time anomaly detection algo- and 10, respectively. Timestamps and metric values in the NAB
rithms drawn from industry and academia. Our goals are to evalu- dataset were encoded using standard HTM datetime and scalar en-
ate the performance of our HTM anomaly detection algorithm and, coders [50]. In the results below, we show two variations. We show
through a detailed discussion of the findings, to facilitate further NAB results obtained by using prediction error only (i.e. threshold-
research on unsupervised real-time anomaly detection algorithms. ing st directly). We also show results obtained by using anomaly
likelihood, as defined in Section 2.3.
For transparency and reproducibility, we have incorporated the
source code and parameter settings for all of the above algorithms
4.1. Tested algorithms and parameter tuning
into the NAB repository.

We considered a number of anomaly detection methods but


4.2. Comparison of results
the list was heavily filtered based on the criteria discussed in
Sections 1.1 and 1.2. The algorithms evaluated include HTM, Twit-
Table 1 summarizes the NAB scores for each algorithm across
ter’s Anomaly Detection, Etsy’s Skyline, Multinomial Relative En-
all application profiles including the three NAB competition win-
tropy [32] EXPoSE [23], Bayesian Online Changepoint detection
ners. In addition to the algorithms described above, we also use
[57], and a simple sliding threshold. Some of these algorithms have
three control detectors in NAB. A “null” detector runs through the
open-source implementations and we implemented the rest based
dataset passively, making no detections, accumulating all false neg-
on their respective papers. We performed extensive parameter tun-
atives. A “perfect” detector is an oracle that outputs detections
ing for each algorithm; the resulting parameters are optimal to the
that would maximize the NAB score; i.e., it outputs only true pos-
best of our knowledge. Most of the algorithms also involve setting
itives at the beginning of each window. The raw scores from these
thresholds. The parameters were kept constant across all streams,
two detectors are used to scale the score for all other algorithms
and a single fixed threshold for each algorithm was used for the
between 0 and 100. We also include a “random” detector that
entire dataset, as required by NAB. Additional implementation de-
tails for these algorithms are included in Supplementary Section
S5. 7
http://www.wcci2016.org.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

S. Ahmad et al. / Neurocomputing 000 (2017) 1–14 11

Fig. 9. These plots show detector results for two example NAB data streams. In both cases we show a subset of the full data stream. The plotted shapes correspond to the
detections of seven different detectors: HTM, Multinomial Relative Entropy, Twitter ADVec, Skyline, Sliding Threshold, Bayesian Online Changepoint, and EXPoSE. Shapes that
correspond to the same data point have been spaced vertically for clarity. For a given detector, true positive detections within each window (red shaded regions) are labeled
in black. All false positive detections are colored red. (a) Detection results for a production server’s CPU metric. The second anomaly shows a sustained shift that requires
algorithms to adjust to a new normal behavior. (b) Results for the data stream shown in Fig. 2. Here we see a subtle temporal anomaly that preceded a large, obvious spike
in the data. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

outputs a random anomaly probability for each data instance, handle concept drift, and automatically update parameters; these
which is then thresholded across the dataset for a range of ran- characteristics are based on published information, which may or
dom seeds. The score from this detector offers some intuition for may not reflect the actual performance. We also list the measured
chance-level performance on NAB. latency of processing each data point. Several algorithms claim to
Overall we see that the HTM detector using anomaly likeli- have all of the listed properties but their actual anomaly detec-
hood gets the best score, followed by CAD-OSE, nab-comportex, tion performance on the benchmark varies significantly. In general
KNN–CAD, and the Multinomial Relative Entropy detector. The there is a rough correlation between the number of properties sat-
HTM detector using only prediction error performs moderately isfied and the NAB ranking (with the exception of EXPoSE, see dis-
well, but using the anomaly likelihood step significantly improves cussion below).
the scores. Most of the algorithms perform much better than a ran- Based on a more detailed analysis of the results we highlight
dom detector. There is a wide range of scores but none are close to four factors that were important for achieving good performance
perfect suggesting there is still significant room for improvement. on NAB: concept drift, ability to detect temporal anomalies, as-
Table 2 summarizes the various algorithmic properties of each sumptions regarding distribution, and assumptions regarding the
of the algorithms we implemented. Each algorithm is categorized number of data points. We discuss each of these qualitatively be-
based on its ability to detect spatial and temporal anomalies, low as well as more detailed file-level comparisons.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

12 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

Table 3
Comparison of average NAB scores for some algorithms across data from different kinds of sources provided by the benchmark. For each source, average scores are shown
for data files that contain only spatial anomalies, only temporal anomalies and a mixture of spatial and temporal anomalies. The average score is given by
separately 
scor ei
a v g( a s ) =  i ∈D ( as )
count si
where D(as ) denotes data files from a given source s consisting of anomalies of the given type a, scorei denotes the NAB score for the ith file and
i ∈D ( as )

countsi denotes the number of occurrences of the given anomaly type in ith file. See Supplementary Section S6 for the scores and anomaly counts for each individual
file. The best average scores across each row of algorithms are shown in bold. Note that scores are computed over standard profile with equal weights assigned to false
positives and false negatives.

Source Anomaly Type Numenta HTM CAD OSE KNN CAD Relative Entropy Twitter AdVec Skyline Sliding Threshold

Artificial Spatial only 0.70 0.84 −0.06 0.78 0.55 −0.39 −1.00
Temporal only 0.11 0.08 −0.13 −0.52 0.52 −0.38 −0.90
Spatial + Temporal N/A N/A N/A N/A N/A N/A N/A
Online advertisement clicks Spatial only 0.75 0.21 0.54 −0.55 −0.03 −1.00 −0.38
Temporal only 0.53 0.83 0.54 −0.53 −1.00 −1.00 −1.00
Spatial + Temporal 0.47 0.47 0.37 −0.15 0.10 −0.47 0.41
AWS server metrics Spatial only 0.61 0.74 0.54 0.11 0.28 0.40 −0.42
Temporal only 0.70 0.29 0.03 0.53 −0.66 −0.77 −1.33
Spatial + Temporal 0.29 0.20 0.01 −0.23 −0.45 −0.14 −0.34
Miscellaneous known causes Spatial only 0.19 0.33 0.23 0.13 0.00 −0.38 −0.41
Temporal only −0.60 −0.79 0.18 −1.00 −0.91 −1.14 −0.96
Spatial + Temporal 0.32 −0.15 −0.34 0.30 −0.48 −0.79 −0.92
Freeway traffic Spatial only 0.83 0.85 −0.06 0.62 0.85 0.87 −0.38
Temporal only 0.55 0.86 −1.44 0.75 −1.00 −1.00 −1.00
Spatial + Temporal 0.51 0.80 0.26 0.50 −0.27 0.40 −0.83
Tweets volume Spatial only 0.38 0.74 0.10 0.64 0.04 −1.46 −0.17
Temporal only N/A N/A N/A N/A N/A N/A N/A
Spatial + Temporal 0.34 0.43 0.29 0.13 0.20 −0.25 0.05
Total Average 0.40 0.39 0.16 0.10 −0.03 −0.28 −0.36

The ability of each algorithm to learn continuously and han- regarding the data and suffer as a result. Note that the Gaussian
dle concept drift is one of the contributing factors to obtaining used in our anomaly likelihood technique is used to model the dis-
a good NAB score. Fig. 9(a) demonstrates one example of this. tribution of prediction errors, not the underlying metric data. As
This file shows CPU usage on a production server over time, and such it is a non-parametric technique with respect to the data.
contains two anomalies. The first is a simple spike that is easily Another interesting factor is demonstrated by the performance
detectable by all algorithms. The second is a sustained shift in of EXPoSE. Theoretically EXPoSE possesses all the properties in
the usage, where the initial change in sequence is an anomalous Table 2, however it performs poorly on the benchmark. One of the
changepoint, but the new behavior persists. Most algorithms detect reasons for this behavior is that EXPoSE has a dependence on the
the change and then adapt to the new normal. However, the Twit- size of the dataset and is more suitable for large-scale datasets
ter ADVec algorithm fails to detect the changepoint and continues with high-dimensional features [58]. The technique computes an
to generate anomalies for several days. The NAB corpus contains approximate mean kernel embedding and small or moderate data
several similar examples, where an anomaly sets off a change in sets do not provide a sufficiently good proxy for this approxima-
sequence that redefines the normal behavior. This is representative tion. The average NAB data file contains 6300 records and is rep-
of one of the key challenges in streaming data. An inability to han- resentative of real streaming applications. This issue highlights the
dle concept drift detections effectively results in a greater number need to output reliable anomalies relatively quickly.
of false positives, lowering the NAB score.
The ability to detect subtle temporal anomalies, while limiting 4.3. Detailed NAB results
false positives, is a second major factor in obtaining good NAB
scores. In practical applications, one major benefit of detecting A breakdown of the algorithms performance on the benchmark
temporal patterns is that it often leads to the early detection of is shown in Table 3. Results have been aggregated across data
anomalies. Fig. 1 shows a representative example. The temporal sources ranging from artificially generated streams to real streams
anomaly in the middle of this figure precedes the actual failure from advertisement clicks, server metrics, traffic data and twitter
by several days. In Fig. 2, the strong spike is preceded by a subtle volume. Data streams are characterized by spatial anomalies, tem-
temporal shift several hours earlier. Fig. 9(b) shows detection re- poral anomalies or a combination thereof. Grouping the streams by
sults on that data stream. The Twitter ADVec, Skyline, and Bayesian their anomaly types in Table 3 helps us inspect the characteristics
Online Changepoint algorithms easily detect the spike, but there of the algorithms identified earlier in Table 2.
are subtle changes in the data preceding the obvious anomaly. Results show that both HTM and CAD-OSE yield the best over-
The Multinomial Relative Entropy and HTM detectors both flag all aggregate scores on almost all data sources and anomaly types,
anomalous data well before the large spike and thus obtain higher with the exceptions of Twitter AdVec on artificial temporal streams
scores. It is challenging to detect such changes in noisy data with- and KNN–CAD on miscellaneous known causes. The difference be-
out a large number of false positives. Both the HTM and Multino- tween aggregate scores for HTM and CAD-OSE for the majority of
mial Relative Entropy perform well in this regard across the NAB the data streams is less than 0.20. For some stream types, HTM
dataset. significantly outperforms CAD-OSE such as spatial advertisement
The third major factor concerns assumptions regarding the un- streams, temporal server metric streams and spatial/temporal mis-
derlying distribution of data. A general lesson is that algorithms cellaneous streams. In particular, the results show HTM performing
making fewer assumptions regarding distribution perform better. well on server metrics and online advertisements data while CAD-
This is particularly important for streaming applications where al- OSE performs well on traffic and twitter streams.
gorithms must be unsupervised, automated, and applicable across In addition, the results in Table 3 also demonstrate that sta-
a variety of domains. Techniques such as the sliding threshold and tistical techniques with assumptions on data distribution such
Bayesian Online Changepoint detectors make strong assumptions as sliding threshold, Twitter AdVec and Skyline may be able to

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

S. Ahmad et al. / Neurocomputing 000 (2017) 1–14 13

capture spatial anomalies (e.g. Skyline on spatial traffic streams) [9] E. Keogh, J. Lin, A. Fu, HOT SAX: Efficiently finding the most unusual time se-
but are not effective enough for capturing temporal anomalies. This ries subsequence, in: Proceedings of the IEEE International Conference on Data
Mining, ICDM, 2005, pp. 226–233, doi:10.1109/ICDM.2005.79.
is reflected by the negative scores for most temporal and spa- [10] P. Malhotra, L. Vig, G. Shroff, P. Agarwal, Long short term memory networks
tial/temporal anomaly streams for these algorithms. This further for anomaly detection in time series, Eur. Symp. Artif. Neural Netw. (2015) 22–
reinforces the correlation between non-parametric techniques and 24.
[11] H.N. Akouemo, R.J. Povinelli, Probabilistic anomaly detection in natural gas
detection of temporal anomalies. time series data, Int. J. Forecast. 32 (2015) 948–956, doi:10.1016/j.ijforecast.
2015.06.001.
5. Conclusion [12] J. Gama, Knowledge Discovery from Data Streams, Chapman and Hall/CRC,
Boca Raton, Florida, 2010.
[13] M.A.F. Pimentel, D.A. Clifton, L. Clifton, L. Tarassenko, A review of novelty de-
With the increase in connected real-time sensors, the detection tection, Signal Process. 99 (2014) 215–249, doi:10.1016/j.sigpro.2013.12.026.
of anomalies in streaming data is becoming increasingly important. [14] M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, Mining data streams, ACM SIG-
The use cases cut across a large number of industries. We believe MOD Rec. 34 (2005) 18.
[15] M. Sayed-Mouchaweh, E. Lughofer, Learning in Non-Stationary Environments:
anomaly detection represents one of the most significant near-term Methods and Applications, Springer, New York, 2012.
applications for machine learning in IoT. [16] M. Pratama, J. Lu, E. Lughofer, G. Zhang, M.J. Er, Incremental learning of con-
In this paper we have discussed a set of requirements for unsu- cept drift using evolving Type-2 recurrent fuzzy neural network, IEEE Trans.
Fuzzy Syst (2016) 1, doi:10.1109/TFUZZ.2016.2599855.
pervised real-time anomaly detection on streaming data and pro- [17] M. Pratama, S.G. Anavatti, M.J. Er, E.D. Lughofer, pClass: an effective clas-
posed a novel anomaly detection algorithm for such applications. sifier for streaming examples, IEEE Trans. Fuzzy Syst 23 (2015) 369–386,
Based on HTM, the algorithm is capable of detecting spatial and doi:10.1109/TFUZZ.2014.2312983.
[18] P.Y. Chen, S. Yang, J.A. McCann, Distributed real-time anomaly detection in net-
temporal anomalies in predictable and noisy domains. The algo-
worked industrial sensing systems, IEEE Trans. Ind. Electron 62 (2015) 3832–
rithm meets the requirements of real-time, continuous, online de- 3842, doi:10.1109/TIE.2014.2350451.
tection without look ahead and supervision. [19] E.J. Spinosa, A.P.D.L.F. De Carvalho, J. Gama, OLINDDA: a cluster-based approach
for detecting novelty and concept drift in data streams, in: Proceedings of the
We also reviewed NAB, an open benchmark for real-world
2007 ACM Symposium on Applied Computing, 2007, pp. 448–452, doi:10.1145/
streaming applications. We showed results of running a number 1244002.1244107.
of algorithms on this benchmark. We highlighted three key fac- [20] E.R. Faria, J. Gama, A.C. Carvalho, Novelty detection algorithm for data
tors that impacted performance: concept drift, detection of tempo- streams multi-class problems, in: Proceedings of the 28th Annual ACM
Symposium on Applied Computing, 2013, pp. 795–800, doi:10.1145/2480362.
ral anomalies, and assumptions regarding distribution and size of 2480515.
data. [21] S. Lee, G. Kim, S. Kim, Self-adaptive and dynamic clustering for online anomaly
There are several areas for future work. The error analysis from detection, Expert Syst. Appl. 38 (2011) 14891–14898, doi:10.1016/j.eswa.2011.
05.058.
NAB indicates that the errors across various algorithms (includ- [22] T. Ahmed, M. Coates, A. Lakhina, Multivariate online anomaly detection us-
ing HTM) are not always correlated. An ensemble-based approach ing kernel recursive least squares, in: Proceedings of the 26th IEEE Interna-
might therefore provide a significant increase in accuracy. The cur- tional Conference on Computing Communication, 2007, pp. 625–633, doi:10.
1109/INFCOM.2007.79.
rent NAB benchmark is limited to data streams containing a sin- [23] M. Schneider, W. Ertel, F. Ramos, Expected Similarity estimation for large-scale
gle metric plus a timestamp. Adding real-world multivariate data batch and streaming anomaly detection, Mach. Learn. 105 (2016) 305–333,
streams labeled with anomalies, such as the data available in the doi:10.1007/s10994- 016- 5567- 7.
[24] A. Stanway, Etsy Skyline, Online Code Repos. (2013). https://github.com/etsy/
DAMADICS dataset [59], would be a valuable addition.
skyline.
[25] A. Bernieri, G. Betta, C. Liguori, On-line fault detection and diagnosis obtained
Acknowledgments by implementing neural algorithms on a digital signal processor, IEEE Trans.
Instrum. Meas 45 (1996) 894–899, doi:10.1109/19.536707.
[26] M. Basseville, I. V Nikiforov, Detection of Abrupt Changes, 1993.
We thank the anonymous reviewers for their feedback and [27] M. Szmit, A. Szmit, Usage of modified holt-winters method in the anomaly de-
helpful suggestions. We also thank Yuwei Cui, Jeff Hawkins, and tection of network traffic: case studies, J. Comput. Networks Commun. (2012),
Ian Danforth for many helpful comments, discussions, and sugges- doi:10.1155/2012/192913.
[28] P. Angelov, Anomaly detection based on eccentricity analysis, in: Proceedings
tions.
of the 2014 IEEE Symposium Evolving and Autonomous Learning Systems,
2014, doi:10.1109/EALS.2014.7009497.
Supplementary materials [29] B.S.J. Costa, C.G. Bezerra, L.A. Guedes, P.P. Angelov, Online fault detection based
on typicality and eccentricity data analytics, in: Proceedings of the Inter-
national Joint Conference on Neural Networks, 2015, doi:10.1109/IJCNN.2015.
Supplementary material associated with this article can be 7280712.
found, in the online version, at doi:10.1016/j.neucom.2017.04.070. [30] A.M. Bianco, M. García Ben, E.J. Martínez, V.J. Yohai, Outlier detection in regres-
sion models with ARIMA errors using robust estimates, J. Forecast. 20 (2001)
References 565–579.
[31] R.J. Hyndman, Y. Khandakar, Automatic time series forecasting : the forecast
[1] V. Chandola, V. Mithal, V. Kumar, Comparative evaluation of anomaly detection package for R Automatic time series forecasting : the forecast package for R, J.
techniques for sequence data, in: Proceedings of the 2008 Eighth IEEE Interna- Stat. Softw 27 (2008) 1–22.
tional Conference on Data Mining, 2008, pp. 743–748, doi:10.1109/ICDM.2008. [32] C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, K. Schwan,
151. Statistical techniques for online anomaly detection in data centers, in: Pro-
[2] A. Lavin, S. Ahmad, Evaluating real-time anomaly detection algorithms – the ceedings of the 12th IFIP/IEEE International Symposium on Integrated Network
Numenta anomaly benchmark, in: Proceedings of the 14th International Con- Management, 2011, pp. 385–392, doi:10.1109/INM.2011.5990537.
ference on Machine Learning Application, Miami, Florida, IEEE, 2015, doi:10. [33] D.L. Simon, A.W. Rinehart, A model-based anomaly detection approach for
1109/ICMLA.2015.141. analyzing streaming aircraft engine measurement data, in: Proceedings of
[3] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on con- Turbo Expo 2014: Turbine Technical Conference and Exposition, ASME, 2014,
cept drift adaptation, ACM Comput. Surv. 46 (2014) 1–37, doi:10.1145/2523813. pp. 665–672, doi:10.1115/GT2014-27172.
[4] M. Pratama, J. Lu, E. Lughofer, G. Zhang, S. Anavatti, Scaffolding type-2 classi- [34] E.K. Lee, H. Viswanathan, D. Pompili, Model-based thermal anomaly detec-
fier for incremental learning under concept drifts, Neurocomputing 191 (2016) tion in cloud datacenters, in: Proceedings of the IEEE International Conference
304–329, doi:10.1016/j.neucom.2016.01.049. on Distributed Computing in Sensor Systems, 2013, pp. 191–198, doi:10.1109/
[5] A.J. Fox, Outliers in time series, J. R. Stat. Soc. Ser. B. 34 (1972) 350–363. DCOSS.2013.8.
[6] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. [35] T. Klerx, M. Anderka, H.K. Buning, S. Priesterjahn, Model-based anomaly detec-
Surv. 41 (2009) 1–72, doi:10.1145/1541880.1541882. tion for discrete event systems, in: Proceedings of the 2014 IEEE 26th Interna-
[7] Wong J. Netflix Surus GitHub, Online Code Repos https://github.com/Netflix/ tional Conference on Tools with Artificial Intelligence, IEEE, 2014, pp. 665–672,
Surus 2015 doi:10.1109/ICTAI.2014.105.
[8] N. Laptev, S. Amizadeh, I. Flint, Generic and Scalable Framework for Au- [36] F. Knorn, D.J. Leith, Adaptive Kalman filtering for anomaly detection in soft-
tomated Time-series Anomaly Detection, in: Proceedings of the 21th ACM ware appliances, in: Proceedings of the IEEE INFOCOM, 2008, doi:10.1109/
SIGKDD International Conference on Knowledge Discovery Data Mining, 2015, INFOCOM.2008.4544581.
pp. 1939–1947.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
JID: NEUCOM
ARTICLE IN PRESS [m5G;June 14, 2017;20:53]

14 S. Ahmad et al. / Neurocomputing 000 (2017) 1–14

[37] A. Soule, K. Salamatian, N. Taft, Combining filtering and statistical methods for Subutai Ahmad received his A.B. in Computer Science
anomaly detection, in: Proceedings of the 5th ACM SIGCOMM conference on from Cornell University in 1986, and his PhD in Com-
Internet measurement, 4, 2005, p. 1, doi:10.1145/1330107.1330147. puter Science from the University of Illinois at Urbana-
[38] H. Lee, S.J. Roberts, On-line novelty detection using the Kalman filter and ex- Champaign in 1991. He is the VP of Research at Numenta.
treme value theory, in: Proceedings of the 19th International Conference on His research interests are in computational neuroscience,
Pattern Recognition„ 2008, pp. 1–4, doi:10.1109/ICPR.2008.4761918. machine learning, computer vision, and real-time sys-
[39] A. Morgan, Lytics Anomalyzer Blog, (2015). https://www.getlytics.com/blog/ tems.
post/check_out_anomalyzer.
[40] Y.J. Lee, Y.R. Yeh, Y.C.F. Wang, Anomaly detection via online oversampling prin-
cipal component analysis, IEEE Trans. Knowl. Data Eng 25 (2013) 1460–1470,
doi:10.1109/TKDE.2012.99.
[41] A. Lakhina, M. Crovella, C. Diot, Diagnosing network-wide traffic anomalies,
ACM SIGCOMM Comput. Commun. Rev 34 (2004) 219, doi:10.1145/1030194.
Alexander Lavin received his B.S. in Mechanical Engineer-
1015492.
ing from Cornell University in 2012, and M.S. in Mechan-
[42] N. Görnitz, M. Kloft, K. Rieck, U. Brefeld, Toward supervised anomaly detection,
ical Engineering from Carnegie Mellon University in 2014,
J. Artif. Intell. Res 46 (2013) 235–262, doi:10.1613/jair.3623.
specializing in spacecraft engineering. He is currently a
[43] U. Rebbapragada, P. Protopapas, C.E. Brodley, C. Alcock, Finding anomalous pe-
Senior Research Engineer at Vicarious, building general
riodic time series : An application to catalogs of periodic variable stars, Mach.
artificial intelligence for robotics. His main research in-
Learn. 74 (2009) 281–313, doi:10.1007/s10994- 008- 5093- 3.
terests are computational neuroscience, computer vision
[44] T. Pevný, Loda: Lightweight on-line detector of anomalies, Mach. Learn 102
systems, and robotics.
(2016) 275–304, doi:10.1007/s10994-015- 5521- 0.
[45] A. Kejariwal, Twitter Engineering: Introducing Practical and Robust Anomaly
Detection in a Time Series [Online blog], (2015). http://bit.ly/1xBbX0Z.
[46] J. Hawkins, S. Ahmad, Why neurons have thousands of synapses, a theory of
sequence memory in neocortex, Front. Neural Circuits. 10 (2016) 1–13, doi:10.
3389/fncir.2016.0 0 023. Scott Purdy received his B.S. and M.Eng. degrees in Com-
[47] D.E. Padilla, R. Brinkworth, M.D. McDonnell, Performance of a hierarchical tem- puter Science in 2010 and 2011, respectively, from the
poral memory network in noisy sequence learning, in: Proceedings of the In- College of Engineering at Cornell University. He is Direc-
ternational Conference on Computational Intelligence and Cybernetics, IEEE, tor of Engineering at Numenta. Scott’s research interests
2013, pp. 45–51, doi:10.1109/CyberneticsCom.2013.6865779. are computational neuroscience, machine learning, and
[48] D. Rozado, F.B. Rodriguez, P. Varona, Extending the bioinspired hierarchical robotics.
temporal memory paradigm for sign language recognition, Neurocomputing 79
(2012) 75–86, doi:10.1016/j.neucom.2011.10.005.
[49] Y. Cui, S. Ahmad, J. Hawkins, Continuous online sequence learning with an
unsupervised neural network model, Neural Comput 28 (2016) 2474–2504,
doi:10.1162/NECO_a_00893.
[50] S. Purdy, Encoding Data for HTM Systems, arXiv. (2016) arXiv:1602.05925
[cs.NE].
[51] J. Mnatzaganian, E. Fokoué, D. Kudithipudi, A Mathematical Formalization of Zuha Agha received her B.S in Computer Science from La-
hierarchical temporal memory’s spatial pooler, Front. Robot. AI. 3 (2017) 81, hore University of Management Sciences Pakistan in 2014,
doi:10.3389/frobt.2016.0 0 081. and her M.S in Computer Science from University of Pitts-
[52] Y. Cui, S. Ahmad, J. Hawkins, The HTM Spatial Pooler: a neocortical algo- burgh in 2017. She was formerly an Algorithms Intern at
rithm for online sparse distributed coding, bioRxiv, 2016, doi:http://dx.doi.org/ Numenta, and will join Apple in the summer of 2017. Her
10.1101/085035. research interests include data science, machine learning,
[53] S. Ahmad, J. Hawkins, Properties of sparse distributed representations and and computer vision.
their application to Hierarchical Temporal Memory, 2015, arXiv:1503.07469 [q-
NC].
[54] B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Com-
mun. ACM. 13 (1970) 422–426, doi:10.1145/362686.362692.
[55] G.K. Karagiannidis, A.S. Lioumpas, An improved approximation for the Gaus-
sian Q-function, IEEE Commun. Lett 11 (2007) 644–646.
[56] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Com-
put. Surv (2009) 1–72.
[57] R.P. Adams, D.J.C. Mackay, Bayesian Online Changepoint Detection, 2007,
arXiv:0710.3742 [stat.ML].
[58] M. Schneider, W. Ertel, G. Palm, Constant Time expected similarity estimation
using stochastic optimization, (2015) arXiv:1511.05371 [cs.LG].
[59] M. Bartyś, R. Patton, M. Syfert, S. de las Heras, J. Quevedo, Introduction to the
DAMADICS actuator FDI benchmark study, Control Eng. Pract. 14 (2006) 577–
596, doi:10.1016/j.conengprac.2005.06.015.

Please cite this article as: S. Ahmad et al., Unsupervised real-time anomaly detection for streaming data, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2017.04.070
View publication stats

You might also like