Manifold Learning Techniques For Unsupervised Anomaly Detection
Manifold Learning Techniques For Unsupervised Anomaly Detection
a r t i c l e i n f o a b s t r a c t
Article history: Appropriately identifying outlier data is a critical requirement in the decision-making process of many
Received 25 July 2016 expert and intelligent systems deployed in a variety of fields including finance, medicine, and defense.
Revised 31 July 2017
Classical outlier detection schemes typically rely on the assumption that normal/background data of in-
Accepted 1 August 2017
terest are distributed according to an assumed statistical model and search for data that deviate from
Available online 25 August 2017
that assumption. However, it is frequently the case that performance is reduced because the underlying
Keywords: distribution does not follow the assumed model. Manifold learning techniques offer improved perfor-
Manifolds mance by learning better models of the background but can be too computationally expensive due to the
Manifold learning need to calculate a distance measure between all data points. Here, we study a general framework that
Image processing allows manifold learning techniques to be used for unsupervised anomaly detection by reducing compu-
Anomaly detection tational expense via a uniform random sampling of a small fraction of the data. A background manifold is
Target detection
learned from the sample and then an out-of-sample extension is used to project unsampled data into the
learned manifold space and construct an anomaly detection statistic based on the prediction error of the
learned manifold. The method works well for unsupervised anomaly detection because, by definition, the
ratio of anomalous to non-anomalous data points is small and the sampling will be dominated by back-
ground points. However, a variety of parameters that affect detection performance are introduced so we
use here a low-dimensional toy problem to investigate their effect on the performance of four learning
algorithms (kernel PCA, two versions of diffusion map, and the Parzen density estimator). We then apply
the methods to the detection of watercraft in an ensemble of 22 infrared maritime scenes where we find
kernel PCA to be superior and show that it outperforms a commonly employed baseline algorithm. The
framework is not limited to the tested image processing example and can be used for any unsupervised
anomaly detection task.
Published by Elsevier Ltd.
1. Introduction port vector machines (SVMs) (Schölkopf & Smola, 2002). Semi-
supervised approaches, some examples of which can be found in
We consider the problem of detecting points that are rare Fujimaki, Yairi, and Machida (2005) and Bouchachia (2007), only
within a data set dominated by the presence of ordinary back- require labels for the background class. Unsupervised algorithms
ground points. The goal is to assign unknown data to either a back- are the most generically applicable and are often based on mea-
ground or anomaly class and the numerous algorithms that have sures of similarity between data vectors. Examples include thresh-
been devised to handle this problem can be categorized as super- olding distances between neighboring data vectors (Knorr, Ng,
vised, semi-supervised, or unsupervised depending on how much & Tucakov, 20 0 0), the local outlier factor (Breunig, Kriegel, Ng,
information is available to the training algorithm. & Sander, 20 0 0), one-class SVMs (Schölkopf, Williamson, Smola,
Supervised approaches require labeled training data for both Shawe-Taylor, & Platt, 20 0 0), and fuzzy c-means clustering
classes. Models that maximize the difference between classes (Bezdek, Ehrlich, & Full, 1984). Some reviews of the anomaly de-
are then constructed; some common algorithms include neural tection problem are provided in Markou and Singh (2003a) and
networks (Markou & Singh, 2003b), Gaussian mixture models Chandola, Banerjee, and Kumar (2009).
(Tarassenko, Nairac, Townsend, & Cowley, 1999), principal com- Supervised techniques are preferred over the unsupervised case
ponents analysis (for linearly separable data), and kernel sup- whenever possible as we would expect the presence of training
data to improve classification. We are not, however, always af-
forded this luxury. This is the case in many image-based detec-
∗
Corresponding author. tion scenarios where neither the background pixels nor the anoma-
E-mail addresses: [email protected], [email protected] (C.C. Olson), lous (often man-made) pixels are expected to be consistent be-
[email protected] (K.P. Judd), [email protected] (J.M. Nichols).
http://dx.doi.org/10.1016/j.eswa.2017.08.005
0957-4174/Published by Elsevier Ltd.
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 375
tween scenes. Variations in the type of background pixels as well Niyogi (2002) and Belkin, Niyogi, and Sindhwani (2006) demon-
as changes in lighting and scene viewing angle can invalidate a pri- strated that semi-supervised techniques can be used to learn a
ori assumptions about scene composition. Thus, we are motivated data manifold for classification. Although similar to our method,
to develop unsupervised detection techniques. we are not aware of any work besides our own that extends these
Kernel and spectral methods comprise a family of algo- techniques to unsupervised anomaly detection in imagery.
rithms commonly used for clustering and classification. Of these, Building an adjacency matrix from a subset of the data is con-
spectral methods in particular are also used for dimensional- ceptually simple but enables application of the wide variety of
ity reduction. Algorithms such as Laplacian eigenmaps (Belkin & data-driven learning techniques to the anomaly detection problem
Niyogi, 2003), locally-linear embedding (Roweis & Saul, 2000), and offers the prospect of improved detection performance over
Isomap (Tenenbaum, de Silva, & Langford, 20 0 0), and diffusion classical techniques. The tradeoff is the introduction of a set of
map (Coifman & Lafon, 2006) (which we discuss below) assume unique considerations relative to previous approaches. The follow-
the observed high-dimensional data were actually generated by a ing are a few of the most fundamental considerations: (1) What
lower-dimensional process and that the associations between the data-driven learning algorithm should be applied to the sampled
two can be learned. The goal of a manifold learning algorithm is skeleton subset?; (2) What fraction of the data set must be sam-
therefore to map the original data onto a new coordinate system in pled in order to guarantee with some probability that all back-
which the classification problem is made simpler. These methods ground classes will be sufficiently sampled without over-sampling
are similar in that they organize the data into clusters based on the anomalous class?; (3) What should the parameter settings
the eigenvalues and eigenvectors of a distance (adjacency) matrix be for a given learning algorithm and how are they affected by
calculated from the data. The data are viewed as nodes in a graph the size of the subsample?; (4) How stable is detection perfor-
and the edges connecting the nodes are weighted by the similarity mance as a function of parameter settings and subsample size?; (5)
between the data as determined by a distance-measuring kernel. How best to extend the learned model space to the out-of-sample
In recent years such methods have been used for the anal- points?
ysis of hyperspectral images. For example, they have been used In this work we primarily focus on considerations (1), (3), and
for classification (Bachmann, Ainsworth, & Fusina, 2005; Chen, (4). In particular we address consideration (1) by using kernel
Crawford, & Ghosh, 2005), target detection (Ziemann & Messinger, PCA (Schölkopf, Smola, & Müller, 1998), two versions of diffusion
2015; Ziemann, Theiler, & Messinger, 2015), and change detection map (Coifman & Lafon, 2006), and the Parzen density estima-
(Albano, Messinger, Schlamm, & Basener, 2011). Within the context tor (Parzen, 1962) to learn background models for panchromatic
of anomaly detection, Kwon and Nasrabadi (2005) introduced a (not hyperspectral) images that have been tiled to form super-
kernelized version of the standard RX algorithm (Reed & Yu, 1990) pixels. With all three techniques the basic idea is the same: learn
under the assumption that background and target would be de- a model based on previously acquired background data, project in
scribed by Gaussian distributions in the high-dimensional feature new pixel data, and compute a measure of error between data
space describing kernelized spectra. The TAD approach was intro- and model as our detection statistic. The performance of the algo-
duced by Basener et al and found to perform well against a variety rithms on a toy problem and real-world data set are quantified us-
of benchmark algorithms (Basener, Ientilucci, & Messinger, 2007). ing receiver operating characteristic (ROC) curves (Kay, 1998) over
Messinger and Albano also considered the anomaly detection prob- a wide range of algorithm parameter settings (consideration 3) and
lem by measuring the connectivity of individual pixels within a over multiple skeleton samples (consideration 4). In all cases, good
locally-constructed graph (Messinger & Albano, 2011). detection performance is obtained on the toy problem; however,
The motivation for using such kernel-based or manifold- kernel PCA outperforms the other learning algorithms on the real-
learning algorithms is that a background model that is more ap- world target detection task. We provide a more complete descrip-
propriate to the specifics of a given scene can be learned using tion of each technique in Section 2, describe the experiments and
data-driven techniques rather than assuming a statistical model a compare to an established algorithm in Section 3, and discuss re-
priori as is done with, for example, RX. Estimating the parame- sults in Section 4 before concluding in Section 5.
ters governing an assumed statistical distribution and constructing
decision surfaces as a function of the learned parameters would 2. Methods and motivation
be preferred, but real-world data frequently fail to follow assumed
distribution models and it has been shown (see, e.g., Theiler, Foy, The idea behind any anomaly detection approach is to model
& Fraser, 2007) that sensitivity to outliers may be reduced if the the background distribution using either assumed physical princi-
assumptions underlying the model are not met by the data. ples or by learning its description from the data. It is the latter
Adoption of such data-driven techniques is hampered, how- route that we consider here. We begin with the set, , of N pixel
ever, by the expense of calculating an adjacency matrix. In intensities xi ∈ RM , i = 1 · · · N that comprise an image. Most of the
Olson, Nichols, Michalowicz, and Bucholtz (2010) we proposed pixels are assumed to contain background information while only
a statistically uniform “skeleton” subsampling of a hyperspectral a very few ( < 1%) are assumed to contain a “target” point of inter-
scene to reduce the computational cost of building an adjacency est.
matrix and performed a preliminary study of out-of-sample ex- In general, we seek to find a function, f( · ), that maps the xi
tension (Bengio et al., 2004; Lafon, Keller, & Coifman, 2006) as a into a new coordinate system where we can draw decision surfaces
means of developing a detection statistic for the remaining unsam- that more accurately separate anomaly from background. We don’t,
pled points. We performed an additional study of the subsampling however, know f( · ) a priori and must form an estimate, fˆ(· ), from
method in Olson and Doster (2016). Bachmann et al. (2005) have our data. In this work we compare a number of methods, both lin-
previously considered the use of subsampled pixel sets as a ear and nonlinear, for learning fˆ(· ) and compare their resulting de-
means of building a global manifold backbone against which lo- tection performance (although we drop the fˆ(· ) from here on out
cal manifolds built from sub-segments of a scene could be aligned, and work with f( · ) for notational parsimony).
but they found the method to be too computationally expen- Given f( · ), each datum can be represented in the new coordi-
sive for classification and did not consider the anomaly detection nate system by performing an analysis step θi = f (xi ) where θi ∈
problem. Graph-based methods have been used previously in a RM . Conversely, we may model (synthesize) each datum as xˆ i =
semi-supervised manner for classification tasks (Blum & Chawla, f −1 (θi ) where we allow f −1 : RM → Rm with m ≤ M. Of course, a
2001; Szummer & Jaakkola, 2002) and, more recently, Belkin and unique inverse and xˆ i = xi can only be guaranteed when m = M. In
376 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385
anomaly detection one would like to use only background points ples (background) reside on an (M − 1 )-dimensional hyperplane
to construct f( · ) and then test a new point x ∈ RM to see whether in RM then each point in the skeleton can indeed be uniquely
or not it is consistent with this model. Here, the error between mapped to the lower-dimensional coordinate θm ∈ Rm=M−1 . In this
the data and the background model representation of that data, case, a test point x drawn from the background will have the
x − f −1 (θ ), is then taken as a reasonable test statistic for de- model representation xˆ = m θ and a correspondingly small pro-
ciding whether the point is background or anomaly. jection error D = x − m mT x 2 . If we assume the anomalous
2
In our approach the background model is learned from a uni- points are spread sparsely but uniformly throughout RM then test
formly sampled subset of the original data. This subset is hence- points that produce a large error are inconsistent with the back-
forth referred to as the data skeleton and is denoted S = {xsi : ground hyperplane and are flagged as an anomaly. It should be
si ∼ U (1, N ), i = 1 · · · NS } where NS < N. The assumption here is pointed out that one can equivalently specify this error in terms
that for the purpose of anomaly detection, the randomly chosen of the manifold coordinates
NS points are sufficient to build f( · ) and are unlikely to contain
anomalies. It is further assumed that these points lie on a (pos-
D = T x − θ¯ m 22
sibly) nonlinear manifold such that standard PCA provides a poor = x · x − (m
T T
x ) · (m x) (1)
decision boundary.
where θ̄m is equal to the full coefficient vector, θ = T x ∈ RM ,
A standard PCA analysis seeks to find linear correlations in the
data which are then represented by the set of principal axes that but with dimensions m + 1, . . . , M set equal to zero. This alterna-
result from finding the eigenvectors of the data covariance matrix. tive formulation simply states that the projection error is equal to
There is, however, no reason to assume that real-world data sets the norm of the coefficients that were discarded in the projection.
are defined solely by linear correlations. Furthermore, there is ev- In this case, the principal axes will be aligned with the manifold
idence (Lafon et al., 2006) to suggest that some data sets may in coordinates.
fact reside on manifolds with intrinsic dimensionality m < M. If so, Now, if the skeleton is geometrically nonlinear, the linear map-
a proper choice of f( · ) can yield θ i that provide a more parsimo- ping is no longer necessarily an accurate description of the data (or
nious description of the data. We discuss how this might lead to even a unique mapping) and a new background point can easily be
better decision boundaries in the following section. mapped to a distant location (consider a 2-dimensional plane in R3
The process of selecting a data skeleton and relying on the rar- that has been folded into an “S”). In this case, PCA will produce a
ity of anomalous pixels to make the assumption that the skeleton large error measure D, even if the test point is chosen from the
manifold is a good background model is a simple concept but does background, hence there will be an unacceptable number of false
not seem to have been discussed in the literature. This procedure positives in the anomaly detector. It will also fail to detect anoma-
is reminiscent of semi-supervised learning, but actually falls under lous points that are not on the “S” but reside between the folds
the definition of unsupervised learning (Chandola et al., 2009). In of the “S”. Conversely, a manifold learning technique will generate
the semi-supervised case a small set of labeled samples is used to coordinates that are aligned with the two directions on the sheet
train a classifier for a much larger data set as opposed to the un- and the direction perpendicular to the manifold. Thresholding the
supervised case where no labeled training data are available. projection error in the orthogonal dimension will produce a deci-
In what follows we briefly review kernel PCA, the Parzen den- sion boundary that blankets the “S” on each side yielding less false
sity, and diffusion map. Each of these methods offers a differ- alarms and more detections.
ent approach to learning the background model f( · ). Construction The possibility of data lying on a geometrically nonlin-
of the error measure (test statistic) for each method is also de- ear surface has motivated a number of different modeling ap-
scribed as it varies by approach. In each case our ultimate goal proaches, among them kernel PCA. Kernel PCA was introduced
is to learn a model of our background imagery against which we by Schölkopf et al. (1998) and adapted to the anomaly detection
can compare test pixels and decide background or anomaly. We problem by Hoffmann (2007). The idea is to map data that are
begin with kernel PCA because the Parzen density arises natu- not linearly separable in the original (ambient) space into a high-
rally from calculation of the kernel PCA test statistic. For ker- dimensional feature space in which linear decision surfaces can be
nel PCA our discussion is informed by the descriptions given in constructed. Kernel PCA can be thought of as a nonlinear version
Hoffmann (2007) and Schölkopf et al. (1998) and for diffusion map of PCA, based on calculating the principal components of the data
the works of Coifman and Lafon (2006) and Lafon et al. (2006). after the nonlinear mapping has been applied. If a nonlinear PCA
model of the training data has been built, then a test point can be
2.1. Kernel PCA test statistic declared anomalous if the reconstruction error of the PCA model
for that point is large. Hoffmann showed that thresholding ker-
Before describing the kernel PCA approach it is first useful to nel PCA reconstruction error yields an anomaly detection statistic
consider in more detail conventional PCA and its use in anomaly that outperforms linear PCA, one-class SVM, and the Parzen den-
detection. In this standard approach, one takes the eigenvec- sity estimator (Parzen, 1962) on a number of toy problems and
tors of the data covariance matrix C = E[(xi − x0 )(xi − x0 )T ] ≈ real-world data sets (Hoffmann, 2007).
1
Ns i (xi − x0 )(xi − x0 ) and assumes the linear model x
T ˆ i = m θ i In more detail, the idea behind kernel PCA is to project the data
where x0 is the data mean and we use m to denote the col- into a new, higher- (in theory, possibly infinite-) dimensional fea-
lection of m ≤ M eigenvectors, taken from in decreasing order ture space, F, via
of the associated eigenvalues. Thus, the mapping f −1 (· ) alluded
xi → (xi ). (2)
to in the previous section is simply a linear matrix multiplica-
tion. It can be shown that this model minimizes the average error Because this transformation must be constructed using finite data,
D̄ = (1/Ns ) i xi − m m T x 2 between the points and their pro-
i 2 the transformed space can be of at most dimension NS as will be
jections onto the subspace spanned by the m . The assumption is seen in the subsequent development. If we form a linear decision
that by retaining only the m most influential coordinates, a simpler boundary in F then we effectively define a nonlinear decision sur-
(lower-dimensional) classification problem results (with the error face in the ambient (M-dimensional) space that better separates
in making this assumption quantified by D̄). the background and anomalous data distributions.
The extent to which this simple linear model is useful depends The difficulty with this approach is that the mapping (2) could
on the structure of the skeleton. For example, if the skeleton sam- be expensive or impossible to compute. However, we never ac-
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 377
tually need to know what the data look like in the higher- Recall from (7) that we require the inner products m ˜ x ) ·
T (
dimensional space i.e. we never have to evaluate (2). Recall that m ˜ x ) in forming the error measure. Using (10) we can write
T (
what is ultimately required for anomaly detection is a measure of each element of T ( ˜ x ) as the inner product
error between data and model. It turns out that any error mea-
˜ x ) · ψ l = gl ( x )
(
sure that involves inner products among data vectors in the high-
dimensional space can be written as a function of an appropriately 1
NS NS
chosen kernel function k( · , · ) applied to our finite, M-dimensional = αil k(x , xi ) − k ( xi , xs ) . . .
NS
data. i=1 s=1
To see how this is accomplished, assume that we can transform
1
NS
1
NS
our training set S into the feature space (using Eq. (2)) we again − k ( x , x )+ k ( xr , xs ) l = 1 · · · NS (11)
s
require the eigenvectors = (ψ1 , ψ2 , · · · , ψNS ) and corresponding NS
s=1 NS 2 r,s=1
eigenvalues λ1 > λ2 > · · · > λNS of the high-dimensional data co-
variance matrix thus we never have to actually compute the high-dimensional co-
variance matrix or its eigenvectors ψ l in order to form ˜ T ( x ).
1 ˜
N
C˜ = (xi )(
˜ xi )T . (3) The needed inner product term in (7) is therefore also a function
N of the kernel such that the total error becomes
i=1
˜ ≡ K˜i j = (
K ˜ xi ) · (
˜ x j ), (8)
2 1
NS NS
DS ( x ) = k ( x , x ) − k ( x , xi ) + k ( xi , x j ). (15)
where NS NS 2 i, j=1
i=1
1 1 1
NS NS NS
K˜i j = Ki j − Kis − Kr j + Krs . (9) For the Gaussian kernel, the first and last terms in (15) are con-
NS
s=1
NS
r=1 NS 2 r,s=1 stant and can be ignored (the last term is only a function of the
˜ , denoted αl ∈ RNS , l = 1, . . . , NS , can then model data and not the test point). Thus, the potential is governed
The eigenvectors of K
by the middle term, which is proportional to the Parzen density
be related to the eigenvectors via (Schölkopf et al., 1998)
estimator (Hoffmann, 2007).
NS
ψl = αil (
˜ xi ), l = 1 · · · NS (10)
2.3. Diffusion map
i=1
where the eigenvectors of K ˜ are normalized such that ||αl || = 1/λl Diffusion map is one of a host of manifold learning tech-
with λl the eigenvalue corresponding to αl . This normalization en- niques that have become prevalent in the literature in recent years
sures that the ||ψl || = 1 (Hoffmann, 2007). (Coifman & Lafon, 2006). Rather than attempting to map the data
378 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385
into a higher-dimensional space in which a linear model is appro- and for a Gaussian kernel is equivalent to a Laplacian eigenmap
priate, manifold learning approaches attempt to model the non- (Belkin & Niyogi, 2003), the eigenvectors of which were shown by
linear data space directly. This approach to background modeling Coifman and Lafon (2006) to converge to the Schrödinger operator,
is significantly different from kernel PCA, thus the error between + E, as N → ∞ and σ → 0. E is a density-dependent scalar po-
model and test data must be calculated differently. We present tential and is the Laplace-Beltrami operator on M (Lafon et al.,
one such approach here and demonstrate how the diffusion map 2006). When α = 1 the diffusion map approximates only and
can be used for the anomaly detection problem. Specifically, we the embedding is independent of density. Here, we examine both
present two different variants of the diffusion map algorithm and extremes.
compare their performance. It was demonstrated in Coifman and Lafon (2006) that a useful
The diffusion map construction views the data as a set of measure of distance between any two points on the graph is the
points in a graph where again we use a kernel function k(xi , xj ) diffusion distance
to measure the similarity between data points. Here, we define 2
Ki j = kG (xi , x j ) as the matrix associated with the Gaussian kernel Dt2 (xi , x j ) ∝ pt (xi , xk ) − pt (x j , xk ) (21)
given by (13). Given a bandwidth σ , the weights determine the k
local connectivity of the data. If one further normalizes this con- which represent the distance between conditional probabilities
nectivity matrix so that that quantify the influence of each point on the rest of the graph.
Ki j Computation of (21) can be simplified by considering a slightly
p i j ≡ p( x i , x j ) = , (16) different formulation. It turns out that the Euclidean distance be-
j Ki j
tween points in the diffusion embedding space is given by
the result can be interpreted as the probability that a random 2
walker will jump from xi to xj in one time step (Coifman & La- Dt2 (xi , x j ) = λ2l t ψil − ψ jl (22)
fon, 2006). Moreover, one can find the probability of a random l≥1
walker moving from xi to xj in t time steps by simply raising and is equivalent to calculating the diffusion distance between
(16) to the tth power, i.e., form pti j . The matrix (16) carries informa- points in the graph (Coifman & Lafon, 2006). The diffusion map
tion about the local geometry of the data set with larger transition corresponding to t quantifies the influence of a point in the graph
probabilities being associated with nearby pairs of points. on all other points in the graph residing within a given band-
As with kernel-PCA, the eigenstructure of this connectivity ma- width (distance) of that point. A small diffusion distance indicates
trix is central to a coordinate transformation. Specifically, the many short paths between two points with rapid diffusion be-
eigenvalues, λi , and eigenvectors, t , of the tth transition matrix tween them. Large distances indicate points that reside in separate
provide the diffusion map coordinates clusters or that are separated by bottlenecks on the manifold. In
short, the diffusion coordinates offer a natural organization of the
f t ( xi ) = λt1 ψi1 , λt2 ψi2 , . . . , λtm ψim data based on clusters whose scale is determined by t. What re-
= (θi1 , θi2 , · · · , θim ) (17) mains is to embed new points that were not originally included in
the eigendecomposition that produced t .
where the dimension of θi = ft (xi ) ∈ Rm
is equal to the number of
retained eigenvectors. As with kernel PCA, the largest m eigenvec-
2.4. Out-of-sample extension
tors are assumed to accurately model the data while the remaining
coordinates model the unimportant features (e.g. noise) and serve
A datum that was not originally included in the eigendecom-
only to complicate the classification problem. Increasing t allows
position can be projected into the diffusion space by weighting
the diffusion map to adjust between local and global information
the diffusion coordinates by the kernel distances between the new
on the data manifold. Larger values of t reduce the required m as
point and the points in the original ambient space. More precisely,
large scale structures in the data require fewer retained eigenvec-
a diffusion coordinate of the test point x ∈
/ S is given by
tors for an accurate reconstruction.
1
Coifman and Lafon (2006) and Lafon et al. (2006) discuss the NS
difference between the geometry and sampling density of a mani- ψ̄k (x ) = kD (x , x j )ψ jk (23)
λkt
fold M on which the observed data reside. In some cases, the sam- j=1
pling density of points on the manifold may not provide any infor- with the complete embedding given by
mation about the underlying process and will distort proper under-
standing of manifold geometry. In other cases, the density of points f¯t (x ) = ψ̄1 (x ), ψ̄2 (x ), . . . , ψ̄m (x ) (24)
on the manifold provides useful information about the underlying where the overbar has been included to explicitly indicate that the
data-producing process. These two extremes are addressed by in- new point has been extended from the manifold MS learned from
troducing the transition kernel the skeleton set. We use the notation kD (·, · ) to indicate that for
Kij the extension it is possible to use a different kernel bandwidth
kD ( xi , x j ) = (18) σ in the Gaussian kernel kG ( · , · ) rather than the original band-
j Ki j width σ that was used to construct MS . In fact, we are not even
where restricted to the same kernel for the extension (Lafon et al., 2006).
This process is based on the Nyström extension and was first
Ki j
Kij = (19) used to reduce the computation cost of kernel-based methods for
q ( xi )α · q ( x j )α
solving integral equations (Baker, 1977; Press, Teukolsky, Vetterling,
and & Flannery, 1988). More recently the method has been adopted by
the manifold learning community as a means of extending points
q ( xi ) = Kik (20) that were not originally included in the manifold learning pro-
k
cess (Bengio et al., 2004; Lafon et al., 2006). We use this pro-
where α ∈ R controls how sensitive the diffusion embedding is cedure as a means of determining how similar a test point is to
to the distribution of points on M. For α = 0 the embedding the learned manifold. In essence, the eigenvectors that have been
combines information from both the density and the geometry learned from S are a basis for the manifold MS and points that
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 379
of that same set. All sampled skeletons are the same for each of
Pd vs Dim the four methods. Although there is some deviation between the
1 different samplings, the differences are not extreme. The most ex-
treme deviation would occur if all 100 anomalies were included in
0.9 the skeleton, but the probability of such an occurrence is extremely
low.
0.8 All four methods perform reasonably well on the this prob-
lem. The Laplace-Beltrami version of diffusion map (α = 1) pro-
0.7 duced the smallest variance among the resulting ROC curves, how-
ever kernel PCA produced, on average, the highest Pd for low Pfa .
0.6 It should be noted that the low variance for Laplace-Beltrami is
consistent with the idea that is sensitive to manifold geome-
d
0.5 try and is independent of sampling density. Both the Parzen den-
P
1 1
0.95 0.95
0.9 0.9
Pd Pd
1 1
2 2
0.85 3 0.85 3
4 4
5 5
6 6
0.8 7 0.8 7
8 8
9 9
10 10
0.75 −3 −2 −1 0
0.75 −3 −2 −1 0
10 10 10 10 10 10 10 10
Pf a Pf a
(a) (b)
1 1
0.95 0.95
0.9 0.9
Pd Pd
1 1
2 2
0.85 3 0.85 3
4 4
5 5
6 6
0.8 7 0.8 7
8 8
9 9
10 10
0.75 −3 −2 −1 0
0.75 −3 −2 −1 0
10 10 10 10 10 10 10 10
Pf a Pf a
(c) (d)
Fig. 4. ROC curves for the (a) Parzen density estimator, (b) kernel PCA, (c) graph Laplacian (diffusion map with α = 0), and (d) Laplace-Beltrami (diffusion map with α = 1)
anomaly detectors. Data set composed of 10,0 0 0 square samples and 100 anomalies. Each curve in each figure represents a different 5% skeleton sampling of the same data
set.
0.6 0.45
0.7
0.55 0.4
0.6
0.5 0.35
0.5 GraphLaplacian, Band=MaxDist
GraphLaplacian, Band=MaxDist 0.45 LaplaceBeltrami, Band=MaxDist 0.3
LaplaceBeltrami, Band=MaxDist kernelPCA, Band=MaxDist
Pd
Pd
d
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Dimension Dimension Dimension
0.45
0.7 0.6
0.4
0.6 GraphLaplacian, d=10 GraphLaplacian, d=10
0.5 0.35
LaplaceBeltrami, d=10 LaplaceBeltrami, d=10
0.5 kernelPCA, d=10 kernelPCA, d=10
SphericalPotential, d=48 0.3 SphericalPotential, d=48
0.4
Pd
d
d
0.4 0.25
P
P
0.3
0.2
0.3
0.2 0.15
0.2
GraphLaplacian, d=10 0.1
0.1 LaplaceBeltrami, d=10
0.1 kernelPCA, d=10 0.05
SphericalPotential, d=48
0 0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2
Band Multiple Band Multiple Band Multiple
3.3. Baseline comparison motivation for building a manifold model of the background skele-
ton. In our framework we have a choice between using all im-
Here we compare our method to the established RX algorithm age tiles to estimate background statistics versus using only tiles
(Reed & Yu, 1990) which is frequently used for anomaly detection sampled for the skeleton. We tested the importance of this choice
in hyperspectral images (Chang & Chiang, 2002). Each pixel in a by constructing ROC curves for a variety of maritime images us-
hyperspectral image is a vector of spectral intensities over a band ing all the image tiles to estimate the background mean and co-
of wavelengths and RX-based anomaly detection is performed for variance and compared the results to ROC curves generated using
each pixel by comparing its spectrum to that of its neighbors in only skeleton tiles for background statistics estimation. 10 differ-
the scene. The spectral signature of the anomaly and the covari- ent skeletons were sampled for each tested image and in most
ance matrix of the background clutter, Cb , are both assumed to be cases the performance of the skeleton-derived ROC curves was bet-
unknown. Ultimately, background pixels are assumed to be drawn ter than that of the globally-derived ROC curves. This can be ex-
from a zero-mean Gaussian distribution with estimated covariance plained by noting that the global method is guaranteed to incor-
Cˆb while anomalous pixels are assumed to be drawn from a non- porate anomalous pixels into the covariance estimation while the
zero-mean Gaussian with the same covariance. proportion of anomalous to background information is more likely
Although we are dealing with panchromatic imagery in this to be improved with the skeleton sampling.
study rather than spectral data cubes, the RX algorithm may still We compare detection performance between the kernel PCA
be applied to the vectorized tiles of intensity information. Rather and RX algorithms by sampling 10 different skeletons for each of
than detecting anomalous spectral signatures we are detecting 22 maritime images (including the images in Fig. 5) and calculating
anomalous spatial features. Using our notation the RX detection the ROC curves for each skeleton. The two algorithms are tested
statistic is given by on the same set of skeleton points for all sampled skeletons. Fix-
ing the false alarm rate at Pf a = 1e−4 we find the corresponding
DRX (x ) = (x − μ
ˆ b )T Cˆb−1 (x − μ
ˆ b) (28)
probability of detection for each of the 10 ROC curves calculated
where μ ˆ b is an estimate of the mean background and we classify for each of the images and estimate the probability density func-
the test point as an anomaly when DRX exceeds a specified thresh- tion corresponding to the probability of detection, f(Pd ), as shown
old (note the similarity to the Mahalanobis distance). in Fig. 10.
Both the covariance and mean in (28) can either be estimated The RX algorithm performs better than kernel PCA on three of
locally or globally. The goal in either case is to better estimate the the 22 test images, but is generally outperformed for the same set
true background covariance structure, which is consistent with our of skeleton points. This performance difference is reflected by the
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 383
1 1 1
1 1 1
Pd
d
d
0.5 0.5 0.5
P
P
2 2 2
0.4 3 0.4 3 0.4 3
4 4 4
0.3 5 0.3 5 0.3 5
6 6 6
0.2 7 0.2 7 0.2 7
8 8 8
0.1 9 0.1 9 0.1 9
10 10 10
0 −4 −3 −2 −1 0
0 −4 −3 −2 −1 0
0 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
P Pfa P
fa fa
fitted Gamma distribution shown in Fig. 10c. One is more likely to 4. Discussion
have a higher probability of detection when using kernel PCA on
maritime imagery of interest. The toy problem results indicate similar detection performance
We note that RX performs miserably on the toy problem be- among the algorithms tested, with kernel PCA possessing a slight
cause each threshold defines a fixed-diameter circle centered on edge over the other approaches. The image results, however, clearly
the origin in the original data space. As the threshold is increased, favor kernel PCA with the Parzen density estimator emerging as a
points within the circle are declared background and points out- reasonable alternative, particularly for low bandwidths. The latter
side are anomalous. Thus the algorithm will never declare any of is, however, highly sensitive to the choice of bandwidth hence we
the points within the square strip as anomalous without also incor- do not recommend it as a general purpose, automated approach.
rectly declaring all background points on the strip as anomalous. Given the robustness of kernel PCA, and its excellent detection per-
When background points are correctly classified as background, all formance it is clearly favored over the other methods in the im-
points within the square will be missed detections. Finally, we note age anomaly detection problem. The improved performance of our
that manifold techniques have been used for spectral segmenta- method over RX on an ensemble of maritime images suggests that
tion in hyperspectral imagery (Bachmann et al., 2005) and that the the methodology should be tested on additional anomaly detection
RX algorithm has been kernelized for hyperspectral applications by tasks.
Kwon and Nasrabadi (2005).
384 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385
Bin Counts
60 60 0.03
f(Pd )
40 40 0.02
20 20 0.01
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Pd Pd Pd
Effort was made to improve the diffusion map results by de- All four learning algorithms perform reasonably well on the low-
creasing the bandwidth as suggested in Lafon et al. (2006). The dimensional toy problem but kernel PCA is clearly favored for the
bandwidth of the extension was also modified for given values of maritime anomaly detection task.
the skeleton embedding bandwidth. No set of bandwidths was ca- This study relies on numerical experiments so a number of the-
pable of outperforming kernel PCA. No claims are made that these oretical questions remain to be answered. For example:
empirical results are representative of those that would be ob-
tained in other anomaly detection problems using other types of - How sensitive is the eigendecomposition of a kernel matrix to
imagery. Rather, we find that kernel PCA works quite well for the the inclusion of outliers and can this sensitivity be determined
presented unsupervised detection framework and would expect it as a function of both number and some measure of anomalous-
to work well when given similar automated detection tasks. ness of the included outliers?
It is also worth pointing out that the ROC curves shown in - Given a known distribution of background classes what can
Fig. 8 convey the Type-I and Type-II errors associated with a single sampling theory predict about the minimum size of the skele-
tile when what is really sought are the ROC curves associated with ton sampling required for the eigendecomposition to converge
targets of varying spatial extent. Close examination of Fig. 9, for ex- to a stable description of the data and can we use this to place
ample, reveals that nearly all of the targets have a significant num- bounds on the detection performance?
ber of correct detections with very few false alarms (Pf a = 0.0 0 01). - What is the effect on detection performance of model-based
A ROC curve designed to indicate probability of detecting a target methods such as RX given some measure of deviation from an
consisting of multiple tiles would better illustrate algorithm perfor- assumed statistical model?
mance. Converting tile level ROC curves to target level ROC curves - Is there some measure of data complexity (perhaps
is a straightforward process. Assume, for example, a target with a information-theoretic) that can be used to better inform
spatial extent covering n tiles. The probability of falsely declaring parameter settings such as bandwidth, dimension, and skeleton
k of those tiles a detection by chance is given by the binomial dis- sampling percentage?
tribution A comprehensive study of performance versus skeleton sam-
(target ) n k pling size would be a good next step for this research. Some ad-
Pf a = P (1 − Pf a )(n−k ) (29) ditional research directions include: testing the performance of
k fa
other manifold learning algorithms; defining a detection metric for
For example, if Pf a = 0.001 for an individual tile, the probability of diffusion map that it is more precise than the current nearest-
getting k = 5 false alarms in a target spanning n = 8 tiles becomes neighbor-based distance metric; studying the effect of using a local
(target )
Pf a = 5.58 × 10−14 . Similarly, we may define bandwidth for each extended point (rather than the current global
method); testing alternative out-of-sample extension techniques;
(target ) n and applying the method to data sets from other fields (e.g., fraud
Pd = 1.0 − (1 − Pd )k Pd(n−k) . (30)
k detection).
5. Conclusion Albano, J., Messinger, D. W., Schlamm, A., & Basener, B. (2011). Graph theoretic met-
rics for spectral imagery with application to change detection. In Proceedings
SPIE: 8048.
We have provided empirical evidence that our proposed
Bachmann, C. M., Ainsworth, T. L., & Fusina, R. A. (2005). Exploiting manifold ge-
method of subsample (a fraction of the data)-learn (a manifold ometry in hyperspectral imagery. IEEE Transactions on Geoscience and Remote
from the subsample)-extend (unsampled points into the manifold Sensing, 43(3), 441–454.
space) for anomaly detection can perform better than the com- Baker, C. T. H. (1977). The numerical treatment of integral equations. Oxford: Claren-
don Press.
monly employed baseline RX algorithm on an ensemble of mar- Basener, B., Ientilucci, E. J., & Messinger, D. W. (2007). Anomaly detection using
itime images when kernel PCA is the employed learning algorithm. topology. In Proceedings SPIE: 6565.
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 385
Belkin, M., & Niyogi, P. (2002). Using manifold structure for partially labeled clas- McDonough, R. N., & Whalen, A. D. (1995). Detection of signals in noise (Second edi-
sification. Advances in neural information processing systems: 14. Cambridge, MA, tion). San Diego: Academic Press.
USA: The MIT Press. Messinger, D. W., & Albano, J. (2011). A graph theoretic approach to anomaly detec-
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction tion in hyperspectral imagery. IEEE workshop on hyperspectral imaging and signal
and data representation. Neural Computation, 15(6), 1373–1396. processing: evolution in remote sensing (WHISPERS).
Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric Mitra, N. J., & Nguyen, A. (2003). Estimating surface normals in noisy point
framework for learning from labeled and unlabeled examples. Journal of Ma- cloud data. In Proceedings 19th annual symposium on computational geometry
chine Learning Research, 7, 2399–2434. (pp. 322–328).
Bengio, Y., Paiement, J.-F., Vincent, P., Delalleau, O., Roux, N. L., & Ouimet, M. (2004). Olson, C. C., & Doster, T. (2016). A parametric study of unsupervised anomaly
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral cluster- detection performance in maritime imagery using manifold learning tech-
ing. Advances in neural information processing systems: 16. Cambridge, MA, USA: niques. Spie defense+ security. International Society for Optics and Photonics.
The MIT Press. 984016–984016.
Bezdek, J., Ehrlich, R., & Full, W. (1984). Fcm: The fuzzy c-means clustering algo- Olson, C. C., Nichols, J. M., Michalowicz, J. V., & Bucholtz, F. (2010). Improved outlier
rithm. Computers and Geosciences, 10(2), 191–203. identification in hyperspectral imaging via nonlinear dimensionality reduction.
Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph In Proceedings SPIE: 7695.
mincuts. In Proceedings of the 18th international conference on machine learning. Parzen, E. (1962). On estimation of probability density function and mode. Annals of
Williamstown, MA, USA: ICML. Mathematical Statistics, 33, 1065–1076.
Bouchachia, A. (2007). Learning with partly labeled data. Neural Computing and Ap- Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1988). Numerical
plications, 16, 267–293. recipes in C. Cambridge: Cambridge University Press.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (20 0 0). Lof: Identifying densi- Reed, L., & Yu, X. (1990). Adaptive multiple-band cfar detection of an optical pattern
ty-based local outliers. In Proceedings of the ACM SIGMOD international confer- with unknown spectral distribution. IEEE Transactions on Acoustics, Speech, and
ence on management of data. Dallas, TX, USA: ACM Press. Signal Processing, 38(10), 1760–1770.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Roweis, S. T., & Saul, L. K. (20 0 0). Nonlinear dimensionality reduction by locally
Computing Surveys, 41(3), Article15. linear embedding. Science, 290, 2323–2326.
Chang, C.-I., & Chiang, S.-S. (2002). Anomaly detection and classification for hy- Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).
perspectral imagery. IEEE Transactions on Geoscienceand Remote Sensing, 40(6), Estimating the support of a high-dimensional distribution. Neural Computation,
1314–1325. 13(7), 1443–1471.
Chen, Y., Crawford, M. M., & Ghosh, J. (2005). Applying nonlinear manifold learning Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.
to hyperspectral data for land cover classification. IGARSS: 5. Schölkopf, B., Smola, A. J., & Müller, K.-R. (1998). Nonlinear component analysis as
Coifman, R. R., & Lafon, S. (2006). Diffusion maps. Applied and Computational Har- a kernel eigenvalue problem. Neural Computing, 10, 1299–1319.
monic Analysis, 21, 5–30. Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., & Platt, J. (20 0 0). Sup-
Fujimaki, R., Yairi, T., & Machida, K. (2005). An approach to spacecraft anomaly port vector method for novelty detection. Advancesin Neural Information Process-
detection problem using kernel feature space. In Proceedings of the 11th ing Systems, 12, 582–588.
ACM SIGKDD international conference on knowledge discovery in data mining Szummer, M., & Jaakkola, T. (2002). Partially labeled classification with markov ran-
(pp. 401–410). Chicago, IL, USA: ACM Press. dom walks. Advances in neural information processing systems: 14. Cambridge,
Hoffmann, H. (2007). Kernel pca for novelty detection. Pattern Recognition, 40, MA, USA: The MIT Press.
863–874. Tarassenko, L., Nairac, A., Townsend, N., & Cowley, P. (1999). Novelty detection in jet
Kay, S. M. (1998). Fundamentals of statistical signal processing: detection theory. New engines. IEE colloquium on condition monitoring: Machinery, external structures
Jersey: Prentice Hall. and health (ref. no. 1999/034). Birmingham, UK: IEE.
Knorr, E. M., Ng, R. T., & Tucakov, V. (20 0 0). Distance-based outliers: Algorithms and Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern
applications. The VLDB Journal, 8, 237–253. Recognition Letters, 20, 1191–1199.
Kwon, H., & Nasrabadi, N. (2005). Kernel rx-algorithm: A nonlinear anomaly detec- Tenenbaum, J., de Silva, V., & Langford, J. (20 0 0). A global geometric framework for
tor for hyperspectral imagery. IEEE Transactions on Geoscienceand Remote Sens- nonlinear dimensionality reduction. Science, 290, 2319–2323.
ing, 43(2), 388–397. Theiler, J., Foy, B. R., & Fraser, A. M. (2007). Beyond the adaptive matched filter:
Lafon, S., Keller, Y., & Coifman, R. R. (2006). Data fusion and multicue data matching Nonlinear detectors for weak signals in high-dimensional clutter. In Proceedings
by diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, SPIE: 6565.
28(11), 1784–1797. Ziemann, A., & Messinger, D. (2015). An adaptive locally linear embedding manifold
Markou, M., & Singh, S. (2003a). Novelty detection: A review, part 1: Statistical ap- approach for hyperspectral target detection. In Proceedings SPIE: 9472.
proaches. Signal Processing, 83(12), 2481–2497. Ziemann, A., Theiler, J., & Messinger, D. (2015). Hyperspectral target detection us-
Markou, M., & Singh, S. (2003b). Novelty detection: A review, part 2: Neural net- ing manifold learning and multiple target spectra. IEEE Applied Imagery Pattern
work based approaches. Signal Processing, 83(12), 2499–2521. Recognition Workshop.