0% found this document useful (0 votes)

31 views

Manifold Learning Techniques For Unsupervised Anomaly Detection

This document discusses using manifold learning techniques for unsupervised anomaly detection. It presents a framework that allows manifold learning algorithms to be used for this purpose by constructing a background manifold from a random sample of the data, rather than calculating distances between all data points. This reduces computational costs. The paper studies how various parameters of this framework affect detection performance using toy problems and applies the method to detecting watercraft in maritime imagery, finding kernel PCA performs best.

Uploaded by

Alex Shevchenko

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Manifold Learning Techniques For Unsupervised Anomaly Detection

Uploaded by

Alex Shevchenko

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Expert Systems With Applications 91 (2018) 374–385

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

Manifold learning techniques for unsupervised anomaly detection

C.C. Olson∗, K.P. Judd, J.M. Nichols
Naval Research Laboratory, 4555 Overlook Ave. SW, Washington, D.C. 20375, USA

a r t i c l e i n f o a b s t r a c t

Article history: Appropriately identifying outlier data is a critical requirement in the decision-making process of many
Received 25 July 2016 expert and intelligent systems deployed in a variety of fields including finance, medicine, and defense.
Revised 31 July 2017
Classical outlier detection schemes typically rely on the assumption that normal/background data of in-
Accepted 1 August 2017
terest are distributed according to an assumed statistical model and search for data that deviate from
Available online 25 August 2017
that assumption. However, it is frequently the case that performance is reduced because the underlying
Keywords: distribution does not follow the assumed model. Manifold learning techniques offer improved perfor-
Manifolds mance by learning better models of the background but can be too computationally expensive due to the
Manifold learning need to calculate a distance measure between all data points. Here, we study a general framework that
Image processing allows manifold learning techniques to be used for unsupervised anomaly detection by reducing compu-
Anomaly detection tational expense via a uniform random sampling of a small fraction of the data. A background manifold is
Target detection
learned from the sample and then an out-of-sample extension is used to project unsampled data into the
learned manifold space and construct an anomaly detection statistic based on the prediction error of the
learned manifold. The method works well for unsupervised anomaly detection because, by definition, the
ratio of anomalous to non-anomalous data points is small and the sampling will be dominated by back-
ground points. However, a variety of parameters that affect detection performance are introduced so we
use here a low-dimensional toy problem to investigate their effect on the performance of four learning
algorithms (kernel PCA, two versions of diffusion map, and the Parzen density estimator). We then apply
the methods to the detection of watercraft in an ensemble of 22 infrared maritime scenes where we find
kernel PCA to be superior and show that it outperforms a commonly employed baseline algorithm. The
framework is not limited to the tested image processing example and can be used for any unsupervised
anomaly detection task.
Published by Elsevier Ltd.

1. Introduction port vector machines (SVMs) (Schölkopf & Smola, 2002). Semi-
supervised approaches, some examples of which can be found in
We consider the problem of detecting points that are rare Fujimaki, Yairi, and Machida (2005) and Bouchachia (2007), only
within a data set dominated by the presence of ordinary back- require labels for the background class. Unsupervised algorithms
ground points. The goal is to assign unknown data to either a back- are the most generically applicable and are often based on mea-
ground or anomaly class and the numerous algorithms that have sures of similarity between data vectors. Examples include thresh-
been devised to handle this problem can be categorized as super- olding distances between neighboring data vectors (Knorr, Ng,
vised, semi-supervised, or unsupervised depending on how much & Tucakov, 20 0 0), the local outlier factor (Breunig, Kriegel, Ng,
information is available to the training algorithm. & Sander, 20 0 0), one-class SVMs (Schölkopf, Williamson, Smola,
Supervised approaches require labeled training data for both Shawe-Taylor, & Platt, 20 0 0), and fuzzy c-means clustering
classes. Models that maximize the difference between classes (Bezdek, Ehrlich, & Full, 1984). Some reviews of the anomaly de-
are then constructed; some common algorithms include neural tection problem are provided in Markou and Singh (2003a) and
networks (Markou & Singh, 2003b), Gaussian mixture models Chandola, Banerjee, and Kumar (2009).
(Tarassenko, Nairac, Townsend, & Cowley, 1999), principal com- Supervised techniques are preferred over the unsupervised case
ponents analysis (for linearly separable data), and kernel sup- whenever possible as we would expect the presence of training
data to improve classiﬁcation. We are not, however, always af-
forded this luxury. This is the case in many image-based detec-
∗
Corresponding author. tion scenarios where neither the background pixels nor the anoma-
E-mail addresses: [email protected], [email protected] (C.C. Olson), lous (often man-made) pixels are expected to be consistent be-
[email protected] (K.P. Judd), [email protected] (J.M. Nichols).

http://dx.doi.org/10.1016/j.eswa.2017.08.005
0957-4174/Published by Elsevier Ltd.
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 375

tween scenes. Variations in the type of background pixels as well Niyogi (2002) and Belkin, Niyogi, and Sindhwani (2006) demon-
as changes in lighting and scene viewing angle can invalidate a pri- strated that semi-supervised techniques can be used to learn a
ori assumptions about scene composition. Thus, we are motivated data manifold for classification. Although similar to our method,
to develop unsupervised detection techniques. we are not aware of any work besides our own that extends these
Kernel and spectral methods comprise a family of algo- techniques to unsupervised anomaly detection in imagery.
rithms commonly used for clustering and classification. Of these, Building an adjacency matrix from a subset of the data is con-
spectral methods in particular are also used for dimensional- ceptually simple but enables application of the wide variety of
ity reduction. Algorithms such as Laplacian eigenmaps (Belkin & data-driven learning techniques to the anomaly detection problem
Niyogi, 2003), locally-linear embedding (Roweis & Saul, 2000), and offers the prospect of improved detection performance over
Isomap (Tenenbaum, de Silva, & Langford, 20 0 0), and diffusion classical techniques. The tradeoff is the introduction of a set of
map (Coifman & Lafon, 2006) (which we discuss below) assume unique considerations relative to previous approaches. The follow-
the observed high-dimensional data were actually generated by a ing are a few of the most fundamental considerations: (1) What
lower-dimensional process and that the associations between the data-driven learning algorithm should be applied to the sampled
two can be learned. The goal of a manifold learning algorithm is skeleton subset?; (2) What fraction of the data set must be sam-
therefore to map the original data onto a new coordinate system in pled in order to guarantee with some probability that all back-
which the classification problem is made simpler. These methods ground classes will be sufficiently sampled without over-sampling
are similar in that they organize the data into clusters based on the anomalous class?; (3) What should the parameter settings
the eigenvalues and eigenvectors of a distance (adjacency) matrix be for a given learning algorithm and how are they affected by
calculated from the data. The data are viewed as nodes in a graph the size of the subsample?; (4) How stable is detection perfor-
and the edges connecting the nodes are weighted by the similarity mance as a function of parameter settings and subsample size?; (5)
between the data as determined by a distance-measuring kernel. How best to extend the learned model space to the out-of-sample
In recent years such methods have been used for the anal- points?
ysis of hyperspectral images. For example, they have been used In this work we primarily focus on considerations (1), (3), and
for classification (Bachmann, Ainsworth, & Fusina, 2005; Chen, (4). In particular we address consideration (1) by using kernel
Crawford, & Ghosh, 2005), target detection (Ziemann & Messinger, PCA (Schölkopf, Smola, & Müller, 1998), two versions of diffusion
2015; Ziemann, Theiler, & Messinger, 2015), and change detection map (Coifman & Lafon, 2006), and the Parzen density estima-
(Albano, Messinger, Schlamm, & Basener, 2011). Within the context tor (Parzen, 1962) to learn background models for panchromatic
of anomaly detection, Kwon and Nasrabadi (2005) introduced a (not hyperspectral) images that have been tiled to form super-
kernelized version of the standard RX algorithm (Reed & Yu, 1990) pixels. With all three techniques the basic idea is the same: learn
under the assumption that background and target would be de- a model based on previously acquired background data, project in
scribed by Gaussian distributions in the high-dimensional feature new pixel data, and compute a measure of error between data
space describing kernelized spectra. The TAD approach was intro- and model as our detection statistic. The performance of the algo-
duced by Basener et al and found to perform well against a variety rithms on a toy problem and real-world data set are quantified us-
of benchmark algorithms (Basener, Ientilucci, & Messinger, 2007). ing receiver operating characteristic (ROC) curves (Kay, 1998) over
Messinger and Albano also considered the anomaly detection prob- a wide range of algorithm parameter settings (consideration 3) and
lem by measuring the connectivity of individual pixels within a over multiple skeleton samples (consideration 4). In all cases, good
locally-constructed graph (Messinger & Albano, 2011). detection performance is obtained on the toy problem; however,
The motivation for using such kernel-based or manifold- kernel PCA outperforms the other learning algorithms on the real-
learning algorithms is that a background model that is more ap- world target detection task. We provide a more complete descrip-
propriate to the specifics of a given scene can be learned using tion of each technique in Section 2, describe the experiments and
data-driven techniques rather than assuming a statistical model a compare to an established algorithm in Section 3, and discuss re-
priori as is done with, for example, RX. Estimating the parame- sults in Section 4 before concluding in Section 5.
ters governing an assumed statistical distribution and constructing
decision surfaces as a function of the learned parameters would 2. Methods and motivation
be preferred, but real-world data frequently fail to follow assumed
distribution models and it has been shown (see, e.g., Theiler, Foy, The idea behind any anomaly detection approach is to model
& Fraser, 2007) that sensitivity to outliers may be reduced if the the background distribution using either assumed physical princi-
assumptions underlying the model are not met by the data. ples or by learning its description from the data. It is the latter
Adoption of such data-driven techniques is hampered, how- route that we consider here. We begin with the set, , of N pixel
ever, by the expense of calculating an adjacency matrix. In intensities xi ∈ RM , i = 1 · · · N that comprise an image. Most of the
Olson, Nichols, Michalowicz, and Bucholtz (2010) we proposed pixels are assumed to contain background information while only
a statistically uniform “skeleton” subsampling of a hyperspectral a very few ( < 1%) are assumed to contain a “target” point of inter-
scene to reduce the computational cost of building an adjacency est.
matrix and performed a preliminary study of out-of-sample ex- In general, we seek to find a function, f( · ), that maps the xi
tension (Bengio et al., 2004; Lafon, Keller, & Coifman, 2006) as a into a new coordinate system where we can draw decision surfaces
means of developing a detection statistic for the remaining unsam- that more accurately separate anomaly from background. We don’t,
pled points. We performed an additional study of the subsampling however, know f( · ) a priori and must form an estimate, fˆ(· ), from
method in Olson and Doster (2016). Bachmann et al. (2005) have our data. In this work we compare a number of methods, both lin-
previously considered the use of subsampled pixel sets as a ear and nonlinear, for learning fˆ(· ) and compare their resulting de-
means of building a global manifold backbone against which lo- tection performance (although we drop the fˆ(· ) from here on out
cal manifolds built from sub-segments of a scene could be aligned, and work with f( · ) for notational parsimony).
but they found the method to be too computationally expen- Given f( · ), each datum can be represented in the new coordi-
sive for classification and did not consider the anomaly detection nate system by performing an analysis step θi = f (xi ) where θi ∈
problem. Graph-based methods have been used previously in a RM . Conversely, we may model (synthesize) each datum as xˆ i =
semi-supervised manner for classification tasks (Blum & Chawla, f −1 (θi ) where we allow f −1 : RM → Rm with m ≤ M. Of course, a
2001; Szummer & Jaakkola, 2002) and, more recently, Belkin and unique inverse and xˆ i = xi can only be guaranteed when m = M. In
376 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385

anomaly detection one would like to use only background points ples (background) reside on an (M − 1 )-dimensional hyperplane
to construct f( · ) and then test a new point x ∈ RM to see whether in RM then each point in the skeleton can indeed be uniquely
or not it is consistent with this model. Here, the error between mapped to the lower-dimensional coordinate θm ∈ Rm=M−1 . In this
the data and the background model representation of that data, case, a test point x drawn from the background will have the
x − f −1 (θ ), is then taken as a reasonable test statistic for de- model representation xˆ = m θ and a correspondingly small pro-
ciding whether the point is background or anomaly. jection error D = x − m mT x 2 . If we assume the anomalous
2
In our approach the background model is learned from a uni- points are spread sparsely but uniformly throughout RM then test
formly sampled subset of the original data. This subset is hence- points that produce a large error are inconsistent with the back-
forth referred to as the data skeleton and is denoted S = {xsi : ground hyperplane and are flagged as an anomaly. It should be
si ∼ U (1, N ), i = 1 · · · NS } where NS < N. The assumption here is pointed out that one can equivalently specify this error in terms
that for the purpose of anomaly detection, the randomly chosen of the manifold coordinates
NS points are sufficient to build f( · ) and are unlikely to contain
anomalies. It is further assumed that these points lie on a (pos-
D = T x − θ¯ m 22
sibly) nonlinear manifold such that standard PCA provides a poor = x · x − (m
T T
x ) · (m x) (1)
decision boundary.
where θ̄m is equal to the full coefficient vector, θ = T x ∈ RM ,
A standard PCA analysis seeks to find linear correlations in the
data which are then represented by the set of principal axes that but with dimensions m + 1, . . . , M set equal to zero. This alterna-
result from finding the eigenvectors of the data covariance matrix. tive formulation simply states that the projection error is equal to
There is, however, no reason to assume that real-world data sets the norm of the coefficients that were discarded in the projection.
are defined solely by linear correlations. Furthermore, there is ev- In this case, the principal axes will be aligned with the manifold
idence (Lafon et al., 2006) to suggest that some data sets may in coordinates.
fact reside on manifolds with intrinsic dimensionality m < M. If so, Now, if the skeleton is geometrically nonlinear, the linear map-
a proper choice of f( · ) can yield θ i that provide a more parsimo- ping is no longer necessarily an accurate description of the data (or
nious description of the data. We discuss how this might lead to even a unique mapping) and a new background point can easily be
better decision boundaries in the following section. mapped to a distant location (consider a 2-dimensional plane in R3
The process of selecting a data skeleton and relying on the rar- that has been folded into an “S”). In this case, PCA will produce a
ity of anomalous pixels to make the assumption that the skeleton large error measure D, even if the test point is chosen from the
manifold is a good background model is a simple concept but does background, hence there will be an unacceptable number of false
not seem to have been discussed in the literature. This procedure positives in the anomaly detector. It will also fail to detect anoma-
is reminiscent of semi-supervised learning, but actually falls under lous points that are not on the “S” but reside between the folds
the definition of unsupervised learning (Chandola et al., 2009). In of the “S”. Conversely, a manifold learning technique will generate
the semi-supervised case a small set of labeled samples is used to coordinates that are aligned with the two directions on the sheet
train a classifier for a much larger data set as opposed to the un- and the direction perpendicular to the manifold. Thresholding the
supervised case where no labeled training data are available. projection error in the orthogonal dimension will produce a deci-
In what follows we briefly review kernel PCA, the Parzen den- sion boundary that blankets the “S” on each side yielding less false
sity, and diffusion map. Each of these methods offers a differ- alarms and more detections.
ent approach to learning the background model f( · ). Construction The possibility of data lying on a geometrically nonlin-
of the error measure (test statistic) for each method is also de- ear surface has motivated a number of different modeling ap-
scribed as it varies by approach. In each case our ultimate goal proaches, among them kernel PCA. Kernel PCA was introduced
is to learn a model of our background imagery against which we by Schölkopf et al. (1998) and adapted to the anomaly detection
can compare test pixels and decide background or anomaly. We problem by Hoffmann (2007). The idea is to map data that are
begin with kernel PCA because the Parzen density arises natu- not linearly separable in the original (ambient) space into a high-
rally from calculation of the kernel PCA test statistic. For ker- dimensional feature space in which linear decision surfaces can be
nel PCA our discussion is informed by the descriptions given in constructed. Kernel PCA can be thought of as a nonlinear version
Hoffmann (2007) and Schölkopf et al. (1998) and for diffusion map of PCA, based on calculating the principal components of the data
the works of Coifman and Lafon (2006) and Lafon et al. (2006). after the nonlinear mapping has been applied. If a nonlinear PCA
model of the training data has been built, then a test point can be
2.1. Kernel PCA test statistic declared anomalous if the reconstruction error of the PCA model
for that point is large. Hoffmann showed that thresholding ker-
Before describing the kernel PCA approach it is first useful to nel PCA reconstruction error yields an anomaly detection statistic
consider in more detail conventional PCA and its use in anomaly that outperforms linear PCA, one-class SVM, and the Parzen den-
detection. In this standard approach, one takes the eigenvec- sity estimator (Parzen, 1962) on a number of toy problems and
tors of the data covariance matrix C = E[(xi − x0 )(xi − x0 )T ] ≈ real-world data sets (Hoffmann, 2007).
1
Ns i (xi − x0 )(xi − x0 ) and assumes the linear model x
T ˆ i = m θ i In more detail, the idea behind kernel PCA is to project the data
where x0 is the data mean and we use m to denote the col- into a new, higher- (in theory, possibly infinite-) dimensional fea-
lection of m ≤ M eigenvectors, taken from in decreasing order ture space, F, via
of the associated eigenvalues. Thus, the mapping f −1 (· ) alluded
xi → (xi ). (2)
to in the previous section is simply a linear matrix multiplica-
tion. It can be shown that this model minimizes the average error Because this transformation must be constructed using finite data,

D̄ = (1/Ns ) i xi − m m T x 2 between the points and their pro-
i 2 the transformed space can be of at most dimension NS as will be
jections onto the subspace spanned by the m . The assumption is seen in the subsequent development. If we form a linear decision
that by retaining only the m most influential coordinates, a simpler boundary in F then we effectively define a nonlinear decision sur-
(lower-dimensional) classification problem results (with the error face in the ambient (M-dimensional) space that better separates
in making this assumption quantified by D̄). the background and anomalous data distributions.
The extent to which this simple linear model is useful depends The difficulty with this approach is that the mapping (2) could
on the structure of the skeleton. For example, if the skeleton sam- be expensive or impossible to compute. However, we never ac-
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 377

tually need to know what the data look like in the higher- Recall from (7) that we require the inner products m ˜ x ) ·
T (

dimensional space i.e. we never have to evaluate (2). Recall that m ˜ x ) in forming the error measure. Using (10) we can write
T (

what is ultimately required for anomaly detection is a measure of each element of T ( ˜ x ) as the inner product
error between data and model. It turns out that any error mea-
˜ x ) · ψ l = gl ( x )
(
sure that involves inner products among data vectors in the high-

dimensional space can be written as a function of an appropriately 1
NS NS
chosen kernel function k( · , · ) applied to our ﬁnite, M-dimensional = αil k(x , xi ) − k ( xi , xs ) . . .
NS
data. i=1 s=1
To see how this is accomplished, assume that we can transform
1
NS
1
NS
our training set S into the feature space (using Eq. (2)) we again − k ( x , x )+ k ( xr , xs ) l = 1 · · · NS (11)
s
require the eigenvectors = (ψ1 , ψ2 , · · · , ψNS ) and corresponding NS
s=1 NS 2 r,s=1
eigenvalues λ1 > λ2 > · · · > λNS of the high-dimensional data co-
variance matrix thus we never have to actually compute the high-dimensional co-
variance matrix or its eigenvectors ψ l in order to form ˜ T ( x ).
1 ˜
N
C˜ = (xi )(
˜ xi )T . (3) The needed inner product term in (7) is therefore also a function
N of the kernel such that the total error becomes
i=1

The tilde used throughout refers to mean-centered data

m
DK ( x ) = DS ( x ) − gl ( x ) 2 (12)
(
˜ · ) = (· ) − 0 (4) l=1

where where m < NS is the number of retained eigenvectors and DS (x )

is the spherical potential discussed below. In doing so we are
1
NS
0 = (xi ) ≈ E[(xi )] (5) again assuming that the data can be adequately modeled using
NS only the eigenvectors associated with the m largest eigenvalues.
i=1
Eigenvectors associated with small eigenvalues are assumed to
is the centroid of S in the feature space (for ease of notation we
add little information to the data model while creating a higher-
allow xi ∈ S ∀ i). As with standard PCA we model the data in the
dimensional (more challenging) classification problem. In short, we
high-dimensional space as a linear combination of the eigenvec-
are performing the same anomaly detection as in conventional
tors, e.g.,
PCA, but are doing so after first transforming the data to the high-
(x ) = m θm (6) dimensional space.
In this work we use a Gaussian kernel
where again m denotes the first m < NS eigenvectors, and the as-
xi − x j 2
sociated model coefficients are θm ∈ Rm . This implies our mapping
kG (xi , x j ) = exp − (13)
between data θ x and coefficient θ is x = −1 (m θm ) = f −1 (θ ). Er- 2σ 2
ror between this reduced model and the full coordinate transfor-
where σ is a bandwidth parameter that must be selected (which
mation θ = T (x ) ∈ RNS is again captured by the Euclidean norm
we consider in more detail in Section 3). Other kernels are pos-
which (noting that T = I) can be written as in (1)
sible and the choice of which to use is significant; however,
DK (x ) = (
˜ x ) · (
˜ x ) − m (x ) · m
T ˜
(x )
T ˜
(7) a number of authors have shown that Gaussian kernels tend
to outperform other kernels (e.g. polynomial) on a number of
for test point x ∈
/ S.
tasks (Schölkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001;
Thus, the approach uses errors in a reduced linear model (as
Tax & Duin, 1999). In particular, Hoffmann has shown that a
in standard PCA), but does so after first moving to the high-
Gaussian kernel is preferred for the toy problem presented in
dimensional space. The reason for writing the model error as
Section 3 (Hoffmann, 2007).
in (7) is that inner products in the high-dimensional space can
be easily computed without having to form the high-dimensional
vectors themselves. Denote the NS × NS kernel matrix, K ≡ Ki j = 2.2. Parzen density estimator
k(xi , x j ) = (xi ) · (x j ), where k( · , · ) is a kernel function (to be
discussed shortly) that can be used to calculate inner products of The spherical potential of a test point x is defined as the
the data in the feature space. Now, it turns out that the eigen- squared distance of that point in F from the centroid of (S ):
vectors of the high-dimensional covariance matrix, C ˜ , can be de- DS (x ) = (x ) − 0 2 = (
˜ x ) · (
˜ x ). (14)
termined by first finding the eigenvectors of the centered inner-
product matrix, K ˜ , given by In terms of the kernel function, the spherical potential is given by

˜ ≡ K˜i j = (
K ˜ xi ) · (
˜ x j ), (8)
2 1
NS NS
DS ( x ) = k ( x , x ) − k ( x , xi ) + k ( xi , x j ). (15)
where NS NS 2 i, j=1
i=1
1 1 1
NS NS NS
K˜i j = Ki j − Kis − Kr j + Krs . (9) For the Gaussian kernel, the ﬁrst and last terms in (15) are con-
NS
s=1
NS
r=1 NS 2 r,s=1 stant and can be ignored (the last term is only a function of the
˜ , denoted αl ∈ RNS , l = 1, . . . , NS , can then model data and not the test point). Thus, the potential is governed
The eigenvectors of K
by the middle term, which is proportional to the Parzen density
be related to the eigenvectors via (Schölkopf et al., 1998)
estimator (Hoffmann, 2007).

NS
ψl = αil (
˜ xi ), l = 1 · · · NS (10)
2.3. Diffusion map
i=1

where the eigenvectors of K ˜ are normalized such that ||αl || = 1/λl Diffusion map is one of a host of manifold learning tech-
with λl the eigenvalue corresponding to αl . This normalization en- niques that have become prevalent in the literature in recent years
sures that the ||ψl || = 1 (Hoffmann, 2007). (Coifman & Lafon, 2006). Rather than attempting to map the data
378 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385

into a higher-dimensional space in which a linear model is appro- and for a Gaussian kernel is equivalent to a Laplacian eigenmap
priate, manifold learning approaches attempt to model the non- (Belkin & Niyogi, 2003), the eigenvectors of which were shown by
linear data space directly. This approach to background modeling Coifman and Lafon (2006) to converge to the Schrödinger operator,
is significantly different from kernel PCA, thus the error between + E, as N → ∞ and σ → 0. E is a density-dependent scalar po-
model and test data must be calculated differently. We present tential and is the Laplace-Beltrami operator on M (Lafon et al.,
one such approach here and demonstrate how the diffusion map 2006). When α = 1 the diffusion map approximates only and
can be used for the anomaly detection problem. Specifically, we the embedding is independent of density. Here, we examine both
present two different variants of the diffusion map algorithm and extremes.
compare their performance. It was demonstrated in Coifman and Lafon (2006) that a useful
The diffusion map construction views the data as a set of measure of distance between any two points on the graph is the
points in a graph where again we use a kernel function k(xi , xj ) diffusion distance
to measure the similarity between data points. Here, we define 2
Ki j = kG (xi , x j ) as the matrix associated with the Gaussian kernel Dt2 (xi , x j ) ∝ pt (xi , xk ) − pt (x j , xk ) (21)
given by (13). Given a bandwidth σ , the weights determine the k

local connectivity of the data. If one further normalizes this con- which represent the distance between conditional probabilities
nectivity matrix so that that quantify the influence of each point on the rest of the graph.
Ki j Computation of (21) can be simplified by considering a slightly
p i j ≡ p( x i , x j ) = , (16) different formulation. It turns out that the Euclidean distance be-
j Ki j
tween points in the diffusion embedding space is given by
the result can be interpreted as the probability that a random 2
walker will jump from xi to xj in one time step (Coifman & La- Dt2 (xi , x j ) = λ2l t ψil − ψ jl (22)
fon, 2006). Moreover, one can find the probability of a random l≥1

walker moving from xi to xj in t time steps by simply raising and is equivalent to calculating the diffusion distance between
(16) to the tth power, i.e., form pti j . The matrix (16) carries informa- points in the graph (Coifman & Lafon, 2006). The diffusion map
tion about the local geometry of the data set with larger transition corresponding to t quantifies the influence of a point in the graph
probabilities being associated with nearby pairs of points. on all other points in the graph residing within a given band-
As with kernel-PCA, the eigenstructure of this connectivity ma- width (distance) of that point. A small diffusion distance indicates
trix is central to a coordinate transformation. Specifically, the many short paths between two points with rapid diffusion be-
eigenvalues, λi , and eigenvectors, t , of the tth transition matrix tween them. Large distances indicate points that reside in separate
provide the diffusion map coordinates clusters or that are separated by bottlenecks on the manifold. In
short, the diffusion coordinates offer a natural organization of the
f t ( xi ) = λt1 ψi1 , λt2 ψi2 , . . . , λtm ψim data based on clusters whose scale is determined by t. What re-
= (θi1 , θi2 , · · · , θim ) (17) mains is to embed new points that were not originally included in
the eigendecomposition that produced t .
where the dimension of θi = ft (xi ) ∈ Rm
is equal to the number of
retained eigenvectors. As with kernel PCA, the largest m eigenvec-
2.4. Out-of-sample extension
tors are assumed to accurately model the data while the remaining
coordinates model the unimportant features (e.g. noise) and serve
A datum that was not originally included in the eigendecom-
only to complicate the classification problem. Increasing t allows
position can be projected into the diffusion space by weighting
the diffusion map to adjust between local and global information
the diffusion coordinates by the kernel distances between the new
on the data manifold. Larger values of t reduce the required m as
point and the points in the original ambient space. More precisely,
large scale structures in the data require fewer retained eigenvec-
a diffusion coordinate of the test point x ∈
/ S is given by
tors for an accurate reconstruction.
1
Coifman and Lafon (2006) and Lafon et al. (2006) discuss the NS

difference between the geometry and sampling density of a mani- ψ̄k (x ) = kD (x , x j )ψ jk (23)
λkt
fold M on which the observed data reside. In some cases, the sam- j=1

pling density of points on the manifold may not provide any infor- with the complete embedding given by
mation about the underlying process and will distort proper under-
standing of manifold geometry. In other cases, the density of points f¯t (x ) = ψ̄1 (x ), ψ̄2 (x ), . . . , ψ̄m (x ) (24)
on the manifold provides useful information about the underlying where the overbar has been included to explicitly indicate that the
data-producing process. These two extremes are addressed by in- new point has been extended from the manifold MS learned from
troducing the transition kernel the skeleton set. We use the notation kD (·, · ) to indicate that for
Kij the extension it is possible to use a different kernel bandwidth
kD ( xi , x j ) = (18) σ in the Gaussian kernel kG ( · , · ) rather than the original band-
j Ki j width σ that was used to construct MS . In fact, we are not even
where restricted to the same kernel for the extension (Lafon et al., 2006).
This process is based on the Nyström extension and was ﬁrst
Ki j
Kij = (19) used to reduce the computation cost of kernel-based methods for
q ( xi )α · q ( x j )α
solving integral equations (Baker, 1977; Press, Teukolsky, Vetterling,
and & Flannery, 1988). More recently the method has been adopted by
the manifold learning community as a means of extending points
q ( xi ) = Kik (20) that were not originally included in the manifold learning pro-
k
cess (Bengio et al., 2004; Lafon et al., 2006). We use this pro-
where α ∈ R controls how sensitive the diffusion embedding is cedure as a means of determining how similar a test point is to
to the distribution of points on M. For α = 0 the embedding the learned manifold. In essence, the eigenvectors that have been
combines information from both the density and the geometry learned from S are a basis for the manifold MS and points that
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 379

are well-represented by the basis are expected to map near the 3

manifold in the diffusion space.
2.5
2.5. Diffusion map test statistic
2
We form a model of the background by uniformly sampling
a subset of . By the definition of anomalous data, we assume 1.5
this data skeleton, S , is dominated by background points. Some
anomalous pixels may be selected as part of the background 1
model, but their influence on the learned manifold is minimal if
sufficient background points have also been selected. Defining NS 0.5
as the cardinality of the data skeleton set, we have found that sam-
ple sizes within the range 0.01N ≤ NS ≤ 0.15N tend to produce good 0
0 0.5 1 1.5 2 2.5 3
results. Choosing too few samples does not provide adequate train-
ing data for learning the manifold and may allow anomalous pixels Fig. 1. Background and anomaly labels for Pf a = 1e−3 . Skeleton points are shown
to overly influence the result. as blue stars, points labeled background are green dots, points labeled anomaly are
Given S we compute the corresponding diffusion map which red crosses, and true anomalies are shown by red squares. The two true anomalies
near the lower right-hand corner of the square have been incorrectly labeled as
yields the skeleton manifold MS . For ease of notation we let
background. One anomaly has been included in the skeleton. (For interpretation of
θi = ft (xi ) : xi ∈ S , represent a point on the manifold, and θ j = the references to colour in this figure legend, the reader is referred to the web
f¯t (x j ) : x j ∈ \S represent a transformed point that has been ex- version of this article.)
tended from the manifold into the diffusion space. We seek to de-
termine whether a point θ j is background or anomaly based on its
alarm, and will be denoted Pfa . The Type-II error is the probabil-
Euclidean distance from the points θ i on MS .
ity of missing a detection. The probability of detection is given as
Once in the diffusion space, the distance of θ j from the mani-
Pd = 1.0−Type-II error and the ROC curve simply plots Pd vs. Pfa .
fold is approximated by selecting a subset of κ nearest neighbors
The ROC curve is a frequently used tool to assess competing ap-
on MS , finding the best least-squares plane through those points,
proaches to detection (McDonough & Whalen, 1995).
and approximating the distance of the new point from the plane.
In detail, we let ξ (θ j , κ ) represent the set of κ manifold points
that are closest to θ j in the Euclidean sense. This neighborhood 3.1. Square-anomaly
has a least-squares, best-fit tangent plane and associated unit nor-
mal vector nˆ given by the eigenvector corresponding to the least Hoffmann presented a toy problem to demonstrate that
eigenvalue of the κ -neighborhood covariance matrix the decision boundaries drawn by the kernel PCA anomaly
detector would not be overly influenced by noise as com-
Cκ = ( θ − cξ ) ( θ − cξ ) (25) pared to other detection techniques like the one-class SVM.
θ ∈ξ (θ j ,κ ) A square strip was generated by first defining a square area
A1 = {(x, y ) : x ∈ [0.4, 2.6], y ∈ [0.4, 2.6]} and a smaller square area
where is the outer product and within A1 as A2 = {(x, y ) : x ∈ [0.6, 2.4], y ∈ [0.6, 2.4]} and then
1 defining the area A3 = {(x, y ) : (x, y ) ∈ A1 \A2 } as the strip of
cξ = θ (26) points within A1 but not A2 . A set of points is then randomly
κ
θ ∈ξ (θ j ,κ ) distributed within A3 to simulate background points on a man-
ifold. Anomalies are simulated by defining a larger area A4 =
is the centroid of the κ -neighborhood (Mitra & Nguyen, 2003). The
{(x, y ) : x ∈ [0, 3], y ∈ [0, 3]} and randomly sampling points from
distance of θ i from this plane is given by
the area A5 = {(x, y ) : (x, y ) ∈ A4 \A3 }.
DP (θ i ) = (θ i − cξ ) · nˆ . (27) Rather than treat the points off the square strip as noise, we de-
fine them as anomalies and show that the skeleton-based anomaly
Test points that are part of the background are expected to map detection framework can be used by all the methods presented in
near the manifold, while anomalies are expected to map far- Section 2 to detect the anomalous points in an unsupervised mode.
ther away from the manifold. A test pixel is labeled anomaly if To illustrate the problem we generate our data set by uniformly
DP (θ i ) > γ P where γ P is a threshold that must be defined. We note sampling 10 0 0 points from the area A3 and 10 points from the area
that (27) is just one of many metrics that might be used to assess A5 as shown in Fig. 1. We then choose 5% of the samples in as
similarity between a test point and the background manifold. For our skeleton S and build a manifold MS using the diffusion map
example, the sum of distances to all points in the κ -neighborhood method described above. We then build the ROC curve by varying
could also be used. Which distance works best will be a function the threshold γ P . Fig. 1 shows which of the original points in
of manifold geometry. were selected as skeleton points and labeled as anomalies or back-
ground for a false alarm rate of 1e−3 .
3. Computational studies Figs. 2 and 3 show the results of a simple parameter study of
the bandwidth and number of retained eigenvectors for each of the
Here we discuss two computational studies performed to com- methods. The bandwidth and dimension used for the toy problem
pare between the various methods. The first is a toy problem pre- ROC curves were selected based on these results.
sented by Hoffmann (2007), the second is based on maritime in- We test the performance of each of the methods by increas-
frared imagery. We overview and present results for each of the ing the sampling density and building ROC curves. ROC curves
studies in turn. As a measure for comparison we use the familiar for the Parzen density estimator, kernel PCA, diffusion map with
Receiver Operating Characteristic (ROC) curves generated by these α = 0, and diffusion map with α = 1 are shown in Figs. 4a–d, re-
different approaches to detection. The utility of any detection algo- spectively. A single data set with 10,0 0 0 samples for the square
rithm is quantified by the Type-I and Type-II errors. Type-I error is and 100 anomalies was used for all methods. Each of the 10 ROC
the probability of declaring a target when there is none i.e. a false curves for each figure represents a different 5% skeleton sampling
380 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385

of that same set. All sampled skeletons are the same for each of
Pd vs Dim the four methods. Although there is some deviation between the
1 different samplings, the differences are not extreme. The most ex-
treme deviation would occur if all 100 anomalies were included in
0.9 the skeleton, but the probability of such an occurrence is extremely
low.
0.8 All four methods perform reasonably well on the this prob-
lem. The Laplace-Beltrami version of diffusion map (α = 1) pro-
0.7 duced the smallest variance among the resulting ROC curves, how-
ever kernel PCA produced, on average, the highest Pd for low Pfa .
0.6 It should be noted that the low variance for Laplace-Beltrami is
consistent with the idea that is sensitive to manifold geome-
d

0.5 try and is independent of sampling density. Both the Parzen den-
P

sity and graph-Laplacian version of diffusion map (α = 0) pro-

0.4 duced lower Pd for a given Pfa than did kernel PCA. Moreover, the
graph-Laplacian produced the largest variance in ROC performance
GraphLaplacian, Band=MaxDist among the methods.
0.3
LaplaceBeltrami, Band=MaxDist
kernelPCA, Band=MaxDist
0.2
kernelPCA, Band=0.05*MaxDist
3.2. Maritime images
0.1
The ultimate application of interest in this work is automated
0 target detection. Specifically, our goal is to develop an automated
0 5 10 15 20 25 30 algorithm capable of reliably detecting, with few false alarms, the
Dimension
presence of a target in a maritime scene. The effectiveness of each
Fig. 2. Pd vs dimension for a fixed false alarm rate Pf a = 0.001. The bandwidth for of the four methods described herein was therefore tested on a
all cases is calculated as a multiple of the largest distance between any of the points diverse set of short-wave infrared maritime images collected at the
on the skeleton. Two versions of kernel PCA are plotted, one with a small multiple
Naval Research Laboratory, Chesapeake Bay detachment; these are
of the maximum distance and one with a bandwidth equal to the max distance.
The graph Laplacian and the Laplace-Beltrami estimator are both calculated with
shown in Fig. (5). The images are characterized by land, sea, and
bandwidths equal to the max distance. The Parzen density estimator (spherical po- sky pixels with land- and sea-based clutter. Our goal is to detect
tential) has dimensionality equal to that of the input data dimension and does not any watercraft present in the scenes without the use of training
appear for this plot. (For interpretation of the references to colour in this figure data, that is, in an unsupervised manner.
legend, the reader is referred to the web version of this article.)
We begin by dividing each N × N image into a set of (in this
case) non-overlapping tiles, each composed of NT < < N2 pixels.
Each tile therefore represents a point in RNT where each coordinate
P vs Band is defined by the associated pixel intensity. Note that we could
d
1 also have chosen to pre-process the tile data in order that each
high dimensional point highlight some other aspect of the image.
0.9 For example, pre-processing each tile by taking the wavelet trans-
form would result in a set of NT wavelet coefficients. If the goal
0.8 is to find edges in an image, these NT coefficient values may very
well provide a better (more parsimonious) description of the data.
0.7 However, for making comparisons among the various algorithms
we use the raw pixel intensity to form the data vectors. Follow-
0.6 ing the above development, we assume that each data point lies
on an image manifold, MI , with dimension NI < NT . Using our ap-
Pd

0.5 proach, we choose 5% of the tiles and build a skeleton manifold

GraphLaplacian, d=2 which we again assume is dominated by background points. Fail-
0.4 LaplaceBeltrami, d=2 ure of our model to appropriately represent a given tile may indi-
kernelPCA, d=30 cate an anomaly.
0.3 As with the toy problem, probability of detection versus dimen-
SphericalPotential, d=2
sion and bandwidth are shown for each of the images in Figs. 6
0.2 and 7. One of the chief difficulties encountered is that there does
not appear to be a simple means of setting the bandwidth for a
0.1 given image. We have found that a unity multiple of the largest
inter-skeleton distance works well and provides consistency be-
0 tween the methods. The Parzen density estimator, however, reli-
0 0.5 1 1.5 2 ably requires a lower bandwidth than the other methods, which is
Band Multiple in agreement with the conclusions reached in Hoffmann (2007).
Fig. 3. Pd vs bandwidth for a fixed false alarm rate PFa = 0.001. The dimension
A lower dimension appears to be favored by all three variable-
(number of retained eigenvectors) of the kernel PCA feature is set to a relatively dimension features. Thus, we arbitrarily select d = 10 as the di-
large value (30) given results shown in Fig. 2. A low dimension is selected for the mension to be used in making algorithm comparisons. We note
graph Laplacian and the Laplace-Beltrami estimator. The Parzen density estimator that the preference of kernel PCA for low dimensions in the im-
(spherical potential) has dimensionality equal to that of the input data dimension.
age examples runs counter to the results from the toy problem.
(For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.) Ultimately, given the relatively poor performance of the other tech-
niques, we only present ROC curves and results for kernel PCA.
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 381

1 1

0.95 0.95

0.9 0.9
Pd Pd
1 1
2 2
0.85 3 0.85 3
4 4
5 5
6 6
0.8 7 0.8 7
8 8
9 9
10 10
0.75 −3 −2 −1 0
0.75 −3 −2 −1 0
10 10 10 10 10 10 10 10
Pf a Pf a
(a) (b)

1 1

0.95 0.95

0.9 0.9
Pd Pd
1 1
2 2
0.85 3 0.85 3
4 4
5 5
6 6
0.8 7 0.8 7
8 8
9 9
10 10
0.75 −3 −2 −1 0
0.75 −3 −2 −1 0
10 10 10 10 10 10 10 10
Pf a Pf a
(c) (d)
Fig. 4. ROC curves for the (a) Parzen density estimator, (b) kernel PCA, (c) graph Laplacian (diffusion map with α = 0), and (d) Laplace-Beltrami (diffusion map with α = 1)
anomaly detectors. Data set composed of 10,0 0 0 square samples and 100 anomalies. Each curve in each ﬁgure represents a different 5% skeleton sampling of the same data
set.

100 100 100

200 200 200

300 300 300

400 400 400

500 500 500

600 600 600

700 700 700

800 800 800

900 900 900

1000 1000 1000

200 400 600 800 1000 1200 200 400 600 800 1000 1200 200 400 600 800 1000 1200

(a) (b) (c)

Fig. 5. Grayscale examples of images used in this study.
382 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385

0.8 0.65 0.5

0.6 0.45
0.7
0.55 0.4
0.6
0.5 0.35
0.5 GraphLaplacian, Band=MaxDist
GraphLaplacian, Band=MaxDist 0.45 LaplaceBeltrami, Band=MaxDist 0.3
LaplaceBeltrami, Band=MaxDist kernelPCA, Band=MaxDist

Pd
d

0.4 kernelPCA, Band=MaxDist 0.4 0.25

kernelPCA, Band=0.05*MaxDist GraphLaplacian, Band=MaxDist

kernelPCA, Band=0.05*MaxDist LaplaceBeltrami, Band=MaxDist
0.35 0.2
0.3 kernelPCA, Band=MaxDist
kernelPCA, Band=0.05*MaxDist
0.3 0.15
0.2
0.25 0.1
0.1
0.2 0.05

0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Dimension Dimension Dimension

(a) (b) (c)

Fig. 6. Pd as a function of dimension for each of the images shown in Fig. 5 with a ﬁxed false alarm rate of Pf a = 0.001. Kernel PCA consistently outperforms the other
features over a wide range of dimensions. The results are similar for Pf a = 0.0 0 01 although they are not shown here.

0.8 0.7 0.5

0.45
0.7 0.6
0.4
0.6 GraphLaplacian, d=10 GraphLaplacian, d=10
0.5 0.35
LaplaceBeltrami, d=10 LaplaceBeltrami, d=10
0.5 kernelPCA, d=10 kernelPCA, d=10
SphericalPotential, d=48 0.3 SphericalPotential, d=48
0.4
Pd
d

d
0.4 0.25
P

P
0.3
0.2
0.3

0.2 0.15
0.2
GraphLaplacian, d=10 0.1
0.1 LaplaceBeltrami, d=10
0.1 kernelPCA, d=10 0.05
SphericalPotential, d=48
0 0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2
Band Multiple Band Multiple Band Multiple

(a) (b) (c)

Fig. 7. Pd as a function of bandwidth for each of the images shown in Fig. 5 with a ﬁxed false alarm rate of Pf a = 0.001. Dimension is ﬁxed to 10 for each of the methods
other than the Parzen density estimator. Kernel PCA consistently outperforms the other features over a wide range of bandwidths. The results are similar for Pf a = 0.0 0 01
although they are not shown here.

3.3. Baseline comparison motivation for building a manifold model of the background skele-
ton. In our framework we have a choice between using all im-
Here we compare our method to the established RX algorithm age tiles to estimate background statistics versus using only tiles
(Reed & Yu, 1990) which is frequently used for anomaly detection sampled for the skeleton. We tested the importance of this choice
in hyperspectral images (Chang & Chiang, 2002). Each pixel in a by constructing ROC curves for a variety of maritime images us-
hyperspectral image is a vector of spectral intensities over a band ing all the image tiles to estimate the background mean and co-
of wavelengths and RX-based anomaly detection is performed for variance and compared the results to ROC curves generated using
each pixel by comparing its spectrum to that of its neighbors in only skeleton tiles for background statistics estimation. 10 differ-
the scene. The spectral signature of the anomaly and the covari- ent skeletons were sampled for each tested image and in most
ance matrix of the background clutter, Cb , are both assumed to be cases the performance of the skeleton-derived ROC curves was bet-
unknown. Ultimately, background pixels are assumed to be drawn ter than that of the globally-derived ROC curves. This can be ex-
from a zero-mean Gaussian distribution with estimated covariance plained by noting that the global method is guaranteed to incor-
Cˆb while anomalous pixels are assumed to be drawn from a non- porate anomalous pixels into the covariance estimation while the
zero-mean Gaussian with the same covariance. proportion of anomalous to background information is more likely
Although we are dealing with panchromatic imagery in this to be improved with the skeleton sampling.
study rather than spectral data cubes, the RX algorithm may still We compare detection performance between the kernel PCA
be applied to the vectorized tiles of intensity information. Rather and RX algorithms by sampling 10 different skeletons for each of
than detecting anomalous spectral signatures we are detecting 22 maritime images (including the images in Fig. 5) and calculating
anomalous spatial features. Using our notation the RX detection the ROC curves for each skeleton. The two algorithms are tested
statistic is given by on the same set of skeleton points for all sampled skeletons. Fix-
ing the false alarm rate at Pf a = 1e−4 we find the corresponding
DRX (x ) = (x − μ
ˆ b )T Cˆb−1 (x − μ
ˆ b) (28)
probability of detection for each of the 10 ROC curves calculated
where μ ˆ b is an estimate of the mean background and we classify for each of the images and estimate the probability density func-
the test point as an anomaly when DRX exceeds a specified thresh- tion corresponding to the probability of detection, f(Pd ), as shown
old (note the similarity to the Mahalanobis distance). in Fig. 10.
Both the covariance and mean in (28) can either be estimated The RX algorithm performs better than kernel PCA on three of
locally or globally. The goal in either case is to better estimate the the 22 test images, but is generally outperformed for the same set
true background covariance structure, which is consistent with our of skeleton points. This performance difference is reflected by the
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 383

1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

1 1 1

Pd
d

d
0.5 0.5 0.5
P

P
2 2 2
0.4 3 0.4 3 0.4 3
4 4 4
0.3 5 0.3 5 0.3 5
6 6 6
0.2 7 0.2 7 0.2 7
8 8 8
0.1 9 0.1 9 0.1 9
10 10 10
0 −4 −3 −2 −1 0
0 −4 −3 −2 −1 0
0 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
P Pfa P
fa fa

(a) (b) (c)

Fig. 8. ROC curves for each of the images shown in Fig. 5. Results for each of 10 skeletons are shown for each image.

100 100 100

200 200 200

300 300 300

400 400 400

500 500 500

600 600 600

700 700 700

800 800 800

900 900 900

1000 1000 1000

200 400 600 800 1000 1200 200 400 600 800 1000 1200 200 400 600 800 1000 1200

(a) (b) (c)

100 100 100

200 200 200

300 300 300

400 400 400

500 500 500

600 600 600

700 700 700

800 800 800

900 900 900

1000 1000 1000

200 400 600 800 1000 1200 200 400 600 800 1000 1200 200 400 600 800 1000 1200

(d) (e) (f)

Fig. 9. Detection mask (top row) along with corresponding binary truth mask (bottom row) corresponding to images in Fig. 5. Fixed false alarm rate for all images is
Pf a = 0.0 0 01.

fitted Gamma distribution shown in Fig. 10c. One is more likely to 4. Discussion
have a higher probability of detection when using kernel PCA on
maritime imagery of interest. The toy problem results indicate similar detection performance
We note that RX performs miserably on the toy problem be- among the algorithms tested, with kernel PCA possessing a slight
cause each threshold defines a fixed-diameter circle centered on edge over the other approaches. The image results, however, clearly
the origin in the original data space. As the threshold is increased, favor kernel PCA with the Parzen density estimator emerging as a
points within the circle are declared background and points out- reasonable alternative, particularly for low bandwidths. The latter
side are anomalous. Thus the algorithm will never declare any of is, however, highly sensitive to the choice of bandwidth hence we
the points within the square strip as anomalous without also incor- do not recommend it as a general purpose, automated approach.
rectly declaring all background points on the strip as anomalous. Given the robustness of kernel PCA, and its excellent detection per-
When background points are correctly classified as background, all formance it is clearly favored over the other methods in the im-
points within the square will be missed detections. Finally, we note age anomaly detection problem. The improved performance of our
that manifold techniques have been used for spectral segmenta- method over RX on an ensemble of maritime images suggests that
tion in hyperspectral imagery (Bachmann et al., 2005) and that the the methodology should be tested on additional anomaly detection
RX algorithm has been kernelized for hyperspectral applications by tasks.
Kwon and Nasrabadi (2005).
384 C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385

100 100 0.05

kPCA
RX
80 80 0.04
Bin Counts

Bin Counts
60 60 0.03

f(Pd )
40 40 0.02

20 20 0.01

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Pd Pd Pd

(a) (b) (c)

−4
Fig. 10. Histogram of probability of detection corresponding to a ﬁxed false alarm rate (Pf a = 1e ) for (a) the RX algorithm and (b) the kernel PCA algorithm. Gamma
distributions (support is positive) are ﬁt to each of the data sets to yield estimated density distributions in (c).

Effort was made to improve the diffusion map results by de- All four learning algorithms perform reasonably well on the low-
creasing the bandwidth as suggested in Lafon et al. (2006). The dimensional toy problem but kernel PCA is clearly favored for the
bandwidth of the extension was also modified for given values of maritime anomaly detection task.
the skeleton embedding bandwidth. No set of bandwidths was ca- This study relies on numerical experiments so a number of the-
pable of outperforming kernel PCA. No claims are made that these oretical questions remain to be answered. For example:
empirical results are representative of those that would be ob-
tained in other anomaly detection problems using other types of - How sensitive is the eigendecomposition of a kernel matrix to
imagery. Rather, we find that kernel PCA works quite well for the the inclusion of outliers and can this sensitivity be determined
presented unsupervised detection framework and would expect it as a function of both number and some measure of anomalous-
to work well when given similar automated detection tasks. ness of the included outliers?
It is also worth pointing out that the ROC curves shown in - Given a known distribution of background classes what can
Fig. 8 convey the Type-I and Type-II errors associated with a single sampling theory predict about the minimum size of the skele-
tile when what is really sought are the ROC curves associated with ton sampling required for the eigendecomposition to converge
targets of varying spatial extent. Close examination of Fig. 9, for ex- to a stable description of the data and can we use this to place
ample, reveals that nearly all of the targets have a significant num- bounds on the detection performance?
ber of correct detections with very few false alarms (Pf a = 0.0 0 01). - What is the effect on detection performance of model-based
A ROC curve designed to indicate probability of detecting a target methods such as RX given some measure of deviation from an
consisting of multiple tiles would better illustrate algorithm perfor- assumed statistical model?
mance. Converting tile level ROC curves to target level ROC curves - Is there some measure of data complexity (perhaps
is a straightforward process. Assume, for example, a target with a information-theoretic) that can be used to better inform
spatial extent covering n tiles. The probability of falsely declaring parameter settings such as bandwidth, dimension, and skeleton
k of those tiles a detection by chance is given by the binomial dis- sampling percentage?
tribution A comprehensive study of performance versus skeleton sam-

(target ) n k pling size would be a good next step for this research. Some ad-
Pf a = P (1 − Pf a )(n−k ) (29) ditional research directions include: testing the performance of
k fa
other manifold learning algorithms; defining a detection metric for
For example, if Pf a = 0.001 for an individual tile, the probability of diffusion map that it is more precise than the current nearest-
getting k = 5 false alarms in a target spanning n = 8 tiles becomes neighbor-based distance metric; studying the effect of using a local
(target )
Pf a = 5.58 × 10−14 . Similarly, we may define bandwidth for each extended point (rather than the current global

method); testing alternative out-of-sample extension techniques;
(target ) n and applying the method to data sets from other fields (e.g., fraud
Pd = 1.0 − (1 − Pd )k Pd(n−k) . (30)
k detection).

as the probability of correctly detecting those 5/8 tiles is 1 mi-

Acknowledgments
nus the probability that we missed 5/8. For this example, assum-
(target )
ing Pd = 0.7 we get Pd = 0.95. In short, the ROC performance
The authors would like to acknowledge funding from the Naval
associated with targets that have a spatial extent greater than a
Research Laboratory as part of a 6.2 Base Program.
single tile can be signiﬁcantly better than those constructed at the
tile level.
References

5. Conclusion Albano, J., Messinger, D. W., Schlamm, A., & Basener, B. (2011). Graph theoretic met-
rics for spectral imagery with application to change detection. In Proceedings
SPIE: 8048.
We have provided empirical evidence that our proposed
Bachmann, C. M., Ainsworth, T. L., & Fusina, R. A. (2005). Exploiting manifold ge-
method of subsample (a fraction of the data)-learn (a manifold ometry in hyperspectral imagery. IEEE Transactions on Geoscience and Remote
from the subsample)-extend (unsampled points into the manifold Sensing, 43(3), 441–454.
space) for anomaly detection can perform better than the com- Baker, C. T. H. (1977). The numerical treatment of integral equations. Oxford: Claren-
don Press.
monly employed baseline RX algorithm on an ensemble of mar- Basener, B., Ientilucci, E. J., & Messinger, D. W. (2007). Anomaly detection using
itime images when kernel PCA is the employed learning algorithm. topology. In Proceedings SPIE: 6565.
C.C. Olson et al. / Expert Systems With Applications 91 (2018) 374–385 385

Belkin, M., & Niyogi, P. (2002). Using manifold structure for partially labeled clas- McDonough, R. N., & Whalen, A. D. (1995). Detection of signals in noise (Second edi-
sification. Advances in neural information processing systems: 14. Cambridge, MA, tion). San Diego: Academic Press.
USA: The MIT Press. Messinger, D. W., & Albano, J. (2011). A graph theoretic approach to anomaly detec-
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction tion in hyperspectral imagery. IEEE workshop on hyperspectral imaging and signal
and data representation. Neural Computation, 15(6), 1373–1396. processing: evolution in remote sensing (WHISPERS).
Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric Mitra, N. J., & Nguyen, A. (2003). Estimating surface normals in noisy point
framework for learning from labeled and unlabeled examples. Journal of Ma- cloud data. In Proceedings 19th annual symposium on computational geometry
chine Learning Research, 7, 2399–2434. (pp. 322–328).
Bengio, Y., Paiement, J.-F., Vincent, P., Delalleau, O., Roux, N. L., & Ouimet, M. (2004). Olson, C. C., & Doster, T. (2016). A parametric study of unsupervised anomaly
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral cluster- detection performance in maritime imagery using manifold learning tech-
ing. Advances in neural information processing systems: 16. Cambridge, MA, USA: niques. Spie defense+ security. International Society for Optics and Photonics.
The MIT Press. 984016–984016.
Bezdek, J., Ehrlich, R., & Full, W. (1984). Fcm: The fuzzy c-means clustering algo- Olson, C. C., Nichols, J. M., Michalowicz, J. V., & Bucholtz, F. (2010). Improved outlier
rithm. Computers and Geosciences, 10(2), 191–203. identification in hyperspectral imaging via nonlinear dimensionality reduction.
Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph In Proceedings SPIE: 7695.
mincuts. In Proceedings of the 18th international conference on machine learning. Parzen, E. (1962). On estimation of probability density function and mode. Annals of
Williamstown, MA, USA: ICML. Mathematical Statistics, 33, 1065–1076.
Bouchachia, A. (2007). Learning with partly labeled data. Neural Computing and Ap- Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1988). Numerical
plications, 16, 267–293. recipes in C. Cambridge: Cambridge University Press.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (20 0 0). Lof: Identifying densi- Reed, L., & Yu, X. (1990). Adaptive multiple-band cfar detection of an optical pattern
ty-based local outliers. In Proceedings of the ACM SIGMOD international confer- with unknown spectral distribution. IEEE Transactions on Acoustics, Speech, and
ence on management of data. Dallas, TX, USA: ACM Press. Signal Processing, 38(10), 1760–1770.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Roweis, S. T., & Saul, L. K. (20 0 0). Nonlinear dimensionality reduction by locally
Computing Surveys, 41(3), Article15. linear embedding. Science, 290, 2323–2326.
Chang, C.-I., & Chiang, S.-S. (2002). Anomaly detection and classification for hy- Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).
perspectral imagery. IEEE Transactions on Geoscienceand Remote Sensing, 40(6), Estimating the support of a high-dimensional distribution. Neural Computation,
1314–1325. 13(7), 1443–1471.
Chen, Y., Crawford, M. M., & Ghosh, J. (2005). Applying nonlinear manifold learning Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.
to hyperspectral data for land cover classification. IGARSS: 5. Schölkopf, B., Smola, A. J., & Müller, K.-R. (1998). Nonlinear component analysis as
Coifman, R. R., & Lafon, S. (2006). Diffusion maps. Applied and Computational Har- a kernel eigenvalue problem. Neural Computing, 10, 1299–1319.
monic Analysis, 21, 5–30. Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., & Platt, J. (20 0 0). Sup-
Fujimaki, R., Yairi, T., & Machida, K. (2005). An approach to spacecraft anomaly port vector method for novelty detection. Advancesin Neural Information Process-
detection problem using kernel feature space. In Proceedings of the 11th ing Systems, 12, 582–588.
ACM SIGKDD international conference on knowledge discovery in data mining Szummer, M., & Jaakkola, T. (2002). Partially labeled classification with markov ran-
(pp. 401–410). Chicago, IL, USA: ACM Press. dom walks. Advances in neural information processing systems: 14. Cambridge,
Hoffmann, H. (2007). Kernel pca for novelty detection. Pattern Recognition, 40, MA, USA: The MIT Press.
863–874. Tarassenko, L., Nairac, A., Townsend, N., & Cowley, P. (1999). Novelty detection in jet
Kay, S. M. (1998). Fundamentals of statistical signal processing: detection theory. New engines. IEE colloquium on condition monitoring: Machinery, external structures
Jersey: Prentice Hall. and health (ref. no. 1999/034). Birmingham, UK: IEE.
Knorr, E. M., Ng, R. T., & Tucakov, V. (20 0 0). Distance-based outliers: Algorithms and Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern
applications. The VLDB Journal, 8, 237–253. Recognition Letters, 20, 1191–1199.
Kwon, H., & Nasrabadi, N. (2005). Kernel rx-algorithm: A nonlinear anomaly detec- Tenenbaum, J., de Silva, V., & Langford, J. (20 0 0). A global geometric framework for
tor for hyperspectral imagery. IEEE Transactions on Geoscienceand Remote Sens- nonlinear dimensionality reduction. Science, 290, 2319–2323.
ing, 43(2), 388–397. Theiler, J., Foy, B. R., & Fraser, A. M. (2007). Beyond the adaptive matched filter:
Lafon, S., Keller, Y., & Coifman, R. R. (2006). Data fusion and multicue data matching Nonlinear detectors for weak signals in high-dimensional clutter. In Proceedings
by diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, SPIE: 6565.
28(11), 1784–1797. Ziemann, A., & Messinger, D. (2015). An adaptive locally linear embedding manifold
Markou, M., & Singh, S. (2003a). Novelty detection: A review, part 1: Statistical ap- approach for hyperspectral target detection. In Proceedings SPIE: 9472.
proaches. Signal Processing, 83(12), 2481–2497. Ziemann, A., Theiler, J., & Messinger, D. (2015). Hyperspectral target detection us-
Markou, M., & Singh, S. (2003b). Novelty detection: A review, part 2: Neural net- ing manifold learning and multiple target spectra. IEEE Applied Imagery Pattern
work based approaches. Signal Processing, 83(12), 2499–2521. Recognition Workshop.