On Normalization and Algorithm Selection For Unsupervised Outlier Detection
On Normalization and Algorithm Selection For Unsupervised Outlier Detection
http://business.monash.edu/econometrics-and-business-
statistics/research/publications
September 2018
Abstract
This paper demonstrates that the performance of various outlier detection methods depends sen-
sitively on both the data normalization schemes employed, as well as characteristics of the datasets.
Recasting the challenge of understanding these dependencies as an algorithm selection problem, we per-
form the first instance space analysis of outlier detection methods. Such analysis enables the strengths
and weaknesses of unsupervised outlier detection methods to be visualized and insights gained into
which method and normalization scheme should be selected to obtain the most likely best performance
for a given dataset.
1 Introduction
An increasingly important challenge in a data-rich world is to efficiently analyze datasets for patterns of
regularity and predictability, and to find outliers that deviate from the expected patterns. The significance
of detecting such outliers with high accuracy, minimizing costly false positives and dangerous false
negatives, is clear when we consider just a few societally critical examples of outliers: e.g. fraudulent
credit card transactions amongst billions of legitimate ones, fetal anomalies in pregnancies, chromosomal
anomalies in tumours, emerging terrorist plots in social media and early signs of stock market crashes.
There are many outlier detection methods already available in the literature, with new methods
emerging at a steady rate (Zimek et al. 2012). The diversity of applications makes it unlikely that a
single method will out-perform all others in all scenarios (Wolpert et al. 1995, Wolpert & Macready
1997, Culberson 1998, Ho & Pepyne 2002, Igel & Toussaint 2005). As such, it is advantageous to know
the strengths and weaknesses of any method, and how specific properties of a dataset might render it
more or less ideally suited to detect outliers than other methods. What kinds of properties would enable
a given method to perform well on one dataset, but maybe poorly on another? How sensitive are the
existing methods to variations in dataset characteristics? How can we objectively evaluate a portfolio of
outlier detection methods to learn these relationships? Given a problem can we learn to predict the best-
suited outlier detection method(s)? And given that normalization of a dataset is a typical pre-processing
step adopted by all outlier detection methods, but rescaling the data can change the relationships between
the data points, what impact does a normalization scheme have on outlier detection accuracy? These are
some of the questions that motivate this work.
When evaluating outlier detection methods, an important issue that needs to be considered is the def-
inition of an outlier - according to both an algorithm’s definition of an outlier and a human who may have
labelled training data. Generically, Hawkins (1980) defines an outlier as an observation which deviates
so much from other observations as to arouse suspicion it was generated by a different mechanism. Bar-
nett & Lewis (1974) define an outlier as an observation (or subset of observations) which appears to be
inconsistent with the remainder of that set of data. Both these definitions indicate that outliers are quite
1
10 1.5
KNN KNN
LOF LOF
8 1 COF COF
6
0.5
4
0
2
-0.5
0
-1
-2
-4 -1.5
-6 -2
-10 -5 0 5 10 -2 -1 0 1 2
(a) (b)
different from non-outlying observations. Barnett & Lewis (1974) also note that it is a a matter of sub-
jective judgement on the part of the observer whether or not some observation is picked out for scrutiny.
The subjectivity of outlier detection is not only due to human judgement, but extends to differences in
how outlier detection methods define an outlier. Indeed, there are many instances where a set of outlier
detection methods may not agree on the location of outliers due to their different definitions, whether
they be related to nearest neighbour distances, density arguments or other quantitative metrics. Figure 1
illustrates the lack of consensus of three popular outlier detection methods namely, KNN (Ramaswamy
et al. 2000), LOF (Breunig et al. 2000) and COF (Tang et al. 2002), and highlights the opportunity to ex-
ploit knowledge of the combination of dataset characteristics and the algorithm’s definition of an outlier
to enhance selection of the most suitable method.
Evaluation of unsupervised outlier detection methods has received growing attention in recent years.
Campos et al. (2016) conducted an experimental evaluation of 12 methods based on nearest neighbours
using the ELKI software suite (Achtert et al. 2008). While the methods considered all used a similar
nearest neighbor distance definition of an outlier, the study is relevant to ours since it contributed a
useful repository of around 1000 benchmark datasets generated by modifying 23 source datasets that can
be used for further outlier detection analysis. It is common practice to test outlier detection algorithms
on datasets with known ground truth labels, for example classification datasets where observations of the
minority class have been down-sampled. We will be extending their approach to dataset generation for
our comprehensive experimental study. Goldstein & Uchida (2016) conducted a comparative evaluation
of 19 unsupervised outlier detection methods, which fall into three categories, namely nearest neighbour
based, clustering based, and based on other algorithms such as one class SVM and robust PCA. They
have used 10 datasets for their evaluation. Their algorithms are released on RapidMiner data mining
software. Emmott et al. (2015) conducted a meta-analysis of 12 outlier detection methods, which fall into
four categories; namely nearest neighbours based, density based, model based or projection based. These
studies focus on the evaluation of outlier detection methods, which is much needed in the contemporary
literature due to the sheer volume of new methods being developed. However, they do not address the
critical algorithm selection problem for outlier detection, i.e. given a dataset which outlier detection
method(s) is expected to give the best performance, and why? This is one of the main contributions of
our work.
The algorithm selection problem has been extensively studied in various research communities (Rice
1976, Smith-Miles 2009) for challenges such as meta-learning in machine learning (Brazdil et al. 2008),
black-box optimization (Bischl et al. 2012), and algorithm portfolio selection in SAT solvers (Leyton-
2
Brown et al. 2003). Smith-Miles and co-authors have extended the Rice (1976) framework for algorithm
selection and developed a methodology known as instance space analysis to visualize and gain insights
into the strengths and weaknesses of algorithms across the broadest possible set of test instances, rather
than a finite set of common benchmarks (Smith-Miles et al. 2014, Smith-Miles & Bowly 2015, Muñoz
et al. 2018). We will use this framework to gain an understanding of strengths and weaknesses of the
outlier detection methods discussed by Campos et al. (2016).
In addition to tackling the algorithm selection problem for outlier detection for the first time, we
will also focus on a topic that is generally over-looked; namely normalization. One of the main pre-
processing steps in outlier detection is normalizing or standardizing the data. Traditionally min-max
normalization method, which normalizes each column of a dataset to the interval [0, 1] is used routinely
in outlier detection (Campos et al. 2016, Goldstein & Uchida 2016). However, there are many different
methods that can be used for normalizing or standardizing the data. Whether the choice of normalization
method impacts the effectiveness of the outlier detection method is a question which has not been given
much attention. In fact, we have not come across any studies which focus on the effect of normalization
on outlier detection. We explore this relationship and show that the performance of outlier methods
can change significantly depending on the normalization method. This is a further contribution of our
work. In addition, we make available a repository of more than 12000 datasets, which is generated from
approximately 200 source datasets, providing a comprehensive basis for future evaluation of outlier
detection methods.
This paper is organized as follows. We start by investigating the impact of normalization on outlier
detection methods in Section 2. Firstly, from a theoretical perspective we present mathematical argu-
ments in Section 2.1 to show how various normalization schemes can change the nearest neighbours
and densities of the data, and hence why we intuitively expect that the impact of normalization can be
significant depending on the definition of an outlier adopted by an algorithm. In Section 2.2 we present
comprehensive experimental evidence that this theoretical sensitivity is observed in practice across a set
of over 12000 datasets. We show that both the normalization method and the outlier detection method,
in combination, have variable performance across the datasets, suggesting that some datasets possess
properties that some methods can exploit well, while others are not as well suited. This experimen-
tal and theoretical evidence then motivates the remainder of the paper, where we adapt instance space
analysis to gain insights into the strengths and weaknesses of outlier detection methods. Section 3 first
describes the methodological framework for the algorithm selection problem and instance space analysis
introduced by Smith-Miles et al. (2014). This section then discusses a novel set of features that capture
properties of outlier detection datasets, and shows how these features can be used to predict performance
of outlier detection methods. The instance space is then constructed, and objective assessment of outlier
detection method strengths and weaknesses is presented in the form of footprint analysis. The instance
space shows that the datasets considered in this study are more diverse and comprehensive than previ-
ous studies, and suitable outlier detection methods are identified for various parts of the instance space.
Finally, in Section 4 we present the conclusions of this work and future avenues of research.
3
x−min(x)
Each column x is transformed to max(x)−min(x) where min(x) and max(x) are the minimum and
maximum values of x respectively.
2. Mean and standard deviation normalization (Mean-SD)
Each column x is transformed to x−mean(x)
sd(x) , where mean(x) and sd(x) are the mean and standard
deviation values of x respectively.
3. Median and the IQR normalization (Median-IQR)
Each column x is transformed to x−median(x)
IQR(x) , where median(x) and IQR(x) are the median and
IQR of x respectively.
4. Median and median absolute deviation normalization (Median-MAD)
x−median(x)
Here MAD(x) = median(|x − median(x)|) and each column x is transformed to MAD(x) .
We note that Min-Max and Mean-SD are influenced by outliers while Median-IQR and Median-
MAD are more robust to outliers. A detailed account of the usage of robust statistics in outlier detection
is covered by Rousseeuw & Hubert (2017).
Generally, normalization scales axes differently causing some axes to compress and some axes to
expand, thus changing the nearest neighbour structure. As nearest neighbour distances play an important
role in many outlier detection techniques, such normalization impacts outlier detection method results,
as will be explained theoretically and then demonstrated experimentally in the following sections.
Here x∗i is the normalized observation, µ is either the minimum, mean or median of the data and
S is a diagonal matrix containing column-wise range, standard deviation, IQR or MAD. Let S =
diag(s1 , s2 , s3 , . . . , sd ) . Let dist(x, y) denote the Euclidean distance between the two points x and
y, i.e. dist(x, y) = kx − yk, where we use the L2 norm. So we have
giving us
The advantage of this representation is that we can explore the effect of normalization without re-
stricting ourselves to the normalized space. That is, suppose we want to compare two normalization
methods given by matrices S1 and S2 . By working with the corresponding vectors w1 and w2 we can
4
stay in the space of yij for different normalization methods. Here, the space where yij lives is differ-
ent from the space of xi and xj . From equation (4) the components of yij corresponds to the squared
component differences between xi and xj . As such the vector yij cannot contain negative values, i.e.
yij ∈ Rd+ where Rd+ is the positive orthant or hyperoctant in Rd . Similarly, w has positive coordinates
and w ∈ Rd+ \{0}.
To understand more about the space of yij , we give it separate notation. Let us denote the space
of yij by Y and the space of observations by O. It is true that Y is isomorphic to Rd+ and O to Rd .
However, because the original observations in O×O map to Y in a slightly different way when compared
with the standard partitioning of Rd+ from Rd , it makes sense to detach Y from Rd+ and O from Rd for
a moment. From (4) we have
T
2 2 2
yij = (xi1 − xj1 ) , (xi2 − xj2 ) , . . . , (xid − xjd ) ,
T
2 2 2
= (xi1 + η1 − xj1 − η1 ) , (xi2 + η2 − xj2 − η2 ) , . . . , (xid + ηd − xjd − ηd ) .
As such, if the points xi and xj give rise to yij so does the points xi + η and xj + η for any η ∈ Rd .
Thus, the mapping from O × O to Y is translation-invariant. This shows that Y is obtained from O × O
in a different way to the standard partitioning of Rd+ from Rd . However, we will not use the translation
invariant property of Y in the next sections.
If x∗l1 is the nearest neighbour of x∗i we define Ai2 as Ai2 = Ai1 \l1 , giving us
where x∗lk−1 is the (k − 1)st nearest neighbour of x∗i and Aik = Ai(k−1) \lk−1 . Proceeding in a similar
way, the k th nearest neighbour distance can be written as
q
nnd (x∗i , k) = min hw , yij i . (9)
j∈Aik
As the outlier detection method KNN declares the points with the highest k-nearest neighbour distance
as outliers we write an expression for the point with the highest knn distance:
q
point with highest knn distance = argmax (nnd (x∗i , k)) = argmax min hw , yij i . (10)
i i j∈Aik
From (10) we can see that w has a role in determining the data-point with the highest knn distance.
A different w may produce a different data-point having the highest knn distance. Therefore, the method
of normalization plays an important role in nearest neighbour computations.
5
ynm
w
yoa
θnm
θoa
Figure 2: The vectors yoa , ynm , w with angles θoa and θnm .
Proposition 2.1. Let xo be an outlier and xn a non-outlier. Let xa and xm be xo and xn ’s respective k-
nearest neighbours according to the normalization scheme defined by w. Let θoa and θnm be the angles
that yoa , ynm ∈ Y make with w. If
kyoa k cos θnm
< ,
kynm k cos θoa
then
nnd (x∗o , k) < nnd (x∗n , k) ,
where x∗o and x∗n are the normalized coordinates of xo and xn according to w. Thus a non-outlier has
a higher knn distance that an outlier with respect to w.
As illustrated in Figure 2 the angle between the normalization vector w and ynm has an effect in
the ordering of k-nearest neighbour distances. Thus the normalization vector w can mask outliers and
favour non-outliers, reducing the performance of outlier detection methods.
6
40
With outliers
Without outliers
20
10
0
0 10 20 30 40
Dimension
n o
2
density(x∗i ) = ] x∗j : x∗j − x∗i ≤1
n 2 o
= ] x∗j : dist x∗j , x∗i ≤1
Again we see that vector w which comes from the method of normalization plays a role in determin-
ing the density of data-points. As many outlier detection methods are based on density estimates, we see
that normalization affects density based outlier detection methods as well.
We now show that this theoretical sensitivity is observed when using common benchmark datasets,
and that the performance of outlier detection methods depends on normalization as well as characteristics
of the datasets.
7
changes from 5% to 25% as the dimension changes from 2 to 20. That is for a 20-dimensional dataset
without outliers, the nearest neighbours of 25% of the data depend on the method of normalization.
Similarly, for a 20-dimensional dataset with one outlier, the nearest neighbours of 30% of the data depend
on the normalization method. This observation has the important implication that as the dimensionality
of the dataset increases while keeping the number of observations constant, the nearest neighbours of a
data-point are highly sensitive to the method of normalization. Thus, given an outlier detection problem,
the normalization method as well as the outlier detection technique play an important role. Of course, it
is important to validate this hypothesis on other datasets, rather than randomly generated data, to see if
structured data from benchmark datasets is also sensitive to normalization.
In the remainder of this section we evaluate the impact of normalization on 12 outlier detection
methods coupled with the above-mentioned 4 normalization methods, across a set of over 12000 datasets
described below.
2.2.1 Datasets
We generate outlier detection datasets by adopting the approach used in recent studies (Campos et al.
2016, Goldstein & Uchida 2016), which takes a classification dataset and down-samples the minority
class to label outliers. Campos et al. (2016) start with 23 datasets, from which different variants are
obtained mainly by downsampling the outlier class at rates 20%, 10%, 5%, 2% and transforming cate-
gorical variables to numeric values. This process results in approximately 1000 datasets. Goldstein &
Uchida (2016) use 10 datasets for their evaluation study, with some overlap with Campos et al. (2016). In
order to obtain a more comprehesive and diverse set of benchmark test datasets, we extend the approach
to utilise a set of 170 base classification datasets recently used by Muñoz et al. (2018) obtained primarily
from the UCI machine learning repository. These classification datasets were not intended for outlier
detection evaluation, and so the following issues need to be addressed to generate meaningful outlier
detection benchmarks:
1. Labelling of outliers - as outliers are rare events, the proportion of outliers is typically 5% or less
for most outlier datasets. In contrast, the classification datasets have sometimes more than 2 classes
and the proportion of observations belonging to each class is often similar and much larger than
10%.
2. Categorical variables - while some classification algorithms such as random forests and decision
trees are capable of handling categorical variables, most outlier detection methods need distances
or densities to find outliers, which requires only numerical attributes.
3. Duplicate observations and missing values - the classification datasets contain data challenges that
we wish to eliminate at this stage to focus on understanding how the underlying mechanism of
outlier detection behaves in the presence of complete data.
Therefore, we modify the 170 classification datasets used in Muñoz et al. (2018) to make them more
applicable for outlier detection, as described below:
Down-sampling: If a dataset has observations belonging to k classes, then each class in turn is desig-
nated the outlier-class and observations belonging to that class are down-sampled, while the observations
belonging to the other k − 1 classes are deemed non-outliers. We conduct the down-sampling such that
the percentage of outliers is p% for p ∈ {2, 5}. For a given outlier class and for each value of p, the
down-sampling is randomly carried out 10 times. Hence, for a given outlier class there are 20 down-
sampled files generated. This procedure is done for all classes in the dataset, e.g. if a base classification
dataset has 3 classes, then there are 3 × 2 × 10 down-sampled files generated from that base dataset.
Categorical Variables: While a range of techniques for transforming categorical variables to numer-
ical variables are available, there is little consensus on which approach is best suited for a given task.
For each source down-sampled dataset, we create two versions: one with categorical variables removed,
and one with categorical variables converted using the method of inverse data frequency (Campos et al.
2016), which creates a new variable IDF (x) = ln(N/nx ) where N is the total number of observations
in the dataset and nx is the number of times the categorical variable takes the value x. IDF maps the
rarer values to higher numbers and common values to lower numbers.
8
Duplicate observations: As the nearest neighbour distance for a duplicate observation is zero, this
can create division by zero errors causing numerical instability when computing densities and other
metrics. As such, we remove duplicate observations from the datasets.
Missing values: We use the method in Campos et al. (2016) to treat missing values. For each attribute
in each dataset, the number of missing values are computed. If an attribute has less than 10% of missing
values, the observations containing the missing values are removed, otherwise, the attribute is removed.
The above procedures were followed on the 170 base classification datasets used in (Muñoz et al.
2018). In addition, we augmented our benchmark collection to considering the 1000 datasets used in
Campos et al. (2016) and selected the ones with 5% and 2% outliers (but not the ones with 10% and
20% outliers). With the datasets from (Campos et al. 2016, Goldstein & Uchida 2016), along with the
datasets we prepared from Muñoz et al. (2018), our final set of benchmarks for this experimental study
contains approximately 12200 datasets suitable for outlier detection evaluation.
9
2.5 7
Data KNN
2 Outlier LOF
6
1.5
5
0.5 4
x2
0
3
-0.5
2
-1
-1.5 1
-2 0
-3 -2 -1 0 1 2 3 0 20 40 60 80 100
x1 k-nearest neighboors
(a) (b)
Figure 4: The dataset is plotted in figure (a) with the outlier at (3, 2). The
outlier score ratios are plotted in figure (b) for both KNN and LOF. We see
the outlier score ratio decrease with increasing k for most k values.
10
Thus, we have a structure where the outlier detection method, normalization method and the source
dataset play a combined role in influencing performance that we seek to understand.
We use two mixed models to ascertain the significance of normalization. The first model uses outlier
detection methods and normalization methods as fixed effects, and source datasets as a random effect.
We do not have any interaction terms for this model. We write the first model (using the R formula
notation) as:
y ∼ Out + Norm + (1|Source) . (19)
Here y is the performance, Out is the outlier detection method, Norm is the normalization method and
Source is the source dataset. The term (1|Source) means that source is a random effect and the intercept
changes according to the source dataset. We can also write this model in the following way:
yijkl = µ + ci + dj + hk + εijkl , (20)
where yijkl is the performance of outlier detection method i using normalization method j on a dataset
variant l from source k. The term µ denotes the intercept, ci the coefficient of the ith outlier detection
method, dj the coefficient of the j th normalization method, hk the random effect due to the source dataset,
and εijkl the error term. While the fixed effects coefficients ci and dj are
parameters, the random effects
2
coefficients hk are modelled as random variables,
i.e. hk ∼ N 0, σh . The errors εijkl are assumed to
be normally distributed, i.e. εijkl ∼ N 0, σε2 .
The second model uses an additional interaction term as follows:
y ∼ Out ∗ Norm + (1|Source) . (21)
We can also write the second model as follows:
yijkl = µ + gij + hk + εijkl . (22)
The difference between the first and the second model results from the interaction term, which gives
rise to a separate regression coefficient gij for each pair of outlier and normalization methods, rather
than assuming their effects are additive. The second model can be used to determine if normalization
affects each outlier detection method differently.
As the two models are nested, we perform a likelihood-ratio test and obtain a p-value of 2.2×10−16 in
favour of the second model making it clear that there are significant interactions between normalization
methods and outlier detection methods. In other words, the effect of normalization is different from one
outlier method to another.
Figure 5 shows the effect of normalization methods on each outlier detection method using plotting
tools described in Breheny & Burchett (2012). The letters D, Q, M and X denote the normalization
methods Mean-SD, Median-IQR, Median-MAD and Min-Max respectively. The plotted value for each
normalization and outlier method is yij. = µ + gij + h̄ from equation (22) where h̄ denotes the mode
of hk , pertaining to the source connectionist vowel. For any other source, the values yij. is a vertical
translation of values shown in Figure 5. A higher value of yij. denotes better performance while a lower
value denotes a poorer performance. The main quantity of interest of the second model constitutes of the
values yij. . As such, we make the following remarks about yij. using Figure 5.
1. KNNW has the highest yij. values, making it the most effective outlier method on average.
2. The three best outlier methods are KNNW, KNN and FAST ABOD.
3. KDEOS has the lowest yij. values, making it the least effective outlier method on average. The
second least effective outlier method is INFLO.
4. For most outlier methods, Min-Max and Mean-SD outperform Median-IQR and Median-MAD.
5. For most outlier methods, Min-Max and Mean-SD give similar yij. values, and Median-IQR and
Median-MAD also give similar yij. values.
6. LOF, LOOP and SIMLOF are quite similar in terms of yij. .
7. The effect of the outlier method on yij. is greater than the effect of the normalization method.
As a result of these insights we only consider normalization methods Min-Max and Median-IQR in
the following sections, so as to elicit higher contrasts in performance arising from normalization.
11
COF F.ABOD INFLO KDEOS KNN KNNW LDF LDOF LOF LOOP ODIN SIMLOF
Out
COF
0.65 F.ABOD
INFLO
KDEOS
KNN
yij.
KNNW
LDF
0.60 LDOF
LOF
LOOP
ODIN
SIMLOF
0.55
D QM X D QM X D QM X D QM X D QM X D QM X D QM X D QM X D QM X D QM X D QM X D QM X
Norm
(a)
12
Table 1: Percentage of datasets for which Median-IQR or Min-Max gives
better performance for each outlier method.
of dataset and outlier detection method? For example, if the performance of an outlier method α1 on
dataset x fluctuates due to normalization, will a different outlier method α2 on x give fluctuating re-
sults as well? We start this investigation by offering a definition of the dataset attribute “sensitivity to
normalization”.
Definition 2.2. For a given dataset, we say that an outlier detection method is ξ−sensitive to normal-
ization if the difference between the maximum performance and the minimum performance across all
normalization schemes for that outlier detection method is greater than ξ.
We use this definition with ξ = 0.05, 0.10, 0.15 and 0.20 to investigate the effects of normalization.
Table 2 reports the number of our datasets that are ξ− sensitive to normalization for each outlier detection
method. By definition 2.2 the number of datasets sensitive to normalization decreases as ξ increases.
By looking at Table 2 we seek to identify if there are common datasets that are sensitive to normal-
ization for multiple outlier detection methods. To this end, we compute the number of datasets that are
sensitive to normalization for exactly n outlier methods for n ∈ {0, 1, 2, . . . , 12}. Table 3 summarizes
these results for ξ = 0.05, 0.10, 0.15 and 0.20. For ξ = 0.05 we see that 2029 datasets are sensitive
to normalization for all 12 outlier methods and 412 datasets are not sensitive to normalization for any
outlier method. In addition to these two extremes, there are different numbers of datasets sensitive to
normalization for n number of outlier methods for n ∈ {1, . . . , 11}. In particular, there are 1002 datasets
that are sensitive to normalization for exactly 1 outlier method. Similarly there are 656 datasets that are
13
Table 3: Number of datasets that are ξ-sensitive to normalization for n
outlier methods.
sensitive to normalization for exactly 2 outlier methods for ξ = 0.05. Again, each dataset may have a
different combination of outlier methods that are sensitive to normalization.
From Table 3 we observe a subtle interplay of dataset characteristics and outlier detection methods
affecting the sensitivity to normalization. In order to understand which outlier methods are more sensitive
to normalization, we examine the datasets which are sensitive to normalization for only one outlier
method. Table 4 contains these results.
From Table 4 we see that FAST ABOD is the method most sensitive to normalization followed by
KDEOS. This is consistent for ξ = 0.05, 0.10, 0.15 and 0.20. The outlier detection methods KNNW
and SIMLOF are the least sensitive to normalization with LOOP and KNN and achieving comparable
results. This outcome further validates the results of the second mixed model in section 2.3 given by
equation (21). In other words, we explicitly see evidence of normalization affecting outlier detection
methods differently.
This section has provided comprehensive evidence, both theoretical and experimental, that normal-
14
ization can have significant impact on some outlier detection methods, and that the complex interplay
of dataset characteristics, outlier detection method and normalization scheme makes it challenging to
ensure the best algorithm is selected for a given dataset. We now turn to recent advances in instance
space analysis to address the challenges of this algorithm selection problem.
3.1 Features
Given that our outlier detection datasets have been generated by down-sampling classification datasets,
we are able to borrow many of the features that have been used to summarize interesting properties
of classification datasets, and then add some additional features that are unique to the outlier detection
challenge. We start with a set of standard classification meta-features categorised as follows:
1. Simple features - These are related to the basic structure of a dataset. For our study these include
the number of observations, number of attributes, ratio of observations to attributes, number of
binary attributes, number of numerical attributes and the ratio of binary to numerical attributes.
15
x∈P Footprints
Problem in instance
space Infer algorithm space
performance
for any x ∈ P
Select or generate Define algorithm
a subset I ⊂ P footprints ϕ (y (x, α))
Select α∗ to
x∈I maximise kyk y∈Y
Problem Performance
subset space
f (x) ∈ F α∈A
Feature Algorithm
∗
space α = S (f (x)) space
Dimensionality reduction
and visualisation
g (f (x)) ∈ R2 α∗ = S (g (f (x)))
Instance
Learn selection mapping
space
from the instance space
2. Statistical features - These include statistical properties of skewness, kurtosis, mean to standard
deviation ratio, and IQR to standard deviation ratio for all attributes of a dataset, i.e. if a dataset
has d attributes, then there are d values for each statistical property. For skewness and kurtosis
we include the mean, median, maximum and the 95% percentile of these d values as features.
For IQR to standard deviation ratio we include the maximum and the 95% percentile. We also
include the average mean to standard deviation ratio. As a correlation measure, we compute the
correlation between all attributes and include the mean absolute correlation. We also perform
Principal Component Analysis and include the standard deviation explained by the first Principal
Component.
3. Information theoretic features - for measures of the amount of information present in a dataset, we
compute the entropy of each attribute and include the mean in our feature set. Also we include the
entropy of the whole dataset and the mutual information.
The above set of features are generic features which measure various aspects of a dataset but are not
particularly tailored towards outlier detection. While it is relevant for us to consider these features, they
shed little light on the outlier structure of a dataset.
4. Outlier features - In order to make our feature set richer we include density-based, residual-based and
graph-based features. We also include features that are based on section 2.1, which are related to the
normalization vector w. We compute these features for different subsets of the dataset, namely outliers,
non-outliers and proxi-outliers. We define proxi-outliers as data points that are either far away from
“normal” data or residing in low density regions. If there is a significant overlap between proxi-outliers
16
and actual outliers, then we expect outlier detection methods to perform well on such datasets. Formally,
we define proxy-outliers as data-points which have the top 3% of knn-distances for k defined in (18).
The density, residual and graph-based features we consider are all ratios. It is either a ratio between
proxi-outliers and outliers, or a ratio between outliers and non-outliers. An example is the ratio between
average density of non-outliers and average density of outliers. We explain the outlier features below:
i) Density based features
The density based features are computed either using the density based clustering algorithm DB-
SCAN (Ester et al. 1996) or kernel density estimation as follows:
(a) DBSCAN features
We perform principal component analysis (PCA) and use DBSCAN for clustering in a lower-
dimensional space. We focus on data-points that either belong to very small clusters, or do
not belong to any cluster. Let us call these points dbscan-proxi-outliers, henceforth named
dbscan-proxies. Once again, if dbscan-proxies are outliers then we expect density based out-
lier algorithms to perform well on such datasets. As features we include i. the percentage of
dbscan-proxies that are outliers, ii. the percentage of dbscan-proxies that are outliers which
do not belong to any cluster and iii. the percentage of dbscan-proxies that are outliers which
belong to very small clusters.
(b) Kernel density estimate (KDE) related features
Here too, we perform PCA and compute kernel density estimates (KDE) on two dimensional
PC spaces to reduce computational burden. We compute KDE as detailed by Duong (2018) for
the first 10 principal component (PC) pairs, and for each PC pair we find proxi-outliers. We
compute the mean, median, standard deviation, IQR, minimum, maximum, 5th , and 95th per-
centiles of KDE values for outliers, non-outliers, proxi-outliers and non-proxi-outliers in each
PC space. Next we take the computed summary statistics ratios of outliers to non-outliers,
and proxi-outliers to non-proxi-outliers; for example the ratio between the mean KDE of
non-outliers and the mean KDE of outliers. These ratios are computed for the first 10 two-
dimensional PC spaces. As features, we include the average of each ratio for the set of PC
spaces. In addition we also include the percentage of proxi-outliers that are outliers in our
feature set.
(c) Local density features
We compute all features explained in i)(b) using a local density measure based on KDE instead
of using KDE itself. The local density is computed by dividing the KDE value of a point by the
mean KDE value of its k-nearest neighbours. Here for each data-point its k-nearest neighbours
are computed with k as in equation (18).
ii) Residual based features
These features include summary statistics of residuals from linear models. First, we fit a series of
linear models by randomly choosing the dependent variable and treating the rest of attributes as
independent variables. For each model, data-points which have the the top 3% of absolute residual
values are deemed as proxi-outliers. Similar to KDE features, the mean, median, standard deviation,
IQR, minimum, maximum, 5th , and 95th percentiles of residual values for outliers, non-outliers,
proxi-outliers and non-proxi-outliers are computed. Next the respective ratios are computed for
each linear model. Finally, the average of each ratio for all the models is included in the feature set.
We also include the percentage of proxi-outliers that are outliers as a feature.
iii) Graph based features
These features are based on graph-theoretic measures such as vertex degree, shortest path and con-
nected components. First, we convert the dataset to a directed-graph based on each data-points’
k-nearest neighbours using the software igraph (Csardi & Nepusz 2006). Next, we compute the
degree of the vertices and label ones with the lowest degree as proxi-outliers. Then, similar to
residual based features, we find the summary statistics of degree values for outliers, non-outliers,
proxi-outliers and non-proxi-outliers and include ratios of outliers to non-outliers and proxi-outliers
to non-proxi-outliers in our feature set. We also include the percentage of proxi-outliers which
are actual outliers. Another set of graph features come from connected components. We compute
17
the number of vertices in each connected component and, similar to degree calculations, compute
summary statistics of these values for outliers, non-outliers, proxi-outliers and non-proxi-outliers
and include the ratios as above. We also compute the shortest distances from outlier vertices to
non-outlier vertices. Here the shortest distance from vertex a to vertex b is the minimum number
of edges that connect a and b. We include some statistics about these distances: the percentage of
outliers which have infinite shortest distance to all non-outlier vertices, i.e. outlier vertices which
are not connected. For outliers that have finite shortest distances to some non-outlying vertex, we
compute the percentage of outliers for which the shortest distance is 1.
iv) Normalization related features
These features are related to quantities described in section 2.1 and Proposition 2.1. First, we
compute the normalization vector w as in equation (4) for Min-Max and Median-IQR. Next, we
compute vectors yij as in equation (4) for outliers and non-outliers based on each data-point’s k-
nearest neighbours. Then we compute hw, yij i for the two different normalization vectors w that
correspond to Min-Max and Median-IQR. The purpose of this exercise is to compare the quantity
hw, yij i obtained from outliers with that of non-outliers for each normalization technique. So we
compute hw, yij i ratios of outliers to non-outliers and include the minimum, maximum, mean,
median, standard deviation and IQR as features. We also include percentage of ratio values less
than 1, as this is a quantity of interest and relates to equation (15) in Proposition 2.1.
This concludes the list of features - both classification and outlier based - that we compute for each
dataset. From this list of features, those that are based on density, residuals and graphs depend on nearest
neighbours and as such are sensitive to the method of normalization. Hence we calculate features for
each of the two selected normalization methods that we have earlier shown are not correlated, namely
Min-Max and Median-IQR. The choice of these two is justified since: 1. Min-Max is the most commonly
used normalization method for outlier detection, 2. Median-IQR is one of the methods which is robust
to outliers. By combining density, residual and graph based features computed on datasets normalized
by 2 different methods with standard features and normalization based features we end up with a total of
346 candidate features. Table 5 provides a summary of features by category.
Before we can proceed with the Instance Space Analysis, it is important to validate that the features
contain sufficient information about the similarities, differences and difficulties of datasets that they
are reliable as instance summaries. To this end, we first demonstrate that reasonable accuracy can be
obtained using the features to predict sensitivity to normalization, and outlier detection method perfor-
mance, given the characteristics of a dataset summarized by the features.
18
12 random forest classifiers (Liaw & Wiener 2002), one for each outlier method, with all 346 features
as input to predict the binary output of ξ-sensitivity to normalization with ξ = 0.05. The results are
given in Table 6. As shown in Table 6 prediction accuracy of sensitivity to normalization ranges from
71% to 80% with FAST ABOD, which was the method most sensitive to normalization, achieving the
highest prediction accuracy. Also, it is insightful to compare these prediction accuracies with the actual
percentages of datasets that are sensitive to normalization, which is given in column 2 of Table 6. In
general, we can correctly predict if a dataset is sensitive to normalization with respect to an outlier
detection method with an accuracy greater than 70%, suggesting that the feature set must contain some
useful summaries of relevant dataset properties.
Next, we investigate which normalization method gives better performance if a dataset is sensitive to
normalization for a given outlier detection method. We only consider the normalization methods Min-
Max and Median-IQR, and datasets that are ξ-sensitive to normalization for each outlier method with
ξ = 0.05, 0.10, and 0.15. Using features of ξ-sensitive datasets as input to a random forest classifier
using 5-fold cross-validation, we predict the normalization method that gives better performance with
results shown in Table 7. From Table 7 we see that prediction accuracy generally increases with ξ. This
is to be expected because it is easier for the classifier to predict the preferred normalization method as the
sensitivity to normalization increases. Also, prediction accuracy is higher for KNN, KNNW and FAST
ABOD than for other outlier methods.
From the results of the mixed models in section 2.3 we know that normalization methods affect
outlier methods differently. As such, one of the reasons for high fluctuations in prediction accuracy
seen in Table 7 may be because the set of features do not sufficiently explain these effects for all outlier
methods equally. Indeed, the features were pooled together with the intent of discovering strengths and
weaknesses of outlier detection methods, not of normalization methods. Only a handful of features
focus on normalization as seen in Table 5. When comparing with Table 6 which predicts the sensitivity
to normalization, Table 7 has higher contrasts in terms of accuracy. However, from both these tables we
see that we can reasonably predict if a dataset is sensitive to normalization given an outlier method, and
if it is sensitive to normalization which normalization method to recommend.
In effect we are proposing a strategy to select the normalization method to maximize performance.
First for a preferred outlier method, we find if a dataset is sensitive to normalization using features and a
classifier. If it is sensitive, then we find which normalization method gives better performance. One may
ask how one selects the preferred outlier method. This question will be answered in detail in Section 3.4.
19
Table 7: Best normalization method prediction accuracy
20
algorithm performance). Therefore, we follow a systematic procedure to select a small subset of fea-
tures that best meet these requirements, using a subset of 2018 instances for which there are three well
performing algorithms or less. This choice is somewhat arbitrary, motivated by the fact that we are less
interested to study datasets where many algorithms perform well or poorly, given our aim is to under-
stand strengths and weaknesses of outlier detection methods. We will later project the full set of datasets
into the constructed instance space, but consider this reduced set of datasets sufficient for performance
prediction and instance space construction.
We pre-process the data such that it becomes amenable to machine learning and dimensionality
projection method. Given that some features are ratios, there is often the case that one feature can
produce excessively large or infinite values. Hence, we bound all the features between their median
plus or minus five times their interquartile range. Any not-a-number value is converted to zero, while
all infinite values are equal to the bounds. Next, we apply Box-Cox transformation to each feature to
stabilize the variance and make the data more normal-like. Finally, we apply z-transform to standardize
the feature.
We start with the full set of 346 features. First, we determine whether the feature has unique values
for most instances. A feature provides little information if it produces the same value for the majority of
the datasets. Hence, we discard those that have less than 30% unique values. After this step we are left
with 255 features. Then, we check the correlation between the features and the algorithm performance
measure, which is the area under the ROC curve. Sorting the absolute value of the correlation from
highest to lowest, we pick the top three features per algorithm, some of which can be repeated. After
this step we are left with 26 features. Next, we identify groups of similar features. We use a clustering
approach, with k-means as the clustering algorithm and 1 − |ρ|, where ρ is the correlation between two
features, as the dissimilarity measure. As result, we obtain eight clusters. All the possible combinations
of eight features out of 26, taking only one feature from each cluster, is 7200. Finally, we determine
the best of these 7200 subsets for the dimensionality reduction. We use PCA with two components to
project the datasets into a 2-d instance space using a candidate feature subset. Then, we fit a random
forest model per algorithm to classify the instances in the trial instance space into groups of -good and
bad performance, which is defined as follows:
Definition 3.1. For a given dataset x, an outlier detection algorithm α gives -good performance if the
difference in area under the ROC curve to the best algorithm is less than . Formally:
where A denotes the algorithm space and AUC the area under the ROC curve.
That is, we fit 12 models per each feature combination. We define the best combination as the
one producing the lowest average out-of-the-bag (OOB) error across all models. Table 9 shows the
combinations which produce the lowest OOB error for each algorithm, and the one selected as the best
compromise. We observe that: (a) the most selected feature in a cluster does not always belong to the
lowest average set; (b) some features never belong to a best performing subset; and (c) each algorithm
has a unique subset of preferred features, even if it shares underlying principles with other algorithms.
The final set of selected features from which we construct the instance space is listed in table 10.
21
Table 9: Feature combinations that produce the lowest out-of-the-bag error
(OOB) for each algorithm, and the one that produces the lowest average
across all models (LO-AVG).
FAST ABOD
% selected
LO-AVG
SIMLOF
KDEOS
KNNW
INFLO
LDOF
LOOP
ODIN
KNN
COF
LDF
LOF
cluster # Feature
OPO DenOut Out 95P 1 X X 17
1
OPO LocDenOut Out 95P 1 X X X X X X X X X X 83 X
OPO Out DenOut 1 3 X X X X X X 50 X
2 OPO Out LocDenOut 1 3 X X X X 33
OPO Out LocDenOut 2 3 X X 17
Skew 95 X X X 25
Kuto Max 0
3
OPO Res KNOut 95P 1 X X X X X 42 X
OPO Res ResOut Median 3 X X X X 33
OPO Res Out SD 1 X 8
OPO Res Out IQR 1 0
4 OPO Res Out 95P 1 X X X X X 42
OPO Res Out IQR 3 X X X 25
OPO Res Out 95P 3 X X X 25 X
Total Entropy Dataset X X X X X X X X 67
OPO ResOut Out Min 1 0
5 OPO ResOut Out Min 3 0
OPO GComp PO Mean 3 X X 17
OPO GComp PO Q95 3 X X 17 X
OPO Den Out SD 3 X X X X 33 X
6 OPO Den Out IQR 3 X X X X X 42
OPO Den Out 95P 3 X X X 25
SNR X X X X X X X X X X 83 X
7
OPO LocDen Out IQR 1 X X 17
OPO GDeg Out Mean 1 X X X X X X X 58 X
8
OPO GDeg Out Mean 3 X X X X X 42
OOB error (%) 19 25 2 12 24 24 20 10 11 8 8 10 16
final projection matrix is defined by Equation 23 to represent each dataset as a 2-d vector Z depending
on its 8-d feature vector.
>
−0.0506 0.1731 SNR
−0.0865 −0.1091
OPO Res KNOut 95P 1
−0.1697 0.0159
OPO Out DenOut 1 3
0.0041 −0.0465 OPO Den Out SD 3
Z= (23)
−0.2158 0.0398
OPO Res Out 95P 3
0.0087 −0.0053
OPO LocDenOut Out 95P 1
0.0420 −0.2056 OPO GDeg Out Mean 1
−0.0661 −0.1864 OPO GComp PO Q95 3
Figure 7 illustrates the resulting instance space, including the full set of more than 12000 instances,
discriminated by their source. The sets by Campos et al. (2016) and Goldstein & Uchida (2016) are
mostly located in the lower left area of the space; whereas the set produced by down-sampling the
22
Table 10: Feature descriptions
UCI repository provides a greater coverage of the instance space and hence more diversity of features.
Finally, Figures 8 and 9 show the distribution of feature values and outlier method performance across
the instance space respectively, based only on the subset of 2018 instances. The scale has been adjusted
to the [0, 1] range. We observe that:
1. Low values of the feature SNR and high values of OPO Res KNOut 95P 1 are found at the bottom
of the space, which correlates with good performance of LDF.
2. Both OPO Out DenOut 1 3 and OPO Res Out 95P 3 tend to decrease from left to right of the
space. Both features tend to correlate with high performance of KDEOS and low performance of
KNN and KNNW.
3. Performance of FAST ABOD tends to increase from the bottom up, which tends to be correlated
with the feature OPO GDeg Out Mean 1.
4. There are no distinguishable linear patterns for some algorithms, such as COF and LOF. This
indicates that either PBLDR cannot find a predictive projection for these algorithms, or that we
lack representative instances for these algorithms. This will be reflected in weaker footprint results
for these methods.
23
Dataset Sources
2
1.5
0.5
z2
0
-0.5
Campos
-1
Goldstein
UCI#
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
z
1
Figure 7: Instance space including the full set of more than 12000 in-
stances, discriminated by their source.
maximum distance respectively. The density threshold, ρ, is set to 10, and the purity threshold, π, is set
to 75%. We then remove any contradictions that could appear when two different conclusions could be
drawn from the same section of the instance space due to overlapping footprints, e.g., when comparing
two algorithms. This is achieved by comparing the area lost by the overlapping footprints when the
contradicting sections are removed. The algorithm that would loose the largest area in a contradiction
gets to keep it, as long as it maintains the density and purity thresholds.
Table 11 presents the results from the footprint analysis. The best algorithm is the one with the
largest area under the ROC curve for the given instance, assuming that the most suitable normalization
method was selected. The results are expressed as a percentage of the total area (3.5520) and density
(565.8808) of the convex hull that encloses all instances. Further results are also illustrated in Figure 10,
which shows the -good footprint as black areas and -bad instances as grey marks. Table 11 demon-
strates that most algorithms have very small footprints. This can be corroborated by Figure 10, which
shows that some algorithms do not have pockets of good performance. Instead, some algorithms such as
INFLO, LDOF and SIMLOF present good performance in scattered regions of the instance space; hence,
we fail to find a well-defined area that fulfils the density and purity requirements. On the other hand,
FAST ABOD, KNN and KNNW possess the largest footprints. FAST ABOD, with a footprint covering
29.8%, of the instance space tends to dominate the upper left areas, while KNN and KNNW tend to
dominate the lower left areas of the instance space. KDEOS and LDF are special cases. If we only
consider their -good performance, we could think that both are unremarkable, as their footprints only
cover 4.9% and 2.2% of the space respectively. However, their footprint increase to 12.9% and 6.4%
respectively when considering their best performance, suggesting that they have some unique strengths.
Observing Figures 10d and 10g, we observe that KDEOS and LDF tend to dominate the upper right and
lower areas of the instance space respectively. Given that the footprint calculation minimises contradic-
tions, their -good performance is diminished when it is compared with the three dominating algorithms,
FAST ABOD, KNN and KNNW. In fact, most algorithms have areas of superior performance, which are
masked by good performance of the three dominating ones.
24
SNR OPO_Res_KNOut_95P_1 OPO_Out_DenOut_1_3
1.5 1.5 1 1.5 1 1 OPO_Den_Out_SD_3
1.5 1
1 1 1
0.8 0.8 0.8
1
0.8
0.5 0.5 0.5
0.6 0.5
0.6 0.6
0.6
z2
0
z2
z2
0 0
z2
0
0.4 0.4 0.4
0.4
-0.5 -0.5 -0.5 -0.5
25
1.5 1.5 1 1.5 1 1 OPO_GComp_PO_Q95_3
1.5 1
1 1 1
0.8 0.8 0.8
1
0.8
0.5 0.5 0.5
0.6 0.5
0.6 0.6
0.6
z2
0
z2
z2
0 0
z2
0
0.4 0.4 0.4
0.4
-0.5 -0.5 -0.5 -0.5
1
1 0.8 1 1 1
0.8 0.8 0.8 1 0.8
0.8
0.5
0.5 0.5 0.5 0.5 0.5
0.6
0.6 0.6 0.6 0.6 0.6
z2
0
z2
z2
z2
z2
0 0 0 0
z2
0
0.4 0.4 0.4 0.4 0.4 0.4
-0.5 -0.5 -0.5 -0.5 -0.5 -0.5
LDF
1.5 1 LDOF LOF LOOP ODIN
26
1.5 1.5 1 1.5 1 1.5 1 1 SIMLOF
1.5 1
1
1 0.8 1 1 1
0.8 0.8 0.8 1 0.8
0.8
0.5
0.5 0.5 0.5 0.5 0.5
0.6
0.6 0.6 0.6 0.6 0.6
z2
0
z2
z2
z2
z2
0 0 0 0
z2
0
0.4 0.4 0.4 0.4 0.4 0.4
-0.5 -0.5 -0.5 -0.5 -0.5 -0.5
Figure 9: Scaled area under the curve for each outlier detection algorithm on the instance space.
COF KNNW
1.5 FAST_ABOD INFLO KDEOS KNN 1.5
1.5 1.5 1.5 1.5
1 1
1 1 1 1
0.5 0.5
0.5 0.5 0.5 0.5
z2
0
z2
0
z2
z2
z2
z2
0 0 0 0
-0.5 -0.5
-0.5 -0.5 -0.5 -0.5
-1 -1 -1 -1 -1 -1
0-good 0-good 0-good 0-good 0-good 0-good
0-bad 0-bad 0-bad 0-bad 0-bad 0-bad
-1.5 -1.5 -1.5 -1.5 -1.5 -1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 -1.5
1.5 -1 -0.5 0 0.5 1 1.5
z1 z1 z1 z1 z1 z1
LDF SIMLOF
27
1.5 LDOF LOF LOOP ODIN 1.5
1.5 1.5 1.5 1.5
1 1
1 1 1 1
0.5 0.5
0.5 0.5 0.5 0.5
z2
z2
0
z2
z2
z2
z2
0 0 0 0
-0.5 -0.5
-0.5 -0.5 -0.5 -0.5
-1 -1 -1 -1 -1 -1
0-good 0-good 0-good 0-good 0-good 0-good
0-bad 0-bad 0-bad 0-bad 0-bad 0-bad
-1.5 -1.5 -1.5 -1.5 -1.5 -1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5 -1 -0.5 0 0.5 1 -1.5
1.5 -1 -0.5 0 0.5 1 1.5
z1 z1 z1 z1 z1 z1
Figure 10: Footprints of the algorithms in the instance space, assuming -good performance.
Table 11: Footprint analysis of the algorithms. αN is the area, dN the
density and p the purity. The footprint areas (and their density and purity)
are shown where algorithm performance is -good and best, with = 0.05.
We use support vector machines (SVM) for this partitioning. Of the 12 outlier methods we consider
FAST ABOD, KDEOS, KNN and KNNW, as these methods have bigger footprints that span a contigu-
ous region of the instance space. For these outlier methods, we use -good performance as the output and
the instance space coordinates as the input for the SVM. In this way, we train 4 SVMs, each SVM on a
single outlier method. The prediction accuracies using 5-fold cross validation along with the percentage
of instances in the majority class are given in Table 12. From Table 12 we see that for FAST ABOD,
KNN and KNNW that the SVM accuracy is greater than the majority class percentage and for KDEOS,
it is equal.
The regions of strength resulting from this experiment are given in Figure 11. From Figure 11 we
see an overlap of regions for FAST ABOD, KNN and KNNW. By combining these regions of strength
we obtain a partitioning of the instance space shown in Figure 12. To break ties, we use the prediction
probability of the SVM and choose the method with the highest prediction probability. One can also use
a different approach such as the sensitivity to normalization criteria to break ties.
From Figure 12 we see that no outlier method is recommended for a large part of the instance space.
This highlights the opportunity for new outlier methods which perform well in this part of the space to
be developed. In addition, we see that KDEOS, which was the overall least effective method (see Figure
5) has a niche in the instance space where no outlier method performs well. This insight was missed by
the standard statistical analysis.
28
FAST ABOD KDEOS
1.5 1.5
1 1
0.5 0.5
z2
z2
0 0
-0.5 -0.5
-1 -1
0-good 0-good
0-bad 0-bad
-1.5 -1.5
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
z1 z1
(a) (b)
KNN KNNW
1.5 1.5
1 1
0.5 0.5
z2
z2
0 0
-0.5 -0.5
-1 -1
0-good 0-good
0-bad 0-bad
-1.5 -1.5
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
z1 z1
(c) (d)
Figure 11: Regions of strength for FAST ABOD, KNN family and LOF
family.
4 Conclusions
In this study we have investigated the effect of normalization and the algorithm selection problem for 12
unsupervised outlier methods. Normalization is a topic which has not received much attention in the liter-
ature. We show its relevance to outlier detection mathematically and further illustrate experimentally that
performance of an outlier method may significantly change depending on the normalization method. In
fact we show that the effect of normalization changes from one outlier method to another. Furthermore,
certain datasets and outlier methods are more sensitive to normalization than others, creating a subtle
interplay between the datasets and the outlier methods that affects their sensitivity to normalization.
One main conclusion of this research is that normalization should not be treated as a fixed strategy,
and a normalization method should be selected to maximize performance. To aid with this selection,
we have proposed an approach whereby we first predict the sensitivity to normalization of a dataset, and
then the normalization method best suited for a given outlier detection method. Our models predict with
reasonable accuracy, with some outlier methods having higher accuracy than others.
29
1.5
0.5
z2
0
In addition to normalization we also studied the algorithm selection problem for unsupervised outlier
detection. Given measurable features of a dataset, we find the outlier method best suited for it with rea-
sonable accuracy. This is important because each method has its strengths and weaknesses and no single
method out-performs all others for all instances. We have explored the strengths and weaknesses of
outlier methods by analysing their footprints in the constructed instance space. Moreover, we have iden-
tified different regions of the instance space that reveal the relative strengths of different outlier detection
methods. Our work clearly demonstrates for example that KDEOS, which gives poor performance on
average, has a region of strength in the instance space where no other algorithm excels.
In addition to these contributions, we hope to have laid some important foundations for future re-
search into new and improved outlier detection methods, in the following ways: 1. enabling evaluation
of the sensitivity to normalization for new outlier methods; 2. rigorous evaluation of new methods using
the comprehensive corpus of over 12000 datasets with diverse characteristics we have made available;
3. using the instance space, the strengths and weaknesses of new outlier methods can be identified, and
their uniqueness compared to existing methods described. Equally valuable, the instance space anal-
ysis can also reveal if a new outlier method is similar to existing outlier methods, or offers a unique
contribution to the available repertoire of techniques.
As a word of caution, we note that our current instance space is computed using our set of datasets,
outlier methods and features. Thus, we do not make claim to have constructed the definitive instance
space for all unsupervised outlier detection methods. Hence, the selected features for the instance space
may change with the expansion of the corpus of datasets and outlier methods. Future research paths
include the expansion of the instance space by generating new and diverse instances and considering
other classes of outlier detection methods, such as subspace approaches. To aid this expansion and
future research, we make all of our data and implementation scripts available on our website.
Broadening the scope of this work, we have been adapting the instance space methodology to other
problems besides outlier detection. For example, machine learning (Muñoz et al. 2018) and time series
forecasting (Kang et al. 2017). Part of this larger project is to build freely accessible web-tools that carry
out the instance space analysis automatically, including testing of algorithms on new instances. Such
tools will be available at matilda.unimelb.edu.au in the near future.
Acknowledgements
Funding was provided by the Australian Research Council through the Australian Laureate Fellowship
FL140100012, and Linkage Project LP160101885. This research was supported in part by the Monash
eResearch Centre and eSolutions-Research Support Services through the MonARCH HPC Cluster.
30
References
Achtert, E., Kriegel, H.-P. & Zimek, A. (2008), Elki: a software system for evaluation of subspace clustering
algorithms, in ‘International Conference on Scientific and Statistical Database Management’, Springer, pp. 580–
585.
Angiulli, F. & Pizzuti, C. (2002), Fast outlier detection in high dimensional spaces, in ‘European Conference on
Principles of Data Mining and Knowledge Discovery’, Springer, pp. 15–27.
Billor, N., Hadi, A. S. & Velleman, P. F. (2000), ‘Bacon: blocked adaptive computationally efficient outlier nomina-
tors’, Computational Statistics & Data Analysis 34(3), 279–298.
Bischl, B., Mersmann, O., Trautmann, H. & Preuß, M. (2012), Algorithm selection based on exploratory landscape
analysis and cost-sensitive learning, in ‘Proceedings of the 14th annual conference on Genetic and evolutionary
computation’, ACM, pp. 313–320.
Brazdil, P., Giraud-Carrier, C., Soares, C. & Vilalta, R. (2008), Metalearning: Applications to data mining, Cogni-
tive Technologies, Springer.
Breunig, M. M., Kriegel, H.-P., Ng, R. T. & Sander, J. (2000), Lof: identifying density-based local outliers, in ‘ACM
sigmod record’, Vol. 29, ACM, pp. 93–104.
Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., Assent, I. & Houle, M. E.
(2016), ‘On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study’, Data
Mining and Knowledge Discovery 30(4), 891–927.
Csardi, G. & Nepusz, T. (2006), ‘The igraph software package for complex network research’, InterJournal, Com-
plex Systems 1695(5), 1–9.
Culberson, J. C. (1998), ‘On the futility of blind search: An algorithmic view of no free lunch’, Evolutionary
Computation 6(2), 109–127.
Emmott, A., Das, S., Dietterich, T., Fern, A. & Wong, W.-K. (2015), ‘A meta-analysis of the anomaly detection
problem’, arXiv preprint arXiv:1503.01158 .
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. et al. (1996), A density-based algorithm for discovering clusters in large
spatial databases with noise., in ‘Kdd’, Vol. 96, pp. 226–231.
Goix, N. (2016), ‘How to evaluate the quality of unsupervised anomaly detection algorithms?’, arXiv preprint
arXiv:1607.01152 .
Goldstein, M. & Uchida, S. (2016), ‘A comparative evaluation of unsupervised anomaly detection algorithms for
multivariate data’, PloS one 11(4), e0152173.
Hansen, N. (2009), Benchmarking a bi-population CMA-ES on the BBOB-2009 function testbed, in ‘GECCO ’09’,
ACM, pp. 2389–2396.
Hautamaki, V., Karkkainen, I. & Franti, P. (2004), Outlier detection using k-nearest neighbour graph, in ‘Pattern
Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on’, Vol. 3, IEEE, pp. 430–
433.
Ho, Y.-C. & Pepyne, D. L. (2002), ‘Simple explanation of the no-free-lunch theorem and its implications’, Journal
of optimization theory and applications 115(3), 549–570.
31
Hubert, M. & Van der Veeken, S. (2008), ‘Outlier detection for skewed data’, Journal of chemometrics 22(3-4), 235–
246.
Igel, C. & Toussaint, M. (2005), ‘A no-free-lunch theorem for non-uniform distributions of target functions’, Journal
of Mathematical Modelling and Algorithms 3(4), 313–322.
Jin, W., Tung, A. K., Han, J. & Wang, W. (2006), Ranking outliers using symmetric neighborhood relationship, in
‘Pacific-Asia Conference on Knowledge Discovery and Data Mining’, Springer, pp. 577–593.
Kang, Y., Hyndman, R. & Smith-Miles, K. (2017), ‘Visualising forecasting algorithm performance using time series
instance spaces’, Int. J. Forecast 33(2), 345–358.
Kriegel, H.-P., Kröger, P., Schubert, E. & Zimek, A. (2009), Loop: local outlier probabilities, in ‘Proceedings of the
18th ACM conference on Information and knowledge management’, ACM, pp. 1649–1652.
Kriegel, H.-P., Zimek, A. et al. (2008), Angle-based outlier detection in high-dimensional data, in ‘Proceedings of
the 14th ACM SIGKDD international conference on Knowledge discovery and data mining’, ACM, pp. 444–452.
Latecki, L. J., Lazarevic, A. & Pokrajac, D. (2007), Outlier detection with kernel density functions, in ‘International
Workshop on Machine Learning and Data Mining in Pattern Recognition’, Springer, pp. 61–75.
Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J. & Shoham, Y. (2003), A portfolio approach to algo-
rithm selection, in ‘IJCAI’, Vol. 3, pp. 1542–1543.
Liaw, A. & Wiener, M. (2002), ‘Classification and regression by randomforest’, R News 2(3), 18–22.
URL: http://CRAN.R-project.org/doc/Rnews/
Muñoz, M. A., Villanova, L., Baatar, D. & Smith-Miles, K. (2018), ‘Instance spaces for machine learning classifi-
cation’, Machine Learning 107(1), 109–147.
Muñoz, M. & Smith-Miles, K. (2017), ‘Performance analysis of continuous black-box optimization algorithms via
footprints in instance space’, Evol. Comput. 25(4), 529–554.
Ramaswamy, S., Rastogi, R. & Shim, K. (2000), Efficient algorithms for mining outliers from large data sets, in
‘ACM Sigmod Record’, Vol. 29, ACM, pp. 427–438.
Rice, J. (1976), The algorithm selection problem, in ‘Advances in Computers’, Vol. 15, Elsevier, pp. 65–118.
Rousseeuw, P. J. & Hubert, M. (2017), ‘Anomaly detection by robust statistics’, Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery .
Schubert, E., Zimek, A. & Kriegel, H.-P. (2014a), Generalized outlier detection with flexible kernel density esti-
mates, in ‘Proceedings of the 2014 SIAM International Conference on Data Mining’, SIAM, pp. 542–550.
Schubert, E., Zimek, A. & Kriegel, H.-P. (2014b), ‘Local outlier detection reconsidered: a generalized view on lo-
cality with applications to spatial, video, and network outlier detection’, Data Mining and Knowledge Discovery
28(1), 190–237.
Smith-Miles, K. A. (2009), ‘Cross-disciplinary perspectives on meta-learning for algorithm selection’, ACM Com-
puting Surveys (CSUR) 41(1), 6.
Smith-Miles, K., Baatar, D., Wreford, B. & Lewis, R. (2014), ‘Towards objective measures of algorithm perfor-
mance across instance space’, Comput. Oper. Res. 45, 12–24.
Smith-Miles, K. & Bowly, S. (2015), ‘Generating new test instances by evolving in instance space’, Computers &
Operations Research 63, 102–113.
Smith-Miles, K. & Tan, T. T. (2012), Measuring algorithm footprints in instance space, in ‘Evolutionary Computa-
tion (CEC), 2012 IEEE Congress on’, IEEE, pp. 1–8.
Talagala, P., Hyndman, R., Smith-Miles, K., Kandanaarachchi, S., Munoz, M. et al. (2018), Anomaly detection in
streaming nonstationary temporal data, Technical report, Monash University, Department of Econometrics and
Business Statistics.
32
Tang, J., Chen, Z., Fu, A. W.-C. & Cheung, D. W. (2002), Enhancing effectiveness of outlier detections for low
density patterns, in ‘Pacific-Asia Conference on Knowledge Discovery and Data Mining’, Springer, pp. 535–
548.
Wilkinson, L. (2018), ‘Visualizing big data outliers through distributed aggregation’, IEEE transactions on visual-
ization and computer graphics 24(1), 256–266.
Wolpert, D. H. & Macready, W. G. (1997), ‘No free lunch theorems for optimization’, IEEE transactions on evolu-
tionary computation 1(1), 67–82.
Wolpert, D. H., Macready, W. G. et al. (1995), No free lunch theorems for search, Technical report, Technical Report
SFI-TR-95-02-010, Santa Fe Institute.
Zhang, E. & Zhang, Y. (2009), Average precision, in ‘Encyclopedia of database systems’, Springer, pp. 192–193.
Zhang, K., Hutter, M. & Jin, H. (2009), A new local distance-based outlier detection approach for scattered real-
world data, in ‘Pacific-Asia Conference on Knowledge Discovery and Data Mining’, Springer, pp. 813–822.
Zimek, A., Schubert, E. & Kriegel, H.-P. (2012), ‘A survey on unsupervised outlier detection in high-dimensional
numerical data’, Statistical Analysis and Data Mining: The ASA Data Science Journal 5(5), 363–387.
33