Distance Based Outlier Detection
Distance Based Outlier Detection
by
Jyoti Ranjan Sethi
June 2013
Computer Science and Engineering
National Institute of Technology Rourkela
Rourkela-769 008, India. www.nitrkl.ac.in
Certificate
This is to certify that the work in the project entitled Study of Distance-Based Outlier
Detection Methods by Jyoti Ranjan Sethi, bearing roll number 109cs0189, is a record
of an original research work carried out under my supervision and guidance in partial
fulfillment of the requirements for the award of the degree of Bachelors of Technol-
ogy in Computer Science and Engineering during the session of 2012-2013. Neither this
thesis nor any part of it has been submitted for any degree or academic award elsewhere.
Prof. B. K. Patra
i
Acknowledgements
I express my sincere gratitude to Prof. B. K. Patra and Prof. Korra Sathya Babu for his
motivation during the course of the project which served as a spur to keep the work on
schedule. I convey my regards to all faculty members of Department of Computer Science
and Engineering, NIT Rourkela for their valuable guidance and advices at appropriate
times. Finally, I would like to thank my friends for their help and assistance all through
this project.
ii
Abstract
Certificate i
Acknowledgements ii
Abstract iii
List of Figures vi
1 Introduction 1
1.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Causes of an Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outlier Detection Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outliers Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Data Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.1 Supervised Anomaly Detection . . . . . . . . . . . . . . . . . . . . 4
1.5.2 Semi-supervised Anomaly Detection . . . . . . . . . . . . . . . . . 4
1.5.3 Unsupervised Anomaly Detection . . . . . . . . . . . . . . . . . . . 5
1.6 Types of Anomaly Detection Techniques . . . . . . . . . . . . . . . . . . . 5
1.7 Output of Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.10 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
iv
v
4.3 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Experiment Details 17
5.1 Iris Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Haberman’s Survival Data Set . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Seeds Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Breast Cancer Wisconsin (Original) Data Set . . . . . . . . . . . . . . . . 20
7 Conclusion 31
Bibliography 32
List of Figures
1.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Example of anomalies in 2-D dataset . . . . . . . . . . . . . . . . . . . . . 3
vi
Chapter 1
Introduction
Outlier detection defined as finding outliers in data that do not show normal behav-
ior.These data which do not conform are called as anomalies, outliers, exceptions.
Anomaly and Outlier can be used interchangeably. Anomaly detection is important
because it causes data translate to significant information in different variety of appli-
cations.
1.1 Outliers
1
2
not of interest to analyst, but acts as a hindrance to data analysis whereas in outlier
detection the data are interesting to the data analyst.
It is used to define a area representing normal behavior and declare any area of obser-
vation in the data that does not belong to normal region as an outlier.
• The noise data tends to be similar to the actual anomalies and hence it is very
difficult to distinguish and remove.
It is not easy to solve the problem of anomaly detection due to these challenges. In
fact, most of the existing outlier detection techniques solve a specific formulation of the
problem.
3
1. Point Anomalies
The instance is called outlier if it anomalous with respect to the rest of the data.
It is simplest type of anomaly and is the focus of majority of research on anomaly
detection.
2. Contextual Anomalies
If a data instance is anomalous in a specific context but not otherwise. Induced
by the structure in the dataset, as a part of the problem formulation the notion
of a context has to be specified. Each data objects is defined using the following
two sets of attributes:
For a time-series data, the contextual attribute is the time which can deter-
mine the position of an instance on the entire sequence.
• Behavioral attributes : The behavioral attributes define the non-contextual
characteristics of an instance. Considering a spatial dataset describing the
average rainfall of the entire world. So at any location the amount of rainfall
is the behavioral attribute.
The labels associated with data instances denote whether that instance is normal data
or anomalous. Based on the extent to which the labels are available, the operation of
anomaly detection can be done in one of the following three modes:
Techniques trained in supervised mode assume that the availability of labeled instances
of a training dataset for normal as well as anomaly classes. It’s approach is to build a
predictive model for normal against outlier classes. Any unseen data object is compared
against the predictive model to determine which class it belongs to. The two major
issues arise in supervised anomaly detection are discussed. Firstly , the anomalous
instances are fewer compared to the normal instances in the training data. Due to
imbalanced class distributions arise issues which have been addressed in the data mining
and machine learning literature. Secondly, it is usually challenging to obtain accurate
and representative labels for the anomaly class . There have been a number of techniques
proposed that inject artificial anomalies into a normal dataset to obtain a labeled training
dataset. Other than these two issues, the supervised anomaly detection problem is
similar to building predictive models.
In this Techniques, it assume that the training data has labeled objects only for the
normal class that operate in a semi-supervised mode. As they do not require labels
5
for the anomaly class as they are more applicable than supervised techniques . For
example, in spacecraft fault detection, it is not easy to model to signify an accident
thats an anomaly scenario. The typical approach used in such techniques is to build a
model for the class corresponding to its normal behavior and use this model to identify
anomalies in the test data. A limited set of anomaly detection techniques exists that
assumes availability of only the anomaly instances for training. As it is difficult to obtain
a training dataset that covers every possible anomalous behavior that can occur in the
data so such techniques are not commonly used.
It do not require training data in those techniques that operate in unsupervised mode,
and so most widely applicable. Assumption made that normal objects are far more
frequent than outliers in test data in this category of techniques. Thus techniques suffer
from high false positive rate if the assumption is not true. By using a sample of the
unlabeled dataset as training data many semi-supervised techniques can be madde to
operate in an unsupervised mode. IT assumes the test data contain very few anomalies
and model learned during training is robust to few anomalies.
1. Statistical-based Detection
(a) Distribution-based
(b) Depth-based
2. Deviation-based Method
3. Sequential exception.
4. Distance-based Detection
(a) Index-based
(b) Nested-loop
(c) Local-outliers
5. Density-based Detection
6. Clustering-based Detection
6
Anomaly detection techniques outputs are one of the following two types:
1. Scores : An anomaly score is assigned to each object in the test data according to
their degree to which that object is considered an anomaly. The output of these
techniques is a ranked list of anomalies. An analyst may choose to either analyze
the top few anomalies or use a cutoff threshold to select the anomalies.
1.8 Challenges
• Method is unsupervised Validation can be quite challenging (just like for clustering)
1.9 Assumptions
Number of normal observation is much more than the number of abnormal observa-
tions(anomalies) in data.
1.10 Applications
1. Fraud detection
Credit card fraud detection : Amount spent in a transaction which is very high
compared to the normal transaction for that person,it will be a point anomaly.
2. Intrusion detection
Network intrusion detection : Detection of anomalous activity (break-ins, pene-
trations, and other forms of computer abuse) in a computer related area.To gain
unauthorized access for information form the network, hackers launch ciber at-
tacks. It uses semi supervised and unsupervised anomaly detection techniques.
7
3. Fault detection
Fault detection in Mechanical Units : Monitor the performance of industrial com-
ponents such as oil flow in pipelines, turbines , motors and detect defects that
might occur due to wear and tear or other unforeseen circumstances.
Chapter 2
• Distance-based outliers are rightly defined for k-dimensional datasets for any value
of k[2].
8
9
By analyzing the multidimensional indexing schemes [3] we found that, for variants of
R-trees [4], k-d trees[5], andX-trees [6], range search lower bound complexity is (N11/k).
By increasing the value of k, range search quickly reduces to O(N), giving at best a
constant time improvement reflecting sequential search. So, the procedure for finding
all DB(p, D) outliers has a worst case complexity of O(k N2). Compared to the depth-
based approaches, which have a lower bound complexity of (N k 2 ), DB outliers
scale much better with dimensionality. The framework of DBoutliers is applicable and
computationally feasible for datasets that have many attributes (i.e., k 5). This is
a significant improvement on the current state-of-the-art,where existing methods can
only realistically deal with two attributes [7, 8].The above analysis only considers search
time. When itcomes to using an index-based algorithm, most often for the kinds of
data-mining applications under consideration, it is a very strong assumption that the
right indexexists. In other words, a huge hidden cost to an index-based algorithm is
the effort to build the index in the first place. As will be shown in Sect. 6, the index
build-ing cost alone, even without counting the search cost,almost always renders the
index-based algorithms non-competitive.
Steps
1. Index is used to search for neighbors of each object O within radius D around that
object.
The nested loop calculate distances of instances to all other objects to find the instances
k nearest neighbors. Nested loop has a complexity of O(kN2), (k is the no. of dimensions
and N is the no. of data objects), and the no. of passes over dataset is linear to N. Due
to quadratic complexity,in respect to no. of objects, making it deficient for mining large
databases. Calculation of distances between objects is the major cost. Though Nested
loop is a good choice for high dimensionality datasets, the large number of calculations
makes it unsuitable[9].
Steps:
1. Divides the buffer space into two halves (first and second arrays)
2. Break data into blocks and then feed two blocks into the arrays.
3. Directly computes the distance between each pair of objects, inside the array or
between arrays
6. Pros:
Avoid index structure construction
Try to minimize the I/Os
11
The previous outlier detection schemes are average when it comes to detecting outliers
in real world scattered datasets. LDOF uses the relative distance from an object to its
neighbors to measure how much objects deviate from their scattered neighborhood. The
higher outlier factor ,it more likely the point is an outlier. It is observed that outlier
detection schemes are more reliable when used in a top-n manner. This means that the
top n factors are taken as outliers, the n is decided by the user as per his requirements
[11].
We use a test case in which there are 3 clusters C1 ,C2 and C3 and four outlying points
O1 ,O2 ,O3 ,O4 are present. When we set a value ok k > 10 the cardinality of a cluster.
i.e. in this case C3 is the smallest cluster whose cardinality is 10. We solve this problem
by following method .
• k th inner distance of xp
Let Np be the set of the k-nearest neighbors of object xp (excluding xp ). The k-
nearest neighbors distance of xp equals the average distance from xp to all objects
in Np . More formally, let dist(x, x) > 0 be a distance measure between objects x
and ó. The k-nearest neighbors distance of object xp is defined as:
d¯xp = k1 xi Np dist(xi , xp )
P
• k th inner distance of xp
Given the k-nearest neighbors set Np of object xp , the k-nearest neighbors inner
distance of xp is defined as the average distance among objects in Np as:
1 P
D̄xp = k(k−1) xi ,x́i Np ,i6=í dist(xi , xí )
• LDOF of xp
d¯xp
LDOF (xp ) = D̄xp
Here d¯xp mainly denotes the distance between the specified point and all other points
in the k neighborhood set of the point and D̄xp represents the distance between the
points in the k neighborhood set[12]. Another way to minimize calculation is to find
the median of the k neighborhood points and name is x́ and then calculate the distance
between points in k neighborhood set and x́. This minimizes calculations.
13
Furthermore LDOF is often used as top-n LDOF and it is used as the following [11]:
Input : A given dataset D, natural numbers n and k.
Steps :
2. Calculate LDOF for each object o. The objects with LDOF < LDOFlb are directly
discarded
3.1 Motivation
3.2 Objective
14
Chapter 4
The disadvantage of Local Distance-based Outlier Factor is false detection i.e. false
Positive error is high. We have introduced a new outlier detection method Modified
Local Distance-Based Outlier Factor (MLDOF) to overcome and to improve the accuracy
of the existing algorithm.
4.1 Features
2. It finds the Nearest neighbour of the points and decides it as outlier or not.
4.2 Algorithm
2. LDOF calulated for each object p. the objects having LDOF ¡ LDOFlb are dis-
carded;
15
16
4.3 Explanation
In this algorithm, it takes each object form the dataset and computes k nearest neighbor
individually. Nearest neighbor is calcluated using some underlying function. Then finds
kNN distance with respect to the objects neighbors and also finds kNN inner distance
of objects respectively. After that it computes the LDOF for each objects.If LDOF of
an object is less than lower bound then it is discarded.After discarding then sort the
objects according to their LDOF values. Then it identifies the indexes of point that can
be a outlier. Now check for the nearest neighbors of that points that are identified by
LDOF. If the nearest neighbor for points is greater than M where M=D(1-p)and p in
minimum fraction of objects in dataset that must be outside the D-neighborhood of an
outlier,then the points regarded as non-outlier. By doing that it discards some of the
points identified by LDOF and that reduces ths false detection of normal data as an
outliers.
Chapter 5
Experiment Details
In this algorithms,mainly uses k-nearest neighbor to find out the object is outlier or not.
Attribute Information :
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. Class
Iris Setosa
Iris Versicolour
Iris Virginica
Source :
Information :
• This is perhaps the best known database to be found in the pattern recog-
nition literature. Fisher’s paper is a classic in the field and is referenced
frequently to this day. (See Duda Hart, for example.) The dataset contains
3 classes of 50 instances each, where each class refers to a type of iris plant.
One class is linearly separable from the other 2; the latter are NOT linearly
separable from each other.
• Predicted attribute: class of iris plant.
• This is an exceedingly simple domain.
• This data differs from the data presented in Fishers article (identified by
Steve Chadwick, [email protected]) the 35th sample should be: 4.9,
3.1, 1.5, 0.2,”Iris-setosa” where the error is in the fourth feature. The 38th
sample: 4.9, 3.6, 1.4, 0.1,”Iris-setosa” where the errors are in the second and
third features.
Attribute Information :
Source :
Information :
Dataset contains cases from a study that was conducted between 1958 and 1970 at
the University of Chicago’s Billings Hospital on the survival of patients who had
undergone surgery for breast cancer.
Attribute Information :
To construct the data, seven geometric parameters of wheat kernels were measured:
Source :
Information :
Attribute Information :
11. Class:
2 for benign
4 for malignant)
Source :
Information :
• Samples arrive periodically as Dr. Wolberg reports his clinical cases. The
database therefore reflects this chronological grouping of the data. This group-
ing information appears immediately below, having been removed from the
data itself :
1. Group 1: 367 instances (January 1989)
2. Group 2: 70 instances (October 1989)
3. Group 3: 31 instances (February 1990)
4. Group 4: 17 instances (April 1990)
5. Group 5: 48 instances (August 1990)
6. Group 6: 49 instances (Updated January 1991)
7. Group 7: 31 instances (June 1991)
8. Group 8: 86 instances (November 1991)
9. Total: 699 points (as of the donated datbase on 15 July 1992)
• Note that the results summarized above in Past Usage refer to a dataset of
size 369, while Group 1 has only 367 instances. This is because it originally
contained 369 instances; 2 were removed.
Chapter 6
We implemented both LDOF and MLDOF outlier detection methods on the datasets
over a large range of k.
In a classification task, the precision for a class is no. of true positives divided by the
total no. of elements labeled as belonging to the positive class. A perfect precision score
of 1 means that every result obtained by a search was relevant.
Example: A precision value of 1 for a class X means that every item labeled as belonging
to class X does indeed belong to class X.
Here precision is calculated to evaluate the performance of each algorithm.
Precision= ND / NR
where,
ND = Number of detected real outliers in the dataset
NR =Number of real outliers in the dataset.
22
23
We have taken total 60 instances in which 50 instances are normal and 10 instances
are outliers. Knowingly 10 outliers are taken to test the algorithms are working fine or
not. We ran both algorithms, for k=12; LDOF showed 13 instances as outliers where
3 instances are normal data which falsely detected as outliers and other 10 instances
are real outliers but in MLDOF showed 10 instances which are real outliers. We draw
confusion matrix for both the algorithms in figure.6.1 and figure.6.2. In order to compute
performance of the algorithm we draw the precision vs neighborhood size k graph, it is
shown in figure 6.3.
We have taken total 229 instances in which 225 instances are normal and 4 instances
are outliers.Knowingly 4 outliers are taken to test the algorithms are working fine or
not. We ran both algorithms, for k=6; LDOF showed 11 instances as outliers where
7 instances are normal data which falsely detected as outliers and other 10 instances
are real outliers but in MLDOF showed 5 instances in which 4 are real outliers and
1 normal data . We draw confusion matrix for both the algorithms in figure.6.4 and
figure.6.5. In order to compute performance of the algorithm we draw the precision vs
neighborhood size k graph, it is shown in figure 6.7. The 3-D projection of the dataset
is drawn showing normal data and outliers.
We have taken total 150 instances in which 140 instances are normal and 10 instances
are outliers. Knowingly 10 outliers are taken to test the algorithms are working fine or
not. We ran both algorithms, for k=12; LDOF showed 14 instances as outliers where
4 instances are normal data which falsely detected as outliers and other 10 instances
are real outliers but in MLDOF showed 10 instances which are real outliers. We draw
confusion matrix for both the algorithms in figure.6.8 and figure.6.9. In order to compute
performance of the algorithm we draw the precision vs neighborhood size k graph, it is
shown in figure 6.10.
We have taken total 449 instances in which 444 instances are normal and 5 instances
are outliers. Knowingly 5 outliers are taken to test the algorithms are working fine or
not. We ran both algorithms, for k=10; LDOF showed 11 instances as outliers where
6 instances are normal data which falsely detected as outliers and other 5 instances are
real outliers but in MLDOF showed 7 instances in which 5 are real outliersand 2 normal
data. We draw confusion matrix for both the algorithms in figure.6.11 and figure.6.12.In
order to compute performance of the algorithm we drawn the precision vs neighborhood
size k graph, it is shown in figure 6.13.
Conclusion
In this thesis we have introduced a new outlier detection method Modified Local Distance-
Based Outlier Factor (MLDOF).Though K-nearest neighbor algorithm is easy to imple-
ment but to determine the value of K at prior is a diffcult task.Many number of iterations
were performed to determine the optimal value of K which was a very time consuming
process. We have shown that the Modified Local Distance Outlier Factor (MLDOF)
improves the accuracy of outlier detection with respect to LDOF and reduces the False
Positive Error.
31
Bibliography
[1] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A
survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.
[2] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers
in large datasets. In Proceedings of the 24rd International Conference on Very Large
Data Bases, volume VLDB ’98, pages 392–403. Citeseer, 1998.
[4] Antonin Guttman. R-trees: a dynamic index structure for spatial searching, vol-
ume 14. ACM, 1984.
[5] Jon Louis Bentley. Multidimensional binary search trees used for associative search-
ing. Communications of the ACM, 18(9):509–517, 1975.
[6] Stefan Berchtold, Daniel A Keim, and Hans-Peter Kriegel. The x-tree: An in-
dex structure for high-dimensional data. Readings in multimedia computing and
networking, page 451, 2001.
[7] Theodore Johnson, Ivy Kwok, and Raymond Ng. Fast computation of 2-dimensional
depth contours. In Proc. KDD, volume 1998, pages 224–228. Citeseer, 1998.
[8] Ida Ruts and Peter J Rousseeuw. Computing depth contours of bivariate point
clouds. Computational Statistics & Data Analysis, 23(1):153–168, 1996.
[9] Edward Hung and David W Cheung. Parallel mining of outliers in large database.
Distributed and Parallel Databases, 12(1):5–26, 2002.
[10] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for
mining outliers from large data sets. In ACM SIGMOD Record, volume 29, pages
427–438. ACM, 2000.
32
Bibliography 33
[11] Ke Zhang, Marcus Hutter, and Huidong Jin. A new local distance-based outlier de-
tection approach for scattered real-world data. In Advances in Knowledge Discovery
and Data Mining, pages 813–822. Springer, 2009.
[12] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof:
identifying density-based local outliers. In ACM Sigmod Record, volume 29, pages
93–104. ACM, 2000.