SlideShare a Scribd company logo
What is Clustering?
Clustering is one of the most important research areas in the field of data mining.
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense or
another) to each other than to those in other groups (clusters).
It is an unsupervised learning technique.
Data clustering is the subject of active research in several fields such as statistics,
pattern recognition and machine learning.
From a practical perspective clustering plays an outstanding role in data mining
applications in many domains.
The main advantage of clustering is that interesting patterns and structures can be
found directly from very large data sets with little or none of the background
knowledge.
Clustering algorithms an be applied in many areas, like marketing, biology, libraries,
insurance, city-planning, earthquake studies and www document classification.
Applications of Clustering
Real life examples where we use clustering:
• Marketing
• Finding group of customers with similar behavior given a large data-base of customers.
• Data containing their properties and past buying records (Conceptual Clustering).
• Biology
• Classification of Plants and Animals Based on the properties under observation (Conceptual
Clustering).
• Insurance
• Identifying groups of car insurance policy holders with a high average claim cost (Conceptual
Clustering).
• City-Planning
• Groups of houses according to their house type, value and geographical location it can be both
(Conceptual Clustering and Distance Based Clustering)
• Libraries
• It is used in clustering different books on the basis of topics and information.
• Earthquake studies
• By learning the earthquake-affected areas we can determine the dangerous zones.
What is Partitioning?
Clustering is a division of data into groups of similar objects.
Each group, called cluster, consists of objects that are similar between
themselves and dissimilar to objects of other groups.
It represents many data objects by few clusters and hence, it models
data by its clusters.
A cluster is a set of points such that any point in a cluster is closer (or
more similar) to every other point in the cluster than to any point not in
the cluster.
K-MEANS Algorithm
K-Means is one of the simplest unsupervised learning algorithm that
solve the well known clustering problem.
The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (k-clusters).
The main idea is to define k centroids, one for each cluster.
A centroid is “the center of mass of a geometric object of uniform
density”, though here, we'll consider mean vectors as centroids.
It is a method of classifying/grouping items into k groups (where k is the
number of pre-chosen groups).
The grouping is done by minimizing the sum of squared distances
between items or objects and the corresponding centroid.
K-MEANS Algorithm Cont..
A clustered scatter plot.
The black dots are data points.
The red lines illustrate the partitions
created by the k-means algorithm.
The blue dots represent the centroids
which define the partitions.
K-MEANS Algorithm Cont..
The initial partitioning can be done in a variety of ways.
Dynamically Chosen
• This method is good when the amount of data is expected to grow.
• The initial cluster means can simply be the first few items of data from the set.
• For instance, if the data will be grouped into 3 clusters, then the initial cluster
means will be the first 3 items of data.
Randomly Chosen
• Almost self-explanatory, the initial cluster means are randomly chosen values
within the same range as the highest and lowest of the data values.
Choosing from Upper and Lower Bounds
• Depending on the types of data in the set, the highest and lowest of the data
range are chosen as the initial cluster means.
K-Means Algorithm - Example
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(185,72
)
(170,56
)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(185,72
)
(170,56
)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(185,72
)
(169,5
8)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(185,72
)
(169,58
)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(182,70
)
(169,5
8)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(182,70
)
(169,58
)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(182,71
)
(169,5
8)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(182,71
)
(169,58
)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(183,72
)
(169,5
8)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
(183,72
)
(169,58
)
K-Means Algorithm – Example Cont..
Sr. Heigh
t
Weigh
t
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
K
1
K
2
Cluster K1 = {1,4,5,6,7,8,9,10,11,12}
Cluster K2 = {2,3}
K-Means Algorithm Cont..
Let us assume two clusters, and each individual's scores include two
variables.
Step-1
• Choose the number of clusters.
Step-2
• Set the initial partition, and the initial mean vectors for each cluster.
Step-3
• For each remaining individual...
Step-4
• Get averages for comparison to the Cluster 1:
• Add individual's A value to the sum of A values of the individuals in Cluster 1, then divide by
the total number of scores that were summed.
• Add individual's B value to the sum of B values of the individuals in Cluster 1, then divide by
the total number of scores that were summed.
K-Means Algorithm Cont..
Step-5
• Get averages for comparison to the Cluster 2:
• Add individual's A value to the sum of A values of the individuals in Cluster 2, then divide by the total number of scores that were
summed.
• Add individual's B value to the sum of B values of the individuals in Cluster 2, then divide by the total number of scores that were
summed.
Step-6
• If the averages found in Step 4 are closer to the mean values of Cluster 1, then this individual belongs to
Cluster 1, and the averages found now become the new mean vectors for Cluster 1.
• If closer to Cluster 2, then it goes to Cluster 2, along with the averages as new mean vectors.
Step-7
• If there are more individual's to process, continue again with Step 4. Otherwise go to Step 8.
Step-8
• Now compare each individual’s distance to its own cluster's mean vector, and to that of the opposite
cluster.
• The distance to its cluster's mean vector should be smaller than it distance to the other vector.
• If not, relocate the individual to the opposite cluster.
K-Means Algorithm Cont..
Step-9
• If any relocations occurred in Step 8, the algorithm must continue again with
Step 3, using all individuals and the new mean vectors.
• If no relocations occurred, stop. Clustering is complete.
What is Medoid?
• Medoids are similar in concept to means or centroids, but medoids are
always restricted to be members of the data set.
• Medoids are most commonly used on data when a mean
or centroid cannot be defined, such as graphs.
• Note: A medoid is not equivalent to a median.
K-Medoids Clustering Algorithm (PAM)
• The k-medoids algorithm is a clustering algorithm related to the k-means algorithm
also called as the medoid shift algorithm.
• Both the k-means and k-medoids algorithms are partitional (breaking the dataset up
into groups).
• In contrast to the k-means algorithm, k-medoids chooses datapoints as centers
(medoids or exemplars).
• K-medoids is also a partitioning technique of clustering that clusters the data set
of n objects into k clusters with k known a priori.
• It could be more robust to noise and outliers as compared to k-means because it
minimizes a sum of general pairwise dissimilarities instead of a sum of squared
Euclidean distances.
• A medoid of a finite dataset is a data point from this set, whose average dissimilarity
to all the data points is minimal i.e. it is the most centrally located point in the set.
K-Medoids Clustering Algorithm (PAM) Cont..
K-Medoids Clustering Algorithm - Example
Sr. X Y
0 8 7
1 3 7
2 4 9
3 9 6
4 8 5
5 5 8
6 7 3
7 8 4
8 7 5
9 4 5
Step 1:
Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4, 5) and C2 -
(8, 5) are the two medoids.
The dissimilarity of each non-medoid point with the medoids is calculated and
tabulated:
Sr
.
X Y
Dissimilarity From
C1
Dissimilarity From C2
0 8 7 |(8-4)|+|(7-5)| = 6 |(8-8)|+|(7-5)| = 2
1 3 7 3 7
2 4 9 4 8
3 9 6 6 2
5 5 8 4 6
6 7 3 5 3
7 8 4 5 1
8 7 5 3 1
K-Medoids Clustering Algorithm – Example Cont..
Sr. X Y
Dissimilar
ity From
C1
Dissimilar
ity From
C2
0 8 7 6 2
1 3 7 3 7
2 4 9 4 8
3 9 6 6 2
5 5 8 4 6
6 7 3 5 3
7 8 4 5 1
8 7 5 3 1
• Each point is assigned to the cluster of that
medoid whose dissimilarity is less.
• The points 1, 2, 5 go to cluster C1 and 0, 3,
6, 7, 8 go to cluster C2.
• The Cost = (3 + 4 + 4) + (2 + 2 + 3 + 1 + 1) =
20
K-Medoids Clustering Algorithm – Example Cont..
Sr. X Y
Dissimilar
ity From
C1
Dissimilar
ity From
C2
0 8 7 6 3
1 3 7 3 8
2 4 9 4 9
3 9 6 6 3
4 8 5 4 1
5 5 8 4 7
6 7 3 5 2
8 7 5 3 2
• Step 3: randomly select one non-medoid
point and recalculate the cost.
• Let the randomly selected point be (8, 4).
• The dissimilarity of each non-medoid point
with the medoids – C1 (4, 5) and C2 (8, 4) is
calculated and tabulated.
K-Medoids Clustering Algorithm – Example Cont..
Sr. X Y
Dissimilar
ity From
C1
Dissimilar
ity From
C2
0 8 7 6 3
1 3 7 3 8
2 4 9 4 9
3 9 6 6 3
4 8 5 4 1
5 5 8 4 7
6 7 3 5 2
8 7 5 3 2
• Each point is assigned to that cluster whose
dissimilarity is less. So, the points 1, 2, 5 go
to cluster C1 and 0, 3, 4, 6, 8 go to cluster C2.
• The New cost,
= (3 + 4 + 4) + (3 + 3 + 1 + 2 + 2) = 22
• Swap Cost = New Cost – Previous Cost
= 22 – 20
= 2
• So, 2>0 that is positive, now our previous
medoid is best.
• The total cost of Medoid (8,4) > the total cost
when (8,5) was the medoid earlier & it
generates the same clusters as earlier.
• If you get negative then you have to take new
medoid and recalculate again.
K-Medoids Clustering Algorithm – Example Cont..
• As the swap cost is not less than zero, we undo the
swap.
• Hence (4, 5) and (8, 5) are the final medoids.
• The clustering would be in the following way
Sr. X Y
0 8 7
1 3 7
2 4 9
3 9 6
4 8 5
5 5 8
6 7 3
7 8 4
8 7 5
9 4 5
K-Medoids Clustering Algorithm (Try Yourself!!)
Sr. X Y
0 2 6
1 3 4
2 3 8
3 4 7
4 6 2
5 6 4
6 7 3
7 7 4
8 8 5
9 7 6
Hierarchical Clustering
• Hierarchical Clustering is a technique to group objects based on distance or
similarity.
• Hierarchical Clustering is called as unsupervised learning.
• Because, the machine (computer) learns mere from objects with their features
and then the machine will automatically categorize those objects into groups.
• This clustering technique is divided into two types:
• Agglomerative
• In this technique, initially each data point is considered as an individual cluster.
• At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed.
• Divisive
• Divisive Hierarchical clustering is exactly the opposite of the Agglomerative Hierarchical clustering.
• In Divisive Hierarchical clustering, we consider all the data points as a single cluster and in each iteration, we
separate the data points from the cluster which are not similar.
• Each data point which is separated is considered as an individual cluster. In the end, we’ll be left with n
clusters.
Agglomerative Hierarchical Clustering Technique
• In this technique, initially each data point is considered as an individual
cluster.
• At each iteration, the similar clusters merge with other clusters until one
cluster or K clusters are formed.
• The basic algorithm of Agglomerative is straight forward.
• Compute the proximity matrix
• Let each data point be a cluster
• Repeat: Merge the two closest clusters and update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters.
Agglomerative Hierarchical Clustering - Example
• To understand better let’s see a pictorial representation of the
Agglomerative Hierarchical clustering Technique.
• Lets say we have six data points {A,B,C,D,E,F}.
• Step- 1:
• In the initial step, we calculate the proximity of individual points and consider all
the six data points as individual clusters as shown in the image below.
Agglomerative Hierarchical Clustering - Example
• Step- 2:
• In step two, similar clusters are merged together and formed as a single cluster.
• Let’s consider B,C, and D,E are similar clusters that are merged in step two.
• Now, we’re left with four clusters which are A, BC, DE, F.
• Step- 3:
• We again calculate the proximity of new clusters and merge the similar clusters to
form new clusters A, BC, DEF.
• Step- 4:
• Calculate the proximity of the new clusters.
• The clusters DEF and BC are similar and merged together to form a new cluster.
• We’re now left with two clusters A, BCDEF.
• Step- 5:
• Finally, all the clusters are merged together and form a single cluster.
Agglomerative Hierarchical Clustering - Example
The Hierarchical clustering Technique can be visualized using a
Dendrogram.
A Dendrogram is a tree-like diagram that records the sequences of
merges or splits.
Divisive Hierarchical Clustering Technique
• Divisive Hierarchical clustering is exactly the opposite of the
Agglomerative Hierarchical clustering.
• In Divisive Hierarchical clustering, we consider all the data points as a
single cluster and in each iteration, we separate the data points from the
cluster which are not similar.
• Each data point which is separated is considered as an individual
cluster.
• In the end, we’ll be left with n clusters.
• As we are dividing the single clusters into n clusters, it is named as
Divisive Hierarchical clustering.
• It is not much used in the real world.
Divisive Hierarchical Clustering Technique Cont..
• Calculating the similarity between two clusters is important to merge or
divide the clusters.
• There are certain approaches which are used to calculate the similarity
between two clusters:
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Ward’s Method
MIN
• Min is known as single-linkage algorithm can be defined as the
similarity of two clusters C1 and C2 is equal to the minimum of the
similarity between points Pi and Pj such that Pi belongs to C1 and Pj
belongs to C2.
• Mathematically this can be written as,
• Sim(C1,C2) = Min Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2
• In simple words, pick the two closest points such that one point lies in
cluster one and the other point lies in cluster 2 and takes their similarity
and declares it as the similarity between two clusters.
MAX
• Max is known as the complete linkage algorithm, this is exactly opposite
to the MIN approach.
• The similarity of two clusters C1 and C2 is equal to the maximum of the
similarity between points Pi and Pj such that Pi belongs to C1 and Pj
belongs to C2.
• Mathematically this can be written as,
• Sim(C1,C2) = Max Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2
• In simple words, pick the two farthest points such that one point lies in
cluster one and the other point lies in cluster 2 and takes their similarity
and declares it as the similarity between two clusters.
Group Average
• Take all the pairs of points and compute their similarities and calculate
the average of the similarities.
• Mathematically this can be written as,
• Sim(C1,C2) = ∑ Sim(Pi, Pj)/|C1|*|C2|, where, Pi ∈ C1 & Pj ∈ C2
Distance between centroids
• Compute the centroids of two clusters C1 & C2 and take the similarity
between the two centroids as the similarity between two clusters.
• This is a less popular technique in the real world.
Ward’s Method
• This approach of calculating the similarity between two clusters is
exactly the same as Group Average except that Ward’s method
calculates the sum of the square of the distances Pi and Pj.
• Mathematically this can be written as,
• Sim(C1,C2) = ∑ (dist(Pi, Pj))²/|C1|*|C2|
Agglomerative Hierarchical Clustering - Example
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
P1 P2 P3 P4 P5 P6
P1 0
P2
0.2
3
0
P3 0
P4 0
P5 0
P6 0
Agglomerative Hierarchical Clustering - Example
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30 P1 P2 P3 P4 P5 P6
P1 0
P2
0.2
3
0
P3
0.2
2
0
P4 0
P5 0
P6 0
Agglomerative Hierarchical Clustering - Example
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22 0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
P6 0.23 0.25 0.11 0.22 0.39 0
3 6
Agglomerative Hierarchical Clustering - Example
To Update the distance matrix MIN[dist(P3,P6),P1]
MIN (dist(P3,P1),(P6,P1))
Min[(0.22,0.23)]
0.22
To Update the distance matrix MIN[dist(P3,P6),P2]
MIN (dist(P3,P2),(P6,P2))
Min[(0.15,0.25)]
0.15
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22
0.1
5
0
P4 0.37
0.2
0
0.1
5
0
P5 0.34
0.1
4
0.2
8
0.2
9
0
P6 0.23
0.2
5
0.1
1
0.2
2
0.3
9
0
Agglomerative Hierarchical Clustering - Example
To Update the distance matrix MIN[dist(P3,P6),P4]
MIN (dist(P3,P4),(P6,P4))
Min[(0.15,0.22)]
0.15
To Update the distance matrix MIN[dist(P3,P6),P5]
MIN (dist(P3,P5),(P6,P5))
Min[(0.28,0.39)]
0.28
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22
0.1
5
0
P4 0.37
0.2
0
0.1
5
0
P5 0.34
0.1
4
0.2
8
0.2
9
0
P6 0.23
0.2
5
0.1
1
0.2
2
0.3
9
0
Agglomerative Hierarchical Clustering - Example
The Updated distance matrix for cluster P3, P6
P1 P2 P3,P
6
P4 P5
P1 0
P2 0.23 0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22
0.1
5
0
P4 0.37
0.2
0
0.1
5
0
P5 0.34
0.1
4
0.2
8
0.2
9
0
P6 0.23
0.2
5
0.1
1
0.2
2
0.3
9
0
2 5
Agglomerative Hierarchical Clustering - Example
To Update the distance matrix MIN[dist(P2,P5),P1]
MIN (dist(P2,P1),(P5,P1))
Min[(0.23,0.34)]
0.23
To Update the distance matrix MIN[dist(P2,P5),(P3,P6)]
MIN [(dist(P2,(P3,P6)),(P5,(P3,P6))]
Min[(0.15,0.28)]
0.15
P1 P2 P3,P
6
P4 P5
P1 0
P2 0.23 0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
Agglomerative Hierarchical Clustering - Example
To Update the distance matrix MIN[dist(P2,P5),P4]
MIN (dist(P2,P4),(P5,P4))
Min[(0.20,0.29)]
0.20 P1 P2 P3,P
6
P4 P5
P1 0
P2 0.23 0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
Agglomerative Hierarchical Clustering - Example
P1 P2 P3,P
6
P4 P5
P1 0
P2 0.23 0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
P1 P2,P
5
P3,P
6
P4
P1 0
P2,P
5
0.23
0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
Agglomerative Hierarchical Clustering - Example
P1 P2,P
5
P3,P
6
P4
P1 0
P2,P
5
0.23
0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
P1 P2,P
5
P3,P
6
P4
P1 0
P2,P
5
0.23
0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
3 6
2 5
Agglomerative Hierarchical Clustering - Example
P1 P2,P
5
P3,P
6
P4
P1 0
P2,P
5
0.23
0
P3,P
6
0.22
0.15 0
P4 0.37 0.20 0.15 0
To Update the distance matrix
MIN[dist(P2,P5),(P3,P6)),P1]
MIN (dist(P2,P5),P1),((P3,P6),P1)]
Min[(0.23,0.22)]
0.22
To Update the distance matrix
MIN[dist(P2,P5),(P3,P6)),P4]
MIN (dist(P2,P5),P4),((P3,P6),P4)]
Min[(0.20,0.15)]
0.15
P1 P2,P5,P3,
P6
P4
P1 0
P2,P5,P3,
P6
0.22
0
P4 0.37 0.15 0
3 6
2 5 4
Agglomerative Hierarchical Clustering - Example
To Update the distance matrix
MIN[dist(P2,P5,P3,P6),P4]
MIN (dist(P2,P5,P3,P6),P1),(P4,P1)]
Min[(0.22,0.37)]
0.22
P1 P2,P5,P3,
P6
P4
P1 0
P2,P5,P3,
P6
0.22
0
P4 0.37 0.15 0 P1 P2,P5,P3,P6,
P4
P1 0
P2,P5,P3,P6,
P4
0.22
0
3 6
2 5 4
Agglomerative Hierarchical Clustering - Example
3 6 2 5 4 1
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
55
What is Cluster Analysis?
■ Cluster: A collection of data objects
■ similar (or related) to one another within the same group
■ dissimilar (or unrelated) to the objects in other groups
■ Cluster analysis (or clustering, data segmentation, …)
■ Finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters
■ Unsupervised learning: no predefined classes (i.e., learning by observations
vs. learning by examples: supervised)
■ Typical applications
■ As a stand-alone tool to get insight into data distribution
■ As a preprocessing step for other algorithms
56
Clustering for Data Understanding and
Applications
■ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and
species
■ Information retrieval: document clustering
■ Land use: Identification of areas of similar land use in an earth observation database
■ Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
■ City-planning: Identifying groups of houses according to their house type, value, and
geographical location
■ Earthquake studies: Observed earthquake epicenters should be clustered along
continent faults
■ Climate: understanding earth climate, find patterns of atmospheric and ocean
■ Economic Science: market research
57
Clustering as a Preprocessing Tool (Utility)
■ Summarization:
■ Preprocessing for regression, PCA, classification, and
association analysis
■ Compression:
■ Image processing: vector quantization
■ Finding K-nearest Neighbors
■ Localizing search to one or a small number of clusters
■ Outlier detection
■ Outliers are often viewed as those “far away” from any cluster
Quality: What Is Good Clustering?
■ A good clustering method will produce high quality clusters
■ high intra-class similarity: cohesive within clusters
■ low inter-class similarity: distinctive between clusters
■ The quality of a clustering method depends on
■ the similarity measure used by the method
■ its implementation, and
■ Its ability to discover some or all of the hidden patterns
58
Measure the Quality of Clustering
■ Dissimilarity/Similarity metric
■ Similarity is expressed in terms of a distance function, typically metric:
d(i, j)
■ The definitions of distance functions are usually rather different for
interval-scaled, boolean, categorical, ordinal ratio, and vector variables
■ Weights should be associated with different variables based on
applications and data semantics
■ Quality of clustering:
■ There is usually a separate “quality” function that measures the
“goodness” of a cluster.
■ It is hard to define “similar enough” or “good enough”
■ The answer is typically highly subjective
59
Considerations for Cluster Analysis
■ Partitioning criteria
■ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
■ Separation of clusters
■ Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
■ Similarity measure
■ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g.,
density or contiguity)
■ Clustering space
■ Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)
60
Requirements and Challenges
■ Scalability
■ Clustering all the data instead of only on samples
■ Ability to deal with different types of attributes
■ Numerical, binary, categorical, ordinal, linked, and mixture of these
■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape
■ Ability to deal with noisy data
■ Incremental clustering and insensitivity to input order
■ High dimensionality
61
Major Clustering Approaches (I)
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
■ Typical methods: k-means, k-medoids, CLARANS
■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects) using some criterion
■ Typical methods: Diana, Agnes, BIRCH, CAMELEON
■ Density-based approach:
■ Based on connectivity and density functions
■ Typical methods: DBSACN, OPTICS, DenClue
■ Grid-based approach:
■ based on a multiple-level granularity structure
■ Typical methods: STING, WaveCluster, CLIQUE
62
Major Clustering Approaches (II)
■ Model-based:
■ A model is hypothesized for each of the clusters and tries to find the best fit of that
model to each other
■ Typical methods: EM, SOM, COBWEB
■ Frequent pattern-based:
■ Based on the analysis of frequent patterns
■ Typical methods: p-Cluster
■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific constraints
■ Typical methods: COD (obstacles), constrained clustering
■ Link-based clustering:
■ Objects are often linked together in various ways
■ Massive links can be used to cluster objects: SimRank, LinkClus
63
Partitioning Algorithms: Basic Concept
■ Partitioning method: Partitioning a database D of n objects into a set of k clusters,
such that the sum of squared distances is minimized (where ci is the centroid or
medoid of cluster Ci)
■ Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
■ Global optimal: exhaustively enumerate all partitions
■ Heuristic methods: k-means and k-medoids algorithms
■ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
■ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
64
The K-Means Clustering Method
■ Given k, the k-means algorithm is implemented in four steps:
■ Partition objects into k nonempty subsets
■ Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean
point, of the cluster)
■ Assign each object to the cluster with the nearest seed point
■ Go back to Step 2, stop when the assignment does not
change
65
An Example of K-Means Clustering
K=2
Arbitrarily
partition
objects into
k groups
Update the
cluster
centroids
Update the
cluster
centroids
Reassign objects
Loop if
needed
66
The initial data set
■ Partition objects into k nonempty subsets
■ Repeat
■ Compute centroid (i.e., mean point)
for each partition
■ Assign each object to the cluster of
its nearest centroid
■ Until no change
What Is the Problem of the K-Means Method?
■ The k-means algorithm is sensitive to outliers !
■ Since an object with an extremely large value may substantially distort the
distribution of the data
■ K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 1
0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 1
0
67
68
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
1
0
0 1 2 3 4 5 6 7 8 9 1
0
K=2
Arbitrary
choose k
object as
initial
medoids
Assign
each
remainin
g object
to
nearest
medoids Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
1
0
0 1 2 3 4 5 6 7 8 9 1
0
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
1
0
0 1 2 3 4 5 6 7 8 9 1
0
The K-Medoid Clustering Method
■ K-Medoids Clustering: Find representative objects (medoids) in clusters
■ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
■ Starts from an initial set of medoids and iteratively replaces one of the medoids by one
of the non-medoids if it improves the total distance of the resulting clustering
■ PAM works effectively for small data sets, but does not scale well for large data sets
(due to the computational complexity)
■ Efficiency improvement on PAM
■ CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
■ CLARANS (Ng & Han, 1994): Randomized re-sampling
69
Hierarchical Clustering
■ Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
70
AGNES (Agglomerative Nesting)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster
71
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster
72
DIANA (Divisive Analysis)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical analysis packages, e.g., Splus
■ Inverse order of AGNES
■ Eventually each node forms a cluster on its own
73
Distance between Clusters
■ Single link: smallest distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = min(tip, tjq)
■ Complete link: largest distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = max(tip, tjq)
■ Average: avg distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = avg(tip, tjq)
■ Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
■ Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
■ Medoid: a chosen, centrally located object in the cluster
X X
74
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
■ Centroid: the “middle” of a cluster
■ Radius: square root of average distance from any point of the cluster to
its centroid
■ Diameter: square root of average mean squared distance between all
pairs of points in the cluster
75
BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)
■ Zhang, Ramakrishnan & Livny, SIGMOD’96
■ Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
■ Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the
data that tries to preserve the inherent clustering structure of the data)
■ Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
■ Scales linearly: finds a good clustering with a single scan and improves the quality with a few
additional scans
■ Weakness: handles only numeric data, and sensitive to the order of the data record
76
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:
SS: square sum of N points
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
77
CF-Tree in BIRCH
■ Clustering feature:
■ Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd
moments of the subcluster from the statistical point of view
■ Registers crucial measurements for computing cluster and utilizes storage
efficiently
■ A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering
■ A nonleaf node in a tree has descendants or “children”
■ The nonleaf nodes store sums of the CFs of their children
■ A CF tree has two parameters
■ Branching factor: max # of children
■ Threshold: max diameter of sub-clusters stored at the leaf nodes
78
The CF Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
79
The Birch Algorithm
■ Cluster Diameter
■ For each point in the input
■ Find closest leaf entry
■ Add point to leaf entry and update CF
■ If entry diameter > max_diameter, then split leaf, and possibly parents
■ Algorithm is O(n)
■ Concerns
■ Sensitive to insertion order of data points
■ Since we fix the size of leaf nodes, so clusters may not be so natural
■ Clusters tend to be spherical given the radius and diameter measures
80
Density-Based Clustering Methods
■ Clustering based on density (local cluster criterion), such as density-connected
points
■ Major features:
■ Discover clusters of arbitrary shape
■ Handle noise
■ One scan
■ Need density parameters as termination condition
■ Several interesting studies:
■ DBSCAN: Ester, et al. (KDD’96)
■ OPTICS: Ankerst, et al (SIGMOD’99).
■ DENCLUE: Hinneburg & D. Keim (KDD’98)
■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
81
Density-Based Clustering: Basic Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an Eps-neighbourhood of that
point
■ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly density-reachable from a
point q w.r.t. Eps, MinPts if
■ p belongs to NEps(q)
■ core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
82
Density-Reachable and Density-Connected
■ Density-reachable:
■ A point p is density-reachable from a point q
w.r.t. Eps, MinPts if there is a chain of points
p1, …, pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
■ Density-connected
■ A point p is density-connected to a point q
w.r.t. Eps, MinPts if there is a point o such that
both, p and q are density-reachable from o
w.r.t. Eps and MinPts
p
q
p1
p q
o
83
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
■ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
■ Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
84
DBSCAN: The Algorithm
■ Arbitrary select a point p
■ Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
■ If p is a core point, a cluster is formed
■ If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
■ Continue the process until all of the points have been
processed
85
DBSCAN: Sensitive to Parameters
86
Grid-Based Clustering Method
■ Using multi-resolution grid data structure
■ Several interesting methods
■ STING (a STatistical INformation Grid approach) by Wang,
Yang and Muntz (1997)
■ WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
■ A multi-resolution clustering approach using wavelet
method
■ CLIQUE: Agrawal, et al. (SIGMOD’98)
■ Both grid-based and subspace clustering
87
STING: A Statistical Information Grid Approach
■ Wang, Yang and Muntz (VLDB’97)
■ The spatial area is divided into rectangular cells
■ There are several levels of cells corresponding to different
levels of resolution
88
The STING Clustering Method
■ Each cell at a high level is partitioned into a number of smaller cells in the
next lower level
■ Statistical info of each cell is calculated and stored beforehand and is
used to answer queries
■ Parameters of higher level cells can be easily calculated from parameters
of lower level cell
■ count, mean, s, min, max
■ type of distribution—normal, uniform, etc.
■ Use a top-down approach to answer spatial data queries
■ Start from a pre-selected layer—typically with a small number of cells
■ For each cell in the current level compute the confidence interval
89
STING Algorithm and Its Analysis
■ Remove the irrelevant cells from further consideration
■ When finish examining the current layer, proceed to the next lower
level
■ Repeat this process until the bottom layer is reached
■ Advantages:
■ Query-independent, easy to parallelize, incremental update
■ O(K), where K is the number of grid cells at the lowest level
■ Disadvantages:
■ All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected
90
91
CLIQUE (Clustering In QUEst)
■ Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
■ Automatically identifying subspaces of a high dimensional data space that allow better
clustering than original space
■ CLIQUE can be considered as both density-based and grid-based
■ It partitions each dimension into the same number of equal length interval
■ It partitions an m-dimensional data space into non-overlapping rectangular units
■ A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
■ A cluster is a maximal set of connected dense units within a subspace
92
CLIQUE: The Major Steps
■ Partition the data space and find the number of points that lie inside each
cell of the partition.
■ Identify the subspaces that contain clusters using the Apriori principle
■ Identify clusters
■ Determine dense units in all subspaces of interests
■ Determine connected dense units in all subspaces of interests.
■ Generate minimal description for the clusters
■ Determine maximal regions that cover a cluster of connected dense
units for each cluster
■ Determination of minimal cover for each cluster
93
Salary
(10,00
0)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacati
on(we
ek)
age
Vacati
on
30 50
τ = 3
94
Strength and Weakness of CLIQUE
■ Strength
■ automatically finds subspaces of the highest dimensionality
such that high density clusters exist in those subspaces
■ insensitive to the order of records in input and does not
presume some canonical data distribution
■ scales linearly with the size of input and has good scalability as
the number of dimensions in the data increases
■ Weakness
■ The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
Summary
■ Cluster analysis groups objects based on their similarity and has wide applications
■ Measure of similarity can be computed for various types of data
■ Clustering algorithms can be categorized into partitioning methods, hierarchical
methods, density-based methods, grid-based methods, and model-based methods
■ K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms
■ Birch and Chameleon are interesting hierarchical clustering algorithms, and there are
also probabilistic hierarchical clustering algorithms
■ DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
■ STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm
■ Quality of clustering results can be evaluated in various ways
95
96
What Are Outliers?
■ Outlier: A data object that deviates significantly from the normal objects as if it were
generated by a different mechanism
■ Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
■ Outliers are different from the noise data
■ Noise is random error or variance in a measured variable
■ Noise should be removed before outlier detection
■ Outliers are interesting: It violates the mechanism that generates the normal data
■ Outlier detection vs. novelty detection: early stage, outlier; but later merged into the
model
■ Applications:
■ Credit card fraud detection
■ Telecom fraud detection
■ Customer segmentation
■ Medical analysis
97
Types of Outliers (I)
■ Three kinds: global, contextual and collective outliers
■ Global outlier (or point anomaly)
■ Object is Og if it significantly deviates from the rest of the data set
■ Ex. Intrusion detection in computer networks
■ Issue: Find an appropriate measurement of deviation
■ Contextual outlier (or conditional outlier)
■ Object is Oc if it deviates significantly based on a selected context
■ Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
■ Attributes of data objects should be divided into two groups
■ Contextual attributes: defines the context, e.g., time & location
■ Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
■ Can be viewed as a generalization of local outliers—whose density significantly
deviates from its local area
■ Issue: How to define or formulate meaningful context?
Global Outlier
98
Types of Outliers (II)
■ Collective Outliers
■ A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
■ Applications: E.g., intrusion detection:
■ When a number of computers keep sending
denial-of-service packages to each other
Collective Outlier
■ Detection of collective outliers
■ Consider not only behavior of individual objects, but also that of groups
of objects
■ Need to have the background knowledge on the relationship among
data objects, such as a distance or similarity measure on objects.
■ A data set may have multiple types of outlier
■ One object may belong to more than one type of outlier
99
Challenges of Outlier Detection
■ Modeling normal objects and outliers properly
■ Hard to enumerate all possible normal behaviors in an application
■ The border between normal and outlier objects is often a gray area
■ Application-specific outlier detection
■ Choice of distance measure among objects and the model of relationship among objects
are often application-dependent
■ E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger
fluctuations
■ Handling noise in outlier detection
■ Noise may distort the normal objects and blur the distinction between normal objects and
outliers. It may help hide outliers and reduce the effectiveness of outlier detection
■ Understandability
■ Understand why these are outliers: Justification of the detection
■ Specify the degree of an outlier: the unlikelihood of the object being generated by a
normal mechanism
Outlier Detection I: Supervised Methods
■ Two ways to categorize outlier detection methods:
■ Based on whether user-labeled examples of outliers can be obtained:
■ Supervised, semi-supervised vs. unsupervised methods
■ Based on assumptions about normal data and outliers:
■ Statistical, proximity-based, and clustering-based methods
■ Outlier Detection I: Supervised Methods
■ Modeling outlier detection as a classification problem
■ Samples examined by domain experts used for training & testing
■ Methods for Learning a classifier for outlier detection effectively:
■ Model normal objects & report those not matching the model as outliers, or
■ Model outliers and treat those not matching the model as normal
■ Challenges
■ Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial
outliers
■ Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not
mislabeling normal objects as outliers)
100
Outlier Detection II: Unsupervised Methods
■ Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
■ An outlier is expected to be far away from any groups of normal objects
■ Weakness: Cannot detect collective outlier effectively
■ Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area
■ Ex. In some intrusion or virus detection, normal activities are diverse
■ Unsupervised methods may have a high false positive rate but still miss many real
outliers.
■ Supervised methods can be more effective, e.g., identify attacking some key resources
■ Many clustering methods can be adapted for unsupervised methods
■ Find clusters, then outliers: not belonging to any cluster
■ Problem 1: Hard to distinguish noise from outliers
■ Problem 2: Costly since first clustering: but far less outliers than normal objects
■ Newer methods: tackle outliers directly
101
Outlier Detection III: Semi-Supervised Methods
■ Situation: In many applications, the number of labeled data is often small: Labels could be
on outliers only, normal objects only, or both
■ Semi-supervised outlier detection: Regarded as applications of semi-supervised learning
■ If some labeled normal objects are available
■ Use the labeled examples and the proximate unlabeled objects to train a model for
normal objects
■ Those not fitting the model of normal objects are detected as outliers
■ If only some labeled outliers are available, a small number of labeled outliers many not
cover the possible outliers well
■ To improve the quality of outlier detection, one can get help from models for normal
objects learned from unsupervised methods
102
Outlier Detection (1): Statistical Methods
■ Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic model)
■ The data not following the model are outliers.
103
■ Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
■ There are rich alternatives to use various statistical models
■ E.g., parametric vs. non-parametric
■ Example (right figure): First use Gaussian distribution to model the
normal data
■ For each object y in region R, estimate gD(y), the probability of y
fits the Gaussian distribution
■ If gD(y) is very low, y is unlikely generated by the Gaussian
model, thus an outlier
Outlier Detection (2): Proximity-Based Methods
■ An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object is significantly deviates from the proximity of most of the
other objects in the same data set
104
■ The effectiveness of proximity-based methods highly relies on the proximity measure.
■ In some applications, proximity or distance measures cannot be obtained easily.
■ Often have a difficulty in finding a group of outliers which stay close to each other
■ Two major types of proximity-based outlier detection
■ Distance-based vs. density-based
■ Example (right figure): Model the proximity of an object using its 3
nearest neighbors
■ Objects in region R are substantially different from other objects in
the data set.
■ Thus the objects in R are outliers
Outlier Detection (3): Clustering-Based Methods
■ Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
105
■ Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
■ Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
■ Example (right figure): two clusters
■ All points not in R form a large cluster
■ The two points in R form a tiny cluster, thus are
outliers
Statistical Approaches
■ Statistical approaches assume that the objects in a data set are generated by a stochastic
process (a generative model)
■ Idea: learn a generative model fitting the given data set, and then identify the objects in low
probability regions of the model as outliers
■ Methods are divided into two categories: parametric vs. non-parametric
■ Parametric method
■ Assumes that the normal data is generated by a parametric distribution with parameter θ
■ The probability density function of the parametric distribution f(x, θ) gives the probability
that object x is generated by the distribution
■ The smaller this value, the more likely x is an outlier
■ Non-parametric method
■ Not assume an a-priori statistical model and determine the model from the input data
■ Not completely parameter free but consider the number and nature of the parameters are
flexible and not fixed in advance
■ Examples: histogram and kernel density estimation
106
Parametric Methods I: Detection Univariate Outliers Based on Normal
Distribution
■ Univariate data: A data set involving only one attribute or variable
■ Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as outliers
■ Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
■ Use the maximum likelihood method to estimate μ and σ
107
■ Taking derivatives with respect to μ and σ2, we derive the following maximum likelihood
estimates
■ For the above data with n = 10, we have
■ Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
Parametric Methods I: The Grubb’s Test
■ Univariate outlier detection: The Grubb's test (maximum normed residual
test) ─ another statistical method under normal distribution
■ For each object x in a data set, compute its z-score: x is an outlier if
where is the value taken by a t-distribution at a
significance level of α/(2N), and N is the # of objects in the data
set
108
Parametric Methods II: Detection of
Multivariate Outliers
■ Multivariate data: A data set involving two or more attributes or variables
■ Transform the multivariate outlier detection task into a univariate outlier detection problem
■ Method 1. Compute Mahalaobis distance
■ Let ō be the mean vector for a multivariate data set. Mahalaobis distance for an object o
to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is the covariance matrix
■ Use the Grubb's test on this measure to detect outliers
■ Method 2. Use χ2 –statistic:
■ where Ei is the mean of the i-dimension among all objects, and n is the dimensionality
■ If χ2 –statistic is large, then object oi is an outlier
109
Parametric Methods III: Using Mixture of Parametric
Distributions
■ Assuming data generated by a normal distribution could be
sometimes overly simplified
■ Example (right figure): The objects between the two clusters cannot
be captured as outliers since they are close to the estimated mean
110
■ To overcome this problem, assume the normal data is generated by two normal distributions. For any object o in the data set, the
probability that o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
■ Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
■ An object o is an outlier if it does not belong to any cluster
Non-Parametric Methods: Detection Using Histogram
■ The model of normal data is learned from the input data
without any a priori structure.
■ Often makes fewer assumptions about the data, and thus can
be applicable in more scenarios
■ Outlier detection using histogram:
111
■ Figure shows the histogram of purchase amounts in transactions
■ A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an
amount higher than $5,000
■ Problem: Hard to choose an appropriate bin size for histogram
■ Too small bin size → normal objects in empty/rare bins, false positive
■ Too big bin size → outliers in some frequent bins, false negative
■ Solution: Adopt kernel density estimation to estimate the probability density distribution of the data.
If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier.
Proximity-Based Approaches: Distance-Based vs. Density-
Based Outlier Detection
■ Intuition: Objects that are far away from the others are outliers
■ Assumption of proximity-based approach: The proximity of an
outlier deviates significantly from that of most of the others in the
data set
■ Two types of proximity-based outlier detection methods
■ Distance-based outlier detection: An object o is an outlier if its
neighborhood does not have enough other points
■ Density-based outlier detection: An object o is an outlier if its
density is relatively much lower than that of its neighbors
112
Distance-Based Outlier Detection
■ For each object o, examine the # of other objects in the r-neighborhood of o, where r is a user-
specified distance threshold
■ An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away
from o, i.e., not in the r-neighborhood of o
■ An object o is a DB(r, π) outlier if
■ Equivalently, one can check the distance between o and its k-th nearest neighbor ok, where
. o is an outlier if dist(o, ok) > r
■ Efficient computation: Nested loop algorithm
■ For any object oi, calculate its distance from other objects, and count the # of other objects in the
r-neighborhood.
■ If π∙n other objects are within r distance, terminate the inner loop
■ Otherwise, oi is a DB(r, π) outlier
■ Efficiency: Actually CPU time is not O(n2) but linear to the data set size since for most non-outlier
objects, the inner loop terminates early
113
Distance-Based Outlier Detection: A Grid-Based Method
■ Why efficiency is still a concern? When the complete set of objects cannot be held into main
memory, cost I/O swapping
■ The major cost: (1) each object tests against the whole data set, why not only its close
neighbor? (2) check objects one by one, why not group by group?
■ Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each cell is a hyper
cube with diagonal length r/2
114
■ Pruning using the level-1 & level 2 cell properties:
■ For any possible point x in cell C and any possible point y in a level-1
cell, dist(x,y) ≤ r
■ For any possible point x in cell C and any point y such that dist(x,y) ≥ r,
y is in a level-2 cell
■ Thus we only need to check the objects that cannot be pruned, and even for such
an object o, only need to compute the distance between o and the objects in the
level-2 cells (since beyond level-2, the distance from o is more than r)
Density-Based Outlier Detection
■ Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data distribution
■ In Fig., o1 and o2 are local outliers to C1, o3 is a global
outlier, but o4 is not an outlier. However, proximity-based
clustering cannot find o1 and o2 are outlier (e.g., comparing
with O4).
115
■ Intuition (density-based outlier detection): The density around an outlier object is
significantly different from the density around its neighbors
■ Method: Use the relative density of an object against its neighbors as the indicator of the
degree of the object being outliers
■ k-distance of an object o, distk(o): distance between o and its k-th NN
■ k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
■ Nk(o) could be bigger than k since multiple objects may have identical distance to o
Local Outlier Factor: LOF
■ Reachability distance from o’ to o:
■ where k is a user-specified parameter
■ Local reachability density of o:
116
■ LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and
those of o’s k-nearest neighbors
■ The lower the local reachability density of o, and the higher the local reachability density of the
kNN of o, the higher LOF
■ This captures a local outlier whose local density is relatively low comparing to the local densities
of its kNN
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
■ An object is an outlier if (1) it does not belong to any cluster, (2) there is a large distance
between the object and its closest cluster , or (3) it belongs to a small or sparse cluster
■ Case I: Not belong to any cluster
■ Identify animals not part of a flock: Using a density-based clustering
method such as DBSCAN
■ Case 2: Far from its closest cluster
■ Using k-means, partition data points of into clusters
■ For each object o, assign an outlier score based on its distance from
its closest center
■ If dist(o, co)/avg_dist(co) is large, likely an outlier
■ Ex. Intrusion detection: Consider the similarity between data points and the
clusters in a training data set
■ Use a training set to find patterns of “normal” data, e.g., frequent itemsets in each
segment, and cluster similar connections into groups
■ Compare new data points with the clusters mined—Outliers are possible attacks
117
■ FindCBLOF: Detect outliers in small clusters
■ Find clusters, and sort them in decreasing size
■ To each data point, assign a cluster-based local outlier factor
(CBLOF):
■ If obj p belongs to a large cluster, CBLOF = cluster_size X
similarity between p and cluster
■ If p belongs to a small one, CBLOF = cluster size X similarity
betw. p and the closest large cluster
118
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
■ Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity
between o and C1 is small. For any point in C3, its closest large cluster is C2 but its
similarity from C2 is low, plus |C3| = 3 is small
Clustering-Based Method: Strength and Weakness
■ Strength
■ Detect outliers without requiring any labeled data
■ Work for many types of data
■ Clusters can be regarded as summaries of the data
■ Once the cluster are obtained, need only compare any object against the clusters to
determine whether it is an outlier (fast)
■ Weakness
■ Effectiveness depends highly on the clustering method used—they may not be optimized
for outlier detection
■ High computational cost: Need to first find clusters
■ A method to reduce the cost: Fixed-width clustering
■ A point is assigned to a cluster if the center of the cluster is within a pre-defined
distance threshold from the point
■ If a point cannot be assigned to any existing cluster, a new cluster is created and the
distance threshold may be learned from the training data under certain conditions
Ad

More Related Content

What's hot (20)

K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
Livares Technologies Pvt Ltd
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
Prashanth Guntal
 
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Simplilearn
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
NTUConcepts1
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
thamizh arasi
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
Kazi Toufiq Wadud
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Simplilearn
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
Dessy Amirudin
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
Vinit Dantkale
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Krish_ver2
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
Prashanth Guntal
 
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Simplilearn
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
NTUConcepts1
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
thamizh arasi
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
Kazi Toufiq Wadud
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Simplilearn
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
Dessy Amirudin
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
Vinit Dantkale
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Krish_ver2
 

Similar to MODULE 4_ CLUSTERING.pptx (20)

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
4 DM Clustering ifor computerscience.ppt
4 DM Clustering ifor computerscience.ppt4 DM Clustering ifor computerscience.ppt
4 DM Clustering ifor computerscience.ppt
arewho557
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
sandeepsandy494692
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
19526YuvaKumarIrigi
 
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
fetnbadani
 
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lfMLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomfMLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
1052LaxmanrajS
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
47 292-298
47 292-29847 292-298
47 292-298
idescitation
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
Jamshed Khan
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
DrGnaneswariG
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
4 DM Clustering ifor computerscience.ppt
4 DM Clustering ifor computerscience.ppt4 DM Clustering ifor computerscience.ppt
4 DM Clustering ifor computerscience.ppt
arewho557
 
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
fetnbadani
 
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lfMLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomfMLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
1052LaxmanrajS
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
Jamshed Khan
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Ad

More from nikshaikh786 (20)

Module 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptxModule 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptx
nikshaikh786
 
Module 1_ Introduction.pptx
Module 1_ Introduction.pptxModule 1_ Introduction.pptx
Module 1_ Introduction.pptx
nikshaikh786
 
Module 1_ Introduction to Mobile Computing.pptx
Module 1_  Introduction to Mobile Computing.pptxModule 1_  Introduction to Mobile Computing.pptx
Module 1_ Introduction to Mobile Computing.pptx
nikshaikh786
 
Module 2_ GSM Mobile services.pptx
Module 2_  GSM Mobile services.pptxModule 2_  GSM Mobile services.pptx
Module 2_ GSM Mobile services.pptx
nikshaikh786
 
TCS MODULE 6.pdf
TCS MODULE 6.pdfTCS MODULE 6.pdf
TCS MODULE 6.pdf
nikshaikh786
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
nikshaikh786
 
Module 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptxModule 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptx
nikshaikh786
 
Module 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptxModule 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptx
nikshaikh786
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptx
nikshaikh786
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
 
Module 3 - Time Series.pptx
Module 3 - Time Series.pptxModule 3 - Time Series.pptx
Module 3 - Time Series.pptx
nikshaikh786
 
Module 2_ Regression Models..pptx
Module 2_ Regression Models..pptxModule 2_ Regression Models..pptx
Module 2_ Regression Models..pptx
nikshaikh786
 
MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptx
nikshaikh786
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
nikshaikh786
 
MAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdfMAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdf
nikshaikh786
 
VIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdfVIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdf
nikshaikh786
 
IOE MODULE 5.pptx
IOE MODULE 5.pptxIOE MODULE 5.pptx
IOE MODULE 5.pptx
nikshaikh786
 
Mad&pwa practical no. 1
Mad&pwa practical no. 1Mad&pwa practical no. 1
Mad&pwa practical no. 1
nikshaikh786
 
Ioe module 4
Ioe module 4Ioe module 4
Ioe module 4
nikshaikh786
 
Ioe module 3
Ioe module 3Ioe module 3
Ioe module 3
nikshaikh786
 
Module 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptxModule 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptx
nikshaikh786
 
Module 1_ Introduction.pptx
Module 1_ Introduction.pptxModule 1_ Introduction.pptx
Module 1_ Introduction.pptx
nikshaikh786
 
Module 1_ Introduction to Mobile Computing.pptx
Module 1_  Introduction to Mobile Computing.pptxModule 1_  Introduction to Mobile Computing.pptx
Module 1_ Introduction to Mobile Computing.pptx
nikshaikh786
 
Module 2_ GSM Mobile services.pptx
Module 2_  GSM Mobile services.pptxModule 2_  GSM Mobile services.pptx
Module 2_ GSM Mobile services.pptx
nikshaikh786
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
nikshaikh786
 
Module 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptxModule 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptx
nikshaikh786
 
Module 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptxModule 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptx
nikshaikh786
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptx
nikshaikh786
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
 
Module 3 - Time Series.pptx
Module 3 - Time Series.pptxModule 3 - Time Series.pptx
Module 3 - Time Series.pptx
nikshaikh786
 
Module 2_ Regression Models..pptx
Module 2_ Regression Models..pptxModule 2_ Regression Models..pptx
Module 2_ Regression Models..pptx
nikshaikh786
 
MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptx
nikshaikh786
 
MAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdfMAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdf
nikshaikh786
 
VIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdfVIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdf
nikshaikh786
 
Mad&pwa practical no. 1
Mad&pwa practical no. 1Mad&pwa practical no. 1
Mad&pwa practical no. 1
nikshaikh786
 
Ad

Recently uploaded (20)

Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 

MODULE 4_ CLUSTERING.pptx

  • 1. What is Clustering? Clustering is one of the most important research areas in the field of data mining. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is an unsupervised learning technique. Data clustering is the subject of active research in several fields such as statistics, pattern recognition and machine learning. From a practical perspective clustering plays an outstanding role in data mining applications in many domains. The main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge. Clustering algorithms an be applied in many areas, like marketing, biology, libraries, insurance, city-planning, earthquake studies and www document classification.
  • 2. Applications of Clustering Real life examples where we use clustering: • Marketing • Finding group of customers with similar behavior given a large data-base of customers. • Data containing their properties and past buying records (Conceptual Clustering). • Biology • Classification of Plants and Animals Based on the properties under observation (Conceptual Clustering). • Insurance • Identifying groups of car insurance policy holders with a high average claim cost (Conceptual Clustering). • City-Planning • Groups of houses according to their house type, value and geographical location it can be both (Conceptual Clustering and Distance Based Clustering) • Libraries • It is used in clustering different books on the basis of topics and information. • Earthquake studies • By learning the earthquake-affected areas we can determine the dangerous zones.
  • 3. What is Partitioning? Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. It represents many data objects by few clusters and hence, it models data by its clusters. A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.
  • 4. K-MEANS Algorithm K-Means is one of the simplest unsupervised learning algorithm that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (k-clusters). The main idea is to define k centroids, one for each cluster. A centroid is “the center of mass of a geometric object of uniform density”, though here, we'll consider mean vectors as centroids. It is a method of classifying/grouping items into k groups (where k is the number of pre-chosen groups). The grouping is done by minimizing the sum of squared distances between items or objects and the corresponding centroid.
  • 5. K-MEANS Algorithm Cont.. A clustered scatter plot. The black dots are data points. The red lines illustrate the partitions created by the k-means algorithm. The blue dots represent the centroids which define the partitions.
  • 6. K-MEANS Algorithm Cont.. The initial partitioning can be done in a variety of ways. Dynamically Chosen • This method is good when the amount of data is expected to grow. • The initial cluster means can simply be the first few items of data from the set. • For instance, if the data will be grouped into 3 clusters, then the initial cluster means will be the first 3 items of data. Randomly Chosen • Almost self-explanatory, the initial cluster means are randomly chosen values within the same range as the highest and lowest of the data values. Choosing from Upper and Lower Bounds • Depending on the types of data in the set, the highest and lowest of the data range are chosen as the initial cluster means.
  • 7. K-Means Algorithm - Example Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (185,72 ) (170,56 )
  • 8. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (185,72 ) (170,56 )
  • 9. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (185,72 ) (169,5 8)
  • 10. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (185,72 ) (169,58 )
  • 11. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (182,70 ) (169,5 8)
  • 12. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (182,70 ) (169,58 )
  • 13. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (182,71 ) (169,5 8)
  • 14. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (182,71 ) (169,58 )
  • 15. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (183,72 ) (169,5 8)
  • 16. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 (183,72 ) (169,58 )
  • 17. K-Means Algorithm – Example Cont.. Sr. Heigh t Weigh t 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77 7 180 71 8 180 70 9 183 84 10 180 88 11 180 67 12 177 76 K 1 K 2 Cluster K1 = {1,4,5,6,7,8,9,10,11,12} Cluster K2 = {2,3}
  • 18. K-Means Algorithm Cont.. Let us assume two clusters, and each individual's scores include two variables. Step-1 • Choose the number of clusters. Step-2 • Set the initial partition, and the initial mean vectors for each cluster. Step-3 • For each remaining individual... Step-4 • Get averages for comparison to the Cluster 1: • Add individual's A value to the sum of A values of the individuals in Cluster 1, then divide by the total number of scores that were summed. • Add individual's B value to the sum of B values of the individuals in Cluster 1, then divide by the total number of scores that were summed.
  • 19. K-Means Algorithm Cont.. Step-5 • Get averages for comparison to the Cluster 2: • Add individual's A value to the sum of A values of the individuals in Cluster 2, then divide by the total number of scores that were summed. • Add individual's B value to the sum of B values of the individuals in Cluster 2, then divide by the total number of scores that were summed. Step-6 • If the averages found in Step 4 are closer to the mean values of Cluster 1, then this individual belongs to Cluster 1, and the averages found now become the new mean vectors for Cluster 1. • If closer to Cluster 2, then it goes to Cluster 2, along with the averages as new mean vectors. Step-7 • If there are more individual's to process, continue again with Step 4. Otherwise go to Step 8. Step-8 • Now compare each individual’s distance to its own cluster's mean vector, and to that of the opposite cluster. • The distance to its cluster's mean vector should be smaller than it distance to the other vector. • If not, relocate the individual to the opposite cluster.
  • 20. K-Means Algorithm Cont.. Step-9 • If any relocations occurred in Step 8, the algorithm must continue again with Step 3, using all individuals and the new mean vectors. • If no relocations occurred, stop. Clustering is complete.
  • 21. What is Medoid? • Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. • Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. • Note: A medoid is not equivalent to a median.
  • 22. K-Medoids Clustering Algorithm (PAM) • The k-medoids algorithm is a clustering algorithm related to the k-means algorithm also called as the medoid shift algorithm. • Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups). • In contrast to the k-means algorithm, k-medoids chooses datapoints as centers (medoids or exemplars). • K-medoids is also a partitioning technique of clustering that clusters the data set of n objects into k clusters with k known a priori. • It could be more robust to noise and outliers as compared to k-means because it minimizes a sum of general pairwise dissimilarities instead of a sum of squared Euclidean distances. • A medoid of a finite dataset is a data point from this set, whose average dissimilarity to all the data points is minimal i.e. it is the most centrally located point in the set.
  • 24. K-Medoids Clustering Algorithm - Example Sr. X Y 0 8 7 1 3 7 2 4 9 3 9 6 4 8 5 5 5 8 6 7 3 7 8 4 8 7 5 9 4 5 Step 1: Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4, 5) and C2 - (8, 5) are the two medoids. The dissimilarity of each non-medoid point with the medoids is calculated and tabulated: Sr . X Y Dissimilarity From C1 Dissimilarity From C2 0 8 7 |(8-4)|+|(7-5)| = 6 |(8-8)|+|(7-5)| = 2 1 3 7 3 7 2 4 9 4 8 3 9 6 6 2 5 5 8 4 6 6 7 3 5 3 7 8 4 5 1 8 7 5 3 1
  • 25. K-Medoids Clustering Algorithm – Example Cont.. Sr. X Y Dissimilar ity From C1 Dissimilar ity From C2 0 8 7 6 2 1 3 7 3 7 2 4 9 4 8 3 9 6 6 2 5 5 8 4 6 6 7 3 5 3 7 8 4 5 1 8 7 5 3 1 • Each point is assigned to the cluster of that medoid whose dissimilarity is less. • The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. • The Cost = (3 + 4 + 4) + (2 + 2 + 3 + 1 + 1) = 20
  • 26. K-Medoids Clustering Algorithm – Example Cont.. Sr. X Y Dissimilar ity From C1 Dissimilar ity From C2 0 8 7 6 3 1 3 7 3 8 2 4 9 4 9 3 9 6 6 3 4 8 5 4 1 5 5 8 4 7 6 7 3 5 2 8 7 5 3 2 • Step 3: randomly select one non-medoid point and recalculate the cost. • Let the randomly selected point be (8, 4). • The dissimilarity of each non-medoid point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.
  • 27. K-Medoids Clustering Algorithm – Example Cont.. Sr. X Y Dissimilar ity From C1 Dissimilar ity From C2 0 8 7 6 3 1 3 7 3 8 2 4 9 4 9 3 9 6 6 3 4 8 5 4 1 5 5 8 4 7 6 7 3 5 2 8 7 5 3 2 • Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2, 5 go to cluster C1 and 0, 3, 4, 6, 8 go to cluster C2. • The New cost, = (3 + 4 + 4) + (3 + 3 + 1 + 2 + 2) = 22 • Swap Cost = New Cost – Previous Cost = 22 – 20 = 2 • So, 2>0 that is positive, now our previous medoid is best. • The total cost of Medoid (8,4) > the total cost when (8,5) was the medoid earlier & it generates the same clusters as earlier. • If you get negative then you have to take new medoid and recalculate again.
  • 28. K-Medoids Clustering Algorithm – Example Cont.. • As the swap cost is not less than zero, we undo the swap. • Hence (4, 5) and (8, 5) are the final medoids. • The clustering would be in the following way Sr. X Y 0 8 7 1 3 7 2 4 9 3 9 6 4 8 5 5 5 8 6 7 3 7 8 4 8 7 5 9 4 5
  • 29. K-Medoids Clustering Algorithm (Try Yourself!!) Sr. X Y 0 2 6 1 3 4 2 3 8 3 4 7 4 6 2 5 6 4 6 7 3 7 7 4 8 8 5 9 7 6
  • 30. Hierarchical Clustering • Hierarchical Clustering is a technique to group objects based on distance or similarity. • Hierarchical Clustering is called as unsupervised learning. • Because, the machine (computer) learns mere from objects with their features and then the machine will automatically categorize those objects into groups. • This clustering technique is divided into two types: • Agglomerative • In this technique, initially each data point is considered as an individual cluster. • At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed. • Divisive • Divisive Hierarchical clustering is exactly the opposite of the Agglomerative Hierarchical clustering. • In Divisive Hierarchical clustering, we consider all the data points as a single cluster and in each iteration, we separate the data points from the cluster which are not similar. • Each data point which is separated is considered as an individual cluster. In the end, we’ll be left with n clusters.
  • 31. Agglomerative Hierarchical Clustering Technique • In this technique, initially each data point is considered as an individual cluster. • At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed. • The basic algorithm of Agglomerative is straight forward. • Compute the proximity matrix • Let each data point be a cluster • Repeat: Merge the two closest clusters and update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters.
  • 32. Agglomerative Hierarchical Clustering - Example • To understand better let’s see a pictorial representation of the Agglomerative Hierarchical clustering Technique. • Lets say we have six data points {A,B,C,D,E,F}. • Step- 1: • In the initial step, we calculate the proximity of individual points and consider all the six data points as individual clusters as shown in the image below.
  • 33. Agglomerative Hierarchical Clustering - Example • Step- 2: • In step two, similar clusters are merged together and formed as a single cluster. • Let’s consider B,C, and D,E are similar clusters that are merged in step two. • Now, we’re left with four clusters which are A, BC, DE, F. • Step- 3: • We again calculate the proximity of new clusters and merge the similar clusters to form new clusters A, BC, DEF. • Step- 4: • Calculate the proximity of the new clusters. • The clusters DEF and BC are similar and merged together to form a new cluster. • We’re now left with two clusters A, BCDEF. • Step- 5: • Finally, all the clusters are merged together and form a single cluster.
  • 34. Agglomerative Hierarchical Clustering - Example The Hierarchical clustering Technique can be visualized using a Dendrogram. A Dendrogram is a tree-like diagram that records the sequences of merges or splits.
  • 35. Divisive Hierarchical Clustering Technique • Divisive Hierarchical clustering is exactly the opposite of the Agglomerative Hierarchical clustering. • In Divisive Hierarchical clustering, we consider all the data points as a single cluster and in each iteration, we separate the data points from the cluster which are not similar. • Each data point which is separated is considered as an individual cluster. • In the end, we’ll be left with n clusters. • As we are dividing the single clusters into n clusters, it is named as Divisive Hierarchical clustering. • It is not much used in the real world.
  • 36. Divisive Hierarchical Clustering Technique Cont.. • Calculating the similarity between two clusters is important to merge or divide the clusters. • There are certain approaches which are used to calculate the similarity between two clusters: • MIN • MAX • Group Average • Distance Between Centroids • Ward’s Method
  • 37. MIN • Min is known as single-linkage algorithm can be defined as the similarity of two clusters C1 and C2 is equal to the minimum of the similarity between points Pi and Pj such that Pi belongs to C1 and Pj belongs to C2. • Mathematically this can be written as, • Sim(C1,C2) = Min Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2 • In simple words, pick the two closest points such that one point lies in cluster one and the other point lies in cluster 2 and takes their similarity and declares it as the similarity between two clusters.
  • 38. MAX • Max is known as the complete linkage algorithm, this is exactly opposite to the MIN approach. • The similarity of two clusters C1 and C2 is equal to the maximum of the similarity between points Pi and Pj such that Pi belongs to C1 and Pj belongs to C2. • Mathematically this can be written as, • Sim(C1,C2) = Max Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2 • In simple words, pick the two farthest points such that one point lies in cluster one and the other point lies in cluster 2 and takes their similarity and declares it as the similarity between two clusters.
  • 39. Group Average • Take all the pairs of points and compute their similarities and calculate the average of the similarities. • Mathematically this can be written as, • Sim(C1,C2) = ∑ Sim(Pi, Pj)/|C1|*|C2|, where, Pi ∈ C1 & Pj ∈ C2
  • 40. Distance between centroids • Compute the centroids of two clusters C1 & C2 and take the similarity between the two centroids as the similarity between two clusters. • This is a less popular technique in the real world.
  • 41. Ward’s Method • This approach of calculating the similarity between two clusters is exactly the same as Group Average except that Ward’s method calculates the sum of the square of the distances Pi and Pj. • Mathematically this can be written as, • Sim(C1,C2) = ∑ (dist(Pi, Pj))²/|C1|*|C2|
  • 42. Agglomerative Hierarchical Clustering - Example X Y P1 0.40 0.53 P2 0.22 0.38 P3 0.35 0.32 P4 0.26 0.19 P5 0.08 0.41 P6 0.45 0.30 P1 P2 P3 P4 P5 P6 P1 0 P2 0.2 3 0 P3 0 P4 0 P5 0 P6 0
  • 43. Agglomerative Hierarchical Clustering - Example X Y P1 0.40 0.53 P2 0.22 0.38 P3 0.35 0.32 P4 0.26 0.19 P5 0.08 0.41 P6 0.45 0.30 P1 P2 P3 P4 P5 P6 P1 0 P2 0.2 3 0 P3 0.2 2 0 P4 0 P5 0 P6 0
  • 44. Agglomerative Hierarchical Clustering - Example X Y P1 0.40 0.53 P2 0.22 0.38 P3 0.35 0.32 P4 0.26 0.19 P5 0.08 0.41 P6 0.45 0.30 P1 P2 P3 P4 P5 P6 P1 0 P2 0.23 0 P3 0.22 0.15 0 P4 0.37 0.20 0.15 0 P5 0.34 0.14 0.28 0.29 0 P6 0.23 0.25 0.11 0.22 0.39 0 3 6
  • 45. Agglomerative Hierarchical Clustering - Example To Update the distance matrix MIN[dist(P3,P6),P1] MIN (dist(P3,P1),(P6,P1)) Min[(0.22,0.23)] 0.22 To Update the distance matrix MIN[dist(P3,P6),P2] MIN (dist(P3,P2),(P6,P2)) Min[(0.15,0.25)] 0.15 P1 P2 P3 P4 P5 P6 P1 0 P2 0.23 0 P3 0.22 0.1 5 0 P4 0.37 0.2 0 0.1 5 0 P5 0.34 0.1 4 0.2 8 0.2 9 0 P6 0.23 0.2 5 0.1 1 0.2 2 0.3 9 0
  • 46. Agglomerative Hierarchical Clustering - Example To Update the distance matrix MIN[dist(P3,P6),P4] MIN (dist(P3,P4),(P6,P4)) Min[(0.15,0.22)] 0.15 To Update the distance matrix MIN[dist(P3,P6),P5] MIN (dist(P3,P5),(P6,P5)) Min[(0.28,0.39)] 0.28 P1 P2 P3 P4 P5 P6 P1 0 P2 0.23 0 P3 0.22 0.1 5 0 P4 0.37 0.2 0 0.1 5 0 P5 0.34 0.1 4 0.2 8 0.2 9 0 P6 0.23 0.2 5 0.1 1 0.2 2 0.3 9 0
  • 47. Agglomerative Hierarchical Clustering - Example The Updated distance matrix for cluster P3, P6 P1 P2 P3,P 6 P4 P5 P1 0 P2 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 P5 0.34 0.14 0.28 0.29 0 P1 P2 P3 P4 P5 P6 P1 0 P2 0.23 0 P3 0.22 0.1 5 0 P4 0.37 0.2 0 0.1 5 0 P5 0.34 0.1 4 0.2 8 0.2 9 0 P6 0.23 0.2 5 0.1 1 0.2 2 0.3 9 0 2 5
  • 48. Agglomerative Hierarchical Clustering - Example To Update the distance matrix MIN[dist(P2,P5),P1] MIN (dist(P2,P1),(P5,P1)) Min[(0.23,0.34)] 0.23 To Update the distance matrix MIN[dist(P2,P5),(P3,P6)] MIN [(dist(P2,(P3,P6)),(P5,(P3,P6))] Min[(0.15,0.28)] 0.15 P1 P2 P3,P 6 P4 P5 P1 0 P2 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 P5 0.34 0.14 0.28 0.29 0
  • 49. Agglomerative Hierarchical Clustering - Example To Update the distance matrix MIN[dist(P2,P5),P4] MIN (dist(P2,P4),(P5,P4)) Min[(0.20,0.29)] 0.20 P1 P2 P3,P 6 P4 P5 P1 0 P2 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 P5 0.34 0.14 0.28 0.29 0
  • 50. Agglomerative Hierarchical Clustering - Example P1 P2 P3,P 6 P4 P5 P1 0 P2 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 P5 0.34 0.14 0.28 0.29 0 P1 P2,P 5 P3,P 6 P4 P1 0 P2,P 5 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0
  • 51. Agglomerative Hierarchical Clustering - Example P1 P2,P 5 P3,P 6 P4 P1 0 P2,P 5 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 P1 P2,P 5 P3,P 6 P4 P1 0 P2,P 5 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 3 6 2 5
  • 52. Agglomerative Hierarchical Clustering - Example P1 P2,P 5 P3,P 6 P4 P1 0 P2,P 5 0.23 0 P3,P 6 0.22 0.15 0 P4 0.37 0.20 0.15 0 To Update the distance matrix MIN[dist(P2,P5),(P3,P6)),P1] MIN (dist(P2,P5),P1),((P3,P6),P1)] Min[(0.23,0.22)] 0.22 To Update the distance matrix MIN[dist(P2,P5),(P3,P6)),P4] MIN (dist(P2,P5),P4),((P3,P6),P4)] Min[(0.20,0.15)] 0.15 P1 P2,P5,P3, P6 P4 P1 0 P2,P5,P3, P6 0.22 0 P4 0.37 0.15 0 3 6 2 5 4
  • 53. Agglomerative Hierarchical Clustering - Example To Update the distance matrix MIN[dist(P2,P5,P3,P6),P4] MIN (dist(P2,P5,P3,P6),P1),(P4,P1)] Min[(0.22,0.37)] 0.22 P1 P2,P5,P3, P6 P4 P1 0 P2,P5,P3, P6 0.22 0 P4 0.37 0.15 0 P1 P2,P5,P3,P6, P4 P1 0 P2,P5,P3,P6, P4 0.22 0 3 6 2 5 4
  • 54. Agglomerative Hierarchical Clustering - Example 3 6 2 5 4 1 X Y P1 0.40 0.53 P2 0.22 0.38 P3 0.35 0.32 P4 0.26 0.19 P5 0.08 0.41 P6 0.45 0.30
  • 55. 55 What is Cluster Analysis? ■ Cluster: A collection of data objects ■ similar (or related) to one another within the same group ■ dissimilar (or unrelated) to the objects in other groups ■ Cluster analysis (or clustering, data segmentation, …) ■ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters ■ Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) ■ Typical applications ■ As a stand-alone tool to get insight into data distribution ■ As a preprocessing step for other algorithms
  • 56. 56 Clustering for Data Understanding and Applications ■ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species ■ Information retrieval: document clustering ■ Land use: Identification of areas of similar land use in an earth observation database ■ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ■ City-planning: Identifying groups of houses according to their house type, value, and geographical location ■ Earthquake studies: Observed earthquake epicenters should be clustered along continent faults ■ Climate: understanding earth climate, find patterns of atmospheric and ocean ■ Economic Science: market research
  • 57. 57 Clustering as a Preprocessing Tool (Utility) ■ Summarization: ■ Preprocessing for regression, PCA, classification, and association analysis ■ Compression: ■ Image processing: vector quantization ■ Finding K-nearest Neighbors ■ Localizing search to one or a small number of clusters ■ Outlier detection ■ Outliers are often viewed as those “far away” from any cluster
  • 58. Quality: What Is Good Clustering? ■ A good clustering method will produce high quality clusters ■ high intra-class similarity: cohesive within clusters ■ low inter-class similarity: distinctive between clusters ■ The quality of a clustering method depends on ■ the similarity measure used by the method ■ its implementation, and ■ Its ability to discover some or all of the hidden patterns 58
  • 59. Measure the Quality of Clustering ■ Dissimilarity/Similarity metric ■ Similarity is expressed in terms of a distance function, typically metric: d(i, j) ■ The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables ■ Weights should be associated with different variables based on applications and data semantics ■ Quality of clustering: ■ There is usually a separate “quality” function that measures the “goodness” of a cluster. ■ It is hard to define “similar enough” or “good enough” ■ The answer is typically highly subjective 59
  • 60. Considerations for Cluster Analysis ■ Partitioning criteria ■ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) ■ Separation of clusters ■ Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) ■ Similarity measure ■ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) ■ Clustering space ■ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 60
  • 61. Requirements and Challenges ■ Scalability ■ Clustering all the data instead of only on samples ■ Ability to deal with different types of attributes ■ Numerical, binary, categorical, ordinal, linked, and mixture of these ■ Constraint-based clustering ■ User may give inputs on constraints ■ Use domain knowledge to determine input parameters ■ Interpretability and usability ■ Others ■ Discovery of clusters with arbitrary shape ■ Ability to deal with noisy data ■ Incremental clustering and insensitivity to input order ■ High dimensionality 61
  • 62. Major Clustering Approaches (I) ■ Partitioning approach: ■ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors ■ Typical methods: k-means, k-medoids, CLARANS ■ Hierarchical approach: ■ Create a hierarchical decomposition of the set of data (or objects) using some criterion ■ Typical methods: Diana, Agnes, BIRCH, CAMELEON ■ Density-based approach: ■ Based on connectivity and density functions ■ Typical methods: DBSACN, OPTICS, DenClue ■ Grid-based approach: ■ based on a multiple-level granularity structure ■ Typical methods: STING, WaveCluster, CLIQUE 62
  • 63. Major Clustering Approaches (II) ■ Model-based: ■ A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other ■ Typical methods: EM, SOM, COBWEB ■ Frequent pattern-based: ■ Based on the analysis of frequent patterns ■ Typical methods: p-Cluster ■ User-guided or constraint-based: ■ Clustering by considering user-specified or application-specific constraints ■ Typical methods: COD (obstacles), constrained clustering ■ Link-based clustering: ■ Objects are often linked together in various ways ■ Massive links can be used to cluster objects: SimRank, LinkClus 63
  • 64. Partitioning Algorithms: Basic Concept ■ Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) ■ Given k, find a partition of k clusters that optimizes the chosen partitioning criterion ■ Global optimal: exhaustively enumerate all partitions ■ Heuristic methods: k-means and k-medoids algorithms ■ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster ■ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 64
  • 65. The K-Means Clustering Method ■ Given k, the k-means algorithm is implemented in four steps: ■ Partition objects into k nonempty subsets ■ Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) ■ Assign each object to the cluster with the nearest seed point ■ Go back to Step 2, stop when the assignment does not change 65
  • 66. An Example of K-Means Clustering K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed 66 The initial data set ■ Partition objects into k nonempty subsets ■ Repeat ■ Compute centroid (i.e., mean point) for each partition ■ Assign each object to the cluster of its nearest centroid ■ Until no change
  • 67. What Is the Problem of the K-Means Method? ■ The k-means algorithm is sensitive to outliers ! ■ Since an object with an extremely large value may substantially distort the distribution of the data ■ K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 1 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 1 0 67
  • 68. 68 PAM: A Typical K-Medoids Algorithm Total Cost = 20 0 1 2 3 4 5 6 7 8 9 1 0 0 1 2 3 4 5 6 7 8 9 1 0 K=2 Arbitrary choose k object as initial medoids Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 1 0 0 1 2 3 4 5 6 7 8 9 1 0 Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 1 0 0 1 2 3 4 5 6 7 8 9 1 0
  • 69. The K-Medoid Clustering Method ■ K-Medoids Clustering: Find representative objects (medoids) in clusters ■ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987) ■ Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering ■ PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity) ■ Efficiency improvement on PAM ■ CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples ■ CLARANS (Ng & Han, 1994): Randomized re-sampling 69
  • 70. Hierarchical Clustering ■ Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) 70
  • 71. AGNES (Agglomerative Nesting) ■ Introduced in Kaufmann and Rousseeuw (1990) ■ Implemented in statistical packages, e.g., Splus ■ Use the single-link method and the dissimilarity matrix ■ Merge nodes that have the least dissimilarity ■ Go on in a non-descending fashion ■ Eventually all nodes belong to the same cluster 71
  • 72. Dendrogram: Shows How Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster 72
  • 73. DIANA (Divisive Analysis) ■ Introduced in Kaufmann and Rousseeuw (1990) ■ Implemented in statistical analysis packages, e.g., Splus ■ Inverse order of AGNES ■ Eventually each node forms a cluster on its own 73
  • 74. Distance between Clusters ■ Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) ■ Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) ■ Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) ■ Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) ■ Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) ■ Medoid: a chosen, centrally located object in the cluster X X 74
  • 75. Centroid, Radius and Diameter of a Cluster (for numerical data sets) ■ Centroid: the “middle” of a cluster ■ Radius: square root of average distance from any point of the cluster to its centroid ■ Diameter: square root of average mean squared distance between all pairs of points in the cluster 75
  • 76. BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) ■ Zhang, Ramakrishnan & Livny, SIGMOD’96 ■ Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering ■ Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) ■ Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree ■ Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans ■ Weakness: handles only numeric data, and sensitive to the order of the data record 76
  • 77. Clustering Feature Vector in BIRCH Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) 77
  • 78. CF-Tree in BIRCH ■ Clustering feature: ■ Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the subcluster from the statistical point of view ■ Registers crucial measurements for computing cluster and utilizes storage efficiently ■ A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering ■ A nonleaf node in a tree has descendants or “children” ■ The nonleaf nodes store sums of the CFs of their children ■ A CF tree has two parameters ■ Branching factor: max # of children ■ Threshold: max diameter of sub-clusters stored at the leaf nodes 78
  • 79. The CF Tree Structure CF1 child1 CF3 child3 CF2 child2 CF6 child6 CF1 child1 CF3 child3 CF2 child2 CF5 child5 CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next B = 7 L = 6 Root Non-leaf node Leaf node Leaf node 79
  • 80. The Birch Algorithm ■ Cluster Diameter ■ For each point in the input ■ Find closest leaf entry ■ Add point to leaf entry and update CF ■ If entry diameter > max_diameter, then split leaf, and possibly parents ■ Algorithm is O(n) ■ Concerns ■ Sensitive to insertion order of data points ■ Since we fix the size of leaf nodes, so clusters may not be so natural ■ Clusters tend to be spherical given the radius and diameter measures 80
  • 81. Density-Based Clustering Methods ■ Clustering based on density (local cluster criterion), such as density-connected points ■ Major features: ■ Discover clusters of arbitrary shape ■ Handle noise ■ One scan ■ Need density parameters as termination condition ■ Several interesting studies: ■ DBSCAN: Ester, et al. (KDD’96) ■ OPTICS: Ankerst, et al (SIGMOD’99). ■ DENCLUE: Hinneburg & D. Keim (KDD’98) ■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 81
  • 82. Density-Based Clustering: Basic Concepts ■ Two parameters: ■ Eps: Maximum radius of the neighbourhood ■ MinPts: Minimum number of points in an Eps-neighbourhood of that point ■ NEps(p): {q belongs to D | dist(p,q) ≤ Eps} ■ Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if ■ p belongs to NEps(q) ■ core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 82
  • 83. Density-Reachable and Density-Connected ■ Density-reachable: ■ A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi ■ Density-connected ■ A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p1 p q o 83
  • 84. DBSCAN: Density-Based Spatial Clustering of Applications with Noise ■ Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points ■ Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5 84
  • 85. DBSCAN: The Algorithm ■ Arbitrary select a point p ■ Retrieve all points density-reachable from p w.r.t. Eps and MinPts ■ If p is a core point, a cluster is formed ■ If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database ■ Continue the process until all of the points have been processed 85
  • 86. DBSCAN: Sensitive to Parameters 86
  • 87. Grid-Based Clustering Method ■ Using multi-resolution grid data structure ■ Several interesting methods ■ STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) ■ WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) ■ A multi-resolution clustering approach using wavelet method ■ CLIQUE: Agrawal, et al. (SIGMOD’98) ■ Both grid-based and subspace clustering 87
  • 88. STING: A Statistical Information Grid Approach ■ Wang, Yang and Muntz (VLDB’97) ■ The spatial area is divided into rectangular cells ■ There are several levels of cells corresponding to different levels of resolution 88
  • 89. The STING Clustering Method ■ Each cell at a high level is partitioned into a number of smaller cells in the next lower level ■ Statistical info of each cell is calculated and stored beforehand and is used to answer queries ■ Parameters of higher level cells can be easily calculated from parameters of lower level cell ■ count, mean, s, min, max ■ type of distribution—normal, uniform, etc. ■ Use a top-down approach to answer spatial data queries ■ Start from a pre-selected layer—typically with a small number of cells ■ For each cell in the current level compute the confidence interval 89
  • 90. STING Algorithm and Its Analysis ■ Remove the irrelevant cells from further consideration ■ When finish examining the current layer, proceed to the next lower level ■ Repeat this process until the bottom layer is reached ■ Advantages: ■ Query-independent, easy to parallelize, incremental update ■ O(K), where K is the number of grid cells at the lowest level ■ Disadvantages: ■ All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected 90
  • 91. 91 CLIQUE (Clustering In QUEst) ■ Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) ■ Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space ■ CLIQUE can be considered as both density-based and grid-based ■ It partitions each dimension into the same number of equal length interval ■ It partitions an m-dimensional data space into non-overlapping rectangular units ■ A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter ■ A cluster is a maximal set of connected dense units within a subspace
  • 92. 92 CLIQUE: The Major Steps ■ Partition the data space and find the number of points that lie inside each cell of the partition. ■ Identify the subspaces that contain clusters using the Apriori principle ■ Identify clusters ■ Determine dense units in all subspaces of interests ■ Determine connected dense units in all subspaces of interests. ■ Generate minimal description for the clusters ■ Determine maximal regions that cover a cluster of connected dense units for each cluster ■ Determination of minimal cover for each cluster
  • 93. 93 Salary (10,00 0) 20 30 40 50 60 age 5 4 3 1 2 6 7 0 20 30 40 50 60 age 5 4 3 1 2 6 7 0 Vacati on(we ek) age Vacati on 30 50 τ = 3
  • 94. 94 Strength and Weakness of CLIQUE ■ Strength ■ automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces ■ insensitive to the order of records in input and does not presume some canonical data distribution ■ scales linearly with the size of input and has good scalability as the number of dimensions in the data increases ■ Weakness ■ The accuracy of the clustering result may be degraded at the expense of simplicity of the method
  • 95. Summary ■ Cluster analysis groups objects based on their similarity and has wide applications ■ Measure of similarity can be computed for various types of data ■ Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods ■ K-means and K-medoids algorithms are popular partitioning-based clustering algorithms ■ Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms ■ DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms ■ STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm ■ Quality of clustering results can be evaluated in various ways 95
  • 96. 96 What Are Outliers? ■ Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism ■ Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ... ■ Outliers are different from the noise data ■ Noise is random error or variance in a measured variable ■ Noise should be removed before outlier detection ■ Outliers are interesting: It violates the mechanism that generates the normal data ■ Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model ■ Applications: ■ Credit card fraud detection ■ Telecom fraud detection ■ Customer segmentation ■ Medical analysis
  • 97. 97 Types of Outliers (I) ■ Three kinds: global, contextual and collective outliers ■ Global outlier (or point anomaly) ■ Object is Og if it significantly deviates from the rest of the data set ■ Ex. Intrusion detection in computer networks ■ Issue: Find an appropriate measurement of deviation ■ Contextual outlier (or conditional outlier) ■ Object is Oc if it deviates significantly based on a selected context ■ Ex. 80o F in Urbana: outlier? (depending on summer or winter?) ■ Attributes of data objects should be divided into two groups ■ Contextual attributes: defines the context, e.g., time & location ■ Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature ■ Can be viewed as a generalization of local outliers—whose density significantly deviates from its local area ■ Issue: How to define or formulate meaningful context? Global Outlier
  • 98. 98 Types of Outliers (II) ■ Collective Outliers ■ A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers ■ Applications: E.g., intrusion detection: ■ When a number of computers keep sending denial-of-service packages to each other Collective Outlier ■ Detection of collective outliers ■ Consider not only behavior of individual objects, but also that of groups of objects ■ Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects. ■ A data set may have multiple types of outlier ■ One object may belong to more than one type of outlier
  • 99. 99 Challenges of Outlier Detection ■ Modeling normal objects and outliers properly ■ Hard to enumerate all possible normal behaviors in an application ■ The border between normal and outlier objects is often a gray area ■ Application-specific outlier detection ■ Choice of distance measure among objects and the model of relationship among objects are often application-dependent ■ E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations ■ Handling noise in outlier detection ■ Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection ■ Understandability ■ Understand why these are outliers: Justification of the detection ■ Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism
  • 100. Outlier Detection I: Supervised Methods ■ Two ways to categorize outlier detection methods: ■ Based on whether user-labeled examples of outliers can be obtained: ■ Supervised, semi-supervised vs. unsupervised methods ■ Based on assumptions about normal data and outliers: ■ Statistical, proximity-based, and clustering-based methods ■ Outlier Detection I: Supervised Methods ■ Modeling outlier detection as a classification problem ■ Samples examined by domain experts used for training & testing ■ Methods for Learning a classifier for outlier detection effectively: ■ Model normal objects & report those not matching the model as outliers, or ■ Model outliers and treat those not matching the model as normal ■ Challenges ■ Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers ■ Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers) 100
  • 101. Outlier Detection II: Unsupervised Methods ■ Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having some distinct features ■ An outlier is expected to be far away from any groups of normal objects ■ Weakness: Cannot detect collective outlier effectively ■ Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area ■ Ex. In some intrusion or virus detection, normal activities are diverse ■ Unsupervised methods may have a high false positive rate but still miss many real outliers. ■ Supervised methods can be more effective, e.g., identify attacking some key resources ■ Many clustering methods can be adapted for unsupervised methods ■ Find clusters, then outliers: not belonging to any cluster ■ Problem 1: Hard to distinguish noise from outliers ■ Problem 2: Costly since first clustering: but far less outliers than normal objects ■ Newer methods: tackle outliers directly 101
  • 102. Outlier Detection III: Semi-Supervised Methods ■ Situation: In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both ■ Semi-supervised outlier detection: Regarded as applications of semi-supervised learning ■ If some labeled normal objects are available ■ Use the labeled examples and the proximate unlabeled objects to train a model for normal objects ■ Those not fitting the model of normal objects are detected as outliers ■ If only some labeled outliers are available, a small number of labeled outliers many not cover the possible outliers well ■ To improve the quality of outlier detection, one can get help from models for normal objects learned from unsupervised methods 102
  • 103. Outlier Detection (1): Statistical Methods ■ Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model) ■ The data not following the model are outliers. 103 ■ Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data ■ There are rich alternatives to use various statistical models ■ E.g., parametric vs. non-parametric ■ Example (right figure): First use Gaussian distribution to model the normal data ■ For each object y in region R, estimate gD(y), the probability of y fits the Gaussian distribution ■ If gD(y) is very low, y is unlikely generated by the Gaussian model, thus an outlier
  • 104. Outlier Detection (2): Proximity-Based Methods ■ An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set 104 ■ The effectiveness of proximity-based methods highly relies on the proximity measure. ■ In some applications, proximity or distance measures cannot be obtained easily. ■ Often have a difficulty in finding a group of outliers which stay close to each other ■ Two major types of proximity-based outlier detection ■ Distance-based vs. density-based ■ Example (right figure): Model the proximity of an object using its 3 nearest neighbors ■ Objects in region R are substantially different from other objects in the data set. ■ Thus the objects in R are outliers
  • 105. Outlier Detection (3): Clustering-Based Methods ■ Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters 105 ■ Since there are many clustering methods, there are many clustering-based outlier detection methods as well ■ Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets ■ Example (right figure): two clusters ■ All points not in R form a large cluster ■ The two points in R form a tiny cluster, thus are outliers
  • 106. Statistical Approaches ■ Statistical approaches assume that the objects in a data set are generated by a stochastic process (a generative model) ■ Idea: learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers ■ Methods are divided into two categories: parametric vs. non-parametric ■ Parametric method ■ Assumes that the normal data is generated by a parametric distribution with parameter θ ■ The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution ■ The smaller this value, the more likely x is an outlier ■ Non-parametric method ■ Not assume an a-priori statistical model and determine the model from the input data ■ Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance ■ Examples: histogram and kernel density estimation 106
  • 107. Parametric Methods I: Detection Univariate Outliers Based on Normal Distribution ■ Univariate data: A data set involving only one attribute or variable ■ Often assume that data are generated from a normal distribution, learn the parameters from the input data, and identify the points with low probability as outliers ■ Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4} ■ Use the maximum likelihood method to estimate μ and σ 107 ■ Taking derivatives with respect to μ and σ2, we derive the following maximum likelihood estimates ■ For the above data with n = 10, we have ■ Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
  • 108. Parametric Methods I: The Grubb’s Test ■ Univariate outlier detection: The Grubb's test (maximum normed residual test) ─ another statistical method under normal distribution ■ For each object x in a data set, compute its z-score: x is an outlier if where is the value taken by a t-distribution at a significance level of α/(2N), and N is the # of objects in the data set 108
  • 109. Parametric Methods II: Detection of Multivariate Outliers ■ Multivariate data: A data set involving two or more attributes or variables ■ Transform the multivariate outlier detection task into a univariate outlier detection problem ■ Method 1. Compute Mahalaobis distance ■ Let ō be the mean vector for a multivariate data set. Mahalaobis distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is the covariance matrix ■ Use the Grubb's test on this measure to detect outliers ■ Method 2. Use χ2 –statistic: ■ where Ei is the mean of the i-dimension among all objects, and n is the dimensionality ■ If χ2 –statistic is large, then object oi is an outlier 109
  • 110. Parametric Methods III: Using Mixture of Parametric Distributions ■ Assuming data generated by a normal distribution could be sometimes overly simplified ■ Example (right figure): The objects between the two clusters cannot be captured as outliers since they are close to the estimated mean 110 ■ To overcome this problem, assume the normal data is generated by two normal distributions. For any object o in the data set, the probability that o is generated by the mixture of the two distributions is given by where fθ1 and fθ2 are the probability density functions of θ1 and θ2 ■ Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data ■ An object o is an outlier if it does not belong to any cluster
  • 111. Non-Parametric Methods: Detection Using Histogram ■ The model of normal data is learned from the input data without any a priori structure. ■ Often makes fewer assumptions about the data, and thus can be applicable in more scenarios ■ Outlier detection using histogram: 111 ■ Figure shows the histogram of purchase amounts in transactions ■ A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000 ■ Problem: Hard to choose an appropriate bin size for histogram ■ Too small bin size → normal objects in empty/rare bins, false positive ■ Too big bin size → outliers in some frequent bins, false negative ■ Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier.
  • 112. Proximity-Based Approaches: Distance-Based vs. Density- Based Outlier Detection ■ Intuition: Objects that are far away from the others are outliers ■ Assumption of proximity-based approach: The proximity of an outlier deviates significantly from that of most of the others in the data set ■ Two types of proximity-based outlier detection methods ■ Distance-based outlier detection: An object o is an outlier if its neighborhood does not have enough other points ■ Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors 112
  • 113. Distance-Based Outlier Detection ■ For each object o, examine the # of other objects in the r-neighborhood of o, where r is a user- specified distance threshold ■ An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away from o, i.e., not in the r-neighborhood of o ■ An object o is a DB(r, π) outlier if ■ Equivalently, one can check the distance between o and its k-th nearest neighbor ok, where . o is an outlier if dist(o, ok) > r ■ Efficient computation: Nested loop algorithm ■ For any object oi, calculate its distance from other objects, and count the # of other objects in the r-neighborhood. ■ If π∙n other objects are within r distance, terminate the inner loop ■ Otherwise, oi is a DB(r, π) outlier ■ Efficiency: Actually CPU time is not O(n2) but linear to the data set size since for most non-outlier objects, the inner loop terminates early 113
  • 114. Distance-Based Outlier Detection: A Grid-Based Method ■ Why efficiency is still a concern? When the complete set of objects cannot be held into main memory, cost I/O swapping ■ The major cost: (1) each object tests against the whole data set, why not only its close neighbor? (2) check objects one by one, why not group by group? ■ Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each cell is a hyper cube with diagonal length r/2 114 ■ Pruning using the level-1 & level 2 cell properties: ■ For any possible point x in cell C and any possible point y in a level-1 cell, dist(x,y) ≤ r ■ For any possible point x in cell C and any point y such that dist(x,y) ≥ r, y is in a level-2 cell ■ Thus we only need to check the objects that cannot be pruned, and even for such an object o, only need to compute the distance between o and the objects in the level-2 cells (since beyond level-2, the distance from o is more than r)
  • 115. Density-Based Outlier Detection ■ Local outliers: Outliers comparing to their local neighborhoods, instead of the global data distribution ■ In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4 is not an outlier. However, proximity-based clustering cannot find o1 and o2 are outlier (e.g., comparing with O4). 115 ■ Intuition (density-based outlier detection): The density around an outlier object is significantly different from the density around its neighbors ■ Method: Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers ■ k-distance of an object o, distk(o): distance between o and its k-th NN ■ k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)} ■ Nk(o) could be bigger than k since multiple objects may have identical distance to o
  • 116. Local Outlier Factor: LOF ■ Reachability distance from o’ to o: ■ where k is a user-specified parameter ■ Local reachability density of o: 116 ■ LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and those of o’s k-nearest neighbors ■ The lower the local reachability density of o, and the higher the local reachability density of the kNN of o, the higher LOF ■ This captures a local outlier whose local density is relatively low comparing to the local densities of its kNN
  • 117. Clustering-Based Outlier Detection (1 & 2): Not belong to any cluster, or far from the closest one ■ An object is an outlier if (1) it does not belong to any cluster, (2) there is a large distance between the object and its closest cluster , or (3) it belongs to a small or sparse cluster ■ Case I: Not belong to any cluster ■ Identify animals not part of a flock: Using a density-based clustering method such as DBSCAN ■ Case 2: Far from its closest cluster ■ Using k-means, partition data points of into clusters ■ For each object o, assign an outlier score based on its distance from its closest center ■ If dist(o, co)/avg_dist(co) is large, likely an outlier ■ Ex. Intrusion detection: Consider the similarity between data points and the clusters in a training data set ■ Use a training set to find patterns of “normal” data, e.g., frequent itemsets in each segment, and cluster similar connections into groups ■ Compare new data points with the clusters mined—Outliers are possible attacks 117
  • 118. ■ FindCBLOF: Detect outliers in small clusters ■ Find clusters, and sort them in decreasing size ■ To each data point, assign a cluster-based local outlier factor (CBLOF): ■ If obj p belongs to a large cluster, CBLOF = cluster_size X similarity between p and cluster ■ If p belongs to a small one, CBLOF = cluster size X similarity betw. p and the closest large cluster 118 Clustering-Based Outlier Detection (3): Detecting Outliers in Small Clusters ■ Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity between o and C1 is small. For any point in C3, its closest large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small
  • 119. Clustering-Based Method: Strength and Weakness ■ Strength ■ Detect outliers without requiring any labeled data ■ Work for many types of data ■ Clusters can be regarded as summaries of the data ■ Once the cluster are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast) ■ Weakness ■ Effectiveness depends highly on the clustering method used—they may not be optimized for outlier detection ■ High computational cost: Need to first find clusters ■ A method to reduce the cost: Fixed-width clustering ■ A point is assigned to a cluster if the center of the cluster is within a pre-defined distance threshold from the point ■ If a point cannot be assigned to any existing cluster, a new cluster is created and the distance threshold may be learned from the training data under certain conditions