8 Clustering
8 Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
access patterns
0
d(2,1)
Dissimilarity matrix 0
d(3,1) d ( 3,2) 0
(one mode)
: : :
d ( n,1) d ( n,2) ... ... 0
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
seed point.
Go back to Step 2, stop when no more new
assignment.
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
categorical data?
Need to specify k, the number of clusters, in advance
Dissimilarity calculations
objects
Using a frequency-based method to update modes of
clusters
A mixture of categorical and numerical data: k-prototype
method
January 10, 2024 Data Mining: Concepts and Techniques 28
The K-Medoids Clustering Method
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
h
8
8
j
7
7
6
6
5
5 i
i h j
t
4
4
3
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
C
January 10, 2024jih CjihTechniques
= d(j, t) - d(j, i) Data Mining: Concepts and = d(j, h) - d(j, t) 31
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased
January 10, 2024 Data Mining: Concepts and Techniques 32
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
January 10, 2024 Data Mining: Concepts and Techniques 33
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
January 10, 2024 Data Mining: Concepts and Techniques 34
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
dynamic modeling
January 10, 2024 Data Mining: Concepts and Techniques 38
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the
order of the data record.
January 10, 2024 Data Mining: Concepts and Techniques 39
Clustering Feature Vector
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
y
y y
x
y y
x x
x x
January 10, 2024 Data Mining: Concepts and Techniques 45
Cure: Shrinking Representative Points
y y
x x
T T
1 2
{3} 1
Sim( T 1, T 2) 0.2
{1,2,3,4,5} 5
Data Set
Merge Partition
Final Clusters
Density-reachable:
p
A point p is density-reachable
from a point q wrt. Eps, MinPts if p1
q
there is a chain of points p1, …,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
p q
A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
January 10, 2024 Data Mining: Concepts and Techniques 53
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
techniques
January 10, 2024 Data Mining: Concepts and Techniques 56
OPTICS: Some Extension from
DBSCAN
Index-based:
k = number of dimensions
N = 20
p = 75% D
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance p1
o
Reachability Distance p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
January 10, 2024 = 3 cm
Data Mining: Concepts and Techniques 57
Reachability
-distance
undefined
‘
Cluster-order
January 10, 2024 Data Mining: Concepts and Techniques
of the objects 58
DENCLUE: using density functions
DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
But needs a large number of parameters
Example
d ( x , y )2
f Gaussian ( x , y ) e 2 2
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
N
f D
Gaussian
2 2
update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
y
l ar 30 50
Sa age
Strength
It automatically finds subspaces of the highest
A classification tree
of continuous data
suffers similar problems as COBWEB
number of clusters
Popular in industry
(neurons)
Neurons compete in a “winner-takes-all” fashion for
Gretzky, ...
Problem
Find top n outlier points
Applications:
Credit card fraud detection
Customer segmentation
Medical analysis
Drawbacks
most tests are for single attribute
data distribution.
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
Outlier Discovery: Deviation-
Based Approach
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
sequential exception technique
simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
OLAP data cube technique
uses data cubes to identify regions of anomalies in
large multidimensional data
January 10, 2024 Data Mining: Concepts and Techniques 88
Problems and Challenges