Data Mining Clustering
Data Mining Clustering
1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
2
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning:
no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
3
Clustering: Rich Applications and Multidisciplinary
Efforts
Business intelligence
• Organize customers to groups, improve project management
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
WWW
Document classification
Cluster Web blog data to discover groups of similar access patterns
4
Clustering: Application Examples
Biology:
taxonomy of living things: kingdom, phylum, class, order, family, genus
and species
Information retrieval: document clustering
Land use:
Identification of areas of similar land use in an earth observation
database
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Climate: understanding earth climate, find patterns of atmospheric and
ocean
Economic Science: market research
5
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples. Large database
may contains millions or even billions of data objects
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
New applications – to deal with graphs, sequences, images and
documents
Discovery of clusters with arbitrary shape
Alg. Based on distance measures find only spherical clusters with similar shape and density
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
Capability of clustering high-dimensionality data
Requirements of domain knowledge (Eg :- required # of clusters)
Constraint-based clustering
• User may give inputs on constraints
Interpretability and usability
6
Quality: What Is Good Clustering?
7
Clustering Problem
8
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function, typically
metric: d(i, j)
The definitions of distance functions are usually rather different
for interval-scaled, boolean, categorical, ordinal ratio, and vector
variables
Weights should be associated with different variables based on
applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures the
“goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective 9
Major Clustering Approaches (I)
Partitioning approach:
Given n objects, partitioning method construct k partition of data, where
each partition represent a cluster, k <= n
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
Quantize the object space into a finte number of cells that form a grid
structure
Typical methods: STING, WaveCluster, CLIQUE
10
Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus
11
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
12
Partitioning Algorithms: Basic Concept
Partitioning method:
Given, a dataset D of n objects into a set of k clusters, such that
the sum of squared distances is minimized (where an object
p ԑ Ci, ci is the centroid or medoid of cluster Ci)
Clustering by k-means
E ik1 pCi (d ( p, ci )) 2
13
Partitioning Algorithms: Basic Concept
15
An Example of K-Means Clustering
K=2
17
K –means example
18
K –means example
19
K –means example
20
Comments on the K-Means Method
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal
Weakness
Applicable only to objects in a continuous n-dimensional space
For nominal data – a variant of k-means known as k-modes can
be used.
Need to specify k, the number of clusters, in advance
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes
21
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
22
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3
2
3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
23
The K-Medoid Clustering Method
25
PAM
26
PAM
To replace existing medoids, calculate cost of replacement for
pairs of medoid and non-medoid combinations
Eg:- Thus we have find, TCAC, TCAD, TCAE, TCBC, TCBD and TCBE
Here TCAC is calculated as
28
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
29
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step Step Step Step Step agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0
30
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
31
Dendrogram: Shows How Clusters are Merged
32
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
33
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
35
Single Link Algorithm
36
Single link - example
37
Single link - example
38
Complete link algorithm
39
Complete link algorithm
40
Complete link algorithm
41
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
43
Clustering Feature Vector in BIRCH
N 2 10
(3,4)
Xi
9
8
(2,6)
i 1
7
5
(4,5)
4
3
(4,7)
2
1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
44
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: max # of children
Threshold: max diameter of sub-clusters stored at the leaf
45
BIRCH Algorithm
Input : D={t1, t2, ….tn}
T // threshold of CF tree construction
Output : K // set of clusters
Algorithm BIRCH
For each ti ɛ D do
determine the correct leaf node for ti insertion
If threshold condition is not violated
Add ti to the cluster and update CF triple
Else
If room to insert ti, then
Insert ti as a single cluster and update CF triples
Else
Split leaf node and redistribute CF features 46
The CF Tree Structure
Root
Non-leaf node
prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
47
The Birch Algorithm
Cluster Diameter 1
(x x )
2
n( n 1) i j
Construct (K-NN)
Sparse Graph Partition the Graph
Data Set
K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 52
CHAMELEON (Clustering Complex Objects)
53
Probabilistic Hierarchical Clustering
Algorithmic hierarchical clustering
Nontrivial to choose a good distance measure
Hard to handle missing attribute values
Optimization goal not clear: heuristic, local search
Probabilistic hierarchical clustering
Use probabilistic models to measure distances between clusters
Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
In practice, assume the generative models adopt common distribution
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
54
Generative Model
Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a Gaussian
distribution:
55
Gaussian Distribution
Bean
machine:
drop ball
with pins
1-d 2-d
Gaussian Gaussian
From wikipedia and http://home.dei.polimi.it
56
A Probabilistic Hierarchical Clustering Algorithm
For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,
58
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
59
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if p1
there is a chain of points p1, …, q
Outlier
Border
Eps = 1cm
Core MinPts = 5
62
DBSCAN: The Algorithm
Input : D : a data set containing n objects, ԑ ; radius, MinPts : neighborhood
threshold
Method :
mark all objects as unmarked
do
randomly select an unvisited object p;
mark p as visited
if the e-neighborhood of p has at least MinPts objects
create a new cluster C, and add p to C
let N be the set of objects in the e-neighborhood of p
for each point p’ in N
if p’ is unvisited
mark p’ as visited
if the e-neighborhood of p’ has at least MinPts points
add those points to N
if p’ is not yet member of any cluster, add p’ to C;
end for
output C;
else mark p as noise
Until no objects is unvisited
63
DBSCAN: Sensitive to Parameters
67
Reachability-
distance
undefined
‘
68
Density-Based Clustering: OPTICS & Applications
demo: http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo
69
DENCLUE: Using Statistical Density
Functions
DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
total influence
Using statistical density functions: on x
d ( x , xi ) 2
2
( x ) i 1 e
d ( x,y) N
D 2 2
f Gaussian ( x , y ) e 2 2 f Gaussian
d ( x , xi ) 2
influence of y
( x, xi ) i 1 ( xi x) e
N
on x f D
Gaussian
2 2
Major features
gradient of x in
Solid mathematical foundation the direction of
xi
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
Significant faster than existing algorithm (e.g., DBSCAN)
But needs a large number of parameters
70
Denclue: Technical Essence
Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
Influence function: describes the impact of a data point within its
neighborhood
Overall density of the data space can be calculated as the sum of the
influence function of all data points
Clusters can be determined mathematically by identifying density
attractors
Density attractors are local maximal of the overall density function
Center defined clusters: assign to each density attractor the points
density attracted to it
Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)
71
Density Attractor
72
Center-Defined and Arbitrary
73
Entropy-Based Measure (II):
Normalized mutual information (NMI)
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
75
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
76