Concepts and Techniques: - Chapter 7
Concepts and Techniques: - Chapter 7
— Chapter 7 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 in jn
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Methods:
treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation to a ratio-scaled
variable f having value xif for object i
yif = log(xif)
Treat xif as continuous ordinal data and treat their
ranks as interval-valued
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj)
= dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
June 3, 2024 Data Mining: Concepts and Techniques 34
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN 1(t )
Cm N
ip
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3
2
3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
h
8
8
j
7
7
6
6
5
5 i
i h j
t
4
4
3
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: specify the maximum number of children.
threshold: max diameter of sub-clusters stored at the leaf nodes
June 3, 2024 Data Mining: Concepts and Techniques 58
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Major ideas
Use links to measure similarity/proximity
Not distance-based
Experiments
Congressional voting, mushroom data
Construct
Sparse Graph Partition the Graph
Data Set
Merge Partition
Final Clusters
based)
June 3, 2024 Data Mining: Concepts and Techniques 67
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
p MinPts = 5
core point condition: q
Eps = 1 cm
|NEps (q)| >= MinPts
June 3, 2024 Data Mining: Concepts and Techniques 68
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if p1
there is a chain of points p1, …, q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
p q
A point p is density-connected to a
point q w.r.t. Eps, MinPts if there
o
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
June 3, 2024 Data Mining: Concepts and Techniques 69
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
techniques
June 3, 2024 Data Mining: Concepts and Techniques 74
OPTICS: Some Extension from
DBSCAN
Index-based:
k = number of dimensions
N = 20
p = 75% D
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance p1
Reachability Distance o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
June 3, 2024 = 3 cm
Data Mining: Concepts and Techniques 75
Reachability
-distance
undefined
‘
Cluster-order
June 3, 2024 Data Mining: Concepts and Techniques
of the objects 76
Density-Based Clustering: OPTICS & Its Applications
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
N
Major features f D
Gaussian
2 2
Major features:
Complexity O(N)
Maximization step:
Estimation of model parameters
objects
Finds characteristic description for each concept (class)
COBWEB (Fisher’87)
A popular a simple method of incremental conceptual
learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a
Competitive learning
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
y
l ar 30 50
Sa age
Strength
automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
insensitive to the order of records in input and does not
1 1 1
d d d d d d
ij | J | ij Ij | I | i I ij IJ | I || J | i I , j J ij
Where jJ
A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
Problems with bi-cluster
No downward closure property,
Due to averaging, it may contain outliers but still within δ-threshold
June 3, 2024 Data Mining: Concepts and Techniques 111
p-Clustering: Clustering
by Pattern Similarity
Given object x, y in O and features a, b in T, pCluster is a 2 by 2
matrix d xa d xb
pScore ( ) | (d xa d xb ) (d ya d yb ) |
d ya d yb
A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for some δ > 0
Properties of δ-pCluster
Downward closure
Clusters are more homogeneous than bi-cluster (thus the name:
pair-wise Cluster)
Pattern-growth algorithm has been developed for efficient mining
d xa / d ya
For scaling patterns, one can observe, taking logarithmic on
d xb / d yb
will lead to the pScore form
June 3, 2024 Data Mining: Concepts and Techniques 112
Chapter 6. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
June 3, 2024 Data Mining: Concepts and Techniques 113
Why Constraint-Based Cluster Analysis?
Need user feedback: Users know their applications the best
Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters
Customer segmentation
Medical analysis
Drawbacks
most tests are for single attribute
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm