Clustering 1
Clustering 1
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary
1
What is Cluster Analysis?
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
3
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
4
Applications of Cluster Analysis
Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Understanding 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Technology1-DOWN
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Group related documents Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Summarization
Reduce the size of large
data sets
Clustering
precipitation in
Australia
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
6
Quality: What Is Good Clustering?
7
Measure the Quality of Clustering
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
9
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary
10
Data Structures
Data matrix
x 11 ... x 1f ... x 1p
(two modes)
... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix 0
(one mode) d(2,1) 0
d(3,1 ) d ( 3 ,2 ) 0
: : :
d ( n ,1 ) d ( n,2 ) ... ... 0
11
Type of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
12
Interval-valued variables
mf = 1
n (x1 f + x2 f + ... + xnf )
where .
14
Similarity and Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) ≥ 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)
15
Example
16
Binary Variables
Symmetric binary variable : If both of its states
i.e. 0 and 1 are equally valuable. Here we cannot
decide which outcome should be 0 and which
outcome should be 1.
Example : Marital status of a person is “Married or
Unmarried”.
Asymmetric binary variable : If the outcome of
the states are not equally important. An example of
such a variable is the presence or absence of a
relatively rare attribute.
Example : Person is “handicapped or not
handicapped” .
17
Binary Variables
Object j
1 0 sum
A contingency table for binary 1 a b a +b
Object i
data 0 c d c+d
sum a + c b + d p
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
20
Object Test-1
1 A
2 B P=1
3 C d(I,j)=0 if match =1 otherwise
4 A
0
0
1 0
1 1 0
0 1 1 0
21
Ordinal Variables
22
Object Test-2 Object Test-2 Object Test-2
1 Excellent 1 3 1 1.0 (3-1)/(3-1)
2 Fair 2 1 2 0.0
3 Good 3 2 3 0.5
4 Excellent 4 3 4 1.0
Ranks
0
0
1 0
0.5 0.5 0
0 1.0 0.5 0
23
Ratio-Scaled Variables
0
0
1.31 0
0.44 0.87 0
0.43 1.74 0.87 0
25
Vector Objects
26
Let x=(1,1,0,0) y=(0,1,1,0)
S(x,y)=(0+1+1+0)/(sqrt(2).sqrt(2))
= 0.5
27
Mixed Types
0 0 0
0 0 0
1 0 1 0 1.31 0
1 1 0 0.5 0.5 0 0.44 0.87 0
0 1 1 0 0 1.0 0.5 0 0.43 1.74 0.87 0
0
0
0.75 0
0.25 0.50 0
0 0.25 1.00 0.50 0
0
0.92 0
0.58 0.67 0
0.08 1.00 0.67 0 D(2,1)=(1(1)+1(1)+1(0.75))/3
28
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary
29
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
30
Typical Alternatives to Calculate the Distance
between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
32
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
33
Partitioning Algorithms: Basic Concept
34
The K-Means Clustering Method
35
The K-Means Clustering Method
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
1
objects 0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
3 3
2
the 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
36
Flowchart
37
Comments on the K-Means Method
38
Variations of the K-Means Method
Dissimilarity calculations
39
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
40
Example
Given : {2,4,10,12,3,20,30,11,25}
Assume number of cluster i.e. k = 2.
Randomly assign means: m1= 3, m2 = 4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
K1={2,3,4,10,11,12},K2={20,30,25}
41
Randomly assign alternative values to each cluster
Number of cluster = 2, therefore
K1 = {2,10,3,30, 25}, Mean = 14
K2 = {4,12, 20, 11}, Mean = 11.75
Re-assign
K1 = {20, 30, 25}, Mean = 25
K2 = {2,4, 10, 12, 3, 11}, Mean= 7
Re-assign
K1 = {20, 30, 25}, Mean = 25
K2 = {2,4, 10, 12, 3, 11}, Mean= 7
So the final answer is K1={2,3,4,10,11,12},K2={20,30,25}
42
Use K-means algorithm to create 3 clusters for
given set of values
{2, 3, 6, 8, 9, 12, 15, 18, 22}
43
The K-Medoids Clustering Method
44
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrary 6
Assign 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
1 1 1
0 0
nearest 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O 7 total cost of 7
5 5
change If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
45
PAM (Partitioning Around Medoids) (1987)
9 9
j
8
t 8 t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
8
h 8
7
7
6
j 6
5
5 i
i 4
h j
4
3
t 3
2
2
t
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
48
Number x coordinate y coordinate
1 1.0 4.0
2 5.0 1.0
3 5.0 2.0
4 5.0 4.0
5 10.0 4.0
6 25.0 4.0
7 25.0 6.0
8 25.0 7.0
9 25.0 8.0
10 29.0 7.0
49
Objects 1 and 5 are the selected representative objects initially.
Average 9.37
50
Object Dissimilarity Dissimilarity Minimal Closest
Number from object 4 from object 8 dissimilarity representative
object
4
1 4.00 24.19 4.00
4
2 3.00 20.88 3.00
4
3 2.00 20.62 2.00
4
4 0.00 20.22 0.00
4
5 5.00 15.30 5.00
8
6 20.00 3.00 3.00
8
7 20.10 1.00 1.00
8
8 20.22 0.00 0.00
8
9 20.40 1.00 1.00
8
10 24.19 4.00 4.00
Average 2.30
51
CLARA (Clustering Large Applications) (1990)
Basic Algorithm:
1. Compute the proximity matrix (i.e. distance
matrix)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
56
single-linkage clustering
D(r,s) = Min { d(i,j) : Where object i is in
cluster r and object j is in cluster s }
complete-linkage clustering
D(r,s) = Max { d(i,j) : Where object i is in
cluster r and object j is in cluster s }.
average-linkage clustering
D(r,s)= mean{ d(i,j) : Where object i is in
cluster r and object j is in cluster s }
57
AGNES (Agglomerative Nesting)
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
58
Dendrogram: Shows How the Clusters are Merged
59
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
60
Assume that the database D is given by the table below. Follow
single link technique to find clusters in D. Use Euclidean distance
measure.
X Y
P1 0.40 0.53
P2 0.22 0.38
D= P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
p1 0
p2 0.24 0
p3 0
p1
p4 0 0
p2
p5 0.24 0 0
p3
p6 0.22 0.15 0 0
p4 0.37
p1 p2 0.20
p3 p4 0.15
p5 p6 0
p5 0.34 0.14 0.28 0.29 0
p6 0.23 0.25 0.11 0.22 0.39 0
p1 p2 p3 p4 p5 p6
61
p1 0 dist( (p3, p6), p1 )
P2 0.24 0 =MIN ( dist(p3, p1) , dist(p6, p1) )
(p3, p6) 0.22 0.15 0
= MIN ( 0.22 , 0.23
p4 0.37 0.20 0.15 0
= 0.22
p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5
62
p1 0
(p2, p5) 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p1 (p2, p5) (p3, p6) p4
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5),
dist(p6, p5)) = MIN ( 0.15 , 0.25, 0.28, 0.39
= 0.15
63
p1 0
(p2, p5, p3, p6) 0.22 0
p4 0.37 0.15 0
p1 (p2, p5, p3, p6) p4
64
65
66
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
67
What Is Outlier Discovery?
68
Outlier Discovery:
Statistical Approaches
70
Density-Based Local
Outlier Detection
Distance-based outlier detection
is based on global distance
distribution
It encounters difficulties to
identify outliers if data is not
uniformly distributed Local outlier factor (LOF)
Ex. C1 contains 400 loosely Assume outlier is not
crisp
distributed points, C2 has 100
tightly condensed points, 2 Each point has a LOF
outlier points o1, o2
Distance-based method cannot
identify o2 as an outlier
Need the concept of local outlier
71
Outlier Discovery: Deviation-Based Approach
72
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary
73
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis
74
Problems and Challenges