Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
Hierarchical Clustering
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240
2
DSCI 5240
Hierarchical Clustering
3
DSCI 5240
4
DSCI 5240
5
DSCI 5240
Dendrogram
• Dendrogram - a graphical
representation of the hierarchical
structure of the clusters
Distance
• Clusters are obtained by “cutting” the
dendrogram at the desired distance
measure (i.e., we specify that clusters
greater than x distance apart should
not be combined)
1 2 3 4 5
Object
7
DSCI 5240
Reading a Dendrogram
2 Clusters at 4.6
6 Clusters at 3.6
Objects 2 and
22 are this far
apart
8
DSCI 5240
9
DSCI 5240
Agglomerative Nesting
(AGNES)
11
DSCI 5240
12
DSCI 5240
Agglomerative Example
Step 1: Each observation is a cluster
6
Obs X Y
A 1 1 5
B 3 1
4
C 2 4
D 2 5 3
Y
E 5 4
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
13
DSCI 5240
Iteration 1
14
DSCI 5240
Agglomerative Example
Step 2: Compute distances between clusters (centroid method)
6
A B C D E
A 0.00 5
B 2.00 0.00
4
C 3.16 3.16 0.00
D 4.12 4.12 1.00 0.00 3
Y
E 5.00 3.16 3.00 3.16 0.00
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
15
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
A B C D E
A 0.00 5
B 2.00 0.00
4
C 3.16 3.16 0.00
D 4.12 4.12 1.00 0.00 3
Y
E 5.00 3.16 3.00 3.16 0.00
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
16
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
A 1 1 5
B 3 1
4
CD 2 4.5
E 5 4 3
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
17
DSCI 5240
Iteration 2
18
DSCI 5240
Agglomerative Example
Step 2: Compute distances between clusters (centroid method)
6
A B CD E
A 0.00 5
B 2.00 0.00
4
CD 3.64 3.64 0.00
E 5.00 3.61 3.04 0.00 3
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
19
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
A B CD E
A 0.00 5
B 2.00 0.00
4
CD 3.64 3.64 0.00
E 5.00 3.61 3.04 0.00 3
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
20
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
AB 2 1 5
CD 2 4.5
4
E 5 4
3
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
21
DSCI 5240
Iteration 3
22
DSCI 5240
Agglomerative Example
Step 2: Compute distances between clusters (centroid method)
6
AB CD E
AB 0.00 5
CD 3.50 0.00
4
E 4.24 3.04 0.00
3
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
23
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
AB CD E
AB 0.00 5
CD 3.50 0.00
4
E 4.24 3.04 0.00
3
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
24
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
AB 2 1 5
CDE 3 4.3
4
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
25
DSCI 5240
Iteration 4
26
DSCI 5240
Agglomerative Example
Step 2: Compute distances between clusters (centroid method)
6
AB CDE
AB 0.00 5
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
27
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
AB CDE
AB 0.00 5
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
28
DSCI 5240
Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
ABCDE 2.6 3.0 5
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X
29
DSCI 5240
Recap
30
DSCI 5240
Recap
5
Distance
Y
2
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
A B C D E
X
Object
31
DSCI 5240
• What point within a cluster should represent the cluster in distance calculations?
• It turns out there are many options and they can result in different cluster structures
32
DSCI 5240
Calculating Distances
• MIN (Single Linkage) – Distance between two clusters is the distance between the two
closest points in the (different) clusters
• MAX (Complete Linkage) – Distance between two clusters is the distance between the
two farthest points in the (different) clusters
• Group Average (Average Linkage) – Distance between two clusters is the average
pairwise distance between points in the (different) clusters
• Distance Between Centers – Distance between two clusters is the distance between the
cluster centroids
33
DSCI 5240
Calculating Distances
34
DSCI 5240
Calculating Distances
3
2
35
DSCI 5240
Calculating Distances
3
2
36
DSCI 5240
Calculating Distances
37
DSCI 5240
38
DSCI 5240
MIN
• MIN is better at handling non-elliptical shapes
• It will likely result in cleaner (more interpretable) clusters
39
DSCI 5240
MAX
• MAX has a tendency to “jump gaps”
• It often breaks large clusters
• Its results would be less interpretable
40
DSCI 5240
41
DSCI 5240
MIN
• MIN is more sensitive to noise and outliers
• In this instance, the clusters would be difficult to interpret
42
DSCI 5240
MAX
• MAX manages noise and outliers better
• It produces better clusters in this situation
43
DSCI 5240
44
DSCI 5240
Validating Clusters
• Goal is to obtain meaningful/useful clusters
• Random chance can produce apparent clusters
• Different clustering methods produce different results
• Hence, need to
• Obtain summary statistics
• Review clusters with respect to variables not used in clustering
• Label clusters
• Look for
• Stability – sensitivity to minor input changes
• Separation – ratio of intra-cluster and inter-cluster variations
45