0% found this document useful (0 votes)
2 views

Hierarchical Start

The document discusses hierarchical clustering methods, particularly the interpretation of dendrograms and how to identify clusters by making horizontal cuts at various heights. It emphasizes that the choice of linkage and dissimilarity measures significantly affects the resulting dendrogram and clustering outcomes. The document also outlines the hierarchical clustering algorithm and describes four common types of linkage: complete, single, average, and centroid linkage.

Uploaded by

Third Party
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hierarchical Start

The document discusses hierarchical clustering methods, particularly the interpretation of dendrograms and how to identify clusters by making horizontal cuts at various heights. It emphasizes that the choice of linkage and dissimilarity measures significantly affects the resulting dendrogram and clustering outcomes. The document also outlines the hierarchical clustering algorithm and describes four common types of linkage: complete, single, average, and centroid linkage.

Uploaded by

Third Party
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

10.

3 Clustering Methods 393

3.0

0.5
2.5
2.0
7

0.0
8

X2
5
1.5

−0.5
9
1.0

2
2
3

−1.0
0.5

1
4

8
0.0

−1.5
1

7
−1.5 −1.0 −0.5 0.0 0.5 1.0

X1

FIGURE 10.10. An illustration of how to properly interpret a dendrogram with


nine observations in two-dimensional space. Left: a dendrogram generated using
Euclidean distance and complete linkage. Observations 5 and 7 are quite similar
to each other, as are observations 1 and 6. However, observation 9 is no more
similar to observation 2 than it is to observations 8, 5, and 7, even though obser-
vations 9 and 2 are close together in terms of horizontal distance. This is because
observations 2, 8, 5, and 7 all fuse with observation 9 at the same height, approx-
imately 1.8. Right: the raw data used to generate the dendrogram can be used to
confirm that indeed, observation 9 is no more similar to observation 2 than it is
to observations 8, 5, and 7.

Now that we understand how to interpret the left-hand panel of Fig-


ure 10.9, we can move on to the issue of identifying clusters on the basis
of a dendrogram. In order to do this, we make a horizontal cut across the
dendrogram, as shown in the center and right-hand panels of Figure 10.9.
The distinct sets of observations beneath the cut can be interpreted as clus-
ters. In the center panel of Figure 10.9, cutting the dendrogram at a height
of nine results in two clusters, shown in distinct colors. In the right-hand
panel, cutting the dendrogram at a height of five results in three clusters.
Further cuts can be made as one descends the dendrogram in order to ob-
tain any number of clusters, between 1 (corresponding to no cut) and n
(corresponding to a cut at height 0, so that each observation is in its own
cluster). In other words, the height of the cut to the dendrogram serves
the same role as the K in K-means clustering: it controls the number of
clusters obtained.
Figure 10.9 therefore highlights a very attractive aspect of hierarchical
clustering: one single dendrogram can be used to obtain any number of
clusters. In practice, people often look at the dendrogram and select by eye
a sensible number of clusters, based on the heights of the fusion and the
number of clusters desired. In the case of Figure 10.9, one might choose to
select either two or three clusters. However, often the choice of where to
cut the dendrogram is not so clear.
392 10. Unsupervised Learning

10

10

10
8

8
6

6
4

4
2

2
0

0
FIGURE 10.9. Left: dendrogram obtained from hierarchically clustering the data
from Figure 10.8 with complete linkage and Euclidean distance. Center: the den-
drogram from the left-hand panel, cut at a height of nine (indicated by the dashed
line). This cut results in two distinct clusters, shown in different colors. Right:
the dendrogram from the left-hand panel, now cut at a height of five. This cut
results in three distinct clusters, shown in different colors. Note that the colors
were not used in clustering, but are simply used for display purposes in this figure.

different the two observations are. Thus, observations that fuse at the very
bottom of the tree are quite similar to each other, whereas observations
that fuse close to the top of the tree will tend to be quite different.
This highlights a very important point in interpreting dendrograms that
is often misunderstood. Consider the left-hand panel of Figure 10.10, which
shows a simple dendrogram obtained from hierarchically clustering nine
observations. One can see that observations 5 and 7 are quite similar to
each other, since they fuse at the lowest point on the dendrogram. Obser-
vations 1 and 6 are also quite similar to each other. However, it is tempting
but incorrect to conclude from the figure that observations 9 and 2 are
quite similar to each other on the basis that they are located near each
other on the dendrogram. In fact, based on the information contained in
the dendrogram, observation 9 is no more similar to observation 2 than it
is to observations 8, 5, and 7. (This can be seen from the right-hand panel
of Figure 10.10, in which the raw data are displayed.) To put it mathe-
matically, there are 2n−1 possible reorderings of the dendrogram, where n
is the number of leaves. This is because at each of the n − 1 points where
fusions occur, the positions of the two fused branches could be swapped
without affecting the meaning of the dendrogram. Therefore, we cannot
draw conclusions about the similarity of two observations based on their
proximity along the horizontal axis. Rather, we draw conclusions about
the similarity of two observations based on the location on the vertical axis
where branches containing those two observations first are fused.
396 10. Unsupervised Learning

9 9

0.5

0.5
7 7
0.0

0.0
8 5 8 5
X2

X2
3 3
−0.5

−0.5
2 2
−1.0

−1.0
1 1
6 6
−1.5

−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 X1
9 9
0.5

0.5

7 7
0.0

0.0

8 5 8 5
X2

X2

3 3
−0.5

−0.5

2 2
−1.0

−1.0

1 1
6 6
−1.5

−1.5

4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 X1

FIGURE 10.11. An illustration of the first few steps of the hierarchical


clustering algorithm, using the data from Figure 10.10, with complete linkage
and Euclidean distance. Top Left: initially, there are nine distinct clusters,
{1}, {2}, . . . , {9}. Top Right: the two clusters that are closest together, {5} and
{7}, are fused into a single cluster. Bottom Left: the two clusters that are closest
together, {6} and {1}, are fused into a single cluster. Bottom Right: the two clus-
ters that are closest together using complete linkage, {8} and the cluster {5, 7},
are fused into a single cluster.

dendrogram typically depends quite strongly on the type of linkage used,


as is shown in Figure 10.12.

Choice of Dissimilarity Measure


Thus far, the examples in this chapter have used Euclidean distance as the
dissimilarity measure. But sometimes other dissimilarity measures might
be preferred. For example, correlation-based distance considers two obser-
vations to be similar if their features are highly correlated, even though the
observed values may be far apart in terms of Euclidean distance. This is
10.3 Clustering Methods 395

Algorithm 10.2 Hierarchical Clustering


1. Begin with n observations
!n" and a measure (such as Euclidean dis-
tance) of all the 2 = n(n − 1)/2 pairwise dissimilarities. Treat each
observation as its own cluster.
2. For i = n, n − 1, . . . , 2:
(a) Examine all pairwise inter-cluster dissimilarities among the i
clusters and identify the pair of clusters that are least dissimilar
(that is, most similar). Fuse these two clusters. The dissimilarity
between these two clusters indicates the height in the dendro-
gram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among
the i − 1 remaining clusters.

Linkage Description
Maximal intercluster dissimilarity. Compute all pairwise dis-
Complete similarities between the observations in cluster A and the
observations in cluster B, and record the largest of these
dissimilarities.
Minimal intercluster dissimilarity. Compute all pairwise dis-
similarities between the observations in cluster A and the
Single observations in cluster B, and record the smallest of these
dissimilarities. Single linkage can result in extended, trailing
clusters in which single observations are fused one-at-a-time.
Mean intercluster dissimilarity. Compute all pairwise dis-
Average similarities between the observations in cluster A and the
observations in cluster B, and record the average of these
dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Centroid
linkage can result in undesirable inversions.
TABLE 10.2. A summary of the four most commonly-used types of linkage in
hierarchical clustering.

linkage are generally preferred over single linkage, as they tend to yield
more balanced dendrograms. Centroid linkage is often used in genomics,
but suffers from a major drawback in that an inversion can occur, whereby inversion
two clusters are fused at a height below either of the individual clusters in
the dendrogram. This can lead to difficulties in visualization as well as in in-
terpretation of the dendrogram. The dissimilarities computed in Step 2(b)
of the hierarchical clustering algorithm will depend on the type of linkage
used, as well as on the choice of dissimilarity measure. Hence, the resulting
10.3 Clustering Methods 395

Algorithm 10.2 Hierarchical Clustering


1. Begin with n observations
!n" and a measure (such as Euclidean dis-
tance) of all the 2 = n(n − 1)/2 pairwise dissimilarities. Treat each
observation as its own cluster.
2. For i = n, n − 1, . . . , 2:
(a) Examine all pairwise inter-cluster dissimilarities among the i
clusters and identify the pair of clusters that are least dissimilar
(that is, most similar). Fuse these two clusters. The dissimilarity
between these two clusters indicates the height in the dendro-
gram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among
the i − 1 remaining clusters.

Linkage Description
Maximal intercluster dissimilarity. Compute all pairwise dis-
Complete similarities between the observations in cluster A and the
observations in cluster B, and record the largest of these
dissimilarities.
Minimal intercluster dissimilarity. Compute all pairwise dis-
similarities between the observations in cluster A and the
Single observations in cluster B, and record the smallest of these
dissimilarities. Single linkage can result in extended, trailing
clusters in which single observations are fused one-at-a-time.
Mean intercluster dissimilarity. Compute all pairwise dis-
Average similarities between the observations in cluster A and the
observations in cluster B, and record the average of these
dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Centroid
linkage can result in undesirable inversions.
TABLE 10.2. A summary of the four most commonly-used types of linkage in
hierarchical clustering.

linkage are generally preferred over single linkage, as they tend to yield
more balanced dendrograms. Centroid linkage is often used in genomics,
but suffers from a major drawback in that an inversion can occur, whereby inversion
two clusters are fused at a height below either of the individual clusters in
the dendrogram. This can lead to difficulties in visualization as well as in in-
terpretation of the dendrogram. The dissimilarities computed in Step 2(b)
of the hierarchical clustering algorithm will depend on the type of linkage
used, as well as on the choice of dissimilarity measure. Hence, the resulting

You might also like