Hierarchical
Hierarchical
7 Hierarchical Clustering
41
has the highest sum of similarities to all non-medoid points, excluding those points that are more
similar to one of the currently chosen initial medoids.
While the K-medoid algorithm is relatively simple, it should be clear that it is expensive compared
to K-means. More recent improvements of the K-medoids algorithm have better efficiency than the
basic algorithm, but are still relatively expensive and will not be discussed here. Relevant references
may be found in the bibliographic remarks.
Sample Data
In the examples that follow we shall use the following data, which consists of six, two-dimensional
points, to illustrate the behavior of the various hierarchical clustering algorithms. The x and y
coordinates of the points and the distances between them are shown, respectively, in tables 5.6 and
5.7. The points themselves are shown in Figure 5.24.
point
p1
p2
p3
p4
p5
p6
x coordinate
0.4005
0.2148
0.3457
0.2652
0.0789
0.4548
y coordinate
0.5306
0.3854
0.3156
0.1875
0.4139
0.3022
p1
p2
p3
p4
p5
p6
p1
0.0000
0.2357
0.2218
0.3688
0.3421
0.2347
p2
0.2357
0.0000
0.1483
0.2042
0.1388
0.2540
p3
0.2218
0.1483
0.0000
0.1513
0.2843
0.1100
p4
0.3688
0.2042
0.1513
0.0000
0.2932
0.2216
p5
0.3421
0.1388
0.2843
0.2932
0.0000
0.3921
0.6
1
0.5
0.4
2
3
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
p6
0.2347
0.2540
0.1100
0.2216
0.3921
0.0000
42
43
0.6
1
0.5
5
2
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
Figure 5.25. Minimum Spanning Tree for Set of Six Two-dimensional Points.
that one point, p, is in the current tree and one, q, is not. We add q to the tree and put an edge
between p and q. Figure 5.25 shows the MST for the points in Figure 5.24.
The MST divisive hierarchical algorithm is shown below. This approach is the divisive version of
the single link agglomerative technique that we will see shortly. Indeed, the hierarchical clustering
produced by MST is the same as that produced by single link. See Figure 5.27.
Algorithm 5 MST Divisive Hierarchical Clustering Algorithm
1:
2:
3:
4:
5:
(a) MIN
(b) MAX
44
3
5
45
0.2
0.15
0.1
0.05
p1 cluster1
p2 cluster2
proximity(p1 , p2 )
size(cluster1 ) size(cluster2 )
(5.17)
46
0.4
0.35
2
5
0.3
0.25
0.2
6
1
0.15
0.1
0.05
0
5
0.25
2
5
0.2
0.15
6
1
0.1
0.05
0.25
2
5
0.2
0.15
6
1
0.1
0.05
47
48
49
Table 5.8. Table of Lance-William Coefficients for Common Hierarchical Clustering Approaches
Clustering Method
A
B
MIN
MAX
Group Average
Centroid
Wards
1/2
1/2
1/2
1/2
nA
nA +nB
nA
nA +nB
nA +nQ
nA +nB +nQ
nB
nA +nB
nB
nA +nB
nB +nQ
nA +nB +nQ
0
0
0
nA nB
(nA +nB )2
nQ
nA +nB +nQ
-1/2
1/2
0
0
0
cluster R, then the proximity of the new cluster, R, to an existing cluster, Q, is a linear function of the
proximities of Q from the original clusters A and B. Table 5.8 shows the values of these coefficients
for the techniques that we discussed. nA , nB , and nQ are the number of points in clusters A, B,
and Q, respectively.
p(R, Q) = A p(A, Q) + B p(B, Q) + p(A, B) + |p(A, Q) p(B, Q)|
(5.18)
Any hierarchical technique that can be phrased in this way does not need the original points,
only the proximity matrix, which is updated as clustering occurs. However, while a general formula
is nice, especially for implementation, it is often easier to understand the different hierarchical
methods by looking directly at the definition of cluster proximity that each method uses, which was
the approach taken in our previous discussion.
5.8.1 DBSCAN
DBSCAN is a density based clustering algorithm that works with a number of different distance
metrics. After DBSCAN has processed a set of data points, a point will either be in a cluster or will
be classified as a noise point. Furthermore, DBSCAN also makes a distinction between the points in
clusters, classifying some as core points, i.e., points in the interior of a cluster, and some as border
points, i.e., points on the edge of a cluster. Informally, any two core points that are close enough
are put in the same cluster. Likewise, any border point that is close enough to a core point is put
in the same cluster as the core point. Noise points are discarded.