Cluster Analysis
Cluster Analysis
Nonhierarchical
Hierarchical
Divisive
Agglomerative
Variance Centroid
Linkage Methods Methods
Methods
Ward's
Method
Complete Average
Single Linkage Linkage
Linkage
592 PART II Dala Collection, Prparutim, Analuis, and Reporting
Figure 20.5 Single Linkage
Linkage Methods of Clustering
Minimunm
Distance
Cluster| Cluster 2
Complete Linkage
Maximum
Distance
Cluster l Cluster 2
Average Linkage
Average
Distance
complete linkage
Linkage method that is based on Cluster! Cluster2
maximum distance or the furthest
neighbor approach.
arerage linkage twoobjects clustered are those that have the smallest distance between them. The next
A
linkage method based on the
average distance between all pairs shortest distance is identified, and either the third object is clustered with the first two, or a
of objects, where one member of the new two-object cluster is formed. At every stage, the distance between two clusters is the
pair is from each of the clusters. distance between their two closest points (see Figure 20.5). Two clusters are merged at any
variance methods stage by the single shortest link between them. This process is continued until allobjects
An agglomerative method of hierar are in one cluster. The single linkage method does not work well when the clusters ae
chicalclustering in which clusters
are generated to minimize the poorly defined. The complete linkage method is similar to single linkage. except that it is
within-cluster variance. based on the maximum distance or the furthest neighbor approach. In complete inkage.
the distance between two clusters is calculated as the distance between their two furthest
Ward's procedure
Variance method in which the points. The average linkage method works similarly. However, in this method. the
squared euclidean distance to the distance between two clusters is defined as the average of the distances between all pairs
cluster means is minimized. of objects, where one member of the pair is from each of the clusters (Figure 20.5). As can
centroid methods be seen, the average linkage method uses information on allpairs of distances, not merely
A variance method of hierarchical the minimum or maximum distances. For this reason, it is usually preferred to the singe
clustering in which the distance and complete linkage methods.
between two clusters is the distance
between their centroids (means for The variance methods attempt to generate clustersto minimize the within-cluster
all the variables). variance. Acommonly used variance method is the Ward's procedure. For each cluster, he
nonhierarchical custering means for all the variables are computed. Then, for each object, the squared euclidean
A
procedure that first assigns or distance to the cluster means is calculated (Figure 20.6). These distances are summed tor
determines a cluster center and then all the objects. At each stage, the two clusters with the smallest increase in the overall sunt
groups all objects within a prespeci of squares within cluster distances are combined. In the centroid methods, the distanct
fied threshold value from the center.
between two clusters is the distance between their centroids (means for all the variables.
sequential threshold method as shown in Figure 20.6. Every time objects are grouped, a new centroid is computed.
A nonhierarchical clustering method the hierarchical methods, average linkage and Ward's methods have been shown "
in which acluster center is selected
and all objects within a prespecified perform better than the other procedures. l
threshold value from the center are The second type of clustering procedures, the nonhierarchical clustering methou
grouped together. frequently referred to as k-means clustering. These methods include sequential thresno
parallel threshold method parallel threshold, and optimizing partitioning. In the sequential threshold metho,
Nonhierarchical clustering method cluster center is selected and all objects within a prespecified threshold value from u
that specifies several cluster centers center are grouped together. Then a newcluster center or seed is selected, and the pro
at once. All objects that are within a
prespecified threshold value from is repeated for the unclustered points. Once an object is clustered with a seed. t
the center are grouped together. longer considered for clustering with subsequent seeds. The parallel threshold mele
593
CHAPTER 20 Cluster Analui
Figure 20.6
Other Agglomerative Clustering Ward's Method
Methods
Centroid Method
simultaneously. and
operates similarly, except that several cluster centers are selected
optimizing partitioning method objects within the threshold level are grouped with the nearest center. The optimnizing
Nonhierarchical clustering method partitioning method differs from the two threshold procedures in that objects can later be
that allows for later reassignment of overall criterion, such as average within-cluster
reassigned to clusters to optimize an
obiects to clusters to optimize an distance for a given number of clusters.
overall criterion.
the number of
Two major disadvantages of the nonhierarchical procedures are that Furthermore.
clustersmust be prespecified and the selection of cluster centers is arbitrary.
selected. Many nonhierarchical
the clustering results may depend on how the centers are
missing values as initial
programs select the first k (k= number of clusters) cases without
order of observations in the
cluster centers. Thus, the clustering results may depend on the
methods and has merit when
data. Yet nonhierarchical clustering is faster than hierarchical
suggested that the hierarchical and
the number of objects or observations is large. It has been is obtained
nonhierarchical methods be used in tandem. First, an initial clustering solution
Ward's. The number of clusters
using a hierarchical procedure, such as average linkage oroptimizing
as inputs to the partitioning method.2
and cluster centroids so obtained are used
Choice of a clustering method and choice of a distance measure are interrelated. For
should be used with the Ward's and centroid
example, squared euclidean distances
distances.
methods. Several nonhierarchical procedures also use squared euclidean clustering. The output
We will use the Ward's procedure to illustrate hierarchical
Table 20.2. Useful information is
obtained by clustering the data of Table 20.1 is given in cases or clusters
the number of
contained in the agglomeration schedule, which shows
stage 1, with 19 clusters.
being combined at each stage. The first line represents shown in the columns labeled
Respondents 14 and 16 are combined at this stage, as
these two respondents is
"Clusters Combined." The squared euclidean distance between "Stage Cluster First
entitled
oiven under the column labeled Coefficients." The column illustrate, an entry oflat
To
Appears" indicates the stage at which a cluster is tirst formed. last column, "Next
at stage 1. The
stage 6 indicates that respondent 14 was first grouped
(respondent) or cluster is combined with
Stage." indicates the stage at which another case
column is 6, we see that at stage 6.
this one Because the number in the first line of the last Similarly, the second Jine
cluster.
respondent 10 is combined with 14 and l6 to fornm a single and7 are grouped together
respondents 6
represents stage 2 with l8 clusters. In stage 2, plot given in Figure 20.7.
Another important part of the output is contained in the iciclecase respondents labeled
in this
The columns correspond to the objects being clustered, read from bottom to
through 20.The rows correspond to the number of clusters. This figure is
considered
cases are as individual clusters. Because there are 20 respondents.
top. At first, all objects are combined, resulting in
there are 20 initial clusters. At the first slep, the two closest
PART III Data Csllatim, Prepartin, Analyi, and Reportng
TABLE 20.2
Results of Hierarchical Clustering
CASE PROCESSING SUMMARY®,b
CASES
VALID MissING
TOTAL
Percent Percent zN
20 100,0 0.0 20
Percent
100.0
"Squared Euclidean Distance used
bWard Linkage
WARD LINKAGE
AGGLOMERATION SCHEDULE
STAGE CLUSTER FIRST
CLUSTER COMBINED APPEARS
STAGE CLUSTER 1 CLUSTER 2 COEFFICIENTS CLUSTER 1 CLUSTER 2 NEXT STAGE
14 16 1.000 6
6 7 2.000 7
13 3.500 15
5.000
3 6.500 16
10 |4 8.167 9
7 6 12 10.500
9 20 13.000
9 10 15.583 (0 6 12
6 18.500 7 13
23.000 4 8
12 19 27.750
13 1 17 33.100 14
14 41.333 13 16
2 S1.833 3 11
16 3 64.500 14
17 79.667 12
18 4 172.667 15 17 19
19 2 328.600 16 18
CLUSTER MEMBERSHIP
CASE 4CLUSTERS 3 CLUSTERS 2 CLUSTERS
1
2 2
3
4 3 2
2 2
6
7
8
9
3 2
2 2
12
13 2 2
14 3
16 3 3
17
18 4 3
19 3
20
1 X X X X X X X X X
X X X X X X X X X
X X X X X X X X X
X
6 X X X X X X X X X X X X
X X X X X X X
X X X X X X X X X X X X X
XX X
X X
7 X X X X X X X X X X X X X X X
X X X X X X X X X X X X
12 X X X X X X X X X
X X X X
X X
X X X X X X X
17 X X X X X X
X X X X X X X X X X X
X
X X X X X
X
15 X X XI X X
X X X X
X X X X X X X X X
X X X
3 X X X X X X X X X X X X X
X X X X X X
X X X X
X X X X X X X X X
X X X X X X X X X X X X X X X
X X
X
X X X X X X X X
2 X X X X X X X X X X X
X X X X X
X X X X X X X X X
13 X X X X X X
X X X X X X
X X X X X X X
X X X X
X XX
5
X X X X X X X X X X X X X X X
X X X X X X X X X
X X
X X X X X X X
11 X X X X X X X X X X X X
X X X X X X
X X XX
X
X X X X X X X X X X X X X
X X X X
X X X X X X
20 X X X X X X X X X X X X X X X X X X
X XX
X
X X X X X
X X X X X X X XX X X X X
4
X X X X X X X
X X X X X X X X X X
10 X XXX X X X X X
X X X X X X
X X X
X
X X X X X X X X X X X X
14 X X X
X X X X
X X X
X X X X X X X X X
Procedure
X X X X X
X X
X
X X X X
X X X X X X X X X
16 X X X
X X X X X X X
X X X X X X X X X X X X X X Ward's
19 X X
X X X X X X X X X X Using
X X X X
18 X X X X X X
CLUSTERS Plot
lcicle
20.7
Figure
OF Vertical
NUMBER
12 3 14 15 16 17 18 19
CASE 8 9 10
2 3 4
595
PART II Data Cllacton, Prpantiwm, Analut, and Reporting
14
Using Ward's
4
Number
Case
Label
12
10 15 20 2
Rescaled Distance Cluster Combine