Clusteranalysis P
Clusteranalysis P
@c 2014 by G. David Garson and Statistical Associates Publishing. All rights reserved
worldwide in all media. No permission is granted to any user to copy or post this work in
any format or any media.
The author and publisher of this eBook and accompanying materials make no
representation or warranties with respect to the accuracy, applicability, fitness, or
completeness of the contents of this eBook or accompanying materials. The author and
publisher disclaim any warranties (express or implied), merchantability, or fitness for any
particular purpose. The author and publisher shall in no event be held liable to any party for
any direct, indirect, punitive, special, incidental or other consequential damages arising
directly or indirectly from any use of this material, which is provided “as is”, and without
warranties. Further, the author and publisher do not warrant the performance,
effectiveness or applicability of any sites listed or linked to in this eBook or accompanying
materials. All links are for information purposes only and are not warranted for content,
accuracy or any other implied or explicit purpose. This eBook and accompanying materials is
© copyrighted by G. David Garson and Statistical Associates Publishing. No part of this may
be copied, or changed in any format, sold, or used in any way under any circumstances
other than reading by the downloading individual.
Contact:
Email: [email protected]
Web: www.statisticalassociates.com
Table of Contents
Overview ....................................................................................................................................... 10
Data examples in this volume ....................................................................................................... 10
Key Concepts and Terms............................................................................................................... 12
Terminology ............................................................................................................................. 12
Distances (proximities) ........................................................................................................ 12
Cluster formation................................................................................................................. 12
Cluster validity ..................................................................................................................... 12
Types of cluster analysis........................................................................................................... 14
Types of cluster analysis by software package .................................................................... 14
Disjoint clustering ................................................................................................................ 15
Hierarchical clustering ......................................................................................................... 15
Overlapping clustering......................................................................................................... 16
Fuzzy clustering ................................................................................................................... 16
Hierarchical cluster analysis in SPSS ............................................................................................. 16
SPSS Input for hierarchical clustering ...................................................................................... 16
Example ............................................................................................................................... 16
The main “Hierarchical Cluster Analysis” dialog ................................................................. 17
Statistics button ................................................................................................................... 18
Plots button ......................................................................................................................... 19
Methods button................................................................................................................... 20
SPSS output for hierarchical cluster analysis ........................................................................... 21
Proximity table..................................................................................................................... 21
Cluster membership table ................................................................................................... 22
Agglomeration Schedule ..................................................................................................... 22
Dendogram .......................................................................................................................... 24
Icicle plots ............................................................................... Error! Bookmark not defined.
Summary measures ................................................................ Error! Bookmark not defined.
Hierarchical cluster analysis in SAS .................................................. Error! Bookmark not defined.
SAS input for hierarchical cluster analysis\ ................................. Error! Bookmark not defined.
Example .................................................................................. Error! Bookmark not defined.
Data setup............................................................................... Error! Bookmark not defined.
SAS syntax ............................................................................... Error! Bookmark not defined.
SAS output for hierarchical cluster analysis ................................ Error! Bookmark not defined.
Simple statistics table ............................................................. Error! Bookmark not defined.
Eigenvalues of the covariance matrix table ........................... Error! Bookmark not defined.
Root mean square coefficients ............................................... Error! Bookmark not defined.
K-means (PROC FASTCLUS) results with original vs. transformed data Error! Bookmark not
defined.
SAS PROC VARCLUS : Oblique principal components cluster analysis .......... Error! Bookmark not
defined.
Overview ..................................................................................... Error! Bookmark not defined.
The PROC VARCLUS default method ...................................... Error! Bookmark not defined.
PROC VARCLUS variations ...................................................... Error! Bookmark not defined.
Example ....................................................................................... Error! Bookmark not defined.
SAS input ..................................................................................... Error! Bookmark not defined.
SAS output ................................................................................... Error! Bookmark not defined.
The dendogram from PROC TREE ........................................... Error! Bookmark not defined.
The cluster summary table ..................................................... Error! Bookmark not defined.
The R-squared table................................................................ Error! Bookmark not defined.
The standardized scoring coefficients table ........................... Error! Bookmark not defined.
The cluster structure table ..................................................... Error! Bookmark not defined.
The table of inter-cluster correlations.................................... Error! Bookmark not defined.
The cluster history summary statistics table .......................... Error! Bookmark not defined.
Cluster membership ............................................................... Error! Bookmark not defined.
Cluster scores.......................................................................... Error! Bookmark not defined.
SAS PROC MODECLUS: Nonparametric density cluster analysis ..... Error! Bookmark not defined.
Overview ..................................................................................... Error! Bookmark not defined.
Interpreting p-values ................................................................... Error! Bookmark not defined.
Example ....................................................................................... Error! Bookmark not defined.
SAS input ..................................................................................... Error! Bookmark not defined.
PROC MODECLUS specifications............................................. Error! Bookmark not defined.
PROC MODECLUS command syntax ...................................... Error! Bookmark not defined.
SAS output ................................................................................... Error! Bookmark not defined.
First pass: Selecting the optimal radius .................................. Error! Bookmark not defined.
Second pass: Generating main output ................................... Error! Bookmark not defined.
PROC MODECLUS: Nearest neighbor analysis ................................. Error! Bookmark not defined.
SAS syntax for nearest neighbor lists/distances ......................... Error! Bookmark not defined.
SAS output for nearest neighbor analysis ................................... Error! Bookmark not defined.
Kohonen clustering in SAS Enterprise Miner ................................... Error! Bookmark not defined.
Overview of Kohonen clustering ................................................. Error! Bookmark not defined.
Kohonen Clustering in SAS Enterprise Miner: Setup .................. Error! Bookmark not defined.
Kohonen Clustering in SAS Enterprise Miner: Modeling ............ Error! Bookmark not defined.
Overview ................................................................................. Error! Bookmark not defined.
The flow chart model.............................................................. Error! Bookmark not defined.
Copyright @c 2014 by G. David Garson and Statistical Associates Publishing Page 7
CLUSTER ANALYSIS 2014 Edition
It is acknowledged that k-means and hierarchical clustering are inefficient and inaccurate for
large datasets, but what is the evidence that two-step clustering does better? ............... Error!
Bookmark not defined.
Can I cluster variables instead of cases? ..................................... Error! Bookmark not defined.
Can I cluster repeated measures data? ...................................... Error! Bookmark not defined.
Isn't discriminant analysis the same as cluster analysis?............ Error! Bookmark not defined.
What is the ratio of distance measure used in autoclustering in two-step cluster analysis?
..................................................................................................... Error! Bookmark not defined.
How does SAS’s PROC MODECLUS work? ................................... Error! Bookmark not defined.
How does joining and dissolving work in SAS PROC MODECLUS? ............. Error! Bookmark not
defined.
What is the rationale for the stability value criterion in SAS PROC MODECLUS? .............. Error!
Bookmark not defined.
What does the content of OUTSTAT= files look like for PROC VARCLUS? . Error! Bookmark not
defined.
What is BIRCH clustering? ........................................................... Error! Bookmark not defined.
What is ClustanGraphics?............................................................ Error! Bookmark not defined.
What is SaTScan? ........................................................................ Error! Bookmark not defined.
Where can I find cluster software for R? .................................... Error! Bookmark not defined.
How does cluster analysis compare with factor analysis and multidimensional scaling?.. Error!
Bookmark not defined.
Acknowledgments............................................................................ Error! Bookmark not defined.
Bibliography ..................................................................................... Error! Bookmark not defined.
Cluster Analysis
Overview
Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to
identify homogeneous subgroups of cases in a population. That is, cluster analysis
is used when the researcher does not know the number of groups in advance but
wishes to establish groups and then analyze group membership. Contrast, for
instance, discriminant function analysis, which analyzes group membership for
known groups pre-specified by the researcher. Cluster analysis implements this by
seeking to identify a set of groups which both minimize within-group variation
and maximize between-group variation. Later, group membership values may be
saved as a case-level variable and used in other procedures such as
crosstabulation.
While sometimes described as a method of clustering observations rather than
variables, it is always possible to transpose the data matrix so that variables are
clustered instead. Some software options allow the researcher to select whether
clustering of observations or of variables is desired, without need for data
transposition.
The judges dataset, drawn from SPSS data samples, is a hypothetical data file
focusing on the scores given by trained judges plus one "enthusiast" to 300
gymnastic performances. Each row represents a separate performance. All judges
viewed and rated the same performances.
• Click here to download judges.sav for SPSS.
• Click here to download judges.sas7bdat for SAS.
• Click here to download judges.dta for Stata.
Two-step clustering in SPSS and PROC MODECLUS in SAS use the “cars” dataset,
also drawn from SPSS data samples. This dataset contains variables dealing with
engine size, number of cylinders, and other attributes of automobiles from
selected countries, for 406 automobile models.
• Click here to download cars.sav for SPSS.
• Click here to download cars.sas7bdat for SAS.
• Click here to download cars.dta for Stata.
Nearest neighbor analysis in SPSS uses the auto.sav data file as an example. It is
also used in the section on SOM/Kohonen clustering with SAS Enterprise Miner.
This is not the same dataset as cars.sav above Variables are described below.
• Click here to download auto.sav for SPSS
• Click here to download auto.sas7bdat for SAS
The PROC VARCLUS example for SAS, below, uses the subset.sas7bdat file. This is
a modified version of the GSS93subset.sav General Social Survey data file supplied
in the SPSS Samples directory.
• Click here to download subset.sav for SPSS.
• Click here to download subset.sas7bdat for SAS.
• Click here to download subset.dta for Stata.
The PROC ACECLUS example for SAS, below, uses a version of the “Iris” sample file
supplied with SPSS Amos and widely used elsewhere for instruction. Variables are
described below.
• Click here to download iris.sas7bdat for SAS
Cluster formation
Cluster formation is the selection of the procedure for determining how clusters
are created, and how the calculations are done. In agglomerative hierarchical
clustering every case is initially considered a cluster, then the two cases with the
lowest distance (or highest similarity) are combined into a cluster. The case with
the lowest distance to either of the first two is considered next. If that third case
is closer to a fourth case than it is to either of the first two, the third and fourth
cases become the second two-case cluster; if not, the third case is added to the
first cluster. The process is repeated, adding cases to existing clusters, creating
new clusters, or combining clusters to get to the desired final number of clusters.
There is also divisive clustering, which works in the opposite direction, starting
with all cases in one large cluster. Hierarchical cluster analysis, discussed below,
can use either agglomerative or divisive clustering strategies.
Cluster validity
By whatever method the researcher forms clusters, the utility of clusters must be
assessed by multiple criteria:
1. Meaningfulness
2. Separation
Clusters are more meaningful if they are distinct from each other. Cluster
separation plots, discussed below, are one method of assessing
separation.
3. Size
All clusters should have enough cases to be meaningful. One or more very
small clusters indicates that the researcher has requested too many
clusters. Analysis resulting in a very large, dominant cluster may indicate
too few clusters have been requested.
4. Criterion validity
The crosstabulation of the cluster membership (id) numbers by other
variables known from theory or prior research to correlate with the
concept which clustering is supposed to reflect, should in fact reveal the
expected direction and level of association.
Failure to meet these criteria may indicate the researcher has requested too
many or too few clusters, or possibly that an inappropriate distance measure has
been selected. It is also possible that the hypothesized basis for clustering does
not exist, resulting in arbitrary clusters.
Disjoint clustering
In disjoint clustering, each object is classified in only one cluster. Clusters are not
clustered. K-means clustering and two-step cluster analysis, both discussed
below, are of this type.
Hierarchical clustering
Hierarchical clustering is appropriate for smaller samples (typically < 250). When
sample size is large, the algorithm may be very slow to reach a solution and when
very large may exceed the capacity of some desktop computers. To accomplish
hierarchical clustering, the researcher must specify how similarity or distance is
defined and how clusters are aggregated (or divided). Hierarchical clustering
generates all possible clusters of sizes 1...K. In hierarchical clustering, the clusters
are nested rather than being mutually exclusive, as is the usual case. That is, in
hierarchical clustering, larger clusters created at later stages may contain smaller
clusters created at earlier stages of agglomeration.
The researcher may wish to use the hierarchical cluster procedure on a sample of
cases (ex., 200) to inspect results for different numbers of clusters. The optimum
number of clusters depends on the research purpose. Identifying "typical" types
may call for few clusters and identifying "exceptional" types may call for many
clusters, and in either case the resulting clusters must be meaningful. After using
hierarchical clustering to determine the desired number of clusters, the
researcher may wish then to analyze the entire dataset with k-means clustering,
specifying that number of clusters.
Copyright @c 2014 by G. David Garson and Statistical Associates Publishing Page 15
CLUSTER ANALYSIS 2014 Edition
Overlapping clustering
In overlapping clustering, objects may be in more than one cluster, even at the
same level.
Fuzzy clustering
This example uses the SPSS example file judges.sav (see access above), where
columns (variables) are judges from eight countries and rows are 300 fictional
cases of gymnasts being rated on a 0-10 scale, illustrated below.
Statistics button
Under the Statistics button, the dialog for which is shown below, the researcher
may request the agglomeration schedule and the proximity matrix, described
below in the section on output. The researcher may also specify the minimum and
maximum number of clusters (3 to 6 is common) for which to seek solutions, or
the researcher may ask for a specific number, or none. The agglomeration
schedule, the proximity matrix, and other outputs are discussed further below.
Plots button
Under the “Plots” button dialog, the researcher may request dendograms and
icicle plots, also described below in the section on output. Also, the orientation
(vertical or horizontal) of icicle plots may be specified.
Methods button
distance. For count data, the most common is chi-square distance. For
binary data, squared Euclidean distance is perhaps the most common
among a large number of alternatives. Alternative similarity/distance
measures are discussed in the FAQ section below.
This table shows the distance from each case to each other case. The type of
distance was determined by the researchers selection under the “Method” button
discussed above. In this case the default, squared Euclidean distance, is used. The
table can be very large but for this example, variables were clustered and judges,
eight in number, were the variables, resulting in the small table shown below. The
distances show how far apart the row judge is from the column judge, with larger
numbers representing greater distances. The “Enthusiast” judge can be seen to be
further from other judges than any other judge, with few exceptions (one
exception is that China is further from France than is the Enthusiast).
The cluster membership table shows variables as rows (this example clusters
variables, not cases, where variables were country judges) and columns are
alternative numbers of clusters in the solution (as specified in the "Range of
Solution" option under the Statistics button, here 3 - 6).
Cell entries show the number of the cluster to which the case belongs in the 3-
cluster solution through the 6-cluster solution. From this table, the researcher can
see which variables (judges in this example) are in which cluster, depending on
the number of clusters in the solution. In each of the four solutions, the
Enthusiast judge is in a unique cluster not shared by any country judge.
In SPSS, the “Save “ button allows the researcher to save the cluster membership
number to file for use as a variable in future analyses only when clustering
observations (cases). It does not support saving cluster membership number
when clustering variables (here, judges) as in the current example.
Agglomeration Schedule
example has 7 stages. The (n - 1)th stage (here Stage 7) includes all the cases in
one cluster.
There are also two "Cluster Combined" columns, giving the case or cluster
numbers for combination at each stage. In agglomerative clustering using a
distance measure like Euclidean distance, stage 1 combines the two cases which
have lowest proximity (distance) score. The cluster number goes by the lower of
the cases or clusters combined, where cases are initially numbered 1 to n.
The figure above reflects 8 judges rating 300 objects. The agglomeration schedule
shows, for instance, that in Stage 1, judges 3 and 5 are combined in a cluster (the
cluster is labeled 3). Then judges 2 and 4 become cluster 2. Then judge 6 is added
to cluster 2. Then at Stage 4, the new cluster 3 formed at stage 1 is combined with
judge 7 to form a larger cluster, also now labeled 3. Then cluster 3 is joined to
judge 1 and is labeled cluster 1. Then cluster 2 is joined to cluster 1 and is labeled
cluster 1. Finally, judge 8 (the "enthusiast" judge, who is most different from
others) is joined to cluster 1, which then is the only remaining cluster.
Dendogram
Also called hierarchical tree diagrams or plots, dendograms are one of two types
of linkage plots output by SPSS (the other is icicle plots). Dendograms show the
relative size of the proximity coefficients at which cases were combined. The
bigger the distance coefficient or the smaller the similarity coefficient, the more
clustering involved combining unlike entities, which may be undesirable. Trees are
usually depicted horizontally, not vertically, with each row representing a case on
the Y axis, while the X axis is a rescaled version of the proximity coefficients.
When the number of variables (when clustering variables) or the number of cases
(when clustering observations) is large, dendograms can become hard to read.
The figure above shows 8 judges who rated 300 objects. The inset showing the
labels for judges 1 – 8 is not part of dendogram output but was lifted from the
main hierarchical cluster analysis dialog, where the researcher entered the
variables (judges). The dendogram shows judges 3 & 5 (Romania and China) to be
in one of the two earliest clusters, with judge 7 (Russia) affiliated with cluster 3 &
5, only at a greater distance.
In general, the dendogram shows the pattern of clustering among the judges,
with connecting lines further to the right indicating more distance between
judges and clusters. The final linkage to judge 8 ("Enthusiast") shows this judge to
be least like the others, but the largest jump occurs a step earlier. If the
researcher decided that making that large jump combined objects which were too
dissimilar, there would be a three-cluster solution:
1. Judges 3, 5, 7, 1
2. Judges 2, 4, 5
3. Judge 8
The researcher may also cluster cases by so selecting in the main “Hierarchical
Cluster Analysis” dialog shown above. The dendogram below is for the clustering
of 50 performances (objects) by the 8 judges, with performances 10, 38, 17, 16,
18, 43, 2, 46, and 27 forming one of the first clusters:
An illustrated tutorial and introduction to cluster analysis using SPSS, SAS, SAS Enterprise
Miner, and Stata for examples. Suitable for introductory graduate-level study.
The 2014 edition is a major update to the 2012 edition. Among the new features are these:
The full content is now available from Statistical Associates Publishers. Click here.
Table of Contents
CLUSTER ANALYSIS 1
Overview 10
Data examples in this volume 10
Key Concepts and Terms 12
Terminology 12
Distances (proximities) 12
Cluster formation 12
Cluster validity 12
Overview 53
Example 54
SAS input for k-means cluster analysis 54
SAS output for k-means cluster analysis 55
The "Statistics for Variables" table 55
Criteria for determining k 57
The "Cluster Summary" table 60
Cluster membership and distance values 61
Crosstabulation tables 61
Cluster separation plots 62
K-Means cluster analysis in Stata 64
Example 64
Stata input for k-means cluster analysis 64
The main kmeans clustering command 64
Obtaining descriptive statistics 65
Obtaining distance information 65
Obtaining cluster separation plots 65
Comparing kmeans and kmedian solutions 66
Stata output for k-means cluster analysis 66
Cluster membership assignments 66
Descriptive statistics 67
Distance coefficients 69
Cluster separation plots 70
Comparing kmeans and kmedians solutions 71
Two-step cluster analysis in SPSS 72
Overview 72
Cluster feature tree (CF tree) 73
Proximity 73
Example 74
SPSS input for two-step clustering 74
The main two-step clustering dialog 74
Options button dialog 75
Output button dialog 78
SPSS output for two-step clustering 79
Autoclustering table 79
Cluster distribution table 81
Centroids (cluster profiles) table 81
Model summary 82
The "Cluster Quality" graph 82
The "Cluster Sizes" pie chart 82
The "Predictor Importance" chart 83
The "Clusters" table 84
The "Cell Distribution" chart 85
The "Cluster Comparison" chart 86
Nearest neighbor analysis in SPSS 87
Overview 87
Target variables 87
Selecting k 87
Feature variables 88
Focal cases 88
Case labels 89
Partitions and cross-validation 89
Example 89
SPSS input 90
The user interface 90
The "Variables" tab 90
What is the rationale for the stability value criterion in SAS PROC MODECLUS? 198
What does the content of OUTSTAT= files look like for PROC VARCLUS? 199
What is BIRCH clustering? 200
What is ClustanGraphics? 200
What is SaTScan? 201
Where can I find cluster software for R? 201
How does cluster analysis compare with factor analysis and multidimensional
scaling? 201
Acknowledgments 201
Bibliography 201
Pagecount: 207
Copyright 1998, 2008, 2009, 2010, 2012, 2014 by G. David Garson and Statistical Associates
Publishers. Worldwide rights reserved in all languages and on all media. Do not copy or post in
any format or on any medium. Last updated 8 June 2014.
NEW! For use by a single individual, our "Regression Models" library of 10 titles is available at Amazon in no-
password pdf format on DVD for $50 plus shipping. Click on http://www.amazon.com/dp/1626380252
NEW! For use by a single individual, our "Qualitative Methods" library of 10 titles is available at Amazon in no-
password pdf format on DVD for $50 plus shipping. Click on http://www.amazon.com/dp/B00JJ2JZYM
NEW FOR CLASS USE! If you are requesting this for class use, consider recommending site licensing so the ebook
is free for everyone at your institution and is always available. For class use, see our new low-cost site license
policy for university libraries and others at http://statisticalassociates.com/FAQ.htm#sales . Site license for a
university is $100 per title.
Association, Measures of
Canonical Correlation
Case Studies
Cluster Analysis
Content Analysis
Correlation
Correlation, Partial
Correspondence Analysis
Cox Regression
Creating Simulated Datasets
Crosstabulation
Curve Estimation & Nonlinear Regression
Delphi Method in Quantitative Research
Discriminant Function Analysis
Ethnographic Research
Evaluation Research
Factor Analysis
Focus Group Research
Game Theory
Generalized Linear Models/Generalized Estimating Equations
GLM (Multivariate), MANOVA, and MANCOVA
GLM (Univariate), ANOVA, and ANCOVA
Grounded Theory
Life Tables & Kaplan-Meier Survival Analysis
Literature Review in Research and Dissertation Writing
Logistic Regression: Binary & Multinomial
Log-linear Models,
Longitudinal Analysis
Missing Values & Data Imputation
Multidimensional Scaling
Multiple Regression
Narrative Analysis
Network Analysis
Neural Network Models
Nonlinear Regression
Ordinal Regression
Parametric Survival Analysis
Partial Correlation
Partial Least Squares Regression
Participant Observation
Path Analysis
Power Analysis
Probability
Probit and Logit Response Models
Research Design
Scales and Measures
Significance Testing
Social Science Theory in Research and Dissertation Writing
Structural Equation Modeling
Survey Research & Sampling
Testing Statistical Assumptions
Two-Stage Least Squares Regression
Validity & Reliability
Variance Components Analysis
Weighted Least Squares Regression