Cluster Analysis: Prof. Vandith Pamuru
Cluster Analysis: Prof. Vandith Pamuru
5
Choice of variables in Clustering
http://images.indiatvnews.com/mainnational/Mumb
ai-Dabbawal38721.jpg
Important to keep in mind
Objective: align the cluster analysis with the
business objective(s).
undergraduate Columbia
Cornell
1310
1280
76
83
24
33
12
13
31,510
21,864
88
90
schools in US
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
universities in 1995. MIT
Northwestern
1380
1260
94
85
30
39
10
11
34,870
28,052
91
89
NotreDame 1255 81 42 13 15,122 94
PennState 1081 38 54 18 10,185 80
Princeton 1375 91 14 8 30,220 95
• This dataset excludes Purdue 1005 28 90 19 9,066 69
Stanford 1360 90 20 12 36,450 93
image variables TexasA&M
UCBerkeley
1075
1240
49
95
67
40
25
17
8,704
15,140
67
78
Distance
• Start with n clusters
(1 record in each cluster)
Distance Requirements:
Non-negative ( dij > 0 )
dii = 0
Symmetry (dij = dji )
Triangle inequality ( dij + djk dik )
Distance between two universities
Notation:
Example:
Person 3
N Y
N a b
Person 1
Y c d
Distance
• Start with n clusters
(1 record in each cluster)
Height of
the branch
denotes
the
distance
between
the
entities
that are
getting
merged at
that level
UG Business Programs:
Universities Clustering.xls
Data for 25 undergraduate programs at Placement
Student
business schools in US universities in 1995 . Quality Program
1 Brown
2 CalTech
3 CMU
4 Columbia
5 Cornell
6 Dartmouth
7 Duke
8 Georgetown
9 Harvard
10 JohnsHopkins
11 MIT
12 Northwestern
13 NotreDame
14 PennState
15 Princeton
16 Purdue
17 Stanford
18 TexasA&M
19 UCBerkeley
20 UChicago
21 UMichigan
22 UPenn
23 UVA
24 UWisconsin
25 Yale
From Dendrograms to Clusters
• After dendrogram is obtained, cut it to create
clusters. How?
• Examine distance levels
• Cutpoint determines # clusters
• Obtain statistics on resulting clusters
Evaluating usefulness of clustering
1 Brown
2 CalTech
3 CMU
4 Columbia
5 Cornell
6 Dartmouth
7 Duke
8 Georgetown
9 Harvard
10 JohnsHopkins
11 MIT
12 Northwestern
13 NotreDame
14 PennState
15 Princeton
16 Purdue
17 Stanford
18 TexasA&M
19 UCBerkeley
20 UChicago
21 UMichigan
22 UPenn
23 UVA
24 UWisconsin
25 Yale
Dendrogram for business Schools
Euclidean distance & Complete linkage
Distances for Mixed (numerical +
categorical) Data
• Simple: standardize numerical
variables, then use Euclidian distance
for all
– In R (reference:
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html),
• for numerical variable;
• , if is a categorical variable and i and j have same values, 1 otherwise.
Non-Hierarchical Clustering:
K-Means Clustering
• Computationally cheap
– Useful for large datasets
K-means clustering
• Predetermined number (K) of non-overlapping clusters
• Choice is subjective!
Elbow Curve/Scree plot
Cluster variability
Discussion point
• What if cluster variability with 5
clusters is higher than that with 3
clusters?
• Is it even possible? Why or why not?
Universities Example with k=4
Cluster 1: CalTech, JohnsHopkins
Cluster 2: PennState, Purdue, TexasA&M
UWisconsin
Cluster 3: CMU, Cornell, Georgetown, Northwestern,
NotreDame, UCBerkeley, Uchicago, Umichigan,
Upenn, UVA
Cluster 4: Brown, Columbia, Dartmouth, Duke,
Harvard, MIT, Princeton, Stanford, Yale
Evaluating usefulness of clustering
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Stuck in a local minimum
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Clustering non-clustered data
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Unevenly Sized Clusters
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
A few more examples of cluster
analysis
Cluster securities based on financial
performance info (return, volatility, beta) and
other info (industry and market
capitalization). What can you do with it?