0% found this document useful (0 votes)
67 views

Cluster Analysis: Prof. Vandith Pamuru

Cluster analysis is an exploratory technique used to group similar records together into clusters while keeping dissimilar records in different clusters. It works by identifying homogeneous groups (clusters) based on attributes. The US Army used cluster analysis to reduce the number of uniform sizes needed by grouping female soldiers into body types like short-legged and small-waisted. Cluster analysis was also used to group 25 top US business schools based on attributes like SAT scores, acceptance rates, expenses to help prospective applicants and business school deans. Hierarchical and k-means clustering are two common approaches, with hierarchical clustering grouping records sequentially into a hierarchy based on similarity measured by distance.

Uploaded by

sourav abhishek
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Cluster Analysis: Prof. Vandith Pamuru

Cluster analysis is an exploratory technique used to group similar records together into clusters while keeping dissimilar records in different clusters. It works by identifying homogeneous groups (clusters) based on attributes. The US Army used cluster analysis to reduce the number of uniform sizes needed by grouping female soldiers into body types like short-legged and small-waisted. Cluster analysis was also used to group 25 top US business schools based on attributes like SAT scores, acceptance rates, expenses to help prospective applicants and business school deans. Hierarchical and k-means clustering are two common approaches, with hierarchical clustering grouping records sequentially into a hierarchy based on similarity measured by distance.

Uploaded by

sourav abhishek
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Cluster Analysis

Prof. Vandith Pamuru


Primary Objective behind Clustering

• Cluster Analysis (“data segmentation”)


is an exploratory method for
identifying homogenous groups
(“clusters”) of records
• Similar records should belong to the
same cluster
• Dissimilar records should belong to
different clusters
Example
Example: Fitting the Troops
(from Data Mining Techniques by Berry & Linoff)
• The US army recently commissioned a study on how
to redesign the uniforms of female soldiers. The
army’s goal is to reduce the number of different
uniform sizes that have to be kept in inventory while
still providing each soldier with well-fitting khakis.

• Researchers Ashdown and Paal @ Cornell University


designed a new set of sizes based on the actual
shapes of women in the army. Unlike traditional
clothing size systems, the new sizes are not an
ordered set of graduated sizes where all dimensions
increase together.

• Instead, they came up with sizes that fit particular


body types (e.g., short-legged, small-waisted, large-
busted women with long torsos, average arms,
broad shoulders, and skinny necks).
More examples of cluster analysis

Cluster Analysis of 77 Community areas


in Chicago based on crimes

5
Choice of variables in Clustering

http://images.indiatvnews.com/mainnational/Mumb
ai-Dabbawal38721.jpg
Important to keep in mind
Objective: align the cluster analysis with the
business objective(s).

• What entities you are clustering?


• Based on what attributes?
• In order to achieve what?

Recall the earlier examples.


UG Business Programs Universities
Clustering.xls
Univ SAT Top10 Accept SFRatio Expenses GradRate
Brown 1310 89 22 13 22,704 94
• Data for 25 CalTech
CMU
1415
1260
100
62
25
59
6
9
63,575
25,026
81
72

undergraduate Columbia
Cornell
1310
1280
76
83
24
33
12
13
31,510
21,864
88
90

programs at business Dartmouth


Duke
1340
1315
89
90
23
30
10
12
32,162
31,585
95
95

schools in US
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
universities in 1995. MIT
Northwestern
1380
1260
94
85
30
39
10
11
34,870
28,052
91
89
NotreDame 1255 81 42 13 15,122 94
PennState 1081 38 54 18 10,185 80
Princeton 1375 91 14 8 30,220 95
• This dataset excludes Purdue 1005 28 90 19 9,066 69
Stanford 1360 90 20 12 36,450 93
image variables TexasA&M
UCBerkeley
1075
1240
49
95
67
40
25
17
8,704
15,140
67
78

(student satisfaction, UChicago


UMichigan
1290
1180
75
65
50
68
13
16
38,380
15,470
87
85

employer satisfaction, UPenn


UVA
1285
1225
80
77
36
44
11
14
27,553
13,349
90
92

deans’ opinions) UWisconsin


Yale
1085
1375
40
95
69
19
15
11
11,857
43,514
71
96
Why cluster universities?
• How can clustering help a prospective
applicant?

• How can clustering help a business


school dean?

• Any other potential stakeholder for the


exercise?
• Simple Clustering: 1-2 variables
• Visual inspection of data
• Two approaches:

– Compute “multivariate distance” between


records, and group “close” records
• Hierarchical Clustering

– Group records to increase within-group


homogeneity
• K-Means clustering
Hierarchical methods - agglomerative:
Hierarchical Clustering

Begin with n records; sequentially merge


similar records or group of records until
all are put in one large group.
• Useful when goal is to arrange the
clusters into a natural hierarchy.
• Requires specifying distance measure to
find similarity
Hierarchical Clustering

Distance
• Start with n clusters
(1 record in each cluster)

• Step 1: two closest records


are merged into one cluster

• At every step, pair of records/clusters with smallest distance are merged


– two records are merged,
– or single record is added to an existing cluster,
– or two existing clusters are combined

• Dendrogram: Tree-like diagram that summarize the clustering process

• How do you know two entities are closest?


– requires a definition of distance
Pairwise distance between Records

dij = distance between records i and j

Distance Requirements:
Non-negative ( dij > 0 )
dii = 0
Symmetry (dij = dji )
Triangle inequality ( dij + djk  dik )
Distance between two universities
Notation:

Example:

• Caltech= (1415, 100, 25, 6, 63575, 81)

• Cornell = (1280, 83, 33, 13, 21864, 90)


Euclidean Distance

• 6-dimensional Euclidean distance between Caltech and


Cornell:

Sqrt [ (1415-1280)2 + (100-83)2 + (25-33)2 + (6-13)2 +


(63575-21864)2 + (81-90)2] = 41,711.22
Standardize when there are multiple
variables of different scales

Euclidean distance is influenced by the


scale of the different measurements

Solution: standardize (=normalize)


each variable before measuring
distances
Standardizing Example

Univ Z_SAT Z_Top10 Z_Accept Z_SFRatio Z_Expenses Z_GradRate


Brown 0.401994 0.644235 -0.87189 0.068840897 -0.32471667 0.80372917
CalTech 1.370988 1.210256 -0.71981 -1.65218153 2.508651168 -0.631501491
CMU -0.05943 -0.74509 1.003685 -0.91460049 -0.16374483 -1.625122718
Columbia 0.401994 -0.0247 -0.77051 -0.17701945 0.285756214 0.141315019
Cornell 0.125139 0.335496 -0.31429 0.068840897 -0.38294938 0.362119736
Dartmouth 0.67885 0.644235 -0.8212 -0.66874014 0.330955887 0.914131529
Euclidean distance between
standardized Caltech and Cornell:

Sqrt[ (1.371-1.125)2 + (1.210-0.335)2 +


… + (-0.632-0.362)2] = 3.84
Lots of other distance metrics
•  
Manhattan distance

Statistical (Mahalanobis) distance


, where S is the covariance matrix
Distances for Binary Data
• Similarity-based metrics based on 2x2 table of counts

Married? Smoke? Manager?


Person 3
Person 1 Y Y Y N Y
Person 2 N Y N Person 1
N 0 0
Y 2 1
Person 3 N N Y

Person 3
N Y
N a b
Person 1
Y c d

• Binary Euclidean Distance: (b+c)/(a+b+c+d)


• Simple matching Coefficient: (a+d)/(a+b+c+d)
• Jaccard’s index: d/(b+c+d)
Revisit the hierarchical clustering
algorithm

Distance
• Start with n clusters
(1 record in each cluster)

• Step 1: two closest records


are merged into one cluster

• At every step, pair of records/clusters with smallest


distance are merged
– two records are merged,
– or single record added to an existing cluster,
– or two existing clusters are combined
Distances Between Clusters:
‘single linkage’ (‘nearest neighbor’)
• Distance between 2 clusters =
minimum distance between
members of the two clusters
Distances Between Clusters:
‘complete linkage’ (‘farthest neighbor’)
• Distance between 2 clusters =
greatest distance between members
of the two clusters
Distances Between Clusters: ‘average
linkage’
• Distance between 2 clusters =
average of all distances between
members of the two clusters
Distances Between Clusters:
‘centroid linkage’
• Distance between 2 clusters =
distance between their centroids
(centers)
Pairwise distance between Clusters
• Single linkage (nearest neighbor): minimum
distance between members of the two clusters

• Complete linkage (farthest neighbor): greatest


distance between members of the two clusters

• Average linkage: average of all distances between


members of the two clusters

• Centroid linkage: distance between their centroids


(centers)
• Insert << Appendix 1 - Hierarchical
clustering by hand >>
Hierarchical Clustering: The
Dendrogram
Distance

Height of
the branch
denotes
the
distance
between
the
entities
that are
getting
merged at
that level
UG Business Programs:
Universities Clustering.xls
Data for 25 undergraduate programs at Placement
Student
business schools in US universities in 1995 . Quality Program

Univ SAT Top10 Accept SFRatio Expenses GradRate


Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
MIT 1380 94 30 10 34,870 91
Northwestern 1260 85 39 11 28,052 89
NotreDame 1255 81 42 13 15,122 94
PennState 1081 38 54 18 10,185 80
Princeton 1375 91 14 8 30,220 95
Purdue 1005 28 90 19 9,066 69
Stanford 1360 90 20 12 36,450 93
TexasA&M 1075 49 67 25 8,704 67
This dataset excludes image variables UCBerkeley 1240 95 40 17 15,140 78
UChicago 1290 75 50 13 38,380 87
(student satisfaction, employer UMichigan 1180 65 68 16 15,470 85
satisfaction, deans’ opinions) UPenn 1285 80 36 11 27,553 90
UVA 1225 77 44 14 13,349 92
UWisconsin 1085 40 69 15 11,857 71
Yale 1375 95 19 11 43,514 96
Dendrogram for business Schools
Euclidean distance & Single linkage
Row Id. University

1 Brown
2 CalTech
3 CMU
4 Columbia
5 Cornell
6 Dartmouth
7 Duke
8 Georgetown
9 Harvard
10 JohnsHopkins
11 MIT
12 Northwestern
13 NotreDame
14 PennState
15 Princeton
16 Purdue
17 Stanford
18 TexasA&M
19 UCBerkeley
20 UChicago
21 UMichigan
22 UPenn
23 UVA
24 UWisconsin
25 Yale
From Dendrograms to Clusters
• After dendrogram is obtained, cut it to create
clusters. How?
• Examine distance levels
• Cutpoint determines # clusters
• Obtain statistics on resulting clusters
Evaluating usefulness of clustering

• What characterizes each cluster?

• Can you give a “name” to each


cluster?

• Does this give us any insight?


• Insert << Tableau exercise post HC >>
Recap & Agenda
• Hierarchical clustering
• Homework from the last class:
– By hand appendix
– R code
– Gower’s similarity
– Case : Mall of America
– Try Tableau (Optional)
• Today:
– K-Means Clustering
– Case
Insights? Anything Interesting?
Row Id. University

1 Brown
2 CalTech
3 CMU
4 Columbia
5 Cornell
6 Dartmouth
7 Duke
8 Georgetown
9 Harvard
10 JohnsHopkins
11 MIT
12 Northwestern
13 NotreDame
14 PennState
15 Princeton
16 Purdue
17 Stanford
18 TexasA&M
19 UCBerkeley
20 UChicago
21 UMichigan
22 UPenn
23 UVA
24 UWisconsin
25 Yale
Dendrogram for business Schools
Euclidean distance & Complete linkage
Distances for Mixed (numerical +
categorical) Data
• Simple: standardize numerical
variables, then use Euclidian distance
for all

• Gower's General Dissimilarity


Coefficient (next page)
Distances for Mixed (numerical +
categorical) Data

• Gower's
 
General Dissimilarity Coefficient

– dijk = distance contributed by the kth variable.


– wijk = usually 1 or 0 depending whether or not the comparison is valid for the
kth variable. For example, the value may be missing.

– In R (reference:
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html),
• for numerical variable;
• , if is a categorical variable and i and j have same values, 1 otherwise.
Non-Hierarchical Clustering:
K-Means Clustering

• Gives Predetermined number (K) of non-overlapping clusters


– Requires specifying # clusters

• Assign records to each of the clusters in order to improve homogeneity


within group
– Clusters are homogeneous yet dissimilar to other clusters

• Need measures of within-cluster similarity (homogeneity) and between-


cluster similarity

• No hierarchy (no dendrogram)! End-product is final cluster memberships

• Computationally cheap
– Useful for large datasets
K-means clustering
• Predetermined number (K) of non-overlapping clusters

• Clusters are homogeneous yet dissimilar to other clusters

• Need measures of within-cluster similarity (homogeneity)


and between-cluster similarity

• No hierarchy (no dendrogram)! End-product is final cluster


memberships

• Useful for large datasets


K-means clustering
Algorithm minimizes within-cluster variance (heterogeneity)

1. For a user-specified value of K, partition dataset


into K initial clusters (next slide).
2. For each record, assign it to cluster with closest
centroid
3. Re-calculate centroids for the “losing” and
“receiving” clusters. Can be done
• after reassignment of each record, or
• after one complete pass through all records (cheaper)
4. Repeat Steps 2-3 until no more reassignment is
necessary
Initial partition into K clusters
Initial partitions can be obtained by either
1. user-specified initial partitions, or
2. user-specified initial centers, or
3. random partitions (by software)
• Insert << Appendix 2 - K-Means
clustering by hand >>
Why multiple start points (initial
partitions) may be necessary?
• K-means clustering is a minimization
problem—minimizing sum of squares
• Existence of multiple local minima
Convergence/robustness of K-
means
• Procedure might oscillate indefinitely
• Convergence criterion:
– Stop when a cluster centroid moves less
than a % of smallest distance between
any of the centroids
– Specify the maximum number of
iterations
kmeans(x, centers, iter.max, nstart,
...)
• x: standardized matrix
• centers:
– either the number of clusters (a random set of
distinct rows in x is chosen as the initial
centers)
– or a set of initial (distinct) cluster centers
• iter.max: the maximum number of
iterations allowed
• nstart: if centers is a number, it specifies
number of random starts
Selecting K
• Re-run algorithm for different values of K

• Tradeoff: simplicity (interpretation) vs.


adequacy (within-cluster homogeneity)

• Plot cluster variability (total within-cluster


sum of squares) vs. K

• Choice is subjective!
Elbow Curve/Scree plot
Cluster variability
Discussion point
• What if cluster variability with 5
clusters is higher than that with 3
clusters?
• Is it even possible? Why or why not?
Universities Example with k=4
Cluster 1: CalTech, JohnsHopkins
Cluster 2: PennState, Purdue, TexasA&M
UWisconsin
Cluster 3: CMU, Cornell, Georgetown, Northwestern,
NotreDame, UCBerkeley, Uchicago, Umichigan,
Upenn, UVA
Cluster 4: Brown, Columbia, Dartmouth, Duke,
Harvard, MIT, Princeton, Stanford, Yale
Evaluating usefulness of clustering

• What characterizes each cluster?

• Can you give a “name” to each


cluster?

• Does this give us any insight?


Final checks
• Cluster stability: do cluster
assignments change dramatically if
some inputs are slightly altered?
– run algorithm with different initial
centers/partitions/data subset
• Cluster separation: compare
between-cluster variation to within-
cluster variation
K-Means vs. Hierarchical
K-Means Hierarchical
The Good The Good
• Computationally fast for large • Finds “natural” grouping – no
need to specify number of
datasets
clusters
• Useful when certain K needed
• Dendrogram: transparency of
process, good for presentation
The Bad
• Can take long to terminate The Bad
• Final solution not guaranteed • Require computation & storage
to be “globally optimal” of n x n distance matrix
• Different initial partitions can • Low stability: Reordering data or
lead to different solutions dropping a few records can lead
• Must re-run the algorithm for to different solution
different values of K • Most distances sensitive to
• No dendrogram outliers
Discussion point: What will be the
outcome of cluster analysis in this case?

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Stuck in a local minimum

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Clustering non-clustered data

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Unevenly Sized Clusters

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
A few more examples of cluster
analysis
Cluster securities based on financial
performance info (return, volatility, beta) and
other info (industry and market
capitalization). What can you do with it?

For a given industry, cluster firms based on


growth rate, profitability, market size, product
range, presence in various international
markets. What can you do with it?
Hierarchical Clustering
Distance
Non-Hierarchical Clustering:
K-Means Clustering

You might also like