0% found this document useful (0 votes)

67 views

Cluster Analysis: Prof. Vandith Pamuru

Cluster analysis is an exploratory technique used to group similar records together into clusters while keeping dissimilar records in different clusters. It works by identifying homogeneous groups (clusters) based on attributes. The US Army used cluster analysis to reduce the number of uniform sizes needed by grouping female soldiers into body types like short-legged and small-waisted. Cluster analysis was also used to group 25 top US business schools based on attributes like SAT scores, acceptance rates, expenses to help prospective applicants and business school deans. Hierarchical and k-means clustering are two common approaches, with hierarchical clustering grouping records sequentially into a hierarchy based on similarity measured by distance.

Uploaded by

sourav abhishek

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views

Cluster Analysis: Prof. Vandith Pamuru

Uploaded by

sourav abhishek

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 68

Cluster Analysis

Prof. Vandith Pamuru

Primary Objective behind Clustering

• Cluster Analysis (“data segmentation”)

is an exploratory method for
identifying homogenous groups
(“clusters”) of records
• Similar records should belong to the
same cluster
• Dissimilar records should belong to
different clusters
Example
Example: Fitting the Troops
(from Data Mining Techniques by Berry & Linoff)
• The US army recently commissioned a study on how
to redesign the uniforms of female soldiers. The
army’s goal is to reduce the number of different
uniform sizes that have to be kept in inventory while
still providing each soldier with well-fitting khakis.

• Researchers Ashdown and Paal @ Cornell University

designed a new set of sizes based on the actual
shapes of women in the army. Unlike traditional
clothing size systems, the new sizes are not an
ordered set of graduated sizes where all dimensions
increase together.

• Instead, they came up with sizes that fit particular

body types (e.g., short-legged, small-waisted, large-
busted women with long torsos, average arms,
broad shoulders, and skinny necks).
More examples of cluster analysis

Cluster Analysis of 77 Community areas

in Chicago based on crimes

5
Choice of variables in Clustering

http://images.indiatvnews.com/mainnational/Mumb
ai-Dabbawal38721.jpg
Important to keep in mind
Objective: align the cluster analysis with the
business objective(s).

• What entities you are clustering?

• Based on what attributes?
• In order to achieve what?

Recall the earlier examples.

UG Business Programs Universities
Clustering.xls
Univ SAT Top10 Accept SFRatio Expenses GradRate
Brown 1310 89 22 13 22,704 94
• Data for 25 CalTech
CMU
1415
1260
100
62
25
59
6
9
63,575
25,026
81
72

undergraduate Columbia
Cornell
1310
1280
76
83
24
33
12
13
31,510
21,864
88
90

programs at business Dartmouth

Duke
1340
1315
89
90
23
30
10
12
32,162
31,585
95
95

schools in US
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
universities in 1995. MIT
Northwestern
1380
1260
94
85
30
39
10
11
34,870
28,052
91
89
NotreDame 1255 81 42 13 15,122 94
PennState 1081 38 54 18 10,185 80
Princeton 1375 91 14 8 30,220 95
• This dataset excludes Purdue 1005 28 90 19 9,066 69
Stanford 1360 90 20 12 36,450 93
image variables TexasA&M
UCBerkeley
1075
1240
49
95
67
40
25
17
8,704
15,140
67
78

(student satisfaction, UChicago

UMichigan
1290
1180
75
65
50
68
13
16
38,380
15,470
87
85

employer satisfaction, UPenn

UVA
1285
1225
80
77
36
44
11
14
27,553
13,349
90
92

deans’ opinions) UWisconsin

Yale
1085
1375
40
95
69
19
15
11
11,857
43,514
71
96
Why cluster universities?
• How can clustering help a prospective
applicant?

• How can clustering help a business

school dean?

• Any other potential stakeholder for the

exercise?
• Simple Clustering: 1-2 variables
• Visual inspection of data
• Two approaches:

– Compute “multivariate distance” between

records, and group “close” records
• Hierarchical Clustering

– Group records to increase within-group

homogeneity
• K-Means clustering
Hierarchical methods - agglomerative:
Hierarchical Clustering

Begin with n records; sequentially merge

similar records or group of records until
all are put in one large group.
• Useful when goal is to arrange the
clusters into a natural hierarchy.
• Requires specifying distance measure to
find similarity
Hierarchical Clustering

Distance
• Start with n clusters
(1 record in each cluster)

• Step 1: two closest records

are merged into one cluster

• At every step, pair of records/clusters with smallest distance are merged

– two records are merged,
– or single record is added to an existing cluster,
– or two existing clusters are combined

• Dendrogram: Tree-like diagram that summarize the clustering process

• How do you know two entities are closest?

– requires a definition of distance
Pairwise distance between Records

dij = distance between records i and j

Distance Requirements:
Non-negative ( dij > 0 )
dii = 0
Symmetry (dij = dji )
Triangle inequality ( dij + djk  dik )
Distance between two universities
Notation:

Example:

• Caltech= (1415, 100, 25, 6, 63575, 81)

• Cornell = (1280, 83, 33, 13, 21864, 90)

Euclidean Distance

• 6-dimensional Euclidean distance between Caltech and

Cornell:

Sqrt [ (1415-1280)2 + (100-83)2 + (25-33)2 + (6-13)2 +

(63575-21864)2 + (81-90)2] = 41,711.22
Standardize when there are multiple
variables of different scales

Euclidean distance is influenced by the

scale of the different measurements

Solution: standardize (=normalize)

each variable before measuring
distances
Standardizing Example

Univ Z_SAT Z_Top10 Z_Accept Z_SFRatio Z_Expenses Z_GradRate

Brown 0.401994 0.644235 -0.87189 0.068840897 -0.32471667 0.80372917
CalTech 1.370988 1.210256 -0.71981 -1.65218153 2.508651168 -0.631501491
CMU -0.05943 -0.74509 1.003685 -0.91460049 -0.16374483 -1.625122718
Columbia 0.401994 -0.0247 -0.77051 -0.17701945 0.285756214 0.141315019
Cornell 0.125139 0.335496 -0.31429 0.068840897 -0.38294938 0.362119736
Dartmouth 0.67885 0.644235 -0.8212 -0.66874014 0.330955887 0.914131529
Euclidean distance between
standardized Caltech and Cornell:

Sqrt[ (1.371-1.125)2 + (1.210-0.335)2 +

… + (-0.632-0.362)2] = 3.84
Lots of other distance metrics
•
Manhattan distance

Statistical (Mahalanobis) distance

, where S is the covariance matrix
Distances for Binary Data
• Similarity-based metrics based on 2x2 table of counts

Married? Smoke? Manager?

Person 3
Person 1 Y Y Y N Y
Person 2 N Y N Person 1
N 0 0
Y 2 1
Person 3 N N Y

Person 3
N Y
N a b
Person 1
Y c d

• Binary Euclidean Distance: (b+c)/(a+b+c+d)

• Simple matching Coefficient: (a+d)/(a+b+c+d)
• Jaccard’s index: d/(b+c+d)
Revisit the hierarchical clustering
algorithm

Distance
• Start with n clusters
(1 record in each cluster)

• Step 1: two closest records

are merged into one cluster

• At every step, pair of records/clusters with smallest

distance are merged
– two records are merged,
– or single record added to an existing cluster,
– or two existing clusters are combined
Distances Between Clusters:
‘single linkage’ (‘nearest neighbor’)
• Distance between 2 clusters =
minimum distance between
members of the two clusters
Distances Between Clusters:
‘complete linkage’ (‘farthest neighbor’)
• Distance between 2 clusters =
greatest distance between members
of the two clusters
Distances Between Clusters: ‘average
linkage’
• Distance between 2 clusters =
average of all distances between
members of the two clusters
Distances Between Clusters:
‘centroid linkage’
• Distance between 2 clusters =
distance between their centroids
(centers)
Pairwise distance between Clusters
• Single linkage (nearest neighbor): minimum
distance between members of the two clusters

• Complete linkage (farthest neighbor): greatest

distance between members of the two clusters

• Average linkage: average of all distances between

members of the two clusters

• Centroid linkage: distance between their centroids

(centers)
• Insert << Appendix 1 - Hierarchical
clustering by hand >>
Hierarchical Clustering: The
Dendrogram
Distance

Height of
the branch
denotes
the
distance
between
the
entities
that are
getting
merged at
that level
UG Business Programs:
Universities Clustering.xls
Data for 25 undergraduate programs at Placement
Student
business schools in US universities in 1995 . Quality Program

Univ SAT Top10 Accept SFRatio Expenses GradRate

Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
MIT 1380 94 30 10 34,870 91
Northwestern 1260 85 39 11 28,052 89
NotreDame 1255 81 42 13 15,122 94
PennState 1081 38 54 18 10,185 80
Princeton 1375 91 14 8 30,220 95
Purdue 1005 28 90 19 9,066 69
Stanford 1360 90 20 12 36,450 93
TexasA&M 1075 49 67 25 8,704 67
This dataset excludes image variables UCBerkeley 1240 95 40 17 15,140 78
UChicago 1290 75 50 13 38,380 87
(student satisfaction, employer UMichigan 1180 65 68 16 15,470 85
satisfaction, deans’ opinions) UPenn 1285 80 36 11 27,553 90
UVA 1225 77 44 14 13,349 92
UWisconsin 1085 40 69 15 11,857 71
Yale 1375 95 19 11 43,514 96
Dendrogram for business Schools
Euclidean distance & Single linkage
Row Id. University

1 Brown
2 CalTech
3 CMU
4 Columbia
5 Cornell
6 Dartmouth
7 Duke
8 Georgetown
9 Harvard
10 JohnsHopkins
11 MIT
12 Northwestern
13 NotreDame
14 PennState
15 Princeton
16 Purdue
17 Stanford
18 TexasA&M
19 UCBerkeley
20 UChicago
21 UMichigan
22 UPenn
23 UVA
24 UWisconsin
25 Yale
From Dendrograms to Clusters
• After dendrogram is obtained, cut it to create
clusters. How?
• Examine distance levels
• Cutpoint determines # clusters
• Obtain statistics on resulting clusters
Evaluating usefulness of clustering

• What characterizes each cluster?

• Can you give a “name” to each

cluster?

• Does this give us any insight?

• Insert << Tableau exercise post HC >>
Recap & Agenda
• Hierarchical clustering
• Homework from the last class:
– By hand appendix
– R code
– Gower’s similarity
– Case : Mall of America
– Try Tableau (Optional)
• Today:
– K-Means Clustering
– Case
Insights? Anything Interesting?
Row Id. University

1 Brown
2 CalTech
3 CMU
4 Columbia
5 Cornell
6 Dartmouth
7 Duke
8 Georgetown
9 Harvard
10 JohnsHopkins
11 MIT
12 Northwestern
13 NotreDame
14 PennState
15 Princeton
16 Purdue
17 Stanford
18 TexasA&M
19 UCBerkeley
20 UChicago
21 UMichigan
22 UPenn
23 UVA
24 UWisconsin
25 Yale
Dendrogram for business Schools
Euclidean distance & Complete linkage
Distances for Mixed (numerical +
categorical) Data
• Simple: standardize numerical
variables, then use Euclidian distance
for all

• Gower's General Dissimilarity

Coefficient (next page)
Distances for Mixed (numerical +
categorical) Data
•
• Gower's

General Dissimilarity Coefficient

– dijk = distance contributed by the kth variable.

– wijk = usually 1 or 0 depending whether or not the comparison is valid for the
kth variable. For example, the value may be missing.

– In R (reference:
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html),
• for numerical variable;
• , if is a categorical variable and i and j have same values, 1 otherwise.
Non-Hierarchical Clustering:
K-Means Clustering

• Gives Predetermined number (K) of non-overlapping clusters

– Requires specifying # clusters

• Assign records to each of the clusters in order to improve homogeneity

within group
– Clusters are homogeneous yet dissimilar to other clusters

• Need measures of within-cluster similarity (homogeneity) and between-

cluster similarity

• No hierarchy (no dendrogram)! End-product is final cluster memberships

• Computationally cheap
– Useful for large datasets
K-means clustering
• Predetermined number (K) of non-overlapping clusters

• Clusters are homogeneous yet dissimilar to other clusters

• Need measures of within-cluster similarity (homogeneity)

and between-cluster similarity

• No hierarchy (no dendrogram)! End-product is final cluster

memberships

• Useful for large datasets

K-means clustering
Algorithm minimizes within-cluster variance (heterogeneity)

1. For a user-specified value of K, partition dataset

into K initial clusters (next slide).
2. For each record, assign it to cluster with closest
centroid
3. Re-calculate centroids for the “losing” and
“receiving” clusters. Can be done
• after reassignment of each record, or
• after one complete pass through all records (cheaper)
4. Repeat Steps 2-3 until no more reassignment is
necessary
Initial partition into K clusters
Initial partitions can be obtained by either
1. user-specified initial partitions, or
2. user-specified initial centers, or
3. random partitions (by software)
• Insert << Appendix 2 - K-Means
clustering by hand >>
Why multiple start points (initial
partitions) may be necessary?
• K-means clustering is a minimization
problem—minimizing sum of squares
• Existence of multiple local minima
Convergence/robustness of K-
means
• Procedure might oscillate indefinitely
• Convergence criterion:
– Stop when a cluster centroid moves less
than a % of smallest distance between
any of the centroids
– Specify the maximum number of
iterations
kmeans(x, centers, iter.max, nstart,
...)
• x: standardized matrix
• centers:
– either the number of clusters (a random set of
distinct rows in x is chosen as the initial
centers)
– or a set of initial (distinct) cluster centers
• iter.max: the maximum number of
iterations allowed
• nstart: if centers is a number, it specifies
number of random starts
Selecting K
• Re-run algorithm for different values of K

• Tradeoff: simplicity (interpretation) vs.

adequacy (within-cluster homogeneity)

• Plot cluster variability (total within-cluster

sum of squares) vs. K

• Choice is subjective!
Elbow Curve/Scree plot
Cluster variability
Discussion point
• What if cluster variability with 5
clusters is higher than that with 3
clusters?
• Is it even possible? Why or why not?
Universities Example with k=4
Cluster 1: CalTech, JohnsHopkins
Cluster 2: PennState, Purdue, TexasA&M
UWisconsin
Cluster 3: CMU, Cornell, Georgetown, Northwestern,
NotreDame, UCBerkeley, Uchicago, Umichigan,
Upenn, UVA
Cluster 4: Brown, Columbia, Dartmouth, Duke,
Harvard, MIT, Princeton, Stanford, Yale
Evaluating usefulness of clustering

• What characterizes each cluster?

• Can you give a “name” to each

cluster?

• Does this give us any insight?

Final checks
• Cluster stability: do cluster
assignments change dramatically if
some inputs are slightly altered?
– run algorithm with different initial
centers/partitions/data subset
• Cluster separation: compare
between-cluster variation to within-
cluster variation
K-Means vs. Hierarchical
K-Means Hierarchical
The Good The Good
• Computationally fast for large • Finds “natural” grouping – no
need to specify number of
datasets
clusters
• Useful when certain K needed
• Dendrogram: transparency of
process, good for presentation
The Bad
• Can take long to terminate The Bad
• Final solution not guaranteed • Require computation & storage
to be “globally optimal” of n x n distance matrix
• Different initial partitions can • Low stability: Reordering data or
lead to different solutions dropping a few records can lead
• Must re-run the algorithm for to different solution
different values of K • Most distances sensitive to
• No dendrogram outliers
Discussion point: What will be the
outcome of cluster analysis in this case?

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Stuck in a local minimum

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
Clustering non-clustered data

http://stats.stackexchange.com/questions/133656/how-to-
understand-the-drawbacks-of-k-means
A few more examples of cluster
analysis
Cluster securities based on financial
performance info (return, volatility, beta) and
other info (industry and market
capitalization). What can you do with it?

For a given industry, cluster firms based on

growth rate, profitability, market size, product
range, presence in various international
markets. What can you do with it?
Hierarchical Clustering
Distance
Non-Hierarchical Clustering:
K-Means Clustering

APEX 5.1.4 Algebra 2
No ratings yet
APEX 5.1.4 Algebra 2
2 pages
Rock Drill & Slu
100% (4)
Rock Drill & Slu
36 pages
Separator Design Guide PDF
No ratings yet
Separator Design Guide PDF
36 pages
Halpin Tsai Review Article
No ratings yet
Halpin Tsai Review Article
9 pages
Data Analytices Training in Bangalore
No ratings yet
Data Analytices Training in Bangalore
24 pages
Data Analytics Courses in Pune
No ratings yet
Data Analytics Courses in Pune
25 pages
FET -Presentation, Personnel 2022
No ratings yet
FET -Presentation, Personnel 2022
70 pages
Ar 1064
No ratings yet
Ar 1064
2 pages
Ar 1064
No ratings yet
Ar 1064
2 pages
06 Rock Drill & Slu
No ratings yet
06 Rock Drill & Slu
36 pages
MJ-6106 PL en 0000
No ratings yet
MJ-6106 PL en 0000
14 pages
127E HMS DeLium Double Suction Pumpshjööhj
No ratings yet
127E HMS DeLium Double Suction Pumpshjööhj
76 pages
Parts Taskalfa 8000i
100% (2)
Parts Taskalfa 8000i
115 pages
Given: - Splitting of Customer Demands Not Permitted - No Limit On Number of Trucks Available
No ratings yet
Given: - Splitting of Customer Demands Not Permitted - No Limit On Number of Trucks Available
9 pages
UBI-LA NORIA
No ratings yet
UBI-LA NORIA
1 page
Parts List - UTAX CD1465 - CD1480 - ENG - Rev52D
No ratings yet
Parts List - UTAX CD1465 - CD1480 - ENG - Rev52D
115 pages
plastic analysis (1)
No ratings yet
plastic analysis (1)
12 pages
Unit-4 IGNOU STATISTICS
No ratings yet
Unit-4 IGNOU STATISTICS
23 pages
Aallun Seal &: Date: - 11 - : - !L
No ratings yet
Aallun Seal &: Date: - 11 - : - !L
1 page
Ranking Data 1708017767597
No ratings yet
Ranking Data 1708017767597
2 pages
0935-06 en
No ratings yet
0935-06 en
40 pages
ORTHIGRPHY PROJECTIONS (Conflicted Copy)-Model
No ratings yet
ORTHIGRPHY PROJECTIONS (Conflicted Copy)-Model
1 page
Alfa Laval I CP100
No ratings yet
Alfa Laval I CP100
16 pages
BTL TRUYỀN NHIỆT thầy Ngôn
No ratings yet
BTL TRUYỀN NHIỆT thầy Ngôn
27 pages
Export Ranking Mba 2024 1709974432426
No ratings yet
Export Ranking Mba 2024 1709974432426
2 pages
Network Models: Group 4 (BSA - 1A)
No ratings yet
Network Models: Group 4 (BSA - 1A)
36 pages
SO3_A1_PC_2A_Vocab
No ratings yet
SO3_A1_PC_2A_Vocab
1 page
Screenshot 2019-10-22 at 10.51.02 PM
No ratings yet
Screenshot 2019-10-22 at 10.51.02 PM
2 pages
GATE Chemical Mass Transfer-MT
No ratings yet
GATE Chemical Mass Transfer-MT
18 pages
Parts: 395/455/495st Pro Airless Paint Sprayers
No ratings yet
Parts: 395/455/495st Pro Airless Paint Sprayers
12 pages
Math 120 2:11:24
No ratings yet
Math 120 2:11:24
3 pages
knethas4e5thah45te
No ratings yet
knethas4e5thah45te
2 pages
Ultra Glide - Owners Manual SP
No ratings yet
Ultra Glide - Owners Manual SP
2 pages
Ficha Editavel CoC 7th 1920s
No ratings yet
Ficha Editavel CoC 7th 1920s
2 pages
60W-MM Triple Stage Mast: Arts Manual
No ratings yet
60W-MM Triple Stage Mast: Arts Manual
21 pages
0942-09 en
No ratings yet
0942-09 en
73 pages
Crane 448_Parts_Manual_
No ratings yet
Crane 448_Parts_Manual_
40 pages
2016 Harvester 6 Row 30 Scrub Parts Manual
No ratings yet
2016 Harvester 6 Row 30 Scrub Parts Manual
147 pages
Cantilever Parts
No ratings yet
Cantilever Parts
4 pages
Parts List: MODEL: DDL-8500-7 List No: 1200-00
No ratings yet
Parts List: MODEL: DDL-8500-7 List No: 1200-00
44 pages
0 15a Crankshaft Piston and
No ratings yet
0 15a Crankshaft Piston and
9 pages
mj1033-pl-v00
No ratings yet
mj1033-pl-v00
52 pages
Incanto Classic-SUP021YNR
No ratings yet
Incanto Classic-SUP021YNR
6 pages
Heimtrainer Aldinord Bda de
No ratings yet
Heimtrainer Aldinord Bda de
28 pages
Waves Workbook
No ratings yet
Waves Workbook
38 pages
Stat Exercises
No ratings yet
Stat Exercises
31 pages
EPL English All
No ratings yet
EPL English All
8 pages
Presentation 19 Jul 25 Jul 21
No ratings yet
Presentation 19 Jul 25 Jul 21
9 pages
06
No ratings yet
06
36 pages
FALCON CREST - by Jerry L.Capps, 1/24/2000
No ratings yet
FALCON CREST - by Jerry L.Capps, 1/24/2000
1 page
Prcatical 5 it skills and data analysis
No ratings yet
Prcatical 5 it skills and data analysis
5 pages
PC750-7 S/N 20001-UP (Overseas Version)
No ratings yet
PC750-7 S/N 20001-UP (Overseas Version)
2 pages
Carburetor
No ratings yet
Carburetor
6 pages
Page478 ExNo. 40
No ratings yet
Page478 ExNo. 40
6 pages
Sanyo Spw-c0705dxhn8 Pl833777-00 11 Parts List
100% (1)
Sanyo Spw-c0705dxhn8 Pl833777-00 11 Parts List
6 pages
PM CM CMRW Row
No ratings yet
PM CM CMRW Row
12 pages
Naphthenes On Rtx-DHA-100 (ASTM D6729-14) : Conc. Peaks T (Min) (WT.%) Conc. Peaks T (Min) (WT.%)
No ratings yet
Naphthenes On Rtx-DHA-100 (ASTM D6729-14) : Conc. Peaks T (Min) (WT.%) Conc. Peaks T (Min) (WT.%)
1 page
3w28i Plano PDF
No ratings yet
3w28i Plano PDF
1 page
OKI CX3535 CX4545 MFP MJ 6103 Hole Punch Unit Illustrated Parts Manual 6633a2ab544a9
No ratings yet
OKI CX3535 CX4545 MFP MJ 6103 Hole Punch Unit Illustrated Parts Manual 6633a2ab544a9
15 pages
Michael - Jacskon Tuba G-Mellophone
No ratings yet
Michael - Jacskon Tuba G-Mellophone
3 pages
2 Abt
No ratings yet
2 Abt
1 page
Wall Horse
No ratings yet
Wall Horse
12 pages
Science Games and Puzzles, Grades 5 - 8
From Everand
Science Games and Puzzles, Grades 5 - 8
Schyrlet Cameron
5/5 (2)
Buy Ebook Radar System Analysis and Modeling Artech House Radar Library Barton Cheap Price
100% (17)
Buy Ebook Radar System Analysis and Modeling Artech House Radar Library Barton Cheap Price
84 pages
PDK 203704 Parameter KW-R06 IX A5 KE en
No ratings yet
PDK 203704 Parameter KW-R06 IX A5 KE en
320 pages
RF MEMS Filters RF MEMS Filters RF MEMS Filters RF MEMS Filters
No ratings yet
RF MEMS Filters RF MEMS Filters RF MEMS Filters RF MEMS Filters
38 pages
ECEM - Analysis of Pin Jointed Plane Trusses - Method of Joints
100% (1)
ECEM - Analysis of Pin Jointed Plane Trusses - Method of Joints
15 pages
MA Economics Syllabus (CUCSS) : Calicut University
No ratings yet
MA Economics Syllabus (CUCSS) : Calicut University
22 pages
Geometry of Projective Algebraic Curves
100% (1)
Geometry of Projective Algebraic Curves
429 pages
Healthy) (Big) Expensive) (Slow) - (Good) (Tall) (Thin) - (Strong)
No ratings yet
Healthy) (Big) Expensive) (Slow) - (Good) (Tall) (Thin) - (Strong)
5 pages
DLL - Mathematics 6 - Q2 - W2
No ratings yet
DLL - Mathematics 6 - Q2 - W2
5 pages
Fractions: Looking Back
No ratings yet
Fractions: Looking Back
28 pages
The Capital Asset Pricing Model
No ratings yet
The Capital Asset Pricing Model
43 pages
Mste 1.0 Algebra Hand Outs
No ratings yet
Mste 1.0 Algebra Hand Outs
19 pages
Module_2 - Fundamental_algorithmetic_strategies
No ratings yet
Module_2 - Fundamental_algorithmetic_strategies
31 pages
Indian National Junior Science Olympiad 2012 Answerkey
No ratings yet
Indian National Junior Science Olympiad 2012 Answerkey
6 pages
Dimensioning of A Furling System For A Small Wind Turbine
No ratings yet
Dimensioning of A Furling System For A Small Wind Turbine
19 pages
Deep Learning and Thresholding With Class-Imbalanced Big Data
No ratings yet
Deep Learning and Thresholding With Class-Imbalanced Big Data
8 pages
Exercises Pythonbasics Exercises v1.0
No ratings yet
Exercises Pythonbasics Exercises v1.0
22 pages
Design Considerations For The Vibration of Floors - Part 2: Advisory Desk
No ratings yet
Design Considerations For The Vibration of Floors - Part 2: Advisory Desk
3 pages
1999 - Snow - Esthetic Smile Analysis of Maxillary Anterior Tooth Width - The Golden Percentage STEPHEN R. SNOW
No ratings yet
1999 - Snow - Esthetic Smile Analysis of Maxillary Anterior Tooth Width - The Golden Percentage STEPHEN R. SNOW
8 pages
Helical Compressing Spring Calculation PDF
No ratings yet
Helical Compressing Spring Calculation PDF
4 pages
Ganak
No ratings yet
Ganak
7 pages
Psychological Testing and Assessment: Introduction
No ratings yet
Psychological Testing and Assessment: Introduction
3 pages
Scripting Langauge For Dyna
No ratings yet
Scripting Langauge For Dyna
117 pages
F1 Maths Midterm SMC
No ratings yet
F1 Maths Midterm SMC
3 pages
Mathematical Methods SL - Nov 2001 - P2 A $
No ratings yet
Mathematical Methods SL - Nov 2001 - P2 A $
15 pages
106th Issue of Sarvekshana - Final
No ratings yet
106th Issue of Sarvekshana - Final
72 pages
Network Strategic Evaluation: Rodrigo Archondo-Callao Senior Highway Engineer, ETWTR
No ratings yet
Network Strategic Evaluation: Rodrigo Archondo-Callao Senior Highway Engineer, ETWTR
24 pages
The Game of Satyagraha
No ratings yet
The Game of Satyagraha
6 pages

Cluster Analysis: Prof. Vandith Pamuru

Uploaded by

Cluster Analysis: Prof. Vandith Pamuru

Uploaded by

Cluster Analysis

Prof. Vandith Pamuru

• Cluster Analysis (“data segmentation”)

• Researchers Ashdown and Paal @ Cornell University

• Instead, they came up with sizes that fit particular

Cluster Analysis of 77 Community areas

• What entities you are clustering?

Recall the earlier examples.

programs at business Dartmouth

(student satisfaction, UChicago

employer satisfaction, UPenn

deans’ opinions) UWisconsin

• How can clustering help a business

• Any other potential stakeholder for the

– Compute “multivariate distance” between

– Group records to increase within-group

Begin with n records; sequentially merge

• Step 1: two closest records

• At every step, pair of records/clusters with smallest distance are merged

• Dendrogram: Tree-like diagram that summarize the clustering process

• How do you know two entities are closest?

dij = distance between records i and j

• Caltech= (1415, 100, 25, 6, 63575, 81)

• Cornell = (1280, 83, 33, 13, 21864, 90)

• 6-dimensional Euclidean distance between Caltech and

Sqrt [ (1415-1280)2 + (100-83)2 + (25-33)2 + (6-13)2 +

Euclidean distance is influenced by the

Solution: standardize (=normalize)

Univ Z_SAT Z_Top10 Z_Accept Z_SFRatio Z_Expenses Z_GradRate

Sqrt[ (1.371-1.125)2 + (1.210-0.335)2 +

Statistical (Mahalanobis) distance

Married? Smoke? Manager?

• Binary Euclidean Distance: (b+c)/(a+b+c+d)

• Step 1: two closest records

• At every step, pair of records/clusters with smallest

• Complete linkage (farthest neighbor): greatest

• Average linkage: average of all distances between

• Centroid linkage: distance between their centroids

Univ SAT Top10 Accept SFRatio Expenses GradRate

• What characterizes each cluster?

• Can you give a “name” to each

• Does this give us any insight?

• Gower's General Dissimilarity

– dijk = distance contributed by the kth variable.

• Gives Predetermined number (K) of non-overlapping clusters

• Assign records to each of the clusters in order to improve homogeneity

• Need measures of within-cluster similarity (homogeneity) and between-

• No hierarchy (no dendrogram)! End-product is final cluster memberships

• Clusters are homogeneous yet dissimilar to other clusters

• Need measures of within-cluster similarity (homogeneity)

• No hierarchy (no dendrogram)! End-product is final cluster

• Useful for large datasets

1. For a user-specified value of K, partition dataset

• Tradeoff: simplicity (interpretation) vs.

• Plot cluster variability (total within-cluster

• What characterizes each cluster?

• Can you give a “name” to each

• Does this give us any insight?

For a given industry, cluster firms based on

You might also like