0% found this document useful (0 votes)

5 views

Cluster Analysis Notes

Chapter 15 of GBUS515 focuses on cluster analysis, which aims to group similar records for applications like market segmentation. It discusses hierarchical and non-hierarchical clustering methods, including algorithms like K-Means, and emphasizes the importance of validating clusters for meaningful insights. The chapter highlights the need for careful selection of parameters and the potential for random chance to affect clustering results.

Uploaded by

drmitola

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Cluster Analysis Notes

Uploaded by

drmitola

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

GBUS515 –Business Intelligence and Information Systems

Chapter 15 – Cluster Analysis

Instructor – Dr. Sunita Goel

Adapted from Shmueli, Bruce & Patel, Data Mining for Business Analytics, 3e

© Galit Shmueli and Peter Bruce 2010

Clustering: The Main Idea

Goal: Form groups (clusters) of similar records

Used for segmenting markets into groups of similar

customers

Example: Claritas segmented US neighborhoods

based on demographics & income: “Furs & station
wagons,” “Money & Brains”, …
Other Applications

Periodic table of the elements

Classification of species
Grouping securities in portfolios
Grouping firms for structural analysis of economy
Army uniform sizes
Example: Public Utilities
Goal: find clusters of similar utilities

Data: 22 firms, 8 variables

Fixed-charge covering ratio
Rate of return on capital
Cost per kilowatt capacity
Annual load factor
Growth in peak demand
Sales
% nuclear
Fuel costs per kwh
Company Fixed_charge RoR Cost Load D Demand Sales Nuclear Fuel_Cost
Arizona 1.06 9.2 151 54.4 1.6 9077 0 0.628
Boston 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central 1.43 15.4 113 53 3.4 9212 0 1.058
Commonwealth 1.02 11.2 168 56 0.3 6423 34.3 0.7
Con Ed NY 1.49 8.8 192 51.2 1 3300 15.6 2.044
Florida 1.32 13.5 111 60 -2.2 11127 22.5 1.241
Hawaiian 1.22 12.2 175 67.6 2.2 7642 0 1.652
Idaho 1.1 9.2 245 57 3.3 13082 0 0.309
Kentucky 1.34 13 168 60.4 7.2 8406 0 0.862
Madison 1.12 12.4 197 53 2.7 6455 39.2 0.623
Nevada 0.75 7.5 173 51.5 6.5 17441 0 0.768
New England 1.13 10.9 178 62 3.7 6154 0 1.897
Northern 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma 1.09 12 96 49.8 1.4 9673 0 0.588
Pacific 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
Puget 1.16 9.9 252 56 9.2 15991 0 0.62
San Diego 0.76 6.4 136 61.9 9 5714 8.3 1.92
Southern 1.05 12.6 150 56.7 2.7 10140 0 1.108
Texas 1.16 11.7 104 54 -2.1 13507 0 0.636
Wisconsin 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
United 1.04 8.6 204 61 3.5 6650 0 2.116
Virginia 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
Sales & Fuel Cost:
3 rough clusters can be seen

High fuel cost, low sales

Low fuel cost, high sales

Low fuel cost, low sales

Extension to More Than 2 Dimensions
In prior example, clustering was done by eye

Multiple dimensions require formal algorithm with

A distance measure
A way to use the distance measure in forming clusters

We will consider two algorithms: hierarchical and non-

hierarchical
Hierarchical Clustering
Hierarchical Methods
Agglomerative Methods
Begin with n-clusters (each record its own cluster)
Keep joining records into clusters until one cluster is
left (the entire data set)
Most popular

Divisive Methods
Start with one all-inclusive cluster
Repeatedly divide into smaller clusters
A Dendrogram shows the cluster hierarchy
Measuring Distance

Between records

Between clusters
Measuring Distance Between Records
Distance Between Two Records

Euclidean Distance is most popular:

Normalizing

Problem: Raw distance measures are highly influenced

by scale of measurements

Solution: normalize (standardize) the data first

Subtract mean, divide by std. deviation
Also called z-scores
Example: Normalization

For 22 utilities:

Avg. sales = 8,914

Std. dev. = 3,550

Normalized score for Arizona sales:

(9,077-8,914)/3,550 = 0.046
For Categorical Data: Similarity
To measure the distance between records in terms
of two 0/1 variables, create table with counts:
0 1
0 a b
1 c d

Similarity metrics based on this table:

Matching coef. = (a+d)/p
Jaquard’s coef. = d/(b+c+d)
Use in cases where a matching “1” is much greater
evidence of similarity than matching “0” (e.g. “owns
Corvette”)
Other Distance Measures

Correlation-based similarity
Statistical distance (Mahalanobis)
Manhattan distance (absolute differences)
Maximum coordinate distance
Gower’s similarity (for mixed variable types:
continuous & categorical)
Measuring Distance Between Clusters
Minimum Distance
(Cluster A to Cluster B)

Also called single linkage

Distance between two clusters is the distance

between the pair of records Ai and Bj that are
closest
Maximum Distance
(Cluster A to Cluster B)

Also called complete linkage

Distance between two clusters is the distance

between the pair of records Ai and Bj that are
farthest from each other
Average Distance

Also called average linkage

Distance between two clusters is the average of all

possible pair-wise distances
Centroid Distance

Distance between two clusters is the distance between

the two cluster centroids.

Centroid is the vector of variable averages for all

records in a cluster
The Hierarchical Clustering Steps (Using
Agglomerative Method)

1. Start with n clusters (each record is its own cluster)

2. Merge two closest records into one cluster
3. At each successive step, the two clusters closest to
each other are merged

Dendrogram, from bottom up, illustrates the process

Records 12 & 21 are closest & form first cluster
Reading the Dendrogram
See process of clustering: Lines connected lower down
are merged earlier
10 and 13 will be merged next, after 12 & 21

Determining number of clusters: For a given “distance

between clusters”, a horizontal line intersects the
clusters that are that far apart, to create clusters
E.g., at distance of 4.6 (red line in next slide), data can be
reduced to 2 clusters -- The smaller of the two is circled
At distance of 3.6 (green line) data can be reduced to 6
clusters, including the circled cluster
Validating Clusters
Interpretation
Goal: obtain meaningful and useful clusters
Caveats:
(1) Random chance can often produce apparent clusters
(2) Different cluster methods produce different results
Solutions:
Obtain summary statistics
Also review clusters in terms of variables not used in
clustering
Label the cluster (e.g. clustering of financial firms in
2008 might yield label like “midsize, sub-prime loser”)
Desirable Cluster Features
Stability – are clusters and cluster assignments
sensitive to slight changes in inputs? Are cluster
assignments in partition B similar to partition A?

Separation – check ratio of between-cluster variation

to within-cluster variation (higher is better)
Nonhierarchical Clustering:
K-Means Clustering
K-Means Clustering Algorithm
1. Choose # of clusters desired, k
2. Start with a partition into k clusters
Often based on random selection of k centroids
3. At each step, move each record to cluster with
closest centroid
4. Recompute centroids, repeat step 3
5. Stop when moving records increases within-cluster
dispersion
K-means Algorithm:
Choosing k and Initial Partitioning

Choose k based on the how results will be used

e.g., “How many market segments do we want?”

Also experiment with slightly different k’s

Initial partition into clusters can be random, or based

on domain knowledge
If random partition, repeat the process with different random
partitions
XLMiner Output: Cluster Centroids

Cluster Fixed_charge RoR Cost Load_factor

Cluster-1 0.89 10.3 202 57.9

Cluster-2 1.43 15.4 113 53
Cluster-3 1.06 9.2 151 54.4

We chose k = 3

4 of the 8 variables are shown

Distance Between Clusters
Distance
Cluster-1 Cluster-2 Cluster-3
between
cluster
Cluster-1 0 5.03216253 3.16901457
Cluster-2 5.03216253 0 3.76581196
Cluster-3 3.16901457 3.76581196 0

Clusters 1 and 2 are relatively well-separated

from each other, while cluster 3 not as much
Within-Cluster Dispersion
Data summary (In Original coordinates)

Average
Cluster #Obs distance in
cluster
Cluster-1 12 1748.348058
Cluster-2 3 907.6919822
Cluster-3 7 3625.242085
Overall 22 2230.906692

Clusters 1 and 2 are relatively tight, cluster 3 very loose

Conclusion: Clusters 1 & 2 well defined, not so for cluster 3

Next step: try again with k=2 or k=4

Summary
Cluster analysis is an exploratory tool. Useful only
when it produces meaningful clusters
Hierarchical clustering gives visual representation of
different levels of clustering
On other hand, due to non-iterative nature, it can be
unstable, can vary highly depending on settings, and is
computationally expensive
Non-hierarchical is computationally cheap and more
stable; requires user to set k
Can use both methods
Be wary of chance results; data may not have
definitive “real” clusters
Chapter Exercises
(Updated in Canvas)

Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
8.Cluster Analysis HCA
No ratings yet
8.Cluster Analysis HCA
31 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Clustering
No ratings yet
Clustering
125 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Clustering
No ratings yet
Clustering
84 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
Cluster
100% (1)
Cluster
72 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Clustering
No ratings yet
Clustering
11 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
10Hierarchical&Probabilistic Clustering & GMM (ML)
No ratings yet
10Hierarchical&Probabilistic Clustering & GMM (ML)
24 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Clustering
No ratings yet
Clustering
20 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering 1
No ratings yet
Clustering 1
18 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Clustering
No ratings yet
Clustering
75 pages
Understanding Clustering_ A Comprehensive Guide to
No ratings yet
Understanding Clustering_ A Comprehensive Guide to
5 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Clustering
No ratings yet
Clustering
104 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
07Clustering
No ratings yet
07Clustering
34 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
lec2
No ratings yet
lec2
32 pages
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
From Everand
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
Keith R. Holdaway
No ratings yet
Wavelet Neural Networks: With Applications in Financial Engineering, Chaos, and Classification
From Everand
Wavelet Neural Networks: With Applications in Financial Engineering, Chaos, and Classification
Antonios K. Alexandridis
No ratings yet
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
GBUS630 Application Project Spring 2023
No ratings yet
GBUS630 Application Project Spring 2023
2 pages
GBUS_630_Final_Project
No ratings yet
GBUS_630_Final_Project
67 pages
610 Project
No ratings yet
610 Project
2 pages
jacobian_chain_rule_backpropagation
No ratings yet
jacobian_chain_rule_backpropagation
7 pages
DSA Syllabus
No ratings yet
DSA Syllabus
2 pages
Partial Differential Equations: Jacobi
No ratings yet
Partial Differential Equations: Jacobi
25 pages
Solution HW4
No ratings yet
Solution HW4
5 pages
A2D Test
No ratings yet
A2D Test
5 pages
Graph Traversal - DFS & BFS
100% (1)
Graph Traversal - DFS & BFS
42 pages
10.4 The Classes P and NP: Next: Up: Previous
No ratings yet
10.4 The Classes P and NP: Next: Up: Previous
3 pages
Hash Functions Technical Report
No ratings yet
Hash Functions Technical Report
3 pages
University of Dar Es Salaam Coict: Department of Computer Science & Eng
No ratings yet
University of Dar Es Salaam Coict: Department of Computer Science & Eng
24 pages
Stability of LTI Systems
No ratings yet
Stability of LTI Systems
4 pages
N queens
No ratings yet
N queens
15 pages
Artificial Intelligence A-Z™ 2023 Build An AI With
No ratings yet
Artificial Intelligence A-Z™ 2023 Build An AI With
19 pages
Swarm Intelligence Slides
No ratings yet
Swarm Intelligence Slides
27 pages
Algorithm Analysis and Design - Lecture
No ratings yet
Algorithm Analysis and Design - Lecture
94 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-N
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-N
3 pages
Introduction To Signals and Systems
No ratings yet
Introduction To Signals and Systems
3 pages
Experiment Exp3.1.1
No ratings yet
Experiment Exp3.1.1
30 pages
Exercises Part 2
No ratings yet
Exercises Part 2
3 pages
Lecture3 Partition
100% (1)
Lecture3 Partition
24 pages
Question Bank DSP (Unit III, IV, V
0% (1)
Question Bank DSP (Unit III, IV, V
8 pages
+ 20db/decade High-Pass Filter (1
No ratings yet
+ 20db/decade High-Pass Filter (1
26 pages
BCS 042 PDF
No ratings yet
BCS 042 PDF
3 pages
Newton Raphson Main Note
No ratings yet
Newton Raphson Main Note
10 pages
Algorithms Lecture Notes Cambridge
No ratings yet
Algorithms Lecture Notes Cambridge
133 pages
Wavelet Based Image Compression Using SPIHT Algorithm: Abstract
No ratings yet
Wavelet Based Image Compression Using SPIHT Algorithm: Abstract
6 pages
30 Hrs Deep Learning CV Images Video
No ratings yet
30 Hrs Deep Learning CV Images Video
6 pages
Vertopal.com C2 W4 Decision Tree With Markdown
No ratings yet
Vertopal.com C2 W4 Decision Tree With Markdown
14 pages
HistoryOfObjectRecognition PDF
No ratings yet
HistoryOfObjectRecognition PDF
2 pages
Topic: Non-Negative Matrix Factorisation: Assignment - 2
No ratings yet
Topic: Non-Negative Matrix Factorisation: Assignment - 2
6 pages
Dsa CBP
No ratings yet
Dsa CBP
5 pages

Cluster Analysis Notes

Uploaded by

Cluster Analysis Notes

Uploaded by

GBUS515 –Business Intelligence and Information Systems

Chapter 15 – Cluster Analysis

Instructor – Dr. Sunita Goel

© Galit Shmueli and Peter Bruce 2010

Goal: Form groups (clusters) of similar records

Used for segmenting markets into groups of similar

Example: Claritas segmented US neighborhoods

 Periodic table of the elements

Data: 22 firms, 8 variables

High fuel cost, low sales

Low fuel cost, high sales

Low fuel cost, low sales

Multiple dimensions require formal algorithm with

We will consider two algorithms: hierarchical and non-

Euclidean Distance is most popular:

Problem: Raw distance measures are highly influenced

Solution: normalize (standardize) the data first

Avg. sales = 8,914

Normalized score for Arizona sales:

Similarity metrics based on this table:

 Also called single linkage

 Distance between two clusters is the distance

 Also called complete linkage

 Distance between two clusters is the distance

 Also called average linkage

 Distance between two clusters is the average of all

 Distance between two clusters is the distance between

 Centroid is the vector of variable averages for all

1. Start with n clusters (each record is its own cluster)

Dendrogram, from bottom up, illustrates the process

Determining number of clusters: For a given “distance

Separation – check ratio of between-cluster variation

Choose k based on the how results will be used

Also experiment with slightly different k’s

Initial partition into clusters can be random, or based

Cluster Fixed_charge RoR Cost Load_factor

Cluster-1 0.89 10.3 202 57.9

4 of the 8 variables are shown

Clusters 1 and 2 are relatively well-separated

Clusters 1 and 2 are relatively tight, cluster 3 very loose

Next step: try again with k=2 or k=4

You might also like

Periodic table of the elements

Also called single linkage

Distance between two clusters is the distance

Also called complete linkage

Distance between two clusters is the distance

Also called average linkage

Distance between two clusters is the average of all

Distance between two clusters is the distance between

Centroid is the vector of variable averages for all