0% found this document useful (0 votes)

24 views45 pages

Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business

The document discusses hierarchical clustering, which identifies hierarchical relationships in data through a nested tree structure, unlike partitional clustering. It describes agglomerative clustering, which starts with each object as its own cluster and iteratively merges the closest clusters. A dendrogram provides a graphical representation of the hierarchical structure, showing cluster merges and their distances. Common hierarchical approaches include agglomerative (bottom-up) and divisive (top-down). An example demonstrates the agglomerative process.

Uploaded by

222342 ktr.et.cse.15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views45 pages

Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

222342 ktr.et.cse.15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

DSCI 5240

Hierarchical Clustering
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240

Hierarchies are Everywhere!

“Every seeming equality conceals a hierarchy.”

Mason Cooley

2
DSCI 5240

Hierarchical Clustering

3
DSCI 5240

What is Hierarchical Clustering

• Hierarchical clustering identifies
hierarchical levels within the data

• Goal is to identify the hierarchies

between objects n in a dataset such
that they can be represented in a
nested tree structure

• Hierarchies are not identified in

partitional clustering (k-means)

4
DSCI 5240

Partitional vs. Hierarchical Clustering

k-Means Cluster Analysis Hierarchical Cluster Analysis

5
DSCI 5240

Partitional vs. Hierarchical Clustering

k-Means Cluster Analysis Hierarchical Cluster Analysis

• Single partition of the data • Multiple partitions of the data

depending on the level of hierarchy
• Number of clusters must be specified a
priori • Number of clusters is not required in
• Relatively fast on large datasets advance
• SLOW on large datasets
• Ideally clusters are hyper-spherical
• May be used (with caution) on
• In theory, non-repeatable due to
differently shaped data
random selection of initial seeds. In
practice, we use random seeds. • Repeatable results once you select a
method to calculate distances
Note: SASEM does not do hierarchical clustering. But that is OK! Remember this course is about the algorithms
6
Neither is categorically better than the other. They are just different!
DSCI 5240

Dendrogram
• Dendrogram - a graphical
representation of the hierarchical
structure of the clusters

• Height of each connection reflects the

distance between clusters

Distance
• Clusters are obtained by “cutting” the
dendrogram at the desired distance
measure (i.e., we specify that clusters
greater than x distance apart should
not be combined)

1 2 3 4 5
Object
7
DSCI 5240

Reading a Dendrogram

2 Clusters at 4.6

6 Clusters at 3.6

Objects 2 and
22 are this far
apart

8
DSCI 5240

Common Hierarchical Approaches

Agglomerative (Bottom Up) Divisive (Top Down)

• Begin with n-clusters (each • Start with one all-inclusive cluster

observation is a singleton cluster)
• Repeatedly divide into smaller cluster
• Keep joining clusters with smallest
• One common method is to recursively
distance until one cluster is left (the
use k-means
entire data set)
• Less popular
• Most popular approach

9
DSCI 5240

Common Hierarchical Approaches

Distance
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (e.g., DIANA)
10
DSCI 5240

Agglomerative Nesting
(AGNES)

11
DSCI 5240

Basic Agglomerative Algorithm

1. Let each data point be a cluster

This turns out to be
2. Compute distances between clusters
somewhat complicated!
3. Merge the two closest clusters
4. Repeat steps 2 and 3
5. End when a single cluster remains

12
DSCI 5240

Agglomerative Example
Step 1: Each observation is a cluster

6
Obs X Y
A 1 1 5

B 3 1
4
C 2 4
D 2 5 3

Y
E 5 4
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

13
DSCI 5240

Iteration 1

14
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)

6
A B C D E
A 0.00 5

B 2.00 0.00
4
C 3.16 3.16 0.00
D 4.12 4.12 1.00 0.00 3

Y
E 5.00 3.16 3.00 3.16 0.00
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

15
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
A B C D E
A 0.00 5

B 2.00 0.00
4
C 3.16 3.16 0.00
D 4.12 4.12 1.00 0.00 3

Y
E 5.00 3.16 3.00 3.16 0.00
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

16
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
Obs X Y
A 1 1 5

B 3 1
4
CD 2 4.5
E 5 4 3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

17
DSCI 5240

Iteration 2

18
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)

6
A B CD E
A 0.00 5

B 2.00 0.00
4
CD 3.64 3.64 0.00
E 5.00 3.61 3.04 0.00 3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

19
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
A B CD E
A 0.00 5

B 2.00 0.00
4
CD 3.64 3.64 0.00
E 5.00 3.61 3.04 0.00 3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

20
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
AB 2 1 5

CD 2 4.5
4
E 5 4
3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

21
DSCI 5240

Iteration 3

22
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)

6
AB CD E
AB 0.00 5

CD 3.50 0.00
4
E 4.24 3.04 0.00
3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

23
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
AB CD E
AB 0.00 5

CD 3.50 0.00
4
E 4.24 3.04 0.00
3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

24
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
AB 2 1 5

CDE 3 4.3
4

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

25
DSCI 5240

Iteration 4

26
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)
6
AB CDE
AB 0.00 5

CDE 3.48 0.00

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

27
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
AB CDE
AB 0.00 5

CDE 3.48 0.00

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

28
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
Obs X Y
ABCDE 2.6 3.0 5

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

29
DSCI 5240

Recap

30
DSCI 5240

Recap

5
Distance

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
A B C D E
X
Object
31
DSCI 5240

But How Should we Measure Distance

• In k-Means, we only ever calculated the distance between points and centroids

• In hierarchical clustering, we have to determine the distance between

• Two points (Step 1)
• A point and a group of points (cluster)
• A group of points (cluster) and another group of points (cluster)

• What point within a cluster should represent the cluster in distance calculations?

• It turns out there are many options and they can result in different cluster structures

32
DSCI 5240

Calculating Distances

Distance between what?

• MIN (Single Linkage) – Distance between two clusters is the distance between the two
closest points in the (different) clusters
• MAX (Complete Linkage) – Distance between two clusters is the distance between the
two farthest points in the (different) clusters
• Group Average (Average Linkage) – Distance between two clusters is the average
pairwise distance between points in the (different) clusters
• Distance Between Centers – Distance between two clusters is the distance between the
cluster centroids

33
DSCI 5240

Calculating Distances

MIN (Single Linkage)

MAX (Complete Linkage)

34
DSCI 5240

Calculating Distances

MIN (Single Linkage)

3
2

35
DSCI 5240

Calculating Distances

3
2

MAX (Complete Linkage)

36
DSCI 5240

Calculating Distances

Group Average (Average Linkage)

Distance Between Centers

37
DSCI 5240

How Does this Affect Cluster Structure?

• Methods of distance calculation are highly impacted by the shape of the original data
• Assume we have data that looks like this:

38
DSCI 5240

MIN
• MIN is better at handling non-elliptical shapes
• It will likely result in cleaner (more interpretable) clusters

39
DSCI 5240

MAX
• MAX has a tendency to “jump gaps”
• It often breaks large clusters
• Its results would be less interpretable

40
DSCI 5240

How Does this Affect Cluster Structure?

• Noise and outliers can also have an impact
• Assume we have data that looks like this:

41
DSCI 5240

MIN
• MIN is more sensitive to noise and outliers
• In this instance, the clusters would be difficult to interpret

42
DSCI 5240

MAX
• MAX manages noise and outliers better
• It produces better clusters in this situation

43
DSCI 5240

Hierarchical Clustering Issues

• No objective function is directly minimized
• Different schemes have problems with one or more of the following:
• Sensitivity to noise and outliers
• Difficulty handling different sized clusters and convex shapes
• Breaking large clusters

44
DSCI 5240

Validating Clusters
• Goal is to obtain meaningful/useful clusters
• Random chance can produce apparent clusters
• Different clustering methods produce different results
• Hence, need to
• Obtain summary statistics
• Review clusters with respect to variables not used in clustering
• Label clusters
• Look for
• Stability – sensitivity to minor input changes
• Separation – ratio of intra-cluster and inter-cluster variations

Gym Management System Project Report
0% (1)
Gym Management System Project Report
129 pages
DPR Format: Preference For Sponsoring Agency of The Project To Bank
No ratings yet
DPR Format: Preference For Sponsoring Agency of The Project To Bank
5 pages
Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business
No ratings yet
Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business
44 pages
Clustering
No ratings yet
Clustering
12 pages
Module -05 Machine Learning(BCS602) Search Creators
No ratings yet
Module -05 Machine Learning(BCS602) Search Creators
47 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Clustering
No ratings yet
Clustering
69 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
MODULE_5
No ratings yet
MODULE_5
43 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
SSRN Id3768295
No ratings yet
SSRN Id3768295
7 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
Clustering (1)
No ratings yet
Clustering (1)
53 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Hierarchical
No ratings yet
Hierarchical
31 pages
Clustering Part2
No ratings yet
Clustering Part2
29 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
10Hierarchical&Probabilistic Clustering & GMM (ML)
No ratings yet
10Hierarchical&Probabilistic Clustering & GMM (ML)
24 pages
Hierarchical Clustering in Data Mining
No ratings yet
Hierarchical Clustering in Data Mining
4 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
dm 4
No ratings yet
dm 4
76 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Unit 2
No ratings yet
Unit 2
33 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
ML Mod 5
No ratings yet
ML Mod 5
47 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
AI20- Hierarchical-clustering
No ratings yet
AI20- Hierarchical-clustering
31 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
41 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Data Mining - Lecture 9
No ratings yet
Data Mining - Lecture 9
29 pages
Clustering
No ratings yet
Clustering
75 pages
unit5_CSM_ML
No ratings yet
unit5_CSM_ML
32 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Chapter 4 _ Clustering
No ratings yet
Chapter 4 _ Clustering
21 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
L08 Hierachical agglomerative clustering
No ratings yet
L08 Hierachical agglomerative clustering
41 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering
No ratings yet
Clustering
65 pages
Unit 3 Clustering
No ratings yet
Unit 3 Clustering
101 pages
Unit 5
No ratings yet
Unit 5
63 pages
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
From Everand
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
Fouad Sabry
No ratings yet
Thesis Topics For Educational Management
100% (3)
Thesis Topics For Educational Management
6 pages
Fraud call detection using conversation analyzer
No ratings yet
Fraud call detection using conversation analyzer
6 pages
Chapter 6 AJP
No ratings yet
Chapter 6 AJP
76 pages
S&s Question
No ratings yet
S&s Question
10 pages
CAD CAM Part 1
No ratings yet
CAD CAM Part 1
9 pages
Instant download Textual Data Science with R Chapman Hall CRC Computer Science Data Analysis 1st Edition Mónica Bécue-Bertaut pdf all chapter
100% (5)
Instant download Textual Data Science with R Chapman Hall CRC Computer Science Data Analysis 1st Edition Mónica Bécue-Bertaut pdf all chapter
55 pages
CNC Tips: #7 DXF File Import
No ratings yet
CNC Tips: #7 DXF File Import
2 pages
Farheen Salmani Perfect
No ratings yet
Farheen Salmani Perfect
64 pages
The Bluffers Guide To Talking Tech
No ratings yet
The Bluffers Guide To Talking Tech
29 pages
Materials - V-Ray For SketchUp - Chaos Help
No ratings yet
Materials - V-Ray For SketchUp - Chaos Help
6 pages
SC Licensing Examination Terms & Conditions (January - June 2021)
No ratings yet
SC Licensing Examination Terms & Conditions (January - June 2021)
11 pages
GRBL Configuration ENG
No ratings yet
GRBL Configuration ENG
5 pages
Visual-Basic-operators
No ratings yet
Visual-Basic-operators
3 pages
Download Full Full-Stack Web Development with Go: Build your web applications quickly using the Go programming language and Vue.js 1st Edition Nanik Tolaram PDF All Chapters
100% (2)
Download Full Full-Stack Web Development with Go: Build your web applications quickly using the Go programming language and Vue.js 1st Edition Nanik Tolaram PDF All Chapters
65 pages
Mahajani: Range: 11150-1117F
No ratings yet
Mahajani: Range: 11150-1117F
2 pages
DWC Ordering Information
No ratings yet
DWC Ordering Information
15 pages
Introduction To SCDA Systems
No ratings yet
Introduction To SCDA Systems
21 pages
Roger Michel KUBWIMANA - Digital Marketing Specialist
No ratings yet
Roger Michel KUBWIMANA - Digital Marketing Specialist
3 pages
Review Questions MIS
100% (1)
Review Questions MIS
12 pages
First Quarter Summative Assessment
No ratings yet
First Quarter Summative Assessment
4 pages
Prac #7
No ratings yet
Prac #7
6 pages
Flymarker Mini 120 45
No ratings yet
Flymarker Mini 120 45
2 pages
Wm6461course PDF
No ratings yet
Wm6461course PDF
710 pages
UNIT V Regular Expression, Rollover and Frames
No ratings yet
UNIT V Regular Expression, Rollover and Frames
31 pages
Ishika Soni: Educational Alifications
No ratings yet
Ishika Soni: Educational Alifications
2 pages
Protective Relays Guide SCHNEIDER-1
No ratings yet
Protective Relays Guide SCHNEIDER-1
56 pages
Enterprise Resource Planning -DITF404
No ratings yet
Enterprise Resource Planning -DITF404
11 pages
XN-L TS AK317071A 1801 En-Ap
100% (1)
XN-L TS AK317071A 1801 En-Ap
110 pages

Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

DSCI 5240

Hierarchies are Everywhere!

“Every seeming equality conceals a hierarchy.”

What is Hierarchical Clustering

• Goal is to identify the hierarchies

• Hierarchies are not identified in

Partitional vs. Hierarchical Clustering

k-Means Cluster Analysis Hierarchical Cluster Analysis

Partitional vs. Hierarchical Clustering

k-Means Cluster Analysis Hierarchical Cluster Analysis

• Single partition of the data • Multiple partitions of the data

• Height of each connection reflects the

Common Hierarchical Approaches

Agglomerative (Bottom Up) Divisive (Top Down)

• Begin with n-clusters (each • Start with one all-inclusive cluster

Common Hierarchical Approaches

Basic Agglomerative Algorithm

1. Let each data point be a cluster

CDE 3.48 0.00

CDE 3.48 0.00

But How Should we Measure Distance

• In hierarchical clustering, we have to determine the distance between

Distance between what?

MIN (Single Linkage)

MAX (Complete Linkage)

MIN (Single Linkage)

MAX (Complete Linkage)

Group Average (Average Linkage)

Distance Between Centers

How Does this Affect Cluster Structure?

How Does this Affect Cluster Structure?

Hierarchical Clustering Issues

You might also like