0% found this document useful (0 votes)
24 views45 pages

Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business

The document discusses hierarchical clustering, which identifies hierarchical relationships in data through a nested tree structure, unlike partitional clustering. It describes agglomerative clustering, which starts with each object as its own cluster and iteratively merges the closest clusters. A dendrogram provides a graphical representation of the hierarchical structure, showing cluster merges and their distances. Common hierarchical approaches include agglomerative (bottom-up) and divisive (top-down). An example demonstrates the agglomerative process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views45 pages

Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business

The document discusses hierarchical clustering, which identifies hierarchical relationships in data through a nested tree structure, unlike partitional clustering. It describes agglomerative clustering, which starts with each object as its own cluster and iteratively merges the closest clusters. A dendrogram provides a graphical representation of the hierarchical structure, showing cluster merges and their distances. Common hierarchical approaches include agglomerative (bottom-up) and divisive (top-down). An example demonstrates the agglomerative process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

DSCI 5240

Hierarchical Clustering
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240

Hierarchies are Everywhere!

“Every seeming equality conceals a hierarchy.”


Mason Cooley

2
DSCI 5240

Hierarchical Clustering

3
DSCI 5240

What is Hierarchical Clustering


• Hierarchical clustering identifies
hierarchical levels within the data

• Goal is to identify the hierarchies


between objects n in a dataset such
that they can be represented in a
nested tree structure

• Hierarchies are not identified in


partitional clustering (k-means)

4
DSCI 5240

Partitional vs. Hierarchical Clustering

k-Means Cluster Analysis Hierarchical Cluster Analysis

5
DSCI 5240

Partitional vs. Hierarchical Clustering

k-Means Cluster Analysis Hierarchical Cluster Analysis

• Single partition of the data • Multiple partitions of the data


depending on the level of hierarchy
• Number of clusters must be specified a
priori • Number of clusters is not required in
• Relatively fast on large datasets advance
• SLOW on large datasets
• Ideally clusters are hyper-spherical
• May be used (with caution) on
• In theory, non-repeatable due to
differently shaped data
random selection of initial seeds. In
practice, we use random seeds. • Repeatable results once you select a
method to calculate distances
Note: SASEM does not do hierarchical clustering. But that is OK! Remember this course is about the algorithms
6
Neither is categorically better than the other. They are just different!
DSCI 5240

Dendrogram
• Dendrogram - a graphical
representation of the hierarchical
structure of the clusters

• Height of each connection reflects the


distance between clusters

Distance
• Clusters are obtained by “cutting” the
dendrogram at the desired distance
measure (i.e., we specify that clusters
greater than x distance apart should
not be combined)

1 2 3 4 5
Object
7
DSCI 5240

Reading a Dendrogram

2 Clusters at 4.6

6 Clusters at 3.6

Objects 2 and
22 are this far
apart

8
DSCI 5240

Common Hierarchical Approaches

Agglomerative (Bottom Up) Divisive (Top Down)

• Begin with n-clusters (each • Start with one all-inclusive cluster


observation is a singleton cluster)
• Repeatedly divide into smaller cluster
• Keep joining clusters with smallest
• One common method is to recursively
distance until one cluster is left (the
use k-means
entire data set)
• Less popular
• Most popular approach

9
DSCI 5240

Common Hierarchical Approaches


Distance
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (e.g., DIANA)
10
DSCI 5240

Agglomerative Nesting
(AGNES)

11
DSCI 5240

Basic Agglomerative Algorithm

1. Let each data point be a cluster


This turns out to be
2. Compute distances between clusters
somewhat complicated!
3. Merge the two closest clusters
4. Repeat steps 2 and 3
5. End when a single cluster remains

12
DSCI 5240

Agglomerative Example
Step 1: Each observation is a cluster

6
Obs X Y
A 1 1 5

B 3 1
4
C 2 4
D 2 5 3

Y
E 5 4
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

13
DSCI 5240

Iteration 1

14
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)

6
A B C D E
A 0.00 5

B 2.00 0.00
4
C 3.16 3.16 0.00
D 4.12 4.12 1.00 0.00 3

Y
E 5.00 3.16 3.00 3.16 0.00
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

15
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
A B C D E
A 0.00 5

B 2.00 0.00
4
C 3.16 3.16 0.00
D 4.12 4.12 1.00 0.00 3

Y
E 5.00 3.16 3.00 3.16 0.00
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

16
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
Obs X Y
A 1 1 5

B 3 1
4
CD 2 4.5
E 5 4 3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

17
DSCI 5240

Iteration 2

18
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)

6
A B CD E
A 0.00 5

B 2.00 0.00
4
CD 3.64 3.64 0.00
E 5.00 3.61 3.04 0.00 3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

19
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
A B CD E
A 0.00 5

B 2.00 0.00
4
CD 3.64 3.64 0.00
E 5.00 3.61 3.04 0.00 3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

20
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
AB 2 1 5

CD 2 4.5
4
E 5 4
3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

21
DSCI 5240

Iteration 3

22
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)

6
AB CD E
AB 0.00 5

CD 3.50 0.00
4
E 4.24 3.04 0.00
3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

23
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
AB CD E
AB 0.00 5

CD 3.50 0.00
4
E 4.24 3.04 0.00
3

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

24
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters
6
Obs X Y
AB 2 1 5

CDE 3 4.3
4

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

25
DSCI 5240

Iteration 4

26
DSCI 5240

Agglomerative Example
Step 2: Compute distances between clusters (centroid method)
6
AB CDE
AB 0.00 5

CDE 3.48 0.00


4

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

27
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
AB CDE
AB 0.00 5

CDE 3.48 0.00


4

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

28
DSCI 5240

Agglomerative Example
Step 3: Merge the two closest clusters

6
Obs X Y
ABCDE 2.6 3.0 5

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
X

29
DSCI 5240

Recap

30
DSCI 5240

Recap

5
Distance

Y
2

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
A B C D E
X
Object
31
DSCI 5240

But How Should we Measure Distance


• In k-Means, we only ever calculated the distance between points and centroids

• In hierarchical clustering, we have to determine the distance between


• Two points (Step 1)
• A point and a group of points (cluster)
• A group of points (cluster) and another group of points (cluster)

• What point within a cluster should represent the cluster in distance calculations?

• It turns out there are many options and they can result in different cluster structures

32
DSCI 5240

Calculating Distances

Distance between what?

• MIN (Single Linkage) – Distance between two clusters is the distance between the two
closest points in the (different) clusters
• MAX (Complete Linkage) – Distance between two clusters is the distance between the
two farthest points in the (different) clusters
• Group Average (Average Linkage) – Distance between two clusters is the average
pairwise distance between points in the (different) clusters
• Distance Between Centers – Distance between two clusters is the distance between the
cluster centroids

33
DSCI 5240

Calculating Distances

MIN (Single Linkage)

MAX (Complete Linkage)

34
DSCI 5240

Calculating Distances

MIN (Single Linkage)


1

3
2

35
DSCI 5240

Calculating Distances

3
2

MAX (Complete Linkage)

36
DSCI 5240

Calculating Distances

Group Average (Average Linkage)

Distance Between Centers

37
DSCI 5240

How Does this Affect Cluster Structure?


• Methods of distance calculation are highly impacted by the shape of the original data
• Assume we have data that looks like this:

38
DSCI 5240

MIN
• MIN is better at handling non-elliptical shapes
• It will likely result in cleaner (more interpretable) clusters

39
DSCI 5240

MAX
• MAX has a tendency to “jump gaps”
• It often breaks large clusters
• Its results would be less interpretable

40
DSCI 5240

How Does this Affect Cluster Structure?


• Noise and outliers can also have an impact
• Assume we have data that looks like this:

41
DSCI 5240

MIN
• MIN is more sensitive to noise and outliers
• In this instance, the clusters would be difficult to interpret

42
DSCI 5240

MAX
• MAX manages noise and outliers better
• It produces better clusters in this situation

43
DSCI 5240

Hierarchical Clustering Issues


• No objective function is directly minimized
• Different schemes have problems with one or more of the following:
• Sensitivity to noise and outliers
• Difficulty handling different sized clusters and convex shapes
• Breaking large clusters

44
DSCI 5240

Validating Clusters
• Goal is to obtain meaningful/useful clusters
• Random chance can produce apparent clusters
• Different clustering methods produce different results
• Hence, need to
• Obtain summary statistics
• Review clusters with respect to variables not used in clustering
• Label clusters
• Look for
• Stability – sensitivity to minor input changes
• Separation – ratio of intra-cluster and inter-cluster variations

45

You might also like