0% found this document useful (0 votes)
40 views

9 - IAI5101 Unsupervised Learning - 20-40

This document discusses methods for evaluating clustering models, including external and internal validation. External validation compares cluster assignments to known ground truth labels, calculating metrics like homogeneity, completeness, and V-measure. Internal validation assesses clustering quality without ground truth by measuring compactness and separation of clusters, using a metric like the Silhouette coefficient which combines similarity within and dissimilarity between clusters. The Silhouette coefficient is calculated for each sample and evaluates how well separated it is from other clusters.

Uploaded by

Mhd rdb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

9 - IAI5101 Unsupervised Learning - 20-40

This document discusses methods for evaluating clustering models, including external and internal validation. External validation compares cluster assignments to known ground truth labels, calculating metrics like homogeneity, completeness, and V-measure. Internal validation assesses clustering quality without ground truth by measuring compactness and separation of clusters, using a metric like the Silhouette coefficient which combines similarity within and dissimilarity between clusters. The Silhouette coefficient is calculated for each sample and evaluates how well separated it is from other clusters.

Uploaded by

Mhd rdb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

How to Determine K?

Find point where:


▪ Adding more clusters
will not improve the
solution considerably A clear ‘elbow’ is
visible at 5 clusters.
▪ Having a smaller # of Hence, K=5
clusters will increase the
error significantly

The value of k should be such that if increased after several levels of


clustering, the SSE remains constant.

IAI5101 Winter 2023


Hierarchical Clustering

IAI5101 Winter 2023


Hierarchical Clustering
▪ Works by grouping data objects into a hierarchy or tree of clusters

2 approaches are used when building hierarchy of clusters


▪ Agglomerative (Bottom-up):
▪ Start with single-instance cluster
▪ At each step, join the two closest clusters

▪ Divisive (Top-down):
▪ Start with one universal cluster
▪ Find two clusters
▪ Proceed recursively on each subset

▪ Both approaches produce a dendrogram

IAI5101 Winter 2023


Dendrogram
▪ A tree-like diagram that Agglomerative
illustrates hierarchical
clustering techniques
▪ Each level shows clusters
for that level
▪ Leaf – individual
clusters
▪ Root – one cluster

▪ A cluster at level i is the


union or division of Divisive
clusters at level i+1

IAI5101 Winter 2023


Agglomerative vs. Divisive Clustering
Use distance as
clustering criteria, e.g.,
merge objects to form
minimum
Euclidean distance

Uses the principle,


e.g., maximum
Euclidean distance
Use algorithm to split
clusters & reassign
data instances to the
most distant pair of
instances

IAI5101 Winter 2023


AGNES (Agglomerative Nesting)
▪ Introduced in Kaufmann and Rousseeuw (1990)
▪ Implemented in statistical packages, e.g., Splus
▪ Use the single-link method & the dissimilarity matrix
▪ Merge nodes that have the least dissimilarity
▪ Go on in a non-descending fashion
▪ Eventually all nodes belong to the same cluster

IAI5101 Winter 2023


Agglomerative Nesting

IAI5101 Winter 2023


DIANA (Divisive Analysis)
▪ Introduced in Kaufmann and Rousseeuw (1990)
▪ Implemented in statistical analysis packages, e.g., Splus
▪ Inverse order of AGNES
▪ Eventually each node forms a cluster on its own

IAI5101 Winter 2023


Distance between Clusters
▪ Singleton clusters are iteratively
combined, based on the linkage
method used
Linkage measures used:
▪ Single link (Nearest
Neighbor): Smallest distance
between 2 nearest observations,
one from each cluster
▪ Complete Link: Largest distance
between an element in one cluster
and an element in the other
▪ Average: Avg. distance between
an element in one cluster & an
element in the other
Measure of proximity is based on distance, e.g., Euclidean distance
IAI5101 Winter 2023
Linkage Measures Derivation
▪ 4 widely used linkage measures for distance between clusters are:

Minimum Distance:

Maximum Distance:

Mean Distance:

Average Distance:

IAI5101 Winter 2023


Exercise

IAI5101 Winter 2023


Exercise 1 - Dendrogram
▪ The daily expenditures on food ▪ The distance between each pair of
(X1) & clothing (X2) of 5 persons observations is shown in the table
are shown in the Table
▪ Distance matrix:

▪ For example, the Euclidean


distance between a & b is:

IAI5101 Winter 2023


Dendrogram
▪ After deriving the distance matrix
Form the cluster:
1. Find min value present in the matrix
▪ Min =1 (join b & e in the
dendrogram)

2. Recalculate the distance matrix &


update. To update, use the min (single
link) ▪ Reconstruct the matrix
▪ Min[(be, a)] = min [(b,a), (e,a)] = min(6, Cluster a be c d
7) = 6
a 0 6 7 1
▪ Min[(be, c)] = min [(b,c), (e,c)] = 1
be 6 0 1 7
▪ Min[(be, d)] = min [(b,d), (e,d)] = 7
c 7 1 0 8
3. Reconstruct matrix d 1 7 8 0
4. Repeat (go to step 1)

IAI5101 Winter 2023


Dendrogram
1. Find min value present in the matrix
▪ Min =1 (join a & d in the Cluster a be c d
dendrogram)
a 0 6 7 1
be 6 0 1 7
2. Recalculate the distance matrix &
update. To update, use the min c 7 1 0 8
(single link) d 1 7 8 0
▪ Min[(ad, be)] = min [(d,be),
(a,be)]
= min(7, 6) = 6
▪ Min[(ad, c)] = min [(d,c), (a,c)] ▪ Reconstruct the matrix
=7
Cluster ad be c
ad 0 6 7
3. Reconstruct matrix
be 6 0 1
4. Repeat (go to step 1) c 7 1 0

IAI5101 Winter 2023


Dendrogram
1. Find min value present in the matrix Cluster ad be c
▪ Min =1 (join c with be in the ad 0 6 7
dendogram)
be 6 0 1
2. Recalculate the distance matrix &
update. To update, use the min c 7 1 0
(single link)
▪ Min[(bec, ad)] = min [(be, ad), (c, ad)]
= min(6, 7) = 6
▪ Reconstruct the matrix

Cluster ad bec
ad 0 6
bec 6 0

IAI5101 Winter 2023


Exercise II - Dendrogram
▪ Show your results by drawing a dendrogram. The dendrogram
should clearly show the order in which the points are merged.
▪ How many sets of cluster can you deduce from the dendrogram ?

IAI5101 Winter 2023


Model Evaluation

IAI5101 Winter 2023


Performance Metrics - Clustering
▪ More difficult than classification due to absence of ground truth (i.e.,
absence of true labels in the data)
▪ Approaches:
1. External Validation: supervised, i.e., the ground truth is
available
▪ Compare clustering against the ground truth using certain clustering
quality measure
▪ Popular Quality Metrics:
▪ Homogeneity: All clusters contain only data points that are members
of a single class (based on the true class labels)
▪ Completeness: All data points of a specific ground truth class label are
also elements of the same cluster
▪ V-measure: Harmonic mean of homogeneity & completeness scores

IAI5101 Winter 2023


Example: External Validation

Note:
▪ Values are typically bounded between 0 & 1
▪ Higher values are better
▪ V-measure for the 1st model with 2 clusters is better than the 5 clusters
due to the higher completeness score

IAI5101 Winter 2023


Performance Metrics - Clustering
2. Internal Validation: Unsupervised, i.e., ground truth is unavailable
▪ Validate a clustering model by defining metrics that capture expected
behavior of a good clustering model

▪ A good clustering model can be identified by 2 traits:


▪ How compact, i.e., data points in a cluster are close to each other
▪ How well separated groups, i.e., 2 clusters are distant from each other

▪ Define metrics (e.g., Euclidean distance) that mathematically calculate


the goodness of the 2 major traits & use to evaluate clustering models
▪ Example: Use Silhouette coefficient

IAI5101 Winter 2023


Example - Silhouette Coefficient
▪ Metric combines the 2 traits of a good clustering model
▪ Uses a combination of similarity to the data points in a cluster &
dissimilarity to the data points not in the cluster

Clustering Model:

SC for each Sample:

b is mean distance btw sample & other SC value is usually bounded between -1
points in same cluster (incorrect clustering) and +1 (excellent quality
a is mean distance btw sample & other dense clusters). Lower scores indicate
points in nearest cluster overlapping clusters
IAI5101 Winter 2023

You might also like