0% found this document useful (0 votes)
30 views

3 Clustering

Clustering is an unsupervised machine learning algorithm that groups similar data points together. It relies on calculating the distance between data points to determine clusters. The k-means clustering algorithm aims to partition observations into k clusters by minimizing the within-cluster sum of squares by iteratively updating cluster centroids.

Uploaded by

Nihad Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

3 Clustering

Clustering is an unsupervised machine learning algorithm that groups similar data points together. It relies on calculating the distance between data points to determine clusters. The k-means clustering algorithm aims to partition observations into k clusters by minimizing the within-cluster sum of squares by iteratively updating cluster centroids.

Uploaded by

Nihad Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Clustering

Clustering
 Unsupervised Algorithm

 Grouping a number of similar


things
 Clustering algorithms rely on a
distance metric between data
points
 Distance calculation is the most
important step

BMSCE - ME
MCL - Python
| PA G E 2
Can you group these items ?

BMSCE - ME
MCL - Python
| PA G E 3
Distance Metrics

Similarity and dissimilarity is


1. d ( x, y )  0
measured in distance 2. d ( x, y )  0 iff x  y
Distance measures should 3. d ( x, y )  d ( y , x )
satisfy below conditions
4. d ( x, z )  d ( x, y )  d ( y , z )

√ √
𝑘 𝑘
𝑒𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒= ∑ √ ( 𝑥𝑖− 𝑦𝑖 ) 2
𝑚𝑎𝑛h𝑎𝑡𝑡𝑎𝑛𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒= ∑ ¿ 𝑥𝑖− 𝑦𝑖∨¿¿
𝑖=1 𝑖=1

Cosine distance
Edit distance
BMSCE - ME
MCL - Python
| PA G E 4
Good Clustering
Algorithm

1
The ability to discover some or all of the
hidden clusters.

2
Within-cluster similarity and between-cluster
dissimilarity.

3
Ability to deal with various types of
attributes.

4 Can deal with noise and outliers.

5 Can handle high dimensionality.

6 Scalable, Interpretable and usable.

BMSCE - ME
MCL - Python
| PA G E 5
Error Metrics

Sum of squared
errors
• Cohesion • Separation
• Intra-Cluster distance • Inter-Cluster distance
• Sum of squared errors between • Sum of squared errors between
all the points inside a cluster all the points between clusters
• Minimum is better • Maximum is better
BMSCE - ME | PA G E
MCL - Python 6
• Cohesion : Minimize
• Separation : Maximize

A = Mean Intra Cluster Distance


B = Mean Nearest-cluster Distance

BMSCE - ME
MCL - Python
| PA G E 7
Complete Average
Single Linkage Linkage Linkage
Between Cluster Distance between two Distance between two Distance between two
clusters is defined as clusters is defined as clusters is defined as the
Distance the shortest distance the longest distance average distance
Functions between two points in between two points in between each point in
each cluster each cluster one cluster to every
point in the other cluster

BMSCE - ME
MCL - Python
| PA G E 8
Hierarchical Clustering

Divisive Clustering
Agglomerative
Clustering

BMSCE - ME
MCL - Python
| PA G E 9
A
A, D, F
D
A, D, F, H
F

H H

A, B, C, D, E
Agglomerative
E E
E, F, G, H

approach B, C
B

(Hierarchical) B, C, G
C

G G

BMSCE - ME
MCL - Python
| PA G E 10
N objects given 01

Flow of Assign one cluster to each points 02


Agglomerative Calculate distance between these clusters 03
Method
Merge two clusters with minimum distance 04

Repeat 3 and 4 until you reach one big cluster 05

BMSCE - ME
MCL - Python
| PA G E 11
A
A, D, F
D
A, D, F, H
F

H H

A, B, C, D,
Divisive
E E
E, F, G, H E

approach B, C
B

(Hierarchical) B, C, G
C

G G

BMSCE - ME
MCL - Python
| PA G E 12
N objects given 01

Flow of Divisive
Assign all the points to a single cluster 02
Method Partition the cluster into two clusters with maximum distance
03
between them (or two clusters which are least similar)

Continue 3rd step till you finally get to clusters with individual
points (N clusters) 04

BMSCE - ME
MCL - Python
| PA G E 13
K-means
Clustering
 Partition n objects into k clusters

 Each object belong to the cluster


with the nearest mean
 Produces exactly K clusters

 The objective of K-Means clustering


is to minimize total intra-cluster
variance, or, the squared error
function

BMSCE - ME
MCL - Python
| PA G E 14
Objective is to partition n given points to k (non zero) clusters 01

Randomly select k points at random as the cluster centroids 02


Flow of k-Means
03
Assign all points to their closest cluster centroid according to the Euclidean
distance function
Clustering Method
Calculate the new centroid or mean of all objects in each cluster. 04
Repeat steps 2, 3 and 4 until the centroid are fixed and doesn’t
change (same points are assigned to each cluster in consecutive
rounds)
05

BMSCE - ME
MCL - Python
| PA G E 15
K-means
Algorithm Randomly initialize k Assign each object to Compute new
points (centroids) the cluster of the centroids of the
nearest seed point clusters of the current
measured with a partition (the centroid
specific distance metric is the centre, i.e., mean
point, of the cluster)

BMSCE - ME
MCL - Python
| PA G E 16
How many
clusters in
k-means ?

BMSCE - ME
MCL - Python
| PA G E 17
Agglomerative Clustering
https://youtu.be/XJ3194AmH40?

Clustering t=4m47s

Demos
K-Means Algorithm
https://www.naftaliharris.com/blog/
visualizing-k-means-clustering/

BMSCE - ME
MCL - Python
| PA G E 18

You might also like