Mean shift and Hierarchical clustering

Clustering for new discovery in data
Mean shift clustering
Hierarchical clustering
- Kunal Parmar
Houston Machine
Learning Meetup
1/21/2017

Clustering : A world
without labels
• Finding hidden structure in data when we don’t
have labels/classes for the data
• We group data
together based
on some notion
of similarity in
the feature space

Clustering approaches
covered in previous lecture
• k-means clustering
o Iterative partitioning into k clusters based on proximity of an observation to
the cluster mean

Clustering approaches
covered in previous lecture
• DBSCAN
o Partition the feature space based on density

In this segment,
Mean shift clustering Hierarchical clustering

• Mean shift clustering is a non-parametric iterative
mode-based clustering technique based on kernel
density estimation.
• It is very commonly used in the field of computer
vision because of it’s high efficiency in image
segmentation.

• It assumes that our data is sampled from an
underlying probability distribution
• The algorithm finds out the modes(peaks) of the
probability distribution. The underlying kernel
distribution at the mode corresponds to a cluster

Kernel density estimation
Set of points KDE surface

Algorithm: Mean shift
1. Define a window (bandwidth of the kernel to be
used for estimation) and place the window on a
data point
2. Calculate mean of all the points within the window
3. Move the window to the location of the mean
4. Repeat step 2-3 until convergence
• On convergence, all data points within that window
form a cluster.

Types of kernels
• Generally, a Gaussian kernel is used for probability
estimation in mean shift clustering.
• However, other kinds of kernels that can be used
are,
o Rectangular kernel
o Flat kernel, etc.
• The choice of kernel affects the clustering result

Types of kernels
• The choice of the bandwidth of the kernel(window)
will also impact the clustering result
o Small kernels will result in lots of clusters, some even being individual data
points
o Big kernels will result in one or two huge clusters

Pros and cons : Mean Shift
• Pros
o Model-free, doesn’t assume predefined shape of clusters
o Only relies on one parameter: kernel bandwidth h
o Robust to outliers
• Cons
o The selection of window size is not trivial
o Computationally expensive; O(𝑛2
)
o Sensitive to selection of kernel bandwidth; small h will slow down convergence,
large h speeds it up but might merge two modes

Applications : Mean Shift
• Clustering and segmentation
• dfsn

Applications : Mean Shift
• Clustering and Segmentation

Hierarchical Clustering
• Hierarchical clustering creates clusters that have a
predetermined ordering from top to bottom.
• There are two types of hierarchical clustering:
o Divisive
• Top to bottom approach
o Agglomerative
• Bottom to top approach

Algorithm:
Hierarchical agglomerative clustering
1. Place each data point in it’s own singleton group
2. Iteratively merge the two closest groups
3. Repeat step 2 until all the data points are merged
into a single cluster
• We obtain a dendogram(tree-like structure) at the
final step. We cut the dendogram at a certain level
to obtain the final set of clusters.

Cluster similarity or
dissimilarity
• Distance metric
o Euclidean distance
o Manhattan distance
o Jaccard index, etc.
• Linkage criteria
o Single linkage
o Complete linkage
o Average linkage

Linkage criteria
• It is the quantification of the distance between sets
of observations/intermediate clusters formed in the
agglomeration process

Single linkage
• Distance between two clusters is the shortest
distance between two points in each cluster

Complete linkage
• Distance between two clusters is the longest
distance between two points in each cluster

Average linkage
• Distance between clusters is the average distance
between each point in one cluster to every point in
other cluster

Example: Hierarchical
clustering
• We consider a small dataset with seven samples;
o (A, B, C, D, E, F, G)
• Metrics used in this example
o Distance metric: Jaccard index
o Linkage criteria: Complete linkage

clustering
• We construct a dissimilarity matrix based on
Jaccard index.
• B and F are merged in this step as they have the
lowest dissimilarity

clustering
• How do we calculate distance of (B,F) with other
clusters?
o This is where the choice of linkage criteria comes in
o Since we are using complete linkage, we use the maximum distance
between two clusters
o So,
• Dissimilarity(B, A) : 0.5000
• Dissimilarity(F, A) : 0.6250
• Hence, Dissimilarity((B,F), A) : 0.6250

clustering
• We iteratively merge clusters at each step until all
the data points are covered,
i. merge two clusters with lowest dissimilarity
ii. update the dissimilarity matrix based on merged clusters
o sfs

Dendogram
• At the end of the agglomeration process, we
obtain a dendogram that looks like this,
• sfdafdfsdfsd

Cutting the tree
• We cut the dendogram at a level where there is a
jump in the clustering levels/dissimilarities

Cutting the tree
• If we cut the tree at 0.5, then we can say that within
each cluster the samples have more than 50%
similarity
• So our final set of clusters is,
i. (B,F),
ii. (A,E,C,G) and
iii. (D)

Impact of metrics
• The metrics chosen for hierarchical clustering can
lead to vastly different clusters.
• Distance metric
o In a 2-dimensional space, the distance between the point (1,0) and the
origin (0,0) can be 2 under Manhattan distance, 2 under Euclidean
distance.
• Linkage criteria
o Distance between two clusters can be different based on linkage criteria
used

Linkage criteria
• Complete linkage is the most popular metric used
for hierarchical clustering. It is less sensitive to
outliers.
• Single linkage can handle non-elliptical shapes. But,
single linkage can lead to clusters that are quite
heterogeneous internally and it more sensitive to
outliers and noise

Pros and Cons :
Hierarchical Clustering
• Pros
o No assumption of a particular number of clusters
o May correspond to meaningful taxonomies
• Cons
o Once a decision is made to combine two clusters, it can’t be undone
o Too slow for large data sets, O(𝑛2
log(𝑛))

References
i. https://spin.atomicobject.com/2015/05/26/mean-
shift-clustering/
ii. http://vision.stanford.edu/teaching/cs131_fall1314
_nope/lectures/lecture13_kmeans_cs131.pdf
iii. http://84.89.132.1/~michael/stanford/maeb7.pdf

Mean shift and Hierarchical clustering

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Mean shift and Hierarchical clustering (20)

More from Yan Xu (20)

Recently uploaded (20)

Mean shift and Hierarchical clustering