2 Clustering
2 Clustering
Clustering
-
-
§ Clustering is one of the most common exploratory data analysis technique used to get an intuition
a n
§ It can be defined as the task of identifying subgroups in the data such that data points in the same
-
subgroup (cluster) are very similar while data points in different clusters are very different
-
§ In other words, we try to find homogeneous subgroups within the data such that data points in each
- e n te
cluster are as similar as possible according to a similarity measure such as euclidean-based distance
- -
-
or correlation-based distance
e n
dance ↑
§ Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t
-
have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate
its performance ↳ concrete
e
↳
-
outputvariable
↳ dependent variable
↳Y
⑧
atthe e
Preprocess
②
Select similarity
③
Cluster
④
Analyze
i1 nvariable inte.Austeneer
data measure e
I I convert the
-> data cleaning ->
clusters to classes
processing
visualize the clusters
↳ Saling ->
§ Marketing
-
§ Discovering groups in customer databases like who makes long-distance calls or who are earning more or
- s e e
§ Insurance
-
§ Land use
-
↳ city
↳ State
↳
country
§-
Exclusive clustering
§ An item belongs exclusively to one cluster and not several
e n t e
⑲ge
age<50 age>,50
Cl (2
§ Overlapping clustering
-
(2
Cl
§ Hierarchical clustering
-re
-
-
⑧
->
-
A
root
C1
cluster with 4
small clusters
cluster
~
->
one item
with
I ⑪ ⑰ C2
-
⑳ C3 18
->
8
i2 I7
14
IG
I5
overlapping subgroups (clusters) where each data point belongs to only one group
-
§ It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as
e
-n
different (far) as possible
-
§ It assigns data points to a cluster such that the sum of the squared distance between the data points
-
and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the
-
-
-
minimum
§ The less variation we have within clusters, the more homogeneous (similar) the data points are within
- -
similarity
measure
a
=distance
-
§ Specify number of clusters K
§ Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the
-
centroids without replacement
§ Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t
--
-
changing
§ Compute the sum of the squared distance between data points and all centroids
-
§ Compute the centroids for the clusters by taking the average of the all data points that belong to each
-
cluster
I
Cl
.
C2
C2
⑮
18
:s:::::
2
x X ·o
x
x
6 8
⑮ "
..
.
.
K-Means Clustering - Algorithm
§ Initialization
-
-
its notrequired that
the initial centroids
dataset
are the
party CI
dataset ⑧
-
⑧
(2
§ Cluster Assignment
-
centroids
-
⑧
%=
-
·
(2
§ Move Centroid
- -
§ Take
-
the older centroid and iteratively reposition them
for optimization
-
§ Optimization
-
- ⑧
892
§ Convergence
e
§ Finally, k-means clustering algorithm converges and divides the data points into two clusters clearly visible in
-
multiple clusters
-
⑭
O
§ Suppose we want to group the visitors to a website using just their age (one-dimensional space) as
-
follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
N = 19
&
k=2
c1 = 16
-
c2 = 22
-
e
--
2
-
-
§ Iteration I
Before:
-
W
c1 = 16
-
c2 = 22
-
After:
c1 = 15.33
-
c2 = 36.25
--
/xi-cil (xi-cel
§ Iteration II
a 0
I
⑧
o
O
Before:
"
c1 = 15.33
E.
c2 = 36.25
=8
After:
-I
c1 = 18.56
c2 = 45.9
o
§ Iteration III
Before:
=)
c1 = 18.56
c2 = 45.9
After:
c1 = 19.50
c2 = 47.89
§ Iteration IV
SO
Before:
c1 = 19.50 Cl
-
0
c2 = 47.89
-
After:
=
c1 = 19.50
-
c2 = 47.89
0
-§
Elbow Method
-
- -
§ The sum of squared distances (Euclidean) between the items and the corresponding centroid
-
all clusters
C =)
e
I
d
↑
-
points within
the cluster
-
the number of clusters
number of clusters
I
I ↑...
-
- I
---
Elbor
I
⑧
·
§ Partition the data on different metrics and see how well it performs for that particular case
-n e e
-
V
/
O
⑧ 0
§ K=3: If you want to provide only 3 sizes(S, M, L) so that prices are cheaper, you will divide the data set into 3 clusters.
-see -
§ K=5: Now, if you want to provide more comfort and variety to your customers with more sizes (XS, S, M, L, XL), then you will
-
-
divide the data set into 5 clusters.
e
§ It is very popular and used in a variety of applications such as market segmentation, document
- - -
§ It always try to construct a nice spherical shape around the centroid. That means, the minute the
clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data
§ Doesn’t let data points that are far-away from each other share the same cluster even though they
obviously belong to the same cluster
§ Divisive
-
Dendegsam
-
⑳
x
33
-
biggest the
· T
fot-------
- -
I I
I 1-
⑧I
I
! I
⑪ ⑧ ⑤ B E
C D E
N
.
Hierarchical Clustering
e n
§ Dendrogram
-
⑳ Roof
parent
↓ parent
.
§ These clusters are then joined greedily by taking two most similar clusters together
-
§ if you have N items, you now have N clusters, each containing just one item
- e
§ Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
- -
§ Compute distances (similarities) between the new cluster and each of the old clusters
-
§ Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
-
↳ Root
-
cluster
-
§ Single-linkage
-
§ Complete-linkage
n e
§ Average-linkage
-
be merged
§ Single linkage
-
§ The
-
distance between two groups is defined as the distance between their two closest members
§ It often yields clusters in which individuals are added sequentially to a single group
-
o C
------
0.......
no
......
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
§ Complete linkage
-
§ also known as furthest neighbour clustering
§ the
-
distance between two groups as the distance between their two farthest-apart members
&
·C
⑱----
complete linkage
⑪
⑳-------- 13
·
⑧
D
§ Average linkage
-
-
§ distance between two groups is defined as the average distance between each of their members
-
⑲or
i
0
§ Then using parametric clustering like k-means divide the cluster into multiple clusters
-
§ For each cluster repeating the process find sub cluster till the desired number of clusters found
-