0% found this document useful (0 votes)
27 views

2 Clustering

This document provides an overview of machine learning clustering techniques. Clustering is an unsupervised learning method used to group unlabeled data points into clusters such that points within a cluster are similar to each other and dissimilar to points in other clusters. The document discusses different types of clustering including exclusive, overlapping, and hierarchical clustering. It also provides details on the k-means clustering algorithm, which iteratively assigns data points to the closest cluster centroid to minimize distances between points and their assigned cluster.

Uploaded by

patil_555
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

2 Clustering

This document provides an overview of machine learning clustering techniques. Clustering is an unsupervised learning method used to group unlabeled data points into clusters such that points within a cluster are similar to each other and dissimilar to points in other clusters. The document discusses different types of clustering including exclusive, overlapping, and hierarchical clustering. It also provides details on the k-means clustering algorithm, which iteratively assigns data points to the closest cluster centroid to minimize distances between points and their assigned cluster.

Uploaded by

patil_555
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning

Sunbeam Infotech www.sunbeaminfo.com


unlabelia dura
-

finding groups/ clusters


--

Clustering
-
-

Explorative Data Analysis

Sunbeam Infotech www.sunbeaminfo.com


Overview unlabelled

data

§ Clustering is one of the most common exploratory data analysis technique used to get an intuition
a n

about the structure of the data mecasters, erosite


-

§ It can be defined as the task of identifying subgroups in the data such that data points in the same
-

subgroup (cluster) are very similar while data points in different clusters are very different
-

§ In other words, we try to find homogeneous subgroups within the data such that data points in each
- e n te

cluster are as similar as possible according to a similarity measure such as euclidean-based distance
- -
-
or correlation-based distance
e n
dance ↑

§ The decision of which similarity measure to use is application-specific


-

§ Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t
-

have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate
its performance ↳ concrete
e


-

outputvariable
↳ dependent variable
↳Y

Sunbeam Infotech www.sunbeaminfo.com


Overview


atthe e
Preprocess

Select similarity

Cluster

Analyze

i1 nvariable inte.Austeneer
data measure e

I I convert the
-> data cleaning ->

clusters to classes
processing
visualize the clusters
↳ Saling ->

Sunbeam Infotech www.sunbeaminfo.com


Use Cases

§ Marketing
-

§ Discovering groups in customer databases like who makes long-distance calls or who are earning more or
- s e e

who are spending more


e n

§ Insurance
-

§ Identifying groups of insurance policy holder with high claim rate


-

§ Land use
-

§ Identification of areas of similar land use in GIS database


-

↳ city
↳ State

country

Sunbeam Infotech www.sunbeaminfo.com


Types
-
-

§-
Exclusive clustering
§ An item belongs exclusively to one cluster and not several
e n t e

§ E.g. K-Means clustering


-
-

⑲ge
age<50 age>,50

Cl (2

Sunbeam Infotech www.sunbeaminfo.com


Types

§ Overlapping clustering
-

§ An Item can belong to multiple clusters


n e e

§ Its degree of association with each cluster is known


-

§ E.g. Fuzzy/C-means clustering


-

(2

Cl

Sunbeam Infotech www.sunbeaminfo.com


Types

§ Hierarchical clustering
-re
-
-

§ When two clusters have a parent child relationship


n e e

§ It forms a tree like structure


n e e

§ E.g. Hierarchical clustering cluster with


-

two smaller clusters


dendograph


->
-
A
root
C1
cluster with 4
small clusters

cluster
~
->
one item
with
I ⑪ ⑰ C2
-

⑳ C3 18

->
8
i2 I7

14
IG
I5

Sunbeam Infotech www.sunbeaminfo.com


KMeans
--

Sunbeam Infotech www.sunbeaminfo.com


Overview
perform actions multiple times
to
§ Kmeans algorithm is an iterative algorithm that tries to partition the dataset into distinct non-
-

overlapping subgroups (clusters) where each data point belongs to only one group
-

§ It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as
e
-n
different (far) as possible
-

§ It assigns data points to a cluster such that the sum of the squared distance between the data points
-

and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the
-
-

-
minimum
§ The less variation we have within clusters, the more homogeneous (similar) the data points are within
- -

the same cluster


-

similarity
measure

a
=distance

Sunbeam Infotech www.sunbeaminfo.com


How does it work?

-
§ Specify number of clusters K
§ Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the
-
centroids without replacement
§ Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t
--

-
changing
§ Compute the sum of the squared distance between data points and all centroids
-

§ Assign each data point to the closest cluster (centroid)


-

§ Compute the centroids for the clusters by taking the average of the all data points that belong to each
-
cluster

Sunbeam Infotech www.sunbeaminfo.com


-

I
Cl

.
C2
C2

18

:s:::::
2

x X ·o

x
x
6 8

⑮ "
..
.
.
K-Means Clustering - Algorithm

§ Initialization
-

§ randomly initialise two points called the cluster centroids


-

-
its notrequired that
the initial centroids

dataset
are the
party CI
dataset ⑧
-

You may select centroids outside of


well
as


(2

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Algorithm

§ Cluster Assignment
-

§ Compute the distance between both the points and


-

centroids
-

§ Depending on the minimum distance from the centroid


-

divide the points into two clusters


-


%=
-

·
(2

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Algorithm

§ Move Centroid
- -

§ Consider the older centroids are data points


-

§ Take
-
the older centroid and iteratively reposition them
for optimization
-

§ Optimization
-

§ Repeat the steps until the cluster centroids stop


-

changing the position Cl

- ⑧

892

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Algorithm

§ Convergence
e

§ Finally, k-means clustering algorithm converges and divides the data points into two clusters clearly visible in
-
multiple clusters
-


O

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Example

§ Suppose we want to group the visitors to a website using just their age (one-dimensional space) as

-
follows:

15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65

N = 19

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Example

§ Initial clusters (random centroid or average)


-

&
k=2
c1 = 16
-

c2 = 22
-

e
--
2
-
-

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Example

age lizei-cil (n 181


=

§ Iteration I

Before:
-
W
c1 = 16
-

c2 = 22
-

After:
c1 = 15.33
-

c2 = 36.25
--

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Example

/xi-cil (xi-cel
§ Iteration II
a 0

I

o
O
Before:
"
c1 = 15.33

E.
c2 = 36.25

=8
After:

-I
c1 = 18.56
c2 = 45.9
o

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Example

§ Iteration III

Before:

=)
c1 = 18.56
c2 = 45.9

After:
c1 = 19.50
c2 = 47.89

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering - Example

§ Iteration IV

SO
Before:
c1 = 19.50 Cl
-
0
c2 = 47.89
-

After:

=
c1 = 19.50
-

c2 = 47.89
0

Sunbeam Infotech www.sunbeaminfo.com


K-Means Clustering

§ How to find the optimum number of clusters?


-


Elbow Method
-
- -

§ Purpose Method (Intuition C


-

Sunbeam Infotech www.sunbeaminfo.com


Elbow Method
-
-

§ Total within-cluster variation


-
-

sum
of distance
squares between
every
point & cluster's centroid
§ Also known as Within Sum of Squares (WSS)
e

§ The sum of squared distances (Euclidean) between the items and the corresponding centroid
-

↳ calculate cluster all


of
for men

all clusters

C =)
e
I
d


-

points within
the cluster

Sunbeam Infotech www.sunbeaminfo.com


Elbow Method
-

§ Draw a curve between WSS (within sum of squares) and


-

-
the number of clusters

§ It is called elbow method because the curve looks like a


-

human arm and the elbow point gives us the optimum


-

number of clusters

I
I ↑...
-
- I
---

Elbor
I

·

Sunbeam Infotech www.sunbeaminfo.com


Purpose Method
-

§ Get different clusters based on a variety of purposes


-

§ Partition the data on different metrics and see how well it performs for that particular case
-n e e

-
V

/
O

⑧ 0

§ K=3: If you want to provide only 3 sizes(S, M, L) so that prices are cheaper, you will divide the data set into 3 clusters.
-see -

§ K=5: Now, if you want to provide more comfort and variety to your customers with more sizes (XS, S, M, L, XL), then you will
-
-
divide the data set into 5 clusters.
e

Sunbeam Infotech www.sunbeaminfo.com


Applications

§ It is very popular and used in a variety of applications such as market segmentation, document
- - -

clustering, image segmentation and image compression, etc.


-
-

Sunbeam Infotech www.sunbeaminfo.com


Disadvantages

§ It always try to construct a nice spherical shape around the centroid. That means, the minute the
clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data
§ Doesn’t let data points that are far-away from each other share the same cluster even though they
obviously belong to the same cluster

Sunbeam Infotech www.sunbeaminfo.com


Hierarchical Clustering
-
-

Sunbeam Infotech www.sunbeaminfo.com


Hierarchical Clustering

§ Separating data into different groups based on some measure of similarity


-stance
§ Types
§ Agglomerative
-

§ Divisive
-

Sunbeam Infotech www.sunbeaminfo.com



Linkage ge
Bottom-up
-

Dendegsam
-


x
33
-
biggest the

· T
fot-------
- -

I I
I 1-
⑧I
I
! I

⑪ ⑧ ⑤ B E
C D E

N
.
Hierarchical Clustering
e n

§ Dendrogram
-

§ diagram that shows the hierarchical relationship between objects


-

⑳ Roof

parent

↓ parent
.

Sunbeam Infotech www.sunbeaminfo.com


Agglomerative Clustering
-

§ Also called as bottom-top clustering as it uses bottom-up approach


-

§ Each data point starts in its own cluster


-

§ These clusters are then joined greedily by taking two most similar clusters together
-

Sunbeam Infotech www.sunbeaminfo.com


Agglomerative Clustering
-

§ Start by assigning each item to a cluster


-

§ if you have N items, you now have N clusters, each containing just one item
- e

§ Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
- -

have one cluster less


-

§ Compute distances (similarities) between the new cluster and each of the old clusters
-

§ Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
-
↳ Root
-
cluster
-

Sunbeam Infotech www.sunbeaminfo.com


Agglomerative Clustering
-

§ Step 3 can be done in different ways


e

§ Single-linkage
-

§ Complete-linkage
n e

§ Average-linkage
-

↳ how the clusters need to

be merged

Sunbeam Infotech www.sunbeaminfo.com


Agglomerative Clustering

§ Single linkage
-

§ Also known as nearest neighbour clustering


-

§ The
-
distance between two groups is defined as the distance between their two closest members
§ It often yields clusters in which individuals are added sequentially to a single group
-

o C

------

0.......
no

......
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering

§ Complete linkage
-
§ also known as furthest neighbour clustering
§ the
-
distance between two groups as the distance between their two farthest-apart members
&

·C

⑱----
complete linkage


⑳-------- 13
·

D

Sunbeam Infotech www.sunbeaminfo.com


Agglomerative Clustering

§ Average linkage
-
-

§ referred to as the unweighted pair-group method


-

§ distance between two groups is defined as the average distance between each of their members
-

⑲or
i
0

Sunbeam Infotech www.sunbeaminfo.com


Divisive Clustering
-

§ Also called as top-bottom clustering as it uses top-bottom approach


-

§ All data point starts in it’s the same cluster (Roo)


-

§ Then using parametric clustering like k-means divide the cluster into multiple clusters
-

§ For each cluster repeating the process find sub cluster till the desired number of clusters found
-

Sunbeam Infotech www.sunbeaminfo.com

You might also like