0% found this document useful (0 votes)

27 views

2 Clustering

This document provides an overview of machine learning clustering techniques. Clustering is an unsupervised learning method used to group unlabeled data points into clusters such that points within a cluster are similar to each other and dissimilar to points in other clusters. The document discusses different types of clustering including exclusive, overlapping, and hierarchical clustering. It also provides details on the k-means clustering algorithm, which iteratively assigns data points to the closest cluster centroid to minimize distances between points and their assigned cluster.

Uploaded by

patil_555

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

2 Clustering

Uploaded by

patil_555

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Machine Learning

Sunbeam Infotech www.sunbeaminfo.com

unlabelia dura
-

finding groups/ clusters

Clustering
-
-

Explorative Data Analysis

Sunbeam Infotech www.sunbeaminfo.com

Overview unlabelled
⑧
data

§ Clustering is one of the most common exploratory data analysis technique used to get an intuition
a n

about the structure of the data mecasters, erosite

§ It can be defined as the task of identifying subgroups in the data such that data points in the same
-

subgroup (cluster) are very similar while data points in different clusters are very different
-

§ In other words, we try to find homogeneous subgroups within the data such that data points in each
- e n te

cluster are as similar as possible according to a similarity measure such as euclidean-based distance
- -
-
or correlation-based distance
e n
dance ↑

§ The decision of which similarity measure to use is application-specific

§ Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t
-

have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate
its performance ↳ concrete
e

↳
-

outputvariable
↳ dependent variable
↳Y

Sunbeam Infotech www.sunbeaminfo.com

Overview

⑧
atthe e
Preprocess
②

Select similarity
③

Cluster
④

Analyze

i1 nvariable inte.Austeneer
data measure e

I I convert the
-> data cleaning ->

clusters to classes
processing
visualize the clusters
↳ Saling ->

Sunbeam Infotech www.sunbeaminfo.com

Use Cases

§ Marketing
-

§ Discovering groups in customer databases like who makes long-distance calls or who are earning more or
- s e e

who are spending more

e n

§ Insurance
-

§ Identifying groups of insurance policy holder with high claim rate

§ Land use
-

§ Identification of areas of similar land use in GIS database

↳ city
↳ State
↳
country

Sunbeam Infotech www.sunbeaminfo.com

Types
-
-

§-
Exclusive clustering
§ An item belongs exclusively to one cluster and not several
e n t e

§ E.g. K-Means clustering

-
-

⑲ge
age<50 age>,50

Cl (2

Sunbeam Infotech www.sunbeaminfo.com

Types

§ Overlapping clustering
-

§ An Item can belong to multiple clusters

n e e

§ Its degree of association with each cluster is known

§ E.g. Fuzzy/C-means clustering

Sunbeam Infotech www.sunbeaminfo.com

Types

§ Hierarchical clustering
-re
-
-

§ When two clusters have a parent child relationship

n e e

§ It forms a tree like structure

n e e

§ E.g. Hierarchical clustering cluster with

two smaller clusters

dendograph

⑧
->
-
A
root
C1
cluster with 4
small clusters

cluster
~
->
one item
with
I ⑪ ⑰ C2
-

⑳ C3 18

->
8
i2 I7

14
IG
I5

Sunbeam Infotech www.sunbeaminfo.com

KMeans
--

Sunbeam Infotech www.sunbeaminfo.com

Overview
perform actions multiple times
to
§ Kmeans algorithm is an iterative algorithm that tries to partition the dataset into distinct non-
-

overlapping subgroups (clusters) where each data point belongs to only one group
-

§ It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as
e
-n
different (far) as possible
-

§ It assigns data points to a cluster such that the sum of the squared distance between the data points
-

and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the
-
-

-
minimum
§ The less variation we have within clusters, the more homogeneous (similar) the data points are within
- -

the same cluster

similarity
measure

a
=distance

Sunbeam Infotech www.sunbeaminfo.com

How does it work?

-
§ Specify number of clusters K
§ Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the
-
centroids without replacement
§ Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t
--

-
changing
§ Compute the sum of the squared distance between data points and all centroids
-

§ Assign each data point to the closest cluster (centroid)

§ Compute the centroids for the clusters by taking the average of the all data points that belong to each
-
cluster

Sunbeam Infotech www.sunbeaminfo.com

I
Cl

.
C2
C2
⑮
18

:s:::::
2

x X ·o

x
x
6 8

⑮ "
..
.
.
K-Means Clustering - Algorithm

§ Initialization
-

§ randomly initialise two points called the cluster centroids

-
its notrequired that
the initial centroids

dataset
are the
party CI
dataset ⑧
-

You may select centroids outside of

well
as

⑧
(2

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Algorithm

§ Cluster Assignment
-

§ Compute the distance between both the points and

centroids
-

§ Depending on the minimum distance from the centroid

divide the points into two clusters

⑧
%=
-

·
(2

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Algorithm

§ Move Centroid
- -

§ Consider the older centroids are data points

§ Take
-
the older centroid and iteratively reposition them
for optimization
-

§ Optimization
-

§ Repeat the steps until the cluster centroids stop

changing the position Cl

- ⑧

892

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Algorithm

§ Convergence
e

§ Finally, k-means clustering algorithm converges and divides the data points into two clusters clearly visible in
-
multiple clusters
-

⑭
O

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Example

§ Suppose we want to group the visitors to a website using just their age (one-dimensional space) as

-
follows:

15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65

N = 19

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Example

§ Initial clusters (random centroid or average)

&
k=2
c1 = 16
-

c2 = 22
-

e
--
2
-
-

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Example

age lizei-cil (n 181

§ Iteration I

Before:
-
W
c1 = 16
-

c2 = 22
-

After:
c1 = 15.33
-

c2 = 36.25
--

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Example

/xi-cil (xi-cel
§ Iteration II
a 0

I
⑧
o
O
Before:
"
c1 = 15.33

E.
c2 = 36.25

=8
After:

-I
c1 = 18.56
c2 = 45.9
o

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Example

§ Iteration III

Before:

=)
c1 = 18.56
c2 = 45.9

After:
c1 = 19.50
c2 = 47.89

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering - Example

§ Iteration IV

SO
Before:
c1 = 19.50 Cl
-
0
c2 = 47.89
-

After:

=
c1 = 19.50
-

c2 = 47.89
0

Sunbeam Infotech www.sunbeaminfo.com

K-Means Clustering

§ How to find the optimum number of clusters?

-§
Elbow Method
-
- -

§ Purpose Method (Intuition C

Sunbeam Infotech www.sunbeaminfo.com

Elbow Method
-
-

§ Total within-cluster variation

-
-
↑
sum
of distance
squares between
every
point & cluster's centroid
§ Also known as Within Sum of Squares (WSS)
e

§ The sum of squared distances (Euclidean) between the items and the corresponding centroid
-

↳ calculate cluster all

of
for men

all clusters

C =)
e
I
d

↑
-

points within
the cluster

Sunbeam Infotech www.sunbeaminfo.com

Elbow Method
-

§ Draw a curve between WSS (within sum of squares) and

-
the number of clusters

§ It is called elbow method because the curve looks like a

human arm and the elbow point gives us the optimum

number of clusters

I
I ↑...
-
- I
---

Elbor
I
⑧
·

Sunbeam Infotech www.sunbeaminfo.com

Purpose Method
-

§ Get different clusters based on a variety of purposes

§ Partition the data on different metrics and see how well it performs for that particular case
-n e e

-
V

/
O

⑧ 0

§ K=3: If you want to provide only 3 sizes(S, M, L) so that prices are cheaper, you will divide the data set into 3 clusters.
-see -

§ K=5: Now, if you want to provide more comfort and variety to your customers with more sizes (XS, S, M, L, XL), then you will
-
-
divide the data set into 5 clusters.
e

Sunbeam Infotech www.sunbeaminfo.com

Applications

§ It is very popular and used in a variety of applications such as market segmentation, document
- - -

clustering, image segmentation and image compression, etc.

-
-

Sunbeam Infotech www.sunbeaminfo.com

Disadvantages

§ It always try to construct a nice spherical shape around the centroid. That means, the minute the
clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data
§ Doesn’t let data points that are far-away from each other share the same cluster even though they
obviously belong to the same cluster

Sunbeam Infotech www.sunbeaminfo.com

Hierarchical Clustering
-
-

Sunbeam Infotech www.sunbeaminfo.com

Hierarchical Clustering

§ Separating data into different groups based on some measure of similarity

-stance
§ Types
§ Agglomerative
-

§ Divisive
-

Sunbeam Infotech www.sunbeaminfo.com

⑳
Linkage ge
Bottom-up
-

Dendegsam
-

⑳
x
33
-
biggest the

· T
fot-------
- -

I I
I 1-
⑧I
I
! I

⑪ ⑧ ⑤ B E
C D E

N
.
Hierarchical Clustering
e n

§ Dendrogram
-

§ diagram that shows the hierarchical relationship between objects

⑳ Roof

parent

↓ parent
.

Sunbeam Infotech www.sunbeaminfo.com

Agglomerative Clustering
-

§ Also called as bottom-top clustering as it uses bottom-up approach

§ Each data point starts in its own cluster

§ These clusters are then joined greedily by taking two most similar clusters together
-

Sunbeam Infotech www.sunbeaminfo.com

Agglomerative Clustering
-

§ Start by assigning each item to a cluster

§ if you have N items, you now have N clusters, each containing just one item
- e

§ Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
- -

have one cluster less

§ Compute distances (similarities) between the new cluster and each of the old clusters
-

§ Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
-
↳ Root
-
cluster
-

Sunbeam Infotech www.sunbeaminfo.com

Agglomerative Clustering
-

§ Step 3 can be done in different ways

§ Single-linkage
-

§ Complete-linkage
n e

§ Average-linkage
-

↳ how the clusters need to

be merged

Sunbeam Infotech www.sunbeaminfo.com

Agglomerative Clustering

§ Single linkage
-

§ Also known as nearest neighbour clustering

§ The
-
distance between two groups is defined as the distance between their two closest members
§ It often yields clusters in which individuals are added sequentially to a single group
-

o C

------

0.......
no

......
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering

§ Complete linkage
-
§ also known as furthest neighbour clustering
§ the
-
distance between two groups as the distance between their two farthest-apart members
&

·C

⑱----
complete linkage

⑪
⑳-------- 13
·
⑧
D

Sunbeam Infotech www.sunbeaminfo.com

Agglomerative Clustering

§ Average linkage
-
-

§ referred to as the unweighted pair-group method

§ distance between two groups is defined as the average distance between each of their members
-

⑲or
i
0

Sunbeam Infotech www.sunbeaminfo.com

Divisive Clustering
-

§ Also called as top-bottom clustering as it uses top-bottom approach

§ All data point starts in it’s the same cluster (Roo)

§ Then using parametric clustering like k-means divide the cluster into multiple clusters
-

§ For each cluster repeating the process find sub cluster till the desired number of clusters found
-

Sunbeam Infotech www.sunbeaminfo.com

Cluster Analysis On The Effect of Dollar Increment On The Economy of Nigeria Real
100% (1)
Cluster Analysis On The Effect of Dollar Increment On The Economy of Nigeria Real
58 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unsupesfwafarvised Learning
No ratings yet
Unsupesfwafarvised Learning
49 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Understanding Clustering_ A Comprehensive Guide to
No ratings yet
Understanding Clustering_ A Comprehensive Guide to
5 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Clustering
No ratings yet
Clustering
84 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
Clustering
No ratings yet
Clustering
75 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
21csc305p Machine Learning Unit 3_updated (2)
No ratings yet
21csc305p Machine Learning Unit 3_updated (2)
147 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unit 7 Clustering (P) (1) (1)
No ratings yet
Unit 7 Clustering (P) (1) (1)
22 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
K, Eans
No ratings yet
K, Eans
4 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
ML - 8
No ratings yet
ML - 8
70 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Unsupervised Learning Update
No ratings yet
Unsupervised Learning Update
37 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Unit 4
No ratings yet
Unit 4
74 pages
Machine learning
No ratings yet
Machine learning
3 pages
K-MEANS CLUSTERING ppt kpu
No ratings yet
K-MEANS CLUSTERING ppt kpu
4 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Week 9
No ratings yet
Week 9
66 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
MINOR PROJECT
No ratings yet
MINOR PROJECT
10 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
CSE4261 Lecture-8
No ratings yet
CSE4261 Lecture-8
49 pages
Week 10 Lecture - Introduction to Clustering(1)
No ratings yet
Week 10 Lecture - Introduction to Clustering(1)
35 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Enhancing The Exactness of K-Means Clustering Algorithm by Centroids
No ratings yet
Enhancing The Exactness of K-Means Clustering Algorithm by Centroids
7 pages
Unit II Final
No ratings yet
Unit II Final
152 pages
Clustering
No ratings yet
Clustering
20 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
Clustering: ISOM3360 Data Mining For Business Analytics
No ratings yet
Clustering: ISOM3360 Data Mining For Business Analytics
28 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
Stat 390 Presentation 2
No ratings yet
Stat 390 Presentation 2
14 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Cluster
No ratings yet
Cluster
50 pages
3 Dbscan
No ratings yet
3 Dbscan
7 pages
Adavance Analytic and Statistic
No ratings yet
Adavance Analytic and Statistic
26 pages
Linux Commands
No ratings yet
Linux Commands
14 pages
Gkenergy Quality Process
No ratings yet
Gkenergy Quality Process
12 pages
To Design Fuzzy Model Based PID Controller For Level Process
No ratings yet
To Design Fuzzy Model Based PID Controller For Level Process
2 pages
(Ref. Page 10 Last Para.) (Ref. Page 10 Last Para.)
No ratings yet
(Ref. Page 10 Last Para.) (Ref. Page 10 Last Para.)
1 page
Static Characteristics of Measurement
100% (1)
Static Characteristics of Measurement
17 pages
Zomorrodi Shahrokhi PID Tunning Comparison PDF
No ratings yet
Zomorrodi Shahrokhi PID Tunning Comparison PDF
12 pages
Root File
No ratings yet
Root File
84 pages
Sensor Characteristic
No ratings yet
Sensor Characteristic
26 pages
Shri Nivas
No ratings yet
Shri Nivas
6 pages
Raewwwwwwwwwwwwwwwwwwwwwwwww: RUNG - A Section of The PLC Ladder Program That Terminates in An Output
No ratings yet
Raewwwwwwwwwwwwwwwwwwwwwwwww: RUNG - A Section of The PLC Ladder Program That Terminates in An Output
6 pages
Estimation of Optimum Range of Set-Points and Two Level & Four Level Control For A Closed Loop Quadruple Tank Pilot Plant Through PLC
No ratings yet
Estimation of Optimum Range of Set-Points and Two Level & Four Level Control For A Closed Loop Quadruple Tank Pilot Plant Through PLC
9 pages
Commonlyused Sentences
No ratings yet
Commonlyused Sentences
1 page
Design of PI/PID Controller For Coupled Tank System
No ratings yet
Design of PI/PID Controller For Coupled Tank System
3 pages
Raewwwwwwwwwwwwwwwwwwwwwwwww: RUNG - A Section of The PLC Ladder Program That Terminates in An Output
No ratings yet
Raewwwwwwwwwwwwwwwwwwwwwwwww: RUNG - A Section of The PLC Ladder Program That Terminates in An Output
5 pages
Raewwwwwwwwwwwwwwwwwwwwwwwww: RUNG - A Section of The PLC Ladder Program That Terminates in An Output
No ratings yet
Raewwwwwwwwwwwwwwwwwwwwwwwww: RUNG - A Section of The PLC Ladder Program That Terminates in An Output
6 pages
Raewwwwwwwwwwwwwwwwwwwwwwwww
No ratings yet
Raewwwwwwwwwwwwwwwwwwwwwwwww
5 pages
Design, Development & Comparison of Various Control Strategies For Distillation Column Pilot Plant
No ratings yet
Design, Development & Comparison of Various Control Strategies For Distillation Column Pilot Plant
2 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
26 pages
02 Bildiri 09 ICAT 16
No ratings yet
02 Bildiri 09 ICAT 16
1,520 pages
Introduction To Intelligent Systems
No ratings yet
Introduction To Intelligent Systems
3 pages
ML-UNIT-5
No ratings yet
ML-UNIT-5
20 pages
Rakesh M. Verma - David J. Marchette - Cybersecurity Analytics-CRC Press (2020)
No ratings yet
Rakesh M. Verma - David J. Marchette - Cybersecurity Analytics-CRC Press (2020)
357 pages
Introduction To Data Mining-1
100% (1)
Introduction To Data Mining-1
24 pages
Marketing Research SCDL
50% (4)
Marketing Research SCDL
64 pages
Data Analyst Masters Program
No ratings yet
Data Analyst Masters Program
34 pages
UNIT I-Machine Learning
No ratings yet
UNIT I-Machine Learning
68 pages
(Ebook) Computational Methods with Applications in Bioinformatics Analysis by Jeffrey J. P. Tsai, Jeffrey J. P. Tsai, Ka-log Ng ISBN 9789813207974, 9813207973 2024 scribd download
100% (3)
(Ebook) Computational Methods with Applications in Bioinformatics Analysis by Jeffrey J. P. Tsai, Jeffrey J. P. Tsai, Ka-log Ng ISBN 9789813207974, 9813207973 2024 scribd download
81 pages
Categorical Data Clustering
No ratings yet
Categorical Data Clustering
14 pages
Yilmaz
No ratings yet
Yilmaz
45 pages
ASSIGNMENT 1 TO 4
No ratings yet
ASSIGNMENT 1 TO 4
4 pages
A) The Least-Squares Method
No ratings yet
A) The Least-Squares Method
19 pages
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
No ratings yet
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
6 pages
Mphil Cs
No ratings yet
Mphil Cs
17 pages
EEG Signal Processing and Machine Learning 2nd Edition Saeid Sanei instant download
100% (2)
EEG Signal Processing and Machine Learning 2nd Edition Saeid Sanei instant download
75 pages
CS 4038 - DM Course Outline (Fall 2021)
No ratings yet
CS 4038 - DM Course Outline (Fall 2021)
4 pages
Forest Inventory Design Principles - Challenges and Solutions
No ratings yet
Forest Inventory Design Principles - Challenges and Solutions
6 pages
Anomaly Detection: A Tutorial
No ratings yet
Anomaly Detection: A Tutorial
101 pages
History of Art Paintings Through The Lens of Entropy and Complexity
No ratings yet
History of Art Paintings Through The Lens of Entropy and Complexity
11 pages
Parallel Algorithms For VLSI Layout Verication
No ratings yet
Parallel Algorithms For VLSI Layout Verication
34 pages
Data Mining Unit I notes
No ratings yet
Data Mining Unit I notes
29 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
Sun 2019
No ratings yet
Sun 2019
37 pages
R
No ratings yet
R
140 pages
A Basic Introduction To Highscore XRD Analysis Software
0% (1)
A Basic Introduction To Highscore XRD Analysis Software
3 pages
Ai 900 Questions
No ratings yet
Ai 900 Questions
57 pages
AI11
No ratings yet
AI11
2 pages