0% found this document useful (0 votes)

23 views27 pages

Lec09 Clustering

Uploaded by

Samreen Begum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views27 pages

Lec09 Clustering

Uploaded by

Samreen Begum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

CLUSTERING

Machine Learning 1
What is Clustering
 Cluster: a collection of data objects
 Similar to one another within the same cluster

 Dissimilar to the objects in other clusters

 Cluster analysis
 Grouping a set of data objects into clusters

 Clustering is unsupervised classification:

no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

Machine Learning 2
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
• Urban planning: Identifying groups of houses according to their house
type, value, and geographical location
• Seismology: Observed earth quake epicenters should be clustered along
continent faults

Machine Learning 3
What Is a Good Clustering?

• A good clustering method will produce clusters with

– High intra-class similarity
– Low inter-class similarity
• Precise definition of clustering quality is difficult
– Application-dependent
– Ultimately subjective

Machine Learning 4
Similarity and Dissimilarity Between Objects
• Euclidean distance:

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

• Properties of a metric d(i,j):

– d(i,j)  0
– d(i,i) = 0
– d(i,j) = d(j,i)
– d(i,j)  d(i,k) + d(k,j)

Machine Learning 5
Major Clustering Approaches

• Partitioning: Construct various partitions and then evaluate them by

some criterion
• Hierarchical: Create a hierarchical decomposition of the set of objects
using some criterion
• Model-based: Hypothesize a model for each cluster and find best fit of
models to data
• Density-based: Guided by connectivity and density functions

Machine Learning 6
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw, 1987): Each cluster is represented by one of the objects
in the cluster

Machine Learning 7
K-Means Clustering
• Given k, the k-means algorithm consists of four steps:
– Select initial centroids at random.
– Assign each object to the cluster with the nearest centroid.
– Compute each centroid as the mean of the objects assigned
to it.
– Repeat previous 2 steps until no change.

Machine Learning 8
Algorithm Definition
• The K-Means algorithm is an method to cluster objects based on their
attributes into k partitions.
• It assumes that the k clusters exhibit Gaussian distributions.
• It assumes that the object attributes form a vector space.
• The objective it tries to achieve is to minimize total intra-cluster
variance.

Machine Learning 9
Algorithm Fitness Function
• The K-Means algorithm attempts to minimize the squared error for all
elements in all clusters.

• The error equation is:

k 2

E    p  mi
i 1 pCi

• Where E is the sum of the square error for all elements in the data set;
p is a given element; and mi is the mean of cluster Ci

Machine Learning 10
K-Means Algorithm
• Input
– k: the number of clusters
– D: a dataset containing n elements

• Output: a set of k clusters

• Method
– (1) arbitrarily choose k elements from D as the initial cluster mean values
– (2) repeat
– (3) assign each element to the cluster whose mean the element
is closest to
– (4) once all of the elements are assigned to clusters,
calculate the actual cluster means
– (5) until there is no change between the new and old cluster means

Machine Learning 11
K-Means Clustering (Example)
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Machine Learning 12
Comments on the K-Means Method
• Strengths
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be
found using techniques such as simulated annealing and genetic
algorithms

• Weaknesses
– Applicable only when mean is defined (what about categorical data?)
– Need to specify k, the number of clusters, in advance
– Trouble with noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

Machine Learning 13
K-medoids Clustering
• K-means is appropriate when we can work with Euclidean distances
• Thus, K-means can work only with numerical, quantitative variable
types
• Euclidean distances do not work well in at least two situations
– Some variables are categorical
– Outliers can be potential threats
• A general version of K-means algorithm called K-medoids can work
with any distance measure
• K-medoids clustering is computationally more intensive

Machine Learning 14
K-medoids Algorithm
• Step 1: For a given cluster assignment C, find the observation in the
cluster minimizing the total distance to other points in that cluster:

i  arg min
k  d ( x , x ).
{i:C ( i )  k } C ( j )  k
i j

• Step 2: Assign m  x  , k  1,2,, K

k i k

• Step 3: Given a set of cluster centers {m1, …, mK}, minimize the total
error by assigning each observation to the closest (current) cluster
center:
C (i)  arg min d ( xi , mk ), i  1,, N
1 k  K
• Iterate steps 1 to 3

Machine Learning 15
K-medoids Summary
• Generalized K-means
• Computationally much costlier that K-means
• Apply when dealing with categorical data
• Apply when data points are not available, but only pair-wise distances
are available
• Converges to local minimum

Machine Learning 16
Hierarchical Clustering
• Two types: (1) agglomerative (bottom up), (2) divisive (top down)
• Agglomerative: two groups are merged if distance between them is less
than a threshold
• Divisive: one group is split into two if intergroup distance more than a
threshold
• Can be expressed by an excellent graphical representation called
“dendogram”, when the process is monotonic: dissimilarity between
merged clusters is increasing. Agglomerative clustering possesses this
property. Not all divisive methods possess this monotonicity.
• Heights of nodes in a dendogram are proportional to the threshold
value that produced them.

Machine Learning 17
An Example Hierarchical Clustering

Machine Learning 18
Hierarchical Clustering
• Use distance matrix as clustering criteria.
• This method does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(bottom up)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (top down)
Machine Learning 19
Agglomerative Nesting (Bottom Up)
• Produces tree of clusters (nodes)
• Initially: each object is a cluster (leaf)
• Recursively merges nodes that have the least dissimilarity
• Criteria: min distance, max distance, avg distance, center distance
• Eventually all nodes belong to the same cluster (root)

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Machine Learning 20
A Dendrogram Shows
How the Clusters are Merged Hierarchically
• Decompose data objects into several levels of nested partitioning (tree
of clusters), called a dendrogram.
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level. Then each connected component forms a cluster.

Machine Learning 21
Divisive Analysis (Top Down)
• Inverse order of Agglomerative
• Start with root cluster containing all objects
• Recursively divide into subclusters
• Eventually each cluster contains a single object

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Machine Learning 22
Linkage Functions
• We know how to measure the distance between two objects, but defining the distance
between an object and a cluster, or defining the distance between two clusters is non
obvious.
– Single linkage (nearest neighbor): In this method the distance between two
clusters is determined by the distance of the two closest objects (nearest neighbors)
in the different clusters. d (G, H )  min d
SL ij
iG
jH

– Complete linkage (furthest neighbor): In this method, the distances between

clusters are determined by the greatest distance between any two objects in the
different clusters (i.e., by the "furthest neighbors"). d CL (G, H )  max d ij
iG
jH

– Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different
clusters. 1
d GA (G, H ) 
NG N H
 d
iG jH
ij

Machine Learning 23
Linkage Functions
• SL considers only a single pair of data points; if this pair is close
enough then action is taken. So, SL can form a “chain” by combining
relatively far apart data points.
• SL often violates the compactness property of a cluster. SL can produce
clusters with large diameters (DG).
DG  max dij
iG , jG

• CL is just the opposite of SL; it produces many clusters with small

diameters.
• CL can violate “closeness” property- two close data points may be
assigned to different clusters.
• GA is a compromise between SL and CL
Machine Learning 24
Linkage Functions

Machine Learning 25
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition

Machine Learning 26
Model based clustering
• Assume data generated from K probability distributions
• Typically Gaussian distribution Soft or probabilistic version of K-means
clustering
• Need to find distribution parameters.
• EM Algorithm

Machine Learning 27

Bikku2019 Springer Con
No ratings yet
Bikku2019 Springer Con
9 pages
Data Visulization Techniques
No ratings yet
Data Visulization Techniques
10 pages
Download Complete Energy Minimization Methods in Computer Vision and Pattern Recognition Marcello Pelillo PDF for All Chapters
100% (1)
Download Complete Energy Minimization Methods in Computer Vision and Pattern Recognition Marcello Pelillo PDF for All Chapters
53 pages
L 8 Clustering
No ratings yet
L 8 Clustering
58 pages
ML Unit III.pptx
No ratings yet
ML Unit III.pptx
82 pages
AI-unit-5
No ratings yet
AI-unit-5
103 pages
Lec00 Introduction
No ratings yet
Lec00 Introduction
8 pages
Lec07 GA
No ratings yet
Lec07 GA
15 pages
Kim Learning Discriminative Dynamics With Label Corruption For Noisy Label Detection CVPR 2024 Paper
No ratings yet
Kim Learning Discriminative Dynamics With Label Corruption For Noisy Label Detection CVPR 2024 Paper
11 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Free Download Research Paper On Data Mining
100% (1)
Free Download Research Paper On Data Mining
8 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Precisions Assembly Is A Unique Branch
No ratings yet
Precisions Assembly Is A Unique Branch
6 pages
ref2
No ratings yet
ref2
22 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
KienVu CV DS
No ratings yet
KienVu CV DS
2 pages
Lec05 InstanceBased
No ratings yet
Lec05 InstanceBased
13 pages
DIT305 AdvancedDataMinngVisualization
No ratings yet
DIT305 AdvancedDataMinngVisualization
7 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
clustering
No ratings yet
clustering
16 pages
Partitioning Around Medoid: K-Medoids
No ratings yet
Partitioning Around Medoid: K-Medoids
5 pages
BCS602-Module-1-2-Notes-1
No ratings yet
BCS602-Module-1-2-Notes-1
35 pages
Machine Learning in Drug Discovery and Development Part 1: A Primer
No ratings yet
Machine Learning in Drug Discovery and Development Part 1: A Primer
14 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
14 - Chapter - 05 FUZZY LOGIC BASED CLUSTERING HIERARCHY FOR PDF
No ratings yet
14 - Chapter - 05 FUZZY LOGIC BASED CLUSTERING HIERARCHY FOR PDF
17 pages
Assignment 8 Solution
No ratings yet
Assignment 8 Solution
7 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
Clustering
No ratings yet
Clustering
35 pages
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
No ratings yet
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
5 pages
Credit Card Fraud Detection Using
100% (1)
Credit Card Fraud Detection Using
12 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Ch5- Review Question
No ratings yet
Ch5- Review Question
3 pages
Week 10 Lecture - Introduction to Clustering(1)
No ratings yet
Week 10 Lecture - Introduction to Clustering(1)
35 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Business Analytics / Data Science As A Career: Seminar Session
No ratings yet
Business Analytics / Data Science As A Career: Seminar Session
38 pages
UNIT 4
No ratings yet
UNIT 4
125 pages
9.54 Class 13: Unsupervised Learning
No ratings yet
9.54 Class 13: Unsupervised Learning
54 pages
360DigiTmg E Book Data Science
100% (1)
360DigiTmg E Book Data Science
168 pages
Cluster
No ratings yet
Cluster
50 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
M5
No ratings yet
M5
40 pages
1992 Nonintrusive Appliance Load Monitoring
No ratings yet
1992 Nonintrusive Appliance Load Monitoring
22 pages
Clustering
No ratings yet
Clustering
75 pages
M5
No ratings yet
M5
40 pages
Week 11
No ratings yet
Week 11
49 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Overview of Software Defect Prediction Using Machine Learning Algorithms
No ratings yet
Overview of Software Defect Prediction Using Machine Learning Algorithms
12 pages
SPE-134076-MS - Integrated Formation Evaluation Using A Combination of Image Logs, WFTs and Mini-DSTs
100% (1)
SPE-134076-MS - Integrated Formation Evaluation Using A Combination of Image Logs, WFTs and Mini-DSTs
17 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Whats in A Word?
No ratings yet
Whats in A Word?
210 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
Kmean
No ratings yet
Kmean
24 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Computer Vision: Chapter 5. Segmentation
100% (1)
Computer Vision: Chapter 5. Segmentation
16 pages
SocialNetworkAnalysis FullNote
No ratings yet
SocialNetworkAnalysis FullNote
10 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
104 pages
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
No ratings yet
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
19 pages
DMBI Viva
No ratings yet
DMBI Viva
18 pages
Fuzzy Neural Network - Scholarpedia
No ratings yet
Fuzzy Neural Network - Scholarpedia
7 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
K Means
No ratings yet
K Means
9 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering
No ratings yet
Clustering
84 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Icles' Motilal Jhunjhunwala College, Vashi IT& CS Department
No ratings yet
Icles' Motilal Jhunjhunwala College, Vashi IT& CS Department
41 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Cluster
100% (1)
Cluster
72 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet