0% found this document useful (0 votes)
25 views

6.nsupervised Learning Clustering Lecture 7 Slides For4962

This document discusses different types of unsupervised machine learning algorithms for clustering unlabeled data. It begins by introducing clustering and describing how it is used to group similar objects without labeled data. It then outlines three main approaches to clustering: exclusive/partitioning clustering like K-means which assigns data to exactly one cluster; agglomerative clustering which starts with each data as a separate cluster and iteratively merges clusters; and overlapping clustering like fuzzy C-means which allows data to belong to multiple clusters. The document focuses on explaining K-means and hierarchical agglomerative clustering algorithms in detail.

Uploaded by

nimrashafiq604
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

6.nsupervised Learning Clustering Lecture 7 Slides For4962

This document discusses different types of unsupervised machine learning algorithms for clustering unlabeled data. It begins by introducing clustering and describing how it is used to group similar objects without labeled data. It then outlines three main approaches to clustering: exclusive/partitioning clustering like K-means which assigns data to exactly one cluster; agglomerative clustering which starts with each data as a separate cluster and iteratively merges clusters; and overlapping clustering like fuzzy C-means which allows data to belong to multiple clusters. The document focuses on explaining K-means and hierarchical agglomerative clustering algorithms in detail.

Uploaded by

nimrashafiq604
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Unsupervised Learning

(Clustering)

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 1


Lecture 07 – Outline

• Clustering: introduction
• Clustering approaches
• Exclusive clustering: K-means algorithm
• Agglomerative clustering: Hierarchical algorithm
• Overlapping clustering: Fuzzy C-means algorithm
• Cluster validity problem
• Cluster quality criteria: Davies-Bouldin index

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 2


Clustering (introduction)
• Clustering is a type of unsupervised machine learning
• It is distinguished from supervised learning by the fact that there is not a
priori output (i.e. no labels)
– The task is to learn the classification/grouping from the data
• A cluster is a collection of objects which are similar in some way
• Clustering is the process of grouping similar objects into groups
• Eg: a group of people clustered based on their height and weight
• Normally, clusters are created using distance measures
– Two or more objects belong to the same cluster if they are “close” according to a given
distance (in this case geometrical distance like Euclidean or Manhattan)
• Another measure is conceptual
– Two or more objects belong to the same cluster if this one defines a concept common to all
that objects
– In other words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Clustering (introduction)
• Example: using distance based clustering

• This was easy but how if you had to create 4 clusters?


• Some possibilities are shown below but which is correct?

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 4


Clustering (introduction – ctd)
• So, the goal of clustering is to determine the intrinsic grouping in a set of
unlabeled data
• But how to decide what constitutes a good clustering?
• It can be shown that there is no absolute “best” criterion which would be
independent of the final aim of the clustering
• Consequently, it is the user which must supply this criterion, to suit the
application
• Some possible applications of clustering
– data reduction – reduce data that are homogeneous (similar)
– find “natural clusters” and describe their unknown properties
– find useful and suitable groupings
– find unusual data objects (i.e. outlier detection)

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Clustering – Major approaches
• Exclusive (partitioning)
– Data are grouped in an exclusive way, one data can only belong to one cluster
– Eg: K-means
• Agglomerative
– Every data is a cluster initially and iterative unions between the two nearest clusters
reduces the number of clusters
– Eg: Hierarchical clustering
• Overlapping
– Uses fuzzy sets to cluster data, so that each point may belong to two or more clusters
with different degrees of membership
– In this case, data will be associated to an appropriate membership value
– Eg: Fuzzy C-Means
• Probabilistic
– Uses probability distribution measures to create the clusters
– Eg: Gaussian mixture model clustering, which is a variant of K-means
– Will not be discussed in this course

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Exclusive (partitioning) clustering
• Aim: Construct a partition of a database D of N objects into a set of K
clusters
• Method: Given a K, find a partition of K clusters that optimises the chosen
partitioning criterion

• K-means (MacQueen’67) is one of the commonly used clustering


algorithm
• It is a heuristic method where each cluster is represented by the centre of
the cluster (i.e. the centroid)

• Note: One and two dimensional (i.e. with one and two features) data are
used in this lecture for simplicity of explanation
• In general, clustering algorithms are used with much higher dimensions

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K-means clustering algorithm

Given K, the K-means algorithm is implemented in four steps:


1. Choose K points at random as cluster centres (centroids)
2. Assign each instance to its closest cluster centre using
certain distance measure (usually Euclidean or Manhattan)
3. Calculate the centroid of each cluster, use it as the new
cluster centre (one measure of centroid is mean)
4. Go back to Step 2, stop when cluster centres do not change
any more

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K-means – an example

• Say, we have the data: {20, 3, 9, 10, 9, 3, 1, 8, 5, 3, 24, 2, 14, 7, 8, 23, 6, 12,
18} and we are asked to use K-means to cluster these data into 3 groups
• Assume we use Manhattan distance*
• Step one: Choose K points at random to be cluster centres
• Say 6, 12, 18 are chosen

*note for one dimensional data, Manhattan


distance=Euclidean distance

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K-means – an example (ctd)

• Step two: Assign each instance to its closest cluster centre using
Manhattan distance
• For instance:
– 20 is assigned to cluster 3
– 3 is assigned to cluster 1

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K-means – Example (ctd)

• Step two continued: 9 can be assigned to cluster 1, 2 but let


us say that it is arbitrarily assigned to cluster 2
• Repeat for all the rest of the instances

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K-Means – Example (ctd)

• And after exhausting all instances…

• Step three: Calculate the centroid (i.e. mean) of each cluster, use it as the new
cluster centre

• End of iteration 1
• Step four: Iterate (repeat steps 2 and 3) until the cluster centres do not change
any more

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K - means

• Strengths
– Relatively efficient: where N is no. objects, K is no. clusters, and T is
no. iterations. Normally, K, T << N.
– Procedure always terminates successfully (but see below)

• Weaknesses
– Does not necessarily find the most optimal configuration
– Significantly sensitive to the initial randomly selected cluster centres
– Applicable only when mean is defined (i.e. can be computed)
– Need to specify K, the number of clusters, in advance

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


K-means in MATLAB
• Use the built in ‘kmeans’
function
• Example, for the data that we
saw earlier
• The ‘ind’ is the output that
gives the cluster index of the
data, while ‘c’ is the final
cluster centres
• For Manhanttan distance, use
…‘distance’, ‘cityblock’…
• For Euclidean (default), no
need to specify distance
measure

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 14


Agglomerative clustering
• K-means approach starts out with a fixed number of clusters
and allocates all data into the exactly number of clusters
• But agglomeration does not require the number of clusters K
as an input
• Agglomeration starts out by forming each data as one cluster
– So, data of N object will have N clusters
• Next by using some distance (or similarity) measure, reduces
the number so clusters (one in each iteration) by merging
process
• Finally, we have one big cluster than contains all the objects
• But then what is the point of having one big cluster in the
end?

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 15


Dendrogram (ctd)
• While merging cluster one by one, we can draw a tree diagram known as
dendrogram
• Dendrograms are used to represent agglomerative clustering
• From dendrograms, we can get any number of clusters
• Eg: say we wish to have 2 clusters, then cut the top one link
– Cluster 1: q, r
– Cluster 2: x, y, z, p
• Similarly for 3 clusters, cut 2 top links
– Cluster 1: q, r
– Cluster 2: x, y, z
– Cluster 3: p

A dendrogram example
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 16
Hierarchical clustering - algorithm
• Hierarchical clustering algorithm is a type of agglomerative clustering

• Given a set of N items to be clustered, hierarchical clustering algorithm:


1. Start by assigning each item to its own cluster, so that if you have N items, you
now have N clusters, each containing just one item
2. Find the closest distance (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster
3. Compute pairwise distances between the new cluster and each of the old clusters
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
5. Draw the dendogram, and with the complete hierarchical tree, if you want K
clusters you just have to cut the K-1 top links

Note any distance measure can be used: Euclidean, Manhattan, etc

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Hierarchical clustering algorithm– step 3
• Computing distances between clusters for Step 3 can be implemented in
different ways:
– Single-linkage clustering
– The distance between one cluster and another cluster is computed as the shortest
distance from any member of one cluster to any member of the other cluster
– Complete-linkage clustering
– The distance between one cluster and another cluster is computed as the greatest
distance from any member of one cluster to any member of the other cluster
– Centroid clustering
– The distance between one cluster and another cluster is computed as the distance from
one cluster centroid to the other cluster centroid

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 18


Hierarchical clustering algorithm– step 3

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 19


Hierarchical clustering – an example
• Assume X=[3 7 10 17 18 20]
1. There are 6 items, so create 6 clusters initially

2. Compute pairwise distances of clusters (assume Manhattan distance)


The closest clusters are 17 and 18 (with distance=1), so merge these two clusters
together

3. Repeat step 2 (assume single-linkage):


The closest clusters are cluster17,18 to cluster20 (with distance |18-20|=2),
so merge these two clusters together

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 20


Hierarchical clustering – an example (ctd)
• Go on repeat cluster mergers until one big cluster remains

• Draw dendrogram (draw it in reverse of the cluster mergers) – remember


the height of each cluster corresponds to the manner of cluster
agglomeration

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 21


Hierarchical clustering – an example (ctd) –using MATLAB
• %Hierarchical clustering example
• X=[3 7 10 17 18 20]; %data
• Y=pdist(X', 'cityblock'); %compute pairwise Manhattan distances
• Z=linkage(Y, 'single'); %do clustering using single-linkage method
• dendrogram(Z); %draw dendrogram –note only indices are shown

4 5 6 1 2 3

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 22


Comparing agglomerative vs exclusive clustering

• Agglomerative - advantages
– Preferable for detailed data analysis
– Provides more information than exclusive clustering
– We can decide on any number of clusters without the need to redo
the algorithm –in exclusive clustering, K has to be decided first, if a
different K is used, then need to redo the whole exclusive clustering
algorithm
– One unique answer
• Disadvantages
– Less efficient than exclusive clustering
– No backtracking, i.e. can never undo previous steps

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Overlapping clustering –Fuzzy C-means algorithm
• Both agglomerative and exclusive clustering allows one data to be in one
cluster only
• Fuzzy C-means (FCM) is a method of clustering which allows one piece of
data to belong to more than one cluster
• In other words, each data is a member of every cluster but with a certain
degree known as membership value
• This method (developed by Dunn in 1973 and improved by Bezdek in
1981) is frequently used in pattern recognition
• Fuzzy partitioning is carried out through an iterative procedure that
updates membership uij and the cluster centroids cj by
N
1
uij  2 u m
x
ij i
C
 || xi  c j ||  m 1 cj  i 1

 
k 1  || xi  ck

||  u
N
m
ij
i 1

where m > 1, and represents the degree of fuzziness (typically, m=2)


Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 24
Overlapping clusters?
• Using both agglomerative and exclusive clustering methods, data X 1 will be member of
cluster 1 only while X2 will be member of cluster 2 only
• However, using FCM, data X can be member of both clusters
• FCM uses distance measure too, so the further data is from that cluster centroid, the
smaller the membership value will be
• For example, membership value for X1 from cluster 1, u11=0.73 and membership value
for X1 from cluster 2, u12=0.27
• Similarly, membership value for X2 from cluster 2, u22=0.2 and membership value for X2
from cluster 1, u21=0.8

• Note: membership values are in the range 0 to 1 and membership values for each data
from all the clusters will add to 1
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 25
Fuzzy C-means algorithm
• Choose the number of clusters, C and m, typically 2
1. Initialise all uij, membership values randomly – matrix U0
N

2. At step k: Compute centroids, cj using u m


x
ij i
cj  i 1
N

u
i 1
m
ij

1
uij 
3. Compute new membership values, uij using C
 || xi  c j || 
2
m 1
 
k 1  || xi  ck

|| 
4. Update Uk+1  Uk
5. Repeat steps 2-4 until change of membership values is very small, Uk+1-
Uk < where  is some small value, typically 0.01

Note: || means Euclidean distance, | means Manhattan


However, if the data is one dimensional (like the examples here), Euclidean distance =
Manhattan distance
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 26
Fuzzy C-means algorithm – an example
• X=[3 7 10 17 18 20] and assume C=2
• Initially, set U randomly U  0.1 0.2 0 .6 0 .3 0 .1 0 .5
0 .9 0 . 8 0 .4 0 .7 0 .9 0 .5

u m
x
ij i
• Compute centroids, cj using cj  i 1
N
, assume m=2
u
i 1
m
ij

• c1=13.16; c2=11.81
1
uij 
• Compute new membership values, uij using 2
C
 || xi  c j ||  m 1

• New U:  
k 1  || xi  ck

|| 
0.43 0.38 0.24 0.65 0.62 0.59
U
0.57 0.62 0.76 0.35 0.38 0.41

• Repeat centroid and membership computation until changes in


membership values are smaller than say 0.01

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 27


Fuzzy C-means algorithm –using MATLAB
• Using ‘fcm’ function in
MATLAB

• The final membership


values, U gives an indication
on similarity of each item to
the clusters

• For eg: item 3 (no. 10) is


more similar to cluster 1
compared to cluster 2 but
item 2 (no. 7) is even more
similar to cluster 1

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 28


Fuzzy C-means algorithm –using MATLAB
• ‘fcm’ function requires Fuzzy
Logic toolbox
• So, using MATLAB but without
‘fcm’ function:

u m
x
ij i
cj  i 1
N

u
i 1
m
ij

1
uij  2
C
 || xi  c j ||  m 1
 
k 1  || xi  ck

|| 

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 29


Clustering validity problem
• Problem 1
• A problem we face in clustering is to decide the optimal number
of clusters that fits a data set
• Problem 2
• The various clustering algorithms behave in a different way
depending on
– the features of the data set the features of the data set (geometry and
density distribution of clusters)
– the input parameters values (eg: for K-means, initial cluster choices
influence the result)
• So, how do we know which clustering method is better/suitable?
• We need a clustering quality criteria

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Clustering validity problem
• In general, good clusters, should have
– High intra–cluster similarity, i.e. low variance among intra-cluster members
where variance for x is defined by
1 N
var( x)  
N  1 i 1
( xi  x ) 2
with x as the mean of x
• For eg: if x=[2 4 6 8], then x  5so var(x)=6.67

• Computing intra-cluster similarity is simple


• For eg: for the clusters shown

• var(cluster1)=2.33 while var(cluster2)=12.33


• So, cluster 1 is better than cluster 2
• Note: use ‘var’ function in MATLAB to compute variance

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008


Clustering Quality Criteria
• But this does not tell us anything about how good is the overall clustering
or on the suitable number of clusters needed!
• To solve this, we also need to compute inter-cluster variance
• Good clusters will also have low inter–cluster similarity, i.e. high variance
among inter-cluster members in addition to high intra–cluster similarity,
i.e. low variance among intra-cluster members
• One good measure of clustering quality is Davies-Bouldin index
• The others are
– Dunn’s Validity Index
– Silhouette method
– C–index
– Goodman–Kruskal index
• So, we compute DB index for different number of clusters, K and the best
value of DB index tells us on the appropriate K value or on how good is the
clustering method
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 32
Davies-Bouldin index
• It is a function of the ratio of the sum of within-cluster (i.e. intra-cluster)
scatter to between cluster (i.e. inter-cluster) separation
• Because a low scatter and a high distance between clusters lead to low
values of Rij , a minimization of DB index is desired
• Let C={C1,….., Ck} be a clustering of a set of N objects:
1 k
DB  . Ri
k i 1

var(Ci )  var(C j )
with Ri  max Rij and Rij 
j 1,..k ,i  j i j || ci  c j ||

where Ci is the ith cluster and ci is the centroid for cluster i

Numerator of Rij is a measure of intra-cluster similarity while the denominator is


a measure of inter-cluster separation
Note, Rij=Rji
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 33
Davies-Bouldin index example
• For eg: for the clusters shown

var(Ci )  var(C j )
• Compute Rij 
i j || ci  c j || Note, variance of one element is
zero and centroid is simply the
element itself
• var(C1)=0, var(C2)=4.5, var(C3)=2.33
• Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33
• So, R12=1, R13=0.152, R23=0.797
Ri  max Rij
• Now, compute j 1,..k ,i  j

• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31
and R32) 1 k
DB  . Ri
• Finally, compute k i 1

• DB=0.932
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 34
Davies-Bouldin index example (ctd)
• For eg: for the clusters shown

var(Ci )  var(C j )
• Compute Rij 
i j || ci  c j ||

• Only 2 clusters here


• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33
• R12=1.26
Ri  max Rij
• Now compute j 1,..k ,i  j

• Since we have only 2 clusters


1
here, R1=R12=1.26; R2=R21=1.26
k
DB  . R
• Finally, compute k i 1
i

• DB=1.26

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 35


Davies-Bouldin index example (ctd)
• DB with 2 clusters=1.26, with 3 clusters=0.932
• So, K=3 is better than K=2 (since DB smaller, better clusters)
• In general, we will repeat DB index computation for all cluster
sizes from 2 to N-1
• So, if we have 10 data items, we will do clustering with K=2,
…..9 and then compute DB for each value of K
– K=10 is not done since each item is its own cluster
• Then, we decide the best clustering size (and the best set of
clusters) would be the one with minimum values of DB index

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 36


Lecture 7: Study Guide
At the end of this lecture, you should be able to
• Define clustering
• Name major clustering approaches and differentiate between
them
• State the K-means algorithm and apply it on a given data set
• State the hierarchical algorithm and apply it on a given data set
• Compare exclusive and agglomerative clustering methods
• State FCM algorithm and apply it to a given data set
• Identify major problems with clustering techniques
• Define and use cluster validity measures such as DB index on a
given data set

Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008

You might also like