6.nsupervised Learning Clustering Lecture 7 Slides For4962
6.nsupervised Learning Clustering Lecture 7 Slides For4962
(Clustering)
• Clustering: introduction
• Clustering approaches
• Exclusive clustering: K-means algorithm
• Agglomerative clustering: Hierarchical algorithm
• Overlapping clustering: Fuzzy C-means algorithm
• Cluster validity problem
• Cluster quality criteria: Davies-Bouldin index
• Note: One and two dimensional (i.e. with one and two features) data are
used in this lecture for simplicity of explanation
• In general, clustering algorithms are used with much higher dimensions
• Say, we have the data: {20, 3, 9, 10, 9, 3, 1, 8, 5, 3, 24, 2, 14, 7, 8, 23, 6, 12,
18} and we are asked to use K-means to cluster these data into 3 groups
• Assume we use Manhattan distance*
• Step one: Choose K points at random to be cluster centres
• Say 6, 12, 18 are chosen
• Step two: Assign each instance to its closest cluster centre using
Manhattan distance
• For instance:
– 20 is assigned to cluster 3
– 3 is assigned to cluster 1
• Step three: Calculate the centroid (i.e. mean) of each cluster, use it as the new
cluster centre
• End of iteration 1
• Step four: Iterate (repeat steps 2 and 3) until the cluster centres do not change
any more
• Strengths
– Relatively efficient: where N is no. objects, K is no. clusters, and T is
no. iterations. Normally, K, T << N.
– Procedure always terminates successfully (but see below)
• Weaknesses
– Does not necessarily find the most optimal configuration
– Significantly sensitive to the initial randomly selected cluster centres
– Applicable only when mean is defined (i.e. can be computed)
– Need to specify K, the number of clusters, in advance
A dendrogram example
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 16
Hierarchical clustering - algorithm
• Hierarchical clustering algorithm is a type of agglomerative clustering
4 5 6 1 2 3
• Agglomerative - advantages
– Preferable for detailed data analysis
– Provides more information than exclusive clustering
– We can decide on any number of clusters without the need to redo
the algorithm –in exclusive clustering, K has to be decided first, if a
different K is used, then need to redo the whole exclusive clustering
algorithm
– One unique answer
• Disadvantages
– Less efficient than exclusive clustering
– No backtracking, i.e. can never undo previous steps
k 1 || xi ck
|| u
N
m
ij
i 1
• Note: membership values are in the range 0 to 1 and membership values for each data
from all the clusters will add to 1
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 25
Fuzzy C-means algorithm
• Choose the number of clusters, C and m, typically 2
1. Initialise all uij, membership values randomly – matrix U0
N
u
i 1
m
ij
1
uij
3. Compute new membership values, uij using C
|| xi c j ||
2
m 1
k 1 || xi ck
||
4. Update Uk+1 Uk
5. Repeat steps 2-4 until change of membership values is very small, Uk+1-
Uk < where is some small value, typically 0.01
u m
x
ij i
• Compute centroids, cj using cj i 1
N
, assume m=2
u
i 1
m
ij
• c1=13.16; c2=11.81
1
uij
• Compute new membership values, uij using 2
C
|| xi c j || m 1
• New U:
k 1 || xi ck
||
0.43 0.38 0.24 0.65 0.62 0.59
U
0.57 0.62 0.76 0.35 0.38 0.41
u m
x
ij i
cj i 1
N
u
i 1
m
ij
1
uij 2
C
|| xi c j || m 1
k 1 || xi ck
||
var(Ci ) var(C j )
with Ri max Rij and Rij
j 1,..k ,i j i j || ci c j ||
var(Ci ) var(C j )
• Compute Rij
i j || ci c j || Note, variance of one element is
zero and centroid is simply the
element itself
• var(C1)=0, var(C2)=4.5, var(C3)=2.33
• Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33
• So, R12=1, R13=0.152, R23=0.797
Ri max Rij
• Now, compute j 1,..k ,i j
• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31
and R32) 1 k
DB . Ri
• Finally, compute k i 1
• DB=0.932
Lecture 7 slides for CC282 Machine Learning, R. Palaniappan, 2008 34
Davies-Bouldin index example (ctd)
• For eg: for the clusters shown
var(Ci ) var(C j )
• Compute Rij
i j || ci c j ||
• DB=1.26