Clustering
Clustering
• Clustering: introduction
• Clustering approaches
• Exclusive clustering: K-means algorithm
• Agglomerative clustering: Hierarchical algorithm
• Overlapping clustering: Fuzzy C-means algorithm
• Cluster validity problem
• Cluster quality criteria: Davies-Bouldin index
Clustering (introduction)
• Clustering is a type of unsupervised machine learning
• It is distinguished from supervised learning by the fact that there is not a
priori output (i.e. no labels)
– The task is to learn the classification/grouping from the data
• A cluster is a collection of objects which are similar in some way
Eg: a group of people clustered based on their height and weight
• Normally, clusters are created using distance measures
– Two or more objects belong to the same cluster if they are “close” according to a given
distance (in this case geometrical distance like Euclidean or Manhattan)
• Another measure is conceptual
– Two or more objects belong to the same cluster if this one defines a concept common to
all that objects
– In other words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures
Clustering (introduction)
• Example: using distance based clustering
• Note: One and two dimensional (i.e. with one and two features) data are
used in this lecture for simplicity of explanation
• In general, clustering algorithms are used with much higher dimensions
K-means clustering algorithm
• Say, we have the data: {20, 3, 9, 10, 9, 3, 1, 8, 5, 3, 24, 2, 14, 7, 8, 23, 6, 12,
18} and we are asked to use K-means to cluster these data into 3 groups
• Assume we use Manhattan distance*
• Step one: Choose K points at random to be cluster centres
• Say 6, 12, 18 are chosen
• Step two: Assign each instance to its closest cluster centre using
Manhattan distance
• For instance:
– 20 is assigned to cluster 3
– 3 is assigned to cluster 1
K-means – Example (ctd)
• Step three: Calculate the centroid (i.e. mean) of each cluster, use it as the new
cluster centre
• End of iteration 1
• Step four: Iterate (repeat steps 2 and 3) until the cluster centres do not change any
more
Exercise
Distance from Distance from Distance from
Point belongs
Given Points center (2, 10) of center (6, 6) of center (1.5, 3.5) of
to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
K - means
• Strengths
– Relatively efficient: where N is no. objects, K is no. clusters, and T is
no. iterations. Normally, K, T << N.
– Procedure always terminates successfully (but see below)
• Weaknesses
– Does not necessarily find the most optimal configuration
– Significantly sensitive to the initial randomly selected cluster centres
– Applicable only when mean is defined (i.e. can be computed)
– Need to specify K, the number of clusters, in advance
K-means in MATLAB
• Use the built in ‘kmeans’
function
• Example, for the data that we
saw earlier
• The ‘ind’ is the output that
gives the cluster index of the
data, while ‘c’ is the final
cluster centres
• For Manhanttan distance, use
…‘distance’, ‘cityblock’…
• For Euclidean (default), no
need to specify distance
measure
Nearest Neighbour Clustering
Agglomerative clustering
• K-means approach starts out with a fixed number of clusters
and allocates all data into the exactly number of clusters
• But agglomeration does not require the number of clusters K
as an input
• Agglomeration starts out by forming each data as one cluster
– So, data of N object will have N clusters
• Next by using some distance (or similarity) measure, reduces
the number so clusters (one in each iteration) by merging
process
• Finally, we have one big cluster than contains all the objects
• But then what is the point of having one big cluster in the
end?
Dendrogram (ctd)
• While merging cluster one by one, we can draw a tree diagram known as
dendrogram
• Dendrograms are used to represent agglomerative clustering
• From dendrograms, we can get any number of clusters
• Eg: say we wish to have 2 clusters, then cut the top one link
– Cluster 1: q, r
– Cluster 2: x, y, z, p
• Similarly for 3 clusters, cut 2 top links
– Cluster 1: q, r
– Cluster 2: x, y, z
– Cluster 3: p
A dendrogram example
Hierarchical clustering - algorithm
• Hierarchical clustering algorithm is a type of agglomerative clustering
4 5 6 1 2 3
Comparing agglomerative vs exclusive clustering
• Agglomerative - advantages
– Preferable for detailed data analysis
– Provides more information than exclusive clustering
– We can decide on any number of clusters without the need to redo
the algorithm –in exclusive clustering, K has to be decided first, if a
different K is used, then need to redo the whole exclusive clustering
algorithm
– One unique answer
• Disadvantages
– Less efficient than exclusive clustering
– No backtracking, i.e. can never undo previous steps
Overlapping clustering –Fuzzy C-means algorithm
• Both agglomerative and exclusive clustering allows one data to be in one
cluster only
• Fuzzy C-means (FCM) is a method of clustering which allows one piece of
data to belong to more than one cluster
• In other words, each data is a member of every cluster but with a certain
degree known as membership value
• This method (developed by Dunn in 1973 and improved by Bezdek in 1981)
is frequently used in pattern recognition
• Fuzzy partitioning is carried out through an iterative procedure that updates
membership uij and the cluster centroids cj by
N
1
uij 2 u m
x
ij i
C
|| xi c j || m 1 cj i 1
k 1 || x c ||
u
N
m
i k ij
i 1
• Note: membership values are in the range 0 to 1 and membership values for each data
from all the clusters will add to 1
Fuzzy C-means algorithm
• Choose the number of clusters, C and m, typically 2
1. Initialise all uij, membership values randomly – matrix U0
N
2. At step k: Compute centroids, cj using u m
x
ij i
cj i 1
N
u
i 1
m
ij
1
3. Compute new membership values, uij using uij 2
C
|| xi c j || m 1
k 1 || x c ||
i k
4. Update Uk+1 Uk
5. Repeat steps 2-4 until change of membership values is very small, Uk+1-Uk
< where is some small value, typically 0.01
u m
x
ij i
• Compute centroids, cj using cj i 1
N
, assume m=2
u
i 1
m
ij
• c1=13.16; c2=11.81
1
uij
• Compute new membership values, uij using 2
C
|| xi c j || m 1
• New U:
k 1 || xi ck ||
0.43 0.38 0.24 0.65 0.62 0.59
U
0.57 0.62 0.76 0.35 0.38 0.41
u m
x
ij i
cj i 1
N
u
i 1
m
ij
1
uij 2
C
|| xi c j || m 1
k 1 || xi ck
||
Clustering validity problem
• Problem 1
• A problem we face in clustering is to decide the optimal number
of clusters that fits a data set
• Problem 2
• The various clustering algorithms behave in a different way
depending on
– the features of the data set (geometry and density distribution of clusters)
– the input parameters values (eg: for K-means, initial cluster choices
influence the result)
• So, how do we know which clustering method is better/suitable?
• We need a clustering quality criteria
Clustering validity problem
• In general, good clusters, should have
– High intra–cluster similarity, i.e. low variance among intra-cluster members
where variance for x is defined by
1 N
var( x)
N 1 i 1
( xi x ) 2
with x as the mean of x
• For eg: if x=[2 4 6 8], then so var(x)=6.67
x 5
1 k
DB . Ri
k i 1
with Ri maxand
Rij var(Ci ) var(C j )
Rij
j 1,..k ,i j i j || ci c j ||
Ri max Rij
•Now, compute j 1,..k ,i j
•R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31 and
R32) k
1
Finally, compute DB . Ri
k i 1
•DB=0.932
Davies-Bouldin index example (ctd)
• For eg: for the clusters shown
var(Ci ) var(C j )
• Compute Rij
i j || ci c j ||
Basic concepts:
•The basic concepts of this algorithm is to select an object as a cluster center (one
representative object per cluster) instead of taking the mean value of the objects
in a cluster (as in k-Means algorithm).
•We call this cluster representative as a cluster medoid or simply medoid.
1.Initially, it selects a random set of k objects as the set of medoids.
2.Then at each step, all objects from the set of objects, which are not currently
medoids are examined one by one to see if they should be medoids.