Data clustering

Data Clustering Relevant Clustering Algorithms Clustering validation
Data Clustering
An Unsupervised Learning Approach
Garima Shakya
garimashakya24@gmail.com
Department of Computer Science and Technology
IIEST,Shibpur,Howrah
28 June 2016

Outline
1 Data Clustering
Feature Selection
Methods:
Distance based Algorithm
2 Relevant Clustering
Algorithms
K-means algorithm
Fuzzy C-means Algorithm
Advantages and
disadvantages of K-means
and Fuzzy C-means
Algorithms
3 Clustering validation
Dunn and Dunn index
Davies Bouldin index

Data Clustering:
”The task of grouping a set of objects in such a way that objects
in same group (called a cluster) are more similar to each other
than to those in other groups(clusters)”.

Applications of Clustering:
The applications of clustering are[2]:
1.) Its an intermediate step for other fundamental data mining
problems.
2.) For Collaborative ﬁltering.
3.) Customer Segmentation.
4.) Data summarisation.
5.) Multimedia data analysis.
6.) Biological data analysis.
7.) Social Network Analysis.
etc.

Contents
1 Data Clustering
Feature Selection Methods:
Distance based Algorithm
2 Relevant Clustering Algorithms
K-means algorithm
Advantages and disadvantages of K-means and Fuzzy
C-means Algorithms
3 Clustering validation
Dunn and Dunn index
Davies Bouldin index

Feature Selection Methods:
A preprocessing step in which original subsets of features are
selected.
Needed in order to enhance the quality of underlying
clustering.
Noisy and irrelevant features are pruned from contention.

Distance based Algorithm:

K-means algorithm
An unsupervised learning algorithm.
Applies on the m-dimensional hyperspace,for a given data set.
The pre-processing steps are: ’Handling missing values’ and
’Scaling’.
Scaling :
If the attribute is A.and have range [Amin, Amax ].Then, to scale a
value of A as A x, the formula is:
Ax (scaled) = (Ax - Amin)/(Amax - Amin)

K-means Algorithm:
Handling Missing values:
For example: 1.) Replace the missing values by zero(if numerical).
2.) Replace it by the maximum possible value.
3.)Fill in missing values manually based on your domain knowledge.
4.)Replace them with the variable mean (if numerical) or the most
frequent value (if categorical).
Input: The data set, value of K (number of clusters).
Output: The clustered data set (each data element must be
assigned to any one the clusters).

In steps, the algorithm is as:
Step 1.) Initialise the k centroids for k clusters by randomly
selecting them as a point in m-dimensional hyperspace.Label them
uniquely.
Step 2.) For each data element,do :
2.i) Calculate the distances from every cluster centroid.
2.ii) Compare the distances and give the cluster label of each data
element as the label of centroid nearest to it.
Step 3.) For each cluster,do :
Calculate the mean of values of each data element within a
cluster.Shift the centroid to the calculated mean in previous step.
Step 4.) 4.i) Calculate the change in position of each cluster
centroids and add them all.
4.ii) If the sum calculated sum is greater than the pre-speciﬁed
threshold or the number of iterations is more than the limit,then
go to step 2.
Step 5.) Terminate.The data set with cluster labels is the result.

Example: three-means iris data
First iteration:

Second iteration:

Fuzzy C-means Algorithm:
Need for fuzzy:
In case of overlapping clusters, Hard-clustering is not feasible.
Then,to extract such overlapping structures,Fuzzy C-means is
used.
• Fuzzy c-means allows data points to be assigned into more than
one cluster, therefore each data point has a degree of membership
(or probability) of belonging to each cluster.
Algorithm:Fuzzy Clustering is carried out through an iterative
optimization of the objective function:
where,

m is any real number greater than 1 and determines the level of
cluster fuzziness. A large m results in smaller memberships wij and
hence, fuzzier clusters.
wij is the degree of membership of xi in the cluster j,
xi is the ith of d-dimensional measured data,
cj is the d-dimension center of the cluster, and
||xi − cj || is any norm representing the similarity(or dissimilarity)
between any measured data xi and the centroid cj .
Input: Data Set X = {x1, x2, x3, ...., xn}, value of C (number of
clusters), value for m.

Output: A set of clusters C = {c1, c2, c3, ...., cn}, A partitioning
matrix W as:
In steps,the algorithm is as:
Step 1.) Initialise the C centroids for C clusters by randomly
selecting them as a point in m-dimensional hyperspace.Label them
uniquely.
Step 2.) For each data element xi , do :
2.i) Calculate the distances(or similarity measure), ||xi − cj || from
every cluster centroid.
2.ii) Calculate the fuzzy membership wij , of xi to belong in cj , by:

And ﬁll the value of wij in matrix W.
Step 3.) For each cluster,do :
3.i) Calculate the new position(or value) of centroid,ck by :
3.ii) Shift the centroid to the calculated position(or value) in
previous step.
Step 4.) 4.i) compute and update values of elements of W (k) as
W (k+1
4.ii) If ||W (k+1) − W (k)|| > β, then go to Step 2. Else The present
value of W is the resultant partitioning matrix.
where, k is the iteration step.
β is the termination criterion, β ∈ [0, 1].
’W (k)’ is the fuzzy membership matrix in kth iteration.

Advantages and disadvantages of K-means and Fuzzy
C-means Algorithms:
The K-means algorithm is fast, robust and easier to
understand. FCM is more complex that K-means.
K-means gives better result when data set are distinct or well
separated from each other, that is for non-overlapping
clusters.FCM is better for overlapping clusters.
The limitations of both is that, the value of K must be known
priorly,
K-mean results in local optima,it is applicable only when
mean is deﬁned, hence fails for categorical data and is unable
to handle noisy data and outliers[1]
The time complexity of the K-Means algorithm is O(tkdn) and
the time complexity of FCM algorithm is O(ndk2t) where, n
is number of data objects, k is number clusters, d is dimension
of each object and t is iterations. Normally, k, t, d << n[5]

How do we know that a particular clustering is good or that it
solves the needs of the applications??

How do we know that a particular clustering is good or that it
solves the needs of the applications??
Given a particular clustering how do we know, what the quality of
the clustering really is???

Cluster validation
Evaluation of clustering results sometimes is referred to as cluster
validation.[7]
Measures are used to compare the quality of diﬀerent clustering
algorithms.
The measures are classiﬁed as:
Internal Indices.
External Indices.

Internal Indices:
Evaluation is Based on the data that was clustered itself.
For Example:
Dunn and Dunn index.
Davies-Bouldin index.
etc.

Dunn and Dunn index:
It is proposed by [3].
Identiﬁes clusters which are well separated and compact.
Goal is[6]:
to maximize the inter-cluster distance.
minimizing the intra-cluster distance.
The Dunn index for k clusters is deﬁned by:

where,
is the dissimilarity between clusters ci and cj ; and
is the intra-cluster function (or diameter) of the cluster.
• If Dunn index is large, it means that compact and well separated
clusters exist.

The Dunn index is:
Computationally expensive
Sensitive to noisy data
Useful for identifying clean clusters in data sets.

Davies Bouldin index:
The Davies Bouldin index [8] is based on similarity measure of
clusters (Rij ).
Dispersion(si ) of a cluster and dissimilarity between
clusters(dij ) are used to compute the Davies-Bouldin(DB)
index.
Similarity measure of clusters, (Rij ) must satisfy the
conditions:
Rij >= 0
Rij = Rji
if si = 0 and sj = 0 then Rij = 0
if sj >sk and dij = dik then Rij >Rik
if sj = sk and dij <dik then Rij >Rik

Usually Rij is deﬁned in the following way[4]:
Then the Davies Bouldin index is deﬁned as:
where,

The Davies Boludin index measures the average of similarity
between each cluster and its most similar one.
As the clusters have to be compact and separated, the lower
Davies Bouldin index means better cluster conﬁguration.
Davies-Bouldin index gives good results for distinct groups.
Its not designed to accommodate overlapping clusters.

Appendix
References I
[1] site/dataclusteringalgorithms/k-means-clustering-algorithm.
[2] Aggrawal, C. C., and Reddy, C. K.
Data Clustering.
CRC Press.
[3] Dunn, J.
Well separated clusters and optimal fuzzy partitions.
Journal of Cybernetics 4 (1974), 95104.
[4] Ferenc Kovcs, Csaba Legny, A. B.
Cluster validity measurement techniques.

Appendix
References II
[5] Ghosh, S., and Dubey, S. K.
Comparative analysis of k-means and fuzzy c-means
algorithms.
((IJACSA) International Journal of Advanced Computer
Science and Applications 4(4) (2013).
[6] Sandro Saitta, B. R., and Smith, I. F.
A bounded index for cluster validity.
[7] wikipedia.
https://en.wikipedia.org/wiki/cluster analysis#evaluation and assessm

Appendix
Thank You!
You can contact me at garimashakya24@gmail.com

Data clustering

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Data clustering (20)

Recently uploaded (20)

Data clustering