0% found this document useful (0 votes)
91 views8 pages

Similarity Measure For Categorical Data

This document summarizes research on clustering algorithms and similarity measures for categorical data. It describes 10 clustering algorithms, including K-modes, fuzzy K-modes, and ROCK. It evaluates the time complexity of these algorithms and discusses various similarity measures for categorical data. The performance of algorithms and measures is evaluated on real-world datasets like mushroom, zoo, soybean, cancer, vote, car and iris by measuring cluster accuracy and error rate.

Uploaded by

Neti Suherawati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views8 pages

Similarity Measure For Categorical Data

This document summarizes research on clustering algorithms and similarity measures for categorical data. It describes 10 clustering algorithms, including K-modes, fuzzy K-modes, and ROCK. It evaluates the time complexity of these algorithms and discusses various similarity measures for categorical data. The performance of algorithms and measures is evaluated on real-world datasets like mushroom, zoo, soybean, cancer, vote, car and iris by measuring cluster accuracy and error rate.

Uploaded by

Neti Suherawati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02

SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR


CATEGORICAL DATA
S. Anitha Elavarasi1 and J. Akilandeswari2
1
Department of Computer Science and Engineering, Sona College of Technology, India
E-mail: [email protected]
2
Department of Information Technology, Sona College of Technology, India
E-mail: [email protected]

Abstract Dichotomous and multi categorical data [23]. Dichotomous can


Learning is the process of generating useful information from a huge have only two values. Multi categorical data can be in three
volume of data. Learning can be either supervised learning (e.g. ways, 1) an ordinal variable (ordered nature, e.g. high low
classification) or unsupervised learning (e.g. Clustering) Clustering is medium), 2) nominal variable (unordered in nature, e.g. mode of
the process of grouping a set of physical objects into classes of similar
transport preferred by persons) and 3) quantitative variable.
object. Objects in real world consist of both numerical and categorical
data. Categorical data are not analyzed as numerical data because of Categorical data are used in health care, educational, marketing
the absence of inherit ordering. This paper describes about ten and biomedical field.
different clustering algorithms, its methodology and the factors This paper describes about various clustering algorithm and
influencing its performance. Each algorithm is evaluated using real similarity/dissimilarity measure applied to categorical data. This
world datasets and its pro and cons are specified. The various paper is organized as follows; section 2 gives an overview of
similarity / dissimilarity measure applied to categorical data and its
different categorical clustering algorithms and its methodologies.
performance is also discussed. The time complexity defines the
amount of time taken by an algorithm to perform the elementary Section 3 describes the time complexity of various categorical
operation. The time complexity of various algorithms are discussed clustering algorithms. In section 4 various similarity measures
and its performance on real world data such as mushroom, zoo, soya used for categorical data are discussed. In section 5 the
bean, cancer, vote, car and iris are measured. In this survey Cluster performance of various algorithm and similarity measure on the
Accuracy and Error rate for four different clustering algorithm (K- real world data sets are discussed. Finally in section 6,
modes, fuzzy K-modes, ROCK and Squeezer), two different similarity conclusions are provided.
measure (DISC and Overlap) and DILCA applied for hierarchy and
partition algorithm are evaluated.
2. EXISTING CATEGORICAL ALGORITHM
Keywords:
Clustering, Categorical Data, Time Complexity, Similarity Measure, 2.1 K MODES ALGORITHM
Data Mining Tools
K means algorithm is a well known partition clustering
1. INTRODUCTION algorithm. It is efficient for processing larger data set, sensitive
to outliers and suitable only for numerical data set. The author
[12] extends the k means by using simple matching dissimilarity
Data mining is a process of extracting useful information
function suitable for categorical data. Mode value is used instead
from the given data set. Data mining technique includes
of mean value and finally a frequency based method for
clustering, classification, regression, association, outlier
updating the clustering process which reduces the cost function.
detection etc. Clustering is a process of grouping objects with
similar properties [1]. Clustering is an unsupervised learning. 2.1.1 Methodology:
Any clustering process should exhibit high intra class similarity 1. Choose K initial mode value.
and low inter class similarity. Clustering algorithm can be
2. Objective function used for categorical objects is,
broadly divided into hierarchical or partition algorithm.

 x , y 
Hierarchical clustering algorithm group's data objects to form a m

tree shaped structure. It can be broadly classified into dc X ,Y   j j (1)


agglomerative hierarchical clustering (bottom up approach) and j 1

divisive hierarchical clustering (top down approach) Partition and


clustering algorithm splits the data points into k partition, where
0, x j  y j
 x j , y j  
each partition represents a cluster. The partition is done based on
certain objective function [2]. Similarity or dissimilarity measure (2)
1, xj  yj
of a clustering algorithm should exhibit the properties such as,
1. Symmetry : Sim(x, y) = Sim(y, x) where, X, Y represents the categorical object and m refers
2. Non Negative: 0 < sim(x, y) < 1 to the categorical attribute.
3. Triangular Inequality : Sim(x, y) + Sim(y, z) = Sim(x, y) 3. Allocate an object to a cluster with minimum mode value.
Update the mode for all iteration until end of the object.
Data in real world are either numerical or categorical in
nature. Numerical data is continuous data and Categorical data 4. Test the dissimilarity of object against current mode. If
consist of a set of categories. Categorical data are divided into mode value of object belongs to different cluster rather

715
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA

than the current one, reallocate the object to the new 2.3 ROCK
cluster.
ROCK stands for RObust Clustering using linKs [4]. It is an
5. Repeat step 2, 3 and 4 until no such modification exists.
Agglomerative hierarchy clustering. It uses links to measure
K modes algorithm produce only local optima. The author similarity between data point. Initially each tuple is assigned as a
compares the performance and scalability of K-modes with K- separate cluster. Clusters are merged based on the closeness
prototype algorithm. Cluster performance is verified by using between clusters. Closeness is measured as the sum of the
cluster accuracy and error rate for soya bean disease dataset. number of links between all pair of tuple. It is suitable for
Soya bean has 47 instances with 35 attributes each. It can be Boolean and categorical data. In traditional approach, categorical
classified under four diseases type. K modes algorithm is tested data are treated as Boolean value. Scalability of the algorithm
for soya bean dataset and produced 200 clusters with two depends on the sample size. The Criterion function and goodness
different mode selections. A misclassification matrix is measure used is given in Eq.(4) and Eq.(5).
generated to analysis the cluster result with diseases
Criterion function:
classification. Scalability is verified against number of clusters
for a given number of objects and number of objects for a given k 
link pq pr 
number of clusters using motor insurance dataset. Motor El   ni *  n1i  2 f  
(4)
insurance has 690 instances described by 6 numerical and 9 i 1 pq , pr ci

categorical attributes with two possible classes. (Only 666


instances are used). K prototype algorithm produces 100 where pq, pr represent the two points in a cluster and ci represent
clusters. A misclassification matrix is generated to analysis the the ith cluster and ni represent the size of the ith cluster.
cluster with original classes. Goodness measure:

2.2 SQUEEZER  
g ci c j 
 
link ci c j
ni  n j 12 f    n1i 2 f    n1j2 f  
(5)
Squeezer [9] [3] is a categorical data clustering algorithm.
The main data structures involved are Cluster Summary and 2.3.1 Methodology:
Cluster Structure. Summary holds set of pair of attribute value 1. Draw a random sample
and their corresponding support value. Cluster Structure (CS)
holds the cluster and summary information. The advantages of 2. Compute the Link similarity
Squeezer algorithm are 1) It produces high quality cluster result 3. Cluster with the link
2) It deserves good scalability 3) It makes only one scan over the 4. Label it on the disk
dataset, so it is highly efficient when considering I/O cost. The The author uses Congressional Vote dataset and Mushroom
disadvantages of Squeezer algorithm is, the quality of the cluster data set from UCI repository and compare ROCK algorithm with
depends on the threshold value(s). Space complexity is O(n + k traditional centroid based hierarchical algorithm. Experiments
* p * m), where „n‟ represent the size of the data set, „m‟ were conducted on Sun Ultra-2/200 machine running Solaris 2.5
represent number of attribute, „k‟ represent final number of Operating system. In vote dataset cluster of republican contains
cluster and „p‟ represent distinct attribute values. only 12% of democrat whereas traditional approach has 25% of
2.2.1 Methodology: democrat with  = 0.73. For Mushroom data set ROCK use  =
1. Read the first tuple. 0.8 and number of desired cluster as 20. It discovers pure
clusters in the sense that mushroom in every cluster were either
2. Generate the Cluster Structure (CS).
edible or poisons.
3. Read the next tuple and computes its similarity using
support measure given as: 2.4 K-HISTOGRAM
supai 
m
SimC , tid   
K-Histogram extends k means algorithm to categorical
(3)
j supai 
domain by replacing mean with histogram and dynamically
i 1
updates histogram during clustering process [13]. The K-means
4. If the similarity is greater than the threshold „s‟. Add to algorithm cannot cluster categorical data in an efficient way. To
the existing Cluster Structure. Else assign to the new make them work for categorical data two modification is done.
Cluster Structure. First mean value is replaced with histogram. Second new
5. Repeat Step 2 through 4 until the end of the tuple. dissimilarity measure between categorical data and histogram is
The author implements the algorithm in Java. It compares applied. Dissimilarity functions and cost measure applied for K
Squeezer algorithm with ROCK using Congressional vote histogram are given in Eq.(6) and Eq.(7).
dataset and Mushroom data set. Congressional vote dataset has Dissimilarity function used is:
435 tuple with 16 attributes and 2 classes (democratic and
 mj1h j y j
republic). Mushroom has 8125 tuple with 22 attribute and 2 d H , Y   (6)
classes (poisonous and edible) Threshold values are assumed as n
10 and 16 for vote and mushroom dataset respectively. The Cost function used is,
author concludes both algorithm produce high quality cluster.
The only parameter that affects the clustering result and speed of
the algorithm is threshold value(s).

716
ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02

k n 2.7 FUZZY RULE BASED CLUSTERING


PW , H    wi, l d  X i H l  (7) ALGORITHM
l 1 i 1
Fuzzy Rule Based Clustering (FRBC) employs the
Histogram can be used in computer vision application. supervised classification approach to do the unsupervised
Results of K histogram lie on the initial selection of Histogram clustering [19]. It explores the potential clusters in data patterns
and the order in which data are processed. Hence it produces and identifies them with fuzzy rules. Fuzzy clustering is applied
only local optimal results. when the cluster boundaries are vague. Advantages of fuzzy
2.4.1 Methodology: model is, it works with imprecise data, elements belong to more
1. Initialize the 'K' value. than one cluster with a specified degree of membership and the
knowledge obtained are human readable FRBC are robust to
2. Apply cost function. noise and outlier.
3. Allocate an object to a cluster whose histogram is near to
2.7.1 Methodology:
it.
4. Update the histogram after each assignment. 1. Assume all unlabeled data patterns as Class 1.
5. Repeat the steps until no object change the cluster. 2. Generate uniformly distributed instance as Auxiliary data
and mark it as class2.
The author compares K Histogram with K modes algorithm
for Congressional vote dataset and Mushroom data set. 3. Apply SGERD (Steady state genetic algorithm to extract
Algorithms were implemented in Java. Both K histogram and K fuzzy Classification rule from data) rule generator to
modes uses same initial points selection method. Four produce fuzzy rules to solve two class problem.
comparisons were made, 1) Cluster error Vs Number of cluster, 4. Select the best rule for class 1 and check whether it is less
2) Number of objects Vs Number of cluster, 3) Number of than the threshold, if less decrement the no of cluster and
Iteration Vs Number of cluster, 4) pure cluster Vs Number of go to step 3. Else increment the cluster and remove from
cluster. class1 and go to 3.
The author applies FRBC to 11 classification dataset and 2
2.5 ANALYSIS THE AGGLOMERATIVE clustering dataset obtained from UCI repository. FRBC is
HIERARCHICAL CLUSTERING ALGORITHM compared with other fuzzy clustering algorithm. The threshold
FOR CATEGORICAL ATTRIBUTE (rule effectiveness measure) is set to 0.1 for all the dataset and
the cluster specified by fuzzy rules are human understandable
The author describes about the implementation detail of the
with good accuracy.
K-pragna [11], an agglomerative hierarchical clustering
algorithm. The Data structure used are Domain Array (DOM[m] 2.8 DBSCAN
[n]), Similarity Matrix and Cluster[m]. Domain Array holds the
values of data set. Similarity matrix holds the similarity between DBSCAN stands for Density-based spatial clustering of
the tuple / clusters. Cluster[m] is a single dimensional array applications with noise [24]. It is based on the notation of
holds the updated values whenever a merge occurs. The density reachablity. It requires two parameter, 1) Eps (Maximum
Language utilized is C. radius of neighborhood) and 2) MinPts (Minimum number of
points on the Eps neighborhood). Advantage of DBSCAN is it
2.5.1 Methodology:
does not require the number of cluster prior, insensitive to the
1. Input the k (expected number of cluster) value. order of notation, and find arbitrarily shaped clusters. Drawback
2. Calculate the similarity. of this algorithm, quality depends on the distance function used.
3. Find the largest merge. 2.8.1 Methodology:
4. Repeat the step 2 and 3 till end. 1. Select a point p.
5. Display the contents of each cluster. 2. Retrieve all points from p satisfying Eps and MinPts.
The author used mushroom data set taken from UCI Machine 3. If p is a core point, a cluster is formed.
Learning repository and tested the algorithm for k = 3. The 4. If p is a border point and no points are density-reachable
accuracy of the algorithm is found to be 0.95. from, visits the next point of the database.
2.6 HIERARCHICAL CLUSTERING ON FEATURE 5. Repeat the process till all the points have been processed.
SELECTION FOR CATEGORICAL DATA OF The author tests the efficiency with CLARANS using
BIOMEDICAL APPLICATION SEQUOIA dataset. DBSCAN is implemented in C++ based on
R*Tree. Running time is compare for various numbers of
The author [14] focuses on the feature association mining. points. DBSCAN outperforms CLARANS by a factor of more
Based on the contingency table, the distance (closeness) between than 200.
features is calculated. Then hierarchical agglomerative clustering
is applied. The clustered results helps the domain expects to
identify the feature association of their own interest. The
drawback of this system is it works only for categorical data.

717
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA

2.9 CURE (CLUSTERING USING clustering algorithm exist in the literatures [4, 9, 11, 12, 13, 18, 20,
REPRESENTATIVE) 21, 23, 24].

CURE represents each cluster with a fixed number of points Table.1. Time Complexity of various clustering algorithm
that are produced by selecting well scattered points from the
cluster and then shrinking them towards the center of the cluster Sl.
Algorithm Time Complexity
[20]. The scattered points after shrinking are the representatives No.
for that cluster and then clusters with the closest pair of these O(tkn)
representatives are merged repeatedly. It is an approach between t - No. of iteration
1 K-modes
the centroid-based and the all-point extremes. The time k – No. of cluster
complexity of CURE is O(s2) for low-dimensional data where s n - No .of object
is sample size of the data. O(n*k*p*m)
n - Size of the data set
2.9.1 Methodology: k - Final number of
1. Draw random sample from the given data set. Squeezer
2 cluster
2. Partition the sample. m - No. of attribute
3. Partially cluster the partitions. p - Distinct attribute
values
4. Eliminate outliers. O(n2+nmmma+n2log n)
5. Clusters the partial cluster. n - No. of input data point
6. Labeling data on a disk. mm - Maximum no of
3 ROCK
The parameter that affects the CURE algorithm are: shrinking neighbor
factor (α), Number of representative points (c), sample size (s) and ma - Average no of
number of partition (p). The performance of CURE is compared neighbor
with BRICH and Minimum Spanning Tree (MST). Results shows O(tkn)
CURE can discover cluster with interesting shapes, less sensitive t – No. of iteration
4 K Histogram
to outlier and less execution time is needed. k – No. of cluster
n – No. of object
2.10 k-ANMI Agglomerative
O(n3)
5 hierarchical clustering
The author [24] use the average normalized mutual n – No. of objects
Algorithm
information (Entropy based) as the criteria for the k-modes O(nd) - fitness function
algorithm. O(n2d) - mutation
Objective function is defined as, O(nKd) - K-means

  1r rq1 NMI q  


6 Genetic K Means n - Size of the data set
 ANIM ,   (8) k - Final number of
cluster
Advantage of k-ANMI is, 1) suitable for both categorical d - Dimensions of data set
data clustering and cluster ensemble, 2) it could be easily O(Ink2rp2)
deployed in clustering distributed categorical data, 3) it is n - Size of dataset
flexible in handling heterogeneous data that contains a mix of r - Number of attributes
categorical and numerical attributes Limitation of k-ANMI is, 1) k - Number of the
it is a great research challenge to implement the k- ANMI histograms, the size of
algorithm in an efficient way such that it is scalable to large 7 K-ANMI
every histogram, the
datasets and 2) Finding global or near optimal is limited. number of clusters
The author uses k-ANMI algorithm to Congressional vote I - iteration times
dataset, Mushroom data set and Wisconsis Brest cancer data set p - number of distinct
from UCI repository. Cancer data set consist of 699 instance attributes values
with 9 attributes and two class. Author compares k-ANMI O(n2 log n)
8 CURE
algorithm with squeezer, GAClust, K-modes and n - Input Size
ccdByEnsemble. k-ANMI outperforms all the other algorithm O(m*log(m))
with respect to the average clustering error. Running time of 9 DBSCAN m – No. of points in
kANMI algorithm increases linearly with number of object. database
O(N*K*A)
3. TIME COMPLEXITY N -total number of
transaction
10 CLOPE
Time Complexity of any algorithm defines the amount of time K - No of Cluster
taken by an algorithm to perform the elementary operation. A - average length of the
Table.1 discusses the time complexity of various categorical transaction

718
ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02

4. EXISTING CATEGORICAL SIMILARITY between the attribute and 1 indicates match exist between the
MEASURES attribute. The overlap similarity is defined as,

1 if X k  Yk

4.1 Chi - SQUARED S k  X k , Yk    (11)

0 otherwise
Karl Pearson in 1900 proposes the chi-squared Statistic [15].
It examines whether there exist, any association between the 4.4 DISC
categorical variable. Range exists between -1 to +1 for two
variables and 0 to +1 for larger number of variable. The value Data Intensive Similarity Measure for Categorical Data
more close to 1 indicates a strong relationship between variables. analysis (DISC) [6]. It makes use of a data structure called
The chi square (x2) formula is defined as, categorical information table (CI Table). CI table stores the co-
occurrence statistics for the categorical data. The similarity
n
Oi  Ei 2

between two attribute is measured using the cosine similarity
x2  (9) measure.
Ei
i 1
4.4.1 Methodology:
where, Oi represent observed value and Ei represent Expected
value. 1. Construct the Categorical Information table (CI Table)
Steps in Chi Square Test: 2. Initialization of similarity matrix.
1. Given Observed frequency  
i, j, k : sim vik , v jk  0 if vik  v jk (12)
2. Note the Expected frequency i, j, k : simvik , v jk   1 if vik  v jk (13)
3. Apply the chi square formula
3. Computer the Similarity between two attribute(vij,vik)
4. Find the degree of freedom(df = N – 1) using the
5. If the obtained value is equal or greater than the chi
square table reject the null hypothesis. 
Sim vij , vik  d 11 dm1,mi Similaritym (14)
Advantage of Chi square is it requires no assumptions about
the shape of the population distribution from which a sample is Similaritym = Cosine Product (CI[Ai:vij][Am],
drawn. It can be applied to nominal or ordinal measured CI[Ai:vik][Am]) for Categorical data
variables. Limitation of Chi square similarity are, 1) need
Similaritym  1
  
CI Am : vij Am CI A j : vik Am 
(15)
maxAm minAm 
quantitative data, 2) sensitive to sample size, 3) does not give
much information about the strength of the relationship and 4)
Expected frequency should not be less than 1. for Numerical data where the cosine product is defined
as,
4.2 COSINE SIMILARITY
Sim  CS 
Cosine similarity [17] is a popular method for text mining. It     
CI Ai : vij Am : vml * CI Ai : vik  Am : vml * sim Vml , vml 
is used for comparing the document (word frequency) and finds  NormalVect or1* NormalVect or 2
the closeness among the data points in clustering. Its range lies vml ,vml Am
between 0 and 1. The similarity between two terms X and Y are (16)
defined as follows.
4. Repeat the step 2 and step 3.
X oY
CosineSim X , Y   . (10) The author concludes that DISC outperforms other similarity
X Y measure both for classification and regression analysis.
One desirable property of Cosine similarity is independent of 4.5 DILCA
document length. Limitation is the terms are assumed to be
orthogonal in space. If the value is zero no similarity exist DILCA - DIstance Learning in Categorical Attribute is the
between the data element and if the vale is 1 similarity exist measure used by the author [7, 8]. Co-occurrence table is formed
between two elements. Considered two documents X and Y with for all the features using symmetric uncertainty a matrix is
attributes X = {1 2 3 0 0} and Y = {2 4 0 0 1}, generated and conditional probability is applied, the results are
given to the Euclidean measure to find the similarity between the
1* 2  2 * 4  3 * 0  0  0 *1
CosineSim X , Y    0.5832 attributes.
1*1  2 * 2  3 * 3 2 * 2  4 * 4  1*1
4.5.1 Methodology:
4.3 OVERLAP 1. Context Selection (Feature extraction) is based on
symmetric uncertainty (SU). It is a co-relation based
The overlap measure counts the number of attribute that measure from information theory. The co-relation matrix
matches the two data instance. It uses only the diagonal entries are formed using SU,
IG  X Y 
of the similarity matrix and sets off diagonal entries to 0 [5]. The
range of per attribute value is 0 to 1. 0 indicate no match exist SU  X , Y   2 * (17)
H  X   H Y 

719
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA

where, IG(X/Y) is the information gain and H(X) and life dataset to show the accuracy, precession and recall values.
H(Y) represent the entropy of the variable X and Y To illustrate the efficiency of the algorithm synthetic datasets are
respectively. generated and four different graphs is plotted for, 1) Time Vs
2. Distance Computation: Applying Conditional probability number of cluster, 2) Time Vs Number of Objects, 3) Time Vs
for the co-relation matrix and Euclidean distance, Number of categories and 4) Time Vs number of Attributes.
Fuzzy K-modes outperforms K -modes for both the data set.
 
d xi x j    Pxi  
yk   P x j yk 2 (18) Accuracy Zoo Error rate Zoo
Y content X  yk Y Accuracy Soya bean Error rate Soyabean

The author embedded the similarity measure both on


partition and hierarchical algorithm. The results are scalable with 0.8
respect to the number of instance in the dataset.
0.6
5. PERFORMANCE ANALYSIS
The cluster validation is the process of evaluating the cluster 0.4
results in a quantitative and objective manner. Cluster
evaluation is done either internal or external. The internal
0.2
evaluation determines the quality of the cluster. The external
evaluation determines the partitioning among the cluster. The
results of different clustering algorithm are validated using 0
Cluster accuracy, error rate. Cluster Accuracy „r‟ is defined as, Kmodes FuzzyKmodes

 ik1ai
r (19) Fig.1. Comparison of K modes and Fuzzy K modes
n
where, „n‟ refers number of instance in the dataset, „ai‟ refers to ROCK and Squeezer [9] for mushroom data set is depicted in
number of instance occurring in both cluster i and its Fig.2. Squeezer outperforms ROCK for the mushroom data set
corresponding class and „k‟ refers to final number of cluster. and ROCK outperform Squeezer for the vote data set. The
Error rate „E‟ is defined as, E = 1 – r, where „r‟ refers to the author [8] compares ROCK with hierarchical and partitioned
cluster accuracy. algorithm for four real world data sets. Threshold values for
ROCK are set as 0.2 to1 in step of 0.05.
Real world dataset: Five real life dataset, such as
Mushroom, vote, Iris, cancer and zoo obtained from UCI
Accuracy Mushroom Errorrate Mushroom
machine learning repository [25]. Mushroom: Each tuple
represent the physical characteristic of mushroom. Number of Accuracy Vote Errorrate vote
instance is 8124 and number of attribute is 22. It can be
0.8
classified into edible (4028) and poisonous mushroom (3916).
Vote: Each tuple represent the United States congressional vote
record in 1984. Number of instance is 435 and number of 0.6
attribute is 16. It can be classified into Democrat (267) and
Republican (168). Iris: The data set contains 3 classes of 50
instances each, where each class refers to a type of iris plant. 0.4
Number of instance is 150 and number of attribute is 4. Cancer:
No of Instance 8124 and no of attribute are 22. Zoo: Zoo
0.2
dataset has 18 attributes with 101 instances. Class distribution of
Zoo dataset has 7 classes. Soyabean: Number of instance is 307
and number of attribute is 35. 0
In this survey Cluster Accuracy and Error rate are evaluate in ROCK Squzeer
three ways, 1) comparisons of different categorical clustering
algorithm, 2) comparisons of DILCA combined with partition Fig.2. Comparison of Rock and Squeezer
and hierarchical clustering algorithm and 3) comparisons of
categorical similarity measure. 5.2 COMPARISONS OF DILCA COMBINED WITH
PARTITION AND HIERARCHICAL
5.1 COMPARISONS OF DIFFERENT CLUSTERING ALGORITHM
CATEGORICAL CLUSTERING ALGORITHM
The author [8] determines the quality of cluster formed using
Algorithm used for comparisons are K-modes [10] [12] [16] accuracy and normalized mutual information. The partition (K-
and fuzzy K-modes [16] using soya bean and Zoo data set is modes algorithm) and wards hierarchical clustering algorithm
depicted in Fig.1. Algorithm was run for 100 times and fuzzy are combined with DILCA for mushroom, vote and cancer
parameter is set as 1.1. The author [16] makes use of four real dataset are depicted in Fig.3. Both the algorithm set the number

720
ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02

of cluster equal to the number of classes. DILCA_Kmodes Several open source data mining tools are available on web.
algorithm is implemented using WEKA platform and Table.2. illustrates various data mining tools and few clustering
DILCA_Hierarchical algorithm is implemented using Java algorithm available.
Murtagh's platform.
Accuracy Mushroom Error rate Mushroom Table.2. Data mining tool and clustering algorithm
Accuracy Vote Error rate Vote
Tool Name Clustering Algorithm
Accuracy Cancer Errorrate Cancer
1
K- Means, X-Means, EM, Cobweb
Weka CLOPE, OPTICS, DBSCAN
0.8 Hierarchical clustering algorithm
K -Means, PAM, DBSCAN, ROCK
R package
0.6 Hierarchical clustering algorithm
K -Means, DBSCAN
0.4 EM
K-Medoids
Rapidminer X-means
0.2
Kernel K- means
Fast K - Means
0
Hierarchical clustering algorithm
DILCA_Kmodes DILCA_Hierarchical

Fig.3. Comparison of DILCA for parametric and nonparametric 6. CONCLUSION

5.3 COMPARISONS OF CATEGORICAL The paper describes a review on different clustering


SIMILARITY MEASURE methodologies and similarity measure associated with the
categorical data clustering. The factor that affects various
The author [6] uses 24 different dataset and 15 similarity clustering algorithm, its advantage and limitation are discussed.
measures were assessed for classification and regression using Time complexities of various categorical clustering algorithms
the kNN algorithm. Experiments were conducted using WEKA are discussed. Cluster accuracy and error rate for real world data
environment. Results were presented for 10 fold cross validation set using different categorical clustering algorithms, parametric
The categorical similarity measures used for comparisons in and non parametric version of DILCA and categorical similarity
fig.4 are DISC and overlap for the car evaluation, iris and cancer measure are illustrated.
dataset using kNN algorithm with K = 10. Both similarity
measures give similar results for all the three dataset. REFERENCES

1
[1] J. Han and M. Kamber, “Data Mining Concepts and
Techniques”, The Morgan Kaufmann Series in Data
Management Systems, Morgan Kaufmann, 2000.
0.8
[2] S. A. Elavarasi, J. Akilandeswari and B. Sathiyabhama, “A
Survey on Partition Clustering Algorithms”, International
0.6 Journal of Enterprise Computing and Business Systems,
Vol. 1, No. 1, pp. 1-14, 2011.
0.4 [3] R. Ranjani, S. A. Elavarasi and J. Akilandeswari,
“Categorical Data Clustering using Cosine based similarity
for Enhancing the Accuracy of Squeezer Algorithm”,
0.2 International Journal of Computer Applications, Vol. 45,
No. 20, pp. 41-45, 2012.
0 [4] S. Guha, R. Rastogi and K. Shim, “ROCK: A Robust
DISC Overlap Clustering Algorithm for Categorical Attributes”,
Information Systems, Vol. 25, No. 5, pp. 345 – 366, 2000.
Accuracy Car Error Rate Car Accuracy Iris [5] S. Boriah, V. Chandola and V. Kumar, “Similarity
ErrorRate Iris Accuracy Cancer Error rate Cancer Measures for Categorical Data: A Comparative
Evaluation”, Proceedings of the 8th SIAM International
Fig.4. Comparison of DISC and Overlap Conference on Data Mining, pp. 243 – 254, 2008.

721
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA

[6] A. Desai, H. Singh and V. Pudi, “DISC: Data Intensive [15] Alan Agresti, “An Introduction to categorical data
Similarity Measure for Categorical Data”, Proceedings of analysis”, Wiley Series in Probability and Statistics,
Advances in Knowledge Discovery and Data Mining – 15th Second Edition, Wiley-Interscience, 2007.
Pacific Asia Conference, Vol. 6635, pp. 469 – 481, 2011. [16] Michael K. Ng and Liping Jing, “A new fuzzy k-modes
[7] D. Ienco, R. G. Pensa and R. Meo, “From Context to clustering algorithm for categorical data”, International
Distance: Learning Dissimilarity for Categorical Data Journal on Granular Computing, Rough Sets and
Clustering”, ACM Transactions on Knowledge Discovery Intelligent Systems, Vol. 1, No. 1, pp. 105 – 119, 2009.
from Data, Vol. 6, No. 1, 2012. [17] A. Rajaraman and J. D. Ullman, “Mining of Massive
[8] D. Ienco, R. G. Pensa and R. Meo, “Context-based distance Datasets”, Cambridge University Press, 2011.
learning for categorical data clustering”, Proceedings of the [18] D. K. Roy and L. K. Sharma, “Genetic K means Clustering
8th International Symposium on Intelligent Data Analysis: Algorithm for Mixed Numerical and Categorical data set”,
Advances in Intelligent Data Analysis VIII, pp. 83 – 94, International journal of Artificial Intelligence &
2009. Applications, Vol. 1, No. 2, pp. 23 – 28, 2010.
[9] H. Zengyou, X. Xiaofei and D. Shengchun “Squeezer: An [19] E. G. Mansoori, “FRBC: Fuzzy Rule Based Clustering
Efficient Algorithm for Clustering Categorical Data”, Algorithm”, IEEE Transactions on Fuzzy Systems, Vol. 19,
Journal on Computer Science & Technology, Vol. 17, No. No. 5, pp. 960 – 971, 2011.
5, pp. 611-624, 2002.
[20] S. Guha, R. Rastogi and K. Shim, “CURE: An Efficient
[10] Z. Haung and M. K. Ng, “A Fuzzy k-Modes Algorithm for Clustering Algorithm for Large Databases”, Proceedings of
Clustering Categorical Data”, IEEE Transactions on Fuzzy the ACM SIGMOD International Conference on
systems, Vol. 7, No. 4, pp. 446-452, 1999. Management of Data, pp. 73 - 84, 1998.
[11] P. Agarwal, M. Afshar Alam and R. Biswas, “Analysis the [21] Y. Yang, X. Guan and J. You, “CLOPE: a fast and
Agglomerative hierarchical clustering Algorithm for effective clustering algorithm for transactional data”,
Categorical Attribute”, International Journal of Innovation, Proceedings of the 8th ACM SIGKDD international
Management and Technology, Vol. 1, No. 2, pp. 186 – 190, conference on Knowledge Discovery and Data Mining, pp.
2010. 682 – 687, 2002.
[12] Z. Huang, “Extensions to the k-Means Algorithm for [22] H. Rezankova, “Cluster Analysis and categorical data”
Clustering Large Data Sets with Categorical Values”, Data Statistika, pp. 216 - 232, 2009.
Mining and Knowledge Discovery, Vol. 2, No. 3, pp. 283 –
[23] M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A density-
304, 1998.
based algorithm for discovering clusters in large spatial
[13] Z. He, X. Xu, S. Deng and B. Dong, “K- Histograms: An databases”, Proceedings of 2nd International Conference on
Efficient Clustering Algorithm for Categorical Dataset”, Knowledge Discovery and Data Mining, pp. 222 – 231,
CoRR, Vol. abs/cs/0509033, 2005. 1996.
[14] Y. Lu and L. R. Liang, “Hierarchical Clustering of Features [24] H. Zengyo, X. Xu and S. Deng, “k-ANMI: A mutual
on Categorical Data for Biomedical information based clustering algorithm for categorical
Applications”, Proceedings of the ISCA 21st International data”, Information Fusion, Vol. 9, No. 2, pp. 223 – 233,
conference on Computer Applications in Industry and 2008.
Engineering, pp. 26 - 31, 2008.
[25] UCI Machine Learning Repository,
http://www.ics.uci.edu/mlearn/MLRepository.html.

722

You might also like