Similarity Measure For Categorical Data
Similarity Measure For Categorical Data
x , y
Hierarchical clustering algorithm group's data objects to form a m
715
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
than the current one, reallocate the object to the new 2.3 ROCK
cluster.
ROCK stands for RObust Clustering using linKs [4]. It is an
5. Repeat step 2, 3 and 4 until no such modification exists.
Agglomerative hierarchy clustering. It uses links to measure
K modes algorithm produce only local optima. The author similarity between data point. Initially each tuple is assigned as a
compares the performance and scalability of K-modes with K- separate cluster. Clusters are merged based on the closeness
prototype algorithm. Cluster performance is verified by using between clusters. Closeness is measured as the sum of the
cluster accuracy and error rate for soya bean disease dataset. number of links between all pair of tuple. It is suitable for
Soya bean has 47 instances with 35 attributes each. It can be Boolean and categorical data. In traditional approach, categorical
classified under four diseases type. K modes algorithm is tested data are treated as Boolean value. Scalability of the algorithm
for soya bean dataset and produced 200 clusters with two depends on the sample size. The Criterion function and goodness
different mode selections. A misclassification matrix is measure used is given in Eq.(4) and Eq.(5).
generated to analysis the cluster result with diseases
Criterion function:
classification. Scalability is verified against number of clusters
for a given number of objects and number of objects for a given k
link pq pr
number of clusters using motor insurance dataset. Motor El ni * n1i 2 f
(4)
insurance has 690 instances described by 6 numerical and 9 i 1 pq , pr ci
2.2 SQUEEZER
g ci c j
link ci c j
ni n j 12 f n1i 2 f n1j2 f
(5)
Squeezer [9] [3] is a categorical data clustering algorithm.
The main data structures involved are Cluster Summary and 2.3.1 Methodology:
Cluster Structure. Summary holds set of pair of attribute value 1. Draw a random sample
and their corresponding support value. Cluster Structure (CS)
holds the cluster and summary information. The advantages of 2. Compute the Link similarity
Squeezer algorithm are 1) It produces high quality cluster result 3. Cluster with the link
2) It deserves good scalability 3) It makes only one scan over the 4. Label it on the disk
dataset, so it is highly efficient when considering I/O cost. The The author uses Congressional Vote dataset and Mushroom
disadvantages of Squeezer algorithm is, the quality of the cluster data set from UCI repository and compare ROCK algorithm with
depends on the threshold value(s). Space complexity is O(n + k traditional centroid based hierarchical algorithm. Experiments
* p * m), where „n‟ represent the size of the data set, „m‟ were conducted on Sun Ultra-2/200 machine running Solaris 2.5
represent number of attribute, „k‟ represent final number of Operating system. In vote dataset cluster of republican contains
cluster and „p‟ represent distinct attribute values. only 12% of democrat whereas traditional approach has 25% of
2.2.1 Methodology: democrat with = 0.73. For Mushroom data set ROCK use =
1. Read the first tuple. 0.8 and number of desired cluster as 20. It discovers pure
clusters in the sense that mushroom in every cluster were either
2. Generate the Cluster Structure (CS).
edible or poisons.
3. Read the next tuple and computes its similarity using
support measure given as: 2.4 K-HISTOGRAM
supai
m
SimC , tid
K-Histogram extends k means algorithm to categorical
(3)
j supai
domain by replacing mean with histogram and dynamically
i 1
updates histogram during clustering process [13]. The K-means
4. If the similarity is greater than the threshold „s‟. Add to algorithm cannot cluster categorical data in an efficient way. To
the existing Cluster Structure. Else assign to the new make them work for categorical data two modification is done.
Cluster Structure. First mean value is replaced with histogram. Second new
5. Repeat Step 2 through 4 until the end of the tuple. dissimilarity measure between categorical data and histogram is
The author implements the algorithm in Java. It compares applied. Dissimilarity functions and cost measure applied for K
Squeezer algorithm with ROCK using Congressional vote histogram are given in Eq.(6) and Eq.(7).
dataset and Mushroom data set. Congressional vote dataset has Dissimilarity function used is:
435 tuple with 16 attributes and 2 classes (democratic and
mj1h j y j
republic). Mushroom has 8125 tuple with 22 attribute and 2 d H , Y (6)
classes (poisonous and edible) Threshold values are assumed as n
10 and 16 for vote and mushroom dataset respectively. The Cost function used is,
author concludes both algorithm produce high quality cluster.
The only parameter that affects the clustering result and speed of
the algorithm is threshold value(s).
716
ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02
717
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
2.9 CURE (CLUSTERING USING clustering algorithm exist in the literatures [4, 9, 11, 12, 13, 18, 20,
REPRESENTATIVE) 21, 23, 24].
CURE represents each cluster with a fixed number of points Table.1. Time Complexity of various clustering algorithm
that are produced by selecting well scattered points from the
cluster and then shrinking them towards the center of the cluster Sl.
Algorithm Time Complexity
[20]. The scattered points after shrinking are the representatives No.
for that cluster and then clusters with the closest pair of these O(tkn)
representatives are merged repeatedly. It is an approach between t - No. of iteration
1 K-modes
the centroid-based and the all-point extremes. The time k – No. of cluster
complexity of CURE is O(s2) for low-dimensional data where s n - No .of object
is sample size of the data. O(n*k*p*m)
n - Size of the data set
2.9.1 Methodology: k - Final number of
1. Draw random sample from the given data set. Squeezer
2 cluster
2. Partition the sample. m - No. of attribute
3. Partially cluster the partitions. p - Distinct attribute
values
4. Eliminate outliers. O(n2+nmmma+n2log n)
5. Clusters the partial cluster. n - No. of input data point
6. Labeling data on a disk. mm - Maximum no of
3 ROCK
The parameter that affects the CURE algorithm are: shrinking neighbor
factor (α), Number of representative points (c), sample size (s) and ma - Average no of
number of partition (p). The performance of CURE is compared neighbor
with BRICH and Minimum Spanning Tree (MST). Results shows O(tkn)
CURE can discover cluster with interesting shapes, less sensitive t – No. of iteration
4 K Histogram
to outlier and less execution time is needed. k – No. of cluster
n – No. of object
2.10 k-ANMI Agglomerative
O(n3)
5 hierarchical clustering
The author [24] use the average normalized mutual n – No. of objects
Algorithm
information (Entropy based) as the criteria for the k-modes O(nd) - fitness function
algorithm. O(n2d) - mutation
Objective function is defined as, O(nKd) - K-means
718
ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02
4. EXISTING CATEGORICAL SIMILARITY between the attribute and 1 indicates match exist between the
MEASURES attribute. The overlap similarity is defined as,
1 if X k Yk
4.1 Chi - SQUARED S k X k , Yk (11)
0 otherwise
Karl Pearson in 1900 proposes the chi-squared Statistic [15].
It examines whether there exist, any association between the 4.4 DISC
categorical variable. Range exists between -1 to +1 for two
variables and 0 to +1 for larger number of variable. The value Data Intensive Similarity Measure for Categorical Data
more close to 1 indicates a strong relationship between variables. analysis (DISC) [6]. It makes use of a data structure called
The chi square (x2) formula is defined as, categorical information table (CI Table). CI table stores the co-
occurrence statistics for the categorical data. The similarity
n
Oi Ei 2
between two attribute is measured using the cosine similarity
x2 (9) measure.
Ei
i 1
4.4.1 Methodology:
where, Oi represent observed value and Ei represent Expected
value. 1. Construct the Categorical Information table (CI Table)
Steps in Chi Square Test: 2. Initialization of similarity matrix.
1. Given Observed frequency
i, j, k : sim vik , v jk 0 if vik v jk (12)
2. Note the Expected frequency i, j, k : simvik , v jk 1 if vik v jk (13)
3. Apply the chi square formula
3. Computer the Similarity between two attribute(vij,vik)
4. Find the degree of freedom(df = N – 1) using the
5. If the obtained value is equal or greater than the chi
square table reject the null hypothesis.
Sim vij , vik d 11 dm1,mi Similaritym (14)
Advantage of Chi square is it requires no assumptions about
the shape of the population distribution from which a sample is Similaritym = Cosine Product (CI[Ai:vij][Am],
drawn. It can be applied to nominal or ordinal measured CI[Ai:vik][Am]) for Categorical data
variables. Limitation of Chi square similarity are, 1) need
Similaritym 1
CI Am : vij Am CI A j : vik Am
(15)
maxAm minAm
quantitative data, 2) sensitive to sample size, 3) does not give
much information about the strength of the relationship and 4)
Expected frequency should not be less than 1. for Numerical data where the cosine product is defined
as,
4.2 COSINE SIMILARITY
Sim CS
Cosine similarity [17] is a popular method for text mining. It
CI Ai : vij Am : vml * CI Ai : vik Am : vml * sim Vml , vml
is used for comparing the document (word frequency) and finds NormalVect or1* NormalVect or 2
the closeness among the data points in clustering. Its range lies vml ,vml Am
between 0 and 1. The similarity between two terms X and Y are (16)
defined as follows.
4. Repeat the step 2 and step 3.
X oY
CosineSim X , Y . (10) The author concludes that DISC outperforms other similarity
X Y measure both for classification and regression analysis.
One desirable property of Cosine similarity is independent of 4.5 DILCA
document length. Limitation is the terms are assumed to be
orthogonal in space. If the value is zero no similarity exist DILCA - DIstance Learning in Categorical Attribute is the
between the data element and if the vale is 1 similarity exist measure used by the author [7, 8]. Co-occurrence table is formed
between two elements. Considered two documents X and Y with for all the features using symmetric uncertainty a matrix is
attributes X = {1 2 3 0 0} and Y = {2 4 0 0 1}, generated and conditional probability is applied, the results are
given to the Euclidean measure to find the similarity between the
1* 2 2 * 4 3 * 0 0 0 *1
CosineSim X , Y 0.5832 attributes.
1*1 2 * 2 3 * 3 2 * 2 4 * 4 1*1
4.5.1 Methodology:
4.3 OVERLAP 1. Context Selection (Feature extraction) is based on
symmetric uncertainty (SU). It is a co-relation based
The overlap measure counts the number of attribute that measure from information theory. The co-relation matrix
matches the two data instance. It uses only the diagonal entries are formed using SU,
IG X Y
of the similarity matrix and sets off diagonal entries to 0 [5]. The
range of per attribute value is 0 to 1. 0 indicate no match exist SU X , Y 2 * (17)
H X H Y
719
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
where, IG(X/Y) is the information gain and H(X) and life dataset to show the accuracy, precession and recall values.
H(Y) represent the entropy of the variable X and Y To illustrate the efficiency of the algorithm synthetic datasets are
respectively. generated and four different graphs is plotted for, 1) Time Vs
2. Distance Computation: Applying Conditional probability number of cluster, 2) Time Vs Number of Objects, 3) Time Vs
for the co-relation matrix and Euclidean distance, Number of categories and 4) Time Vs number of Attributes.
Fuzzy K-modes outperforms K -modes for both the data set.
d xi x j Pxi
yk P x j yk 2 (18) Accuracy Zoo Error rate Zoo
Y content X yk Y Accuracy Soya bean Error rate Soyabean
ik1ai
r (19) Fig.1. Comparison of K modes and Fuzzy K modes
n
where, „n‟ refers number of instance in the dataset, „ai‟ refers to ROCK and Squeezer [9] for mushroom data set is depicted in
number of instance occurring in both cluster i and its Fig.2. Squeezer outperforms ROCK for the mushroom data set
corresponding class and „k‟ refers to final number of cluster. and ROCK outperform Squeezer for the vote data set. The
Error rate „E‟ is defined as, E = 1 – r, where „r‟ refers to the author [8] compares ROCK with hierarchical and partitioned
cluster accuracy. algorithm for four real world data sets. Threshold values for
ROCK are set as 0.2 to1 in step of 0.05.
Real world dataset: Five real life dataset, such as
Mushroom, vote, Iris, cancer and zoo obtained from UCI
Accuracy Mushroom Errorrate Mushroom
machine learning repository [25]. Mushroom: Each tuple
represent the physical characteristic of mushroom. Number of Accuracy Vote Errorrate vote
instance is 8124 and number of attribute is 22. It can be
0.8
classified into edible (4028) and poisonous mushroom (3916).
Vote: Each tuple represent the United States congressional vote
record in 1984. Number of instance is 435 and number of 0.6
attribute is 16. It can be classified into Democrat (267) and
Republican (168). Iris: The data set contains 3 classes of 50
instances each, where each class refers to a type of iris plant. 0.4
Number of instance is 150 and number of attribute is 4. Cancer:
No of Instance 8124 and no of attribute are 22. Zoo: Zoo
0.2
dataset has 18 attributes with 101 instances. Class distribution of
Zoo dataset has 7 classes. Soyabean: Number of instance is 307
and number of attribute is 35. 0
In this survey Cluster Accuracy and Error rate are evaluate in ROCK Squzeer
three ways, 1) comparisons of different categorical clustering
algorithm, 2) comparisons of DILCA combined with partition Fig.2. Comparison of Rock and Squeezer
and hierarchical clustering algorithm and 3) comparisons of
categorical similarity measure. 5.2 COMPARISONS OF DILCA COMBINED WITH
PARTITION AND HIERARCHICAL
5.1 COMPARISONS OF DIFFERENT CLUSTERING ALGORITHM
CATEGORICAL CLUSTERING ALGORITHM
The author [8] determines the quality of cluster formed using
Algorithm used for comparisons are K-modes [10] [12] [16] accuracy and normalized mutual information. The partition (K-
and fuzzy K-modes [16] using soya bean and Zoo data set is modes algorithm) and wards hierarchical clustering algorithm
depicted in Fig.1. Algorithm was run for 100 times and fuzzy are combined with DILCA for mushroom, vote and cancer
parameter is set as 1.1. The author [16] makes use of four real dataset are depicted in Fig.3. Both the algorithm set the number
720
ICTACT JOURNAL ON SOFT COMPUTING, JANUARY 2014, VOLUME: 04, ISSUE: 02
of cluster equal to the number of classes. DILCA_Kmodes Several open source data mining tools are available on web.
algorithm is implemented using WEKA platform and Table.2. illustrates various data mining tools and few clustering
DILCA_Hierarchical algorithm is implemented using Java algorithm available.
Murtagh's platform.
Accuracy Mushroom Error rate Mushroom Table.2. Data mining tool and clustering algorithm
Accuracy Vote Error rate Vote
Tool Name Clustering Algorithm
Accuracy Cancer Errorrate Cancer
1
K- Means, X-Means, EM, Cobweb
Weka CLOPE, OPTICS, DBSCAN
0.8 Hierarchical clustering algorithm
K -Means, PAM, DBSCAN, ROCK
R package
0.6 Hierarchical clustering algorithm
K -Means, DBSCAN
0.4 EM
K-Medoids
Rapidminer X-means
0.2
Kernel K- means
Fast K - Means
0
Hierarchical clustering algorithm
DILCA_Kmodes DILCA_Hierarchical
1
[1] J. Han and M. Kamber, “Data Mining Concepts and
Techniques”, The Morgan Kaufmann Series in Data
Management Systems, Morgan Kaufmann, 2000.
0.8
[2] S. A. Elavarasi, J. Akilandeswari and B. Sathiyabhama, “A
Survey on Partition Clustering Algorithms”, International
0.6 Journal of Enterprise Computing and Business Systems,
Vol. 1, No. 1, pp. 1-14, 2011.
0.4 [3] R. Ranjani, S. A. Elavarasi and J. Akilandeswari,
“Categorical Data Clustering using Cosine based similarity
for Enhancing the Accuracy of Squeezer Algorithm”,
0.2 International Journal of Computer Applications, Vol. 45,
No. 20, pp. 41-45, 2012.
0 [4] S. Guha, R. Rastogi and K. Shim, “ROCK: A Robust
DISC Overlap Clustering Algorithm for Categorical Attributes”,
Information Systems, Vol. 25, No. 5, pp. 345 – 366, 2000.
Accuracy Car Error Rate Car Accuracy Iris [5] S. Boriah, V. Chandola and V. Kumar, “Similarity
ErrorRate Iris Accuracy Cancer Error rate Cancer Measures for Categorical Data: A Comparative
Evaluation”, Proceedings of the 8th SIAM International
Fig.4. Comparison of DISC and Overlap Conference on Data Mining, pp. 243 – 254, 2008.
721
S ANITHA ELAVARASI AND J AKILANDESWARI: SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
[6] A. Desai, H. Singh and V. Pudi, “DISC: Data Intensive [15] Alan Agresti, “An Introduction to categorical data
Similarity Measure for Categorical Data”, Proceedings of analysis”, Wiley Series in Probability and Statistics,
Advances in Knowledge Discovery and Data Mining – 15th Second Edition, Wiley-Interscience, 2007.
Pacific Asia Conference, Vol. 6635, pp. 469 – 481, 2011. [16] Michael K. Ng and Liping Jing, “A new fuzzy k-modes
[7] D. Ienco, R. G. Pensa and R. Meo, “From Context to clustering algorithm for categorical data”, International
Distance: Learning Dissimilarity for Categorical Data Journal on Granular Computing, Rough Sets and
Clustering”, ACM Transactions on Knowledge Discovery Intelligent Systems, Vol. 1, No. 1, pp. 105 – 119, 2009.
from Data, Vol. 6, No. 1, 2012. [17] A. Rajaraman and J. D. Ullman, “Mining of Massive
[8] D. Ienco, R. G. Pensa and R. Meo, “Context-based distance Datasets”, Cambridge University Press, 2011.
learning for categorical data clustering”, Proceedings of the [18] D. K. Roy and L. K. Sharma, “Genetic K means Clustering
8th International Symposium on Intelligent Data Analysis: Algorithm for Mixed Numerical and Categorical data set”,
Advances in Intelligent Data Analysis VIII, pp. 83 – 94, International journal of Artificial Intelligence &
2009. Applications, Vol. 1, No. 2, pp. 23 – 28, 2010.
[9] H. Zengyou, X. Xiaofei and D. Shengchun “Squeezer: An [19] E. G. Mansoori, “FRBC: Fuzzy Rule Based Clustering
Efficient Algorithm for Clustering Categorical Data”, Algorithm”, IEEE Transactions on Fuzzy Systems, Vol. 19,
Journal on Computer Science & Technology, Vol. 17, No. No. 5, pp. 960 – 971, 2011.
5, pp. 611-624, 2002.
[20] S. Guha, R. Rastogi and K. Shim, “CURE: An Efficient
[10] Z. Haung and M. K. Ng, “A Fuzzy k-Modes Algorithm for Clustering Algorithm for Large Databases”, Proceedings of
Clustering Categorical Data”, IEEE Transactions on Fuzzy the ACM SIGMOD International Conference on
systems, Vol. 7, No. 4, pp. 446-452, 1999. Management of Data, pp. 73 - 84, 1998.
[11] P. Agarwal, M. Afshar Alam and R. Biswas, “Analysis the [21] Y. Yang, X. Guan and J. You, “CLOPE: a fast and
Agglomerative hierarchical clustering Algorithm for effective clustering algorithm for transactional data”,
Categorical Attribute”, International Journal of Innovation, Proceedings of the 8th ACM SIGKDD international
Management and Technology, Vol. 1, No. 2, pp. 186 – 190, conference on Knowledge Discovery and Data Mining, pp.
2010. 682 – 687, 2002.
[12] Z. Huang, “Extensions to the k-Means Algorithm for [22] H. Rezankova, “Cluster Analysis and categorical data”
Clustering Large Data Sets with Categorical Values”, Data Statistika, pp. 216 - 232, 2009.
Mining and Knowledge Discovery, Vol. 2, No. 3, pp. 283 –
[23] M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A density-
304, 1998.
based algorithm for discovering clusters in large spatial
[13] Z. He, X. Xu, S. Deng and B. Dong, “K- Histograms: An databases”, Proceedings of 2nd International Conference on
Efficient Clustering Algorithm for Categorical Dataset”, Knowledge Discovery and Data Mining, pp. 222 – 231,
CoRR, Vol. abs/cs/0509033, 2005. 1996.
[14] Y. Lu and L. R. Liang, “Hierarchical Clustering of Features [24] H. Zengyo, X. Xu and S. Deng, “k-ANMI: A mutual
on Categorical Data for Biomedical information based clustering algorithm for categorical
Applications”, Proceedings of the ISCA 21st International data”, Information Fusion, Vol. 9, No. 2, pp. 223 – 233,
conference on Computer Applications in Industry and 2008.
Engineering, pp. 26 - 31, 2008.
[25] UCI Machine Learning Repository,
http://www.ics.uci.edu/mlearn/MLRepository.html.
722