0% found this document useful (0 votes)
35 views

An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index

Uploaded by

Balu Amith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index

Uploaded by

Balu Amith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Sciences and Engineering Systems

IJCSES
Vol.International Journal of
10 No. 1 (June, Computer Sciences and Engineering Systems, Vol. 1, No. 4, October 2007
2016)
CSES International © 2007 ISSN 0973-4406

An Optimized Approach on Applying Genetic Algorithm


to Adaptive Cluster Validity Index
Tzu-Chieh LIN1, Hsiang-Cheh HUANG2, Bin-Yih LIAO1 & Jeng-Shyang PAN1
1
Department of Electronic Engineering, National Kaohsiung University of Applied SciencesKaohsiung, 807, Taiwan
E-mail:{tclin, byliao, jspan}@bit.kuas.edu.tw
2
Department of Electrical Engineering, National University of KaohsiungKaohsiung, 811, Taiwan
E-mail: [email protected]

Abstract: The partitioning or clustering method is an important research branch in data mining area, and it divides the
dataset into an arbitrary number of clusters based on the correlation attribute of all elements of the dataset. Most datasets
have the original clusters number, which is estimated with cluster validity index. But most methods give the error estimation
for most real datasets. In order to solve this problem, this paper applies the optimization technique of genetic algorithm
(GA) to the new adaptive cluster validity index, which is called the Gene Index (GI). The algorithm applies GA to adjust the
weighting factors of adaptive cluster validity index to train an optimal cluster validity index. It is tested with many real
datasets, and results show the proposed algorithm can give higher performance and accurately estimate the original cluster
number of real datasets compared with the current cluster validity index methods.
Keywords: Clustering, Genetic Algorithm, Cluster Validity Index, Optimization, Data Mining

1. INTRODUCTION results are demonstrated in Section 4. Finally, we conclude


Data partitioning is commonly encountered in real this paper in Section 5.
applications. Lots of schemes are proposed to assess the 2. DATA PARTITIONING SCHEMES AND CLUSTER
performances for specific algorithms in literature. The main VALIDITY INDICES
concern of data partitioning is how to correctly divide the
In this paper, we employ the fuzzy C-means (FCM) [2]
data points into clusters. Some algorithms in literature are
algorithm for data clustering, and then make comparisons
specifically designed for certain databases. Thus, these may
among several indices. By using the concepts of fuzzy theory,
perform well in some cases but not always good in general.
every data point does not absolutely belong to a certain
In this paper, we would like to propose a generalized scheme,
cluster; it is denoted by a floating number to represent the
which is integrated with optimization techniques, for better
degree of belonging to a certain cluster.
partitioning the data.
The major drawback for FCM or other algorithms is
There are a number of indices proposed in literature to
that the correct number of clusters cannot be known exactly
assess the performances of data clustering. The main ideas
in advance. Thus, the cluster validity indices with several
are twofold: (1) data points within the same cluster should
kinds of representations are proposed to evaluate the correct
locate as close as possible, and (2) data points in different
number of clusters. Every index has its advantages and
clusters should be as apart as possible. Based on the two
drawbacks. We cite several commonly encountered indices;
concepts, a variety of the cluster validity indices are
then we perform verifications in Sec. 2, and finally combine
proposed. We make necessary simulations and verify that
the advantages of these indices and propose the genetic-based
not all the indices perform well. Therefore, we employ the
cluster validity index in Sec. 3.
genetic algorithm (GA) [1] for resultin g in better
performances in data partitioning. 2.1 Cluster Validity Index: PC Index
This paper is organized as follows. In Section 2 we point
out the data partitioning schemes and the cluster validity PC (partition coefficient) index [3] was one of the measures
indices. In Section 3 we describe the proposed algorithm by used in early days, with the definition in Eq. (1):
integrating existing indices and training with GA. Simulation 1 n c 2
V PC �U � � �� uik (1)
n k �1 i �1
Manuscript received July 30, 2007 where uik denotes the degree of membership of xi in the
Manuscript revised September 20, 2007 cluster k, x i is the i th of d-dimensional measured data
254 IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 1, No. 4, October 2007

(and we use d = 2 here as an example), under the condition � �


G �c � �
max δ Vi , V j
i, j
that �
min δ Vi , V j
i� j
� (5)
uik � �0 , 1�, � i, k ;
where δ �Vi , V j � is a distance measure between the geometric
c

�u
centers of clusters i and j, with the definition of
ik � 1, � k .
i �1

To assess the effectiveness of clustering algorithm, the



δ �Vi , V j � � �Vi � V j � A�Vi � V j �
T
�1
2
(6)

larger the PC index value, the better the performance. and A denotes a positive definite matrix with dimension of
d � d (or 2 × 2 here). For simplicity, people use the identity
2.2 Cluster Validity Index: PE Index matrix I to replace the matrix A in Eq. (6) to verify the
PE (partition entropy) index was also proposed in [3], with distance measure.
the definition in Eq. (2): Next, the compactness is represented by the ratio of
variances between the data points of the current cluster, and
�1 � n c �
VPE �U � � ��� �uik � log �uik ���
the data points within every cluster, denoted by Vwt(c),
(2)
n � k �1 i �1 � d c
1 � � varq �k �
To assess the effectiveness of clustering algorithm, the Vwt �c � � � q �d1 k �1 (7)
c q��1vartotal �q �
smaller the PE index value, the better the performance.

2.3 Cluster Validity Index: XB Index where varq denotes the current cluster and var total denotes the
variance of the whole data set. From experimental results,
The XB index was proposed by Xie and Beni in [4] with the the value of G(c) is much larger than that of Vwt(c) with the
two important concepts of compactness and separation. For ranges of G(c)�[0, 20] and Vwt(c)�� [0, 0.8], thus we need
a good clustering result, the data points within the same to include a weighting factor ��to balance the effects from
cluster should be as compact as possible, while any two both factors, and we obtain
different clusters should be as far as possible. It can be
formulated by Eq. (3): B crit �c � � G �c � � α � Vwt �c � (8)
n c
� � uij2 � x j � νi
2
where α � max G ( c )
denotes the weighting factor..
VXB �U, V; X � �
max Vwt ( c )
j �1i �1


n �min νi � νk
i �k
2
� (3) From derivations above, when the smaller Bcrit index is
th
obtained, the clustering performance would be better.
where xi is the i of d-dimensional measured data (and we
use d = 2 here), vk is the d-dimension center of the cluster. 2.6 Cluster Validity Index: SV Index
In Eq. (3), the numerator implies the compactness and
SV index was proposed in [7]. It also adopted the concepts
the denominator denotes the separation. Therefore, to assess
of compactness and separation. Unlike the Bcrit index in Sec.
the effectiveness of clustering algorithm, the smaller the XB
0, both factors are normalized to the values between 0 and 1
index value, the better the performance.
to balance the effects from both factors. In measuring the
2.4 Cluster Validity Index: K Index compactness, the mean distance of the c clusters in the data
set is calculated,
The K index was proposed by Kwon [5] based on the
improvement of the XB index. In Eq. (3), we find when c�n, 1 c �1 �
V XB �0, and it is gen erally incorrect for practical Vu �c, V; X � � �� �V i � x� (9)
c i �1 � ni x�xi �
applications. By modifying Eq. (3), we obtain Eq. (4):
n c 2 c 2 where ni denotes the number of data points within cluster i,
� � uij2 � x j �νi � 1c � νi � ν
VK �U, V; X � � j �1i �1


min νi � νk
j �1
2
� (4) Vi is the geometric center of cluster i, and the total of c mean
i �k distances are calculated. The separation measure is simply
where � denotes the geometric center of data points. denoted by Vo � dmin
c
, where dmin denotes the minimum
To assess the effectiveness of clustering algorithm, the
smaller the XB index value, the better the performance. distance between any two clusters.
Next, normalization of Eqs. (9) and (10) is performed
2.5 Cluster Validity Index: B crit Index by
B crit index was proposed in [6]. It is also composed of the VuN �c, V; X � � Vu �c, V; X � � min �Vu �c, V; X ��
max �Vu � c, V; X �� � min �Vu � c, V; X �� , (10)
compactness and separation parameters in order to obtain
the optimal number of clusters. The measure of compactness
VoN �c, V � � Vo � c, V � � min �Vo � c, V ��
max �Vo � c, V �� � min �Vo � c, V �� . (11)
and separation are independently derived. First, the
separation between clusters is denoted by G(c),
An Optimized Approach on Applying Genetic Alogirthm to Adaptive Cluster Validity Index 255

Finally, the SV index is defined by Table 1


The Index Values for Clustering from 2 to 10 Clusters in Six
VSV �c, V; X � � VuN �c, V; X � � VoN �c, V � . (12) Different Schemes for My_sample Database. The Shaded
To assess the effectiveness of clustering algorithm, the Blocks Represent the Correct Clustering Results
smaller the SV index value, the better the performance. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10

2.7 Preliminary Results with Existing Indices PC 0.722 0.673 0.625 0.640 0.674 0.720 0.757 0.797 0.770
PE 0.631 0.852 1.051 1.071 1.030 0.936 0.855 0.754 0.834
To evaluate the effectiveness of existing indices, we generate
a two-dimensional, 2000-point, 9-cluster testing database XB 0.277 0.110 0.188 0.224 0.095 0.100 0.068 0.042 0.628
called ‘My_sample’, illustrated in Fig. 1. All six indices are K 555 221 377 449 191 202 139 87 1300
examined, and results are in Table 1. B crit 17.87 10.85 9.86 10.77 8.14 8.79 8.59 8.39 17.42
With the database, we can expect that the column with SV 1.000 0.691 0.619 0.485 0.334 0.312 0.261 0.220 1.000
k = 9 should perform the best, i.e., the largest PC value and
the smallest values of the other five should be obtained. As
we can see, not all of the indices indicate that the correct 3.1 Preprocessing in GA
clustering result is when k = 9. Moreover, the criterion for We need to have chromosomes to perform the three steps in
PC is to search for its maximum value, while for the rest GA. We employ five popularly used databases, including
indices the criterion is to find their minimum values. Based auto-mpg [8], bupa [9], cmc [10], iris [11], and wine [12] in
on the two findings, the optimization techniques can be Table 2 for GA optimization. Half of the data set in each
included into the clustering algorithm to search for the better database is used for training, and the other half is used for
and more correct results. testing.
3. GENETIC-BASED CLUSTER VALIDITY INDEX 3.2 Deciding the Fitness Function
As we can see from Sec. 2.1 to 2.6, every index has its own After considering practical implementations in GA, and
specific concept for data clustering and the results in Sec. based on the indices described in Sec. 0 to 0, in this paper,
2.7 have a diversity of performances. Therefore, we employ we proposed the genetic-based index for data clustering. The
genetic algorithm (GA) for finding an optimized result based fitness function is denoted by
on the concept of every index above. GA constitutes of three
major steps: crossover, mutation, and selection. Based on 1
c
� INTRA � k � �
max d Vi , V j �
Vgene �c, V; X � � α � � β � min
c i, j
k �1
the fitness function, we try to integrate our watermarking MSD t d �Vi , V j � . (13)
i� j
scheme with GA procedures.
In the first term, it denotes the compactness with

INTRA �k � � 1
nk � x�xk
Vt - x j , and (14)
nt
MSD t � 1
nt �Vj �1
t - xj . (15)

In the second term, d(Vi, Vj) is the same as that defined


in Eq. (6). Also, � and � are the weighting factors, which
act as the output after GA training.
The goal for optimization is to find the minimized value
in the fitness function. Under the best condition, the fitness
value reaches 0.

3.3 Procedures in GA Training


The GA procedures for optimized cluster validity index are
described as follows.
Step 1: Producing the chromosomes: 40 chromosomes are
produced. Each chromosome denotes the weighting
factors in the fitness function, i.e., ( α i, β i), 1�i�40.
Because the fitness function is composed of two
opposing conditions, we only concern about the ratio
Figure 1: The Two-Dimensional, 2000-Point, 9-Cluster Database between the two weights; we set 0� α i, β i � 1.
My_sample.
256 IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 1, No. 4, October 2007

Table 2 Table 3
The Five Databases Used in This Paper Index Values from 2 to 10 Clusters in Seven Different Schemes.
Shaded Blocks Show the Correct Results for
Training database # of data Testing database # of data Auto-mpg Database
points points
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
auto-mpg_train 196 auto-mpg_test 196
bupa_train 173 bupa_test 172 PC 0.866 0.801 0.787 0.764 0.726 0.716 0.713 0.704 0.700

cmc_train 737 cmc_test 736 PE 0.330 0.522 0.596 0.679 0.804 0.847 0.869 0.915 0.944
iris_train 75 iris_test 75 XB 0.056 0.073 0.083 0.067 0.145 0.121 0.121 0.104 0.123
wine_train 89 wine_test 89 K 11.31 15.30 18.21 15.74 35.53 33.51 36.00 32.70 41.44
B crit 13.57 8.20 6.94 6.47 9.21 9.89 11.11 11.21 13.24
Fitness values are calculated from the training SV 1.000 0.633 0.466 0.415 0.548 0.592 0.705 0.771 1.000
databases in Table 2. At the beginning of first iteration,
chromosome values are randomly set. In training, GI 0.523 0.487 0.521 0.536 0.780 0.854 0.960 0.974 1.148
chromosome values are modified based on the output of
the previous iteration. Table 4
Step 2: Selecting the better chromosomes: All the 40 sets Index Values from 2 to 10 Clusters in Seven Different Schemes.
Shaded Blocks Show the Correct Results for
of chromosomes are included into the fitness function
Bupa Database
and the corresponding fitness scores are calculated. The
20 chromosomes with smaller fitness values are kept index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
for use in the next iteration, and the other 20 are
PC 0.882 0.664 0.562 0.476 0.411 0.383 0.346 0.328 0.295
discarded. 20 new chromosomes in the next iteration
are produced from crossover and mutation based on the PE 0.304 0.809 1.136 1.435 1.676 1.826 2.011 2.131 2.309
20 chromosomes remained. XB 0.065 0.511 0.587 0.623 1.480 1.307 1.073 1.395 1.407
Step 3: Crossover of chromosome: From the 20 remained K 11.64 94.16 110.2 118.5 284.2 256.1 212.8 271.9 282.5
chromosomes, we randomly choose 10 of them, and
B crit 59.03 46.69 45.75 47.62 67.55 83.79 49.41 56.06 63.48
gather into 5 pairs, to perform the crossover operation.
By swapping the � or � values of every pair, 10 new SV 1.000 0.718 0.617 0.555 0.702 0.786 0.699 0.882 1.000
chromosomes are produced. GI 1.088 1.225 1.286 1.328 1.754 1.787 1.690 1.903 1.981
Step 4: Mutation of chromosome: The 10 chromosomes
that are not chosen in 0 are used in this step. The �
with the proposed GA-based index has the correct result. In
values in the first five chromosomes are replaced by
bupa, cmc, iris, and wine databases, similar results can be
randomly set, new � values. Similar operation is
obtained, and detailed comparisons can be found from Table
performed on the � values of the other five.
4 to Table 7, respectively. In addition, from Table 8, we see
Step 5: The stopping condition: Once the pre-determined that the proposed GI results in correct cluster numbers in
number of iterations is reached, or when the fitness value four of the five test databases. Comparing to other six indices
equals 0, the training is stopped, and the weighting that only result in one correct cluster number, our scheme
factors corresponding to the smallest fitness score in gets better performance. In addition, regarding to the cmc
the final iteration, (�, �), is the output. database, none of the seven indices have the correct cluster
number.
4. SIMULATION RESULTS
5. CONCLUSION
After training for 1000 iterations the GA optimization in
Sec. 3.3, we obtain the optimized weighting factors In this paper, we discussed about data clustering schemes
(�, �) = (0.8561, 0.0826). With the two values, we can and proposed a new cluster validity index based on GA. GI
compare the GA optimized result with those in index outperforms all the six existing indices in literature.
Sec. 2.1 to 2.6 by verifying the five test databases in However, clustering results for applications to some database
Table 2. We depict the detailed results with the auto-mpg are not correct even after GA training. And this is the
database in Table 3, the bupa database in Table 4, the iris motivation for our researches in the future.
database in ACKNOWLEDGMENTS
Table 5, the wine database in Table 6, respectively.
This work is partially supported by National Science Council
Numerical values in Table 3 depict the results for the auto-
(Taiwan) under grant NSC95-2218-E-005-034.
mpg database, which has three clusters. We can see that only
An Optimized Approach on Applying Genetic Alogirthm to Adaptive Cluster Validity Index 257

Table 5 REFERENCES
Index Values from 2 to 10 Clusters in Seven Different Schemes.
[1] D. E. Goldberg, Genetic Algorithms in Search,
Shaded Blocks Show the Correct Results for cmc Database
Optimization and Machine Learning, Boston, MA:
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 Kluwer, 1989.
PC 0.809 0.704 0.597 0.528 0.474 0.423 0.378 0.342 0.321 [2] J. C. Bezdeck, R. Ehrlich, and W. Full, “FCM: Fuzzy C-
PE 0.459 0.773 1.089 1.323 1.523 1.723 1.905 2.066 2.189 Means Algorithm”, Computers and Geosciences, Vol. 10,
No. 2-3, 1984, pp. 16-20.
XB 0.096 0.125 0.197 0.222 0.231 0.296 0.388 0.604 0.539
K 70.86 92.96 146.9 165.7 173.6 223.1 293.0 458.3 410.4 [3] J.C. Bezdek, Pattern Recognition with Fuzzy Objective
Function Algorithms, New York, NY: Plenum, 1981.
B crit 18.57 13.26 11.96 13.35 13.37 17.26 16.77 19.17 22.55
[4] X. L. Xie and G. Beni, “A Validity Measure for Fuzzy
SV 1.000 0.580 0.452 0.428 0.440 0.514 0.664 0.935 1.000
Clustering”, IEEE Trans. Patt. Anal. Machine Intell., Vol.
GI 0.617 0.595 0.660 0.721 0.771 0.866 0.990 1.214 1.206
13, No. 8, 1991, pp. 841-846.
[5] S. H. Kwon, “Cluster Validity Index for Fuzzy
Table 6 Clustering”, Electronics Letters, Vol. 34, No. 22, pp.
Index Values from 2 to 10 Clusters in Seven Different Schemes. 2176-2177, 1998.
Shaded Blocks Show the Correct Results for iris Database.
[6] A. O. Boudraa, “Dynamic Estimation of Number of
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 Clusters in Data Sets”, Electronics Letters, Vol. 35,
PC 0.888 0.790 0.738 0.678 0.610 0.584 0.562 0.538 0.535 No. 19, 1999, pp. 1606-1608.
PE 0.290 0.559 0.736 0.933 1.108 1.216 1.337 1.435 1.486 [7] D. J. Kim, Y. W. Park, and D. J. Park, “A Novel Validity
XB 0.058 0.115 0.160 0.265 0.316 0.549 0.239 0.227 0.289
Index for Determination of the Optimal Number of
Clusters”, IEICE. Trans. Inf. & Syst., Vol. E84-D, No. 2,
K 4.622 9.920 14.72 25.25 33.21 61.43 26.73 28.05 36.45
2001, pp. 281-285.
B crit 18.46 12.13 10.40 10.77 17.03 21.21 16.08 16.80 16.53
[8] R. Quinlan, “Auto-mpg data”,
SV 1.000 0.724 0.598 0.628 0.695 0.907 0.700 0.832 1.000
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
GI 0.442 0.510 0.602 0.755 0.887 1.147 0.881 0.953 1.091 auto-mpg/, 1993.
[9] BUPA Medical Research Ltd, “BUPA Liver Disorders”,
Table 7 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
Index Values from 2 to 10 Clusters in Seven Different Schemes. liver-disorders/, 1990.
Shaded Blocks Show the Correct Results for
[10] T. S. Lim, “Contraceptive method choice”,
Wine Database
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 cmc/, 1999.
PC 0.868 0.783 0.772 0.746 0.751 0.784 0.786 0.760 0.738 [11] R.A. Fisher, “Iris plants database”,
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
PE 0.328 0.572 0.636 0.720 0.738 0.663 0.677 0.764 0.830
iris/, 1988.
XB 0.067 0.141 0.101 0.081 0.123 0.071 0.097 0.209 0.261
[12] S. Aeberhard, “Wine recognition data”,
K 6.264 13.81 11.28 11.00 18.96 14.83 22.75 50.47 67.97
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
B crit 22.85 14.17 11.64 9.169 11.01 10.06 12.82 19.18 22.89 wine/, 1998.
SV 1.000 0.672 0.569 0.413 0.406 0.357 0.461 0.772 1.000
GI 0.570 0.566 0.605 0.641 0.841 0.828 1.061 1.594 1.896

Table 8
Comparisons of the Seven Indices for the Five Test Databases.
Our Scheme Performs the Best

Database Original PC PE XB K Bcrit SV GA


clusters
auto-mpg 3 2 2 2 2 5 5 3
bupa 2 2 2 2 2 5 5 2
cmc 3 2 2 2 2 4 4 2
iris 3 2 2 2 2 4 5 3
wine 3 2 2 2 2 5 7 3

You might also like