An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index
An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index
IJCSES
Vol.International Journal of
10 No. 1 (June, Computer Sciences and Engineering Systems, Vol. 1, No. 4, October 2007
2016)
CSES International © 2007 ISSN 0973-4406
Abstract: The partitioning or clustering method is an important research branch in data mining area, and it divides the
dataset into an arbitrary number of clusters based on the correlation attribute of all elements of the dataset. Most datasets
have the original clusters number, which is estimated with cluster validity index. But most methods give the error estimation
for most real datasets. In order to solve this problem, this paper applies the optimization technique of genetic algorithm
(GA) to the new adaptive cluster validity index, which is called the Gene Index (GI). The algorithm applies GA to adjust the
weighting factors of adaptive cluster validity index to train an optimal cluster validity index. It is tested with many real
datasets, and results show the proposed algorithm can give higher performance and accurately estimate the original cluster
number of real datasets compared with the current cluster validity index methods.
Keywords: Clustering, Genetic Algorithm, Cluster Validity Index, Optimization, Data Mining
�u
centers of clusters i and j, with the definition of
ik � 1, � k .
i �1
larger the PC index value, the better the performance. and A denotes a positive definite matrix with dimension of
d � d (or 2 × 2 here). For simplicity, people use the identity
2.2 Cluster Validity Index: PE Index matrix I to replace the matrix A in Eq. (6) to verify the
PE (partition entropy) index was also proposed in [3], with distance measure.
the definition in Eq. (2): Next, the compactness is represented by the ratio of
variances between the data points of the current cluster, and
�1 � n c �
VPE �U � � ��� �uik � log �uik ���
the data points within every cluster, denoted by Vwt(c),
(2)
n � k �1 i �1 � d c
1 � � varq �k �
To assess the effectiveness of clustering algorithm, the Vwt �c � � � q �d1 k �1 (7)
c q��1vartotal �q �
smaller the PE index value, the better the performance.
2.3 Cluster Validity Index: XB Index where varq denotes the current cluster and var total denotes the
variance of the whole data set. From experimental results,
The XB index was proposed by Xie and Beni in [4] with the the value of G(c) is much larger than that of Vwt(c) with the
two important concepts of compactness and separation. For ranges of G(c)�[0, 20] and Vwt(c)�� [0, 0.8], thus we need
a good clustering result, the data points within the same to include a weighting factor ��to balance the effects from
cluster should be as compact as possible, while any two both factors, and we obtain
different clusters should be as far as possible. It can be
formulated by Eq. (3): B crit �c � � G �c � � α � Vwt �c � (8)
n c
� � uij2 � x j � νi
2
where α � max G ( c )
denotes the weighting factor..
VXB �U, V; X � �
max Vwt ( c )
j �1i �1
�
n �min νi � νk
i �k
2
� (3) From derivations above, when the smaller Bcrit index is
th
obtained, the clustering performance would be better.
where xi is the i of d-dimensional measured data (and we
use d = 2 here), vk is the d-dimension center of the cluster. 2.6 Cluster Validity Index: SV Index
In Eq. (3), the numerator implies the compactness and
SV index was proposed in [7]. It also adopted the concepts
the denominator denotes the separation. Therefore, to assess
of compactness and separation. Unlike the Bcrit index in Sec.
the effectiveness of clustering algorithm, the smaller the XB
0, both factors are normalized to the values between 0 and 1
index value, the better the performance.
to balance the effects from both factors. In measuring the
2.4 Cluster Validity Index: K Index compactness, the mean distance of the c clusters in the data
set is calculated,
The K index was proposed by Kwon [5] based on the
improvement of the XB index. In Eq. (3), we find when c�n, 1 c �1 �
V XB �0, and it is gen erally incorrect for practical Vu �c, V; X � � �� �V i � x� (9)
c i �1 � ni x�xi �
applications. By modifying Eq. (3), we obtain Eq. (4):
n c 2 c 2 where ni denotes the number of data points within cluster i,
� � uij2 � x j �νi � 1c � νi � ν
VK �U, V; X � � j �1i �1
�
min νi � νk
j �1
2
� (4) Vi is the geometric center of cluster i, and the total of c mean
i �k distances are calculated. The separation measure is simply
where � denotes the geometric center of data points. denoted by Vo � dmin
c
, where dmin denotes the minimum
To assess the effectiveness of clustering algorithm, the
smaller the XB index value, the better the performance. distance between any two clusters.
Next, normalization of Eqs. (9) and (10) is performed
2.5 Cluster Validity Index: B crit Index by
B crit index was proposed in [6]. It is also composed of the VuN �c, V; X � � Vu �c, V; X � � min �Vu �c, V; X ��
max �Vu � c, V; X �� � min �Vu � c, V; X �� , (10)
compactness and separation parameters in order to obtain
the optimal number of clusters. The measure of compactness
VoN �c, V � � Vo � c, V � � min �Vo � c, V ��
max �Vo � c, V �� � min �Vo � c, V �� . (11)
and separation are independently derived. First, the
separation between clusters is denoted by G(c),
An Optimized Approach on Applying Genetic Alogirthm to Adaptive Cluster Validity Index 255
2.7 Preliminary Results with Existing Indices PC 0.722 0.673 0.625 0.640 0.674 0.720 0.757 0.797 0.770
PE 0.631 0.852 1.051 1.071 1.030 0.936 0.855 0.754 0.834
To evaluate the effectiveness of existing indices, we generate
a two-dimensional, 2000-point, 9-cluster testing database XB 0.277 0.110 0.188 0.224 0.095 0.100 0.068 0.042 0.628
called ‘My_sample’, illustrated in Fig. 1. All six indices are K 555 221 377 449 191 202 139 87 1300
examined, and results are in Table 1. B crit 17.87 10.85 9.86 10.77 8.14 8.79 8.59 8.39 17.42
With the database, we can expect that the column with SV 1.000 0.691 0.619 0.485 0.334 0.312 0.261 0.220 1.000
k = 9 should perform the best, i.e., the largest PC value and
the smallest values of the other five should be obtained. As
we can see, not all of the indices indicate that the correct 3.1 Preprocessing in GA
clustering result is when k = 9. Moreover, the criterion for We need to have chromosomes to perform the three steps in
PC is to search for its maximum value, while for the rest GA. We employ five popularly used databases, including
indices the criterion is to find their minimum values. Based auto-mpg [8], bupa [9], cmc [10], iris [11], and wine [12] in
on the two findings, the optimization techniques can be Table 2 for GA optimization. Half of the data set in each
included into the clustering algorithm to search for the better database is used for training, and the other half is used for
and more correct results. testing.
3. GENETIC-BASED CLUSTER VALIDITY INDEX 3.2 Deciding the Fitness Function
As we can see from Sec. 2.1 to 2.6, every index has its own After considering practical implementations in GA, and
specific concept for data clustering and the results in Sec. based on the indices described in Sec. 0 to 0, in this paper,
2.7 have a diversity of performances. Therefore, we employ we proposed the genetic-based index for data clustering. The
genetic algorithm (GA) for finding an optimized result based fitness function is denoted by
on the concept of every index above. GA constitutes of three
major steps: crossover, mutation, and selection. Based on 1
c
� INTRA � k � �
max d Vi , V j �
Vgene �c, V; X � � α � � β � min
c i, j
k �1
the fitness function, we try to integrate our watermarking MSD t d �Vi , V j � . (13)
i� j
scheme with GA procedures.
In the first term, it denotes the compactness with
INTRA �k � � 1
nk � x�xk
Vt - x j , and (14)
nt
MSD t � 1
nt �Vj �1
t - xj . (15)
Table 2 Table 3
The Five Databases Used in This Paper Index Values from 2 to 10 Clusters in Seven Different Schemes.
Shaded Blocks Show the Correct Results for
Training database # of data Testing database # of data Auto-mpg Database
points points
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
auto-mpg_train 196 auto-mpg_test 196
bupa_train 173 bupa_test 172 PC 0.866 0.801 0.787 0.764 0.726 0.716 0.713 0.704 0.700
cmc_train 737 cmc_test 736 PE 0.330 0.522 0.596 0.679 0.804 0.847 0.869 0.915 0.944
iris_train 75 iris_test 75 XB 0.056 0.073 0.083 0.067 0.145 0.121 0.121 0.104 0.123
wine_train 89 wine_test 89 K 11.31 15.30 18.21 15.74 35.53 33.51 36.00 32.70 41.44
B crit 13.57 8.20 6.94 6.47 9.21 9.89 11.11 11.21 13.24
Fitness values are calculated from the training SV 1.000 0.633 0.466 0.415 0.548 0.592 0.705 0.771 1.000
databases in Table 2. At the beginning of first iteration,
chromosome values are randomly set. In training, GI 0.523 0.487 0.521 0.536 0.780 0.854 0.960 0.974 1.148
chromosome values are modified based on the output of
the previous iteration. Table 4
Step 2: Selecting the better chromosomes: All the 40 sets Index Values from 2 to 10 Clusters in Seven Different Schemes.
Shaded Blocks Show the Correct Results for
of chromosomes are included into the fitness function
Bupa Database
and the corresponding fitness scores are calculated. The
20 chromosomes with smaller fitness values are kept index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
for use in the next iteration, and the other 20 are
PC 0.882 0.664 0.562 0.476 0.411 0.383 0.346 0.328 0.295
discarded. 20 new chromosomes in the next iteration
are produced from crossover and mutation based on the PE 0.304 0.809 1.136 1.435 1.676 1.826 2.011 2.131 2.309
20 chromosomes remained. XB 0.065 0.511 0.587 0.623 1.480 1.307 1.073 1.395 1.407
Step 3: Crossover of chromosome: From the 20 remained K 11.64 94.16 110.2 118.5 284.2 256.1 212.8 271.9 282.5
chromosomes, we randomly choose 10 of them, and
B crit 59.03 46.69 45.75 47.62 67.55 83.79 49.41 56.06 63.48
gather into 5 pairs, to perform the crossover operation.
By swapping the � or � values of every pair, 10 new SV 1.000 0.718 0.617 0.555 0.702 0.786 0.699 0.882 1.000
chromosomes are produced. GI 1.088 1.225 1.286 1.328 1.754 1.787 1.690 1.903 1.981
Step 4: Mutation of chromosome: The 10 chromosomes
that are not chosen in 0 are used in this step. The �
with the proposed GA-based index has the correct result. In
values in the first five chromosomes are replaced by
bupa, cmc, iris, and wine databases, similar results can be
randomly set, new � values. Similar operation is
obtained, and detailed comparisons can be found from Table
performed on the � values of the other five.
4 to Table 7, respectively. In addition, from Table 8, we see
Step 5: The stopping condition: Once the pre-determined that the proposed GI results in correct cluster numbers in
number of iterations is reached, or when the fitness value four of the five test databases. Comparing to other six indices
equals 0, the training is stopped, and the weighting that only result in one correct cluster number, our scheme
factors corresponding to the smallest fitness score in gets better performance. In addition, regarding to the cmc
the final iteration, (�, �), is the output. database, none of the seven indices have the correct cluster
number.
4. SIMULATION RESULTS
5. CONCLUSION
After training for 1000 iterations the GA optimization in
Sec. 3.3, we obtain the optimized weighting factors In this paper, we discussed about data clustering schemes
(�, �) = (0.8561, 0.0826). With the two values, we can and proposed a new cluster validity index based on GA. GI
compare the GA optimized result with those in index outperforms all the six existing indices in literature.
Sec. 2.1 to 2.6 by verifying the five test databases in However, clustering results for applications to some database
Table 2. We depict the detailed results with the auto-mpg are not correct even after GA training. And this is the
database in Table 3, the bupa database in Table 4, the iris motivation for our researches in the future.
database in ACKNOWLEDGMENTS
Table 5, the wine database in Table 6, respectively.
This work is partially supported by National Science Council
Numerical values in Table 3 depict the results for the auto-
(Taiwan) under grant NSC95-2218-E-005-034.
mpg database, which has three clusters. We can see that only
An Optimized Approach on Applying Genetic Alogirthm to Adaptive Cluster Validity Index 257
Table 5 REFERENCES
Index Values from 2 to 10 Clusters in Seven Different Schemes.
[1] D. E. Goldberg, Genetic Algorithms in Search,
Shaded Blocks Show the Correct Results for cmc Database
Optimization and Machine Learning, Boston, MA:
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 Kluwer, 1989.
PC 0.809 0.704 0.597 0.528 0.474 0.423 0.378 0.342 0.321 [2] J. C. Bezdeck, R. Ehrlich, and W. Full, “FCM: Fuzzy C-
PE 0.459 0.773 1.089 1.323 1.523 1.723 1.905 2.066 2.189 Means Algorithm”, Computers and Geosciences, Vol. 10,
No. 2-3, 1984, pp. 16-20.
XB 0.096 0.125 0.197 0.222 0.231 0.296 0.388 0.604 0.539
K 70.86 92.96 146.9 165.7 173.6 223.1 293.0 458.3 410.4 [3] J.C. Bezdek, Pattern Recognition with Fuzzy Objective
Function Algorithms, New York, NY: Plenum, 1981.
B crit 18.57 13.26 11.96 13.35 13.37 17.26 16.77 19.17 22.55
[4] X. L. Xie and G. Beni, “A Validity Measure for Fuzzy
SV 1.000 0.580 0.452 0.428 0.440 0.514 0.664 0.935 1.000
Clustering”, IEEE Trans. Patt. Anal. Machine Intell., Vol.
GI 0.617 0.595 0.660 0.721 0.771 0.866 0.990 1.214 1.206
13, No. 8, 1991, pp. 841-846.
[5] S. H. Kwon, “Cluster Validity Index for Fuzzy
Table 6 Clustering”, Electronics Letters, Vol. 34, No. 22, pp.
Index Values from 2 to 10 Clusters in Seven Different Schemes. 2176-2177, 1998.
Shaded Blocks Show the Correct Results for iris Database.
[6] A. O. Boudraa, “Dynamic Estimation of Number of
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 Clusters in Data Sets”, Electronics Letters, Vol. 35,
PC 0.888 0.790 0.738 0.678 0.610 0.584 0.562 0.538 0.535 No. 19, 1999, pp. 1606-1608.
PE 0.290 0.559 0.736 0.933 1.108 1.216 1.337 1.435 1.486 [7] D. J. Kim, Y. W. Park, and D. J. Park, “A Novel Validity
XB 0.058 0.115 0.160 0.265 0.316 0.549 0.239 0.227 0.289
Index for Determination of the Optimal Number of
Clusters”, IEICE. Trans. Inf. & Syst., Vol. E84-D, No. 2,
K 4.622 9.920 14.72 25.25 33.21 61.43 26.73 28.05 36.45
2001, pp. 281-285.
B crit 18.46 12.13 10.40 10.77 17.03 21.21 16.08 16.80 16.53
[8] R. Quinlan, “Auto-mpg data”,
SV 1.000 0.724 0.598 0.628 0.695 0.907 0.700 0.832 1.000
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
GI 0.442 0.510 0.602 0.755 0.887 1.147 0.881 0.953 1.091 auto-mpg/, 1993.
[9] BUPA Medical Research Ltd, “BUPA Liver Disorders”,
Table 7 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
Index Values from 2 to 10 Clusters in Seven Different Schemes. liver-disorders/, 1990.
Shaded Blocks Show the Correct Results for
[10] T. S. Lim, “Contraceptive method choice”,
Wine Database
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 cmc/, 1999.
PC 0.868 0.783 0.772 0.746 0.751 0.784 0.786 0.760 0.738 [11] R.A. Fisher, “Iris plants database”,
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
PE 0.328 0.572 0.636 0.720 0.738 0.663 0.677 0.764 0.830
iris/, 1988.
XB 0.067 0.141 0.101 0.081 0.123 0.071 0.097 0.209 0.261
[12] S. Aeberhard, “Wine recognition data”,
K 6.264 13.81 11.28 11.00 18.96 14.83 22.75 50.47 67.97
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/
B crit 22.85 14.17 11.64 9.169 11.01 10.06 12.82 19.18 22.89 wine/, 1998.
SV 1.000 0.672 0.569 0.413 0.406 0.357 0.461 0.772 1.000
GI 0.570 0.566 0.605 0.641 0.841 0.828 1.061 1.594 1.896
Table 8
Comparisons of the Seven Indices for the Five Test Databases.
Our Scheme Performs the Best