7 Automatic Clustering Using Genetic Algorithms 2011
7 Automatic Clustering Using Genetic Algorithms 2011
a r t i c l e i n f o a b s t r a c t
Keywords: In face of the clustering problem, many clustering methods usually require the designer to
Clustering provide the number of clusters as input. Unfortunately, the designer has no idea, in general,
Genetic algorithms about this information beforehand. In this article, we develop a genetic algorithm based
Noising method clustering method called automatic genetic clustering for unknown K (AGCUK). In the
Davies–Bouldin index
AGCUK algorithm, noising selection and division–absorption mutation are designed to
K-means algorithm
keep a balance between selection pressure and population diversity. In addition, the
Davies–Bouldin index is employed to measure the validity of clusters. Experimental results
on artificial and real-life data sets are given to illustrate the effectiveness of the AGCUK
algorithm in automatically evolving the number of clusters and providing the clustering
partition.
Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction
Clustering is a fundamental problem that frequently arises in a great variety of application fields such as pattern recog-
nition, machine learning, statistics, etc. It is a formal study of algorithms and methods for grouping or classifying objects
without category labels. The resulting partition should possess two properties: (1) homogeneity within the clusters, i.e. ob-
jects belonging to the same cluster should be as similar as possible, and (2) heterogeneity between the clusters, i.e. objects
belonging to different clusters should be as different as possible. Many clustering techniques have been proposed [1,2].
Among them, the K-means algorithm is an important one. It is an iterative hill-climbing algorithm and the solution obtained
depends on the initial clustering. Although the K-means algorithm had been applied to many practical clustering problems
successfully, it may converge to a partition that is significantly inferior to the global optimum [3].
Recently, researchers solved the clustering problem by stochastic optimization methods such as genetic algorithms, tabu
search, simulated annealing, etc. Liu et al. [4] integrated a tabu list into the genetic algorithm based clustering algorithm to
prevent several fitter individuals from occupying the population and to maintain population diversity. In addition, an aspi-
ration criterion is adopted to keep selection pressure. Bandyopadhyay and Maulik [5] designed a genetic clustering approach.
They used the K-means algorithm to provide the domain knowledge and improve the search capability of genetic algorithms.
Laszlo and Mukherjee [6] presented a genetic algorithm for evolving the cluster centers in the K-means algorithm. The set of
the cluster centers is represented using a hyper-quadtree constructed on the data. Liu et al. [7] combined the K-means algo-
rithm and the tabu search approach to accelerate the convergence speed of the tabu search based clustering algorithm. Ng
and Wong [8] proposed a tabu search based fuzzy K-modes algorithm for clustering categorical objects. Bandyopadhyay et al.
[9] integrated the K-means algorithm into the simulated annealing based clustering method to modify the cluster centroids.
⇑ Corresponding author at: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, PR
China.
E-mail address: [email protected] (Y. Liu).
0096-3003/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.amc.2011.06.007
1268 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279
By redistributing objects among clusters probabilistically, the presented method obtains better results than the K-means
algorithm. Güngör and Ünler [10] combined the K-harmonic means algorithm and the simulated annealing method to deal
with the clustering problem. The simulated annealing method is used to generate non-local moves for the cluster centers and
to select the best solution. Liu et al. [11] adopted the noising method, a metaheuristic technique reported by Charon and
Hudry [12], to solve the clustering problem. With lower computational cost than Bandyopadhyay et al.’s method [9], the pro-
posed method is inferior to the latter in terms of solution quality. By modeling the clustering problem as an optimization
problem, Mahdavi et al. [13] proposed a harmony search based clustering algorithm for grouping the web documents. They
hybridized the K-means algorithm and the harmony search method in two ways and designed two hybrid algorithms. Pach-
eco [14] adopted the scatter search approach to deal with the clustering problem under the criterion of minimum sum-of-
squares clustering. Within the framework of the proposed method, greedy randomized adaptive search procedure (GRASP)
based constructions, H-means+ algorithm, and tabu search are integrated. Jarhoui et al. [15] designed a clustering approach
based on the combinatorial particle swarm optimization (CPSO) algorithm. In the CPSO method, each particle is represented
as a string of length n (where n is the number of objects) and the ith element of the string denotes the group number as-
signed to object i. The CPSO algorithm obtains better results than a genetic algorithm based clustering method in some cases.
Shelokar et al. [16] proposed an ant colony optimization method for grouping N objects into K clusters. The presented meth-
od employs distributed agents which mimic the way real ants find the shortest path from their nest to food source and back.
Fathian et al. [17] presented the application of honeybee mating optimization in clustering (HBMK-means). By experimental
simulations, the HBMK-means method is proved to be better than other heuristic algorithms in clustering, such as genetic
algorithm, simulated annealing, tabu search, and ant colony optimization.
The aforementioned clustering techniques [3–11,13–17] require the designer to provide the number of clusters as input.
Unfortunately, in many real-life cases the number of clusters in a data set is not known a priori. How to automatically find a
proper value of the number of clusters and provide the appropriate clustering under this condition becomes a challenge. In
this paper, our aim is to develop a genetic algorithm based clustering method called automatic genetic clustering for un-
known K (AGCUK) to automatically find the number of clusters and provide the proper clustering partition. We design
two new operators, noising selection and division–absorption mutation, to keep the balance between selection pressure
and population diversity. The Davies–Bouldin index is employed as a measure of the validity of clusters. Experimental results
on artificial and real-life data sets are given to illustrate the superiority of the AGCUK algorithm over four known genetic
clustering methods.
The remaining part of this paper is organized as follows. The related work on the automatic clustering method based on
genetic algorithms is reviewed in Section 2. In Section 3, we propose the AGCUK algorithm and give detailed descriptions. In
Section 4, the choice of the original noise rate rmax and the terminal noise rate rmin is discussed, how to estimate selection
pressure and population diversity is given, and performance comparison between AGCUK and four known genetic algorithm
based clustering methods is conducted for experimental data sets. Finally, some conclusions are drawn in Section 5.
2. Related work
In this study, we focus on how to solve the automatic clustering problem using genetic algorithms. In this regard, some
attempts have been made to use genetic algorithms for automatically clustering the data. Bandyopadhyay and Maulik [18]
applied the variable string length genetic algorithm, with real encoding of the coordinates of the cluster centers in the chro-
mosome, to the clustering problem. Experimental results on artificial and real-life data sets show that their algorithm is able
to evolve the number of clusters as well as provide the proper clustering. Tseng and Yang [19] proposed a genetic algorithm
based approach for the clustering problem. Their method consists of two stages, nearest neighbor clustering and genetic
optimization. Equipped with a heuristic strategy, the proposed method can search for a proper number of clusters and clas-
sify nonoverlapping objects into these clusters. Bandyopadhyay and Maulik [20] exploited the searching capability of genetic
algorithms for finding the number of clusters as well as the proper clustering of a given data set. A string representation,
comprising both real numbers and the do not care symbol, is used to encode a variable number of clusters. Effectiveness
of their technique is demonstrated for both artificial and real-life data sets. Lai [21] adopted the hierarchical genetic algo-
rithm to solve the clustering problem. In the proposed method, the chromosome consists of two types of genes, control genes
and parametric genes. The control genes are coded as binary digits. The total number of ‘‘1’’ represents the number of clus-
ters. The parametric genes are coded as real numbers to represent the coordinates of the cluster centers. The relationship
between the control genes and the parametric genes is that the activation of the latter is governed by the value of the former.
If the value of a control gene is ‘‘1’’, then the associated parametric genes due to that particular active control gene are acti-
vated; otherwise the associated parametric genes are disabled. Experimental results on artificial and real-life data sets show
Lai’s method can search for the number of clusters and provide the proper clustering. Lin et al. [22] presented a genetic clus-
tering algorithm based on the use of a binary chromosome representation. The proposed method selects the cluster centers
directly from the data set. With the aid of a look-up table, the distances between all pairs of objects are saved in advance and
evaluated only once throughout the evolution process. By experimental simulations, the superiority of their algorithm over
Bandyopadhyay and Maulik’s method [20] is shown. Lai and Chang [23] presented a clustering based approach using a hier-
archical evolutionary algorithm (HEA) for medical image segmentation. By means of a hierarchical structure in the chromo-
some, the proposed approach can automatically classify the image into appropriate classes and avoid the difficulty of
Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1269
searching for the proper number of classes. Saha and Bandyopadhyay [24] reported a genetic algorithm with line symmetry
distance-based clustering technique (GALSD). The GALSD algorithm assigns objects to different clusters based on the line-
symmetry-based distance and evolves the number of clusters as long as the clusters have some line symmetry property.
In addition, A Kd-tree-based nearest neighbor search is utilized to reduce the computational complexity of computing the
line-symmetry-based distance. Chang et al. [25] proposed a genetic clustering algorithm based on dynamic niching with
niche migration (DNNM-clustering). On the basis of a similarity function relating to the approximate density shape estima-
tion, the DNNM-clustering method performs a dynamic identification of the niches with niche migration at each generation
to automatically evolve the number of clusters.
After reviewing some known genetic algorithm based clustering methods, we find that how to preserve the balance be-
tween selection pressure and population diversity is neglected in these methods. In this paper, our aim is to design the AG-
CUK algorithm so as to automatically provide the number of clusters and find the clustering partition. Besides maintaining
population diversity so as to avoid the clustering solution trapping in local minima, the AGCUK algorithm should keep selec-
tion pressure so as to accelerate the convergence speed of the clustering algorithm.
In this section, we first briefly introduce two techniques (i.e., genetic algorithms and noising method), and then describe
the AGCUK algorithm in detail.
Genetic algorithms are randomized search and optimization techniques guided by the principles of evolution and natural
genetics [26]. They have a large amount of implicit parallelism and provide the near optimal solution of the objective or fit-
ness function in complex, large, and multi-modal landscapes. In genetic algorithms, the parameters of the search space are
encoded in the form of strings (or chromosomes). The fitness function that represents the goodness degree of the solution
encoded in the chromosome is associated with each string. Biologically inspired operators like selection, crossover, and
mutation are used over a number of generations for generating potentially better solutions. After a satisfactory individual
is found or the specified number of generations is over, the best individual obtained during the evolution process is viewed
as the final result. Genetic algorithms have applications in fields as diverse as image processing, information security, infor-
mation retrieval, protein sequence design, etc. [27–30].
The noising method guiding the heuristic search procedure to explore the solution space is a metaheuristic technique pro-
posed by Charon and Hudry [12]. Instead of taking the genuine data into account directly, the noising method considers the
optimal result as the outcome of a series of fluctuating data converging towards the genuine ones. Like some other metaheu-
ristics, the noising method is based on a descent. The main difference with a descent is that, when the objective function
value for a given solution is considered, a perturbation called a noise is added to this value. This noise is randomly chosen
in an interval of which the range decreases during the iteration process. It means that the original value of the noise rate rn
should be chosen in such a way that, at the beginning of the iteration process, a bad neighboring solution may be accepted.
As added noises are chosen in an interval centered on zero, a good neighboring solution may be also rejected. The final solu-
tion is the best solution obtained during the iteration process. The structure of the noising method is shown in Fig. 1. In Fig. 1,
Ni denotes the current number of iterations, Nt denotes the total number of iterations, Nf denotes the number of iterations at
a fixed noise rate, f(Xc), f(X0 ), and f(Xb) denote the function values of the current solution Xc, the neighboring solution X0 , and
Begin
set parameters and create the current solution X c
while N i ≤ N t , do
Ni ← Ni + 1
the best known solution Xb, respectively. The noising method has been applied to traveling salesman problem [31], task allo-
cation problem [32], and clique partitioning problem [33].
The general description of the AGCUK algorithm is given in Fig. 2. Its most procedures observe the architecture of genetic
algorithms. Its different steps are discussed in the following.
where ci denotes the cluster center of cluster Ci. Cluster center ci is computed as
1 X
ci ¼ x; ð2Þ
ni x2C
i
where ni denotes the number of the objects belonging to cluster Ci. In Eq. (1), Si,q denotes the qth root of the qth moment of
the objects belonging to cluster Ci with respect to their mean, and is a measure of the dispersion of the objects belonging to
cluster Ci. Specifically, Si,1, used in this article, denotes the average Euclidean distance of the objects belonging to cluster Ci to
their cluster center ci. The distance between clusters Ci and Cj is defined as
where dij,t denotes the Minkowski distance of order t between the centroids which characterize clusters Ci and Cj. Next, we
define a term Ri,qt for cluster Ci as
Si;q þ Sj;q
Ri;qt ¼ Max : ð4Þ
j;j–i dij;t
Begin
initialize parameters
establish the initial population for clustering
while (not termination condition) do
fitness evaluation
noising selection
division-absorption mutation
end
end
1X K
DB ¼ Ri;qt : ð5Þ
K i¼1
A small value of this evaluation indicates a good clustering result. We set the fitness Fi of individual i to be equal to 1/DBi,
where DBi is the DB index computed for individual i. As a result, the minimum problem is converted into a maximum one
suitable for genetic algorithms.
Step 1: Given population Qt, where t denotes the number of generations, set i = 1 and choose the ith individual X ti .
Step 2: If t = 1, then individual X ti is selected and proceed to Step 4.
Step 3: Individual X ti is compared with the ith individual X t1
i in population Qt1, if
F ti F t1
i þ noise > 0; ð6Þ
then individual X ti
is selected; otherwise individual X t1
is selected. Here, the value of noise is a real randomly generated in
i
the range ½r n ; r n ; F ti and F t1
i denote the fitness values of individuals X ti and X t1
i , respectively. In addition, the noise range
decreases as the algorithm evolves, which is described in Section 3.2.8.
Step 4: View the selected individual as the ith individual in the selected population and let i = i + 1, if i 6 P, then go to Step 2;
otherwise return the selected population. Here, P denotes the population size.
3.2.5.1. Division operation. Suppose cluster Ci is the one to be divided, we adopt proportional selection to choose cluster Ci.
The selection probability pdi is defined as
X
pdi ¼ Si;1 = Si;1 ; ð7Þ
C1' c3 C3
C3
c1 c4 C4'
C1
C1 c1
C2 C2
c2 C1' c2
C3' C3 c3 C3
c1 c3 c1
C2 C2
C1 c2 C1 c2
where i = 1, . . . , K. That is, the sparser cluster Ci, the more possibly it is selected to be partitioned, and vice versa. Since the K-
means algorithm is simple and computationally attractive, we use it to partition cluster Ci. After the division operation, clus-
ter Ci is divided into two new clusters and the number of clusters increases by one.
3.2.5.2. Absorption operation. Like the division operation, we adopt proportional selection to determine which cluster is to be
merged. That is, the closer two clusters to each other, the more possibly one of them is selected as the one to be absorbed,
and vice versa. For cluster Ci, compute the distance between Ci and its nearest neighbor as
then Ci is selected; otherwise its nearest neighbor Cj is selected. That is, in the selected cluster pair, the cluster with sparser
structure is the one to be merged. Suppose cluster Ci is to be absorbed, object x belonging to cluster Ci is reassigned to cluster
Ck if and only if the following condition holds:
kx ck k2 < kx cj k2 ; ð11Þ
where cj – ck. After the absorption operation, cluster Ci disappears and the number of clusters decreases by one. After the
division–absorption mutation operation, objects are reassigned among different clusters and a new individual is created.
Step 1: Set the best individual generated from the initial population Q0 as the best known individual Xb.
Step 2: Perform selection and mutation operations to obtain population Qt. The best individual and the worst individual in
population Qt are denoted as X tb and X tw , respectively.
Step 3: If individual X tw is worse than individual Xb, then replace individual X tw by individual Xb.
Step 4: If individual X tb is better than individual Xb, then replace individual Xb by individual X tb .
Note that Steps 2, 3, and 4 constitute one iteration of the AGCUK algorithm and are repeated until the termination crite-
rion is satisfied.
Step 1: Initialization. Given the number of generations G, the population size P, the original noise rate rmax, the terminal
noise rate rmin, and the number of iterations at a fixed noise rate Nf, generate the initial population Q0, view the best
individual from population Q0 as the best known individual Xb, and set the current number of generations t = 1.
Step 2: Evaluation and selection. Evaluate the fitness value F ti of individual X ti , where i = 1, . . . , P, and perform the noising
selection operation.
Step 3: Mutation. Perform the division–absorption mutation operation for each individual.
Step 4: Termination check. If t 6 G, then t = t + 1 and perform the elitist operation. Set rn = rn (rmax rmin)/(Nt/Nf 1), go to
Step 2. Otherwise output individual Xb. Here, we set Nf = P and the total number of iterations Nt = G Nf.
Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1273
4. Experimental results
In this paper, computer simulations are conducted in Matlab on an Intel Pentium D processor running at 3.4 GHz with
512 MB real memory. The population size P of each experimental algorithm is equal to 20. The number of generations G
of each experimental algorithm is equal to 50. Each experiment includes 20 independent trials.
In this section, we first choose the proper values of the original noise rate rmax and the terminal noise rate rmin, and then
discuss how to estimate selection pressure and population diversity. Due to space limitations, a two-dimensional artificial
data set t8_2 with 400 objects and 8 clusters [22] is chosen to show the comparison results. For other experimental data sets,
the similar results are obtained.
There are two parameters to be determined in the noising selection operation, the original noise rate rmax and the termi-
nal noise rate rmin. The average run time when the best result is attained for the first time is shown in Fig. 4. Given different
rmax and rmin, AGCUK obtains the maximum fitness 3.78483149 in each run, finds the number of clusters, and provides the
correct clustering partition but requires different run time. In Fig. 4(a), we find that if the original noise rate rmax is too large
or too small, then AGCUK requires more run time to converge to the best solution. Here, the terminal noise rate rmin is set to
be 0. When the value of rmax is equal to 1, the best performance is attained. So, we choose this parameter to be 1. It is found
that, in Fig. 4(b), the larger the value of rmin the worse. Here, the original value of the noise range is set to be 1. The reason is
that an added noise is a random real drawn with a uniform distribution in the interval [rn, rn], and the noise rate rn is
bounded by two extreme values rmax and rmin, then we may make rn decrease down to 0 so as to get back the genuine func-
tion at the end of the noising selection operation. So, we choose rmin to be 0 in this study.
It is well known that there are two major issues in designing genetic algorithms: selection pressure and population diver-
sity. Selection pressure leads genetic algorithms to exploit information inside the fitter individuals and results in more supe-
rior offspring iteratively. The diversity in genetic algorithms is attributed to the form of population, which contains a certain
number of encoded individuals for exploration. Emphasis on selective pressure accelerates the optimization convergence but
potentially causes premature convergence because of hastened loss of diversity. On the contrary, maintaining diversity can
yield a better solution quality, but often slows down the convergence speed due to the lack of selection pressure. Therefore, a
good scheme should pursue a good balance between exploration and exploitation in consideration of both convergence
speed and solution quality. On one hand, genetic algorithms must maintain certain diversity to explore the unvisited space,
and on the other hand, genetic algorithms must have adequate selective pressure to exploit the relevant solutions [36]. To
the best of our knowledge, how to keep the balance between selection pressure and population diversity is neglected in the
genetic algorithm based clustering method.
In this study, we design the noising selection operation to maintain population diversity so as to avoid the solution search
trapping in local minima, and develop the division–absorption mutation operation to keep selection pressure so as to accel-
erate the convergence speed of the clustering algorithm. In AGCUK, the degree of selection pressure is defined as
DSP ¼ Nsi =P; ð12Þ
where Nsi denotes the number of super individuals. Here, a super individual is a best individual in the population after the
selection operation. That is, the more super individuals the stronger the selection pressure, and vice versa.
In order to estimate the degree of population diversity, we adopt the string-of-group-numbers encoding to record the
assignment of objects. That is, we let the length of the string equal to the number of objects. The value of the ith element
Fig. 4. Determination of the original noise rate rmax and the terminal noise rate rmin.
1274 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279
of the string denotes the cluster number assigned to the ith object. As objects in the clustering problem are unlabeled, it is
necessary to label a clustering partition so as to estimate the difference between the individuals and to deal with the prob-
lem of population diversity. In this study, we define the best individual as the reference partition and label it in advance.
Then the difference between the individuals and the reference partition can be calculated by estimating the number of dif-
ferent objects belonging to the same cluster. Here, we use the matrix St ¼ ½stij PN to record the difference between the indi-
viduals and the reference partition in the tth generation. If object j in individual i is assigned to the same cluster as that in the
best individual, then stij ¼ 0; otherwise stij ¼ 1. The degree of population diversity is defined as
1 X P X N
DPD ¼ st : ð13Þ
P N i¼1 j¼1 ij
The degree of population diversity varies with the number of generation. As the increase of the number of super individ-
uals, the diversity decreases. If the population converges to the best individual, then the diversity decreases to zero.
Fig. 5(a) shows the selection pressure provided by experimental methods during the evolution process. The average val-
ues of the selection pressure provided by the methods presented by Bandyopadhyay and Maulik [18,20], Lai [21], Lin et al.
[22], and the AGCUK method are 15.27%, 14.25%, 11.57%, 86.88%, and 16.59%, respectively. In this experiment, Lin et al.’s
method leads to the strongest selection pressure. The selection pressure provided by the AGCUK method is weaker than that
provided by Lin et al.’s method but stronger than those provided by the other three methods. Fig. 5(b) shows the population
diversity provided by experimental methods during evolution. The average values of the population diversity provided by
the methods proposed by Bandyopadhyay and Maulik [18,20], Lai [21], Lin et al. [22], and the AGCUK method are 28.35%,
33.04%, 33.68%, 11.80%, and 21.39%, respectively. It is seen that Lin et al.’s method leads to the lowest diversity. The popu-
lation diversity provided by the AGCUK method is higher than that provided by Lin et al.’s but lower than those provided by
the other three methods. As a result, Lin et al.’s method leads to the strongest selection pressure and the lowest population
diversity. The methods presented by Bandyopadhyay and Maulik [18,20], and Lai [21] result in weaker selection pressure and
higher population diversity than Lin et al.’s method and the AGCUK method. These phenomena show that they emphasize
one hand and neglect the other hand. Therefore, it is difficult for them to obtain the best results in different cases. With
strong selection pressure and high population diversity, the AGCUK method can preserve the balance between selection
pressure and population diversity. As a result, the AGCUK method gives better results with less number of generations than
the other four methods as shown in Fig. 6.
The clustering results of experimental methods for data set t8_2 are shown as Table 1. Seven indicators are given. The first
two indicators are the average and standard deviation values of the DB index (AvgDB and SDDB) provided by experimental
methods. The average and standard deviation number of clusters (AvgNC and SDNC) are used to show the capability of the
clustering algorithm to find the correct number of clusters. The misclassified rate (MR) denotes the number of objects which
are incorrectly partitioned divided by the total number of objects. The success rate (SR) is defined as the number of trials
where the correct partition is obtained divided by the total number of trials. The final indicator is the average run time when
the correct partition is attained for the first time. It is seen that the AGCUK algorithm outputs the minimum average value of
the DB index and its SDDB value is equal to 0. In this experiment, Bandyopadhyay and Maulik’s method [18] provides the
worst AvgDB and SDDB. In each trial, the AGCUK algorithm finds the correct number of clusters and provides the optimal
clustering partition, that is, its misclassified rate is equal to 0 and success rate is equal to 100%. Meanwhile, we find the AG-
CUK algorithm requires more run time than Lin et al.’s method to obtain the proper solution. In addition, the methods pre-
sented by Bandyopadhyay and Maulik [18] and Lai [21] fail to provide the correct partition within the specified number of
generations.
Table 1
Clustering results of experimental methods for t8_2.
In this section, the AGCUK algorithm is applied to experimental data sets and compared with the clustering techniques
provided by Bandyopadhyay and Maulik [18,20], Lai [21], and Lin et al. [22].
Lin et al. [22] used 100 two-dimensional artificial data sets with a variety of numbers (in [Kmin, Kmax] = [2, 11]) of clusters
to compare their method with Bandyopadhyay and Maulik’s method [20]. There are ten data sets for each number of clusters.
As a result, Lin et al.’s method is better than the latter and finds the number of clusters and the correct partitions of the data
sets with less than 7 clusters. But Lin et al.’s method becomes bad with the further increase of the number of clusters. In this
paper, we use the latter 50 data sets with a variety of numbers (in [Kmin, Kmax] = [7, 11]) of clusters to compare the AGCUK
algorithm with four known genetic clustering algorithms. These 50 two-dimensional artificial data sets are shown as Fig. 7.
Fig. 8 shows the average number of clusters provided by experimental algorithms for 50 artificial data sets. We find that
Bandyopadhyay and Maulik’s method [18] is the worst and fails to find the correct partitions in most runs. For the data sets
with seven clusters, Lai’s method can provide the number of clusters closer to 7 than Bandyopadhyay and Maulik’s method.
But for other data sets, their results are close to each other in many cases. Lin et al.’s method is better than the above three
methods and provides more accurate number of clusters in most cases. In face of each data set, the AGCUK algorithm is the
best one among experimental algorithms and finds the correct number of clusters.
Fig. 9(a) shows the average misclassified rates of experimental algorithms for 50 artificial data sets. The misclassified
rates of the methods given by Bandyopadhyay and Maulik [18,20], and Lai [21], respectively, are far larger than 0 in all runs.
Lin et al.’s method and the AGCUK method seem to be comparable. After removing the other three methods, we find that our
method is much better than Lin et al.’s method as shown in Fig. 9(b). The AGCUK algorithm provides the correct partitions of
21 of 50 artificial data sets in each trial. Its worst misclassified rate is equal to 0.15% in face of data set t9_4.
As the misclassified rates of Lin et al.’s method and the AGCUK method are much lower than those of the other three
methods, we further compare Lin et al.’s method with the AGCUK method in terms of the run time. Fig. 10 shows the average
run time when the correct partitions of data sets are found for the first time. On one hand, with the aid of the look-up table,
Lin et al.’s method saves the repeated computation of the distance between each object and its corresponding cluster center,
and is not very sensitive to the variation of data sets. On the other hand, as noises are added to the variation of the fitness
value, the AGCUK algorithm may accept bad individuals in the noising selection operation during the evolution process. As a
result, although the AGCUK algorithm prevents the selected population from being occupied by several fitter individuals and
maintains population diversity, it has to require more run time to find the correct partitions with the increase of the number
of clusters and the size of data sets.
From abovementioned experiments it follows that the AGCUK method requires more run time to produce solutions of
better quality as compared with Lin et al.’s method. In order to make a more fair comparison between the AGCUK method
and Lin et al.’s method, we further examine whether Lin et al.’s method would perform better than the AGCUK method when
it is configured to run more time. Here, we keep the parameter settings of the AGCUK method fixed. In genetic algorithms,
the population size P and the number of generations G are two parameters related to the run time. Therefore, increasing the
run time of Lin et al.’s method can be done by increasing P or G. We summarize our findings as below.
In Lin et al.’s method [22], the mutation probability is defined as
pffi
pm 1:75=ðP lÞ; ð14Þ
Fig. 10. Comparison of the run time of the AGCUK method and Lin et al.’s method.
where P denotes the population size and l denotes the chromosome length. The larger the population size P the smaller the
mutation probability pm, and vice versa. We keep the number of generations G fixed and add the population size P up to 60.
In this experiment, Lin et al.’s method requires more run time than the AGCUK algorithm to provide the correct partitions of
data sets as shown in Fig. 11(a), and finds the correct partitions of 8 of 50 artificial data sets in each run. But with the de-
crease of the mutation probability, Lin et al.’s method leads to higher misclassified rates in face of some data sets such as data
set t7_8 as shown in Fig. 11(b).
In the other experiment, we keep the population size P fixed and add the number of generations G up to 150. In this case,
Lin et al.’s method requires more run time than the AGCUK algorithm as shown in Fig. 12(a), and finds the correct partitions
of 6 of 50 artificial data sets in each run. With the increase of the number of generations, Lin et al.’s method achieves lower
misclassified rates in face of most data sets as shown in Fig. 12(b) than before. But we find the AGCUK method still outper-
forms Lin et al.’s method in most cases.
In addition, we use the Wisconsin Breast Cancer data set [37] to compare experimental algorithms. In the Breast Cancer
data set, each pattern has nine features corresponding to clump thickness, cell size uniformity, cell shape uniformity, mar-
ginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. There are two catego-
ries in the data set, malignant and benign, which are known to be linearly inseparable. The total number of patterns are 699
(458 benign and 241 malignant), of which 16 patterns contain single missing feature. These 16 patterns have been removed
and the remaining 683 patterns are used for clustering.
Experimental results for Breast Cancer are shown as Table 2. The AGCUK algorithm and the clustering techniques pro-
vided by Bandyopadhyay and Maulik [18,20] find the correct number of clusters in all runs. Among experimental methods,
the misclassified rate of the AGCUK algorithm is the lowest. But we find that all experimental algorithms do not provide the
correct partition for this data set within the specified number of generations. We find it difficult for one validity index to deal
with different data sets. Other validity indices such as PBM-index [38] may be used to find the clustering partition in future
research.
Fig. 11. Comparison of the AGCUK method and Lin et al.’s method at P = 60.
1278 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279
Fig. 12. Comparison of the AGCUK method and Lin et al.’s method at G = 150.
Table 2
Clustering results of experimental methods for breast cancer.
In the following, we analyze the time complexities of experimental methods. The time complexities of the clustering
methods presented by Bandyopadhyay and Maulik [18,20], Lai [21], and Lin et al. [22] are O(GPKmN), O(GPKmN), O(GPKmN),
and O(GPKN + mN2), respectively. The time complexity of the AGCUK algorithm is given as follows: In each generation, the
time complexity of the fitness evaluation of the population is O(PKmN). In genetic operations, the time complexity of the
selection operation is O(P). The time complexity of the mutation operation is described as follows: The time complexity
of selecting the cluster to be divided is O(K). The time complexity of partitioning the cluster is O(mN). The time complexity
of selecting the cluster to be absorbed is O(K). The time complexity of merging the cluster is O(KmN). Then the time com-
plexity of the mutation operation is dominated by the computational cost of merging the cluster. Therefore, the time com-
plexity of the AGCUK algorithm is O(GPKmN) the same as those of the methods presented by Bandyopadhyay and Maulik
[18,20], and Lai [21].
5. Conclusions
Clustering is aimed at discovering structures and patterns of a given data set. As a fundamental problem and technique for
data analysis, clustering has become increasingly important. Many clustering methods usually ask the designer to provide
the number of clusters as input. Unfortunately, the number of clusters in general is unknown a priori. In this paper, we pro-
pose a genetic algorithm based clustering method call automatic genetic clustering for unknown K (AGCUK). We design two
operations: noising selection and division–absorption mutation. The reciprocal of the Davies–Bouldin index is used for com-
puting the fitness of individuals.
We adopt the noising selection operation to prevent the selected population being occupied by several fitter individuals
and to maintain population diversity. Noises are added to the variation of the fitness value so as to avoid the clustering meth-
od trapping in local minima. According to different clustering partition states, we design the division–absorption mutation
operation. Three combinations of two sub-operations, division operation and absorption operation, are performed on differ-
ent kinds of individuals to evolve the number of clusters and provide the correct partition, respectively. The AGCUK method
and four genetic clustering methods are compared. Among experimental algorithms, the AGCUK algorithm provides the cor-
rect number of clusters for artificial and real-life data sets. It obtains lower misclassified rates than the other four experimen-
tal methods. In addition, the time complexity of the AGCUK method is the same as those of the methods proposed by
Bandyopadhyay and Maulik [18,20], and Lai [21].
On the other hand, as bad individuals may be accepted in the noising selection operation, the AGCUK algorithm has to
take more run time than Lin et al.’s method to produce solutions of better quality. We further examine whether Lin
et al.’s method would perform better than the AGCUK method when it is configured to run more time. By increasing the
Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1279
population size P or the number of generations G, we make a fair comparison between the AGCUK method and Lin et al.’s
method. As a result, we find the AGCUK method still outperforms Lin et al.’s method in most cases. How to accelerate the
convergence speed of the AGCUK algorithm without decreasing its searching capability is an important area of future re-
search. In this paper, the Davies–Bouldin index is used for computing the fitness of individuals. But we find it difficult to
use one validity index to deal with different data sets. Combining other indices such as PBM-index with our method to solve
the clustering problem will be an important area of future research.
Acknowledgements
The authors thank Dr. Chih-Chin Lai for his valuable suggestions on our works. This research was supported in part by the
National Natural Science Foundation of China (NSFC) under grants 60903074 and 60828005, the National High Technology
Research and Development Program of China (863 Program) under grant 2008AA01Z119, the National Basic Research Pro-
gram of China (973 Program) under grant 2009CB326203, and the US National Science Foundation (NSF) under grant CCF-
0905337.
References
[1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (1999) 264–323.
[2] W. Pedrycz, Knowledge-Based Clustering, Wiley, New Jersey, 2005.
[3] S.Z. Selim, M.A. Ismail, K-means-type algorithm: generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal.
Mach. Intell. 6 (1984) 81–87.
[4] Y.G. Liu, K.F. Chen, X.M. Li, A hybrid genetic based clustering algorithm, in: Proceedings of the 3rd International Conference on Machine Learning and
Cybernetics, Shanghai, IEEE, 2004, pp. 1677–1682.
[5] S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN, Inf. Sci. 146 (2002) 221–237.
[6] M. Laszlo, S. Mukherjee, A genetic algorithm using hyper-quadtrees for low-dimensional K-means clustering, IEEE Trans. Pattern Anal. Mach. Intell. 28
(2006) 33–543.
[7] Y.G. Liu, Y. Liu, L.B. Wang, K.F. Chen, A hybrid tabu search based clustering algorithm, in: R. Khosla, R.J. Howlett, L.C. Jain (Eds.), Knowledge-Based
Intelligent Information and Engineering Systems, LNCS, vol. 3682, Springer, Berlin, 2005, pp. 186–192.
[8] M.K. Ng, J.C. Wong, Clustering categorical data sets using tabu search techniques, Pattern Recogn. 35 (2002) 2783–2790.
[9] S. Bandyopadhyay, U. Maulik, M.K. Pakhira, Clustering using simulated annealing with probabilistic redistribution, Int. J. Pattern Recogn. Artif. Intell. 15
(2001) 269–285.
[10] Z. Güngör, A. Ünler, K-harmonic means data clustering with simulated annealing heuristic, Appl. Math. Comput. 184 (2007) 199–209.
[11] Y.G. Liu, Y. Liu, K.F. Chen, Clustering with noising method, in: X. Li, S.L. Wang, Z.Y. Dong (Eds.), Advanced Data Mining and Applications, LNCS, vol.
3584, Springer, Berlin, 2005, pp. 209–216.
[12] I. Charon, O. Hudry, The noising method: a new method for combinatorial optimization, Oper. Res. Lett. 14 (1993) 133–137.
[13] M. Mahdavi, M. Haghir Chehreghani, H. Abolhassani, R. Forsati, Novel meta-heuristic algorithms for clustering web documents, Appl. Math. Comput.
201 (2008) 441–451.
[14] J.A. Pacheco, A scatter search approach for the minimum sum-of-squares clustering problem, Comput. Oper. Res. 32 (2005) 1325–1335.
[15] B. Jarhoui, M. Cheikh, P. Siarry, A. Rebai, Combinatorial particle swarm optimization (CPSO) for partitional clustering problem, Appl. Math. Comput. 192
(2007) 337–345.
[16] P.S. Shelokar, V.K. Jayaraman, B.D. Kulkarni, An ant colony approach for clustering, Anal. Chim. Acta 509 (2004) 187–195.
[17] M. Fathian, B. Amiri, A. Maroosi, Application of honey-bee mating optimization algorithm on clustering, Appl. Math. Comput. 190 (2007) 1502–1513.
[18] S. Bandyopadhyay, U. Maulik, Nonparametric genetic clustering: comparison of validity indices, IEEE Trans. Syst. Man Cybern. Part C – Appl. Rev. 31
(2001) 120–125.
[19] L.Y. Tseng, S.B. Yang, A genetic approach to the automatic clustering algorithm, Pattern Recogn. 34 (2001) 415–424.
[20] S. Bandyopadhyay, U. Maulik, Genetic clustering for automatic evolution of clusters and application to image classification, Pattern Recogn. 35 (2002)
1197–1208.
[21] C.C. Lai, A novel clustering approach using hierarchical genetic algorithms, Intell. Autom. Soft Comput. 11 (2005) 143–153.
[22] H.J. Lin, F.W. Yang, Y.T. Kao, An efficient GA-based clustering technique, Tamkang J. Sci. Eng. 8 (2005) 113–122.
[23] C.C. Lai, C.Y. Chang, A hierarchical evolutionary algorithm for automatic medical image segmentation, Expert Syst. Appl. 36 (2009) 248–259.
[24] S. Saha, S. Bandyopadhyay, A new line symmetry distance and its application to data clustering, J. Comput. Sci. Technol. 24 (2009) 544–556.
[25] D.X. Chang, X.D. Zhang, C.W. Zheng, D.M. Zhang, A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem,
Pattern Recogn. 43 (2010) 1346–1360.
[26] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
[27] P. Kumsawat, K. Attakitmongcol, A. Srikaew, A new approach for optimization in image watermarking by using genetic algorithms, IEEE Trans. Signal
Process. 53 (2005) 4707–4719.
[28] P. Ramasubramanian, A. Kannan, A genetic-algorithm based neural network short-term forecasting framework for database intrusion prediction
system, Soft Comput. 10 (2006) 699–714.
[29] Y.C. Chang, S.M. Chen, A new query reweighting method for document retrieval based on genetic algorithms, IEEE Trans. Evol. Comput. 10 (2006) 617–
622.
[30] L.P.B. Scott, J. Chahine, J.R. Ruggiero, Using genetic algorithm to design protein sequence, Appl. Math. Comput. 200 (2008) 1–9.
[31] I. Charon, O. Hudry, Application of the noising method to the traveling salesman problem, Eur. J. Oper. Res. 125 (2000) 266–277.
[32] W.H. Chen, C.S. Lin, A hybrid heuristic to solve a task allocation problem, Comput. Oper. Res. 27 (2000) 287–303.
[33] I. Charon, O. Hudry, Noising methods for a clique partitioning problem, Discrete Appl. Math. 154 (2006) 754–769.
[34] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (1997) 224–227.
[35] D. Bhandari, C.A. Murthy, S.K. Pal, Genetic algorithm with elitist model and its convergence, Int. J. Pattern Recognit. Artif. Intell. 10 (1996) 731–747.
[36] C.K. Ting, S.T. Li, C. Lee, On the harmonious mating strategy through tabu search, Inf. Sci. 156 (2003) 189–214.
[37] C.L. Blake, C.J. Merz, UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer
Science, 1998. <http://www.ics.uci.edu/mlearn/MLRepository.html>.
[38] M.K. Pakhira, S. Bandyopadhyay, U. Maulik, Validity index for crisp and fuzzy clusters, Pattern Recogn. 37 (2004) 487–501.