0% found this document useful (0 votes)

13 views13 pages

7 Automatic Clustering Using Genetic Algorithms 2011

Uploaded by

yuvicena940

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

7 Automatic Clustering Using Genetic Algorithms 2011

Uploaded by

yuvicena940

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Applied Mathematics and Computation 218 (2011) 1267–1279

Contents lists available at ScienceDirect

Applied Mathematics and Computation

journal homepage: www.elsevier.com/locate/amc

Automatic clustering using genetic algorithms

Yongguo Liu a,b,c,d,⇑, Xindong Wu d, Yidong Shen b
a
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, PR China
b
State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100191, PR China
c
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, PR China
d
Department of Computer Science, University of Vermont, Burlington, VT 05405, USA

a r t i c l e i n f o a b s t r a c t

Keywords: In face of the clustering problem, many clustering methods usually require the designer to
Clustering provide the number of clusters as input. Unfortunately, the designer has no idea, in general,
Genetic algorithms about this information beforehand. In this article, we develop a genetic algorithm based
Noising method clustering method called automatic genetic clustering for unknown K (AGCUK). In the
Davies–Bouldin index
AGCUK algorithm, noising selection and division–absorption mutation are designed to
K-means algorithm
keep a balance between selection pressure and population diversity. In addition, the
Davies–Bouldin index is employed to measure the validity of clusters. Experimental results
on artiﬁcial and real-life data sets are given to illustrate the effectiveness of the AGCUK
algorithm in automatically evolving the number of clusters and providing the clustering
partition.
Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction

Clustering is a fundamental problem that frequently arises in a great variety of application fields such as pattern recog-
nition, machine learning, statistics, etc. It is a formal study of algorithms and methods for grouping or classifying objects
without category labels. The resulting partition should possess two properties: (1) homogeneity within the clusters, i.e. ob-
jects belonging to the same cluster should be as similar as possible, and (2) heterogeneity between the clusters, i.e. objects
belonging to different clusters should be as different as possible. Many clustering techniques have been proposed [1,2].
Among them, the K-means algorithm is an important one. It is an iterative hill-climbing algorithm and the solution obtained
depends on the initial clustering. Although the K-means algorithm had been applied to many practical clustering problems
successfully, it may converge to a partition that is significantly inferior to the global optimum [3].
Recently, researchers solved the clustering problem by stochastic optimization methods such as genetic algorithms, tabu
search, simulated annealing, etc. Liu et al. [4] integrated a tabu list into the genetic algorithm based clustering algorithm to
prevent several fitter individuals from occupying the population and to maintain population diversity. In addition, an aspi-
ration criterion is adopted to keep selection pressure. Bandyopadhyay and Maulik [5] designed a genetic clustering approach.
They used the K-means algorithm to provide the domain knowledge and improve the search capability of genetic algorithms.
Laszlo and Mukherjee [6] presented a genetic algorithm for evolving the cluster centers in the K-means algorithm. The set of
the cluster centers is represented using a hyper-quadtree constructed on the data. Liu et al. [7] combined the K-means algo-
rithm and the tabu search approach to accelerate the convergence speed of the tabu search based clustering algorithm. Ng
and Wong [8] proposed a tabu search based fuzzy K-modes algorithm for clustering categorical objects. Bandyopadhyay et al.
[9] integrated the K-means algorithm into the simulated annealing based clustering method to modify the cluster centroids.

⇑ Corresponding author at: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, PR
China.
E-mail address: [email protected] (Y. Liu).

0096-3003/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.amc.2011.06.007
1268 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279

By redistributing objects among clusters probabilistically, the presented method obtains better results than the K-means
algorithm. Güngör and Ünler [10] combined the K-harmonic means algorithm and the simulated annealing method to deal
with the clustering problem. The simulated annealing method is used to generate non-local moves for the cluster centers and
to select the best solution. Liu et al. [11] adopted the noising method, a metaheuristic technique reported by Charon and
Hudry [12], to solve the clustering problem. With lower computational cost than Bandyopadhyay et al.’s method [9], the pro-
posed method is inferior to the latter in terms of solution quality. By modeling the clustering problem as an optimization
problem, Mahdavi et al. [13] proposed a harmony search based clustering algorithm for grouping the web documents. They
hybridized the K-means algorithm and the harmony search method in two ways and designed two hybrid algorithms. Pach-
eco [14] adopted the scatter search approach to deal with the clustering problem under the criterion of minimum sum-of-
squares clustering. Within the framework of the proposed method, greedy randomized adaptive search procedure (GRASP)
based constructions, H-means+ algorithm, and tabu search are integrated. Jarhoui et al. [15] designed a clustering approach
based on the combinatorial particle swarm optimization (CPSO) algorithm. In the CPSO method, each particle is represented
as a string of length n (where n is the number of objects) and the ith element of the string denotes the group number as-
signed to object i. The CPSO algorithm obtains better results than a genetic algorithm based clustering method in some cases.
Shelokar et al. [16] proposed an ant colony optimization method for grouping N objects into K clusters. The presented meth-
od employs distributed agents which mimic the way real ants find the shortest path from their nest to food source and back.
Fathian et al. [17] presented the application of honeybee mating optimization in clustering (HBMK-means). By experimental
simulations, the HBMK-means method is proved to be better than other heuristic algorithms in clustering, such as genetic
algorithm, simulated annealing, tabu search, and ant colony optimization.
The aforementioned clustering techniques [3–11,13–17] require the designer to provide the number of clusters as input.
Unfortunately, in many real-life cases the number of clusters in a data set is not known a priori. How to automatically find a
proper value of the number of clusters and provide the appropriate clustering under this condition becomes a challenge. In
this paper, our aim is to develop a genetic algorithm based clustering method called automatic genetic clustering for un-
known K (AGCUK) to automatically find the number of clusters and provide the proper clustering partition. We design
two new operators, noising selection and division–absorption mutation, to keep the balance between selection pressure
and population diversity. The Davies–Bouldin index is employed as a measure of the validity of clusters. Experimental results
on artificial and real-life data sets are given to illustrate the superiority of the AGCUK algorithm over four known genetic
clustering methods.
The remaining part of this paper is organized as follows. The related work on the automatic clustering method based on
genetic algorithms is reviewed in Section 2. In Section 3, we propose the AGCUK algorithm and give detailed descriptions. In
Section 4, the choice of the original noise rate rmax and the terminal noise rate rmin is discussed, how to estimate selection
pressure and population diversity is given, and performance comparison between AGCUK and four known genetic algorithm
based clustering methods is conducted for experimental data sets. Finally, some conclusions are drawn in Section 5.

2. Related work

In this study, we focus on how to solve the automatic clustering problem using genetic algorithms. In this regard, some
attempts have been made to use genetic algorithms for automatically clustering the data. Bandyopadhyay and Maulik [18]
applied the variable string length genetic algorithm, with real encoding of the coordinates of the cluster centers in the chro-
mosome, to the clustering problem. Experimental results on artificial and real-life data sets show that their algorithm is able
to evolve the number of clusters as well as provide the proper clustering. Tseng and Yang [19] proposed a genetic algorithm
based approach for the clustering problem. Their method consists of two stages, nearest neighbor clustering and genetic
optimization. Equipped with a heuristic strategy, the proposed method can search for a proper number of clusters and clas-
sify nonoverlapping objects into these clusters. Bandyopadhyay and Maulik [20] exploited the searching capability of genetic
algorithms for finding the number of clusters as well as the proper clustering of a given data set. A string representation,
comprising both real numbers and the do not care symbol, is used to encode a variable number of clusters. Effectiveness
of their technique is demonstrated for both artificial and real-life data sets. Lai [21] adopted the hierarchical genetic algo-
rithm to solve the clustering problem. In the proposed method, the chromosome consists of two types of genes, control genes
and parametric genes. The control genes are coded as binary digits. The total number of ‘‘1’’ represents the number of clus-
ters. The parametric genes are coded as real numbers to represent the coordinates of the cluster centers. The relationship
between the control genes and the parametric genes is that the activation of the latter is governed by the value of the former.
If the value of a control gene is ‘‘1’’, then the associated parametric genes due to that particular active control gene are acti-
vated; otherwise the associated parametric genes are disabled. Experimental results on artificial and real-life data sets show
Lai’s method can search for the number of clusters and provide the proper clustering. Lin et al. [22] presented a genetic clus-
tering algorithm based on the use of a binary chromosome representation. The proposed method selects the cluster centers
directly from the data set. With the aid of a look-up table, the distances between all pairs of objects are saved in advance and
evaluated only once throughout the evolution process. By experimental simulations, the superiority of their algorithm over
Bandyopadhyay and Maulik’s method [20] is shown. Lai and Chang [23] presented a clustering based approach using a hier-
archical evolutionary algorithm (HEA) for medical image segmentation. By means of a hierarchical structure in the chromo-
some, the proposed approach can automatically classify the image into appropriate classes and avoid the difficulty of
Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1269

searching for the proper number of classes. Saha and Bandyopadhyay [24] reported a genetic algorithm with line symmetry
distance-based clustering technique (GALSD). The GALSD algorithm assigns objects to different clusters based on the line-
symmetry-based distance and evolves the number of clusters as long as the clusters have some line symmetry property.
In addition, A Kd-tree-based nearest neighbor search is utilized to reduce the computational complexity of computing the
line-symmetry-based distance. Chang et al. [25] proposed a genetic clustering algorithm based on dynamic niching with
niche migration (DNNM-clustering). On the basis of a similarity function relating to the approximate density shape estima-
tion, the DNNM-clustering method performs a dynamic identification of the niches with niche migration at each generation
to automatically evolve the number of clusters.
After reviewing some known genetic algorithm based clustering methods, we find that how to preserve the balance be-
tween selection pressure and population diversity is neglected in these methods. In this paper, our aim is to design the AG-
CUK algorithm so as to automatically provide the number of clusters and find the clustering partition. Besides maintaining
population diversity so as to avoid the clustering solution trapping in local minima, the AGCUK algorithm should keep selec-
tion pressure so as to accelerate the convergence speed of the clustering algorithm.

3. The AGCUK algorithm

In this section, we ﬁrst brieﬂy introduce two techniques (i.e., genetic algorithms and noising method), and then describe
the AGCUK algorithm in detail.

3.1. Genetic algorithms and noising method

Genetic algorithms are randomized search and optimization techniques guided by the principles of evolution and natural
genetics [26]. They have a large amount of implicit parallelism and provide the near optimal solution of the objective or fit-
ness function in complex, large, and multi-modal landscapes. In genetic algorithms, the parameters of the search space are
encoded in the form of strings (or chromosomes). The fitness function that represents the goodness degree of the solution
encoded in the chromosome is associated with each string. Biologically inspired operators like selection, crossover, and
mutation are used over a number of generations for generating potentially better solutions. After a satisfactory individual
is found or the specified number of generations is over, the best individual obtained during the evolution process is viewed
as the final result. Genetic algorithms have applications in fields as diverse as image processing, information security, infor-
mation retrieval, protein sequence design, etc. [27–30].
The noising method guiding the heuristic search procedure to explore the solution space is a metaheuristic technique pro-
posed by Charon and Hudry [12]. Instead of taking the genuine data into account directly, the noising method considers the
optimal result as the outcome of a series of fluctuating data converging towards the genuine ones. Like some other metaheu-
ristics, the noising method is based on a descent. The main difference with a descent is that, when the objective function
value for a given solution is considered, a perturbation called a noise is added to this value. This noise is randomly chosen
in an interval of which the range decreases during the iteration process. It means that the original value of the noise rate rn
should be chosen in such a way that, at the beginning of the iteration process, a bad neighboring solution may be accepted.
As added noises are chosen in an interval centered on zero, a good neighboring solution may be also rejected. The final solu-
tion is the best solution obtained during the iteration process. The structure of the noising method is shown in Fig. 1. In Fig. 1,
Ni denotes the current number of iterations, Nt denotes the total number of iterations, Nf denotes the number of iterations at
a fixed noise rate, f(Xc), f(X0 ), and f(Xb) denote the function values of the current solution Xc, the neighboring solution X0 , and

Begin
set parameters and create the current solution X c
while N i ≤ N t , do
Ni ← Ni + 1

find a neighbor X ' of the current solution X c

if f ( X ' ) − f ( X c ) + noise < 0 , then X c ← X '
if f ( X c ) < f ( X b ) , then X b ← X c
if N i = 0(mod N f ) , then decrease rn
end do
end

Fig. 1. The general description of the noising method.

1270 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279

the best known solution Xb, respectively. The noising method has been applied to traveling salesman problem [31], task allo-
cation problem [32], and clique partitioning problem [33].

3.2. The proposed approach

The general description of the AGCUK algorithm is given in Fig. 2. Its most procedures observe the architecture of genetic
algorithms. Its different steps are discussed in the following.

3.2.1. Individual representation

In the AGCUK algorithm, the chromosomes are made up of real numbers representing the coordinates of the cluster cen-
ters. The length of the chromosome is Ki m, where Ki denotes the number of clusters of the ith individual and m denotes the
number of object attributes. The ﬁrst m genes denote the m dimensions of the ﬁrst cluster center, the next m genes represent
those of the second cluster center, and so on. For instance, let m = 2 and Ki = 3, then the chromosome 25.2 18.6 5.3 10.8 65.3
7.0 represents the coordinates of three cluster centers (25.2 18.6), (5.3 10.8), and (65.3 7.0).

3.2.2. Population initialization

For individual i, the number of clusters Ki is randomly
pffiffiffiffi generated in the range [Kmin, Kmax]. Here, Kmin is chosen to be 2
unless specified otherwise and Kmax is chosen to be N [22], where N denotes the number of objects. For initializing indi-
vidual i, Ki distinct objects are chosen randomly from the data set and viewed as the initial cluster centers.

3.2.3. Fitness evaluation

In this paper, the ﬁtness of an individual is computed using the Davies–Bouldin (DB) index [34]. It is also adopted in Refs.
[18,20–22] to measure the validity of clusters. The DB index is a function of the ratio of the sum of within-cluster scatter to
between-cluster separation. The scatter within cluster Ci, the ith cluster, is deﬁned as
!1=q
1 X
Si;q ¼ kx ci kq2 ; ð1Þ
jC i j x2C
i

where ci denotes the cluster center of cluster Ci. Cluster center ci is computed as
1 X
ci ¼ x; ð2Þ
ni x2C
i

where ni denotes the number of the objects belonging to cluster Ci. In Eq. (1), Si,q denotes the qth root of the qth moment of
the objects belonging to cluster Ci with respect to their mean, and is a measure of the dispersion of the objects belonging to
cluster Ci. Speciﬁcally, Si,1, used in this article, denotes the average Euclidean distance of the objects belonging to cluster Ci to
their cluster center ci. The distance between clusters Ci and Cj is deﬁned as

dij;t ¼ dðC i ; C j Þ ¼ kci cj kt ; ð3Þ

where dij,t denotes the Minkowski distance of order t between the centroids which characterize clusters Ci and Cj. Next, we
deﬁne a term Ri,qt for cluster Ci as

Si;q þ Sj;q
Ri;qt ¼ Max : ð4Þ
j;j–i dij;t

Begin
initialize parameters
establish the initial population for clustering
while (not termination condition) do
fitness evaluation
noising selection
division-absorption mutation
end
end

Fig. 2. The general description of the AGCUK algorithm.

Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1271

Then the DB index is deﬁned as

1X K
DB ¼ Ri;qt : ð5Þ
K i¼1

A small value of this evaluation indicates a good clustering result. We set the ﬁtness Fi of individual i to be equal to 1/DBi,
where DBi is the DB index computed for individual i. As a result, the minimum problem is converted into a maximum one
suitable for genetic algorithms.

3.2.4. Noising selection

In this study, we adopt the noising method to implement the selection operation. Like the tabu list used in [4], the noising
selection is adopted here to avoid the selected population being occupied by several ﬁtter individuals and to maintain pop-
ulation diversity. We add noises to the variation of the ﬁtness value. The selection operation of the AGCUK algorithm is
implemented as follows:

Step 1: Given population Qt, where t denotes the number of generations, set i = 1 and choose the ith individual X ti .
Step 2: If t = 1, then individual X ti is selected and proceed to Step 4.
Step 3: Individual X ti is compared with the ith individual X t1
i in population Qt1, if

F ti F t1
i þ noise > 0; ð6Þ
then individual X ti
is selected; otherwise individual X t1
is selected. Here, the value of noise is a real randomly generated in
i
the range ½r n ; r n ; F ti and F t1
i denote the ﬁtness values of individuals X ti and X t1
i , respectively. In addition, the noise range
decreases as the algorithm evolves, which is described in Section 3.2.8.
Step 4: View the selected individual as the ith individual in the selected population and let i = i + 1, if i 6 P, then go to Step 2;
otherwise return the selected population. Here, P denotes the population size.

3.2.5. Division–absorption mutation

For a data set, there are three partition states as shown in Fig. 3: under-partitioned state, optimal-partitioned state, and
over-partitioned state. In the under-partitioned state as shown in Fig. 3(a), clusters C1 and C3 are improperly grouped into
cluster C 01 . In the over-partitioned state as shown in Fig. 3(b), cluster C1 is improperly divided into clusters C 01 and C 04 . Only
in the optimal-partitioned one as shown in Fig. 3(d), all clusters are correctly partitioned. So, partitioning the under-parti-
tioned clusters and merging the over-partitioned clusters are helpful for exploring the correct clustering. In this paper, we
design a mutation operation called division–absorption mutation. It consists of two sub-operations: division operation and
absorption operation. In the AGCUK approach, there are two kinds of individuals in the population: the best individual and
the others. A best individual is viewed as the one with the correct number of clusters in the population. But it may not rep-
resent the optimal partition as shown in Fig. 3(c). For the best individual, we keep the number of clusters ﬁxed and use two
sub-operations to tune the assignment of objects and ﬁnd the correct partition. Here, two sub-operations are performed in
random order. For other individuals, we randomly choose one of two sub-operations to redistribute objects among clusters
and evolve the proper number of clusters. Two sub-operations are stated as follow:

3.2.5.1. Division operation. Suppose cluster Ci is the one to be divided, we adopt proportional selection to choose cluster Ci.
The selection probability pdi is deﬁned as
X
pdi ¼ Si;1 = Si;1 ; ð7Þ

C1' c3 C3
C3
c1 c4 C4'
C1
C1 c1
C2 C2
c2 C1' c2

(a) Under-partitioned state (b) Over-partitioned state

C3' C3 c3 C3
c1 c3 c1

C2 C2
C1 c2 C1 c2

(c) Suboptimal-partitioned state (d) Optimal-partitioned state

Fig. 3. Partition of a data set.

1272 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279

where i = 1, . . . , K. That is, the sparser cluster Ci, the more possibly it is selected to be partitioned, and vice versa. Since the K-
means algorithm is simple and computationally attractive, we use it to partition cluster Ci. After the division operation, clus-
ter Ci is divided into two new clusters and the number of clusters increases by one.

3.2.5.2. Absorption operation. Like the division operation, we adopt proportional selection to determine which cluster is to be
merged. That is, the closer two clusters to each other, the more possibly one of them is selected as the one to be absorbed,
and vice versa. For cluster Ci, compute the distance between Ci and its nearest neighbor as

di ¼ min kci cj k2 ; ð8Þ

i–j

then the selection probability pai is deﬁned as

X
pai ¼ di =di : ð9Þ

Suppose the cluster pair (Ci, Cj) is selected, if

Si;1 > Sj;1 ; ð10Þ

then Ci is selected; otherwise its nearest neighbor Cj is selected. That is, in the selected cluster pair, the cluster with sparser
structure is the one to be merged. Suppose cluster Ci is to be absorbed, object x belonging to cluster Ci is reassigned to cluster
Ck if and only if the following condition holds:

kx ck k2 < kx cj k2 ; ð11Þ

where cj – ck. After the absorption operation, cluster Ci disappears and the number of clusters decreases by one. After the
division–absorption mutation operation, objects are reassigned among different clusters and a new individual is created.

3.2.6. Elitist operation

The elitist operation is used to carry the best individual obtained from the previous population into the child population,
which assures the evolution process to converge to the optimal result [35]. It is implemented in this paper as follows:

Step 1: Set the best individual generated from the initial population Q0 as the best known individual Xb.
Step 2: Perform selection and mutation operations to obtain population Qt. The best individual and the worst individual in
population Qt are denoted as X tb and X tw , respectively.
Step 3: If individual X tw is worse than individual Xb, then replace individual X tw by individual Xb.
Step 4: If individual X tb is better than individual Xb, then replace individual Xb by individual X tb .

Note that Steps 2, 3, and 4 constitute one iteration of the AGCUK algorithm and are repeated until the termination crite-
rion is satisﬁed.

3.2.7. Termination criterion

In general, two stopping criteria are used in genetic algorithms. In the first, the evolution process is executed for a fixed
number of generations and the best obtained individual is taken to be the optimal one. In the other, the algorithm is termi-
nated if no further improvement in the fitness value of the best individual is observed for a fixed number of generations, and
the best obtained individual is taken to be the optimal one. We have used the first method in the experiment. That is, the best
individual having the largest fitness (i.e., smallest DB index value) seen up to the last generation provides the solution to the
clustering problem.

3.2.8. Implementation of the AGCUK algorithm

The AGCUK algorithm is implemented as follows:

Step 1: Initialization. Given the number of generations G, the population size P, the original noise rate rmax, the terminal
noise rate rmin, and the number of iterations at a ﬁxed noise rate Nf, generate the initial population Q0, view the best
individual from population Q0 as the best known individual Xb, and set the current number of generations t = 1.
Step 2: Evaluation and selection. Evaluate the ﬁtness value F ti of individual X ti , where i = 1, . . . , P, and perform the noising
selection operation.
Step 3: Mutation. Perform the division–absorption mutation operation for each individual.
Step 4: Termination check. If t 6 G, then t = t + 1 and perform the elitist operation. Set rn = rn (rmax rmin)/(Nt/Nf 1), go to
Step 2. Otherwise output individual Xb. Here, we set Nf = P and the total number of iterations Nt = G Nf.
Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1273

4. Experimental results

In this paper, computer simulations are conducted in Matlab on an Intel Pentium D processor running at 3.4 GHz with
512 MB real memory. The population size P of each experimental algorithm is equal to 20. The number of generations G
of each experimental algorithm is equal to 50. Each experiment includes 20 independent trials.

4.1. Performance evaluation

In this section, we first choose the proper values of the original noise rate rmax and the terminal noise rate rmin, and then
discuss how to estimate selection pressure and population diversity. Due to space limitations, a two-dimensional artificial
data set t8_2 with 400 objects and 8 clusters [22] is chosen to show the comparison results. For other experimental data sets,
the similar results are obtained.
There are two parameters to be determined in the noising selection operation, the original noise rate rmax and the termi-
nal noise rate rmin. The average run time when the best result is attained for the first time is shown in Fig. 4. Given different
rmax and rmin, AGCUK obtains the maximum fitness 3.78483149 in each run, finds the number of clusters, and provides the
correct clustering partition but requires different run time. In Fig. 4(a), we find that if the original noise rate rmax is too large
or too small, then AGCUK requires more run time to converge to the best solution. Here, the terminal noise rate rmin is set to
be 0. When the value of rmax is equal to 1, the best performance is attained. So, we choose this parameter to be 1. It is found
that, in Fig. 4(b), the larger the value of rmin the worse. Here, the original value of the noise range is set to be 1. The reason is
that an added noise is a random real drawn with a uniform distribution in the interval [rn, rn], and the noise rate rn is
bounded by two extreme values rmax and rmin, then we may make rn decrease down to 0 so as to get back the genuine func-
tion at the end of the noising selection operation. So, we choose rmin to be 0 in this study.
It is well known that there are two major issues in designing genetic algorithms: selection pressure and population diver-
sity. Selection pressure leads genetic algorithms to exploit information inside the fitter individuals and results in more supe-
rior offspring iteratively. The diversity in genetic algorithms is attributed to the form of population, which contains a certain
number of encoded individuals for exploration. Emphasis on selective pressure accelerates the optimization convergence but
potentially causes premature convergence because of hastened loss of diversity. On the contrary, maintaining diversity can
yield a better solution quality, but often slows down the convergence speed due to the lack of selection pressure. Therefore, a
good scheme should pursue a good balance between exploration and exploitation in consideration of both convergence
speed and solution quality. On one hand, genetic algorithms must maintain certain diversity to explore the unvisited space,
and on the other hand, genetic algorithms must have adequate selective pressure to exploit the relevant solutions [36]. To
the best of our knowledge, how to keep the balance between selection pressure and population diversity is neglected in the
genetic algorithm based clustering method.
In this study, we design the noising selection operation to maintain population diversity so as to avoid the solution search
trapping in local minima, and develop the division–absorption mutation operation to keep selection pressure so as to accel-
erate the convergence speed of the clustering algorithm. In AGCUK, the degree of selection pressure is defined as
DSP ¼ Nsi =P; ð12Þ
where Nsi denotes the number of super individuals. Here, a super individual is a best individual in the population after the
selection operation. That is, the more super individuals the stronger the selection pressure, and vice versa.
In order to estimate the degree of population diversity, we adopt the string-of-group-numbers encoding to record the
assignment of objects. That is, we let the length of the string equal to the number of objects. The value of the ith element

Fig. 4. Determination of the original noise rate rmax and the terminal noise rate rmin.
1274 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279

of the string denotes the cluster number assigned to the ith object. As objects in the clustering problem are unlabeled, it is
necessary to label a clustering partition so as to estimate the difference between the individuals and to deal with the prob-
lem of population diversity. In this study, we deﬁne the best individual as the reference partition and label it in advance.
Then the difference between the individuals and the reference partition can be calculated by estimating the number of dif-
ferent objects belonging to the same cluster. Here, we use the matrix St ¼ ½stij PN to record the difference between the indi-
viduals and the reference partition in the tth generation. If object j in individual i is assigned to the same cluster as that in the
best individual, then stij ¼ 0; otherwise stij ¼ 1. The degree of population diversity is deﬁned as

1 X P X N
DPD ¼ st : ð13Þ
P N i¼1 j¼1 ij

The degree of population diversity varies with the number of generation. As the increase of the number of super individ-
uals, the diversity decreases. If the population converges to the best individual, then the diversity decreases to zero.
Fig. 5(a) shows the selection pressure provided by experimental methods during the evolution process. The average val-
ues of the selection pressure provided by the methods presented by Bandyopadhyay and Maulik [18,20], Lai [21], Lin et al.
[22], and the AGCUK method are 15.27%, 14.25%, 11.57%, 86.88%, and 16.59%, respectively. In this experiment, Lin et al.’s
method leads to the strongest selection pressure. The selection pressure provided by the AGCUK method is weaker than that
provided by Lin et al.’s method but stronger than those provided by the other three methods. Fig. 5(b) shows the population
diversity provided by experimental methods during evolution. The average values of the population diversity provided by
the methods proposed by Bandyopadhyay and Maulik [18,20], Lai [21], Lin et al. [22], and the AGCUK method are 28.35%,
33.04%, 33.68%, 11.80%, and 21.39%, respectively. It is seen that Lin et al.’s method leads to the lowest diversity. The popu-
lation diversity provided by the AGCUK method is higher than that provided by Lin et al.’s but lower than those provided by
the other three methods. As a result, Lin et al.’s method leads to the strongest selection pressure and the lowest population
diversity. The methods presented by Bandyopadhyay and Maulik [18,20], and Lai [21] result in weaker selection pressure and
higher population diversity than Lin et al.’s method and the AGCUK method. These phenomena show that they emphasize
one hand and neglect the other hand. Therefore, it is difficult for them to obtain the best results in different cases. With
strong selection pressure and high population diversity, the AGCUK method can preserve the balance between selection
pressure and population diversity. As a result, the AGCUK method gives better results with less number of generations than
the other four methods as shown in Fig. 6.
The clustering results of experimental methods for data set t8_2 are shown as Table 1. Seven indicators are given. The first
two indicators are the average and standard deviation values of the DB index (AvgDB and SDDB) provided by experimental
methods. The average and standard deviation number of clusters (AvgNC and SDNC) are used to show the capability of the
clustering algorithm to find the correct number of clusters. The misclassified rate (MR) denotes the number of objects which
are incorrectly partitioned divided by the total number of objects. The success rate (SR) is defined as the number of trials
where the correct partition is obtained divided by the total number of trials. The final indicator is the average run time when
the correct partition is attained for the first time. It is seen that the AGCUK algorithm outputs the minimum average value of
the DB index and its SDDB value is equal to 0. In this experiment, Bandyopadhyay and Maulik’s method [18] provides the
worst AvgDB and SDDB. In each trial, the AGCUK algorithm finds the correct number of clusters and provides the optimal
clustering partition, that is, its misclassified rate is equal to 0 and success rate is equal to 100%. Meanwhile, we find the AG-
CUK algorithm requires more run time than Lin et al.’s method to obtain the proper solution. In addition, the methods pre-
sented by Bandyopadhyay and Maulik [18] and Lai [21] fail to provide the correct partition within the specified number of
generations.

Fig. 5. Experimental results of selection pressure and population diversity.

Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1275

Fig. 6. Experimental results of the DB index.

Table 1
Clustering results of experimental methods for t8_2.

AvgDB SDDB AvgNC SDNC MR (%) SR (%) Time (s)

Bandyopadhyay (2001) 0.4139 0.0787 6.70 0.56 18.10 – –
Bandyopadhyay (2002) 0.3656 0.0489 6.90 0.54 14.31 10 22.06
Lai (2005) 0.3702 0.0276 7.15 0.48 13.40 – –
Lin (2005) 0.3555 0.0527 8.25 0.70 0.64 80 23.62
AGCUK 0.2642 0 8 0 0 100 34.39

4.2. Performance comparison

In this section, the AGCUK algorithm is applied to experimental data sets and compared with the clustering techniques
provided by Bandyopadhyay and Maulik [18,20], Lai [21], and Lin et al. [22].
Lin et al. [22] used 100 two-dimensional artificial data sets with a variety of numbers (in [Kmin, Kmax] = [2, 11]) of clusters
to compare their method with Bandyopadhyay and Maulik’s method [20]. There are ten data sets for each number of clusters.
As a result, Lin et al.’s method is better than the latter and finds the number of clusters and the correct partitions of the data
sets with less than 7 clusters. But Lin et al.’s method becomes bad with the further increase of the number of clusters. In this
paper, we use the latter 50 data sets with a variety of numbers (in [Kmin, Kmax] = [7, 11]) of clusters to compare the AGCUK
algorithm with four known genetic clustering algorithms. These 50 two-dimensional artificial data sets are shown as Fig. 7.
Fig. 8 shows the average number of clusters provided by experimental algorithms for 50 artificial data sets. We find that
Bandyopadhyay and Maulik’s method [18] is the worst and fails to find the correct partitions in most runs. For the data sets
with seven clusters, Lai’s method can provide the number of clusters closer to 7 than Bandyopadhyay and Maulik’s method.
But for other data sets, their results are close to each other in many cases. Lin et al.’s method is better than the above three

Fig. 7. 50 artiﬁcial data sets.

1276 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279

Fig. 8. Experimental results of the number of clusters.

methods and provides more accurate number of clusters in most cases. In face of each data set, the AGCUK algorithm is the
best one among experimental algorithms and finds the correct number of clusters.
Fig. 9(a) shows the average misclassified rates of experimental algorithms for 50 artificial data sets. The misclassified
rates of the methods given by Bandyopadhyay and Maulik [18,20], and Lai [21], respectively, are far larger than 0 in all runs.
Lin et al.’s method and the AGCUK method seem to be comparable. After removing the other three methods, we find that our
method is much better than Lin et al.’s method as shown in Fig. 9(b). The AGCUK algorithm provides the correct partitions of
21 of 50 artificial data sets in each trial. Its worst misclassified rate is equal to 0.15% in face of data set t9_4.
As the misclassified rates of Lin et al.’s method and the AGCUK method are much lower than those of the other three
methods, we further compare Lin et al.’s method with the AGCUK method in terms of the run time. Fig. 10 shows the average
run time when the correct partitions of data sets are found for the first time. On one hand, with the aid of the look-up table,
Lin et al.’s method saves the repeated computation of the distance between each object and its corresponding cluster center,
and is not very sensitive to the variation of data sets. On the other hand, as noises are added to the variation of the fitness
value, the AGCUK algorithm may accept bad individuals in the noising selection operation during the evolution process. As a
result, although the AGCUK algorithm prevents the selected population from being occupied by several fitter individuals and
maintains population diversity, it has to require more run time to find the correct partitions with the increase of the number
of clusters and the size of data sets.
From abovementioned experiments it follows that the AGCUK method requires more run time to produce solutions of
better quality as compared with Lin et al.’s method. In order to make a more fair comparison between the AGCUK method
and Lin et al.’s method, we further examine whether Lin et al.’s method would perform better than the AGCUK method when
it is configured to run more time. Here, we keep the parameter settings of the AGCUK method fixed. In genetic algorithms,
the population size P and the number of generations G are two parameters related to the run time. Therefore, increasing the
run time of Lin et al.’s method can be done by increasing P or G. We summarize our findings as below.
In Lin et al.’s method [22], the mutation probability is defined as
pffi
pm 1:75=ðP lÞ; ð14Þ

Fig. 9. Experimental results of the misclassiﬁed rate.

Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1277

Fig. 10. Comparison of the run time of the AGCUK method and Lin et al.’s method.

where P denotes the population size and l denotes the chromosome length. The larger the population size P the smaller the
mutation probability pm, and vice versa. We keep the number of generations G fixed and add the population size P up to 60.
In this experiment, Lin et al.’s method requires more run time than the AGCUK algorithm to provide the correct partitions of
data sets as shown in Fig. 11(a), and finds the correct partitions of 8 of 50 artificial data sets in each run. But with the de-
crease of the mutation probability, Lin et al.’s method leads to higher misclassified rates in face of some data sets such as data
set t7_8 as shown in Fig. 11(b).
In the other experiment, we keep the population size P fixed and add the number of generations G up to 150. In this case,
Lin et al.’s method requires more run time than the AGCUK algorithm as shown in Fig. 12(a), and finds the correct partitions
of 6 of 50 artificial data sets in each run. With the increase of the number of generations, Lin et al.’s method achieves lower
misclassified rates in face of most data sets as shown in Fig. 12(b) than before. But we find the AGCUK method still outper-
forms Lin et al.’s method in most cases.
In addition, we use the Wisconsin Breast Cancer data set [37] to compare experimental algorithms. In the Breast Cancer
data set, each pattern has nine features corresponding to clump thickness, cell size uniformity, cell shape uniformity, mar-
ginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. There are two catego-
ries in the data set, malignant and benign, which are known to be linearly inseparable. The total number of patterns are 699
(458 benign and 241 malignant), of which 16 patterns contain single missing feature. These 16 patterns have been removed
and the remaining 683 patterns are used for clustering.
Experimental results for Breast Cancer are shown as Table 2. The AGCUK algorithm and the clustering techniques pro-
vided by Bandyopadhyay and Maulik [18,20] find the correct number of clusters in all runs. Among experimental methods,
the misclassified rate of the AGCUK algorithm is the lowest. But we find that all experimental algorithms do not provide the
correct partition for this data set within the specified number of generations. We find it difficult for one validity index to deal
with different data sets. Other validity indices such as PBM-index [38] may be used to find the clustering partition in future
research.

Fig. 11. Comparison of the AGCUK method and Lin et al.’s method at P = 60.
1278 Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279

Fig. 12. Comparison of the AGCUK method and Lin et al.’s method at G = 150.

Table 2
Clustering results of experimental methods for breast cancer.

Number of clusters Misclassiﬁed rate (%)

Bandyopadhyay (2001) 2 34.85
Bandyopadhyay (2002) 2 8.94
Lai (2005) 2.55 17.30
Lin (2005) 8.95 35.29
AGCUK 2 3.54

In the following, we analyze the time complexities of experimental methods. The time complexities of the clustering
methods presented by Bandyopadhyay and Maulik [18,20], Lai [21], and Lin et al. [22] are O(GPKmN), O(GPKmN), O(GPKmN),
and O(GPKN + mN2), respectively. The time complexity of the AGCUK algorithm is given as follows: In each generation, the
time complexity of the ﬁtness evaluation of the population is O(PKmN). In genetic operations, the time complexity of the
selection operation is O(P). The time complexity of the mutation operation is described as follows: The time complexity
of selecting the cluster to be divided is O(K). The time complexity of partitioning the cluster is O(mN). The time complexity
of selecting the cluster to be absorbed is O(K). The time complexity of merging the cluster is O(KmN). Then the time com-
plexity of the mutation operation is dominated by the computational cost of merging the cluster. Therefore, the time com-
plexity of the AGCUK algorithm is O(GPKmN) the same as those of the methods presented by Bandyopadhyay and Maulik
[18,20], and Lai [21].

5. Conclusions

Clustering is aimed at discovering structures and patterns of a given data set. As a fundamental problem and technique for
data analysis, clustering has become increasingly important. Many clustering methods usually ask the designer to provide
the number of clusters as input. Unfortunately, the number of clusters in general is unknown a priori. In this paper, we pro-
pose a genetic algorithm based clustering method call automatic genetic clustering for unknown K (AGCUK). We design two
operations: noising selection and division–absorption mutation. The reciprocal of the Davies–Bouldin index is used for com-
puting the fitness of individuals.
We adopt the noising selection operation to prevent the selected population being occupied by several fitter individuals
and to maintain population diversity. Noises are added to the variation of the fitness value so as to avoid the clustering meth-
od trapping in local minima. According to different clustering partition states, we design the division–absorption mutation
operation. Three combinations of two sub-operations, division operation and absorption operation, are performed on differ-
ent kinds of individuals to evolve the number of clusters and provide the correct partition, respectively. The AGCUK method
and four genetic clustering methods are compared. Among experimental algorithms, the AGCUK algorithm provides the cor-
rect number of clusters for artificial and real-life data sets. It obtains lower misclassified rates than the other four experimen-
tal methods. In addition, the time complexity of the AGCUK method is the same as those of the methods proposed by
Bandyopadhyay and Maulik [18,20], and Lai [21].
On the other hand, as bad individuals may be accepted in the noising selection operation, the AGCUK algorithm has to
take more run time than Lin et al.’s method to produce solutions of better quality. We further examine whether Lin
et al.’s method would perform better than the AGCUK method when it is configured to run more time. By increasing the
Y. Liu et al. / Applied Mathematics and Computation 218 (2011) 1267–1279 1279

population size P or the number of generations G, we make a fair comparison between the AGCUK method and Lin et al.’s
method. As a result, we find the AGCUK method still outperforms Lin et al.’s method in most cases. How to accelerate the
convergence speed of the AGCUK algorithm without decreasing its searching capability is an important area of future re-
search. In this paper, the Davies–Bouldin index is used for computing the fitness of individuals. But we find it difficult to
use one validity index to deal with different data sets. Combining other indices such as PBM-index with our method to solve
the clustering problem will be an important area of future research.

Acknowledgements

The authors thank Dr. Chih-Chin Lai for his valuable suggestions on our works. This research was supported in part by the
National Natural Science Foundation of China (NSFC) under grants 60903074 and 60828005, the National High Technology
Research and Development Program of China (863 Program) under grant 2008AA01Z119, the National Basic Research Pro-
gram of China (973 Program) under grant 2009CB326203, and the US National Science Foundation (NSF) under grant CCF-
0905337.

References

[1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (1999) 264–323.
[2] W. Pedrycz, Knowledge-Based Clustering, Wiley, New Jersey, 2005.
[3] S.Z. Selim, M.A. Ismail, K-means-type algorithm: generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal.
Mach. Intell. 6 (1984) 81–87.
[4] Y.G. Liu, K.F. Chen, X.M. Li, A hybrid genetic based clustering algorithm, in: Proceedings of the 3rd International Conference on Machine Learning and
Cybernetics, Shanghai, IEEE, 2004, pp. 1677–1682.
[5] S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN, Inf. Sci. 146 (2002) 221–237.
[6] M. Laszlo, S. Mukherjee, A genetic algorithm using hyper-quadtrees for low-dimensional K-means clustering, IEEE Trans. Pattern Anal. Mach. Intell. 28
(2006) 33–543.
[7] Y.G. Liu, Y. Liu, L.B. Wang, K.F. Chen, A hybrid tabu search based clustering algorithm, in: R. Khosla, R.J. Howlett, L.C. Jain (Eds.), Knowledge-Based
Intelligent Information and Engineering Systems, LNCS, vol. 3682, Springer, Berlin, 2005, pp. 186–192.
[8] M.K. Ng, J.C. Wong, Clustering categorical data sets using tabu search techniques, Pattern Recogn. 35 (2002) 2783–2790.
[9] S. Bandyopadhyay, U. Maulik, M.K. Pakhira, Clustering using simulated annealing with probabilistic redistribution, Int. J. Pattern Recogn. Artif. Intell. 15
(2001) 269–285.
[10] Z. Güngör, A. Ünler, K-harmonic means data clustering with simulated annealing heuristic, Appl. Math. Comput. 184 (2007) 199–209.
[11] Y.G. Liu, Y. Liu, K.F. Chen, Clustering with noising method, in: X. Li, S.L. Wang, Z.Y. Dong (Eds.), Advanced Data Mining and Applications, LNCS, vol.
3584, Springer, Berlin, 2005, pp. 209–216.
[12] I. Charon, O. Hudry, The noising method: a new method for combinatorial optimization, Oper. Res. Lett. 14 (1993) 133–137.
[13] M. Mahdavi, M. Haghir Chehreghani, H. Abolhassani, R. Forsati, Novel meta-heuristic algorithms for clustering web documents, Appl. Math. Comput.
201 (2008) 441–451.
[14] J.A. Pacheco, A scatter search approach for the minimum sum-of-squares clustering problem, Comput. Oper. Res. 32 (2005) 1325–1335.
[15] B. Jarhoui, M. Cheikh, P. Siarry, A. Rebai, Combinatorial particle swarm optimization (CPSO) for partitional clustering problem, Appl. Math. Comput. 192
(2007) 337–345.
[16] P.S. Shelokar, V.K. Jayaraman, B.D. Kulkarni, An ant colony approach for clustering, Anal. Chim. Acta 509 (2004) 187–195.
[17] M. Fathian, B. Amiri, A. Maroosi, Application of honey-bee mating optimization algorithm on clustering, Appl. Math. Comput. 190 (2007) 1502–1513.
[18] S. Bandyopadhyay, U. Maulik, Nonparametric genetic clustering: comparison of validity indices, IEEE Trans. Syst. Man Cybern. Part C – Appl. Rev. 31
(2001) 120–125.
[19] L.Y. Tseng, S.B. Yang, A genetic approach to the automatic clustering algorithm, Pattern Recogn. 34 (2001) 415–424.
[20] S. Bandyopadhyay, U. Maulik, Genetic clustering for automatic evolution of clusters and application to image classiﬁcation, Pattern Recogn. 35 (2002)
1197–1208.
[21] C.C. Lai, A novel clustering approach using hierarchical genetic algorithms, Intell. Autom. Soft Comput. 11 (2005) 143–153.
[22] H.J. Lin, F.W. Yang, Y.T. Kao, An efﬁcient GA-based clustering technique, Tamkang J. Sci. Eng. 8 (2005) 113–122.
[23] C.C. Lai, C.Y. Chang, A hierarchical evolutionary algorithm for automatic medical image segmentation, Expert Syst. Appl. 36 (2009) 248–259.
[24] S. Saha, S. Bandyopadhyay, A new line symmetry distance and its application to data clustering, J. Comput. Sci. Technol. 24 (2009) 544–556.
[25] D.X. Chang, X.D. Zhang, C.W. Zheng, D.M. Zhang, A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem,
Pattern Recogn. 43 (2010) 1346–1360.
[26] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
[27] P. Kumsawat, K. Attakitmongcol, A. Srikaew, A new approach for optimization in image watermarking by using genetic algorithms, IEEE Trans. Signal
Process. 53 (2005) 4707–4719.
[28] P. Ramasubramanian, A. Kannan, A genetic-algorithm based neural network short-term forecasting framework for database intrusion prediction
system, Soft Comput. 10 (2006) 699–714.
[29] Y.C. Chang, S.M. Chen, A new query reweighting method for document retrieval based on genetic algorithms, IEEE Trans. Evol. Comput. 10 (2006) 617–
622.
[30] L.P.B. Scott, J. Chahine, J.R. Ruggiero, Using genetic algorithm to design protein sequence, Appl. Math. Comput. 200 (2008) 1–9.
[31] I. Charon, O. Hudry, Application of the noising method to the traveling salesman problem, Eur. J. Oper. Res. 125 (2000) 266–277.
[32] W.H. Chen, C.S. Lin, A hybrid heuristic to solve a task allocation problem, Comput. Oper. Res. 27 (2000) 287–303.
[33] I. Charon, O. Hudry, Noising methods for a clique partitioning problem, Discrete Appl. Math. 154 (2006) 754–769.
[34] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (1997) 224–227.
[35] D. Bhandari, C.A. Murthy, S.K. Pal, Genetic algorithm with elitist model and its convergence, Int. J. Pattern Recognit. Artif. Intell. 10 (1996) 731–747.
[36] C.K. Ting, S.T. Li, C. Lee, On the harmonious mating strategy through tabu search, Inf. Sci. 156 (2003) 189–214.
[37] C.L. Blake, C.J. Merz, UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer
Science, 1998. <http://www.ics.uci.edu/mlearn/MLRepository.html>.
[38] M.K. Pakhira, S. Bandyopadhyay, U. Maulik, Validity index for crisp and fuzzy clusters, Pattern Recogn. 37 (2004) 487–501.

Genetic K-Means Algorithm: Conf., 1987, Pp. 50-58
No ratings yet
Genetic K-Means Algorithm: Conf., 1987, Pp. 50-58
7 pages
Expert Systems With Applications: Jing Xiao, Yuping Yan, Jun Zhang, Yong Tang
No ratings yet
Expert Systems With Applications: Jing Xiao, Yuping Yan, Jun Zhang, Yong Tang
8 pages
An Efficient GA-based Clustering Technique: Hwei-Jen Lin, Fu-Wen Yang and Yang-Ta Kao
No ratings yet
An Efficient GA-based Clustering Technique: Hwei-Jen Lin, Fu-Wen Yang and Yang-Ta Kao
10 pages
Video 18
No ratings yet
Video 18
17 pages
A K-Means Based Genetic Algorithm For Data Clustering: Advances in Intelligent Systems and Computing October 2017
No ratings yet
A K-Means Based Genetic Algorithm For Data Clustering: Advances in Intelligent Systems and Computing October 2017
12 pages
AK-means: An Automatic Clustering Algorithm Based On K-Means
No ratings yet
AK-means: An Automatic Clustering Algorithm Based On K-Means
6 pages
1 s2.0 S0950705114002937 Main
No ratings yet
1 s2.0 S0950705114002937 Main
21 pages
2015 Elsevier Dynamic Clustering With Improved Binary Artificial Bee Colony Algorithm
No ratings yet
2015 Elsevier Dynamic Clustering With Improved Binary Artificial Bee Colony Algorithm
12 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
11 pages
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
No ratings yet
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
7 pages
Video 17
No ratings yet
Video 17
40 pages
GA Clustering
No ratings yet
GA Clustering
6 pages
A Hybrid Algorithm For Fuzzy Clustering: Z.H. Che
No ratings yet
A Hybrid Algorithm For Fuzzy Clustering: Z.H. Che
18 pages
Sine Cosine Based Algorithm For Data Clustering
No ratings yet
Sine Cosine Based Algorithm For Data Clustering
5 pages
WWWWW Clustering Algorithm
No ratings yet
WWWWW Clustering Algorithm
7 pages
Automatic Clustering With Single Optimal Solution
No ratings yet
Automatic Clustering With Single Optimal Solution
13 pages
Conference Paper1
No ratings yet
Conference Paper1
5 pages
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
No ratings yet
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
5 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Genetic Algorithm Applied in Clustering Datasets
No ratings yet
Genetic Algorithm Applied in Clustering Datasets
7 pages
Comparison of K-Means and Fuzzy C-Means Algorithms On Different Cluster Structures
No ratings yet
Comparison of K-Means and Fuzzy C-Means Algorithms On Different Cluster Structures
11 pages
ComparisonofK MeansandFuzzyC MeansAlgorithmsonDifferentClusterStructures
No ratings yet
ComparisonofK MeansandFuzzyC MeansAlgorithmsonDifferentClusterStructures
11 pages
FGKA A Fast Genetic K-Means Clustering Algorithm
No ratings yet
FGKA A Fast Genetic K-Means Clustering Algorithm
3 pages
Genetic Algorithm-Based Clustering Technique
No ratings yet
Genetic Algorithm-Based Clustering Technique
11 pages
83-522-1-PB Bees
No ratings yet
83-522-1-PB Bees
16 pages
Optimizing of Fuzzy C-Means Clustering Algorithm Using GA: Mohanad Alata, Mohammad Molhim, and Abdullah Ramini
No ratings yet
Optimizing of Fuzzy C-Means Clustering Algorithm Using GA: Mohanad Alata, Mohammad Molhim, and Abdullah Ramini
6 pages
Expert Systems With Applications: D. Binu
No ratings yet
Expert Systems With Applications: D. Binu
12 pages
Automatic Clustering Using An Improved Differential Evolution Algorithm
No ratings yet
Automatic Clustering Using An Improved Differential Evolution Algorithm
20 pages
1 s2.0 S0952197615001736 Main
No ratings yet
1 s2.0 S0952197615001736 Main
14 pages
Certificate
No ratings yet
Certificate
7 pages
Journal of Computer Applications - WWW - Jcaksrce.org - Volume 4 Issue 2
No ratings yet
Journal of Computer Applications - WWW - Jcaksrce.org - Volume 4 Issue 2
5 pages
Normalized Clustering Algorithm Based On Mahalanobis Distance
No ratings yet
Normalized Clustering Algorithm Based On Mahalanobis Distance
5 pages
Image Clustering Using Genetic Algorithm With Tour
No ratings yet
Image Clustering Using Genetic Algorithm With Tour
7 pages
Clustering Algorithms On Data Mining in Categorical Database
No ratings yet
Clustering Algorithms On Data Mining in Categorical Database
4 pages
Aced
No ratings yet
Aced
17 pages
FCM - The Fuzzy C-Means Clustering Algorithm
No ratings yet
FCM - The Fuzzy C-Means Clustering Algorithm
13 pages
genedata doc
No ratings yet
genedata doc
67 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
10 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
Unsupervised K-Means Clustering Algorithm
No ratings yet
Unsupervised K-Means Clustering Algorithm
17 pages
Clustering Categorical Data Using The K Means Algorithm and The Attributes Relative Frequency
No ratings yet
Clustering Categorical Data Using The K Means Algorithm and The Attributes Relative Frequency
7 pages
A hybrid metaheuristic and kernel intuitionistic fuzzy c-means
No ratings yet
A hybrid metaheuristic and kernel intuitionistic fuzzy c-means
10 pages
Fast_and_Robust_General_Purpose_Clustering_Algorit
No ratings yet
Fast_and_Robust_General_Purpose_Clustering_Algorit
29 pages
Expert Systems With Applications: Minghao Yin, Yanmei Hu, Fengqin Yang, Xiangtao Li, Wenxiang Gu
No ratings yet
Expert Systems With Applications: Minghao Yin, Yanmei Hu, Fengqin Yang, Xiangtao Li, Wenxiang Gu
6 pages
Azimi 2017
No ratings yet
Azimi 2017
26 pages
Research on k Mean Algorithm
No ratings yet
Research on k Mean Algorithm
5 pages
Clustering Large Data Sets With Mixed Numeric and Categorical Values
No ratings yet
Clustering Large Data Sets With Mixed Numeric and Categorical Values
14 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
8910-24120-1-PB
No ratings yet
8910-24120-1-PB
7 pages
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
No ratings yet
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
4 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
Similarity Measure For Categorical Data
No ratings yet
Similarity Measure For Categorical Data
8 pages
Clustering Analysis
No ratings yet
Clustering Analysis
2 pages
A Portfolio Optimization Algorithm Using Fuzzy Granularity Based Clustering
No ratings yet
A Portfolio Optimization Algorithm Using Fuzzy Granularity Based Clustering
15 pages
1
No ratings yet
1
76 pages
2022-A Comprehensive Survey of Clustering Algorithms State-Of-The-Art Machine Learning Applications Taxonomy Challenges
No ratings yet
2022-A Comprehensive Survey of Clustering Algorithms State-Of-The-Art Machine Learning Applications Taxonomy Challenges
43 pages
Fuzzy Partitioning Using Real Coded Variable Length Genetic Algorithm For Pixel Classification
No ratings yet
Fuzzy Partitioning Using Real Coded Variable Length Genetic Algorithm For Pixel Classification
19 pages
Performance Evaluation of K-Means Clustering Algorithm With Various Distance Metrics
No ratings yet
Performance Evaluation of K-Means Clustering Algorithm With Various Distance Metrics
5 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
L15 Maps and Hashes
No ratings yet
L15 Maps and Hashes
41 pages
Collections Framework in Java
No ratings yet
Collections Framework in Java
28 pages
Project in DSA Java
No ratings yet
Project in DSA Java
5 pages
Sheet 3 With Solutions
No ratings yet
Sheet 3 With Solutions
6 pages
Gaussian Elimination
No ratings yet
Gaussian Elimination
79 pages
AI Notes
No ratings yet
AI Notes
91 pages
LZW Project
No ratings yet
LZW Project
27 pages
DS Lab_Manual
No ratings yet
DS Lab_Manual
79 pages
Daa 4th and 5th PRG
No ratings yet
Daa 4th and 5th PRG
6 pages
Cholesky Decomposition, Linear Algebra Libraries and Matlab Routine
No ratings yet
Cholesky Decomposition, Linear Algebra Libraries and Matlab Routine
33 pages
NOI 2018 Solution Writeup
No ratings yet
NOI 2018 Solution Writeup
25 pages
Md. Nasim Sarkar - 20CSE017
No ratings yet
Md. Nasim Sarkar - 20CSE017
24 pages
Homework Set No. 6: 1. Gaussian Elimination
No ratings yet
Homework Set No. 6: 1. Gaussian Elimination
3 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
DAA Notes Complete
No ratings yet
DAA Notes Complete
242 pages
Multidimensional Index Structures
No ratings yet
Multidimensional Index Structures
70 pages
QB-2 BCS401 4th sem
No ratings yet
QB-2 BCS401 4th sem
2 pages
Data Structure (Search Algorthm)
No ratings yet
Data Structure (Search Algorthm)
14 pages
2.1 Locating Roots: Chapter Two Non Linear Equations
No ratings yet
2.1 Locating Roots: Chapter Two Non Linear Equations
13 pages
Easy Medium Hard: Array
No ratings yet
Easy Medium Hard: Array
6 pages
Two Pointers-3
No ratings yet
Two Pointers-3
47 pages
Pps External Lab Important Questions
No ratings yet
Pps External Lab Important Questions
4 pages
DS-Assignment1
No ratings yet
DS-Assignment1
3 pages
Introduction To C++ STL
No ratings yet
Introduction To C++ STL
29 pages
Direct File
No ratings yet
Direct File
48 pages
Lecture 13
No ratings yet
Lecture 13
7 pages
Unit 4.3 Greedy Algorithm Prim's Algo
No ratings yet
Unit 4.3 Greedy Algorithm Prim's Algo
22 pages
NLP Assignment-4 Solution
100% (1)
NLP Assignment-4 Solution
5 pages
Forests
No ratings yet
Forests
3 pages

7 Automatic Clustering Using Genetic Algorithms 2011

Uploaded by

7 Automatic Clustering Using Genetic Algorithms 2011

Uploaded by

Applied Mathematics and Computation 218 (2011) 1267–1279

Contents lists available at ScienceDirect

Applied Mathematics and Computation

Automatic clustering using genetic algorithms

3. The AGCUK algorithm

3.1. Genetic algorithms and noising method

find a neighbor X ' of the current solution X c

Fig. 1. The general description of the noising method.

3.2. The proposed approach

3.2.1. Individual representation

3.2.2. Population initialization

3.2.3. Fitness evaluation

dij;t ¼ dðC i ; C j Þ ¼ kci cj kt ; ð3Þ

Fig. 2. The general description of the AGCUK algorithm.

Then the DB index is deﬁned as

3.2.4. Noising selection

3.2.5. Division–absorption mutation

(a) Under-partitioned state (b) Over-partitioned state

(c) Suboptimal-partitioned state (d) Optimal-partitioned state

Fig. 3. Partition of a data set.

di ¼ min kci cj k2 ; ð8Þ

then the selection probability pai is deﬁned as

Suppose the cluster pair (Ci, Cj) is selected, if

Si;1 > Sj;1 ; ð10Þ

3.2.6. Elitist operation

3.2.7. Termination criterion

3.2.8. Implementation of the AGCUK algorithm

4.1. Performance evaluation

Fig. 5. Experimental results of selection pressure and population diversity.

Fig. 6. Experimental results of the DB index.

AvgDB SDDB AvgNC SDNC MR (%) SR (%) Time (s)

4.2. Performance comparison

Fig. 7. 50 artiﬁcial data sets.

Fig. 8. Experimental results of the number of clusters.

Fig. 9. Experimental results of the misclassiﬁed rate.

Number of clusters Misclassiﬁed rate (%)

You might also like