0% found this document useful (0 votes)
33 views

An Improvement in K Means Clustering Algorithm IJERTV2IS1385

The document summarizes the k-means clustering algorithm and discusses some of its limitations. It then reviews several proposed improvements to the original k-means algorithm from other studies. However, it notes that most of these improved algorithms do not address the computational complexity and time efficiency issues of the original k-means algorithm. The authors propose their own improved k-means algorithm to address these limitations.

Uploaded by

David Moreno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

An Improvement in K Means Clustering Algorithm IJERTV2IS1385

The document summarizes the k-means clustering algorithm and discusses some of its limitations. It then reviews several proposed improvements to the original k-means algorithm from other studies. However, it notes that most of these improved algorithms do not address the computational complexity and time efficiency issues of the original k-means algorithm. The authors propose their own improved k-means algorithm to address these limitations.

Uploaded by

David Moreno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013

An Improvement in K-means Clustering Algorithm


Anand Sutariya1, Prof. Kiran Amin2
1
PG Student, U.V.Patel College of Engineering, Ganpat University, Mehsana, Gujarat
2
Head, CE Dept., U.V.Patel College of Engineering, Ganpat University, Mehsana, Gujarat

Abstract information on both the overall dataset and


characteristics of objects in it [2].
Data Mining is the process of extracting information
from large data sets through the use of algorithms and 1.1 Partitioning Methods
techniques drawn from the field of Statistics, Machine
Learning and Data Base Management Systems. Cluster Given a database of n objects or data tuples, a
analysis is one of the major data analysis methods and partitioning method constructs k partitions of the data,
the k-means clustering algorithm is widely used for where each partition represents a cluster. That is, it
many practical applications. But the original k-means classifies the data into k groups, which together satisfy
algorithm is computationally expensive and the quality the following requirements: (1) each group must
RRTT
of the resulting clusters heavily depends on the contain at least one object, and (2) each object must
selection of initial centroids. Several methods have belong to exactly one group. Given k, the number of
been proposed for improving the performance of the k- partitions to construct, a partitioning method creates an
initial partitioning. It then uses an iterative relocation
IIJJEE

means clustering algorithm. But still there are many


problems in original k-means algorithm. So, we have technique that attempts to improve the partitioning by
proposed the improved algorithm of k-means for moving objects from one group to another. The general
improving the performance of the algorithm. criterion of a good partitioning is that objects in the
same cluster are “close” or related to each other,
whereas objects of different clusters are “far apart” or
1. Introduction very different. There are various kinds of other criteria
for judging the quality of partitions.
Data mining is a process that uses a variety of data
analysis tools to discover patterns and relationships in 1.2 Hierarchical Methods
data that may be used to make valid predictions.
A hierarchical method creates a hierarchical
Clustering is the grouping of similar objects and a decomposition of the given set of data objects. A
cluster of a set is a partition of its elements that is hierarchical method can be classified as being either
chosen to minimize some measure of dissimilarity [1]. agglomerative or divisive, based on how the
Unlike classification which is a supervised learning hierarchical decomposition is formed. The
technique, clustering is a type of unsupervised learning. agglomerative approach, also called the bottom-up
In clustering, objects in the dataset are grouped into approach, starts with each object forming a separate
clusters, such that groups are very different from each group. It successively merges the objects or groups that
other and the objects in the same group are very similar are close to one another, until all of the groups are
to each other. In this case, clusters are not predefined merged into one (the topmost level of the hierarchy), or
which means that result clusters are not known before until a termination condition holds. The divisive
the execution of clustering algorithm. These clusters approach, also called the top-down approach, starts
are extracted from the dataset by grouping the objects with all of the objects in the same cluster. In each
in it. For some algorithms, number of desired clusters is successive iteration, a cluster is split up into smaller
supplied to the algorithm, whereas some others clusters, until eventually each object is in one cluster,
determine the number of groups themselves for the best or until a termination condition holds.
clustering result. Clustering of a dataset gives

www.ijert.org 1
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013

1.3 K-means Algorithm 2. Related Work


K-means is a data mining algorithm which performs Fang Yuan et al. [4] proposed a systematic method
clustering. As mentioned previously, clustering is for finding the initial centroids. The centroids obtained
dividing a dataset into a number of groups such that by this method are consistent with the distribution of
similar items fall into same groups [1]. Clustering uses data. Hence it produced clusters with better accuracy,
unsupervised learning technique which means that compared to the original k- means algorithm. However,
result clusters are not known before the execution of Yuan’s method does not suggest any improvement to
clustering algorithm unlike the case in classification. the time complexity of the k-means algorithm.
Some clustering algorithms takes the number of desired
clusters as input while some others decide the number Sun Shibao et al. [5] proposed an improved k-means
of result clusters themselves. algorithm based on weights. This is a new partitioning
clustering algorithm, which can handle the data of
K-means algorithm uses an iterative procedure in numerical attribute, and it also can handle the data of
order to cluster database [3]. It takes the number of symbol attribute. Meanwhile, this method reduces the
desired clusters and the initial means as inputs and impact of isolated points and the “noise”, so it enhances
produces final means as output. Mentioned initial and the efficiency of clustering. However, this method has
final means are the means of clusters. If the algorithm no improvement on the complexity of time.
is required to produce K clusters then there will be K
initial means and K final means. In completion, K- K-means algorithm is a popular partition algorithm
means algorithm produces K final means which in cluster analysis, which has some limitations when
answers why the name of algorithm is K-means. there are some restrictions in computing resources and
time, especially for huge size dataset. Yu-Fang Zhang
After termination of K-means clustering, each object et al. [6] proposed the improved K-means algorithm is
in dataset becomes a member of one cluster. This a solution to handle large scale data. However, this
RRTT
cluster is determined by searching throughout the method has no improvement on the complexity of time.
means in order to find the cluster with nearest mean to
the object. Shortest distanced mean is considered to be K-means algorithm has higher efficiency and
IIJJEE

the mean of cluster to which examined object belongs. scalability and converges fast when dealing with large
K-means algorithm tries to group the items in dataset data sets. However it also has many deficiencies: the
into desired number of clusters. To perform this task it number of clusters K needs to be initialized, the initial
makes some iteration until it converges. After each cluster centers are arbitrarily selected, and the
iteration, calculated means are updated such that they algorithm is influenced by the noise points. In view of
become closer to final means. And finally, the the shortcomings of the traditional K-means clustering
algorithm converges and stops performing iterations. algorithm, Juntao Wang et al. [7] proposed an
improved K-means algorithm using noise data filter.
Steps of Algorithm: The algorithm developed density-based detection
methods based on characteristics of noise data where
Input: the discovery and processing steps of the noise data are
added to the original algorithm. However, this method
D = {d1, d2,......,dn} //set of n data items. has no improvement on the complexity of time.
k // Number of desired clusters
The k-means algorithm is well known for its
Output: A set of k clusters. efficiency in clustering large data sets. However,
working only on numeric values prohibits it from being
Steps: used to cluster real world data containing categorical
values. Zhexu Huang [10] proposed two algorithms
1. Arbitrarily choose k data-items from D as which extend the k-means algorithm to categorical
initial centroids; domains and domains with mixed numeric and
2. Repeat categorical values. However, this method has no
Assign each item di to the cluster which improvement on the complexity of time as well as final
has the closest centroid; cluster.
Calculate new mean for each cluster;
Until convergence criteria is met.

www.ijert.org 2
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013

3. Proposed Solution 3. Find the closest pair of data points from


the set D and form a data-point set Am
Based on above related work we can figure out that (1<= m <= k) which contains these two
all the algorithms are suffers in term of time data- points, Delete these two data points
complexity as well as accuracy. from the set D;

So, we have proposed new Improved K-means 4. Find the data point in D that is closest to
algorithm which can handle the above problem very the data point set Am, Add it to Am and
well. delete it from D;

Improved K-means Algorithm 5. Repeat step 4 until the number of data


points in Am reaches 0.75*(n/k);
Steps of Algorithm:
6. If m<k, then m = m+1, find another pair
Input: of data points from D between which the
distance is the shortest, form another data-
D = {d1, d2,......,dn} // set of n data items point set Am and delete them from D, Go
k // Number of desired clusters to step 4;

Output: A set of k clusters. 7. For each data-point set Am (1<=m<=k)


find the arithmetic mean of the vectors of
Steps: data points in Am, these means will be the
initial centroids.
Phase 1: Determine the initial centroids of the
clusters by using Phase 1. Phase 1 [4] describes the method for finding initial
RRTT
centroids of the clusters. Initially, compute the
Phase 2: Assign each data point to the distances between each data point and all other data
appropriate clusters by using Phase 2 points in the set of data points. Then find out the closest
IIJJEE

pair of data points and form a set A1 consisting of these


In the first phase, the initial centroids are determined two data points, and delete them from the data point set
systematically so as to produce clusters with better D. Then determine the data point which is closest to the
accuracy. The second phase assigns each data point to set A1, add it to A1 and delete it from D. Repeat this
the appropriate clusters. The two phases of the procedure until the number of elements in the set A1
improved method are described below as Phase 1 and reaches a threshold. At that point go back to the second
Phase 2. step and form another data-point set A2. Repeat this till
’k’ such sets of data points are obtained. Finally the
Phase 1: Finding the initial centroids. initial centroids are obtained by averaging all the
vectors in each data-point set. The Euclidean distance is
Input: used for determining the closeness of each data point to
the cluster centroids. The distance between one vector
D = {d1, d2,......,dn} // set of n data items X = (x1, x2, ....xn) and another vector Y = (y1, y2,
k // Number of desired clusters …….yn) is obtained as,

Output: A set of k initial centroids.

Steps: The distance between a data point X and a data-


point set D is defined as
1. Set m = 1;
d(X, D) = min (d (X, Y), where Y∈ D).
2. Compute the distance between each data
point and all other data- points in the set The initial centroids of the clusters are given as
D; input to the second phase, for assigning data-points to
appropriate clusters. The steps involved in this phase
are outlined as Phase 2.

www.ijert.org 3
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013

clusters having the closest centroids. This results in an


Phase 2: Assigning data points to clusters. initial grouping of the data-points. For each data-point,
the cluster to which it is assigned (ClusterId) and its
Input: distance from the centroid of the nearest cluster
(Nearest_Dist) are noted. Inclusion of data-points in
D = {d1, d2,......,dn} // set of n data-points. various clusters may lead to a change in the values of
C = {c1, c2,.......,ck} // set of k centroids the cluster centroids. For each cluster, the centroids are
recalculated by taking the mean of the values of its
data-points. Up to this step, the procedure is almost
Output: A set of k clusters similar to the original k-means algorithm except that
the initial centroids are computed systematically.
Steps:
The next stage is an iterative process which makes
1. Compute the distance of each data-point use of a heuristic method to improve the efficiency.
di (1<=i<=n) to all the centroids During the iteration, the data-points may get
cj (1<=j<=k) as d(di, cj); redistributed to different clusters. The method involves
keeping track of the distance between each data-point
2. For each data-point di, find the closest and the centroid of its present nearest cluster. At the
centroid cj and assign di to cluster j. beginning of the iteration, the distance of each data-
point from the new centroid of its present nearest
3. Set ClusterId[i] = j; // j: Id of the closest cluster is determined. If this distance is less than or
cluster equal to the previous nearest distance, that is an
indication that the data point stays in that cluster itself
4. Set Nearest Dist[i] = d(di, cj); and there is no need to compute its distance from other
centroids. These results in the saving of time required
RRTT
5. For each cluster j (1<=j<=k), recalculate to compute the distances to k-1 cluster centroids. On
the centroids; the other hand, if the new centroid of the present
nearest cluster is more distant from the data-point than
IIJJEE

6. Repeat its previous centroid, there is a chance for the data-


point getting included in another nearer cluster. In that
7. For each data-point di, case, it is required to determine the distance of the data-
point from all the cluster centroids. Identify the new
7.1 Compute its distance from the centroid nearest cluster and record the new value of the nearest
of the present nearest cluster; distance. The loop is repeated until no more data-points
7.2 If this distance is less than or equal to cross cluster boundaries, which indicates the
the present nearest distance, the convergence criterion. The heuristic method described
datapoint stays in the cluster; above results in significant reduction in the number of
Else computations and thus improves the efficiency.
7.2.1 For every centroid cj (1<=j<=k)
Compute the distance d(di, cj); 4. Experiment and Results
Endfor;
7.2.2 Assign the data-point di to the Improved algorithm of the K-means has been
cluster with the nearest centroid cj. designed and implemented in this project for the
7.2.3 Set ClusterId[i] = j; purpose of improvement of K-means algorithm in
7.2.4 Set Nearest_Dist[i] = d(di, cj); execution time and to get the better accuracy. Original
K-means algorithm has also been implemented for the
Endfor; purpose of comparison with Improved K-means
8. For each cluster j (1<=j<=k), recalculate algorithm in time and accuracy. Both implementations
the centroids; until the convergence have been tested on the same environment which is
criteria is met. JAVA Programming Language.
The multivariate and sequential Libras Movement
The first step in Phase 2 is to determine the distance dataset has 360 instances and 90 attributes, iris dataset
between each data-point and the initial centroids of all has 150 instances and 4 attributes, wine quality dataset
the clusters. The data-points are then assigned to the has 6497 instances and 12 attributes, blood transfusion

www.ijert.org 4
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013

dataset has 748 instances and 5 attributes and outperforms the standard K-means algorithm in terms
TAE(Teaching Assistant Evaluation) dataset has 151 of efficiency and accuracy.
instances and 6 attributes. All these datasets taken from
the UCI repository of machine learning database [9], is
used for testing the efficiency and accuracy of the
improved algorithm. The same data set is given as input
to the original k-means algorithm and the improved
algorithm.

The results of experiments are tabulated in Table 1


and Table 2. In Table 1 we compare both the algorithm
for efficiency in term of time for all the datasets. In
Table 2 we compare both the algorithms for accuracy
for all datasets.

Table 1. Comparison of efficiency of two


algorithms

Improved
K-means
Algorithms K-means
Centroid Centroid
Datasets Selected computed by
randomly program Figure 1. Comparison of efficiency of two
Libras algorithms
RRTT
478 261
Movement
Time
Iris 20 10
taken
TAE 20 02
IIJJEE

in
Wine quality 426 112
(ms)
Blood
35 05
Transfusion

Table 2. Comparison of accuracy of two


algorithms

Improved
K-means
Algorithms K-means
Centroid Centroid
Datasets Selected computed by
randomly program
Libras
59 61
Movement
Iris 66 90
Accuracy
TAE 58 61
(in %) Figure 2. Comparison of accuracy of two
Wine quality 26 30
Blood algorithms
59 64 5. Conclusion
Transfusion
Figure 1 and Figure 2 depicts the performances of
the standard K-means algorithm and the Improved K- The original k-means algorithm is widely used for
means algorithm in terms of the efficiency and clustering large sets of data. But it does not always
accuracy for all the datasets. It can be seen from the guarantee for good results, as the accuracy of the final
results that the improved algorithm significantly clusters depend on the selection of initial centroids.

www.ijert.org 5
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013

Moreover, the computational complexity of the original


algorithm is very high because it reassigns the data
points a number of times during every iteration of the
loop. Here we presents an improved k-means algorithm
which combines a systematic method for finding initial
centroids and an efficient way for assigning data points
to clusters. This method ensures the entire process of
clustering in time without sacrificing accuracy of
clusters.

6. References
[1] S. Kantabutra, “Parallel K-means Clustering Algorithm on
NOWs”, Department of Computer Science, Tufts
University, 1999
[2] Berkhin Pavel, “A Survey of Clustering Data Mining
Techniques”, Springer Berlin Heidelberg, 2006.
[3] R. Ali, U. Ghani, A. Saeed, “Data Clustering and Its
Applications”, Rudjer Boskovic Institute, 2001
[4] Yuan F, Meng Z. H, Zhang H. X and Dong C. R, “A New
Algorithm to Get the Initial Centroids,” Proc. of the 3rd
International Conference on Machine Learning and
Cybernetics, pages 26–29, August 2004.
[5] Sun Shibao, Qin Keyun,”Research on Modified k-means
Data Cluster Algorithm”, Computer Engineering, vol.33,
RRTT
No.13, pp.200–201, July 2007.
[6] Yu-Fang Zhang, Jia-li mao and Zhong-Yang Xiong, “AN
EFFICIENT CLUSTERING ALGORITHM”,
IIJJEE

Proceedings of the Second International Conference on


Machine Learning and Cybernetics, Xi’an, 2-5 November
2003
[7] Juntao Wang and Xiaolong Su, “An improved K-means
clustering algorithm

www.ijert.org 6

You might also like