An Improvement in K Means Clustering Algorithm IJERTV2IS1385
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
www.ijert.org 1
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
the mean of cluster to which examined object belongs. scalability and converges fast when dealing with large
K-means algorithm tries to group the items in dataset data sets. However it also has many deficiencies: the
into desired number of clusters. To perform this task it number of clusters K needs to be initialized, the initial
makes some iteration until it converges. After each cluster centers are arbitrarily selected, and the
iteration, calculated means are updated such that they algorithm is influenced by the noise points. In view of
become closer to final means. And finally, the the shortcomings of the traditional K-means clustering
algorithm converges and stops performing iterations. algorithm, Juntao Wang et al. [7] proposed an
improved K-means algorithm using noise data filter.
Steps of Algorithm: The algorithm developed density-based detection
methods based on characteristics of noise data where
Input: the discovery and processing steps of the noise data are
added to the original algorithm. However, this method
D = {d1, d2,......,dn} //set of n data items. has no improvement on the complexity of time.
k // Number of desired clusters
The k-means algorithm is well known for its
Output: A set of k clusters. efficiency in clustering large data sets. However,
working only on numeric values prohibits it from being
Steps: used to cluster real world data containing categorical
values. Zhexu Huang [10] proposed two algorithms
1. Arbitrarily choose k data-items from D as which extend the k-means algorithm to categorical
initial centroids; domains and domains with mixed numeric and
2. Repeat categorical values. However, this method has no
Assign each item di to the cluster which improvement on the complexity of time as well as final
has the closest centroid; cluster.
Calculate new mean for each cluster;
Until convergence criteria is met.
www.ijert.org 2
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
So, we have proposed new Improved K-means 4. Find the data point in D that is closest to
algorithm which can handle the above problem very the data point set Am, Add it to Am and
well. delete it from D;
www.ijert.org 3
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
www.ijert.org 4
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
dataset has 748 instances and 5 attributes and outperforms the standard K-means algorithm in terms
TAE(Teaching Assistant Evaluation) dataset has 151 of efficiency and accuracy.
instances and 6 attributes. All these datasets taken from
the UCI repository of machine learning database [9], is
used for testing the efficiency and accuracy of the
improved algorithm. The same data set is given as input
to the original k-means algorithm and the improved
algorithm.
Improved
K-means
Algorithms K-means
Centroid Centroid
Datasets Selected computed by
randomly program Figure 1. Comparison of efficiency of two
Libras algorithms
RRTT
478 261
Movement
Time
Iris 20 10
taken
TAE 20 02
IIJJEE
in
Wine quality 426 112
(ms)
Blood
35 05
Transfusion
Improved
K-means
Algorithms K-means
Centroid Centroid
Datasets Selected computed by
randomly program
Libras
59 61
Movement
Iris 66 90
Accuracy
TAE 58 61
(in %) Figure 2. Comparison of accuracy of two
Wine quality 26 30
Blood algorithms
59 64 5. Conclusion
Transfusion
Figure 1 and Figure 2 depicts the performances of
the standard K-means algorithm and the Improved K- The original k-means algorithm is widely used for
means algorithm in terms of the efficiency and clustering large sets of data. But it does not always
accuracy for all the datasets. It can be seen from the guarantee for good results, as the accuracy of the final
results that the improved algorithm significantly clusters depend on the selection of initial centroids.
www.ijert.org 5
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
6. References
[1] S. Kantabutra, “Parallel K-means Clustering Algorithm on
NOWs”, Department of Computer Science, Tufts
University, 1999
[2] Berkhin Pavel, “A Survey of Clustering Data Mining
Techniques”, Springer Berlin Heidelberg, 2006.
[3] R. Ali, U. Ghani, A. Saeed, “Data Clustering and Its
Applications”, Rudjer Boskovic Institute, 2001
[4] Yuan F, Meng Z. H, Zhang H. X and Dong C. R, “A New
Algorithm to Get the Initial Centroids,” Proc. of the 3rd
International Conference on Machine Learning and
Cybernetics, pages 26–29, August 2004.
[5] Sun Shibao, Qin Keyun,”Research on Modified k-means
Data Cluster Algorithm”, Computer Engineering, vol.33,
RRTT
No.13, pp.200–201, July 2007.
[6] Yu-Fang Zhang, Jia-li mao and Zhong-Yang Xiong, “AN
EFFICIENT CLUSTERING ALGORITHM”,
IIJJEE
www.ijert.org 6