A Novel Approach For Data Clustering Using Improved K-Means Algorithm PDF
A Novel Approach For Data Clustering Using Improved K-Means Algorithm PDF
13
International Journal of Computer Applications (0975 8887)
Volume 142 No.12, May 2016
partitions of the data, where each partition represents a cluster defining no of cluster, selection of initial cluster center. So a
and k n (data objects) [6]. It clusters the data into k (no. of comparison of all the algorithms can be made based on all
cluster) groups, which together fulfil the following these problems.
requirements:
Tian et al. [1] proposed a systematic method for finding the
i) Each group must contain at least one object, and initial centroids. This techniques result gives better effects
and less iterative time than the existing k-means algorithm.
ii) Each object must belong to exactly one group [8]. This approach adds nearly no burden to the system. This
method will decrease the iterative time of the k means
K-means Algorithm [1]: algorithm, making the clustering analysis more efficient. The
result for the small data set is not very notable when the
Input: D = {d1, d2,......,dn} //D contains data objects. refinement algorithm operates over a small subset of a quite
k // user defined number of cluster large data set.
Abdul Nazeer et al. [3] proposed a systematic method for
Output: finding the initial centroids. In this enhanced algorithm, the
A set of k clusters. data object and the value of k are the only inputs required
since the initial centroids are computed automatically by using
Steps: the algorithm. A systematic method for finding initial
1. Randomly choose k data-items from D as initial centroids; centroids and an efficient way for assigning data object to
clusters. A limitation of the proposed algorithm is that the
2. Repeat the loop value of k, the number of desired clusters, is still required to
be given as an input, regardless of the distribution of the data
Assign each item di to the cluster which has the closest
points.
centroid;
FAHIM A et al. [4] proposed a systematic method for finding
Calculate new mean for each cluster;
the initial centroids. In this approach from every iteration
Until convergence criteria is met. some heuristic value is kept for less calculation in next
iteration from data object to the centroid. i.e. in each iteration
As shown in above Algorithm, the original k-means algorithm the centroid closer to some data objects and far apart from the
make up of two phases: In the first phase determining the other data objects, the points that become closer to the
initial centroids and the other for assigning data object to the centroid will stay in that cluster, so there is no need to find its
nearest clusters and then recomputed the cluster centroids. distances to other cluster centroids. The points far apart from
The second phase is carried out continuously until the clusters the center may change the cluster, so only for these data
get stable, i.e., data objects stop moving over cluster object their distances to other cluster centers will be
boundaries [3]. calculated, and assigned to the nearest center. This is simple
The k-means algorithm is effective in producing good and efficient clustering algorithm based on the k-means
clustering results for many applications [4]. The reasons for algorithm. This algorithm is easy to implement, requiring a
the popularity of k-means are ease and simplicity of simple data structure to keep some information in each
implementation, scalability, speed of convergence and iteration to be used in the next iteration.
adaptability to sparse data [4]. K-means is simple and can be Dr. Urmila R. et al. [6] proposed a systematic method for
easily used for clustering of data practice and the time clustering large dataset. In this paper algorithm is used to
complexity is O (nkt), n is the number of objects, k is the design data level parallelism. The algorithm is work as divide
number of clusters, t is the number of iterations, so it is the given data objects into N number of partitions by Master
generally regarded as very fast. Original k-means algorithm is Processor. After then each partition will be assigned to every
computationally expensive and the quality of the resulting processor. In next step Master processor calculates K
clusters heavily depends on the selection of initial centroids. centroids and broadcast to every processor. After that each
K-means clustering is a partitioning clustering technique in processor calculates new centroids and broadcast to Master
which clusters are formed with the help of centroids. On the processor. Master processor recalculates global centroids and
basis of these centroids, clusters can vary from one another in broadcast to every processor. Repeat these steps until unique
different iterations. Also, data elements can vary from one cluster is found. In this algorithm number of clusters are fixed
cluster to another, as clusters are based on the random to be three and the initial centroids are initialized to minimum
numbers known as centroids [6]. The k-means algorithm is the value, maximum value and the N/2th value of data point of
most extensively studied clustering algorithm and is generally the total data object.
effective in producing good results. The k-means algorithm is
computationally expensive and requires time proportional to Yugal Kumar et al. [7] proposed a systematic method for
the product of the number of data items, number of clusters finding the initial centroids. In this paper, a new algorithm is
and the number of iterations. proposed for the problem of selecting initial cluster centers for
the cluster in K-Means algorithm based on binary search
3. STUDY OF THE VARIOUS technique. Search technique Binary search is a popular
APPROACHES OF MODIFIED K- searching method that is used to find an item in given list of
array. The algorithm is designed in such a way that the initial
MEANS ALGORITHMS cluster centers have obtained using binary search property and
Several attempts were made by researchers to improve the after that assignment of data object in K-Means algorithm is
effectiveness and efficiency of the k-means algorithm [1, 3, 4, applied to gain optimal cluster centers in dataset.
6, 7, 8]. All the algorithms reviewed in this paper define the
same common problems of the k means algorithm like Shafeeq et al. [8] proposed a systematic method for dynamic
clustering large dataset, number of iteration in the algorithm, clustering of data. In the above paper a dynamic clustering
14
International Journal of Computer Applications (0975 8887)
Volume 142 No.12, May 2016
15
International Journal of Computer Applications (0975 8887)
Volume 142 No.12, May 2016
Accuracy
Ds1.10 2000 5
0.15
bfi 2500 29
0.1
0.02
iqitems
0
12.2 12
12 K-means Algo Exec Proposed Algo Exec
11.8 Time Time
Iterations
11.6
11.4
11.2 11 Figure 5: Graph for Execution time
11
10.8
10.6
10.4
K-means Algo Proposed Algo
Iterations Iterations
16
International Journal of Computer Applications (0975 8887)
Volume 142 No.12, May 2016
Exec time
terms of accuracy, number of iterations and efficiency.
0.25
0.2
0.15 0.108 0.101
Iterations 0.1
0.089
0.039 0.035
0.076 0.064
0.03
0.05
35 30 0
30 26
25 20
Iterations
18
20 15
13 13
15
9 9
10 7
5 K-means Algo Exec Time
0 Proposed Algo Exec Time
7. ACKNOWLEDGMENT
This research was supported by my mentor Shubha Puthran
and faculty guide Dr. Dhirendra Mishra. I am grateful to them
K-means Algo Accuracy Proposed Algo Accuracy for sharing their pearls of wisdom with me during the course
of this research.
Figure 7: comparison of Accuracy on different datasets 8. REFERENCES
[1]. Farajian, Mohammad Ali, and Shahriar Mohammadi.
"Mining the banking customer behavior using clustering
and association rules methods."International Journal of
Industrial Engineering 21, no. 4 (2010).
[2]. Bhatia, M. P. S., and Deepika Khurana. "Experimental
study of Data clustering using k-Means and modified
algorithms." International Journal of Data Mining &
Knowledge Management Process (IJDKP) Vol 3 (2013).
17
International Journal of Computer Applications (0975 8887)
Volume 142 No.12, May 2016
[3]. Jain, Sapna, M. Afshar Aalam, and M. N. Doja. "K- [16]. Aloise, Daniel, Amit Deshpande, Pierre Hansen, and
means clustering using weka interface." In Proceedings Preyas Popat. "NP-hardness of Euclidean sum-of-squares
of the 4th National Conference. 2010. clustering." Machine Learning 75, no. 2 (2009): 245-248.
[4]. Kumar, M. Varun, M. Vishnu Chaitanya, and M. [17]. Wang, Haizhou, and Mingzhou Song. "Ckmeans. 1d. dp:
Madhavan. "Segmenting the Banking Market Strategy by optimal k-means clustering in one dimension by dynamic
Clustering." International Journal of Computer programming." The R Journal 3, no. 2 (2011): 29-33.
Applications 45 (2012).
[18]. Al-Daoud, Moth'D. Belal. "A new algorithm for cluster
[5]. Namvar, Morteza, Mohammad R. Gholamian, and initialization." In WEC'05: The Second World
Sahand KhakAbi. "A two phase clustering method for Enformatika Conference. 2005.
intelligent customer Segmentation." In Intelligent
Systems, Modelling and Simulation (ISMS), 2010 [19]. Wang, X. Y., and Jon M. Garibaldi. "A comparison of
International Conference on, pp. 215-219. IEEE, 2010. fuzzy and non-fuzzy clustering techniques in cancer
diagnosis." In Proceedings of the 2nd International
[6]. Tian, Jinlan, Lin Zhu, Suqin Zhang, and Lu Liu. Conference in Computational Intelligence in Medicine
"Improvement and parallelism of k-means clustering and Healthcare, BIOPATTERN Conference, Costa da
algorithm." Tsinghua Science & Technology 10, no. 3 Caparica, Lisbon, Portugal, p. 28. 2005.
(2005): 277-281.
[20]. Liu, Ting, Charles Rosenberg, and Henry A. Rowley.
[7]. Zhao, Weizhong, Huifang Ma, and Qing He. "Parallel k- "Clustering billions of images with large scale nearest
means clustering based on mapreduce." In Cloud neighbor search." In Applications of Computer Vision,
Computing, pp. 674-679. Springer Berlin Heidelberg, 2007. WACV'07. IEEE Workshop on, pp. 28-28. IEEE,
2009. 2007.
[8]. Nazeer, KA Abdul, and M. P. Sebastian. "Improving the [21]. Oyelade, O. J., O. O. Oladipupo, and I. C. Obagbuwa.
Accuracy and Efficiency of the k-means Clustering "Application of k Means Clustering algorithm for
Algorithm." In Proceedings of the World Congress on prediction of Students Academic Performance." arXiv
Engineering, vol. 1, pp. 1-3. 2009. preprint arXiv: 1002.2425 (2010).
[9]. Fahim, A. M., A. M. Salem, F. A. Torkey, and M. A. [22]. Akkaya, Kemal, Fatih Senel, and Brian McLaughlan.
Ramadan. "An efficient enhanced k-means clustering "Clustering of wireless sensor and actor
algorithm." Journal of Zhejiang University SCIENCE networks based on sensor distribution and connectivity.
A 7, no. 10 (2006): 1626-1633. Journal of Parallel and Distributed Computing 69, no. 6
(2009): 573-587.
[10]. Rasmussen, Edie M., and PETER WILLETT.
"Efficiency of hierarchic agglomerative clustering using [23]. https://sites.google.com/site/dataclusteringalgorithms/clu
the ICL distributed array processor." Journal of stering-algorithm-applications
Documentation 45, no. 1 (1989): 1-24.
[24]. Pakhira, Malay K. "A modified k-means algorithm to
[11]. Dr.Urmila R. Pol, Enhancing K-means Clustering avoid empty clusters. International Journal of Recent
Algorithm and Proposed Parallel K-means clustering for Trends in Engineering 1, no. 1 (2009).
Large Data Sets. International Journal of Advanced
Research in Computer Science and Software [25]. Singh, Kehar, Dimple Malik, and Naveen Sharma.
Engineering, Volume 4, Issue 5, May 2014. "Evolving limitations in K-means algorithm in data
mining and their removal." International Journal of
[12]. Yugal Kumar, Yugal Kumar, and G. Sahoo G. Sahoo. "A Computational Engineering & Management 12 (2011):
New Initialization Method to Originate Initial Cluster 105-109.
Centers for K-Means Algorithm." International Journal
of Advanced Science and Technology 62 (2014): 43-54. [26]. Rishikesh Suryawanshi, Shubha Puthran,"Review of
Various Enhancement for Clustering Algorithms in Big
[13]. Shafeeq, Ahamed, and K. S. Hareesha. "Dynamic Data Mining" International Journal of Advanced
clustering of data with modified k-means algorithm." Research in Computer Science and Software
In Proceedings of the 2012 conference on information Engineering(2016)
and computer networks, pp. 221-225. 2012.
[27]. http://nlp.stanford.edu/IR-book/html/htmledition/k-
[14]. Ben-Dor, Amir, Ron Shamir, and Zohar Yakhini. means-1.html#sec:kmeans
"Clustering gene expression patterns." Journal of
computational biology 6, no. 3-4 (1999): 281-297. [28]. https://archive.ics.uci.edu/ml/datasets.html
IJCATM : www.ijcaonline.org
18