0% found this document useful (0 votes)
46 views

1 A Modified Version

The document discusses a modified version of the k-means clustering algorithm that aims to improve upon some drawbacks of the traditional algorithm, such as runtime and accuracy depending on initial centroid selection. It proposes finding better initial centroids and combining this with an existing improved method for data object assignment between iterations using simple data structures.

Uploaded by

Edward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

1 A Modified Version

The document discusses a modified version of the k-means clustering algorithm that aims to improve upon some drawbacks of the traditional algorithm, such as runtime and accuracy depending on initial centroid selection. It proposes finding better initial centroids and combining this with an existing improved method for data object assignment between iterations using simple data structures.

Uploaded by

Edward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Global Journal of Computer Science and Technology: C

Software & Data Engineering


Volume 15 Issue 7 Version 1.0 Year 2015
Type: Double Blind Peer Reviewed International Research Journal
Publisher: Global Journals Inc. (USA)
Online ISSN: 0975-4172 & Print ISSN: 0975-4350

A Modified Version of the K-Means Clustering Algorithm


By Juhi Katara & Naveen Choudhary
Maharana Pratap University of Agriculture and Technology, India
Abstract- Clustering is a technique in data mining which divides given data set into small clusters
based on their similarity. K-means clustering algorithm is a popular, unsupervised and iterative
clustering algorithm which divides given dataset into k clusters. But there are some drawbacks of
traditional k-means clustering algorithm such as it takes more time to run as it has to calculate
distance between each data object and all centroids in each iteration. Accuracy of final clustering
result is mainly depends on correctness of the initial centroids, which are selected randomly. This
paper proposes a methodology which finds better initial centroids further this method is combined
with existing improved method for assigning data objects to clusters which requires two simple data
structures to store information about each iteration, which is to be used in the next iteration.
Proposed algorithm is compared in terms of time and accuracy with traditional k-means clustering
algorithm as well as with a popular improved k-means clustering algorithm.
Keywords: clustering, data mining, initial centroids, k-means clustering.
GJCST-C Classification : B.2.4 B.7.1

AModifiedVersionoftheKMeansClusteringAlgorithm

Strictly as per the compliance and regulations of:

© 2015. Juhi Katara & Naveen Choudhary. This is a research/review paper, distributed under the terms of the Creative Commons
Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use,
distribution, and reproduction inany medium, provided the original work is properly cited.
A Modified Version of the K-Means Clustering
Algorithm
Juhi Katara α & Naveen Choudhary σ

Abstract- Clustering is a technique in data mining which agglomerative or divisive, based on how the hierarchical
divides given data set into small clusters based on their decomposition is designed. Density based clustering
similarity. K-means clustering algorithm is a popular, algorithm uses notion of density for clustering data

2015
unsupervised and iterative clustering algorithm which divides
objects. It either grows clusters according to the density
given dataset into k clusters. But there are some drawbacks of

Year
of neighborhood objects or according to some density
traditional k-means clustering algorithm such as it takes more
time to run as it has to calculate distance between each data function. Grid based clustering algorithm first quantizes
object and all centroids in each iteration. Accuracy of final the object space into a finite number of cells that form a 1
clustering result is mainly depends on correctness of the initial grid structure, and then performs clustering on the grid
centroids, which are selected randomly. This paper proposes structure. Model based clustering algorithm attempts to

( C ) Volume XV Issue VII Version I


a methodology which finds better initial centroids further this optimize the fit between the given data and some
method is combined with existing improved method for mathematical model.
assigning data objects to clusters which requires two simple
K-means clustering is a partitioning clustering
data structures to store information about each iteration, which
is to be used in the next iteration. Proposed algorithm is technique in which clusters are formed with the help of
compared in terms of time and accuracy with traditional k- centroids. It follows unsupervised, non deterministic and
means clustering algorithm as well as with a popular improved iterative approach towards clustering. K-means
k-means clustering algorithm. clustering is processed by the minimization of the
Keywords: clustering, data mining, initial centroids, average squared Euclidean distance between the data
K-means clustering. objects and the cluster centroids. The result of the k-
means clustering algorithm is affected by the choice of
I. Introduction initial centroid. Distinct initial centroid might result in

D
ata mining refers to using a variety of data distinct final clusters. Centroid of the cluster may be
analysis techniques and tools to discover defined as the mean of the objects in a cluster. It may
previously unknown, valid patterns and not necessarily be a member of the dataset.

Global Journal of C omp uter S cience and T echnology


relationship in large dataset[5]. Data mining techniques
like clustering and associations can be used to find
II. TRADITIONAL K-MEANS CLUSTERING
meaningful patterns for future predictions. Clustering ALGORITHM
may be defined as preprocessing step in all data mining
K-means clustering is the most popular
algorithms in which the data objects are divided into
clustering algorithm [9]. In the traditional k-means
clusters which contains high intra-cluster similarity and
clustering given dataset is classified into k numbers of
low inter-cluster similarity [3], [10].
disjoint clusters, where the value of k is given as input to
Clustering can be applied to a wide range of
the algorithm. The algorithm is implemented in two
fields like pattern recognition, marketing, image
phases.In the first phase k centroids are selected
processing etc[3]. Clustering algorithms are mainly
randomly. In the second phase assignment of each data
divided into partitioning, hierarchical, density based, grid
object to the closest centroid cluster is done. Distance
based, model based clustering algorithms.
between data objects and centroids is generally
Partitioning clustering algorithm first creates an
calculated by Euclidean distance. When all data objects
initial set of k partition, where parameter k is the number
are assigned to any of the k clusters, first iteration is
of partitions to construct. It then uses an iterative
completed and an early grouping is done. After
relocation technique that tries to improve the clustering
completion of first iteration recalculation of centroids are
by moving objects from one class to another.
done by taking mean of data objects of each cluster. As
Hierarchical clustering algorithm creates a hierarchical
k new centroids are calculated, a new assignment is to
decomposition of the dataset using some criterion.
be done between the same data objects and new
The method can be categorized as being either
centroids, generating loops which results in number of
iterations. As a result of this loop k centroids and data
Author α σ: Department of Computer Science & Engineering, College of objects may change their position in a step by step
Technology and Engineering, Maharana Pratap University of Agriculture
and Technology, Udaipur, Rajasthan, India.
manner. Ultimately the situation will occur where the
e-mail: [email protected] centroids do not update anymore. This means the

© 20 15 Global Journals Inc. (US)


A Modified Version of the K-Means Clustering Algorithm

convergence criterion for clustering is achieved. In this IV. RELATED WORK


algorithm generally Euclidean distance is used to find
distance between data objects and centroids [3]. Xiuyun Li et al. [1] proposed enhanced k-means
Between one data object X = (x1 ,x2 , …xn) and another clustering algorithm based on fuzzy feature selection.
data object Y = (y1 , y2 , …yn) the Euclidean distance d(X , This algorithm generates weight of feature important
Y)be calculated as follows: factor to describe the contribution of each feature to the
clustering and makes use of FIF to improve the similarity
𝑑𝑑(𝑋𝑋 , 𝑌𝑌 ) = { (𝑋𝑋1 − 𝑌𝑌1 )2 + (𝑋𝑋2 − 𝑌𝑌2 )2 + ⋯
measure and then achieve the improved clustering
+ (𝑋𝑋𝑛𝑛 − 𝑌𝑌𝑛𝑛 )2 }1/2
result.
Algorithm 1 : The Traditional K-Means Wang Shunye et al. [5] proposed an improved
k-means clustering algorithm in the optimal initial
Clustering Algorithm [3] centroids based on dissimilarity. This algorithm achieves
2015

Input: D = {d1, d2,......,dn} //set of n data objects. the dissimilarity to reflect the degree of correlation
Year

k // number of required clusters. between data objects then uses a Huffman tree to find
Output: k clusters the initial centroids. It takes less amount of time
2 Steps: because the iteration diminishes through the Huffman
1. Randomly select K data objects as initial algorithm.
Shi Na et al. [3] proposed an improved k-
( C ) Volume XV Issue VII Version I

centroids from D.
2. Calculate the distance between each data object means clustering algorithm to increase efficiency of k-
di (1<=i<=n) and all k cluster centroids cj means clustering algorithm. This algorithm requires two
(1<=j<=k) , then allocate data object di to the simple data structures to store information in every
cluster which has closest centroid. iteration which is to be used in the next iteration. The
3. Calculate new mean for each cluster. improved algorithm does less calculation, which saves
//new mean is the updated centroid of cluster. run time.
4. Repeat step 3 and 4 until no change in the Mohammed El Agha et al. [4] proposed
centroid of cluster. improved k-means clustering algorithm which has
ElAgha initialization that uses a guided random
technique as k-means clustering algorithm suffers from
III. DRAWBACKS OF TRADITIONAL K-MEANS initial centroids problem. ElAgha initialization
outperformed the random initialization and enhanced
CLUSTERING ALGORITHM
the quality of clustering with a big margin in complex
Traditional K-means clustering algorithm has datasets.
Global Journal of C omp uter S cience and T echnology

several drawbacks. The major drawback of traditional K- K.A Abdul Nazeer et al. [2] proposed an
means clustering algorithm is its performance is mainly algorithm to enhance accuracy and efficiency of the k-
depends on the initial centroids, which are selected means clustering algorithm. This algorithm consist of
randomly and resulting clusters are different for different two phases. First phase is used to determine initial
runs for the same input dataset. Another drawback centroids systematically so as to produce clusters with
includes distance calculation process of traditional k- better accuracy. Second phase is used for allocating
means algorithm which takes long time to converge data objects to the appropriate clusters in less amount
clustering result, as it calculates the distance from each of time. This algorithm outputs good clusters in less
data object to every cluster centroids in each iteration amount of time to run.
while there is no need to calculate that distance each
time. As in the resulting clusters some data objects still V. PROPOSED ALGORITHM
remains in the same cluster after several iteration. It In this section a modified algorithm is proposed
affects the performance of the algorithms. One more for improving the performance of k-means clustering
drawback of k-means clustering is the requirement to algorithm. In the paper [3], authors proposed an
give number of clusters formed as input by the user. improved k-means clustering algorithm to improve the
efficiency of the k-means clustering algorithm but in this
algorithm the initial centroids are selected randomly so
this method is very sensitive to the initial centroids as
random selection of initial centroids does not guarantee
to output unique clustering result. In the paper [5],
authors proposed an improved k-means clustering
algorithm in the optimal initial centroids based on
dissimilarity. However this algorithm is computationally
complex and requires more time to run.

© 2015
1 Global Journals Inc. (US)
A Modified Version of the K-Means Clustering Algorithm

In this paper we proposed a new approach for In the proposed algorithm distance of each data
selecting better initial centroids which outputs the object from origin is calculated. Then the original data
unique clustering result and increases the accuracy of objects are sorted accordance with the sorted distance.
basic k-means clustering algorithm and proposed Insertion sort is used for sorting in this paper. Now
approach is combined with the algorithm of paper [3] for divide the sorted data objects into k equal sets. Take
allocating the data objects to the suitable cluster. The middle data object as the initial centroid from each set.
algorithm of paper [3] is referred as shina improved k- This process of selecting centroid outputs better unique
means clustering algorithm in this paper. We compared clustering result. Now for every data object in the
the traditional k-means clustering algorithm, shina dataset calculate distance from every initial centroid.
improved k-means clustering algorithm [3] and The next step is an iterative process which reduces the
proposed algorithm in terms of time and accuracy required time to run. The data objects are assigned to

2015
parameters. the cluster which has the closest centroid. Two data
structures cluster [ ] and dist[ ]are required to store

Year
Algorithm 2: Modified k-means clustering algorithm information about the completed iteration of the
Input : D = {d1 , d2 ,d 3 , ………..… d n } algorithm. Array cluster [ ] stores the cluster number of
//Dataset of n data objects data object from which it belongs to and array dist [ ] 3
k // Number of required clusters stores the distance of every data object from closest

( C ) Volume XV Issue VII Version I


Output : A set of k clusters. centroid. Next, for each cluster obtained in completed
Steps : iteration the new centroid is calculated by taking the
1. Calculate distance from origin of each data mean of its data objects.
object dn in the dataset D. Then for each data object the distance is
2. Apply sorting on the distances obtained in calculated from the new calculated centroid of its
step 1. Sort the data objects according to present cluster. If this distance is less than or equal to
distance. the previous closest distance, the data object remains in
3. Now divide the sorted data objects into k equal the same cluster otherwise for every remaining data
sets. object,calculate the distance from all new calculated
4. Select the middle data object as the initial centroids. Next, the data objects are assigned to the
centroid from each set. cluster which has the closest centroids. Now array
5. Calculate the distance between each data cluster and dist are updated storing new values
object di(1<=i<=n) and all k cluster centroids obtained in this step. This reassigning process is
cj (1<=j<=k) as Euclideandistance d(di ,c j). repeated until no change in the centroids of cluster.

Global Journal of C omp uter S cience and T echnology


6. For each data object di find the closest
centroid cj and assign di to that cluster j. VI. EXPERIMENTAL RESULTS AND DISCUSSION
7. Repeat step 8 to 11 until no change in the All the experiments are carried out on core i3
centroid of clusters. Intel based PC machine with 4 GB RAM, running on
8. Store the cluster number in array cluster[ ]. WINDOWS 7 64 bits operating environment and
Set cluster[i]=j. Programming Platform is MATLAB version R2013a.
9. Store the distance of data object from the In this paper two different datasets are taken
closest centroid in the array dist[ ]. from the UCI repository of machine learning databases
Set dist[j] = d(di , cj). [6] to test the performance of the proposed k-means
10. For each cluster j(1<=j<=k) recalculate the clustering algorithm and for comparing the traditional k-
cluster centroid. means clustering algorithm, shina improved k-means
11. For each data object di clustering algorithm [3] and proposed algorithm of this
11.1 Compute its distance from the new paper. IRIS and WINE datasets are selected as the test
computed centroid of the present cluster. datasets [6]. The values of attributes are numeric.
11.2 If this distance is less than or equal to A brief introduction of the datasets used in
the present closest centroid, the data experimental evaluation is given in the table below:
remains in the same cluster.
Else Table 1 : Characteristics of datasets
11.2.1 For every centroid cj (1<=j<=k)
Calculate the distance d (di , cj) , Number of Number of
then assign di to the cluster which Dataset
attributes instances
closest centroids. Iris 4 150
End For Wine 13 178

© 20 15 Global Journals Inc. (US)


A Modified Version of the K-Means Clustering Algorithm

a) Iris dataset
Iris dataset contains the three classes of iris
flower: setosa, versicolour and virginica. This dataset
contains 150 instances and three classes. In iris dataset,
each class contains 50 instances with four attributes:
sepal length, sepal width, petal length, petal width.
b) Wine dataset
This dataset contains the chemical analysis of
wine in the same region of Italy but three different
cultivators. The dataset contains 178 instances and
2015

three classes with 13 attributes. First class contains 59,


second class contains 71 and third class contains 48
Year

instances. The attributes of dataset are alcohol, malic


acid, ash, alcalinity of ash, magnesium, total phenols,
4
flavonoids, nonflavanoids phenols, proanthocyanins,
Color intensity, hue, OD280/OD315 of diluted wines and
( C ) Volume XV Issue VII Version I

proline. Fig. 1 : Accuracy comparison chart for Iris dataset


The same datasets are given as input to all the
algorithms. Number of k is given three for both the
datasets. Experiment compares proposed k-means
clustering algorithm with the traditional k-means
clustering algorithm and with the shina improved k-
means [3] in terms of time and accuracy.
Accuracy: Accuracy is the ratio of correctly predicted
instances divided by total number of instances.
Time: It is the amount of time that passes from the start
of an algorithm to its finish.
Accuracy of clustering is determined by
comparing the clustering results with the clusters
already available in the UCI datasets [6]. Traditional and
Global Journal of C omp uter S cience and T echnology

shina improved k-means clustering algorithm gives


different accuracy and time for every run as it selects
initial centroid randomly. So these algorithms are
executed several time and average of accuracy and
time is taken. Accuracy of proposed k-means clustering Fig. 2 : Time comparison chart for iris dataset
algorithm is unique at every run but time is different for
each run so it is also executed several time and average Table 2 : Performance comparison on Iris dataset
of time is taken.
Shina
Traditional Proposed
Improved
K-means K-means
Parameters k-means
clustering clustering
clustering
algorithm algorithm
algorithm

Accuracy
76 80 89
(In %)
Time
86 24 4
(In ms)

© 2015
1 Global Journals Inc. (US)
A Modified Version of the K-Means Clustering Algorithm

number of distance calculations. So proposed algorithm


combines both this methods and results in less time to
run. At the same time the proposed k-means clustering
algorithm can improve the accuracy of the algorithm.

VII. CONCLUSION
K-means clustering algorithm is one of the most
popular and an effective algorithm to cluster datasets
which is used in number of fields like scientific and
commercial applications. However, this algorithm has
several drawbacks such as selection of initial centroid is

2015
random which does not guarantee to output unique
clustering result and k-means clustering has more

Year
number of iterations and distance calculations which
finally result in more amount of time to run. Various
enhancements have been carried out on the Traditional 5
Fig. 3 : Accuracy comparison chart for Wine data set k-means clustering algorithm by different researchers

( C ) Volume XV Issue VII Version I


considering different drawbacks. The proposed
algorithm combines a systematic way for selecting initial
centroids and an efficient method for assigning data
objects to clusters. So proposed algorithm is found to
be more accurate, efficient and feasible. The value of k
required number of clusters is still required to be given
as an input to the proposed algorithm. Intelligent pre
estimation of the value of k is suggested as a future
work.

REFERENCES REFERENCES REFERENCIAS


1. Xiuyun Li, Jie Yang, Qing Wang, Jinjin Fan, Peng Liu
“Research and Application of Improved K-means
AlgorithmBased on Fuzzy Feature Selection” in Fifth

Global Journal of C omp uter S cience and T echnology


International Conference on Fuzzy Systems and
Knowledge Discoveryvol 1, 2008 ieeeconference
publications
Fig. 4 : Time comparison chart for wine dataset 2. K.A Abdul Nazeer and M. P Sebastian, “Improving
the accuracy and efficiency of the k-means
Table 3 : Performance comparison on Wine data set clustering algorithm” in International Conference on
Data Mining and Knowledge Engineering (ICDMKE),
Proceedings of the World Congress on
Traditional Shina Proposed Engineering(WCE-2009), Vol 1, July 2009, London,
K-means Improved K-means
Parameters UK
clustering k-means clustering
algorithm clustering algorithm 3. Shi Na, Liu Xumin, Guan Yong “Research on k-
algorithm means Clustering Algorithm : An Improved k-means
Accuracy Clustering Algorithm” in Third International
64 66 70
(In %) Symposium on Intelligent Information Technology
Time and Security Informatics 2010 ieee conference
115 27 10
(In ms) publications
The result of experiment shows that the 4. Mohammed El Agha, Wesam M. Ashour “Efficient
proposed k-means clustering algorithm can output the and Fast Initialization Algorithm for K-means
better unique clustering result in less amount of time Clustering”I.J. Intelligent Systems and Applications,
than traditional k-means clustering algorithm and shina 2012
improved k-means clustering algorithm [3]. As it selects 5. Wang Shunye, Cui Yeqin, Jin Zuotao and Liu
better initial centroids which result in reduction of Xinyuan “K-means algorithm in the optimal initial
iterations. Shina improved method [3] of assigning data centroids based on dissimilarity” in Journal of
objects to the appropriate clusters results in less Chemical and Pharmaceutical Research, 2013

© 20 15 Global Journals Inc. (US)


A Modified Version of the K-Means Clustering Algorithm

6. Merz C and Murphy P, UCIRepository of Machine


LearningDatabases,Available:ftp://ftp.ics.uci.edu/pu
b/machine-learning-databases
7. Charles Elkan “Using the Triangle Inequality to
Accelerate k-Means” in Proceedings of the
Twentieth International Conference on Machine
Learning (ICML-2003), Washington DC, 2003
8. Hong Liu and Xiaohong Yu “Application Research of
k-means Clustering Algorithm in Image Retrieval
System” in Proceedings of the Second Symposium
International Computer Science and Computational
2015

Technology(ISCSCT) 2009
Year

9. Jiawei Han, Michelinekamber and Morgan


Kauffman, “Data Mining: Concepts and
6 Techniques”, 2nd edition 2006
10. Osama Abu Abbas “Comparisons between data
clustering algorithms” in The International Arab
( C ) Volume XV Issue VII Version I

Journal of Information Technology vol 5 , no. 3 ,


July 2008
11. OyeladeO.J, Oladipupo O.O,Obagbuwa I. C
“Application of k-Means Clustering algorithm for
prediction of Students Academic Performance”
inInternational Journal of Computer Science and
Information Security,Vol. 7, 2010
12. Chunfei Zhang, Zhiyi Fang “An Improved K-means
Clustering Algorithm” in Journal of Information &
Computational Science 2013
Global Journal of C omp uter S cience and T echnology

© 2015
1 Global Journals Inc. (US)

You might also like