1 A Modified Version
1 A Modified Version
AModifiedVersionoftheKMeansClusteringAlgorithm
© 2015. Juhi Katara & Naveen Choudhary. This is a research/review paper, distributed under the terms of the Creative Commons
Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use,
distribution, and reproduction inany medium, provided the original work is properly cited.
A Modified Version of the K-Means Clustering
Algorithm
Juhi Katara α & Naveen Choudhary σ
Abstract- Clustering is a technique in data mining which agglomerative or divisive, based on how the hierarchical
divides given data set into small clusters based on their decomposition is designed. Density based clustering
similarity. K-means clustering algorithm is a popular, algorithm uses notion of density for clustering data
2015
unsupervised and iterative clustering algorithm which divides
objects. It either grows clusters according to the density
given dataset into k clusters. But there are some drawbacks of
Year
of neighborhood objects or according to some density
traditional k-means clustering algorithm such as it takes more
time to run as it has to calculate distance between each data function. Grid based clustering algorithm first quantizes
object and all centroids in each iteration. Accuracy of final the object space into a finite number of cells that form a 1
clustering result is mainly depends on correctness of the initial grid structure, and then performs clustering on the grid
centroids, which are selected randomly. This paper proposes structure. Model based clustering algorithm attempts to
D
ata mining refers to using a variety of data distinct final clusters. Centroid of the cluster may be
analysis techniques and tools to discover defined as the mean of the objects in a cluster. It may
previously unknown, valid patterns and not necessarily be a member of the dataset.
Input: D = {d1, d2,......,dn} //set of n data objects. the dissimilarity to reflect the degree of correlation
Year
k // number of required clusters. between data objects then uses a Huffman tree to find
Output: k clusters the initial centroids. It takes less amount of time
2 Steps: because the iteration diminishes through the Huffman
1. Randomly select K data objects as initial algorithm.
Shi Na et al. [3] proposed an improved k-
( C ) Volume XV Issue VII Version I
centroids from D.
2. Calculate the distance between each data object means clustering algorithm to increase efficiency of k-
di (1<=i<=n) and all k cluster centroids cj means clustering algorithm. This algorithm requires two
(1<=j<=k) , then allocate data object di to the simple data structures to store information in every
cluster which has closest centroid. iteration which is to be used in the next iteration. The
3. Calculate new mean for each cluster. improved algorithm does less calculation, which saves
//new mean is the updated centroid of cluster. run time.
4. Repeat step 3 and 4 until no change in the Mohammed El Agha et al. [4] proposed
centroid of cluster. improved k-means clustering algorithm which has
ElAgha initialization that uses a guided random
technique as k-means clustering algorithm suffers from
III. DRAWBACKS OF TRADITIONAL K-MEANS initial centroids problem. ElAgha initialization
outperformed the random initialization and enhanced
CLUSTERING ALGORITHM
the quality of clustering with a big margin in complex
Traditional K-means clustering algorithm has datasets.
Global Journal of C omp uter S cience and T echnology
several drawbacks. The major drawback of traditional K- K.A Abdul Nazeer et al. [2] proposed an
means clustering algorithm is its performance is mainly algorithm to enhance accuracy and efficiency of the k-
depends on the initial centroids, which are selected means clustering algorithm. This algorithm consist of
randomly and resulting clusters are different for different two phases. First phase is used to determine initial
runs for the same input dataset. Another drawback centroids systematically so as to produce clusters with
includes distance calculation process of traditional k- better accuracy. Second phase is used for allocating
means algorithm which takes long time to converge data objects to the appropriate clusters in less amount
clustering result, as it calculates the distance from each of time. This algorithm outputs good clusters in less
data object to every cluster centroids in each iteration amount of time to run.
while there is no need to calculate that distance each
time. As in the resulting clusters some data objects still V. PROPOSED ALGORITHM
remains in the same cluster after several iteration. It In this section a modified algorithm is proposed
affects the performance of the algorithms. One more for improving the performance of k-means clustering
drawback of k-means clustering is the requirement to algorithm. In the paper [3], authors proposed an
give number of clusters formed as input by the user. improved k-means clustering algorithm to improve the
efficiency of the k-means clustering algorithm but in this
algorithm the initial centroids are selected randomly so
this method is very sensitive to the initial centroids as
random selection of initial centroids does not guarantee
to output unique clustering result. In the paper [5],
authors proposed an improved k-means clustering
algorithm in the optimal initial centroids based on
dissimilarity. However this algorithm is computationally
complex and requires more time to run.
© 2015
1 Global Journals Inc. (US)
A Modified Version of the K-Means Clustering Algorithm
In this paper we proposed a new approach for In the proposed algorithm distance of each data
selecting better initial centroids which outputs the object from origin is calculated. Then the original data
unique clustering result and increases the accuracy of objects are sorted accordance with the sorted distance.
basic k-means clustering algorithm and proposed Insertion sort is used for sorting in this paper. Now
approach is combined with the algorithm of paper [3] for divide the sorted data objects into k equal sets. Take
allocating the data objects to the suitable cluster. The middle data object as the initial centroid from each set.
algorithm of paper [3] is referred as shina improved k- This process of selecting centroid outputs better unique
means clustering algorithm in this paper. We compared clustering result. Now for every data object in the
the traditional k-means clustering algorithm, shina dataset calculate distance from every initial centroid.
improved k-means clustering algorithm [3] and The next step is an iterative process which reduces the
proposed algorithm in terms of time and accuracy required time to run. The data objects are assigned to
2015
parameters. the cluster which has the closest centroid. Two data
structures cluster [ ] and dist[ ]are required to store
Year
Algorithm 2: Modified k-means clustering algorithm information about the completed iteration of the
Input : D = {d1 , d2 ,d 3 , ………..… d n } algorithm. Array cluster [ ] stores the cluster number of
//Dataset of n data objects data object from which it belongs to and array dist [ ] 3
k // Number of required clusters stores the distance of every data object from closest
a) Iris dataset
Iris dataset contains the three classes of iris
flower: setosa, versicolour and virginica. This dataset
contains 150 instances and three classes. In iris dataset,
each class contains 50 instances with four attributes:
sepal length, sepal width, petal length, petal width.
b) Wine dataset
This dataset contains the chemical analysis of
wine in the same region of Italy but three different
cultivators. The dataset contains 178 instances and
2015
Accuracy
76 80 89
(In %)
Time
86 24 4
(In ms)
© 2015
1 Global Journals Inc. (US)
A Modified Version of the K-Means Clustering Algorithm
VII. CONCLUSION
K-means clustering algorithm is one of the most
popular and an effective algorithm to cluster datasets
which is used in number of fields like scientific and
commercial applications. However, this algorithm has
several drawbacks such as selection of initial centroid is
2015
random which does not guarantee to output unique
clustering result and k-means clustering has more
Year
number of iterations and distance calculations which
finally result in more amount of time to run. Various
enhancements have been carried out on the Traditional 5
Fig. 3 : Accuracy comparison chart for Wine data set k-means clustering algorithm by different researchers
Technology(ISCSCT) 2009
Year
© 2015
1 Global Journals Inc. (US)