0% found this document useful (0 votes)
43 views18 pages

1 s2.0 S0031320319301608 Main

This document summarizes a study on how much the k-means clustering algorithm can be improved by using better initialization techniques and repeating the algorithm multiple times. The study finds that when clusters overlap, k-means performance can be significantly improved through these methods. Initializing with the furthest point heuristic reduces erroneous clusters from 15% to 6% on average, and repeating the algorithm 100 times further reduces this to 1%. However, when clusters are well separated, k-means performance depends only on the initialization. Therefore, for high accuracy clustering, a better algorithm than k-means should be used.

Uploaded by

Monzha labs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views18 pages

1 s2.0 S0031320319301608 Main

This document summarizes a study on how much the k-means clustering algorithm can be improved by using better initialization techniques and repeating the algorithm multiple times. The study finds that when clusters overlap, k-means performance can be significantly improved through these methods. Initializing with the furthest point heuristic reduces erroneous clusters from 15% to 6% on average, and repeating the algorithm 100 times further reduces this to 1%. However, when clusters are well separated, k-means performance depends only on the initialization. Therefore, for high accuracy clustering, a better algorithm than k-means should be used.

Uploaded by

Monzha labs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Pattern Recognition 93 (2019) 95–112

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

How much can k-means be improved by using better initialization


and repeats?
Pasi Fränti, Sami Sieranoja∗
Machine Learning Group, School of Computing, University of Eastern Finland, P.O. Box 111, FIN-80101 Joensuu, Finland

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, we study what are the most important factors that deteriorate the performance of the k-
Received 13 August 2018 means algorithm, and how much this deterioration can be overcome either by using a better initialization
Revised 11 March 2019
technique, or by repeating (restarting) the algorithm. Our main finding is that when the clusters overlap,
Accepted 13 April 2019
k-means can be significantly improved using these two tricks. Simple furthest point heuristic (Maxmin)
Available online 15 April 2019
reduces the number of erroneous clusters from 15% to 6%, on average, with our clustering benchmark.
Keywords: Repeating the algorithm 100 times reduces it further down to 1%. This accuracy is more than enough for
Clustering algorithms most pattern recognition applications. However, when the data has well separated clusters, the perfor-
K-means mance of k-means depends completely on the goodness of the initialization. Therefore, if high clustering
Initialization accuracy is needed, a better algorithm should be used instead.
Clustering accuracy © 2019 The Authors. Published by Elsevier Ltd.
Prototype selection
This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)

1. Introduction question how much clustering results are biased because of using
an inferior algorithm.
K-means (KM) algorithm [1–3] groups N data points into k clus- There are other algorithms that are known, in many situa-
ters by minimizing the sum of squared distances between every tions, to provide better clustering results than k-means. However,
point and its nearest cluster mean (centroid). This objective func- k-means is popular for good reasons. First, it is simple to imple-
tion is called sum-of-squared errors (SSE). Although k-means was ment. Second, people often prefer to use an extensively studied al-
originally designed for minimizing SSE of numerical data, it has gorithm whose limitations are known rather than a potentially bet-
also been applied for other objective functions (even some non- ter, but less studied, algorithm that might have unknown or hid-
numeric). den limitations. Third, the local fine-tuning capability of k-means
Sometimes the term k-means is used to refer to the clustering is very effective, and for this reason, it is also used as part of bet-
problem of minimizing SSE [4–7]. However, we consider here k- ter algorithms such as the genetic algorithm [9,10], random swap
means as an algorithm. We study how well it performs as a clus- [11,12], particle swarm optimization [13], spectral clustering [14],
tering algorithm to minimize the given objective function. This ap- and density clustering [15]. Therefore, our results can also help
proach follows the recommendation in [8] to establish a clear dis- better understand those more complex algorithms that rely on the
tinction between the clustering method (objective function) and the use of k-means.
clustering algorithm (how it is optimized). K-means starts by selecting k random data points as the initial
In real-life applications, the selection of the objective function set of centroids, which is then improved by two subsequent steps.
is much more important. Clustering results depend primarily on In the assignment step, every point is put into the cluster of the
the selected objective function, and only secondarily on the se- nearest centroid. In the update step, the centroid of every cluster
lected algorithm. Wrong choice of the function can easily reverse is recalculated as the mean of all data points assigned to the clus-
the benefit of a good algorithm so that a proper objective function ter. Together, these two steps constitute one iteration of k-means.
with a worse algorithm can provide better clustering than good These steps fine-tune both the cluster borders and the centroid lo-
algorithm with wrong objective function. However, it is an open cations. The algorithm is iterated a fixed number of times, or until
convergence (no further improvement is obtained). MacQueen also
presented sequential variant of k-means [2], where the centroid is
updated immediately after every single assignment.

Corresponding author. K-means has excellent fine-tuning capabilities. Given a rough
E-mail addresses: pasi.franti@uef.fi (P. Fränti), sami.sieranoja@uef.fi,
allocation of the initial cluster centroids, it can usually optimize
[email protected].fi (S. Sieranoja).

https://doi.org/10.1016/j.patcog.2019.04.014
0031-3203/© 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)
96 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

variant of Maxmin [23]. They observed that the Maxmin variants


provide slightly better performance. Their argument is that the
Maxmin variants are based on distance optimization, which tends
to help k-means provide better cluster separation.
Steinley and Brusco [21] studied 12 variants including com-
plete algorithms like agglomerative clustering [27] and global k-
means [28]. They ended up recommending these two algorithms
and Steinley’s variant [29] without much reservation. The first two
are already complete stand-alone algorithms themselves and not
true initialization techniques, whereas the last one is a trivial im-
provement of the random partition.
Steinley and Brusco also concluded that agglomerative cluster-
ing should be used only if the size, dimensionality or the number
of clusters is big; and that global k-means (GKM) [28] should be
used if not enough memory to store the N2 pairwise distances.
However, these recommendations are not sound. First, agglom-
erative clustering can be implemented without storing the dis-
tance matrix [30]. Second, GKM is extremely slow and not prac-
tical for bigger datasets. Both these alternatives are also stan-
Fig. 1. K-means is excellent in fine-tuning cluster borders locally but fails to re- dalone algorithms and they provide better clustering even without
locate the centroids globally. Here a minus sign (−) represents a centroid that is k-means.
not needed, and a plus sign (+) a cluster where more centroids would be needed.
Celebi et al. [22] performed the most extensive comparison
K-means cannot do it because there are stable clusters in between.
so far with 8 different initialization techniques on 32 real and
12,228 synthetic datasets. They concluded that random centroids
their locations locally. However, the main limitation of k-means is and Maxmin often perform poorly and should not be used, and
that it rarely succeeds in optimizing the centroid locations globally. that there are significantly better alternatives with comparable
The reason is that the centroids cannot move between the clusters computational requirements. However, their results do not clearly
if their distance is big, or if there are other stable clusters in be- point out a single technique that would be consistently better than
tween preventing the movements, see Fig. 1. The k-means result others.
therefore depends a lot on the initialization. Poor initialization can The detailed results in [22] showed that a sub-sampling and
cause the iterations to get stuck into an inferior local minimum. repeat strategy [31] performs consistently in the best group and
Fortunately, finding the exact optimum is not always impor- k-means++ performs generally well. For small datasets Bradley’s
tant. In pattern recognition applications, the goal can be merely sub-sampling strategy or greedy variant of k-means++ was recom-
to model the distribution of the data, and the clustering result is mended. For large data, split-based algorithm was recommended.
used as a part in a more complex system. In [16], the quality of the The second major improvement, besides the initializations, is to
clustering was shown not to be critical for the speaker recognition repeat k-means [32]. The idea is simply to restart k-means several
performance when any reasonable clustering algorithm, including times from different initial solution to produce several candidate
repeated k-means, was used. solutions, and then keeping the best result found as the final so-
However, if the quality of clustering is important then k-means lution. This approach requires that the initialization technique pro-
algorithm has problems. For example, if we need to solve the num- duces different starting solutions by involving some randomness
ber of clusters, the goodness of the algorithm matters much more. in the process. We call this variant repeated k-means (RKM). The
Experiments with three different indexes (WB, DBI, Dunn) have number of repeats is typically small like R = 20 in [33].
shown that k-means rarely achieves the correct number of clus- Many researchers consider the repeats as an obvious and neces-
ters whereas random swap succeeded in most cases [17]. Similar sary improvement to the k-means at the cost of increased process-
observations were made with stability-based approach in [18]. ing time. Bradley and Fayyad [31] used slightly different variant
To compensate for the mentioned weaknesses of k-means, two by combining the repeats and sub-sampling to avoid the increase
main approaches have been considered: (1) using a better initial- in the processing time. Besides these papers, it is hard to find any
ization, (2) repeating k-means several times by different initial so- systematic study how the repeats affect on the k-means. For exam-
lution. Numerous initialization techniques have been presented in ple, none of the review papers investigate the effect of the repeats
the literature, including the following: on the performance.
To sum up, existing literature provides merely relative com-
• Random points
parisons between the selected initialization techniques. They lack
• Furthest point heuristic
clear answers of the significance of the results, and present no
• Sorting heuristic
analysis on which type of data the techniques work and fail. Many
• Density-based
of the studies also use classification datasets, which have limited
• Projection-based
suitability for studying the clustering performance.
• Splitting technique
We made a brief survey about how recent research papers ap-
Few comparative studies exists [19–22], but there is no consen- ply k-means. Random centroids [5,34,35] seems to be the most
sus of which technique should be used. A clear state-of-the-art is popular initialization method, along with k-means++ [6,33,36].
missing. Pena et al. [19] studied four basic variants: random cen- Some papers do not specify how they initialize [37], or it had to
troids [1] and MacQueen’s variant of it [2], random partition and be concluded indirectly. For example, Boutsidis [5] used the de-
Kaufman’s variant of the Maxmin heuristic [23]. Their results show fault method available in MATLAB, which was random centroids in
that random partition and Maxmin outperform the random cen- the 2014a version and k-means++ starting from the 2014b version.
troid variants with the three datasets (Iris, Ruspini, Glass). The method in [38] initializes both the centroids and the partition
He et al. [20] studied random centroids, random perturbation of labels at random. However, as they apply the centroid step first,
the mean [24], greedy technique [25], Maxmin [26], and Kaufman’s the random partition is effectively applied.
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 97

Fig. 2. Three examples of clustering result when using SSE cost function. Gaussian cluster is split into several spherical clusters (left); mismatch of the variance causes the
larger cluster to be split (middle); mismatch of the cluster sizes does not matter if the clusters are well-separated.

The number of k-means repeats varies from a relatively small Fig. 2 demonstrates the situation. An algorithm minimizing SSE
amount of 10–20 [5,33,35] to a relatively high value of 100 [36]. would find spherical clusters regardless of the data. If the data con-
The most extreme example is [34] where 20 h time limit is ap- tain non-spherical clusters, they would be divided into spherical
plied. Although they stop iterating if the running time grows twice sub-clusters, usually along the direction of the highest variance.
as that of their proposed algorithm, it is still quite extensive. Sev- Clusters of variable sizes would also cause large clusters to be split,
eral papers do not repeat k-means at all [6,7,37]. and smaller ones to be merged. In these cases, if natural clusters
The choice of the initialization and the number of repeats might are wanted, a better clustering result could be achieved by using
also vary depending on the motivation. The aim of using k-means an objective function based on Mahalanobis distance [42] or Gaus-
can be to have a good clustering result, or to provide merely a sian mixture model [43] instead of SSE.
point of comparison. In the first case, all the good tricks are used,
such as more repeats and better initialization. In the second case, 2.1. Datasets
some simpler variant is more likely applied. A counter-example is
in [34] where serious efforts seem to be made to ensure all algo- In this paper, we focus on the algorithmic performance of k-
rithms have the best possible performance. means rather than the choice of the objective function. We use the
In this paper we study the most popular initialization heuris- clustering basic benchmark [39] as all these datasets can be clus-
tics. We aim at answering the following questions. First, to what tered correctly with SSE. Therefore, any clustering errors made by
extent k-means can be improved by a better initialization tech- k-means must originate from the properties of the algorithm, and
nique? Second, can the fundamental weakness of k-means be elim- not from the choice of wrong objective function. The datasets are
inated simply by repeating the algorithm several times? Third, can summarized in Table 1. They are designed to vary the following
we predict under which conditions k-means works, and which it properties as defined in [39]:
fails?
• Cluster overlap
In a recent study [39], it was shown that k-means performs
• Number of clusters
poorly when the clusters are well separated. Here we will answer
• Dimensionality
how much a better initialization or repeats can compensate for this
• Unbalance of cluster sizes
weakness. We will also show that dimensionality does not matter
for most variants, and that unbalance of cluster sizes deteriorates
2.2. Methodology
the performance of most initializations.
The rest of the paper is organized as follows. In Section 2, we
To measure the success of the algorithm, the value of the objec-
define the methodology and data. We also give brief review of
tive function itself is the most obvious measure. Existing literature
the properties of the standard k-means algorithm. Different ini-
reviews of k-means use either SSE [19,22], or the deviation of the
tialization techniques are then studied in Section 3. Experimental
clusters [20], which is also a variant of SSE. It is calculated as:
analysis is performed in Section 4, and conclusions are drawn in
Section 5. N 
 
SSE = xi − c j 2 (1)
i=1

2. Performance of k-means where xi is a data point and cj is its nearest centroid. In [39], SSE
is also measured relative to the SSE-value of the ground truth so-
Following the recommendation of Jain [8], we make a clear dis- lution (SSEopt ):
tinction between the clustering method and algorithm. Clustering
(SSE − SSEopt )
method refers to the objective function, and clustering algorithm ε − ratio = (2)
SSEopt
to the process optimizing it. Without this distinction, it would be
easy to draw wrong conclusions. If the ground truth is known, external indexes such as adjusted
For example, k-means has been reported to work poorly with Rand index (ARI), Van Dongen (VD), variation of information (VI) or
unbalanced cluster sizes [40], and that it can cause large clusters to normalized mutual information (NMI) can also be used [22]. A com-
be wrongly split and smaller clusters wrongly merged [41]. These parative study of several suitable indexes can be found in [44]. The
observations themselves are correct but they miss the root cause, number of iterations have also been studied in [19,22], and the
which is the SSE objective function. Even an optimal algorithm time complexities reported in [22].
minimizing SSE would end up with the same incorrect result. Such The problem of SSE, and most of the external indexes, is that
observations therefore relate to the objective function, and not to the raw value does not tell how significant the result is. We there-
the k-means algorithm. fore use Centroid Index (CI) [45], which indicates how many cluster
98 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

Table 1
Basic clustering benchmark [39]. The data is publicly available here: http://cs.uef.fi/sipu/datasets/.

Dataset Varying Size Dimensions Clusters Per cluster

A Number of clusters 30 0 0–750 0 2 20–50 150


S Overlap 50 0 0 2 15 333
Dim Dimensions 1024 32–1024 16 64
G2 Dimensions + overlap 2048 2–1024 2 1024
Birch Structure 10 0,0 0 0 2 100 10 0 0
Unbalance Balance 6500 2 8 10 0–20 0 0

tering performance, see Fig. 5. This is a fundamental weakness of


the k-means algorithm.
In [39], it was also found that the number of errors has linear
dependency on the number of clusters (k). For example, the CI-
values for the A sets with k = 20, 35, 50 clusters were measured
as CI = 2.5, 4.5, 6.5, respectively. The relative CI-values (CI/k) cor-
respond to a constant of 13% of centroids being wrongly located.
Results with the subsets of Birch2 (varying k from 1 to 100) con-
verge to about 16% when k approaches to 100, see Fig. 6.
Two series of datasets are used to study the dimensionality:
DIM and G2. The DIM sets have 16 well separated clusters in
high-dimensional space with dimensionality varying from D = 32
to 1024. Because of clear cluster separation, these datasets should
be easy for any good clustering algorithm to reach CI = 0 and 100%
success rate. However, k-means again performs poorly; it obtains
the values CI = 3.6 and 0% success rate regardless of the dimen-
sionality. The reason for the poor performance is again the lack of
cluster overlap, and not the dimensionality.
The results with the G2 sets confirmed the dependency be-
tween the dimensionality and the success rate. We allocated four
centroids with 3:1 unbalance so that the first cluster had three
centroids and the second only one. We then ran k-means and
Fig. 3. Performance of k-means with the A2 dataset: CI = 4, SSE = 3.08 (1010 ), checked whether it found the expected 2:2 allocation by moving
ε = 0.52. one of the three centroids to the second group. The results in Fig. 7
show that the overlap is the mediating factor for the success rate:
centroids are wrongly located. Specifically, the value CI = 0 implies the more overlap, the lower the success rate of k-means.
that the clustering structure is correct with respect to the ground The cluster size unbalance was also shown in [39] to result in
truth. poor performance. The main reason for this was the random ini-
An example is shown in Fig. 3, where k-means provides tialization, which cannot pick the initial centroids in a balanced
SSE = 3.08 × 1010 , which is 52% higher than that of the ground way. Another reason was the k-means iterations which fail to im-
truth. But what do these numbers really mean? How significant is prove the initial solution due to lack of cluster overlap.
the difference? On the other hand, the value CI = 4 tells that ex- The effect of the different properties of data on k-means can be
actly four real clusters are missing a centroid. summarized as follows:
Based on CI, a success rate (%) was also defined in [39] to mea-
Property: Effect:
sure the probability of finding the correct clustering. For example, Cluster overlap Overlap is good
when running k-means 50 0 0 times with dataset A2 (Fig. 3), CI = 0 Number of clusters Linear dependency
was never reached, and thus, its success rate is 0%. Another exam- Dimension No direct effect
ple with dataset S2 (Fig. 4) results in success rate of 1/6 = 17%. Unbalance Bad

The success rate has an important implication. Any value higher


than 0% indicates that the correct clustering can be found simply
3. K-means initialization techniques
by repeating k-means. For a success rate p, the expected number of
repeats is 1/p. For instance, p = 50% indicates that expected number
Next we study how much these problems of k-means can be
of repeats is 2; and p = 1% indicates 100 repeats. Even with as low
solved by the following two improvements:
value as p = 0.1% the correct solution is expected to be found in
10 0 0 repeats. This is time consuming, but feasible. However, for • Better initialization
some of our datasets the success rate is so low that the number • Repeating k-means
repeats would be unreasonably high. For example, even 20 0,0 0 0
repeats produces 0% success rate in our experiments with some K-means is a good algorithm for local fine-tuning but it has se-
datasets. rious limitation to relocate the centroids when the clusters do not
overlap. It is therefore unrealistic to expect the clustering problem
2.3. Properties of k-means to be solved simply by inventing a better initialization for k-means.
The question is merely, how much a better initialization can com-
We next briefly summarize the main properties of the k-means pensate for the weakness of k-means.
algorithm. Generally the clustering problem is the easier the more Any clustering algorithm could be used as an initialization
the clusters are separated. However, in [39] it was found that for technique for k-means. However, solving the location of initial
k-means it is just the opposite; the less overlap the worse the clus- centroids is not significantly easier than the original clustering
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 99

Fig. 4. Centroid index measures how many real clusters are missing a centroid (+), or how many centroids are allocated to wrong cluster (−). Six examples are shown for
S2 dataset.

Fig. 5. Success rate (%) of k-means, measured as the probability of finding correct clustering, improves when the cluster overlap increases.

20 % First, the algorithm should be trivial, or at least very easy to


implement. Measuring implementation complexity can be subjec-
K-means
tive. The number of functions and the lines of code were used in
Relative CI-value

15 % [16]. Repeated k-means was counted to have 5 functions and 162


lines of C-code. In comparison, random swap [11,12], fast agglomer-
Repeated ative clustering variant [30], and sophisticated splitting algorithm
10 % [46] had 7, 12 and 22 functions, and 226, 317 and 947 lines of
k-means Birch2 codes, respectively. Random initialization had 2 functions and 26
subsets lines of code.
5% Second, the algorithm should have lower or equal time com-
plexity compared to k-means. Celebi et al. [22] categorizes the al-
gorithms to linear, log-linear and quadratic based on their time
0% complexities. Spending quadratic time cannot be justified as the
0 10 20 30 40 50 60 70 80 90 100 fastest agglomerative algorithms are already working in close to
quadratic time [30]. A faster O(N logN) time variant also exists
Number of clusters (k)
[47] but it is significantly more complex to implement and requires
Fig. 6. CI-value of k-means increases linearly with k, and relative CI converges to to calculate k-near neighbors (KNN). K-means requires O(gkN) time,
16% with the Birch2 subsets. where g is the number of iterations and typically varies from 20 to
50.
The third requirement is that the algorithm should be free
problem itself. Therefore, for an algorithm to be considered as ini-
of parameters; others than k. For instance, there are algorithms
tialization technique for k-means, in contrast to being a standalone
[25,48] that select the first centroid using some simple rule, and
algorithm, we set the following requirements:
the rest greedily by cluster growing, based on whether the point
1. Simple to implement is within a given distance. Density-connectivity criterion was also
2. Lower (or equal) time complexity than k-means used in [49]. Nevertheless, this approach requires one or more
3. No additional parameters threshold parameters.
100 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

Table 2
Summary of the initialization techniques compared in this paper. Time refers to the aver-
age processing time with the A3 dataset (N = 7500, k = 50). Randomized refers to whether
the technique include randomness naturally. Randomness will be needed for the repeated
k-means variant later.

Technique Ref. Complexity Time Randomized Parameters

Random partitions [3] O(N) 10 ms Yes –


Random centroids [1,2] O(N) 13 ms Yes –
Maxmin [54] O(kN) 16 ms Modified –
kmeans++ [59] O(kN) 19 ms Yes –
Bradley [31] O(kN+Rk2 ) 41 ms Yes R = 10, s = 10%
Sorting heuristic [62] O(N logN) 13 ms Modified –
Projection-based [72] O(N logN) 14 ms Yes –
Luxburg [50] O(kN logk) 29 ms Yes –
Split [46,68] O(N logN) 67 ms Yes k=2

the same point twice, and that the selection is independent on the
order of the data. For the random number generator we use the
method in [52]. We refer to this initialization method as random
centroids.
Slightly different variant in [2] selects simply the first k data
points. This is the default option in the Quick Cluster in IBM SPSS
Statistics [53]. If the data is in random order the result is effec-
tively the same as random centroids, except that it always provides
the same selection.
We note that the randomness is actually a required property
for the repeated k-means variant. This is because we must be able
to produce different solutions at every repeat. Some practitioners
might not like the randomness and prefer deterministic algorithms
always producing the same result. However, both of these goals
can actually be achieved if so wanted. We simply use pseudo-
random number generator with the same seed number. In this way,
single runs of k-means will produce different result but the overall
algorithm still produces always the same result for the same input.

3.2. Random partitions

An alternative to random centroids is to generate random par-


titions. Every point is put into a randomly chosen cluster and their
centroids are then calculated. The positive effect is that it avoids
selecting outliers from the border areas. The negative effect is that
the resulting centroids are concentrated in the central area of the
data due to the averaging. According to our observations, the tech-
nique works well when the clusters are highly overlapped but per-
forms poorly otherwise, see Fig. 8.
According to [19], the random partition avoids the worst case
Fig. 7. The effect of overlap for the success of k-means with the G2 datasets. The behavior more often than the random centroids. According to our
numbers circled are for the three sample datasets shown above. The dataset names experiments, this is indeed the case but only when the clusters
are coded as G2-DIM-SD, where DIM refers to the dimensions and SD to the stan- have high overlap. The behavior of the random partition is also
dard deviation; the higher the SD, the more the two clusters overlap.
more deterministic than that of random centroids. This is because
the centroids are practically always near the center of the data. Un-
fortunately, this also reduces the benefits of the repeated k-means
The most common heuristics are summarized in Table 2. We because there is very little variation in the initial solutions, and
categorize them roughly into random, furthest point, sorting, and therefore, also the final solutions often become identical.
projection-based heuristics. Two standalone algorithms are also Steinley [29] repeats the initialization 50 0 0 times and selects
considered: Luxburg [50] and Split algorithm. For a good review of the one with the smallest SSE. However, repeating only the ini-
several others we refer to [51]. tialization does not fix the problem. Instead, it merely slows down
the initialization because it takes 50 0 0N steps, which is typically
3.1. Random centroids much more than O(kN).
Thiesson et al. [24] calculate the mean point of the data set and
By far the most common technique is to select k random data then add random vectors to it. This effectively creates initial cen-
objects as the set of initial centroids [1,2]. It guarantees that ev- troids like a cloud around the center of the data, with very similar
ery cluster includes at least one point. We use shuffling method effect as the random partition. The size of this cloud is a parame-
by swapping the position of every data point with another ran- ter. If it is set up high enough, the variant becomes similar to the
domly chosen point. This takes O(N) time. After that, we take the random centroids technique, with the exception that it can select
first k points from the array. This guarantees that we do not select points also from empty areas.
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 101

Fig. 8. Initial centroids created by random partition (left), by Steinley’s variant (middle), and the final result after the k-means iterations (right).

Fig. 8 shows the effect of the random partition and Steinley’s Another variant selects the one with maximum distance to the ori-
variant. Both variants locate the initial centroids near the center gin [57] because it is likely to be located far from the center. Max-
of the data. If the clusters have low overlap, k-means cannot pro- imum density has also been used [51,58].
vide enough movement and many of the far away clusters will lack K-means++ [59] is a randomized variant of the furthest point
centroids in the final solution. heuristic. It chooses the first centroid randomly and the next ones
using a weighted probability pi = costi /SUM(costi ), where costi is
3.3. Furthest point heuristic (Maxmin) the squared distance of the data point xi to its nearest centroids.
This algorithm is an O(log k)-approximation to the problem. We
Another popular technique is the furthest point heuristic [54]. also implement k-means++ for our tests because of its popularity.
It was originally presented as standalone 2-approximate clustering Chiang and Mirkin [55] recalculate all the centroids after updat-
algorithm but has been widely used to initialize k-means. It selects ing the partitions, and the next centroid is selected as the farthest
an arbitrary point as the first centroid and then adds new centroids from the recently added centroid. Slightly more complex variant
one by one. At each step, the next centroid is the point that is [23] selects the point that decreases the objective function most. It
furthest (max) from its nearest (min) existing centroid. This is also requires calculation of all distances between every pair of points,
known as Maxmin [19,21,22,55]. which takes O(N2 ) time. Thus, it does not qualify our criteria for
Straightforward implementation requires O(k2 N) time but it can k-means initialization. With the same amount of computation we
be easily reduced to O(kN) as follows. For each point, we main- can already run implement agglomerative clustering algorithm.
tain pointer to its nearest centroid. When adding a new centroid, Other authors also weight the distances by the density of the
we calculate the distance of every point to this new centroid. If point [51,58]. This reduces the probability that outliers are se-
the new distance is smaller than to the previous nearest, then it lected. Erisoglu et al. [60] use cumulative distance to all previous
is updated. This requires N distance calculations. The process is re- centroids instead of the maxmin criterion. However, this performs
peated k times, and the time complexity is therefore O(kN) in total, worse because it can easily choose two nearby points provided
which is the same as one iteration of k-means. Further speedup that they have large cumulative distance to all other centroids [61].
can be achived by searching for the furthest point in just a subset We use here a variant that selects the first point randomly
of the data [56]. [54,59]. This adds randomness to the process as required by the
There are several alternative ways to choose the first cen- repeated k-means variant. The next centroids we select using the
troid. In the original variant the selection is arbitrary [54]. In [55], original maxmin criterion, i.e. choosing the point with biggest dis-
the furthest pair of points are chosen as the first two centroids. tance to its nearest centroid.
102 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

density, and the remaining k-1 centroids are chosen at a decreas-


ing order, with the condition that they are not closer than distance
d2 from an already chosen centroid. Steinley and Brusco [21] rec-
ommends using the average pairwise distance (pd) both for d1 and
d2 . This makes the technique free from parameters but it is still
slow, O(N2 ) time, for calculating the pairwise distances.
It would be possible to simplify this technique further and use
random sampling: select N pairs of points, and use this subsam-
ple to estimate the value of pd. However, the calculation of the
densities is still the bottleneck, which prevents this approach from
meeting the requirements for k-means initialization as such.
Cao et al. [64] proposed a similar approach. They use a primary
criterion (cohesion) to estimate how central a point is (how far
from boundary). Secondary threshold criterion (coupling) is used
to prevent centroids from being neighbors.
Al-Daoud [65] sorts the data points according to the dimension
with the largest variance. The points are then partitioned into k
equal size clusters. Median of each cluster is selected instead of the
mean. This approach belongs to a more general class of projection-
based techniques where the objects are mapped to some linear
axis such as diagonal or principal axis.
The sorting heuristic would work if the clusters were well sep-
arated, and all have different criterion value (such as the distance
Fig. 9. Example of the maxmin heuristic for S3 dataset. The blue dots are the initial from center point). This actually happens with the very high di-
and the red dots the final centroids. The trajectories show their movement during
mensional DIM datasets in our benchmark. However, with most
the k-means iterations. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.) other datasets the clusters tend to be randomly located in respect
to the center point, and it is unlikely that all the clusters would
have different criterion values. What happens in practice, is that
Maxmin technique helps to avoid worst case behavior of the the selected centroids are just random data points in the space,
random centroids, especially when the cluster sizes have serious see Fig. 10.
unbalance. It also has tendency to pick up outlier points from the
border areas, which leads to slightly inferior performance in the
3.5. Projection-based heuristics
case of datasets with high overlap (S3 and S4). However, k-means
usually works better with such datasets [39], which compensates
Sorting heuristics can also be seen as a projection of the points
for the weakness of Maxmin. Fig. 9 demonstrates the performance
into a one-dimensional (non-linear) curve in the space. Most cri-
of the Maxmin technique.
teria would just produce an arbitrary curve connecting the points
randomly, and lacking convexity or any sensible shape. However,
3.4. Sorting heuristics several linear projection-based techniques have been considered in
the literature:
Another popular technique is to sort the data points according
to some criterion. Sorting requires O(N log N) time, which is less • Diagonal axis [65]
than that of one k-means iteration, O(kN), assuming that logN ≤ k. • Single axis [66,67]
After sorting, k points are selected from the sorted list using one • Principal axis [46,67–71]
of the following heuristics: • Two random points [72]
• Furthest points [72]
• First k points.
• First k points while disallowing points closer than ε to already After the projection is performed, the points are partitioned
chosen centroids. into k equal size clusters similarly as with the sorting-based
• Every (N/k)th point (uniform partition) heuristics.
Yedla et al. [66] sort the points according to their distance to
For the sorting, at least the following criteria have been consid- origin, and then select every N/kth point. If the origin is the center
ered: of data, this is essentially the same technique as in [62]. If the at-
• Distance to center point [62] tributes are non-negative, then this is essentially the same as pro-
• Density [21,63] jecting the data to the diagonal axis. Such projection is trivial to
• Centrality [64] implement by calculating the average of the attribute values. It has
• Attribute with the greatest variance [65] also been used for speeding-up nearest neighbor searches in clus-
tering in [73].
Hartigan and Wong [62] sort the data points according to their Al-Daoud [65] sorts the points according to the dimension with
distance to the center of the data. The centroids are then selected the largest variance. The points are then partitioned into k equal
as every N/kth point in this order. We include this variant in our size clusters. Median of each cluster is selected instead of the
tests. To have randomness, we choose a random data point as a mean. This adapts to the data slightly better than just using the
reference point instead of the center. This heuristic fulfills our re- diagonal.
quirements: it is fast, simple, and requires no additional parame- A more common approach is to use principal axis, which is the
ters. axis of projection that maximizes variance. It has been used ef-
Astrahan [63] calculates density as the number of other points fectively in divisive clustering algorithms [46,67–71]. Calculation of
within a distance d1 . First centroid is the point with the highest the principal axis takes O(DN)-O(D2 N) depending on the variant
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 103

Fig. 10. Examples of sorting and projection-based techniques.

Fig. 11. Examples of the two projection-based heuristics for A2 dataset: random points (left), and the furthest point projections (right) [72].

[46]. A more complex principal curve has also been used for clus- trivial how to calculate the density, and how to use it in cluster-
tering [74]. ing. Especially since the initialization technique should be fast and
We consider two simple variants: random and two furthest simple.
points projection as studied in [72]. The first heuristic takes two The main bottleneck of the algorithms is how to calculate the
random data points and projects to the line passing by these two density is estimated for the points. There are three common ap-
reference points. The key idea is the randomness; single selection proaches for this:
may provide poor initialization but when repeating several times, • Buckets
the chances to find one good initialization increases, see Fig. 11. • ε-radius circle
We include this technique into our experiments and refer to it as • k-nearest neighbors (KNN)
Projection.
The second heuristic is slightly more deterministic but still ran- The first approach divides the space by a regular grid, and
dom. We start by selecting a random point, and calculate its fur- counts the frequency of the points in every bucket [76]. The den-
thest point. The projection axis is the line passing by these two ref- sity of a point is then inherited from the bucket it is in. This ap-
erence points. We again rely on randomness, but now the choices proach is feasible in low-dimensional space but would become im-
are expected to be more sensible, potentially providing better re- practical in higher-dimensional spaces. In [61], the problem is ad-
sults using fewer trials. However, according to [72] this variant dressed by processing the dimensions independently in a heuris-
does not perform any better than the simpler random heuristic. tic manner. Other authors have used kd-tree [51,57] or space-filling
Projection works well if the data has one-dimensional structure. curve [77] to partition the space into buckets containing roughly
In [72], projective value is calculated to estimate how well a given the same number of points. In [51,57], the number of buckets is
projection axis models the data. From our data, Birch2 and G2 have 10k.
high projective values and suitable for projection-based technique. The other two approaches calculate the density for every point
However, with all other datasets, the projection does not make individually. The traditional one is to define a neighborhood us-
much more sense than the naïve sorting heuristics, see Fig. 10. ing a cutoff threshold (ε -radius), and then counting the number
We also note that projection-based techniques also general- of other points within this neighborhood [21,63,64,78]. The third
ize to segmentation-based clustering, where k-1 dividing planes approach finds the k-nearest neighbors of a point [79], and then
are searched simultaneously using dynamic programming [74,75]. calculates the average distance to the points within this neighbor-
These clustering results usually require fine-tuning by k-means at hood. Lemke and Keller calculate the density between every pair
the final step, but nevertheless, they are standalone algorithms. of points [49].
The bottleneck of the last two approaches is that we need to
3.6. Density-based heuristics find the points that are within the neighborhood. This requires
O(N2 ) distance calculations in both cases. Several speed-up tech-
Density was already used both with the furthest point and the niques and approximate variants exist [80,81] but none that is both
sorting heuristics, but the concept deserves a little bit further dis- fast and simple to implement. Calculating density values only for
cussion. The idea of using density itself is appealing but it is not a subset of size SQRT(N) would reduce the complexity to O(N1.5 )
104 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

depending whether the distances are calculated to all points or


only within the subset. In [82], density is calculated in each di-
mension separately, and then final approximation is obtained by
summing up the individual densities. This allows rapid O(DN) time
estimation with more accurate estimation than the sub-sampling
approach.
Once calculated, the density can be used jointly with the fur-
thest point heuristic, with the sorting heuristic, or some of their
combination. For example, in [51] the furthest point heuristic was
modified by weighting the distance by its density so that outliers
are less likely chosen. The density peaks algorithm in [78] finds for
every point its nearest neighbor with higher density. It then ap-
plies sorting heuristic based on one of the two features: density
and the distance to its neighbor. The method works as a standalone
algorithm and does not require k-means at all.
Luxburg [50] first selects k∗ SQRT(k) preliminary clusters using
k-means and then eliminates the smallest ones. After this, the fur-
thest point heuristic is used to select the kclusters from the pre-
Fig. 12. General principle of repeated k-means (RKM). The key idea is that the ini-
liminary set of clusters. When minimizing SSE, the size of the clus-
tialization includes randomness to produce different solutions at every repeat.
ters correlates to their density. Thus, Luxburg’s technique indirectly
implements a density-based approach which favors clusters of high
density. We include this technique in our experiments although it plied but only within the cluster that was split as done in [68].
does not satisfy our simplicity criterion. The main difference to the bisecting k-means [86] and its original
We also note that there are several standalone clustering algo- split+kmeans variant in [46], is that the time complexity sums up
rithms based on density [49,78,83,84]. However, they do not fit to to only O(NlogN); a proof can be easily derived from the one in
our requirements for speed and simplicity. If combined with the [46].
faster density estimation in [82], some of these techniques could
be made competitive also in speed.
3.8. Repeated k-means

3.7. Splitting algorithm


Repeated k-means performs k-means multiple times starting
with different initialization, and then keeping the result with low-
Split algorithm puts all points into a single cluster, and then it-
est SSE-value. This is sometimes referred as multi-start k-means.
eratively splits one cluster at a time until k clusters are reached.
The basic idea of the repeats is to increase the probability of suc-
This approach is seemingly simple and tempting to consider for
cess. Repeated k-means can be formulated as a probabilistic algo-
initializing k-means. However, there are two non-trivial design
rithm as follows. If we know that k-means with a certain initializa-
choices to make: which cluster to split, and how to split it. We
tion technique will succeed with a probability of p, the expected
therefore consider split mainly as a standalone algorithm, but dis-
number of repeats (R) to find the correct clustering would be:
cuss briefly some existing techniques that have been used within
k-means. R = 1/p
Linde et al. [85] uses binary split for initialization of their LBG
In other words, it is enough that k-means succeeds even some-
algorithm in the vector quantization context. Every cluster is split
times (p > 0). It is then merely a question of how many repeats
by replacing the original centroid c by c+ε and c-ε , where ε refers
are needed. Only if p ≈ 0 the number of repeats can be unrealisti-
to a random vector. Splitting every cluster avoids the question of
cally high. For example, standard k-means with random centroids
which cluster to split but it does not have any real speed benefit.
succeeds 6–26% of the time with the S1-S4 datasets. These corre-
In [46], ε was calculated as the standard deviation of the points in
sponds to R = 7 to 14 repeats, on average.
the cluster, in each dimension separately.
If the initialization technique is deterministic (no randomness),
Projection-based approaches are also suitable for the splitting
then it either succeeds (p = 100%) or fails (p = 0%) every time. To
algorithm. The idea is to divide a chosen cluster according to a
justify the repeats, a basic requirement is that there is some ran-
hyperplane perpendicular to the projection axis. It is possible to
domness in the initialization so that the different runs produce dif-
find the optimal choice of the cluster to be split, and the opti-
ferent results. Most techniques have the randomness implicitly. The
mal location of the hyperplane in O(N) time [46,68]. This results
rest of the techniques we modify as follows:
in a fast, O(NlogNlogk) time algorithm, but the implementation
is quite complex. It requires 22 functions and 947 lines of codes, • Rand-P Already included
compared to 5 functions and 162 lines in repeated k-means [16]. • Rand-C Already included
• Maxmin First centroid randomly
There is also a split-kmeans variant that applies k-means itera-
• Kmeans++ Already included
tion after every split in [46], later popularized under the name Bi- • Bradley Already included
secting k-means in document clustering [86]. However, this would • Sorting Reference point randomly
increase the time complexity to O(k2 N), which equals to O(N2 ) if • Projection Reference points randomly
k ≈ SQRT(N). Tri-level k-means [87] performs the clustering in two • Luxburg Already included
• Split Split centroids randomly
stages. It first creates less clusters than k, and then splits the clus-
ters with highest variation before applying the traditional k-means. Repeats add one new parameter R. Since p is not known in
All these variants are definitely standalone algorithms, and do not practice, we cannot derive value for R automatically. In this paper,
qualify as an initialization technique here. we use R = 100 unless otherwise noted. Fig. 12 shows the overall
In this paper, we therefore implement a simpler variant. We scheme of the repeated k-means.
always select the biggest cluster to be split. The split is done by Repeating k-means also multiplies the processing time by a fac-
selecting two random points in the cluster. K-means is then ap- tor of R. It is possible to compensate for this by dividing the data
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 105

into random subsets. For instance, if we divide the data into R sub- k-means brings only little improvement. The average CI-value of
sets of size N/R, the total processing time would be roughly the Luxburg improves only from 1.7 to 1.2 (∼30%), and Bradley from
same as that of a single run. 3.4 to 3.1 (∼10%). The latter is more understandable as k-means is
For example, Bradley and Fayyad [31] apply k-means for a sub- already involved in the iterations. Split heuristic, although a stan-
sample of size N/R, where R = 10 was recommended. Each sample dalone algorithm, leaves more space for k-means to improve (61%).
is clustered by k-means starting with random centroids. However, Number of iterations: The main observation is that the easier
instead of taking the best clustering of the repeats, a new dataset the dataset, and the better the initialization, the fewer the itera-
is created from the Rk centroids. This new dataset is then clus- tions needed. The differences between the initialization vary from
tered by repeated k-means (R repeats). The total time complexity 20 (Luxburg) to 36 (Rand-C); with the exception of random parti-
is Rk(N/R) + Rk2 = kN + Rk2 , where the first part comes from tion (Rand-P), which takes 65 iterations.
clustering the sub-samples, and the second part from clustering
the combined set. If k = SQRT(N), then this would be N1. 5 + RN. 4.2. Cluster overlap
Overall, the algorithm is fast and satisfies the criteria for initializa-
tion technique. The results with the S1–S4 datasets (Table 3) demonstrate the
Bahmani et al. [88] have a similar approach. They repeat k- effect of the overlap in general: the less overlap, the worse the
means++ R = O(logN) times to obtain Rk preliminary centroids, k-means’ performance. Some initialization techniques can compen-
which are then used as a new dataset for clustering by standard k- sate for this weakness. For example, the maxmin variants and the
means. They reported that R = 5 would be sufficient for the num- standalone algorithms reduce this phenomenon but do not remove
ber of repeats. In our experiments, we consider the Bradley and it completely. They provide better initial solution with S1 (less
Fayyad [31] as an initialization, and use R = 100 repeats as with all overlap) than with S4 (more overlap), but the final result after the
techniques. k-means iterations is still not much different. An extreme case is
DIM32, for which all these better techniques provide correct solu-
4. Experimental results tion. However, they do it even without k-means iterations!
Further tests with G2 confirm the observation, see Fig. 13.
We study next the overall performance of different initialization When overlap is less than 2%, the k-means iterations do not help
techniques, and how the results depend on the following factors: much and the result depends mostly on the initialization. If the
• Overlap of clusters correct clustering is found, it is found without k-means. Thus, the
• Number of clusters clustering is solved by a better algorithm, not by better k-means
• Dimensions initialization. In case of high overlap, k-means reaches almost the
• Unbalance of cluster sizes same result (about 88% success rate) regardless of how it was ini-
tialized.
The overall results (CI-values and success rates) are summarized
in Table 3. We also record (as fails) how many datasets provide 4.3. Number of clusters
success rate p = 0%. This means that the algorithm cannot find the
correct clustering even with 50 0 0 repeats. We test the following The results with the A1–A3 datasets (Table 3) show that the
methods: more there are clusters the higher the CI-value and the lower the
• Rand-P success rate. This phenomenon holds for all initialization tech-
• Rand-C niques and it is not specific to k-means algorithm only. If an
• Maxmin algorithm provides correct clustering with success rate p for a
• kmeans++ dataset of size k, then p is expected to decrease when k increases.
• Bradley Fig. 14 confirms this dependency with the Birch2 subsets. Projec-
• Sorting tion heuristic is the only technique that manages to capture the
• Projection hidden 1-dimensional structure in this data. The success rate of all
• Luxburg other true initialization techniques eventually decreases to 0%.
• Split Fig. 15 shows that the CI-value has a near linear dependency
on the number of clusters. In most cases, the relative CI-value con-
4.1. Overall results verges to a constant when k approaches its maximum (k = 100). An
exception is Luxburg, which is less sensitive to the increase of k;
CI-values: Random partition works clearly worse (CI = 12.4) providing values CI = (0.82, 1.25, 1.42, 1.54) for k = (25, 50, 75, 100).
than the random centroids (CI = 4.5). Bradley and sorting heuris- Besides this exception, we conclude that the performance has lin-
tics are slightly better (CI = 3.1 and 3.3), but the maxmin heuris- ear dependency on k regardless of the initialization technique.
tics (Maxmin and kmeans++) are the best among the true ini-
tialization techniques (CI = 2.2 and 2.3). The standalone algorithms 4.4. Dimensions
(Luxburg and Split) are better (CI = 1.2 and 1.2), but even they pro-
vide the correct result (CI = 0) only for the easiest dataset: DIM32. We tested the effect of dimensions using the DIM and G2
Success rates: The results show that Maxmin is a reason- datasets. Two variants (Maxmin, Split) solve the DIM sets al-
able heuristic. Its average success rate is 22% compared to 5% most every time (99–100%), whereas Kmeans++ and Luxburg solve
of random centroids. It also fails (success rate = 0%) only in case them most of the times (≈95%), see Fig. 16. Interestingly, they find
of three datasets; the datasets with a high number of clusters the correct result by the initialization and no k-means iterations
(A3, Birch1, Birch2). Random partition works with S2, S3 and S4 are needed. In general, if the initialization technique is able to
but fails with all the other 8 datasets. The standalone algorithms solve the clustering, it does it regardless of the dimensionality.
(Luxburg and Split) provide 40% success rates, on average, and fail The sorting and projection heuristics are exceptions in this
only with Birch1 and Unbalance. sense; their performance actually improves with the highest di-
Effect of iterations: From the initial results we can see that mensions. The reason is that when the dimensions increase, the
Luxburg and Bradley are already standalone algorithms for which clusters eventually become so clearly separated that even such
106 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

Table 3
Average CI-values before and after k-means iterations, success rates, and the number of iterations performed. The results are
averages of 50 0 0 runs. Fail records for how many datasets the correct solution was never found (success rate = 0%). From DIM
datasets we report only DIM32; the results for the others are practically the same. Note: The values for Impr. and Aver. columns
are calculated from precise values and not from the shown rounded values. (For interpretation of the references to color in the
Table the reader is referred to the web version of this article.)

CI-values (initial)

Method s1 s2 s3 s4 a1 a2 a3 unb b1 b2 dim32 Aver.

Rand-P 12.5 14.0 12.8 14.0 19.0 32.9 48.1 7.0 96.0 96.6 13.1 33.3
Rand-C 5.3 5.5 5.4 5.4 7.3 12.7 18.2 4.6 36.6 36.6 5.8 13.0
Maxmin 1.3 2.9 6.1 6.8 2.1 4.1 5.0 0.9 21.4 9.6 0.0 5.5
kmeans++ 1.7 2.3 3.2 3.3 3.1 5.6 7.9 0.8 21.3 10.4 0.1 5.4
Bradley 1.0 0.7 0.6 0.5 1.5 3.4 5.3 3.3 5.7 13.6 1.7 3.4
Sorting 3.3 3.7 4.1 4.4 4.9 10.4 15.6 4.0 34.1 7.2 1.7 8.5
Projection 3.0 3.4 3.9 4.2 4.5 9.8 15.2 4.0 33.7 1.0 1.1 7.6
Luxburg 0.8 0.8 1.1 1.3 0.9 1.1 1.2 4.2 5.6 1.7 0.0 1.7
Split 0.5 0.8 1.4 1.4 1.3 2.4 3.5 4.5 12.0 2.7 0.0 2.8

CI-values (final)

Method s1 s2 s3 s4 a1 a2 a3 unb b1 b2 dim32 Aver. Impr.

Rand-P 3.3 0.6 1.2 0.4 6.0 10.7 17.9 4.0 11.3 75.6 5.3 12.4 63%
Rand-C 1.8 1.4 1.3 0.9 2.5 4.5 6.6 3.9 6.6 16.6 3.6 4.5 65%
Maxmin 0.7 1.0 0.7 1.0 1.0 2.6 2.9 0.9 5.5 7.3 0.0 2.2 62%
kmeans++ 1.0 0.9 1.0 0.8 1.5 2.9 4.2 0.5 4.9 7.2 0.1 2.3 57%
Bradley 0.9 0.6 0.5 0.4 1.3 3.0 4.8 3.5 4.6 12.5 1.6 3.1 11%
Sorting 1.3 1.1 1.0 0.7 1.5 3.6 5.5 4.0 5.7 4.3 1.4 2.7 69%
Projection 1.2 0.9 0.8 0.6 1.2 3.3 5.2 4.0 5.3 0.2 0.9 2.2 71%
Luxburg 0.5 0.4 0.6 0.4 0.6 0.9 1.0 4.0 2.7 1.6 0.0 1.2 29%
Split 0.2 0.3 0.4 0.4 0.5 1.1 1.8 4.0 2.8 1.6 0.0 1.2 61%

Success-%

Method s1 s2 s3 s4 a1 a2 a3 unb b1 b2 dim32 Aver. Fails

Rand-P 0% 47% 5% 63% 0% 0% 0% 0% 0% 0% 0% 10% 8


Rand-C 3% 11% 12% 26% 1% 0% 0% 0% 0% 0% 0% 5% 6
Maxmin 37% 16% 36% 9% 15% 1% 0% 22% 0% 0% 100% 22% 3
kmeans++ 21% 24% 18% 30% 7% 0% 0% 51% 0% 0% 88% 22% 4
Bradley 21% 46% 49% 64% 7% 0% 0% 0% 0% 0% 2% 17% 5
Sorting 12% 20% 22% 36% 10% 0% 0% 0% 0% 12% 15% 12% 4
Projection 16% 29% 30% 42% 18% 0% 0% 0% 0% 92% 34% 24% 4
Luxburg 52% 60% 45% 61% 45% 33% 31% 0% 0% 17% 95% 40% 2
Split 78% 75% 62% 64% 51% 17% 5% 0% 0% 10% 99% 42% 2

Number of iterations

Method s1 s2 s3 s4 a1 a2 a3 unb b1 b2 dim32 Aver.

Rand-P 32 37 37 39 43 58 76 36 228 130 3 65


Rand-C 20 24 27 40 22 26 27 33 117 48 5 36
Maxmin 13 19 24 37 20 18 20 4 92 43 2 26
kmeans++ 14 19 24 35 17 20 22 13 89 43 2 27
Bradley 13 12 13 17 12 17 19 24 77 45 2 23
Sorting 17 21 25 37 19 24 26 38 104 33 3 32
Projection 15 20 25 35 17 24 25 36 99 6 3 28
Luxburg 9 12 17 27 11 12 12 33 62 23 2 20
Split 7 11 19 27 12 16 18 35 65 27 2 22

100 % 100 %
Low overlap High overlap
80 % 80 %

60 % 60 %

40 % 40 %

20 % 20 %

0% 0%
-P

-P
-C

-C
in

in
us

us
y

y
RP

RP
ng

ng

lit
lit
t

t
ec

ec
le

le
M

M
Sp

Sp
nd

nd
nd

nd
pl

pl
rt i

rt i
ad

ad
oj

oj

oj

oj
ax

ax
KM

KM
So

So
Ra

Ra
Ra

Ra
Pr

Pr

Pr

Pr
Br

Br
M

Fig. 13. Average success rates for all G2 datasets before (gray) and after k-means (white). The datasets were divided into two categories: those with low overlap <2% (left),
and those with high overlap ≥2% (right).
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 107

100% 4.5. Unbalance


Projection
Unbalance dataset shows one weakness of k-means. The prob-
80% lem is not the different densities as such, but the unbalance of
cluster sizes together with the separation of the clusters. If no cen-
Success rate

troids are selected from the sparse area, k-means iterations man-
60%
Split age to move only one centroid into this area, and all other cen-
troids will remain in the dense area, see Fig. 17. The probability
40% that a single random centroid would be selected from the sparse
area is p = 50 0/650 0 = 7%. To pick all required five centroids from
Luxburg the sparse area would happen with probability of 0.01%,1 i.e. only
Bradley
20% once every 8430 runs.
Maxmin
Rand-C Sorting Besides Rand-C and Rand-P, sorting and projection heuristics,
KM++ Luxburg and Split algorithms all fail with this data by allocating
0% most centroids to the dense area. Bradley works only slightly bet-
10 20 30 40 50 60 70 80 90 100 ter and often allocates two centroids to the sparse area. Maxmin
heuristics work best because they rely more on distances than
Clusters (k)
on frequencies. K-means++ typically misses one centroid whereas
Fig. 14. Dependency of the success rate and the number of clusters when using the Maxmin does the opposite and allocates one too many centroids in
subsets of Birch2 (B2-sub). the sparse area. They provide success rates of 22% (Maxmin) and
51% (KM++), in contrast to the other techniques that result in 0%
20% success.
To sum up, success depends mainly on the goodness of the ini-
Rand-C tialization; k-means iterations can do very little with this kind of
data. If the correct clustering is found, it is found mainly without
Relative CI-value

15% k-means.
Bradley
4.6. Repeats
10%
KM++ We next investigate to what extent the k-means performance
can be improved by repeating the algorithm several times. Table 5
Maxmin
5% Sorting summarizes the results. We can see that significant improvement
is achieved with all initialization techniques. When the success
Luxburg rate of a single run of k-means is 2% or higher, CI = 0 can always be
Split
reached thanks to the repeats. However, none of the variants can
0% Projection solve all datasets. Overall performance of the different initialization
10 20 30 40 50 60 70 80 90 100 techniques can be summarized as follows:
Clusters (k) • Random partition is almost hopeless and the repeats do not
help much. It only works when the clusters have strong overlap.
Fig. 15. Dependency of the relative CI-values (CI/k) and the number of clusters
when using the subsets of Birch2 (B2-sub).
But even then, k-means works relatively well anyway regardless
of the initialization.
• Random centroids is improved from CI = 4.5 to 2.1, on average,
naïve heuristics will be able to cluster the data. In general, the rea- but still it can solve only three datasets (S2, S3, S4). Two other
son for success or failure is not the dimensionality but the cluster datasets (S1, A1) could be solved with significantly more re-
separation. peats, but not the rest.
The results with G2 confirm the above observation, see Fig. 16. • Maxmin variants are the best among the simple initialization
With the lowest dimensions, k-means iterations work because techniques providing CI = 0.7, on average, compared to 2.1 of
some cluster overlap exists. However, for higher dimensions the Rand-C. They still fail with four datasets. K-means++ is not sig-
overlap eventually disappears and the performance starts to de- nificantly better than the simpler Maxmin.
pend mainly on the initialization. We also calculated how much • The standalone algorithms (Luxburg and Split) are the best.
the success rate correlates with the dimensions and the overlap. They provide average value of CI = 1.2 without the repeats, and
The results in Table 4 show that the final result correlates much CI = 0.4 with 100 repeats. They fail only with the Unbalance
stronger with the overlap than with the dimensionality. datasets.
Since there is causality between dimensions and overlap, it is The improvement from the repeats is achieved at the cost
unclear whether the dimensionality has any role at all. To test this of increased processing time. We used the fast k-means variant
further, we generated additional datasets with D = 2–16 and com- [89] that utilizes the activity of the centroids. For the smaller data
pared only those with overlap = 2%, 4%, 8%. The results showed sets the results are close to real-time, but with the largest dataset
that success of the k-means iterations do not depend on the di- (Birch1, N = 10 0,0 0 0), the 10 0 repeats can take from 10–30 min.
mensions even when the clusters overlap. We extended the tests and ran 20 0,0 0 0 repeats for A3 and Un-
To sum up, our conclusion is that k-means iterations cannot balance datasets. The results in Table 6 show that Maxmin would
solve the problem when the clusters are well separated. All tech- need 216 repeats to reach CI = 0 with A3, on average, whereas k-
niques that solve these datasets, do it already by the initialization means++ would require 8696 repeats even though it finds CI = 1
technique without any help of k-means. When there is overlap, k-
means works better. But even then, the performance does not de-
8
p5 (1 − p) .
3
pend on the dimensionality. 1
5
108 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

Table 4
Correlation of success rate with increasing overlap (left) and dimensions (right) with
the G2 datasets (3:3 centroid allocation test). Red>0.60, Yellow = 0.30–0.53.

100% 100%

Maxmin, Split
80% K-means++
Initial 80%
Maxmin, Split
K-means++
Final
DIM DIM
Success rate

Luxburg
Success rate Luxburg
60% 60%

Projection Projection
40% 40%
Sorting Sorting
20% 20%
Bradley Bradley
Rand-C, Rand-P Rand-C, Rand-P
0% 0%
32 64 128 256 512 1024 32 64 128 256 512 1024
Dimensions Dimensions
100% 100%
Sorting Sorting
Split Split
80% G2 80%
Success rate
Success rate

60% Initial Luxburg


60% Luxburg
Rand-P Maxmin
40%
Rand-C Maxmin 40%
G2 Rand-C
20% 20% Bradley
Bradley
Rand-P Final
0% 0%
1 2 4 8 16 32 64 128 256 512 1024 1 2 4 8 16 32 64 128 256 512 1024
Dimensions Dimensions

Fig. 16. Dependency of success rate on the dimensions when no overlap (DIM sets), and with overlap (G2 datasets). The results of G2 are average success rates for all
sd = 10–100 (G2-D-sd) with a given dimension D, before and after k-means.

already after 138 repeats. The results also show that Unbalance • Cluster overlap is the biggest factor. If there is high overlap,
dataset is difficult for almost all initialization techniques but the k-means iterations work well regardless of the initialization.
maxmin heuristics are most suitable for this type of data. If there is no overlap, then the success depends completely
on the initialization technique: if it fails, k-means will also
fail.
4.7. Summary
• Practically all initialization techniques perform worse when the
number of clusters increases. Success of the k-means depends
We make the following observations:
linearly on the number of clusters. The more clusters, the more
• Random partition provides an initial solution of similar qual- errors there are, before and after the iterations.
ity regardless of overlap, but the errors in initial solution can • Dimensionality does not have a direct effect. It has a slight ef-
be better fixed by k-means iterations when clusters have high fect on some initialization techniques but k-means iterations
overlap. In this case it can even outperform random centroids. are basically independent on the dimensions.
However, repeats do not improve the results much, especially • Unbalance of cluster sizes can be problematic especially for the
with sets having many clusters (A3, Birch2). random initializations but also for the other techniques. Only
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 109

Table 5
Performance of the repeated k-means (100 repeats). The last two columns show the average results of all datasets without repeats (KM) and
with repeats (RKM). (For interpretation of the references to color in the Table the reader is referred to the web version of this article.)

the maxmin variants with 100 repeats can overcome this prob- The most important factor is the cluster overlap. In general,
lem. well separated clusters make the clustering problem easier but
for k-means it is just the opposite. When the clusters overlap, k-
Table 7 summarizes how the four factors affect the different ini- means iterations work reasonably well regardless of the initial-
tialization techniques and the k-means iterations. ization. This is the expected situation in most pattern recognition
applications.
5. Conclusions The number of errors have a linear dependency on the number
of clusters (k): the more clusters, the more errors k-means makes,
On average, k-means caused errors with about 15% of the clus- but the percentage remains constant. Unbalance of cluster sizes is
ters (CI = 4.5). By repeating k-means 100 times this errors was more problematic. Most initialization techniques fail, and only the
reduced to 6% (CI = 2.0). Using a better initialization technique maxmin heuristics worked in this case. The clustering result then
(Maxmin), the corresponding numbers were 6% (CI = 2.1) with k- depends merely on the goodness of the initialization technique.
means as such, and 1% (CI = 0.7) with 100 repeats. For most pat- Dimensionality itself is not a factor. It merely matters how the
tern recognition applications this accuracy is more than enough dimensions affect the cluster overlap. With our data, the clus-
when clustering is just one component within a complex system. ters became more separated when the dimensions were increased,
110 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

Table 6
Number of repeats in RKM to reach certain CI-level. Missing values (−)
indicate that this CI-level was never reached during the 20 0,0 0 0 repeats.

A3
CI-value

Initialization 6 5 4 3 2 1 0

Rand-P – – – – – – –
Rand-C 2 4 11 54 428 11,111 –
Maxmin 1 3 14 216
Kmeans++ 1 2 3 14 138 8696
Bradley 1 2 8 58 1058 33,333
Sorting 1 2 4 13 73 1143 –
Projection 1 2 3 9 46 581 18,182
Luxburg 1 3
Split 1 2 9
Unbalance
CI-value

Initialization 6 5 4 3 2 1 0

Rand-P 1 97 8333 – –
Rand-C 1 16 69 1695 100k
Maxmin 1 4
Kmeans++ 1 2
Bradley 1 3 6 70 1471
Sorting 1 – – – –
Projection 1 935 16,667 – –
Luxburg 1 59 16,667 – –
Split 1 9524 – – –

Table 7
How the four factors have effect on the performance of the initialization and on
the k-means iterations.

Method Overlap Clusters Dimension Unbalance

Rand-P No effect Constant No effect Very bad


Rand-C No effect Constant No effect Very bad
Maxmin Bad Constant No effect A bit worse
kmeans++ A bit worse Constant No effect A bit worse
Bradley Good Constant No effect Bad
Sorting A bit worse Constant No effect Very bad
Projection A bit worse Constant No effect Very bad
Luxburg A bit worse Minor effect No effect Very bad
Split A bit worse Constant No effect Very bad
KM iterations Good Constant No effect No effect

which in turn worsened the k-means performance. Besides this in-


direct effect, the dimensions did not matter much.
With real data the effect might be just the opposite. If the fea-
tures (attributes) are added in the order of their clustering capa-
bility, it is expected that the clusters would become more overlap-
ping when adding more features. As a result, k-means would start
to work better but the data itself would become more difficult to
Fig. 17. Examples of the initialization technique on the Unbalance dataset. The only
cluster, possibly losing the clustering structure. And vice versa, if techniques that do not badly fail are the maxmin heuristics. The numbers indicate
good feature selection is applied, the clusters can be more sepa- the order in which the centroids are selected.
rated, which has the danger that k-means would start to perform
worse.
Based on these observations, choosing an initialization tech-
nique like Maxmin can compensate for the weaknesses of k-means. To sum up, if the clusters overlap, the choice of initialization
With unbalanced cluster sizes it might work best overall. However, technique does not matter much, and repeated k-means is usually
it is preferable to repeat the k-means 10–100 times; each time tak- good enough for the application. However, if the data has well-
ing a random point as the first centroids and selecting the rest separated clusters, the result of k-means depends merely on the
using the Maxmin heuristic. This will keep the number of errors initialization algorithm.
relatively small. In general, the problem of initialization is not any easier than
However, the fundamental problem of k-means still remains solving the clustering problem itself. Therefore, if the accuracy of
when the clusters are well separated. From all the tested combi- clustering is important, then a better algorithm should be used.
nations, none was able to solve all the benchmark datasets despite Using the same computing time spent for repeating k-means, a
them being seemingly simple. With 100 repeats, Maxmin and k- simple alternative called random swap (RS) [12] solves all the
means++ solved 7 datasets (out of the 11), thus being the best ini- benchmark datasets. Other standalone algorithms that we have
tialization techniques. The better standalone algorithms (Luxburg found able to solve all the benchmark sets include genetic algo-
and Split) managed to solve 9. rithm (GA) [10], the split algorithm [46], split k-means [46], and
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 111

density peaks [78]. Agglomerative clustering [30] solves 10 out [36] P.B. Frandsen, B. Calcott, C. Mayer, R. Lanfear, Automatic selection of parti-
of 11. tioning schemes for phylogenetic analyses using iterative k-means clustering
of site rates, BMC Evol. Biol. 15 (13) (2015).
[37] D.G. Márquez, A. Otero, P. Félix, C.A. García, A novel and simple strategy for
References evolving prototype based clustering, Pattern Recognit. 82 (2018) 16–30.
[38] L. Huang, H.-Y. Chao, C.-D. Wang, Multi-view intact space clustering, Pattern
[1] E. Forgy, Cluster analysis of multivariate data: efficiency vs. interpretability of Recognit. 86 (2019) 344–353.
classification, Biometrics 21 (1965) 768–780. [39] P. Fränti, S. Sieranoja, K-means properties on six clustering benchmark
[2] J. MacQueen, Some methods for classification and analysis of multivariate ob- datasets, Appl. Intel. 48 (12) (2018) 4743–4759.
servations, in: Berkeley Symposium on Mathematical Statistics and Probability, [40] L. Morissette, S. Chartier, The k-means clustering technique: general consider-
1, Statistics University of California Press, Berkeley, Calif., 1967, pp. 281–297. ations and implementation in Mathematica, Tutor. Quant. Methods Psychol. 9
[3] S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1) (2013) 15–24.
(1982) 129–137. [41] J. Liang, L. Bai, C. Dang F. Cao, The k-means-type algorithms versus imbal-
[4] L. Wang, C. Pan, Robust level set image segmentation via a local correntropy- anced data distributions, IEEE Trans. Fuzzy Syst. 20 (4, August) (2012) 728–
-based k-means clustering, Pattern Recognit. 47 (2014) 1917–1925. 745.
[5] C. Boutsidis, A. Zouzias, M.W. Mahoney, P. Drineas, Randomized dimensional- [42] I. Melnykov, V. Melnykov, On k-means algorithm with the use of Mahalanobis
ity reduction for k-means clustering, IEEE Trans. Inf. Theory 61 (2, February) distances, Stat. Probab. Lett. 84 (January) (2014) 88–95.
(2015) 1045–1062. [43] V. Melnykov, S. Michael, I. Melnykov, Recent developments in model-based
[6] M. Capo, Perez A, J.A. Lozano, An efficient approximation to the k-means clus- clustering with applications, in: M. Celebi (Ed.), Partitional Clustering Algo-
tering for massive data, Knowl.-Based Syst. 117 (2017) 56–69. rithms, Springer, Cham, 2015.
[7] Z. Huang, N. Li, K. Rao, C. Liu, Y. Huang, M. Ma, Z. Wang, Development of a [44] M. Rezaei, P. Fränti, Set-matching methods for external cluster validity, IEEE
data-processing method based on Bayesian k-means clustering to discriminate Trans. Knowl. Data Eng. 28 (8, August) (2016) 2173–2186.
aneugens and clastogens in a high-content micronucleus assay, Hum. Exp. Tox- [45] P. Fränti, M. Rezaei, Q. Zhao, Centroid index: cluster level similarity measure,
icol. 37 (3) (2018) 285–294. Pattern Recognit. 47 (9) (2014) 3034–3045.
[8] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31 [46] P. Fränti, T. Kaukoranta, O. Nevalainen, On the splitting method for VQ code-
(2010) 651–666. book generation, Opt. Eng. 36 (11, November) (1997) 3043–3051.
[9] K. Krishna, Murty M.N, Genetic k-means algorithm, IEEE Trans. Syst. Man Cy- [47] P. Fränti, O. Virmajoki, V. Hautamäki, Fast agglomerative clustering using a k-n-
bern. Part B 29 (3) (1999) 433–439. earest neighbor graph, IEEE Trans. Pattern Anal. Mach. Intel. 28 (11, November)
[10] P. Fränti, Genetic algorithm with deterministic crossover for vector quantiza- (2006) 1875–1881.
tion, Pattern Recognit. Lett. 21 (1) (20 0 0) 61–68. [48] G.H. Ball, D.J. Hall, A clustering technique for summarizing multivariate data,
[11] P. Fränti, J. Kivijärvi, Randomized local search algorithm for the clustering Syst. Res. Behav. Sci. 12 (2, March) (1967) 153–155.
problem, Pattern Anal. Appl. 3 (4) (20 0 0) 358–369. [49] O. Lemke, B. Keller, Common nearest neighbor clustering: a benchmark, Algo-
[12] P. Fränti, Efficiency of random swap clustering, J. Big Data 5 (13) (2018) 1–29. rithms 11 (2) (2018) 19.
[13] S. Kalyani, K.S. Swarup, Particle swarm optimization based K-means clustering [50] U.V. Luxburg, Clustering stability: an overview, Found. Trends Mach. Learn. 2
approach for security assessment in power systems, Expert Syst. Appl. 32 (9) (3) (2010) 235–274.
(2011) 10839–10846. [51] S.J. Redmond, C. Heneghan, A method for initialising the K-means clustering
[14] D. Yan, L. Huang, M.I. Jordan, Fast approximate spectral clustering, ACM algorithm using kd-trees, Pattern Recognit. Lett. 28 (8) (2007) 965–973.
SIGKDD Int. Conf. Knowl. Discov. Data Min. (2009) 907–916. [52] S. Tezuka, P.L Equyer, Efficient portable combined Tausworthe random number
[15] L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo, Fast density clustering strate- generators, ACM Trans. Model. Comput. Simul. 1 (1991) 99–112.
gies based on the k-means algorithm, Pattern Recognit. 71 (2017) 375– [53] M.J. Norušis, IBM SPSS Statistics 19 Guide to Data Analysis, Prentice Hall, Upper
386. Saddle River, New Jersey, 2011.
[16] T. Kinnunen, I. Sidoroff, M. Tuononen, P. Fränti, Comparison of clustering meth- [54] T. Gonzalez, Clustering to minimize the maximum intercluster distance, Theor.
ods: a case study of text-independent speaker modeling, Pattern Recognit. Lett. Comput. Sci. 38 (2–3) (1985) 293–306.
32 (13, October) (2011) 1604–1617. [55] M.M.-T. Chiang, B. Mirkin, Intelligent choice of the number of clusters in
[17] Q. Zhao, P. Fränti, WB-index: a sum-of-squares based index for cluster validity, k-means clustering: an experimental study with different cluster spreads, J.
Data Knowl. Eng. 92 (July) (2014) 77–89. Classification 27 (2010) 3–40.
[18] M. Rezaei and P. Fränti Can the number of clusters be solved by external in- [56] J. Hämäläinen, T. Kärkkäinen, Initialization of big data clustering using distri-
dex? manuscript. (submitted) butionally balanced folding, Proceedings of the European Symposium on Arti-
[19] J.M Peña, J.A. Lozano, P. Larrañaga, An empirical comparison of four initializa- ficial Neural Networks, Comput. Intel. Mach. Learn.-ESANN (2016).
tion methods for the k-means algorithm, Pattern Recognit. Lett. 20 (10, Octo- [57] I. Katsavounidis, C.C.J. Kuo, Z. Zhang, A new initialization technique for gener-
ber) (1999) 1027–1040. alized Lloyd iteration, IEEE Signal Process Lett. 1 (10) (1994) 144–146.
[20] J. He, M. Lan, C-L Tan, S-Y Sung, H-B Low, Initialization of Cluster Refinement [58] F. Cao, J. Liang, L. Bai, A new initialization method for categorical data cluster-
Algorithms: a review and comparative study, IEEE Int. Joint Conf. Neural Netw. ing, Expert Syst. Appl. 36 (7) (2009) 10223–10228.
(2004). [59] D. Arthur, S. Vassilvitskii, K-means++: the advantages of careful seeding,
[21] D. Steinley, M.J. Brusco, Initializing k-means batch clustering: a critical evalua- ACM-SIAM Symp. on Discrete Algorithms (SODA’07), January 2007.
tion of several techniques, J. Classification 24 (2007) 99–121. [60] M. Erisoglu, N. Calis, S. Sakallioglu, A new algorithm for initial cluster centers
[22] M.E. Celebi, H.A. Kingravi, P.A. Vela, A comparative study of efficient initial- in k-means algorithm, Pattern Recognit. Lett. 32 (14) (2011) 1701–1705.
ization methods for the k-means clustering algorithm, Expert Syst. Appl. 40 [61] C. Gingles, M. Celebi, Histogram-based method for effective initialization of
(2013) 200–210. the k-means clustering algorithm, Florida Artificial Intelligence Research So-
[23] L. Kaufman, P. Rousseeuw, Finding Groups in data: An introduction to Cluster ciety Conference, May 2014.
Analysis, Wiley Interscience, 1990. [62] J.A. Hartigan, M.A. Wong, Algorithm AS 136: a k-means clustering algorithm, J.
[24] B. Thiesson, C. Meek, D.M. Chickering, and D. Heckerman, Learning mixtures R. Stat. Soc. C 28 (1) (1979) 100–108.
of Bayesian networks, Technical Report MSR-TR-97-30 Cooper & Moral, 1997. [63] M.M. Astrahan, Speech Analysis by Clustering, Or the Hyperphome Method,
[25] J.T. Tou, R.C. Gonzales, Pattern Recognition Principles, Addison-Wesley, 1974. Stanford Artificial Intelligence Project Memorandum AIM-124, Stanford Univer-
[26] T.F. Gonzalez, Clustering to minimize the maximum intercluster distance, sity, Stanford, CA, 1970.
Theor. Comput. Sci. 38 (2–3) (1985) 293–306. [64] F. Cao, J. Liang, G. Jiang, An initialization method for the k-means algorithm
[27] J.H. Ward, Hierarchical grouping to optimize an objective function, J. Am. Stat. using neighborhood model, Comput. Math. Appl. 58 (2009) 474–483.
Assoc. 58 (301) (1963) 236–244. [65] M. Al-Daoud, A new algorithm for cluster initialization, in: World Enformatika
[28] A. Likas, N. Vlassis, J. Verbeek, The global k-means clustering algorithm, Pat- Conference, 2005, pp. 74–76.
tern Recognit. 36 (2003) 451–461. [66] M. Yedla, S.R. Pathakota, T.M. Srinivasa, Enhancing k-means clustering algo-
[29] D. Steinley, Local optima in k-means clustering: what you don’t know may rithm with improved initial center, Int. J. Comput. Sci. Inf. Technol. 1 (2) (2010)
hurt you, Psychol. Methods 8 (2003) 294–304. 121–125.
[30] P. Fränti, T. Kaukoranta, D.-F. Shen, K.-S. Chang, Fast and memory efficient im- [67] T. Su, J.G. Dy, In search of deterministic methods for initializing k-means and
plementation of the exact PNN, IEEE Trans. Image Process. 9 (5, May) (20 0 0) gaussian mixture clustering, Intel. Data Anal. 11 (4) (2007) 319–338.
773–777. [68] X. Wu, K. Zhang, A better tree-structured vector quantizer, in: IEEE Data Com-
[31] P. Bradley, U. Fayyad, Refining initial points for k-means clustering, in: Inter- pression Conference, Snowbird, UT, 1991, pp. 392–401.
national Conference on Machine Learning, San Francisco, 1998, pp. 91–99. [69] C.-M. Huang, R.W. Harris, A comparison of several vector quantization code-
[32] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, John Wiley and book generation approaches, IEEE Trans. Image Process. 2 (1) (1993) 108–112.
Sons, New York, 1973. [70] D. Boley, Principal direction divisive partitioning, Data Min. Knowl. Discov. 2
[33] M. Bicego, M.A.T. Figueiredo, Clustering via binary embedding, Pattern Recog- (4) (1998) 325–344.
nit. 83 (2018) 52–63. [71] M.E. Celebi, H.A. Kingravi, Deterministic initialization of the k-means algorithm
[34] N. Karmitsa, A.M. Bagirov, S. Taheri, Clustering in large data sets with using hierarchical clustering, Int. J. Pattern Recognit Artif Intell. 26 (07) (2012)
the limited memory bundle method, Pattern Recognit. 83 (2018) 245– 1250018.
259. [72] S. Sieranoja, P. Fränti, Random projection for k-means clustering, in: Int. Conf.
[35] Y. Zhu, K.M. Ting, M.J. Carman, Grouping points by shared subspaces for effec- Artificial Intelligence and Soft Computing (ICAISC), Zakopane, Poland, June
tive subspace clustering, Pattern Recognit. 83 (2018) 230–244. 2018, pp. 680–689.
112 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112

[73] S.-W. Ra, J.-K. Kim, A fast mean-distance-ordered partial codebook search algo- [84] J. Xie, Z.Y. Xiong, Y.F. Zhang, Y. Feng, J. Ma, Density core-based clustering algo-
rithm for image vector quantization, IEEE Trans. Circuits Syst. 40 (September) rithm with dynamic scanning radius, Knowl.-Based Syst. 142 (2018) 68–70.
(1993) 576–579. [85] Y. Linde, A. Buzo, R.M. Gray, An algorithm for vector quantizer design, IEEE
[74] I. Cleju, P. Fränti, X. Wu, Clustering based on principal curve, in: Scandina- Trans. Commun. 28 (1, January) (1980) 84–95.
vian Conf. On Image Analysis, LNCS, vol. 3540, Springer, Heidelberg, 2005, [86] M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering tech-
pp. 872–881. niques, in: KDD workshop on text mining, vol. 40 0, Boston, 20 0 0, pp. 525–526.
[75] X. Wu, Optimal quantization by matrix searching, J. Algorithms 12 (4) (1991) [87] S-S. Yu, S-W. Chu, C-M. Wang, Y-K. Chan, T-C. Chang, Two improved k-means
663–673. algorithms, Appl. Soft Comput. 68 (2018) 747–755.
[76] M.B. Al-Daoud, S.A. Roberts, New methods for the initialisation of clusters, Pat- [88] B. Bahmani, B. Mosley, A. Vattani, R. Kumar, S. Vassilvitski, Scal-
tern Recognit. Lett. 17 (5) (1996) 451–455. able k-means++, Proc. VLDB Endow. 5 (7) (2012) 622–633.
[77] P. Gourgaris, C. Makris, A Density Based K-Means Initialization Scheme, EANN [89] T. Kaukoranta, P. Fränti, O. Nevalainen, A fast exact GLA based on code vector
workshops, Rhodes Island, Greece, 2015. activity detection, IEEE Trans. Image Process. 9 (8, August) (20 0 0) 1337–1342.
[78] A. Rodriquez, A. Laio, Clustering by fast search and find of density peaks, Sci-
ence 344 (6191) (2014) 1492–1496. Pasi Fränti received his MSc and PhD degrees from the University of Turku, 1991
[79] P. Mitra, C. Murthy, S.K. Pal, Density-based multiscale data condensation, IEEE and 1994 in Science. Since 20 0 0, he has been a professor of Computer Science at
Trans. Pattern Anal. Mach. Intel. 24 (6) (2002) 734–747. the University of Eastern Finland (UEF). He has published 81 journals and 167 peer
[80] S. Sieranoja, P. Fränti, Constructing a high-dimensional kNN-graph using a review conference papers, including 14 IEEE transaction papers. His main research
Z-order curve, ACM J. Exp. Algorithmics 23 (1, October) (2018) 1–21 1.9:. interests are in machine learning, data mining, pattern recognition including clus-
[81] W. Dong, C. Moses, K. Li, Efficient k-nearest neighbor graph construction for tering algorithms and intelligent location-aware systems. Significant contributions
generic similarity measures, in: Proceedings of the ACM International Confer- have also been made in image compression, image analysis, vector quantization and
ence on World wide web, ACM, 2011, pp. 577–586. speech technology.
[82] P. Fränti, S. Sieranoja, Dimensionally distributed density estimation, in: Int.
Conf. Artificial Intelligence and Soft Computing (ICAISC), Zakopane , Poland,
June 2018, pp. 343–353. Sami Sieranoja received the B.Sc. and M.Sc. degrees in University of Eastern Fin-
[83] H.J. Curti, R.S. Wainschenker, FAUM: fast Autonomous Unsupervised Multidi- land, 2014 and 2015. Currently he is a doctoral student at the University of Eastern
mensional classification, Inf. Sci. 462 (2018) 182–203. Finland. His research interests include neighborhood graphs and data clustering.

You might also like