1 s2.0 S0031320319301608 Main
1 s2.0 S0031320319301608 Main
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we study what are the most important factors that deteriorate the performance of the k-
Received 13 August 2018 means algorithm, and how much this deterioration can be overcome either by using a better initialization
Revised 11 March 2019
technique, or by repeating (restarting) the algorithm. Our main finding is that when the clusters overlap,
Accepted 13 April 2019
k-means can be significantly improved using these two tricks. Simple furthest point heuristic (Maxmin)
Available online 15 April 2019
reduces the number of erroneous clusters from 15% to 6%, on average, with our clustering benchmark.
Keywords: Repeating the algorithm 100 times reduces it further down to 1%. This accuracy is more than enough for
Clustering algorithms most pattern recognition applications. However, when the data has well separated clusters, the perfor-
K-means mance of k-means depends completely on the goodness of the initialization. Therefore, if high clustering
Initialization accuracy is needed, a better algorithm should be used instead.
Clustering accuracy © 2019 The Authors. Published by Elsevier Ltd.
Prototype selection
This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)
1. Introduction question how much clustering results are biased because of using
an inferior algorithm.
K-means (KM) algorithm [1–3] groups N data points into k clus- There are other algorithms that are known, in many situa-
ters by minimizing the sum of squared distances between every tions, to provide better clustering results than k-means. However,
point and its nearest cluster mean (centroid). This objective func- k-means is popular for good reasons. First, it is simple to imple-
tion is called sum-of-squared errors (SSE). Although k-means was ment. Second, people often prefer to use an extensively studied al-
originally designed for minimizing SSE of numerical data, it has gorithm whose limitations are known rather than a potentially bet-
also been applied for other objective functions (even some non- ter, but less studied, algorithm that might have unknown or hid-
numeric). den limitations. Third, the local fine-tuning capability of k-means
Sometimes the term k-means is used to refer to the clustering is very effective, and for this reason, it is also used as part of bet-
problem of minimizing SSE [4–7]. However, we consider here k- ter algorithms such as the genetic algorithm [9,10], random swap
means as an algorithm. We study how well it performs as a clus- [11,12], particle swarm optimization [13], spectral clustering [14],
tering algorithm to minimize the given objective function. This ap- and density clustering [15]. Therefore, our results can also help
proach follows the recommendation in [8] to establish a clear dis- better understand those more complex algorithms that rely on the
tinction between the clustering method (objective function) and the use of k-means.
clustering algorithm (how it is optimized). K-means starts by selecting k random data points as the initial
In real-life applications, the selection of the objective function set of centroids, which is then improved by two subsequent steps.
is much more important. Clustering results depend primarily on In the assignment step, every point is put into the cluster of the
the selected objective function, and only secondarily on the se- nearest centroid. In the update step, the centroid of every cluster
lected algorithm. Wrong choice of the function can easily reverse is recalculated as the mean of all data points assigned to the clus-
the benefit of a good algorithm so that a proper objective function ter. Together, these two steps constitute one iteration of k-means.
with a worse algorithm can provide better clustering than good These steps fine-tune both the cluster borders and the centroid lo-
algorithm with wrong objective function. However, it is an open cations. The algorithm is iterated a fixed number of times, or until
convergence (no further improvement is obtained). MacQueen also
presented sequential variant of k-means [2], where the centroid is
updated immediately after every single assignment.
∗
Corresponding author. K-means has excellent fine-tuning capabilities. Given a rough
E-mail addresses: pasi.franti@uef.fi (P. Fränti), sami.sieranoja@uef.fi,
allocation of the initial cluster centroids, it can usually optimize
[email protected].fi (S. Sieranoja).
https://doi.org/10.1016/j.patcog.2019.04.014
0031-3203/© 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)
96 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
Fig. 2. Three examples of clustering result when using SSE cost function. Gaussian cluster is split into several spherical clusters (left); mismatch of the variance causes the
larger cluster to be split (middle); mismatch of the cluster sizes does not matter if the clusters are well-separated.
The number of k-means repeats varies from a relatively small Fig. 2 demonstrates the situation. An algorithm minimizing SSE
amount of 10–20 [5,33,35] to a relatively high value of 100 [36]. would find spherical clusters regardless of the data. If the data con-
The most extreme example is [34] where 20 h time limit is ap- tain non-spherical clusters, they would be divided into spherical
plied. Although they stop iterating if the running time grows twice sub-clusters, usually along the direction of the highest variance.
as that of their proposed algorithm, it is still quite extensive. Sev- Clusters of variable sizes would also cause large clusters to be split,
eral papers do not repeat k-means at all [6,7,37]. and smaller ones to be merged. In these cases, if natural clusters
The choice of the initialization and the number of repeats might are wanted, a better clustering result could be achieved by using
also vary depending on the motivation. The aim of using k-means an objective function based on Mahalanobis distance [42] or Gaus-
can be to have a good clustering result, or to provide merely a sian mixture model [43] instead of SSE.
point of comparison. In the first case, all the good tricks are used,
such as more repeats and better initialization. In the second case, 2.1. Datasets
some simpler variant is more likely applied. A counter-example is
in [34] where serious efforts seem to be made to ensure all algo- In this paper, we focus on the algorithmic performance of k-
rithms have the best possible performance. means rather than the choice of the objective function. We use the
In this paper we study the most popular initialization heuris- clustering basic benchmark [39] as all these datasets can be clus-
tics. We aim at answering the following questions. First, to what tered correctly with SSE. Therefore, any clustering errors made by
extent k-means can be improved by a better initialization tech- k-means must originate from the properties of the algorithm, and
nique? Second, can the fundamental weakness of k-means be elim- not from the choice of wrong objective function. The datasets are
inated simply by repeating the algorithm several times? Third, can summarized in Table 1. They are designed to vary the following
we predict under which conditions k-means works, and which it properties as defined in [39]:
fails?
• Cluster overlap
In a recent study [39], it was shown that k-means performs
• Number of clusters
poorly when the clusters are well separated. Here we will answer
• Dimensionality
how much a better initialization or repeats can compensate for this
• Unbalance of cluster sizes
weakness. We will also show that dimensionality does not matter
for most variants, and that unbalance of cluster sizes deteriorates
2.2. Methodology
the performance of most initializations.
The rest of the paper is organized as follows. In Section 2, we
To measure the success of the algorithm, the value of the objec-
define the methodology and data. We also give brief review of
tive function itself is the most obvious measure. Existing literature
the properties of the standard k-means algorithm. Different ini-
reviews of k-means use either SSE [19,22], or the deviation of the
tialization techniques are then studied in Section 3. Experimental
clusters [20], which is also a variant of SSE. It is calculated as:
analysis is performed in Section 4, and conclusions are drawn in
Section 5. N
SSE = xi − c j 2 (1)
i=1
2. Performance of k-means where xi is a data point and cj is its nearest centroid. In [39], SSE
is also measured relative to the SSE-value of the ground truth so-
Following the recommendation of Jain [8], we make a clear dis- lution (SSEopt ):
tinction between the clustering method and algorithm. Clustering
(SSE − SSEopt )
method refers to the objective function, and clustering algorithm ε − ratio = (2)
SSEopt
to the process optimizing it. Without this distinction, it would be
easy to draw wrong conclusions. If the ground truth is known, external indexes such as adjusted
For example, k-means has been reported to work poorly with Rand index (ARI), Van Dongen (VD), variation of information (VI) or
unbalanced cluster sizes [40], and that it can cause large clusters to normalized mutual information (NMI) can also be used [22]. A com-
be wrongly split and smaller clusters wrongly merged [41]. These parative study of several suitable indexes can be found in [44]. The
observations themselves are correct but they miss the root cause, number of iterations have also been studied in [19,22], and the
which is the SSE objective function. Even an optimal algorithm time complexities reported in [22].
minimizing SSE would end up with the same incorrect result. Such The problem of SSE, and most of the external indexes, is that
observations therefore relate to the objective function, and not to the raw value does not tell how significant the result is. We there-
the k-means algorithm. fore use Centroid Index (CI) [45], which indicates how many cluster
98 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
Table 1
Basic clustering benchmark [39]. The data is publicly available here: http://cs.uef.fi/sipu/datasets/.
Fig. 4. Centroid index measures how many real clusters are missing a centroid (+), or how many centroids are allocated to wrong cluster (−). Six examples are shown for
S2 dataset.
Fig. 5. Success rate (%) of k-means, measured as the probability of finding correct clustering, improves when the cluster overlap increases.
Table 2
Summary of the initialization techniques compared in this paper. Time refers to the aver-
age processing time with the A3 dataset (N = 7500, k = 50). Randomized refers to whether
the technique include randomness naturally. Randomness will be needed for the repeated
k-means variant later.
the same point twice, and that the selection is independent on the
order of the data. For the random number generator we use the
method in [52]. We refer to this initialization method as random
centroids.
Slightly different variant in [2] selects simply the first k data
points. This is the default option in the Quick Cluster in IBM SPSS
Statistics [53]. If the data is in random order the result is effec-
tively the same as random centroids, except that it always provides
the same selection.
We note that the randomness is actually a required property
for the repeated k-means variant. This is because we must be able
to produce different solutions at every repeat. Some practitioners
might not like the randomness and prefer deterministic algorithms
always producing the same result. However, both of these goals
can actually be achieved if so wanted. We simply use pseudo-
random number generator with the same seed number. In this way,
single runs of k-means will produce different result but the overall
algorithm still produces always the same result for the same input.
Fig. 8. Initial centroids created by random partition (left), by Steinley’s variant (middle), and the final result after the k-means iterations (right).
Fig. 8 shows the effect of the random partition and Steinley’s Another variant selects the one with maximum distance to the ori-
variant. Both variants locate the initial centroids near the center gin [57] because it is likely to be located far from the center. Max-
of the data. If the clusters have low overlap, k-means cannot pro- imum density has also been used [51,58].
vide enough movement and many of the far away clusters will lack K-means++ [59] is a randomized variant of the furthest point
centroids in the final solution. heuristic. It chooses the first centroid randomly and the next ones
using a weighted probability pi = costi /SUM(costi ), where costi is
3.3. Furthest point heuristic (Maxmin) the squared distance of the data point xi to its nearest centroids.
This algorithm is an O(log k)-approximation to the problem. We
Another popular technique is the furthest point heuristic [54]. also implement k-means++ for our tests because of its popularity.
It was originally presented as standalone 2-approximate clustering Chiang and Mirkin [55] recalculate all the centroids after updat-
algorithm but has been widely used to initialize k-means. It selects ing the partitions, and the next centroid is selected as the farthest
an arbitrary point as the first centroid and then adds new centroids from the recently added centroid. Slightly more complex variant
one by one. At each step, the next centroid is the point that is [23] selects the point that decreases the objective function most. It
furthest (max) from its nearest (min) existing centroid. This is also requires calculation of all distances between every pair of points,
known as Maxmin [19,21,22,55]. which takes O(N2 ) time. Thus, it does not qualify our criteria for
Straightforward implementation requires O(k2 N) time but it can k-means initialization. With the same amount of computation we
be easily reduced to O(kN) as follows. For each point, we main- can already run implement agglomerative clustering algorithm.
tain pointer to its nearest centroid. When adding a new centroid, Other authors also weight the distances by the density of the
we calculate the distance of every point to this new centroid. If point [51,58]. This reduces the probability that outliers are se-
the new distance is smaller than to the previous nearest, then it lected. Erisoglu et al. [60] use cumulative distance to all previous
is updated. This requires N distance calculations. The process is re- centroids instead of the maxmin criterion. However, this performs
peated k times, and the time complexity is therefore O(kN) in total, worse because it can easily choose two nearby points provided
which is the same as one iteration of k-means. Further speedup that they have large cumulative distance to all other centroids [61].
can be achived by searching for the furthest point in just a subset We use here a variant that selects the first point randomly
of the data [56]. [54,59]. This adds randomness to the process as required by the
There are several alternative ways to choose the first cen- repeated k-means variant. The next centroids we select using the
troid. In the original variant the selection is arbitrary [54]. In [55], original maxmin criterion, i.e. choosing the point with biggest dis-
the furthest pair of points are chosen as the first two centroids. tance to its nearest centroid.
102 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
Fig. 11. Examples of the two projection-based heuristics for A2 dataset: random points (left), and the furthest point projections (right) [72].
[46]. A more complex principal curve has also been used for clus- trivial how to calculate the density, and how to use it in cluster-
tering [74]. ing. Especially since the initialization technique should be fast and
We consider two simple variants: random and two furthest simple.
points projection as studied in [72]. The first heuristic takes two The main bottleneck of the algorithms is how to calculate the
random data points and projects to the line passing by these two density is estimated for the points. There are three common ap-
reference points. The key idea is the randomness; single selection proaches for this:
may provide poor initialization but when repeating several times, • Buckets
the chances to find one good initialization increases, see Fig. 11. • ε-radius circle
We include this technique into our experiments and refer to it as • k-nearest neighbors (KNN)
Projection.
The second heuristic is slightly more deterministic but still ran- The first approach divides the space by a regular grid, and
dom. We start by selecting a random point, and calculate its fur- counts the frequency of the points in every bucket [76]. The den-
thest point. The projection axis is the line passing by these two ref- sity of a point is then inherited from the bucket it is in. This ap-
erence points. We again rely on randomness, but now the choices proach is feasible in low-dimensional space but would become im-
are expected to be more sensible, potentially providing better re- practical in higher-dimensional spaces. In [61], the problem is ad-
sults using fewer trials. However, according to [72] this variant dressed by processing the dimensions independently in a heuris-
does not perform any better than the simpler random heuristic. tic manner. Other authors have used kd-tree [51,57] or space-filling
Projection works well if the data has one-dimensional structure. curve [77] to partition the space into buckets containing roughly
In [72], projective value is calculated to estimate how well a given the same number of points. In [51,57], the number of buckets is
projection axis models the data. From our data, Birch2 and G2 have 10k.
high projective values and suitable for projection-based technique. The other two approaches calculate the density for every point
However, with all other datasets, the projection does not make individually. The traditional one is to define a neighborhood us-
much more sense than the naïve sorting heuristics, see Fig. 10. ing a cutoff threshold (ε -radius), and then counting the number
We also note that projection-based techniques also general- of other points within this neighborhood [21,63,64,78]. The third
ize to segmentation-based clustering, where k-1 dividing planes approach finds the k-nearest neighbors of a point [79], and then
are searched simultaneously using dynamic programming [74,75]. calculates the average distance to the points within this neighbor-
These clustering results usually require fine-tuning by k-means at hood. Lemke and Keller calculate the density between every pair
the final step, but nevertheless, they are standalone algorithms. of points [49].
The bottleneck of the last two approaches is that we need to
3.6. Density-based heuristics find the points that are within the neighborhood. This requires
O(N2 ) distance calculations in both cases. Several speed-up tech-
Density was already used both with the furthest point and the niques and approximate variants exist [80,81] but none that is both
sorting heuristics, but the concept deserves a little bit further dis- fast and simple to implement. Calculating density values only for
cussion. The idea of using density itself is appealing but it is not a subset of size SQRT(N) would reduce the complexity to O(N1.5 )
104 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
into random subsets. For instance, if we divide the data into R sub- k-means brings only little improvement. The average CI-value of
sets of size N/R, the total processing time would be roughly the Luxburg improves only from 1.7 to 1.2 (∼30%), and Bradley from
same as that of a single run. 3.4 to 3.1 (∼10%). The latter is more understandable as k-means is
For example, Bradley and Fayyad [31] apply k-means for a sub- already involved in the iterations. Split heuristic, although a stan-
sample of size N/R, where R = 10 was recommended. Each sample dalone algorithm, leaves more space for k-means to improve (61%).
is clustered by k-means starting with random centroids. However, Number of iterations: The main observation is that the easier
instead of taking the best clustering of the repeats, a new dataset the dataset, and the better the initialization, the fewer the itera-
is created from the Rk centroids. This new dataset is then clus- tions needed. The differences between the initialization vary from
tered by repeated k-means (R repeats). The total time complexity 20 (Luxburg) to 36 (Rand-C); with the exception of random parti-
is Rk(N/R) + Rk2 = kN + Rk2 , where the first part comes from tion (Rand-P), which takes 65 iterations.
clustering the sub-samples, and the second part from clustering
the combined set. If k = SQRT(N), then this would be N1. 5 + RN. 4.2. Cluster overlap
Overall, the algorithm is fast and satisfies the criteria for initializa-
tion technique. The results with the S1–S4 datasets (Table 3) demonstrate the
Bahmani et al. [88] have a similar approach. They repeat k- effect of the overlap in general: the less overlap, the worse the
means++ R = O(logN) times to obtain Rk preliminary centroids, k-means’ performance. Some initialization techniques can compen-
which are then used as a new dataset for clustering by standard k- sate for this weakness. For example, the maxmin variants and the
means. They reported that R = 5 would be sufficient for the num- standalone algorithms reduce this phenomenon but do not remove
ber of repeats. In our experiments, we consider the Bradley and it completely. They provide better initial solution with S1 (less
Fayyad [31] as an initialization, and use R = 100 repeats as with all overlap) than with S4 (more overlap), but the final result after the
techniques. k-means iterations is still not much different. An extreme case is
DIM32, for which all these better techniques provide correct solu-
4. Experimental results tion. However, they do it even without k-means iterations!
Further tests with G2 confirm the observation, see Fig. 13.
We study next the overall performance of different initialization When overlap is less than 2%, the k-means iterations do not help
techniques, and how the results depend on the following factors: much and the result depends mostly on the initialization. If the
• Overlap of clusters correct clustering is found, it is found without k-means. Thus, the
• Number of clusters clustering is solved by a better algorithm, not by better k-means
• Dimensions initialization. In case of high overlap, k-means reaches almost the
• Unbalance of cluster sizes same result (about 88% success rate) regardless of how it was ini-
tialized.
The overall results (CI-values and success rates) are summarized
in Table 3. We also record (as fails) how many datasets provide 4.3. Number of clusters
success rate p = 0%. This means that the algorithm cannot find the
correct clustering even with 50 0 0 repeats. We test the following The results with the A1–A3 datasets (Table 3) show that the
methods: more there are clusters the higher the CI-value and the lower the
• Rand-P success rate. This phenomenon holds for all initialization tech-
• Rand-C niques and it is not specific to k-means algorithm only. If an
• Maxmin algorithm provides correct clustering with success rate p for a
• kmeans++ dataset of size k, then p is expected to decrease when k increases.
• Bradley Fig. 14 confirms this dependency with the Birch2 subsets. Projec-
• Sorting tion heuristic is the only technique that manages to capture the
• Projection hidden 1-dimensional structure in this data. The success rate of all
• Luxburg other true initialization techniques eventually decreases to 0%.
• Split Fig. 15 shows that the CI-value has a near linear dependency
on the number of clusters. In most cases, the relative CI-value con-
4.1. Overall results verges to a constant when k approaches its maximum (k = 100). An
exception is Luxburg, which is less sensitive to the increase of k;
CI-values: Random partition works clearly worse (CI = 12.4) providing values CI = (0.82, 1.25, 1.42, 1.54) for k = (25, 50, 75, 100).
than the random centroids (CI = 4.5). Bradley and sorting heuris- Besides this exception, we conclude that the performance has lin-
tics are slightly better (CI = 3.1 and 3.3), but the maxmin heuris- ear dependency on k regardless of the initialization technique.
tics (Maxmin and kmeans++) are the best among the true ini-
tialization techniques (CI = 2.2 and 2.3). The standalone algorithms 4.4. Dimensions
(Luxburg and Split) are better (CI = 1.2 and 1.2), but even they pro-
vide the correct result (CI = 0) only for the easiest dataset: DIM32. We tested the effect of dimensions using the DIM and G2
Success rates: The results show that Maxmin is a reason- datasets. Two variants (Maxmin, Split) solve the DIM sets al-
able heuristic. Its average success rate is 22% compared to 5% most every time (99–100%), whereas Kmeans++ and Luxburg solve
of random centroids. It also fails (success rate = 0%) only in case them most of the times (≈95%), see Fig. 16. Interestingly, they find
of three datasets; the datasets with a high number of clusters the correct result by the initialization and no k-means iterations
(A3, Birch1, Birch2). Random partition works with S2, S3 and S4 are needed. In general, if the initialization technique is able to
but fails with all the other 8 datasets. The standalone algorithms solve the clustering, it does it regardless of the dimensionality.
(Luxburg and Split) provide 40% success rates, on average, and fail The sorting and projection heuristics are exceptions in this
only with Birch1 and Unbalance. sense; their performance actually improves with the highest di-
Effect of iterations: From the initial results we can see that mensions. The reason is that when the dimensions increase, the
Luxburg and Bradley are already standalone algorithms for which clusters eventually become so clearly separated that even such
106 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
Table 3
Average CI-values before and after k-means iterations, success rates, and the number of iterations performed. The results are
averages of 50 0 0 runs. Fail records for how many datasets the correct solution was never found (success rate = 0%). From DIM
datasets we report only DIM32; the results for the others are practically the same. Note: The values for Impr. and Aver. columns
are calculated from precise values and not from the shown rounded values. (For interpretation of the references to color in the
Table the reader is referred to the web version of this article.)
CI-values (initial)
Rand-P 12.5 14.0 12.8 14.0 19.0 32.9 48.1 7.0 96.0 96.6 13.1 33.3
Rand-C 5.3 5.5 5.4 5.4 7.3 12.7 18.2 4.6 36.6 36.6 5.8 13.0
Maxmin 1.3 2.9 6.1 6.8 2.1 4.1 5.0 0.9 21.4 9.6 0.0 5.5
kmeans++ 1.7 2.3 3.2 3.3 3.1 5.6 7.9 0.8 21.3 10.4 0.1 5.4
Bradley 1.0 0.7 0.6 0.5 1.5 3.4 5.3 3.3 5.7 13.6 1.7 3.4
Sorting 3.3 3.7 4.1 4.4 4.9 10.4 15.6 4.0 34.1 7.2 1.7 8.5
Projection 3.0 3.4 3.9 4.2 4.5 9.8 15.2 4.0 33.7 1.0 1.1 7.6
Luxburg 0.8 0.8 1.1 1.3 0.9 1.1 1.2 4.2 5.6 1.7 0.0 1.7
Split 0.5 0.8 1.4 1.4 1.3 2.4 3.5 4.5 12.0 2.7 0.0 2.8
CI-values (final)
Rand-P 3.3 0.6 1.2 0.4 6.0 10.7 17.9 4.0 11.3 75.6 5.3 12.4 63%
Rand-C 1.8 1.4 1.3 0.9 2.5 4.5 6.6 3.9 6.6 16.6 3.6 4.5 65%
Maxmin 0.7 1.0 0.7 1.0 1.0 2.6 2.9 0.9 5.5 7.3 0.0 2.2 62%
kmeans++ 1.0 0.9 1.0 0.8 1.5 2.9 4.2 0.5 4.9 7.2 0.1 2.3 57%
Bradley 0.9 0.6 0.5 0.4 1.3 3.0 4.8 3.5 4.6 12.5 1.6 3.1 11%
Sorting 1.3 1.1 1.0 0.7 1.5 3.6 5.5 4.0 5.7 4.3 1.4 2.7 69%
Projection 1.2 0.9 0.8 0.6 1.2 3.3 5.2 4.0 5.3 0.2 0.9 2.2 71%
Luxburg 0.5 0.4 0.6 0.4 0.6 0.9 1.0 4.0 2.7 1.6 0.0 1.2 29%
Split 0.2 0.3 0.4 0.4 0.5 1.1 1.8 4.0 2.8 1.6 0.0 1.2 61%
Success-%
Number of iterations
100 % 100 %
Low overlap High overlap
80 % 80 %
60 % 60 %
40 % 40 %
20 % 20 %
0% 0%
-P
-P
-C
-C
in
in
us
us
y
y
RP
RP
ng
ng
lit
lit
t
t
ec
ec
le
le
M
M
Sp
Sp
nd
nd
nd
nd
pl
pl
rt i
rt i
ad
ad
oj
oj
oj
oj
ax
ax
KM
KM
So
So
Ra
Ra
Ra
Ra
Pr
Pr
Pr
Pr
Br
Br
M
Fig. 13. Average success rates for all G2 datasets before (gray) and after k-means (white). The datasets were divided into two categories: those with low overlap <2% (left),
and those with high overlap ≥2% (right).
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 107
troids are selected from the sparse area, k-means iterations man-
60%
Split age to move only one centroid into this area, and all other cen-
troids will remain in the dense area, see Fig. 17. The probability
40% that a single random centroid would be selected from the sparse
area is p = 50 0/650 0 = 7%. To pick all required five centroids from
Luxburg the sparse area would happen with probability of 0.01%,1 i.e. only
Bradley
20% once every 8430 runs.
Maxmin
Rand-C Sorting Besides Rand-C and Rand-P, sorting and projection heuristics,
KM++ Luxburg and Split algorithms all fail with this data by allocating
0% most centroids to the dense area. Bradley works only slightly bet-
10 20 30 40 50 60 70 80 90 100 ter and often allocates two centroids to the sparse area. Maxmin
heuristics work best because they rely more on distances than
Clusters (k)
on frequencies. K-means++ typically misses one centroid whereas
Fig. 14. Dependency of the success rate and the number of clusters when using the Maxmin does the opposite and allocates one too many centroids in
subsets of Birch2 (B2-sub). the sparse area. They provide success rates of 22% (Maxmin) and
51% (KM++), in contrast to the other techniques that result in 0%
20% success.
To sum up, success depends mainly on the goodness of the ini-
Rand-C tialization; k-means iterations can do very little with this kind of
data. If the correct clustering is found, it is found mainly without
Relative CI-value
15% k-means.
Bradley
4.6. Repeats
10%
KM++ We next investigate to what extent the k-means performance
can be improved by repeating the algorithm several times. Table 5
Maxmin
5% Sorting summarizes the results. We can see that significant improvement
is achieved with all initialization techniques. When the success
Luxburg rate of a single run of k-means is 2% or higher, CI = 0 can always be
Split
reached thanks to the repeats. However, none of the variants can
0% Projection solve all datasets. Overall performance of the different initialization
10 20 30 40 50 60 70 80 90 100 techniques can be summarized as follows:
Clusters (k) • Random partition is almost hopeless and the repeats do not
help much. It only works when the clusters have strong overlap.
Fig. 15. Dependency of the relative CI-values (CI/k) and the number of clusters
when using the subsets of Birch2 (B2-sub).
But even then, k-means works relatively well anyway regardless
of the initialization.
• Random centroids is improved from CI = 4.5 to 2.1, on average,
naïve heuristics will be able to cluster the data. In general, the rea- but still it can solve only three datasets (S2, S3, S4). Two other
son for success or failure is not the dimensionality but the cluster datasets (S1, A1) could be solved with significantly more re-
separation. peats, but not the rest.
The results with G2 confirm the above observation, see Fig. 16. • Maxmin variants are the best among the simple initialization
With the lowest dimensions, k-means iterations work because techniques providing CI = 0.7, on average, compared to 2.1 of
some cluster overlap exists. However, for higher dimensions the Rand-C. They still fail with four datasets. K-means++ is not sig-
overlap eventually disappears and the performance starts to de- nificantly better than the simpler Maxmin.
pend mainly on the initialization. We also calculated how much • The standalone algorithms (Luxburg and Split) are the best.
the success rate correlates with the dimensions and the overlap. They provide average value of CI = 1.2 without the repeats, and
The results in Table 4 show that the final result correlates much CI = 0.4 with 100 repeats. They fail only with the Unbalance
stronger with the overlap than with the dimensionality. datasets.
Since there is causality between dimensions and overlap, it is The improvement from the repeats is achieved at the cost
unclear whether the dimensionality has any role at all. To test this of increased processing time. We used the fast k-means variant
further, we generated additional datasets with D = 2–16 and com- [89] that utilizes the activity of the centroids. For the smaller data
pared only those with overlap = 2%, 4%, 8%. The results showed sets the results are close to real-time, but with the largest dataset
that success of the k-means iterations do not depend on the di- (Birch1, N = 10 0,0 0 0), the 10 0 repeats can take from 10–30 min.
mensions even when the clusters overlap. We extended the tests and ran 20 0,0 0 0 repeats for A3 and Un-
To sum up, our conclusion is that k-means iterations cannot balance datasets. The results in Table 6 show that Maxmin would
solve the problem when the clusters are well separated. All tech- need 216 repeats to reach CI = 0 with A3, on average, whereas k-
niques that solve these datasets, do it already by the initialization means++ would require 8696 repeats even though it finds CI = 1
technique without any help of k-means. When there is overlap, k-
means works better. But even then, the performance does not de-
8
p5 (1 − p) .
3
pend on the dimensionality. 1
5
108 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
Table 4
Correlation of success rate with increasing overlap (left) and dimensions (right) with
the G2 datasets (3:3 centroid allocation test). Red>0.60, Yellow = 0.30–0.53.
100% 100%
Maxmin, Split
80% K-means++
Initial 80%
Maxmin, Split
K-means++
Final
DIM DIM
Success rate
Luxburg
Success rate Luxburg
60% 60%
Projection Projection
40% 40%
Sorting Sorting
20% 20%
Bradley Bradley
Rand-C, Rand-P Rand-C, Rand-P
0% 0%
32 64 128 256 512 1024 32 64 128 256 512 1024
Dimensions Dimensions
100% 100%
Sorting Sorting
Split Split
80% G2 80%
Success rate
Success rate
Fig. 16. Dependency of success rate on the dimensions when no overlap (DIM sets), and with overlap (G2 datasets). The results of G2 are average success rates for all
sd = 10–100 (G2-D-sd) with a given dimension D, before and after k-means.
already after 138 repeats. The results also show that Unbalance • Cluster overlap is the biggest factor. If there is high overlap,
dataset is difficult for almost all initialization techniques but the k-means iterations work well regardless of the initialization.
maxmin heuristics are most suitable for this type of data. If there is no overlap, then the success depends completely
on the initialization technique: if it fails, k-means will also
fail.
4.7. Summary
• Practically all initialization techniques perform worse when the
number of clusters increases. Success of the k-means depends
We make the following observations:
linearly on the number of clusters. The more clusters, the more
• Random partition provides an initial solution of similar qual- errors there are, before and after the iterations.
ity regardless of overlap, but the errors in initial solution can • Dimensionality does not have a direct effect. It has a slight ef-
be better fixed by k-means iterations when clusters have high fect on some initialization techniques but k-means iterations
overlap. In this case it can even outperform random centroids. are basically independent on the dimensions.
However, repeats do not improve the results much, especially • Unbalance of cluster sizes can be problematic especially for the
with sets having many clusters (A3, Birch2). random initializations but also for the other techniques. Only
P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112 109
Table 5
Performance of the repeated k-means (100 repeats). The last two columns show the average results of all datasets without repeats (KM) and
with repeats (RKM). (For interpretation of the references to color in the Table the reader is referred to the web version of this article.)
the maxmin variants with 100 repeats can overcome this prob- The most important factor is the cluster overlap. In general,
lem. well separated clusters make the clustering problem easier but
for k-means it is just the opposite. When the clusters overlap, k-
Table 7 summarizes how the four factors affect the different ini- means iterations work reasonably well regardless of the initial-
tialization techniques and the k-means iterations. ization. This is the expected situation in most pattern recognition
applications.
5. Conclusions The number of errors have a linear dependency on the number
of clusters (k): the more clusters, the more errors k-means makes,
On average, k-means caused errors with about 15% of the clus- but the percentage remains constant. Unbalance of cluster sizes is
ters (CI = 4.5). By repeating k-means 100 times this errors was more problematic. Most initialization techniques fail, and only the
reduced to 6% (CI = 2.0). Using a better initialization technique maxmin heuristics worked in this case. The clustering result then
(Maxmin), the corresponding numbers were 6% (CI = 2.1) with k- depends merely on the goodness of the initialization technique.
means as such, and 1% (CI = 0.7) with 100 repeats. For most pat- Dimensionality itself is not a factor. It merely matters how the
tern recognition applications this accuracy is more than enough dimensions affect the cluster overlap. With our data, the clus-
when clustering is just one component within a complex system. ters became more separated when the dimensions were increased,
110 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
Table 6
Number of repeats in RKM to reach certain CI-level. Missing values (−)
indicate that this CI-level was never reached during the 20 0,0 0 0 repeats.
A3
CI-value
Initialization 6 5 4 3 2 1 0
Rand-P – – – – – – –
Rand-C 2 4 11 54 428 11,111 –
Maxmin 1 3 14 216
Kmeans++ 1 2 3 14 138 8696
Bradley 1 2 8 58 1058 33,333
Sorting 1 2 4 13 73 1143 –
Projection 1 2 3 9 46 581 18,182
Luxburg 1 3
Split 1 2 9
Unbalance
CI-value
Initialization 6 5 4 3 2 1 0
Rand-P 1 97 8333 – –
Rand-C 1 16 69 1695 100k
Maxmin 1 4
Kmeans++ 1 2
Bradley 1 3 6 70 1471
Sorting 1 – – – –
Projection 1 935 16,667 – –
Luxburg 1 59 16,667 – –
Split 1 9524 – – –
Table 7
How the four factors have effect on the performance of the initialization and on
the k-means iterations.
density peaks [78]. Agglomerative clustering [30] solves 10 out [36] P.B. Frandsen, B. Calcott, C. Mayer, R. Lanfear, Automatic selection of parti-
of 11. tioning schemes for phylogenetic analyses using iterative k-means clustering
of site rates, BMC Evol. Biol. 15 (13) (2015).
[37] D.G. Márquez, A. Otero, P. Félix, C.A. García, A novel and simple strategy for
References evolving prototype based clustering, Pattern Recognit. 82 (2018) 16–30.
[38] L. Huang, H.-Y. Chao, C.-D. Wang, Multi-view intact space clustering, Pattern
[1] E. Forgy, Cluster analysis of multivariate data: efficiency vs. interpretability of Recognit. 86 (2019) 344–353.
classification, Biometrics 21 (1965) 768–780. [39] P. Fränti, S. Sieranoja, K-means properties on six clustering benchmark
[2] J. MacQueen, Some methods for classification and analysis of multivariate ob- datasets, Appl. Intel. 48 (12) (2018) 4743–4759.
servations, in: Berkeley Symposium on Mathematical Statistics and Probability, [40] L. Morissette, S. Chartier, The k-means clustering technique: general consider-
1, Statistics University of California Press, Berkeley, Calif., 1967, pp. 281–297. ations and implementation in Mathematica, Tutor. Quant. Methods Psychol. 9
[3] S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1) (2013) 15–24.
(1982) 129–137. [41] J. Liang, L. Bai, C. Dang F. Cao, The k-means-type algorithms versus imbal-
[4] L. Wang, C. Pan, Robust level set image segmentation via a local correntropy- anced data distributions, IEEE Trans. Fuzzy Syst. 20 (4, August) (2012) 728–
-based k-means clustering, Pattern Recognit. 47 (2014) 1917–1925. 745.
[5] C. Boutsidis, A. Zouzias, M.W. Mahoney, P. Drineas, Randomized dimensional- [42] I. Melnykov, V. Melnykov, On k-means algorithm with the use of Mahalanobis
ity reduction for k-means clustering, IEEE Trans. Inf. Theory 61 (2, February) distances, Stat. Probab. Lett. 84 (January) (2014) 88–95.
(2015) 1045–1062. [43] V. Melnykov, S. Michael, I. Melnykov, Recent developments in model-based
[6] M. Capo, Perez A, J.A. Lozano, An efficient approximation to the k-means clus- clustering with applications, in: M. Celebi (Ed.), Partitional Clustering Algo-
tering for massive data, Knowl.-Based Syst. 117 (2017) 56–69. rithms, Springer, Cham, 2015.
[7] Z. Huang, N. Li, K. Rao, C. Liu, Y. Huang, M. Ma, Z. Wang, Development of a [44] M. Rezaei, P. Fränti, Set-matching methods for external cluster validity, IEEE
data-processing method based on Bayesian k-means clustering to discriminate Trans. Knowl. Data Eng. 28 (8, August) (2016) 2173–2186.
aneugens and clastogens in a high-content micronucleus assay, Hum. Exp. Tox- [45] P. Fränti, M. Rezaei, Q. Zhao, Centroid index: cluster level similarity measure,
icol. 37 (3) (2018) 285–294. Pattern Recognit. 47 (9) (2014) 3034–3045.
[8] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31 [46] P. Fränti, T. Kaukoranta, O. Nevalainen, On the splitting method for VQ code-
(2010) 651–666. book generation, Opt. Eng. 36 (11, November) (1997) 3043–3051.
[9] K. Krishna, Murty M.N, Genetic k-means algorithm, IEEE Trans. Syst. Man Cy- [47] P. Fränti, O. Virmajoki, V. Hautamäki, Fast agglomerative clustering using a k-n-
bern. Part B 29 (3) (1999) 433–439. earest neighbor graph, IEEE Trans. Pattern Anal. Mach. Intel. 28 (11, November)
[10] P. Fränti, Genetic algorithm with deterministic crossover for vector quantiza- (2006) 1875–1881.
tion, Pattern Recognit. Lett. 21 (1) (20 0 0) 61–68. [48] G.H. Ball, D.J. Hall, A clustering technique for summarizing multivariate data,
[11] P. Fränti, J. Kivijärvi, Randomized local search algorithm for the clustering Syst. Res. Behav. Sci. 12 (2, March) (1967) 153–155.
problem, Pattern Anal. Appl. 3 (4) (20 0 0) 358–369. [49] O. Lemke, B. Keller, Common nearest neighbor clustering: a benchmark, Algo-
[12] P. Fränti, Efficiency of random swap clustering, J. Big Data 5 (13) (2018) 1–29. rithms 11 (2) (2018) 19.
[13] S. Kalyani, K.S. Swarup, Particle swarm optimization based K-means clustering [50] U.V. Luxburg, Clustering stability: an overview, Found. Trends Mach. Learn. 2
approach for security assessment in power systems, Expert Syst. Appl. 32 (9) (3) (2010) 235–274.
(2011) 10839–10846. [51] S.J. Redmond, C. Heneghan, A method for initialising the K-means clustering
[14] D. Yan, L. Huang, M.I. Jordan, Fast approximate spectral clustering, ACM algorithm using kd-trees, Pattern Recognit. Lett. 28 (8) (2007) 965–973.
SIGKDD Int. Conf. Knowl. Discov. Data Min. (2009) 907–916. [52] S. Tezuka, P.L Equyer, Efficient portable combined Tausworthe random number
[15] L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo, Fast density clustering strate- generators, ACM Trans. Model. Comput. Simul. 1 (1991) 99–112.
gies based on the k-means algorithm, Pattern Recognit. 71 (2017) 375– [53] M.J. Norušis, IBM SPSS Statistics 19 Guide to Data Analysis, Prentice Hall, Upper
386. Saddle River, New Jersey, 2011.
[16] T. Kinnunen, I. Sidoroff, M. Tuononen, P. Fränti, Comparison of clustering meth- [54] T. Gonzalez, Clustering to minimize the maximum intercluster distance, Theor.
ods: a case study of text-independent speaker modeling, Pattern Recognit. Lett. Comput. Sci. 38 (2–3) (1985) 293–306.
32 (13, October) (2011) 1604–1617. [55] M.M.-T. Chiang, B. Mirkin, Intelligent choice of the number of clusters in
[17] Q. Zhao, P. Fränti, WB-index: a sum-of-squares based index for cluster validity, k-means clustering: an experimental study with different cluster spreads, J.
Data Knowl. Eng. 92 (July) (2014) 77–89. Classification 27 (2010) 3–40.
[18] M. Rezaei and P. Fränti Can the number of clusters be solved by external in- [56] J. Hämäläinen, T. Kärkkäinen, Initialization of big data clustering using distri-
dex? manuscript. (submitted) butionally balanced folding, Proceedings of the European Symposium on Arti-
[19] J.M Peña, J.A. Lozano, P. Larrañaga, An empirical comparison of four initializa- ficial Neural Networks, Comput. Intel. Mach. Learn.-ESANN (2016).
tion methods for the k-means algorithm, Pattern Recognit. Lett. 20 (10, Octo- [57] I. Katsavounidis, C.C.J. Kuo, Z. Zhang, A new initialization technique for gener-
ber) (1999) 1027–1040. alized Lloyd iteration, IEEE Signal Process Lett. 1 (10) (1994) 144–146.
[20] J. He, M. Lan, C-L Tan, S-Y Sung, H-B Low, Initialization of Cluster Refinement [58] F. Cao, J. Liang, L. Bai, A new initialization method for categorical data cluster-
Algorithms: a review and comparative study, IEEE Int. Joint Conf. Neural Netw. ing, Expert Syst. Appl. 36 (7) (2009) 10223–10228.
(2004). [59] D. Arthur, S. Vassilvitskii, K-means++: the advantages of careful seeding,
[21] D. Steinley, M.J. Brusco, Initializing k-means batch clustering: a critical evalua- ACM-SIAM Symp. on Discrete Algorithms (SODA’07), January 2007.
tion of several techniques, J. Classification 24 (2007) 99–121. [60] M. Erisoglu, N. Calis, S. Sakallioglu, A new algorithm for initial cluster centers
[22] M.E. Celebi, H.A. Kingravi, P.A. Vela, A comparative study of efficient initial- in k-means algorithm, Pattern Recognit. Lett. 32 (14) (2011) 1701–1705.
ization methods for the k-means clustering algorithm, Expert Syst. Appl. 40 [61] C. Gingles, M. Celebi, Histogram-based method for effective initialization of
(2013) 200–210. the k-means clustering algorithm, Florida Artificial Intelligence Research So-
[23] L. Kaufman, P. Rousseeuw, Finding Groups in data: An introduction to Cluster ciety Conference, May 2014.
Analysis, Wiley Interscience, 1990. [62] J.A. Hartigan, M.A. Wong, Algorithm AS 136: a k-means clustering algorithm, J.
[24] B. Thiesson, C. Meek, D.M. Chickering, and D. Heckerman, Learning mixtures R. Stat. Soc. C 28 (1) (1979) 100–108.
of Bayesian networks, Technical Report MSR-TR-97-30 Cooper & Moral, 1997. [63] M.M. Astrahan, Speech Analysis by Clustering, Or the Hyperphome Method,
[25] J.T. Tou, R.C. Gonzales, Pattern Recognition Principles, Addison-Wesley, 1974. Stanford Artificial Intelligence Project Memorandum AIM-124, Stanford Univer-
[26] T.F. Gonzalez, Clustering to minimize the maximum intercluster distance, sity, Stanford, CA, 1970.
Theor. Comput. Sci. 38 (2–3) (1985) 293–306. [64] F. Cao, J. Liang, G. Jiang, An initialization method for the k-means algorithm
[27] J.H. Ward, Hierarchical grouping to optimize an objective function, J. Am. Stat. using neighborhood model, Comput. Math. Appl. 58 (2009) 474–483.
Assoc. 58 (301) (1963) 236–244. [65] M. Al-Daoud, A new algorithm for cluster initialization, in: World Enformatika
[28] A. Likas, N. Vlassis, J. Verbeek, The global k-means clustering algorithm, Pat- Conference, 2005, pp. 74–76.
tern Recognit. 36 (2003) 451–461. [66] M. Yedla, S.R. Pathakota, T.M. Srinivasa, Enhancing k-means clustering algo-
[29] D. Steinley, Local optima in k-means clustering: what you don’t know may rithm with improved initial center, Int. J. Comput. Sci. Inf. Technol. 1 (2) (2010)
hurt you, Psychol. Methods 8 (2003) 294–304. 121–125.
[30] P. Fränti, T. Kaukoranta, D.-F. Shen, K.-S. Chang, Fast and memory efficient im- [67] T. Su, J.G. Dy, In search of deterministic methods for initializing k-means and
plementation of the exact PNN, IEEE Trans. Image Process. 9 (5, May) (20 0 0) gaussian mixture clustering, Intel. Data Anal. 11 (4) (2007) 319–338.
773–777. [68] X. Wu, K. Zhang, A better tree-structured vector quantizer, in: IEEE Data Com-
[31] P. Bradley, U. Fayyad, Refining initial points for k-means clustering, in: Inter- pression Conference, Snowbird, UT, 1991, pp. 392–401.
national Conference on Machine Learning, San Francisco, 1998, pp. 91–99. [69] C.-M. Huang, R.W. Harris, A comparison of several vector quantization code-
[32] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, John Wiley and book generation approaches, IEEE Trans. Image Process. 2 (1) (1993) 108–112.
Sons, New York, 1973. [70] D. Boley, Principal direction divisive partitioning, Data Min. Knowl. Discov. 2
[33] M. Bicego, M.A.T. Figueiredo, Clustering via binary embedding, Pattern Recog- (4) (1998) 325–344.
nit. 83 (2018) 52–63. [71] M.E. Celebi, H.A. Kingravi, Deterministic initialization of the k-means algorithm
[34] N. Karmitsa, A.M. Bagirov, S. Taheri, Clustering in large data sets with using hierarchical clustering, Int. J. Pattern Recognit Artif Intell. 26 (07) (2012)
the limited memory bundle method, Pattern Recognit. 83 (2018) 245– 1250018.
259. [72] S. Sieranoja, P. Fränti, Random projection for k-means clustering, in: Int. Conf.
[35] Y. Zhu, K.M. Ting, M.J. Carman, Grouping points by shared subspaces for effec- Artificial Intelligence and Soft Computing (ICAISC), Zakopane, Poland, June
tive subspace clustering, Pattern Recognit. 83 (2018) 230–244. 2018, pp. 680–689.
112 P. Fränti and S. Sieranoja / Pattern Recognition 93 (2019) 95–112
[73] S.-W. Ra, J.-K. Kim, A fast mean-distance-ordered partial codebook search algo- [84] J. Xie, Z.Y. Xiong, Y.F. Zhang, Y. Feng, J. Ma, Density core-based clustering algo-
rithm for image vector quantization, IEEE Trans. Circuits Syst. 40 (September) rithm with dynamic scanning radius, Knowl.-Based Syst. 142 (2018) 68–70.
(1993) 576–579. [85] Y. Linde, A. Buzo, R.M. Gray, An algorithm for vector quantizer design, IEEE
[74] I. Cleju, P. Fränti, X. Wu, Clustering based on principal curve, in: Scandina- Trans. Commun. 28 (1, January) (1980) 84–95.
vian Conf. On Image Analysis, LNCS, vol. 3540, Springer, Heidelberg, 2005, [86] M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering tech-
pp. 872–881. niques, in: KDD workshop on text mining, vol. 40 0, Boston, 20 0 0, pp. 525–526.
[75] X. Wu, Optimal quantization by matrix searching, J. Algorithms 12 (4) (1991) [87] S-S. Yu, S-W. Chu, C-M. Wang, Y-K. Chan, T-C. Chang, Two improved k-means
663–673. algorithms, Appl. Soft Comput. 68 (2018) 747–755.
[76] M.B. Al-Daoud, S.A. Roberts, New methods for the initialisation of clusters, Pat- [88] B. Bahmani, B. Mosley, A. Vattani, R. Kumar, S. Vassilvitski, Scal-
tern Recognit. Lett. 17 (5) (1996) 451–455. able k-means++, Proc. VLDB Endow. 5 (7) (2012) 622–633.
[77] P. Gourgaris, C. Makris, A Density Based K-Means Initialization Scheme, EANN [89] T. Kaukoranta, P. Fränti, O. Nevalainen, A fast exact GLA based on code vector
workshops, Rhodes Island, Greece, 2015. activity detection, IEEE Trans. Image Process. 9 (8, August) (20 0 0) 1337–1342.
[78] A. Rodriquez, A. Laio, Clustering by fast search and find of density peaks, Sci-
ence 344 (6191) (2014) 1492–1496. Pasi Fränti received his MSc and PhD degrees from the University of Turku, 1991
[79] P. Mitra, C. Murthy, S.K. Pal, Density-based multiscale data condensation, IEEE and 1994 in Science. Since 20 0 0, he has been a professor of Computer Science at
Trans. Pattern Anal. Mach. Intel. 24 (6) (2002) 734–747. the University of Eastern Finland (UEF). He has published 81 journals and 167 peer
[80] S. Sieranoja, P. Fränti, Constructing a high-dimensional kNN-graph using a review conference papers, including 14 IEEE transaction papers. His main research
Z-order curve, ACM J. Exp. Algorithmics 23 (1, October) (2018) 1–21 1.9:. interests are in machine learning, data mining, pattern recognition including clus-
[81] W. Dong, C. Moses, K. Li, Efficient k-nearest neighbor graph construction for tering algorithms and intelligent location-aware systems. Significant contributions
generic similarity measures, in: Proceedings of the ACM International Confer- have also been made in image compression, image analysis, vector quantization and
ence on World wide web, ACM, 2011, pp. 577–586. speech technology.
[82] P. Fränti, S. Sieranoja, Dimensionally distributed density estimation, in: Int.
Conf. Artificial Intelligence and Soft Computing (ICAISC), Zakopane , Poland,
June 2018, pp. 343–353. Sami Sieranoja received the B.Sc. and M.Sc. degrees in University of Eastern Fin-
[83] H.J. Curti, R.S. Wainschenker, FAUM: fast Autonomous Unsupervised Multidi- land, 2014 and 2015. Currently he is a doctoral student at the University of Eastern
mensional classification, Inf. Sci. 462 (2018) 182–203. Finland. His research interests include neighborhood graphs and data clustering.