Fast_and_Robust_General_Purpose_Clustering_Algorit
Fast_and_Robust_General_Purpose_Clustering_Algorit
net/publication/226564781
CITATIONS READS
104 476
2 authors, including:
Vladimir Estivill-Castro
University Pompeu Fabra
262 PUBLICATIONS 4,185 CITATIONS
SEE PROFILE
All content following this page was uploaded by Vladimir Estivill-Castro on 31 May 2014.
V. Estivill-Castro
School of Computing and Information Technology,
Griffith University, Nathan, QLD 4111, Australia.
J. Yang
School of Electrical Engineering and Computer Science,
The University of Newcastle, Callaghan, NSW 2308, Australia.
Abstract. General purpose and highly applicable clustering methods are usually
required during the early stages of knowledge discovery exercises. k-Means has been
adopted as the prototype of iterative model-based clustering because of its speed,
simplicity and capability to work within the format of very large databases. However,
k-Means has several disadvantages derived from its statistical simplicity. We pro-
pose an algorithm that remains very efficient, generally applicable, multidimensional
but is more robust to noise and outliers. We achieve this by using medians rather
than means as estimators for the centers of clusters. Comparison with k-Means,
Expectation Maximization and Gibbs sampling demonstrates the advantages of
our algorithm.
1. Introduction
Estivill-CastroYandDAMI237-00.tex;
2 Estivill-Castro & Yang
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 3
where
1. S = {~s1 , ~s2 , . . . , ~sn } is a set of n data items in D-dimensional real
space IRD ;
2. the weight wi > 0Pmay reflect relevance of the observation ~s i , and
Euclid(~x, ~y ) = ( D 2 1/2 is the Euclidean metric;
j=1 |xj − yj | )
Estivill-CastroYandDAMI237-00.tex;
4 Estivill-Castro & Yang
(1) Construct initial set of representatives. (1) Construct initial set of representatives.
minimizes the sum of squared errors between the center and the points
in the cluster. Next, using the new representatives, a classification step
obtains new clusters. These steps are repeated until an iteration occurs
in which the clustering does not change; refer to Fig. 1. This conceptual
iteration of k-Means is illustrated in Fig. 1 to highlight its similarity
with Expectation Maximization.
We highlight that the conceptual pseudo code of Fig. 1 is not how
k-Means or Expectation Maximization should actually be imple-
mented. This is because this conceptual pseudo code implies two passes
over the data. But in both cases, the two conceptual passes can be
carried out per data item in an implementation that does only one
pass on the data per iteration (and obtain exactly the same result as
the two-pass version).
3. Because k central vectors are means of cluster points, they are com-
monly adopted as representative of the data points of the cluster.
However, it is possible for the arithmetic mean to have no valid
interpretation; for example, the average of the coordinates of a
group of schools may indicate that the representative school lies
in the middle of a lake.
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 5
Estivill-CastroYandDAMI237-00.tex;
6 Estivill-Castro & Yang
The expectation step estimates the complete data from the incomplete
data. The maximization step takes the “estimated” complete data and
estimates ~θ by maximum likelihood [46, 47].
Titterington et al [47] show that often, (for example, if k = 2 or
the components are assumed to be of the same type and belonging to
an exponential density family), the maximization step is explicit, in
the sense that the value that attains the maximum of E[l( Y ~ ; θ~(t) )] can
be found algebraically, without numerical approximation, or it is of no
more difficulty as a Maximum Likelihood exercise on ‘complete’ data.
The Expectation Maximization procedure has solid theoretical re-
sults regarding the sequence h~θ(t) i of approximations. In particular, the
estimated parameters produce a sequence of likelihood values that is
non-decreasing.
Unfortunately, the maximization step for θ~j , depends on the form of
the part fj of the mixture kj=1 πj fj (θ~j ). Thus, different Expectation
P
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 7
0
0 1 2 3 4 5 6 7 8
Figure 2. A function Gravity(~x) and its level curves. The minimum is the gravity
center (shown with ◦).
Estivill-CastroYandDAMI237-00.tex;
8 Estivill-Castro & Yang
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 9
3. Our algorithms
The problem with means is that they are not robust estimators of
central tendency [43]. Means are very sensitive to noise and outliers.
Medians represent better a typical value in skew distributions and are
invariant under monotonic transformations of the random variable.
Means are invariant only under linear transformations. The median of a
distribution is much less tractable from the mathematical point of view
than the mean. This is the main reason why traditional statistics usu-
ally chooses the mean rather than then median to describe the “center”
of a distribution [12]. In clustering, as in vector quantization, the mean
is to be a representative of the data points ~x i that are nearest to it. The
mean and the median are both measures of location [12]. Equation (1)
represents what statisticians call a L 2 loss functional [43]. Thus, an
immediate alternative is to use an error evaluation that measures the
sum of absolute errors rather than the sum of squared of errors. This
L1 criterion results in the Fermat-Weber clustering criterion [32],
n
X
minimize F W (C) = wi Euclid(si , rep[si , C]). (4)
i=1
Estivill-CastroYandDAMI237-00.tex;
10 Estivill-Castro & Yang
70
7
60
6
50
40 5
30
4
20
3
10
0 2
8
6
1
4
8
2 6
4
2 0
0 0 0 1 2 3 4 5 6 7 8
Figure 3. A Fermat-Weber function and its level curves. The arithmetic mean is
shown with ◦.
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 11
continuous FW center
mean
Our second proposal is to find the data point in each cluster that min-
imizes FW(~x). That is, we solve a discrete 1-median problem for each
cluster. Again, we minimize Equation (5), but now with the additional
restriction that the estimator of location be in C j . Our clustering algo-
rithm (k-d-medians) has the same structure for k-Means presented
in Fig. 1. However, the new center of each cluster C j is the discrete
1-median of the points in Cj . This can trivially be solved in O(n 2j ) time
Estivill-CastroYandDAMI237-00.tex;
12 Estivill-Castro & Yang
∇E FW(~x)
∇FW(~
x) if ~x 6∈ Cj ,
= 1
max 1 − ,0 ∇FW¬m (~xm ) if ~x = ~xm ∈ Cj .
k∇FW¬m (~
x)k
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 13
Estivill-CastroYandDAMI237-00.tex;
14 Estivill-Castro & Yang
can be found in O(nj ) time. The halving strategy will result in a total
of O(nj log nj ) time to compute the discrete 1-median.
As we already pointed out, the arithmetic mean corresponds to the
center of mass. Also, by Equation (3) the mass is
nj nj
ˆ) = 1 XX
Gravity(~x Euclid2 (~xi , ~xm ),
2Wj i=1 m=1
where Wj = ~xi ∈Cj wi . Moreover, the level curves are spheres. Thus
P
i=1 m=1
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 15
Step 2. Find u points ~xm in the candidate set nearest to the arithmetic
mean.
ˆ ), use ~x
Step 3. If for one of the u-points FW(x m ) < FW(~x ˆ as another
judge point.
Step 5. Remove from the candidate list the points in u filtering zones.
4. Experimental validation
Estivill-CastroYandDAMI237-00.tex;
16 Estivill-Castro & Yang
continuous
L1 median
Harmonic
mean
Figure 6. Four types of centers (estimators of location) on the same data set.
4.1. Performance
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 17
Expectation Maximization 2
× 7200 Expectation Maximization 2
×
900
k-Means 3 k-Means 3
C k-d-medians +
C k-d-medians +
P 600 P
Gibbs × Gibbs ×
U × U 3600 ×
s s
e 300 2 e ×
c × c 2
+
2 ×
× × 2 + × +
2
×
0×
2
3
3 22 +
2++
+
+ 33 3 +
3 3 ×
0×
+
+
2+
3
2
3 + +
22
33 2
3 3 3
0 50,000 100,000 0 50,000 100,000
size n of data size n of data
Figure 7. Comparison of CPU times for R. N. Neal examples for Bayesian mixture
models.
Estivill-CastroYandDAMI237-00.tex;
18 Estivill-Castro & Yang
believe that this type of dataset constitutes the worst case for our
method, and we are pleased to see that it remains comparable to Ex-
pectation Maximization. We should remark that we experimented
with enlarging the categorical domain of attributes (from two to 5 or
10 values) and found that k-Means performance deteriorates and ours
improves.
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 19
Table I. Misclassification with 95% confidence in-
tervals. The top part of this table is one data set
produced with generator and for each combina-
tion of φ, ψ each clustering algorithm is executed
10 times (with different initialization). The bottom
part combines 10 different data sets produced with
generator.
n = 300 One data set (10 runs per set)
k = 10 Algorithm
u=1 k-Means k-d-medians EM
Noise Random MST Random Random
ψ φ start start start start
0 0 39% ±7 8% 16%±4 25%±5 %
0.1 30% ±4 27% 16%±4 30%±4
0.2 30% ±5 30% 22%±5 39%±3
Estivill-CastroYandDAMI237-00.tex;
20 Estivill-Castro & Yang
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 21
1
0.5
0
0 0.5 1
Figure 9. Data is uniform if projected to either axis, but has a clear pattern.
Table II. Results for 3-D mixture of normals with 20% noise.
T P3
Algorithm ~ˆ j
Estimated µ j=1
~ˆ j − µ
kµ ~j k CPU time
~T
µ 1 = (12.7, 8.4, 6.3)
k-Means ~T
µ 2 = (3.1, 7.8, 13.5) 2.83 96 sec
~T
µ 3 = (16.3, 9.6, 14.8)
~T
µ 1 = (10.1, 9.7, 9.4)
Expectation Maximization ~T
µ 2 = (2.7, 7.1, 14.4) 7.77 5 sec
~T
µ 3 = (15.1, 8.6, 11.3)
~T
µ 1 = (12.8, 8.0, 6.9)
k-d-medians ~T
µ 2 = (2.8, 7.1, 14.4) 0.97 0.7 sec
~T
µ 3 = (16.6, 9.6, 14.8)
~T
µ 1 = (7.8, 16.0, 6.0)
k-c-L1 medians ~T
µ 2 = (3.8, 7.1, 14.4) 16.8 0.4 sec
~T
µ 3 = (15.0, 8.5, 9.8)
Estivill-CastroYandDAMI237-00.tex;
22 Estivill-Castro & Yang
Table III. Results for 10 dataset each consisting of a 3-D mixture of normals with
20% noise.
k-c-L1 medians finds cluster with over 40% error 90% of the time
and 10% of the time is totally wrong providing two clusters separated
by the line Y = 0.5. By contrast, k-d-medians performs very well in
this data set, 90% of the time the misclassification is only 10%.
The statistical literature has rejected the L1-metric optimization
because the estimator of location can be outside the convex hull of
the cloud of points for which it is estimating a center [43]. A simple
example is the 3 point set (1,0,0), (0,1,0) and (0,0,1) in 3D. The L 1
center is (0,0,0) which is outside the convex hull.
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 23
Estivill-CastroYandDAMI237-00.tex;
24 Estivill-Castro & Yang
5. Final remarks
Section 4.1 shows that our algorithms are slightly more costly than k-
Means but certainly much faster than alternatives like Expectation
Maximization and Gibbs Sampling. Section 4.2 shows that our algo-
rithm provide much more resistance to noise and outliers. Section 4.3
shows that they offer high statistical quality. Section 4.4 shows how our
algorithms can be applied successfully to a case study previously used
in the Data Mining literature. Our algorithms produce clusterings with
improved results.
The algorithms presented here are suitable for exploratory data
analysis. They do not depend on the order of the data, as some variants
of k-Means and they do not demand detailed initialization. Their use
brings insight into the structure of a large multidimensional data set.
Because they are faster than Expectation Maximization, they can
be applied in combination with criteria for determining the number k
of clusters. Recall that the most robust criteria to find an estimate of
the value of k by repeatedly cluster with different values of k [39, 40].
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 25
References
Estivill-CastroYandDAMI237-00.tex;
26 Estivill-Castro & Yang
Estivill-CastroYandDAMI237-00.tex;
Fast And Robust General Purpose Clustering Algorithms 27
38. A.T. Murray and V. Estivill-Castro. Cluster discovery techniques for ex-
ploratory spatial data analysis. International Journal of Geographic Infor-
mation Systems, 12(5):431–443, 1998.
39. R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data
mining. In J. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings of the 20th
Conference on Very Large Data Bases (VLDB), pages 144–155, San Francisco,
CA, 1994. Santiago, Chile, Morgan Kaufmann Publishers.
40. J.J. Oliver, R.A. Baxter, and C.S. Wallace. Unsupervised learning using MML.
In L. Saitta, editor, Proceedings of the 13th Machine Learning Conference,
pages 364–372, San Mateo, CA, July 1996. Morgan Kaufmann Publishers.
41. M.L. Overton. A quadratically convergent method for minimizing a sum of
Euclidean norms. Mathematical Programming, 27:34–63, 1983.
42. G.W. Rogers, B.C. Wallet, and E.J. Wegman. A mixed measure formulation of
the EM algorithm for huge data set applications. In L. Billard and N.I. Fisher,
editors, Proceedings of the 28th Symposium on the Interface between Computer
Science and Statistics, pages 492–497, Sydney, Australia, July 1997. Interface
Foundation of North America.
43. P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. John
Wiley & Sons, NY, USA, 1987.
44. S.Z. Selim and M.A. Ismail. k-means-type algorithms: A generalized conver-
gence theorem and characterization of local optimality. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-6(1):81–86, January 1984.
45. A.F.M. Smith and G.O. Roberts. Bayesian computation via the Gibbs sampler
and reated Markov chain Monte Carlo methods. Journal of the Royal Statistical
Society B, 55(1):2–23, 1993.
46. M.A. Tanner. Tools for Statistical Inference. Springer-Verlag, NY, US., 1993.
47. D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical Analysis of
Finite Mixture Distributions. John Wiley & sons, UK, 1985.
48. C.S. Wallace and P.R. Freeman. Estimation and inference by compact coding.
Journal of the Royal Statistical Society, Series B, 49(3):240–265, 1987.
49. G. Wesolowsky. The Weber problem: history and perspectives. Location
Science, 1:5–23, 1993.
50. B. Zhang, M. Hsu, and U. Dayal. K-harmonic means — a spatial clus-
tering algorithm with boosting. In J. Roddick and K. Hornsby, editors,
Proceedings of the International Workshop on Temporal, Spatial and Spatio-
Temporal Data Mining - TSDM2000, in conjunction with the 4th European
Conference on Principles and Practices of Knowledge Discovery and Databases,
pages 31–42, Lyon, France, 2000. Springer-Verlag Lecture Notes in Artificial
Intelligence 2007.
Estivill-CastroYandDAMI237-00.tex;
Estivill-CastroYandDAMI237-00.tex;