Analysis of Agglomerative Clustering
Analysis of Agglomerative Clustering
Marcel R. Ackermann
1
, Johannes Blmer
1
, Daniel Kuntze
1
, and
Christian Sohler
2
1 Department of Computer Science
University of Paderborn
{mra,bloemer,kuntze}@upb.de
2 Department of Computer Science
TU Dortmund
[email protected]
Abstract
The diameter k-clustering problem is the problem of partitioning a nite subset of R
d
into k
subsets called clusters such that the maximum diameter of the clusters is minimized. One early
clustering algorithm that computes a hierarchy of approximate solutions to this problem for all
values of k is the agglomerative clustering algorithm with the complete linkage strategy. For
decades this algorithm has been widely used by practitioners. However, it is not well studied
theoretically. In this paper we analyze the agglomerative complete linkage clustering algorithm.
Assuming that the dimension d is a constant, we show that for any k the solution computed
by this algorithm is an O(log k)-approximation to the diameter k-clustering problem. Moreover,
our analysis does not only hold for the Euclidean distance but for any metric that is based on a
norm.
1998 ACM Subject Classication F.2.2 [Analysis of Algorithms and Problem Complexity]:
Nonnumerical Algorithms and ProblemsGeometrical problems and computations; H.3.3 [In-
formation Storage and Retrieval]: Information Search and RetrievalClustering; I.5.3 [Pattern
Recognition]: ClusteringAlgorithms, Similarity measures
Keywords and phrases agglomerative clustering, hierarchical clustering, complete linkage, ap-
proximation guarantees
1 Introduction
Clustering is the process of partitioning a set of objects into subsets (called clusters) such
that each subset contains similar objects and objects in dierent subsets are dissimilar. It
has many applications including data compression [13], analysis of gene expression data [6],
anomaly detection [10], and structuring results of search engines [3]. For every application
a proper objective function is used to measure the quality of a clustering. One particular
objective function is the largest diameter of the clusters. If the desired number of clusters k
is given we call the problem of minimizing this objective function the diameter k-clustering
problem.
One of the earliest and most widely used clustering strategies is agglomerative clustering.
The history of agglomerative clustering goes back at least to the 1950s (see for example
[8, 11]). Later, biological taxonomy became one of the driving forces of cluster analysis.
In [14] the authors, who where the rst biologists using computers to classify organisms,
discuss several agglomerative clustering methods.
For all four authors this research was supported by the German Research Foundation (DFG), grants
BL 314/6-2 and SO 514/4-2.
2 Analysis of Agglomerative Clustering
Agglomerative clustering is a bottom-up clustering process. At the beginning, every
input object forms its own cluster. In each subsequent step, the two closest clusters will
be merged until only one cluster remains. This clustering process creates a hierarchy of
clusters, such that for any two dierent clusters A and B from possibly dierent levels of
the hierarchy we either have A B = , A B, or B A. Such a hierarchy is useful
in many applications, for example, when one is interested in hereditary properties of the
clusters (as in some bioinformatics applications) or if the exact number of clusters is a priori
unknown.
In order to dene the agglomerative strategy properly, we have to specify a distance
measure between clusters. Given a distance function between data objects, the following
distance measures between clusters are frequently used. In the single linkage strategy, the
distance between two clusters is dened as the distance between their closest pair of data
objects. It is not hard to see that this strategy is equivalent to computing a minimum
spanning tree of the graph induced by the distance function using Kruskals algorithm. In
case of the complete linkage strategy, the distance between two clusters is dened as the
distance between their furthest pair of data objects. In the average linkage strategy the
distance is dened as the average distance between data objects from the two clusters.
1.1 Related Work
In this paper we study the agglomerative clustering algorithm using the complete linkage
strategy to nd a hierarchical clustering of n points from R
d
. The running time is obviously
polynomial in the description length of the input. Therefore, our only goal in this paper is to
give an approximation guarantee for the diameter k-clustering problem. The approximation
guarantee is given by a factor such that the cost of the k-clustering computed by the
algorithm is at most times the cost of an optimal k-clustering. Although the agglomerative
complete linkage clustering algorithm is widely used, only few theoretical results considering
the quality of the clustering computed by this algorithm are known. It is known that there
exists a certain metric distance function such that this algorithm computes a k-clustering
with an approximation factor of (log k) [5]. However, prior to the analysis we present
in this paper, no non-trivial upper bound for the approximation guarantee of the classical
complete linkage agglomerative clustering algorithm was known, and deriving such a bound
has been discussed as one of the open problems in [5].
The diameter k-clustering problem is closely related to the k-center problem. In this
problem, we are searching for k centers and the objective is to minimize the maximum
distance of any input point to the nearest center. When the centers are restricted to come
from the set of the input points, the problem is called the discrete k-center problem. It is
known that for metric distance functions the costs of optimal solutions to all three problems
are within a factor of 2 from each other.
For the Euclidean case we know that the diameter k-clustering problem and the k-center
problem are NP-hard. In fact, it is already NP-hard to approximate both problems with
an approximation factor below 1.96 and 1.82 respectively [7].
For xed k, i.e. when we are not interested in a hierarchy of clusterings, there ex-
ist provably good approxiamtion algorithms. For the discrete k-center problem, a simple
2-approximation algorithm is known for metric spaces [9], which immediately yields a 4-
approximation algorithm for the diameter k-clustering problem. For the k-center prob-
lem, a variety of results is known. For example, for the Euclidean metric in [2] a (1 + )-
approximation algorithm with running time 2
O(
k log k
/
2
)
dn is shown. This implies a (2 +)-
approximation algorithm with the same running time for the diameter k-clustering problem.
Ackermann, Blmer, Kuntze, Sohler 3
Also, for metric spaces a hierarchical clustering strategy with an approximation guarantee
of 8 for the discrete k-center problem is known [5]. This implies an algorithm with an
approximation guarantee of 16 for the diameter k-clustering problem.
This paper as well as all of the above mentioned work is about static clustering, i.e. in
the problem denition we are given the whole set of input points at once. An alternative
model of the input data is to consider sequences of points that are given one after another. In
[4] the authors discuss clustering in a so-called incremental clustering model. They give an
algorithm with constant approximation factor that maintains a hierarchical clustering while
new points are added to the input set. Furthermore, they show a lower bound of (log k)
for the agglomerative complete linkage algorithm and the diameter k-clustering problem.
However, since their model diers from ours, this result has no bearing on our lower bounds.
1.2 Our contribution
In this paper, we study the agglomerative complete linkage clustering algorithm for input
sets X R
d
, where d is constant. To measure the distance between data points, we use
a metric that is based on a norm, e.g., the Euclidean metric. We prove that in this case
the agglomerative clustering algorithm is an O(log k)-approximation algorithm. Here, the
O-notation hides a constant that is doubly exponential in d. This approximation guarantee
holds for every level of the hierarchy computed by the algorithm. That is, we compare each
computed k-clustering with an optimal solution for the particular value of k. These optimal
k-clusterings do not necessarily form a hierarchy. In fact, there are simple examples where
optimal solutions have no hierarchical structure.
Our analysis also yields that if we allow 2k instead of k clusters and compare the cost of
the computed 2k-clustering to an optimal solution with k clusters, the approximation factor
is independent of k and depends only on d. Moreover, the techniques of our analysis can be
applied to prove stronger results for the k-center problem and the discrete k-center problem.
For the k-center problem we derive an approximation guarantee that is logarithmic in k and
only single exponential in d. For the discrete k-center problem we derive an approximation
guarantee that is logarithmic in k and the dependence on d is only linear and additive.
Furthermore, we give almost matching upper and lower bounds for the one-dimensional
case. These bounds are independent of k. For d 2 and the metric based on the
-norm
we provide a lower bound that exceeds the upper bound for d = 1. For d 3 we give
a lower bound for the Euclidean case which is above the lower bound for d = 1. Finally,
we construct instances providing lower bounds for any metric based on an
p
-norm with
1 p . However, for these instances the lower bounds and the dimension d depend on
k.
2 Preliminaries and problem denition
Throughout this paper, we consider input sets that are nite subsets of R
d
. Our results
hold for arbitrary metrics that are based on a norm, i.e., the distance x y between two
points x, y R
d
is measured using an arbitrary norm . Readers who are not familiar
with arbitrary metrics or are only interested in the Euclidean case, may assume that
2
is used, i.e. x y =
d
i=1
(x
i
y
i
)
2
. For r R and y R
d
we denote the closed
d-dimensional ball of radius r centered at y by B
d
r
(y) := {x| x y r}.
Given k N and a nite set X R
d
with k |X| we say that C
k
= {C
1
, . . . , C
k
} is a
k-clustering of X if the sets C
1
, . . . , C
k
(called clusters) form a partition of X into k non-
4 Analysis of Agglomerative Clustering
empty subsets. We call a collection of k-clusterings of the same nite set X but for dierent
values of k hierarchical, if it fullls the following two properties. First, for any 1 k |X|
the collection contains at most one k-clustering. Second, for any two of its clusterings C
i
, C
j
with |C
i
| = i < j = |C
j
| every cluster in C
i
is the union of one or more clusters from C
j
. A
hierarchical collection of clusterings is called a hierarchical clustering.
For a nite and non-empty set C R
d
we dene the diameter of C to be diam(C) :=
max
x,yC
x y. Finally, we dene the cost of a k-clustering C
k
as its largest diameter,
i.e. cost(C
k
) := max
CC
k
diam(C).
Problem 1 (diameter k-clustering). Given k N and a nite set X R
d
with |X| k
nd a k-clustering C
k
of X with minimal cost.
For our analysis of agglomerative clustering we repeatedly use the volume argument
stated in Lemma 3. This argument provides an upper bound on the minimum distance
between two points from a nite set of points lying inside the union of nitely many balls.
For the application of this argument the following denition is crucial.
Denition 2. Let k N and r R. A set X R
d
is called (k, r)-coverable if there exist
y
1
, . . . , y
k
R
d
with X
k
i=1
B
d
r
(y
i
).
Lemma 3. Let k N, r R and P R
d
be nite and (k, r)-coverable with |P| > k. Then
there exist distinct p, q P such that p q 4r
d
k
|P|
.
The proof of Lemma 3 can be found in the full version of this paper [1].
3 Analysis
In this section we analyze the agglomerative algorithm for Problem 1 stated as Algorithm 1.
Given a nite set X R
d
of input points, the algorithm computes hierarchical k-clusterings
for all values of k between 1 and |X|. As mentioned before, the algorithm takes a bottom-up
approach. It starts with the |X|-clustering that contains one cluster for each input point
and then successively merges two of the remaining clusters that minimize the diameter of
the resulting cluster.
Observation 4. The greedy strategy guarantees that the following holds for all computed
clusterings. First, the cost of the clustering is equal to the diameter of the cluster created
last. Second, the diameter of the union of any two clusters is always an upper bound for
the cost of the clustering to be computed next.
Note that our results hold for any particular tie-breaking strategy. However, to keep the
analysis simple, we assume that there are no ties. Thus, for any input set X the clusterings
computed by Algorithm 1 are uniquely determined.
Our main result is the following theorem.
Theorem 5. Let X R
d
be a nite set of points. Then for all k N with k |X| the
partition C
k
of X into k clusters as computed by Algorithm 1 satises
cost(C
k
) = O(log k) opt
k
,
where opt
k
denotes the cost of an optimal solution to Problem 1, and the constant hidden in
the O-notation is doubly exponential in the dimension d.
Ackermann, Blmer, Kuntze, Sohler 5
AgglomerativeCompleteLinkage(X):
X nite set of input points from R
d
1: C
|X|
:= { {x} | x X}
2: for i = |X| 1, . . . , 1 do
3: nd distinct clusters A, B Ci+1
minimizing diam(A B)
4: Ci := (Ci+1 \ {A, B}) {A B}
5: end for
6: return C1, . . . , C
|X|
Algorithm 1 The agglomerative complete linkage clustering algorithm.
We prove Theorem 5 in two steps. First, Proposition 6 in Section 3.1 provides an upper
bound to the cost of the intermediate 2k-clustering. This upper bound is independent of
k and |X| and may be of independent interest. Second, in the remainder of Section 3, we
analyze the k merge steps of Algorithm 1 down to the computation of the k-clustering.
In the following, let X R
d
be the nite set of input points for Algorithm 1 and k N
be a xed number of clusters with k |X|. Furthermore, to simplify notation let r := opt
k
,
where opt
k
is the maximum diameter of an optimal solution to Problem 1. Since any cluster
C is contained in a ball of radius diam(C), the set X is (k, r)-coverable, a fact that will
be used frequently in our analysis. By C
1
, . . . , C
|X|
we denote the clusterings computed by
Algorithm 1 on input X.
3.1 Analysis of the 2k-clustering
Proposition 6. Let X R
d
be nite. Then for all k N with 2k |X| the partition C
2k
of X into 2k clusters as computed by Algorithm 1 satises
cost(C
2k
) < 2
3
(28d + 6) opt
k
,
where = (42d)
d
and opt
k
denotes the cost of an optimal solution to Problem 1.
To prove Proposition 6 we divide the merge steps of Algorithm 1 into two stages. The
rst stage consists of the merge steps down to a 2
2
O(d log d)
k-clustering. The analysis of the
rst stage is based on the following notion of similarity. Two clusters are called similar if
one cluster can be translated such that every point of the translated cluster is near a point
of the second cluster. Then, by merging similar clusters, the diameter essentially increases
by the length of the translation vector. During the whole rst stage we guarantee that
there is a suciently large number of similar clusters left. The cost of the intermediate
2
2
O(d log d)
k-clustering can be upper bounded by O(d) opt
k
.
The second stage consists of the merge steps reducing the number of remaining clusters
from 2
2
O(d log d)
k to only 2k. In this stage we are no longer able to guarantee that a suciently
large number of similar clusters exists. Therefore, we analyze the merge steps of the second
stage using a weaker argument. The underlying reasoning of what we do for the second
stage is the following. If there are more than 2k clusters left, we are able to nd suciently
many pairs of clusters that intersect with the same cluster of an optimal k-clustering. As
long as one of these pairs is left, the cost of merging this pair gives an upper bound on the
cost of the next merge step. Therefore, we can bound the diameter of the created cluster
by the sum of the diameters of the two clusters plus the diameter of the optimal cluster.
We nd that the cost of the intermediate 2k-clustering is upper bounded by 2
2
O(d log d)
opt
k
.
Let us remark that we do not obtain our main result if we already use this argument for the
rst stage.
6 Analysis of Agglomerative Clustering
3.2 Stage one
In our analysis the rst stage is subdivided into phases, such that in each phase the number
of remaining clusters is reduced by one fourth. The following lemma will be used to bound
the increase of the cost during a single phase.
Lemma 7. Let R with 0 < < 1 and :=
3m
4
) < (1 + 2) cost(C
m
) + 4r
d
2
+1
k
m
. (1)
Proof. Let t :=
3m
4
and S := C
m
C
t+1
be the set of clusters from C
m
that still exist
m
4
R
d
with B
d
R
(0)
i=1
B
d
R
(y
i
). For C S we call the set Conf(C) := {y
i
| 1
i and B
d
R
(y
i
) (C p
C
) = } the conguration of C. That is, we indentify each
cluster C S with the subset of the balls B
d
R
(y
1
), . . . , B
d
R
(y
2
+1
k
m
.
Next we want to bound the diameter of the union of the corresponding clusters C
a
and
C
b
. The distance between any two points u, v C
a
or u, v C
b
is at most the cost of
C
m
. Now let u C
a
and v C
b
. Using the triangle inequality, for any w R
d
we obtain
u v p
Ca
p
C
b
+u +p
C
b
p
Ca
w +w v.
For p
Ca
p
C
b
we just derived an upper bound. To bound u + p
C
b
p
Ca
w, we
let y Conf(C
a
) = Conf(C
b
) such that u p
Ca
B
d
R
(y). Furthermore, we x w C
b
with w p
C
b
B
d
R
(y). Hence, u + p
C
b
p
Ca
w = u p
Ca
(w p
C
b
) can be
upper bounded by 2R = 2 cost(C
m
). For w C
b
the distance w v is bounded by
diam(C
b
) cost(C
m
). We conclude that merging clusters C
a
and C
b
results in a cluster
whose diameter can be upper bounded by
diam(C
a
C
b
) < (1 + 2) cost(C
m
) + 4r
d
2
+1
k
m
.
Using Observation 4 and the fact that C
a
and C
b
are part of the clustering C
t+1
, we can
upper bound the cost of C
t
by cost(C
t
) diam(C
a
C
b
).
Note that the parameter from Lemma 7 establishes a trade-o between the two terms
on the right-hand side of Inequality 1. To complete the analysis of the rst stage, we have to
carefully choose . In the proof of the following lemma we use =
ln
4
3/4d and apply Lemma 7
for
log4
3
|X|
2
+1
k
log3
4
2
+1
k
|X|
and dene m
i
:=
3
4
i
|X|
3mi
4
m
i+1
and Algorithm 1 uses a greedy strategy we deduce
cost(C
mi+1
) cost(C
3m
i
4
) for all i = 0, . . . , u 1. Using cost(C
2
+1
k
) cost(C
mu
) and
cost(C
m0
) = 0 we get
cost (C
2
+1
k
) <
u1
i=0
(1 + 2)
i
4r
d
2
+1
k
3
4
u1i
|X|
= 4r
d
2
+1
k
3
4
u1
|X|
u1
i=0
(1 + 2)
i
3
4
.
Using u 1 < log3
4
2
+1
k
|X|
we get
cost (C
2
+1
k
) < 4r
u1
i=0
1 + 2
d
4
3
i
. (2)
By taking only the rst two terms of the series expansion of the exponential function we get
1 +2 = 1 +
ln
4
3
2d
< e
ln
4
3
2d
=
2d
4
3
. Substituting this bound into Inequality (2) and extending
the sum gives
cost (C
2
+1
k
) < 4r
i=0
1
2d
4
3
i
< 4r
i=0
1
1 + 2
i
.
Solving the geometric series leads to
cost (C
2
+1
k
) < 4r
1
2
+ 1
< (28d + 4) r.
);
3. For all n N with + 1 n |Y |, every cluster in P
n
is (Z, r)-connected to another
cluster in P
n
.
Here, the collection P
1
, . . . , P
|Y |
denotes the hierarchical clustering computed by Algorithm 1
on input Y .
Proof. To dene Y, Z, and we consider the (k + 1)-clustering computed by Algorithm 1
on input X. We know that X =
AC
k+1
A is (k, r)-coverable. Let E C
k+1
be a minimal
subset such that
AE
A is (|E| 1, r)-coverable, i.e., for all sets F C
k+1
with |F| < |E|
the union
AF
A is not (|F| 1, r)-coverable. Since a set F of size 1 cannot be (|F| 1, r)-
coverable, we get |E| 2.
Let Y :=
AE
A and := |E| 1. Then k and Y is (, r)-coverable. Thus, we can
dene Z R
d
with |Z| = and Y
zZ
B
d
r
(z). Furthermore, we let P
1
, . . . , P
|Y |
be the
hierarchical clustering computed by Algorithm 1 on input Y .
Since Y is the union of the clusters from E C
k+1
, each merge step between the
computation of C
|X|
and C
k+1
merges either two clusters A, B Y or two clusters A, B
X\Y . The merge steps inside X\Y have no inuence on the clusters inside Y . Furthermore,
the merge steps inside Y would be the same in the absence of the clusters inside X \ Y .
Therefore, on input Y Algorithm 1 computes the ( + 1)-clustering P
+1
= E = C
k+1
2
Y
.
Thus, P
+1
C
k+1
.
To compute P
).
It remains to show that for all n N with + 1 n |Y | it holds that every cluster in
P
n
is (Z, r)-connected to another cluster in P
n
. We rst show the property for n = + 1.
Ackermann, Blmer, Kuntze, Sohler 9
For = 1 this follows from the fact that B
d
r
(z) with Z = {z} has to contain both clusters
from P
2
. For > 1 we are otherwise able to remove one cluster from P
+1
and get clusters
whose union is ( 1, r)-coverable. This contradicts the denition of E = P
+1
as a minimal
subset with this property.
To prove 3. for general n, let C
1
P
n
and z Z with B
d
r
(z) C
1
= . There exists a
unique cluster
C
1
P
+1
with C
1
C
1
. Then we have B
d
r
(z)
C
1
= . However, B
d
r
(z)
has to intersect with at least two clusters from P
+1
. Thus, there exists another cluster
C
2
P
+1
with B
d
r
(z)
C
2
= . Since every cluster from P
+1
is a union of clusters from
P
n
, there exists at least one cluster C
2
P
n
with C
2
C
2
and B
d
r
(z) C
2
= .
3.5 Analysis of the remaining merge steps
Let Y, Z, , and P
1
, . . . , P
|Y |
be as given in Lemma 11. Then, Proposition 6 can be used to
obtain an upper bound for the cost of P
2
. In the following, we analyze the merge steps
leading from P
2
to P
+1
and show how to obtain an upper bound for the cost of P
+1
. As
in Section 3.1, we analyze the merge steps in phases. The following lemma is used to bound
the increase of the cost during a single phase.
Lemma 12. Let m, n N with n 2 and < m n |Y |. If there are no two
(Z, r)-connected clusters in P
m
P
n
, it holds
cost(P
m+
2
) cost(P
m
) + 2 (cost(P
n
) + 2r) .
Proof. We show that there exist at least m disjoint pairs of clusters from P
m
such that
the diameter of their union can be upper bounded by cost(P
m
) + 2 (cost(P
n
) + 2r). By
Observation 4, this upper bounds the cost of the computed clusterings as long as such a pair
of clusters remains. Then the lemma follows from the fact that in each iteration of its loop
the algorithm can destroy at most two of these pairs.
To bound the number of such pairs of clusters we start with a structural observation.
Let S := P
m
P
n
be the set of clusters from P
n
that still exist in P
m
. By our deniton
of Y, Z, and we nd that any cluster A S P
m
is (Z, r)-connected to another cluster
B P
m
. If we assume that there are no two (Z, r)-connected clusters in S, this implies
B P
m
\ S. Thus, using A P
n
, B P
m
, and Equation (3) the diameter of A B can be
bounded by
diam(A B) cost(P
m
) + cost(P
n
) + 2r. (4)
Moreover, using similar argument, if two clusters A
1
, A
2
S P
n
are (Z, r)-connected to
the same cluster B P
m
\ S we can bound the diameter of A
1
A
2
by
diam(A
1
A
2
) cost(P
m
) + 2 (cost(P
n
) + 2r). (5)
Next we show that there exist at least
|S|
2
|S|
2
.
To complete the proof, we show that m
|S|
2
n|S|
2
n|S|
2
n
2
+
|S|
2
. Using n 2 we get m
|S|
2
.
Lemma 13. Let n N with n 2 and < n |Y |. Then
cost(P
+1
) < 2(log
2
+ 2) (cost(P
n
) + 2r) .
Proof. For n = +1 there is nothing to show. Hence, assume n > +1. Then by denition
of Z there exist two (Z, r)-connected clusters in P
n
. Now let n N with n < n be maximal
such that no two (Z, r)-connected clusters exist in P
n
P
n
. The number n is well-dened
since at least the set P
1
does not contain two clusters at all. It follows that the same holds
for all m N with m n. We conclude that Lemma 12 is applicable for all m n with
< m.
By the denition of n there still exist at least two (Z, r)-connected clusters in P
n+1
P
n
.
Then, Observation 4 implies
cost(P
n
) 2 cost(P
n
) + 2r. (6)
If n + 1 then Inequality (6) proves the lemma. For n > + 1 let u := log
2
( n ) and
dene m
i
:=
1
2
i
( n ) +
mi+
2
m
i+1
and Algorithm 1 uses a greedy strategy we
deduce cost(C
mi+1
) cost(C
m
i
+
2
2
3
(28d + 6) + 2
r for = (42d)
d
.
Proof. Let n := min(|Y |, 2). Then, using Proposition 6, we get cost(P
n
) < 2
3
(28d + 6) r.
The lemma follows by using this bound in combination with Lemma 13.
3.6 Proof of Theorem 5
Using Lemma 11 we know that there is a subset Y X, a number k, and a hierarchical
clustering P
1
, . . . , P
|Y |
of Y with cost(C
k
) cost(P
-norm. Note that this exceeds the upper bound from the one-
dimensional case. Furthermore, for the Euclidean metric we know a 3-dimensional input
set giving a lower bound of 2.56, thus exceeding our lower bound from the one-dimensional
case.
Moreover, we can show that there exist input instances such that Algorithm 1 computes
an approximation to Problem 1 with an approximation factor of (
p
-norm. In case
of the
1
- and the
-norm this matches the already known lower bound [5] that has been
shown using a rather articial metric. However, in our instances the dimension d is not xed
but depends on k.
All lower bounds mentioned above are proven in the full version of this paper [1].
4.2 Related clustering problems
As mentioned in the introduction, the cost of optimal solutions to the diameter k-clustering
problem, the k-center problem, and the discrete k-center problem are within a factor of 2
from each other. That is, Algorithm 1 computes an O(log k)-approximation for all three
problems.
In case of the k-center problem and the discrete k-center problem, our techniques can be
applied in a simplied way and yield stronger bounds. Here, we consider the agglomerative
algorithm that minimizes the (discrete) k-center cost function in every merge step. In case
of the k-center problem we are able to show an upper bound that is logarithmic in k and
single exponential in d. More precisely, the dependency on d in the upper bound for the
cost of the 2k-clustering from doubly exponential to only single exponential. Mainly this is
achieved because the analysis no longer requires congurations of clusters.
In case of the discrete k-center problem we are able to show an upper bound of 20d +
2 log
2
k + 4 for the approximation factor. Here, the analysis benets from the fact that we
are able to bound the increase of the cost in each phase of the second stage by a term that
is only additive.
The lower bound of (
p
-norm can be
adopted to the discrete k-center problem (see full version of this paper [1]). In particular,
in case of the
2
-norm we obtain an instance in dimension O(log
3
k) for which we can show
a lower bound of (
d and in
log k.
12 Analysis of Agglomerative Clustering
4.3 Open problems
The main open problems our contribution raises are:
Can the doubly exponential dependence on d in Theorem 5 be improved?
Are the dierent dependencies on d in the approximation factors for the discrete k-
center problem, the k-center problem, and the diameter k-clustering problem due to the
limitations of our analysis or are they inherent to these problems?
Can our results be extended to more general distance measures?
Can the lower bounds for
p
-metrics with 1 < p < be improved to (log k), matching
the lower bound from [5] for all
p
-norms?
References
1 M. R. Ackermann, J. Blmer, D. Kuntze, and C. Sohler. Analysis of Agglomerative Clus-
tering. CoRR: Computing Research Repository, 2010. arXiv:1012.3697 [cs.DS].
2 M. Badoiu, S. Har-Peled, and P. Indyk. Approximate Clustering via Core-Sets. In Pro-
ceedings of STOC 02, pages 250257, 2002.
3 A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic Clustering of the
Web. In Selected papers from the sixth intern. conf. on WWW, pages 11571166, Essex,
UK, 1997. Elsevier Science Publishers Ltd.
4 M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental Clustering and Dynamic
Information Retrieval. SIAM J. Comput., 33:14171440, June 2004.
5 S. Dasgupta and P. M. Long. Performance Guarantees for Hierarchical Clustering. JCSS:
Journal of Computer and System Sciences, 70(4):555569, 2005.
6 M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of
genome-wide expression patterns. PNAS: Proceedings of the National Academy of Sciences,
95(25):1486314868, December 1998.
7 T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of
STOC 88, pages 434444, New York, NY, USA, 1988. ACM.
8 K. Florek, J. Lukaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la liaison et la
division des points dun ensemble ni. Colloqium Math., 2:282285, 1951.
9 T. F. Gonzalez. Clustering to Minimize the Maximum Intercluster Distance. Theoretical
Computer Science, 38:293306, 1985.
10 K. Lee, J. Kim, K. Kwon, Y. Han, and S. Kim. DDoS attack detection method using
cluster analysis. Expert Systems with Applications, 34(3):16591665, 2008.
11 L. L. McQuitty. Elementary Linkage Analysis for Isolating Orthogonal and Oblique Types
and Typal Relevancies. Educational and Psychological Measurement, 17:207209, 1957.
12 M. Naszdi. Covering a Set with Homothets of a Convex Body. Positivity, 2009.
13 F. Pereira, N. Tishby, and L. Lee. Distributional Clustering of English Words. In Proceed-
ings of the 31st Annual Meeting of the Association for Computational Linguistics, pages
183190, 1993.
14 P. H. A. Sneath and R. R. Sokal. Numerical taxonomy: the principles and practice of
numerical classication. W. H. Freeman, 1973.