0% found this document useful (0 votes)
40 views

Analysis of Agglomerative Clustering

This document analyzes the agglomerative complete linkage clustering algorithm. It shows that for a finite subset of R^d where d is a constant, and using a metric based on a norm like the Euclidean metric, the algorithm computes a solution that is an O(log k) approximation to the optimal diameter k-clustering problem. This approximation guarantee holds for any value of k and improves upon the previously best known lower bound of Ω(log k) for this algorithm. The analysis also shows that allowing 2k clusters yields an approximation factor independent of k and depending only on d.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Analysis of Agglomerative Clustering

This document analyzes the agglomerative complete linkage clustering algorithm. It shows that for a finite subset of R^d where d is a constant, and using a metric based on a norm like the Euclidean metric, the algorithm computes a solution that is an O(log k) approximation to the optimal diameter k-clustering problem. This approximation guarantee holds for any value of k and improves upon the previously best known lower bound of Ω(log k) for this algorithm. The analysis also shows that allowing 2k clusters yields an approximation factor independent of k and depending only on d.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Analysis of Agglomerative Clustering

Marcel R. Ackermann
1
, Johannes Blmer
1
, Daniel Kuntze
1
, and
Christian Sohler
2
1 Department of Computer Science
University of Paderborn
{mra,bloemer,kuntze}@upb.de
2 Department of Computer Science
TU Dortmund
[email protected]
Abstract
The diameter k-clustering problem is the problem of partitioning a nite subset of R
d
into k
subsets called clusters such that the maximum diameter of the clusters is minimized. One early
clustering algorithm that computes a hierarchy of approximate solutions to this problem for all
values of k is the agglomerative clustering algorithm with the complete linkage strategy. For
decades this algorithm has been widely used by practitioners. However, it is not well studied
theoretically. In this paper we analyze the agglomerative complete linkage clustering algorithm.
Assuming that the dimension d is a constant, we show that for any k the solution computed
by this algorithm is an O(log k)-approximation to the diameter k-clustering problem. Moreover,
our analysis does not only hold for the Euclidean distance but for any metric that is based on a
norm.
1998 ACM Subject Classication F.2.2 [Analysis of Algorithms and Problem Complexity]:
Nonnumerical Algorithms and ProblemsGeometrical problems and computations; H.3.3 [In-
formation Storage and Retrieval]: Information Search and RetrievalClustering; I.5.3 [Pattern
Recognition]: ClusteringAlgorithms, Similarity measures
Keywords and phrases agglomerative clustering, hierarchical clustering, complete linkage, ap-
proximation guarantees
1 Introduction
Clustering is the process of partitioning a set of objects into subsets (called clusters) such
that each subset contains similar objects and objects in dierent subsets are dissimilar. It
has many applications including data compression [13], analysis of gene expression data [6],
anomaly detection [10], and structuring results of search engines [3]. For every application
a proper objective function is used to measure the quality of a clustering. One particular
objective function is the largest diameter of the clusters. If the desired number of clusters k
is given we call the problem of minimizing this objective function the diameter k-clustering
problem.
One of the earliest and most widely used clustering strategies is agglomerative clustering.
The history of agglomerative clustering goes back at least to the 1950s (see for example
[8, 11]). Later, biological taxonomy became one of the driving forces of cluster analysis.
In [14] the authors, who where the rst biologists using computers to classify organisms,
discuss several agglomerative clustering methods.

For all four authors this research was supported by the German Research Foundation (DFG), grants
BL 314/6-2 and SO 514/4-2.
2 Analysis of Agglomerative Clustering
Agglomerative clustering is a bottom-up clustering process. At the beginning, every
input object forms its own cluster. In each subsequent step, the two closest clusters will
be merged until only one cluster remains. This clustering process creates a hierarchy of
clusters, such that for any two dierent clusters A and B from possibly dierent levels of
the hierarchy we either have A B = , A B, or B A. Such a hierarchy is useful
in many applications, for example, when one is interested in hereditary properties of the
clusters (as in some bioinformatics applications) or if the exact number of clusters is a priori
unknown.
In order to dene the agglomerative strategy properly, we have to specify a distance
measure between clusters. Given a distance function between data objects, the following
distance measures between clusters are frequently used. In the single linkage strategy, the
distance between two clusters is dened as the distance between their closest pair of data
objects. It is not hard to see that this strategy is equivalent to computing a minimum
spanning tree of the graph induced by the distance function using Kruskals algorithm. In
case of the complete linkage strategy, the distance between two clusters is dened as the
distance between their furthest pair of data objects. In the average linkage strategy the
distance is dened as the average distance between data objects from the two clusters.
1.1 Related Work
In this paper we study the agglomerative clustering algorithm using the complete linkage
strategy to nd a hierarchical clustering of n points from R
d
. The running time is obviously
polynomial in the description length of the input. Therefore, our only goal in this paper is to
give an approximation guarantee for the diameter k-clustering problem. The approximation
guarantee is given by a factor such that the cost of the k-clustering computed by the
algorithm is at most times the cost of an optimal k-clustering. Although the agglomerative
complete linkage clustering algorithm is widely used, only few theoretical results considering
the quality of the clustering computed by this algorithm are known. It is known that there
exists a certain metric distance function such that this algorithm computes a k-clustering
with an approximation factor of (log k) [5]. However, prior to the analysis we present
in this paper, no non-trivial upper bound for the approximation guarantee of the classical
complete linkage agglomerative clustering algorithm was known, and deriving such a bound
has been discussed as one of the open problems in [5].
The diameter k-clustering problem is closely related to the k-center problem. In this
problem, we are searching for k centers and the objective is to minimize the maximum
distance of any input point to the nearest center. When the centers are restricted to come
from the set of the input points, the problem is called the discrete k-center problem. It is
known that for metric distance functions the costs of optimal solutions to all three problems
are within a factor of 2 from each other.
For the Euclidean case we know that the diameter k-clustering problem and the k-center
problem are NP-hard. In fact, it is already NP-hard to approximate both problems with
an approximation factor below 1.96 and 1.82 respectively [7].
For xed k, i.e. when we are not interested in a hierarchy of clusterings, there ex-
ist provably good approxiamtion algorithms. For the discrete k-center problem, a simple
2-approximation algorithm is known for metric spaces [9], which immediately yields a 4-
approximation algorithm for the diameter k-clustering problem. For the k-center prob-
lem, a variety of results is known. For example, for the Euclidean metric in [2] a (1 + )-
approximation algorithm with running time 2
O(
k log k
/
2
)
dn is shown. This implies a (2 +)-
approximation algorithm with the same running time for the diameter k-clustering problem.
Ackermann, Blmer, Kuntze, Sohler 3
Also, for metric spaces a hierarchical clustering strategy with an approximation guarantee
of 8 for the discrete k-center problem is known [5]. This implies an algorithm with an
approximation guarantee of 16 for the diameter k-clustering problem.
This paper as well as all of the above mentioned work is about static clustering, i.e. in
the problem denition we are given the whole set of input points at once. An alternative
model of the input data is to consider sequences of points that are given one after another. In
[4] the authors discuss clustering in a so-called incremental clustering model. They give an
algorithm with constant approximation factor that maintains a hierarchical clustering while
new points are added to the input set. Furthermore, they show a lower bound of (log k)
for the agglomerative complete linkage algorithm and the diameter k-clustering problem.
However, since their model diers from ours, this result has no bearing on our lower bounds.
1.2 Our contribution
In this paper, we study the agglomerative complete linkage clustering algorithm for input
sets X R
d
, where d is constant. To measure the distance between data points, we use
a metric that is based on a norm, e.g., the Euclidean metric. We prove that in this case
the agglomerative clustering algorithm is an O(log k)-approximation algorithm. Here, the
O-notation hides a constant that is doubly exponential in d. This approximation guarantee
holds for every level of the hierarchy computed by the algorithm. That is, we compare each
computed k-clustering with an optimal solution for the particular value of k. These optimal
k-clusterings do not necessarily form a hierarchy. In fact, there are simple examples where
optimal solutions have no hierarchical structure.
Our analysis also yields that if we allow 2k instead of k clusters and compare the cost of
the computed 2k-clustering to an optimal solution with k clusters, the approximation factor
is independent of k and depends only on d. Moreover, the techniques of our analysis can be
applied to prove stronger results for the k-center problem and the discrete k-center problem.
For the k-center problem we derive an approximation guarantee that is logarithmic in k and
only single exponential in d. For the discrete k-center problem we derive an approximation
guarantee that is logarithmic in k and the dependence on d is only linear and additive.
Furthermore, we give almost matching upper and lower bounds for the one-dimensional
case. These bounds are independent of k. For d 2 and the metric based on the

-norm
we provide a lower bound that exceeds the upper bound for d = 1. For d 3 we give
a lower bound for the Euclidean case which is above the lower bound for d = 1. Finally,
we construct instances providing lower bounds for any metric based on an
p
-norm with
1 p . However, for these instances the lower bounds and the dimension d depend on
k.
2 Preliminaries and problem denition
Throughout this paper, we consider input sets that are nite subsets of R
d
. Our results
hold for arbitrary metrics that are based on a norm, i.e., the distance x y between two
points x, y R
d
is measured using an arbitrary norm . Readers who are not familiar
with arbitrary metrics or are only interested in the Euclidean case, may assume that
2
is used, i.e. x y =

d
i=1
(x
i
y
i
)
2
. For r R and y R
d
we denote the closed
d-dimensional ball of radius r centered at y by B
d
r
(y) := {x| x y r}.
Given k N and a nite set X R
d
with k |X| we say that C
k
= {C
1
, . . . , C
k
} is a
k-clustering of X if the sets C
1
, . . . , C
k
(called clusters) form a partition of X into k non-
4 Analysis of Agglomerative Clustering
empty subsets. We call a collection of k-clusterings of the same nite set X but for dierent
values of k hierarchical, if it fullls the following two properties. First, for any 1 k |X|
the collection contains at most one k-clustering. Second, for any two of its clusterings C
i
, C
j
with |C
i
| = i < j = |C
j
| every cluster in C
i
is the union of one or more clusters from C
j
. A
hierarchical collection of clusterings is called a hierarchical clustering.
For a nite and non-empty set C R
d
we dene the diameter of C to be diam(C) :=
max
x,yC
x y. Finally, we dene the cost of a k-clustering C
k
as its largest diameter,
i.e. cost(C
k
) := max
CC
k
diam(C).
Problem 1 (diameter k-clustering). Given k N and a nite set X R
d
with |X| k
nd a k-clustering C
k
of X with minimal cost.
For our analysis of agglomerative clustering we repeatedly use the volume argument
stated in Lemma 3. This argument provides an upper bound on the minimum distance
between two points from a nite set of points lying inside the union of nitely many balls.
For the application of this argument the following denition is crucial.
Denition 2. Let k N and r R. A set X R
d
is called (k, r)-coverable if there exist
y
1
, . . . , y
k
R
d
with X

k
i=1
B
d
r
(y
i
).
Lemma 3. Let k N, r R and P R
d
be nite and (k, r)-coverable with |P| > k. Then
there exist distinct p, q P such that p q 4r
d

k
|P|
.
The proof of Lemma 3 can be found in the full version of this paper [1].
3 Analysis
In this section we analyze the agglomerative algorithm for Problem 1 stated as Algorithm 1.
Given a nite set X R
d
of input points, the algorithm computes hierarchical k-clusterings
for all values of k between 1 and |X|. As mentioned before, the algorithm takes a bottom-up
approach. It starts with the |X|-clustering that contains one cluster for each input point
and then successively merges two of the remaining clusters that minimize the diameter of
the resulting cluster.
Observation 4. The greedy strategy guarantees that the following holds for all computed
clusterings. First, the cost of the clustering is equal to the diameter of the cluster created
last. Second, the diameter of the union of any two clusters is always an upper bound for
the cost of the clustering to be computed next.
Note that our results hold for any particular tie-breaking strategy. However, to keep the
analysis simple, we assume that there are no ties. Thus, for any input set X the clusterings
computed by Algorithm 1 are uniquely determined.
Our main result is the following theorem.
Theorem 5. Let X R
d
be a nite set of points. Then for all k N with k |X| the
partition C
k
of X into k clusters as computed by Algorithm 1 satises
cost(C
k
) = O(log k) opt
k
,
where opt
k
denotes the cost of an optimal solution to Problem 1, and the constant hidden in
the O-notation is doubly exponential in the dimension d.
Ackermann, Blmer, Kuntze, Sohler 5
AgglomerativeCompleteLinkage(X):
X nite set of input points from R
d
1: C
|X|
:= { {x} | x X}
2: for i = |X| 1, . . . , 1 do
3: nd distinct clusters A, B Ci+1
minimizing diam(A B)
4: Ci := (Ci+1 \ {A, B}) {A B}
5: end for
6: return C1, . . . , C
|X|
Algorithm 1 The agglomerative complete linkage clustering algorithm.
We prove Theorem 5 in two steps. First, Proposition 6 in Section 3.1 provides an upper
bound to the cost of the intermediate 2k-clustering. This upper bound is independent of
k and |X| and may be of independent interest. Second, in the remainder of Section 3, we
analyze the k merge steps of Algorithm 1 down to the computation of the k-clustering.
In the following, let X R
d
be the nite set of input points for Algorithm 1 and k N
be a xed number of clusters with k |X|. Furthermore, to simplify notation let r := opt
k
,
where opt
k
is the maximum diameter of an optimal solution to Problem 1. Since any cluster
C is contained in a ball of radius diam(C), the set X is (k, r)-coverable, a fact that will
be used frequently in our analysis. By C
1
, . . . , C
|X|
we denote the clusterings computed by
Algorithm 1 on input X.
3.1 Analysis of the 2k-clustering
Proposition 6. Let X R
d
be nite. Then for all k N with 2k |X| the partition C
2k
of X into 2k clusters as computed by Algorithm 1 satises
cost(C
2k
) < 2
3
(28d + 6) opt
k
,
where = (42d)
d
and opt
k
denotes the cost of an optimal solution to Problem 1.
To prove Proposition 6 we divide the merge steps of Algorithm 1 into two stages. The
rst stage consists of the merge steps down to a 2
2
O(d log d)
k-clustering. The analysis of the
rst stage is based on the following notion of similarity. Two clusters are called similar if
one cluster can be translated such that every point of the translated cluster is near a point
of the second cluster. Then, by merging similar clusters, the diameter essentially increases
by the length of the translation vector. During the whole rst stage we guarantee that
there is a suciently large number of similar clusters left. The cost of the intermediate
2
2
O(d log d)
k-clustering can be upper bounded by O(d) opt
k
.
The second stage consists of the merge steps reducing the number of remaining clusters
from 2
2
O(d log d)
k to only 2k. In this stage we are no longer able to guarantee that a suciently
large number of similar clusters exists. Therefore, we analyze the merge steps of the second
stage using a weaker argument. The underlying reasoning of what we do for the second
stage is the following. If there are more than 2k clusters left, we are able to nd suciently
many pairs of clusters that intersect with the same cluster of an optimal k-clustering. As
long as one of these pairs is left, the cost of merging this pair gives an upper bound on the
cost of the next merge step. Therefore, we can bound the diameter of the created cluster
by the sum of the diameters of the two clusters plus the diameter of the optimal cluster.
We nd that the cost of the intermediate 2k-clustering is upper bounded by 2
2
O(d log d)
opt
k
.
Let us remark that we do not obtain our main result if we already use this argument for the
rst stage.
6 Analysis of Agglomerative Clustering
3.2 Stage one
In our analysis the rst stage is subdivided into phases, such that in each phase the number
of remaining clusters is reduced by one fourth. The following lemma will be used to bound
the increase of the cost during a single phase.
Lemma 7. Let R with 0 < < 1 and :=

. Furthermore let m N with


2
+1
k < m |X|. Then
cost(C

3m
4

) < (1 + 2) cost(C
m
) + 4r
d

2
+1
k
m
. (1)
Proof. Let t :=

3m
4

and S := C
m
C
t+1
be the set of clusters from C
m
that still exist

m
4

1 merge steps after the computation of C


m
. In each iteration of its loop, the algorithm
can merge at most two clusters from C
m
. Thus |S| >
m
2
.
From every cluster C S we x an arbitrary point and denote it by p
C
. Let R :=
cost(C
m
). Then the distance from p
C
to any q C is at most R and we get Cp
C
B
d
R
(0).
A ball of radius R can be covered by balls of radius R (see [12]). Hence, there exist
y
1
, . . . , y

R
d
with B
d
R
(0)

i=1
B
d
R
(y
i
). For C S we call the set Conf(C) := {y
i
| 1
i and B
d
R
(y
i
) (C p
C
) = } the conguration of C. That is, we indentify each
cluster C S with the subset of the balls B
d
R
(y
1
), . . . , B
d
R
(y

) that intersect with C p


C
.
Note that no cluster from C S has an empty conguration. The number of possible
congurations can be upper bounded by 2

. With |S| >


m
2
it follows that there exist
j >
m
2
+1
distinct clusters C
1
, . . . , C
j
S with the same conguration. Using m > 2
+1
k we
deduce j > k.
Let P := {p
C1
, . . . , p
Cj
}. Since X is (k, r)-coverable, so is P X. Therefore, by
Lemma 3, there exist distinct a, b {1, . . . , j} such that p
Ca
p
C
b
4r
d

2
+1
k
m
.
Next we want to bound the diameter of the union of the corresponding clusters C
a
and
C
b
. The distance between any two points u, v C
a
or u, v C
b
is at most the cost of
C
m
. Now let u C
a
and v C
b
. Using the triangle inequality, for any w R
d
we obtain
u v p
Ca
p
C
b
+u +p
C
b
p
Ca
w +w v.
For p
Ca
p
C
b
we just derived an upper bound. To bound u + p
C
b
p
Ca
w, we
let y Conf(C
a
) = Conf(C
b
) such that u p
Ca
B
d
R
(y). Furthermore, we x w C
b
with w p
C
b
B
d
R
(y). Hence, u + p
C
b
p
Ca
w = u p
Ca
(w p
C
b
) can be
upper bounded by 2R = 2 cost(C
m
). For w C
b
the distance w v is bounded by
diam(C
b
) cost(C
m
). We conclude that merging clusters C
a
and C
b
results in a cluster
whose diameter can be upper bounded by
diam(C
a
C
b
) < (1 + 2) cost(C
m
) + 4r
d

2
+1
k
m
.
Using Observation 4 and the fact that C
a
and C
b
are part of the clustering C
t+1
, we can
upper bound the cost of C
t
by cost(C
t
) diam(C
a
C
b
).
Note that the parameter from Lemma 7 establishes a trade-o between the two terms
on the right-hand side of Inequality 1. To complete the analysis of the rst stage, we have to
carefully choose . In the proof of the following lemma we use =
ln
4
3/4d and apply Lemma 7
for

log4
3
|X|
2
+1
k

consecutive phases, where = (42d)


d
. Then, we are able to upper bound
the total increase of the cost by a term that is linear in d and r and independent of |X| and
k. The number of remaining clusters is independent of the number of input points |X| and
depends only on the dimension d and the desired number of clusters k.
Ackermann, Blmer, Kuntze, Sohler 7
Lemma 8. Let 2
+1
k < |X| for = (42d)
d
. Then on input X Algorithm 1 computes a
clustering C
2
+1
k
with cost (C
2
+1
k
) < (28d + 4) r.
Proof. Let u :=

log3
4
2
+1
k
|X|

and dene m
i
:=

3
4

i
|X|

for all i = 0, . . . , u. Furthermore


let =
ln
4
3/4d. This implies for the parameter of Lemma 7. Then m
u
2
+1
k
and m
i
> 2
+1
k 2
+1
k for all i = 0, . . . , u 1. We apply Lemma 7 with m = m
i
for all
i = 0, . . . , u 1. Since

3mi
4

m
i+1
and Algorithm 1 uses a greedy strategy we deduce
cost(C
mi+1
) cost(C

3m
i
4

) for all i = 0, . . . , u 1. Using cost(C
2
+1
k
) cost(C
mu
) and
cost(C
m0
) = 0 we get
cost (C
2
+1
k
) <
u1

i=0

(1 + 2)
i
4r
d

2
+1
k

3
4

u1i
|X|

= 4r
d

2
+1
k

3
4

u1
|X|

u1

i=0

(1 + 2)
i

3
4

.
Using u 1 < log3
4
2
+1
k
|X|
we get
cost (C
2
+1
k
) < 4r
u1

i=0

1 + 2
d

4
3

i
. (2)
By taking only the rst two terms of the series expansion of the exponential function we get
1 +2 = 1 +
ln
4
3
2d
< e
ln
4
3
2d
=
2d

4
3
. Substituting this bound into Inequality (2) and extending
the sum gives
cost (C
2
+1
k
) < 4r

i=0

1
2d

4
3

i
< 4r

i=0

1
1 + 2

i
.
Solving the geometric series leads to
cost (C
2
+1
k
) < 4r

1
2
+ 1

< (28d + 4) r.

3.3 Stage two


The second stage covers the remaining merge steps until Algorithm 1 computes the clustering
C
2k
. The following lemma is the analogon of Lemma 8. Again, the proof subdivides the
merge steps into phases of one fourth of the remaining steps. However, compared to stage
one, the analysis of a single phase yields a weaker bound. The proof can be found in the
full version of this paper [1].
Lemma 9. Let n N with n 2
+1
k and 2k < n |X| for = (42d)
d
. Then on input
X Algorithm 1 computes a clustering C
2k
with
cost(C
2k
) < 2
3
(cost(C
n
) + 2r) .
Proposition 6 follows immediately by combining Lemma 8 and Lemma 9.
8 Analysis of Agglomerative Clustering
3.4 Connected instances
For the analysis of the two stages in Section 3.1 we use arguments that are only applicable
if there are enough clusters left (at least 2k in case of stage two). To analyze the remaining
merge steps, we show that it is sucient to analyze Algorithm 1 on a subset Y X satisfying
a certain connectivity property. Using this property we are able to apply a combinatorial
approach that relies on the number of merge steps left. This introduces the O(log k) term
to the approximation factor of our main result.
We start by dening the connectivity property that will be used to relate clusters to an
optimal k-clustering.
Denition 10. Let Z R
d
and r R. Two sets A, B R
d
are called (Z, r)-connected if
there exists a z Z with B
d
r
(z) A = and B
d
r
(z) B = .
Note that for any two (Z, r)-connected clusters A, B we have
diam(A B) diam(A) + diam(B) + 2r. (3)
Next, we show that for any input set X we can bound the cost of the k-clustering
computed by Algorithm 1 by the cost of the -clustering computed by the algorithm on
a connected subset Y X for a proper k. Recall that by our convention from the
beginning of Section 3, the clusterings computed by Algorithm 1 on a particular input set
are uniquely determined.
Lemma 11. Let X R
d
be nite and k N with k |X|. Then there exists a subset
Y X, a number N with min(k, |Y |), and a set Z R
d
with |Z| = such that:
1. Y is (, r)-coverable;
2. cost(C
k
) cost(P

);
3. For all n N with + 1 n |Y |, every cluster in P
n
is (Z, r)-connected to another
cluster in P
n
.
Here, the collection P
1
, . . . , P
|Y |
denotes the hierarchical clustering computed by Algorithm 1
on input Y .
Proof. To dene Y, Z, and we consider the (k + 1)-clustering computed by Algorithm 1
on input X. We know that X =

AC
k+1
A is (k, r)-coverable. Let E C
k+1
be a minimal
subset such that

AE
A is (|E| 1, r)-coverable, i.e., for all sets F C
k+1
with |F| < |E|
the union

AF
A is not (|F| 1, r)-coverable. Since a set F of size 1 cannot be (|F| 1, r)-
coverable, we get |E| 2.
Let Y :=

AE
A and := |E| 1. Then k and Y is (, r)-coverable. Thus, we can
dene Z R
d
with |Z| = and Y

zZ
B
d
r
(z). Furthermore, we let P
1
, . . . , P
|Y |
be the
hierarchical clustering computed by Algorithm 1 on input Y .
Since Y is the union of the clusters from E C
k+1
, each merge step between the
computation of C
|X|
and C
k+1
merges either two clusters A, B Y or two clusters A, B
X\Y . The merge steps inside X\Y have no inuence on the clusters inside Y . Furthermore,
the merge steps inside Y would be the same in the absence of the clusters inside X \ Y .
Therefore, on input Y Algorithm 1 computes the ( + 1)-clustering P
+1
= E = C
k+1
2
Y
.
Thus, P
+1
C
k+1
.
To compute P

, Algorithm 1 on input Y merges two clusters fromP


+1
that minimize the
diameter of the resulting cluster. Analogously, Algorithm 1 on input X merges two clusters
from C
k+1
to compute C
k
. Since P
+1
C
k+1
, Observation 4 implies cost(C
k
) cost(P

).
It remains to show that for all n N with + 1 n |Y | it holds that every cluster in
P
n
is (Z, r)-connected to another cluster in P
n
. We rst show the property for n = + 1.
Ackermann, Blmer, Kuntze, Sohler 9
For = 1 this follows from the fact that B
d
r
(z) with Z = {z} has to contain both clusters
from P
2
. For > 1 we are otherwise able to remove one cluster from P
+1
and get clusters
whose union is ( 1, r)-coverable. This contradicts the denition of E = P
+1
as a minimal
subset with this property.
To prove 3. for general n, let C
1
P
n
and z Z with B
d
r
(z) C
1
= . There exists a
unique cluster

C
1
P
+1
with C
1


C
1
. Then we have B
d
r
(z)

C
1
= . However, B
d
r
(z)
has to intersect with at least two clusters from P
+1
. Thus, there exists another cluster

C
2
P
+1
with B
d
r
(z)

C
2
= . Since every cluster from P
+1
is a union of clusters from
P
n
, there exists at least one cluster C
2
P
n
with C
2


C
2
and B
d
r
(z) C
2
= .
3.5 Analysis of the remaining merge steps
Let Y, Z, , and P
1
, . . . , P
|Y |
be as given in Lemma 11. Then, Proposition 6 can be used to
obtain an upper bound for the cost of P
2
. In the following, we analyze the merge steps
leading from P
2
to P
+1
and show how to obtain an upper bound for the cost of P
+1
. As
in Section 3.1, we analyze the merge steps in phases. The following lemma is used to bound
the increase of the cost during a single phase.
Lemma 12. Let m, n N with n 2 and < m n |Y |. If there are no two
(Z, r)-connected clusters in P
m
P
n
, it holds
cost(P

m+
2

) cost(P
m
) + 2 (cost(P
n
) + 2r) .
Proof. We show that there exist at least m disjoint pairs of clusters from P
m
such that
the diameter of their union can be upper bounded by cost(P
m
) + 2 (cost(P
n
) + 2r). By
Observation 4, this upper bounds the cost of the computed clusterings as long as such a pair
of clusters remains. Then the lemma follows from the fact that in each iteration of its loop
the algorithm can destroy at most two of these pairs.
To bound the number of such pairs of clusters we start with a structural observation.
Let S := P
m
P
n
be the set of clusters from P
n
that still exist in P
m
. By our deniton
of Y, Z, and we nd that any cluster A S P
m
is (Z, r)-connected to another cluster
B P
m
. If we assume that there are no two (Z, r)-connected clusters in S, this implies
B P
m
\ S. Thus, using A P
n
, B P
m
, and Equation (3) the diameter of A B can be
bounded by
diam(A B) cost(P
m
) + cost(P
n
) + 2r. (4)
Moreover, using similar argument, if two clusters A
1
, A
2
S P
n
are (Z, r)-connected to
the same cluster B P
m
\ S we can bound the diameter of A
1
A
2
by
diam(A
1
A
2
) cost(P
m
) + 2 (cost(P
n
) + 2r). (5)
Next we show that there exist at least

|S|
2

disjoint pairs of clusters from P


m
such that
the diameter of their union can be bounded either by Inequality (4) or by Inequality (5).
To do so, we rst consider the pairs of clusters from S that are (Z, r)-connected to the same
cluster from P
m
\ S until no candidates are left. For these pairs we can bound the diameter
of their union by Inequality (5). Then, each cluster from P
m
\ S is (Z, r)-connected to at
most one of the remaining clusters from S. Thus, each remaining cluster A S can be
paired with a dierent cluster B P
m
\ S such that A and B are (Z, r)-connected. For
these pairs we can bound the diameter of their union by Inequality (4). Since for all pairs
10 Analysis of Agglomerative Clustering
either one or both of the clusters come from the set S, we can lower bound the number of
pairs by

|S|
2

.
To complete the proof, we show that m

|S|
2

. In each iteration of its loop


the algorithm can merge at most two clusters from P
n
. Therefore, to compute P
m
, at
least

n|S|
2

merge steps must have been done since the computation of P


n
. Hence, m
n

n|S|
2

n
2
+
|S|
2
. Using n 2 we get m
|S|
2
.
Lemma 13. Let n N with n 2 and < n |Y |. Then
cost(P
+1
) < 2(log
2
+ 2) (cost(P
n
) + 2r) .
Proof. For n = +1 there is nothing to show. Hence, assume n > +1. Then by denition
of Z there exist two (Z, r)-connected clusters in P
n
. Now let n N with n < n be maximal
such that no two (Z, r)-connected clusters exist in P
n
P
n
. The number n is well-dened
since at least the set P
1
does not contain two clusters at all. It follows that the same holds
for all m N with m n. We conclude that Lemma 12 is applicable for all m n with
< m.
By the denition of n there still exist at least two (Z, r)-connected clusters in P
n+1
P
n
.
Then, Observation 4 implies
cost(P
n
) 2 cost(P
n
) + 2r. (6)
If n + 1 then Inequality (6) proves the lemma. For n > + 1 let u := log
2
( n ) and
dene m
i
:=

1
2

i
( n ) +

> for all i = 0, . . . , u. We apply Lemma 12 with m = m


i
for all i = 0, . . . , u 1. Since

mi+
2

m
i+1
and Algorithm 1 uses a greedy strategy we
deduce cost(C
mi+1
) cost(C

m
i
+
2

) for all i = 0, . . . , u1. By summation over all i, we get


cost(P
mu
) < cost(P
n
) + 2u (cost(P
n
) + 2r)
(6)
< 2(u + 1) (cost(P
n
) + 2r) .
Since n < 2 we get u < log
2
+ 1 and the lemma follows using m
u
= + 1.
The following lemma nishes the analysis except for the last merge step.
Lemma 14. Let Y R
d
be nite and |Y | such that Y is (, r)-coverable. Fur-
thermore, let Z R
d
with |Z| = such that for all n N with + 1 n |Y |
every cluster in P
n
is (Z, r)-connected to another cluster in P
n
, where P
1
, . . . , P
|Y |
de-
notes the hierarchical clustering computed by Algorithm 1 on input Y . Then cost(P
+1
) <
2(log
2
+ 2)

2
3
(28d + 6) + 2

r for = (42d)
d
.
Proof. Let n := min(|Y |, 2). Then, using Proposition 6, we get cost(P
n
) < 2
3
(28d + 6) r.
The lemma follows by using this bound in combination with Lemma 13.
3.6 Proof of Theorem 5
Using Lemma 11 we know that there is a subset Y X, a number k, and a hierarchical
clustering P
1
, . . . , P
|Y |
of Y with cost(C
k
) cost(P

). Furthermore, there is a set Z R


d
such that every cluster from P
+1
is (Z, r)-connected to another cluster in P
+1
. Thus, P
+1
contains two clusters A, B that intersect with the same ball of radius r. Hence cost(C
k
)
diam(A B) 2 cost(P
+1
) + 2r. The theorem follows using Lemma 14 and k.
Ackermann, Blmer, Kuntze, Sohler 11
4 Further results and open problems
4.1 Lower bounds
For d = 1 we are able to show that Algorithm 1 computes an approximation to Problem 1
with an approximation factor between 2.5 and 3. We even know that for any input set
X R the approximation factor of the computed solution is strictly below 3. However, we
do not show an approximation factor of 3 for some > 0. The proof of the upper bound
is very technical, makes extensive use of the total order of the real numbers, and is too long
to be included in this extended abstract.
Furthermore, we know that the dimension d has an impact on the approximation factor
of Algorithm 1. This follows from a 2-dimensional input set yielding a lower bound of 3 for
the metric based on the

-norm. Note that this exceeds the upper bound from the one-
dimensional case. Furthermore, for the Euclidean metric we know a 3-dimensional input
set giving a lower bound of 2.56, thus exceeding our lower bound from the one-dimensional
case.
Moreover, we can show that there exist input instances such that Algorithm 1 computes
an approximation to Problem 1 with an approximation factor of (
p

log k) for metrics based


on an
p
-norm (1 p < ) and (log k) for the metric based on the

-norm. In case
of the
1
- and the

-norm this matches the already known lower bound [5] that has been
shown using a rather articial metric. However, in our instances the dimension d is not xed
but depends on k.
All lower bounds mentioned above are proven in the full version of this paper [1].
4.2 Related clustering problems
As mentioned in the introduction, the cost of optimal solutions to the diameter k-clustering
problem, the k-center problem, and the discrete k-center problem are within a factor of 2
from each other. That is, Algorithm 1 computes an O(log k)-approximation for all three
problems.
In case of the k-center problem and the discrete k-center problem, our techniques can be
applied in a simplied way and yield stronger bounds. Here, we consider the agglomerative
algorithm that minimizes the (discrete) k-center cost function in every merge step. In case
of the k-center problem we are able to show an upper bound that is logarithmic in k and
single exponential in d. More precisely, the dependency on d in the upper bound for the
cost of the 2k-clustering from doubly exponential to only single exponential. Mainly this is
achieved because the analysis no longer requires congurations of clusters.
In case of the discrete k-center problem we are able to show an upper bound of 20d +
2 log
2
k + 4 for the approximation factor. Here, the analysis benets from the fact that we
are able to bound the increase of the cost in each phase of the second stage by a term that
is only additive.
The lower bound of (
p

log k) for any


p
-norm and (log k) for the

-norm can be
adopted to the discrete k-center problem (see full version of this paper [1]). In particular,
in case of the
2
-norm we obtain an instance in dimension O(log
3
k) for which we can show
a lower bound of (

log k). Applying the upper bound of 20d+2 log


2
k +4 to this instance
we see that Algorithm 1 computes a k-clustering whose cost is O(log
3
k) times the cost of
an optimal solution. This implies that the approximation factor of Algorithm 1 cannot be
simultaneously independent of d and log k. More precisely, the approximation factor cannot
be sublinear in
6

d and in

log k.
12 Analysis of Agglomerative Clustering
4.3 Open problems
The main open problems our contribution raises are:
Can the doubly exponential dependence on d in Theorem 5 be improved?
Are the dierent dependencies on d in the approximation factors for the discrete k-
center problem, the k-center problem, and the diameter k-clustering problem due to the
limitations of our analysis or are they inherent to these problems?
Can our results be extended to more general distance measures?
Can the lower bounds for
p
-metrics with 1 < p < be improved to (log k), matching
the lower bound from [5] for all
p
-norms?
References
1 M. R. Ackermann, J. Blmer, D. Kuntze, and C. Sohler. Analysis of Agglomerative Clus-
tering. CoRR: Computing Research Repository, 2010. arXiv:1012.3697 [cs.DS].
2 M. Badoiu, S. Har-Peled, and P. Indyk. Approximate Clustering via Core-Sets. In Pro-
ceedings of STOC 02, pages 250257, 2002.
3 A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic Clustering of the
Web. In Selected papers from the sixth intern. conf. on WWW, pages 11571166, Essex,
UK, 1997. Elsevier Science Publishers Ltd.
4 M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental Clustering and Dynamic
Information Retrieval. SIAM J. Comput., 33:14171440, June 2004.
5 S. Dasgupta and P. M. Long. Performance Guarantees for Hierarchical Clustering. JCSS:
Journal of Computer and System Sciences, 70(4):555569, 2005.
6 M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of
genome-wide expression patterns. PNAS: Proceedings of the National Academy of Sciences,
95(25):1486314868, December 1998.
7 T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of
STOC 88, pages 434444, New York, NY, USA, 1988. ACM.
8 K. Florek, J. Lukaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la liaison et la
division des points dun ensemble ni. Colloqium Math., 2:282285, 1951.
9 T. F. Gonzalez. Clustering to Minimize the Maximum Intercluster Distance. Theoretical
Computer Science, 38:293306, 1985.
10 K. Lee, J. Kim, K. Kwon, Y. Han, and S. Kim. DDoS attack detection method using
cluster analysis. Expert Systems with Applications, 34(3):16591665, 2008.
11 L. L. McQuitty. Elementary Linkage Analysis for Isolating Orthogonal and Oblique Types
and Typal Relevancies. Educational and Psychological Measurement, 17:207209, 1957.
12 M. Naszdi. Covering a Set with Homothets of a Convex Body. Positivity, 2009.
13 F. Pereira, N. Tishby, and L. Lee. Distributional Clustering of English Words. In Proceed-
ings of the 31st Annual Meeting of the Association for Computational Linguistics, pages
183190, 1993.
14 P. H. A. Sneath and R. R. Sokal. Numerical taxonomy: the principles and practice of
numerical classication. W. H. Freeman, 1973.

You might also like