Evolutionary Clustering
Evolutionary Clustering
To develop an evolutionary hierarchical clustering, we first We refer to this heuristic as Squared, since it greedily mini-
describe a standard agglomerative clustering at a particular mizes the squared error in Equation 3.
fixed timestep t. Let M = Mt = sim(·, ·, t), U = U≤t . First, However, we observe that a merge with a particular squared
we select the pair i, j of objects that maximizes M (i, j). error may become better or worse if it is put off until later.
Next, we merge these two objects, creating a new object; To wit, if two objects are far away in T ′ , then perhaps we
we also update the similarity matrix M by replacing the should delay the merge until they are similarly far away in
rows and columns corresponding to objects i and j by their T . However, if two objects are close in T ′ but merging them
average that represents the new object. We then repeat the would already make them far in T then we should encour-
procedure, building a bottom-up binary tree T whose leaves age the merge despite their high cost, as delaying will only
are the objects in U ; the tree Ct = Tt = T represents the make things worse. Based on this observation, we consider
clustering of the objects at timestep t. the cost of merge based on what would change if we de-
Let the internal nodes of T be labeled m1 , . . . , m|U |−1 , layed the merge until the two merged subtrees became more
and let simM (mi ) represent the similarity of objects that distant from one another (due to intermediate merges).
were merged to produce the internal node mi . Let in(T ) be Thus, consider a possible merge of subtrees S1 and S2 .
the set of all internal nodes of T . For an internal node m, Performing the merge incurs a penalty for nodes that are
let mℓ be the left child of m, mr be the right child of m, and still too close, and a benefit for nodes that are already
leaf (m) be the set of leaves in the subtree rooted at m. Let too far apart. The benefit and penalty are expressed in
d(i, j) be the tree distance in T between nodes i and j. If terms of the change in cost if either S1 or S2 participates
T ′ , T are binary trees with leaf(T ) ⊇ leaf (T ′ ), then the tree in another merge, and hence the elements of S1 and S2 in-
T ′ |T is the projection of T ′ onto T , i.e., T ′ |T is a binary tree crease their average distance by 1. This penalty may be
written by taking the partial derivative of the squared cost The algorithm proceeds during several passes, during each
with respect to the distance of an element to the root. At of which it updates each centroid based on the data elements
any point in the execution of the algorithm at time t, let currently assigned to that centroid:
root(i) be the root of the current subtree containing i. For
cj ← E | (x),
i ∈ S1 and j ∈ S2 , let dm T (i, j) be the merge distance of x∈closest(j)
i and j at time t, i.e., dm T (i, j) is the distance between i
and j at time t if S1 and S2 are merged together. Then, after which cj is normalized to have unit length. The al-
dm gorithm terminates after sufficiently many passes and the
T (i, j) = dT (i, root(i)) + dT (j, root(j)) + 2 The benefit of
merging now is given by: clustering Ct = C is given by the set {c1 , . . . , ck } of k cen-
0 1 troids.
We define the snapshot quality of a k-means clustering to
simM (m) − @cp · E (dT ′ (i, j) − dm
T (i, j))
A. (6) be
i∈leaf(mℓ ) X
j∈leaf (mr )
sq(C, M ) = (1 − min ||c − x||).
c∈C
We refer to this heuristic as Linear-Internal. Notice that, x∈U
as desired, the benefit is positive when the distance in T ′ is (Since all points are on the unit sphere, distances are bounded
large, and negative otherwise. Similarly, the magnitude of above by 1.)
the penalty depends on the derivative of the squared error We define the history cost, i.e., the distance between two
(Equation 3). clusterings, to be
As another heuristic, we may also observe that our deci-
sion about merging S1 with S2 may also depend on objects hc(C, C ′ ) = min ||ci − c′f (i) ||,
f :[k]→[k]
that do not belong to either subtree. Assume that elements
of S1 are already too far apart from some subtree S3 . Then where f is a function that maps centroids of C to centroids
merging S1 with S2 may introduce additional costs down- of C ′ . That is, the distance between two clusterings is com-
stream that are not apparent without looking outside the puted by matching each centroid in C to a centroid in C ′
potential merge set. In order to address this problem, we in the best possible way, and then adding the distances for
modify (6) to penalize a merge if it increases the distance these matches.
gap (i.e., the distance at time t versus the distance at time As stated earlier, we use a greedy approximation algo-
t−1) between elements that participate in the merge and el- rithm to choose the next cluster in the sequence. How-
ements that do not. Similarly, we give a benefit to a merge if ever, in the case of k-means, the greedy algorithm becomes
it decreases the distance gap between elements in the merge particularly easy. At time t, for a current centroid ctj , let
and elements not in the merge. The joint formulation is then ct−1
f (j) ∈ Ct−1 be the closest centroid in Ct−1 . Let nj =
t
This heuristic considers the internal cost of merging ele- +γ · (1 − cp) E (x).
x∈closest(j)
ments i ∈ S1 and j ∈ S2 , and the external cost of merging
elements i ∈ S1 ∪ S2 and j 6∈ S1 ∪ S2 ; therefore, we refer to In words, the new centroid ctj lies in between the centroid
it as Linear-Both. For completeness, we also consider the suggested by non-evolutionary k-means and its closest match
external cost alone: from the previous timestep, weighted by the cp and the rel-
ative sizes of these two clusters. Again, this is normalized
simM (m) + cp · E (dT ′ (i, j) − dm
T (i, j)) . (8) to unit length, and we continue with the usual k-means it-
i∈leaf (m)
j6∈leaf (m)
erations.
We refer to this final heuristic as Linear-External.
5. EXPERIMENTS
4.2 k-means clustering
In this Section, we perform an extensive study of our al-
Let the objects to be clustered be normalized to unit vec- gorithms under different parameter settings. We show how
tors in the Euclidean space, i.e., the objects at time t are distance from history can be reduced significantly while still
given by Ut = {x1,t , . . .} where each xi,t ∈ ℜℓ and the dis- maintaining very high snapshot quality. For our experi-
tance matrix Mt (i, j) = dist(i, j, t) = ||xi,t − xj,t ||. (See, for ments, we use the collection of timestamped photo–tag pairs
instance, [7].) from flickr.com indicating that at a given time, a certain
We begin with a description of the traditional k-means tag was placed on a photo. A bipartite tag-photo graph is
algorithm. Let t be a fixed timestep and let U = U≤t , xi = formed for each week, and two tags are considered to be sim-
xi,t , M = Mt . The algorithm begins with a set of k cluster ilar if they co-occur on the same photo at the same timestep,
centroids, c1 , . . . , ck , with ci ∈ ℜℓ ; these centroids can be as described before in Section 3. Our goal is to apply evo-
initialized either randomly, or by using the results of the lutionary clustering algorithms to this space of tags.
previous clustering Ct−1 (which is exactly “incremental k-
means”). Let closest(j) be the set of all points that are k-means clustering over time. For this experiment, we
closest to centroid cj , i.e., selected the most commonly occurring 5000 tags that in the
Flickr data and proceeded to study their clustering. We ran
closest(j) = {x ∈ U | j = arg min ||cj ′ − x||}. k-means with k = 10 centroids over time t = 0 . . . 67, for
j ′ =1,...,k
several values of cp. Recall that cp = 0 is exactly the same 6. CONCLUSIONS
as applying k-means independently to each snapshot, but We considered the problem of clustering data over time
with the clusters found in the previous step as the starting and proposed an evolutionary clustering framework. This
seed; it is “incremental k-means,” in other words. framework requires that the clustering at any point in time
Figure 1 shows the results. We observe the following: should be of high quality while ensuring that the clustering
Both the snapshot quality and the distance from history de- does not change dramatically from one timestep to the next.
crease as cp increases. In fact, incremental k-means (cp = 0) We presented two instantiations of this framework: k-means
gives the best snapshot quality and worst distance from his- and agglomerative hierarchical clustering. Our experiments
tory. This is to be expected since clustering each snapshot on Flickr tags showed that these algorithms have the desired
independently should give the best quality performance, but properties — obtaining a solution that balances both the
at the cost of high distance from history. Also, even low val- current and historical behavior of data.
ues of cp lower the distance from history significantly. For It will be interesting to study this framework for a larger
example, even when cp is as low as 0.125, k-means incorpo- family of clustering algorithms. It will also be interesting to
rates history very well, which results in a significant drop in investigate tree-based clustering algorithms that construct
distance from history. non-binary and weighted trees.
Agglomerative clustering over time. We empirically 7. REFERENCES
find that Linear-Both and Linear-Internal significantly out- [1] C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for
perform both Linear-External and Squared, so in Figure 2, clustering evolving data streams. In Proceedings of the
International Conference on Very Large Data Bases, pages
we plot only the performance of Linear-Both and Linear- 852–863, 2003.
Internal over the top 2000 tags. The plots for Linear-Both [2] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and
are smoother than those for Linear-Internal, for all values Y. Yang. Topic detection and tracking pilot study: Final
report. Proc. of the DARPA Broadcast News Transcription and
of the change parameter cp. This demonstrates that the ex- Understanding Workshop, 1998.
tra processing for Linear-Both improves the cluster tracking [3] P. Auer and M. Warmuth. Tracking the best disjunction. In
ability of the algorithm. Also note that the distance from Proceedings of the 36th Annual Symposium on Foundations
history plot shows very high values for a few timesteps. We of Computer Science, pages 312–321, 1995.
[4] D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum.
suspect this is due to increased activity during that time- Hierarchical topic models and the nested Chinese restaurant
frame; that was when Flickr “took off.” Note that this peak process. In S. Thrun, L. Saul, and B. Schölkopf, editors,
also appears during k-means clustering (Figure 1(b)), rein- Advances in Neural Information Processing Systems 16, 2004.
forcing the idea that this is an artifact of the data. [5] C. Chatfield. The Analysis of Time Series. Chapman and
Hall, 1984.
Effect of cp on snapshot quality. Figure 3(a,b) shows [6] S. Chien and N. Immorlica. Semantic similarity between search
engine queries using temporal correlation. In Proceedings of the
the dependence of snapshot quality on cp. The snapshot International Conference on the World-Wide Web, pages
quality values at time t are normalized by the corresponding 2–11, 2005.
value for cp = 0 to remove the effects of any artifacts in the [7] I. Dhillon and D. S. Modha. Concept decompositions for large
data itself. We observe that the snapshot quality is inversely sparse text data using clustering. Machine Learning,
42:143–175, 2001.
related to cp. I.e., higher the cp, more the weight assigned to [8] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
the distance from history, and thus worse the performance Classification. Wiley-Interscience, 2000.
on snapshot quality. [9] D. Fisher. Knowledge acquisition via incremental conceptual
However, while the snapshot quality decreases linearly and clustering. Machine Learning, 2:139–172, 1987.
[10] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan.
is well-behaved as a function of cp for k-means, the situa- Clustering data streams. In IEEE Symposium on Foundations
tion is different for agglomerative clustering. The snapshot of Computer Science, pages 359–366, 2000.
quality takes a hit as soon as history is incorporated even [11] M. Herbster and M. K. Warmuth. Tracking the best linear
a little bit, but the degradation after that is gentler. This predictor. Journal of Machine Learning Research, 1:281–309,
2001.
suggests that k-means can accommodate more of history [12] J. Lin, M. Vlachos, E. Keogh, and D. Gunopulos. Iterative
without compromising the snapshot quality. incremental clustering of time series. In Proceedings of the
International Conference on Extending Database Technology,
Effect of cp on distance from history. Figure 3(c,d) pages 106–122, 2004.
shows the dependence of distance from history on the change [13] M. Meila. Comparing clusterings by the variation of
parameter cp. The y-axis values are normalized by the cor- information. In Proceedings of the ACM Conference on
Computational Learning Theory, pages 173–187, 2003.
responding value for cp = 0 at that timestep to remove [14] P. Smyth. Clustering sequences with hidden Markov models. In
any data artifacts. We see that the distance from history Advances in Neural Information Processing Systems,
is inversely related with cp. I.e., as the value of cp is in- volume 9, page 648, 1997.
creased, our algorithms weigh the distance higher, and re- [15] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying
similarities, periodicities and bursts for online search queries.
ducing the distance from history becomes relatively more In Proceedings of the ACM SIGMOD International
important than increasing snapshot quality. Thus, higher Conference on Management of Data, pages 131–142, 2004.
cp leads to lower distance from history. [16] I. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 2005.
While k-means gets closer to history for small values of cp,
[17] Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective
the situation is more dramatic with agglomerative cluster- and on-line event detection. In Proceedings of the 21st ACM
ing. Even values of cp as small as 0.05 reduce the distance International Conference on Research and Development in
from history in a dramatic fashion. This suggests that the Information Retrieval, pages 28–36, 1998.
[18] J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model
agglomerative clustering algorithm is easily influenced by for online document clustering with applications to novelty
history. detection. In Proceedings of Advances in Neural Information
Processing Systems, 2005.
2 0.8
cp = 0 cp = 0
1.8 cp = 0.125 0.7 cp = 0.125
cp = 0.25 cp = 0.25
1.6 cp = 0.5 cp = 0.5
Snapshot quality
1.4
0.5
1.2
0.4
1
0.3
0.8
0.6 0.2
0.4 0.1
0.2 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Time Time
(a) Snapshot quality over time (b) Distance from history, over time
Figure 1: k-means clusters over time: As the change parameter cp increases, both the snapshot quality and
the distance from history decrease. The case of cp = 0 is “incremental k-means.”
0.16 10000
cp = 0 cp = 0
cp = 0.05 cp = 0.05
0.14 cp = 0.1 cp = 0.1
cp = 0.2 cp = 0.2
0.1
100
0.08
0.06
10
0.04
0.02 1
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Time Time
(a) Linear-Both snapshot quality (log-linear) (b) Linear-Both distance from history (log-linear)
0.16 1e+06
cp = 0 cp = 0
0.14 cp = 0.05 100000 cp = 0.05
cp = 0.1 cp = 0.1
cp = 0.2 10000 cp = 0.2
Distance from history
0.12
cp = 0.3 cp = 0.3
Snapshot quality
1000
0.1
100
0.08
10
0.06
1
0.04 0.1
0.02 0.01
0 0.001
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Time Time
(a) Linear-Internal snapshot quality (log-linear) (b) Linear-Internal distance from history (log-linear)
Figure 2: Performance of agglomerative clustering over time: The plots for Linear-Both are far smoother than
those of Linear-Internal.
1 1 1 1
0.95 0.9
Distance from history (wrt cp=0)
0.9 0.9
Snapshot quality (wrt cp=0)
(a) k-means snapshot (b) Agglomerative snapshot (c) k-means distance (d) Agglomerative distance
quality vs. cp quality vs. cp from history vs. cp from history vs. cp
Figure 3: Snapshot quality and distance from history, versus the change parameter cp.