(TAM-12) Top-Down Vs Bottom-Up Methods of Linkage For Asymmetric Agglomerative Clustering
(TAM-12) Top-Down Vs Bottom-Up Methods of Linkage For Asymmetric Agglomerative Clustering
G'I
L s(x, y).
xEG,yEG'
(4)
There are two more linkage methods of the centroid link
and the Ward method that assume objects are points in the
Euclidean space. They use dissimilarity measures related
to the Euclidean distance. For example, the centroid link
uses the square of the Euclidean distance between two
centroids of the clusters. The above mentioned fve linkage
methods all assume the symmetric property of similarity and
dissimilarity measures.
For the single link, complete link, and average link, it is
known that we have the monotonicit of mK:
If the monotonicity does not hold, we have a reversal in
a dendrogram: it means that G and G' are merged into
G = G U G' at level m = s( G, G') and after that G and Gil
are merged at the level i = s( G, Gil), and i > m occurs.
Reversals in a dendrogram are observed for the centroid
method. A simple example of a reversal is shown in Fig. 1.
I
I
Figure 1. A simple example of reversal.
III. CONCEPT OF TOP-DOWN AND BOTTOM-UP
METHODS
As briefy noted in the introduction, we classify the fve
well-known linkage methods of agglomeraitve hierarchical
clustering using symmetric similarity or dissimilarity mea
sures. In some linkage methods, similarity between objects
are frst defned and then similarity between clusters are
defned using the former. We call such a linkage method as a
"bottom-up" method. In another method, similarity measure
between clusters are frst defned and similarity between
objects is a special case when a cluster consists of a single
object. Such a method is called a "top-down" method.
The single linkage, complete linkage, and average linkage
are bottom-up methods, since s(x, y) is frst defned between
a pair of objects, and then s( G, G') is defned: the single
linkage uses (3), the average linkage uses (4), and the
complete linkage uses
s(G,G') = min s(x,y)
xEG,yEG'
(6)
In these methods, to defne s( G, G') directly is impossible.
In contrast, the centroid method and the Ward method are
top-down method, since
d(G, G') =
II
M(G) -M(G')11
2
(7)
is used in the centroid method, where d( G, G') is a dis
similarity measure and M( G) is the center of gravity (alias
centroid) of G.
In the Ward method, we frst defne
E
(G) = L Il
xk -M(G)11
2
(8)
xkEG
and then
d(G, G') = E
(G U G') - E(G) - E(G') (9)
In the both methods, the dissimilarity d( G, G') between
clusters are directly defned without referring to d(x, y).
In the next section we propose a top-down method for an
asymmetric similarity measure.
IV. ASYMMETRIC SIMILARITY MEASURES
Up to now, methods using asymmetric measures are
bottom-up, as shown in [9], [12]. In this section, we show
two asymmetric agglomerative hierarchical methods, one of
which is top-down and the other bottom-up. Why we discuss
these two methods herer is that they have no reversals in the
dendrograms, as shown in the next section.
A. A prbabilistic model
We assume a specifc example of handling citations be
tween journals in this section. This example seems very
specifc, but the proposed model can easily be used for a
wide class of real applications. This specifcation for citation
is thus for the sake of simplicity.
We hence call objects in X jourals. Assume that n(x, y)
is the number of citations from x to y: journal x cites y for
n(x, y) times. Moreover n(x) is the total number of citations
of x, i.e., the number of citations from x to all journals. We
have
n(x) ;: L n(x, y). (10)
yE
X
Note that n(x) =
LYE
X
n(x, y) does not hold in general,
since X does not generally exhaust all journals in the world.
We can defne the estimate of citation probability from x
to y:
( )
_ n(x,y)
7 x,y
- n(x)
(11)
which may be generalized to inter-cluster similarity:
7(G,G') =
LXEG,YEG'
n(x,y)
. (12)
LXEG
n(x)
This measure 7( G, G') is, however, inconvenient for clus
tering, as we discuss in the next section. Hence we defne
asymmetric similarity as follows.
Defnition 1. Assume that n(x, y) and n(x) are given as
above, and G, G' are arbitrar two clusters of X. Then an
average citation prbability frm G to G' is defned by
(G G') =
7(G, G')
r ,
I
G'I
We also defne
LXEG,YEG'
n(x, y)
I
G'I
LXEG
n(x)
.
n(G,G') = L n(x,y),
We then have
xEG,yEG'
n(G) = L n(x).
xEG
( ')
n(G,G')
r G, G =
I
G'l
n(G)"
Note that if G = {x} and G' = {y}, we have
r(G,G') =r( {x}, {y}) =7(x,y).
(13)
(14)
(15)
(16)
Hence this measure is based on the citation probability fom
x to y. We have the following formula for the updating in
AHC3.
Proposition 1. When Gp and Gq are merged into GT
(GT = Gp U Gq), the updating formula in AHC3 is:
(
"
) (
"
)
n(Gp, Gil)
+ n(Gq, Gil)
r GnG =r GpUGq,G =
I
G"
I
(n(Gp)
+
n(Gq))
,
(l7)
(
II
) (
"
)
n(G
"
, Gp)
+
n(G
"
, Gq)
r G ,GT =r G ,GpUGq =
(
I
Gp
l+I
Gq
l
)n(G")
.
(18)
The proof is straightforward and omitted.
B. Extended updating formula
An extended updating formula was proposed by Yado
hisa [12]. Let Gi, G
j
, and Gk be clusters. When Gi and G
j
are merged into Gij, the updating formula for dissimilarity
d(ij)k
from cluster Gij to another cluster Gk is:
d(ij)k
= OU
1
(dik' dki)
+
OJf
1
(djk, dkj)
+ (
1
g
1
(dij, dji)
+ ,
l
ldik - djkl
. (19)
Similarly, the updating formula for dissimilarity dk(ji) is:
dk(ij) = OU
2
(dki' dik) + 0;f
2
(dkj, djk)
+ (
2
g
2
(dij, dji)
+ ,
2
1 dki - dkj I
. (20)
H I
1 2 2
(
1
(
2 1
d
2
t ere, O
i
' O
j
, O
i
, O
j
, , " , an , are constan s or
functions. Moreover we assume f
1
(X, y) = X, f
2
(x, y) = y,
and gl(x,y) ;: max{x,y}. This pair of formulas are called
the extended updating formula.
Various linkage methods can be represented by this for
mulas [12], [9], [11], but we omit the details.
V. DENDROGRAM WITHOUT A REVERSAL
As noted earlier, the average linkage for symmetric mea
sures have no reversals in the dendrogram [1], [5], [6]. We
will show that this property of no reversal also holds for the
above two methods. First, we defne
S(K) = { s(G,G')
: V
(G,G') E 9 x 9, G i
G'} (21)
where K is the index in AHC and 9 changes as K varies,
e.g., 191 = K. Hence S(K) is the set of all values of
similarity for K.
We also assume maxS(K) is the maximum value of
S(K): it exactly is mK given by (2). We have the following
lemma.
Lemma 1. I maxS(K) is monotonically non-increasing
with respect to K:
maxS(N) maxS(N -1)
. . .
maxS(2) maxS(I),
(22)
then there is no reversal in the dendrgram.
Prof The proof is almost trivial, since maxS(K) = mK.
Thus (22) is exactly the same as (5). Q.E.D.
We have the next two propositions regarding the extended
updating formula and the probabilistic model.
Proposition 3. Assume that r( G, G') are used. For
G, G', Gil E 9, we have
r(G U G', Gil) : max{r(G, Gil), r(G', Gil)},
r(G
"
, G U G') : max{r(G
"
, G), r(G
"
, G')}.
Hence Lemma 1 is applied and no reversal occurs.
(23)
(24)
Prof The two relations can be proved by easy calculations.
If these inequalities are satisfed, the set of values in S(K)
do not increase, and hence Lemma 1 is applied. Q.E.D.
Proposition 4. Assume that the extended updating formula
is used. I o
+
0;
+
(
1
1 and , 0 (I = 1,2), we have
d(ij)k min{dik,djk}
dk(ij) min{dki, dkj}.
Hence Lemma 1 is applied and no reversal occurs.
Prof (i) Let us frst prove (25).
When dij : dji, we have
dik,djk dij.
Then, right-hand side minus left-hand side in (25) is:
(25)
(26)
(ii) The proof of (26) is similar to (i); we omit the detail.
Finally, the reason why Lemma 1 can be applied is the
same as that for Proposition 3. Q.E.D.
VI. ASYMMETRIC DENDROGRAM
Asymmetric dendrograms have been proposed for ag
glomerative hierarchical clustering using asymmetric mea
sures [8], [12]. When we plot asymmetric dendrogram, we
use two methods here: the frst method has been proposed
by Okada and Iwamoto [8], while the second method is new.
A. Representation of asymmetr using ratio
The frst method is based on the ratio of similarity
measures [8]. We use an asymmetric dendrogram that uses
the ratio of s(G, G') and s(G', G) as in Fig. 2.
G-
s(G.G)
s(G,G')
G-
Figure 2. Example of asymmetric dendrogram using the ratio.
B. Representation of asymmetr using hypothesis testing
The second method uses a hypothesis testing. Here, we
use Chi-square test, and test the following hypothesis:
When the input value and the total is such as Table I, X
2
is
given by [2]:
where we put
Ou = n(G
1
, G
2
), 0
12
= n(GI) - n(G
1
, G
2
),
0
21
= n(G
2
, Gd, 0
22
= n(G
2
) - n(G
2
, Gd,
n
1
= n(Gd, n
2
= n(G
2
).
Then, we have an asymmetric dendrogram such as those in
d(ij)k -min{dij,djd
Fig. 3 according to the rejection of Ho with two diferent
=
odik + OJdjk + (
1
gl (dij, dji) + 1l
ldik -djk 1 - di
J
signifcance levels fom X
2
with one degree of freedom [2].
(o + OJ + (
1
-1)dij + 1l
ldik -djkl
0 (27)
Table I
where we used the assumption:
0
1
+
0
1
+
(
1
>
1
I
O.
t
J -
,
The case of dij dji is handled in a similar way; we omit
the detail.
THE 2 x 2 CONTINGENCY TABLE.
class 1 c1ass2 total
c1assl
Oll 012 n1
c1ass2
021 022 n2
Figure 3. Example of asymmetric dendrogram using the hypothesis testing:
the lef fgure means that the hypothesis Ho is rejected with I % signifcance
level, the center means that Ho is rejected with 5% signifcance level, and
the right means that Ho is not rejected.
VII. NUMERICAL EXAMPLES
Two data sets were used. They are as follows:
1) First data set is the numbers of citations among eight
journals on statistics. The original data are omitted
here, which are given in [lOJ. We call this data set
citation data.
2) Second data set is the number of traveler among
twenty countries in 2001. The original data are omitted
here, which are given in [14J. We call this data traveler
data.
We used the probability model and the extended updating
formula. In the latter method, the following parameter values
were used:
1 1 2 2
1
(1
=
(2
=
0,
ei = ej = ei = ej = 2' 'l = " = o.
Note that these parameters satisfy the condition of no
reversals in Section V.
A. Citation data
Figures 4, 5 and 6 were respectively obtained fom the
the probability model and the extended updating formula.
Figures 4 and 5 used the two different dendrograms for the
same clusters. In these fgures a cluster of three journals
'JASA', 'AnnSt', and ' ComSt' are formed. They are fre
quently citing with one another. An example of asymmetry
of citation is found between 'AnnSt' and ' ComSt'; the
citation fom ' ComSt' to 'AnnSt' is stronger than the reverse
direction in both the probability model and the updating
formula. When the dendrogram with the hypothesis testing
was used, the hypotheses at all branches were rejected with
1 % signifcance level.
B. Traveler data
Figures 7 and 8 were respectively obtained from the
the probability model and the extended updating formula.
We observed similar clusters in the both dendrograms.
First, countries in same area are merged, and then clusters
refecting geometrical closeness are formed. It moreover
seems that the probability method provides a better-balanced
dendrogram than the extended updating formula.
ComSI
JASA
r
r
AnnSI
JRSSB
I
-
Bioka
Biocs
JRSSC
Tech
o
Figure 4. Dendrogram of journal citation data using the probability model
with the representation of asymmetry using the ratio.
ComSI
JASA
I
-
I
AnnSI
JRSSB
I
Bioka
Biocs
JRSSC
Tech
o
Figure 5. Dendrogram of journal citation data using the probability model
with the representation of asymmetry using the hypothesis testing.
VIII. CONCLUSION
We discussed what we call bottom-up methods and top
down methods in agglomerative hierarchical clustering using
asymmetric similarity measures. Study of top-down methods
will lead to the development of a new theory and algorithms
in agglomerative hierarchical clustering.
The theory of reversals in dendrograms has also been
developed and it has been proved that the two methods
in this paper are without any reversal in the dendrogram.
This theory can be applied to other existing and new linkage
methods.
We have shown applications to small scale data sets. In
near future, we will study how to handle large-scale data by
Tech
JRSSB
Bioka
JASA
r-
-
AnnS!
CorS! U
Biocs
JRSSC
1200 1000 800 600 400 200 o
Figure 6. Dendrogram of journal citation data using the extended updating
formula with the representation of asymmetry using the ratio.
China
Hong Kong
Taiwan
Malaysia
Singapore
Indonesia
Thailand
Australia
New Zealand
India
France
Italy
Swizerland
United Kingdom
South Africa
Turkey
United States
Canada
Japan
Korea
--
I
r
I
I
r
o
Figure 7. Dendrogram of traveler data using the probability model with
the representation of asymmetry using the ratio.
agglomerative hierarchical methods using both symmetric
and asymmetric measures of similarity.
ACKNOWLEDGMENT
This work has partly been supported by the Grant-in-Aid
for Scientifc Research, Japan Society for the Promotion of
Science, No. 23500269.
REFERENCES
[1] M.R. Anderberg, Cluster Analysis for Applications, Academic
Press, New York, 1960.
[2] W.J. Conover, Prctical Nonparmetric Statistics, Wiley, New
York, 1971.
South Africa
Turkey
France
United Kingdom
Italy
Swizerland
Canada
United States
Australia
New Zealand
Malaysia
Singapore
Indonesia
Thailand
China
Hong Kong
Taiwan
Korea
Japan
India
I
--
58568488
h
-
r
o
Figure 8. Dendrogram of traveler data using the extended updating formula
with the representation of asymmetry using the ratio.
[3] B.S. Everitt, Cluster Analysis, 3rd Edition, Arnold, London,
1993.
[4] L. Hubert, Min and max hierarchical clustering using asym
metric similarity measures, Psychometrika, Vo1.38, No.1,
pp.63-72, 1973.
[5] S. Miyamoto, Fuzzy Sets in Information Retrieval and Cluster
Analysis, Kluwer, Dordrecht, 1990.
[6] S. Miyamoto, Introduction to Cluster Analysis, Morikita
Shuppan, Tokyo, 1999 (in Japanese).
[7] S. Miyamoto, Fuzzy multisets and their generalizations, in
C.S. Calude et at., eds., Multiset Processing, Lecture Notes
in Computer Science, LNCS 2235, Springer, Berlin, pp. 225-
235, 2001.
[8] A. Okada, T Iwamoto, A Comparison before and after the
Joint First Stage Achievement Test by Asymmetric Cluster
Analysis, Behaviormetrika, Vo1.23, No.2, pp.169-185, 1996.
[9] T Saito, H. Yadohisa, Data Analysis of Asymmetric Struc
tures, Marcel Dekker, New York, 2005.
[10] S.M. Stigler, Citation Patterns in the Journals of Statistics and
Probability, Statistical Science, vol.9, pp.94-108, 1994.
[11] A. Takeuchi, T. Saito, H. Yadohisa, Asymmetric agglomera
tive hierarchical clustering algorithms and their evaluations,
Joural of Classication, Vo1.24, pp.123-143, 2007.
[12] H. Yadohisa, Formulation of Asymmetric Agglomerative
Clustering and Graphical Representation of Its Result, J. of
Japanese Society of Computational Statistics, Vo1.15, No.2,
pp.309-316, 2002 (in Japanese).
[13] S. Miyamoto, K. Nakayama, Similarity Measures Based on
Fuzzy Set Model and Application to Hierarchical Clustering,
IEEE Trans. SMC., pp. 479-482, 1986.
[14] http://www.unwto-osaka.orglindex.html