Multi-Marginal Optimal Transport Defines a Generalized Metric
Multi-Marginal Optimal Transport Defines a Generalized Metric
inf is over measures p1,2 on Ω1 × Ω2 . Problem (1.1) is Multi-Marginal Optimal Transport (MMOT) seeks
typically studied under the assumptions that Ω1 and Ω2 Z 1` Z
`
are in a Polish space on which d is a metric, in which (1.2) inf d dp s.t. dp = pi ∀i,
p Ω1 ×...×Ωn Ω−i
case the minimum of (1.1) is the Wasserstein distance
1 n
(WD). The WD is popular in many applications including where the infimum is taken over measures p on Ω ×. . .×Ω .
shape interpolation [2], generative modeling [3, 4], domain The term MMOT was coined in [17], and was surveyed by the
adaptation [5], and dictionary learning [6]. same authors in [18]. Applications of MMOT include image
The WD is a metric on the space of probability measures translation, image registration, multi-agent matching with
[7], and this property is useful in many ML tasks, e.g., clus- fairness requirement, and labeling for classification [19, 20].
tering [8, 9], nearest-neighbor search [10, 11, 12], and outlier Unfortunately, there is a lack of discussion about the
detection [13]. Indeed, some of these tasks are tractable, or (generalized) metric properties of MMOT. Much of the
allow theoretical guarantees, when done on a metric space. discussion on MMOT has focused on the existence of a
E.g., finding the nearest neighbor [10, 11, 12] or the diameter minimizer, the uniqueness and structure of both Monge and
[14] of a dataset requires a polylogarithimic computational ef- Kantorovich solutions, applications, practical algorithms, and
fort under metric assumptions; approximation algorithms for the choice of the cost function [21, 22, 23, 24].
clustering rely on metric assumptions, whose absence wors- Since the metric property of the WD is useful in so many
ens known bounds [15]; also, [16] uses the metric properties applications, understanding when the (potential) minimum of
(1.2), W(p1 , . . . , pn ), a multi-way distance, has metric-like
* Boston College [email protected] properties is critical, theoretically, and practically.
† Arizona State University [email protected] For example, metric properties can improve distance-
based clustering, so too can generalized metrics improve (Sec. 3.3). Finally, we show that the triangle inequality that
clustering based on multi-way distances. In Figure 1, MMOT satisfies cannot be improved, up to a linear factor.
we preview such an improvement on clustering chemical
2 Definitions and setup
compounds, which is discussed further in Sec. 5. Importantly,
several algorithms in [8]– [15], and more, which use distances 2.1 Lists Expressions that depend on a list of symbols
as input including WD, have guarantees if the distances are indexed consecutively are abbreviated using “:”. In particular,
metrics. They extend to feeding off multi-distances, and we write s1 , . . . , sk as s1:k , Ω1 , . . . , Ωk as Ω1:k , and As1 ,...,sk
hence can use MMOT, and have guarantees under generalized as As1:k . Note that As1 :sk differs from As1:k . Assuming sk >
metrics similar to those under classic metrics. We now s1 , then we have As1 :sk ≡ As1 , As1 +1 , As1 +2 , . . . , Ask .
exemplify these extensions, and their potential applications. By it self, 1 : i has no meaning, and it does not mean
Example 1: Given a set S with n distributions we can 1, . . . , i. For i ∈ N, we let [i] , {1, . . . , i}. The symbol
find its 3-diameter ∆ , maxp1 ,p2 ,p3 ∈S,distinct W(p1 , p2 , p3 ) ⊕ denotes a list join operation with no duplicate removal, e.g.
n {x, y} ⊕ {x, z} = {x, y, x, z}.
with 3 evaluations of W. What if W satisfies
the generalized triangle inequality W(p1 , p2 , p3 ) ≤ 2.2 Bra-ket operator Given two equi-multidimensional
W(p4 , p2 , p3 ) + W(p1 , p4 , p3 ) + W(p1 , p2 , p4 )? We arrays A and B, and ` ∈ N, we define hA, Bi` ,
now know that for at least n/3 distribution triplets P
(A ` `
s1:k s1:k ) Bs1:k , where (·) is the `th power.
W ≥ ∆/3. Indeed, if ∆ = W(p∗ 1 , p∗ 2 , p∗ 3 ),
4
then for all p ∈ S, we cannot simultaneously have 2.3 Probability spaces To facilitate exposition, we state
W(p4 , p∗ 2 , p∗ 3 ), W(p∗ 1 , p∗ 2 , p4 ), W(p∗ 1 , p4 , p∗ 3 ) < our main contributions for probability spaces with a sample
∆/3. Therefore, if we evaluate W on random distribution finite space in Ω, an event set σ-algebra which is the power
triplets, we are guaranteed to find a (1/3)-approximation set of the sample space, and a probability measure described
of ∆ with only O(n2 ) evaluations of W on average, an by a probability mass function. We refer to probability mass
improvement over n3 . Diameter estimation relates to outlier functions using bold letters, e.g. p, q, r, etc.
detection [13] which is critical e.g. in cybersecurity [25]. When talking about n probability spaces, the ith space
Example 2: Let S be as above. We can find A , has sample space Ωi = {Ωi1:mi } ⊆ Ω, an event space
i
n −1 P 1 2 3 n 2Ω , and a probability mass function pi , or q i , or r i , etc.
3 p1 ,p2 ,p3 ∈S,distinct W (p , p , p ) with 3 evalua-
tions of W . We can estimate A by averaging W over a Variable mi is the number of atoms in Ωi . Symbol pis denotes
set with O(n2 ) distinct triplets randomly sampled from S, the probability of the atomic event {Ωis }. Without loss of
improving over n3 . If W is a generalized metric, an argu- generality (w.l.o.g.) we assume pis > 0, ∀i ∈ [n], ∀s ∈ [mi ].
ment as in Example 1 shows that with high probability we Our notation assumes that atoms can indexed, but our results
do not miss triplets with large W , which is the critical step extend beyond this assumption. W.l.o.g., we assume that
to prove that we approximate A well. Average estimation is Ωis = Ωit if and only if s = t.
critical e.g. in differential privacy [26]. Symbol pi1:k denotes a mass function for the proba-
Example 3: Let S be as above. Consider building an hy- bility space with sample space Ωi1 × . . . × Ωik and event
i1 i
pergraph with nodes S and hyperedges defined as follows. For space 2Ω ×...×Ω k . In particular, pis1:k 1:k
(i.e. pis11,...,i
,...,sk )
k
each distinct triplet (p1 , p2 , p3 ) for which W (p1 , p2 , p3 ) < is the probability of the atomic event {(Ωis11 , . . . , Ωiskk )}.
thr, a constant threshold, include it as an hyperedge. Hyper- We use pi1:k |j1:r to denote a probability mass function
graphs are increasingly important in modern ML, specially for the probability space with sample space Ωi1 × . . . ×
for clustering using multiwise relationships among objects i1 i
Ωik , and event space 2Ω ×...×Ω k , such that ps1:k
i |j1:r
,
1:k |t1:r
[27]–[28]. Let W satisfy the triangle inequality in Example 1
and be invariant under arguments permutations. To shorten pis11,...,i k ,j1 ,...,jr j1 ,...,jr
,...,sk ,t1 ,...,tr /pt1 ,...,tr , i.e. a conditional probability.
1. W 1,...,n ≥ 0 , L EMMA 3.1. (G LUING LEMMA ) Let p1,3 and p2,3 be arbi-
trary mass functions for Ω1 × Ω3 and Ω2 × Ω3 , respectively,
2. W 1,...,n =0 iff pi =pj , Ωi =Ωj , ∀i, j , with the same marginal, p3 , over Ω3 . There exists a mass func-
tion r 1,2,3 for Ω1 × Ω2 × Ω3 whose marginals over Ω1 × Ω3
3. W 1,...,n = W σ(1,...,n) , for any map σ , and Ω2 × Ω3 equal p1,3 and p2,3 respectively.
n The way Lemma 3.1 is used to prove WD’s triangle inequality
4. C(n)W 1,..,n ≤ W 1,..,r−1,r+1,..,n+1 .
P
is as follows. Assume d is a metric (Def. 2). Let ` = 1 for
r=1
simplicity. Let p∗ 1,2 , p∗ 1,3 , and p∗ 2,3 be optimal transports
R EMARK 1. Equalities pi = pj and Ωi = Ωj , mean that such that W
1,2
= p∗ 1,2 , d1,2 , W 1,3 = p∗ 1,3 , d1,3 , and
2,3 ∗ 2,3
mi = mj , and that there exists a bijection bi,j (·) from [mi ] to W = p , d2,3 . Define r 1,2,3 as in Lemma 3.1, and let
j j 1,3 2,3 1,2
j i i
[m ] such that ps = pbi,j (s) and Ωs = Ωbi,j (s) , ∀ s ∈ [m ].i r , r , and r be its bivariate marginals. We then have
D Esuboptimal r
(3.5) p∗ 1,2 , d1,2
X 1,2 1,2
R EMARK 2. We abbreviate (n, 1)-metric by n-metric. ≤ r 1,2 , d1,2 = rs,t ds,t
s,t
R EMARK 3. Our notions of metric and generalized metric (3.6) X 1,2,3 1,2 d is metricX
1,2,3 1,3
= rs,t,l ds,t ≤ rs,t,l (ds,l + d2,3
t,l )
are more general than usual in the sense that they support
s,t,l s,t,l
the use of different functions depending on the spaces from 1,3 1,3
where we are drawing elements. This grants an extra layer of (3.7) = r ,d + r , d2,3
2,3
D E D E
Lemma 3.1
generality to our results. (3.8) = p∗ 1,3 , d1,3 + p∗ 2,3 , d2,3 .
In our setup, the inf in (1.2) is always attained (recall, Our first roadblock is that Lemma 3.1 does not generalize
finite-spaces) and amounts to solving an LP. We refer to the to higher dimensions. For simplicity, we now omit the sample
minimizing distributions by p∗ , q ∗ , r ∗ , etc. We define the spaces on which mass functions are defined. When a set of
following map from n probability spaces to R. The definition mass functions have all their marginals over the same sample
below amounts to (1.2) when p’s are empirical measures. sub-spaces equal, we will say they are compatible.
T HEOREM 3.1. (N O GLUING ) There exists mass functions p3 = (1/2, 1/2)
p1 4 4
p1,2,4 , p1,3,4 , and p2,3,4 with compatible marginals such that W 1,2,3 W 1,2,4
✏ 3 3
there is no mass function r 1,2,3,4 compatible with them. 4
p = (1/2, 1/2)
1 1 1 1
2✏
2
2 2
Proof. If this were not the case, then it would be true that, 1/2
0<✏⌧1 4 + W 4,2,3 4 + 1,4,3
W
given arbitrary mass functions p1,2 , p1,3 , and p2,3 with com- p2 3 3
1
patible univariate marginals, we should be able to find r 1,2,3 d(x, y, z) = Area of triangle(x, y, z) 1 1
whose bivariate marginals equal these three mass functions. d(x, y, z) = Area of triangle(x, y, z) 2 2
1,2 1,3 i 4
But this is not the case. For example, let p = p = Figure 2: (Left) Sample space Ω, mass functions {p }i=1 , and cost
[1, 0, 1; 0, 1, 0; 0, 0, 0]/3 and p2,3 = [1, 1, 1; 1, 1, 1; 1, 1, 1]/9 function d that lead to violation (3.9). (Right) Geometric analog of
(we are using matrix notation for the marginals). These the generalized triangle ineq.: the total area of any three faces in a
marginals have compatible univariate marginals, namely, tetrahedron is greater than that of the fourth face.
p1 = [2, 1, 0]/3 and p2 = p3 = [1, 1, 1]/3. Yet, the fol- associated with {di,j }i,j for n probability spaces with
1,2,3
lowing system of eqs. over {ri,j,k }i,j,k∈[3] is easily checked masses pi1:n over Ωi1:n is W(pi1:n ) , W i1:n with
to be infeasible ( i ri,j,k = p2,3
P 1,2,3 P 1,2,3
j,k ∀j, k) ∧ ( j ri,j,k = 1
1,3 P 1,2,3 1,2
X
pi,k ∀i, k) ∧ ( k ri,j,k = pi,j ∀i, j). (3.10) W i1:n = i min i
dis ,it , r is ,it `` ,
r :r s =p s ∀s∈[n]
1≤s<t≤n
3.2 Cost d being an n-metric is not a sufficient condition
for MMOT to be an n-metric Lemma 3.1 tells us that, even where r is a mass over Ωi1 × . . . × Ωin , with marginals r is
if we assume that d is an n-metric, we cannot adapt the and r is ,it over Ωis and Ωis × Ωit , respectively.
classical proof showing WD is a metric to MMOT leading to
gives W i1:n =
P
an n-metric. The question remains, however, whether there R EMARK 5. Swapping min and
is ,it is ,it
is the WD between Ωis and
P
exists such a proof at all only under the assumption that d is 1≤s<t≤n W , where W
it
n-metric. Theorem 3.2 settles this question in the negative. Ω . This is trivially an n-metric (cf. [29]) but is different
from eq. (3.10). In particular, it does not provide a joint
T HEOREM 3.2. Let W be as in Def. 5 with ` = 1. There optimal transport, which is important to many applications.
exists Ω, mass functions p1 , p2 , p3 , and p4 over Ω, and
d : Ω × Ω × Ω 7→ R such that d is an n-metric (n = 3), but If n = 2, Def. 6 reduces to the WD distance. Our
definition is a special case of the Kantorovich formulation for
(3.9) W 1,2,3 > W 1,2,4 + W 1,3,4 + W 2,3,4 . the general MMOT problem discussed in [18].
We can get Def. 6 from Def. 5, by defining
R EMARK 4. The theorem can be generalized to spaces of di1:n : Ωi1 × . . . × Ωin 7→ R such that di1:n (w1:n ) =
dim. > 2, and to n > 3, and ` > 1. P is ,it
1
(ws , wt ))` ` , for some set of distances
1≤s<t≤n (d
i,j i,j
Proof. Let Ω be the six points in Figure 2-(left), where we {d }i,j . It is easy to prove that if {d }i,j is a metric (Def.
assume that 0 < 1, and hence that there are no three 2), then d is an n-metric (Def. 3). However, because of
co-linear points, and no two equal points. Let p1 , p2 , p3 , and Theorem 3.2, we know that this is not sufficient to guarantee
p4 be as in Figure 2-(left), each is represented by a unique that the pairwise MMOT distance is an n-metric, which only
color and is uniformly distributed over the points of the same makes the proof of the next theorem all the more interesting.
color. Given any x, y, z ∈ Ω let d(x, y, z) = γ if exactly T HEOREM 3.3. If d is a metric (Def. 2), then the pairwise
two points are equal, and let d(x, y, z) be the area of the MMOT distance (Def. 6) associated with d is an (n, C(n))-
corresponding triangle otherwise, where γ lower bounds the metric, with C(n) ≥ 1.
area of the triangle formed by any three non-co-linear points,
e.g. γ = /4. A few geometric considerations (see Appendix We currently do not know the most general conditions under
A) show that d is an n-metric (n = 3, C(n) = 1) and that which Def. 3 is an n-metric. However, working with Def. 6
(3.9) holds as 12 > 18 + 18 + 4 + 18 + 4 . allows us sharply bound the best possible C(n), which would
unlikely be possible in a general setting. As Theorem 3.4
3.3 Pairwise MMOT is a generalized metric We will shows, the best C(n) is C(n) = Θ(n).
prove that the properties in Def. 4 hold for the following
variant of Def. 5. T HEOREM 3.4. In Theorem 3.3, the constant C(n) can be
made larger than (n−1)/5 for n > 7, and there exists sample
D EFINITION 6. (PAIRWISE MMOT DISTANCE ) Let spaces Ω1:n , mass functions p1:n , and a metric d over Ω1:n
i,j i,j i j
{d }i,j be a set of distances of the form d : Ω × Ω 7→ R such that C(n) ≤ n − 1.
and di,j (Ωis , Ωjt ) , di,j
s,t . The Pairwise MMOT distance
R EMARK 6. Note that if Ωi = Ω, ∀P i and d : Ω × · · · × Ω 7→ and recall that w.l.o.g. we assume that no element in Ωi has
R such that d(w1:n ) = minw∈Ω s∈[n] d1,2 (ws , w) and zero mass, so the denominators are not zero. We have that
d1,2 is a metric, then d is an n-metric [29]. One can
(4.12) W 1,2,3 ≤ d1,2 , p1,2 + d1,3 , p1,3 + d2,3 , p2,3 ,
then prove
P [30] that Def. 2.4 is equivalent to W (p1:n ) =
s
minp s∈[n] W (p , p), which is also called the Wasserstein since the bivariates p1,2 , p1,3 , p2,3 of p1,2,3,4 are a feasible
barycenter distance (WBD) [31]. The later definition makes but suboptimal choice of minimizer in (3.10) in Def. 6.
W (p1:n ) a Fermat distance, from which it follows immedi- It is convenient to introduce the followingDmore compact E
i,j
ately via general results in [29] that it is an n-metric with notation wi,j = di,j , pi,j and wi,j,r ∗
= di,j , p∗ (r) .
C(n) = Θ(n). The pairwise MMOT is not a Fermat distance,
Notice that, for any i, j, k and r, we have wi,j ≤ wi,k + wj,k
and Thrms. 3.3 and 3.4 do not follow from [29]. A novel ∗ ∗ ∗
and wi,j,r ≤ wi,k,r + wj,k,r . This follows directly from the
proof strategy is required. i,j
assumption that {d } are metrics. Without loss of generality,
let us prove that w1,2 ≤ w1,3 + w2,3 :
4 Main proof ideas
X 1,2 1,2 X 1,2 1,2,3
Our main technical contribution is our proof that the general- (4.13) w1,2 = ds,t ps,t = ds,t ps,t,l
ized triangle inequality – property 4 in Def. 4 – holds with s,t s,t,l
C(n) ≥ (n − 1)/5, n > 7 , if d is a metric (Def. 2), i.e. the X 1,3
(4.14) ≤ (ds,l + d2,3 1,2,3
t,l )ps,t,l = w1,3 + w2,3 .
first part of Theorem 3.4. We give this proof in this section.
s,t,l
The other proofs are included in the Appendix. A full proof
∗ ∗ ∗ ∗
of Theorem 3.3 is in Appendix D, and the proof of the second It is also easy to see that wi,j = wj,i and that wi,j,r = wj,i,r .
1,2,3
part of Theorem 3.4 is in Appendix E. Using this notation and (4.12) we can write W ≤
Before we proceed, we give a short proof that the w1,2 +w1,3 +w2,3 ≤ (w1,4 +w2,4 )+(w1,4 +w3,4 )+(w2,4 +
generalized triangle inequality holds with C(n) = 1 for w3,4 ), and, noticing that the bivariate marginals of p1,2,3,4
n = 3 when d is a metric. This avoids some key ideas getting satisfy p1,4 = p∗ (3)1,4 , p3,4 = p∗ (2)3,4 , and p2,4 = p∗ (1)2,4 ,
obscured by the heavy index notation that is unavoidable
when dealing with a general n and a tighter C(n). (4.15) W 1,2,3 ≤ (w1,4,3 ∗ ∗
+ w2,4,1 ∗
) + (w1,4,3 ∗
+ w3,4,2 )+
∗ ∗
(w2,4,1 + w3,4,2 ).
4.1 Proof of the generalized triangle inequality for n =
3, ` = 1, and C(n) = 1 We will prove that for any mass At the same time, also using this new notation, we can
1 4 1 4 i,j
functions p , . . . , p over Ω , . . . , Ω , respectively, if d : re-write the r.h.s. of (4.11) as
i j
Ω × Ω 7→ R is a metric for any i, j ∈ {1, . . . , 4}, i 6= j, (4.16) W \3 + W \2 + W \1 = (w∗ +w∗ +w∗ )+
1,2,3 1,4,3 2,4,3
then ∗ ∗ ∗ ∗ ∗ ∗
(w1,3,2+w1,4,2 +w3,4,2 ) + (w2,3,1 +w2,4,1+w3,4,1 ).
(4.11) W 1,2,3 ≤ W 1,2,4 + W 1,3,4 + W 2,3,4 ,
To finish the proof we show that the r.h.s. of (4.15) can
which we write more succinctly as W 1,2,3 ≤ W \3 + W \2 + be upper bounded by the r.h.s. of (4.16). We use the triangular
W \1 , using a new symbol W \r whose meaning is obvious. inequality of w∗ and apply it to the 1st, 4th, and 5th terms
We begin by expanding all of the terms in (4.11), namely, on the r.h.s. of i,j,k
(4.15) as specified by the parenthesis:
D E D E D E
W 1,2,3 = d1,2 , p∗ 1,2 + d1,3 , p∗ 1,3 + d2,3 , p∗ 2,3 , ∗
(w1,4,3 ∗
+ w2,4,1 ∗
) + (w1,4,3 ∗
+ w3,4,2 ∗
) + (w2,4,1 ∗
+ w3,4,2 )
W \3 + W \2 + W \1 = ∗
≤ ((w1,2,3 ∗
+ w2,4,3 ∗
) + w2,4,1 ∗
)+(w1,4,3 ∗
+ (w1,3,2 ∗
+ w1,4,2 ))
∗ ∗ ∗
D 1,2
E D 1,4
E D 2,4
E
d1,2 , p∗ (3) + d1,4 , p∗ (3) + d2,4 , p∗ (3) + ((w2,3,1 + w3,4,1 ) + w3,4,2 ),
and observe that the terms in the r.h.s. of this last inequality
D 1,3
E D 1,4
E D 3,4
E
+ d1,3 , p∗ (2) + d1,4 , p∗ (2) + d3,4 , p∗ (2)
D E D E D E are accounted for on the r.h.s. of (4.16). This ends the proof.
2,3 2,4 3,4
+ d2,3 , p∗ (1) + d2,4 , p∗ (1) + d3,4 , p∗ (1) , We note that this last step, figuring out to which terms
to apply the triangular inequality property of w∗ such that
where {p∗ i,j } are the bivariate marginals of the optimal we can “cover” the r.h.s. of (4.15) with the r.h.s. of (4.16),
i,j
joint distribution p∗ 1,2,3 for W 1,2,3 , and {p∗ (r) } are the is critical and to generalize in a proof for an arbitrary n.
bivariate marginals of the optimal joint distribution for W \r . Not only that, but the fact that we want to prove that the
Now we define the following probability mass function MMOT triangle inequality holds for C(n) = Θ(n) makes
on Ω1 × · · · × Ω4 , namely, p1,2,3,4 is such that this last step even harder. We define a general procedure for
expanding (using the triangle inequality) and matching terms
1,4 (2)3,4 (1)2,4
p∗ (3)
s,u p∗ l,u p∗ t,u in our general proof using a special hash function described
1,2,3,4
ps,t,l,u = p∗ 4u , next. It will play a critical role in our general proof.
p∗ 4u p∗ 4u p∗ 4u
4.2 Special hash function To prove that 4. in Def. 4 holds 4.4 Proof of lower Pn bound on C(n) We will show (n −
with C(n) ≥ (n − 1)/5, n > 7, we need the following tool. 1)W 1,...,n ≤ 5 r=1 W 1,...,r−1,r+1,...,n+1 . For r ∈ [n], let
0
p(∗r) be a minimizer for W 1,...,r−1,r+1,...,n+1 . We would
n
D EFINITION 7. The map H transforms a triple normally use r (∗r) for this minimizer, but, to avoid confusions
(i, j, r), 1 ≤ i < j ≤ n, r ∈ [n − 1] to either 2, 3, or between r and r, we avoid doing so. For i, j ∈ [n + 1]\{r},
4 triples according to i,j
let p(∗r) be the marginal of p(∗r) for the sample space
0 0 0 Ωi × Ωj . Since p(∗r) satisfies the constraints in (3.10), its
(4.17) (i, j, r) 7→ H n (i, j, r) = H1n (i, j, r)⊕H1n (j, i, r), marginal over Ωi equals pi .
Let h0 (·, ·) be the map in (4.19). For each r ∈ [n − 1],
define the mass function over Ω1 × . . . × Ωn
(4.18)
{(i, r, h0 (i, r))} , if j = h0 (i, r)
0
n (r) r (∗h0
(i,r)) i|r
H1 (i, j, r) = , q = G p , {p }i∈[n]\r ,
{(i, j, h0 (i, r), (j, r, h0 (i, r))} , if j 6= h0 (i, r) (4.21)
0 i|r 0 i|r 0 i,r
where p(∗h (i,r)) satisfies p(∗h (i,r)) pr = p(∗h (i,r)) .
(4.19) Note that h0 (i, r) ∈
/ {i, r}, ∀ 1 ≤ i ≤ n, and r ∈ [n − 1].
0 1 + ((i + r − 1) mod n) , if i < n 0 i,r 0 i|r i
h (i, r) = . Thus, p(∗h (i,r)) and p(∗h (i,r)) exist. Let q (r) be the
1 + (r mod (n − 1)) , if i = n i,j
marginal of q (r) over Ωi , and q (r) over Ωi × Ωj .
i
We assume that the first two components of each out- By Lemma 4.2, we know that q (r) equals pi (given)
put triple are ordered. For example, (i, r, h0 (j, r)) ≡ for all i ∈ [n], and hence q (r) satisfies the optimization
(min{i, r}, max{i, r}, h0 (j, r)). constraints in (3.10) for W 1,...,n . Therefore, we can write
n−1
X X D E 1`
0
The following property of H n is critical to lower bound (4.22) (n − 1)W 1,...,n = di,j , p∗ i,j
`
C(n). Its proof is in Appendix B. r=1 1≤i<j≤n
n−1 E1
0 X X D i,j `
L EMMA 4.1. Let (a, b, c) ∈ H n (i, j, r), 1 ≤ i < j ≤ n, ≤ di,j , q (r) ,
`
r ∈ [n − 1]. Then, 1 ≤ a ≤ b ≤ n, 1 ≤ c ≤ n, and r=1 1≤i<j≤n
c∈/ {a, b}. Furthermore,
where p∗ i,j is the bivariate marginal over Ωi × Ωj of the
0
minimizer p∗ for W 1,...,n .
M
n
(4.20) H (i, j, r)
1≤i<j≤n We now bound each term in the inner most sum on the
r.h.s. of (4.22) as
has at most 5 copies of each triple, where two triples are E1 D E1 D E1
i,j ` (a) i,r ` r,j `
D
equal iff they agree component-wise. (4.23) di,j , q (r) ≤ di,r , q (r) + dr,j , q (r)
` ` `
R EMARK 7. Note that we might have a = b in an triple D(b) E1 D
i,r `
E1
j,r `
0 (4.24) = di,r , q (r) + dj,r , q (r)
(a, b, c) output by H n . For example, if n = 4, all 5 triples ` `
(1, 2, 3), (1, 3, 2), (2, 3, 2), (2, 3, 3), and (2, 3, 4) map to D(c) 0
E1 D
i,r ` 0
E1
j,r `
(2, 3, 1). Also, both (1, 2, 1) and (1, 4, 1) map to (1, 1, 2) (4.25) = di,r , p(∗h (i,r)) + dj,r , p(∗h (j,r)) ,
`
whose first two components equal.
where i 6= r, r 6= j, and: (a) holds by Lemma 4.3; (b) holds
4.3 Useful lemmas We also need the following lemmas because d is symmetric;i,rand (c) holds because, byj,rLemma
i,r 0 j,r 0
whose proofs is in Appendix C. 4.2, q (r) = p(∗h (i,r)) and q (r) = p(∗h (j,r)) .
Bounding the r.h.s. of (4.22) using (4.23) - (4.25), we
L EMMA 4.2. Let p be as in Def. 1 eq. (2.3) for some q k and re-write the resulting inequality using the notation
{q i|k }i∈[n]\k . Let pi and pi,k , i 6= k, be the marginals of p
n−1
over Ωi and Ωi × Ωk , respectively. Let q i,k = q i|k q k , i 6= k, 1,...,n
X X
i i i
and let q be its marginals over Ω . We have that p = q ∀i, i (4.26) (n − 1)W = w(i,j,r)
r=1 1≤i<j≤n
and pi,k = q i,k ∀i 6= k.
n−1
X X
L EMMA 4.3. Let d be a metric and p a mass over Ω1 × ≤ v(i,r,h0 (i,r)) + v(j,r,h0 (j,r)) ,
r=1 1≤i<j≤n
. . . × Ωn . Let pi,j be the marginal of p over Ωi × Ωj . Define
1
wi,j = di,j , pi,j `` . For any i, j, k ∈ [n] and ` ∈ N we have where (a) we are implicitly assuming that the first two
that wi,j ≤ wi,k + wk,j . components of each triple on the r.h.s. of (4.26) are ordered,
Hypergraph clustering via NH-Cut Hypergraph clustering via TTM Spectral clustering
30 30 25
pairwise MMOT (mean 0.615) pairwise MMOT (mean 0.617) WD (mean 0.722)
non-n-metric (mean 0.707) non-n-metric (mean 0.694)
25 barycenter (mean 0.623) 25 barycenter (mean 0.622) 20
Number of repetitions
Number of repetitions
Number of repetitions
20 20
15
15 15
10
10 10
5
5 5
0 0 0
0.55 0.6 0.65 0.7 0.75 0.55 0.6 0.65 0.7 0.75 0.66 0.68 0.7 0.72 0.74 0.76 0.78
Fraction of miss-classified graphs Fraction of miss-classified graphs Fraction of miss-classified graphs
Figure 3: Comparing the effect that different distances and metrics have on clustering synthetic graphs.
i.e. if e.g. r < i then (r, i, h0 (i, r)) should be red as non-zero v(a,b,c) will not appear more than 5 times. Therefore,
1 0
(i, r, h0 (i, r)); (b) each w(i,j,r) represents one di,j , p∗ i,j `` the upper bound we build with the help of h for the r.h.s of
on the l.h.s. of (4.22); and (c) each v(s,t,l) represents (4.26) can be upper bounded by (4.27).
D E1
s,t `
ds,t , p(∗l) if s 6= t, and is zero if s = t. Since 5 Numerical experiments
`
0 i,r We illustrate how a MMOT which defines an n-metric,
h0 (i, r) ∈
/ {i, r}, when i 6= r the mass p(∗h (i,r)) exists.
Finally, using this same compact notation, we write n > 2, and pairwise MMOT in particular, improves a task
(4.27) of clustering graphs compared to using an OT that defines a
X n n
X X 2-metric, or a non-n-metric MMOT.
5 W 1,...,r−1,r+1,...,n+1 = 5 v(i,j,r) , We cluster graphs by i) computing their spectrum, ii)
r=1 r=1 i,j∈[n+1]\{r},i<j treating each spectrum as a probability distribution, iii)
using WD and three different MMOT’s to compute distances
and now we will show that (4.27) upper-bounds the r.h.s. of
among these distributions, and iv) feeding these distances
(4.26), finishing the proof.
to distance-based clustering algorithms to recover the true
First, by Lemma 4.3 and the symmetry of d, observe that
cluster memberships. We use spectral clustering based
the following inequalities are true
on normalized random-walk Laplacians [32] to produce
(4.28) v(i,r,h0 (i,r)) ≤ v(i,j,h0 (i,r)) + v(j,r,h0 (i,r)) , one clustering solution out of the pairwise graph distances
(4.29) v(j,r,h0 (j,r)) ≤ v(i,j,h0 (j,r)) + v(i,r,h0 (j,r)) , computed via WD. We also produce clustering solutions out
of the graph triple-wise distances computed via Def. 6 (an
as long as for each triple (a, b, c) in the above expressions, n-metric), via WBD in Remark 6 (also an n-metric), and
c∈ / {a, b}. We will use inequalities (4.28) and (4.29) to upper via W as in Thrm. 3.2 (a non-n-metric). To do so, we use
bound some of the terms on the r.h.s. of (4.26), and then we the hypergraph-based clustering methods NH-Cut [33] and
will show that the resulting sum can be upper bounded by TTM [34, 27]. Code for our experiments and details about
(4.27). In particular, for each (i, j, r) considered in the r.h.s. our setup are in https://drive.google.com/drive/folders/11_
of (4.26), we will apply inequalities (4.28) and (4.29) such MqRx29Yq-KuZYUSsOOhK7EbmTAIqu9?usp=sharing.
that that the terms v(a,b,c) that we get after their use have
0n
triples (a, b, c) that match the triples in H (i, j, r), defined 5.1 Synthetic graphs dataset We generate 7 equal-sized
0n
in Def. 7. To be concrete, for example, if H maps (i, j, r) to synthetic clusters of graphs by including in the ith cluster
{(i, r, h0 (i, r)), (r, j, h0 (j, r))}, then we do not apply (4.28) multiple random perturbations of: i = 1) a complete graph,
and (4.29), and we leave v(i,r,h0 (i,r)) + v(r,j,h0 (j,r)) as is on i = 2) a complete bipartite graph, i = 3) a cyclic chains,
0n i = 4) a k-dimensional cube, i = 5) a K-hop lattice, i = 6) a
the r.h.s. of (4.26). If, for example, H maps (i, j, r) to
0 0 0 periodic 2D grid, or i = 7) an Erdős–Rényi graph. A random
{(i, r, h (i, r)), (i, j, h (j, r)), (i, r, h (j, r))}, then we leave
class prediction has a 0.857 error rate. We repeat this cluster
the first term in v(i,r,h0 (i,r)) + v(r,j,h0 (j,r)) in the r.h.s. of
generation 100 times for 100 independent experiments, and
(4.26) untouched, but we upper bound the second term using
collect performance statistics.
(4.29) to get v(i,r,h0 (i,r)) + v(i,j,h0 (j,r)) + v(i,r,h0 (j,r)) .
Figure 3-(left, center) show that both TTM and NH-Cut
After proceeding in this fashion, and by Lemma 4.1, we
work better when hyperedges are computed using an n-metric,
know that all of the terms v(a,b,c) that we obtain have triples
and that pairwise MMOT works better than WBD. To double
(a, b, c) with c 6= {a, b}, c ∈ [n − 1], and 1 ≤ a ≤ b ≤ n.
check that this is due to the n-metric properties, we perturbed
Therefore, these terms are either zero (if a = b) or appear in
W to introduce triangle inequality violations (i.e. violate
(4.27). Also because of Lemma 4.1, each triple (a, b, c) with
Hypergraph clustering via NH-Cut Hypergraph clustering via TTM Spectral clustering
30 30 30
n-metric (mean 0.526) n-metric (mean 0.527) WD (mean 0.662)
non-n-metric (mean 0.684) non-n-metric (mean 0.656)
25 barycenter (mean 0.645) 25 barycenter (mean 0.651)
25
20 20 20
Number of repetitions
Number of repetitions
Number of repetitions
15 15 15
10 10 10
5 5 5
0 0 0
0.4 0.5 0.6 0.7 0.8 0.4 0.5 0.6 0.7 0.8 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72
Fraction of miss-classified graphs Fraction of miss-classified graphs Fraction of miss-classified graphs
Figure 4: Comparing the effect that different distances and metrics have on clustering molecular graphs.
Def. 4-4 with C(3) = 1) and observe that this leads to worse 1 shows the effect of z on performance. Comparing more
MMOT performance. Table 5.1 shows the effect of adding graphs, i.e. increasing z, should improve clustering. However,
20% violations to different MMOT distances. These additions for a non-n-metric, as z grows, triangle inequality violations
Violations Clustering Pairwise Barycenter Non-n-metric can appear that introduce confusion: a graph can be “close”
No NH-Cut 0.615 0.623 0.707 to two clusters that are far away, confusing TTM and NH-Cut.
Yes NH-Cut 0.632 0.632 0.704 This compensates the benefits of a high z and results in the
No TTM 0.617 0.622 0.694 flat curves in Figure 1.
Yes TTM 0.627 0.634 0.696
6 Future work
Table 1: Triangle inequalities violations in W degrades clustering
performance with n-metrics more than with non-n-metrics. We have showed that a generalization of the optimal transport
to multiple distributions, the pairwise multi-marginal optimal
clearly affect pairwise-MMOT, barycenter-MMOT, both n- transport (pairwise MMOT), leads to a multi-distance that
metrics, but the non-n-metric is not affected much by the satisfies generalized metric properties. In particular, we
injected of triangle inequality violations. Details about how have proved that the generalized triangle inequality that it
W is perturbed are in our code repository. satisfies cannot be improved, up to a linear factor. This now
Figure 3-(right) shows that clustering using only pairwise opens the door to us using pairwise MMOT in combination
relationships among graphs leads to worse accuracy than if with several algorithms whose good performance depends on
using triple-wise relationships as in Figure 3-(left, center). metric properties. Meanwhile, for a general MMOT, we have
This has been pointed out in [35]. proved that the cost function being a generalized metric is not
enough to guarantee that MMOT defines a generalized metric.
5.2 Molecular graphs dataset This experiment is moti-
In future work, we seek to find new sufficient conditions under
vated by the important task in chemistry of clustering chem-
which other variants of MMOT lead to generalized metrics,
ical compounds, represented as graphs, by their structure
and, for certain families of MMOT, find necessary conditions
[36, 37, 38]. We use the molecular dataset in the sup-
for these same properties to hold.
plementary material of [39], which can be download at
https://pubs.acs.org/doi/abs/10.1021/ci034143r#_i21. It con-
tains five types of compounds: cyclooxygenase-2 inhibitors, References
benzodiazepine receptor ligands, estrogen receptor ligands,
dihydrofolate reductase inhibitors, and monoamine oxidase in-
hibitors. We randomly sample an equal number of molecules [1] L. V. Kantorovich, “On the translocation of masses,” in Dokl.
from each type to build each cluster. A random class predic- Akad. Nauk SSSR, vol. 37, pp. 199–201, 1942.
tion has 0.8 error rate. We repeat this sampling 100 times in- [2] J. Solomon et al., “Convolutional Wasserstein distances:
dependently for each of 100 trials, and collect performances. Efficient optimal transportation on geometric domains,” ACM
Figure 4-(left, center) shows that both TTM and NH-Cut Trans. Graph., vol. 34, no. 4, p. 66, 2015.
work better when hyperedges are computed using n-metrics, [3] M. Arjovsky et al., “Wasserstein generative adversarial net-
works,” in ICML, 2017.
and Figure 4-(right) shows that clustering using pairwise
[4] H. Fan, H. Su, and L. J. Guibas, “A point set generation
relationships performs worse than using triple-wise relations.
network for 3d object reconstruction from a single image,”
There is a starker difference between n-metrics and non- in Proceedings of the IEEE conference on computer vision and
n-metrics, seen in Figure 1. The number of possible 3-sized pattern recognition, pp. 605–613, 2017.
hyperedges is cubic with the number of graph. Thus, in our [5] B. B. Damodaran et al., “Deepjdot: Deep joint distribution
experiments we randomly sample z triples (i, j, k) and only optimal transport for unsupervised domain adaptation,” arXiv
for these we create an hyperedge with weight W i,j,k . Figure preprint arXiv:1803.10081, 2018.
[6] M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly, transport problems concentrated on several graphs,” ESAIM:
M. Cuturi, G. Peyré, and J.-L. Starck, “Wasserstein dictionary Control, Optimisation and Calculus of Variations, vol. 23,
learning: Optimal transport-based unsupervised nonlinear no. 2, pp. 551–567, 2017.
dictionary learning,” SIAM Journal on Imaging Sciences, [25] K. Singh and S. Upadhyaya, “Outlier detection: applications
vol. 11, no. 1, pp. 643–678, 2018. and techniques,” International Journal of Computer Science
[7] L. Ambrosio and N. Gigli, “A user’s guide to optimal transport,” Issues (IJCSI), vol. 9, no. 1, p. 307, 2012.
in Modelling and optimisation of flows on networks, pp. 1–155, [26] C. Dwork and J. Lei, “Differential privacy and robust statistics,”
Springer, 2013. in Proceedings of the forty-first annual ACM symposium on
[8] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng, “Distance Theory of computing, pp. 371–380, 2009.
metric learning with application to clustering with side- [27] D. Ghoshdastidar and A. Dukkipati, “A provable generalized
information,” in Advances in neural information processing tensor spectral method for uniform hypergraph partitioning,”
systems, pp. 521–528, 2003. in Inter. Conf. on Machine Learning, pp. 400–409, 2015.
[9] J. A. Hartigan, Clustering algorithms. John Wiley & Sons, [28] P. Purkait, T.-J. Chin, A. Sadri, and D. Suter, “Clustering
Inc., 1975. with hypergraphs: the case for large hyperedges,” IEEE
[10] K. L. Clarkson, “Nearest-neighbor searching and metric space transactions on pattern analysis and machine intelligence,
dimensions,” Nearest-neighbor methods for learning and vol. 39, no. 9, pp. 1697–1711, 2016.
vision: theory and practice, pp. 15–59, 2006. [29] G. Kiss, J.-L. Marichal, and B. Teheux, “A generalization
[11] K. L. Clarkson, “Nearest neighbor queries in metric spaces,” of the concept of distance based on the simplex inequality,”
Discrete & Comp. Geometry, vol. 22, no. 1, pp. 63–93, 1999. Beiträge zur Algebra und Geometrie/Contributions to Algebra
[12] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for and Geometry, vol. 59, no. 2, pp. 247–266, 2018.
nearest neighbor,” in Proceedings of the 23rd international [30] G. Carlier and I. Ekeland, “Matching for teams,” Economic
conference on Machine learning, pp. 97–104, 2006. theory, vol. 42, no. 2, pp. 397–418, 2010.
[13] F. Angiulli and C. Pizzuti, “Fast outlier detection in high [31] M. Agueh et al., “Barycenters in the Wasserstein space,” SIAM
dimensional spaces,” in European Conf. on Principles of Data J. on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, 2011.
Mining and Knowledge Discovery, pp. 15–27, Springer, 2002. [32] J. Shi and J. Malik, “Normalized cuts and image segmenta-
[14] P. Indyk, “Sublinear time algorithms for metric space prob- tion,” IEEE Transactions on pattern analysis and machine
lems,” in Proceedings of the thirty-first annual ACM sympo- intelligence, vol. 22, no. 8, pp. 888–905, 2000.
sium on Theory of computing, pp. 428–434, 1999. [33] D. Ghoshdastidar, A. Dukkipati, et al., “Consistency of
[15] M. R. Ackermann, J. Blömer, and C. Sohler, “Clustering for spectral hypergraph partitioning under planted partition model,”
metric and nonmetric distance measures,” ACM Transactions The Annals of Statistics, vol. 45, no. 1, pp. 289–315, 2017.
on Algorithms (TALG), vol. 6, no. 4, pp. 1–26, 2010. [34] D. Ghoshdastidar and A. Dukkipati, “Uniform hypergraph par-
[16] F. Mémoli, “Gromov–Wasserstein distances and the metric titioning: Provable tensor methods and sampling techniques,”
approach to object matching,” Foundations of computational The Journal of Machine Learning Research, vol. 18, no. 1,
mathematics, vol. 11, no. 4, pp. 417–487, 2011. pp. 1638–1678, 2017.
[17] B. Pass, “On the local structure of optimal measures in the [35] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hyper-
multi-marginal optimal transportation problem,” Calculus of graphs: Clustering, classification, and embedding,” in Adv. in
Variations and Partial Differential Equations, vol. 43, no. 3-4, neural information processing systems, pp. 1601–1608, 2007.
pp. 529–536, 2012. [36] S. J. Wilkens, J. Janes, and A. I. Su, “Hiers: hierarchical
[18] B. Pass, “Multi-marginal optimal transport: theory and ap- scaffold clustering using topological chemical graphs,” Journal
plications,” ESAIM: Mathematical Modelling and Numerical of medicinal chemistry, vol. 48, no. 9, pp. 3182–3193, 2005.
Analysis, vol. 49, no. 6, pp. 1771–1790, 2015. [37] M. Seeland, A. K. Johannes, and S. Kramer, “Structural
[19] C. T. Li and V. Anantharam, “Pairwise multi-marginal optimal clustering of millions of molecular graphs,” in Proc. of the
transport and embedding for earth mover’s distance,” arXiv 29th Annual ACM Symposium on Applied Computing, 2014.
preprint arXiv:1908.01388, 2019. [38] M. J. McGregor and P. V. Pallai, “Clustering of large databases
[20] J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan, “Multi- of compounds: using the mdl “keys” as structural descriptors,”
marginal Wasserstein gan,” in Advances in Neural Information Journal of chemical information and computer sciences,
Processing Systems, pp. 1774–1784, 2019. vol. 37, no. 3, pp. 443–448, 1997.
[21] B. Pass, “Multi-marginal optimal transport and multi-agent [39] J. J. Sutherland, L. A. O’brien, and D. F. Weaver, “Spline-
matching problems: uniqueness and structure of solutions,” fitting with a genetic algorithm: A method for developing
arXiv preprint arXiv:1210.7372, 2012. classification structure- activity relationships,” Journal of
[22] G. Peyré, M. Cuturi, et al., “Computational optimal transport,” chemical information and computer sciences, vol. 43, no. 6,
Foundations and Trends® in Machine Learning, vol. 11, no. 5- pp. 1906–1915, 2003.
6, pp. 355–607, 2019. [40] L. Torres, P. Suárez-Serrato, and T. Eliassi-Rad, “Non-
[23] A. Gerolin, A. Kausamo, and T. Rajala, “Duality theory backtracking cycles: length spectrum theory and graph mining
for multi-marginal optimal transport with repulsive costs in applications,” Applied Net. Science, vol. 4, no. 1, p. 41, 2019.
metric spaces,” ESAIM: Control, Optimisation and Calculus [41] V. Batagelj and M. Zaveršnik, “Fast algorithms for determining
of Variations, vol. 25, p. 62, 2019. (generalized) core groups in social networks,” Adv. in Data
[24] A. Moameni and B. Pass, “Solutions to multi-marginal optimal Analysis and Classification, vol. 5, no. 2, pp. 129–145, 2011.
[42] D. Constantine and J.-F. Lafont, “Marked length rigidity for
one-dimensional spaces,” Journal of Topology and Analysis,
vol. 11, no. 03, pp. 585–621, 2019.
A Details for proof of Theorem 3.2 p1 p1
overlaps: H1n (i, j) vs. H2n (i, j), H1n (i0 , j 0 ) vs. H2n (i0 , j 0 ), volves a set of distancesD{d˜a,b }a,bEthat satisfy d˜i,j = dσ (i,j) .
H1n (i, j) vs. H1n (i0 , j 0 ), H1n (i0 , j 0 ) vs. H2n (i0 , j 0 ). The two Therefore, each term d˜i,j , r i,j involved in the computa-
`
combinations left are H1n (i, j) vs H2n (i0 , j 0 ) and H1n (i0 , j 0 ) vs D −1 E
tion of W(pσ(i1:n ) ), can be rewritten as dσ (i,j) , r i,j ,
H2n (i, j). We notice that they are symmetric and, because the P `
choice of the tuples (i, j), (i0 , j 0 ) is arbitrary, we only need to which a simple reindexing of the summation i<j allow
show that H1n (i, j) and H2n (i0 , j 0 ) do not have overlaps, given us to write as di,j , r σ(i,j) ` . Since the mass function r
(i, j) 6= (i0 , j 0 ). has as supporting sample space Ωσ(i1 ) × . . . × Ωσ(in ) , the
H1n (i, j) and H2n (i0 , j 0 ) each have two possibilities for marginal r σ(i,j) can be seen as the marginal q i,j of a mass
the form of their output. Thus, together, there are four i1 in
function q D with support E Ω × . . . × Ω . Therefore, minimiz-
possibilities to consider. None of them have an overlap, which ing i<j ( d˜ , r
P i,j i,j
) for r over Ωσ(i1 ) × . . . × Ωσ(in )
1/`
we show by contradiction. `
is the same as minimizing i<j ( di,j , q i,j ` )1/` for q over
P
1. H1n (i, j) = {(i, n+ 1, h(i))} and H2n (i0 , j 0 ) = {(j 0 , n+
Ωi1 × . . . × Ωin .
1, h(j 0 ))}. If these single-element sets have an overlap,
that implies that i = j 0 , but, according to the definition, D.4 Identity
i = 1 and i0 = j 0 − 1 which implies j 0 > 1.
Proof. We prove each direction of the equivalence separately.
2. H1n (i, j) = {(i, n + 1, h(i))} and H2n (i0 , j 0 ) = Recall that {pi } are given, they are the masses for which we
{(i0 , j 0 , h(j 0 )), (i0 , n + 1, h(j 0 ))}. For them to have an want to compute the pairwise MMOT.
overlap, h(i) = h(j 0 ). That requires i = j 0 which con- “⇐=”: If for each i, j ∈ [n] we have Ωi = Ωj , then
tradictory to i = 1 and i0 < j 0 − 1 at the same time. m = mj , and there exists a bijection bi,j (·) from [mi ] to
i
j i j i j
3. H1n (i, j) = {(i, j, h(i)), (j, n + 1, h(i))} and [m ] such that Ωs = Ωbi,j (s) for all s. If furthermore p = p ,
1 n
H2n (i0 , j 0 ) = {(i0 , j 0 , h(j 0 )), (i0 , n + 1, h(j 0 ))}. For the we can define a r for Ω ×Ω such that its univariate marginal
first two components to equal, i = i0 , j = j 0 , and i = j 0 , over Ωi , r i , satisfies r i = pi , and such that its bivariate
i,j
which is contradictory to i0 < j 0 − 1. For the second marginal over Ωi ×Ωj , r i,j , satisfies rs,t = pis , if t = bi,j (s),
two components to equal, j = i0 and i = j 0 , which is and zero otherwise. Such a r achieves an objective value of
contradictory to i < j or i0 < j 0 . Because of the exis- 0 in (3.10), the smallest value possible by the first metric
tence of “n + 1”, the components at different positions property (already proved). Therefore, W 1,...,n = 0.
cannot collide. “=⇒”: Now let r ∗ be a minimizer of (3.10) for
W 1,...,n
. Let {r ∗ i } and {r ∗ i,j } be its univariate and bivariate
4. H1n (i, j) = {(i, j, h(i)), (j, n + 1, h(i))} and marginals respectively. If W 1,...,n = 0 then di,j , r ∗ i,j =
`
H2n (i0 , j 0 ) = {(j 0 , n + 1, h(j 0 ))}. This implies j 0 = j 0 for all i, j. Let us consider a specific pair i, j, and, without
0
and j = i, which is contradictory to i < j. loss of generality, let us assume that mi ≤ mj . Since, by
assumption, we have that r ∗ is = pis > 0 for all s ∈ [mi ], and
r ∗ js = pjs > 0 for all s ∈ [mj ], there exists an injection bi,j (·)
For example, if n = 3, then the possible tuples from [mi ] to [mj ] such that r ∗ i,j i
s,bi,j (s) > 0 for all s ∈ [m ].
(1, 2), and (1, 3), and (2, 3), get mapped respectively to
(1, 2, 3), (2, 4, 3), (2, 4, 1), and (1, 4, 3), (1, 3, 2), (1, 4, 2), Therefore, d , r
i,j ∗ i,j
`
= 0 implies that di,j s,bi,j (s) = 0 for
i
and (2, 3, 1), (3, 4, 1), (3, 4, 2), all of which are different and all s ∈ [m ]. Therefore, since d is a metric, it must be that
satisfy the claims in Lemma D.1. Ωis = Ωjbi,j (s) for all s ∈ [mi ]. Now lets us suppose that there
We now prove the four metric properties in order. It is exists an r ∈ [mj ] that is not in the range of bi,j . Since, by
trivial to prove the first three properties given the definition assumption, all of the elements of the sample spaces are differ-
of our distance function for the transport problem. Then, we ent, it must be that di,j i
s,r > 0 for all s ∈ [m ]. Therefore, since
provide a detailed proof for the triangle inequality. di,j , r ∗ i,j ` = 0, it must be that r ∗ i,j i
s,r = 0 for all s ∈ [m ].
∗ i,j ∗j j
P
This contradicts the fact that s∈[mi ] r s,r = r r = pr > 0
D.2 Non-Negativity
(the last inequality being true by assumption). Therefore,
Proof. The non-negativity of di,j and r i,j , implies that mi = mj , and the existence of bi,j proves that Ωi = Ωj . At
di,j , r i,j ` ≥ 0, and hence that W ≥ 0. the same time, since di,j i,j
s,t > 0 for all t 6= b (s), it must be
that r ∗ i,j i,j i j
s,t = 0 for all t 6= b (s). Therefore, ps = pbi,j (s) we have
for all s, i.e. pi = pj .
X X
(D.9) w(i,j) ≤ w(i,n,h(i)) + w(j,n,h(j)) .
1≤i<j≤n−1 1≤i<j≤n−1
D.5 Generalized Triangle Inequality
Finally, we write
Proof. Let p∗ be a minimizer for (the optimization problem (D.10)
associated with) W 1,...,n , and let p∗ i,j be the marginal Xn
1,...,r−1,r+1,...,n+1
Xn X
∗ i
induced by p for the sample space Ω × Ω . We wouldj W = w(i,j,r) ,
normally use r ∗ for this minimizer, but, to avoid confusions r=1 r=1 i,j∈[n+1]\{r},i<j
between r and r, we avoid doing so. We can write that and show that (D.10) upper-bounds the r.h.s of (D.9).
D E 1` First, by Lemma 4.3 and the symmetry of d, we have
di,j , p∗ i,j .
X
(D.4) W 1,...,n =
` (D.11) w(i,n,h(i)) ≤ w(i,j,h(i)) + w(j,n,h(i)) ,
1≤i<j≤n−1
(D.12) w(j,n,h(j)) ≤ w(i,j,h(j)) + w(i,n,h(j)) ,
For r ∈ [n], let p(∗r) be a minimizer for
W 1,...,r−1,r+1,...,n+1
. We would normally use r (∗r) for this as long as for each triple (a, b, c) in the above expressions,
minimizer, but, to avoid confusions between r and r, we c ∈/ {a, b}. We will use these inequalities to upper bound
i,j some of the terms on the r.h.s. of (D.9), which can be
avoid doing so. For i, j ∈ [nn + 1\{r}, let p(∗r) be the
further upper bounded by (D.10). In particular, we will
marginal of p(∗r) for the sample space Ωi × Ωj . Recall that
apply inequalities (D.11) and (D.12) such that the terms
since p(∗r) satisfies the constraints in (3.10), its marginal for
wa,b,c that we get after their use have triples (a, b, c) that
the sample space Ωi is p∗ i , which is given in advance.
match the triples obtained via the map Hn defined in Section
Let h(·) be the map defined as (D.2).
4.3. To be concrete, for example, if Hn maps (i, j) to
Define the following mass function for Ω1 × . . . × Ωn+1 ,
{(i, n+1, h(i)), (j, n+1, h(j))}, then we do not apply (D.11)
i|n+1
(D.5) q = G p∗ n+1 , {p(∗h(i)) }i∈[n] , and (D.12), and we leave w(i,n+1,h(i)) + w(j,n+1,h(j)) as is
on the r.h.s. of (D.9). If, for example, Hn maps (i, j) to
i|n+1 {(i, n + 1, h(i)), (i, j, h(j)), (i, n + 1, h(j))}, then we leave
where p(∗h(i)) is defined as the mass function that
(∗h(i)) i|n+1 ∗ n+1 i,n+1 the first term in w(i,n+1,h(i)) + w(j,n+1,h(j)) in the r.h.s. of
satisfies p p = p(∗h(i)) . Notice that (D.9) untouched, but we upper bound the second term using
i,n+1
since h(i) ∈ / {i, n + 1}, the probability p(∗h(i)) exists (D.12) to get w(i,n+1,h(i)) + w(i,j,h(j)) + w(j,n+1,h(j)) .
for all i ∈ [n]. After proceeding in this fashion, and by Lemma D.1, we
Let q 1,...,n be the marginal of q for sample space know that all of the terms w(a,b,c) that we obtain have triples
Ω × . . . × Ωn , and q i,j be the marginal of q for Ωi × Ωj .
1
(a, b, c) with c 6= {a, b}, with c ∈ [n], and 1 ≤ a < b ≤ n+1.
By Lemma 4.2, we know that the ith univariate marginal Therefore, these terms appear in (D.10). Also by Lemma
of q is pi (given) and hence q 1,...,n satisfies the constraints D.1, we know that we do not get each triple more than once.
associated with W 1,...,n . Therefore, we can write that Therefore, the upper bound that we just constructed with the
X D E 1` 1 help of Hn for the r.h.s of (D.9) can be upper bounded by
di,j , p∗ i,j
X
(D.6) ≤ di,j , q i,j `` . (D.10).
`
1≤i<j≤n 1≤i<j≤n
By Lemma 4.3, inequality (a) below holds; because d is E Proof of upper bound in Theorem 3.4
i
symmetric, (b) below holds; by the definition of q, (c) below Proof. Consider the following setup. Let m = m for all
i
follows. Therefore, i ∈ [n], and Ωs ∈ R for all i ∈ [n], s ∈ [m]. Define d such
that di,j i j
s,t is |Ωs − Ωt |, if s = t, and infinity otherwise. Let
(D.7) 1
pis = m for all i ∈ [n], s ∈ [m].
1 (a) 1 1
i,j i,j `
d ,q ` ≤ d i,n+1
,q i,n+1 `
+ d n+1,j
,q n+1,j ` Any optimal solution r ∗ to the pairwise MMOT problem
` `
(b)
must have bivariate marginals that satisfy r ∗ i,j 1
s,t = m δs,t , and
1 1 1
= di,n+1 , q i,n+1 `` + dj,n+1 , q j,n+1 `` thus di,j , r ∗ i,j `` = m1` kΩi − Ωj k` , where we interpret Ωi
(D.8) has a vector in Rm , and k · k` is the vector `-norm. Therefore,
(c) D i,n+1 `
E 1 D j,n+1 `
E 1 ignoring the factor m1` , we only need to prove that 4. in
= di,n+1 , p(∗h(i)) + dj,n+1 , p(∗h(j)) . Def. 4 holds with C(n) = n − 1 when W 1:n is defined
`
as 1≤i<j≤n kΩi − Ωj k` . This in turn is a standard result,
P
Let w(i,j) denote each term on the r.h.s. of (D.4), and whose proof (in a more general form) can be found e.g. in
D E1
i,j `
w (i,j,r) denote di,j , p(∗r) . Combining (D.6) - (D.8), Example 2.4 in [29].
`
F Details of numerical experiments {W i,j,k }i,j,k∈[70] , where W i,j,k is defined as in Remark 6
Graphs are everywhere, and classifying and clustering graphs with di,j as for TA . We compute a sampled version T̂ C
are important in diverse areas. For example, [36] clusters of the tensor T C with Ti,j,k C
= W i,j,k , where W i,j,k is
graphs of app’s code execution to find malware; [38] clusters defined as in Thrm. 3.2, but now considering points in the
0
graphs that represent chemical compounds to understand complex plane. The sampled tensors T̂ B , T̂ B , and T̂ C are
their anti-cancer and cancer-inducing characteristics; [39] built by randomly selecting 100 triples (i, j, k) and setting
B B B’ B’ C C
represents text as word-based dependency trees and clusters T̂i,j,k = Ti,j,k , T̂i,j,k = Ti,j,k , and T̂i,j,k = Ti,j,k . The non-
them to classify cellphone reviews; [40] clusters graphs that sampled triples are given a very large value. The sampled
represent the secondary structure of proteins; [37] reviews matrix T̂ A is built by sampling (3/2) × 100 = 150 pairs
A A
other graph clustering applications. (i, j) and setting T̂i,j = Ti,j , and setting a large value for
A powerful and general approach to clustering is non-sampled pairs.
distance-based clustering, a type of connectivity-based clus- Clustering: We feed the distances T̂ A to a spectral clustering
tering: objects that are similar, according to a given distance algorithm [32] based on normalized random-walk Laplacians
measure, are put into the same cluster. Its use for graph clus- to produce one clustering solution C A , we feed the distances
tering requires a measure of the distance between graphs. The T̂ B to the two hypergraph-based clustering methods NH-Cut
purpose of our experiments is to illustrate via distance-based [33] and TTM [34, 27] to produce clustering solutions C B1
graph clustering i) the advantages of using MMOT over OT, and C B2 respectively, we use T̂ B’ and NH-Cut and TTM to
and ii) the advantages of using an n-metric MMOT over a produce clustering solutions C B’1 and C B’2 , and we use T̂ C
non-n-metric MMOT. and NH-Cut and TTM to produce clustering solutions C C1 and
C C2 . Both NH-Cut and TTM determine clusters by finding
F.1 Synthetic graphs Graphs: We create 7 clusters, each optimal cuts of a hypergraph where each hyperedge’s weight
with 10 graphs. Each graph is a random perturbation (edge is the MMOT distance among three graphs. Both NH-Cut and
addition/removal with p = 0.05) of either 1) a complete TTM require a threshold that is used to prune the hypergraph.
graph, 2) a complete bipartite graph, 3) a cyclic chain, 4) a Edges whose weight (multi-distance) is larger than a given
k-dimensional cube, 5) a K-hop lattice, 6) a periodic 2D grid, threshold are removed. This threshold is tuned to minimize
or 7) an Erdős–Rényi graph. each clustering solution error. All clustering solutions output
Vector data: We transform the graphs {Gi }70 i=1 into vectors
7 clusters.
{v i }70 i
i=1 to be clustered. Each v is the (complex-valued)
Errors: For each clustering solution, we compute the fraction
spectrum of a matrix M i representing non-backtracking walks of miss-classified graphs. In particular, if we use C x (i) = k,
on Gi , which approximates the length spectrum µi of Gi [40]. x = A, B1, B2,B’1, B’2, C1, C2, to represent that clustering
Object µi uniquely identifies (the 2-core [41] of) Gi (up to solution C x assigns graph Gi to cluster k, then the error of
an isomorphism) [42], but is too abstract to be used directly. this solution is
Hence, we use its approximation v i . The length of v i and
70
v j for equal-sized Gi and Gj can be different, depending on 1 X
(F.13) min I(C ground truth (i) = C x (σ(i)),
how we approximate µi . We use distance-based clustering σ 70 i=1
and OT (multi) distances, since OT allows comparing objects
of different lengths. Note that unlike the length spectrum, where the min is over all permutations σ, since the specific
the classical spectrum of a graph (the eigenvalues of e.g. an cluster IDs output by different algorithms have no real
adjacency matrix, Laplacian matrix, or random-walk matrix) meaning. This experiment is repeated 100 times (random
has the advantage of having the same length for graphs with numbers being drawn independently among experiments)
the same number of nodes. However, it does not uniquely and the frequency of the errors are plotted in histograms in
identify a graph. For example, a star graph with 5 nodes and Figure 3.
the graph that is the union of a square with an isolated node Code: Code is fully written in Matlab 2020a and is
are co-spectral but are not isomorphic. available at https://drive.google.com/drive/folders/11_
Distances: Each v i is interpreted as a uniform distribution MqRx29Yq-KuZYUSsOOhK7EbmTAIqu9?usp=sharing. It
pi over Ωi = {vki , k = 1, . . . , }, the eigenvalues of M i . is also included in the supplementary material files. It requires
We compute a sampled version T̂ A of the matrix T A = installing CVX, available in http://cvxr.com/cvx/download/.
{W i,j }i,j∈[70] , where W i,j is the WD between pi and pj To produce Figure 3 just open Matlab and run the file
using a di,j defined by di,j i j
s,t = |vs − vt |. We compute a run_me_for_synthetic_experiments.m
sampled version T̂ B of the tensor T B = {W i,j,k }i,j,k∈[70] ,
where W i,j,k is defined as in Def. 6 with di,j as for TA .
We compute a sampled version T̂ B’ of the tensor T B’ = F.2 Injection of triangle inequality violations To test
if the better performance of pairwise-MMOT and WBD
compared to that of the non-n-metric is due to the n- [37] Kaspar, Riesen, and Bunke Horst. “Graph classification and
metric property, we perturb the tensor W i,j,k to introduce clustering based on vector space embedding”. Vol. 77. World
triangle inequality violations. In particular, for each set Scientific, 2010.
of four different graphs (i, j, k, l), we find which among [38] Kong, Xiangnan, and Philip S. Yu. “Semi-supervised feature
W i,j,k , W i,j,l , W i,l,k , W l,j,k we can change the least to selection for graph classification.” In Proceedings of the
16th ACM SIGKDD international conference on Knowledge
produce a triangle inequality violation among these values.
discovery and data mining, pp. 793-802. 2010.
Let us assume that this value is W i,j,k , and that to violate [39] Kudo, Taku, Eisaku Maeda, and Yuji Matsumoto. “An applica-
W i,j,k ≤ W i,j,l + W i,l,k + W l,j,k , the quantity W i,j,k needs tion of boosting to graph classification.” In Advances in neural
to increase at least by δ. We then increase W i,j,k by 1.3 × δ. information processing systems, pp. 729-736. 2005.
We repeat this procedure such that, in total, 20% of the entries [40] Narayanan, Annamalai, Mahinthan Chandramohan, Rajasekar
of the tensor W get changed. Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal.
“graph2vec: Learning distributed representations of graphs.”
F.3 Real molecular graphs The details of our setup for arXiv preprint arXiv:1707.05005 (2017).
real data is almost identical as the setup for the synthetic data.
We explain only the major differences.
Graphs: The dataset that we use [39] contains the adjacency
matrices of 467 graphs for cyclooxygenase-2 inhibitors, 405
graphs for ligands for the benzodiazepinereceptor, 756 in-
hibitors of dihydrofolatereductase, 1009 estrogen receptor
ligands, and 1641 monoamine oxidase inhibitors. In each ex-
periment, we randomly get 10 graphs of each type, and prune
them so that they have no node with a degree smaller than
1. We note that, unlike for the sysmteitc data, the molecular
graphs have weighted adjacency matrices, whose entries can
be 0, 1, or 2.
Vector data: For each graph, we get an approximation of
the length spectrum just as explained for the synthetic data.
However, to minimize the time our experiments take to run,
we only keep the largest 16 eigenvalues.
Distances: For Figure 4, we computes tensors T̂ B , T̂ B’ , and
T̂ C by sampling x = 600 triples (i, j, k) (i.e. hyperedge). For
Figure 1, we perform different experiments with x ranging
from 50 to 500.
Clustering: We force all clustering algorithms to output ex-
actly 5 clusters.
Errors: We compute errors using (F.13) but with 70 replaced
by 50. Each experiment is repeated 100 times (random num-
bers being drawn independently among experiments) and the
frequency of the errors are plotted in the histograms in Fig-
ure 4. For Figure 1, each point on each curve is the average
over 100 trials. Error bars are the standard deviation of the
average.
Code: The code for this can be found in the same
link as the code for the synthetic dataset. To pro-
duce Figure 4 just open Matlab and run the file
run_me_for_molecular_experiments.m.
References