0% found this document useful (0 votes)

24 views17 pages

Multi-Marginal Optimal Transport Defines a Generalized Metric

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views17 pages

Multi-Marginal Optimal Transport Defines a Generalized Metric

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Multi-Marginal Optimal Transport Defines a Generalized Metric

José Bento* Liang Mi†

Abstract Clustering molecules with metrics and non-metrics

0.8 pairwise NH-cut

Fraction of miss-classified graphs

The Optimal transport (OT) problem is rapidly finding its way into random guess non metric NH-cut
barycenter NH-cut
machine learning. Favoring its use are its metric properties. Many 0.75 pairwise TTM
arXiv:2001.11114v4 [cs.LG] 29 Mar 2021

non metric TTM

problems admit solutions that guarantee only for objects embedded barycenter TTM
0.7
in metric spaces, and the use of non-metrics can complicate them.
Multi-marginal OT (MMOT) generalizes OT to simultaneously 0.65
with generalized metrics,
transporting multiple distributions. It captures important relations clustering improves with
0.6 more molecular
that are missed if the transport is pairwise. Research on MMOT, comparisons (more hyperedges)
however, has been focused on its existence, uniqueness, practical 0.55
algorithms, and the choice of cost functions. There is a lack of
100 200 300 400 500 600
discussion on the metric properties of MMOT, which limits its
# of hyperedges for clustering via hypergraph cuts
theoretical and practical use. Here, we prove that (pairwise) MMOT
Figure 1: We cluster molecules using two hypergraph cut methods
defines a generalized metric. We first explain the difficulty of proving
(TTM and NH-cut). Hypergraphs are built by comparing molecules’
this via two negative results. Afterwards, we prove key intermediate
shapes using MMOT and creating hyperedges for similarly-shaped
steps and then MMOT’s metric properties. Finally, we show that the
molecule groups. With a non generalized metric, gains from using
generalized triangle inequality of MMOT cannot be improved. a richer hypergraph (more hyperedges) are lost due to anomalies
allowed by the lack of metricity (flat green & red curves), not so
1 Introduction with a generalized metric (negative slope curves). The generalised
Let (Ω1 , F 1 , p1 ) and (Ω2 , F 2 , p2 ) be two probability spaces. metric studied (pairwise) leads to a better clustering than previously
Given a cost function d : Ω1 × Ω2 → R≥0 , and ` ≥ 1, the studied metrics (barycenter).
(Kantorovich) Optimal Transport (OT) problem [1] seeks the of the WD to study object matching via metric invariants.
Z 1` Recently, a generalization of OT to multiple marginal
` 1,2
(1.1) inf
1,2
d dp measures has gained attention. Given probability spaces
p Ω1 ×Ω2
(Ωi , F i , pi ), i = 1, . . . , n, a function d : Ω1 × . . . × Ωn 7→
subject to Ω1 dp1,2 = p2 and Ω2 dp1,2 = p1 , where the R≥0 , ` ≥ 1, and Ω−i , Ω1 × . . . × Ωi−1 × Ωi+1 × . . . × Ωn ,
R R

inf is over measures p1,2 on Ω1 × Ω2 . Problem (1.1) is Multi-Marginal Optimal Transport (MMOT) seeks
typically studied under the assumptions that Ω1 and Ω2 Z 1` Z
`
are in a Polish space on which d is a metric, in which (1.2) inf d dp s.t. dp = pi ∀i,
p Ω1 ×...×Ωn Ω−i
case the minimum of (1.1) is the Wasserstein distance
1 n
(WD). The WD is popular in many applications including where the infimum is taken over measures p on Ω ×. . .×Ω .
shape interpolation [2], generative modeling [3, 4], domain The term MMOT was coined in [17], and was surveyed by the
adaptation [5], and dictionary learning [6]. same authors in [18]. Applications of MMOT include image
The WD is a metric on the space of probability measures translation, image registration, multi-agent matching with
[7], and this property is useful in many ML tasks, e.g., clus- fairness requirement, and labeling for classification [19, 20].
tering [8, 9], nearest-neighbor search [10, 11, 12], and outlier Unfortunately, there is a lack of discussion about the
detection [13]. Indeed, some of these tasks are tractable, or (generalized) metric properties of MMOT. Much of the
allow theoretical guarantees, when done on a metric space. discussion on MMOT has focused on the existence of a
E.g., finding the nearest neighbor [10, 11, 12] or the diameter minimizer, the uniqueness and structure of both Monge and
[14] of a dataset requires a polylogarithimic computational ef- Kantorovich solutions, applications, practical algorithms, and
fort under metric assumptions; approximation algorithms for the choice of the cost function [21, 22, 23, 24].
clustering rely on metric assumptions, whose absence wors- Since the metric property of the WD is useful in so many
ens known bounds [15]; also, [16] uses the metric properties applications, understanding when the (potential) minimum of
(1.2), W(p1 , . . . , pn ), a multi-way distance, has metric-like
* Boston College [email protected] properties is critical, theoretically, and practically.
† Arizona State University [email protected] For example, metric properties can improve distance-
based clustering, so too can generalized metrics improve (Sec. 3.3). Finally, we show that the triangle inequality that
clustering based on multi-way distances. In Figure 1, MMOT satisfies cannot be improved, up to a linear factor.
we preview such an improvement on clustering chemical
2 Definitions and setup
compounds, which is discussed further in Sec. 5. Importantly,
several algorithms in [8]– [15], and more, which use distances 2.1 Lists Expressions that depend on a list of symbols
as input including WD, have guarantees if the distances are indexed consecutively are abbreviated using “:”. In particular,
metrics. They extend to feeding off multi-distances, and we write s1 , . . . , sk as s1:k , Ω1 , . . . , Ωk as Ω1:k , and As1 ,...,sk
hence can use MMOT, and have guarantees under generalized as As1:k . Note that As1 :sk differs from As1:k . Assuming sk >
metrics similar to those under classic metrics. We now s1 , then we have As1 :sk ≡ As1 , As1 +1 , As1 +2 , . . . , Ask .
exemplify these extensions, and their potential applications. By it self, 1 : i has no meaning, and it does not mean
Example 1: Given a set S with n distributions we can 1, . . . , i. For i ∈ N, we let [i] , {1, . . . , i}. The symbol
find its 3-diameter ∆ , maxp1 ,p2 ,p3 ∈S,distinct W(p1 , p2 , p3 ) ⊕ denotes a list join operation with no duplicate removal, e.g.
n {x, y} ⊕ {x, z} = {x, y, x, z}.

with 3 evaluations of W. What if W satisfies
the generalized triangle inequality W(p1 , p2 , p3 ) ≤ 2.2 Bra-ket operator Given two equi-multidimensional
W(p4 , p2 , p3 ) + W(p1 , p4 , p3 ) + W(p1 , p2 , p4 )? We arrays A and B, and ` ∈ N, we define hA, Bi` ,
now know that for at least n/3 distribution triplets P
(A ` `
s1:k s1:k ) Bs1:k , where (·) is the `th power.
W ≥ ∆/3. Indeed, if ∆ = W(p∗ 1 , p∗ 2 , p∗ 3 ),
4
then for all p ∈ S, we cannot simultaneously have 2.3 Probability spaces To facilitate exposition, we state
W(p4 , p∗ 2 , p∗ 3 ), W(p∗ 1 , p∗ 2 , p4 ), W(p∗ 1 , p4 , p∗ 3 ) < our main contributions for probability spaces with a sample
∆/3. Therefore, if we evaluate W on random distribution finite space in Ω, an event set σ-algebra which is the power
triplets, we are guaranteed to find a (1/3)-approximation set of the sample space, and a probability measure described
of ∆ with only O(n2 ) evaluations of W on average, an by a probability mass function. We refer to probability mass
improvement over n3 . Diameter estimation relates to outlier functions using bold letters, e.g. p, q, r, etc.
detection [13] which is critical e.g. in cybersecurity [25]. When talking about n probability spaces, the ith space
Example 2: Let S be as above. We can find A , has sample space Ωi = {Ωi1:mi } ⊆ Ω, an event space
i
n −1 P 1 2 3 n 2Ω , and a probability mass function pi , or q i , or r i , etc.

3 p1 ,p2 ,p3 ∈S,distinct W (p , p , p ) with 3 evalua-
tions of W . We can estimate A by averaging W over a Variable mi is the number of atoms in Ωi . Symbol pis denotes
set with O(n2 ) distinct triplets randomly sampled from S, the probability of the atomic event {Ωis }. Without loss of
improving over n3 . If W is a generalized metric, an argu- generality (w.l.o.g.) we assume pis > 0, ∀i ∈ [n], ∀s ∈ [mi ].

ment as in Example 1 shows that with high probability we Our notation assumes that atoms can indexed, but our results
do not miss triplets with large W , which is the critical step extend beyond this assumption. W.l.o.g., we assume that
to prove that we approximate A well. Average estimation is Ωis = Ωit if and only if s = t.
critical e.g. in differential privacy [26]. Symbol pi1:k denotes a mass function for the proba-
Example 3: Let S be as above. Consider building an hy- bility space with sample space Ωi1 × . . . × Ωik and event
i1 i
pergraph with nodes S and hyperedges defined as follows. For space 2Ω ×...×Ω k . In particular, pis1:k 1:k
(i.e. pis11,...,i
,...,sk )
k

each distinct triplet (p1 , p2 , p3 ) for which W (p1 , p2 , p3 ) < is the probability of the atomic event {(Ωis11 , . . . , Ωiskk )}.
thr, a constant threshold, include it as an hyperedge. Hyper- We use pi1:k |j1:r to denote a probability mass function
graphs are increasingly important in modern ML, specially for the probability space with sample space Ωi1 × . . . ×
for clustering using multiwise relationships among objects i1 i
Ωik , and event space 2Ω ×...×Ω k , such that ps1:k
i |j1:r
,
1:k |t1:r
[27]–[28]. Let W satisfy the triangle inequality in Example 1
and be invariant under arguments permutations. To shorten pis11,...,i k ,j1 ,...,jr j1 ,...,jr
,...,sk ,t1 ,...,tr /pt1 ,...,tr , i.e. a conditional probability.

notation, let W (pi , pj , pk ) = W i,j,k . One can prove that,

for any p1 , p2 , p3 , and p4 , W 1,2,3 ≥ max{W 1,2,4−W 1,3,4− D EFINITION 1. (G LUING MAP ) Consider a mass function
W 2,3,4 , W 2,3,4−W 1,3,4−W 1,2,4 , W 1,3,4 − W 1,2,4−W 2,3,4 }. q k over Ωk and n − 1 conditional mass functions
We can use this inequality to quickly ignore triplets with large {q i|k }i∈[n]\{k} over {Ωi }i∈[n]\{k} . The map G defines the
W 1,2,3 without evaluating them: plug into it already com- mass function p over Ω1 × . . . × Ωn as
puted values for W 1,2,4 , W 1,3,4 , W 2,3,4 for multiple p4 and
check if the r.h.s. is larger than thr. (2.3) p = G q k , {q i|k }i∈[n]\{k}
We show for the first time that an important variant of Y Y
MMOT (pairwise-MMOT) defines a generalized metric. We = q i|k q k q i|k .
first explain the difficulty of proving this via two negative i∈[k−1] i−k∈[n−k]
results (Sec. 3.1 and 3.2). Afterwards, we establish key
intermediate steps and prove the generalized metric properties To be more specific, ps1 ,...,sn = qskk
Q i|k
qs i .
i∈[n]\{k}
2.4 Distances and metrics We use “distance” to refer to D EFINITION 5. (MMOT DISTANCE FOR FINTE SPACES )
an object that, depending on extra assumptions, might, or Let {di1:n }i1:n be a set of distances of the form
might not, have the properties of a (generalized) metric. We di1:n : Ωi1 ×. . .×Ωin 7→ R and di1:n (Ωis11 , . . . , Ωiskn ) , dis1:n
1:n
.
use the standard definition for a metric, and for generalized The MMOT distance associated with {di1:n }i1:n for n proba-
metrics we use the definition in [29]. bility spaces with masses pi1:n over Ωi1:n is
1
i1:n i1:n
D EFINITION 2. (M ETRIC ) Let {di,j }i,j be a set of distances (2.4) W(p ) , W = i min di1:n , r `` ,
r :r s =pis ∀s∈[n]
j i,j
of the form di,j : Ωi × Ωj 7→ R and di,j (Ωis , Ωt ) , ds,t .
i i i
We say that d is a metric when, for any i, j, k, and any where r is a mass function over Ω 1 × . . . × Ω n , and r s be
i
s ∈ [mi ], r ∈ [mj ], t ∈ [mk ], we have i) di,j r,s ≥ 0; ii)
the marginal probability of r on Ω . s

i,j j,i i,j i j i,j i,k k,j

dr,s = ds,r ; iii) ds,r = 0 iff Ωs = Ωr ; iv) ds,r ≤ ds,t + dt,r .
3 Main results
D EFINITION 3. (G ENERALIZED METRIC ) Let {di1:n }i1:n To prove that MMOT leads to an n-metric, it is natural to
extend the ideas in the classical proof that WD is a metric.
be a set of distances of the form di1:n : Ωi1 × . . . × Ωin 7→ R
and di1:n (Ωis11 , . . . , Ωiskn ) , dis1:n . The hardest property to prove is the triangle inequality. See
We say that d
1:n
is an (n, C(n))-metric when, for any i1:n+1 and Figure 2-(right) for a geometric analogy of property 4 in Def.
s1:n+1 with sr ∈ [mir ]∀r ∈ [n + 1], we have i) 4. Its proof for the WD follows from (a) a gluing lemma
dis1:n ≥ 0; ii) dis1:n
σ(i )
= dσ(s1:n and the assumption that (b) d itself is a metric (Def. 2). Our
for any permutation
1:n 1:n 1:n )
hope is that if we can prove (a) a generalized gluing lemma
map σ; iii) dis1:n = 0 iff Ω ir
sr = Ωistt ∀r, t ∈ [n]; iv)
1:n
n i ,...,i ,ir+1 ,...,in+1
and assume that (b) d is a generalized metric, that we can
C(n)dis1:n
P
1:n
≤ r=1 ds11 ,...,sr−1 r−1 ,sr+1 ,...,sn+1 . prove what we want. Unfortunately, to our surprise, and as
we explain in Sec. 3.1 and 3.2, (a) is not possible, and (b) is
D EFINITION 4. (G ENERALIZED METRIC ) Let W be a map not enough. This requires developing completely new proofs.
from n probability spaces to R such that W i1:n , W(pi1:n )
is the image of the probability spaces with indices i1:n . For 3.1 The gluing lemma does not generalize to higher
any n+1 probability spaces with samples Ω1:n+1 and masses dimensions The gluing lemma used to prove that WD is
p1:n+1 , and any permutation σ,W is an (n, C(n))-metric if: a metric is as follows. For its proof see [7], Lemma 2.1.

1. W 1,...,n ≥ 0 , L EMMA 3.1. (G LUING LEMMA ) Let p1,3 and p2,3 be arbi-
trary mass functions for Ω1 × Ω3 and Ω2 × Ω3 , respectively,
2. W 1,...,n =0 iff pi =pj , Ωi =Ωj , ∀i, j , with the same marginal, p3 , over Ω3 . There exists a mass func-
tion r 1,2,3 for Ω1 × Ω2 × Ω3 whose marginals over Ω1 × Ω3
3. W 1,...,n = W σ(1,...,n) , for any map σ , and Ω2 × Ω3 equal p1,3 and p2,3 respectively.
n The way Lemma 3.1 is used to prove WD’s triangle inequality
4. C(n)W 1,..,n ≤ W 1,..,r−1,r+1,..,n+1 .
P
is as follows. Assume d is a metric (Def. 2). Let ` = 1 for
r=1
simplicity. Let p∗ 1,2 , p∗ 1,3 , and p∗ 2,3 be optimal transports
R EMARK 1. Equalities pi = pj and Ωi = Ωj , mean that such that W
1,2
= p∗ 1,2 , d1,2 , W 1,3 = p∗ 1,3 , d1,3 , and
2,3 ∗ 2,3
mi = mj , and that there exists a bijection bi,j (·) from [mi ] to W = p , d2,3 . Define r 1,2,3 as in Lemma 3.1, and let
j j 1,3 2,3 1,2
j i i
[m ] such that ps = pbi,j (s) and Ωs = Ωbi,j (s) , ∀ s ∈ [m ].i r , r , and r be its bivariate marginals. We then have
D Esuboptimal r
(3.5) p∗ 1,2 , d1,2
X 1,2 1,2
R EMARK 2. We abbreviate (n, 1)-metric by n-metric. ≤ r 1,2 , d1,2 = rs,t ds,t
s,t

R EMARK 3. Our notions of metric and generalized metric (3.6) X 1,2,3 1,2 d is metricX
1,2,3 1,3
= rs,t,l ds,t ≤ rs,t,l (ds,l + d2,3
t,l )
are more general than usual in the sense that they support
s,t,l s,t,l
the use of different functions depending on the spaces from 1,3 1,3
where we are drawing elements. This grants an extra layer of (3.7) = r ,d + r , d2,3
2,3
D E D E
Lemma 3.1
generality to our results. (3.8) = p∗ 1,3 , d1,3 + p∗ 2,3 , d2,3 .

In our setup, the inf in (1.2) is always attained (recall, Our first roadblock is that Lemma 3.1 does not generalize
finite-spaces) and amounts to solving an LP. We refer to the to higher dimensions. For simplicity, we now omit the sample
minimizing distributions by p∗ , q ∗ , r ∗ , etc. We define the spaces on which mass functions are defined. When a set of
following map from n probability spaces to R. The definition mass functions have all their marginals over the same sample
below amounts to (1.2) when p’s are empirical measures. sub-spaces equal, we will say they are compatible.
T HEOREM 3.1. (N O GLUING ) There exists mass functions p3 = (1/2, 1/2)
p1 4 4
p1,2,4 , p1,3,4 , and p2,3,4 with compatible marginals such that W 1,2,3  W 1,2,4
✏ 3 3
there is no mass function r 1,2,3,4 compatible with them. 4
p = (1/2, 1/2)
1 1 1 1
2✏
2
2 2
Proof. If this were not the case, then it would be true that, 1/2
0<✏⌧1 4 + W 4,2,3 4 + 1,4,3
W
given arbitrary mass functions p1,2 , p1,3 , and p2,3 with com- p2 3 3
1
patible univariate marginals, we should be able to find r 1,2,3 d(x, y, z) = Area of triangle(x, y, z) 1 1
whose bivariate marginals equal these three mass functions. d(x, y, z) = Area of triangle(x, y, z) 2 2
1,2 1,3 i 4
But this is not the case. For example, let p = p = Figure 2: (Left) Sample space Ω, mass functions {p }i=1 , and cost
[1, 0, 1; 0, 1, 0; 0, 0, 0]/3 and p2,3 = [1, 1, 1; 1, 1, 1; 1, 1, 1]/9 function d that lead to violation (3.9). (Right) Geometric analog of
(we are using matrix notation for the marginals). These the generalized triangle ineq.: the total area of any three faces in a
marginals have compatible univariate marginals, namely, tetrahedron is greater than that of the fourth face.
p1 = [2, 1, 0]/3 and p2 = p3 = [1, 1, 1]/3. Yet, the fol- associated with {di,j }i,j for n probability spaces with
1,2,3
lowing system of eqs. over {ri,j,k }i,j,k∈[3] is easily checked masses pi1:n over Ωi1:n is W(pi1:n ) , W i1:n with
to be infeasible ( i ri,j,k = p2,3
P 1,2,3 P 1,2,3
j,k ∀j, k) ∧ ( j ri,j,k = 1
1,3 P 1,2,3 1,2
X
pi,k ∀i, k) ∧ ( k ri,j,k = pi,j ∀i, j). (3.10) W i1:n = i min i
dis ,it , r is ,it `` ,
r :r s =p s ∀s∈[n]
1≤s<t≤n
3.2 Cost d being an n-metric is not a sufficient condition
for MMOT to be an n-metric Lemma 3.1 tells us that, even where r is a mass over Ωi1 × . . . × Ωin , with marginals r is
if we assume that d is an n-metric, we cannot adapt the and r is ,it over Ωis and Ωis × Ωit , respectively.
classical proof showing WD is a metric to MMOT leading to
gives W i1:n =
P
an n-metric. The question remains, however, whether there R EMARK 5. Swapping min and
is ,it is ,it
is the WD between Ωis and
P
exists such a proof at all only under the assumption that d is 1≤s<t≤n W , where W
it
n-metric. Theorem 3.2 settles this question in the negative. Ω . This is trivially an n-metric (cf. [29]) but is different
from eq. (3.10). In particular, it does not provide a joint
T HEOREM 3.2. Let W be as in Def. 5 with ` = 1. There optimal transport, which is important to many applications.
exists Ω, mass functions p1 , p2 , p3 , and p4 over Ω, and
d : Ω × Ω × Ω 7→ R such that d is an n-metric (n = 3), but If n = 2, Def. 6 reduces to the WD distance. Our
definition is a special case of the Kantorovich formulation for
(3.9) W 1,2,3 > W 1,2,4 + W 1,3,4 + W 2,3,4 . the general MMOT problem discussed in [18].
We can get Def. 6 from Def. 5, by defining
R EMARK 4. The theorem can be generalized to spaces of di1:n : Ωi1 × . . . × Ωin 7→ R such that di1:n (w1:n ) =
dim. > 2, and to n > 3, and ` > 1. P is ,it
1
(ws , wt ))` ` , for some set of distances
1≤s<t≤n (d
i,j i,j
Proof. Let Ω be the six points in Figure 2-(left), where we {d }i,j . It is easy to prove that if {d }i,j is a metric (Def.
assume that 0 < 1, and hence that there are no three 2), then d is an n-metric (Def. 3). However, because of
co-linear points, and no two equal points. Let p1 , p2 , p3 , and Theorem 3.2, we know that this is not sufficient to guarantee
p4 be as in Figure 2-(left), each is represented by a unique that the pairwise MMOT distance is an n-metric, which only
color and is uniformly distributed over the points of the same makes the proof of the next theorem all the more interesting.
color. Given any x, y, z ∈ Ω let d(x, y, z) = γ if exactly T HEOREM 3.3. If d is a metric (Def. 2), then the pairwise
two points are equal, and let d(x, y, z) be the area of the MMOT distance (Def. 6) associated with d is an (n, C(n))-
corresponding triangle otherwise, where γ lower bounds the metric, with C(n) ≥ 1.
area of the triangle formed by any three non-co-linear points,
e.g. γ = /4. A few geometric considerations (see Appendix We currently do not know the most general conditions under
A) show that d is an n-metric (n = 3, C(n) = 1) and that which Def. 3 is an n-metric. However, working with Def. 6
(3.9) holds as 12 > 18 + 18 + 4 + 18 + 4 . allows us sharply bound the best possible C(n), which would
unlikely be possible in a general setting. As Theorem 3.4
3.3 Pairwise MMOT is a generalized metric We will shows, the best C(n) is C(n) = Θ(n).
prove that the properties in Def. 4 hold for the following
variant of Def. 5. T HEOREM 3.4. In Theorem 3.3, the constant C(n) can be
made larger than (n−1)/5 for n > 7, and there exists sample
D EFINITION 6. (PAIRWISE MMOT DISTANCE ) Let spaces Ω1:n , mass functions p1:n , and a metric d over Ω1:n
i,j i,j i j
{d }i,j be a set of distances of the form d : Ω × Ω 7→ R such that C(n) ≤ n − 1.
and di,j (Ωis , Ωjt ) , di,j
s,t . The Pairwise MMOT distance
R EMARK 6. Note that if Ωi = Ω, ∀P i and d : Ω × · · · × Ω 7→ and recall that w.l.o.g. we assume that no element in Ωi has
R such that d(w1:n ) = minw∈Ω s∈[n] d1,2 (ws , w) and zero mass, so the denominators are not zero. We have that
d1,2 is a metric, then d is an n-metric [29]. One can
(4.12) W 1,2,3 ≤ d1,2 , p1,2 + d1,3 , p1,3 + d2,3 , p2,3 ,
then prove
P [30] that Def. 2.4 is equivalent to W (p1:n ) =
s
minp s∈[n] W (p , p), which is also called the Wasserstein since the bivariates p1,2 , p1,3 , p2,3 of p1,2,3,4 are a feasible
barycenter distance (WBD) [31]. The later definition makes but suboptimal choice of minimizer in (3.10) in Def. 6.
W (p1:n ) a Fermat distance, from which it follows immedi- It is convenient to introduce the followingDmore compact E
i,j
ately via general results in [29] that it is an n-metric with notation wi,j = di,j , pi,j and wi,j,r ∗
= di,j , p∗ (r) .
C(n) = Θ(n). The pairwise MMOT is not a Fermat distance,
Notice that, for any i, j, k and r, we have wi,j ≤ wi,k + wj,k
and Thrms. 3.3 and 3.4 do not follow from [29]. A novel ∗ ∗ ∗
and wi,j,r ≤ wi,k,r + wj,k,r . This follows directly from the
proof strategy is required. i,j
assumption that {d } are metrics. Without loss of generality,
let us prove that w1,2 ≤ w1,3 + w2,3 :
4 Main proof ideas
X 1,2 1,2 X 1,2 1,2,3
Our main technical contribution is our proof that the general- (4.13) w1,2 = ds,t ps,t = ds,t ps,t,l
ized triangle inequality – property 4 in Def. 4 – holds with s,t s,t,l
C(n) ≥ (n − 1)/5, n > 7 , if d is a metric (Def. 2), i.e. the X 1,3
(4.14) ≤ (ds,l + d2,3 1,2,3
t,l )ps,t,l = w1,3 + w2,3 .
first part of Theorem 3.4. We give this proof in this section.
s,t,l
The other proofs are included in the Appendix. A full proof
∗ ∗ ∗ ∗
of Theorem 3.3 is in Appendix D, and the proof of the second It is also easy to see that wi,j = wj,i and that wi,j,r = wj,i,r .
1,2,3
part of Theorem 3.4 is in Appendix E. Using this notation and (4.12) we can write W ≤
Before we proceed, we give a short proof that the w1,2 +w1,3 +w2,3 ≤ (w1,4 +w2,4 )+(w1,4 +w3,4 )+(w2,4 +
generalized triangle inequality holds with C(n) = 1 for w3,4 ), and, noticing that the bivariate marginals of p1,2,3,4
n = 3 when d is a metric. This avoids some key ideas getting satisfy p1,4 = p∗ (3)1,4 , p3,4 = p∗ (2)3,4 , and p2,4 = p∗ (1)2,4 ,
obscured by the heavy index notation that is unavoidable
when dealing with a general n and a tighter C(n). (4.15) W 1,2,3 ≤ (w1,4,3 ∗ ∗
+ w2,4,1 ∗
) + (w1,4,3 ∗
+ w3,4,2 )+
∗ ∗
(w2,4,1 + w3,4,2 ).
4.1 Proof of the generalized triangle inequality for n =
3, ` = 1, and C(n) = 1 We will prove that for any mass At the same time, also using this new notation, we can
1 4 1 4 i,j
functions p , . . . , p over Ω , . . . , Ω , respectively, if d : re-write the r.h.s. of (4.11) as
i j
Ω × Ω 7→ R is a metric for any i, j ∈ {1, . . . , 4}, i 6= j, (4.16) W \3 + W \2 + W \1 = (w∗ +w∗ +w∗ )+
1,2,3 1,4,3 2,4,3
then ∗ ∗ ∗ ∗ ∗ ∗
(w1,3,2+w1,4,2 +w3,4,2 ) + (w2,3,1 +w2,4,1+w3,4,1 ).
(4.11) W 1,2,3 ≤ W 1,2,4 + W 1,3,4 + W 2,3,4 ,
To finish the proof we show that the r.h.s. of (4.15) can
which we write more succinctly as W 1,2,3 ≤ W \3 + W \2 + be upper bounded by the r.h.s. of (4.16). We use the triangular
W \1 , using a new symbol W \r whose meaning is obvious. inequality of w∗ and apply it to the 1st, 4th, and 5th terms
We begin by expanding all of the terms in (4.11), namely, on the r.h.s. of i,j,k
(4.15) as specified by the parenthesis:
D E D E D E
W 1,2,3 = d1,2 , p∗ 1,2 + d1,3 , p∗ 1,3 + d2,3 , p∗ 2,3 , ∗
(w1,4,3 ∗
+ w2,4,1 ∗
) + (w1,4,3 ∗
+ w3,4,2 ∗
) + (w2,4,1 ∗
+ w3,4,2 )
W \3 + W \2 + W \1 = ∗
≤ ((w1,2,3 ∗
+ w2,4,3 ∗
) + w2,4,1 ∗
)+(w1,4,3 ∗
+ (w1,3,2 ∗
+ w1,4,2 ))
∗ ∗ ∗
D 1,2
E D 1,4
E D 2,4
E
d1,2 , p∗ (3) + d1,4 , p∗ (3) + d2,4 , p∗ (3) + ((w2,3,1 + w3,4,1 ) + w3,4,2 ),
and observe that the terms in the r.h.s. of this last inequality
D 1,3
E D 1,4
E D 3,4
E
+ d1,3 , p∗ (2) + d1,4 , p∗ (2) + d3,4 , p∗ (2)
D E D E D E are accounted for on the r.h.s. of (4.16). This ends the proof.
2,3 2,4 3,4
+ d2,3 , p∗ (1) + d2,4 , p∗ (1) + d3,4 , p∗ (1) , We note that this last step, figuring out to which terms
to apply the triangular inequality property of w∗ such that
where {p∗ i,j } are the bivariate marginals of the optimal we can “cover” the r.h.s. of (4.15) with the r.h.s. of (4.16),
i,j
joint distribution p∗ 1,2,3 for W 1,2,3 , and {p∗ (r) } are the is critical and to generalize in a proof for an arbitrary n.
bivariate marginals of the optimal joint distribution for W \r . Not only that, but the fact that we want to prove that the
Now we define the following probability mass function MMOT triangle inequality holds for C(n) = Θ(n) makes
on Ω1 × · · · × Ω4 , namely, p1,2,3,4 is such that this last step even harder. We define a general procedure for
expanding (using the triangle inequality) and matching terms
1,4 (2)3,4 (1)2,4
p∗ (3)
s,u p∗ l,u p∗ t,u in our general proof using a special hash function described
1,2,3,4
ps,t,l,u = p∗ 4u , next. It will play a critical role in our general proof.
p∗ 4u p∗ 4u p∗ 4u
4.2 Special hash function To prove that 4. in Def. 4 holds 4.4 Proof of lower Pn bound on C(n) We will show (n −
with C(n) ≥ (n − 1)/5, n > 7, we need the following tool. 1)W 1,...,n ≤ 5 r=1 W 1,...,r−1,r+1,...,n+1 . For r ∈ [n], let
0
p(∗r) be a minimizer for W 1,...,r−1,r+1,...,n+1 . We would
n
D EFINITION 7. The map H transforms a triple normally use r (∗r) for this minimizer, but, to avoid confusions
(i, j, r), 1 ≤ i < j ≤ n, r ∈ [n − 1] to either 2, 3, or between r and r, we avoid doing so. For i, j ∈ [n + 1]\{r},
4 triples according to i,j
let p(∗r) be the marginal of p(∗r) for the sample space
0 0 0 Ωi × Ωj . Since p(∗r) satisfies the constraints in (3.10), its
(4.17) (i, j, r) 7→ H n (i, j, r) = H1n (i, j, r)⊕H1n (j, i, r), marginal over Ωi equals pi .
Let h0 (·, ·) be the map in (4.19). For each r ∈ [n − 1],
define the mass function over Ω1 × . . . × Ωn
(4.18)
{(i, r, h0 (i, r))} , if j = h0 (i, r)

0
n (r) r (∗h0
(i,r)) i|r
H1 (i, j, r) = , q = G p , {p }i∈[n]\r ,
{(i, j, h0 (i, r), (j, r, h0 (i, r))} , if j 6= h0 (i, r) (4.21)
0 i|r 0 i|r 0 i,r
where p(∗h (i,r)) satisfies p(∗h (i,r)) pr = p(∗h (i,r)) .
(4.19) Note that h0 (i, r) ∈
/ {i, r}, ∀ 1 ≤ i ≤ n, and r ∈ [n − 1].
0 1 + ((i + r − 1) mod n) , if i < n 0 i,r 0 i|r i
h (i, r) = . Thus, p(∗h (i,r)) and p(∗h (i,r)) exist. Let q (r) be the
1 + (r mod (n − 1)) , if i = n i,j
marginal of q (r) over Ωi , and q (r) over Ωi × Ωj .
i
We assume that the first two components of each out- By Lemma 4.2, we know that q (r) equals pi (given)
put triple are ordered. For example, (i, r, h0 (j, r)) ≡ for all i ∈ [n], and hence q (r) satisfies the optimization
(min{i, r}, max{i, r}, h0 (j, r)). constraints in (3.10) for W 1,...,n . Therefore, we can write
n−1
X X D E 1`
0
The following property of H n is critical to lower bound (4.22) (n − 1)W 1,...,n = di,j , p∗ i,j
`
C(n). Its proof is in Appendix B. r=1 1≤i<j≤n
n−1 E1
0 X X D i,j `
L EMMA 4.1. Let (a, b, c) ∈ H n (i, j, r), 1 ≤ i < j ≤ n, ≤ di,j , q (r) ,
`
r ∈ [n − 1]. Then, 1 ≤ a ≤ b ≤ n, 1 ≤ c ≤ n, and r=1 1≤i<j≤n
c∈/ {a, b}. Furthermore,
where p∗ i,j is the bivariate marginal over Ωi × Ωj of the
0
minimizer p∗ for W 1,...,n .
M
n
(4.20) H (i, j, r)
1≤i<j≤n We now bound each term in the inner most sum on the
r.h.s. of (4.22) as
has at most 5 copies of each triple, where two triples are E1 D E1 D E1
i,j ` (a) i,r ` r,j `
D
equal iff they agree component-wise. (4.23) di,j , q (r) ≤ di,r , q (r) + dr,j , q (r)
` ` `
R EMARK 7. Note that we might have a = b in an triple D(b) E1 D
i,r `
E1
j,r `
0 (4.24) = di,r , q (r) + dj,r , q (r)
(a, b, c) output by H n . For example, if n = 4, all 5 triples ` `
(1, 2, 3), (1, 3, 2), (2, 3, 2), (2, 3, 3), and (2, 3, 4) map to D(c) 0
E1 D
i,r ` 0
E1
j,r `
(2, 3, 1). Also, both (1, 2, 1) and (1, 4, 1) map to (1, 1, 2) (4.25) = di,r , p(∗h (i,r)) + dj,r , p(∗h (j,r)) ,
`
whose first two components equal.
where i 6= r, r 6= j, and: (a) holds by Lemma 4.3; (b) holds
4.3 Useful lemmas We also need the following lemmas because d is symmetric;i,rand (c) holds because, byj,rLemma
i,r 0 j,r 0
whose proofs is in Appendix C. 4.2, q (r) = p(∗h (i,r)) and q (r) = p(∗h (j,r)) .
Bounding the r.h.s. of (4.22) using (4.23) - (4.25), we
L EMMA 4.2. Let p be as in Def. 1 eq. (2.3) for some q k and re-write the resulting inequality using the notation
{q i|k }i∈[n]\k . Let pi and pi,k , i 6= k, be the marginals of p
n−1
over Ωi and Ωi × Ωk , respectively. Let q i,k = q i|k q k , i 6= k, 1,...,n
X X
i i i
and let q be its marginals over Ω . We have that p = q ∀i, i (4.26) (n − 1)W = w(i,j,r)
r=1 1≤i<j≤n
and pi,k = q i,k ∀i 6= k.
n−1
X X
L EMMA 4.3. Let d be a metric and p a mass over Ω1 × ≤ v(i,r,h0 (i,r)) + v(j,r,h0 (j,r)) ,
r=1 1≤i<j≤n
. . . × Ωn . Let pi,j be the marginal of p over Ωi × Ωj . Define
1
wi,j = di,j , pi,j `` . For any i, j, k ∈ [n] and ` ∈ N we have where (a) we are implicitly assuming that the first two
that wi,j ≤ wi,k + wk,j . components of each triple on the r.h.s. of (4.26) are ordered,
Hypergraph clustering via NH-Cut Hypergraph clustering via TTM Spectral clustering
30 30 25
pairwise MMOT (mean 0.615) pairwise MMOT (mean 0.617) WD (mean 0.722)
non-n-metric (mean 0.707) non-n-metric (mean 0.694)
25 barycenter (mean 0.623) 25 barycenter (mean 0.622) 20

Number of repetitions

Number of repetitions
Number of repetitions

20 20
15
15 15
10
10 10

5
5 5

0 0 0
0.55 0.6 0.65 0.7 0.75 0.55 0.6 0.65 0.7 0.75 0.66 0.68 0.7 0.72 0.74 0.76 0.78
Fraction of miss-classified graphs Fraction of miss-classified graphs Fraction of miss-classified graphs

Figure 3: Comparing the effect that different distances and metrics have on clustering synthetic graphs.
i.e. if e.g. r < i then (r, i, h0 (i, r)) should be red as non-zero v(a,b,c) will not appear more than 5 times. Therefore,
1 0
(i, r, h0 (i, r)); (b) each w(i,j,r) represents one di,j , p∗ i,j `` the upper bound we build with the help of h for the r.h.s of
on the l.h.s. of (4.22); and (c) each v(s,t,l) represents (4.26) can be upper bounded by (4.27).
D E1
s,t `
ds,t , p(∗l) if s 6= t, and is zero if s = t. Since 5 Numerical experiments
`
0 i,r We illustrate how a MMOT which defines an n-metric,
h0 (i, r) ∈
/ {i, r}, when i 6= r the mass p(∗h (i,r)) exists.
Finally, using this same compact notation, we write n > 2, and pairwise MMOT in particular, improves a task
(4.27) of clustering graphs compared to using an OT that defines a
X n n
X X 2-metric, or a non-n-metric MMOT.
5 W 1,...,r−1,r+1,...,n+1 = 5 v(i,j,r) , We cluster graphs by i) computing their spectrum, ii)
r=1 r=1 i,j∈[n+1]\{r},i<j treating each spectrum as a probability distribution, iii)
using WD and three different MMOT’s to compute distances
and now we will show that (4.27) upper-bounds the r.h.s. of
among these distributions, and iv) feeding these distances
(4.26), finishing the proof.
to distance-based clustering algorithms to recover the true
First, by Lemma 4.3 and the symmetry of d, observe that
cluster memberships. We use spectral clustering based
the following inequalities are true
on normalized random-walk Laplacians [32] to produce
(4.28) v(i,r,h0 (i,r)) ≤ v(i,j,h0 (i,r)) + v(j,r,h0 (i,r)) , one clustering solution out of the pairwise graph distances
(4.29) v(j,r,h0 (j,r)) ≤ v(i,j,h0 (j,r)) + v(i,r,h0 (j,r)) , computed via WD. We also produce clustering solutions out
of the graph triple-wise distances computed via Def. 6 (an
as long as for each triple (a, b, c) in the above expressions, n-metric), via WBD in Remark 6 (also an n-metric), and
c∈ / {a, b}. We will use inequalities (4.28) and (4.29) to upper via W as in Thrm. 3.2 (a non-n-metric). To do so, we use
bound some of the terms on the r.h.s. of (4.26), and then we the hypergraph-based clustering methods NH-Cut [33] and
will show that the resulting sum can be upper bounded by TTM [34, 27]. Code for our experiments and details about
(4.27). In particular, for each (i, j, r) considered in the r.h.s. our setup are in https://drive.google.com/drive/folders/11_
of (4.26), we will apply inequalities (4.28) and (4.29) such MqRx29Yq-KuZYUSsOOhK7EbmTAIqu9?usp=sharing.
that that the terms v(a,b,c) that we get after their use have
0n
triples (a, b, c) that match the triples in H (i, j, r), defined 5.1 Synthetic graphs dataset We generate 7 equal-sized
0n
in Def. 7. To be concrete, for example, if H maps (i, j, r) to synthetic clusters of graphs by including in the ith cluster
{(i, r, h0 (i, r)), (r, j, h0 (j, r))}, then we do not apply (4.28) multiple random perturbations of: i = 1) a complete graph,
and (4.29), and we leave v(i,r,h0 (i,r)) + v(r,j,h0 (j,r)) as is on i = 2) a complete bipartite graph, i = 3) a cyclic chains,
0n i = 4) a k-dimensional cube, i = 5) a K-hop lattice, i = 6) a
the r.h.s. of (4.26). If, for example, H maps (i, j, r) to
0 0 0 periodic 2D grid, or i = 7) an Erdős–Rényi graph. A random
{(i, r, h (i, r)), (i, j, h (j, r)), (i, r, h (j, r))}, then we leave
class prediction has a 0.857 error rate. We repeat this cluster
the first term in v(i,r,h0 (i,r)) + v(r,j,h0 (j,r)) in the r.h.s. of
generation 100 times for 100 independent experiments, and
(4.26) untouched, but we upper bound the second term using
collect performance statistics.
(4.29) to get v(i,r,h0 (i,r)) + v(i,j,h0 (j,r)) + v(i,r,h0 (j,r)) .
Figure 3-(left, center) show that both TTM and NH-Cut
After proceeding in this fashion, and by Lemma 4.1, we
work better when hyperedges are computed using an n-metric,
know that all of the terms v(a,b,c) that we obtain have triples
and that pairwise MMOT works better than WBD. To double
(a, b, c) with c 6= {a, b}, c ∈ [n − 1], and 1 ≤ a ≤ b ≤ n.
check that this is due to the n-metric properties, we perturbed
Therefore, these terms are either zero (if a = b) or appear in
W to introduce triangle inequality violations (i.e. violate
(4.27). Also because of Lemma 4.1, each triple (a, b, c) with
Hypergraph clustering via NH-Cut Hypergraph clustering via TTM Spectral clustering
30 30 30
n-metric (mean 0.526) n-metric (mean 0.527) WD (mean 0.662)
non-n-metric (mean 0.684) non-n-metric (mean 0.656)
25 barycenter (mean 0.645) 25 barycenter (mean 0.651)
25

20 20 20

Number of repetitions
Number of repetitions

Number of repetitions
15 15 15

10 10 10

5 5 5

0 0 0
0.4 0.5 0.6 0.7 0.8 0.4 0.5 0.6 0.7 0.8 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72
Fraction of miss-classified graphs Fraction of miss-classified graphs Fraction of miss-classified graphs

Figure 4: Comparing the effect that different distances and metrics have on clustering molecular graphs.

Def. 4-4 with C(3) = 1) and observe that this leads to worse 1 shows the effect of z on performance. Comparing more
MMOT performance. Table 5.1 shows the effect of adding graphs, i.e. increasing z, should improve clustering. However,
20% violations to different MMOT distances. These additions for a non-n-metric, as z grows, triangle inequality violations
Violations Clustering Pairwise Barycenter Non-n-metric can appear that introduce confusion: a graph can be “close”
No NH-Cut 0.615 0.623 0.707 to two clusters that are far away, confusing TTM and NH-Cut.
Yes NH-Cut 0.632 0.632 0.704 This compensates the benefits of a high z and results in the
No TTM 0.617 0.622 0.694 flat curves in Figure 1.
Yes TTM 0.627 0.634 0.696
6 Future work
Table 1: Triangle inequalities violations in W degrades clustering
performance with n-metrics more than with non-n-metrics. We have showed that a generalization of the optimal transport
to multiple distributions, the pairwise multi-marginal optimal
clearly affect pairwise-MMOT, barycenter-MMOT, both n- transport (pairwise MMOT), leads to a multi-distance that
metrics, but the non-n-metric is not affected much by the satisfies generalized metric properties. In particular, we
injected of triangle inequality violations. Details about how have proved that the generalized triangle inequality that it
W is perturbed are in our code repository. satisfies cannot be improved, up to a linear factor. This now
Figure 3-(right) shows that clustering using only pairwise opens the door to us using pairwise MMOT in combination
relationships among graphs leads to worse accuracy than if with several algorithms whose good performance depends on
using triple-wise relationships as in Figure 3-(left, center). metric properties. Meanwhile, for a general MMOT, we have
This has been pointed out in [35]. proved that the cost function being a generalized metric is not
enough to guarantee that MMOT defines a generalized metric.
5.2 Molecular graphs dataset This experiment is moti-
In future work, we seek to find new sufficient conditions under
vated by the important task in chemistry of clustering chem-
which other variants of MMOT lead to generalized metrics,
ical compounds, represented as graphs, by their structure
and, for certain families of MMOT, find necessary conditions
[36, 37, 38]. We use the molecular dataset in the sup-
for these same properties to hold.
plementary material of [39], which can be download at
https://pubs.acs.org/doi/abs/10.1021/ci034143r#_i21. It con-
tains five types of compounds: cyclooxygenase-2 inhibitors, References
benzodiazepine receptor ligands, estrogen receptor ligands,
dihydrofolate reductase inhibitors, and monoamine oxidase in-
hibitors. We randomly sample an equal number of molecules [1] L. V. Kantorovich, “On the translocation of masses,” in Dokl.
from each type to build each cluster. A random class predic- Akad. Nauk SSSR, vol. 37, pp. 199–201, 1942.
tion has 0.8 error rate. We repeat this sampling 100 times in- [2] J. Solomon et al., “Convolutional Wasserstein distances:
dependently for each of 100 trials, and collect performances. Efficient optimal transportation on geometric domains,” ACM
Figure 4-(left, center) shows that both TTM and NH-Cut Trans. Graph., vol. 34, no. 4, p. 66, 2015.
work better when hyperedges are computed using n-metrics, [3] M. Arjovsky et al., “Wasserstein generative adversarial net-
works,” in ICML, 2017.
and Figure 4-(right) shows that clustering using pairwise
[4] H. Fan, H. Su, and L. J. Guibas, “A point set generation
relationships performs worse than using triple-wise relations.
network for 3d object reconstruction from a single image,”
There is a starker difference between n-metrics and non- in Proceedings of the IEEE conference on computer vision and
n-metrics, seen in Figure 1. The number of possible 3-sized pattern recognition, pp. 605–613, 2017.
hyperedges is cubic with the number of graph. Thus, in our [5] B. B. Damodaran et al., “Deepjdot: Deep joint distribution
experiments we randomly sample z triples (i, j, k) and only optimal transport for unsupervised domain adaptation,” arXiv
for these we create an hyperedge with weight W i,j,k . Figure preprint arXiv:1803.10081, 2018.
[6] M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly, transport problems concentrated on several graphs,” ESAIM:
M. Cuturi, G. Peyré, and J.-L. Starck, “Wasserstein dictionary Control, Optimisation and Calculus of Variations, vol. 23,
learning: Optimal transport-based unsupervised nonlinear no. 2, pp. 551–567, 2017.
dictionary learning,” SIAM Journal on Imaging Sciences, [25] K. Singh and S. Upadhyaya, “Outlier detection: applications
vol. 11, no. 1, pp. 643–678, 2018. and techniques,” International Journal of Computer Science
[7] L. Ambrosio and N. Gigli, “A user’s guide to optimal transport,” Issues (IJCSI), vol. 9, no. 1, p. 307, 2012.
in Modelling and optimisation of flows on networks, pp. 1–155, [26] C. Dwork and J. Lei, “Differential privacy and robust statistics,”
Springer, 2013. in Proceedings of the forty-first annual ACM symposium on
[8] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng, “Distance Theory of computing, pp. 371–380, 2009.
metric learning with application to clustering with side- [27] D. Ghoshdastidar and A. Dukkipati, “A provable generalized
information,” in Advances in neural information processing tensor spectral method for uniform hypergraph partitioning,”
systems, pp. 521–528, 2003. in Inter. Conf. on Machine Learning, pp. 400–409, 2015.
[9] J. A. Hartigan, Clustering algorithms. John Wiley & Sons, [28] P. Purkait, T.-J. Chin, A. Sadri, and D. Suter, “Clustering
Inc., 1975. with hypergraphs: the case for large hyperedges,” IEEE
[10] K. L. Clarkson, “Nearest-neighbor searching and metric space transactions on pattern analysis and machine intelligence,
dimensions,” Nearest-neighbor methods for learning and vol. 39, no. 9, pp. 1697–1711, 2016.
vision: theory and practice, pp. 15–59, 2006. [29] G. Kiss, J.-L. Marichal, and B. Teheux, “A generalization
[11] K. L. Clarkson, “Nearest neighbor queries in metric spaces,” of the concept of distance based on the simplex inequality,”
Discrete & Comp. Geometry, vol. 22, no. 1, pp. 63–93, 1999. Beiträge zur Algebra und Geometrie/Contributions to Algebra
[12] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for and Geometry, vol. 59, no. 2, pp. 247–266, 2018.
nearest neighbor,” in Proceedings of the 23rd international [30] G. Carlier and I. Ekeland, “Matching for teams,” Economic
conference on Machine learning, pp. 97–104, 2006. theory, vol. 42, no. 2, pp. 397–418, 2010.
[13] F. Angiulli and C. Pizzuti, “Fast outlier detection in high [31] M. Agueh et al., “Barycenters in the Wasserstein space,” SIAM
dimensional spaces,” in European Conf. on Principles of Data J. on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, 2011.
Mining and Knowledge Discovery, pp. 15–27, Springer, 2002. [32] J. Shi and J. Malik, “Normalized cuts and image segmenta-
[14] P. Indyk, “Sublinear time algorithms for metric space prob- tion,” IEEE Transactions on pattern analysis and machine
lems,” in Proceedings of the thirty-first annual ACM sympo- intelligence, vol. 22, no. 8, pp. 888–905, 2000.
sium on Theory of computing, pp. 428–434, 1999. [33] D. Ghoshdastidar, A. Dukkipati, et al., “Consistency of
[15] M. R. Ackermann, J. Blömer, and C. Sohler, “Clustering for spectral hypergraph partitioning under planted partition model,”
metric and nonmetric distance measures,” ACM Transactions The Annals of Statistics, vol. 45, no. 1, pp. 289–315, 2017.
on Algorithms (TALG), vol. 6, no. 4, pp. 1–26, 2010. [34] D. Ghoshdastidar and A. Dukkipati, “Uniform hypergraph par-
[16] F. Mémoli, “Gromov–Wasserstein distances and the metric titioning: Provable tensor methods and sampling techniques,”
approach to object matching,” Foundations of computational The Journal of Machine Learning Research, vol. 18, no. 1,
mathematics, vol. 11, no. 4, pp. 417–487, 2011. pp. 1638–1678, 2017.
[17] B. Pass, “On the local structure of optimal measures in the [35] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hyper-
multi-marginal optimal transportation problem,” Calculus of graphs: Clustering, classification, and embedding,” in Adv. in
Variations and Partial Differential Equations, vol. 43, no. 3-4, neural information processing systems, pp. 1601–1608, 2007.
pp. 529–536, 2012. [36] S. J. Wilkens, J. Janes, and A. I. Su, “Hiers: hierarchical
[18] B. Pass, “Multi-marginal optimal transport: theory and ap- scaffold clustering using topological chemical graphs,” Journal
plications,” ESAIM: Mathematical Modelling and Numerical of medicinal chemistry, vol. 48, no. 9, pp. 3182–3193, 2005.
Analysis, vol. 49, no. 6, pp. 1771–1790, 2015. [37] M. Seeland, A. K. Johannes, and S. Kramer, “Structural
[19] C. T. Li and V. Anantharam, “Pairwise multi-marginal optimal clustering of millions of molecular graphs,” in Proc. of the
transport and embedding for earth mover’s distance,” arXiv 29th Annual ACM Symposium on Applied Computing, 2014.
preprint arXiv:1908.01388, 2019. [38] M. J. McGregor and P. V. Pallai, “Clustering of large databases
[20] J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan, “Multi- of compounds: using the mdl “keys” as structural descriptors,”
marginal Wasserstein gan,” in Advances in Neural Information Journal of chemical information and computer sciences,
Processing Systems, pp. 1774–1784, 2019. vol. 37, no. 3, pp. 443–448, 1997.
[21] B. Pass, “Multi-marginal optimal transport and multi-agent [39] J. J. Sutherland, L. A. O’brien, and D. F. Weaver, “Spline-
matching problems: uniqueness and structure of solutions,” fitting with a genetic algorithm: A method for developing
arXiv preprint arXiv:1210.7372, 2012. classification structure- activity relationships,” Journal of
[22] G. Peyré, M. Cuturi, et al., “Computational optimal transport,” chemical information and computer sciences, vol. 43, no. 6,
Foundations and Trends® in Machine Learning, vol. 11, no. 5- pp. 1906–1915, 2003.
6, pp. 355–607, 2019. [40] L. Torres, P. Suárez-Serrato, and T. Eliassi-Rad, “Non-
[23] A. Gerolin, A. Kausamo, and T. Rajala, “Duality theory backtracking cycles: length spectrum theory and graph mining
for multi-marginal optimal transport with repulsive costs in applications,” Applied Net. Science, vol. 4, no. 1, p. 41, 2019.
metric spaces,” ESAIM: Control, Optimisation and Calculus [41] V. Batagelj and M. Zaveršnik, “Fast algorithms for determining
of Variations, vol. 25, p. 62, 2019. (generalized) core groups in social networks,” Adv. in Data
[24] A. Moameni and B. Pass, “Solutions to multi-marginal optimal Analysis and Classification, vol. 5, no. 2, pp. 129–145, 2011.
[42] D. Constantine and J.-F. Lafont, “Marked length rigidity for
one-dimensional spaces,” Journal of Topology and Analysis,
vol. 11, no. 03, pp. 585–621, 2019.
A Details for proof of Theorem 3.2 p1 p1

Proof. Note that Definition 3 supports using a different

function di,j,k for different product sample spaces Ωi × p3 p4
Ωj × Ωk . In the case of Theorem 3.2, however, we only
use Ω × Ω × Ω, so, when checking the n-metric properties,
we can drop the upper indices in d in Definition 3. p2 p2
For simplicity, we will abuse the notation and use
d(x, y, z) and di,j,k interchangeably, where i, j, and k are the Figure 5: (Left) Triangles associated with the optimal distribution of
index of x, y, and z, in Ω. triples p∗1,2,3 associated with W 1,2,3 . (Right) Triangles associated
with the optimal distribution p∗1,2,4 associated with W 1,2,4 .
Given x, y, z, w ∈ Ω, it is immediate to see that (i)
d(x, y, z) ≥ 0, (ii) d(x, y, z) is permutation invariant, and marginal p∗ 3,4 of the form
that (iii) d(x, y, z) = 0 if and only x = y = z (remember that
there are no three co-linear points in Ω). It is also not hard to 1 1
{{p∗ 3,4 ∗ 3,4 ∗ 3,4 ∗ 3,4
1,1 , p 2,1 }, {p 1,2 , p 2,2 }}= α, −α , −α, α ,
see that, d(x, y, z) ≤ d(x, y, w) + d(x, w, z) + d(w, y, z). To 2 2
be specific, if d(x, y, z) = 0, then the inequality is obvious.
where α ∈ 21 . Therefore, the distance W 2,3,4 is equal to the

If d(x, y, z) = γ, then without loss of generality we can
assume that x = y 6= z. In this case, if furthermore w = x, weighted average of the area of the four shaded triangles in
then d(x, w, z) = γ, and the inequality follows. If w = z, Figure 6, where we split the four triangles into two different
then d(x, y, w) = γ, and the inequality follows. If w is drawings for clarity sake. In other words,
different from x, y, z then γ ≤ d(x, w, z), and the inequality
follows. If d(x, y, z) > γ, it must be that x, y and z are
different. In which case we need do consider two special
cases. If w is equal to one among x, y, z, say w = x p4 p3 p4 p3
without loss of generality, then d(x, y, z) = d(y, z, w), and
the inequality follows. If w is different from x, y, z, then we
p2 p2
have d(x, y, z) = d(x, y, w) + d(x, w, z) + d(w, y, z) if w is
contained by the triangle formed by x, y, and z, and otherwise, Figure 6: Triangles associated with the optimal distribution of
we have d(x, y, z) < d(x, y, w) + d(x, w, z) + d(w, y, z). In triples p∗2,3,4 associated with W 2,3,4 .
other words, d is an n-metric (n = 3).
Given a mass function pi,j,k , the value di,j,k , pi,j,k n 1 1
1
W 2,3,4 = min1 α +α + −α −
represents the average area of the triangle whose three vertices α∈[0, 2 ] 4 2 2 4 2
are sampled from pi,j,k . Computing the MMOT distance o
1 1
W i,j,k for the mass functions pi , pj , pk , amounts to finding + −α
the mass function p∗ i,j,k with univariate marginals pi , pj , pk 2 4
n1 1 1 1 1

1 1 o 1
that minimizes this average area. = min + , − + = + ,
Now consider p1 , p2 , p3 , and p4 as depicted in Figure 2 4 2 2 2 4 2 2 4 8 4
2-(left). The mass functions p1 , p2 assign probability one to
each one of the blue and red points, and zero probability to where we are using the fact that
every other point in Ω. The mass functions p3 and p4 assign minα∈[0, 12 ] (linear function of α) must be minimized at
equal probability to each one of the green points, and orange either extreme α = 0 or α = 12 .
points, respectively, and 0 probability to other points in Ω. It is
finally easy
to observe that W 1,2,3 = 12 > 18 +
1 1 1,2,4
Now we compute the distances W 1,2,3 , W 1,2,4 , W 1,3,4 , 8 + 4 + 8 + 4 =W + W 1,3,4 + W 2,3,4 .
and W 2,3,4 . The MMOT distance W 1,2,3 is equal to the
average of the area of the two shaded triangles in Figure 5- B Proof of Lemma 4.1
(left), which is W 1,2,3 = 0.5 × (0.5) + 0.5 × (0.5) = 0.5.
The distance W 1,2,3 is equal to the average of the area 5
of the two shaded triangles in Figure 5-(right), which is 6 1
W 1,2,4 = 0.5 × (0.5) + 0.5 × (0.25 − 0.5) = 0.125. 2
The MMOT distances for W 1,3,4 and W 2,3,4 are the 10 4
same by symmetry. We focus on the computation of W 2,3,4 . 7
Since both p4 and p3 are uniform over their respective 3 8 9
supports, it must be the case that p∗ 2,3,4 – the optimal joint
Figure 7: Graph whose edges are pairs of scenarios that cannot
distribution in the computation of W 2,3,4 – has a bi-variate
both hold. Any maximum independent set has size 5, which is used
below to prove Lemma 4.1.
Proof. Recall the definitions: graph is {3, 8, 9, 2, x}, where x ∈ {1, 4, 6, 10}. Thus at most
5 input scenarios lead to a given output triple.
{(i, r, h0 (i, r))} , if j = h0 (i, r)

0
n
H1 (i, j, r) = , What remains to be proved is that several pairs of the 10
{(i, j, h0 (i, r), (j, r, h0 (i, r))} , if j 6= h0 (i, r) scenarios described above cannot both hold.
Recall that for any input triple (i, j, r) we always have
{(j, r, h0 (j, r))} , if i = h0 (j, r) 1 ≤ i < j ≤ n, and r ∈ [n − 1].

0
H2n (i, j, r) = .
{(i, j, h0 (j, r)), (i, r, h0 (j, r))} , if i 6= h0 (j, r) Scenario 1) and 5) cannot both hold, because that would
imply r5 = i1 , j5 = r1 , which would imply h0 (r5 , j5 ) =
0 1 + ((i + r − 1) mod n) , if i < n h0 (i1 , r1 ) = h0 (i5 , r5 ), which since r5 , i5 < n would imply
h (i, r) = .
1 + (r mod (n − 1)) , if i = n i5 = j5 , contradicting i5 < j5 .
The fact that 1 ≤ a ≤ b ≤ n and that 1 ≤ c ≤ n is Scenarios 1) and 6) cannot both hold, because that would
immediate. The fact that c 6= {a, b} amounts to checking that imply a = i1 < j1 = h0 (i1 , r1 ) = c = h0 (j6 , r6 ) = i6 <
h0 (i, r) ∈/ {i, r} for all i ∈ [n], and r ∈ [n − 1]. This j6 = a.
can be checked directly from (4.19). E.g. h0 (i, r) = i Scenarios 2) and 7) cannot both hold, because that would
would imply either (i = n) ∧ (r mod n − 1 = i − 1), or imply that j2 = h0 (i2 , r2 ) = h0 (j7 , r7 ) = i7 < j7 = i2 ,
(i < n) ∧ ((i + r − 1) mod n = i − 1), both of which are contradicting i2 < j2 .
impossible. The rest of the proof amounts to checking that Scenarios 1) and 4) cannot both hold, because that would
0
if (a, b, c) is in the output of H n , then there are at most 5 imply j4 = i1 < n and r1 = r4 , which would imply
different inputs that lead to (a, b, c). There are 10 possible h0 (j4 , r4 ) = h0 (i1 , r1 ) = h0 (i4 , r4 ), which since i4 , j4 < n
candidate input triples that lead to output (a, b, c). Namely, would imply j4 = i4 , contradicting i4 < j4 .
Scenarios 2) and 5) cannot both hold, because that
0
1. (a, b, c) = (i1 , r1 , h0 (i1 , r1 )) = H n (i1 , j1 , r1 ), if j1 = would imply j5 = i2 < n, r5 = r2 , which would imply
h0 (i1 , r1 ) and i1 < r1 , h0 (j5 , r5 ) = h0 (i2 , r2 ) = h0 (i5 , r5 ), which since j5 < n
0 would imply i5 = j5 , contradicting i5 < j5 .
2. (a, b, c) = (r2 , i2 , h0 (i2 , r2 )) = H n (i2 , j2 , r2 ), if j2 = Scenarios 6) and 10) cannot both hold, because that
h0 (i2 , r2 ) and r2 < i2 , would imply r10 = j6 < r6 = i10 < n, which would imply
0
3. (a, b, c) = (i3 , j3 , h0 (i3 , r3 )) = H n (i3 , j3 , r3 ), if j3 6= h0 (r10 , i10 ) = h0 (j6 , r6 ) = h0 (j10 , r10 ). This in turn would
h0 (i3 , r3 ), imply one of two things. If j10 < n, then h0 (r10 , i10 ) =
0
h0 (j10 , r10 ) would imply j10 = i10 , contradicting i10 < j10 .
4. (a, b, c) = (j4 , r4 , h0 (i4 , r4 )) = H n (i4 , j4 , r4 ), if j4 6= If on the other hand j10 = n, then h0 (r10 , i10 ) = h0 (j10 , r10 )
h0 (i4 , r4 ) and j4 < r4 , would imply 1+(r10 +i10 −1 mod n) = 1+(r10 mod n−1).
0 Recalling that r10 < r6 ≤ n − 1, we would get i10 − 1 =
5. (a, b, c) = (r5 , j5 , h0 (i5 , r5 )) = H n (i5 , j5 , r5 ), if j5 6= 0 mod n. This would imply i10 = 1, contradicting i10 >
h0 (i5 , r5 ) and r5 < j5 , r6 ≥ 1.
0
6. (a, b, c) = (j6 , r6 , h0 (j6 , r6 )) = H n ((i6 , j6 , r6 ), if Scenarios 7) and 10) cannot both hold, because that
i6 = h0 (j6 , r6 ) and j6 < r6 , would imply r7 = r10 < j7 = i10 ≤ n − 1, which
0
would imply h0 (i10 , r10 ) = h0 (j7 , r7 ) = h0 (j10 , r10 ). This
7. (a, b, c) = (r7 , j7 , h0 (j7 , r7 )) = H n (i7 , j7 , r7 ), if in turn would imply one of two things. If j10 < n,
i7 = h0 (j7 , r7 ) and r7 < j7 , then h0 (i10 , r10 ) = h0 (j10 , r10 ) would imply i10 = j10 ,
0 contradicting i10 < j10 . If, on the other hand, j10 = n,
8. (a, b, c) = (i8 , j8 , h0 (j8 , r8 )) = H n (i8 , j8 , r8 ), if i8 6=
then h0 (i10 , r10 ) = h0 (j10 , r10 ) would imply 1 + (i10 +
h0 (j8 , r8 ),
r10 − 1 mod n) = 1 + (r10 mod n − 1). Recalling that
0
9. (a, b, c) = (i9 , r9 , h0 (j9 , r9 )) = H n (i9 , j9 , r9 ), if i9 6= r10 < j7 = i10 ≤ n − 1, we would get i10 − 1 = 0 mod n.
h0 (j9 , r9 ) and i9 < r9 , This would imply i10 = 1, contradicting i10 > r10 ≥ 1.
0
Scenarios 5) and 6) cannot both hold, because that would
10. (a, b, c) = (r10 , i10 , h0 (j10 , r10 )) = H n (i10 , j10 , r10 ), imply j6 = r5 < j5 = r6 ≤ n − 1, and h0 (i5 , r5 ) =
if i10 6= h0 (j10 , r10 ) and r10 < i10 . h0 (j6 , r6 ), which would imply h0 (i5 , j6 ) = h0 (j6 , j5 ), which
since j6 ≤ n − 2 would imply i5 = j5 , contradicting i5 < j5 .
Twelve pairs of the 10 scenarios above cannot simultane-
Scenarios 1) and 10) cannot both hold, because that
ously hold. The 12 pairs of scenarios that cannot both hold are
would imply r1 = i10 > i1 = r10 ≥ 1, which would imply
displayed as edges in a graph in Figure 7. E.g. edge (4, 10)
h0 (r10 , i10 ) = h0 (i1 , r1 ) = h0 (j10 , r10 ). This would imply
represents that scenarios 4 and 10 that cannot both hold. The
one of two things. If j10 < n, and, recalling that r10 ≤ n,
proof that the pairs represented by edges in Figure 7 cannot
this would imply i10 = j10 , contradicting i10 < j10 . If on the
both hold is done below. A maximum independent set of this
1/` P 1/`
other hand, j10 = n, this would imply 1 + (r10 mod n − 1) =
P
i,k ` i,j,k k,j ` i,j,k
s,t,r (ds,r ) ps,t,r + s,t,r (dr,t ) ps,t,r = wi,k +
1 + (r10 + i10 − 1 mod n), which would imply i10 = 1,
wk,j .
contradicting i10 > 1.
Scenarios 4) and 6) cannot both hold, because that would
D Proof of Theorem 3.3
imply j4 = j6 < r4 = r6 < n, which would imply
h0 (i4 , r4 ) = h0 (j6 , r6 ) = h0 (j4 , r4 ), which recalling that We will need the following hash function in this proof.
j4 < n would imply i4 = j4 , contradicting i4 < j4 .
Scenarios 4) and 7) cannot both hold, because that would D.1 Special hash function
imply j4 = r7 < j7 = r4 < n, which would imply
D EFINITION 8. The map Hn transforms a tuple (i, j), 1 ≤
h0 (i4 , r4 ) = h0 (r7 , j7 ) = h0 (j4 , r4 ), which recalling that
i < j ≤ n, into either 2, 3 or 4 triples according to
j4 < n would imply i4 = j4 , contradicting i4 < j4 .
Scenarios 4) and 10) cannot both hold, because that (D.1) (i, j) 7→ Hn (i, j) = H1n (i, j) ⊕ H2n (i, j),
would imply i4 < j4 = r10 < r4 = i10 < j10 , which would
imply h0 (i4 , i10 ) = h0 (i4 , r4 ) = h0 (j10 , r10 ) = h0 (j10 , j4 ). where two tuples (resp. triples) are assumed duplicates iff all
This would imply one of two things. If j4 < n, this would of their components agree and
imply 1 + (i4 + i10 + 1 mod n) = 1 + (j4 + j10 + 1 mod n),
which would imply (j4 − i4 ) + (j10 − i10 ) = 0 mod n, which n {(i, n + 1, h(i))} , if j = n ∧ i = 1
H1 (i, j) = ,
since j4 > i4 , j10 > i10 would imply (j4 −i4 )+(j10 −i10 ) = {(i, j, h(i)), (j, n + 1, h(i))} , if otherwise
n. This in turn would imply j10 = n + (i10 − j4 ) + i4 >
and
n + i4 > n, since i10 − j4 > 0, and i4 > 0, contradicting
j10 ≤ n. If on the other hand j4 = n, this would imply {(j, n + 1, h(j))} , if i = j − 1
H2n (i, j) = .
1 + (i4 + i10 + 1 mod n) = 1 + (j4 mod n − 1), which {(i, j, h(j)), (i, n + 1, h(j))} , if i < j − 1
since j4 = r10 < r4 ≤ n − 1 would imply (j4 − i4 − i10 +
1) mod n = 0, which imply either j4 − i4 − i10 + 1 = 0 h(·) is also a function of n but for simplicity we omit it in the
or j4 −i4 −i10 + 1 = −n. The 1st option would imply notation. h(·) is defined as
j4 = i10+i4−1≥i10 , contradicting j4 <i10 . The 2nd option
would imply i10 =n+1+j4−i4>n, contradicting i10 <n. (D.2) h(i) = 1 + ((i − 2) mod n).

L EMMA D.1. Let (a, b, c) ∈ Hn (i, j) for 1 ≤ i < j ≤ n.

Then, 1 ≤ a < b ≤ n + 1, 1 ≤ c ≤ n, and c ∈ / {a, b}.
C Proof of Lemma 4.2 and Lemma 3.3 Furthermore, the set
Here we include the proofs of Lemma 4.2 and Lemma 4.3. M
(D.3) Hn (i, j)
C.1 Proof of Lemma 4.2 Let p be as in Def. 1 eq. (2.3) 1≤i<j≤n
has no duplicates.
for some q k and {q i|k }i∈[n]\k . Let pi and pi,k , i 6= k, be
the marginals of p over Ωi and Ωi × Ωk , respectively. Let Proof. The fact that 1 ≤ a < b ≤ n + 1 and that 1 ≤ c ≤ n
q i,k = q i|k q k , i 6= k, and let q i be its marginals over Ωi . We is immediate. To see that c ∈ / {a, b}, we just need to notice
have that pi = q i ∀i, and pi,k = q i,k ∀i 6= k. that h(i) ∈
/ {i, n + 1} for i ∈ [n]. The fact that h(i) 6= n + 1
Proof. Think of p as describing n discrete random variables follows the range of h being [n]. If we had h(i) = i, then we
(r.v.’s). It follows from the factorisation in (2.3) that condi- would have (i − 2) mod n = i − 1, which is not possible. To
tioned on the kth r.v. the other r.v.’s are independent. The see that (D.3) does not have duplicates, we need to see that,
result follows. starting from two different tuples, the different expressions
that define the triples that go into (D.3) can never be equal.
C.2 Proof of Lemma 4.3 Let d be a metric (Def. 2), and p Given 1 ≤ i < j ≤ n, 1 ≤ i0 < j 0 ≤ n, (i, j) 6= (i0 , j 0 )
a mass function over Ω1 × . . . × Ωn . Let pi,j be the marginal we will show that
1
of p over Ωi × Ωj . Define wi,j = di,j , pi,j `` . For any 1. Hn (i, j) does not have duplicates;
i, j, k ∈ [n] and ` ∈ N we have that wi,j ≤ wi,k + wk,j .
2. Hn (i, j) and Hn (i0 , j 0 ) do not have overlaps, that is,
Proof. Let p i,j,k i
be the marginal of p over Ω × Ω × Ω . j k H1n (i, j), H2n (i, j), H1n (i0 , j 0 ), and H2n (i0 , j 0 ) do not
1/`
1/` P i,j ` i,j,k have overlaps with each other.
Write wi,j = di,j , pi,j ` = s,t,r (ds,t ) ps,t,r ≤
P
i,k k,j ` i,j,k
1/` It is obvious that H1n (i, j) does not have duplicates and
(d
s,t,r s,r + d r,t ) ps,t,r . Use Minkowski’s ineq. nor does H2n (i, j) according to their definitions. Because
i,j,k
on a L` space with measure p to bound this by i 6= j, it is also trivial to show that H1n (i, j) and H2n (i, j) do
not have overlaps, based on their definitions. Therefore, 1. is D.3 Symmetry
indeed true. The burden now is to verify 2.
Proof. Recall that the computation of W(pi1:n ) involves a set
For 2., we show that the four sets have no overlaps with
of distances {da,b }a,b . Consider a generic permutation map σ,
each other. We show this two sets at a time, there are in
and let σ −1 be its inverse. Let σ and σ −1 apply component-
total 6 pairs to consider. As an immediate result of the
wise to its arguments. The computation of W(pσ(i1:n ) ) in-
discussion in 1., the following four combinations do not have −1

overlaps: H1n (i, j) vs. H2n (i, j), H1n (i0 , j 0 ) vs. H2n (i0 , j 0 ), volves a set of distancesD{d˜a,b }a,bEthat satisfy d˜i,j = dσ (i,j) .
H1n (i, j) vs. H1n (i0 , j 0 ), H1n (i0 , j 0 ) vs. H2n (i0 , j 0 ). The two Therefore, each term d˜i,j , r i,j involved in the computa-
`
combinations left are H1n (i, j) vs H2n (i0 , j 0 ) and H1n (i0 , j 0 ) vs D −1 E
tion of W(pσ(i1:n ) ), can be rewritten as dσ (i,j) , r i,j ,
H2n (i, j). We notice that they are symmetric and, because the P `
choice of the tuples (i, j), (i0 , j 0 ) is arbitrary, we only need to which a simple reindexing of the summation i<j allow
show that H1n (i, j) and H2n (i0 , j 0 ) do not have overlaps, given us to write as di,j , r σ(i,j) ` . Since the mass function r
(i, j) 6= (i0 , j 0 ). has as supporting sample space Ωσ(i1 ) × . . . × Ωσ(in ) , the
H1n (i, j) and H2n (i0 , j 0 ) each have two possibilities for marginal r σ(i,j) can be seen as the marginal q i,j of a mass
the form of their output. Thus, together, there are four i1 in
function q D with support E Ω × . . . × Ω . Therefore, minimiz-
possibilities to consider. None of them have an overlap, which ing i<j ( d˜ , r
P i,j i,j
) for r over Ωσ(i1 ) × . . . × Ωσ(in )
1/`
we show by contradiction. `
is the same as minimizing i<j ( di,j , q i,j ` )1/` for q over
P
1. H1n (i, j) = {(i, n+ 1, h(i))} and H2n (i0 , j 0 ) = {(j 0 , n+
Ωi1 × . . . × Ωin .
1, h(j 0 ))}. If these single-element sets have an overlap,
that implies that i = j 0 , but, according to the definition, D.4 Identity
i = 1 and i0 = j 0 − 1 which implies j 0 > 1.
Proof. We prove each direction of the equivalence separately.
2. H1n (i, j) = {(i, n + 1, h(i))} and H2n (i0 , j 0 ) = Recall that {pi } are given, they are the masses for which we
{(i0 , j 0 , h(j 0 )), (i0 , n + 1, h(j 0 ))}. For them to have an want to compute the pairwise MMOT.
overlap, h(i) = h(j 0 ). That requires i = j 0 which con- “⇐=”: If for each i, j ∈ [n] we have Ωi = Ωj , then
tradictory to i = 1 and i0 < j 0 − 1 at the same time. m = mj , and there exists a bijection bi,j (·) from [mi ] to
i
j i j i j
3. H1n (i, j) = {(i, j, h(i)), (j, n + 1, h(i))} and [m ] such that Ωs = Ωbi,j (s) for all s. If furthermore p = p ,
1 n
H2n (i0 , j 0 ) = {(i0 , j 0 , h(j 0 )), (i0 , n + 1, h(j 0 ))}. For the we can define a r for Ω ×Ω such that its univariate marginal
first two components to equal, i = i0 , j = j 0 , and i = j 0 , over Ωi , r i , satisfies r i = pi , and such that its bivariate
i,j
which is contradictory to i0 < j 0 − 1. For the second marginal over Ωi ×Ωj , r i,j , satisfies rs,t = pis , if t = bi,j (s),
two components to equal, j = i0 and i = j 0 , which is and zero otherwise. Such a r achieves an objective value of
contradictory to i < j or i0 < j 0 . Because of the exis- 0 in (3.10), the smallest value possible by the first metric
tence of “n + 1”, the components at different positions property (already proved). Therefore, W 1,...,n = 0.
cannot collide. “=⇒”: Now let r ∗ be a minimizer of (3.10) for
W 1,...,n
. Let {r ∗ i } and {r ∗ i,j } be its univariate and bivariate
4. H1n (i, j) = {(i, j, h(i)), (j, n + 1, h(i))} and marginals respectively. If W 1,...,n = 0 then di,j , r ∗ i,j =
`
H2n (i0 , j 0 ) = {(j 0 , n + 1, h(j 0 ))}. This implies j 0 = j 0 for all i, j. Let us consider a specific pair i, j, and, without
0
and j = i, which is contradictory to i < j. loss of generality, let us assume that mi ≤ mj . Since, by
assumption, we have that r ∗ is = pis > 0 for all s ∈ [mi ], and
r ∗ js = pjs > 0 for all s ∈ [mj ], there exists an injection bi,j (·)
For example, if n = 3, then the possible tuples from [mi ] to [mj ] such that r ∗ i,j i
s,bi,j (s) > 0 for all s ∈ [m ].
(1, 2), and (1, 3), and (2, 3), get mapped respectively to
(1, 2, 3), (2, 4, 3), (2, 4, 1), and (1, 4, 3), (1, 3, 2), (1, 4, 2), Therefore, d , r
i,j ∗ i,j
`
= 0 implies that di,j s,bi,j (s) = 0 for
i
and (2, 3, 1), (3, 4, 1), (3, 4, 2), all of which are different and all s ∈ [m ]. Therefore, since d is a metric, it must be that
satisfy the claims in Lemma D.1. Ωis = Ωjbi,j (s) for all s ∈ [mi ]. Now lets us suppose that there
We now prove the four metric properties in order. It is exists an r ∈ [mj ] that is not in the range of bi,j . Since, by
trivial to prove the first three properties given the definition assumption, all of the elements of the sample spaces are differ-
of our distance function for the transport problem. Then, we ent, it must be that di,j i
s,r > 0 for all s ∈ [m ]. Therefore, since
provide a detailed proof for the triangle inequality. di,j , r ∗ i,j ` = 0, it must be that r ∗ i,j i
s,r = 0 for all s ∈ [m ].
∗ i,j ∗j j
P
This contradicts the fact that s∈[mi ] r s,r = r r = pr > 0
D.2 Non-Negativity
(the last inequality being true by assumption). Therefore,
Proof. The non-negativity of di,j and r i,j , implies that mi = mj , and the existence of bi,j proves that Ωi = Ωj . At
di,j , r i,j ` ≥ 0, and hence that W ≥ 0. the same time, since di,j i,j
s,t > 0 for all t 6= b (s), it must be
that r ∗ i,j i,j i j
s,t = 0 for all t 6= b (s). Therefore, ps = pbi,j (s) we have
for all s, i.e. pi = pj .
X X
(D.9) w(i,j) ≤ w(i,n,h(i)) + w(j,n,h(j)) .
1≤i<j≤n−1 1≤i<j≤n−1
D.5 Generalized Triangle Inequality
Finally, we write
Proof. Let p∗ be a minimizer for (the optimization problem (D.10)
associated with) W 1,...,n , and let p∗ i,j be the marginal Xn
1,...,r−1,r+1,...,n+1
Xn X
∗ i
induced by p for the sample space Ω × Ω . We wouldj W = w(i,j,r) ,
normally use r ∗ for this minimizer, but, to avoid confusions r=1 r=1 i,j∈[n+1]\{r},i<j

between r and r, we avoid doing so. We can write that and show that (D.10) upper-bounds the r.h.s of (D.9).
D E 1` First, by Lemma 4.3 and the symmetry of d, we have
di,j , p∗ i,j .
X
(D.4) W 1,...,n =
` (D.11) w(i,n,h(i)) ≤ w(i,j,h(i)) + w(j,n,h(i)) ,
1≤i<j≤n−1
(D.12) w(j,n,h(j)) ≤ w(i,j,h(j)) + w(i,n,h(j)) ,
For r ∈ [n], let p(∗r) be a minimizer for
W 1,...,r−1,r+1,...,n+1
. We would normally use r (∗r) for this as long as for each triple (a, b, c) in the above expressions,
minimizer, but, to avoid confusions between r and r, we c ∈/ {a, b}. We will use these inequalities to upper bound
i,j some of the terms on the r.h.s. of (D.9), which can be
avoid doing so. For i, j ∈ [nn + 1\{r}, let p(∗r) be the
further upper bounded by (D.10). In particular, we will
marginal of p(∗r) for the sample space Ωi × Ωj . Recall that
apply inequalities (D.11) and (D.12) such that the terms
since p(∗r) satisfies the constraints in (3.10), its marginal for
wa,b,c that we get after their use have triples (a, b, c) that
the sample space Ωi is p∗ i , which is given in advance.
match the triples obtained via the map Hn defined in Section
Let h(·) be the map defined as (D.2).
4.3. To be concrete, for example, if Hn maps (i, j) to
Define the following mass function for Ω1 × . . . × Ωn+1 ,
{(i, n+1, h(i)), (j, n+1, h(j))}, then we do not apply (D.11)
i|n+1

(D.5) q = G p∗ n+1 , {p(∗h(i)) }i∈[n] , and (D.12), and we leave w(i,n+1,h(i)) + w(j,n+1,h(j)) as is
on the r.h.s. of (D.9). If, for example, Hn maps (i, j) to
i|n+1 {(i, n + 1, h(i)), (i, j, h(j)), (i, n + 1, h(j))}, then we leave
where p(∗h(i)) is defined as the mass function that
(∗h(i)) i|n+1 ∗ n+1 i,n+1 the first term in w(i,n+1,h(i)) + w(j,n+1,h(j)) in the r.h.s. of
satisfies p p = p(∗h(i)) . Notice that (D.9) untouched, but we upper bound the second term using
i,n+1
since h(i) ∈ / {i, n + 1}, the probability p(∗h(i)) exists (D.12) to get w(i,n+1,h(i)) + w(i,j,h(j)) + w(j,n+1,h(j)) .
for all i ∈ [n]. After proceeding in this fashion, and by Lemma D.1, we
Let q 1,...,n be the marginal of q for sample space know that all of the terms w(a,b,c) that we obtain have triples
Ω × . . . × Ωn , and q i,j be the marginal of q for Ωi × Ωj .
1
(a, b, c) with c 6= {a, b}, with c ∈ [n], and 1 ≤ a < b ≤ n+1.
By Lemma 4.2, we know that the ith univariate marginal Therefore, these terms appear in (D.10). Also by Lemma
of q is pi (given) and hence q 1,...,n satisfies the constraints D.1, we know that we do not get each triple more than once.
associated with W 1,...,n . Therefore, we can write that Therefore, the upper bound that we just constructed with the
X D E 1` 1 help of Hn for the r.h.s of (D.9) can be upper bounded by
di,j , p∗ i,j
X
(D.6) ≤ di,j , q i,j `` . (D.10).
`
1≤i<j≤n 1≤i<j≤n

By Lemma 4.3, inequality (a) below holds; because d is E Proof of upper bound in Theorem 3.4
i
symmetric, (b) below holds; by the definition of q, (c) below Proof. Consider the following setup. Let m = m for all
i
follows. Therefore, i ∈ [n], and Ωs ∈ R for all i ∈ [n], s ∈ [m]. Define d such
that di,j i j
s,t is |Ωs − Ωt |, if s = t, and infinity otherwise. Let
(D.7) 1
pis = m for all i ∈ [n], s ∈ [m].
1 (a) 1 1
i,j i,j `
d ,q ` ≤ d i,n+1
,q i,n+1 `
+ d n+1,j
,q n+1,j ` Any optimal solution r ∗ to the pairwise MMOT problem
` `
(b)
must have bivariate marginals that satisfy r ∗ i,j 1
s,t = m δs,t , and
1 1 1
= di,n+1 , q i,n+1 `` + dj,n+1 , q j,n+1 `` thus di,j , r ∗ i,j `` = m1` kΩi − Ωj k` , where we interpret Ωi
(D.8) has a vector in Rm , and k · k` is the vector `-norm. Therefore,
(c) D i,n+1 `
E 1 D j,n+1 `
E 1 ignoring the factor m1` , we only need to prove that 4. in
= di,n+1 , p(∗h(i)) + dj,n+1 , p(∗h(j)) . Def. 4 holds with C(n) = n − 1 when W 1:n is defined
`
as 1≤i<j≤n kΩi − Ωj k` . This in turn is a standard result,
P
Let w(i,j) denote each term on the r.h.s. of (D.4), and whose proof (in a more general form) can be found e.g. in
D E1
i,j `
w (i,j,r) denote di,j , p(∗r) . Combining (D.6) - (D.8), Example 2.4 in [29].
`
F Details of numerical experiments {W i,j,k }i,j,k∈[70] , where W i,j,k is defined as in Remark 6
Graphs are everywhere, and classifying and clustering graphs with di,j as for TA . We compute a sampled version T̂ C
are important in diverse areas. For example, [36] clusters of the tensor T C with Ti,j,k C
= W i,j,k , where W i,j,k is
graphs of app’s code execution to find malware; [38] clusters defined as in Thrm. 3.2, but now considering points in the
0
graphs that represent chemical compounds to understand complex plane. The sampled tensors T̂ B , T̂ B , and T̂ C are
their anti-cancer and cancer-inducing characteristics; [39] built by randomly selecting 100 triples (i, j, k) and setting
B B B’ B’ C C
represents text as word-based dependency trees and clusters T̂i,j,k = Ti,j,k , T̂i,j,k = Ti,j,k , and T̂i,j,k = Ti,j,k . The non-
them to classify cellphone reviews; [40] clusters graphs that sampled triples are given a very large value. The sampled
represent the secondary structure of proteins; [37] reviews matrix T̂ A is built by sampling (3/2) × 100 = 150 pairs
A A
other graph clustering applications. (i, j) and setting T̂i,j = Ti,j , and setting a large value for
A powerful and general approach to clustering is non-sampled pairs.
distance-based clustering, a type of connectivity-based clus- Clustering: We feed the distances T̂ A to a spectral clustering
tering: objects that are similar, according to a given distance algorithm [32] based on normalized random-walk Laplacians
measure, are put into the same cluster. Its use for graph clus- to produce one clustering solution C A , we feed the distances
tering requires a measure of the distance between graphs. The T̂ B to the two hypergraph-based clustering methods NH-Cut
purpose of our experiments is to illustrate via distance-based [33] and TTM [34, 27] to produce clustering solutions C B1
graph clustering i) the advantages of using MMOT over OT, and C B2 respectively, we use T̂ B’ and NH-Cut and TTM to
and ii) the advantages of using an n-metric MMOT over a produce clustering solutions C B’1 and C B’2 , and we use T̂ C
non-n-metric MMOT. and NH-Cut and TTM to produce clustering solutions C C1 and
C C2 . Both NH-Cut and TTM determine clusters by finding
F.1 Synthetic graphs Graphs: We create 7 clusters, each optimal cuts of a hypergraph where each hyperedge’s weight
with 10 graphs. Each graph is a random perturbation (edge is the MMOT distance among three graphs. Both NH-Cut and
addition/removal with p = 0.05) of either 1) a complete TTM require a threshold that is used to prune the hypergraph.
graph, 2) a complete bipartite graph, 3) a cyclic chain, 4) a Edges whose weight (multi-distance) is larger than a given
k-dimensional cube, 5) a K-hop lattice, 6) a periodic 2D grid, threshold are removed. This threshold is tuned to minimize
or 7) an Erdős–Rényi graph. each clustering solution error. All clustering solutions output
Vector data: We transform the graphs {Gi }70 i=1 into vectors
7 clusters.
{v i }70 i
i=1 to be clustered. Each v is the (complex-valued)
Errors: For each clustering solution, we compute the fraction
spectrum of a matrix M i representing non-backtracking walks of miss-classified graphs. In particular, if we use C x (i) = k,
on Gi , which approximates the length spectrum µi of Gi [40]. x = A, B1, B2,B’1, B’2, C1, C2, to represent that clustering
Object µi uniquely identifies (the 2-core [41] of) Gi (up to solution C x assigns graph Gi to cluster k, then the error of
an isomorphism) [42], but is too abstract to be used directly. this solution is
Hence, we use its approximation v i . The length of v i and
70
v j for equal-sized Gi and Gj can be different, depending on 1 X
(F.13) min I(C ground truth (i) = C x (σ(i)),
how we approximate µi . We use distance-based clustering σ 70 i=1
and OT (multi) distances, since OT allows comparing objects
of different lengths. Note that unlike the length spectrum, where the min is over all permutations σ, since the specific
the classical spectrum of a graph (the eigenvalues of e.g. an cluster IDs output by different algorithms have no real
adjacency matrix, Laplacian matrix, or random-walk matrix) meaning. This experiment is repeated 100 times (random
has the advantage of having the same length for graphs with numbers being drawn independently among experiments)
the same number of nodes. However, it does not uniquely and the frequency of the errors are plotted in histograms in
identify a graph. For example, a star graph with 5 nodes and Figure 3.
the graph that is the union of a square with an isolated node Code: Code is fully written in Matlab 2020a and is
are co-spectral but are not isomorphic. available at https://drive.google.com/drive/folders/11_
Distances: Each v i is interpreted as a uniform distribution MqRx29Yq-KuZYUSsOOhK7EbmTAIqu9?usp=sharing. It
pi over Ωi = {vki , k = 1, . . . , }, the eigenvalues of M i . is also included in the supplementary material files. It requires
We compute a sampled version T̂ A of the matrix T A = installing CVX, available in http://cvxr.com/cvx/download/.
{W i,j }i,j∈[70] , where W i,j is the WD between pi and pj To produce Figure 3 just open Matlab and run the file
using a di,j defined by di,j i j
s,t = |vs − vt |. We compute a run_me_for_synthetic_experiments.m
sampled version T̂ B of the tensor T B = {W i,j,k }i,j,k∈[70] ,
where W i,j,k is defined as in Def. 6 with di,j as for TA .
We compute a sampled version T̂ B’ of the tensor T B’ = F.2 Injection of triangle inequality violations To test
if the better performance of pairwise-MMOT and WBD
compared to that of the non-n-metric is due to the n- [37] Kaspar, Riesen, and Bunke Horst. “Graph classification and
metric property, we perturb the tensor W i,j,k to introduce clustering based on vector space embedding”. Vol. 77. World
triangle inequality violations. In particular, for each set Scientific, 2010.
of four different graphs (i, j, k, l), we find which among [38] Kong, Xiangnan, and Philip S. Yu. “Semi-supervised feature
W i,j,k , W i,j,l , W i,l,k , W l,j,k we can change the least to selection for graph classification.” In Proceedings of the
16th ACM SIGKDD international conference on Knowledge
produce a triangle inequality violation among these values.
discovery and data mining, pp. 793-802. 2010.
Let us assume that this value is W i,j,k , and that to violate [39] Kudo, Taku, Eisaku Maeda, and Yuji Matsumoto. “An applica-
W i,j,k ≤ W i,j,l + W i,l,k + W l,j,k , the quantity W i,j,k needs tion of boosting to graph classification.” In Advances in neural
to increase at least by δ. We then increase W i,j,k by 1.3 × δ. information processing systems, pp. 729-736. 2005.
We repeat this procedure such that, in total, 20% of the entries [40] Narayanan, Annamalai, Mahinthan Chandramohan, Rajasekar
of the tensor W get changed. Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal.
“graph2vec: Learning distributed representations of graphs.”
F.3 Real molecular graphs The details of our setup for arXiv preprint arXiv:1707.05005 (2017).
real data is almost identical as the setup for the synthetic data.
We explain only the major differences.
Graphs: The dataset that we use [39] contains the adjacency
matrices of 467 graphs for cyclooxygenase-2 inhibitors, 405
graphs for ligands for the benzodiazepinereceptor, 756 in-
hibitors of dihydrofolatereductase, 1009 estrogen receptor
ligands, and 1641 monoamine oxidase inhibitors. In each ex-
periment, we randomly get 10 graphs of each type, and prune
them so that they have no node with a degree smaller than
1. We note that, unlike for the sysmteitc data, the molecular
graphs have weighted adjacency matrices, whose entries can
be 0, 1, or 2.
Vector data: For each graph, we get an approximation of
the length spectrum just as explained for the synthetic data.
However, to minimize the time our experiments take to run,
we only keep the largest 16 eigenvalues.
Distances: For Figure 4, we computes tensors T̂ B , T̂ B’ , and
T̂ C by sampling x = 600 triples (i, j, k) (i.e. hyperedge). For
Figure 1, we perform different experiments with x ranging
from 50 to 500.
Clustering: We force all clustering algorithms to output ex-
actly 5 clusters.
Errors: We compute errors using (F.13) but with 70 replaced
by 50. Each experiment is repeated 100 times (random num-
bers being drawn independently among experiments) and the
frequency of the errors are plotted in the histograms in Fig-
ure 4. For Figure 1, each point on each curve is the average
over 100 trials. Error bars are the standard deviation of the
average.
Code: The code for this can be found in the same
link as the code for the synthetic dataset. To pro-
duce Figure 4 just open Matlab and run the file
run_me_for_molecular_experiments.m.

References

[36] Narayanan, Annamalai, Mahinthan Chandramohan, Lihui

Chen, Yang Liu, and Santhoshkumar Saminathan. “sub-
graph2vec: Learning distributed representations of rooted sub-
graphs from large graphs.” arXiv preprint arXiv:1606.08928
(2016).

Machine Learning Week 2 Coursera
100% (1)
Machine Learning Week 2 Coursera
4 pages
Reliability-21 08 2023
No ratings yet
Reliability-21 08 2023
51 pages
Recent Advances in Optimal Transport For Machine Learning
No ratings yet
Recent Advances in Optimal Transport For Machine Learning
20 pages
LectureOTM2_Nenna
No ratings yet
LectureOTM2_Nenna
112 pages
Graph Diffusion Wasserstein Distances
No ratings yet
Graph Diffusion Wasserstein Distances
17 pages
Optimal Transport For Measures With Noisy Tree Metric
No ratings yet
Optimal Transport For Measures With Noisy Tree Metric
31 pages
Computational Optimal Transport
No ratings yet
Computational Optimal Transport
209 pages
Distance Distributions and Inverse Problems for Metric Measure
No ratings yet
Distance Distributions and Inverse Problems for Metric Measure
71 pages
Unbalanced Optimal Transport: Geometry and Kantorovich Formulation
No ratings yet
Unbalanced Optimal Transport: Geometry and Kantorovich Formulation
45 pages
SWIFT: Scalable Wasserstein Factorization For Sparse Nonnegative Tensors
No ratings yet
SWIFT: Scalable Wasserstein Factorization For Sparse Nonnegative Tensors
25 pages
2310.13653 Robust TW
No ratings yet
2310.13653 Robust TW
29 pages
Junk 4
No ratings yet
Junk 4
12 pages
Interpolating between Optimal Transport and MMD using Sinkhorn Divergences
No ratings yet
Interpolating between Optimal Transport and MMD using Sinkhorn Divergences
15 pages
Matrix Power Mean and Karcher Mean
No ratings yet
Matrix Power Mean and Karcher Mean
17 pages
Montesuma Et Al. - 2023 - Recent Advances in Optimal Transport For Machine L
No ratings yet
Montesuma Et Al. - 2023 - Recent Advances in Optimal Transport For Machine L
20 pages
Information Geometry Manifold of Toeplitz Hermitian Positive Definite Covariance Matrices: Mostow/Berger Fibration and Berezin Quantization of Cartan-Siegel Domains
No ratings yet
Information Geometry Manifold of Toeplitz Hermitian Positive Definite Covariance Matrices: Mostow/Berger Fibration and Berezin Quantization of Cartan-Siegel Domains
11 pages
Learning Wasserstein Embeddings
No ratings yet
Learning Wasserstein Embeddings
10 pages
Spatio-Temporal Alignments: Optimal Transport Through Space and Time
No ratings yet
Spatio-Temporal Alignments: Optimal Transport Through Space and Time
21 pages
Sobolev Transport: A Scalable Metric For Probability Measures With Graph Metrics
No ratings yet
Sobolev Transport: A Scalable Metric For Probability Measures With Graph Metrics
25 pages
Hausdorff and Wasserstein Metrics On Graphs and Other Structured Data
No ratings yet
Hausdorff and Wasserstein Metrics On Graphs and Other Structured Data
41 pages
Iterative Bregman Projections For Regularized Transportation Problems
No ratings yet
Iterative Bregman Projections For Regularized Transportation Problems
29 pages
Wasserstein Metric
No ratings yet
Wasserstein Metric
5 pages
Karcher Means and Karcher Equations of Positive Definite Operator
No ratings yet
Karcher Means and Karcher Equations of Positive Definite Operator
22 pages
Optimal Transport in Learning, Control, and Dynamical Systems
No ratings yet
Optimal Transport in Learning, Control, and Dynamical Systems
25 pages
Computational Optimal Transport
No ratings yet
Computational Optimal Transport
236 pages
Ambrosio L., Gigli N. - A User's Guide To Optimal Transport-Web Draft (2009)
No ratings yet
Ambrosio L., Gigli N. - A User's Guide To Optimal Transport-Web Draft (2009)
128 pages
Hypergraph Co-Optimal Transport: Metric and Categorical Properties
No ratings yet
Hypergraph Co-Optimal Transport: Metric and Categorical Properties
21 pages
Tensor_optimal_transport_distance_between_sets_of_
No ratings yet
Tensor_optimal_transport_distance_between_sets_of_
33 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transport
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transport
9 pages
Tangential Wasserstein Projections
No ratings yet
Tangential Wasserstein Projections
41 pages
Multi-marginal optimal transport and multi-agent matching problems: uniqueness and structure of solutions
No ratings yet
Multi-marginal optimal transport and multi-agent matching problems: uniqueness and structure of solutions
20 pages
1 Distances and Metric Spaces: 1.1 Finite Metrics and Graphs
No ratings yet
1 Distances and Metric Spaces: 1.1 Finite Metrics and Graphs
10 pages
Multi-Marginal Optimal Transport: Theory and Applications
No ratings yet
Multi-Marginal Optimal Transport: Theory and Applications
29 pages
21 - ICML - Unbalanced Minibatch Optimal Transport - Applications To Domain Adaptation
No ratings yet
21 - ICML - Unbalanced Minibatch Optimal Transport - Applications To Domain Adaptation
12 pages
Optimal_Transport
No ratings yet
Optimal_Transport
103 pages
34
No ratings yet
34
27 pages
Department of Mathematics, The Ohio State University.: Bstract
No ratings yet
Department of Mathematics, The Ohio State University.: Bstract
23 pages
Multi-marginal optimal transport and probabilistic graphical models
No ratings yet
Multi-marginal optimal transport and probabilistic graphical models
24 pages
Optimal Transport: Fast Probabilistic Approximation With Exact Solvers
No ratings yet
Optimal Transport: Fast Probabilistic Approximation With Exact Solvers
23 pages
23-ECP525
No ratings yet
23-ECP525
14 pages
Optimizing Urban Network Via Mass Transport
No ratings yet
Optimizing Urban Network Via Mass Transport
161 pages
2287 Estimating Riemannian Metric W
No ratings yet
2287 Estimating Riemannian Metric W
24 pages
Metric
No ratings yet
Metric
23 pages
Rational Spectrum Ot
No ratings yet
Rational Spectrum Ot
17 pages
Journal Club: Diffusion Maps for Geometric Data Organization and Network Analysis
No ratings yet
Journal Club: Diffusion Maps for Geometric Data Organization and Network Analysis
40 pages
ϖ-Interpolative Ciric-Reich-Rus-Type Contactions in m-metric space
No ratings yet
ϖ-Interpolative Ciric-Reich-Rus-Type Contactions in m-metric space
6 pages
Uniqueness and Monge Solutions in the Multimarginal OT problem
No ratings yet
Uniqueness and Monge Solutions in the Multimarginal OT problem
20 pages
Semi-relaxed Gromov Wasserstein divergence with applications on graphs
No ratings yet
Semi-relaxed Gromov Wasserstein divergence with applications on graphs
28 pages
Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019
No ratings yet
Notes of Optimal Transport Problem and Metrics: Yang YANG, EE 68 April 27, 2019
15 pages
Journal of Computational Physics: Jing Chen, Yifan Chen, Hao Wu, Dinghui Yang
No ratings yet
Journal of Computational Physics: Jing Chen, Yifan Chen, Hao Wu, Dinghui Yang
22 pages
10.3934 Math.2023108
No ratings yet
10.3934 Math.2023108
24 pages
Linear Distance Metric Learning With Noisy Labels: Meysam Alishahi
No ratings yet
Linear Distance Metric Learning With Noisy Labels: Meysam Alishahi
53 pages
Generalized Sobolev Transport For Probability Measures On A Graph
No ratings yet
Generalized Sobolev Transport For Probability Measures On A Graph
25 pages
Tutorial4 SVM
No ratings yet
Tutorial4 SVM
8 pages
Locally Smooth Metric Learning With Application To Image Retrieval
No ratings yet
Locally Smooth Metric Learning With Application To Image Retrieval
7 pages
A Linear Transportation Lp Distance for Pattern Recognition
No ratings yet
A Linear Transportation Lp Distance for Pattern Recognition
41 pages
Schrodinger Bridge Flow NeurIPS24
No ratings yet
Schrodinger Bridge Flow NeurIPS24
58 pages
Course Optimal Transport
No ratings yet
Course Optimal Transport
46 pages
Weiner Distance and Metric Spaces
No ratings yet
Weiner Distance and Metric Spaces
2 pages
Fault Tolerant Metric Dimension of Interconnection Networks 2zrj29i7ql
No ratings yet
Fault Tolerant Metric Dimension of Interconnection Networks 2zrj29i7ql
11 pages
Reversible Gromov-Monge Sampler for Simulation-Based Inference
No ratings yet
Reversible Gromov-Monge Sampler for Simulation-Based Inference
49 pages
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
From Everand
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
TRANSPORT INEQUALITIES. A SURVEY
No ratings yet
TRANSPORT INEQUALITIES. A SURVEY
82 pages
Quantitative Stability of Regularized Optimal Transport
No ratings yet
Quantitative Stability of Regularized Optimal Transport
35 pages
Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems
No ratings yet
Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems
30 pages
Understanding the basis of graph signal processing via an intuitive example-driven approach
No ratings yet
Understanding the basis of graph signal processing via an intuitive example-driven approach
10 pages
Spectral Distances On Graphs
No ratings yet
Spectral Distances On Graphs
11 pages
A multiscale approach to optimal transport
No ratings yet
A multiscale approach to optimal transport
19 pages
Slides - Graph Signal Processing and Applications in Neuroscience
No ratings yet
Slides - Graph Signal Processing and Applications in Neuroscience
103 pages
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
No ratings yet
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
118 pages
A Geometric View of Optimal Transportation and Generative Model
No ratings yet
A Geometric View of Optimal Transportation and Generative Model
21 pages
The Emerging Field of Signal Processing On Graphs
No ratings yet
The Emerging Field of Signal Processing On Graphs
14 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
13 pages
Robust Shape Matching With OT
No ratings yet
Robust Shape Matching With OT
175 pages
AdvDIP CommSys
No ratings yet
AdvDIP CommSys
2 pages
The Conformal Transformation's Controversy What Are We Missing
No ratings yet
The Conformal Transformation's Controversy What Are We Missing
17 pages
Aerial robotics sample final exam
No ratings yet
Aerial robotics sample final exam
7 pages
Statistics and Probability Midterm Examination
No ratings yet
Statistics and Probability Midterm Examination
5 pages
Assignment1-MFML
No ratings yet
Assignment1-MFML
2 pages
Computerised Paper Evaluation Using Neural Network
90% (10)
Computerised Paper Evaluation Using Neural Network
19 pages
DMML Assignment
No ratings yet
DMML Assignment
3 pages
Ptspunit2 VRC
No ratings yet
Ptspunit2 VRC
164 pages
Machine Learning Based Intrusion Detection System
No ratings yet
Machine Learning Based Intrusion Detection System
5 pages
Soft Computing Quantum
No ratings yet
Soft Computing Quantum
100 pages
A New Sorting Algorithm: Kiran Kumar Sundararajan, Soubhik Chakraborty
No ratings yet
A New Sorting Algorithm: Kiran Kumar Sundararajan, Soubhik Chakraborty
5 pages
Malaria Parasite Detection Using Deep Learning
No ratings yet
Malaria Parasite Detection Using Deep Learning
7 pages
Loki 97
100% (1)
Loki 97
22 pages
Halting Problem and Rice theorem
No ratings yet
Halting Problem and Rice theorem
12 pages
Data Science Chapitre 1
No ratings yet
Data Science Chapitre 1
54 pages
Content Based Image Retrieval (Cbir) By:: Project Guide:Prof. Rachana Dhanawat
No ratings yet
Content Based Image Retrieval (Cbir) By:: Project Guide:Prof. Rachana Dhanawat
21 pages
The Rise of AI: Changing of The World The Rise of AI: Changing of The World
No ratings yet
The Rise of AI: Changing of The World The Rise of AI: Changing of The World
7 pages
What's The Relative Risk?
No ratings yet
What's The Relative Risk?
0 pages
Dead-Time Compensation (纯滞后补偿) : Lei Xie Institute of Industrial Control, Zhejiang University, Hangzhou, P. R. China
No ratings yet
Dead-Time Compensation (纯滞后补偿) : Lei Xie Institute of Industrial Control, Zhejiang University, Hangzhou, P. R. China
16 pages
Artificial Intelligence: Tutorial 7 Questions Uncertainty and Imprecision
No ratings yet
Artificial Intelligence: Tutorial 7 Questions Uncertainty and Imprecision
2 pages
Autoencoder-Based Anomaly Detection in Network Traffic
No ratings yet
Autoencoder-Based Anomaly Detection in Network Traffic
4 pages
اشارات ونظم
No ratings yet
اشارات ونظم
17 pages
Text To Speech With Custom Voice
No ratings yet
Text To Speech With Custom Voice
10 pages
Model Predictive Control
No ratings yet
Model Predictive Control
23 pages
Echo Cancellation Algorithms Using Adaptive Filters: A Comparative Study
No ratings yet
Echo Cancellation Algorithms Using Adaptive Filters: A Comparative Study
8 pages
Analysis of Chess Games With Chess Engines
No ratings yet
Analysis of Chess Games With Chess Engines
16 pages
2020 Quiz 1
No ratings yet
2020 Quiz 1
2 pages
Business Math Chapter 3, Introduction to Linear Programming (2)
No ratings yet
Business Math Chapter 3, Introduction to Linear Programming (2)
17 pages

Multi-Marginal Optimal Transport Defines a Generalized Metric

Uploaded by

Multi-Marginal Optimal Transport Defines a Generalized Metric

Uploaded by

Multi-Marginal Optimal Transport Defines a Generalized Metric

José Bento* Liang Mi†

Abstract Clustering molecules with metrics and non-metrics

Fraction of miss-classified graphs

non metric TTM

notation, let W (pi , pj , pk ) = W i,j,k . One can prove that,

i,j j,i i,j i j i,j i,k k,j

Proof. Note that Definition 3 supports using a different

L EMMA D.1. Let (a, b, c) ∈ Hn (i, j) for 1 ≤ i < j ≤ n.

[36] Narayanan, Annamalai, Mahinthan Chandramohan, Lihui

You might also like