Quantitative Convergence of Quadratically Regularized Linear Programs
Quantitative Convergence of Quadratically Regularized Linear Programs
Linear Programs∗
Alberto González-Sanz† Marcel Nutz‡
August 9, 2024
arXiv:2408.04088v1 [math.OC] 7 Aug 2024
Abstract
Linear programs with quadratic regularization are attracting renewed interest due to their
applications in optimal transport: unlike entropic regularization, the squared-norm penalty
gives rise to sparse approximations of optimal transport couplings. It is well known that the
solution of a quadratically regularized linear program over any polytope converges stationarily
to the minimal-norm solution of the linear program when the regularization parameter tends
to zero. However, that result is merely qualitative. Our main result quantifies the convergence
by specifying the exact threshold for the regularization parameter, after which the regularized
solution also solves the linear program. Moreover, we bound the suboptimality of the regu-
larized solution before the threshold. These results are complemented by a convergence rate
for the regime of large regularization. We apply our general results to the setting of optimal
transport, where we shed light on how the threshold and suboptimality depend on the number
of data points.
1 Introduction
Let c ∈ Rd and let P ⊂ Rd be a polytope. Moreover, let ⟨·, ·⟩ be an inner product on Rd and ∥ · ∥
its induced norm. We study the linear program
1
solution xη of (QLP) converges to a particular solution x∗ of (LP), namely the solution with
smallest norm: x∗ = arg minx∈M ∥x∥2 , where M denotes the set of minimizers of (LP). Our main
goal is to describe how quickly this convergence happens.
The convergence is, in fact, stationary: there exists a threshold η ∗ such that xη = x∗ for all
η ≥ η ∗ . This was first established for linear programs in [32, Theorem 1] and [31, Theorem 2.1],
and was more recently rediscovered in the context of optimal transport [16, Property 5]. However,
those results are qualitative: they do not give a value or a bound for η ∗ . We shall characterize
the exact value of the threshold η ∗ (cf. Theorem 2.5), and show how this leads to computable
bounds in applications. This exact result raises the question about the speed of convergence as
η ↑ η ∗ . Specifically, we are interested in the convergence of the error E(η) = ⟨c, xη ⟩ − minx∈P ⟨c, x⟩
measuring how suboptimal the solution xη of (QLP) is when plugged into (LP). In Theorem 2.5, we
show that E(η) = o(η ∗ − η) as η ↑ η ∗ and give an explicit bound for E(η)/(η ∗ − η). After observing
that the curve η 7→ xη is piecewise affine, this linear rate can be understood as the slope of the last
segment of the curve before ending at x∗ . Figure 1 illustrates these quantities in a simple example.
Our results for η → ∞ are complemented by a convergence rate for the large regularization regime
η → 0 where xη tends to arg minx∈P ∥x∥2 ; cf. Proposition 2.7.
E(η)
η1 η∗
P3
Figure 1: Suboptimality E(η) of (QOT) when µ = ν = 13 i=1 δi/3 and c(x, y) = ∥x − y∥2 .
Theorem 2.5 characterizes the location of η ∗ and bounds the slope to the left of η ∗ .
While linear programs and their penalized counterparts go back far into the last century, much
of the recent interest is fueled by the surge of optimal transport in applications such as machine
learning (e.g., [26]), statistics (e.g., [37]), language and image processing (e.g., [3, 39]) and economics
(e.g., [22]). In its simplest form, the optimal transport problem between probability measures µ
and ν is Z
inf c(x, y)dγ(x, y), (OT)
γ∈Γ(µ,ν)
where Γ(µ, ν) denotes the set of couplings; i.e., probability measures γ with marginals µ and ν (see
[41, 42] for an in-depth exposition). Here c(·, ·) is a given cost function, most commonly c(x, y) =
∥x − y∥2 . In many applications the marginals represent observed P data: data pointsPX1 , . . . , XN and
Y1 , . . . , YN are encoded in their empirical measures µ = N1 i δXi and ν = N1 i δYi . Writing
also cij = c(Xi , Yj ), the problem (OT) is a particular case of (LP) in dimension d = N × N . The
2
general linear program (LP) also includes other transport problems of recent interest, such as multi-
marginal optimal transport and Wasserstein barycenters [1], adapted Wasserstein distances [4] or
martingale optimal transport [6].
As the optimal transport problem is computationally costly (e.g., [38]), [15] proposed to reg-
ularize (OT) by penalizing with Kullback–Leibler divergence (entropy). Then, solutions can be
computed using the Sinkhorn–Knopp (or IPFP) algorithm, which has lead to an explosion of high-
dimensional applications. Entropic regularization always leads to “dense” solutions (couplings
whose support contains all data pairs (Xi , Yj )) even though the unregularized problem (OT) typi-
cally has a sparse solution. In some applications that is undesirable; for instance, it may correspond
to blurrier images in an image processing task [8]. For that reason, [8] suggested the quadratic pe-
nalization Z 2
1 dγ
inf c(x, y)dγ(x, y) + , (QOT)
γ∈Γ(µ,ν) η d(µ ⊗ ν) L2 (µ⊗ν)
where dγ/d(µ ⊗ ν) denotes the density of γ with respect to the product measure µ ⊗ ν. See also [20]
for a similar formulation of minimum-cost flow problems, the predecessors referenced therein, and
[16] for optimal transport with more general convex regularization. Quadratic regularization gives
rise to sparse solutions (see [8], and [34] for a theoretical result). Recent applications of quadrati-
cally regularized optimal transport include manifold learning [44] and image processing [28] while
[33] establishes a connection to maximum likelihood estimation of Gaussian mixtures. Computa-
tional approaches are developed in [18, 23, 24, 28, 40] whereas [30, 17, 5, 34] study theoretical
aspects with a focus on continuous problems. In that context, [29, 19] show Gamma convergence
to the unregularized optimal transport problem in the small regularization limit. Those results
are straightforward in the discrete case considered in the present work. Conversely, the stationary
convergence studied here does not take place in the continuous case.
For linear programs with entropic regularization, [13] established that solutions converge expo-
nentially to the limiting unregularized counterpart. More recently, [43] gave an explicit bound for
the convergence rate. The picture for entropic regularization is quite different to quadratic regular-
ization as the convergence is not stationary. For instance, in optimal transport, the support of the
regularized solution contains all data pairs for any value of the regularization parameter, collapsing
only at the unregularized limit. Nevertheless, our analysis benefits from some of the technical ideas
in [43], specifically for the proof of the slope bound (3). The small regularization limit has also
attracted a lot of attention in continuous optimal transport (e.g., [2, 7, 12, 14, 27, 35, 36]) which
however is technically less related to the present work.
The remainder of this note is organized as follows. Section 2 contains the main results on
the general linear program and its quadratic regularization, Section 3 the application to optimal
transport. Proofs are gathered in Section 4.
2 Main Results
Throughout, ∅ ̸= P ⊂ Rd denotes a polytope. That is, P is the convex hull of its extreme points
(or vertices) exp(P) = {v1 , . . . , vK }, which are in turn minimal with the property of spanning P
(see [10] for detailed definitions). We recall the linear program (LP) and its quadratically penalized
version (QLP) as defined in the Introduction, and in particular their cost vector c ∈ Rd . The set
3
of minimizers of (LP) is denoted
∥x∥2
Φη (x) = ⟨c, x⟩ + .
η
2
In view of Φη (x) = η1 x + η2c − η4 ∥c∥2 , minimizing Φη (x) over P is equivalent to projecting
−ηc/2 onto P in the Hilbert space (Rd , ⟨·, ·⟩). The projection theorem (e.g., [9, Theorem 5.2]) thus
implies the following result. We denote by ri(C) the relative interior of a set C ⊂ Rd ; i.e, the
topological interior when C is considered as a subset of its affine hull.
Lemma 2.1. Given η > 0, (QLP) admits a unique minimizer xη . It is characterized as the unique
xη ∈ P such that D ηc E
− − xη , x − xη ≤ 0 for all x ∈ P.
2
η
In particular, if x ∈ ri(C) for some convex set C ⊂ P, then also
D ηc E
− − xη , x − xη = 0 for all x ∈ C.
2
Figure 2 illustrates how xη is obtained as the projection of −ηc/2. The algorithm of [25] solves
the problem of projecting a point onto a polyhedron, hence can be used to find xη numerically.
(−ηc/2)η≥0
η = η∗
η=2
η=1
x∗
η=0 P
Figure 2: The minimizer xη of (QLP) is the projection of −ηc/2 onto P. The curve η 7→ xη is
piecewise affine and converges stationarily to a point x∗ ; i.e., xη = x∗ for all η ≥ η ∗ .
measuring how suboptimal the solution xη of (QLP) is when used as control in (LP). It follows from
the optimality of xη for (QLP) that η 7→ E(η) is nonincreasing. (Figure 2 illustrates that it need not
4
be strictly decreasing even on [0, η ∗ ]). The optimality of xη also implies that E(η) ≤ η −1 (∥x∗ ∥2 −
∥xη ∥2 ); in fact, an analogous result holds for any regularization. The following improvement is
particular to the quadratic penalty and will be important for our main result.
Lemma 2.2. Let xη be the unique minimizer of (QLP) and let x∗ be any minimizer of (LP).
Then
∥x∗ ∥2 − ∥xη ∥2 − ∥x∗ − xη ∥2
E(η) ≤ for all η > 0.
η
Remark 2.3. The bound in Lemma 2.2 is sharp. Indeed, consider the example P = [0, 1] and
c = −1. Then x∗ = 1 and xη = η/2 for η ∈ (0, 2], whereas xη = x∗ for η ≥ 2. It is straightforward
to check that the inequality in Lemma 2.2 is an equality for all η > 0.
The next lemma details the piecewise linear nature of the curve η 7→ xη . This result is known
(even for some more general norms, see [21] and the references therein), and so is the stationary
convergence [31, Theorem 2.1]. For completeness, we detail a short proof in Section 4.
Lemma 2.4. Let xη be the unique minimizer of (QLP). The curve η 7→ xη is piecewise linear and
converges stationarily to x∗ = arg minx∈M ∥x∥2 . That is, there exist n ∈ N and
such that [ηi , ηi+1 ] ∋ η 7→ xη is affine for every i ∈ {0, . . . , n − 1}, and moreover,
xη = x∗ for all η ≥ η ∗ .
Correspondingly, the suboptimality E(η) = ⟨c, xη − x∗ ⟩ is also piecewise linear and converges sta-
tionarily to zero.
We can now state our main result for regime of small regularization: the threshold η ∗ beyond
which xη = x∗ and a bound for the slope of the suboptimality E(η) of (1) before the threshold.
See Figures 1 and 2 for illustrations. We recall that M denotes the set of minimizers of (LP) and
exp(P) denotes the extreme points of P.
Theorem 2.5. Let xη be the unique minimizer of (QLP) and let x∗ be the minimizer of (LP) with
minimal norm, x∗ = arg minx∈M ∥x∥2 . Let 0 = η0 < η1 < · · · < ηn = η ∗ be the breakpoints of the
curve η 7→ xη as in Lemma 2.4; in particular, η ∗ is the threshold such that xη = x∗ for all η ≥ η ∗ .
The right-hand side attains its maximum on the set M(P, c∗ ) of minimizers for the linear
∗
program (LP) with the auxiliary cost c∗ := η2c + x∗ . Moreover, we have xη ∈ M(P, c∗ ) for
∗ ∗ η
all η ∈ [ηn−1 , η ∗ ], so that η ∗ = 2 ⟨x ,x −x ⟩ ∗
⟨c,xη −x∗ ⟩ for all η ∈ [ηn−1 , η ].
E(η)
(b) The slope (η ∗ −η) of the last segment of the curve η 7→ E(η) satisfies the bound
2
x∗ − xηn−1 ∥c∥2
E(η) 1
≤ c, ∗ ≤ , η ∈ [ηn−1 , η ∗ ). (3)
(η ∗ − η) 2 ∥x − xηn−1 ∥ 2
5
It is worth noting that the first bound in (3) is in terms of the angle between c and x∗ − xηn−1 .
The formula (2) for η ∗ is somewhat implicit in that it refers to x∗ . The following corollary states
a bound for η ∗ using similar quantities as [43] uses for entropic regularization. In particular, we
define the suboptimality gap of P as
it measures the cost difference between the suboptimal and the optimal vertices of P.
Corollary 2.6. Let B = supx∈P ∥x∥ and D = supx,x′ ∈P ∥x − x′ ∥ be the bound and diameter of P,
respectively. Then
2BD
η∗ ≤ .
∆
For integer programs, where c and the vertices of P have integer coordinates, it is clear that
∆ ≥ 1. In general, the explicit computation of ∆ is not obvious. In Section 3 below we shall find
it more useful to directly use (2).
We conclude this section with a quantitative result for the regime η → 0 of large regularization.
After rescaling with η, the quadratically regularized linear program (QLP) formally tends to the
quadratic program
minimize ∥x∥2 subject to x ∈ P. (QP)
The unique solution x0 of (QP) is simply the projection of the origin onto P. It is known in several
contexts that xη → x0 as η → 0 (e.g., [16, Properties 2,7]). The following result quantifies this
convergence by establishing that ∥xη − x0 ∥ tends to zero at a linear rate.
Proposition 2.7. Let xη and x0 be the minimizers of (QLP) and (QP), respectively. Then
6
where Γ(µ, ν) denotes the set of couplings of (µ, ν), and its quadratically regularized version
Z 2
1 dγ
inf c(x, y)dγ(x, y) + . (QOT)
γ∈Γ(µ,ν) η d(µ ⊗ ν) L2 (µ⊗ν)
Throughout this section, we consider given points Xi , Yi , 1 ≤ i ≤ N (in RD , say) with their
associated empirical measures and cost matrix
N N
1 X 1 X
µ= δX , ν= δY , Cij := c(Xi , Xj ).
N i=1 i N i=1 i
Any coupling γ gives rise to a matrix γij = γ(Xi , Yj ) through its probability mass function. Those
matrices form the set
ΓN = {γ ∈ RN ×N : γ 1 = N −1 1, γ ⊤ 1 = N −1 1, γi,j ≥ 0}.
It is more standard to work instead with the Birkhoff polytope of doubly stochastic matrices,
ΠN = {π ∈ RN ×N : π 1 = 1, π ⊤ 1 = 1, πi,j ≥ 0},
that is obtained through the bijection πij = N γij . By Birkhoff’s theorem (e.g., [11]), the extreme
points exp(ΠN ) are precisely the permutation matrices; i.e., matrices with binary entries whose
PN PN
rows and columns sum to one. Let ⟨A, B⟩ := Trace(A⊤ B) = i=1 j=1 Ai,j Bi,j be the Frobenius
inner product on RN ×N and ∥ · ∥ the associated norm. Then (QOT) becomes a particular case of
(QLP), namely
N2 1 1
min ⟨C, γ⟩ + ∥γ∥2 or equivalently min ⟨C, π⟩ + ∥π∥2 , (6)
γ∈ΓN η π∈ΠN N η
where the factor N 2 is due to µ ⊗ ν being the uniform measure on N 2 points. To have the same
form as in (QLP) and Section 2, we write (6) as
1
min ⟨c, π⟩ + ∥π∥2 where cij := Cij /N. (7)
π∈ΠN η
We can now apply the general results of Theorem 2.5 to (7) and infer the following for the regularized
optimal transport problem (QOT); a detailed proof can be found in Section 4.
Proposition 3.1. (a) The optimal coupling γ η of (QOT) is optimal for (OT) if and only if
⟨π ∗ , π ∗ − π⟩
η ≥ η ∗ := 2 N · max , (8)
π∈exp(ΠN )\M ⟨C, π − π ∗ ⟩
7
The following example shows that Proposition 3.1 is sharp.
Example 3.1. Let c(Xi , Yj ) = −δij , so that π ∗ = Id is the identity matrix and C = − Id. Note
also that π 0 has entries πi,j0
= 1/N . It follows from (8) that η ∗ = 2N , and the right-hand side
N −1
of (9) evaluates to 2N 2 . We show below that [0, η ∗ ] ∋ η 7→ xη is affine, or more explicitly, that
−η 0
π η = 2N η ∗ η
2N π + 2N π =: π̃ . As a consequence, we have for every η ∈ [0, η ) that
∗
Next, we focus on a more representative class of transport problems. Our main interest is to
see how our key quantities scale with N , the number of data points.
Corollary 3.2. Assume that there is a permutation σ ∗ : {1, . . . , N } → {1, . . . , N } such that
κ := min c(Xi , Yj ) > 0 and c(Xi , Yσ∗ (i) ) = 0 for all i ∈ {1, . . . , n}.
i∈{1,...,N },j̸=σ ∗ (i)
Then
4N 2N
′
≤ η∗ ≤ , (10)
κ κ
where κ′ := mini∈{1,...,N },j̸=σ∗ (i) c(Xi , Yj ) + c(Xj , Yi ). If the cost is symmetric; i.e., c(Xi , Yj ) =
c(Xj , Yi ) for all i, j ∈ {1, . . . , N }, then
2N
η∗ = . (11)
κ
The proof is detailed in Section 4. We illustrate Proposition 3.1 and Corollary 3.2 with a
representative example for scalar data.
i
Example 3.2. Consider the quadratic cost c(x, y) = ∥x − y∥2 and Xi = Yi = N, 1 ≤ i ≤ N with
N ≥ 2, leading to the cost matrix
|i − j|2
Cij = .
N2
Then
η ∗ = 2N 3
and we have the following bound for the slope of the suboptimality,
8
Indeed, the value of η ∗ follows directly from (11) with κ = 1/N 2 and σ ∗ being the identity. The
proof of (12) is longer and relegated to Section 4.
To study the accuracy of the bound (12), we compute numerically the limit
for N = j ∗ 30 with j = 2, . . . , 16. Figure 3 shows N 7→ LN in blue and the upper bound N 7→ NN−1
6
in red (in double logarithmic scale). We observe that both have the same order as a function of N .
log(N )
4 Proofs
Proof of Lemma 2.2. For any x ∈ P, Lemma 2.1 implies the inequality in
2xη ∥x − xη ∥2 ∥x − xη ∥2
Φη (x) = Φη (xη ) + c + , x − xη + ≥ Φη (xη ) + .
η η η
Therefore,
∥x − xη ∥2 ∥xη ∥2 − ∥x∥2 + ∥x − xη ∥2
0 ≥ Φη (xη ) − Φη (x) + = ⟨c, xη − x⟩ +
η η
and in particular choosing x = x∗ gives
9
Proof of Lemma 2.4 and Theorem 2.5. Step 1. Let η(1) < η(2) . We claim that if xη(1) , xη(2) ∈ ri(F)
for some face1 F of P, then [η(1) , η(2) ] ∋ η 7→ xη is affine. Indeed, xη(i) = projP (−η(i) c/2) is the
projection of −η(i) c/2 onto P. As xη(i) ∈ ri(F), it follows that xη(i) = projA (−η(i) c/2) is also the
projection onto the affine hull A of F. Since A is an affine space, the map η 7→ projA (−ηc/2) is
affine. For η(1) ≤ η ≤ η(2) , convexity of ri(F) then implies projA (−ηc/2) ∈ ri(F) , which in turn
implies projA (−ηc/2) = projF (−ηc/2) = projP (−ηc/2) = xη .
Step 2. We can now define η1 , . . . , ηn recursively as follows. Recall first that each x ∈ P is in the
relative interior of exactly one face of P (possibly P itself), namely the smallest face containing x
[10, Theorem 5.6]. Let F0 be the unique face such that x0 := arg minx∈P ∥x∥ ∈ ri(F0 ) and define
η1 := inf{η > 0 : xη ∈
/ ri(F0 )},
where we use the convention that inf ∅ = +∞. Then (0, η1 ) ∋ η 7→ xη is affine by Step 1. For i > 1,
if ηi−1 < ∞, let Fi−1 be the face such that xηi−1 ∈ ri(Fi−1 ) and define
On the other hand, the fact that x∗ is the projection of the origin onto M yields
X
λi ⟨x∗ , vi − x∗ ⟩ ≥ 0.
i: vi ∈exp(M)
Together,
D ηc E X X
, x − x∗ ≥ − λi ⟨x∗ , vi − x∗ ⟩ ≥ − λi ⟨x∗ , vi − x∗ ⟩ = − ⟨x∗ , x − x∗ ⟩ .
2
i: vi ∈exp(P)\exp(M) i: vi ∈exp(P)
As x ∈ P was arbitrary, Lemma 2.1 now shows that x∗ = xη . This completes the proof of Lemma 2.4
and (2).
1 A nonempty face F of the polytope P can be defined as a subset F ⊂ P such that there exists an affine hyperplane
10
Finally, note that x attains the maximum in (2) if and only if ⟨c∗ , x − x∗ ⟩ = 0. Moreover,
⟨c , x − x∗ ⟩ ≥ 0 for all x ∈ P by Lemma 2.1. Hence the set of maximizers of (2) equals the set of
∗
D ηc ′
E
− − xη , xη − xη = 0 for all η ′ ∈ [ηn−1 , η ∗ ], η ∈ (ηn−1 , η ∗ ),
2
and by continuity, the previous display also holds for η ∈ [ηn−1 , η ∗ ]. In summary, we have
∗
η c
− − x∗ , x − x∗ ≤ 0 for all x ∈ P (14)
2
and
η∗ c
− − x∗ , xηn−1 − x∗ = 0.
2
∗
Therefore, xηn−1 ∈ M(P, c∗ ). On the other hand, (14) also states that xη = x∗ ∈ M(P, c∗ ), and
then convexity implies the claim.
Step 5. It remains to prove (b). Let η ∈ (ηn−1 , η ∗ ). Then Lemma 2.4 implies that xη =
λx ηn−1
+ (1 − λ)x∗ for some λ ∈ (0, 1) and thus
Using
∥xη ∥2 = ∥x∗ ∥2 + λ2 ∥x∗ − xηn−1 ∥2 + 2λ⟨x∗ , xηn−1 − x∗ ⟩
and ∥xη − x∗ ∥2 = λ2 ∥x∗ − xηn−1 ∥2 , it follows that
and hence
2⟨x∗ , x∗ − xηn−1 ⟩ − η⟨c, xηn−1 − x∗ ⟩
λ≤ . (15)
2∥x∗ − xηn−1 ∥2
2 ⟨x∗ , x∗ − xηn−1 ⟩
η∗ = .
⟨c, xηn−1 − x∗ ⟩
11
Inserting this in (15) yields
(η ∗ − η)⟨c, xηn−1 − x∗ ⟩
λ≤
2∥x∗ − xηn−1 ∥2
and now it follows that
(η ∗ − η)⟨c, xηn−1 − x∗ ⟩2
E(η) = λ⟨c, xηn−1 − x∗ ⟩ ≤
2∥x∗ − xηn−1 ∥2
as claimed.
Proof of Proposition 2.7. Consider the function
Θη (x) := η⟨c, x⟩ + ∥x∥2 .
We have Θ0 = ∥ · ∥2 and hence rearranging the inner product gives
Θ0 (xη ) = Θ0 (x0 ) + 2⟨x0 , xη − x0 ⟩ + ∥x0 − xη ∥2 .
Since x0 is the projection of the origin onto P, it holds that ⟨x0 , x − x0 ⟩ ≥ 0 for all x ∈ P, so that
∥x0 − xη ∥2 ≤ Θ0 (xη ) − Θ0 (x0 ).
Noting further that 0 ≤ Θη (x0 ) − Θη (xη ) by the optimality of xη , we conclude
∥x0 − xη ∥2 ≤ Θ0 (xη ) − Θ0 (x0 )
≤ Θ0 (xη ) − Θ0 (x0 ) + Θη (x0 ) − Θη (xη )
= η⟨c, x0 − xη ⟩
≤ η∥c∥∥xη − x0 ∥
and the bound (4) follows. To prove (5), we observe that Lemma 2.1 yields
D ηc E D ηc E
0≤ + xη , x0 − xη = , x0 − xη + xη , x0 − xη .
2 2
In view of the additional condition, it follows that
D ηc E 1
∥x0 − xη ∥2 = −xη , x0 − xη ≤ , x0 − xη ≤ η∥c∥∥x0 − xη ∥
2 2
as claimed.
Proof of Proposition 3.1. Theorem 2.5(a) directly yields (8). Whereas for (9), direct application of
Theorem 2.5(b) only yields
c(x, y)dγ η (x, y) − c(x, y)dγ ∗ (x, y)
R R Z
1
lim sup ≤ c(x, y)2 d(µ ⊗ ν)(x, y).
η→η ∗ η∗ − η 2
To improve this bound, note that the optimizer of (QOT) does not change if the cost c(x, y) is
changed by an additive constant. Moreover, for any m ∈ R,
Z Z Z Z
c(x, y)dγ η (x, y) − c(x, y)dγ ∗ (x, y) = (c(x, y) − m)dγ η (x, y) − (c(x, y) − m)dγ ∗ (x, y).
R
Applying Theorem 2.5 with the modified cost c(x, y) − m for the choice m := c(x, y)d(µ ⊗ ν)(x, y)
yields (9).
12
Proof of Corollary 3.2. Assume without loss of generality that σ ∗ is the identity, so that π ∗ = Id
is the identity matrix. Let Pσ be the permutation matrix associated with a permutation σ :
{1, . . . , N } → {1, . . . , N }. We define N (σ) = {i ∈ {1, . . . , N } : σ(i) = i}. Then
⟨π ∗ , π ∗ − Pσ ⟩ N − |N (σ)|
=P , (16)
⟨C, Pσ − π ∗ ⟩ i∈N
/ (σ) c(X i , Yσ(i) ) − c(Xi , Yi )
and Proposition 3.1 again yields the claim. It remains to observe that the bounds in (10) match
when the cost is symmetric.
Proof for Example 3.2. Corollary 3.2 applies with σ ∗ being the identity and κ = 1/N 2 . As a
consequence, the critical value η ∗ isP2N 3 .
ηn−1 k Pk
To prove (12), write π = i=1 λi Pσi with λi ∈ (0, 1] and i=1
∗
λi = 1. Recall from
Theorem 2.5(a) that 0 = ⟨c∗ , π ηn−1 − π ∗ ⟩. With the optimality of π ∗ = π η for ⟨c∗ , ·⟩, this implies
∗
∗ ∗ η C ∗ ∗
0 = ⟨c , Pσi − π ⟩ = + π , Pσi − π = N 2 C + π ∗ , Pσi − π ∗ for all i = 1, . . . , k.
2N
Let c̄ := A ⊙ c be the entry-wise product, meaning that entries of c outside the three principal
diagonals are set to zero. As π ηn−1 − Id vanishes outside those diagonals, we have
13
We can now use Theorem 2.5(b) and the Cauchy–Schwarz inequality to find
⟨π η − Id, c⟩ ⟨π ηn−1 − Id, c⟩2 ⟨π ηn−1 − Id, c̄⟩2 ∥c̄∥2 1 2(N − 1) N −1
lim sup ≤ = ≤ = =
η→η ∗ (η ∗ − η) 2∥π ηn−1 − Id∥2 2∥π ηn−1 − Id∥2 2 2N 2 N4 N6
as claimed in (12).
References
[1] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM J. Math. Anal., 43(2):904–924,
2011.
[2] J. M. Altschuler, J. Niles-Weed, and A. J. Stromme. Asymptotics for semidiscrete entropic optimal
transport. SIAM J. Math. Anal., 54(2):1718–1741, 2022.
[3] D. Alvarez-Melis and T. Jaakkola. Gromov-Wasserstein alignment of word embedding spaces. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages
1881–1890, 2018.
[4] J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, and M. Eder. All adapted topologies are equal. Probab.
Theory Related Fields, 178(3-4):1125–1172, 2020.
[5] E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regularized
optimal transport. Preprint arXiv:2212.00367v1, 2022.
[6] M. Beiglböck, P. Henry-Labordère, and F. Penkner. Model-independent bounds for option prices: a
mass transport approach. Finance Stoch., 17(3):477–501, 2013.
[7] E. Bernton, P. Ghosal, and M. Nutz. Entropic optimal transport: Geometry and large deviations.
Duke Math. J., 171(16):3363–3400, 2022.
[8] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings
of Machine Learning Research, pages 880–889, 2018.
[9] H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer,
New York, 2011.
[10] A. Brøndsted. An introduction to convex polytopes, volume 90 of Graduate Texts in Mathematics.
Springer-Verlag, New York-Berlin, 1983.
[11] R. A. Brualdi. Combinatorial matrix classes, volume 108 of Encyclopedia of Mathematics and its
Applications. Cambridge University Press, Cambridge, 2006.
[12] G. Carlier, V. Duval, G. Peyré, and B. Schmitzer. Convergence of entropic schemes for optimal
transport and gradient flows. SIAM J. Math. Anal., 49(2):1385–1418, 2017.
[13] R. Cominetti and J. San Martı́n. Asymptotic analysis of the exponential penalty trajectory in linear
programming. Math. Programming, 67(2, Ser. A):169–187, 1994.
[14] G. Conforti and L. Tamanini. A formula for the time derivative of the entropic cost and applications.
J. Funct. Anal., 280(11):108964, 2021.
[15] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural
Information Processing Systems 26, pages 2292–2300. 2013.
[16] A. Dessein, N. Papadakis, and J.-L. Rouas. Regularized optimal transport and the rot mover’s distance.
J. Mach. Learn. Res., 19(15):1–53, 2018.
[17] S. Di Marino and A. Gerolin. Optimal transport losses and Sinkhorn algorithm with general convex
regularization. Preprint arXiv:2007.00976v1, 2020.
[18] S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via
penalization and neural networks. Appl. Math. Optim., 83(2):639–667, 2021.
[19] S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math.
Oper. Res., 49(2):1223–1240, 2024.
[20] M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs. SIAM J. Sci.
Comput., 40(4):A1961–A1986, 2018.
14
[21] M. Finzel and W. Li. Piecewise affine selections for piecewise polyhedral multifunctions and metric
projections. J. Convex Anal., 7(1):73–94, 2000.
[22] A. Galichon. Optimal transport methods in economics. Princeton University Press, Princeton, NJ,
2016.
[23] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal trans-
port. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016.
[24] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein
GANs. In Proceedings of the 31st International Conference on Neural Information Processing Systems,
pages 5769–5779, 2017.
[25] W. W. Hager and H. Zhang. Projection onto a polyhedron that exploits sparsity. SIAM J. Optim.,
26(3):1773–1798, 2016.
[26] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal
processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
[27] C. Léonard. From the Schrödinger problem to the Monge-Kantorovich problem. J. Funct. Anal.,
262(4):1879–1920, 2012.
[28] L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycenters. In
H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020.
[29] D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl.
Math. Optim., 85(2):Paper No. 14, 33, 2022.
[30] D. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim.,
83(3):1919–1949, 2021.
[31] O. L. Mangasarian. Normal solutions of linear programs. Math. Programming Stud., 22:206–216, 1984.
Mathematical programming at Oberwolfach, II (Oberwolfach, 1983).
[32] O. L. Mangasarian and R. R. Meyer. Nonlinear perturbation of linear programs. SIAM J. Control
Optim., 17(6):745–752, 1979.
[33] G. Mordant. Regularised optimal self-transport is approximate Gaussian mixture maximum likelihood.
Preprint arXiv:2310.14851v1, 2023.
[34] M. Nutz. Quadratically regularized optimal transport: Existence and multiplicity of potentials.
Preprint arXiv:2404.06847v1, 2024.
[35] M. Nutz and J. Wiesel. Entropic optimal transport: convergence of potentials. Probab. Theory Related
Fields, 184(1-2):401–424, 2022.
[36] S. Pal. On the difference between entropic cost and the optimal transport cost. Ann. Appl. Probab.,
34(1B):1003–1028, 2024.
[37] V. M. Panaretos and Y. Zemel. Statistical aspects of Wasserstein distances. Annu. Rev. Stat. Appl.,
6:405–431, 2019.
[38] G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foun-
dations and Trends in Machine Learning, 11(5-6):355–607, 2019.
[39] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval.
Int. J. Comput. Vis., 40:99–121, 2000.
[40] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal
transport and mapping estimation. In International Conference on Learning Representations, 2018.
[41] C. Villani. Topics in optimal transportation, volume 58 of Graduate Studies in Mathematics. American
Mathematical Society, Providence, RI, 2003.
[42] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der Mathematischen Wis-
senschaften. Springer-Verlag, Berlin, 2009.
[43] J. Weed. An explicit analysis of the entropic penalty in linear programming. volume 75 of Proceedings
of Machine Learning Research, pages 1841–1855, 2018.
[44] S. Zhang, G. Mordant, T. Matsumoto, and G. Schiebinger. Manifold learning with sparse regularised
optimal transport. Preprint arXiv:2307.09816v1, 2023.
15