Quadratically Regularized Optimal Transport
Quadratically Regularized Optimal Transport
1 Introduction
problem formulation we use here is the one of Kantorovich [15]. Let us fix some
notation and formulate the problem: Let Ω1 ⊂ Rd1 , Ω2 ⊂ Rd2 be two compact do-
mains, denote Ω = Ω1 × Ω2 , and assume we are given two positive regular Radon
measures µ1 and µ2 on Ω1 and Ω2 , respectively. Further we assume that a cost
function c : Ω1 × Ω2 → R is given that models the cost of transporting a unit
of mass from x1 ∈ Ω1 to x2 ∈ Ω2 . The optimal transport problem asks to find a
transport plan Rπ , which is a Radon measure on Ω , such that it has minimal overall
transport cost Ω c(x1 , x2 ) dπ (x1 , x2 ) among all measures π which have µ1 and µ2
as first and second marginals, respectively, i.e. for all Borel sets A ∈ Ω1 it holds that
π (A × Ω2 ) = µ1 (A) and for all Borel sets B ∈ Ω2 it holds that π (Ω1 × B ) = µ2 (B ).
This problem has been studied extensively and we refer to the books [18, 19, 23,
24, 21]. One particular result is, that an optimal plan π ∗ exists and that the sup-
port of optimal plans is contained in the so-called c-superdifferential of a c-concave
function [1, Theorem 1.13]. For many cost functions c, this means that optimal
transport plans are supported on small sets and that they are in fact singular with
respect to the Lebesgue measure on Ω . This makes the numerical treatment of
optimal transport problems difficult and one can employ regularization to obtain
approximately optimal plans π that are functions on Ω . The regularization method
that has got the most attention recently is regularization with the negative en-
tropy of π and we refer to [16, 10, 4]. Entropic regularization has gotten popular
in machine learning applications due to the fact that it allows for the very simple
Sinkhorn algorithm (in the discrete case), see [9, 13] and also [17] for a recent and
thorough review of the computational aspects of optimal transport.
Regularizations different from entropic regularization has been much less stud-
ied. We are only aware of works in the discrete case, e.g. [3, 11]. In this work we
will investigate the case where we regularize the problem in L2 (Ω ). The paper
is organized as follows: In Section 2 we state the problem and analyze existence
and duality. It will turn out that existence of solutions of the dual problem will
be quite tricky to show, but we will show that dual solutions exist in respective
L2 spaces and that a straightforward optimality system characterizes primal-dual
optimality. In Section 3 we derive two different algorithms for the discrete version
of the quadratically regularized optimal transport problem, and in Section 4 we
comment on a simple discretization scheme and report numerical examples.
Notation. We will abbreviate x+ = max(x, 0) (and will apply this also to functions
and to measures where + will mean the positive part from the Hahn-Jordan de-
composition). By C (Ω ) we denote that space of continuous functions on Ω (and we
will always work on compact sets) equipped with the supremum norm k · k∞ and
by M(Ω ) we denote the space R of Radon measures on a compact domain and we
use the norm kµkM = sup{ f dµ | f ∈ C (Ω ), |f | ≤ 1}. The Lebesgue measure will
be λ (and we also use λ1 and λ2 to specify the Lebesgue measure on sets Ω1 and
Ω2 , respectively). For convenience, we use |Ω| for the Lebesgue measure of the set
Ω . Furthermore, for a Radon measure w ∈ M, we denote the absolutely and sin-
gular part arising from the Lebesgue decomposition with respect to the Lebesgue
measure by wac and ws , i.e. they satisfy wac λ and ws ⊥ λ. Duality pairings
are denoted by h·, ·i. If both arguments of the duality pairing are positive and the
duality pairing does not necessarily exist, e.g. for ψ ∈ M(Ω ) and x ∈ L2 (Ω ), we
set hψ, xi := +∞.
Quadratically regularized optimal transport 3
∗ 2
Proof Assume that there is an optimal solution π ∈ L (Ω1 × Ω2 ). By Jensen’s
inequality we get
Z Z Z 2
µ21 (x1 ) dλ1 = π ∗ (x1 , x2 ) dλ2 dλ1
Ω1 Ω1 Ω
ZZ 2
≤ |Ω2 | π ∗ (x1 , x2 )2 dλ1 dλ2 < ∞
Ω1 ×Ω2
2
which shows µ1 ∈ L (Ω1 ). The argument for µ2 is similar. Non-negativity of µ1
and µ2 follows from non-negativity of π ∗ . Finally, by Fubini’s theorem
Z ZZ
µ1 (x1 ) dλ1 = π ∗ (x1 , x2 ) dλ1 dλ2
Ω1 Ω1 ×Ω2
Z
= µ2 (x2 ) dλ2
Ω2
In the following section, we apply the classical Lagrange duality to the linear-
quadratic program (1). To this end, let us define the Lagrangian associated with
(1). In order to shorten the notation, we set
µ := γ µ1 ⊗ µ2 .
4 Dirk A. Lorenz et al.
Furthermore, we define
Z Z
2 2 2
P1 : L (Ω ) 3 π 7→ π dλ2 ∈ L (Ω1 ), P2 : L (Ω ) 3 π 7→ π dλ1 ∈ L2 (Ω2 ), (2)
Ω2 Ω1
L : L2 (Ω ) × L2 (Ω1 ) × L2 (Ω2 ) × L2 (Ω ) → R,
L(π, α1 , α2 , %) := Eγ (π ) − h%, πiL2 (Ω )
+ hα1 , P1 π − µ1 iL2 (Ω1 ) + hα2 , P2 π − µ2 iL2 (Ω2 ) .
The main part of the upcoming analysis is devoted to the existence of solutions to
(DP). Once this is established, the necessary and sufficient optimality condition
associated with (1) in form of the variational inequality will allow us to derive an
optimality system that is also amenable for numerical computations.
To show existence for (DP), we first reformulate the dual problem. Since L is
quadratic w.r.t. π , the inner inf-problem is solved by
1
π= (ρ + α1 ⊕ α2 − c), (4)
γ
Remark 2.2 The map ⊕ is related to the adjoints of the projections P1 and P2
from (2) by α1 ⊕ α2 = P1∗ α1 + P2∗ α2 .
Again, the inner optimization problem is quadratic w.r.t. ρ so that its solution is
given by
ρ = −(α1 ⊕ α2 − c)− . (7)
Inserted in (6), this results in the following dual problem
⊕ α2 − c)+ k2L2 (Ω )
1
min Φ(α1 , α2 ) := 2 k(α1
−γhα1 , µ1 i − γhα2 , µ2 i (D)
2
s.t. αi ∈ L (Ωi ), i = 1, 2.
To prove existence of solutions for this problem, we need to require the following
Assumption 1 The domains Ω1 and Ω2 are compact. Moreover, the cost function c
is in L2 (Ω ) and fulfills c ≥ c > −∞. Furthermore, the marginals µ1 and
R µ2 satisfy
µi ∈ L2 (Ωi ) and µi ≥ δ > 0, i = 1, 2. In addition we assume that Ω µ2 dλ1 =
R 1
Ω
µ2 dλ1 = 1.
1
Observe that the objective Φ in (D) is also well defined for functions in αi ∈
L1 (Ωi ) with (α1 ⊕ α2 − c)+ ∈ L2 (Ω ). This gives rise to the following auxiliary dual
problem:
)
min Φ(α1 , α2 )
(D’)
s.t. αi ∈ L1 (Ωi ), i = 1, 2, (α1 ⊕ α2 − c)+ ∈ L2 (Ω ).
Our strategy to prove existence of solutions to (D) is now as follows:
1. First, we show that (D’) admits a solution (α1∗ , α2∗ ) ∈ L1 (Ω1 ) × L1 (Ω2 ), see
Proposition 2.9.
2. Then, we prove that α1∗ and α2∗ possess higher regularity, namely that they are
functions in L2 (Ωi ), i = 1, 2, cf. Theorem 2.10.
3. Thus, (α1∗ , α2∗ ) is feasible for (D) and, since the feasible set of (D’) contains
the one of (D), while the objective of (D’) restricted to L2 -functions coincides
with the objective in (D), this finally gives that (α1∗ , α2∗ ) is indeed optimal for
(D).
The reason to consider (D’) is essentially that the objective Φ is not coercive in
L2 (Ω ), but only in L1 (Ω ) (at least w.r.t. the negative part of αi ). Therefore, we
have to deal with weakly∗ converging sequences in the space of Radon measures
within the proof of existence of solutions. For this purpose, we need to extend the
objective to a suitable set. To that end, let us define
Z
G : L2 (Ω ) 3 w 7→ 1 2
2 w+ − wµ dλ ∈ R. (8)
Ω
6 Dirk A. Lorenz et al.
R R
Note that, thanks to Ω1
µ2 dλ1 = Ω1
µ2 dλ1 = 1, it holds
Z
Φ(α1 , α2 ) = G(α1 ⊕ α2 − c) − c µ dλ ∀ αi ∈ L2 (Ωi ), i = 1, 2. (9)
Ω
Of course, G is also well defined as a functional on the feasible set of (D’) and
we will denote this functional by the same symbol to ease notation. In order to
extend G to the space of Radon measures, consider for a given measure w ∈ M(Ω ),
the Hahn-Jordan decomposition w R= w+ − w− and assume that w+ ∈ L2 (Ω ).
2
Then, we set G(w) = Ω 12 w+
R
dλ − Ω µ dw. With a slight abuse of notation, we
denote this mapping by G, too. Furthermore, for w+ ∈ L2 (Ω ), − Ω w+ µ dλ is
R
2
Rfinite for µ ∈ L (Ω ) as in Assumption 1. Regarding, the negative part, we define
Ω
µ dw− := ∞, where this expression is not properlyR defined, as w− and µ are
both positive. Combining this, we obtain that − Ω µ dw ∈ R ∪ {∞}.
Note in this context that, if the singular part of w (w.r.t. the Lebesgue measure)
vanishes, then also w+ ∈ L1 (Ω ) and w+ (x) = max{0, w(x)} λ-a.e. in Ω so that both
functionals coincide on L2 (Ω ), which justifies this notation. Furthermore, we also
generalize the map ⊕ to the measure space by setting
α1 ⊕ α2 := α1 ⊗ λ2 + λ1 ⊗ α2 , αi ∈ M(Ωi ), i = 1, 2.
Again, it is easily seen that, for αi ∈ L2 (Ωi ), i = 1, 2, this definition boils down to
the one in (5). Also Remark 2.2 applies in that we can express α1 ⊕ α2 in terms
of the adjoints of P1 and P2 from (2) when defined appropriately.
The next lemma is rather obvious and covers the coercivity of G in L1 (Ω ) as
indicated above.
Lemma 2.5 Let Assumption 1 hold and suppose that a sequence {wn } ⊂ L2 (Ω ) fulfills
G(wn ) ≤ C < ∞ ∀ n ∈ N.
n n
Then, the sequences {w+ } and {w− } are bounded in L2 (Ω ) and L1 (Ω ), respectively.
1 2
R R
Proof We rewrite G as G(w) = Ω 2
w+ − w+ µ dλ + Ω
w− µ dλ. The positivity of
µ then implies
Z Z
n 2
kw+ kL2 (Ω ) = G(wn ) + n
w+ µ dλ − n
w− n
µ dλ ≤ C + kµkL2 (Ω ) kw+ kL2 (Ω ) ,
Ω Ω
which gives the first assertion. To see the second one, we use µ ≥ δ to estimate
Z Z Z
C ≥ G(wn ) = 1
2 ( w+
n
− µ ) 2
dλ − µ 2
/ 2 d λ + n
w− µ dλ
Ω Ω Ω
Z
n
≥− µ2 /2 dλ + δkw− k L 1 (Ω ) ,
Ω
The next lemma provides a lower semicontinuity result for G w.r.t. weak∗
convergence in M(Ω ). Note that, here, we need the extension of G as introduced
above.
Quadratically regularized optimal transport 7
n
Proof By virtue of Lemma 2.5, {w+ } is bounded in L2 (Ω ) and thus, there is a
n
subsequence of {w+ }, to ease notation denoted by the same symbol, that converges
weakly in L2 (Ω ) to some θ+ ∈ L2 (Ω ). Since the set {v ∈ L2 (Ω ) : v ≥ 0 a.e. in Ω}
is clearly weakly closed, we have θ+ ≥ 0 a.e. in Ω . With a little
R abuse of notation,
we denote the Radon measure induced by C (Ω ) 3 ϕ 7→ Ω θ+ ϕ dλ ∈ R by θ+ ,
too. If we define θ− := θ+ − w∗ ∈ M(Ω ), then w− n
= w+ n
− wn *∗ θ− in M(Ω )
∗
with θ− ≥ 0. Thus we have w = θ+ − θ− with two positive Radon measures
θ+ , θ− . The maximality property of the Hahn-Jordan decomposition then implies
∗ ∗
w+ ≤ θ+ . Since θ+ is absolutely continuous w.r.t. λ, the same thus holds for w+ ,
∗ ∗
i.e. w+ ∈ L1 (Ω ). Applying again w+ ≤ θ+ , which clearly also holds for the densities
pointwise λ-almost everywhere, we moreover deduce from the weak convergence
n
of w+ in L2 (Ω ) that
Z Z Z
∗ 2 n 2
(w+ ) dλ ≤ (θ+ )2 dλ ≤ lim inf (w+ ) dλ, (11)
Ω Ω n→∞ Ω
∗
which implies w+ ∈ L2 (Ω ) as claimed. Since the above reasoning applies to every
n
subsequence w+ that is weakly converging in L2 (Ω ), (11) holds for the whole se-
quence {w+ n
}, which together with the weak∗ convergence of wn and the definition
of G, gives (10).
u
t
Before we are in the position to prove existence for (D’), we need two additional
results on the ⊕-operator in the space of Radon measures.
R
Lemma 2.7 If αi ∈ M(Ωi ), i = 1, 2 and Ω2
dα2 = 0, then it holds that
1 2
kα1 kM ≤ |Ω2 |
kα1 ⊕ α2 kM and kα2 kM ≤ |Ω1 |
kα1 ⊕ α2 kM
Proof We estimate
ZZ
kα1 ⊕ α2 kM = sup φ(x1 , x2 ) d(α1 (x1 ) + α2 (x2 ))
kφk∞ ≤1 Ω1 ×Ω2
ZZ
≥ sup φ1 (x1 )φ2 (x2 ) d(α1 (x1 ) + α2 (x2 ))
kφ1 k∞ ≤1 Ω1 ×Ω2
kφ2 k∞ ≤1
" ZZ
= sup φ1 (x1 )φ2 (x2 ) dα1 (x1 )dλ2
kφ1 k∞ ≤1 Ω1 ×Ω2
kφ2 k∞ ≤1
ZZ #
+ φ1 (x1 )φ2 (x2 ) dλ1 dα2 (x2 ) . (12)
Ω1 ×Ω2
8 Dirk A. Lorenz et al.
R
Taking φ2 ≡ 1 and using Ω2
dα2 (x2 ) = 0 gives
Z Z Z
kα1 ⊕ α2 kM ≥ sup φ1 (x1 )dα1 (x1 )|Ω2 | + dα2 (x2 ) φ1 (x1 )dλ1
kφ1 k∞ ≤1 Ω1 Ω2 Ω1
= |Ω2 |kα1 kM .
Now we start again at (12) and estimate from below by taking φ1 ≡ 1 to get
Z Z Z
kα1 ⊕ α2 kM ≥ sup dα1 (x1 ) φ2 (x2 ) dλ2 + φ2 (x2 ) dα(x2 )|Ω1 |
kφ2 k∞ ≤1 Ω1 Ω2 Ω2
Z
≥ −|Ω2 | dα1 (x1 ) + |Ω1 |kα2 kM
Ω1
which implies
|Ω1 |kα2 kM ≤ kα1 ⊕ α2 kM + |Ω2 |kα1 kM
which completes the proof. u
t
The next lemma will be used to show that the negative part of the minimizer
of (D) does not have a singular part.
Lemma 2.8 Let c ∈ L1 (Ω ) and αi ∈ M(Ωi ) for i ∈ {1, 2} with Lebesgue decomposi-
tions, αi = fi + ηi satisfying fi λ and ηi ⊥ λ for i ∈ {1, 2}.
1. It holds that
(α1 ⊕ α2 − c)+ = (f1 ⊕ f2 − c + (η1 )+ ⊕ (η2 )+ )+ . (13)
2. If (αi )+ is absolutely continuous for i = 1, 2, then for α̃i = αi − (ηi )− for i = 1, 2,
it holds that
Φ(α̃1 , α̃2 ) ≤ Φ(α1 , α2 ).
Proof We first proof point 1. The measures fi , ηi exist by Lebesgue’s decomposition
theorem, see Theorem 1.155 in [12]. We combine these decompositions with α1 ⊕
α2 = α1 ⊗ λ + λ ⊗ α2 to arrive at Lebesgue’s decomposition of α1 ⊕ α2 with respect
to λ ⊗ λ, namely
α1 ⊕ α2 − c = f1 ⊕ f2 − c + η1 ⊕ η2 (14)
f1 ⊕ f2 − c λ ⊗ λ (15)
η1 ⊕ η2 ⊥ λ ⊗ λ (16)
(which holds true because c ∈ L1 (Ω ) ,→ M(Ω )). Now, we consider the Hahn-Jordan
decomposition of η1 ,
η1 = (η1 )+ − (η1 )−
(η1 )+ ⊥ (η1 )− , (17)
and obtain from (14) that
α1 ⊕ α2 − c = (f1 + η1 ) ⊕ (f2 + η2 ) − c
= f1 ⊕ f2 + η1 ⊕ η2 − c
= f1 ⊕ f2 + (η1 )+ − (η1 )− ⊕ η2 − c
= f1 ⊕ f2 + (η1 )+ ⊗ λ − (η1 )− ⊗ λ + λ ⊗ η2 − c
= f1 ⊕ f2 − c + (η1 )+ ⊕ η2 − (η1 )− ⊗ λ.
Quadratically regularized optimal transport 9
Furthermore,
(η1 )− ⊗ λ ⊥ f1 ⊕ f2 − c + (η1 )+ ⊕ η2
where the singularity with respect to f1 ⊕ f2 − c is due to (15) and (16) and the
singularity with respect to (η1 )+ ⊕ η2 is due to (17). Thus,
Proposition 2.9 Under Assumption 1 the minimization problem (D’) admits a solu-
tion (α1∗ , α2∗ ) ∈ L1 (Ω1 ) × L1 (Ω2 ).
Proof We proceed via the classical direct method of the calculus of variations. For
this purpose, let {(α1n , α2n )} ⊂ L1 (Ω1 ) × L1 (Ω2 ) with (α1n ⊕ α2n − c)+ ∈ L2 (Ω ) be a
minimizing sequence for (D’), where R we shift α1 and α2 by adding and subtracting
constants such that we obtain Ω2 α2 dλ2 = 0. Note that, due to its additive
structure, this does not change the objective Φ in (D’), cf. Remark 2.4.
Next, let us define wn := α1n ⊕ α2n − c. Then, thanks to (9) and Lemma 2.5,
the sequence {wn } is bounded in L1 (Ω ). Hence, there is a weakly∗ converging
subsequence, which we denote by the same symbol w.l.o.g., i.e. wn *∗ w̃ in M(Ω ).
Now, Lemma 2.6 applies giving that
w̃+ ∈ L2 (Ω ), (18)
G(w̃) ≤ lim inf G(wn ). (19)
n→∞
Since {wn } is bounded in M(Ω ), the same holds for {α1n ⊕ α2n } and, as α2n is
normalized, Lemma 2.7 gives that {αin } is bounded in M(Ωi ), i = 1, 2. Therefore,
we can select a further (sub-)subsequence, still denoted by the same symbol to
ease notation, such that
Since the mapping M(Ω1 ) × M(Ω2 ) 3 (Rα1 , α2 ) 7→ Rα1 ⊕ α2 ∈M(Ω ) is the adjoint of
the projection mapping C (Ω ) 3 ϕ 7→ Ω2 ϕ dλ2 , Ω1 ϕ dλ1 ∈ C (Ω1 ) × C (Ω2 ), see
Remark 2.2, it is weakly∗ continuous so that
Next, we investigate the singular parts of α̃1 and α̃2 . We start with the positive
part and employ Lebesgue’s decomposition of α̃1 and α̃2 :
α̃i = αi∗ + η̃i , αi∗ λi , η̃i ⊥ λi , i = 1 , 2.
Proof We again consider the positive and the negative part separately and start
with (α1∗ )− . Let ϕ ∈ Cc∞ (Ω1 ) and t > 0 be fixed, but arbitrary. Then, thanks to
0 ≤ ((α1∗ + t ϕ) ⊕ α2∗ − c)+ ≤ (α1∗ ⊕ α2∗ − c)+ + t ϕ+ ,
Proposition 2.9 implies that ((α1∗ + tϕ) ⊕ α2∗ − c)+ ∈ L2 (Ω ) so that (α1∗ + tϕ, α2∗ ) is
feasible for (D’). Therefore, the optimality of (α1∗ , α2∗ ) for (D’) yields
Z Z
1 1 ∗ ∗ 2 ∗ ∗ 2
2 (( α1 + t ϕ ) ⊕ α2 − c )+ − ( α1 ⊕ α2 − c) + dλ − γ µ1 ϕ dλ1 ≥ 0 ∀ t > 0.
Ω t Ω1
2
Owing to the continuous differentiability of R 3 r 7→ ∈ R, the first integrand
r+
converges to 2(α1∗ ⊕ α2∗ − c)+ ϕ λ-a.e. in Ω for t & 0. Moreover, the Lipschitz
continuity of the max-function gives that
1 ∗
((α1 + t ϕ) ⊕ α2∗ − c)2+ − (α1∗ ⊕ α2∗ − c)2+ ≤ |ϕ|2 + 2 |ϕ| (α1∗ ⊕ α2∗ − c)2+ a.e. in Ω
t
Quadratically regularized optimal transport 11
Since ϕ ∈ Cc∞ (Ω ) was arbitrary, the fundamental lemma of the calculus of varia-
tions thus gives
Z
(α1∗ ⊕ α2∗ − c)+ dλ2 = γµ1 λ1 -a.e. in Ω1 . (23)
Ω2
where δ > 0 is the threshold for µ1 from Assumption 1. Now assume that α1∗ ≤ −N
λ1 -a.e. on a set of E ⊂ Ω1 of positive Lebesgue measure. Then
Z Z
(α1∗ ⊕ α2∗ − c)+ dλ2 ≤ (−N ⊕ α2∗ − c)+ dλ2 < γ δ ≤ γ µ1 λ1 -a.e. in E,
Ω2 Ω2
which contradicts (23). Therefore, α1∗ > −N λ1 -a.e. in Ω1 , which even implies that
(α1∗ )− ∈ L∞ (Ω1 ). Concerning (α2∗ )− , one can argue in exactly the same way to
conclude that (α2∗ )− ∈ L∞ (Ω2 ), too.
For the positive parts we find
where we used (21) and the boundedness of the negative R parts proven above. Note
that the constant shift, potentially needed to ensure Ω2 α2∗ dλ2 = 0 has no effect
on the equation in (21) due to the additive structure of ⊕.
We have thus shown that (α1∗ , α2∗ ) is feasible for (D). Since (α1∗ , α2∗ ) solves (D’),
whose objective is the same as in (D), while its feasible set is larger, this implies
that we have found a solution to (D). u
t
12 Dirk A. Lorenz et al.
We now show that, if π ∗ is of the form π ∗ = γ −1 (α1∗ ⊕α2∗ −c)+ with two functions
αi∗∈ L2 (Ωi ), i = 1, 2, and has the marginals µ1 and µ2 , respectively, then it solves
the necessary and sufficient optimality conditions of the primal problem (1) in
form of the following variational inequality:
π∗ ∈ F , hγπ ∗ + c, π − π ∗ iL2 ≥ 0 ∀ π ∈ F. (VI)
Herein, F is the (convex) feasible set of (1), i.e.
n Z
F := π ∈ L2 (Ω ) : π ≥ 0 λ-a.e. in Ω, π dλ2 = µ1 λ1 -a.e. in Ω1 ,
Ω2
Z o
π dλ1 = µ2 λ2 -a.e. in Ω2 .
Ω1
For this purpose, let π ∈ F be fixed but arbitrary. Multiplying the equality con-
straints in F with α1∗ and α2∗ , respectively, integrating the arising equations and
add them yields
Z Z Z
µ1 α1∗ dλ1 + µ2 α2∗ dλ2 = π (α1∗ ⊕ α2∗ )dλ
Ω1 Ω2 Ω
Z Z
π (α1∗ ⊕ α2∗ − c)+ + c dλ − π (α1∗ ⊕ α2∗ − c)− dλ
=
Ω Ω
Z
≤ π (γπ ∗ + c)dλ, (25)
Ω
where we used π ≥ 0 for the last inequality. Using the feasibility of π ∗ , we find
similarly
Z Z Z
µ1 α1∗ dλ1 + µ2 α2∗ dλ2 = π ∗ (α1∗ ⊕ α2∗ − c) + c dλ
Ω1 Ω2 Ω
Z
γ −1 (α1∗ ⊕ α2∗ − c)+ (α1∗ ⊕ α2∗ − c) + c dλ
=
ZΩ
= π ∗ (γπ ∗ + c)dλ. (26)
Ω
Combining (25) and (26) now yields (VI). As (1) is a strictly convex minimization
problem, this shows that, if π ∗ has the form π ∗ = γ −1 (α1∗ ⊕ α2∗ − c)+ with functions
αi∗ ∈ L2 (Ωi ) and satisfies π ∗ ∈ F , then it is a solution of (1). On the other
hand, we know from Theorem 2.10 that, under Assumption 1 (more or less needed
for the existence of solutions of (1) anyway), there always exist αi∗ ∈ L2 (Ωi ) so
that π ∗ = γ −1 (α1∗ ⊕ α2∗ − c)+ satisfies the equality constraints in F . Therefore, in
summary we have deduced the following:
Theorem 2.11 (Necessary and Sufficient Optimality Conditions for (1)) Un-
der Assumption 1, π ∗ ∈ L2 (Ω ) is a solution of (1) if and only if there exist functions
αi∗ ∈ L2 (Ωi ), i = 1, 2, such that the following optimality system is fulfilled:
π∗ − 1
α1∗ ⊕ α2∗ − c
γ +
=0 λ-a.e. in Ω, (27a)
Z
α1∗ ⊕ α2∗ − c
+
dλ2 = γµ1 λ1 -a.e. in Ω1 , (27b)
Ω2
Z
α1∗ ⊕ α2∗ − c
+
dλ1 = γµ2 λ2 -a.e. in Ω2 . (27c)
Ω1
Quadratically regularized optimal transport 13
The significance of Theorem 2.11 lies in the fact that we can characterize
optimality of π by just two equalities in L2 (Ω1 ) and L2 (Ω2 ), respectively, namely
(27b) and (27c). Thus, we effectively reduce the size of the problem from searching
one function on Ω = Ω1 × Ω2 to searching two functions, one on Ω1 and one on Ω2
(similarly as for entropic regularization, cf. [4]). This will be exploited numerically
in Section 3.
As seen before, the dual problem in (D) is not uniquely solvable. One source of
non-uniqueness is of course the kernel of the map (α1 , α2 ) 7→ α1 ⊕ α2 . This kernel
is one-dimensional and is spanned by the function (1, −1), which could be easily
taken into account in an algorithmic framework. However, there is another source
of non-uniqueness due to the max-operator that cuts of the negative part. Here
is a simple example where dual solutions are not unique: For Ω1 = Ω2 = [0, 1],
µ1 = µ2 ≡ 1, and
(
C, if 21 ≤ x ≤ 1, 1
2 ≤ y ≤ 1,
c(x, y ) := with C > 4,
0, else,
one can show by a straight forward calculation that, for every δ ∈ [0, C− 4
2 ], the
tuple
( (
∗ 1 + δ, if x ∈ [0, 12 ), ∗ 3 + δ, if y ∈ [0, 21 ),
α1 (x) = α2 (y ) =
−1 − δ, if x ∈ [ 12 , 1], 1 − δ, if y ∈ [ 12 , 1],
solves the optimality system (27b)–(27c). This shows that the potential structure
of non-uniqueness might become fairly intricate. A situation like this can certainly
happen in the discretized problem we will derive in Section 2.4 and can lead to
problems when we derive algorithms for the discrete problem since non-unique
solutions imply a degenerate Hessian at the optimum.
Therefore, we investigate the following regularization of the dual problem:
Proposition 2.12 Let {εn } ⊂ R+ be a sequence converging to zero and denote the
solutions of (Dε ) with ε = εn by (α1n , α2n ) ∈ L2 (Ω1 ) × L2 (Ω2 ). Then the sequence
{(α1n , α2n )} admits a weak accumulation point, every weak accumulation point is also
strong one and a solution of the original dual problem (D).
14 Dirk A. Lorenz et al.
Proof Let (α1∗ , α2∗ ) ∈ L2 (Ω1 ) ×L2 (Ω2 ) denote an arbitrary globally optimal solution
of (D) (whose existence is guaranteed by Theorem 2.10). Then the optimality of
(α1∗ , α2∗ ) for (D) and of (α1n , α2n ) for (Dε ) (with ε = εn ) gives
Φ(α1∗ , α2∗ ) + εn
kα1n k2L2 (Ω1 ) + kα2n k2L2 (Ω2 ) ≤ Φεn (α1n , α2n ) ≤ Φεn (α1∗ , α2∗ )
2
which implies
kα1n k2L2 (Ω1 ) + kα2n k2L2 (Ω2 ) ≤ kα1∗ k2L2 (Ω1 ) + kα2∗ k2L2 (Ω2 ) . (28)
Thus, the boundedness of {(α1n , α2n )} in L2 (Ω1 ) × L2 (Ω2 ). This in turn gives the
existence of a weak accumulation point as claimed.
Now assume that (α̃1 , α̃2 ) is such a weak accumulation point, i.e.
(for a subsequence). Using again the optimality of (α1∗ , α2∗ ) and (α1n , α2n ), respec-
tively, we obtain
Φ(α1∗ , α2∗ ) ≤ Φ(α1n , α2n ) ≤ Φεn (α1n , α2n ) ≤ Φεn (α1∗ , α2∗ ) → Φ(α1∗ , α2∗ ). (30)
On the other hand, by convexity and weak lower semicontinuity of Φ we get from
(29) and (30) that
Φ(α̃1 , α̃2 ) ≤ lim inf Φ(α1n , α2n ) = lim Φ(α1n , α2n ) = Φ(α1∗ , α2∗ ),
n→∞ n→∞
which gives in turn the optimality of the weak limit. Estimate (28) for the choice
(α1∗ , α2∗ ) = (α̃1 , α̃2 ) shows that
kα1n k2L2 (Ω1 ) + kα2n k2L2 (Ω2 ) ≤ kα̃1 k2L2 (Ω1 ) + kα̃2 k2L2 (Ω2 )
lim inf kα1n k2L2 (Ω1 ) + kα2n k2L2 (Ω2 ) ≤ kα̃1 k2L2 (Ω1 ) + kα̃2 k2L2 (Ω2 ) ,
n→∞
kα̃1 k2L2 (Ω1 ) + kα̃2 k2L2 (Ω2 ) < lim inf kα1n k2L2 (Ω1 ) + kα2n k2L2 (Ω2 )
n→∞
Theorem 2.13 Let {εn } ⊂ R+ be a sequence converging to zero and denote the solu-
tions of (Dε ) with ε = εn again by (α1n , α2n ) ∈ L2 (Ω1 ) × L2 (Ω2 ). Moreover, define
n
πn := 1
γ (α1 ⊕ α2n − c)+ . (31)
Proof From (28), we know that {(α1n , α2n )} is bounded and hence, {πn } is bounded
in L2 (Ω ). Thus,
πn * π̃ in L2 (Ω ) (32)
for some subsequence. Now we show that π̃ is the optimal for (1). Weak closedness
of {π ∈ L2 (Ω ) : π (x1 , x2 ) ≥ 0 a.e. in Ω} implies π̃ ≥ 0. Integrating the first-order
optimality conditions for (Dε )
Z
(α1n ⊕ α2n − c)+ dλ2 + ε α1n = γ µ1 λ1 -a.e. in Ω1 (33)
Ω2
Z
(α1n ⊕ α2n − c)+ dλ1 + ε α2n = γ µ2 λ2 -a.e. in Ω2 . (34)
Ω1
against some ϕ1 ∈ Cc∞ (Ω1 ), inserting the definition of πn , and integrating over Ω1
yields Z Z Z Z
εn
πn dλ2 ϕ1 dλ1 = µ1 ϕ1 dλ1 − α1n ϕ1 dλ1
Ω1 Ω2 Ω1 γ Ω1
Passing to the limit we obtain
Z Z Z
π̃ dλ2 ϕ1 dλ1 = µ1 ϕ1 dλ1 ,
Ω1 Ω2 Ω1
and thus, π̃ satisfies the first equality constraint in (1). The second equality con-
straint can be verified analogously.
To show optimality of π̃ , we test the optimality conditions in (33) and (34)
with α1n and α2n , respectively, and get
Z
n n γ2 2
Φεn (α1 , α2 ) = 2 kπn kL2 (Ω ) − γ πn (α1n ⊕ α2n ) dλ − ε2n kα1n k2L2 (Ω1 ) − ε2n kα2n k2L2 (Ω2 )
Ω
Z
γ2 2
= − 2 kπn kL2 (Ω ) − γ c πn dλ − ε2n kα1n k2L2 (Ω1 ) − ε2n kα2n k2L2 (Ω2 )
Ω
= −γEγ (πn ) − εn
2 kα1n k2L2 (Ω1 ) − εn
2 kα2n k2L2 (Ω2 ) ,
Φ(α1∗ , α2∗ ) = −γ Eγ (π ∗ ),
where π ∗ ∈ L2 (Ω ) is the unique solution of (1) and (α1∗ , α2∗ ) ∈ L2 (Ω1 ) × L2 (Ω2 )
solves the dual problem (D). Now, putting everything so far together, we obtain
lim Eγ (πn ) = lim − γ1 Φεn (α1n , α2n ) − 2εnγ kα1n k2L2 (Ω1 ) − 2εnγ kα2n k2L2 (Ω2 )
n→∞ n→∞
= − γ1 Φ(α1∗ , α2∗ ) = Eγ (π ∗ ).
This gives the optimality of π̃ and by strict convexity also uniqueness, i.e. π̃ =
π ∗ . Thus, the weak limit is unique and a well known argument by contradiction
therefore implies the weak convergence of the whole sequence {πn } to π ∗ . Finally,
strong convergence follows from a standard argument. u
t
16 Dirk A. Lorenz et al.
3 Algorithms
The optimality system (36b), (36c) for the smooth and convex problem (D) can be
solved by different methods. In [3] the authors propose to use a generic L-BFGS
solver and also derive an alternating minimization scheme, which is similar to
the non-linear Gauss-Seidel method in the next section, but differs slightly in the
numerical realization and [20] also uses an off-the-shelf solver. Here we propose
methods that exploit the special structure of the optimality system: A non-linear
Gauss-Seidel method and a semismooth Newton method.
Quadratically regularized optimal transport 17
The method in this section is similar to the one described in the Appendix of [3],
but we describe it here for the sake of completeness. A close look at the optimality
system
N
X
(αi + βj − cij )+ = γνi , i = 1, . . . , M. (38a)
j =1
M
X
(αi + βj − cij )+ = γµj , j = 1, . . . , N (38b)
i=1
shows that we can solve all M equations in (38a) for the αi in parallel (for fixed β )
since the ith equation depends on αi only. Similarly, all N equations in (38b) can
be solved for the βj if α is fixed. Hence, we can perform a non-linear Gauss-Seidel
method for these non-smooth equations (also known as alternating minimization,
nonlinear SOR or coordinate descent method for Φ [6, 25]), i.e. alternatingly solving
the equations (38a) for α (for fixed β ) and then the equations (38b) for β (for fixed
α). The whole method is stated in Algorithm 1. Since Φ is convex with Lipschitz
continuous gradient (cf. Lemma 2.14) the convergence of the algorithm follows
from results in [2].
Of course, one can solve this problem by bisection, but here are two other, more
efficient methods to solve equations of the type (39):
Direct search. If we denote by y[j ] the j -th smallest entry of y (i.e. we sort y in
an ascending way), we get that
n
X
f (x) = (x − y [ j ] )+
j =1
0,
x ≤ y[1]
= kx − kj=1 y[j ] ,
P
y[k] ≤ x ≤ y[k+1] , k = 1, . . . , n − 1
Pn
nx − j =1 y[j ] , x ≥ y[n] .
18 Dirk A. Lorenz et al.
To obtain the solution of (39) we evaluate f at the break points y[j ] until we
find the interval [y[k] , y[k+1] [ in which the solution lies (by finding k such that
f (y[k] ) ≤ b < f (y[k+1] )), and then setting
Pk
b+ j =1 y[j ]
x= .
k
The complexity of the method is dominated by the sorting of the vector y , its
complexity is O(n log(n)).
Semismooth Newton. Although f is non-smooth, we may perform Newton’s
method here. The function f is piecewise linear and on each interval ]y[j ] , y[j +1] [
is has the slope j (a simple situation with n = 3 is shown in Figure 1). At the
break points we may define f 0 (y[j ] ) = j and then we iterate
f (xk )
xk+1 = xk − f 0 (x k )
.
f (x)
y[1] y[2]
y[3] x
As seen in Lemma 2.14, the mapping F is semismooth and hence, we may use a
semismooth Newton method [5, 7].
A simple calculation proves the following lemma.
Lemma 3.1 A Newton derivative of F from (37) at (α, β ) is given by
diag(σ1N ) σ
G= ∈ R(M +N )×(M +N )
σT diag(σ T 1M )
Quadratically regularized optimal transport 19
where σ ∈ RM ×N is given by
(
1 αi + βj − cij ≥ 0
σij =
0 otherwise.
A step of the semismooth Newton method for the solution of F (α, β ) = 0 would
consist of setting
k+1 k k k
α α δα k k δα
k+1 = k − k where F ( α , β ) = G k .
β β δβ δβ
However, the next lemma shows, that G has a non-trivial kernel.
Lemma 3.2 Let G be the Newton derivative of F at (α, β ) defined in Lemma 3.1.
Then the following holds true:
1. G ∈ R(M +N )×(M +N ) is symmetric,
2. G is positive semi-definite,
3. (a, b) ∈ kern(G) if and only if σij (ai + bj ) = 0 for all 1 ≤ i ≤ M , 1 ≤ j ≤ N .
Proof Symmetry of G is clear by construction. To see that G is positive semi-
definite we calculate
N X
M N X
M N X
M
(a, b)> G(a, b) =
X X X
σij a2i + σij b2j + 2 σij ai bj
j =1 i=1 j =1 i=1 j =1 i=1
N X
X M
= σij (ai + bj )2 ≥ 0.
j =1 i=1
4 Numerical examples
4.1 Illustration of γ → 0
In our first numerical example we illustrate the how the solutions π ∗ of the regular-
ized problem converge for vanishing regularization parameter γ → 0. We generate
some marginals, fix a transport cost and compute solutions of the discretized
transport problems (35) for a sequence γn → 0 and illustrate the optimal trans-
port plans (and the related regularized transport costs). Our marginals are non-
negative functions sampled at equidistant points xi , yi in the interval [0, 1] and we
used M = N = 400 and the cost cij = (xi − yj )2 is the squared distance between
the sampling points. The results are shown in Figure 2. One observes that the
optimal transport plans converge to a measure that is singular and is supported
on the graph of a monotonically increasing function, exactly as the fundamental
theorem of optimal transport [1] predicts.
We repeat the same experiment where the cost is the (non-squared) distance
cij = |xi − yj |. Here we had to choose larger regularization parameters as it turned
out that values similar to Figure 2 would lead to almost undistinguishable results.
The results are shown in Figure 3. Note the different structure of the transport
plan (which is again in agreement with the predicted results from the fundamental
theorem of optimal transport).
p In Figure 4 we show the results for the concave
but increasing cost cij = |xi − yj | and again observe the expected effect that a
concave transport cost encouraged that as much mass as possible stays in place
(as can be seen by the concentration of mass along the diagonal of the transport
plan).
Quadratically regularized optimal transport 21
While we did not analyze our algorithms in the continuous case, we made an
experiment to see how the methods converge when we change the mesh size of the
discretization. To that end, we did a simple piecewise constant approximation of
the marginals, the cost and the transport plan as described in Appendix A. This
derivation shows that one has to scale up the marginals for finer discretization (or,
equivalently, scale down the regularization parameter γ ) to get consistent results.
We also took care to adapt the termination criteria so that we terminate the
algorithms when the continuous counterpart of the termination criteria is satisfied
(again, see Table 1 in Appendix A for details).
We used marginals µ± : [0, 1] → [0, ∞[ of the form
µ(x) = r 1+m(1x−a)2 , , ν (x) = s 1
1+m1 (x−a1 )2
+ 1
1+m2 (x−a2 )2
22 Dirk A. Lorenz et al.
100
80
# ssn iterations
60
40
20
0
0 200 400 600 800 1,000
M =N
Fig. 5 Number of iteration for the semismooth Newton method to achieve a desired accuracy.
Each graph corresponds to one instance of the problem.
The optimal transport problem (1) with these two marginals does no fulfill As-
sumption 1, since the marginals are not L2 -functions. However, we can consider it
as a discrete problem optimal transport problem in the form (35) when we denote
Quadratically regularized optimal transport 23
150
# gs iterations
100
50
0
0 100 200 300 400 500
M =N
Fig. 6 Number of iteration for the nonlinear Gauss-Seidel semismooth method to achieve a
desired accuracy. Each graph corresponds to one instance of the problem.
cij = c(xi , yj ) (for some cost c) and marginals 1M and 1N , respectively. We solve
this discrete optimal transport problem and obtain a transport plan π ∗ . Since we
use quadratic regularization, the plan will be sparse and hence, we can visualize
it by plotting arrows from xi to yj and we make the thickness of the arrows pro-
∗
portional to the size of the entry πij . In other words: The thickness of the arrow
from xi to yj indicates how much of the mass in xi has been transported to yj .
In Figure 7 we show the result for N = 80 samples from an anisotropic Gaus-
sian distribution (centered at the origin) and M = 120 samples from a uniform
distribution on a segment of an annulus. We used c(xi , yj ) = kxi − yj k2 with the
Euclidean norm and regularization paramater γ = 1. The resulting plan π ∗ has 212
non-zero entries. For a comparison we show the result of entropically regularized
optimal transport in the same situation in Figure 8. We used γ = 0.05 (which is
the smallest value for which our naive implementation of Sinkhorn algorithms is
still stable). The resulting plan has 6730 nonzero entries and we only plot lines
for the transport which are larger than 1% of the largest entry in the optimal
transport plan.
5 Conclusion
5 20
4
40
3
60
2
1 80
20 40 60 80 100 120
0
histogram of optimal plan
-1 10 4
-2
-3
10 2
-4
-5
10 0
-5 0 5 0 0.002 0.004 0.006 0.008 0.01
Fig. 7 Illustration of the quadratically regularized optimal transport between empirical dis-
tributions. Left: Source distribution µ̂ denoted by blue starts and target distribution ν̂ denoted
by red circles together with lines that indicate the transport. Right: The transport plan and
its histogram in semi-log scale.
5 20
4
40
3
60
2
1 80
20 40 60 80 100 120
0
histogram of optimal plan
-1 4
10
-2
-3
10 2
-4
-5
10 0
-5 0 5 0 0.002 0.004 0.006 0.008 0.01
Fig. 8 Illustration of the entropically regularized optimal transport between empirical distri-
butions. Left: Source distribution µ̂ denoted by blue starts and target distribution ν̂ denoted
by red circles together with lines that indicate the transport. Right: The transport plan and
its histogram in semi-log scale.
investigate, how special structure of the cost function c may help to reduce the
cost to assemble the sparse matrix σ .
Acknowledgements We would like to thank the reviewer for helpful suggestions that lead
to an improvement presentation and also Stephan Walther (TU Dortmund) for helping with
the construction of the counterexample in Section 2.3.
For sake of brevity, we just consider an equidistant discretization of [0, 1] into N intervals using
piecewise constant ansatz functions, i.e.
N
X −1
π(x, y) := πij χ( i , i+1 j j+1
)×( N , N )
(x, y),
N N
i,j=0
Quadratically regularized optimal transport 25
for coefficients πij and assume analogous definitions for the quantities c, µ+ , µ− , α and β.
They have to coincide on average over the intervals. Again, we study this for π and obtain
that the identity
Z i+1 Z j+1 Z i+1 Z j+1 N −1
N N N N X
π(x, y) dydx = πij χ( i , i+1 j j+1
)×( N , N )
(x, y) dydx
i j i j N N
N N N N i,j=0
1
= πij
N2
holds. Again, analogous identities hold for the quantities c, µ+ , µ− , α and β. The ones with
1
one-dimensional domain are scaled by N instead of N12 .
Now, we consider the discrete Algorithm 2, which operates on discrete quantities and
establish a consistent mapping of the quantities from the discretization to the ones of the
solver. We denote its input quantities by c̄ij , µ̄− + ¯
i , µ̄i and its output quantities by ᾱi , β̄j , piij ,
and Ē. It solves for
N
X −1
π̄ij = γ µ̄+
i ,
j=0
Plugging in N µ− − + +
i = µ̄i , N µj = µ̄j and γπij = π̄ij gives
N −1 N −1 N −1
21 −
X X X
2 +
Ē = γ π − γN ᾱi µi + β̄j µj .
2 i,j=0 ij
i=0 j=0
1
Thus, the consistent identity E = γN 2
Ē follows if we choose ᾱi := αi and β̄i := βi . The solver
computes ᾱi as the solution of
N −1
(ᾱi + β̄j − c̄ij )+ = γ µ̄−
X
i ,
j=0
26 Dirk A. Lorenz et al.
N −1
1 X
(αi + βj − cij )+ = γµ−
i
N j=0
in terms of the coefficients. Plugging in the choices αi = ᾱi , βj = β̄j , cij = c̄ij and N µ− −
i = µ̄i
yields equivalence of the latter equation to
N −1
1 X 1
(ᾱi + β̄j − c̄ij )+ = γ µ̄− ,
N j=0 N j
which is equivalent to the equation that is solved by Algorithm 2. The argument for µ̄+
j is
carried out analogously.
Regarding termination, the solver checks the criteria
M −1 M −1
1 X 1 X
π̄ij − µ̄−
i <τ and π̄ij − µ̄+
j < τ.
γ j=0 γ i=0
We only consider the first and plug the identity γπij = π̄ij into it, which gives equivalence to
M −1
πij − N µ−
X
i < τ.
j=0
We summarize the choices for the consistent mapping of quantities arising from the discretiza-
tion to quantities the solver operates on in Table 1. Finally, we make a note on the calculation
of the coefficients cij for the cost function c(x, y) := (x − y)2 :
Z i+1 Z j+1
N N 1 1
cij = N 2 (x − y)2 dydx = ... = (i − j)2 + .
i
N
j
N
N2 6
Conflict of Interest: The authors declare that they have no conflict of interest.
Quadratically regularized optimal transport 27
References
1. Luigi Ambrosio and Nicola Gigli. A users guide to optimal transport. In Modelling and
optimisation of flows on networks, pages 1–155. Springer, 2013.
2. Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Com-
putation Series. Athena Scientific, Belmont, MA, third edition, 2016.
3. Mathieu Blondel, Vivien Seguy, and Antoine Rolet. Smooth and sparse optimal transport.
In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First In-
ternational Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings
of Machine Learning Research, pages 880–889, Playa Blanca, Lanzarote, Canary Islands,
09–11 Apr 2018. PMLR.
4. Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of
entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical
Analysis, 49(2):1385–1418, 2017.
5. Xiaojun Chen. Superlinear convergence of smoothing quasi-newton methods for nons-
mooth equations. Journal of Computational and Applied Mathematics, 80(1):105 – 126,
1997.
6. Xiaojun Chen. On convergence of SOR methods for nonsmooth equations. Numer. Linear
Algebra Appl., 9(1):81–92, 2002.
7. Xiaojun Chen, Zuhair Nashed, and Liqun Qi. Smoothing methods and semismooth meth-
ods for nondifferentiable operator equations. SIAM J. Numer. Anal., 38(4):1200–1216,
2000.
8. Christian Clason, Dirk A Lorenz, Hinrich Mahler, and Benedikt Wirth. Entropic regu-
larization of continuous optimal transport problems. arXiv preprint arXiv:1906.01333,
2019.
9. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In
Advances in neural information processing systems, pages 2292–2300, 2013.
10. Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational Wasserstein
problems. SIAM J. Imaging Sci., 9(1):320–343, 2016.
11. Montacer Essid and Justin Solomon. Quadratically regularized optimal transport on
graphs. SIAM Journal on Scientific Computing, 40(4):A1961–A1986, 2018.
12. Irene Fonseca and Giovanni Leoni. Modern methods in the calculus of variations: Lp
spaces. Springer Monographs in Mathematics. Springer, New York, 2007.
13. Aude Genevay, Gabriel Peyre, and Marco Cuturi. Learning generative models with
sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics,
pages 1608–1617, 2018.
14. Michael Hinze, René Pinnau, Michael Ulbrich, and Stefan Ulbrich. Optimization with
PDE constraints, volume 23. Springer Science & Business Media, 2008.
15. Leonid V. Kantorovič. On the translocation of masses. C. R. (Doklady) Acad. Sci. URSS
(N.S.), 37:199–201, 1942.
16. Nicolas Papadakis, Gabriel Peyré, and Edouard Oudet. Optimal transport with proximal
splitting. SIAM J. Imaging Sci., 7(1):212–238, 2014.
17. Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and
Trends in Machine Learning, 11(5-6):355–607, 2019.
18. Svetlozar T. Rachev and Ludger Rüschendorf. Mass transportation problems. Vol. I.
Probability and its Applications (New York). Springer-Verlag, New York, 1998. Theory.
28 Dirk A. Lorenz et al.
19. Svetlozar T. Rachev and Ludger Rüschendorf. Mass transportation problems. Vol. II.
Probability and its Applications (New York). Springer-Verlag, New York, 1998. Applica-
tions.
20. Lucas Roberts, Leo Razoumov, Lin Su, and Yuyang Wang. Gini-regularized optimal trans-
port with an application to spatio-temporal forecasting. arXiv preprint arXiv:1712.02512,
2017.
21. Filippo Santambrogio. Optimal transport for applied mathematicians, volume 87 of
Progress in Nonlinear Differential Equations and their Applications. Birkhäuser/Springer,
Cham, 2015. Calculus of variations, PDEs, and modeling.
22. Fredi Tröltzsch. Regular Lagrange multipliers for control problems with mixed pointwise
control-state constraints. SIAM Journal on Optimization, 15:616–634, 2005.
23. Cédric Villani. Topics in optimal transportation, volume 58 of Graduate Studies in Math-
ematics. American Mathematical Society, Providence, RI, 2003.
24. Cédric Villani. Optimal transport. Old and new, volume 338 of Grundlehren der Mathe-
matischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-
Verlag, Berlin, 2009.
25. Stephen J. Wright. Coordinate descent algorithms. Math. Program., 151(1, Ser. B):3–34,
2015.