Sparse Regularized Optimal Transport with Deformed q-Entropy
Sparse Regularized Optimal Transport with Deformed q-Entropy
Article
Sparse Regularized Optimal Transport with Deformed
q-Entropy
Han Bao 1, * and Shinsaku Sakaue 2
1 Graduate School of Informatics and The Hakubi Center for Advanced Research, Kyoto University,
Kyoto 604-8103, Japan
2 Department of Mathematical Informatics, Graduate School of Information Science and Technology,
The University of Tokyo, Tokyo 153-8505, Japan
* Correspondence: [email protected]
Abstract: Optimal transport is a mathematical tool that has been a widely used to measure the
distance between two probability distributions. To mitigate the cubic computational complexity of
the vanilla formulation of the optimal transport problem, regularized optimal transport has received
attention in recent years, which is a convex program to minimize the linear transport cost with an
added convex regularizer. Sinkhorn optimal transport is the most prominent one regularized with
negative Shannon entropy, leading to densely supported solutions, which are often undesirable in
light of the interpretability of transport plans. In this paper, we report that a deformed entropy
designed by q-algebra, a popular generalization of the standard algebra studied in Tsallis statistical
mechanics, makes optimal transport solutions supported sparsely. This entropy with a deformation
parameter q interpolates the negative Shannon entropy (q = 1) and the squared 2-norm (q = 0),
and the solution becomes more sparse as q tends to zero. Our theoretical analysis reveals that a
larger q leads to a faster convergence when optimized with the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) algorithm. In summary, the deformation induces a trade-off between the sparsity and
convergence speed.
Keywords: optimal transport; Sinkhorn algorithm; convex analysis; entropy; quasi-Newton method
Citation: Bao, H.; Sakaue, S. Sparse
Regularized Optimal Transport with
Deformed q-Entropy. Entropy 2022,
24, 1634. https://doi.org/10.3390/
e24111634
1. Introduction
Optimal transport (OT) is a classic problem in operations research, and it is used to
Academic Editor: Sotiris Kotsiantis
compute a transport plan between suppliers and demanders with a minimum transporta-
Received: 18 September 2022 tion cost. The minimum transportation cost can be interpreted as the closeness between the
Accepted: 7 November 2022 distributions when considering suppliers and demanders as two probability distributions.
Published: 10 November 2022 The OT problem has been extensively studied (also as the Wasserstein distance) [1] and
Publisher’s Note: MDPI stays neutral
used in robust machine learning [2], domain adaptation [3], generative modeling [4], and
with regard to jurisdictional claims in
natural language processing [5], attributed to its many useful properties, such as the dis-
published maps and institutional affil-
tance between two probability distributions. Recently, the OT problem has been employed
iations. for various modern applications, such as interpretable word alignment [6] and the locality-
aware evaluation of object detection [7], because it can capture the geometry of data and
provide a measurement method for closeness and alignment among different objects. From
a computational perspective, a naïve approach is to use a network simplex algorithm or
Copyright: © 2022 by the authors. interior point method to solve the OT problem as a usual linear program; this approach
Licensee MDPI, Basel, Switzerland. requires supercubic time complexity [8] and is not scalable. A number of approaches have
This article is an open access article been suggested to accelerate the computation of the OT problem: entropic regulariza-
distributed under the terms and tion [9,10], accelerated gradient descent [11], and approximation with tree [12] and graph
conditions of the Creative Commons metrics [13]. We focused our attention on entropic-regularized OT because it allows a
Attribution (CC BY) license (https://
unique solution attributed to strong convexity and transforms the original constrained
creativecommons.org/licenses/by/
optimization into an unconstrained problem with a clear primal–dual relationship. The
4.0/).
2. Background
2.1. Preliminaries
p
For x ∈ R, let [ x ]+ = x if x > 0 and 0 otherwise, and let [ x ]+ represent ([ x ]+ ) p
hereafter. For a convex function f , X → R, where X represents a Euclidean vector space
equipped with an inner product h·, ·i, the Fenchel–Legendre conjugate f ? : X → R is
defined as f ? (y) := supx∈X hx, yi − f (x). The relative interior of a set S is denoted by
ri S, and the effective domain of a function f is denoted by dom( f ). A differentiable
function f is said to be M-strongly convex over S ⊆ ri dom( f ) if, for all x, y ∈ S, we have
f (x) − f (y) ≤ h∇ f (x), x − yi − M 2
2 kx − yk2 . If f is twice differentiable, the strong convexity
2
is equivalent to ∇ f (x) MI for all x ∈ S. Similarly, a differentiable function f is said to be
M-smooth over S ⊆ ri dom( f ) if for all x, y ∈ S, we have k∇ f (x) − ∇ f (y)k2 ≤ M kx − yk2 ,
which is equivalent to ∇2 f (x) MI for all x ∈ S if f is twice differentiable.
The transport polytope U defines the constraints on the row/column marginals of a trans-
port matrix Π. These constraints are often referred to as coupling constraints. For notational
simplicity, matrix Dij := d(xi , y j ) and expectation hD, Πi := ∑in=1 ∑m
j=1 Dij Πij are used here-
after. T (µ, ν) is known as a 1-Wasserstein distance, which defines a metric space over
histograms [1].
Equation (1) is a linear program and can be solved by well-studied algorithms such as
the interior point and network simplex methods. However, its computational complexity is
O(n3 log n) (assuming n = m), so is not scalable to large datasets [8].
where λ > 0 represents the regularization strength. Let us review the derivation of
the updates of the Sinkhorn algorithm. The Lagrangian of the optimization problem in
Equation (3) is
n m
L(Π, α, β) := ∑ ∑ (Dij Πij + λ(Πij log Πij − Πij ))
i =1 j =1
n m
(4)
+ ∑ αi ([Π1m ]i − ai ) + ∑ β j ([Π 1n ] j − b j ),
>
i =1 j =1
αi + β j + Dij
Πij = exp − . (6)
λ
D α +β
The decomposition Πij = exp − λij / exp i λ j suggests that the stationary point is the
D
(normalized) Gibbs kernel exp − λij . One can easily infer that the Sinkhorn solution is
dense because the Gibbs kernel is supported on the entire R≥0 , i.e., exp − λz > 0 for all
The following Sinkhorn updates are used to make Equation (7) meet the marginal con-
straints: (
u0 ← a/(Kv)
, (8)
v0 ← b/(K> u)
where z/η represents the element-wise division of the two vectors z and η. The compu-
tational complexity is O(Knm) because the Sinkhorn updates involve only matrix-vector
multiplications and element-wise divisions; K represents the number of the Sinkhorn up-
dates. Finer analysis of the number of updates required to meet the error tolerance is
provided in the literature [25].
Next, we derive its dual by Lagrange duality. The Lagrangian of Equation (9) is
defined as
D E
L(Π, α, β) := hD, Πi + ∑ Ω(Πij ) + hα, Π1m − ai + β, Π> 1n − b , (10)
i,j
with dual variables α ∈ Rn and β ∈ Rm . Then, the primal can be rewritten in terms of the
Lagrangian
TΩ (µ, ν) = inf sup L(Π, α, β). (11)
Π∈Rn≥×0 m α∈Rn ,β∈Rm
In this Lagrangian formulation, we let the constraints Π ∈ Rn≥×0 m remain for a technical
reason. The constrained optimization problem in (11) can be reformulated into the following
unconstrained one with an indicator function IRn×m .
≥0
which corresponds to an optimization problem with the convex objective function hD, Πi +
∑i,j Ω(Πij ) + IRn×m (Π) with only the linear constraints Π1m = a and Π> 1n = b. By
≥0
invoking the Sinkhorn–Knopp theorem [26], the existence of a strictly feasible solution,
namely, a solution satisfying Π1m = a and Π> 1n = b, can be confirmed. Hence, we see
that the Slater condition is satisfied, and the strong duality holds as follows:
Although each element of the transport plans ranges over [0, 1], it is sufficient to define
the Fenchel–Legendre conjugate as the supremum over R≥0 because of how Ω? emerges
in the strong duality (13). According to Danskin’s theorem [27], the supremum of the
Fenchel–Legendre conjugate can be attained at
Example 1 (Negative Shannon entropy). Let Ω(π ) = −λH (π ) = λ(π log π − π ); then
Ω? (η ) = λeη/λ and ∇Ω? (η )
= eη/λ . The optimal solution represented with the optimal dual
Dij +αi? + β?j
variables (α? , β? ) is Πij? = exp − λ . This recovers the stationary point of the Sinkhorn
OT in Equation (6). The solution is dense because the regularizer Ω induces the Gibbs kernel
∇Ω? (η ) = eη/λ > 0 for all η ∈ R.
Table 1. Summary of Ω(π ), Ω? (η ), and ∇Ω? (η ) for several regularizers. The relationship between
Ω, its conjugate, and the derivatives are summarized in Bao and Sugiyama [28].
x 1− q −1 1/(1−q)
( (
1− q if q ∈ [0, 1) [1 + (1 − q ) x ] + if q ∈ [0, 1)
logq ( x ) := , expq ( x ) := . (17)
log( x ) if q = 1 exp( x ) if q = 1
The q logarithm is defined for only x > 0, as in the natural logarithm; they are inverse
functions to each other (in an appropriate domain) and they recover the natural definition
of the logarithm and exponential as q % 1. Their derivatives are (logq ( x ))0 = x1q and
(expq ( x ))0 = expq ( x )q , respectively. The additive factorization property exp( x + y) =
exp( x ) exp(y) satisfied by the natural exponential no longer holds for the q exponential,
such that expq ( x + y) 6= expq ( x ) expq (y) = expq ( x + y + (1 − q) xy). Instead, we can
construct another algebraic structure by introducing the other operation called the q prod-
uct ⊗q :
1/(1−q)
x ⊗ q y = [ x 1− q + y 1− q − 1 ] + . (18)
With this product, the pseudoadditive factorization expq ( x + y) = expq ( x ) ⊗q expq (y)
holds. Thus, the q algebra captures rich nonlinear structures, and it is often used to extend
the Shannon entropy to the Tsallis entropy [23]
n
Tq (π ) = − ∑ πi logq (πi ).
q
(19)
i =1
Entropy 2022, 24, 1634 7 of 27
One can see that the Tsallis entropy has an equivalent power formulation Tq (π ) =
q
π −π
∑in=1 1−q i , which means that it is often suitable for modeling heavy-tailed phenomena
i
such as the power law. Although the introduced q logarithm and exponential can look
arbitrary, they can be axiomatically derived by assuming the essential properties of the
algebra (see Naudts [29]). For more physical insights, we recommend readers to refer to
the literature [30].
Next, we introduce the q-exponential distribution. We introduce a simpler form for our
purpose, whereas more general formulations of the q-exponential distribution have been
introduced in the literature [22]. Given the form of the Gibbs kernel k(ξ ) := exp(−ξ/λ),
we define the q-Gibbs kernel as follows:
Definition 3 (q-Gibbs kernel). For ξ ≥ 0, we define the q-Gibbs kernel as k q (ξ ) := expq (−ξ/λ)
for a deformation parameter q ∈ [0, 1] and a temperature parameter λ ∈ R>0 .
1.2
q = 0.0
1.0 q = 0.25
0.8 q = 0.5
q = 0.75
0.6 q = 1.0
kq ( )
0.4
0.2
0.0
0 1 2 3 4 5
Figure 1. Plots of the q-Gibbs kernels with different q (λ = 1).
h i
q for q ∈ [0, 1)
λ
By definition, the support of the q-Gibbs kernel is supp(k q ) = 0, 1−
and supp(k q ) = R≥0 for q = 1. This indicates that the q-Gibbs kernel ignores the effect of a
too-large ξ (or too large a distance between two points); its threshold is smoothly controlled
by the temperature parameter λ and deformation parameter q.
Finally, we derive an entropic regularizer that induces sparsity by using the q-Gibbs
kernel. Given the stationary condition in Equation (15), we impose the following functional
form on the dual map: η
π = ∇Ω? (η ) = expq , (20)
λ
where (π, η ) = (Πij? , −Dij − αi − β j ). Equation (20) results in the factorization
−Dij −β j
α
i
Πij? = expq − ⊗q expq − ⊗q expq − , (21)
λ λ λ
Entropy 2022, 24, 1634 8 of 27
and a sufficiently large input distance Dij drives Πij to zero; though expq (−Dij /λ) = 0
does not immediately imply Πij? = 0 because the q-product ⊗q lacks an absorbing element.
By solving Equation (20),
λ
∇Ω(π ) = λ logq (π ), Ω(π ) = π logq (π ) − π . (22)
2−q
For the completeness, its derivation is shown in Appendix A. Hence, we define the deformed
q entropy as follows:
by solving its dual counterpart. The deformed q entropy is different from the Tsallis entropy
Tq (see Equation (19)) in that the Tsallis entropy and deformed q entropy are defined by
the q expectation hπ q , ·i [34] and the usual expectation hπ, ·i, respectively, while both are
defined by the q logarithm.
Remark 1. The primary reason we picked the deformed q entropy Hq to design the regularizer is
owing to its natural connection to the q-Gibbs kernel through the dual map, ∇(−λHq )? (η ) =
expq (η/λ). When the Tsallis entropy Tq is used, the dual map is
q1/(1−q)
∇(−λTq )? (η ) = , (25)
expq (−η/λ)
which is not naturally connected to the q-Gibbs kernel. Muzellec et al. [35] proposed regularized
OT with the Tsallis entropy, but they did not discuss its sparsity. As we show in Appendix D.1, the
Tsallis entropy does not empirically induce sparsity.
0.5
q = 0.0
0.4 q = 0.25
q = 0.5
0.3 q = 0.75
Hq( ) q = 1.0
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 2. Plots of deformed q entropy with different q values. A constant term is ignored in the plots
so that the end points are calibrated to zero.
Dij + αi + β j 2−q
λ
T−λHq (µ, ν) = sup −ha, αi − hb, βi −
2−q ∑ q exp −
λ
, (27)
α∈Rn ,β∈Rm i,j
| {z }
:=−F (z)
are the differences of the dual variables and gradients between the next and current steps,
respectively. Furthermore, let (γ, γ0 ) be the tolerance parameter for the Wolfe conditions,
i.e., update directions and step sizes satisfy the conditions
Fe (z(k) + ρ(k) d(k) ) ≤ Fe (z(k) ) + γ0 ρ(k) g(k)> d(k) , (Armijo condition) (30)
(k+1)> (k) (k)> (k)
g d ≥ γg d . (curvature condition) (31)
Assumption 1. Let z? be the global optimum of Fe . For τ ∈ (0, 1), we define the set Zτ ⊆
ri dom(Fe ) as
Zτ := z ∇Ω? (−Dij − αi − β j ) ≤ τ for all i, j .
(32)
The dual map ∇Ω? translates dual variables into primal variables, as in Equation (15).
It is easy to confirm that Zτ is a closed convex set attributed to the convexity of ∇Ω? .
Assumption 1 essentially assumes that all elements of the primal matrix (of z(K ) and z? ) are
strictly less than 1; this always holds for z? (unless n = m = 1) because of the strong duality.
Moreover, this assumption is natural for z(K ) values sufficiently close to the optimum z? .
The bound parameter τ is a key element for characterizing the convergence speed.
Theorem 1. Let N := max{n, m}. Under Assumption 1, Algorithm 1 with the parameter choice
κ = 2Nτ q λ−1 returns a point z(k) satisfying
s
16(Fe (z(0) ) − Fe? ) Nτ q K
k g( K ) k 2 < r (33)
λ
where Fe? := infz Fe (z) represents the optimal value of the `2 -regularized dual objective and
0 < r < 1 is an absolute constant independent from (λ, τ, q, N ).
Entropy 2022, 24, 1634 11 of 27
The proof is shown in Section 4.3. We conclude that a larger deformation parameter
q yields q/2
√ better convergence because the coefficient in Equation (33) is O(τ ) with the
base τ < 1. Therefore, the deformation parameter introduces a new trade-off: q & 0
yields a more sparse solution but slows down the convergence, whereas q % 1 ameliorates the
convergence while sacrificing sparsity. One may obtain the solution faster than the squared
2-norm regularizer used in Blondel et al. [20], which corresponds to the case q = 0, by
modulating the deformation parameter q.
In regularized OT, it is a common approach to use weaker regularization (i.e., a smaller
λ) to obtain a solution sparser and closer to the unregularized solution; however, a smaller
λ results in numerical instability and slow computation [37]. This can be observed from
Equation (33) because a smaller λ drives its upper bound considerably large.
Subsequently, we compared the computational complexity of q-DOT with the BFGS
method and Sinkhorn algorithm. Altschuler et al. [25] showed that the Sinkhorn algorithm
satisfies coupling constraints within the `1 error ε in O( N 2 (log N )ε−3 ) time, which is the
sublinear convergence rate. In contrast, our convergence rate in Equation (33) is translated
into the iteration complexity K = O(log( Nε−1 )), where kg(K ) k2 ≤ ε. The gradient of Fe is
..
.
a − ∑m ∇Ω? (−D − α − β ) + κα
i j =1 ij i j i
∇Fe (z) =
..
, (34)
.
n
bi − ∑i=1 ∇Ω? (−Dij − αi − β j ) + κβ j
..
.
and ∇Ω? (·) represents the mapping from the dual variables (αi , β j ) to the primal transport
matrix Πij in Equation (15). Therefore, the gradient norm of F and coupling constraint error
are comparable when the `2 -regularization parameter κ is sufficiently small. The overall
computational complexity is O( N 2 log( Nε−1 )) because the one step of Algorithm 1 runs
in O( N 2 ) time; this is the linear convergence rate. To confirm the one step of Algorithm 1
runs in O( N 2 ) time, we note that the update direction can be computed with O( N 2 ) time
by using the Sherman–Morrison formula to invert B(k) . In addition, the Hessian estimate
can be updated with O( N 2 ) time because B(k) is the rank-1 update and the computation of
its inverse only requires the matrix-vector products of size N. Hence, Algorithm 1 exhibits
better convergence in terms of the stopping criterion ε. The comparison is summarized in
Table 2.
Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed
q-optimal transport. N = max{n, m}.
Sinkhorn q-DOT
O( N 2 (log N )ε−3 ) O( N 2 log( Nε−1 ))
4.3. Proofs
To prove Theorem 1, we leveraged several lemmas shown below. Lemma 2 is based
on Powell [24] and Byrd et al. [36]. The missing proofs are provided in Appendix C.
Lemma 1. For the initial point z(0) and sequence z(1) , z(2) , . . . , z(K ) obtained by Algorithm 1, we
define the following set and its bound:
n o
Z := conv z(0) , z(1) , z(2) , . . . , z(K) , R := sup max ∇Ω? (−Dij − αi − β j ), (35)
z∈Z i,j
where conv(S) represents the convex hull of the set S. Then, Fe : Rn+m → R is M1 strongly convex
and M2 -smooth over Z , where M1 = κ and M2 ≤ κ + 2NRq λ−1 . Moreover, Fe is M20 -smooth
over Zτ (defined in Equation (32)), where M20 ≤ κ + 2Nτ q λ−1 .
Entropy 2022, 24, 1634 12 of 27
Lemma 2. Let z(1) , z(2) , . . . , z(k) be a sequence generated by Algorithm 1 given an initial point
z(0) . In addition, let c1 , c2 , c3 , c4 , and c5 be the constants
1−γ n+m
c1 := , c2 := + M2 ,
M2 K
(n+m)/K n+m+K
K c3
c3 := c2 K , c4 := , (36)
n+m 1−γ
2(1 − γ 0 )
c5 := .
M1
Then,
!K/2
γ0 c1 M1
Fe (z(K) ) − Fe? ≤ 1− (Fe (z(0) ) − Fe? ). (37)
2c24 c25
Proof of Theorem 1. Because Fe is differentiable and strongly convex, there exists an opti-
mum z? such that g? := ∇Fe (z? ) = 0; this implies kg(K ) k2 = kg(K ) − g? k2 .
By using Assumption 1 and Lemma 1, we obtain kg(K ) − g? k2 = k∇Fe (z(K ) ) −
∇Fe (z? )k2 ≤ M20 kz(K) − z? k2 . In addition, kz(K) − z? k22 ≤ M21 (Fe (z(K) ) − Fe? ) as Fe is
M1 strongly convex over Z and the stationary condition ∇Fe (z? ) = 0 holds. We obtain the
convergence bound by using Lemmas 2 and 3 as
k g( K ) k 2 = k g( K ) − g? k 2
≤ M20 kz(K) − z? k2
s
2(Fe (z(K ) ) − Fe? )
≤ M20
M1
v
u !K/2
u e (0)
0 t 2( F (z ) − F ? )
e γ0 c1 M1
≤ M2 1− (39)
M1 2c24 c25
v
u !K/2
u e (0)
0 t 2( F (z ) − F ? ) (1 − γ)3 γ0 e−2(n+m)/e M1 3
e
< M2 1−
M1 8(1 − γ 0 )2 M2
s
K/2
2Nτ q 2(Fe (z(0) ) − Fe? )
C
≤ κ+ 1− ,
λ κ (1 + 2NRq λ−1 κ −1 )3
(1−γ)3 γ0 e−2(n+m)/e
where we define C := 8(1− γ 0 )2
and Lemma 1 is used at the last inequality to replace
M1 , M2 and M20 . We can immediately confirm C ≤ 1
16 from 0 < γ0 < γ < 1, γ0 < 12 , and
e − 2 ( n + m ) /e < 1. Finally, by choosing κ = 2Nτ q λ−1 ,
s
K/2
16(Fe (z(0) ) − Fe? ) Nτ q
C
k g( K ) k 2 ≤ 1−
λ (1 + ( R/τ )q )3
s (40)
16(Fe (z(0) ) − Fe? ) Nτ q K
≤ r ,
λ
where
√ we use ( R/τ )q ≥ 1 (owing to R ≥ τ by definition) and let r := (1 − C/8)1/4 and
4
127/128 ≤ r < 1.
Entropy 2022, 24, 1634 13 of 27
Remark 2. More precisely, Altschuler et al. [25] showed that the Sinkhorn algorithm converges
in O( N 2 L3 (log N )ε−3 ) time, where L := kDk∞ . For q-DOT, its computational complexity is
not directly comparable to that of the Sinkhorn in L; instead, the following analysis provides us a
qualitative comparison. First, the convergence rate of q-DOT in Equation (33) is translated into
the iteration complexity K = O(log( Nε−1 )/ log(1/r )). The rate r is introduced in the proof of
1/4
C
Theorem 1 (see Equation (40)): r ≥ 1 − (1+( R/τ q
) ) 3 . Then, by the Taylor expansion, we
have a rough estimate K ≈ O( N 2 R−3q log( Nε−1 )), where R is a bound on the possible primal
variables defined in Equation (35). We cannot directly compare R−q and L; nevertheless, R−q
and L can be considered in the same magnitude given a reasonably sized domain Z , noting that
∇Ω(π ) ≈ O(π 1−q ). Hence, it is reasonable to suppose that both the Sinkhorn algorithm and
q-DOT roughly converge in cubic time with respect to L.
5. Numerical Experiments
5.1. Sparsity
All the simulations described in this section were executed on a 2.7 GHz quad-core
Intel® Core™ i7 processor. We used the following synthetic dataset: (xi )in=1 ∼ N (12 , I2 ),
j=1 ∼ N (−12 , I2 ), and n = m = 30, where N ( µ, Σ ) represents the Gaussian distribution
(y j ) m
with mean µ and covariance Σ. For each of the unregularized OTs, q-DOT, and Sinkhorn
algorithm, we computed the transport matrices. For q-DOT and the Sinkhorn algorithm,
different regularization parameters λ were compared: λ ∈ 1 × 10−2 , 1 × 10−1 , 1 ; and ε =
1 × 10−6 was used as the stopping criterion: q-DOT stopped after the gradient norm was
less than ε, and the Sinkhorn algorithm stopped after the `1 error of the coupling constraints
was less than ε. We compared different deformation parameters q ∈ {0, 0.25, 0.5, 0.75} and
fixed the dual `2 -regularization parameter κ = 1 × 10−6 for q-DOT. The q-DOT with q = 0
corresponded to a regularized OT with the squared 2-norm proposed by Blondel et al. [20].
For the unregularized OT, we used the implementation of the Python optimal transport
package [38]. For q-DOT, we used the L-BFGS-B method (instead of the vanilla BFGS)
provided by the SciPy package [39]. To determine zero entries in the transport matrix, we
did not impose any positive threshold to disregard small values (as in Swanson et al. [6])
but regarded entries smaller than machine epsilon as zero.
The simulation results are shown in Table 3 and Figure 3. First, we qualitatively
evaluated each method by using Figure 3 such that q-DOT obtained a very similar transport
matrix to the unregularized OT solution. The solution was slightly blurred with increases
in q and λ. In contrast, the Sinkhorn algorithm output considerably uncertain transport
matrices. Furthermore, the Sinkhorn algorithm was numerically unstable with a very small
regularization such as λ = 0.01.
From Table 3, we further quantitatively observed the behavior. The transport matrices
obtained by q-DOT were very sparse in most cases, and the sparsity was close to that of
the unregularized OT. Furthermore, we observed the tendency such that smaller q and
λ yielded a sparser solution. Significantly, the Sinkhorn algorithm obtained completely
dense matrices (sparsity = 0). Although the transport matrices of q-DOT with (q, λ) =
(0.5, 1), (0.75, 1) appear somewhat similar to the Sinkhorn solutions in Figure 3, the former
is much sparser. This suggests that a deformation parameter q slightly smaller than 1 is
sufficient for q-DOT to output a sparse transport matrix.
Entropy 2022, 24, 1634 14 of 27
Wasserstein
0 5 10 15 20 25
0
0.030
5
0.025
10
0.020
15
0.015
20 0.010
25 0.005
0.000
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
Figure 3. Comparison of transport matrices. Wasserstein represents the result of the unregularized
OT. Sinkhorn (λ = 0.01) does not work well because of numerical instability.
D E
For the obtained cost values D, Π b , we did not see a clear advantage of using a
specific q and λ from the results of q-DOT. Nevertheless, it is evident that q-DOT more
accurately estimated the Wasserstein cost than the Sinkhorn algorithm regardless of the q
and λ used in this simulation.
Entropy 2022, 24, 1634 15 of 27
Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of
zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon
to measure the sparsity instead of imposing a small positive threshold for determining zero entries.
Sinkhorn (λ = 0.01) does not work well because of numerical instability.
D E
Sparsity Cost D, Π
b
700 5000
600
4000
500
Runtime [ms]
Runtime [ms]
3000
400
300 2000
200
1000
100
0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1 0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1
Figure 4. Cont.
Entropy 2022, 24, 1634 16 of 27
16000 70000
14000 60000
12000
50000
Runtime [ms]
Runtime [ms]
10000
40000
8000
30000
6000
20000
4000
2000 10000
0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1 0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1
Figure 4. Runtime comparison of q-DOT and Sinkhorn algorithm (q = 1). The error bars indicate the
standard errors of 20 trials.
0.05
λ = 0.05
λ = 0.1
Abs([ D, Π̂ ] - [1-Wasserstein distance])
λ = 0.5
0.04
0.03
0.02
®
0.01
Figure 5. Wasserstein approximation error of q-DOT and the Sinkhorn algorithm (q = 1). The line
shades indicate the standard errors of 20 trials.
Author Contributions: Conceptualization, H.B.; methodology, H.B.; validation, H.B. and S.S.; formal
analysis, H.B. and S.S.; writing—original draft preparation, H.B.; writing—review and editing, H.B.
and S.S.; funding acquisition, H.B. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was supported by the Hakubi Project, Kyoto University, and JST ERATO
Grant JPMJER1903. The APC was covered by the Hakubi Project.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Entropy 2022, 24, 1634 17 of 27
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
BFGS Broyden–Fletcher–Goldfarb–Shannon
q-DOT Deformed q-optimal transport
L-BFGS Limited-memory BFGS
OT Optimal transport
Note that z(k) + ts(k) ∈ Z follows by the definition of Z in Equation (35). Thus, the first
statement is proven.
Entropy 2022, 24, 1634 18 of 27
kζ (k) k22 s(k)> Ḡ(k)2 s(k) (s(k)> Ḡ(k)1/2 )Ḡ(k) (Ḡ(k)1/2 s(k) )
= =
ζ (k)> s(k) s(k)> Ḡ(k) s(k) kḠ(k)1/2 s(k) k22
Z 1 (k )0 > 2 e (k ) 0
(s ) [∇ F (z + ts(k) )](s(k) ) (A7)
= 0 dt
0 ks(k) k22
≤ M2 ,
0
where s(k) := Ḡ(k)1/2 s(k) .
M1 (k) D E
kz − z? k22 ≤ Fe (z(k) ) − Fe (z? ) − ∇Fe (z(k) ), z(k) − z?
2 (A9)
≤ k g( k ) k 2 k z( k ) − z? k 2 ,
n+m
where c2 := K + M2 is defined in Lemma 2.
Proof. To prove Equation (A10), we use the linearity of the trace and tr(ba> ) = a> b to
evaluate tr(B(k+1) ) as follows:
!
( k +1) (k) B(k) s(k) s(k)> B(k) ζ (k) ζ (k)>
tr(B ) = tr B − + (k)> (k)
s(k)> B(k) s(k) ζ s
! !
(k) B(k) s(k) s(k)> B(k) ζ (k) ζ (k)>
= tr(B ) − tr + tr (k)> (k)
s(k)> B(k) s(k) ζ s
| {z }
≥0
(A13)
(k) kζ (k) k22
≤ tr(B )+
ζ (k)> s(k)
k kζ ( j) k22
≤ tr(B(0) ) + ∑ ( j)> s( j)
j =0 ζ
≤ tr(B(0) ) + (k + 1) M2 ,
where Lemma A1 is used at the last inequality. Note that the trace is the sum of the
eigenvalues, whereas the determinant is the product of the eigenvalues. Then, we can use
the AM–GM inequality to translate the determinant into the trace as follows:
Entropy 2022, 24, 1634 19 of 27
n+m !n+m
tr(B(0) ) + M2 (k + 1)
( k +1) 1
det(B )≤ tr(B(k+1) ) ≤ . (A14)
n+m n+m
Lemma A4. For k, ks(k) k2 ≤ c5 kg(k) k2 cos θk , where θk is the angle between s(k) and −g(k) , and
2(1− γ 0 )
c5 := M is defined in Lemma 2.
1
1
s(k)> g(k) + M ks(k) k22 ≤ γ0 g(k)> s(k)
2 1
1
⇒ (1 − γ0 )(−s(k)> g(k) ) ≥ M1 ks(k) k22
2 (A21)
2 ( 1 − γ 0) − s(k)> g(k)
⇒ ks( k ) k 2 ≤ k g( k ) k 2 ,
M1 k s( k ) k 2 k g( k ) k 2
| {z } | {z }
= c5 =cos θk
Lemma A5. For k, let θk be the angle between s(k) and −g(k) . Then,
!K/2
K −1
γ0 c1 M1 cos2 θk γ0 c1 M1
∏ 1−
2
≤ 1−
2c24 c25
, (A22)
k =0
(n+m)/K n+m+K
K
where c3 := n+m c2 K
is defined in Lemma 2. By using B(k) s(k) = −ρ(k) g(k)
and ζ (k)> s(k) ≥ −(1 − γ)g(k)> s(k) (shown in Equation (A33)),
K −1 kB(k) s(k) k22 K −1 kg(k ) k2 · ζ (k )> s(k )
ζ (k)> s(k)
∏ ·
s(k)> B(k) s(k) s(k)> B(k) s(k)
= ∏ (−s(2k)> g(k) )2
k =0 k =0
(A24)
K −1 kg(k) k22
≥ (1 − γ ) · K
∏ −s(k)> g(k)
.
k =0
Hence,
K −1 K
k g( k ) k 2
c3
∏ ks(k) k2 cos θk
≤
1−γ
= c4K . (A25)
k =0
K −1 K −1 K
1 kg(k) k2 cos θk
1
∏ cos2 θk ≥ ∏ c4 k s( k ) k 2
≥
c4 c5
. (A26)
k =0 k =0
K K −1 2Kb
1 1
c4 c5
≤ ∏ cos2 θk ≤
c4 c5
, (A27)
k =0
Entropy 2022, 24, 1634 21 of 27
K 1
implying that K
b is at most
2 (note that c4 c5 < 1 from Equation (A26)). Therefore,
!K/2
K −1
γ0 c1 M1 cos2 θk γ0 c1 M1
∏ 1−
2
≤ 1−
2c24 c25
. (A28)
k =0
where P(z)q is the element-wise power of P(z). Then, by invoking the Gershgorin cir-
cle theorem (Theorem 7.2.1 of [41]), the eigenvalues of H can be upper bounded by the
following value:
max ∑ j Pij (z) + |[P(z{z
q
) 1m ]i , ∑ Pij (z) + [(P(z) ) 1n ] j ,
q q q >
i
| {z } }
center of i-th disc radius of i-th disc
m n (A31)
≤ max 2 ∑ Pij (z) , 2 ∑ Pij (z)
q q
j =1 i =1
q
≤ 2NR ,
where we use 0 ≤ Pij (z) ≤ R for all i, j, and z ∈ Z at the last inequality. Hence, M2 ≤
q
κ + 2NR
λ .
q
M20 ≤ κ + 2Nτλ is confirmed by noting that 0 ≤ Pij (z) ≤ τ for all i, j, and z ∈ Zτ and
that Zτ is a closed convex set.
where Equation (A32) is used at the first inequality; Equation (A34) is used at the second
inequality; LemmaDA2 is used atEthe third inequality; a consequence of the convexity
Fe (z(k) ) − Fe (z? ) ≤ g(k) , z(k) − z? ≤ kg(k) k2 kz(k) − z? k2 is used at the fourth inequality.
Next, recursively invoking the inequality Equation (A35), we obtain
( )
K −1 0 c M cos2 θ
γ
Fe (z ) − Fe? ≤ ∏ 1 −
( K ) 1 1 k
(Fe (z(0) ) − Fe? )
k =0
2
!K/2 (A36)
γ0 c1 M1
≤ 1− (Fe (z(0) ) − Fe? ),
2c24 c25
−γ
γ0 c1 M1 γ0 · 1M 2
· M1
=
c24 c25
2
2(1− γ 0 ) 2
c3
1− γ · M 1
(1 − γ)3 M13 γ0
=
4(1 − γ0 )2 c23 M2
M13 (1 − γ)3 γ0 1
=
4M2 (1 − γ0 )2
1/K n+m+K
2
1 n+m
K
(n+m)n+m
c2 K K (A37)
− 2( n + m + K )
M 3 (1 − γ )3 γ 0
2( n + m ) n+m n+m K
= 1 0 2
· (n + m) K · + M2 ·K n+m+K
4M2 (1 − γ ) K
M13 (1 − γ)3 γ0 2( n + m )
> 0 2
· 1 · M2−2 K − K
4M2 (1 − γ )
(1 − γ)3 γ0 e−2(n+m)/e M1 3
≥ ,
4(1 − γ 0 )2 M2
2( n + m )
where, at the first inequality, we invoke (n + m) K > 1 and
− 2( n + m + K ) − 2( n + m + K )
n+m n+m K n+m K
+ M2 ·K n+m+K ≥ M2 K n+m+K
K
− 2(n+Km+K ) 2( n + m ) (A38)
= M2 K− K
2( n + m )
≥ M2−2 K − K ,
2( n + m ) 2( n + m )
and we use K − K ≥ e− e for all K at the second inequality. Hence, the desired
inequality is proven.
The simulations in this section were executed on a 2.7 GHz quad-core Intel® Core™
i7 processor. We used the following synthetic dataset: (xi )in=1 ∼ N (12 , I2 ), (y j )m
j =1 ∼
N (−12 , I2 ), and n = m = 100. For q-DOT and Tsallis-regularized OT, different regu-
larization parameters λ ∈ {0.5, 1} were compared, and ε = 1 × 10−6 was used as the
stopping criterion on the gradient norm. The range of regularization parameters differed
from that in Section 5.1 because Tsallis-regularized OT does not converge with too-small
regularization parameters such as λ = 0.01. We compared different deformation parame-
ters q ∈ {0, 0.25, 0.5, 0.75}. For the unregularized OT, we used the implementation of the
Python optimal transport package [38]. For q-DOT and Tsallis-regularized OT, we use dthe
L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the
transport matrix, we regarded entries smaller than machine epsilon as zero.
Entropy 2022, 24, 1634 24 of 27
Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates
the ratio of zero entries in each transport matrix. We counted the number of entries smaller than ma-
chine epsilon to measure the sparsity instead of imposing a small positive threshold for determining
zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein
distance. Tsallis-regularized OT with q = 0.00 does not work due to numerical instability.
Sparsity (q-DOT) Abs. Error (q-DOT) Sparsity (Tsallis) Abs. Error (Tsallis)
q = 0.00, λ = 0.50 0.984 0.001 — —
q = 0.00, λ = 1.00 0.981 0.011 — —
q = 0.25, λ = 0.50 0.977 0.008 0.000 3.362
q = 0.25, λ = 1.00 0.973 0.010 0.000 3.388
q = 0.50, λ = 0.50 0.959 0.015 0.000 3.153
q = 0.50, λ = 1.00 0.944 0.022 0.000 3.283
q = 0.75, λ = 0.50 0.861 0.052 0.000 1.962
q = 0.75, λ = 1.00 0.776 0.099 0.000 2.582
As can be seen from the results in Table A1, the Tsallis entropic regularizer neither
induces sparsity nor achieves a better approximation of the 1-Wasserstein distance than
the deformed q entropy. Note that the Tsallis entropy induces the dual map ∇Ω? (η ) =
q1/(1−q) / expq (−η/λ) shown in Equation (25), which has dense support for q > 0 and be-
comes the source of dense transport matrices. This verifies that the design of the regularizer
is important for regularized optimal transport.
Table A2. Hyperparameter sensitivity of q-DOT and Sinkhorn algorithm. In these tables, q = 1.00
corresponds to the Sinkhorn algorithm. (q, λ) = (1.00, 0.01) did not work well because of numerical
instability. The results shown in the tables are the means of 10 random trials. Bold typeface indicates
the best result for each of sparsity, absolute error, and runtime.
Runtime Runtime
(N = 100) Sparsity Abs. error (N = 100) Sparsity Abs. error
[ms] [ms]
q = 0.00, λ = 0.01 0.990 2.28 × 10−2 4366.142 q = 0.00, λ = 0.01 0.997 1.30 × 100 33,592.026
q = 0.00, λ = 0.10 0.988 3.63 × 10−3 1236.346 q = 0.00, λ = 0.10 0.996 2.15 × 10−2 14,641.740
q = 0.00, λ = 1.00 0.982 6.20 × 10−3 842.253 q = 0.00, λ = 1.00 0.994 2.03 × 10−2 7749.233
q = 0.25, λ = 0.01 0.989 8.18 × 10−3 3182.535 q = 0.25, λ = 0.01 0.996 7.07 × 10−2 36,167.445
q = 0.25, λ = 0.10 0.986 5.54 × 10−3 1131.784 q = 0.25, λ = 0.10 0.994 1.83 × 10−2 15,176.970
q = 0.25, λ = 1.00 0.973 1.16 × 10−2 668.734 q = 0.25, λ = 1.00 0.990 2.69 × 10−2 5848.561
q = 0.50, λ = 0.01 0.987 9.91 × 10−3 2388.176 q = 0.50, λ = 0.01 0.994 1.99 × 10−2 25,940.619
q = 0.50, λ = 0.10 0.977 7.66 × 10−3 1040.818 q = 0.50, λ = 0.10 0.991 2.41 × 10−2 8304.774
q = 0.50, λ = 1.00 0.946 2.40 × 10−2 339.978 q = 0.50, λ = 1.00 0.976 3.52 × 10−2 2713.598
q = 0.75, λ = 0.01 0.979 1.16 × 10−2 2396.353 q = 0.75, λ = 0.01 0.991 2.97 × 10−2 18,820.365
q = 0.75, λ = 0.10 0.950 1.31 × 10−2 731.564 q = 0.75, λ = 0.10 0.973 3.34 × 10−2 4823.098
q = 0.75, λ = 1.00 0.786 1.02 × 10−1 200.654 q = 0.75, λ = 1.00 0.864 9.57 × 10−2 1654.697
q = 1.00, λ = 0.01 — — — q = 1.00, λ = 0.01 — — —
q = 1.00, λ = 0.10 0.000 5.83 × 10−2 1132.516 q = 1.00, λ = 0.10 0.000 7.39 × 10−2 2014.341
q = 1.00, λ = 1.00 0.000 7.51 × 10−1 31.284 q = 1.00, λ = 1.00 0.000 8.15 × 10−1 207.094
Runtime
(N = 100) Sparsity Abs. error (N = 100) Sparsity Abs. error Runtime [s]
[ms]
q = 0.00, λ = 0.01 0.999 2.48 × 100 86,046.395 q = 0.00, λ = 0.01 1.000 6.39 × 100 336.207
q = 0.00, λ = 0.10 0.997 3.91 × 10−2 49,523.995 q = 0.00, λ = 0.10 0.999 8.76 × 10−2 286.879
q = 0.00, λ = 1.00 0.996 4.10 × 10−2 27,357.659 q = 0.00, λ = 1.00 0.998 8.22 × 10−2 133.223
q = 0.25, λ = 0.01 0.998 2.36 × 10−1 104,346.641 q = 0.25, λ = 0.01 0.999 4.27 × 100 413.775
q = 0.25, λ = 0.10 0.996 5.12 × 10−2 41,810.473 q = 0.25, λ = 0.10 0.998 1.01 × 10−1 221.787
q = 0.25, λ = 1.00 0.994 4.22 × 10−2 18,415.400 q = 0.25, λ = 1.00 0.997 9.01 × 10−2 87.945
q = 0.50, λ = 0.01 0.996 4.52 × 10−2 78,618.996 q = 0.50, λ = 0.01 0.998 8.61 × 10−2 374.123
q = 0.50, λ = 0.10 0.994 4.50 × 10−2 25,512.371 q = 0.50, λ = 0.10 0.997 9.37 × 10−2 120.605
q = 0.50, λ = 1.00 0.984 4.92 × 10−2 8266.048 q = 0.50, λ = 1.00 0.990 9.49 × 10−2 41.435
q = 0.75, λ = 0.01 0.994 4.55 × 10−2 57,839.639 q = 0.75, λ = 0.01 0.996 1.05 × 10−1 275.101
q = 0.75, λ = 0.10 0.979 5.07 × 10−2 14,257.452 q = 0.75, λ = 0.10 0.985 1.02 × 10−1 67.301
q = 0.75, λ = 1.00 0.890 1.00 × 10−1 4362.478 q = 0.75, λ = 1.00 0.917 1.34 × 10−1 21.536
q = 1.00, λ = 0.01 — — — q = 1.00, λ = 0.01 — — —
q = 1.00, λ = 0.10 0.000 7.92 × 10−2 5731.333 q = 1.00, λ = 0.10 0.000 8.62 × 10−2 57.739
q = 1.00, λ = 1.00 0.000 8.35 × 10−1 562.722 q = 1.00, λ = 1.00 0.000 8.51 × 10−1 2.215
(N = 100) Sparsity Abs. error Runtime [s] (N = 100) Sparsity Abs. error Runtime [s]
q = 0.00, λ = 0.01 1.000 3.59 × 100 1386.554 q = 0.00, λ = 0.01 1.000 4.09 × 100 3257.314
q = 0.00, λ = 0.10 0.999 2.25 × 10−1 1245.867 q = 0.00, λ = 0.10 1.000 8.56 × 10−1 3108.889
q = 0.00, λ = 1.00 0.999 1.85 × 10−1 823.011 q = 0.00, λ = 1.00 0.999 2.68 × 10−1 2355.733
q = 0.25, λ = 0.01 1.000 5.88 × 100 1555.064 q = 0.25, λ = 0.01 1.000 3.78 × 100 3821.319
q = 0.25, λ = 0.10 0.999 1.86 × 10−1 1201.656 q = 0.25, λ = 0.10 0.999 2.94 × 10−1 3532.833
q = 0.25, λ = 1.00 0.998 1.86 × 10−1 492.324 q = 0.25, λ = 1.00 0.999 2.76 × 10−1 1530.838
q = 0.50, λ = 0.01 0.999 6.66 × 10−1 1494.270 q = 0.50, λ = 0.01 1.000 1.85 × 100 3669.894
q = 0.50, λ = 0.10 0.998 1.97 × 10−1 589.379 q = 0.50, λ = 0.10 0.999 2.93 × 10−1 1637.985
q = 0.50, λ = 1.00 0.994 1.85 × 10−1 210.008 q = 0.50, λ = 1.00 0.995 2.71 × 10−1 644.164
q = 0.75, λ = 0.01 0.998 2.00 × 10−1 1300.517 q = 0.75, λ = 0.01 0.998 2.98 × 10−1 3560.379
q = 0.75, λ = 0.10 0.989 2.00 × 10−1 321.221 q = 0.75, λ = 0.10 0.991 2.91 × 10−1 853.451
q = 0.75, λ = 1.00 0.937 2.08 × 10−1 106.334 q = 0.75, λ = 1.00 0.946 2.83 × 10−1 270.046
q = 1.00, λ = 0.01 — — — q = 1.00, λ = 0.01 — — —
q = 1.00, λ = 0.10 0.000 9.06 × 10−2 147.372 q = 1.00, λ = 0.10 0.000 8.94 × 10−2 272.210
q = 1.00, λ = 1.00 0.000 8.62 × 10−1 8.575 q = 1.00, λ = 1.00 0.000 8.62 × 10−1 20.120
Entropy 2022, 24, 1634 26 of 27
References
1. Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338.
2. Shafieezadeh-Abadeh, S.; Mohajerin Esfahani, P.M.; Kuhn, D. Distributionally robust logistic regression. Adv. Neural Inf. Process.
Syst. 2015, 28. https://dl.acm.org/doi/10.5555/2969239.2969415.
3. Courty, N.; Flamary, R.; Habrard, A.; Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Adv.
Neural Inf. Process. Syst. 2017, 30. https://dl.acm.org/doi/10.5555/3294996.3295130.
4. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International
Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR, pp. 214–223.
5. Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word embeddings to document distances. In Proceedings of the 32nd
International Conference on Machine Learning, Lille, France, 7–9 July 2015; PMLR, pp. 957–966.
6. Swanson, K.; Yu, L.; Lei, T. Rationalizing text matching: Learning sparse alignments via optimal transport. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5609–5626.
7. Otani, M.; Togashi, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Optimal correction cost for object detection evaluation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022;
pp. 21107–21115.
8. Pele, O.; Werman, M. Fast and robust Earth Mover’s Distances. In Proceedings of the 2009 IEEE 12th International Conference on
Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: New York, NY, USA, 2009; pp. 460–467.
9. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300.
10. Dessein, A.; Papadakis, N.; Rouas, J.L. Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res. 2018,
19, 590–642.
11. Dvurechensky, P.; Gasnikov, A.; Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent
is better than by Sinkhorn’s algorithm. In Proceedings of the 36th International Conference on Machine Learning, Stockholm,
Sweden, 10–15 July 2018; PMLR, pp. 1367–1376.
12. Le, T.; Yamada, M.; Fukumizu, K.; Cuturi, M. Tree-sliced variants of Wasserstein distances. Adv. Neural Inf. Process. Syst. 2019,
32, 12304–12315.
13. Le, T.; Nguyen, T.; Phung, D.; Nguyen, V.A. Sobolev transport: A scalable metric for probability measures with graph metrics.
In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Online, 28–30 March 2022; PMLR,
pp. 9844–9868.
14. Frogner, C.; Zhang, C.; Mobahi, H.; Araya, M.; Poggio, T.A. Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 2015,
28, 2053–2061.
15. Cuturi, M.; Teboul, O.; Vert, J.P. Differentiable ranking and sorting using optimal transport. Adv. Neural Inf. Process. Syst. 2019,
32, 6861–6871.
16. Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69.
17. Birkhoff, G. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucum’an Rev. Ser. A 1946, 5, 147–154.
18. Brualdi, R.A. Combinatorial Matrix Classes; Cambridge University Press: Cambridge, UK, 2006; Volume 13.
19. Alvarez-Melis, D.; Jaakkola, T. Gromov–Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1881–1890.
20. Blondel, M.; Seguy, V.; Rolet, A. Smooth and sparse optimal transport. In Proceedings of the 21st International Conference on
Artificial Intelligence and Statistics, Canary Islands, Spain, 9–11 April 2018; PMLR, pp. 880–889.
21. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528.
[CrossRef]
22. Amari, S.i.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [CrossRef]
23. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [CrossRef]
24. Powell, M.J.D. Some global convergence properties of a variable metric algorithm for minimization without exact line searches.
In Proceedings of the Nonlinear Programming, SIAM-AMS Proceedings, New York, NY, USA, 1 January 1976; Volume 9.
25. Altschuler, J.; Niles-Weed, J.; Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration.
Adv. Neural Inf. Process. Syst. 2017, 30, 1961–1971.
26. Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967, 21, 343–348.
[CrossRef]
27. Danskin, J.M. The theory of max-min, with applications. SIAM J. Appl. Math. 1966, 14, 641–664. [CrossRef]
28. Bao, H.; Sugiyama, M. Fenchel-Young losses with skewed entropies for class-posterior probability estimation. In Proceedings of
the 24th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13–15 April 2021; pp. 1648–1656.
29. Naudts, J. Deformed exponentials and logarithms in generalized thermostatistics. Phys. A Stat. Mech. Its Appl. 2002, 316, 323–334.
[CrossRef]
30. Suyari, H. The unique non self-referential q-canonical distribution and the physical temperature derived from the maximum
entropy principle in Tsallis statistics. Prog. Theor. Phys. Suppl. 2006, 162, 79–86. [CrossRef]
31. Ding, N.; Vishwanathan, S. t-Logistic regression. Adv. Neural Inf. Process. Syst. 2010, 23, 514–522.
32. Futami, F.; Sato, I.; Sugiyama, M. Expectation propagation for t-exponential family using q-algebra. Adv. Neural Inf. Process. Syst.
2017, 30. https://dl.acm.org/doi/10.5555/3294771.3294985.
Entropy 2022, 24, 1634 27 of 27
33. Amid, E.; Warmuth, M.K.; Anil, R.; Koren, T. Robust bi-tempered logistic loss based on bregman divergences. Adv. Neural Inf.
Process. Syst. 2019, 32, 15013–15022.
34. Martins, A.F.; Figueiredo, M.A.; Aguiar, P.M.; Smith, N.A.; Xing, E.P. Nonextensive entropic kernels. In Proceedings of the 25th
International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 640–647.
35. Muzellec, B.; Nock, R.; Patrini, G.; Nielsen, F. Tsallis regularized optimal transport and ecological inference. In Proceedings of the
31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
36. Byrd, R.H.; Nocedal, J.; Yuan, Y.X. Global convergence of a cass of quasi-Newton methods on convex problems. SIAM J. Numer.
Anal. 1987, 24, 1171–1190. [CrossRef]
37. Schmitzer, B. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput. 2019,
41, A1443–A1481. [CrossRef]
38. Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.;
et al. POT: Python optimal transport. J. Mach. Learn. Res. 2021, 22, 1–8.
39. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.;
Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. doi:
10.1038/s41592-019-0686-2. [CrossRef] [PubMed]
40. Weed, J. An explicit analysis of the entropic penalty in linear programming. In Proceedings of the the 31st Conference on
Learning Theory, Stockholm, Sweden, 5–9 July 2018; PMLR, pp. 1841–1855.
41. Golub, G.H.; van Loan, C.F. Matrix Computations; The Johns Hopkins University Press: Baltimore, MA, USA, 2013.