0% found this document useful (0 votes)
14 views27 pages

Sparse Regularized Optimal Transport with Deformed q-Entropy

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

Sparse Regularized Optimal Transport with Deformed q-Entropy

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

entropy

Article
Sparse Regularized Optimal Transport with Deformed
q-Entropy
Han Bao 1, * and Shinsaku Sakaue 2

1 Graduate School of Informatics and The Hakubi Center for Advanced Research, Kyoto University,
Kyoto 604-8103, Japan
2 Department of Mathematical Informatics, Graduate School of Information Science and Technology,
The University of Tokyo, Tokyo 153-8505, Japan
* Correspondence: [email protected]

Abstract: Optimal transport is a mathematical tool that has been a widely used to measure the
distance between two probability distributions. To mitigate the cubic computational complexity of
the vanilla formulation of the optimal transport problem, regularized optimal transport has received
attention in recent years, which is a convex program to minimize the linear transport cost with an
added convex regularizer. Sinkhorn optimal transport is the most prominent one regularized with
negative Shannon entropy, leading to densely supported solutions, which are often undesirable in
light of the interpretability of transport plans. In this paper, we report that a deformed entropy
designed by q-algebra, a popular generalization of the standard algebra studied in Tsallis statistical
mechanics, makes optimal transport solutions supported sparsely. This entropy with a deformation
parameter q interpolates the negative Shannon entropy (q = 1) and the squared 2-norm (q = 0),
and the solution becomes more sparse as q tends to zero. Our theoretical analysis reveals that a
larger q leads to a faster convergence when optimized with the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) algorithm. In summary, the deformation induces a trade-off between the sparsity and
convergence speed.

Keywords: optimal transport; Sinkhorn algorithm; convex analysis; entropy; quasi-Newton method
Citation: Bao, H.; Sakaue, S. Sparse
Regularized Optimal Transport with
Deformed q-Entropy. Entropy 2022,
24, 1634. https://doi.org/10.3390/
e24111634
1. Introduction
Optimal transport (OT) is a classic problem in operations research, and it is used to
Academic Editor: Sotiris Kotsiantis
compute a transport plan between suppliers and demanders with a minimum transporta-
Received: 18 September 2022 tion cost. The minimum transportation cost can be interpreted as the closeness between the
Accepted: 7 November 2022 distributions when considering suppliers and demanders as two probability distributions.
Published: 10 November 2022 The OT problem has been extensively studied (also as the Wasserstein distance) [1] and
Publisher’s Note: MDPI stays neutral
used in robust machine learning [2], domain adaptation [3], generative modeling [4], and
with regard to jurisdictional claims in
natural language processing [5], attributed to its many useful properties, such as the dis-
published maps and institutional affil-
tance between two probability distributions. Recently, the OT problem has been employed
iations. for various modern applications, such as interpretable word alignment [6] and the locality-
aware evaluation of object detection [7], because it can capture the geometry of data and
provide a measurement method for closeness and alignment among different objects. From
a computational perspective, a naïve approach is to use a network simplex algorithm or
Copyright: © 2022 by the authors. interior point method to solve the OT problem as a usual linear program; this approach
Licensee MDPI, Basel, Switzerland. requires supercubic time complexity [8] and is not scalable. A number of approaches have
This article is an open access article been suggested to accelerate the computation of the OT problem: entropic regulariza-
distributed under the terms and tion [9,10], accelerated gradient descent [11], and approximation with tree [12] and graph
conditions of the Creative Commons metrics [13]. We focused our attention on entropic-regularized OT because it allows a
Attribution (CC BY) license (https://
unique solution attributed to strong convexity and transforms the original constrained
creativecommons.org/licenses/by/
optimization into an unconstrained problem with a clear primal–dual relationship. The
4.0/).

Entropy 2022, 24, 1634. https://doi.org/10.3390/e24111634 https://www.mdpi.com/journal/entropy


Entropy 2022, 24, 1634 2 of 27

celebrated Sinkhorn algorithm solves entropic-regularized OT with square-time complex-


ity [9]. Furthermore, the Sinkhorn algorithm is amenable to differentiable programming,
and it is easily incorporated into end-to-end learning pipelines [14,15].
Despite the popularity of the Sinkhorn algorithm, one of the main drawback is that
Shannon entropy blurs the OT solution, i.e., solutions of entropic-regularized OT are
always densely supported. The Shannon entropy induces a probability distribution that
has strictly positive values everywhere on its support owing to the nature of the Shannon
entropy [16] whereas the vanilla (unregularized) OT produces extremely sparse transport
plans located on the boundaries of a polytope [17,18]. If we are interested in alignment
and matching between different objects (such as in the several applications of natural
language processing [6,19]), dense transport plans are not so interpretable that matching
information between objects may be obfuscated by unimportant small densities contained
in the transport plans. One attempt toward realizing sparse OT is to use the squared
two-norm as an alternative regularizer. Blondel et al. [20] showed that the dual of this
optimization problem can be solved via the L-BFGS method [21]; the primal solution
corresponds to a transport plan recovered from the dual solution in a closed form, which is
sparse. Although they successfully obtained a sparse OT formulation with a numerically
stable algorithm, the degree of the sparsity cannot be easily modulated when we prefer
to control the sparsity given a final application. Furthermore, the theoretical convergence
rates of solving regularized OT are yet to be known.
In this study, we aimed to examine the relationship between the sparsity of transport
plans and the convergence guarantee of regularized OT. Specifically, we propose yet another
entropic regularizer called deformed q-entropy with a deformation parameter q that allows
us to control the solution sparsity. We start with a dual solution of the entropic-regularized
OT given by the Gibbs kernel to introduce a new regularizer; the Gibbs kernel associated
with Shannon entropy induces nonsparsity, and, therefore, we replace the Gibbs kernel with
another sparse kernel based on q-exponential distribution [22], following the idea of Tsallis
statistics [23]. The deformed q entropy is derived from the dual solution characterized by
the sparse kernel. Interestingly, the deformed q entropy recovers the Shannon entropy at the
limit of q % 1 and matches the (negative) squared two-norm at q = 0; this means that the
deformed q entropy interpolates between the two regularizers. We confirm that the solution
becomes increasingly sparse as q approaches zero. We call the regularized OT with the
deformed q entropy deformed q-optimal transport (q-DOT). The q-DOT reveals an interesting
connection between the OT solution and the q-exponential distribution, which is an inde-
pendent interest. From the optimization perspective, we can solve the unconstrained dual
of q-DOT with many standard solvers, as reported in Blondel et al. [20]. We can see that
the convergence becomes faster with the BFGS method [24] as the deformation parameter q
approaches one, as a result of our analysis of the convergence rate of the dual optimization.
Therefore, the weaker deformation (larger q) leads to faster convergence while sacrificing
sparsity. Finally, we demonstrate the trade-off between sparsity and convergence in the
numerical experiments.
Our contributions can be summarized as: (i) showing a clear connection between the
regularized OT problem and the q-exponential distribution; (ii) demonstrating the trade-off
of the q-DOT between sparsity and convergence; (iii) providing a formal convergence
guarantee of the q-DOT when solved with the BFGS method. The rest of this paper is
organized as follows: Section 2 introduces the necessary background to the OT problem
and entropic regularization. In Section 3, the Lagrange dual of the entropic-regularized
OT problem is first shown; then, the dual optimal formula and the q-exponential distri-
bution is connected to sparsify the transport matrix. Section 4 specifically focuses on the
optimization perspective of the regularized OT problem, and a convergence guarantee with
the BFGS method is provided, which shows the theoretical trade-off between sparsity and
convergence. Finally, the empirical behavior and the trade-off of the regularized OT are
numerically confirmed in Section 5.
Entropy 2022, 24, 1634 3 of 27

2. Background
2.1. Preliminaries
p
For x ∈ R, let [ x ]+ = x if x > 0 and 0 otherwise, and let [ x ]+ represent ([ x ]+ ) p
hereafter. For a convex function f , X → R, where X represents a Euclidean vector space
equipped with an inner product h·, ·i, the Fenchel–Legendre conjugate f ? : X → R is
defined as f ? (y) := supx∈X hx, yi − f (x). The relative interior of a set S is denoted by
ri S, and the effective domain of a function f is denoted by dom( f ). A differentiable
function f is said to be M-strongly convex over S ⊆ ri dom( f ) if, for all x, y ∈ S, we have
f (x) − f (y) ≤ h∇ f (x), x − yi − M 2
2 kx − yk2 . If f is twice differentiable, the strong convexity
2
is equivalent to ∇ f (x)  MI for all x ∈ S. Similarly, a differentiable function f is said to be
M-smooth over S ⊆ ri dom( f ) if for all x, y ∈ S, we have k∇ f (x) − ∇ f (y)k2 ≤ M kx − yk2 ,
which is equivalent to ∇2 f (x)  MI for all x ∈ S if f is twice differentiable.

2.2. Optimal Transport


The OT is a mathematical problem to find a transport plan between two probability
distributions with the minimum transport cost. The discussions in this paper are restricted
to discrete distributions. Let (X , d), δx , and 4n−1 := {p ∈ [0, 1]n | hp, 1n i = 1} represent
a metric space, Dirac measure at point x, and (n − 1)-dimensional probability simplex,
respectively. Let µ = ∑in=1 ai δxi and ν = ∑m j=1 bi δyi be histograms supported on the finite
sets of points (xi )in=1 ⊆ X and (y j )m n−1 and b ∈ 4m−1
j=1 ⊆ X , respectively, where a ∈ 4
are probability vectors. The OT between two discrete probability measures µ and ν is the
optimization problem
n m
T (µ, ν) := inf
Π∈U (µ,ν)
∑ ∑ d(xi , yj )Πij , (1)
i =1 j =1

where U represents the transport polytope, defined as


n o
U (µ, ν) := Π ∈ Rn≥×0 m Π1m = a, Π> 1n = b . (2)

The transport polytope U defines the constraints on the row/column marginals of a trans-
port matrix Π. These constraints are often referred to as coupling constraints. For notational
simplicity, matrix Dij := d(xi , y j ) and expectation hD, Πi := ∑in=1 ∑m
j=1 Dij Πij are used here-
after. T (µ, ν) is known as a 1-Wasserstein distance, which defines a metric space over
histograms [1].
Equation (1) is a linear program and can be solved by well-studied algorithms such as
the interior point and network simplex methods. However, its computational complexity is
O(n3 log n) (assuming n = m), so is not scalable to large datasets [8].

2.3. Entropic Regularization and Sinkhorn Algorithm


The entropic-regularized formulation is commonly used to reduce the computational
burden. Here, we introduce regularized OT with negative Shannon entropy [9] as
n m
T−λH (µ, ν) := inf hD, Πi + λ ∑ ∑ (Πij log Πij − Πij ), (3)
Π∈U (µ,ν) i =1 j =1
| {z }
negative Shannon entropy
Entropy 2022, 24, 1634 4 of 27

where λ > 0 represents the regularization strength. Let us review the derivation of
the updates of the Sinkhorn algorithm. The Lagrangian of the optimization problem in
Equation (3) is
n m
L(Π, α, β) := ∑ ∑ (Dij Πij + λ(Πij log Πij − Πij ))
i =1 j =1
n m
(4)
+ ∑ αi ([Π1m ]i − ai ) + ∑ β j ([Π 1n ] j − b j ),
>
i =1 j =1

where α ∈ Rn and β ∈ Rm represent the Lagrangian multipliers. Equation (4) ignores


the constraints Πij ≥ 0 (for all i ∈ [n] and j ∈ [m]); however, they will be automatically
satisfied. By taking the derivative in Πij ,

∇Πij L = Dij + λ log Πij + αi + β j , (5)

and, hence, the stationary condition ∇Πij L = 0 induces the solution

αi + β j + Dij
 
Πij = exp − . (6)
λ
 D   α +β 
The decomposition Πij = exp − λij / exp i λ j suggests that the stationary point is the
 D 
(normalized) Gibbs kernel exp − λij . One can easily infer that the Sinkhorn solution is
dense because the Gibbs kernel is supported on the entire R≥0 , i.e., exp − λz > 0 for all


z ∈ R≥0 . We can write Equation


 β (6) into a matrix 
form by  applying the variable transforms
αi  j Dij
ui := exp − λ , v j := exp − λ , and Kij := exp − λ as

Π = diag(u) K diag(v) . (7)


| {z } | {z }
:=U :=V

The following Sinkhorn updates are used to make Equation (7) meet the marginal con-
straints: (
u0 ← a/(Kv)
, (8)
v0 ← b/(K> u)
where z/η represents the element-wise division of the two vectors z and η. The compu-
tational complexity is O(Knm) because the Sinkhorn updates involve only matrix-vector
multiplications and element-wise divisions; K represents the number of the Sinkhorn up-
dates. Finer analysis of the number of updates required to meet the error tolerance is
provided in the literature [25].

3. Deformed q-Entropy and q-Regularized Optimal Transport


3.1. Regularized Optimal Transport and Its Dual
Let us consider the following primal problem with a general regularization function Ω.

Definition 1 (Primal of regularized OT).

TΩ (µ, ν) = inf hD, Πi + ∑ Ω(Πij ), (9)


Π∈U (µ,ν) i,j

where Ω : R → R represents a proper closed convex function.


Entropy 2022, 24, 1634 5 of 27

Next, we derive its dual by Lagrange duality. The Lagrangian of Equation (9) is
defined as
D E
L(Π, α, β) := hD, Πi + ∑ Ω(Πij ) + hα, Π1m − ai + β, Π> 1n − b , (10)
i,j

with dual variables α ∈ Rn and β ∈ Rm . Then, the primal can be rewritten in terms of the
Lagrangian
TΩ (µ, ν) = inf sup L(Π, α, β). (11)
Π∈Rn≥×0 m α∈Rn ,β∈Rm

In this Lagrangian formulation, we let the constraints Π ∈ Rn≥×0 m remain for a technical
reason. The constrained optimization problem in (11) can be reformulated into the following
unconstrained one with an indicator function IRn×m .
≥0

TΩ (µ, ν) = inf sup L(Π, α, β) + IRn×m (Π), (12)


Π∈Rm×m α∈Rn ,β∈Rm ≥0

which corresponds to an optimization problem with the convex objective function hD, Πi +
∑i,j Ω(Πij ) + IRn×m (Π) with only the linear constraints Π1m = a and Π> 1n = b. By
≥0
invoking the Sinkhorn–Knopp theorem [26], the existence of a strictly feasible solution,
namely, a solution satisfying Π1m = a and Π> 1n = b, can be confirmed. Hence, we see
that the Slater condition is satisfied, and the strong duality holds as follows:

TΩ (µ, ν) = sup inf L(Π, α, β)


n×m
α∈Rn ,β∈Rm Π∈R≥0

= sup −ha, αi − hb, βi + inf


Π∈Rn≥×0 m i,j
∑(Dij + αi + β j )Πij + Ω(Πij )
α∈Rn ,β∈Rm
 
(13)
= sup −ha, αi − hb, βi −  sup ∑ −(Dij + αi + β j )Πij − Ω(Πij )
α∈Rn ,β∈Rm Π∈Rn≥×0 m i,j

= sup −ha, αi − hb, βi − ∑ Ω? (−Dij − αi − β j ),


α∈Rn ,β∈Rm i,j

where Ω? represents the Fenchel–Legendre conjugate of Ω : R → R

Ω? (η ) := sup ηπ − Ω(π ). (14)


π ≥0

Although each element of the transport plans ranges over [0, 1], it is sufficient to define
the Fenchel–Legendre conjugate as the supremum over R≥0 because of how Ω? emerges
in the strong duality (13). According to Danskin’s theorem [27], the supremum of the
Fenchel–Legendre conjugate can be attained at

Πij? = ∇Ω? (−Dij − αi − β j ). (15)

Therefore, the dual of regularized OT is formulated as follows:

Definition 2 (Dual of regularized OT).

TΩ (µ, ν) = sup −ha, αi − hb, βi − ∑ Ω? (−Dij − αi − β j ), (16)


α∈Rn ,β∈Rm i,j

where Ω? represents the Fenchel–Legendre conjugate Ω? (η ) := supπ ≥0 ηπ − Ω(π ). The optimal


solution of the primal is given by the dual map ∇Ω? such that Πij? = ∇Ω? (−Dij − αi? − β?j ),
where (α? , β? ) represents the dual optimal solution.
Entropy 2022, 24, 1634 6 of 27

Next, we see several examples that are summarized in Table 1.

Example 1 (Negative Shannon entropy). Let Ω(π ) = −λH (π ) = λ(π log π − π ); then
Ω? (η ) = λeη/λ and ∇Ω? (η ) 
= eη/λ . The optimal solution represented with the optimal dual
Dij +αi? + β?j

variables (α? , β? ) is Πij? = exp − λ . This recovers the stationary point of the Sinkhorn
OT in Equation (6). The solution is dense because the regularizer Ω induces the Gibbs kernel
∇Ω? (η ) = eη/λ > 0 for all η ∈ R.

Example 2 (Squared 2-norm). Let Ω(π ) = λ2 π 2 ; then Ω? (η ) = 2λ 1


[η ]2+ and ∇Ω? (η ) =
1
λ [hη ]+ . The optimal solution represented with the optimal dual variables (α? , β? ) is Πij? =
i
1 ? ?
λ −Dij − αi − β j . As mentioned by Blondel et al. [20], the squared 2-norm can sparsify
+
the solution because ∇Ω? (η ) = 1
λ [η ]+ may take the value 0.

Table 1. Summary of Ω(π ), Ω? (η ), and ∇Ω? (η ) for several regularizers. The relationship between
Ω, its conjugate, and the derivatives are summarized in Bao and Sugiyama [28].

Ω(π ) Ω? (η) ∇Ω? (η)

Negative entropy λ(π log π − π ) λeη/λ eη/λ


λ 2 1 2 1
Squared 2-norm 2π 2λ [ η ]+ λ [η ]+
Deformed q entropy 2−q ( π logq ( π ) − π )
λ λ
2− q expq (η/λ)2−q expq (η/λ)

3.2. q Algebra and Deformed Entropy


As shown in the last few examples, the dual map ∇Ω? plays an important role in the
OT solution sparsity. In addition, the induced ∇Ω? is the Gibbs kernel when the negative
Shannon entropy is used as Ω. Therefore, one may think of designing a regularizer from
∇Ω? by utilizing a kernel function that induces sparsity. One candidate is a q-exponential
distribution. We begin with some basics required to formulate q-exponential distributions.
First, we introduce q-algebra, which has been well studied in the field of Tsallis sta-
tistical mechanics [23,29,30]. q algebra has been used in the machine-learning literature
for regression [31], Bayesian inference [32], and robust learning [33]. For a deformation
parameter q ∈ [0, 1], the q-logarithm and q-exponential functions are defined as

x 1− q −1 1/(1−q)
( (
1− q if q ∈ [0, 1) [1 + (1 − q ) x ] + if q ∈ [0, 1)
logq ( x ) := , expq ( x ) := . (17)
log( x ) if q = 1 exp( x ) if q = 1

The q logarithm is defined for only x > 0, as in the natural logarithm; they are inverse
functions to each other (in an appropriate domain) and they recover the natural definition
of the logarithm and exponential as q % 1. Their derivatives are (logq ( x ))0 = x1q and
(expq ( x ))0 = expq ( x )q , respectively. The additive factorization property exp( x + y) =
exp( x ) exp(y) satisfied by the natural exponential no longer holds for the q exponential,
such that expq ( x + y) 6= expq ( x ) expq (y) = expq ( x + y + (1 − q) xy). Instead, we can
construct another algebraic structure by introducing the other operation called the q prod-
uct ⊗q :
1/(1−q)
x ⊗ q y = [ x 1− q + y 1− q − 1 ] + . (18)
With this product, the pseudoadditive factorization expq ( x + y) = expq ( x ) ⊗q expq (y)
holds. Thus, the q algebra captures rich nonlinear structures, and it is often used to extend
the Shannon entropy to the Tsallis entropy [23]
n
Tq (π ) = − ∑ πi logq (πi ).
q
(19)
i =1
Entropy 2022, 24, 1634 7 of 27

One can see that the Tsallis entropy has an equivalent power formulation Tq (π ) =
q
π −π
∑in=1 1−q i , which means that it is often suitable for modeling heavy-tailed phenomena
i

such as the power law. Although the introduced q logarithm and exponential can look
arbitrary, they can be axiomatically derived by assuming the essential properties of the
algebra (see Naudts [29]). For more physical insights, we recommend readers to refer to
the literature [30].
Next, we introduce the q-exponential distribution. We introduce a simpler form for our
purpose, whereas more general formulations of the q-exponential distribution have been
introduced in the literature [22]. Given the form of the Gibbs kernel k(ξ ) := exp(−ξ/λ),
we define the q-Gibbs kernel as follows:

Definition 3 (q-Gibbs kernel). For ξ ≥ 0, we define the q-Gibbs kernel as k q (ξ ) := expq (−ξ/λ)
for a deformation parameter q ∈ [0, 1] and a temperature parameter λ ∈ R>0 .

If we take ξ as the (centered) squared distance, then k q (ξ ) represents the q-Gaussian


distribution [22]. We illustrate the q-Gibbs kernel with different deformation parameters in
Figure 1.

1.2
q = 0.0
1.0 q = 0.25
0.8 q = 0.5
q = 0.75
0.6 q = 1.0
kq ( )

0.4
0.2
0.0
0 1 2 3 4 5
Figure 1. Plots of the q-Gibbs kernels with different q (λ = 1).
h i
q for q ∈ [0, 1)
λ
By definition, the support of the q-Gibbs kernel is supp(k q ) = 0, 1−
and supp(k q ) = R≥0 for q = 1. This indicates that the q-Gibbs kernel ignores the effect of a
too-large ξ (or too large a distance between two points); its threshold is smoothly controlled
by the temperature parameter λ and deformation parameter q.
Finally, we derive an entropic regularizer that induces sparsity by using the q-Gibbs
kernel. Given the stationary condition in Equation (15), we impose the following functional
form on the dual map: η
π = ∇Ω? (η ) = expq , (20)
λ
where (π, η ) = (Πij? , −Dij − αi − β j ). Equation (20) results in the factorization

−Dij −β j
 α     
i
Πij? = expq − ⊗q expq − ⊗q expq − , (21)
λ λ λ
Entropy 2022, 24, 1634 8 of 27

and a sufficiently large input distance Dij drives Πij to zero; though expq (−Dij /λ) = 0
does not immediately imply Πij? = 0 because the q-product ⊗q lacks an absorbing element.
By solving Equation (20),

λ  
∇Ω(π ) = λ logq (π ), Ω(π ) = π logq (π ) − π . (22)
2−q

For the completeness, its derivation is shown in Appendix A. Hence, we define the deformed
q entropy as follows:

Definition 4 (Deformed q-entropy). For π ∈ 4n−1 , the deformed q entropy is defined as


n
1
Hq (π ) = −
2−q ∑ (πi logq (πi ) − πi ). (23)
i =1

The deformed q-entropic regularizer for an element πi is Ω(πi ) = λ


2−q ( πi logq (πi ) − πi ).

The deformed q entropy recovers the Shannon entropy at the limit q % 1: H1 (π ) =


− ∑i (πi log(πi ) − πi ). In addition, the limit q & 0 recovers the negative of the squared
2-norm: H0 (π ) = − 12 ∑i (πi2 − 2πi ) = − 12 kπ k22 + 1. Therefore, the deformed q entropy is
an interpolation between the Shannon entropy and squared 2-norm. Hereafter, we consider
the regularized OT with the deformed q entropy

T−λHq (µ, ν) = inf hD, Πi − λHq (Π), (24)


Π∈U (µ,ν)

by solving its dual counterpart. The deformed q entropy is different from the Tsallis entropy
Tq (see Equation (19)) in that the Tsallis entropy and deformed q entropy are defined by
the q expectation hπ q , ·i [34] and the usual expectation hπ, ·i, respectively, while both are
defined by the q logarithm.

Remark 1. The primary reason we picked the deformed q entropy Hq to design the regularizer is
owing to its natural connection to the q-Gibbs kernel through the dual map, ∇(−λHq )? (η ) =
expq (η/λ). When the Tsallis entropy Tq is used, the dual map is

q1/(1−q)
∇(−λTq )? (η ) = , (25)
expq (−η/λ)

which is not naturally connected to the q-Gibbs kernel. Muzellec et al. [35] proposed regularized
OT with the Tsallis entropy, but they did not discuss its sparsity. As we show in Appendix D.1, the
Tsallis entropy does not empirically induce sparsity.

In Figure 2, the deformed q entropy with a different deformation parameter is plotted


for the one-dimensional simplex 41 . One can easily confirm that Hq (π ) is concave for
π ∈ Rn≥0 , as illustrated in the figure.
Entropy 2022, 24, 1634 9 of 27

0.5
q = 0.0
0.4 q = 0.25
q = 0.5
0.3 q = 0.75
Hq( ) q = 1.0
0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 2. Plots of deformed q entropy with different q values. A constant term is ignored in the plots
so that the end points are calibrated to zero.

4. Optimization and Convergence Analysis


4.1. Optimization Algorithm
We occasionally write Ω = −λHq to simplify the notation in this section. By simple
algebra, we confirm
λ  η 2− q
Ω? (η ) = expq , (26)
2−q λ
which is convex because of the concavity of Hq . To solve Equation (24), we solve the dual

Dij + αi + β j 2−q
 
λ
T−λHq (µ, ν) = sup −ha, αi − hb, βi −
2−q ∑ q exp −
λ
, (27)
α∈Rn ,β∈Rm i,j
| {z }
:=−F (z)

where z := (α, β) denotes dual variables. As Equation (27) is an unconstrained optimization


problem, many famous optimization solvers can be used to solve it; here, we use the BFGS
method [24]. For the sake of convergence analysis (Section 4.2), we optimize the convex
`2 -regularized dual objective
κ
minimize Fe (z) := ha, αi + hb, βi + ∑ Ω? (−Dij − αi − β j ) + kzk22 , (28)
i,j
2

where κ > 0 represents the `2 -regularization parameter. In practice, `2 regularization


hardly affects the performance when κ is sufficiently small. We can characterize the
convergence rate by introducing (small) `2 regularization, which makes the objective
strongly convex, whereas the convergence guarantee without its rate is still possible without
`2 regularization [36].
We briefly summarize the algorithm in Algorithm 1, where d(k) , ρ(k) , and g(k) :=
∇Fe (z(k) ) represent the kth update direction, kth step size, and gradient at the current
variable z(k) , respectively.

s(k) := z(k+1) − z(k) and ζ (k) := g(k+1) − g(k) (29)


Entropy 2022, 24, 1634 10 of 27

are the differences of the dual variables and gradients between the next and current steps,
respectively. Furthermore, let (γ, γ0 ) be the tolerance parameter for the Wolfe conditions,
i.e., update directions and step sizes satisfy the conditions

Fe (z(k) + ρ(k) d(k) ) ≤ Fe (z(k) ) + γ0 ρ(k) g(k)> d(k) , (Armijo condition) (30)
(k+1)> (k) (k)> (k)
g d ≥ γg d . (curvature condition) (31)

Algorithm 1: BFGS algorithm for dual regularized OT


Input : z(0) initial point, 0 < γ0 < 12 tolerance parameter for the Armijo condition,
γ0 < γ < 1 tolerance parameter for the curvature condition, and B(0) = I
initial Hessian estimate
1 for k = 0, . . . , K − 1 do
. Calculate update direction
2 d(k) ← −[B(k) ]−1 g(k)
. Determine step size by line search
3 ρ(k) ← line_search(d(k) , g(k) , γ, γ0 )
. Update dual variables
4 z( k +1) ← z( k ) + ρ ( k ) d( k )
. Update Hessian estimate
B(k) s(k) s(k)> B(k) ζ (k) ζ (k)>
5 B( k +1) ← B( k ) − +
s(k)> B(k) s(k) ζ (k)> s(k)
6 end
7 α, βb) ← z(K )
return (b

After obtaining the dual solution (b


α, βb), the primal solution can be recovered from
Equation (15).

4.2. Convergence Analysis


We provide a convergence guarantee for Algorithm 1. A technical assumption is stated
beforehand.

Assumption 1. Let z? be the global optimum of Fe . For τ ∈ (0, 1), we define the set Zτ ⊆
ri dom(Fe ) as
Zτ := z ∇Ω? (−Dij − αi − β j ) ≤ τ for all i, j .

(32)

Assume that z(K ) obtained by Algorithm 1 and z? are contained in Zτ .

The dual map ∇Ω? translates dual variables into primal variables, as in Equation (15).
It is easy to confirm that Zτ is a closed convex set attributed to the convexity of ∇Ω? .
Assumption 1 essentially assumes that all elements of the primal matrix (of z(K ) and z? ) are
strictly less than 1; this always holds for z? (unless n = m = 1) because of the strong duality.
Moreover, this assumption is natural for z(K ) values sufficiently close to the optimum z? .
The bound parameter τ is a key element for characterizing the convergence speed.

Theorem 1. Let N := max{n, m}. Under Assumption 1, Algorithm 1 with the parameter choice
κ = 2Nτ q λ−1 returns a point z(k) satisfying
s
16(Fe (z(0) ) − Fe? ) Nτ q K
k g( K ) k 2 < r (33)
λ

where Fe? := infz Fe (z) represents the optimal value of the `2 -regularized dual objective and
0 < r < 1 is an absolute constant independent from (λ, τ, q, N ).
Entropy 2022, 24, 1634 11 of 27

The proof is shown in Section 4.3. We conclude that a larger deformation parameter
q yields q/2
√ better convergence because the coefficient in Equation (33) is O(τ ) with the
base τ < 1. Therefore, the deformation parameter introduces a new trade-off: q & 0
yields a more sparse solution but slows down the convergence, whereas q % 1 ameliorates the
convergence while sacrificing sparsity. One may obtain the solution faster than the squared
2-norm regularizer used in Blondel et al. [20], which corresponds to the case q = 0, by
modulating the deformation parameter q.
In regularized OT, it is a common approach to use weaker regularization (i.e., a smaller
λ) to obtain a solution sparser and closer to the unregularized solution; however, a smaller
λ results in numerical instability and slow computation [37]. This can be observed from
Equation (33) because a smaller λ drives its upper bound considerably large.
Subsequently, we compared the computational complexity of q-DOT with the BFGS
method and Sinkhorn algorithm. Altschuler et al. [25] showed that the Sinkhorn algorithm
satisfies coupling constraints within the `1 error ε in O( N 2 (log N )ε−3 ) time, which is the
sublinear convergence rate. In contrast, our convergence rate in Equation (33) is translated
into the iteration complexity K = O(log( Nε−1 )), where kg(K ) k2 ≤ ε. The gradient of Fe is
 
..
 . 
 a − ∑m ∇Ω? (−D − α − β ) + κα 
 i j =1 ij i j i
∇Fe (z) = 
 .. 
, (34)
 . 
 n
bi − ∑i=1 ∇Ω? (−Dij − αi − β j ) + κβ j 

..
 
.

and ∇Ω? (·) represents the mapping from the dual variables (αi , β j ) to the primal transport
matrix Πij in Equation (15). Therefore, the gradient norm of F and coupling constraint error
are comparable when the `2 -regularization parameter κ is sufficiently small. The overall
computational complexity is O( N 2 log( Nε−1 )) because the one step of Algorithm 1 runs
in O( N 2 ) time; this is the linear convergence rate. To confirm the one step of Algorithm 1
runs in O( N 2 ) time, we note that the update direction can be computed with O( N 2 ) time
by using the Sherman–Morrison formula to invert B(k) . In addition, the Hessian estimate
can be updated with O( N 2 ) time because B(k) is the rank-1 update and the computation of
its inverse only requires the matrix-vector products of size N. Hence, Algorithm 1 exhibits
better convergence in terms of the stopping criterion ε. The comparison is summarized in
Table 2.

Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed
q-optimal transport. N = max{n, m}.

Sinkhorn q-DOT
O( N 2 (log N )ε−3 ) O( N 2 log( Nε−1 ))

4.3. Proofs
To prove Theorem 1, we leveraged several lemmas shown below. Lemma 2 is based
on Powell [24] and Byrd et al. [36]. The missing proofs are provided in Appendix C.

Lemma 1. For the initial point z(0) and sequence z(1) , z(2) , . . . , z(K ) obtained by Algorithm 1, we
define the following set and its bound:
n o
Z := conv z(0) , z(1) , z(2) , . . . , z(K) , R := sup max ∇Ω? (−Dij − αi − β j ), (35)
z∈Z i,j

where conv(S) represents the convex hull of the set S. Then, Fe : Rn+m → R is M1 strongly convex
and M2 -smooth over Z , where M1 = κ and M2 ≤ κ + 2NRq λ−1 . Moreover, Fe is M20 -smooth
over Zτ (defined in Equation (32)), where M20 ≤ κ + 2Nτ q λ−1 .
Entropy 2022, 24, 1634 12 of 27

Lemma 2. Let z(1) , z(2) , . . . , z(k) be a sequence generated by Algorithm 1 given an initial point
z(0) . In addition, let c1 , c2 , c3 , c4 , and c5 be the constants

1−γ n+m
c1 := , c2 := + M2 ,
M2 K
 (n+m)/K n+m+K
K c3
c3 := c2 K , c4 := , (36)
n+m 1−γ
2(1 − γ 0 )
c5 := .
M1

Then,
!K/2
γ0 c1 M1
Fe (z(K) ) − Fe? ≤ 1− (Fe (z(0) ) − Fe? ). (37)
2c24 c25

Lemma 3. Let c1 , c2 , c3 , c4 , and c5 be the same constants defined in Lemma 2. Then,


3
γ0 c1 M1 (1 − γ)3 γ0 e−2(n+m)/e

M1
> . (38)
c24 c25 4(1 − γ 0 )2 M2

Proof of Theorem 1. Because Fe is differentiable and strongly convex, there exists an opti-
mum z? such that g? := ∇Fe (z? ) = 0; this implies kg(K ) k2 = kg(K ) − g? k2 .
By using Assumption 1 and Lemma 1, we obtain kg(K ) − g? k2 = k∇Fe (z(K ) ) −
∇Fe (z? )k2 ≤ M20 kz(K) − z? k2 . In addition, kz(K) − z? k22 ≤ M21 (Fe (z(K) ) − Fe? ) as Fe is
M1 strongly convex over Z and the stationary condition ∇Fe (z? ) = 0 holds. We obtain the
convergence bound by using Lemmas 2 and 3 as

k g( K ) k 2 = k g( K ) − g? k 2
≤ M20 kz(K) − z? k2
s
2(Fe (z(K ) ) − Fe? )
≤ M20
M1
v
u !K/2
u e (0)
0 t 2( F (z ) − F ? )
e γ0 c1 M1
≤ M2 1− (39)
M1 2c24 c25
v
u  !K/2
u e (0)
0 t 2( F (z ) − F ? ) (1 − γ)3 γ0 e−2(n+m)/e M1 3
e 
< M2 1−
M1 8(1 − γ 0 )2 M2
s
K/2
2Nτ q 2(Fe (z(0) ) − Fe? )
  
C
≤ κ+ 1− ,
λ κ (1 + 2NRq λ−1 κ −1 )3

(1−γ)3 γ0 e−2(n+m)/e
where we define C := 8(1− γ 0 )2
and Lemma 1 is used at the last inequality to replace
M1 , M2 and M20 . We can immediately confirm C ≤ 1
16 from 0 < γ0 < γ < 1, γ0 < 12 , and
e − 2 ( n + m ) /e < 1. Finally, by choosing κ = 2Nτ q λ−1 ,
s
K/2
16(Fe (z(0) ) − Fe? ) Nτ q

C
k g( K ) k 2 ≤ 1−
λ (1 + ( R/τ )q )3
s (40)
16(Fe (z(0) ) − Fe? ) Nτ q K
≤ r ,
λ

where
√ we use ( R/τ )q ≥ 1 (owing to R ≥ τ by definition) and let r := (1 − C/8)1/4 and
4
127/128 ≤ r < 1.
Entropy 2022, 24, 1634 13 of 27

Remark 2. More precisely, Altschuler et al. [25] showed that the Sinkhorn algorithm converges
in O( N 2 L3 (log N )ε−3 ) time, where L := kDk∞ . For q-DOT, its computational complexity is
not directly comparable to that of the Sinkhorn in L; instead, the following analysis provides us a
qualitative comparison. First, the convergence rate of q-DOT in Equation (33) is translated into
the iteration complexity K = O(log( Nε−1 )/ log(1/r )). The rate r is introduced in the proof of
 1/4
C
Theorem 1 (see Equation (40)): r ≥ 1 − (1+( R/τ q
) ) 3 . Then, by the Taylor expansion, we
have a rough estimate K ≈ O( N 2 R−3q log( Nε−1 )), where R is a bound on the possible primal
variables defined in Equation (35). We cannot directly compare R−q and L; nevertheless, R−q
and L can be considered in the same magnitude given a reasonably sized domain Z , noting that
∇Ω(π ) ≈ O(π 1−q ). Hence, it is reasonable to suppose that both the Sinkhorn algorithm and
q-DOT roughly converge in cubic time with respect to L.

5. Numerical Experiments
5.1. Sparsity
All the simulations described in this section were executed on a 2.7 GHz quad-core
Intel® Core™ i7 processor. We used the following synthetic dataset: (xi )in=1 ∼ N (12 , I2 ),
j=1 ∼ N (−12 , I2 ), and n = m = 30, where N ( µ, Σ ) represents the Gaussian distribution
(y j ) m
with mean µ and covariance Σ. For each of the unregularized OTs, q-DOT, and Sinkhorn
algorithm, we computed the transport matrices. For q-DOT  and the Sinkhorn algorithm,
different regularization parameters λ were compared: λ ∈ 1 × 10−2 , 1 × 10−1 , 1 ; and ε =
1 × 10−6 was used as the stopping criterion: q-DOT stopped after the gradient norm was
less than ε, and the Sinkhorn algorithm stopped after the `1 error of the coupling constraints
was less than ε. We compared different deformation parameters q ∈ {0, 0.25, 0.5, 0.75} and
fixed the dual `2 -regularization parameter κ = 1 × 10−6 for q-DOT. The q-DOT with q = 0
corresponded to a regularized OT with the squared 2-norm proposed by Blondel et al. [20].
For the unregularized OT, we used the implementation of the Python optimal transport
package [38]. For q-DOT, we used the L-BFGS-B method (instead of the vanilla BFGS)
provided by the SciPy package [39]. To determine zero entries in the transport matrix, we
did not impose any positive threshold to disregard small values (as in Swanson et al. [6])
but regarded entries smaller than machine epsilon as zero.
The simulation results are shown in Table 3 and Figure 3. First, we qualitatively
evaluated each method by using Figure 3 such that q-DOT obtained a very similar transport
matrix to the unregularized OT solution. The solution was slightly blurred with increases
in q and λ. In contrast, the Sinkhorn algorithm output considerably uncertain transport
matrices. Furthermore, the Sinkhorn algorithm was numerically unstable with a very small
regularization such as λ = 0.01.
From Table 3, we further quantitatively observed the behavior. The transport matrices
obtained by q-DOT were very sparse in most cases, and the sparsity was close to that of
the unregularized OT. Furthermore, we observed the tendency such that smaller q and
λ yielded a sparser solution. Significantly, the Sinkhorn algorithm obtained completely
dense matrices (sparsity = 0). Although the transport matrices of q-DOT with (q, λ) =
(0.5, 1), (0.75, 1) appear somewhat similar to the Sinkhorn solutions in Figure 3, the former
is much sparser. This suggests that a deformation parameter q slightly smaller than 1 is
sufficient for q-DOT to output a sparse transport matrix.
Entropy 2022, 24, 1634 14 of 27

Wasserstein
0 5 10 15 20 25
0
0.030
5
0.025
10
0.020

15
0.015

20 0.010

25 0.005

0.000

Q-DOT (q = 0, = 0.01) Q-DOT (q = 0, = 0.1) Q-DOT (q = 0, = 1)


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0

5 5 5

10 10 10

15 15 15

20 20 20

25 25 25

Q-DOT (q = 0.25, = 0.01) Q-DOT (q = 0.25, = 0.1) Q-DOT (q = 0.25, = 1)


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0

5 5 5

10 10 10

15 15 15

20 20 20

25 25 25

Q-DOT (q = 0.5, = 0.01) Q-DOT (q = 0.5, = 0.1) Q-DOT (q = 0.5, = 1)


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0

5 5 5

10 10 10

15 15 15

20 20 20

25 25 25

Q-DOT (q = 0.75, = 0.01) Q-DOT (q = 0.75, = 0.1) Q-DOT (q = 0.75, = 1)


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0

5 5 5

10 10 10

15 15 15

20 20 20

25 25 25

Sinkhorn (q = 1, = 0.01) Sinkhorn (q = 1, = 0.1) Sinkhorn (q = 1, = 1)


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0

5 5 5

10 10 10

15 15 15

20 20 20

25 25 25

Figure 3. Comparison of transport matrices. Wasserstein represents the result of the unregularized
OT. Sinkhorn (λ = 0.01) does not work well because of numerical instability.
D E
For the obtained cost values D, Π b , we did not see a clear advantage of using a
specific q and λ from the results of q-DOT. Nevertheless, it is evident that q-DOT more
accurately estimated the Wasserstein cost than the Sinkhorn algorithm regardless of the q
and λ used in this simulation.
Entropy 2022, 24, 1634 15 of 27

Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of
zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon
to measure the sparsity instead of imposing a small positive threshold for determining zero entries.
Sinkhorn (λ = 0.01) does not work well because of numerical instability.
D E
Sparsity Cost D, Π
b

Wasserstein (unregularized) 0.967 7.126


q-DOT (q = 0.00, λ = 0.01) 0.962 7.129
q-DOT (q = 0.00, λ = 0.10) 0.961 7.126
q-DOT (q = 0.00, λ = 1.00) 0.950 7.144
q-DOT (q = 0.25, λ = 0.01) 0.963 7.129
q-DOT (q = 0.25, λ = 0.10) 0.959 7.126
q-DOT (q = 0.25, λ = 1.00) 0.912 7.133
q-DOT (q = 0.50, λ = 0.01) 0.963 7.136
q-DOT (q = 0.50, λ = 0.10) 0.946 7.127
q-DOT (q = 0.50, λ = 1.00) 0.879 7.155
q-DOT (q = 0.75, λ = 0.01) 0.948 7.127
q-DOT (q = 0.75, λ = 0.10) 0.897 7.136
q-DOT (q = 0.75, λ = 1.00) 0.647 7.245
Sinkhorn (λ = 0.01) — —
Sinkhorn (λ = 0.10) 0.000 7.164
Sinkhorn (λ = 1.00) 0.000 7.788

5.2. Runtime Comparison


We compared the runtimes of q-DOT and the Sinkhorn algorithm using the same
dataset as in Section 5.1, but with different dataset sizes: we chose n = m ∈ {100, 300, 500,
1000}. The parameter choices were the same as in Section 5.1, except that the regularization
parameter was fixed to λ = 0.1. The result is shown in Figure 4; the larger deformation
parameter q makes q-DOT converge faster when n = m = 100. When n = m ≥ 300, the
difference between q = 0, q = 0.25, and q = 0.5 was not as evident. This may be partly
because we fixed the parameter choice κ = 1 × 10−6 for the all experiments, unlike the ora-
cle parameter choice κ = 2Nτ q λ−1 (in Theorem 1) depending on q. Nonetheless, q = 0.75
is clearly superior to the smaller q. From these observations, the trade-off between the
sparsity and computation speed resulting from the deformation parameter q is theoretically
established in Theorem 1 and it was empirically observed.

700 5000

600
4000
500
Runtime [ms]

Runtime [ms]

3000
400

300 2000
200
1000
100

0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1 0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1

(a) n = m = 100. (b) n = m = 300.

Figure 4. Cont.
Entropy 2022, 24, 1634 16 of 27

16000 70000
14000 60000
12000
50000

Runtime [ms]

Runtime [ms]
10000
40000
8000
30000
6000
20000
4000
2000 10000

0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1 0 q=0 q = 0.25 q = 0.5 q = 0.75 q=1

(c) n = m = 500. (d) n = m = 1,000.

Figure 4. Runtime comparison of q-DOT and Sinkhorn algorithm (q = 1). The error bars indicate the
standard errors of 20 trials.

5.3. Approximation of 1-Wasserstein Distance


D E
Finally, we compared the approximation errors of the 1-Wasserstein distance | D, Π b −
D, Π] | of q-DOT and the Sinkhorn algorithm with different q and λ, where Π b represents
the computed transport matrix and Π] ∈ arg minΠ∈U (µ,ν) hD, Πi represents the LP solution.
We used the same dataset and stopping criterion ε as described in Section 5.1 For the range
of q, we used q ∈ {0.00, 0.25, 0.50, 0.75}. For the range of λ, we used λ ∈ {0.05, 0.1, 0.5}.
The result is shown in Figure 5. The difference was not significant when q was
small, such as q ∈ {0.00, 0.25}. Once q became larger, such as q ∈ {0.50, 0.75}, the ap-
proximation error evidently worsened. The Sinkhorn algorithm always exhibited worse
approximation errors than q-DOT with q in the range used in this simulation regardless of λ.
Formal guarantees for the 1-Wasserstein approximation error (such as Altschuler et al. [25]
and Weed [40]) will be considered in future work.

0.05
λ = 0.05
λ = 0.1
Abs([ D, Π̂ ] - [1-Wasserstein distance])

λ = 0.5
0.04

0.03

0.02
®

0.01
­

0.00 0.0 0.2 0.4 0.6 0.8 1.0


q

Figure 5. Wasserstein approximation error of q-DOT and the Sinkhorn algorithm (q = 1). The line
shades indicate the standard errors of 20 trials.

Author Contributions: Conceptualization, H.B.; methodology, H.B.; validation, H.B. and S.S.; formal
analysis, H.B. and S.S.; writing—original draft preparation, H.B.; writing—review and editing, H.B.
and S.S.; funding acquisition, H.B. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was supported by the Hakubi Project, Kyoto University, and JST ERATO
Grant JPMJER1903. The APC was covered by the Hakubi Project.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Entropy 2022, 24, 1634 17 of 27

Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

BFGS Broyden–Fletcher–Goldfarb–Shannon
q-DOT Deformed q-optimal transport
L-BFGS Limited-memory BFGS
OT Optimal transport

Appendix A. Derivation of Deformed q Entropy


Given a functional relationship π = ∇Ω? (η ) = expq (η/λ) in Equation (20), we derive
the deformed q entropy.
First, the derivative of the regularizer ∇Ω is simply the inverse of the dual map ∇Ω?
by Danskin’s theorem [27]; hence, ∇Ω(π ) = λ logq (π ). The (negative of) deformed q
entropy is recovered by integrating ∇Ω:
Z p  2− q
Z p 1− q
−1

λ p π
Ω(π ) = λ logq ( p)dp = λ dp = −π
0 0 1−q 1−q 2−q
(A1)
π 1− q − 1
 
λ λ  
= π −π = π logq (π ) − π .
2−q 1−q 2−q

Appendix B. Additional Lemmas


Note again that we let M1 , M2 > 0 be the strong convexity and smoothness constants
of Fe over Z , N := max{n, m}, and z? ∈ arg minz∈Z Fe (z).

Lemma A1. For all k,


M1 ks(k) k22 ≤ ζ (k)> s(k) ≤ M2 ks(k) k22 . (A2)
In addition,
kζ (k) k22
≤ M2 . (A3)
ζ (k)> s(k)
R1
Proof. Let Ḡ(k) := 0 ∇2 Fe (z(k) + ts(k) )dt. Then, by the chain rule and the fundamental
theorem of calculus,

∂∇Fe (z(k) + ts(k) )


Z 1
(k) (k)
Ḡ s = dt
0 ∂t (A4)
= ∇Fe (z(k) + s(k) ) − ∇Fe (z(k) ) = g(k+1) − g(k) = ζ (k) .
Because Fe is M1 strongly convex and M2 -smooth (over Z ), we have

M1 kwk22 ≤ w> [∇2 Fe (z)]w ≤ M2 kwk22 (A5)

for all z ∈ Z and w. By choosing z = z(k) + ts(k) and w = s(k) , we have


Z 1
M1 ks(k) k22 ≤ s(k)> [∇2 Fe (z(k) + ts(k) )]s(k) dt
0 (A6)
(k)>
=s Ḡ(k) s(k) = ζ (k)> s(k) ≤ M2 ks(k) k22 .

Note that z(k) + ts(k) ∈ Z follows by the definition of Z in Equation (35). Thus, the first
statement is proven.
Entropy 2022, 24, 1634 18 of 27

The second statement is proven as follows:

kζ (k) k22 s(k)> Ḡ(k)2 s(k) (s(k)> Ḡ(k)1/2 )Ḡ(k) (Ḡ(k)1/2 s(k) )
= =
ζ (k)> s(k) s(k)> Ḡ(k) s(k) kḠ(k)1/2 s(k) k22
Z 1 (k )0 > 2 e (k ) 0
(s ) [∇ F (z + ts(k) )](s(k) ) (A7)
= 0 dt
0 ks(k) k22
≤ M2 ,
0
where s(k) := Ḡ(k)1/2 s(k) .

Lemma A2. For all k,


M1 (k)
k z − z? k 2 ≤ k g( k ) k 2 . (A8)
2

Proof. Because Fe is M1 strongly convex over Z ,

M1 (k) D E
kz − z? k22 ≤ Fe (z(k) ) − Fe (z? ) − ∇Fe (z(k) ), z(k) − z?
2 (A9)
≤ k g( k ) k 2 k z( k ) − z? k 2 ,

where it follows from the optimality of z? and the Cauchy–Schwarz inequality.

Lemma A3. The following equations hold:


 n+m
(K ) c2 K
det(B )≤ , (A10)
n+m
K −1 kB(k) s(k) k22
∏ s(k)> B(k) s(k)
≤ c2K , (A11)
k =0
K −1
det(B(K ) ) ζ (k)> s(k)
det(B(0) )
= ∏ s(k)> B(k) s(k)
, (A12)
k =0

n+m
where c2 := K + M2 is defined in Lemma 2.

Proof. To prove Equation (A10), we use the linearity of the trace and tr(ba> ) = a> b to
evaluate tr(B(k+1) ) as follows:
!
( k +1) (k) B(k) s(k) s(k)> B(k) ζ (k) ζ (k)>
tr(B ) = tr B − + (k)> (k)
s(k)> B(k) s(k) ζ s
! !
(k) B(k) s(k) s(k)> B(k) ζ (k) ζ (k)>
= tr(B ) − tr + tr (k)> (k)
s(k)> B(k) s(k) ζ s
| {z }
≥0
(A13)
(k) kζ (k) k22
≤ tr(B )+
ζ (k)> s(k)
k kζ ( j) k22
≤ tr(B(0) ) + ∑ ( j)> s( j)
j =0 ζ

≤ tr(B(0) ) + (k + 1) M2 ,

where Lemma A1 is used at the last inequality. Note that the trace is the sum of the
eigenvalues, whereas the determinant is the product of the eigenvalues. Then, we can use
the AM–GM inequality to translate the determinant into the trace as follows:
Entropy 2022, 24, 1634 19 of 27

n+m !n+m
tr(B(0) ) + M2 (k + 1)

( k +1) 1
det(B )≤ tr(B(k+1) ) ≤ . (A14)
n+m n+m

Hence, by substituting k = K − 1 and tr(B(0) ) = n + m, Equation (A10) is proven.


To prove Equation (A11), we evaluate tr(B(k+1) ) in a way similar to that for Equa-
tion (A13). From Lemma A1,

kB(k) s(k) k22 kζ (k) k22


0 ≤ tr(B(k+1) ) = tr(B(k) ) − +
s(k)> B(k) s(k) ζ (k)> s(k)
k kB( j) s( j) k22 k k ζ ( j ) k2
= tr(B(0) ) − ∑ ( j)> B( j) s( j)
+ ∑ 2
( j)> s( j) (A15)
j =0 s j =0 ζ
k kB( j) s( j) k22
≤ tr(B(0) ) − ∑ ( j)> B( j) s( j)
+ (k + 1) M2 .
j =0 s

By the AM–GM inequality,


! k +1
k k B( j ) s( j ) k 2 k k B( j ) s( j ) k 2
1
∏ s( j)> B( j) s(2j) ≤ k+1 ∑ s( j)> B( j) s(2j) . (A16)
j =0 j =0

Hence, by substituting k = K − 1 and tr(B(0) ) = n + m, Equation (A11) is proven.


To prove Equation (A12), we use the matrix determinant lemma to expand det(B(k+1) )
as follows:
!
( k +1) (k) B(k) s(k) s(k)> B(k) ζ (k) ζ (k)>
det(B ) = det B − + (k)> (k)
s(k)> B(k) s(k) ζ s
 ! −1 
 1 (
ζ ζ k ) ( k )> 
= 1 − (k)> (k) (k) · s(k)> B(k) B(k) + (k)> (k) B( k ) s( k ) (A17)
 s B s ζ s 
!
ζ (k) ζ (k)>
· det B(k) + (k)> (k) .
ζ s

Further, by the Sherman–Morrison formula, we have


! −1
(k) ζ (k) ζ (k)> B(k)−1 ζ (k) ζ (k)> B(k)−1
B + (k)> (k) = B(k)−1 − . (A18)
ζ s ζ (k)> s(k) + ζ (k)> B(k)−1 ζ (k)

By plugging Equation (A18) into Equation (A17), we have


!
(s(k)> ζ (k) )2 ζ (k) ζ (k)>
det(B(k+1) ) = (k)> (k) (k) (k)> (k) det B (k)
+
(s B s )(ζ s + ζ (k)> B(k)−1 ζ (k) ) ζ (k)> s(k)
(s(k)> ζ (k) )2
=
(s(k)> B(k) s(k) )(ζ (k)> s(k) + ζ (k)> B(k)−1 ζ (k) )
! (A19)
ζ (k)> B(k)−1 ζ (k)
· 1+ det(B(k) )
ζ (k)> s(k)
ζ (k)> s(k)
= det(B(k) ) ,
s(k)> B(k) s(k)
where the matrix determinant lemma is invoked again at the second identity. Recursively
applying Equation (A19) with det(B(0) ) = 1, we obtain Equation (A12).
Entropy 2022, 24, 1634 20 of 27

Lemma A4. For k, ks(k) k2 ≤ c5 kg(k) k2 cos θk , where θk is the angle between s(k) and −g(k) , and
2(1− γ 0 )
c5 := M is defined in Lemma 2.
1

Proof. By the Armijo condition (30), we have

Fe (z(k+1) ) − Fe (z(k) ) ≤ γ0 ρ(k) g(k)> d(k) = γ0 g(k)> s(k) . (A20)

Additionally, as Fe is M1 -strongly convex over Z , it holds that Fe (z(k+1) ) − Fe (z(k) ) ≥


s(k)> g(k) + 12 M1 ks(k) k22 . Hence,

1
s(k)> g(k) + M ks(k) k22 ≤ γ0 g(k)> s(k)
2 1
1
⇒ (1 − γ0 )(−s(k)> g(k) ) ≥ M1 ks(k) k22
2 (A21)
2 ( 1 − γ 0) − s(k)> g(k)
⇒ ks( k ) k 2 ≤ k g( k ) k 2 ,
M1 k s( k ) k 2 k g( k ) k 2
| {z } | {z }
= c5 =cos θk

which is the desired inequality.

Lemma A5. For k, let θk be the angle between s(k) and −g(k) . Then,
!K/2
K −1 
γ0 c1 M1 cos2 θk γ0 c1 M1

∏ 1−
2
≤ 1−
2c24 c25
, (A22)
k =0

where c1 , c4 , and c5 are defined in Lemma 2.

Proof. By multiplying each side of Equations A10–A12, we have

K −1 kB(k) s(k) k22 ζ (k)> s(k)


∏ ·
s(k)> B(k) s(k) s(k)> B(k) s(k)
≤ c3K , (A23)
k =0

 (n+m)/K n+m+K
K
where c3 := n+m c2 K
is defined in Lemma 2. By using B(k) s(k) = −ρ(k) g(k)
and ζ (k)> s(k) ≥ −(1 − γ)g(k)> s(k) (shown in Equation (A33)),
K −1 kB(k) s(k) k22 K −1 kg(k ) k2 · ζ (k )> s(k )
ζ (k)> s(k)
∏ ·
s(k)> B(k) s(k) s(k)> B(k) s(k)
= ∏ (−s(2k)> g(k) )2
k =0 k =0
(A24)
K −1 kg(k) k22
≥ (1 − γ ) · K
∏ −s(k)> g(k)
.
k =0

Hence,
K −1 K
k g( k ) k 2

c3
∏ ks(k) k2 cos θk

1−γ
= c4K . (A25)
k =0

By Lemma A4, we can confirm

K −1 K −1 K
1 kg(k) k2 cos θk

1
∏ cos2 θk ≥ ∏ c4 k s( k ) k 2

c4 c5
. (A26)
k =0 k =0

b be the number of k = 0, 1, . . . , K − 1 such that cos θk ≤ 1


Let K c4 c5 , then

 K K −1  2Kb
1 1
c4 c5
≤ ∏ cos2 θk ≤
c4 c5
, (A27)
k =0
Entropy 2022, 24, 1634 21 of 27

K 1
implying that K
b is at most
2 (note that c4 c5 < 1 from Equation (A26)). Therefore,
!K/2
K −1 
γ0 c1 M1 cos2 θk γ0 c1 M1

∏ 1−
2
≤ 1−
2c24 c25
. (A28)
k =0

Appendix C. Deferred Proofs


Appendix C.1. Proof of Lemma 1
Proof. It is easy to confirm M1 = κ because Fe is the sum of F (convex) and κ2 kzk22 .
Because Fe is twice differentiable and Z is a closed convex set, we evaluate the
smoothness parameter M2 (over Z ) by the eigenvalues of ∇2 Fe (z). We begin by eval-
uating the eigenvalues of ∇2 F (z), then evaluate the eigenvalues of ∇2 Fe (z) by ∇2 Fe (z) =
∇2 F (z) + κI. Let P(z) ∈ Rn×m be a matrix such that Pij (z) := ∇Ω? (−Dij − αi − β j ). Here,
Pij (z) is the primal variable corresponding to the dual variables (αi , β j ) (see Equation (15)).
The gradient of F is
   
.. ..
 .   . 
 a − ∑m ∇Ω? (−D − α − β )   a − ∑m P (z) 
 i j =1 ij i j   i j=1 ij 
∇F (z) = 
 .
..
  .
..

=
   , (A29)

 n   n
b j − ∑i=1 ∇Ω (−Dij − αi − β j ) b j − ∑i=1 Pij (z)
? 
.. ..
   
. .

and the Hessian of F is


"   #
1 diag ∑ j Pij (z)q P(z) q
∇2 F (z) = · , (A30)
(P(z) q ) > diag ∑i Pij (z)q

λ
| {z }
:=H

where P(z)q is the element-wise power of P(z). Then, by invoking the Gershgorin cir-
cle theorem (Theorem 7.2.1 of [41]), the eigenvalues of H can be upper bounded by the
following value:
 
max ∑ j Pij (z) + |[P(z{z
q
) 1m ]i , ∑ Pij (z) + [(P(z) ) 1n ] j ,
q q q >
i
| {z } }
center of i-th disc radius of i-th disc
 m n  (A31)
≤ max 2 ∑ Pij (z) , 2 ∑ Pij (z)
q q
j =1 i =1
q
≤ 2NR ,
where we use 0 ≤ Pij (z) ≤ R for all i, j, and z ∈ Z at the last inequality. Hence, M2 ≤
q
κ + 2NR
λ .
q
M20 ≤ κ + 2Nτλ is confirmed by noting that 0 ≤ Pij (z) ≤ τ for all i, j, and z ∈ Zτ and
that Zτ is a closed convex set.

Appendix C.2. Proof of Lemma 2


Proof. First, we evaluate the ratio between Fe (z(k+1) ) − Fe? and Fe (z(k) ) − Fe? for k =
0, 1, 2, . . . , K − 1. Let θk be the angle between the vectors s(k) and −g(k) . By the Armijo
condition (Equation (30)), the difference Fe (z(k+1) ) − Fe (z(k) ) can be evaluated as follows:
Entropy 2022, 24, 1634 22 of 27

Fe (z(k+1) ) − Fe (z(k) ) ≤ γ0 ρ(k) g(k)> d(k)


= γ0 g(k)> (z(k+1) − z(k) )
(A32)
= γ0 g(k)> s(k)
= γ0 (−ks(k) k2 kg(k) k2 cos θK ).
In addition, by the curvature condition (Equation (31)),

ζ (k)> s(k) = g(k+1)> s(k) −g(k)> s(k)


| {z }
=ρ(k) g(k+1)> d(k) ≥ρ(k) ·γg(k)> d(k)
(A33)
≥ γg(k)> s(k) − g(k)> s(k)
= −(1 − γ)g(k)> s(k) ,
−γ (k)> (k) 1− γ ( k )
which implies ks(k) k22 ≥ M12 ζ (k)> s(k) ≥ − 1M2
g s = (k)
M2 ks k2 kg k2 cos θk together
with Lemma A1. Hence, we have

ks(k) k2 ≥ c1 kg(k) k2 cos θk , (A34)


1− γ
where c1 := M2 . Then,

Fe (z(k+1) ) − Fe? ≤ (Fe (z(k) ) − Fe? ) + γ0 (−ks(k) k2 kg(k) k2 cos θk )


≤ (Fe (z(k) ) − Fe? ) − γ0 c1 kg(k) k22 cos2 θk
≤ (Fe (z(k) ) − Fe? ) − γ0 c1 ( M1 /2)kg(k) k2 kz(k) − z? k2 cos2 θk (A35)
≤ (Fe (z(k) ) − Fe? ) − γ0 c1 ( M1 /2) cos2 θk (Fe (z(k) ) − Fe? )
= (1 − γ0 c1 M1 cos2 θk /2)(Fe (z(k) ) − Fe? ),

where Equation (A32) is used at the first inequality; Equation (A34) is used at the second
inequality; LemmaDA2 is used atEthe third inequality; a consequence of the convexity
Fe (z(k) ) − Fe (z? ) ≤ g(k) , z(k) − z? ≤ kg(k) k2 kz(k) − z? k2 is used at the fourth inequality.
Next, recursively invoking the inequality Equation (A35), we obtain
( )
K −1  0 c M cos2 θ 
γ
Fe (z ) − Fe? ≤ ∏ 1 −
( K ) 1 1 k
(Fe (z(0) ) − Fe? )
k =0
2
!K/2 (A36)
γ0 c1 M1
≤ 1− (Fe (z(0) ) − Fe? ),
2c24 c25

which is the desired bound. The last inequality is due to Lemma 3.

Appendix C.3. Proof of Lemma 3


Proof. By substituting the definitions of the constants c1 , c2 , c3 , c4 , and c5 ,
Entropy 2022, 24, 1634 23 of 27

−γ
γ0 c1 M1 γ0 · 1M 2
· M1
= 
c24 c25
2 
2(1− γ 0 ) 2

c3
1− γ · M 1

(1 − γ)3 M13 γ0
=
4(1 − γ0 )2 c23 M2
M13 (1 − γ)3 γ0 1
=
4M2 (1 − γ0 )2 
 1/K n+m+K
2
1 n+m
K
(n+m)n+m
c2 K K (A37)
 − 2( n + m + K )
M 3 (1 − γ )3 γ 0
 
2( n + m ) n+m n+m K
= 1 0 2
· (n + m) K · + M2 ·K n+m+K
4M2 (1 − γ ) K
M13 (1 − γ)3 γ0 2( n + m )
> 0 2
· 1 · M2−2 K − K
4M2 (1 − γ )
(1 − γ)3 γ0 e−2(n+m)/e M1 3
 
≥ ,
4(1 − γ 0 )2 M2
2( n + m )
where, at the first inequality, we invoke (n + m) K > 1 and
   − 2( n + m + K )  − 2( n + m + K )
n+m n+m K  n+m K
+ M2 ·K n+m+K ≥ M2 K n+m+K
K
− 2(n+Km+K ) 2( n + m ) (A38)
= M2 K− K

2( n + m )
≥ M2−2 K − K ,
2( n + m ) 2( n + m )
and we use K − K ≥ e− e for all K at the second inequality. Hence, the desired
inequality is proven.

Appendix D. Additional Experiments


Appendix D.1. Comparison with Tsallis Entropy
In this study, we used the deformed q entropy instead of the Tsallis entropy [23] as
the sparse regularization. Here, we briefly empirically analyze what happens if we use
the Tsallis entropy instead. We compare the dual optimization objective in Definition 2
with the deformed q entropy and Tsallis entropy. We use the following convex regularizer
formed by the Tsallis entropy:
n
Ω(π ) = λ ∑ πi logq (πi ).
q
(A39)
i =1

The simulations in this section were executed on a 2.7 GHz quad-core Intel® Core™
i7 processor. We used the following synthetic dataset: (xi )in=1 ∼ N (12 , I2 ), (y j )m
j =1 ∼
N (−12 , I2 ), and n = m = 100. For q-DOT and Tsallis-regularized OT, different regu-
larization parameters λ ∈ {0.5, 1} were compared, and ε = 1 × 10−6 was used as the
stopping criterion on the gradient norm. The range of regularization parameters differed
from that in Section 5.1 because Tsallis-regularized OT does not converge with too-small
regularization parameters such as λ = 0.01. We compared different deformation parame-
ters q ∈ {0, 0.25, 0.5, 0.75}. For the unregularized OT, we used the implementation of the
Python optimal transport package [38]. For q-DOT and Tsallis-regularized OT, we use dthe
L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the
transport matrix, we regarded entries smaller than machine epsilon as zero.
Entropy 2022, 24, 1634 24 of 27

Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates
the ratio of zero entries in each transport matrix. We counted the number of entries smaller than ma-
chine epsilon to measure the sparsity instead of imposing a small positive threshold for determining
zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein
distance. Tsallis-regularized OT with q = 0.00 does not work due to numerical instability.

Sparsity (q-DOT) Abs. Error (q-DOT) Sparsity (Tsallis) Abs. Error (Tsallis)
q = 0.00, λ = 0.50 0.984 0.001 — —
q = 0.00, λ = 1.00 0.981 0.011 — —
q = 0.25, λ = 0.50 0.977 0.008 0.000 3.362
q = 0.25, λ = 1.00 0.973 0.010 0.000 3.388
q = 0.50, λ = 0.50 0.959 0.015 0.000 3.153
q = 0.50, λ = 1.00 0.944 0.022 0.000 3.283
q = 0.75, λ = 0.50 0.861 0.052 0.000 1.962
q = 0.75, λ = 1.00 0.776 0.099 0.000 2.582

As can be seen from the results in Table A1, the Tsallis entropic regularizer neither
induces sparsity nor achieves a better approximation of the 1-Wasserstein distance than
the deformed q entropy. Note that the Tsallis entropy induces the dual map ∇Ω? (η ) =
q1/(1−q) / expq (−η/λ) shown in Equation (25), which has dense support for q > 0 and be-
comes the source of dense transport matrices. This verifies that the design of the regularizer
is important for regularized optimal transport.

Appendix D.2. Hyperparameter Sensitivity


In this section, we summarize more comprehensive experimental results of q-DOT
and the Sinkhorn algorithm to show the performance dependence on hyperparameters q
and λ. Subsequently, we describe experiments to show the sparsity of transport matrices,
absolute error of computed costs with respect to 1-Wasserstein distance, and runtime with
differently-sized datasets.
The simulations in this section were executed on a 2.7 GHz Intel® Xeon® Gold
6258R processor (different from the processor that we used in Section 5). We used the
following synthetic dataset: (xi )in=1 ∼ N (12 , I2 ), (y j )m
j=1 ∼ N (−12 , I2 ), with different
N (= n = m) ∈ {100, 300, 500, 1000, 2000, 3000}. For q-DOT and Tsallis-regularized OT, dif-
ferent regularization parameters λ ∈ {0.01, 0.1, 1} were compared, and ε = 1 × 10−6
was used as the stopping criterion. We compared different deformation parameters
q ∈ {0, 0.25, 0.5, 0.75}. For the unregularized OT, we used the implementation of the
Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method pro-
vided by the SciPy package [39]. To determine zero entries in the transport matrix, we
regarded entries smaller than machine epsilon as zero.
The results are shown in Table A2. In these tables, the results with q = 1.00 correspond
to the Sinkhorn algorithm. The results for (q, λ) = (1.00, 0.01) are missing because they did
not work well due to numerical instability. In general, we observed similar behavior as we
described in Section 5: sparsity intensified as q and λ decreased, thereby increasing runtime.
As N increased, nonmonotonic trends in runtime were observed with respect to q: for a
fixed λ, larger q accelerated the computation, while q = 0.25 seemed to be the slowest. This
apparent discrepancy from Theorem 1 may be partly because Theorem 1 relies on an oracle
parameter choice κ = 2Nτ q λ−1 as we discussed in Section 5.2, which is hardly known in
practice. Nevertheless, it is remarkable that even q = 0.75 gives very sparse solutions with
a reasonable amount of runtime. Regarding the absolute error, smaller q tends to perform
better with relatively small datasets, such as N ≤ 1000, while q = 1.00 performs better for
larger datasets, such as N = 2000 and 3000. As we mentioned in Section 5.3, theoretical
analysis of the approximation error is still unclear, and will be left for future work.
Entropy 2022, 24, 1634 25 of 27

Table A2. Hyperparameter sensitivity of q-DOT and Sinkhorn algorithm. In these tables, q = 1.00
corresponds to the Sinkhorn algorithm. (q, λ) = (1.00, 0.01) did not work well because of numerical
instability. The results shown in the tables are the means of 10 random trials. Bold typeface indicates
the best result for each of sparsity, absolute error, and runtime.

Runtime Runtime
(N = 100) Sparsity Abs. error (N = 100) Sparsity Abs. error
[ms] [ms]
q = 0.00, λ = 0.01 0.990 2.28 × 10−2 4366.142 q = 0.00, λ = 0.01 0.997 1.30 × 100 33,592.026
q = 0.00, λ = 0.10 0.988 3.63 × 10−3 1236.346 q = 0.00, λ = 0.10 0.996 2.15 × 10−2 14,641.740
q = 0.00, λ = 1.00 0.982 6.20 × 10−3 842.253 q = 0.00, λ = 1.00 0.994 2.03 × 10−2 7749.233
q = 0.25, λ = 0.01 0.989 8.18 × 10−3 3182.535 q = 0.25, λ = 0.01 0.996 7.07 × 10−2 36,167.445
q = 0.25, λ = 0.10 0.986 5.54 × 10−3 1131.784 q = 0.25, λ = 0.10 0.994 1.83 × 10−2 15,176.970
q = 0.25, λ = 1.00 0.973 1.16 × 10−2 668.734 q = 0.25, λ = 1.00 0.990 2.69 × 10−2 5848.561
q = 0.50, λ = 0.01 0.987 9.91 × 10−3 2388.176 q = 0.50, λ = 0.01 0.994 1.99 × 10−2 25,940.619
q = 0.50, λ = 0.10 0.977 7.66 × 10−3 1040.818 q = 0.50, λ = 0.10 0.991 2.41 × 10−2 8304.774
q = 0.50, λ = 1.00 0.946 2.40 × 10−2 339.978 q = 0.50, λ = 1.00 0.976 3.52 × 10−2 2713.598
q = 0.75, λ = 0.01 0.979 1.16 × 10−2 2396.353 q = 0.75, λ = 0.01 0.991 2.97 × 10−2 18,820.365
q = 0.75, λ = 0.10 0.950 1.31 × 10−2 731.564 q = 0.75, λ = 0.10 0.973 3.34 × 10−2 4823.098
q = 0.75, λ = 1.00 0.786 1.02 × 10−1 200.654 q = 0.75, λ = 1.00 0.864 9.57 × 10−2 1654.697
q = 1.00, λ = 0.01 — — — q = 1.00, λ = 0.01 — — —
q = 1.00, λ = 0.10 0.000 5.83 × 10−2 1132.516 q = 1.00, λ = 0.10 0.000 7.39 × 10−2 2014.341
q = 1.00, λ = 1.00 0.000 7.51 × 10−1 31.284 q = 1.00, λ = 1.00 0.000 8.15 × 10−1 207.094
Runtime
(N = 100) Sparsity Abs. error (N = 100) Sparsity Abs. error Runtime [s]
[ms]
q = 0.00, λ = 0.01 0.999 2.48 × 100 86,046.395 q = 0.00, λ = 0.01 1.000 6.39 × 100 336.207
q = 0.00, λ = 0.10 0.997 3.91 × 10−2 49,523.995 q = 0.00, λ = 0.10 0.999 8.76 × 10−2 286.879
q = 0.00, λ = 1.00 0.996 4.10 × 10−2 27,357.659 q = 0.00, λ = 1.00 0.998 8.22 × 10−2 133.223
q = 0.25, λ = 0.01 0.998 2.36 × 10−1 104,346.641 q = 0.25, λ = 0.01 0.999 4.27 × 100 413.775
q = 0.25, λ = 0.10 0.996 5.12 × 10−2 41,810.473 q = 0.25, λ = 0.10 0.998 1.01 × 10−1 221.787
q = 0.25, λ = 1.00 0.994 4.22 × 10−2 18,415.400 q = 0.25, λ = 1.00 0.997 9.01 × 10−2 87.945
q = 0.50, λ = 0.01 0.996 4.52 × 10−2 78,618.996 q = 0.50, λ = 0.01 0.998 8.61 × 10−2 374.123
q = 0.50, λ = 0.10 0.994 4.50 × 10−2 25,512.371 q = 0.50, λ = 0.10 0.997 9.37 × 10−2 120.605
q = 0.50, λ = 1.00 0.984 4.92 × 10−2 8266.048 q = 0.50, λ = 1.00 0.990 9.49 × 10−2 41.435
q = 0.75, λ = 0.01 0.994 4.55 × 10−2 57,839.639 q = 0.75, λ = 0.01 0.996 1.05 × 10−1 275.101
q = 0.75, λ = 0.10 0.979 5.07 × 10−2 14,257.452 q = 0.75, λ = 0.10 0.985 1.02 × 10−1 67.301
q = 0.75, λ = 1.00 0.890 1.00 × 10−1 4362.478 q = 0.75, λ = 1.00 0.917 1.34 × 10−1 21.536
q = 1.00, λ = 0.01 — — — q = 1.00, λ = 0.01 — — —
q = 1.00, λ = 0.10 0.000 7.92 × 10−2 5731.333 q = 1.00, λ = 0.10 0.000 8.62 × 10−2 57.739
q = 1.00, λ = 1.00 0.000 8.35 × 10−1 562.722 q = 1.00, λ = 1.00 0.000 8.51 × 10−1 2.215
(N = 100) Sparsity Abs. error Runtime [s] (N = 100) Sparsity Abs. error Runtime [s]
q = 0.00, λ = 0.01 1.000 3.59 × 100 1386.554 q = 0.00, λ = 0.01 1.000 4.09 × 100 3257.314
q = 0.00, λ = 0.10 0.999 2.25 × 10−1 1245.867 q = 0.00, λ = 0.10 1.000 8.56 × 10−1 3108.889
q = 0.00, λ = 1.00 0.999 1.85 × 10−1 823.011 q = 0.00, λ = 1.00 0.999 2.68 × 10−1 2355.733
q = 0.25, λ = 0.01 1.000 5.88 × 100 1555.064 q = 0.25, λ = 0.01 1.000 3.78 × 100 3821.319
q = 0.25, λ = 0.10 0.999 1.86 × 10−1 1201.656 q = 0.25, λ = 0.10 0.999 2.94 × 10−1 3532.833
q = 0.25, λ = 1.00 0.998 1.86 × 10−1 492.324 q = 0.25, λ = 1.00 0.999 2.76 × 10−1 1530.838
q = 0.50, λ = 0.01 0.999 6.66 × 10−1 1494.270 q = 0.50, λ = 0.01 1.000 1.85 × 100 3669.894
q = 0.50, λ = 0.10 0.998 1.97 × 10−1 589.379 q = 0.50, λ = 0.10 0.999 2.93 × 10−1 1637.985
q = 0.50, λ = 1.00 0.994 1.85 × 10−1 210.008 q = 0.50, λ = 1.00 0.995 2.71 × 10−1 644.164
q = 0.75, λ = 0.01 0.998 2.00 × 10−1 1300.517 q = 0.75, λ = 0.01 0.998 2.98 × 10−1 3560.379
q = 0.75, λ = 0.10 0.989 2.00 × 10−1 321.221 q = 0.75, λ = 0.10 0.991 2.91 × 10−1 853.451
q = 0.75, λ = 1.00 0.937 2.08 × 10−1 106.334 q = 0.75, λ = 1.00 0.946 2.83 × 10−1 270.046
q = 1.00, λ = 0.01 — — — q = 1.00, λ = 0.01 — — —
q = 1.00, λ = 0.10 0.000 9.06 × 10−2 147.372 q = 1.00, λ = 0.10 0.000 8.94 × 10−2 272.210
q = 1.00, λ = 1.00 0.000 8.62 × 10−1 8.575 q = 1.00, λ = 1.00 0.000 8.62 × 10−1 20.120
Entropy 2022, 24, 1634 26 of 27

References
1. Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338.
2. Shafieezadeh-Abadeh, S.; Mohajerin Esfahani, P.M.; Kuhn, D. Distributionally robust logistic regression. Adv. Neural Inf. Process.
Syst. 2015, 28. https://dl.acm.org/doi/10.5555/2969239.2969415.
3. Courty, N.; Flamary, R.; Habrard, A.; Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Adv.
Neural Inf. Process. Syst. 2017, 30. https://dl.acm.org/doi/10.5555/3294996.3295130.
4. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International
Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR, pp. 214–223.
5. Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word embeddings to document distances. In Proceedings of the 32nd
International Conference on Machine Learning, Lille, France, 7–9 July 2015; PMLR, pp. 957–966.
6. Swanson, K.; Yu, L.; Lei, T. Rationalizing text matching: Learning sparse alignments via optimal transport. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5609–5626.
7. Otani, M.; Togashi, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Optimal correction cost for object detection evaluation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022;
pp. 21107–21115.
8. Pele, O.; Werman, M. Fast and robust Earth Mover’s Distances. In Proceedings of the 2009 IEEE 12th International Conference on
Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: New York, NY, USA, 2009; pp. 460–467.
9. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300.
10. Dessein, A.; Papadakis, N.; Rouas, J.L. Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res. 2018,
19, 590–642.
11. Dvurechensky, P.; Gasnikov, A.; Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent
is better than by Sinkhorn’s algorithm. In Proceedings of the 36th International Conference on Machine Learning, Stockholm,
Sweden, 10–15 July 2018; PMLR, pp. 1367–1376.
12. Le, T.; Yamada, M.; Fukumizu, K.; Cuturi, M. Tree-sliced variants of Wasserstein distances. Adv. Neural Inf. Process. Syst. 2019,
32, 12304–12315.
13. Le, T.; Nguyen, T.; Phung, D.; Nguyen, V.A. Sobolev transport: A scalable metric for probability measures with graph metrics.
In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Online, 28–30 March 2022; PMLR,
pp. 9844–9868.
14. Frogner, C.; Zhang, C.; Mobahi, H.; Araya, M.; Poggio, T.A. Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 2015,
28, 2053–2061.
15. Cuturi, M.; Teboul, O.; Vert, J.P. Differentiable ranking and sorting using optimal transport. Adv. Neural Inf. Process. Syst. 2019,
32, 6861–6871.
16. Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69.
17. Birkhoff, G. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucum’an Rev. Ser. A 1946, 5, 147–154.
18. Brualdi, R.A. Combinatorial Matrix Classes; Cambridge University Press: Cambridge, UK, 2006; Volume 13.
19. Alvarez-Melis, D.; Jaakkola, T. Gromov–Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1881–1890.
20. Blondel, M.; Seguy, V.; Rolet, A. Smooth and sparse optimal transport. In Proceedings of the 21st International Conference on
Artificial Intelligence and Statistics, Canary Islands, Spain, 9–11 April 2018; PMLR, pp. 880–889.
21. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528.
[CrossRef]
22. Amari, S.i.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [CrossRef]
23. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [CrossRef]
24. Powell, M.J.D. Some global convergence properties of a variable metric algorithm for minimization without exact line searches.
In Proceedings of the Nonlinear Programming, SIAM-AMS Proceedings, New York, NY, USA, 1 January 1976; Volume 9.
25. Altschuler, J.; Niles-Weed, J.; Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration.
Adv. Neural Inf. Process. Syst. 2017, 30, 1961–1971.
26. Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967, 21, 343–348.
[CrossRef]
27. Danskin, J.M. The theory of max-min, with applications. SIAM J. Appl. Math. 1966, 14, 641–664. [CrossRef]
28. Bao, H.; Sugiyama, M. Fenchel-Young losses with skewed entropies for class-posterior probability estimation. In Proceedings of
the 24th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13–15 April 2021; pp. 1648–1656.
29. Naudts, J. Deformed exponentials and logarithms in generalized thermostatistics. Phys. A Stat. Mech. Its Appl. 2002, 316, 323–334.
[CrossRef]
30. Suyari, H. The unique non self-referential q-canonical distribution and the physical temperature derived from the maximum
entropy principle in Tsallis statistics. Prog. Theor. Phys. Suppl. 2006, 162, 79–86. [CrossRef]
31. Ding, N.; Vishwanathan, S. t-Logistic regression. Adv. Neural Inf. Process. Syst. 2010, 23, 514–522.
32. Futami, F.; Sato, I.; Sugiyama, M. Expectation propagation for t-exponential family using q-algebra. Adv. Neural Inf. Process. Syst.
2017, 30. https://dl.acm.org/doi/10.5555/3294771.3294985.
Entropy 2022, 24, 1634 27 of 27

33. Amid, E.; Warmuth, M.K.; Anil, R.; Koren, T. Robust bi-tempered logistic loss based on bregman divergences. Adv. Neural Inf.
Process. Syst. 2019, 32, 15013–15022.
34. Martins, A.F.; Figueiredo, M.A.; Aguiar, P.M.; Smith, N.A.; Xing, E.P. Nonextensive entropic kernels. In Proceedings of the 25th
International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 640–647.
35. Muzellec, B.; Nock, R.; Patrini, G.; Nielsen, F. Tsallis regularized optimal transport and ecological inference. In Proceedings of the
31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
36. Byrd, R.H.; Nocedal, J.; Yuan, Y.X. Global convergence of a cass of quasi-Newton methods on convex problems. SIAM J. Numer.
Anal. 1987, 24, 1171–1190. [CrossRef]
37. Schmitzer, B. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput. 2019,
41, A1443–A1481. [CrossRef]
38. Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.;
et al. POT: Python optimal transport. J. Mach. Learn. Res. 2021, 22, 1–8.
39. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.;
Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. doi:
10.1038/s41592-019-0686-2. [CrossRef] [PubMed]
40. Weed, J. An explicit analysis of the entropic penalty in linear programming. In Proceedings of the the 31st Conference on
Learning Theory, Stockholm, Sweden, 5–9 July 2018; PMLR, pp. 1841–1855.
41. Golub, G.H.; van Loan, C.F. Matrix Computations; The Johns Hopkins University Press: Baltimore, MA, USA, 2013.

You might also like