Stochastic Hamiltonian Gradient Methods For Smooth Games
Stochastic Hamiltonian Gradient Methods For Smooth Games
Nicolas Loizou 1 Hugo Berard 1 2 Alexia Jolicoeur-Martineau 1 Pascal Vincent† 1 2 Simon Lacoste-Julien† 1
Ioannis Mitliagkas† 1
2018; Abernethy et al., 2019). It is only very recently that 2. Further Related work
last-iterate convergence guarantees over a non-compact
domain appeared in literature for the stochastic problem In recent years, several second-order methods have been
(Palaniappan & Bach, 2016; Chavdarova et al., 2019; Hsieh proposed for solving the min-max optimization problem (1).
et al., 2019; Mishchenko et al., 2020) under the assumption Some of them require the computation or inversion of a
of strong monotonicity. Strong monotonicity, a generaliza- Jacobian which is a highly inefficient operation (Wang et al.,
tion of strong convexity for general operators, seems to be 2019; Mazumdar et al., 2019). In contrast, second-order
an essential condition for fast convergence in optimization. methods like the ones presented in Mescheder et al. (2017);
Here, we make no strong monotonicity assumption. Balduzzi et al. (2018); Abernethy et al. (2019) and in this
work are more efficient as they only rely on the computation
The algorithms we consider belong to a recently intro- of a Jacobian-vector product in each step.
duced family of computationally-light second order methods
which in each step require the computation of a Jacobian- Abernethy et al. (2019) provide the first last-iterate con-
vector product. Methods that belong to this family are the vergence rates for the deterministic Hamiltonian gradient
consensus optimization (CO) method (Mescheder et al., descent (HGD) for several classes of games including games
2017) and Hamiltonian gradient descent (Balduzzi et al., satisfying the sufficiently bilinear condition. The authors
2018; Abernethy et al., 2019). Even though some con- briefly touch upon the stochastic setting and by using the
vergence results for these methods are known for the de- convergence results of Karimi et al. (2016), explain how a
terministic problem, there is no available analysis for the stochastic variant of HGD with decreasing stepsize behaves.
stochastic problem. We close this gap. We study stochastic Their approach was purely theoretical and they did not pro-
Hamiltonian gradient descent (SHGD), and propose the first vide an efficient way of selecting the unbiased estimators
stochastic variance reduced Hamiltonian method, named L- of the gradient of the Hamiltonian. In addition, they as-
SVRHG. Our contributions are summarized as follows: sumed bounded gradient of the Hamiltonian function which
is restrictive for functions satisfying the Polyak-Lojasiewicz
• Our results provide the first set of global non-asymptotic (PL) condition (Gower et al., 2020). In this work we provide
last-iterate convergence guarantees for a stochastic game the first efficient variants and analysis of SHGD. We did
over a non-compact domain, in the absence of strong that by choosing practical unbiased estimator of the full
monotonicity assumptions. gradient and by using the recently proposed assumptions
• The proposed stochastic Hamiltonian methods use novel of expected smoothness (Gower et al., 2019) and expected
unbiased estimators of the gradient of the Hamiltonian residual (Gower et al., 2020) in our analysis. The proposed
function. This is an essential point for providing conver- theory of SHGD allow us to obtain as a corollary tight con-
gence guarantees. Existing practical variants of SHGD vergence guarantees for the deterministic HGD recovering
use biased estimators (Mescheder et al., 2017). the result of Abernethy et al. (2019) for the sufficiently
• We provide the first efficient convergence analysis of bilinear games.
stochastic Hamiltonian methods. In particular, we focus In another line of work, Carmon et al. (2019) analyze vari-
on solving two classes of stochastic smooth games: ance reduction methods for constrained finite-sum problems
– Stochastic Bilinear Games. and Ryu et al. (2019) provide an ODE-based analysis and
– Stochastic games satisfying the “sufficiently bilin- guarantees in the monotone but potentially non-smooth case.
ear” condition or simply Stochastic Sufficiently Bi- Chavdarova et al. (2019) show that both alternate stochas-
linear Games. The deterministic variant of this class tic descent-ascent and stochastic extragradient diverge on
of games was firstly introduced by Abernethy et al. an unconstrained stochastic bilinear problem. In the same
(2019) to study the deterministic problem and no- paper, Chavdarova et al. (2019) propose the stochastic vari-
tably includes some non-monotone problems. ance reduced extragradient (SVRE) algorithm with restart,
which empirically achieves last-iterate convergence on this
• For the above two classes of games, we provide conver-
problem. However, it came with no theoretical guarantees.
gence guarantees for SHGD with a constant step-size (lin-
In Section 7, we observe in our experiments that SVRE is
ear convergence to a neighborhood of stationary point),
slower than the proposed L-SVRHG for both the stochastic
SHGD with a variable step-size (sub-linear convergence
bilinear and sufficiency bilinear games that we tested.
to a stationary point) and L-SVRHG. For the latter, we
guarantee a linear rate. In concurrent work, Yang et al. (2020) provide global conver-
• We show the benefits of the proposed methods by per- gence guarantees for stochastic alternate gradient descent-
forming numerical experiments on simple stochastic bilin- ascent (and its variance reduction variant) for a subclass
ear and sufficiently bilinear problems, as well as toy GAN of nonconvex-nonconcave objectives satisfying a so-called
problems for which the optimal solution is known. Our two-sided Polyak-Lojasiewicz inequality, but this does not
numerical findings corroborate our theoretical results. include the stochastic bilinear problem that we cover.
Stochastic Hamiltonian Gradient Methods for Smooth Games
3. Technical Preliminaries expected residual condition if there exists ρ > 0 such that
for all x ∈ Rd ,
In this section, we present the necessary background and
the basic notation used in the paper. We also describe the h
2
i
update rule of the deterministic Hamiltonian method. Ei k∇fi (x) − ∇fi (x∗ ) − (∇f (x) − ∇f (x∗ ))k
≤ 2ρ (f (x) − f (x∗ )) . (6)
3.1. Optimization Background: Basic Definitions
We start by presenting some definitions that we will later 3.2. Smooth Min-Max Optimization
use in the analysis of the proposed methods.
We use standard notation used previously in Mescheder
Definition 3.1. Function f : Rd → R is µ–quasi-strongly et al. (2017); Balduzzi et al. (2018); Abernethy et al. (2019);
convex if there exists a constant µ > 0 such that ∀x ∈ Rd : Letcher et al. (2019).
2
f ∗ ≥ f (x) + h∇f (x), x∗ − xi + µ2 kx∗ − xk , where f ∗
∗ Let x = (x1 , x2 )> ∈ Rd be the column vector obtained by
is the minimum value of f and x is the projection of x
stacking x1 and x2 one on top of the other. With ξ(x) :=
onto the solution set X ∗ minimizing f . >
(∇x1 g, −∇x2 g) , we denote the signed vector of partial
derivatives evaluated at point x. Thus, ξ(x) : Rd → Rd is a
Definition 3.2. We say that a function satisfies the Polyak-
vector function. We use
Lojasiewicz (PL) condition if there exists µ > 0 such that 2
∇2x1 ,x2 g
∇x1 ,x1 g
1 J = ∇ξ = ∈ Rd×d
k∇f (x)k2 ≥ µ [f (x) − f ∗ ] ∀x ∈ Rd , (4) −∇2x2 ,x1 g −∇2x2 ,x2 g
2
where f ∗ is the minimum value of f . to denote the Jacobian of the vector function ξ. Note
that using the above notation, the simultaneous gradient
An analysis of several stochastic optimization methods un- descent/ascent (SGDA) update can be written simply as:
der the assumption of PL condition (Polyak, 1987) was xk+1 = xk − ηk ξ(xk ).
recently proposed in Karimi et al. (2016). A function can Definition 3.6. The objective function g of problem (1)
satisfy the PL condition and not be strongly convex, or is Lg -smooth if there exist Lg > 0 such that:
even convex. However, if the function is µ−quasi strongly kξ(x) − ξ(y)k ≤ Lg kx − yk ∀x, y ∈ Rd .
convex then it satisfies the PL condition with the same µ
(Karimi et al., 2016). We also say that g is L-smooth in x1 (in x2 ) if
k∇x1 g(x1 , x2 ) − ∇x1 g(x01 , x2 )k ≤ Lkx1 − x01 k (if
Definition 3.3. Function f : Rd → R is L-smooth if k∇x2 g(x1 , x2 ) − ∇x2 g(x1 , x02 )k ≤ Lkx2 − x02 k)
there exists L > 0 such that: for all x1 , x01 ∈ Rd1 (for all x2 , x02 ∈ Rd2 ).
k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ Rd .
Pn Definition 3.7. A stationary point of function f : Rd →
If f = n1 i=1 fi (x), then a more refined analysis of R is a point x∗ ∈ Rd such that ∇f (x∗ ) = 0. Using the
stochastic gradient methods has been proposed under new above notation, in min-max problem (1), point x∗ ∈ Rd
notions of smoothness. In particular, the notions of expected is a stationary point when ξ(x∗ ) = 0.
smoothness (ES) and expected residual (ER) have been in-
troduced and used in the analysis of SGD in Gower et al.
As mentioned in the introduction, in this work we focus on
(2019) and Gower et al. (2020) respectively. ES and ER are
smooth games satisfying the following assumption.
generic and remarkably weak assumptions. In Section 6 and
Appendix B.2, we provide more details on their generality. Assumption 3.8. The objective function g of problem (3)
We state their definitions below. has at least one stationary point and all of its stationary
Definition 3.4 (Expected smoothness, P (Gower et al., points are global min-max solutions.
n
2019)). We say that the function f = n1 i=1 fi (x) sat-
isfies the expected smoothness condition if there exists With Assumption 3.8, we can guarantee convergence to a
L > 0 such that for all x ∈ Rd , min-max solution of problem (3) by proving convergence to
a stationary point. This assumption is true for several classes
of games including strongly convex-strongly concave and
h i
2
Ei k∇fi (x) − ∇fi (x∗ )k ≤ 2L(f (x) − f (x∗ )), (5)
convex-concave games. However, it can also be true for
some classes of non-convex non-concave games (Abernethy
Definition 3.5 (Expected residual,
Pn(Gower et al., 2020)). et al., 2019). In Section 4, we describe in more details
We say that the function f = n1 i=1 fi (x) satisfies the the two classes of games that we study. Both satisfy this
assumption.
Stochastic Hamiltonian Gradient Methods for Smooth Games
3.3. Deterministic Hamiltonian Gradient Descent Assumption 4.1. Functions gi : Rd1 × Rd2 → R of
problem (3) are twice differentiable, Li -smooth with Si -
Hamiltonian gradient descent (HGD) has been proposed as Lipschitz Jacobian. That is, for each i ∈ [n] there are
an efficient method for solving min-max problems in Bal- constants Li > 0 and Si > 0 such that kξi (x) − ξi (y)k ≤
duzzi et al. (2018). To the best of our knowledge, the first Li kx − yk and kJi (x) − Ji (y)k ≤ Si kx − yk for all
convergence analysis of the method is presented in Aber- x, y ∈ Rd .
nethy et al. (2019) where the authors prove non-asymptotic
linear last-iterate convergence rates for several classes of
4.1. Classes of Stochastic Games
games.
In particular, HGD converges to saddle points of problem Here we formalize the two families of stochastic smooth
(1) by performing gradient descent on a particular objec- games under study: (i) stochastic bilinear, and (ii) stochastic
tive function H, which is called the Hamiltonian function sufficiently bilinear. Both families satisfy Assumption 3.8.
(Balduzzi et al., 2018), and has the following form: Interestingly, the latter family includes some non-convex
non-concave games, i.e. non-monotone problems.
1
min H(x) = kξ(x)k2 . (7)
x 2 Stochastic Bilinear Games. A stochastic bilinear game
is the stochastic smooth game (3) in which function g has
That is, HGD is a gradient descent method that minimizes the following structure:
the square norm of the gradient ξ(x). Note that under As- n
sumption 3.8, solving problem (7) is equivalent to solving 1X >
x1 bi + x> >
g(x1 , x2 ) = 1 Ai x2 + ci x2 . (9)
problem (1). The equivalence comes from the fact that H n i=1
only achieves its minimum at stationary points. The up-
While this game appears simple, standard methods diverge
date rule of HGD can be expressed using a Jacobian-vector
on it (Chavdarova et al., 2019) and L-SVRHG gives the first
product (Balduzzi et al., 2018; Abernethy et al., 2019):
stochastic method with last-iterate convergence guarantees.
xk+1 = xk − ηk ∇H(x) = xk − ηk J> ξ ,
(8)
Stochastic sufficiently bilinear games. A game of the
making HGD a second-order method. However, as dis- form (3) is called stochastic sufficiently bilinear if it satisfies
cussed in Balduzzi et al. (2018), the Jacobian-vector prod- the following definition.
uct can be efficiently evaluated in tasks like training neural
networks and the computation time of the gradient and the Definition 4.2. Let Assumption 4.1 be satisfied and let
Jacobian-vector product is comparable (Pearlmutter, 1994). the objective function g of problem (3) be L-smooth
in x1 and L-smooth in x2 . Assume that a constant
C > 0 exists, such that Ei kξi (x)k < C. Assume the
4. Stochastic Smooth Games and Stochastic cross derivative ∇2x1 ,x2 g be full rank with 0 < δ ≤
Hamiltonian Function
σi ∇2x1 ,x2 g ≤ ∆ for all x ∈ Rd and for all singular
2
In this section, we provide the two classes of stochastic values σi . Let ρ2 = minx1 ,x2 λmin ∇2x1 ,x1 g(x1 , x2 )
2
games that we study. We define the stochastic counterpart and β 2 = minx1 ,x2 λmin ∇2x2 ,x2 g(x1 , x2 ) . Finally let
to the Hamiltonian function as a step towards solving prob- the following condition to be true:
lem (3) and present its main properties.
(δ 2 + ρ2 )(δ 2 + β 2 ) − 4L2 ∆2 > 0. (10)
Let us start by presenting theP basic notation for the stochas-
n
tic setting. Let ξ(x) = n1 i=1 ξi (x), where ξi (x) := Note that the definition of the stochastic sufficiently bilinear
>
(∇x1 gi , −∇x2 gi ) , for all i ∈ [n] and let game has no restriction on the convexity of functions gi (x)
n 2 and g(x). The most important condition that needs to be
∇2x1 ,x2 gi
1X ∇x1 ,x1 gi satisfied is the expression in equation (10). Following the
J= Ji , where Ji = .
n −∇2x2 ,x1 gi −∇2x2 ,x2 gi terminology of Abernethy et al. (2019), we call the con-
i=1
dition (10): “sufficiently bilinear” condition. Later in our
Using the above notation, the stochastic variant of SGDA
numerical evaluation, we present stochastic non convex-non
can be written as xk+1 = xk −ηk ξi (xk ) where Ei [ξi (xk )] =
concave min-max problems that satisfy condition (10).
ξ(xk ).3
We highlight that the deterministic counterpart of the above
In this work, we focus on stochastic smooth games of the
game was first proposed in Abernethy et al. (2019). The
form (3) that satisfy the following assumption.
deterministic variant of Abernethy et al. (2019) can be ob-
3
Here the expectation is over the uniform distribution. That is, tained as special case of the above class of games when
Ei [ξi (x)] = n1 n
P
i=1 ξi (x). n = 1 in problem (3).
Stochastic Hamiltonian Gradient Methods for Smooth Games
Algorithm 2 Loopless Stochastic Variance Reduced Hamil- Algorithm 3 L-SVRHG (with Restart)
tonian Gradient (L-SVRHG) Input: Starting stepsize γ > 0. Choose initial points
Input: Starting stepsize γ > 0. Choose initial points x0 = w0 ∈ Rd . Distribution D of samples. Probability
x0 = w0 ∈ Rd . Distribution D of samples. Probability p ∈ (0, 1], T
p ∈ (0, 1] for t = 0, 1, 2, · · · , T do
for k = 0, 1, 2, · · · , K − 1 do Set xt+1 = L-SVRHGII (xt , K, γ, p ∈ (0, 1])
Generate fresh samples i ∼ D and j ∼ D and evaluate end for
∇Hi,j (xk ). Output: The last iterate xT .
Evaluate g k = ∇Hi,j (xk ) − ∇Hi,j (wk ) + ∇H(wk ).
Set xk+1 = xk − γg k
Set restarted variant of Alg. 2, presented in Alg. 3, which calls
( at each step Alg. 2 with the second option of output, that is
k+1 xk with probability p L-SVRHGII . Using the property from Proposition 4.4 that
w =
wk with probability 1 − p the Hamiltonian function (11) satisfy the PL condition 3.2,
we show that Alg. 3 converges linearly to the solution of the
end for sufficiently bilinear game (Theorem 6.8).
Output:
Option I: The last iterate x = xk .
Option II: x is chosen uniformly at random from {xi }K 6. Convergence Analysis
i=0 .
We provide theorems giving the performance of the previ-
ously described stochastic Hamiltonian methods for solving
tic gradient algorithms for solving finite-sum optimization the two classes of stochastic smooth games: stochastic bi-
problems. These algorithms, by reducing the variance of linear and stochastic sufficiently bilinear. In particular, we
the stochastic gradients, are able to guarantee convergence present three main theorems for each one of these classes
to the exact solution of the optimization problem with faster describing the convergence rates for (i) SHGD with con-
convergence than classical SGD. For example, for smooth stant step-size, (ii) SHGD with decreasing step-size and (iii)
strongly convex functions, variance reduced methods can L-SVRHG and its restart variant (Algorithm 3).
guarantee linear convergence to the optimum. This is a vast
The proposed results depend on the two main parameters
improvement on the sub-linear convergence of SGD with
µH , LH evaluated in Propositions 4.3 and 4.4. In addition,
decreasing step-size. In the past several years, many effi-
the theorems related to the bilinear games (the Hamiltonian
cient variance-reduced methods have been proposed. Some
function is quasi-strongly convex) use the expected smooth-
popular examples of variance reduced algorithms are SAG
ness constant L (5), while the theorems related to the suffi-
(Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG
ciently bilinear games (the Hamiltonian function satisfied
(Johnson & Zhang, 2013) and SARAH (Nguyen et al., 2017).
the PL condition) use the expected residual constant ρ (6).
For more examples of variance reduced methods in different
We note that the expected smoothness and expected residual
settings, see Defazio (2016); Konečný et al. (2016); Gower
constants can take several values according to the well-
et al. (2018); Sebbouh et al. (2019).
defined distributions D selected in our algorithms and the
In our second method Algorithm 2, we propose a vari- proposed theory will still hold (Gower et al., 2019; 2020).
ance reduced Hamiltonian method for solving (3). Our
As a concrete example, in the case of τ -minibatch sam-
method is inspired by the recently introduced and well
pling,4 the expected smoothness and expected residual pa-
behaved variance reduced algorithm, Loopless-SVRG (L-
rameters take the following values:
SVRG) first proposed in Hofmann et al. (2015); Kovalev
et al. (2020) and further analyzed under different settings n2 (τ −1) n2 −τ
in Qian et al. (2019); Gorbunov et al. (2020); Khaled et al. L(τ ) = τ (n2 −1) LH + τ (n2 −1) Lmax (13)
2
(2020). We name our method loopless stochastic variance −τ
ρ(τ ) = Lmax (nn2 −1)τ (14)
reduced Hamiltonian gradient (L-SVRHG). The method
works by selecting at each step the unbiased estimator where Lmax = max{1,...,n2 } {LHi,j } is the maximum
g k = ∇Hi,j (xk ) − ∇Hi,j (wk ) + ∇H(wk ) of the full gra- smoothness constant of the functions Hi,j . By using the
dient. As we will prove in the next section, this method expressions (13) and (14), it is easy to see that for single
guarantees linear convergence to the min-max solution of element sampling where τ = 1 (the one we use in our ex-
the stochastic bilinear game (9).
4
In each step we draw uniformly at random τ components of
To get a linearly convergent algorithm in the more general the n2 possible choices of the stochastic Hamiltonian function (11).
setup of sufficiently bilinear games 4.2, we had to propose a For more details on the τ -minibatch sampling see Appendix B.2.
Stochastic Hamiltonian Gradient Methods for Smooth Games
periments) L = ρ = Lmax . On the other limit case where a lection of step-size L-SVRHG convergences linearly to a
full-batch is used (τ = n2 ), that is we run the deterministic min-max solution.
Hamiltonian gradient descent, these values become L = LH Theorem 6.4 (L-SVRHG). Let us have the stochastic bi-
and ρ = 0 and as we explain below, the proposed theorems linear game (9). Let step-size γ = 1/6LH and p ∈ (0, 1].
include the convergence of the deterministic method as spe- Then L-SVRHG with Option I for output as given in Al-
cial case. gorithm 2 convergences linearly to the min-max solution
x∗ and satisfies:
6.1. Stochastic Bilinear Games k
µ p
We start by presenting the convergence of SHGD with con- E[Φk ] ≤ max 1 − ,1 − Φ0
6LH 2
stant step-size and explain how we can also obtain an anal-
ysis of the HGD (8) as special case. Then we move to the 4γ 2 Pn
where Φk := kxk − x∗ k2 + pn2 i,j=1 k∇Hi,j (wk ) −
convergence of SHGD with decreasing step-size and the
L-SVRHG where we are able to guarantee convergence to ∇Hi,j (x∗ )k2 .
a min-max solution x∗ . In the results related to SHGD we
2
use σ 2 := Ei,j [k∇Hi,j (x∗ )k ] to denote the finite gradient 6.2. Stochastic Sufficiently-Bilinear Games
noise at the solution.
As in the previous section, we start by presenting the con-
Theorem 6.1 (Constant stepsize). Let us have the stochas- vergence of SHGD with constant step-size and explain how
tic bilinear game (9). Then iterates of SHGD with constant we can obtain an analysis of the HGD (8) as special case.
1
step-size γ k = γ ∈ (0, 2L ] satisfy: Then we move to the convergence of SHGD with decreasing
step-size and the L-SVRHG (with restart) where we are able
2γσ 2
k
Ekxk − x∗ k2 ≤ (1 − γµH ) kx0 − x∗ k2 + . (15) to guarantee linear convergence to a min-max solution x∗ .
µ In contrast to the results on bilinear games, the convergence
guarantees of the following theorems are given in terms of
That is, Theorem 6.1 shows linear convergence to a neigh-
the Hamiltonian function E[H(xk )]. In all theorems we call
borhood of the min-max solution. Using Theorem 6.1 and
“sufficiently-bilinear game” the game described in Defini-
following the approach of Gower et al. (2019), we can obtain 2
tion 4.2. With σ 2 := Ei,j [k∇Hi,j (x∗ )k ], we denote the
the following corollary on the convergence of deterministic
finite gradient noise at the solution.
Hamiltonian gradient descent (HGD) (8). Note that for the
deterministic case σ = 0 and L = L (13). Theorem 6.5. Let us have a stochastic sufficiently-
bilinear game. Then the iterates of SHGD with constant
Corollary 6.2. Let us have a deterministic bilinear game. µ
1 steps-size γ k = γ ≤ L(µ+2ρ) satisfy:
Then the iterates of HGD with step-size γ = 2L satisfy:
k LH γσ 2
kxk − x∗ k2 ≤ (1 − γµH ) kx0 − x∗ k2 (16) k
E[H(xk )] ≤ (1 − γµH ) [H(x0 )] + . (19)
µH
To the best of our knowledge, Corollary 6.2 provides the
first linear convergence guarantees for HGD in terms of Using the above Theorem and by following the approach of
kxk − x∗ k2 (Abernethy et al. (2019) gave guarantees only Gower et al. (2020), we can obtain the following corollary
on H(xk )). Let us now select a decreasing step-size rule on the convergence of deterministic Hamiltonian gradient
(switching strategy) that guarantees a sublinear convergence descent (HGD) (8). It shows linear convergence of HGD to
to the exact min-max solution for the SHGD. the min-max solution. Note that for the deterministic case
Theorem 6.3 (Decreasing stepsizes/switching strategy). σ = 0 and ρ = 0 (14).
Let us have the stochastic bilinear game (9). Let K := Corollary 6.6. Let us have a deterministic sufficiently-
L/ µH . Let bilinear game. Then the iterates of HGD with step-size
γ = L1H satisfy:
1
2L for k ≤ 4dKe
k k
γ = (17) H(xk ) ≤ (1 − γµH ) H(x0 ) (20)
2k+1 for k > 4dKe.
2
(k+1) µH
The result of Corollary 6.6 is equivalent to the conver-
If k ≥ 4dKe, then SHGD given in Algorithm 1 satisfy: gence of HGD as proposed in Abernethy et al. (2019).
σ2 8 16dKe2
Ekxk − x∗ k2 ≤ µ2H k
+ e2 k2 kx
0
− x∗ k2 . (18) Let us now show that with decreasing step-size (switching
strategy), SHGD can converge (with sub-linear rate) to the
Lastly, in the following theorem, we show under what se- min-max solution.
Stochastic Hamiltonian Gradient Methods for Smooth Games
Theorem 6.7 (Decreasing stepsizes/switching strategy). We show the convergence of the different algorithms in
Let us have
a stochastic
sufficiently-bilinear game. Let Fig. 1a. As predicted by theory, SHGD with decreasing
∗ L ρ
k := 2 µ 1 + 2 µ and step-size converges at a sublinear rate while L-SVRHG
converges at a linear rate. Among all the methods we com-
pared to, L-SVRHG is the fastest to converge; however, the
LH (µµHH+2ρ) for k ≤ dk ∗ e
γk = (21) speed of convergence depends a lot on parameter p. We
2k+1
(k+1)2 µH for k > dk ∗ e. observe that setting p = 1/n yields the best performance.
To further illustrate the behavior of the Hamiltonian meth-
If k ≥ dk ∗ e, then SHGD given in Algorithm 1 satisfy: ods, we look at the trajectory of the methods on a simple
4LH σ 2 1 (k∗ )2 2D version of the bilinear game, where we choose x1 and
E[H(xk )] ≤ µ2H k
+ 0
k2 e2 [H(x )]. x2 to be scalars. We observe that while previously proposed
methods such as SGDA and SVRE suffer from rotations
which slow down their convergence and can even make them
In the next Theorem we show how the updates of L-SVRHG diverge, the Hamiltonian methods converge much faster by
with Restart (Algorithm 3) converges linearly to the min- removing rotation and converging “straight” to the solution.
max solution. We highlight that each step t of Alg. 3 requires
K = µH4 γ updates of the L-SVRHG. 7.2. Sufficiently-Bilinear Games
Theorem 6.8 (L-SVRHG with Restart). Let us have a In section 6.2, we showed that Hamiltonian methods are also
2/3
o Let p ∈ (0, 1] and
stochasticnsufficiently-bilinear game.
√ guaranteed to converge when the problem is non-convex
p
γ ≤ min 4L1H , 361/3p(LH ρ)1/3 , √6ρ and let K = µH4 γ . non-concave but satisfies the sufficiently-bilinear condi-
Then the iterates of L-SVRHG (with Restart) given in tion (10). To illustrate these results, we propose to look
Algorithm 3 satisfies at the following game inspired by Abernethy et al. (2019):
t n
E[H(xt )] ≤ (1/2) [H(x0 )]. 1X
min max F (x1 ) + δ x>
1 A i x2 +
x1 ∈Rd x2 ∈Rd n i=1
7. Numerical Evaluation b> >
i x1 + ci x2 − F (x2 ) , (22)
In this section, we compare the algorithms proposed in this where F (x) is a non-linear function (see details in Ap-
paper to existing methods in the literature. Our goal is to pendix F.2). This game is non-convex non-concave and
illustrate the good convergence properties of the proposed satisfies the sufficiently-bilinear condition if δ > 2L, where
algorithms as well as to explore how these algorithms be- L is the smoothness of F (x). Thus, the results and theorems
have in settings not covered by the theory. We propose from Section 6.2 hold.
to compare the following algorithms: SHGD with con-
stant step-size and decreasing step-size, a biased version Results are shown in Fig.1b. Similarly to the bilinear case,
of SHGD (Mescheder et al., 2017), L-SVRHG with and the methods follow very closely the theory. We highlight
without restart, consensus optimization (CO)5 (Mescheder that while the proposed theory for this setting only guar-
et al., 2017), the stochastic variant of SGDA, and finally antees convergence for L-SVRHG with restart, in practice
the stochastic variance-reduced extragradient with restart using restart is not strictly necessary: L-SVRHG with the
SVRE proposed in (Chavdarova et al., 2019). For all our ex- correct choice of stepsize also converges in our experiment.
periments, we ran the different algorithms with 10 different Finally we show the trajectories of the different methods on
seeds and plot the mean and 95% confidence intervals. We a 2D version of the problem. We observe that contrary to the
provide further details about the experiments and choice of bilinear case, stochastic SGDA converges but still suffers
hyperparameters for the different methods in Appendix F. from rotation compared to Hamiltonian methods.
First we compare the different methods on the stochastic In previous experiments, we verify the proposed theory for
bilinear problem (9). Similarly to Chavdarova et al. (2019), the stochastic bilinear and sufficiently-bilinear games. Al-
we choose n = d1 = d2 = 100, [Ai ]kl = 1 if i = k = l though we do not have theoretical results for more complex
and 0 otherwise, and [bi ]k , [ci ]k ∼ N (0, 1/n). games, we wanted to test our algorithms on a simple GAN
setting, which we call GaussianGAN.
5
CO is a mix between SGDA and SHGD, with the follow-
ing update rule xk+1 = xk − ηk (ξi (xk ) + λ∇Hi,j (xk )) (See In GaussianGAN, we have a dataset of real data xreal and
Appendix F.5) latent variable z from a normal distribution with mean
Stochastic Hamiltonian Gradient Methods for Smooth Games
1.0 0.0
H(x0)
H(xk)
10 4 10 4
x2
x2
1.5 SHGD (constant step-size) 2.5
SHGD (constant step-size) 2.0 SHGD (decreasing step-size)
10 6 SHGD (decreasing step-size) SGDA SVRE 10 6 Biased SHGD 5.0 SGDA SVRE
Biased SHGD 2.5 SHGD (constant step-size) Starting point x0 L-SVRHG 7.5 SHGD (constant step-size) Starting point x0
L-SVRHG SHGD (decreasing step-size) Optimum point x * L-SVRHG with restart SHGD (decreasing step-size) Optimum point x *
SVRE 3.0 L-SVRHG SVRE 10.0 L-SVRHG with restart
10 8 10 8
0 200 400 600 800 1000 3 2 1 0 1 0 1000 2000 3000 4000 5000 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Num of samples 1e2 x1 Num of samples 1e2 x1
0 and standard deviation 1. The generator is defined as For WGAN, we see that stochastic SGDA fails to converge
G(z) = µ + σz and the discriminator as D(xdata ) = and that L-SVRHG is the only method to converge linearly
φ0 + φ1 xdata + φ2 x2data , where xdata is either real data on the Hamiltonian. For satGAN, SGDA seems to perform
(xreal ) or fake generated data (G(z)). In this setting, the best. Algorithms that take into account the Hamiltonian
parameters are x = (x1 , x2 ) = ([µ, σ], [φ0 , φ1 , φ2 ]). In have high variance. We looked at individual runs and found
GaussianGAN, we can directly measure the L2 distance that, in 3 out of 10 runs, the algorithms other than stochas-
between the generator’s parameters and the true optimal tic SGDA fail to converge, and the Hamiltonian does not
parameters: ||µ̂ − µ|| + ||σ̂ − σ||, where µ̂ and σ̂ are the significantly decrease over time. While WGAN is guaran-
sample’s mean and standard deviation. teed to have a unique critical point, which is the solution
of the game, this is not the case for satGAN and nsGAN
We consider three possible minmax games: Wasserstein
due to the non-linear component. Thus, as expected, As-
GAN (WGAN) (Arjovsky et al., 2017), saturating GAN
sumption 3.8 is very important in order for the proposed
(satGAN) (Goodfellow et al., 2014), and non-saturating
stochastic Hamiltonian methods to perform well.
GAN (nsGAN) (Goodfellow et al., 2014). We present the
results for WGAN and satGAN in Figure 2. We provide the
nsGAN results in Appendix G.2 and details for the different 8. Conclusion and Extensions
experiments in Appendix F.3.
We introduce new variants of SHGD (through novel unbi-
ased estimator and step-size selection) and present the first
101
variance reduced Hamiltonian method L-SVRHG. Using
100
Generator L2 distance to optimum
10−2 100
tools from optimization literature, we provide convergence
10−4
10−1
guarantees for the two methods and we show how they can
H(x0)
H(xk)
10−6
10−2
efficiently solve stochastic unconstrained bilinear games and
10−8
CO CO
10 −10
SGDA
SHGD (constant step-size)
10−3
SGDA
SHGD (constant step-size)
the more general class of games that satisfy the “sufficiently
L-SVRHG L-SVRHG
10−12
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
bilinear condition. An important result of our analysis is
Number of samples 1e7 Number of samples 1e7
10−4 10−1
could work as a first step in closing the gap between the
H(x0)
H(xk)
10−6
10−2 stochastic optimization algorithms and methods for solving
10−8
10 −10
10−3 stochastic games and can open up many avenues for further
10−12
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
10−4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
development and research in both areas. A natural extension
Number of samples 1e7 Number of samples 1e7
of our results will be the proposal of accelerated Hamil-
(c) Hamiltonian for satGAN (d) Distance to optimum for tonian methods that use momentum (Loizou & Richtárik,
satGAN 2017; Assran & Rabbat, 2020) on top of the Hamiltonian
gradient update. We speculate that similar ideas to the
Figure 2. The Hamiltonian H(x k)
H(x0 )
(left) and the distance to the
optimal generator (right) as a function of the number of samples
ones presented in this work can be used for the develop-
seen during training for WGAN and satGAN. The distance to the ment of efficient decentralized methods (Assran et al., 2019;
optimal generator corresponds to ||||µ̂−µ k ||+||σ̂−σk ||
. Koloskova et al., 2020) for solving problem (3).
µ̂−µ0 ||+||σ̂−σ0 ||
Stochastic Hamiltonian Gradient Methods for Smooth Games
Acknowledgements Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Train-
ing gans with optimism. In ICLR, 2018.
The authors would like to thank Reyhane Askari, Gauthier
Gidel and Lewis Liu for useful discussions and feedback. Defazio, A. A simple practical accelerated method for finite
Nicolas Loizou acknowledges support by the IVADO post- sums. In NeurIPS, 2016.
doctoral funding program. This work was partially sup- Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A
ported by the FRQNT new researcher program (2019- fast incremental gradient method with support for non-
NC-257943), the NSERC Discovery grants (RGPIN-2017- strongly convex composite objectives. In NeurIPS, 2014.
06936 and RGPIN-2019-06512) and the Canada CIFAR AI
chairs program. Ioannis Mitliagkas acknowledges support Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-
by an IVADO startup grant and a Microsoft Research collab- Julien, S. A variational inequality perspective on genera-
orative grant. Simon Lacoste-Julien acknowledges support tive adversarial networks. In ICLR, 2018.
by a Google Focused Research award. Simon Lacoste-
Julien and Pascal Vincent are CIFAR Associate Fellows in Gidel, G., Hemmat, R. A., Pezeshki, M., Le Priol, R., Huang,
the Learning in Machines & Brains program. G., Lacoste-Julien, S., and Mitliagkas, I. Negative mo-
mentum for improved game dynamics. In AISTATS, 2019.
References Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Abernethy, J., Lai, K. A., and Wibisono, A. Last-iterate con- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.
vergence rates for min-max optimization. arXiv preprint Generative adversarial nets. In NeurIPS, 2014.
arXiv:1906.02027, 2019.
Gorbunov, E., Hanzely, F., and Richtárik, P. A unified theory
Albuquerque, I., Monteiro, J., Falk, T. H., and Mitliagkas, of sgd: Variance reduction, sampling, quantization and
I. Adversarial target-invariant representation learning for coordinate descent. In AISTATS, 2020.
domain generalization. arXiv preprint arXiv:1911.00804,
Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasi-
2019.
gradient methods: Variance reduction via Jacobian sketch-
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein ing. arxiv:1805.02632, 2018.
generative adversarial networks. In ICML, 2017.
Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,
Assran, M. and Rabbat, M. On the convergence of nesterov’s Shulgin, E., and Richtárik, P. SGD: General analysis
accelerated gradient method in stochastic settings. arXiv and improved rates. In ICML, 2019.
preprint arXiv:2002.12414, 2020.
Gower, R. M., Sebbouh, O., and Loizou, N. SGD for struc-
Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochas- tured nonconvex functions: Learning rates, minibatch-
tic gradient push for distributed deep learning. ICML, ing and interpolation. arXiv preprint arXiv:2006.10311,
2019. 2020.
Azizian, W., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. Hardt, M., Recht, B., and Singer, Y. Train faster, generalize
A tight and unified analysis of gradient-based methods for better: stability of stochastic gradient descent. In ICML,
a whole spectrum of differentiable games. In AISTATS, 2016.
2020a.
Hofmann, T., Lucchi, A., Lacoste-Julien, S., and
Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., McWilliams, B. Variance reduced stochastic gradient
and Gidel, G. Accelerating smooth games by manipulat- descent with neighbors. In NeurIPS, 2015.
ing spectral shapes. AISTATS, 2020b.
Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos,
Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, P. On the convergence of single-call stochastic extra-
K., and Graepel, T. The mechanics of n-player differen- gradient methods. In NeurIPS, 2019.
tiable games. In ICML, 2018.
Ibrahim, A., Azizian, W., Gidel, G., and Mitliagkas, I.
Carmon, Y., Jin, Y., Sidford, A., and Tian, K. Variance Linear lower bounds and conditioning of differentiable
reduction for matrix games. In NeurIPS, 2019. games. arXiv preprint arXiv:1906.07300, 2019.
Chavdarova, T., Gidel, G., Fleuret, F., and Lacoste-Julien, Johnson, R. and Zhang, T. Accelerating stochastic gradient
S. Reducing noise in gan training with variance reduced descent using predictive variance reduction. In NeurIPS,
extragradient. In NeurIPS, 2019. 2013.
Stochastic Hamiltonian Gradient Methods for Smooth Games
Karimi, H., Nutini, J., and Schmidt, M. Linear conver- Necoara, I., Nesterov, Y., and Glineur, F. Linear convergence
gence of gradient and proximal-gradient methods under of first order methods for non-strongly convex optimiza-
the Polyak-łojasiewicz condition. In ECML-PKDD, 2016. tion. Math. Program., pp. 1–39, 2018.
Khaled, A., Sebbouh, O., Loizou, N., Gower, R. M., and Nemirovski, A. Prox-method with rate of convergence o
Richtrik, P. Unified analysis of stochastic gradient meth- (1/t) for variational inequalities with lipschitz continuous
ods for composite convex and smooth optimization. arXiv monotone operators and smooth convex-concave saddle
preprint arXiv:2006.11573, 2020. point problems. SIAM Journal on Optimization, 15(1):
229–251, 2004.
Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and
Stich, S. U. A unified theory of decentralized SGD with Nemirovski, A. and Yudin, D. B. On Cezari’s convergence
changing topology and local updates. arXiv preprint of the steepest descent method for approximating saddle
arXiv:2003.10422, 2020. point of convex-concave functions. Soviet Mathetmatics
Doklady, 19, 1978.
Konečný, J., Liu, J., Richtárik, P., and Takáč, M. Mini-batch
semi-stochastic gradient descent in the proximal setting. Nemirovski, A. and Yudin, D. B. Problem complexity and
IEEE Journal of Selected Topics in Signal Processing, 10 method efficiency in optimization. Wiley Interscience,
(2):242–255, 2016. 1983.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-
Korpelevich, G. The extragradient method for finding saddle
bust stochastic approximation approach to stochastic pro-
points and other problems. Matecon, 12:747–756, 1976.
gramming. SIAM Journal on Optimization, 19(4):1574–
Kovalev, D., Horváth, S., and Richtárik, P. Dont jump 1609, 2009.
through hoops and remove those loops: SVRG and Nguyen, L., Nguyen, P. H., van Dijk, M., Richtárik, P.,
Katyusha are better without the outer loop. In Algorithmic Scheinberg, K., and Takáč, M. SGD and hogwild! Con-
Learning Theory, 2020. vergence without the bounded gradients assumption. In
ICML, 2018.
Letcher, A., Balduzzi, D., Racanière, S., Martens, J., Foer-
ster, J. N., Tuyls, K., and Graepel, T. Differentiable game Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. Sarah:
mechanics. Journal of Machine Learning Research, 20 a novel method for machine learning problems using
(84):1–40, 2019. stochastic recursive gradient. In ICML, 2017.
Loizou, N. and Richtárik, P. Momentum and stochastic Palaniappan, B. and Bach, F. Stochastic variance reduction
momentum for stochastic gradient, Newton, proximal methods for saddle-point problems. In NeurIPS, 2016.
point and subspace descent methods. arXiv preprint
arXiv:1712.09677, 2017. Pearlmutter, B. A. Fast exact multiplication by the hessian.
Neural computation, 6(1):147–160, 1994.
Loizou, N., Vaswani, S., Laradji, I., and Lacoste-Julien,
Pfau, D. and Vinyals, O. Connecting generative adversar-
S. Stochastic polyak step-size for SGD: An adap-
ial networks and actor-critic methods. arXiv preprint
tive learning rate for fast convergence. arXiv preprint
arXiv:1610.01945, 2016.
arXiv:2002.10542, 2020.
Polyak, B. Introduction to optimization. translations series
Mazumdar, E. V., Jordan, M. I., and Sastry, S. S. On finding in mathematics and engineering. Optimization Software,
local nash equilibria (and only local nash equilibria) in 1987.
zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
Qian, X., Qu, Z., and Richtárik, P. L-SVRG and L-
Mescheder, L., Nowozin, S., and Geiger, A. The numerics Katyusha with arbitrary sampling. arXiv preprint
of gans. In NeurIPS, 2017. arXiv:1906.01481, 2019.
Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., Robbins, H. and Monro, S. A stochastic approximation
and Malitsky, Y. Revisiting stochastic extragradient. In method. The Annals of Mathematical Statistics, pp. 400–
AISTATS, 2020. 407, 1951.
Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unified Ryu, E. K., Yuan, K., and Yin, W. ODE analysis of
analysis of extra-gradient and optimistic gradient methods stochastic gradient methods with optimism and anchor-
for saddle point problems: Proximal point approach. In ing for minimax problems and GANs. arXiv preprint
AISTATS, 2020. arXiv:1905.10899, 2019.
Stochastic Hamiltonian Gradient Methods for Smooth Games
In the Appendix we present the proofs of the main Propositions and Theorems proposed in the main paper together with
additional experiments on different Bilinear and sufficiently bilinear games.
In particular in Section A, we start by presenting the pseudo-codes of the stochastic optimization algorithms SGD and
L-SVRG based on which we build our stochastic Hamiltonian methods. In Section B we provide more details on the
assumptions and definitions used in the main paper. In Section D we present the proofs of the two main propositions and
in Section E we explain how these propositions can be combined with existing convergence results in order to obtain the
Theorems of Section 6. Finally in Sections F and G we present the experimental details and provide additional experiments.
for all x ∈ Rd . For simplicity, we will write (f, D) ∼ ES(L) to say that expected smoothness holds.
Assumption B.3 (Expected Residual (ER)). We say that f satisfied the expected residual assumption if there exists
ρ = ρ(f, D) > 0 such that
h i
2
ED k∇fv (x) − ∇fv (x∗ ) − (∇f (x) − ∇f (x∗ ))k ≤ 2ρ (f (x) − f (x∗ )) . (28)
for all x ∈ Rd . For simplicity, we will write (f, D) ∼ ER(ρ) to say that expected residual holds.
As we explain in Section 6, in this work we focus on τ -minibatch sampling, where in each step we select uniformly at
random a minibatch of size τ ∈ [n2 ] (recall that the Hamiltonian function (11) has n2 components). However we highlight
that the proposed analysis of the stochastic Hamiltonian methods holds for any form of sampling vector following the
results presented in Gower et al. (2019; 2020) for the case of SGD and Qian et al. (2019) for the case of L-SVRG methods,
including importance sampling variants.
Let us provide a formal definition of the τ -minibatch sampling when τ ∈ [n].
n
Definition B.4 (τ -Minibatch sampling).
Let τ ∈ [n].We say that
v ∈ R is a τ –minibatch sampling if for every subset
n
P n
S ∈ [n] with |S| = τ we have that P v = τ i∈S ei = 1/ τ := τ !(n − τ )!/n!
It is easy to verify by using a double counting argument that if v is a τ –minibatch sampling, it is also a valid sampling vector
(E [vi ] = 1) (Gower et al., 2019).
Pn
Let f (x) = n1 i=1 fi (x) with functions fi be Li –smooth and function f be L-smooth and let Lmax = max{1,...,n} {Li }.
In this setting as it was shown in Gower et al. (2019; 2020) for the case of τ -minibatch sampling (τ ∈ [n]), the expected
smoothness and expected residual parameters and the finite gradient noise σ 2 take the following form:
n(τ − 1) n−τ
L(τ ) = L+ Lmax (29)
τ (n − 1) τ (n − 1)
n−τ
ρ(τ ) = Lmax (30)
(n − 1)τ
n
2 1n−τ 1 X 2
σ 2 (τ ) := ED [k∇fv (x∗ )k ] = k∇fi (x∗ )k . (31)
τ n − 1 n i=1
become L = L and ρ = σ 2 = 0. Note that these are exactly the values for L, ρ and σ 2 we use in Section 6 with the only
difference that τ ∈ [n2 ] because the stochastic Hamiltonian function (11) has n2 components Hi,j .
2
In particular, as we explained in Section 6, for the Theorems related to SHGD we use σ 2 := Ei,j [k∇Hi,j (x∗ )k ]. From the
above expression and for the case of τ -minibatch sampling with τ ∈ [n2 ] this is equivalent to:
n n
2 1 n2 − τ 1 X X 2
σ 2 := Ei,j [k∇Hi,j (x∗ )k ] = k∇Hi,j (x∗ )k .
τ n2 − 1 n2 i=1 j=1
Connection between τ -minibatch sampling and sampling step of main algorithms. Note that one of the main steps
of Algorithms 1 and 5 is the generation of fresh samples i ∼ D and j ∼ D and the evaluation of ∇Hi,j (xk ). In the case of
uniform single element sampling, the samples i and j are selected with probability pi = 1/n and pj = 1/n respectively.
This is equivalent on selecting samples {i, j} uniformly at random from the n2 components of the Hamiltonian function. In
both cases the probability of selecting the component Hi,j is equal to pHi,j = 1/n2 .
In other words, for the case of 1-minibatch sampling (uniform single element sampling), one can simply substitute the
sampling step of SHGD and L-SVRHG: “Generate fresh samples i ∼ D and j ∼ D and evaluate ∇Hi,j (xk ).” with the
“Sample uniformly at random the component Hi,j and evaluate ∇Hi,j (xk ).”
Trivially, using the definition (B.4) and the above notion of sampling vector, this connection can be extended to capture the
more general τ -minibatch sampling where τ ∈ [n2 ]. In this case, we will have
n X
X n
1
∇Hv (x) := n2 vi,j ∇Hi,j (x),
i=1 j=1
2
n
Pnv ∈PRn+ is a random sampling vector such that Ei,j [vi,j ] = 1, for i = 1, . . . , n, and j = 1, . . . , n and Hv (x) :=
where
1
n2 Pi=1 Pj=1 vi,j Hi,j (x). Note that it follows immediately from this definition of sampling vector that E [∇Hv (x)] =
1 n n
n2 i=1 j=1 Ei,j [vi,j ]∇Hi,j (x) = ∇H(x).
In this case the update rule of SHGD (Algorithm 1) will simply be: xk+1 = xk − γ k ∇Hv (x) and the proposed theoretical
results will still hold.
when f is strongly convex (Nguyen et al., 2018). Recall that the class of µ-strongly convex functions is a special case of
both the µ-quasi strongly convex and functions satisfying the PL condition (see (25)).
Using ES and ER in the proposed theorems we do not need to assume such a bound. Instead, we use the following direct
consequence of expected smoothness and expected residual to bound the expected norm of the stochastic gradients.
Lemma B.6. (Gower et al., 2019) If (f, D) ∼ ES(L), then
Similar upper bound on the stochastic gradients can be obtained if one assumed expected residual:
Lemma B.7. (Gower et al., 2020) If (f, D) ∼ ER(ρ) then
That is, the Hamiltonian function H(x) can be expressed as a finite sum with n2 components.
1 1 1 >
Ji ξ j + J>
∇Hi,j (x) = ∇hξi (x), ξj (x)i = [h∇ξi (x), ξj (x)i + hξi (x), ∇ξj (x)i] = j ξi , (35)
2 2 2
and it is an unbiased estimator of the full gradient. That is, ∇H(x) = Ei,j [∇Hi,j (x)].
n n
1 2 1 1 XX 1
∇H(x) = ∇ kξ(x)k = ∇ hξ(x), ξ(x)i = 2 ∇hξi (x), ξj (x)i
2 2 n i=1 j=1 2
n n
1 XX 1
= [h∇ξi (x), ξj (x)i + hξi (x), ∇ξj (x)i]
n2 i=1 j=1 2
n n
1 XX 1 >
Ji ξj + J>
= 2 j ξi
n i=1 j=1 |2 {z }
∇Hi,j (x)
n n
1 X 1 X
= ∇Hi,j (x)
n i=1
n j=1
= Ei Ej [∇Hi,j (x)] = Ei,j [∇Hi,j (x)] (36)
Stochastic Hamiltonian Gradient Methods for Smooth Games
where ζ is a random variable obeying some distribution. Then ξ(x) = Eζ [ξ(x, ζ)], J = Eζ [J(x, ζ)] and the stochastic
Hamiltonian function will become
1
H(x) = Eζi Eζj hξ(x, ζi ), ξ(x, ζj )i .
|2 {z }
Hi,j (x)
1
J(x, ζi )> ξ(x, ζj ) + J(x, ζj )> ξ(x, ζi ) and ∇H(x) = Eζi Eζj [∇Hi,j (x)].
In this case ∇Hi,j (x) = 2
In this case SHGD will execute the following updates in each step k ∈ {0, 1, 2, · · · , K}:
Eigenvalues, singular values Let A ∈ Rn×n .We denote with λ1 ≤ λ2 ≤ · · · ≤ λn its eigenvalues. Let λmin = λ1 be the
smallest non-zero eigenvalue, and λmax = λn be the largest eigenvalue. With σ1 ≤ σ2 ≤ · · · ≤ σn we denote its singular
values. With σmax and σmin we denote the maximum singular value and the minimum non-zero singular value of matrix A.
1
Hi,j (x) =hξi (x), ξj (x)i
2
1
Ai x2 + bi , −[A> >
= i x1 + ci ] , Aj x2 + bj , −[Aj x1 + cj ]
2
Ai A>
1 0 x1 1 > > x1 1 1
= (x1 , x2 )> j
+ c A + c > > >
A , b A i + b >
A j + c> c j + b> bj
2 0 A> i Aj x2 2 j i i j j i x2 2 i 2 i
1 > >
= x Qi,j x + qi,j x + `i,j , (37)
2
Ai A>
j 0
where Qi,j = and
0 A> i Aj
>
= 21 c> > > > > > 1 > 1 >
qi,j j Ai + ci Aj , bj Ai + bi Aj and `i,j = 2 ci cj + 2 bi bj .
Using the finite-sum structure of the Hamiltonian function (11) the stochastic Hamiltonian function takes the following
form:
n
1 X
H(x) = Hi,j (x)
n2 i,j=1
n
1 X 1 > >
= x Qi,j x + qi,j x + `i,j
n2 i,j=1 2
1 >
= x Qx + q > x + ` (38)
2
h Pn i AA> 0
Pn
1 1
where Q = Q i,j = with A = Ai and
n2 i,j=1 0 A> A n i=1
h Pn i h P i
n
q> = 1
n2
>
i,j=1 qi,j and ` = n2
1
i,j=1 `i,j .
1 >
H(x) = x Qx + q > x + `
2
where function φ(y) = 12 kyk2 − (LQ x∗ )> y + `. In addition, note that function φ is 1-strongly convex with 1-Lipschitz
continuous gradient.
Thus, using Lemma D.1 we have that the the Hamiltonian function is a LH −smooth, µH –quasi-strongly convex function
+
with constants LH = kLQ k2 = λmax (L> 2 2 2
Q LQ ) = λmax (Q) = σmax (A) and µH = σmin (LQ ) = λmin (Q) = σmin (A).
Thus,
n n n n
(39) 1 XX > 1 XX >
k∇H(x) − ∇H(y)k = J i (x)ξ j (x) − J (y)ξj (y)
n2 i=1 j=1 n2 i=1 j=1 i
n n
1 XX >
J (x)ξj (x) − J>
= i (y)ξj (y)
n2 i=1 j=1 i
n n
Jensen 1 XX >
≤ J (x)ξj (x) − J>
i (y)ξj (y)
n2 i=1 j=1 i
(∗)
≤ Ei Ej J> >
i (x)ξj (x) − Ji (y)ξj (y)
Ei Ej Ji (x) − J>
> >
= i (y) ξj (x) + Ji (y) [ξj (x) − ξj (y)]
Ei Ej Ji (x) − J>
> >
≤ i (y) ξj (x) + Ei Ej Ji (y) [ξj (x) − ξj (y)]
where in (∗) we use that i and j are sampled from a uniform distribution.
PL Condition. To show that the Hamiltonian function satisfies the PL condition (3.2) we use a linear algebra lemma from
(Abernethy et al., 2019).
Lemma D.2. (Lemma H.2in (Abernethyet al., 2019))
A C
Let matrix M = where matrix C is a square full rank matrix. Let c =
−C> −B
2
2
σmin (C) + λmin (A2 ) λmin (B2 ) + σmin
2 2
(C) − σmax (C) (kAk + kBk) and let assume that c > 0. Here λmin
denotes the smaller eigenvalue and σmin and σmax the smallest and largest singular values respectively. Then if λ is an
eigenvalue of MM> it holds that:
2
2
σmin (C) + λmin (A2 ) λmin (B2 ) + σmin
2
(C) − σmax 2
(C) (kAk + kBk)
λ>
2 (C) + λ(A2 ) + λ
(2σmin 2 2
min (B ))
In addition note that if there exist µ > 0 such that J(x)J> (x) µI then the Hamiltonian function satisfies the PL condition
with parameter µ.
Stochastic Hamiltonian Gradient Methods for Smooth Games
Lemma D.3. Let g(x) of min-max problem 3 be twice differentiable function. If there exist µ > 0 such that J(x)J> (x)
µI for all x ∈ Rd then the Hamiltonian function H(x) (11) satisfies the PL condition (3.2) with parameter µ.
Proof.
1 1 1
k∇H(x)k2 = kJ> (x)ξ(x)k2 = ξ(x)> J(x)J> (x)ξ(x)
2 2 2
J(x)J> (x)µI µ
≥ ξ(x)> ξ(x)
2
1
= µ kξ(x)k2
2
= µ [H(x)]
H(x∗ )=0
= µ [H(x) − H(x∗ )] (41)
Combining the above two lemmas we can now show that for the sufficiently bilinear games that Hamiltonian function
2 2
)(δ 2 +β 2 )−4L2 ∆2
satisfies the PL condition with parameter µH = (δ +ρ 2δ 2 +ρ2 +β 2 .
2
∇2x1 ,x2 g
∇x1 ,x1 g
Recall that J = ∇ξ = ∈ Rd×d . Now let C(x) = ∇2x1 ,x2 g(x) and note that this is a square full
−∇x2 ,x1 g −∇2x2 ,x2 g
2
rank matrix. In particular, by the assumption of the sufficiently bilinear games we have that the cross derivative ∇2x1 ,x2 g is
full rank matrix with 0 < δ ≤ σi ∇2x1 ,x2 g ≤ ∆ for all x ∈ Rd and for all singular values σi . In addition since we assume
that the sufficiently bilinear condition (10) holds we can apply Lemma D.2 with M = J. Since function g is smooth in x1
and x2 and using the bounds on the singular values of matrix C(x) we have that:
2
(δ + ρ2 )(δ 2 + β 2 ) − 4L2 ∆2
JJ> I,
2δ 2 + ρ2 + β 2
2 2
where ρ2 = minx1 ,x2 λmin ∇2x1 ,x1 g(x1 , x2 ) and β 2 = minx1 ,x2 λmin ∇2x2 ,x2 g(x1 , x2 ) . Using Lemma D.3 it is clear
(δ 2 +ρ2 )(δ 2 +β 2 )−4L2 ∆2
that the Hamiltonian function of the sufficiently bilinear games satisfies the PL condition with µH = 2δ 2 +ρ2 +β 2 .
This completes the proof.
In the rest of this section we present the Theorems of the convergence of SGD (Algorithm 4) and L-SVRG (Algorithm 5) for
solving the finite sum problem (23) as presented in the above papers with some brief comments on their convergence. As we
explain above combining these results with the Propositions 4.3 and 4.4 yield the Theorems presented in Section 6.
k 2γσ 2
Ekxk − x∗ k2 ≤ (1 − γµ) kx0 − x∗ k2 + µ . (42)
Theorem E.2 (Decreasing stepsizes/switching strategy). Assume f is µ-quasi-strongly convex and that (f, D) ∼ ES(L).
Let K := L/ µ and
1
2L for k ≤ 4dKe
k
γ = (43)
2k+1 for k > 4dKe.
(k+1)2 µ
σ2 8 16dKe2
Ekxk − x∗ k2 ≤ µ2 k + e2 k2 kx
0
− x∗ k2 . (44)
Theorem E.3. Assume f is µ-quasi-strongly convex. Let step-size γ = 1/6L and let p ∈ (0, 1] and D be the uniform
distribution. Then L-SVRGI given in Algorithm 5 convergences to the optimum and satisfies:
Note that in statement of Theorem 6.4 we replace n in the above expression with n2 because the Hamiltonian function has
finite-sum structure with n2 components. As we explain before for obtaining Theorem 6.4 of the main paper one can simply
combine the above Theorem with Proposition 4.3.
We highlight that in (Qian et al., 2019) a convergence Theorem of L-SVRG for smooth strongly convex functions was
presented under the arbitrary sampling paradigm (well defined distribution D) . This result can be trivially extended to
capture the class of smooth quasi-strongly convex functions and as such it can be also used in combination with 4.3. In this
case the step-size will become γ = 1/6L where L is the expected smoothness parameter. Using this, one can guarantee
linear convergence of L-SVRG, and as a result of L-SVRHG, with more general distribution D (beyond uniform sampling).
For other well defined choices of distribution D we refer the interested reader to (Qian et al., 2019).
E.3. Convergence of Stochastic Optimization Methods for Functions Satisfying the PL Condition
As we have already mentioned the Theorems of the convergence of the stochastic Hamiltonian methods for solving the
sufficiently bilinear games can be obtain by combining Proposition 4.4 with existing results on the analysis of SGD and
L-SVRG for functions satisfying the PL condition.
In particular, in this subsection we present the main convergence Theorems as presented in Gower et al. (2020) for the
analysis of SGD for functions satisfying the PL condition and we explain how we can extend the results of Qian et al. (2019)
in order to provide an analysis of L-SVRG with restart.
The main assumption of these Theorems is that function f of problem (23) satisfies the PL condition and that the expected
residual is satisfied. Note that again no assumption on convexity of fi is made.
An important remark that we need to highlight is that all convergence result are presented in terms of function suboptimality
E[f (xk ) − f (x∗ )]. When these results are used for the Hamiltonian method that we know that H(x∗ ) = 0 they can be
written as E[H(xk )]. This is exactly the quantity for which we show convergence in Theorems 6.5, 6.7 and 6.8.
µ
E(k∇fi (x∗ )k2 ), where x∗ = arg minx f (x). Let γk = γ ≤ L(µ+2ρ) , for all k. Then the iterates of SGD satisfy:
k Lγσ 2
E[f (xk ) − f ∗ ] ≤ (1 − γµ) [f (x0 ) − f ∗ ] + . (45)
µ
Theorem E.5 (Decreasing step sizes/switching Let f be an L-smooth. Assume expected residual and that f (x)
strategy).
ρ
satisfies the PL condition (4). Let k ∗ := 2 L
µ 1 + 2 µ and
µ
, for k ≤ dk ∗ e
L(µ + 2ρ)
γk = 2k + 1 (46)
for k > dk ∗ e
(k + 1)2 µ
4Lσ 2 1 (k∗ )2
E[f (xk ) − f ∗ ] ≤ + 0
k2 e2 [f (x ) − f ∗ ]. (47)
µ2 k
Assumption E.6 is similar to the expected residual condition presented in the main paper. For the case of τ -minibatch
sampling, in Qian et al. (2019) it was shown that parameter ρnc of the assumption can be upper bounded by ρnc ≤
n2 −τ 1
P 2
(n2 −1)τ n Li where Li is the smoothness parameter of function fi .
Under the Expected residual Assumption E.6 the following lemma was proven in Qian et al. (2019).
Lemma E.7 (Theorem 5.1 in Qian et al. (2019) ). Let f be nonconvex and smooth function. Let Assumption E.6 be
satisfied and let p ∈ (0, 1]. Consider the Lyapunov function Ψk = f (xk ) + αkxk − wk k2 where α = 3γ 2 Lρnc /p. If
stepsize γ satisfies:
√
p2/3
1 p
γ ≤ min , , √ (49)
4L 361/3 (Lρnc )1/3 6ρnc
then the update of L-SVRG (Algorithm 5) satisfies
γ
Ei [Ψk+1 ] ≤ Ψk − k∇f (xk )k2 .
4
Having the result of Lemma E.7 let us now present the main Theorem describing the convergence of L-SVRG with
restart presented in Algorithm 6. Let us, run L-SVRG with step-size γ that satisfies (49) and select the output xu of the
method to be its Option II. That is xu is chosen uniformly at random from {xi }K i=0 . In this case we name the method
L-SVRGII (x0 = w0 , K, γ, p ∈ (0, 1]).
Theorem E.8 (Convergence of Algorithm 6). Let f be L−smooth function that satisfies the PL condition (4) with
parameter µ. Let Assumption E.6 be satisfied and let p ∈ (0, 1]. If stepsize γ satisfies:
√
p2/3
1 p
γ ≤ min , 1/3 1/3
, √
4L 36 (Lρnc ) 6ρnc
Stochastic Hamiltonian Gradient Methods for Smooth Games
4
and K = µγ then the update of Algorithm 6 satisfies
t
∗ 1
t
E[f (x ) − f (x )] ≤ [f (x0 ) − f (x∗ )], (50)
2
and t
t 1 2
Ek∇f (x )k ≤ k∇f (x0 )k2 . (51)
2
Convergence on function values. The above derivation, (52), shows that the iterates of Algorithm 6 satisfy:
4
Ek∇f (xt )k2 ≤ E f (xt−1 ) − f (x∗ )
γK
4
Substitute the specified value of K = γµ in the above inequality, we have
and since the function satisfies the PL condition we have 21 k∇f (x)k2 ≥ µ [f (x) − f (x∗ )] which means that:
Thus,
1
E f (xt ) − f (x∗ ) ≤ E f (xt−1 ) − f (x∗ )
2
by unrolling the recurrence we obtain (50).
Convergence on norm of the gradient. Similar to the previous case, using (52), the iterates of Algorithm 6 satisfy:
4
Ek∇f (xt )k2 f (xt−1 ) − f (x∗ )
≤
γK
(4) 4 1
≤ k∇f (xt−1 )k2
γK 2µ
2
= k∇f (xt−1 )k2 (53)
γµK
Stochastic Hamiltonian Gradient Methods for Smooth Games
4
Using the specified value K = γµ in the above inequality, we have:
t 2 1
Ek∇f (x )k ≤ k∇f (xt−1 )k2 (54)
2
and by unrolling the recurrence we obtain (51).
F. Experimental Details
In the experimental section we compare several different algorithms, we provide a short explanation of the different
algorithms here:
• SHGD with constant and decreasing step-size: This is the Alg. 1 proposed in the paper.
• Biased SHGD: This is a biased version of Alg. 1 that was proposed by Mescheder et al. (2017), where ∇Hi,j (x) =
1 1 2
2 ∇hξi (x), ξj (x)i is replaced by ∇Ĥi,j (x) = 2 ∇kξi (x) + ξj (x)k , note that this a biased estimator of ∇H(x).
• L-SVRHG with or without restart: This is the Alg.2 proposed in the paper, with Option II for the restart and Option I
for the version without restart. Restart is not used unless specified.
• CO: This is the Consensus Optimization algorithm proposed in Mescheder et al. (2017). We provide more details in
App. F.5.
• SGDA: This is the stochastic version of Simultaneous Gradient Descent/Ascent algorithm, which uses the following
update xk+1 = xk − ηk ξi (xk ).
• SVRE with restart: This is the Alg. 3 described in Chavdarova et al. (2019).
In the following sections we provide the details for the different hyper-parameters used in our different experiments.
where:
n = d = 100
1 if i = k = l
Ai ∈ Rd×d , [Ai ]kl =
0 otherwise
bi , ci ∈ Rd , [bi ]k , [ci ]k ∼ N (0, 1/d)
The hyper-parameters used for the different algorithms are described in Table 1:
Table 1. Hyper-parameters used for the different algorithms in the Bilinear Experiments (section 7.1).
1
The optimal constant step-size suggested by the theory for SHGD is γ = 2L . In this experiment we have that L = 1, thus
the optimal step-size is 0.5, this is also what we observed in practice. However we observe that while the theory recommends
to decrease the step-size after 4dKe = 40, 000 we observe in this experiment that it actually converges faster if we decrease
the step-size a bit earlier after only 10, 000 iterations.
where:
n = d = 100 and δ = 7
d×d 1 if i = k = l
Ai ∈ R , [Ai ]kl =
0 otherwise
−3 x + π2 for x ≤ − π2
d
1X
F (x) = f (xk ), a f (x) = −3 cos x for − π2 < x ≤ π
2 (56)
d
− cos x + 2x − π for x > π2
k=1
Note that this game satisfies the sufficiently-bilinear condition as long as δ > 2L, where L is the smoothness of F (x), in our
case L = 3. Thus we choose δ = 7 in order for the sufficiently-bilinear condition to be satisfied.
The hyper-parameters used for the different algorithms are described in Table 2:
Table 2. Hyper-parameters used for the different algorithms in the sufficiently-bilinear experiments (section 7.2).
F.3. GANs
In this section we present the details for the GANs experiments. We first present the different problem we try to solve.
satGAN solve the following problem:
n
1X
min max log(sigmoid(φ0 + φ1 yi + φ2 yi2 )) + log(1 − sigmoid(φ0 + φ1 (µ + σzi ) + φ2 (µ + σzi )2 ))
µ,σ φ0 ,φ1 ,φ2 n i=1
Stochastic Hamiltonian Gradient Methods for Smooth Games
All Discriminator and Generator parameters are initialized randomly with U (−1, 1) prior. The data is set as yi ∼ N (0, 1),
zi ∼ N (0, 1). We run all experiments 10 times (with seed 1, 2, . . . , 10).
The hyper-parameters used for the different algorithms are described in Table 3:
Table 3. Hyper-parameters used for the different algorithms in the GAN Experiments (section 7.3).
A LGORITHMS S TEP - SIZE γ k P ROBABILITY p S AMPLE SIZE M INI - BATCH SIZE
As per Mescheder et al. (2017), we used λ = 10 in all the experiments and a biased estimator of the Hamiltonian
∇Ĥi,j (x) = 21 ∇kξi (x) + ξj (x)k2 . Note that we also tried to use the unbiased estimator proposed in section 5.1 but found
no significant difference in our results, and thus only included the results for the original algorithm proposed by Mescheder
et al. (2017) that uses the biased estimator.
• SHGD: We can write Hi,j (x) = 12 < ξi (x), ξj (x) >, thus at every iteration we need to compute two gradients ξi (x)
and ξj (x), which leads to a cost of 2 per iteration.
• Biased SHGD: The biased estimate is based on Hi (x) = 21 kξi (x) + ξj (x)k2 , which requires the computation of two
gradients ξi (x) and thus also has a cost of 2 per iteration.
• L-SVRHG: At every iteration we need to compute two Hamiltonian updates which cost 2 each, and with probability p
we need to compute the full Hamiltonian which cost 2n (see App. F.4), which leads to a cost of 4 + p · 2n
• SVRE: At each iteration SVRE need to do an extrapolation step and an update step, both the extrapolation step and the
update step requires to evaluate two gradients, and with probability p we need to compute the full gradient which cost
n, which leads to a total cost of 4 + p · n.
Table 4. Number of the gradient computations per iteration for the different algorithms compared in the paper.
G. Additional Experiments
In this section we provide additional experiments that we couldn’t include in the main paper. Those experiments provide
further observations on the behavior of our proposed methods in different settings.
Pn presented in the paper the matrix Ai have a particular structure, they are very sparse, and the
In all the experiments
matrix A = 1/n i=1 Ai is the identity. We thus propose here to compare the methods on the bilinear game (9) and the
sufficiently-bilinear game from Section 7.2 but with different matrices Ai . We choose Ai to be random symmetric positive
definite matrices. For the sufficiently bilinear experiments we choose δ, such that the sufficiently-bilinear condition is
satisfied. We show the results in Fig. 3, we observe results very similar to the results observed in section 7.1 and section 7.2,
the experiments again shows that our proposed methods follow closely the theory and that L-SVRHG is the fastest method
to converge.
Stochastic Hamiltonian Gradient Methods for Smooth Games
H(x0)
H(xk)
10 4 10 4
10 6 10 6
10 8 10 8
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Num of samples 1e2 Num of samples 1e2
(a) Bilinear game (b) Sufficiently-bilinear game
Figure 3. Results for the bilinear game and sufficiently-bilinear game with symmetric positive definite matrices. We observe results very
similar to the results observed in section 7.1 and section 7.2, the experiments again shows that our proposed methods follow closely the
theory and that L-SVRHG is the fastest method to converge.
If a game satisfies the interpolation condition, then SHGD with constant step-size converges linearly to the solution.
In the bilinear game (9) and the sufficiently-bilinear game from Section 7.2, if we choose to set ∀i bi = ci = 0, then both
problems satisfies the interpolation condition. We provide additional experiments in this particular setting where we compare
SHGD with constant step-size, Biased SHGD, and L-SVRHG. We show the results in Fig. 4. We observe that all methods
converge linearly to the solution, surprisingly in this setting Biased SHGD converges much faster than all other methods.
We argue that this is due to the fact that Biased SHGD is optimizing an upper-bound on the Hamiltonian. Indeed we can
show using Jensen’s inequality, that:
n n n n
1 1 1X Jensen 1 X 2 1X1 2 1X
H(x) = kξ(x)k2 = k ξi (x)k2 ≤ kξi (x)k = kξi (x)k = Hi (x) (59)
2 2 n i=1 2n i=1 n i=1 2 n i=1
If the interpolation condition is satisfied, then we have that at the optimum x∗ , the inequality becomes an equality:
n
1X
H(x∗ ) = Hi (x∗ ) = 0 (60)
n i=1
Thus in this particular setting Biased SHGD also converges to the solution. Furthermore we can notice that because the Ai
are very sparse, ∀i 6= j ∇Hi,j (x) = 0. Thus most of the time SHGD will not update the current iterate, which is not the
case of Biased SHGD which only considers the ∇Hi,i (x) = 0 to do its update and thus always has signal. The convergence
of SHGD could thus be improved by using non-uniform sampling. We leave this for future work.
Stochastic Hamiltonian Gradient Methods for Smooth Games
100 100
10 2 10 2
||x0 x * ||2
||xk x * ||2
H(x0)
H(xk)
10 4 10 4
10 6 10 6
SHGD SHGD
Biased SHGD Biased SHGD
L-SVRHG L-SVRHG
10 8 10 8
0 20 40 60 80 100 0 20 40 60 80 100
Num of samples 1e2 Num of samples 1e2
(a) Bilinear game (b) Sufficiently-bilinear game
Figure 4. Results for the bilinear game and sufficiently-bilinear game when ∀i bi = ci = 0. We observe that all the methods converge
linearly in this setting. Surprisingly in this setting Biased SHGD is the fastest method to converge. We give a brief informal explanation
on why this is the case above.
G.2. GANs
We present the missing experiments for satGAN (with batch size 100) in Figure 5. As can be observed, results for nsGAN
are very similar to results for satGAN (see Figure 2).
101
CO CO
100 SGDA SGDA
Generator L2 distance to optimum
10−4 10−1
H(x0)
H(xk)
10−6
10−2
10−8
10−3
10−10
10−12 10−4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Number of samples 1e7 Number of samples 1e7