0% found this document useful (0 votes)
4 views

Stochastic Hamiltonian Gradient Methods For Smooth Games

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Stochastic Hamiltonian Gradient Methods For Smooth Games

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Stochastic Hamiltonian Gradient Methods for Smooth Games

Nicolas Loizou 1 Hugo Berard 1 2 Alexia Jolicoeur-Martineau 1 Pascal Vincent† 1 2 Simon Lacoste-Julien† 1
Ioannis Mitliagkas† 1

Abstract for every x1 ∈ Rd1 and x2 ∈ Rd2 . We call point, x∗ , a sad-


The success of adversarial formulations in ma- dle point, min-max solution or Nash equilibrium of (1). In
chine learning has brought renewed motivation its general form, this problem is hard. In this work we focus
on the simplest family of problems where some important
arXiv:2007.04202v1 [cs.LG] 8 Jul 2020

for smooth games. In this work, we focus on the


class of stochastic Hamiltonian methods and pro- questions are still open: the case where all stationary points
vide the first convergence guarantees for certain are global min-max solutions.
classes of stochastic smooth games. We propose a Motivated by recent applications in machine learning, we
novel unbiased estimator for the stochastic Hamil- are particularly interested in cases where the objective, g, is
tonian gradient descent (SHGD) and highlight its naturally expressed as a finite sum
benefits. Using tools from the optimization lit- n
erature we show that SHGD converges linearly 1X
min max g(x1 , x2 ) = gi (x1 , x2 ) (3)
to the neighbourhood of a stationary point. To x1 ∈Rd1 x2 ∈Rd2 n i=1
guarantee convergence to the exact solution, we
analyze SHGD with a decreasing step-size and where each component function gi : Rd1 × Rd2 → R is
we also present the first stochastic variance re- assumed to be smooth. Indeed, in problems like domain gen-
duced Hamiltonian method. Our results provide eralization (Albuquerque et al., 2019), generative adversarial
the first global non-asymptotic last-iterate con- networks (Goodfellow et al., 2014), and some formulations
vergence guarantees for the class of stochastic in reinforcement learning (Pfau & Vinyals, 2016), empirical
unconstrained bilinear games and for the more risk minimization yields finite sums of the form of (3). We
general class of stochastic games that satisfy a refer to this formulation as a stochastic smooth game.1 We
“sufficiently bilinear” condition, notably including call problem (1) a deterministic game.
some non-convex non-concave problems. We sup- The deterministic version of the problem has been stud-
plement our analysis with experiments on stochas- ied in a number of classic (Korpelevich, 1976; Nemirovski,
tic bilinear and sufficiently bilinear games, where 2004) and recent results (Mescheder et al., 2017; Ibrahim
our theory is shown to be tight, and on simple et al., 2019; Gidel et al., 2018; Daskalakis et al., 2018; Gidel
adversarial machine learning formulations. et al., 2019; Mokhtari et al., 2020; Azizian et al., 2020a;b)
in various settings. Importantly, the majority of these results
provide last-iterate convergence guarantees. In contrast,
1. Introduction for the stochastic setting, guarantees on the classic extra-
We consider the min-max optimization problem gradient method and its variants rely on iterate averaging
over compact domains (Nemirovski, 2004). However, Chav-
min max g(x1 , x2 ) (1) darova et al. (2019) highlighted a possibility of pathological
x1 ∈Rd1 x2 ∈Rd2
behavior where the iterates diverge towards and then ro-
where g : Rd1 × Rd2 → R is a smooth objective. Our goal tate near the boundary of the domain, far from the solution,
is to find x∗ = (x∗1 , x∗2 )> ∈ Rd where d = d1 + d2 such while their average is shown to converge to the solution (by
that convexity).2 This behavior is also problematic in the context
g(x∗1 , x2 ) ≤ g(x∗1 , x∗2 ) ≤ g(x1 , x∗2 ), (2) of applying the method on non-convex problems, where av-
eraging do not necessarily yield a solution (Daskalakis et al.,
1
Mila, Université de Montréal † Canada CIFAR AI Chair
2 1
Facebook AI Research. Correspondence to: Nicolas Loizou We note that all of our results except the one on variance
<[email protected]>. reduction do not require the finite-sum assumption and can be
easily adapted to the stochastic setting (see Appendix C.3).
Proceedings of the 37 th International Conference on Machine 2
This is qualitatively very different to stochastic minimization
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by where the iterates converge towards a neighborhood of the solution
the author(s). and averaging is only used to stabilize the method.
Stochastic Hamiltonian Gradient Methods for Smooth Games

2018; Abernethy et al., 2019). It is only very recently that 2. Further Related work
last-iterate convergence guarantees over a non-compact
domain appeared in literature for the stochastic problem In recent years, several second-order methods have been
(Palaniappan & Bach, 2016; Chavdarova et al., 2019; Hsieh proposed for solving the min-max optimization problem (1).
et al., 2019; Mishchenko et al., 2020) under the assumption Some of them require the computation or inversion of a
of strong monotonicity. Strong monotonicity, a generaliza- Jacobian which is a highly inefficient operation (Wang et al.,
tion of strong convexity for general operators, seems to be 2019; Mazumdar et al., 2019). In contrast, second-order
an essential condition for fast convergence in optimization. methods like the ones presented in Mescheder et al. (2017);
Here, we make no strong monotonicity assumption. Balduzzi et al. (2018); Abernethy et al. (2019) and in this
work are more efficient as they only rely on the computation
The algorithms we consider belong to a recently intro- of a Jacobian-vector product in each step.
duced family of computationally-light second order methods
which in each step require the computation of a Jacobian- Abernethy et al. (2019) provide the first last-iterate con-
vector product. Methods that belong to this family are the vergence rates for the deterministic Hamiltonian gradient
consensus optimization (CO) method (Mescheder et al., descent (HGD) for several classes of games including games
2017) and Hamiltonian gradient descent (Balduzzi et al., satisfying the sufficiently bilinear condition. The authors
2018; Abernethy et al., 2019). Even though some con- briefly touch upon the stochastic setting and by using the
vergence results for these methods are known for the de- convergence results of Karimi et al. (2016), explain how a
terministic problem, there is no available analysis for the stochastic variant of HGD with decreasing stepsize behaves.
stochastic problem. We close this gap. We study stochastic Their approach was purely theoretical and they did not pro-
Hamiltonian gradient descent (SHGD), and propose the first vide an efficient way of selecting the unbiased estimators
stochastic variance reduced Hamiltonian method, named L- of the gradient of the Hamiltonian. In addition, they as-
SVRHG. Our contributions are summarized as follows: sumed bounded gradient of the Hamiltonian function which
is restrictive for functions satisfying the Polyak-Lojasiewicz
• Our results provide the first set of global non-asymptotic (PL) condition (Gower et al., 2020). In this work we provide
last-iterate convergence guarantees for a stochastic game the first efficient variants and analysis of SHGD. We did
over a non-compact domain, in the absence of strong that by choosing practical unbiased estimator of the full
monotonicity assumptions. gradient and by using the recently proposed assumptions
• The proposed stochastic Hamiltonian methods use novel of expected smoothness (Gower et al., 2019) and expected
unbiased estimators of the gradient of the Hamiltonian residual (Gower et al., 2020) in our analysis. The proposed
function. This is an essential point for providing conver- theory of SHGD allow us to obtain as a corollary tight con-
gence guarantees. Existing practical variants of SHGD vergence guarantees for the deterministic HGD recovering
use biased estimators (Mescheder et al., 2017). the result of Abernethy et al. (2019) for the sufficiently
• We provide the first efficient convergence analysis of bilinear games.
stochastic Hamiltonian methods. In particular, we focus In another line of work, Carmon et al. (2019) analyze vari-
on solving two classes of stochastic smooth games: ance reduction methods for constrained finite-sum problems
– Stochastic Bilinear Games. and Ryu et al. (2019) provide an ODE-based analysis and
– Stochastic games satisfying the “sufficiently bilin- guarantees in the monotone but potentially non-smooth case.
ear” condition or simply Stochastic Sufficiently Bi- Chavdarova et al. (2019) show that both alternate stochas-
linear Games. The deterministic variant of this class tic descent-ascent and stochastic extragradient diverge on
of games was firstly introduced by Abernethy et al. an unconstrained stochastic bilinear problem. In the same
(2019) to study the deterministic problem and no- paper, Chavdarova et al. (2019) propose the stochastic vari-
tably includes some non-monotone problems. ance reduced extragradient (SVRE) algorithm with restart,
which empirically achieves last-iterate convergence on this
• For the above two classes of games, we provide conver-
problem. However, it came with no theoretical guarantees.
gence guarantees for SHGD with a constant step-size (lin-
In Section 7, we observe in our experiments that SVRE is
ear convergence to a neighborhood of stationary point),
slower than the proposed L-SVRHG for both the stochastic
SHGD with a variable step-size (sub-linear convergence
bilinear and sufficiency bilinear games that we tested.
to a stationary point) and L-SVRHG. For the latter, we
guarantee a linear rate. In concurrent work, Yang et al. (2020) provide global conver-
• We show the benefits of the proposed methods by per- gence guarantees for stochastic alternate gradient descent-
forming numerical experiments on simple stochastic bilin- ascent (and its variance reduction variant) for a subclass
ear and sufficiently bilinear problems, as well as toy GAN of nonconvex-nonconcave objectives satisfying a so-called
problems for which the optimal solution is known. Our two-sided Polyak-Lojasiewicz inequality, but this does not
numerical findings corroborate our theoretical results. include the stochastic bilinear problem that we cover.
Stochastic Hamiltonian Gradient Methods for Smooth Games

3. Technical Preliminaries expected residual condition if there exists ρ > 0 such that
for all x ∈ Rd ,
In this section, we present the necessary background and
the basic notation used in the paper. We also describe the h
2
i
update rule of the deterministic Hamiltonian method. Ei k∇fi (x) − ∇fi (x∗ ) − (∇f (x) − ∇f (x∗ ))k
≤ 2ρ (f (x) − f (x∗ )) . (6)
3.1. Optimization Background: Basic Definitions
We start by presenting some definitions that we will later 3.2. Smooth Min-Max Optimization
use in the analysis of the proposed methods.
We use standard notation used previously in Mescheder
Definition 3.1. Function f : Rd → R is µ–quasi-strongly et al. (2017); Balduzzi et al. (2018); Abernethy et al. (2019);
convex if there exists a constant µ > 0 such that ∀x ∈ Rd : Letcher et al. (2019).
2
f ∗ ≥ f (x) + h∇f (x), x∗ − xi + µ2 kx∗ − xk , where f ∗
∗ Let x = (x1 , x2 )> ∈ Rd be the column vector obtained by
is the minimum value of f and x is the projection of x
stacking x1 and x2 one on top of the other. With ξ(x) :=
onto the solution set X ∗ minimizing f . >
(∇x1 g, −∇x2 g) , we denote the signed vector of partial
derivatives evaluated at point x. Thus, ξ(x) : Rd → Rd is a
Definition 3.2. We say that a function satisfies the Polyak-
vector function. We use
Lojasiewicz (PL) condition if there exists µ > 0 such that  2
∇2x1 ,x2 g

∇x1 ,x1 g
1 J = ∇ξ = ∈ Rd×d
k∇f (x)k2 ≥ µ [f (x) − f ∗ ] ∀x ∈ Rd , (4) −∇2x2 ,x1 g −∇2x2 ,x2 g
2
where f ∗ is the minimum value of f . to denote the Jacobian of the vector function ξ. Note
that using the above notation, the simultaneous gradient
An analysis of several stochastic optimization methods un- descent/ascent (SGDA) update can be written simply as:
der the assumption of PL condition (Polyak, 1987) was xk+1 = xk − ηk ξ(xk ).
recently proposed in Karimi et al. (2016). A function can Definition 3.6. The objective function g of problem (1)
satisfy the PL condition and not be strongly convex, or is Lg -smooth if there exist Lg > 0 such that:
even convex. However, if the function is µ−quasi strongly kξ(x) − ξ(y)k ≤ Lg kx − yk ∀x, y ∈ Rd .
convex then it satisfies the PL condition with the same µ
(Karimi et al., 2016). We also say that g is L-smooth in x1 (in x2 ) if
k∇x1 g(x1 , x2 ) − ∇x1 g(x01 , x2 )k ≤ Lkx1 − x01 k (if
Definition 3.3. Function f : Rd → R is L-smooth if k∇x2 g(x1 , x2 ) − ∇x2 g(x1 , x02 )k ≤ Lkx2 − x02 k)
there exists L > 0 such that: for all x1 , x01 ∈ Rd1 (for all x2 , x02 ∈ Rd2 ).
k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ Rd .
Pn Definition 3.7. A stationary point of function f : Rd →
If f = n1 i=1 fi (x), then a more refined analysis of R is a point x∗ ∈ Rd such that ∇f (x∗ ) = 0. Using the
stochastic gradient methods has been proposed under new above notation, in min-max problem (1), point x∗ ∈ Rd
notions of smoothness. In particular, the notions of expected is a stationary point when ξ(x∗ ) = 0.
smoothness (ES) and expected residual (ER) have been in-
troduced and used in the analysis of SGD in Gower et al.
As mentioned in the introduction, in this work we focus on
(2019) and Gower et al. (2020) respectively. ES and ER are
smooth games satisfying the following assumption.
generic and remarkably weak assumptions. In Section 6 and
Appendix B.2, we provide more details on their generality. Assumption 3.8. The objective function g of problem (3)
We state their definitions below. has at least one stationary point and all of its stationary
Definition 3.4 (Expected smoothness, P (Gower et al., points are global min-max solutions.
n
2019)). We say that the function f = n1 i=1 fi (x) sat-
isfies the expected smoothness condition if there exists With Assumption 3.8, we can guarantee convergence to a
L > 0 such that for all x ∈ Rd , min-max solution of problem (3) by proving convergence to
a stationary point. This assumption is true for several classes
of games including strongly convex-strongly concave and
h i
2
Ei k∇fi (x) − ∇fi (x∗ )k ≤ 2L(f (x) − f (x∗ )), (5)
convex-concave games. However, it can also be true for
some classes of non-convex non-concave games (Abernethy
Definition 3.5 (Expected residual,
Pn(Gower et al., 2020)). et al., 2019). In Section 4, we describe in more details
We say that the function f = n1 i=1 fi (x) satisfies the the two classes of games that we study. Both satisfy this
assumption.
Stochastic Hamiltonian Gradient Methods for Smooth Games

3.3. Deterministic Hamiltonian Gradient Descent Assumption 4.1. Functions gi : Rd1 × Rd2 → R of
problem (3) are twice differentiable, Li -smooth with Si -
Hamiltonian gradient descent (HGD) has been proposed as Lipschitz Jacobian. That is, for each i ∈ [n] there are
an efficient method for solving min-max problems in Bal- constants Li > 0 and Si > 0 such that kξi (x) − ξi (y)k ≤
duzzi et al. (2018). To the best of our knowledge, the first Li kx − yk and kJi (x) − Ji (y)k ≤ Si kx − yk for all
convergence analysis of the method is presented in Aber- x, y ∈ Rd .
nethy et al. (2019) where the authors prove non-asymptotic
linear last-iterate convergence rates for several classes of
4.1. Classes of Stochastic Games
games.
In particular, HGD converges to saddle points of problem Here we formalize the two families of stochastic smooth
(1) by performing gradient descent on a particular objec- games under study: (i) stochastic bilinear, and (ii) stochastic
tive function H, which is called the Hamiltonian function sufficiently bilinear. Both families satisfy Assumption 3.8.
(Balduzzi et al., 2018), and has the following form: Interestingly, the latter family includes some non-convex
non-concave games, i.e. non-monotone problems.
1
min H(x) = kξ(x)k2 . (7)
x 2 Stochastic Bilinear Games. A stochastic bilinear game
is the stochastic smooth game (3) in which function g has
That is, HGD is a gradient descent method that minimizes the following structure:
the square norm of the gradient ξ(x). Note that under As- n
sumption 3.8, solving problem (7) is equivalent to solving 1X >
x1 bi + x> >

g(x1 , x2 ) = 1 Ai x2 + ci x2 . (9)
problem (1). The equivalence comes from the fact that H n i=1
only achieves its minimum at stationary points. The up-
While this game appears simple, standard methods diverge
date rule of HGD can be expressed using a Jacobian-vector
on it (Chavdarova et al., 2019) and L-SVRHG gives the first
product (Balduzzi et al., 2018; Abernethy et al., 2019):
stochastic method with last-iterate convergence guarantees.
xk+1 = xk − ηk ∇H(x) = xk − ηk J> ξ ,
 
(8)
Stochastic sufficiently bilinear games. A game of the
making HGD a second-order method. However, as dis- form (3) is called stochastic sufficiently bilinear if it satisfies
cussed in Balduzzi et al. (2018), the Jacobian-vector prod- the following definition.
uct can be efficiently evaluated in tasks like training neural
networks and the computation time of the gradient and the Definition 4.2. Let Assumption 4.1 be satisfied and let
Jacobian-vector product is comparable (Pearlmutter, 1994). the objective function g of problem (3) be L-smooth
in x1 and L-smooth in x2 . Assume that a constant
C > 0 exists, such that Ei kξi (x)k < C. Assume the
4. Stochastic Smooth Games and Stochastic cross derivative ∇2x1 ,x2 g be full rank with 0 < δ ≤
Hamiltonian Function

σi ∇2x1 ,x2 g ≤ ∆ for all x ∈ Rd and for all singular
 2
In this section, we provide the two classes of stochastic values σi . Let ρ2 = minx1 ,x2 λmin ∇2x1 ,x1 g(x1 , x2 )
 2
games that we study. We define the stochastic counterpart and β 2 = minx1 ,x2 λmin ∇2x2 ,x2 g(x1 , x2 ) . Finally let
to the Hamiltonian function as a step towards solving prob- the following condition to be true:
lem (3) and present its main properties.
(δ 2 + ρ2 )(δ 2 + β 2 ) − 4L2 ∆2 > 0. (10)
Let us start by presenting theP basic notation for the stochas-
n
tic setting. Let ξ(x) = n1 i=1 ξi (x), where ξi (x) := Note that the definition of the stochastic sufficiently bilinear
>
(∇x1 gi , −∇x2 gi ) , for all i ∈ [n] and let game has no restriction on the convexity of functions gi (x)
n  2 and g(x). The most important condition that needs to be
∇2x1 ,x2 gi

1X ∇x1 ,x1 gi satisfied is the expression in equation (10). Following the
J= Ji , where Ji = .
n −∇2x2 ,x1 gi −∇2x2 ,x2 gi terminology of Abernethy et al. (2019), we call the con-
i=1
dition (10): “sufficiently bilinear” condition. Later in our
Using the above notation, the stochastic variant of SGDA
numerical evaluation, we present stochastic non convex-non
can be written as xk+1 = xk −ηk ξi (xk ) where Ei [ξi (xk )] =
concave min-max problems that satisfy condition (10).
ξ(xk ).3
We highlight that the deterministic counterpart of the above
In this work, we focus on stochastic smooth games of the
game was first proposed in Abernethy et al. (2019). The
form (3) that satisfy the following assumption.
deterministic variant of Abernethy et al. (2019) can be ob-
3
Here the expectation is over the uniform distribution. That is, tained as special case of the above class of games when
Ei [ξi (x)] = n1 n
P
i=1 ξi (x). n = 1 in problem (3).
Stochastic Hamiltonian Gradient Methods for Smooth Games

4.2. Stochastic Hamiltonian Function Algorithm 1 Stochastic Hamiltonian Gradient Descent


(SHGD)
Having presented the two main classes of stochastic smooth
games, in this section we focus on the structure of the Input: Starting stepsize γ 0 > 0. Choose initial points
stochastic Hamiltonian function and highlight some of its x0 ∈ Rd . Distribution D of samples.
properties. for k = 0, 1, 2, · · · , K do
Generate fresh samples i ∼ D and j ∼ D and evaluate
∇Hi,j (xk ).
Finite-Sum Structure Hamiltonian Function. Having
Set step-size γ k following one of the selected choices
the objective function g of problem (3) to be stochastic
(constant, decreasing)
and in particular to be a finite-sum function, leads to the
Set xk+1 = xk − γ k ∇Hi,j (xk )
following expression for the Hamiltonian function:
end for
n
1 X 1
H(x) = hξi (x), ξj (x)i . (11)
n2 i,j=1 |2 {z } 5.1. Unbiased Estimator
Hi,j (x)
One of the most important elements of stochastic gradient-
based optimization algorithms for solving finite-sum prob-
That is, the Hamiltonian function H(x) can be expressed as
lems of the form (11) is the selection of unbiased estimators
a finite-sum with n2 components.
of the full gradient ∇H(x) in each step. In our proposed
optimization algorithms for solving (11), at each step we
Properties of the Hamiltonian Function. As we will see use the gradient of only one component function Hi,j (x):
in the following sections, the finite-sum structure of the
stochastic Hamiltonian function (11) allows us to use popu- 1 >
J ξ j + J>

∇Hi,j (x) = j ξi . (12)
lar stochastic optimization problems for solving problem (7). 2 i
However in order to be able to provide convergence guar- It can easily be shown that this selection is an unbiased
antees of the proposed stochastic Hamiltonian methods, we estimator of ∇H(x). That is, Ei,j [∇Hi,j (x)] = ∇H(x).
need to show that the stochastic Hamiltonian function (11)
satisfies specific properties for the two classes of games we 5.2. Stochastic Hamiltonian Gradient Descent (SHGD)
study. This is what we do in the following two proposi-
tions. Stochastic gradient descent (SGD) (Robbins & Monro,
1951; Nemirovski & Yudin, 1978; 1983; Nemirovski et al.,
Proposition 4.3. For stochastic bilinear games of the
2009; Hardt et al., 2016; Gower et al., 2019; 2020; Loizou
form (9), the stochastic Hamiltonian function (11) is
et al., 2020) is the workhorse for training supervised ma-
a smooth quadratic µH –quasi-strongly convex function
2 2 chine learning problems. In Algorithm 1, we apply SGD
with constantsPLH = σmax (A) and µH = σmin (A)
1 n to (11), yielding stochastic Hamiltonian gradient descent
where A = n i=1 Ai and σmax and σmin are the maxi-
(SHGD) for solving problem (3). Note that at each step,
mum and minimum non-zero singular values of A.
i ∼ D and j ∼ D are sampled from a given well-defined
distribution D and then are used to evaluate ∇Hi,j (xk ) (un-
Proposition 4.4. For stochastic sufficiently bilinear biased estimator of the full gradient). In our analysis, we
games, the stochastic Hamiltonian function (11) is a provide rates for two selections of step-sizes for SHGD.
LH = S̄C + L̄2 smooth function and satisfies the PL These are the constant step-size γ k = γ and the decreasing
2 2
)(δ 2 +β 2 )−4L2 ∆2
condition (4) with µH = (δ +ρ 2δ 2 +ρ2 +β 2 . Here step-size (switching rule which describe when one should
S̄ = Ei [Si ] and L̄ = Ei [Li ]. switch from a constant to a decreasing stepsize regime).

5.3. Loopless Stochastic Variance Reduced


5. Stochastic Hamiltonian Gradient Methods
Hamiltonian Gradient (L-SVRHG)
In this section we present the proposed stochastic Hamil-
One of the main disadvantage of Algorithm 1 with constant
tonian methods for solving the stochastic min-max prob-
step-size selection is that it guarantees linear convergence
lem (3). Our methods could be seen as extensions of pop-
only to a neighborhood of the min-max solution x∗ . As we
ular stochastic optimization methods into the Hamiltonian
will present in Section 6, the decreasing step-size selection
setting. In particular, the two algorithms that we build upon
allow us to obtain exact convergence to the min-max but at
are the popular stochastic gradient descent (SGD) and the re-
the expense of slower rate (sublinear).
cently introduced loopless stochastic variance reduced gradi-
ent (L-SVRG). For completeness, we present their form for One of the most remarkable algorithmic breakthroughs in re-
solving finite-sum optimization problems in Appendix A. cent years was the development of variance-reduced stochas-
Stochastic Hamiltonian Gradient Methods for Smooth Games

Algorithm 2 Loopless Stochastic Variance Reduced Hamil- Algorithm 3 L-SVRHG (with Restart)
tonian Gradient (L-SVRHG) Input: Starting stepsize γ > 0. Choose initial points
Input: Starting stepsize γ > 0. Choose initial points x0 = w0 ∈ Rd . Distribution D of samples. Probability
x0 = w0 ∈ Rd . Distribution D of samples. Probability p ∈ (0, 1], T
p ∈ (0, 1] for t = 0, 1, 2, · · · , T do
for k = 0, 1, 2, · · · , K − 1 do Set xt+1 = L-SVRHGII (xt , K, γ, p ∈ (0, 1])
Generate fresh samples i ∼ D and j ∼ D and evaluate end for
∇Hi,j (xk ). Output: The last iterate xT .
Evaluate g k = ∇Hi,j (xk ) − ∇Hi,j (wk ) + ∇H(wk ).
Set xk+1 = xk − γg k
Set restarted variant of Alg. 2, presented in Alg. 3, which calls
( at each step Alg. 2 with the second option of output, that is
k+1 xk with probability p L-SVRHGII . Using the property from Proposition 4.4 that
w =
wk with probability 1 − p the Hamiltonian function (11) satisfy the PL condition 3.2,
we show that Alg. 3 converges linearly to the solution of the
end for sufficiently bilinear game (Theorem 6.8).
Output:
Option I: The last iterate x = xk .
Option II: x is chosen uniformly at random from {xi }K 6. Convergence Analysis
i=0 .
We provide theorems giving the performance of the previ-
ously described stochastic Hamiltonian methods for solving
tic gradient algorithms for solving finite-sum optimization the two classes of stochastic smooth games: stochastic bi-
problems. These algorithms, by reducing the variance of linear and stochastic sufficiently bilinear. In particular, we
the stochastic gradients, are able to guarantee convergence present three main theorems for each one of these classes
to the exact solution of the optimization problem with faster describing the convergence rates for (i) SHGD with con-
convergence than classical SGD. For example, for smooth stant step-size, (ii) SHGD with decreasing step-size and (iii)
strongly convex functions, variance reduced methods can L-SVRHG and its restart variant (Algorithm 3).
guarantee linear convergence to the optimum. This is a vast
The proposed results depend on the two main parameters
improvement on the sub-linear convergence of SGD with
µH , LH evaluated in Propositions 4.3 and 4.4. In addition,
decreasing step-size. In the past several years, many effi-
the theorems related to the bilinear games (the Hamiltonian
cient variance-reduced methods have been proposed. Some
function is quasi-strongly convex) use the expected smooth-
popular examples of variance reduced algorithms are SAG
ness constant L (5), while the theorems related to the suffi-
(Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG
ciently bilinear games (the Hamiltonian function satisfied
(Johnson & Zhang, 2013) and SARAH (Nguyen et al., 2017).
the PL condition) use the expected residual constant ρ (6).
For more examples of variance reduced methods in different
We note that the expected smoothness and expected residual
settings, see Defazio (2016); Konečný et al. (2016); Gower
constants can take several values according to the well-
et al. (2018); Sebbouh et al. (2019).
defined distributions D selected in our algorithms and the
In our second method Algorithm 2, we propose a vari- proposed theory will still hold (Gower et al., 2019; 2020).
ance reduced Hamiltonian method for solving (3). Our
As a concrete example, in the case of τ -minibatch sam-
method is inspired by the recently introduced and well
pling,4 the expected smoothness and expected residual pa-
behaved variance reduced algorithm, Loopless-SVRG (L-
rameters take the following values:
SVRG) first proposed in Hofmann et al. (2015); Kovalev
et al. (2020) and further analyzed under different settings n2 (τ −1) n2 −τ
in Qian et al. (2019); Gorbunov et al. (2020); Khaled et al. L(τ ) = τ (n2 −1) LH + τ (n2 −1) Lmax (13)
2
(2020). We name our method loopless stochastic variance −τ
ρ(τ ) = Lmax (nn2 −1)τ (14)
reduced Hamiltonian gradient (L-SVRHG). The method
works by selecting at each step the unbiased estimator where Lmax = max{1,...,n2 } {LHi,j } is the maximum
g k = ∇Hi,j (xk ) − ∇Hi,j (wk ) + ∇H(wk ) of the full gra- smoothness constant of the functions Hi,j . By using the
dient. As we will prove in the next section, this method expressions (13) and (14), it is easy to see that for single
guarantees linear convergence to the min-max solution of element sampling where τ = 1 (the one we use in our ex-
the stochastic bilinear game (9).
4
In each step we draw uniformly at random τ components of
To get a linearly convergent algorithm in the more general the n2 possible choices of the stochastic Hamiltonian function (11).
setup of sufficiently bilinear games 4.2, we had to propose a For more details on the τ -minibatch sampling see Appendix B.2.
Stochastic Hamiltonian Gradient Methods for Smooth Games

periments) L = ρ = Lmax . On the other limit case where a lection of step-size L-SVRHG convergences linearly to a
full-batch is used (τ = n2 ), that is we run the deterministic min-max solution.
Hamiltonian gradient descent, these values become L = LH Theorem 6.4 (L-SVRHG). Let us have the stochastic bi-
and ρ = 0 and as we explain below, the proposed theorems linear game (9). Let step-size γ = 1/6LH and p ∈ (0, 1].
include the convergence of the deterministic method as spe- Then L-SVRHG with Option I for output as given in Al-
cial case. gorithm 2 convergences linearly to the min-max solution
x∗ and satisfies:
6.1. Stochastic Bilinear Games  k
µ p
We start by presenting the convergence of SHGD with con- E[Φk ] ≤ max 1 − ,1 − Φ0
6LH 2
stant step-size and explain how we can also obtain an anal-
ysis of the HGD (8) as special case. Then we move to the 4γ 2 Pn
where Φk := kxk − x∗ k2 + pn2 i,j=1 k∇Hi,j (wk ) −
convergence of SHGD with decreasing step-size and the
L-SVRHG where we are able to guarantee convergence to ∇Hi,j (x∗ )k2 .
a min-max solution x∗ . In the results related to SHGD we
2
use σ 2 := Ei,j [k∇Hi,j (x∗ )k ] to denote the finite gradient 6.2. Stochastic Sufficiently-Bilinear Games
noise at the solution.
As in the previous section, we start by presenting the con-
Theorem 6.1 (Constant stepsize). Let us have the stochas- vergence of SHGD with constant step-size and explain how
tic bilinear game (9). Then iterates of SHGD with constant we can obtain an analysis of the HGD (8) as special case.
1
step-size γ k = γ ∈ (0, 2L ] satisfy: Then we move to the convergence of SHGD with decreasing
step-size and the L-SVRHG (with restart) where we are able
2γσ 2
k
Ekxk − x∗ k2 ≤ (1 − γµH ) kx0 − x∗ k2 + . (15) to guarantee linear convergence to a min-max solution x∗ .
µ In contrast to the results on bilinear games, the convergence
guarantees of the following theorems are given in terms of
That is, Theorem 6.1 shows linear convergence to a neigh-
the Hamiltonian function E[H(xk )]. In all theorems we call
borhood of the min-max solution. Using Theorem 6.1 and
“sufficiently-bilinear game” the game described in Defini-
following the approach of Gower et al. (2019), we can obtain 2
tion 4.2. With σ 2 := Ei,j [k∇Hi,j (x∗ )k ], we denote the
the following corollary on the convergence of deterministic
finite gradient noise at the solution.
Hamiltonian gradient descent (HGD) (8). Note that for the
deterministic case σ = 0 and L = L (13). Theorem 6.5. Let us have a stochastic sufficiently-
bilinear game. Then the iterates of SHGD with constant
Corollary 6.2. Let us have a deterministic bilinear game. µ
1 steps-size γ k = γ ≤ L(µ+2ρ) satisfy:
Then the iterates of HGD with step-size γ = 2L satisfy:
k LH γσ 2
kxk − x∗ k2 ≤ (1 − γµH ) kx0 − x∗ k2 (16) k
E[H(xk )] ≤ (1 − γµH ) [H(x0 )] + . (19)
µH
To the best of our knowledge, Corollary 6.2 provides the
first linear convergence guarantees for HGD in terms of Using the above Theorem and by following the approach of
kxk − x∗ k2 (Abernethy et al. (2019) gave guarantees only Gower et al. (2020), we can obtain the following corollary
on H(xk )). Let us now select a decreasing step-size rule on the convergence of deterministic Hamiltonian gradient
(switching strategy) that guarantees a sublinear convergence descent (HGD) (8). It shows linear convergence of HGD to
to the exact min-max solution for the SHGD. the min-max solution. Note that for the deterministic case
Theorem 6.3 (Decreasing stepsizes/switching strategy). σ = 0 and ρ = 0 (14).
Let us have the stochastic bilinear game (9). Let K := Corollary 6.6. Let us have a deterministic sufficiently-
L/ µH . Let bilinear game. Then the iterates of HGD with step-size
 γ = L1H satisfy:
1
 2L for k ≤ 4dKe
k k
γ = (17) H(xk ) ≤ (1 − γµH ) H(x0 ) (20)
 2k+1 for k > 4dKe.
2
(k+1) µH
The result of Corollary 6.6 is equivalent to the conver-
If k ≥ 4dKe, then SHGD given in Algorithm 1 satisfy: gence of HGD as proposed in Abernethy et al. (2019).
σ2 8 16dKe2
Ekxk − x∗ k2 ≤ µ2H k
+ e2 k2 kx
0
− x∗ k2 . (18) Let us now show that with decreasing step-size (switching
strategy), SHGD can converge (with sub-linear rate) to the
Lastly, in the following theorem, we show under what se- min-max solution.
Stochastic Hamiltonian Gradient Methods for Smooth Games

Theorem 6.7 (Decreasing stepsizes/switching strategy). We show the convergence of the different algorithms in
Let us have
 a stochastic
 sufficiently-bilinear game. Let Fig. 1a. As predicted by theory, SHGD with decreasing
∗ L ρ
k := 2 µ 1 + 2 µ and step-size converges at a sublinear rate while L-SVRHG
converges at a linear rate. Among all the methods we com-
pared to, L-SVRHG is the fastest to converge; however, the

 LH (µµHH+2ρ) for k ≤ dk ∗ e
γk = (21) speed of convergence depends a lot on parameter p. We
 2k+1
(k+1)2 µH for k > dk ∗ e. observe that setting p = 1/n yields the best performance.
To further illustrate the behavior of the Hamiltonian meth-
If k ≥ dk ∗ e, then SHGD given in Algorithm 1 satisfy: ods, we look at the trajectory of the methods on a simple
4LH σ 2 1 (k∗ )2 2D version of the bilinear game, where we choose x1 and
E[H(xk )] ≤ µ2H k
+ 0
k2 e2 [H(x )]. x2 to be scalars. We observe that while previously proposed
methods such as SGDA and SVRE suffer from rotations
which slow down their convergence and can even make them
In the next Theorem we show how the updates of L-SVRHG diverge, the Hamiltonian methods converge much faster by
with Restart (Algorithm 3) converges linearly to the min- removing rotation and converging “straight” to the solution.
max solution. We highlight that each step t of Alg. 3 requires
K = µH4 γ updates of the L-SVRHG. 7.2. Sufficiently-Bilinear Games
Theorem 6.8 (L-SVRHG with Restart). Let us have a In section 6.2, we showed that Hamiltonian methods are also
2/3
o Let p ∈ (0, 1] and
stochasticnsufficiently-bilinear game.
√ guaranteed to converge when the problem is non-convex
p
γ ≤ min 4L1H , 361/3p(LH ρ)1/3 , √6ρ and let K = µH4 γ . non-concave but satisfies the sufficiently-bilinear condi-
Then the iterates of L-SVRHG (with Restart) given in tion (10). To illustrate these results, we propose to look
Algorithm 3 satisfies at the following game inspired by Abernethy et al. (2019):
t n
E[H(xt )] ≤ (1/2) [H(x0 )]. 1X
min max F (x1 ) + δ x>
1 A i x2 +
x1 ∈Rd x2 ∈Rd n i=1
7. Numerical Evaluation b> >

i x1 + ci x2 − F (x2 ) , (22)

In this section, we compare the algorithms proposed in this where F (x) is a non-linear function (see details in Ap-
paper to existing methods in the literature. Our goal is to pendix F.2). This game is non-convex non-concave and
illustrate the good convergence properties of the proposed satisfies the sufficiently-bilinear condition if δ > 2L, where
algorithms as well as to explore how these algorithms be- L is the smoothness of F (x). Thus, the results and theorems
have in settings not covered by the theory. We propose from Section 6.2 hold.
to compare the following algorithms: SHGD with con-
stant step-size and decreasing step-size, a biased version Results are shown in Fig.1b. Similarly to the bilinear case,
of SHGD (Mescheder et al., 2017), L-SVRHG with and the methods follow very closely the theory. We highlight
without restart, consensus optimization (CO)5 (Mescheder that while the proposed theory for this setting only guar-
et al., 2017), the stochastic variant of SGDA, and finally antees convergence for L-SVRHG with restart, in practice
the stochastic variance-reduced extragradient with restart using restart is not strictly necessary: L-SVRHG with the
SVRE proposed in (Chavdarova et al., 2019). For all our ex- correct choice of stepsize also converges in our experiment.
periments, we ran the different algorithms with 10 different Finally we show the trajectories of the different methods on
seeds and plot the mean and 95% confidence intervals. We a 2D version of the problem. We observe that contrary to the
provide further details about the experiments and choice of bilinear case, stochastic SGDA converges but still suffers
hyperparameters for the different methods in Appendix F. from rotation compared to Hamiltonian methods.

7.1. Bilinear Games 7.3. GANs

First we compare the different methods on the stochastic In previous experiments, we verify the proposed theory for
bilinear problem (9). Similarly to Chavdarova et al. (2019), the stochastic bilinear and sufficiently-bilinear games. Al-
we choose n = d1 = d2 = 100, [Ai ]kl = 1 if i = k = l though we do not have theoretical results for more complex
and 0 otherwise, and [bi ]k , [ci ]k ∼ N (0, 1/n). games, we wanted to test our algorithms on a simple GAN
setting, which we call GaussianGAN.
5
CO is a mix between SGDA and SHGD, with the follow-
ing update rule xk+1 = xk − ηk (ξi (xk ) + λ∇Hi,j (xk )) (See In GaussianGAN, we have a dataset of real data xreal and
Appendix F.5) latent variable z from a normal distribution with mean
Stochastic Hamiltonian Gradient Methods for Smooth Games

100 1.0 100 10.0


0.5 7.5
10 2 0.0 10 2 5.0
0.5 2.5
||x0 x * ||2
||xk x * ||2

1.0 0.0

H(x0)
H(xk)
10 4 10 4

x2

x2
1.5 SHGD (constant step-size) 2.5
SHGD (constant step-size) 2.0 SHGD (decreasing step-size)
10 6 SHGD (decreasing step-size) SGDA SVRE 10 6 Biased SHGD 5.0 SGDA SVRE
Biased SHGD 2.5 SHGD (constant step-size) Starting point x0 L-SVRHG 7.5 SHGD (constant step-size) Starting point x0
L-SVRHG SHGD (decreasing step-size) Optimum point x * L-SVRHG with restart SHGD (decreasing step-size) Optimum point x *
SVRE 3.0 L-SVRHG SVRE 10.0 L-SVRHG with restart
10 8 10 8
0 200 400 600 800 1000 3 2 1 0 1 0 1000 2000 3000 4000 5000 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Num of samples 1e2 x1 Num of samples 1e2 x1

(a) Bilinear game (b) Sufficiently-bilinear game


∗ 2
Figure 1. a) Comparison of different methods on the stochastic bilinear game (9). Left: Distance to optimality ||x k −x ||
||x0 −x∗ ||2
as a function of
the number of samples seen during training. Right: The trajectory of the different methods on a 2D version of the problem.
b) Comparison of different methods on the sufficiently bilinear games (22). Left: The Hamiltonian H(x k)
H(x0 )
as a function of the number of
samples seen during training. Right: The trajectory of the different methods on a 2D version of the problem.

0 and standard deviation 1. The generator is defined as For WGAN, we see that stochastic SGDA fails to converge
G(z) = µ + σz and the discriminator as D(xdata ) = and that L-SVRHG is the only method to converge linearly
φ0 + φ1 xdata + φ2 x2data , where xdata is either real data on the Hamiltonian. For satGAN, SGDA seems to perform
(xreal ) or fake generated data (G(z)). In this setting, the best. Algorithms that take into account the Hamiltonian
parameters are x = (x1 , x2 ) = ([µ, σ], [φ0 , φ1 , φ2 ]). In have high variance. We looked at individual runs and found
GaussianGAN, we can directly measure the L2 distance that, in 3 out of 10 runs, the algorithms other than stochas-
between the generator’s parameters and the true optimal tic SGDA fail to converge, and the Hamiltonian does not
parameters: ||µ̂ − µ|| + ||σ̂ − σ||, where µ̂ and σ̂ are the significantly decrease over time. While WGAN is guaran-
sample’s mean and standard deviation. teed to have a unique critical point, which is the solution
of the game, this is not the case for satGAN and nsGAN
We consider three possible minmax games: Wasserstein
due to the non-linear component. Thus, as expected, As-
GAN (WGAN) (Arjovsky et al., 2017), saturating GAN
sumption 3.8 is very important in order for the proposed
(satGAN) (Goodfellow et al., 2014), and non-saturating
stochastic Hamiltonian methods to perform well.
GAN (nsGAN) (Goodfellow et al., 2014). We present the
results for WGAN and satGAN in Figure 2. We provide the
nsGAN results in Appendix G.2 and details for the different 8. Conclusion and Extensions
experiments in Appendix F.3.
We introduce new variants of SHGD (through novel unbi-
ased estimator and step-size selection) and present the first
101
variance reduced Hamiltonian method L-SVRHG. Using
100
Generator L2 distance to optimum

10−2 100
tools from optimization literature, we provide convergence
10−4
10−1
guarantees for the two methods and we show how they can
H(x0)
H(xk)

10−6
10−2
efficiently solve stochastic unconstrained bilinear games and
10−8
CO CO

10 −10
SGDA
SHGD (constant step-size)
10−3
SGDA
SHGD (constant step-size)
the more general class of games that satisfy the “sufficiently
L-SVRHG L-SVRHG
10−12
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
bilinear condition. An important result of our analysis is
Number of samples 1e7 Number of samples 1e7

the first set of global non-asymptotic last-iterate conver-


(a) Hamiltonian for WGAN (b) Distance to optimum for gence guarantees for a stochastic game over a non-compact
WGAN
domain, in the absence of strong monotonicity assumptions.
101
CO CO
100 SGDA SGDA
We believe that our results and the Hamiltonian viewpoint
Generator L2 distance to optimum

SHGD (constant step-size) SHGD (constant step-size)


100
L-SVRHG L-SVRHG
10−2

10−4 10−1
could work as a first step in closing the gap between the
H(x0)
H(xk)

10−6
10−2 stochastic optimization algorithms and methods for solving
10−8

10 −10
10−3 stochastic games and can open up many avenues for further
10−12
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
10−4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
development and research in both areas. A natural extension
Number of samples 1e7 Number of samples 1e7
of our results will be the proposal of accelerated Hamil-
(c) Hamiltonian for satGAN (d) Distance to optimum for tonian methods that use momentum (Loizou & Richtárik,
satGAN 2017; Assran & Rabbat, 2020) on top of the Hamiltonian
gradient update. We speculate that similar ideas to the
Figure 2. The Hamiltonian H(x k)
H(x0 )
(left) and the distance to the
optimal generator (right) as a function of the number of samples
ones presented in this work can be used for the develop-
seen during training for WGAN and satGAN. The distance to the ment of efficient decentralized methods (Assran et al., 2019;
optimal generator corresponds to ||||µ̂−µ k ||+||σ̂−σk ||
. Koloskova et al., 2020) for solving problem (3).
µ̂−µ0 ||+||σ̂−σ0 ||
Stochastic Hamiltonian Gradient Methods for Smooth Games

Acknowledgements Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Train-
ing gans with optimism. In ICLR, 2018.
The authors would like to thank Reyhane Askari, Gauthier
Gidel and Lewis Liu for useful discussions and feedback. Defazio, A. A simple practical accelerated method for finite
Nicolas Loizou acknowledges support by the IVADO post- sums. In NeurIPS, 2016.
doctoral funding program. This work was partially sup- Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A
ported by the FRQNT new researcher program (2019- fast incremental gradient method with support for non-
NC-257943), the NSERC Discovery grants (RGPIN-2017- strongly convex composite objectives. In NeurIPS, 2014.
06936 and RGPIN-2019-06512) and the Canada CIFAR AI
chairs program. Ioannis Mitliagkas acknowledges support Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-
by an IVADO startup grant and a Microsoft Research collab- Julien, S. A variational inequality perspective on genera-
orative grant. Simon Lacoste-Julien acknowledges support tive adversarial networks. In ICLR, 2018.
by a Google Focused Research award. Simon Lacoste-
Julien and Pascal Vincent are CIFAR Associate Fellows in Gidel, G., Hemmat, R. A., Pezeshki, M., Le Priol, R., Huang,
the Learning in Machines & Brains program. G., Lacoste-Julien, S., and Mitliagkas, I. Negative mo-
mentum for improved game dynamics. In AISTATS, 2019.
References Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Abernethy, J., Lai, K. A., and Wibisono, A. Last-iterate con- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.
vergence rates for min-max optimization. arXiv preprint Generative adversarial nets. In NeurIPS, 2014.
arXiv:1906.02027, 2019.
Gorbunov, E., Hanzely, F., and Richtárik, P. A unified theory
Albuquerque, I., Monteiro, J., Falk, T. H., and Mitliagkas, of sgd: Variance reduction, sampling, quantization and
I. Adversarial target-invariant representation learning for coordinate descent. In AISTATS, 2020.
domain generalization. arXiv preprint arXiv:1911.00804,
Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasi-
2019.
gradient methods: Variance reduction via Jacobian sketch-
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein ing. arxiv:1805.02632, 2018.
generative adversarial networks. In ICML, 2017.
Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,
Assran, M. and Rabbat, M. On the convergence of nesterov’s Shulgin, E., and Richtárik, P. SGD: General analysis
accelerated gradient method in stochastic settings. arXiv and improved rates. In ICML, 2019.
preprint arXiv:2002.12414, 2020.
Gower, R. M., Sebbouh, O., and Loizou, N. SGD for struc-
Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochas- tured nonconvex functions: Learning rates, minibatch-
tic gradient push for distributed deep learning. ICML, ing and interpolation. arXiv preprint arXiv:2006.10311,
2019. 2020.

Azizian, W., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. Hardt, M., Recht, B., and Singer, Y. Train faster, generalize
A tight and unified analysis of gradient-based methods for better: stability of stochastic gradient descent. In ICML,
a whole spectrum of differentiable games. In AISTATS, 2016.
2020a.
Hofmann, T., Lucchi, A., Lacoste-Julien, S., and
Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., McWilliams, B. Variance reduced stochastic gradient
and Gidel, G. Accelerating smooth games by manipulat- descent with neighbors. In NeurIPS, 2015.
ing spectral shapes. AISTATS, 2020b.
Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos,
Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, P. On the convergence of single-call stochastic extra-
K., and Graepel, T. The mechanics of n-player differen- gradient methods. In NeurIPS, 2019.
tiable games. In ICML, 2018.
Ibrahim, A., Azizian, W., Gidel, G., and Mitliagkas, I.
Carmon, Y., Jin, Y., Sidford, A., and Tian, K. Variance Linear lower bounds and conditioning of differentiable
reduction for matrix games. In NeurIPS, 2019. games. arXiv preprint arXiv:1906.07300, 2019.

Chavdarova, T., Gidel, G., Fleuret, F., and Lacoste-Julien, Johnson, R. and Zhang, T. Accelerating stochastic gradient
S. Reducing noise in gan training with variance reduced descent using predictive variance reduction. In NeurIPS,
extragradient. In NeurIPS, 2019. 2013.
Stochastic Hamiltonian Gradient Methods for Smooth Games

Karimi, H., Nutini, J., and Schmidt, M. Linear conver- Necoara, I., Nesterov, Y., and Glineur, F. Linear convergence
gence of gradient and proximal-gradient methods under of first order methods for non-strongly convex optimiza-
the Polyak-łojasiewicz condition. In ECML-PKDD, 2016. tion. Math. Program., pp. 1–39, 2018.

Khaled, A., Sebbouh, O., Loizou, N., Gower, R. M., and Nemirovski, A. Prox-method with rate of convergence o
Richtrik, P. Unified analysis of stochastic gradient meth- (1/t) for variational inequalities with lipschitz continuous
ods for composite convex and smooth optimization. arXiv monotone operators and smooth convex-concave saddle
preprint arXiv:2006.11573, 2020. point problems. SIAM Journal on Optimization, 15(1):
229–251, 2004.
Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and
Stich, S. U. A unified theory of decentralized SGD with Nemirovski, A. and Yudin, D. B. On Cezari’s convergence
changing topology and local updates. arXiv preprint of the steepest descent method for approximating saddle
arXiv:2003.10422, 2020. point of convex-concave functions. Soviet Mathetmatics
Doklady, 19, 1978.
Konečný, J., Liu, J., Richtárik, P., and Takáč, M. Mini-batch
semi-stochastic gradient descent in the proximal setting. Nemirovski, A. and Yudin, D. B. Problem complexity and
IEEE Journal of Selected Topics in Signal Processing, 10 method efficiency in optimization. Wiley Interscience,
(2):242–255, 2016. 1983.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-
Korpelevich, G. The extragradient method for finding saddle
bust stochastic approximation approach to stochastic pro-
points and other problems. Matecon, 12:747–756, 1976.
gramming. SIAM Journal on Optimization, 19(4):1574–
Kovalev, D., Horváth, S., and Richtárik, P. Dont jump 1609, 2009.
through hoops and remove those loops: SVRG and Nguyen, L., Nguyen, P. H., van Dijk, M., Richtárik, P.,
Katyusha are better without the outer loop. In Algorithmic Scheinberg, K., and Takáč, M. SGD and hogwild! Con-
Learning Theory, 2020. vergence without the bounded gradients assumption. In
ICML, 2018.
Letcher, A., Balduzzi, D., Racanière, S., Martens, J., Foer-
ster, J. N., Tuyls, K., and Graepel, T. Differentiable game Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. Sarah:
mechanics. Journal of Machine Learning Research, 20 a novel method for machine learning problems using
(84):1–40, 2019. stochastic recursive gradient. In ICML, 2017.
Loizou, N. and Richtárik, P. Momentum and stochastic Palaniappan, B. and Bach, F. Stochastic variance reduction
momentum for stochastic gradient, Newton, proximal methods for saddle-point problems. In NeurIPS, 2016.
point and subspace descent methods. arXiv preprint
arXiv:1712.09677, 2017. Pearlmutter, B. A. Fast exact multiplication by the hessian.
Neural computation, 6(1):147–160, 1994.
Loizou, N., Vaswani, S., Laradji, I., and Lacoste-Julien,
Pfau, D. and Vinyals, O. Connecting generative adversar-
S. Stochastic polyak step-size for SGD: An adap-
ial networks and actor-critic methods. arXiv preprint
tive learning rate for fast convergence. arXiv preprint
arXiv:1610.01945, 2016.
arXiv:2002.10542, 2020.
Polyak, B. Introduction to optimization. translations series
Mazumdar, E. V., Jordan, M. I., and Sastry, S. S. On finding in mathematics and engineering. Optimization Software,
local nash equilibria (and only local nash equilibria) in 1987.
zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
Qian, X., Qu, Z., and Richtárik, P. L-SVRG and L-
Mescheder, L., Nowozin, S., and Geiger, A. The numerics Katyusha with arbitrary sampling. arXiv preprint
of gans. In NeurIPS, 2017. arXiv:1906.01481, 2019.
Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., Robbins, H. and Monro, S. A stochastic approximation
and Malitsky, Y. Revisiting stochastic extragradient. In method. The Annals of Mathematical Statistics, pp. 400–
AISTATS, 2020. 407, 1951.
Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unified Ryu, E. K., Yuan, K., and Yin, W. ODE analysis of
analysis of extra-gradient and optimistic gradient methods stochastic gradient methods with optimism and anchor-
for saddle point problems: Proximal point approach. In ing for minimax problems and GANs. arXiv preprint
AISTATS, 2020. arXiv:1905.10899, 2019.
Stochastic Hamiltonian Gradient Methods for Smooth Games

Schmidt, M., Le Roux, N., and Bach, F. Minimizing fi-


nite sums with the stochastic average gradient. Math.
Program., 162(1-2):83–112, 2017.
Sebbouh, O., Gazagnadou, N., Jelassi, S., Bach, F., and
Gower, R. Towards closing the gap between the theory
and practice of SVRG. In NeurIPS, 2019.

Wang, Y., Zhang, G., and Ba, J. On solving minimax opti-


mization locally: A follow-the-ridge approach. In ICLR,
2019.
Yang, J., Kiyavash, N., and He, N. Global conver-
gence and variance-reduced optimization for a class
of nonconvex-nonconcave minimax problems. arXiv
preprint arXiv:2002.09621, 2020.
Supplementary Material
Stochastic Hamiltonian Gradient Methods for Smooth Games

In the Appendix we present the proofs of the main Propositions and Theorems proposed in the main paper together with
additional experiments on different Bilinear and sufficiently bilinear games.
In particular in Section A, we start by presenting the pseudo-codes of the stochastic optimization algorithms SGD and
L-SVRG based on which we build our stochastic Hamiltonian methods. In Section B we provide more details on the
assumptions and definitions used in the main paper. In Section D we present the proofs of the two main propositions and
in Section E we explain how these propositions can be combined with existing convergence results in order to obtain the
Theorems of Section 6. Finally in Sections F and G we present the experimental details and provide additional experiments.

A. Stochastic Optimization Algorithms


In this section we present the pseudocodes of SGD and L-SVRG for solving the finite-sum optimization problem:
" n
#
1X
min f (x) = fi (x) . (23)
x∈Rd n i=1

Algorithm 4 Stochastic Gradient Descent (SGD)


Input: Starting stepsize γ 0 > 0. Choose initial points x0 ∈ Rd . Distribution D of samples.
for k = 0, 2, · · · , K do
Generate fresh sample i ∼ D and evaluate ∇fi (xk ).
Set step-size γ k following one of the selected choices (constant, decreasing)
Set xk+1 = xk − γ k ∇fi (xk )
end for
Output: The last iterate xk .

Algorithm 5 Loopless Stochastic Variance Reduced Gradient (L-SVRG)


Input: Starting stepsize γ > 0. Choose initial points x0 = w0 ∈ Rd . Distribution D of samples. Probability p ∈ (0, 1].
for k = 0, 2, · · · , K do
Generate fresh sample i ∼ D evaluate ∇fi (xk ).
Evaluate g k = ∇fi (xk ) − ∇fi (wk ) + ∇f (wk ).
Set xk+1 = xk − γg k
Set (
k+1 xk with probability p
w =
wk with probability 1 − p
end for
Output:
Option I: The last iterate x = xk .
Option II: x is chosen uniformly at random from {xi }K
i=0 .
Stochastic Hamiltonian Gradient Methods for Smooth Games

Algorithm 6 L-SVRG (with Restart)


Input: Starting stepsize γ > 0. Choose initial points x0 = w0 ∈ Rd . Distribution D of samples. Probability p ∈ (0, 1],
T
for t = 0, 1, 2, · · · , T do
Set xt+1 = L-SVRGII (xt , K, γ, p ∈ (0, 1])
end for
Output: The last iterate xT .

B. Connections of Main Assumptions and Definitions


As we mentioned above SGD (Algorithm 4) and L-SVRG (Algorithm 5) are popular methods for solving the stochastic
optimization problem (23). Several convergence analyses of the two algorithms have been proposed under different
assumptions on the functions f and fi . In this section we describe in more details the assumptions used in the analysis of
the stochastic Hamiltonian methods in the main paper.

B.1. On Quasi-strong convexity and PL condition


In Section 3.1 we present the definitions of quasi-strong convexity and the PL condition and later in Section 4.2 we explain
that for the two classes of games (Bilinear and Sufficiently bilinear) the stochastic Hamiltonian function (11) satisfies one
of these conditions. Here using Karimi et al. (2016) we explain the connection between these conditions and the more
well-known definition of strong convexity.
Definition B.1 (Strong Convexity). A differentiable function f : Rn → R, is µ-strongly convex, if there exists a constant
µ > 0 such that ∀x, y ∈ Rn :
µ 2
f (x) ≥ f (y) + h∇f (y), x − yi + kx − yk (24)
2

In particular the following connection hold:


SC ⊆ QSC ⊆ P L, (25)
where SC denotes the class of strongly convex functions, QSC the class of quasi-strongly convex (Definition 3.1) and P L
the class of functions satisfy the PL condition (Definition 3.2). For more details on the connections of the µ parameter
between these methods we refer the reader to Karimi et al. (2016) and Necoara et al. (2018).

B.2. On Smoothness and Expected Smoothness / Expected Residual


In Section 3.1 we present the definitions of Expected Smoothness (ES) and Expected Residual (ER). In the main theoretical
results of Section 6 we also use the expected smoothness parameter L and the expected residual parameter ρ to provide the
convergence guarantees of SHGD and L-SVRHG. In this section we provide more details on these assumptions as presented
in Gower et al. (2019; 2020).
As explained in Gower et al. (2019; 2020) expected smoothness and expected residual are assumptions that combine both
the properties of the distribution D of drawing samples and the smoothness properties of function f . In particular, ES and
ER can be seen as two different ways to measure how far the gradient estimate ∇fi (x) is from the true gradient ∇f (x)
where i ∼ D.
ES was first used for the analysis of SGD in Gower et al. (2019) for solving stochastic optimization problems of the form (23)
where the objective function f is assumed to be µ–quasi-strongly convex (see Definition 3.1). Later in Gower et al. (2020) a
similar analysis for SGD has been proposed for functions satisfying the PL condition. As explained in Gower et al. (2020),
assuming ES in the analysis of SGD for functions satisfying the PL condition is not ideal as it does not allow the recovery of
the the best known dependence on the condition number for the deterministic Gradient Descent (full batch). For this reason
Gower et al. (2020) used the notion of Expected residual (ER) in the proposed analysis and explained its benefits.
In both, Gower et al. (2019) and (Gower et al., 2020), the ES and ER assumptions have been used in combination with the
arbitrary sampling paradigm. That is, the proposed theorems of Gower et al. (2019; 2020) that describe the convergence of
SGD include an infinite array of variants of SGD as special cases. Each one of these variants is associated with a specific
probability law governing the data selection rule used to form minibatches.
Stochastic Hamiltonian Gradient Methods for Smooth Games

B.2.1. F ORMAL D EFINITIONS


Let us present the definitions of ES and ER as presented in Gower et al. (2019) and Gower et al. (2020). In Gower et al.
(2019; 2020) to allow for any form of minibatching the arbitrary sampling notation was used. That is,
n
X
1
∇fv (x) := n vi ∇fi (x), (26)
i=1
Pn
where v ∈ Rn+ is a random sampling vector such that E [vi ] = 1, for i = 1, . . . , n andPfv (x) := n1 i=1 vi fi (x). Note that
n
it follows immediately from this definition of sampling vector that E [∇fv (x)] = n1 i=1 E [vi ]∇fi (x) = ∇f (x).
In addition note that using the notion of arbitrary sampling the update rule of SGD is simply: xk+1 = xk − γ k ∇fv (xk ).
Under the notion of arbitrary sampling the expected smoothness assumption (Gower et al., 2019) and the expected residual
assumption (Gower et al., 2020) take the following form (generalization of the definitions presented in Section 3.1).
Assumption B.2 (Expected Smoothness (ES)). We say that f is L–smooth in expectation with respect to a distribution D
if there exists L = L(f, D) > 0 such that
h i
2
ED k∇fv (x) − ∇fv (x∗ )k ≤ 2L(f (x) − f (x∗ )), (27)

for all x ∈ Rd . For simplicity, we will write (f, D) ∼ ES(L) to say that expected smoothness holds.

Assumption B.3 (Expected Residual (ER)). We say that f satisfied the expected residual assumption if there exists
ρ = ρ(f, D) > 0 such that
h i
2
ED k∇fv (x) − ∇fv (x∗ ) − (∇f (x) − ∇f (x∗ ))k ≤ 2ρ (f (x) − f (x∗ )) . (28)

for all x ∈ Rd . For simplicity, we will write (f, D) ∼ ER(ρ) to say that expected residual holds.

As we explain in Section 6, in this work we focus on τ -minibatch sampling, where in each step we select uniformly at
random a minibatch of size τ ∈ [n2 ] (recall that the Hamiltonian function (11) has n2 components). However we highlight
that the proposed analysis of the stochastic Hamiltonian methods holds for any form of sampling vector following the
results presented in Gower et al. (2019; 2020) for the case of SGD and Qian et al. (2019) for the case of L-SVRG methods,
including importance sampling variants.
Let us provide a formal definition of the τ -minibatch sampling when τ ∈ [n].
n
Definition B.4 (τ -Minibatch sampling).
 Let τ ∈ [n].We say that
 v ∈ R is a τ –minibatch sampling if for every subset
n
P n
S ∈ [n] with |S| = τ we have that P v = τ i∈S ei = 1/ τ := τ !(n − τ )!/n!

It is easy to verify by using a double counting argument that if v is a τ –minibatch sampling, it is also a valid sampling vector
(E [vi ] = 1) (Gower et al., 2019).
Pn
Let f (x) = n1 i=1 fi (x) with functions fi be Li –smooth and function f be L-smooth and let Lmax = max{1,...,n} {Li }.
In this setting as it was shown in Gower et al. (2019; 2020) for the case of τ -minibatch sampling (τ ∈ [n]), the expected
smoothness and expected residual parameters and the finite gradient noise σ 2 take the following form:
n(τ − 1) n−τ
L(τ ) = L+ Lmax (29)
τ (n − 1) τ (n − 1)
n−τ
ρ(τ ) = Lmax (30)
(n − 1)τ
n
2 1n−τ 1 X 2
σ 2 (τ ) := ED [k∇fv (x∗ )k ] = k∇fi (x∗ )k . (31)
τ n − 1 n i=1

Using the above expressions (29),


P(30) and (31) it is easy to see that for single element sampling where τ = 1 it holds that
n 2
L = ρ = Lmax and that σ 2 = n1 i=1 k∇fi (x∗ )k . On the other limit case where a full-batch is used (τ = n), these values
Stochastic Hamiltonian Gradient Methods for Smooth Games

become L = L and ρ = σ 2 = 0. Note that these are exactly the values for L, ρ and σ 2 we use in Section 6 with the only
difference that τ ∈ [n2 ] because the stochastic Hamiltonian function (11) has n2 components Hi,j .
2
In particular, as we explained in Section 6, for the Theorems related to SHGD we use σ 2 := Ei,j [k∇Hi,j (x∗ )k ]. From the
above expression and for the case of τ -minibatch sampling with τ ∈ [n2 ] this is equivalent to:
n n
2 1 n2 − τ 1 X X 2
σ 2 := Ei,j [k∇Hi,j (x∗ )k ] = k∇Hi,j (x∗ )k .
τ n2 − 1 n2 i=1 j=1

Connection between τ -minibatch sampling and sampling step of main algorithms. Note that one of the main steps
of Algorithms 1 and 5 is the generation of fresh samples i ∼ D and j ∼ D and the evaluation of ∇Hi,j (xk ). In the case of
uniform single element sampling, the samples i and j are selected with probability pi = 1/n and pj = 1/n respectively.
This is equivalent on selecting samples {i, j} uniformly at random from the n2 components of the Hamiltonian function. In
both cases the probability of selecting the component Hi,j is equal to pHi,j = 1/n2 .
In other words, for the case of 1-minibatch sampling (uniform single element sampling), one can simply substitute the
sampling step of SHGD and L-SVRHG: “Generate fresh samples i ∼ D and j ∼ D and evaluate ∇Hi,j (xk ).” with the
“Sample uniformly at random the component Hi,j and evaluate ∇Hi,j (xk ).”
Trivially, using the definition (B.4) and the above notion of sampling vector, this connection can be extended to capture the
more general τ -minibatch sampling where τ ∈ [n2 ]. In this case, we will have
n X
X n
1
∇Hv (x) := n2 vi,j ∇Hi,j (x),
i=1 j=1
2
n
Pnv ∈PRn+ is a random sampling vector such that Ei,j [vi,j ] = 1, for i = 1, . . . , n, and j = 1, . . . , n and Hv (x) :=
where
1
n2 Pi=1 Pj=1 vi,j Hi,j (x). Note that it follows immediately from this definition of sampling vector that E [∇Hv (x)] =
1 n n
n2 i=1 j=1 Ei,j [vi,j ]∇Hi,j (x) = ∇H(x).

In this case the update rule of SHGD (Algorithm 1) will simply be: xk+1 = xk − γ k ∇Hv (x) and the proposed theoretical
results will still hold.

B.2.2. S UFFICIENT CONDITIONS AND CONNECTIONS BETWEEN NOTIONS OF SMOOTHNESS .


In Gower et al. (2019) it was proved that convexity and Li –smoothness of fi in problem (23) implies expected smoothness
of function f . However, the opposite implication does not hold. The expected smoothness assumption can hold even when
the fi ’s and f are not convex. See (Gower et al., 2019) for more details.
Similar results have been shown in Gower et al. (2020) for the case of expected residual. More specifically, it was shown
that if the functions fi of problem (23) are Li –smooth and also x∗ -convex for x∗ ∈ X ∗ (where X ∗ is solution set of f ) then
function f satisfies the expected residual conditions, that is (f, D) ∼ ER(ρ) and the expected residual parameter ρ has a
meaningful expression.
Another interesting connections between the smoothness parameters is the following Gower et al. (2019). If we assume that
function f of problem (23) is Lsmooth and that each fi function is Li smooth then the expected smoothness L constant is
bounded as follows:
L ≤ L ≤ Lmax ,
where Lmax = max Li ni=1 .
Let us also present the following lemma as proved in (Gower et al., 2020) that connects the ES and ER assumptions.
Lemma B.5. (Expected smoothness implies Expected Residual, from (Gower et al., 2020).) If function f of problem (23)
satisfies the expected smoothness (f, D) ∼ ES(L), then it satisfies the expected residual (f, D) ∼ ER(ρ) with ρ = L. If
in addition the function satisfied the PL condition that satisfied the expected residual with ρ = L − µ.

B.2.3. B OUNDS ON THE S TOCHASTIC G RADIENT


A common assumption used to prove the convergence of SGD is uniform boundedness of the stochastic gradients: there
exist 0 < c < ∞ such that Ek∇fv (x)k2 ≤ c for all x. However, this assumption often does not hold, such as in the case
Stochastic Hamiltonian Gradient Methods for Smooth Games

when f is strongly convex (Nguyen et al., 2018). Recall that the class of µ-strongly convex functions is a special case of
both the µ-quasi strongly convex and functions satisfying the PL condition (see (25)).
Using ES and ER in the proposed theorems we do not need to assume such a bound. Instead, we use the following direct
consequence of expected smoothness and expected residual to bound the expected norm of the stochastic gradients.
Lemma B.6. (Gower et al., 2019) If (f, D) ∼ ES(L), then

ED k∇fv (x)k2 ≤ 4L(f (x) − f (x∗ )) + 2σ 2 ,


 
(32)
2
where σ 2 := ED [k∇fv (x∗ )k ].

Similar upper bound on the stochastic gradients can be obtained if one assumed expected residual:
Lemma B.7. (Gower et al., 2020) If (f, D) ∼ ER(ρ) then

ED k∇fv (x)k2 ≤ 4ρ(f (x) − f ∗ ) + k∇f (x)k2 + 2σ 2 .


 
(33)
2
where σ 2 := ED [k∇fv (x∗ )k ].

C. On Stochastic Hamiltonian Function and Unbiased Estimator of the Gradient


C.1. Finite-Sum Structure of Hamiltonian Methods
Having g to be a finite sum function leads to the following derivations on the Hamiltonian functions and gradients:
n n
1 1 1 1X 1X
H(x) = kξ(x)k2 = hξ(x), ξ(x)i = h ξi (x), ξj (x)i
2 2 2 n i=1 n j=1
n n n
1 XX 1 1 X 1
= hξi (x), ξj (x)i = 2 hξi (x), ξj (x)i (34)
n2 i=1 j=1 |2 {z } n i,j=1 |2 {z }
Hi,j (x) Hi,j (x)

That is, the Hamiltonian function H(x) can be expressed as a finite sum with n2 components.

C.2. Unbiased Estimator of the Full Gradient


The gradient of Hi,j (x) has the following form:

1 1 1 >
Ji ξ j + J>

∇Hi,j (x) = ∇hξi (x), ξj (x)i = [h∇ξi (x), ξj (x)i + hξi (x), ∇ξj (x)i] = j ξi , (35)
2 2 2
and it is an unbiased estimator of the full gradient. That is, ∇H(x) = Ei,j [∇Hi,j (x)].
n n
1 2 1 1 XX 1
∇H(x) = ∇ kξ(x)k = ∇ hξ(x), ξ(x)i = 2 ∇hξi (x), ξj (x)i
2 2 n i=1 j=1 2
n n
1 XX 1
= [h∇ξi (x), ξj (x)i + hξi (x), ∇ξj (x)i]
n2 i=1 j=1 2
n n
1 XX 1  >
Ji ξj + J>

= 2 j ξi
n i=1 j=1 |2 {z }
∇Hi,j (x)
n n
1 X 1 X
= ∇Hi,j (x)
n i=1
n j=1
= Ei Ej [∇Hi,j (x)] = Ei,j [∇Hi,j (x)] (36)
Stochastic Hamiltonian Gradient Methods for Smooth Games

C.3. Beyond Finite-Sum


All results presented in the main paper related to SHGD can be trivially extended beyond finite sum problems (with exactly
the same rates). The finite-sum structure is required only for the variance reduced method L-SVRHG. In the stochastic case,
problem (3) will be
min max g(x1 , x2 ) = Eζ [g(x, ζ)]
x1 ∈Rd1 x2 ∈Rd2

where ζ is a random variable obeying some distribution. Then ξ(x) = Eζ [ξ(x, ζ)], J = Eζ [J(x, ζ)] and the stochastic
Hamiltonian function will become
1
H(x) = Eζi Eζj hξ(x, ζi ), ξ(x, ζj )i .
|2 {z }
Hi,j (x)

1
J(x, ζi )> ξ(x, ζj ) + J(x, ζj )> ξ(x, ζi ) and ∇H(x) = Eζi Eζj [∇Hi,j (x)].
 
In this case ∇Hi,j (x) = 2

In this case SHGD will execute the following updates in each step k ∈ {0, 1, 2, · · · , K}:

1. Generate i.i.d random variables ζi and ζj and evaluate ∇Hi,j (xk ).

2. Set step-size γ k following one of the selected choices (constant, decreasing)

3. Set xk+1 = xk − γ k ∇Hi,j (xk )

D. Proofs of Main Propositions


Let us first present the main notation used for eigenvalues and singular values (similar to the main paper).

Eigenvalues, singular values Let A ∈ Rn×n .We denote with λ1 ≤ λ2 ≤ · · · ≤ λn its eigenvalues. Let λmin = λ1 be the
smallest non-zero eigenvalue, and λmax = λn be the largest eigenvalue. With σ1 ≤ σ2 ≤ · · · ≤ σn we denote its singular
values. With σmax and σmin we denote the maximum singular value and the minimum non-zero singular value of matrix A.

D.1. Proof of Proposition 4.3


In our proof we use the following result proved in Necoara et al. (2018).
Lemma D.1 (Necoara et al. (2018)). Let function z : Rm → R be µz -strongly convex with Lz -Lipschitz continuous
gradient and A ∈ Rm×n be a nonzero matrix. Then, the convex function f (x) = z(Ax) is a smooth µ–quasi-strongly
convex function with constants L = Lz kAk2 and µ = µz σmin
2
(A) where σmin (A) is the smallest nonzero singular value
of matrix A and kAk is the spectral norm.

Proof. Recall that in the stochastic bilinear game we have that:


n
1X >
g(x1 , x2 ) = x bi + x> >
1 Ai x2 + ci x2
n i=1 | 1 {z }
gi (x1 ,x2 )

By evaluating the partial derivatives we obtain:

• ∇x1 gi (x1 , x2 ) = Ai x2 + bi ∈ Rd1 ×1 , ∀i ∈ [n]

• ∇x2 gi (x1 , x2 ) = A>


i x1 + ci ∈ R
d2 ×1
, ∀i ∈ [n]

Thus, from definition of ξi (x) we get:

ξi (x) = (∇x1 gi , −∇x2 gi ) = Ai x2 + bi , −[A>



i x1 + ci ] ∀i ∈ [n]
Stochastic Hamiltonian Gradient Methods for Smooth Games

and as a result by simple computations:

1
Hi,j (x) =hξi (x), ξj (x)i
2
1
Ai x2 + bi , −[A> >
 
= i x1 + ci ] , Aj x2 + bj , −[Aj x1 + cj ]
2
Ai A>
    
1 0 x1 1 > >  x1 1 1
= (x1 , x2 )> j
+ c A + c > > >
A , b A i + b >
A j + c> c j + b> bj
2 0 A> i Aj x2 2 j i i j j i x2 2 i 2 i
1 > >
= x Qi,j x + qi,j x + `i,j , (37)
2

Ai A>
 
j 0
where Qi,j = and
0 A> i Aj
>
= 21 c> > > > > > 1 > 1 >

qi,j j Ai + ci Aj , bj Ai + bi Aj and `i,j = 2 ci cj + 2 bi bj .

Using the finite-sum structure of the Hamiltonian function (11) the stochastic Hamiltonian function takes the following
form:
n
1 X
H(x) = Hi,j (x)
n2 i,j=1
n
1 X 1 > >
= x Qi,j x + qi,j x + `i,j
n2 i,j=1 2
1 >
= x Qx + q > x + ` (38)
2

h Pn i AA> 0

Pn
1 1
where Q = Q i,j = with A = Ai and
n2 i,j=1 0 A> A n i=1
h Pn i h P i
n
q> = 1
n2
>
i,j=1 qi,j and ` = n2
1
i,j=1 `i,j .

Thus, the stochastic Hamiltonian function

1 >
H(x) = x Qx + q > x + `
2

is a quadratic smooth function (not necessarily strongly convex).


Note also that throughout the paper we assume that a stationary point of function g exist. This in combination with
Assumption 3.8 guarantees that the Hamiltonian function H(x) has at least on minimum point x∗ . That is, there exist x∗
such that ∇H(x∗ ) = Qx∗ + q = 0.
Here matrix Q is symmetric and Q  0. Thus, let Q = L> Q LQ be the Cholesky decomposition of matrix Q. In
addition, since we already assume that the Hamiltonian function H(x) has at least on minimum point x∗ we have that
q = −Qx∗ = −L> ∗
Q LQ x .

Using this note that:


H(x) = φ(LQ x)

where function φ(y) = 12 kyk2 − (LQ x∗ )> y + `. In addition, note that function φ is 1-strongly convex with 1-Lipschitz
continuous gradient.
Thus, using Lemma D.1 we have that the the Hamiltonian function is a LH −smooth, µH –quasi-strongly convex function
+
with constants LH = kLQ k2 = λmax (L> 2 2 2
Q LQ ) = λmax (Q) = σmax (A) and µH = σmin (LQ ) = λmin (Q) = σmin (A).

This completes the proof.


Stochastic Hamiltonian Gradient Methods for Smooth Games

D.2. Proof of Proposition 4.4


Proof. Our proof follows closely the approach of Abernethy et al. (2019). The main difference is that our game by its
definition is stochastic and as a result quantities like S̄ = Ei [Si ] and L̄ = Ei [Li ] appear in the expression of LH .
We divide the proof into two parts. In the first one we show that the Hamiltonian function is smooth and in the next one that
satisfies the PL condition.

Smoothness. Note that:


   
n n n X
n n X n n X
n
(36) 1 XX 1  > 1 1 X 1 X 1 X
J ξj + J> ξi = J> ξ j + J > ξi  = 2  J> ξj  (39)

∇H(x) = 
n2 i=1 j=1 |2 i {z j } n2 2 i=1 j=1 i 2 j=1 i=1 j n i=1 j=1 i
∇Hi,j (x)

Thus,
n n n n
(39) 1 XX > 1 XX >
k∇H(x) − ∇H(y)k = J i (x)ξ j (x) − J (y)ξj (y)
n2 i=1 j=1 n2 i=1 j=1 i

n n
1 XX >
J (x)ξj (x) − J>

= i (y)ξj (y)
n2 i=1 j=1 i
n n
Jensen 1 XX >
≤ J (x)ξj (x) − J>
i (y)ξj (y)
n2 i=1 j=1 i
(∗)
≤ Ei Ej J> >
i (x)ξj (x) − Ji (y)ξj (y)

= Ei Ej J> > > >


i (x)ξj (x) + Ji (y)ξj (x) − Ji (y)ξj (x) − Ji (y)ξj (y)

Ei Ej Ji (x) − J>
 >  >
= i (y) ξj (x) + Ji (y) [ξj (x) − ξj (y)]

Ei Ej Ji (x) − J>
 >  >
≤ i (y) ξj (x) + Ei Ej Ji (y) [ξj (x) − ξj (y)]

= Ei J> > >


i (x) − Ji (y) Ej kξj (x)k + Ei Ji (y) Ej kξj (x) − ξj (y)k
Assumption 4.1
≤ Ei [Si kx − yk] Ej kξj (x)k + Ei [Li ]Ej [Lj kx − yk]
Ej kξj (x)k<C
≤ C S̄ kx − yk + L̄2 kx − yk
S̄C + L̄2 kx − yk

= (40)

where in (∗) we use that i and j are sampled from a uniform distribution.

PL Condition. To show that the Hamiltonian function satisfies the PL condition (3.2) we use a linear algebra lemma from
(Abernethy et al., 2019).
Lemma D.2. (Lemma H.2in (Abernethyet al., 2019))
A C
Let matrix M = where matrix C is a square full rank matrix. Let c =
−C> −B
2
  2
σmin (C) + λmin (A2 ) λmin (B2 ) + σmin
2 2
(C) − σmax (C) (kAk + kBk) and let assume that c > 0. Here λmin
denotes the smaller eigenvalue and σmin and σmax the smallest and largest singular values respectively. Then if λ is an
eigenvalue of MM> it holds that:
2
  2
σmin (C) + λmin (A2 ) λmin (B2 ) + σmin
2
(C) − σmax 2
(C) (kAk + kBk)
λ>
2 (C) + λ(A2 ) + λ
(2σmin 2 2
min (B ))

In addition note that if there exist µ > 0 such that J(x)J> (x)  µI then the Hamiltonian function satisfies the PL condition
with parameter µ.
Stochastic Hamiltonian Gradient Methods for Smooth Games

Lemma D.3. Let g(x) of min-max problem 3 be twice differentiable function. If there exist µ > 0 such that J(x)J> (x) 
µI for all x ∈ Rd then the Hamiltonian function H(x) (11) satisfies the PL condition (3.2) with parameter µ.

Proof.
1 1 1
k∇H(x)k2 = kJ> (x)ξ(x)k2 = ξ(x)> J(x)J> (x)ξ(x)
2 2 2
J(x)J> (x)µI µ
≥ ξ(x)> ξ(x)
2
1
= µ kξ(x)k2
2
= µ [H(x)]
H(x∗ )=0
= µ [H(x) − H(x∗ )] (41)

Combining the above two lemmas we can now show that for the sufficiently bilinear games that Hamiltonian function
2 2
)(δ 2 +β 2 )−4L2 ∆2
satisfies the PL condition with parameter µH = (δ +ρ 2δ 2 +ρ2 +β 2 .
 2
∇2x1 ,x2 g

∇x1 ,x1 g
Recall that J = ∇ξ = ∈ Rd×d . Now let C(x) = ∇2x1 ,x2 g(x) and note that this is a square full
−∇x2 ,x1 g −∇2x2 ,x2 g
2

rank matrix. In particular, by the assumption of the sufficiently bilinear games we have that the cross derivative ∇2x1 ,x2 g is

full rank matrix with 0 < δ ≤ σi ∇2x1 ,x2 g ≤ ∆ for all x ∈ Rd and for all singular values σi . In addition since we assume
that the sufficiently bilinear condition (10) holds we can apply Lemma D.2 with M = J. Since function g is smooth in x1
and x2 and using the bounds on the singular values of matrix C(x) we have that:
 2
(δ + ρ2 )(δ 2 + β 2 ) − 4L2 ∆2

JJ>  I,
2δ 2 + ρ2 + β 2
 2  2
where ρ2 = minx1 ,x2 λmin ∇2x1 ,x1 g(x1 , x2 ) and β 2 = minx1 ,x2 λmin ∇2x2 ,x2 g(x1 , x2 ) . Using Lemma D.3 it is clear
(δ 2 +ρ2 )(δ 2 +β 2 )−4L2 ∆2
that the Hamiltonian function of the sufficiently bilinear games satisfies the PL condition with µH = 2δ 2 +ρ2 +β 2 .
This completes the proof.

E. Proofs of Main Theorems


In this section we present the proofs of the convergence analysis Theorems presented in Section 6 for the convergence of
SHGD (constant and decreasing step-size) and L-SVRHG (and its variant for PL functions) for solving the bilinear games
and sufficiently bilinear games.

E.1. Derivation of Convergence Results


The Theorems of the papers can be obtained by combining existing and new optimization convergence results with the two
main proposition proved in the previous section (Propositions 4.3 and 4.4).
In particular we use the following combination of results:

• For the Bilinear Games:


– Convergence of SHGD with constant step-size (Theorem 6.1): Combination of constant step-size SGD theorem
from Gower et al. (2019) and Proposition 4.3.
– Convergence of SHGD with decreasing step-size (Theorem 6.3): Combination of decreasing step-size SGD
theorem from Gower et al. (2019) and Proposition 4.3.
– Convergence of L-SVRHG (Theorem 6.4): Combination of the convergence of L-SVRG from Kovalev et al.
(2020); Gorbunov et al. (2020) and Proposition 4.3.
Stochastic Hamiltonian Gradient Methods for Smooth Games

• For the Sufficiently Bilinear Games:


– Convergence of SHGD with constant step-size (Theorem 6.5): Combination of constant step-size SGD theorem
from Gower et al. (2020) and Proposition 4.4.
– Convergence of SHGD with decreasing step-size (Theorem 6.7): Combination of decreasing step-size SGD
theorem from Gower et al. (2020) and Proposition 4.4.
– Convergence of L-SVRHG with Restarts (Theorem 6.8): Combination of Theorem E.8 describing the convergence
of Algorithm 6 in the optimization setting (extension of the convergence results from Qian et al. (2019)) and
Proposition 4.4.

In the rest of this section we present the Theorems of the convergence of SGD (Algorithm 4) and L-SVRG (Algorithm 5) for
solving the finite sum problem (23) as presented in the above papers with some brief comments on their convergence. As we
explain above combining these results with the Propositions 4.3 and 4.4 yield the Theorems presented in Section 6.

E.2. Convergence of Stochastic Optimization Methods for µ–Quasi-strongly Convex Functions


In this subsection we present the main convergence results as presented in Gower et al. (2019) for SGD and in Kovalev et al.
(2020); Gorbunov et al. (2020) for L-SVRG. The main assumption of these Theorems is the function f of problem (23) to
be µ–quasi-strongly convex function and that the expected smoothness is satisfied. Note that no assumption on convexity of
fi is made.

E.2.1. C ONVERGENCE OF SHGD


Two Theorems have been presented in Gower et al. (2019) for the convergence of SGD one for constant step-size and one
with a decreasing step-size. In particular the second theorem provide insightful stepsize-switching rules which describe
when one should switch from a constant to a decreasing stepsize regime. As expected for the choice of constant step-size,
SGD converges with linear rate to a neighborhood of the solution while for the decreasing step-size converges to the exact
optimum but with a slower sublinear rate.
For the case of our stochastic Hamiltonian methods this is exactly the behavior of the SHGD where for the constant step-size
the method convergence to a neighborhood of the min-max solution while for the case of decreasing step-size the method
converges with a slower rate to the min-max solution of problem (3).
Theorem E.1 (Constant Stepsize). Assume f is µ-quasi-strongly convex and that (f, D) ∼ ES(L). Choose γ k = γ ∈
1
(0, 2L ] for all k. Then iterates of Stochastic Gradient Descent (SGD) given in Algorithm 4 satisfy:

k 2γσ 2
Ekxk − x∗ k2 ≤ (1 − γµ) kx0 − x∗ k2 + µ . (42)

Theorem E.2 (Decreasing stepsizes/switching strategy). Assume f is µ-quasi-strongly convex and that (f, D) ∼ ES(L).
Let K := L/ µ and 
1
 2L for k ≤ 4dKe
k
γ = (43)
 2k+1 for k > 4dKe.
(k+1)2 µ

If k ≥ 4dKe, then Stochastic Gradient Descent (SGD) given in Algorithm 4 satisfy:

σ2 8 16dKe2
Ekxk − x∗ k2 ≤ µ2 k + e2 k2 kx
0
− x∗ k2 . (44)

E.2.2. C ONVERGENCE OF L-SVRG


As we explained in the main paper L-SVRG is a variance reduced method which means that allow us to obtain convergence
to the exact solution of the problem. The methods is analyzed in Kovalev et al. (2020) for the case of strongly convex
functions and extended to the class of µ-quasi strongly convex in Gorbunov et al. (2020). Following the theorem proposed
in Gorbunov et al. (2020) it can be shown that the method converges to the solution x∗ with linear rate as follows:
Stochastic Hamiltonian Gradient Methods for Smooth Games

Theorem E.3. Assume f is µ-quasi-strongly convex. Let step-size γ = 1/6L and let p ∈ (0, 1] and D be the uniform
distribution. Then L-SVRGI given in Algorithm 5 convergences to the optimum and satisfies:

E[Φk ] ≤ max{1 − µ/6L, 1 − p/2}k Φ0


4γ 2 Pn
where Φk = kxk − x∗ k2 + pn i=1 k∇fi (wk ) − ∇fi (x∗ )k2 .

Note that in statement of Theorem 6.4 we replace n in the above expression with n2 because the Hamiltonian function has
finite-sum structure with n2 components. As we explain before for obtaining Theorem 6.4 of the main paper one can simply
combine the above Theorem with Proposition 4.3.
We highlight that in (Qian et al., 2019) a convergence Theorem of L-SVRG for smooth strongly convex functions was
presented under the arbitrary sampling paradigm (well defined distribution D) . This result can be trivially extended to
capture the class of smooth quasi-strongly convex functions and as such it can be also used in combination with 4.3. In this
case the step-size will become γ = 1/6L where L is the expected smoothness parameter. Using this, one can guarantee
linear convergence of L-SVRG, and as a result of L-SVRHG, with more general distribution D (beyond uniform sampling).
For other well defined choices of distribution D we refer the interested reader to (Qian et al., 2019).

E.3. Convergence of Stochastic Optimization Methods for Functions Satisfying the PL Condition
As we have already mentioned the Theorems of the convergence of the stochastic Hamiltonian methods for solving the
sufficiently bilinear games can be obtain by combining Proposition 4.4 with existing results on the analysis of SGD and
L-SVRG for functions satisfying the PL condition.
In particular, in this subsection we present the main convergence Theorems as presented in Gower et al. (2020) for the
analysis of SGD for functions satisfying the PL condition and we explain how we can extend the results of Qian et al. (2019)
in order to provide an analysis of L-SVRG with restart.
The main assumption of these Theorems is that function f of problem (23) satisfies the PL condition and that the expected
residual is satisfied. Note that again no assumption on convexity of fi is made.
An important remark that we need to highlight is that all convergence result are presented in terms of function suboptimality
E[f (xk ) − f (x∗ )]. When these results are used for the Hamiltonian method that we know that H(x∗ ) = 0 they can be
written as E[H(xk )]. This is exactly the quantity for which we show convergence in Theorems 6.5, 6.7 and 6.8.

E.3.1. C ONVERGENCE OF SHGD


In Gower et al. (2020) several convergence theorems describing the performance of SGD were presented for two large
classes of structured nonconvex functions: (i) the Quasar (Strongly) Convex functions and (ii) the functions satisfying
the Polyak-Lojasiewicz (PL) condition. The proposed analysis of Gower et al. (2020) relied on the Expected Residual
assumption 6. The authors proved the weakness of this assumption compared to previous one used for the analysis of SGD
and highlight the benefits of using it when the function satisfying the PL condition.
In particular, one of the main contributions of this work is a novel analysis of minibatch SGD for PL functions which recovers
the best known dependence on the condition number for Gradient Descent (GD) while also matching the current state-of-
the-art rate derived in for SGD. Recall that this is what we used to obtain the convergence of deterministic Hamiltonian
method (equivalent to GD) in Corollary 6.6.
In Gower et al. (2020) two theorems have been proposed for PL functions for the convergence of SGD, one for constant
step-size and one with a decreasing step-size. In particular the second theorem provide insightful stepsize-switching rules
which describe when one should switch from a constant to a decreasing stepsize regime. For the case of constant step-size,
SGD converges with linear rate to a neighborhood of the solution while for the decreasing step-size converges to the exact
optimum but with a slower sublinear rate.
Theorem E.4. Let f be L-smooth. Assume expected residual and that f (x) satisfies the PL condition (4). Set σ 2 =
Stochastic Hamiltonian Gradient Methods for Smooth Games

µ
E(k∇fi (x∗ )k2 ), where x∗ = arg minx f (x). Let γk = γ ≤ L(µ+2ρ) , for all k. Then the iterates of SGD satisfy:

k Lγσ 2
E[f (xk ) − f ∗ ] ≤ (1 − γµ) [f (x0 ) − f ∗ ] + . (45)
µ

Theorem E.5 (Decreasing step sizes/switching  Let f be an L-smooth. Assume expected residual and that f (x)
 strategy).
ρ
satisfies the PL condition (4). Let k ∗ := 2 L
µ 1 + 2 µ and

 µ
 , for k ≤ dk ∗ e
L(µ + 2ρ)


γk = 2k + 1 (46)
for k > dk ∗ e


(k + 1)2 µ

If k ≥ dk ∗ e, then SGD given in Algorithm 4 satisfies:

4Lσ 2 1 (k∗ )2
E[f (xk ) − f ∗ ] ≤ + 0
k2 e2 [f (x ) − f ∗ ]. (47)
µ2 k

E.3.2. C ONVERGENCE OF L-SVRG ( WITH R ESTART )


In Qian et al. (2019) an analysis of L-SVRG (Algorithm 5) was provided for non-convex and smooth functions. In particular
Lemma E.7 has been proved under the Expected residual Assumption E.6.
Assumption E.6. There is a constant ρnc > 0 such that
h i
2
E k∇fi (x) − ∇fi (y) − [∇f (x) − ∇f (y)]k ≤ ρnc kx − y||2 (48)

Assumption E.6 is similar to the expected residual condition presented in the main paper. For the case of τ -minibatch
sampling, in Qian et al. (2019) it was shown that parameter ρnc of the assumption can be upper bounded by ρnc ≤
n2 −τ 1
P 2
(n2 −1)τ n Li where Li is the smoothness parameter of function fi .
Under the Expected residual Assumption E.6 the following lemma was proven in Qian et al. (2019).
Lemma E.7 (Theorem 5.1 in Qian et al. (2019) ). Let f be nonconvex and smooth function. Let Assumption E.6 be
satisfied and let p ∈ (0, 1]. Consider the Lyapunov function Ψk = f (xk ) + αkxk − wk k2 where α = 3γ 2 Lρnc /p. If
stepsize γ satisfies:
√ 
p2/3

1 p
γ ≤ min , , √ (49)
4L 361/3 (Lρnc )1/3 6ρnc
then the update of L-SVRG (Algorithm 5) satisfies
γ
Ei [Ψk+1 ] ≤ Ψk − k∇f (xk )k2 .
4

Having the result of Lemma E.7 let us now present the main Theorem describing the convergence of L-SVRG with
restart presented in Algorithm 6. Let us, run L-SVRG with step-size γ that satisfies (49) and select the output xu of the
method to be its Option II. That is xu is chosen uniformly at random from {xi }K i=0 . In this case we name the method
L-SVRGII (x0 = w0 , K, γ, p ∈ (0, 1]).
Theorem E.8 (Convergence of Algorithm 6). Let f be L−smooth function that satisfies the PL condition (4) with
parameter µ. Let Assumption E.6 be satisfied and let p ∈ (0, 1]. If stepsize γ satisfies:
√ 
p2/3

1 p
γ ≤ min , 1/3 1/3
, √
4L 36 (Lρnc ) 6ρnc
Stochastic Hamiltonian Gradient Methods for Smooth Games

4
and K = µγ then the update of Algorithm 6 satisfies
 t
∗ 1
t
E[f (x ) − f (x )] ≤ [f (x0 ) − f (x∗ )], (50)
2

and  t
t 1 2
Ek∇f (x )k ≤ k∇f (x0 )k2 . (51)
2

Proof. Using Lemma E.7 we obtain:


γ
Ei [Ψk+1 ] ≤ Ψk − k∇f (xk )k2 .
4
By taking expectation again and by rearranging:
4   k
Ek∇f (xk )k2 ≤ E Ψ − E[Ψk+1 ]

γ
By letting xu to be chosen uniformly at random from {xi }K
i=0 we obtain:
K−1
1 X
Ek∇f (xu )k2 ≤ Ek∇f (xi )k2
K i=0
K−1
1 4 X   k
E Ψ − E[Ψk ]


K γ i=0
1 4
Ψ0 − E[ΨK ]

=
K γ
1 4
f (x0 ) − E[f (xk )] − αE[kxk − wk k2 ]

=
K γ
1 4
f (x0 ) − f (x∗ )
 
≤ (52)
K γ

Convergence on function values. The above derivation, (52), shows that the iterates of Algorithm 6 satisfy:
4 
Ek∇f (xt )k2 ≤ E f (xt−1 ) − f (x∗ )

γK
4
Substitute the specified value of K = γµ in the above inequality, we have

Ek∇f (xt )k2 ≤ µE f (xt−1 ) − f (x∗ )


 

and since the function satisfies the PL condition we have 21 k∇f (x)k2 ≥ µ [f (x) − f (x∗ )] which means that:

2µE f (xt ) − f (x∗ ) ≤ Ek∇f (xt )k2 ≤ µE f (xt−1 ) − f (x∗ )


   

Thus,  
1
E f (xt ) − f (x∗ ) ≤ E f (xt−1 ) − f (x∗ )
   
2
by unrolling the recurrence we obtain (50).

Convergence on norm of the gradient. Similar to the previous case, using (52), the iterates of Algorithm 6 satisfy:
4 
Ek∇f (xt )k2 f (xt−1 ) − f (x∗ )


γK
(4) 4 1
≤ k∇f (xt−1 )k2
γK 2µ
2
= k∇f (xt−1 )k2 (53)
γµK
Stochastic Hamiltonian Gradient Methods for Smooth Games

4
Using the specified value K = γµ in the above inequality, we have:
 
t 2 1
Ek∇f (x )k ≤ k∇f (xt−1 )k2 (54)
2
and by unrolling the recurrence we obtain (51).

F. Experimental Details
In the experimental section we compare several different algorithms, we provide a short explanation of the different
algorithms here:

• SHGD with constant and decreasing step-size: This is the Alg. 1 proposed in the paper.
• Biased SHGD: This is a biased version of Alg. 1 that was proposed by Mescheder et al. (2017), where ∇Hi,j (x) =
1 1 2
2 ∇hξi (x), ξj (x)i is replaced by ∇Ĥi,j (x) = 2 ∇kξi (x) + ξj (x)k , note that this a biased estimator of ∇H(x).

• L-SVRHG with or without restart: This is the Alg.2 proposed in the paper, with Option II for the restart and Option I
for the version without restart. Restart is not used unless specified.
• CO: This is the Consensus Optimization algorithm proposed in Mescheder et al. (2017). We provide more details in
App. F.5.
• SGDA: This is the stochastic version of Simultaneous Gradient Descent/Ascent algorithm, which uses the following
update xk+1 = xk − ηk ξi (xk ).
• SVRE with restart: This is the Alg. 3 described in Chavdarova et al. (2019).

In the following sections we provide the details for the different hyper-parameters used in our different experiments.

F.1. Bilinear Game


We first provide the details about the bilinear experiments presented in section 7.1:
n
1X >
min max x A i x 2 + b> >
i x1 + ci x2 (55)
x1 ∈Rd x2 ∈Rd n i=1 1

where:
n = d = 100

1 if i = k = l
Ai ∈ Rd×d , [Ai ]kl =
0 otherwise
bi , ci ∈ Rd , [bi ]k , [ci ]k ∼ N (0, 1/d)

The hyper-parameters used for the different algorithms are described in Table 1:

Table 1. Hyper-parameters used for the different algorithms in the Bilinear Experiments (section 7.1).

A LGORITHMS S TEP - SIZE γ k P ROBABILITY p R ESTART

SHGD WITH CONSTANT STEP - SIZE 0.5 N/A N/A


(
0.5 FOR k ≤ 10, 000
SHGD WITH DECREASING STEP - SIZE 2k+1 N/A N/A
1
(k+1)2 2500
FOR k > 10, 000.

B IASED SHGD 0.5 N/A N/A


1
L-SVRHG 10 n
= 0.01 N/A
1
SVRE 0.3 n
= 0.01 RESTART WITH PROBABILITY 0.1
Stochastic Hamiltonian Gradient Methods for Smooth Games

1
The optimal constant step-size suggested by the theory for SHGD is γ = 2L . In this experiment we have that L = 1, thus
the optimal step-size is 0.5, this is also what we observed in practice. However we observe that while the theory recommends
to decrease the step-size after 4dKe = 40, 000 we observe in this experiment that it actually converges faster if we decrease
the step-size a bit earlier after only 10, 000 iterations.

F.2. Sufficiently-Bilinear Games


In this section we provide more details about the sufficiently-bilinear experiments of section 7.2:
n
1X
min max F (x1 ) + δx> > >
1 Ai x2 + bi x1 + ci x2 − F (x2 )
x1 ∈Rd x2 ∈Rd n i=1

where:
n = d = 100 and δ = 7

d×d 1 if i = k = l
Ai ∈ R , [Ai ]kl =
0 otherwise

bi , ci ∈ Rd , [bi ]k , [ci ]k ∼ N (0, 1/d)


 −3 x + π2 for x ≤ − π2

d
1X
F (x) = f (xk ), a f (x) = −3 cos x for − π2 < x ≤ π
2 (56)
d
− cos x + 2x − π for x > π2

k=1

Note that this game satisfies the sufficiently-bilinear condition as long as δ > 2L, where L is the smoothness of F (x), in our
case L = 3. Thus we choose δ = 7 in order for the sufficiently-bilinear condition to be satisfied.
The hyper-parameters used for the different algorithms are described in Table 2:

Table 2. Hyper-parameters used for the different algorithms in the sufficiently-bilinear experiments (section 7.2).

A LGORITHMS S TEP - SIZE γ k P ROBABILITY p R ESTART

SHGD WITH CONSTANT STEP - SIZE 0.02 N/A N/A


(
0.02 FOR k ≤ 10, 000
SHGD WITH DECREASING STEP - SIZE 2k+1 N/A N/A
1
(k+1)2 2500
FOR k > 10, 000.

B IASED SHGD 0.01 N/A N/A


1
L-SVRHG 0.1 n
= 0.01 N/A
1
L-SVRHG WITH RESTART 0.1 n
= 0.01 RESTART EVERY 1, 000 ITERATIONS
SVRE 0.05 0.1 RESTART WITH PROBABILITY 0.1

F.3. GANs
In this section we present the details for the GANs experiments. We first present the different problem we try to solve.
satGAN solve the following problem:
n
1X
min max log(sigmoid(φ0 + φ1 yi + φ2 yi2 )) + log(1 − sigmoid(φ0 + φ1 (µ + σzi ) + φ2 (µ + σzi )2 ))
µ,σ φ0 ,φ1 ,φ2 n i=1
Stochastic Hamiltonian Gradient Methods for Smooth Games

nsGAN solve the following problem:


n
1X
max log(sigmoid(φ0 + φ1 yi + φ2 yi2 )) + log(1 − sigmoid(φ0 + φ1 (µ + σzi ) + φ2 (µ + σzi )2 ))
φ0 ,φ1 ,φ2 n i=1
n
1X
max log(sigmoid(φ0 + φ1 (µ + σzi ) + φ2 (µ + σzi )2 )) + log(1 − sigmoid(φ0 + φ1 yi + φ2 yi2 ))
µ,σ n
i=1

WGAN solve the following problem:


n
1X
min max (φ1 yi + φ2 yi2 ) − (φ1 (µ + σzi ) + φ2 (µ + σzi )2 )
µ,σ φ1 ,φ2 n i=1

All Discriminator and Generator parameters are initialized randomly with U (−1, 1) prior. The data is set as yi ∼ N (0, 1),
zi ∼ N (0, 1). We run all experiments 10 times (with seed 1, 2, . . . , 10).
The hyper-parameters used for the different algorithms are described in Table 3:

Table 3. Hyper-parameters used for the different algorithms in the GAN Experiments (section 7.3).
A LGORITHMS S TEP - SIZE γ k P ROBABILITY p S AMPLE SIZE M INI - BATCH SIZE

CO .02 N/A 10K 100


SGDA .02 N/A 10K 100
SHGD .02 N/A 10K 100
1
L-SVRHG .02 n
= 0.01 10K 100

F.4. Implementation details for L-SVRHG


L-SVRHG requires the computation of the gradient of the full Hamiltonian with probability p. As a reminder the Hamiltonian
can be written as the sum of n2 terms (see eq 11), a naive implementations would thus requires n2 operation to compute
each of the ∇Hi,j (x) to get the full gradient. However a more efficient alternative is to notice that the Hamiltonian can be
written as H(x) = 21 kξ(x)k2 , by first computing ξ(x) and then using back-propagation to compute the gradient of H(x),
we can reduce the cost of computing the full gradient to 2n instead of n2 .

F.5. Details for Consensus Optimization (CO)


Consensus optimization can be formulated as solving the following problem using SGDA:
min max g(x1 , x2 ) + λH((x1, x2)) (57)
x1 ∈Rd1 x2 ∈Rd2

Using SGDA to solve this problem is equivalent to do the following update:


xk+1 = xk − ηk (ξ(xk ) + λ∇Hi,j (xk )) (58)

As per Mescheder et al. (2017), we used λ = 10 in all the experiments and a biased estimator of the Hamiltonian
∇Ĥi,j (x) = 21 ∇kξi (x) + ξj (x)k2 . Note that we also tried to use the unbiased estimator proposed in section 5.1 but found
no significant difference in our results, and thus only included the results for the original algorithm proposed by Mescheder
et al. (2017) that uses the biased estimator.

F.6. Cost per iteration


In the experimental section, we compare the different algorithms as a function of the number of gradient computations. In
Table 4, we give the number of gradient computations per iteration for all the different algorithms compared in the paper.
We also give a brief explanation on the cost per iteration of each methods:
Stochastic Hamiltonian Gradient Methods for Smooth Games

• SHGD: We can write Hi,j (x) = 12 < ξi (x), ξj (x) >, thus at every iteration we need to compute two gradients ξi (x)
and ξj (x), which leads to a cost of 2 per iteration.

• Biased SHGD: The biased estimate is based on Hi (x) = 21 kξi (x) + ξj (x)k2 , which requires the computation of two
gradients ξi (x) and thus also has a cost of 2 per iteration.

• L-SVRHG: At every iteration we need to compute two Hamiltonian updates which cost 2 each, and with probability p
we need to compute the full Hamiltonian which cost 2n (see App. F.4), which leads to a cost of 4 + p · 2n

• SVRE: At each iteration SVRE need to do an extrapolation step and an update step, both the extrapolation step and the
update step requires to evaluate two gradients, and with probability p we need to compute the full gradient which cost
n, which leads to a total cost of 4 + p · n.

Table 4. Number of the gradient computations per iteration for the different algorithms compared in the paper.

A LGORITHM C OST PER ITERATION


SGDA 1
SHGD 2
B IASED SHGD 2
L-SVRHG 4 + p · 2n
SVRE 4+p·n

G. Additional Experiments
In this section we provide additional experiments that we couldn’t include in the main paper. Those experiments provide
further observations on the behavior of our proposed methods in different settings.

G.1. Bilinear and Sufficiently-Bilinear Games


G.1.1. S YMMETRIC POSITIVE DEFINITE MATRIX

Pn presented in the paper the matrix Ai have a particular structure, they are very sparse, and the
In all the experiments
matrix A = 1/n i=1 Ai is the identity. We thus propose here to compare the methods on the bilinear game (9) and the
sufficiently-bilinear game from Section 7.2 but with different matrices Ai . We choose Ai to be random symmetric positive
definite matrices. For the sufficiently bilinear experiments we choose δ, such that the sufficiently-bilinear condition is
satisfied. We show the results in Fig. 3, we observe results very similar to the results observed in section 7.1 and section 7.2,
the experiments again shows that our proposed methods follow closely the theory and that L-SVRHG is the fastest method
to converge.
Stochastic Hamiltonian Gradient Methods for Smooth Games

100 Biased SHGD 100 SHGD (constant step-size)


SHGD (constant step-size) SHGD (decreasing step-size)
SHGD (decreasing step-size) Biased SHGD
10 2 L-SVRHG 10 2 L-SVRHG with restart
SVRE SVRE
||x0 x * ||2
||xk x * ||2

H(x0)
H(xk)
10 4 10 4

10 6 10 6

10 8 10 8
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Num of samples 1e2 Num of samples 1e2
(a) Bilinear game (b) Sufficiently-bilinear game

Figure 3. Results for the bilinear game and sufficiently-bilinear game with symmetric positive definite matrices. We observe results very
similar to the results observed in section 7.1 and section 7.2, the experiments again shows that our proposed methods follow closely the
theory and that L-SVRHG is the fastest method to converge.

G.1.2. I NTERPOLATED G AMES


In this section we present a particular class of games that we call interpolated games.
Definition G.1 (Interpolated Games). If a game is such that at the equilibrium x∗ , we have ∀i ξi (x∗ ) = 0, then we say
that the game satisfies the interpolation condition.

If a game satisfies the interpolation condition, then SHGD with constant step-size converges linearly to the solution.
In the bilinear game (9) and the sufficiently-bilinear game from Section 7.2, if we choose to set ∀i bi = ci = 0, then both
problems satisfies the interpolation condition. We provide additional experiments in this particular setting where we compare
SHGD with constant step-size, Biased SHGD, and L-SVRHG. We show the results in Fig. 4. We observe that all methods
converge linearly to the solution, surprisingly in this setting Biased SHGD converges much faster than all other methods.
We argue that this is due to the fact that Biased SHGD is optimizing an upper-bound on the Hamiltonian. Indeed we can
show using Jensen’s inequality, that:

n n n n
1 1 1X Jensen 1 X 2 1X1 2 1X
H(x) = kξ(x)k2 = k ξi (x)k2 ≤ kξi (x)k = kξi (x)k = Hi (x) (59)
2 2 n i=1 2n i=1 n i=1 2 n i=1

If the interpolation condition is satisfied, then we have that at the optimum x∗ , the inequality becomes an equality:

n
1X
H(x∗ ) = Hi (x∗ ) = 0 (60)
n i=1

Thus in this particular setting Biased SHGD also converges to the solution. Furthermore we can notice that because the Ai
are very sparse, ∀i 6= j ∇Hi,j (x) = 0. Thus most of the time SHGD will not update the current iterate, which is not the
case of Biased SHGD which only considers the ∇Hi,i (x) = 0 to do its update and thus always has signal. The convergence
of SHGD could thus be improved by using non-uniform sampling. We leave this for future work.
Stochastic Hamiltonian Gradient Methods for Smooth Games

100 100

10 2 10 2
||x0 x * ||2
||xk x * ||2

H(x0)
H(xk)
10 4 10 4

10 6 10 6
SHGD SHGD
Biased SHGD Biased SHGD
L-SVRHG L-SVRHG
10 8 10 8
0 20 40 60 80 100 0 20 40 60 80 100
Num of samples 1e2 Num of samples 1e2
(a) Bilinear game (b) Sufficiently-bilinear game

Figure 4. Results for the bilinear game and sufficiently-bilinear game when ∀i bi = ci = 0. We observe that all the methods converge
linearly in this setting. Surprisingly in this setting Biased SHGD is the fastest method to converge. We give a brief informal explanation
on why this is the case above.

G.2. GANs
We present the missing experiments for satGAN (with batch size 100) in Figure 5. As can be observed, results for nsGAN
are very similar to results for satGAN (see Figure 2).

101
CO CO
100 SGDA SGDA
Generator L2 distance to optimum

SHGD (constant step-size) SHGD (constant step-size)


100
L-SVRHG L-SVRHG
10−2

10−4 10−1
H(x0)
H(xk)

10−6
10−2

10−8
10−3
10−10

10−12 10−4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Number of samples 1e7 Number of samples 1e7

(a) Hamiltonian with nsGAN (b) Distance to optimum with nsGAN

Figure 5. nsGAN with batch size of 100

You might also like