0% found this document useful (0 votes)
33 views

First Order Optimization Algorithms Via Discretization of Finite Time Convergent Flows

This document summarizes a paper that proposes three discretization methods for two first-order finite-time optimization flows called the rescaled-gradient flow and signed-gradient flow. These flows converge locally to minima in finite time for gradient-dominated functions. The discretization methods are analyzed theoretically and tested empirically on neural network training tasks using CIFAR10, SVHN and MNIST datasets. Results show the proposed schemes converge faster than standard optimization methods while achieving equivalent or better accuracy.

Uploaded by

Aman Jalan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

First Order Optimization Algorithms Via Discretization of Finite Time Convergent Flows

This document summarizes a paper that proposes three discretization methods for two first-order finite-time optimization flows called the rescaled-gradient flow and signed-gradient flow. These flows converge locally to minima in finite time for gradient-dominated functions. The discretization methods are analyzed theoretically and tested empirically on neural network training tasks using CIFAR10, SVHN and MNIST datasets. Results show the proposed schemes converge faster than standard optimization methods while achieving equivalent or better accuracy.

Uploaded by

Aman Jalan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Under review as a conference paper at ICLR 2021

F IRST-O RDER O PTIMIZATION A LGORITHMS VIA D IS -


CRETIZATION OF F INITE -T IME C ONVERGENT F LOWS

Anonymous authors
Paper under double-blind review

A BSTRACT
In this paper the performance of several discretization algorithms for two first-order
finite-time optimization flows. These flows are, namely, the rescaled-gradient
flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or
discontinuous dynamical systems that converge locally in finite time to the minima
of gradient-dominated functions. We introduce three discretization methods for
these first-order finite-time flows, and provide convergence guarantees. We then
apply the proposed algorithms in training neural networks and empirically test their
performances on three standard datasets, namely, CIFAR10, SVHN, and MNIST.
Our results show that our schemes demonstrate faster convergences against standard
optimization alternatives, while achieving equivalent or better accuracy.

1 I NTRODUCTION
Consider the unconstrained minimization problem for a given cost function f : Rn → R. When f is
sufficiently regular, the standard algorithm in continuous time (dynamical system) is given by
ẋ = FGF (x) , −∇f (x) (1)
d
with ẋ , dt x(t), known as the gradient flow (GF). Generalizing GF, the q-rescaled GF (q-
RGF) Wibisono et al. (2016) given by
∇f (x)
ẋ = −c q−2 (2)
k∇f (x)k2q−1
1

with c > 0 and q ∈ (1, ∞] has an asymptotic convergence rate f (x(t)) − f (x? ) = O tq−1 under
mild regularity, for kx(0) − x? k > 0 small enough, where x? ∈ Rn denotes a local minimizer of f .
However, we recently proved Romero & Benosman (2020) that the q-RGF, as well as our proposed
q-signed GF (q-SGF)
1
ẋ = −c k∇f (x)k1q−1 sign(∇f (x)), (3)
where sign(·) denotes the sign function, applied element-wise for (real-valued) vectors, are both
finite-time convergent, provided that f is gradient dominated of order p ∈ (1, q). In particular, if f is
strongly convex, then q-RGF and q-SGF is finite-time convergent for any q ∈ (2, ∞], since f must
be gradient dominated of order p = 2.

C ONTRIBUTION

In this paper, we explore three discretization schemes for the q-RGF (2) and q-SGF (3) and provide
some convergence guarantees using results from hybrid dynamical control theory. In particular, we
explore a forward-Euler discretization of RGF/SGF, followed by an explicit Runge-Kutta discretiza-
tion, and finally a novel Nesterov-like discretization. We then test their performance on both synthetic
and real-world data in the context of deep learning, namely, over the well-known datasets CIFAR10,
SVHN, and MNIST.

R ELATED W ORK

Propelled by the work of Wang & Elia (2011) and Su et al. (2014), there has been a recent and
significant research effort dedicated to analyzing optimization algorithms from the perspective of

1
Under review as a conference paper at ICLR 2021

dynamical systems and control theory, especially in continuous time Wibisono et al. (2016); Wilson
(2018); Lessard et al. (2016); Fazlyab et al. (2017b); Scieur et al. (2017); França et al. (2018); Fazlyab
et al. (2018); Fazlyab et al. (2018); Taylor et al. (2018); França et al. (2019a); Orvieto & Lucchi
(2019); Muehlebach & Jordan (2019). A major focus within this initiative is in accceleration, both in
terms of trying to gain new insight into more traditional optimization algorithms from this perspective,
or even to exploit the interplay between continuous-time systems and their potential discretizations
for novel algorithm design Muehlebach & Jordan (2019); Fazlyab et al. (2017a); Shi et al. (2018);
Zhang et al. (2018); França et al. (2019b); Wilson et al. (2019). Many of these papers also focus on
explicit mappings and matchings of convergence rates from the continuous-time domain into discrete
time.
For older work connecting ordinary differential equations (ODEs) and their numerical analysis, with
optimization algorithms, see Botsaris (1978a;b); Zghier (1981); Snyman (1982; 1983); Brockett
(1988); Brown (1989). In Helmke & Moore (1994), the authors studied relationships between linear
programming, ODEs, and general matrix theory. Further, Schropp (1995) and Schropp & Singer
(2000) explored several aspects linking nonlinear dynamical systems to gradient-based optimization,
including nonlinear constraints.
Tools from Lyapunov stability theory are often employed for this purpose, mainly because there
already exists a rich body of work within the nonlinear systems and control theory community for
this purpose. In particular, typically in previous works, one seeks asymptotically Lyapunov stable
gradient-based systems with an equilibrium (stationary point) at an isolated extremum of the given
cost function, thus certifying local convergence. Naturally, global asymptotic stability leads to global
convergence, though such an analysis will typically require the cost function to be strongly convex
everywhere.
For physical systems, a Lyapunov function can often be constructed from first principles via some
physically meaningful measure of energy (e.g., total energy = potential energy + kinetic energy). In
optimization, the situation is somewhat similar in the sense that a suitable Lyapunov function may
often be constructed by taking simple surrogates of the objective function as candidates. For instance,
V (x) , f (x) − f (x? ) can be a good initial candidate. Further, if f is continuously differentiable
and x? is an isolated stationary point, then another alternative is V (x) , k∇f (x)k2 . However, most
fundamental and applied research conducted in systems and control regarding Lyapunov stability
theory deals exclusively with continuous-time systems. Unfortunately, (dynamical) stability properties
are generally not preserved for simple forward-Euler and sample-and-hold discretizations and control
laws Stuart & Humphries (1998). Furthermore, practical implementations of optimization algorithms
in modern digital computers demand discrete-time. Nonetheless, it has been extensively noted that a
vast amount of general Lyapunov-based results appear to have a discrete-time equivalent.
In that sense, we aim here to start from the q-RGF, and q-SGF continuous flows, characterized by
their Lyapunov-based finite-time convergence, and seek discretization schemes, which allow us to
‘shadow’ the solutions of these flows in discrete time, hoping to achieve an acceleration of the discrete
methods inspired from the finite-time convergence characteristics of the continuous flows.

2 O PTIMIZATION A LGORITHMS AS D ISCRETE -T IME S YSTEMS


Generalizing (1), (2), and (3), consider a continuous-time algorithm (dynamical system) modeled via
an ordinary differential equation (ODE)
ẋ = F (x) (4)
for t ≥ 0, or, more generally, a differential inclusion
ẋ(t) ∈ F(x(t)) (5)
a.e. t ≥ 0 (e.g. for the q = ∞ case), such that x(t) → x? as t → t? . In the case of the q-RGF (2)
and q-SGF (3) for f gradient dominated of order p ∈ (1, q), we have finite-time convergence, and
thus t? = t? (x(0)) < ∞.
Most of the popular numerical optimization schemes can be written in a state-space form (i.e.,
recursively), as
Xk+1 = Fd (k, Xk ) (6a)
xk = G(Xk ) (6b)

2
Under review as a conference paper at ICLR 2021

for k ∈ Z+ , {0, 1, 2, . . .} and a given X0 ∈ Rm (typically m ≥ n), where Fd : Z+ × Rm → Rn


and G : Rm → Rn .
Naturally, (6) can be seen as a discrete-time dynamical system constructed by discretizing (4) in
time. In particular, we have xk ≈ x(tk ), where {0 = t0 < t1 < t2 < . . .} denotes a time partition
and x(·) a solution to (4) or (5) as appropriate. Therefore, we call Xk and xk , respectively, the state
and output at time step k. Whenever Fd (k, X) does not depend on k, we will drop k and thus write
Fd (X) instead. Whenever m = n, we will denote take G(X) , X and replace X and Xk by x and
xk , respectively.
Example 1. The standard gradient descent (GD) algorithm
xk+1 = xk − η∇f (xk ) (7)
with step size (learning rate) η > 0 can be readily written in the form (6) by taking m = n,
Fd (x) , x − η∇f (x), and G(x) , x.

• If the step sizes are adaptive, i.e. if we replace η by a sequence {ηk } with ηk > 0, then we
only need to replace Fd (k, x) , x − ηk ∇f (x), provided that {ηk } is not computed using
feedback from {xk } (e.g. through a line search method).
• If we do wish to use feedback1 (and no memory past the most recent output and step size),
(1) (2)
then we can set m = n + 1, G([x; η]) , x, and Fd ([x; η]) , [Fd ([x; η]); Fd ([x; η])],
(1) (2)
where Fd ([x; η]) , x − η∇f (x), and Fd is a user-defined function that dictates the
updates in the step size. In particular, an open-loop (no feedback) adaptive step size {ηk } may
(2)
also be achieved under this scenario, provided that it is possible to write ηk+1 = Fd (ηk ).
(2)
If this is not possible (and still open-loop step size), then we may take Fd (k, X) , ηk+1 ,
and of course add a k-argument in Fd .
• If we wish to use individual step sizes for each the n components of {xk }, then it suffices
to take ηk as an n-dimensional vector (thus m = 2n), and make appropriate changes in Fd
and G.

In each of these cases, GD can be seen as a forward-Euler discretization of the GF (1), i.e.,
xk+1 = xk + ∆tk FGF (xk ) (8)

with FGF = −∇f and adaptive time step ∆tk , tk+1 − tk chosen as ∆tk = ηk .
Example 2. The proximal point algorithm (PPA)
 
1
xk+1 = arg min f (x) + kx − xk k2 (9)
x∈Rn 2ηk
with step size ηk > 0 (open loop, for simplicity) can also be written in the form (6), by taking m = n,
Fd (k, x) , arg minx0 ∈Rn {f (x0 ) + 2η1k kx0 − xk2 }, and G(x) , x. Naturally, we need to assume
sufficient regularity for Fd (k, x) to exist and we must design a consistent way to choose Fd (k, x)
when multiple minimizers exist in the definition of Fd (k, x). Alternatively, these conditions must
be satisfied, at the very least, at every (k, x) ∈ {(0, x0 ), (1, x1 ), (2, x2 ), . . .} for a particular chosen
initial x0 ∈ Rn .
1
Assuming sufficient regularity, we have ∇x {f (x) + 2ηk kx − xk k2 }|x=xk+1 = 0, and thus

1
∇f (xk+1 ) + (xk+1 − xk ) = 0 ⇐⇒ xk+1 = xk + ∆tk FGF (xk+1 ) (10)
ηk
with ∆tk = ηk , which is precisely the backward-Euler discretization of the GF (1).
1
Also known as closed-loop design in control-theoretic and reinforcement learning terminology, meaning
that ηk = ϕ(k, xk ) for some ϕ : Z+ × Rn → R+ that does not depend on {X0 , X1 , X2 , . . .}. On the other
hand, open-loop design can be seen as closed loop with ϕ(k, ·) constant for each k ∈ Z+ .

3
Under review as a conference paper at ICLR 2021

Example 3. The Nesterov accelerated gradient descent (N-AGD)


yk = xk + βk (xk − xk−1 ) (11a)
xk+1 = yk − ηk ∇f (yk ) (11b)
with step size ηk > 0 and momentum coefficient βk > 0 (both open loop, for simplicity), can be
written in the form (6) by taking m = 2n,
 
(1 + βk+1 )(y − ηk ∇f (y)) − βk+1 x
Fd (k, [y; x]) , (12)
y − ηk ∇f (y)

and G([y; x]) , x for y, x ∈ Rn . In other words, Xk = [yk ; xk ]. Traditionally, βk = k−1 k−2 , but
clearly, if we set ηk = η > 0 and βk = β ∈ (0, 1) (in practice, η ≈ 0 and β ≈ 1), then we can drop k
from Fd (k, [y; x]).
There exist a few approaches in the literature on the interpretation of N-AGD (11b) as the dis-
cretization of a second-order continuous-time dynamical system, namely via a vanishing step size
argument Su et al. (2014), or via symplectic Euler schemes of crafted Hamiltonian systems Muehle-
bach & Jordan (2019); França et al. (2019b).

3 P ROPOSED A LGORITHMS VIA D ISCRETIZATION


In this section, we propose three classes of optimization algorithms via discretization of the q-RGF (2)
and q-SGF (3). But first, we review the necessary conditions to ensure finite-time convergence of
these flows.
Given q ∈ (1, ∞], let Fq−RGF (x) and Fq−SGF (x) be defined, respectively, by the RHS of (2)
and (3). The hyperparameter c > 0 is not explicitly denoted in Fq−RGF , Fq−SGF . Next, borrowing
terminology from Wilson et al. (2019), we say that f (assumed continuously differentiable) is
µ-gradient dominated of order p ∈ (1, ∞] (with µ > 0) near the local minimizer x? if
p−1 p 1
k∇f (x)k p−1 ≥ µ p−1 (f (x) − f ? ) (13)
p
for every x ∈ Rn near x = x? , where f ? = f (x? ). When µ > 0 is unknown or unimportant,
but known to exist, we will omit it in the previous definition. It can be proved that continuously
differentiable strongly convex functions are gradient dominated of order p = 2. Furthermore, if f is
gradient dominated (of any order) w.r.t. x? , then x? is an isolated stationary point of f .
Remark 1. For strongly convex functions, gradient dominance of order p = 2 can be established. In
fact, gradient dominance is usually defined exclusively for order p = 2, often referred to as the Polyak-
Łojasiewicz (PL) inequality, which was introduced by Polyak (1963) to relax the (strong) convexity
assumption commonly used to show convergence of the GD algorithm (7). The PL inequality can also
be used to relax convexity assumptions of similar gradient and proximal-gradient methods Karimi et al.
(2016); Attouch & Bolte (2009). Our adopted generalized notion of gradient dominance is strongly
tied to the Łojasiewicz gradient inequality from real analytic geometry, established by Łojasiewicz
(1963; 1965)2 independently and simultaneously from Polyak (1963), and generalizing the PL
inequality. More precisely, this inequality is typically written as
k∇f (x)k ≥ C · |f (x) − f ? |θ (14)
for every x ∈ Rn in a small enough open neighborhood of the stationary point x = x? , for some
C > 0 and θ ∈ 12 , 1 . This inequality is guaranteed for analytic functions Łojasiewicz (1965). More
precisely, when x? is a local minimizer of f , the aforementioned relationship is explicitly given by
  p−1
p p 1 p−1
C= µp , θ= . (15)
p−1 p
Therefore, analytic functions are always gradient dominated. However, while analytic functions are
always smooth, smoothness is not required to attain gradient dominance.
2
For more modern treatments in English, see Łojasiewicz & Zurro (1999); Bolte et al. (2007)

4
Under review as a conference paper at ICLR 2021

We are now ready to state the finite-time convergence of the q-RGF (2) and q-SGF (3).
Theorem 1. Romero & Benosman (2020) Suppose that f : Rn → R is continuously differentiable
and µ-gradient dominated of order p ∈ (1, ∞) near a strict local minimizer x? ∈ Rn . Let c > 0 and
q ∈ (p, ∞]. Then, any maximal solution x(·), in the sense of Filippov, to the q-RGF (2) or q-SGF (3)
will converge in finite time to x? , provided that kx(0) − x? k > 0 is sufficiently small. More precisely,
lim x(t) = x? , where the convergence time t? < ∞ may depend on which flow is used, but in both
t→t?
cases is upper bounded by
1 1
? k∇f (x0 )k θ − θ0
t ≤ 1  , (16)
cC θ 1 − θθ0
  p−1
p 1
p
where x0 = x(0), C = p−1 µ p , θ = p−1 0 q−1
p , and θ = q . In particular, given any compact
and positively invariant subset S ⊂ D, both flows converge in finite time with the aforementioned
convergence time upper bound (which can be tightened by replacing D with S) for any x0 ∈ S.
Furthermore, if D = Rn , then we have global finite-time convergence, i.e. finite-time convergence to
any maximal solution (in the sense of Filippov) x(·) with arbitrary x0 ∈ Rn .

In essence, the analysis (introduced in Romero & Benosman (2020)) consists of leveraging the
gradient dominance to show that the energy function E(t) , f (x(t)) − f ? satisfies the Lyapunov-like
differential inequality Ė(t) = O(E(t)α ) for some α < 1. The detailed proof is recalled in Appendix
C for completeness.

3.1 M AIN RESULT 1: D ISCRETIZATION A LGORITHMS AND THEIR C ONVERGENCE A NALYSIS

3.1.1 F ORWARD -E ULER D ISCRETIZATION


First, we propose a simple forward Euler discretization of the finite-time convergent flows
xk+1 = xk + ηF (xk ), η > 0 (17)
where F ∈ {Fq−RGF , Fq−SGF }. We show later, in Theorem 2, that this simple method leads, for
small enough η > 0, to solutions that are -close to the finite-time flows.

3.1.2 E XPLICIT RUNGE -K UTTA D ISCRETIZATION


We propose to use the following discretization
K
X K
X
xk+1 = xk + η αi F (yki ), yk1 = xk , αi = 1, (18a)
i=1 i=1
i−1
X
yki = xk + η βj F (y j ), i > 1, (18b)
j=1

for η > 0, K ∈ {1, 2, 3, . . .}, and F ∈ {Fq−RGF , Fq−SGF }. This method is well-known to be
Pi=K
numerically stable under the consistency condition i=1 αi = 1, Stuart & Humphries (1996).
However, in our optimization framework, we want to be able to guarantee that the stable numerical
solution of (18) remains close to the solution of the continuous flows. In other words, we seek
arbitrarily small global error, also known as shadowing. This will be discussed in Theorem 2.

3.1.3 N ESTEROV- LIKE D ISCRETIZATION


First, we rewrite Nesterov’s accelerated GD as
xk+1 = xk + βyk − η∇f (xk + βyk ) (19a)
yk+1 = xk+1 − xk , (19b)
where yk now serves as a momentum term. We argue that Nesterov’s acceleration can be interpreted
as actually applying the discretization given by (19) to the GF (1), i.e., by seeing the term −η∇f (xk +
βyk ) as a mapping applied to the GF flow (1) at xk + βyk , as −ηFGF (xk + βyk ).

5
Under review as a conference paper at ICLR 2021

Therefore, given any optimization flow represented by the continuous-time system ẋ = F (x), locally
convergent to a local minimizer x? ∈ Rn of a cost function f : Rn → R, we can replicate Nesterov’s
acceleration of (1). More precisely, we obtain the algorithm
xk+1 = xk + ηF (xk + βyk ) + βyk (20a)
yk+1 = xk+1 − xk . (20b)
Based on this idea, we propose two ‘Nesterov-like’ discrete optimization algorithms. The first one
based on the q-RGF continuous flow, is defined as:
∇f (xk + βyk ) 
xk+1 = xk + η − c q−2 + βyk (21a)
k∇f (xk + βyk )k q−1

yk+1 = xk+1 − xk . (21b)


The second algorithm is based on the q-SGF continuous flow, and is given by:
1
xk+1 = xk + η − c k∇f (xk + βyk )k1q−1 sign(∇f (xk + βyk ))) + βyk (22a)
yk+1 = xk+1 − xk . (22b)

3.1.4 C ONVERGENCE ANALYSIS


We present here some convergence results of the three proposed discretizations. The analysis
summarized in Theorem 2 is based on tools from hybrid control theory, and is detailed in Appendix
D3 .
Theorem 2. Suppose that f : Rn → R is continuously differentiable, locally Lf -Lipschitz, and µ-
gradient dominated of order p ∈ (2, ∞) in a compact neighborhood S of a strict local minimizer x? ∈
Rn . Let c > 0 and q ∈ (p, ∞]. Then, for a given initial condition x0 ∈ S any maximal solution
x(t), x(0) = x0 , (in the sense of Filippov) to the q-RGF given by (2) or the q-SGF flow (3), there
exists an arbitrarily small  > 0 such that the solution xk of any of the discrete algorithms (17),
(18), (21), or (22), with sufficiently small η > 0, are -close to x(t), i.e., kxk − xk ≤ , and s.t. the
following convergence bound holds
kf (xk ) − f (x? )k ≤ Lf  + [(f (x0 ) − f (x? ))(1−α) − c̃(1 − α)ηk]1/(1−α) , Lf > 0, k ≤ k ? , (23)
  p−1  10
θ
θ p−1 0 q−1 p p 1 (f (x0 )−f (x? ))(1−α)
where α = θ0
, θ= p , θ = q , c̃ = c p−1 µ p , and k ? = c̃(1−α)η .

Theorem 2 thus shows that that -convergence of xk → x? can be achieved in a finite number of
(x? ))(1−α)
steps upper bounded by k ? = (f (x0 )−f
c̃(1−α)η . This is a preliminary convergence result, which is
meant to show the existence of discrete solutions obtained via the proposed discretization algorithms,
which are -close to the continuous solutions of the finite-time flows. We also underline here that
after xk reaches an -neighborhood of x? , then xk+1 ≈ xk , ∀k > k ? , since x? is an equilibrium
point of the continuous flows; see Definition 2 in Appendix B.

4 M AIN RESULT 2: N UMERICAL E XPERIMENTS


4.1 N UMERICAL TESTS ON AN ACADEMIC EXAMPLE

Let us show first on a simple numerical example that the acceleration in convergence, proven
in continuous time for certain range of the hyperparmeters, can translate to some convergence
acceleration in discrete time, as shown in Theorem 2. We consider the Rosenbrock function f :
R2 → R, given by f (x1 , x2 ) = (a − x1 )2 + b(x2 − x21 )2 , with parameters a, b ∈ R. This function
admits exactly one stationary point (x?1 , x?2 ) = (a, a2 ) for b ≥ 0, and is locally strongly convex,
hence locally satisfies gradient dominance of order p = 2, which allows us to select q > 2 in q-RGF
3
Note that there might be several ways of approaching this proof. For instance, one could follow the general
results on stochastic approximation of set-valued dynamical systems, using the notion of perturbed solutions to
differential inclusions presented in Benaı̈m et al. (2005).

6
Under review as a conference paper at ICLR 2021

and q-SGF to achieve finite-time convergence in continuous-time. We report in Figure 1 the mean
performance of all three discretizations for q-RGF and q-SGF4 with fixed step size5 , for several
values of q, for 10 random initial conditions in [0, 2]. We observe for all three discretizations that,
as expected from the continuous flow analysis, for q close to 2, q-RGF behaves similar to GD in
terms of convergence rate, whereas for q > 2 the finite-time convergence in continuous time seems to
translate to some acceleration in this simple discretization method. Similarly for q-SGF, q closer to 2
translates to less accelerated algorithms, with a behavior similar to GD, whereas larger q values lead
to accelerated convergence.

12
GD (fixed step size) 11 GD (Nesterov acceleration w/fixed step size)
RGF (Euler discretization w/fixed step: q=2.2) SGF(Nesterov-like disc. w/fixed step- q=3)
10 2 10
RGF (Euler discretization w/fixed step: q=3) SGF (Nesterov-like disc. w/fixed step- q=6)
RGF (Euler discretization w/fixed step: q=6) 9 SGF (Nesterov-like disc. w/fixed step- q=10)
RGF (Euler discretization w/fixed step: q=10) 8 SGF (Nesterov-like disc. w/fixed step- q=2.2)

6
f(xk ) - f *

f(xk ) - f *
5

10 1
3

0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
k k

10 3
GD Euler discretization(fixed step size)
RGF (Runge Kutta discretization w/fixed step- q=3)
RGF (Runge Kutta discretization w/fixed step- q=6)
RGF (Runge Kutta discretization w/fixed step- q=10)
RGF (Runge Kutta discretization w/fixed step- q=2.2)

2
10
f(xk ) - f *

10 1

0 20 40 60 80 100 120 140 160 180 200


k

Figure 1: Example of the proposed discretization algorithms of finite-time q-RGF and q-SGF

4.2 N UMERICAL E XPERIMENTS ON R EAL - WORLD DATA

We report here the results of our experiments using deep neural network (DNN) training on three
well-known datasets, namely, CIFAR10, MNIST, and SVHN. We report results on CIFAR10, and
SVHN in the sequel, while results on MNIST can be found in Appendix E. Note that, we use Pytorch
platform to conduct all the tests reported in this paper, and do not use GPUs. We underline here that
the DNNs are non-convex globally, however, one could assume at least local convexity, hence local
gradient dominance of order p = 2, thus, we will select q > 2 in our experiments (see (Remark 2,
Appendix E) for more explanations on the choice of q).

4.2.1 E XPERIMENT ON CIFAR10


In this experiment, we use the proposed algorithms to train a VGG16 CNN model with cross entropy
loss, e.g., Simonyan & Zisserman (2015) on the CIFAR10 dataset. We divided the dataset into a
training set spread in 50 batches of 1000 images each, and a test set of 10 batches with 1000 images
each. We ran 20 epochs of training over all the training batches. Since Nesterov accelerated GD is
one of the most efficient methods is DNN applications, to conduct fair comparisons we implemented
our Nesterov-like discretization of q-RGF (c = 1, q = 3, η = 0.04, µ = 0.9), and the Nesterov-
like discretization of q-SGF (c = 10−3 , q = 3, η = 0.04, µ = 0.9). We compare against the
4
To avoid overloading the figures we had to choose one flow at a time, either q-RGF or q-SGF. More results
can be found in Appendix E
5
We did multiple iterations to find the best step size for each algorithm (best values where between 10−4 and
10−2 depending on the algorithm). Details of the step size for each test are given in Appendix E.

7
Under review as a conference paper at ICLR 2021

mainstream algorithms6 , such as, Nesterov’s accelerated gradient descent (GD), Adam, Adaptive
gradient (AdaGrad), per-dimension learning rate method for gradient descent (AdaDelta), and Root
Mean Square Propagation (RMSprop)7 . Note that, all algorithms have been implemented in their
stochastic version8 , i.e., using mini-batches implementation, with constant step size. For instance, in
Figures 2, 39 , we see the training loss for the different optimization algorithms. We notice that the
proposed Algorithms, RGF, and SGF, quickly separate from the GD, and RMS algorithms in terms
of convergence speed, but also ends up with an overall better performance on the test set 84% vs.
83% for GD, and RMS. We also note that other algorithms, such as, AdaGrad and AdaDelta behave
similarly to RGF in terms of convergence speed, but lack behind in terms of final performance 75%
and 68%, respectively. Finally, in Figure 3, we notice that Adam is slower in terms of computation
time w.r.t SGF, and RGF, with an average lag of 8 min, and 80 min, respectively. However, to be fair
one, has to underline that Adam is an adaptive method based on the extra computations and memory
overhead of lower order moments estimate and bias correction, Kingma & Ba (2015). Furthermore, to
better compare the performance of these algorithms, we report in Figure 2 the loss on the test dataset
over the learning iterations. We confirm that RGF and SGF performs better than GD, RMSprop, and
Adam, while avoiding the overfitting observed with AdaGrad and AdaDelta.

Figure 2: Losses for several optimization algorithms- VGG-16- CIFAR10: Train loss (left), test loss
(right)- We add an S before the name of an algorithm to denote its stochastic implementation.

Figure 3: Training loss vs. computation time for several optimization algorithms- VGG-16- CIFAR10

6
We run several tests by trying to optimally tune the parameters of each algorithms on a validation set (tenth
of training set), and we are reporting the best final accuracy performance we could obtain for each one. We have
implemented all algorithms with the same total number of iterations, so that we can compare the convergence
speed of all algorithms against a common iteration range. Details of the hyper-parameter values are given in
Appendix.
7
Original reference for each method can be found in: https://pytorch.org/docs/stable/optim.html
8
We want to underline here that our first tests were done in the deterministic setting, however, to compare the
propsed optimization methods against the best optimization algorithms available for DNNs training, we decided
to also conduct comparison tests in the stochastic setting. Since the results remain qualitatively unchanged, we
only report here the results due to the stochastic setting.
9
To avoid overloading the figures we reported only the computation time plots of the three most competitive
methods: RGF, SGF and Adam.

8
Under review as a conference paper at ICLR 2021

4.2.2 E XPERIMENTS ON SVHN DATASET


We test the proposed algorithms to train the same VGG16 CNN model with cross entropy loss on
the SVHN dataset. We divided the dataset into a training set of 70 batches with 1000 images each,
and a test set of 10 of 1000 images each, and ran 20 epochs of training over all the training batches.
We tested the Nesterov-like discretization of q-RGF (c = 1, q = 3, η = 0.04, µ = 0.09), and the
Nesterov-like discretization of q-SGF (c = 10−3 , q = 11, η = 0.04, µ = 0.09) against Nesterov’s
accelerated gradient descent (GD), and Adam10 . Note from Figures 4, 5 it is clear that RGF and
SGF give a good performance in terms of convergence speed, and final test performance 93%. We
can also observe in Figure 5 that SGF, RGF are faster than GD, and all three methods are faster (in
average 41 min for GD, 58 min for SGF, 75 min for RGF) than Adam as expected since it is an
adaptive scheme with more computation steps (see our discussion of Adam in Section 4.2.1). More

Figure 4: Losses for several optimization algorithms - CNN- SVHN: Train loss (left), test loss (right)

Figure 5: Training loss vs. computation time for several optimization algorithms- VGG-16- SVHN

numerical results on MNIST, and on SVHN using Euler and Runge-Kutta descretization, showing
similar qualitative results, can be found in Appendix E.

5 C ONCLUSION
We studied connections between optimization algorithms and continuous-time representations (dy-
namical systems) via discretization. We then reviewed two families of non-Lipschitz or discontinuous
first-order optimization flows for continuous-time optimization, namely the q-RGF and q-SGF, whose
distinguishing characteristic is their finite-time convergence. We then proposed three discretization
methods for these flows, namely a forward-Euler discretization, followed by an explicit Runge-Kutta
discretization, and finally a novel Nesterov-like discretization. Based on tools from hybrid systems
control theory, we proved a convergence bound for these algorithms. Finally, we conducted numer-
ical experiments on known deep neural net benchmarks, which showed that the proposed discrete
algorithms can outperform some state of the art algorithms, when tested on large DNN models.
10
We also tested Adaptive gradient (AdaGrad), per-dimension learning rate method for gradient descent
(AdaDelta), and Root Mean Square Propagation (RMSprop). However, since their performance was not
competitive we decided not to report the plots to avoid overloading the figures.

9
Under review as a conference paper at ICLR 2021

R EFERENCES
Hedy Attouch and Jerome Bolte. On the convergence of the proximal algorithm for nonsmooth
functions involving analytic features. Mathematical Programming B, 116(1):5–16, 2009.
Andrea Bacciotti and Francesca Ceragioli. Stability and stabilization of discontinuous systems and
nonsmooth lyapunov functions. ESAIM: Control, Optimisation and Calculus of Variations, 4:
361–376, 1999.
Michel Benaı̈m, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and differential
inclusions. SIAM Journal on Control and Optimization, 44(1):328–348, 2005.
Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth suban-
alytic functions with applications to subgradient dynamical systems. Society for Industrial and
Applied Mathematics, 17:1205–1223, January 2007.
C.A Botsaris. A class of methods for unconstrained minimization based on stable numerical integra-
tion techniques. Journal of Mathematical Analysis and Applications, 63(3):729–749, 1978a.
C.A. Botsaris. Differential gradient methods. Journal of Mathematical Analysis and Applications, 63
(1):177–198, 1978b.
R.W. Brockett. Dynamical systems that sort lists, diagonalize matrices and solve linear programming
problems. In IEEE Conference on Decision and Control, pp. 799–803, 1988.
A.A. Brown. Some effective methods for unconstrained optimization based on the solution of
systems of ordinary differential equations. Journal of Optimization Theory and Applications, 62
(2):211–224, August 1989.
Frank H. Clarke. Generalized gradients of lipschitz functionals. Advances in Mathematics, 40(1):
52–67, 1981.
Jorge Cortés. Finite-time convergent gradient flows with applications to network consensus. Auto-
matica, 42(11):1993–2000, November 2006.
Jorge Cortés. Discontinuous dynamical systems. IEEE Control Systems Magazine, 28(3):36–73, June
2008.
Jorge Cortés and Francesco Bullo. Coordination and geometric optimization via distributed dynamical
systems. SIAM Journal on Control and Optimization, 44(5):1543–1574, October 2005.
M. Fazlyab, A. Koppel, V. M. Preciado, and A. Ribeiro. A variational approach to dual methods for
constrained convex optimization. In 2017 American Control Conference (ACC), pp. 5269–5275,
May 2017a.
M. Fazlyab, A. Koppel, V. M. Preciado, and A. Ribeiro. A variational approach to dual methods for
constrained convex optimization. In 2017 American Control Conference (ACC), pp. 52690–5275,
May 2017b.
M. Fazlyab, M. Morari, and V. M. Preciado. Design of first-order optimization algorithms via sum-
of-squares programming. In IEEE Conference on Decision and Control (CDC), pp. 4445–4452,
December 2018.
Mahyar Fazlyab, Alejandro Ribeiro, Manfred Morari, and Victor M. Preciado. Analysis of optimiza-
tion algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM J. Optim,
28(3):2654–2689, 2018.
Aleksei Fedorovich Filippov and F. M. Arscott. Differential equations with discontinuous righthand
sides. Kluwer Academic Publishers Group, Dordrecht, Netherlands, 1988.
G. França, D. Robinson, and R. Vidal. ADMM and accelerated ADMM as continuous dynamical
systems. July 2018.
G. França, D.P. Robinson, and R. Vidal. A dynamical systems perspective on nonsmooth constrained
optimization. arXiv preprint 1808.04048, 2019a.

10
Under review as a conference paper at ICLR 2021

G. França, J. Sulam, D. Robinson, and R. Vidal. Conformal symplectic and relativistic optimization.
arXiv preprint 1903.04100, 2019b.
Uwe Helmke and John Barratt Moore. Optimization and Dynamical Systems. Springer-Verlag, 1994.
Qing Hui, Wassim Haddad, and Sanjay Bhat. Semistability, finite-time stability, differential inclusions,
and discontinuous dynamical systems having a continuum of equilibria. IEEE Transactions on
Automatic Control, 54:2465–2470, November 2009.
Hamed Karimi, Julie Nutini, , and Mark Schmidt. Linear convergence of gradient and proximal-
gradient methods under the Polyak-łojasiewicz condition. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, 2016.
Diederik P. Kingma and Jimmy Lei Ba. Adam: a method for stochastic optimization. International
Conference on Learning Representations, pp. 1–15, May 2015.
L. Lessard, B. Recht, , and A. Packard. Analysis and design of optimization algorithms via integral
quadratic constraints. SIAM J. Optim, 26(1):57–95, 2016.
S. Łojasiewicz. A topological property of real analytic subsets (in French). Les équations aux
dérivées partielles, pp. 87–89, 1963.
S. Łojasiewicz. Ensembles semi-analytiques. Centre de Physique Theorique de l’Ecole Polytechnique,
1965. URL https://perso.univ-rennes1.fr/michel.coste/Lojasiewicz.
pdf.
StanisŁaw Łojasiewicz and Maria-Angeles Zurro. On the gradient inequality. Bulletin of the Polish
Academy of Sciences, Mathematics, 47, January 1999.
Michael Muehlebach and Michael Jordan. A dynamical systems perspective on Nesterov acceleration.
In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International
Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.
4656–4662, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
A. Orvieto and A. Lucchi. Shadowing properties of optimization algorithms. In Neural Information
Processing Systems, December 2019.
Bradley Paden and Shankar Sastry. A calculus for computing filippov’s differential inclusion with
application to the variable structure control of robot manipulators. IEEE Transactions on Circuits
and Systems, 34:73–82, February 1987.
Boris Polyak. Gradient methods for the minimisation of functionals (in Russian). USSR Computa-
tional Mathematics and Mathematical Physics, 3:864–878, December 1963.
O. Romero and M. Benosman. Finite-time convergence in continuous-time optimization. In Interna-
tional Conference on Machine Learning, Vienna, Austria, July 2020.
R. G. Sanfelice and A. R. Teel. Dynamical properties of hybrid systems simulators. Automatica, 46:
239–248, 2010.
Johannes Schropp. Using dynamical systems methods to solve minimization problems. Applied
Numerical Mathematics, 18(1):321–335, 1995.
Johannes Schropp and I Singer. A dynamical systems approach to constrained minimization. Numer-
ical Functional Analysis and Optimization, 21:537–551, May 2000.
D. Scieur, V. Roulet, F. Bach, , and A. d’Aspremont. Integration methods and optimization algorithms.
In Neural Information Processing Systems, December 2017.
Bin Shi, Simon Du, Michael Jordan, and Weijie Su. Understanding the acceleration phenomenon via
high-resolution differential equations. arXiv preprint 1810.08907, October 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. International Conference on Learning Representations, pp. 1–14, May 2015.

11
Under review as a conference paper at ICLR 2021

J.A. Snyman. A new and dynamic method for unconstrained minimization. Applied Mathematical
Modelling, 6(6):448–462, December 1982.
J.A. Snyman. An improved version of the original leap-frog dynamic method for unconstrained
minimization: LFOP1(b). Applied Mathematical Modelling, 7(3):216–218, June 1983.
A. M. Stuart and A. R. Humphries. Dynamical systems and numerical analysis. Cambridge University
Press, 1996.
Andrew M. Stuart and A. R. Humphries. Dynamical systems and numerical analysis. Cambridge
University Press, first edition, November 1998.
W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling Nesterov’s accelerated
gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp.
2510–2518. Curran Associates, Inc., 2014.
Adrien Taylor, Bryan Van Scoy, and Laurent Lessard. Lyapunov functions for first-order methods:
Tight automated convergence guarantees. In International Conference on Machine Learning,
Stockholm, Sweden, July 2018.
J. Wang and N. Elia. A control perspective for centralized and distributed convex optimization. In
IEEE Conference on Decision and Control and European Control Conference, pp. 3800–3805,
December 2011.
Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated
methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358,
2016.
A. Wilson. Lyapunov Arguments in Optimization. PhD thesis, UC Berkeley, 2018.
Ashia Wilson, Lester Mackey, and Andre Wibisono. Accelerating rescaled gradient descent: Fast op-
timization of smooth functions. In Advances in Neural Information Processing Systems. December
2019.
Abbas K. Zghier. The use of differential equations in optimization. PhD thesis, Loughborough
University, 1981.
J. Zhang, A. Mokhtari, S. Sra, , and A. Jadbabaie. Direct runge-kutta discretization achieves
acceleration. In Neural Information Processing Systems, December 2018.

12
Under review as a conference paper at ICLR 2021

A D ISCONTINUOUS S YSTEMS AND D IFFERENTIAL I NCLUSIONS


11
Recall that for an initial value problem (IVP)
ẋ(t) = F (x(t)) (24a)
x(0) = x0 (24b)
with F : Rn → Rn , the typical way to check for existence of solutions is by establishing continuity
of F . Likewise, to establish uniqueness of the solution, we typically seek Lipschitz continuity. When
F is discontinuous, we may understand (24a) as the Filippov differential inclusion
ẋ(t) ∈ K[F ](x(t)), (25)
n n
where K[F ] : R ⇒ R denotes the Filippov set-valued map given by
\ \
K[F ](x) , co F (Bδ (x) \ S), (26)
δ>0 µ(S)=0

where µ denotes the usual Lebesgue measure and co the convex closure, i.e. closure of the convex
hull co. For more details, see Paden & Sastry (1987). We can generalize (25) to the differential
inclusion Bacciotti & Ceragioli (1999)
ẋ(t) ∈ F(x(t)), (27)
where F : Rn ⇒ Rn is some set-valued map.
Definition 1 (Carathéodory/Filippov solutions). We say that x : [0, τ ) → Rn with 0 < τ ≤ ∞ is
a Carathéodory solution to (27) if x(·) is absolutely continuous and (27) is satisfied a.e. in every
compact subset of [0, τ ). Furthermore, we say that x(·) is a maximal Carathéodory solution if no
other Carathéodory solution x0 (·) exists with x = x0 |[0,τ ) . If F = K[F ], then Carathéodory solutions
are referred to as Filippov solutions.

For a comprehensive overview of discontinuous systems, including sufficient conditions for existence
(Proposition 3) and uniqueness (Propositions 4 and 5) of Filippov solutions, see the work of Cortés
(2008). In particular, it can be established that Filippov solutions to (24) exist, provided that the
following assumption (Assumption 1) holds.
Assumption 1 (Existence of Filippov solutions). F : Rn → Rn is defined almost everywhere (a.e.)
and is Lebesgue-measurable in a non-empty open neighborhood U ⊂ Rn of x0 ∈ Rn . Further, F is
locally essentially bounded in U , i.e., for every point x ∈ U , F is bounded a.e. in some bounded
neighborhood of x.

More generally, Carathéodory solutions to (27) exist (now with arbitrary x0 ∈ Rn ), provided that the
following assumption (Assumption 2) holds.
Assumption 2 (Existence of Carathéodory solutions). F : Rn ⇒ Rn has nonempty, compact, and
convex values, and is upper semi-continuous.

Filippov & Arscott (1988) proved that, for the Filippov set-valued map F = K[F ], Assumptions 1
and 2 are equivalent (with arbitrary x0 ∈ Rn in Assumption 1).
Uniqueness of the solution requires further assumptions. Nevertheless, we can characterize the
Filippov set-valued map in a similar manner to Clarke’s generalized gradient, as seen in the following
proposition.
Proposition 1 (Theorem 1 of Paden & Sastry (1987)). Under Assumption 1, we have
 
K[F ](x) = lim F (xk ) : xk ∈ Rn \ (NF ∪ S) s.t. xk → x (28)
k→∞

for some (Lebesgue) zero-measure set NF ⊂ Rn and any other zero-measure set S ⊂ Rn . In
particular, if F is continuous at a fixed x, then K[F ](x) = {F (x)}.
11
The notions introduced here are solely needed for rigorously dealing with the singular discontinuity at the
equilibrium point of the q-RGF and q-SGF flows. However, the reader can skip these definitions and still be able
to intuitively follow the proofs of Theorems 1, and 2.

13
Under review as a conference paper at ICLR 2021

For instance, for the GF (1), we have K[−∇f ](x) = {−∇f (x)} for every x ∈ Rn , provided that f
is continuously differentiable. Furthermore, if f is only locally Lipschitz continuous and regular (see
Definition 3 of Appendix B), then K[−∇f ](x) = −∂f (x), where
 
n
∂f (x) , lim ∇f (xk ) : xk ∈ R \ Nf s.t. xk → x (29)
k→∞

denotes Clarke’s generalized gradient Clarke (1981) of f , with Nf denoting the zero-measure set
over which f is not differentiable (Rademacher’s theorem). It can be established that ∂f coincides
with the subgradient of f , provided that f is convex. Therefore, the GF (1) interpreted as Filippov
differential inclusion may also be seen as a continuous-time variant of subgradient descent methods.

B F INITE -T IME S TABILITY OF D IFFERENTIAL I NCLUSIONS


We are now ready to focus on extending some notions from traditional Lipschitz continuous systems
to differential inclusions.
Definition 2. We say that x? ∈ Rn is an equilibrium of (27) if x(t) ≡ x? on some small enough
non-degenerate interval is a Carathéodory solution to (27). In other words, if and only if 0 ∈ F(x? ).
We say that (27) is (Lyapunov) stable at x? ∈ Rn if, for every ε > 0, there exists some δ > 0 such that,
for every maximal Carathéodory solution x(·) of (27), we have kx0 −x? k < δ =⇒ kx(t)−x? k < ε
for every t ≥ 0 in the interval where x(·) is defined. Note that, under Assumption 2, if (27) is stable
at x? , then x? is an equilibrium of (27) Bacciotti & Ceragioli (1999). Furthermore, we say that (27)
is (locally and strongly) asymptotically stable at x? ∈ Rn if is stable at x? and there exists some
δ > 0 such that, for every maximal Carathéodory solution x : [0, τ ) → Rn of (27), if kx0 − x? k < δ
then x(t) → x? as t → τ . Finally, (27) is (locally and strongly) finite-time stable at x? if it is
asymptotically stable and there exists some δ > 0 and T : Bδ (x? ) → [0, ∞) such that, for every
maximal Carathéodory solution x(·) of (27) with x0 ∈ Bδ (x? ), we have limt→T (x0 ) x(t) = x? .

We will now construct a Lyapunov-based criterion adapted from the literature of finite-time stability
of Lipschitz continuous systems.
Lemma 1. Let E(·) be an absolutely continuous function satisfying the differential inequality
Ė(t) ≤ −c E(t)α (30)
?
a.e. in t ≥ 0, with c, E(0) > 0 and α < 1. Then, there exists some t > 0 such that E(t) > 0 for
t ∈ [0, t? ) and E(t? ) = 0. Furthermore, t? > 0 can be bounded by
E(0)1−α
t? ≤ , (31)
c(1 − α)
with this bound tight whenever (30) holds with equality. In that case, but now with α ≥ 1, then
E(t) > 0 for every t ≥ 0, with limt→∞ E(t) = 0. This will be represented by t? = ∞, with
E(∞) , limt→∞ E(t).

Proof. Suppose that E(t) > 0 for every t ∈ [0, T ] with T > 0. Let t? be the supremum of all such
T ’s, thus satisfying E(t) > 0 for every t ∈ [0, t? ). We will now investigate E(t? ). First, by continuity
of E, it follows that E(t? ) ≥ 0. Now, by rewriting
d E(t)1−α
 
Ė(t) ≤ −c E(t)α ⇐⇒ ≤ −c, (32)
dt 1 − α
a.e. in t ∈ [0, t? ), we can thus integrate to obtain
E(t)1−α E(0)1−α
− ≤ −c t, (33)
1−α 1−α
everywhere in t ∈ [0, t? ), which in turn leads to
E(t) ≤ [E(0)1−α − c(1 − α)t]1/(1−α) (34)
and
E(0)1−α − E(t)1−α E(0)1−α
t≤ ≤ , (35)
c(1 − α) c(1 − α)

14
Under review as a conference paper at ICLR 2021

where the last inequality follows from E(t) > 0 for every t ∈ [0, t? ). Taking the supremum in (35)
then leads to the upper bound (31). Finally, we conclude that E(t? ) = 0, since E(t? ) > 0 is
impossible given that it would mean, due to continuity of E, that there exists some T > t? such that
E(t) > 0 for every t ∈ [0, T ], thus contradicting the construction of t? .
Finally, notice that if E is such that (30) holds with equality, then (34) and the first inequality
in (35) hold with equality as well. The tightness of the bound (31) thus follows immediately.
Furthermore, notice that if α ≥ 1, and E is a tight solution to the differential inequality (30), i.e.
E(t) = [E(0)1−α − c(1 − α)t]1/(1−α) , then clearly E(t) > 0 for every t ≥ 0 and E(t) → 0 as
t → ∞. 

Cortés & Bullo (2005) proposed (Proposition 2.8) a Lyapunov-based criterion to establish finite-
time stability of discontinuous systems, which fundamentally coincides with our Lemma 1 for the
particular choice of exponent α = 0. Their proposition was, however, directly based on Theorem 2
of Paden & Sastry (1987). Later, Cortés (2006) proposed a second-order Lyapunov criterion, which,
on the other hand, fundamentally translates to E(t) , V (x(t)) being strongly convex. Finally, Hui
et al. (2009) generalized Proposition 2.8 of Cortés & Bullo (2005) in their Corollary 3.1, to establish
semistability. Indeed, these two results coincide for isolated equilibria.
We now present a novel result that generalizes the aforementioned first-order Lyapunov-based results,
by exploiting our Lemma 1. More precisely, given a Laypunov candidate function V (·), the objective
is to set E(t) , V (x(t)), and we aim to check that the conditions of Lemma 1 hold. To do this, and
assuming V to be locally Lipschitz continuous, we first borrow and adapt from Bacciotti & Ceragioli
(1999) the definition of set-valued time derivative of V : D → R w.r.t. the differential inclusion (27),
given by
V̇ (x) , {a ∈ R : ∃v ∈ F(x) s.t. a = p · v, ∀p ∈ ∂V (x)}, (36)
for each x ∈ D. Notice that, under Assumption 2 for Filippov differential inclusions F = K[F ],
the set-valued time derivative of V thus coincides with with the set-valued Lie derivative LF V (·).
Indeed, more generally V̇ could be seen as a set-valued Lie derivative LF V w.r.t. the set-valued map
F.
Definition 3. V (·) is said to be regular if every directional derivative, given by
V (x + h v) − V (x)
V 0 (x; v) , lim , (37)
h→0 h
exists and is equal to
V (x0 + h v) − V (x0 )
V ◦ (x; v) , lim sup , (38)
x0 →x h→0+ h
known as Clarke’s upper generalized derivative Clarke (1981).

In practice, regularity is a fairly mild and easy to guarantee condition. For instance, it would suffice
that V is convex or continuously differentiable to ensure that it is Lipschitz and regular.
Assumption 3. V : D → R is locally Lipscthiz continuous and regular, with D ⊆ Rn open.

Under Assumption 3, Clarke’s generalized gradient


∂V (x) , {p ∈ Rn : V ◦ (x; v) ≥ p · v, ∀v ∈ Rn } (39)
is non-empty for every x ∈ D, and is also given by
 
n
∂V (x) = lim ∇V (xk ) : xk ∈ R \ NV s.t. xk → x , (40)
k→∞

where NV denotes the set of points in D ⊆ Rn where V is not differentiable (Rademacher’s


theorem) Clarke (1981).
Through the following lemma (Lemma 2), we can formally establish the correspondence between the
set-valued time-derivative of V and the derivative of the energy function E(t) , V (x(t)) associated
with an arbitrary Carathéodory solution x(·) to the differential inclusion (27).

15
Under review as a conference paper at ICLR 2021

Lemma 2 (Lemma 1 of Bacciotti & Ceragioli (1999)). Under Assumption 3, given any Carathéodory
solution x : [0, τ ) → Rn to (27), then E(t) , V (x(t)) is absolutely continuous and Ė(t) =
d
dt V (x(t)) ∈ V̇ (x(t)) a.e. in t ∈ [0, τ ).

We are now ready to state and prove our Lyapunov-based sufficient condition for finite-time stability
of differential inclusions.
Theorem 3. Suppose that Assumptions 2 and 3 hold for some set-valued map F : Rn ⇒ Rn and
some function V : D → R, where D ⊆ Rn is an open and positively invariant neighborhood of a
point x? ∈ Rn . Suppose that V is positive definite w.r.t. x? and that there exist constants c > 0 and
α < 1 such that
sup V̇ (x) ≤ −c V (x)α (41)
?
a.e. in x ∈ D. Then, (27) is finite-time stable at x , with settling time upper bounded by
V (x0 )1−α
t? ≤ , (42)
c(1 − α)
where x(0) = x0 . In particular, any Carathéodory solution x(·) with x(0) = x0 ∈ D will converge
in finite time to x? under the upper bound (42). Furthermore, if D = Rn , then (27) is globally
finite-time stable. Finally, if V̇ (x) is a singleton a.e. in x ∈ D and (41) holds with equality, then the
bound (42) is tight.

Proof. Note that, by Proposition 1 of Bacciotti & Ceragioli (1999), we know that (27) is Lyapunov
stable at x? . All that remains to be shown is local convergence towards x? (which must be an
equilibrium) in finite time. Indeed, given any maximal solution x : [0, t? ) → Rn to (27) with
x(0) = x0 6= x? , we know by Lemma 2, that E(t) = V (x(t)) is absolutely continuous with
Ė(t) ∈ V̇ (x(t)) a.e. in t ∈ [0, t? ]. Therefore, we have
Ė(t) ≤ sup V̇ (x(t)) ≤ −c V (x(t))α = −c E(t)α (43)
a.e. in t ∈ [0, t? ]. Since E(0) = V (x0 ) > 0, given that x0 6= x? , the result then follows by invoking
Lemma 1 and noting that E(t? ) = 0 ⇐⇒ V (t? , x(t? )) = 0 ⇐⇒ x(t? ) = x? . 

Finite-time stability still follows without Assumption 2, provided that x? is an equilibrium of (27). In
practical terms, this means that trajectories starting arbitrarily close to x? may not actually exist, but
nevertheless there exists a neighborhood D of x? over which, any trajectory x(·) that indeed exists
and starts at x(0) = x0 ∈ D must converge in finite time to x? , with settling time upper bounded by
T (x0 ) (the bound still tight in the case that (41) holds with equality).

C P ROOF OF T HEOREM 1
Let us focus on the q-RGF (2) (the case of q-SGF (3) follows exactly the same steps) with the
candidate Lyapunov function V , f − f ? . Clearly, V is Lipschitz continuous and regular (given that
it is continuously differentiable). Furthermore, V is positive definite w.r.t. x? .
Notice that, due to the dominated gradient assumption, x? must be an isolated stationary point of
f . To see this, notice that, if x? were not an isolated stationary point, then there would have to exist
some x̃? sufficiently near x? such that x̃? is both a stationary point of f , and satisfies f (x̃? ) > f ? ,
since x? is a strict local minimizer of f . But then, we would have
p−1 p 1
0= k∇f (x̃? )k p−1 ≥ µ p−1 (f (x̃? ) − f ? ) > 0, (44)
p
and subsequently 0 > 0, which is absurd.
q−2
Therefore, F (x) , −c∇f (x)/k∇f (x)k q−1 is continuous for every x ∈ D \ {x? }, for some small
enough open neighborhood D of x? . Let us assume that D is positively invariant w.r.t. (2), which can
be achieved, for instance, by replacing D with its intersection with some small enough strict sublevel
1
1
set of f . Notice that kF (x)k = ck∇f (x)k q−1 with q ∈ (p, ∞] ⊂ (1, ∞], i.e., q−1 ∈ [0, ∞). If
∇f (x)
q = ∞, which results in the normalized gradient flow ẋ = − k∇f (x)k proposed by Cortés (2006),

16
Under review as a conference paper at ICLR 2021

then kF (x)k = c > 0. We can thus show that F (x) is discontinuous at x = 0 for q = ∞. On
the other hand, if q ∈ (p, ∞) ⊂ (1, ∞), then we have kF (x)k → 0 as x → x? , and thus F (x) is
continuous (but not Lipschitz) at x = x? . Regardless, we may freely focus exclusively on D \ {x? }
since {x? } is obviously a zero-measure set.
Let F , K[F ]. We thus have, for each x ∈ D \ {x? },
sup V̇ (x) = sup {a ∈ R : ∃v ∈ F(x) s.t. a = p · v, ∀p ∈ ∂V (x)} (45a)
= sup {∇V (x) · v : v ∈ F(x)} (45b)
= ∇V (x) · F (x) (45c)
q−2
= −ck∇f (x)k2− q−1 (45d)
1
= −ck∇f (x)k θ0 (45e)
1
? θ
≤ −c[C(f (x) − f ) ] θ0 (45f)
1 θ
= −cC V (x) .
θ0 θ0 (45g)

Since θθ0 < 1, given that s > 1 7→ s−1 s is strictly increasing, then the conditions of Theorem 3 are
satisfied. In particular, we have finite-time stability at x? with a settling time t? upper bounded by

(k∇f (x0 )k/C) θ (1− θ0 )


θ 1 θ 1 1
? (f (x0 ) − f ? )1− θ0 k∇f (x0 )k θ − θ0
t ≤ 1  ≤ 1 = 1 (46)
cC θ 1 − θθ0

c C θ0 1 − θθ0 c C θ0 1 − θθ0


for each x0 ∈ D, which completes the proof.

D P ROOF OF T HEOREM 2
To prove Theorem 2, we borrow some tools and results from hybrid control systems theory. Hybrid
control systems are characterized by continuous flows with discrete jumps between the continuous
flows. They are often modeled by differential inclusions added to discrete mappings to model the
jumps between the differential inclusions. We see the case of the optimization flows proposed
here as a simple case of a hybrid systems with one differential inclusion, with a possible jump or
discontinuity at the optimum. Based on this, we will use the tools and results of Sanfelice & Teel
(2010), which study how a certain class of hybrid systems behave after discretization with a certain
class of discretization algorithms. In other words, Sanfelice & Teel (2010) quantifies, under some
conditions, how close are the solutions of the discretized hybrid dynamical system to the solutions of
the original hybrid system.
In this section we will denote the differential inclusion of the continuous optimization flow by
F : Rn ⇒ Rn , and its discretization in time by Fd : Rn ⇒ Rn . We first recall a definition, which
we will adapt from the general case of jumps between multiple differential inclusions (Definition 3.2,
Sanfelice & Teel (2010)) to our case of one differential inclusion or flow.
Definition 4. ((T, )-closeness). Given T > 0,  > 0, η > 0, two solutions xt : [0, T ] → Rn , and
xk : {0, 1, 2, ...} → Rn are (T, )-close if:
(a) for all t ≤ T there exists k ∈ {1, 2, ...} such that |t − kη| < , and kxt (t) − xk (k)k < ,
(b) for all k ∈ {1, 2, ...} there exists t ≤ T such that |t − kη| < , and kxt (t) − xk (k)k < .

Next, we will recall Theorem 5.2 in Sanfelice & Teel (2010), while adapting it to our special case of
a hybrid system with one differential inclusion12 .
Theorem 4. (Closeness of continuous and discrete solutions on compact domains) Consider the
differential inclusion
Ẋ(t) ∈ F(X(t)), (47)
12
A set-valued mapping F : Rn ⇒ Rn is outer semicontinuous if for each sequence {xi }∞
i=1 converging
to a point x ∈ Rn and each sequence yi ∈ F(xi ) converging to a point y, it holds that y ∈ F(x). It
is locally bounded if, for each x ∈ Rn , there exists compact sets K, K 0 ⊂ Rn such that x ∈ K and
0
F(K) , ∪x∈K F(x) ⊂ K . In what follows, we use the following notations: Given a set A, conA denotes the
convex hull, and B denotes the closed unit ball in a Euclidean space.

17
Under review as a conference paper at ICLR 2021

for a given set-valued mapping F : Rm ⇒ Rm assumed to be outer semicontinuous, locally bounded,


nonempty, and with convex values for every x ∈ C, for some closed set C ⊆ Rm . Consider the
discrete-time system represented by the flow Fd : Rn ⇒ Rn , such that, for each compact set K ⊂ Rn ,
there exists ρ ∈ K∞ , and η ? > 0 such that for each x ∈ K and each η ∈ (0, η ? ],
Fd (x) ⊂ x + η conF(x + ρ(η)B) + ηρ(η)B. (48)
n
Then, for every compact set K ⊂ R , every  > 0, and every time horizon T ∈ R≥0 there exists
η ? > 0 such that: for any η ∈ (0, η ? ] and any discrete solution xk with xk (0) ∈ K + δB, δ > 0,
there exists a continuous solution xt with xt (0) ∈ K such that xk and xt are (T, )-close.

To prove Theorem 2 we will use the results of Theorem 4, where we will have to check that condition
(48) is satisfied for the three proposed discretizations.
We are now ready to prove Theorem 2. First, note that outer semicontinuity follows from the upper
semicontinuity and the closedness of the Filippov differential inclusion map. Furthermore, local
boundedness follows from continuity everywhere outside stationary points, which are isolated.
Now, let us examine their discretization by the three proposed algorithms:

F ORWARD -E ULER D ISCRETIZATION

The mapping Fd in this case is a singleton, given by


Fd (x) , x + ηF (x), (49)
where η > 0, which clearly satisfies condition (48).

RUNGE -K UTTA D ISCRETIZATION

Once again, the mapping Fd is singleton, this time given by


K
X
Fd (x) , x + η αi F(y i ) (50a)
i=1
i−1
X
yi = x + η βj F(y j ), (50b)
j=1
PK
where η, α1 , . . . , αK , β1 , . . . , βK−1 > 0 are such that i=1 αi = 1.
By equation (50b) one can establish a function ρ ∈ K∞ such that for each xk ∈ K ⊂ Rn , for each
Pi=K
η > 0, yki ∈ xk + ηρ(η)B. Next, by equation (50a) together with the condition i=1 αi = 1,
one can write that for any xk ∈ K and η > 0, Fd (xk ) ⊂ xk + ηconF(yki ) ⊂ xk + ηconF(xk +
ηρ(η)B) + ηρ(η)B.

N ESTEROV- LIKE D ISCRETIZATION

In this case the discrete-time flow Fd is defined as


Fd (xk ) = xk + ηF(xk + µyk ) + µyk (51a)
yk = xk − xk−1 . (51b)
In this case, to take into account the integral effect of the Nesterov-like discretization, let us extend
the continuous-time flow as
! 
ẋ F(xt )

˙x̃t = R t+η t
d t xt (s)ds = xt+η −xt , (52)
dt η η

we then compare the solution of the extended continuous-time system (52) with the extended discrete-
time system
   
xk+1 xk + ηF(xk (1 + µ) + xk−1 ) + µ(xk − xk−1 )
x̃k+1 = = , (53)
xk+1 − xk ηF(xk (1 + µ) + xk−1 ) + µ(xk − xk−1 )

18
Under review as a conference paper at ICLR 2021

which could be rearranged as


Fd (x̃k ) = x̃k+1 = x̃k + M x̃k + η F̃t (x̃k ), (54)
   
0 µ F(x̃k )
where M = , and F̃t (x̃k ) = , which shows that for any x̃k ∈ K ⊂ R2n ,
0 −1 + µ F(x̃k )
and η > 0, we have (using similar recursive reasoning as above) that Fd (x̃k ) ⊂ x̃k + ηconF̃t (x̃k +
ηρ(η)B) + ηρ(η)B. Then, using Theorem 4 we conclude about the (T, )-closeness between the
continuous-time solutions of the flows F : q-RGF (2), q-SGF (3), and the discrete-time solutions
of their respective discretization by any of the three discretization methods. Furthermore, for the
Nesterov-like discretization, we can also conclude about the (T, )− closeness of the integral of the
continuous-time solutions x̃t (2) and its discretization x̃k (2).
Finally, using the Lyapunov function V = f − f ? as defined in the proof of Theorem 1, together
with inequalities (45g), (34), and a local Lipschitz bound on f , one can derive the convergence bound
given by (23), as follows:

kf (xk ) − f (x? ) − (f (xt ) − f (x? ))k = kf (xk ) − f (xt )k ≤ ˜ = Lf , Lf > 0,  > 0,


kf (xk ) − f (x? )k − k(f (xt ) − f (x? ))k ≤ kf (xk ) − f (xt )k ≤ ˜,
kf (xk ) − f (x? )k ≤ ˜ + kf (xt ) − f (x? )k,
kf (xk ) − f (x? )k ≤ ˜ + [(f (x0 ) − f (x? ))(1−α) − c̃(1 − α)ηk]1/(1−α) , for k ≤ k ? ,
  p−1  10
θ
θ p−1 0 q−1 p p 1 (f (x0 )−f (x? ))(1−α)
where α = θ0
, θ= p , θ = q , c̃ = c p−1 µ p , k? = c̃(1−α)η .

E A DDITIONAL DETAILS AND NUMERICAL RESULTS


In this section, we will expand upon the numerical results experiments discussed in the paper. In
particular, we report more details on the hyper-parameters values used13 for the numerical tests, and
report some results for the MNIST experiments.

E.1 H YPER PARAMETERS VALUES USED IN THE TESTS OF S ECTION 4.1


• GD fixed step size: η = 10−3
• RGF Euler disc. w/fixed step size: q = 2.2, η = 10−3
• RGF Euler disc. w/fixed step size: q = 3, η = 10−2
• RGF Euler disc. w/fixed step size: q = 6, η = 10−2
• RGF Euler disc. w/fixed step size: q = 10, η = 10−2
• GD Nesterov acceleration fixed step size: η = 10−4 ; µ = 0.9
• SGF Nesterov-like disc. w/fixed step size: q = 2.2, η = 10−4 , µ = 0.9
• SGF Nesterov-like disc. w/fixed step size: q = 3, η = 10−3 , µ = 0.9
• SGF Nesterov-like disc. w/fixed step size: q = 6, η = 10−3 , µ = 0.9
• SGF Nesterov-like disc. w/fixed step size: q = 10, η = 10−2 , µ = 0.09
• RGF Runge Kutta disc. w/fixed step size: q = 2.2, K = 2, η = 10−2 , β1 = 0.09, α1 =
α2 = 0.5
• RGF Runge Kutta disc. w/fixed step size: q = 3, K = 2, η = 10−2 , β1 = 0.09, α1 =
α2 = 0.5
• RGF Runge Kutta disc. w/fixed step size: q = 6, K = 2, η = 10−2 , β1 = 0.09, α1 =
α2 = 0.5
• RGF Runge Kutta disc. w/fixed step size: q = 10, K = 2, η = 10−2 , β1 = 0.09, α1 =
α2 = 0.5
13
In all the tests, for q-RGF and q-SGF c = 1 unless otherwise stated.

19
Under review as a conference paper at ICLR 2021

E.2 H YPER PARAMETERS VALUES USED IN THE TESTS OF S ECTION 4.2

Note that the description of the coefficients for each of the prior art methods can be found in:
https://pytorch.org/docs/stable/optim.html.

• GD: η = 4.10−2 , µ = 0.9, Nesterov=True


• RGF: η = 4.10−2 , µ = 0.9
• SGF: η = 4.10−3 , µ = 0.9
• ADAM: η = 8.10−4 (remaining coefficients=nominal values)
• RMS: η = 10−3 (remaining coefficients=nominal values)
• ADAGRAD: η = 10−3 (remaining coefficients=nominal values)
• ADADELTA: η = 4.10−2 , ρ = 0.9,  = 10−6 , weight decay = 0

E.3 M ORE TESTS ON THE ROSENBROCK FUNCTION

Due to space limitations, we have decided to report in the main paper only one test for q-RGF with
Euler discretization, one test for q-SGF with Nesterov-like discretization, and one test for q-RGF
with Runge Kutta discretization. For the sake of completeness we report here the remaining tests for
each algorithm. One can observe similar qualitative behavior in Figure 6 as the one noticed in the
results of Section 4.5.
The step-size and other hyper-parameters for each test are given below:

• GD fixed step size: η = 10−3


• SGF Euler disc. w/fixed step size: q = 2.1, η = 10−3
• SGF Euler disc. w/fixed step size: q = 2.5, η = 10−3
• SGF Euler disc. w/fixed step size: q = 2.8, η = 10−3
• SGF Euler disc. w/fixed step size: q = 100, η = 10−2
• GD Nesterov acceleration fixed step size: η = 10−4 ; µ = 0.9
• RGF Nesterov-like disc. w/fixed step size: q = 2.2, η = 10−4 , µ = 0.9
• RGF Nesterov-like disc. w/fixed step size: q = 3, η = 10−3 , µ = 0.9
• RGF Nesterov-like disc. w/fixed step size: q = 6, η = 10−3 , µ = 0.9
• RGF Nesterov-like disc. w/fixed step size: q = 10, η = 10−3 , µ = 0.9
• SGF Runge Kutta disc. w/fixed step size: q = 2.2, K = 2, η = 10−3 , β1 = 0.09, α1 =
α2 = 0.5
• SGF Runge Kutta disc. w/fixed step size: q = 3, K = 2, η = 10−2 , β1 = 0.9, α1 =
α2 = 0.5
• SGF Runge Kutta disc. w/fixed step size: q = 6, K = 2, η = 10−2 , β1 = 0.09, α1 =
α2 = 0.5
• SGF Runge Kutta disc. w/fixed step size: q = 10, K = 2, η = 10−2 , β1 = 0.09, α1 =
α2 = 0.5
Remark 2. Choice of q: The settling time upper bound (15) decreases as q → ∞, which appears to
lead to faster convergence when discretized. On the other hand, the larger q is, the stiffer the ODE, so
more prone to numerical instability, so q cannot be too large. Therefore, assuming p to be not too
large, it appears that q ∈ (p, p + δ] works best, with δ > 0 as small as needed to avoid numerical
issues. For instance, if we know the cost function to be strongly convex (locally), then we search for
q slightly larger than p = 2 at first, but continue to increase until performance deteriorates. If, on
the other hand, we don’t know the order p > 1, then it’s currently unclear how to choose q. We will
investigate this further in future work. Furthermore, there is evidence that gradient dominance does
hold locally in many deep learning contexts (Zhou and Liang, 2017, https://arxiv.org/abs/1710.06910).
Indeed, since convexity readily leads to gradient dominance of order p = ∞, it suffices that a slightly
stronger form of it holds (but weaker than strong convexity), in order to have p < ∞, and thus for us
to be able to choose q > p.

20
Under review as a conference paper at ICLR 2021

12
30 GD (fixed step size) GD (Nesterov acceleration w/fixed step size)
SGF (Euler discretization w/fixed step: q=2.1) 11 RGF (Nesterov-like disc. w/fixed step- q=3)
SGF (Euler discretization w/fixed step: q=2.5) 10 RGF (Nesterov-like disc. w/fixed step- q=6)
25 SGF (Euler discretization w/fixed step: q=2.8) RGF (Nesterov-like disc. w/fixed step- q=10)
SGF (Euler discretization w/fixed step: q=100) 9 RGF (Nesterov-like disc. w/fixed step- q=2.2)

20 8

15
6
f(xk ) - f *

f(xk ) - f *
5

10
4

0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
k k

10 0
f(xk ) - f *

10 -1

GD (fixed step size)


SGF (Runge Kutta discretization w/fixed step- q=3)
SGF (Runge Kutta discretization w/fixed step- q=6)
SGF (Runge Kutta discretization w/fixed step- q=10)
SGF (Runge Kutta discretization w/fixed step- q=2.2)

10 -2

0 20 40 60 80 100 120 140 160 180 200


k

Figure 6: Example of the proposed discretization algorithms of finite-time q-RGF and q-SGF

E.4 E XPERIMENT 3: MNIST CNN M ODEL

In this experiment we optimize the CNN network described using a Pytorch code sequence in
MODEL 1 with a negative log likelihood loss, on the MNIST dataset.
We use 10 epochs of training, with 60 batches of 1000 images for training, and 10 batches of 1000
images for testing. We tested Algorithm 1 RGF (c = 1, q = 3, η = 0.06, µ = 0.9), and Algorithm
2 SGF (c = 0.001, q = 2.1, η = 0.06, µ = 0.9) against Nesterov’s accelerated gradient descent
(GD) (η = 0.06, µ = 0.9, Nesterov=True), and Adam (η = 0.004, remaining coefficients=nominal
values of Torch.Optim). We also tested other algorithms such as RMSprop, AdaGrad, and AdaDelta,
but since their convergence performance, as well as, test performance were not very competitive w.r.t.
GD, on this experiment, we decided not to report them here, to avoid overloading the graphs. In
Figures 7, 8 we can see the training loss over the training iterations, where we see that GD, RGF
and SGF perform better than Adam in terms of convergence speed (20 sec lead in average), and
in terms of test performance 98% for Adam, 98% for GD, and 99% for both RGF and SGF. The
RGF and SGF perform slightly better than GD in terms of convergence speed. The gain is relatively
small (5 sec to 10 sec in average) which is expected in such a small DNN network (please refer
to VGG16-CIFAR10, and VGG16-SVHN test results for larger computation-time gains). We also
notice, in Figure 7 that all algorithms behave well in terms of avoiding overfitting the model.

Figure 7: Losses for several optimization algorithms- CNN- MNIST: Train loss (left), test loss (right)

21
Under review as a conference paper at ICLR 2021

Figure 8: Training loss vs. computation time for several optimization algorithms- MNIST

MODEL 1: MNIST-CNN
class Net(MNIST-CNN)
def.init.(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(1, 10, kernel.size=5)
self.conv2 = nn.Conv2d(10, 20, kernel.size=5)
self.conv2.drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)

def forward(self, x):


x = F.relu(F.max.pool2d(self.conv1(x), 2))
x = F.relu(F.max.pool2d(self.conv2.drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log.softmax(x)

E.5 E XPERIMENT 4: E ULER DISCRETIZATION ON SVHN DATASET


In the main paper, due to space limitation, we decided to report the results of the Nesterov-like discretization only,
which also seems like a fair comparison since our Nesterov-like discretization of q-RGF and q-SGF flows can be
compared against Nesterov implementation of GD algorithm. However, we also wanted to test the performance
of the simple Euler discretization of the proposed flows against the simple GD algorithm, to do so we run some
extra tests on the SVHN dataset. These results are presented below.
We tested the proposed Euler algorithms to train the same VGG16 CNN model with cross entropy loss. We
divided the dataset into a training set of 70 batches with 1000 images each, and a test set of 10 of 1000 images
each, and ran 20 epochs of training over all the training batches. We tested the Euler discretization of q-RGF
(c = 1, q = 2.1, η = 0.04, µ = 0.9), and the Euler discretization of q-SGF (c = 10−3 , q = 2.1, η =
0.04, µ = 0.9) against gradient descent (GD) and Adam (same optimal tuning as in Section 4.2). All algorithms
have been implemented in their stochastic version.
In Figures 9 , 10 we can see that both algorithms, Euler q-RGF and Euler q-SGF, converge faster (40 min lead
in average) than SGD and Adam for these tests, and reach an overall better performance on the test-set.

E.6 E XPERIMENT 5: RUNGE -K UTTA DISCRETIZATION ON SVHN DATASET


Finally, for compleetness, we also wanted to test the performance of the Runge-Kutta discretization of the
proposed flows against SGD, to do so we run some extra tests on the SVHN dataset. These results are presented
below.
We tested the proposed Runge-Kutta algorithms to train the same VGG16 CNN model with cross entropy loss.
We divided the dataset into a training set of 70 batches with 1000 images each, and a test set of 10 of 1000
images each, and ran 20 epochs of training over all the training batches. We tested the Runge-Kutta discretization
of q-RGF (c = 1, q = 2.1, K = 2, η = 10−2 , β1 = 10−2 , α1 = α2 = 0.5), and the Runge-Kutta
discretization of q-SGF (c = 10−3 , q = 2.1, K = 2, η = 10−2 , β1 = 10−2 , α1 = α2 = 0.5) against

22
Under review as a conference paper at ICLR 2021

Figure 9: Losses for several optimization algorithms- SVHN: Train loss (left), test loss (right)

Figure 10: Training loss vs. computation time for several optimization algorithms- VGG-16- SVHN

gradient descent (GD) and Adam (same optimal tuning as in Section 4.2). All algorithms have been implemented
in their stochastic version.
In Figures 11, 12 we can see that both algorithms, Runge-Kutta q-RGF and Runge-Kutta q-SGF converge faster
(40 min lead in average) than SGD and Adam for these tests.

Figure 11: Losses for several optimization algorithms- SVHN: Train loss (left), test loss (right)

23
Under review as a conference paper at ICLR 2021

Figure 12: Training loss vs. computation time for several optimization algorithms- VGG-16- SVHN

24

You might also like