First Order Optimization Algorithms Via Discretization of Finite Time Convergent Flows
First Order Optimization Algorithms Via Discretization of Finite Time Convergent Flows
Anonymous authors
Paper under double-blind review
A BSTRACT
In this paper the performance of several discretization algorithms for two first-order
finite-time optimization flows. These flows are, namely, the rescaled-gradient
flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or
discontinuous dynamical systems that converge locally in finite time to the minima
of gradient-dominated functions. We introduce three discretization methods for
these first-order finite-time flows, and provide convergence guarantees. We then
apply the proposed algorithms in training neural networks and empirically test their
performances on three standard datasets, namely, CIFAR10, SVHN, and MNIST.
Our results show that our schemes demonstrate faster convergences against standard
optimization alternatives, while achieving equivalent or better accuracy.
1 I NTRODUCTION
Consider the unconstrained minimization problem for a given cost function f : Rn → R. When f is
sufficiently regular, the standard algorithm in continuous time (dynamical system) is given by
ẋ = FGF (x) , −∇f (x) (1)
d
with ẋ , dt x(t), known as the gradient flow (GF). Generalizing GF, the q-rescaled GF (q-
RGF) Wibisono et al. (2016) given by
∇f (x)
ẋ = −c q−2 (2)
k∇f (x)k2q−1
1
with c > 0 and q ∈ (1, ∞] has an asymptotic convergence rate f (x(t)) − f (x? ) = O tq−1 under
mild regularity, for kx(0) − x? k > 0 small enough, where x? ∈ Rn denotes a local minimizer of f .
However, we recently proved Romero & Benosman (2020) that the q-RGF, as well as our proposed
q-signed GF (q-SGF)
1
ẋ = −c k∇f (x)k1q−1 sign(∇f (x)), (3)
where sign(·) denotes the sign function, applied element-wise for (real-valued) vectors, are both
finite-time convergent, provided that f is gradient dominated of order p ∈ (1, q). In particular, if f is
strongly convex, then q-RGF and q-SGF is finite-time convergent for any q ∈ (2, ∞], since f must
be gradient dominated of order p = 2.
C ONTRIBUTION
In this paper, we explore three discretization schemes for the q-RGF (2) and q-SGF (3) and provide
some convergence guarantees using results from hybrid dynamical control theory. In particular, we
explore a forward-Euler discretization of RGF/SGF, followed by an explicit Runge-Kutta discretiza-
tion, and finally a novel Nesterov-like discretization. We then test their performance on both synthetic
and real-world data in the context of deep learning, namely, over the well-known datasets CIFAR10,
SVHN, and MNIST.
R ELATED W ORK
Propelled by the work of Wang & Elia (2011) and Su et al. (2014), there has been a recent and
significant research effort dedicated to analyzing optimization algorithms from the perspective of
1
Under review as a conference paper at ICLR 2021
dynamical systems and control theory, especially in continuous time Wibisono et al. (2016); Wilson
(2018); Lessard et al. (2016); Fazlyab et al. (2017b); Scieur et al. (2017); França et al. (2018); Fazlyab
et al. (2018); Fazlyab et al. (2018); Taylor et al. (2018); França et al. (2019a); Orvieto & Lucchi
(2019); Muehlebach & Jordan (2019). A major focus within this initiative is in accceleration, both in
terms of trying to gain new insight into more traditional optimization algorithms from this perspective,
or even to exploit the interplay between continuous-time systems and their potential discretizations
for novel algorithm design Muehlebach & Jordan (2019); Fazlyab et al. (2017a); Shi et al. (2018);
Zhang et al. (2018); França et al. (2019b); Wilson et al. (2019). Many of these papers also focus on
explicit mappings and matchings of convergence rates from the continuous-time domain into discrete
time.
For older work connecting ordinary differential equations (ODEs) and their numerical analysis, with
optimization algorithms, see Botsaris (1978a;b); Zghier (1981); Snyman (1982; 1983); Brockett
(1988); Brown (1989). In Helmke & Moore (1994), the authors studied relationships between linear
programming, ODEs, and general matrix theory. Further, Schropp (1995) and Schropp & Singer
(2000) explored several aspects linking nonlinear dynamical systems to gradient-based optimization,
including nonlinear constraints.
Tools from Lyapunov stability theory are often employed for this purpose, mainly because there
already exists a rich body of work within the nonlinear systems and control theory community for
this purpose. In particular, typically in previous works, one seeks asymptotically Lyapunov stable
gradient-based systems with an equilibrium (stationary point) at an isolated extremum of the given
cost function, thus certifying local convergence. Naturally, global asymptotic stability leads to global
convergence, though such an analysis will typically require the cost function to be strongly convex
everywhere.
For physical systems, a Lyapunov function can often be constructed from first principles via some
physically meaningful measure of energy (e.g., total energy = potential energy + kinetic energy). In
optimization, the situation is somewhat similar in the sense that a suitable Lyapunov function may
often be constructed by taking simple surrogates of the objective function as candidates. For instance,
V (x) , f (x) − f (x? ) can be a good initial candidate. Further, if f is continuously differentiable
and x? is an isolated stationary point, then another alternative is V (x) , k∇f (x)k2 . However, most
fundamental and applied research conducted in systems and control regarding Lyapunov stability
theory deals exclusively with continuous-time systems. Unfortunately, (dynamical) stability properties
are generally not preserved for simple forward-Euler and sample-and-hold discretizations and control
laws Stuart & Humphries (1998). Furthermore, practical implementations of optimization algorithms
in modern digital computers demand discrete-time. Nonetheless, it has been extensively noted that a
vast amount of general Lyapunov-based results appear to have a discrete-time equivalent.
In that sense, we aim here to start from the q-RGF, and q-SGF continuous flows, characterized by
their Lyapunov-based finite-time convergence, and seek discretization schemes, which allow us to
‘shadow’ the solutions of these flows in discrete time, hoping to achieve an acceleration of the discrete
methods inspired from the finite-time convergence characteristics of the continuous flows.
2
Under review as a conference paper at ICLR 2021
• If the step sizes are adaptive, i.e. if we replace η by a sequence {ηk } with ηk > 0, then we
only need to replace Fd (k, x) , x − ηk ∇f (x), provided that {ηk } is not computed using
feedback from {xk } (e.g. through a line search method).
• If we do wish to use feedback1 (and no memory past the most recent output and step size),
(1) (2)
then we can set m = n + 1, G([x; η]) , x, and Fd ([x; η]) , [Fd ([x; η]); Fd ([x; η])],
(1) (2)
where Fd ([x; η]) , x − η∇f (x), and Fd is a user-defined function that dictates the
updates in the step size. In particular, an open-loop (no feedback) adaptive step size {ηk } may
(2)
also be achieved under this scenario, provided that it is possible to write ηk+1 = Fd (ηk ).
(2)
If this is not possible (and still open-loop step size), then we may take Fd (k, X) , ηk+1 ,
and of course add a k-argument in Fd .
• If we wish to use individual step sizes for each the n components of {xk }, then it suffices
to take ηk as an n-dimensional vector (thus m = 2n), and make appropriate changes in Fd
and G.
In each of these cases, GD can be seen as a forward-Euler discretization of the GF (1), i.e.,
xk+1 = xk + ∆tk FGF (xk ) (8)
with FGF = −∇f and adaptive time step ∆tk , tk+1 − tk chosen as ∆tk = ηk .
Example 2. The proximal point algorithm (PPA)
1
xk+1 = arg min f (x) + kx − xk k2 (9)
x∈Rn 2ηk
with step size ηk > 0 (open loop, for simplicity) can also be written in the form (6), by taking m = n,
Fd (k, x) , arg minx0 ∈Rn {f (x0 ) + 2η1k kx0 − xk2 }, and G(x) , x. Naturally, we need to assume
sufficient regularity for Fd (k, x) to exist and we must design a consistent way to choose Fd (k, x)
when multiple minimizers exist in the definition of Fd (k, x). Alternatively, these conditions must
be satisfied, at the very least, at every (k, x) ∈ {(0, x0 ), (1, x1 ), (2, x2 ), . . .} for a particular chosen
initial x0 ∈ Rn .
1
Assuming sufficient regularity, we have ∇x {f (x) + 2ηk kx − xk k2 }|x=xk+1 = 0, and thus
1
∇f (xk+1 ) + (xk+1 − xk ) = 0 ⇐⇒ xk+1 = xk + ∆tk FGF (xk+1 ) (10)
ηk
with ∆tk = ηk , which is precisely the backward-Euler discretization of the GF (1).
1
Also known as closed-loop design in control-theoretic and reinforcement learning terminology, meaning
that ηk = ϕ(k, xk ) for some ϕ : Z+ × Rn → R+ that does not depend on {X0 , X1 , X2 , . . .}. On the other
hand, open-loop design can be seen as closed loop with ϕ(k, ·) constant for each k ∈ Z+ .
3
Under review as a conference paper at ICLR 2021
and G([y; x]) , x for y, x ∈ Rn . In other words, Xk = [yk ; xk ]. Traditionally, βk = k−1 k−2 , but
clearly, if we set ηk = η > 0 and βk = β ∈ (0, 1) (in practice, η ≈ 0 and β ≈ 1), then we can drop k
from Fd (k, [y; x]).
There exist a few approaches in the literature on the interpretation of N-AGD (11b) as the dis-
cretization of a second-order continuous-time dynamical system, namely via a vanishing step size
argument Su et al. (2014), or via symplectic Euler schemes of crafted Hamiltonian systems Muehle-
bach & Jordan (2019); França et al. (2019b).
4
Under review as a conference paper at ICLR 2021
We are now ready to state the finite-time convergence of the q-RGF (2) and q-SGF (3).
Theorem 1. Romero & Benosman (2020) Suppose that f : Rn → R is continuously differentiable
and µ-gradient dominated of order p ∈ (1, ∞) near a strict local minimizer x? ∈ Rn . Let c > 0 and
q ∈ (p, ∞]. Then, any maximal solution x(·), in the sense of Filippov, to the q-RGF (2) or q-SGF (3)
will converge in finite time to x? , provided that kx(0) − x? k > 0 is sufficiently small. More precisely,
lim x(t) = x? , where the convergence time t? < ∞ may depend on which flow is used, but in both
t→t?
cases is upper bounded by
1 1
? k∇f (x0 )k θ − θ0
t ≤ 1 , (16)
cC θ 1 − θθ0
p−1
p 1
p
where x0 = x(0), C = p−1 µ p , θ = p−1 0 q−1
p , and θ = q . In particular, given any compact
and positively invariant subset S ⊂ D, both flows converge in finite time with the aforementioned
convergence time upper bound (which can be tightened by replacing D with S) for any x0 ∈ S.
Furthermore, if D = Rn , then we have global finite-time convergence, i.e. finite-time convergence to
any maximal solution (in the sense of Filippov) x(·) with arbitrary x0 ∈ Rn .
In essence, the analysis (introduced in Romero & Benosman (2020)) consists of leveraging the
gradient dominance to show that the energy function E(t) , f (x(t)) − f ? satisfies the Lyapunov-like
differential inequality Ė(t) = O(E(t)α ) for some α < 1. The detailed proof is recalled in Appendix
C for completeness.
for η > 0, K ∈ {1, 2, 3, . . .}, and F ∈ {Fq−RGF , Fq−SGF }. This method is well-known to be
Pi=K
numerically stable under the consistency condition i=1 αi = 1, Stuart & Humphries (1996).
However, in our optimization framework, we want to be able to guarantee that the stable numerical
solution of (18) remains close to the solution of the continuous flows. In other words, we seek
arbitrarily small global error, also known as shadowing. This will be discussed in Theorem 2.
5
Under review as a conference paper at ICLR 2021
Therefore, given any optimization flow represented by the continuous-time system ẋ = F (x), locally
convergent to a local minimizer x? ∈ Rn of a cost function f : Rn → R, we can replicate Nesterov’s
acceleration of (1). More precisely, we obtain the algorithm
xk+1 = xk + ηF (xk + βyk ) + βyk (20a)
yk+1 = xk+1 − xk . (20b)
Based on this idea, we propose two ‘Nesterov-like’ discrete optimization algorithms. The first one
based on the q-RGF continuous flow, is defined as:
∇f (xk + βyk )
xk+1 = xk + η − c q−2 + βyk (21a)
k∇f (xk + βyk )k q−1
Theorem 2 thus shows that that -convergence of xk → x? can be achieved in a finite number of
(x? ))(1−α)
steps upper bounded by k ? = (f (x0 )−f
c̃(1−α)η . This is a preliminary convergence result, which is
meant to show the existence of discrete solutions obtained via the proposed discretization algorithms,
which are -close to the continuous solutions of the finite-time flows. We also underline here that
after xk reaches an -neighborhood of x? , then xk+1 ≈ xk , ∀k > k ? , since x? is an equilibrium
point of the continuous flows; see Definition 2 in Appendix B.
Let us show first on a simple numerical example that the acceleration in convergence, proven
in continuous time for certain range of the hyperparmeters, can translate to some convergence
acceleration in discrete time, as shown in Theorem 2. We consider the Rosenbrock function f :
R2 → R, given by f (x1 , x2 ) = (a − x1 )2 + b(x2 − x21 )2 , with parameters a, b ∈ R. This function
admits exactly one stationary point (x?1 , x?2 ) = (a, a2 ) for b ≥ 0, and is locally strongly convex,
hence locally satisfies gradient dominance of order p = 2, which allows us to select q > 2 in q-RGF
3
Note that there might be several ways of approaching this proof. For instance, one could follow the general
results on stochastic approximation of set-valued dynamical systems, using the notion of perturbed solutions to
differential inclusions presented in Benaı̈m et al. (2005).
6
Under review as a conference paper at ICLR 2021
and q-SGF to achieve finite-time convergence in continuous-time. We report in Figure 1 the mean
performance of all three discretizations for q-RGF and q-SGF4 with fixed step size5 , for several
values of q, for 10 random initial conditions in [0, 2]. We observe for all three discretizations that,
as expected from the continuous flow analysis, for q close to 2, q-RGF behaves similar to GD in
terms of convergence rate, whereas for q > 2 the finite-time convergence in continuous time seems to
translate to some acceleration in this simple discretization method. Similarly for q-SGF, q closer to 2
translates to less accelerated algorithms, with a behavior similar to GD, whereas larger q values lead
to accelerated convergence.
12
GD (fixed step size) 11 GD (Nesterov acceleration w/fixed step size)
RGF (Euler discretization w/fixed step: q=2.2) SGF(Nesterov-like disc. w/fixed step- q=3)
10 2 10
RGF (Euler discretization w/fixed step: q=3) SGF (Nesterov-like disc. w/fixed step- q=6)
RGF (Euler discretization w/fixed step: q=6) 9 SGF (Nesterov-like disc. w/fixed step- q=10)
RGF (Euler discretization w/fixed step: q=10) 8 SGF (Nesterov-like disc. w/fixed step- q=2.2)
6
f(xk ) - f *
f(xk ) - f *
5
10 1
3
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
k k
10 3
GD Euler discretization(fixed step size)
RGF (Runge Kutta discretization w/fixed step- q=3)
RGF (Runge Kutta discretization w/fixed step- q=6)
RGF (Runge Kutta discretization w/fixed step- q=10)
RGF (Runge Kutta discretization w/fixed step- q=2.2)
2
10
f(xk ) - f *
10 1
Figure 1: Example of the proposed discretization algorithms of finite-time q-RGF and q-SGF
We report here the results of our experiments using deep neural network (DNN) training on three
well-known datasets, namely, CIFAR10, MNIST, and SVHN. We report results on CIFAR10, and
SVHN in the sequel, while results on MNIST can be found in Appendix E. Note that, we use Pytorch
platform to conduct all the tests reported in this paper, and do not use GPUs. We underline here that
the DNNs are non-convex globally, however, one could assume at least local convexity, hence local
gradient dominance of order p = 2, thus, we will select q > 2 in our experiments (see (Remark 2,
Appendix E) for more explanations on the choice of q).
7
Under review as a conference paper at ICLR 2021
mainstream algorithms6 , such as, Nesterov’s accelerated gradient descent (GD), Adam, Adaptive
gradient (AdaGrad), per-dimension learning rate method for gradient descent (AdaDelta), and Root
Mean Square Propagation (RMSprop)7 . Note that, all algorithms have been implemented in their
stochastic version8 , i.e., using mini-batches implementation, with constant step size. For instance, in
Figures 2, 39 , we see the training loss for the different optimization algorithms. We notice that the
proposed Algorithms, RGF, and SGF, quickly separate from the GD, and RMS algorithms in terms
of convergence speed, but also ends up with an overall better performance on the test set 84% vs.
83% for GD, and RMS. We also note that other algorithms, such as, AdaGrad and AdaDelta behave
similarly to RGF in terms of convergence speed, but lack behind in terms of final performance 75%
and 68%, respectively. Finally, in Figure 3, we notice that Adam is slower in terms of computation
time w.r.t SGF, and RGF, with an average lag of 8 min, and 80 min, respectively. However, to be fair
one, has to underline that Adam is an adaptive method based on the extra computations and memory
overhead of lower order moments estimate and bias correction, Kingma & Ba (2015). Furthermore, to
better compare the performance of these algorithms, we report in Figure 2 the loss on the test dataset
over the learning iterations. We confirm that RGF and SGF performs better than GD, RMSprop, and
Adam, while avoiding the overfitting observed with AdaGrad and AdaDelta.
Figure 2: Losses for several optimization algorithms- VGG-16- CIFAR10: Train loss (left), test loss
(right)- We add an S before the name of an algorithm to denote its stochastic implementation.
Figure 3: Training loss vs. computation time for several optimization algorithms- VGG-16- CIFAR10
6
We run several tests by trying to optimally tune the parameters of each algorithms on a validation set (tenth
of training set), and we are reporting the best final accuracy performance we could obtain for each one. We have
implemented all algorithms with the same total number of iterations, so that we can compare the convergence
speed of all algorithms against a common iteration range. Details of the hyper-parameter values are given in
Appendix.
7
Original reference for each method can be found in: https://pytorch.org/docs/stable/optim.html
8
We want to underline here that our first tests were done in the deterministic setting, however, to compare the
propsed optimization methods against the best optimization algorithms available for DNNs training, we decided
to also conduct comparison tests in the stochastic setting. Since the results remain qualitatively unchanged, we
only report here the results due to the stochastic setting.
9
To avoid overloading the figures we reported only the computation time plots of the three most competitive
methods: RGF, SGF and Adam.
8
Under review as a conference paper at ICLR 2021
Figure 4: Losses for several optimization algorithms - CNN- SVHN: Train loss (left), test loss (right)
Figure 5: Training loss vs. computation time for several optimization algorithms- VGG-16- SVHN
numerical results on MNIST, and on SVHN using Euler and Runge-Kutta descretization, showing
similar qualitative results, can be found in Appendix E.
5 C ONCLUSION
We studied connections between optimization algorithms and continuous-time representations (dy-
namical systems) via discretization. We then reviewed two families of non-Lipschitz or discontinuous
first-order optimization flows for continuous-time optimization, namely the q-RGF and q-SGF, whose
distinguishing characteristic is their finite-time convergence. We then proposed three discretization
methods for these flows, namely a forward-Euler discretization, followed by an explicit Runge-Kutta
discretization, and finally a novel Nesterov-like discretization. Based on tools from hybrid systems
control theory, we proved a convergence bound for these algorithms. Finally, we conducted numer-
ical experiments on known deep neural net benchmarks, which showed that the proposed discrete
algorithms can outperform some state of the art algorithms, when tested on large DNN models.
10
We also tested Adaptive gradient (AdaGrad), per-dimension learning rate method for gradient descent
(AdaDelta), and Root Mean Square Propagation (RMSprop). However, since their performance was not
competitive we decided not to report the plots to avoid overloading the figures.
9
Under review as a conference paper at ICLR 2021
R EFERENCES
Hedy Attouch and Jerome Bolte. On the convergence of the proximal algorithm for nonsmooth
functions involving analytic features. Mathematical Programming B, 116(1):5–16, 2009.
Andrea Bacciotti and Francesca Ceragioli. Stability and stabilization of discontinuous systems and
nonsmooth lyapunov functions. ESAIM: Control, Optimisation and Calculus of Variations, 4:
361–376, 1999.
Michel Benaı̈m, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and differential
inclusions. SIAM Journal on Control and Optimization, 44(1):328–348, 2005.
Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth suban-
alytic functions with applications to subgradient dynamical systems. Society for Industrial and
Applied Mathematics, 17:1205–1223, January 2007.
C.A Botsaris. A class of methods for unconstrained minimization based on stable numerical integra-
tion techniques. Journal of Mathematical Analysis and Applications, 63(3):729–749, 1978a.
C.A. Botsaris. Differential gradient methods. Journal of Mathematical Analysis and Applications, 63
(1):177–198, 1978b.
R.W. Brockett. Dynamical systems that sort lists, diagonalize matrices and solve linear programming
problems. In IEEE Conference on Decision and Control, pp. 799–803, 1988.
A.A. Brown. Some effective methods for unconstrained optimization based on the solution of
systems of ordinary differential equations. Journal of Optimization Theory and Applications, 62
(2):211–224, August 1989.
Frank H. Clarke. Generalized gradients of lipschitz functionals. Advances in Mathematics, 40(1):
52–67, 1981.
Jorge Cortés. Finite-time convergent gradient flows with applications to network consensus. Auto-
matica, 42(11):1993–2000, November 2006.
Jorge Cortés. Discontinuous dynamical systems. IEEE Control Systems Magazine, 28(3):36–73, June
2008.
Jorge Cortés and Francesco Bullo. Coordination and geometric optimization via distributed dynamical
systems. SIAM Journal on Control and Optimization, 44(5):1543–1574, October 2005.
M. Fazlyab, A. Koppel, V. M. Preciado, and A. Ribeiro. A variational approach to dual methods for
constrained convex optimization. In 2017 American Control Conference (ACC), pp. 5269–5275,
May 2017a.
M. Fazlyab, A. Koppel, V. M. Preciado, and A. Ribeiro. A variational approach to dual methods for
constrained convex optimization. In 2017 American Control Conference (ACC), pp. 52690–5275,
May 2017b.
M. Fazlyab, M. Morari, and V. M. Preciado. Design of first-order optimization algorithms via sum-
of-squares programming. In IEEE Conference on Decision and Control (CDC), pp. 4445–4452,
December 2018.
Mahyar Fazlyab, Alejandro Ribeiro, Manfred Morari, and Victor M. Preciado. Analysis of optimiza-
tion algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM J. Optim,
28(3):2654–2689, 2018.
Aleksei Fedorovich Filippov and F. M. Arscott. Differential equations with discontinuous righthand
sides. Kluwer Academic Publishers Group, Dordrecht, Netherlands, 1988.
G. França, D. Robinson, and R. Vidal. ADMM and accelerated ADMM as continuous dynamical
systems. July 2018.
G. França, D.P. Robinson, and R. Vidal. A dynamical systems perspective on nonsmooth constrained
optimization. arXiv preprint 1808.04048, 2019a.
10
Under review as a conference paper at ICLR 2021
G. França, J. Sulam, D. Robinson, and R. Vidal. Conformal symplectic and relativistic optimization.
arXiv preprint 1903.04100, 2019b.
Uwe Helmke and John Barratt Moore. Optimization and Dynamical Systems. Springer-Verlag, 1994.
Qing Hui, Wassim Haddad, and Sanjay Bhat. Semistability, finite-time stability, differential inclusions,
and discontinuous dynamical systems having a continuum of equilibria. IEEE Transactions on
Automatic Control, 54:2465–2470, November 2009.
Hamed Karimi, Julie Nutini, , and Mark Schmidt. Linear convergence of gradient and proximal-
gradient methods under the Polyak-łojasiewicz condition. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, 2016.
Diederik P. Kingma and Jimmy Lei Ba. Adam: a method for stochastic optimization. International
Conference on Learning Representations, pp. 1–15, May 2015.
L. Lessard, B. Recht, , and A. Packard. Analysis and design of optimization algorithms via integral
quadratic constraints. SIAM J. Optim, 26(1):57–95, 2016.
S. Łojasiewicz. A topological property of real analytic subsets (in French). Les équations aux
dérivées partielles, pp. 87–89, 1963.
S. Łojasiewicz. Ensembles semi-analytiques. Centre de Physique Theorique de l’Ecole Polytechnique,
1965. URL https://perso.univ-rennes1.fr/michel.coste/Lojasiewicz.
pdf.
StanisŁaw Łojasiewicz and Maria-Angeles Zurro. On the gradient inequality. Bulletin of the Polish
Academy of Sciences, Mathematics, 47, January 1999.
Michael Muehlebach and Michael Jordan. A dynamical systems perspective on Nesterov acceleration.
In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International
Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.
4656–4662, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
A. Orvieto and A. Lucchi. Shadowing properties of optimization algorithms. In Neural Information
Processing Systems, December 2019.
Bradley Paden and Shankar Sastry. A calculus for computing filippov’s differential inclusion with
application to the variable structure control of robot manipulators. IEEE Transactions on Circuits
and Systems, 34:73–82, February 1987.
Boris Polyak. Gradient methods for the minimisation of functionals (in Russian). USSR Computa-
tional Mathematics and Mathematical Physics, 3:864–878, December 1963.
O. Romero and M. Benosman. Finite-time convergence in continuous-time optimization. In Interna-
tional Conference on Machine Learning, Vienna, Austria, July 2020.
R. G. Sanfelice and A. R. Teel. Dynamical properties of hybrid systems simulators. Automatica, 46:
239–248, 2010.
Johannes Schropp. Using dynamical systems methods to solve minimization problems. Applied
Numerical Mathematics, 18(1):321–335, 1995.
Johannes Schropp and I Singer. A dynamical systems approach to constrained minimization. Numer-
ical Functional Analysis and Optimization, 21:537–551, May 2000.
D. Scieur, V. Roulet, F. Bach, , and A. d’Aspremont. Integration methods and optimization algorithms.
In Neural Information Processing Systems, December 2017.
Bin Shi, Simon Du, Michael Jordan, and Weijie Su. Understanding the acceleration phenomenon via
high-resolution differential equations. arXiv preprint 1810.08907, October 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. International Conference on Learning Representations, pp. 1–14, May 2015.
11
Under review as a conference paper at ICLR 2021
J.A. Snyman. A new and dynamic method for unconstrained minimization. Applied Mathematical
Modelling, 6(6):448–462, December 1982.
J.A. Snyman. An improved version of the original leap-frog dynamic method for unconstrained
minimization: LFOP1(b). Applied Mathematical Modelling, 7(3):216–218, June 1983.
A. M. Stuart and A. R. Humphries. Dynamical systems and numerical analysis. Cambridge University
Press, 1996.
Andrew M. Stuart and A. R. Humphries. Dynamical systems and numerical analysis. Cambridge
University Press, first edition, November 1998.
W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling Nesterov’s accelerated
gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp.
2510–2518. Curran Associates, Inc., 2014.
Adrien Taylor, Bryan Van Scoy, and Laurent Lessard. Lyapunov functions for first-order methods:
Tight automated convergence guarantees. In International Conference on Machine Learning,
Stockholm, Sweden, July 2018.
J. Wang and N. Elia. A control perspective for centralized and distributed convex optimization. In
IEEE Conference on Decision and Control and European Control Conference, pp. 3800–3805,
December 2011.
Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated
methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358,
2016.
A. Wilson. Lyapunov Arguments in Optimization. PhD thesis, UC Berkeley, 2018.
Ashia Wilson, Lester Mackey, and Andre Wibisono. Accelerating rescaled gradient descent: Fast op-
timization of smooth functions. In Advances in Neural Information Processing Systems. December
2019.
Abbas K. Zghier. The use of differential equations in optimization. PhD thesis, Loughborough
University, 1981.
J. Zhang, A. Mokhtari, S. Sra, , and A. Jadbabaie. Direct runge-kutta discretization achieves
acceleration. In Neural Information Processing Systems, December 2018.
12
Under review as a conference paper at ICLR 2021
where µ denotes the usual Lebesgue measure and co the convex closure, i.e. closure of the convex
hull co. For more details, see Paden & Sastry (1987). We can generalize (25) to the differential
inclusion Bacciotti & Ceragioli (1999)
ẋ(t) ∈ F(x(t)), (27)
where F : Rn ⇒ Rn is some set-valued map.
Definition 1 (Carathéodory/Filippov solutions). We say that x : [0, τ ) → Rn with 0 < τ ≤ ∞ is
a Carathéodory solution to (27) if x(·) is absolutely continuous and (27) is satisfied a.e. in every
compact subset of [0, τ ). Furthermore, we say that x(·) is a maximal Carathéodory solution if no
other Carathéodory solution x0 (·) exists with x = x0 |[0,τ ) . If F = K[F ], then Carathéodory solutions
are referred to as Filippov solutions.
For a comprehensive overview of discontinuous systems, including sufficient conditions for existence
(Proposition 3) and uniqueness (Propositions 4 and 5) of Filippov solutions, see the work of Cortés
(2008). In particular, it can be established that Filippov solutions to (24) exist, provided that the
following assumption (Assumption 1) holds.
Assumption 1 (Existence of Filippov solutions). F : Rn → Rn is defined almost everywhere (a.e.)
and is Lebesgue-measurable in a non-empty open neighborhood U ⊂ Rn of x0 ∈ Rn . Further, F is
locally essentially bounded in U , i.e., for every point x ∈ U , F is bounded a.e. in some bounded
neighborhood of x.
More generally, Carathéodory solutions to (27) exist (now with arbitrary x0 ∈ Rn ), provided that the
following assumption (Assumption 2) holds.
Assumption 2 (Existence of Carathéodory solutions). F : Rn ⇒ Rn has nonempty, compact, and
convex values, and is upper semi-continuous.
Filippov & Arscott (1988) proved that, for the Filippov set-valued map F = K[F ], Assumptions 1
and 2 are equivalent (with arbitrary x0 ∈ Rn in Assumption 1).
Uniqueness of the solution requires further assumptions. Nevertheless, we can characterize the
Filippov set-valued map in a similar manner to Clarke’s generalized gradient, as seen in the following
proposition.
Proposition 1 (Theorem 1 of Paden & Sastry (1987)). Under Assumption 1, we have
K[F ](x) = lim F (xk ) : xk ∈ Rn \ (NF ∪ S) s.t. xk → x (28)
k→∞
for some (Lebesgue) zero-measure set NF ⊂ Rn and any other zero-measure set S ⊂ Rn . In
particular, if F is continuous at a fixed x, then K[F ](x) = {F (x)}.
11
The notions introduced here are solely needed for rigorously dealing with the singular discontinuity at the
equilibrium point of the q-RGF and q-SGF flows. However, the reader can skip these definitions and still be able
to intuitively follow the proofs of Theorems 1, and 2.
13
Under review as a conference paper at ICLR 2021
For instance, for the GF (1), we have K[−∇f ](x) = {−∇f (x)} for every x ∈ Rn , provided that f
is continuously differentiable. Furthermore, if f is only locally Lipschitz continuous and regular (see
Definition 3 of Appendix B), then K[−∇f ](x) = −∂f (x), where
n
∂f (x) , lim ∇f (xk ) : xk ∈ R \ Nf s.t. xk → x (29)
k→∞
denotes Clarke’s generalized gradient Clarke (1981) of f , with Nf denoting the zero-measure set
over which f is not differentiable (Rademacher’s theorem). It can be established that ∂f coincides
with the subgradient of f , provided that f is convex. Therefore, the GF (1) interpreted as Filippov
differential inclusion may also be seen as a continuous-time variant of subgradient descent methods.
We will now construct a Lyapunov-based criterion adapted from the literature of finite-time stability
of Lipschitz continuous systems.
Lemma 1. Let E(·) be an absolutely continuous function satisfying the differential inequality
Ė(t) ≤ −c E(t)α (30)
?
a.e. in t ≥ 0, with c, E(0) > 0 and α < 1. Then, there exists some t > 0 such that E(t) > 0 for
t ∈ [0, t? ) and E(t? ) = 0. Furthermore, t? > 0 can be bounded by
E(0)1−α
t? ≤ , (31)
c(1 − α)
with this bound tight whenever (30) holds with equality. In that case, but now with α ≥ 1, then
E(t) > 0 for every t ≥ 0, with limt→∞ E(t) = 0. This will be represented by t? = ∞, with
E(∞) , limt→∞ E(t).
Proof. Suppose that E(t) > 0 for every t ∈ [0, T ] with T > 0. Let t? be the supremum of all such
T ’s, thus satisfying E(t) > 0 for every t ∈ [0, t? ). We will now investigate E(t? ). First, by continuity
of E, it follows that E(t? ) ≥ 0. Now, by rewriting
d E(t)1−α
Ė(t) ≤ −c E(t)α ⇐⇒ ≤ −c, (32)
dt 1 − α
a.e. in t ∈ [0, t? ), we can thus integrate to obtain
E(t)1−α E(0)1−α
− ≤ −c t, (33)
1−α 1−α
everywhere in t ∈ [0, t? ), which in turn leads to
E(t) ≤ [E(0)1−α − c(1 − α)t]1/(1−α) (34)
and
E(0)1−α − E(t)1−α E(0)1−α
t≤ ≤ , (35)
c(1 − α) c(1 − α)
14
Under review as a conference paper at ICLR 2021
where the last inequality follows from E(t) > 0 for every t ∈ [0, t? ). Taking the supremum in (35)
then leads to the upper bound (31). Finally, we conclude that E(t? ) = 0, since E(t? ) > 0 is
impossible given that it would mean, due to continuity of E, that there exists some T > t? such that
E(t) > 0 for every t ∈ [0, T ], thus contradicting the construction of t? .
Finally, notice that if E is such that (30) holds with equality, then (34) and the first inequality
in (35) hold with equality as well. The tightness of the bound (31) thus follows immediately.
Furthermore, notice that if α ≥ 1, and E is a tight solution to the differential inequality (30), i.e.
E(t) = [E(0)1−α − c(1 − α)t]1/(1−α) , then clearly E(t) > 0 for every t ≥ 0 and E(t) → 0 as
t → ∞.
Cortés & Bullo (2005) proposed (Proposition 2.8) a Lyapunov-based criterion to establish finite-
time stability of discontinuous systems, which fundamentally coincides with our Lemma 1 for the
particular choice of exponent α = 0. Their proposition was, however, directly based on Theorem 2
of Paden & Sastry (1987). Later, Cortés (2006) proposed a second-order Lyapunov criterion, which,
on the other hand, fundamentally translates to E(t) , V (x(t)) being strongly convex. Finally, Hui
et al. (2009) generalized Proposition 2.8 of Cortés & Bullo (2005) in their Corollary 3.1, to establish
semistability. Indeed, these two results coincide for isolated equilibria.
We now present a novel result that generalizes the aforementioned first-order Lyapunov-based results,
by exploiting our Lemma 1. More precisely, given a Laypunov candidate function V (·), the objective
is to set E(t) , V (x(t)), and we aim to check that the conditions of Lemma 1 hold. To do this, and
assuming V to be locally Lipschitz continuous, we first borrow and adapt from Bacciotti & Ceragioli
(1999) the definition of set-valued time derivative of V : D → R w.r.t. the differential inclusion (27),
given by
V̇ (x) , {a ∈ R : ∃v ∈ F(x) s.t. a = p · v, ∀p ∈ ∂V (x)}, (36)
for each x ∈ D. Notice that, under Assumption 2 for Filippov differential inclusions F = K[F ],
the set-valued time derivative of V thus coincides with with the set-valued Lie derivative LF V (·).
Indeed, more generally V̇ could be seen as a set-valued Lie derivative LF V w.r.t. the set-valued map
F.
Definition 3. V (·) is said to be regular if every directional derivative, given by
V (x + h v) − V (x)
V 0 (x; v) , lim , (37)
h→0 h
exists and is equal to
V (x0 + h v) − V (x0 )
V ◦ (x; v) , lim sup , (38)
x0 →x h→0+ h
known as Clarke’s upper generalized derivative Clarke (1981).
In practice, regularity is a fairly mild and easy to guarantee condition. For instance, it would suffice
that V is convex or continuously differentiable to ensure that it is Lipschitz and regular.
Assumption 3. V : D → R is locally Lipscthiz continuous and regular, with D ⊆ Rn open.
15
Under review as a conference paper at ICLR 2021
Lemma 2 (Lemma 1 of Bacciotti & Ceragioli (1999)). Under Assumption 3, given any Carathéodory
solution x : [0, τ ) → Rn to (27), then E(t) , V (x(t)) is absolutely continuous and Ė(t) =
d
dt V (x(t)) ∈ V̇ (x(t)) a.e. in t ∈ [0, τ ).
We are now ready to state and prove our Lyapunov-based sufficient condition for finite-time stability
of differential inclusions.
Theorem 3. Suppose that Assumptions 2 and 3 hold for some set-valued map F : Rn ⇒ Rn and
some function V : D → R, where D ⊆ Rn is an open and positively invariant neighborhood of a
point x? ∈ Rn . Suppose that V is positive definite w.r.t. x? and that there exist constants c > 0 and
α < 1 such that
sup V̇ (x) ≤ −c V (x)α (41)
?
a.e. in x ∈ D. Then, (27) is finite-time stable at x , with settling time upper bounded by
V (x0 )1−α
t? ≤ , (42)
c(1 − α)
where x(0) = x0 . In particular, any Carathéodory solution x(·) with x(0) = x0 ∈ D will converge
in finite time to x? under the upper bound (42). Furthermore, if D = Rn , then (27) is globally
finite-time stable. Finally, if V̇ (x) is a singleton a.e. in x ∈ D and (41) holds with equality, then the
bound (42) is tight.
Proof. Note that, by Proposition 1 of Bacciotti & Ceragioli (1999), we know that (27) is Lyapunov
stable at x? . All that remains to be shown is local convergence towards x? (which must be an
equilibrium) in finite time. Indeed, given any maximal solution x : [0, t? ) → Rn to (27) with
x(0) = x0 6= x? , we know by Lemma 2, that E(t) = V (x(t)) is absolutely continuous with
Ė(t) ∈ V̇ (x(t)) a.e. in t ∈ [0, t? ]. Therefore, we have
Ė(t) ≤ sup V̇ (x(t)) ≤ −c V (x(t))α = −c E(t)α (43)
a.e. in t ∈ [0, t? ]. Since E(0) = V (x0 ) > 0, given that x0 6= x? , the result then follows by invoking
Lemma 1 and noting that E(t? ) = 0 ⇐⇒ V (t? , x(t? )) = 0 ⇐⇒ x(t? ) = x? .
Finite-time stability still follows without Assumption 2, provided that x? is an equilibrium of (27). In
practical terms, this means that trajectories starting arbitrarily close to x? may not actually exist, but
nevertheless there exists a neighborhood D of x? over which, any trajectory x(·) that indeed exists
and starts at x(0) = x0 ∈ D must converge in finite time to x? , with settling time upper bounded by
T (x0 ) (the bound still tight in the case that (41) holds with equality).
C P ROOF OF T HEOREM 1
Let us focus on the q-RGF (2) (the case of q-SGF (3) follows exactly the same steps) with the
candidate Lyapunov function V , f − f ? . Clearly, V is Lipschitz continuous and regular (given that
it is continuously differentiable). Furthermore, V is positive definite w.r.t. x? .
Notice that, due to the dominated gradient assumption, x? must be an isolated stationary point of
f . To see this, notice that, if x? were not an isolated stationary point, then there would have to exist
some x̃? sufficiently near x? such that x̃? is both a stationary point of f , and satisfies f (x̃? ) > f ? ,
since x? is a strict local minimizer of f . But then, we would have
p−1 p 1
0= k∇f (x̃? )k p−1 ≥ µ p−1 (f (x̃? ) − f ? ) > 0, (44)
p
and subsequently 0 > 0, which is absurd.
q−2
Therefore, F (x) , −c∇f (x)/k∇f (x)k q−1 is continuous for every x ∈ D \ {x? }, for some small
enough open neighborhood D of x? . Let us assume that D is positively invariant w.r.t. (2), which can
be achieved, for instance, by replacing D with its intersection with some small enough strict sublevel
1
1
set of f . Notice that kF (x)k = ck∇f (x)k q−1 with q ∈ (p, ∞] ⊂ (1, ∞], i.e., q−1 ∈ [0, ∞). If
∇f (x)
q = ∞, which results in the normalized gradient flow ẋ = − k∇f (x)k proposed by Cortés (2006),
16
Under review as a conference paper at ICLR 2021
then kF (x)k = c > 0. We can thus show that F (x) is discontinuous at x = 0 for q = ∞. On
the other hand, if q ∈ (p, ∞) ⊂ (1, ∞), then we have kF (x)k → 0 as x → x? , and thus F (x) is
continuous (but not Lipschitz) at x = x? . Regardless, we may freely focus exclusively on D \ {x? }
since {x? } is obviously a zero-measure set.
Let F , K[F ]. We thus have, for each x ∈ D \ {x? },
sup V̇ (x) = sup {a ∈ R : ∃v ∈ F(x) s.t. a = p · v, ∀p ∈ ∂V (x)} (45a)
= sup {∇V (x) · v : v ∈ F(x)} (45b)
= ∇V (x) · F (x) (45c)
q−2
= −ck∇f (x)k2− q−1 (45d)
1
= −ck∇f (x)k θ0 (45e)
1
? θ
≤ −c[C(f (x) − f ) ] θ0 (45f)
1 θ
= −cC V (x) .
θ0 θ0 (45g)
Since θθ0 < 1, given that s > 1 7→ s−1 s is strictly increasing, then the conditions of Theorem 3 are
satisfied. In particular, we have finite-time stability at x? with a settling time t? upper bounded by
D P ROOF OF T HEOREM 2
To prove Theorem 2, we borrow some tools and results from hybrid control systems theory. Hybrid
control systems are characterized by continuous flows with discrete jumps between the continuous
flows. They are often modeled by differential inclusions added to discrete mappings to model the
jumps between the differential inclusions. We see the case of the optimization flows proposed
here as a simple case of a hybrid systems with one differential inclusion, with a possible jump or
discontinuity at the optimum. Based on this, we will use the tools and results of Sanfelice & Teel
(2010), which study how a certain class of hybrid systems behave after discretization with a certain
class of discretization algorithms. In other words, Sanfelice & Teel (2010) quantifies, under some
conditions, how close are the solutions of the discretized hybrid dynamical system to the solutions of
the original hybrid system.
In this section we will denote the differential inclusion of the continuous optimization flow by
F : Rn ⇒ Rn , and its discretization in time by Fd : Rn ⇒ Rn . We first recall a definition, which
we will adapt from the general case of jumps between multiple differential inclusions (Definition 3.2,
Sanfelice & Teel (2010)) to our case of one differential inclusion or flow.
Definition 4. ((T, )-closeness). Given T > 0, > 0, η > 0, two solutions xt : [0, T ] → Rn , and
xk : {0, 1, 2, ...} → Rn are (T, )-close if:
(a) for all t ≤ T there exists k ∈ {1, 2, ...} such that |t − kη| < , and kxt (t) − xk (k)k < ,
(b) for all k ∈ {1, 2, ...} there exists t ≤ T such that |t − kη| < , and kxt (t) − xk (k)k < .
Next, we will recall Theorem 5.2 in Sanfelice & Teel (2010), while adapting it to our special case of
a hybrid system with one differential inclusion12 .
Theorem 4. (Closeness of continuous and discrete solutions on compact domains) Consider the
differential inclusion
Ẋ(t) ∈ F(X(t)), (47)
12
A set-valued mapping F : Rn ⇒ Rn is outer semicontinuous if for each sequence {xi }∞
i=1 converging
to a point x ∈ Rn and each sequence yi ∈ F(xi ) converging to a point y, it holds that y ∈ F(x). It
is locally bounded if, for each x ∈ Rn , there exists compact sets K, K 0 ⊂ Rn such that x ∈ K and
0
F(K) , ∪x∈K F(x) ⊂ K . In what follows, we use the following notations: Given a set A, conA denotes the
convex hull, and B denotes the closed unit ball in a Euclidean space.
17
Under review as a conference paper at ICLR 2021
To prove Theorem 2 we will use the results of Theorem 4, where we will have to check that condition
(48) is satisfied for the three proposed discretizations.
We are now ready to prove Theorem 2. First, note that outer semicontinuity follows from the upper
semicontinuity and the closedness of the Filippov differential inclusion map. Furthermore, local
boundedness follows from continuity everywhere outside stationary points, which are isolated.
Now, let us examine their discretization by the three proposed algorithms:
we then compare the solution of the extended continuous-time system (52) with the extended discrete-
time system
xk+1 xk + ηF(xk (1 + µ) + xk−1 ) + µ(xk − xk−1 )
x̃k+1 = = , (53)
xk+1 − xk ηF(xk (1 + µ) + xk−1 ) + µ(xk − xk−1 )
18
Under review as a conference paper at ICLR 2021
19
Under review as a conference paper at ICLR 2021
Note that the description of the coefficients for each of the prior art methods can be found in:
https://pytorch.org/docs/stable/optim.html.
Due to space limitations, we have decided to report in the main paper only one test for q-RGF with
Euler discretization, one test for q-SGF with Nesterov-like discretization, and one test for q-RGF
with Runge Kutta discretization. For the sake of completeness we report here the remaining tests for
each algorithm. One can observe similar qualitative behavior in Figure 6 as the one noticed in the
results of Section 4.5.
The step-size and other hyper-parameters for each test are given below:
20
Under review as a conference paper at ICLR 2021
12
30 GD (fixed step size) GD (Nesterov acceleration w/fixed step size)
SGF (Euler discretization w/fixed step: q=2.1) 11 RGF (Nesterov-like disc. w/fixed step- q=3)
SGF (Euler discretization w/fixed step: q=2.5) 10 RGF (Nesterov-like disc. w/fixed step- q=6)
25 SGF (Euler discretization w/fixed step: q=2.8) RGF (Nesterov-like disc. w/fixed step- q=10)
SGF (Euler discretization w/fixed step: q=100) 9 RGF (Nesterov-like disc. w/fixed step- q=2.2)
20 8
15
6
f(xk ) - f *
f(xk ) - f *
5
10
4
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
k k
10 0
f(xk ) - f *
10 -1
10 -2
Figure 6: Example of the proposed discretization algorithms of finite-time q-RGF and q-SGF
In this experiment we optimize the CNN network described using a Pytorch code sequence in
MODEL 1 with a negative log likelihood loss, on the MNIST dataset.
We use 10 epochs of training, with 60 batches of 1000 images for training, and 10 batches of 1000
images for testing. We tested Algorithm 1 RGF (c = 1, q = 3, η = 0.06, µ = 0.9), and Algorithm
2 SGF (c = 0.001, q = 2.1, η = 0.06, µ = 0.9) against Nesterov’s accelerated gradient descent
(GD) (η = 0.06, µ = 0.9, Nesterov=True), and Adam (η = 0.004, remaining coefficients=nominal
values of Torch.Optim). We also tested other algorithms such as RMSprop, AdaGrad, and AdaDelta,
but since their convergence performance, as well as, test performance were not very competitive w.r.t.
GD, on this experiment, we decided not to report them here, to avoid overloading the graphs. In
Figures 7, 8 we can see the training loss over the training iterations, where we see that GD, RGF
and SGF perform better than Adam in terms of convergence speed (20 sec lead in average), and
in terms of test performance 98% for Adam, 98% for GD, and 99% for both RGF and SGF. The
RGF and SGF perform slightly better than GD in terms of convergence speed. The gain is relatively
small (5 sec to 10 sec in average) which is expected in such a small DNN network (please refer
to VGG16-CIFAR10, and VGG16-SVHN test results for larger computation-time gains). We also
notice, in Figure 7 that all algorithms behave well in terms of avoiding overfitting the model.
Figure 7: Losses for several optimization algorithms- CNN- MNIST: Train loss (left), test loss (right)
21
Under review as a conference paper at ICLR 2021
Figure 8: Training loss vs. computation time for several optimization algorithms- MNIST
MODEL 1: MNIST-CNN
class Net(MNIST-CNN)
def.init.(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(1, 10, kernel.size=5)
self.conv2 = nn.Conv2d(10, 20, kernel.size=5)
self.conv2.drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
22
Under review as a conference paper at ICLR 2021
Figure 9: Losses for several optimization algorithms- SVHN: Train loss (left), test loss (right)
Figure 10: Training loss vs. computation time for several optimization algorithms- VGG-16- SVHN
gradient descent (GD) and Adam (same optimal tuning as in Section 4.2). All algorithms have been implemented
in their stochastic version.
In Figures 11, 12 we can see that both algorithms, Runge-Kutta q-RGF and Runge-Kutta q-SGF converge faster
(40 min lead in average) than SGD and Adam for these tests.
Figure 11: Losses for several optimization algorithms- SVHN: Train loss (left), test loss (right)
23
Under review as a conference paper at ICLR 2021
Figure 12: Training loss vs. computation time for several optimization algorithms- VGG-16- SVHN
24