0% found this document useful (0 votes)
31 views

First-Order Optimization Algorithms Via Inertial Systems With Hessian Driven Damping

Uploaded by

vipin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

First-Order Optimization Algorithms Via Inertial Systems With Hessian Driven Damping

Uploaded by

vipin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Mathematical Programming (2022) 193:113–155

https://doi.org/10.1007/s10107-020-01591-1

FULL LENGTH PAPER


Series A

First-order optimization algorithms via inertial systems


with Hessian driven damping

Hedy Attouch1 · Zaki Chbani2 · Jalal Fadili3 · Hassan Riahi2

Received: 24 July 2019 / Accepted: 3 November 2020 / Published online: 16 November 2020
© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2020

Abstract
In a Hilbert space setting, for convex optimization, we analyze the convergence rate of
a class of first-order algorithms involving inertial features. They can be interpreted as
discrete time versions of inertial dynamics involving both viscous and Hessian-driven
dampings. The geometrical damping driven by the Hessian intervenes in the dynamics
in the form ∇ 2 f (x(t))ẋ(t). By treating this term as the time derivative of ∇ f (x(t)),
this gives, in discretized form, first-order algorithms in time and space. In addition to
the convergence properties attached to Nesterov-type accelerated gradient methods,
the algorithms thus obtained are new and show a rapid convergence towards zero of
the gradients. On the basis of a regularization technique using the Moreau envelope,
we extend these methods to non-smooth convex functions with extended real values.
The introduction of time scale factors makes it possible to further accelerate these
algorithms. We also report numerical results on structured problems to support our
theoretical findings.

Keywords Hessian driven damping · Inertial optimization algorithms · Nesterov


accelerated gradient method · Ravine method · Time rescaling

B Jalal Fadili
[email protected]
Hedy Attouch
[email protected]
Zaki Chbani
[email protected]
Hassan Riahi
[email protected]

1 IMAG, Univ. Montpellier, CNRS, Montpellier, France


2 Faculty of Sciences Semlalia, Mathematics, Cadi Ayyad University, 40000 Marrakech, Morocco
3 Normandie Université-ENSICAEN, CNRS, GREYC, Caen, France

123
114 H. Attouch et al.

Mathematics Subject Classification 37N40 · 46N10 · 49M30 · 65B99 · 65K05 ·


65K10 · 90B50 · 90C25

1 Introduction

Unless specified, throughout the paper we make the following assumptions




⎨H is a real Hilbert space;
f : H → R is a convex function of class C 2 , S := argminH f = ∅; (H)


γ , β, b : [t0 , +∞[→ R+ are non-negative continuous functions, t0 > 0.

As a guide in our study, we will rely on the asymptotic behavior, when t → +∞,
of the trajectories of the inertial system with Hessian-driven damping

ẍ(t) + γ (t)ẋ(t) + β(t)∇ 2 f (x(t))ẋ(t) + b(t)∇ f (x(t)) = 0,

γ (t) and β(t) are damping parameters, and b(t) is a time scale parameter.
The time discretization of this system will provide a rich family of first-order
methods for minimizing f . At first glance, the presence of the Hessian may seem to
entail numerical difficulties. However, this is not the case as the Hessian intervenes in
the above ODE in the form ∇ 2 f (x(t))ẋ(t), which is nothing but the derivative w.r.t.
time of ∇ f (x(t)). This explains why the time discretization of this dynamic provides
first-order algorithms. Thus, the Nesterov extrapolation scheme [25,26] is modified
by the introduction of the difference of the gradients at consecutive iterates. This gives
algorithms of the form

yk = xk + αk (xk − xk−1 ) − βk (∇ f (xk ) − ∇ f (xk−1 ))
xk+1 = T (yk ),

where T , to be specified later, is an operator involving the gradient or the proximal


operator of f .
Coming back to the continuous dynamic, we will pay particular attention to the
following two cases, specifically adapted to the properties of f :
• For a general convex function f , taking γ (t) = αt , gives

α
(DIN-AVD)α,β,b ẍ(t) + ẋ(t) + β(t)∇ 2 f (x(t))ẋ(t) + b(t)∇ f (x(t)) = 0.
t
In the case β ≡ 0, α = 3, b(t) ≡ 1, it can be interpreted as a continuous version
of the Nesterov accelerated
  gradient method [31]. According to this, in this case,
we will obtain O t −2 convergence rates for the objective values.
• For a μ-strongly convex function f , we will rely on the autonomous inertial system
with Hessian driven damping

(DIN)2√μ,β ẍ(t) + 2 μẋ(t) + β∇ 2 f (x(t))ẋ(t) + ∇ f (x(t)) = 0,

123
First-order optimization algorithms via inertial systems… 115

and show exponential (linear) convergence rate for both objective values and gra-
dients.
For an appropriate setting of the parameters, the time discretization of these dynamics
provides first-order algorithms with fast convergence properties. Notably, we will show
a rapid convergence towards zero of the gradients.

1.1 A historical perspective

B. Polyak initiated the use of inertial dynamics to accelerate the gradient method in
optimization. In [27,28], based on the inertial system with a fixed viscous damping
coefficient γ > 0

(HBF) ẍ(t) + γ ẋ(t) + ∇ f (x(t)) = 0,

he introduced the Heavy Ball with Friction method. For a strongly convex function
f , (HBF) provides convergence at exponential rate of f (x(t)) to minH f . For gen-
eral convex functions, the asymptotic convergence rate of (HBF) is O( 1t ) (in the
worst case). This is however not better than the steepest descent. A decisive step to
improve (HBF) was taken by Alvarez–Attouch–Bolte–Redont [2] by introducing the
Hessian-driven damping term β∇ 2 f (x(t))ẋ(t), that is (DIN)0,β . The next important
step was accomplished by Su–Boyd–Candès [31] with the introduction of a vanishing
viscous damping coefficient γ (t) = αt , that is (AVD)α (see Sect. 1.1.2). The system
(DIN-AVD)α,β,1 (see Sect. 2) has emerged as a combination of (DIN)0,β and (AVD)α .
Let us review some basic facts concerning these systems.

1.1.1 The (DIN),ˇ dynamic

The inertial system

(DIN)γ ,β ẍ(t) + γ ẋ(t) + β∇ 2 f (x(t))ẋ(t) + ∇ f (x(t)) = 0,

was introduced in [2]. In line with (HBF), it contains a fixed positive friction coefficient
γ . The introduction of the Hessian-driven damping makes it possible to neutralize the
transversal oscillations likely to occur with (HBF), as observed in [2] in the case of
the Rosenbrook function. The need to take a geometric damping adapted to f had
already been observed by Alvarez [1] who considered

ẍ(t) + Γ ẋ(t) + ∇ f (x(t)) = 0,

where Γ : H → H is a linear positive anisotropic operator. But still this damping oper-
ator is fixed. For a general convex function, the Hessian-driven damping in (DIN)γ ,β
performs a similar operation in a closed-loop adaptive way. The terminology (DIN)
stands shortly for Dynamical Inertial Newton. It refers to the natural link between this
dynamic and the continuous Newton method.

123
116 H. Attouch et al.

1.1.2 The (AVD)˛ dynamic

The inertial system

α
(AVD)α ẍ(t) + ẋ(t) + ∇ f (x(t)) = 0,
t

was introduced in the context of convex optimization in [31]. For general convex func-
tions it provides a continuous version of the accelerated gradient method of Nesterov.
For α ≥ 3, each trajectory x(·) of (AVD)
 α satisfies the asymptotic rate of convergence
of the values f (x(t))−inf H f = O 1/t 2 . As a specific feature, the viscous damping
coefficient αt vanishes (tends to zero) as time t goes to infinity, hence the terminology.
The convergence properties of the dynamic (AVD)α have been the subject of many
recent studies, see [3–6,8–10,14,15,24,31]. They helped to explain why αt is a wise
choise of the damping coefficient.
In [20], the authors showed that a vanishing damping coefficient γ (·) dissipates
the energy, and hence makes the dynamic interesting for optimization, as long as
+∞
t0 γ (t)dt = +∞. The damping coefficient can go to zero asymptotically but not
too fast. The smallest which is admissible is of order 1t . It enforces the inertial effect
with respect to the friction effect.
The tuning of the parameter α in front of 1t comes from the Lyapunov analysis and
the optimality of the convergence rates obtained. The case α = 3, which corresponds
to Nesterov’s historical algorithm, is critical. In the case α = 3, the question of the
convergence of the trajectories remains an open problem (except in one dimension
where convergence holds [9]). As a remarkable property, for α > 3, it has been shown
by Attouch–Chbani–Peypouquet–Redont [8] that each trajectory converges weakly to
a minimizer. The corresponding algorithmic result has been obtained by Chambolle–
Dossal [21]. For α > 3, it is shown in [10,24] that the asymptotic convergence rate
of the values is actually o(1/t 2 ). The subcritical case α ≤ 3 has been examined by
Apidopoulos–Aujol–Dossal [3] and Attouch–Chbani–Riahi [9], with the convergence

rate of the objective values O t − 3 . These rates are optimal, that is, they can be
reached, or approached arbitrarily close:

• α ≥ 3: the optimal rate O t −2 is achieved by taking f (x) = x r with r → +∞


( f become very flat around its minimum), see [8].

• α < 3: the optimal rate O t − 3 is achieved by taking f (x) = x , see [3].

The inertial system with a general damping coefficient γ (·) was recently studied by
Attouch–Cabot in [4,5], and Attouch–Cabot–Chbani–Riahi in [6].

1.1.3 The (DIN-AVD)˛,ˇ dynamic

The inertial system

α
(DIN-AVD)α,β ẍ(t) + ẋ(t) + β∇ 2 f (x(t))ẋ(t) + ∇ f (x(t)) = 0,
t

123
First-order optimization algorithms via inertial systems… 117

was introduced in [11]. It combines the two types of damping considered above.
Its formulation looks at a first glance more complicated than (AVD)α . Attouch-
Peypouquet-Redont [12] showed that (DIN-AVD)α,β is equivalent to the first-order
system in time and space


⎨ ẋ(t) + β∇ f (x(t)) − 1
− α
x(t) + 1
y(t) = 0;
β t β
⎩ ẏ(t) − 1
− α
+ αβ
x(t) + 1
y(t) = 0.
β t t2 β

This provides a natural extension to f : H → R∪{+∞} proper lower semicontinuous


and convex, just replacing the gradient by the subdifferential.
To get better insight, let us compare the two dynamics (AVD)α and (DIN-AVD)α,β
on a simple quadratic minimization problem, in which case the trajectories can be
computed in closed form as explained in Appendix A.3. Take H = R2 and f (x1 , x2 ) =
2 (x 1 + 1000x 2 ), which is ill-conditioned. We take parameters α = 3.1, β = 1, so
1 2 2

as to obey the condition α > 3. Starting with initial conditions: (x1 (1), x2 (1)) =
(1, 1), (ẋ1 (1), ẋ2 (1)) = (0, 0), we have the trajectories displayed in Fig. 1. This
illustrates the typical situation of an ill-conditioned minimization problem, where the
wild oscillations of (AVD)α are neutralized by the Hessian damping in (DIN-AVD)α,β
(see Appendix A.3 for further details).

1.2 Main algorithmic results

Let us describe our main convergence rates for the gradient type algorithms. Corre-
sponding results for the proximal algorithms are also obtained.

Fig. 1 Evolution of the objective (left) and trajectories (right) for (AVD)α (α = 3.1) and (DIN-AVD)α,β
(α = 3.1, β = 1) on an ill-conditioned quadratic problem in R2

123
118 H. Attouch et al.

General convex function Let f : H → R be a convex function whose gradient is L-


Lipschitz continuous. Based on the discretization of (DIN-AVD)α,β,1+ β , we consider
t

   √ √
β s
yk = xk + 1 − αk (xk − xk−1 ) − β s (∇ f (xk ) − ∇ f (xk−1 )) − k ∇ f (xk−1 )
xk+1 = yk − s∇ f (yk ).


Suppose that α ≥ 3, 0 < β < 2 s, s L ≤ 1. In Theorem 6, we show that

1
(i) f (xk ) − min f = O 2 as k → +∞;
H k
 
(ii) k 2 ∇ f (yk ) 2 < +∞ and k 2 ∇ f (xk ) 2
< +∞.
k k

Strongly convex function When f : H → R is μ-strongly convex for some μ > 0,



our analysis relies on the autonomous dynamic (DIN)γ ,β with γ = 2 μ. Based on
its time discretization, we obtain linear convergence results for the values (hence the
trajectory) and the gradients terms. Explicit discretization gives the inertial gradient
algorithm

√ √
1 − μs β s
xk+1 = xk + √ (xk − xk−1 ) − √ (∇ f (xk ) − ∇ f (xk−1 ))
1 + μs 1 + μs
s
− √ ∇ f (xk ).
1 + μs

1
Assuming that ∇ f is L-Lipschitz continuous, L sufficiently small and β ≤ √ , it is
μ
1
shown in Theorem 11 that, with q = √ ( 0 < q < 1)
1 + 21 μs

 
f (xk ) − min f = O q k and xk − x   = O q k/2 as k → +∞,
H

Moreover, the gradients converge exponentially fast to zero.

1.3 Contents

The paper is organized as follows. Sections 2 and 3 deal with the case of general
convex functions, respectively in the continuous case and the algorithmic cases. We
improve the Nesterov convergence rates by showing in addition fast convergence of
the gradients. Sections 4 and 5 deal with the same questions in the case of strongly
convex functions, in which case, linear convergence results are obtained. Section 6 is
devoted to numerical illustrations. We conclude with some perspectives.

123
First-order optimization algorithms via inertial systems… 119

2 Inertial dynamics for general convex functions

Our analysis deals with the inertial system with Hessian-driven damping
α
(DIN-AVD)α,β,b ẍ(t) + ẋ(t) + β(t)∇ 2 f (x(t))ẋ(t) + b(t)∇ f (x(t)) = 0.
t

2.1 Convergence rates

We start by stating a fairly general theorem on the convergence rates and integrability
properties of (DIN-AVD)α,β,b under appropriate conditions on the parameter functions
β(t) and b(t). As we will discuss shortly, it turns out that for some specific choices of
the parameters, one can recover most of the related results existing in the literature.
The following quantities play a central role in our analysis:

β(t)
w(t) := b(t) − β̇(t) − and δ(t) := t 2 w(t). (1)
t

Theorem 1 Consider (DIN-AVD)α,β,b , where (H) holds. Take α ≥ 1. Let x :


[t0 , +∞[→ H be a solution trajectory of (DIN-AVD)α,β,b . Suppose that the following
growth conditions are satisfied:

β(t)
(G2 ) b(t) > β̇(t) + ;
t
(G3 ) t ẇ(t) ≤ (α − 3)w(t).

Then, w(t) is positive and

1
(i) f (x(t)) − min f = O 2 as t → +∞;
H t w(t)
 +∞
(ii) t 2 β(t)w(t) ∇ f (x(t)) 2 dt < +∞;
t0
 +∞
(iii) t (α − 3)w(t) − t ẇ(t) ( f (x(t)) − min f )dt < +∞.
t0 H

Proof Given x  ∈ argminH f , define for t ≥ t0

1
E(t) := δ(t)( f (x(t)) − f (x  )) + v(t) 2
, (2)
2

where v(t) := (α − 1)(x(t) − x  ) + t (ẋ(t) + β(t)∇ f (x(t)).


The function E(·) will serve as a Lyapunov function. Differentiating E gives

d
E(t) = δ̇(t)( f (x(t)) − f (x  )) + δ(t)∇ f (x(t)), ẋ(t) + v(t), v̇(t). (3)
dt

123
120 H. Attouch et al.

Using equation (DIN-AVD)α,β,b , we have


 
v̇(t) = α ẋ(t) + β(t)∇ f (x(t)) + t ẍ(t) + β̇(t)∇ f (x(t)) + β(t)∇ 2 f (x(t))ẋ(t)
 α 
= α ẋ(t) + β(t)∇ f (x(t)) + t − ẋ(t) + (β̇(t) − b(t))∇ f (x(t))
t
 β(t) 
= t β̇(t) + − b(t) ∇ f (x(t)).
t
Hence,

β(t)
v(t), v̇(t) = (α − 1)t β̇(t) + − b(t) ∇ f (x(t)), x(t) − x  
t
β(t)
+ t 2 β̇(t) + − b(t) ∇ f (x(t)), ẋ(t)
t
β(t)
+ t 2 β(t) β̇(t) + − b(t) ∇ f (x(t)) 2 .
t

Let us go back to (3). According to the choice of δ(t), the terms ∇ f (x(t)), ẋ(t)
cancel, which gives

d (α − 1)
E(t) = δ̇(t)( f (x(t)) − f (x  )) + δ(t)∇ f (x(t)), x  − x(t)
dt t
−β(t)δ(t) ∇ f (x(t)) 2 .

Condition (G2 ) gives δ(t) > 0. Combining this equation with convexity of f ,

f (x  ) − f (x(t)) ≥ ∇ f (x(t)), x  − x(t),

we obtain the inequality

d  (α − 1) 
E(t) + β(t)δ(t) ∇ f (x(t)) 2
+ δ(t) − δ̇(t) ( f (x(t)) − f (x  )) ≤ 0.
dt t
(4)

Then note that


(α − 1)
δ(t) − δ̇(t) = t (α − 3)w(t) − t ẇ(t) . (5)
t

Hence, condition (G3 ) writes equivalently

(α − 1)
δ(t) − δ̇(t) ≥ 0, (6)
t
d
which, by (4), gives E(t) ≤ 0. Therefore, E(·) is non-increasing, and hence E(t) ≤
dt
E(t0 ). Since all the terms that enter E(·) are nonnegative, we obtain (i). Then, by

123
First-order optimization algorithms via inertial systems… 121

integrating (4) we get


 +∞
β(t)δ(t) ∇ f (x(t)) 2
dt ≤ E(t0 ) < +∞,
t0

and
 +∞
t (α − 3)w(t) − t ẇ(t) ( f (x(t)) − f (x  ))dt ≤ E(t0 ) < +∞,
t0

which gives (ii) and (iii), and completes the proof. 




2.2 Particular cases

As anticipated above, by specializing the functions β(t) and b(t), we recover most
known results in the literature; see hereafter for each specific case and related literature.
For all these cases, we will argue also on the interest of our generalization.
Case 1 The (DIN-AVD)α,β system corresponds to β(t) ≡ β and b(t) ≡ 1. In this
case, w(t) = 1 − βt . Conditions (G2 ) and (G3 ) are satisfied by taking α > 3 and
t > α−2
α−3 β. Hence, as a consequence of Theorem 1, we obtain the following result of
Attouch–Peypouquet–Redont [12]:

Theorem 2 [12] Let x : [t0 , +∞[→ H be a trajectory of the dynamical system


(DIN-AVD)α,β . Suppose α > 3. Then
 ∞
1
f (x(t)) − min f = O and t 2 ∇ f (x(t)) 2 dt < +∞.
H t2 t0

Case 2 The system(DIN-AVD)α,β,1+ β , which corresponds to β(t) ≡ β and b(t) =


t
1 + βt , was considered in [30]. Compared to (DIN-AVD)α,β it has the additional
coefficient βt in front of the gradient term. This vanishing coefficient will facilitate
the computational aspects while keeping the structure of the dynamic. Observe that
in this case, w(t) ≡ 1. Conditions (G2 ) and (G3 ) boil down to α ≥ 3. Hence, as a
consequence of Theorem 1, we obtain

Theorem 3 Let x : [t0 , +∞[→ H be a solution trajectory of the dynamical system


(DIN-AVD)α,β,1+ β . Suppose α ≥ 3. Then
t

 ∞
1
f (x(t)) − min f = O 2 and t 2 ∇ f (x(t)) 2 dt < +∞.
H t t0

Case 3 The dynamical system (DIN-AVD)α,0,b , which corresponds to β(t) ≡ 0, was


considered by Attouch–Chbani–Riahi in [7]. It comes also naturally from the time

123
122 H. Attouch et al.

scaling of (AVD)α . In this case, we have w(t) = b(t). Condition (G2 ) is equivalent to
b(t) > 0. (G3 ) becomes

t ḃ(t) ≤ (α − 3)b(t),

which is precisely the condition introduced in [7, Theorem 8.1]. Under this condition,
we have the convergence rate

1
f (x(t)) − min f = O as t → +∞.
H t 2 b(t)

This makes clear the acceleration effect due to the time scaling. For b(t) = t r , we
1
have f (x(t)) − minH f = O 2+r , under the assumption α ≥ 3 + r .
t
Case 4 Let us illustrate our results in the case b(t) = ct b , β(t) = t β . We have
w(t) = ct b − (β + 1)t β−1 , w  (t) = cbt b−1 − (β 2 − 1)t β−2 . The conditions (G2 ), (G3 )
can be written respectively as:

ct b > (β + 1)t β−1 and c(b − α + 3)t b ≤ (β + 1)(β − α + 2)t β−1 . (7)

When b = β − 1, the conditions (7) are equivalent to β < c − 1 and β ≤ α − 2,


1
which gives the convergence rate f (x(t)) − minH f = O β+1 .
t
Discussion Let us first apply the above choices of (α, β(t), b(t)) for each case to the
quadratic function f : (x1 , x2 ) ∈ R2 → (x1 + x2 )2 /2. f is convex but not strongly
so, and argminR2 f = {(x1 , x2 ) ∈ R2 : x2 = −x1 }. The closed-form solution of
(DIN-AVD)α,β,b with each choice of β(t) and b(t) is given in Appendix A.3. For all
cases, we set α = 5. For case 1, we set β = b = 1. For case 2, we take β = 1. As
for case 3, we set r = 2. For case 4, we choose β = 3, b = β − 1 = 2 and c = 5 in
order to satisfy condition (7). The left panel of Fig. 2 depicts the convergence
 profile
 
of the function value as well as the predicted convergence rates O 1/t 2 and O 1/t 4
(the latter is for cases with time (re)scaling). The right panel of Fig. 2 displays the
associated trajectories for the different scenarios of the parameters.
The rates one can achieve in our Theorem 1 look similar to those in Theorem 2 and
Theorem 3. Thus one may wonder whether our framework allowing for more general
variable parameters is necessary. The answer is affirmative for several reasons. First,
our framework can be seen as a one-stop shop allowing for a unified analysis with an
unprecedented level of generality. It also handles time (re)scaling straightforwardly by
appropriately setting the functions β(t) and b(t) (see Case 3 and 4 above). In addition,
though these convergence rates appear similar, one has to keep in mind that these are
upper-bounds. It turns out from our detailed example in the quadratic case introduced
above in Fig. 2, that not only the oscillations are reduced due to the presence of Hessian
damping, but also the trajectory and the objective can be made much less oscillatory
thanks to the flexible choice of the parameters allowed by our framework. This is yet
again another evidence of the interest of our setting.

123
First-order optimization algorithms via inertial systems… 123

Fig. 2 Convergence of the objective values and trajectories associated with the system (DIN-AVD)α,β,b
for different choices of β(t) and b(t)

3 Inertial algorithms for general convex functions

3.1 Proximal algorithms

3.1.1 Smooth case

Writing the term ∇ 2 f (x(t))ẋ(t) in (DIN-AVD)α,β,b as the time derivative of


∇ f (x(t)), and taking the implicit time discretization of this system, with step size
h > 0, gives

xk+1 − 2xk + xk−1 α xk+1 − xk βk


+ + (∇ f (xk+1 ) − ∇ f (xk )) + bk ∇ f (xk+1 ) = 0.
h2 kh h h

Equivalently

k(xk+1 − 2xk + xk−1 ) + α(xk+1 − xk ) + βk hk(∇ f (xk+1 ) − ∇ f (xk ))


+bk h 2 k∇ f (xk+1 ) = 0. (8)

Observe that this requires f to be only of class C 1 . Set now s = h 2 . We obtain the
following algorithm with βk and bk varying with k:

(IPAHD): Inertial Proximal Algorithm with Hessian Damping.


Step k : Set μk := k+α
k
(βk s + sbk ).
 √
α α
yk = xk + 1 − k+α (xk − xk−1 ) + βk s 1 − ∇ f (xk )
(IPAHD) k+α
xk+1 = proxμk f (yk ).

123
124 H. Attouch et al.

Theorem 4 Assume that f : H → R is a convex C 1 function. Suppose that α ≥ 1. Set

δk := h bk hk − βk+1 − k(βk+1 − βk ) (k + 1), (9)

and suppose that the following growth conditions are satisfied:

(G2dis ) bk hk − βk+1 − k(βk+1 − βk ) > 0;


δk
(G3dis ) δk+1 − δk ≤ (α − 1) .
k+1

Then, δk is positive and, for any sequence (xk )k∈N generated by (IPAHD)
 
1 1
(i) f (xk ) − min f = O =O  
H δk k(k + 1) bk h − βk+1
k − (βk+1 − βk )

(ii) δk βk+1 ∇ f (xk+1 ) 2 < +∞.
k

Before delving into the proof, the following remarks on the choice/growth of the
parameters are in order.
Remark 1 We first observe that condition (G2dis ) is nothing but a forward (explicit)
discretization of its continuous analogue (G2 ). In addition, in view of (1), (G3 ) equiv-
alently reads

t δ̇(t) ≤ (α − 1)δ(t).

In turn, (9) and (G3dis ) are explicit discretizations of (1) and (G3 ) respectively.

Remark 2 The convergence rate on the objective values in Theorem 4(i) is


O (1/((k + 1)k) with the proviso that

βk+1
inf (bk h − − (βk+1 − βk )) > 0, (10)
k k

which in turn implies (G2dis ). If, in addition to 


(10), we also have inf k βk > 0, then the
summability property in Theorem 4(ii) reads k k(k + 1) ∇ f (xk+1 ) 2 < +∞. For
instance, if βk is non-increasing and bk ≥ c + βkh k+1
, c > 0, then (10) is in force with
c as a lower-bound on the infimum. In summary, we get O (1/((k + 1)k) under fairly
general assumptions on the growth of the sequences (βk )k∈N and (bk )k∈N .
Let us now exemplify choices of βk and bk that have the appropriate growth as
above and comply with (10) (hence (G2dis )) as well as (G3dis ).
• Let us take βk = β > 0 and bk = 1, which is the discrete analogue of the
continuous case 1 considered in Sect. 2.2 (recall that the continuous version was
analyzed in [12]). Note however that [12] did not study the discrete (algorithmic)
case and thus our result is new even for this system. In such a case, δk = h 2 (k +

123
First-order optimization algorithms via inertial systems… 125

1)(k − β/h) and βk is obviously non-increasing. Thus, if α > 3, then one easily
β
checks that (10) (hence (G2dis )) and (G3dis ) are in force for all k ≥ α−2
α−3 h + α−3 .
2

• Consider now the discrete counterpart of case 2 in Sect. 2.2. Take βk = β > 0
and bk = 1 + β/(hk)1 . Thus δk = h 2 (k + 1)k. This case was studied in [30] both
in the continuous setting and for the gradient algorithm, but not for the proximal
algorithm. This choice is a special case of the one discussed above since βk is the
constant sequence and c = 1. Thus (10) (hence (G2dis )) holds. (G3dis ) is also verified
for all k ≥ α−3
2
as soon as α > 3.
Proof Given x  ∈ argminH f , set

1
E k := δk ( f (xk ) − f (x  )) + vk 2
,
2
where

vk := (α − 1)(xk − x  ) + k(xk − xk−1 + βk h∇ f (xk )),

and (δk )k∈N is a positive sequence that will be adjusted. Observe that E k is nothing
but the discrete analogue of the Lyapunov function (2). Set ΔE k := E k+1 − E k , i.e.,

1
ΔE k = (δk+1 − δk )( f (xk+1 ) − f (x  )) + δk ( f (xk+1 ) − f (xk )) + ( vk+1 2
− vk 2
)
2

Let us evaluate the last term of the above expression with the help of the three-point
identity 21 vk+1 2 − 21 vk 2 = vk+1 − vk , vk+1  − 21 vk+1 − vk 2 .
Using successively the definition of vk and (8), we get

vk+1 − vk = (α − 1)(xk+1 − xk ) + (k + 1)(xk+1 − xk + βk+1 h∇ f (xk+1 ))


−k(xk − xk−1 + βk h∇ f (xk ))
= α(xk+1 − xk ) + k(xk+1 − 2xk + xk−1 ) + βk+1 h∇ f (xk+1 )
+hk(βk+1 ∇ f (xk+1 ) − βk ∇ f (xk ))
= [α(xk+1 − xk ) + k(xk+1 − 2xk + xk−1 ) + khβk (∇ f (xk+1 ) − ∇ f (xk ))]
+βk+1 h∇ f (xk+1 ) + kh(βk+1 − βk )∇ f (xk+1 )
= −bk h 2 k∇ f (xk+1 ) + βk+1 h∇ f (xk+1 ) + kh(βk+1 − βk )∇ f (xk+1 )
= h βk+1 + k(βk+1 − βk ) − bk hk ∇ f (xk+1 ).

Set shortly Ck = βk+1 + k(βk+1 − βk ) − bk hk. We have obtained

1 1 h2
vk+1 2 − vk 2 = − Ck2 ∇ f (xk+1 ) 2
2 2 2
∇ f (xk+1 ), (α − 1)(xk+1 − x  ) + (k + 1)(xk+1 − xk + βk+1 h∇ f (xk+1 ))

1 One can even consider the more general case b(t) = 1 + b/(hk), b > 0 for which our discussion remains
true under minor modifications. But we do not pursue this for the sake of simplicity.

123
126 H. Attouch et al.

1 2
= −h 2 C − Ck βk+1 ∇ f (xk+1 ) 2 − (α − 1)hCk ∇ f (xk+1 ), x  − xk+1 
2 k
−hCk (k + 1)∇ f (xk+1 ), xk − xk+1 .

By virtue of (G2dis ), we have

−Ck = bk hk − βk+1 − k(βk+1 − βk ) > 0.

Then, in the above expression, the coefficient of ∇ f (xk+1 ) 2 is less or equal than
zero, which gives

1 1  
vk+1 2 − vk 2 ≤ −(α − 1)hCk ∇ f (xk+1 ), x  − xk+1
2 2
−hCk (k + 1) ∇ f (xk+1 ), xk − xk+1  .

According to the (convex) subdifferential inequality and Ck < 0 (by (G2dis )), we infer

1 1
vk+1 2 − vk 2 ≤ −(α − 1)hCk ( f (x  ) − f (xk+1 ))
2 2
−hCk (k + 1)( f (xk ) − f (xk+1 )).

Take δk := −hCk (k + 1) = h bk hk − βk+1 − k(βk+1 − βk ) (k + 1) so that the terms


f (xk ) − f (xk+1 ) cancel in E k+1 − E k . We obtain

E k+1 − E k ≤ δk+1 − δk − (α − 1)h(bk hk − βk+1 − k(βk+1 − βk )) ( f (xk+1 ) − f (x  ))

Equivalently

δk
E k+1 − E k ≤ δk+1 − δk − (α − 1) ( f (xk+1 ) − f (x  )).
k+1

δk
By assumption (G3dis ), we have δk+1 − δk − (α − 1) k+1 ≤ 0. Therefore, the sequence
(E k )k∈N is non-increasing, which, by definition of E k , gives, for k ≥ 0

E0
f (xk ) − min f ≤ .
H δk

By summing the inequalities

h
E k+1 − E k + h (βk+1 + k(βk+1 − βk ) − bk hk)2 + δk βk+1 ∇ f (xk+1 ) 2
≤0
2

k δk βk+1 ∇ f (xk+1 ) < +∞. 

we finally obtain 2

123
First-order optimization algorithms via inertial systems… 127

3.1.2 Non-smooth case

Let f : H → R ∪ {+∞} be a proper lower semicontinuous and convex function. We


rely on the basic properties of the Moreau-Yosida regularization. Let f λ be the Moreau
envelope of f of index λ > 0, which is defined by:

 
1
f λ (x) = min f (z) + z−x 2
, for any x ∈ H.
z∈H 2λ

We recall that f λ is a convex function, whose gradient is λ−1 -Lipschitz continuous,


such that argminH f λ = argminH f . The interested reader may refer to [17,19] for a
comprehensive treatment of the Moreau envelope in a Hilbert setting. Since the set of
minimizers is preserved by taking the Moreau envelope, the idea is to replace f by
f λ in the previous algorithm, and take advantage of the fact that f λ is continuously
differentiable. The Hessian dynamic attached to f λ becomes

α
ẍ(t) + ẋ(t) + β∇ 2 f λ (x(t))ẋ(t) + b(t)∇ f λ (x(t)) = 0.
t

However, we do not really need to work on this system (which requires f λ to be C 2 ),


but with the discretized form which only requires the function to be continuously
differentiable, as is the case of f λ . Then, algorithm (IPAHD) applied to f λ now reads

 √
α α
yk = xk + 1 − k+α (xk − xk−1 ) + β s 1 − k+α ∇ f λ (xk )
xk+1 = prox k (β √s+sbk ) fλ (yk ).
k+α

By applying Theorem 4 we obtain that under the assumption (G2dis ) and (G3dis ),

1 
f λ (xk ) − min f = O , δk βk+1 ∇ f λ (xk+1 ) 2
< +∞.
H δk
k

Thus, we just need to formulate these results in terms of f and its proximal mapping.
This is straightforward thanks to the following formulae from proximal calculus [17]:

1  
x − prox (x))2 ,
f λ (x) = f (proxλ f (x)) + λf (11)

1 
∇ f λ (x) = x − proxλ f (x) , (12)
λ
λ θ
proxθ fλ (x) = x+ prox(λ+θ) f (x). (13)
λ+θ λ+θ

123
128 H. Attouch et al.

We obtain the following relaxed inertial proximal algorithm (NS stands for Non-
Smooth):

(IPAHD-NS) :

λ(k+α)
Set μk := λ(k+α)+k(β √
s+sbk )
⎧ √  
⎨ yk = xk + (1 − α )(xk − xk−1 ) + β s α
k+α λ 1− k+α xk − proxλ f (xk )
⎩xk+1 = μk yk + (1 − μk ) prox λ
f (yk ).
μk

Theorem 5 Let f : H → R ∪ {+∞} be a convex, lower semicontinuous, proper


function. Let the sequence (δk )k∈N as defined in (9), and suppose that the growth
conditions (G2dis ) and (G3dis ) in Theorem 4 are satisfied. Then, for any sequence (xk )k∈N
generated by (IPAHD-NS) , the following holds

1   2
f (proxλ f (xk )) − min f = O , δk βk+1 xk+1 − proxλ f (xk+1 ) < +∞.
H δk
k

3.2 Gradient algorithms

Take f a convex function whose gradient is L-Lipschitz continuous. Our analysis is


based on the dynamic (DIN-AVD)α,β,1+ β considered in Theorem 3 with damping
t
parameters α ≥ 3, β ≥ 0. Consider the time discretization of (DIN-AVD)α,β,1+ β
t

1 α β
(xk+1 − 2xk + xk−1 ) + (xk − xk−1 ) + √ (∇ f (xk ) − ∇ f (xk−1 ))
s ks s
β
+ √ ∇ f (xk−1 ) + ∇ f (yk ) = 0,
k s

with yk inspired by Nesterov’s accelerated scheme. We obtain the following scheme:

(IGAHD) : Inertial Gradient Algorithm with Hessian Damping.

Step k: αk = 1 − αk .
 √ √
β s
yk = xk + αk (xk − xk−1 ) − β s (∇ f (xk ) − ∇ f (xk−1 )) − k ∇ f (xk−1 )
xk+1 = yk − s∇ f (yk )

Following [5], set tk+1 = α−1


k
, whence tk = 1 + tk+1 αk .

Given x ∈ argminH f , our Lyapunov analysis is based on the sequence (E k )k∈N

1
E k := tk2 ( f (xk ) − f (x  )) + vk 2
(14)
2s

123
First-order optimization algorithms via inertial systems… 129


vk := (xk−1 − x  ) + tk xk − xk−1 + β s∇ f (xk−1 ) . (15)

Theorem 6 Let f : H → R be a convex function whose gradient is L-Lipschitz


continuous.√Let (xk )k∈N be a sequence generated by algorithm (IGAHD) , where α ≥ 3,
0 ≤ β < 2 s and s ≤ 1/L. Then the sequence (E k )k∈N defined by (14)–(15) is non-
increasing, and the following convergence rates are satisfied:

1
(i) f (xk ) − min f = O as k → +∞;
H k2
(ii) Suppose that β > 0. Then
 
k 2 ∇ f (yk ) 2 < +∞ and k 2 ∇ f (xk ) 2
< +∞.
k k

Proof We rely on the following reinforced version of the gradient descent lemma
(Lemma 1 in “Appendix A.1”). Since s ≤ L1 , and ∇ f is L-Lipschitz continuous,

s s
f (y − s∇ f (y)) ≤ f (x) + ∇ f (y), y − x − ∇ f (y) 2
− ∇ f (x) − ∇ f (y) 2
2 2
for all x, y ∈ H. Let us write it successively at y = yk and x = xk , then at y = yk ,
x = x  . According to xk+1 = yk − s∇ f (yk ) and ∇ f (x  ) = 0, we get
s s
f (xk+1 ) ≤ f (xk ) + ∇ f (yk ), yk − xk  − ∇ f (yk )
∇ f (xk ) − ∇ f (yk ) 2
2

2 2
(16)
  s s
f (xk+1 ) ≤ f (x  ) + ∇ f (yk ), yk − x  − ∇ f (yk ) 2 − ∇ f (yk ) 2 . (17)
2 2
Multiplying (16) by tk+1 − 1 ≥ 0, then adding (17), we derive that

tk+1 ( f (xk+1 ) − f (x  )) ≤ (tk+1 − 1)( f (xk ) − f (x  ))


s
+∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x   − tk+1 ∇ f (yk ) 2 .
2
s s
− (tk+1 − 1) ∇ f (xk ) − ∇ f (yk ) 2 − ∇ f (yk ) 2 . (18)
2 2
Let us multiply (18) by tk+1 to make appear E k . We obtain
2
tk+1 ( f (xk+1 ) − f (x  )) ≤ (tk+1
2
− tk+1 − tk2 )( f (xk ) − f (x  )) + tk2 ( f (xk ) − f (x  ))
s 2
+tk+1 ∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x   − tk+1 ∇ f (yk ) 2
2
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2 .
2 2

Since α ≥ 3 we have tk+1


2 − tk+1 − tk2 ≤ 0, which gives

2
tk+1 ( f (xk+1 − f (x  )) ≤ tk2 ( f (xk ) − f (x  ))

123
130 H. Attouch et al.

s 2
+tk+1 ∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x   − tk+1 ∇ f (yk ) 2
2
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2 .
2 2

According to the definition of E k , we infer

s 2
E k+1 − E k ≤ tk+1 ∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x   − tk+1 ∇ f (yk ) 2
2
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2
2 2
1 1
+ vk+1 2 − vk 2 .
2s 2s

Let us compute this last expression with the help of the elementary identity

1 1 1
vk+1 2
− vk 2
= vk+1 − vk , vk+1  − vk+1 − vk 2
.
2 2 2

By definition of vk , according to (IGAHD) and tk − 1 = tk+1 αk , we have


vk+1 − vk = xk − xk−1 + tk+1 (xk+1 − xk + β s∇ f (xk ))

−tk (xk − xk−1 + β s∇ f (xk−1 ))

= tk+1 (xk+1 − xk ) − (tk − 1)(xk − xk−1 ) + β s tk+1 ∇ f (xk ) − tk ∇ f (xk−1 )

= tk+1 xk+1 − (xk + αk (xk − xk−1 ) + β s tk+1 ∇ f (xk ) − tk ∇ f (xk−1 )

√ β s
= tk+1 (xk+1 − yk ) − tk+1 β s(∇ f (xk ) − ∇ f (xk−1 )) − tk+1 ∇ f (xk−1 )
k

+β s(tk+1 ∇ f (xk ) − tk ∇ f (xk−1 ))
√ 1
= tk+1 (xk+1 − yk ) + β s tk+1 1 − − tk ∇ f (xk−1 )
k
= tk+1 (xk+1 − yk ) = −stk+1 ∇ f (yk ).

Hence

1 1 s 2
vk+1 2 − vk 2
= − tk+1 ∇ f (yk ) 2
2s  2s 2 

−tk+1 ∇ f (yk ), xk − x  + tk+1 xk+1 − xk + β s∇ f (xk ) .

Collecting the above results, we obtain

E k+1 − E k ≤ tk+1 ∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x   − stk+1


2
∇ f (yk ) 2
 √ 

−tk+1 ∇ f (yk ), xk − x + tk+1 xk+1 − xk + β s∇ f (xk )
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2
− tk+1 ∇ f (yk ) 2 .
2 2

123
First-order optimization algorithms via inertial systems… 131

Equivalently

E k+1 − E k ≤ tk+1 ∇ f (yk ), Ak  − stk+1


2
∇ f (yk ) 2
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2 ,
2 2

with

Ak := (tk+1 − 1)(yk − xk ) + yk − xk − tk+1 xk+1 − xk + β s∇ f (xk )

= tk+1 yk − tk+1 xk − tk+1 (xk+1 − xk ) − tk+1 β s∇ f (xk )

= tk+1 (yk − xk+1 ) − tk+1 β s∇ f (xk )

= stk+1 ∇ f (yk ) − tk+1 β s∇ f (xk )

Consequently

E k+1 − E k ≤ tk+1 ∇ f (yk ), stk+1 ∇ f (yk ) − tk+1 β s∇ f (xk )
s 2 s
−stk+1
2
∇ f (yk ) 2 − (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2
2 2
√ s 2
= −tk+1 β s∇ f (yk ), ∇ f (xk ) − (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2
2
2
s
− tk+1 ∇ f (yk ) 2
2
= −tk+1 Bk ,

where
√ s s
Bk := tk+1 β s∇ f (yk ), ∇ f (xk ) + (tk+1 − 1) ∇ f (xk ) − ∇ f (yk ) 2
+ ∇ f (yk ) 2 .
2 2

When β = 0 we have Bk ≥ 0. Let us analyze the sign of Bk in the case β > 0. Set
Y = ∇ f (yk ), X = ∇ f (xk ). We have

s s √
Bk = Y 2 + (tk+1 − 1) Y − X 2 + tk+1 β sY , X 
2 2
s  √  s
= tk+1 Y 2 + tk+1 (β s − s) + s Y , X  + (tk+1 − 1) X 2
2 2
s  √  s
≥ tk+1 Y 2 − tk+1 (β s − s) + s Y X + (tk+1 − 1) X 2 .
2 2

Elementary algebra gives that the above quadratic form is non-negative when
 √ 2
tk+1 (β s − s) + s ≤ s 2 tk+1 (tk+1 − 1).

Recall
√ that tk is of order k. Hence, this inequality
√ is satisfied for k large enough if
(β s − s)2 < s 2 , which is equivalent to β < 2 s. Under this condition E k+1 − E k ≤

123
132 H. Attouch et al.


0, which gives conclusion (i). Similar argument gives
√ that for 0 < < 2 sβ − β 2
(such exists according to assumption 0 < β < 2 s)

1 2
E k+1 − E k + t ∇ f (yk ) 2
≤ 0.
2 k+1

After summation of these inequalities, we obtain conclusion (ii). 




Remark 3 In [32, Theorem 8], the same convergence rate as in Theorem 6 on


the objective values is obtained for a very different discretization of the sys-

tem (DIN-AVD) √ α s . Their scheme is thus related but quite different from
α,b s,1+ 2t
(IGAHD) . Their claims require also intricate conditions relating (α, b, s, L) to hold
true. √
In Theorem 6, the condition β < 2 s essentially reveals that in order to preserve
acceleration offered by the viscous damping, the geometric damping should not be
too large. It is an open question whether this constraint is a technical artifact or is
fundamental to acceleration. We leave it to a future work.

Remark 4 From k k 2 ∇ f (xk ) 2 < +∞ we immediately infer that for k ≥ 1


k 
k 
inf ∇ f (xi ) 2
i2 ≤ i 2 ∇ f (xi ) 2
≤ i 2 ∇ f (xi ) 2
< +∞.
i=1,··· ,k
i=1 i=1 i∈N

A similar argument holds for yk . Hence

1 1
inf ∇ f (xi ) 2
=O , inf ∇ f (yi ) 2
=O .
i=1,...,k k3 i=1,...,k k3

Remark 5 In Theorem 6, the convergence property of the values is expressed according


to the sequence (xk )k∈N . It is natural to know if a similar result is true for the sequence
(yk )k∈N . This is an open question in the case of Nesterov’s accelerated gradient method
and the corresponding FISTA algorithm for structured minimization [18,26]. In the
case of the Hessian-driven damping algorithms, we give a partial answer to this ques-
tion. By the classical descent lemma, and the monotonicity of ∇ f we have

L
f (yk ) ≤ f (xk+1 ) + yk − xk+1 , ∇ f (xk+1 ) + yk − xk+1 2
2
L
≤ f (xk+1 ) + yk − xk+1 , ∇ f (yk ) + yk − xk+1 2
2

According to xk+1 = yk − s∇ f (yk ) we obtain

s2 L
f (yk ) − min f ≤ f (xk+1 ) − min f + s ∇ f (yk ) 2
+ ∇ f (yk ) 2 .
H H 2

123
First-order optimization algorithms via inertial systems… 133

From Theorem 6 we deduce that

1 s2 L 1 1
f (yk ) − min f ≤ O + s+ ∇ f (yk ) 2
=O +o .
H k2 2 k2 k2

Remark 6 When f is a proper lower semicontinuous proper function, but not neces-
sarily smooth, we follow the same reasoning as in Sect. 3.1.2. We consider minimizing
the Moreau envelope f λ of f , whose gradient is 1/λ-Lipschitz continuous, and then
apply (IGAHD) to f λ . We omit the details for the sake of brevity. This observation
will be very useful to solve even structured composite problems as we will describe
in Sect. 6.

4 Inertial dynamics for strongly convex functions

4.1 Smooth case

Recall the classical definition of strong convexity:


Definition 1 A function f : H → R is said to be μ-strongly convex for some μ > 0
if f − μ2 · 2 is convex.
For strongly convex functions, a suitable choice of γ and β in (DIN)γ ,β provides
exponential decay of the value function (hence of the trajectory), and of the gradients.

Theorem 7 Suppose that (H) holds where f : H → R is in addition μ-strongly convex


for some μ > 0. Let x(·) : [t0 , +∞[→ H be a solution trajectory of

ẍ(t) + 2 μẋ(t) + β∇ 2 f (x(t))ẋ(t) + ∇ f (x(t)) = 0. (19)

Suppose that 0 ≤ β ≤ 1

2 μ. Then, the following hold:

(i) for all t ≥ t0

μ  √
μ
x(t) − x  2 ≤ f (x(t)) − min f ≤ Ce− 2 (t−t0 )
2 H

where C := f (x(t0 )) − minH f + μ x(t0 ) − x  2 + ẋ(t0 ) + β∇ f (x(t0 )) 2 .


(ii) There exists some constant C1 > 0 such that, for all t ≥ t0


 t √ √
μ
e− μt
e μs
∇ f (x(s)) 2 ds ≤ C1 e− 2 t .
t0


μ

Moreover, t0 e
2 t ẋ(t) 2 dt < +∞.
 √ 
When β = 0, we have f (x(t)) − minH f = O e− μt as t → +∞.

123
134 H. Attouch et al.

Remark 7 When β = 0, Theorem 7 recovers [29, Theorem 2.2]. In the case β > 0,
a result on a related but different dynamical system can be found in [32, Theorem 1]
(their rate is also sligthtly worse than ours). Our gradient estimate is distinctly new in
the literature.

Proof (i) Let x  be the unique minimizer of f . Define E : [t0 , +∞[→ R+ by

1 √
E(t) := f (x(t)) − min f + μ(x(t) − x  ) + ẋ(t) + β∇ f (x(t)) 2 .
H 2

Set v(t) = μ(x(t) − x  ) + ẋ(t) + β∇ f (x(t)). Derivation of E(·) gives

d √
E(t) := ∇ f (x(t)), ẋ(t) + v(t), μẋ(t) + ẍ(t) + β∇ 2 f (x(t))ẋ(t).
dt

Using (19), we get

d √
E(t) = ∇ f (x(t)), ẋ(t) + v(t), − μẋ(t) − ∇ f (x(t)).
dt

After developing and simplification, we obtain

d √ √
E(t) + μ∇ f (x(t)), x(t) − x   + μx(t) − x  , ẋ(t) + μ ẋ(t) 2
dt

+β μ∇ f (x(t)), ẋ(t) + β ∇ f (x(t)) 2 = 0.

By strong convexity of f we have

μ
∇ f (x(t)), x(t) − x   ≥ f (x(t)) − f (x  ) + x(t) − x  2 .
2

Thus, combining the last two relations we obtain

d √
E(t) + μA ≤ 0,
dt

where (the variable t is omitted to lighten the notation)

μ √
A := f (x) − f (x  ) + x − x 2
+ μx − x  , ẋ + ẋ 2
+ β∇ f (x), ẋ
2
β
+ √ ∇ f (x) 2
μ

Let us formulate A with E(t).

1 √ √
A=E− ẋ + β∇ f (x) 2
− μx − x  , ẋ + β∇ f (x) + μx − x  , ẋ + ẋ 2
2

123
First-order optimization algorithms via inertial systems… 135

β
+β∇ f (x), ẋ + √ ∇ f (x) 2 .
μ

After developing and simplifying, we obtain


   
d √ 1 2 β β2 2 √ 
E(t) + μ E(t) + ẋ + √ − ∇ f (x) − β μx − x , ∇ f (x) ≤ 0.
dt 2 μ 2

Since 0 ≤ β ≤ √β − β2
≥ β
2 μ.
√1 , we immediately get √ Hence
μ μ 2

d √ 1 β √
E (t) + μ E (t) + ẋ 2
+ √ ∇ f (x) 2
− β μx − x  , ∇ f (x) ≤ 0.
dt 2 2 μ

Let us use again the strong convexity of f to write

1 1 1 1  1 μ
E (t) = E (t) + E (t) ≥ E (t) + f (x(t)) − f (x  ) ≥ E (t) + x(t) − x  2 .
2 2 2 2 2 4

By combining the two inequalities above, we obtain


√ √
d μ μ √
E(t) + E(t) + ẋ(t) 2
+ μB ≤ 0,
dt 2 2

where B = μ4 x(t) − x  2 + 2√β μ ∇ f (x) 2 − β μ x − x  ∇ f (x) .
Set X = x − x  , Y = ∇ f (x) . Elementary algebraic computation gives that,
under the condition 0 ≤ β ≤ 2√1 μ

μ 2 β √
X + √ Y 2 − β μX Y ≥ 0.
4 2 μ

Hence for 0 ≤ β ≤ 1

2 μ

√ √
d μ μ
E(t) + E(t) + ẋ(t) 2
≤ 0.
dt 2 2

By integrating the differential inequality above we obtain



μ
E(t) ≤ E(t0 )e− 2 (t−t0 ) .

By definition of E(t), we infer



μ
f (x(t)) − min f ≤ E(t0 )e− 2 (t−t0 ) ,
H

123
136 H. Attouch et al.

and

√ μ
μ(x(t) − x  ) + ẋ(t) + β∇ f (x(t)) 2
≤ 2E(t0 )e− 2 (t−t0 ) .

μ
(ii) Set C = 2E(t0 )e 2 t0 . Developing the above expression, we obtain
√  
μ x(t) − x  2 + ẋ(t) 2 + β 2 ∇ f (x(t)) 2 + 2β μ x(t) − x  , ∇ f (x(t))

 √  μ
+ ẋ(t), 2β∇ f (x(t)) + 2 μ(x(t) − x  ) ≤ Ce− 2 t .

By convexity of f we have x(t) − x  , ∇ f (x(t)) ≥ f (x(t)) − f (x  ). More-


over,
 √ 
ẋ(t), 2β∇ f (x(t)) + 2 μ(x(t) − x  )
d √
= 2β( f (x(t)) − f (x  )) + μ x(t) − x  2
.
dt
Combining the above results, we obtain
√ √
μ[2β( f (x(t)) − f (x  )) + μ x(t) − x  2 ] + β 2 ∇ f (x(t)) 2

d √ μ
+ 2β( f (x(t)) − f (x  )) + μ x(t) − x  2 ≤ Ce− 2 t .
dt

Set Z (t) := 2β( f (x(t)) − f (x  )) + μ x(t) − x  2 ]. We have

d √ μ
Z (t) + μZ (t) + β 2 ∇ f (x(t)) 2
≤ Ce− 2 t .
dt
By integrating this differential inequality, elementary computation gives


 t √ √
μ
e− μt
e μs
∇ f (x(s)) 2 ds ≤ Ce− 2 t .
t0
√ √
Noticing that the integral of e μs over [t0 , t] is of order e μt , the above estimate
reflects the fact, as t → +∞, the gradient terms ∇ f (x(t)) 2 tend to zero at
exponential rate (in average, not pointwise).



Remark 8 Let us justify the choice of γ = 2 μ in Theorem 7. Indeed, considering

ẍ(t) + 2γ ẋ(t) + β∇ 2 f (x(t)) + ∇ f (x(t)) = 0,

a similar proof to that described above can be performed on the basis of the Lyapunov
function
1
E(t) := f (x(t)) − min f + γ (x(t) − x  ) + ẋ(t) + β∇ f (x(t)) 2 .
H 2

123
First-order optimization algorithms via inertial systems… 137

√ μ
Under the conditions γ ≤ μ and β ≤ 2γ 3
we obtain the exponential convergence
rate
γ
f (x(t)) − min f = O e− 2 t as t → +∞.
H


Taking γ = μ gives the best convergence rate, and the result of Theorem 7.

4.2 Non-smooth case

Following [2], (DIN)γ ,β is equivalent to the first-order system



⎨ẋ(t) + β∇ f (x(t)) + γ − 1
x(t) + 1
y(t) = 0;
β β
.
⎩ ẏ(t) + γ − 1
x(t) + 1
y(t) = 0.
β β

This permits to extend (DIN)γ ,β to the case of a proper lower semicontinuous convex
function f : H → R ∪ {+∞}. Replacing the gradient of f by its subdifferential, we
obtain its Non-Smooth version :

⎨ẋ(t) + β∂ f (x(t)) + γ − 1
x(t) + 1
y(t)  0;
β β
(DIN-NS)γ ,β
⎩ ẏ(t) + γ − 1
x(t) + 1
y(t) = 0.
β β

Most properties of (DIN)γ ,β are still valid for this generalized version. To illustrate
it, let us consider the following extension of Theorem 7.

Theorem 8 Suppose that f : H → R ∪ {+∞} is lower semicontinuous and μ-


strongly convex for some μ > 0. Let x(·) be a trajectory of (DIN-NS)2√μ,β . Suppose
that 0 ≤ β ≤ 2√1 μ . Then

μ  √
μ
x(t) − x  2 ≤ f (x(t)) − min f = O e− 2 t as t → +∞,
2 H
 ∞ √
μ
and e 2 t ẋ(t) 2 dt < +∞.
t0

Proof Let us introduce E : [t0 , +∞[→ R+ defined by

1 √ √ 1 1
E(t) := f (x(t)) − min f + μ(x(t) − x  ) − 2 μ − x(t) − y(t) 2 ,
H 2 β β

that will serve as a Lyapunov function. Then, the proof follows the same lines as that
of Theorem 7, with the use of the derivation rule of Brezis [19, Lemme 3.3, p. 73]. 


123
138 H. Attouch et al.

5 Inertial algorithms for strongly convex functions

We will show in this section that the exponential convergence of Theorem 7 for the
inertial system (19) translates into linear convergence in the algorithmic case under
proper discretization.

5.1 Proximal algorithms

5.1.1 Smooth case

Consider the inertial dynamic (19). Its implicit discretization similar to that performed
before gives

1 2 μ β
(x k+1 − 2x k + x k−1 ) + (xk+1 − xk ) + (∇ f (xk+1 ) − ∇ f (xk )) + ∇ f (xk+1 ) = 0,
h2 h h

where h is the positive step size. Set s = h 2 . We obtain the following inertial proximal
algorithm with hessian damping (SC refers to Strongly Convex):

(IPAHD-SC)
⎧ √
2 μs √ √
2 μs
⎨ yk = xk + 1 − √ (xk − xk−1 ) + β s 1 − √ ∇ f (xk )
1+2 μs 1+2 μs
⎩xk+1 = prox √
β s+s

(yk ).
1+2 μs
f

Theorem 9 Assume that f : H → R is a convex C 1 function and μ-strongly convex,


μ > 0, and suppose that

1 √
0 ≤ β ≤ √ and s ≤ β.
2 μ

Set q = 1
√ , which satisfies 0 < q < 1. Then, the sequence (xk )k∈N generated
1+ 21 μs
by the algorithm (IPAHD-SC) obeys, for any k ≥ 1
μ 
xk − x  2 ≤ f (xk ) − min f ≤ E 1 q k−1 ,
2 H


where E 1 = f (x1 )− f (x  )+ 21 μ(x1 − x  )+ √1s (x1 − x0 )+β∇ f (x1 ) 2 . Moreover,
the gradients converge exponentially fast to zero: setting θ = 1

1+ μs which belongs
to ]0, 1[, we have


k−2
θk θ − j ∇ f (x j ) 2
= O qk as k → +∞.
j=0

123
First-order optimization algorithms via inertial systems… 139

Remark 9 We are not aware of any result of this kind for such a proximal algorithm.

Proof Let x  be the unique minimizer of f , and consider the sequence (E k )k∈N

1
E k := f (xk ) − f (x  ) + vk 2
,
2

where vk = μ(xk − x  ) + √1s (xk − xk−1 ) + β∇ f (xk ).
We will use the following equivalent formulation of the algorithm (IPAHD-SC)

1 √
√ (xk+1 − 2xk + xk−1 ) + 2 μ(xk+1 − xk )
s

+β(∇ f (xk+1 ) − ∇ f (xk )) + s∇ f (xk+1 ) = 0. (20)

We have

1 1
E k+1 − E k = f (xk+1 ) − f (xk ) + vk+1 2
− vk 2
.
2 2

Using successively the definition of vk and (20), we get

√ 1
vk+1 − vk = μ(xk+1 − xk ) + √ (xk+1 − 2xk + xk−1 ) + β(∇ f (xk+1 ) − ∇ f (xk ))
s
√ √ √
= μ(xk+1 − xk ) − 2 μ(xk+1 − xk ) − s∇ f (xk+1 )
√ √
= = − μ(xk+1 − xk ) − s∇ f (xk+1 ).

√ √
Write shortly Bk = μ(xk+1 − xk ) + s∇ f (xk+1 ). We have

1 1   1
vk+1 2 − vk 2 = vk+1 − vk , vk+1 − vk+1 − vk 2
2  2 2 
√  1 1
= − Bk , μ(xk+1 − x ) + √ (xk+1 − xk ) + β∇ f (xk+1 ) − Bk 2
s 2
  μ √  
= −μ xk+1 − xk , xk+1 − x  − xk+1 − xk 2 − β μ ∇ f (xk+1 ), xk+1 − xk
s
√     √
− μs ∇ f (xk+1 ), xk+1 − x  − ∇ f (xk+1 ), xk+1 − xk − β s ∇ f (xk+1 ) 2
1 1 √  
− μ xk+1 − xk 2 − s ∇ f (xk+1 2 − μs ∇ f (xk+1 ), xk+1 − xk
2 2

By virtue of strong convexity of f

μ
f (xk ) ≥ f (xk+1 ) + ∇ f (xk+1 ), xk − xk+1  + xk+1 − xk 2 ;
2
  μ
f (x  ) ≥ f (xk+1 ) + ∇ f (xk+1 ), x  − xk+1 + xk+1 − x  2 .
2

123
140 H. Attouch et al.


Combining the above results, and after dividing by s, we get

1 √ μ
√ (E k+1 − E k ) + μ[ f (xk+1 ) − f (x  ) + xk+1 − x  2 ]
s 2

μ  
 μ
≤ − √ xk+1 − xk , xk+1 − x − xk+1 − xk 2
s s
μ μ
−β ∇ f (xk+1 ), xk+1 − xk  − √ xk+1 − xk 2 − β ∇ f (xk+1 ) 2
s 2 s
μ 1 √ √
− √ xk+1 − xk 2 − s ∇ f (xk+1 2 − μ ∇ f (xk+1 ), xk+1 − xk  ,
2 s 2

which gives, after developing and simplification

1 √  
√ (E k+1 − E k ) + μE k+1 − βμ ∇ f (xk+1 ), xk+1 − x 
s
√  √ √ 
μ μ β2 μ s
≤− +√ xk+1 − xk − β −
2
+ ∇ f (xk+1 ) 2
2s s 2 2

− μ ∇ f (xk+1 ), xk+1 − xk  .

β2 μ 3β
According to 0 ≤ β ≤ 2
1

μ , we have β − 2 ≥ 4 , which, with Cauchy-Schwarz
inequality, gives

1 √ μ μ 3β
√ (E k+1 − E k ) + μE k+1 + +√ xk+1 − xk 2 + ∇ f (xk+1 ) 2
s 2s s 4

−βμ ∇ f (xk+1 ) xk+1 − x  − μ ∇ f (xk+1 ) xk+1 − xk ≤ 0.

Let us use again the strong convexity of f to write

1 1  1 μ
E k+1 ≥ E k+1 + f (xk+1 ) − f (x  ) ≥ E k+1 + xk+1 − x  2 .
2 2 2 4

Combining the two inequalities above, we get


1 1√ √ μ μ μ
√ (E k+1 − E k ) + μE k+1 + μ xk+1 − x  2 + +√ xk+1 − xk 2
s 2 4 2s s
3β √
+ ∇ f (xk+1 ) 2 − βμ ∇ f (xk+1 ) xk+1 − x  − μ ∇ f (xk+1 ) xk+1 − xk ≤ 0.
4

Let us rearrange the terms as follows

1 1√
√ (E k+1 − E k ) + μE k+1
s 2

123
First-order optimization algorithms via inertial systems… 141

√ μ β
+ μ xk+1 − x  2
+ ∇ f (xk+1 ) 2
− βμ ∇ f (xk+1 ) xk+1 − x 
4 2
! "# $
Term 1

μ μ β √
+ +√ xk+1 − xk 2
+ ∇ f (xk+1 ) 2
− μ ∇ f (xk+1 ) xk+1 − xk ≤0
2s s 4
! "# $
Term 2

Let us examine the sign of the last two terms in the rhs of inequality above.
Term 1 Set X = xk+1 − x  , Y = ∇ f (xk+1 ) . Elementary algebra gives that

√ μ 2 β 2
μ X + Y − βμX Y ≥ 0,
4 2

holds true under the condition 0 ≤ β ≤ 1



2 μ. Hence, under this condition

√ μ β
μ xk+1 − x  2
+ ∇ f (xk+1 ) 2
− βμ ∇ f (xk+1 ) xk+1 − x  ≥ 0.
4 2
Term 2 Set X = xk+1 − xk , Y = ∇ f (xk+1 ) . Elementary algebra gives

μ μ β 2 √
+√ X2 + Y − μX Y ≥ 0
2s s 4

μ μ μ
holds true under the condition 2s + √
s
≥ β. Hence, under this condition

μ μ β √
+√ xk+1 − xk 2
+ ∇ f (xk+1 ) 2
− μ ∇ f (xk+1 ) xk+1 − xk ≥ 0.
2s s 4

μ √ %
μ μ β
2s + s ≥
is equivalent to s ≤ 1+ 1+ .
√ 2

In turn, the condition β 2 β μ

Clearly, this condition is satisfied if s ≤ β.
Let us put the above
√ results together. We have obtained that, under the conditions
0 ≤ β ≤ 2√1 μ and s ≤ β,

1 1√
√ (E k+1 − E k ) + μE k+1 ≤ 0.
s 2

Set q = 1
√ , which satisfies 0 < q < 1. From this, we infer E k ≤ q E k−1 which
1+ 21 μs
gives

E k ≤ E 1 q k−1 . (21)

Since E k ≥ f (xk ) − f (x  ), we finally obtain

f (xk ) − f (x  ) ≤ E 1 q k−1 = O q k .

123
142 H. Attouch et al.

Let us now estimate the convergence rate of the gradients to zero. According to the
exponential decay of (E k )k∈N , as given in (21), and by definition of E k , we have, for
all k ≥ 1

√ 1
μ(xk − x  ) + √ (xk − xk−1 ) + β∇ f (xk ) 2
≤ 2E k ≤ 2E 1 q k−1 .
s

After developing, we get

1 √  
μ xk − x  2
+ xk − xk−1 2
+ β 2 ∇ f (xk ) 2
+ 2β μ xk − x  , ∇ f (xk )
s
1  √ 
+ √ xk − xk−1 , 2β∇ f (xk ) + 2 μ(xk − x  ) ≤ 2E 1 q k−1 .
s

By convexity of f , we have
 
xk − x  , ∇ f (xk ) ≥ f (xk ) − f (x  ) and xk − xk−1 , ∇ f (xk ) ≥ f (xk ) − f (xk−1 )

Moreover, xk − xk−1 , xk − x   ≥ 21 xk − x  2 − 1


2 xk−1 − x  2 .
Combining the above results, we obtain

√ √  2
μ 2β( f (xk ) − f (x  )) + μ xk − x   + β 2 ∇ f (xk ) 2

1 √  2
+ √ 2β( f (xk ) − f (x  )) + μ xk − x  
s
1 √  2
− √ 2β( f (xk−1 ) − f (x  )) + μ xk−1 − x   ≤ 2E 1 q k−1 .
s


Set Z k := 2β( f (xk ) − f (x  )) + μ xk − x  2 . We have, for all k ≥ 1

1 √
√ (Z k − Z k−1 ) + μZ k + β 2 ∇ f (xk ) 2
≤ 2E 1 q k−1 . (22)
s

Set θ = 1

1+ μs which belongs to ]0, 1[. Equivalently

√ √
Z k + θβ 2 s ∇ f (xk ) 2
≤ θ Z k−1 + 2E 1 θ sq k−1 .

Iterating this linear recursive inequality gives

√ k−2
√ k−2
Z k + θβ 2 s θ p ∇ f (xk− p ) 2
≤ θ k−1 Z 1 + 2E 1 θ s θ p q k− p−1 . (23)
p=0 p=0

123
First-order optimization algorithms via inertial systems… 143


θ 1+ 21 μs
Then notice that q = √
1+ μs < 1, which gives


k−2 
k−2
θ p
1
θ p q k− p−1 = q k−1 ≤2 1+ √ q k−1 .
q μs
p=0 p=0

Collecting the above results, we obtain

√ k−2
4E 1
θβ 2 s θ p ∇ f (xk− p ) 2
≤ θ k−1 Z 1 + √ q k−1 . (24)
μ
p=0

Using again the inequality θ < q, and after reindexing, we finally obtain


k−2
θk θ − j ∇ f (x j ) 2
= O qk .
p=0




5.1.2 Non-smooth case

Let f : H → R ∪ {+∞} be a proper, lower semicontinuous and convex function. We


argue as in Sect. 3.1.2 by replacing f with its Moreau envelope f λ . The key observation
is that the Moreau-Yosida regularization also preserves strong convexity, though with
a different modulus as shown by the following result.

Proposition 1 Suppose that f : H → R ∪ {+∞} is a proper, lower semicontinuous


convex function. Then, for any λ > 0 and μ > 0

μ
f is μ-strongly convex ⇒ f λ is strongly convex with modulus .
1 + λμ

Proof If f is strongly convex with constant μ > 0, we have f = g + μ2 · 2 for


some convex function g. Elementary calculus (see e.g., [17, Exercise 12.6]) gives,
λ
with θ = 1+λμ ,

1 μ
f λ (x) = gθ x + x 2
.
1 + λμ 2(1 + λμ)

Since x → gθ 1
1+λμ x is convex, the above formula shows that f λ is strongly convex
μ
with constant 1+λμ . 


123
144 H. Attouch et al.

According to the expressions


% (12) and (13), (IPAHD-SC) becomes with θ =
√ μ
β %s+s 2 1+λμ s
μ
and a = %
μ
:
1+2 1+λμ s 1+2 1+λμ s

(IPAHD-NS-SC)
 √  
yk = xk + (1 − a)(xk − xk−1 ) + β λ s (1 − a) xk − proxλ f (xk )
λ θ
xk+1 = λ+θ yk + λ+θ prox(λ+θ) f (yk )

It is a relaxed inertial proximal algorithm whose coefficients are constant. As a


result, its computational burden is equivalent to (actually twice) that of the classical
proximal algorithm. A direct application of the conclusions of Theorem 9 to f λ gives
the following statement.
Theorem 10 Suppose that f : H → R ∪ {+∞} is a proper, lower semicontinuous
and convex function which is μ-strongly convex for some μ > 0. Take λ > 0. Suppose
that
&
1 1 √
0≤β≤ λ+ and s ≤ β.
2 μ

1
Set q = % , which satisfies 0 < q < 1. Then, for any sequence (xk )k∈N
μ
1+ 1
1+λμ s
2
generated by algorithm (IPAHD-NS-SC)
 
xk − x   = O q k/2 and f (proxλ f (xk )) − min f = O q k as k → +∞,
H

and

xk − proxλ f (xk ) 2
= O qk as k → +∞.

5.2 Inertial gradient algorithms

Let us embark from the continuous dynamic (19) whose linear convergence rate was
established in Theorem 7. Its explicit time discretization with centered finite differ-
ences for speed and acceleration gives

1 μ 1
(xk+1 − 2xk + xk−1 ) + √ (xk+1 − xk−1 ) + β √ (∇ f (xk ) − ∇ f (xk−1 )) + ∇ f (xk ) = 0.
s s s

Equivalently,

√ √
(xk+1 − 2xk + xk−1 ) + μs(xk+1 − xk−1 ) + β s(∇ f (xk ) − ∇ f (xk−1 )) + s∇ f (xk ) = 0,

123
First-order optimization algorithms via inertial systems… 145

(25)

which gives the inertial gradient algorithm with Hessian damping (SC stands for
Strongly Convex):

(IGAHD-SC)
√ √
1− μs β s
xk+1 = xk + 1+ μs (x k
√ − xk−1 ) − √
1+ μs (∇ f (xk ) − ∇ f (xk−1 ))

− 1+√
s
μs ∇ f (x k ).

Let us analyze the linear convergence rate of (IGAHD-SC) .

Theorem 11 Let f : H → R be a C 1 and μ-strongly convex function for some μ > 0,


and whose gradient ∇ f is L-Lipschitz continuous. Suppose that
⎧ √ ⎫
1 ⎨ √μ μ
2s + s
μ
√ ⎬
β ≤ √ and L ≤ min , √ . (26)
μ ⎩ 8β 2βμ + √1 + μ ⎭
s 2

1
Set q = 1√
, which satisfies 0 < q < 1. Then, for any sequence (xk )k∈N
1+ 2 μs
generated by algorithm (IGAHD-SC) , we have

 
xk − x   = O q k/2 and f (xk ) − min f = O q k as k → +∞.
H

Moreover, the gradients converge exponentially fast to zero: setting θ = 1



1+ μs which
belongs to ]0, 1[, we have


k−2
θk θ − j ∇ f (x j ) 2
= O qk as k → +∞.
p=0

Remark 10 1. (IGAHD-SC) can be seen as an extension of the Nesterov accelerated


method for strongly convex functions that corresponds to the particular case β =
0. Actually, in this very specific case, (IGAHD-SC) is nothing but the (HBF)√
1− μs
method with stepsize parameter a = 1+√ s
μs and momentum parameter b = √
1+ μs ;
see [28, (2) in Section 3.2]. Thus, if f is also of class C 2 at x  , one can obtain
linear convergence of the iterates (xk )k∈N (but not the objective values) from [28,
Theorem 1] under the assumption that s < 4/L (which can be√shown to be weaker
than (26) since the latter is equivalent for β = 0 to s L ≤ ( 1 − c + c2 − (1 −
c))2 /c ≤ 1, where c = μ/L).

123
146 H. Attouch et al.

xk − x 
2. In fact, even for β > 0, by lifting the problem to the vector z k = as
xk−1 − x 
is standard in the (HBF) method, one can write (IGAHD-SC) as

(1 + b)I − (a + d)∇ f 2 (x  ) −bI + d∇ f 2 (x  )


z k+1 = z k + o(z k ),
I 0

where d = 1+β √μss
. Linear convergence of the iterates (xk )k∈N can then be obtained
by studying the spectral properties of the above matrix.
3. For β = 0, Theorem 11 recovers [29, Theorem 3.2], though the author uses a
slightly different discretization, requires only s ≤ 1/L and his convergence rate is
 √ −1
1 + μs , which is slightly better than ours for this special case. In the case
β > 0, a result on a scheme related but different from (IGAHD-SC) can be found
in [32, Theorem 3] (their rate is also slightly worse than ours). Our estimate are
also new in the literature.
Proof The proof is based on Lyapunov analysis, and the decrease property at linear
rate of the sequence (E k )k∈N defined by

1
E k := f (xk ) − f (x  ) + vk 2
,
2
where x  is the unique minimizer of f , and

√ 1
vk = μ(xk−1 − x  ) + √ (xk − xk−1 ) + β∇ f (xk−1 ).
s

We have E k+1 − E k = f (xk+1 ) − f (xk ) + 1


2 vk+1 2 − 1
2 vk 2. Using successively
the definition of vk and (25), we obtain

√ 1
vk+1 − vk = μ(xk − xk−1 ) + √ (xk+1 − 2xk + xk−1 ) + β(∇ f (xk ) − ∇ f (xk−1 ))
s
1 √ √
= √ (xk+1 − 2xk + xk−1 ) + μs(xk − xk−1 ) + β s(∇ f (xk ) − ∇ f (xk−1 ))
s
1 √ √
= √ − s∇ f (xk ) − μs(xk+1 − xk−1 ) + μs(xk − xk−1 ))
s
√ √
= − μ(xk+1 − xk ) − s∇ f (xk ).

Since 1
2 vk+1 2 − 1
2 vk 2 = vk+1 − vk , vk+1  − 1
2 vk+1 − vk 2, we have

1 1 1 √ √
vk+1 2 − vk 2 = − μ(xk+1 − xk ) + s∇ f (xk ) 2
2  2 2 
√ √ √ ∗ 1
− μ(xk+1 − xk ) + s∇ f (xk ), μ(xk − x ) + √ (xk+1 − xk ) + β∇ f (xk )
s
  μ √
= −μ xk+1 − xk , xk − x ∗ − xk+1 − xk 2 − β μ ∇ f (xk ), xk+1 − xk 
s

123
First-order optimization algorithms via inertial systems… 147

√   √
− μs ∇ f (xk ), xk − x ∗ − ∇ f (xk ), xk+1 − xk  − β s ∇ f (xk ) 2

1 1 √
− μ xk+1 − xk 2 − s ∇ f (xk 2 − μs ∇ f (xk ), xk+1 − xk  .
2 2

By strong convexity of f and L-Lipschitz continuity of ∇ f we have

  μ
f (x  ) ≥ f (xk ) + ∇ f (xk ), x  − xk + xk − x  2
2
μ
f (xk ) ≥ f (xk+1 ) + ∇ f (xk+1 ), xk − xk+1  + xk+1 − xk 2
2
μ
≥ f (xk+1 ) + ∇ f (xk ), xk − xk+1  + ( − L) xk+1 − xk 2
.
2

Combining the results above, and after dividing by s, we get

1 √ μ √
√ (E k+1 − E k ) + μ[ f (xk+1 ) − f (x  ) + xk − x  2 ] + μ( f (xk ) − f (xk+1 ))
s 2

μ   μ μ
≤ − √ xk+1 − xk , xk − x  − xk+1 − xk 2 − β ∇ f (xk ), xk+1 − xk 
s s s
1 μ μ
+ √ (L − ) xk+1 − xk 2 − √ xk+1 − xk 2
s 2 2 s
1√ √
− β+ s ∇ f (xk 2 − μ ∇ f (xk ), xk+1 − xk  .
2

Let us make appear E k

1 √ √ √ L
√ (E k+1 − E k ) + μE k+1 ≤ μ ∇ f (xk ), xk+1 − xk  + μ xk+1 − xk 2
s 2
√  
μ 1 1
+ √ (xk+1 − xk ) + β∇ f (xk ) 2 + μ xk − x  , √ (xk+1 − xk ) + β∇ f (xk )
2 s s

μ   μ μ
− √ xk+1 − xk , xk − x  − xk+1 − xk 2 − β ∇ f (xk ), xk+1 − xk 
s s s
1 μ μ
+ √ (L − ) xk+1 − xk 2 − √ xk+1 − xk 2
s 2 2 s
1√ √
− β+ s ∇ f (xk 2 − μ ∇ f (xk ), xk+1 − xk  .
2

After developing and simplification, we get


√ √
1 √ μ μ 1 μ
√ (E k+1 − E k ) + μE k+1 ≤ − +√ −L √ + xk+1 − xk 2
s 2s s s 2
 √ √ 
β μ
2
s  
− β− + ∇ f (xk+1 ) 2 + βμ ∇ f (xk ), xk − x  .
2 2

123
148 H. Attouch et al.

Let us majorize this last term by using the Lipschitz continuity of ∇ f


   
∇ f (xk ), xk − x  = ∇ f (xk ) − ∇ f (x  ), xk − x  ≤ L xk − x  2

≤ 2L xk+1 − x  2
+ 2L xk+1 − xk 2
.

Therefore

√ √
1 √ μ μ 1 μ
√ (E k+1 − E k ) + μE k+1 + + √ − L 2βμ + √ + xk+1 − xk 2
s 2s s s 2
 √ √ 
β μ
2
s
+ β− + ∇ f (xk+1 ) 2 − 2βμL xk+1 − x  2 ≤ 0.
2 2


β2 μ
According to 0 ≤ β ≤ √1 ,
μ we have β − 2 ≥ β2 , which gives

1 √
√ (E k+1 − E k ) + μE k+1
s
√ √
μ μ 1 μ
+ + √ − L 2βμ + √ + xk+1 − xk 2
2s s s 2
β
+ ∇ f (xk+1 ) 2 − 2βμL xk+1 − x  2 ≤ 0.
2

Let us use again the strong convexity of f to write

1 1  1 μ
E k+1 ≥ E k+1 + f (xk+1 ) − f (x  ) ≥ E k+1 + xk+1 − x  2 .
2 2 2 4

Combining the two above relations we get

1 1√ √ μ
√ (E k+1 − E k ) + μE k+1 + μ − 2βμL xk+1 − x  2
s 2 4
√ √
μ μ 1 μ β
+ + √ − L 2βμ + √ + xk+1 − xk 2 + ∇ f (xk+1 ) 2
≤0
2s s s 2 2

μ
Let us examine the sign of the above quantities: Under the condition L ≤ 8β we

μ μ √
√ μ 2s + s

μ
have μ 4 − 2βμL ≥ 0. Under the condition L ≤ √
μ we have 2s + √μs −
2βμ+ √1s + 2

μ
L 2βμ + √1
s
+ 2 ≥ 0. Therefore, under the above conditions

1 1√ β
√ (E k+1 − E k ) + μE k+1 + ∇ f (xk+1 ) 2
≤ 0.
s 2 2

123
First-order optimization algorithms via inertial systems… 149

Set q = 1
√ , which satisfies 0 < q < 1. By a similar argument as in Theorem 9
1+ 21 μs

E k ≤ E 1 q k−1 .

According to the definition of E k ≥ f (xk ) − f (x  ), we finally obtain

f (xk ) − f (x  ) = O q k ,

and the linear convergence of xk to x  and that of the gradients to zero. 




6 Numerical results

Here, we illustrate our results on the composite problem on H = Rn ,


 
1
minn f (x) := y − Ax 2
+ g(x) , (RLS)
x∈R 2

where A is a linear operator from Rn to Rm , m ≤ n, g : Rn → R ∪ {+∞} is a proper


lsc convex function which acts as a regularizer. Problem (RLS) is extremely popular
in a variety of fields ranging from inverse problems in signal/image processing, to
machine learning and statistics. Typical examples of g include the 1 norm (Lasso),
the 1 − 2 norm (group Lasso), the total variation, or the nuclear norm (the 1 norm
of the singular values of x ∈ R N ×N identified with a vector in Rn with n = N 2 ). To
avoid trivialities, we assume that the set of minimizers of (RLS) is non-empty.
Though (RLS) is a composite non-smooth problem, it fits perfectly well into our
framework. Indeed, the key idea is to appropriately choose the metric. For a symmetric
positive definite matrix S ∈ Rn×n , denote the scalar product in the metric S as S·, ·
and the corresponding norm as · S . When S = I , then we simply use the shorthand
notation for the Euclidean scalar product ·, · and norm · . For a proper convex lsc
function h, we denote h S and proxhS its Moreau envelope and proximal mapping in the
metric S, i.e.

1 1
h S (x) = minn z−x 2
S + h(z), proxhS (x) = argminz∈Rn z−x 2
S + h(z).
z∈R 2 2

Similarly, when S = I , we drop S in the above notation.


Let M = s −1 I − A∗ A. With the proviso that 0 < s A 2 < 1, M is a symmetric
positive definite matrix. It can be easily shown (we provide a proof in Appendix A.2 for
completeness; see also the discussion in [22, Section 4.6]), that the proximal mapping
of f as defined in (RLS) in the metric M is

f (x) = proxsg (x + s A (y − Ax)),
prox M (27)

which is nothing but the forward-backward fixed-point operator for the objective in
(RLS). Moreover, f M is a continuously differentiable convex function whose gradient

123
150 H. Attouch et al.

(again in the metric M) is given by the standard identity

∇ f M (x) = x − prox M
f (x),

and ∇ f M (x) − ∇ f M (z) M ≤ x − z M , i.e. ∇ f M is Lipschitz continuous in the


metric M. In addition, a standard argument shows that

argminH f = Fix(prox M
f ) = argminH f M .

We are then in position to solve (RLS) by simply applying (IGAHD) (see Sect. 3.2)
to f M . We infer from Theorem 6 and properties of f M that

f (prox M
f (x k )) − min f = O(k −2 ).
R n

(IGAHD) and FISTA (i.e. (IGAHD) with β = 0) were applied to f M with four
instances of g: 1 norm, 1 − 2 norm, the total variation, and the nuclear norm. The
results are depicted in Fig. 3. One can clearly see that the convergence profiles observed
for both algorithms agree with the predicted rate. Moreover, (IGAHD) exhibits, as
expected, less oscillations than FISTA, and eventually converges faster.

7 Conclusion, perspectives

As a guideline to our study, the inertial dynamics with Hessian driven damping give
rise to a new class of first-order algorithms for convex optimization. While retaining
the fast convergence of the function values reminiscent of the Nesterov accelerated
algorithm, they benefit from additional favorable properties among which the most
important are:
• fast convergence of gradients towards zero;
• global convergence of the iterates to optimal solutions;
• extension to the non-smooth setting;
• acceleration via time scaling factors.
This article contains the core of our study with a particular focus on the gradient and
proximal methods. The results thus obtained pave the way to new research avenues.
For instance:
• as initiated in Sect. 6, apply these results to structured composite optimization
problems beyond (RLS) and develop corresponding splitting algorithms;
• with the additional gradient estimates, we can expect the restart method to work
better with the presence of the Hessian damping term;
• deepen the link between our study and the Newton and Levenberg–Marquardt
dynamics and algorithms (e.g., [13]), and with the Ravine method [23].
• the inertial dynamic with Hessian driven damping goes well with tame analysis and
Kurdyka–Lojasiewicz property [2], suggesting that the corresponding algorithms
be developed in a non-convex (or even non-smooth) setting.

123
First-order optimization algorithms via inertial systems… 151

Fig. 3 Evolution of f (prox M


f (x k )) − minRn f , where x k is the iterate of either (IGAHD) or FISTA, when
solving (RLS) with different regularizers g

A Auxiliary results

A.1 Extended descent lemma

Lemma 1 Let f : H → R be a convex function whose gradient is L-Lipschitz contin-


uous. Let s ∈]0, 1/L]. Then for all (x, y) ∈ H2 , we have

s s
f (y − s∇ f (y)) ≤ f (x) + ∇ f (y), y − x − ∇ f (y) 2
− ∇ f (x) − ∇ f (y) 2 .
2 2
(28)

Proof Denote y + = y − s∇ f (y). By the standard descent lemma applied to y + and


y, and since s L ≤ 1 we have

s s
f (y + ) ≤ f (y) − (2 − Ls) ∇ f (y) 2
≤ f (y) − ∇ f (y) 2 . (29)
2 2

123
152 H. Attouch et al.

We now argue by duality between strong convexity and Lipschitz continuity of the
gradient of a convex function. Indeed, using Fenchel identity, we have

f (y) = ∇ f (y), y − f ∗ (∇ f (y)).

L-Lipschitz continuity of the gradient of f is equivalent to 1/L-strong convexity


of its conjugate f ∗ . This together with the fact that (∇ f )−1 = ∂ f ∗ gives for all
(x, y) ∈ H2 ,

1
f ∗ (∇ f (y)) ≥ f ∗ (∇ f (x)) + x, ∇ f (y) − ∇ f (x) + ∇ f (x) − ∇ f (y) 2
.
2L

Inserting this inequality into the Fenchel identity above yields

1
f (y) ≤ − f ∗ (∇ f (x)) + ∇ f (y), y − x, ∇ f (y) − ∇ f (x) − ∇ f (x) − ∇ f (y) 2
2L
1
= − f ∗ (∇ f (x)) + x, ∇ f (x) + ∇ f (y), y − x − ∇ f (x) − ∇ f (y) 2
2L
1
= f (x) + ∇ f (y), y − x − ∇ f (x) − ∇ f (y) 2
2L
s
≤ f (x) + ∇ f (y), y − x − ∇ f (x) − ∇ f (y) 2 .
2

Inserting the last bound into (29) completes the proof. 




A.2 Proof of (27)

Proof We have

1
f (x) = argmin z∈Rn
prox M z − x 2M + f (z)
2
1 1 1
= argminz∈Rn z−x 2− A(z − x) 2
+ y − Az 2
+ g(z).
2s 2 2

By the Pythagoras relation, we then get

1 1
f (x) = argmin z∈Rn
prox M z−x 2+ y − Ax 2 − A(x − z), Ax − y + g(z)
2s 2
1
= argminz∈Rn z − x 2 − z − x, A∗ (y − Ax) + g(z)
2s
1   
z − x − s A∗ (Ax − y) 2 + g(z)
= argminz∈Rn
 2s 
= proxsg x − s A∗ (Ax − y) .




123
First-order optimization algorithms via inertial systems… 153

A.3 Closed-form solutions of (DIN-AVD)˛,ˇ,b for quadratic functions

We here provide the closed form solutions to (DIN-AVD)α,β,b for the quadratic objec-
tive f : Rn → Ax, x, where A is a symmetric positive definite matrix. The case
of a semidefinite positive matrix A can be treated similarly by restricting the analysis
to ker(A) . Projecting (DIN-AVD)α,β,b on the eigenspace of A, one has to solve n
independent one-dimensional ODEs of the form
α
ẍi (t) + + β(t)λi ẋi (t) + λi b(t)xi (t) = 0, i = 1, . . . , n.
t
where λi > 0 is an eigenvalue of A. In the following, we drop the subscript i.
Case β(t) ≡ β, b(t) = b + γ /t, β ≥ 0, b > 0, γ ≥ 0: The ODE reads
α γ
ẍ(t) + + βλ ẋ(t) + λ b + x(t) = 0. (30)
t t

• If β 2 λ2 = 4bλ: set
%
γ − αβ/2
ξ= β 2 λ2 − 4bλ, κ = λ , σ = (α − 1)/2.
ξ

Using the relationship between the Whitaker functions and the Kummer’s confluent
hypergeometric functions M and U , see [16], the solution to (30) can be shown to
take the form

x(t) = ξ α/2 e−(βλ+ξ )t/2 [c1 M(α/2 − κ, α, ξ t) + c2 U (α/2 − κ, α, ξ t)] ,

where c1 and c2 are constants


√ given by the initial conditions.
• If β 2 λ2 = 4bλ: set ζ = 2 λ (γ − αβ/2). The solution to (30) takes the form
 √ √ 
x(t) = t −(α−1)/2 e−βλt/2 c1 J(α−1)/2 (ζ t) + c2 Y(α−1)/2 (ζ t) ,

where Jν and Yν are the Bessel functions of the first and second kind.
When β > 0, one can clearly see the exponential decrease forced by the Hessian.
From the asymptotic expansions of M, U , Jν and Yν for large t, straightforward
computations provide the behaviour of |x(t)| for large t as follows:
• If β 2 λ2 > 4bλ, we have

β t−( 2 −|κ|) log(t)


α βλ−ξ α
− 2b
|x(t)| = O t − 2 +|κ| e− 2 t
=O e .

• If β 2 λ2 < 4bλ, whence ξ ∈ iR+


∗ and κ ∈ iR, we have

α βλ
|x(t)| = O t − 2 e− 2 t .

123
154 H. Attouch et al.

• If β 2 λ2 = 4bλ, we have
2α−1 βλ
|x(t)| = O t − 4 e− 2 t .

Case β(t) = t β , b(t) = ct β−1 , β ≥ 0, c > 0: The ODE reads now


α
ẍ(t) + + t β λ ẋ(t) + cλt β−1 x(t) = 0.
t
1 1
Let us make the change of variable t := τ β+1 . Let y(τ ) := x τ β+1 . By the standard
derivation chain rule, it is straightforward to show that y obeys the ODE

α+β λ cλ
ÿ(τ ) + + ẏ(τ ) + y(τ ) = 0.
(1 + β)τ 1+β (1 + β)2 τ

It is clear that this is a special case of (30). Since β and λ > 0, set

λ α+β −c α+β 1
ξ= ,κ=− ,σ = − .
1+β 1+β 2(1 + β) 2

It follows from the first case above that


* +
λτ
− 1+β α+β α+β
x(t) = ξ σ +1/2 e c1 M σ − κ + 1/2, , ξτ + c2 U σ − κ + 1/2, , ξτ .
1+β 1+β

Asymptotic estimates can also be derived similarly to above. We omit the details for
the sake of brevity.

References
1. Álvarez, F.: On the minimizing property of a second-order dissipative system in Hilbert spaces. SIAM
J. Control Optim. 38(4), 1102–1119 (2000)
2. Álvarez, F., Attouch, H., Bolte, J., Redont, P.: A second-order gradient-like dissipative dynamical
system with Hessian-driven damping. Application to optimization and mechanics. J. Math. Pures
Appl. 81(8), 747–779 (2002)
3. Apidopoulos, V., Aujol, J.-F., Dossal, C.: Convergence rate of inertial Forward–Backward algorithm
beyond Nesterov’s rule. Math. Program. Ser. B. 180, 137–156 (2020)
4. Attouch, H., Cabot, A.: Asymptotic stabilization of inertial gradient dynamics with time-dependent
viscosity. J. Differ. Equ. 263, 5412–5458 (2017)
5. Attouch, H., Cabot, A.: Convergence rates of inertial forward–backward algorithms. SIAM J. Optim.
28(1), 849–874 (2018)
6. Attouch, H., Cabot, A., Chbani, Z., Riahi, H.: Rate of convergence of inertial gradient dynamics with
time-dependent viscous damping coefficient. Evol. Equ. Control Theory 7(3), 353–371 (2018)
7. Attouch, H., Chbani, Z., Riahi, H.: Fast proximal methods via time scaling of damped inertial dynamics.
SIAM J. Optim. 29(3), 2227–2256 (2019)
8. Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algo-
rithms with asymptotic vanishing viscosity. Math. Program. Ser. B. 168, 123–175 (2018)
9. Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the Nesterov accelerated gradient method
in the subcritical case α ≤ 3. ESAIM Control Optim. Calc. Var. 25, 2–35 (2019)

123
First-order optimization algorithms via inertial systems… 155

10. Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward–backward
method is actually faster than 1/k 2 . SIAM J. Optim. 26(3), 1824–1834 (2016)
11. Attouch, H., Peypouquet, J., Redont, P.: A dynamical approach to an inertial forward–backward algo-
rithm for convex minimization. SIAM J. Optim. 24(1), 232–256 (2014)
12. Attouch, H., Peypouquet, J., Redont, P.: Fast convex minimization via inertial dynamics with Hessian
driven damping. J. Diffe. Equ. 261(10), 5734–5783 (2016)
13. Attouch, H., Svaiter, B. F.: A continuous dynamical Newton-Like approach to solving monotone
inclusions. SIAM J. Control Optim. 49(2), 574–598 (2011). Global convergence of a closed-loop
regularized Newton method for solving monotone inclusions in Hilbert spaces. J. Optim. Theory Appl.
157(3), 624–650 (2013)
14. Aujol, J.-F., Dossal, Ch.: Stability of over-relaxations for the forward-backward algorithm, application
to FISTA. SIAM J. Optim. 25(4), 2408–2433 (2015)
15. Aujol, J.-F., Dossal, C.: Optimal rate of convergence of an ODE associated to the Fast Gradient Descent
schemes for b>0 (2017). https://hal.inria.fr/hal-01547251v2
16. Bateman, H.: Higher Transcendental Functions, vol. 1. McGraw-Hill, New York (1953)
17. Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces.
CMS Books in Mathematics, Springer (2011)
18. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM J. Imaging Sci. 2(1), 183–202 (2009)
19. Brézis, H.: Opérateurs maximaux monotones dans les espaces de Hilbert et équations d’évolution,
Lecture Notes 5, North Holland, (1972)
20. Cabot, A., Engler, H., Gadat, S.: On the long time behavior of second order differential equations with
asymptotically small dissipation. Trans. Am. Math. Soc. 361, 5983–6017 (2009)
21. Chambolle, A., Dossal, Ch.: On the convergence of the iterates of the fast iterative shrinkage thresh-
olding algorithm. J. Optim. Theory Appl. 166, 968–982 (2015)
22. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25,
161–319 (2016)
23. Gelfand, I.M., Zejtlin, M.: Printszip nelokalnogo poiska v sistemah avtomatich, Optimizatsii, Dokl.
AN SSSR, 137, 295?298 (1961) (in Russian)
24. May, R.: Asymptotic for a second-order evolution equation with convex potential and vanishing damp-
ing term. Turk. J. Math. 41(3), 681–685 (2017)
25. Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(1/k 2 ).
Sov. Math. Doklady 27, 372–376 (1983)
26. Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 152(1–
2), 381–404 (2015)
27. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. U.S.S.R. Comput.
Math. Math. Phys. 4, 1–17 (1964)
28. Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
29. Siegel, W.: Accelerated first-order methods: differential equations and Lyapunov functions.
arXiv:1903.05671v1 [math.OC] (2019)
30. Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-
resolution differential equations. arXiv:submit/2440124 [cs.LG] 21 Oct 2018
31. Su, W.J., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient
method: theory and insights. NIPS’14 27, 2510–2518 (2014)
32. Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization.
arXiv:1611.02635 (2016)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

123

You might also like