First-Order Optimization Algorithms Via Inertial Systems With Hessian Driven Damping
First-Order Optimization Algorithms Via Inertial Systems With Hessian Driven Damping
https://doi.org/10.1007/s10107-020-01591-1
Received: 24 July 2019 / Accepted: 3 November 2020 / Published online: 16 November 2020
© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2020
Abstract
In a Hilbert space setting, for convex optimization, we analyze the convergence rate of
a class of first-order algorithms involving inertial features. They can be interpreted as
discrete time versions of inertial dynamics involving both viscous and Hessian-driven
dampings. The geometrical damping driven by the Hessian intervenes in the dynamics
in the form ∇ 2 f (x(t))ẋ(t). By treating this term as the time derivative of ∇ f (x(t)),
this gives, in discretized form, first-order algorithms in time and space. In addition to
the convergence properties attached to Nesterov-type accelerated gradient methods,
the algorithms thus obtained are new and show a rapid convergence towards zero of
the gradients. On the basis of a regularization technique using the Moreau envelope,
we extend these methods to non-smooth convex functions with extended real values.
The introduction of time scale factors makes it possible to further accelerate these
algorithms. We also report numerical results on structured problems to support our
theoretical findings.
B Jalal Fadili
[email protected]
Hedy Attouch
[email protected]
Zaki Chbani
[email protected]
Hassan Riahi
[email protected]
123
114 H. Attouch et al.
1 Introduction
As a guide in our study, we will rely on the asymptotic behavior, when t → +∞,
of the trajectories of the inertial system with Hessian-driven damping
γ (t) and β(t) are damping parameters, and b(t) is a time scale parameter.
The time discretization of this system will provide a rich family of first-order
methods for minimizing f . At first glance, the presence of the Hessian may seem to
entail numerical difficulties. However, this is not the case as the Hessian intervenes in
the above ODE in the form ∇ 2 f (x(t))ẋ(t), which is nothing but the derivative w.r.t.
time of ∇ f (x(t)). This explains why the time discretization of this dynamic provides
first-order algorithms. Thus, the Nesterov extrapolation scheme [25,26] is modified
by the introduction of the difference of the gradients at consecutive iterates. This gives
algorithms of the form
yk = xk + αk (xk − xk−1 ) − βk (∇ f (xk ) − ∇ f (xk−1 ))
xk+1 = T (yk ),
α
(DIN-AVD)α,β,b ẍ(t) + ẋ(t) + β(t)∇ 2 f (x(t))ẋ(t) + b(t)∇ f (x(t)) = 0.
t
In the case β ≡ 0, α = 3, b(t) ≡ 1, it can be interpreted as a continuous version
of the Nesterov accelerated
gradient method [31]. According to this, in this case,
we will obtain O t −2 convergence rates for the objective values.
• For a μ-strongly convex function f , we will rely on the autonomous inertial system
with Hessian driven damping
√
(DIN)2√μ,β ẍ(t) + 2 μẋ(t) + β∇ 2 f (x(t))ẋ(t) + ∇ f (x(t)) = 0,
123
First-order optimization algorithms via inertial systems… 115
and show exponential (linear) convergence rate for both objective values and gra-
dients.
For an appropriate setting of the parameters, the time discretization of these dynamics
provides first-order algorithms with fast convergence properties. Notably, we will show
a rapid convergence towards zero of the gradients.
B. Polyak initiated the use of inertial dynamics to accelerate the gradient method in
optimization. In [27,28], based on the inertial system with a fixed viscous damping
coefficient γ > 0
he introduced the Heavy Ball with Friction method. For a strongly convex function
f , (HBF) provides convergence at exponential rate of f (x(t)) to minH f . For gen-
eral convex functions, the asymptotic convergence rate of (HBF) is O( 1t ) (in the
worst case). This is however not better than the steepest descent. A decisive step to
improve (HBF) was taken by Alvarez–Attouch–Bolte–Redont [2] by introducing the
Hessian-driven damping term β∇ 2 f (x(t))ẋ(t), that is (DIN)0,β . The next important
step was accomplished by Su–Boyd–Candès [31] with the introduction of a vanishing
viscous damping coefficient γ (t) = αt , that is (AVD)α (see Sect. 1.1.2). The system
(DIN-AVD)α,β,1 (see Sect. 2) has emerged as a combination of (DIN)0,β and (AVD)α .
Let us review some basic facts concerning these systems.
was introduced in [2]. In line with (HBF), it contains a fixed positive friction coefficient
γ . The introduction of the Hessian-driven damping makes it possible to neutralize the
transversal oscillations likely to occur with (HBF), as observed in [2] in the case of
the Rosenbrook function. The need to take a geometric damping adapted to f had
already been observed by Alvarez [1] who considered
where Γ : H → H is a linear positive anisotropic operator. But still this damping oper-
ator is fixed. For a general convex function, the Hessian-driven damping in (DIN)γ ,β
performs a similar operation in a closed-loop adaptive way. The terminology (DIN)
stands shortly for Dynamical Inertial Newton. It refers to the natural link between this
dynamic and the continuous Newton method.
123
116 H. Attouch et al.
α
(AVD)α ẍ(t) + ẋ(t) + ∇ f (x(t)) = 0,
t
was introduced in the context of convex optimization in [31]. For general convex func-
tions it provides a continuous version of the accelerated gradient method of Nesterov.
For α ≥ 3, each trajectory x(·) of (AVD)
α satisfies the asymptotic rate of convergence
of the values f (x(t))−inf H f = O 1/t 2 . As a specific feature, the viscous damping
coefficient αt vanishes (tends to zero) as time t goes to infinity, hence the terminology.
The convergence properties of the dynamic (AVD)α have been the subject of many
recent studies, see [3–6,8–10,14,15,24,31]. They helped to explain why αt is a wise
choise of the damping coefficient.
In [20], the authors showed that a vanishing damping coefficient γ (·) dissipates
the energy, and hence makes the dynamic interesting for optimization, as long as
+∞
t0 γ (t)dt = +∞. The damping coefficient can go to zero asymptotically but not
too fast. The smallest which is admissible is of order 1t . It enforces the inertial effect
with respect to the friction effect.
The tuning of the parameter α in front of 1t comes from the Lyapunov analysis and
the optimality of the convergence rates obtained. The case α = 3, which corresponds
to Nesterov’s historical algorithm, is critical. In the case α = 3, the question of the
convergence of the trajectories remains an open problem (except in one dimension
where convergence holds [9]). As a remarkable property, for α > 3, it has been shown
by Attouch–Chbani–Peypouquet–Redont [8] that each trajectory converges weakly to
a minimizer. The corresponding algorithmic result has been obtained by Chambolle–
Dossal [21]. For α > 3, it is shown in [10,24] that the asymptotic convergence rate
of the values is actually o(1/t 2 ). The subcritical case α ≤ 3 has been examined by
Apidopoulos–Aujol–Dossal [3] and Attouch–Chbani–Riahi [9], with the convergence
2α
rate of the objective values O t − 3 . These rates are optimal, that is, they can be
reached, or approached arbitrarily close:
The inertial system with a general damping coefficient γ (·) was recently studied by
Attouch–Cabot in [4,5], and Attouch–Cabot–Chbani–Riahi in [6].
α
(DIN-AVD)α,β ẍ(t) + ẋ(t) + β∇ 2 f (x(t))ẋ(t) + ∇ f (x(t)) = 0,
t
123
First-order optimization algorithms via inertial systems… 117
was introduced in [11]. It combines the two types of damping considered above.
Its formulation looks at a first glance more complicated than (AVD)α . Attouch-
Peypouquet-Redont [12] showed that (DIN-AVD)α,β is equivalent to the first-order
system in time and space
⎧
⎨ ẋ(t) + β∇ f (x(t)) − 1
− α
x(t) + 1
y(t) = 0;
β t β
⎩ ẏ(t) − 1
− α
+ αβ
x(t) + 1
y(t) = 0.
β t t2 β
as to obey the condition α > 3. Starting with initial conditions: (x1 (1), x2 (1)) =
(1, 1), (ẋ1 (1), ẋ2 (1)) = (0, 0), we have the trajectories displayed in Fig. 1. This
illustrates the typical situation of an ill-conditioned minimization problem, where the
wild oscillations of (AVD)α are neutralized by the Hessian damping in (DIN-AVD)α,β
(see Appendix A.3 for further details).
Let us describe our main convergence rates for the gradient type algorithms. Corre-
sponding results for the proximal algorithms are also obtained.
Fig. 1 Evolution of the objective (left) and trajectories (right) for (AVD)α (α = 3.1) and (DIN-AVD)α,β
(α = 3.1, β = 1) on an ill-conditioned quadratic problem in R2
123
118 H. Attouch et al.
√ √
β s
yk = xk + 1 − αk (xk − xk−1 ) − β s (∇ f (xk ) − ∇ f (xk−1 )) − k ∇ f (xk−1 )
xk+1 = yk − s∇ f (yk ).
√
Suppose that α ≥ 3, 0 < β < 2 s, s L ≤ 1. In Theorem 6, we show that
1
(i) f (xk ) − min f = O 2 as k → +∞;
H k
(ii) k 2 ∇ f (yk ) 2 < +∞ and k 2 ∇ f (xk ) 2
< +∞.
k k
√ √
1 − μs β s
xk+1 = xk + √ (xk − xk−1 ) − √ (∇ f (xk ) − ∇ f (xk−1 ))
1 + μs 1 + μs
s
− √ ∇ f (xk ).
1 + μs
1
Assuming that ∇ f is L-Lipschitz continuous, L sufficiently small and β ≤ √ , it is
μ
1
shown in Theorem 11 that, with q = √ ( 0 < q < 1)
1 + 21 μs
f (xk ) − min f = O q k and xk − x = O q k/2 as k → +∞,
H
1.3 Contents
The paper is organized as follows. Sections 2 and 3 deal with the case of general
convex functions, respectively in the continuous case and the algorithmic cases. We
improve the Nesterov convergence rates by showing in addition fast convergence of
the gradients. Sections 4 and 5 deal with the same questions in the case of strongly
convex functions, in which case, linear convergence results are obtained. Section 6 is
devoted to numerical illustrations. We conclude with some perspectives.
123
First-order optimization algorithms via inertial systems… 119
Our analysis deals with the inertial system with Hessian-driven damping
α
(DIN-AVD)α,β,b ẍ(t) + ẋ(t) + β(t)∇ 2 f (x(t))ẋ(t) + b(t)∇ f (x(t)) = 0.
t
We start by stating a fairly general theorem on the convergence rates and integrability
properties of (DIN-AVD)α,β,b under appropriate conditions on the parameter functions
β(t) and b(t). As we will discuss shortly, it turns out that for some specific choices of
the parameters, one can recover most of the related results existing in the literature.
The following quantities play a central role in our analysis:
β(t)
w(t) := b(t) − β̇(t) − and δ(t) := t 2 w(t). (1)
t
β(t)
(G2 ) b(t) > β̇(t) + ;
t
(G3 ) t ẇ(t) ≤ (α − 3)w(t).
1
(i) f (x(t)) − min f = O 2 as t → +∞;
H t w(t)
+∞
(ii) t 2 β(t)w(t) ∇ f (x(t)) 2 dt < +∞;
t0
+∞
(iii) t (α − 3)w(t) − t ẇ(t) ( f (x(t)) − min f )dt < +∞.
t0 H
1
E(t) := δ(t)( f (x(t)) − f (x )) + v(t) 2
, (2)
2
d
E(t) = δ̇(t)( f (x(t)) − f (x )) + δ(t)∇ f (x(t)), ẋ(t) + v(t), v̇(t). (3)
dt
123
120 H. Attouch et al.
β(t)
v(t), v̇(t) = (α − 1)t β̇(t) + − b(t) ∇ f (x(t)), x(t) − x
t
β(t)
+ t 2 β̇(t) + − b(t) ∇ f (x(t)), ẋ(t)
t
β(t)
+ t 2 β(t) β̇(t) + − b(t) ∇ f (x(t)) 2 .
t
Let us go back to (3). According to the choice of δ(t), the terms ∇ f (x(t)), ẋ(t)
cancel, which gives
d (α − 1)
E(t) = δ̇(t)( f (x(t)) − f (x )) + δ(t)∇ f (x(t)), x − x(t)
dt t
−β(t)δ(t) ∇ f (x(t)) 2 .
Condition (G2 ) gives δ(t) > 0. Combining this equation with convexity of f ,
d (α − 1)
E(t) + β(t)δ(t) ∇ f (x(t)) 2
+ δ(t) − δ̇(t) ( f (x(t)) − f (x )) ≤ 0.
dt t
(4)
(α − 1)
δ(t) − δ̇(t) ≥ 0, (6)
t
d
which, by (4), gives E(t) ≤ 0. Therefore, E(·) is non-increasing, and hence E(t) ≤
dt
E(t0 ). Since all the terms that enter E(·) are nonnegative, we obtain (i). Then, by
123
First-order optimization algorithms via inertial systems… 121
and
+∞
t (α − 3)w(t) − t ẇ(t) ( f (x(t)) − f (x ))dt ≤ E(t0 ) < +∞,
t0
As anticipated above, by specializing the functions β(t) and b(t), we recover most
known results in the literature; see hereafter for each specific case and related literature.
For all these cases, we will argue also on the interest of our generalization.
Case 1 The (DIN-AVD)α,β system corresponds to β(t) ≡ β and b(t) ≡ 1. In this
case, w(t) = 1 − βt . Conditions (G2 ) and (G3 ) are satisfied by taking α > 3 and
t > α−2
α−3 β. Hence, as a consequence of Theorem 1, we obtain the following result of
Attouch–Peypouquet–Redont [12]:
∞
1
f (x(t)) − min f = O 2 and t 2 ∇ f (x(t)) 2 dt < +∞.
H t t0
123
122 H. Attouch et al.
scaling of (AVD)α . In this case, we have w(t) = b(t). Condition (G2 ) is equivalent to
b(t) > 0. (G3 ) becomes
t ḃ(t) ≤ (α − 3)b(t),
which is precisely the condition introduced in [7, Theorem 8.1]. Under this condition,
we have the convergence rate
1
f (x(t)) − min f = O as t → +∞.
H t 2 b(t)
This makes clear the acceleration effect due to the time scaling. For b(t) = t r , we
1
have f (x(t)) − minH f = O 2+r , under the assumption α ≥ 3 + r .
t
Case 4 Let us illustrate our results in the case b(t) = ct b , β(t) = t β . We have
w(t) = ct b − (β + 1)t β−1 , w (t) = cbt b−1 − (β 2 − 1)t β−2 . The conditions (G2 ), (G3 )
can be written respectively as:
ct b > (β + 1)t β−1 and c(b − α + 3)t b ≤ (β + 1)(β − α + 2)t β−1 . (7)
123
First-order optimization algorithms via inertial systems… 123
Fig. 2 Convergence of the objective values and trajectories associated with the system (DIN-AVD)α,β,b
for different choices of β(t) and b(t)
Equivalently
Observe that this requires f to be only of class C 1 . Set now s = h 2 . We obtain the
following algorithm with βk and bk varying with k:
√
Step k : Set μk := k+α
k
(βk s + sbk ).
√
α α
yk = xk + 1 − k+α (xk − xk−1 ) + βk s 1 − ∇ f (xk )
(IPAHD) k+α
xk+1 = proxμk f (yk ).
123
124 H. Attouch et al.
Then, δk is positive and, for any sequence (xk )k∈N generated by (IPAHD)
1 1
(i) f (xk ) − min f = O =O
H δk k(k + 1) bk h − βk+1
k − (βk+1 − βk )
(ii) δk βk+1 ∇ f (xk+1 ) 2 < +∞.
k
Before delving into the proof, the following remarks on the choice/growth of the
parameters are in order.
Remark 1 We first observe that condition (G2dis ) is nothing but a forward (explicit)
discretization of its continuous analogue (G2 ). In addition, in view of (1), (G3 ) equiv-
alently reads
t δ̇(t) ≤ (α − 1)δ(t).
In turn, (9) and (G3dis ) are explicit discretizations of (1) and (G3 ) respectively.
βk+1
inf (bk h − − (βk+1 − βk )) > 0, (10)
k k
123
First-order optimization algorithms via inertial systems… 125
1)(k − β/h) and βk is obviously non-increasing. Thus, if α > 3, then one easily
β
checks that (10) (hence (G2dis )) and (G3dis ) are in force for all k ≥ α−2
α−3 h + α−3 .
2
• Consider now the discrete counterpart of case 2 in Sect. 2.2. Take βk = β > 0
and bk = 1 + β/(hk)1 . Thus δk = h 2 (k + 1)k. This case was studied in [30] both
in the continuous setting and for the gradient algorithm, but not for the proximal
algorithm. This choice is a special case of the one discussed above since βk is the
constant sequence and c = 1. Thus (10) (hence (G2dis )) holds. (G3dis ) is also verified
for all k ≥ α−3
2
as soon as α > 3.
Proof Given x ∈ argminH f , set
1
E k := δk ( f (xk ) − f (x )) + vk 2
,
2
where
and (δk )k∈N is a positive sequence that will be adjusted. Observe that E k is nothing
but the discrete analogue of the Lyapunov function (2). Set ΔE k := E k+1 − E k , i.e.,
1
ΔE k = (δk+1 − δk )( f (xk+1 ) − f (x )) + δk ( f (xk+1 ) − f (xk )) + ( vk+1 2
− vk 2
)
2
Let us evaluate the last term of the above expression with the help of the three-point
identity 21 vk+1 2 − 21 vk 2 = vk+1 − vk , vk+1 − 21 vk+1 − vk 2 .
Using successively the definition of vk and (8), we get
1 1 h2
vk+1 2 − vk 2 = − Ck2 ∇ f (xk+1 ) 2
2 2 2
∇ f (xk+1 ), (α − 1)(xk+1 − x ) + (k + 1)(xk+1 − xk + βk+1 h∇ f (xk+1 ))
1 One can even consider the more general case b(t) = 1 + b/(hk), b > 0 for which our discussion remains
true under minor modifications. But we do not pursue this for the sake of simplicity.
123
126 H. Attouch et al.
1 2
= −h 2 C − Ck βk+1 ∇ f (xk+1 ) 2 − (α − 1)hCk ∇ f (xk+1 ), x − xk+1
2 k
−hCk (k + 1)∇ f (xk+1 ), xk − xk+1 .
Then, in the above expression, the coefficient of ∇ f (xk+1 ) 2 is less or equal than
zero, which gives
1 1
vk+1 2 − vk 2 ≤ −(α − 1)hCk ∇ f (xk+1 ), x − xk+1
2 2
−hCk (k + 1) ∇ f (xk+1 ), xk − xk+1 .
According to the (convex) subdifferential inequality and Ck < 0 (by (G2dis )), we infer
1 1
vk+1 2 − vk 2 ≤ −(α − 1)hCk ( f (x ) − f (xk+1 ))
2 2
−hCk (k + 1)( f (xk ) − f (xk+1 )).
Equivalently
δk
E k+1 − E k ≤ δk+1 − δk − (α − 1) ( f (xk+1 ) − f (x )).
k+1
δk
By assumption (G3dis ), we have δk+1 − δk − (α − 1) k+1 ≤ 0. Therefore, the sequence
(E k )k∈N is non-increasing, which, by definition of E k , gives, for k ≥ 0
E0
f (xk ) − min f ≤ .
H δk
h
E k+1 − E k + h (βk+1 + k(βk+1 − βk ) − bk hk)2 + δk βk+1 ∇ f (xk+1 ) 2
≤0
2
k δk βk+1 ∇ f (xk+1 ) < +∞.
we finally obtain 2
123
First-order optimization algorithms via inertial systems… 127
1
f λ (x) = min f (z) + z−x 2
, for any x ∈ H.
z∈H 2λ
α
ẍ(t) + ẋ(t) + β∇ 2 f λ (x(t))ẋ(t) + b(t)∇ f λ (x(t)) = 0.
t
√
α α
yk = xk + 1 − k+α (xk − xk−1 ) + β s 1 − k+α ∇ f λ (xk )
xk+1 = prox k (β √s+sbk ) fλ (yk ).
k+α
By applying Theorem 4 we obtain that under the assumption (G2dis ) and (G3dis ),
1
f λ (xk ) − min f = O , δk βk+1 ∇ f λ (xk+1 ) 2
< +∞.
H δk
k
Thus, we just need to formulate these results in terms of f and its proximal mapping.
This is straightforward thanks to the following formulae from proximal calculus [17]:
1
x − prox (x))2 ,
f λ (x) = f (proxλ f (x)) + λf (11)
2λ
1
∇ f λ (x) = x − proxλ f (x) , (12)
λ
λ θ
proxθ fλ (x) = x+ prox(λ+θ) f (x). (13)
λ+θ λ+θ
123
128 H. Attouch et al.
We obtain the following relaxed inertial proximal algorithm (NS stands for Non-
Smooth):
(IPAHD-NS) :
λ(k+α)
Set μk := λ(k+α)+k(β √
s+sbk )
⎧ √
⎨ yk = xk + (1 − α )(xk − xk−1 ) + β s α
k+α λ 1− k+α xk − proxλ f (xk )
⎩xk+1 = μk yk + (1 − μk ) prox λ
f (yk ).
μk
1 2
f (proxλ f (xk )) − min f = O , δk βk+1 xk+1 − proxλ f (xk+1 ) < +∞.
H δk
k
1 α β
(xk+1 − 2xk + xk−1 ) + (xk − xk−1 ) + √ (∇ f (xk ) − ∇ f (xk−1 ))
s ks s
β
+ √ ∇ f (xk−1 ) + ∇ f (yk ) = 0,
k s
Step k: αk = 1 − αk .
√ √
β s
yk = xk + αk (xk − xk−1 ) − β s (∇ f (xk ) − ∇ f (xk−1 )) − k ∇ f (xk−1 )
xk+1 = yk − s∇ f (yk )
1
E k := tk2 ( f (xk ) − f (x )) + vk 2
(14)
2s
123
First-order optimization algorithms via inertial systems… 129
√
vk := (xk−1 − x ) + tk xk − xk−1 + β s∇ f (xk−1 ) . (15)
1
(i) f (xk ) − min f = O as k → +∞;
H k2
(ii) Suppose that β > 0. Then
k 2 ∇ f (yk ) 2 < +∞ and k 2 ∇ f (xk ) 2
< +∞.
k k
Proof We rely on the following reinforced version of the gradient descent lemma
(Lemma 1 in “Appendix A.1”). Since s ≤ L1 , and ∇ f is L-Lipschitz continuous,
s s
f (y − s∇ f (y)) ≤ f (x) + ∇ f (y), y − x − ∇ f (y) 2
− ∇ f (x) − ∇ f (y) 2
2 2
for all x, y ∈ H. Let us write it successively at y = yk and x = xk , then at y = yk ,
x = x . According to xk+1 = yk − s∇ f (yk ) and ∇ f (x ) = 0, we get
s s
f (xk+1 ) ≤ f (xk ) + ∇ f (yk ), yk − xk − ∇ f (yk )
∇ f (xk ) − ∇ f (yk ) 2
2
−
2 2
(16)
s s
f (xk+1 ) ≤ f (x ) + ∇ f (yk ), yk − x − ∇ f (yk ) 2 − ∇ f (yk ) 2 . (17)
2 2
Multiplying (16) by tk+1 − 1 ≥ 0, then adding (17), we derive that
2
tk+1 ( f (xk+1 − f (x )) ≤ tk2 ( f (xk ) − f (x ))
123
130 H. Attouch et al.
s 2
+tk+1 ∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x − tk+1 ∇ f (yk ) 2
2
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2 .
2 2
s 2
E k+1 − E k ≤ tk+1 ∇ f (yk ), (tk+1 − 1)(yk − xk ) + yk − x − tk+1 ∇ f (yk ) 2
2
s 2 s
− (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2
2 2
1 1
+ vk+1 2 − vk 2 .
2s 2s
Let us compute this last expression with the help of the elementary identity
1 1 1
vk+1 2
− vk 2
= vk+1 − vk , vk+1 − vk+1 − vk 2
.
2 2 2
√
vk+1 − vk = xk − xk−1 + tk+1 (xk+1 − xk + β s∇ f (xk ))
√
−tk (xk − xk−1 + β s∇ f (xk−1 ))
√
= tk+1 (xk+1 − xk ) − (tk − 1)(xk − xk−1 ) + β s tk+1 ∇ f (xk ) − tk ∇ f (xk−1 )
√
= tk+1 xk+1 − (xk + αk (xk − xk−1 ) + β s tk+1 ∇ f (xk ) − tk ∇ f (xk−1 )
√
√ β s
= tk+1 (xk+1 − yk ) − tk+1 β s(∇ f (xk ) − ∇ f (xk−1 )) − tk+1 ∇ f (xk−1 )
k
√
+β s(tk+1 ∇ f (xk ) − tk ∇ f (xk−1 ))
√ 1
= tk+1 (xk+1 − yk ) + β s tk+1 1 − − tk ∇ f (xk−1 )
k
= tk+1 (xk+1 − yk ) = −stk+1 ∇ f (yk ).
Hence
1 1 s 2
vk+1 2 − vk 2
= − tk+1 ∇ f (yk ) 2
2s 2s 2
√
−tk+1 ∇ f (yk ), xk − x + tk+1 xk+1 − xk + β s∇ f (xk ) .
123
First-order optimization algorithms via inertial systems… 131
Equivalently
with
√
Ak := (tk+1 − 1)(yk − xk ) + yk − xk − tk+1 xk+1 − xk + β s∇ f (xk )
√
= tk+1 yk − tk+1 xk − tk+1 (xk+1 − xk ) − tk+1 β s∇ f (xk )
√
= tk+1 (yk − xk+1 ) − tk+1 β s∇ f (xk )
√
= stk+1 ∇ f (yk ) − tk+1 β s∇ f (xk )
Consequently
√
E k+1 − E k ≤ tk+1 ∇ f (yk ), stk+1 ∇ f (yk ) − tk+1 β s∇ f (xk )
s 2 s
−stk+1
2
∇ f (yk ) 2 − (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2 − tk+1 ∇ f (yk ) 2
2 2
√ s 2
= −tk+1 β s∇ f (yk ), ∇ f (xk ) − (tk+1 − tk+1 ) ∇ f (xk ) − ∇ f (yk ) 2
2
2
s
− tk+1 ∇ f (yk ) 2
2
= −tk+1 Bk ,
where
√ s s
Bk := tk+1 β s∇ f (yk ), ∇ f (xk ) + (tk+1 − 1) ∇ f (xk ) − ∇ f (yk ) 2
+ ∇ f (yk ) 2 .
2 2
When β = 0 we have Bk ≥ 0. Let us analyze the sign of Bk in the case β > 0. Set
Y = ∇ f (yk ), X = ∇ f (xk ). We have
s s √
Bk = Y 2 + (tk+1 − 1) Y − X 2 + tk+1 β sY , X
2 2
s √ s
= tk+1 Y 2 + tk+1 (β s − s) + s Y , X + (tk+1 − 1) X 2
2 2
s √ s
≥ tk+1 Y 2 − tk+1 (β s − s) + s Y X + (tk+1 − 1) X 2 .
2 2
Elementary algebra gives that the above quadratic form is non-negative when
√ 2
tk+1 (β s − s) + s ≤ s 2 tk+1 (tk+1 − 1).
Recall
√ that tk is of order k. Hence, this inequality
√ is satisfied for k large enough if
(β s − s)2 < s 2 , which is equivalent to β < 2 s. Under this condition E k+1 − E k ≤
123
132 H. Attouch et al.
√
0, which gives conclusion (i). Similar argument gives
√ that for 0 < < 2 sβ − β 2
(such exists according to assumption 0 < β < 2 s)
1 2
E k+1 − E k + t ∇ f (yk ) 2
≤ 0.
2 k+1
k
k
inf ∇ f (xi ) 2
i2 ≤ i 2 ∇ f (xi ) 2
≤ i 2 ∇ f (xi ) 2
< +∞.
i=1,··· ,k
i=1 i=1 i∈N
1 1
inf ∇ f (xi ) 2
=O , inf ∇ f (yi ) 2
=O .
i=1,...,k k3 i=1,...,k k3
L
f (yk ) ≤ f (xk+1 ) + yk − xk+1 , ∇ f (xk+1 ) + yk − xk+1 2
2
L
≤ f (xk+1 ) + yk − xk+1 , ∇ f (yk ) + yk − xk+1 2
2
s2 L
f (yk ) − min f ≤ f (xk+1 ) − min f + s ∇ f (yk ) 2
+ ∇ f (yk ) 2 .
H H 2
123
First-order optimization algorithms via inertial systems… 133
1 s2 L 1 1
f (yk ) − min f ≤ O + s+ ∇ f (yk ) 2
=O +o .
H k2 2 k2 k2
Remark 6 When f is a proper lower semicontinuous proper function, but not neces-
sarily smooth, we follow the same reasoning as in Sect. 3.1.2. We consider minimizing
the Moreau envelope f λ of f , whose gradient is 1/λ-Lipschitz continuous, and then
apply (IGAHD) to f λ . We omit the details for the sake of brevity. This observation
will be very useful to solve even structured composite problems as we will describe
in Sect. 6.
Suppose that 0 ≤ β ≤ 1
√
2 μ. Then, the following hold:
μ √
μ
x(t) − x 2 ≤ f (x(t)) − min f ≤ Ce− 2 (t−t0 )
2 H
√
t √ √
μ
e− μt
e μs
∇ f (x(s)) 2 ds ≤ C1 e− 2 t .
t0
√
μ
∞
Moreover, t0 e
2 t ẋ(t) 2 dt < +∞.
√
When β = 0, we have f (x(t)) − minH f = O e− μt as t → +∞.
123
134 H. Attouch et al.
Remark 7 When β = 0, Theorem 7 recovers [29, Theorem 2.2]. In the case β > 0,
a result on a related but different dynamical system can be found in [32, Theorem 1]
(their rate is also sligthtly worse than ours). Our gradient estimate is distinctly new in
the literature.
1 √
E(t) := f (x(t)) − min f + μ(x(t) − x ) + ẋ(t) + β∇ f (x(t)) 2 .
H 2
√
Set v(t) = μ(x(t) − x ) + ẋ(t) + β∇ f (x(t)). Derivation of E(·) gives
d √
E(t) := ∇ f (x(t)), ẋ(t) + v(t), μẋ(t) + ẍ(t) + β∇ 2 f (x(t))ẋ(t).
dt
d √
E(t) = ∇ f (x(t)), ẋ(t) + v(t), − μẋ(t) − ∇ f (x(t)).
dt
d √ √
E(t) + μ∇ f (x(t)), x(t) − x + μx(t) − x , ẋ(t) + μ ẋ(t) 2
dt
√
+β μ∇ f (x(t)), ẋ(t) + β ∇ f (x(t)) 2 = 0.
μ
∇ f (x(t)), x(t) − x ≥ f (x(t)) − f (x ) + x(t) − x 2 .
2
d √
E(t) + μA ≤ 0,
dt
μ √
A := f (x) − f (x ) + x − x 2
+ μx − x , ẋ + ẋ 2
+ β∇ f (x), ẋ
2
β
+ √ ∇ f (x) 2
μ
1 √ √
A=E− ẋ + β∇ f (x) 2
− μx − x , ẋ + β∇ f (x) + μx − x , ẋ + ẋ 2
2
123
First-order optimization algorithms via inertial systems… 135
β
+β∇ f (x), ẋ + √ ∇ f (x) 2 .
μ
Since 0 ≤ β ≤ √β − β2
≥ β
2 μ.
√1 , we immediately get √ Hence
μ μ 2
d √ 1 β √
E (t) + μ E (t) + ẋ 2
+ √ ∇ f (x) 2
− β μx − x , ∇ f (x) ≤ 0.
dt 2 2 μ
1 1 1 1 1 μ
E (t) = E (t) + E (t) ≥ E (t) + f (x(t)) − f (x ) ≥ E (t) + x(t) − x 2 .
2 2 2 2 2 4
μ 2 β √
X + √ Y 2 − β μX Y ≥ 0.
4 2 μ
Hence for 0 ≤ β ≤ 1
√
2 μ
√ √
d μ μ
E(t) + E(t) + ẋ(t) 2
≤ 0.
dt 2 2
123
136 H. Attouch et al.
and
√
√ μ
μ(x(t) − x ) + ẋ(t) + β∇ f (x(t)) 2
≤ 2E(t0 )e− 2 (t−t0 ) .
√
μ
(ii) Set C = 2E(t0 )e 2 t0 . Developing the above expression, we obtain
√
μ x(t) − x 2 + ẋ(t) 2 + β 2 ∇ f (x(t)) 2 + 2β μ x(t) − x , ∇ f (x(t))
√
√ μ
+ ẋ(t), 2β∇ f (x(t)) + 2 μ(x(t) − x ) ≤ Ce− 2 t .
√
t √ √
μ
e− μt
e μs
∇ f (x(s)) 2 ds ≤ Ce− 2 t .
t0
√ √
Noticing that the integral of e μs over [t0 , t] is of order e μt , the above estimate
reflects the fact, as t → +∞, the gradient terms ∇ f (x(t)) 2 tend to zero at
exponential rate (in average, not pointwise).
√
Remark 8 Let us justify the choice of γ = 2 μ in Theorem 7. Indeed, considering
a similar proof to that described above can be performed on the basis of the Lyapunov
function
1
E(t) := f (x(t)) − min f + γ (x(t) − x ) + ẋ(t) + β∇ f (x(t)) 2 .
H 2
123
First-order optimization algorithms via inertial systems… 137
√ μ
Under the conditions γ ≤ μ and β ≤ 2γ 3
we obtain the exponential convergence
rate
γ
f (x(t)) − min f = O e− 2 t as t → +∞.
H
√
Taking γ = μ gives the best convergence rate, and the result of Theorem 7.
This permits to extend (DIN)γ ,β to the case of a proper lower semicontinuous convex
function f : H → R ∪ {+∞}. Replacing the gradient of f by its subdifferential, we
obtain its Non-Smooth version :
⎧
⎨ẋ(t) + β∂ f (x(t)) + γ − 1
x(t) + 1
y(t) 0;
β β
(DIN-NS)γ ,β
⎩ ẏ(t) + γ − 1
x(t) + 1
y(t) = 0.
β β
Most properties of (DIN)γ ,β are still valid for this generalized version. To illustrate
it, let us consider the following extension of Theorem 7.
μ √
μ
x(t) − x 2 ≤ f (x(t)) − min f = O e− 2 t as t → +∞,
2 H
∞ √
μ
and e 2 t ẋ(t) 2 dt < +∞.
t0
1 √ √ 1 1
E(t) := f (x(t)) − min f + μ(x(t) − x ) − 2 μ − x(t) − y(t) 2 ,
H 2 β β
that will serve as a Lyapunov function. Then, the proof follows the same lines as that
of Theorem 7, with the use of the derivation rule of Brezis [19, Lemme 3.3, p. 73].
123
138 H. Attouch et al.
We will show in this section that the exponential convergence of Theorem 7 for the
inertial system (19) translates into linear convergence in the algorithmic case under
proper discretization.
Consider the inertial dynamic (19). Its implicit discretization similar to that performed
before gives
√
1 2 μ β
(x k+1 − 2x k + x k−1 ) + (xk+1 − xk ) + (∇ f (xk+1 ) − ∇ f (xk )) + ∇ f (xk+1 ) = 0,
h2 h h
where h is the positive step size. Set s = h 2 . We obtain the following inertial proximal
algorithm with hessian damping (SC refers to Strongly Convex):
(IPAHD-SC)
⎧ √
2 μs √ √
2 μs
⎨ yk = xk + 1 − √ (xk − xk−1 ) + β s 1 − √ ∇ f (xk )
1+2 μs 1+2 μs
⎩xk+1 = prox √
β s+s
√
(yk ).
1+2 μs
f
1 √
0 ≤ β ≤ √ and s ≤ β.
2 μ
Set q = 1
√ , which satisfies 0 < q < 1. Then, the sequence (xk )k∈N generated
1+ 21 μs
by the algorithm (IPAHD-SC) obeys, for any k ≥ 1
μ
xk − x 2 ≤ f (xk ) − min f ≤ E 1 q k−1 ,
2 H
√
where E 1 = f (x1 )− f (x )+ 21 μ(x1 − x )+ √1s (x1 − x0 )+β∇ f (x1 ) 2 . Moreover,
the gradients converge exponentially fast to zero: setting θ = 1
√
1+ μs which belongs
to ]0, 1[, we have
k−2
θk θ − j ∇ f (x j ) 2
= O qk as k → +∞.
j=0
123
First-order optimization algorithms via inertial systems… 139
Remark 9 We are not aware of any result of this kind for such a proximal algorithm.
Proof Let x be the unique minimizer of f , and consider the sequence (E k )k∈N
1
E k := f (xk ) − f (x ) + vk 2
,
2
√
where vk = μ(xk − x ) + √1s (xk − xk−1 ) + β∇ f (xk ).
We will use the following equivalent formulation of the algorithm (IPAHD-SC)
1 √
√ (xk+1 − 2xk + xk−1 ) + 2 μ(xk+1 − xk )
s
√
+β(∇ f (xk+1 ) − ∇ f (xk )) + s∇ f (xk+1 ) = 0. (20)
We have
1 1
E k+1 − E k = f (xk+1 ) − f (xk ) + vk+1 2
− vk 2
.
2 2
√ 1
vk+1 − vk = μ(xk+1 − xk ) + √ (xk+1 − 2xk + xk−1 ) + β(∇ f (xk+1 ) − ∇ f (xk ))
s
√ √ √
= μ(xk+1 − xk ) − 2 μ(xk+1 − xk ) − s∇ f (xk+1 )
√ √
= = − μ(xk+1 − xk ) − s∇ f (xk+1 ).
√ √
Write shortly Bk = μ(xk+1 − xk ) + s∇ f (xk+1 ). We have
1 1 1
vk+1 2 − vk 2 = vk+1 − vk , vk+1 − vk+1 − vk 2
2 2 2
√ 1 1
= − Bk , μ(xk+1 − x ) + √ (xk+1 − xk ) + β∇ f (xk+1 ) − Bk 2
s 2
μ √
= −μ xk+1 − xk , xk+1 − x − xk+1 − xk 2 − β μ ∇ f (xk+1 ), xk+1 − xk
s
√ √
− μs ∇ f (xk+1 ), xk+1 − x − ∇ f (xk+1 ), xk+1 − xk − β s ∇ f (xk+1 ) 2
1 1 √
− μ xk+1 − xk 2 − s ∇ f (xk+1 2 − μs ∇ f (xk+1 ), xk+1 − xk
2 2
μ
f (xk ) ≥ f (xk+1 ) + ∇ f (xk+1 ), xk − xk+1 + xk+1 − xk 2 ;
2
μ
f (x ) ≥ f (xk+1 ) + ∇ f (xk+1 ), x − xk+1 + xk+1 − x 2 .
2
123
140 H. Attouch et al.
√
Combining the above results, and after dividing by s, we get
1 √ μ
√ (E k+1 − E k ) + μ[ f (xk+1 ) − f (x ) + xk+1 − x 2 ]
s 2
√
μ
μ
≤ − √ xk+1 − xk , xk+1 − x − xk+1 − xk 2
s s
μ μ
−β ∇ f (xk+1 ), xk+1 − xk − √ xk+1 − xk 2 − β ∇ f (xk+1 ) 2
s 2 s
μ 1 √ √
− √ xk+1 − xk 2 − s ∇ f (xk+1 2 − μ ∇ f (xk+1 ), xk+1 − xk ,
2 s 2
1 √
√ (E k+1 − E k ) + μE k+1 − βμ ∇ f (xk+1 ), xk+1 − x
s
√ √ √
μ μ β2 μ s
≤− +√ xk+1 − xk − β −
2
+ ∇ f (xk+1 ) 2
2s s 2 2
√
− μ ∇ f (xk+1 ), xk+1 − xk .
√
β2 μ 3β
According to 0 ≤ β ≤ 2
1
√
μ , we have β − 2 ≥ 4 , which, with Cauchy-Schwarz
inequality, gives
√
1 √ μ μ 3β
√ (E k+1 − E k ) + μE k+1 + +√ xk+1 − xk 2 + ∇ f (xk+1 ) 2
s 2s s 4
√
−βμ ∇ f (xk+1 ) xk+1 − x − μ ∇ f (xk+1 ) xk+1 − xk ≤ 0.
1 1 1 μ
E k+1 ≥ E k+1 + f (xk+1 ) − f (x ) ≥ E k+1 + xk+1 − x 2 .
2 2 2 4
√
1 1√ √ μ μ μ
√ (E k+1 − E k ) + μE k+1 + μ xk+1 − x 2 + +√ xk+1 − xk 2
s 2 4 2s s
3β √
+ ∇ f (xk+1 ) 2 − βμ ∇ f (xk+1 ) xk+1 − x − μ ∇ f (xk+1 ) xk+1 − xk ≤ 0.
4
1 1√
√ (E k+1 − E k ) + μE k+1
s 2
123
First-order optimization algorithms via inertial systems… 141
√ μ β
+ μ xk+1 − x 2
+ ∇ f (xk+1 ) 2
− βμ ∇ f (xk+1 ) xk+1 − x
4 2
! "# $
Term 1
√
μ μ β √
+ +√ xk+1 − xk 2
+ ∇ f (xk+1 ) 2
− μ ∇ f (xk+1 ) xk+1 − xk ≤0
2s s 4
! "# $
Term 2
Let us examine the sign of the last two terms in the rhs of inequality above.
Term 1 Set X = xk+1 − x , Y = ∇ f (xk+1 ) . Elementary algebra gives that
√ μ 2 β 2
μ X + Y − βμX Y ≥ 0,
4 2
√ μ β
μ xk+1 − x 2
+ ∇ f (xk+1 ) 2
− βμ ∇ f (xk+1 ) xk+1 − x ≥ 0.
4 2
Term 2 Set X = xk+1 − xk , Y = ∇ f (xk+1 ) . Elementary algebra gives
√
μ μ β 2 √
+√ X2 + Y − μX Y ≥ 0
2s s 4
√
μ μ μ
holds true under the condition 2s + √
s
≥ β. Hence, under this condition
√
μ μ β √
+√ xk+1 − xk 2
+ ∇ f (xk+1 ) 2
− μ ∇ f (xk+1 ) xk+1 − xk ≥ 0.
2s s 4
√
μ √ %
μ μ β
2s + s ≥
is equivalent to s ≤ 1+ 1+ .
√ 2
√
In turn, the condition β 2 β μ
√
Clearly, this condition is satisfied if s ≤ β.
Let us put the above
√ results together. We have obtained that, under the conditions
0 ≤ β ≤ 2√1 μ and s ≤ β,
1 1√
√ (E k+1 − E k ) + μE k+1 ≤ 0.
s 2
Set q = 1
√ , which satisfies 0 < q < 1. From this, we infer E k ≤ q E k−1 which
1+ 21 μs
gives
E k ≤ E 1 q k−1 . (21)
f (xk ) − f (x ) ≤ E 1 q k−1 = O q k .
123
142 H. Attouch et al.
Let us now estimate the convergence rate of the gradients to zero. According to the
exponential decay of (E k )k∈N , as given in (21), and by definition of E k , we have, for
all k ≥ 1
√ 1
μ(xk − x ) + √ (xk − xk−1 ) + β∇ f (xk ) 2
≤ 2E k ≤ 2E 1 q k−1 .
s
1 √
μ xk − x 2
+ xk − xk−1 2
+ β 2 ∇ f (xk ) 2
+ 2β μ xk − x , ∇ f (xk )
s
1 √
+ √ xk − xk−1 , 2β∇ f (xk ) + 2 μ(xk − x ) ≤ 2E 1 q k−1 .
s
By convexity of f , we have
xk − x , ∇ f (xk ) ≥ f (xk ) − f (x ) and xk − xk−1 , ∇ f (xk ) ≥ f (xk ) − f (xk−1 )
√ √ 2
μ 2β( f (xk ) − f (x )) + μ xk − x + β 2 ∇ f (xk ) 2
1 √ 2
+ √ 2β( f (xk ) − f (x )) + μ xk − x
s
1 √ 2
− √ 2β( f (xk−1 ) − f (x )) + μ xk−1 − x ≤ 2E 1 q k−1 .
s
√
Set Z k := 2β( f (xk ) − f (x )) + μ xk − x 2 . We have, for all k ≥ 1
1 √
√ (Z k − Z k−1 ) + μZ k + β 2 ∇ f (xk ) 2
≤ 2E 1 q k−1 . (22)
s
Set θ = 1
√
1+ μs which belongs to ]0, 1[. Equivalently
√ √
Z k + θβ 2 s ∇ f (xk ) 2
≤ θ Z k−1 + 2E 1 θ sq k−1 .
√ k−2
√ k−2
Z k + θβ 2 s θ p ∇ f (xk− p ) 2
≤ θ k−1 Z 1 + 2E 1 θ s θ p q k− p−1 . (23)
p=0 p=0
123
First-order optimization algorithms via inertial systems… 143
√
θ 1+ 21 μs
Then notice that q = √
1+ μs < 1, which gives
k−2
k−2
θ p
1
θ p q k− p−1 = q k−1 ≤2 1+ √ q k−1 .
q μs
p=0 p=0
√ k−2
4E 1
θβ 2 s θ p ∇ f (xk− p ) 2
≤ θ k−1 Z 1 + √ q k−1 . (24)
μ
p=0
Using again the inequality θ < q, and after reindexing, we finally obtain
k−2
θk θ − j ∇ f (x j ) 2
= O qk .
p=0
μ
f is μ-strongly convex ⇒ f λ is strongly convex with modulus .
1 + λμ
1 μ
f λ (x) = gθ x + x 2
.
1 + λμ 2(1 + λμ)
Since x → gθ 1
1+λμ x is convex, the above formula shows that f λ is strongly convex
μ
with constant 1+λμ .
123
144 H. Attouch et al.
(IPAHD-NS-SC)
√
yk = xk + (1 − a)(xk − xk−1 ) + β λ s (1 − a) xk − proxλ f (xk )
λ θ
xk+1 = λ+θ yk + λ+θ prox(λ+θ) f (yk )
1
Set q = % , which satisfies 0 < q < 1. Then, for any sequence (xk )k∈N
μ
1+ 1
1+λμ s
2
generated by algorithm (IPAHD-NS-SC)
xk − x = O q k/2 and f (proxλ f (xk )) − min f = O q k as k → +∞,
H
and
xk − proxλ f (xk ) 2
= O qk as k → +∞.
Let us embark from the continuous dynamic (19) whose linear convergence rate was
established in Theorem 7. Its explicit time discretization with centered finite differ-
ences for speed and acceleration gives
√
1 μ 1
(xk+1 − 2xk + xk−1 ) + √ (xk+1 − xk−1 ) + β √ (∇ f (xk ) − ∇ f (xk−1 )) + ∇ f (xk ) = 0.
s s s
Equivalently,
√ √
(xk+1 − 2xk + xk−1 ) + μs(xk+1 − xk−1 ) + β s(∇ f (xk ) − ∇ f (xk−1 )) + s∇ f (xk ) = 0,
123
First-order optimization algorithms via inertial systems… 145
(25)
which gives the inertial gradient algorithm with Hessian damping (SC stands for
Strongly Convex):
(IGAHD-SC)
√ √
1− μs β s
xk+1 = xk + 1+ μs (x k
√ − xk−1 ) − √
1+ μs (∇ f (xk ) − ∇ f (xk−1 ))
− 1+√
s
μs ∇ f (x k ).
1
Set q = 1√
, which satisfies 0 < q < 1. Then, for any sequence (xk )k∈N
1+ 2 μs
generated by algorithm (IGAHD-SC) , we have
xk − x = O q k/2 and f (xk ) − min f = O q k as k → +∞.
H
k−2
θk θ − j ∇ f (x j ) 2
= O qk as k → +∞.
p=0
123
146 H. Attouch et al.
xk − x
2. In fact, even for β > 0, by lifting the problem to the vector z k = as
xk−1 − x
is standard in the (HBF) method, one can write (IGAHD-SC) as
1
E k := f (xk ) − f (x ) + vk 2
,
2
where x is the unique minimizer of f , and
√ 1
vk = μ(xk−1 − x ) + √ (xk − xk−1 ) + β∇ f (xk−1 ).
s
√ 1
vk+1 − vk = μ(xk − xk−1 ) + √ (xk+1 − 2xk + xk−1 ) + β(∇ f (xk ) − ∇ f (xk−1 ))
s
1 √ √
= √ (xk+1 − 2xk + xk−1 ) + μs(xk − xk−1 ) + β s(∇ f (xk ) − ∇ f (xk−1 ))
s
1 √ √
= √ − s∇ f (xk ) − μs(xk+1 − xk−1 ) + μs(xk − xk−1 ))
s
√ √
= − μ(xk+1 − xk ) − s∇ f (xk ).
Since 1
2 vk+1 2 − 1
2 vk 2 = vk+1 − vk , vk+1 − 1
2 vk+1 − vk 2, we have
1 1 1 √ √
vk+1 2 − vk 2 = − μ(xk+1 − xk ) + s∇ f (xk ) 2
2 2 2
√ √ √ ∗ 1
− μ(xk+1 − xk ) + s∇ f (xk ), μ(xk − x ) + √ (xk+1 − xk ) + β∇ f (xk )
s
μ √
= −μ xk+1 − xk , xk − x ∗ − xk+1 − xk 2 − β μ ∇ f (xk ), xk+1 − xk
s
123
First-order optimization algorithms via inertial systems… 147
√ √
− μs ∇ f (xk ), xk − x ∗ − ∇ f (xk ), xk+1 − xk − β s ∇ f (xk ) 2
1 1 √
− μ xk+1 − xk 2 − s ∇ f (xk 2 − μs ∇ f (xk ), xk+1 − xk .
2 2
μ
f (x ) ≥ f (xk ) + ∇ f (xk ), x − xk + xk − x 2
2
μ
f (xk ) ≥ f (xk+1 ) + ∇ f (xk+1 ), xk − xk+1 + xk+1 − xk 2
2
μ
≥ f (xk+1 ) + ∇ f (xk ), xk − xk+1 + ( − L) xk+1 − xk 2
.
2
√
Combining the results above, and after dividing by s, we get
1 √ μ √
√ (E k+1 − E k ) + μ[ f (xk+1 ) − f (x ) + xk − x 2 ] + μ( f (xk ) − f (xk+1 ))
s 2
√
μ μ μ
≤ − √ xk+1 − xk , xk − x − xk+1 − xk 2 − β ∇ f (xk ), xk+1 − xk
s s s
1 μ μ
+ √ (L − ) xk+1 − xk 2 − √ xk+1 − xk 2
s 2 2 s
1√ √
− β+ s ∇ f (xk 2 − μ ∇ f (xk ), xk+1 − xk .
2
1 √ √ √ L
√ (E k+1 − E k ) + μE k+1 ≤ μ ∇ f (xk ), xk+1 − xk + μ xk+1 − xk 2
s 2
√
μ 1 1
+ √ (xk+1 − xk ) + β∇ f (xk ) 2 + μ xk − x , √ (xk+1 − xk ) + β∇ f (xk )
2 s s
√
μ μ μ
− √ xk+1 − xk , xk − x − xk+1 − xk 2 − β ∇ f (xk ), xk+1 − xk
s s s
1 μ μ
+ √ (L − ) xk+1 − xk 2 − √ xk+1 − xk 2
s 2 2 s
1√ √
− β+ s ∇ f (xk 2 − μ ∇ f (xk ), xk+1 − xk .
2
123
148 H. Attouch et al.
≤ 2L xk+1 − x 2
+ 2L xk+1 − xk 2
.
Therefore
√ √
1 √ μ μ 1 μ
√ (E k+1 − E k ) + μE k+1 + + √ − L 2βμ + √ + xk+1 − xk 2
s 2s s s 2
√ √
β μ
2
s
+ β− + ∇ f (xk+1 ) 2 − 2βμL xk+1 − x 2 ≤ 0.
2 2
√
β2 μ
According to 0 ≤ β ≤ √1 ,
μ we have β − 2 ≥ β2 , which gives
1 √
√ (E k+1 − E k ) + μE k+1
s
√ √
μ μ 1 μ
+ + √ − L 2βμ + √ + xk+1 − xk 2
2s s s 2
β
+ ∇ f (xk+1 ) 2 − 2βμL xk+1 − x 2 ≤ 0.
2
1 1 1 μ
E k+1 ≥ E k+1 + f (xk+1 ) − f (x ) ≥ E k+1 + xk+1 − x 2 .
2 2 2 4
1 1√ √ μ
√ (E k+1 − E k ) + μE k+1 + μ − 2βμL xk+1 − x 2
s 2 4
√ √
μ μ 1 μ β
+ + √ − L 2βμ + √ + xk+1 − xk 2 + ∇ f (xk+1 ) 2
≤0
2s s s 2 2
√
μ
Let us examine the sign of the above quantities: Under the condition L ≤ 8β we
√
μ μ √
√ μ 2s + s
√
μ
have μ 4 − 2βμL ≥ 0. Under the condition L ≤ √
μ we have 2s + √μs −
2βμ+ √1s + 2
√
μ
L 2βμ + √1
s
+ 2 ≥ 0. Therefore, under the above conditions
1 1√ β
√ (E k+1 − E k ) + μE k+1 + ∇ f (xk+1 ) 2
≤ 0.
s 2 2
123
First-order optimization algorithms via inertial systems… 149
Set q = 1
√ , which satisfies 0 < q < 1. By a similar argument as in Theorem 9
1+ 21 μs
E k ≤ E 1 q k−1 .
f (xk ) − f (x ) = O q k ,
6 Numerical results
1 1
h S (x) = minn z−x 2
S + h(z), proxhS (x) = argminz∈Rn z−x 2
S + h(z).
z∈R 2 2
which is nothing but the forward-backward fixed-point operator for the objective in
(RLS). Moreover, f M is a continuously differentiable convex function whose gradient
123
150 H. Attouch et al.
∇ f M (x) = x − prox M
f (x),
argminH f = Fix(prox M
f ) = argminH f M .
We are then in position to solve (RLS) by simply applying (IGAHD) (see Sect. 3.2)
to f M . We infer from Theorem 6 and properties of f M that
f (prox M
f (x k )) − min f = O(k −2 ).
R n
(IGAHD) and FISTA (i.e. (IGAHD) with β = 0) were applied to f M with four
instances of g: 1 norm, 1 − 2 norm, the total variation, and the nuclear norm. The
results are depicted in Fig. 3. One can clearly see that the convergence profiles observed
for both algorithms agree with the predicted rate. Moreover, (IGAHD) exhibits, as
expected, less oscillations than FISTA, and eventually converges faster.
7 Conclusion, perspectives
As a guideline to our study, the inertial dynamics with Hessian driven damping give
rise to a new class of first-order algorithms for convex optimization. While retaining
the fast convergence of the function values reminiscent of the Nesterov accelerated
algorithm, they benefit from additional favorable properties among which the most
important are:
• fast convergence of gradients towards zero;
• global convergence of the iterates to optimal solutions;
• extension to the non-smooth setting;
• acceleration via time scaling factors.
This article contains the core of our study with a particular focus on the gradient and
proximal methods. The results thus obtained pave the way to new research avenues.
For instance:
• as initiated in Sect. 6, apply these results to structured composite optimization
problems beyond (RLS) and develop corresponding splitting algorithms;
• with the additional gradient estimates, we can expect the restart method to work
better with the presence of the Hessian damping term;
• deepen the link between our study and the Newton and Levenberg–Marquardt
dynamics and algorithms (e.g., [13]), and with the Ravine method [23].
• the inertial dynamic with Hessian driven damping goes well with tame analysis and
Kurdyka–Lojasiewicz property [2], suggesting that the corresponding algorithms
be developed in a non-convex (or even non-smooth) setting.
123
First-order optimization algorithms via inertial systems… 151
A Auxiliary results
s s
f (y − s∇ f (y)) ≤ f (x) + ∇ f (y), y − x − ∇ f (y) 2
− ∇ f (x) − ∇ f (y) 2 .
2 2
(28)
s s
f (y + ) ≤ f (y) − (2 − Ls) ∇ f (y) 2
≤ f (y) − ∇ f (y) 2 . (29)
2 2
123
152 H. Attouch et al.
We now argue by duality between strong convexity and Lipschitz continuity of the
gradient of a convex function. Indeed, using Fenchel identity, we have
1
f ∗ (∇ f (y)) ≥ f ∗ (∇ f (x)) + x, ∇ f (y) − ∇ f (x) + ∇ f (x) − ∇ f (y) 2
.
2L
1
f (y) ≤ − f ∗ (∇ f (x)) + ∇ f (y), y − x, ∇ f (y) − ∇ f (x) − ∇ f (x) − ∇ f (y) 2
2L
1
= − f ∗ (∇ f (x)) + x, ∇ f (x) + ∇ f (y), y − x − ∇ f (x) − ∇ f (y) 2
2L
1
= f (x) + ∇ f (y), y − x − ∇ f (x) − ∇ f (y) 2
2L
s
≤ f (x) + ∇ f (y), y − x − ∇ f (x) − ∇ f (y) 2 .
2
Proof We have
1
f (x) = argmin z∈Rn
prox M z − x 2M + f (z)
2
1 1 1
= argminz∈Rn z−x 2− A(z − x) 2
+ y − Az 2
+ g(z).
2s 2 2
1 1
f (x) = argmin z∈Rn
prox M z−x 2+ y − Ax 2 − A(x − z), Ax − y + g(z)
2s 2
1
= argminz∈Rn z − x 2 − z − x, A∗ (y − Ax) + g(z)
2s
1
z − x − s A∗ (Ax − y) 2 + g(z)
= argminz∈Rn
2s
= proxsg x − s A∗ (Ax − y) .
123
First-order optimization algorithms via inertial systems… 153
We here provide the closed form solutions to (DIN-AVD)α,β,b for the quadratic objec-
tive f : Rn → Ax, x, where A is a symmetric positive definite matrix. The case
of a semidefinite positive matrix A can be treated similarly by restricting the analysis
to ker(A) . Projecting (DIN-AVD)α,β,b on the eigenspace of A, one has to solve n
independent one-dimensional ODEs of the form
α
ẍi (t) + + β(t)λi ẋi (t) + λi b(t)xi (t) = 0, i = 1, . . . , n.
t
where λi > 0 is an eigenvalue of A. In the following, we drop the subscript i.
Case β(t) ≡ β, b(t) = b + γ /t, β ≥ 0, b > 0, γ ≥ 0: The ODE reads
α γ
ẍ(t) + + βλ ẋ(t) + λ b + x(t) = 0. (30)
t t
• If β 2 λ2 = 4bλ: set
%
γ − αβ/2
ξ= β 2 λ2 − 4bλ, κ = λ , σ = (α − 1)/2.
ξ
Using the relationship between the Whitaker functions and the Kummer’s confluent
hypergeometric functions M and U , see [16], the solution to (30) can be shown to
take the form
where Jν and Yν are the Bessel functions of the first and second kind.
When β > 0, one can clearly see the exponential decrease forced by the Hessian.
From the asymptotic expansions of M, U , Jν and Yν for large t, straightforward
computations provide the behaviour of |x(t)| for large t as follows:
• If β 2 λ2 > 4bλ, we have
α βλ
|x(t)| = O t − 2 e− 2 t .
123
154 H. Attouch et al.
• If β 2 λ2 = 4bλ, we have
2α−1 βλ
|x(t)| = O t − 4 e− 2 t .
α+β λ cλ
ÿ(τ ) + + ẏ(τ ) + y(τ ) = 0.
(1 + β)τ 1+β (1 + β)2 τ
It is clear that this is a special case of (30). Since β and λ > 0, set
λ α+β −c α+β 1
ξ= ,κ=− ,σ = − .
1+β 1+β 2(1 + β) 2
Asymptotic estimates can also be derived similarly to above. We omit the details for
the sake of brevity.
References
1. Álvarez, F.: On the minimizing property of a second-order dissipative system in Hilbert spaces. SIAM
J. Control Optim. 38(4), 1102–1119 (2000)
2. Álvarez, F., Attouch, H., Bolte, J., Redont, P.: A second-order gradient-like dissipative dynamical
system with Hessian-driven damping. Application to optimization and mechanics. J. Math. Pures
Appl. 81(8), 747–779 (2002)
3. Apidopoulos, V., Aujol, J.-F., Dossal, C.: Convergence rate of inertial Forward–Backward algorithm
beyond Nesterov’s rule. Math. Program. Ser. B. 180, 137–156 (2020)
4. Attouch, H., Cabot, A.: Asymptotic stabilization of inertial gradient dynamics with time-dependent
viscosity. J. Differ. Equ. 263, 5412–5458 (2017)
5. Attouch, H., Cabot, A.: Convergence rates of inertial forward–backward algorithms. SIAM J. Optim.
28(1), 849–874 (2018)
6. Attouch, H., Cabot, A., Chbani, Z., Riahi, H.: Rate of convergence of inertial gradient dynamics with
time-dependent viscous damping coefficient. Evol. Equ. Control Theory 7(3), 353–371 (2018)
7. Attouch, H., Chbani, Z., Riahi, H.: Fast proximal methods via time scaling of damped inertial dynamics.
SIAM J. Optim. 29(3), 2227–2256 (2019)
8. Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algo-
rithms with asymptotic vanishing viscosity. Math. Program. Ser. B. 168, 123–175 (2018)
9. Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the Nesterov accelerated gradient method
in the subcritical case α ≤ 3. ESAIM Control Optim. Calc. Var. 25, 2–35 (2019)
123
First-order optimization algorithms via inertial systems… 155
10. Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward–backward
method is actually faster than 1/k 2 . SIAM J. Optim. 26(3), 1824–1834 (2016)
11. Attouch, H., Peypouquet, J., Redont, P.: A dynamical approach to an inertial forward–backward algo-
rithm for convex minimization. SIAM J. Optim. 24(1), 232–256 (2014)
12. Attouch, H., Peypouquet, J., Redont, P.: Fast convex minimization via inertial dynamics with Hessian
driven damping. J. Diffe. Equ. 261(10), 5734–5783 (2016)
13. Attouch, H., Svaiter, B. F.: A continuous dynamical Newton-Like approach to solving monotone
inclusions. SIAM J. Control Optim. 49(2), 574–598 (2011). Global convergence of a closed-loop
regularized Newton method for solving monotone inclusions in Hilbert spaces. J. Optim. Theory Appl.
157(3), 624–650 (2013)
14. Aujol, J.-F., Dossal, Ch.: Stability of over-relaxations for the forward-backward algorithm, application
to FISTA. SIAM J. Optim. 25(4), 2408–2433 (2015)
15. Aujol, J.-F., Dossal, C.: Optimal rate of convergence of an ODE associated to the Fast Gradient Descent
schemes for b>0 (2017). https://hal.inria.fr/hal-01547251v2
16. Bateman, H.: Higher Transcendental Functions, vol. 1. McGraw-Hill, New York (1953)
17. Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces.
CMS Books in Mathematics, Springer (2011)
18. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM J. Imaging Sci. 2(1), 183–202 (2009)
19. Brézis, H.: Opérateurs maximaux monotones dans les espaces de Hilbert et équations d’évolution,
Lecture Notes 5, North Holland, (1972)
20. Cabot, A., Engler, H., Gadat, S.: On the long time behavior of second order differential equations with
asymptotically small dissipation. Trans. Am. Math. Soc. 361, 5983–6017 (2009)
21. Chambolle, A., Dossal, Ch.: On the convergence of the iterates of the fast iterative shrinkage thresh-
olding algorithm. J. Optim. Theory Appl. 166, 968–982 (2015)
22. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25,
161–319 (2016)
23. Gelfand, I.M., Zejtlin, M.: Printszip nelokalnogo poiska v sistemah avtomatich, Optimizatsii, Dokl.
AN SSSR, 137, 295?298 (1961) (in Russian)
24. May, R.: Asymptotic for a second-order evolution equation with convex potential and vanishing damp-
ing term. Turk. J. Math. 41(3), 681–685 (2017)
25. Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(1/k 2 ).
Sov. Math. Doklady 27, 372–376 (1983)
26. Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 152(1–
2), 381–404 (2015)
27. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. U.S.S.R. Comput.
Math. Math. Phys. 4, 1–17 (1964)
28. Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
29. Siegel, W.: Accelerated first-order methods: differential equations and Lyapunov functions.
arXiv:1903.05671v1 [math.OC] (2019)
30. Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-
resolution differential equations. arXiv:submit/2440124 [cs.LG] 21 Oct 2018
31. Su, W.J., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient
method: theory and insights. NIPS’14 27, 2510–2518 (2014)
32. Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization.
arXiv:1611.02635 (2016)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123