0% found this document useful (0 votes)
55 views

Introductory Course On Non-Smooth Optimisation: Lecture 01 - Gradient Methods

This document introduces gradient descent methods for unconstrained smooth optimization. It discusses finding the minimum of a convex function F(x) through an iterative process. At each iteration k, a descent direction dk is chosen such that the inner product h∇F(xk), dk) is negative. Then an exact or backtracking line search is performed to choose a step size γk to minimize F(x) along the ray xk + γkdk, generating the next iterate xk+1. The goal is to converge to a minimizer x*? of F(x) as k approaches infinity.

Uploaded by

Rashmi Phadnis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Introductory Course On Non-Smooth Optimisation: Lecture 01 - Gradient Methods

This document introduces gradient descent methods for unconstrained smooth optimization. It discusses finding the minimum of a convex function F(x) through an iterative process. At each iteration k, a descent direction dk is chosen such that the inner product h∇F(xk), dk) is negative. Then an exact or backtracking line search is performed to choose a step size γk to minimize F(x) along the ray xk + γkdk, generating the next iterate xk+1. The goal is to converge to a minimizer x*? of F(x) as k approaches infinity.

Uploaded by

Rashmi Phadnis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Introductory Course on Non-smooth Optimisation

Lecture 01 - Gradient methods

Jingwei Liang

Department of Applied Mathematics and Theoretical Physics


Table of contents

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Convexity

Convex set
A set S ⊂ Rn is convex if for any θ ∈ [0, 1] and two points x, y ∈ S,
θx + (1 − θ)y ∈ S.

Convex function
Function F : Rn → R is convex if dom(F) is convex and for all x, y ∈ dom(F) and θ ∈ [0, 1],
F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y).

Proper convex: F(x) < +∞ at least for one x and F(x) > −∞ for all x.

1st-order condition: F is continuous differentiable


F(y) ≥ F(x) + h∇F(x), y − xi, ∀x, y ∈ dom(F).

2nd-order condition: if F is twice differentiable


∇2 F(x)  0, ∀x ∈ dom(F).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Unconstrained smooth optimisation

Problem
Unconstrained smooth optimisation
min F(x),
x∈Rn

where F : Rn → R is proper convex and smooth differentiable.

Optimality condition: let x? be an minimiser of F(x), then


0 = ∇F(x? ).

F (xk ) ∇F (x)

∇F (x ? )
x

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Example: quadratic minimisation

Quadratic programming
General quadratic programming problem
min 21 xT Ax + bT x + c,
x∈Rn

where A ∈ Rn×n is symmetric positive definite, b ∈ Rn and c ∈ R.

Optimality condition:
0 = Ax? + b.

Special case least square


||Ax − b||2 = xT (AT A)x − 2(AT b)T x + bT b.

Optimality condition
AT Ax? = AT b.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Example: geometric programming

Geometric programming
Pm T 
min log
x∈Rn i=1 exp(ai x + bi ) .

Optimality condition:
1 Pm
0 = Pm T ? i=1
exp(aTi x? + bi )ai .
i=1 exp(ai x + bi )

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Outline

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Problem

Unconstrained smooth optimisation


Consider minising
min F(x),
x∈Rn

where F : Rn → R is proper convex and smooth differentiable.

x?

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Problem

Unconstrained smooth optimisation


Consider minising
min F(x),
x∈Rn

where F : Rn → R is proper convex and smooth differentiable.

The set of minimisers, i.e.


Argmin(F) = {x ∈ Rn : F(x) = minn F(x)}
x∈R

is non-empty.

However, given x? ∈ Argmin(F), no closed form expression.

Iterative strategy to find one x? ∈ Argmin(F): start from x0 and generate a train of sequsence
{xk }k∈N such taht
lim xk = x? ∈ Argmin(F).
k→∞

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Problem

Unconstrained smooth optimisation


Consider minising
min F(x),
x∈Rn

where F : Rn → R is proper convex and smooth differentiable.

xk−1
xk
xk+1
xk+2

x?

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Descent methods

Iterative scheme
For each k = 1, 2, ..., find γk > 0 and dk ∈ Rn and then
xk+1 = xk + γk dk ,

where
dk is called search/descent direction.
γk is called step-size.

Descent methods
An algorithm is called descent method, if there holds
F(xk+1 ) < F(xk ).

NB: if xk ∈ Argmin(F), then xk+1 = xk ...

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Conditions

From convexity of F, we have


F(xk+1 ) ≥ F(xk ) + h∇F(xk ), xk+1 − xk i,

which gives
h∇F(xk ), xk+1 − xk i ≥ 0 =⇒ F(xk+1 ) ≥ F(xk ).

Since xk+1 − xk = γk dk , the direction dk should be such that


h∇F(xk ), dk i < 0.

∇F (x k )

xk

x?

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


General descent method

General descent method


initial : x0 ∈ dom(F);
repeat :
1. Find a descent direction dk .
2. Choose a step-size γk : line search.
3. Update xk+1 = xk + γk dk .
until : stopping criterion is satisfied.

Stopping criterion:  > 0 is the tolerance,


Function value: F(xk+1 ) − F(xk ) ≤  (can be time consuming).
Sequence: ||xk+1 − xk || ≤ .
Optimality condition: ||∇F(xk )|| ≤ .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Exact line search

Exact line search


Suppose that the direction dk is given. Choose γk such that F(x) is minimised along the ray
xk + γdk , γ > 0:
γk = argminγ>0 F(xk + γdk ).

Useful when the minimistion problem for γk is simple.


γk can be found analytically for special cases.

F (xk + γdk )

F (xk + γk dk )

γ=0 γk γ

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Backtracking/inexact line search

Backtracking line search


Suppose that the direction dk is given. Choose δ ∈]0, 0.5[ and β ∈]0, 1[, let γ = 1
while F(xk + γdk ) > F(xk ) + δγh∇F(xk ), dk i : γ = βγ.

Reduce F enough along the direction dk .


Since dk is a descent direction
h∇F(xk ), dk i < 0.

Stopping criterion for backtracking:


F(xk + γdk ) ≤ F(xk ) + δγh∇F(xk ), dk i.

When γ is small enough


F(xk + γdk ) ≈ F(xk ) + γh∇F(xk ), dk i < F(xk ) + δγh∇F(xk ), dk i,

whcih means backtracking eventually will stop.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Backtracking/inexact line search

Backtracking line search


Suppose that the direction dk is given. Choose δ ∈]0, 0.5[ and β ∈]0, 1[, let γ = 1
while F(xk + γdk ) > F(xk ) + δγh∇F(xk ), dk i : γ = βγ.

F (xk + γdk )

F (xk ) + γ∇F (xk )T dk F (xk ) + δγ∇F (xk )T dk

γ=0 γ

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Outline

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Monotonicity

Monotonicity of gradient
Let F : Rn → R be proper convex and smooth differentiable, then
h∇F(x) − ∇F(y), x − yi ≥ 0, ∀x, y ∈ dom(F).

C1 : proper convex and smooth differentiable functions on Rn .

Proof Owing to convexity, given x, y ∈ dom(F), we have


F(y) ≥ F(x) + h∇F(x), y − xi

and
F(x) ≥ F(y) + h∇F(y), x − yi.

Summing them up yields


h∇F(x) − ∇F(y), x − yi ≥ 0.

NB: Let F ∈ C1 , F is convex if and only if ∇F(x) is monotone.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Lipschitz continuous gradient

Lipschitz continuity
The gradient of F is L-Lipschitz continuous if there exists L > 0 such that
||∇F(x) − ∇F(y)|| ≤ L||x − y||, ∀x, y ∈ dom(F).

CL1 : proper convex functions with L-Lipschitz continuous gradient on Rn .

If F ∈ CL1 , then
H(x) = 2L ||x||2 − F(x)
def

is convex.
Hint : monotonicity of ∇H(x), i.e.
h∇H(x) − ∇H(y), x − yi = L||x − y||2 − h∇F(x) − ∇F(y), x − yi
≥ L||x − y||2 − L||x − y||2
= 0.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Descent lemma

Descent lemma, quadratic upper bound


Let F ∈ CL1 , then there holds
F(y) ≤ F(x) + h∇F(x), y − xi + 2L ||y − x||2 , ∀x, y ∈ dom(F).

Proof Define H(t) = F(x + t(y − x)), then


Z 1 Z 1
F(y) − F(x) = H(1) − H(0) = ∇H(t)dt = (y − x)T ∇F(x + t(y − x))dt
0 0
Z 1 Z 1
(y − x)T ∇F(x)dt + (y − x)T ∇F(x + t(y − x)) − ∇F(x) dt


0 0
Z 1
T
≤ (y − x) ∇F(x) + ||y − x||||∇F(x + t(y − x)) − ∇F(x)||dt
0
Z 1
≤ (y − x)T ∇F(x) + ||y − x|| tL||y − x||dt
0

= (y − x)T ∇F(x) + L ||y − x||2 .


2
def
NB: first-order condition of convexity for H(x) = 2L ||x||2 − F(x).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Descent lemma: consequences

Corollary
Let F ∈ CL1 and x? ∈ Argmin(F), then
1
2L
||∇F(x)||2 ≤ F(x) − F(x? ) ≤ 2L ||x − x? ||2 , ∀x ∈ dom(F).

Proof Right-hand inequality: ∇F(x? ) = 0,


F(x) ≤ F(x? ) + h∇F(x? ), x − x? i + 2L ||x − x? ||2 , ∀x ∈ dom(F).

Left-hand inequality:
n o
F(x? ) ≤ min F(x) + h∇F(x), y − xi + 2L ||y − x||2
y∈dom(F)
1
= F(x) − 2L ||∇F(x)||2 .

The corresponding y is y = x − L1 ∇F(x).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Co-coercivity of gradient

Co-coercivity
Let F ∈ CL1 , then
hx − y, ∇F(x) − ∇F(y)i ≥ L1 ||∇F(x) − ∇F(y)||2 .

Co-coercivity implies Lipschitz continuity


def
For F ∈ CL1 , H(x) = 2L ||x||2 − F(x)
Lipschitz continuity of ∇F =⇒ Convexity of H(x)
=⇒ Co-coercivity of ∇F(x)
=⇒ Lipschitz continuity of ∇F

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Co-coercivity of gradient

Co-coercivity
Let F ∈ CL1 , then
hx − y, ∇F(x) − ∇F(y)i ≥ L1 ||∇F(x) − ∇F(y)||2 .

Proof Define R(z) = F(z) − h∇F(x), zi, then ∇R(x) = 0.


Recall the lemma
F ∈ CL1 and x? ∈ Argmin(F): 1
2L
||∇F(x)||2 ≤ F(x) − F(x? ) ≤ 2L ||x − x? ||2 .

Then we have
1
F(y) − F(x) − h∇F(x), y − xi = R(y) − R(x) ≥ 2L ||∇R(y)||2
1
= 2L ||∇F(y) − ∇F(x)||2 .

Similarly, define S(z) = F(z) − h∇F(y), zi, then


1
F(x) − F(y) − h∇F(y), x − yi = S(y) − S(x) ≥ 2L ||∇F(x) − ∇F(y)||2 .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Strongly convex function

Strong convexity
Function F : Rn → R is strongly convex if dom(F) is convex and for all x, y ∈ dom(F) and
θ ∈ [0, 1], there exists α > 0 such that
F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y) − α
2
θ(1 − θ)||x − y||2 .

F is strongly convex with parameter α > 0 if


G(x) = F(x) − α
def
2
||x||2

is convex.
Monotonicity:
h∇F(x) − ∇F(y), x − yi ≥ α||x − y||2 , ∀x, y ∈ dom(F).

Second-order condition for strong convexity: if F ∈ C2 ,


∇2 F(x)  αId, ∀x ∈ dom(F).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Quadratic lower bound

Quadratic lower bound


Let F ∈ C1 and strongly convex, then
F(y) ≥ F(x) + h∇F(x), y − xi + α
2
||y − x||2 , ∀x, y ∈ dom(F).

def
Proof First-order condition of convexity for G(x) = F(x) − α
2
||x||2 .

Corollary
Let F ∈ C1 be α-strongly convex and x? ∈ Argmin(F), then
α 1
2
||x − x? ||2 ≤ F(x) − F(x? ) ≤ 2α ||∇F(x)||2 , ∀x ∈ dom(F).

Proof Left-hand inequality: quadratic lower bound.


Right-hand inequality:
n o
F(x? ) ≥ min F(x) + h∇F(x), y − xi + α
2
||y − x||2
y∈dom(F)
1
= F(x) − 2α ||∇F(x)||2 .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Extension of co-coercivity

If F ∈ CL1 and α-strongly convex, then


G(x) = F(x) − α
def
2
||x2 ||

is convex, and ∇G is L − α-Lipschitz continuous.

The co-coercivity of ∇G yields


h∇F(x) − ∇F(y), x − yi ≥ ααL
+L
1
||x − y||2 + α + L
||∇F(x) − ∇F(y)||2

for all x, y ∈ dom(F).

S1α,L : functions in CL1 that are α-strongly convex.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Rate of convergence

Sequence xk converges linearly to x? if


||xk+1 − x? ||
lim ||xk − x? ||

k→+∞

holds for ρ ∈]0, 1[, and ρ is called the rate of convergence.


||xk+1 −x? ||
If xk converges, let ρk = ||xk −x? ||
,
– if limk→+∞ ρk = 0: super-linear convergence.
– if limk→+∞ ρk = 1: sub-linear convergence.

Superlinear convergence: q > 1


||xk+1 − x? ||
lim <η
k→+∞ ||xk − x? ||q

for some η ∈]0, 1[.


– q = 2: quadratic convergence.
– q = 3: cubic convergnce.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Outline

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Unconstrained smooth optimisation

Unconstrained smooth optimisation


Consider minimising
min F(x),
x∈Rn

where F : Rn → R is proper convex and smooth differentiable.

Assumptions:
F ∈ C1 is convex.
∇F(x) is L-Lipschitz continuous for some L > 0.
Set of minimisers is non-empty, i.e. Argmin(F) 6= ∅.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Gradient descent

Descent direction: let d = −∇F(x), then


h∇F(x), di = −||∇F(x)||2 ≤ 0.

Gradient descent
initial : x0 ∈ dom(F);
repeat :
1. Choose step-size γk > 0
2. Update xk+1 = xk − γk ∇F(xk )
until : stopping criterion is satisfied.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence analysis: constant step-size

Owing to the quadratic upper bound


F(xk+1 ) ≤ F(xk ) + h∇F(xk ), xk+1 − xk i + 2L ||xk+1 − xk ||2
2
= F(xk ) − γ||∇F(xk )||2 + γ2 L ||∇F(xk )||2
= F(xk ) − γ(1 − γL
2
)||∇F(xk )||2 .
Hence
F(xk ) − F(xk+1 ) ≥ γ(1 − γL
2
)||∇F(xk )||2 .

Let γ ∈]0, 2/L[,


γ(1 − γL
Pk
2
) i=0 ||∇F(xi )||2 ≤ F(x0 ) − F(xk+1 ) ≤ F(x0 ) − F(x? ).

F(x? ) > −∞, rhs is a positive constant.

for lhs, let k → +∞,


lim ||∇F(xk )||2 = 0.
k→+∞

NB: convexity is not required here.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence analysis: constant step-size

Let γ ∈]0, 1/L], then γ(1 − γL


2
)≥ γ
2
, and

F(xk+1 ) ≤ F(xk ) − γ2 ||∇F(xk )||2


(cvx of F at xk ) ≤ F(x? ) + h∇F(xk ), xk − x? i − γ2 ||∇F(xk )||2
1
= F(x? ) + 2γ ||xk − x? ||2 − ||xk − x? − γ∇F(xk )||2


1
= F(x? ) + 2γ ||xk − x? ||2 − ||xk+1 − x? ||2 .


Summability of F(xk ) − F(x? ),


Pk 1 Pk
F(xk ) − F(x? ) ≤ 2γ ||xi−1 − x? ||2 − ||xi − x? ||2
 
i=1 i=1

1
||x0 − x? ||2 − ||xk+1 − x? ||2

= 2γ
1
≤ 2γ ||x0 − x? ||2 .

Since F(xk ) − F(x? ) is decreasing


P 
F(xk ) − F(x? ) ≤ k1 1
k
i=1
F(xk ) − F(x? ) ≤ 2γk ||x0 − x? ||2 .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence analysis: strongly convex F

Besides the basic assumptions, let’s further assume F ∈ S1α,L .

Recall that, for all x, y ∈ dom(F)


h∇F(x) − ∇F(y), x − yi ≥ ααL
+L
1
||x − y||2 + α + L
||∇F(x) − ∇F(y)||2 .

Analysis for constant step-size: let γ ∈]0, 2/(α + L)[


||xk+1 − x? ||2 = ||xk − γ∇F(xk ) − x? ||2
= ||xk − x? ||2 − 2γh∇F(xk ), xk − x? i + γ 2 ||∇F(xk )||2
2γαL 2
 ? 2

?
(∇F(x ) = 0) ≤ 1− α +L
||xk − x || + γ γ −
α+L
||∇F(xk )||2
2γαL

≤ 1− α +L
||xk − x? ||2 .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence analysis: strongly convex F

2γαL
Distance to minimiser: ρ = 1 − α+L

||xk − x? ||2 ≤ ρk ||x0 − x? ||2 .


linear covnergence
2
for γ = α+L
,
2
ρ = LL −

α
.

Convergence rate of objective function value:


k
F(xk ) − F(x? ) ≤ 2L ||xk − x? ||2 ≤ ρ2L ||x0 − x? ||2 .

Numer of iterations k needed for F(xk ) − F(x? ) ≤ 


F ∈ CL1 : O(1/).
F ∈ S1α,L : O(log(1/)).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Limits on convergence rate of gradient descent

First-order method: xk is an element from the set



x0 + span ∇F(x0 ), ..., ∇F(xi ), ..., ∇F(xk−1 ) . 4.1

Problem class: CL1

Nesterov’s lower bound


For every integer k ≤ (n − 1)/2 and every x0 , there exist functions in the problem class such that
for any first-order method satisfies (4.1),
2
3 L||x0 − x? ||
F(xk ) − F(x? ) ≥ 32 2 ,
(k + 1)
1
||xk − x || ≥ 8 ||x0 − x? ||2 .
? 2

Suggests O(1/k) is not the optimal rate.


Accelerated gradient methods can achieve O(1/k2 ) rate.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Outline

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Observations

Gradient descent:
−γ∇F(xk ) = xk+1 − xk .

def
Consider the angle: θk = angle(∇F(xk+1 ), ∇F(xk )),
lim θk = 0.
k→+∞
Exercise: prove this claim for least square.

Let a > 0 be some constant,


−∇F(xk+1 ) ≈ a(xk+1 − xk ).

xk−1
xk
xk+1
xk+2

x?

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Heavy-ball method

Heavy-ball method (Polyak)


Initial : x0 ∈ dom(F) and γ ∈]0, 2/L[;
yk = xk + ak (xk − xk−1 ), ak ∈ [0, 1],
xk+1 = yk − γ∇F(xk ).

xk−1
xk
xk+1 yk
xk+2
xk+1

x?

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Heavy-ball method

Heavy-ball method (Polyak)


Initial : x0 ∈ dom(F) and γ ∈]0, 2/L[;
yk = xk + ak (xk − xk−1 ), ak ∈ [0, 1],
xk+1 = yk − γ∇F(xk ).

xk − xk−1 is called the inertial term or momentum term.

ak is called the inertial parameter.

Convergence can be proved by studying the Lyapunov function


def ak
E(xk ) = F(xk ) + 2γ ||xk − xk−1 ||2 .

In general, no convergence rate for F ∈ CL1 . Local rate for F ∈ S2α,L .

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence rate

Theorem
Let x? be a (local) minimiser of F such that αId  ∇2 F(x? )  LId and choose a, γ with
a ∈ [0, 1[, γ ∈]0, 2(1 + a)/L[. There exists ρ < 1 such that if ρ < ρ < 1 and if x0 , x1 are close
enough to x? , one has
||xk − x? || ≤ Cρk .

Moreover, if √ √ 2 √ √
a= √L − √α , γ = √ 4 √ 2 then ρ = √
L− α
√ .
L+ α ( L + α) L+ α

Starting points need to close enough to x?


Almost the optimal rate can be achieve by gradient method (or first-order method)
Gradient descent
−α
ρ = LL + α
.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence rate: proof

Taylor expansion
xk+1 = xk + a(xk − xk−1 ) − γ∇2 F(x? )(xk − x? ) + o(||xk − x? ||).

Let zk = (xk − x? , xk−1 − x? )T and H = ∇2 F, then


" #
(1 + a)Id − aH −aId
zk+1 = zk + o(||zk ||).
Id 0
| {z }
M

Spectral radius ρ(M), η = 1 − γα


0 = ρ2 − (a + η)ρ + aη.

ρ(M) is a function of a and η (essentially γ).

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Outline

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Convergence rate of gradient descent

Gradient descent with constant step-size:


F ∈ CL1
L||x0 − x? ||2
F(xk ) − F(x? ) ≤ k+4
.
F ∈ S1α,L 2
−α
F(xk ) − F(x? ) ≤ 2L LL + α
||x0 − x? ||2 .

xk−1
xk
xk+1 yk
xk+2 xk+1

x?

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Nesterov’s optimal scheme

Optimal scheme with constant step-size


initial : Choose x0 ∈ Rn , φ0 ∈]0, 1[; Let y0 = x0 and q = α/L.
repeat :
1. Compute φk+1 ∈]0, 1[ from equation
φ2k+1 = (1 − φk+1 )φ2k + qφk+1 .
φk (1−φk )
Let ak = φ2k +φk+1
and
yk = xk + ak (xk − xk−1 ).

2. Update xk+1 by
xk+1 = yk − L1 ∇F(yk ).
until : stopping criterion is satisfied.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Convergence rate

Convergence rate
p
Let φ0 ≥ α/L, then
n q k o
α √ 4L √
F(xk ) − F(x? ) ≤ min 1− ,
L (2 L + k ν)2
× F(x0 ) − F(x? ) + ν2 ||x0 − x? ||2 ,


φ0 (φ0 L−α)
where ν = 1−φ0
.

Parameter choices:
F ∈ CL1 : φ0 = 1,
2
q = 0, φk ≈ k + 1
→0 and ak ≈ 11 − φk
+ φk
→ 1.
p
F ∈ S1α,L : φ0 = α/L
√ √
L− α
q q
α α
q= L
, φk ≡ L
and ak ≡ √ √ .
L+ α

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Outline

1 Unconstrained smooth optimisation

2 Descent methods

3 Gradient of convex functions

4 Gradient descent

5 Heavy-ball method

6 Nesterov’s optimal schemes

7 Dynamical system
Dynamical system of gradient descent

From gradient descent


xk+1 − xk
= −∇F(xk ).
γ
Let γ be small enough
Ẋ(t) + ∇F(X(t)) = 0.

Discretisation
Explicit Euler method
X(t + h) − X(t)
Ẋ(t) = .
h
Implicit Euler method
X(t) − X(t − h)
Ẋ(t) = .
h

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Dynamical system of inertial schemes

Given a 2nd order dynamical system


Ẍ(t) + λ(t)Ẋ(t) + ∇F(X(t)) = 0.

Discretisation:
2nd order term
X(t + h) − 2X(t) + X(t − h)
Ẍ(t) = .
h2
Implicit Euler method
X(t) − X(t − h)
Ẋ(t) = .
h
Combine together:
X(t + h) − X(t) − (1 − hλ(t))(X(t) − X(t − h)) + h2 ∇F(X(t)) = 0.

Choices:
Heavy-ball: hλ(t) ∈]0, 1[.
Nesterov: λ(t) = dt , d > 3.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019


Reference

S. Boyd and L. Vandenberghe. “Convex optimization”. Cambridge university press, 2004.


B. Polyak. “Introduction to optimization”. Optimization Software, 1987.
Y. Nesterov. “Introductory lectures on convex optimization: A basic course”. Vol. 87. Springer
Science & Business Media, 2013.
W. Su, S. Boyd, and E. Candès. “A differential equation for modeling Nesterov’s accelerated
gradient method: Theory and insights”. Advances in Neural Information Processing Systems.
2014.

You might also like