Introductory Course On Non-Smooth Optimisation: Lecture 01 - Gradient Methods
Introductory Course On Non-Smooth Optimisation: Lecture 01 - Gradient Methods
Jingwei Liang
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Convexity
Convex set
A set S ⊂ Rn is convex if for any θ ∈ [0, 1] and two points x, y ∈ S,
θx + (1 − θ)y ∈ S.
Convex function
Function F : Rn → R is convex if dom(F) is convex and for all x, y ∈ dom(F) and θ ∈ [0, 1],
F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y).
Proper convex: F(x) < +∞ at least for one x and F(x) > −∞ for all x.
Problem
Unconstrained smooth optimisation
min F(x),
x∈Rn
F (xk ) ∇F (x)
∇F (x ? )
x
Quadratic programming
General quadratic programming problem
min 21 xT Ax + bT x + c,
x∈Rn
Optimality condition:
0 = Ax? + b.
Optimality condition
AT Ax? = AT b.
Geometric programming
Pm T
min log
x∈Rn i=1 exp(ai x + bi ) .
Optimality condition:
1 Pm
0 = Pm T ? i=1
exp(aTi x? + bi )ai .
i=1 exp(ai x + bi )
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Problem
x?
is non-empty.
Iterative strategy to find one x? ∈ Argmin(F): start from x0 and generate a train of sequsence
{xk }k∈N such taht
lim xk = x? ∈ Argmin(F).
k→∞
xk−1
xk
xk+1
xk+2
x?
Iterative scheme
For each k = 1, 2, ..., find γk > 0 and dk ∈ Rn and then
xk+1 = xk + γk dk ,
where
dk is called search/descent direction.
γk is called step-size.
Descent methods
An algorithm is called descent method, if there holds
F(xk+1 ) < F(xk ).
which gives
h∇F(xk ), xk+1 − xk i ≥ 0 =⇒ F(xk+1 ) ≥ F(xk ).
∇F (x k )
xk
x?
F (xk + γdk )
F (xk + γk dk )
γ=0 γk γ
F (xk + γdk )
γ=0 γ
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Monotonicity
Monotonicity of gradient
Let F : Rn → R be proper convex and smooth differentiable, then
h∇F(x) − ∇F(y), x − yi ≥ 0, ∀x, y ∈ dom(F).
and
F(x) ≥ F(y) + h∇F(y), x − yi.
Lipschitz continuity
The gradient of F is L-Lipschitz continuous if there exists L > 0 such that
||∇F(x) − ∇F(y)|| ≤ L||x − y||, ∀x, y ∈ dom(F).
If F ∈ CL1 , then
H(x) = 2L ||x||2 − F(x)
def
is convex.
Hint : monotonicity of ∇H(x), i.e.
h∇H(x) − ∇H(y), x − yi = L||x − y||2 − h∇F(x) − ∇F(y), x − yi
≥ L||x − y||2 − L||x − y||2
= 0.
Corollary
Let F ∈ CL1 and x? ∈ Argmin(F), then
1
2L
||∇F(x)||2 ≤ F(x) − F(x? ) ≤ 2L ||x − x? ||2 , ∀x ∈ dom(F).
Left-hand inequality:
n o
F(x? ) ≤ min F(x) + h∇F(x), y − xi + 2L ||y − x||2
y∈dom(F)
1
= F(x) − 2L ||∇F(x)||2 .
Co-coercivity
Let F ∈ CL1 , then
hx − y, ∇F(x) − ∇F(y)i ≥ L1 ||∇F(x) − ∇F(y)||2 .
Co-coercivity
Let F ∈ CL1 , then
hx − y, ∇F(x) − ∇F(y)i ≥ L1 ||∇F(x) − ∇F(y)||2 .
Then we have
1
F(y) − F(x) − h∇F(x), y − xi = R(y) − R(x) ≥ 2L ||∇R(y)||2
1
= 2L ||∇F(y) − ∇F(x)||2 .
Strong convexity
Function F : Rn → R is strongly convex if dom(F) is convex and for all x, y ∈ dom(F) and
θ ∈ [0, 1], there exists α > 0 such that
F(θx + (1 − θ)y) ≤ θF(x) + (1 − θ)F(y) − α
2
θ(1 − θ)||x − y||2 .
is convex.
Monotonicity:
h∇F(x) − ∇F(y), x − yi ≥ α||x − y||2 , ∀x, y ∈ dom(F).
def
Proof First-order condition of convexity for G(x) = F(x) − α
2
||x||2 .
Corollary
Let F ∈ C1 be α-strongly convex and x? ∈ Argmin(F), then
α 1
2
||x − x? ||2 ≤ F(x) − F(x? ) ≤ 2α ||∇F(x)||2 , ∀x ∈ dom(F).
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Unconstrained smooth optimisation
Assumptions:
F ∈ C1 is convex.
∇F(x) is L-Lipschitz continuous for some L > 0.
Set of minimisers is non-empty, i.e. Argmin(F) 6= ∅.
Gradient descent
initial : x0 ∈ dom(F);
repeat :
1. Choose step-size γk > 0
2. Update xk+1 = xk − γk ∇F(xk )
until : stopping criterion is satisfied.
1
= F(x? ) + 2γ ||xk − x? ||2 − ||xk+1 − x? ||2 .
1
||x0 − x? ||2 − ||xk+1 − x? ||2
= 2γ
1
≤ 2γ ||x0 − x? ||2 .
2γαL
Distance to minimiser: ρ = 1 − α+L
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Observations
Gradient descent:
−γ∇F(xk ) = xk+1 − xk .
def
Consider the angle: θk = angle(∇F(xk+1 ), ∇F(xk )),
lim θk = 0.
k→+∞
Exercise: prove this claim for least square.
xk−1
xk
xk+1
xk+2
x?
xk−1
xk
xk+1 yk
xk+2
xk+1
x?
Theorem
Let x? be a (local) minimiser of F such that αId ∇2 F(x? ) LId and choose a, γ with
a ∈ [0, 1[, γ ∈]0, 2(1 + a)/L[. There exists ρ < 1 such that if ρ < ρ < 1 and if x0 , x1 are close
enough to x? , one has
||xk − x? || ≤ Cρk .
Moreover, if √ √ 2 √ √
a= √L − √α , γ = √ 4 √ 2 then ρ = √
L− α
√ .
L+ α ( L + α) L+ α
Taylor expansion
xk+1 = xk + a(xk − xk−1 ) − γ∇2 F(x? )(xk − x? ) + o(||xk − x? ||).
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Convergence rate of gradient descent
xk−1
xk
xk+1 yk
xk+2 xk+1
x?
2. Update xk+1 by
xk+1 = yk − L1 ∇F(yk ).
until : stopping criterion is satisfied.
Convergence rate
p
Let φ0 ≥ α/L, then
n q k o
α √ 4L √
F(xk ) − F(x? ) ≤ min 1− ,
L (2 L + k ν)2
× F(x0 ) − F(x? ) + ν2 ||x0 − x? ||2 ,
φ0 (φ0 L−α)
where ν = 1−φ0
.
Parameter choices:
F ∈ CL1 : φ0 = 1,
2
q = 0, φk ≈ k + 1
→0 and ak ≈ 11 − φk
+ φk
→ 1.
p
F ∈ S1α,L : φ0 = α/L
√ √
L− α
q q
α α
q= L
, φk ≡ L
and ak ≡ √ √ .
L+ α
2 Descent methods
4 Gradient descent
5 Heavy-ball method
7 Dynamical system
Dynamical system of gradient descent
Discretisation
Explicit Euler method
X(t + h) − X(t)
Ẋ(t) = .
h
Implicit Euler method
X(t) − X(t − h)
Ẋ(t) = .
h
Discretisation:
2nd order term
X(t + h) − 2X(t) + X(t − h)
Ẍ(t) = .
h2
Implicit Euler method
X(t) − X(t − h)
Ẋ(t) = .
h
Combine together:
X(t + h) − X(t) − (1 − hλ(t))(X(t) − X(t − h)) + h2 ∇F(X(t)) = 0.
Choices:
Heavy-ball: hλ(t) ∈]0, 1[.
Nesterov: λ(t) = dt , d > 3.