0% found this document useful (0 votes)

51 views

Optimization For Machine Learning: Massachusetts Institute of Technology

The document summarizes Suvrit Sra's lecture on first-order optimization methods for machine learning. It discusses gradient descent methods, which iteratively update parameters in the negative gradient direction with a step size. The gradient provides the descent direction, while the step size can be constant, diminishing, or follow other schedules. Various methods are discussed for choosing the descent direction, including the scaled gradient, Newton's method, quasi-Newton methods, and discretized Newton's method.

Uploaded by

Mufakir Qamar Ansari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Optimization For Machine Learning: Massachusetts Institute of Technology

Uploaded by

Mufakir Qamar Ansari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 169

Optimization for Machine Learning

Lecture 7: First-order methods

6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

11 Mar, 2021
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 2
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
Gradient Descent
x ← x − η∇f (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 4
Descent methods

xk+1

...

x∗ ∇f (x∗ ) = 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 5
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using

x(η) = x + ηd,

where direction d ∈ Rn obtuse to ∇f (x), i.e.,

h∇f (x), di < 0.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using

x(η) = x + ηd,

where direction d ∈ Rn obtuse to ∇f (x), i.e.,

h∇f (x), di < 0.

I Again, we have the Taylor expansion

f (x(η)) = f (x) + ηh∇f (x), di + o(η),

where h∇f (x), di dominates o(η) for small η

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using

x(η) = x + ηd,

where direction d ∈ Rn obtuse to ∇f (x), i.e.,

h∇f (x), di < 0.

I Again, we have the Taylor expansion

f (x(η)) = f (x) + ηh∇f (x), di + o(η),

where h∇f (x), di dominates o(η) for small η

I Since d is obtuse to ∇f (x), this implies f (x(η)) < f (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Descent methods

∇f (x)

Suvrit Sra ([email protected])

−∇f (x) (03/11/21; Lecture 7)
6.881 Optimization for Machine Learning 7
Descent methods

∇f (x)

x − α∇f (x)
x

x − δ∇f (x)

Suvrit Sra ([email protected])

−∇f (x)
6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Descent methods

∇f (x)

x − α∇f (x) x + α2 d
x

d
x − δ∇f (x)

Suvrit Sra ([email protected])

−∇f (x)
6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Gradient-based methods

1 Start with some guess x0 ;

2 For each k = 0, 1, . . .
xk+1 ← xk + ηk dk
Stop somehow (e.g., if k∇f (xk+1 )k ≤ ε)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 8
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )

Descent direction dk satisfies

h∇f (xk ), dk i < 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )

Descent direction dk satisfies

h∇f (xk ), dk i < 0

Numerous ways to select ηk and dk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )

Descent direction dk satisfies

h∇f (xk ), dk i < 0

Numerous ways to select ηk and dk

Many methods seek monotonic descent

f (xk+1 ) < f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient methods – direction

xk+1 = xk + ηk dk , k = 0, 1, . . .

I Different choices of direction dk

◦ Scaled gradient: dk = −Dk ∇f (xk ), Dk 0
◦ Newton’s method: (Dk = [∇2 f (xk )]−1 )
◦ Quasi-Newton: Dk ≈ [∇2 f (xk )]−1
◦ Steepest descent: Dk = I
−1
∂ 2 f (xk )

◦ Diagonally scaled: Dk diagonal with Dkii ≈ (∂xi )2

◦ Discretized Newton: Dk = [H(xk )]−1 , H via finite-diff.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 10
Gradient methods – direction

xk+1 = xk + ηk dk , k = 0, 1, . . .

I Different choices of direction dk

◦ Discretized Newton: Dk = [H(xk )]−1 , H via finite-diff.

◦ ...
Exercise: Verify that h∇f (xk ), dk i < 0 for above choices

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 10
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)

P
I Diminishing: ηk → 0 but k ηk = ∞.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)

P
I Diminishing: ηk → 0 but k ηk = ∞.
Exercise: Prove that the latter condition ensures that xk

does not converge to nonstationary points.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)

P
I Diminishing: ηk → 0 but k ηk = ∞.
Exercise: Prove that the latter condition ensures that xk

does not converge to nonstationary points.

Sketch: Say, xk → x̄; then for sufficiently large m and n, (m > n)
m−1
!
m n m n
X
x ≈ x ≈ x̄, x ≈ x − ηk ∇f (x̄).
k=n

The sum can be made arbitrarily large, contradicting nonstationarity of x̄

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection?

I Exact: ηk := argmin f (xk + ηdk )

η≥0

I Limited min: ηk = argmin f (xk + ηdk )

0≤η≤s
I Armijo-rule. Given fixed scalars, s, β, σ with 0 < β < 1 and
0 < σ < 1 (chosen experimentally). Set

ηk = β mk s,

where we try β m s for m = 0, 1, . . . until sufficient descent

f (xk ) − f (x + β m sdk ) ≥ −σβ m sh∇f (xk ), dk i

If h∇f (xk ), dk i < 0, stepsize guaranteed to exist

Usually, σ small ∈ [10−5 , 0.1], while β from 1/2 to 1/10
depending on how confident we are about initial stepsize s.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 12
Barzilai-Borwein step-size?

Stepsize computation can be expensive

Convergence analysis depends on monotonic descent

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?

Stepsize computation can be expensive

Convergence analysis depends on monotonic descent
Give up search for stepsizes
Use constants or closed-form formulae for stepsizes
Don’t insist on monotonic descent?
(e.g., diminishing stepsizes non-monotonic descent)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?

Stepsize computation can be expensive

Barzilai & Borwein stepsizes

xk+1 = xk − η k ∇f (xk ), k = 0, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?

Stepsize computation can be expensive

Barzilai & Borwein stepsizes

xk+1 = xk − η k ∇f (xk ), k = 0, 1, . . .

huk , vk i kuk k2
ηk = , ηk =
kvk k2 huk , vk i
uk = xk − xk−1 , vk = ∇f (xk ) − ∇f (xk−1 )

Challenge. Analyze convergence of GD using BB stepsizes.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Exercise
♠ Let D be the (n − 1) × n differencing matrix
 
−1 1
 −1 1 
D=  ∈ R(n−1)×n ,
 
..
 . 
−1 1

♠ f (x) = 12 kDT x − bk22 = 12 (kDT xk22 + kbk22 − 2hDT x, bi)

♠ Notice that ∇f (x) = D(DT x − b)
♠ Try different choices of b, and different initial vectors x0
♠ Exercise: Experiment to see how large n must be before gradi-
ent method starts outperforming CVX
♠ Exercise: Minimize f (x) for large n; e.g., n = 106 , n = 107
♠ Exercise: Repeat same exercise with constraints: xi ∈ [−1, 1].

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 14
Convergence
(remarks)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 15
Gradient descent – convergence

Assumption: Lipschitz continuous gradient; denoted f ∈ C1L

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 16
Gradient descent – convergence

Assumption: Lipschitz continuous gradient; denoted f ∈ C1L

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2

♣ Gradient vectors of closeby points are close to each other

♣ Objective function has “bounded curvature”
♣ Speed at which gradient varies is bounded
♣ Exercise: If f ∈ C1L is twice diff. then k∇2 f (x)k2 ≤ L.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 16
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Theorem. Let f ∈ C1L . min1≤k≤T k∇f (xk )k2 = O(1/T)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Theorem. Let f ∈ C1L . min1≤k≤T k∇f (xk )k2 = O(1/T)

Convergence rate: function suboptimality

Theorem. Let f ∈ C1L be convex; let xk be generated as

above, with ηk = 1/L. Then, f (xT+1 ) − f (x∗ ) = O(1/T).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Theorem. Let f ∈ C1L . min1≤k≤T k∇f (xk )k2 = O(1/T)

Convergence rate: function suboptimality

Theorem. Let f ∈ C1L be convex; let xk be generated as

above, with ηk = 1/L. Then, f (xT+1 ) − f (x∗ ) = O(1/T).

2
Theorem. If f ∈ S1L,µ , η = L+µ , and xk generated by GD.

2T
κ−1
Then, f (xT ) − f ∗ ≤ L2 κ+1 kx0 − x∗ k22 , where κ := L/µ is
the condition number.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Proof
(sketches)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 18
Key result: The Descent Lemma

Lemma (Descent lemma). Let f ∈ C1L . Then,

f (x) ≤ f (y) + h∇f (y), x − yi + L2 kx − yk22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 19
Key result: The Descent Lemma

Lemma (Descent lemma). Let f ∈ C1L . Then,

f (x) ≤ f (y) + h∇f (y), x − yi + L2 kx − yk22
Proof. By Taylor’s theorem, for zt = y + t(x − y) we have
R1
f (x) = f (y) + 0 h∇f (zt ), x − yidt.

Adding and subtracting h∇f (y), x − yi we

R obtain
1
|f (x) − f (y) − h∇f (y), x − yi| = 0 h∇f (zt ) − ∇f (y), x − yidt

R1
≤ 0
|h∇f (zt ) − ∇f (y), x − yi|dt
R1
≤ 0
k∇f (zt ) − ∇f (y)k2 kx − yk2 dt
R1
≤ L 0 tkx − yk22 dt
= L
2
kx − yk22 .

Bounds f (x) above and below with quadratic functions

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 19
Descent lemma – corollary

Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Descent lemma – corollary

Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )

Proof.
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
ηk2 L
= f (xk ) − ηk k∇f (xk )k22 + 2
k∇f (xk )k22
= f (xk ) − ηk (1 − ηk
2
L)k∇f (xk )k22

If ηk < 2/L we have descent. min over ηk to get best bound, giving ηk = 1/L.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Descent lemma – corollary

Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )

Proof.
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
ηk2 L
= f (xk ) − ηk k∇f (xk )k22 + 2
k∇f (xk )k22
= f (xk ) − ηk (1 − ηk
2
L)k∇f (xk )k22

If ηk < 2/L we have descent. min over ηk to get best bound, giving ηk = 1/L.

Alternative bigger picture

Minimize global upper bound:
f (x) ≤ f (y) + h∇f (y), x − yi + L2 kx − yk2
f (x) ≤ F(x, y), where F(x, x) = f (x)

Explore: Other global upper bounds and corresponding algorithms.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain

T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 )
2L
k=0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain

T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain

T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0
I We assume f ∗ > −∞, so rhs is some fixed positive constant

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain

T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0
I We assume f ∗ > −∞, so rhs is some fixed positive constant
I Thus, as k → ∞, lhs must converge; thus
k∇f (xk )k2 → 0 as k → ∞.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain

PT
I min k∇f (xk )k2 ≤ 1
T+1
k 2
k=0 k∇f (x )k , so O( 1ε ) for k∇f k2 ≤ ε
0≤k≤T
I Notice, we did not require f to be convex . . .
SGD

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence rate – strongly convex

Theorem. If f ∈ S1L,µ , 0 < η < 2/(L + µ), then the gradient

method generates a sequence xk that satisfies

k
2ηµL
kxk − x∗ k22 ≤ 1− kx0 − x∗ k2 .
µ+L

Moreover, if η = 2/(L + µ) then

2k
κ−1

∗ L
k
f (x ) − f ≤ kx0 − x∗ k22 ,
2 κ+1

where κ := L/µ is the condition number.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 22
Convergence – strongly convex case

Assumption: Strong convexity; denote f ∈ S1L,µ

f (x) ≥ f (y) + h∇f (y), x − yi + µ2 kx − yk22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case

Assumption: Strong convexity; denote f ∈ S1L,µ

f (x) ≥ f (y) + h∇f (y), x − yi + µ2 kx − yk22

Descent lemma convex corollary

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Exercise: Prove Cor. 2. (Hint: Consider φ(y) = f (y) − h∇f (x0 ), yi).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case

Assumption: Strong convexity; denote f ∈ S1L,µ

f (x) ≥ f (y) + h∇f (y), x − yi + µ2 kx − yk22

Descent lemma convex corollary

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Exercise: Prove Cor. 2. (Hint: Consider φ(y) = f (y) − h∇f (x0 ), yi).

Valuable refinement for the strongly convex case. . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn

µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn

µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn

µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22

I If µ = L, then immediate from strong convexity and Cor. 2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn

µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22

I If µ = L, then immediate from strong convexity and Cor. 2
I If µ < L, then φ ∈ C1L−µ ; now invoke Cor. 2
1
h∇φ(x) − ∇φ(y), x − yi ≥ k∇φ(x) − ∇φ(y)k22
L−µ

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then

1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn

µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22

I If µ = L, then immediate from strong convexity and Cor. 2
I If µ < L, then φ ∈ C1L−µ ; now invoke Cor. 2
1
h∇φ(x) − ∇φ(y), x − yi ≥ k∇φ(x) − ∇φ(y)k22
L−µ

Let’s put this to use now....

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence rate – strongly convex

Theorem. If f ∈ S1L,µ , 0 < η < 2/(L + µ), then the gradient

method generates a sequence xk that satisfies

k
2ηµL
kxk − x∗ k22 ≤ 1− kx0 − x∗ k2 .
µ+L

Moreover, if η = 2/(L + µ) then

2k
κ−1

∗ L
k
f (x ) − f ≤ kx0 − x∗ k22 ,
2 κ+1

where κ := L/µ is the condition number.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 25
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22

= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22

= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22
= r2k − 2ηh∇f (xk ) − ∇f (x∗ ), xk − x∗ i + η 2 k∇f (xk )k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22

= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22
= r2k − 2ηh∇f (xk ) − ∇f (x∗ ), xk − x∗ i + η 2 k∇f (xk )k22

2ηµL 2 2
≤ 1− r +η η− k∇f (xk )k22
µ+L k µ+L

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22

where we used Thm. 2 with ∇f (x∗ ) = 0 for last inequality.

Exercise: Complete the proof of the theorem now.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Convergence rate – (weakly) convex