0% found this document useful (0 votes)
51 views

Optimization For Machine Learning: Massachusetts Institute of Technology

The document summarizes Suvrit Sra's lecture on first-order optimization methods for machine learning. It discusses gradient descent methods, which iteratively update parameters in the negative gradient direction with a step size. The gradient provides the descent direction, while the step size can be constant, diminishing, or follow other schedules. Various methods are discussed for choosing the descent direction, including the scaled gradient, Newton's method, quasi-Newton methods, and discretized Newton's method.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Optimization For Machine Learning: Massachusetts Institute of Technology

The document summarizes Suvrit Sra's lecture on first-order optimization methods for machine learning. It discusses gradient descent methods, which iteratively update parameters in the negative gradient direction with a step size. The gradient provides the descent direction, while the step size can be constant, diminishing, or follow other schedules. Various methods are discussed for choosing the descent direction, including the scaled gradient, Newton's method, quasi-Newton methods, and discretized Newton's method.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Optimization for Machine Learning

Lecture 7: First-order methods


6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

11 Mar, 2021
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 2
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
Gradient Descent
x ← x − η∇f (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 4
Descent methods

xk

xk+1

...

x∗ ∇f (x∗ ) = 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 5
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using

x(η) = x + ηd,

where direction d ∈ Rn obtuse to ∇f (x), i.e.,

h∇f (x), di < 0.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using

x(η) = x + ηd,

where direction d ∈ Rn obtuse to ∇f (x), i.e.,

h∇f (x), di < 0.

I Again, we have the Taylor expansion

f (x(η)) = f (x) + ηh∇f (x), di + o(η),

where h∇f (x), di dominates o(η) for small η

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using

x(η) = x + ηd,

where direction d ∈ Rn obtuse to ∇f (x), i.e.,

h∇f (x), di < 0.

I Again, we have the Taylor expansion

f (x(η)) = f (x) + ηh∇f (x), di + o(η),

where h∇f (x), di dominates o(η) for small η


I Since d is obtuse to ∇f (x), this implies f (x(η)) < f (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Descent methods

∇f (x)

Suvrit Sra ([email protected])


−∇f (x) (03/11/21; Lecture 7)
6.881 Optimization for Machine Learning 7
Descent methods

∇f (x)

x − α∇f (x)
x

x − δ∇f (x)

Suvrit Sra ([email protected])


−∇f (x)
6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Descent methods

∇f (x)

x − α∇f (x) x + α2 d
x

d
x − δ∇f (x)

Suvrit Sra ([email protected])


−∇f (x)
6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Gradient-based methods

1 Start with some guess x0 ;


2 For each k = 0, 1, . . .
xk+1 ← xk + ηk dk
Stop somehow (e.g., if k∇f (xk+1 )k ≤ ε)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 8
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )


Descent direction dk satisfies

h∇f (xk ), dk i < 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )


Descent direction dk satisfies

h∇f (xk ), dk i < 0

Numerous ways to select ηk and dk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods

xk+1 = xk + ηk dk , k = 0, 1, . . .

stepsize ηk ≥ 0, usually ensures f (xk+1 ) < f (xk )


Descent direction dk satisfies

h∇f (xk ), dk i < 0

Numerous ways to select ηk and dk

Many methods seek monotonic descent

f (xk+1 ) < f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient methods – direction

xk+1 = xk + ηk dk , k = 0, 1, . . .

I Different choices of direction dk


◦ Scaled gradient: dk = −Dk ∇f (xk ), Dk  0
◦ Newton’s method: (Dk = [∇2 f (xk )]−1 )
◦ Quasi-Newton: Dk ≈ [∇2 f (xk )]−1
◦ Steepest descent: Dk = I
−1
∂ 2 f (xk )

◦ Diagonally scaled: Dk diagonal with Dkii ≈ (∂xi )2

◦ Discretized Newton: Dk = [H(xk )]−1 , H via finite-diff.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 10
Gradient methods – direction

xk+1 = xk + ηk dk , k = 0, 1, . . .

I Different choices of direction dk


◦ Scaled gradient: dk = −Dk ∇f (xk ), Dk  0
◦ Newton’s method: (Dk = [∇2 f (xk )]−1 )
◦ Quasi-Newton: Dk ≈ [∇2 f (xk )]−1
◦ Steepest descent: Dk = I
−1
∂ 2 f (xk )

◦ Diagonally scaled: Dk diagonal with Dkii ≈ (∂xi )2

◦ Discretized Newton: Dk = [H(xk )]−1 , H via finite-diff.


◦ ...
Exercise: Verify that h∇f (xk ), dk i < 0 for above choices

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 10
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)


P
I Diminishing: ηk → 0 but k ηk = ∞.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)


P
I Diminishing: ηk → 0 but k ηk = ∞.
Exercise: Prove that the latter condition ensures that xk


does not converge to nonstationary points.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection

I Constant: ηk = 1/L (for suitable value of L)


P
I Diminishing: ηk → 0 but k ηk = ∞.
Exercise: Prove that the latter condition ensures that xk


does not converge to nonstationary points.


Sketch: Say, xk → x̄; then for sufficiently large m and n, (m > n)
m−1
!
m n m n
X
x ≈ x ≈ x̄, x ≈ x − ηk ∇f (x̄).
k=n

The sum can be made arbitrarily large, contradicting nonstationarity of x̄

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection?

I Exact: ηk := argmin f (xk + ηdk )


η≥0

I Limited min: ηk = argmin f (xk + ηdk )


0≤η≤s
I Armijo-rule. Given fixed scalars, s, β, σ with 0 < β < 1 and
0 < σ < 1 (chosen experimentally). Set

ηk = β mk s,

where we try β m s for m = 0, 1, . . . until sufficient descent

f (xk ) − f (x + β m sdk ) ≥ −σβ m sh∇f (xk ), dk i

If h∇f (xk ), dk i < 0, stepsize guaranteed to exist


Usually, σ small ∈ [10−5 , 0.1], while β from 1/2 to 1/10
depending on how confident we are about initial stepsize s.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 12
Barzilai-Borwein step-size?

Stepsize computation can be expensive


Convergence analysis depends on monotonic descent

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?

Stepsize computation can be expensive


Convergence analysis depends on monotonic descent
Give up search for stepsizes
Use constants or closed-form formulae for stepsizes
Don’t insist on monotonic descent?
(e.g., diminishing stepsizes non-monotonic descent)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?

Stepsize computation can be expensive


Convergence analysis depends on monotonic descent
Give up search for stepsizes
Use constants or closed-form formulae for stepsizes
Don’t insist on monotonic descent?
(e.g., diminishing stepsizes non-monotonic descent)

Barzilai & Borwein stepsizes


xk+1 = xk − η k ∇f (xk ), k = 0, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?

Stepsize computation can be expensive


Convergence analysis depends on monotonic descent
Give up search for stepsizes
Use constants or closed-form formulae for stepsizes
Don’t insist on monotonic descent?
(e.g., diminishing stepsizes non-monotonic descent)

Barzilai & Borwein stepsizes


xk+1 = xk − η k ∇f (xk ), k = 0, 1, . . .

huk , vk i kuk k2
ηk = , ηk =
kvk k2 huk , vk i
uk = xk − xk−1 , vk = ∇f (xk ) − ∇f (xk−1 )

Challenge. Analyze convergence of GD using BB stepsizes.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Exercise
♠ Let D be the (n − 1) × n differencing matrix
 
−1 1
 −1 1 
D=  ∈ R(n−1)×n ,
 
..
 . 
−1 1

♠ f (x) = 12 kDT x − bk22 = 12 (kDT xk22 + kbk22 − 2hDT x, bi)


♠ Notice that ∇f (x) = D(DT x − b)
♠ Try different choices of b, and different initial vectors x0
♠ Exercise: Experiment to see how large n must be before gradi-
ent method starts outperforming CVX
♠ Exercise: Minimize f (x) for large n; e.g., n = 106 , n = 107
♠ Exercise: Repeat same exercise with constraints: xi ∈ [−1, 1].

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 14
Convergence
(remarks)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 15
Gradient descent – convergence

Assumption: Lipschitz continuous gradient; denoted f ∈ C1L


k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 16
Gradient descent – convergence

Assumption: Lipschitz continuous gradient; denoted f ∈ C1L


k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2

♣ Gradient vectors of closeby points are close to each other


♣ Objective function has “bounded curvature”
♣ Speed at which gradient varies is bounded
♣ Exercise: If f ∈ C1L is twice diff. then k∇2 f (x)k2 ≤ L.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 16
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Theorem. Let f ∈ C1L . min1≤k≤T k∇f (xk )k2 = O(1/T)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Theorem. Let f ∈ C1L . min1≤k≤T k∇f (xk )k2 = O(1/T)

Convergence rate: function suboptimality

Theorem. Let f ∈ C1L be convex; let xk be generated as




above, with ηk = 1/L. Then, f (xT+1 ) − f (x∗ ) = O(1/T).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence

Convergence of gradient norm

Theorem. Let f ∈ C1L . k∇f (xk )k2 → 0 as k → ∞

Theorem. Let f ∈ C1L . min1≤k≤T k∇f (xk )k2 = O(1/T)

Convergence rate: function suboptimality

Theorem. Let f ∈ C1L be convex; let xk be generated as




above, with ηk = 1/L. Then, f (xT+1 ) − f (x∗ ) = O(1/T).

2
Theorem. If f ∈ S1L,µ , η = L+µ , and xk generated by GD.

 2T
κ−1
Then, f (xT ) − f ∗ ≤ L2 κ+1 kx0 − x∗ k22 , where κ := L/µ is
the condition number.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Proof
(sketches)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 18
Key result: The Descent Lemma

Lemma (Descent lemma). Let f ∈ C1L . Then,


f (x) ≤ f (y) + h∇f (y), x − yi + L2 kx − yk22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 19
Key result: The Descent Lemma

Lemma (Descent lemma). Let f ∈ C1L . Then,


f (x) ≤ f (y) + h∇f (y), x − yi + L2 kx − yk22
Proof. By Taylor’s theorem, for zt = y + t(x − y) we have
R1
f (x) = f (y) + 0 h∇f (zt ), x − yidt.

Adding and subtracting h∇f (y), x − yi we


R obtain
1
|f (x) − f (y) − h∇f (y), x − yi| = 0 h∇f (zt ) − ∇f (y), x − yidt

R1
≤ 0
|h∇f (zt ) − ∇f (y), x − yi|dt
R1
≤ 0
k∇f (zt ) − ∇f (y)k2 kx − yk2 dt
R1
≤ L 0 tkx − yk22 dt
= L
2
kx − yk22 .

Bounds f (x) above and below with quadratic functions

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 19
Descent lemma – corollary

Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Descent lemma – corollary

Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )

Proof.
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
ηk2 L
= f (xk ) − ηk k∇f (xk )k22 + 2
k∇f (xk )k22
= f (xk ) − ηk (1 − ηk
2
L)k∇f (xk )k22

If ηk < 2/L we have descent. min over ηk to get best bound, giving ηk = 1/L.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Descent lemma – corollary

Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )

Proof.
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
ηk2 L
= f (xk ) − ηk k∇f (xk )k22 + 2
k∇f (xk )k22
= f (xk ) − ηk (1 − ηk
2
L)k∇f (xk )k22

If ηk < 2/L we have descent. min over ηk to get best bound, giving ηk = 1/L.

Alternative bigger picture


Minimize global upper bound:
f (x) ≤ f (y) + h∇f (y), x − yi + L2 kx − yk2
f (x) ≤ F(x, y), where F(x, x) = f (x)

Explore: Other global upper bounds and corresponding algorithms.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain


T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 )
2L
k=0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain


T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain


T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0
I We assume f ∗ > −∞, so rhs is some fixed positive constant

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain


T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0
I We assume f ∗ > −∞, so rhs is some fixed positive constant
I Thus, as k → ∞, lhs must converge; thus
k∇f (xk )k2 → 0 as k → ∞.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm

I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,

I Sum up above inequalities for k = 0, 1, . . . , T to obtain


T
1 X
k∇f (xk )k22 ≤ f (x0 ) − f (xT+1 ) ≤ f (x0 ) − f ∗ .
2L
k=0
I We assume f ∗ > −∞, so rhs is some fixed positive constant
I Thus, as k → ∞, lhs must converge; thus
k∇f (xk )k2 → 0 as k → ∞.

PT
I min k∇f (xk )k2 ≤ 1
T+1
k 2
k=0 k∇f (x )k , so O( 1ε ) for k∇f k2 ≤ ε
0≤k≤T
I Notice, we did not require f to be convex . . .
SGD

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence rate – strongly convex

Theorem. If f ∈ S1L,µ , 0 < η < 2/(L + µ), then the gradient


method generates a sequence xk that satisfies


 k
2ηµL
kxk − x∗ k22 ≤ 1− kx0 − x∗ k2 .
µ+L

Moreover, if η = 2/(L + µ) then


2k
κ−1

∗ L
k
f (x ) − f ≤ kx0 − x∗ k22 ,
2 κ+1

where κ := L/µ is the condition number.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 22
Convergence – strongly convex case

Assumption: Strong convexity; denote f ∈ S1L,µ


f (x) ≥ f (y) + h∇f (y), x − yi + µ2 kx − yk22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case

Assumption: Strong convexity; denote f ∈ S1L,µ


f (x) ≥ f (y) + h∇f (y), x − yi + µ2 kx − yk22

Descent lemma convex corollary

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Exercise: Prove Cor. 2. (Hint: Consider φ(y) = f (y) − h∇f (x0 ), yi).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case

Assumption: Strong convexity; denote f ∈ S1L,µ


f (x) ≥ f (y) + h∇f (y), x − yi + µ2 kx − yk22

Descent lemma convex corollary

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Exercise: Prove Cor. 2. (Hint: Consider φ(y) = f (y) − h∇f (x0 ), yi).

Valuable refinement for the strongly convex case. . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn


µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn


µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn


µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22


I If µ = L, then immediate from strong convexity and Cor. 2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn


µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22


I If µ = L, then immediate from strong convexity and Cor. 2
I If µ < L, then φ ∈ C1L−µ ; now invoke Cor. 2
1
h∇φ(x) − ∇φ(y), x − yi ≥ k∇φ(x) − ∇φ(y)k22
L−µ

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case

Corollary 2. If f is a convex function ∈ C1L , then


1
L k∇f (x) − ∇f (y)k22 ≤ h∇f (x) − ∇f (y), x − yi,

Thm 2. Suppose f ∈ S1L,µ . Then, for any x, y ∈ Rn


µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk22 + k∇f (x) − ∇f (y)k22
µ+L µ+L

I Consider the convex function φ(x) = f (x) − µ2 kxk22


I If µ = L, then immediate from strong convexity and Cor. 2
I If µ < L, then φ ∈ C1L−µ ; now invoke Cor. 2
1
h∇φ(x) − ∇φ(y), x − yi ≥ k∇φ(x) − ∇φ(y)k22
L−µ

Let’s put this to use now....

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence rate – strongly convex

Theorem. If f ∈ S1L,µ , 0 < η < 2/(L + µ), then the gradient


method generates a sequence xk that satisfies


 k
2ηµL
kxk − x∗ k22 ≤ 1− kx0 − x∗ k2 .
µ+L

Moreover, if η = 2/(L + µ) then


2k
κ−1

∗ L
k
f (x ) − f ≤ kx0 − x∗ k22 ,
2 κ+1

where κ := L/µ is the condition number.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 25
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22


= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22


= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22
= r2k − 2ηh∇f (xk ) − ∇f (x∗ ), xk − x∗ i + η 2 k∇f (xk )k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22


= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22
= r2k − 2ηh∇f (xk ) − ∇f (x∗ ), xk − x∗ i + η 2 k∇f (xk )k22
   
2ηµL 2 2
≤ 1− r +η η− k∇f (xk )k22
µ+L k µ+L

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate

I Key idea: Analyze rk = kxk − x∗ k2 recursively; consider thus,

r2k+1 = kxk − x∗ − η∇f (xk )k22


= r2k − 2ηh∇f (xk ), xk − x∗ i + η 2 k∇f (xk )k22
= r2k − 2ηh∇f (xk ) − ∇f (x∗ ), xk − x∗ i + η 2 k∇f (xk )k22
   
2ηµL 2 2
≤ 1− r +η η− k∇f (xk )k22
µ+L k µ+L

where we used Thm. 2 with ∇f (x∗ ) = 0 for last inequality.


Exercise: Complete the proof of the theorem now.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
2ηk
≤ r2k + ηk2 kgk k22 − L kg
k
− g∗ k22 (Coroll. 2, pg. 24)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
2ηk
≤ r2k + ηk2 kgk k22 − L kg
k
− g∗ k22 (Coroll. 2, pg. 24)
= r2k − ηk ( L2 − ηk )kgk k22 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex

? Want to prove the well-known O(1/T) rate


? Let ηk = 1/L
? Shorthand notation gk = ∇f (xk ), g∗ = ∇f (x∗ )
? Let rk := kxk − x∗ k2 (distance to optimum)

Lemma Distance to min shrinks monotonically; rk+1 ≤ rk

1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
2ηk
≤ r2k + ηk2 kgk k22 − L kg
k
− g∗ k22 (Coroll. 2, pg. 24)
= r2k − ηk ( L2 − ηk )kgk k22 .

Since ηk < 2/L, it follows that rk+1 ≤ rk


Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate

Lemma Let ∆k := f (xk ) − f (x∗ ). Then, ∆k+1 ≤ ∆k (1 − βk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate

Lemma Let ∆k := f (xk ) − f (x∗ ). Then, ∆k+1 ≤ ∆k (1 − βk )

cvx f
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate

Lemma Let ∆k := f (xk ) − f (x∗ ). Then, ∆k+1 ≤ ∆k (1 − βk )

cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate

Lemma Let ∆k := f (xk ) − f (x∗ ). Then, ∆k+1 ≤ ∆k (1 − βk )

cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk

That is, kgk k2 ≥ ∆k /rk .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate

Lemma Let ∆k := f (xk ) − f (x∗ ). Then, ∆k+1 ≤ ∆k (1 − βk )

cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk

That is, kgk k2 ≥ ∆k /rk . In particular, since rk ≤ r0 , we have

∆k
kgk k2 ≥ .
r0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate

Lemma Let ∆k := f (xk ) − f (x∗ ). Then, ∆k+1 ≤ ∆k (1 − βk )

cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk

That is, kgk k2 ≥ ∆k /rk . In particular, since rk ≤ r0 , we have

∆k
kgk k2 ≥ .
r0

Now we have a bound on the gradient norm...

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate

Recall f (xk+1 ) ≤ f (xk ) − 1 k 2


2L kg k2 ; subtracting f ∗ from both sides

∆2k ∆k 
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
2Lr20

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate

Recall f (xk+1 ) ≤ f (xk ) − 1 k 2


2L kg k2 ; subtracting f ∗ from both sides

∆2k ∆k 
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate

Recall f (xk+1 ) ≤ f (xk ) − 1 k 2


2L kg k2 ; subtracting f ∗ from both sides

∆2k ∆k 
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20

But we want to bound: f (xT+1 ) − f (x∗ )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate

Recall f (xk+1 ) ≤ f (xk ) − 1 k 2


2L kg k2 ; subtracting f ∗ from both sides

∆2k ∆k 
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20

But we want to bound: f (xT+1 ) − f (x∗ )

1 1 1 1
∆k+1 ≤ ∆k (1 − βk ) =⇒ ≥ (1 + βk ) = + .
∆k+1 ∆k ∆k 2Lr20

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate

Recall f (xk+1 ) ≤ f (xk ) − 1 k 2


2L kg k2 ; subtracting f ∗ from both sides

∆2k ∆k 
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20

But we want to bound: f (xT+1 ) − f (x∗ )

1 1 1 1
∆k+1 ≤ ∆k (1 − βk ) =⇒ ≥ (1 + βk ) = + .
∆k+1 ∆k ∆k 2Lr20
I Sum both sides over k = 0, . . . , T (telescoping) to obtain

1 1 T+1
≥ +
∆T+1 ∆0 2Lr20

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate

1 1 T+1
≥ +
∆T+1 ∆0 2Lr20

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 30
Convergence rate

1 1 T+1
≥ +
∆T+1 ∆0 2Lr20
I Rearrange to conclude

2L∆0 r20
f (xT ) − f ∗ ≤
2Lr20 + T∆0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 30
Convergence rate

1 1 T+1
≥ +
∆T+1 ∆0 2Lr20
I Rearrange to conclude

2L∆0 r20
f (xT ) − f ∗ ≤
2Lr20 + T∆0

I Use descent lemma to bound ∆0 ≤ (L/2)kx0 − x∗ k22 ; simplify

2L∆0 kx0 − x∗ k22


f (xT ) − f (x∗ ) ≤ = O(1/T).
T+4

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 30
SGD
x ← x − ηg

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 31
Why SGD?

Regularized Empirical Risk Minimization


n
1X
min `(yi , xT ai ) + λr(x).
x n
i=1
(e.g., logistic regression, deep learning, SVMs, etc.)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 32
Why SGD?

Regularized Empirical Risk Minimization


n
1X
min `(yi , xT ai ) + λr(x).
x n
i=1
(e.g., logistic regression, deep learning, SVMs, etc.)
training data: (ai , yi ) ∈ Rd × Y (i.i.d.)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 32
Why SGD?

Regularized Empirical Risk Minimization


n
1X
min `(yi , xT ai ) + λr(x).
x n
i=1
(e.g., logistic regression, deep learning, SVMs, etc.)
training data: (ai , yi ) ∈ Rd × Y (i.i.d.)
large-scale ML: Both d and n are large:
I d: dimension of each input sample
I n: number of training data points / samples
Assume training data “sparse”; so total datasize  dn.
Running time O(#nnz)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 32
Finite-sum problems

min f (x) =
x∈Rd

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 33
Finite-sum problems

n
1X
min f (x) = fi (x).
x∈Rd n
i=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 33
Finite-sum problems

n
1X
min f (x) = fi (x).
x∈Rd n
i=1

Gradient / subgradient methods

xk+1 = xk − ηk ∇f (xk )
xk+1 = xk − ηk g(xk ), g ∈ ∂f (xk )

If n is large, each iteration above is expensive

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 33
Stochastic gradients

At iteration k, we randomly pick an integer


i(k) ∈ {1, 2, . . . , n}
xk+1 = xk − ηk ∇fi(k) (xk )

I The update requires only gradient for fi(k)


I Uses unbiased estimate E[∇fi(k) ] = ∇f
I One iteration now n times faster using ∇f (x)
I Can such a method work? If so, how fast? Why?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 34
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Solving f 0 (x) = 0 we obtain


P
∗ a i bi
x = Pi 2
i ai

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Solving f 0 (x) = 0 we obtain


P
∗ a i bi
x = Pi 2
i ai

I Minimum of a single fi (x) = 21 (ai x − bi )2 is x∗i = bi /ai

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Solving f 0 (x) = 0 we obtain


P
∗ a i bi
x = Pi 2
i ai

I Minimum of a single fi (x) = 21 (ai x − bi )2 is x∗i = bi /ai


I Notice now that
x∗ ∈ [mini x∗i , maxi x∗i ] =: R

(Use: i ai bi = i a2i (bi /ai ))


P P

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Notice: x∗ ∈ mini x∗i , maxi x∗i =: R


 

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Notice: x∗ ∈ mini x∗i , maxi x∗i =: R


 

I If we have a scalar x that lies outside R?


I We see that
∇fi (x) = ai (ai x − bi )
X
∇f (x) = ai (ai x − bi )
i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Notice: x∗ ∈ mini x∗i , maxi x∗i =: R


 

I If we have a scalar x that lies outside R?


I We see that
∇fi (x) = ai (ai x − bi )
X
∇f (x) = ai (ai x − bi )
i

I ∇fi (x) has same sign as ∇f (x). So using ∇fi (x) instead of
∇f (x) also ensures progress.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
Intuition – (Bertsekas)

I Assume all variables involved are scalars.


Xn
min f (x) = 21 (ai x − bi )2
i=1

I Notice: x∗ ∈ mini x∗i , maxi x∗i =: R


 

I If we have a scalar x that lies outside R?


I We see that
∇fi (x) = ai (ai x − bi )
X
∇f (x) = ai (ai x − bi )
i

I ∇fi (x) has same sign as ∇f (x). So using ∇fi (x) instead of
∇f (x) also ensures progress.
I But once inside region R, no guarantee that SGD will make
progress towards optimum.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
SGD: two variants

min n1
P
i fi (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: two variants

min n1
P
i fi (x)

• Start with feasible x0


• For k = 0, 1, . . . ,
Option 1: Randomly pick an index i (with
replacement)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: two variants

min n1
P
i fi (x)

• Start with feasible x0


• For k = 0, 1, . . . ,
Option 1: Randomly pick an index i (with
replacement)
Option 2: Pick index i without replacement
Use gk = ∇fi (x) as the “stochastic gradient”
Update xk+1 = xk − ηk gk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: two variants

min n1
P
i fi (x)

• Start with feasible x0


• For k = 0, 1, . . . ,
Option 1: Randomly pick an index i (with
replacement)
Option 2: Pick index i without replacement
Use gk = ∇fi (x) as the “stochastic gradient”
Update xk+1 = xk − ηk gk
Explore. Which version would you use? Why?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: mini-batches
n
1X
min f (x) = fi (x)
x n
i=1

Idea: Use a mini-batch of stochastic gradients


ηk X
xk+1 = xk − ∇fj (xk )
|Ik |
j∈Ik

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 38
SGD: mini-batches
n
1X
min f (x) = fi (x)
x n
i=1

Idea: Use a mini-batch of stochastic gradients


ηk X
xk+1 = xk − ∇fj (xk )
|Ik |
j∈Ik

Iteration k samples set Ik , and uses |Ik | stochastic gradients


Increases parallelism, reduces communication

Explore: Large mini-batches not that “favorable” for DNNs.


(also known as: “large-batch training”)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 38
SGD: some theoretical challenges

xk+1 = xk − ηk ∇fi(k) (xk )


I Proving that it “works”
I Theoretical results lagging behind practice (without
replacement SGD widely used, most published theory
studies with replacement)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 39
SGD: some theoretical challenges

xk+1 = xk − ηk ∇fi(k) (xk )


I Proving that it “works”
I Theoretical results lagging behind practice (without
replacement SGD widely used, most published theory
studies with replacement)
Explore: Why does SGD work so well for neural networks?
(i.e., why does it deliver such low training losses despite non-
convexiity, and how does it influence generalization behavior
of neural networks?)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 39
SGD for empirical risk / finite sums

n
1X
min f (x) = fi (x)
x∈Rd n
i=1

– Iteration: xk+1 = xk − 0 (xk )


ηk fi(t)
• Sampling with replacement: i(k) ∼ Unif({1, . . . , n})
1 Pk j
• Polyak-Ruppert averaging: x̄k = k+1 j=0 x
– Convergence rate if each fi convex L-smooth, and f is
µ-strongly-convex:
 √ √
∗ O(1/ k) if ηk = 1/(L k)
E[f (x̄k ) − f (x )] 6
O(L/(µk)) = O(κ/k) if ηk = 1/(µk)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 40
SGD vs GD (strongly convex case)

GD SGD

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 41
SGD vs GD (strongly convex case)

GD SGD

I Batch GD:
– Linear (e.g., exponential) convergence rate in O(e−k/κ )
– Iteration complexity is linear in n (O(n log 1/ε))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 41
SGD vs GD (strongly convex case)

GD SGD

I Batch GD:
– Linear (e.g., exponential) convergence rate in O(e−k/κ )
– Iteration complexity is linear in n (O(n log 1/ε))
I SGD:
– Sampling with replacement: i(k) random element of {1, . . . , n}
– Convergence rate O(κ/k)
– Iteration complexity independent of n (O(1/ε2 ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 41
Convergence
(some theory)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 42
SGD: nonconvex (smooth) case
1 P
f (x) = n i fi (x) and xk+1 = xk − ηk ∇fik (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 43
SGD: nonconvex (smooth) case
1 P
f (x) = n i fi (x) and xk+1 = xk − ηk ∇fik (xk )
Assumption 1: L-smooth components fi ∈ C1L
Assumption 2: Unbiased gradients E[∇fit (x) − ∇f (x)] = 0
Assumption 3: Bounded noise: E[k∇fik (x) − ∇f (x)k2 ] = σ 2
Assumption 4: Bounded gradient: k∇fi (x)k ≤ G

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 43
SGD: nonconvex (smooth) case
1 P
f (x) = n i fi (x) and xk+1 = xk − ηk ∇fik (xk )
Assumption 1: L-smooth components fi ∈ C1L
Assumption 2: Unbiased gradients E[∇fit (x) − ∇f (x)] = 0
Assumption 3: Bounded noise: E[k∇fik (x) − ∇f (x)k2 ] = σ 2
Assumption 4: Bounded gradient: k∇fi (x)k ≤ G

Theorem. Under above assumptions, for suitable stepsize


SGD satisfies
T
f (x1 ) − f (x∗ ) Lc 2
 
1X 2 1
E[k∇f (xk )k ] ≤ √ + G ,
T T c 2
k=1

for some constant c; hence mink E[k∇f (xk )k2 ] = O(1/ T).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 43
SGD: nonconvex (smooth) case

Proof: Using L-smoothness of fi and taking expectations we obtain

≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
 
E[f (xk+1 )]
SGD: nonconvex (smooth) case

Proof: Using L-smoothness of fi and taking expectations we obtain

≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
 
E[f (xk+1 )]
 Lη2
≤ E[f (xk )] − ηk E k∇f (xk )k2 + 2 k G2 .

SGD: nonconvex (smooth) case

Proof: Using L-smoothness of fi and taking expectations we obtain

≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
 
E[f (xk+1 )]
 Lη2
≤ E[f (xk )] − ηk E k∇f (xk )k2 + 2 k G2 .


Rearranging the terms above we obtain


1   Lηk 2
E k∇f (xk )k2 ≤ E f (xk ) − f (xk+1 ) +
 
G.
ηk 2
SGD: nonconvex (smooth) case

Proof: Using L-smoothness of fi and taking expectations we obtain

≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
 
E[f (xk+1 )]
 Lη2
≤ E[f (xk )] − ηk E k∇f (xk )k2 + 2 k G2 .


Rearranging the terms above we obtain


1   Lηk 2
E k∇f (xk )k2 ≤ E f (xk ) − f (xk+1 ) +
 
G.
ηk 2
Choose ηk = √c for some constant c and sum over k = 0 to T − 1 to obtain
T

T
1X  1 Lc
E k∇f (xk )k2 ≤ √ E[f (x1 ) − f (xT+1 )] + √ G2

T Tc 2 T
k=1
 ∗ 
1 f (x1 ) − f (x ) Lc
≤ √ + G2 .
T c 2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 44
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)

Denote: Rk := kxk − x∗ k2 and rk := E[Rk ] = E[kxk − x∗ k2 ]

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)

Denote: Rk := kxk − x∗ k2 and rk := E[Rk ] = E[kxk − x∗ k2 ]


Bounding Rk+1
Rk+1 = kxk+1 − x∗ k22 = kPX (xk − ηk gk ) − PX (x∗ )k22
≤ kxk − x∗ − ηk gk k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)

Denote: Rk := kxk − x∗ k2 and rk := E[Rk ] = E[kxk − x∗ k2 ]


Bounding Rk+1
Rk+1 = kxk+1 − x∗ k22 = kPX (xk − ηk gk ) − PX (x∗ )k22
≤ kxk − x∗ − ηk gk k22
= Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD – analysis for strongly cvx

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G, and take expectation:


rk+1 ≤ rk + ηk2 G2 − 2ηk E[hgk , xk − x∗ i].

Unbiasedness E[gk ] = ∇f (xk ) and µ-strong convexity give

rk+1 ≤ rk + ηk2 G2 − 2ηk E[f (xk ) − f (x∗ ) + µ2 kxk − x∗ k2 ].

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G, and take expectation:


rk+1 ≤ rk + ηk2 G2 − 2ηk E[hgk , xk − x∗ i].

Unbiasedness E[gk ] = ∇f (xk ) and µ-strong convexity give

rk+1 ≤ rk + ηk2 G2 − 2ηk E[f (xk ) − f (x∗ ) + µ2 kxk − x∗ k2 ].

Rearranging and dividing by 2ηk we get

ηk G2 ηk−1 − µ 1
E[f (xk ) − f (x∗ )] ≤ + rk − r .
2 2 2ηk k+1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G, and take expectation:


rk+1 ≤ rk + ηk2 G2 − 2ηk E[hgk , xk − x∗ i].

Unbiasedness E[gk ] = ∇f (xk ) and µ-strong convexity give

rk+1 ≤ rk + ηk2 G2 − 2ηk E[f (xk ) − f (x∗ ) + µ2 kxk − x∗ k2 ].

Rearranging and dividing by 2ηk we get

ηk G2 ηk−1 − µ 1
E[f (xk ) − f (x∗ )] ≤ + rk − r .
2 2 2ηk k+1

Put ηk = 1/µk, and telescope (and one more trick...)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx

G2 µ(k − 1) µk
E[f (xk ) − f (x∗ )] ≤ + rk − rk+1 . (∗∗)
2µk 2 2
Using convexity, observe that
XT XT
1
xk − f (x∗ ) ≤ 1
E[f (xk ) − f (x∗ )]

Ef T T
k=1 k=1

Using (∗∗), after telescoping, clearing junk Verify! we get


T
1 XT G2 X 1 G2
E[f (xk ) − f (x∗ )] ≤ ≤ (1 + log T).
T k=1 2µT k 2µT
k=1

log T 2
We’ve obtained the rate O( G 2µT )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 47
SGD – exercise

Exercise: Suppose fi is convex and f (x) is µ-strongly convex.


2(i+1)
Let x̄k := ki=0 θi xi , where θi = (k+1)(k+2)
P
, we obtain

2G2
E[f (x̄k ) − f (x∗ )] ≤ .
µ(k + 1)

Question: What if we want to not use averaged iterates?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 48
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G on compact set X


I Taking expectation:
rk+1 ≤ rk + ηk2 M2 − 2ηk E[hgk , xk − x∗ i].

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G on compact set X


I Taking expectation:
rk+1 ≤ rk + ηk2 M2 − 2ηk E[hgk , xk − x∗ i].

I We need to now get a handle on the last term

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G on compact set X


I Taking expectation:
rk+1 ≤ rk + ηk2 M2 − 2ηk E[hgk , xk − x∗ i].

I We need to now get a handle on the last term


I Since xk is independent of ξk , we have

E[hxk − x∗ , g(xk , ξk )i] =

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G on compact set X


I Taking expectation:
rk+1 ≤ rk + ηk2 M2 − 2ηk E[hgk , xk − x∗ i].

I We need to now get a handle on the last term


I Since xk is independent of ξk , we have

E[hxk − x∗ , g(xk , ξk )i] = E E[hxk − x∗ , g(xk , ξk )i | ξ[1..(k−1)] ]




Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G on compact set X


I Taking expectation:
rk+1 ≤ rk + ηk2 M2 − 2ηk E[hgk , xk − x∗ i].

I We need to now get a handle on the last term


I Since xk is independent of ξk , we have

E[hxk − x∗ , g(xk , ξk )i] = E E[hxk − x∗ , g(xk , ξk )i | ξ[1..(k−1)] ]




= E hxk − x∗ , E[g(xk , ξk ) | ξ[1..(k−1)] ]i




Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

Rk+1 ≤ Rk + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i

I Assume: kgk k2 ≤ G on compact set X


I Taking expectation:
rk+1 ≤ rk + ηk2 M2 − 2ηk E[hgk , xk − x∗ i].

I We need to now get a handle on the last term


I Since xk is independent of ξk , we have

E[hxk − x∗ , g(xk , ξk )i] = E E[hxk − x∗ , g(xk , ξk )i | ξ[1..(k−1)] ]




= E hxk − x∗ , E[g(xk , ξk ) | ξ[1..(k−1)] ]i




= E[hxk − x∗ , Gk i], Gk ∈ ∂F(xk ).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

I Since F is cvx, F(x) ≥ F(xk ) + hGk , x − xk i for any x ∈ X .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

I Since F is cvx, F(x) ≥ F(xk ) + hGk , x − xk i for any x ∈ X .


I Thus, in particular

2ηk E[F(x∗ ) − F(xk )] ≥ 2ηk E[hGk , x∗ − xk i]

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

I Since F is cvx, F(x) ≥ F(xk ) + hGk , x − xk i for any x ∈ X .


I Thus, in particular

2ηk E[F(x∗ ) − F(xk )] ≥ 2ηk E[hGk , x∗ − xk i]

Plug this bound back into the rk+1 inequality:

rk+1 ≤ rk + ηk2 M2 − 2ηk E[hGk , xk − x∗ i]

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

I Since F is cvx, F(x) ≥ F(xk ) + hGk , x − xk i for any x ∈ X .


I Thus, in particular

2ηk E[F(x∗ ) − F(xk )] ≥ 2ηk E[hGk , x∗ − xk i]

Plug this bound back into the rk+1 inequality:

rk+1 ≤ rk + ηk2 M2 − 2ηk E[hGk , xk − x∗ i]


2ηk E[hGk , xk − x∗ i] ≤ rk − rk+1 + ηk M2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

I Since F is cvx, F(x) ≥ F(xk ) + hGk , x − xk i for any x ∈ X .


I Thus, in particular

2ηk E[F(x∗ ) − F(xk )] ≥ 2ηk E[hGk , x∗ − xk i]

Plug this bound back into the rk+1 inequality:

rk+1 ≤ rk + ηk2 M2 − 2ηk E[hGk , xk − x∗ i]


2ηk E[hGk , xk − x∗ i] ≤ rk − rk+1 + ηk M2
2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

It remains to bound: E[hxk − x∗ , Gk i]

I Since F is cvx, F(x) ≥ F(xk ) + hGk , x − xk i for any x ∈ X .


I Thus, in particular

2ηk E[F(x∗ ) − F(xk )] ≥ 2ηk E[hGk , x∗ − xk i]

Plug this bound back into the rk+1 inequality:

rk+1 ≤ rk + ηk2 M2 − 2ηk E[hGk , xk − x∗ i]


2ηk E[hGk , xk − x∗ i] ≤ rk − rk+1 + ηk M2
2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .

We’ve bounded the expected progress; What now?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case

2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case

2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .


Sum up over i = 1, . . . , k, to obtain
Xk X
(2ηi E[F(xi ) − f (x∗ )]) ≤ r1 − rk+1 + M2 η2
i=1 i i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case

2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .


Sum up over i = 1, . . . , k, to obtain
Xk X
(2ηi E[F(xi ) − f (x∗ )]) ≤ r1 − rk+1 + M2 η2
i=1 i i
X
≤ r1 + M2 ηi2 .
i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case

2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .


Sum up over i = 1, . . . , k, to obtain
Xk X
(2ηi E[F(xi ) − f (x∗ )]) ≤ r1 − rk+1 + M2 η2
i=1 i i
X
≤ r1 + M2 ηi2 .
i
P
Divide both sides by i ηi , so

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case

2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .


Sum up over i = 1, . . . , k, to obtain
Xk X
(2ηi E[F(xi ) − f (x∗ )]) ≤ r1 − rk+1 + M2 η2
i=1 i i
X
≤ r1 + M2 ηi2 .
i
P
Divide both sides by i ηi , so
I Set γi = Pηki .
i ηi
P
I Thus, γi ≥ 0 and i γi = 1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case

2ηk E[F(xk ) − F(x∗ )] ≤ rk − rk+1 + ηk M2 .


Sum up over i = 1, . . . , k, to obtain
Xk X
(2ηi E[F(xi ) − f (x∗ )]) ≤ r1 − rk+1 + M2 η2
i=1 i i
X
≤ r1 + M2 ηi2 .
i
P
Divide both sides by i ηi , so
I Set γi = Pηki .
i ηi
P
I Thus, γi ≥ 0 and i γi = 1
hX i r + M2 P η 2
∗ 1
E γi (F(xi ) − F(x )) ≤ P i i
i 2 i ηi

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case

I Bound looks similar to bound in subgradient method

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case

I Bound looks similar to bound in subgradient method


I But we wish to say something about xk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case

I Bound looks similar to bound in subgradient method


I But we wish to say something about xk
I Since γi ≥ 0 and ki γi = 1, and we have γi F(xi )
P

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case

I Bound looks similar to bound in subgradient method


I But we wish to say something about xk
I Since γi ≥ 0 and ki γi = 1, and we have γi F(xi )
P

I Easier to talk about averaged


Xk
x̄k := γi xi .
i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case

I Bound looks similar to bound in subgradient method


I But we wish to say something about xk
I Since γi ≥ 0 and ki γi = 1, and we have γi F(xi )
P

I Easier to talk about averaged


Xk
x̄k := γi xi .
i
P
I f (x̄k ) ≤ i γi F(xi ) due to convexity

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case

I Bound looks similar to bound in subgradient method


I But we wish to say something about xk
I Since γi ≥ 0 and ki γi = 1, and we have γi F(xi )
P

I Easier to talk about averaged


Xk
x̄k := γi xi .
i
P
I f (x̄k ) ≤ i γi F(xi ) due to convexity
So we finally obtain the inequality

 r1 + M2 i ηi2
P


E F(x̄k ) − F(x ) ≤ P .
2 i ηi

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case

♠ Let DX := maxx∈X kx − x∗ k2 (act. only need kx1 − x∗ k ≤ DX )


♠ Assume ηi = η is a constant. Observe that

D2X + M2 kη 2
E[F(x̄k ) − F(x∗ )] ≤
2kη

♠ Minimize the rhs over η > 0 to obtain


E[F(x̄k ) − F(x∗ )] ≤ D√
XM
k
♠ If k is not fixed in advance, then choose
θDX
ηi = √, i = 1, 2, . . .
M i
♠ Analyze E[F(x̄k ) − F(x∗ )] with this choice of stepsize

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 53
SGD: weakly convex case

♠ Let DX := maxx∈X kx − x∗ k2 (act. only need kx1 − x∗ k ≤ DX )


♠ Assume ηi = η is a constant. Observe that

D2X + M2 kη 2
E[F(x̄k ) − F(x∗ )] ≤
2kη

♠ Minimize the rhs over η > 0 to obtain


E[F(x̄k ) − F(x∗ )] ≤ D√
XM
k
♠ If k is not fixed in advance, then choose
θDX
ηi = √, i = 1, 2, . . .
M i
♠ Analyze E[F(x̄k ) − F(x∗ )] with this choice of stepsize

We showed O(1/ k) rate

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 53
SGD: weakly convex and smooth

Exercise: Assuming the cost (and component functions) are


L-smooth and convex, study the convergence rate of SGD.
Hint: Use bounded noise: E[k∇fik (x) − ∇f (x)k2 ] = σ 2 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 54

You might also like