Optimization For Machine Learning: Massachusetts Institute of Technology
Optimization For Machine Learning: Massachusetts Institute of Technology
Suvrit Sra
Massachusetts Institute of Technology
11 Mar, 2021
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 2
First-order methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
First-order methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 3
Gradient Descent
x ← x − η∇f (x)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 4
Descent methods
xk
xk+1
...
x∗ ∇f (x∗ ) = 0
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 5
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using
x(η) = x + ηd,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using
x(η) = x + ηd,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
I Suppose we have a vector x ∈ Rn for which ∇f (x) 6= 0
I Consider updating x using
x(η) = x + ηd,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 6
Descent methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 7
Descent methods
∇f (x)
∇f (x)
x − α∇f (x)
x
x − δ∇f (x)
∇f (x)
x − α∇f (x) x + α2 d
x
d
x − δ∇f (x)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 8
Gradient based methods
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient based methods
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 9
Gradient methods – direction
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 10
Gradient methods – direction
xk+1 = xk + ηk dk , k = 0, 1, . . .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 10
Stepsize selection
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 11
Stepsize selection?
ηk = β mk s,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 12
Barzilai-Borwein step-size?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Barzilai-Borwein step-size?
huk , vk i kuk k2
ηk = , ηk =
kvk k2 huk , vk i
uk = xk − xk−1 , vk = ∇f (xk ) − ∇f (xk−1 )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 13
Exercise
♠ Let D be the (n − 1) × n differencing matrix
−1 1
−1 1
D= ∈ R(n−1)×n ,
..
.
−1 1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 14
Convergence
(remarks)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 15
Gradient descent – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 16
Gradient descent – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 16
Gradient descent – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Gradient descent – convergence
2
Theorem. If f ∈ S1L,µ , η = L+µ , and xk generated by GD.
2T
κ−1
Then, f (xT ) − f ∗ ≤ L2 κ+1 kx0 − x∗ k22 , where κ := L/µ is
the condition number.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 17
Proof
(sketches)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 18
Key result: The Descent Lemma
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 19
Key result: The Descent Lemma
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 19
Descent lemma – corollary
Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Descent lemma – corollary
Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )
Proof.
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
ηk2 L
= f (xk ) − ηk k∇f (xk )k22 + 2
k∇f (xk )k22
= f (xk ) − ηk (1 − ηk
2
L)k∇f (xk )k22
If ηk < 2/L we have descent. min over ηk to get best bound, giving ηk = 1/L.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Descent lemma – corollary
Coroll. 1 If f ∈ C1L , and 0 < ηk < 2/L, then f (xk+1 ) < f (xk )
Proof.
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
ηk2 L
= f (xk ) − ηk k∇f (xk )k22 + 2
k∇f (xk )k22
= f (xk ) − ηk (1 − ηk
2
L)k∇f (xk )k22
If ηk < 2/L we have descent. min over ηk to get best bound, giving ηk = 1/L.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 20
Convergence of gradient norm
I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm
I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm
I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm
I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm
I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence of gradient norm
I We showed that
f (xk ) − f (xk+1 ) ≥ 1 k 2
2L k∇f (x )k2 ,
PT
I min k∇f (xk )k2 ≤ 1
T+1
k 2
k=0 k∇f (x )k , so O( 1ε ) for k∇f k2 ≤ ε
0≤k≤T
I Notice, we did not require f to be convex . . .
SGD
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 21
Convergence rate – strongly convex
k
2ηµL
kxk − x∗ k22 ≤ 1− kx0 − x∗ k2 .
µ+L
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 22
Convergence – strongly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case
Exercise: Prove Cor. 2. (Hint: Consider φ(y) = f (y) − h∇f (x0 ), yi).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case
Exercise: Prove Cor. 2. (Hint: Consider φ(y) = f (y) − h∇f (x0 ), yi).
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 23
Convergence – strongly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence – strongly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 24
Convergence rate – strongly convex
k
2ηµL
kxk − x∗ k22 ≤ 1− kx0 − x∗ k2 .
µ+L
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 25
Strongly convex – rate
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Strongly convex – rate
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 26
Convergence rate – (weakly) convex
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
2ηk
≤ r2k + ηk2 kgk k22 − L kg
k
− g∗ k22 (Coroll. 2, pg. 24)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
2ηk
≤ r2k + ηk2 kgk k22 − L kg
k
− g∗ k22 (Coroll. 2, pg. 24)
= r2k − ηk ( L2 − ηk )kgk k22 .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 27
Convergence rate – (weakly) convex
1
Proof. Descent lemma implies that: f (xk+1 ) ≤ f (xk ) − k 2
2L kg k2
Consider, r2k+1 = kxk+1 − x∗ k22 = kxk − x∗ − ηk gk k22 .
r2k+1 = r2k + ηk2 kgk k22 − 2ηk hgk , xk − x∗ i
= r2k + ηk2 kgk k22 − 2ηk hgk − g∗ , xk − x∗ i as g∗ = 0
2ηk
≤ r2k + ηk2 kgk k22 − L kg
k
− g∗ k22 (Coroll. 2, pg. 24)
= r2k − ηk ( L2 − ηk )kgk k22 .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate
cvx f
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate
cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate
cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate
cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk
∆k
kgk k2 ≥ .
r0
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate
cvx f CS
f (xk ) − f (x∗ ) = ∆k ≤ hgk , xk − x∗ i ≤ kgk k2 kxk − x∗ k2 .
| {z }
rk
∆k
kgk k2 ≥ .
r0
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 28
Convergence rate
∆2k ∆k
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
2Lr20
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate
∆2k ∆k
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate
∆2k ∆k
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate
∆2k ∆k
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20
1 1 1 1
∆k+1 ≤ ∆k (1 − βk ) =⇒ ≥ (1 + βk ) = + .
∆k+1 ∆k ∆k 2Lr20
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate
∆2k ∆k
∆k+1 ≤ ∆k − = ∆k 1 − 2Lr20
= ∆k (1 − βk ).
2Lr20
1 1 1 1
∆k+1 ≤ ∆k (1 − βk ) =⇒ ≥ (1 + βk ) = + .
∆k+1 ∆k ∆k 2Lr20
I Sum both sides over k = 0, . . . , T (telescoping) to obtain
1 1 T+1
≥ +
∆T+1 ∆0 2Lr20
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 29
Convergence rate
1 1 T+1
≥ +
∆T+1 ∆0 2Lr20
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 30
Convergence rate
1 1 T+1
≥ +
∆T+1 ∆0 2Lr20
I Rearrange to conclude
2L∆0 r20
f (xT ) − f ∗ ≤
2Lr20 + T∆0
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 30
Convergence rate
1 1 T+1
≥ +
∆T+1 ∆0 2Lr20
I Rearrange to conclude
2L∆0 r20
f (xT ) − f ∗ ≤
2Lr20 + T∆0
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 30
SGD
x ← x − ηg
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 31
Why SGD?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 32
Why SGD?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 32
Why SGD?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 32
Finite-sum problems
min f (x) =
x∈Rd
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 33
Finite-sum problems
n
1X
min f (x) = fi (x).
x∈Rd n
i=1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 33
Finite-sum problems
n
1X
min f (x) = fi (x).
x∈Rd n
i=1
xk+1 = xk − ηk ∇f (xk )
xk+1 = xk − ηk g(xk ), g ∈ ∂f (xk )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 33
Stochastic gradients
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 34
Intuition – (Bertsekas)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 35
Intuition – (Bertsekas)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
Intuition – (Bertsekas)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
Intuition – (Bertsekas)
I ∇fi (x) has same sign as ∇f (x). So using ∇fi (x) instead of
∇f (x) also ensures progress.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
Intuition – (Bertsekas)
I ∇fi (x) has same sign as ∇f (x). So using ∇fi (x) instead of
∇f (x) also ensures progress.
I But once inside region R, no guarantee that SGD will make
progress towards optimum.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 36
SGD: two variants
min n1
P
i fi (x)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: two variants
min n1
P
i fi (x)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: two variants
min n1
P
i fi (x)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: two variants
min n1
P
i fi (x)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 37
SGD: mini-batches
n
1X
min f (x) = fi (x)
x n
i=1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 38
SGD: mini-batches
n
1X
min f (x) = fi (x)
x n
i=1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 38
SGD: some theoretical challenges
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 39
SGD: some theoretical challenges
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 39
SGD for empirical risk / finite sums
n
1X
min f (x) = fi (x)
x∈Rd n
i=1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 40
SGD vs GD (strongly convex case)
GD SGD
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 41
SGD vs GD (strongly convex case)
GD SGD
I Batch GD:
– Linear (e.g., exponential) convergence rate in O(e−k/κ )
– Iteration complexity is linear in n (O(n log 1/ε))
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 41
SGD vs GD (strongly convex case)
GD SGD
I Batch GD:
– Linear (e.g., exponential) convergence rate in O(e−k/κ )
– Iteration complexity is linear in n (O(n log 1/ε))
I SGD:
– Sampling with replacement: i(k) random element of {1, . . . , n}
– Convergence rate O(κ/k)
– Iteration complexity independent of n (O(1/ε2 ))
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 41
Convergence
(some theory)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 42
SGD: nonconvex (smooth) case
1 P
f (x) = n i fi (x) and xk+1 = xk − ηk ∇fik (xk )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 43
SGD: nonconvex (smooth) case
1 P
f (x) = n i fi (x) and xk+1 = xk − ηk ∇fik (xk )
Assumption 1: L-smooth components fi ∈ C1L
Assumption 2: Unbiased gradients E[∇fit (x) − ∇f (x)] = 0
Assumption 3: Bounded noise: E[k∇fik (x) − ∇f (x)k2 ] = σ 2
Assumption 4: Bounded gradient: k∇fi (x)k ≤ G
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 43
SGD: nonconvex (smooth) case
1 P
f (x) = n i fi (x) and xk+1 = xk − ηk ∇fik (xk )
Assumption 1: L-smooth components fi ∈ C1L
Assumption 2: Unbiased gradients E[∇fit (x) − ∇f (x)] = 0
Assumption 3: Bounded noise: E[k∇fik (x) − ∇f (x)k2 ] = σ 2
Assumption 4: Bounded gradient: k∇fi (x)k ≤ G
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 43
SGD: nonconvex (smooth) case
≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
E[f (xk+1 )]
SGD: nonconvex (smooth) case
≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
E[f (xk+1 )]
Lη2
≤ E[f (xk )] − ηk E k∇f (xk )k2 + 2 k G2 .
SGD: nonconvex (smooth) case
≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
E[f (xk+1 )]
Lη2
≤ E[f (xk )] − ηk E k∇f (xk )k2 + 2 k G2 .
≤ E[f (xk )] + E h∇f (xk ), −ηk ∇fik (xk )i + L2 kηk ∇fik (xk )k2
E[f (xk+1 )]
Lη2
≤ E[f (xk )] − ηk E k∇f (xk )k2 + 2 k G2 .
T
1X 1 Lc
E k∇f (xk )k2 ≤ √ E[f (x1 ) − f (xT+1 )] + √ G2
T Tc 2 T
k=1
∗
1 f (x1 ) − f (x ) Lc
≤ √ + G2 .
T c 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 44
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD: convex case
I minx∈X f (x) := E[F(x, ξ)]
I Let ξk denote the randomness at step k
I xk depends on rvs ξ1 , . . . , ξk−1 , so itself random
I Of course, xk does not depend on ξk
I SGD analysis hinges upon: E[kxk − x∗ k2 ]
I SGD iteration: xk+1 ← PX (xk − ηk gk ) (PX : projection)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 45
SGD – analysis for strongly cvx
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx
ηk G2 ηk−1 − µ 1
E[f (xk ) − f (x∗ )] ≤ + rk − r .
2 2 2ηk k+1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx
ηk G2 ηk−1 − µ 1
E[f (xk ) − f (x∗ )] ≤ + rk − r .
2 2 2ηk k+1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 46
SGD – analysis for strongly cvx
G2 µ(k − 1) µk
E[f (xk ) − f (x∗ )] ≤ + rk − rk+1 . (∗∗)
2µk 2 2
Using convexity, observe that
XT XT
1
xk − f (x∗ ) ≤ 1
E[f (xk ) − f (x∗ )]
Ef T T
k=1 k=1
log T 2
We’ve obtained the rate O( G 2µT )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 47
SGD – exercise
2G2
E[f (x̄k ) − f (x∗ )] ≤ .
µ(k + 1)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 48
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 49
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 50
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 51
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case
r1 + M2 i ηi2
P
∗
E F(x̄k ) − F(x ) ≤ P .
2 i ηi
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 52
SGD: weakly convex case
D2X + M2 kη 2
E[F(x̄k ) − F(x∗ )] ≤
2kη
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 53
SGD: weakly convex case
D2X + M2 kη 2
E[F(x̄k ) − F(x∗ )] ≤
2kη
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 53
SGD: weakly convex and smooth
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (03/11/21; Lecture 7) 54