Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
Ryan Tibshirani
Convex Optimization 10-725
Last time: canonical convex programs
min cT x
x
subject to Dx ≤ d
Ax = b
2
Gradient descent
min f (x)
x
3
●
●
● ●
4
● ●
●●
5
Gradient descent interpretation
x+ = x − t∇f (x)
6
●
7
Outline
Today:
• How to choose step sizes
• Convergence analysis
• Nonconvex functions
• Gradient boosting
8
Fixed step size
20
●
10 ●
*
0
−10
−20
−20 −10 0 10 20
9
Can be slow if t is too small. Same example, gradient descent after
100 steps:
20
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
●
●
●
●
●
●
●
●
●
●
*
0
−10
−20
−20 −10 0 10 20
10
Converges nicely when t is “just right”. Same example, 40 steps:
20
●
●
●
●
●
●
●
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
*
0
−10
−20
−20 −10 0 10 20
11
Backtracking line search
x+ = x − t∇f (x)
12
Descent methods Backtracking interpretation 465
f (x + t∆x)
Figure 9.1 Backtracking line search. The curve shows f , restricted to the line
∆x
For usline
over which we search. The lower dashed = −∇f
shows (x) extrapolation
the linear
of f , and the upper dashed line has a slope a factor of α smaller. The
backtracking condition is that f lies below the upper dashed line, i.e., 0 ≤
t ≤ t0 .
13
Setting α = β = 0.5, backtracking picks up roughly the right step
size (12 outer steps, 40 steps total),
20
●
10 ●
●●
●●
●
●
●
●
●
*
0
−10
−20
−20 −10 0 10 20
14
Exact line search
15
Convergence analysis
Assume that f convex and differentiable, with dom(f ) = Rn , and
additionally that ∇f is Lipschitz continuous with constant L > 0,
kx(0) − x? k22
f (x(k) ) − f ? ≤
2tk
and same result holds for backtracking, with t replaced by β/L
16
Analysis for strong convexity
17
9.3 Gradient descent method
104
102
Called linear convergence
f (x(k) ) − p⋆
100
exact l.s.
Objective versus iteration
curve looks linear on semi- 10−2
backtracking l.s.
log plot
10−4
0 50 100 150 200
k
Figure 9.6 Error f (x(k) ) − p⋆ versus iteration k for the gradient method with
(From B & V page 487)
backtracking and exact line search, for a problem in R100 .
− m/L).
Important note: γ = O(1These experimentsThus we thecan
suggest that effectwrite convergence
of the backtracking parameters on the
rate as
convergence is not large, no more than a factor of two or so.
L
Gradient method and condition number
O log(1/)
mexperiment will illustrate the importance of the condition number of
Our last
∇2 f (x) (or the sublevel sets) on the rate of convergence of the gradient method.
Higher condition number L/m ⇒ slower rate. This is not only true
We start with the function given by (9.21), but replace the variable x by x = T x̄,
where
of in theory ... very apparent in practice too T = diag((1, γ 1/n , γ 2/n , . . . , γ (n−1)/n )),
i.e., we minimize
m
!
f¯(x̄) = cT T x̄ − log(bi − aTi T x̄). 18
(9.22)
A look at the conditions
Lipschitz continuity of ∇f :
• Recall this means ∇2 f (x) LI
• As ∇2 f (β) = X T X, we have L = λmax (X T X)
Strong convexity of f :
• Recall this means ∇2 f (x) mI
• As ∇2 f (β) = X T X, we have m = λmin (X T X)
• If X is wide (X is n × p with p > n), then λmin (X T X) = 0,
and f can’t be strongly convex
• Even if σmin (X) > 0, can have a very large condition number
L/m = λmax (X T X)/λmin (X T X)
19
Practicalities
20
Can we do better?
Gradient descent has O(1/) convergence rate over problem class
of convex, differentiable functions with Lipschitz gradients
3Lkx(0) − x? k22
f (x(k) ) − f ? ≥
32(k + 1)2
√
Can attain rate O(1/k 2 ), or O(1/ )? Answer: yes (we’ll see)!
21
Analysis for nonconvex case
Assume f is differentiable with Lipschitz gradient, now nonconvex.
Asking for optimality is too much. Let’s settle for a -substationary
point x, which means k∇f (x)k2 ≤
√
Thus gradient descent has rate O(1/ k), or O(1/2 ), even in the
nonconvex case for finding stationary points
1
Carmon et al. (2017), “Lower bounds for finding stationary points I”
22
Gradient boosting
23
Given responses yi ∈ R and features xi ∈ Rp , i = 1, . . . , n
...
24
Pick a loss function L to reflect setting. For continuous responses,
e.g., could take L(yi , ui ) = (yi − ui )2
Want to solve
n
X X M
min L yi , βj · Tj (xi )
β
i=1 j=1
25
Start with initial model, a single tree u(0) = T0 . Repeat:
• Compute negative gradient d at latest prediction u(k−1) ,
∂L(yi , ui )
di = − , i = 1, . . . , n
∂ui ui =u
(k−1)
i
u(k) = u(k−1) + αk · Tk
26
References and further reading
27