0% found this document useful (0 votes)
103 views

Gradient Descent: Ryan Tibshirani Convex Optimization 10-725

Gradient descent is an algorithm for smooth convex optimization that iteratively updates the solution by taking steps in the negative gradient direction. At each iteration, it approximates the objective function using a quadratic model and chooses the next point to minimize this model. The step size can be chosen using a fixed value, backtracking line search, or exact line search. Backtracking line search adapts the step size at each iteration by starting with a large value and shrinking it until sufficient decrease is achieved. For convex functions with Lipschitz gradients, gradient descent converges at a rate of O(1/k) meaning the optimality gap decreases by 1/k per iteration. For strongly convex functions, the convergence is linear meaning

Uploaded by

John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Gradient Descent: Ryan Tibshirani Convex Optimization 10-725

Gradient descent is an algorithm for smooth convex optimization that iteratively updates the solution by taking steps in the negative gradient direction. At each iteration, it approximates the objective function using a quadratic model and chooses the next point to minimize this model. The step size can be chosen using a fixed value, backtracking line search, or exact line search. Backtracking line search adapts the step size at each iteration by starting with a large value and shrinking it until sufficient decrease is achieved. For convex functions with Lipschitz gradients, gradient descent converges at a rate of O(1/k) meaning the optimality gap decreases by 1/k per iteration. For strongly convex functions, the convergence is linear meaning

Uploaded by

John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Gradient Descent

Ryan Tibshirani
Convex Optimization 10-725
Last time: canonical convex programs

• Linear program (LP): takes the form

min cT x
x
subject to Dx ≤ d
Ax = b

• Quadratic program (QP): like LP, but with quadratic criterion


• Semidefinite program (SDP): like LP, but with matrices
• Conic program: the most general form of all

2
Gradient descent

Consider unconstrained, smooth convex optimization

min f (x)
x

That is, f is convex and differentiable with dom(f ) = Rn . Denote


optimal criterion value by f ? = minx f (x), and a solution by x?

Gradient descent: choose initial point x(0) ∈ Rn , repeat:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Stop at some point

3

● ●

4
● ●

●●

5
Gradient descent interpretation

At each iteration, consider the expansion


1
f (y) ≈ f (x) + ∇f (x)T (y − x) + ky − xk22
2t

Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I

f (x) + ∇f (x)T (y − x) linear approximation to f


1
2t ky − xk22 proximity term to x, with weight 1/(2t)

Choose next point y = x+ to minimize quadratic approximation:

x+ = x − t∇f (x)

6

Blue point is x, red point is


1
x+ = argmin f (x) + ∇f (x)T (y − x) + ky − xk22
y 2t

7
Outline

Today:
• How to choose step sizes
• Convergence analysis
• Nonconvex functions
• Gradient boosting

8
Fixed step size

Simply take tk = t for all k = 1, 2, 3, . . ., can diverge if t is too big.


Consider f (x) = (10x21 + x22 )/2, gradient descent after 8 steps:

20

10 ●

*
0
−10
−20

−20 −10 0 10 20

9
Can be slow if t is too small. Same example, gradient descent after
100 steps:

20
●●






●●




















10









*
0
−10
−20

−20 −10 0 10 20

10
Converges nicely when t is “just right”. Same example, 40 steps:

20






10



















*
0
−10
−20

−20 −10 0 10 20

Convergence analysis later will give us a precise idea of “just right”

11
Backtracking line search

One way to adaptively choose the step size is to use backtracking


line search:
• First fix parameters 0 < β < 1 and 0 < α ≤ 1/2
• At each iteration, start with t = tinit , and while

f (x − t∇f (x)) > f (x) − αtk∇f (x)k22

shrink t = βt. Else perform gradient descent update

x+ = x − t∇f (x)

Simple and tends to work well in practice (further simplification:


just take α = 1/2)

12
Descent methods Backtracking interpretation 465

f (x + t∆x)

f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x


t
t=0 t0

Figure 9.1 Backtracking line search. The curve shows f , restricted to the line
∆x
For usline
over which we search. The lower dashed = −∇f
shows (x) extrapolation
the linear
of f , and the upper dashed line has a slope a factor of α smaller. The
backtracking condition is that f lies below the upper dashed line, i.e., 0 ≤
t ≤ t0 .
13
Setting α = β = 0.5, backtracking picks up roughly the right step
size (12 outer steps, 40 steps total),

20

10 ●
●●

●●




*
0
−10
−20

−20 −10 0 10 20

14
Exact line search

We could also choose step to do the best we can along direction of


negative gradient, called exact line search:

t = argmin f (x − s∇f (x))


s≥0

Usually not possible to do this minimization exactly

Approximations to exact line search are typically not as efficient as


backtracking, and it’s typically not worth it

15
Convergence analysis
Assume that f convex and differentiable, with dom(f ) = Rn , and
additionally that ∇f is Lipschitz continuous with constant L > 0,

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 for any x, y

(Or when twice differentiable: ∇2 f (x)  LI)

Theorem: Gradient descent with fixed step size t ≤ 1/L satisfies

kx(0) − x? k22
f (x(k) ) − f ? ≤
2tk
and same result holds for backtracking, with t replaced by β/L

We say gradient descent has convergence rate O(1/k). That is, it


finds -suboptimal point in O(1/) iterations

16
Analysis for strong convexity

Reminder: strong convexity of f means f (x) − m 2


2 kxk2 is convex
2
for some m > 0 (when twice differentiable: ∇ f (x)  mI)

Assuming Lipschitz gradient as before, and also strong convexity:

Theorem: Gradient descent with fixed step size t ≤ 2/(m + L)


or with backtracking line search search satisfies
L
f (x(k) ) − f ? ≤ γ k kx(0) − x? k22
2
where 0 < γ < 1

Rate under strong convexity is O(γ k ), exponentially fast! That is,


it finds -suboptimal point in O(log(1/)) iterations

17
9.3 Gradient descent method

104

102
Called linear convergence

f (x(k) ) − p⋆
100
exact l.s.
Objective versus iteration
curve looks linear on semi- 10−2

backtracking l.s.
log plot
10−4
0 50 100 150 200
k
Figure 9.6 Error f (x(k) ) − p⋆ versus iteration k for the gradient method with
(From B & V page 487)
backtracking and exact line search, for a problem in R100 .

− m/L).
Important note: γ = O(1These experimentsThus we thecan
suggest that effectwrite convergence
of the backtracking parameters on the

rate as  
convergence is not large, no more than a factor of two or so.
L
Gradient method and condition number
O log(1/)
mexperiment will illustrate the importance of the condition number of
Our last
∇2 f (x) (or the sublevel sets) on the rate of convergence of the gradient method.
Higher condition number L/m ⇒ slower rate. This is not only true
We start with the function given by (9.21), but replace the variable x by x = T x̄,
where
of in theory ... very apparent in practice too T = diag((1, γ 1/n , γ 2/n , . . . , γ (n−1)/n )),

i.e., we minimize
m
!
f¯(x̄) = cT T x̄ − log(bi − aTi T x̄). 18
(9.22)
A look at the conditions

A look at the conditions for a simple problem, f (β) = 12 ky − Xβk22

Lipschitz continuity of ∇f :
• Recall this means ∇2 f (x)  LI
• As ∇2 f (β) = X T X, we have L = λmax (X T X)

Strong convexity of f :
• Recall this means ∇2 f (x)  mI
• As ∇2 f (β) = X T X, we have m = λmin (X T X)
• If X is wide (X is n × p with p > n), then λmin (X T X) = 0,
and f can’t be strongly convex
• Even if σmin (X) > 0, can have a very large condition number
L/m = λmax (X T X)/λmin (X T X)

19
Practicalities

Stopping rule: stop when k∇f (x)k2 is small


• Recall ∇f (x? ) = 0 at solution x?
• If f is strongly convex with parameter m, then

k∇f (x)k2 ≤ 2m =⇒ f (x) − f ? ≤ 

Pros and cons of gradient descent:


• Pro: simple idea, and each iteration is cheap (usually)
• Pro: fast for well-conditioned, strongly convex problems
• Con: can often be slow, because many interesting problems
aren’t strongly convex or well-conditioned
• Con: can’t handle nondifferentiable functions

20
Can we do better?
Gradient descent has O(1/) convergence rate over problem class
of convex, differentiable functions with Lipschitz gradients

First-order method: iterative method, which updates x(k) in

x(0) + span{∇f (x(0) ), ∇f (x(1) ), . . . ∇f (x(k−1) )}

Theorem (Nesterov): For any k ≤ (n − 1)/2 and any starting


point x(0) , there is a function f in the problem class such that
any first-order method satisfies

3Lkx(0) − x? k22
f (x(k) ) − f ? ≥
32(k + 1)2


Can attain rate O(1/k 2 ), or O(1/ )? Answer: yes (we’ll see)!

21
Analysis for nonconvex case
Assume f is differentiable with Lipschitz gradient, now nonconvex.
Asking for optimality is too much. Let’s settle for a -substationary
point x, which means k∇f (x)k2 ≤ 

Theorem: Gradient descent with fixed step size t ≤ 1/L satisfies


s
2(f (x(0) ) − f ? )
min k∇f (x(i) )k2 ≤
i=0,...,k t(k + 1)


Thus gradient descent has rate O(1/ k), or O(1/2 ), even in the
nonconvex case for finding stationary points

This rate cannot be improved (over class of differentiable functions


with Lipschitz gradients) by any deterministic algorithm1

1
Carmon et al. (2017), “Lower bounds for finding stationary points I”
22
Gradient boosting

23
Given responses yi ∈ R and features xi ∈ Rp , i = 1, . . . , n

Want to construct a flexible (nonlinear) model for response based


on features. Weighted sum of trees:
m
X
ui = βj · Tj (xi ), i = 1, . . . , n
j=1

Each tree Tj inputs xi , outputs predicted response. Typically trees


are pretty short

...

24
Pick a loss function L to reflect setting. For continuous responses,
e.g., could take L(yi , ui ) = (yi − ui )2

Want to solve
n
X  X M 
min L yi , βj · Tj (xi )
β
i=1 j=1

Indexes all trees of a fixed size (e.g., depth = 5), so M is huge.


Space is simply too big to optimize

Gradient boosting: basically a version of gradient descent that is


forced to work with trees

First think of optimization as minu f (u), over predicted values u,


subject to u coming from trees

25
Start with initial model, a single tree u(0) = T0 . Repeat:
• Compute negative gradient d at latest prediction u(k−1) ,
 
∂L(yi , ui )
di = − , i = 1, . . . , n
∂ui ui =u
(k−1)
i

• Find a tree Tk that is close to a, i.e., according to


n
X
min (di − T (xi ))2
trees T
i=1

Not hard to (approximately) solve for a single tree


• Compute step size αk , and update our prediction:

u(k) = u(k−1) + αk · Tk

Note: predictions are weighted sums of trees, as desired

26
References and further reading

• S. Boyd and L. Vandenberghe (2004), “Convex optimization”,


Chapter 9
• T. Hastie, R. Tibshirani and J. Friedman (2009), “The
elements of statistical learning”, Chapters 10 and 16
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 2
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

27

You might also like