0% found this document useful (0 votes)

103 views

Gradient Descent: Ryan Tibshirani Convex Optimization 10-725

Gradient descent is an algorithm for smooth convex optimization that iteratively updates the solution by taking steps in the negative gradient direction. At each iteration, it approximates the objective function using a quadratic model and chooses the next point to minimize this model. The step size can be chosen using a fixed value, backtracking line search, or exact line search. Backtracking line search adapts the step size at each iteration by starting with a large value and shrinking it until sufficient decrease is achieved. For convex functions with Lipschitz gradients, gradient descent converges at a rate of O(1/k) meaning the optimality gap decreases by 1/k per iteration. For strongly convex functions, the convergence is linear meaning

Uploaded by

John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views

Gradient Descent: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Gradient Descent

Ryan Tibshirani
Convex Optimization 10-725
Last time: canonical convex programs

• Linear program (LP): takes the form

min cT x
x
subject to Dx ≤ d
Ax = b

• Quadratic program (QP): like LP, but with quadratic criterion

• Semidefinite program (SDP): like LP, but with matrices
• Conic program: the most general form of all

2
Gradient descent

Consider unconstrained, smooth convex optimization

min f (x)
x

That is, f is convex and differentiable with dom(f ) = Rn . Denote

optimal criterion value by f ? = minx f (x), and a solution by x?

Gradient descent: choose initial point x(0) ∈ Rn , repeat:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Stop at some point

3
●
●

● ●

4
● ●

●●

5
Gradient descent interpretation

At each iteration, consider the expansion

1
f (y) ≈ f (x) + ∇f (x)T (y − x) + ky − xk22
2t

Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I

f (x) + ∇f (x)T (y − x) linear approximation to f

1
2t ky − xk22 proximity term to x, with weight 1/(2t)

Choose next point y = x+ to minimize quadratic approximation:

x+ = x − t∇f (x)

6
●

Blue point is x, red point is

1
x+ = argmin f (x) + ∇f (x)T (y − x) + ky − xk22
y 2t

7
Outline

Today:
• How to choose step sizes
• Convergence analysis
• Nonconvex functions
• Gradient boosting

8
Fixed step size

Simply take tk = t for all k = 1, 2, 3, . . ., can diverge if t is too big.

Consider f (x) = (10x21 + x22 )/2, gradient descent after 8 steps:

20
●

10 ●

*
0
−10
−20

−20 −10 0 10 20

9
Can be slow if t is too small. Same example, gradient descent after
100 steps:

20
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

10
●
●
●
●
●
●
●
●
●
●

*
0
−10
−20

−20 −10 0 10 20

10
Converges nicely when t is “just right”. Same example, 40 steps:

20
●
●
●
●
●
●
●

10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

*
0
−10
−20

−20 −10 0 10 20

Convergence analysis later will give us a precise idea of “just right”

11
Backtracking line search

One way to adaptively choose the step size is to use backtracking

line search:
• First fix parameters 0 < β < 1 and 0 < α ≤ 1/2
• At each iteration, start with t = tinit , and while

f (x − t∇f (x)) > f (x) − αtk∇f (x)k22

shrink t = βt. Else perform gradient descent update

x+ = x − t∇f (x)

Simple and tends to work well in practice (further simplification:

just take α = 1/2)

12
Descent methods Backtracking interpretation 465

f (x + t∆x)

f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x

t
t=0 t0

Figure 9.1 Backtracking line search. The curve shows f , restricted to the line
∆x
For usline
over which we search. The lower dashed = −∇f
shows (x) extrapolation
the linear
of f , and the upper dashed line has a slope a factor of α smaller. The
backtracking condition is that f lies below the upper dashed line, i.e., 0 ≤
t ≤ t0 .
13
Setting α = β = 0.5, backtracking picks up roughly the right step
size (12 outer steps, 40 steps total),

20
●

10 ●
●●

●●
●
●
●
●
●

*
0
−10
−20

−20 −10 0 10 20

14
Exact line search

We could also choose step to do the best we can along direction of

negative gradient, called exact line search:

t = argmin f (x − s∇f (x))

s≥0

Usually not possible to do this minimization exactly

Approximations to exact line search are typically not as efficient as

backtracking, and it’s typically not worth it

15
Convergence analysis
Assume that f convex and differentiable, with dom(f ) = Rn , and
additionally that ∇f is Lipschitz continuous with constant L > 0,

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 for any x, y

(Or when twice differentiable: ∇2 f (x) LI)

Theorem: Gradient descent with fixed step size t ≤ 1/L satisfies

kx(0) − x? k22
f (x(k) ) − f ? ≤
2tk
and same result holds for backtracking, with t replaced by β/L

We say gradient descent has convergence rate O(1/k). That is, it

finds -suboptimal point in O(1/) iterations

16
Analysis for strong convexity

Reminder: strong convexity of f means f (x) − m 2

2 kxk2 is convex
2
for some m > 0 (when twice differentiable: ∇ f (x) mI)

Assuming Lipschitz gradient as before, and also strong convexity:

Theorem: Gradient descent with fixed step size t ≤ 2/(m + L)

or with backtracking line search search satisfies
L
f (x(k) ) − f ? ≤ γ k kx(0) − x? k22
2
where 0 < γ < 1

Rate under strong convexity is O(γ k ), exponentially fast! That is,

it finds -suboptimal point in O(log(1/)) iterations

17
9.3 Gradient descent method

104

102
Called linear convergence

f (x(k) ) − p⋆
100
exact l.s.
Objective versus iteration
curve looks linear on semi- 10−2

backtracking l.s.
log plot
10−4
0 50 100 150 200
k
Figure 9.6 Error f (x(k) ) − p⋆ versus iteration k for the gradient method with
(From B & V page 487)
backtracking and exact line search, for a problem in R100 .

− m/L).
Important note: γ = O(1These experimentsThus we thecan
suggest that eﬀectwrite convergence
of the backtracking parameters on the

rate as
convergence is not large, no more than a factor of two or so.
L
Gradient method and condition number
O log(1/)
mexperiment will illustrate the importance of the condition number of
Our last
∇2 f (x) (or the sublevel sets) on the rate of convergence of the gradient method.
Higher condition number L/m ⇒ slower rate. This is not only true
We start with the function given by (9.21), but replace the variable x by x = T x̄,
where
of in theory ... very apparent in practice too T = diag((1, γ 1/n , γ 2/n , . . . , γ (n−1)/n )),

i.e., we minimize
m
!
f¯(x̄) = cT T x̄ − log(bi − aTi T x̄). 18
(9.22)
A look at the conditions

A look at the conditions for a simple problem, f (β) = 12 ky − Xβk22

Lipschitz continuity of ∇f :
• Recall this means ∇2 f (x) LI
• As ∇2 f (β) = X T X, we have L = λmax (X T X)

Strong convexity of f :
• Recall this means ∇2 f (x) mI
• As ∇2 f (β) = X T X, we have m = λmin (X T X)
• If X is wide (X is n × p with p > n), then λmin (X T X) = 0,
and f can’t be strongly convex
• Even if σmin (X) > 0, can have a very large condition number
L/m = λmax (X T X)/λmin (X T X)

19
Practicalities

Stopping rule: stop when k∇f (x)k2 is small

• Recall ∇f (x? ) = 0 at solution x?
• If f is strongly convex with parameter m, then
√
k∇f (x)k2 ≤ 2m =⇒ f (x) − f ? ≤

Pros and cons of gradient descent:

• Pro: simple idea, and each iteration is cheap (usually)
• Pro: fast for well-conditioned, strongly convex problems
• Con: can often be slow, because many interesting problems
aren’t strongly convex or well-conditioned
• Con: can’t handle nondifferentiable functions

20
Can we do better?
Gradient descent has O(1/) convergence rate over problem class
of convex, differentiable functions with Lipschitz gradients

First-order method: iterative method, which updates x(k) in

x(0) + span{∇f (x(0) ), ∇f (x(1) ), . . . ∇f (x(k−1) )}

Theorem (Nesterov): For any k ≤ (n − 1)/2 and any starting

point x(0) , there is a function f in the problem class such that
any first-order method satisfies

3Lkx(0) − x? k22
f (x(k) ) − f ? ≥
32(k + 1)2

√
Can attain rate O(1/k 2 ), or O(1/ )? Answer: yes (we’ll see)!

21
Analysis for nonconvex case
Assume f is differentiable with Lipschitz gradient, now nonconvex.
Asking for optimality is too much. Let’s settle for a -substationary
point x, which means k∇f (x)k2 ≤

Theorem: Gradient descent with fixed step size t ≤ 1/L satisfies

s
2(f (x(0) ) − f ? )
min k∇f (x(i) )k2 ≤
i=0,...,k t(k + 1)

√
Thus gradient descent has rate O(1/ k), or O(1/2 ), even in the
nonconvex case for finding stationary points

This rate cannot be improved (over class of differentiable functions

with Lipschitz gradients) by any deterministic algorithm1

1
Carmon et al. (2017), “Lower bounds for finding stationary points I”
22
Gradient boosting

23
Given responses yi ∈ R and features xi ∈ Rp , i = 1, . . . , n

Want to construct a flexible (nonlinear) model for response based

on features. Weighted sum of trees:
m
X
ui = βj · Tj (xi ), i = 1, . . . , n
j=1

Each tree Tj inputs xi , outputs predicted response. Typically trees

are pretty short

...

24
Pick a loss function L to reflect setting. For continuous responses,
e.g., could take L(yi , ui ) = (yi − ui )2

Want to solve
n
X X M
min L yi , βj · Tj (xi )
β
i=1 j=1

Indexes all trees of a fixed size (e.g., depth = 5), so M is huge.

Space is simply too big to optimize

Gradient boosting: basically a version of gradient descent that is

forced to work with trees

First think of optimization as minu f (u), over predicted values u,

subject to u coming from trees

25
Start with initial model, a single tree u(0) = T0 . Repeat:
• Compute negative gradient d at latest prediction u(k−1) ,

∂L(yi , ui )
di = − , i = 1, . . . , n
∂ui ui =u
(k−1)
i

• Find a tree Tk that is close to a, i.e., according to

n
X
min (di − T (xi ))2
trees T
i=1

Not hard to (approximately) solve for a single tree

• Compute step size αk , and update our prediction:

u(k) = u(k−1) + αk · Tk

Note: predictions are weighted sums of trees, as desired

26
References and further reading

• S. Boyd and L. Vandenberghe (2004), “Convex optimization”,

Chapter 9
• T. Hastie, R. Tibshirani and J. Friedman (2009), “The
elements of statistical learning”, Chapters 10 and 16
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 2
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Gradient_Descent
No ratings yet
Gradient_Descent
52 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
10 Unconstrained
No ratings yet
10 Unconstrained
41 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Gradient
No ratings yet
Gradient
31 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient
No ratings yet
Gradient
37 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Lec_11
No ratings yet
Lec_11
13 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
Chương 9
No ratings yet
Chương 9
12 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Chapter 07
No ratings yet
Chapter 07
20 pages
lecture7-graddesc
No ratings yet
lecture7-graddesc
8 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
No ratings yet
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
23 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
p5-CO-opti-algo
No ratings yet
p5-CO-opti-algo
15 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Hauser Lecture2
No ratings yet
Hauser Lecture2
26 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
No ratings yet
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
4 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
Unconstrained
No ratings yet
Unconstrained
30 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
subgradients_slides
No ratings yet
subgradients_slides
37 pages
SGD
No ratings yet
SGD
19 pages
06_23ECE216_GradientDescent_v2
No ratings yet
06_23ECE216_GradientDescent_v2
73 pages
ESP32 Documantation
No ratings yet
ESP32 Documantation
2 pages
Java Review 2 - Errors, Exceptions, Debugging: Nelson Padua-Perez Chau-Wen Tseng
No ratings yet
Java Review 2 - Errors, Exceptions, Debugging: Nelson Padua-Perez Chau-Wen Tseng
10 pages
n-5403-2nd Assignment Autumn 2023
No ratings yet
n-5403-2nd Assignment Autumn 2023
25 pages
Linked List
No ratings yet
Linked List
22 pages
Chapter 4
No ratings yet
Chapter 4
147 pages
OrangeEdit HowTo 0.3.EN
No ratings yet
OrangeEdit HowTo 0.3.EN
39 pages
CPP Template Generic A3
No ratings yet
CPP Template Generic A3
9 pages
Mavenplugin 2
No ratings yet
Mavenplugin 2
2 pages
School System File
No ratings yet
School System File
58 pages
How to Find and Replace Characters in a String With Blank
No ratings yet
How to Find and Replace Characters in a String With Blank
2 pages
Blaise 93 UK
100% (1)
Blaise 93 UK
102 pages
Design Concepts: by Deepika Chaudhary
No ratings yet
Design Concepts: by Deepika Chaudhary
25 pages
Anurags Resume
No ratings yet
Anurags Resume
1 page
Qualifier CT Set - II Solution
100% (2)
Qualifier CT Set - II Solution
26 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Comparison of CLIST Language and REXX
No ratings yet
Comparison of CLIST Language and REXX
10 pages
Shubham_Shukla__Resume
No ratings yet
Shubham_Shukla__Resume
1 page
CS214 - Programming Fundamentals: Quiz 2
No ratings yet
CS214 - Programming Fundamentals: Quiz 2
2 pages
SRM PPS Unit 5 Consolidated
No ratings yet
SRM PPS Unit 5 Consolidated
139 pages
15.2 C++ Doubly Linked List
No ratings yet
15.2 C++ Doubly Linked List
19 pages
Dfsort Tutorial Reference Study Material - Sort JCL - Ibm Mainframe
No ratings yet
Dfsort Tutorial Reference Study Material - Sort JCL - Ibm Mainframe
8 pages
Extra Coding Statements TCS
No ratings yet
Extra Coding Statements TCS
54 pages
Example: Model Train Controller.: © 2005 ECNU SEI Principles of Embedded Computing System Design 1
No ratings yet
Example: Model Train Controller.: © 2005 ECNU SEI Principles of Embedded Computing System Design 1
36 pages
Pcan Iso TP API Userman Eng
No ratings yet
Pcan Iso TP API Userman Eng
272 pages
1-Introduction and App Options
No ratings yet
1-Introduction and App Options
63 pages
Quick Manual
No ratings yet
Quick Manual
3 pages
Ada qp1
No ratings yet
Ada qp1
3 pages
Javascript API Office Js Docs Reference Outlook Js 1.13
No ratings yet
Javascript API Office Js Docs Reference Outlook Js 1.13
2,371 pages
ES CS201 Assignments
No ratings yet
ES CS201 Assignments
2 pages
Advanced Java Unit-IV - 2nd Year-Econtent
No ratings yet
Advanced Java Unit-IV - 2nd Year-Econtent
21 pages

Gradient Descent: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Gradient Descent: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Gradient Descent

• Linear program (LP): takes the form

• Quadratic program (QP): like LP, but with quadratic criterion

Consider unconstrained, smooth convex optimization

That is, f is convex and differentiable with dom(f ) = Rn . Denote

Gradient descent: choose initial point x(0) ∈ Rn , repeat:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Stop at some point

At each iteration, consider the expansion

Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I

f (x) + ∇f (x)T (y − x) linear approximation to f

Choose next point y = x+ to minimize quadratic approximation:

Blue point is x, red point is

Simply take tk = t for all k = 1, 2, 3, . . ., can diverge if t is too big.

Convergence analysis later will give us a precise idea of “just right”

One way to adaptively choose the step size is to use backtracking

f (x − t∇f (x)) > f (x) − αtk∇f (x)k22

shrink t = βt. Else perform gradient descent update

Simple and tends to work well in practice (further simplification:

f (x) + t∇f (x)T ∆x f (x) + αt∇f (x)T ∆x

We could also choose step to do the best we can along direction of

t = argmin f (x − s∇f (x))

Usually not possible to do this minimization exactly

Approximations to exact line search are typically not as efficient as

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 for any x, y

(Or when twice differentiable: ∇2 f (x)  LI)

Theorem: Gradient descent with fixed step size t ≤ 1/L satisfies

We say gradient descent has convergence rate O(1/k). That is, it

Reminder: strong convexity of f means f (x) − m 2

Assuming Lipschitz gradient as before, and also strong convexity:

Theorem: Gradient descent with fixed step size t ≤ 2/(m + L)

Rate under strong convexity is O(γ k ), exponentially fast! That is,

A look at the conditions for a simple problem, f (β) = 12 ky − Xβk22

Stopping rule: stop when k∇f (x)k2 is small

Pros and cons of gradient descent:

First-order method: iterative method, which updates x(k) in

x(0) + span{∇f (x(0) ), ∇f (x(1) ), . . . ∇f (x(k−1) )}

Theorem (Nesterov): For any k ≤ (n − 1)/2 and any starting

Theorem: Gradient descent with fixed step size t ≤ 1/L satisfies

This rate cannot be improved (over class of differentiable functions

Want to construct a flexible (nonlinear) model for response based

Each tree Tj inputs xi , outputs predicted response. Typically trees

Indexes all trees of a fixed size (e.g., depth = 5), so M is huge.

Gradient boosting: basically a version of gradient descent that is

First think of optimization as minu f (u), over predicted values u,

• Find a tree Tk that is close to a, i.e., according to

Not hard to (approximately) solve for a single tree

Note: predictions are weighted sums of trees, as desired

• S. Boyd and L. Vandenberghe (2004), “Convex optimization”,

You might also like

(Or when twice differentiable: ∇2 f (x) LI)