0% found this document useful (0 votes)

4 views

Raghu Meka notes

The document discusses the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for Lipschitz convex functions. It presents key properties of convex functions, the analysis of GD showing its convergence to a global optimum, and the advantages of using SGD to reduce computational costs. Theorems are provided to quantify the convergence rates for both GD and SGD under specific conditions.

Uploaded by

c.a.tadei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Raghu Meka notes

Uploaded by

c.a.tadei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CS289ML: Notes on convergence of gradient descent

Raghu Meka

1 Gradient descent
In class we discussed the following notions:

• f : Rd → R is convex if for all x, y ∈ Rd and 0 ≤ λ ≤ 1,

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).

• A smooth function f : Rd → R is L-Lipschitz if for all x, y ∈ Rd ,

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 .

Examples
Pn x2 , ex are all convex. So is the loss function from least squares regression LSR f (x) =
(1/n) j=1 (haj , xi − bj )2 . For what L is this loss Lipschitz? Let A be the n × d matrix whose rows
are the vectors aj and let b be the n-dimensional vector consisting of the bj ’s. Then, we can also
write f (x) = (1/n)kAx − bk2 . So that ∇f (x) = (2/n)AT (Ax − b). Therefore, for x, y ∈ Rd ,

k∇f (x)−∇f (y)k = (2/n)kAT (Ax−b)−AT (Ay−b)k = (2/n)kAT A(x−y)k ≤ (2/n)kAT Ak2 ·kx−yk,

where for a matrix B, the spectral norm, kBk2 is defined as kBk2 = maxu kBuk/kuk. Thus, the
loss function in LSR is (2/n)kAT Ak2 Lipschitz.
In general, while convexity is a strong constraint in practice, Lipschitz-ness is more common.
Nevertheless, Lipschitz convex functions are a rich class of functions that cover many common in-
stances in optimization. We next analyze gradient descent for Lipschitz convex functions. Through-
out this note, gradient descent (GD) will refer to the following algorithm:

• Choose x0 ∈ Rd and step-size t > 0.

• For i = 0, . . . , define
xi+1 = xi − t∇f (xi ).

1.1 Analysis of Gradient Descent

We first need some elementary properties of Lipschitz convex functions; proofs of the claims can be
found at the end. The following claim captures the fact that the tangent plane of a convex function
lies below the curve:

Claim 1.1. For f : Rd → R convex, for all x, y,

f (x) + h∇f (x), y − xi ≤ f (y).

1
The next claim complements the above property for Lipschitz convex functions.
Claim 1.2. If f : Rd → R is L-Lipschitz convex, then for all x, y,
L
ky − xk22 .
f (y) ≤ f (x) + h∇f (x), y − xi +
2
We also need the following property of convex functions.
Claim 1.3. For any convex function f : Rd → R, and y1 , . . . , yk ∈ Rd ,

y1 + · · · + yk f (y1 ) + · · · + f (yk )
f ≤ .
k k
We next prove that gradient descent converges to the global optimum for Lipschitz convex
functions.
Theorem 1.4. Let f : Rd → R be a L-Lipschitz convex function and x∗ = arg minx f (x). Then,
GD with step-size t ≤ 1/L satisfies the following:
kx0 − x∗ k22
f (xk ) ≤ f (x∗ ) + .
2tk
Lkx0 −x∗ k22
In particular, ε iterations suffice to find an ε-approximate optimal value x (for t = 1/L).
Proof. First, by convexity of f , we have:
f (xi ) ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i. (1.1)
Further, as f is L-Lipschitz, by the previous lemma,
L
f (xi+1 ) ≤ f (xi ) + h∇f (xi ), xi+1 − xi i + kxi+1 − xi k22 (1.2)
2
Lt2
= f (xi ) − tk∇f (xi )k22 +
k∇f (xi )k22
2
= f (xi ) − t(1 − Lt/2)k∇f (xi )k22
t
≤ f (xi ) − k∇f (xi )k22 ,
2
where the last inequality follows as Lt ≤ 1. In particular, the above shows that GD is mono-
tonic: the objective value is non-decreasing. Combining the above two equations and the fact that
∇f (xi ) = (1/t)(xi − xi+1 ), we get
t
f (xi+1 ) ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i − k∇f (xi )k22 (1.3)
2
1 1
= f (x∗ ) + · hxi − xi+1 , xi − x∗ i − kxi − xi+1 k2
t 2t
1 1
= f (x∗ ) + kxi − x∗ k22 − kxi − x∗ k22 − 2ht∇f (xi ), xi − x∗ i + kt∇f (xi )k22

2t 2t
(we are basically “completing” a square for the last two terms)
1 1
= f (x∗ ) + kxi − x∗ k22 − kxi − x∗ − t∇f (xi )k22
2t 2t
∗ 1
kxi − x k2 − kxi+1 − x∗ k22 .
∗ 2

= f (x ) +
2t

2
Summing the above equations for i = 0, . . . , k − 1, we get
k−1
X 1 kx0 − x∗ k22
(f (xi+1 ) − f (x∗ )) ≤ kx0 − x∗ k22 − kxk − x∗ k22 ≤ .
2t 2t
i=0

Finally, by Equation 1.2, f (x0 ), . . . , f (xk ) is non-increasing. Therefore, f (xk ) − f (x∗ ) ≤ f (xi ) −
f (x∗ ) for all i < k. Thus,
kx0 − x∗ k22
k · (f (xk ) − f (x∗ )) ≤ .
2t
The theorem now follows.

2 Stochastic gradient descent

We discussed several advantages of gradient descent. However, one disadvantage of GD is that
sometimes it may be too expensive to compute the gradient of a function. Indeed, even for the
special case of Least Squares Regression (LSR), the gradient depends on all the data points and
thus requires time O(nd) time to compute when there are n data points. In many situations,
we cannot afford this. The basic idea of stochastic gradient descent (SGD) is to instead use an
estimator for the gradient at each iteration. This results in significant speed-up of per-iteration
cost and also does not hurt the number of iterations needed too much. SGD (as applied to ERM)
also has several important advantages coming from statistical machine learning. We unfortunately
will overlook these aspects completely.
In the following we describe SGD and analyze the rate of convergence for Lipschitz convex
functions. As before, we have a L-Lipschitz convex function f : Rd → R that we are trying to
minimize. The basic template for SGD is as follows:

1. Pick x0 and set step-size t > 0.

2. For i = 0, 1, . . . ,:

(a) Let vi be a random vector such that E[vi ] = ∇f (xi ). That is, vi is an unbiased estimator
for ∇f (xi ). For simplicity, we will assume that vi is independent of all previous random
choices; technically, one only needs the conditional expectation to equal the gradient.
(b) xi+1 = xi − tvi .

Example: Let us look at the example of LSR again. We have data points a1 , . . . , an ∈ Rd with
respective values b1 , . . . , bn ∈ R and our goal was to find x minimizing
n
X 2
f (x) = (1/n) haj , xi − bj .
j=1

Then, ∇f (x) = (2/n) nj=1 haj , xi − bj · aj . This takes O(nd) time to compute. Now, define a
P

random vector v(x) as follows: Pick a uniformly random j ∈ [n] and set v(x) = 2 haj , xi − bj · aj .

Then, clearly, E[v(x)] = ∇f (x) and only takes O(d) time to compute. In particular, SGD applied
to LSR yields the following algorithm:

3
1. Pick x0 and set step-size t > 0.
2. For i = 0, 1, . . . ,:
(a) Pick a uniformly random index1 j ∈ [n] and set vi = 2 haj , xi − bi · aj .

(b) Set xi+1 = xi − tvi .

The above algorithm can also be extended straightforwardly to ERM in general.

2.1 Analysis of SGD

We now bound the convergence rate for SGD. For a random vector v, define the variance of the
vector by Var(v) = E[kvk22 ] − k E[v]k22 . To simplify the analysis of our algorithm2 , we actually look
at a variant of SGD where the final output is the average of all the intermediate iterations; that is,
after k iterations, we look at xk = (1/k)(x1 + . . . + xk ). Alternately, the same guarantees hold for
the point with the least objective value among the iterates: x∗k = arg min{f (x1 ), f (x2 ), . . . , f (xk )}.
Looking at the average has other theoretical advantages especially in settings where one cannot
compute the objective value easily (cf. online learning).
Theorem 2.1. Let f : Rd → R be a L-Lipschitz convex function and x∗ = arg minx f (x). Consider
an instance of SGD where the estimators vi have bounded variance: for all i ≥ 0, Var(vi ) ≤ σ 2 .
Then, for any k > 1, SGD with step-size t ≤ 1/L satisifes
kx0 − x∗ k22 tσ 2
E[f (xk )] ≤ f (x∗ ) + + ,
2tk 2
where xk = (1/k)(x1 + . . . + xk ). In particular, for k = (σ 2 + Lkx0 − x∗ k22√
)2 /ε2 iterations suffice to
find a 2ε-approximate optimal value—in expectation—x by setting t = 1/ k.
Proof. The argument is very similar to that of the analysis of GD. By Lemma 1.2,
L
f (xi+1 ) ≤ f (xi ) + h∇f (xi ), xi+1 − xi i + kxi+1 − xi k22
2
Lt2
kvi k22 .
= f (xi ) − th∇f (xi ), vi i +
2
By taking expectation on both sides with respect to the choice of vt , we get
Lt2
E[f (xi+1 )] ≤ f (xi ) − tk∇f (xi )k22 + k∇f (xi )k22 + Var(vt )

(2.1)
2
Lt2 2
≤ f (xi ) − t(1 − Lt/2)k∇f (xi )k22 + σ
2
t t
≤ f (xi ) − k∇f (xi )k22 + σ 2 ,
2 2
where the last inequality follows as Lt ≤ 1. Note that, unlike GD, SGD need not be monotonic3 .
Combining the above two equations we get
t t
E[f (xi+1 )] ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i − k∇f (xi )k22 + σ 2 .
2 2
1
In practice, one would order the data randomly and process them sequentially after that.
2
In practice, one would try various tricks to pick the right one.
3
However, it is almost monotonic if σ is small

4
We now back-substitute vi into the equation by using E[vi ] = ∇f (xi ) and k∇f (xi )k22 = E[kvi k22 ] −
V ar(vi ) ≤ E[kvi k22 ] − σ 2 :
t
E[f (xi+1 )] ≤ f (x∗ ) + hE[vi ], xi − x∗ i − E[kvi k22 ] + tσ 2
2
∗ ∗ t
= f (x ) + E hvi , xi − x i − kvi k2 + tσ 2 .
2
2

We now repeat the calculations as in the analysis of GD (Equation 1.3) by completing the square
for the middle two terms to get:

∗ 1 ∗ 2 ∗
kxi − x k2 − kxi − x − tvi k2 + tσ 2
2

E[f (xi+1 )] ≤ f (x ) + E
2t

∗ 1 ∗ 2 ∗ 2
kxi − x k2 − kxi+1 − x k2 + tσ 2 .

= f (x ) + E
2t

The above is analogous to Equation 1.3 but for an additional tσ 2 term (and taking expectation).
Summing the above equations for i = 0, . . . , k − 1, we get
k−1
X 1 kx0 − x∗ k22
(E[f (xi+1 )] − f (x∗ )) ≤ kx0 − x∗ k22 − E kxk − x∗ k22 + ktσ 2 ≤ + ktσ 2 .

2t 2t
i=0

Finally, by Claim 1.3,

x1 + . . . + xk
k · f (xk ) = k · f ≤ f (x1 ) + · · · + f (xk ).
k
Thus,
k−1
X
(E[f (xi+1 )] − f (x∗ )) = E [f (x1 ) + · · · + f (xk )] − kf (x∗ ) ≥ k E [f (xk )] − kf (x∗ ).
i=0

Combining the above equations we get

kx0 − x∗ k22
E[f (xk )] ≤ f (x∗ ) + + tσ 2 .
2tk
The main statement of the theorem now follows. The in particular part follows by substituting the
specific values of k and t.

3 Remarks about GD and SGD

• SGD is widely used in many large-scale machine learning systems. It is simple, efficient, can
be run in parallel, and is ideal for ERM.

• In practice, one uses various heuristics and tricks to implement GD or SGD for choosing the
step-size (such as line-search) and stopping criterion. Choosing the right parameters is a bit
of black-art and you can find some advice here.

5
• If you ignore the dependence on L, and kx0 −x∗ k2 , GD takes O(1/ε) iterations and SGD takes
O(1/ε2 ) iterations to get ε-error. Both these bounds are tight for the specific algorithms.
√
• Remarkably, there are various accelerated gradient descent algorithms which only need O(1/ ε)
iterations. This can be quite significant when you can compute the gradient fast: the acceler-
ated methods—such as the celebrated Nesterov’s AGD—only need two gradient evaluations.

• Unfortunately, the acceleration does not help in the stochastic setting. When using
√ estimators
with variance σ 2 , the best error one can get after k iterations is O(L/k 2 + σ/ k).

• Even more remarkably, one can showpthat if only given access to gradient computations, it
is not possible to do better than Ω( L/ε) iterations. A good resource for such advanced
material is Nesterov’s textbook Introductory lectures on convex optimization.

3.1 Subgradient Methods

In several applications, the cost function f : Rd → R though convex could be non-differentiable.
Such situations can sometimes be handled by subgradient descent. Recall one of the nice properties
of the gradient: if f is convex and is differentiable at a point x, then for all y,

f (y) ≥ f (x) + h∇f (x), y − xi.

The above inequality in fact serves as one of the principles behind gradient descent and motivates
the definition of subgradient.

Definition 3.1. For a function f : Rd → R, and a point x ∈ Rd , a vector v ∈ Rd is a subgradient

of f at x if for all y ∈ Rd ,
f (y) ≥ f (x) + hv, y − xi.

In words, the hyperplane at x with normal v lies below f . Note that from the definition, there
can be multiple subgradients for a function at a point. However, if is differentiable at x, then ∇f (x)
is the only subgradient of f at x. For example, consider f : R → R defined by f (x) = max(x, 0) 4 .
Then, for all x > 0, the subgradient of f at x is 1 and for x < 0, the subgradient is 0. How about
x = 0? Any number between [0, 1] is a subgradient.
Subgradient descent is given by an update similar to gradient descent: xi+1 = xi − tv, where
v is any subgradient of f at xi . One can show that for some suitable notion of Lipschitz-ness,
subgradient descent also converges to the global minimum of convex Lipschitz functions. We will
unfortunately not cover this.
4
This is the RELU function that is used often in neural networks.

6
4 Missing proofs
Proof of Claim 1.2. By the fundamental theorem of calculus5 ,
Z 1
f (y) = f (x) + h∇f (x + τ (y − x)), y − xidτ
0
Z 1
= f (x) + h∇f (x), y − xi + h∇f (x + τ (y − x)) − ∇f (x), y − xidτ
0
Z 1
≤ f (x) + h∇f (x), y − xi + k∇f (x + τ (y − x)) − ∇f (x)k2 ky − xk2 dτ
0
(as for all vectors u, v, |hu, vi| ≤ kuk2 · kvk2 ).
Z 1
≤ f (x) + h∇f (x), y − xi + Lkτ (y − x)k2 ky − xk2 dτ
0
(as f is L-Lipschitz)
L
= f (x) + h∇f (x), y − xi + ky − xk22 .
2

Proof of Claim 1.3. The proof is by induction. For k = 2, the claim follows by convexity. Suppose
it is true for k − 1. Let y = (y1 + . . . + yk−1 )/(k − 1). Then,

y1 + · · · + yk k−1 yk
f =f y+
k k k
k−1 1
≤ · f (y) + f (yk )
k k
(by convexity applied with λ = (k − 1)/k)

k−1 f (y1 ) + · · · + f (yk−1 ) 1
≤ · + f (yk )
k k−1 k
(induction hypothesis applied to y1 , . . . , yk−1 )
f (y1 ) + · · · + f (yk )
= .
k

5
Basically, you define a function
R1 h : R → R by h(τ ) = f (x + τ (y − x)) and apply the fundamental theorem of
calculus to h: h(1) = h(0) + 0 h0 (τ )dτ .

Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
05 - Reinventing Government and Bureaucracy in The Phils.
100% (3)
05 - Reinventing Government and Bureaucracy in The Phils.
21 pages
Fidelis Endpoint: SIEM Integrations Guide
No ratings yet
Fidelis Endpoint: SIEM Integrations Guide
53 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Notes ch6
No ratings yet
Notes ch6
11 pages
lec13
No ratings yet
lec13
6 pages
Lecture_15_projected_gradient
No ratings yet
Lecture_15_projected_gradient
8 pages
2104.00423v1
No ratings yet
2104.00423v1
19 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
0105 Stoch Subgrad Notes
No ratings yet
0105 Stoch Subgrad Notes
17 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Bregman
No ratings yet
Bregman
9 pages
Gradient
No ratings yet
Gradient
31 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
SGD
No ratings yet
SGD
19 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Lecture_11_AGD_restart_lower_bounds
No ratings yet
Lecture_11_AGD_restart_lower_bounds
5 pages
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
No ratings yet
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
26 pages
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
No ratings yet
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
23 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Gradient
No ratings yet
Gradient
37 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Gradient Decent - PDF 2
No ratings yet
Gradient Decent - PDF 2
7 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
NIPS-2011-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine-learning-Paper
No ratings yet
NIPS-2011-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine-learning-Paper
9 pages
online gradient descent
No ratings yet
online gradient descent
7 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
Hdnotes 2021
No ratings yet
Hdnotes 2021
31 pages
Protter
No ratings yet
Protter
43 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Lecture 3 ML_optimization
No ratings yet
Lecture 3 ML_optimization
32 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
No ratings yet
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
3 pages
Emp Proc Lecture Notes
No ratings yet
Emp Proc Lecture Notes
172 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
2021 - Creel - econometrics (githuib book)
No ratings yet
2021 - Creel - econometrics (githuib book)
1,060 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Slides Seance03
No ratings yet
Slides Seance03
48 pages
Econometric s
No ratings yet
Econometric s
1,341 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Institute of Computer Science: Academy of Sciences of The Czech Republic
No ratings yet
Institute of Computer Science: Academy of Sciences of The Czech Republic
49 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Solution 14
No ratings yet
Solution 14
5 pages
Econometrics Simpler Note
No ratings yet
Econometrics Simpler Note
692 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
No ratings yet
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
211 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Introduction to Bessel Functions
From Everand
Introduction to Bessel Functions
Frank Bowman
2.5/5 (1)
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
UN-40D6500 Spec Sheet
No ratings yet
UN-40D6500 Spec Sheet
2 pages
Gaba-Overseas-Interview-Task-0922
No ratings yet
Gaba-Overseas-Interview-Task-0922
4 pages
Oral Presentation
No ratings yet
Oral Presentation
19 pages
Sample Resume Help Desk Agent
100% (1)
Sample Resume Help Desk Agent
8 pages
18 Siddhas South Indian
No ratings yet
18 Siddhas South Indian
8 pages
Handout - VMDR - Q1-2024
No ratings yet
Handout - VMDR - Q1-2024
177 pages
Education Law and Policy Main Exam
No ratings yet
Education Law and Policy Main Exam
9 pages
dlp-2019 5
No ratings yet
dlp-2019 5
1 page
List of Companies
No ratings yet
List of Companies
8 pages
Exploded View Parts List
No ratings yet
Exploded View Parts List
1 page
Lecture 2 Summary-Accounting - PAU
No ratings yet
Lecture 2 Summary-Accounting - PAU
2 pages
Full Download Pro SQL Server 2022 Administration: A guide for the modern DBA 3rd Edition Peter A. Carter PDF DOCX
100% (5)
Full Download Pro SQL Server 2022 Administration: A guide for the modern DBA 3rd Edition Peter A. Carter PDF DOCX
65 pages
Lect 8 Simplex Method - 1
No ratings yet
Lect 8 Simplex Method - 1
32 pages
Microbiology and Parasitology
No ratings yet
Microbiology and Parasitology
26 pages
Ledapol Basic Lingerie and Clothing Collection 23
100% (1)
Ledapol Basic Lingerie and Clothing Collection 23
108 pages
Electrical Tools Materials and Equipment
No ratings yet
Electrical Tools Materials and Equipment
58 pages
Physical Oceanography: Introduction
No ratings yet
Physical Oceanography: Introduction
40 pages
Symbols Of Freemasonry Daniel Beresniak instant download
No ratings yet
Symbols Of Freemasonry Daniel Beresniak instant download
40 pages
Preliminary English Test
100% (4)
Preliminary English Test
11 pages
Solution Manual To Chapter 03 PDF
No ratings yet
Solution Manual To Chapter 03 PDF
2 pages
Loner V3.0
No ratings yet
Loner V3.0
69 pages
Kalai_AD_CCS341_DW_UNIT 3
No ratings yet
Kalai_AD_CCS341_DW_UNIT 3
46 pages
Centre For Agribusiness Incubation and Entrepreneurship (CAIE)
No ratings yet
Centre For Agribusiness Incubation and Entrepreneurship (CAIE)
23 pages
Bio Project
50% (2)
Bio Project
13 pages
Chassis
No ratings yet
Chassis
30 pages
Pipeline Stress Analysis With Caesar II
No ratings yet
Pipeline Stress Analysis With Caesar II
16 pages
Cry, The Peacock: Portrayal of Feminine And: Masculine Doctrines
No ratings yet
Cry, The Peacock: Portrayal of Feminine And: Masculine Doctrines
3 pages
Corporate Governance Mechanisms and Firm Performance A Survey of Literature
No ratings yet
Corporate Governance Mechanisms and Firm Performance A Survey of Literature
16 pages

Raghu Meka notes

Uploaded by

Raghu Meka notes

Uploaded by

CS289ML: Notes on convergence of gradient descent

• f : Rd → R is convex if for all x, y ∈ Rd and 0 ≤ λ ≤ 1,

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).

• A smooth function f : Rd → R is L-Lipschitz if for all x, y ∈ Rd ,

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 .

• Choose x0 ∈ Rd and step-size t > 0.

1.1 Analysis of Gradient Descent

Claim 1.1. For f : Rd → R convex, for all x, y,

f (x) + h∇f (x), y − xi ≤ f (y).

2 Stochastic gradient descent

1. Pick x0 and set step-size t > 0.

(b) Set xi+1 = xi − tvi .

2.1 Analysis of SGD

Finally, by Claim 1.3,

Combining the above equations we get

3 Remarks about GD and SGD

3.1 Subgradient Methods

f (y) ≥ f (x) + h∇f (x), y − xi.

Definition 3.1. For a function f : Rd → R, and a point x ∈ Rd , a vector v ∈ Rd is a subgradient

You might also like