0% found this document useful (0 votes)
4 views

Raghu Meka notes

The document discusses the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for Lipschitz convex functions. It presents key properties of convex functions, the analysis of GD showing its convergence to a global optimum, and the advantages of using SGD to reduce computational costs. Theorems are provided to quantify the convergence rates for both GD and SGD under specific conditions.

Uploaded by

c.a.tadei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Raghu Meka notes

The document discusses the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for Lipschitz convex functions. It presents key properties of convex functions, the analysis of GD showing its convergence to a global optimum, and the advantages of using SGD to reduce computational costs. Theorems are provided to quantify the convergence rates for both GD and SGD under specific conditions.

Uploaded by

c.a.tadei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CS289ML: Notes on convergence of gradient descent

Raghu Meka

1 Gradient descent
In class we discussed the following notions:

• f : Rd → R is convex if for all x, y ∈ Rd and 0 ≤ λ ≤ 1,

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).

• A smooth function f : Rd → R is L-Lipschitz if for all x, y ∈ Rd ,

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 .

Examples
Pn x2 , ex are all convex. So is the loss function from least squares regression LSR f (x) =
(1/n) j=1 (haj , xi − bj )2 . For what L is this loss Lipschitz? Let A be the n × d matrix whose rows
are the vectors aj and let b be the n-dimensional vector consisting of the bj ’s. Then, we can also
write f (x) = (1/n)kAx − bk2 . So that ∇f (x) = (2/n)AT (Ax − b). Therefore, for x, y ∈ Rd ,

k∇f (x)−∇f (y)k = (2/n)kAT (Ax−b)−AT (Ay−b)k = (2/n)kAT A(x−y)k ≤ (2/n)kAT Ak2 ·kx−yk,

where for a matrix B, the spectral norm, kBk2 is defined as kBk2 = maxu kBuk/kuk. Thus, the
loss function in LSR is (2/n)kAT Ak2 Lipschitz.
In general, while convexity is a strong constraint in practice, Lipschitz-ness is more common.
Nevertheless, Lipschitz convex functions are a rich class of functions that cover many common in-
stances in optimization. We next analyze gradient descent for Lipschitz convex functions. Through-
out this note, gradient descent (GD) will refer to the following algorithm:

• Choose x0 ∈ Rd and step-size t > 0.

• For i = 0, . . . , define
xi+1 = xi − t∇f (xi ).

1.1 Analysis of Gradient Descent


We first need some elementary properties of Lipschitz convex functions; proofs of the claims can be
found at the end. The following claim captures the fact that the tangent plane of a convex function
lies below the curve:

Claim 1.1. For f : Rd → R convex, for all x, y,

f (x) + h∇f (x), y − xi ≤ f (y).

1
The next claim complements the above property for Lipschitz convex functions.
Claim 1.2. If f : Rd → R is L-Lipschitz convex, then for all x, y,
L
ky − xk22 .
f (y) ≤ f (x) + h∇f (x), y − xi +
2
We also need the following property of convex functions.
Claim 1.3. For any convex function f : Rd → R, and y1 , . . . , yk ∈ Rd ,
 
y1 + · · · + yk f (y1 ) + · · · + f (yk )
f ≤ .
k k
We next prove that gradient descent converges to the global optimum for Lipschitz convex
functions.
Theorem 1.4. Let f : Rd → R be a L-Lipschitz convex function and x∗ = arg minx f (x). Then,
GD with step-size t ≤ 1/L satisfies the following:
kx0 − x∗ k22
f (xk ) ≤ f (x∗ ) + .
2tk
Lkx0 −x∗ k22
In particular, ε iterations suffice to find an ε-approximate optimal value x (for t = 1/L).
Proof. First, by convexity of f , we have:
f (xi ) ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i. (1.1)
Further, as f is L-Lipschitz, by the previous lemma,
L
f (xi+1 ) ≤ f (xi ) + h∇f (xi ), xi+1 − xi i + kxi+1 − xi k22 (1.2)
2
Lt2
= f (xi ) − tk∇f (xi )k22 +
k∇f (xi )k22
2
= f (xi ) − t(1 − Lt/2)k∇f (xi )k22
t
≤ f (xi ) − k∇f (xi )k22 ,
2
where the last inequality follows as Lt ≤ 1. In particular, the above shows that GD is mono-
tonic: the objective value is non-decreasing. Combining the above two equations and the fact that
∇f (xi ) = (1/t)(xi − xi+1 ), we get
t
f (xi+1 ) ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i − k∇f (xi )k22 (1.3)
2
1 1
= f (x∗ ) + · hxi − xi+1 , xi − x∗ i − kxi − xi+1 k2
t 2t
1 1
= f (x∗ ) + kxi − x∗ k22 − kxi − x∗ k22 − 2ht∇f (xi ), xi − x∗ i + kt∇f (xi )k22

2t 2t
(we are basically “completing” a square for the last two terms)
1 1
= f (x∗ ) + kxi − x∗ k22 − kxi − x∗ − t∇f (xi )k22
2t 2t
∗ 1
kxi − x k2 − kxi+1 − x∗ k22 .
∗ 2

= f (x ) +
2t

2
Summing the above equations for i = 0, . . . , k − 1, we get
k−1
X 1  kx0 − x∗ k22
(f (xi+1 ) − f (x∗ )) ≤ kx0 − x∗ k22 − kxk − x∗ k22 ≤ .
2t 2t
i=0

Finally, by Equation 1.2, f (x0 ), . . . , f (xk ) is non-increasing. Therefore, f (xk ) − f (x∗ ) ≤ f (xi ) −
f (x∗ ) for all i < k. Thus,
kx0 − x∗ k22
k · (f (xk ) − f (x∗ )) ≤ .
2t
The theorem now follows.

2 Stochastic gradient descent


We discussed several advantages of gradient descent. However, one disadvantage of GD is that
sometimes it may be too expensive to compute the gradient of a function. Indeed, even for the
special case of Least Squares Regression (LSR), the gradient depends on all the data points and
thus requires time O(nd) time to compute when there are n data points. In many situations,
we cannot afford this. The basic idea of stochastic gradient descent (SGD) is to instead use an
estimator for the gradient at each iteration. This results in significant speed-up of per-iteration
cost and also does not hurt the number of iterations needed too much. SGD (as applied to ERM)
also has several important advantages coming from statistical machine learning. We unfortunately
will overlook these aspects completely.
In the following we describe SGD and analyze the rate of convergence for Lipschitz convex
functions. As before, we have a L-Lipschitz convex function f : Rd → R that we are trying to
minimize. The basic template for SGD is as follows:

1. Pick x0 and set step-size t > 0.

2. For i = 0, 1, . . . ,:

(a) Let vi be a random vector such that E[vi ] = ∇f (xi ). That is, vi is an unbiased estimator
for ∇f (xi ). For simplicity, we will assume that vi is independent of all previous random
choices; technically, one only needs the conditional expectation to equal the gradient.
(b) xi+1 = xi − tvi .

Example: Let us look at the example of LSR again. We have data points a1 , . . . , an ∈ Rd with
respective values b1 , . . . , bn ∈ R and our goal was to find x minimizing
n
X 2
f (x) = (1/n) haj , xi − bj .
j=1

Then, ∇f (x) = (2/n) nj=1 haj , xi − bj · aj . This takes O(nd) time to compute. Now, define a
P 

random vector v(x) as follows: Pick a uniformly random j ∈ [n] and set v(x) = 2 haj , xi − bj · aj .


Then, clearly, E[v(x)] = ∇f (x) and only takes O(d) time to compute. In particular, SGD applied
to LSR yields the following algorithm:

3
1. Pick x0 and set step-size t > 0.
2. For i = 0, 1, . . . ,:
(a) Pick a uniformly random index1 j ∈ [n] and set vi = 2 haj , xi − bi · aj .


(b) Set xi+1 = xi − tvi .


The above algorithm can also be extended straightforwardly to ERM in general.

2.1 Analysis of SGD


We now bound the convergence rate for SGD. For a random vector v, define the variance of the
vector by Var(v) = E[kvk22 ] − k E[v]k22 . To simplify the analysis of our algorithm2 , we actually look
at a variant of SGD where the final output is the average of all the intermediate iterations; that is,
after k iterations, we look at xk = (1/k)(x1 + . . . + xk ). Alternately, the same guarantees hold for
the point with the least objective value among the iterates: x∗k = arg min{f (x1 ), f (x2 ), . . . , f (xk )}.
Looking at the average has other theoretical advantages especially in settings where one cannot
compute the objective value easily (cf. online learning).
Theorem 2.1. Let f : Rd → R be a L-Lipschitz convex function and x∗ = arg minx f (x). Consider
an instance of SGD where the estimators vi have bounded variance: for all i ≥ 0, Var(vi ) ≤ σ 2 .
Then, for any k > 1, SGD with step-size t ≤ 1/L satisifes
kx0 − x∗ k22 tσ 2
E[f (xk )] ≤ f (x∗ ) + + ,
2tk 2
where xk = (1/k)(x1 + . . . + xk ). In particular, for k = (σ 2 + Lkx0 − x∗ k22√
)2 /ε2 iterations suffice to
find a 2ε-approximate optimal value—in expectation—x by setting t = 1/ k.
Proof. The argument is very similar to that of the analysis of GD. By Lemma 1.2,
L
f (xi+1 ) ≤ f (xi ) + h∇f (xi ), xi+1 − xi i + kxi+1 − xi k22
2
Lt2
kvi k22 .
= f (xi ) − th∇f (xi ), vi i +
2
By taking expectation on both sides with respect to the choice of vt , we get
Lt2
E[f (xi+1 )] ≤ f (xi ) − tk∇f (xi )k22 + k∇f (xi )k22 + Var(vt )

(2.1)
2
Lt2 2
≤ f (xi ) − t(1 − Lt/2)k∇f (xi )k22 + σ
2
t t
≤ f (xi ) − k∇f (xi )k22 + σ 2 ,
2 2
where the last inequality follows as Lt ≤ 1. Note that, unlike GD, SGD need not be monotonic3 .
Combining the above two equations we get
t t
E[f (xi+1 )] ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i − k∇f (xi )k22 + σ 2 .
2 2
1
In practice, one would order the data randomly and process them sequentially after that.
2
In practice, one would try various tricks to pick the right one.
3
However, it is almost monotonic if σ is small

4
We now back-substitute vi into the equation by using E[vi ] = ∇f (xi ) and k∇f (xi )k22 = E[kvi k22 ] −
V ar(vi ) ≤ E[kvi k22 ] − σ 2 :
t
E[f (xi+1 )] ≤ f (x∗ ) + hE[vi ], xi − x∗ i − E[kvi k22 ] + tσ 2
 2 
∗ ∗ t
= f (x ) + E hvi , xi − x i − kvi k2 + tσ 2 .
2
2

We now repeat the calculations as in the analysis of GD (Equation 1.3) by completing the square
for the middle two terms to get:
 
∗ 1 ∗ 2 ∗
kxi − x k2 − kxi − x − tvi k2 + tσ 2
2

E[f (xi+1 )] ≤ f (x ) + E
2t
 
∗ 1 ∗ 2 ∗ 2
kxi − x k2 − kxi+1 − x k2 + tσ 2 .

= f (x ) + E
2t

The above is analogous to Equation 1.3 but for an additional tσ 2 term (and taking expectation).
Summing the above equations for i = 0, . . . , k − 1, we get
k−1
X 1 kx0 − x∗ k22
(E[f (xi+1 )] − f (x∗ )) ≤ kx0 − x∗ k22 − E kxk − x∗ k22 + ktσ 2 ≤ + ktσ 2 .
 
2t 2t
i=0

Finally, by Claim 1.3,


 
x1 + . . . + xk
k · f (xk ) = k · f ≤ f (x1 ) + · · · + f (xk ).
k
Thus,
k−1
X
(E[f (xi+1 )] − f (x∗ )) = E [f (x1 ) + · · · + f (xk )] − kf (x∗ ) ≥ k E [f (xk )] − kf (x∗ ).
i=0

Combining the above equations we get


kx0 − x∗ k22
E[f (xk )] ≤ f (x∗ ) + + tσ 2 .
2tk
The main statement of the theorem now follows. The in particular part follows by substituting the
specific values of k and t.

3 Remarks about GD and SGD


• SGD is widely used in many large-scale machine learning systems. It is simple, efficient, can
be run in parallel, and is ideal for ERM.

• In practice, one uses various heuristics and tricks to implement GD or SGD for choosing the
step-size (such as line-search) and stopping criterion. Choosing the right parameters is a bit
of black-art and you can find some advice here.

5
• If you ignore the dependence on L, and kx0 −x∗ k2 , GD takes O(1/ε) iterations and SGD takes
O(1/ε2 ) iterations to get ε-error. Both these bounds are tight for the specific algorithms.

• Remarkably, there are various accelerated gradient descent algorithms which only need O(1/ ε)
iterations. This can be quite significant when you can compute the gradient fast: the acceler-
ated methods—such as the celebrated Nesterov’s AGD—only need two gradient evaluations.

• Unfortunately, the acceleration does not help in the stochastic setting. When using
√ estimators
with variance σ 2 , the best error one can get after k iterations is O(L/k 2 + σ/ k).

• Even more remarkably, one can showpthat if only given access to gradient computations, it
is not possible to do better than Ω( L/ε) iterations. A good resource for such advanced
material is Nesterov’s textbook Introductory lectures on convex optimization.

3.1 Subgradient Methods


In several applications, the cost function f : Rd → R though convex could be non-differentiable.
Such situations can sometimes be handled by subgradient descent. Recall one of the nice properties
of the gradient: if f is convex and is differentiable at a point x, then for all y,

f (y) ≥ f (x) + h∇f (x), y − xi.

The above inequality in fact serves as one of the principles behind gradient descent and motivates
the definition of subgradient.

Definition 3.1. For a function f : Rd → R, and a point x ∈ Rd , a vector v ∈ Rd is a subgradient


of f at x if for all y ∈ Rd ,
f (y) ≥ f (x) + hv, y − xi.

In words, the hyperplane at x with normal v lies below f . Note that from the definition, there
can be multiple subgradients for a function at a point. However, if is differentiable at x, then ∇f (x)
is the only subgradient of f at x. For example, consider f : R → R defined by f (x) = max(x, 0) 4 .
Then, for all x > 0, the subgradient of f at x is 1 and for x < 0, the subgradient is 0. How about
x = 0? Any number between [0, 1] is a subgradient.
Subgradient descent is given by an update similar to gradient descent: xi+1 = xi − tv, where
v is any subgradient of f at xi . One can show that for some suitable notion of Lipschitz-ness,
subgradient descent also converges to the global minimum of convex Lipschitz functions. We will
unfortunately not cover this.
4
This is the RELU function that is used often in neural networks.

6
4 Missing proofs
Proof of Claim 1.2. By the fundamental theorem of calculus5 ,
Z 1
f (y) = f (x) + h∇f (x + τ (y − x)), y − xidτ
0
Z 1
= f (x) + h∇f (x), y − xi + h∇f (x + τ (y − x)) − ∇f (x), y − xidτ
0
Z 1
≤ f (x) + h∇f (x), y − xi + k∇f (x + τ (y − x)) − ∇f (x)k2 ky − xk2 dτ
0
(as for all vectors u, v, |hu, vi| ≤ kuk2 · kvk2 ).
Z 1
≤ f (x) + h∇f (x), y − xi + Lkτ (y − x)k2 ky − xk2 dτ
0
(as f is L-Lipschitz)
L
= f (x) + h∇f (x), y − xi + ky − xk22 .
2

Proof of Claim 1.3. The proof is by induction. For k = 2, the claim follows by convexity. Suppose
it is true for k − 1. Let y = (y1 + . . . + yk−1 )/(k − 1). Then,
   
y1 + · · · + yk k−1 yk
f =f y+
k k k
k−1 1
≤ · f (y) + f (yk )
k k
(by convexity applied with λ = (k − 1)/k)
   
k−1 f (y1 ) + · · · + f (yk−1 ) 1
≤ · + f (yk )
k k−1 k
(induction hypothesis applied to y1 , . . . , yk−1 )
f (y1 ) + · · · + f (yk )
= .
k

5
Basically, you define a function
R1 h : R → R by h(τ ) = f (x + τ (y − x)) and apply the fundamental theorem of
calculus to h: h(1) = h(0) + 0 h0 (τ )dτ .

You might also like