Raghu Meka notes
Raghu Meka notes
Raghu Meka
1 Gradient descent
In class we discussed the following notions:
Examples
Pn x2 , ex are all convex. So is the loss function from least squares regression LSR f (x) =
(1/n) j=1 (haj , xi − bj )2 . For what L is this loss Lipschitz? Let A be the n × d matrix whose rows
are the vectors aj and let b be the n-dimensional vector consisting of the bj ’s. Then, we can also
write f (x) = (1/n)kAx − bk2 . So that ∇f (x) = (2/n)AT (Ax − b). Therefore, for x, y ∈ Rd ,
k∇f (x)−∇f (y)k = (2/n)kAT (Ax−b)−AT (Ay−b)k = (2/n)kAT A(x−y)k ≤ (2/n)kAT Ak2 ·kx−yk,
where for a matrix B, the spectral norm, kBk2 is defined as kBk2 = maxu kBuk/kuk. Thus, the
loss function in LSR is (2/n)kAT Ak2 Lipschitz.
In general, while convexity is a strong constraint in practice, Lipschitz-ness is more common.
Nevertheless, Lipschitz convex functions are a rich class of functions that cover many common in-
stances in optimization. We next analyze gradient descent for Lipschitz convex functions. Through-
out this note, gradient descent (GD) will refer to the following algorithm:
• For i = 0, . . . , define
xi+1 = xi − t∇f (xi ).
1
The next claim complements the above property for Lipschitz convex functions.
Claim 1.2. If f : Rd → R is L-Lipschitz convex, then for all x, y,
L
ky − xk22 .
f (y) ≤ f (x) + h∇f (x), y − xi +
2
We also need the following property of convex functions.
Claim 1.3. For any convex function f : Rd → R, and y1 , . . . , yk ∈ Rd ,
y1 + · · · + yk f (y1 ) + · · · + f (yk )
f ≤ .
k k
We next prove that gradient descent converges to the global optimum for Lipschitz convex
functions.
Theorem 1.4. Let f : Rd → R be a L-Lipschitz convex function and x∗ = arg minx f (x). Then,
GD with step-size t ≤ 1/L satisfies the following:
kx0 − x∗ k22
f (xk ) ≤ f (x∗ ) + .
2tk
Lkx0 −x∗ k22
In particular, ε iterations suffice to find an ε-approximate optimal value x (for t = 1/L).
Proof. First, by convexity of f , we have:
f (xi ) ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i. (1.1)
Further, as f is L-Lipschitz, by the previous lemma,
L
f (xi+1 ) ≤ f (xi ) + h∇f (xi ), xi+1 − xi i + kxi+1 − xi k22 (1.2)
2
Lt2
= f (xi ) − tk∇f (xi )k22 +
k∇f (xi )k22
2
= f (xi ) − t(1 − Lt/2)k∇f (xi )k22
t
≤ f (xi ) − k∇f (xi )k22 ,
2
where the last inequality follows as Lt ≤ 1. In particular, the above shows that GD is mono-
tonic: the objective value is non-decreasing. Combining the above two equations and the fact that
∇f (xi ) = (1/t)(xi − xi+1 ), we get
t
f (xi+1 ) ≤ f (x∗ ) + h∇f (xi ), xi − x∗ i − k∇f (xi )k22 (1.3)
2
1 1
= f (x∗ ) + · hxi − xi+1 , xi − x∗ i − kxi − xi+1 k2
t 2t
1 1
= f (x∗ ) + kxi − x∗ k22 − kxi − x∗ k22 − 2ht∇f (xi ), xi − x∗ i + kt∇f (xi )k22
2t 2t
(we are basically “completing” a square for the last two terms)
1 1
= f (x∗ ) + kxi − x∗ k22 − kxi − x∗ − t∇f (xi )k22
2t 2t
∗ 1
kxi − x k2 − kxi+1 − x∗ k22 .
∗ 2
= f (x ) +
2t
2
Summing the above equations for i = 0, . . . , k − 1, we get
k−1
X 1 kx0 − x∗ k22
(f (xi+1 ) − f (x∗ )) ≤ kx0 − x∗ k22 − kxk − x∗ k22 ≤ .
2t 2t
i=0
Finally, by Equation 1.2, f (x0 ), . . . , f (xk ) is non-increasing. Therefore, f (xk ) − f (x∗ ) ≤ f (xi ) −
f (x∗ ) for all i < k. Thus,
kx0 − x∗ k22
k · (f (xk ) − f (x∗ )) ≤ .
2t
The theorem now follows.
2. For i = 0, 1, . . . ,:
(a) Let vi be a random vector such that E[vi ] = ∇f (xi ). That is, vi is an unbiased estimator
for ∇f (xi ). For simplicity, we will assume that vi is independent of all previous random
choices; technically, one only needs the conditional expectation to equal the gradient.
(b) xi+1 = xi − tvi .
Example: Let us look at the example of LSR again. We have data points a1 , . . . , an ∈ Rd with
respective values b1 , . . . , bn ∈ R and our goal was to find x minimizing
n
X 2
f (x) = (1/n) haj , xi − bj .
j=1
Then, ∇f (x) = (2/n) nj=1 haj , xi − bj · aj . This takes O(nd) time to compute. Now, define a
P
random vector v(x) as follows: Pick a uniformly random j ∈ [n] and set v(x) = 2 haj , xi − bj · aj .
Then, clearly, E[v(x)] = ∇f (x) and only takes O(d) time to compute. In particular, SGD applied
to LSR yields the following algorithm:
3
1. Pick x0 and set step-size t > 0.
2. For i = 0, 1, . . . ,:
(a) Pick a uniformly random index1 j ∈ [n] and set vi = 2 haj , xi − bi · aj .
4
We now back-substitute vi into the equation by using E[vi ] = ∇f (xi ) and k∇f (xi )k22 = E[kvi k22 ] −
V ar(vi ) ≤ E[kvi k22 ] − σ 2 :
t
E[f (xi+1 )] ≤ f (x∗ ) + hE[vi ], xi − x∗ i − E[kvi k22 ] + tσ 2
2
∗ ∗ t
= f (x ) + E hvi , xi − x i − kvi k2 + tσ 2 .
2
2
We now repeat the calculations as in the analysis of GD (Equation 1.3) by completing the square
for the middle two terms to get:
∗ 1 ∗ 2 ∗
kxi − x k2 − kxi − x − tvi k2 + tσ 2
2
E[f (xi+1 )] ≤ f (x ) + E
2t
∗ 1 ∗ 2 ∗ 2
kxi − x k2 − kxi+1 − x k2 + tσ 2 .
= f (x ) + E
2t
The above is analogous to Equation 1.3 but for an additional tσ 2 term (and taking expectation).
Summing the above equations for i = 0, . . . , k − 1, we get
k−1
X 1 kx0 − x∗ k22
(E[f (xi+1 )] − f (x∗ )) ≤ kx0 − x∗ k22 − E kxk − x∗ k22 + ktσ 2 ≤ + ktσ 2 .
2t 2t
i=0
• In practice, one uses various heuristics and tricks to implement GD or SGD for choosing the
step-size (such as line-search) and stopping criterion. Choosing the right parameters is a bit
of black-art and you can find some advice here.
5
• If you ignore the dependence on L, and kx0 −x∗ k2 , GD takes O(1/ε) iterations and SGD takes
O(1/ε2 ) iterations to get ε-error. Both these bounds are tight for the specific algorithms.
√
• Remarkably, there are various accelerated gradient descent algorithms which only need O(1/ ε)
iterations. This can be quite significant when you can compute the gradient fast: the acceler-
ated methods—such as the celebrated Nesterov’s AGD—only need two gradient evaluations.
• Unfortunately, the acceleration does not help in the stochastic setting. When using
√ estimators
with variance σ 2 , the best error one can get after k iterations is O(L/k 2 + σ/ k).
• Even more remarkably, one can showpthat if only given access to gradient computations, it
is not possible to do better than Ω( L/ε) iterations. A good resource for such advanced
material is Nesterov’s textbook Introductory lectures on convex optimization.
The above inequality in fact serves as one of the principles behind gradient descent and motivates
the definition of subgradient.
In words, the hyperplane at x with normal v lies below f . Note that from the definition, there
can be multiple subgradients for a function at a point. However, if is differentiable at x, then ∇f (x)
is the only subgradient of f at x. For example, consider f : R → R defined by f (x) = max(x, 0) 4 .
Then, for all x > 0, the subgradient of f at x is 1 and for x < 0, the subgradient is 0. How about
x = 0? Any number between [0, 1] is a subgradient.
Subgradient descent is given by an update similar to gradient descent: xi+1 = xi − tv, where
v is any subgradient of f at xi . One can show that for some suitable notion of Lipschitz-ness,
subgradient descent also converges to the global minimum of convex Lipschitz functions. We will
unfortunately not cover this.
4
This is the RELU function that is used often in neural networks.
6
4 Missing proofs
Proof of Claim 1.2. By the fundamental theorem of calculus5 ,
Z 1
f (y) = f (x) + h∇f (x + τ (y − x)), y − xidτ
0
Z 1
= f (x) + h∇f (x), y − xi + h∇f (x + τ (y − x)) − ∇f (x), y − xidτ
0
Z 1
≤ f (x) + h∇f (x), y − xi + k∇f (x + τ (y − x)) − ∇f (x)k2 ky − xk2 dτ
0
(as for all vectors u, v, |hu, vi| ≤ kuk2 · kvk2 ).
Z 1
≤ f (x) + h∇f (x), y − xi + Lkτ (y − x)k2 ky − xk2 dτ
0
(as f is L-Lipschitz)
L
= f (x) + h∇f (x), y − xi + ky − xk22 .
2
Proof of Claim 1.3. The proof is by induction. For k = 2, the claim follows by convexity. Suppose
it is true for k − 1. Let y = (y1 + . . . + yk−1 )/(k − 1). Then,
y1 + · · · + yk k−1 yk
f =f y+
k k k
k−1 1
≤ · f (y) + f (yk )
k k
(by convexity applied with λ = (k − 1)/k)
k−1 f (y1 ) + · · · + f (yk−1 ) 1
≤ · + f (yk )
k k−1 k
(induction hypothesis applied to y1 , . . . , yk−1 )
f (y1 ) + · · · + f (yk )
= .
k
5
Basically, you define a function
R1 h : R → R by h(τ ) = f (x + τ (y − x)) and apply the fundamental theorem of
calculus to h: h(1) = h(0) + 0 h0 (τ )dτ .