06 SG Method
06 SG Method
xRn
k = 1, 2, 3, . . .
Outline
Today:
Subgradients
Examples and properties
Subgradient method
Convergence rate
Subgradients
Remember that for convex f : Rn R,
f (y) f (x) + f (x)T (y x) all x, y
I.e., linear approximation always underestimates f
A subgradient of convex f : Rn R at x is any g Rn such that
f (y) f (x) + g T (y x), all y
Always exists
If f differentiable at x, then g = f (x) uniquely
Actually, same definition works for nonconvex f (however,
Examples
0.5
0.0
0.5
f(x)
1.0
1.5
2.0
f(x)
x2
x1
f(x)
x2
x1
f(x)
10
15
Subdifferential
11
Subgradient calculus
Basic rules for convex functions:
Scaling: (af ) = a f provided a > 0
Addition: (f1 + f2 ) = f1 + f2
Affine composition: if g(x) = f (Ax + b), then
g(x) = AT f (Ax + b)
Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then
f (x) = conv
fi (x) ,
12
n
f (x) cl conv
o
fs (x)
13
Why subgradients?
14
Optimality condition
For convex f ,
f (x? ) = minn f (x)
xR
0 f (x? )
15
Soft-thresholding
Lasso problem can be parametrized as
min
x
1
ky Axk2 + kxk1
2
1
ky xk2 + kxk1
2
yi
[S (y)]i = 0
yi +
if yi >
if yi
if yi <
16
0.5
1.0
Soft-thresholding in
one variable:
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
17
Subgradient method
Given convex f : Rn R, not necessarily differentiable
Subgradient method: just like gradient descent, but replacing
gradients with subgradients. I.e., initialize x(0) , then repeat
x(k) = x(k1) tk g (k1) ,
k = 1, 2, 3, . . . ,
18
X
k=1
t2k < ,
tk = ,
k=1
19
Convergence analysis
Assume that f :
Rn
R is convex, also:
20
Basic inequality
Can prove both results from same basic inequality. Key steps:
Using definition of subgradient,
kx(k+1) x? k2
kx(k) x? k2 2tk (f (x(k) ) f (x? )) + t2k kg (k) k2
Iterating last inequality,
kx(k+1) x? k2
kx(1) x? k2 2
k
X
i=1
ti (f (x(i) ) f (x? )) +
k
X
t2i kg (i) k2
i=1
21
k
X
(i)
ti (f (x ) f (x )) R +
i=1
k
X
t2i kg (i) k2
i=1
(k)
Introducing f (xbest ),
k
X
ti (f (x(i) ) f (x? )) 2
i=1
k
X
(k)
ti (f (xbest ) f (x? ))
i=1
f (xbest ) f (x? )
P
R2 + G2 ki=1 t2i
P
2 ki=1 ti
22
Convergence proofs
For constant step size t, basic bound is
R 2 + G2 t 2 k
G2 t
as k
2tk
2
For diminishing step sizes tk ,
X
i=1
t2i < ,
ti = ,
i=1
we get
P
R2 + G2 ki=1 t2i
0 as k
P
2 ki=1 ti
23
Convergence rate
(k)
24
Intersection of sets
Example from Boyds lecture notes: suppose we want to find
x? C1 . . . Cm , i.e., find point in intersection of closed,
convex sets C1 , . . . Cm
First define
f (x) = max dist(x, Ci ),
i=1,...m
xRn
25
(x)
(x)
= PCi (x(k1) )
Here we used
f (x(k1) ) = dist(x(k1) , Ci ) = kx(k1) PCi (x(k1) )k
For two sets, this is exactly the famous alternating projections
method, i.e., just keep projecting back and forth
27
Can we do better?
Strength of subgradient method: broad applicability
Lower bound
Theorem (Nesterov): For any k n1 and starting point x(0) ,
there is a function in the problem class such that any nonsmooth
first-order method satisfies
f (x(k) ) f (x? )
RG
2(1 + k + 1)
xi+1 = . . . = x(i)
n =0
Start with i = 1: note g (0) = e1 . Then:
span{g (0) , g (1) } span{e1 , e2 }
span{g (0) , g (1) , g (2) } span{e1 , e2 , e3 }
...
span{g (0) , g (1) , . . . g (i1) } span{e1 , . . . ei } v
1
RG
=
2n
2(1 + n)
31
32
References
S. Boyd, Lecture Notes for EE 264B, Stanford University,
Spring 2010-2011
Y. Nesterov (2004), Introductory Lectures on Convex
2011-2012
33