0% found this document useful (0 votes)
143 views

06 SG Method

The document summarizes the subgradient method for minimizing convex functions that may not be differentiable. It begins by introducing subgradients as generalizations of gradients for nondifferentiable functions. It then provides examples of subgradients for various functions. The subgradient method is presented as an analog of gradient descent that replaces gradients with subgradients. Convergence analysis shows the subgradient method converges at a rate of O(1/sqrt(k)) for k iterations. The document also discusses applications of subgradients to problems like soft thresholding and finding the intersection of convex sets.

Uploaded by

Samsul Rahmadani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

06 SG Method

The document summarizes the subgradient method for minimizing convex functions that may not be differentiable. It begins by introducing subgradients as generalizations of gradients for nondifferentiable functions. It then provides examples of subgradients for various functions. The subgradient method is presented as an analog of gradient descent that replaces gradients with subgradients. Convergence analysis shows the subgradient method converges at a rate of O(1/sqrt(k)) for k iterations. The document also discusses applications of subgradients to problems like soft thresholding and finding the intersection of convex sets.

Uploaded by

Samsul Rahmadani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Subgradient method

Geoff Gordon & Ryan Tibshirani


Optimization 10-725 / 36-725

Remember gradient descent


We want to solve
min f (x),

xRn

for f convex and differentiable


Gradient descent: choose initial x(0) Rn , repeat:
x(k) = x(k1) tk f (x(k1) ),

k = 1, 2, 3, . . .

If f Lipschitz, gradient descent has convergence rate O(1/k)


Downsides:
Can be slow later
Doesnt work for nondifferentiable functions today

Outline

Today:
Subgradients
Examples and properties
Subgradient method
Convergence rate

Subgradients
Remember that for convex f : Rn R,
f (y) f (x) + f (x)T (y x) all x, y
I.e., linear approximation always underestimates f
A subgradient of convex f : Rn R at x is any g Rn such that
f (y) f (x) + g T (y x), all y
Always exists
If f differentiable at x, then g = f (x) uniquely
Actually, same definition works for nonconvex f (however,

subgradient need not exist)

Examples

0.5

0.0

0.5

f(x)

1.0

1.5

2.0

Consider f : R R, f (x) = |x|

For x 6= 0, unique subgradient g = sign(x)


For x = 0, subgradient g is any element of [1, 1]

Consider f : Rn R, f (x) = kxk (Euclidean norm)

f(x)

x2

x1

For x 6= 0, unique subgradient g = x/kxk


For x = 0, subgradient g is any element of {z : kzk 1}
6

Consider f : Rn R, f (x) = kxk1

f(x)

x2

x1

For xi 6= 0, unique ith component gi = sign(xi )


For xi = 0, ith component gi is an element of [1, 1]
7

f(x)

10

15

Let f1 , f2 : Rn R be convex, differentiable, and consider


f (x) = max{f1 (x), f2 (x)}

For f1 (x) > f2 (x), unique subgradient g = f1 (x)


For f2 (x) > f1 (x), unique subgradient g = f2 (x)
For f1 (x) = f2 (x), subgradient g is any point on the line

segment between f1 (x) and f2 (x)


8

Subdifferential

Set of all subgradients of convex f is called the subdifferential:


f (x) = {g Rn : g is a subgradient of f at x}
f (x) is closed and convex (even for nonconvex f )
Nonempty (can be empty for nonconvex f )
If f is differentiable at x, then f (x) = {f (x)}
If f (x) = {g}, then f is differentiable at x and f (x) = g

Connection to convex geometry


Convex set C Rn , consider indicator function IC : Rn R,
(
0 if x C
IC (x) = I{x C} =
if x
/C
For x C, IC (x) = NC (x), the normal cone of C at x,
NC (x) = {g Rn : g T x g T y for any y C}
Why? Recall definition of subgradient g,
IC (y) IC (x) + g T (y x) for all y
For y
/ C, IC (y) =
For y C, this means 0 g T (y x)
10

11

Subgradient calculus
Basic rules for convex functions:
Scaling: (af ) = a f provided a > 0
Addition: (f1 + f2 ) = f1 + f2
Affine composition: if g(x) = f (Ax + b), then

g(x) = AT f (Ax + b)
Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then

f (x) = conv


fi (x) ,

i:fi (x)=f (x)

the convex hull of union of subdifferentials of all active


functions at x

12

General pointwise maximum: if f (x) = maxsS fs (x), then

n

f (x) cl conv

o
fs (x)

s:fs (x)=f (x)

and under some regularity conditions (on S, fs ), we get =


Norms: important special case, f (x) = kxkp . Let q be such

that 1/p + 1/q = 1, then


n
o
f (x) = y : kykq 1 and y T x = max z T x
kzkq 1

Why is this a special case? Note


kxkp = max z T x
kzkq 1

13

Why subgradients?

Subgradients are important for two reasons:


Convex analysis: optimality characterization via subgradients,

monotonicity, relationship to duality


Convex optimization: if you can compute subgradients, then

you can minimize (almost) any convex function

14

Optimality condition
For convex f ,
f (x? ) = minn f (x)
xR

0 f (x? )

I.e., x? is a minimizer if and only if 0 is a subgradient of f at x?


Why? Easy: g = 0 being a subgradient means that for all y
f (y) f (x? ) + 0T (y x? ) = f (x? )
Note analogy to differentiable case, where f (x) = {f (x)}

15

Soft-thresholding
Lasso problem can be parametrized as
min
x

1
ky Axk2 + kxk1
2

where 0. Consider simplified problem with A = I:


min
x

1
ky xk2 + kxk1
2

Claim: solution of simple problem


soft-thresholding operator:

yi
[S (y)]i = 0

yi +

is x? = S (y), where S is the

if yi >
if yi
if yi <
16

Why? Subgradients of f (x) = 12 ky xk2 + kxk1 are


g = x y + s,
where si = sign(xi ) if xi 6= 0 and si [1, 1] if xi = 0

0.5
1.0

Soft-thresholding in
one variable:

0.0

0.5

1.0

Now just plug in x = S (y) and check we can get g = 0

1.0

0.5

0.0

0.5

1.0

17

Subgradient method
Given convex f : Rn R, not necessarily differentiable
Subgradient method: just like gradient descent, but replacing
gradients with subgradients. I.e., initialize x(0) , then repeat
x(k) = x(k1) tk g (k1) ,

k = 1, 2, 3, . . . ,

where g (k1) is any subgradient of f at x(k1)


Subgradient method is not necessarily a descent method, so we
(k)
keep track of best iterate xbest among x(1) , . . . x(k) so far, i.e.,
(k)

f (xbest ) = min f (x(i) )


i=1,...k

18

Step size choices

Fixed step size: tk = t all k = 1, 2, 3, . . .


Diminishing step size: choose tk to satisfy

X
k=1

t2k < ,

tk = ,

k=1

i.e., square summable but not summable


Important that step sizes go to zero, but not too fast
Other options too, but important difference to gradient descent:
all step sizes options are pre-specified, not adaptively computed

19

Convergence analysis
Assume that f :

Rn

R is convex, also:

f is Lipschitz continuous with constant G > 0,

|f (x) f (y)| Gkx yk for all x, y


Equivalently: kgk G for any subgradient of f at any x
kx(1) x k R (equivalently, kx(0) x k is bounded)

Theorem: For a fixed step size t, subgradient method satisfies


(k)

lim f (xbest ) f (x? ) + G2 t/2

Theorem: For diminishing step sizes, subgradient method satisfies


(k)
lim f (xbest ) = f (x? )
k

20

Basic inequality
Can prove both results from same basic inequality. Key steps:
Using definition of subgradient,

kx(k+1) x? k2
kx(k) x? k2 2tk (f (x(k) ) f (x? )) + t2k kg (k) k2
Iterating last inequality,

kx(k+1) x? k2
kx(1) x? k2 2

k
X
i=1

ti (f (x(i) ) f (x? )) +

k
X

t2i kg (i) k2

i=1

21

Using kx(k+1) x? k 0 and kx(1) x? k R,

k
X

(i)

ti (f (x ) f (x )) R +

i=1

k
X

t2i kg (i) k2

i=1
(k)

Introducing f (xbest ),

k
X

ti (f (x(i) ) f (x? )) 2

i=1

k
X

(k)
ti (f (xbest ) f (x? ))
i=1

Plugging this in and using kg (i) k G,


(k)

f (xbest ) f (x? )

P
R2 + G2 ki=1 t2i
P
2 ki=1 ti

22

Convergence proofs
For constant step size t, basic bound is
R 2 + G2 t 2 k
G2 t

as k
2tk
2
For diminishing step sizes tk ,

X
i=1

t2i < ,

ti = ,

i=1

we get
P
R2 + G2 ki=1 t2i
0 as k
P
2 ki=1 ti

23

Convergence rate
(k)

After k iterations, what is complexity of error f (xbest ) f (x? )?

Consider taking ti = R/(G k), all i = 1, . . . k. Then basic bound


is
P
R2 + G2 ki=1 t2i
RG
=
Pk
k
2 i=1 ti
Can show this choice is the best we can do (i.e., minimizes bound)

I.e., subgradient method has convergence rate O(1/ k)


(k)

I.e., to get f (xbest ) f (x? ) , need O(1/2 ) iterations

24

Intersection of sets
Example from Boyds lecture notes: suppose we want to find
x? C1 . . . Cm , i.e., find point in intersection of closed,
convex sets C1 , . . . Cm
First define
f (x) = max dist(x, Ci ),
i=1,...m

and now solve


min f (x)

xRn

Note that f (x? ) = 0 x? C1 . . . Cm


Recall distance to set C,
dist(x, C) = min{kx uk : u C}

25

For closed, convex C, there is a unique point minimizing kx uk


over u C. Denoted u? = PC (x), so dist(x, C) = kx PC (x)k

Let fi (x) = dist(x, Ci ), each i. Then f (x) = maxi=1,...m fi (x),


and
xP

(x)

For each i, and x


/ Ci , fi (x) = kxPCCi (x)k
i
xP

(x)

If f (x) = fi (x) 6= 0, then kxPCi (x)k f (x)


Ci
26

Now apply subgradient method with step size tk = f (x(k1) )


(Polyak step size, can show that we get convergence)
Hence at iteration k, find Ci so that x(k1) is farthest from Ci .
Then update
x(k) = x(k1) f (x(k1) )

x(k1) PCi (x(k1) )


kx(k1) PCi (x(k1) )k

= PCi (x(k1) )
Here we used
f (x(k1) ) = dist(x(k1) , Ci ) = kx(k1) PCi (x(k1) )k
For two sets, this is exactly the famous alternating projections
method, i.e., just keep projecting back and forth

27

(From Boyds notes)


28

Can we do better?
Strength of subgradient method: broad applicability

Downside: O(1/ k) rate is really slow ... can we do better?


Given starting point x(0) . Setup:
Problem class: convex functions f with solution x? , with

kx(0) x? k R, f Lipschitz with constant G > 0 on


{x : kx x(0) k R}
Weak oracle: given x, oracle returns a subgradient g f (x)
Nonsmooth first-order methods: iterative methods that start

with x(0) and update x(k) in


x(0) + span{g (0) , g (1) , . . . g (k1) }
subgradients g (0) , g (1) , . . . g (k1) come from weak oracle
29

Lower bound
Theorem (Nesterov): For any k n1 and starting point x(0) ,
there is a function in the problem class such that any nonsmooth
first-order method satisfies
f (x(k) ) f (x? )

RG

2(1 + k + 1)

Proof: Well do the proof for k = n 1 and x(0) = 0; the proof is


similar otherwise. Let
1
f (x) = max xi + kxk2
i=1,...n
2
Solution: x? = (1/n, . . . 1/n), f (x? ) = 1/(2n)

For R = 1/ n, f is Lipschitz with G = 1 + 1/ n


Oracle: returns g = ej + x, where j is smallest index such that
xj = maxi=1,...n xi
30

Claim: for any i 1, . . . n 1, the ith iterate satisfies


(i)

xi+1 = . . . = x(i)
n =0
Start with i = 1: note g (0) = e1 . Then:
span{g (0) , g (1) } span{e1 , e2 }
span{g (0) , g (1) , g (2) } span{e1 , e2 , e3 }
...
span{g (0) , g (1) , . . . g (i1) } span{e1 , . . . ei } v

Therefore f (x(n1) ) 0, recall f (x? ) = 1/(2n), so


f (x(n1) ) f (x? )

1
RG

=
2n
2(1 + n)

31

Improving on the subgradient method


To improve, we must go beyond nonsmooth first-order methods
There are many ways to improve for general nonconvex problems,
e.g., localization methods, filtered subgradients, memory terms
Instead, well focus on minimizing functions of the form
f (x) = g(x) + h(x)
where g is convex and differentiable, h is convex
For a lot of problems (i.e., functions h), we can recover O(1/k)
rate of gradient descent with a simple algorithm, having big
practical consequences

32

References
S. Boyd, Lecture Notes for EE 264B, Stanford University,

Spring 2010-2011
Y. Nesterov (2004), Introductory Lectures on Convex

Optimization: A Basic Course, Kluwer Academic Publishers,


Chapter 3
B. Polyak (1987), Introduction to Optimization, Optimization

Software Inc., Chapter 5


R. T. Rockafellar (1970), Convex Analysis, Princeton

University Press, Chapters 2325


L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring

2011-2012

33

You might also like