0% found this document useful (0 votes)
106 views124 pages

Optimization For Machine Learning: Lecture 12: Coordinate Descent, BCD, Altmin 6.881: MIT

This document discusses coordinate descent, an optimization technique for minimizing functions. Coordinate descent works by iteratively optimizing one variable at a time while holding others fixed. Specifically: - It picks a variable index and optimizes the objective function with respect to just that variable, keeping others fixed. This is repeated for each variable index. - Choosing which variable to optimize at each step can be done in a cyclic, almost cyclic, or random order. - Coordinate descent is one of the simplest optimization methods and dates back to methods for linear systems. It can be slow but is sometimes competitive with gradient methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views124 pages

Optimization For Machine Learning: Lecture 12: Coordinate Descent, BCD, Altmin 6.881: MIT

This document discusses coordinate descent, an optimization technique for minimizing functions. Coordinate descent works by iteratively optimizing one variable at a time while holding others fixed. Specifically: - It picks a variable index and optimizes the objective function with respect to just that variable, keeping others fixed. This is repeated for each variable index. - Choosing which variable to optimize at each step can be done in a cyclic, almost cyclic, or random order. - Coordinate descent is one of the simplest optimization methods and dates back to methods for linear systems. It can be slow but is sometimes competitive with gradient methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Optimization for Machine Learning

Lecture 12: Coordinate Descent, BCD, Altmin


6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

01 Apr, 2021
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)

Since x ∈ Rd , now consider


min f (x) = f (x1 , x2 , . . . , xd )

Previously, we went through f1 , . . . , fn


What if we now go through x1 , . . . , xd one by one?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)

Since x ∈ Rd , now consider


min f (x) = f (x1 , x2 , . . . , xd )

Previously, we went through f1 , . . . , fn


What if we now go through x1 , . . . , xd one by one?

Explore: Going through both [n] and [d]?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Coordinate descent
 For k = 0, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
 For k = 0, 1, . . .
Pick an index i from {1, . . . , d}

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
 For k = 0, 1, . . .
Pick an index i from {1, . . . , d}
Optimize the ith coordinate
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkd )
i
ξ∈R |1 {z i−1} |{z} | {z }
done current todo

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
 For k = 0, 1, . . .
Pick an index i from {1, . . . , d}
Optimize the ith coordinate
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkd )
i
ξ∈R |1 {z i−1} |{z} | {z }
done current todo

 Decide when/how to stop; return xk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent - context
♣ One of the simplest optimization methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable
♣ Renewed interest in CD was driven by ML

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable
♣ Renewed interest in CD was driven by ML
♣ Notice: in general CD is “derivative free”

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
CD – which coordinate?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i
Derivative free rules:

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .
♣ Almost cyclic: Each coordinate 1 ≤ i ≤ d picked at least
once every B successive iterations (B ≥ d)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .
♣ Almost cyclic: Each coordinate 1 ≤ i ≤ d picked at least
once every B successive iterations (B ≥ d)
♣ Double sweep, 1, . . . , d then d − 1, . . . , 1, repeat

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .
♣ Almost cyclic: Each coordinate 1 ≤ i ≤ d picked at least
once every B successive iterations (B ≥ d)
♣ Double sweep, 1, . . . , d then d − 1, . . . , 1, repeat
♣ Cylic with permutation: random order each cycle

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the


index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .
♣ Almost cyclic: Each coordinate 1 ≤ i ≤ d picked at least
once every B successive iterations (B ≥ d)
♣ Double sweep, 1, . . . , d then d − 1, . . . , 1, repeat
♣ Cylic with permutation: random order each cycle
♣ Random sampling: pick random index at each iteration

Which ones would you prefer? Why?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
Exercise: CD for least squares

min kAx − bk22


x

Exercise: Obtain an update for j-th coordinate


Coordinate descent update
Pm  P 
i=1 aij bi − l6=j ail xl
xj ← Pm 2
i=1 aij

(dropped superscripts, since we overwrite)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 6
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
♠ Can slow down near optimum

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
♠ Can slow down near optimum
♠ Nonsmooth case more tricky

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
♠ Can slow down near optimum
♠ Nonsmooth case more tricky
♠ Explore: not easy to use for deep learning...

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
BCD
(Basics, Convergence)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 8
Block coordinate descent (BCD)

min f (x) := f (x1 , . . . , xm )


x ∈ X1 × X2 × · · · × Xm .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
Block coordinate descent (BCD)

min f (x) := f (x1 , . . . , xm )


x ∈ X1 × X2 × · · · × Xm .

Gauss-Seidel update
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkm )
i
ξ∈Xi |1 {z i−1} |{z} | {z }
done current todo

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
Block coordinate descent (BCD)

min f (x) := f (x1 , . . . , xm )


x ∈ X1 × X2 × · · · × Xm .

Gauss-Seidel update
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkm )
i
ξ∈Xi |1 {z i−1} |{z} | {z }
done current todo

Jacobi update (easy to parallelize)


xk+1
i ← argmin f (xk1 , . . . , xki−1 , ξ , xki+1 , . . . , xkm )
ξ∈Xi | {z } |{z} | {z }
don0 t clobber current todo

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum

min f (x1 , . . . , xi+1 , ξ, xi+1 , . . . , xm )


ξ∈Xi

is
 uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum

min f (x1 , . . . , xi+1 , ξ, xi+1 , . . . , xm )


ξ∈Xi

is
 uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .

Corollary. If f is inaddition convex, then every limit point of


the BCD sequence xk is a global minimum.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum

min f (x1 , . . . , xi+1 , ξ, xi+1 , . . . , xm )


ξ∈Xi

is
 uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .

Corollary. If f is inaddition convex, then every limit point of


the BCD sequence xk is a global minimum.

I Unique solutions of subproblems not always possible


I Above result is only asymptotic (holds in the limit)
I Warning! BCD may cycle indefinitely without converging,
if number blocks > 2 and objective nonconvex.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – Two blocks
Two block BCD

minimize f (x) = f (x1 , x2 ) x ∈ X1 × X2 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
BCD – Two blocks
Two block BCD

minimize f (x) = f (x1 , x2 ) x ∈ X1 × X2 .

Theorem. (Grippo & Sciandrone (2000)). Let f be continu-


ously differentiable. Let X1 , X2 be closed and convex. Assume
both BCD subproblems have solutions and the sequence xk


has limit points. Then, every limit point of xk is stationary.




Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
BCD – Two blocks
Two block BCD

minimize f (x) = f (x1 , x2 ) x ∈ X1 × X2 .

Theorem. (Grippo & Sciandrone (2000)). Let f be continu-


ously differentiable. Let X1 , X2 be closed and convex. Assume
both BCD subproblems have solutions and the sequence xk


has limit points. Then, every limit point of xk is stationary.




I No need of unique solutions to subproblems


I BCD for 2 blocks is also called: Alternating Minimization

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
CD – projection onto convex sets

min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
CD – projection onto convex sets

min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .

Solution 1: Rewrite using indicator functions


Xm
min 1
2 kx − yk22 + δCi (x).
i=1

I Now invoke Douglas-Rachford using the product-space trick

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
CD – projection onto convex sets

min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .

Solution 1: Rewrite using indicator functions


Xm
min 1
2 kx − yk22 + δCi (x).
i=1

I Now invoke Douglas-Rachford using the product-space trick


Solution 2: Take dual of the above formulation

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
Solution 1: Product space technique

I Original problem over H = Rn

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn


I Suppose we have ni=1 fi (x)
P

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn


I Suppose we have ni=1 fi (x)
P

I Introduce n new variables (x1 , . . . , xn )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn


I Suppose we have ni=1 fi (x)
P

I Introduce n new variables (x1 , . . . , xn )


I Now problem is over domain Hn := "ni=1 H

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn


I Suppose we have ni=1 fi (x)
P

I Introduce n new variables (x1 , . . . , xn )


I Now problem is over domain Hn := "ni=1 H
I New constraint: x1 = x2 = . . . = xn
X
min fi (xi )
(x1 ,...,xn ) i

s.t. x1 = x2 = · · · = xn .

Technique due to: G. Pierra (1976)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

min f (x) + 1B (x)


x

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)


x

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}


I Let y = (y1 , . . . , yn )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)


x

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}


I Let y = (y1 , . . . , yn )
I proxf (y) = (proxf1 (y1 ), . . . , proxfn (yn ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)


x

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}


I Let y = (y1 , . . . , yn )
I proxf (y) = (proxf1 (y1 ), . . . , proxfn (yn ))
I proxB ≡ ΠB (y) can be solved as follows:

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)


x

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}


I Let y = (y1 , . . . , yn )
I proxf (y) = (proxf1 (y1 ), . . . , proxfn (yn ))
I proxB ≡ ΠB (y) can be solved as follows:
minz∈B 12 kz − yk22
P 1 2
minx∈H i 2 kx − yi k2
x = n1 i yi
P
=⇒

Exercise: Work out the details of the Douglas-Rachford


algorithm using the above product space trick.
Remark: This technique commonly exploited in ADMM too

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)


x,z,w

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)


x,z,w
x−y+ν+µ=0 =⇒ x=y−ν−µ

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)


x,z,w
x−y+ν+µ=0 =⇒ x=y−ν−µ
g(ν, µ) = − 21 kν + µk22 + (ν + µ)T y − f ∗ (ν) − h∗ (µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)


x,z,w
x−y+ν+µ=0 =⇒ x=y−ν−µ
g(ν, µ) = − 21 kν + µk22 + (ν + µ)T y − f ∗ (ν) − h∗ (µ)

Dual as minimization problem

min k(ν, µ) := 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

I 0 ∈ ν + µk − y + ∂f ∗ (ν)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk ) =⇒ ν = y − µk − proxf (y − µk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk ) =⇒ ν = y − µk − proxf (y − µk )
I Similarly, we see that
µ = y − νk+1 − proxh (y − νk+1 )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method

Apply CD to k(ν, µ) = 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

νk+1 = argminν k(ν, µk )


µk+1 = argminµ k(νk+1 , µ)

I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk ) =⇒ ν = y − µk − proxf (y − µk )
I Similarly, we see that
µ = y − νk+1 − proxh (y − νk+1 )

νk+1 ← y − µk − proxf (y − µk )
µk+1 ← y − νk+1 − proxh (y − νk+1 )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
Proximal-Dykstra as CD

 Simplify, and use Lagrangian stationarity to obtain primal

x = y − ν − µ =⇒ y − µ = x + ν

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 17
Proximal-Dykstra as CD

 Simplify, and use Lagrangian stationarity to obtain primal

x = y − ν − µ =⇒ y − µ = x + ν

 Thus, the CD iteration may be rewritten as

tk ← proxf (xk + νk )
νk+1 ← xk + νk − tk
xk+1 ← proxh (µk + tk )
µk+1 ← µk + tk − xk+1

 We used: proxh (y − νk+1 ) = µk+1 − y − νk+1 = xk+1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 17
Proximal-Dykstra as CD

 Simplify, and use Lagrangian stationarity to obtain primal

x = y − ν − µ =⇒ y − µ = x + ν

 Thus, the CD iteration may be rewritten as

tk ← proxf (xk + νk )
νk+1 ← xk + νk − tk
xk+1 ← proxh (µk + tk )
µk+1 ← µk + tk − xk+1

 We used: proxh (y − νk+1 ) = µk+1 − y − νk+1 = xk+1


 This is the proximal-Dykstra method!
Explore: Pros-cons of Prox-Dykstra versus product space+DR

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 17
CD – nonsmooth case

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 18
CD for nonsmooth convex problems

min |x1 − x2 | + 12 |x1 + x2 |

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 19
CD for separable nonsmoothness

I Nonsmooth part is separable


Xd
min f (x) + ri (xi )
x∈Rd i=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 20
CD for separable nonsmoothness

I Nonsmooth part is separable


Xd
min f (x) + ri (xi )
x∈Rd i=1

Theorem. If f is convex, continuously differentiable, each


ri (x) is closed, convex, and each coordinate admits a unique
solution. Further, assume we go through allcoordinates in an
k

essentially cyclic way. Then, the sequence x generated by
CD is bounded, and every limit point of it is optimal.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 20
CD for separable nonsmoothness

I Nonsmooth part is separable


Xd
min f (x) + ri (xi )
x∈Rd i=1

Theorem. If f is convex, continuously differentiable, each


ri (x) is closed, convex, and each coordinate admits a unique
solution. Further, assume we go through allcoordinates in an
k

essentially cyclic way. Then, the sequence x generated by
CD is bounded, and every limit point of it is optimal.

Remark: A related result for nonconvex problems with separable non-smoothness


(under more assumptions), can be found in: “Convergence of Block Coordinate
Descent Method for Nondifferentiable Minimization” by P. Tseng (2001).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 20
CD – iteration complexity

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 21
CD non-asymptotic rate

I So far, we saw CD based on essentially cyclic rules

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate

I So far, we saw CD based on essentially cyclic rules


I It is difficult to prove global convergence and almost
impossible to estimate global rate of convergence

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate

I So far, we saw CD based on essentially cyclic rules


I It is difficult to prove global convergence and almost
impossible to estimate global rate of convergence
I Above results highlighted at best local (asymptotic) rates

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate

I So far, we saw CD based on essentially cyclic rules


I It is difficult to prove global convergence and almost
impossible to estimate global rate of convergence
I Above results highlighted at best local (asymptotic) rates
 Consider the unconstrained problem min f (x), s.t., x ∈ R
d

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate

I So far, we saw CD based on essentially cyclic rules


I It is difficult to prove global convergence and almost
impossible to estimate global rate of convergence
I Above results highlighted at best local (asymptotic) rates
 Consider the unconstrained problem min f (x), s.t., x ∈ R
d

 Assume f is convex, with componentwise Lipschitz gradients

|∇i f (x + hei ) − ∇i f (x)| ≤ Li |h|, x ∈ Rd , h ∈ R.

Here ei denotes the ith canonical basis vector

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate

I So far, we saw CD based on essentially cyclic rules


I It is difficult to prove global convergence and almost
impossible to estimate global rate of convergence
I Above results highlighted at best local (asymptotic) rates
 Consider the unconstrained problem min f (x), s.t., x ∈ R
d

 Assume f is convex, with componentwise Lipschitz gradients

|∇i f (x + hei ) − ∇i f (x)| ≤ Li |h|, x ∈ Rd , h ∈ R.

Here ei denotes the ith canonical basis vector

Choose x0 ∈ Rd . Let M = maxi Li ; For k ≥ 0


ik = argmax |∇i f (xk )|
1≤i≤d
1
xk+1 = xk − ∇ f (xk )eik .
M ik

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

I Looks like gradient-descent O(1/k) bound for C1L cvx


I Notice factor of d in the numerator!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

I Looks like gradient-descent O(1/k) bound for C1L cvx


I Notice factor of d in the numerator!
I But this method is impractical

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

I Looks like gradient-descent O(1/k) bound for C1L cvx


I Notice factor of d in the numerator!
I But this method is impractical
I At each step, it requires access to full gradient

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

I Looks like gradient-descent O(1/k) bound for C1L cvx


I Notice factor of d in the numerator!
I But this method is impractical
I At each step, it requires access to full gradient
I Might as well use ordinary gradient methods!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

I Looks like gradient-descent O(1/k) bound for C1L cvx


I Notice factor of d in the numerator!
I But this method is impractical
I At each step, it requires access to full gradient
I Might as well use ordinary gradient methods!
I Also, if f ∈ C1L , it can easily happen that M ≥ L

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence

Theorem. Let xk be iterate sequence generated by above




greedy CD method. Then,

2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4

I Looks like gradient-descent O(1/k) bound for C1L cvx


I Notice factor of d in the numerator!
I But this method is impractical
I At each step, it requires access to full gradient
I Might as well use ordinary gradient methods!
I Also, if f ∈ C1L , it can easily happen that M ≥ L
I So above rate is in general, worse than gradient methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
BCD – Notation

I Decomposition: E = [E1 , . . . , En ] into n blocks


I Corresponding decomposition of x is

(ET1 x , ET2 x , . . . , ETn x ) = (x(1) , x(2) , . . . , x(n) )


|{z} |{z} |{z}
N1 + N2 + · · · +Nn =N

I Observation: (
INi i=j
ETi Ej =
0Ni ,Nj i 6= j.
I So the Ei s define our partitioning of the coordinates
I Just fancier notation for a random partition of coordinates
I Now with this notation . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 24
BCD – formal setup

min f (x) where x ∈ Rd

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup

min f (x) where x ∈ Rd

Assume gradient of block i is Lipschitz continuous∗∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup

min f (x) where x ∈ Rd

Assume gradient of block i is Lipschitz continuous∗∗

k∇i f (x + Ei h) − ∇i f (x)k∗ ≤ Li khk

Block gradient ∇i f (x) is projection of full grad: ETi ∇f (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup

min f (x) where x ∈ Rd

Assume gradient of block i is Lipschitz continuous∗∗

k∇i f (x + Ei h) − ∇i f (x)k∗ ≤ Li khk

Block gradient ∇i f (x) is projection of full grad: ETi ∇f (x)


∗∗ — each block can use its own norm

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup

min f (x) where x ∈ Rd

Assume gradient of block i is Lipschitz continuous∗∗

k∇i f (x + Ei h) − ∇i f (x)k∗ ≤ Li khk

Block gradient ∇i f (x) is projection of full grad: ETi ∇f (x)


∗∗ — each block can use its own norm

Block Coordinate “Gradient” Descent

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup

min f (x) where x ∈ Rd

Assume gradient of block i is Lipschitz continuous∗∗

k∇i f (x + Ei h) − ∇i f (x)k∗ ≤ Li khk

Block gradient ∇i f (x) is projection of full grad: ETi ∇f (x)


∗∗ — each block can use its own norm

Block Coordinate “Gradient” Descent


I Using the descent lemma, we have blockwise upper bounds
Li 2
f (x + Ei h) ≤ f (x) + h∇i f (x), hi + 2 khk , for i = 1, . . . , d.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup

min f (x) where x ∈ Rd

Assume gradient of block i is Lipschitz continuous∗∗

k∇i f (x + Ei h) − ∇i f (x)k∗ ≤ Li khk

Block gradient ∇i f (x) is projection of full grad: ETi ∇f (x)


∗∗ — each block can use its own norm

Block Coordinate “Gradient” Descent


I Using the descent lemma, we have blockwise upper bounds
Li 2
f (x + Ei h) ≤ f (x) + h∇i f (x), hi + 2 khk , for i = 1, . . . , d.

I At each step, minimize these upper bounds!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
Randomized BCD
I For k ≥ 0 (no init. of x necessary)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )

I Update the impacted coordinates of x, formally

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )

I Update the impacted coordinates of x, formally


(i) (i)
xk+1 ← xk + h
1
xk+1 ← xk − Li Ei ∇f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )

I Update the impacted coordinates of x, formally


(i) (i)
xk+1 ← xk + h
1
xk+1 ← xk − Li Ei ∇f (xk )

(i)
Notice: Original BCD had: xk = argminh f (. . . , |{z}
h , . . .)
block i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )

I Update the impacted coordinates of x, formally


(i) (i)
xk+1 ← xk + h
1
xk+1 ← xk − Li Ei ∇f (xk )

(i)
Notice: Original BCD had: xk = argminh f (. . . , |{z}
h , . . .)
block i
We’ll call this BCM (Block Coordinate Minimization)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Exercise: proximal extension

min f (x) + r(x)


I If block separable r(x) := ni=1 ri (x(i) )
P

(i) Li
xk = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
2
+ ri (ETi xk + h)
h
(i)
xk = proxri (· · ·)

Exercise: Fill in the dots

h = prox(1/L)ri ETi xk − 1

Li
∇i f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 27
Randomized BCD – analysis

Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis

Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk

Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis

Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk

Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
1
xk+1 = xk − Li Ei ∇i f (xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis

Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk

Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
1
xk+1 = xk − Li Ei ∇i f (xk )
2
1 2 Li 1
f (xk+1 ) ≤ f (xk ) − Li k∇i f (xk )k + − Li ∇i f (xk )

2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis

Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk

Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
1
xk+1 = xk − Li Ei ∇i f (xk )
2
1 2 Li 1
f (xk+1 ) ≤ f (xk ) − Li k∇i f (xk )k + − Li ∇i f (xk )

2
1 2
f (xk+1 ) ≤ f (xk ) − 2Li k∇i f (xk )k .

1
f (xk ) − f (xk+1 ) ≥ 2Li
k∇i f (xk )k2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis

Expected descent:
d
X
1

f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis

Expected descent:
d
X
1

f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis

Expected descent:
d
X
1

f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
1 2
= 2 k∇f (xk )kW (suitable W).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis

Expected descent:
d
X
1

f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
1 2
= 2 k∇f (xk )kW (suitable W).

Exercise: What’s expected descent with uniform probabilities?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis

Expected descent:
d
X
1

f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
1 2
= 2 k∇f (xk )kW (suitable W).

Exercise: What’s expected descent with uniform probabilities?


Descent plus some more (hard) work yields
d X 
(i) (i)
O Li kx0 − x∗ k2
 i

as the iteration complexity of obtaining E[f (xk )] − f ∗ ≤ 

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
BCD – Exercise

I Recall Lasso problem: min 21 kAx − bk2 + λkxk1


I Here x ∈ RN
I Make n = N blocks
I Show what the Randomized BCD iterations look like
I Notice, 1D prox operations for λ| · | arise
I Try to implement it as efficiently as you can (i.e., do not copy
or update vectors / coordinates than necessary)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 30
Connections

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 31
CD – exercise
n
1X λ
min fi (xT ai ) + kxk2 .
n 2
i=1

Dual problem
n n
1X ∗ λ 1 X
2
max −fi (−αi ) − λn αi ai

α n 2
i=1 i=1

Exercise: Study the SDCA algorithm and derive a connection


between it and SAG/SAGA family of algorithms.
S. Shalev-Shwartz, T. Zhang. Stochastic Dual Coordinate Ascent Methods for
Regularized Loss Minimization. JMLR (2013).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 32
Other connections

Explore: Block-Coordinate Frank-Wolfe algorithm.


Y
min f (x), s.t. x ∈ Xi
x
i

Explore: Doubly stochastic methods


X
min f (x) = fi (x1 , . . . , xd )
i

Being jointly stochastic over fi as well as coordinates.


Explore: CD with constraints (linear and nonlinear constraints)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 33

You might also like