0% found this document useful (0 votes)

106 views124 pages

Optimization For Machine Learning: Lecture 12: Coordinate Descent, BCD, Altmin 6.881: MIT

This document discusses coordinate descent, an optimization technique for minimizing functions. Coordinate descent works by iteratively optimizing one variable at a time while holding others fixed. Specifically: - It picks a variable index and optimizes the objective function with respect to just that variable, keeping others fixed. This is repeated for each variable index. - Choosing which variable to optimize at each step can be done in a cyclic, almost cyclic, or random order. - Coordinate descent is one of the simplest optimization methods and dates back to methods for linear systems. It can be slow but is sometimes competitive with gradient methods.

Uploaded by

Mufakir Qamar Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views124 pages

Optimization For Machine Learning: Lecture 12: Coordinate Descent, BCD, Altmin 6.881: MIT

Uploaded by

Mufakir Qamar Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

Optimization for Machine Learning

Lecture 12: Coordinate Descent, BCD, Altmin

6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

01 Apr, 2021
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)

Since x ∈ Rd , now consider

min f (x) = f (x1 , x2 , . . . , xd )

Previously, we went through f1 , . . . , fn

What if we now go through x1 , . . . , xd one by one?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)

Since x ∈ Rd , now consider

min f (x) = f (x1 , x2 , . . . , xd )

Previously, we went through f1 , . . . , fn

What if we now go through x1 , . . . , xd one by one?

Explore: Going through both [n] and [d]?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Coordinate descent
For k = 0, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
For k = 0, 1, . . .
Pick an index i from {1, . . . , d}

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
For k = 0, 1, . . .
Pick an index i from {1, . . . , d}
Optimize the ith coordinate
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkd )
i
ξ∈R |1 {z i−1} |{z} | {z }
done current todo

Decide when/how to stop; return xk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent - context
♣ One of the simplest optimization methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
CD – which coordinate?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

index that minimizes [∇f (xk )]i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

index that minimizes [∇f (xk )]i
Derivative free rules:

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

index that minimizes [∇f (xk )]i
Derivative free rules:
♣ Cyclic order 1, 2, . . . , d, 1, . . .
♣ Almost cyclic: Each coordinate 1 ≤ i ≤ d picked at least
once every B successive iterations (B ≥ d)
♣ Double sweep, 1, . . . , d then d − 1, . . . , 1, repeat

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?

Gauss-Southwell: If f is differentiable, at iteration k, pick the

Which ones would you prefer? Why?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
Exercise: CD for least squares

min kAx − bk22

Exercise: Obtain an update for j-th coordinate

Coordinate descent update
Pm P
i=1 aij bi − l6=j ail xl
xj ← Pm 2
i=1 aij

(dropped superscripts, since we overwrite)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 6
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
BCD
(Basics, Convergence)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 8
Block coordinate descent (BCD)

min f (x) := f (x1 , . . . , xm )

x ∈ X1 × X2 × · · · × Xm .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
Block coordinate descent (BCD)

min f (x) := f (x1 , . . . , xm )

x ∈ X1 × X2 × · · · × Xm .

Gauss-Seidel update
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkm )
i
ξ∈Xi |1 {z i−1} |{z} | {z }
done current todo

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
Block coordinate descent (BCD)

min f (x) := f (x1 , . . . , xm )

x ∈ X1 × X2 × · · · × Xm .

Gauss-Seidel update
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkm )
i
ξ∈Xi |1 {z i−1} |{z} | {z }
done current todo

Jacobi update (easy to parallelize)

xk+1
i ← argmin f (xk1 , . . . , xki−1 , ξ , xki+1 , . . . , xkm )
ξ∈Xi | {z } |{z} | {z }
don0 t clobber current todo

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum

min f (x1 , . . . , xi+1 , ξ, xi+1 , . . . , xm )

ξ∈Xi

is
uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum

min f (x1 , . . . , xi+1 , ξ, xi+1 , . . . , xm )

ξ∈Xi

is
uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .

Corollary. If f is inaddition convex, then every limit point of

the BCD sequence xk is a global minimum.

min f (x1 , . . . , xi+1 , ξ, xi+1 , . . . , xm )

ξ∈Xi

is
uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .

Corollary. If f is inaddition convex, then every limit point of

the BCD sequence xk is a global minimum.

I Unique solutions of subproblems not always possible

I Above result is only asymptotic (holds in the limit)
I Warning! BCD may cycle indefinitely without converging,
if number blocks > 2 and objective nonconvex.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – Two blocks
Two block BCD

minimize f (x) = f (x1 , x2 ) x ∈ X1 × X2 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
BCD – Two blocks
Two block BCD

minimize f (x) = f (x1 , x2 ) x ∈ X1 × X2 .

Theorem. (Grippo & Sciandrone (2000)). Let f be continu-

ously differentiable. Let X1 , X2 be closed and convex. Assume
both BCD subproblems have solutions and the sequence xk

has limit points. Then, every limit point of xk is stationary.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
BCD – Two blocks
Two block BCD

minimize f (x) = f (x1 , x2 ) x ∈ X1 × X2 .

Theorem. (Grippo & Sciandrone (2000)). Let f be continu-

ously differentiable. Let X1 , X2 be closed and convex. Assume
both BCD subproblems have solutions and the sequence xk

has limit points. Then, every limit point of xk is stationary.

I No need of unique solutions to subproblems

I BCD for 2 blocks is also called: Alternating Minimization

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
CD – projection onto convex sets

min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
CD – projection onto convex sets

min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .

Solution 1: Rewrite using indicator functions

Xm
min 1
2 kx − yk22 + δCi (x).
i=1

I Now invoke Douglas-Rachford using the product-space trick

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
CD – projection onto convex sets

min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .

Solution 1: Rewrite using indicator functions

Xm
min 1
2 kx − yk22 + δCi (x).
i=1

I Now invoke Douglas-Rachford using the product-space trick

Solution 2: Take dual of the above formulation

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
Solution 1: Product space technique

I Original problem over H = Rn

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn

I Suppose we have ni=1 fi (x)
P

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn

I Suppose we have ni=1 fi (x)
P

I Introduce n new variables (x1 , . . . , xn )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn

I Suppose we have ni=1 fi (x)
P

I Introduce n new variables (x1 , . . . , xn )

I Now problem is over domain Hn := "ni=1 H

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

I Original problem over H = Rn

I Suppose we have ni=1 fi (x)
P

I Introduce n new variables (x1 , . . . , xn )

I Now problem is over domain Hn := "ni=1 H
I New constraint: x1 = x2 = . . . = xn
X
min fi (xi )
(x1 ,...,xn ) i

s.t. x1 = x2 = · · · = xn .

Technique due to: G. Pierra (1976)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique

min f (x) + 1B (x)

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}

I Let y = (y1 , . . . , yn )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}

I Let y = (y1 , . . . , yn )
I proxf (y) = (proxf1 (y1 ), . . . , proxfn (yn ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}

I Let y = (y1 , . . . , yn )
I proxf (y) = (proxf1 (y1 ), . . . , proxfn (yn ))
I proxB ≡ ΠB (y) can be solved as follows:

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique

min f (x) + 1B (x)

where x ∈ Hn and B = {z ∈ Hn | z = (x, x, . . . , x)}

I Let y = (y1 , . . . , yn )
I proxf (y) = (proxf1 (y1 ), . . . , proxfn (yn ))
I proxB ≡ ΠB (y) can be solved as follows:
minz∈B 12 kz − yk22
P 1 2
minx∈H i 2 kx − yi k2
x = n1 i yi
P
=⇒

Exercise: Work out the details of the Douglas-Rachford

algorithm using the above product space trick.
Remark: This technique commonly exploited in ADMM too

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)

x,z,w

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)

x,z,w
x−y+ν+µ=0 =⇒ x=y−ν−µ

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)

x,z,w
x−y+ν+µ=0 =⇒ x=y−ν−µ
g(ν, µ) = − 21 kν + µk22 + (ν + µ)T y − f ∗ (ν) − h∗ (µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra

1
min 2
kx − yk22 + f (x) + h(x)

L(x, z, w, ν, µ) := 12 kx − yk22 + f (z) + h(w) + ν T (x − z) + µT (x − w)

g(ν, µ) := inf L(x, z, ν, µ)

x,z,w
x−y+ν+µ=0 =⇒ x=y−ν−µ
g(ν, µ) = − 21 kν + µk22 + (ν + µ)T y − f ∗ (ν) − h∗ (µ)

Dual as minimization problem

min k(ν, µ) := 12 kν + µ − yk22 + f ∗ (ν) + h∗ (µ)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
The Proximal-Dykstra method