0% found this document useful (0 votes)

25 views

ADMM

The Alternating Direction Method of Multipliers (ADMM) is an algorithm that can be used to solve optimization problems with separable objectives and constraints. It works by breaking the problem into smaller subproblems that can be addressed separately in an alternating fashion. ADMM converges under modest assumptions and has convergence guarantees for the residuals, objectives, and dual variables. The scaled form of ADMM reduces each update to a proximal operator call. ADMM can be applied to problems like lasso regression, group lasso regression, and sparse subspace estimation by rewriting them in an appropriate form.

Uploaded by

Shuhan Qiu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

ADMM

Uploaded by

Shuhan Qiu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Alternating Direction Method of Multipliers

Ryan Tibshirani
Convex Optimization 10-725
Last time: dual ascent
Consider the problem

min f (x) subject to Ax = b

where f is strictly convex and closed. Denote Lagrangian

L(x, u) = f (x) + uT (Ax − b)

Dual gradient ascent repeats, for k = 1, 2, 3, . . .

x(k) = argmin L(x, u(k−1) )

x
(k) (k−1)
u =u + tk (Ax(k) − b)

Good: x update decomposes when f does. Bad: require stringent

assumptions (strong convexity of f ) to ensure convergence

2
Augmented Lagrangian method modifies the problem, for ρ > 0,
ρ
min f (x) + kAx − bk22
x 2
subject to Ax = b

uses a modified Lagrangian

ρ
Lρ (x, u) = f (x) + uT (Ax − b) + kAx − bk22
2
and repeats, for k = 1, 2, 3, . . .

x(k) = argmin Lρ (x, u(k−1) )

x
(k) (k−1)
u =u + ρ(Ax(k) − b)

Good: better convergence properties. Bad: lose decomposability

3
Alternating direction method of multipliers
Alternating direction method of multipliers or ADMM tries for the
best of both methods. Consider a problem of the form:

min f (x) + g(z) subject to Ax + Bz = c

x,z

We define augmented Lagrangian, for a parameter ρ > 0,

ρ
Lρ (x, z, u) = f (x) + g(z) + uT (Ax + Bz − c) + kAx + Bz − ck22
2
We repeat, for k = 1, 2, 3, . . .

x(k) = argmin Lρ (x, z (k−1) , u(k−1) )

x
z (k) = argmin Lρ (x(k) , z, u(k−1) )
z
(k) (k−1)
u =u + ρ(Ax(k) + Bz (k) − c)

4
Convergence guarantees

Under modest assumptions on f, g (these do not require A, B to

be full rank), the ADMM iterates satisfy, for any ρ > 0:
• Residual convergence: r(k) = Ax(k) + Bz (k) − c → 0 as
k → ∞, i.e., primal iterates approach feasibility
• Objective convergence: f (x(k) ) + g(z (k) ) → f ? + g ? , where
f ? + g ? is the optimal primal objective value
• Dual convergence: u(k) → u? , where u? is a dual solution

For details, see Boyd et al. (2010). Note that we do not generically
get primal convergence, but this is true under more assumptions

Convergence rate: roughly, ADMM behaves like first-order method.

Theory still being developed, see, e.g., in Hong and Luo (2012),
Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015)

5
Scaled form ADMM
Scaled form: denote w = u/ρ, so augmented Lagrangian becomes
ρ ρ
Lρ (x, z, w) = f (x) + g(z) + kAx + Bz − c + wk22 − kwk22
2 2
and ADMM updates become
ρ
x(k) = argmin f (x) + kAx + Bz (k−1) − c + w(k−1) k22
x 2
ρ
z (k) = argmin g(z) + kAx(k) + Bz − c + w(k−1) k22
z 2
w(k) = w(k−1) + Ax(k) + Bz (k) − c

Note that here kth iterate w(k) is just a running sum of residuals:
k
X
(k) (0)
Ax(i) + Bz (i) − c

w =w +
i=1

6
Outline

Today:
• Examples, practicalities
• Consensus ADMM
• Special decompositions

7
Connection to proximal operators
Consider

min f (x) + g(x) ⇐⇒ min f (x) + g(z) subject to x = z

x x,z

ADMM steps (equivalent to Douglas-Rachford, here):

x(k) = proxf,1/ρ (z (k−1) − w(k−1) )

z (k) = proxg,1/ρ (x(k) + w(k−1) )
w(k) = w(k−1) + x(k) − z (k)

where proxf,1/ρ is the proximal operator for f at parameter 1/ρ,

and similarly for proxg,1/ρ

In general, the update for block of variables reduces to prox update

whenever the corresponding linear transformation is the identity

8
Example: lasso regression

Given y ∈ Rn , X ∈ Rn×p , recall the lasso problem:

1
min ky − Xβk22 + λkβk1
β 2

We can rewrite this as:

1
min ky − Xβk22 + λkαk1 subject to β − α = 0
β,α 2

ADMM steps:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

α(k) = Sλ/ρ (β (k) + w(k−1) )

w(k) = w(k−1) + β (k) − α(k)

9
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the soft-thresolding operator St , which
recall is defined as

xj − t x > t

[St (x)]j = 0 −t ≤ x ≤ t , j = 1, . . . , p

xj + t x < −t


• ADMM steps are “almost” like repeated soft-thresholding of

ridge regression coefficients

10
Comparison of various algorithms for lasso regression: 100 random
instances with n = 200, p = 50

Coordinate desc
Proximal grad
Accel prox

1e−01
ADMM (rho=50)
ADMM (rho=100)
ADMM (rho=200)
Suboptimality fk−fstar

1e−04
1e−07
1e−10

0 10 20 30 40 50 60

Iteration k

11
Practicalities

In practice, ADMM usually obtains a relatively accurate solution in

a handful of iterations, but it requires a large number of iterations
for a highly accurate solution (like a first-order method)

Choice of ρ can greatly influence practical convergence of ADMM:

• ρ too large → not enough emphasis on minimizing f + g
• ρ too small → not enough emphasis on feasibility
Boyd et al. (2010) give a strategy for varying ρ; it can work well in
practice, but does not have convergence guarantees

Like deriving duals, transforming a problem into one that ADMM

can handle is sometimes a bit subtle, since different forms can lead
to different algorithms

12
Example: group lasso regression

Given y ∈ Rn , X ∈ Rn×p , recall the group lasso problem:

G
1 X
min ky − Xβk22 + λ cg kβg k2
β 2
g=1

Rewrite as:
G
1 X
min ky − Xβk22 + λ cg kαg k2 subject to β − α = 0
β,α 2
g=1

ADMM steps:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

αg(k) = Rcg λ/ρ βg(k) + wg(k−1) , g = 1, . . . , G

w(k) = w(k−1) + β (k) − α(k)

13
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the group soft-thresolding operator Rt ,
which recall is defined as

t
Rt (x) = 1 − x
kxk2 +

• Similar ADMM steps follow for a sum of arbitrary norms of as

regularizer, provided we know prox operator of each norm
• ADMM algorithm can be rederived when groups have overlap
(hard problem to optimize in general!). See Boyd et al. (2010)

14
Example: sparse subspace estimation

Given S ∈ Sp (typically S 0 is a covariance matrix), consider the

sparse subspace estimation problem (Vu et al., 2013):

max tr(SY ) − λkY k1 subject to Y ∈ Fk

where Fk is the Fantope of order k, namely

Fk = {Y ∈ Sp : 0 Y I, tr(Y ) = k}

Note that when λ = 0, the above problem is equivalent to ordinary

principal component analysis (PCA)

This above is an SDP and in principle solveable with interior-point

methods, though these can be complicated to implement and quite
slow for large problem sizes

15
Rewrite as:

min −tr(SY ) + IFk (Y ) + λkZk1 subject to Y = Z

Y,Z

ADMM steps are:

Y (k) = PFk (Z (k−1) − W (k−1) + S/ρ)

Z (k) = Sλ/ρ (Y (k) + W (k−1) )
W (k) = W (k−1) + Y (k) − Z (k)

Here PFk is Fantope projection operator, computed by clipping the

eigendecomposition A = U ΣU T , Σ = diag(σ1 , . . . , σp ):

PFk (A) = U Σθ U T , Σθ = diag(σ1 (θ), . . . , σp (θ))

where each σi (θ) = min{max{σi − θ, 0}, 1}, and pi=1 σi (θ) = k

16
Example: sparse + low rank decomposition
Given M ∈ Rn×m , consider the sparse plus low rank decomposition
problem (Candes et al., 2009):

min kLktr + λkSk1

L,S

subject to L + S = M

ADMM steps:

L(k) = S1/ρ
tr
(M − S (k−1) + W (k−1) )
S (k) = Sλ/ρ
`1
(M − L(k) + W (k−1) )
W (k) = W (k−1) + M − L(k) − S (k)
tr for matrix soft-thresolding
where, to distinguish them, we use Sλ/ρ
`1
and Sλ/ρ for elementwise soft-thresolding

17
Example from Candes et al. (2009):

(a) Original frames (b) Low-rank L̂ (c) Sparse Ŝ (d) Low-rank L̂

Convex optimization (this work) Alternating
18
Consensus ADMM
B
X
Consider a problem of the form: min fi (x)
x
i=1

The consensus ADMM approach begins by reparametrizing:

B
X
min fi (xi ) subject to xi = x, i = 1, . . . B
x1 ,...,xB ,x
i=1

This yields the decomposable ADMM steps:

(k) ρ (k−1) 2
xi = argmin fi (xi ) + kxi − x(k−1) + wi k2 , i = 1, . . . , B
xi 2
B
1 X (k) (k−1)

x(k) = x i + wi
B
i=1
(k) (k−1) (k)
wi = wi + xi − x(k) , i = 1, . . . , B

19
Write x̄ = B1 B
P
i=1 xi and similarly for other variables. Not hard to
(k)
see that w̄ = 0 for all iterations k ≥ 1

Hence ADMM steps can be simplified, by taking x(k) = x̄(k) :

(k) ρ (k−1) 2
xi = argmin fi (xi ) + kxi − x̄(k−1) + wi k2 , i = 1, . . . , B
xi 2
(k) (k−1) (k)
wi = wi + xi − x̄(k) , i = 1, . . . , B

To reiterate, the xi , i = 1, . . . , B updates here are done in parallel

Intuition:
• Try to minimize each fi (xi ), use (squared) `2 regularization to
pull each xi towards the average x̄
• If a variable xi is bigger than the average, then wi is increased
• So the regularization in the next step pulls xi even closer

20
General consensus ADMM
B
X
Consider a problem of the form: min fi (aTi x + bi ) + g(x)
x
i=1

For consensus ADMM, we again reparametrize:

B
X
min fi (aTi xi + bi ) + g(x) subject to xi = x, i = 1, . . . B
x1 ,...,xB ,x
i=1

This yields the decomposable ADMM updates:

(k) ρ (k−1) 2
xi = argmin fi (aTi xi + bi ) + kxi − x(k−1) + wi k2 ,
xi 2
i = 1, . . . , B
Bρ
x(k) = argmin kx − x̄(k) − w̄(k−1) k22 + g(x)
x 2
(k) (k−1) (k)
wi = wi + xi − x(k) , i = 1, . . . , B
21
Notes:
• It is no longer true that w̄(k) = 0 at a general iteration k, so
ADMM steps do not simplify as before
• To reiterate, the xi , i = 1, . . . , B updates are done in parallel
• Each xi update can be thought of as a loss minimization on
part of the data, with `2 regularization
• The x update is a proximal operation in regularizer g
• The w update drives the individual variables into consensus
• A different initial reparametrization will give rise to a different
ADMM algorithm

See Boyd et al. (2010), Parikh and Boyd (2013) for more details
on consensus ADMM, strategies for splitting up into subproblems,
and implementation tips

22
Special decompositions

ADMM can exhibit much faster convergence than usual, when we

parametrize subproblems in a “special way”
• ADMM updates relate closely to block coordinate descent, in
which we optimize a criterion in an alternating fashion across
blocks of variables
• With this in mind, get fastest convergence when minimizing
over blocks of variables leads to updates in nearly orthogonal
directions
• Suggests we should design ADMM form (auxiliary constraints)
so that primal updates de-correlate as best as possible
• This is done in, e.g., Ramdas and Tibshirani (2014), Wytock
et al. (2014), Barbero and Sra (2014)

23
Example: 2d fused lasso

Given an image Y ∈ Rd×d , equivalently written as y ∈ Rn , the 2d

fused lasso or 2d total variation denoising problem is:
1 X
min kY − Θk2F + λ |Θi,j − Θi+1,j | + |Θi,j − Θi,j+1 |
Θ 2
i,j
1
⇐⇒ min ky − θk22 + λkDθk1
θ 2
Here D ∈ Rm×n is a 2d difference operator giving the appropriate
differences (across
164
horizontally andParallel
vertically adjacent positions)
and Distributed Algorithms

Figure 5.2: Variables are black dots; the partitions P and Q are in orange and cyan.
24
First way to rewrite:
1
min ky − θk22 + λkzk1 subject to θ = Dz
θ,z 2

Leads to ADMM steps:

θ(k) = (I + ρDT D)−1 y + ρDT (z (k−1) + w(k−1) )

z (k) = Sλ/ρ (Dθ(k) − w(k−1) )

w(k) = w(k−1) + z (k−1) − Dθ(k)

Notes:
• The θ update solves linear system in I + ρL, with L = DT D
the graph Laplacian matrix of the 2d grid, so this can be done
efficiently, in roughly O(n) operations
• The z update applies soft thresholding operator St
• Hence one entire ADMM cycle uses roughly O(n) operations
25
Second way to rewrite:
1 X
min kY − Hk2F + λ |Hi,j − Hi+1,j | + |Vi,j − Vi,j+1 |
H,V 2
i,j

subject to H = V

Leads to ADMM steps:

(k−1) (k−1)
Y + ρ(V·,j − W·,j )

(k)
H·,j = FL1d
λ/(1+ρ) , j = 1, . . . , d
1+ρ

(k) (k) (k−1)
Vi,· = FL1d
λ/ρ H i,· + W i,· , i = 1, . . . , d
W (k) = W (k−1) + H (k) − V (k)

Notes:
• Both H, V updates solve (sequence of) 1d fused lassos, where
Pd−1
we write FL1d 1 2
τ (a) = argminx 2 ka − xk2 + τ i=1 |xi − xi+1 |

26
• Critical: each 1d fused lasso solution can be computed exactly
in O(d) operations with specialized algorithms (e.g., Johnson,
2013; Davies and Kovac, 2001)
• Hence one entire ADMM cycleParallel
again usesand Distributed
O(n) operations Algori

27
Comparison of 2d fused lasso algorithms: an image of dimension
300 × 200 (so n = 60, 000)
7
6
5
4
3

Data Exact solution

28
Two ADMM algorithms, (say) standard and specialized ADMM:

Standard
Specialized

1e+05
1e+04
f(k)−fstar

1e+03
1e+02
1e+01

0 20 40 60 80 100

29
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM

10 iterations 10 iterations

30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM

30 iterations 30 iterations

30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM

50 iterations 50 iterations

30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:

7
6
5
4
3

Standard ADMM Specialized ADMM

100 iterations 100 iterations

30
References and further reading
• A. Barbero and S. Sra (2014), “Modular proximal optimization
for multidimensional total-variation regularization”
• S. Boyd and N. Parikh and E. Chu and B. Peleato and J.
Eckstein (2010), “Distributed optimization and statistical
learning via the alternating direction method of multipliers”
• E. Candes and X. Li and Y. Ma and J. Wright (2009),
“Robust principal component analysis?”
• N. Parikh and S. Boyd (2013), “Proximal algorithms”
• A. Ramdas and R. Tibshirani (2014), “Fast and flexible
ADMM algorithms for trend filtering”
• V. Vu and J. Cho and J. Lei and K. Rohe (2013),“Fantope
projection and selection: a near-optimal convex relaxation of
sparse PCA”
• M. Wytock and S. Sra. and Z. Kolter (2014), “Fast Newton
methods for the group fused lasso”
31

Conditional Gradient (Frank-Wolfe) Method: Lecturer: Javier Pe Na Convex Optimization 10-725/36-725
No ratings yet
Conditional Gradient (Frank-Wolfe) Method: Lecturer: Javier Pe Na Convex Optimization 10-725/36-725
28 pages
Hw3sol PDF
No ratings yet
Hw3sol PDF
8 pages
MAT 461/561: 5.1 Stationary Iterative Methods
No ratings yet
MAT 461/561: 5.1 Stationary Iterative Methods
4 pages
DSCC 435 PS6(1)
No ratings yet
DSCC 435 PS6(1)
3 pages
Frank Wolfe
No ratings yet
Frank Wolfe
25 pages
Numerical_Methods_Formula_Sheet
No ratings yet
Numerical_Methods_Formula_Sheet
1 page
05 MixedPrecisionsSolvers
No ratings yet
05 MixedPrecisionsSolvers
28 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
tut 9s(updated)
No ratings yet
tut 9s(updated)
6 pages
Lect 41
No ratings yet
Lect 41
17 pages
18.657: Mathematics of Machine Learning: S R LR LK K
No ratings yet
18.657: Mathematics of Machine Learning: S R LR LK K
9 pages
Fixedpoint
No ratings yet
Fixedpoint
5 pages
Chapter 3 Supplementary: (Recall That N (N) )
No ratings yet
Chapter 3 Supplementary: (Recall That N (N) )
5 pages
IWIAS Mini Course Opt GF Aug 2023 Nopause
No ratings yet
IWIAS Mini Course Opt GF Aug 2023 Nopause
26 pages
CntrlEngg (Optimization) ConvexAnalysisAndOptimization Solutions DimitriBertsekas
No ratings yet
CntrlEngg (Optimization) ConvexAnalysisAndOptimization Solutions DimitriBertsekas
191 pages
Examples of the Chain Rule
No ratings yet
Examples of the Chain Rule
12 pages
Homework 2: Mathematics For AI: AIT2005
No ratings yet
Homework 2: Mathematics For AI: AIT2005
3 pages
Lecture 01
No ratings yet
Lecture 01
10 pages
Statement of The Riemann-Lebesgue Lemma: The Result Is Easy To State
No ratings yet
Statement of The Riemann-Lebesgue Lemma: The Result Is Easy To State
4 pages
K-Gamma and K-Beta Function
No ratings yet
K-Gamma and K-Beta Function
5 pages
TD Extra 2
No ratings yet
TD Extra 2
1 page
Homework 2
No ratings yet
Homework 2
5 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
24-cond-grad
No ratings yet
24-cond-grad
25 pages
10_corr
No ratings yet
10_corr
2 pages
2 DeterminantsExercise
No ratings yet
2 DeterminantsExercise
60 pages
Conjugate Gradient Method
No ratings yet
Conjugate Gradient Method
30 pages
Mathematics Formulary PDF
No ratings yet
Mathematics Formulary PDF
30 pages
Bookof Questions 1
No ratings yet
Bookof Questions 1
8 pages
Unconstrained Minimization in R: Newton Methods
No ratings yet
Unconstrained Minimization in R: Newton Methods
5 pages
19 Quiz3
No ratings yet
19 Quiz3
6 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
465-lecture4
No ratings yet
465-lecture4
18 pages
Notes PDF
No ratings yet
Notes PDF
73 pages
Fixed Point
No ratings yet
Fixed Point
8 pages
CT III (1)
No ratings yet
CT III (1)
13 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Tentalosning TMA947 070312 2
No ratings yet
Tentalosning TMA947 070312 2
6 pages
Introduction To Nonlinear Control Lecture # 5 State Feedback Stabilization
No ratings yet
Introduction To Nonlinear Control Lecture # 5 State Feedback Stabilization
52 pages
Notes ch0
No ratings yet
Notes ch0
12 pages
Vol4 Solution Final
No ratings yet
Vol4 Solution Final
8 pages
Kalman Filter Deriv PDF
100% (1)
Kalman Filter Deriv PDF
29 pages
Review Calculus 2
No ratings yet
Review Calculus 2
4 pages
M504Lect5 - Eigenvalues of Hermitians Matrices
No ratings yet
M504Lect5 - Eigenvalues of Hermitians Matrices
6 pages
2015 MC Ques Ans F
No ratings yet
2015 MC Ques Ans F
11 pages
Lect 18 PDF
No ratings yet
Lect 18 PDF
10 pages
Lect 11
No ratings yet
Lect 11
12 pages
Chap4 Equa Diff
No ratings yet
Chap4 Equa Diff
24 pages
Integration "Tricks" Sergei V. Shabanov: Department of Mathematics, University of Florida
No ratings yet
Integration "Tricks" Sergei V. Shabanov: Department of Mathematics, University of Florida
2 pages
TECHNIQUES OF INTEGRATION
No ratings yet
TECHNIQUES OF INTEGRATION
6 pages
Methods of Applied Mathematics-II: M L+M L M L M M L M M L
No ratings yet
Methods of Applied Mathematics-II: M L+M L M L M M L M M L
2 pages
Topics Math Level 1 Versie 2009
No ratings yet
Topics Math Level 1 Versie 2009
7 pages
Midterm 18s Sol
No ratings yet
Midterm 18s Sol
3 pages
Inbound 6152017962266772315
No ratings yet
Inbound 6152017962266772315
12 pages
Chapter 1
No ratings yet
Chapter 1
5 pages
M2 Exam 2022-23 Solutions
No ratings yet
M2 Exam 2022-23 Solutions
12 pages
Indian Institute of Science: Problem 1
No ratings yet
Indian Institute of Science: Problem 1
4 pages
Lecture 5
No ratings yet
Lecture 5
3 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
IB Binomial Theorem
No ratings yet
IB Binomial Theorem
4 pages
Lecture (Distribution Models)
No ratings yet
Lecture (Distribution Models)
101 pages
Chapter 6 Lecture Notes
No ratings yet
Chapter 6 Lecture Notes
4 pages
1.4 Ws A
No ratings yet
1.4 Ws A
5 pages
Block 2
No ratings yet
Block 2
208 pages
Sumona
No ratings yet
Sumona
4 pages
Simpson's Rule - Examples, Method, & Formula - Video & Lesson TR
No ratings yet
Simpson's Rule - Examples, Method, & Formula - Video & Lesson TR
13 pages
Day 2-FDP
No ratings yet
Day 2-FDP
23 pages
GE 122-Lec8-Interpolation-Handouts
No ratings yet
GE 122-Lec8-Interpolation-Handouts
18 pages
Ug-Bca Latest
No ratings yet
Ug-Bca Latest
34 pages
An Optimization Primer
50% (2)
An Optimization Primer
149 pages
Civil Sem III Jan Feb 2023
No ratings yet
Civil Sem III Jan Feb 2023
9 pages
Homework 3
No ratings yet
Homework 3
2 pages
Mathematical Foundations For Data Science: BITS Pilani
No ratings yet
Mathematical Foundations For Data Science: BITS Pilani
36 pages
Q Bank
No ratings yet
Q Bank
14 pages
Problem Topic
No ratings yet
Problem Topic
4 pages
Introduction Numerical Analysis (1) (1)
No ratings yet
Introduction Numerical Analysis (1) (1)
252 pages
Detyre Kursi Ekonometri
No ratings yet
Detyre Kursi Ekonometri
6 pages
Continuous Block Hybrid-Predictor-Corrector Method For The Solution of
No ratings yet
Continuous Block Hybrid-Predictor-Corrector Method For The Solution of
8 pages
Numerical Analysis and Computing: Lecture Notes #12 - Approximation Theory - Chebyshev Polynomials & Least Squares, Redux
No ratings yet
Numerical Analysis and Computing: Lecture Notes #12 - Approximation Theory - Chebyshev Polynomials & Least Squares, Redux
73 pages
Mat 8
No ratings yet
Mat 8
32 pages
Csvtu Syllabus Be Cse 4 Sem
0% (1)
Csvtu Syllabus Be Cse 4 Sem
12 pages
Practice Final COMP251 With Answers W2017
No ratings yet
Practice Final COMP251 With Answers W2017
5 pages
MATLAB Tutorial 5 - Numerical Integration
No ratings yet
MATLAB Tutorial 5 - Numerical Integration
11 pages
MATLAB Open Ended Lab Final
No ratings yet
MATLAB Open Ended Lab Final
10 pages
Pyp1 Polynomial - 221202 - 154715-1
No ratings yet
Pyp1 Polynomial - 221202 - 154715-1
6 pages
MPS 07
No ratings yet
MPS 07
38 pages
Deep Learning - IIT Ropar - Unit 7 - Week 4
No ratings yet
Deep Learning - IIT Ropar - Unit 7 - Week 4
6 pages
Lecture 4&5
No ratings yet
Lecture 4&5
12 pages
Uk Komputer Statistik - Putri Amellia
No ratings yet
Uk Komputer Statistik - Putri Amellia
7 pages

ADMM

Uploaded by

ADMM

Uploaded by

Alternating Direction Method of Multipliers

min f (x) subject to Ax = b

where f is strictly convex and closed. Denote Lagrangian

L(x, u) = f (x) + uT (Ax − b)

Dual gradient ascent repeats, for k = 1, 2, 3, . . .

x(k) = argmin L(x, u(k−1) )

Good: x update decomposes when f does. Bad: require stringent

uses a modified Lagrangian

x(k) = argmin Lρ (x, u(k−1) )

Good: better convergence properties. Bad: lose decomposability

min f (x) + g(z) subject to Ax + Bz = c

We define augmented Lagrangian, for a parameter ρ > 0,

x(k) = argmin Lρ (x, z (k−1) , u(k−1) )

Under modest assumptions on f, g (these do not require A, B to

Convergence rate: roughly, ADMM behaves like first-order method.

min f (x) + g(x) ⇐⇒ min f (x) + g(z) subject to x = z

ADMM steps (equivalent to Douglas-Rachford, here):

x(k) = proxf,1/ρ (z (k−1) − w(k−1) )

where proxf,1/ρ is the proximal operator for f at parameter 1/ρ,

In general, the update for block of variables reduces to prox update

Given y ∈ Rn , X ∈ Rn×p , recall the lasso problem:

We can rewrite this as:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

α(k) = Sλ/ρ (β (k) + w(k−1) )

• ADMM steps are “almost” like repeated soft-thresholding of

In practice, ADMM usually obtains a relatively accurate solution in

Choice of ρ can greatly influence practical convergence of ADMM:

Like deriving duals, transforming a problem into one that ADMM

Given y ∈ Rn , X ∈ Rn×p , recall the group lasso problem:

β (k) = (X T X + ρI)−1 X T y + ρ(α(k−1) − w(k−1) )

αg(k) = Rcg λ/ρ βg(k) + wg(k−1) , g = 1, . . . , G

w(k) = w(k−1) + β (k) − α(k)

• Similar ADMM steps follow for a sum of arbitrary norms of as

Given S ∈ Sp (typically S  0 is a covariance matrix), consider the

max tr(SY ) − λkY k1 subject to Y ∈ Fk

where Fk is the Fantope of order k, namely

Note that when λ = 0, the above problem is equivalent to ordinary

This above is an SDP and in principle solveable with interior-point

min −tr(SY ) + IFk (Y ) + λkZk1 subject to Y = Z

ADMM steps are:

Y (k) = PFk (Z (k−1) − W (k−1) + S/ρ)

Here PFk is Fantope projection operator, computed by clipping the

PFk (A) = U Σθ U T , Σθ = diag(σ1 (θ), . . . , σp (θ))

where each σi (θ) = min{max{σi − θ, 0}, 1}, and pi=1 σi (θ) = k

min kLktr + λkSk1

(a) Original frames (b) Low-rank L̂ (c) Sparse Ŝ (d) Low-rank L̂

The consensus ADMM approach begins by reparametrizing:

This yields the decomposable ADMM steps:

Hence ADMM steps can be simplified, by taking x(k) = x̄(k) :

To reiterate, the xi , i = 1, . . . , B updates here are done in parallel

For consensus ADMM, we again reparametrize:

This yields the decomposable ADMM updates:

ADMM can exhibit much faster convergence than usual, when we

Given an image Y ∈ Rd×d , equivalently written as y ∈ Rn , the 2d

Leads to ADMM steps:

θ(k) = (I + ρDT D)−1 y + ρDT (z (k−1) + w(k−1) )

z (k) = Sλ/ρ (Dθ(k) − w(k−1) )

Leads to ADMM steps:

Data Exact solution

Standard ADMM Specialized ADMM

Standard ADMM Specialized ADMM

Standard ADMM Specialized ADMM

Standard ADMM Specialized ADMM

You might also like

Given S ∈ Sp (typically S 0 is a covariance matrix), consider the