ADMM
ADMM
Ryan Tibshirani
Convex Optimization 10-725
Last time: dual ascent
Consider the problem
2
Augmented Lagrangian method modifies the problem, for ρ > 0,
ρ
min f (x) + kAx − bk22
x 2
subject to Ax = b
3
Alternating direction method of multipliers
Alternating direction method of multipliers or ADMM tries for the
best of both methods. Consider a problem of the form:
4
Convergence guarantees
For details, see Boyd et al. (2010). Note that we do not generically
get primal convergence, but this is true under more assumptions
5
Scaled form ADMM
Scaled form: denote w = u/ρ, so augmented Lagrangian becomes
ρ ρ
Lρ (x, z, w) = f (x) + g(z) + kAx + Bz − c + wk22 − kwk22
2 2
and ADMM updates become
ρ
x(k) = argmin f (x) + kAx + Bz (k−1) − c + w(k−1) k22
x 2
ρ
z (k) = argmin g(z) + kAx(k) + Bz − c + w(k−1) k22
z 2
w(k) = w(k−1) + Ax(k) + Bz (k) − c
Note that here kth iterate w(k) is just a running sum of residuals:
k
X
(k) (0)
Ax(i) + Bz (i) − c
w =w +
i=1
6
Outline
Today:
• Examples, practicalities
• Consensus ADMM
• Special decompositions
7
Connection to proximal operators
Consider
8
Example: lasso regression
ADMM steps:
9
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the soft-thresolding operator St , which
recall is defined as
xj − t x > t
[St (x)]j = 0 −t ≤ x ≤ t , j = 1, . . . , p
xj + t x < −t
10
Comparison of various algorithms for lasso regression: 100 random
instances with n = 200, p = 50
Coordinate desc
Proximal grad
Accel prox
1e−01
ADMM (rho=50)
ADMM (rho=100)
ADMM (rho=200)
Suboptimality fk−fstar
1e−04
1e−07
1e−10
0 10 20 30 40 50 60
Iteration k
11
Practicalities
12
Example: group lasso regression
Rewrite as:
G
1 X
min ky − Xβk22 + λ cg kαg k2 subject to β − α = 0
β,α 2
g=1
ADMM steps:
13
Notes:
• The matrix X T X + ρI is always invertible, regardless of X
• If we compute a factorization (say Cholesky) in O(p3 ) flops,
then each β update takes O(p2 ) flops
• The α update applies the group soft-thresolding operator Rt ,
which recall is defined as
t
Rt (x) = 1 − x
kxk2 +
14
Example: sparse subspace estimation
Fk = {Y ∈ Sp : 0 Y I, tr(Y ) = k}
15
Rewrite as:
16
Example: sparse + low rank decomposition
Given M ∈ Rn×m , consider the sparse plus low rank decomposition
problem (Candes et al., 2009):
subject to L + S = M
ADMM steps:
L(k) = S1/ρ
tr
(M − S (k−1) + W (k−1) )
S (k) = Sλ/ρ
`1
(M − L(k) + W (k−1) )
W (k) = W (k−1) + M − L(k) − S (k)
tr for matrix soft-thresolding
where, to distinguish them, we use Sλ/ρ
`1
and Sλ/ρ for elementwise soft-thresolding
17
Example from Candes et al. (2009):
19
Write x̄ = B1 B
P
i=1 xi and similarly for other variables. Not hard to
(k)
see that w̄ = 0 for all iterations k ≥ 1
Intuition:
• Try to minimize each fi (xi ), use (squared) `2 regularization to
pull each xi towards the average x̄
• If a variable xi is bigger than the average, then wi is increased
• So the regularization in the next step pulls xi even closer
20
General consensus ADMM
B
X
Consider a problem of the form: min fi (aTi x + bi ) + g(x)
x
i=1
See Boyd et al. (2010), Parikh and Boyd (2013) for more details
on consensus ADMM, strategies for splitting up into subproblems,
and implementation tips
22
Special decompositions
23
Example: 2d fused lasso
Figure 5.2: Variables are black dots; the partitions P and Q are in orange and cyan.
24
First way to rewrite:
1
min ky − θk22 + λkzk1 subject to θ = Dz
θ,z 2
Notes:
• The θ update solves linear system in I + ρL, with L = DT D
the graph Laplacian matrix of the 2d grid, so this can be done
efficiently, in roughly O(n) operations
• The z update applies soft thresholding operator St
• Hence one entire ADMM cycle uses roughly O(n) operations
25
Second way to rewrite:
1 X
min kY − Hk2F + λ |Hi,j − Hi+1,j | + |Vi,j − Vi,j+1 |
H,V 2
i,j
subject to H = V
Notes:
• Both H, V updates solve (sequence of) 1d fused lassos, where
Pd−1
we write FL1d 1 2
τ (a) = argminx 2 ka − xk2 + τ i=1 |xi − xi+1 |
26
• Critical: each 1d fused lasso solution can be computed exactly
in O(d) operations with specialized algorithms (e.g., Johnson,
2013; Davies and Kovac, 2001)
• Hence one entire ADMM cycleParallel
again usesand Distributed
O(n) operations Algori
27
Comparison of 2d fused lasso algorithms: an image of dimension
300 × 200 (so n = 60, 000)
7
6
5
4
3
28
Two ADMM algorithms, (say) standard and specialized ADMM:
Standard
Specialized
1e+05
1e+04
f(k)−fstar
1e+03
1e+02
1e+01
0 20 40 60 80 100
29
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:
7
6
5
4
3
30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:
7
6
5
4
3
30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:
7
6
5
4
3
30
ADMM iterates visualized after k = 10, 30, 50, 100 iterations:
7
6
5
4
3
30
References and further reading
• A. Barbero and S. Sra (2014), “Modular proximal optimization
for multidimensional total-variation regularization”
• S. Boyd and N. Parikh and E. Chu and B. Peleato and J.
Eckstein (2010), “Distributed optimization and statistical
learning via the alternating direction method of multipliers”
• E. Candes and X. Li and Y. Ma and J. Wright (2009),
“Robust principal component analysis?”
• N. Parikh and S. Boyd (2013), “Proximal algorithms”
• A. Ramdas and R. Tibshirani (2014), “Fast and flexible
ADMM algorithms for trend filtering”
• V. Vu and J. Cho and J. Lei and K. Rohe (2013),“Fantope
projection and selection: a near-optimal convex relaxation of
sparse PCA”
• M. Wytock and S. Sra. and Z. Kolter (2014), “Fast Newton
methods for the group fused lasso”
31