Proximal Algorithms in Statistics and Machine Learning
Proximal Algorithms in Statistics and Machine Learning
Abstract. Proximal algorithms are useful for obtaining solutions to difficult Proximal algorithm 主要是处理
nonsmooth 和 composite 函数的;
optimization problems, especially those involving nonsmooth or compos-
ite objective functions. A proximal algorithm is one whose basic iterations 这篇文章给出了几个方面的主题:
1. splitting 策略
involve the proximal operator of some function, whose evaluation requires 2. augmented Lagrangian
3. 目标函数的envelope表示
solving a specific optimization problem that is typically easier than the orig- 4. 组合目标函数的proximal算法
inal problem. Many familiar algorithms can be cast in this form, and this 5. proximal operator的闭式解
“proximal view” turns out to provide a set of broad organizing principles
for many algorithms useful in statistics and machine learning. In this paper,
we show how a number of recent advances in this area can inform modern
statistical practice. We focus on several main themes: (1) variable splitting
strategies and the augmented Lagrangian; (2) the broad utility of envelope (or
variational) representations of objective functions; (3) proximal algorithms
for composite objective functions; and (4) the surprisingly large number of
functions for which there are closed-form solutions of proximal operators.
We illustrate our methodology with regularized Logistic and Poisson regres-
sion incorporating a nonconvex bridge penalty and a fused lasso penalty. We
also discuss several related issues, including the convergence of nondescent
algorithms, acceleration and optimization for nonconvex functions. Finally,
we provide directions for future research in this exciting area at the intersec-
tion of statistics and optimization.
Key words and phrases: Bayes MAP, shrinkage, sparsity, splitting, Kur-
dyka–Łojasiewicz, nonconvex, envelopes, regularization, ADMM, optimiza-
tion, Divide and Concur.
559
560 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
optimization subproblem that is (one hopes) easier than and how proximal algorithms can be viewed as enve-
the original problem of interest. By iteratively solv- lope gradients. Section 6 considers the general prob-
ing such subproblems, a proximal algorithm converges lem of composite operator optimization and shows how
on the solution to the original problem. Chrétien and to compute the exact proximal operator with a gen-
Hero III (2000) provide a general relation between EM eral quadratic envelope and a composite regularization
and proximal point (PP) algorithms and show that the penalty. Section 7 illustrates the methodology with ap-
latter can provide dramatic improvements in rates of plications to logistic and Poisson regression with fused
convergence. lasso penalties. A bridge regression penalty illustrates
The early foundational work in this area dates to the nonconvex case and we apply our algorithm to
the study of iterative fixed-point algorithms in Ba- the prostate data of Hastie, Tibshirani and Friedman
nach spaces (Von Neumann, 1951; Brègman, 1967; (2009). Finally, Section 8 concludes with directions for
Hestenes, 1969; Martinet, 1970; Rockafellar, 1976). future research, while Appendix A discusses conver-
As these techniques matured, they became widely used gence results for both convex and nonconvex cases to-
in several different fields. As a result, they have been gether with Nesterov acceleration.
referred to by a diverse set of names, including prox- We also include several useful summaries in table
imal gradient, proximal point, alternating direction form. Table 1 lists commonly used proximal operators,
method of multipliers (ADMM) (Boyd et al., 2011), Table 2 documents several examples of half-quadratic
divide and concur (DC), Frank–Wolfe (FW), Douglas– envelopes, and Table 3 provides convergence rates for
Rachford splitting, operator splitting and alternating a variety of algorithms.
split Bregman (ASB) methods. The field of image pro-
1.2 Notation
cessing has developed most of these ideas indepen-
dently of statistics—for example, in the form of to- In this paper we consider optimization problems of
tal variation (TV) de-noising and half-quadratic (HQ) the form
optimization (Geman and Yang, 1995; Geman and
(1) minimize F (x) := l(x) + φ(x),
Reynolds, 1992; Nikolova and Ng, 2005). Many other
widely-known methods—including, for example, fast where l(x) is a measure of fit depending implicitly on
iterative shrinkage thresholding (FISTA), expectation some observed data y, and φ(x) is a regularization
maximization (EM), majorization-minimization (MM) term that imposes structure or effects a favorable bias-
and iteratively reweighed least squares (IRLS)—also variance trade-off. Often l(x) is a smooth function and
fall into the proximal framework. φ(x) is nonsmooth—like a lasso or bridge penalty—so
Recently there has been a spike of interest in prox- as to induce sparsity. We will assume that l and p are
imal algorithms, with a handful of recent broad sur- convex and lower semi-continuous except when explic-
veys appearing in the last few years (Cevher, Becker itly stated to be nonconvex.
and Schmidt, 2014; Komodakis and Pesquet, 2014; We will pay particular attention to composite penal-
Combettes and Pesquet, 2011; Boyd et al., 2011). In- ties of the form φ(Bx), where B is a matrix corre-
deed, the use of specific proximal algorithms has be- sponding to some constraint or structural penalty, such
come commonplace in statistics and machine learn- as the discrete difference operator in fused lasso or
ing (e.g., Bien, Taylor and Tibshirani, 2013; Tibshirani, polynomial trend filtering. We use x = (x1 , . . . , xd )
2014; Tansey et al., 2014). However, there has not been to denote a d-dimensional parameter of interest, y an
a real focus on the general family of approaches that n-vector of outcomes, A a fixed n × d matrix whose
underly these algorithms, with specific attention to the rows are covariates (or features) ai , and B a fixed
issues of most direct interest to statisticians. Our re- k × d matrix, b a prior mean or target for shrinkage,
view is designed to fill this gap. and γ > 0 a regularization parameter that will trace
The rest of the paper proceeds as follows. Section 1.2 out a solution path. Observations are indexed by i, pa-
provides notation and basic properties of proximal op- rameters by j , and iterations of an algorithm by t. Un-
erators and envelopes. Section 2 describes the proximal less stated otherwise, all vectors are column vectors.
operator and Moreau envelope. Section 3 describes the Putting these together, this paper treats general com-
basic proximal algorithms and their extensions. Sec- posite objectives of the form
tion 4 describes common algorithms and techniques,
n
k
such as ADMM and Divide and Concur, that rely on (2) F (x) := l yi , ai x + γ φ [Bx − b]j .
proximal algorithms. Section 5 discusses envelopes i=1 j =1
PROXIMAL ALGORITHMS 561
TABLE 1
Sources: Chaux et al. (2007), Hu, Li and Yang (2015)
d(x) = (a|x| − γ )+ √
√ x ,
τ x√2 , |x| ≤ ω/ 2τ , 2τ +1 √ |x| ≤ ω(2τ + 1)/ 2τ ,
Huber dist. 2 √
ω 2τ |x| − ω /2, otherwise x − ω 2τ sgn(x), |x| > ω(2τ + 1)/ 2τ
ω, τ ∈ (0, +∞)
Max-entropy dist. ω|x| + τ |x|2 + κ|x|p sgn(x) proxκ|·|p /(2τ +1) ( 2τ1+1 max(|x| − ω, 0))
2 = p ∈ (1, +∞),
ω, τ, κ ∈ (0, +∞) √
sgn(x) ω|x|−ω −1+ |ω|x|−ω
2 2 −1|2 +4ω|x|
Smoothed-laplace dist. ω|x| − ln(1 + ω|x|) 2ω
ωx, x ≥ 0, x − ω, x ≥ ω,
Exponential dist.
⎧ +∞, x < 0
0, x<ω
⎨ −ω, x < −ω,
x − ω, x ≥ ω,
Uniform dist. x, |x| ≤ ω,
⎩ 0, x<ω
⎧ ω, x >ω ⎧ √
⎨ − ln(x − ω) + ln(−ω), x ∈ (ω, 0), ⎪
⎨ x+ω+ |x−ω|2 +4 , x < 1/ω,
Triangular dist. − ln(ω̂ − x) + ln(ω̂), x ∈ (0, ω̂), √2
⎩ ⎪
⎩ x+ω̂− |x−ω̂|2 +4 ,
+∞, otherwise 2 x > 1/ω̂
ω ∈ (−∞, 0], ω̂ ∈ (0, ∞)
−κ ln x + ωx p , x > 0,
Weibull dist. π s.t. pωπ p + π 2 − xπ = κ
+∞, x≤0
p ∈ (1, +∞) ω, κ ∈ (−∞, 0]
−κ ln x + ωx + ρ/x, x > 0,
GIG dist. π s.t. π 3 + (ω − x)π 2 − κπ = ρ
+∞, x≤0
ω, κ, ρ ∈ (−∞, 0]
For example, lasso can be viewed as a simple sta- component-wise terms that are added to produce the
tistical model with the negative log-likelihood from overall loss or penalty [as in equation (2)]. We have
y = Ax + ε, where ε is a standard normal measurement taken care to ensure that their meaning will always be
error, corresponding to the norm l(x) = Ax − y2 , clear in context.
and each parameter xj has independent Laplace priors We also use the following conventions: sgn(x) is the
corresponding to the regularization penalty φ(x) = |x|. algebraic sign of x, and x+ = max(x, 0); ιC (x) is the
To keep the notation light, we overload the symbols set indicator function taking the value 0 if x ∈ C and
l and φ: they can refer either to the overall loss and ∞ if x ∈/ C; R+ = [0, ∞), R++ = (0, ∞), and R is the
penalty terms [as in equation (1)] or to the individual extended real line R ∪ {−∞, ∞}.
For latent variable 的splitting,
proximal的性质如何;
TABLE 2
φ (0+ ), if t=0,
Minimizers for the multiplicative form are σ (t) = φ (t)/t, if t=0 and for additive form σ (t) = ct − φ (t). See Nikolova and Ng (2005)
Penalty Minimizer
φ(t) = mins {Q(t, s) + ψ(s)} Q(t, s) = 12 t 2 s Q(t, s) = (t − s)2
splitting 是一个很重要的工具去探索
unconstraint 和 constraint问题之间
的联系,对于constraint 问题包含了一
个隐藏变量z
Further preliminaries. We now briefly introduce The convex conjugate of l(x), l (z), is defined as
several useful concepts and definitions to be described
further in subsequent sections. First, splitting is a key l (λ) = sup λ x − l(x) .
tool that exploits an equivalence between an uncon- x
strained optimization problem and a constrained one
that includes a latent or slack variable z. For example, The conjugate function l (λ) is the point-wise supre-
suppose that the original problem is mum of a family of affine (and therefore convex) func-
tions in z; it is convex even when l(x) is not. But if l(x)
minimize l(x) + φ(Bx).
x is convex (and closed and proper), then we also have
To apply splitting to this problem, we formulate the that l(x) = supλ {λ x − l (λ)}, so that l and l are dual
equivalent problem to one another. If l(x) is differentiable, the maximizing
minimize l(x) + φ(z) value of λ is λ̂(x) = ∇l(x).
x,z
The convex conjugate is our first example of an en-
subject to Bx = z, velope, which is a way of representing functions in
so that the objective is split into two terms involving terms of a pointwise extremum of a family of func-
separate sets of primal variables. tions indexed by a latent variable. Another example is
TABLE 3
See Duckworth (2014)
Error rate
Algorithm Convex Strongly convex Per-iteration cost
√
Accelerated gradient descent O(1/ ε) O(log(1/ε)) O(n)
Proximal gradient descent O(1/ε) O(log(1/ε)) O(n)
√
Accelerated proximal gradient descent O(1/ ε) O(log(1/ε)) O(n)
ADMM O(1/ε) O(log(1/ε)) O(n)
√
Frank-wolfe/conditional gradient algorithm O(1/ε) O(1/ ε) O(n)
Newton’s method O(log log(1/ε)) O(n3 )
Conjugate gradient descent O(n) O(n2 )
L-BFGS Between O(log(1/ε)) and O(log log(1/ε)) O(n2 )
PROXIMAL ALGORITHMS 563
a quadratic envelope, where we represent l as Intuitively, the Moreau envelope is a regularized ver-
sion of f . It approximates f from below and has the
1
l(x) = inf x (z)x − η(z) x + ψ(z) same set of minimizing values (Rockafellar and Wets,
z 2 1998, Chapter 1G). The proximal operator specifies the
for some , η, ψ. We will draw heavily on the use of value that solves the minimization problem defined by
envelope (or variational) representations of functions. the Moreau envelope. It balances the two goals of min-
A function g(x) is said to majorize another function imizing f and staying near x, with γ controlling the
f (x) at x0 if g(x0 ) = f (x0 ) and g(x) ≥ f (x) for all trade-off. Table 1 provides an extensive list of closed-
x = x0 . If the same relation holds with the inequality form solutions.
sign flipped, g(x) is said to be a minorizing function Parikh and Boyd (2013) provide several interesting
for f (x). interpretations of the proximal operator. Each one pro-
The subdifferential of a function f at the point x is vides some intuition about why proximal operators
defined as the set might be useful in optimization. We highlight three of
these interpretations here.
∂f (x) = v : f (z) ≥ f (x) + v (z − x), First, the proximal operator behaves similarly to a
∀z, x ∈ dom(f ) . gradient-descent step for the function f . There are
many ways of motivating this connection, but one sim-
Any such element is called a subgradient. If the func- ple way is to consider the Moreau envelope f γ (x). Ob-
tion is differentiable, then the subdifferential is a sin- serve that the Moreau derivative is
gleton set comprising the ordinary gradient from dif-
1 1
ferential calculus. ∂f (x) = ∂ inf f (z) +
γ
z − x2 = x − ẑ(x) ,
2
z 2γ γ
Finally, a ρ-strong convex function satisfies
ρ where ẑ(x) = proxγf (x) is the value that achieves the
f (x) ≥ f (z) + u (x − z) + x − z22 minimum. Hence,
2
where u ∈ ∂f (z), prox(x) = x − γ ∂f γ (x).
γf
while a ρ-smooth function satisfies Thus, evaluating the proximal operator can be viewed
ρ as a gradient-descent step for the Moreau envelope,
f (x) ≤ f (z) + ∇f (z) (x − z) + x − z22 ∀x, z. with γ as a step-size parameter.
2
Second, the proximal operator generalizes the no-
2. PROXIMAL OPERATORS AND MOREAU tion of the Euclidean projection. To see this, consider
ENVELOPES the special case where f (x) = ιC (x) is the set indica-
tor function of some convex set C. Then proxf (x) =
2.1 Basic Properties
argminz∈C x − z22 is the ordinary Euclidean pro-
Our perspective throughout this paper will be to view jection of x onto C. This suggests that, for other
a proximal algorithm as taking a gradient-descent step functions, the proximal operator can be thought of as
for a suitably defined envelope function. By construct- a generalized projection. A constrained optimization
ing different envelopes, one can develop new optimiza- problem minx∈C f (x) has an equivalent solution as
tion algorithms. We build up to this perspective by first an unconstrained proximal operator problem. Proximal
discussing the basic properties of the proximal opera- approaches are, therefore, directly related to convex re-
tor and its relationship to the gradient of the standard laxation and quadratic majorization, through the addi-
Moreau envelope. tion of terms like ρ2 x − v2 to an objective function,
Let f (x) be a lower semi-continuous function, and where ρ might be a constant that bounds an operator
let γ > 0 be a scalar. The Moreau envelope f γ (x) and or the Hessian of a function. We can choose where
proximal operator proxγf (x) with parameter γ are de- these quadratic terms are introduced, which variables
fined as the terms can involve, and the order in which optimiza-
tion steps are taken. The envelope framework high-
1
f γ (x) = inf f (z) + z − x22 ≤ f (x), lights such choices, leading to many distinct and fa-
z 2γ
(3) miliar algorithms.
1 Finally, there is a close connection between proximal
prox(x) = argmin f (z) + z − x22 .
γf z 2γ operators and fixed-point theory, in that proxγf (x ) =
564 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
This can be motivated in several ways. We outline Consider the proximal gradient method applied to
what is perhaps the most transparent motivation for a quadratic-form log-likelihood (6), as in a weighted
statisticians by showing that the proximal gradient is least squares problem, with a penalty function φ(x).
an MM (majorize/minimize) algorithm. Then ∇l(x) = A Ax −A y, and the proximal gra-
Suppose that l(x) has a Lipschitz-continuous gradi- dient method becomes
ent with modulus γl . This allows us to construct a ma- x t+1 = prox x t − γ t A Ax t − y .
jorizing function for l(x), and therefore for the whole γtφ
objective. Whenever γ ∈ (0, 1/γl ], we have the ma- This algorithm has been widely studied under the name
jorization of IST, or iterative shrinkage thresholding (Figueiredo
and Nowak, 2003). Its primary computational costs at
l(x) + φ(x) ≤ l(x0 ) + (x − x0 ) ∇l(x0 ) each iteration are as follows: (1) multiplying the cur-
1 rent iterate x t by A, and (2) multiplying the resid-
+ x − x0 22 + φ(x),
2γ ual Ax t − y by A . Typically, the proximal oper-
ator for φ will be simple to compute, as in the case
with equality at x = x0 . Simple algebra shows that the
of a quadratic or 1 -norm/Lasso penalty discussed in
optimum value of the right-hand side is the previous section. Thus, the evaluation of the proxi-
1 mal operator will contribute a negligible amount to the
x̂ = argmin φ(x) + x − u22 , overall complexity of the algorithm.
x 2γ
566 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
A useful feature of proximal algorithms is the abil- with Lagrange multiplier λ. The dual function is
ity to use acceleration techniques (Nesterov, 1983), of- g(λ) = infx,z L(x, z, λ), and the dual problem is to
ten referred to as Nesterov acceleration. Acceleration maximize g(λ).
leads to nondescent algorithms that can provide sub- Let p and d be the optimal values of the primal
stantial increases in efficiency versus their nonacceler- and dual problems, respectively. Assuming that strong
ated counterparts. duality holds, the optimal values of the primal and dual
The idea of acceleration is to add an intermediate problems are the same. Moreover, we may recover a
“momentum” variable z, prior to evaluating the for- primal-optimal point (x , z ) from a dual-optimal point
ward and backward steps: λ using the fact that
x , z = argmin L x, z, λ
z = x + θt+1 θt−1 − 1 x t − x t−1 ,
t+1 t
x,z
t+1 −1 t+1
x t+1
= prox z −γ ∇l z , ⇐⇒ 0 ∈ ∂x,z L x , z , λ .
γ −1 φ
The idea of dual ascent is to solve the dual problem
where standard choices are θt = 2/(t + 1) and using gradient ascent, exploiting the fact that
θt+1 (θt−1 − 1) = (t − 1)/(t + 2).
When φ is convex, the proximal problem is strongly ∇g(λ) = ∇λ L(x̂λ , ẑλ , λ),
convex, and advanced acceleration techniques can be where
used (Zhang, Saha and Vishwanathan, 2010; Meng and
(x̂λ , ẑλ ) = argmin L(x, z, λ).
Chen, 2011). x,z
PROXIMAL ALGORITHMS 567
Thus, the required gradient is simply the residual for Thus, the dual-variable update does not change com-
the primal constraint: ∇λ L(x, z, λ) = x − z. Therefore, pared to standard dual ascent. But the joint (x, z) up-
dual ascent involves iterating two steps: date has a regularization term added to it, whose mag-
t+1 t+1 nitude depends upon the tuning parameter γ . Notice
x ,z = argmin L x, z, λt , that the step size γ is used in the dual-update step.
x,z
Scaled form. Many proximal algorithms have more
λt+1
= λ + αt x t+1 − zt+1
t
concise updates when the dual variable λ is expressed
for appropriate step size αt . in scaled form. Specifically, rescale the dual variable as
An obvious issue with dual ascent for problem (9) is u = γ −1 λ. We can rewrite the augmented Lagrangian
that the update in x and z must be done jointly, rather in terms of u as
than one at a time. This is rarely practical for problems
Lγ (x, z, u)
of this form. But a discussion of dual ascent is an im-
γ
portant starting point for building up to more realistic = l(x) + φ(z) + γ u (x − z) + x − z22
algorithms. 2
We also note that in the case where g is not differ- γ γ
= l(x) + φ(z) + r + u2 − u22 ,
2
entiable, it is possible to replace the gradient with the 2 2
negative of a subgradient of −g, leading to dual sub- where r = x − z is the primal residual. This leads to
gradient ascent; see Shor (1985). the following dual-update formulas:
t+1 t+1
Augmented Lagrangian and the method of multi- x ,z
pliers. Take problem (9) as before, with Lagrangian
γ 2
L(x, z, λ) = l(x) + φ(z) + λ (x − z). The augmented- = argmin l(x) + φ(z) + x − z + ut 2 ,
x,z 2
Lagrangian approach (also known as the method of
multipliers) seeks to stabilize the intermediate steps ut+1 = ut + x t+1 − zt+1 .
of dual ascent by adding a ridge-like term to the La- Bregman iteration. The augmented Lagrangian
grangian: method for solving 1 -norm problems is called “Breg-
γ man iteration” in the compressed-sensing literature.
Lγ (x, z, λ) = l(x) + φ(z) + λ (x − z) + x − z22 ,
2 Here the goal is to solve the exact-recovery problem
where γ is a scale or step-size parameter. One way of via basis pursuit:
viewing this augmented Lagrangian is as the standard minimize x1
x
Lagrangian for the equivalent problem
subject to Ax = y,
γ
minimize l(x) + φ(z) + x − z22
x,z 2 where y is measured, x is the unknown signal, and A is
subject to x − z = 0. a known “short and fat” sensing matrix (meaning more
coordinates of x than there are observations).
We can see that this is equivalent to the original be- The scaled-form augmented Lagrangian correspond-
cause, for any primal-feasible x and z, the new objec- ing to this problem is
tive takes the same value as the original objective, and γ γ
thus has the same minimum. The dual function cor- Lγ (x, u) = x1 + Ax − y + u22 − u22 ,
2 2
responding to this augmented Lagrangian is gγ (λ) = with steps
infx,z Lγ (x, z, λ), which is differentiable and strongly
γ 2
convex under mild conditions. (The ordinary dual func- x t+1
= argmin x1 + Ax − zt 2 ,
tion need not be either of these things, which is a key x 2
advantage of using the augmented Lagrangian.) zt+1 = y + zt − Ax t+1 ,
The method of multipliers is to use dual ascent for
the modified problem, iterating where we have redefined zt = y − ut compared to the
t+1 t+1 usual form of the dual update. Thus, each intermediate
x ,z = argmin Lγ x, z, λt , step of the Bregman iteration is like a lasso regression
x,z
problem. (This algorithm also has an alternate deriva-
λt+1 = λt + γ x t+1 − zt+1 . tion in terms of Bregman divergences, hence its name.)
568 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
+1
J subject to Ax − z = 0.
minimize lj (zj ) For example, in a Poisson model (yi |μi ) ∼ Pois(μi ),
x,z
j =1
μi = exp(θi ) with natural parameter θi = ai x. The cu-
subject to zj = Aj x, mulant generating function is b(θ) = exp(θ ), and thus
PROXIMAL ALGORITHMS 569
d(μ) = μ log μ − μ. After simplification, the diver- First, we define the FB envelope, FγFB (x), which will
gence Dd (y, μ) = μ − y log μ + (μ − y). The opti- possess some desirable properties (see Patrinos and Be-
mization problem can then be split as mporad, 2013):
N
FγFB (x) := min l(x) + ∇l(x) (v − x) + φ(v)
min (zi − yi log zi ) + φ(x) v
x,z
i=1
1
subject to ai x = log zi . + v − x2
2γ
These same optimization problems arise when one γ 2
= l(x) − ∇l(x) + φ γ x − γ ∇l(x) .
considers scale mixtures, or convex variational forms 2
(Palmer et al., 2005, Polson and Scott, 2015). The con- If we pick γ ∈ (0, γl−1 ), the matrix I − γ ∇ 2 l(x) is
nection is made explicit by the dual function for a symmetric and positive definite. The stationary points
density and its relationship with scale-mixture decom- of the envelope FγFB (x) are the solutions x of the orig-
positions. For instance, one can obtain the following inal problem which satisfy x = proxγ φ (x − γ ∇l(x)).
equality for appropriate densities p(x), q(z) and con- This follows from the derivative information
stants μ, κ:
−1 ∇FγFB (x) = I − γ ∇ 2 l(x) Gγ (x),
− log p(x) = − sup log pN x; μ + κ/z, z q(z)
z>0 where Gγ (x) = γ −1 (x − Pγ (x)) and Pγ (x) =
proxγ φ (x − γ ∇l(x)).
z √
= inf (x − μ − κ/z)2 − log zq(z) , With these definitions, we can establish the follow-
z>0 2
ing descent property for gradient steps based on the FB
where pN (x; μ, σ 2 ) is the density function for a nor- envelope:
mal distribution with mean μ and variance σ 2 . The γ 2
form resulting from this normal scale-mixture envelope FγFB (x) ≤ F (x) − Gγ (x) ,
2
is similar to the half-quadratic envelopes described in 2
γ
Section 5. Polson and Scott (2015) describe these rela- F Pγ (x) ≤ FγFB (x) − (1 − γ γl )Gγ (x) .
tionships in further detail. 2
Hence, for γ ∈ (0, γl−1 ), the envelope value always de-
5. ENVELOPE METHODS creases on application of the proximal operator of γ φ,
and we can determine the stationary points. See Ap-
In this section we describe several types of en-
pendix A for further details.
velopes: the forward–backward (FB) envelope, the
Douglas–Rachford (DR) envelope, the half-quadratic 5.2 Douglas–Rachford Envelope
(HQ) envelope, and the Bregman divergence envelopes.
Mimicking the forward–backward approach, Patri-
These all build upon the idea of a Moreau envelope
nos, Lorenzo and Alberto (2014) define the Douglas–
and lead to analogous proximal algorithms. Within this
Rachford (DR) envelope as
framework, various algorithms may be generated in
γ 2
terms of gradient steps for the corresponding enve- FγDR (x) = l γ (x) − ∇l γ (x)2 + φ γ x − 2γ ∇l γ (x)
lope. (For instance, ADMM methods will be viewed 2
as the gradient step of the dual FB envelope.) Section 6
= min l x + ∇l x z − x + φ(z)
dissects these envelopes in further detail, shows their z
relationship to Lagrangian approaches, and provides a
1 2
framework within which they can be derived and ex- + z − x ,
2γ
tended.
where we recall that l γ is the Moreau envelope of the
5.1 Forward–Backward Envelope function l and x = proxγ l (x).
Suppose as in (9) that we have to minimize F = This can be interpreted as a backward–backward en-
l + φ, under the assumptions that l is strongly con- velope. It is a special case of a FB envelope evaluated
vex and possesses a continuous gradient with Lipschitz at the proximal operator of γ l, namely,
constant γl , so that |∇ 2 l(x)| ≤ γl , and that φ is proper FγDR (x) = FγFB prox(x) .
lower semi-continuous and convex. γl
570 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
or, in split form, and the split–dual by a similar argument. These two
formulations are related via the Max–Min inequality
minimize l(x) + φ(z) (Boyd and Vandenberghe, 2004):
x,z
(10)
subject to Bx = z. sup inf F (q, v) ≤ inf sup F (q, v).
q v v q
Composite penalties arise in statistical models that ac-
In the special case of closed proper convex functions,
count for structural constraints or spatiotemporal cor-
we have
relations (e.g., Tibshirani and Taylor, 2011; Tibshirani,
2014; Tansey et al., 2014). The most famous examples min F (x) = min sup FPD (x, λ)
x x z
of problems in this class are total-variation denoising
(Rudin, Osher and Faterni, 1992) and the fused lasso = max min FSP (x, z, λ)
λ x,z
(Tibshirani et al., 2005).
We start by noting that many approaches for solving = max min FSD (x, z, λ),
x λ,z
this problem, including the ones in Section 4, can be
characterized in terms of one of the four general forms where we exploit the fact that
of the objective functions/Lagrangians that result from φ(Bx) = sup z Bx − φ (z)
appealing to splitting and conjugate functions: z
primal F (x) = l(x) + φ(Bx), whenever φ is convex. FSP (x, z, λ) and FPD (x, λ) are
also related by
split–primal FSP (x, z, λ) = l(x) + φ(z)
min FSP (x, z, λ) = min φ(z) + l(x) + λ (Bx − z)
+ λ (Bx − w), z≥0 z≥0
primal–dual FPD (x, λ) = l(x) + λ (Bx) = l(x) + λ Bx + min φ(z) − λ z
z≥0
− φ (λ),
= l(x) + λ Bx − φ (λ)
split–dual FSD (x, z, λ) = l (z) + φ (λ)
= FPD (x, λ).
+ x −B λ − z .
6.2 Proximal Solutions
From a statistical perspective, it is natural to think of
In most statistical problems of form (10), it is typi-
z and λ as latent variables, and of each of these split-
cally the case that closed-form expressions for one or
ting/duality strategies as defining a higher-dimensional
more of l(x), l (z), φ(z) or φ (λ) will be unavailable
objective function. Such ideas are familiar in statistics,
or inefficient to compute. However, exact solutions to
where alternating minimization, iterated conditional
related problems that share the same critical points may
mode (ICM), EM and MM algorithms have a long his-
be easily accessible. We now step through several such
tory (e.g., Dempster, Laird and Rubin, 1977; Csiszár
approaches for solving (10), explaining how they relate
and Tusnády, 1984; Besag, 1986). Indeed, Polson and
to the ideas introduced thus far. We highlight whenever
Scott (2015) show how many such algorithms that ap-
proximal operators enter the analysis. Because prox-
peal to convex conjugacy have a natural EM-like inter-
imal operators are so well understood, their presence
pretation in terms of missing data.
in an algorithm is convenient: the properties of proxi-
For problem (10), the motivation for using the
mal operators and the associated fixed-point theory can
primal–dual and the split forms (see Esser, Zhang and
simplify otherwise lengthy constructions and conver-
Chan, 2010) lies in how they decouple φ from the lin-
gence arguments. Moreover, by exploiting the proxi-
ear mapping B; it is precisely the composition of these
mal operator’s known properties, like the Moreau iden-
functions that poses the difficulty for problems like TV
tity, one can move easily between the different formu-
denoising and the fused lasso. Note that the primal–
lations above, and thus between the primal and dual
dual formulation follows from profiling the slack vari-
spaces. It is also worth mentioning that the efficacy of
able z out of the split–primal objective:
certain acceleration techniques can depend on which
inf L(x, z, λ) = inf l(x) + φ(z) + λ (Bx − z) formulation is used, and therefore implicitly on the
z z
specific proximal steps taken. We refer the reader to
= l(x) + λ Bx − φ (λ), Beck and Teboulle (2014) for further discussion.
572 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
First, proximal operators arise naturally whenever First, notice that the argmin for the subproblem in x,
we augment the Lagrangian for problem (10), which l(x) + λ (Bx), can be characterized in terms of the
entails adding a ridge term to the split–primal objec- following fixed point whenever γl > 0:
tive:
x = prox x .
γl (l(x)+λ Bx)
Lρ (x, z, λ) = l(x) + φ(z) + λ (Bx − z)
ρ We now use the fact that
(11) + Bx − z2 (12) prox (q) = prox(q − u),
2
ρ g(z)+u z g
= FSP (x, z, λ) + Bx − z2 . for a generic function g(z) and variables q, z and u;
2
As already detailed, this leads naturally to an ADMM this is obtained by completing the square in the defini-
tion of the operator. Appealing to (12) gives
algorithm whose intermediate iterates involve proximal
operators. x = prox x = prox x − γl B λ .
Second, we also are not restricted to using the prox- γl (l(x)+λ Bx) γl l
imal operators directly implied by one of these four Now we’re left with only the subproblem in λ:
problem formulations, such as those that appear when max l x + λ Bx − φ (λ)
l, l , φ and/or φ contain quadratic terms. We can also λ
apply a surrogate or approximation (e.g., an envelope = − min φ (λ) − λ Bx − l x .
or majorizer) to certain terms. For example, when exact λ
solutions to the composite proximal operator are not We can take yet another proximal step, for the mini-
available, one can consider “linearizing” ρ2 Bx − z2 mization of φ (λ) − λ (Bx ), in λ with step size γφ .
with 2λρB x − z2 , where σmax (B B) < λB , yielding Using (12) and (4), we find that the argmin satisfies
ρ λ = prox λ + γφ Bx .
FSP (x, w, z) + Bx − z2 γφ φ
2
ρ Using the Moreau decomposition in (4), we can de-
≤ FSP (x, w, z) + x − z2 . rive yet another strategy. Note that
2λB 一个非常简单的优化方法
prox λ + γφ Bx
This approach can be seen as a simple majoriza- γφ φ
tion and, when combined with the proximal solution
1
for z, as a forward–backward envelope for the sub- = I − prox ◦ γφ λ + Bx .
problem. Implementations of this approach include γφ φ/γφ
the linearized ADMM technique or the split inex- Hence, we can characterize the solution to the primal–
act Uzawa method, and are described in the context dual problem in terms of fixed points of the following
of Lagrangians by Chen and Teboulle (1994) and two operators:
primal–dual algorithms in Chambolle and Pock (2011).
x = prox x − γl B λ ,
Magnússon et al. (2014) detail splitting methods in γl l
terms of augmented-Lagrangians for nonconvex objec- (13)
tives. 1
λ = I − prox ◦ γφ λ + Bx .
Finally, one can represent one of the terms in the ob- γφ φ/γφ
jective using one of the envelopes described in Sec- If we separate the last step implied by (13) into two
tion 5, in which case the iterates of the resulting steps and simplify by setting γl = γφ = 1, we arrive at
algorithm will involve proximal operators. In fact,
the envelope representation can itself be seen as a x = prox x − B u ,
l
way to encode the iterates in each of a problem’s la-
tent/slack/splitting terms as proximal operators. w = prox u + Bx ,
φ
An example: The primal–dual. To demonstrate these
u = u − w − Bx .
ideas, we give an example of how proximal operators
and their properties can be used to derive an algorithm This has the same basic form of techniques like
starting from the primal–dual formulation ADMM, alternating split Bregman, split inexact Uzawa
and so forth. See Chen, Huang and Zhang (2013) for
max min l(x) + λ (Bx) − φ (λ) . more details.
λ x
PROXIMAL ALGORITHMS 573
6.3 Composition in General Quadratic Envelopes φ(Bx) directly, by finding the fixed point of the opera-
tor
Consider now the most general form of a quadratic
envelope involving a composite penalty function: Hk = κI + (1 − κ)H,
1
(14) F (x) = inf x (z)x − η (z)x + φ(Bx) , for κ ∈ (0, 1), where
z 2
where (z) is symmetric positive definite. Such forms H (v) := I − prox BA−1 η + I − γ BA−1 B v
can arise when one majorizes l(x) using a second-order γ −1 φ
approximation of around z. This general quadratic case ∀v ∈ Rp .
in which (z) is not necessarily diagonal encompasses
the approaches of Geman and Yang (1995), Geman and Here 0 < γ < 2/σmax (BA−1 B ) and A = (z). The
Reynolds (1992), and can be addressed with splitting operator H is understood to be nonexpansive, so, by
techniques. Opial’s theorem, one is guaranteed convergence; when
If B B is positive definite, a proximal point solution H is a contraction, this convergence is linear. After
can be obtained by setting l(x) = x (z)x − η x in finding the fixed point v , one sets x = A−1 (η −
(13). The general solution to a quadratic-form proxi- xB v ).
mal operator (6), together with the split–dual formula-
tion, implies a proximal point algorithm that exploits 7. APPLICATIONS
the fact that the optimal values satisfy
7.1 Logit Loss Plus Lasso Penalty
x = prox x − γl B z
γl l(x) To illustrate our approach, we simulate observations
−1 from the model
= I + γl z x − γl B z + γl η ,
1 (yi |pi ) ∼ Binom(mi , pi ),
z = I − prox ◦ γφ z + Bx .
γφ φ/γφ pi = logit−1 ai x ,
This formulation introduces the subproblem of solv-
where i = 1, . . . , 100, ai is a row vector of A ∈
ing a system of linear equations. Using the exact solu-
R100×300 , x ∈ R300 . The A matrix is simulated from
tion to this system would reflect methods that involve
Levenberg–Marquardt steps, quasi-Newton methods N(0, 1) variates and normalized column-wise. The sig-
and Tikhonov regularization, and is related to the use nal x is also simulated from N(0, 1) variates, but with
of second-order Taylor approximations to an objec- only 10% of entries being nonzero.
tive function. Naturally, the efficiency of computing Here mi are the number of trials and yi the number of
exact solutions depends very much on the properties of successes. The composite objective function for sparse
I + γl (z), since the system defined by this term will logistic regression is then given by
need to be solved on each iteration of a fixed-point al- p
n
T x
gorithm. When (z) is constant, a decomposition can argmin mi log 1 + eai − yi ai x + λ |xj |.
be performed at the start and reused, so that solutions x
i=1 j =1
are computed quickly at each step. For some matri-
ces, this can mean only O(n) operations per iteration. To specify a proximal gradient algorithm, all we need
In general, however, the post-startup iteration cost is is an envelope such as those commonly used in Vari-
O(n2 ). ational Bayes. In this example, we use the simple
Other approaches, like those in Chen, Huang and quadratic majorizer with Lipschitz constant given by
Zhang (2013) and Argyriou et al. (2011), do not at- A A2 /4 = σmax (A)/4, and a penalty coefficient λ
tempt to directly solve the aforementioned system of set to 0.1σmax (A).
equations. Instead they use a forward–backward algo- Figure 2 shows the (adjusted) objective values per
rithm on the dual objective, FPD . In particular, we call iteration with and without Nesterov acceleration. We
attention to the approach of Argyriou et al. (2011). can see the nondescent nature of the algorithm and the
They show how to evaluate the proximal operator of clear advantage of adding acceleration.
574 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
F IG . 2. (Adjusted) objective values for iterations of the proximal gradient method, with and without acceleration, applied to a logistic
regression problem with an 1 -norm penalty.
F IG . 3. Objective values for iterations of two proximal composite formulations applied to a multinomial logistic regression problem with a
composite 1 -norm penalty. Both are run until the same numeric precision is reached.
where ai are the column vectors of A and D (1) x is the implementation of an EM algorithm for penalized like-
matrix operator of first-order differences in x. Since lihood estimation.
the Poisson loss function is not Lipschitz but still con-
7.4 L2 -Norm Loss Plus Lq -Norm Penalty for
vex, we replace the constant gradient step with a back-
0<q <1
tracking line search. This can be accomplished with a
back-tracking line search step. A common nonconvex penalty is the Lq -norm for
Figure 4 shows the objective value results for each 0 < q < 1. There are a number of ways of developing a
method, with and without acceleration. An alternative proximal algorithm to solve such problems. The prox-
approach is given by Green (1990), who describes an imal operator of Lq -norm has a closed-form, multi-
F IG . 4. (Adjusted) objective values for iterations of the proximal gradient method, with and without acceleration, applied to a Poisson
regression problem with a fused 1 -norm penalty.
576 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
valued solution, and convergence results are available where Ai is column i of A, and A−i , x−i have col-
for proximal methods in Marjanovic and Solo (2013) umn/element i removed. Applied to a quadratic ma-
and Attouch, Bolte and Svaiter (2013). For this exam- jorization scheme, we find that at iteration t
ple, we choose the former approach. A t+1
A
i (y − A−i xi )
t
i r
The regularization problem involves finding the min- xit+1 = = + xit
imizer of an L2 -norm loss with an Lq -norm penalty for A
i Ai A i 2
F IG . 5. Penalty weight, λ, vs. MSE and q for an L2 -norm error with an Lq -norm penalty, 0 < q < 1, estimated via cyclic descent and
proximal solutions.
8. DISCUSSION
Proximal algorithms are a widely used approach for
solving optimization problems. They provide an ele-
gant extension of classical gradient descent method
and have properties that—much like EM or MM
algorithms—can be used to derive many different ap-
proaches for solving a given problem.
For readers interest in further historical details, we
recommend Beck and Sabach (2015), who provide
a historical perspective on iterative shrinkage algo-
rithms by focusing mainly on the Weiszfeld algorithm
(Weiszfeld, 1937) for computing an 1 median. The
split Lagrangian methods described here were origi-
nally developed by Hestenes (1969) and Rockafellar
(1974). More recently, there is work being done to ex-
tend the range of applicability of these methods outside
of the class of convex functions to the broader class of
functions satisfying the Kurdyka–Łojasiewicz inequal-
ity (Attouch, Bolte and Svaiter, 2013).
The purpose of our review has been to describe and
apply proximal algorithms to some archetypical opti-
mization problems that arise in statistics. These prob-
lems often involve composite functions that are rep-
resentable by a sum of a linear or quadratic enve-
lope, together with a function that has a closed-form
proximal operator that is easy to evaluate. Many pa-
F IG . 6. Proximal results for the prostate data example under the pers demonstrate the efficacy and breadth of applica-
Lq -norm penalty. tion of this approach: for example, Micchelli et al.
578 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
(2013) and Micchelli, Shen and Xu (2011) study proxi- when l and φ are convex, lower semi-continuous
mal operators for composite operators for L2 -norm and and ∇l is Lipschitz continuous. We also assume that
1 -norm/TV denoising models; Argyriou et al. (2011) proxφ/λ is nonempty and can be evaluated indepen-
describe numerical advantages of the proximal opera- dently in each component.
tor approach versus traditional fused lasso implemen- Recalling the translation property of proximal oper-
tations; and Chen, Huang and Zhang (2013) provide a ators stated in (12), we can say
further class of fixed-point algorithms that advance the x = prox x − ∇l(x)/λ = prox (x)
proximal approach in the composite setting. φ/λ (φ(z)+λ∇l(x) z)/λ
Another nice property of proximal algorithms is
λ
the ease with which acceleration techniques can be = argmin φ(z) + ∇l(x) (z − x) + x − z2 .
z 2
applied. The most common approach involves Nes-
terov acceleration; see Nesterov (1983) and Beck and By the proximal operator’s minimizing properties, its
Teboulle (2004), who introduce a momentum term solution x satisfies
λ 2
for gradient-descent algorithms applied to nonsmooth φ x + ∇l(x) x − x + x − x ≤ φ(x),
composite problems. Attouch and Bolte (2009), Noll 2
(2014) provide further convergence properties for non- providing a quadratic minorizer for F (w) in the form
smooth functions. O’Donoghue and Candes (2015) use of
adaptive restart to improve the convergence rate of ac- λ 2
l(w) + φ x + ∇l(w) x − w + w − x
celerated gradient schemes. Meng and Chen (2011) 2
modify Nesterov’s gradient method for strongly con- ≤ l(w) + φ(w) ≡ F (w).
vex functions with Lipschitz continuous gradients.
The Lipschitz continuity of ∇l(x), that is,
Allen-Zhu and Orecchia (2014) provide a simple in-
γ
terpretation of Nesterov’s scheme as a two-step algo- l(x) ≤ l(w) + ∇l(w) (x − w) + x − w2 ,
rithm with gradient-descent steps which yield proximal 2
(forward) progress coupled with mirror-descent (back- also gives us a quadratic majorizer
ward) steps with dual (backward) progress. By linearly F (x) ≡ l(x) + φ(x)
coupling these two steps they improve convergence.
≤ l(w) + φ(x) + ∇l(w) (x − w)
Giselsson and Boyd (2014) also show how precondi-
γ
tioning can help with convergence for ill-conditioned + x − w2 ,
problems. 2
There are a number of directions for future research which, when evaluated at x = x and combined with
on proximal methods in statistics, for example, ex- our minorizer, yields
2
ploring the use of Divide and Concur methods for (λ − γ ) 12 x − w ≤ F (w) − F x .
exponential-family mixed models and studying the re-
lationship between proximal splitting and variational Thus, if we want to ensure that the objective value will
Bayes methods in graphical models. Another interest- decrease in this procedure, we need to fix λ ≥ γ . Fur-
ing area of research involves combining proximal steps thermore, functional characteristics of l and φ, such
with MCMC algorithms (Pereyra, 2013). Of course, as strong convexity, can improve the bounds in the
steps above and guarantee good- or optimal-decreases
the proximal methods developed here are not designed
in F (w) − F (x ).
to provide standard errors and the advantage of MCMC
Finally, when we compound up the errors we obtain
methods is the ability to assess uncertainty through the
a O(1/k) convergence bound. This can be improved by
full posterior distribution.
adding a momentum term that includes the first deriva-
tive information.
APPENDIX A: PROXIMAL GRADIENT These arguments can be extended to Bregman diver-
CONVERGENCE gences by way of the general law of cosines inequality:
We now outline convergence results for the proximal Dφ (x, z) = Dφ (x, w) + Dφ (w, z)
gradient solution, given by (4), to the fixed point prob-
lem − ∇φ(z) − ∇φ(w) (x − w),
so that Dφ (x, z) ≥ Dφ (x, w) + Dφ (w, z) where w =
x = prox x − ∇l(x)/λ ,
φ/λ argminv Dφ (v, z).
PROXIMAL ALGORITHMS 579
APPENDIX B: NESTEROV ACCELERATION U = {x ∈ dom(l) : l(x) ≤ inft l(x t )}, one finds that U
is a nonempty closed convex set and that x t is a Fe-
A powerful addition is Nesterov acceleration. Con-
jér sequence of finite length, t x t+1 − x t < ∞,
sider a convex combination, with parameter θ , of up-
and that it converges to a critical point of l as long as
per bounds for the proximal operator inequality z = x
and z = x . We are free to choose variables z = θ x + min{l(x) : x ∈ Rd } is nonempty.
(1 − θ )x + and w. If φ is convex, φ(θx + (1 − θ )x + ) ≤
θ φ(x) + (1 − θ )φ(x + ), then we have APPENDIX D: NONCONVEX:
KURDYKA–ŁOJASIEWICZ (KL)
F x + − F − (1 − θ ) F (x) − F
A locally Lipschitz function l : Rd → R satisfies KL
= F x + − θ F − (1 − θ )F (x) at x ∈ Rd if and only if ∃η ∈ (0, ∞) and a neighbor-
+
≤λ x −w θ x + (1 − θ )x − x + hood U of x and a concave κ : [0, η] → [0, ∞) with
κ(0) = 0, κ ∈ C 1 , κ > 0 on (0, η) and for every x ∈ U
λ 2
+ x + − w with l(x ) < l(x) < l(x ) + η we have
2
λ 2 κ l(x) − l x dist 0, ∂l(x) ≥ 1,
= w − (1 − θ )x − θ x
2 where dist(0, A) := supx∈A x2 .
2
− x + − (1 − θ )x − θ x The KL condition guarantees summability and there-
2 2
fore a finite length of the discrete subgradient trajec-
θ 2 λ
= u − x − u+ − x , tory. Using the KL properties of a function, one can
2 show convergence for alternating minimization algo-
where w is given in terms of the intermediate steps rithms for problems like
θ u = w − (1 − θ )x, min L(x, z) := l(x) + Q(x, z) + φ(z),
x,z
θ u+ = x + − (1 − θ )x,
where ∇Q is Lipschitz continuous (see Attouch et al.,
introducing a sequence θt with iteration subscript, t. 2010, Attouch, Bolte and Svaiter, 2013). A typical ap-
The second identity, θ u = x − (1 − θ )x − , then yields plication involves solving minx∈Rd {l(x) + φ(x)} via
an update for w as the current state x plus a momentum the augmented Lagrangian
term, depending on the direction (x − x − ), namely,
ρ
w = (1 − θt )x + θt u = x − θt−1 (1 − θt ) x − x − . L(x, z) = l(x) + φ(z) + λ (x − z) + x − z2 ,
2
APPENDIX C: QUASI-CONVEX CONVERGENCE where ρ is a relaxation parameter.
A useful class of functions that satisfy KL is one that
Consider an optimization problem minx∈X l(x) possesses uniform convexity
where l is quasi-convex, continuous and has a non-
empty set of finite global minima. Let x t be generated l(z) ≥ l(x) + u (z − x) + Kz − xp ,
by the proximal point algorithm
where
λt 2
x ∈ argmin l(x) + x − x t .
t
p≥1 ∀u ∈ ∂l(x).
2
Papa Quiroz and Oliveira (2009) show that these iter- Then l satisfies KL on dom(l) for κ(s) = pK −1/p s 1/p .
ates converge to the global minima, although the proxi- For explicit convergence rates in the KL setting, see
mal operator at each step may be set-valued, due to the Frankel, Garrigos and Peypouquet (2015).
nonconvexity of l. A function l is quasi-convex when
l θ x + (1 − θ )z ≤ max l(x), l(z) , ACKNOWLEDGMENTS
which accounts for a number of nonconvex functions We thank the participants at the 2014 ASA meetings
like |x|q , when 0 < q < 1, and functions involving ap- for their comments. We also thank the Editor, Asso-
propriate ranges of log(x) and tanh(x). In this setting, ciate Editor and two anonymous referees for their help
using the level-sets generated by the sequence, that is, in improving the paper.
580 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD
G REEN , P. J., Ł ATUSZY ŃSKI , K., P EREYRA , M. and PATRINOS , P. and B EMPORAD , A. (2013). Proximal Newton
ROBERT, C. P. (2015). Bayesian computation: A perspec- methods for convex composite optimization. In Decision and
tive on the current state, and sampling backwards and forwards. Control (CDC), 2013 IEEE 52nd Annual Conference on 2358–
Preprint. Available at arXiv:1502.01148. 2363. IEEE, New York.
H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Ele- PATRINOS , P., L ORENZO , S. and A LBERTO , B. (2014). Douglas-
ments of Statistical Learning: Data Mining, Inference, and Pre- rachford splitting: Complexity estimates and accelerated vari-
diction, 2nd ed. Springer, New York. MR2722294 ants. Preprint. Available at arXiv:1407.6723.
H ESTENES , M. R. (1969). Multiplier and gradient methods. J. Op- P EREYRA , M. (2013). Proximal Markov chain Monte Carlo algo-
tim. Theory Appl. 4 303–320. MR0271809 rithms. Preprint. Available at arXiv:1306.0187.
H U , Y. H., L I , C. and YANG , X. Q. (2015). Proximal gradient P OLSON , N. G. and S COTT, J. G. (2012). Local shrinkage rules,
algorithm for group sparse optimization. Lévy processes and regularized regression. J. R. Stat. Soc. Ser.
KOMODAKIS , N. and P ESQUET, J.-C. (2014). Playing with du- B. Stat. Methodol. 74 287–311. MR2899864
ality: An overview of recent primal–dual approaches for solv-
P OLSON , N. G. and S COTT, J. G. (2015). Mixtures, envelopes,
ing large-scale optimization problems. Preprint. Available at
and hierarchical duality. J. Roy. Statist. Soc. Ser. B. To appear.
arXiv:1406.5429.
Available at arXiv:1406.0177.
M AGNÚSSON , S., W EERADDANA , P. C., R ABBAT, M. G. and
ROCKAFELLAR , R. T. (1974). Conjugate duality and optimization.
F ISCHIONE , C. (2014). On the convergence of alternating di-
Technical report, DTIC Document, 1973.
rection lagrangian methods for nonconvex structured optimiza-
tion problems. Preprint. Available at arXiv:1409.8033. ROCKAFELLAR , R. T. (1976). Monotone operators and the prox-
M ARJANOVIC , G. and S OLO , V. (2013). On exact q denoising. In imal point algorithm. SIAM J. Control Optim. 14 877–898.
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE MR0410483
International Conference on 6068–6072. IEEE, New York. ROCKAFELLAR , R. T. and W ETS , R. J.-B. (1998). Variational
M ARTINET, B. (1970). Brève communication. Regularisation Analysis. Springer, Berlin. MR1491362
d’inéquations variationnelles par approximations successives. RUDIN , L., O SHER , S. and FATERNI , E. (1992). Nonlinear total
ESAIM Math. Modell. Numer. Anal. 4 154–158. variation based noise removal algorithms. Phys. D 60 259–268.
M ENG , X. and C HEN , H. (2011). Accelerating Nesterov’s method S HOR , N. Z. (1985). Minimization Methods for Nondifferentiable
for strongly convex functions with Lipschitz gradient. Preprint. Functions. Springer, Berlin. MR0775136
Available at arXiv:1109.6058. TANSEY, W., KOYEJO , O., P OLDRACK , R. A. and S COTT, J. G.
M ICCHELLI , C. A., S HEN , L. and X U , Y. (2011). Proximity algo- (2014). False discovery rate smoothing. Technical report, Univ.
rithms for image models: Denoising. Inverse Probl. 27 045009, Texas at Austin.
30. MR2781033 T IBSHIRANI , R. (1996). Regression shrinkage and selection via
M ICCHELLI , C. A., S HEN , L., X U , Y. and Z ENG , X. (2013). Prox- the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242
imity algorithms for the L1/TV image denoising model. Adv. T IBSHIRANI , R. J. (2014). Adaptive piecewise polynomial estima-
Comput. Math. 38 401–426. MR3019155 tion via trend filtering. Ann. Statist. 42 285–323. MR3189487
N ESTEROV, Y U . E. (1983). A method for solving the convex pro- T IBSHIRANI , R. J. and TAYLOR , J. (2011). The solution path of
gramming problem with convergence rate O(1/k 2 ). Sov. Math., the generalized lasso. Ann. Statist. 39 1335–1371. MR2850205
Dokl. 27 372–376. T IBSHIRANI , R., S AUNDERS , M., ROSSET, S., Z HU , J. and
N IKOLOVA , M. and N G , M. K. (2005). Analysis of half-quadratic K NIGHT, K. (2005). Sparsity and smoothness via the fused
minimization methods for signal and image recovery. SIAM J. lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 91–108.
Sci. Comput. 27 937–966 (electronic). MR2199915
MR2136641
N OLL , D. (2014). Convergence of non-smooth descent methods
VON N EUMANN , J. (1951). Functional Operators: The Geometry
using the Kurdyka–Łojasiewicz inequality. J. Optim. Theory
of Orthogonal Spaces. Princeton Univ. Press, Princeton, NJ.
Appl. 160 553–572. MR3180983
W EISZFELD , E. (1937). Sur le point pour lequel la somme des dis-
O’D ONOGHUE , B. and C ANDES , E. (2015). Adaptive restart for
accelerated gradient schemes. Found. Comput. Math. 15 715– tances de n points donnés est minimum. Tohoku Math. J. 43
732. 355–386.
PALMER , J., K REUTZ -D ELGADO , K., R AO , B. D. and W ITTEN , D. M., T OBSHIRANI , R. and H ASTIE , T. (2009). A pe-
W IPF, D. P. (2005). Variational EM algorithms for non- nalized matrix decomposition, with applications to sparse prin-
Gaussian latent variable models. In Advances in Neural In- cipal components and canonical correlation analysis. Biostatis-
formation Processing Systems 18 1059–1066. Vancouver, BC, tics 10 515–534.
Canada. Z HANG , X., S AHA , A. and V ISHWANATHAN , S. V. N. (2010).
PAPA Q UIROZ , E. A. and O LIVEIRA , P. R. (2009). Proximal point Regularized risk minimization by Nesterov’s accelerated gra-
methods for quasiconvex and convex functions with Bregman dient methods: Algorithmic extensions and empirical studies.
distances on Hadamard manifolds. J. Convex Anal. 16 49–69. Preprint. Available at arXiv:1011.0472.
MR2531192 Z OU , H. and H ASTIE , T. (2005). Regularization and variable se-
PARIKH , N. and B OYD , S. (2013). Proximal algorithms. Founda- lection via the elastic Net. J. R. Stat. Soc. Ser. B. Stat. Methodol.
tions and Trends in Optimization 1 123–231. 67 301–320. MR2137327