0% found this document useful (0 votes)
51 views

Proximal Algorithms in Statistics and Machine Learning

The document discusses proximal algorithms, which are useful for optimization problems involving nonsmooth or composite objective functions. It covers several themes: variable splitting strategies, augmented Lagrangian methods, envelope representations of objective functions, proximal algorithms for composite objectives, and closed-form solutions to proximal operators. Examples of applications to logistic and Poisson regression with nonconvex penalties are provided.

Uploaded by

Hongqing Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Proximal Algorithms in Statistics and Machine Learning

The document discusses proximal algorithms, which are useful for optimization problems involving nonsmooth or composite objective functions. It covers several themes: variable splitting strategies, augmented Lagrangian methods, envelope representations of objective functions, proximal algorithms for composite objectives, and closed-form solutions to proximal operators. Examples of applications to logistic and Poisson regression with nonconvex penalties are provided.

Uploaded by

Hongqing Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Statistical Science

2015, Vol. 30, No. 4, 559–581


DOI: 10.1214/15-STS530
© Institute of Mathematical Statistics, 2015

Proximal Algorithms in Statistics and


Machine Learning
Nicholas G. Polson, James G. Scott and Brandon T. Willard

Abstract. Proximal algorithms are useful for obtaining solutions to difficult Proximal algorithm 主要是处理
nonsmooth 和 composite 函数的;
optimization problems, especially those involving nonsmooth or compos-
ite objective functions. A proximal algorithm is one whose basic iterations 这篇文章给出了几个方面的主题:
1. splitting 策略
involve the proximal operator of some function, whose evaluation requires 2. augmented Lagrangian
3. 目标函数的envelope表示
solving a specific optimization problem that is typically easier than the orig- 4. 组合目标函数的proximal算法
inal problem. Many familiar algorithms can be cast in this form, and this 5. proximal operator的闭式解
“proximal view” turns out to provide a set of broad organizing principles
for many algorithms useful in statistics and machine learning. In this paper,
we show how a number of recent advances in this area can inform modern
statistical practice. We focus on several main themes: (1) variable splitting
strategies and the augmented Lagrangian; (2) the broad utility of envelope (or
variational) representations of objective functions; (3) proximal algorithms
for composite objective functions; and (4) the surprisingly large number of
functions for which there are closed-form solutions of proximal operators.
We illustrate our methodology with regularized Logistic and Poisson regres-
sion incorporating a nonconvex bridge penalty and a fused lasso penalty. We
also discuss several related issues, including the convergence of nondescent
algorithms, acceleration and optimization for nonconvex functions. Finally,
we provide directions for future research in this exciting area at the intersec-
tion of statistics and optimization.
Key words and phrases: Bayes MAP, shrinkage, sparsity, splitting, Kur-
dyka–Łojasiewicz, nonconvex, envelopes, regularization, ADMM, optimiza-
tion, Divide and Concur.

1. INTRODUCTION ing (Tibshirani et al., 2005), covariance estimation


(Witten, Tobshirani and Hastie, 2009), image process-
1.1 Proximal Algorithms for Optimization
ing (Geman and Reynolds, 1992; Geman and Yang,
Optimization problems that involve a trade-off be- 1995; Rudin, Osher and Faterni, 1992), nonlinear curve
tween model fit and model complexity sit at the heart fitting (Tibshirani, 2014), Bayesian MAP inference
of modern statistical practice. They arise, for example, (Polson and Scott, 2012), multiple hypothesis testing
in sparse regression (Tibshirani, 1996), spatial smooth- (Tansey et al., 2014) and shrinkage/sparsity-inducing
prior regularization problems (Green et al., 2015).
Nicholas G. Polson is Professor of Econometrics and The goal of this paper is to introduce researchers in
Statistics and Brandon T. Willard is Research Consultant, statistics and machine learning to the large body of lit-
Booth School of Business, University of Chicago, 5807 erature on proximal algorithms for solving such opti-
South Woodlawn Avenue, Chicago, Illinois 60637, USA
mization problems. By a proximal algorithm, we mean
(e-mail: [email protected]; [email protected]).
James G. Scott is Associate Professor of Statistics,
an algorithm whose steps involve evaluating a proximal
McCombs School of Business and Department of Statistics operator related to some term in the objective func-
and Data Sciences, University of Texas at Austin, 2110 tion. Both of these concepts will be defined precisely
Speedway, B6500, Austin, Texas 78712, USA (e-mail: in the next section, but the basic idea is simple. Eval-
[email protected]). uating a proximal operator requires solving a specific

559
560 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

optimization subproblem that is (one hopes) easier than and how proximal algorithms can be viewed as enve-
the original problem of interest. By iteratively solv- lope gradients. Section 6 considers the general prob-
ing such subproblems, a proximal algorithm converges lem of composite operator optimization and shows how
on the solution to the original problem. Chrétien and to compute the exact proximal operator with a gen-
Hero III (2000) provide a general relation between EM eral quadratic envelope and a composite regularization
and proximal point (PP) algorithms and show that the penalty. Section 7 illustrates the methodology with ap-
latter can provide dramatic improvements in rates of plications to logistic and Poisson regression with fused
convergence. lasso penalties. A bridge regression penalty illustrates
The early foundational work in this area dates to the nonconvex case and we apply our algorithm to
the study of iterative fixed-point algorithms in Ba- the prostate data of Hastie, Tibshirani and Friedman
nach spaces (Von Neumann, 1951; Brègman, 1967; (2009). Finally, Section 8 concludes with directions for
Hestenes, 1969; Martinet, 1970; Rockafellar, 1976). future research, while Appendix A discusses conver-
As these techniques matured, they became widely used gence results for both convex and nonconvex cases to-
in several different fields. As a result, they have been gether with Nesterov acceleration.
referred to by a diverse set of names, including prox- We also include several useful summaries in table
imal gradient, proximal point, alternating direction form. Table 1 lists commonly used proximal operators,
method of multipliers (ADMM) (Boyd et al., 2011), Table 2 documents several examples of half-quadratic
divide and concur (DC), Frank–Wolfe (FW), Douglas– envelopes, and Table 3 provides convergence rates for
Rachford splitting, operator splitting and alternating a variety of algorithms.
split Bregman (ASB) methods. The field of image pro-
1.2 Notation
cessing has developed most of these ideas indepen-
dently of statistics—for example, in the form of to- In this paper we consider optimization problems of
tal variation (TV) de-noising and half-quadratic (HQ) the form
optimization (Geman and Yang, 1995; Geman and
(1) minimize F (x) := l(x) + φ(x),
Reynolds, 1992; Nikolova and Ng, 2005). Many other
widely-known methods—including, for example, fast where l(x) is a measure of fit depending implicitly on
iterative shrinkage thresholding (FISTA), expectation some observed data y, and φ(x) is a regularization
maximization (EM), majorization-minimization (MM) term that imposes structure or effects a favorable bias-
and iteratively reweighed least squares (IRLS)—also variance trade-off. Often l(x) is a smooth function and
fall into the proximal framework. φ(x) is nonsmooth—like a lasso or bridge penalty—so
Recently there has been a spike of interest in prox- as to induce sparsity. We will assume that l and p are
imal algorithms, with a handful of recent broad sur- convex and lower semi-continuous except when explic-
veys appearing in the last few years (Cevher, Becker itly stated to be nonconvex.
and Schmidt, 2014; Komodakis and Pesquet, 2014; We will pay particular attention to composite penal-
Combettes and Pesquet, 2011; Boyd et al., 2011). In- ties of the form φ(Bx), where B is a matrix corre-
deed, the use of specific proximal algorithms has be- sponding to some constraint or structural penalty, such
come commonplace in statistics and machine learn- as the discrete difference operator in fused lasso or
ing (e.g., Bien, Taylor and Tibshirani, 2013; Tibshirani, polynomial trend filtering. We use x = (x1 , . . . , xd )
2014; Tansey et al., 2014). However, there has not been to denote a d-dimensional parameter of interest, y an
a real focus on the general family of approaches that n-vector of outcomes, A a fixed n × d matrix whose
underly these algorithms, with specific attention to the rows are covariates (or features) ai , and B a fixed
issues of most direct interest to statisticians. Our re- k × d matrix, b a prior mean or target for shrinkage,
view is designed to fill this gap. and γ > 0 a regularization parameter that will trace
The rest of the paper proceeds as follows. Section 1.2 out a solution path. Observations are indexed by i, pa-
provides notation and basic properties of proximal op- rameters by j , and iterations of an algorithm by t. Un-
erators and envelopes. Section 2 describes the proximal less stated otherwise, all vectors are column vectors.
operator and Moreau envelope. Section 3 describes the Putting these together, this paper treats general com-
basic proximal algorithms and their extensions. Sec- posite objectives of the form
tion 4 describes common algorithms and techniques,

n
  
k
 
such as ADMM and Divide and Concur, that rely on (2) F (x) := l yi , ai x + γ φ [Bx − b]j .
proximal algorithms. Section 5 discusses envelopes i=1 j =1
PROXIMAL ALGORITHMS 561

TABLE 1
Sources: Chaux et al. (2007), Hu, Li and Yang (2015)

Type φ(x) proxγ φ (y)

Laplace ωx sgn(x) max(x − ω, 0)


Gaussian τ x2 x/(2τ + 1)
Group-sparse, p κxp sgn(x)ρ,
ρ s.t. ρ + pκρ p−1 = x
..
. p = 4/3 x + 4κ1/3 ((χ − x)1/3 − (χ + x)1/3 )
32 
χ = x 2 + 256κ 3 /729
.. 
. p = 3/2 x + 9κ 2 sgn(x)(1 − 1 + 16|x|/(9κ 2 ))/8
.. √
. p=3 sgn(x)( 1 + 12κ|x| − 1)/(6κ)
..
. p=4 ( χ+x
8κ )
1/3 − ( χ−x )1/3


χ = x 2 + 1/(27κ)

Gamma, Chi −κ ln x + ωx 1 (x − ω + (x − ω)2 + 4κ)
2 
sgn(x)
Double–Pareto γ log(1 + |x|/a) 2 {|x| − a + (a − |x|) + 4d(x)},
2

 d(x) = (a|x| − γ )+ √
 √ x ,
τ x√2 , |x| ≤ ω/ 2τ , 2τ +1 √ |x| ≤ ω(2τ + 1)/ 2τ ,
Huber dist. 2 √
ω 2τ |x| − ω /2, otherwise x − ω 2τ sgn(x), |x| > ω(2τ + 1)/ 2τ
ω, τ ∈ (0, +∞)
Max-entropy dist. ω|x| + τ |x|2 + κ|x|p sgn(x) proxκ|·|p /(2τ +1) ( 2τ1+1 max(|x| − ω, 0))
2 = p ∈ (1, +∞),
ω, τ, κ ∈ (0, +∞) √
sgn(x) ω|x|−ω −1+ |ω|x|−ω
2 2 −1|2 +4ω|x|
Smoothed-laplace dist. ω|x| − ln(1 + ω|x|) 2ω
 
ωx, x ≥ 0, x − ω, x ≥ ω,
Exponential dist.
⎧ +∞, x < 0
0, x<ω
⎨ −ω, x < −ω, 
x − ω, x ≥ ω,
Uniform dist. x, |x| ≤ ω,
⎩ 0, x<ω
⎧ ω, x >ω ⎧ √
⎨ − ln(x − ω) + ln(−ω), x ∈ (ω, 0), ⎪
⎨ x+ω+ |x−ω|2 +4 , x < 1/ω,
Triangular dist. − ln(ω̂ − x) + ln(ω̂), x ∈ (0, ω̂), √2
⎩ ⎪
⎩ x+ω̂− |x−ω̂|2 +4 ,
+∞, otherwise 2 x > 1/ω̂
ω ∈ (−∞, 0], ω̂ ∈ (0, ∞)

−κ ln x + ωx p , x > 0,
Weibull dist. π s.t. pωπ p + π 2 − xπ = κ
+∞, x≤0
p ∈ (1, +∞) ω, κ ∈ (−∞, 0]

−κ ln x + ωx + ρ/x, x > 0,
GIG dist. π s.t. π 3 + (ω − x)π 2 − κπ = ρ
+∞, x≤0
ω, κ, ρ ∈ (−∞, 0]

For example, lasso can be viewed as a simple sta- component-wise terms that are added to produce the
tistical model with the negative log-likelihood from overall loss or penalty [as in equation (2)]. We have
y = Ax + ε, where ε is a standard normal measurement taken care to ensure that their meaning will always be
error, corresponding to the norm l(x) = Ax − y2 , clear in context.
and each parameter xj has independent Laplace priors We also use the following conventions: sgn(x) is the
corresponding to the regularization penalty φ(x) = |x|. algebraic sign of x, and x+ = max(x, 0); ιC (x) is the
To keep the notation light, we overload the symbols set indicator function taking the value 0 if x ∈ C and
l and φ: they can refer either to the overall loss and ∞ if x ∈/ C; R+ = [0, ∞), R++ = (0, ∞), and R is the
penalty terms [as in equation (1)] or to the individual extended real line R ∪ {−∞, ∞}.
For latent variable 的splitting,
proximal的性质如何;

562 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

TABLE 2
φ (0+ ), if t=0,
Minimizers for the multiplicative form are σ (t) = φ (t)/t, if t=0 and for additive form σ (t) = ct − φ (t). See Nikolova and Ng (2005)

Penalty Minimizer
φ(t) = mins {Q(t, s) + ψ(s)} Q(t, s) = 12 t 2 s Q(t, s) = (t − s)2

|t|α , α ∈ (1, 2] α|t|α−2



α + t2 √1 ct − √ t
α+t 2 α+t 2
|t| |t| 1 t
− log(1 + α) ct − α(α+|t|)
⎧α 
α(α+|t|)
⎨ t2 , 
2 |t| ≤ α, 1, |t| ≤ α, (c − 1)t, |t| ≤ α,
α
⎩ α|t| − α 2 , |t| > α |t| , |t| > α ct − α sgn(t), |t| > α
2
log(cosh(αt)) α tanh(αt)
t ct − α tanh(αt)

−2, for t = 0,
1
− 1+|x| sgn (t) ct − sgn (t)2
t (|t|+1)2 , otherwise (|t|+1)

1√
−∞, for t = 0,
− 1
√ , otherwise ct − √ √1
1+ x 2t 3/2 ( t+1)2 2 t( t+1)2

splitting 是一个很重要的工具去探索
unconstraint 和 constraint问题之间
的联系,对于constraint 问题包含了一
个隐藏变量z
Further preliminaries. We now briefly introduce The convex conjugate of l(x), l  (z), is defined as
several useful concepts and definitions to be described
 
further in subsequent sections. First, splitting is a key l  (λ) = sup λ x − l(x) .
tool that exploits an equivalence between an uncon- x
strained optimization problem and a constrained one
that includes a latent or slack variable z. For example, The conjugate function l  (λ) is the point-wise supre-
suppose that the original problem is mum of a family of affine (and therefore convex) func-
tions in z; it is convex even when l(x) is not. But if l(x)
minimize l(x) + φ(Bx).
x is convex (and closed and proper), then we also have
To apply splitting to this problem, we formulate the that l(x) = supλ {λ x − l  (λ)}, so that l and l  are dual
equivalent problem to one another. If l(x) is differentiable, the maximizing
minimize l(x) + φ(z) value of λ is λ̂(x) = ∇l(x).
x,z
The convex conjugate is our first example of an en-
subject to Bx = z, velope, which is a way of representing functions in
so that the objective is split into two terms involving terms of a pointwise extremum of a family of func-
separate sets of primal variables. tions indexed by a latent variable. Another example is

TABLE 3
See Duckworth (2014)

Error rate
Algorithm Convex Strongly convex Per-iteration cost

Accelerated gradient descent O(1/ ε) O(log(1/ε)) O(n)
Proximal gradient descent O(1/ε) O(log(1/ε)) O(n)

Accelerated proximal gradient descent O(1/ ε) O(log(1/ε)) O(n)
ADMM O(1/ε) O(log(1/ε)) O(n)

Frank-wolfe/conditional gradient algorithm O(1/ε) O(1/ ε) O(n)
Newton’s method O(log log(1/ε)) O(n3 )
Conjugate gradient descent O(n) O(n2 )
L-BFGS Between O(log(1/ε)) and O(log log(1/ε)) O(n2 )
PROXIMAL ALGORITHMS 563

a quadratic envelope, where we represent l as Intuitively, the Moreau envelope is a regularized ver-
  sion of f . It approximates f from below and has the
1 
l(x) = inf x (z)x − η(z) x + ψ(z) same set of minimizing values (Rockafellar and Wets,
z 2 1998, Chapter 1G). The proximal operator specifies the
for some , η, ψ. We will draw heavily on the use of value that solves the minimization problem defined by
envelope (or variational) representations of functions. the Moreau envelope. It balances the two goals of min-
A function g(x) is said to majorize another function imizing f and staying near x, with γ controlling the
f (x) at x0 if g(x0 ) = f (x0 ) and g(x) ≥ f (x) for all trade-off. Table 1 provides an extensive list of closed-
x = x0 . If the same relation holds with the inequality form solutions.
sign flipped, g(x) is said to be a minorizing function Parikh and Boyd (2013) provide several interesting
for f (x). interpretations of the proximal operator. Each one pro-
The subdifferential of a function f at the point x is vides some intuition about why proximal operators
defined as the set might be useful in optimization. We highlight three of
 these interpretations here.
∂f (x) = v : f (z) ≥ f (x) + v  (z − x), First, the proximal operator behaves similarly to a

∀z, x ∈ dom(f ) . gradient-descent step for the function f . There are
many ways of motivating this connection, but one sim-
Any such element is called a subgradient. If the func- ple way is to consider the Moreau envelope f γ (x). Ob-
tion is differentiable, then the subdifferential is a sin- serve that the Moreau derivative is
gleton set comprising the ordinary gradient from dif-  
1 1 
ferential calculus. ∂f (x) = ∂ inf f (z) +
γ
z − x2 = x − ẑ(x) ,
2
z 2γ γ
Finally, a ρ-strong convex function satisfies
ρ where ẑ(x) = proxγf (x) is the value that achieves the
f (x) ≥ f (z) + u (x − z) + x − z22 minimum. Hence,
2
where u ∈ ∂f (z), prox(x) = x − γ ∂f γ (x).
γf
while a ρ-smooth function satisfies Thus, evaluating the proximal operator can be viewed
ρ as a gradient-descent step for the Moreau envelope,
f (x) ≤ f (z) + ∇f (z) (x − z) + x − z22 ∀x, z. with γ as a step-size parameter.
2
Second, the proximal operator generalizes the no-
2. PROXIMAL OPERATORS AND MOREAU tion of the Euclidean projection. To see this, consider
ENVELOPES the special case where f (x) = ιC (x) is the set indica-
tor function of some convex set C. Then proxf (x) =
2.1 Basic Properties
argminz∈C x − z22 is the ordinary Euclidean pro-
Our perspective throughout this paper will be to view jection of x onto C. This suggests that, for other
a proximal algorithm as taking a gradient-descent step functions, the proximal operator can be thought of as
for a suitably defined envelope function. By construct- a generalized projection. A constrained optimization
ing different envelopes, one can develop new optimiza- problem minx∈C f (x) has an equivalent solution as
tion algorithms. We build up to this perspective by first an unconstrained proximal operator problem. Proximal
discussing the basic properties of the proximal opera- approaches are, therefore, directly related to convex re-
tor and its relationship to the gradient of the standard laxation and quadratic majorization, through the addi-
Moreau envelope. tion of terms like ρ2 x − v2 to an objective function,
Let f (x) be a lower semi-continuous function, and where ρ might be a constant that bounds an operator
let γ > 0 be a scalar. The Moreau envelope f γ (x) and or the Hessian of a function. We can choose where
proximal operator proxγf (x) with parameter γ are de- these quadratic terms are introduced, which variables
fined as the terms can involve, and the order in which optimiza-
  tion steps are taken. The envelope framework high-
1
f γ (x) = inf f (z) + z − x22 ≤ f (x), lights such choices, leading to many distinct and fa-
z 2γ
(3)   miliar algorithms.
1 Finally, there is a close connection between proximal
prox(x) = argmin f (z) + z − x22 .
γf z 2γ operators and fixed-point theory, in that proxγf (x  ) =
564 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

x  if and only if x  is a minimizing value of f (x). To


see this informally, consider the proximal minimization
algorithm, in which we start from some point x0 and
repeatedly apply the proximal operator:
   
x t+1 = prox x t = x t − γ ∇f γ x t .
γf

At convergence, we reach a minimum point x  of


the Moreau envelope, and thus a minimum of the F IG . 1. A simple example of the proximal operator and Moreau
original function. At this minimizing value, we have envelope. The solid black line shows the function f (x) = |x|,
∇f γ (x  ) = 0, and thus proxγf (x  ) = x  . and the dotted line shows the corresponding Moreau enve-
lope with parameter γ = 1. The grey line shows the function
Another key property of proximal operators is the |x| + (1/2)(x − x0 )2 for x0 = 1.5, whose minimum (shown as a
Moreau decomposition for the proximal operator of f  , red cross) defines the Moreau envelope and proximal operator.
the dual of f :
x = prox(x) + λ prox(λx), point (x0 , f 1 (x0 )), emphasizing the point-wise con-
λf f  /λ
(4)  
struction of the Moreau envelope in terms of a simple
I − prox (x) = λ prox(λx). optimization problem.
λf f  /λ Let φ(x) = λx1 and consider the proximal oper-
The Moreau identity allows one to easily alter steps ator proxγ φ (x). In this case the proximal operator is
within a proximal algorithm so that some computations clearly separable in the components of x, and the prob-
are performed in the dual (or primal) space. Applica- lem that must be solved for each component is
tions of this identity can also succinctly explain the re-  
γ
lationship between a number of different optimization minimize λ|z| + (z − x)2 .
z∈R 2
algorithms, as described in Section 6.
All three of these ideas—taking gradient-descent This problem has solution
steps, projecting points onto constraint regions, and  
ẑ = prox (x) = sgn(x) |x| − λ/γ +
finding fixed points of suitably defined operators— λ|x|/γ
arise routinely in many classical optimization algo- (5)
rithms. It is therefore easy to imagine that the prox- = Sλ/γ (x),
imal operator, which relates to all these ideas, could
the soft-thresholding operator with parameter λ/γ .
also prove useful.
As a second example, quadratic terms of the form
2.2 Simple Examples of Proximal Operators
(6) l(x) = 12 x  P x + q  x + r
Many intermediate steps in statistical optimization
problems can be written very concisely in terms of are very common in statistics. They correspond to
proximal operators of log-likelihoods or penalty func- conditionally Gaussian sampling models and arise in
tions. However, this conciseness is practically useful weighted least squares problems, in ridge regression
only if the proximal operator can be evaluated in closed and in EM algorithms based on scale-mixtures of nor-
form or at modest computational cost. Here are two mals. For example, if we assume that we observe data
simple examples where this holds. (y|x) ∼ N(Ax, −1 ), then l(x) = (y − Ax) (y −
First, Figure 1 shows a simple proximal operator Ax)/2 or
and Moreau envelope. The solid black line shows the
P = A A, q = −A y, r = y  y/2
function f (x) = |x|, and the dotted line shows the
corresponding Moreau envelope f 1 (x) with param- in the general form given above (6). If l(x) takes this
eter γ = 1. The grey line shows the function |x| + form, its proximal operator (with parameter 1/γ ) may
(1/2)(x − x0 )2 for x0 = 1.5, whose minimum (shown be directly computed as
as a red cross) defines the Moreau envelope and prox-  
imal operator. This point has ordinate proxf (x0 ) = 0.5 prox(x) = (P + γ I )−1 γ x − q ,
l/γ
and abscissa f 1 (x0 ) = 1, and is closer than x0 to the
overall minimum at x = 0. The blue circle shows the assuming the relevant inverse exists.
PROXIMAL ALGORITHMS 565

3. PROXIMAL ALGORITHMS: SIMPLE EXAMPLES where


3.1 The Proximal Gradient Method u = x0 − γ ∇l(x0 ).
We note by way of Introduction that starting at a Thus, to find the minimum of the majorizing function,
point x0 and iteratively applying the proximal opera- we perform precisely the two steps prescribed by the
proximal gradient-method: (1) form the intermediate
tor of some function f is the most basic proximal al-
point u taking the gradient-descent step for l(x) from
gorithm for finding the minimum of that function. It
x0 , and (2) evaluate the proximal operator of φ at this
is usually called the proximal point method or, simply,
point u.
proximal iteration. It is not widely useful, since taking The fact that we may write this method as an MM
proximal points steps is typically no easier than simply algorithm leads to the following basic convergence re-
minimizing f directly. sult. Suppose that:
One of the simplest nontrivial proximal algorithms
is the proximal-gradient method, which provides an 1. l(x) is convex with domain Rn .
important starting point for the more advanced tech- 2. ∇l(x) is Lipschitz continuous with modulus γl ,
niques we describe in subsequent sections. Suppose as that is,
 
in (2) that the objective function is F (x) = l(x) +φ(x), ∇l(x) − ∇l(z) ≤ γl x − z2 ∀x, z.
2
where l(x) is differentiable but φ(x) is not. An archety- 3. φ is closed and convex, ensuring that proxγ φ
pal case is that of a generalized linear model with a makes sense.
nondifferentiable penalty designed to encourage spar- 4. the optimal value is finite and obtained at x  .
sity. The proximal gradient method is well suited for
such problems. It has only two basic steps which are If these conditions are met, then the proximal gradient
iterated until convergence: method converges at rate 1/t with fixed step size γ =
1/γl (Beck and Teboulle, 2010).
(1) Gradient step. Define an intermediate point v t The proximal gradient method can also be inter-
by taking a gradient step with respect to the differen- preted as a means for finding the fixed point of a
tiable term l(x): “forward–backward” operator derived from the stan-
  dard optimality conditions from subdifferential calcu-
v t = x t − γ ∇l x t . lus. For this reason the method is sometimes referred
(2) Proximal operator step. Evaluate the proximal to as forward–backward splitting. This has connections
operator of the nondifferentiable term φ(x) at the in- (not pursued here) with the forward–backward method
termediate point v t : for solving partial differential equations. We refer the
     reader to Parikh and Boyd (2013) for details.
(7) x t+1 = prox v t = prox x t − γ ∇l x t .
γφ γφ 3.2 Iterative Shrinkage Thresholding

This can be motivated in several ways. We outline Consider the proximal gradient method applied to
what is perhaps the most transparent motivation for a quadratic-form log-likelihood (6), as in a weighted
statisticians by showing that the proximal gradient is least squares problem, with a penalty function φ(x).
an MM (majorize/minimize) algorithm. Then ∇l(x) = A Ax −A y, and the proximal gra-
Suppose that l(x) has a Lipschitz-continuous gradi- dient method becomes
  
ent with modulus γl . This allows us to construct a ma- x t+1 = prox x t − γ t A  Ax t − y .
jorizing function for l(x), and therefore for the whole γtφ
objective. Whenever γ ∈ (0, 1/γl ], we have the ma- This algorithm has been widely studied under the name
jorization of IST, or iterative shrinkage thresholding (Figueiredo
and Nowak, 2003). Its primary computational costs at
l(x) + φ(x) ≤ l(x0 ) + (x − x0 ) ∇l(x0 ) each iteration are as follows: (1) multiplying the cur-
1 rent iterate x t by A, and (2) multiplying the resid-
+ x − x0 22 + φ(x),
2γ ual Ax t − y by A . Typically, the proximal oper-
ator for φ will be simple to compute, as in the case
with equality at x = x0 . Simple algebra shows that the
of a quadratic or 1 -norm/Lasso penalty discussed in
optimum value of the right-hand side is the previous section. Thus, the evaluation of the proxi-
 
1 mal operator will contribute a negligible amount to the
x̂ = argmin φ(x) + x − u22 , overall complexity of the algorithm.
x 2γ
566 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

3.3 Proximal Newton 4. REDUNDANCY, SPLITTING AND THE


AUGMENTED LAGRANGIAN
As we have described, the proximal gradient method
is a generalization of classical gradient approaches. It 4.1 Overview
uses only first-order information about the smooth term
In this section we show how the splitting tech-
l(x). However, one can naturally use higher-order ex- nique described in the Introduction leads to many well-
pansions to construct different envelopes that take into known proximal algorithms. As a running example,
account second-order information about l, leading to consider the problem of minimizing l(x)+φ(x), where
improvements analogous to the manner in which New- we apply the splitting strategy to formulate the equiva-
ton’s method improves upon gradient descent. Con- lent problem
sider a family of functions of the form
minimize l(x) + φ(z)
FH (x, z) = l(z) + ∇l(z) (x − z) x,z
(9)
+ 12 (x − z) Hz (x − z), subject to x − z = 0.
and use this to define an envelope in the manner of (3). The advantage of such a variable-splitting approach
Then we can calculate the generalized proximity oper- is that now the fit and penalty terms are decoupled in
ator the objective function of the primal problem. A stan-
 −1 dard tactic for exploiting this fact is to write down and
(8) prox(z) = z − γ −1 I + Hz ∇l(z). solve the dual problem corresponding to the original
FH
(primal) constrained problem. This is sometimes re-
Instead of directly using the Hessian, Hz = ∇ 2 l(z), ferred to as dualization. Many well-known references
approximations can be employed, leading to quasi- exist on this topic (e.g., Bertsekas, 2011). For this rea-
Newton-style approaches. As we will soon describe, son we focus on problem formulation and algorithms
the second-order bound, and approximations to the for solving (9), avoiding standard material on duality
Hessian, are one way to interpret the half-quadratic or optimality conditions.
(HQ) approach, as well as introduce quasi-Newton 4.2 Dual Ascent, the Augmented Lagrangian and
methods into the proximal framework. Proximal New- Scaled Form
ton methods are even possible for some nonconvex
problems, as in Chouzenoux, Pesquet and Repetti Consider first the ordinary Lagrangian of prob-
(2014) and Appendix D. lem (9):
3.4 Nesterov Acceleration L(x, z, λ) = l(x) + φ(z) + λ (x − z),

A useful feature of proximal algorithms is the abil- with Lagrange multiplier λ. The dual function is
ity to use acceleration techniques (Nesterov, 1983), of- g(λ) = infx,z L(x, z, λ), and the dual problem is to
ten referred to as Nesterov acceleration. Acceleration maximize g(λ).
leads to nondescent algorithms that can provide sub- Let p and d  be the optimal values of the primal
stantial increases in efficiency versus their nonacceler- and dual problems, respectively. Assuming that strong
ated counterparts. duality holds, the optimal values of the primal and dual
The idea of acceleration is to add an intermediate problems are the same. Moreover, we may recover a
“momentum” variable z, prior to evaluating the for- primal-optimal point (x  , z ) from a dual-optimal point
ward and backward steps: λ using the fact that
    
   x , z = argmin L x, z, λ
z = x + θt+1 θt−1 − 1 x t − x t−1 ,
t+1 t
x,z
 t+1 −1  t+1   
x t+1
= prox z −γ ∇l z , ⇐⇒ 0 ∈ ∂x,z L x  , z , λ .
γ −1 φ
The idea of dual ascent is to solve the dual problem
where standard choices are θt = 2/(t + 1) and using gradient ascent, exploiting the fact that
θt+1 (θt−1 − 1) = (t − 1)/(t + 2).
When φ is convex, the proximal problem is strongly ∇g(λ) = ∇λ L(x̂λ , ẑλ , λ),
convex, and advanced acceleration techniques can be where
used (Zhang, Saha and Vishwanathan, 2010; Meng and
(x̂λ , ẑλ ) = argmin L(x, z, λ).
Chen, 2011). x,z
PROXIMAL ALGORITHMS 567

Thus, the required gradient is simply the residual for Thus, the dual-variable update does not change com-
the primal constraint: ∇λ L(x, z, λ) = x − z. Therefore, pared to standard dual ascent. But the joint (x, z) up-
dual ascent involves iterating two steps: date has a regularization term added to it, whose mag-
 t+1 t+1    nitude depends upon the tuning parameter γ . Notice
x ,z = argmin L x, z, λt , that the step size γ is used in the dual-update step.
x,z
  Scaled form. Many proximal algorithms have more
λt+1
= λ + αt x t+1 − zt+1
t
concise updates when the dual variable λ is expressed
for appropriate step size αt . in scaled form. Specifically, rescale the dual variable as
An obvious issue with dual ascent for problem (9) is u = γ −1 λ. We can rewrite the augmented Lagrangian
that the update in x and z must be done jointly, rather in terms of u as
than one at a time. This is rarely practical for problems
Lγ (x, z, u)
of this form. But a discussion of dual ascent is an im-
γ
portant starting point for building up to more realistic = l(x) + φ(z) + γ u (x − z) + x − z22
algorithms. 2
We also note that in the case where g is not differ- γ γ
= l(x) + φ(z) + r + u2 − u22 ,
2
entiable, it is possible to replace the gradient with the 2 2
negative of a subgradient of −g, leading to dual sub- where r = x − z is the primal residual. This leads to
gradient ascent; see Shor (1985). the following dual-update formulas:
 t+1 t+1 
Augmented Lagrangian and the method of multi- x ,z
pliers. Take problem (9) as before, with Lagrangian  
γ  2
L(x, z, λ) = l(x) + φ(z) + λ (x − z). The augmented- = argmin l(x) + φ(z) + x − z + ut 2 ,
x,z 2
Lagrangian approach (also known as the method of  
multipliers) seeks to stabilize the intermediate steps ut+1 = ut + x t+1 − zt+1 .
of dual ascent by adding a ridge-like term to the La- Bregman iteration. The augmented Lagrangian
grangian: method for solving 1 -norm problems is called “Breg-
γ man iteration” in the compressed-sensing literature.
Lγ (x, z, λ) = l(x) + φ(z) + λ (x − z) + x − z22 ,
2 Here the goal is to solve the exact-recovery problem
where γ is a scale or step-size parameter. One way of via basis pursuit:
viewing this augmented Lagrangian is as the standard minimize x1
x
Lagrangian for the equivalent problem
subject to Ax = y,
γ
minimize l(x) + φ(z) + x − z22
x,z 2 where y is measured, x is the unknown signal, and A is
subject to x − z = 0. a known “short and fat” sensing matrix (meaning more
coordinates of x than there are observations).
We can see that this is equivalent to the original be- The scaled-form augmented Lagrangian correspond-
cause, for any primal-feasible x and z, the new objec- ing to this problem is
tive takes the same value as the original objective, and γ γ
thus has the same minimum. The dual function cor- Lγ (x, u) = x1 + Ax − y + u22 − u22 ,
2 2
responding to this augmented Lagrangian is gγ (λ) = with steps
infx,z Lγ (x, z, λ), which is differentiable and strongly  
γ 2
convex under mild conditions. (The ordinary dual func- x t+1
= argmin x1 + Ax − zt 2 ,
tion need not be either of these things, which is a key x 2
advantage of using the augmented Lagrangian.) zt+1 = y + zt − Ax t+1 ,
The method of multipliers is to use dual ascent for
the modified problem, iterating where we have redefined zt = y − ut compared to the
 t+1 t+1    usual form of the dual update. Thus, each intermediate
x ,z = argmin Lγ x, z, λt , step of the Bregman iteration is like a lasso regression
x,z
  problem. (This algorithm also has an alternate deriva-
λt+1 = λt + γ x t+1 − zt+1 . tion in terms of Bregman divergences, hence its name.)
568 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

4.3 ADMM where lJ +1 = φ and AJ +1 = I . This can be solved us-


ing an iterative proximal splitting algorithm (e.g., mul-
The alternating-direction method of multipliers (or
tiple ADMM, split Bregman). For example, under
ADMM) is a proximal algorithm that combines three
ADMM (Parikh and Boyd, 2013) the updates are
ideas for solving problems like (9): splitting, the aug-  
mented Lagrangian, and alternating-direction updates. xjt+1 = prox x̄ t − ukj ,
λlj ◦Aj
Recall that the scaled-form augmented Lagrangian for
this problem is ut+1 = utj + xjt+1 − x̄ t+1 ,
j
Lγ (x, z, u) 
J +1 t
where x̄ t = J +1
1
j =1 xj .
γ γ Divide and Concur methods provide a natural ap-
= l(x) + φ(z) + x − z + u22 − u22 .
2 2 proach to hierarchical models or to very large prob-
ADMM is similar to dual ascent for this problem, ex- lems—for example, where each lj corresponds to a
cept that we optimize the Lagrangian in x and z in- negative log-likelihood for a subset of the data stored
dividually, rather than jointly, in each pass. (Hence, on one machine. In this case, DC allows the overall
“alternating direction.”) For our problem, the updates problem to be broken into many tractable, indepen-
become dently computable subproblems via splitting. Only the
  intermediate solutions to these subproblems, rather ev-
γ  2
x t+1 = argmin l(x) + x − zt + ut 2 ery subset of the actual data, need to be broadcast be-
x 2 tween machines.
 
= prox zt − ut , 4.5 Other Forms of Redundancy
l/γ
  Other redundant parameterizations are certainly pos-
γ  t+1 2
zt+1 = argmin φ(z) + x − z + ut  2 sible, beyond the basic splitting strategy considered
z 2
  here. For example, consider the case of an exponential-
= prox x t+1 + ut , family model for outcome y with cumulant-generating
φ/γ function ψ(z) and with natural parameter z:
 
ut+1 = ut + x t+1 − zt+1 . p(y) = p0 (y) exp yz − ψ(z) .
The first two steps are to evaluate the proximal opera- There is a unique Bregman divergence associated with
tors of l and φ, respectively. every exponential family. It corresponds precisely to
the relationship between the natural parameterization
4.4 Divide and Concur
and the mean-value parameterization. There is a corre-
Divide and Concur (e.g., Gravel and Elser, 2008) is sponding class of Bregman proximal point algorithms.
another type of splitting strategy that provides a gen- In a generalized linear model, the natural parameter
eral approach to statistical models that require opti- for outcome yi is a linear regression on covariates, zi =
mization of a sum of J + 1 composite functions of the ai x. In this case l(x) may be written as
form

N
   

J l(x) = li (x) where li (x) = ψ ai x − yi ai x ,
minimize lj (Aj x) + φ(x). i=1
x
j =1 up to an additive constant not depending on x. Now
In Divide and Concur, we add slack variables zj introduce slack variables zi = ai x. This leads to the
for j ∈ {1, . . . , J + 1} to divide the problem together, equivalent primal problem
with equality constraints so that the solutions concur. 
N
 
Specifically, we form the equivalent constrained opti- min ψ(zi ) − yi zi + φ(x)
x,z
mization problem i=1

+1
J subject to Ax − z = 0.
minimize lj (zj ) For example, in a Poisson model (yi |μi ) ∼ Pois(μi ),
x,z
j =1
μi = exp(θi ) with natural parameter θi = ai x. The cu-
subject to zj = Aj x, mulant generating function is b(θ) = exp(θ ), and thus
PROXIMAL ALGORITHMS 569

d(μ) = μ log μ − μ. After simplification, the diver- First, we define the FB envelope, FγFB (x), which will
gence Dd (y, μ) = μ − y log μ + (μ − y). The opti- possess some desirable properties (see Patrinos and Be-
mization problem can then be split as mporad, 2013):


N
FγFB (x) := min l(x) + ∇l(x) (v − x) + φ(v)
min (zi − yi log zi ) + φ(x) v
x,z
i=1 
1
subject to ai x = log zi . + v − x2

These same optimization problems arise when one γ 2  
= l(x) − ∇l(x) + φ γ x − γ ∇l(x) .
considers scale mixtures, or convex variational forms 2
(Palmer et al., 2005, Polson and Scott, 2015). The con- If we pick γ ∈ (0, γl−1 ), the matrix I − γ ∇ 2 l(x) is
nection is made explicit by the dual function for a symmetric and positive definite. The stationary points
density and its relationship with scale-mixture decom- of the envelope FγFB (x) are the solutions x  of the orig-
positions. For instance, one can obtain the following inal problem which satisfy x = proxγ φ (x − γ ∇l(x)).
equality for appropriate densities p(x), q(z) and con- This follows from the derivative information
stants μ, κ:  
  −1   ∇FγFB (x) = I − γ ∇ 2 l(x) Gγ (x),
− log p(x) = − sup log pN x; μ + κ/z, z q(z)
z>0 where Gγ (x) = γ −1 (x − Pγ (x)) and Pγ (x) =
  proxγ φ (x − γ ∇l(x)).
z √ 
= inf (x − μ − κ/z)2 − log zq(z) , With these definitions, we can establish the follow-
z>0 2
ing descent property for gradient steps based on the FB
where pN (x; μ, σ 2 ) is the density function for a nor- envelope:
mal distribution with mean μ and variance σ 2 . The γ 2
form resulting from this normal scale-mixture envelope FγFB (x) ≤ F (x) − Gγ (x) ,
2
is similar to the half-quadratic envelopes described in  2
  γ
Section 5. Polson and Scott (2015) describe these rela- F Pγ (x) ≤ FγFB (x) − (1 − γ γl )Gγ (x) .
tionships in further detail. 2
Hence, for γ ∈ (0, γl−1 ), the envelope value always de-
5. ENVELOPE METHODS creases on application of the proximal operator of γ φ,
and we can determine the stationary points. See Ap-
In this section we describe several types of en-
pendix A for further details.
velopes: the forward–backward (FB) envelope, the
Douglas–Rachford (DR) envelope, the half-quadratic 5.2 Douglas–Rachford Envelope
(HQ) envelope, and the Bregman divergence envelopes.
Mimicking the forward–backward approach, Patri-
These all build upon the idea of a Moreau envelope
nos, Lorenzo and Alberto (2014) define the Douglas–
and lead to analogous proximal algorithms. Within this
Rachford (DR) envelope as
framework, various algorithms may be generated in
γ 2  
terms of gradient steps for the corresponding enve- FγDR (x) = l γ (x) − ∇l γ (x)2 + φ γ x − 2γ ∇l γ (x)
lope. (For instance, ADMM methods will be viewed 2

as the gradient step of the dual FB envelope.) Section 6      
= min l x  + ∇l x  z − x  + φ(z)
dissects these envelopes in further detail, shows their z
relationship to Lagrangian approaches, and provides a 
1  2
framework within which they can be derived and ex- + z − x  ,

tended.
where we recall that l γ is the Moreau envelope of the
5.1 Forward–Backward Envelope function l and x  = proxγ l (x).
Suppose as in (9) that we have to minimize F = This can be interpreted as a backward–backward en-
l + φ, under the assumptions that l is strongly con- velope. It is a special case of a FB envelope evaluated
vex and possesses a continuous gradient with Lipschitz at the proximal operator of γ l, namely,
 
constant γl , so that |∇ 2 l(x)| ≤ γl , and that φ is proper FγDR (x) = FγFB prox(x) .
lower semi-continuous and convex. γl
570 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

Again, the gradient of this envelope produces a prox- where δi = (B  x − b)i .


imal algorithm (see Patrinos, Lorenzo and Alberto, This establishes an equivalence between gradient
2014) which converges to the minimum of {l(x) + linearization and quasi-Newton. These algorithms give
φ(x)}. The iterations are the iterative mappings
    −1 
wt+1 = prox x t , x t+1 = L v̂ x t A y
γl
  and
zt+1 = prox 2wt − x t ,  −1  
γφ x t+1 = x t − L x t ∇x F x t ,
 
x t+1 = x t + zt − wt . respectively, where L(x t ) is a step size function. They
There are many ways to rearrange the basic DR al- turn out to be identical, with derivative information
gorithm. For example, with an intermediate variable, 
d
φ (δi ) 
v = w − x, we could equally well iterate ∇x F (x) = A Ax − A y + γ Bi Bi x
  i=1
δi 
wt+1 = prox x t − v t ,  
γl = A A + γ B V(x)B  x − A y
      
x t+1
= prox wt + v t , v t+1 = v t + wt − x t . = L v̂(x) x − A y
γφ
for V(x) = diag(v̂(δdi=1 )) and L(v̂(x)) = A A +
5.3 Half-Quadratic Envelopes
vB V(x)B  .
We now provide an illustration of a quasi-Newton See Polson and Scott (2015) for further explanation
algorithm within the class of Half-Quadratic (HQ) op- of the half-quadratic class of penalties.
timization problems (Geman and Yang, 1995; Geman
and Reynolds, 1992). This envelope applies to the com- 5.4 Bregman Divergence Envelopes
monly used L2 -norm where l(x) = Ax − y2 , and Many statistical models, such as those generated by
can be used in conjunction with some nonconvex φ. an exponential family distribution, can be written in
See Nikolova and Ng (2005) for convergence rates and terms of a Bregman divergence. One is then faced with
comparisons of the different algorithms. the joint minimization of an objective function of the
The half-quadratic (HQ) envelope is defined by form F (x, v) = D(x, v) + φ(x) + ψ(v). To minimize
 
F HQ (x) = inf Q(x, v) + ψ(v) , over (x, v), we can use an alternating Bregman projec-
v tion method. To perform the minimization of v given x,
where we can make use of the D-Moreau envelope, which is
Q(x, v) = vx 2 or (v − x)2 , defined by
 
that is, the function Q(x, v) is “half-quadratic” in the φ D (x) = inf D(x, v) + φ(v) ,
v
variable v. In the HQ framework, the term ψ(v) is usu-
where D(x, v) is a Bregman divergence. A key fea-
ally understood to be the convex conjugate of some
ture here is that a Bregman divergence satisfies a three-
function, for example, ψ(v) = φ  (x).
point law of cosines triangle inequality, which helps to
As an initial example, suppose that we wish to mini-
establish the descent property for proximal algorithms
mize the function
derived from these envelopes (see Appendix A). Many
F (x) = 12 Ax − y2 + γ (x), commonly used EM, MM and variational EM algo-
where rithms in statistics implicitly use envelopes of this type.

d
 
(x) = φ B x − b i , 6. PROXIMAL ALGORITHMS FOR COMPOSITE
FUNCTIONS
i=1
and that the penalty is specified in terms of the repre- 6.1 Overview
sentation φ(x) = F HQ (x). Then we need to minimize Building off the general objective in (1), we now
the higher-dimensional function consider the optimization of a general composite ob-
1 d d jective of the form
F (x, v) = Ax −y2 +γ Q(δi , vi )+γ ψ(vi ),
2 i=1 i=1 F (x) := l(x) + φ(Bx)
PROXIMAL ALGORITHMS 571

or, in split form, and the split–dual by a similar argument. These two
formulations are related via the Max–Min inequality
minimize l(x) + φ(z) (Boyd and Vandenberghe, 2004):
x,z
(10)
subject to Bx = z. sup inf F (q, v) ≤ inf sup F (q, v).
q v v q
Composite penalties arise in statistical models that ac-
In the special case of closed proper convex functions,
count for structural constraints or spatiotemporal cor-
we have
relations (e.g., Tibshirani and Taylor, 2011; Tibshirani,
2014; Tansey et al., 2014). The most famous examples min F (x) = min sup FPD (x, λ)
x x z
of problems in this class are total-variation denoising
(Rudin, Osher and Faterni, 1992) and the fused lasso = max min FSP (x, z, λ)
λ x,z
(Tibshirani et al., 2005).
We start by noting that many approaches for solving = max min FSD (x, z, λ),
x λ,z
this problem, including the ones in Section 4, can be
characterized in terms of one of the four general forms where we exploit the fact that
 
of the objective functions/Lagrangians that result from φ(Bx) = sup z Bx − φ  (z)
appealing to splitting and conjugate functions: z

primal F (x) = l(x) + φ(Bx), whenever φ is convex. FSP (x, z, λ) and FPD (x, λ) are
also related by
split–primal FSP (x, z, λ) = l(x) + φ(z)  
min FSP (x, z, λ) = min φ(z) + l(x) + λ (Bx − z)
+ λ (Bx − w), z≥0 z≥0
 
primal–dual FPD (x, λ) = l(x) + λ (Bx) = l(x) + λ Bx + min φ(z) − λ z
z≥0
− φ  (λ),
= l(x) + λ Bx − φ  (λ)
split–dual FSD (x, z, λ) = l  (z) + φ  (λ)
 
= FPD (x, λ).
+ x  −B  λ − z .
6.2 Proximal Solutions
From a statistical perspective, it is natural to think of
In most statistical problems of form (10), it is typi-
z and λ as latent variables, and of each of these split-
cally the case that closed-form expressions for one or
ting/duality strategies as defining a higher-dimensional
more of l(x), l  (z), φ(z) or φ  (λ) will be unavailable
objective function. Such ideas are familiar in statistics,
or inefficient to compute. However, exact solutions to
where alternating minimization, iterated conditional
related problems that share the same critical points may
mode (ICM), EM and MM algorithms have a long his-
be easily accessible. We now step through several such
tory (e.g., Dempster, Laird and Rubin, 1977; Csiszár
approaches for solving (10), explaining how they relate
and Tusnády, 1984; Besag, 1986). Indeed, Polson and
to the ideas introduced thus far. We highlight whenever
Scott (2015) show how many such algorithms that ap-
proximal operators enter the analysis. Because prox-
peal to convex conjugacy have a natural EM-like inter-
imal operators are so well understood, their presence
pretation in terms of missing data.
in an algorithm is convenient: the properties of proxi-
For problem (10), the motivation for using the
mal operators and the associated fixed-point theory can
primal–dual and the split forms (see Esser, Zhang and
simplify otherwise lengthy constructions and conver-
Chan, 2010) lies in how they decouple φ from the lin-
gence arguments. Moreover, by exploiting the proxi-
ear mapping B; it is precisely the composition of these
mal operator’s known properties, like the Moreau iden-
functions that poses the difficulty for problems like TV
tity, one can move easily between the different formu-
denoising and the fused lasso. Note that the primal–
lations above, and thus between the primal and dual
dual formulation follows from profiling the slack vari-
spaces. It is also worth mentioning that the efficacy of
able z out of the split–primal objective:
certain acceleration techniques can depend on which
 
inf L(x, z, λ) = inf l(x) + φ(z) + λ (Bx − z) formulation is used, and therefore implicitly on the
z z
specific proximal steps taken. We refer the reader to
= l(x) + λ Bx − φ  (λ), Beck and Teboulle (2014) for further discussion.
572 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

First, proximal operators arise naturally whenever First, notice that the argmin for the subproblem in x,
we augment the Lagrangian for problem (10), which l(x) + λ (Bx), can be characterized in terms of the
entails adding a ridge term to the split–primal objec- following fixed point whenever γl > 0:
tive:  
x = prox x .
γl (l(x)+λ Bx)
Lρ (x, z, λ) = l(x) + φ(z) + λ (Bx − z)
ρ We now use the fact that
(11) + Bx − z2 (12) prox (q) = prox(q − u),
2
ρ g(z)+u z g
= FSP (x, z, λ) + Bx − z2 . for a generic function g(z) and variables q, z and u;
2
As already detailed, this leads naturally to an ADMM this is obtained by completing the square in the defini-
tion of the operator. Appealing to (12) gives
algorithm whose intermediate iterates involve proximal    
operators. x = prox x = prox x  − γl B  λ .
Second, we also are not restricted to using the prox- γl (l(x)+λ Bx) γl l

imal operators directly implied by one of these four Now we’re left with only the subproblem in λ:
    
problem formulations, such as those that appear when max l x  + λ Bx  − φ  (λ)
l, l  , φ and/or φ  contain quadratic terms. We can also λ
    
apply a surrogate or approximation (e.g., an envelope = − min φ  (λ) − λ Bx  − l x  .
or majorizer) to certain terms. For example, when exact λ
solutions to the composite proximal operator are not We can take yet another proximal step, for the mini-
available, one can consider “linearizing” ρ2 Bx − z2 mization of φ  (λ) − λ (Bx  ), in λ with step size γφ .
with 2λρB x − z2 , where σmax (B  B) < λB , yielding Using (12) and (4), we find that the argmin satisfies
 
ρ λ = prox λ + γφ Bx  .
FSP (x, w, z) + Bx − z2 γφ φ 
2
ρ Using the Moreau decomposition in (4), we can de-
≤ FSP (x, w, z) + x − z2 . rive yet another strategy. Note that
2λB 一个非常简单的优化方法  
prox λ + γφ Bx 
This approach can be seen as a simple majoriza- γφ φ 
tion and, when combined with the proximal solution
1    
for z, as a forward–backward envelope for the sub- = I − prox ◦ γφ λ + Bx  .
problem. Implementations of this approach include γφ φ/γφ
the linearized ADMM technique or the split inex- Hence, we can characterize the solution to the primal–
act Uzawa method, and are described in the context dual problem in terms of fixed points of the following
of Lagrangians by Chen and Teboulle (1994) and two operators:
primal–dual algorithms in Chambolle and Pock (2011).  
x  = prox x  − γl B  λ ,
Magnússon et al. (2014) detail splitting methods in γl l
terms of augmented-Lagrangians for nonconvex objec- (13)
tives. 1    
λ = I − prox ◦ γφ λ + Bx  .
Finally, one can represent one of the terms in the ob- γφ φ/γφ
jective using one of the envelopes described in Sec- If we separate the last step implied by (13) into two
tion 5, in which case the iterates of the resulting steps and simplify by setting γl = γφ = 1, we arrive at
algorithm will involve proximal operators. In fact,  
the envelope representation can itself be seen as a x  = prox x  − B  u ,
l
way to encode the iterates in each of a problem’s la-  
tent/slack/splitting terms as proximal operators. w = prox u + Bx  ,

φ
An example: The primal–dual. To demonstrate these  
u = u − w − Bx  .
 
ideas, we give an example of how proximal operators
and their properties can be used to derive an algorithm This has the same basic form of techniques like
starting from the primal–dual formulation ADMM, alternating split Bregman, split inexact Uzawa
  and so forth. See Chen, Huang and Zhang (2013) for
max min l(x) + λ (Bx) − φ  (λ) . more details.
λ x
PROXIMAL ALGORITHMS 573

6.3 Composition in General Quadratic Envelopes φ(Bx) directly, by finding the fixed point of the opera-
tor
Consider now the most general form of a quadratic
envelope involving a composite penalty function: Hk = κI + (1 − κ)H,
 
1 
(14) F (x) = inf x (z)x − η (z)x + φ(Bx) , for κ ∈ (0, 1), where
z 2
    
where (z) is symmetric positive definite. Such forms H (v) := I − prox BA−1 η + I − γ BA−1 B  v
can arise when one majorizes l(x) using a second-order γ −1 φ
approximation of around z. This general quadratic case ∀v ∈ Rp .
in which (z) is not necessarily diagonal encompasses
the approaches of Geman and Yang (1995), Geman and Here 0 < γ < 2/σmax (BA−1 B  ) and A = (z). The
Reynolds (1992), and can be addressed with splitting operator H is understood to be nonexpansive, so, by
techniques. Opial’s theorem, one is guaranteed convergence; when
If B  B is positive definite, a proximal point solution H is a contraction, this convergence is linear. After
can be obtained by setting l(x) = x  (z)x − η x in finding the fixed point v  , one sets x  = A−1 (η −
(13). The general solution to a quadratic-form proxi- xB  v  ).
mal operator (6), together with the split–dual formula-
tion, implies a proximal point algorithm that exploits 7. APPLICATIONS
the fact that the optimal values satisfy
  7.1 Logit Loss Plus Lasso Penalty
x  = prox x  − γl B  z
γl l(x) To illustrate our approach, we simulate observations
  −1    from the model
= I + γl  z  x − γl B  z  + γl η ,
1     (yi |pi ) ∼ Binom(mi , pi ),
z = I − prox ◦ γφ z + Bx  .  
γφ φ/γφ pi = logit−1 ai x ,
This formulation introduces the subproblem of solv-
where i = 1, . . . , 100, ai is a row vector of A ∈
ing a system of linear equations. Using the exact solu-
R100×300 , x ∈ R300 . The A matrix is simulated from
tion to this system would reflect methods that involve
Levenberg–Marquardt steps, quasi-Newton methods N(0, 1) variates and normalized column-wise. The sig-
and Tikhonov regularization, and is related to the use nal x is also simulated from N(0, 1) variates, but with
of second-order Taylor approximations to an objec- only 10% of entries being nonzero.
tive function. Naturally, the efficiency of computing Here mi are the number of trials and yi the number of
exact solutions depends very much on the properties of successes. The composite objective function for sparse
I + γl (z), since the system defined by this term will logistic regression is then given by
need to be solved on each iteration of a fixed-point al- p

n
  T x  
gorithm. When (z) is constant, a decomposition can argmin mi log 1 + eai − yi ai x + λ |xj |.
be performed at the start and reused, so that solutions x
i=1 j =1
are computed quickly at each step. For some matri-
ces, this can mean only O(n) operations per iteration. To specify a proximal gradient algorithm, all we need
In general, however, the post-startup iteration cost is is an envelope such as those commonly used in Vari-
O(n2 ). ational Bayes. In this example, we use the simple
Other approaches, like those in Chen, Huang and quadratic majorizer with Lipschitz constant given by
Zhang (2013) and Argyriou et al. (2011), do not at- A A2 /4 = σmax (A)/4, and a penalty coefficient λ
tempt to directly solve the aforementioned system of set to 0.1σmax (A).
equations. Instead they use a forward–backward algo- Figure 2 shows the (adjusted) objective values per
rithm on the dual objective, FPD . In particular, we call iteration with and without Nesterov acceleration. We
attention to the approach of Argyriou et al. (2011). can see the nondescent nature of the algorithm and the
They show how to evaluate the proximal operator of clear advantage of adding acceleration.
574 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

F IG . 2. (Adjusted) objective values for iterations of the proximal gradient method, with and without acceleration, applied to a logistic
regression problem with an 1 -norm penalty.

7.2 Logit Fused Lasso use of second-order information, we have extremely


fast convergence to the solution.
To illustrate a logit fused lasso problem, we compare For data preconditioning, we perform the following
a Geman–Reynolds inspired quadratic envelope for the decompositions: A = U V  , the singular value de-
multinomial logit loss and a fused lasso penalty with composition (SVD), −1 (v) = 12 A−1 D −1 A− , where
the standard Lipschitz-bounded gradient step. We de- D = diag(m · λ(Av)). This implies that one SVD of A,
fine the following quantities: or generalized inverse, is required to compute all future

n
 
−1 (v), thus providing computational savings.
(v) = 2 mi λ ai v ai ai 7.3 Poisson Fused Lasso
i=1
  To illustrate an objective that is not Lipschitz but still
= 2A diag m · λ(Av) A, convex, we use a Poisson regression example with a

n fused lasso penalty. We simulated a signal given from
η = 2 (yi − mi /2)ai , the model
i=1  
(y|x) ∼ Pois exp(Ax) ,
where λ(v) = 2v 1
( 1+e1 −v − 12 ). Now we compute xt , p
  
conditional on w, for the envelope φ(x) = D (1) x 1 = |xj − xj −1 |.
j =1

n
  T x   
mi log 1 + eai − yi ai x + D (1) x 1 In our simulation, the true sparse parameter vector x
i=1 has 10% nonzero signals from N(0, 1). The design ma-
 
1   trix A ∈ R100×300 is also generated from N(0, 1), then
= min x  (w)x − η x + c(w) + γ D (1) x 1 . column normalized.
w 2
In sum, we have a negative log-likelihood and regu-
To do this, we employ the Picard–Opial composite larization penalty of the composite form
method of Argyriou et al. (2011). p

n
  
Simulations were performed in a similar fashion as F (x) = exp ai x − yi ai x + |xj − xj −1 |
Section 7.2 but with N = 100, M = 400, m = 2 and i=1 j =1
where D (1) x has a fused lasso construction consisting

n
   
of first-order differences of x. Figure 3 show the objec- = exp ai x − yi ai x + D (1) x 1 ,
tive values for iterations of each formulation. With the i=1
PROXIMAL ALGORITHMS 575

F IG . 3. Objective values for iterations of two proximal composite formulations applied to a multinomial logistic regression problem with a
composite 1 -norm penalty. Both are run until the same numeric precision is reached.

where ai are the column vectors of A and D (1) x is the implementation of an EM algorithm for penalized like-
matrix operator of first-order differences in x. Since lihood estimation.
the Poisson loss function is not Lipschitz but still con-
7.4 L2 -Norm Loss Plus Lq -Norm Penalty for
vex, we replace the constant gradient step with a back-
0<q <1
tracking line search. This can be accomplished with a
back-tracking line search step. A common nonconvex penalty is the Lq -norm for
Figure 4 shows the objective value results for each 0 < q < 1. There are a number of ways of developing a
method, with and without acceleration. An alternative proximal algorithm to solve such problems. The prox-
approach is given by Green (1990), who describes an imal operator of Lq -norm has a closed-form, multi-

F IG . 4. (Adjusted) objective values for iterations of the proximal gradient method, with and without acceleration, applied to a Poisson
regression problem with a fused 1 -norm penalty.
576 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

valued solution, and convergence results are available where Ai is column i of A, and A−i , x−i have col-
for proximal methods in Marjanovic and Solo (2013) umn/element i removed. Applied to a quadratic ma-
and Attouch, Bolte and Svaiter (2013). For this exam- jorization scheme, we find that at iteration t
ple, we choose the former approach. A t+1
A
i (y − A−i xi )
t
i r
The regularization problem involves finding the min- xit+1 = = + xit
imizer of an L2 -norm loss with an Lq -norm penalty for A
i Ai A i 2

0 < q < 1, so that with y − Ax t = r t . In a similar fashion to gradient


 p  descent, this involves O(n) operations for updates of

q
x̂λ
1
:= argmin y − Ax2 + λ |xi |q . A t
i r , so one cycle is O(np).
x 2 j =1 We simulate a data vector y ∈ Rn from a regression
model
The component-wise, set-valued proximal Lq -norm
operator is given by y = Ax + σ ε where ε ∼ N(0, 1)
⎧ with an underlying sparse parameter value x ∈ Rd with
⎨ 0,
  if |v| < hλ ,
n = 100, d = 256, in which the true sparse x has 5%
prox(v) = 0, sgn(v)xλ , if |v| = hλ ,
⎩ nonzero signals generated from N(0, 1). The design
λφq sgn(v)x̂, if |v| > hλ ,
matrix A ∈ R100×256 is also generated from N(0, 1),
where then column normalized. We set the signal-to-noise
 1/(2−q) ratio at 16.5 to match the simulated example from
bλ,q = 2λ(1 − q) ,
Marjanovic and Solo (2013), which gives σ = 0.0369.
q−1 Figure 5 plots the mean squared error (MSE) ver-
hλ,q = bλ,q + λqbλ,q ,
  sus the log-regularization penalty and the power in the
x̂ + λq x̂ q−1 = |v|, x̂ ∈ bλ,q , |x| . Lq -norm penalty. Essentially, this consists of contours
of log10 (MSE(x̂)) on a plot of 0 < q < 1 versus the
Attouch, Bolte and Svaiter (2013) describe how the amount of regularization log10 (λ). One interesting fea-
objective for this problem is a Kurdyka–Łojasiewicz ture of this model is that the estimated regression co-
(KL) function, which provides convergence results for q
efficients x̂λ can jump to sparsity as 0 < q < 1, and
an inexact (multi-valued proximal operator) forward– this will be illustrated in a regularized path for the next
backward algorithm given by example.
  
x t+1 ∈ prox x t − γt A Ax t − A b . 7.5 Prostate Data
λγt ·p
As a practical example of our methodology, we con-
Interestingly, the KL convergence results for forward– sider the prostate cancer data set, which examines the
backward splitting on appropriate nonconvex contin- relationship between the level of a prostate specific
uous functions bounded below imply that the solu- antigen and a number of clinical factors. The variables
tion choice for multi-valued proximal maps—as in the are log cancer volume (lcavol), log prostate weight
Lq -norm case—does not affect the convergence prop- (lweight), age (age), log of the amount of benign
erties. See Appendix D for more information. prostatic hyperplasia (lbph), seminal vesicle inva-
An alternative approach is the variational representa- sion (svi), log of capsular penetration (lcp), Gleason
tion of the Lq -norm; however, this does not satisfy the score (gleason) and percent of Gleason scores 4 or 5
convergence conditions of Allain, Idier and Goussard (pgg45).
(2006) within the half-quadratic framework. A common regularized approach is to use lasso and
Marjanovic and Solo (2013) detail how cyclic de- elastic net; see Tibshirani (1996) and in Zou and Hastie
scent can be used to apply the proximal operator in a (2005), respectively. Alternatively, we fit the regular-
per-coordinate fashion under a squared-error loss. The ization path using
cyclic descent method is derived from the following al-  p 
q 1 
gebra. First, a single solution to the squared-error loss x̂λ := argmin y − Ax2 + λ |xi |q .
minimization problem can be given for a component i x 2 j =1
of x, by We can use the exact proximal operator for the
0 = ∇i l(x) = A − y) Lq -norm and solve the harder nonconvex problem.
i (Ax
Figure 6 shows the regularization path. The major dif-
= A
i (Ai xi + A−i x−i − y), ference is, again, in the jumps to a sparse solution.
PROXIMAL ALGORITHMS 577

F IG . 5. Penalty weight, λ, vs. MSE and q for an L2 -norm error with an Lq -norm penalty, 0 < q < 1, estimated via cyclic descent and
proximal solutions.

8. DISCUSSION
Proximal algorithms are a widely used approach for
solving optimization problems. They provide an ele-
gant extension of classical gradient descent method
and have properties that—much like EM or MM
algorithms—can be used to derive many different ap-
proaches for solving a given problem.
For readers interest in further historical details, we
recommend Beck and Sabach (2015), who provide
a historical perspective on iterative shrinkage algo-
rithms by focusing mainly on the Weiszfeld algorithm
(Weiszfeld, 1937) for computing an 1 median. The
split Lagrangian methods described here were origi-
nally developed by Hestenes (1969) and Rockafellar
(1974). More recently, there is work being done to ex-
tend the range of applicability of these methods outside
of the class of convex functions to the broader class of
functions satisfying the Kurdyka–Łojasiewicz inequal-
ity (Attouch, Bolte and Svaiter, 2013).
The purpose of our review has been to describe and
apply proximal algorithms to some archetypical opti-
mization problems that arise in statistics. These prob-
lems often involve composite functions that are rep-
resentable by a sum of a linear or quadratic enve-
lope, together with a function that has a closed-form
proximal operator that is easy to evaluate. Many pa-
F IG . 6. Proximal results for the prostate data example under the pers demonstrate the efficacy and breadth of applica-
Lq -norm penalty. tion of this approach: for example, Micchelli et al.
578 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

(2013) and Micchelli, Shen and Xu (2011) study proxi- when l and φ are convex, lower semi-continuous
mal operators for composite operators for L2 -norm and and ∇l is Lipschitz continuous. We also assume that
1 -norm/TV denoising models; Argyriou et al. (2011) proxφ/λ is nonempty and can be evaluated indepen-
describe numerical advantages of the proximal opera- dently in each component.
tor approach versus traditional fused lasso implemen- Recalling the translation property of proximal oper-
tations; and Chen, Huang and Zhang (2013) provide a ators stated in (12), we can say
 
further class of fixed-point algorithms that advance the x  = prox x − ∇l(x)/λ = prox (x)
proximal approach in the composite setting. φ/λ (φ(z)+λ∇l(x) z)/λ
Another nice property of proximal algorithms is  
 λ
the ease with which acceleration techniques can be = argmin φ(z) + ∇l(x) (z − x) + x − z2 .
z 2
applied. The most common approach involves Nes-
terov acceleration; see Nesterov (1983) and Beck and By the proximal operator’s minimizing properties, its
Teboulle (2004), who introduce a momentum term solution x  satisfies
    λ 2
for gradient-descent algorithms applied to nonsmooth φ x  + ∇l(x) x  − x + x − x   ≤ φ(x),
composite problems. Attouch and Bolte (2009), Noll 2
(2014) provide further convergence properties for non- providing a quadratic minorizer for F (w) in the form
smooth functions. O’Donoghue and Candes (2015) use of
adaptive restart to improve the convergence rate of ac-     λ 2
l(w) + φ x  + ∇l(w) x  − w + w − x  
celerated gradient schemes. Meng and Chen (2011) 2
modify Nesterov’s gradient method for strongly con- ≤ l(w) + φ(w) ≡ F (w).
vex functions with Lipschitz continuous gradients.
The Lipschitz continuity of ∇l(x), that is,
Allen-Zhu and Orecchia (2014) provide a simple in-
γ
terpretation of Nesterov’s scheme as a two-step algo- l(x) ≤ l(w) + ∇l(w) (x − w) + x − w2 ,
rithm with gradient-descent steps which yield proximal 2
(forward) progress coupled with mirror-descent (back- also gives us a quadratic majorizer
ward) steps with dual (backward) progress. By linearly F (x) ≡ l(x) + φ(x)
coupling these two steps they improve convergence.
≤ l(w) + φ(x) + ∇l(w) (x − w)
Giselsson and Boyd (2014) also show how precondi-
γ
tioning can help with convergence for ill-conditioned + x − w2 ,
problems. 2
There are a number of directions for future research which, when evaluated at x = x  and combined with
on proximal methods in statistics, for example, ex- our minorizer, yields
 2  
ploring the use of Divide and Concur methods for (λ − γ ) 12 x  − w ≤ F (w) − F x  .
exponential-family mixed models and studying the re-
lationship between proximal splitting and variational Thus, if we want to ensure that the objective value will
Bayes methods in graphical models. Another interest- decrease in this procedure, we need to fix λ ≥ γ . Fur-
ing area of research involves combining proximal steps thermore, functional characteristics of l and φ, such
with MCMC algorithms (Pereyra, 2013). Of course, as strong convexity, can improve the bounds in the
steps above and guarantee good- or optimal-decreases
the proximal methods developed here are not designed
in F (w) − F (x  ).
to provide standard errors and the advantage of MCMC
Finally, when we compound up the errors we obtain
methods is the ability to assess uncertainty through the
a O(1/k) convergence bound. This can be improved by
full posterior distribution.
adding a momentum term that includes the first deriva-
tive information.
APPENDIX A: PROXIMAL GRADIENT These arguments can be extended to Bregman diver-
CONVERGENCE gences by way of the general law of cosines inequality:
We now outline convergence results for the proximal Dφ (x, z) = Dφ (x, w) + Dφ (w, z)
gradient solution, given by (4), to the fixed point prob-  
lem − ∇φ(z) − ∇φ(w) (x − w),
  so that Dφ (x, z) ≥ Dφ (x, w) + Dφ (w, z) where w =
x  = prox x − ∇l(x)/λ ,
φ/λ argminv Dφ (v, z).
PROXIMAL ALGORITHMS 579

APPENDIX B: NESTEROV ACCELERATION U = {x ∈ dom(l) : l(x) ≤ inft l(x t )}, one finds that U
is a nonempty closed convex set and that x t is a Fe-
A powerful addition is Nesterov acceleration. Con- 
jér sequence of finite length, t x t+1 − x t  < ∞,
sider a convex combination, with parameter θ , of up-
and that it converges to a critical point of l as long as
per bounds for the proximal operator inequality z = x
and z = x  . We are free to choose variables z = θ x + min{l(x) : x ∈ Rd } is nonempty.
(1 − θ )x + and w. If φ is convex, φ(θx + (1 − θ )x + ) ≤
θ φ(x) + (1 − θ )φ(x + ), then we have APPENDIX D: NONCONVEX:
    KURDYKA–ŁOJASIEWICZ (KL)
F x + − F  − (1 − θ ) F (x) − F 
  A locally Lipschitz function l : Rd → R satisfies KL
= F x + − θ F  − (1 − θ )F (x) at x  ∈ Rd if and only if ∃η ∈ (0, ∞) and a neighbor-
 +    
≤λ x −w θ x + (1 − θ )x − x + hood U of x  and a concave κ : [0, η] → [0, ∞) with
κ(0) = 0, κ ∈ C 1 , κ > 0 on (0, η) and for every x ∈ U
λ 2
+ x + − w with l(x  ) < l(x) < l(x  ) + η we have
2     
λ  2 κ l(x) − l x  dist 0, ∂l(x) ≥ 1,
= w − (1 − θ )x − θ x  
2 where dist(0, A) := supx∈A x2 .
 2 
− x + − (1 − θ )x − θ x   The KL condition guarantees summability and there-
2  2 
fore a finite length of the discrete subgradient trajec-
θ 2 λ 
= u − x   − u+ − x   , tory. Using the KL properties of a function, one can
2 show convergence for alternating minimization algo-
where w is given in terms of the intermediate steps rithms for problems like
θ u = w − (1 − θ )x, min L(x, z) := l(x) + Q(x, z) + φ(z),
x,z
θ u+ = x + − (1 − θ )x,
where ∇Q is Lipschitz continuous (see Attouch et al.,
introducing a sequence θt with iteration subscript, t. 2010, Attouch, Bolte and Svaiter, 2013). A typical ap-
The second identity, θ u = x − (1 − θ )x − , then yields plication involves solving minx∈Rd {l(x) + φ(x)} via
an update for w as the current state x plus a momentum the augmented Lagrangian
term, depending on the direction (x − x − ), namely,
  ρ
w = (1 − θt )x + θt u = x − θt−1 (1 − θt ) x − x − . L(x, z) = l(x) + φ(z) + λ (x − z) + x − z2 ,
2
APPENDIX C: QUASI-CONVEX CONVERGENCE where ρ is a relaxation parameter.
A useful class of functions that satisfy KL is one that
Consider an optimization problem minx∈X l(x) possesses uniform convexity
where l is quasi-convex, continuous and has a non-
empty set of finite global minima. Let x t be generated l(z) ≥ l(x) + u (z − x) + Kz − xp ,
by the proximal point algorithm
  where
λt  2
x ∈ argmin l(x) + x − x t  .
t
p≥1 ∀u ∈ ∂l(x).
2
Papa Quiroz and Oliveira (2009) show that these iter- Then l satisfies KL on dom(l) for κ(s) = pK −1/p s 1/p .
ates converge to the global minima, although the proxi- For explicit convergence rates in the KL setting, see
mal operator at each step may be set-valued, due to the Frankel, Garrigos and Peypouquet (2015).
nonconvexity of l. A function l is quasi-convex when
   
l θ x + (1 − θ )z ≤ max l(x), l(z) , ACKNOWLEDGMENTS

which accounts for a number of nonconvex functions We thank the participants at the 2014 ASA meetings
like |x|q , when 0 < q < 1, and functions involving ap- for their comments. We also thank the Editor, Asso-
propriate ranges of log(x) and tanh(x). In this setting, ciate Editor and two anonymous referees for their help
using the level-sets generated by the sequence, that is, in improving the paper.
580 N. G. POLSON, J. G. SCOTT AND B. T. WILLARD

REFERENCES C HAMBOLLE , A. and P OCK , T. (2011). A first-order primal–dual


algorithm for convex problems with applications to imaging.
A LLAIN , M., I DIER , J. and G OUSSARD , Y. (2006). On global and
J. Math. Imaging Vision 40 120–145. MR2782122
local convergence of half-quadratic algorithms. IEEE Trans. Im-
C HAUX , C., C OMBETTES , P. L., P ESQUET, J.-C. and
age Process. 15 1130–1142.
WAJS , V. R. (2007). A variational formulation for frame-based
A LLEN -Z HU , Z. and O RECCHIA , L. (2014). A novel, simple
inverse problems. Inverse Probl. 23 1495–1518. MR2348078
interpretation of Nesterov’s accelerated method as a combi-
C HEN , P., H UANG , J. and Z HANG , X. (2013). A primal–dual
nation of gradient and mirror descent. Preprint. Available at
arXiv:1407.1537. fixed point algorithm for convex separable minimization with
A RGYRIOU , A., M ICCHELLI , C. A., P ONTIL , M., S HEN , L. and applications to image restoration. Inverse Probl. 29 025011, 33.
X U , Y. (2011). Efficient first order methods for linear composite MR3020432
regularizers. Preprint. Available at arXiv:1104.1436. C HEN , G. and T EBOULLE , M. (1994). A proximal-based decom-
ATTOUCH , H. and B OLTE , J. (2009). On the convergence of the position method for convex minimization problems. Math. Pro-
proximal algorithm for nonsmooth functions involving analytic gram. 64 81–101. MR1274173
features. Math. Program. 116 5–16. MR2421270 C HOUZENOUX , E., P ESQUET, J.-C. and R EPETTI , A. (2014).
ATTOUCH , H., B OLTE , J. and S VAITER , B. F. (2013). Conver- Variable metric forward–backward algorithm for minimizing
gence of descent methods for semi-algebraic and tame prob- the sum of a differentiable function and a convex function.
lems: Proximal algorithms, forward-backward splitting, and J. Optim. Theory Appl. 162 107–132. MR3228518
regularized Gauss–Seidel methods. Math. Program. 137 91– C HRÉTIEN , S. and H ERO , A. O. III (2000). Kullback proximal al-
129. MR3010421 gorithms for maximum-likelihood estimation. IEEE Trans. In-
ATTOUCH , H., B OLTE , J., R EDONT, P. and S OUBEYRAN , A. form. Theory 46 1800–1810. MR1790321
(2010). Proximal alternating minimization and projection meth- C OMBETTES , P. L. and P ESQUET, J.-C. (2011). Proximal splitting
ods for nonconvex problems: An approach based on the methods in signal processing. In Fixed-Point Algorithms for In-
Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35 438–457. verse Problems in Science and Engineering 185–212. Springer,
MR2674728 New York. MR2858838
B ECK , A. and S ABACH , S. (2015). Weiszfeld’s method: Old and C SISZÁR , I. and T USNÁDY, G. (1984). Information geometry and
new results. J. Optim. Theory Appl. 164 1–40. MR3296283 alternating minimization procedures. Statist. Decisions 1 (sup-
B ECK , A. and T EBOULLE , M. (2004). A conditional gradient plement issue) 205–237. MR0785210
method with linear rate of convergence for solving convex linear D EMPSTER , A. P., L AIRD , N. M. and RUBIN , D. B. (1977). Max-
systems. Math. Methods Oper. Res. 59 235–247. MR2063242 imum likelihood from incomplete data via the EM algorithm.
B ECK , A. and T EBOULLE , M. (2010). Gradient-based algorithms J. Roy. Statist. Soc. Ser. B 39 1–38. MR0501537
with applications to signal recovery problems. In Convex Opti- D UCKWORTH , D. (2014). The big table of convergence rates.
mization in Signal Processing and Communications (D. P. Palo- Available at https://github.com/duckworthd/duckworthd.github.
mar and Y. C. Eldar, eds.) 42–88. Cambridge Univ. Press, Cam- com/blob/master/blog/big-table-of-convergence-rates.html.
bridge. MR2840594 E SSER , E., Z HANG , X. and C HAN , T. F. (2010). A general frame-
B ECK , A. and T EBOULLE , M. (2014). A fast dual proximal gradi- work for a class of first order primal–dual algorithms for convex
ent algorithm for convex minimization and applications. Oper. optimization in imaging science. SIAM J. Imaging Sci. 3 1015–
Res. Lett. 42 1–6. MR3159144 1046. MR2763706
B ERTSEKAS , D. P. (2011). Incremental gradient, subgradient, and
F IGUEIREDO , M. A. T. and N OWAK , R. D. (2003). An EM algo-
proximal methods for convex optimization: A survey. Optimiza-
rithm for wavelet-based image restoration. IEEE Trans. Image
tion for Machine Learning 2010 1–38.
Process. 12 906–916. MR2008658
B ESAG , J. (1986). On the statistical analysis of dirty pictures.
F RANKEL , P., G ARRIGOS , G. and P EYPOUQUET, J. (2015). Split-
J. Roy. Statist. Soc. Ser. B 48 259–302. MR0876840
ting methods with variable metric for Kurdyka–Lojasiewicz
B IEN , J., TAYLOR , J. and T IBSHIRANI , R. (2013). A LASSO
functions and general convergence rates. J. Optim. Theory Appl.
for hierarchical interactions. Ann. Statist. 41 1111–1141.
165 874–900.
MR3113805
B OYD , S. and VANDENBERGHE , L. (2004). Convex Optimization. G EMAN , D. and R EYNOLDS , G. (1992). Constrained restoration
Cambridge Univ. Press, Cambridge. MR2061575 and the recovery of discontinuities. IEEE Trans. Pattern Anal.
B OYD , S., PARIKH , N., C HU , E., P ELEATO , B. and E CKSTEIN , J. Mach. Intell. 14 367–383.
(2011). Distributed Optimization and Statistical Learning via G EMAN , D. and YANG , C. (1995). Nonlinear image recovery
the Alternating Direction Method of Multipliers. now Publish- with half-quadratic regularization. IEEE Trans. Image Process.
ers, Hanover, MA. 4 932–946.
B RÈGMAN , L. M. (1967). A relaxation method of finding a com- G ISELSSON , P. and B OYD , S. (2014). Preconditioning in fast dual
mon point of convex sets and its application to the solution of gradient methods. In Proceedings of the 53rd Conference on
problems in convex programming. USSR Comput. Math. Math. Decision and Control. 5040–5045. Los Angeles, CA.
Phys. 7 200–217. G RAVEL , S. and E LSER , V. (2008). Divide and concur: A general
C EVHER , V., B ECKER , S. and S CHMIDT, M. (2014). Convex op- approach to constraint satisfaction. Phys. Rev. E 78 036706.
timization for big data: Scalable, randomized, and parallel al- G REEN , P. J. (1990). On use of the EM algorithm for penalized
gorithms for big data analytics. IEEE Signal Process. Mag. 31 likelihood estimation. J. Roy. Statist. Soc. Ser. B 52 443–452.
32–43. MR1086796
PROXIMAL ALGORITHMS 581

G REEN , P. J., Ł ATUSZY ŃSKI , K., P EREYRA , M. and PATRINOS , P. and B EMPORAD , A. (2013). Proximal Newton
ROBERT, C. P. (2015). Bayesian computation: A perspec- methods for convex composite optimization. In Decision and
tive on the current state, and sampling backwards and forwards. Control (CDC), 2013 IEEE 52nd Annual Conference on 2358–
Preprint. Available at arXiv:1502.01148. 2363. IEEE, New York.
H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Ele- PATRINOS , P., L ORENZO , S. and A LBERTO , B. (2014). Douglas-
ments of Statistical Learning: Data Mining, Inference, and Pre- rachford splitting: Complexity estimates and accelerated vari-
diction, 2nd ed. Springer, New York. MR2722294 ants. Preprint. Available at arXiv:1407.6723.
H ESTENES , M. R. (1969). Multiplier and gradient methods. J. Op- P EREYRA , M. (2013). Proximal Markov chain Monte Carlo algo-
tim. Theory Appl. 4 303–320. MR0271809 rithms. Preprint. Available at arXiv:1306.0187.
H U , Y. H., L I , C. and YANG , X. Q. (2015). Proximal gradient P OLSON , N. G. and S COTT, J. G. (2012). Local shrinkage rules,
algorithm for group sparse optimization. Lévy processes and regularized regression. J. R. Stat. Soc. Ser.
KOMODAKIS , N. and P ESQUET, J.-C. (2014). Playing with du- B. Stat. Methodol. 74 287–311. MR2899864
ality: An overview of recent primal–dual approaches for solv-
P OLSON , N. G. and S COTT, J. G. (2015). Mixtures, envelopes,
ing large-scale optimization problems. Preprint. Available at
and hierarchical duality. J. Roy. Statist. Soc. Ser. B. To appear.
arXiv:1406.5429.
Available at arXiv:1406.0177.
M AGNÚSSON , S., W EERADDANA , P. C., R ABBAT, M. G. and
ROCKAFELLAR , R. T. (1974). Conjugate duality and optimization.
F ISCHIONE , C. (2014). On the convergence of alternating di-
Technical report, DTIC Document, 1973.
rection lagrangian methods for nonconvex structured optimiza-
tion problems. Preprint. Available at arXiv:1409.8033. ROCKAFELLAR , R. T. (1976). Monotone operators and the prox-
M ARJANOVIC , G. and S OLO , V. (2013). On exact q denoising. In imal point algorithm. SIAM J. Control Optim. 14 877–898.
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE MR0410483
International Conference on 6068–6072. IEEE, New York. ROCKAFELLAR , R. T. and W ETS , R. J.-B. (1998). Variational
M ARTINET, B. (1970). Brève communication. Regularisation Analysis. Springer, Berlin. MR1491362
d’inéquations variationnelles par approximations successives. RUDIN , L., O SHER , S. and FATERNI , E. (1992). Nonlinear total
ESAIM Math. Modell. Numer. Anal. 4 154–158. variation based noise removal algorithms. Phys. D 60 259–268.
M ENG , X. and C HEN , H. (2011). Accelerating Nesterov’s method S HOR , N. Z. (1985). Minimization Methods for Nondifferentiable
for strongly convex functions with Lipschitz gradient. Preprint. Functions. Springer, Berlin. MR0775136
Available at arXiv:1109.6058. TANSEY, W., KOYEJO , O., P OLDRACK , R. A. and S COTT, J. G.
M ICCHELLI , C. A., S HEN , L. and X U , Y. (2011). Proximity algo- (2014). False discovery rate smoothing. Technical report, Univ.
rithms for image models: Denoising. Inverse Probl. 27 045009, Texas at Austin.
30. MR2781033 T IBSHIRANI , R. (1996). Regression shrinkage and selection via
M ICCHELLI , C. A., S HEN , L., X U , Y. and Z ENG , X. (2013). Prox- the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242
imity algorithms for the L1/TV image denoising model. Adv. T IBSHIRANI , R. J. (2014). Adaptive piecewise polynomial estima-
Comput. Math. 38 401–426. MR3019155 tion via trend filtering. Ann. Statist. 42 285–323. MR3189487
N ESTEROV, Y U . E. (1983). A method for solving the convex pro- T IBSHIRANI , R. J. and TAYLOR , J. (2011). The solution path of
gramming problem with convergence rate O(1/k 2 ). Sov. Math., the generalized lasso. Ann. Statist. 39 1335–1371. MR2850205
Dokl. 27 372–376. T IBSHIRANI , R., S AUNDERS , M., ROSSET, S., Z HU , J. and
N IKOLOVA , M. and N G , M. K. (2005). Analysis of half-quadratic K NIGHT, K. (2005). Sparsity and smoothness via the fused
minimization methods for signal and image recovery. SIAM J. lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 91–108.
Sci. Comput. 27 937–966 (electronic). MR2199915
MR2136641
N OLL , D. (2014). Convergence of non-smooth descent methods
VON N EUMANN , J. (1951). Functional Operators: The Geometry
using the Kurdyka–Łojasiewicz inequality. J. Optim. Theory
of Orthogonal Spaces. Princeton Univ. Press, Princeton, NJ.
Appl. 160 553–572. MR3180983
W EISZFELD , E. (1937). Sur le point pour lequel la somme des dis-
O’D ONOGHUE , B. and C ANDES , E. (2015). Adaptive restart for
accelerated gradient schemes. Found. Comput. Math. 15 715– tances de n points donnés est minimum. Tohoku Math. J. 43
732. 355–386.
PALMER , J., K REUTZ -D ELGADO , K., R AO , B. D. and W ITTEN , D. M., T OBSHIRANI , R. and H ASTIE , T. (2009). A pe-
W IPF, D. P. (2005). Variational EM algorithms for non- nalized matrix decomposition, with applications to sparse prin-
Gaussian latent variable models. In Advances in Neural In- cipal components and canonical correlation analysis. Biostatis-
formation Processing Systems 18 1059–1066. Vancouver, BC, tics 10 515–534.
Canada. Z HANG , X., S AHA , A. and V ISHWANATHAN , S. V. N. (2010).
PAPA Q UIROZ , E. A. and O LIVEIRA , P. R. (2009). Proximal point Regularized risk minimization by Nesterov’s accelerated gra-
methods for quasiconvex and convex functions with Bregman dient methods: Algorithmic extensions and empirical studies.
distances on Hadamard manifolds. J. Convex Anal. 16 49–69. Preprint. Available at arXiv:1011.0472.
MR2531192 Z OU , H. and H ASTIE , T. (2005). Regularization and variable se-
PARIKH , N. and B OYD , S. (2013). Proximal algorithms. Founda- lection via the elastic Net. J. R. Stat. Soc. Ser. B. Stat. Methodol.
tions and Trends in Optimization 1 123–231. 67 301–320. MR2137327

You might also like