0% found this document useful (0 votes)
26 views

A Simplified View of First Order Methods For Optimization

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

A Simplified View of First Order Methods For Optimization

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Math. Program., Ser.

B
https://doi.org/10.1007/s10107-018-1284-2

FULL LENGTH PAPER

A simplified view of first order methods for optimization

Marc Teboulle1

Received: 30 January 2018 / Accepted: 26 April 2018


© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2018

Abstract We discuss the foundational role of the proximal framework in the develop-
ment and analysis of some iconic first order optimization algorithms, with a focus on
non-Euclidean proximal distances of Bregman type, which are central to the analysis
of many other fundamental first order minimization relatives. We stress simplification
and unification by highlighting self-contained elementary proof-patterns to obtain con-
vergence rate and global convergence both in the convex and the nonconvex settings,
which in turn also allows to present some novel results.

Keywords Proximal framework · Non-Euclidean Bregman distance · First order


algorithms · Convex and nonconvex minimization · Descent Lemma · Kurdyka–
Łosiajewicz property

Mathematics Subject Classification 90C25 · 65K05

1 Introduction

The notion of proximal map of a convex function, invented about half a century ago
by Moreau in his seminal work [48], is kicking and alive! This fundamental reg-
ularization process gave birth 5 years later to the so-called proximal minimization
algorithm by Martinet [47], followed by its extension in Rockafellar [59] for solv-

This research was partially supported by the Israel Science Foundation, under ISF Grant 1844-16, and the
German Israel Foundation, under GIF Grant 1253304.6-2014.

B Marc Teboulle
[email protected]

1 School of Mathematical Sciences, Tel-Aviv University, 69978 Ramat-Aviv, Israel

123
M. Teboulle

ing monotone inclusions. Merely abandoned in favor of the sophisticated interior


point methods, and unpopular among many optimization researchers for over three
decades, proximal based methods and their relatives are nowadays “starring” in mod-
ern optimization algorithms based on first order information, e.g., function values
and gradient/subgradients. This resurrection and renewed interest in first order meth-
ods (FOM) is motivated by the current high demand in solving large and huge scale
problems arising in a wide spectrum of disparate fundamental applications (e.g., signal
processing, image sciences, machine learning, communication systems, computerized
tomography, astronomy), which admit structured convex and nonconvex optimization
formulations. The computational simplicity of FOM makes them appealing and ideal
candidates for solving such problems to medium accuracy. Current intensive research
activities can be seen from the very large volume of literature which is constantly
growing at a rapid pace. Clearly, space limitation prevents us to discuss and review
all the past and recent achievements. Instead, to get an initial sense of the intensity of
this body of research which started about a decade ago, we refer the reader to, e.g.,
[54,62] and references therein. For a more recent comprehensive treatment of convex
optimization algorithms, we refer the reader to the book of Bertsekas [23], and to the
just released book of Beck [15], which focuses on first order methods, and provides
a unique self-contained and rigorous study underlying the theoretical foundations of
FOM. Both monographs include many relevant up-to-date and annotated sources, but
note that new works in this area are basically appearing on a daily basis!
In this work we discuss in a concise way the foundational role of the proximal
approach in the development and analysis of first order optimization algorithms, with
a focus on non-Euclidean proximal schemes based on Bregman distances. The aim
of this paper is to stress simplification and unification. Starting with the proximal
framework (Sect. 2), and building on two fundamental pillars, one which is quite
old and one very recent (Sect. 3), elementary proof-patterns emerge. This allows to
simplify the derivation of the rate of convergence and global convergence of some
of the most fundamental first order algorithmic icons: gradient/subgradient, proximal
gradient, and mirror descent for composite convex optimization models. Note that these
schemes are also central to the analysis of many other first order minimization relatives
that will not be reviewed here, but see Sect. 2 for further discussion and references. In
addition to some of these classical results, we highlight the benefits of the simplicity
of the proof techniques (Sect. 4), which also allow to derive these results under weaker
assumptions than the usual ones, as well as novel results, see respectively Sect. 4.2,
where we introduce a single condition in terms of the problem’s data allowing to
refine the classical analysis of the Mirror Descent (Theorem 4.3), and Sect. 4.3 where
we derive a descent like property for the more general convexly composite model
(Lemma 4.2), and the corresponding complexity result in Theorem 4.4. The nonconvex
setting is far more difficult than its convex counterpart, and remains highly challenging.
In Sect. 5, we give a brief tour on some very recent advances in this area, describing
a flexible framework to derive global convergence of the Bregman based proximal
gradient scheme to a critical point of a nonconvex composite minimization model. We
end the paper in Sect. 6, with some discussion on the benefits and limitations of the
Bregman proximal framework, and concluding remarks leading to an open challenging
question.

123
A simplified view of first order methods for optimization

Notation We use standard notation and concepts from convex and variational analysis,
which unless otherwise specified, can all be found in the classical monographs [57,58].
Throughout, E stands for a finite dimensional vector space with inner product ·, ·,
norm  ·  and its usual dual norm, denoted by  · ∗ , residing in the dual space E∗ .

2 The proximal framework

We start with the seminal work of Moreau [48], which provides the foundation and
the central ideas underlying the proximal methodology.

2.1 The master chef: Moreau’s proximal map

Given a proper, lower semicontinuous (lsc) and convex function ϕ : E → (−∞, ∞]


and λ > 0, the proximal map of Moreau [48], is defined as the unique minimizer:
 
1
proxλϕ (x) := argmin ϕ(u) + u − x2 : u ∈ E ,

and the resulting optimal value, that is, the Moreau envelope of ϕ is the function
 
1
x → ϕλ (x) := min ϕ(u) + u − x : u ∈ E .
2

This regularization process for ϕ produces a very neat function ϕλ which is finite
everywhere, convex and with λ−1 -Lipschitz continuous gradient on E given by:

∇ϕλ (x) = λ−1 (x − proxλϕ (x)).

Moreover, minimization of ϕ and ϕλ are equivalent in the sense that:

inf ϕλ (x) = inf ϕ(x) and argmin x ϕ(x) = argmin x ϕλ (x),


x x

with minimizers satisfying the equation x = proxλϕ (x).


This naturally provides the rationale for the so-called proximal minimization algo-
rithm of Martinet [47], which consists of iterating the above fixed point equation:

x 0 ∈ E, x k+1 = proxλϕ (x k ), k = 0, 1, . . . .

Alternatively, this is equivalent to the basic gradient iteration with step size λ applied
on the smooth function ϕλ .
Despite the obvious difficulties that can often emerge from the practical com-
putational side of a proximal map (which by itself requires to solve a nonsmooth
optimization problem), the proximal framework is fundamental, and nowadays has
regained a starring position in the development and analysis of modern optimiza-
tion algorithms based on first order information, both in their primal and dual forms.

123
M. Teboulle

In fact, the proximal methodology is not only underlying the essence of the central
first order schemes, e.g., gradient, proximal gradient, and mirror descent which can
be recovered through a specific choice of ϕ, but also at the root of many other fun-
damental first order minimization methods. For instance, these include incremental
and stochastic versions, coordinate descent and smoothing methods, as well as funda-
mental primal-dual methods, within augmented Lagrangian and their many relatives,
e.g., alternating direction method of multipliers, and related decomposition schemes.
Because of space limitations, we obviously barely scratch the surface, and will not
discuss the many theoretical, computational and applications aspects of past and recent
achieved results in this area. For a taste of these and additional modern coverage on
these topics, we refer the reader to some recent works, e.g., [13,15,18,19,23,30,61]
which also provide ample relevant references.

2.2 Extending the scope of Moreau’s proximal map: motivation

Mimicking Moreau, one can consider replacing the squared Euclidean distance with
a non-Euclidean proximity measure, also called distance-like function. Formally, a
general form of the algorithm for minimizing a function ψ : E → (−∞, ∞] then
reads as
 
1
x k+1 ∈ argmin ψ(u) + D(u, x k ) : u ∈ E ,
λ

where D(·, ·) stands for a distance-like function which replaces the squared Euclidean
distance. There exist various ways to define adequate proximity measures, see for
instance the general framework of Auslender and Teboulle [8] and references therein,
which analyzes first order methods based on a variety of distance-like functions, includ-
ing the so-called Csiszar φ-divergence [37] and the Bregman distance [31].
The rationale which underlines the relevance and usefulness of considering more
general proximal measures stems from algorithmic contexts, i.e., toward improv-
ing and extending convergence properties of classical algorithms, as well as from
specific applications areas. Let us briefly describe some contexts where the use of
non-Euclidean proximal schemes have been particularly useful, and which include for
example:

(a) The possibility to better adapt to the geometry of a given problem, which allows
to derive simpler proximal computational steps, i.e., by naturally eliminating the
constraint through a suitable choice of D which best match the problem’s data
information and structures, see e.g. [7,9,14]. Moreover, this allows for deriving
a complexity constant in terms of D(x ∗ , x 0 ) between the starting point x 0 , and
an optimal solution x ∗ , which can be significantly smaller than the usual term
x ∗ − x 0 2 , and is often nearly optimal for very large scale problems, see e.g.,
[16,20,50].
(b) Augmented Lagrangian Methods. It is well known that the classical augmented
Lagrangian method for the standard inequality constrained convex programming
problem is nothing else but the proximal minimization algorithm applied on its

123
A simplified view of first order methods for optimization

dual [21]. Likewise, applying a non-Euclidean proximal scheme produces an


augmented Lagrangian function which will involve the conjugate of D. How-
ever, in contrast to its classical quadratic counterpart for inequality constraints,
(which lacks twice continuous differentiability, even if the problem’s data is C 2 ),
here one can obtain smooth augmented Lagrangians, and this provides computa-
tional advantages for their minimization, see Bertsekas [21] for early work in this
direction, and [10,40,56,63] for more examples and results within this approach.
(c) Links with Dynamical Systems In that context, non-Euclidean proximal based
methods are particularly useful to study some continuous dynamical systems
associated with constrained optimization problems, and have resulted in produc-
ing new interior descent methods and elliptic barrier operators [26], and also
allows to make connections with physical and other applied science phenomena,
such as, the heavy ball with friction in mechanics [3], and the Lotka–Volterra
systems in biology [2].
The above description is far from being exhaustive, but rather just an appetizer
on the prospect of studying non-Euclidean proximal methods. For further details,
elaborations and enhancements with general non-Euclidean proximal schemes, includ-
ing extensions to variational inequalities, see, for instance, [6,7,9,49] and references
therein, as well as the concluding Sect. 6 for further discussion.
This paper will focus only on schemes based on the Bregman distance, stressing its
natural appearance and usefulness in most of the fundamental basic FOM. We refer the
reader to [12,26,33–35,40,63] and references therein for early foundational papers,
more motivation, and key results on proximal-based methods associated to Bregman
distances. Some very recent works on Bregman based FOM in various contexts can
be found in [14,24,41,52].
The rest of this section provides the basic definitions and preliminaries for working
in the proximal Bregman setting.

2.3 The Bregman distance

To define non-Euclidean proximity measures in general, and here specifically the


Bregman distance, it is convenient to work with Legendre functions. Basically, as we
shall see below, this allows to handle the nonlinear geometry of a problem, e.g., acting
as a natural barrier for a constraint set. This is very flexible and more general than
the classical squared Euclidean distance, allowing adaptation to the structure and data
information of the problem at hand.
To introduce Legendre functions, we first need to recall some basic facts from
convex analysis which can be found in [58, Section 26].
A proper and convex function h : E → (−∞, ∞] is called essentially smooth
if int dom h = ∅, h is differentiable on int dom h, and ∇h(x k ) → ∞ whenever
{x k }k∈N ⊂ int dom h converges to a boundary point x ∈ dom h as k → +∞.
Recall that an extended valued function on E is smooth, only if it is actually finite and
differentiable throughout E. Clearly, a smooth convex function on E is in particular
essentially smooth. The next result records important facts characterizing essential
smoothness.

123
M. Teboulle

Lemma 2.1 (Essential smoothness) [58, Theorem 26.1] Let h : E → (−∞, ∞] be a


proper, lsc and convex function. The following statements are equivalent:
(a) h is essentially smooth.
(b) ∂h(x) = {∇h(x)} when x ∈ int dom h, while ∂h(x) = ∅ for x ∈
/ int dom h.
(c) dom ∂h = int dom h = ∅.

Definition 2.1 (Legendre function) [58]. A function h : E → (−∞, ∞] which is


proper, lsc, strictly convex and essentially smooth will be called a Legendre function.

In addition to the properties (b) and (c) given in Lemma 2.1, a Legendre function
also enjoys the following useful properties [58, Thm. 26.5]:
• h is Legendre if and only if its conjugate h ∗ is Legendre.
• The gradient of a Legendre function h is a bijection from int dom h to int dom h ∗ ,
and its inverse is the gradient of the conjugate, that is, we have (∇h)−1 = ∇h ∗ .
In this work we adopt the following definition of a Bregman distance, which origin
goes back to the work of Bregman [31].

Definition 2.2 (Bregman distance) Let h : E → (−∞, ∞] be a Legendre function.


The Bregman distance associated to h, denoted by Dh : dom h × int dom h → R+ is
defined by:
Dh (x, y) := h (x) − [h (y) + ∇h (y) , x − y] .

It naturally measures the proximity between the two points (x, y). Indeed, thanks
to the gradient inequality we have,

h is convex if and only if Dh (x, y) ≥ 0, ∀ x ∈ dom h, y ∈ int dom h.

In addition, thanks to the strict convexity of h we have that Dh (x, y) = 0 if and only
if x = y, which implies a basic distance-like property. However, note that Dh is not
symmetric in general, unless h is the energy function, that is, h(·) = (1/2) ·2 , and
in this case Dh recovers the classical squared Euclidean distance.
We list below some of the most popular and useful choices for h to generate relevant
associated Bregman distances Dh , which are well documented in the literature, see
e.g., [8,12,40,63] where more examples are given, including nonseparable Bregman
distances, as well as, Bregman distances on the space of symmetric matrices.

Example 2.1 To generate a Bregman distance in the n dimensional vector space E =


Rn , it is often enough to consider a one  dimensional function h j : R → (−∞, ∞]
which is convex on R. Then, with h(x) = nj=1 h j (x j ), we simply get that Dh (x, y) =
n
j=1 Dh j (x j , y j ). Typical and useful examples include the following h (with n = 1):

• (Energy) h(x) = 21 x 2 with dom h = dom h ∗ = R, and likewise, h(x) =



p |x| , p ≥ 2, with h (y) = q |y| , p + q = pq.
1 p 1 q

• (Shannon Entropy) h(x) = x log x with dom h = [0, ∞], (0 log 0 = 0), and
h ∗ (y) = e y−1 .

123
A simplified view of first order methods for optimization

• (Fermi–Dirac) h(x) = x log x + (1 − x) log(1 − x) with dom h = [0, 1], and


h ∗ (y) = log(1 + e y ). √ 
• (Hellinger) h(x) = − 1 − x 2 with dom h = [−1, 1] and h ∗ (y) = 1 + y 2 .
• (Burg) h(x) = − log x with dom h = (0, ∞), and h ∗ (y) = − log(−y) − 1.
Note that, in all the above examples, h is Legendre. Moreover, except for the last exam-
ple, all their conjugates function h ∗ share the nice and useful property of dom h ∗ = R;
see Sect. 6 for further discussion on the relevance of these examples in applications.

The definition of a Bregman distance varies in the literature. In particular, most


often the function h generating Dh is assumed to be σ -strongly convex, a stringent
assumption that might also prevent the use of classical Dh , e.g., all the last four
functions h given in Example 2.1 are not strongly convex on dom h. In fact, as we
shall see later on, strong convexity of h is often not needed.
The structural form of Dh is also useful when h is not convex. The function Dh
measures the gap between the value of h at a given point x ∈ dom h from its first order
Taylor linear approximation around y ∈ int dom h, and could be naturally dubbed as
the First Order Taylor Gap associated to h. In that case, obviously, the distance-like
property of Dh is lost, yet this interpretation can be beneficially used, see Sect. 4.
We end this section with the three points identity, whose naturally extends the
Euclidean Pythagoras type identity. This identity, first observed in Chen and Teboulle
[35] is very simple, yet crucial in the analysis of any optimization method based on
Bregman distances.

Lemma 2.2 (The Three Points Identity) [35, Lemma 3.1] Suppose h : E → (−∞, ∞]
is Legendre. Then, for any three points x, y ∈ int dom h and z ∈ dom h, the following
three points identity holds true:

Dh (z, x) − Dh (z, y) − Dh (y, x) = ∇h(x) − ∇h(y), y − z.

Note that the Legendre property of h plays no role in the derivation of this identity.
Indeed, for any differentiable function h, the identity follows by simple algebra relying
on the structural definition of Dh . With h(u) := u2 /2, E = Rn , we recover as
expected, the fundamental Pythagoras identity

z − x2 − z − y2 − x − y2 = 2x − y, y − z,

where here  ·  stands for the Euclidean norm in Rn .

2.4 The Bregman proximal map

Consider the following convex optimization problem

ϕ∗ = inf{ϕ(u) : u ∈ C}.

We make the following standing assumptions.

123
M. Teboulle

Assumption A (i) ϕ : E → (−∞, ∞] is proper, lsc and convex,


(ii) C ⊆ E is nonempty, closed and convex,
(iii) h : E → (−∞, ∞] is Legendre,
(iv) (dom ϕ ∩ C) ∩ dom h = ∅.
Mimicking Moreau one more time, we replace the squared Euclidean distance with
Dh (·, ·) to define the Bregman proximal map (with the same notation as the one of
Moreau, except for the presence of h).

Definition 2.3 (Bregman proximal map) Suppose that Assumption A holds. For any
x ∈ int dom h and λ > 0, the Bregman proximal map is defined by
 
1
h
proxλϕ (x) := argmin ϕ(u) + Dh (u, x) : u ∈ C .
λ

Various types of assumptions can be made to warrant well-posedness of the Breg-


man proximal map, and which has been developed in the cited literature just alluded
above. We state here one result which warrants that the Bregman proximal map asso-
ciated with a convex function is well-defined. To make this part self-contained, we
first recall some elementary results which can be found in [58].

Definition 2.4 (Level boundedness) A function ψ : E → (−∞, ∞] is called (lower)


level bounded if for every α ∈ R, the level set Lev(ψ, α) := {x ∈ E : ψ(x) ≤ α} is
bounded (possibly empty).

We then have the following properties, (cf. [58, Corollary 8.7.1]). Let ψ : E →
(−∞, ∞] be a proper, lsc and convex function.
• If the level set Lev(ψ, α) is nonempty and bounded for some α ∈ R, it is bounded
for every α ∈ R.
• If ψ is level bounded, then argmin ψ is nonempty and compact, and inf ψ is finite.
The proof of Lemma 2.3 is patterned after similar approaches used, e.g., in [8,40,
64].

Lemma 2.3 Suppose that Assumption A holds. For any x ∈ int dom h and λ > 0 let
 
1
x + := proxλϕ
h
(x) = argmin ϕ(u) + Dh (u, x) : u ∈ C .
λ

The following statements hold.


(a) If inf{ϕ(u) : u ∈ C} = ϕ∗ > −∞, then the minimizer x + exists and is unique.
(b) If in addition, ri(dom ϕ ∩ C) ⊂ int dom h, then x + ∈ dom ϕ ∩ int dom h, which
is characterized via its first order optimality condition:

0 ∈ λ∂ϕ(x + ) + ∇h(x + ) − ∇h(x). (2.1)

Proof Fix x ∈ int dom h and λ > 0.

123
A simplified view of first order methods for optimization

(a) Under the premises of the lemma, a standard argument shows that

λ (u) := ϕ(u) + λ−1 Dh (u, x) + δC (u),

is proper, lsc and strictly convex (since h is strictly convex). Thus, if a minimizer
of inf u λ (u) exists, it must be unique. To show existence of a minimizer, we will
prove that λ is level bounded. For convenience, let u → d(u) := Dh (u, x). Then
d is proper, lsc and strictly convex, with d(u) ≥ 0, and equal to zero if and only
if x = u. Therefore, Lev(d, 0) = {x}, and hence (cf. above recalled properties)
Lev(d, α) is bounded for any α ≥ 0. Now, let m := inf u λ (u). Since d(u) ≥ 0, then
m ≥ ϕ∗ > −∞, and the following holds:

L(m) := Lev( λ , m) ⊂ Lev(d, λ(m − ϕ∗ )).

Since we have just seen that Lev(d, α) is bounded for any α ≥ 0, then Lev(d, λ(m −
ϕ∗ )) is bounded, and hence by the above inclusion, so is L(m), and the claimed
existence result is proved.
For (b), under the premises of the lemma, we can apply standard subdifferential
calculus rules [58], which thanks to the additional fact that h is given Legendre,
characterizes a well-defined unique optimal x + ∈ int dom h via the inclusion 0 ∈
∂ϕ(x + ) + λ(∇h(x + ) − ∇h(x)), as desired. 


Throughout the rest of this paper, we take as our blanket assumption the validity
of Lemma 2.3, that is, the proximal map x + = proxλϕ h (x) is uniquely defined and
+
satisfies x ∈ int dom h.

3 Two fundamental pillars

The analysis of basic proximal schemes and their variants, essentially relies on the
following two main pillars:
• One well-known and quite old: the Bregman proximal inequality derived by Chen
and Teboulle [35], which provides a fundamental estimate in the objective function
value gap for Bregman based proximal schemes.
• One very recent: A Lipschitz-Like Convexity Condition, recently introduced by
Bauschke et al. [14], which naturally captures through one convexity condition,
the nonlinear geometry information of the problem’s at hand.

3.1 The Bregman proximal inequality

The first pillar of the analysis is the following result which naturally extends a similar
fundamental estimate for the Euclidean squared proximal map as first derived by
Güler [43].

123
M. Teboulle

Lemma 3.1 (Bregman proximal inequality) [35, Lemma 3.2] Let ϕ : E → (−∞, ∞]
be a proper, lsc and convex function. Given λ > 0 and x ∈ int dom h, define:
 
1
x + := proxλϕ (x) = argminu ϕ(u) + Dh (u, x) .
λ

Then, x + ∈ dom ϕ ∩ int dom h, and

λ(ϕ(x + ) − ϕ(u)) ≤ Dh (u, x) − Dh (u, x + ) − Dh (x + , x), ∀u ∈ dom h.

Proof Let x ∈ int dom h and λ > 0. Lemma 2.3 warrants the existence of a unique
x + ∈ int dom h, such that

λ ξ + ∇h(x + ) − ∇h(x) = 0, with ξ ∈ ∂ϕ(x + ).

Therefore, using the subgradient inequality for the convex function ϕ, it follows that

λ(ϕ(x + ) − ϕ(u)) ≤ λξ, x + − u = ∇h(x) − ∇h(x + ), x + − u, ∀u ∈ dom h.

Invoking the three points identity of Lemma 2.2 with z := u and y := x + gives the
desired result. 


3.2 A Lipschitz-like convexity condition

The second main pillar of the analysis relies on the recent work of Bauschke et al. [14]
which introduces a simple framework capturing the non-Euclidean geometry of a
given problem through one Lipschitz-like/Convexity condition. It elegantly translates
into a general Descent Lemma precisely given in terms of Bregman distances, and in
particular allows to handle problems lacking a global Lipschitz gradient, a common
assumption used in almost all FOM.
We adopt this framework, with a slight extension, which allows to consider non-
convex functions g, as recently done in Bolte et al. [24], stressing its usefulness and
flexibility in various contexts.
A Lipschitz-like/Convexity Condition. [14] Let h : E → (−∞, ∞] be a Legendre
function and let g : E → (−∞, ∞] be a proper and lsc function with dom g ⊃ dom h,
and g is differentiable on int dom h. Given such a pair of functions (g, h), the Lipschitz-
like/Convexity Condition denoted by (LC) is:

(LC) ∃L > 0 with Lh − g convex on int dom h.

Note that the above definition holds for any convex function h which is differentiable
on any open subset of dom h. Clearly, this condition does not need the Legendre
property on h. Only the convexity of Lh − g plays a central role. Moreover, the
convexity condition (LC) nicely translates in terms of Bregman distances to produce
a new Descent Lemma (called NoLips in [14]).

123
A simplified view of first order methods for optimization

Lemma 3.2 (NoLips Descent Lemma) [14, Lemma 1] Consider the pair of functions
(g, h) as above. Take L > 0. The following statements are equivalent:
(i) Lh − g is convex on int dom h, i.e., condition (LC) holds.
(ii) Dg (x, y) ≤ L Dh (x, y) ⇐⇒ D Lh−g (x, y) ≥ 0 for all x, y ∈ int dom h.

Proof Simply follows from the gradient inequality for the convex function Lh − g,
and the fact that 0 ≤ D Lh−g (x, y) = L Dh (x, y) − Dg (x, y). 


Note that if L  ≥ L property LC holds with L  . When both functions g and h are
assumed to be C 2 on int dom h, the above statement is equivalent to

∃L > 0, L∇ 2 h(x) − ∇ 2 g(x)  0, for all x ∈ int dom h,

which can be useful to the check condition (LC), see [14] for examples.
Clearly, in the Euclidean setting, with h(x) = 21 x22 , item (ii) above recovers the
classical and fundamental Descent Lemma [22] which under the assumption that g is
convex, is equivalent to the standard smoothness property: ∇g is Lipschitz continuous
with constant L on E. Observe however, that contrary to the usual Descent Lemma
of a differentiable function, the first inequality (ii) of Lemma 3.2 is one-sided. As
recently shown in [24], it can be complemented by asking both Lh − g and Lh + g
to be convex on int dom h, which immediately yields a full Descent Lemma, which
reads compactly:
 
 Dg (x, y) ≤ L Dh (x, y) , ∀ x, y ∈ int dom h.

Clearly, when g is convex, the convexity condition on Lh + g trivially holds.


The structural form of Dh allows to derive another useful and more general descent-
like result than Lemma 3.2.

Lemma 3.3 (The Three Points Descent Lemma) Consider the pair of functions (g, h)
as above. Take L > 0. Then, the function Lh − g is convex on int dom h if and only if
for any (x, y, z) ∈ (int dom h)3 :

g(x) ≤ g(y) + ∇g(z), x − y + L Dh (x, z) − Dg (y, z). (3.1)

Proof Thanks to the structural form of Dg , it is easy to see that for any (x, y, z) ∈
(int dom h)3 , the claimed inequality can be equivalently written as:

g(x) − g(z) − ∇g(z), x − z ≤ L Dh (x, z) ⇐⇒ Dg (x, z) ≤ L Dh (x, z),

and the later inequality is equivalent to D Lh−g (x, z) ≥ 0, namely that Lh − g convex.



When g is also assumed convex, we have that Dg (y, z) ≥ 0, and hence one can drop
the last term in the inequality (3.1), thus recovering [14, Lemma 4].

123
M. Teboulle

We end this section by stressing again, the relevance of the condition (LC). To
that end, first recall that h is σ -strongly convex, (cf. [57, Section 12H]) if there exists
σ > 0 such that

u − v, x − y ≥ σ x − y2 , u ∈ ∂h(x), v ∈ ∂h(y).

For simplicity, here we set σ = 1, as h can always be re-scaled if necessary.


First order methods based on Bregman distances usually requires both the smooth-
ness of g and the 1-strong convexity of h. This is stronger than what is actually
necessary. Indeed, the following very simple result shows that these two assumptions
imply condition (LC), thus showing that condition (LC) is very natural, and has the
nice feature and generality of capturing the essential information in the non-Euclidean
setting.

Proposition 3.1 Let  ·  be an arbitrary norm in E. If ∇g is L-Lipschitz continuous


on int dom h and h is 1-strongly convex, then condition (LC) holds true, that is, Lh − g
is convex on int dom h.

Proof The result follows immediately from the premises made on the pair (g, h).
Indeed, for all x, y ∈ int dom h we have

∇g(x) − ∇g(y), x − y ≤ ∇g(x) − ∇g(y)∗ x − y, [Cauchy−Schwartz]


≤ Lx − y2 [L-Lipschitz gradient of g],
≤ L∇h(x) − ∇h(y), x − y [h is 1-strongly convex].

This proves that the gradient of (Lh − g) is monotone on int dom h, and hence the
convexity of Lh − g. 


4 Analysis of non-Euclidean proximal based schemes

Thanks to the two main pillars described above in Lemmas 3.1 and 3.3, we now revisit
the analysis of some of the most fundamental FOM, stressing simplification in the
proofs, which in turn, also allows to present less well-known results, as well as to
derive new ones.
We start with the popular additive convex composite model, which despite its sim-
plicity, underscores the basic elements leading to most fundamental FOM, and also
covers a broad class of problems arising in a wide spectrum of applications.

4.1 The additive composite optimization model

The Problem and Assumptions. Consider the following additive convex composite
model:

(CM) v(P) = inf{ (u) := f (u) + g(u) : u ∈ C},

123
A simplified view of first order methods for optimization

where C ⊆ E is a closed convex set with a nonempty interior. The following assump-
tions are made throughout this section.

Assumption CM (i) f : E → (−∞, ∞] is proper, lsc and convex,


(ii) h : E → (−∞, ∞] is Legendre, with dom h = C,
(iii) g : E → (−∞, ∞] is proper, lsc and convex with dom g ⊃ dom h, which is
differentiable on int dom h, and condition (LC) holds: there exists L > 0 with
Lh − g is convex on int dom h,
(iv) dom f ∩ int dom h = ∅,
(v) −∞ < v(P) = inf{ (x) : x ∈ C}.
Given a Legendre function h, for all x ∈ int dom h and any λ > 0, we define
 
1
Tλ (x) := argmin f (u) + ∇g(x), u − x + Dh (u, x) : u ∈ C . (4.1)
λ

This map emerges from the usual approach which consists of linearizing the differen-
tiable part g around x, and regularize it with a proximal distance from that point. Again,
when h is the energy function, this is nothing else but the classical proximal gradient
map, also called/known as proximal forward–backward splitting map [32,36,55].
Throughout, we suppose that assumption CM holds, and recall that we systemati-
cally assume that Tλ is well-defined and resides in int dom h (cf. Sect. 2 and [14]).
We consider solving problem (CM) with the Bregman Proximal Gradient Method,
that is we generate a sequence {x k }k∈N via the following fixed point iteration for the
map Tλ (·).

The Bregman proximal gradient scheme and its convergence rate have been inves-
tigated in other contexts in past studies, e.g., [8,65] under various assumptions. We
will come back to this point at the end of this subsection.
A Mixer in Action: An “All-in-One” Proximal Inequality. Combining the two main
inequalities established in Lemmas 3.1 and 3.3, the mixer in action, gives an “all in
one” proximal inequality. Recall that for the result below, we do not assume that g is
convex.

Lemma 4.1 (All in One Proximal Inequality) Let x ∈ int dom h, λ > 0, and
 
1
x + = argmin f (u) + ∇g(x), u − x + Dh (u, x) : u ∈ C .
λ

Then, for any u ∈ dom h, we have

λ( (x + )− (u)) ≤ Dh (u, x)− Dh (u, x + )−(1−λL)Dh (x + , x)−λDg (u, x). (4.2)

123
M. Teboulle

Proof Writing successively the Bregman proximal inequality (Lemma 3.1) for the
convex function u → ϕ(u) := f (u) + ∇g(x), u − x, and the Three Points Descent
Lemma 3.3 for g, yields the following inequalities:

λ( f (x + ) − f (u)) ≤ λ∇g(x), u − u +  + Dh (u, x) − Dh (u, x + ) − Dh (x + , x),


λ(g(x + ) − g(u)) ≤ λ∇g(x), u + − u + λL Dh (u + , x) − λDg (u, x).

Adding these two inequalities, proves the desired result (4.2). 




As a special case, when g is assumed convex, the term λDg (u, x) is nonnegative
and thus can be dropped. In this case we recover the key descent inequality established
in Bauschke et al. [14, Lemma 4], that is,

λ( (x + ) − (u)) ≤ Dh (u, x) − Dh (u, x + ) − (1 − λL)Dh (x + , x), ∀u ∈ dom h.


(4.3)
This inequality at hand is all what is needed to establish the O(1/n) rate of conver-
gence for the BPG method, as established in Bauschke et al. [14, Theorem 1], as well
as its relatives: the Bregman proximal minimization and the Bregman projected gra-
dient scheme (hence including as special cases the classical results with the squared
Euclidean norm).

Theorem 4.1 (Basic complexity estimate) [14] Let {x k }k∈N be the sequence generated
by BPG with λ ∈ (0, L −1 ]. Then,
(a) the sequence { (x k )}k∈N is nonincreasing.
(b) Let λ = L −1 , then for any n ≥ 1,

L
(x n ) − (u) ≤ Dh (u, x 0 ), ∀ u ∈ dom h.
n

Proof Let k ≥ 1. Invoking Lemma 4.1 (e.g., (4.3)) with x = x k , x + = x k+1 =


Tλ (x k ), with λ ≤ L −1 we obtain for any u ∈ dom h:

λ( (x k+1 ) − (u)) ≤ Dh (u, x k ) − Dh (u, x k+1 ).

For u = x k we obviously have that Dh (x k , x k ) = 0, and since Dh (·, ·) ≥ 0, the above


inequality implies that (x k+1 ) ≤ (x k ), proving (a). Summing the above inequality
over k = 0, . . . , n − 1, we thus obtain, with λ = L −1 .


n−1
( (x k+1 ) − (u)) ≤ L(Dh (u, x 0 ) − Dh (u, x n )) ≤ L Dh (u, x 0 ).
k=0

Since by (a) the sequence { (x k )}k∈N is nonincreasing, with s k := (x k ) − (u),


we have s k+1 ≤ s , and hence it follows that s ≤ n
k n −1 n−1 k+1
k=0 s , which together
with the above inequality proves the desired result (b). 


123
A simplified view of first order methods for optimization

Remark 4.1. A close inspection of Lemma 4.1 shows again the key role of condition
(LC). Indeed, thanks to the structural definition of Bregman distance, it is easily
seen by a simple algebraic manipulation that with λL = 1 in (4.2), the complexity
constant derived in Theorem 4.1(b) given by := L Dh (u, x 0 ) can be replaced by the
complexity constant  := D Lh−g (u, x 0 ), which assuming that g is convex, clearly
implies D Lh−g (u, x 0 ) ≤ L Dh (u, x 0 ), thus providing the improved constant  ≤ .
As a straightforward consequence of the above analysis, much like in the classical
Euclidean squared norm setting, assuming here that g −σ h is convex on int dom h, for
some σ > 0 (e.g., that g is σ -strongly convex with respect to h, see [11, Definition 4.1]),
we can immediately obtain the following linear rate of convergence of BPG for the
value gap function (x n ) − ∗ . Here we denote by x ∗ the unique minimizer of ,
and by ∗ its minimal value (x ∗ ).
Proposition 4.1. (Linear Rate of BPG). Let {x k }k∈N be the sequence generated by
BPG with λ = L −1 , and assume that g − σ h is convex on int dom h for some σ > 0.
Then, for any n ≥ 0,
 σ n+1
(x n+1 ) − ∗ ≤ 1− L Dh (x ∗ , x 0 ).
L

Proof Simply invoke again Lemma 4.1 with u := x ∗ , x = x k , and x + = x k+1 ,


followed by the convexity of g − σ h, i.e., Dg (·, ·) ≥ σ Dh (·, ·), and the nonnegativity
of Dh (·, ·) to obtain for all k ≥ 0,:

(x k+1 ) − ∗ ≤ L(Dh (x ∗ , x k ) − Dh (x ∗ , x k+1 )) − Dg (x ∗ , x k )


≤ (L − σ )Dh (x ∗ , x k ) − L Dh (x ∗ , x k+1 )
 σ
≤ L 1− Dh (x ∗ , x k )
L

Since ∗ ≤ (x k+1 ), applying recursively the resulting second inequality above


implies
 σ k
Dh (x ∗ , x k ) ≤ 1 − Dh (x ∗ , x 0 ),
L
and hence the claimed result immediately follows from the third inequality. 

Global Pointwise Convergence of BPG. As usual, deriving the global convergence of
the sequence generated by proximal based schemes require more subtle arguments, in
particular, to determine the “largest possible step size” in a given algorithm, see for
instance the analysis in [36] for the classical proximal gradient. As recently shown
in [14], it turns out that the notion of a symmetry coefficient measure for Dh , plays
a fundamental role in determining the most aggressive step size for global pointwise
convergence of the BPG.
Indeed, Bregman distances are in general not symmetric, except when h is the
energy function, for which it simply reduces to the squared Euclidean distance. It is
thus natural to introduce a measure for the lack of symmetry in Dh , as done in [14].

123
M. Teboulle

Definition 4.1 (Symmetry coefficient) Given a Legendre function h : E → (−∞, ∞],


its symmetry coefficient is defined by


α(h) := inf D(x, y)/D(y, x)  (x, y) ∈ int dom h × int dom h, x = y ∈ [0, 1].
(4.4)

Clearly, perfect symmetry occurs when α(h) = 1, i.e., when h(·) = 2−1  · 2 . Total
lack of symmetry, namely α(h) = 0, occurs for the key examples h(x) = x log x and
h(x) = − log√x, which often arise in applications, while with h(x) = x 4 one obtains
α(h) = 2 − 3, [14].
Equipped with the symmetry coefficient, it was shown in [14] that the most aggres-
sive step size that can be used in BPG satisfies

1 + α(h)
0<λ< .
L

This allows to nicely recover the standard step size λ ∈ (0, 2/L) which warrants
global convergence in the classical proximal gradient [36], since in that case α(h) = 1.
To establish the global convergence to an optimal solution, additional assumptions
on the Bregman proximal distance (satisfied by all examples in Example 2.1) is needed
to ensure separation properties of the Bregman distance at the boundary, so that we
can use arguments “à la Opial” [53] as done with classical norms.

Assumption H (i) For every x ∈ dom h and β ∈ R, the level set y ∈ int dom h :
Dh (x, y) ≤ β} is bounded.
(ii) If {x k }k∈N converges to some x in dom h then Dh (x, x k ) → 0.
(iii) Reciprocally, if x is in dom h and if {x k }k∈N is such that Dh (x, x k ) → 0, then
x k → x.

We then have the following global convergence result.

Theorem 4.2 (BPG: Global Convergence) [14, Theorem 2] Let {x k }k∈N be the
sequence generated by BPG with λ ∈ (0, (1 + α(h))/L). Assume that C = dom h =
dom h, the solution set of (P) argminC is nonempty and compact, and that Assump-
tion H is satisfied. Then, the sequence {x k }k∈N converges to some solution x ∗ of
problem (P).

To conclude this part it is interesting to re-emphasize the key role played by the
Lipschitz-like convexity condition (LC) which is sufficient and weaker than past stud-
ies. For instance, when f = 0 the algorithm BPG is the gradient method with Bregman
distance studied in [8], and later extended in [65] to handle the composite model (CM).
An important difference with both works, is the fact that the usual assumptions: ∇g
globally L-Lipschitz continuous and the strong convexity of h are not needed anymore,
see also the concluding remarks for further discussion.

123
A simplified view of first order methods for optimization

4.2 Non-Euclidean projected subgradient schemes: mirror descent

Consider the special case of problem (CM) with g = 0, namely the nonsmooth convex
problem
(N S) min{ f (u) : u ∈ C}.

We use the notation f  (u) for a subgradient element of the subdifferential set ∂ f (u).
To solve problem (NS), we consider the so-called Mirror Descent method intro-
duced by Nemirovsky and Yudin [50, Chapter 4]. As shown in Beck and Teboulle [16],
the Mirror Descent can be derived and analyzed through the lenses of the proximal
framework as a non-Euclidean Bregman projected subgradient scheme and reads as
follows:

The usual and standard assumptions to establish the rate of convergence and iteration
complexity of MD require:
(a) σ -strong convexity of h,
(b) Lipschitz continuity of the convex function f , namely there exists L f > 0 such
that

| f (x) − f (y)| ≤ L f x − y, for all x, y ∈ dom f,

which implies that for all x ∈ int dom f , and f  (x) ∈ ∂ f (x), we have  f  (x)∗ ≤
Lf.
As we shall see, we can relax these requirements with a weaker and more general
assumption, which is in the spirit of condition (LC), namely with a condition capturing
the data information on both f and h within a single condition, and which naturally
emerges from a close inspection of the proof derived in [16].
To see this, applying Lemma 3.1 on ϕ(u) =  f  (x k ), u − x k  + δC (u) with λ = tk ,
x = x k and x + = x k+1 , we get the iterate produced by (MD), which then yields
the first inequality below, while the second inequality follows from the subgradient
inequality for the convex function f :

tk  f  (x k ), x k+1 − u ≤ Dh (u, x k ) − Dh (u, x k+1 ) − Dh (x k+1 , x k ),


tk ( f (x k ) − f (u)) ≤ tk  f  (x k ), x k − u
= tk  f  (x k ), x k+1 − u + tk  f  (x k ), x k − x k+1 .

Adding these two inequalities, we then obtain for all u ∈ dom h:


 
tk ( f (x k ) − f (u)) ≤ Dh u, x k − Dh u, x k+1
  
+ tk f  (x k ), x k − x k+1 − Dh x k+1 , x k . (4.5)

123
M. Teboulle

To continue the standard proof of MD and obtain a meaningful estimate in terms


of some objective function gap, we observe that using standard telescoping arguments
over k = 0, . . . , n, clearly the first term in the squared brackets will behave nicely
and collapse to Dh (x 0 , x n ). Therefore, to complete the proof, all what we need is to
find an adequate bound for the expression given by the two last terms in (4.5). In the
classical proof of MD, this is obtained through the standard Assumptions (a) and (b)
alluded above, (cf. [16, Theorem 4.1, inequality (4.21)]). This naturally leads us to
consider instead, the following weaker condition on the pair [ f, h].
Condition W[f,h]. For any t > 0, there exists G > 0, such that

t2 2
t f  (x), x − u − Dh (u, x) ≤ G , for all u ∈ dom h, x ∈ int dom h.
2
The very simple result below shows that this condition is weaker (hence the W!) than
the usual separate Assumptions (a) and (b) on [ f, h] alluded above which are made
in the classical analysis of the MD method.
Proposition
√ 4.2 Assumptions (a) and (b) imply the validity of W[f,h] with G =
L f / σ.
Proof Using the the Fenchel–Young inequality 2a, b ≤ s −1 a2∗ + sb2 for all
a, b ∈ E and s > 0, with a := t f  (x), b := x − u and s := σ followed by the σ -strong
convexity of h, we obtain for any t > 0, u ∈ dom h, and x ∈ int dom f :

t2 σ
t f  (x), x − u − Dh (u, x) ≤  f  (x)2∗ + x − u2 − Dh (u, x)
2σ 2
t2 t2 2
≤  f  (x)2∗ ≤ L ,
2σ 2σ f

proving that W[f,h] holds with G = L f / σ . 

Under our standing assumption on problem (N S), summarizing the above devel-
opments for the MD algorithm, we obtain the following main result expressed in
terms of the best value function min0≤k≤n f (x k ) (or, likewise, thanks to Jensen’s
inequality,
n interms of the average value function, namely f (x̂ n ), where x̂ n =
( k=0 tk ) −1 n
k=0 tk f (x )).
k

Theorem 4.3 (Best value function gap estimate for MD) Let {x k }k∈N be the sequence
generated by the MD algorithm, and suppose that condition W[f,h] holds for all
u ∈ dom h and x k ∈ int dom h, tk > 0. Then, for any k ≥ 0 and u ∈ dom h,

G2 n
Dh (u, x 0 ) + 2
k=0 tk
min f (x ) − f (u) ≤
k n 2
.
0≤k≤n k=0 tk

Proof Thanks to condition W[f,h], with t := tk , and x := x k we get from (4.5)

tk2 2
tk ( f (x k ) − f (u)) ≤ Dh (u, x k ) − Dh (u, x k+1 ) + G , for all u ∈ dom h.
2

123
A simplified view of first order methods for optimization

Summing the above inequality over k = 0, . . . n we obtain for all u ∈ dom h


n
G2  2
n
tk ( f (x k ) − f (u)) ≤ Dh (u, x 0 ) − Dh (u, x n+1 ) + tk ,
2
k=0 k=0

and hence, the claimed result follows at once. 



From the above result, assuming that the optimal set X ∗ of problem (NS) is
∗ ∗ 0
√let x ∈ X ∗ , and suppose that R(x ) := max x∈C Dh (x , x ) < ∞. Then,
nonempty, 0

the O(1/ n) rate of convergence result easily follows as done in [16] by minimizing
the righthand side above with respect to tk > 0 (cf. [16, Proposition 4.1]) to obtain
2R(x 0 )
tk = 1
G n+1 , k = 0, . . . , n, from which we get

2R(x 0 )
min f (x k ) − f (x ∗ ) ≤ G .
0≤k≤n (n + 1)

The same line of analysis can be adapted in a straightforward way for the composite
model (CM), when both f and g are nonsmooth, C = E, which results in a Bregman
proximal-subgradient scheme:

This scheme was analyzed by Duchi et al. [39], under the usual strong convexity of
h and Lipschitz continuity of f , and by assuming in addition that g(·) is a nonnegative
function. Here, our analysis implies that the proof can be done in analogous way, but
under the weaker condition W[f,h] on f . Alternatively, one can also get away from
the additional nonnegativity assumption on g, at the price of increasing the constant
in the estimation rate, where G which was attributed to f , should be replaced with a
constant G  , so that the sum f + g, satisfies the condition W[f+g,h].

4.3 Extension to convexly composite model

We outline here an extension that can benefit from the above developments, and con-
sider the convexly composite model

(Cvx − CM) inf{c(x) := f (x) + θ (G(x)) : x ∈ Rd },

where G(x) = (g1 (x), . . . , gm (x)).


Here θ : Rm → (−∞, ∞] is a proper, lsc and convex function, which is assumed
to be nonincreasing with respect to its i-th argument for nonlinear gi . The assumptions
are as before for f and h, where we also assume that condition (LC) holds for each
pair (gi , h), i.e., there exists L i > 0 such that L i h − gi , where for simplicity we
assume here the same h for each pair (in this respect, see also Remark 4.2 below.)

123
M. Teboulle

Under the above assumptions, clearly (Cvx-CM) is a convex problem, which is


far more flexible than the additive (CM) (easily recovered with θ being the identity
map, and with g(x) = g1 ≡ g, m = 1) and covers many important classical convex
optimization models, such as, minimax of a finite collection, standard constrained con-
vex programming, Lagrangians-type/penalty objectives, ect.. This model was popular
in the mid 80’s and merely abandoned. However, it has recently re-appeared on the
surface, see for instance the recent works: [45] which also includes additional elab-
orated coverage on proximal methods and earlier references motivating this model;
[38] which provides a comprehensive analysis on the role of error bounds in the linear
convergence of proximal methods for this class, and the very recent work [25] in the
context of nonconvex Lagrangian methods and their variants.
Suppose we want to solve (Cvx-CM) via BPG. Before hand, it will be convenient
to introduce the following notations. For i = 1, . . . , m, let l gi (x, y) := gi (y) +
∇gi (y), x − y, and define

G(x, y) := (l g1 (x, y), . . . , l gm (x, y)),


L := (L 1 , . . . , L m ),
+ : vi = 0, for i ∈ Ia .
Ia := {1 ≤ i ≤ m : gi is affine} and V := v ∈ Rm

Let x 0 ∈ dom f ∩ G −1 (dom θ )∩int dom h. The Bregman Proximal Gradient iteration
for (Cvx-CM) reads

Throughout, we assume that x k+1 is well defined, and in int dom h (this can be
ensured through additional technical assumptions along the lines of Lemma 2.3; for
simplicity we omit the details).
From the proof-mechanism revealed in the analysis of the previous schemes, it is
clear that all what is needed to analyze this case is an appropriate Descent-like Lemma
for the composite term θ (G(x)). From there, the rate of convergence analysis of BPG
will then follow in an analogous way. It turns out, that we can express the adequate
choice of λ in term of the asymptotic (recession) function of L, a notion that we now
recall with its basic properties, see [58, p. 87] and [5, p. 53].

Proposition 4.3 (Asymptotic function) Let θ : Rm → (−∞, ∞] be a proper, lsc and


convex function. Then, the following properties hold for the asymptotic function of θ
which is denoted by θ∞ :
(a) θ∞ (d) = sup{θ (x + d) − θ (x) : x ∈ dom θ } for all d ∈ Rm .
(b) θ∞ is proper, lsc, convex and positively homogeneous, i.e., θ∞ (sd) = sθ∞ (d) for
every s > 0.
(c) Let L θ ≥ 0. Then, dom θ = Rm and θ is L θ -continuous if and only if

θ∞ (d) ≤ L θ d, for all d ∈ Rm .

123
A simplified view of first order methods for optimization

We then have the following descent-like property for the convexly composite func-
tion θ (G(x)).

Lemma 4.2 (Descent-like property for a convexly composite function) Let θ : Rm →


(−∞, ∞] be proper, lsc, convex and increasing for i ∈ / Ia , and let x ∈ int dom h.
Then, for any u ∈ dom h, the following statements hold:
(a) θ (G(u)) − θ (l G (u, x)) ≤ θ∞ (L)Dh (u, x), with L ∈ V and θ∞ (L) ≥ 0.
(b) Suppose that θ : Rm → R is Lipschitz continuous, with constant L θ > 0. Then,
we have

θ (G(u)) − θ (l G (u, x)) ≤ L θ LDh (u, x).

Proof For any i = 1, . . . , m, the convexity of gi together with the condition (LC) of
each pair1 (gi , h) imply the validity of the following inequality for each coordinate:

0 ≤ G(u) − l G (u, x) ≤ LDh (u, x),

that is, G(u) −l G (u, v) ∈ V . Since θ is nondecreasing in each of its arguments i ∈


/ Ia ,
it follows from Proposition 4.3(a) that θ∞ (L) ≥ 0. To establish the claimed inequality,
using G(u) − l G (u, v) ∈ V , and the monotonicity of θ , we get:

θ (G(u)) − θ (l G (u, x)) ≤ θ (l G (u, x) + G(u) − l G (u, x)) − θ (l G (u, x))


≤ θ (l G (u, x) + LDh (u, x)) − θ (l G (u, x))
≤ θ∞ (LDh (u, x))
= Dh (u, x)θ∞ (L),

where the last inequality and the equality follow from Proposition 4.3(a) and (b),
respectively.
(b) Since we assumed that θ if finite-valued and L θ continuous on Rm , the desired
statement follows from item (a) and Proposition 4.3(c). 


Thanks to this lemma, the convergence rate result of BPG for the convexly com-
posite model (Cvx-CM) follows by the same arguments as done in Theorem 4.1 for
the BPG for the additive composite model (CM), where the only change here is in the
choice of λ to produce the same O(1/n) rate for problem (Cvx-CM) with an adjusted
complexity constant; this is recorded in the following theorem.

Theorem 4.4 (Complexity estimate of BPG for (Cvx-CM)) Let {x k }k∈N be the
sequence generated by BPG for (Cvx-CM) with λ−1 = L θ L. Then, for all n ≥ 1,

L θ LDh (u, x 0 )
c(x n ) − c(u) ≤ , ∀ u ∈ dom h.
n

1 Note that when i ∈ I , we can take L = 0 and hence the condition (LC) still holds, since L h −g = −g
a i i i i
with gi affine.

123
M. Teboulle

Note that the same rate with complexity constant θ∞ (L) would remain valid when θ
is extended valued, provided one can warrant in this case that 0 < θ∞ (L) < ∞.

Remark 4.2 An extension presuming a different h i for each gi with the condition
L i h i − gi convex for each i, can also be established. For brevity we omit the additional
technical details. It can be shown that in that case it would result in a simple adjustment
in the right hand side of Lemma 4.2 and reads for any u ∈ dom h:

θ (G(u)) − θ (l G (u, x)) ≤ L θ L max Dh i (u, x).


1≤i≤m

5 From convex to nonconvex models

The nonconvex setting is far more difficult than its convex counterpart, and remains
highly challenging. Convergence to a global optimum is usually out of reach and rate of
convergence results are obviously limited and weaker. For the classical proximal gra-
dient equipped with the squared Euclidean norm, the analysis goes back to Fukushima
and Milne [42], and more recent works include e.g., [4,29]. Moreover, these works
also imposed the usual restrictive global Lipschitz continuity of the gradient of g. Very
recently, these results have been significantly improved in Bolte et al. [24] within the
non-Euclidean proximal framework for the fully nonconvex composite model, namely
both f and g being nonconvex, and with g lacking a global Lipschitz continuous gra-
dient.
For brevity, here we describe some of these recent results derived in [24] by focusing
on the following simpler nonconvex model, which consists of minimizing the sum of
a convex nonsmooth function with a nonconvex differentiable function satisfying the
condition (LC). This simplified model is already of sufficient importance in many
applications, and we refer the reader to [24] for details on analyzing the more general
nonconvex model.

5.1 Nonconvex BPG: rate of convergence

Throughout this section we consider the composite problem (P) defined on C ≡ E =


Rd , namely
 
(N C) v∗ = inf (u) := f (u) + g (u) : u ∈ Rd ,

under the following standing assumptions.

Assumption NC (i) f : Rd → (−∞, ∞] is proper, lsc and convex,


(ii) dom h = Rd and h is σ -strongly convex on Rd ,
(iii) g : Rd → R is nonconvex differentiable, and satisfies condition (LC),
(iv) v∗ := inf  (u) : u ∈ Rd > −∞.

123
A simplified view of first order methods for optimization

As in the convex case, we consider the BPG iteration, which here we start with
x 0 ∈ Rd , and generate the sequence {x k }k∈N via its Bregman proximal gradient map:
 
1
x k+1 = Tλ (x k ) := argmin f (u) + ∇g(x k ), u − x k  + Dh (u, x k ) : u ∈ Rd .
λ

Clearly, here since dom h = Rd and h is strongly convex, the mapping Tλ is well-
defined and single-valued. We have the following rate of convergence properties
recently derived in [24].

Theorem 5.1 Suppose that Assumption NC holds. Let {x k}k∈N


 be the sequence gen-
erated by BPG with 0 < λL < 1. Then, the sequence  x k k∈N is nonincreasing,
and for every n ≥ 1 we have

   1/2
1 λ( x 0 − ∗ )
min x k+1
−x ≤ √
k
.
1≤k≤n n σ (1 − λL)

Proof Under the condition (LC), we can invoke Lemma 4.1 with u = v = x k and
u + = x k+1 = Tλ (x k ) to obtain

λ((x k+1 ) − (x k )) ≤ −(1 − λL)Dh (x k+1 , x k ) − Dh (x k , x k+1 )


≤ −(1 − λL)Dh (x k , x k+1 ). (5.1)
 
Since 0 < λL < 1, this proves that  x k k∈N is nonincreasing. Summing the
inequality (5.1) over k = 0, . . . , n, and using the σ -strong convexity of h, it follows
that

σ (1 − λL)  n
n min x k − x k+1 2 ≤ (1 − λL) Dh (x k , x k+1 )
2 1≤k≤n
k=0
≤ λ((x ) − (x n )),
0

and with (x n ) ≥ ∗ > −∞, the claimed desired result follows. 


Note that this result recovers the classical rate of convergence result of the proximal
gradient algorithm for the nonconvex composite model [18, Theorem 2.3], with h being
the energy function, σ = 1 and λL = 1/2. However, a major difference here is that
we do not require the gradient of the nonconvex function g to be globally Lipschitz
continuous. Instead, once again, condition (LC) plays the central role to adapt to the
geometry of the problem at hand. This allows to treat important models; see [24] for
an interesting application with simple globally convergent algorithms for the so-called
class of quadratic inverse problems with sparsity constraints, which having a quartic
objective, fails to have a global Lipschitz gradient.

123
M. Teboulle

5.2 Global convergence to a critical point

To prove the global convergence of the sequence {x k }k∈N generated by BPG to a critical
point of , we first outline a general abstract convergence mechanism as recently
developed in [29], which is very flexible and can be applied to any given algorithm
satisfying the premises described below. To this end, let F : Rd → (−∞, +∞] be a
proper and lower semicontinuous function which is bounded from below and consider
the problem  
inf F (x) : x ∈ Rd .

Consider a generic algorithm A which generates a sequence {x k }k∈N via the fol-
lowing:

start with any x 0 ∈ Rd and set x k+1 ∈ A x k , k = 0, 1, . . . .

The main goal is to prove that the whole sequence {x k }k∈N , generated by the algo-
rithm A, converges to a critical point x ∗ of F, namely 0 ∈ ∂ F(x ∗ ). Note that in this
general setting, ∂ F stands for the limiting subdifferential of F, cf. [57].

Definition 5.1 (Gradient-like Descent Sequence) A sequence {x k }k∈N is called a


gradient-like descent sequence for F if the following three conditions hold:
(C1) Sufficient decrease property. There exists a positive scalar ρ1 such that
 2
 
ρ1 x k+1 − x k  ≤ F(x k ) − F(x k+1 ), ∀ k ∈ N.

(C2) A subgradient lower bound for the iterates gap. There exists a positive scalar
ρ2 such that
   
 k+1   
w  ≤ ρ2 x k+1 − x k  , for some w k+1 ∈ ∂ F(x k+1 ), ∀ k ∈ N.
 
(C3) Let x be a limit point of any subsequence x k k∈K
, then lim supk∈K⊂N F x k ≤
F (x).

These three conditions are typical for any descent type algorithm, and basic to prove
subsequential convergence, see e.g., [1,29,66].
To establish global convergence of the whole sequence, we need an additional
assumption on the class of functions F: it must satisfy the so-called nonsmooth
Kurdyka–Łojasiewicz (KL) property [27] (see [44,46] for smooth cases). We refer
the reader to [28] for an in depth study of the class of KL functions, as well as refer-
ences therein.
Verifying the KL property of a given function might often be a difficult task. How-
ever, thanks to a fundamental result established by Bolte et al. [27], it holds for the
broad class of semi-algebraic functions, which abound in applications, see [29,66],
and references therein.

123
A simplified view of first order methods for optimization

The last ingredient needed to activate this generic framework, is a key uniformiza-
tion of the KL property proven in [29, Lemma 6, p. 478]. Then, we can prove the
following general theorem (see [29] for details leading to this result).

Theorem 5.2 (Global Convergence) Let {x k }k∈N be a bounded gradient-like descent


sequence forF. IfF satisfies the KL property, then the sequence {x k }k∈N has finite
length, i.e., ∞  k+1 − x k  < ∞, and it converges to a critical point x ∗ of F.
k=1 x

Equipped with this generic result, we can establish the global convergence result
of BPG to a critical point of . Thanks to Fermat’s rule [57, Theorem 10.1, p. 422],
the set of critical points of  here is given by
 
crit  = x ∈ Rd : 0 ∈ ∂ (x) ≡ ∂ f (x) + ∇g (x) .

In brief, let us mention that the sufficient descent property established in Theorem 5.1
(cf. (5.1)) allows to prove conditions C1 and C3, while the following mild additional
assumption on the pair (g, h) is needed to establish condition C2.
Assumption NC[+] ∇h and ∇g are Lipschitz continuous on any bounded subset of
Rd .
We then have the following simplified version of the more general convergence
result proved in [24].

Theorem 5.3 (Convergence of BPG) Suppose that assumptions NC and NC[+] hold.
Let {x k }k∈N be a sequence generated by BPG which is assumed to be bounded, and
let 0 < λL < 1. The following assertions hold:
(i) Subsequential convergence. Any limit point of the sequence {x k }k∈N is a critical
point of .
(ii) Global convergence. Suppose that  satisfies the KL property on dom , which
is in particular true when f and g are semialgebraic. Then, the sequence {x k }k∈N
has finite length, and converges to a critical point x ∗ of .

Convergence rate results of the generated sequence described in Theorem 5.3 can
also be derived by applying the generic rate of convergence result of Attouch and
Bolte [1].

6 Concluding remarks

We have provided a concise synthesis outlining the key theoretical elements playing
a central role in the convergence analysis of Non Euclidean Bregman based prox-
imal methods, and their most fundamental first order algorithm relatives, stressing
clarifications and simplifications through elementary proof-patterns. The volume of
literature on first order methods has exploded over the last decade, and we refer the
reader to some of the recent pointers given in the bibliography for further elaborations,
extensions and applications.
Below, we briefly further discuss the benefits and limitations of the non-Euclidean
proximal framework, and end with an open challenging question.

123
M. Teboulle

Much like any proximal minimization scheme, the underlying framework depends
on how efficiently we can compute the iteration formula of BPG (cf. (4.1)):
 
+ 1
x = Tλ (x) := argmin f (u) + ∇g(x), u − x + Dh (u, x) : u ∈ C .
λ

Although not visible at first sight, it is interesting to observe that Tλ (·) shares the
same structural splitting principle as the classical Euclidean proximal-gradient map,
which can be useful for computational purpose allowing to decompose it through
two specific operators. Indeed, as shown in [14, Section 3.1], writing the optimality
condition for x + , formal computations show that by defining the following operators:
• A Bregman gradient step
 
1
pλ (x) := argmin ∇g(x), u + Dh (u, x) : u ∈ E , x ∈ int dom h, (6.1)
λ

• A Bregman proximal map

proxλh f (y) := argmin {λ f (u) + Dh (u, y) : u ∈ C} , y ∈ int dom h, (6.2)

one can rewrite the map Tλ (·) simply as the composition of a Bregman proximal step
with a Bregman gradient step:

x + = proxλh f ( pλ (x)). (6.3)

Thus, computing Tλ depends on whether the above two operators can output closed-
form solutions or can be numerically computed in an efficient way. In the classical
Euclidean proximal gradient method, i.e., when h is the energy function, the first
computation above reduces to a standard gradient step, pλ (x) = x − λ∇g(x), and
the second one is the usual Moreau proximal map of λ f . In the general case, for any
x ∈ int dom h, define v(x) := ∇h(x) − λ∇g(x). Then, the Bregman gradient step
(6.1) reduces to

pλ (x) = ∇h ∗ (v(x)), with v(x) ∈ dom ∇h ∗ .

Therefore, once we know the conjugate function h ∗ of h, the computation of pλ is


straightforward. This is the case for the five examples listed in Example 2.1,
 as well
as for the n-dimensional “Hellinger-Like function” defined by h(x) = − 1 − x2
with dom h = {x ∈ Rn : x  ≤ 1}, which is relevant for ball constraints. In the
latter case, one obtains h ∗ (y) = 1 + y2 , with dom h ∗ = Rn , and hence pλ (x) =
(1 + v 2 (x))−1/2 v(x). For further details and interesting examples evaluating pλ for
nonseparable Bregman distances, and for handling conic constraints with Bregman
distances see [14, Section 3.1] and [8, Examples B, C, p. 718].
We now discuss the computation of the second operator, namely the Bregman
proximal map given in (6.2). In the classical case, the Moreau proximal map of f is in

123
A simplified view of first order methods for optimization

general computable in closed form whence f is “norm-like”, or when f is the indicator


of sets whose geometry is favorable to Euclidean projections, which we refer here as
prox-friendly, see e.g., [36, Section 2.6] and [15, pp. 156 and 177] for many such
examples. However, beyond such prox-friendly functions/sets, in general computing
the classical proximal map can be very difficult. Concerning (6.2) the situation is
similar. Sets and functions which will be prox-friendly with respect to a Bregman
proximal term Dh , will share a mathematical structure similar to h, which is chosen
to adapt to the geometry of the given function/set. Moreover, to activate BPG, the
choice for h is also intimately related to the step-size expressed in terms of L, which is
determined by the condition (LC). Such computations are illustrated in details for the
convex setting in [14, Section 5], leading to explicit simple algorithms to tackle the
important class of Poisson linear inverse problems which is prevalent in the statistical
and image sciences application areas. In the nonconvex setting, computations of Tλ
and L, can be found in the very recent work [24, Section 5] for the broad class of
quadratic inverse problems described by an unconstrained problem with an l1 -norm
regularizer and a nonconvex sparsity constrained problem, resulting in two simple
closed-form Bregman proximal gradient algorithms. It would be interesting to expand
proximal calculus with Bregman distances that can be of benefit to further relevant
applications.
This survey would be incomplete without discussing the perspectives for fast non-
Euclidean proximal based schemes, in the spirit of the classical fast gradient method
of Nesterov [51], and its extension to the composite convex model, FISTA of Beck and
Teboulle [17]. Actually, both do not seem to be extendible in their basic formulation to
the Bregman distance setting. A remedy to this situation was addressed by Auslender
and Teboulle [8, Section 5] who proposed a Bregman gradient scheme (called IGA-
Improved Interior Gradient Algorithm) and which reads as follows:

This algorithm was proven to achieve the faster rate O(1/n 2 ), [8, Theorem 5.2].
Later on it was extended by Tseng [65] to handle as well, the additive convex composite
model (CM), where the Bregman gradient z-step should be replaced by the Bregman
proximal-gradient iteration z k+1 = Tλk (yk ) with λk = tk L −1 . However, for both
algorithms, these results were derived only under the assumptions that g as an L-
globally Lipschitz gradient, and h is strongly convex, two assumptions that we have
specifically avoided throughout this paper. As pointed out in [14], one central and
interesting topic for further research is to devise fast FOM capable of lifting both
restrictions. In a recent study [60], numerical experiments based on the AT scheme
(and some other variants) applied to problems satisfying the (LC) condition (i.e.,

123
M. Teboulle

lacking global Lipschitz gradient), and with h not strongly convex exhibit the very
same faster rate. The theoretical justification, however, is still lacking. We thus end this
paper with the following, which we believe remains as a challenging open question:
Under the framework of this paper, can we produce first order algorithms with
a proven faster convergent rate?

Acknowledgements This synthesis has emerged from my long standing cooperation with Alfred Auslender
on this topic. I am deeply indebted to him, not only for generously sharing with me over the past two decades
his profound mathematical knowledge, enthusiasm, and vision in optimization, but also for his friendship
and the resulting pleasure, fun and adventures beyond our mathematical activities. The material covered
here also benefited greatly, and comes from past and recent works written with Amir Beck, Jérôme Bolte
and Shoham Sabach, with whom I have had the pleasure to work with over many years. My deepest thanks
to all of them for this fructuous past and continuing collaboration, and to Heinz Bauschke, for the pleasure of
collaborating with him on this topic over the past 3 years. I am grateful to the editors and the referees of the
paper, whose constructive comments and suggestions were very useful to improve the paper presentation.

References
1. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving
analytic features. Math. Program. 116, 5–16 (2009)
2. Attouch, H., Teboulle, M.: A regularized Lotka–Volterra dynamical system as a continuous proximal-
like method in optimization. J. Optim. Theory Appl. 121, 541–570 (2004)
3. Attouch, H., Bolte, J., Redont, P.: Optimizing properties of an inertial dynamical system with geometric
damping: link with proximal methods. Control Cybern. 31, 643–657 (2002)
4. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame
problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.
Math. Program. 137, 91–129 (2013)
5. Auslender, A., Teboulle, M.: Asymptotic Cones and Functions in Optimization and Variational Inequal-
ities. Springer, New York (2003)
6. Auslender, A., Teboulle, M.: Interior gradient and Epsilon-subgradient methods for constrained convex
minimization. Math. Oper. Res. 29, 1–26 (2004)
7. Auslender, A., Teboulle, M.: Interior projection-like methods for monotone variational inequalities.
Math. Program. 104, 39–68 (2005)
8. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization.
SIAM J. Optim. 16, 697–725 (2006)
9. Auslender, A., Teboulle, M.: Projected subgradient methods with non-Euclidean distances for non-
differentiable convex minimization and variational inequalities. Math. Program. Ser. B 120, 27–48
(2009)
10. Auslender, A., Teboulle, M., Ben-Tiba, S.: Interior proximal and multiplier methods based on second
order homogeneous kernels. Math. Oper. Res. 24, 645–668 (1999)
11. Bartlett, P.L., Hazan, E., Rakhlin, A.: Adaptive online gradient descent. In: Advances in Neural Infor-
mation Processing Systems, vol. 20 (2007)
12. Bauschke, H.H., Borwein, J.M.: Legendre functions and the method of Bregman projections. J. Convex
Anal. 4(1), 27–67 (1997)
13. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces.
Springer, New York (2011)
14. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first
order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2016)
15. Beck, A.: First Order Methods in Optimization. SIAM, Philadelphia (2017)
16. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex opti-
mization. Oper. Res. Lett. 31, 167–175 (2003)
17. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM J. Imaging Sci. 2(1), 183–202 (2009)

123
A simplified view of first order methods for optimization

18. Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In:
Palomar, D., Eldar, Y.C. (eds.) Convex Optimization in Signal Processing and Communications, pp.
139–162. Cambridge University Press, Cambridge (2009)
19. Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22,
557–580 (2012)
20. Ben-Tal, A., Margalit, T., Nemirovsky, A.: The ordered subsets mirror descent optimization method
with applications to tomography. SIAM J. Optim. 12, 79–108 (2001)
21. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, Cam-
bridge (1982)
22. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
23. Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
24. Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and Lipschitz
gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. (2017) (accepted)
25. Bolte, J., Sabach, S., Teboulle, M.: Nonconvex Lagrangian-based optimization: monitoring schemes
and global convergence. Math. Oper. Res. (2018). https://doi.org/10.1287/moor.2017.0900
26. Bolte, J., Teboulle, M.: Barrier operators and associated gradient like dynamical systems for constrained
minimization problems. SIAM J. Control Optim. 42, 1266–1292 (2003)
27. Bolte, J., Daniilidis, A., Lewis, A.S.: The Łojasiewicz inequality for nonsmooth subanalytic functions
with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
28. Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient
flows, talweg, convexity. Trans. Am. Math. Soc. 362, 3319–3363 (2010)
29. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and
nonsmooth problems. Math. Program. 146(1), 459–494 (2014)
30. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning
via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
31. Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application
to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–217
(1967)
32. Bruck, R.: On the weak convergence of an ergodic iteration for the solution of variational inequalities
for monotone operators in Hilbert space. J. Math. Anal. Appl. 61, 159–164 (1977)
33. Burachik, R.S., Iusem, A.N.: A generalized proximal point algorithm for the variational inequality
problem in a Hilbert space. SIAM J. Optim. 8, 197–216 (1998)
34. Censor, Y., Zenios, S.A.: Proximal minimization algorithm with D-functions. J. Optim. Theory Appl.
73, 451–464 (1992)
35. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Breg-
man functions. SIAM J. Optim. 3, 538–543 (1993)
36. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward–backward splitting. SIAM Multi-
scale Model. Simul. 4, 1168–1200 (2005)
37. Csiszár, I.: Information-type measures of difference of probability distributions and indirect observa-
tions. Studia Sci. Mat. Hungar. 2, 299–318 (1967)
38. Drusvyatskiy, D., Lewis A.S.: Error bounds, quadratic growth, and linear convergence of proximal
methods. Math. Oper. Res. (2018). https://doi.org/10.1287/moor.2017.0889
39. Duchi, J.C., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Pro-
ceedings of 23rd Annual Conference on Learning Theory, pp. 14–26. (2010)
40. Eckstein, J.: Nonlinear proximal point algorithms using Bregman functions, with applications to convex
programming. Math. Oper. Res. 18, 202–226 (1993)
41. Flammarion, N., Bach, F.: Stochastic composite least-squares regression with convergence rate O(1/n).
Proc. Mach. Learn. Res. 65, 1–44 (2017)
42. Fukushima, M., Mine, H.: A generalized proximal point algorithm for certain nonconvex minimization
problems. Int. J. Syst. Sci. 12, 989–1000 (1981)
43. Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM J.
Control Optim. 29(2), 403–419 (1991)
44. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48(3),
769–783 (1998)
45. Lewis, A.S., Wright, S.J.: A proximal method for composite minimization. Math. Program. Ser. A 158,
501–546 (2016)

123
M. Teboulle

46. Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. In: Les Équations
aux Derivées Partielles, pp. 87–89. Éditions du Centre National de la Recherche Scientifique, Paris
(1963)
47. Martinet, B.: Régularisation d’inéquations variationnelles par approximations successives. Rev.
Française Informatique et Recherche Opérationnelle 4, 154–158 (1970)
48. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 93(2), 273–299
(1965)
49. Nemirovsky, A.S.: Prox-method with rate of convergence O(1/t) for variational inequalities with lips-
chitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J.
Optim. 15, 229–251 (2004)
50. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley,
New York (1983)
51. Nesterov, Y.: A method for solving the convex programming problem with convergence rate O(1/k 2 ).
Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
52. Nguyen, Q.V.: Forward–backward splitting with Bregman distances. Vietnam J. Math. 45, 519–539
(2017)
53. Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings.
Bull. AMS 73, 591–597 (1967)
54. Palomar, D.P., Eldar, Y.C.: Convex Optimization in Signal Processing and Communications. Cambridge
University Press, Cambridge (2010)
55. Passty, G.B.: Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J.
Math. Anal. Appl. 72, 383–390 (1979)
56. Polyak, R., Teboulle, M.: Nonlinear rescaling and proximal-like methods in convex optimization. Math.
Program. 76, 265–284 (1997)
57. Rockafellar, R.T., Wets, R.: Variational analysis. In: Grundlehren der Mathematischen Wissenschaften,
vol. 317. Springer (1998)
58. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
59. Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim.
14(5), 877–898 (1976)
60. Sabach, S., Teboulle, M., Vaisbourd, Y.: Fast non-Euclidean first order algorithms: a numerical study.
In: Working Paper. (April 2017)
61. Shefi, R., Teboulle, M.: Rate of convergence analysis of decomposition methods based on the proximal
method of multipliers for convex minimization. SIAM J. Optim. 24, 269–297 (2014)
62. Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. The MIT Press, Cambridge
(2011)
63. Teboulle, M.: Entropic proximal mappings with application to nonlinear programming. Math. Oper.
Res. 17, 670–690 (1992)
64. Teboulle, M.: Convergence of proximal-like algorithms. SIAM J. Optim. 7, 1069–1083 (1997)
65. Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimiza-
tion. Math. Program. Ser. B 125, 263–295 (2010)
66. Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with
applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789
(2013)

123

You might also like