A Simplified View of First Order Methods For Optimization
A Simplified View of First Order Methods For Optimization
B
https://doi.org/10.1007/s10107-018-1284-2
Marc Teboulle1
Abstract We discuss the foundational role of the proximal framework in the develop-
ment and analysis of some iconic first order optimization algorithms, with a focus on
non-Euclidean proximal distances of Bregman type, which are central to the analysis
of many other fundamental first order minimization relatives. We stress simplification
and unification by highlighting self-contained elementary proof-patterns to obtain con-
vergence rate and global convergence both in the convex and the nonconvex settings,
which in turn also allows to present some novel results.
1 Introduction
The notion of proximal map of a convex function, invented about half a century ago
by Moreau in his seminal work [48], is kicking and alive! This fundamental reg-
ularization process gave birth 5 years later to the so-called proximal minimization
algorithm by Martinet [47], followed by its extension in Rockafellar [59] for solv-
This research was partially supported by the Israel Science Foundation, under ISF Grant 1844-16, and the
German Israel Foundation, under GIF Grant 1253304.6-2014.
B Marc Teboulle
[email protected]
123
M. Teboulle
123
A simplified view of first order methods for optimization
Notation We use standard notation and concepts from convex and variational analysis,
which unless otherwise specified, can all be found in the classical monographs [57,58].
Throughout, E stands for a finite dimensional vector space with inner product ·, ·,
norm · and its usual dual norm, denoted by · ∗ , residing in the dual space E∗ .
We start with the seminal work of Moreau [48], which provides the foundation and
the central ideas underlying the proximal methodology.
and the resulting optimal value, that is, the Moreau envelope of ϕ is the function
1
x → ϕλ (x) := min ϕ(u) + u − x : u ∈ E .
2
2λ
This regularization process for ϕ produces a very neat function ϕλ which is finite
everywhere, convex and with λ−1 -Lipschitz continuous gradient on E given by:
x 0 ∈ E, x k+1 = proxλϕ (x k ), k = 0, 1, . . . .
Alternatively, this is equivalent to the basic gradient iteration with step size λ applied
on the smooth function ϕλ .
Despite the obvious difficulties that can often emerge from the practical com-
putational side of a proximal map (which by itself requires to solve a nonsmooth
optimization problem), the proximal framework is fundamental, and nowadays has
regained a starring position in the development and analysis of modern optimiza-
tion algorithms based on first order information, both in their primal and dual forms.
123
M. Teboulle
In fact, the proximal methodology is not only underlying the essence of the central
first order schemes, e.g., gradient, proximal gradient, and mirror descent which can
be recovered through a specific choice of ϕ, but also at the root of many other fun-
damental first order minimization methods. For instance, these include incremental
and stochastic versions, coordinate descent and smoothing methods, as well as funda-
mental primal-dual methods, within augmented Lagrangian and their many relatives,
e.g., alternating direction method of multipliers, and related decomposition schemes.
Because of space limitations, we obviously barely scratch the surface, and will not
discuss the many theoretical, computational and applications aspects of past and recent
achieved results in this area. For a taste of these and additional modern coverage on
these topics, we refer the reader to some recent works, e.g., [13,15,18,19,23,30,61]
which also provide ample relevant references.
Mimicking Moreau, one can consider replacing the squared Euclidean distance with
a non-Euclidean proximity measure, also called distance-like function. Formally, a
general form of the algorithm for minimizing a function ψ : E → (−∞, ∞] then
reads as
1
x k+1 ∈ argmin ψ(u) + D(u, x k ) : u ∈ E ,
λ
where D(·, ·) stands for a distance-like function which replaces the squared Euclidean
distance. There exist various ways to define adequate proximity measures, see for
instance the general framework of Auslender and Teboulle [8] and references therein,
which analyzes first order methods based on a variety of distance-like functions, includ-
ing the so-called Csiszar φ-divergence [37] and the Bregman distance [31].
The rationale which underlines the relevance and usefulness of considering more
general proximal measures stems from algorithmic contexts, i.e., toward improv-
ing and extending convergence properties of classical algorithms, as well as from
specific applications areas. Let us briefly describe some contexts where the use of
non-Euclidean proximal schemes have been particularly useful, and which include for
example:
(a) The possibility to better adapt to the geometry of a given problem, which allows
to derive simpler proximal computational steps, i.e., by naturally eliminating the
constraint through a suitable choice of D which best match the problem’s data
information and structures, see e.g. [7,9,14]. Moreover, this allows for deriving
a complexity constant in terms of D(x ∗ , x 0 ) between the starting point x 0 , and
an optimal solution x ∗ , which can be significantly smaller than the usual term
x ∗ − x 0 2 , and is often nearly optimal for very large scale problems, see e.g.,
[16,20,50].
(b) Augmented Lagrangian Methods. It is well known that the classical augmented
Lagrangian method for the standard inequality constrained convex programming
problem is nothing else but the proximal minimization algorithm applied on its
123
A simplified view of first order methods for optimization
123
M. Teboulle
In addition to the properties (b) and (c) given in Lemma 2.1, a Legendre function
also enjoys the following useful properties [58, Thm. 26.5]:
• h is Legendre if and only if its conjugate h ∗ is Legendre.
• The gradient of a Legendre function h is a bijection from int dom h to int dom h ∗ ,
and its inverse is the gradient of the conjugate, that is, we have (∇h)−1 = ∇h ∗ .
In this work we adopt the following definition of a Bregman distance, which origin
goes back to the work of Bregman [31].
It naturally measures the proximity between the two points (x, y). Indeed, thanks
to the gradient inequality we have,
In addition, thanks to the strict convexity of h we have that Dh (x, y) = 0 if and only
if x = y, which implies a basic distance-like property. However, note that Dh is not
symmetric in general, unless h is the energy function, that is, h(·) = (1/2) ·2 , and
in this case Dh recovers the classical squared Euclidean distance.
We list below some of the most popular and useful choices for h to generate relevant
associated Bregman distances Dh , which are well documented in the literature, see
e.g., [8,12,40,63] where more examples are given, including nonseparable Bregman
distances, as well as, Bregman distances on the space of symmetric matrices.
• (Shannon Entropy) h(x) = x log x with dom h = [0, ∞], (0 log 0 = 0), and
h ∗ (y) = e y−1 .
123
A simplified view of first order methods for optimization
Lemma 2.2 (The Three Points Identity) [35, Lemma 3.1] Suppose h : E → (−∞, ∞]
is Legendre. Then, for any three points x, y ∈ int dom h and z ∈ dom h, the following
three points identity holds true:
Note that the Legendre property of h plays no role in the derivation of this identity.
Indeed, for any differentiable function h, the identity follows by simple algebra relying
on the structural definition of Dh . With h(u) := u2 /2, E = Rn , we recover as
expected, the fundamental Pythagoras identity
ϕ∗ = inf{ϕ(u) : u ∈ C}.
123
M. Teboulle
Definition 2.3 (Bregman proximal map) Suppose that Assumption A holds. For any
x ∈ int dom h and λ > 0, the Bregman proximal map is defined by
1
h
proxλϕ (x) := argmin ϕ(u) + Dh (u, x) : u ∈ C .
λ
We then have the following properties, (cf. [58, Corollary 8.7.1]). Let ψ : E →
(−∞, ∞] be a proper, lsc and convex function.
• If the level set Lev(ψ, α) is nonempty and bounded for some α ∈ R, it is bounded
for every α ∈ R.
• If ψ is level bounded, then argmin ψ is nonempty and compact, and inf ψ is finite.
The proof of Lemma 2.3 is patterned after similar approaches used, e.g., in [8,40,
64].
Lemma 2.3 Suppose that Assumption A holds. For any x ∈ int dom h and λ > 0 let
1
x + := proxλϕ
h
(x) = argmin ϕ(u) + Dh (u, x) : u ∈ C .
λ
123
A simplified view of first order methods for optimization
(a) Under the premises of the lemma, a standard argument shows that
is proper, lsc and strictly convex (since h is strictly convex). Thus, if a minimizer
of inf u λ (u) exists, it must be unique. To show existence of a minimizer, we will
prove that λ is level bounded. For convenience, let u → d(u) := Dh (u, x). Then
d is proper, lsc and strictly convex, with d(u) ≥ 0, and equal to zero if and only
if x = u. Therefore, Lev(d, 0) = {x}, and hence (cf. above recalled properties)
Lev(d, α) is bounded for any α ≥ 0. Now, let m := inf u λ (u). Since d(u) ≥ 0, then
m ≥ ϕ∗ > −∞, and the following holds:
Since we have just seen that Lev(d, α) is bounded for any α ≥ 0, then Lev(d, λ(m −
ϕ∗ )) is bounded, and hence by the above inclusion, so is L(m), and the claimed
existence result is proved.
For (b), under the premises of the lemma, we can apply standard subdifferential
calculus rules [58], which thanks to the additional fact that h is given Legendre,
characterizes a well-defined unique optimal x + ∈ int dom h via the inclusion 0 ∈
∂ϕ(x + ) + λ(∇h(x + ) − ∇h(x)), as desired.
Throughout the rest of this paper, we take as our blanket assumption the validity
of Lemma 2.3, that is, the proximal map x + = proxλϕ h (x) is uniquely defined and
+
satisfies x ∈ int dom h.
The analysis of basic proximal schemes and their variants, essentially relies on the
following two main pillars:
• One well-known and quite old: the Bregman proximal inequality derived by Chen
and Teboulle [35], which provides a fundamental estimate in the objective function
value gap for Bregman based proximal schemes.
• One very recent: A Lipschitz-Like Convexity Condition, recently introduced by
Bauschke et al. [14], which naturally captures through one convexity condition,
the nonlinear geometry information of the problem’s at hand.
The first pillar of the analysis is the following result which naturally extends a similar
fundamental estimate for the Euclidean squared proximal map as first derived by
Güler [43].
123
M. Teboulle
Lemma 3.1 (Bregman proximal inequality) [35, Lemma 3.2] Let ϕ : E → (−∞, ∞]
be a proper, lsc and convex function. Given λ > 0 and x ∈ int dom h, define:
1
x + := proxλϕ (x) = argminu ϕ(u) + Dh (u, x) .
λ
Proof Let x ∈ int dom h and λ > 0. Lemma 2.3 warrants the existence of a unique
x + ∈ int dom h, such that
Therefore, using the subgradient inequality for the convex function ϕ, it follows that
Invoking the three points identity of Lemma 2.2 with z := u and y := x + gives the
desired result.
The second main pillar of the analysis relies on the recent work of Bauschke et al. [14]
which introduces a simple framework capturing the non-Euclidean geometry of a
given problem through one Lipschitz-like/Convexity condition. It elegantly translates
into a general Descent Lemma precisely given in terms of Bregman distances, and in
particular allows to handle problems lacking a global Lipschitz gradient, a common
assumption used in almost all FOM.
We adopt this framework, with a slight extension, which allows to consider non-
convex functions g, as recently done in Bolte et al. [24], stressing its usefulness and
flexibility in various contexts.
A Lipschitz-like/Convexity Condition. [14] Let h : E → (−∞, ∞] be a Legendre
function and let g : E → (−∞, ∞] be a proper and lsc function with dom g ⊃ dom h,
and g is differentiable on int dom h. Given such a pair of functions (g, h), the Lipschitz-
like/Convexity Condition denoted by (LC) is:
Note that the above definition holds for any convex function h which is differentiable
on any open subset of dom h. Clearly, this condition does not need the Legendre
property on h. Only the convexity of Lh − g plays a central role. Moreover, the
convexity condition (LC) nicely translates in terms of Bregman distances to produce
a new Descent Lemma (called NoLips in [14]).
123
A simplified view of first order methods for optimization
Lemma 3.2 (NoLips Descent Lemma) [14, Lemma 1] Consider the pair of functions
(g, h) as above. Take L > 0. The following statements are equivalent:
(i) Lh − g is convex on int dom h, i.e., condition (LC) holds.
(ii) Dg (x, y) ≤ L Dh (x, y) ⇐⇒ D Lh−g (x, y) ≥ 0 for all x, y ∈ int dom h.
Proof Simply follows from the gradient inequality for the convex function Lh − g,
and the fact that 0 ≤ D Lh−g (x, y) = L Dh (x, y) − Dg (x, y).
Note that if L ≥ L property LC holds with L . When both functions g and h are
assumed to be C 2 on int dom h, the above statement is equivalent to
which can be useful to the check condition (LC), see [14] for examples.
Clearly, in the Euclidean setting, with h(x) = 21 x22 , item (ii) above recovers the
classical and fundamental Descent Lemma [22] which under the assumption that g is
convex, is equivalent to the standard smoothness property: ∇g is Lipschitz continuous
with constant L on E. Observe however, that contrary to the usual Descent Lemma
of a differentiable function, the first inequality (ii) of Lemma 3.2 is one-sided. As
recently shown in [24], it can be complemented by asking both Lh − g and Lh + g
to be convex on int dom h, which immediately yields a full Descent Lemma, which
reads compactly:
Dg (x, y) ≤ L Dh (x, y) , ∀ x, y ∈ int dom h.
Lemma 3.3 (The Three Points Descent Lemma) Consider the pair of functions (g, h)
as above. Take L > 0. Then, the function Lh − g is convex on int dom h if and only if
for any (x, y, z) ∈ (int dom h)3 :
Proof Thanks to the structural form of Dg , it is easy to see that for any (x, y, z) ∈
(int dom h)3 , the claimed inequality can be equivalently written as:
and the later inequality is equivalent to D Lh−g (x, z) ≥ 0, namely that Lh − g convex.
When g is also assumed convex, we have that Dg (y, z) ≥ 0, and hence one can drop
the last term in the inequality (3.1), thus recovering [14, Lemma 4].
123
M. Teboulle
We end this section by stressing again, the relevance of the condition (LC). To
that end, first recall that h is σ -strongly convex, (cf. [57, Section 12H]) if there exists
σ > 0 such that
Proof The result follows immediately from the premises made on the pair (g, h).
Indeed, for all x, y ∈ int dom h we have
This proves that the gradient of (Lh − g) is monotone on int dom h, and hence the
convexity of Lh − g.
Thanks to the two main pillars described above in Lemmas 3.1 and 3.3, we now revisit
the analysis of some of the most fundamental FOM, stressing simplification in the
proofs, which in turn, also allows to present less well-known results, as well as to
derive new ones.
We start with the popular additive convex composite model, which despite its sim-
plicity, underscores the basic elements leading to most fundamental FOM, and also
covers a broad class of problems arising in a wide spectrum of applications.
The Problem and Assumptions. Consider the following additive convex composite
model:
123
A simplified view of first order methods for optimization
where C ⊆ E is a closed convex set with a nonempty interior. The following assump-
tions are made throughout this section.
This map emerges from the usual approach which consists of linearizing the differen-
tiable part g around x, and regularize it with a proximal distance from that point. Again,
when h is the energy function, this is nothing else but the classical proximal gradient
map, also called/known as proximal forward–backward splitting map [32,36,55].
Throughout, we suppose that assumption CM holds, and recall that we systemati-
cally assume that Tλ is well-defined and resides in int dom h (cf. Sect. 2 and [14]).
We consider solving problem (CM) with the Bregman Proximal Gradient Method,
that is we generate a sequence {x k }k∈N via the following fixed point iteration for the
map Tλ (·).
The Bregman proximal gradient scheme and its convergence rate have been inves-
tigated in other contexts in past studies, e.g., [8,65] under various assumptions. We
will come back to this point at the end of this subsection.
A Mixer in Action: An “All-in-One” Proximal Inequality. Combining the two main
inequalities established in Lemmas 3.1 and 3.3, the mixer in action, gives an “all in
one” proximal inequality. Recall that for the result below, we do not assume that g is
convex.
Lemma 4.1 (All in One Proximal Inequality) Let x ∈ int dom h, λ > 0, and
1
x + = argmin f (u) + ∇g(x), u − x + Dh (u, x) : u ∈ C .
λ
123
M. Teboulle
Proof Writing successively the Bregman proximal inequality (Lemma 3.1) for the
convex function u → ϕ(u) := f (u) + ∇g(x), u − x, and the Three Points Descent
Lemma 3.3 for g, yields the following inequalities:
As a special case, when g is assumed convex, the term λDg (u, x) is nonnegative
and thus can be dropped. In this case we recover the key descent inequality established
in Bauschke et al. [14, Lemma 4], that is,
Theorem 4.1 (Basic complexity estimate) [14] Let {x k }k∈N be the sequence generated
by BPG with λ ∈ (0, L −1 ]. Then,
(a) the sequence { (x k )}k∈N is nonincreasing.
(b) Let λ = L −1 , then for any n ≥ 1,
L
(x n ) − (u) ≤ Dh (u, x 0 ), ∀ u ∈ dom h.
n
n−1
( (x k+1 ) − (u)) ≤ L(Dh (u, x 0 ) − Dh (u, x n )) ≤ L Dh (u, x 0 ).
k=0
123
A simplified view of first order methods for optimization
Remark 4.1. A close inspection of Lemma 4.1 shows again the key role of condition
(LC). Indeed, thanks to the structural definition of Bregman distance, it is easily
seen by a simple algebraic manipulation that with λL = 1 in (4.2), the complexity
constant derived in Theorem 4.1(b) given by := L Dh (u, x 0 ) can be replaced by the
complexity constant := D Lh−g (u, x 0 ), which assuming that g is convex, clearly
implies D Lh−g (u, x 0 ) ≤ L Dh (u, x 0 ), thus providing the improved constant ≤ .
As a straightforward consequence of the above analysis, much like in the classical
Euclidean squared norm setting, assuming here that g −σ h is convex on int dom h, for
some σ > 0 (e.g., that g is σ -strongly convex with respect to h, see [11, Definition 4.1]),
we can immediately obtain the following linear rate of convergence of BPG for the
value gap function (x n ) − ∗ . Here we denote by x ∗ the unique minimizer of ,
and by ∗ its minimal value (x ∗ ).
Proposition 4.1. (Linear Rate of BPG). Let {x k }k∈N be the sequence generated by
BPG with λ = L −1 , and assume that g − σ h is convex on int dom h for some σ > 0.
Then, for any n ≥ 0,
σ n+1
(x n+1 ) − ∗ ≤ 1− L Dh (x ∗ , x 0 ).
L
123
M. Teboulle
α(h) := inf D(x, y)/D(y, x) (x, y) ∈ int dom h × int dom h, x = y ∈ [0, 1].
(4.4)
Clearly, perfect symmetry occurs when α(h) = 1, i.e., when h(·) = 2−1 · 2 . Total
lack of symmetry, namely α(h) = 0, occurs for the key examples h(x) = x log x and
h(x) = − log√x, which often arise in applications, while with h(x) = x 4 one obtains
α(h) = 2 − 3, [14].
Equipped with the symmetry coefficient, it was shown in [14] that the most aggres-
sive step size that can be used in BPG satisfies
1 + α(h)
0<λ< .
L
This allows to nicely recover the standard step size λ ∈ (0, 2/L) which warrants
global convergence in the classical proximal gradient [36], since in that case α(h) = 1.
To establish the global convergence to an optimal solution, additional assumptions
on the Bregman proximal distance (satisfied by all examples in Example 2.1) is needed
to ensure separation properties of the Bregman distance at the boundary, so that we
can use arguments “à la Opial” [53] as done with classical norms.
Assumption H (i) For every x ∈ dom h and β ∈ R, the level set y ∈ int dom h :
Dh (x, y) ≤ β} is bounded.
(ii) If {x k }k∈N converges to some x in dom h then Dh (x, x k ) → 0.
(iii) Reciprocally, if x is in dom h and if {x k }k∈N is such that Dh (x, x k ) → 0, then
x k → x.
Theorem 4.2 (BPG: Global Convergence) [14, Theorem 2] Let {x k }k∈N be the
sequence generated by BPG with λ ∈ (0, (1 + α(h))/L). Assume that C = dom h =
dom h, the solution set of (P) argminC is nonempty and compact, and that Assump-
tion H is satisfied. Then, the sequence {x k }k∈N converges to some solution x ∗ of
problem (P).
To conclude this part it is interesting to re-emphasize the key role played by the
Lipschitz-like convexity condition (LC) which is sufficient and weaker than past stud-
ies. For instance, when f = 0 the algorithm BPG is the gradient method with Bregman
distance studied in [8], and later extended in [65] to handle the composite model (CM).
An important difference with both works, is the fact that the usual assumptions: ∇g
globally L-Lipschitz continuous and the strong convexity of h are not needed anymore,
see also the concluding remarks for further discussion.
123
A simplified view of first order methods for optimization
Consider the special case of problem (CM) with g = 0, namely the nonsmooth convex
problem
(N S) min{ f (u) : u ∈ C}.
We use the notation f (u) for a subgradient element of the subdifferential set ∂ f (u).
To solve problem (NS), we consider the so-called Mirror Descent method intro-
duced by Nemirovsky and Yudin [50, Chapter 4]. As shown in Beck and Teboulle [16],
the Mirror Descent can be derived and analyzed through the lenses of the proximal
framework as a non-Euclidean Bregman projected subgradient scheme and reads as
follows:
The usual and standard assumptions to establish the rate of convergence and iteration
complexity of MD require:
(a) σ -strong convexity of h,
(b) Lipschitz continuity of the convex function f , namely there exists L f > 0 such
that
which implies that for all x ∈ int dom f , and f (x) ∈ ∂ f (x), we have f (x)∗ ≤
Lf.
As we shall see, we can relax these requirements with a weaker and more general
assumption, which is in the spirit of condition (LC), namely with a condition capturing
the data information on both f and h within a single condition, and which naturally
emerges from a close inspection of the proof derived in [16].
To see this, applying Lemma 3.1 on ϕ(u) = f (x k ), u − x k + δC (u) with λ = tk ,
x = x k and x + = x k+1 , we get the iterate produced by (MD), which then yields
the first inequality below, while the second inequality follows from the subgradient
inequality for the convex function f :
123
M. Teboulle
t2 2
t f (x), x − u − Dh (u, x) ≤ G , for all u ∈ dom h, x ∈ int dom h.
2
The very simple result below shows that this condition is weaker (hence the W!) than
the usual separate Assumptions (a) and (b) on [ f, h] alluded above which are made
in the classical analysis of the MD method.
Proposition
√ 4.2 Assumptions (a) and (b) imply the validity of W[f,h] with G =
L f / σ.
Proof Using the the Fenchel–Young inequality 2a, b ≤ s −1 a2∗ + sb2 for all
a, b ∈ E and s > 0, with a := t f (x), b := x − u and s := σ followed by the σ -strong
convexity of h, we obtain for any t > 0, u ∈ dom h, and x ∈ int dom f :
t2 σ
t f (x), x − u − Dh (u, x) ≤ f (x)2∗ + x − u2 − Dh (u, x)
2σ 2
t2 t2 2
≤ f (x)2∗ ≤ L ,
2σ 2σ f
√
proving that W[f,h] holds with G = L f / σ .
Under our standing assumption on problem (N S), summarizing the above devel-
opments for the MD algorithm, we obtain the following main result expressed in
terms of the best value function min0≤k≤n f (x k ) (or, likewise, thanks to Jensen’s
inequality,
n interms of the average value function, namely f (x̂ n ), where x̂ n =
( k=0 tk ) −1 n
k=0 tk f (x )).
k
Theorem 4.3 (Best value function gap estimate for MD) Let {x k }k∈N be the sequence
generated by the MD algorithm, and suppose that condition W[f,h] holds for all
u ∈ dom h and x k ∈ int dom h, tk > 0. Then, for any k ≥ 0 and u ∈ dom h,
G2 n
Dh (u, x 0 ) + 2
k=0 tk
min f (x ) − f (u) ≤
k n 2
.
0≤k≤n k=0 tk
tk2 2
tk ( f (x k ) − f (u)) ≤ Dh (u, x k ) − Dh (u, x k+1 ) + G , for all u ∈ dom h.
2
123
A simplified view of first order methods for optimization
n
G2 2
n
tk ( f (x k ) − f (u)) ≤ Dh (u, x 0 ) − Dh (u, x n+1 ) + tk ,
2
k=0 k=0
the O(1/ n) rate of convergence result easily follows as done in [16] by minimizing
the righthand side above with respect to tk > 0 (cf. [16, Proposition 4.1]) to obtain
2R(x 0 )
tk = 1
G n+1 , k = 0, . . . , n, from which we get
2R(x 0 )
min f (x k ) − f (x ∗ ) ≤ G .
0≤k≤n (n + 1)
The same line of analysis can be adapted in a straightforward way for the composite
model (CM), when both f and g are nonsmooth, C = E, which results in a Bregman
proximal-subgradient scheme:
This scheme was analyzed by Duchi et al. [39], under the usual strong convexity of
h and Lipschitz continuity of f , and by assuming in addition that g(·) is a nonnegative
function. Here, our analysis implies that the proof can be done in analogous way, but
under the weaker condition W[f,h] on f . Alternatively, one can also get away from
the additional nonnegativity assumption on g, at the price of increasing the constant
in the estimation rate, where G which was attributed to f , should be replaced with a
constant G , so that the sum f + g, satisfies the condition W[f+g,h].
We outline here an extension that can benefit from the above developments, and con-
sider the convexly composite model
123
M. Teboulle
Let x 0 ∈ dom f ∩ G −1 (dom θ )∩int dom h. The Bregman Proximal Gradient iteration
for (Cvx-CM) reads
Throughout, we assume that x k+1 is well defined, and in int dom h (this can be
ensured through additional technical assumptions along the lines of Lemma 2.3; for
simplicity we omit the details).
From the proof-mechanism revealed in the analysis of the previous schemes, it is
clear that all what is needed to analyze this case is an appropriate Descent-like Lemma
for the composite term θ (G(x)). From there, the rate of convergence analysis of BPG
will then follow in an analogous way. It turns out, that we can express the adequate
choice of λ in term of the asymptotic (recession) function of L, a notion that we now
recall with its basic properties, see [58, p. 87] and [5, p. 53].
123
A simplified view of first order methods for optimization
We then have the following descent-like property for the convexly composite func-
tion θ (G(x)).
Proof For any i = 1, . . . , m, the convexity of gi together with the condition (LC) of
each pair1 (gi , h) imply the validity of the following inequality for each coordinate:
where the last inequality and the equality follow from Proposition 4.3(a) and (b),
respectively.
(b) Since we assumed that θ if finite-valued and L θ continuous on Rm , the desired
statement follows from item (a) and Proposition 4.3(c).
Thanks to this lemma, the convergence rate result of BPG for the convexly com-
posite model (Cvx-CM) follows by the same arguments as done in Theorem 4.1 for
the BPG for the additive composite model (CM), where the only change here is in the
choice of λ to produce the same O(1/n) rate for problem (Cvx-CM) with an adjusted
complexity constant; this is recorded in the following theorem.
Theorem 4.4 (Complexity estimate of BPG for (Cvx-CM)) Let {x k }k∈N be the
sequence generated by BPG for (Cvx-CM) with λ−1 = L θ L. Then, for all n ≥ 1,
L θ LDh (u, x 0 )
c(x n ) − c(u) ≤ , ∀ u ∈ dom h.
n
1 Note that when i ∈ I , we can take L = 0 and hence the condition (LC) still holds, since L h −g = −g
a i i i i
with gi affine.
123
M. Teboulle
Note that the same rate with complexity constant θ∞ (L) would remain valid when θ
is extended valued, provided one can warrant in this case that 0 < θ∞ (L) < ∞.
Remark 4.2 An extension presuming a different h i for each gi with the condition
L i h i − gi convex for each i, can also be established. For brevity we omit the additional
technical details. It can be shown that in that case it would result in a simple adjustment
in the right hand side of Lemma 4.2 and reads for any u ∈ dom h:
The nonconvex setting is far more difficult than its convex counterpart, and remains
highly challenging. Convergence to a global optimum is usually out of reach and rate of
convergence results are obviously limited and weaker. For the classical proximal gra-
dient equipped with the squared Euclidean norm, the analysis goes back to Fukushima
and Milne [42], and more recent works include e.g., [4,29]. Moreover, these works
also imposed the usual restrictive global Lipschitz continuity of the gradient of g. Very
recently, these results have been significantly improved in Bolte et al. [24] within the
non-Euclidean proximal framework for the fully nonconvex composite model, namely
both f and g being nonconvex, and with g lacking a global Lipschitz continuous gra-
dient.
For brevity, here we describe some of these recent results derived in [24] by focusing
on the following simpler nonconvex model, which consists of minimizing the sum of
a convex nonsmooth function with a nonconvex differentiable function satisfying the
condition (LC). This simplified model is already of sufficient importance in many
applications, and we refer the reader to [24] for details on analyzing the more general
nonconvex model.
123
A simplified view of first order methods for optimization
As in the convex case, we consider the BPG iteration, which here we start with
x 0 ∈ Rd , and generate the sequence {x k }k∈N via its Bregman proximal gradient map:
1
x k+1 = Tλ (x k ) := argmin f (u) + ∇g(x k ), u − x k + Dh (u, x k ) : u ∈ Rd .
λ
Clearly, here since dom h = Rd and h is strongly convex, the mapping Tλ is well-
defined and single-valued. We have the following rate of convergence properties
recently derived in [24].
1/2
1 λ( x 0 − ∗ )
min x k+1
−x ≤ √
k
.
1≤k≤n n σ (1 − λL)
Proof Under the condition (LC), we can invoke Lemma 4.1 with u = v = x k and
u + = x k+1 = Tλ (x k ) to obtain
σ (1 − λL) n
n min x k − x k+1 2 ≤ (1 − λL) Dh (x k , x k+1 )
2 1≤k≤n
k=0
≤ λ((x ) − (x n )),
0
and with (x n ) ≥ ∗ > −∞, the claimed desired result follows.
Note that this result recovers the classical rate of convergence result of the proximal
gradient algorithm for the nonconvex composite model [18, Theorem 2.3], with h being
the energy function, σ = 1 and λL = 1/2. However, a major difference here is that
we do not require the gradient of the nonconvex function g to be globally Lipschitz
continuous. Instead, once again, condition (LC) plays the central role to adapt to the
geometry of the problem at hand. This allows to treat important models; see [24] for
an interesting application with simple globally convergent algorithms for the so-called
class of quadratic inverse problems with sparsity constraints, which having a quartic
objective, fails to have a global Lipschitz gradient.
123
M. Teboulle
To prove the global convergence of the sequence {x k }k∈N generated by BPG to a critical
point of , we first outline a general abstract convergence mechanism as recently
developed in [29], which is very flexible and can be applied to any given algorithm
satisfying the premises described below. To this end, let F : Rd → (−∞, +∞] be a
proper and lower semicontinuous function which is bounded from below and consider
the problem
inf F (x) : x ∈ Rd .
Consider a generic algorithm A which generates a sequence {x k }k∈N via the fol-
lowing:
start with any x 0 ∈ Rd and set x k+1 ∈ A x k , k = 0, 1, . . . .
The main goal is to prove that the whole sequence {x k }k∈N , generated by the algo-
rithm A, converges to a critical point x ∗ of F, namely 0 ∈ ∂ F(x ∗ ). Note that in this
general setting, ∂ F stands for the limiting subdifferential of F, cf. [57].
(C2) A subgradient lower bound for the iterates gap. There exists a positive scalar
ρ2 such that
k+1
w ≤ ρ2 x k+1 − x k , for some w k+1 ∈ ∂ F(x k+1 ), ∀ k ∈ N.
(C3) Let x be a limit point of any subsequence x k k∈K
, then lim supk∈K⊂N F x k ≤
F (x).
These three conditions are typical for any descent type algorithm, and basic to prove
subsequential convergence, see e.g., [1,29,66].
To establish global convergence of the whole sequence, we need an additional
assumption on the class of functions F: it must satisfy the so-called nonsmooth
Kurdyka–Łojasiewicz (KL) property [27] (see [44,46] for smooth cases). We refer
the reader to [28] for an in depth study of the class of KL functions, as well as refer-
ences therein.
Verifying the KL property of a given function might often be a difficult task. How-
ever, thanks to a fundamental result established by Bolte et al. [27], it holds for the
broad class of semi-algebraic functions, which abound in applications, see [29,66],
and references therein.
123
A simplified view of first order methods for optimization
The last ingredient needed to activate this generic framework, is a key uniformiza-
tion of the KL property proven in [29, Lemma 6, p. 478]. Then, we can prove the
following general theorem (see [29] for details leading to this result).
Equipped with this generic result, we can establish the global convergence result
of BPG to a critical point of . Thanks to Fermat’s rule [57, Theorem 10.1, p. 422],
the set of critical points of here is given by
crit = x ∈ Rd : 0 ∈ ∂ (x) ≡ ∂ f (x) + ∇g (x) .
In brief, let us mention that the sufficient descent property established in Theorem 5.1
(cf. (5.1)) allows to prove conditions C1 and C3, while the following mild additional
assumption on the pair (g, h) is needed to establish condition C2.
Assumption NC[+] ∇h and ∇g are Lipschitz continuous on any bounded subset of
Rd .
We then have the following simplified version of the more general convergence
result proved in [24].
Theorem 5.3 (Convergence of BPG) Suppose that assumptions NC and NC[+] hold.
Let {x k }k∈N be a sequence generated by BPG which is assumed to be bounded, and
let 0 < λL < 1. The following assertions hold:
(i) Subsequential convergence. Any limit point of the sequence {x k }k∈N is a critical
point of .
(ii) Global convergence. Suppose that satisfies the KL property on dom , which
is in particular true when f and g are semialgebraic. Then, the sequence {x k }k∈N
has finite length, and converges to a critical point x ∗ of .
Convergence rate results of the generated sequence described in Theorem 5.3 can
also be derived by applying the generic rate of convergence result of Attouch and
Bolte [1].
6 Concluding remarks
We have provided a concise synthesis outlining the key theoretical elements playing
a central role in the convergence analysis of Non Euclidean Bregman based prox-
imal methods, and their most fundamental first order algorithm relatives, stressing
clarifications and simplifications through elementary proof-patterns. The volume of
literature on first order methods has exploded over the last decade, and we refer the
reader to some of the recent pointers given in the bibliography for further elaborations,
extensions and applications.
Below, we briefly further discuss the benefits and limitations of the non-Euclidean
proximal framework, and end with an open challenging question.
123
M. Teboulle
Much like any proximal minimization scheme, the underlying framework depends
on how efficiently we can compute the iteration formula of BPG (cf. (4.1)):
+ 1
x = Tλ (x) := argmin f (u) + ∇g(x), u − x + Dh (u, x) : u ∈ C .
λ
Although not visible at first sight, it is interesting to observe that Tλ (·) shares the
same structural splitting principle as the classical Euclidean proximal-gradient map,
which can be useful for computational purpose allowing to decompose it through
two specific operators. Indeed, as shown in [14, Section 3.1], writing the optimality
condition for x + , formal computations show that by defining the following operators:
• A Bregman gradient step
1
pλ (x) := argmin ∇g(x), u + Dh (u, x) : u ∈ E , x ∈ int dom h, (6.1)
λ
one can rewrite the map Tλ (·) simply as the composition of a Bregman proximal step
with a Bregman gradient step:
Thus, computing Tλ depends on whether the above two operators can output closed-
form solutions or can be numerically computed in an efficient way. In the classical
Euclidean proximal gradient method, i.e., when h is the energy function, the first
computation above reduces to a standard gradient step, pλ (x) = x − λ∇g(x), and
the second one is the usual Moreau proximal map of λ f . In the general case, for any
x ∈ int dom h, define v(x) := ∇h(x) − λ∇g(x). Then, the Bregman gradient step
(6.1) reduces to
123
A simplified view of first order methods for optimization
This algorithm was proven to achieve the faster rate O(1/n 2 ), [8, Theorem 5.2].
Later on it was extended by Tseng [65] to handle as well, the additive convex composite
model (CM), where the Bregman gradient z-step should be replaced by the Bregman
proximal-gradient iteration z k+1 = Tλk (yk ) with λk = tk L −1 . However, for both
algorithms, these results were derived only under the assumptions that g as an L-
globally Lipschitz gradient, and h is strongly convex, two assumptions that we have
specifically avoided throughout this paper. As pointed out in [14], one central and
interesting topic for further research is to devise fast FOM capable of lifting both
restrictions. In a recent study [60], numerical experiments based on the AT scheme
(and some other variants) applied to problems satisfying the (LC) condition (i.e.,
123
M. Teboulle
lacking global Lipschitz gradient), and with h not strongly convex exhibit the very
same faster rate. The theoretical justification, however, is still lacking. We thus end this
paper with the following, which we believe remains as a challenging open question:
Under the framework of this paper, can we produce first order algorithms with
a proven faster convergent rate?
Acknowledgements This synthesis has emerged from my long standing cooperation with Alfred Auslender
on this topic. I am deeply indebted to him, not only for generously sharing with me over the past two decades
his profound mathematical knowledge, enthusiasm, and vision in optimization, but also for his friendship
and the resulting pleasure, fun and adventures beyond our mathematical activities. The material covered
here also benefited greatly, and comes from past and recent works written with Amir Beck, Jérôme Bolte
and Shoham Sabach, with whom I have had the pleasure to work with over many years. My deepest thanks
to all of them for this fructuous past and continuing collaboration, and to Heinz Bauschke, for the pleasure of
collaborating with him on this topic over the past 3 years. I am grateful to the editors and the referees of the
paper, whose constructive comments and suggestions were very useful to improve the paper presentation.
References
1. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving
analytic features. Math. Program. 116, 5–16 (2009)
2. Attouch, H., Teboulle, M.: A regularized Lotka–Volterra dynamical system as a continuous proximal-
like method in optimization. J. Optim. Theory Appl. 121, 541–570 (2004)
3. Attouch, H., Bolte, J., Redont, P.: Optimizing properties of an inertial dynamical system with geometric
damping: link with proximal methods. Control Cybern. 31, 643–657 (2002)
4. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame
problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.
Math. Program. 137, 91–129 (2013)
5. Auslender, A., Teboulle, M.: Asymptotic Cones and Functions in Optimization and Variational Inequal-
ities. Springer, New York (2003)
6. Auslender, A., Teboulle, M.: Interior gradient and Epsilon-subgradient methods for constrained convex
minimization. Math. Oper. Res. 29, 1–26 (2004)
7. Auslender, A., Teboulle, M.: Interior projection-like methods for monotone variational inequalities.
Math. Program. 104, 39–68 (2005)
8. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization.
SIAM J. Optim. 16, 697–725 (2006)
9. Auslender, A., Teboulle, M.: Projected subgradient methods with non-Euclidean distances for non-
differentiable convex minimization and variational inequalities. Math. Program. Ser. B 120, 27–48
(2009)
10. Auslender, A., Teboulle, M., Ben-Tiba, S.: Interior proximal and multiplier methods based on second
order homogeneous kernels. Math. Oper. Res. 24, 645–668 (1999)
11. Bartlett, P.L., Hazan, E., Rakhlin, A.: Adaptive online gradient descent. In: Advances in Neural Infor-
mation Processing Systems, vol. 20 (2007)
12. Bauschke, H.H., Borwein, J.M.: Legendre functions and the method of Bregman projections. J. Convex
Anal. 4(1), 27–67 (1997)
13. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces.
Springer, New York (2011)
14. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first
order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2016)
15. Beck, A.: First Order Methods in Optimization. SIAM, Philadelphia (2017)
16. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex opti-
mization. Oper. Res. Lett. 31, 167–175 (2003)
17. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM J. Imaging Sci. 2(1), 183–202 (2009)
123
A simplified view of first order methods for optimization
18. Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In:
Palomar, D., Eldar, Y.C. (eds.) Convex Optimization in Signal Processing and Communications, pp.
139–162. Cambridge University Press, Cambridge (2009)
19. Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22,
557–580 (2012)
20. Ben-Tal, A., Margalit, T., Nemirovsky, A.: The ordered subsets mirror descent optimization method
with applications to tomography. SIAM J. Optim. 12, 79–108 (2001)
21. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, Cam-
bridge (1982)
22. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
23. Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
24. Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and Lipschitz
gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. (2017) (accepted)
25. Bolte, J., Sabach, S., Teboulle, M.: Nonconvex Lagrangian-based optimization: monitoring schemes
and global convergence. Math. Oper. Res. (2018). https://doi.org/10.1287/moor.2017.0900
26. Bolte, J., Teboulle, M.: Barrier operators and associated gradient like dynamical systems for constrained
minimization problems. SIAM J. Control Optim. 42, 1266–1292 (2003)
27. Bolte, J., Daniilidis, A., Lewis, A.S.: The Łojasiewicz inequality for nonsmooth subanalytic functions
with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
28. Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient
flows, talweg, convexity. Trans. Am. Math. Soc. 362, 3319–3363 (2010)
29. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and
nonsmooth problems. Math. Program. 146(1), 459–494 (2014)
30. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning
via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
31. Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application
to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–217
(1967)
32. Bruck, R.: On the weak convergence of an ergodic iteration for the solution of variational inequalities
for monotone operators in Hilbert space. J. Math. Anal. Appl. 61, 159–164 (1977)
33. Burachik, R.S., Iusem, A.N.: A generalized proximal point algorithm for the variational inequality
problem in a Hilbert space. SIAM J. Optim. 8, 197–216 (1998)
34. Censor, Y., Zenios, S.A.: Proximal minimization algorithm with D-functions. J. Optim. Theory Appl.
73, 451–464 (1992)
35. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Breg-
man functions. SIAM J. Optim. 3, 538–543 (1993)
36. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward–backward splitting. SIAM Multi-
scale Model. Simul. 4, 1168–1200 (2005)
37. Csiszár, I.: Information-type measures of difference of probability distributions and indirect observa-
tions. Studia Sci. Mat. Hungar. 2, 299–318 (1967)
38. Drusvyatskiy, D., Lewis A.S.: Error bounds, quadratic growth, and linear convergence of proximal
methods. Math. Oper. Res. (2018). https://doi.org/10.1287/moor.2017.0889
39. Duchi, J.C., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Pro-
ceedings of 23rd Annual Conference on Learning Theory, pp. 14–26. (2010)
40. Eckstein, J.: Nonlinear proximal point algorithms using Bregman functions, with applications to convex
programming. Math. Oper. Res. 18, 202–226 (1993)
41. Flammarion, N., Bach, F.: Stochastic composite least-squares regression with convergence rate O(1/n).
Proc. Mach. Learn. Res. 65, 1–44 (2017)
42. Fukushima, M., Mine, H.: A generalized proximal point algorithm for certain nonconvex minimization
problems. Int. J. Syst. Sci. 12, 989–1000 (1981)
43. Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM J.
Control Optim. 29(2), 403–419 (1991)
44. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48(3),
769–783 (1998)
45. Lewis, A.S., Wright, S.J.: A proximal method for composite minimization. Math. Program. Ser. A 158,
501–546 (2016)
123
M. Teboulle
46. Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. In: Les Équations
aux Derivées Partielles, pp. 87–89. Éditions du Centre National de la Recherche Scientifique, Paris
(1963)
47. Martinet, B.: Régularisation d’inéquations variationnelles par approximations successives. Rev.
Française Informatique et Recherche Opérationnelle 4, 154–158 (1970)
48. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 93(2), 273–299
(1965)
49. Nemirovsky, A.S.: Prox-method with rate of convergence O(1/t) for variational inequalities with lips-
chitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J.
Optim. 15, 229–251 (2004)
50. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley,
New York (1983)
51. Nesterov, Y.: A method for solving the convex programming problem with convergence rate O(1/k 2 ).
Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
52. Nguyen, Q.V.: Forward–backward splitting with Bregman distances. Vietnam J. Math. 45, 519–539
(2017)
53. Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings.
Bull. AMS 73, 591–597 (1967)
54. Palomar, D.P., Eldar, Y.C.: Convex Optimization in Signal Processing and Communications. Cambridge
University Press, Cambridge (2010)
55. Passty, G.B.: Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J.
Math. Anal. Appl. 72, 383–390 (1979)
56. Polyak, R., Teboulle, M.: Nonlinear rescaling and proximal-like methods in convex optimization. Math.
Program. 76, 265–284 (1997)
57. Rockafellar, R.T., Wets, R.: Variational analysis. In: Grundlehren der Mathematischen Wissenschaften,
vol. 317. Springer (1998)
58. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
59. Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim.
14(5), 877–898 (1976)
60. Sabach, S., Teboulle, M., Vaisbourd, Y.: Fast non-Euclidean first order algorithms: a numerical study.
In: Working Paper. (April 2017)
61. Shefi, R., Teboulle, M.: Rate of convergence analysis of decomposition methods based on the proximal
method of multipliers for convex minimization. SIAM J. Optim. 24, 269–297 (2014)
62. Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. The MIT Press, Cambridge
(2011)
63. Teboulle, M.: Entropic proximal mappings with application to nonlinear programming. Math. Oper.
Res. 17, 670–690 (1992)
64. Teboulle, M.: Convergence of proximal-like algorithms. SIAM J. Optim. 7, 1069–1083 (1997)
65. Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimiza-
tion. Math. Program. Ser. B 125, 263–295 (2010)
66. Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with
applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789
(2013)
123