Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
Editors
M. Artin S. S. Chern J. Coates 1. M. Frohlich
H. Hironaka F. Hirzebruch L. Hormander
C.C. Moore J.K. Moser M. Nagata W. Schmidt
D. S. Scott Ya. G. Sinai 1. Tits M. Waldschmidt
S.Watanabe
Managing Editors
M. Berger B. Eckmann S.R.S. Varadhan
Jean-Baptiste Hiriart-Urruty
Claude Lemarechal
Convex Analysis
and Minimization
Algorithms II
Advanced Theory
and Bundle Methods
With 64 Figures
Claude Lemarechal
INRIA, Rocquencourt
Domaine de Voluceau
B.P.105
F-78153 Le Chesnay, France
Typesetting: Camera-ready copy produced from the authors' output file using a
Springer TEX macro package
41/3140-5432 1 0 Printed on acid-free paper
Table of Contents Part II
Introduction . . . . . . . . . . . . . ... . . . xv
IX. Inner Construction of the Subdifferential 1
1 The Elementary Mechanism . 2
2 Convergence Properties . . 9
2.1 Convergence . . . . . . 9
2.2 Speed of Convergence . 15
3 Putting the Mechanism in Perspective . 24
3.1 Bundling as a Substitute for Steepest Descent. 24
3.2 Bundling as an Emergency Device for Descent Methods 27
3.3 Bundling as a Separation Algorithm 29
References 337
Introduction . . . . . . . . . . . . . . . . . xv
I. Convex Functions of One Real Variable 1
I Basic Definitions and Examples . . . 1
1.1 First Definitions of a Convex Function 2
1.2 Inequalities with More Than Two Points 6
1.3 Modern Definition of Convexity . . . 8
2 First Properties . . . . . . . . . . . . . . 9
2.1 Stability Under Functional Operations 9
2.2 Limits of Convex Functions . 11
2.3 Behaviour at Infinity . . . . . . . . . 14
3 Continuity Properties . . . . . . . . . . . 16
3.1 Continuity on the Interior of the Domain 16
3.2 Lower Semi-Continuity: Closed Convex Functions 17
3.3 Properties of Closed Convex Functions . . . . . 19
4 First-Order Differentiation . . . . . . . . . . . . . 20
4.1 One-Sided Differentiability of Convex Functions 21
4.2 Basic Properties of Subderivatives 24
4.3 Calculus Rules . . . . . . . . . . . . . . . . 27
5 Second-Order Differentiation . . . . . . . . . . . 29
5.1 The Second Derivative of a Convex Function . 30
5.2 One-Sided Second Derivatives . . . . . . . . 32
5.3 How to Recognize a Convex Function. . . . . 33
6 First Steps into the Theory of Conjugate Functions 36
6.1 Basic Properties of the Conjugate . 38
6.2 Differentiation of the Conjugate. 40
6.3 Calculus Rules with Conjugacy . . 43
References 407
In this second part of "Convex Analysis and Minimization Algorithms", the dichoto-
mous aspect implied by the title is again continuously present. Chapter IX introduces
the bundling mechanism to construct the subdifferential of a convex function at a
given point. It reveals implementation difficulties which serve as a motivation for
another concept: the approximate subdifferential. However, a convenient study of this
latter set necessitates the so-called Legendre-Fenchel transform, or conjugacy corre-
spondence, which is therefore the subject of Chap. X. This parent of the celebrated
Fourier transform is being used more and more, in its natural context of variational
problems, as well as in broader fields from natural sciences. Then we can study the
approximate subdifferential in Chap. XI, where our point of view is definitely oriented
towards computational utilizations.
Chapter XII, lying on the fringe between theory and applications, makes a break
in our development. Its subject, Lagrange relaxation, is probably the most spectacular
application of convex analysis; it can be of great help in a huge number of practical
optimization problems, but it requires a high level of sophistication. We expose this
delicate theory in a setting as applied as possible, so that the major part of the chapter
should be accessible to the non-specialist. This chapter is quite voluminous but only
its Sections 1 to 3 are essential to understand the subject. Section 4 concerns dual
algorithms and its interest is numerical only. Section 5 definitely deviates towards
theory and relates the approach with other chapters of the book, especially Chap. X.
The last three chapters can then be entirely devoted to bundle methods, and to the
several ways of arriving at them. Chapter XIII does with the approximate subdifferen-
tial the job that was done by Chap. IX with the ordinary subdifferential. This allows the
development of bundle methods in dual form (Chap. XIV). Finally, Chap. XV gives
these same methods in their primal form, which is probably the one having the more
promising future.
A reader mainly interested by the convex analysis part of the book can skip the
numerical Chapters IX, XIII, XIv. Even if he jumps directly to Chap. XV he will get
the necessary comprehension ofbundle methods, and more generally of algorithms for
nonsmooth optimization. This might even be a good idea, in the sense that Chap. XV is
probably much easier to read than the other three. However, we do not recommend this
for a reader with professional algorithmic motivations, and this for several reasons:
- Chapter XV relies exclusively on convex analysis, and not on the general algorith-
mic principles making up Chap. II in the first part. This is dangerous because, as
explained in §VIII.3.3, there is no clearcut division between smooth objective func-
XVI Introduction
tions and merely convex ones. A sound algorithm for convex minimization should
therefore not ignore its parents; Chap. xrv, precisely, gives an account of bundle
methods on the basis of smooth minimization.
- The bundling mechanism of Chap. IX, and the dual bundle method of Chap. Xrv,
can be readily extended to nonconvex objective functions. Chapter XV is of no help
to understand how this extension can be made.
- More generally, this reader would miss many interesting features of bundle methods,
which are important if one wants to do research on the subject. Chapter XV is only
a tree, which should not hide the other chapters making up the forest.
We have tried as much as possible to avoid any prospective development, con-
tenting ourselves with well-established material. This explains why the present book
ignores many recent works in convex analysis and minimization algorithms, in par-
ticular those of the last decade or two, concerning the second-order differentiation of
a nonsmooth function. In fact, these are at present just isolated trials towards solving
a challenging question, and a well-organized theory is still hard to foresee. It is not
clear which of these works will be of real value in the future: they rather belong to
speculative science; but yet, "there still are so many beautiful things to write in C
major" (S. Prokofiev).
We recall that references to theorems or equations of another chapter are preceded
by the chapter number (in roman numerals). The letter A refers to Appendix A of the
first part, in which our main system of notation is overviewed. In order to be reason-
ably self-contained, we give below a short glossary of symbols and terms appearing
throughout.
"1-+" denotes a function or mapping, as in x 1-+ I (x); when I (x) is a set instead of
a singleton, we have a multifunction and we rather use the notation x 1-+ I (x ).
(., .), II . II and III . II are respectively the scalar product, the associated norm and an
arbitrary norm (here and below the space of interest is most generally the Euclidean
space lR.n). Sometimes, (s, x) will be the standard dot-product s T x = E1=1 SjXj.
B(O, 1) := {x : IIxli ~ I} is the (closed) unit ball and its boundary is the unit sphere.
cl I is the closure, or lower semi-continuous hull, of the function I; if I is convex
and lower semi-continuous on the whole of lR.n, we say that I is closed convex.
D/(x), D_/(x) and D+/(x) are respectively the derivative, the left- and the right-
derivative of a univariate function I at a point x; and the directional derivative of I
(multivariate and convex) at x along d is
domf := {x : f(x) < +oo} and epi f := {(x, r) : f(x) ~ r} are respectively the
domain and epigraph of a function f : JR.n ~ JR. U Hoo}.
The function d 1-+ f/,o(d) := lilDt ....Hool/t [f(x + td) - f(x)] does not depend on
x e dom f and is the asymptotic function of f (e Conv JR.n).
Hs-:r := {x : (s, x) ~ r} is a half-space, defined by s :F 0 and r e JR.; replacing the
inequality by an equality, we obtain the corresponding (affine) hyperplane.
Is(x) is the indicator function of the set S at the point x; its value is 0 on S, +00
elsewhere.
If K is a cone, its polar KO is the set of those s such that (s, x) ~ 0 for all x e K.
If f(x) ~ g(x) for all x, we say that the function f minorizes the function g. A
sequence {Xk} is minimizing for the function f: X ~ JR. if f(Xk) ~ i:= infx f.
JR.+ = [0, +oo[ is the set of nonnegative numbers; (JR.+)n is the nonnegative orthant
ofJR.n; a sequence {tk} is decreasing if k' > k => tk' ~ tk.
Outer semi-continuity of a multifunction F is what is usually called upper semi-
continuity elsewhere. Suppose F (x) is nonempty and included in a fixed compact set
for x in a neighborhood of a given x* (i.e. F is locally bounded near x*); we say that
F is outer semi-continuous at x* if, for arbitrary 8 > 0,
Imini/ex, d)
!lIdll2 = I,
knowing that more general normalizations can also be considered. A slight modifica-
tion of the above problem is much more tractable, just by changing the normalization
constraint into an inequality. Exploiting positive homogeneity and using some duality
s
theory, one obtains the classical projection problem := argminsEof(X) 1/2 lis 112. The
essential results of §VIII.1 are then as follows:
s
- either = 0, which means that x minimizes I;
s
- or #- 0, in which case the Euclidean (non-normalized) steepest-descent direction
is-so
This allows the derivation of a steepest-descent algorithm, in which the direction is
computed as above, and some line-search can be made along this direction.
We have also seen the major drawbacks of this type of algorithm:
(a) It is subject to zigzags, and need not converge to a minimum point.
(b) The projection problem can be solved only in special situations, basically when
the subdifferential is a convex polyhedron, or an ellipsoid.
(c) To compute any kind of descent direction, steepest or not, an explicit description
ofthe whole subdifferential is needed, one way or another. In many applications,
this is too demanding.
Our main motivation in the present chapter is to overcome (c), and we will develop
a technique which will eliminate (b) at the same time. As in the previous chapters, we
assume a finite-valued objective function:
II : lRn -+ lR is convex. I
2 IX. Inner Construction ofthe Subdifferential
For the sake of simplicity, we limit the study to steepest-descent directions in the sense
of the Euclidean norm I I . I I = II . II; adapting our development to other normings
is indeed straightforward and only results in additional technicalities. In fact, our
situation is fundamentally characterized by the information concerning the objective
function f: we need only the value f(x) and a subgradient s(x) E af(x), computed
in a black box of the type (Ul) in Fig. 11.1.2.1, at any x E JRn that pleases us. Such a
computation is possible in theory because dom af = JRn; and it is also natural from
the point of view of applications, for reasons given in §VIII.3.
The idea is to construct the subdifferential of f at x, or at least an approximation
of it, good enough to mimic the computation of the steepest-descent direction. The
approximation in question is obtained by a sequence of compact convex polyhedra,
and this takes care of (b) above. On the other hand, our approximation scheme does
not solve (a): for this, we need additional material from convex analysis; the question
is therefore deferred to Chap. XIII and the chapters that follow it. Furthermore, our
approximation mechanism is not quite implementable; this new difficulty, having the
same remedy as needed for (a), will also be addressed in the same chapters.
The present chapter is fundamental. It introduces the key ingredients for con-
structing efficient algorithms for the minimization of convex functions, as well as
their extensions to the nonconvex case.
In view of the poor information available, our very first problem is twofold: (i) how
can one compute a descent direction at a given x, or (ii) how can one realize that x is
optimal? Let us illustrate this by a simple specific example.
For n = 2 take the maximum of three functions f), h, /3:
(1.1)
which is minimal at x = (0,0). Figure 1.1 shows a level-set of f and the three half-
lines of kinks. Suppose that the black box computing s(x) produces the gradient of
one of the active functions. For example, the following scheme takes the largest active
index:
=
First, set s(x) V fl (x) = (1,2);
if h(x) ;;;:: fl (x), then set s(x) = Vh(x) = (1, -2);
=
if /3 (x) ;;;:: max{fl (x), h(x)}, then set s(x) Vh(x) = (-1,0).
(i) When x is as indicated in the picture, s(x) = Vh(x) and -s(x) points in a
direction of increasing f. This is confirmed by calculations: with the usual dot-
product for (-, .) in JR2 (which is implicitly assumed in the picture),
Thus the choice d = -s (x) is bad. Of course, it would not help to take d = s (x):
a rather odd idea since we already know that
1 The Elementary Mechanism 3
Finally note that the trouble would be just the same if, instead of taking the largest
possible index, the black box (U I) chose the smallest one, say, yielding VII (x).
(ii) When x = 0, s(x) = Vh(x) still does nottell us that x is optimal. It would be
extremely demanding of (Ul ) to ask that it select the only subgradient of interest
here, namely s = O!
It is the aim of the present chapter to solve this double problem. From now on, x
is a fixed point in IRn and we want to answer the double question: is x optimal? and if
not, find a descent direction issuing from x.
Remark t.t The difficulty is inherent to the combination of the two words: descent direction
and convex function. In fact, if we wanted an ascent direction, it would suffice to take the
direction of any subgradient, say d = s(x): indeed
f'(X, d) ;;;: (s(x), d) = IIs(x) 112 •
3M> 0,317 > 0 such that IIs(y)1I ~ M whenever IIY - xii ~ 17. (1.2)
We will make extensive use of the theory from Chap. VIII, more specifically §VIII.1.
In particular, let S [= BI (x)] be a closed convex polyhedron and consider the problem
min {crs(d) : IIdli ~ I}. [i.e. min {I' (x, d) : IIdli ~ I}]
We recall that a simple solution can be obtained as follows: compute the unique
s
solution of
min HIIsll2 : s E S} ;
then take d = -s.
4 IX. Inner Construction of the Subdifferential
The problem of finding a descent direction issuing from the given x will be solved
by an iterative process. A sequence of''trial'' directions dk will be constructed, together
with a sequence of designated subgradients Sk. These two sequences will be such that:
either (i) us(dk) = f' (x, dk) < 0 for some k, or (ii) k ~ 00 and Sk ~ 0, indicating
that x is optimal. For this process to work, the boundedness property (J .2) is absolutely
essential.
(a) The Key Idea. We start the process with the computation ofS (x) E af (x) coming
from the black box (VI). We call its), we form the polyhedron SI := {sIl, and we set
k = 1.
Recursively, suppose that k ~ 1 subgradients
or equivalently
(S, dk) < 0 for all S E Sk (1.5)
or again
(1.6)
Only then, has dk a chance of being a descent direction.
The precise computation of dk satisfying (1.4) will be specified later. Suppose for
the moment that this computation is done; then the question is whether or not dk is
a descent direction. The best way to check is to compare f(x + tdk) with f(x) for
t ,j, o. This is nothing but a line-search, which works schematically as follows.
Algorithm 1.2 (Line-Search for Mere Descent) The direction d (= dk) is given.
Start from some t > 0, for example t = I.
STEP I. Compute f (x +t d). If f (x +t d) < f (x) stop. This line-search is successful,
d is downhill.
STEP 2. Take a smaller t > 0, for example 1/2 t, and loop to Step 1. o
Compared to the line-searches of §II.3, the present one is rather special, indeed.
No extrapolations are made: t is never too small. Our aim is not to find a new iterate
x+ = x + td, but just to check the (unknown) sign of f'(x, d). Only the case t ,j, 0
is interesting for our present concern: it indicates that f' (x, d) ~ 0, a new direction
must therefore be computed.
Then the key lies in the following simple result.
1 The Elementary Mechanism 5
(S*,dk) ~o.
In other words: not only have we observed that dk was not a descent direction;
but we have also explained why: we have singled out an additional subgradient s*
which does not satisfy (1.5); therefore s* ¢ Sk. We can call sHI this additional
subgradient, and recompute a new direction satisfying (1.4) (with k increased by 1).
This new direction dk+1 will certainly be different from dt, so the process can be
safely repeated.
Remark 1.4 When doing this, we are exploiting the characterization of of (x) de-
tailed in §VI.6.3(b). First of all, Proposition 1.3 goes along with Lemma VI.6.3.4:
s* is produced by a directional sequence (x + td}q.o, and therefore lies in the face
of of (x) exposed by d. Algorithm 1.2 is one instance of the process called there n,
constructing the subgradient s* = Sn(dk). In view of Theorem VI.6.3.6, we know
that, if we take "sufficiently many" such directional sequences, then we shall be able
to make up the whole set of (x). More precisely (see also Proposition Y.3.1.5), the
whole boundary of of (x) is described by the collection of subgradients sn(d), with
d describing the unit sphere oflRn.
The game is not over, though: it is prohibited to run Algorithm 1.2 infinitely many
times, one time per normalized direction - not even mentioning the fact that each run
takes an infinite computing time, with t tending to o. Our main concern is therefore
to select the directions dk carefully, so as to obtain a sufficiently rich approximation
of of (x) for some reasonable value of k. 0
(b) Choosing the Sequence of Directions. To specify how each dk should be com-
puted, (1.4) leaves room for infinitely many possibilities. As observed in Remark 1.4
above, this computation is crucial if we do not want to exhaust a whole neighborhood
of x. To guide our choice, let us observe that the above mechanism achieves two things
at the same time:
- It performs a sequence of line-searches along the trial directions d l , d2 , ••• , dk; the
hope is really to obtain a descent direction d, having O'Bf(x) (d) < o.
6 IX. Inner Construction of the Subdifferential
Once again, the theory of §VIII.l tells us that it is a good idea to replace the normal-
ization by an inequality, so as to solve instead the more tractable problem
(1.8)
Remark 1.5 It is worth mentioning that, with dk solving (1.8), -lIdkll2 is an estimate of
f'(x, dk): there holds (see Remark VIII. 1.3.7)
Algorithm 1.6 (Quest for a Descent Direction) Given x E IRn, compute 1 (x) and
SI =
s(x) E ol(x). Set k 1.=
STEP 1. With Sk defined by (1.3), solve (1.8) to obtaindk. Ifdk = 0 stop: x is optimal.
STEP 2. Execute the line-search 1.2 with its two possible exits:
STEP 3. If the line-search has produced t > 0 with 1 (x + tdk) < 1 (x), then stop: dk
is a descent direction.
STEP 4. If the line-search has produced t -l- 0 and Sk+ 1 E 01 (x) such that
(Sk+I' dk) ~ 0, (1.9)
Remark 1.7 Another question, which is not raised by the algorithm itself but rather by
§VIII.2.2, is the following: if we plug Algorithm 1.6 into a steepest-descent scheme such
as VIII. 1. 1.7 to minimize t, the result will probably converge to a wrong point. A descent
direction dk produced by Algorithm 1.6 is "at best" the steepest-descent direction, which is
known to generate zigzags. In view of (ii) above, we have laid down a non-implementable
computation, which results in a non-convergent method!
Chapter XIII and its successors will be devoted to remedying this double drawback. For
the moment, we just mention once more that the ambition of a minimization method should
not be limited to iterating along steepest-descent directions, which are already bad enough
in the smooth case. The remedy, precisely, will also result in improvements of the steepest-
descent idea. 0
Example 1.8 Let us give a geometric illustration of the key assessing the introduction
of Sk+1 into Sk. Take again the function 1 of(1.1) and, for given x, consider the one-
dimensional functiont ~ q(t) := I(x - ts(x».
If q(t) < q(O) for some t > 0, no comment. Suppose, therefore, that q(t) ~ q(O)
for all t ~ O. This certainly implies that x = (~, .,,) is a kink; say ." = 0, as in Fig. 1.1.
More importantly, the function that is active at x - here 12, say, with s(x) = V h(x)
- is certainly active at no t > o. Otherwise, we would have for t > 0 small enough
8 IX. Inner Construction of the Subdifferential
q(t) = h(x - ts(x» = q(O) - t(Vh(x),s(x») = q(O) - tIlVh(x) 112 < q(O).
In other words: the black box (VI) isforced to produce some other gradient - here
Vfl (x) - at any x - ts(x), no matter how small t > 0 is. The discontinuous nature
of s(-) is here turned into an advantage. In the particular example of the picture, once
this new gradient is picked, our knowledge of af(x) is complete and we can iterate.
D
Remark 1.9 Note in Example 1.8 that, fl being affine, s(x - ts(x» is already in of (x) for
t > O. It is not necessary to let t .j.. 0, the argument (ii) above is eliminated. This is a nice
feature of piecewise affine functions.
It is interesting to consider this example with §VIII.3.4 in mind. Specifically, observe
in the picture that an arbitrary s E of (x) is usually not in of (x - ts), no matter how close
the positive t is to O. There is, however, exactly one s E of (x) satisfying s E of (x - ts)
s
for t > 0 small enough; and this is precisely the Euclidean steepest-descent direction at x.
The same observation can be made in the example of §VIII.2.2, even though the geometry
is slightly different. Indeed, we have here an "explanation" of the uniqueness result in Theo-
rem VIIl.3.4.I, combined with the stability result VIII.3.4.5(i). 0
Needless to say, the bundling mechanism is not bound to the Euclidean norm II . II.
Choosing an arbitrary I I . II would amount to solving
minr
(sj,d)~r forj=I, ... ,k,
~Idll = I
instead of(1.8). This would preserve the key property that uSk (dk) is as negative as possible, in
some different sense. As an example, we recall from Proposition VIII. 1.3 .4 that, if a quadratic
norm (d, Qd) 1/2 is chosen, the essential calculations become
Among all possible norms, could there be one (possibly depending on k) such that the
bundling Algorithm 1.6 converges as fast as possible? This is a fascinating question indeed,
which is important because the quality of algorithms for nonsmooth optimization depends
directly on it. Unfortunately, no clear answer can be given with our present knowledge. Worse:
the very meaning of this question depends on a proper definition of "as fast as possible", and
this is not quite clear either, in the context of nonsmooth optimization.
On the other hand, Remark 1.9 above reveals a certain property of "stability" of the
steepest-descent direction, which is special to the Euclidean norming.
2 Convergence Properties 9
2 Convergence Properties
In this section, we study the convergence properties of the schematic bundling process
described as Algorithm 1.6. Its convergence and speed of convergence will be ana-
lyzed by techniques which, once again, serve as a basis for virtually all minimization
methods to come.
2.1 Convergence
We will use the notation Sk of (1.3); and, to make the writing less cumbersome, we
will denote by
the Euclidean projection of a vector v onto the closed convex hull of a set A.
From the point of view of its convergence, the process of §1 can be reduced
to the following essentials. It consists of generating two sequences of IRn : {Sk} (the
subgradients) and {dk} (the directions, or the projections). Knowing that
(2.1.1)
(2.1.2)
which is (1.9). Furthermore, we know from the boundedness property (1.2) that
All our convergence theory is based on the property that Sk -+ 0 if Algorithm 1.6
loops forever.
Before giving a proof, we illustrate this last property in Fig. 2.1.1. Suppose af (x) lies
entirely in the upper half-space. Start from SI = (-1,0), say (we assume M = 1). Given
Sk somewhere as indicated in the picture, Sk+l has to lie below the dashed line. "At worst",
this sH I has norm 1 and satisfies equality in (2.1.2). The picture does suggest that, when the
operation is repeated, the ordinate of SHI must tend to 0, implying 8k -+ O.
The property 8k -+ 0 results from the combination of the inequalities (2.1.1), (2.1.2),
(2.1.3): they ensure that the triangle formed by 0, 8k and sk+l is not too elongated. Taking
(2.1.3) for granted, a further examination of Fig. 2.1.1 reveals what exactly (2.1.1) and (2.1.2)
are good for: the important thing is thilt the projection of the segment [SI, sH I] along the
vector 8k has a length which is not "infinitely smaller" than 118k II.
10 IX. Inner Construction of the Subdifferential
Lemma 2.1.1 Let m > 0 befixed. Consider two sequences {Sk} and {Sk}, satisfying
fork=1.2 •...
Corollary 2.1.2 Consider the bundling Algorithm 1.6.lfx is not optimal, there must
be some integer k with If (x. dk) < O.
Figure 2.1.2 is another illustration of this convergence property. Along the first dl = -SI,
the S2 produced by the line-search has to be in the dashed area. In the situation displayed in the
picture, d2 will separate af(x) from the origin, i.e. d2 will be a descent direction (§VIII.l.l).
2 Convergence Properties 11
Lemma 2.1.1 also provides a stopping criterion for the case of an optimal x. It suffices
to insert in Step 1 of Algorithm 1.6 the test
(2.1.7)
where lJ > 0 is some pre-specified tolerance. As explained in Remark VIII. 1.3. 7, -dk can be
viewed as an approximation of''the'' gradient of f at Xk; (2.1.7) thus appears as the equiv.alent
of the "classical" stopping test (11.1.2.1). When (2.1.7) is inserted, Algorithm 1.6 stops in any
case: it either produces a descent direction, or signals (approximate) optimality of x.
There are several ways of interpreting (2.1.7) to obtain some information on the quality
OfXk:
- First we have by the Cauchy-Schwarz inequality:
f(y) ~ f(x) + (Sb Y - x) ~ f(x) -lJlly - xII for all y E ]Rn. (2.1.8)
If x and a minimum of f are a priori known to lie in some ball B(O, R), this gives an
estimate of the type
f(x):r;;,. inf f +2lJR.
- Second x minimizes the perturbed function
h(y) - f(x) ~ (s, y - x) + !clly - xll2 ~ lIy - xII [!clly - xII -lIsll] ;
thus, the property h(y) :r;;,. h(x) = f(x) implies lIy - xII :r;;,. ~lIsli.
Remark 2.1.3 The number m in Lemma 2.1.1 is not used in Corollary 2.1.2. It will
appear as a tolerance for numerical implementations of the bundling Algorithm 1.6.
In fact, it will be convenient to set
- The tolerance m' plays the role of 0 and will actually be essential for facilitating the
line-search: with m' > 0, the test
(b) A Quantitative Result and its Consequences. The next result is a more accurate
alternative to Lemma 2.1.1; it will be useful for numerical implementations, and also
when we study speeds of convergence. Here, sand S play the role of Sk+l and Sk
respectively; m' is the m' of Remark 2.1.3; as for z, it is an alternative to Sk+l'
1 ')2 IIsII2 0
JL := ( - m M2 _ m 12 l1sl1 2 > . (2.1.11)
(2.1.12)
Furthermore. ifequality holds in (2.1.10) and lis II = M. equality holds in (2.1.12).
2 Convergence Properties 13
PROOF. Develop
By assumption, lias + (1 - a)sll2 :::;; cp(a) for a E [0, 1], so we have the bound
cp'(I) =2(I-m')lIsIl2~0,
so the minimum of cp on [0, 1] is actually unconstrained. Then the minimal value is
given by straightforward calculations:
M2 _m 12 l1sI1 2
CPmin = M2 _ 2m' lis 112 + IIsll211s112
(which, of course, is nonnegative). Write this equality as CPmin = cllsll 2. This defines
and another straightforward calculation shows that this JL is just (2.1.11). It is non-
negative because its denominator is
A first use of this additional convergence result appears when considering nu-
merical implementations. In the bundling Algorithm 1.6, Sk is defined by k vectors.
Since k has no a priori upper bound, an infinite memory is required for the algorithm
to work. Furthermore, the computing time necessary to solve (l.8) is also potentially
unbounded. Let us define, however, the following extension of Algorithm 1.6.
STEP 4 (null-step, update S'). Here, the line-search has produced t .J.. 0 and SHI E
af(x) such that
(SHit dk) ~ - m'lI dkll 2 •
Take for Sk+1 any compact convex set included in af(x) but containing at least
Sk andsHI'
STEP 5 (loop). Replace k by k + I and loop to Step 1. 0
The idea exploited in this scheme should be clear, but let us demonstrate the
mechanism in some detail. Suppose for example that we do not want to store more
than 10 generators to characterize the bundle of information Sk' Until the lOth it-
eration, Algorithm 1.6 can be reproduced normally: the successive subgradients Sk
can be accumulated one after the other, Sk is just Sk. After the lOth iteration has been
completed, however, there is no room to store SII. Then we can carry out the following
operations:
- Extract from {I, 2, ... , 10} a subset containing at most 8 indices - call KIO this
subset, let it contain k' :s:;; 8 indices.
- Delete all those Sj such that j ¢ K IO • After renumbering, the subgradients that are
still present can be denoted by SI, ... ,Ski. We have here a "compression" of the
bundle.
- Append SIO and SII to the above list of subgradients and define sfo to be the convex
hull of Sit ... , Ski, SIO, SII'
- Then loop to perform the II th iteration.
Algorithm 1.6 is then continued: S12, S13, ••• are appended, until the next com-
pression occurs, at some future k (:s:;; 19, depending on the size of KIO).
This is just an example, other compression mechanisms are possible. An impor-
tant observation is that, at subsequent iterations, the polyhedron Sk (k > 10) will be
generated by vectors of several origins: (i) some will be original subgradients, com-
puted by the black box (Ul) during previous iterations; (ii) some will be projections St
for some 10 :s:;; t < k, i.e. convex combinations of those in (i); (iii) some will be "pro-
jections of projections" (after two compressions at least have been made), i.e. convex
combinations of those in (i) and (ii); and so on. The important thing to understand is
that all these generators are in af(x) anyway: we have Sk c af(x) for all k.
Remark 2.1.6 Returning to the notation of §VIII.2.1, suppose that solving the quadratic
problem (VIII.2.1.10) at the 10th iteration has produced el E .1 C ]RIO. Then choose KIO.
If KIO happens to contain each i ~ 10 such that elj > 0, it is not necessary to append
subsequently 810: it is already in CO{SI, ... , sk'}'
A possible strategy for defining the compressed set K 10 is to delete first all those Sj such
that elj = 0, if any (one is enough). If there is none, then call for a more stringent deletion
rule; for example delete the two subgradients with largest norm. 0
Lemma 2.1.4 suggests that the choice of KIO above is completely arbitrary -
including KIO := 0 -leaving convergence unaffected. This is confirmed by the next
result.
2 Convergence Properties 15
Theorem 2.1.7 Algorithm 2.1.5 stops/or somefinite iteration index, producing either
a downhill dk or an Sk E af(x) o/norm less than 8.
PROOF. Suppose for contradiction that the algorithm runs forever. Because of Step 4,
we have at each iteration
(2.1.13)
where ILk is defined by (2.1.11) with S replaced by Sk. If the algorithm runs forever,
the stop of Step 1 never operates and we have
Set r := ml2l1skll2/M2 E [0, 1[ and observe that the function r 1-+ r/(l - r) is
increasing on [0, 1[. This implies
, 2 82
ILk > (1 - m) M2 _ m12 82 =: IL > 0 for k = 1, 2, ...
Lemma 2.2.1 Let {8k} be a sequence 0/nonnegative numbers satisfying, with A > 0
fixed,
(2.2.1)
PROOF. We prove the "="-part by induction on k. First, (2.2.2) obviously holds for
k = 1. Assume that the equality in (2.2.2) holds for some k. Plug this value of 8k into
(2.2.1) and work out the algebra to obtain after some manipulations
>..8 >..8
8k+1 = k8 1 +1 A. = (k + 1)81 +1 A. - 81
.
The recurrence is established. The " ~ "- and" ~ "-parts follow since the function
x 1-+Ax / (A. + x) is increasing on R+. 0
<a> Worst Possible Behaviour. Our analysis so far allows an immediate bound on
the speed of convergence of the bundling algorithm:
A 1
II skll ~ /LM. (2.2.3)
(1 - m').., k
PROOF. First check that (2.2.3) holds for k = 1. Then use Lemma 2.1.4: we clearly
have in (2.1.11)
JLr
~ (1 _
m
')2I1 sM2'
kIl 2 •
plug this majorization into (2.1.12) to obtain
Thus a bundling algorithm must reach its stopping criterion after not more than
[M / (1- m')8]2 iterations - and we already mentioned that this is a prohibitive number.
Suppose that the estimate (2.2.3) is sharp; then it may take like a million iterations,
just to obtain a subgradient of norm 10-3 M. Now, if we examine the proof of The-
orem 2.1.7, we see that the only upper bound that is possibly not sharp is the first
inequality in (2.1.13). From the second part of Lemma 2.1.4, we see that the equality
2
+ II SklF II SkA 112
II ProJ. O/{ASb Sk+1 }112 = II Sk+II1S112k+111
(i) the generated subgradients do not cluster to 0 (say all have roughly the same
norm M),
(ii) they are all approximately orthogonal to the corresponding Sk (so (2.1.10) be-
comes roughly an equality), and
(iii) a cheap form of Algorithm 2.1.5 is used (say, S~+) is systematically taken as the
segment [Sko sk+Il, or a compact convex polyhedron hardly richer).
and let x := (0,0) be the reference point in Algorithm 2.1.5. Suppose S) = (0, 1)
(why not!). Then the first line-search is done along (0, -1). Suppose it produces
S2 = (1,0), so d2 = (-1, -1) (no compression is allowed at this stage). Then S3 is
certainly (-1, 0) (see Fig. 2.2.1) and our knowledge of af (0) is complete.
-~
From then on, suppose that the "minimal" bundle {Sk, sk+Il is taken at each
iteration. Calling (0-, i) E ]R2 the generic projection S (note: i > 0), the generic
line-search yields the next subgradient S = (-0- / 10- I, 0). Now observe the symmetry
in Fig. 2.2.1: the projections of S2 and S3 onto S are opposite, so we have
(S, s) = -lIslI 2 .
Thus (2.1.10) holds as an equality with m' := -1, and we conclude from the last
part of Lemma 2.1.4 that
2 1 - II Skll 2 2 Ak 2 t4
II SH III
A A A
(hereAk = I-lIskll2 ~ 1). With Lemma 2.2. 1 in mind, it is clear that IISkll2 ~ I/(4k).
o
18 IX. Inner Construction of the Subdifferential
(b) Best Possible Behaviour. We are naturally led to the next question: might the
bound established in Corollary 2.2.2 be unduly pessimistic? More specifically: since
the only non-sharp majorization in our analysis is II Sk+ I II :::;; IIzll, where z denotes
Proj O/[Sk, sHtl,dowehave IISHIII « liz II? The answerisno; the bundling algorithm
does suffer a sublinear convergence rate, even if no subgradient is ever discarded:
Counter-example 2.2.4 To reach the stopping criterion IIskll :::;; 8, Corollary 2.2.2
tells us that an order of (M /8)2 iterations are sufficient, no matter how {Sk} is generated.
As a partial converse to this result, we claim that a particular sequence {skI can be
constructed, such that an order of M / 8 iterations are indeed necessary.
To substantiate our claim, consider again the example of Fig. 2.1.1, with M = 1;
but place the first iterate close to 0, say at (-8,0), instead of (-1,0) (see the left part
of Fig. 2.2.2); then, place all the subsequent iterates as before, namely take each sHI
of norm 1 and orthogonal to Sk. Look at the left part of Fig. 2.2.2 to see that Sk is, as
before, the projection of the origin onto the line-segment [Slo Sk]; but the new thing
is that IIsllI « IIsk II already for k = 2: the majorization obtained in Lemma 2.1.4
(which is sharp, knowing that S plays now the role of SI), becomes weak from the very
beginning.
o e
Fig. 2.2.2. Perverse effect of a short initial subgradient
These observations can be quantified, with the help of some trigonometry: the
angle f:)k being as drawn in the right part of Fig. 2.2.2, we have
(2.2.4)
Draw the angle a = f:)HI - f:)k and observe that sina = IISHIII = 8cosf:)HI to
obtain
8cosf:)k+1 = sinf:)k+1 cosf:)k - cosf:)k+1 sinf:)k.
Thus, the variable Uk := tanf:)k satisfies the recurrence formula
(2.2.5)
Thus the event Uk+1 ~ 1 requires k ~ l/(sJ2) = 1/(28); our claim is proved. 0
The above counter-example is slightly artificial, in that SI is "good" (i.e. small), and the
following subgradients are as nasty as possible. The example can be modified, however, so
that all the subgradients generated by the black box (Ul) have comparable norms. For this, add
a third coordinate to the space, and start with two preliminary iterations: the first subgradient,
say So, is (-e, 0, 1), so So = So = (-e, 0, 1). For the next subgradient, take SI = (-e, 0, -1),
which obviously satisfies (2.1.2); it is easy to see that SI is then (-e, 0, 0). From then on,
generate the subgradients S2 = (0, 1, 0) and so on as before, all having their third coordinate
null. Draw a picture if necessary to see that the situation is then exactly the same as in 2.2.4,
with SI = (-e, 0, 0) replacing the would-be SI = (-e, 0): the third coordinate no longer
plays a role.
In this variant, a wide angle «(SI, so) « 0) has the same effect as a small initial subgradi-
ent. The message from our counter-examples is that, in a bundling algorithm, an exceptionally
good event (small subgradient, or wide angle) must subsequently be paid fur by a slower con-
vergence.
Among other things, our Counter-example 2.2.4 suggests that the choice of
S~+I c Sk+1 plays a minor role. A situation worth mentioning is one in which
this choice is strictly irrelevant: when all the gradients are mutually orthogonal. Re-
membering Remark 11.2.4.6, assume
Then the projection of the origin onto CO{SI, ••• , skl and onto [Sk-l> Sk] coincide.
This remark has some important consequences. Naturally, (2.2.6) cannot hold
forever in a finite-dimensional space. Instead oflRn , consider therefore the space £2 of
square-summable sequences S = (SI, s2, ... ). Let Sk be the kth vector of the canonical
Hilbertian basis of this space:
This is because
20 IX. Inner Construction of the Subdifferential
k
IISk 112 = ;2 ~ IISj 112 = ~ = (Sk' Sj) for j = I, ... , k , (2.2.7)
J=I
Thus, the above sequence {Sk} could well be generated by the bundling Algo-
rithm 1.6. In fact, such would be the case with
if(U I ) computed the smallest index giving the max, and ifthe algorithm was initialized
on the reference point x = o.
In this case, the majorization (2.2.3) would be definitely
sharp, as predicted by our theory and confirmed by (2.2.7). It may be argued that
numerical algorithms, intended for implementation on a computer, are not supposed
to solve infinite-dimensional problems. Yet, if n is large, say n ~ 100, then (2.2.6)
may hold till k = 100; in this case, it requires 100 iterations to reduce the initial lis II
by a mere factor of 10.
(2.2.8)
with a rate q < 1 independent of the particular function f, cannot hold for our algorithm -
at least when the space is infinite-dimensional. There is actually a reason for that: according
to complexity theory, no algorithm can enjoy the "good" behaviour (2.2.8); or if it does, the
rate q in (2.2.8) must tend to 1 when the dimension n of the space tends to infinity; in fact q
must be "at best" of the form q ~ 1 - 1/ n.
Actually, complexity theory also tells us that the estimate (2.2.3) cannot be improved,
unless n is taken explicitly into account; a brief and informal account of this complexity result
is as follows. Consider an arbitrary algorithm, having called our black box (UI) at k points
XI, .•• , Xl> and denote by SI, ••. , Sk the output from (UI). Then, an estimate of the typt.
(2.2.9)
(c) Practical Behaviour. Our analysis in (a) and (b) above predicts a certain (medio-
cre) convergence speed of Algorithm 2.1.5; it also indicates that the choice of Sk+ I has
little influence on this speed. Unfortunately, this is blatantly contradicted empirically.
Some numerical experiments will give an idea of what is observed in practice.
Apply Algorithm 2.1.5 to this example, the starting point x being a minimum i.
Then no stop can occur at Step 3 and the algorithm loops between Step 4 and Step 1
until a small enough USkU is found.
We have tested two forms of Algorithm 2.1.5:
- The first form is the most expensive possible, in which no subgradient is ever deleted:
at each iteration, Sk+ I = SH I .
22 IX. Inner Construction of the Subdifferential
- The second form is the cheapest possible: all subgradients are systematically deleted
and Sk+1 = [st, sk+d at each iteration.
Figure 2.2.3 displays the respective evolutions of IIskll, as a function of k. We
believe that it illustrates well enough how pessimistic (2.l.I3) can be. The same phe-
nomenon is observed with the test-problem MAXQUAD ofVIII.3.3.3: with the expensive
form, the squared norm of the subgradient is divided by 105 in 3 iterations; with the
cheap form, it is divided by 102 , and then stagnates there.
cheap form
-1 expensive form
-2~__~~__~~______~~______~~
100 200 300 k
Fig. 2.2.3. Efficiency of storing subgradients in a bundling algorithm; TR48
In view of Remark 2.2.5, one could expect that the cheap and expensive forms
tend to equalize when the dimension n increases. To check this, we introduce another
example:
Test-problem 2.2.7 (TSP) Consider a complete graph, i.e. a set of n nodes ("cities")
and m = 1/2n(n - 1) arcs ("routes", linking each pair of cities); let a cost ("length")
be associated with each arc. The traveling salesman problem is that of constructing
a path visiting all the cities exactly once, and having the shortest length. This is not
simple, but a closely related problem is to minimize an ordinary piecewise affine
function associated with the graph, say
where x is a vector of "dummy costs", associated with the nodes. We do not give
a precise statement of this last problem; let us just say that the basis is once again
duality, introduced in a fashion to be seen in Chap. XII.
Here the index-set T (the set of "I-trees" in the graph) is finite but intractable:
ITI:::: nn.However,findingforgivenxanindexk E Trealizingthemaxin{2.2.II)isa
reasonable task, which can be performed in 0 (m) =
0 (n 2) operations. Furthermore,
the resulting Sk is an integer vector.
This example will be used with several datasets corresponding to several values
ofn, and referred to as Tspn. 0
Take first TSP442, having thus n = 442 variables; the minimal value of f turns out
to be -50505. Just as with TR48, we start from an optimal i and we run the two forms of
Algorithm 2.1.5. With this larger test-problem, Fig. 2.2.4 displays an evolution similar
to that of TR48; but at least the cheap variant is no longer "non-convergent": the two
curves can be drawn on the same picture.
2 Convergence Properties 23
o
'.
\
o
-1
-2
k
50 100 600
Fig. 2.2.5. Efficiency of storing subgmdients in a bundling algorithm; TSP1173
Figure 2.2.5 is a further illustration, using another dataset called TSP1173. A new
feature is now the unexpectedly small number of iterations, which can be partly
explained by the special structure of the subgradients Sk:
(i) All their coordinates have very small values, say in the range {-I, 0, I}, with
exceptionally some at ±2; as a result, it becomes easier to form the O-vector with
them.
(ii) More importantly, most of their components are systematically 0, for all x near
the optimal reference point: actually, less than 100 of them ever become nonzero;
it follows that the sequence {Sk} evolves in a space of dimension not more than
100 (instead of 1173). In TSP442, this space has dimension 150; and we recall that
in TR48, it has dimension exactly 47.
Remark 2.2.8 The snaky shape of the curves in Figs. 2.2.3,2.2.5 (at least for the expensive
variant) is probably due to piecewise affine chamcter of f. Roughly speaking, IISk II decreases
fast at the beginning, and then slows down, simply because 1/ k follows this pattern; then
comes the structure of fJf(i), which is finitely generated: less and less room is left for the
next Sk+l; the combinatorial aspect of the problem somehow vanishes, and IIskll decreases
more mpidly. 0
Let us conclude this section with a comment similar to Remark 11.2.4.4: numerical anal-
ysis is difficult, in that theory and practice often contmdict each other. More precisely, a
theoretical fmmework establishing specific properties of an algorithm may not be ~levant for
the actual behaviour of that algorithm. Here, it is difficult to find an appropriate theoretical
framework explaining why the bundling algorithm works so much better in its expensive
form. It is also difficult to explain why its actual behaviour contradicts so much the theory.
24 IX. Inner Construction of the Subdifferential
In this section, we look at the bundling algorithm from some different points of view,
and we relate it to various topics.
Remark 3.1.1 As already mentioned in Remark 1.4, each cycle of the bundling
mechanism generates a subgradient Sk+1 lying in the face of CJf(x) exposed by the
direction dk.
This sk+ I is interesting for dk itself: not only is dk uphill (because (sk+ I, dk) ~ 0),
but we can say more. In terms of the descent property of dk, the subgradient Sk+1 is
the worst possible; or, reverting the argument, sk+1 is the best possible in terms of the
useful information concerning dk. Figure 2.1.2 shows how to interpret this property
geometrically: sk+1 (there S2), known to lie in the dashed area, is even at the bottom
tip of it. 0
In light of this remark, we see that the subgradients generated by the bundling
mechanism are fairly special ones. Their special character is also enhanced by the
black box (VI): the way s(x) is computed in practice makes it likely that it will lie
on the boundary of CJf (x), or even at an extreme point. For example, in the case of a
minimax problem (§VIII.2), each s(x), and hence each Sk+h is certainly the gradient
at x of some differentiable active function.
In summary, the bundling mechanism 1.6 can be viewed as aftlter, which extracts
some carefully selected subgradients from the whole set CJf(x). What is even more
interesting is that it has a tendency to select the most interesting ones. Let us admit
that the directions dk are not completely random - they are probably closer and
closer to directions of interest, namely descent directions. Then the subgradients Sk+ I
generated by the algorithm are also likely to ''flirt'' with the subgradients of interest,
namely those near the face of CJf(x) exposed by (steepest) descent directions. In
other words, our algorithm can be interpreted as selecting those subgradients that are
important for defining descent directions, disregarding the "hidden side of the moon";
see Fig. 3.1.1, where the visible side of the moon is the set of faces of CJf (x) that are
exposed by descent directions.
descent
directions
(3.1.1)
26 IX. Inner Construction of the Subdifferential
is a known compact convex polyhedron, with m possibly large. When Algorithm 1.6
is applied to this example, the full set (3.1.1) is replaced by a growing sequence
of internal approximations Sk. Correspondingly, there is a sequence of approximate
steepest-descent directions db with its sequence of approximate derivatives -lIdk 112
(see Remark 1.5).
Now, it is reasonable to admit that each subgradient Sk+1 generated by the al-
gorithm is extracted from the list {Sl, ... ,sm}: first, (UI) is likely to choose sO in
that list; and second, each sk+1 is in the face of aI(x) exposed by dk; remember Re-
mark 1.4. Then, projecting onto Sk is a simpler problem than projecting onto (3.1.1).
Except for its stopping test, the bundling algorithm really provides a constructive way
s
to compute = Proj O/aI(x).
Of course, this interpretation is of limited value when a descent direction does
exist: Algorithm 1.6 stops much too early, namely as soon as
(knowing that dk = -Sk). A real algorithm computing S should stop much later,
namely when If (x, dk) is "negative enough":
since the latter characterizes Sk as the projection of the origin onto aI(x).
On the other hand, when x is optimal, the problem of computing S = 0 becomes
that of finding the correct convex multipliers, satisfying Lj=1 Clpj = 0 in (3.1.1).
Then the bundling Algorithm 1.6 is a valid alternative to direct methods for projection.
In fact, if(3.1.1) is complex enough, Algorithm 1.6 can be a very efficient approach:
it consists of replacing the single convex quadratic problem
Example 3.1.2 The test-problem TR48 of 2.2.6 illustrates our point in a particularly
spectacular way. We have said that, at an optimum i, there are an order of 1011 active
functions; and there are just as many possible values of s(i), as computed by (Ul).
On the other hand, it is (probably) necessary to have 48 of them to obtain the correct
convex multipliers aj in Lj ajs j = 0 (this is Caratheodory's theorem in ]R48, taking
into account that all subgradients are actually in ]R47 because of degeneracy).
Figure 2.2.3 reveals a remarkable phenomenon: in the expensive variant of Algo-
rithm 2.1.5, IId6S 11 is zero within roundoff error. In other words, the bundling mecha-
nism has managed to extract only 65 "interesting" subgradients: to obtain a correct set
of multipliers, it has made only 17 (= 65 - 48) "mistakes", among the 1011 possible
ones. o
3 Putting the Mechanism in Perspective 27
To conclude, we add that all intermediate black boxes (Ul) are conceivable: in this chapter,
we considered only the simplest one, computing just one subgradient; Chap. VIII required the
most sophisticated one, computing the full set (3.1.1). When (Ul) returns richer information,
Algorithm 2.1.5 will converge faster, but possibly at the price of a more expensive iteration;
among other things, each individual projection will be more difficult. The bundling approach
is open to a proper balance between simplicity (one subgradient) and rapidity (all of them).
Another, more subtle, way oflooking at the bundling mechanism comes from (vi) in
the introduction to §3.I. The rather delicate material contained in this Section 3.2 is
an introduction to the numerical Chaps XIII and XIv. Consider an "ordinary" method
such as those in Chap. II for minimizing a smooth function. Within one iteration,
consider the line-search: given the current x and d, and setting q(t) := f(x + td),
we wantto find a convenient stepsize t > 0, satisfying in particular q(t) < q(O).
Anyone familiar with optimization algorithms has experienced line-searches
which "do not work", just because q doggedly refuses to decrease. A frequent rea-
son is that the s(x) computed by (UI) is not the gradient of f at x - because of
some programming mistake. Another reason, more fundamental, is ill-conditioning:
although q' (0) < 0, the property q(t) < q(O) may hold only for values of t > 0 so
small that the computer cannot find any; remember Fig. 11.1.3 .1. In this situation, any
reasonable line-search can only produce a sequence of trial stepsizes tending to O. In
the terminology of §1I.3, tL == 0, tR ,J.. 0: the minimization algorithm is stuck at the
present iteration. Normally there is not much to do, then - except checking (UI)!
It is precisely in this situation that the mechanism of §2 can come into play. As
stated, Algorithm 1.2 can be viewed as the beginning of a ''normal'' line-search, of the
type of those in §II.3. More precisely, compare the line-search 1.2 and Fig. 11.3.3.1
with m = O. As long as tL does not become positive, i.e. as long as q(t) ~ q(O),
both line-searches do the same thing: they perform interpolations aimed at pulling t
towards the region having q (t) < q(O)(ifit is not void). It is only when some tL > 0 is
found, that the difference occurs. Then the line-search 1.2 becomes pointless, simply
because the bundling Algorithm 1.6 stops: the local problem of finding a descent
direction is solved
If, however, the question is considered from the higher point of view of really
minimizing f, it is certainly not a good idea to stop the line-search 1.2 as we do, when
q (t) < q (0). Rather, it is time to call tL the present t, and to start a second phase in the
line-search: our aim is now to find a convenient t, satisfying some criterion (0) in the
terminology of §II.3.2. In other words: knowing that the line-search 1.2 is just a slave
of Algorithm 1.6, it should be a good idea to complement it with some mechanism
coming from §II.3.2.
The result would produce a compound algorithm, which can be outlined as fol-
lows:
Remark 3.2.2 We do not specify how to compute the direction in Step 1. Let us just observe
that, if it is the gradient method of Definition 11.2.2.2 that is chosen, then Algorithm 3.2.1 just
becomes the method we started from (§2.1), simulating the steepest descent. It is in Chap. XIII
that we will start studying approaches making possible harmonious combination of Steps 1
and 2.
Our interpretation of Algorithm 1.6, as an anti-zigzagging mechanism for a descent
method, suggests a way in which one might suppress the difficulty (v) of the introduction to
§3.1. Instead ofletting t = tR ..j.. 0, it is sensible in Algorithm 3.2.1 to switch to Step 3 when
t R becomes small. To make this kind of decision, we will see that the following considerations
are relevant:
- The expected decrease of q in the current direction d should be estimated. If it is small, d
can be relinquished.
- Allowing such a "nonzero null-step" implies appending to Sk some vectors which are not
in af(x). One should therefore estimate how close s(x + td) is to af(x); this makes sense
because of the outer semi-continuity of af.
Thus, the expression "small stepsize" can be given an appropriate meaning, in terms of
objective-value differences, and of subgradient proximity. 0
An interesting point here is to see how Step 2 could be implemented, along the
lines of §II.3. The issue would be to inject into the test (0), (R), (L) a provision for
the switch to Step 3, when t is small enough - whatever this means. In anticipation of
Chap. :xrv; and without giving too much detail, let us mention the starting ideas:
- First, the initial derivative q' (0) is needed. Although it exists (being the directional
derivative f' (x, d) along the current direction d) it is not known. Nevertheless,
Remark 1.5 gives a substitute: we have -lIdll 2 ~ q'(O), with equality if the current
polyhedron S approximates the subdifferential af(x) well enough.
- Then the descent-test, declaring that t is not too large, is that of (11.3 .2.1):
Remark 3.2.3 The need for a null-step appears when (3.2.1) is never satisfied, despite re-
peated interpolations. If the coefficients m andm' are properly chosen, namely m < m', then
we have
(s(x + td), d) ~ q(t) - q(O) > -mlldll 2 > -m'lIdIl 2 ;
t
here the first inequality expresses that s(x +td) E of (x + td), and the second is the negation
of (3.2.1). We see that (3.2.2) holds automatically in this case; one more manifestation of
the second part in Remark 3.1.1. On the other hand, if the black box (Ul) is bugged, the
line-search may well generate a sequence tR ..j.. 0 which never satisfies (3.2.1) nor (3.2.2).
Then, even our rescuing mechanism is helpless. The same phenomenon may happen with
a nonconvex f (but a fairly pathological one, indeed). There is a moral: a totally arbitrary
function, such as the output of a wrong (Ul), cannot be properly minimized! 0
(d, s) < r for all s E S and r < (d, s') for all s' E to} .
In simpler terms, it is a d E IRn such that
-If
S := {s E]Rn : c(s)::::; O}
is defined by constraints (several constraints Cj can be included in c, simply by
setting C = maxj Cj), then answering the question is easy: it amounts to computing
c(O) and a subgradient d of cat O. In fact
In case c(O) > 0, then any d E oc(O) defines a separating hyperplane. The reason
is that the property
Remark 3.3.2 It is interesting to recall that the above set of separating hyperplanes (the cone
of descent directions!) has a geometric characterization. When S does not contain the origin,
as is not minimal at O. In view of Theorem VI. 1.3.4, our set of separating hyperplanes is there-
fore given by the interior of the polar [cone S]O of the cone generated by S. Incidentally, note
also from the compactness of S that the conical hull cone S is closed (proposition III. 1.4. 7).
o
When a mere descent direction was sought, the bundling Algorithm 1.6 was
stopped as soon as as(dk) became negative. Here, to compute a steepest-descent
direction, Algorithm 3.3.1 must continue until
We therefore deduce
(Sk. Sj - SH1) ~ IIsk - sII2 ~!5 > 0 for i = 1..... k and k = 1.2....
infinitely often would lead to a contradiction: we could extract a further subsequence
such that Sj - Sk+l ~ O. 0
while the black box for Algorithm 3.3.1 is "optimal": among all possible such SHh
it selects one having the largest possible such scalar product.
Then a natural question arises: is this reflected by a faster convergence of Al-
gorithm 3.3.1? A complete answer is unknown but we mention the following partial
result.
Assume dk "# 0 (otherwise there is nothing to prove) and, for!5 > 0, define s(!5) :=
!5dk/lldkll E B(O. !5). Because 0 E S, the affine and linear hulls of S coincide, and
s(!5) E aft'S. Then. by Definition 111.2.1.1 of the relative interior, our assumption
implies that s(!5) E S if!5 is small enough. In this case
Of course, nothing of this sort holds for Algorithm 1.6: observe in Fig. 2.1.1
that of (x) may contain some points with negative ordinates, without changing the
sequence {Sk}. The assumption for the above result is interesting, to the extent that
S is a subdifferential and that we are checking optimality of our given x: remember
from the end of §VI.2.2 that, if x = i actually minimizes f, the property 0 E ri of (i)
is natural.
We illustrate this by the following numerical experiments. Take a convex compact
set S containing the origin and consider two forms of Algorithm 3.3.1, differing in
their Step 2, i.e. in their black box.
- The first form is Algorithm 3.3.1 itself, in which Step 2 does solve (3.3.1) at each
iteration.
- In the second form, Step 2 merely computes an sk+1 such that (Sk+l, dk) = 0 (note:
since 0 E S, oS(d) ~ 0 for all d). This second form is therefore pretty much like
Algorithm 1.6. It is even ideal for illustrating Lemma 2.1.4 with equality in (2.1.10).
Just how these two forms can be constructed will be seen in §XIII.3. Here, it
suffices to say that the test-problem TR48 is used; the set S in question is a certain
enlargement of the subdifferential at a non-optimal x; this enlargement contains the
origin; Algorithm 3.3.1 is implemented via Algorithm 1.6, with two variants of the
line-search 1.2.
o
-1
-2 support-black box
-3
k
100 200 400 600
Fig. 3.3.1. Two forms of Algorithm 3.3.1 with TR48
'."
"" ...... IIt ...
decrease of IIdk II at the end of the first variant (finite convergence, cf. Remark 2.2.8).
With another test, based on MAXQUAD ofVIII.3.4.2, 0 is a boundary point of S c lR lo •
The results are plotted in Fig. 3.3.2: IIdk II decreases much more regularly. Finally,
Figs. 3.3.3 and 3.3.4 give another illustration, using the test-problem TSP of 2.2.7. In
all these pictures, note the same qualitative behaviour of IIdk II.
\
\,
·1 '."
"~- ___ ~ orthogonal black box
-3 SUPPO~:I:~:-:~----~----------------~----.
--~~~------------------------__- - k
500 6000
Fig. 3.3.3. 1\vo forms of Algorithm 3.3.1 with TSPI20
"
a
,~
"
__-_2_~--~~----------~~------~~----\~
' k
100 600
Fig. 3.3.4. Two forms of Algorithm 3.3.1 with TSP442
x. Conjugacy in Convex Analysis
Prerequisites. Definitions, properties and operations concerning convex sets (Chap. III)
and convex functions (Chap. IV); definition of sublinear functions and associated convex
sets (Chap. V); definitions of the subdifferential of a finite convex function (§VI.l). One-
dimensional conjugacy (§I.6) can be helpful but is not necessary.
dh = (s, dx) + (ds, x) - (V f(x), dx) = (s, dx) + (ds, x) - (s, dx) = (ds, x)
and this defines x as the gradient of h with respect to s.
is called the Legendre transform (relative to I). It should never be forgotten that, from
the above motivation itself, it is not really the function h which is primarily interesting,
but rather its gradient Vh. 0
36 X. Conjugacy in Convex Analysis
The gradients of f and of h are inverse to each other by definition, so they establish
a reciprocal correspondence:
and the Legendre transform - in the classical sense of the term - is well-defined when
this problem has a unique solution.
Let us sum up: if f is convex, (0.3) is a possible way of defining the Legendre
transform when it exists unambiguously. It is easy to see that the latter holds when f
satisfies three properties:
- differentiability - so that there is something to invert;
- strict convexity - to have uniqueness in (0.3);
- V f(lRn) = ]Rn - so that (0.3) does have a solution for all s E ]Rn; this latter
property essentially means that, when IIxll ~ 00, f(x) increases faster than any
linear function: f is J -coercive.
In all other cases, there is no well-defined Legendre transform; but then, the
transformation implied by (0.3) can be taken as a new definition, generalizing the
initial inversion of Vf. We can even extend this definition to nonconvex f, namely
to any function such that (0.3) is meaningful! Finally, an important observation is
that the infimal value in (0.3) is a concave, and even closed, function of s; this is
Proposition IY.2.1.2: the infimand is affine in s, f has little importance, ]Rn is nothing
more than an index set.
The concept of conjugacy in convex analysis results from all the observations
above. It is often useful to simplify algebraic calculations; it plays an important role
in deriving duality schemes for convex minimization problems; it is also a basic
operation/for formulating variational principles in optimization (convex or not), with
applications in other areas of applied mathematics, such as probability, statistics,
nonlinear elasticity, economics, etc.
1 The Convex Conjugate of a Function 37
Note in particular that this implies f (x) > -00 for all x. As usual, we use the notation
doml := {x : I(x) < +oo} #- 0. We know from Proposition IY.1.2.1 that (1.1.1)
holds for example if 1 E Conv]Rn.
Definition 1.1.1 The conjugate ofa function 1 satisfying (1.1.1) is the function j*
defined by
For simplicity, we may also let x run over the whole space instead of dom I.
The mapping 1 ~ 1* will often be called the conjugacy operation, or the
Legendre-Fenchel transform. 0
Furthermore, this inequality is obviously true if x ¢ dom I: it does hold for all
(x, s) E ]Rn x ]Rn and is called Fenchel's inequality. Another observation is that
j*(s) > -00 for all s E ]Rn; also, if x ~ (so, x) + ro is an affine function smaller
than I, we have
- 1* (so) = inf [f(x) - (so, x)] ~ ro, i.e. I*(so) ~ - ro < +00.
x
Example 1.1.4 (Convex Degenerate Quadratic Functions) Take the convex qua-
dratic function (1.1.4), but with Q symmetric positive semi-definite. The supremum
in (1.1.2) is finite only if s - b E (Ker Q).l, i.e. s - b E 1m Q; it is attained at an x
such that s - Vf (x) = 0 (optimality condition VI.2.2.1). In a word, we obtain
f* (s) = { too s
if ¢ ~ + 1m Q ,
2(x, s - b) otherwtse,
1 The Convex Conjugate of a Function 39
which is one more illustration of(O.1): add f(x) to both sides and obtain
* { +00 if s \l b + 1m Q ,
f (s)= !(s-b,Q-(s-b») ifsEb+lmQ. o
*
(PH) (s) =
{+oo if s\l H ,
!lI s ll if s H.
2 E
Example 1.1.5 Let Ie be the indicator function of a nonempty set C c IRn. Then
1.2 Interpretations
to see that r = I*(s). This means that the best Hs.r intersects the vertical axis {OJ x R
at the altitude - !*(s).
The picture illustrates another definition of 1*: being normal to Hs •r , the vector
(s, -1) is also normal to gr 1 (more precisely to epi f) at the contact when it exists.
ifu>O,
ifu=O, (1.2.2)
if u < O.
and the first equality is established. As for (1.2.2), the case U < 0 is trivial; when
U > 0, use the positive homogeneity of support functions to get
PROOF. Use direct calculations; or see Proposition IY.2.2.2 and the calculations in
Example Iy'3.2.4. 0
Now take the polar cone (Kf)O c lRn x lR x lR. We know from Interpretation Y.2.1.6
that it is the epigraph (in lRn+ 1 x lR!) of the support function of epi f. In view of
Proposition 1.2.1, its intersection with lRn x {-I} x lR is therefore epi f*. A short
calculation confirms this:
Imposing a = -1, we just obtain the translated epigraph of the function described in
(1.2.1).
42 X. Conjugacy in Convex Analysis
K pnina/
-.---.------~
epilx(-l}
Figure 1.2.2 illustrates the construction, with n = 1 and epi f drawn horizontally;
this epi f plays the role of S in Fig. Y.2.l.2. Note once more the symmetry: Kf [resp.
(Kf) 0] is defined in such a way that its intersection with the hyperplane lRn x lR x {-1 }
[resp. lRn x {-I} x lR] is just epi f [resp. epi 1*].
Finally, we mention a simple economic interpretation. Suppose ]Rn is a set of goods and
its dual (]Rn)'1< a set of prices: to produce the goods x costs f(x), and to sell x brings an
income (s, x). The net benefit associated with x is then (s, x) - f(x), whose supremal value
1* (s) is the best possible profit, resulting from the given set of unit prices s. Incidentally, this
last interpretation opens the way to nonlinear conjugacy, in which the selling price would be
a nonlinear (but concave) function of x.
Remark 1.2.3 This last interpretation confirms the warning already made on several oc-
casions: ]Rn should not be confused with its dual; the arguments of f and of f* are not
comparable, an expression like x + s is meaningless (until an isomorphism is established
between the Euclidean space ]Rn and its dual).
On the other hand, f -values and f* -values are comparable, indeed: for example, they can
be added to each other- which is just done by Fenchel's inequality (1.1.3)! This is explicitly
due to the particular value "-I" in (1.2.1), which goes together with the "-I" of(1.2.4). 0
Some properties ofthe conjugacy operation f 1-+ 1* come directly from its definition.
(a) Elementary Calculus Rules Direct arguments prove easily a first result:
and assuming that lR,n has the scalar product ofa product-space,
m
1*(sJ, ... , sm) = L 1j*(Sj)' o
j=1
Among these results, (iv) deserves comment: it gives the effect of a change of variables
on the conjugate function; this is of interest for example when the scalar product is put in the
form (x, y) = (AX)T Ay, with A invertible. Using Example 1.1.3, an illustration of (vii) is:
the only function f satistying f = f* is 1/211' 112 (start from Fenchel's inequality); and also,
for symmetric positive definite Q and P:
Our next result expresses how the conjugate is trarIsformed, when the starting
function is restricted to a subspace.
Proposition 1.3.2 Let I satisfY (1.1.1), let H be a subspace oflR,n, and call PH the
operator of orthogonal projection onto H. Suppose that there is a point in H where
I isfinite. Then 1+ IH satisfies (1.1.1) and its conjugate is
(1.3.1)
PROOF. When Y describes lR,n, PH Y describes H so we can write, knowing that PHis
symmetric:
Remark 1.3.3 Thus, if a subspace H and a function g (standing for f + IH) are such that
dom g C H, then g* (. + s) = g* for all s E H 1.. It is interesting to note that this property
has a converse: if, for some Xo E domg, we have g(xo + y) = g(xo} for all y E H, then
dom g* C H 1.. The proof is immediate: take Xo as above and, for s ¢ H 1., take Yo E H such
that (s, Yo) = a ::f:. 0; there holds (s, AYO) - g(xo + AYO) = Aa - g(xo) for all A E lR; hence
Proposition 1.3.4 For 1 satisfying (1.1.1), let a subspace V contain the subspace
parallel to affdoml and set U := Vol. For any Z E affdoml and any s E IRn
decomposed as s = su + sv, there holds
+ I*(sv)·
I*(s) = (su, z)
PROOF. In (1.1.2), the variable x can range through z + V :J aff dom I:
This operation appears as fundamental. The function j** thus defined is the "close-
convexification" of I, in the sense that its epigraph is the closed convex hull of epi I:
1 The Convex Conjugate of a Function 45
Theorem 1.3.5 For f satisfYing (1.1.1), the jUnction f** of (1.3 .2) is the pointwise
supremum ofall the affine functions on lR.n majorized by f. In other words
PROOF. Call 17 C lR. n x lR. the set of pairs (s, r) defining affine functions x 1-+ (s, x}-r
majorized by f:
Note: the biconjugate of an f E Conv lR.n is not exactly f itself but its closure:
f** = cl f. Thanks to Theorem 1.3.5, the general notation
can be - and will be - used for a function simply satisfying (1.1.1); it reminds one
more directly that f** is the closed convex hull of f:
PROOF. Immediate. o
Thus, the conjugacy operation defines an involution on the set of closed convex func-
tions. When applied to strictly convex quadratic functions, it corresponds to the inversion of
symmetric positive definite operators (Example 1.1.3 with b = 0). When applied to indicators
of closed convex cones, it corresponds to the polarity correspondence (Example 1.1.5 - note
also in this example that the biconjugate of the square root of a norm is the zero-function).
For general f E Conv R n , it has a geometric counterpart, also based on polarity, which is the
correspondence illustrated in Fig. 1.3.1. Of course, this involution property implies a lot of
symmetry, already alluded to, for pairs of conjugate functions.
46 X. Conjugacy in Convex Analysis
( . )*
f E ConvRn
-- - f' E COri'V Rn
( . )*
Fig. 1.3.1. The *-involution
(c) Conjugacy and Coercivity A basic question in (1.1.2) is whether the supremum
is going to be +00. This depends only on the behaviour of f at infinity, so we extend
to non-convex situations the concepts seen in Definition Iy'3.2.6:
Definition 1.3.7 A function f satisfying (1.1.1) is said to be O-coercive [resp. 1-
coercive] when
Remark 1.3.10 If we assume in particular f E Conv]Rn, (i) and (ii) become equiv-
alences:
Xo E int dom f <==> f* - (xo, .) is O-coercive ,
dom f = ]Rn <==> f* is I-coercive. o
Theorem 1.4.1 For f satisfying (1.1.1) and af defined by (1.4.1), s E af(x) if and
only if
f*(s) + f(x) - (s, x) = 0 (or ~ 0). (1.4.2)
i.e.
f*(s) ~ (s, x) - f(x) ;
but this is indeed an equality, in view of Fenchel's inequality (1.1.3). o
As before, af(x) is closed and convex: it is a sublevel-set of the closed convex
function f* - (., x), namely at the level - f (x). A subgradient of f at x is the slope
of an affine function minorizing f and coinciding with f at x; af(x) can therefore
be empty: for example if epi f has a vertical tangent hyperplane at (x, f (x»; or also
if f is not convex-like near x.
af(x) =F 0 ~ = f(x) ;
(co f)(x) (1.4.3)
PROOF. Let s be a subgradient of I at x. From the definition (1.4.1) itself, the function
Y ~ is(Y) := I(x) + (s, Y - x) is affine and minorizes I, hence is ~ co I ~ I;
because is (x) = I(x), this implies (1.4.3).
Now, s E 81(x) if and only if
PROOF. This is a rewriting of Theorem 1.4.1, taking into account (1.4.5) and the
symmetric role played by I and /* when I E Conv JRn . 0
This inclusion is strict: think of I (x) = Ix 11 /2. Equality holds under appropriate
assumptions, as can be seen from our analysis below (see Remark 1.5.7).
The next results relate the smoothness of I and of co I.
Let t -I.- 0 to obtain (Y'/(x), d) ~ (s, d) for all d, and this implies Y'/(x) = s. The
second equality then follows using (1.4.3) and (1.4.4). 0
Thus, the global property (ii) is just what is missing in the local property (i) for a
stationary point to be a global minimum. One concludes for example that differentiable
functions whose stationary points are global minima are those f for which
PROOF. Take a sequence {Xk, rk} in co(epi f) converging to (x, r) for k ---+ +00; in
order to establish (x, r) E co(epi f), we will prove that (x, p) E co(epi f) for some
p~r.
By definition of a convex hull in lRn +l , there are n + 2 sequences {xi, ri} with
f(xi) ~ rk, and a sequence {ak} in the unit simplex L1n+2 such that
n+2
(Xk, rk) =L ai (xi, rk) .
i=1
[Step 1] For each i, the sequence {airk} is bounded: in fact, because of (1.5.2), f is
bounded from below, say by JL; then we write
and our claim is true because Irk} and {ail are bounded.
The sequences Irk} are also bounded from below by JL. Now, if some Irk} is not
bounded from above, go to Key Step; otherwise go to Last Step.
[Key Step] We proceed to prove that, if Irk} is not bounded from above for some
index i, the corresponding sequence can be omitted from the convex combination.
Assume without loss of generality that 1 is such an index and extract a subsequence
if necessary, so rl ---+ +00. Then
alrl
ak ---+ 0, Ii akxIII
l
k -
- k k
rl:Jllxlll
---+ 0
.
ai+1
11~ := -k11 for i = 1, ... , n + 1.
-ak
We have
n+1 1
L l1~x~ = - - I (Xk - akxk) ~ X,
i=1 l-a k
n+1 1 1
JL ~ LI1~r~ = - - I (rk - akrk) ~ --I (rk - akJL) ~ r.
i=1 l-ak l-a k
Let us summarize this key step: starting from the n + 2 sequences of triples
{al, x~, r~}, we have eliminated one having {r~} unbounded, and thus obtained.e =
n + 1 sequences of triples {11~, xl, r~} satisfying
l l
11k E Lll' LI1~xl ~ x, LI1~r~ ~ p ~ r. (1.5.3)
i=1 i=1
Execute this procedure as many times as necessary, to end up with .e ~ 1 sequences
of triples satisfying (1.5.3), and all having {r~} bounded.
[Last Step] At this point of the proof, each {r~} is bounded; fromcoercivity of f, each
{x~} is bounded as well. Extracting subsequences if necessary, we are therefore in the
following situation: there are sequences of triples {11~, x~, r~}, i = 1, ... ,.e ~ n + 2,
satisfying
11k E Lll' 11k ~ 11 E Lll ;
f(xl) ~ r~, xl ~ xi, r~ ~ ri for i = 1, ... ,.e;
l l l
LI1~x~ ~ x, LI1~r~ ~ Ll1iri = p ~ r.
i=1 i=1 i=1
i
Because f is lower semi-continuous, f(x ) ~ ri for i = 1, ... ,.e and the definition
(lY.2.5.3) of co f gives
l l
(cof)(x)~ Ll1if(xi)~ Ll1iri =p~r.
i=1 i=1
In a word, (x,r) E epicof. o
This c10sedness property has important consequences:
Proposition 1.5.4 Let f satisfy (1.5.2). Then
(i) co f = co f (hence co f E Conv]Rn).
(ii) For any x E dom co f = co dom f, there are Xj E dom f and convex multipliers
aj for j = 1, ... , n + 1 such that
n+1 n+1
x = Lajxj and (cof)(x) = Lajf(xj).
j=1 j=1
52 X. Conjugacy in Convex Analysis
with I(xj} :::;; rj. Actually each rj has to be I(xj} if the corresponding aj is positive,
simply because
n+1
L,aj (Xj' I(xj}) E coepi/,
j=1
and, again from the definition of co I (x):
n+1 n+1
L,ajrj = co/(x}:::;; L,aj/(xj}. o
j=1 j=1
Theorem 1.5.5 Let I satisfy (1.1.1). For a given x E co dom I, suppose there exists
a family {Xj, aj} as described in Proposition l.5.4(ii); set
J := {j : aj > O} .
Then
(i) I(xj} = (co f)(xj}for all j E J,
(ii) co I is affine on the polyhedron P := co {Xj: j E J}.
[(ii)] Consider the affine function (here {fJj} is a set of convex multipliers)
n
Theorem 1.5.6 Under the hypotheses and notations of Theorem 1.5.5,
a(co f)(x) = a/(Xj); (1.5.4)
jeJ
Vs e a(co f)(x), (s, x) - (co f)(x) = (s, Xj) - /(Xj) for all j e J.
PROOF. A subgradient of co / at x is an s characterized by
(co f)*(s) + (co f)(x) -
(s, x) 0 = [Theorem 1.4.1]
-<==:} /*(s) + LjeJ aj /(Xj) - (s, LjeJ ajxj) 0 = [(co f)* = /*]
-<==:} LjeJ aj[/*(s) + /(Xj) - (s, Xj)] 0 =
-<==:} /*(s)+/(Xj)-(s,Xj)=O foralljeJ. [Fenchel (1.1.3)]
The last line means precisely that s e a/(Xj) for all j e J; furthermore, we can
write
/(Xj) - (s,Xj) = -/*(s) = -(cof)*(s) = (cof)(x) - (s,x). 0
Note in the right-hand side of (1.5.4) that each a/(Xj) could be replaced by
a(co f)(Xj): this is due to (1.4.4).
54 X. Conjugacy in Convex Analysis
Remark 1.5.7 As a supplement to (1.5.1), the above result implies the following: for
f satisfying (1.5.2), Argmin f is a compact set (trivial), whose convex hull coincides
with Argmin(co f). Indeed, use (1.5.4): an x E Argmin(co f) is characterized as
being a convex combination of points {Xj}jeJ such that 0 E af(xj), i.e. these Xj
minimize f. 0
= co f. The
PROOF. Our f satisfies (1.5.2), so Proposition 1.5.4(i) directly gives co f
latter function is finite everywhere, hence it has a nonempty subdifferential at any x
(Theorem 1.4.2): by virtue of (1.5.4), all the af(xj)'s are nonempty. Together with
Proposition 1.5.1, we obtain that all these af(xj)'s are actually the same singleton,
described by (1.5.5). Finally, the continuityofV(co f) follows from Theorem VI.6.2.4.
o
It is worth mentioning that, even if we impose more regularity on f (say COO),
co f as a rule is not C2 •
The function f, whose conjugate is to be computed, is often obtained from some other
functions !;, whose conjugates are known. In this section, we develop a set of calculus
rules expressing f* in terms of the (fi)* (some rudimentary such rules were already
given in Proposition 1.3.1).
Letg* be associated with a scalar product (., ·)m inlRm; we denote by (., ·)n the scalar
product in lRn , with the help of which we want to define (Ag)*. To make sure that Ag
satisfies (1.1.1), some additional assumption is needed: among all the affine functions
minorizing g, there is one with slope in (Ker A)l. = 1m A *.
Theorem 2.1.1 With the above notation, assume that 1m A * n dom g* =f:. 10. Then
Ag satisfies (1.1.1); its conjugate is
(Ag)* = g* 0 A* .
2 Calculus Rules on the Conjugacy Operation 55
PROOF. First, it is clear that Ag ;fE +00 (take x = Ay, with y E domg). On the other
hand, our assumption implies the existence of some Po = A *So such that g* (Po) <
+00; with Fenchel's inequality (1.1.3), we have for all y E Rm:
For each x E Rn, take the infimum over those Y satisfying Ay = x: the affine function
(so, .) - g*(po) minorizes Ag. Altogether, Ag satisfies (1.1.1).
Then we have for s E Rn
where g operates on the product-space R n x RP. Indeed, just call A the projection
onto Rn: A (x, z) = x E Rn, so I is clearly Ag. We obtain:
PROOF. It suffices to observe that, A being the projection defined above, there holds
for all YI = (XI, ZI) E Rm andx2 ERn,
(AYI> X2)n = (XI> X2)n = (Xlt X2)n + (Zit 0) P = (Yl> (X2, O»m ,
which dtifines the adjoint A *x = (x, 0) for all x E Rn. Then apply Theorem 2.1.1.
o
More will be said on this operation in §2.4. Here we consider another application:
the infimal convolution which, to II and h defined on R n , associates (see §IY.2.3)
(2.1.3)
Corollary 2.1.3 Let two functions II and hfrom IRn to IR U {+oo}, not identically
t
+00, satisfy domN n domN '# 0. Then II h satisfies (1.1.1), and (fl 12)* = t
N+lt
PROOF. EquippinglRn xlRn with the scalar product (', ')+{', .), weobtaing*(slo S2) =
N(sl)+ N(S2) (Proposition 1.3.I(ix»andA*(s) = (s, s). Then apply the definitions.
o
In particular, if f is the indicator of a nonempty set CeRn, the above formulae yield
In view of the symmetry of the conjugacy operation, Theorem 2.1.1 suggests that
the conjugate of goA, when A is a linear mapping, is the image-function A *g*.
In particular, a condition was needed in §2.1 to prevent Ag(x) = -00 for some x.
Likewise, a condition will be needed here to prevent goA == +00. The symmetry is
not quite perfect, though: the composition of a closed convex function with a linear
mapping is still a closed convex function; but an image-function need not be closed,
and therefore cannot be a conjugate function.
We use notation similar to that of §2.1, but we find it convenient to distinguish
between the linear and affine cases.
Theorem 2.2.1 With g E ConvlRm and Ao linear from IRn to IRm, define A(x) :=
Aox + Yo E IRm and suppose that A(lRn) n domg '# 0. Then goA E ConvlRn and
its conjugate is the closure of the convex function
Thus, the conjugation of a composition by a linear (or affine) mapping is not quite
straightforward, as it requires a closure operation. A natural question is therefore: when
does (2.2.1) define a closed function of s? In other words, when is an image-function
closed? Also, when is the infimum in (2.2.1) attained at some p? We start with a
technical result.
Lemma 2.2.2 Let g E Conv JR.m be such that 0 E dom g and let Ao be linear from
JR.n to JR. m. Make the following assumption:
PROOF. To prove (g 0 Ao)* = Ati g*, we have to prove that Ati g* is a closed function,
i.e. that its sublevel-sets are closed (Definition IY.1.2.3). Thus, for given r E JR., take
a sequence {Sk} such that
Take also Ok ..l- 0; from the definition of the image-function, we can find Pk E JR.m
such that
g*(Pk) :::;; r + Ok and AtiPk = Sk.
Let qk be the orthogonal projection of Pk onto the subspace V := lin dom g - 1m Ao.
Because V contains lindomg, Proposition 1.3.4 (withz = 0) gives g*(Pk) = g*(qk).
Furthermore, V..1 = (lin dom g)..1 n Ker Ati; in particular, qk - Pk E Ker Ati. In
summary, we have singled out qk E V such that
(2.2.3)
We conclude
Theorem 2.2.3 With g E ConvlRm and Ao linear from IRn to IRm, define A(x) :=
Aox + Yo E IRm. Make the following assumption:
(2.2.4)
Then Lemma 2.2.2 allows the computation of this conjugate. We have 0 in the
domain of g, and even in its relative interior:
(g 0 Ao)*(s) = min{g*(p)
p
: A~p = s},
or also
An example will be given in Remark 2.2.4 below, to show that the calculus rule (2.2.5)
does need a qualification assumption such as (2.2.4). First of all, we certainly need to avoid
goA == +00, i.e.
(2.2.Q.i)
but this is not sufficient, unless g is a polyhedral function (this situation has essentially been
treated in §Y.3.4).
2 Calculus Rules on the Conjugacy Operation 59
but it is fairly restrictive, implying in particular that dom g is full-dimensional. More tolerant
is
o E int[domg - A (lRn)] , (2.2.Q.iii)
which is rather common. Use various results from Chap. V, in particular Theorem Y.2.2.3(iii),
to see that (2.2.Q.iii) means
Knowing that O"domg = (g*)~ (Proposition 1.2.2), this condition has already been alluded to
at the end of §IY.3.
Naturally, our assumption (2.2.4) is a further weakening; actually, it is only a slight
generalization of (2.2.Q.iii). It is interesting to note that, if (2.2.4) is replaced by (2.2.Q.iii),
the solution-set in problem (2.2.5) becomes bounded; to see this, read again the proof of
Lemma 2.2.2, in which V becomes Rm.
Remark 2.2.4 Condition (2.2.4) is rather natural: when it does not hold, almost all informa-
tion on g is ignored by A, and pathological things may happen when conjugating. Consider
the following example withm = n = 2, A(~, 1]) := (~, 0),
*( ) {exP(. - 1) if p +. = 0,
g p,. = +00 otherwise,
A := [~ ~ J. M := [~ ~] so that
(s,x)=sTMx and A*=[ 2 2]=M-1A™.
-1 -1
Thus domg n ImA = R x {OJ; none of the assumptions in Theorem 2.2.3 are satisfied;
domg* n ImA* = {OJ,
and the infimum is not attained "at finite distance". Note also that A *g* is closed (how could
it not be, its domain being a singleton!) and is the conjugate of goA == o.
The trouble illustrated by this example is that composition with A extracts from g only
its nasty behaviour, namely its vertical tangent plane for 1] = o. 0
60 X. Conjugacy in Convex Analysis
The message of the above counter-example is that conjugating brutally goA, with a
scalar product defined on the whole of lim , is clumsy; this is the first line in Fig. 2.2.1. Indeed,
the restriction of g to 1m A, considered as a Euclidean space, is a closed convex function by
itself, and this is the relevant function to consider: see the second line of Fig. 2.2.1 (but a
scalar product is then needed in 1m A). Alternatively, if we insist on working in the whole
environment space lim, the relevant function is rather g + 11m A (third line in Fig. 2.2.1,
which requires the conjugate of a sum). All three constructions give the same goA, but the
intermediate conjugates are quite different.
Corollary 2.2.5 Let g E Conv lim and let A be linearfrom lin to lim with 1m A ndom g "# 0.
Then
(g 0 A)*' = A*(g + IlmA)* (2.2.6)
and,for every s E dom(g 0 A)*, the problem
inf {(g
p
+ IlmA)*(P) : A* p = s}
(2.2.7)
PROOF. The closed convex functions goA and (g +11m A) 0 A are identical on lin, so they have
the same conjugate. Also, g + 11m A is a closed convex function whose domain is included in
ImA, hence
=
ridom(g + limA) n ImA ridom(g + limA) "# 0.
The result follows from Theorem 2.2.3. o
We leave it as an exercise to study how Example 2.2.4 is modified by this result. To
make it more interesting, one can also take g(~, 1]) + 1/2~2 instead of g. Theorem 2.2.3
and Corollary 2.2.5 are two different ways of stating essentially the same thing; the former
requires that a qualification assumption such as (2.2.4) be checked; the latter requires that the
conjugate of the additional function g + 11m A be computed. Either result may be simpler to
use, depending on the particular pro~lem under consideration.
inf{(goPA)*(PAP): A*p=s},
(g 0 PA)*(P AP) = A*[(g 0 PA)* 0 PA](S) = (g 0 A)*(s).
When the qualification assumption allows the application of Theorem 2.2.3, we are bound to
realize that
A*g* = A*(g + IImA)* = A*[(g 0 P A)* 0 PA]. (2.2.8)
Indeed, Theorem 2.2.3 actually applies also to the couple (P A, g) in this case: (g 0 P A)*
can be developed further, to obtain (since PAis symmetric)
This 1/1
is the composition of g with the affine mapping which, to t E JR, assigns
Xo + t d E JR m. It is an exercise to apply Theorem 2.2.1 and obtain the conjugate:
whenever, for example, Xo and d are such that Xo + td E ri dom g for some t - this is
(2.2.4). This example can be further particularized to various functions mentioned in
§ 1 (quadratic, indicator, ... ).
The formula for conjugating a sum will supplement Proposition 1.3. 1(ii) to obtain
the conjugate of a positive combination of closed convex functions. Summing two
functions is a simple operation (at least it preserves closedness); but Corollary 2.1.3
shows that a sum is the conjugate of something rather involved: an inf-convolution.
Likewise, compare the simplicity of the composition goA with the complexity of
its dual counterpart Ag; as seen in §2.2, difficulties are therefore encountered when
conjugating a function of the type goA. The same kind of difficulties must be expected
when conjugating a sum, and the development in this section is quite parallel to that
of§2.2.
Theorem 2.3.1 Let gl and g2 be in Conv JRn and assume that dom gl n dom g2 #- 0.
The conjugate (gl + g2)* of their sum is the closure of the convex function gr t g;'
62 X. Conjugacy in Convex Analysis
PROOF. Call f;* := gj, fori = 1,2; apply Corollary 2.1.3: (grtgi)* = gl +g2; then
take the conjugate again. 0
The above calculus rule is very useful in the following framework: suppose we
want to compute an inf-convolution, say h = f t g with f and g in Conv]Rn . Compute
f* and g*; if their sum happens to be the conjugate of some known function in
Conv]Rn, this function has to be the closure of h.
Just as in the previous section, it is of interest to ask whether the closure operation
is necessary, and whether the inf-convolution is exact.
Theorem 2.3.2 The assumptions are those of Theorem 2.3.1. Assume in addition
that
the relative interiors ofdom gl and dom g2 intersect I
or equivalently: 0 E ri(domgl - domg2). (2.3.1)
PROOF. Define g E Conv(]Rn x ]Rn) by g(XIo X2) := gl (Xl) + g2(X2) and the linear
operator A : ]Rn -+ ]Rn x ]Rn by Ax := (x, x). Then gl + g2 = goA, and we proceed
to use Theorem 2.2.3. As seen in Proposition 1.3.1(ix), g*(p, q) = gr(p) + gi(q)
and straightforward calculation shows that A *(p, q) = p + q. Thus, if we can apply
Theorem 2.2.3, we can write
(Proposition 111.2.1.11), and this just means that 1m A = Ll has a nonempty intersec-
tion with ri dom g. 0
As with Theorem 2.2.3, a qualification assumption - taken here as (2.3.1) playing the
role of (2.2.4) - is necessary to ensure the stated properties; but other assumptions exist that
do the same thing. First of all,
is obviously necessary to have gl + g2 '# +00. We saw that this ''minimal'' condition was
sufficient in the polyhedral case. Here the results of Theorem 2.3.2 do hold if (2.3.1) is
replaced by:
gl and g2 are both polyhedral, or also
gl is polyhedral and dom gl n ri dom g2 :f:. (21 •
The "comfortable" assumption playing the role of(2.2.Q.ii) is
which means
Udomg.(S) +udomg2(-s) > 0 foralls:f:. 0;
we leave it as an exercise to prove (2.3.Q.jj') => (2.3.Q.JJJ).
(2.3.2)
It is appropriate to recall here that conjugate functions are interesting primarily via
their arguments and subgradients; thus, it is the existence of a minimizing s in the
right-hand side of (2.3.2) which is useful, rather than a mere equality between two
real numbers.
Remark 2.3.3 Our proof of Theorem 2.3.2 is based on the fact that the sum of two functions
can be viewed as the composition of a function with a linear mapping. It is interesting to
demonstrate the converse mechanism: suppose we want to prove Theorem 2.2.3, assuming
that Theorem 2.3.2 is already proved.
Given g E Conv lRm and A : lRn ~ lRm linear, we then select the two functions with
argument (x, y) E lRn x lRm :
if s = 0,
gl*(s, p)
g*(p)
= { +00 if not ,
(2.3.3)
Hence
*
g2(s, p) =
{O+00 ifs+A*p=O,
ifnot. (2.3.4)
so the whole question is to compute the conjugate of gl + g2. If (2.3. 1) holds, we have
but, due to the particular structure appearing in (2.3.3) and (2.3.4), the above minimization
problem can be written
inf{g*(p): s-A*p=O}=A*g*(s).
Under these conditions, (2.3.1) expresses the existence of an Xo such that Axo E ri dom g,
and this is exactly (2.2.4), which was our starting hypothesis. 0
To conclude this study of the sum, we mention one among the possible results
concerning its dual operation: the infimal convolution.
and the first part of the claim follows. For closedness of the infimal convolution, we
set gi := 1;*, i = 1, 2; because ofO-coercivity of flo 0 E int dom gl (Remark 1.3.10),
and g2 (0) ~ - f.L. Thus, we can apply Theorem 2.3.2 with the qualification assumption
(2.3.Q.JJ'). 0
2 Calculus Rules on the Conjugacy Operation 65
Theorem 2.4.1 Let {fj }j eJ be a collection offunctions satisfying (1.1.1) and assume
that there is a common affine function minorizing all of them:
Then their infimum f := infjeJ fj satisfies (1.1.1), and its conjugate is the supremum
of the rJ8:
(infjeJ ij)* = SUPjeJ fj*. (2.4.1)
g*(s, O) = supg;(s, z)
z
g;
holds, where the subscript x of indicates conjugacy with respect to the first variable only.
In a word, the conjugate with respect to the couple (x, z) is of no use for computing the
conjugate at (·,0). Beware that this is true only when the scalar product considers x and z
as two separate variables (i.e. is compatible with the product-space). Otherwise, trouble may
occur: see Remark 2.2.4.
a fonnula already given in Example 2.1.4. The same exercise can be carried out with
the squared distance. 0
Example 2.4.3 (Directional Derivative) With f E Conv IRn , let Xo be a point where
af(xo) is nonempty and consider for t > 0 the functions
They are all minorlzed by (so, .), with So E af(xo); also, the difference quotient is
an increasing function of the stepsize; their infimum is therefore obtained for t .J.. o.
This infimum is denoted by f'(xo, .), the directional derivative of fat Xo (already
encountered in Chap. VI, in a finite-valued setting). Their conjugates can be computed
directly or using Proposition 1.3.1:
Sl-+
_ /*(s)
(jjt ) *(s) - + f(xo) - (s, xo)
, (2.4.4)
t
so we obtain from (2.4.1)
[f '(Xo, .)]*() -
s - sup
/*(s) + f(xo) - (s, xo)
.
t>O t
Observe that the supremand is always nonnegative, and that it is 0 if and only if
s E af(xo) (cf. Theorem 1.4.1). In a word:
Theorem 2.4.4 Let {gj ljE} be a collection ofjUnctions in ConvlRn. If their supre-
mum g := SUPjE} is not identically +00, it is in ConvlRn, and its conjugate is the
closed convex hull of the s:gi
(SUPjE) gj)* = co (infjE) gj) . (2.4.5)
so (2.4.5) gives
.1 C= co (18(0,1) - O'-C). o
Example 2.4.6 (Asymptotic Function) With I E Conv IRn ,xo E dom I and It as
defined by (2.4.3), consider
(see §1y'3.2). Clearly, It E ConvlRn and It(O) = 0 for all t > 0, so we can apply
Theorem 2.4.4: 160 E Conv IRn. In view of (2.4.4), the conjugate of 160 is therefore
the closed convex hull of the function
s~m
. f I*(s) + I(xo) - (s, xo)
.
t>o t
Since the infimand is in [0, +00], the infimum is 0 if S E dom J*, +00 if not. In
summary, (/60)* is the indicator ofcldomJ*. Conjugating again, we obtain
I~ = O'domf* '
a formula already seen in (1.2.3).
The comparison with Example 2.4.3 is interesting. In both cases we extremize
the same function, namely the difference quotient (2.4.3); and in both cases a support
function is obtained (neglecting the closure operation). Naturally, the supported set is
larger in the present case of maximization. Indeed, dom J* and the union of all the
subdifferentials of I have the same closure. 0
but then we still have to take the lower semi-continuous hull of the result. Both
possibilities are rather involved, let us mention a situation where the second one takes
an easier form.
68 X. Conjugacy in Convex Analysis
Theorem 2.4.7 Let fl' ... , fp be finitely many convex functions from JRn to JR, and
let f := maxj fj; denote by m := min{p, n + I}. For every
S E domf* = cOU{dom.tj* : j = 1, ... , p},
there exist m vectors Sj E dom fj* and convex multipliers aj such that
In other words: an optimal solution exists in the minimization problem (2.4.6), and
the optimal value is a closed convex function ofs.
PROOF. Set g := minj fj*, so f* = cog; observe that domg = Uj domfj*. Because
each fj* is closed and I-coercive, so is g. Then we apply Proposition 1.5.4: first,
co g = co g; second, for every
In the last sum, each g*(Sj) is a certain !;*(Sj); furthermore, several sj's having
the same 1;* can: be compressed to a single convex combination (thanks to convexity
of each 1;*, see Proposition IY.2.S.4). Thus, we obtain (2.4.7). 0
The framework in which we proved this result may appear very restrictive; however,
enlarging it is not easy. For example, extended-valued function Ii create a difficulty: (2.4.7)
does not hold with n = I and
!I (x) = expx and h = I{o} .
Example 2.4.8 Consider the function g+ := max{O, g}, where g : JRn --+- JR is
convex. With fl := g and h == 0, we have ft = I{o} and, according to Theorem 2.4.7:
for all S E dom(g+)* =
co{{O} U domg*},
(g+)*(s)= min {ag*(sl) : SI E domg*, a E [0,1], aSI =s}.
For S = 0, the feasible solutions have either a = 0 or SI = O. Both cases are covered
by the condensed formulation
(g+)*(O) = min {g*(O), O},
which could also have been obtained by successive applications of the formula inf f =
- /*(0). 0
2 Calculus Rules on the Conjugacy Operation 69
For f E Conv lRn and g E Conv lR, the function g 0 f is in Conv lRn if g is increasing
(we asswne f(lR n ) n domg :j:. 0 and we set g(+oo) = +(0). A relevant question is
then how to express its conjugate, in terms of f* and g*.
We start with some preliminary observations, based on the particular nature of
g. The domain of g is an interval unbounded from the left, whose relative interior is
intdomg :j:. 0. Also, domg* c lR+ and (remember §1.3.2)
Theorem 2.5.1 With f and g as above, assume that f (lRn) n int dom g :j:. 0. For all
s E dom(g 0 1)*, define the function Vrs E ConvlR by
Then
(g 0 I)*(s) = min
aER
Vrs(a) .
PROOF. By definition,
(g 0 I)*(s) = min[g*(a)
p,a
+ I{-s}(-p) + Uepi!(P, -a)] = min
a
Vrs(a) ,
- With g(r) = r+, we leave it to the reader to compute the relevant conjugates: t/ls(a) has the
a
unique minimum-point = 0 for all s.
- With g = 1]-00,0], the qualification assumption in the above theorem is no longer satisfied.
We have (g 0 f)*(s) = 0 for all s, while milla t/ls(a) = IIsll - 1.
Remark 2.5.2 Under the qualification assumption, t/ls always attains its minimum at some
a ~ o. As a result, if for example (J'dom f (s) =
(f*)60 (s) = +00, or if g* (0) = +00, we are
sure that a > O. 0
Example 2.5.3 If g is positively homogeneous, say g = (J'l for some closed interval
l, the domain of '1/1s is contained in l, on which the term g* (a) disappears. The interval
l supported by g has to be contained in lR+ (to preserve monotonicity) and there are
essentially two instances: Example 2.4.8 was one, in which l was [0, I]; in the other
instance, l is unbounded (a half-line), and the following example gives an application.
Consider the set C := (x E lRn : f(x) ~ OJ, with f : lRn ~ lR convex; its
support function can be expressed in terms of f*. Indeed, write the indicator of C as
the composed function:
Ie = 1]-00,0] 0 f .
This places us in the present framework: g is the support oflR +. To satisfy the hypothe-
ses of Theorem 2.5.1, we have only to assume the existence ofanxo with f(xo) < 0;
then the result is: for 0 f. s E dom oc, there exists a > 0 such that
ac(s) = min
a>O
a f*(ks) = af*(s/a) .
which is the support function of the elliptic set (see Example Y.2.3.4)
The composition of I with the function r ~ g(r) := I/U 2 is the quadratic from
associated with Q-I, whose conjugate is known (see Example 1.1.3):
dom/l ndomh"# 0.
Clearly, co II +co 12 is then a function of Conv IRn which minorizes II + f2. Therefore
co II + co 12 ::;,; co (II + h) .
This inequality is the only general comparison result one can get, as far as sums are
concerned; and closed convexity of II, say, does not help: it is only when II is affine
that equality holds.
Let now (fj)jeJ be a collection of functions and I := sup fj. Starting from
fj ::;,; I for j E J, we obtain immediately co fj ::;,; co I and
sup(co fj) ::;,; co I .
jeJ
Nothing more accurate can be said in general.
The case of an infimum is hardly more informative:
72 X. Conjugacy in Convex Analysis
PROOF. The second relation is trivial. As for the first, Theorem 2.4.1 gives (infj fj)* =
SUPj fj*. The left-hand function is in Conv IRn, so we can take the conjugate of both
sides and apply Theorem 2.4.4. 0
So far, the calculus developed in the previous sections has been oflittle help. The
situation improves for the image function under a linear mapping.
Proposition 2.6.2 Let g : IRm ~ IR U {+oo} satisfy (1.1.1) and let A be linear from
IRn to IRm. Assuming ImA* n ridomg* "# 0, the equality co(Ag) = A (co g) holds.
Actually, for each x E domco(Ag), there exists y such that
3 Various Examples
The examples listed below illustrate some applications of the conjugacy operation.
It is an obviously useful tool thanks to its convexification ability; note also that the
knowledge of f* fully characterizes co f, i.e. f in case the latter is already in Conv IRn;
another fundamental property is the characterization (1.4.2) of the subdifferential. All
this concerns convex analysis more or less directly, but the conjugacy operation is
instrumental in several areas of mathematics.
Let P be a probability measure on IRn and define its Laplace transform L p : IRn ~
f
]0, +00] by
IRn 3 s 1-+ Lp(s) := exp(s, z) dP(z) .
an
The so-called Cramer transform of P is the function
3 Various Examples 73
Let S be a nonempty closed set in IRn and define the function I := 1/211 . 112 + Is:
1*(s)=sup{(s,x)-!lIxIl2: XES},
(3.2.2)
From Theorem 1.5.4, co 1= co I and, for each X E co S, there are x], ... ,Xn+1
in S and convex multipliers cx], ... ,CXn+1 such that
n+1 n+1
x = L CXjXj and (co f)(x) = ! L CXj IIXj 112.
j=1 j=1
in particular, x E a/(x);
(ii) (co f)(x) = I(x) = !lIxll2 and a(co f)(x) = a/(x).
74 X. Conjugacy in Convex Analysis
Using Is(x) = 0, this means ds(s) = lis - x II, i.e. x E P S(s). In particular, s = x is in
af(x),justbecause Ps(x) = {x}. Hence, af(x) is nonempty and (ii) is a consequence
of (1.4.3), (1.4.4). 0
Having thus characterized af, we can compute a(co f) using Theorem 1.5.6, and
this in turn allows the computation of af*:
This result is interesting in itself. For example, we deduce that the projection
onto a closed set is almost everywhere a singleton, just because V f* exists almost
everywhere. When S is convex, Ps(s) is for any s a singleton {ps(s)}, which thus
appears as the gradient of f* at s:
Another interesting result is that the converse to this property is true: if d~ (or equiv-
alently f*) is differentiable, then S is convex and we obtain a convexity criterion:
Theorem 3.2.3 For a nonempty closed set S, thefollowing statements are eqUivalent:
(i) S is convex;
(ii) the projection operation Ps is single-valued on ]R.n;
(iii) the squared distance d~ is differentiable on ]R.n;
(iv) the function f = IS + 1/211 . 112 is convex.
PROOF. We know that (i) => (ii); when (ii) holds, (3.2.3) tells us that the function
!* = 1/2 (II . 112 - d~) is differentiable, i.e. (iii) holds. When fin (iv) is convex, its
domain S is convex. There remains to prove (iii) => (iv).
Thus, assuming (iii), we want to prove that f = co f (= co f). Take first x E
ridomco f, so that there is an s E a(co f)(x) (Theorem 1.4.2). In view of (1.5.4),
there are Xj E S and positive convex multipliers aj such that
3 Various Examples 75
The last property implies that each Xj is the unique element of a/*(s): they are all
equal and it follows (co f) (x) = 1 (x).
Now use Proposition IV.1.2.5, expressing co 1 outside the relative interior of its
domain: for x E domco I, there is a sequence {Xk} C ridomco 1 tending to x such
that
I(x) ~ (co f)(x) = lim (co I)(Xk) = lim/(Xk) ~ I(x) ,
k~+oo
where PHis the operator of orthogonal projection onto Hand (.) - is the Moore-
Penrose pseudo-inverse.
g*(s) = (1* tIH . t.) (s) = min {4(p, B-p) p E 1mB, s - p E H1.},
This last example fits exactly into the framework of §3.2, and (3.2.2) becomes
Proposition 3.4.1 At each s E co{s(, ... , skI = dom f*, the conjugate off has the
value (L1k is the unit simplex)
(3.4.2)
In JR.n x JR. (considered here as the dual of the graph-space), each of the k vertical
half-lines {(Si, r) : r ~ bj}, i = 1, ... , k is the epigraph of It,
and these half-
lines are the "needles" of Fig. IY.2.5.1. Now, the convex hull of these half-lines is a
closed convex polyhedron, which is just epi f*. Needless to say, (3.4.1) is obtained
by conjugating again (3.4.2). This double operation has an interesting application in
the design of minimization algorithms:
3 Various Examples 77
Example 3.4.2 Suppose that the only available information about a function ! E
Conv IRn is a finite sampling of function- and subgradient-values, say
The resulting bracket on! is drawn in the left-part of Fig. 304.1. It has a counterpart
in the dual space: 'Pi ~ 1* ~ 'P~, illustrated in the right-part of Fig. 304.1, where
Note: these formulae can be made more symmetric, with the help of the relations
(Sj, Xj) - !(Xj) = I*(Sj).
t*(S1) I·--·..·--·~
t*(S2) 1..·---·-....·-+..·--1~
X1 0 x2
Fig. 3.4.1. Sandwiching a function and its conjugate
Because the ! of (3 04.1) increases at infinity no faster than linearly, its conjugate
1* has a bounded domain: it is not piecewise affine, but rather polyhedral. A natural
idea is then to develop a full calculus for polyhedral functions, just as was done in
§3.3 for the quadratic case.
A polyhedral function has the general form g = ! +I p, where P is a closed convex
polyhedron and! is defined by (304.1). The conjugate of g is given by Theorem 2.3.2:
g* = !* t Up, i.e.
78 X. Conjugacy in Convex Analysis
g*(S) = aei!1k
min [E~=1 ajbj + up(s - E~=1 ajsj)]. (3.4.3)
This fonnula may take different fonns, depending on the particular description of P.
When
P = co {PI, ... , Pm}
is described as a convex hull, Up is the maximum of just as many linear functions
(Pj' .) and (3.4.3) becomes an explicit fonnula. Dual descriptions of P are more
frequent in applications, though.
Example 3.4.3 With the notations above, suppose that P is described as an intersec-
tion (assumed nonempty) of half-spaces:
It is convenient to introduce the following notation: ]Rk and ]Rm are equipped
with their standard dot-products; A [resp. C] is the linear operator which, to x E ]Rn,
associates the vector whose coordinates are (Sj, x) in]Rk [resp. (Cj' x) E ]Rm]. Then
it is not difficult to compute the adjoints of A and C: for a E ]Rk and Y E ]Rm,
k m
A*a = Lajsj, C*Y = LYjCj.
j=1 j=1
Then we want to compute the conjugate of the polyhedral function
To compute Up, we have its conjugate I p as the sum of the indicators of the
Hj's; using Theorem 2.3.2 (with the qualification assumption (2.3.Q.j), since all the
functions involved are polyhedral),
Piecing together, we finally obtain g*(s) as the optimal value of the problem in
the pair of variables (a, y):
4 Differentiability of a Conjugate Function 79
and this actually was our very motivation for defining f* , see the introduction to this
a a
chapter. Geometrically, the graph of f and of f* in]Rn x]Rn are images of each other
under the mapping (x, s) 1-+ (s, x). Knowing that a convex function is differentiable
when its subdifferential is a singleton, smoothness properties of f* correspond to
monotonicity properties of af, in the sense of §VI.6.1.
Theorem 4.1.1 Let f E Conv]Rn be strictly convex. Then int dom f* =f:. 0 and f*
is continuously differentiable on int dom f*.
PROOF. For arbitrary Xo E dom f and nonzero d E ]Rn, consider Example 2.4.6. Strict
convexity of f implies that
i.e. dom f* has a positive breadth in every nonzero direction d: its interior is nonempty
- Theorem Y.2.2.3(iii).
Now, suppose that there is some S E intdomf* such that 8f*(s) contains two
distinct points XI and X2. Then S E 8f(xl) n 8f(X2) and, by convex combination of
the relations
f*(s) + f(Xi) = (s, Xi) for i = 1,2,
we deduce, using Fenchel's inequality (1.1.3):
!*(s) + E:=I (li!(Xi) = (s, E:=I (liXd :::;; !*(s) + f(E:=1 (liXi) ,
which implies that f is affine on [XI. X2], a contradiction. In other words, 8!* is
single-valued on int dom f*, and this means f* is continuously differentiable there.
o
For an illustration, consider
whose conjugate is
Here, f is strictly convex (compute V2 f to check this), dom!* is the unit ball, on
the interior of which !* is differentiable, but 8!* is empty on the boundary.
Incidentally, observe also that !* is strictly convex; as a result, f is differentiable.
Such is not the case in our next example: with n = 1, take
! (s + 1)2for s:::;; - 1 •
f* (s) = { 0 for - I :::;; s :::;; 1 ,
!(s - 1)2 for s ;;:: 1 .
This example is illustrated by the instructive Fig. 4.1.1. Its left part displays gr f,
made up of two parabolas; !* (s) is obtained by leaning onto the relevant parabola a
straight line with slope s. The right part illustrates Theorem 2.4.4: it displays 1/2 S2 + s
and 1/2 S2 - s, the conjugates of the two functions making up f; epi!* is then the
convex hull of the union of their epigraphs.
Example 4.1.2 We have studied in §IY.1.3(t) the volume of an ellipsoid: on the Eu-
clidean space (Sn (lR), (( .•. ))) of symmetric matrices with the standard scalar product
oflRnxn , the function
x
s
Fig. 4.1.1. Strict convexity corresponds to differentiability of the conjugate
a2f
-.....:....--(A) = (A-Ik(A-I)jk
aaijaakt J
Theorem 4.1.3 Let f E Conv lRn be differentiable on the set D := intdom f. Then
f* is strictly convex on each convex subset C C V f(il).
PROOF. Let C be a convex set as stated. Suppose that there are two distinct points
SI and S2 in C such that f* is affine on the line-segment [sJ, S2]' Then, setting S :=
1/2 (Sl + S2) E C C V f(il), there is x E il such that V f(x) =
s, i.e. x E af*(s).
Using the affine character of f*, we have
= ! L [J(x) + f*(sj) -
2
0= f(x) + f*(s) - (s, x) (Sj, x)]
j=1
82 X. Conjugacy in Convex Analysis
and, in view ofFenchel 's inequality (1.1.3), this implies that each term in the bracket is
0: x E a!*(Sl) n a!*(S2), i.e. af(x) contains the two points Sl and S2, a contradiction
to the existence of Vf (x). 0
The simplest such situation occurs when f is a strictly convex quadratic function,
as in Example 1.1.3, corresponding to an affine Legendre transform. Another example
is f(x) := expO IIxIl2).
Along the same lines as in §4.1, better than strict convexity of f and better than
differentiability of f* correspond to each other. We start with the connection between
strong convexity of f and Lipschitz continuity of V f*.
(a) Lipschitz Continuity of the Gradient Mapping. The next result is stated for a
finite-valued f, mainly because the functions considered in Chap. VI were such; but
this assumption is actually useless.
Theorem 4.2.1 Assume that f : lRn -+ lR is strongly convex with modulus c > 0 on
lRn : for all (Xl, X2) E lRn x lRn and a E ]0, 1[,
PROOF. We use the various equivalent definitions of strong convexity (see Theo-
rem VI.6.1.2). Fix Xo and So E of(xo): for all 0 =f. dE lRn and t ~ 0
4 Differentiability of a Conjugate Function 83
hence f60 (d) = adorn f* (d) = +00, i.e. dom j* = IRn. Also, f is in particular strictly
convex, so we know from Theorem 4.1.1 that f* is differentiable (on IRn). Finally,
the strong convexity of f can also be written
in which we have Si E af(Xi), i.e. Xi = V j*(Si), for i = 1,2. The rest follows from
the Cauchy-Schwarz inequality. 0
This result is quite parallel to Theorem 4.1.1: improving the convexity of f from
"strict" to "strong" amounts to improving V f* from "continuous" to "Lipschitzian".
The analogy can even be extended to Theorem 4.1.3:
Then f* is strongly convex with modulus 1/ L on each convex subset C C dom af*.
In particular, there holds for all (XI, X2) E IRn x ]Rn
(4.2.2)
PROOF. Let Sl and S2 be arbitrary in dom af* C dom f*; take sand s' on the segment
[Sl' S2]. To establish the strong convexity of f*, we need to minorize the remainder
term j*(s') - j*(s) - (x, s' - s), with X E aj*(s). For this, we minorize f*(s') =
SUPy[(s', y} - fey)], i.e. we majorize fey):
Observe that the last supremum is nothing but the value at s' - s of the conjugate of
1/2 L II . -x 112. Using the calculus rule 1.3.1, we have therefore proved
for all s, s' in [Sl' S2] and all x E aj*(s). Replacing s' in (4.2.3) by Sl and by S2, and
setting s = aSI + (1 - a)s2' the strong convexity (4.2.1) for j* is established by
convex combination.
84 X. Conjugacy in Convex Analysis
f*(S2) ~ f*(sl) + (S2 - SJ, XI) + n:lIs2 - sI1I 2 for all XI E of*(sl).
In view ofthe differentiability of f, this is just (4.2.2), which has to hold for all (XI, X2)
simply because 1m of* = dom "If = ]Rn. 0
Remark 4.2.3 For a convex function, the Lipschitz property of the gradient mapping
thus appears as equivalently characterized by (4.2.2); a result which is of interest in
itself. As an application, let us return to §3.2: for a convex set C, the convex function
II . 112 - d~ has gradient Pc, a nonexpansive mapping. Therefore
Naturally, Corollary 4.1.4 has also its equivalent, namely: if f is strongly convex
and has a Lipschitzian gradient-mapping on ]Rn, then f* enjoys the same properties.
These properties do not leave much room, though: f (and f*) must be "sandwiched"
between two positive definite quadratic functions.
Lemma 4.2.4 For a directionally quadratic function qJ, the following properties are
equivalent:
(i) qJ is finite everywhere;
(ii) VqJ(O) exists (and is equal to 0);
(iii) there is C ;;;: 0 such that qJ(x) ~ 1/2 Cllxll 2for all x E JR.n;
(iv) there is c > 0 such that qJ*(s) ;;;: 1/2CllsIl 2for all s E JR.n;
(v) qJ*(s) > ofor all s;/; o.
PROOF. [(i) {:} (ii) ::} (iii)] When (i) holds, qJ is continuous; call 1/2 C ;;;: 0 its maximal
value on the unit sphere: (iii) holds by positive homogeneity. Furthermore, compute
difference quotients to observe that qJ'(O,.) == 0, and (ii) follows from the differen-
tiability criterion IY.4.2.1.
Conversely, existence of VqJ(O) implies finiteness of qJ in a neighborhood of 0
and, by positive homogeneity, on the whole space.
[(iii)::} (iv)::} (v)] When (iii) holds, (iv) comes from the calculus rule 1.3. 1(vii), with
for example 0 < c:= I/C (or rather I/(C + 1), to take care of the case C =
0); this
implies (v) trivially.
[(v) ::} (i)] The lower semi-continuous function qJ* is positive on the unit sphere, and
has a minimal value c > 0 there. Being positively homogeneous of degree 2, qJ* is
then I-coercive; finiteness of its conjugate qJ follows from Proposition 1.3.8. 0
Definition 4.2.5 We say that the directionally quadratic function qJs defines a mi-
norization to second order of f E Conv JR.n at Xo E dom f, and associated with
s E af(xo), when
Proposition 4.2.6 With the notation of Definition 4.2.5, suppose that there is a di-
rectionally quadratiC function qJs satisfying, for some c > 0,
86 X. Conjugacy in Convex Analysis
(4.2.6)
PROOF. [Preamble] First note that ({J: is finite everywhere: as already mentioned after
Definition 4.2.5, differentiability of f* will follow from the majorization property of
f*.
We take the tilted function
(1 - ¥) ((Js(h)
IIhll ~ 0, I g(h) ~ Hc _ e) IIh1l2.
g(h) ~
whenever
Our aim is then to establish that, even though (*) holds locally only, the calculus
rule 1.3. 1(vii) applies, at least in a neighborhood ofO.
[Step 2] For IIhll > 0, write the convexity of g on [0, h] and use g(O) = 0 to obtain
with (**)
hence
g(h) ~ Hc - e) ollhll.
Then, if lip II ~ (1/2C - e)o =: 0', we have
(p, h) - g(h) ~ Hc - e) IIhll - g(h) ~ 0 whenever IIhll > o.
Since g* ~ 0, this certainly implies that
4 Differentiability of a Conjugate Function 87
[Step 3] Thus, using the calculus rule l.3.l(ii) and positive homogeneity of q1;, we
have for P E R(O, 8'),
Let us sum up: given s' > 0, choose S in Step 1 such that C(C:'2E) ~ s'. This gives
° °
8 > and 8' > in Step 2 yielding the required majorization
Proposition 4.2.7 With the notation of Definition 4.2.5, suppose that there is a di-
rectionally quadratic function q1s satisfYing (4.2.6) for some c > and defining a
majorization to second order of f E Conv IR n , associated with s E af(xo). Then q1;
°
defines a minorization to second order of f* at s, associated with Xo.
PROOF. We use the proof pattern of Proposition 4.2.6; using in particular the same
°
tilting technique, we assume Xo = 0, f(O) = and s = 0. For any S > 0, there is by
°
assumption 8 > such that
for all P; we have to show that, for lip II small enough, the right-hand side is close to
q1;(p).
Because of (4.2.6), q1; is finite everywhere and Vq1;(O) = (Lemma 4.2.4).
It follows from the outer semi-continuity of aq1; (Theorem VI.6.2.4) that, for some
°
8' > 0,
0=1- aq1;(I+iE/CP) c R(O, 8) for all P E R(O, 8').
Said otherwise: whenever IIpll ~ 8', there exists ho E R(O, 8) such that
*(
q1s I+u/cP
I )
= 1(p,ho)
+ 2s/c - q1s(h o).
Thus, the function (p, .) - (1 + 2s/c)q1s attains its (unconstrained) maximum at ho.
In summary, provided that II p II is small enough, the index h in (4.2.7) can run in the
whole space, to give
Sa := 'LajVJj(xo) E af(xo)
jeJ
and
({Ja(h) := ~ 'Laj{Ajh, h};
jeJ
this last function is convex quadratic and yields the minorization
3c > 0 such that (V2 f(x)d, d) ~ clldll 2 for all (x, d) E ]Rn x ]Rn .
Remark 4.2.11 To conclude, let us return to Example 4.2.8, which illustrates a fun-
damental difficulty for a second-order analysis. Under the conditions of that example,
f can be approximated to second order in a number of ways.
(a) For h E ]Rn, define
in other words. the convex hull of the gradients indexed by Jh forms the face of af (xo)
exposed by h. The function
surface). For d in the neighborhood of this do, q;(d) is equal to either 1/2 (AId, d) or
1/2 (A 2d, d), depending on which index prevails at Xo + td for small t > O.
It does not suffer the non-uniformity effect of (a) and the approximating function is
now nicely convex. However, it patches together the first- and second-order informa-
tion and is no longer the sum of a sublinear function and a directionally quadratic
function. D
XI. Approximate Sub differentials
of Convex Functions
Prerequisites. Sublinear functions and associated convex sets (Chap. V); characterization
of the subdifferential via the conjugacy correspondence (§X.l.4); calculus rules on conjugate
functions (§X.2); and also: behaviour at infinity ofone-dimensional convex functions (§I.2.3).
Introduction. There are two motivations for the concepts introduced in this chapter:
a practical one, related with descent methods, and a more theoretical one, in the
framework of differential calculus.
- In §VIII.2.2, we have seen that the steepest-descent method is not convergent, es-
sentially because the subdifferential is not a continuous mapping. Furthermore, we
have defined Algorithm IX.l.6 which, to find a descent direction, needs to extract
limits of subgradients: an impossible task on a computer.
- On the theoretical side, we have seen in Chap. VI the directional derivative of a finite
convex function, which supports a convex set: the subdifferential. This latter set was
generalized to extended-valued functions in §X.1.4; and infinite-valued directional
derivatives have also been seen (Proposition 1.4.1.3, Example X.2.4.3). A natural
question is then: is the supporting property still true in the extended-valued case?
The answer is not quite yes, see below the example illustrated by Fig. 2.1.1.
The two difficulties above are overcome altogether by the so-called e-subdifferen-
tial of f, denoted as f, which is a certain perturbation of the subdifferential studied
in Chap. VI for finite convex functions. While the two sets are identical for e = 0, the
properties of as f turn out to be substantially different from those of af. We therefore
study as f with the help of the relevant tools, essentially the conjugate function f*
(which was of no use in Chap. VI). In return, particularizing our study to the case
e = 0 enables us to generalize the results of Chap. VI to extended-valued functions.
Throughout this chapter, and unless otherwise specified, we therefore have
1f E Conv]Rn and e ~ 0 .1
However, keeping in mind that our development has a practical importance for nu-
merical optimization, we will often pay special attention to the finite-valued case.
92 XI. Approximate Subdifferentials of Convex Functions
Of course, s is still an e-subgradient if (1.1.1) holds only for y E dom f. The set of
all e-subgradients of f at x is the e-subdifferential (of f at x), denoted by ad(x).
Even though we will rarely take e-subdifferentials of functions not in Conv JRn ,
it goes without saying that the relation of definition (1.1.1) can be applied to any
function finite at x. Also, one could set ad (x) = 0 for x ~ dom f. 0
aae+(l-a)e' f(x) :,) aad(x) + (1 - a)ae, f(x) for all a E ]0, 1[. (1.1.4)
The last relation means that the graph of the multifunction JR+ 3 e f---+ aef (x) is a
convex set in JRn+l; more will be said about this set later (Proposition 1.3.3).
We will continue to use the notation af, rather than aof, for the exact subdiffer-
ential- knowing that ae f can be called an approximate subdifferential when e > O.
Figure 1.1.1 gives a geometric illustration of Definition 1.1.1: s, together with
r E JR, defines the affine function y f-+ as ,r (y) := r + (s, y - x); we say that s is
an e-subgradient of f at x when it is possible to have simultaneously r ~ f (x) - e
and as,r ~ f. The condition r = f(x), corresponding to exact subgradients, is thus
relaxed bye; thanks to closed convexity of f, this relaxation makes it possible to find
such an as,r:
, f(x)
I
~:
e ..--1---...... -----.. _------ gr s,r
f
x .....
Fig.t.t.t. Supporting hyperplanes within c
Incidentally, the above proof shows that, among the e-subgradients, there is one
for which strict inequality holds in (1.1.1). A consequence of this result is that, for
e > 0, the domain of the multifunction x ~ 8e1 (x) is the convex set dom I. Here
is a difference with the case e = 0: we know that dom 81 need not be the whole of
dom I; it may even be a nonconvex set.
Consider for example the one-dimensional convex function
[-I,-I-e/x] if x <-e/2,
8ef(x)= { [-1,+1] if-e/2~x~e/2,
[1-e/x,l] ifx>e/2.
The two parts of Fig. 1.1.2 display this set, as a multifunction of e and x respectively. It
is always a segment, reduced to the singleton {f'(x)} only for e = O(whenx # 0). This
example suggests that the approximate subdifferential is usually a proper enlargement
of the exact subdifferential; this will be confirmed by Proposition 1.2.3 below.
Another interesting instance is the indicator function of a nonempty closed convex
set:
s s
x o E
The 8-nonnal set is thus an intersection of half-spaces but is usually not a cone;
it contains the familiar nonnal cone Nc(x), to which it reduces when 8 = O. A
condensed fonn of (1. 1.6) uses the polar of the set C - {x}:
aeIC(x) = 8(C - x)O for all x E C and 8 > o.
These examples raise the question of the boundedness of aef(x).
Theorem 1.1.4 For 8 ~ 0, aef(x) is a closed convex set, which is nonempty and
bounded if and only if x E int dom f.
PROOF. Closedness and convexity come immediately from the definition (1.1.1).
Now, if x E intdomf, then aef(x) contains the nonempty set af(x) (Theo-
rem X.1.4.2). Then let 8 > 0 be such that B(x, 8) C intdomf, and let L be a
Lipschitz constant for f on B(x, 8) (Theorem Iy'3.1.2). For 0 :f:. s E aef(x), take
y = x + 8s/lIsll:
f(x) + L8 ~ f(y) ~ f(x) + 8{s, s/lIsll) - 8
i.e. IIsll ::;,;; L + 8/8. Thus, the nonempty aef(x) is also bounded.
Conversely, take any s) in the normal cone to dom f at x E dom f:
(sJ, y - x) ::;,;; 0 for all y E dom f .
If aef(x) :f:. 0, add this inequality to (1.1.1) to obtain
f(y) ~ f(x) + (s + SJ, Y - x) - 8
In the introduction to this chapter, it was mentioned that one motivation for the
8-subdifferential was practical. The next result gives a first explanation: as f can be
used to characterize the 8-solutions of a convex minimization problem - but, starting
from Chap. XIII, we will see that its role is much more important than that.
Theorem 1.1.S The follOWing two properties are equivalent:
o E aef(x) ,
f(x) ::;,;; f(y) + 8 for all y E ]Rn •
PROOF. Apply the definition. o
1 The Approximate Subdifferential 95
which, remembering that /* (s) is the supremum of the bracket, is equivalent to (1.2.1).
This implies that 8ef(x) C dom /* and, applying (1.2.1) with f replaced by f*:
Indeed, the inclusion "c" comes directly from (1.2.1); conversely, if s E dom f*, we
know from Theorem 1.1.2 that there exists x E 8ef*(s), i.e. s E 8ef(x).
Likewise, for fixed x E dom f,
Here again, the "c" comes directly from (1.2.1) while, if s E dom /*, set
{sEb+lmQ: f(x)+4(s-b,Q-(s-b))~(s,x)+s}.
This set has a nicer expression if we single out V f (x): setting p = s - b - Qx, we
see that s - b E 1m Q means p E 1m Q and, via some algebra,
96 XI. Approximate Subdifferentials of Convex Functions
When Q is invertible and b = 0, f defines the norm Illxlll = (Qx, X)I/2. Its e-
subdifferential is then a neighborhood of the gradient for the metric associated with
the dual norm Illslll* = (s, Q- 1S)I/2. 0
The above example suggests once more that as f is usually a proper enlargement
of af; in particular it is "never" reduced to a singleton, in contrast with af, which is
"often" reduced to the gradient of f. This is made precise by the next results, which
somehow describe two opposite situations.
int as' f(x) = {s E lR.n : f*(s) + f(x) - (s, x) < e'} :::> ad(x).
Proposition 1.2.4 Let f E Conv lR.n and suppose that aso f (Xo) is a Singleton for
some Xo E dom f and eo > O. Then f is affine on lR.n .
PROOF. Denote by So the unique eo-subgradient of f at Xo. Let e E ]0, eo[; in view
of the monotonicity property (1.1.2), the nonempty set as f (Xo) can only be the same
singleton {so}. Then let e' > eo; the graph-convexity property (1.1.4) easily shows
that as' f(xo) is again {so}.
Thus, using the characterization (1.2.l) of an approximate subgradient, we have
proved:
s =1= So ===} f*(s) > e + (so, xo) - f(xo) for all e > 0,
1 The Approximate Subdifferential 97
(. .1\
B = 0, we do obtain the exposed face itself: aac(d) = Fe(d).
""'ctd
)
d
.........,
•• C ,••~
,
\\
o
Fig.I.2.I. An 8-face
Along the same lines, the conjugate ofIe in (1.1.6) is ae, so the B-normal set to
C at x is equivalently defined as
Ne,e(x) = {s E IRn : ac(s) ~ (s, x) + B}.
Beware of the difference with (1.2.6); one set looks like a face, the other like a cone;
the relation linking them connotes the polarity relation, see Remark V.3.2.6.
(a) Elementary Calculus. First, we list some properties coming directly from the
definitions.
Proposition 1.3.1
(i) For the function g(x) := I(x) + r, aeg(x) = ad(x).
(ii) For the function g(x) := al(x) and a > 0, aeg(x) = aae/al(x).
(iii) For the function g(x) := I(ax) and a =f:. 0, aeg(x) = aael(ax).
(iv) Moregenerally, if A is an invertible linear operator, ae(f oA)(x) = A*ad(Ax).
(v) For the function g(x) := I(x - xo), aeg(x + xo) = a/(x).
(vi) For the function g(x) := I(x) + {so, x}, aeg(x) = ad(x) + {so}.
(vii) If II ~ 12 and II (x) = l2(x), then adl(x) C ae l2(x).
PROOF. Apply (1.1.1), or combine (1.2.1) with the elementary calculus rules X.l.3 .1,
whichever seems easier. 0
Proposition 1.3.2 Let H be a subspace containing a point ofdom I and call PH the
operator oforthogonal projection onto H. For all x E dom I n H,
i.e.
(b) The Tilted Conjugate Function. From (1.2.1), ad(x) appears as the sublevel-
set at level e of the "tilted conjugate function"
°
which is clearly in Conv]Rn (remember x E dom I!) and plays an important role. Its
infimum on]Rn is (Theorem 1.1.2), attained at the subgradients of I at x, if there
1 The Approximate Subdifferential 99
Proposition 1.3.3 For x E dom f, the epigraph ofg; is the graph ofthe multifunction
s ~ oef(x):
The supportfunction of this set has, at (d, -u) E JRn x JR, the value
O'epig*(d, -u)
x
= sup{(s,
s,e
d) - su : s E oef(x)} = (1.3.4)
PROOF. The equality of the two sets in (1.3.3) is trivial, in view of(I.2.1) and (1.3.1).
Then (1.3.4) comes either from direct calculations, or via Proposition X.l.2.1, with f
replaced by g;,
whose domain is just dom f* (Proposition X.l.3.1 may also be used).
o
Remember §IY.2.2 and Example 1.3.2.2 to realize that, up to the closure operation
at u = 0, the function (l.3.4) is just the perspective of the shifted function h ~
f(x + h) - f(x).
Figure 1.3.1 displays the graph of s ~ oe f (x), with the variable s plotted along
the horizontal axis, as usual (see also the right part of Fig. 1.1.2). Rotating the picture
so that this axis becomes vertical, we obtain epi g;.
£
Fig. 1.3.1. Approximate subdifferential as a SUblevel set in the dual space
100 XI. Approximate Subdifferentials of Convex Functions
The relation expressed in (1.3.5) is rather natural, and can be compared to Theo-
rem 1.1.5. It defines a sort of "critical" value for 8, satisfying the following property:
PROOF. Fix x e dom/. Then set u = 1 in (1.3.4) to obtain uepigj(h, -1) = gx(h),
i.e. (1.3.7), which is just a closed form for (1.3.6). 0
Theorem 1.3.6 gives a converse to Proposition 1.3.1 (vii): if, for a particular x e
dom II n dom 12,
odl (x) C oeh(x) for alle > 0,
then
II (y) - II (x) :::;; h(y) - h(x) for all y e ]Rn .
which also expresses a closed convex function as a supremum, but of affine functions.
The index s can be restricted to dom 1*, or even to ri dom 1*; another better idea:
restrict s to dom of* (::J ridomf*), in which case f*(s) = (s, y) - I(y) for some
y e dom 01. In other words, we can write
This formula has a refined form, better suited to numerical optimization where
only one subgradient at a particular point is usually known (see again the concept of
black box (Ul) in §VIII.3.5).
Yt:=x+tderidomf forte]O,I].
Then
f(x + d) ~ f(Yt) + (s(Yt), x + d - Yt)
so that
{ ( ) d) ~ f (x + d) - f (Yt)
S Yt, "" 1 -t .
Then write
Throughout this section, x is fixed in dom f. As a (nonempty) closed convex set, the
approximate subdifferential aef(x) then has a support function, for any e > O. We
denote it by f;(x, .):
a closed sublinear function. The notation f; is motivated by §VI.I.I: f' (x, .) supports
af(x), so it is natural to denote by f;(x, .) the function supporting aef(x). The
present section is devoted to a study of this support function, which is obtained via
an "approximate difference quotient".
Theorem 2.1.1 For x e dom f and e > 0, the support function of ae f (x) is
1!])n
IN.. 3
d
~
~'(
Je x,
d) _ . f f(x
- In
+ td) - f(x) +e , (2.1.1)
t>O t
which will be called the e-directional derivative of fat x.
PROOF. We use Proposition 1.3.3: embedding the set aef (x) in the larger space JRn x JR,
we view it as the intersection of epi gi with the horizontal hyperplane
2 The Approximate Directional Derivative 103
f;
(rotate and contemplate thoroughly Fig. 1.3.1). Correspondingly, (x, d) is the value
at (d, 0) of the support function of our embedded set epigi n He:
e E A(riepig;) = ri[A(epig;)],
where the last equality comes from Proposition III.2.1.12. But we know from Theo-
rem 1.1.2 and Proposition 1.3.3 that A(epigi) is JR+ or JRt; in both cases, its relative
interior is JRt, which contains the given e > O. Our assumption is checked.
As a result, the following problem
When e > 0, this function does achieve its minimum on JR+; in the t-Ianguage, this
means that the case t -I- 0 never occurs in the minimization problem (2.1.1). On the
other hand, 0 may be the unique minimum point of re, i.e. (2.1.1) may have no solution
"at finite distance".
which cannot be a support function: it is not even closed. On the other hand, the infimum in
(2.1.1), which is obtained for x + td on the unit sphere, is easy to compute: for d =f. 0,
if 8> 0,
if8 ~ 0.
(2.1.4)
Note: to obtain (2.1.4), it is fun to use elementary geometry (carry out in Fig. 2.1.1 the
inversion operation from the pole x), but the argument of Remark VI.6.3. 7 is more systematic,
and is also interesting.
Remark 2.1.2 We have here an illustration of Example X.2.4.3: closing f' (x, .) is just what
is needed to obtain the support function of af(x). For x E dom af, the function
1I])n
~ 3
d 1-+
f'e x, d) -_ I·1m f(x + td) - f(x) .
t,j,o t
This property appears also in the proof of Theorem 2.1.1: for e = 0, the closure operation
must still be performed after (2.1.2) is solved. 0
°
Figure 2.1.2 illustrates (2.1.1) in a less dramatic situation: for e > 0, the line representing
t f(x) - e + tf;(x, d) supports the graph oft 1-+ f(x + td), not at t = but at some
1-+
point te > 0; among all the slopes joining (0, f(x) - e) to an arbitrary point on the graph of
f (x + .d), the right-hand side of (2.1.1) is the smallest possible one.
On the other hand, Fig. 2.1.2 is the trace in ~ x ~ of a picture in ~n x R among all
the possible hyperplanes passing through (x, f(x) - e) and supporting epi f, there is one
touching gr f somewhere along the given d; this hyperplane therefore gives the maximal
slope along d, which is the value (2.0.1). The contact x + tEd plays an important role in
minimization algorithms and we will return to it later.
2 The Approximate Directional Derivative 105
f(x) ----I
f(x+td) - E - -
~:::""_...L....
The same picture illustrates Theorem 1.3.6: consider the point x + ted as fixed and
call it y. Now, for arbitrary 8 ~ 0, draw a hyperplane supporting epi f and passing through
(0, f (x) - 8). Its altitude at y is f (x) - 8 + f; (x, y - x) which, by definition of a support,
is not larger than f(y), but equal to it when 8 = e.
In summary: fix (x, d) E dom f x lRn and consider a pair (e, y) E lRt x lRn linked by
the relation y = x + ted.
- To obtain y from e, use (2.1.1): as a function of the "horizontal" variable t > 0, draw the
line passing through (x, f(x) - e) and (x + td, f(x + td)); the resulting slope must be
minimized.
- To obtain e from y, use (1.3.7): as a function of the "vertical" variable 8 ~ 0, draw a support
passing through (x, f (x) - 8); its altitude at y must be maximized.
Example 2.1.3 Take again the convex quadratic function of Example 1.2.2: I: (x, d)
is the optimal value of the one-dimensional minimization problem
. t(Qd,d)t 2 +(V/(x),d)t+e
mf .
t>O t
If (Qd, d) = 0, this infimum is (V I(x), d). Otherwise, it is attained at
t
6 -
- Rf; e
(Qd,d)
and
= (V/(x), d) + .j2e(Qd, d).
I;(x, d)
This is a general formula: it also holds for (Qd, d) = O.
It is interesting to observe that[l: (x, d) - I' (x, d)]/ e ~ +00 when e -J.. 0, but
H/:(x, d) - I'(x, d)]2 == (Qd, d).
e
This suggests that, for C2 convex functions,
where
106 XI. Approximate Subdifferentials of Convex Functions
is the second derivative of f at x in the direction d. Here is one more motivation for
the approximate subdifferentiaI: it accounts for second order behaviour of f. 0
Our aim is now to study the minimization of qe, a relevant problem in view ofTheo-
rem 2.1.1.
which describes the behaviour of f(x + td) for t -+ +00 (see §IY.3.2 and Exam-
ple X.2.4.6 for the asymptotic function f6o)' The other,
concerns the case t -J, 0: t l = 0 if f(x+td) has a "positive curvature fort -J, 0". When
t l = +00, i.e. f is affine on the half-line x +lR+d, we have qe(t) = f'ex, d) + elt;
then it is clear that f; (x, d) = f' (x, d) = fbo (d) and that Te is empty for e > O.
Example 2.2.1 Before making precise statements, let us see in Fig. 2.2.1 what can be ex-
pected, with the help of the example
The lower-part of the picture gives the correspondence e -4 Te , with the same abscissa-axis
as the upper-part (namely t). Some important properties are thus introduced:
- As already known, q increases from f'ex, d) = 1 to f/x;(d) = 3;
- indeed, q(t) is constantly equaito its minimal value f'ex, d) forO < t ~ 1/2, and t l = 1/2;
2 The Approximate Directional Derivative 107
t(d) =3 ~--::6 i
f'(x,d) =1
10
00
10
--------
= 1 _._ ... _................_._..... _...-.._............_.........
- f;(x, d) stays between f'ex, d) and f60(d), and reaches this last value for s ;;:: 1;
- To is the segment ]0, 1/2]; TJ is the half-line [1/2, +00[, and Te is empty for s > 1. 0
This example reveals another important number associated with large t, which we will
calls oo (equal to 1 in the example). The statements (i) and (ii) in the next result are already
known, but their proof is more natural in the t-Ianguage.
satisfies
Te "# 0 if s < SOO and Te = 0 if s > SOO •
Because f'(x,·) ~ f/x;, this in turn means that q(t) is constant for t > 0, or that
l- = +00.
f;
Now observe that s r+ (x, d) is increasing (just as de f); so E, when nonempty,
is an interval containing 0: s < soo implies sEE, hence Te #- 0. Conversely, if
s > soo, take s' in ]Soo, s[ (so s' fj E) and t > 0 arbitrary:
The example t r+ f(x + td) = .Jf+t2 illustrates the case SOO < +00 (here
soo = l);itcorrespondstof(x+, d) having the asymptote t r+ f(x)-soo+ f60(d)t.
Also, one directly sees in (2.2.3) thats oo = +00 whenever f60(d) is infinite. With the
example f of(1.1.5)(and taking d = 1), wehavet l = 0, f/x;(1) = Oands oo = +00.
(b) The Closed Convex Function r e. For a more accurate study of Te , we use now the
function re of (2.1.3), obtained via the change of variable U = I/t: re(u) = qe(1/u)
for u > O. It is the trace on lR of the function (1.3.4), which is known to be in
Conv(lRn x lR); therefore
re E ConvlR.
PROOF. The whole point is to compute the subdifferential of the convex function
u ~ t/f(u) := u((J(llu), and this amounts to computing its one-sided derivatives.
Take positive u' = lit' and u = lit, with u E domre (hence ((J(llu) < +00), and
compute the difference quotient of t/f (cf. the proof of Theorem 1.1.1.6)
u'((J(llu') - u((J(llu) ((J(t') - ((J(t)
,
u-u
= ((J(t) - t,
t-t
.
Knowing that re(u') = t/f(u') - u'[f(x) - e] for all u' > 0, we readily obtain
ore(u) = ((J(t) - to((J(t) - f(x) + e = re(u)lu - to((J(t). 0
whose graph supports epi q> at r = t. Its value 1(0) atthe vertical axis r = 0 isa subderivative
of u 1-+ uq>(l/u) at u = lit, and t is optimal in (2.1.1) when 1(0) reaches the given
1
f(x) - e. In this case, TE is the contact-set between gr and grq>. Note: convexity and
Proposition 2.2.2(iii) tell us that f(x) - e oo ~ 1(0) ~ f(x). 0
Let us summarize the results of this section concerning the optimal set Te, or Ue
of (2.2.4).
- First, we have the somewhat degenerate case t l = +00, meaning that f is affine
on the half-line x + 1R+d. This can be described by one of the following equivalent
statements:
f'(x, d) = f60(d);
f;(x,d) = f'(x,d) foralle > 0;
Vt > 0, qe(t) > f60(d) for aIle> 0;
Te = I{) for all e > 0;
Ue = {OJ for all e > O.
110 XI. Approximate SubdifIerentials of Convex Functions
- The second situation, more interesting, is when tf. < +00; then three essentially
different cases may occur, according to the value of e.
- When 0 < e < eoo , one has equivalently
f'(x, d) < f:(x, d) < f/:.o(d);
3t > 0 such that qs(t) < f60(d) ;
Ts is a nonempty compact interval;
Of/. Us.
- When eOO < e:
f:(x, d) = f/:.o(d);
Vt > 0, qs(t) > f/:.o(d);
Ts =0;
Us = {O}.
f:(x, d) = f60(d) ;
Ts is empty or unbounded;
o E Us.
Note: in the last case, Ts nonempty but unbounded means that f(x + td) touches
its asymptote f(x) - e + tf60(d) for t large enough.
._ {-f:(X,d) ife~O,
() .-
ve
+00 th . 0 erwlse.
Then v E Conv lR: in fact, dom v is either [0, +oo[ or ]0, +00[. When e -l- 0, v(e)
tends to - f'(x, d) E lR U {+oo}.
For -e > 0, we have trivially (remember that f' (x, d) < +00)
and this confirms the relevance of the notation f60 for an asymptotic function.
Theorem 2.3.2 With the notation o/Proposition 2.3.1 and (2.2.4),
which exactly means that u E Ue. The rest follows from the conclusions of §2.2. 0
f~(x,d)
t,:.(d) =31------:::ao'I"'"-----
f'(x,d) =1
Remark 2.3.3 Some useful formulae follow from (2.3.2): whenever Te -::/= 0, we have
1'/ - e
I
fT/(x,d) ~ fe(x,d)
I
+- - for all tETe,
t
1'/ - e
fe(x, d) + -.- - 0(1'/ - e) ifl'/ ~ e,
I I
fT/(x, d) =
te
d d 1'/ - e . 1'/
I
fT/(x, ) = Je(x, )
~I
+- -- 0(1'/ - e) If ~ e,
!e
where!e and te are respectively the smallest and largest elements of Te, and the remainder
terms 0(·) are nonnegative (of course,!e = te except possibly for countably many values of
e). We also have the integral representation (1.4.2.6)
A natural question is now: what happens when e ,!.. o? We already know that
fi(x, d) ~ f'(x, d) but at what speed? A qualitative answer is as follows (see also
Example 2.1.3).
· fi(x,d)-f'(x,d 1 [0 ]
11m
e.J..o e
= ""l
t
E ,+00,
For fixed x and d, use the notation of Remark 2.2.4 and look again at Fig. 2.1.2, with
the results of the present section in mind: it is important to meditate on the correspondence
between the horizontal set of stepsizes and the vertical set of f -decrements.
To any stepsize t ~ 0 and slope (s, d) ofa line supporting epiql at (t, qI(t», is associated
a value
8r,s := f(x) - 1(0) = f(x) - f(x + td) + t(s, d) ~ o. (2.3.2)
Likewise, to any decrement 8 ~ 0 from f (x) is associated a stepsize te ~ 0 via the slope
ff(x, d). This defines a pair of multi functions t t---+ -ar(llt) and 8 t---+ Te , inverse to each
other, and monotone in the sense that
for tETe and t' E Te" 8 > 8' ¢=> t > t' .
To go analytically from t to 8, one goes first to the somewhat abstract set of inverse stepsizes
u, from which e is obtained by the duality correspondence of Lemma 2.3.1. See also the lower
part of Fig. 2.2.1, for an instance of a mapping T (or rather its inverse).
Proposition 2.3.5 With the hypotheses and notations given so far, assume t l < +00
(i.e. f is not affine on the whole half-line x + JR.+d). For t > 0, there is a unique
e(t) E [0, e OO [ such that
(v(a): 0:::;;a<8°O}=-[I'(x,d),I~(d)[.
Figure 2.3.2 emphasizes the difference between this 8(t) and the 8t,s defined in
(2.3.2).
f(x+td)
--~------------~---------------
f(x)
gr <p
f(x)-£(t)
From Theorem X.2.3.1, the conjugate of a sum of two functions II + h is the closure
of the infimal convolution It t g.
Expressing ae(iJ + h) (s) will require an expres-
114 XI. Approximate Subdifferentials of Convex Functions
sion for this infimal convolution, which in turn requires the following basic assump-
tion:
When s E dom(fl + 12)*'1 (fl + h)*(s) = !t(PI) + g(P2)
(3.1.1)
for some Pi satisfying PI + P2 = s .
This just expresses that the inf-convolution of !t and 12* is exact at s = PI + P2:
the couple (Ph h) actually minimizes the function (Ph P2) ~ !t(PI) + g(P2).
Furthermore, we know (Theorem X.2.3.2) that this property holds under various
conditions on II and 12; one of them is
This s is therefore in dom(fl + 12)* and we can apply (3.1.1): with the help of some
PI and P2, we write (3.1.4) as 81 + 82 ~ 8, where we have set
Let us compute the £-subdifferential of this function, i.e. the £-normal set of Defini-
tion 1.1.3. The approximate normal set to Hj- has been given in Example 1.2.5, we
obtain for x E C
(3.1.7)
(and the normal cone Nc(x) can even be added to the left-hand set). Likewise, set
k(x):= minj Cj(x). Ifk(x) > 0,
is not -00. The e-minimizers of f on C are those x E C such that f(x) :::;; lc + e; clearly
enough, an e-minimizer is an x such that (remember Theorem 1.1.5)
Here domf = ]Rn: we conclude from Theorem 3.1.1 that an e-minimizer is an x such
that
o E oaf(x) + Ne,s-a(x) for some a E [0, e],
i.e. f has at x an a-subgradient whose opposite lies in the (e - a)-normal set to C. The
situation simplifies in some cases.
116 XI. Approximate Subdifferentials of Convex Functions
An 8-minimizer is then an x E C for which there exists JL = (JLl •...• JLm) E R m such that
2:J=1 JLjSj + So = 0, )
2:J=1 JLj'j ~ (so. x) :( e,
JLj ~ 0 for J = 1•...• m. o
We leave it as an exercise to redo the above examples when
is described in standard form (K being a closed convex polyhedral cone, say the nonnegative
orthant).
Remark 3.1.5 In the space IR n ) x IR n2 equipped with the scalar product of a product-
space, take a decomposable function:
Because ofthe calculus rule X.1.3.1(ix), the basic assumption (3.1.1) holds automat-
ically, so we always have
Given g E ConvlRm and an affine mapping A : IRn -+ IRm (Ax = Aox + Yo E IRm
with Ao linear), take f := goA E ConvlRn. As in §3.1, we need an assumption,
which is in this context:
«(', .) will denote indifferently the scalar product in IR n or IRm). As was the case with
(3.1.1), Theorem X.2.2.1 tells us that the above p actually minimizes the function
3 Calculus Rules on the Approximate Subdifferential 117
Theorem 3.2.1 Let g and A be defined as above. For all e ~ 0 and x such that
Ax E dom g, there holds
(3.2.3)
where we have used the property A(y - x) = Ao(Y - x). Thus we have proved that
Atp E ae(g 0 A)(x).
Conversely, lets E ae(g 0 A)(x), i.e.
Apply (3.2.1): with the help of some p such that Atp = s, (3.2.4) can be written
e ~ g*(p) - (p, Yo) + g(Ax) - (p, Aox) = g*(p) + g(Ax) - (p, Ax) .
This shows that P E aeg(Ax). Altogether, we have proved that our sis in Ataeg(Ax).
o
Naturally, only the linear part of A counts in the right-hand side of (3.2.3): the
translation is taken care ofby Proposition X.1.3. l(v).
As an illustration of this calculus rule, take Xo E domg, a direction d =f. 0, and
compute the approximate subdifferential of the function
We recall that, for g ECony lRm and A linear from lRm to lRn , the image of g under
A is the function Ag defined by
lRn 3 x H- (Ag)(x) := inf {g(y) : Ay = x}. (3.3.1)
Once again, we need an assumption for characterizing the 8-subdifferentia1 of Ag,
namely that the infimum in (3.3.1) is attained "at finite distance". A sufficient assump-
tion for this is
ImA* nridomg* =1= 0, (3.3.2)
which implies at the same time that Ag E Cony lRn (see Theorem X.2.2.3). As already
seen for condition (X.2.2.Q.iii), this assumption is implied by
g~(d) > 0 for all nonzero dE Ker A.
Theorem 3.3.1 Let 8 ~ 0 and x E dom Ag = A (dom g). Suppose that there is some
y E lR m with Ay = x and g(y) = Ag(x);for example assume (3.3.2). Then
8£(Ag)(x) = {s E lRn : A*s E 8£g(y)}. (3.3.3)
PROOF. To say that s E 8£(Ag)(x) is to say that
where we have made use of the existence and properties of y. Then apply Theo-
rem X.2.1.1: (Ag)* = g* 0 A*, so
S E 8£(Ag)(x) {::=} g*(A*s) + g(y) - (A*s, y) ~ 8. 0
This result can of course be compared to Theorem VI.4.5.1: the hypotheses are
just the same - except for the extended-valuedness possibility. Thus, we see that the
inverse image under A * of 8£g(yx) does not depend on the particular Yx optimal in
(3.3.1).
We know that a particular case is the marginal function:
lRn 3 x H- I(x) := inf {g(x, z) : Z E lR P }, (3.3.4)
where g E Conv(lRn x lR P ). Indeed, I is the image of g under the projection mapping
from lRn+ P to lRn defined by A (x, z) = x. The above result can be particularized to
this case:
Corollary 3.3.2 With g E Conv(lRn x lR P ), let g* be associated with a scalar product
preserving the structure oflRm = lRn x lRP as a product space, namely:
(3.3.5)
and consider the marginal function I of (3.3.4). Let 8 ~ 0, x E dom I; suppose that
there is some Z E lR P such that g(x, z) = I(x); Z exists for example when
3so E lRn such that (so, 0) E ridomg* . (3.3.6)
Then
8ef(x) = {s E lRn : (s,O) E 8£g(x, z)}. (3.3.7)
3 Calculus Rules on the Approximate Subdifferential 119
PROOF. Set A : (x, z) t-+ x. With the scalar product (3.3.5), A* : IR n ---+ IRn x IRP
is defined by A*s = =
(s, 0), ImA* IR n x {OJ, so (3.3.6) and (3.3.7) are just (3.3.2)
and (3.3.3) respectively. 0
where I, and !2 are both in Conv IRn. With m = 2n, and 1R2n being equipped with
the Euclidean structure of a product-space, this is indeed an image function:
Theorem 3.4.1 Let 8 ~ 0 and x E dom(f, ~ h) = dom I, + dom fz. Suppose that
there are y, and Y2 such that the inf-convolution is exact at x = y, + Y2; this is the
case for example when
ridomlt n ridomg =f. 13. (3.4.2)
Then
so (Proposition III.2.UI)
so that s E aeJi (Yi) for i = 1,2. Particularizing (3.3.3) to our present situation:
If all (YI) n a12 (Y2) =f:. 0, the inf-convolution is exact at x = YI + Y2 and equality
holds in (3.4.6).
in view of the Fenchel inequality (X. 1. l.3), this is actually an equality, i.e. SEaI (x).
Now use this last equality as a value for (s, x) in (3.4.7) to obtain
i.e. the inf-convolution is exactatx = YI + Y2; equality in (3.4.6) follows from (3.4.5).
o
Remark 3.4.3 Beware that D may be empty on the whole of H while aUI i!i h)(x) ¥= 0.
Take for example
II (y) = exp Y and 12 = exp( -Y);
then II i!i 12 == 0, hence aUI i!i h) == {O}. Yet D(YI, Y2) = 0 for all (YI, Y2) E H: the
inf-convolution is nowhere exact.
Note also that (3.4.5) may express the equality between empty sets: take for example
II = 1[0, +oo[ and
R ~() {-,JYi. ifY2;?:0,
3 Y2 1-+ J2 Y2 = +00 otherwise.
Example 3.4.4 (Moreau-Yosida Regularizations) For e > 0 and I E Conv IRn, let
I(c) := f t (1/2 ell . 11 2), i.e.
f(c)(X) = min {f(y) + !ellx - yll2 : y E IRn};
Using the approximate subdifferential of 1/211 . 112 (see Example 1.2.2 if necessary),
Theorem 3.4.1 gives
Oef(c)(x)= U [oe-af(xc)nB(e(X-xc),J2ea)].
o~a~e
Thus f(c) is a differentiable convex function. It can even be said that V f(c) is Lips-
chitzian with constant e on IRn. To see this, recall that the conjugate
f(~) = f* + ~ II . 112
is strongly convex with modulus 1/e; then apply Theorem X.4.2.1.
In the particular case where e = 1 and f is the indicator function of a closed
convexsetC, f(c) = 1/2d~ (the squared distance to C) and Xc = Pc (x) (the projection
of x onto C). Using the notation of (1.1.6):
Just as in Example 3.4.4, we can particularize ftc] to the case where f is the
indicator function of a closed convex set C: we get ftc] = cdc and therefore
8e (cdc)(x) = NC,e(x) n B(O, c) for all x E C.
Thus, when e increases, the set K' in Fig. V.2.3.1 increases but stays confined within
B(O, c).
3 Calculus Rules on the Approximate Subdifferential 123
Remark 3.4.6 For x fj. Cc[!], there is some Ye =f:. x yielding the infimum in (3.4.9).
At such a Ye, the Euclidean norm is differentiable, so we obtain that f[e] is differentiable
as well and
x - Ye
V f[e] (x) = c Ilx _ Yell E 8f(Ye) .
Compare this with Example 3.4.4: here Ye need not be unique, but the direction Ye - x
depends only on x. 0
An important issue is whether this minimization problem has a solution, and whether
the infimal value is a closed function of s. Here again, the situation is not simple when
the functions are extended-valued.
Theorem 3.5.1 Let fl' ... , fp be a finite number of convex functions from lRn to lR
and let f := maxj f); set m := min{p, n + I}. Then, S E 8ef(x) ifand only if there
exist m vectors Sj E dom fj*, convex multipliers aj, and nonnegative I>j such that
m
f*(s) = Lajft(sj) ,
j=l
m m
"Lajfj*(Sj) + f(x) ~ S + "Laj(sj,x) ,
j=1 j=1
which we write
m m
"Laj[fj*(Sj) + Jj(x) - (Sj' x)] ~ S + "LajJj(x) - f(x). (3.5.3)
j=1 j=1
Thus, if S E ad(x), i.e. if(3.5.3) holds, we can set
It is important to realize what (3.5.2) means. Its third relation (which, incidentally, could
be replaced by an equality) can be written
m
~)ej + ajej) ~ e, (3.5.4)
j=1
where, for each j, the number
ej := f(x) - fj(x)
is nonnegative and measures how close ij comes to the maximal f at x. Using elementary
calculus rules for approximate subdifferentials, a set-formulation of(3.5.2) is
Example 3.5.2 Consider the function f+ := max{O, f}, where f : lRn --+ lR is convex. We
get
as(r)(x) = u {a8(af)(X) : 0::;; a::;; 1, 8 + rex) - af(x) ::;; 8} .
Setting 8(a) := 8 - rex) + af(x), this can be written as
Each 8j /arsubdifferential in (3.5.2) is constantly {Sj}: the 81'S play no role and can be
eliminated from (3.5.4), which can be used as a mere definition
lRn 3 Y 1-+ fey) = f(x) + max {-ej + (Sj, y -x) : j = 1, ... , pl·
The constant term f(x) is of little importance, as far as subdifferentials are concerned.
Neglecting it, ej thus appears as the value at x (the point where asf is computed) ofthe j'll
affine function making up f.
Geometrically, as f (x) dilates when 8 increases, and describes a sort of spider web with
af(x) as "kernel". When 8 reaches the value maxj ej' ad(x) stops at CO{SI, ... , Sp}. 0
Finally, if 8 = 0, (3.5.4) can be satisfied only by 8j = 0 and ajej = 0 for all j. We thus
recover the important Corollary VI.4.3.2. Comparing it with (3.5.6), we see that, for larger ej
or 1}j, the 1}rsubgradient Sj is "more remote from af(x)", and as a result, its weightaj must
be smaller.
PROOF. [=*] Let a ~ 0 be a minimum of the function 1/Is in Theorem X.2.5.1; there
are two cases:
(a) a = O. Because domf = lRn , this implies s = 0 and (g 0 f)*(0) = g*(O). The
characterization of s = 0 E oe (g 0 f) (x) is
Thus, (3.6.1) holds with 82 = 8,81 = 0 - and a = 0 (note: o(Of) == {OJ because f
is finite everywhere).
(b) a > O. Then (g 0 f)*(s) = aj*(s/a) + g*(a) and the characterization of
s E oe(g °f)(x) is
i.e.
(af)*(s) + af(x) - af(x) + g*(a) + g(f(x» - (s, x) ~ 8.
Then we obtain: s E Ne,e(x) if and only if there are nonnegative a, 8t. 82, with
81 + 82 = 8, such that
s E oel (ac)(x) and ac(x) + 82 ~ O. (3.6.3)
4 The Approximate Subdifferential as a Multifunction 127
Corollary 3.6.2 Let C be described by (3.6.2) and assume that c(xo) < 0 for some
Xo. Then
NC,e(x) = U {a8 (a c) (x) : a ~ 0, 8 ~ 0, 8 - ac(x) ~ 6} . (3.6.4)
Inparticular, ifx e bdC, i.e. ifc(x) = 0,
NC,e(x) = U{ae(ac)(x) : a ~ O}. (3.6.5)
PROOF. Eliminate 62 and let 8 := 6) in (3.6.3) to obtain the set-formulation (3.6.4).
Then remember (VI.l.3.6) and use again the monotonicity of the multifunction 8 ~
~~~W. 0
We will see in this section that the multifunction ae f is much more regular when 6 > 0
than the exact subdifferential, studied in §VI.6.2. We start with two useful properties,
stating that the approximate subdifferential (e, x) ~ aef(x) has a closed graph,
and is locally bounded on the interior of dom f; see §A.5 for the terminology.
Proposition 4.1.1 Let {(ek> Xk, Sk)} be a sequence converging to (e, x, s), with sk e
aekf(Xk) for all k. Then s e aef(x).
PROOF. By definition,
f(y) ~ f(Xk) + (Sk> y - Xk) - 6k for all y e ]Rn .
Pass to the limit on k and use the lower semi-continuity of f. o
Proposition 4.1.2 Assume intdom f '" 0; let 8 > 0 and L be such that f is Lip-
schitzian with constant L on some ball B(x, 8), where x e intdomf. Then, for all
8' < 8,
e
IIslI ~ L + 8 _ 8' (4.1.1)
'} (lix
LlH(aef(x), ae' f(x'» ~ --:-----{K
mm e,e
- x'il + Ie - e'I). (4.1.2)
PROOF. With d of norm 1, use (2.1.1): for any TJ > 0, there is tl1 > 0 such that
(4.1.3)
where we have used the notation qe(x, t) for the approximate difference quotient. By
assumption, we can let 8 -+ +00 in Proposition 4.1.2: there is a global Lipschitz
constant, say L, such that f;(x, d) ~ L. From
we therefore obtain
1 2L + TJ
-~--
tl1 e
Then we can write, using (2.1.1) and (4.1.3) again:
~
2Lllx'
-
xii + Ie' - el ~2:7](2Lllx'-xll+le'-el).
L
tl1
Remembering that TJ > 0 is arbitrary and inverting (x, e) with (x', e'), we do
obtain
4 The Approximate Subdifferential as a Multifunction 129
a property already illustrated by Fig. 1.1.2. Remember from §VI.6.2 that the multi-
function af need not be inner semi-continuous; when e = 0, no inclusion resembling
the above can hold, unless x isjixed.
A local version of Theorem 4.1.3 can similarly be proved: (4.1.2) holds on the
compact sets included in int dom f. Here, we consider an extension of the result to
unbounded subdifferentials. Recall that the Hausdorff distance is not convenient for
unbounded sets. A better distance is obtained by comparing the bounded parts of
closed convex sets: for c ~ 0, we take
Corollary 4.1.4 Let f E Conv IRn. Suppose that S C dom f and f. > are such
that af(x) n B(O, 0 #- 0forall XES. Then,forall c > ~ there exists Kc such that,
°
for all x, x' in Sand e, e' positive,
mm 8,E,}(lIx' -
LlH , c(ae!(x), ae,f(x'» ~ ~{ xII + Ie' - el).
PROOF. Consider fic] := f ~ cll . II and observe that we are in the conditions of
Proposition 3.4.5: f[c] is Lipschitzian on IRn and Theorem 4.1.3 applies. The rest
follows because the coincidence set of f[c] and f contains S. 0
Applying this result to an f finite everywhere, we obtain for example the following
local Lipschitz continuity:
Corollary 4.1.5 Let f : IRn ~ IR be convex. For any 8 ~ 0, there is K8 > Osuch
that
K8
LlH(aef(x), ae!(x'» ~ -lix - x' II for all x and x' in B(O, 8).
e
a
PROOF. We know from Proposition 4.1.2 that e f is bounded on B(O, 8), so the result
is a straightforward application of Corollary 4.1.4. 0
In §VI.6.3, we have seen that af(x) can be constructed by piecing together limits of
subgradients along directional sequences. From a practical point of view, this is an
130 XI. Approximate Subdifferentials of Convex Functions
important property: for example, it is the basis for descent schemes in nonsmooth opti-
mization, see Chap. IX. This kind of property is even more important for approximate
subdifferentials: remember from §II.l.2 that the only information obtainable from f
is a black box (Ul), which computes an exact subgradient at designated points. The
concept of approximate subgradient is therefore of no use, as long as there is no "black
box" to compute one. Starting from this observation, we study here the problem of
constructing aef(x) with the sole help of the same (Ul).
Theorem 4.2.1 (A. Brndsted and R.T. Rockafellar) Let be given f E ConvJR.n ,
o.
x E dom f and 8 ~ For any 11 > 0 and s E aef(x), there exist xl'/ E B(X,l1) and
sl'/ E af(xl'/) such that IIsl'/ - sil ~ 8/11.
PROOF. The data are x E dom f, 8 > 0 (if 8 = 0, just take xl'/ = x and sl'/ = s!),
11 > 0 and s E aef(x). Consider the closed convex function
It is nonnegative (Fenchel's inequality), satisfies fP(x) ~ 8 (cf. (1.2.1», and its subdif-
ferential is
afP(Y) = af(y) - Is} for all y E domfP = domf.
Perturb fP to the closed convex function
8
JR.n 3 y 1--+ 1/f(y) := fP(y) + -lly - x II ,
11
whose subdifferential at y E dom f is (apply Corollary 3.1.2: (3.1.2) obviously holds)
*).
is written
o E af(xl'/) - Is} + B(O,
It remains to prove that Xl'/ E B(x, 11). Using the nonnegativity of fP and optimality of
aef(x) c n u
1'/>0 lIy-xll ~ 1'/
{af(Y) + B(O, *)}. (4.2.1)
Proposition 4.2.2 (Transportation Formula) With x and x' in dom f, let s' E
af (x'). Then s' E aef (x) if and only if
f(x') ~ f(x) + (s', x' - x) - e. (4.2.2)
PROOF. The condition is obviously necessary, since the relation of definition (1.1.1)
must in particular hold at y = x'. Conversely, for s' E af (x'), we have for all y
f(y) ~ f(x') + (s', y - x') =
= f(x) + (s', y - x) + [f(x') - f(x) + (s', x - x')].
x ~
Fig. 4.2.1. Linearization errors and the transportation formula
Definition 4.2.3 (Linearization Error) For (x, x', s') E dom f x dom f x IRn, the
linearization error made at x, when f is linearized at x' with slope s', is the number
This linearization error is of particular interest when s' E af(x'); then, calling
The definition of s' E 01 (x') via the conjugate function can also be used:
The content of the transportation formula (4.2.2), illustrated by Fig. 4.2.1, is that
any s' E ol(x') is an e(x, x', s')-subgradient of 1 at x; and also that it is not in a
tighter approximate subdifferential, i.e.
This latter property relies on the contact (at x') between gr lx',s' and gr I. In fact, a
result slightly different from Proposition 4.2.2 is:
or equivalently
e(x, x', s') + ." :s:;; e .
PROOF. [(i)] Apply Proposition 4.1.2: there exist ~ > 0 and a constant K such that,
for all x' E B(x,~) ands' E af(x'),
e(x, x', s') ~ If(x') - f(x)1 + 1Is'1i IIx' - xII ~ 2Kllx' - xII;
see Fig.4.2.1: we insert a point between x and x'. A consequence of (4.2.4) will be
Ys(x) c cl Ve(x) (let ttl).
To prove our claim, set d := x' - x and use the function r = ro of (2.1.3) to
realize with Remark 2.2.4 that
Then we take arbitrary l/t = u > 1 ands" E af(x +td), so that -e(x, x +td, s") E
ar (u). The monotonicity property (V1.6.1.1) or (1.4.2.1) of the subdifferential ar gives
hence e(x, x + td, s") ~ e(x, x', s') ~ e, which proves (4.2.4).
There remains to prove that Ys(x) is closed, so let {xk} be a sequence of Ys(x)
converging to some x'. To each xk' we associate Sk E af(xk> n ae!(x); {Sk} is
bounded by virtue of Theorem 1.1.4: extracting a subsequence if necessary, we may
assume that {Sk} has a limit s'. Then, Proposition 4.1.1 and Theorem 1.1.4 show that
s' E af(x') n aef(x), which means that x' E Ys(x). 0
From a practical point of view, the set Ve (x) and the property (i) above are both important:
provided that y is close enough to x, the s(y) E of(y) computed by the black box (Ul)
is guaranteed to be an e-subgradient at x. Now, Ve(x) and ~(x) differ very little (they
are numerically indistinguishable), and the latter has a strong intrinsic value: by definition,
x' E ~(x) if and only if
3s' E oef(x) such that s' E of (x'), i.e. such that x' E of*(s').
Furthermore the above proof, especially (4.2.5), establishes a connection between ~ (x)
and §2.2. First ~(x) - {x} is star-shaped. Also, consider the intersection of ~(x) with a
direction d issuing from x. Keeping (4.2.3) in mind, we see that this set is the closed interval
134 XI. Approximate Subdifferentials of Convex Functions
where the perturbed difference quotient qe is decreasing. In a word, let te(d) be the largest
element of Te = Te (d) defined by (2.2.2), with the convention te (d) = +00 if Te is empty or
unbounded (meaning that the approximate difference quotient is decreasing on the whole of
lR;). Then
¥e(x) = (x + td : t E [0, te(d)], d E B(O, I)}. (4.2.7)
See again the geometric interpretations of §2, mainly the end of §2.3.
Our neighborhoods Ve(x) and Ye(x) enjoy some interesting properties if addi-
tional assumptions are made on f:
Proposition 4.2.6
(i) If f is I-coercive, then Ye (x) is bounded.
(ii) Iff is differentiable on IR.n, then Ve (x) = Ye (x) and Vf (Ve (x» c 8d (x).
(iii) Iff is I-coercive and differentiable, then V f(Ve(x» = 8d(x).
PROOF. In case (i), f* is finite everywhere, so the result follows from (4.2.6) and the
local boundedness of the subdifferential mapping (Proposition VI.6.2.2).
When f is differentiable, the equality between the two neighborhoods has already
been observed, and the stated inclusion comes from the very definition of Ve (x).
Finally, let us establish the converse inclusion in case (iii): for s E 8e f(x), pick
y E 81*(s), i.e. s = Vf(y). The result follows from (4.2.6): y E Ye(x) = Ve(x).
o
(see Example 4.2.7). However, I is minimal on the unit ball B(O, 1), i.e. 0 E a/(x')
forallx' E B(O,I);furthermore,O ¢ al /d(x).Inviewofitsdefinition(4.2.3), Ve(x)
therefore does not meet B(O, 1):
On the other hand, it suffices to remove B(O, 1) from D, and this can be seen from the
definition (4.2.3) of Ye(x): the linearization error e(x, " .) is left unperturbed by the
max-operation defining I. In summary:
Ye(x) = {(~, TJ) : (~- 1)2 + (TJ - 1)2 ~ 1/2, ~2 + TJ2 ~ I}.
We thus obtain a nonconvex neighborhood; observe its star-shaped character. 0
8(0,1)
Another expression is
which shows that Ve*(x) is closed and convex. Reproduce the proof of Proposi-
tion 4.2.5(i) to see that Ve*(x) is a neighborhood of x. For the quadratic function
of Example 4.2.7, Ve*(x) and Ve(x) coincide.
grf
Introduction. The subject of this chapter is by far the most important application of convex
minimization, namely general decomposition in mathematical programming (more exactly:
price decomposition), and dual algorithms. It can be safely ascertained that this subject nearly
coincides with convex minimization, studied in Chaps. VIII and IX:
- on the one hand, algorithms for convex minimization have their best-suited field of applica-
tion in decomposition, more precisely the problem of price-adjustment (another important
field concerns eigenvalue optimization of a varying matrix but there, nonconvexity comes
quickly into play);
- on the other hand, when decomposing a problem - more exactly: when adjusting prices
via a decentralization algorithm in a (usually large-scale) optimization problem -, one is
primarily minimizing a certain convex function, namely the dual function associated with
the problem; and there is no way around this.
ICj(u)
sup ((J(u)
=0
U E
for j
U,
= 1, ... , m. (1.1.1)
which we will call the primal problem. As usual, a U E U satisfying the constraints
Cj (u) = 0 will be called feasible. Throughout this chapter, without further mention,
the following assumption will be in force:
§ 1.2 how abstract, or how concrete U can be. This implies in particular that the
objective- and constraint-functions have no structure either, such as convexity, or a
fortiori differentiability: in U, these words are meaningless for the moment. They
will of course appear (to be solvable, an optimization problem such as (1.1.1) must
enjoy some structure), but much later and it is useful to see how far the theory can
be developed in abstracto. Observe also that assuming q; and each Cj to be finite
everywhere is not really a restriction: if these functions were extended-valued, U
could be replaced by its intersection with their domains.
Actually, the only structure assumed concerns the sets of objective and constraint
values, which are lR and lRm respectively. Then we equip lRm with the ordinary dot-
product: for (A, c) E lRm x lRm ,
m
AT C:= LAjCj. (1.1.2)
j=1
The associated norm will be II· II. As far as the space of constraint-values is concerned,
and this space only, we are therefore in the general framework of this book, with scalar
products, duality, metric, and so on.
Then, denoting by c(u) := (CI (u), ... , cm(u» E lRm the vector of constraint-
values at U E U, we can consider the Lagrange function or Lagrangian, defined
by
L(u, A) := q;(u) - AT c(u) for all A E lRm and u E U. (1.1.3)
For a given A E lRm , this appears as a perturbation of the objective function of (1. 1. 1),
which simply says the following: a violation of the constraints is accepted but then, a
price AT c(u) must be paid when the control value is u E U.
Remark 1.1.1 Naturally, (1.1.3) is the same Lagrange function that played a central role in
Chap. VII. Once again, however, it is important for a better understanding to realize that L is
defined here on U x Rm. By contrast, Chap. VII heavily relied on the situation U = R n - so
as to define subdifferentiability of ({J and c - and little on the concept of a varying A.
With respect to Chap. VII, beware of the minus-sign in (1.1.3); it comes from the fact
that we start from a maximization problem, and will be motivated later. 0
No theoretical property is assumed on the data (U, q;, c), other than the duality
structure induced by (1.1.2). For practical purposes, however, we do make a heavy
assumption:
where A is fixed in lRm , is considerably simpler than the primal problem (1.1.1).
We will call (1.1.4h. the Lagrange problem associated with A. 0
To quantify a little bit the wording "considerably simpler", we can say for example:
- the simpler (1.1.4h. is, the more efficient the approach of this chapter will be;
1 The Problem and the General Approach 139
- Asswnption 1.1.2 holds when an efficient methodology exists for {1.1.4h, but not
for (1.1.1);
- or when it costs more to solve (1.1.1) once than to solve (1.1.4h 102-104 times,
say;
- this still loose quantification can be slightly sharpened by saying that we will rather
be in the 102-range if (1. 1.4h enjoys some more theoretical properties (convexity),
and rather in the 104-range ifnot.
However vague it is, Asswnption 1.1.2 is fundamental; it has to be supplemented
by one more practical asswnption, hardly more meaningful (but certainly fairly usual
in mathematical programming):
Assumption 1.1.3 One will be content with an approximate solution of (1.1.1), i.e.
with U E U such that
- rp(u) is possibly slightly less than the supremal value in (1.1.1),
- and/or c(u) is possibly not exactly 0 E ]Rm. 0
When faced with a practical optimization problem, there are several ways of formulating
the data (U, rp, c) of (1.1.1): constraints may be incorporated in rp via some penalty term
(§VII.3.2); or its extreme version, which is the indicator function of a feasible set; they can
also be considered as making up U; or as making up c, and so on. To choose a formulation
adapted to our present framework, Assumptions 1.1.2 and 1.1.3 must be the central concern.
The case of Assumption 1.1.2 is linked to the decomposability of (1.1.1) and will be seen
more closely in § 1.2; as for Assumption 1.1.3, it distinguishes two categories of constraints:
- the hard constraints, which must absolutely be satisfied by u; they will most conveniently
appear in the definition of U;
- the soft constraints, for which some tolerance is accepted; they can go in the c-set.
The methods of this chapter are aimed at solving (1.1.1) with the help of (1. 1.3).
A black box (the black box (VI) of Fig. 11.1.2.1 that keeps appearing in this book) is
available which, given).. E ]Rm, solves (1.I.4h and returns appropriate information.
Our problem in this chapter is therefore as follows:
Problem 1.1.4 Find some suitable value of ).. such that the associated Lagrange
problem, as solved by the black box, provides a solution of the primal problem.
The unknown ).. will be called the dual variable. 0
Seen from the point of view of 1.1.4, the situation is therefore as illustrated by
Fig. 1.1.1: the coordinator (sometimes called the master program) chooses a price-
vector J.., sends it to the black box (sometimes called the local problem),and receives in
return information consisting ofa vector(c, L) E ~m xR The duty of the coordinator
is then to decide about optimality in terms of (1. 1. 1), and if not satisfied, to modify J..
accordingly.
Remark 1.1.6 The temporary Assumption 1.1.5(i) is of a theoretical nature; it holds es-
sentially if U is a compact set, on which L(·,)..) is upper semi-continuous. A particular and
important case is when U is simply a finite set.
Part (ii), concerning the nature and amount of information computed by the black box, is
typical and useful. We have selected it because it serves best our purpose, mainly illustrative.
Other situations are possible, though:
- A very powerful black box, able to return the maximal L together with all c(u) for all
optimal u's, could be conceived of; but we will see in the next sections that it usually makes
little sense.
- By contrast, more meaningful would be a weaker black box, only able to compute some
suboptimal solution, i.e. an u).,e E U satisfying, for some e > 0:
This situation is of theoretical interest: an e-optimal solution exists whenever the supre-
mum in (1.1.4)). is finite; also, from a practical point of view, it demands less from the black
box. This last advantage should not be over-rated, though: for efficiency, e must be decided
by the coordinator; and just for this reason, life may not be easier for the black box.
- We mention that the values of Land c are readily available after (1.1.4)). is solved~ a situation
in which only L is returned is hardly conceivable; this poor information would result in a
poor coordinator anyway. 0
Let us mention one last point: L(u).,)..) is a well-defined number, namely the optimal
value in (1.1.4)).; but c(u).) is not a well-defined vector, since it depends on the particular
solution selected by the black box. We will see in the next sections that (1.1.4)). has a unique
solution for almost all)" E Rm - and this property holds independently of the data (U, f/J, c).
Nevertheless, there may exist "critical" values of).. for which this uniqueness does not hold
1 The Problem and the General Approach 141
This question of uniqueness is crucial, and comes on the top of the practical Assump-
tion 1.1.2 to condition the success of the approach. If (1.1.4h has a unique maximizer for
each). E lR,m, then the dual approach is extremely powerful; difficulties begin in case of
non-uniqueness, and then some structure in (U, cp, c) becomes necessary.
1.2 Examples
(a) The Knapsack Problem. Our first example is the simplest instance of combina-
torial problems, in which U is a finite or countable set. One has a knapsack and one
considers putting in it n objects (toothbrush, saucepan, TV set, ... ) each of which
has a price pi - expressing how much one would like to take it - and a volume vi.
Then one wants to make the knapsack of maximal price, knowing that it has a limited
volume v.
For each object, two decisions are possible: take it ornot; U is the set of all possible
such decisions for the n objects, i.e. U can be identified with the set {I, ... , 2n}. To
each u E U, are associated the objective- and constraint-values, namely the sum of
respectively the prices and volumes of all objects taken. The problem is solvable in
142 XII. Abstract Duality for Practitioners
a finite time, but quite a large time if n is large; actually, problems of this kind are
extremely difficult.
Now consider the Lagrange problem: m = 1 and the number A is a penalty
coefficient, or the ''price of space" in the knapsack: when the i th object is taken, the
payoff is no longer pi but pi - Avi. The Lagrange problem is straightforward: for
each object i,
if pi - Av i > 0, take the object;
if pi - Av i < 0, leave it;
if pi - Av i = 0, do what you want.
While making each of these n decisions, the corresponding terms are added to the
Lagrange- and constraint-values and the job to be done in Fig. 1.1.1 is clear enough.
Some preliminary observations are worth mentioning.
- U is totally unstructured: for example, what is the sum of two decisions? On the
other hand, the Lagrange function defines, via the coefficients pi - Av i , an order
in U, which depends on A;
- from its interpretation, A should be nonnegative: there is no point in giving a bonus
to bulky objects like TV sets;
- there always exists an optimal decision in the Lagrange problem (1.1.4h.; and it is
well-defined (unique) except when A is one of the n values pi Iv i .
Here is a case where the practical Assumption 1.1.2 is "very true". It should not be
hastily concluded, however, that the dual approach is going to solve the problem easily:
in fact, the rule is that Problem 1.1.4 is itself hard when (1.1.1) is combinatorial (so the
"law of conservation of difficulty" applies). The reason comes from the conjunction
of two bad things: uniqueness in the Lagrange problem does not hold - even though it
holds "most of the time" - and U has no nice structure. On the other hand, the knapsack
problem is useful to us, as it illustrates some important points of the approach.
The problem can be given an analytical flavour: assign to the control variable ui
the value 1 if the i th object is taken, 0 ifnot. Then we have to solve
n n
maxLpiu i subjectto u i e{O,I}, Lviui~v.
i=1 i=1
In order to fit with the equality-constrained form (1.1.1 ), a nonnegative slack variable
uO is appended and we obtain the 0-1 programming problem
n
max ((J(u) := L pi u i [+ O· uO]
i=1
n (1.2.3)
c(u):= Lviu i +uo - v =0
i=1
uO ~ 0, u i e {O, I} for i = I, ... , n [{} u e U],
Needless to say, the formulation (1.2.3) should not hide the fact that U = 1R+ X to, I}n
is still unstructured: one can now define, say 1/2 (UI + U2) for U1 E U and U2 E U; but the
half of a toothbrush has little to do with a toothbrush. Another way of saying the same thing
is as follows: the constraints u i E to, I} can be formulated as
o..;; u i ..;; 1 and u i (I - ui ) = O.
Then these constraints are hard; by contrast, the constraint c(u) = 0 can be considered as
soft: if some tolerance on the volume of the knapsack is accepted, the constraint does not have
to be strictly satisfied
(b) The Constrained Lotsizing Problem. Suppose a machine produces objects, for
example bottles, "in lots", i.e. at a high rate. Producing one bottle costs p (French
Francs, say); but first, the machine must be installed and there is a setup cost S to
produce one lot. Thus, the cost for producing u bottles in v lots is
Sv+ pu.
When produced, the bottles are sold at a slow rate, say r bottles per unit time. A stock
is therefore formed, which costs s per bottle and per unit time. Denoting by u(t) the
number of bottles in the stock at time t, the total inventory cost over a period [t1, t2] is
s 1t.
t2
u(t)dt.
pu+Sv+l __ .
s u2
(1.2.5)
2 r v
Between the two extremes (one lot = high inventory vs. many lots = high setup), there
is an optimum, the economic lotsize, obtained when v minimizes (1.2.5), i.e. when
144 XII. Abstract Duality for Practitioners
=1IIt----;~ time
(1.2.6)
Furthermore, production and setup take time, and the total availability of the
machine may be limited, say
tpu +tsv ~ T.
N.B. These times are not taken into account in Fig. 1.2.1 (which assumes tp = ts = 0),
but they do not change the essence of (1.2.5); also, v is allowed non-integer values,
which simply means that sales will continue to deplete the stock left at time t2.
Now suppose that several machines, and possibly several kinds of bottles, are
involved, each with its own cost and time characteristics. The problem is schematically
formulated as
n m
min LL¢ij(Uij, Vij) (i)
i=lj=1
m
L Uij = Di for i = 1, ... , n (ii)
(1.2.7)
j=1
n
L(aijUij + bijvij) ~ Tj for j = 1, ... , m (iii)
i=1
Uij ~ 0, Vij ~ 0 for all i and j , (iv)
where the cost is given as in (1.2.5), say
0 ifu=v=O,
'Pij(u, v) :=
{ aiju + fJijv + Yij :2 if u, v > 0,
+00 otherwise.
The notation is clear enough; Di is the total (known) demand in bottles of type i, Tj is
the total availability of the machine j; all data are positive. To explain the +oo-value
for 'Pij' observe that a positive number of units cannot be processed in 0 lots.
Here, the practical Assumption 1.1.2 comes into play because, barring the time-
constraints (iii) in (1.2.7), the problem would be very easy. First of all, the variables Vij
would become unconstrained; assuming each Uij fixed, an economic lotsize formula
like (1.2.6) would give the optimal Vij. The resulting problem in U would then split
into independent problems, one for each product i.
More specifically, define the Lagrange function by "dualizing" the constraints
(1.2.7)(iii), and optimize the result with respect to v to obtain
1 The Problem and the General Approach 145
(1.2.9)
?:
m
min [aij + Ajaij + 2.jYij({3ij + Ajbij)] Uj
J=l
(1.2.10))..
I>j
m
= Di, Uj ~ 0 for j = 1, ... , m ,
j=l
admitting that Uj stands for Uij, but with i fixed. As was the case with the knapsack
problem, the dualized constraints are intrinsic inequalities and the infimum is -00 if
Aj < 0 for some j. Otherwise, we have to minimize a linear objective function on the
convex hull of the m extreme points of U i ; it suffices to take one best such extreme
point, associated with some j (i) giving the smallest of the following m numbers:
In addition to il, the data of the problem is the m + 1 functions 1fr, Yl, ... , Ym, which
send (x, t) E {] x lR to lR. The Lagrangian is the integral over il of the function
146 XII. Abstract Duality for Practitioners
m
l(x, u, A) := t/I(x, u(x» - LAjYj(X, u(x».
j=1
This type of problem appears all the time when some measurements (characterized by
the functions Yj) are made on u; one then seeks the "most probable" u (in the sense
of the entropy -t/l) compatible with these measurements.
In a sense, we are in the decomposable situation (1.2.1), but with "sums of in-
finitely many terms". Indeed, without going into technical details from functional
analysis, suppose that the maximization of l with respect to u E LI (G, lR) makes
sense, and that the standard optimality conditions hold: given A E lRm , we must solve
for each x E G ~e equation in t E lR:
(1.2.13)
Take a solution giving the largestl (if there are several), call uA,(x) the result, and this
gives the output from the black box.
The favourable cases are those where the function t H- t/I (x, t) is strictly concave
and each function t H- Yj (x, t) is affine. A typical example is one in which t/I is an
entropy (usually not depending on x), for example
In aj(x)u(x) dx - bj =0
where each aj E Loo(G, lR) is a given function, for example aj(x) = cos(21rjx)
(Fourier constraints).
Here, the practical Assumption 1.1.2 is satisfied, at least in the case (1.2.14): the
Lagrange problem (1.2.13) has the explicit solution
(1.2.16)
By contrast, (1.2.12) has "infinitely many" variables; even with a suitable discretiza-
tion, it will likely remain a cumbersome problem.
Remark 1.1.1 On the other hand, the technical details already alluded to must not be for-
gotten. Take for example (1.2.15): the optimality conditions (1.2.13) for u). give
(1.2.17)
2 The Necessary Theory 147
The set of).. for which this function is nice and log-integrable is now much more complicated
than the nonnegative orthant of the two previous examples (a) and (b). Actually, several
delicate questions have popped up in this problem: when does the Lagrange problem have
a solution? and when do the optimality conditions (1.2.13) hold at such a solution? These
questions have little to do with our present concern, and are left unanswered here. 0
As long as the entropy 1/f is strictly concave and the constraints Yj are affine (and
barring the possible technical difficulties from functional analysis), the dual approach
is here very efficient. The reason, now, is that the U A of the black box is unambiguous:
the Lagrange problem has a well-defined unique solution, which behaves itself when
).. varies.
Throughout this section, which requires some attention from the reader, our leading
motivation is as follows: we want to solve the u-problem (1.1.1); but we have replaced
it by the A-problem 1.1.4, which is vaguely stated. Our first aim is to formulate a more
explicit A-problem (hereafter called the dual problem), which will turn out to be
very well-posed. Then, we will examine questions regarding its solvability, and its
relevance: to what extent does the dual problem really solve the primal u-problem
that we started from?
The data (U, q>, c) will still be viewed as totally unstructured, up to the point where
we need to require some specific properties from them; in particular, Assumption 1.1.5
will not be in force, unless otherwise specified.
To solve our problem, we must at least find a feasible u in (1.1.1). This turns out to
be also sufficient, and the reason lies in what is probably the most important practical
link between (1.1.1) and (1.1.4h:
Theorem 2.1.1 (H. Everett) Fix A E ]Rm; suppose that (1.1.4h has an optimal
solution UA E U and set CA := c(u A) E ]Rm. Then UA is also an optimal solution of
Ic(u)
maxq>(u) u
= CA E
E
]Rm.
U,
(2.1.1)
If, in addition, c(u) = CA, then u is feasible in (2.1.1) and the above relations become
q>(u) ~ q>(u A). 0
Thus, once we have solved the Lagrange problem (1.1.4h for some particular A,
we have at the same time solved a perturbation of (1.1.1), in which 0 is replaced by
an a posteriori right-hand side. Immediate consequence:
148 XII. Abstract Duality for Practitioners
Corollary 2.1.2 If, for some A E ]Rm, (1.1.4h, happens to have a solution u>.. which
isfeasible in (1.1.1), then this u>.. is a solution of (1.1.1). 0
where u>.. is given by the black box; Corollary 2.1.2 then says that it suffices to solve
this system. As stated, the problem is not simple: the mapping A ..... C)., is not even
well-defined since u>.., as returned by the black box, is ambiguous.
Example 2.1.3 Take the knapsack problem (1.2.4). In view of the great simplicity of the
black box, CJ" is easy to compute. For >.. ~ 0 (otherwise there is no UJ" and no cJ,,), uO = 0 does
not count; with the index-set J (>..) defined in (1.2.4), the left-hand side of our (univariate)
equation (2.1.2) is obviously
CJ,,=V- LVi.
iel(J,,)
When>.. varies, CJ" stays constant except at the branch-points where J()") changes; then
CJ" jumps to another constant value. Said otherwise (see also the text following Remark 1.1.6),
the mapping >.. ~ CJ", which is ill-defined as it depends on the black box, is discontinuous.
Such wild behaviour makes it hard to imagine an efficient numerical method. 0
Fortunately, another result, just as simple as Theorem 2.1.1, helps a great deal.
First of all, we consider a problem more tolerant than (2.1.2):
.
Fmd A such that there IS U
. E
I maximizing L(·, A) and
U sati'sfyingc () 0 (2.1.3)
U = .
Definition 2.1.4 (Dual Function) In the Lagrange problem (1.1.4h" the optimal
value is called the dualfunction, denoted bye:
Theorem 2.1.5 (Weak Duality) For all ).. E IRm and all U feasible in (1.1.1), there
holds
e()..) ~ qJ(u) . (2.1.5)
Theorem 2.1.7 If i is such that the associated Lagrange problem (1.1.4)j, is max-
imized at a feasible U, i.e. ifi solves (2.1.3), then i is also a solution of the dual
problem (2.1.6).
Conversely, if i solves (2.1.6), and if (2.1.3) has a solution at all, then i is such
a solution.
e
which shows that assumes the least of its values allowed by inequality (2.1.5).
Conversely, let (2.1.3) have a solution)..*: some u* E U satisfies c(u*) = 0 and
L(u*,).. *) = e().. *) = qJ(u*). Again from (2.1.5), e().. *) has to be the minimal value
e(i) of the dual function and we write
Corollary 2.1.8 Suppose (2.1.3) has a solution. Then, for any solution).. ofthe dual
problem (2.1.6), the primal solutions are those u maximizing L(·,)..) that are feasible
in (1.1.1).
PROOF. When (2.1.3) has a solution, we already know that ).. minimizes e
if and only
if: ).. solves (2.1.3), and there is some u such that qJ(u) = L(u,)..) = e()..).
Then, for any u solving (1.1.1), we have c(u) = 0 and qJ(u) = qJ(u), hence
L(u,)..) = e()..). 0
In the above two statements, never forget the property "(2.1.3) has a solution", which
need not hold in general. We will see in §2.3 what it implies, in terms of the data (U, qJ, c).
150 XII. Abstract Duality for Practitioners
Remark 2.1.9 When (2.1.3) has no solution, at least a heuristic resolution of (1. 1. 1) can be
considered, inspired by Everett's Theorem 2.1.1 and the weak duality Theorem 2.1.5.
Apply an iterative algorithm to minimize e and, during the course of this algorithm, store
the primal points computed by the black box. After the dual problem is solved, select among
these primal points one giving a best compromise between its qJ-value and constraint-violation.
This heuristic heavily depends on Assumption 1.1.3. Its quality can be judged a posteriori
from the infimal value of e; but nothing can be guaranteed in advance: qJ may remain far
from e, or the constraints may doggedly refuse to approach O. 0
Let us summarize our development so far: starting from the vaguely stated problem
1.1.4, we have introduced a number of formulations, interconnected as follows:
Thus, to solve the ugly equation (2.1.2), we must solve the perfectly well-stated prob-
lem (2.1.6). Furthermore, we have nothing to lose:
- If no primal solution is thus produced, the task was hopeless from the very begin-
~ng: no other value of A could give a primal solution; this is the converse part in
Theorem 2.1.7.
- If the technique works, no primal solution can be missed: they all solve the Lagrange
problem associated with any dual optimum; this is Corollary 2.1.8.
Remark 2.2.1 From now on, we will use the word "dual" for everything concerning
(2.1.6): we have to minimize the dual function e, with respect to the dual variable A
(u being the primal variable), possibly via a dual algorithm, yielding a dual solution,
and so on.
The symmetry between (1.1.1) and (2.1.6) becomes more suggestive if the primal
constraints are incorporated into the objective function: (1.1.1) can be formulated as
The applications described in § 1.2 give good illustrations of the logical chain
(2.2.1).
(a) The knapsack problem is a typical example where (2.1.6) has little to do with
(2.1.2), or even (2.1.3). Take the (particularly simple!) case n = 1; for example
The dual function is a nice polyhedral convex function (cf. Example 2.1.3 if
necessary):
2 The Necessary Theory 151
in < 0,
ifO ~ A ~ 1/2,
in~1/2;
C).. = 1 1
ambiguously
-1
if 0 ~ >.. < 1/2,
± 1 in = 1/2,
in> 1/2.
A good exercise is to do the same calculations with the entropy (1.2.15) (without
bothering too much with Remark 1.2.1); observe in passing that, in both cases,
B is convex and -c = VB.
Here full equivalence holds in (2.2.1).
In all three examples including (c), the dual problem will be more advantageously
viewed as minimizing B than solving (2.1.2) or (2.1.3). See the considerations at the
end of §II.1: having a "potential" B to minimize makes it possible to stabilize the
resolution algorithm, whatever it is. Indeed, to some extent, the dual problem (2.1.6)
is well posed:
Proposition 2.2.2 If not identically +00, the dual function B is in Conv lRm. Fur-
thermore, for any u)., solution of the Lagrange problem (1.1.4».., the corresponding
-c)., = -c(u).,) is a subgradient ofB at A.
152 XII. Abstract Duality for Practitioners
PROOF. Direct proofs are straightforward; for example, write for each A and /L:
so -C)., E a8(A). The other claims can be proved similarly. Proposition IY.2.1.2 and
Lemma VI.4.4.1 can also be invoked: the dual function
Remark 2.2.3 Once again, it is important to understand what is lost by the dual
problem. Altogether, a proper resolution of our Problem 1.1.4 entails two steps:
(i) First solve (2.1.6), i.e. use the one-way implications in (2.2.1); this must be done
anyway, whether (2.1.6) is equivalent to (2.1.2) or not.
(ii) Then take care of the lost converse implications in (2.2.1) to recover a primal
solution from a dual one.
In view of Proposition 2.2.2, (i) is an easy task; a first glance at how it might
be solved was given in Chap. IX, and we will see in §4 below some other traditional
algorithms. The status of (ii) is quite different and we will see in §2.3 that there are
three cases:
(iiI) (2.1.2) has no solution - and hence, (i) was actually of little use; example:
knapsack.
2 The Necessary Theory 153
(ih) A primal solution can be found after (i) is done, but this requires some extra
work because (2.1.3) {} (2.1.6) but (2.1.3) fo (2.1.2); example: economic
lotsize.
(ih) A primal solution is readily obtained from the black box after (i) is done,
(2.1.6) {} (2.1.2); example: entropy (barring difficulties from functional anal-
~~. 0
Usually, fJ is properly contained in U: for example, the functions (1.2.16) and (1.2.17)
certainly describe a very small part ofL] (ll, lR) when A describes lR m - and the interest
of duality precisely lies there. If (and only if) fJ contains a feasible point, the dual
approach will produce a primal solution; otherwise, it will break down. How much it
breaks down depends on the problem. Consider the knapsack example (2.2.2): it has
a unique dual solution j, = 1/2, at which the dual optimal value provides the upper
bound 8(1/2) = 1/2; but the primal optimal value is O. At j, = 1/2, the Lagrange
problem has two (slackened) solutions:
with constraint-values -1 and 1: none ofthem is feasible, ofcourse. On the other hand,
the non-slackened solution u = 0 is feasible with respect to the inequality constraint.
This last example illustrates the following important concept:
Definition 2.2.4 (Duality Gap) The difference between the optimal primal and dual
values
inf 8(A) - sup {cp(u) : c(u) = O} , (2.2.4)
AEIRm UEU
Suppose the dual problem is solved: some A E IRm has been found, minimizing the
dual function e on the whole space; a solution of the primal problem (1.1.1) is still
to be found. It is now that some assumptions must be made - if only to make sure that
(1.1.1) has a solution at all!
A dual solution A is characterized by 0 E ae(A); now the subdifferentials of e
lie in the dual of the A-space, i.e. in the space of constraint-values. Indeed, consider
the (possibly empty) optimal set in the Lagrange problem (1.1.4))..:
U(A) := {u E U : L(u, A) = e(A)}; (2.3.1)
remembering Proposition 2.2.2,
ae(A) :::> co {-c(u) : u E U(A)}.
The converse inclusion is therefore crucial, and motivates the following definition.
Definition 2.3.1 (Filling Property) With U(A) defined in (2.3.1), we say that the
filling property holds at A E IRm when
ae(A) = - co {c(u) : u E U(A)}. (2.3.2)
Observe that the right-hand side in (2.3.2) is then closed, since the left-hand side is.
o
Lemma 2.3.2 Suppose that U in (1.1.1) is a compact set, on which rp is upper semi-
continuous, and each Cj is continuous. Then thefillingproperty (2.3.2) holds at each
A E IRm.
PROOF. Under the stated assumptions, the Lagrange function L(·, A) is upper semi-
continuous and has a maximum for each J..: dom e = IRm. For the supremum of affine
functions A t-+ rp(u) - AT c(u), indexed by u E U, the calculus rule VI.4.4.4 applies.
o
We have therefore a sufficient condition for an easy description of ae in terms of primal
points. It is rather "normal", in that it goes in harmony with two other important properties:
existence of an optimal solution in the primal problem (1.1.1), and in the Lagrange problems
(1.1.4).
Remark 2.3.3 In case of inequality constraints, nonnegative slack variables can be appended
to the primal problem, as in §1.2(a). Then, the presence ofthe (unbounded) nonnegative orthant
in the control space kills the compactness property required by the above result. However,
we will see in §3.2 that this is a "mild" unboundedness. Just remember here that the calculus
rule VIAAA is still valid when Ahas all its coordinates positive. Lemma 2.3.2 applies in this
case, to the extent that).. e int dom e. The slacks play no role: for 11- e (JR+)m (close enough
to )..), maximizing the slackened Lagrangian
q>(u) - 11- T c(u) + 11-TV
with respect to (u, v) e U X (JR+)m just amounts to maximizing the ordinary Lagrangian
L(·, 11-) over U. 0
2 The Necessary Theory 155
Theorem 2.3.4 Let thejillingproperty (2.3.2) hold (for example, make the assump-
tions ofLemma 2.3.2) and denote by
the image by c of the set (2.3.1). A dual optimum Xis characterized by the existence
ofk ~ m + 1 points ut. ... , uk in U(X) and convex multipliers at. ... , ak such that
k
Laic(Ui) = O.
i=l
In particular, if C (X) is convex for some optimal Xthen, for any optimal ).. *, the
feasible points in U ().. *) make up all the solutions of the primal problem (1.1.1).
PROOF. When the filling property holds, the minimality condition 0 E ae(X) is
exactly the existence of {Ui, ail as stated, i.e. 0 E coC(X). If C(X) is convex, the
"co"-operation is useless, 0 is already the constraint-value of some U maximizing
L(·, X). The rest follows from Corollary 2.1.8. 0
This is the first instance where convexity comes into play, to guarantee that the
set fJ of (2.2.3) contains a feasible point. Convexity is therefore crucial to rule out a
duality gap.
Remark 2.3.5 Once again, a good illustration is obtained from the simple knapsack problem
(2.2.2). At the unique dual optimum X= 1/2, C(X) = {-I, +1} is not convex and there is a
duality gap equal to 1/2. Nevertheless, the corresponding (slackened) set of solutions to the
Lagrange problem is U(1/2) = {(O, 0), (1, O)} and the convex multipliers al = a2 = 1/2
do make up the point (1/2,0), which satisfies the constraint. Unfortunately, this convex
combination is not in the (slackened) control space U = {O, I} X JR+.
Remember the observation made in Remark VIII.2.1.3: even though the kinks of a convex
function (such as e) form a set of measure zero, a minimum point is usually in this set. Here
ae(I/2) = [-1, 1] and the observation is confirmed: the minimum X= 1/2 is the only kink
~e. 0
In practice, the convexity property needed for Theorem 2.3.4 implies that the original
problem itself has the required convex structure: roughly speaking, the only cases in which
there is no duality gap are those described by the following result.
Corollary 2.3.6 Suppose that thejilling property (2.3.2) holds. In either ofthefol-
lowing situations (i), (ii) below, there is no duality gap; for every dual solution)..*
(if any), the feasible points in U ().. *) make up all the solutions of the primal problem
(1.1.1 ).
(i) For some dual solution X, the associated Lagrangefunction L(·, X) is maximized
u; u
at a unique then, is the unique solution of (1.1.1).
(ii) In (1.1.1), U is convex, rp is concave and c : U ~ IR m is affine.
156 XII. Abstract Duality fQr PractitiQners
PROOF. Straightforward. o
Case (i) means that e is actually differentiable at i; we are in the situation (ih)
of Remark 2.2.3. Differentiability of the dual function thus appears as an important
property, not only for convenience when minimizing it, but also for a harmonious
primal-dual relationship. The entropy problem of § 1.2(c) fully enters this framework,
at least with the entropy (1.2.14). The calculations in §2.2(c) confirm that the dual
function is then differentiable and its minimization amounts to solving Ve().) =
-c(U},) = O. For this example, if Lemma 2.3.2 applies (cf. difficulties from functional
analysis), the black box automatically produces the unique primal solution at any
optimal), (if any).
The knapsack problem gives an illustration a contrario: U is not convex and e
is, as a rule, not differentiable at a minimum (see §2.2(a) for example).
These two situations are rather clear, the intermediate case 2.2.3(ih) is more
delicate. At a dual solution i, the Lagrange problem has several solutions; some of
them solve (1.1.1), but they may not be produced by the black box. A constructive
character must still be given to Corollary 2.3.6(ii).
Example 2.3.7 Take the IQtsizing prQblem Qf § 1.2(b), and remember that we have Qnly
cQnsidered a simple black bQx: fQr any>.., each product i is assigned entirely to. Qne single
machine j (i), even in the ambiguQus cases where SQme prQduct has the same CQst fQr two.
different machines. FQr example, the black bQX is tQtally unable to. yield an allQcatiQn in
which SQme prQduct is split between several machines; and it may well be that Qnly such an
allQcatiQn is Qptimal in (1.2.7); actually, since e is piecewise affine, there is almQst certainly
such an ambiguity at any Qptimal i; remember Remark 2.3.5.
On the Qther hand, imagine a sQphisticated black bQx, cQmputing the full Qptimal set
in the Lagrange prQblem (1.2.1Oh. This amQunts to. determining, fQr each i, all the Qptimal
indices j in (1.2.11); let there be mj such indices. Each Qfthem defines an extreme PQint in
the simplex (1.2.9); piecing them tQgether, they make up the full sQlutiQn-set Qf (1.2.1Oh,
which is the CQnvex hull Qf p = ml x ... x mn PQints in U l x ... X Un; let us call them
u(l), .•. , u(p). NQW CQmes the fundamental property: to. each u(k) cQrresPQnds via (1.2.8) a
PQint v(k) := v (u(k) ,>..) E ]Rnxm; and because u t-+ v(u, >..) in (1.2.8) is linear, the sQlutiQn-
set Qf(1.2.10h is the CQnvex hull Qfthe PQints (u(l), v(I), •.. , (u(p) , v(p) thus Qbtained.
Assume fQr simplicity that>.. is Qptimal and has all its cQQrdinates PQsitive. From TheQ-
rem 2.3.4 Qr CQrQllary 2.3.6, there is a CQnvex cQmbinatiQn Qfthe (u(k), v(k), k = 1, ... , p
(hence a sQlutiQn Qfthe Lagrange problem) which satisfies the cQnstraints (1.2. 7)(iii) as equal-
ities: this is a primal QPtimum. 0
Remark 2.3.8 It is interesting to interpret the null-step mechanism of Chap. IX, in our
present duality framework. Use this mechanism to find a descent direction of the present
function e, but start it from a dual optimum, say i. To. make things simple, just cQnsider the
knapsack problem (2.2.2).
- At the optimal i = 1/2, the index-set (1.2.4) is 10/2) = {1}, thus 81 = -1; the black box
suggests that the Qbject is worth taking and that, to decrease e, >.. must be increased: 1/2 is
an insufficient "price Qf space".
- For all >.. > 1/2, 1(>") = 0; the black box now suggests to. leave the Qbject and to. decrease
>... During the first line-search in Algorithm IX.1.6, no improvement is obtained from and e
82 = 1 E aeO/2) is eventually produced.
2 The Necessary Theory 157
- At this stage, the direction-finding problem (IX.1.8) detects optimality of i; but more
importantly, it produces a = 1/2 for the convex combination
aSI + (l - a)s2 = O.
Calling u(1) (= 1) and u(2) (= 0) the decisions corresponding to SI and S2, the convex
combination 1/2 [u(1) + u(2)] is the one revealed by Theorem 2.3.4.
Some important points are raised by this demonstration.
(i) In the favourable cases (no duality gap), constructing the zero vector in 1Ie(>..) corre-
sponds to constructing an optimal solution in the primal problem.
(ii) Seen with "dual glasses", a problem with a duality gap behaves just as if Corol-
lary 2.3.6(ii) applied. In the above knapsack problem, u = 1/2 is computed by the
dual algorithm, even though a half of a TV set is useless. This u = 1/2 would become
a primal optimum if U = to, I} in (2.2.2) were replaced by its convex relaxation [0,1];
but the dual algorithm would see no difference.
(iii) The bundling mechanism is able to play the role of the sophisticated black box wanted in
Example 2.3.7. For this, an access is needed to the primal points computed by a simple
black box, in addition to their constraint-values.
(iv) Remember Example IX.3.1.2: it can be hoped that the bundling mechanism generates a
primal optimum with only a few calls to the simple black box, when it is started on an
optimal i; and it should do this independently ofthe complexity of the set U (i). Suppose
we had illustrated (iii) with the lotsizing problem of Example 2.3.7, rather than a trivial
knapsack problem. We would have obtained 0 E 1Ie(>..) with probably an order of m
iterations; by contrast, the m I x ... x mn primal candidates furnished by a sophisticated
black box should all be collected for a direct primal construction. 0
We conclude with a comment concerning the two assumptions needed for Theo-
rem 2.3.4:
- The behaviour of the dual function must be fully described by the optimal solutions
of the Lagrange problem. This is the filling property (2.3.2), of topological nature,
and fairly hard to establish (see §VI.4.4). Let us add that situations in which it does
not hold can be considered as pathological.
- Some algebraic structure must exist in the solution-set of the Lagrange problem, at
least at an optimal A (Corollary 2.3.6). It can be considered as a "gift of God": it has
no reason to exist a priori but when it does, everythirtg becomes straightforward.
The question is now whether the dual problem (2.1.6) has a solution X E IRm. This
implies first that the dual function eis bounded from below; furthermore, it must
attain its infimum. For our study, the relevant object is the image
of the control set U under the constraint mapping c : U --+- IRm. Needless to say, an
essential prerequisite for the primal problem (1.1.1) to make sense is 0 E C (U); as
158 XII. Abstract Duality for Practitioners
far as the dual problem is concerned, however, cases in which 0 ¢ C(U) are also of
interest.
r
First of all, denote by the affine hull of C (U):
L(u, J.. + IL) = L(u, J..) -IL T c(u) = L(u, J..) -IL T c(uo) for all (J.., IL) e lRm x ref
e(J.. + IL) = e(J..) -IL T c(uo) for all J.. e lRm and IL e ro.l. . (2.4.2)
In other words, e is affine in the subspace ro.l.. This observation clarifies two cases:
- If 0 ¢ r,
let 1L0 =f:. 0 be the projection of the origin onto r;
in (2.4.2), fix J.. and take
IL = t 1L0 with t ~ +00. Because c(uo) e r, we have with (2.4.2)
In this case, the primal problem cannot have a feasible point (weak duality The-
orem 2.1.5); to become meaningful, (1.1.1) should have its constraints perturbed,
say to c(u) = 1L0.
- The only interesting case is therefore 0 e r. The dual optimal set is then A + rrf ,
where Aero is the (possibly empty) optimal set of
Now let J..o be such that e (J..o) < +00, and write for all u e U:
2 The Necessary Theory 159
Since U was arbitrary, this implies 8(>"0 + tILo) ~ 8(>"0) - t8, and the latter term
tends to -00 when t ~ +00.
[(ii)] There are finitely many points in U, say Ul, ..• , up, and some cx in the unit
simplex oflR P , such that 0 = Lf=l CXjc(Uj). Then write for all >.. E IRm :
P P
8(>") ~ LCXjrp(Uj) - >.. T LCXjc(Uj) ~ min rp(Uj) ,
j=l j=l 1~ j ~P
If ro = (OJ, i.e. C(U) = (OJ, L(u, >..) = rp(u) for all U E U, so 8 is a constant
function; and this constant is not +00 by assumption. If ro =f:. (OJ, take 0 =f:. >.. E ro
in (2.4.4) and
>..
y := -8 11 >"11 E B(O, 8) n To;
this y is therefore a convex combination of the c(Uj)'s. The same convex combination
in (2.4.4) gives
p
8(>") ~ LCXjrp(Uj) + 811>"11 ~ min rp(Uj) + 811>"11.
j=l l~j~p
We conclude that 8 (>..) ~ +00 if>.. grows unboundedly in ro: the closed function
8 (Proposition 2.2.2) does have a minimum on ro, and on IRm as well, in view of
(2.4.3). 0
Figure 2.4.1 summarizes the different situations revealed by our above analysis.
Thus, it is the convex hull of the image-set C (U) which is relevant; and, just as in
§2.3, this convexification is crucial:
160 XII. Abstract Duality for Practitioners
Example 2.4.2 In view of the weak duality Theorem 2.1.5, existence of a feasible
U in (1.1.1) implies boundedness of 8 from below. To confirm the prediction of (ii),
that this is not necessary, take a variant of the knapsack problem (2.2.2), in which the
knapsack should be completely filled: we want to solve
This problem clearly has no feasible point: its supremal value is -00. Nevertheless, the
dual problem is essentially the same, and it still has the optimal value 8 (1/2) = 1/2.
Here, the duality gap is infinite. 0
The entropy problem of §1.2(c) provides a few examples to show that the topo-
logical operations also play their role in Theorem 2.4.1.
Example 2.4.3 Take Q = {OJ (so that LI (Q, JR) = JR) and one equality constraint:
our entropy problem is
The corresponding sets C(V) are both convex: in fact C(UI) = [-uo, +oo[ and
C (V2 ) =] - uo, +00[. The dual functions are easy to compute:
we would have
8 3 (A) = { !A2 + UOA for A ~ 0,
UOA for A ~ 0
and a dual solution would exist for all Uo ~ O. o
3 Illustrations 161
3 Illustrations
The Lagrange function (1.1.3) is not the only possibility to replace the primal problem
(1.1.1) by something supposedly simpler. In the previous sections, we have followed
a path
primal problem (1.1.1) ~ Lagrange problem (1.1.4h, ~
feasibility problem (2.1.2) or (2.1.3) ~ dual problem (2.1.6) .
This was just one instance of the following formal construction:
- define some set V (playing the role oflRm, the dual of the space of constraint-values),
whose elements v E V play the role of A E IRm;
- define some bivariate function l : U x V --+ IR U {+oo}, playing the role of the
Lagrangian L;
- V and l must satisfY the following property, for all u E U,
if c(u) = 0,
. f l(
m
VEV
u,v
)
= {~(U)
-00 otherwise;
(3.1.1)
The above construction may sound artificial but the ordinary Lagrange technique,
precisely, discloses an instance where it is not. Its starting ingredient is the pairing
(1.1.2) (a natural "pricing" in the space of constraint-values); then (3.1.1) is obviously
satisfied by the Lagrange function l = L of (1.1.3); furthermore, the latter is affine
with respect to the dual variable v = A, and this takes care of the requirement (3.1.4)
for the dual function 0 = e, provided that 0 ¢. +00 (Proposition IY.2.1.2).
Remark 3.1.1 In the dual approach, the basic idea is thus to replace the sup-inf problem
(1.1.1) = (3.1.2) by its inf-sup form (2.1.6) = (3.1.3) (after having defined an appropriate
scalar product in the space of constraint-values). The theory in §VII.4 tells us when this is
going to work, and explains the role of the assumptions appearing in Sections 2.3 and 2.4:
162 XII. Abstract Duality for Practitioners
- The function l (i.e. L) is already closed and convex in ).,; what is missing is its concavity
with respect to u.
- Some topological properties (of L, U and/or V) are also necessary, so that the relevant
extrema are attained.
Then we can apply Theorem VIIA.3.1: the Lagrange function has a saddle-point, the
values in (3.1.2) and (3.1.3) are equal, there is no duality gap; the feasibility problem (2.1.3)
is equivalent to the dual problem (2.1.6). This explains also the "invariance" property resulting
from Corollary 2.1.8: the set of saddle-points was seen to be a Cartesian product. 0
in which the constraint can be condensely written c(u) E -(lR+)P. Two examples
were seen in § 1.2; in both of them, nonnegative slack variables were included, so as to
recover the equality-constrained framework; then the dual function was +00 outside
the nonnegative orthant.
The ideas of §3.1 can also be used: form the same Lagrange function as before,
but take
V := (JR+)P = {J,L E JRP : J,Lj ~ 0 for j = 1, ... , p}.
This construction satisfies (3.1.1): if u violates some constraint, say Cj (u) > 0, fix all
J,Lj at 0 except J,Lj, which is sent to +00; then the Lagrange function tends to -00.
The dual problem associated with (3.2.1) is therefore as before:
(note that the nonnegative orthant (JR+)P is closed convex: (3.1.4) is preserved if
e ¢. +00).
Instead of introducing slack variables, we thus have a shortcut allowing the primal
set U to reJ1lain unchanged; the price is modest: just add simple constraints in the dual
problem.
m P
L(U,A,IL) :=q1(u)- LAiai(U)- LlLjCj(u)
i=1 j=1
to obtain the dual function 6) : JRm x JRP -+ JR U Hoc} as before; then solve the dual
problem
(3.2.2)
Naturally, other combinations are possible: the primal problem may be a mini-
mization one, the linking constraints may be introduced with a "+" sign in the Lagrange
function (1.1.3), and so on. Some mental exercise is required to make the link with
§VII.4.5.
It is wise to adopt a definite strategy, so as to develop unconscious automatisms.
(i) We advise proceeding as follows: formulate the primal problem as
(and also with some equality constraints if applicable). Then take the dual func-
tion
IL 1-+ 6)(IL) = J~~ [q1(u) + LJ'=llLjCj(U)] ,
(a) Modified Everett's Theorem. For IL E (JR+)P, let up., maximize the Lagrangian
q1(u) - IL T c(u) associated with (3.2.1). Then up., solves
maxql(u) U E U,
Cj(u)~Cj(up.,) iflLj >0, (3.2.3)
Cj (u) unconstrained otherwise.
Then the original Everett theorem directly applies; observe that any Vj maximizes the
associated Lagrangian if ILj = O.
164 XII. Abstract Duality for Practitioners
Naturally, uJ.L is a primal point all the more interesting when (3.2.3) is less con-
strained; but constraints may be appended ad libitum, under the sole condition that
they do not eliminate u J.L' For example, if u J.L is feasible, the last line of (3 .2.3) can be
replaced by
Cj{u) ~ 0 [~Cj{uJ.L)]'
then uJ.L solves (3.2.1). The wording "and satisfies the complementarity slackness"
must be added to "feasible" wherever applicable in Sections 2.1 and 2.3 (for example
in Corollary 2.1.8).
and
8(JL) < +00 for some JL with JLj > Ofor j = 1, ... , p. (3.2.5)
Then il solves the dual problem if and only if: there are k ~ p + 1 primal points
u l , ••• , uk maximizing L(·, il), and a = (aJ, ... , ak) E .11k such that
k k
LaiCj(Ui) ~ 0 and ilj LaiCj(Ui) = Ofor j = 1, ... , k.
i=1 i=1
PROOF. The dual optimality condition is 0 E a(8 +I(R+)P )(il), so we need to compute
the subdifferential of this sum of functions. With JL as stated in (3.2.5), there is a ball
B(JL, 8) contained in (JR.t)P; and any such ball intersects ri dom 8 (JL is adherent to
ri dom 8). Then Corollary XI.3 .1.2 gives
(c) Existence of Dual Solutions. Finally, let us examine how Proposition 2.4.1 can
handle inequality constraints. Using slack variables, the image-set associated with
(3.2.4) is
C'(U) = C(U) + (JR+)P ,
where C (U) is defined in (2.4.1). Clearly, aff C' (U) = JRP and we obtain:
e
(i) If 0 !if co[C(U) + (JR+)P], then inf = -00; unfortunately this closed convex
hull is not easy to describe in simpler terms.
e
(ii) If 0 E co C(U) + (JR+)P, then inf > -00. This comes from the easy relation
coC'(U) [= co[C(U) + (JR+)P]] = coC(U) + (JR+)P.
(iii) If 0 E ri[ co C (U) + (JR+)P], then a dual solution exists. Interestingly enough,
this is a Slater-type assumption:
Proposition 3.2.3 If there are k (; p + 1 points u I, ... , uk in U and convex multi-
pliers ell, .•. , elk such that
k
LeliCj(Ui) <0 jorj=I, ... ,p,
i=1
then the dualjUnction associated with the optimization problem (3.2.1) has a minimum
over the nonnegative orthant (JR+)P.
PROOF. Our assumption means that 0 lies in co C (U) + (JRt)P, an open convex set:
o E coC(U) + (lRt)P = ri[coC(U) + (JRt)P].
As a particular case of §3.2, suppose that U is JR n , with some scalar product (., .),
and that (3.2.1) is
sup(q,u}
I (aj' u) (; bj for j = 1, ... , p,
(3.3.1)
JRn x JRP 3 (u, J.L) f-+ L(u, J.L) := (q - L)=I J.Ljaj, u) + L)=I J.Ljbj
is "easy to maximize": e(J.L) = +00 if [J.L !if (JR+)P or] q - Lj J.Ljaj =f:. O. The dual
problem is therefore
Remark 3.3.1 The dual constraints that appear in (3.2.2) and (3.3.2) simply express
that the study can be restricted to dom e, since e must be minimized. Instead of
(3.3.2), for example, the dual of (3.3.1) can obviously be formulated as the uncon-
strained minimization of the function
o
Continuing with this example, consider the standard form of a linear program:
assume that (., .) is the usual dot-product and that our primal problem is now
Here the matrix A has m rows, b E ]Rm. It is natural to take as primal set U := (lR+)n,
so that the dual space is a priori ]Rm, m being the number of equalities. Then the
Lagrange function
is again "easy to maximize" over U: the maximal value is finite if and only if the
vector q - A T A has all its coordinates nonpositive. In summary, the dual of (3.3.3) is
The nonnegativity constraints in (3.3.3) can be dualized as well, even though the example
in (1.2.1) does not suggest doing so because they have already a decomposed form. In fact,
nothing much interesting comes out of this extra dualization: the Lagrange function becomes
Let again U be ]Rn, equipped with the dot-product for simplicity, and consider
(3.4.1)
Here q E ]Rn, b E ]RP and the n x n matrix Q is symmetric positive definite. This
problem has a unique solution ifthe feasible domain is nonempty. Choose the Lagrange
function
(3.4.2)
to be minimized over (lR+)P. Note that it is differentiable and its gradient illustrates
Proposition 2.2.2:
On the other hand, suppose that the constraints in (3.4.1) are equalities: Au = b;
then the dual minimization is performed on the whole of lR P • If they exist, the dual
=
solutions are the solutions of V8(JL) 0, which is the linear system
(3.4.3)
Existence of such a solution implies first that b E 1m A, i.e. that there exists a primal
feasible u. In this case we claim that the dual problem has one solution at least, and
that all such solutions make up via (3.4.2) a unique point (the unique primal solution).
This will illustrate Corollary 2.1.8.
The key lies in the following facts:
(i) The two subspaces 1m AT and Ker A are two orthogonal generators oflRn. This
implies:
(ii) when applying the positive definite operator Q or Q-I to one of them, we never
obtain a vector in the other:
(3.4.4)
Hence:
(iii) we have further decompositions of lRn into two subspaces:
(3.4.5)
[Uniqueness] If ILl and IL2 solve (3.4.3), with corresponding UI and U2 maximizing
Lover ]R.n,
and we deduce
Consider the direction-finding problem of Chapters VIII and IX, assuming the Euclidean
norming on JRn: I I . I I = II . II. We had a number of (sub)gradients SI, ... , Sk and we wanted
to solve, for a given normalization parameter K > 0:
Adapting the notations (u is the couple (d, r), the dual variable is a E JR k , we have now
a minimization problem), define the Lagrange function
JRn x JR x JRk :3 (d, r, a) 1-+ L(d, r, a) := (1 - E~=I aj) r + (E~=I ajSj, d), (3.5.2)
(3.5.3)
The dual function (to be maximized) is "easy to compute": the dual constraint 1- E~= I aj = 0
takes care of the unconstrained linear variable r; altogether, the dual variable has to vary over
the unit simplex L1k ofJRk.
Setting s(a) := E~=I ajSj, the Lagrange problem associated with a E L1k then has
solutions (dex , rex) defined as follows:
and rex is arbitrary, its coefficient in (3.5.2) being o. Finally the dual problem is
max{-KlIs(a)1I : a E L1kJ.
from the dual approach, except the information that 0 is in the convex hull of the Si'S - and
this was good enough in the case of interest.
In this example, Corollary 2.3.6(i) applies if s(li) =1= 0, which means that s(a) =1= 0
for all a E .1k: even though the original problem (3.5.1) is not convex, the dual function is
differentiable everywhere (at least relative to its domain).
By contrast, suppose s(li) = O. Then the dual function is differentiable at no dual
optimum, simply because II . II is not differentiable at 0; its subdifferential can be computed
with the help of the calculus developed in §VI.4: to within the normalization coefficient K,
we obtain the ellipsoid
{2:7=1 aisi : 2:7=1 al ~ I} .
Having a dual optimal solution Ii is absolutely ofno help to find a primal solution: the Lagrange
function L(·. '. Ii) of (3.5.2) is optimal at all feasible points of (3.5.1), duality has killed the
role of the objective function.
Remark 3.5.1 Formulating (3.5.1) with slack variables, say Pi, we obtain the new Lagrange
function (with obvious notation)
-dome c JR+;
- e(O) = 0 if 0 E S, +00 ifnot;
-e is finite on JRt;
- since e E ConvJR, liIIlao.j.o e(ao) = e(O);
- at any ao > 0, e has the derivative e'(ao) = K2 - IId(ao)ll2, where d(ao) is the unique
optimal solution in (3.5.7);
- if e is minimal at some ao > 0, then d(ao) has norm K and solves the original problem
(Corollary 2.3.6);
e
- if is minimal at 0, there are two cases, illustrated by Example 3.5.1: either IId(ao) II t K
when ao .J,. 0; there is no duality gap, this was the case p > O. If IId(ao)II has a limit strictly
smaller than K, the original problem (3.5.1) is not solvable via duality.
In this section, we review the most classical algorithms aimed at solving the dual
problem (2.1.6). We neglect for the moment the primal aspect of the question, namely
the task (ii) mentioned in Remark 2.2.3. According to the various results of Sections
2.1, 2.2, our situation is as follows:
- we must minimize a convex function (the dual function 8);
- the only information at hand is the black box that solves the Lagrange problem
(1.1.4h, for each given A E ]Rm;
- Assumption 1.1.5 is in force; thus, for each A E ]Rm, the black box computes the
number 8 (A) and some S(A) E a8(A).
The present problem is therefore just the one considered in Chap. IX; formally, it
is also that of Chap. II, with the technical difference that the function A ~ S(A) is not
continuous. The general approach will also be the same: at the kth iteration, the black
box is called (i.e. the Lagrange problem is solved at Ak) to obtain 8(Ak) and S(Ak);
then the (k + 1)st iterate Ak+1 is computed. A dual algorithm is thus characterized by
the set of rules giving Ak+ I.
Note here that our present task is after all nothing but developing general algorithms
which minimize a convex function finite everywhere, and which work with the help of the
above-mentioned information (function- and subgradient-values). It is fair to say that most
of the known algorithms of this type have been actually motivated precisely by the need to
solve dual problems.
Note also that all this is valid again with no assumption on the primal problem (1.1.1),
except Assumption 1.1.5.
4 Classical Dual Algorithms 171
The simplest algorithm for convex minimization is directly inspired by the gradient
method of Definition 11.2.2.2: the next iterate is sought along the direction -SP..k)
issuing from Ak. There is a serious difficulty, however, which has been the central
issue in Chap. IX: no line-search is possible based on decreasing e, simply because
-S(Ak) may not be a descent direction - or so weak that the resulting sequence {Ak}
would not minimize e. The relevant algorithm is then as follows:
Needless t9 say, each subgradient is obtained via some u).. solving the Lagrange
problem at the corresponding A - hence the need for Assumption 1.1.5. Note that the
function-values e (Ak) are never used by this algorithm, whose motivation is as follows
(see Fig. 4.1.1): if /L is better than Ab we obtain from the subgradient inequality
which means that the angle between the direction of move -Sk and the "nice" direction
/L - Ak is acute. If our move along that direction is small enough, we get closer to /L.
From this interpretation, the stepsize tk should be small, and we will require
On the other hand, observe that IIAk+1 - Ak II = tk and the triangle inequality
implies that all iterates are confined in some ball: for k = 1, 2, ...
I>k.
00
Ak E B(A), T), where T:=
k=1
172 XII. Abstract Duality for Practitioners
Suppose B has a nonempty set of minima; then Algorithm 4.1.1 has no chance of
producing a minimizing sequence if B(J"10 T) does not intersect this set: in spite of
(4.1.2), the stepsizes should not be too small, and we will require
00
Lemma 4.1.2 Let {tk} be a sequence a/positive numbers satisfying (4.1.2), (4.1.3)
and set
n n
rn:= Ltk. Pn:= Ltl· (4.1.4)
k=l k=l
Then (rn ~ +00 and) -Pn ~ 0 when n ~ +00.
rn
PROOF. Fix 8 > 0; there is some n(8) such that tk ~ 8 for all n > n(8); then
n n
Pn = Pn(c5)-1 + L tl ~ Pn(c5)-1 + 8 L tk ~ Pn(c5)-1 + 8rn.
k=n(c5) k=n(c5)
so that
Pn ~ Pn(c5)-1 +8 for all n > n(8) ;
rn rn
thus, lim sup ~ ~ 8. The result follows since 8 > 0 was arbitrary. o
Lemma 4.1.3 Let B : IRm ~ IR be convex and.fix i E IRm. For all ).. E IRm such
thatB()..) > B(i) and/or ails E aB()..),set
- (s,).. - i)
d()"):= lis II >0. (4.1.5)
PROOF. Positivity of d()") is clear from the subgradient inequality written at )... Let
IL()..) be the projection of i onto the hyperplane
~(A.)
. I
·l·~
X . 9(.) = 9(A.)
This last result says that dO.) is a sort of "distance" between A and~, for which
e has a locally Lipschitzian behaviour.
Finally, we introduce the sequence of best values generated by Algorithm 4.1.1:
which is needed because {e (Ak)} is not monotonic. The whole issue is whether these
best function-values tend to the infimum of e over IRm (a number in IR U {-oo n.
Theorem 4.1.4 Apply Algorithm 4.1.1 to the convex function e : IRm --+- R and let
the stepsizes satisfy (4.1.2), (4.1.3). Then
PROOF. Assume for contradiction the existence of ~ E IRm and T/ > 0 such that
(4.1.6)
then develop
II~ - Ak+11I2 = II~ - Akll 2 + 2(~ - Ak) T (Ak - Ak+I) + IIAk - Ak+11I2
= II~ - Akll 2 - 2tkd(Ak) + t1,
where the notation (4.1.5) is used: the triple (~, Ak, Sk) enters the framework of
Lemma 4.1.3. For n ~ k, set On := mink=I, ... ,n d(Ak) so that
summing from 1 to n:
n n
[II~ - An+11I 2+] 20n L:tk::;;; II~ - AI1I2 + L:t1 for all n.
k=1 k=1
From Lemma 4.1.2, it follows that On --+- 0 when n --+- +00.
Thus, we have an infinite subset K of integers such that limkEK d(Ak) = O. Now
apply Lemma 4.1.3: {d(Ak)}kEK is bounded and
Remark 4.1.5 We mention here that the convergence assumptions concerning (4.1.4) are
somewhat bizarre. A computer cannot represent positive numbers under a certain threshold,
say mo > O. As a result,
- Charybdis: to satisfy (4.1.3), the computer must set tk ;;:: mo for all k: (4.1.2) becomes
impossible.
- Scylla: alternatively, setting tk = 0 for k large enough is even worse. No update is performed
by (4.1.1) and the algorithm has to stop; but at this point, the series E tk has not diverged
yet.
On the other hand, this kind of argument is not totally convincing: the mere concept
"6k ~ 0" is important in numerical analysis; yet, who can be patient enough to check
whether a sequence is convergent, and to obtain its limit? We shall only say that requiring a
property resembling (4.1.2), (4.1.3) is a bit sloppy with respect to finite arithmetic. 0
To ease our exposition, we assume now that a compact convex set C C JRm is known
to contain a dual solution. In view of the sup-form ofB, our dual problem (2.1.6) can
then be written
I
min () () E JR, A E C ,
() :;;:: L(u, A) for all u E U.
(4.2.1)
This is a semi-infinite programming problem, which would become easy ifit had only
finitely many constraints, i.e. if U were a finite set; taking for example a compact
convex polyhedron for C, (4.2.1) would become an ordinary linear program. Then the
basic idea of the cutting-plane algorithm is quite natural: accumulate the constraints
one after the other in (4.2.1). Furthermore take advantage of the fact that the constraint-
index u can be restricted to the smaller set fJ of (2.2.3).
Algorithm 4.2.1 (Basic Cutting-Plane Algorithm) The compact convex set C and
the stopping tolerance 6 :;;:: 0 are given.
min () () E JR, A E C ,
I () ? L(uj, A) for i = 0, ... , k - 1
(4.2.2)
(4.2.3)
Imin ()
() ? 6>(Aj) + [S(Aj)]T (A -
() E JR, A E C ,
Aj) for i = 0, ... , k - 1,
(4.2.5)
Remark 4.2.2 When we write (4.2.2) in the form (4.2.5), we do nothing but prove again
Proposition 2.2.2: c(u;J E -aeo.). When we hope that (4.2.5) does approximate (4.2.1), we
rely upon the property proved in Theorem XI. 1.3.8: the convex function e can be expressed
as a supremum of affine functions, namely
e(A) = sup {e(p,) + [S(p,)]T ().. - p,) : p, E ~m} for all ).. E ~m ,
s(p,) being arbitrary in ae(p,). 0
Theorem 4.2.3 With C C JRm convex compact and 6> convexfrom JRm to lR, consider
the optimal value
Be := min {6>(A) : A E C}.
In Algorithm 4.2.1, ()k ~ Be for all k and the following convergence properties hold:
- If s > 0, the stop occurs at some iteration ke with Ake satisfoing
6>(Ake ) ~ Be + s. (4.2.6)
- If s = 0, the sequences {()kl and {6> (Ak)} tend to Be when k --+ +00.
176 XII. Abstract Duality for Practitioners
PROOF. Because (4.2.1) is less constrained than (4.2.2) = (4.2.5), the inequality
Ok :::;; Be is clear and the optimality condition (4.2.6) does hold when the stopping
criterion (4.2.3) is satisfied.
Now suppose for contradiction that, for k = 1, 2, ...
hence
-8> B(Aj) - B(Ak) - IIS(Aj)1I IIAk - Aili.
Because B is Lipschitzian on the bounded set C (Theorem Iy'3.1.2 and Proposi-
tion VI.6.2.2), there is L such that
The dualization of (4.2.7) is interesting. Call a E IRk the dual variable, form the
Lagrange function L(B, A, a), and apply the technique of§3.3 to obtain the dual linear
program
k-I
max Lai [e(Ai) - [S(Ai)]T Ai ] a E IRk,
i=o
k-I
LaiS(Ai) = 0,
i=o
k-I
L
i=o
ai = I and ai ~ ° for i = 0, ... , k - 1 .
Not unexpectedly, this problem can be written in primal notation: since S(Ai) =
-C(Ui) and knowing that .11k is the unit simplex of IRk, we actually have to solve
max
aELlk
{It~~ ai({J(ui) : L~':~ aic(ui) = o} . (4.2.8)
When stated with (4.2.8) instead of (4.2.7), the cutting-plane Algorithm 4.2.1 (a
row-generation mechanism) is called Dantzig-WolJe's algorithm (a column-genera-
tion mechanism). In terms of the original problem (1.1.1), it is much more suggestive:
basically, ((J(.) and cO are replaced in (4.2.8) by appropriate convex combinations.
Accordingly, the point
k-I
.11k 3 a f-+ u(a) := L aiui
i=o
can reasonably be viewed as an approximate primal solution:
Theorem 4.2.5 Let the convexity conditions of Corollary 2.3.6(ii) hold (U convex,
({J concave, c affine) and suppose that Algorithm 4.2.1, applied to the convex function
e : IRm -+ IR, can be used with (4.2.7) instead of (4.2.5). When the stop occurs,
denote by ii E .11k an optimal solution of (4.2.8). Then u(ii) is an e-solution of the
primal problem (1.1.1).
PROOF. Observe first that u(ii) is feasible by construction. Also, there is no duality
gap between the two linear programs (4.2.7) and (4.2.8), so we have (use the concavity
of({J)
k-I
(h = L iii ({J(Ui) ~ ({J(u(ii)) .
i=o
The stopping criterion and the weak duality Theorem 2.1.5 terminate the proof. 0
A final comment: the convergence results above heavily rely on the fact that all
the iterates Ui are stored and taken into account for (4.2.2). No trick similar to the
compression mechanism of §IX.2.1 seems possible in general. We will return to the
cutting-plane algorithm in Chap. xv.
178 XII. Abstract Duality for Practitioners
On several occasions, we have observed connections between the present dual ap-
proach and various concepts from the previous chapters. These connections were by
no means fortuitous, and exploiting them allows fruitful interpretations and exten-
sions. In this section, Assumption 1.1.5 is not in force; in particular, 8 may assume
the value +00.
From its definition itself, the dual function strongly connotes the conjugacy operation
of Chap. X. This becomes even more suggestive if we introduce the artificial variable
y = c(u) (the coefficient of J..); such a trick is always very fruitful when dealing with
duality. Here we consider the following function (remember §IV.2.4):
In other words, the right-hand side of the constraints in the primal problem (1.1.1)
is considered as a varying parameter y, the supremal value being a function of that
parameter; and the original problem is recovered when y is fixed to o. It is this primal
jUnction that yields the expected conjugacy correspondence. The reason for the change
of sign is that convexity comes in in a more handy way.
PROOF. By definition,
This property holds for example when qJ is bounded from above on U (take
f.L = 0). In view of Proposition ry.1.2.1, it also holds if P E Conv lR.m, which in turn
is true if (use the notation (2.4.1), Theorem IY.2.4.2, and remember that qJ is assumed
finite-valued on U "# 0):
Theorem 5.1.1 explains why convexity popped up in Sections 2.3 and 2.4: being
a conjugate function, e does not distinguish the starting function P from its closed
convex hull. As a first consequence, the closed convex hull of P is readily obtained:
This sheds some light on the connection between the primal-dual pair of problems
(1.1.1) and (2.1.6):
(i) When it exists, the duality gap is the number
i.e. the discrepancy at 0 between P and its closed convex hull. There is no duality
gap if P E Conv lR.m, or at least if P has a "closed convex behaviour near 0";
see (iii) below.
In the situation (5.1.2), P E Conv lR.m, so co P =
cl P. In this case, there is
no duality gap if P(O) = cl P(O).
(ii) The minima of a closed convex function have been characterized in (X. 1.4.6),
which becomes here
This proves once more Theorem 5.1.1, but also explains Theorem VU.3.3.2,
giving an expression for the subdifferential of a closed convex primal function:
even though (5.1.1) does not suggest it, P is in this case a sup-function and its
subdifferential can be obtained from the calculus rules of §VI.4.4.
(v) The most interesting observation is perhaps that the dual problem actually solves
the "closed convex version" of the primal problem.
This latter object is rather complicated, but it simplifies in some cases, for example
when the data (U, cp, c) are as follows:
Furthermore, assume that this dual problem has some optimal solution )..; then
the solutions of (5.1.4) are those
P*(>..) = (lu - (q, .»*(A*>") + >.. Tb = au (A*>" + q) + >..Tb for all >.. e lRm .
Note: it is only for simplicity that we have assumed U bounded; the result would still
hold with finer hypotheses relating Ker A with the asymptotic cone of co U. An interesting
5 Putting the Method in Perspective 181
instance of(5.1.3) is integer linear programming. Consider the primal problem in N n (a sort
of generalized knapsack problem)
inf c Tx
Ax = a E ~m, Bx - b E -(~+)P [i.e. Bx :( b E ~P] (5.1.5)
xi E N, xi :( i for i = 1, ... , n .
Here i is some positive integer, introduced again to avoid technicalities; then the next result
uses the duality scheme
Corollary 5.1.4 The dual optimal value associated with (5.1.5) is the optimal value of
infcTx
Ax = a, Bx - b E -(~+)P , (5.1.6)
O:(xi:(i fori = 1, ... ,n.
whose closed convex hull is clearly [0, i]n x [0, z]P. Thus Proposition 5.1.3 applies, and it
suffices to eliminate the slacks to obtain (5.1.6). D
is evidently equivalent to (1.1.1): it has the same feasible set, and the same objective
function there. In the dual space, however, this equivalence no longer holds: the
Lagrangian associated with (5.2.1) is
called the augmented Lagrangian associated with (1.1.1) (knowing that L = Lo is the
"ordinary" Lagrangian). Correspondingly, we have the "augmented dual function"
Remark 5.2.1 (Inequality Constraints) When the problem has inequality constraints, as
in (3.2.1), the augmentation goes as follows: with slack variables, we have
Working out the calculations, the "non-slackened augmented Lagrangian" boils down to
P
U x ]RP 3 (u, A) 1-+ It(u, A) = cp(u) - I>t(Cj(u), Aj),
j=l
{
I-LY + !ty 2 ify ~ tI-L,
(5.2.3)
1Ct(y,I-L)= -itllI-L1I2 ify~tI-L·
In words: the /" constraint is "augmentedly dualized" if it is frankly violated; otherwise,
it is neglected, but a correcting term is added to obtain continuity in u. Note: the dual variables
I-Lj are no longer constrained, and they appear in a "more strictly convex" way. 0
A mere application of the weak duality theorem to the primal-dual pair (5.2.1),
(5.2.2) gives
(5.2.4)
for all t, all J.. E ]Rm and all u feasible in (5.2.1) or (1.1.1). On the other hand, the
obvious inequality L t ~ L extends to dual functions: e t ::;; e, i.e. the augmented
Lagrangian approach cannot worsen the duality gap. Indeed, this approach turns out
to be efficient when the "lack of convexity" of the primal function can be corrected
by a quadratic term:
Theorem 5.2.2 With P defined by (5.1.1), suppose that there are to ~ 0 and I E lR.m
such that
P(y) ~ P(O) - IT y - 4tollYII2 for all Y E lR.m . (5.2.5)
Then, for all t ~ to, there is no duality gap associated with the augmented Lagrangian
L t, and actually
PROOF. When (5.2.5) holds, it holds also with to replaced by any t ~ to, and then we
have for all y:
Remember the definition (5.1.1) of P: this means precisely that there is no duality
gap, and that i minimizes 8t. 0
and (5.2.5) just means aPto(O) #- 0 (attention: the calculus rule on a sum of subdif-
ferentials is absurd for Pt = P + 1/2 til· 11 2, simply because P is not convex!).
The perturbed primal function establishes an interesting connection between the
augmented Lagrangian and the Moreau-Yosida regularization of Example XI.3 .4.4:
- We will see in Chap. XV that there are sound numerical algorithms computing a Moreau-
Yosida regularization, at least approximately; then the gradient Vet (or at least approxima-
tions of it) will be available "for free", see (XI.3.4.8).
However, the augmented Lagrangian technique suffers a serious drawback: it usually
kills the practical Assumption 1.1.2. The reader can convince himself that, in all examples
of § 1.2, the augmented Lagrangian no longer has a decomposed structure; roughly speaking,
if a method is conceived to maximize it, the same method will solve the original problem
(1.1.1) as well, or at least its penalized form
(see §VII.3.2 for example). We conclude that, in practice, a crude use of the augmented
Lagrangian is rarely possible. On the other hand, it can be quite useful in theory, particularly
for various interpretational exercises: remember our comments at the end of §VII.3.2.
Example 5.2.4 Consider the simple knapsack problem (2.2.2), whose optimal value is 0,
e
but for which the minimal value of is 112 (see Remark 2.3.5 again). The primal function
is plotted on Fig. 5.2.1, observe that the value at 0 of its convex hull is -1/2 (the opposite
of the optimal dual value). The graph of Pt is obtained by bending upwards the graph of P,
i.e. adding 1/2 til· 112; for t ~ to = 2, the discontinuity at y = 1 is lifted high enough to yield
(5.2.5) with ~ = O. In view ofTheorem 5.2.2, the duality gap is suppressed.
t IL ........"1
i···.. I ••••••
. . . . ./
l .,....... +1 ••••••
-'Y
-1 I I'f;,
-1 ~----~------------- P("()
Fig. 5.2.1. The primal function in a knapsack problem
Calculations confirm this happy event: with the help of (5.2.3), the augmented dual
function is (we neglect large values of >..)
e
this shows that, if t ~ 2 and >.. is close to 0, t (>..) = 1/ (2t) >..2 , whose minimal value is o.
Examining Fig. 5.2.1, the reader can convince himselfthat the same phenomenon occurs
for an arbitrary knapsack problem; and even more generally, for any integer program such as
(5.1.5). In other words, any integer linear programming problem is equivalent to minimizing
the (convex) augmented dual function: an easy problem. The price for this "miracle" is the
(expensive) maximization of the augmented Lagrangian. 0
To conclude, we mention that (5.2.1) is not the only possible augmentation. Closed
convexity of Pt implies IJPt(y) =F 0 for "many" y (namely on the whole relative
5 Putting the Method in Perspective 185
interior of dom Pt = dom P), a somewhat luxurious property: as far as closing the
duality gap is concerned, it suffices to have apt (0) =f:. 0 (Theorem 5.2.2). In terms of
(1.1.1), this latter property means that it is possible to add some stiff enough quadratic
term to P, thus obtaining a perturbed primal function which "looks convex near 0".
Other devices may be more appropriate in some other situations:
Example 5.2.5 Suppose that there are i E IRm and to > 0 such that
in other words, the same apparent convexity near y = 0 is yielded by some steep
enough sublinear term added to P. Then consider for t > 0 the class of problems
This latter augmentation does suppress the duality gap at y = 0: reproduce the
proof of Theorem 5.2.2 to obtain
Concerning the regularization ability proved by Proposition 5.2.3, we have the fol-
lowing: til· II is the support function of the ball B(O, t) (§Y.2.3); its conjugate is the
indicator function of B(O, t) (Example X.l.l.5); computing the infunal convolution
of the latter with a function such as e gives at A the value infJL-AeB(O,t) e(JL). In
summary, Proposition 5.2.3 can be reproduced to establish the correspondence
Finally note the connection with the exact penalty technique of §VIl.l.2. We know
from Corollary VII.3.2.3 that, under appropriate assumptions and for t large enough,
P,(O) is not changed if the constraint c(u) = 0 is removed from (5.2.7). 0
In §1.2, we have applied duality for the actual solution of some practical optimization
problems; more examples have been seen in §3. We now review a few situations where
duality can be extremely useful also for theoretical purposes.
(a) Constraints with Values in a Cone. Consider abstractly a primal problem posed
under the form
Ic(u)
SUpqi(u)
E
u E U,
K.
(5.3.1)
186 XII. Abstract Duality for Practitioners
In (1.1.1) for example, we had K = {OJ C IRm; in (3.2.1), K was the nonpositive
orthant of IR P • More generally, we take here for K a closed convex cone in some
finite-dimensional vector space, call it IR m, equipped with a scalar product (', .); and
KO will be the polar cone of K.
The Lagrange function is then conveniently defined as
where the last equality simply relates the support and indicator functions of a closed
convex cone; see Example y'2.3.1. The weak duality theorem
a formulation useful to adapt Everett's Theorem 2.1.1. Indeed the slackened Lagrang-
ian is
U x K x KO 3 (u, v, A) 1-+ qJ(U) - (A, c(u» + (A, v) ,
and a key is to observe that (A, v) stays constant when v describes the face FK(A)
of K exposed by A E KO. In other words, ifu)., maximizes the Lagrangian (5.3.2),
the whole set {u).,} x F K (A) C U x K maximizes its slackened version. Everett's
theorem becomes: a maximizer u)., of the Lagrangian (5.3.2) solves the perturbed
primal problem
sup {qJ(u) : c(u) - c(u)..) E K - FK(A)}.
ueU
If c(u).,) E FK(A), the feasible set in this problem contains that of (5.3.1). Thus,
Corollary 2.1.2 becomes: if some u).. maximizing the Lagrangian is feasible and
satisfies the complementarity slackness (A, c(u).,») = 0, then this u).. is optimal in
(5.3.1).
The results from Sections 2.3 and 2.4 can also be adapted via a slight generalization
of §3 .2(b) and (c), which is left to the reader. We just mention that in §3.2, the cones
K and KO were both full-dimensional, yielding some simplification in the formulae.
note the difference with (5.2.1): the extra variable v is free, while it was frozen to 0 in
the case of an augmented Lagrangian. We find ourselves in a constrained optimization
framework, and we can define the Lagrangian
for all (u, v, A) E U x]Rm x]Rm. The corresponding dual function is easy to compute:
the same Lagrangian black box as before is used for u (another difference with §5.2),
and the maximization in v is explicit:
Proposition 5.3.1 Let t > 0; with the notation (5.1.1), (5.3.5), there holds
This result was not totally unexpected: in view of Theorem 5.1.1, the conjugate of Pr is
e
the sum t of two closed convex functions (barring the change of sign). No wonder, then, that
Pr is an infimal convolution: remember Corollary X.2.1.3. Note, however, that the starting
function P is not in Conv R m , so a proof is really needed.
188 XII. Abstract Duality for Practitioners
Remark 5.3.2 Compare Proposition 5.3.1 with Proposition 5.2.3. Augmenting the Lagrang-
ian and penalizing the primal constraints are operations somehow conjugate to each other: in
one case, the primal function is a sum and the dual function is an infimal convolution; and
vice versa. Note an important difference, however: closed convexity of Pt was essential for
Proposition 5.2.3, while Proposition 5.3.1 is totally general.
This difference has a profound reason, outlined in Example 5.2.4: the augmented La-
grangian is able to suppress the duality gap, a very powerful property, which therefore requires
some assumption; by contrast, penalization brings nothing more than existence ora dual so-
lution. 0
Here f E Conv JR.n ,Co C JR.n is a nonempty closed convex set intersecting dom f,
A is linear from JR.n to JR.m , and each Cj: JR.n -+ JR. is convex. Thus we now accept
an extended-valued objective function; but the constraint-functions are still assumed
finite-valued.
This problem enters the general framework of the present chapter if we take the
Lagrangian
JR.n x JR.m x (JR.+)P 3 (x, A, f..L) ~ L(x, A, f..L) = f(x) + AT (Ax - b) + f..L Tc(x)
and the corresponding dual function
JR.m x (JR.+)P 3 (A, f..L) ~ t9(A, f..L) = - inf {L(x, A, f..L) : x E Co}.
The control space is now U = Co n dom f. The whole issue is then whether there
is a dual solution, and whether the filling property (2.3.2) holds; altogether, these
properties will guarantee the existence of a saddle-point, i.e. of a primal-dual solution-
pair.
Theorem 5.3.3 With the above notation, make thefollowing Slater-type assumption:
There is Xo E (ri dom f) n ri Co such that IAxo = band .
Cj (xo) < 0 for J = 1, ... , p .
A solution i of (5.3.6) is characterized by the existence of A = (AI. ... , Am) E JR.m
and f..L = (f..LJ. ... , f..Lp) E JR.P such that
P
o E of (i) + ATA + Lf..Ljocj(i) + NCo (i) , (5.3.7)
j=1
The calculus rule XI.3.1.2 can therefore be invoked: i solves (5.3.6) if and only if
Then it suffices to express the last two normal cones, which was done for example in
§VII.2.2. 0
Of course, (5.3.7) means that L(·, J.., /L) has a subgradient at i whose opposite
is normal to Co at i; thus, i solves the Lagrangian problem associated with (J.., /L) -
and (A, /L) solves the dual problem.
In view of §2.3, a relevant question is now the following: suppose we have found
a dual solution (X, il), can we reconstruct a primal solution from it? For this, we need
the filling property, which in turn calls for the calculus rule VI.4.4.2. The trick is to
realize that the dual function is not changed (and hence, neither are its subdifferentials)
if the minimization of the Lagrangian is restricted to some sublevel-set: for r large
enough and (A, /L) in a neighborhood of (X, il),
-6>(A, /L) = inf {L(x, A, /L) : L(x, A, /L) ~ r}.
XECo
On several occasions in this Section 5, connections have appeared between the La-
grangian duality of the present chapter and the conjugacy operation of Chap. X. On
the other hand, it has been observed in (X.2.3.2) that, for two closed convex functions
gl and g2, the optimal value in the "primal problem"
= =
Ii gi for i 1. 2: the infimal convolution
PROOF. Apply (XI.3 .4.S) to the functions
gr t g; is exact at 0 =s-
s and the subdifferential 8(gr t g;)(O) is then (S.4.4).
+
Now apply Theorem X.2.3.2: this last subdifferential is 8(gl g2)*(0), which in turn
is just the solution-set of (S.4.1), see (X. 1.4.6). 0
(a) From Fenchel to Lagrange. As was done in §S.3 in different situations, the
approach of this chapter can be applied to Fenchel's duality. It suffices to formulate
(S.4.1) as
inf {gl (XI) + g2(X2) : XI - X2 = O}.
This is a minimization problem posed in ]Rn x ]Rn, with constraint-values in ]Rn,
which lends itself to Lagrangian duality: taking the dual variable).. E ]Rn, we form
the Lagrangian
L(xi. X2.)..) = gl (XI) + g2(X2) +).. T (XI - X2) .
can be written
B()") = - inf
Xl
[gl (XI) +).. T XI] - inf [g2(X2) -).. T X2]
X2
= gt(-)..) + gi()..).
a form which blatantly displays Fenchel's dual problem (S.4.2).
5 Putting the Method in Perspective 191
(b) From Lagrange to Fenchel. Conversely, suppose we would like to apply Fenchel
duality to the problems encountered in this chapter. This can be done at least formally,
with the help of appropriate functions in (S.4.1):
(i) g2(X) = I{o}(Ax - b) models affine equality constraints, as in convex instances
of (1. 1.1);
(ii) g2(X) = IK(C(X», where K is a closed convex cone, plays the same role for
the problem of §S.3(a); needless to say, inequality constraints correspond to the
nonpositive orthant K oflR P ;
(iii) g2(X) = 1/2tIlAx-bIl 2 is associated to penalized affine constraints, as in §S.3(b);
(iv) in the case of the augmented Lagrangian (S .2.1 ), we have a sum ofthree functions:
Many other situations can be imagined; let us consider the case ofaffine constraints
in some detail.
(c) Qualification Property. With gl and h closed convex, A linear from lRn to lRm ,
and lRm equipped with the usual dot-product, consider the following form of (S.4.1):
(S.4.S)
The role of the control set U is played by dom gl, and h can be an indicator, a squared
norm, etc.; g2 of(S.4.1) is given by
(d) Dual Problem. The conjugate of g2 can be computed with the help of the calculus
rule X.2.2.3: assume (S.4.6), so that in particular [A(lRn) - b] nridomh =f:. 0,
we obtain precisely
(t) Nonlinear Constraints. The situation corresponding to (ii) in §S.4(b) gives also
an interesting illustration: with p convex functions CI •... , cp from JRn to JR, define
JRn 3 x 1-+ gj(x) := Ij-oo,oj(Cj(x» for j = 1, ...• p.
Then take an objective function go E n
Conv JR , and consider the primal problem
inspired by the form (3.2.1) with inequality constraints:
Proposition 5.4.3 With the above notation, assume the existence ofii, ... , i p such
thatcj(ij) < Ofor j =
1, ... , p. Then the optimal value in (5.4.10) is
inf{B(IL) : IL E (lRt)P)
with
PROOF. Our assumptions allow the use of the conjugate calculus given in Exam-
ple X.2.5.3:
g}'!'(s) = J.I,>o
min ILC}":(tS) for j 1, ... , p.=
Then (5.4.10) has actually two (groups of) minimization variables: (Sl, ... , sp) and
IL = (ILh ... , ILp). We minimize with respect to {Sj} first: the value (5.4.10) is the
infimum over IL E (lRt)P of the function
The key is to realize that this is the value at S = 0 of the infimal convolution
Note that this is an abstract result, which says nothing about primal-dual rela-
tionships, nor existence of primal-dual solutions; in particular, it does not rule out the
case dom B = 0.
Let us conclude: there is a two-way correspondence between Lagrange and
Fenchel duality schemes, even though they start from different primal problems; the
difference is mainly a matter of taste. The Lagrangian approach may be deemed more
natural and flexible; in particular, it is often efficient when the initial optimization
problem contains "intermediate" variables, say Yj = Cj (x), which one wants to single
out for some reason. On the other hand, Fenchel's approach is often more direct in
theoretical developments.
XIII. Inner Construction of the Approximate
Subdifferential: Methods of e-Descent
- Or, if there does not exist any such direction, explain why, i.e. find a subgradient
which is (approximately) 0 - the difficulty being that the black box (01) is not
supposed to ever answer s(x) = O.
This mechanism consisted in collecting information about aI (x) in a bundle, and
worked as follows (see Algorithm IX. 1.6 for example):
- Given a compact convex polyhedron Sunder-estimating al (x), i.e. with Seal (x),
one computed a hyperplane separating S and {OJ. This hyperplane was essentially
defined by its normal vector d, interpreted as a direction, and one actually computed
the best such hyperplane, namely the projection of 0 onto S.
- Then one made a line-search along this d, with two possible exits:
(a) The hyperplane actually separated a/(x) and {OJ, in which case the process
was terminated. The line-search was successful and I could be improved in the
direction d, to obtain the next iterate, say x+.
(b) Or the hyperplane did not separate aI (x) and {O}, in which case the line-search
produced a new subgradient, say s+ E a/(x), to improve the current S - the
line-search was unsuccessful and one looped to redo it along a new direction,
issuing from the same x, but obtained from the better S.
We explained that this mechanism could be grafted onto each iteration ofa descent
scheme. It would thus allow the construction of descent directions without computing
the full subdifferential explicitly, a definite advantage over the steepest-descent method
of Chap. VIII. A fairly general algorithm would then be obtained, which could for
example minimize dual functions associated with abstract optimization problems;
such an algorithm would thus be directly comparable to those of §XII.4.
Yet the idea is good if a simple precaution is taken: rather than the purely local set
aI (x), it is aeI (x) that we must identify. It gathers differential information from a spe-
cific, "finitely small", neighborhoodofx -called Ye(x) in TheoremXI.4.2.5 -instead
ofa neighborhood shrinking to {x}, as is the case for a/(x) - see Theorem VI.6.3.1.
Accordingly, we can guess that this will fix (i) and (ii) above. In fact:
1 Introduction. IdentifYing the Approximate Subdifferential 197
PROOF. Use for example Theorem XI.2.Ll: for given d '" 0, let TJ := f;(x, d). If
TJ < 0, we can find t > 0 such that
Remark 1.1.6 This is just the basic scheme, for which several variants can easily be
imagined. For example, e = eP can be chosen at each iteration. A stop at iteration p*
will then mean that x P* is e P* - minimal. Various conditions will ensure that this stop
does eventually occur; for example
Le
P*
p=1
P ~ +00 when p* ~ 00. (1.1.1)
It is not the first time that we encounter this idea of taking a divergent series:
the subgradient algorithm of §XII.4.1 also used one for its step size. It is appropriate
to recall Remark XII.4.1.S: here again, a condition like (1.1.1) makes little sense in
practice. This is particularly visible with p* appearing explicitly in the summation:
when choosing eP, we do not know if the present pth iteration is going to be the last
one. In the present context, we just mention one simple "online" rule, more reasonable:
take a fixed e but, when 0 E IJef(x P ), diminish e, say divide it by 2, until it reaches a
final acceptable threshold. It is straightforward to extend the proof of Proposition 1.1.5
to such a strategy.
A general convergence theory of these schemes with varying e's is rather trivial
and is of little interest; we will not insist on this aspect here. The real issue would be
of a practical nature, consisting of a study of the values of e (= e P ), and/or of the
norming I I . III, to reach maximal efficiency - whatever this means! 0
Remark 1.1.7 To solve a numerical problem (for example of optimization) one nor-
mally constructs a sequence - say {x P }, usually infinite - for which one proves some
desired asymptotic property: for example that any, or some, cluster point of {x P }
solves the initial problem - recall Chap. II, and more particularly (11.1.1.8). In this
sense, the statement of Proposition 1.1.5 appears as rather non-classical: we construct
afinite sequence {x P }, and then we establish how close the last iterate comes to opti-
mality. This point of view will be systematically adopted here. It makes convergence
results easier to prove in our context, and furthermore we believe that it better reflects
what actually happens on the computer.
Observe the double role played by e at each iteration of Algorithm 1.1.2: an "ac-
tive" role to compute d P , and a "passive" role as a stopping criterion. We should really
have two different epsilons: one, possibly variable, used to compute the direction; and
the other to stop the algorithm. Such considerations will be of importance for the next
chapter. 0
1 Introduction. Identifying the Approximate Subdifferential 199
Having eliminated the difficulty 1.1.1(i), we must now check that 1.1.1(ii) is
eliminated as well; this is not quite trivial and motivates the present chapter. We
therefore study one single iteration of Algorithm 1.1.4, i.e. one execution of Step 1,
regardless of any sophisticated choice of e. In other words, we are interested in the
static aspect of the algorithm, in which the current iterate x P is considered as fixed.
Dropping the index p from our notation, we are given fixed x E IRn and e > 0,
and we apply the mechanism of Chap. IX to construct compact convex polyhedra in
3d(x).
Ifwe look at Step 1 of Algorithm 1.1.4 in the light of Algorithm IX.l.6, we see that
it is going to be an iterative subprocess which works as follows; we use the subscript
k as in Chap. IX.
Process 1.1.8 The black box (U1) has computed a first e-subgradient SJ, and a first
polyhedron SI is on hand:
- At stage k, having the current polyhedron Sk C 3e f (x), compute the best hyperplane
separating Sk and {OJ, i.e. compute
(recall that Proj denotes the Euclidean projection onto the closed convex hull of a
set).
- Then determine whether
(ae ) dk separates not only Sk but even 3d(x) from {OJ; then the work is finished,
3e f (x) is properly identified, dk is an e-descent direction and we can pass to
the next iteration in Algorithm 1.1.4;
or
(be) dk does not separate from {OJ; then enlarge Sk with a new Sk+1 E 3d(x) and
loop to the next k. 0
Of course, the above alternatives (ae) - (be) play the same role as (a) - (b) of
§l.l. In Chapter IX, (a) - (b) were resolved by a line-search minimizing f along dk.
In case (b), the optimal step size was 0, the event f(x + tdk) < f(x) was impossible
to obtain, and step sizes t -l- 0 were produced. The new element sk+ I, to be appended
to Sk, was then a by-product of this line-search, namely a corresponding cluster point
of {s(x + tdk)}t.\-O.
Here, in order to resolve the alternatives (ae ) - (be), and detect whether dk is an
e-descent direction (instead of a mere descent direction), we can again minimize f
along dk and see whether e can thus be dropped from f. This is no good, however: in
case of failure we will not obtain a suitable sk+ I.
200 XIII. Methods of e-Descent
a much better idea is to use Definition 1.1.2 itself: compute the support function
f; (x, dk) and check whether it is negative. From Theorem IX.2.1.1, this means min-
imizing the perturbed difference quotient, a problem that the present § 1.2 is devoted
to. Clearly enough, the alternatives (ae) - (be) will thus be resolved; but in case of
failure, when (be) holds, we will see below that a by-product of this minimization is
an sk+1 suitable for our enlargement problem.
Naturally, the material in this section takes a lot from §XI.2. In the sequel, we
°
drop the index k: given x E IRn , d =F and s > 0, the perturbed difference quotient
is as in §XI.2.2:
+00 ift ~ 0,
qe(t) := { f(x + td) t- f(x) + s (1.2.2)
ift > 0,
Remark 1.2.1 The above problem (1.2.3) looks like a line-search, just as in Chap. II: after
all, it amounts to finding a stepsize along the direction d. We should mention, however, that
such an interpretation is slightly misleading, as the motivation is substantially different. First,
the present "line-search" is aimed at diminishing the perturbed difference quotient, rather
than the objective function itself.
More importantly, as we will see in §2.2 below, qe must be minimized rather accurately.
By contrast, we have insisted long enough in §II.3 to make it clear that a line-search had little
to do with one-dimensional minimization: its role was rather to find a ''reasonable'' stepsize
along the given direction, "reasonable" being understood in terms of the objective function.
We just note here that the present "line-search" will certainly not produce a zero-stepsize
since qe(t) ~ +00 when t -I- 0 (cf. Proposition XI.2.2.2). This confirms our observation (iie)
in§1.1. 0
here,
e(x, y, s) := f(x) - [f(y) + (s, x - y)] (1.2.5)
1 Introduction. IdentifYing the Approximate Subdifferential 201
is the linearization error made at x when f is linearized at y with slope s (see the
transportation formula of Proposition XI.4.2.2). The function re is minimized over a
nonempty compact interval Ue = U!e, ue], with
Te := {t = ~ : u E Ue and u > o}
ofminima ofqe is a closed interval (possibly empty, possibly not boundedfrom above),
which does not contain O. Denoting by t..e := llue and Fe := ll!ie the endpoints of
Te (the convention 110 = +00 is used), there holds for any t > 0:
(i) tETe {=::::} e(x, x + td, s) = e for some s E of (x + td)
(ii) t < t..e {=::::} +
e(x, x td, s) < e for all s E of (x + td)
(iii) t > Fe {=::::} e(x, x + td, s) > e for all s E of (x + td).
Of course, (iii) is pointless if!ie = O.
PROOF. All this comes from the change of variable t = 1I u in the convex function re
(remember that it is minimized "at finite distance"). The best way to "see" the proof
is probably to draw a picture, for example Figs. 1.2.1 and 1.2.2. 0
For our present concern of minimizing qe, the two cases illustrated respectively
by Fig. 1.2.1 (ue > 0, Te nonempty) and Fig. 1.2.2 (ue = !ie = 0, Te empty) are
rather different. For each step size t > 0, take s E of (x + td). When t increases,
- the inverse stepsize u = 1I t decreases;
- being the slope of the convex function re, the number e - e(x, x + td, s) decreases
(not continuously);
-therefore e(x, x + td, s) increases.
- Altogether, e(x, x + td, s) starts from 0 and increases with t > 0; a discontinuity
occurs at each t such that of (x + td) has a nonzero breadth along d.
The difference between the two cases in Figs. 1.2.1 and 1.2.2 is whether or not
e(x, x + td, s) crosses the value e; this property conditions the non-vacuousness of
Te.
Lemma 1.2.2 is of course crucial for a one-dimensional search aimed at minimiz-
ing q e. First, it shows that q e is mildly nonconvex (the technical term is quasi-convex).
Also, the number e - e, calculated at a given t, contains all the essential information
202 XIII. Methods of c:-Descent
f~ (x,d) ..................................-.................-......................................................._.....................
-It must separate the current S from {O} "sufficiently well", which means it must
make (s+, d) large enough, see for example Remark IX.2.l.3.
Then the next result will be useful.
PROOF. (0)] Combine Lemma 1.2.2(i) with (1.2.6) to obtains E al(x +td) naef(x)
such that
I(x) = I(x + td) - t(s, d) + 8;
since t > 0, the property I(x) ~ I(x + td) + 8 implies (s, d) ~ o.
[OJ)] In case (jj), every t > 0 comes under case (ii) of Lemma 1.2.2, and any set) E
al(x+td) is in ae I(x) by virtue of(1.2.6). Also, sinced is nota direction of8-descent,
Remark 1.2.5 Compare with §XI.4.2: for all t smaller than the largest element te (d)
of Te , al(x + td) C aef(x) because x + td E Ve(x). For t = te(d), one can only
ascertain that some subgradient at x + td is an 8-subgradient at x and can fruitfully
enlarge the current polyhedron S. For t > teCd), al(x + td) n oef(x) = 0 because
x + td f/ Vs(x). 0
In summary, when t minimizes qe (including the case "t large enough" ifTe = 0),
it is theoretically possible to find in aI (x +t d) the necessary sub gradient s+ to improve
a
the current approximation of eI (x). We will see in §2 that it is not really possible to
find s+ in practice; rather, it is a convenient approximation of it which will be obtained
during the process of minimizing qe.
We begin to see how Algorithm 1.1.4 will actually work: Step 1 will be a projection
onto a convex polyhedron, followed by a minimization of the convex function reo In
anticipation of §2, we restate Algorithm 1.1.4, with a somewhat more detailed Step 1.
Also, we incorporate a criterion to stop the algorithm without waiting for the (unlikely)
event "0 E Sk".
STEP 1.1 (computing the trial direction). Solve the minimization problem in a
(1.3.1)
where ilk is the unit simplex of IRk. Set dk = - .E7=, aisi; iflldkll ~ 8 stop.
STEP 1.2 (line-search). Minimize re and conclude:
(ae) either f;(x P , dk) < 0; then go to Step 2.
(be) or f;(x P, dk) ~ 0; then obtain a suitable Sk+1 E ad(xP),replacekbyk+ I
and loop to Step 1.1.
STEP 2 (descent). Obtain t > 0 such that f(x P + tdk) < f(x P) - 6. Set x p + 1 =
x P + tdk, replace p by p + I and loop to Step 1.0. 0
Remark 1.3.2 Within Step 1, k increases by one at each subiteration. The number
of such subiterations is not known in advance, so the number of subgradients to be
stored can grow arbitrarily large, and the complexity of the projection problem as
well. We know, however, that it is not really necessary to keep all the subgradients at
each iteration: Theorem IX.2.1. 7 tells us that the (k + 1)SI projection must be made
onto a polyhedron which can be as small as the segment [-dko sk+d. In other words,
the number of subgradients to be stored, i.e. the number of variables in the quadratic
problem of Step 1.1, can be as small as 2 (but at the price of a substantial loss in actual
performance: remember Fig. IX.2.2.3). 0
Remark 1.3.3 Ifwe compare this algorithm to those of Chap. II, we see that it works
rather similarly: at each iterate x P , it constructs a direction d k , along which it per-
forms a line-search. However there are two new features, both characteristic of all the
minimization algorithms to come.
- First, the direction is computed in a rather sophisticated way, as the solution of an
auxiliary minimization problem instead of being given explicitly in terms of the
"gradient" Sk.
- The second feature brings a rather fundamental difference: it is that the line-search
has two possible exits (ae) and (be). The first case is the normal one and could be
called a descent step, in which the current iterate x P is updated to a better one. In
the second case, the iterate is kept where it is and the next line-search will start from
the same x p . As in the bundling Algorithm IX.1.6, this can be called a null-step.
o
We can guess that there will be two rather different cases in Step 1.2, corresponding
to those of Figs. 1.2.1, 1.2.2. This will be seen in more detail in §2, but we give already
here an idea of the convergence proof, assuming a simple situation: only the case of
Fig. 1.2.1 occurs, qe is minimized exactly at each line-search, and correspondingly,
an Sk+1 predicted by Lemma 1.2.4 is found at each null-step. Note that the last two
assumptions are rather unrealistic.
Theorem 1.3.4 Suppose that each execution ofStep 1.2 in Algorithm 1.3.1 produces
an optimal Uk > 0 and a corresponding sk+1 E af(x p + IJUk dk) such that
1 Introduction. Identifying the Approximate Subditferential 205
where
ek:= f(x P) - f(xp + ulkdk) + (Sk+b ulkdk).
Then: either f(x P) -+ -00, or the stop of Step 1.1 occursforsomefiniteiteration
index P*, at which there holds
f(x P*) ~ fey) + e + 811y - xP*1I for all y E ]Rn. (1.3.2)
PROOF. Suppose f(x P ) is bounded from below. As in Proposition 1.1.5, Step 2 cannot
be executed infinitely many times. At some finite iteration index P*, Algorithm 1.3.1
therefore loops between Steps 1.1 and 1.2, i.e. case (be) always occurs at this iteration.
Then Lemma 1.2.4(j) applies for each k: first
Sk+l E aekf(x P*) = ad(x P*) for all k;
we deduce in particular that the sequence {Sk} c aef(x P*) is bounded (Theo-
rem XL 1.1.4). Second, (sk+ b dk) ~ 0 which, together with the minimality conditions
for the projection -dk, yields
(-dk, Sj - Sk+l) ~ IIdkll2 for j = 1, ... , k.
Lemma IX.2.1.1 applies (with Sk = -dk and m = 1): dk -+ 0 if k -+ 00, so the stop
must eventually occur. Finally, each -dk is a convex combination of e-subgradients
s[, ... ,Sk of fat x P* (note that Sl E asf(x P*) by construction) and is therefore an
e-subgradient itself:
fey) ~ f(x P*) - (dk, y - x P*) - e for all y E ]Rn .
This section contains the computational details that are necessary to implement the
algorithm introduced in §1, sketched as Algorithm 1.3 .1. We mention that these details
are by no means fundamental for the next chapters and can therefore be skipped by
a casual reader - it has already been mentioned that methods of e-descent are not
advisable for actual use in "real life" problems. On the other hand, these details give a
good idea of the kind ofquestions that arise when methods fornonsmooth optimization
are implemented.
To obtain a really implementable form of Algorithm 1.3.1, we need to specify
two calculations: in Step 1.1, how the a-problem is solved, and in Step 1.2, how the
line-search is performed. The a-problem is a classical convex quadratic minimization
problem with linear constraints; as such, it poses no particular difficulty. The line-
search, on the contrary, is rather new since it consists of minimizing the nonsmooth
function qe or reo It forms the subject of the present section, in which we largely use
the principles of §II.3.
A totally clear implementation of a line-search implies answering three questions.
(i) Initialization: how should the first trial stepsize be chosen? Here, the line-search
must be initialized at each new direction dk. No really satisfactory initialization
is known - and this is precisely one of the drawbacks of the present algorithm.
We will not study this problem, considering that the choice t = 1 is simplest,
if not excessively sensible! (instead of 1, one must of course at least choose a
number which, when multiplied by IIdkll, gives a reasonable move from x P ).
(ii) Iteration: given the current stepsize, assumed not suitable, how the next one can
be chosen? This will be the subject of §2.1.
(iii) Stopping Criterion: when can the current stepsize be considered as suitable
(§2.2)? As is the case with all line-searches, this is by far the most delicate
question in the present context. It is more crucial here than ever: it gives not only
the conditions for stopping the line-search, but also those for determining the
next subgradient to be added to the current approximation of aef(x P ).
for the linearization error (1.2.5). This creates no confusion because x and d are fixed
and because, once again, the subgradient s is considered as a single-valued function,
depending on the black box (UI). Recall that £ - e(t) is in 8re(lft) (we remind
the reader that the last expression means the subdifferential of the convex function
U 1-+ re(u) at the point u = 1ft).
The problem to be solved in this subsection is: during the process of minimizing
qe, where should we place each trial stepsize? Applying the principles of §1I.3.1, we
must decide whether a given t > 0 is
(0) convenient, so the minimization of qe can be stopped;
(L) on the left of the set of convenient stepsizes;
(R) on their right.
The key to designing (L) and (R) lies in the statements 1.2.2 and 1.2.3: in fact,
call the black box (VI) to compute s(t) E 8f(x + td); then compute e(t) of (2.1.1)
and compare it to £. There are three possibilities:
(0) :=" {e(t) = £}" (an extraordinary event!). Then this t is optimal and the line-
search is finished.
(L) :=" {e(t) < £}". This t is too small, in the sense that no optimal t can lie on
its left. It can serve as a lower bound for all subsequent stepsizes, therefore set
tL = t before looping to the next trial.
(R) :="{e(t) > £}".NotonlyTecanbebutontheleftofthist,butalsos(t) ¢ 8ef(x),
so t is too large (see Proposition XI.4.2.5 and the discussion following it). This
makes two reasons to set t R = t before looping.
Apart from the stopping criterion, we are now in a position to describe the line-
search in some detail. The notation in the following algorithm is exactly that of §II.3.
tL ~ t ~ tR
for any t minimizing qe. Once an interpolation is made, i.e. once some real tR has
been found, no extrapolation is ever made again. In this case one can ascertain that qe
208 XIII. Methods of c-Descent
has a minimum "at finite distance". The new t in Step 4 must be computed so that,
if infinitely many extrapolations [resp. interpolations] are made, then tL --+ 00 [resp.
tR - tL --+ 0]. This is what was called the safeguard-reduction Property 11.3.1.3.
Remark 2.1.2 See §II.3A for suitable safeguarded strategies in Step 4. Without entering
into technical details, we mention a possibility for the interpolation formula; it is aimed at
minimizing a convex function, so we switch to the (u, r)-notation, instead of (t, q).
Suppose that some U L > 0 (i.e. t R = 1/U L < +(0) has been generated: we do have an
actual bracket [UL, UR], with corresponding actual function-and slope-values; let us denote
them by r Land ri, r R and r~ respectively. We will subscript by G the endpoint that is better
than the other. Look for example at Fig. 2.1.1: uo = UL because rL < rR. Finally, call u the
minimum ofre, assumed unique for simplicity.
To place the next iterate, say U+, two possible ideas can be exploited.
Idea Q (quadratic).Assume that re is smooth near its minimum. Then it is a good idea to adopt
a smooth model for r, for example a convex quadratic function Q:
where c > 0 estimates the local curvature of re. This yields a proposal U Q := argmin Q
for the next iterate.
Idea P (polyhedral).Assume that re is kinky near its minimum. Then, best results will probably
be obtained with a piecewise affine model P:
It has been already mentioned again and again that the stopping criterion (Step 3 in Al-
gorithm 2.1.1) is by far the most important ingredient of the line-search. Without it, no
actual implementation can even be considered. It is a pivot between Algorithms 2.1.1
and 1.3.1.
As it is clear from Step 1.2 in Algorithm 1.3.1, the stopping test consists in
choosing between three possibilities:
- either the present t is not suitable and the line-search must be continued;
- or the present t is suitable because case (ae) is detected: f£(x, d) < 0; this simply
amounts to testing the descent inequality
(2.2.1)
(2.2.2)
(2.2.3)
Here, the symbol "0" can be taken as the mere number 0, as was the case in the
framework of Theorem 1.3.4. However, we saw in Remark IX.2.1.3 that "0" could
also be a suitable negative tolerance, namely a fraction of -lIdkIl2. Actually, the
following specification of (2.2.3):
(2.2.4)
(2.2.5)
It can be seen that the two requirements (2.2.4) and (2.2.5) are antagonistic. They
may even be incompatible:
{-l} if x <i,
a/(x) = { [-1,expi] if x =i,
expx if x > i.
Now suppose that the £-descent Algorithm 1.3.1 is initialized at Xl = 0, where the
direction of search is d l = 1. The trace-function t ~ 1(0 + t1) is differentiable
except at a certain t E [0, 1[, corresponding to i. The linearization-error function e
of (2.1.1) is
0 if 0 ~ t < t,
{
e(t)= 2+(t-1)expt ift>!.
At t = t, the number e(t) is somewhere in [0, e*] (depending on which subgradient
s(i) is actually computed by the black box), where
e* := 2 + (t - 1) exp t = 3t - P > 1.
This example is illustrated in Fig.2.2.l. Finally suppose that £ ~ e*, for example
£ = 1. Clearly enough,
-ift E [0,([, then (s(t),d l ) = -1 = -lIddl 2 so (2.2.4) cannot hold, no matter how
m' is chosen in ]0, 1[;
- if t > t, then e(t) > e* ~ £, so (2.2.5) cannot hold.
In other words, it is impossible to obtain simultaneously (2.2.4) and (2.2.5) unless
the following two extraordinary events happen:
(i) the line-search must produce the particular stepsize t = t (said otherwise, qe or
re must be exactly minimized);
(ii) at this i = 0 + t· 1, the black box must produce a rather particular subgradient s,
namely one between the two extreme points -1 and expi, so that (s, d l ) is large
enough and (2.2.5) has a chance to hold.
Note also that, if £ is large enough, namely £ ~ 2 - I(i), (2.2.1) cannot be
obtained, just by definition of i. 0
2 A Direct Implementation: Algorithm of €-Descent 211
Remark 2.2.2 A reader not familiar with numerical computations may not consider (i) as
so extraordinary. He should however remember how a line-search algorithm has to work (see
§1I.3, and especially Example 11.3.1.5). Furthermore, i is given by a non-solvable equation
and cannot be computed exactly. This is why we bothered to take a somewhat complicated f
in (2.2.6). The example would be just as good, say with f(x) = Ix - 11. 0
From this example, we draw the following conclusion: it may happen that for the
given f, x, d and e, no line-search algorithm can produce a suitable new element
in asf(x); and this no matter how m' is chosen in ]0,1[. When this phenomenon
happens, Algorithm 2.1.1 is stuck: it loops forever between Step 1 and Step 4. Of
course, relaxing condition (2.2.4), say by taking m' ;? 1, does not help because then,
it is Algorithm 1.3.1 which may be stuck, looping within Step 1: dk will have no
°
reason to tend to and the stop of Step 1.1 will never occur.
The diagnostic is that (1.2.6), i.e. the transportation formula, does not allow us
to reach those e-subgradients that we need. The remedy is to use a slightly more
sophisticated formula, to construct e-subgradients as combinations of subgradients
computed at several other points:
With a = (aJ, ... ,am) in the unit simplex .1 m, set e := LJ!=1 ajej. Then there holds
m
S := L ajsj E ae!(x) .
j=l
This result can be applied to our line-search problem. Suppose that the minima
of qs have been bracketed by tL and tR: in view of Lemma 1.2.2, we have on hand
two subgradients SL := sex + tLd) and SR := sex + tRd) satisfying
On the other hand, Lemma 2.2.3 shows that sJ.L E aef(x) if IL is large enough,
namely IL ~ !!:.' with
eR -s
IL := E ]0, I [ . (2.2.8)
- eR - eL
It turns out (and this will follow from Lemma 2.3 .2 below) that, from the property
f(x+td) ~ f(x)-s, wehaveIL ~ jLfortR-tL small enough. When this happens, we
are done becausewecanfindsk~1 = sJ.L satisfying (2.2.4) and (2.2.5). Algorithm 1.3.1
can be stopped, and the current approximation of aef(x P ) can be suitably enlarged.
There remains to treat the case in which no finite t R can ever be found, which
happens if qe behaves as in Fig. 1.2.2. Actually, this case is simpler: tL .... 00 and
Lemma 1.2.4 tells us that s(x + tLd) is eventually convenient.
A simple and compact way to carry out the calculations is to take for IL the
maximal value (2.2.8), being understood that IL = 1 if tR = 00. This amounts to
taking systematically sJ.L in aef(x). Then the stopping criterion is:
(ae) stop the line-search if f(x + td) < f(x) - s (and pass to the next x);
(be) stop the line-search if (sJ.L, d) ~ a IIdll 2 (and pass to the next d).
In all other cases, continue the line-search, i.e. pass to the next trial t.
The consistency of this stopping criterion will be confirmed in the next section.
We are now in a position to realize this Section 2 and to write down the complete
organization of the algorithm proposed. The initial iterate x 1 is on hand, together
with the black box (Ul) which, for each x E IR n , computes f(x) and s(x) E af(x).
Furthermore, a number k ~ 2 is given, which is the maximal number of n-vectors
°
that can be stored, in view of the memory allocated in the computer. We choose the
tolerances s > 0, " > and m' E ]0, 1[. Our aim is to furnish a final iterate x P*
satisfying
f(y) ~ f(x P*) - s - "lIy - x P* II for all y E IRn . (2.3.1)
This gives some hints on how to choose the tolerances: s is homogeneous to function-
values; as in Chap. II, " is homogeneous to gradient-norms; and the value m' = 0.1
is reasonable.
The algorithm below is of course a combination of Algorithms 1.3.1,2.1.1, and
of the stopping criterion introduced at the end of §2.2. Notes such as e)
refer to
explanations given afterwards.
Algorithm 2.3.1 (Algorithm of s-Descent) The initial iterate x 1is given. Set p = 1.
Compute f(x P ) ands(x P ) E af(x P ). Set fO = f(x P ), Sl = s(xl).e)
2 A Direct Implementation: Algorithm of c-Descent 213
e = fO - f + t (s, d) .
STEP 6 (dispatching). If e < e, set tL =
t, eL = e, SL = s.
= =
If e ~ e, set tR t, eR e, SR s. =
IftR = 0, set JL = 1; otherwise set JL = (eR - e)/(eR - eL).(4)
STEP 7 (stopping criterion of the line-search). If f < fO - e go to Step 11.
Set sIL = JLSL + (1 - JL)SR.CS) If (sIL, d) ~ - m'lIdll 2 go to Step 9.
STEP 8 (iterating the line-search). If tR = 0, extrapolate i.e. compute a new
t > tL.
IftR "# 0, interpolate i.e. compute a new tin ]tL, tR[' Loop to Step 5.
STEP 9 (managing the computer-memory). If k = k, delete at least two (arbi-
trary) elements from the list s" ... , Sk. Insert in the new list the element -d
coming from Step 2 and let k < k be the number of elements thus obtained.(6)
STEP 10 (iterating the direction-finding process). Set sk+' = sIL .C) Replace k by
k + 1 and loop to Step 2.
STEP II (iterating the descent process). Set x P+' = x P + td, s, = S.(8) Replace p
by p + 1 and loop to Step 1. 0
Comments
C') fO is the objective-value at the origin-point x P of each line-search; s' is the
subgradient computed at this origin-point.
e) This stop may not be a real stop but a signal to reduce e and/or 8 (cf. Remark 1.3.5).
It should be clear that, at this stage of the algorithm, the current iterate satisfies
the approximate minimality condition (2.3.1).
e) The initializations of SL and eL are "normal": in view ofC') above, s, E af(x P)
and, as is obvious from (2.1.1), e(x P , x P , s,) = o. On the other hand, the initial-
izations of t R and S R are purely artificial (see Algorithm 11.3 .1.2 and the remark
following it): they simply help define sIL in the forthcoming Step 7. Initializing
e R is not necessary.
(4) In view of the values of eL and eR, JL becomes < 1 as soon as some tR > 0 is
found.
CS) If tR = 0, then JL = 1 and sIL = SL. If eR = e, then JL = 0 and sIL = SR.
Otherwise sIL is a nontrivial convex combination. In all cases sIL E aef(x P).
214 XIII. Methods of g-Descent
(6) We have chosen to give a specific rule for cleaning the bundle, instead of staying
abstract as in Algorithm IX.2.1.5. If k < ;C, then k can be increased by one and
the next subgradient can be appended to the current approximation of aef(x P ).
Otherwise, one must make room to store at least the current projection -d and
the next subgradient sIL, thus making it possible to use Theorem IX.2.1.7. Note,
however, that if the deletion process turns out to keep every Sj corresponding to
(Xj > 0, then the current projection need not be appended: it belongs to the convex
hull of the current sub gradients and will not improve the next polyhedron anyway.
C) Here we arrive from Step 7 via Step 10 and slL is the convenient convex combi-
nation found by the line-search. This slL is certainly in aef(x P ): if /-t = 1, then
eL ~ e and the transportation formula (XI.4.2.2) applies; if /-t E ]0, 1[, the corre-
sponding linearization error /-teL + (1- /-t)eR is equal to e, and it is Lemma 2.2.3
that applies.
(8) At this point, s is the last subgradient computed at Step 5 of the line-search. It is
a subgradient at the next iterate x p + 1 = x P + td. 0
A loop from Step 11 to Step 1 represents an actual e-descent, with x P moved; this
was called a descent-step in Remark 1.3.3. A loop from Step 10 to Step 2 represents
one iteration of the direction-finding procedure, i.e. a null-step: one subgradient is ap-
a
pended to the current approximation of ef (x P). The detailed line-search is expanded
between Steps 5 and 8. At each cycle, Step 7 decides whether
- to iterate the line-search, by a loop from Step 8 to Step 5,
- to iterate the direction-finding procedure,
- or to iterate the descent process, with an actual e-descent obtained.
Now we have to prove convergence of this algorithm. First, we make sure that
each line-search terminates.
Lemma 2.3.2 Let f : lRn -+- lR be convex. Suppose that the extrapolation and
interpolation formulae in Step 8 of Algorithm 2.3.1 satisfy the safeguard-reduction
Property 11.3.1.3. Then,for each iteration (p, k), the number of loops from Step 8 to
Step 5 is finite.
PROOF. Suppose for contradiction that, for some fixed iteration (p, k), the line-search
does not terminate. Suppose first that no t R > 0 is ever generated. By construction,
/-t = 1 and slL = s = s L forever. One has therefore at each cycle
where the first inequality holds because Step 7 never exits to Step 11; the second is
because slL E af(x p + tLd). We deduce
(2.3.2)
Now, the assumptions on the extrapolation formulae imply that tL -+- 00. In view of
the test Step 7 -+- Step 9, (2.3.2) cannot hold infinitely often.
2 A Direct Implementation: Algorithm of c-Descent 215
Thus some t R > 0 must eventually be generated, and Step 6 shows that, from
then on, 1.Le L + (1 - fL)e R = 6 at every subsequent cycle. This can be expressed as
follows:
f(x P ) - fLf(x P + tLd) - (1 - fL)f(x P + tRd)
= 6 - tLfL(SL, d) - tR(1 - fL)(SR, d) .
Furthermore, non-exit from Step 7 to Step 11 implies that the left-hand side above is
smaller than 6, so
(2.3.3)
By assumption on the interpolation formulae, tL and tR have a common limit
t ~ 0, and we claim that t >O. If not, the Lipschitz continuity of f in a neighborhood
of x P (Theorem IY.3.1.2) would imply the contradiction
0<6 < eR = f(x P) - f(x P + tRd) + tR(SR. d) ~ 2LtR ~ 0
(L being a Lipschitz constant around x P ).
Thus, after division by t > 0, (2.3.3) can be written as
fL(SL, d) + (1 - fL)(SR, d) + TJ ~ 0
where the extra term
tL - t tR - t
---fL(SL, d) + ---(1 - fL)(SR, d)
TJ :=
t t
tends to 0 when the number of cycles tends to infinity. This is impossible because of
the test Step 7 ~ Step 9. 0
°
Remark 2.3.4 This proof reveals the necessity of taking the tolerance m' > to stop each
line-search. If m' were set to 0, we would obtain a non-implementable algorithm of the type
IX. 1.6.
Other tolerances can equally be used. In the next section, precisely, we will consider
variations around the stopping criterion of the line-search. Here, we propose the following
exercise: reproduce Lemma 2.3.2 with m' = 0, but with the descent criterion in Step 7
replaced by
If f < fo - s' go to Step 11
where s' is fixed in ]0, s[. o
216 XIII. Methods of c-Descent
As already stated in Remark 1.1.6, we are not interested in giving a specific choice
for £ in algorithms such as 2.3.1. It is important for numerical efficiency only; but
precisely, we believe that these algorithms cannot be made numerically efficient un-
der their present form. We nevertheless consider two variants, obtained by giving £
rather special values. They are interesting for the sake of curiosity; furthermore, they
illustrate an aspect of the algorithm not directly related to optimization, but rather to
separation of closed convex sets - see §IX.3.3.
Suppose that j := inf {f(y) : y E ]Rn} > -00 is known and set
x = Xl being the starting point of the £-descent Algorithm 2.3.1. If we take this
value B for £ in that algorithm, the exit from Step 7 to Step 11 will never occur. The
line-searches will be made along a sequence of successive directions, all issuing from
the same starting x.
Then, we could let the algorithm run with £ = e, simply as stated in 2.3.1 (the
exit-test from Step 7 to Step 11 - and Step 11 itself-could simply be suppressed with
no harm). However, this does no good in terms of minimizing f: when the algorithm
stops, we will learn that the approximate minimality condition (2.3.1) holds with
x P* = x = Xl and £ = e; but we knew this before, even with 8 = O.
Fortunately, there are better things to do. Observe first that, instead of minimizing
f within e(which has already been done), the idea of the algorithm is now to solve the
following problem: by suitable calls to the black box (VI), compute e-subgradients at
the given x, so as to obtain 0 - or at least a vector of small norm - in their convex hull.
Even though the definition of B implies 0 E aef (x), the only constructive information
concerning this latter set is the black box (VI), and 0 might never be produced by
(VI). In other words, we must try to separate 0 from aef(x). We will fail but the
point is to explain this failure.
Remark 3.1.1 The above problem is not a theoretical pastime but is of great interest in
Lagrangian relaxation. Suppose that (Ul) is a "Lagrange black box", as discussed in §XII.l.l:
each subgradient is of the form s = -c(u), U being a primal variable. When we solve the above
separation problem, we construct primal points U I, ... , Uk and associated convex multipliers
ai, ... , ak such that
k
Lajc(uj) = O.
j=1
The next result shows that, once again, our task is to minimize re: this sk+1 must be
some s E of (x + I/ud), with u yielding 0 E ore(u).
Theorem 3.1.2 Let x, d "# 0 and 8 ~ 0 be given. Suppose that s E IRn satisfies one
of the following two properties:
(i) either, for some t > 0, s E of (x + td) and e(x, x + td, s) = 8;
(ii) or s is a cluster point ofa sequence (s(t) E of (x + td)lt with t ~ +00 and
e(x, x +td,s(t» ~8 forallt ~ O.
Then
s E oef(x) and (s, d) = f;(x, d) .
PROOF. The transportation formula (Proposition XI.4.2.2) directly implies that the s
in (i) and the s(t) in (ii) are in oef(x). Invoking the closedness of oef(x) for case
(ii), we see that s E oef(x) in both cases.
Now take an arbitrary s' E Os f (x), satisfying in particular
o ~ t(s', d) - t(s(t), d) - 8.
Divide by t > 0 and let t ~ +00 to see that (s, d) ~ (s', d). o
In terms of Algorithm 2.3.1, we realize that the slL computed in Step 7 precisely
aims at satisfying the conditions in Theorem 3.1.2:
218 XIII. Methods of c-Descent
(i) Suppose first that some tR > 0 is generated (which implies Te =f:. 0). When
sufficiently many interpolations are made, tL and tR become close to their com-
mon limit, say teo Because a/O is outer semi-continuous, SL and SR are both
close to al(x + ted). Their convex combination slI- is also close to the convex
set al(x + ted). Finally, e(x, x + ted, sll-) is close to I.LeL + (l - IL)eR = e,
by continuity of I. In other words, sll- almost satisfies the assumptions of Theo-
rem 3.1.2(i). This is illustrated in Fig. 3.1.1.
f(x+td)
~--------~------~---------
-E
(ii) Ifno tR > 0 is generated and tL --+ 00 (which happens when Te = 0), it is case
(ii) that is satisfied by the corresponding SL = sll-.
In a word, Algorithm 2.3.1 becomes a realization of Algorithm IX.3.3.1 if we
let the interpolations [resp. the extrapolations] drive tR - tL small enough [resp. tL
large enough] before exiting from Step 7 to Step 9. It is then not difficult to adapt the
various tolerances so that the approximation is good enough, namely (sll-, dk) is close
enough to O'S(dk).
These ideas have been used for the numerical experiments of Figs. IX.3.3.1 and
IX. 3 .3 .2. It is the above variant that was the first form of Algorithm IX.3.3.1, alluded to
at the end of §IX.3.3. In the case OfTR48 (Example IX.2.2.6), ael(x) was identified at
x = 0, a significantly non-optimal point: 1(0) = -464816, while j = -638565. All
generated Sk had norms roughly comparable. In the example MAXQUAD ofVIII.3.3.3,
this was far from being the case: the norms of the subgradients were fairly dispersed,
remember Table VIII.3.3 .1.
Remark 3.1.3 An interesting question is whether this variant produces a minimizing se-
quence. The answer is no: take the example of (IX. 1. 1), where the objective function was
e
Start from the initial x = (2, 1) and take e = = 4. We leave it as an exercise to see
how Algorithm 1.3.1 proceeds: the first iteration, along the direction (-1, -2), produces Y2
anywhere on the half-line {(2 - t, 1 - 2t) : t ~ 1/2} ands2 is unambiguously (1, -2). Then,
d 2 is collinear to (-1,0), Y3 = (-1, 1) and S3 = (-1/3,2/3) (for this last calculation, take
S3 = a( -1,0) + (1 - a)(l, 2), a convex combination of subgradients at n, and adjust a
so that the corresponding linearization error e(x, Y3, S3) is equal to e = 4). Finally d3 = 0,
3 Putting the Mgorithm in Perspective 219
although neither Y2 nor Y3 is even close to the minimum (0, 0). These operations are illustrated
by Fig. 3.1.2.
Note in this example that d3 = 0 is a convex combination of S2 and S3: SI plays no role.
Thus 0 is on the boundary of CO{SI, S2, S3} = 04/(X). If e is decreased, S2 and S3 are slightly
perturbed and 0 ¢ CO{SI, S2, S3}: d3 is no longer zero (although it stays small). The property
o E bdo4/(x) does not hold by chance but was proved in Proposition XI. 1.3.5: e = 4 is the
critical value of e and the normal cone to 04/(X) at 0 is not reduced to {O}. 0
Our second variant is also a sort of separation Algorithm IX.3.3 .1. It also performs one
single descent iteration, the starting point of the line-searches never being updated. It
differs from the previous example, though: it is really aimed at minimizing I, and e
is no longer fixed but depends on the direction-index k. Specifically, for fixed x = Xl
and d =F 0 in Algorithm 2.3.1, we set
A second peculiarity of our variant is therefore that e is not known; and also, it
may be 0 (for simplicity, we assume e(x, d) < +00). Nevertheless, the method can
still be made implementable in this case. In fact, a careful study of Algorithm 2.3.1
shows that the actual value of e is needed in three places:
- for the exit-test from Step 7 to Step 11; with the e of (3 .2.1), this test is never passed
and can be suppressed, just as in §3.1;
- in Step 6, to dispatch t as a t L or t R; Proposition 3.2.1 below shows that the precise
value (3.2.1) is useless for this particular operation;
- in Step 6 again, to compute /L; it is here that lack of knowledge of e has the main
consequences.
PROOF. Letiminimize f(x+, d) overjR+. 1ft = 0, thene(x, d) = oand qe(x ,d) (0) =
f'(x, d) ~ qe(x,d)(t) for all t ~ 0 (monotonicityofthe difference quotient); the proof
is finished.
If i > 0, the definition
f(x + id) = f(x) - e(x, d)
combined with the minimality condition
o E (af(x + id), d}
shows that i satisfies the minimality condition of Lemma 1.2.2(i) for qe(x,d)'
Now let {'fk} C jR+ be a minimizing sequence for f(x + ·d):
f(x + 'fkd ) ~ f(x) - e,
where e is the e(x, d) of(3.2.1). If e > 0, then 0 cannot be a cluster point of {'fd: we
can write
f(x + 'fkd) - f(x) + e
-=--.:.----.:.:~-:......:......:......-~O,
'fk
and this implies f;(x, d) ~ O. Ifte minimizes qe, we have
- First, ifno tR > 0 is ever found (f decreasing along d), there is nothing to change:
in view of Lemma 1.2.2(ii), stt = s(x +tLd) E ae(x,d)f(x) for all tL; furthermore,
Lemma 1.2.4Gj) still applies and eventually, (stt, d)?: - m'lIdIl 2 •
- The other case is when a true bracket [t L, t R] is eventually found; then Lemma 2.2.3
implies stt E aef (x) where
(3.2.5)
e(x, d) ?: ~:= max {f(x) - f(x + tL), f(x) - f(x + tRd)}. (3.2.6)
e
Continuity of f ensures that and~have the common limit e(x, d) when tL andtR
both tend to the optimal (and finite) t ?: O. Then, because of outer semi-continuity
of e ~ aef(x), stt is close to ae(x,d)f(x) when tR and tL are close together.
In a word, without entering into the hair-splitting details of the implementation,
the following variant of Algorithm 2.3.1 is reasonable.
Algorithm 3.2.2 (Totally Static Algorithm - Sketch) The initial point x is given,
together with the tolerances TJ > 0,0 > 0, m' E ]0, 1[. Set k = 1, Sl = s(x).
STEP 1. Compute dk := - Proj O/{Sl, ... , Sk}.
STEP 2. If IIdk II ~ 0 stop.
STEP 3. By an approximate minimization oft 1-+ f(x + tdk) over t ?: 0, find tk > 0
andsk+l such that
(3.2.7)
and
where
ek := f(x) - f(x + tkdk) + TJ·
STEP 4. Replace k by k + I and loop to Step 1. o
Here, there is no longer any descent test, but a new tolerance TJ is introduced, to
make sure that e - ~ of (3.2.5), (3.2.6) is small. Admitting that Step 3 terminates for
each k, this algorithm is convergent:
(3.2.9)
for each k. Coming back to the problem of separating asf(x) =: S and {OJ, we
see that Algorithm 3.2.2 is the second form of Algorithm IX.3.3.l that was used for
Figs. IX.3.3.1 and IX.3.3.2: for each direction d, its black box generates S+ E S with
(s+. d) ~ 0 (instead of computing the support function). It is therefore a convenient
instrument for illustrating the point made at the end of §IX.3.3.
XlV. Dynamic Construction of Approximate
Subdifferentials: Dual Form of Bundle Methods
Prerequisites. Basic concepts of numerical optimization (Chap. II); descent principles for
nonsmooth minimization (Chaps. IX and XIII); definition and elementary properties of ap-
proximate subdifferentials (Chap. XI); and to a lesser extent: minirnality conditions for simple
minimization problems, and elements of duality theory (Chaps. VII and XII).
1.1 Motivation
The methods of s-descent, which were the subject of Chap. XIII, present a number of
deficiencies from the numerical point of view. Their rationale itself - to decrease by s
the objective function at each iteration - is suspect: the fact that s can (theoretically)
be chosen in advance, without any regard for the real behaviour of f, must have its
price, one way or another.
<a> The Choice of e. At first glance, an attractive idea in the algorithms of Chap. XIII
is to choose s fairly large, so as to reduce the number of outer iterations, indexed by
p. Then, of course, a counterpart is that the complexity of each such iteration must
be expected to increase: k will grow larger for fixed p. Let us examine this point in
more detail.
Consider one single iteration, say p = 1, in the schematic algorithm of s-descent
XIII.1.3.1. Set the stopping tolerance £, = 0 and suppose for simplicity that qe (or
re) can be minimized exactly at each subiteration k. In a word, consider the following
algorithm:
(1.1.1)
224 XlV. Dual Form of Bundle Methods
1(8)-+] if8te.
- For 8 ~ e, 0 E aeI (x) and Algorithm 1.1.1 cannot stop in Step 3: it has to loop
between Step 1 and Step 3 - or stop in Step 1. In fact, we obtain for 8 = e nothing
but (a simplified form of) the algorithm of §XIII.3.1. As mentioned there, {ykl need
not be a minimizing sequence; see more particularly Remark XIII.3.1.3. In other
words, it may well be that](e) > j. Altogether, we see that the function] has no
reason to be (left-)continuous at e.
Conclusion: it is hard to believe that (1.1.2) holds in practice, because when
numerical calculations are involved, the discontinuity of ] is likely not to be exactly
at e. When 8 is close to e, ](8) is likely to be far from ],just as it is when 8 = e.
Remark 1.1.2 It is interesting to see what really happens numerically. Toward this end,
take again the examples MAXQUAD ofVIII.3.3.3 and TR48 ofIX.2.2.6. To each of them, apply
Algorithm 1.1.1 with e = S, and Algorithm XIII.3.2.2. With the stopping tolerance;; set to 0,
the algorithms should run forever. Actually, a stop occurs in each case, due to some failure:
in the quadratic program computing the direction, or in the line-search. We call again j the
best I -value obtained when this stop occurs, with either of the two algorithms.
Table 1.1.1 summarizes the results: for each of the four experiments, it gives the number of
iterations, the total number of calls to the black box (UI) of Fig. 11.1.2.1, and the improvement
of the final I-value obtained, relative to the initial value, i.e. the ratio
1 Introduction: The Bundle of Information 225
1(8) - 1
f(x) - j"
Observe that the theory is somewhat contradicted: if one wanted to compare Algo-
rithm 1.1.1 ( supposedlynon-convergent) and Algorithm XIII.3 .2.2 (proved to converge), one
should say that the former is better. We add that each test in Table 1.1.1 was run several times,
with various initial x; all the results were qualitatively the same, as stated in the table. The
lesson is that one must be modest when applying mathematics to computers: proper .assess-
ment of an algorithm implies an experimental study, in addition to establishing its theoretical
properties; see also Remark 11.2.4.4.
MAXQUAD TR48
The apparently good behaviour of Algorithm 1.1.1 can be explained by the rather slow
convergence of {dd to o. The algorithm takes advantage of its large number of iterations to
explore the space around the starting x, and thus locates the optimum more or less accurately.
By contrast, the early stop occurring in the 2-dimensional example of Remark XIII.3.1.3 gives
the algorithm no chance. 0
Knowing that a large e is not necessarily safe, the next temptation is to take e
small. Then, it is hardly necessary to mention the associated danger: the direction may
become that of steepest-descent, which has such a bad reputation, even for smooth
functions (remember the end of §II.2.2). In fact, if f is smooth (or mildly kinky),
every anti-subgradient is a descent direction. Thus, if e is really small, the e-steepest
descent Algorithm XIII. 1.3 .1 will simply reduce to
(1.1.3)
at least until x P becomes nearly optimal. Only then, will the bundling mechanism
enter into play.
The delicate choice of e (not too large, not too small ... ) is a major drawback of
methods of e-descent.
(b) The Role of the Line-Search. In terms of diminishing f, the weakness of the
direction is even aggravated by the choice of the stepsize, supposed to minimize over
jR+ not the function t H- f(x P + tdk) but the perturbed difference quotient qs.
Firstly, trying to minimize qs along a given direction is not a sound idea: after all,
it is f that we would like to reduce. Yet the two functions qs and t H- f(x + td) may
have little to do with each other. Figure 1.1.1 shows Ts of Lemma XIIl.l.2.2 for two
extreme values of e: on the left [right] part of the picture, e is too small [too large]:
the stepsize sought by the e-descentAlgorithm XIII.l.3.1 is too small [too large], and
in both cases, f cannot be decreased efficiently.
226 XlV. Dual Form of Bundle Methods
The second difficulty with the line-search is that, when dk is not an e-descent
direction, the minimization of qe must be carried out rather accurately. In Table 1.1.1,
the number of calls to the black box (U 1) per iteration is in the range 5-10; recall that,
in the smooth case, a line-search typically takes less than two calls per iteration. Even
taking into account the increased difficulty when passing to a nonsmooth function,
the present ratio seems hardly acceptable.
Still concerning the line-search, we mention one more problem: an initial stepsize
is hard to guess. With methods of Chap. II, it was possible to initialize t by estimating
the decrease of the line-search function t f-+ f(x + td) (Remark 11.3.4.2). The
behaviour of this function after changing x and d, i.e. after completing an iteration,
could be reasonably predicted. Here the function of interest is qe of (XIIl.l.2.2), which
behaves wildly when x and d vary; among other things, it is infinite for t = 0, i.e. at
x.
Remark 1.1.3 The difficulties illustrated by Fig. 1.1.1 are avoided by a further variant of
§XIII.3.2: a descent test can be inserted in Step 4 of Algorithm XIII.3.2.2, updating x when
f(x + tkdk) < f(x) - E. Here E > 0 is again chosen in advance, so this variant does
not facilitate the choice of a suitable e, nor of a suitable initial stepsize. Besides, the one-
dimensional minimization of f must again be performed rather accurately. 0
(c) Clumsy Use of the Information. Our last item in the list of deficiencies of e-
descent methods lies in their structure itself. Consider the two possible actions after
terminating an iteration (p, k).
(ae) In one case (e-descent obtained), x P is moved and all the past is forgotten.
The polyhedron Sk is reinitialized to a singleton, which is the poorest possible
approximation of aef(x p + I ). This "Markovian" character is unfortunate: clas-
sically, all efficient optimization algorithms accumulate and exploit information
about f from previous iterations. This is done for example by conjugate gradient
and quasi-Newton methods of §1I.2.
(be) In the second case (Sk enriched), x P is left as it is, the information is now
correctly accumulated. Another argument appears, however: the real problem of
interest, namely to decrease f, is somewhat overlooked. Observe for example
that no f -value is used to compute the directions dk; more importantly, the
iterate x P + tkdk produced by the line-search is discarded, even if it happens to
be (possibly much) better than the current x p.
In fact, the above argument is rather serious, although subtle, and deserves some explana-
tion. The bundling idea is based on a certain separation process, whose speed of convergence
1 Introduction: The Bundle of Information 227
is questionable; see §IX.2.2. The (be)-phase, which relies entirely and exclusively upon this
process, has therefore a questionable speed of convergence as well. Yet the real problem, of
minimizing a convex function I, is addressed by the (ae)-phase only. This sort of dissociation
seems undesimble for performance.
In other words, there are two processes in the game: one is the convergence to 0 of some
sequence of (approximate) subgradients Sk = -dk; the otheris the convergence of {f (x P)} to
f. The first is theoretically essential- after all, it is the only way to get a minimality condition
- but has fragile convergence qualities. One should therefore strive to facilitate its task with
the help of the second process, by diminishing IIdll- and I-values in an harmonious way. The
ovemll convergence will probably not be improved in theory; but the algorithm must be given
a chance to have a better behaviour heuristically.
(d) Conclusion. Let us sum up this Section 1.1: e-descent methods have a number
of deficiencies, and the following points should be addressed when designing new
minimization methods.
(i) They should rely less on e. In particular, the test for moving the current x P should
be less crude than simply decreasing f by an a priori value e.
(ii) Their line-search should be based on reducing the natural objective function f.
In particular, the general and sound principles of §II.3 should be applied.
(iii) They should take advantage as much as possible of all the information accumu-
lated during the successive calls to (U I ), perhaps from the very beginning of the
iterations.
(iv) Their convergence should not be based entirely on that of {dkl to 0: they should
take care as much as possible (on a heuristic basis) of the decrease of f to its
minimal value.
In this chapter, we will introduce a method coping with (i) - (iii), and somehow
with (iv) (to the extent that this last point can be quantified, i.e. very little). This
method will retain the concept of keeping x P fixed in some cases, in order to improve
the approximation of aef(x P ) for some e. In fact, this technique is mandatory when
designing methods for nonsmooth optimization based on a descent principle. The
direction will again be computed by projecting the origin onto the current approxi-
mation of ad(x P ), which will be substantially richer than in Chap. XIII. Also, the
test for descent will be quite different: as in §II.3.2, it will require from f a definite
relative decrease, rather than absolute. Finally, the information will be accumulated
all along the iterations, and it is only for practical reasons, such as lack of memory,
that this information will be discarded.
Definition 1.2.2 The actual bundle (or simply "bundle", if no confusion is possible)
is the bank of information that is
- explicitly obtained from the raw bundle,
- actually stored in the computer,
- and used to compute the current direction. o
There are several possibilities for selecting-compressing the raw bundle, depend-
ing on the role of the actual bundle in terms of the resulting minimization algorithm.
Let us illustrate this point.
Example 1.2.3 (Quasi-Newton Methods) If one wanted to relate the present bundling con-
cept to classical optimization methods, one could first consider the quasi-Newton methods
of §1l.2.3. There the aim of the bundle was to identify the second-order behaviour of f. The
selection consisted of two steps:
(i) Only one triple (y, !, s) per iteration was "selected", namely (Xl. !(Xk), Sk); any inter-
mediate information obtained during each line-search was discarded: the actual number
eof elements in the bundle was the number k of line-searches.
(ii) Only the couple (y, s) was kept from each triple (y, !, s): ! was never used.
Thus, the actual bundle was "compressed" to
with i indexing iterations, rather than calls to (VI). This bundle (1.2.2), made of 2nk real
numbers, was not explicitly stored Rather, it was further "compressed" into a symmetric
1 Introduction: The Bundle ofInfonnation 229
n x n matrix - although, for 2nk :s;; 1/2 n(n + 1), this could hardly be called a compression!
- whose aim was to approximate (V2 f) -I.
In summary, the actual bundle was a set of I /2 n (n + 1) numbers, gathering the infonnation
contained in (1.2.2), and computed by recurrence formulae such as that ofBroyden-Fletcher-
Goldfarb-Shanno (11.2.3.9). 0
The (i)-part of the selection in Example 1.2.3 is rather general in optimization methods.
In practice, L in (1.2.1) is just the current number k of iterations, and i indexes line-searches
rather than calls to (Ul). In most optimization algorithms, the intermediate (j, s) computed
during the line-searches are used only to update the stepsize, but not to compute subsequent
directions.
Example 1.2.4 (Conjugate-Gradient Methods) Still along the same idea, consider now
the conjugate-gradient method of §1I.2.4. We have seen that its original motivation, based on
the algebra of quadratic functions, was not completely clear. Rather, it could be interpreted as
a way of smoothing out the sequence {dk} of successive directions (remember Fig. 11.2.4.1).
There, we had a selection similar to (i) of Example 1.2.3; indeed the direction dk was
defined in terms of
{sj:i=l, ...• k}. (1.2.3)
The above two examples are slightly artificial, in that the bundling concept is a fairly
diverted way of introducing the straightforward calculations for quasi-Newton and conjugate-
gradient methods. However, they have the merit of stressing the importance of the bundling
concept: virtually all efficient minimization methods use it. one way or another. On the other
hand, the cutting-plane algorithm of §XII.4.2 is an instance of an approach in which the
bundling idea is blatant. Actually. that example is even too simple: no compression and no
selection can be made, the raw bundle and actual bundle are there identical.
Example 1.2.5 (e-Descent Methods) In Chap. XIII, the bundle was used to identify
oe f at the current iterate. To disclose the selection-compression mechanism, let us
use the notation (P. k) of that chapter.
First of all, the directions depended only on s-values. When computing the current
direction dk issued from the current iterate x P , one had on hand the set of subgradients
(the raw bundle, in a way)
l Sl
{S I' Sl • S2 S2 s2 . . sP sp}
2.···. k!' I' 2'···· k 2" ' " I'···· k •
where kq, for q = I, ... , p - I, was the total number of line-searches needed by the
qth outer iteration (knowing that the pth is not complete yet).
In the selection, all the information collected from the previous descent-iterations
(those having q < p) was discarded, as well as all f- and y-values. Thus, the actual
bundle was essentially
{sf,···. sf} . (1.2.4)
230 XlV. Dual Form of Bundle Methods
Furthermore, there was a compression mechanism, to cope with possibly large values
ofk. If necessary, an arbitrary subset in (1.2.4) was replaced by the convex combination
(1.2.5)
just computed in the previous iteration (remember Algorithm XII!.2.3.1). In a way, the
aggregate subgradient (1.2.5) could be considered as synthesizing the most essential
information from (1.2.4). 0
The actual bundle in Example 1.2.5, with its Sk'S and Sk'S, is not so easy to
describe with explicit formulae; a clearer view is obtained from inspecting step by
step its evolution along the iterations. For this, we proceed to rewrite the schematic
e-descent Algorithm XII!'I.3.1 in a condensed form, with emphasis on this evolution.
Furthermore, we give up the (p, k)-notation of Chap. XIII, to the advantage of the
"standard" notation, used in Chap. II. Thus, the single index k will denote the number
of iterations done from the very beginning of the algorithm. Accordingly, Xk will
be the current iterate, at which the stopping criterion is checked, the direction dk is
computed, and the line-search along dk is performed.
In terms of the methods of Chap. XIII, this change of notation implies a substantial
change of philosophy: there is only one line-search per iteration (whereas a would-be
outer iteration was a sequence ofline-searches); and this line-search has two possible
exits, corresponding to the two cases (ae) and (be).
Call y+ the final point tried by the current line-search: it is of the form y+ =
Xk + tdk' for some step size t > 0. The two possible exits are:
(ae) When I has decreased bye, the current iterate Xk is updated to Xk+l = y+ =
Xk + tdk. The situation is rather similar to the general scheme of Chap. II and
this is called a descent-iteration, or a descent-step. As for the bundle, it is totally
reset to {sk+d = {S(Xk+l)} c ael(Xk+I). There is no need to use an additional
index p.
(be) When a new line-search must be performed from the same x, the trick is to set
Xk+l = Xb disregarding y+. Here, we have a null- iteration, or a null-step. This
is the new feature with respect to the algorithms of Chap. II: the current iterate
is not changed, only the next direction is going to be changed.
With this notation borrowed from classical optimization (i.e. of smooth functions),
Algorithm XII!'I.3.1 can now be described as follows. We neglect the irrelevant details
of the line-search, but we keep the description of a possible deletion-compression
mechanism.
mm II"ko+l
. {I2 L....i=ko+1 (Xisi 11 2 .. (Xi?'~ 0, "ko+l
L....i=ko+1 (Xi = 1}
and set dk := - L~~t!+1 (XiSi. If IIdkll ~ 8 stop.
STEP 2 (line-search). Minimize re along dk accurately enough to obtain a stepsize
t > 0 and a new iterate y+ = Xk + tdk with S+ E o!(y+) such that
- either! (y+) < ! (Xk) - e; then go to Step 3;
- or (s+, dk) ~ - m'lIdk 112 and S+ E Oe!(Xk); then go to Step 4.
STEP 3 (descent-step). Set Xk+ I = y+, ko = k, .e = 1; replace k by k + 1 and loop to
Step 1.
STEP 4 (null-step). Set Xk+1 = Xk, replace k by k + 1. If.e < £, set sko+l+1 = S+
from Step 2, replace .e by .e + 1 and loop to Step 1. Else go to Step 5.
STEP 5 (compression). Delete at least two vectors from the list {Sko+J, ... , SkoH}.
With .e denoting the number of elements in the new list, set Sko+l+ I = -dk from
Step 1, set Sko+l+2 = s+ from Step 2; replace .e by.e + 2 and loop to Step 1. 0
In a way, we obtain something simpler than Algorithm XIII. 1.3. 1. A reader fa-
miliar with iterative algorithms and computer programming may even observe that
the index k can be dropped: at each iteration, the old (x, d) can be overwritten; the
only important indices are: the current bundle size .e, and the index ko of the last
descent-iteration.
Observe the evolution of the bundle: it is totally refreshed after a descent-iteration
(Step 3); it grows by one element at each null-iteration (Step 4), and it is compressed
when necessary (Step 5). Indeed, the relative complexity of this flow-chart is mainly
due to this compression mechanism. If it were neglected, the algorithm would become
fairly simple: Step 5 would disappear and the number.e ofelements in the bundle would
just become k - ko.
Recall that the compression in Step 5 is not fundamental, but is inserted for
the sake of implementability only. By contrast, Definition 1.2.7 below will give a
different type of compression, which is attached to the conception of the method
itself. The mechanism of Step 5 will therefore be more suitably called aggregation,
while "compression" has a more fundamental meaning, to be seen in (1.2.9).
Examining the e-descent method in its form 1.2.6 suggests a weak point, not so
visible in Chap. XlII: why, after all, should the actual bundle be reset entirely at k o?
Some of the elements {sJ, ... ,Sko} might well be in Oe!(Xk). Then why not use them
to compute dk? The answer, quite simple, is contained in the transportation formula
of Proposition Xl.4.2.2, extensively used in Chap. XIII.
Consider the number
e(x, y, s) := !(x) - [f(y) + (s, x - y)] for all x, y, sin lR,n; (1.2.6)
Note also that s returned from (Ul) at y can be considered as a function of y alone
(remember Remark VIII.3.5.l); so the notation e(x, y) could be preferred to (1.2.6).
An important point here is that, for given x and y, this e(x, y) can be explicitly
computed and stored in the computer whenever f (x), f (y) and S (y) are known. Thus,
we have for example a straightforward test to detect whether a given S in the bundle
can be used to compute the direction. Accordingly, we will use the following notation.
Definition 1.2.7 For each i in the raw bundle (1.2.1), the linearization error at the
iterate Xk is the nonnegative number
(1.2.8)
Only the vectors Sj and the scalars ej are needed by the minimization method of
the present chapter. In other words, we are interested in the set
(1.2.9)
and this is the actual bundle. Note: to be really accurate, the term "actual" should be
reserved for the resulting bundle after aggregation of Step 5 in Algorithm 1.2.6; but
we neglect this question of aggregation for the moment.
It is important to note here that the vectors Yj have been eliminated in (1.2.9);
the memory needed to store (1.2.9) is ken + 1), instead of k(2n + 1) for (1.2.1) (with
L = k): we really have a compression here, and the information contained in (1.2.9)
is definitely poorer than that in (1.2.1). However, (1.2.8) suggests that each ej must
be recomputed at each descent iteration; so, can we really maintain the bundle (1.2.9)
without storing also the y's? The answer is yes, thanks to the following simple result.
Proposition 1.2.8 For any two iterates k and k' and sampling point yj, there holds
(1.2.10)
Remark 1.2.9 An equivalent way of storing the relevant information is to replace (1.2.9) by
(1.2.11)
where we recognize in
!*(Si) := (Si. Yi) - !(Yi)
the value of the conjugate of ! at Si (Theorem X.1.4.1). Clearly enough, the linearization
errors can then be recovered via
(1.2.12)
Compared to (1.2.8), formulae (1.2.11) and (1.2.12) have a more elegant and more sym-
metric appearance (which hides their nice geometric interpretation, though). Also they explain
Proposition 1.2.8: Yi is eliminated from the notation. From a practical point of view, there are
arguments in favor of each form, we will keep the form (1.2.8). It is more suggestive, and it
will be more useful when we need to consider aggregation of bundle-elements. 0
Let us swn up this section. The methods considered in this chapter compute the
direction dk by using the bundle (1.2.9) exclusively. This information is definitely
poorer than the actual bundle (1.2.1), because the vectors Yj are "compressed" to the
scalars ej. Actually this bundle can be considered just as a bunch of ej-subgradients
at Xk, and the question is: are these approximate subgradients good for computing dk?
Here our answer will be: use them to approximate eJef(Xk) for some chosen 8, and
then proceed essentially as in Chap. XIII.
In this section, we show how to compute the direction on the basis of our development
in § 1. The current iterate Xk is given, as well as a bundle {Sj, ej} described by (1.2.8),
(1.2.9). Whenever possible, we will drop the iteration index k: in particular, the current
iterate Xk will be denoted by x. The bundle size will be denoted by l (which is normally
equal to k, at least if no aggregation has been done yet), and we recall that the unit
simplex of Rl is
is contained in ae f (x).
in the form
f(z)~f(x)+(Si,z-x}-ei fori=I, ... ,e
and obtain the results by convex combination. o
We will make constant use of the following notation: for a E Lle, we set
e e
s(a) := Laisi, e(a) := L aiei , (2.1.2)
i=l i=l
so that (2.1.1) can be written in the more compact form
The above result gives way to a strategy for computing the direction, merely
copying the procedure of Chap. XIII. There, we had a set of subgradients, call them
{sJ, ... , se}, obtained via suitable line-searches; their convex hull was a compact
convex polyhedron S included in aef (x). The origin was then projected onto S, so as
to obtain the best hyperplane separating S from {O}. Here, Proposition 2.1.1 indicates
that S can be replaced by the slightly more elaborate S(e) of(2.1.1): we still obtain an
inner approximation of aef (x) and, with this straightforward substitution, the above
strategy can be reproduced.
In a word, all we have to do is project the origin onto S(e): we choose e ~ 0 and
we compute the particular e-subgradient
mm I II Lj=1
. 2 l ajsj 112 a E lR.l , [min ! lis (a) 112] (i)
Lf=laj = 1, aj ~Ofori = 1, ... ,t, [a E L1l] (ii) (2.1.4)
Lf=1 ajej :::;; 8 , [e(a) :::;; 8] (iii)
For a given bundle, the above direction-finding problem differs from the previous one
- (XIII. 1.3.1 ), (IX. 1.8), or (VIII.2.1.6) - by the additional constraint (2.1.4)(iii) only.
The feasible set (2.1.4)(ii), (iii) is a compact convex polyhedron (possibly empty),
namely a portion of the unit simplex L1l, cut by a half-space. The set S(8) of(2.1.1)
is its image under the linear transformation that, to a E lR.l , associates sea) E lR.n by
(2.1.2). This confirms that S(8) too is a compact convex polyhedron (possibly empty).
Note also that the family {S(8)}e ~ 0 is nested and has a maximal element, namely
the convex hull of the available subgradients:
PROOF. Because all ej and aj are nonnegative, the constraints (2.1.4)(ii), (iii) are
consistent if and only if (2.1.6) holds, in which case the feasible domain is nonempty
and compact. The rest is classical. 0
In a way, (2.1.3) is "equivalent" to (2.1.4), and the optimal set ofthe latteris the polyhedron
Uniqueness of st: certainly does not imply that At: is a singleton, though. In fact, consider the
example with n = 2 and.e = 3:
s
It is easy to see that = S2 and that the solutions of (2. 1.4) are those a satisfying (2.1.4)(ii)
together with al = a3. Observe, correspondingly, that the set {e(a) : a E At;} is not a
singleton either, but the whole segment [0, 8/2].
°
Remark 2.1.3 Usually, minj ej = in (2.1.6), i.e. the bundle does contain some element of
8!(Xk) - namely S(Xk). Then the direction-finding problem has an optimal solution for any
8 ~o.
It is even often true that there is only one such minimal ej, Le.:
236 XlV. Dual Form of Bundle Methods
3io E {I, ... , l} such that ejo =0 and ej > 0 for i # io.
This extra property is due to the fact that, usually, the only way to obtain a subgradient at a
given x (here Xk) is to call the black box (UI) at this very x. Such is the case for example
when f is strictly convex (Proposition VI.6.1.3):
f(y) > f(x) + (s, y - x) for any y # x and s E af(x) ,
which implies in (1.2.6) that e(x, y, s) > 0 whenever y # x and S E af(y). On the other
hand, the present extra property does not hold for some special objective functions, typically
piecewise affine. 0
Our new direction-finding problem (2.1.4) generalizes those seen in the previous
chapters:
- When s is large, say s ~ maxi ei , there is no difference between S (s) and S = S (00).
The extra constraint (2.1.4)(iii) is inactive, we obtain nothing but the direction of
the s-descent Algorithm XIII.I.3 .1.
- On the other hand, suppose s lies at its minimal value of (2.1.6), say s = 0 in view
of Remark 2.1.3. Then any feasible a in (2.1.4)(ii), (iii) has to satisfy
We equip the a-space IRl with the standard dot-product. Then consider the function
a ~ 1/2I/s(a) 1/ 2 =: v(a) defined via (2.1.2). Observing that it is the composition
2 Computing the Direction 237
l
JRl '3 a ~ s = I>iSi E JRn ~ ~lIsll2 = v(a) E JR,
i=l
av
aa.(a) = (sj,s(a») forj=l, ... ,e. (2.2.1)
]
Then it is not difficult to write down the minimality conditions for the direction-
finding problem, based on the Lagrange function
(2.2.2)
Theorem 2.2.1 We use the notation (2.1.2); a is a solution of (2.1.4) if and only if:
it satisfies the constraints (2.1.4) (ii), (iii), and there is /L ~ 0 such that
PROOF. Our convex minimization problem falls within the framework of §XII.5.3(c).
All the constraints being affine, the weak Slater assumption holds and the Lagrangian
(2.2.2) must be minimized with respect to a in the first orthant, for suitable values of A
and /L. Then, use (2.2.1) to compute the relevant derivatives and obtain the minimality
conditions as in Example VII.1.1.6:
/L[e(a) - e] = 0,
and, for i = I, ... , e,
(Sj, s(a») + /Lej - A ~ 0,
aj[(sj, s(a») + /Lej - A] = o.
Add the last e equalities to see that A = IIs(a) 112 + /Le, and recognize the Karush-
Kuhn-Tucker conditions of Theorem VII.2.1.4; also, /L does not depend on a (Propo-
sition VII.3.l.1). 0
Thus, the right-hand side in (2.2.3) is the multiplier A of the equality constraint
in (2.1.4)(ii); it is an important number, and we will return to it later.
On the other hand, the extra constraint (2.1.4)(iii) is characteristic ofthe algorithm
we have in mind; therefore, its multiplier /L is even more important than A, and the rest
of this section is devoted to its study. To single it out, we use the special Lagrangian
238 XIv. Dual Form of Bundle Methods
It must be minimized with respect to ex E L\e, to obtain the closed convex dual function
Me = Argmin{B(IL) : IL;;:: O}
is nonempty: it is actually the set of those IL described by Theorem 2.2.1 for some
solution ex E Ae of (2.1.7). Then the minimality conditions have another expression:
Theorem 2.2.2 ~ use the notation (2.1.2); ex E L\e solves (2.1.4) if and only if, for
some IL ;;:: 0, it solves the minimization problem in (2.2.4) and satisfies
e(ex) ~ e with equality if IL is actually positive. (2.2.5)
Proposition 2.2.3 For given IL ;;:: 0, all optimal ex in (2.2.4) make up the same s(ex)
in (2.1.2). If IL > 0, they also make up the same e(ex).
PROOF. Let ex and fJ solve (2.2.4). From convexity, L/L(tex + (1 - t)fJ) is constantly
equal to B(JL) for all t E [0, 1], i.e.
Proposition 2.2.4 The optimal value v(e) := 1/2 lise 112 in (2.1.4) is a convexfunction
of e, and its subdifforential is the negative of Me:
IL E Me <===} -IL E av(e) .
2 Computing the Direction 239
When 8 diminishes, the multipliers can but increase, but they remain bounded:
Proposition 2.2.5 With the notation (2.1.1), suppose S(O) =I- 0 (i.e. mini ei = 0).
Then
K
JL ~ 2 - for all 8 > 0 and JL E Me ,
~
where
PROOF. If JL = 0, there is nothing to prove; if not, the optimal e(a) equals 8. Then
take 8 E ]0, ~[: there must be some j with aj > 0 and ej ~ ~ > 8 (the a's sum up to
1I). For this j, we can write
hence
JLVi. - 8) ~ IIsll2 - (Sj. s) ~ 2K .
knowing that Soo = Proj O/{Sh ... , sil. The function 8 ~ 1/2 lise 112 is minimal for
8 E [8. +oo[ and we have
Proposition 2.2.6 Suppose 8 is such that, for some a solving (2.1.4), there is a j
with aj > 0 and ej =I- 8. Then Me consists of the single element
240 XlV. Dual Form of Bundle Methods
- The situation described by this last result cannot hold when e = 0: all the active weights
are zero, then. Indeed consider the function v of Proposition 2.2.4; Theorem 1.4.2.1 tells us
that Mo = [jL, +oo[ , where jL = - D+ v(O) is the common limit for e -I- 0 of all numbers
J-L E Me. In view of Proposition 2.2.5, jL is a finite number.
- On the other hand, for e > 0, it is usually the case - at least if the bundle is rich enough - that
(2.1.4) produces at least two active subgradients, with corresponding weights bracketing e;
when this holds, Proposition 2.2.6 applies.
- Yet, this last property need not hold: as an illustration in one dimension, take the bundle
with two elements
SI = I, S2 = 2; el = I, e2 = 0; e = I.
Direct calculations show that the set of multipliers J-L in Theorem 2.2.1 is [0, 1].
Thus, the graph of the multifunction e ~ Me may contain vertical intervals other than
{OJ x Mo. On the other hand, we mention that it cannot contain horizontal intervals other than
[e, +oo[x{O}:
PROOF (sketch). A value of e smaller than e has positive multipliers J-L; from Proposi-
tion 2.2.3, the corresponding e(a) is uniquely determined. Said in other terms, the dual
function J-L 1-+ e(J-L) of(2.2.4) is differentiable for J-L > 0; its conjugate is therefore strictly
convex on ]0, e[ (Theorem X.4.1.3); but this conjugate is justthe function e 1-+ 1/2118e 112 (The-
orem XII.5.I.I); with Proposition 2.2.4, this means that the derivative -J-L of e 1-+ 1/2118e 112
is strictly increasing. 0
In summary, the function e 1-+ 1/2 lise 112 behaves as illustrated in Fig. 2.2.1 (which as-
sumes mini ei = 0), and has the following properties.
E
E
Fig. 2.2.1. Typical graph of the squared norm ofthe direction
It is of interest to return to basics and remember why (2.1.4) is relevant for computing
the direction.
- The reason for introducing (2.1.4) was to copy the development of Chap. XIII, with
S(e) of (2. 1.1) replacing the would-be S = Sk = S(oo).
as
- The reason for Chap. XIII was to copy Chap. IX, with f (x) replacing f (x). The a
two approximating polyhedra S were essentially the same, as generated by calls to
(Ul); only their meaning was different: one was contained in ad(x), the other in
af(x).
- Finally, Chap. IX was motivated by an implementation of the steepest-descent al-
gorithm of Chap. VIII: the non-computable af(x) was approximated by S, directly
available.
Now recall §VIlLI. 1: what was needed there was a hyperplane separating {O} and
af(x) "best", i.e. minimizing f'(x, .) on some unit ball; we wanted
and the theory of §VILl.2 told us that, due to positive homogeneity, this minimization
problem was in some sense equivalent to
knowing that we limit ourselves to the case III· III = III . 111* = II . II·
242 XIV: Dual Form of Bundle Methods
Remark 2.3.1 We recall from §VIII.l.2 what the above "in some sense equivalent" means.
First of all, these problems are different indeed if the minimal value in (2.3.4) is non-
negative, i.e. if the optimal sin (2.3.3) is zero. Then 0 E S(e), (2.1.4) produces no useful
direction.
Second, the equivalence in question neglects the normalization. In (2.3.4), the right-hand
side "1" of the constraint should be understood as any K > O. The solution-set stays collinear
with itself when K varies; remember Proposition VIII. 1. l.S.
Passing between the two problems involves the inverse multifunctions
and
(Rn , I I . 111*) 3 s 1---+ Argmax(s, d} c (Rn , I I . liD.
Idl:::; I
This double mapping simplifies when 111·111 is a Euclidean norm, say Illdlf = 1/2 (Qd, d). Then
s
the (unique) solutions J of (2.3.3) and of (2.3.4) are linked by J = _Q-I if we forgetthe s
normalization. In our present particular case of I I . I I = I I . I I * = II . II, this reduces to (2.1.5).
We leave it as an exercise to copy §2.2 with an arbitrary norm. o
The following relation comes directly from the equivalence between (2.3.3) and
(2.3.4).
PROOF. This comes directly from (VIII. 1.2.5), or can easily be checked via the mini-
malityconditions. By definition, any s E S(e) can be writtens = s{a) with a feasible
in (2.1.4)(ii), (iii); then (2.2.3) gives
Thus, the optimal value in the direction-finding problem (2.1.4) readily gives
an (ooder-)estimate of the corresponding e-directional derivative. In the ideal case
CJef(x) = S(e), we would have fi(x, -s) = -lIsIl2.
Proposition 2.1.1 brings directly the following inequality: with the notation
(2.1.5),
for all z E IRn. This is useful for the stopping criterion: when II S II is small, e can be
decreased, unless it is small as well, in which case f can be considered as satisfactorily
minimized. All this was seen already in Chap. XIII.
2 Computing the Direction 243
Remark 2.3.3 A relation with Remark VIII. 1.3 .7 can be established. Combining the bundle
elements, we obtain for any a E ill (not necessarily optimal) and Z E ]Rn
where we have set 8(a) := IIs(a)lI. If e(a) ~ e, then 8(a) ~ - IIsll by definition of sand
we can actually write (2.3.5). In other words, IIsll gives the most accurate inequality (2.3.6)
obtainable from the bundle, when e(a) is restricted not to exceed e. 0
(2.3.7)
If, in addition, all the extreme points of8f(x) appear among the l elements of
the bundle, then equality holds in (2.3.7).
PROOF. If ej = 0, Sj E 8f(x), so
Remark 2.3.5 In the general case, there is no guarantee whether -lIsll 2 -1Lf: is an over- or
under-estimate of f'(x, d): ifno subgradient at x is active in the direction-finding problem,
then (2.3.7) need not hold.
The assumptions (i), (ii) in Proposition 2.3.4 mean that of (x) n S(O) is nonempty and
contains an element which is active for s. The additional assumption means that of (x) c S (0);
244 XIv. Dual Form of Bundle Methods
and this implies equality of the two sets: indeed af(x) :::> S(O) because, as implied by
Proposition XI.4.2.2, a subgradient s at some Y with a positive weight e(x, y, s) cannot be
a subgradient at x. We emphasize: it is really S(O) which is involved in the above relations,
not S(e); for example, the property S(e) n af(x) =1= 0 does not suffice for equality in (2.3.7).
The following counter-example is instructive.
TakelR2 3 (~, 7/) ~ f(~, 7/) = ~2+7/2 and x = (1,0). Suppose that the bundle contains
justtwoelements,computedatYI = (2, -1)andY2 = (-1,2). Inotherwords,sl = (4, -2),
S2 = (-2,4). The reader can check that el = 2, e2 = 8. Finally take e = 4; then it can be
seen that
v f(x) = (2,0) = l(2s1 + S2) E S(e)
(but Vf(x) ¢ S(O) = 0!), and Proposition 2.3.4 does not apply. The idea of the example
is illustrated by Fig. 2.3.1: s is precisely Vf(x), so f'(x, s) = -lIsIl2; yet, p, =1= 0 (actually
p, = 2). Our estimate is therefore certainly not exact (indeed f'(x, s) = 4 < 12 = IIsIl 2+p,e).
o
----lIoo~......,,.------ ~
$1 (e1 =2)
Now we address the following question: is there a convex function having S(e) as ap-
proximate subdifferential? By construction, d = -is would be an e-descent direction
for it: we would be very happy if this function were close to f.
The answer lies in the following concept:
Definition 2.4.1 The cutting-plane function associated with the actual bundle is
In the second expression above, Yi is the point at which (U 1) has computed f (Yi )
and Si = S(Yi) E af(Yi); the first expression singles out the current iterate x = Xk.
The equivalence between these two fonns can readily be checked from the definition
(1.2.8) of the weights ei; see again Example VI.3.4. 0
2 Computing the Direction 245
the linearization of fat Yj with slope Sj E af(Yj); the subgradient inequality means
h ~ f for each i, so the cutting-plane function
i = max {h : i = I •...• l}
is a piecewise affine minorization of f (the terminology "supporting plane" would be
more appropriate than cutting plane: (2.4.2) defines a hyperplane supporting gr f).
Proposition 2.4.2 With i defined by (2.4.1) and S(e) by (2.1.1), there holds for all
e~O
aei(x) = S(e).
(2.4.3)
The above number is actually h(x - I/P,se),for each i such that equality holds in
(2.2.3).
PROOF. For 0 < IL E Me, divide (2.2.3) by IL and change signs to obtain for i =
1•...• l (remember that S (a), S and se denote the same thing):
-e - *"sI ~
2 - ej + (Sj. -*s) = h(x - *s) - f(x);
and equality in (2.2.3) results in an equality above. On the other hand, there is certainly
at least one i E {I •...• l} such that equality holds in (2.2.3). Then the result follows
from the definition of j. 0
246 XlV. Dual Form of Bundle Methods
where the last inequality comes directly by convex combination in (2.4.2) or (2.4.1). What
(2.4.3) says is that, if a solves (2.1.4), then fa (x -I/p., ss) = i(x -1/p.,Ss) for any multiplier
J-t > 0 (if there is one; incidentally, e(a) is then e). In other words, this "optimal" fa gives
one more support of epi f: it even gives a support of epi i at (x - I/p., s, i(x - 1/p.,S»).
- This was a primal interpretation; in the dual space, combine Propositions 2.3.2 and 2.4.2:
i;(x, -ss) = -liss 11 2 , and the value (2.4.3)
fY( X - /iSs
I A
= -e + Js;1 (x, -/iSs
) I A )
f(x+tcI)
f(x) (=0) k:--------+-----------
-e3
-£
-e2
-£ _ 11811 2/11 I------lo.-=
Such a hyperplane supports epi i if and only if s E 8s i(x) = S(e). When a point in this
hyperplane is moved so that its abscissa goes from x to x+h, its altitude varies by (s, h). Fora
normalized h, this variation is minimal (i.e. most negative) and equals -lis II if h = -s / lis II. In
words, -s is the acceleration vector of a drop ofwater rolling on this hyperplane, and subject to
gravity only. Now, the direction -s
of Fig. 2.4.1 defines, among all the hyperplanes supporting
2 Computing the Direction 247
gr j, the one with the smallest possible acceleration. This is the geometric counterpart of the
analytic interpretation of Remark 2.3.3.
The thick line in Fig. 2.4.1 is the intersection of the vertical hyperplane containing the
picture with the "optimal" hyperplane thus obtained, whose equation is
It represents an "aggregate" function ja, whose graph also supports gr f (and gr and h,
which summarizes, in a way conditioned bye, the linearizations h of (2.4.2). Using (2.4.3),
we see that
~ ~
ss
p(x - .is) = f(x) - e - .ill 112 = j(x - .is).
~
Thus, gr P touches epij at the point (x - !/~s, j(x - !/~s)); hence s E oj(x - !/~s).
Note in passing that the step size t = 1/JL thus plays a special role along the direction -s.
It will even become crucial in the next chapter, where these relations will be seen more
thoroughly.
Remark 2.4.4 Suppose JL > O. Each linearization h in the bundle coincides with f and
with j at the corresponding Yi. The aggregate linearization P coincides with j at x - ! / ~ s;
could ja coincide with f itself at some y?Note in Fig. 2.4.1 thaty = x -!/~s would be such
a coincidence point: swould be in of (x - !/~ s). In addition, for each i such that aj > 0 at
some solution of(2.1.4), Sj would also be in of (x -!/~s): this comes from the transportation
formula used in
In Fig. 2.4.1, lift the graph of P as high as possible subject to supporting epi f. Ana-
e,
lytically, this amounts to decreasing e in (2.4.4) to the minimal value, say preserving the
supporting property:
e= inf (e : f(x) - e + (s, z - x) ~ f(z) for all z E lin} = f(x) + f*(s) - (s, x).
I- l-
e = Lajei = f(x) + La;j*(si) - (s, x).
j=! i=!
l-
e- e= La;j*(sj) - f*(s) ~ O. (2.4.5)
j=!
Another definition ofe is that it is the smallest e such that s E oef(x) (transportation formula
again). The aggregate bundle element (s, e) thus appears as somehow corrupted, as compared
to the original elements (Sj, ej), which are "sharp" because Sj ¢ oef(x) if e < ej. Finally, the
gap (2.4.5) also explains our comment at the end of Remark 1.2.9: the use of f* is clumsy
for the aggregate element. 0
248 XIv. Dual Form of Bundle Methods
i
To conclude, remember Example X.3.4.2: is the smallest convex function com-
patible with the information contained in the bundle; ifg is a convex function satisfying
The previous sections have laid down the basic ideas for constructing the algorithms
we have in mind. To complete our description, there remains to specify the line-search,
and to examine some implementation details.
In this section, we are given the current iterate x = Xko and the current direction
d = dk = -se obtained from (2.1.4). We are also given the current 8 > 0 and the
current multiplier f.L ~ 0 of the extra constraint (2.1.4)(iii). Our aim is to describe the
line-search along d, which will produce a suitable Y+ = x + td and its corresponding
s+ E a!(y+) (note: we will often drop the subscript k and use "+" for the subscript
k+ 1).
The algorithms developed in the previous chapters have clearly revealed the double
role of the line-search, which must
(a) either find a descent-step, yielding a Y+ better than the current x, so that the next
iterate x+ can be set to this Y+,
(b) or find a null-step, with no better Y+, but with an s+ providing a useful enrichment
of the approximate subdifferential of ! at x.
On the other hand, we announced in § 1.2(b) that we want the line-search to follow
as much as possible the principles exposed in §II.3. Accordingly, we must test each
trial stepsize t, with three possible cases:
(0) this t is suitable and the line-search can be stopped;
(R) this t is not suitable and no suitable t should be searched on its right;
(L) this t is not suitable and no suitable t should be searched on its left.
Naturally, the (O)-clause is itself double, reflecting the alternatives (a) - (b): we
must define what suitable descent- and null-steps are, respectively. The knowledge of
3 The Implementable Algorithm 249
I' (x, d), and possibly of I' (y+, d), was required for this test in §1I.3 -where they were
called q' (0) and q' (t). Here, both numbers are unknown, but (s+, d) is a reasonable
estimate of the latter; and what we need is a reasonable estimate of the former. Thus,
we suppose that a negative number is on hand, playing the role of I' (x, d). We will
see later how it can be computed, based on §2.3. For the moment, it suffices to call it
v
formally < o.
A very first requirement for a descent-step is copied from (lI.3.2.1): a coefficient
m E ]0, 1[ is chosen and, if t > 0 does not satisfy the descent test
it is declared "too large"; then case (R) occurs. On the other hand, a "too small"
step size will be more conveniently defined after we see what a null-step is.
According to the general principles of Chap. XIII, a null-step is useful when I
can hardly be decreased along d because S (e) of (2.1.1) approximates at: I (x) poorly.
The role of the line-search is then to improve this approximation with a new element
(s+, e+) enriching the bundle (1.2.9). A new projection s+ will be computed, a new
line-search will be performed from the same x, and here comes a crucial point: we
must absolutely have
(3.1.2)
Indeed, if (3.1.2) does not hold, just take some a E ..1l solving (2.1.4), append
the (l + 1)S! multiplier a+ := 0 to it and realize that the old solution (a, f.1.) is again
an optimal primal-dual pair: the minimization algorithm enters a disastrous loop. In a
way, (3.1.2) plays the role of(XIII.2.1.1): it guarantees that the new element (s+, e+)
will really define an S+(e) richer than See); see also Remark IX.2.1.3.
- To keep the same spirit as in Chap. XIII, we ensure (3.1.2) by forcing first e+ to be
v,
small. In addition to another parameter B > 0 is therefore passed to the line-search
which, if no descent is obtained, must at least produce an s+ E as I (x) to enrich the
bundle. Admitting that y+ = x + td ands+ E a/(y+), this amounts to requiring
(3.1.3)
- For (3.1.2) to hold, not only e+ but also (s+, s) = -(s+, d) must be small. This
v
term represents a directional derivative, comparable to of(3.1.1); so we require
(3.1.4)
for some coefficient m'; for convergence reasons, m' is taken in ]m, 1[.
In summary, a null-iteration will be declared when (3.1.3) and (3.1.4) hold simul-
taneously: this is the second part of the (O)-clause, in which case t is accepted as a
"suitable null-step".
Remark 3.1.1 Thus, we require the null-step to satisfy two inequalities, although the single
(3.1.2) would suffice. In a way, this complicates the algorithm but our motivation is to keep
closer to the preceding chapters: (3.1.3) is fully in the line of Chap. XIII, while (3.1.4) connotes
Wolfe's criterion of (II.3.2.4).
250 XlV. Dual Form of Bundle Methods
Note that this strategy implies a careful choice of ii and e: we have to make sure that
[(3.1.4), (3.1.3)] does imply (3.1.2), i.e.
(3.1.5)
an inequality which we will have to keep in mind when choosing the tolerances. o
Finally, we come to the (L )-clause. From their very motivation, the concepts of
null-step and of descent-step are mutually exclusive; and it is normal to reflect this
exclusion in the criteria defining them. In view ofthis, we find it convenient to declare
t as "not too small" when (3.1.3) does not hold: then t is accepted as a descent-step
(if (3.1.1) holds as well), otherwise case (L) occurs. Naturally, observe that (3.1.3)
holds for any t close enough to O.
To sum up, the stopping criterion for the line-search will be as follows. The
v e
tolerances < 0, > 0 are given (satisfying (3.1.5), but this is unimportant for the
moment), as well as the coefficients 0 < m < m' < 1. Accept the stepsize t > 0 with
y+ = x + td, S+ E af(y+) and e+ = e(x, y+, s+) when:
[descent-step1 (3.1.6)
or
[null-step1 (3.1.7)
We are now in a position to design the line-search algorithm. It is based on the strategy
of §IL3 (and more particularly Fig. IL3.3.l), suitably adapted to take §3.l into account;
its general organization is:
(0) t is convenient when it satisfies (3.1.6) or (3.1.7), with the appropriate exit-case;
(R) t is called t R when (3.1.1) does not hold; subsequent trials will be smaller than
tR;
(L) t is called tL in all other cases; subsequent trials will be larger than tL.
Corresponding to these rules, the truth-value Table 3.2.1 specifies the decision
made in each possible combination (T for true, F for false; a star* means an impossible
case, ruled out by convexity). The line-search itself can be realized by the following
algorithm.
Algorithm 3.2.1 (Line-Search, Nonsmooth Case) The initial point x E ]Rn and
v e
step size t > 0, the direction d E ]Rn, the tolerances < 0, > 0, and coefficients
m E ]0,1[, m' E ]m, 1[ are given. SettL = 0, tR = O.
3 The Implementable Algorithm 251
STEP 1. Compute f(x + td) and s = sex + td); compute e = e(x, x + td, s).
STEP 2 (test for a null-step). If (3.1.3) and (3.1.4) hold, stop the line-search with a
null-step. Otherwise proceed to Step 3.
STEP 3 (test for large t). If (3.1.1) does not hold, set tR = t and go to Step 6. Other-
wise proceed to Step 4.
STEP 4 (test for a descent-step; t is not too large). If (3.1.3) does not hold, stop the
line-search with a descent-step. Otherwise set tL = t and proceed to Step 5.
STEP 5 (extrapolation). If tR = 0, find a new t by extrapolation beyond tL and loop
to Step 1. Otherwise proceed to Step 6.
STEP 6 (interpolation). Find a new t by interpolation in ]tL, tR[ and loop to Step 1.
D
Theorem 3.2.2 Let f : IRn -+ IR be a convex jUnction, and assume that the
safeguard-reduction Property 11.3 .1.3 holds. Then Algorithm 3.2.1 either generates a
sequence of stepsizes t -+ +00 with f(x + td) -+ -00, or terminates after afinite
number of cycles, with a null- or a descent-step.
PROOF. In what follows, we will use the notationsL :=s(x+tLd), SR := sex +tRd).
Arguing by contradiction, we suppose that the stop never occurs, neither in Step 2,
nor in Step 4. We observe that, at each cycle,
(3.2.1)
252 XIv. Dual Form of Bundle Methods
I
compute q(t), s(t), e(t)
~. ·8
no
[Extrapolations] Suppose first that Algorithm 3.2.1 loops indefinitely between Step 5
and Step 1: no interpolation is ever made. Then, by construction, every generated t is
v
a tL and tends to +00 by virtue of the safeguard-reduction property. Because < 0,
(3.2.1) shows that f(x + tLd) ~ -00.
[Interpolations] Thus, if f(x + td) is bounded from below, tR becomes positive
at some cycle. From then on, the algorithm loops between Step 6 and Step 1. By
construction, at each subsequent cycle,
(3.2.2)
the sequence {td is increasing, the sequence {tR} is decreasing, every tL is smaller
than every t R, and the safeguard-reduction property implies that these two sequences
have a common limit, say t ~ o.
By continuity, (3.2.1) and (3.2.2) imply
[Case t = 0] Compare (3.2.2) and (3.2.3) to see that tR > t; then pass to the limit in
i.e. (3.1.3) and (3.1.4) are satisfied by t R small enough; the stopping criterion of Step 2
must eventually be satisfied.
[Case t > 0] Then tL becomes positive after some cycle; write the subgradient
inequality
as
(SL, d) ~ f(x + tLd) - f(x)
tL
and pass to the limit: when the number of cycles goes to infinity, use (3.2.3) to see
that the right-hand side tends to mv m'v.
>
Thus, after some cycle, (3.1.4) is satisfied forever by S+ = SL. On the other hand,
each t L satisfies (3.1.3) (otherwise, a descent-step is found); we therefore obtain again
a contradiction: tL should be accepted as a null-step. 0
Remark 3.2.3 The present line-search differs from a "smooth" one by the introduction of
the null-criterion, of course. With respect to Wolfe's line-search of §1I.3.3, a second difference
is the (L)-c1ause, which used to be "not (3.1.4)", and is replaced here by (3.1.3).
However, a slight modification in Algorithm 3.2.1 cancels this last difference. Keep the
same definition for a null-step, but take Wolfe's definition of a descent-step; the test (0), (R),
(L) becomes
~ f(x) + mtv and (s+. d) ~ m'v,
(0) I e+f(y+)
~ e and (s+, d) ~ m'v;
[descent-step]
[null-step]
(R) f(y+) > f(x) + mtv;
(L) all other cases.
Furthermore give priority to the null-criterion; in other words, in the ambiguous case when
make a null-step.
254 XIV, Dual Form of Bundle Methods
The effect of this variant on the flow-chart 3.2.1 is to replace the box "e(t) > 8 :::}
descent" by "q'(t) > m'v :::} descent"; and the comparison with Fig. 11.3.3.1 becomes quite
eloquent. The only difference with (11.3.2.5) is now the insertion of the null-criterion in (0).
The proof of Theorem 3.2.2 can be reproduced (easy exercise). Besides, a close look at the
logic reveals that (3.1.4) still holds in case of a descent-step. In other words, the present
variant is formally identical to its original: it still produces descent-steps satisfying (3.1.6),
and null-steps satisfying (3.1.7).
Although not earth-shaking, this observation illustrates an important point, already al-
luded to in §VIII.3.3: when designing algorithms for nonsmooth minimization, one should
keep as close as possible to the "smooth-philosophy" and extract its good features. In this
sense, the algorithms of §XIIA are somewhat suspect, as they depart too much from the
smooth case. 0
Based on the previous developments, the general organization of the overall mini-
mization algorithm becomes clear: at each iteration, having chosen e, we solve (2.1.4)
to obtain a primal-dual solution (a, JL), ands from (2.1.5). If lis II is small, (2.3.6) pro-
vides a minimality estimate. Otherwise the line-search 3.2.1 is done along d = -s,
to obtain either a descent-step as in §II.3, or a null-step obviating small moves. For
a complete description of the algorithm, it remains to specify a few items: the line-
v s,
search parameters and the choice of e (which plays such a central role), and the
management of the bundle.
(a) Line-Search Parameters. The line-search needs two parameters and v for s,
s
(3.1.6), (3.1.7). The status of is rather clear: first, it is measured in the same units
as e, i.e. as the objective function. Second, it should not be larger than e: the purpose
a
of a null-step is to find S+ E e I (Xk). We therefore set
s=me, (3.3.1)
m
for some fixed E ]0, 1]. This coefficient allows some flexibility; there is no a priori
reason to choose it very small nor close to 1.
On the other hand, we have two possible strategies concerning the v-question,
respectively based on Chaps. II and IX.
- If we follow the ideas of §II.3.2, our aim is to decrease I by a fraction of its initial
slope I'(x, d) - see (3.1.1). This I'(x, d) is unknown (and, incidentally, it may be
nonnegative) but, from Proposition 2.3.4, it can conveniently be replaced by
ij = -lIsll 2 - JLe. (3.3.2)
If this expression overestimates the real I' (x, d), a descent-step will be easily
found; otherwise the null-mechanism will play its role.
- If we decide to follow the strategy of a separation mechanism (§ XIII.2.2 for exam-
v
ple), the role of is different: it is aimed at forcing s+ away from S(e) and should
rather be as(e) (d), given by Proposition 2.3.2 - see (3.1.4):
v = -lIsll 2 • (3.3.3)
3 The Implementable Algorithm 255
These choices are equally logical, we will keep both possibilities open (actually
they make little difference, from theoretical as well as practical points of view).
(b) Stopping Criterion. The rationale for the stopping criterion is of course (2.3.5):
s
when the projection is close to 0, the current x minimizes f within the current 8, at
least approximately. However, some complexity appears, due to the double role played
by 8: it is not only useful to stop the algorithm, but it is also essential to compute the
direction, via the constraint (2.1.4)(iii).
From this last point of view, 8 must be reasonably large, subject to S(8) being a
reasonable approximation of the 8-subdifferential of f at the current x. As a result, the
s
minimization algorithm cannot be stopped directly when ~ o. One must still check
that the (approximate) minimality condition is tight enough; if not, 8 must be reduced
and the projection recomputed. On the other hand, the partial rninimality condition
thus obtained is useful to safeguard from above the subsequent values of 8.
Accordingly, the general strategy will be as follows. First, a tolerance 8 > 0 is
chosen, which plays the same role as in Algorithms 11.1.3.1 and XIII.l.3 .1. It has the
same units as a subgradient-norm and the event "s
~ 0" is quantified by "lIsli ~ 8".
As for 8, three values are maintained throughout the algorithm:
(i) The current value, denoted by 8 = 8k, is used in (2.1.4)(iii) to solve the direction-
finding problem.
(ii) The value f > 0 is the final tolerance, or the lowest useful value for 8k: the user
wishes to obtain a final iterate x satisfying
obtained so far. It is decreased to the current 8 each time (2.1.4) produces with s
IIsll ~ 8.
In practice, having the current B ~ f, the direction-finding problem is solved at
each iteration with an 8 chosen in (f, B]. The successive values B form a decreasing
sequence {Bd, rninorized by f, and Bk+l is normally equal to Bk. In fact, an iteration
k + 1 with Bk+l < Bk is exceptional; during such an iteration, the direction-finding
problem (2.1.4) has been solved several times, until a true direction -s,
definitely
nonzero, has been produced. The final tolerance f is fixed for the entire algorithm,
which is stopped for good as soon as (2.1.4) is used with 8 = f and produces IIsll ~ 8.
Knowing that f and B are managed in an automatic way, there remains to fix the
current value 8 of (i); we distinguish three cases.
- At the first iteration, the question is of minor importance in the sense that the first
direction does not depend on 8 (the bundle contains only one element, with el =
0);
it does play some role, however, via B of (3.3.1), in case the first direction -Sl is not
downhill. Choosing this initial 8 is rather similar to choosing an initial stepsize.
256 XlV. Dual.Form of Bundle Methods
(c) Management ofthe Bundle. At every "normal" iteration, a new element (s+, e+)
is appended to the bundle. With y+ = Xk + tkdk and s+ E a!(y+) found by the kth
line-search, there are two cases for e+.
- If the line-search has produced a null-step, then xk+1 = Xk; hence
- In the case ofadescent-step, xk+1 = y+ ande+ = e(xk+b y+, s+) = o. Because the
current iterate is moved, we must also update the old linearization errors according
to (1.2.10): for each index i in the old bundle, ei is changed to
e
and this is just what is needed for (1.2.9) to hold. Incidentally, = e if the multiplier
e
fJ. is nonzero; otherwise we cannot even guarantee that is well-defined, since the
optimal ct may not be unique.
3 The Implementable Algorithm 257
Thus, when necessary, at least two elements are destroyed from the current bundle,
e
whose size is therefore decreased to a value e' :::; e-
2. This makes room to append
(3.3.5)
=
Remark 3.3.1 If all the elements destroyed have ex 0 from the quadratic problem
(2.1.4), (s, e) brings nothing to the definition of S(s). In that case, no aggregation is
necessary, only one element has to be destroyed. 0
Now a detailed description of our algorithm can be given. We start with a remark
concerning the line-search parameters.
Remark 3.4.1 As already mentioned in Remark 3.1.1, f: (i.e. m) and vare not totally
independent of each other: they should satisfY (3.1.5); otherwise (3.1.2), which is
essential in the case of a null-step, may not hold.
v
If is given by (3.3.3), we have
to see that the case m' + iii > 1 is dangerous: the new subgradient may belong to the
old S(e); in the case ofanull-step, sand JL will stay the same-a disaster. Conclusion:
if the value (3.3.2) is used for ii, it is just safe to take
Notes 3.4.3
(I) We use this name because the algorithm has been conceived via a development in the
dual space, to construct approximate subdifferentials. The next chapter will develop
similar algorithms, based on primal arguments only.
e) This is the same problem as the very first initialization ofthe stepsize (Remark II.3.4.2):
use for s an estimate of f(xI) - j, where I is the infimal value of f.
3 The Implementable Algorithm 259
(6) We want to detect e-optimality of the current iterate x. For this, it is a good idea to let 0
vary with e. In fact, we have
c) In case J1, = 0, the value e is ambiguous. In view ofthe gap mentioned in Remark 2.4.4, it
is useful to preserve as much accuracy as possible; the aggregate weight could therefore
be taken as small as possible, and this means the optimal value in
f(Xk + td) :::;; f(x) + mtii and e(xk, Xk + td, s+) > me
(meaning that t is neither too small nor too large) while, in the case of a null-step, t is
small enough and S+ is useful, i.e.:
(II) If a deletion is made after a null-step, if the gradient at the current x (i.e. the element
with e = 0) is deleted, and if E is subsequently decreased at Step 2, then (2.1.4) may
become inconsistent; see (4) above. To be on the safe side, this element should not be
deleted, which means that the maximal bundle size i should actually be at least 3.
e 2 ) Here lies a key of the algorithm: between f and e,a possibly wide range of values
are available for E. Each of them will give a different direction, thus conditioning the
efficiency of the algorithm (remember from Chap. II that a good direction is a key to
an efficient algorithm). A rough idea of sensible values is known, since they have the
same units as the objective f. This, however, is not enough: full efficiency wants more
accurate values. In a way, the requirement (i) in the conclusion of § 1.1 is not totally
fulfilled More will be said on this point in §4. 0
We now turn to convergence questions. Naturally, the relevant tools are those of
Chaps. IX or XIII: the issue will be to show that, although the e's and the e's are
moving, the situation eventually stabilizes and becomes essentially as in §IX.2.1.
In what follows, we call "iteration" a loop from Step 5 to Step 1. During one such
iteration, the direction-finding problem may be solved several times, for decreasing
values of e. We will denote by ek> Sk, ILk the values corresponding to the last such
resolution, which are those actually used by the line-search. We will also assume that
v is given either by (3.3.2) or by (3.3.3). Our aim is to show that there are only finitely
many iterations, i.e. the algorithm does stop in Step 2.
Lemma 3.4.4 Let f : lRn -+ lR be convex and assume that the line-search terminates
at each iteration. Then, either ! (Xk) -+ -00, or the number ofdescent-steps is finite.
PROOF. The descent test (3.1.1) and the definition (3.3.2) or (3.3.3) ofv imply
!(Xk) - !(Xk+I) ~ mtkIISk 112 = mllskll II Xk+ I - xkll ~ m8l1xk+1 - xkll (3.4.4)
at each descent-iteration, i.e. each time Xk is changed (here 8 > 0 is either fixed, or
bounded away from 0 in case the refinement (3.4.3) is used). Suppose that {f(Xk)} is
bounded from below.
We deduce first that {Xk} is bounded. Then the set Uk8e !(Xk) is bounded (Propo-
sition XI.4.1.2), so {Sk} is also bounded, say by L.
Now, we have from the second half "not (3.1.3)" of the descent-test (3.1.6):
'ik = mek < !(Xk) - !(Xk+I) + (S(Xk+I), Xk+1 - Xk) ~ 2Lllxk+1 - xkll·
Combining with (3.4.4), we see that ! decreases at least by the fixed quantity
m m8 f! (2L) each time it is changed: this process must be finite. 0
Lemma 3.4.5 The assumptions are those o/Lemma 3.4.4; assume also m/ + iii :( 1.
Then, between two descent-steps, only finitely many consecutive null-steps can be
made without reducing B in Step 2.
PROOF. Suppose that, starting from some iteration ko, only null-steps are made.
From then on, Xk and ek remain fixed at some x and e; because ek:( e, we have
Sk E SHI(e) c ad(x). The sequence {IISkIl2} is therefore decreasing; with
Sk+1 = Proj OISHI (e), we can write
where we have used successively: the characterization of the projection Sk+b the
Cauchy-Schwarz inequality, and monotonicity of {II Sk 112}. This implies the following
convergence properties: when k --+ +00,
Because of the stopping criterion, 0/ ? 02 > 0; and this holds even if the refinement
(3.4.3) is used.
Now, we obtain from the minimality condition (3.4.2) at the (k + i)st iteration:
s
(Sk' sHI) :( - m/vk:( m'(lI kIl 2 + lLke),
the second inequality coming from the definition (3.3.2) or (3.3.3) of v = Vk. So,
combining with (3.4.6):
where
Ok := m'IISk1l2 + (SHI - Sb sHI) - IISHI1I2.
In view of (3.4.5) and remembering that {sHI} c ariief(x) is bounded,
Thus, when k --+ +00, 0 cannot be a cluster point of {Ok}: at each k large enough,
the relation
shows that ILk diminishes at least by a fixed quantity. Since ILk ? 0, this cannot go on
forever. 0
262 XIV. Dual Form of Bundle Methods
With these two results, convergence of the overall algorithm is easy to establish;
but some care must be exercised in the control of the upper bound £: as mentioned
in 3.4.3(8), Step 2 should be organized in such a way that infinitely many decreases
would result in an £ tending to O. This explains the extra assumption made in the
following result.
Theorem 3.4.6 Let f : lRn -+- lR be convex. Assume for example that £ is divided
v
by 10 at each loop from Step 2 to Step 1 ofAlgorithm 3.4.2; if is given by (3.3.2),
assume also m' + iii ::;; 1. Then:
- either f (Xk) -+- -00 for k -+- +00,
- or the line-search detects at some iteration k that f is unbounded from below,
- or the stop occurs in Step 2 for some finite k, at which there holds
PROOF. Details are left to the reader. By virtue of Lemma 3.4.4, the sequence {xtJ
stops. Then, either Lemma IX.2.1.4 or Lemma 3.4.5 ensures that the event "IISk II ::;; 8"
occurs in Step 2 as many times as necessary to reduce £ to its minimal value f. 0
LEkllskll = +00
k=1
ensures a sufficient decrease in f at each descent-iteration, which still allows the last
argument in the proof of Lemma 3.4.4.
(iv) Finally observe that Lemma 3.4.4 uses very little ofthe definition of S. As far as descent-
steps are concerned, other directions may also work well, provided that (3.1.6) holds
(naturally, ii must then be given some suitable value, interpreted as a convergence pa-
rameter to be driven to 0). It is only when null-steps come into play that the bundling
mechanism, i.e. (2.1.4), becomes important. Indeed, observe the similarity between the
proofs of Lemma 3.4.4 and of Theorem 11.3.3.6.
4 Numerical Illustrations 263
4 Numerical Illustrations
This section illustrates the numerical behaviour of the dual bundle Algorithm 3.4.2.
In particular, we compare to each other various choices of the parameters appearing
in the definition of the algorithm. Unless otherwise specified, the experiments below
are generally conducted with the following values of the parameters:
and a computer working with 6 digit-accuracy; as for the stopping criterion, it uses the
refinement (3.4.3) and generally §. = 10-3 (1 + Ijl), i.e. 3 digit-accuracy is required
(j being the minimal value of f).
First, we specify how 8 can typically be chosen, in Step 5 of the algorithm; since it
will enter into play at the next iteration, we call8k+l or 8+ this value (as before, the
index k is dropped whenever possible). The following result is useful for this matter.
Proposition 4.1.1 Let the current line-search, made from x along d = -s, produce
y+ = x + td, s+ E IJf(y+), and e+ = e(x, y+, s+) = f(x) - f(y+) + t{s+, d}.
Then there holds
2 , 8 - e+ 8 - Ll
-lIsl! ::;; + - - = --,
A
PROOF. The first inequality results from Propositions 2.1.1 and 2.3.2. The second was
given in Remark XI.2.3.3 (see Fig.4.l.1 if necessary): in fact,
where the last set denotes the subdifferential ofthe convex function 81-+ - f~(x, d)
at8 = e+. 0
See Fig. 4.1.2 for illustration; observe from the definitions that as(e) is piecewise
affine, and remember from Example XI.2.1.3 that the actual fi behaves typically like
SI/2 for s .J, O. This result can be exploited when a descent-step is made from Xk = x
to Xk+1 = y+. For example, the average
I(
ak := 2 -lIskll
A 2
+ Sk -tk .1k)
is a possible estimate of the true fik (x, d); and then, we can for example use the
following simple rules:
.... ..
••••• ~ ••••••• ' crS(E)
.........
Fig. 4.1.2. Approximations of the approximate derivative
where x E !RIO, m = 5, and each Aj is positive definite. The minimal value is known
to be O. The top-part of Fig.4.1.3 gives the decrease of f (in logarithmic scale) as a
function of the number of calls to the black box (VI); the stopping criterion§. = 10-4
was obtained after 143 such calculations, making up a total of 33 descent- and 26
null-iterations. Observe the evolution of f: it suggests long levels during which the
s-subdifferential is constructed, and which are separated by abrupt decreases, when
it is suitably identified.
The lower part of Fig. 4.1.3 represents the corresponding evolution of IIsk 112.
Needless to say, the abrupt increases happen mainly when the partial stopping criterion
e
of Step 2 is obtained and is decreased. Indeed, Strategy 4.1.2 is of little influence
in this example, and most of the iterations have s = e. Table 4.1.1 displays the
information concerning this partial stopping criterion, obtained 7 times during the
run (the 7th time being the last).
4 Numerical Illustrations 265
-3
-5 L..-_ _ _. - -_ _ _- . -_ _ _ _ # (U1 )-calls
50 100
-4
The test-problem TR48 ofIX.2.2.6 further illustrates how the algorithm can behave. With
£. = 50 (remembering that the optimal value is -638565, this again means 4-digit accuracy),
£ is never decreased; the stopping criterion IIsI12 = 10- 13 is obtained after 86 descent- and
71 null-iterations, for a total of286 calls to (Ul). Figure 4.1.4 shows the evolution of !(Xk),
again as a function ofthe number of calls to (Ul); although slightly smoothed, the curve gives
a fair account of reality; ek follows the same pattern. The behaviour of IISkll2 is still as erratic
as in Fig. 4.1.3; same remark for the multiplier J-tk of the constraint (2.1.4)(iii).
-470000
-580000
-638524
1======1:;::0=0~!!:::::==2:;:0=0===:--# (U1 )-calls
To give an idea of how important the choice of ek can be, run the dual bundle Algo-
rithm 3.4.2 with the two examples above, using various strategies for ek. Table 4.2.1
records the number of iterations required to reach 3-digit accuracy. We recall from
Remark 11.3.2.1 that the most important column is the third.
The two strategies using I(Xk) - j are of course rarely possible, as they need a
knowledge of the minimal value. They are mentioned mainly to illustrate the range
for sensible values of e; and they suggest that these values should be rather optimistic.
From the motivation of the method itself (to decrease 1 by e at each iteration), one
could think that ek+ 1 = 0.1 [I (Xk) - j] should be more realistic; but this is empirically
contradicted.
The last strategy in Table 4.2.1 is based on a (too!) simple observation: if every-
thing were harmonious, we would have e :::::: 1 (x) - 1 (x+) :::::: -tii, a reasonable value
for e+. The strategy e+ = I(x) - I(x+) is its logical predecessor.
Remark 4.2.1 (e-Instability) Observe the relative inefficiency of these last two
strategies; in fact, they illustrate an interesting difficulty. Suppose that, for some
reason, e is unduly small at some iteration. The resulting direction is bad (remember
steepest descent!), so the algorithm makes little progress; as a result, e+ is going to
be even smaller, thus amplifying the bad choice of e. This is exactly what happens
4 Numerical Illustrations 267
with TR48: in the last two experiments in Table 4.2.1, 8k reaches its lower level f fairly
soon, and stays there forever.
The interesting point is that this difficulty is hard to avoid: to choose 8+, the only
available information is the observed behaviour of the algorithm during the previous
iteration(s). The problem is then to decide whether this behaviour was "normal" or
not; if not, the reason is still to be determined: was it because 8 was too large, or too
small? We claim that here lies a key of all kinds of bundle methods. 0
These two series of results do not allow clear conclusions about which implemen-
table strategies seem good. We therefore supplement the analysis with the test-problem
TSP442, given in IX.2.2. 7. The results are now those of Table 4.2.2; they indicate the im-
portance of 8 much more clearly. Observe, naturally, the high proportion of null-steps
when 8 is constant (i.e. too large); a contrario, the last two strategies have 8 too small
(cf. Remark 4.2.1), and have therefore a small proportion of null-steps. This table
assesses the standard strategy 4.1.2, which had a mediocre behaviour on MAXQUAD.
Another series of experiments test the initial value for 8, which has a direct
influence on the standard Strategy 4.1.2. The idea is to correct the value of 81, so as to
reduce hazard. The results are illustrated by Table 4.2.3, in which the initialization is:
82 = K[f(X2) - f(xI)] for K = 0.1, 1,5,10; in the last line, 84 = f(X4) - f(xI) -we
admit that 81,82 and 83 have little importance. From now on, we will call "standard"
the 8-Strategy 4.1.2 supplemented with this last initialization.
A ·star in the table means that the required 3-digit accuracy has not been reached:
because of discrepancies between! - and s-values (due to roundoff errors), the line-
search has failed before the stopping criterion could be met.
268 XlV. Dual Form of Bundle Methods
Finally, we recall from §VIII.3.3 that smooth but stiff problems also provide inter-
esting applications. Precisely, the line "nonsmooth" in Table VIII.3.3.2 was obtained
by the present algorithm, with the e-strategy -fkVk. When the standard strategy is
used, the performances are comparable, as recorded in Table 4.2.4 (to be readjust as
Table VIII.3.3.2; note that 7r = 0 corresponds to Table 4.2.3).
Among the possible E-strategies, a simple one is Ek == +00, i.e. very large. The reSUlting
direction is then (opposite to) the projection of the origin onto the convex hull of all the
subgradients in the bundle - insofar as 8 is also large. Naturally, 8 = mE is then a clumsy
e
value: unduly many null-steps will be generated. In this case, it is rather that must be chosen
via a suitable strategy.
We have applied this idea to the three test-problems ofTable 4.2.3 with e = m[! (Xk) - /]
(in view of Tables 4.2.1 and 4.2.2, this choice is a sensible one, if available). The results,
displayed in Table 4.3.1, show that the strategy is not too good. Observe for example the very
high proportion of null-steps with TSP442: they indicate that the directions are generally bad. We
e
also mention that, when is computed according to the standard strategy, the instability 4.2.1
appears, and the method fails to reach the required 3-digit accuracy (except for MAXQUAD, in
which the standard strategy itself generates large E 's).
These results show that large values for E are inconvenient. Despite the indications of
Tables 4.2.1 and 4.2.2, one should not be "overly optimistic" when choosing E: in fact, since
d is computed in view of S(E), this latter set should not be ''too much larger" than the actual
Be!(x). Figure 4.3.1 confirms this fact: it displays the evolution of !(Xk) for TR48 in the
experiment above, with convergence pushed to 4-digit accuracy. Vertical dashed lines indicate
iterations at which the partial stopping criterion is satisfied; then 8 is reduced and forces Ek
down to a smaller value; after a while, this value becomes again large in terms of !(Xk) - /,
the effect of 8 vanishes and the performance degrades again. We should mention that this
reduction mechanism is triggered by the IS-strategy of (3.4.3); without it, the convergence
would be much slower.
Remark 4.3.1 We see that choosing sensible values for E is a real headache:
- If Ek is too small, dk becomes close to the disastrous steepest-descent direction. In the worst
case, the algorithm has to stop because of roundoff errors, namely when: dk is not numer-
4 Numerical Illustrations 269
e
ically downhill, but nevertheless (3.1.3) cannot be fulfilled in practice, being negligible
compared to f -values. This is the cause of *stars in Tables 4.2.3 and 4.3.1.
- If Sk is too large, S(Sk) is irrelevant and dk is not good either.
- Of these two extremes, the second is probably the less dangerous, but Remark 4.2.1 tells
us that the first is precisely the harder to avoid. Furthermore, Fig.4.3.1 suggests that an
appropriate 8-strategy is then compulsory, although hard to adjust.
From our experience, we even add that, when the current S is judged inconvenient, no
available information can help to decide whether it is too small or too large. Furthermore, a
decision made at a given iteration does not yield immediate effects: it takes a few iterations
for the algorithm to recover from the old bad choice of s. All these reasons make it advisable
to bias the strategy (4.1.2 or any other) towards large s-values. 0
The interest ofthe present variant is mainly historical: it establishes a link with conjugate
gradients of §II.2A; and for this reason, it used to be called the conjugate subgradient method.
In fact, suppose that f is convex and quadratic, and that the line-searches are exact: at each
iteration, f(Xk + tdk) is minimized with respect to t, and no null-step is ever accepted. Then
the theory of §II.2.4 can easily be reproduced to realize that:
- the gradients are mutually orthogonal: (Sj, Sj) = 0 for all i =f:. j;
- the direction is actually the projection of the origin onto the affine hull of the gradients;
- this direction is just the same as in the conjugate-gradient method;
- these properties hold for any i ~ 2.
Remember in particular Remark II.2A.6. For this interpretation to hold, the assumptions
"f convex quadratic and line-searches exact" are essential. From the experience reported
above, we consider this interpretation as too thin to justify the present variant in a non-
quadratic context.
algorithm may become highly inefficient: the directions may become close to steepest
s
descent, and also may become small, making it hard to find nUll-steps when needed.
Furthermore, this phenomenon is hard to avoid, remember Remark 4.2.1.
In Table 4.2.3, for example, this is exactly what happens on the first line with TR48,
in which Bk is generally too small simply because it is initialized on a small value.
Actually, the phenomenon also occurs on each "bad" line, even when B could be
thought of as too large. Take for example the run TSP442 with K = 5: Fig. 4.4.1 shows,
in logarithmic scale and relative values, the simultaneous evolution of !(Xk) - ]
and Bk (remember from Tables 4.2.1 and 4.2.2 that, in a harmonious run, the two
curves should evolve together). It blatantly suffers the instability phenomenon of
Remark 4.2.1: in this run, BI is too large; at the beginning, inefficient null-steps are
taken; then 4.1.2 comes into play and B is reduced down to f; and this reduction goes
much faster than the concomitant reduction of !. Yet, the B-strategy systematically
takes Bk+1 in [0.9Bb 4Bk] - i.e. it is definitely biased towards large B-values.
(log)
.. ~
E-values
o ...
................
' ....
-2
f-values
# (U1)-calls
1000 2000
Fig.4.4.1. TSP442 with a large initial e
(b) The Tail ofthe Algorithm. Theorem 3.4.6 says that Bk ::;; s eventually stabilizes
at the threshold f. During this last phase, the algorithm becomes essentially that of
§XIII.3.1, realizing the separation process of §IX.3.3. Then, of course, ! decreases
little (but descent-steps do continue to appear, due to the value iii = 0.1) and the
relevant convergence is rather that of {II h 112} to 0.
For a more precise study of this last phase, we have used six more Tsp-problems of
the type IX.2.2.7, with respectively 14,29, 100, 120,614 and 1173 (dual) variables.
For each such problem, Table 4.4.1 singles out the iteration ko where B reaches its
limit f (which is here 10- 4 ]), and stays there till the final iteration kj, where the
convergence criterion IIsll2 ::;; 82 = 10- 7 is met. All these experiments have been
performed with I = 400, on a 15-digit computer; the strategy Bk = !(Xk) - ] has
been used, in order for ko to be well-defined. The last column in Table 4.4.1 gives the
number of active subgradients in the final direction-finding problem. It supposedly
gives a rough idea of the dimensionality of S{f), hence of the f-subdifferential at the
final iterate.
4 Numerical Illustrations 271
Remark 4.4.1 The speed of convergence is illustrated by Fig. 4.4.2, which shows the
evolution of lis 112 during the 1911 iterations forming the last phase in TSP442. Despite the
theoretical predictions of §IX.2.2(b), it does suggest that, in practice, {Sk} converges
to 0 at a rate definitely better than sublinear (we believe that the rate - linear - is
measured by the first part of the curve, the finitely many possible s (x) explaining the
steeper second part, starting roughly at iteration 1500). This observation supplements
those in §IX.2.2(c). 0
II 5112
(log)
-2
-4
-7 ~------------~------------~~__ iter.
300 1300 2300
Fig. 4.4.2. Practical convergence of the separation process; TSP442
To measure the influence of f on each of the two phases individually, Table 4.4.2
reports on the same experiments, made with TSP442, but for varying f. Full accuracy
was required, in the sense that 8 was set to 10- 27 (in view of roundoff errors, this
appeared to be a smallest allowable value, below which the method got lost in the
line-search or in the quadratic program). The column !J.! represents the final relative
error [f (x f) - ill i; observe that it is of order iiif·
The positive correlation between f and the number of active subgradients is easy
to understand: when a set decreases, its dimensionality can but decrease. We have no
explanation for the small value of kf - ko in the line "10- 2".
i 2 5 10 100
iter(#U1) 101(184) 117(194) 106(186) 137(203)
The same experiment, conducted with TSPIl73, is reported in Table 4.5.2. The re-
sults are paradoxical, to say the least; they suggest once more that theoretical predic-
tions in numerical analysis are suspect, as long as they are not checked experimentally.
Remark 4.5.1 One more point concerning i: as already mentioned on several occasions,
most of the computing time is spent in (VI). If i becomes really large, however, solving
the quadratic problem may become expensive. It is interesting to note that, in most of our
examples (with a large number of variables, say beyond 102), the operation that becomes
most expensive for growing i is not the quadratic problem itself, but rather the computation
of the scalar products (s+, Sj), i = 1, ... ,l. All this indicates the need for a careful choice
of i-values, and of compression strategies. 0
The parameter iii is potentially important, via its direct control of nUll-steps.
Comparative experiments are reported in Table 4.5.3, which just reads as Table 4.2.3
- in which iii was 0.1. Note from (3.4.1) that the largest value allowed is iii = 0.8.
According to these results, iii may not be crucial for performances, but it does
playa role. The main reason is that, when iii is small, e+ can be much smaller than e;
then the approximation (4.1.1) becomes hazardous and the standard e-strategy may
present difficulties. This, for example, causes the failure in the line "0.01" of TR48 via
the instability 4.2.1.
4 Numerical Illustrations 273
Finally, we study the role of V, which can take either of the values (3.3.2) or
(3.3.3). To illustrate their difference, we have recorded at each iteration of the run
"TSP442 + standard algorithm" the ratio
IIsll2 + f.Le
IIsll 2
Figure 4.5.1 gives the histogram: it shows for example that, for 85% of the iterations,
the value (3.3.2) was not larger than 3 times the value (3.3.3). The ratio reached a
maximal value of 8, but it was larger than 5 in only 5% of the iterations. The statistics
involved 316 iterations and, to avoid a bias, we did not count the tail of the algorithm,
in which f.L was constantly O.
Frequency (%)
10
ratio
1 2 5
Fig.4.5.1. Comparing two possible values of ii
Thus, the difference between (3.2.2) and (3.2.3) can be absorbed by a proper re-
definition of the constants m and m'. We mention that experiments conducted with
(3.3.3) give, with respect to the standard strategy (3.3.2), differences which may be
non-negligible, but which are certainly not significant: they never exceed 20 calls to
(UI), for any required accuracy and any test-problem. This also means that the role
of m and m' is minor (a fact already observed with classical algorithms for smooth
functions).
A relevant question is now: have we fulfilled the requirements expressed in the con-
clusion of § 1. I? Judging from our experiments, several observations can be made.
(i) The algorithms of this chapter do enter the general framework of Chap. II. A
comparison of the line-searches (Figs. 11.3.3.1 and 3.2.1) makes this point quite
274 XIV, Dual Form of Bundle Methods
Introduction. In Chap. XII, we sketched two numerical algorithms for convex minimiza-
tion: subgradients and cutting planes. Apparently, they have nothing to do with each other;
one has a dual motivation, in the sense that it uses a subgradient (a dual object) as a direction
of motion from the current iterate; the second is definitely primal: the objective function is
replaced by an approximation which is minimized to yield the next iterate.
Here we study more particularly the cutting-plane method, for which we propose a number
of accelerated versions. We show that these versions are primal adaptations of the dual bundle
methods of Chap. XIv. They define a sort of continuum having two endpoints: the algorithms
of subgradients and of cutting planes; a link is thus established between these two methods.
If: lR n -+ lR is convex I
and we want to minimize f. As always when dealing with numerical algorithms,
we assume the existence of a black box (Ul) which, given x E lRn , computes f (x)
together with some subgradient s(x) E af(x).
We have seen in §XII.4.2 that a possible algorithm for minimizing f is the cutting-
plane algorithm, and we briefly recall how it works. At iteration k, suppose that the
iterates Yl, ... , Yk have been generated and the corresponding (bundle of) information
f(Yl), ... , f(Yk), Sl = S(Yl), ... , Sk = S(Yk) has been collected. The cutting-plane
276 xv. Primal Forms of Bundle Methods
approximation of f, associated with the sampling points YI, ... , Yk> is the piecewise
affine function of Definition XIY.2.4.1:
]Rn 3 Y t-+- ik(y) := max (f(Yi) + (Si' Y - Yi) : i = 1, ... , k}. (1.0.1)
does have a solution, which is thus taken as the next iterate. We recall from Theo-
rem XII.4.2.3 the main convergence properties of this algorithm: denoting by Ie the
minimal value of f over C, we have
To make sure that our original problem is really solved, C should contain at least one
minimum point of f; so finding a convenient C is not totally trivial. Furthermore, it
is widely admitted that the numerical performance of the cutting-plane algorithm is
intolerably low. Both questions are addressed in the present chapter.
Consider the simple example illustrated by Fig. 1.1.1: with n = 1, take f(x) = 1/2X2
and start with two iterates YI = 1, Y2 = -8 < O. Then Y3 is obviously the solution of
h (y) = /z(y), i.e.
Remark 1.1.1 We mention a curious consequence ofthis phenomenon. Forgetting the arti-
ficial set C, consider the linear program expressing (1.0.2):
inf{r: !(Yi)+(Si,Y-Yi)::;;rfori=l, ... ,k}.
Y,r
Taking k Lagrange multipliers aI, ...• ako its dual is (see §XII.3.3 if necessary, L1k C Rk is
the unit simplex):
1 Accelerating the Cutting-Plane Algorithm 277
Y2 = -E Y3
Fig.t.t.t. Instability of cutting planes
k k
sup I>i[f(Yi) - (Si' Yi)], a E ..1b Laisi =0.
i=1 i=1
Now assume that f is quadratic; then the gradient mapping S = Vf is affine. It follows
that, if a is an optimal solution of the dual problem above, we have:
k k
Vf(L~=1 aiYi) = LaiVf(Yi) = Laisi = O.
i=1 i=1
In our example of Fig. 1.1.1, the instability is not too serious; but it can become
disastrous in less naive situations. Indeed the next example cooks a black box (VI) for
which reducing the initial gap f(YI)- ic
by a factore < 1 requires some (l/e)(n-2)/2
iterations: with 20 variables, a billion iterations are needed to obtain just one digit
accuracy!
Example 1.1.2 We use an extra variable TJ, which plays a role for the first two itera-
tions only. Given e E ]0, 1/2[, we want to minimize the function
°
on the unit ball oflRn x lR: C in (1.0.2) is therefore taken as this unit ball. The optimal
value is obviously 0, obtained for TJ = and y anywhere in the ball B(O, 1- 2e) c lRn.
Starting the cutting-plane algorithm at (0, 1) E lRn x lR, which gives the first objective-
value 1, the question is: how many iterations will be necessary to obtain an objective-
value of at most e?
The first subgradient is (0,1) and the second iterate is (0, -1), at which the
objective-value is again 1 and the second subgradient is (0, -1). The next cutting-
plane problem is then
(1.1.1)
278 xv. Primal Forms of Bundle Methods
Its minimal value is 0, obtained at all points of the fonn (y , 0) with Y describing the
unit ball B of]Rn. The constraints of (1.1 .1) can thus be fonnulated as
Look at Fig. 1.1.2(a): the minimal value is still 0 and the effect of the above third
constraint is to cut from the second optimal set (B itself) the portion defined by
(Y3 , y) > 1 - 2s .
More generally, as long as the kth optimal set Bk C B contains a vector of nonn I,
the (k + l)st iterate will have an objective-value of2s, and will cut from Bk a similar
portion obtained by rotation. As a result, no s-optimal solution can be produced before
all the vectors of nonn 1 are eliminated by these successive cuts. For this, k must be
so large that k - 2 times the area S(s) of the cap
is at least equal to the area Sn of the boundary of B; see Fig. 1.1 .2(a), where the thick
line represents S(s), vhas nonn 1 and stands for Y3, Y4, .. . ) .
It is known that the area of the boundary of B(O, r) in]Rn is r n- I Sn . The area of
the infinitesimal ring displayed in Fig. 1.1.2(b), at distance r of the origin, is therefore
1 Accelerating the Cutting-Plane Algorithm 279
~n-2
Sn-n/1 - r2 dr = Sn-l sinn - 1 B dB; (1.1.2)
See) = Sn-l l(h sinn - 1 B dB :::; Sn-l l(Js Bn- 1 dB = Sn-l k(Be)n .
Using (1.1.2) again:
Thus, the required number of iterations Sn/ See) is at least 2/(Be)n. Knowing that
Be ~ 2..(£, we see the disaster. 0
It is interesting to observe that the instability demonstrated above has the same
origin as the need to introduce C in (1.0.2): the minima of A
are hardly controlled,
in extreme cases they are "at infinity". The artificial set C had primarily a theoretical
motivation: to give a meaning to the problem of minimizing ik; but this appears to
have a pragmatic supplement: to prevent a wild behaviour of {Yk}. Accordingly, we
can even say that C (which after all is not so artificial) should be "small", which also
implies that it should be "appropriately located" to catch a minimum of f. Actually,
one should rather have a varying set Cko with diameter shrinking to as the cutting-
plane iterations proceed. Such was not the case in our counter-example above: the
°
"stability set" C remained equal to the entire R(O, 1) for an enormous number of
iterations. This seemingly innocent remark can be considered as a starting point for
the methods in this chapter.
Our stability issue existed already in Chap. II: in §II.2.1, we wanted to minimize a
certain directional derivative (Sk, .), which played the role of ik - both functions are
identical for k = 1. Due to positive homogeneity, a normalization had to be introduced;
the situation was the same for the steepest-descent problem of §VIII.I.
We choose to follow the same idea here, and we describe abstractly the kth iteration
of a stabilized algorithm as follows:
(i) we have a model, call it f{Jko supposed to represent f;
(ii) we choose a stability center, call it Xk;
(iii) we choose a norrning, call it I I . Illk;
(iv) then we compute a next iterate Yk+ 1 realizing a compromise between diminishing
the model:
f{Jk(Yk+l) < f{Jk(Xk) ,
and keeping close to the stability center:
Let us now review the list of items (i) - (iv) of the beginning of this Section 1.2.
(i) The model ({Jb i.e. the cutting-plane function ik of (1.0.1), can be enriched by
one affine piece for example every time a new iterate Yk+l is obtained from the
model-problem in (iv). Around this general philosophy, several realizations are
possible. Modelling is actually just the same as the bundling concept of §XlY.l.2;
in particular, an important ingredient will be the aggregation technique already
seen on several occasions.
1 Accelerating the Cutting-Plane Algorithm 281
(ii) The stability center should approximate a minimum point of!, as reasonably as
possible; using objective-values to define the word "reasonably", the idea will
be to take for Xk one of the sampling points YI, ... , Yk having the best! -value.
Actually, a recursive strategy will be used: at each iteration, the stability center
is either left as it is, or moved to the new iterate Yk+h depending on how good
!(Yk+I) is.
(iii) Norming is quite an issue, our study will be limited to multiples of a fixed norm.
Said otherwise, our attention will be focused on the /C or the JL considered above,
and III . Illk will be kept fixed (normally to the Euclidean norm).
(iv) Section 2 will be devoted to the stabilized problem; three conceptually equivalent
possibilities will be given, which can be viewed as primal interpretations of the
dual bundle method of Chap. XlV.
In fact, the whole stabilizing idea is primarily characterized by (ii), which relies
on the following crucial technique. In addition to the next iterate Yk+ I, the stabilized
problem yields a "nominal decrease" for!; this is a nonnegative number 13k> giving an
idea of the gain !(Xk) - !(Yk+I) to be expected from the current stability center Xk
to the next iterate Yk+ I; see Example 1.2.3 below. Then Xk is set to Yk+ I if the actual
gain is at least a fraction of the "ideal" gain 13k. With emphasis put on the management
of the stability center, the resulting algorithm looks like the following:
Example 1.2.3 (Line-Search) At the given Xk, take the first-order approximation
(1.2.2)
as model and choose a symmetric positive definite operator Q to define the norming.
282 xv. Primal Forms of Bundle Methods
(1.2.3)
for some radius K > 0; stabilization is thus forced via an explicit constraint. From
the minimality conditions of Theorem VII.2.1.4, there is a multiplier J-L ~ 0 such that
(1.2.4)
(to obtain the last equality, take the scalar product of (1.2.4) with K Q-I Sk). We in-
terpret these relations as follows: between the stability center and a solution of the
stabilized problem, the model decreases by
Knowing that f{Jk(Xk) = !(Xk) and f{Jk ~ !, this 8k can be viewed as a "nominal
decrease" for!. Then Algorithm 1.2.2 corresponds to the following strategy:
(i) If 8k is small, stop the algorithm, which is justified to the extent that IIsk II is small.
This in turn is the case if K is not too small, and the maximal eigenvalue of Q is
not too large: then the original problem, or its first-order approximation, is not
too affected by the stabilization constraint.
(ii) If 8k is far from zero, the next iterate Yk+1 = Xk - 1/J.LkQ-I Sk is well defined;
setting dk := _Q-ISk (a direction) and tk := 1/J-Lk (a stepsize), (1.2.1) can be
written
We recognize the descent test (11.3.2.1), universally used for classical line-searches. This
test is thus encountered once more, just as in §XIY.3 .1. With Remark 11.3.2.3 in mind, observe
the interpretative role of 8k in terms of the initial derivative of a line-search function:
(b) Steepest Descent, Second Order. With the same CfJk of (1.2.2), take the stabilized
problem as
(1.2.5)
for some penalty coefficient f.L > O. The situation is quite comparable to that in (a)
above; a slight difference is that the multiplier f.L is now explicitly given, as well as
the solution Xk - 1//1- Q-I Sk .
A more subtle difference concerns the choice of Ok: the rationale for (a) was to
approximate f by CfJk, at least on some restricted region around Xk. Here we are bound
to consider that f is approximated by the minimand in (1.2.5); otherwise, why solve
(1.2.5) at all? It is therefore more logical to set
With respect to (a), the nominal decrease of f is divided by two; once again, remember
Remark 11.3.2.3, and also Fig. 11.3.2.2. 0
Definition 1.2.4 We will say that, when the stability center is changed in Algo-
rithm 1.2.2, we make a descent-step; otherwise we make a null-step.
The set of descent iterations is denoted by KeN:
The above terminology suggests a link between the bundle methods of the pre-
ceding chapters and the line-searches of Chap. II; at the same time it points out the
main difference: in case of a null-step, bundle methods enrich the piecewise affine
model CfJk, while line-searches act on the norming (shortening K, increasing f.L) and do
not change the model.
The ambiguity between cases (a) and (b) in Example 1.2.3 suggests a certain incon-
sistency in the line-search principle. Let us come back to the situation of Chap. II,
with a smooth objective function f, and let us consider a Newton-type approach: we
have on hand a symmetric positive definite operator Q representing, possibly equal
to, the Hessian \f2 f(Xk). Then the model (quadratic but not positively homogeneous)
(1.3.1)
Algorithm 1.3.1 (Goldstein and Price Curved-Search) The data are: the current
iterate Xh the model ({Jk, the descent coefficients m E ]0, 1[ and m' E ]m, 1[. Set
KL = 0 and KR = 0; take an initial K > O.
STEP o. Solve (1.3.2) to obtain Y(K) and set 8 := f(Xk) - ({Jk(Y(K».
STEP I (Test for large K). If f(Y(K» > f(Xk) - m8, set KR = K and go to Step 4.
STEP 2 (K is not too large). If f(Y(K» ~ f(Xk) - m'8, stop the curved-search with
xk+t = Y(K).
Otherwise set Kl =
K and go to Step 3.
STEP 3 (Extrapolation). If KR > 0 go to Step 4.
Otherwise find a new K by extrapolation beyond KL and loop to Step o.
STEP 4 (Interpolation). Find a new K by interpolation in]K L, KR [ and loop to Step O.
o
If rpk does provide a relevant Newtonian point y, the initial K should be chosen large
enough to produce it. Afterwards, each new K could be computed by polynomial interpolation,
as in §II.3.4; this implies a parametric study of (1.3.2), so as to get the derivative of the
curved-search function K t-+ f (y (K». Note also thatthe same differential information would
be needed if we wanted to implement a "Wolfe curved-search".
2 A Variety of Stabilized Algorithms 285
Remark 1.3.2 This trust-region technique was initially motivated by non-convexity. Sup-
pose for example that Q in (1.3.1) is indefinite, as might well happen with Q = "112 f(Xk) if
f is not convex. Then the Newtonian point ji = Xk - Q-I Sk becomes suspect (although it is
still of interest for solving the equation Vf (x) = 0); it may not even exist if Q is degenerate.
Furthermore, a line-search along the associated direction dk = ji - Xk may be disastrous
because f'(Xk, dk) = (Skt dk) = -(Qdk, dk) need not be negative.
Indeed, Newton's method with line-search has little relevance in a non-convex situation.
By contrast, the solution Y(K) of(1.3.2) makes a lot of sense in terms of minimizing f:
_·It always exists, since the trust-region is compact. Much more general models CfJk could even
be handled; the only issue would be the actual computation ofY(K).
- To the extent that CfJk really approximates f to second order near Xkt (1.3.2) is consistent
with the original problem.
- If Q happens to be positive definite, the Newtonian point is recovered, provided that K is
chosen large enough.
- If, for some reason, the curved-search produces small values of K, the move from Xk to Y(K)
is made roughly along the steepest-descent direction; and this is good: the steepest-descent
direction, precisely, is steepest for small moves. 0
Admitting that a model such as CfJk is trusted around Xk only, the trust-region technique
is still relevant even when CfJk is nicely convex. In this case, the stabilized problem can also
be formulated as
instead of(1.3.2) (Proposition VII.3.1.4). The parameter J.I, will represent an alternative curvi-
linear coordinate, giving the explicit solution Xk - (Q + J.l,I)-1 Sk if the model is (1.3.1).
We give in this section several algorithms realizing the general scheme of § 1.2. They
use conceptually equivalent stabilized problems in Step 2 of Algorithm 1.2.2. Three
of them are formulated in the primal space IRn; via an interpretation in the dual space,
they are also conceptually equivalent to the bundle method ofChap. xrv.
The notations
are those of §1: ikis the cutting-plane function of(1.0.1); the Euclidean norm II· II is
assumed for the stabilization (even though our development is only descriptive, and
could accommodate more general situations).
286 xv. Primal Forms of Bundle Methods
2.1 The Trust-Region Point of View
The first idea that comes to mind is to force the next iterate to be a priori in a ball
associated with the given norming, centered at the given stability center, and having
a given radius K. The sequence of iterates is thus defined by
This approach has the same rationale as Example 1.2.3(a). The original model ik
is considered as a good approximation of! in B(Xk, K), a trust-region drawn around
ik
the stability center. Accordingly, is minimized in this trust-region, any point outside
it being disregarded. The resulting algorithm in its crudest form is then as follows.
Algorithm 2.1.1 (Cutting Planes with Trust Region) The initial point Xt is given,
together with a stopping tolerance Q~ o. Choose a trust-region radius K > 0 and a
descent coefficient m E ]0. 1[ . Initialize the descent-set K = 0, the iteration-counter
k = 1 and Yt = Xt; compute !(Yt) andst = s(Yt).
STEP 1. Define the model
(2.1.1)
and set
set xk+t = Yk+t and append k to the set K (descent-step). Otherwise set Xk+t =
Xk (null-step).
+
Replace k by k 1 and loop to Step 1. 0
Here K represents the set of descent iterations, see Definition 1.2.4 again. Its role
is purely notational and will appear when we study convergence. The algorithm could
well be described without any reference to K. Concerning the nominal decrease ~ko it is
useful to understand that! (Xk) =
ik(Xk) by construction. When a series of null-steps
is taken, the cutting-plane algorithm is applied within the stability set C := B(Xk. K),
which is changed at every descent-step. Such a change happens possibly long before
! is minimized on C, and this is crucial for efficiency: minimizing ! accurately over
C is a pure waste of time if C is far from a minimum of !.
2 A Variety of Stabilized Algorithms 287
Remark 2.1.2 An efficient solution scheme of the stabilized problem (2.1.1) can be devel-
oped, based on duality: for J-L > 0, the Lagrange function
has a unique minimizer Y(J-L), which can be computed exactly. It suffices to find J-L > 0 solving
the equation IIY(J-L) - xkll = K, if there is one; otherwise it has an unconstrained minimum
in B(Xk, K). Equivalently, the concave function J-L ~ L(Y(J-L), J-L) must be minimized over
J-L~Q 0
Proposition 2.1.3 Denote by /Coo ~ 0 the distance from Xk to the minimum-set ofik
(with the convention /Coo = +00 if Argmin A
= 0). For 0 :::; /C :::; /Coo, (2.1.1) has
a unique solution, which lies at distance /C from Xk. For /C ~ /Coo, the solution-set of
(2.1.1) is Argminik n B(Xb /C).
PROOF.The second statement is trivial (when applicable, i.e. when /Coo < +(0).
Now take 7J ~ 0 such that the sublevel-set
is nonempty, and let x* be the projection of Xk onto S. If /C :::; IIx* - xkll :::; /Coo,
any solution y of (2.1.1) must be at a distance /C from Xk: otherwise y, lying in the
interior of B(Xk, /C), would minimize ik locally, hence globally, and the property
II y - Xk II < /C :::; IIx* - Xk II would contradict the definition of x*. The solution-set of
(2.1.1) is therefore a convex set on the surface of a Euclidean ball: it is a singleton.
o
Now we turn to a brief study of convergence, which uses typical arguments. First
of all, {f (Xk)} is a decreasing sequence, which has a limit in IR U {-oo}. If {f (Xk)}
is unbounded from below,! has no minimum and {Xk} is a ''minimizing'' sequence.
The only interesting case is therefore
Lemma 2.1.4 With the notation (2.1.2), and m denoting the descent coefficient in
Step 4 ofAlgorithm 2.1.1, there holds LkEK 8k:::; [!(XI) - !*]/m.
Let k' be the successor of kinK. Because the stability center is not changed after a
null-step, !(Xk+I) = ... = !(xk') and we have
!(xk') - !(xk'+I)
8k':::;
m
= !(Xk+I) -m !(xk'+I) .
288 xv. Primal Forms of Bundle Methods
The recurrence is thus established: sum over K to obtain the result. o
°
This shows that, if no null-step were made, the method would converge rather
fast: {Ok} would certainly tend to faster than {k- 1} (to be compared with the speed
k- 2/ n of Example 1.1.2). In particular, the stopping criterion would act after finitely
many iterations. Note that, when the method stops, we have by construction
Using convexity, an approximate optimality condition is derived for Xk; it is this idea
that is exploited in Case 2 of the proof below.
Theorem 2.1.5 Let Algorithm 2.1.1 be used with fixed K > 0, m E ]0, 1[ and Q = 0.
Then {Xk} is a minimizing sequence.
and B (Xk, K) cannot contain i. Then consider the construction of Fig. 2.1.1: Zk is
betweenxk andi, at a distance K from Xb and we setrk := Iii - zkll. We have
from convexity,
In other words
Write this inequality for each k (large enough) in K and sum up. By construction,
IIxk+ 1 - xk II :( K and the triangle inequality implies that rk+ 1 :( rk+K for all k E K: the
left-hand side forms a divergent series; but this is impossible in view of Lemma 2.1.4.
[Case 2: K is finite] In this second case, only null-steps are taken after some iteration
k*:
(2.1.3)
2 A Variety of Stabilized Algorithms 289
From then on, Algorithm 2.1.1 reduces to the ordinary cutting-plane algorithm ap-
plied on the (fixed) compact set C = B(Xk*, K). It is convergent: when k ~ +00,
Theorem XII.4.2.3 tells us that {A(Yk+I)} and {f(Yk+I)} tend to the minimal value
of! on B(Xk*, K), call it j*. Passing to the limit in (2.1.3),
(1 - m)j* ~ (1 - m)!(xk*).
Because m < 1, we conclude thatxk* minimizes! on B(xk*, K), hence on the whole
space. 0
Note in the above proof that null-steps and descent-steps call for totally different ar-
guments. In one case, similar to §XIIA.2 or §XIY.3A, the key is the accumulation of the
information into the successive models it. The second case uses the definite decrease of f at
each descent iteration, and this connotes the situation in §ll.3.3. Another remark, based upon
this proof, is that the stop does occur at some finite kif§. > 0: either because ofLemma2.1A,
it
or because (Yk+ I) t f (Xk") in Case 2.
The trust-region technique of §2.l was rather abrupt the next iterate was controlled by
a mere switch: "on" inside the trust-region, "off" outside it. Something more flexible
is obtained if the distance from the stability center acts as a weight.
Here we choose a coefficient JL > 0 (the strength of a spring) and our model is
needless to say, pure cutting planes would be obtained with JL = o. This strategy is
in the spirit of Example 1.2.3(b): it is the model itself that is given a stabilized form;
its unconstrained minimization will furnish the next iterate, and its decrease will give
the nominal decrease for !.
290 xv. Primal Forms of Bundle Methods
Remark 2.2.1 This stabilizing term has been known for a long time in the framework of
least-squares calculations. Suppose we have to minimize the function
1 m
Rn 3 x ~ f(x) := - L f/(x),
2 j=1
where each Jj is smooth. According to the general principles of §11.2, a second-order model
of f is desirable.
Observe that (we assume the dot-product for simplicity)
m m
V2 f(x) = Q(x) +L Jj(x)V2 Jj(x) with Q(x):= LV Jj(x)[V Jj(X)]T ;
j=1 j=1
Q(x) is thus a reasonable approximation of V2 f(x), at least if the Jj(x),s or the V2 Jj(x)'s
are small. The so-called method of Gauss-Newton exploits this idea, taking as next iterate a
minimum of the corresponding model
An advantage is that only first-order derivatives are required; furthermore Q(Xk) is automat-
ically positive semi-definite.
However, trouble will appear when Q(Xk) is singular or ill-conditioned. Adding to it a
multiple of the identity matrix, say ILl, is then a good idea: this is just our present stabilization
by penalty. In a Gauss-Newton framework, IL is traditionally called the coefficient of Leven-
berg and Marquardt. 0
Just as in the previous section, we give the resulting algorithm in its crudest form:
Algorithm 2.2.2 (Cutting Planes with Stabilization by Penalty) The initial point
XI is given, together with a stopping tolerance Q~ o. Choose a spring-strength J.L > 0
and a descent-coefficient m E ]0, 1[. Initialize the descent-set K = 0, the iteration-
counter k = 1, and YI = XI; compute f(YI) ands l = S(YI).
STEP 1. With Adenoting the cutting-plane function (1.0.1), compute the solution
Yk+lof
(2.2.1)
and set
v I
8k := f(Xk) - A(Yk+l) - 2J.LIIYk+1 - xkll ~
2
o.
STEP 2. If 8k ~ Q stop.
STEP 3. Compute f(Yk+I) and Sk+1 = S(Yk+I). If
set xk+ I = Yk+ I and append k to the set K (descent-step). Otherwise set Xk+ I =
Xk (null-step).
Replace k by k + 1 and loop to Step 1. 0
2 A Variety of Stabilized Algorithms 291
With respect to the trust-region variant, a first obvious difference is that the stabi-
lized problem is easier: it is the problem with quadratic objective and affine constraints
[r + !/LIIY - xkll2] (y, r) E IR n x IR
I min
!(Yi) + (Si' Y - Xi) ::;; r for i =1, ... , k. (2.2.2)
A second difference is the value of Ok: here the nominal decrease for! is more
logically taken as the decrease of its piecewise quadratic model; this is comparable
to the opposition (a) vs. (b) in Example 1.2.3. We will see later that the difference
is actually of little significance: Algorithm 2.2.2 would be almost the same with
Ok = !(Xk) - ik(Yk+l).
In fact, neglecting any ok-detail, the two variants are conceptually equivalent in
the sense that they produce the same (k + 1)81 iterate, provided that K and /L are
properly chosen:
Proposition 2.2.3
°
(i) For any K > 0, there is /L ~ such that any solution ofthe trust-region problem
(2.1.1) also solves the penalized problem (2.2.1).
°
(ii) Conversely, for any /L ~ 0, there is K ~ such that any solution of (2.2.1) (as-
sumed to exist) also solves (2.1.1).
PROOF. [(i)] When K > 0, the results of Chap. VII can be applied to the convex
°
minimization problem (2.1.1): Slater's assumption is satisfied, the set of multipliers
is nonempty, and there is a /L ~ such that the solutions of (2.1.1) can be obtained
by unconstrained minimization of the Lagrangian
Y ~ ik(Y) + !/L(IIY - xkll 2 - K2)
(Proposition VII.3.1.4).
[(ii)] Suppose that (2.2.1) has a solution Yk+ 1· For any Y E B (Xk, II Yk+ 1 - Xk II),
VI
!key) + 2/LIIYk+1 - 2
xkll ~ !k(Y)
VI
+ 2/LIlY - 2 v
xkll ~ !k(Yk+l) + 2/LIIYk+1
1
- Xkll
2
Section 3 will be entirely devoted to this second variant, including a study of its
convergence.
The point made in the previous section is that the second term 1/2/LIIY - xkll 2 in
(2.2.1) can be interpreted as the dualization of a certain constraint lIy - xkll ~ K,
whose right-hand side K = K (/L) becomes a function of its multiplier /L. Likewise, we
ik ik
can interpret the first term (y) as the dualization of a constraint (y) ~ e, whose
e e
right-hand side = (/L) will be a function of its multiplier 1/ /L.
In other words, a third possible stabilized problem is
~n!IIY - xk11 2 ,
I fk(y) ~ e (2.3.1)
for some level e. We leave it to the reader to adapt Proposition 2.2.3, thereby studying
the equivalence of (2.3.1) with the previous variants. A difficulty has now appeared,
though: what if(2.3.1) has an empty feasible set, i.e. if < infe ik? By contrast, the
parameter K or /L of the previous sections could be given arbitrary positive values
throughout the iterations, even if a fixed K led to possible inefficiencies. The present
variant thus needs some more sophistication.
On the other hand, the level used in (2.3.1) suggests an obvious nominal decrease
for !: it is natural to set xk+1 = Yk+1 if this results in a definite objective-decrease
from! (xd towards e. Then the resulting algorithm has the following form, in which
we explicitly let the level depend on the iteration.
(2.3.2)
set xk+ I = Yk+ I and append k to the set K (descent-step). Otherwise set xk+ I =
Xk (null-step).
Replace k by k + 1 and loop to Step 1. 0
2 A Variety of Stabilized Algorithms 293
To explain the title of this subsection, consider the problem of solving a system of
inequalities: given an index-set J and smooth functions /j for j E J, find x E Rn such that
It may be a difficult one, especially when J is a large set (possibly infinite), and/or the /j's
are not affine. Several techniques are known for a numerical treatment of this problem.
- The relaxation method addresses large sets J and consists of a dynamic selection of ap-
propriate elements in J. For example: at the current iterate Xk we take a most violated
inequality, i.e. we compute an index jk such that
then we solve
min UIIX-XkIl2: /jk(X)~O}.
Unless Xk is already a solution of (2.3.3), we certainly obtain Xk+1 =F Xk. Refined variants
exist, in which more than one index is taken into account at each iteration.
- Newton's principle (§II.2.3) can also be used: each inequality can be linearized at Xk, (2.3.3)
being replaced by
Remark 2.3.2 The solution of this projection problem can be computed explicitly: the min-
imality conditions are
Assuming S(Xk) =F 0 and 1< f(Xk) (hence x =F Xk, /L > 0), the next iterate is
f(Xk) -1
Xk - IIS(Xk)112 S(Xk)·
This is just the subgradient Algorithm XII.4.1.1, in which the knowledge of 1 is exploited to
provide a special stepsize. 0
When formulating our minimization problem as the single inequality f(x) ~ 1, the
essential resolution step is the linearization of f (Newton's principle). Another possible for-
mulation is via infinitely many cutting-plane inequalities: f can be expressed as a supremum
of affine functions,
where the indices i E J of(2.3.3) are mtherdenoted by y E R.n. We can apply to this problem
the relaxation principle and memorize the inequalities visited during the successive itemtions.
At the given iterate Xt, we recover a problem of the form (2.3.1).
Let us conclude this comparison: if the infimum of f is known, then Algorithm 2.3.1
with ik == i is quite a suitable variant. In other cases, the whole issue for this algorithm will
i,
be to identify the unknown value thus indicating suitable rules for the management of {ik}.
if v(k) denotes the number of descent-steps that have been made prior to the kth
iteration. If the algorithm performs infinitely many descent-steps, !(Xk) - f ~ 0
and we are done.
[Case 2: K is finite] Suppose now that the sequence {Xk} stops at some iteration k*, so
that Xk = Xt* for all k ~ k*. y:!e proceed to prove that f is a cluster value of {f (Yk)}.
First, !k(i) :s;;; !(i) = ! for all k: i is feasible in each stabilized problem, so
IIYk - Xk* II :s;;; IIi - xk* II by construction. We conclude that {Yk} is bounded; from
their definition, all the model-functions it have a fixed Lipschitz constant L, namely
a Lipschitz constant of! on B(Xk*. IIi - Xk* II) (Theorem IY.3.1.2).
Then take k and k' arbitrary with k* :s;;; k:S;;; k'; observe that ik'(Yk) = !(Yk) by
i
definition of k" so
The comments following Theorem 2.1.5 are still valid, concerning the proof technique;
it can even be said that, after a descent-step, all the linearizations appearing in can be it
refreshed, thus reinitializing ik+1 to the affine function f(Xk+l) + (Sk+lo· - Xk+I). This
modification does not affect Case 2; and Case 1 uses no memory mechanism.
2 A Variety of Stabilized Algorithms 295
Remark 2.3.4 Further consideration of (2.3.4) suggests another interesting technical com-
ment: suppose in Algorithm 2.3.1 that the level l is fixed but the descent test is ignored: only
null-steps are taken. Then, providing that {Yk} is bounded, (f(Yk)} reaches any level above
l. If l = j, this has several implications:
- First, as far as proving convergence is concerned, the concept of descent-step is useless: we
could just set m = 1 in the algorithm; the stability center would remain fixed throughout,
the minimizing sequence would be {yd.
- Even further: the choice of the stability center, to be projected onto the sublevel-set ofit,
is moderately important. Technically, its role is limited to preserving the boundedness of
{Yk}·
- Existence of a minimum of f is required to guarantee this boundedness: in a way, our
present algorithm is weaker than the trust-region form, which did not need this existence.
However, we leave it as an exercise to reproduce the proof of Theorem 2.3.3 for a variant
of Algorithm 2.3.1 using a more conservative choice of the level:
- Our proof of Theorem 2.3.3 is qualitative; compare it with §IX.2.I(a). A quantitative argu-
ment could also be conceived of, as in §IX.2.1(b). 0
(2.3.5)
This makes numerical sense if J is finite (and not large): we have a quadratic program
to solve at each iteration. Even in this case, however, the algorithm may not converge
because a Newton method is only locally convergent.
To get global convergence, line-searching is a natural technique: we can take a
next iterate along the half-line pointing towards the solution of (2.3 .5) and decreasing
"substantially" the natural objective function maxi E J Ii; we are right in the framework
of §II.3. The message ofthe present section is that another possible technique is to
memorize the linearizations; as already seen in § 1.2, a null-step resembles one cycle in
a line-search. This technique enables the resolution of (2.3.3) with an infinite index-set
J, and also with nonsmooth functions Ii. The way is open to minimization methods
with nonsmooth constraints. Compare also the present discussion with §IX.3.2.
I min [r + !lLlly -
I(Yi)+(Si,y-Yi)~r
xkll2] (y, r) E]Rn x
fori=l, ... ,k.
]R,
(2.4.1)
The dual of this quadratic program can be formulated explicitly, yielding very in-
structive interpretations. In the result below, Llk is the unit simplex of]Rk as usual;
the coefficients
296 xv. Primal Fonns of Bundle Methods
are the linearization errors between Yi andxk, already encountered in several previous
chapters (see Definitions XI.4.2.3 and XIY.1.2.7).
Lemma 2.4.1 For f.L > 0, the unique solution of the penalized problem (2.2.1) =
(2.4.1) is
(2.4.3)
(2.4.4)
(2.4.5)
PROOF. This is a direct application of Chap. XII (see §XII.3.4 if necessary). Take
k nonnegative dual variables aI, ... , ak and form the Lagrange function which, at
(y, r, a) E ]Rn x ]R x IRk, has the value
k k
r + 4f.LIIY - xkll 2 + L ai(si, y) + Lai[f(Yi) - (Si, Yi) - r].
i=1 i=1
Its minimization with respect to the primal variables (y, r) implies first the condition
L:f=1 ai = 1 (otherwise we get -00), and results in Y given as in (2.4.3). Plugging this
value back into the Lagrangian, we obtain the dual problem associated with (2.4.1):
With the form (2.4.4) of stabilized problem, we can play the same game as in
the previous subsections: the linear term in a can be interpreted as the dualization of
a constraint whose right-hand side, say e, is a function of its multiplier f.L. In other
words: given f.L > 0, there is an e such that the solution of (2.2. 1) is given by (2.4.3),
where a solves
Conversely, let the constraint in this last problem have a positive multiplier J-L; then
the associated Yk+1 of (2.4.3) is also the unique solution of (2.2.1) with this J-L.
A thorough observation of our notation shows that
ej ~ 0 for i = 1, ... , k and ej = 0 for some j :::;; k.
It follows that the correspondence B i ; J-L involves nonnegative values of B only.
Furthermore, we will see that the values of B that are relevant for convergence must
depend on the iteration index. In summary, our detour into the dual space has revealed
a fourth conceptually equivalent stabilized algorithm:
Algorithm 2.4.2 (Cutting Planes with Dual Stabilization) The initial point XI is
given, together with a stopping tolerance Q. ~ O. Choose a descent-coefficient m E
]0, 1[. Initialize the descent-set K = 0, the iteration counter k = 1, and YI = XI;
compute !(YI) and SI = S(YI)·
STEP 1. Choose B ~ 0 such that the constraint in (2.4.6) has a positive multiplier J-L
(for simplicity, we assume this is possible).
STEP 2. Solve (2.4.6) to obtain an optimal a E Llk and a multiplier J-L > O. Unless
the stopping criterion is satisfied, set
k
s:= Lajsj, Yk+1 = Xk - lis.
I A
j=1
The interesting point about this interpretation is that (2.4.6) is just the direction-
finding problem (XIY.2.1.4) of the dual bundle methods. This allows some useful
comparisons:
(i) A stabilized cutting-plane method is a particular form of bundle method, in
which the "line-search" chooses systematically t = 1/J-L (the inverse multiplier
of (2.4.6), which must be positive). The privileged role played by this particular
step size was already outlined in §XIY.2.4; see Fig. XIY.2.4.1 again.
(ii) Alternatively, a dual bundle method can be viewed as a stabilized cutting-plane
method, in which a line-search is inserted. Since Remark 1.2.1 warns us against
such a mixture, it might be desirable to imagine something different.
(iii) The stopping criterion in Algorithm 2.4.2 is not specified but we know from the
previous chapters that Xk is approximately optimal ifboth B and II sII are small; see
for example (XIY.2.3.5). This interpretation is also useful for the stopping criteria
used in the previous subsections. Indeed (2.4.5) reveals two distinct components
in the nominal decrease A(Yk+I) - !(Xk). One, B, is directly comparable to
! -values; the other, IIsll2 / J1" connotes shifts in X via the "rate" of decrease IIsll;
see the note XIY.3.4.3(6).
298 xv. Primal Forms of Bundle Methods
(iv) In §XIY.3.3(a), we mentioned an ambiguity concerning the descent test used
by a dual bundle method. If the value (XIy'3.3.2) is used for the line-search
parameter ii, the dual bundle algorithm uses the descent test
which minorizes f, simply because it minorizes ik' The way is thus open to
"economic" cutting-plane algorithms (possibly stabilized), in which the above
aggregate linearization can take the place of one or more direct linearizations,
thereby diminishing the complexity of the cutting-plane function.
Recall from §XIY.3.4 that a rather intricate control of ek is necessary, just to ensure con-
vergence of Algorithm 2.4.2. Having said enough about this problem in Chap. XIv, we make
here a last remark. To pass from the primal stabilized Ii--problem to its dual (2.4.4) or (2.4.6),
we applied Lagrange duality to the developed form (2.4.1); and to make the comparison more
suggestive, we introduced the linearization errors (2.4.2). The same interpretative work can
be done by applying Fenchel duality (§XII.5.4) directly to (2.2.1). This gives something more
abstract, but also more intrinsic, involving the conjugate function of A:
Proposition 2.4.3 For Ii- > 0, the unique solution Yk+ I of (2.2.1) is
I
Yk+1 = Xk - /is,
A
where $ E olk (YH I) is the unique minimizer of the closed convex function
Rn 3 S ~ I:(s) - (s, Xk) + 2~ lis 112 . (2.4.7)
PROOF. Set gl := A,g2 := 1/21i-1I . -xkll 2 and apply Proposition XII.5.4.1: because g2 is
finite everywhere, the qualification assumption (XII.5.4.3) certainly holds. The conjugate of
g2 can be easily computed, and minimizing the function of (2.4.7) is exactly the dual problem
of (2.2.1). Denoting its solution by $, the solution of (2.2.1) is then the (therefore nonempty)
set 01:($) n (-I/J.I.$ + xd. Then there holds
I AE UJk
Yk+1 =Xk - /is " 1*(A)
s, o
Needless to say, (2.4.7) is simply a compact form for the objective function of (2.4.4).
To see this, use either of the following two useful expressions for the cutting-plane model:
Its conjugate can be computed with the help of various calculus rules from Chap. X (see in
particular §X.3.4); we obtain a convex hull of needles:
it(s) = - t(Xk) + (s, Xk) + min {'Lf=l a,e, : a E Llb 'Lf=l aisi = s}
= min {'Lf=l ad*(Si) : a E Llk, 'Lf=l aisi = s} .
Plugging the first expression into (2.4.7) yields the minimization problem (2.4.4); and
with the second value, we obtain an equivalent form:
r
a~Tk {! II'Lf=l aisi + /-L 'Lf=l ai[f*(si) - (Si, Xk)]}.
2.5 Conclusion
This Section 2 has reviewed a number of possible algorithms for minimizing a (finite-
valued) convex function, based on two possible motivations:
- Three of them work in the primal space. They start from the observation that the
cutting-plane algorithm is unstable: its next iterate is ''too far" from the current one,
and should be pulled back.
- One of them works in the dual space. It starts from the observation that the steepest-
descent algorithm is sluggish: the next iterate is "too close" to the current one, and
should be pushed forward.
In both approaches, the eventual aim is to improve the progress towards an optimal
solution, as measured in terms of the objective function.
Remark 2.5.1 We mention here that our list of variants is not exhaustive.
- When starting from the /-L-problem (2.2.1), we introduced a trust-region constraint, or a
level constraint, ending up with two more primal variants.
- When starting from the same /-L-problem in its dual form (2.4.4), we introduced a lineariza-
tion-error constraint, which corresponded to a level point of view in the dual space. Alto-
gether, we obtained the four possibilities, reviewed above.
- We could likewise introduce a trust-region point of view in the dual space: instead of(2.4.6),
we could formulate
for some dual radius a (note: -a represents the value of a certain support function, remem-
ber Proposition XIV:2.3.2).
- Another idea would be to take the dual of (2.1.1) or (2.3.1), and then to make analogous
changes of parameter.
We do not know what would result from these various exercises. o
All these algorithms follow essentially the same strategy: solve a quadratic program
depending on a certain internal parameter: K, /-L, e, e, or whatever. They are conceptually
equivalent, in the sense that they generate the same iterates, providing that their respective
parameters are linked by a certain a posteriori relation; and they are characterized precisely
by the way this parameter is chosen.
300 xv. Primal Forms of Bundle Methods
Figure 2.5.1 represents the relations connecting all these parameters; it plots the two
model-functions, A and its piecewise quadratic perturbation, along S interpreted as a di-
rection. The graph of ! lies somewhere above gr A and meets the two model-graphs at
(Xb !(Xk»· The dashed line represents the trace along Xk + RS of the affine function
Ifthe variant mentioned in Remark 2.5.1 is used, take the graph of an affine function of slope
-0-, and lift it as high as possible so as to support gr A:
this gives the other parameters.
Remark 2.5.2 This picture clearly confirms that the nominal decreases for the trust-region
and penalized variants are not significantly different: in view of (2.4.5), they are respectively
So far, our review has been purely descriptive; no argument has been given to
prefer any particular variant. Yet, the numerical illustrations in §XIY.4.2 have clearly
demonstrated the importance of a proper choice of the associated parameter (be it e,
K, f,.L, l, or whatever). The selection of a particular variant should therefore address
two questions, one theoretical, one practical:
- specific rules for choosing the parameter efficiently, in terms of speed of conver-
gence;
- effective resolution of the stabilized problem, which is extremely important in prac-
tice: being routinely executed, it must be fast and fail-safe.
With respect to the second criterion, the four variants are approximately even, with
a slight advantage to the e-form: some technical details in quadratic programming
make its resolution process more stable than the other three.
The first criterion is also the more decisive; but unfortunately, it does not yield
a clear conclusion. We refer to the difficulties illustrated in §XIY.4.2 for efficient
choices of e; and apart from the situation of a known minimal value j (in which case
the level-strategy becomes obvious), little can be said about appropriate choices of
the other parameters.
3 A Class of Primal Bundle Algorithms 301
It turns out, however, that the variant by penalization has a third possible moti-
vation, which gives it a definite advantage. It will be seen in §4 that it is intimately
related to the Moreau-Yosida regularization of Example XI.3.4.4. The next section is
therefore devoted to a thorough study of this variant.
In this section, we study more particularly primal bundle methods in penalized form,
introduced in §2.2. Because they work directly in the primal space, they can handle
possible constraints rather easily. We therefore assume that the problem to solve is
actually
inf{f(x) : x E C}; (3.0.1)
f is still a convex function (finite everywhere), and now C is a closed convex subset of
IRn. The only restriction imposed on C is of a practical nature, namely each stabilized
problem (2.2.1) must still be solvable, even if the constraint y E C is added. In
practice, this amounts to assuming that C is a closed convex polyhedron:
(aj' bj) E IRn x IR being given for j = 1, ... , m. As far as this chapter is concerned,
the only effect of introducing C is a slight complication in the algebra; the reader may
take C = IRn throughout, if this helps him to follow our development more easily.
Note, anyway, that (3.0.1) is the unconstrained minimization of g := f + Ie; because
f is assumed finite everywhere, ag = af + Ne (Theorem XI.3.1.1) and little is
essentially changed with respect to our previous developments.
The model will not be exactly ik of (1.0.1), so we prefer to denote it abstractly by CPt.
a convex function finite everywhere. Besides, the iteration index k is useless for the
moment; the stabilized problem is therefore denoted by
x and IL being the (given) kth stability center and penalty coefficient respectively. Once
again, this problem is assumed numerically solvable; such is the case for cp piecewise
affine and C polyhedral.
The model cp will incorporate the aggregation technique, already seen in previous
chapters, which we proceed to describe in a totally primal language. First we reproduce
Lemma 2.4.2.
Lemma 3.1.1 With cp : IRn ~ IR convex, IL > 0 and C closed convex, (3.1.1) has a
unique solution y+ characterized by the formulae
(3.1.2)
302 xv. Primal Forms of Bundle Methods
Furthermore
qJ(Y)~f(X)+(5,y-x}-e forallyElR n ,
where
(3.1.3)
PROOF. The assumptions clearly imply that (3.1.1) has a unique solution. Using the
geometric minimality condition VII. 1. 1. 1 and some subdifferential calculus, this so-
lution is seen to be the unique point y+ satisfying
Proposition 3.1.2 With the notation ofLemma 3.1.1, take an arbitrary function 1/1 :
IR n --+ IR U {+oo} satisfying
PROOF. Use again the same transportation trick: using (3.1.3) and (3.1.2), the relations
defining 1/1 can be written
with equality at y = y+. Adding the term 1/2 /LilY - x 112 to both sides,
again with equality at y = y+. Then it suffices to observe from (3.1.2) thatthe function
of the right-hand side is minimized over Cat y = y+: indeed its gradient at y+ is
(3.1.5)
Proposition 3.1.2 tells us in particular that the next iterate y+ would not be changed
if, instead of ({J, the model were any convex function 1/1 sandwiched between ]a and
({J.
Note that the aggregate linearization concerns exclusively f, the function that is modeled
by ({I. On the other hand, the indicator part of the (unconstrained) objective function f + Ie
is treated directly, without any modelling; so aggregation is irrelevant for it.
Remark 3.1.3 We could take for example 1/1 = fain Proposition 3.1.2. In relation to
Fig. 2.5.1, this suggests another construction, illustrated by Fig. 3.1.1. For given x and fL > 0,
assume p = 0 (so as to place ourselves in the framework of §2); with y ofthe form x - ts,
draw the parabola of equation
x
Fig. 3.1.1. Supporting a convex epigraph with a parabola
The above observation, together with the property of parabolas given in Fig. 11.3.2.2,
illustrates once more the point made in Remark 2.5.2 concerning nominal decreases. 0
Keeping in mind the strategy used in the previous chapters, the above two results
can be used as follows. When (3.1.1) is solved, a new affine piece (y+, f(y+), s+)
will be introduced to give the new model
Before doing so, we may wish to "simplify" ({J to some other 1/1, in order to make room
in the computer andlor to simplify the next quadratic program. For example, we may
wish to discard some old affine pieces; this results in 1/1 ~ ({J. After such an operation,
304 xv. Primal Forms of Bundle Methods
we must incorporate the affine piece ]a into the definition of the simpler"", so that we
will have fa ""~ ~ rp. This aggregation operation has been seen already in several
previous chapters and we can infer that it will not impair convergence. Besides, the
simpler function "" is piecewise affine if rp is such, and the operation can be repeated
at any later iteration.
In terms of the objective function I, the aggregate linearization ]a is not attached
to any 9 such thats E al(9), so the notation (1.0.1) is no longer correct for the model.
For the same reason, characterizing the model in terms of triples (y, I, s)j is clumsy:
we need only couples (s, r)j E JRn x JR characterizing affine functions. In addition to
its slope Sj, each such affine function could be characterized by its value at 0 but we
prefer to characterize it by its value at the current stability center X; furthermore we
choose to call I(x) - ej this value. Calling l the total number of affine pieces, all the
necessary information is then characterized by a bundle of couples
hence
et = ej + I(x+) - I(x) - (Sj, x+ - x).
Finally, we prefer to work with the inverse of the penalty parameter; its interpretation
as a stepsize is much more suggestive in a primal context.
In summary, re-introducing the iteration index k, our stabilized problem to be
solved at the kth iteration is
Algorithm 3.1.4 (Primal Bundle Method With Penalty) The initial point XI is gi-
ven, together with a stopping tolerance Q~ 0 and a maximal bundle size Choose t.
3 A Class of Primal Bundle Algorithms 305
STEP 1 (main computation and stopping test). Choose a "stepsize" tk > 0 and solve
(3.1.6). As stated in Lemma 3.1.1, its unique solutionis Yk+1 = Xk - tk(Sk + fit),
with Sk E 0fPk(Yk+I) and fit E Nc(Yk+I)' Set
If 8k :;;;; Qstop.
STEP 2 ( descent test). Compute! (Yk+ I) and S(Yk+ I); if the descent test
(3.1.7)
Notes 3.1.5
(i) The initialization tl can use Remark 11.3.4.2 if we have an estimate ..1 of the total
decrease f (XI) - j. Then the initial tl can be obtained from the formula tlllsI1I 2 = 2..1.
Not surprisingly, we then have 81 = ..1.
306 xv. Primal Forms of Bundle Methods
(ii) Lemma 3.1.1 essentially says that Sk E 0ek!(Xk) and our convergence analysis will
establish that Sk + Pk E oek(f + Ie)(Xk), with 8k ;;.: 0 given in Lemma 3.2.1 below.
Because the objective function is ! +Ie, the whole issue will be to show that 8k --+- 0 and
Sk + Pk --+- O. Accordingly, it may be judged convenient to split the stopping tolerance
Q. into two terms: stop when
This algorithm requires access individually to Sk (the aggregate slope) and Pk (to compute
Yk+ 1 knowing Sk), which are of course given from the minimality conditions in the stabilized
problem. Thus, instead of(3.1.1) = (3.1.6), one might prefer to solve a dual problem.
min [qik(Y)
yERn
+ 2~k lIy - xkll 2 + Ic(Y)].
The conjugates of the three terms making up the above objective function are respectively
qik' 1/2tkll. 112 + (', Xk) and ue. The dual problem (for example Fenchel's dual of §XII.5.4)
is then the minimization with respect to (s, p) E R,n x R,n of
or equivalently
Remark 3.1.6 Note in passing the rather significant role of ILk = 1/ tb which appears as
more than a coefficient needed for numerical efficiency: multiplication by ILk actually sends
a vector ofR,n to a vector of the dual space (R,n)*.
Indeed, it is good practice to view ILk as the operator ILk I, which could be more generally
a symmetric operator Q: R,n --+- (R,n)*. The same remark is valid for the ordinary gradient
method of Chap. II: writing Y = x - ts(x) should be understood as Y = x - Q-l s(x). This
mental operation is automatic in the Newton method, in which Q = V2 f(x). 0
(b) More concretely: consider (3.1.6), where C is a closed convex polyhedron as described
in (3.0.2). The corresponding conjugate functions qik and ue could be computed to specify
the dual problem of (a). More simply, we can also follow the example of §XII.3.4: formulate
(3.0.2), (3.1.6) as
3 A Class of Primal Bundle Algorithms 307
which uses the variable y - Xk. Taking.e multipliers aj and m multipliers Yj, we set
from which we obtain Sk = Ef=l ajSj and fik = Ej=l Yjaj, as well as ek = Ef=l ajej. We
obtain also a term not directly used by the algorithm:
m
bk := LYj[bj - (aj, Xk)];
j=l
from the transversality conditions,
Knowing that the essential operation is conjugation of the polyhedral function ({Jk + Ie,
compare the above dual problem with Example X.3.4.3. Our "bundle" could be viewed as
made up of two parts: (Sj, ej)j=l, ... ,l, as well as (aj, bj - (aj, Xk))j=l •...• m; the multipliers
ofthe second part vary in the nonnegative orthant, instead of the unit simplex.
3.2 Convergence
Naturally, convergence of Algorithm 3.1.4 cannot hold for arbitrary choice of the
stepsizes tk' When studying convergence, two approaches are therefore possible: either
give specific rules to define tk - and establish convergence of the correspondirig
implementation - or give abstract conditions on {tk} for which convergence can be
established. We choose the second solution here, even though our conditions on {tk}
lack implementability. In fact, our aim is to demonstrate a technical framework, rather
than establish a particular result.
The two observations below give a feeling for "reasonable" abstract conditions.
Indeed:
308 xv. Primal Forms of Bundle Methods
(i) A small tk> or a large ILk> is dangerous: it might over-emphasize the role of the
stabilization, resulting in unduly small moves from Xk to Yk+ I. This is suggested
by analogy with Chap. II: the descent-test in Step 2 takes care oflarge stepsizes
but so far, nothing prevents them from being too small.
(ii) It is dangerous to make a null-step with large tk. Here, we are warned by the
convergence theory of Chap. IX: what we need in this situation is Sk + Pk --+ 0,
and this depends crucially on the boundedness of the successive subgradients
Sk> which in turn comes from the boundedness of the successive iterates Yk. A
large tk gives a Yk+1 far away.
First of all, we fix the point made in Note 3.1.5(ii).
Lemma 3.2.1 At each iteration ofAlgorithm 3.1.4, fPk ~ f and there holds
with
PROOF. At the first iteration, fPl is the (affine) cutting-plane functionof(1.0.1): fP) ~ f.
Assume recursively that fPk ~ f. Keeping Lemma 3.1.2 in mind, the compression of
the bundle at Step 4 replaces fPk by some other convex function "" ~ f. Then, when
appending at Step 5 the new piece
- n
f(Xk) - eHI + (SHh Y - Xk) =: fHI(Y) ~ f(y) forall Y E IR •
the model becomes fPk+ 1 = max {"". fH I} ~ f: the required minoration does hold
for all k.
Then add the inequalities
(3.2.1)
If both {ek} and {Sk + h} tend to zero, any accumulation point of {Xk} will be
optimal (Proposition XI.4.l.1: the graph of (e. x) 1---+ (Je(f + Ic)(x) is closed).
However, convergence can be established even if {Xk} is unbounded, with a more
subtle argument, based on inequality (3.2.3) below.
As before, we distinguish two cases: if there are infinitely many descent steps,
the objective function decreases "sufficiently". thanks to the successive descent tests
(3.1.7); and if the sequence {Xk} stops, it is the bundling mechanism that does thejob.
Theorem 3.2.2 Let Algorithm 3.1.4 be applied to the minimization problem (3.0.1),
with the stopping tolerance Q= O. Assume that K is an infinite set.
3 A Class of Primal Bundle Algorithms 309
(i) If
(3.2.2)
(3.2.3)
Now {f(Xk)} has a limit f* E [-00, +00[. If f* = -00, the proof is finished.
Otherwise the descent test (3.1.7) implies that
(3.2.4)
where we have used the fact that f(xk+l) - f(Xk) = Oifk ¢ K (the same argument
was used in Lemma 2.1.4).
[(i)] Assume for contradiction that there are y E C and 71 > 0 such that
If {tk} is bounded on K, (3.2.4) implies that the series LkeK tk8k is convergent; once
again, sum the inequalities (3.2.5) over k E K to see that {Xk} is bounded. Extract some
310 xv. Primal Forms of Bundle Methods
cluster point; in view of (i), this cluster point minimizes f on C and can legitimately
be called i.
Given an arbitrary TJ > 0, take some kl large enough in K so that
L tk 8k ~ TJ/4.
kl ~keK
Perform once more on (3.2.5) the summation process, with k running in K from kl
to an arbitrary k2 ~ kl (k2 E K). We conclude:
Apart from the conditions on the stepsize, the only arguments needed for Theorem 3.2.2
are: the definition (3.1.2) of the next iterate, the inequalities stated in Lemma 3.2.1, and
the descent test (3.1.7); altogether, the bundling mechanism is by no means involved. Thus,
consider a variant of the algorithm in which, when a descent has been obtained, Step 4
flushes the bundle entirely: e is set to 1 and the aggregate piece (it, ek) is simply left out.
Theorem 3.2.2 still applies to such an algorithm. Numerically, the idea is silly, though: it uses
the steepest-descent direction whenever possible, and we have insisted again and again that
something worse can hardly be invented.
the aggregate linearization obtained at the kth iteration of Algorithm 3.1.4. For all
Y E C, there holds
(3.2.6)
3 A Class of Primal Bundle Algorithms 311
Theorem 3.2.4 Consider Algorithm 3.1.4 with the stopping tolerance Q= O. Assume
that K is finite: for some ko, each iteration k ~ ko produces a null-step. If
L t2
_k_ =+00, (3.2.8)
k>ko tk-I
PROOF. In all our development below, k > ko and we use the notation of Lemma 3.2.3.
[Step 1] Write the definition of the kth nominal decrease:
1:-1 (Yk+I) + 2tk_. IIYk+1 - Xko II [Step 4 implies rpk ;?; It-I]
- 1 2
~
~ I(Xko) - 8k-1 + 2tL.IIYk+1 - Ykll 2 . [from (3.2.6)]
Hence
IIYk+1 - xkoll2 ~ 2tk8k ~ 2tko8ko =: R2,
the sequence {Yk} is bounded; L will be a Lipschitz constant for 1 and each CfJk on
B(Xko' R) and will be used to bound from below the decrease from 8k to 8k+1. Indeed,
Step 5 of the algorithm forces I(Yk) = CfJk(Yk), therefore
312 xv. Primal Forms of Bundle Methods
!(Yk+I) - (jlk(Yk+I) = !(Yk+I) - !(Yk) + (jlk(Yk) - (jlk(Yk+I) ~
(3.2.10)
~ 2LIIYk+1 - Ykll·
[Step 3] On the other hand, the descent test (3.1.7) is not satisfied:
!(Yk+I) > !(Xko) - m8k;
C L 82
_k_ < 8ko <
k>ko tk-I
+00 .
L t2
-k-lisk + fikll 4 < +00.
k>ko tk-I
It remains to use (3.2.8): liminf IISk + fikll 4 = O. Altogether, Lemma 3.2.1 tells us
that 0 E a(f + IC)(Xko). 0
Remark 3.2.5 In this proof, as well as that of Theorem 3.2.2, the essential argument is that
lh -+ o. The algorithm does terminate ifthe stopping tolerance Qis positive; upon termination,
Lemma 3.2.1 then gives an approximate minimality condition, depending on the magnitude
of tk. Thus, a proof can also be given in the spirit of those in the previous chapters, showing
how the actual implementation behaves in the computer.
Note also that the proof gives an indication of the speed at which {8k} tends to 0: com-
pare (3.2.11) with Lemma IX.2.2.1. Numerically speaking, it is a nice property that the two
components ofthe "compound" convergence parameter
8k = 8k + 4tkllsk + Pkll 2
are simultaneously driven to o. See again Fig. XIY.4.1.3 and §XIY.4.4(b).
We mentioned in Remark IX.2.1.3 that a tolerance mil could help an approximate reso-
lution of the quadratic program for dual bundle methods. In an actual implementation of the
present algorithm, a convenient stopping criterion should also be specified for the quadratic
solver. We omit such details here, admitting that exact arithmetic is used for an exact resolu-
tion of(3.1.6). 0
3 A Class of Primal Bundle Algorithms 313
Both conditions (3.2.2), (3.2.8) rule out small stepsizes and connote (XII.4.1.3),
used to prove convergence of the basic subgradient algorithm; note for example that
(3.2.8) holds iftk = 1/ k. Naturally, the comments of Remark XII.4.1.5 apply again,
concerning the practical irrelevance of such conditions. The situation is even worse
here: because we do not know in advance whether k E K, the possibility of checking
(3.2.2) becomes even more remote. Our convergence results 3.2.2 and 3.2.4 are more
interesting for their proofs than for their statements. To guarantee (3.2.2), (3.2.8), a
simple strategy is to bound tk from below by a positive threshold!.. > O. In view of
(3.2.7), it is even safer to impose a fixed stepsize tk == t > 0; then we get Algo-
rithm 2.2.2, with a fixed penalty parameter f.L = 1/ t. However, let us say again that
no reasonable value for t is a priori available; the corresponding algorithm is hardly
efficient in practice.
On the other hand, (3.2.7) is numerically meaningful and says: when a null-step
has been made, do not increase the stepsize. This has a practical motivation. Indeed
suppose that a null-step has been made, so that the only change for the next iteration
is the new piece (sk+1> ek+l) appearing in the bundle. Increasing the stepsize might
well reproduce the iterate Yk+2 = Yk+1> in which case the next call to the black box
(UI) would be redundant.
Remark 3.2.6 We have already mentioned that (3.0.1) is the problem of minimizing the sum
f + Ie of two closed convex functions. More generally, let a minimization problem be posed
in the form
inf {f(x) + g(x) : x E jRn},
with f and g closed and convex. Several bundling patterns can be considered:
(i) First of all, f + g can be viewed as one single objective function h, to which the general
bundle method can be applied. Provided that h is finite everywhere, there is nothing
new so far. We recall here that the local Lipschitz continuity of the objective function is
technically important, as it guarantees the boundedness of the sequence {Sk}.
(ii) A second possibility is to apply the bundling mechanism to f alone, keeping g as it
is; this is what we have done here, keeping the constraint Y E C explicitly in each
stabilized problem. Our convergence results state that this approach is valid if f is finite
everywhere, while g is allowed the value +00.
(iii) When both f and g are finite everywhere, a third possibility is to manage two "decom-
posed" bundles separately. Indeed suppose that the black box (U1) of Fig. 11.1.2.1 is able
to answer individual objective-values fey) and g(y) and subgradient-values, say s(y)
and p(y) - rather than the sums fey) + g(y) and s(y) + p(y). Then f and g can be
modelled separately:
A(y):= max [f(Yi) + (S(Yi), Y - Yi)] ~ f(y) ,
i=I •...• k
8k(Y):= max [g(Yi) + (P(Yi), Y - Yi)] ~ g(y).
i=I ..... k
The resulting model is more accurate than the "normal" model
hk(Y):= max [(f + g)(Yi) + {(s + P)(Yi), Y - Yi)] ~ fey) + g(y) ,
i=I ..... k
simply because hk ~ ik + 8k (this is better seen if two different indices i and j are used
in the definition of i and 8: the maximum hk of a sum is smaller than the sum + 8kA
of maxima).
314 xv. Primal Forms of Bundle Methods
Exploiting this idea costs in memory (two additional subgradients are stored at each
iteration) but may result in faster convergence. 0
where s(t) = L:f=1 (XjSj can be obtained from the dual problem: L1e c IRe being the
unit simplex,
(3.3.2)
(a) SmaU Stepsizes. When t ,!.. 0, the term L:f=1 (Xjej in (3.3.2) is pushed down to its
minimal value. Actually, this minimal value is even attained for finitely small t > 0:
Proposition 3.3.1 Assume that ej = 0 for some i. Then there exists! > 0 such that,
for t E ]0, !], s(t) solves
(3.3.3)
EXPLANATION. We do not give a formal proof, but we link: this result to the direction-
finding problem for dual bundle methods. Interpret l/t =: I.L in (3.3.2) as a multiplier
of the constraint in the equivalent formulation (see §2.4)
min
aELil
{4 11 L:f=1 (Xjsill 2 : L:f=1 (Xjej :s::; e} .
When e ,!.. 0 we know from Proposition XI\l.2.2.S that such a multiplier is bounded,
say by ~ =: l/t. 0
3 A Class of Primal Bundle Algorithms 315
The $(t) obtained from (3.3.3) is nothing but the substitute for steepest-descent,
considered in Chap. IX. Note also that we have $(t) E of (x): small stepsizes in
Algorithm 3.1.4 mimic the basic subgradient algorithm of §XII.4.1. Proposition 3.3.1
tells us that this sort of steepest-descent direction is obtained when t is small. We know
already from Chap. II or VIII that such directions are dangerous, and this explains why
t = tk should not be small in Algorithm 3.1.4: not only for theoretical convergence
but also for numerical efficiency.
(b) Large Stepsizes. The case t -+ +00 represents the other extreme, and a first
situation is one in which rp is bounded from below on IRn. In this case,
- being piecewise affine, rp has a nonempty set of minimizers (§3.4 in Chap. V or
VIII);
- whent -+ +00, y(t) has a limit which is the minimizerofrp that is closestto x; this
follows from Propositions 2.2.3 and 2.1.3. We will even see in Remark 4.2.6 that
this limit is attained for finite t, a property which relies upon the piecewise affine
character of rp.
Thus, we say: among the minimizers of the cutting-plane model (if there are
any), there is a distinguished one which can be reached with a large stepsize in Al-
gorithm 3.1.4. It can also be reached by the trust-region variant, with K = Koo of
Proposition 2.1.3.
On the other hand, suppose that rp is not bounded from below; then there is no
cutting-plane iterate, and no limit for y(t) when t -+ +00. Nevertheless, minimization
of the cutting-plane model can give something meaningful in this case too. In the next
result, we use again the notation Proj for the projection onto the closed convex hull
ofa set.
PROOF. With t > 0, write the minimality conditions of (3.3.2) (see for example
Theorem XIY.2.2.1):
Also, as $(00) is in the convex hull of {s!. ... , sil, the only possibilty forS(oo) is to
be the stated projection (see for example Proposition VIII.l.3.4). 0
Thus, assume $(00) =F 0 in the above result. When t -+ +00, y(t) is unbounded
but [y(t) - x]/t = $(t) converges to a nonzero direction $(00), the solution of
316 xv. Primal Forms of Bundle Methods
(3.3.4)
To sum up, consider the following hybrid way of computing the next iterate y+ in Algo-
rithm 3.1.4.
- If the solution $(00) of (3.3.4) is nonzero, make a line-search along -$(00).
- 1[$(00) = 0, take the limit as t ~ +00 of y(t) described by (3.3.1); in this case too, a
line-search can be made to cope with a y(oo) which is very far, and hence has little value.
The essence of this algorithm is to give a meaning to the cutting-plane iteration as much as
possible. Needless to say, it should not be recommended.
Remark 3.3.3 Having thus established a connection between conjugate subgradients and
cutting planes, another observation can be made. Suppose that f is quadratic; we saw in
§XIY.4.3 that, taking the directions of conjugate subgradients, and making exact line-searches
along each successive such direction, we obtained the ordinary conjugate-gradient algorithm
of §U.2.4. Keeping in mind Remark 1.1.1, we are bound to notice mysterious relationships
between piecewise affine and quadratic models of convex functions. 0
(c) On-Line Control of the Stepsize. Our development (a) - (b) above clearly shows
the difficulty of choosing the stepsize in Algorithm 3.1.4: small [resp. large] values
result in some form of steepest descent [resp. cutting plane]; both are disastrous in
practice. Once again, this difficulty is not new, it was with us all the way through
§XIv.4. To guide our choice, an on-line strategy can be suggested, based on the trust-
region principles explained in § 1.3, and combining the present primal motivations with
the dual viewpoint of Chap. XIv. For this, it suffices to follow the general scheme
of Fig. XIY.3 .2.1, say, but with the direction recomputed each time the stepsize is
changed.
Thus, the idea is to design a test (0), (R), (L) which, upon observation of the actual
objective function at the solution y(t) of (3.3.1), (3.3.2), decides:
(Od) This solution is convenient for a descent-step,
(On) or this solution is convenient for a null-step.
(R) This solution corresponds to a t too large.
(L) This solution corresponds to a t too small.
No matter how this test is designed, it will result in the following pattern:
Algorithm 3.3.4 (Curved-Search Pattern, Nonsmooth Case) The data are the ini-
tial x, the model ({J, and an initial t > O. Set tL = tR = O.
STEP 1. Solve (3.3.1), (3.3.2) and compute f(y(t» and s(y(t» E 8f(y(t)). Apply
the test (0), (R), (L).
STEP 2 (Dispatching). In case (0) stop the line-search, with either a null- or a descent-
step. In case (L) [resp. (R)] set tL = t [resp. tR = t] and proceed to Step 3.
4 Bundle Methods as Regularizations 317
Consider one iteration of the primal bundle method of §3. Given the current iterate,
model and step size, we minimize
<p(y) + trlly - Xll2
with respect to y E IRn (assuming an unconstrained situation for simplicity). Now the
above optimal value can be viewed as a/unction of x E IR n , and we recognize in this
function the Moreau-Yosida regularization of <p, already seen on several occasions.
In fact, bundle methods and Moreau-Yosida regularization are intimately related
and the aim of this section is to explore this relation.
In this subsection, we collect and complete some results previously given concerning
the Moreau-Yosida regularization, seen from the point ofview of convex minimization.
In contrast with the previous sections, we consider now a general closed convex
318 xv. Primal Forms of Bundle Methods
function: in what follows,
If E Conv]Rn and M is a symmetric positive definite operator. I
The function
]Rn 3 x ~ fM(X) := min [fey) + !(M(y - x), y - x}] (4.1.1)
YElRn
Lemma 4.1.1 The minimization problem in (4.1.1) has a unique solution, charac-
terized as the unique point y E ]Rn satisfying
M(x - y) E af(y) . (4.1.2)
PROOF. For each x, the minimand g(x, .) is a strictly convex function; as such, it has
at most one minimum. On the other hand, f is minorized by some affine function;
g(x, .) is therefore I-coercive and, being also closed, it does have one minimum.
Now, because the quadratic term in g is finite everywhere, the calculus rule
XI.3.1.1 on the sum of two convex functions applies, and the subdifferential of g(x, .)
at y is
ayg(x, y) = af(y) + M(y - x)
(with the convention 0 + {s} = 0 for all s). Thus (4.1.2) represents the necessary and
sufficient minimality condition 0 E ayg(x, y), and has a solution which is the unique
minimizer of g(x, .). 0
Note here that the convexity of f is important but not its closedness. The function g(x, .)
would still be strongly convex even if f were not closed; and all minimizing sequences would
have the same unique limit point. Then nothing would be essentially changed if f were
replaced by cl f, in (4.1.1) and (4.1.2) as well.
Definition 4.1.2 (Proximal Point) We will extensively use the following system of
notation:
PM(X) := argmin [fey) + !(M(y - x), y - x}]
y
is called the proximal point of x (associated with f and M); x can be called the
proximal center;
(4.1.3)
is the particular subgradient of f at PM (x) defined via Lemma 4.1.1; we set
W:=M- 1 ,
Interpretation 4.1.3 The operator M defines the scalar product ((x, x')) := (Mx, x'), with
its associated norm Illxlll := J((x, x)). For this norm, PM(X) is the projection of x onto a
certain sublevel-set of f, namely the one at the level f(PM(X». Indeed take Y such that
fey) ( f(PM(X». Combining with the subgradient inequality
,-'"
YR. L
,// - -------------------------------t(.) = t
Note the similarity between this construction and the level-variant of §2.3. Note also
from Lemma 4.1.1 that PM (x) may be on the boundary of dom f but, even in this case, f
has a nonempty subdifferential at PM(X). 0
(4.1.5)
Its conjugate is
IRn 3 S ~ fM(S) = f*(s) + t{s, Ws). (4.1.6)
and
Note that 1M can also be viewed as a marginal function associated with the minimand
g (convex in (x, y) and differentiable in x); V1M can therefore be derived from (XI.3.3.7);
see also Corollary VI.4.5.3 and the comments following it. The global Lipschitz property of
V1M can also be proved with Theorem XA.2.1.
PROOF. There exists an affine function minorizing I: for some (so, ro) E JRn x JR,
(4.1.10)
so that
(so, PM(X») - ro + !Amin(M)llpM(X) - X1I2:::; 1M (x) .
With some algebraic manipulations, we obtain
(4.1.11)
PROOF. Use (4.1.4): the first relation is the definition of 1M; the second is the subgra-
dient inequality
o
322 xv. Primal Forms of Bundle Methods
Theorem 4.1.7 Minimizing I and 1M are equivalent problems, in the sense that
(an equality in ~U {-co}), and that the following statements are equivalent:
(i) x minimizes I;
(ii) PM(X) = x;
(iii) SM(X) = 0;
(iv) x minimizes 1M;
(v) I(PM(X» = I(x);
(vi) IM(X) = I(x).
(i) ==> (ii) {:::::::} (iii) {:::::::} (iv) {:::::::} (v) ==> (vi) .
Finally assume that (vi) holds; use (4.1.13) again to see that (iii) = (ii) holds; then
Theorem 4.1.7 gives a number of equivalent formulations for the problem of mini-
mizing I. Among them, (ii) shows that we must find a fixed point of the mapping
x 1-+ PM(X), which resembles a projection: see Remark 4. 1.3. As such, PM is nonex-
pansive (for the norm associated with «', .)))
and the iteration formula xk+ I = PM (Xk)
appears as a reasonable proposal.
This is known as the proximal point algorithm. Note that this algorithm can be
formulated with a varying operator M, say Xk+1 = PMk(Xk). In terms of minimizing
I, this still makes sense, even though the Moreau-Yosida interpretation disappears.
Algorithm 4.2.1 (proximal Point Algorithm) Start with an initial point XI E ~n
and an initial symmetric positive definite operator MI. Set k 1. =
STEP 1. Compute the unique solution y = xk+1 of
(4.2.1)
here the notation Sk replaces the former s Mk' Furthermore we have from (4.1.13)
L Amin(Wk) = +00.
00
(4.2.3)
k=1
It rules out unduly small [resp. large] eigenvalues of Wk [resp. Mk].
Lemma 4.2.2 Assume that 1* is ajinite number. Then L~I Ok < +00 and (4.2.3)
implies that 0 is a cluster point of {sd.
and by summation
L [Ok + 4Amin(Wk)llSkIl
00
2] ~ l(xl) - 1* .
k=1
If the right-hand side is finite, the two series L Ok and L Amin(Wk) IISk 112 are conver-
gent. If the convergence condition (4.2.3) holds, IISkll2 cannot stay away from zero.
o
324 xv. Primal Forms of Bundle Methods
Theorem 4.2.3 In Algorithm 4.2.1, assume that the convergence condition (4.2.3)
holds. If {Xk} is bounded, all its cluster points are minimizers of f.
Theorem 4.2.4 Assume that Mk = I-tkM, with I-tk > 0 for all k, and M symmetric
positive difinite.
(i) If the convergence condition (4.2.3) holds, i.e. if
001
L-=+oo,
k=l I-tk
PROOF. Our proof is rather quick because the situation is in fact as in Theorem 3.2.2.
The usual transportation trick in the subgradient inequality
gives
f(y) ~ f(Xk) + (Mk(Xk - Xk+l), y - Xk) - 8k (4.2.4)
where, using (4.2.2),
Denoting by ((u, v)) the scalar product (Mu, v) (III . I I will be the associated norm,
note that both are independent of k), we write (4.2.4) as
and obtain
Another interesting particular case is when the proximal point Algorithm 4.2.1
does terminate at Step 2 with Sk = 0 and an optimal Xk.
Proposition 4.2.5 Assume that ! has a finite infimum and satisfies the following
property:
311 > 0 such that a!(x) n B(O, 11) =f. '" :::::::} x minimizes! . (4.2.5)
lfthe convergence condition (4.2.3) holds, the stop in Algorithm 4.2.1 occurs for some
k.
PROOF. Lemma 4.2.2 guarantees that the event IIsk II ~ 11 eventually occurs, implying
optimality of xk+ I. From Theorem 4.1.7, the algorithm will stop at the next iteration.
o
Observe the paradoxical character of the above statement: finite termination does not
match divergence ofthe series E A.min (Wk)! Remember that the correct translation of (4.2.3)
is: the matrices Wk are computed in such a way that
k(R)
VR ~ 0, 3k(R) e N* such that L A.min(Wk) ~ R.
k=1
We have purposedly used a somewhat sloppy statement, in order to stress once more that
properties resembling (4.2.3) have little relevance in reality.
Remark 4.2.6 The meaning of (4.2.5) deserves comment. In words, it says that all subgra-
dients of! at all nonoptimal points are uniformly far from O. In particular, use the notation
s(x) := Proj O/B!(x) for the subgradientof I atx that has least Euc1ideannorm. If I satisfies
(4.2.5), s(x) is either 0 (x optimal) or larger than rJ in norm (x not optimal). In the latter case,
I can be decreased locally from x at a rate at least rJ: simply move along the steepest-descent
direction -s (x). We say in this case that I has a sharp set of minima.
Suppose for example that I is piecewise affine:
s] := Proj O/{Sj : j e J}
to obtain 2m - 1 vectors s]; all the possible s(x) are taken from this finite list. Setting
326 xv. Primal Forms of Bundle Methods
In this subsection, x and M are fixed; they can be for example Xk and Mk at the
current iteration of the proximal point Algorithm 4.2.1. We address the problem of
minimizing the perturbed objective function of (4.1.1)
- The basic cutting-plane algorithm of §XII.4.2 is still impeded by the need for com-
pactness.
- Bundle methods will take cutting-plane approximations for g(x, .), introduce a sta-
bilizing quadratic term, and solve the resulting quadratic program. This is somewhat
redundant: a quadratic perturbation has already been introduced when passing from
f to g(x, .).
- Then a final possibility suggests itself, especially if we remember Remark 3.2.6(ii):
apply the cutting-plane mechanism to f only, obtaining a "hybrid" bundle method
in penalized form.
- This last technique can also be seen as an "auto-stabilized" cutting-plane method,
in which the quadratic term in the objective function g(x,·) is kept as a stabilizer
(preconditioned by M). Because no artificial stabilizer is introduced, the stability
center x must not be moved and the descent test must be inhibited, so as to take
enough "null-steps", until fM(X) = ming(x, .) is reached. Actually, the quadratic
term in g plays the role of the artificial C introduced for (1.0.2).
For consistency with §3, the index k will still denote the current iteration of the
resulting algorithm. The key is then to replace f in the proximal problem (4.1.1) by
a model-function CPk satisfying
CPk::;; f; (4.3.1)
also, CPk is piecewise affine, but this is not essential. Then the next iterate is the proximal
point of x associated with CPk:
(4.3.4)
will be useful for the next model CPk+ I. With respect to §4.2, beware that Sk does not
refer to the proximal point of some varying Xk associated with f: here the proximal
center x is fixed, it is the function CPk that is varying.
We obtain an algorithm which is implementable, in the sense that it needs only a
black box (U 1) that, given Y E ]Rn, computes f (y) and s (y) E af (y).
Algorithm 4.3.1 (Hybrid Bundle Method) The preconditioner M and the proximal
center x are given. Choose the initial model CPI ::;; f, a convex function which can be
for example
Y 1-+ CPI(Y) = f(x) + (s(x), Y - x),
and initialize k = 1.
STEP 1. Compute Yk+1 from (4.3.2) or (4.3.3).
328 xv. Primal Forms of Bundle Methods
STEP 2. If !(Yk+I) = fPdYk+I) stop.
STEP 3. Update the model to any convex function fPk+1 satisfYing (4.3.1), as well as
(4.3.5)
and
(4.3.6)
This algorithm is ofcourse just a form ofAlgorithm 3.1.4 with a few modifications:
- C = IRn; a simplification which yields Pk = o.
- The notation is "bundle-free": no list of affine functions is assumed, and the succes-
sive models are allowed more general forms than piecewise affine; only the essential
properties (4.3.1), (4.3.5), (4.3.6) are retained.
- The stabilizing operator M is fixed but is not proportional to the identity (a negligible
generalization).
- The proximal center Xk = x is never updated; this can be simulated by taking a very
large m > 0 in (3.1.7).
- The stopping criterion in Step 2 has no reason to ever occur; in view of (4.3.1), it
means that ! coincides with its model fPk at the "trial proximal point" Yk+ I .
(4.3.7)
for some positive tolerance Q. The whole issue for convergence is indeed the property
!(Yk+I) - fPk(Yk+I) .... 0, which we establish with the tools of §3.2. First of all,
remember that nothing is changed if the objective function of (4.3 .2) is a posteriori
replaced by
Y ~ Yk(Y) := 1k(y) + !(M(y - x), Y - x) .
PROOF. Use the simple identity, valid for all u, v in IRn and symmetric M:
Theorem 4.3.4 For f : IRn ~ IR convex and M symmetric positive definite, the
sequence {Yk} generated by Algorithm 4.3.1 satisfies
Yk ~ PM(X).
PROOF. We assume that the stop never occurs, otherwise Proposition 4.3.2 applies. In
view of (4.3 .5),
l k-
I (YHI) ~ (jIk(YHI) for all k > 1 .
Add 1/2 (M(YHI -X), YHI - X) to both sides and use Lemma 4.3.3 withk replaced
by k - 1 to obtain
Yk-I (Yk) + ~(M(Yk+1 - Yk), Yk+1 - Yk) ~ Yk(YHI) ~ f(x) for k > 1.
subtracting f(YHI) from both sides and using a Lipschitz constant L for f around
x,
(4.3.8) follows.
Finally extract a convergent subsequence from {Yk}; more precisely, let KeN
be such that YHI ~ Y* for k ~ +00 in K. Then (jIk(YHI) ~ f(y*) and, passing
to the limit in the subgradient inequality
which can be viewed as a descent criterion. Under these circumstances, we can think of
a Qdepending on k, and comparable to f (x) - CPk (Yk+ [) (a number which is available
when Step 2 is executed). For example, with some coefficient m E ]0, 1[, we can take
We recognize a descent test used in bundle methods: see more particularly the trust-
region Algorithm 2.1.1. The descent iteration of such a bundle method will then update
the current stability center x = Xk to this "approximate proximal point" Yk+ [.
Conclusion: seen from the proximal point of view of this Section 4, bundle meth-
ods provide three ingredients:
- an inner algorithm to compute a proximal point, based on cutting-plane approxima-
tions of f;
- a way of stopping this internal algorithm dynamically, when a sufficient decrease is
obtained for the original objective function f;
- an outer algorithm to minimize f, using the output of the inner algorithm to mimic
the proximal point formula.
Furthermore, each inner algorithm can use for its initialization the work performed
during the previous outer iteration: this is reflected in the initial CPt of Algorithm 4.3.1,
in which all the cutting planes computed from the very beginning of the iterations can
be accumulated.
This explains our belief expressed at the very end of §2.5: the penalized form is
probably the most interesting variant among the possible bundle methods.
Bibliographical Comments
Just as we did with the first volume, let us repeat that [159] is a must for convex
analysis in finite dimension. On the other hand, we recommend [89] for an exhaustive
account of bundle methods, with the most refined techniques concerning convergence,
and also various extensions.
Chapter IX. The technique giving birth to the bundling mechanism can be contrasted
with an old separation algorithm going back to [1]. The complexity theory that is
alluded to in §2.2(b) comes from [133] and our Counter-example 2.2.4 is due to
A.S. Nemirovskij.
It is important to know that this bundling mechanism can be extended to noncon-
vex functions without major difficulty (at least theoretically). The approach, pioneered
in [55], goes as follows. First, one considers locally Lipschitzian functions; denote
by af(x) the Clarke generalized gradient of such a function f at x ([36,37]). This
is jus the set co y f(x) of §VI.6.3(a) and would be the subdifferential of f at x if f
were convex. Assume that some s (y) E af (y) can be computed at each y, just as in
the convex case. Given a direction d (= dk), the line-search 1.2 is performed and the
interesting situation is when t .,j, 0, with no descent obtained; this may happen when
·
11m sup
f(x + td) - f(x)
~
0
.
t,\,o t
As explained in §l.3(a), the key is then to produce a cluster point s* of {s(x + td)h
satisfying (s*, d) ~ 0; to be on the safe side, we need
liminf(s(x + td), d} ~ o.
t,\,o
If f is convex, (*) automatically implies (**). If not, the trick is simply to set the
property "( *) => (**)" as an axiom restricting the class of Lipschitz functions that
can be minimized by this approach.
A possible such axiom is for example
· sup f(x
11m + td) - f(x)
~
1·lmill
. f (s(x + t d ), d },
t,\,o t t,\,o
resulting in the rather convenient classes of semi-smooth functions ofR. Mifilin [122].
On the other hand, a minimal requirement allowing the bundling mechanism to work
332 Bibliographical Comments
is obtained if one observes that, in the actual line-search, the sequences {tk} giving
f (x + tkd) in (*) and s (x + tkd) in (**) are just the same. What is needed is simply
the axiom defined in [24]:
.
hmsup
[f(X + td) - f(x)
- (s(x + td), d) ] ~ o.
t,j..o t
Convex quadratic programming problems are usually solved by one universal
pivoting technique, see [23]. The starting idea is that a solution can be explicitly
computed if the set of active constraints is known (solve a system oflinear equations).
The whole technique is then iterative:
- at each iteration, a subset J of the inequality constraints is selected;
- each constraint in J is replaced by an equality, the other constraints being discarded;
- this produces a point XJ, whose optimality is tested;
- if x J is not optimal, J is updated and the process is repeated until a correct J* is
identified, yielding a solution of the original problem.
J:
biconjugacy, of a function arises naturally in variational problems ([49, 81]): mini-
J:
mizing an objective function like l(t, x(t), i(t»dt is related to the minimization
of the "relaxed" form col(t, x(t), i(t»dt, where col denotes the closed convex
hull of the function l(t, x, .). The question leading to Corollary 1.5.2 was answered
in [43], in a calculus of variations framework; a short and pedagogical proof can be
found in [74]. Lemma 1.5.3 is due to M. Valadier [180, p.69]; the proof proposed
here is more legible and detailed. The results of Proposition 1.5.4 and Theorem 1.5.5
appeared in [64].
The calculus rules of §2 are all rather classical; two of them deserve a particular
comment: post-composition with an increasing convex function, and maximum of
functions. The first is treated in [94] for vector-valued functions. As for max-functions,
we limited ourselves here to finitely many functions but more general cases are treated
Bibliographical Comments 333
similarly. In fact, consider the following situation (cf. §VI.4A): T is compact in some
metric space; f : T x IRn -+ IR satisfies
f (t, .) =: ft is a convex function from IRn to IR for all t E T
f(', x) is upper semi-continuous on T for all x E IRn ;
hence f(x) := maxtET ft(X) < +00 for all x E IR n .
As already said in §VI.4A, f is thenjointly upper semi-continuous on T x IRn, so that
1* : (t, s) H- 1*(t, s) := Ut)*(s) is jointly lower semi-continuous and it follows
that cP := mintET f*(t, .) is lower semi-continuous; besides, cp is also I-coercive
since cp* = f is finite everywhere. Altogether, the results of § 1.5 can be applied to
cp: with the help of Proposition 1.5 A, f* = co cp = co cp can be expressed in a way
very similar to Theorem 204.7. We have here an alternative technique to compute the
subdifferential of a max-function (Theorem VI.4A.2).
The equivalence (i) {} (ii) of Theorem 3.2.3 is a result of T.S. Motzkin (1935);
our proof is new. Proposition 3.3.1 was published in [120, §II.3]. Our Section4.2(b) is
partly inspired from [63], [141]: Corollary 4.2.9 can be found in [63], while J.-P. Penot
was more concerned in [141] by the "one-sided" aspect suitable for the unilateral world
of convex analysis, even though all his functions CPs were convex quadratic forms. For
Corollary 4.2.10, we mention [39]; see also [170], where this result is illustrated in
various domains of mathematics.
Chapter XI. Approximate subgradients appeared for the first time in [30, §3]. They
were primarily motivated by topological considerations (related to their regularization
properties), and the idea of using them for algorithmic purposes came in [22]. In [137,
138,139], E.A. Nurminskii used(1.3.5)to design algorithms for convex minimization;
see also [111], and the considerations at the end of [106]. Approximate subgradients
also have intriguing applications in global optimization. Let f and g be two functions
in ConvlRn. Then x minimizes (globally) the difference f - g if and only if aeg(x) c
aef(x) for all £ > O. This result was published in [75] and, to prove its sufficiency
part, one can for example start from (1.3.7).
The characterization (1.304) of the support function of the graph of the multi-
function £ ~ aef(x) also appeared in [75]. The representation of closed convex
functions via approximate directional derivatives (Theorem 1.3.6) was published in-
dependently in [73, §2] and [111, §1]. As for the fundamental expression (2. 1.1) of the
approximate directional derivative, it appeared in [131, p. 67] and [159, pp. 219-220],
but with different proofs. The various properties of approximate difference quotients,
detailed in our §2.2, come from [73, §2].
Concerning approximate subdifferential calculus (§ 3), let us mention that general
results, with vector-valued functions, had been announced in [95]; the case of convex
functions with values in IR U {+oo} is detailed in the survey paper [72], which inspired
our exposition. Qualification assumptions can be avoided to develop calculus rules of
a different nature, using the (richer) information contained in {a l1 ,f;(x) : 0 < I') :::; ij},
instead of the mere a,f;(x). For example, with fl and h in ConvlR n,
aUI + h)(x) = n
0<11:::; ij
cl [a l1 fl (x) + al1 f2 (x) ] , where ij > 0 is arbitrarily small.
334 Bibliographical Comments
This formula emphasizes once more the smoothing or "viscosification" effect of ap-
proximate subdifferentials. For a more concrete development of approximate mini-
mality conditions, alluded to in §3.1 and the end of §3.6, see [175].
The local Lipschitz property of adO, with fixed 8 > 0, was observed for the
first time in [136]; the overall formalization of this result, and the proof used here,
were published in [71]. Passing from a globally Lipschitzian convex function to an ar-
bitrary convex function was anyway the motivation for introducing the regularization-
approximation technique via the inf-convolution with ell . II ([71, §2]). Theorem 4.2.1
comes from [30]. The transportation formula (4.2.2), and the neighborhoods of(4.2.3),
were motivated by the desire to make the algorithm of [22] implementable (see [100]).
Chapter XII. References to duality without convexity assumptions are not so com-
mon; we can mention [32] (especially for §3.1), [52], [65]; see also [21]. However,
there is a wealth of works dealing with Lagrangian relaxation in integer program-
ming; a milestone in this subject was [68], which gave birth to the test-problems TSP
oflX.2.2.7.
The subgradient method of §4.1 was discovered by N.Z. Shor in the beginning
of the sixties, and its best account is probably that of [145]; for more recent de-
velopments, and in particular accelerations by dilation of the space, see [171]. Our
proof of convergence is copied from [134]. The original references to cutting planes
are the two independent papers [35] and [84]; their column-generation variant in a
linear-programming context appeared in [40].
The views expressed in §5 go along those of[163]. The most complete theory of
augmented Lagrangians is given in the works ofD.P. Bertsekas, see for example [20].
Dualization schemes when contraints take their values in a cone are also explained
in [114]. Several authors have related Fenchel and Lagrange duality schemes. Our
approach is inspired from [115].
Chapters XIII and XIV, The 8-descent algorithm, going back to [100], is the ancestor
of bundle methods. It was made possible thanks to the work [22], which was done at
the same time as the speculations detailed in Chap. IX; the latter were motivated by a
particular economic lotsizing problem of the type XII.I.2(b), coming from the glass
industry. This observation points out once more how applied mathematics can be a
delicate subject: the publication of purely theoretical papers like [22] may sometimes
be necessary for the resolution of apparently innocent applied problems. Using an
idea of[116], the method can be extended to the resolution of variational inequalities.
Then came the eonjugate-subgradient form (§XIV.4.3) in [101] and [188], and
Algorithm XIV.3.4.2 was introduced soon after in [102]. From then on, the main works
concerning these methods dealt with generalizations to the nonconvex case and the
treatment of constraints, in particular by R. Mifflin [123]. At that time, the similarity
with conjugate gradients (Remark 11.2.4.6) was felt as a key in favour of conjugate
subgradients; but we now believe that this similarity is incidental. We mention here that
conjugate gradients might well become obsolete for "smooth" optimization anyway,
see [61, 113] among others. In fact, it is probably with other interior point methods, as
in [62], that the most promising connections of bundle methods remain to be explored.
Bibliographical Comments 335
Chapter xv. Primal bundle methods appeared in [103], after it was realized from
[153] that (dual) bundle methods were connected with sequential quadratic program-
ming (see the bibliographical comments of Chap. VIII). R. Miftlin definitely formal-
ized the approach in [124], with an appropriate treatment of non-convexity. In [88],
K.C. Kiwiel gave the most refined proof of convergence and proved finite termination
for piecewise affine functions. Then he proposed a wealth of adaptations to various
situations: noisy data, sums of functions, . .. see for example his review [90]. J. Zowe
contributed a lot to the general knowledge of bundle methods [168]. E.A. Nurminskii
gave an interesting interpretation in [13 7, 138, 139], which can be sketched as follows:
at iteration k,
- choose a distance in the dual graph space;
- choose a certain point of the form Pk = (0, rk) in this same space;
- choose an approximation E k of epi f*; more specifically, take the epigraph of (ik) *;
-project Pk onto Ek in the sense of the distance chosen (which is not necessarily a
norm); this gives a vector (Sko Ok), and -Sk can be used for a line-search.
The counter-example in §1.1 was published in [133, §4.3.6]; for the calculations
that it needs in Example 1.1.2, see [18] for example. In classical "smooth" optimiza-
tion, there is a strong tendency to abandon line-searohes, to the advantage of the
trust-region technique alluded to in § 1.3 and overviewed in [127].
The trust-region variant of §2.1 has its roots in [117]. The level variant of §2.3 is
due to A.S. Nemirovskij and Yu. Nesterov, see [110]. As for the relaxation variant, it
strongly connotes and generalizes [146]; see [46] for an account of the Gauss-Newton
and Levenberg-Marquardt methods.
Our convergence analysis in §3 comes from [91]. Several researchers have felt
attracted by the connection between quadratic models and cutting planes approxima-
tions (Remark 3.3.3): for example [185], [147], [92].
The Moreau-Yosida regularization is due to K. Yosida for maximal monotone
operators, and was adapted in [130] to the case of a subgradient mapping. The idea of
exploiting it for numerical purposes goes back to [14, Chap.V] for solving ill-posed
systems of linear equations. This was generalized in [119] for the minimization of
convex functions, and was then widely developed in primal-dual contexts: [161] and
its derivatives. The connection with bundle methods was realized in the beginning of
the eighties: [59], [11].
References
1. Aizerman, M.A., Braverman, E.M., Rozonoer, L.I.: The probability problem ofpattem
recognition learning and the method of potential functions. Automation and Remote
Control 25,9 (1964) 1307-1323.
2. Alexeev, v., Galeev, E, Tikhomirov, v.: Recueil de Prob!emes d'Optimisation. Mir,
Moscow (1984).
3. Alexeev, v., Tikhomirov, v., Fomine, S.: Commande Optimale. Mir, Moscou (1982).
4. Anderson Jr., WN., Duffin, R1.: Series and parallel addition of matrices. 1. Math. Anal.
Appl. 26 (1969) 576-594.
5. Artstein, Z.: Discrete and continuous bang-bang and facial spaces or: look for the
extreme points. SIAM Review 22,2 (1980) 172-185.
6. Asplund, E.: Differentiability of the metric projection in finite-dimensional Euclidean
space. Proc. Amer. Math. Soc. 38 (1973) 218-219.
7. Aubin, 1.-P.: Optima and Equilibria: An Introduction to Nonlinear Analysis. Springer,
Berlin Heidelberg (1993).
8. Aubin, 1.-P.: Mathematical Methods of Game and Economic Theory. North-Holland
(1982) (revised edition).
9. Aubin, 1.-P., Cellina, A.: Differential Inclusions. Springer, Berlin Reidelberg (1984).
10. Auslender, A: Optimisation, Methodes Numeriques. Masson, Paris (1976).
11. Auslender, A: Numerical methods for nondifferentiable convex optimization. In: Non-
linear Analysis and Optimization. Math. Prog. Study 30 (1987) 102-126.
12. Barbu, v., Precupanu, T.: Convexity and Optimization in Banach Spaces. Sijthoff &
Noordhoff(1982).
13. Barndorff-Nielsen, 0.: Information and Exponential Families in Statistical Theory.
Wiley & Sons (1978).
14. Bellman, R.E., Kalaba, RE., Lockett, 1.: Numerical Inversion ofthe Laplace Transform.
Elsevier (1966).
15. Ben Tal, A, Ben Israel, A, Teboulle, M.: Certainty equivalents and information mea-
sures: duality and extremal principles. 1. Math. Anal. Appl. 157 (1991) 211-236.
16. Berger, M.: Geometry I, II (Chapters 11, 12). Springer, Berlin Heidelberg (1987).
17. Berger, M.: Convexity. Amer. Math. Monthly 97,8 (1990) 650-678.
18. Berger, M., Gostiaux, B.: Differential Geometry: Manifolds, Curves and Surfaces.
Springer, New York (1990).
19. Bertsekas, D.P.: Necessary and sufficient conditions for a penalty method to be exact.
Math. Prog. 9 (1975) 87-99.
20. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic
Press (1982).
338 References
21. Bertsekas, D.P.: Convexification procedures and decomposition methods for nonconvex
optimization problems. 1. Optimization Th. Appl. 29,2 (1979) 169-197.
22. Bertsekas, D.P., Mitter, S.K.: A descent numerical method for optimization problems
with nondifferentiable cost functionals. SIAM 1. Control 11,4 (1973) 637-652.
23. Best, M.1.: Equivalence of some quadratic programming algorithms. Math. Prog. 30,1
(1984) 71-87.
24. Bihain, A: Optimization of upper semi-differentiable functions. 1. Optimization Th.
Appl. 4 (1984) 545-568.
25. Bonnans, 1.F: Theorie de la penalisation exacte. Modelisation Mathematique et Analyse
Numerique 24,2 (1990) 197-210.
26. Borwein, 1.M.: A note on the existence of subgradients. Math. Prog. 24 (1982) 225-228.
27. Borwein, 1.M., Lewis, A: Convexity, Optimization and Functional Analysis. Wiley
Interscience - Canad. Math. Soc. (in preparation).
28. Brenier, Y.: Un algorithme rapide pour Ie calcul de transformees de Legendre-Fenchel
discretes. Note aux C.R. Acad. Sci. Paris 308 (1989) 587-589.
29. Bnmdsted, A.: An Introduction to Convex Polytopes. Springer, New York (1983).
30. Bnmdsted, A, Rockafellar, R.T.: On the subdifferentiability of convex functions. Proc.
Amer. Math. Soc. 16 (1965) 605-611.
31. Brousse, P.: Optimization in Mechanics: Problems and Methods. North-Holland (1988).
32. Cansado, E.: Dual programming problems as hemi-games. Management Sci. 15,9
(1969) 539-549.
33. Castaing, C., Valadier, M.: Convex Analysis and Measurable Multi/unctions. Lecture
Notes in Mathematics, vol. 580. Springer, Berlin Heidelberg (1977).
34. Cauchy, A: Methode generale pour la resolution des systemes d' equations simultanees.
Note aux C. R. Acad. Sci. Paris 25 (1847) 536-538.
35. Cheney, E. w., Goldstein, AA: Newton's method for convex programming and Tcheby-
cheffapproximation. Numer. Math. 1 (1959) 253-268.
36. Clarke, EH.: Generalized gradients and applications. Trans. Amer. Math. Soc. 205
(1975) 247-262.
37. Clarke, EH.: Optimization and Nonsmooth Analysis. Wiley & Sons (1983), reprinted
by SIAM (1990).
38. Crandall, M.G., Ishii, H., Lions, P.-L.: User's guide to viscosity solutions of second
order partial differential equations. Bull. Amer. Math. Soc. 27,1 (1992) 1-67.
39. Crouzeix, 1.-P.: A relationship between the second derivative of a convex function and
of its conjugate. Math. Prog. 13 (1977) 364-365.
40. Dantzig, G.B. Wolfe, P.: A decomposition principle for linear programs. Oper. Res. 8
(1960) 101-111.
41. Davidon, W.C.: Variable metric method for minimization. AEC Report ANL5990,
Argonne National Laboratory (1959).
42. Davidon, W.C.: Variable metric method for minimization. SIAM 1. Optimization 1
(1991) 1-17.
43. Dedieu, 1.-P.: Une condition necessaire et suffisante d'optirnalite en optimisation non
convexe et en calcul des variations. Seminaire d'Analyse Numerique, Univ. Paul
Sabatier, Toulouse (1979-80).
44. Demjanov, V.E: Algorithms for some minimax problems. 1. Compo Syst. Sci. 2 (1968)
342-380.
45. Demjanov, V.E, Malozemov, V.N.: Introduction to Minimax. Wiley & Sons (1974).
References 339
46. Dennis, J., Schnabel, R.: Numerical Methods for Constrained Optimization and Non-
linear Equations. Prentice Hall (1983).
47. Dubois, J.: Surlaconvexire etses applications. Ann. Sci. Math. Quebec 1,1 (1977) 7-31.
48. Dubuc, S.: Problemes d'Optimisation en Calcul des Probabilites. Les Presses de l'Uni-
versite de Montreal (1978).
49. Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. North-Holland,
Amsterdam (1976).
50. Eggleston, H.G.: Convexity. Cambridge University Press, London (1958).
51. Ellis, R.S.: Entropy, Large Deviations and Statistical Mechanics. Springer, New York
(1985).
52. Everett III, H.: Generalized Lagrange multiplier method for solving problems of opti-
mum allocation of resources. Oper. Res. 11 (1963) 399-417.
53. Fenchel, w.: Convexity through the ages. In: Convexity and its Applications (p.M. Gruber
and J.M. Wills, eds.). Birkhiiuser, Basel (1983) 120-130.
54. Fenchel, w.: Obituary for the death of -. Det Kongelige Danske Videnskabemes
Selskabs Aarbok (Oversigten) [Yearbook of the Royal Danish Academy of Sciences]
(1988-89) 163-171.
55. Feuer, A: An implementable mathematical programming algorithm for admissible fun-
damental functions. PhD. Thesis, Columbia Univ. (1974).
56. Fletcher, R.: Practical Methods of Optimization. Wiley & Sons (1987).
57. Fletcher, R., Powell, M.J.D.: A rapidly convergent method for minimization. The Com-
puter Joumal6 (1963) 163-168.
58. Flett, T.M.: Differential Analysis. Cambridge University Press (1980).
59. Fukushima, M.: A descent algorithm for nonsmooth convex programming. Math. Prog.
30,2 (1984) 163-175.
60. Geoffrion, AM.: Duality in nonlinear programming: a simplified application-oriented
development. SIAM Review 13,11 (1971) 1-37.
61. Gilbert, J.C., Lemarechal,C.: Some numerical experiments with variable-storage quasi-
Newton algorithms. Math. Prog. 45 (1989) 407-435.
62. Goffin, J.-L., Haurie, A, Vial, J.-Ph.: Decomposition and nondifferentiable optimization
with the projective algorithm. Management Sci. 38,2 (1992) 284-302.
63. Gomi, G.: Conjugation and second-order properties of convex functions. J. Math. Anal.
Appl. 158,2 (1991) 293-315.
64. Griewank, A, Rabier, P.J.: On the smoothness of convex envelopes. Trans. Amer. Math.
Soc. 322 (1990) 691-709.
65. Grinold, R.C.: Lagrangian subgradients. Management Sci. 17,3 (1970) 185-188.
66. Gritzmann, P., Klee, V.: Mathematical programming and convex geometry. In: Handbook
of Convex Geometry (Elsevier, North-Holland, to appear).
67. Gruber, P.M.: History of convexity. In: Handbook ofConvex Geometry (Elsevier, North-
Holland, to appear).
68. Held, M., Karp, R.M.: The traveling-salesman problem and minimum spanning trees.
Math. Prog. 1,1 (1971) 6-25.
69. Hestenes, M.R., Stiefel, M.R.: Methods of conjugate gradients for solving linear sys-
tems. J. Res. NBS 49 (1959) 409-436.
70. Hiriart-Urruty, J.-B.: Extension of Lipschitz functions. J. Math. Anal. Appl. 77 (1980)
539-554.
71. Hiriart-Urruty, J.-B.: Lipschitz r-continuity of the approximate subdifferential of a con-
vex function. Math. Scand. 47 (1980) 123-134.
340 References
72. Hiriart-Urruty, J.-B.: e-subdifferential calculus. In: Convex Analysis and Optimization
(J.-P. Aubin and R. Vinter, eds.). Pitman (1982), pp. 43-92.
73. Hiriart-Urruty, J.-B.: Limiting behaviour ofthe approximate first order and second order
directional derivatives for a convex function. Nonlinear Anal. Theory, Methods & Appl.
6,12 (1982) 1309-1326.
74. Hiriart-Urruty, J.-B.: When is a point x satisfying V f(x) = 0 a global minimum of f?
Amer. Math. Monthly 93 (1986) 556-558.
75. Hiriart-Urruty, J.-B.: Conditions nt!cessaires et suffisantes d'optimalite globale en op-
timisation de differences de fonctions convexes. Note aux C.R. Acad. Sci. Paris 309, I
(1989) 459-462.
76. Hiriart-Urruty, J.-B., Ye, D.: Sensitivity analysis of all eigenvalues ofa symmetric matrix.
Preprint Univ. Paul Sabatier, Toulouse (1992).
77. Holmes, R.B.: A Course on Optimization and Best Approximation. Lecture Notes in
Mathematics, vol. 257. Springer, Berlin Heidelberg (1972).
78. Holmes, R.B.: Geometrical Functional Analysis and its Applications. Springer, Berlin
Heidelberg (1975).
79. Hormander, L.: Sur la fonction d'appui des ensembles convexes dans un espace locale-
ment convexe. Ark. Mat. 3,12 (1954) 181-186.
80. loffe, A.D., Levin, Y.L.: Subdifferentials of convex functions. Trans. Moscow Math.
Soc. 26 (1972) 1-72.
81. loffe, A.D., Tikhomirov, Y.M.: Theory o/Extremal Problems. North-Holland (1979).
82. Israel, R.B.: Convexity in the Theory o/Lattice Gases. Princeton University Press (1979).
83. Karlin, S.: Mathematical Methods and Theory in Games, Programming and Economics.
Mc Graw-Hill, New York (1960).
84. Kelley, J.E.: The cutting plane method for solving convex programs. J. SIAM 8 (1960)
703-712.
85. Kim, K. Y., Nesterov, Yu.E., Cherkassky, B.Y.: The estimate of complexity of gradient
computation. Soviet Math. Dokl. 275,6 (1984) 1306-1309.
86. Kiselman, C.O.: How smooth is the shadow of a smooth convex body? J. London Math.
Soc. (2) 33 (1986) 101-109.
87. Kiselman, C.O.: Smoothness of vectors sums of plane convex sets. Math. Scand. 60
(1987),239-252.
88. Kiwiel, K.C.: An aggregate subgradient method for nonsmooth convex minimization.
Math. Prog. 27 (1983) 320-341.
89. Kiwiel, K.C.: Methods o/Descent/or NondifJerentiable Optimization. Lecture Notes in
Mathematics, vol. 1133. Springer, Berlin Heidelberg (1985).
90. Kiwiel, K.C.: A survey of bundle methods for nondifferentiable optimization. In: Pro-
ceedings, XIII. International Symposium on Mathematical Programming, Tokyo (1988).
91. Kiwiel, K.C.: Proximity control in bundle methods for convex nondifferentiable mini-
mization. Math. Prog. 46,1 (1990) 105-122.
92. Kiwiel, K.C.: A tilted cutting plane proximal bundle method for convex nondifferen-
tiable optimization. Oper. Res. Lett. 10 (1991) 75-81.
93. Kuhn, H.W.: Nonlinear programming: a historical view. SIAM-AMS Proceedings 9
(1976) 1-26.
94. Kutateladze, S.S.: Changes of variables in the Young transformation. Soviet Math.
Dokl. 18,2 (1977) 545-548.
95. Kutateladze, S.S.: Convex e-programming. Soviet Math. Dokl. 20 (1979) 391-393.
96. Kutateladze, S.S.: e-subdifferentials and e-optimality. Sib. Math. J. (1981) 404-411.
References 341
121. Mc Cormick, G.P', Tapia, R.A.: The gradient projection method under mild differentia-
bility conditions. SIAM 1. Control 10,1 (1972) 93-98.
122. Mifflin, R.: Semi-smooth and semi-convex functions in constrained optimization. SIAM
1. Control Opt. 15,6 (1977) 959-972.
123. Mifflin, R.: An algorithm for constrained optimization with semi-smooth functions.
Math. Oper. Res. 2,2 (1977) 191-207.
124. Mifflin, R.: A modification and an extension ofLemarechal's algorithm for nonsmooth
minimization. In: NondifJerential and Variational Techniques in Optimization (D.C.
Sorensen, 1.B. Wets, eds.). Math. Prog. Study 17 (1982) 77-90.
125. Minoux, M.: Programmation Mathematique: Theorie etAlgorithmes I, II. Dunod, Paris
(1983).
126. More, 1.1.: Implementation and testing of optimization software. In: Performance
Evaluation o/Numerical Software (L.D. Fosdick, ed.). North-Holland (1979).
127. More, 1.1.: Recent developments in algorithms and software for trust region methods. In:
Mathematical Programming, the State o/the Art (A. Bachem, M. Grotschel, B. Korte,
eds.). Springer, Berlin Heidelberg (1983), pp. 258-287.
128. More, J.J., Thuente, D.1.: Line search algorithms with guaranteed sufficient decrease.
ACM Transactions on Math. Software; Assoc. for Compo Machinery (to appear).
129. Moreau, J.-1.: Decomposition orthogonale d'un espace hilbertien selon deux cones
mutuellement polaires. C.R. Acad. Sci. Paris 255 (1962) 238-240.
130. Moreau, 1.-1.: Proximite et dualite dans un espace hilbertien. Bull. Soc. Math. France
93 (1965) 273-299.
131. Moreau, 1.-J.: Fonctionnelles Convexes. Lecture notes, Seminaire "Equations aux deri-
vees partielles", College de France, Paris (1966).
132. Moulin, H., Fogelman-Soulie, F.: La Convexite dans les Mathematiques de la Decision.
Hermann, Paris (1979).
133. Nemirovskij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Opti-
mization. Wiley-Interscience (1983).
134. Nesterov, Yu.E.: Minimization methods for nonsmooth convex and quasiconvex func-
tions. Matekon 20 (1984) 519-531.
135. Niven. I.: Maxima and Minima Without Calculus. Dolciani Mathematical Expositions
6 (1981).
136. Nurminskii, E.A.: On s-subgradient mappings and their applications in nondifferen-
tiable optimization. Working paper 78,58 (1978) IIASA, 2361 Laxenburg, Austria.
137. Nurminskii, E.A.: s-subgradient mapping and the problem of convex optimization.
Cybernetics 21,6 (1986) 796-800.
138. Nurminskii, E.A.: Convex optimization problems with constraints. Cybernetics 23,4
(1988) 470-474.
139. Nurminskii, E.A.: A class of convex programming methods. USSR Comput. Maths
Math. Phys. 26,4 (1988) 122-128.
140. Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimiz-
ing sums ofthe largest eigenvalues of symmetric matrices. Math. Prog. (to appear).
141. Penot, 1.-P.: Subhessians, superhessians and conjugation. Nonlinear Analysis: Theory,
Methods and Appl. (to appear).
142. Peressini, A.L., Sullivan, F.E., Uhl, 1.1.: The Mathematics o/Nonlinear Programming.
Springer, New York (1988).
References 343
143. Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability. Lecture
Notes in Mathematics, vol. 1364. Springer, Berlin Heidelberg (1989, new edition in
1993).
144. Polak, E.: Computational Methods in Optimization. Academic Press, New York (1971).
145. Poljak, B.T.: A general method for solving extremum problems. Soviet Math. Dokl.
174,8 (1966) 33-36.
146. Poljak, B. T.: Minimization ofunsmooth functionals. USSR Comput. Maths Math. Phys.
9 (1969) 14-29.
147. Popova, N.K., Tarasov, V.N.: A modification of the cutting-plane method with accel-
erated convergence. In: Nondifferentiable Optimization: Motivations and Aplications
(Y.E Demjanov, D. Pallaschke, eds.). Lecture Notes in Economics and Mathematical
Systems, vol. 255. Springer, Berlin Heidelberg (1984), pp. 284-190.
148. Ponstein, I: Applying some modem developments to choosing your own Lagrange
multipliers. SIAM Review 25,2 (1983) 183-199.
149. Pourciau, B.H.: Modem multiplier rules. Amer. Math. Monthly 87 (1980), 433-452.
150. Powell, M.ID.: Nonconvex minimization calculations and the conjugate gradient
method. In: Numerical Analysis (D.E Griffiths ed.). Lecture Notes in Mathematics,
vol. 1066. Springer, Berlin Heidelberg (1984), pp. 122-141.
151. Prekopa, A.: On the development of optimization theory. Amer. Math. Monthly 87
(1980) 527-542.
152. Pshenichnyi, B.N.: Necessary ConditionsJor an Extremum. Marcel Dekker (1971).
153. Pshenichnyi, B.N.: Nonsmooth optimization and nonlinear programming. In: Non-
smooth Optimization (C. Lemarechal, R. Mifflin, eds.), IIASA Proceedings Series 3,
Pergamon Press (1978), pp. 71-78.
154. Pshenichnyi, B.N.: Methods oJLinearization. Springer, Berlin Heidelberg (1993).
155. Pshenichnyi, B.N., Danilin, Yu.M.: Numerical Methods Jor Extremal Problems. Mir,
Moscow (1978).
156. Quadrat, I-P.: Theoremes asymptotiques en programmation dynamique. C.R. Acad.
Sci. Paris, 311, Serie I (1990) 745-748.
157. Roberts, A.w., Varberg, D.E.: Convex Functions. Academic Press (1973).
158. Rockafellar, R. T.: Convex programming and systems of elementary monotonic relations.
I Math. Anal. Appl. 19 (1967) 543-564.
159. Rockafellar, R.T.: Convex Analysis. Princeton University Press (1970).
160. Rockafellar, R.T.: Convex Duality and Optimization. SIAM regional conference series
in applied mathematics (1974).
161. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point al-
gorithm in convex programming. Math. Oper. Res. 1,2 (1976) 97-116.
162. Rockafellar, R.T.: Lagrange multipliers in optimization. SIAM-AMS Proceedings 9
(1976) 145-168.
163. Rockafellar, R. T.: Solving a nonlinear programming problem by way of a dual problem.
Symposia Mathematica XIX (1976) 135-160.
164. Rockafellar, R.T.: The Theory oJSubgradients ad its Applications to Problems oJOpti-
mization: Convex and Nonconvex Functions. Heldermann, West-Berlin (1981).
165. Rockafellar, R. T.: Lagrange multipliers and optimality. SIAM Review (to appear, 1993).
166. Rockafellar, R.T., Wets, R.I-B.: Variational Analysis (in preparation).
167. Rosen, IB.: The gradient projection method for nonlinear programming; part I: linear
constraints. I SIAM 8 (1960) 181-217.
344 References
168. Schramm, H., Zowe, 1.: A version of the bundle idea for minimizing a nonsmooth
function: conceptual idea, convergence analysis, numerical results. SIAM 1. Opt. 2
(1992) 121-152.
169. Schrijver, A.: Theory ofLinear and Integer Programming. Wiley-Interscience (1986).
170. Seeger, A.: Second derivatives of a convex function and of its Legendre-Fenchel trans-
formate. SIAM 1. Opt. 2,3 (1992) 405-424.
171. Shor, N.Z.: Minimization Methods for Nondi.fferentiable Functions. Springer, Berlin
Heidelberg (1985).
172. Smith, K.T.: Primer ofModem Analysis. Springer, New York (1983).
173. Stoer, 1., Witzgall, C.: Convexity and Optimization in Finite Dimension I Springer,
Berlin Heidelberg (1970).
174. Strang, G.: Introduction to Applied Mathematics. Wellesley - Cambridge Press (1986).
175. Strodiot, 1.-1., Nguyen, Y.H., Heukemes, N.: e-optimal solutions in nondifferentiable
convex programming and some related questions. Math. Prog. 25 (1983) 307-328.
176. Tikhomirov, Y.M.: Stories about maxima and minima. In: Mathematical World 1, Amer.
Math. Society, Math. Association of America (1990).
177. Troutman, 1.L.: Variational Calculus with Elementary Convexity. Springer, New York
(1983).
178. Valadier, M.: Sous-differentiels d'une borne superieure et d'une somme continue de
fonctions convexes. Note aux C. R. Acad. Sci. Paris, Serie A 268 (1969) 39-42.
179. Valadier, M.: Contribution Ii I 'Analyse Convexe. These de doctorat es sciences mathema-
tiques, Paris (1970).
180. Valadier, M.: Integration de convexes fermes notamment d'epigraphes. Inf-convolution
continue. Revue d'Informatique et de Recherche Operationnelle (1970) 47-53.
181. Van Rooij, A.C.M., Schikhof, W.H.: A Second Course on Real Functions. Cambridge
University Press (1982).
182. Van Tiel, 1.: Convex Analysis. An Introductory Text. Wiley & Sons (1984).
183. Wets, R. 1.-B.: Grundlagen konvexer Optimierung. Lecture Notes in Economics and
Mathematical Systems, vol. 137. Springer, Berlin Heidelberg (1976).
184. Willem, M.: Analyse Convexe et Optimisation, 3rd edn. Editions CIACO Louvain-La-
Neuve (1989).
185. Wolfe, P.: Accelerating the cutting plane method for nonlinear programming. 1. SIAM
9,3 (1961) 481-488.
186. Wolfe, P.: Convergence conditions for ascent methods. SIAM Review 11 (1968) 226-
235.
187. Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable func-
tions. In: Proceedings, XII. Annual Allerton conference on Circuit and System The-
ory (P.Y. Kokotovic, B.S. Davidson, eds.). Univ. Illinois at Urbana-Champaign (1974),
. pp.8-15.
188. P. Wolfe: A method of conjugate subgradients for minimizing nondifferentiable func-
tions. In: Nondifferentiable Optimization (M.L. Balinski, P. Wolfe, eds.). Math. Prog.
Study 3 (1975) 145-173.
189. Zarantonello, E.H.: Projections on convex sets in Hilbert spaces and spectral theory. In:
Contributions to Nonlinear Functional Analysis. Academic Press (1971), pp. 237-424.
190. Zeidler, E.: Nonlinear Functional Analysis and its Applications III Variational Methods
and Optimization. Springer, New York (1985).
Index
indicator function, XVII, 39, 93 - (quadratic), 234, 272, 299, 304, 332
inf-convolution, 55, 187 - (semi-infinite), 174
- (exact), 62, 119, 120 proximal point, 318, 322
interior points, 335
qualification, 58, 62, 72, 125, 191
Lagrange, Lagrangian, 138,237,307 quasi-convex, 201
Legendre transform, 35, 43, 81
rate of convergence, see speed -
Levenberg-Marquardt, 290
recession(cone, function), see asymptotic
line-search,4, 196,283
relative interior, XVII, 62
linearization error, 131,201,232 relaxation
Lipschitz, 122, 128 - (Lagrangian), 181,216
local problem, 140 - (convex), 157, 181
locally bounded, 127 - (method), 174, 293
marginal function, 55, 320 saddle-point, 188
marginal price, 151 safeguard-reduction property, 208, 251
master problem, 140 semi-smooth, 331
mean-value theorem, 112 separation, 10, 195, 199,222,226,254
minimality conditions (approximate), 115 set-valued, see multifunction
minimizing sequence, XVII, 218, 288, 309 slack variable, 142
minimum, minimum point, 49 Slater, 165, 188
- (global), 333 speed of convergence, 33, 288, 310, 312
minorize, minorization, XVII - (fast), 208
model, 101,279 - (linear), 20
Moreau-Yosida, 121, 183, 187 - (sublinear), 16, 18,271
multi-valued, see multifunction stability center, 279
multifunction, XVI, 99, 112 stationary point, 50
steepest-descent direction, 1, 196,225,262,
Newton, quasi-Newton, 228, 283, 293, 326
315
nonnegative, XVII
strictly convex, 79, 81,174,181
normal set (approximate), 93,115
strongly convex, 82, 83, 318
normalization, norming, 279, 280
subdifferential, sub gradient, 47, 151
null-step, 7, 156, 204, 231, 248, 283, 295,
subgradient algorithm, 171, 315
305
sublevel-set, XVII
objective function, 137, 152 support function, XVII, 30, 40,66,97,200,
orthant, XVII 299
outer semi-continuous, XVII, 128 transportation formula, 131,211
penalty, 142 transportation problem, 21
travelling salesman problem, 22
- (exact), 185
trust-region, 284
perspective-function, XVII, 41, 99
piecewise affine, 76, 125, 156,245 unit ball, XVI
polar cone, XVII, 45,186 unit simplex, XVI
polyhedral function, 77, 307 unit sphere, XVI
positively homogeneous, 84, 280
primal problem, 137 vertex, see exposed (point)
programming problem Wolfe, 177,249,284
- (integer), 142, 181
- (linear), 116, 145, 181,276 zigzag, 1, 7, 28
Grundlehren der mathematischen Wissenschaften
A Series of Comprehensive Studies in Mathematics
A Selection