0% found this document useful (0 votes)
48 views361 pages

Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics

Uploaded by

renee.lopez.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views361 pages

Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics

Uploaded by

renee.lopez.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 361

Grundlehren der

mathematischen Wissenschaften 306


A Series of Comprehensive Studies in Mathematics

Editors
M. Artin S. S. Chern J. Coates 1. M. Frohlich
H. Hironaka F. Hirzebruch L. Hormander
C.C. Moore J.K. Moser M. Nagata W. Schmidt
D. S. Scott Ya. G. Sinai 1. Tits M. Waldschmidt
S.Watanabe

Managing Editors
M. Berger B. Eckmann S.R.S. Varadhan
Jean-Baptiste Hiriart-Urruty
Claude Lemarechal

Convex Analysis
and Minimization
Algorithms II
Advanced Theory
and Bundle Methods

With 64 Figures

Springer-Verlag Berlin Heidelberg GmbH


Jean-Baptiste Hiriart-Urruty
Departement de Math6matiques
Universite Paul Sabatier
118, route de Narbonne
F-31062 Toulouse, France

Claude Lemarechal
INRIA, Rocquencourt
Domaine de Voluceau
B.P.105
F-78153 Le Chesnay, France

Mathematics Subject Classification (1991): 65-01, 65K, 49M,


49M27, 49-01, 93B60, 90C

ISBN 978-3-642-08162-0 ISBN 978-3-662-06409-2 (eBook)


DOI 10.1007/978-3-662-06409-2
This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilm or in any other way,
and storage in data banks. Duplication of this publication or parts thereof is permitted
only under the provisions of the German Copyright Law of September 9, 1965, in its
current version, and permission for use must always be obtained from Springer-
Verlag Berlin Heidelberg GmbH.
Violations are liable for prosecution under the German Copyright Law.

© Springer-Verlag Berlin Heidelberg 1993


Originally published by Springer-Verlag Berlin Heidelberg New York in 1993

Typesetting: Camera-ready copy produced from the authors' output file using a
Springer TEX macro package
41/3140-5432 1 0 Printed on acid-free paper
Table of Contents Part II

Introduction . . . . . . . . . . . . . ... . . . xv
IX. Inner Construction of the Subdifferential 1
1 The Elementary Mechanism . 2
2 Convergence Properties . . 9
2.1 Convergence . . . . . . 9
2.2 Speed of Convergence . 15
3 Putting the Mechanism in Perspective . 24
3.1 Bundling as a Substitute for Steepest Descent. 24
3.2 Bundling as an Emergency Device for Descent Methods 27
3.3 Bundling as a Separation Algorithm 29

X. Conjugacy in Convex Analysis .... 35


1 The Convex Conjugate ofa Function 37
1.1 Definition and First Examples. 37
1.2 Interpretations. . . . . . . . . . 40
1.3 First Properties . . . . . . . . . 42
1.4 Subdifferentials of Extended-Valued Functions 47
1.5 Convexification and Subdifferentiability .. . 49
2 Calculus Rules on the Conjugacy Operation. . . . 54
2.1 Image of a Function Under a Linear Mapping. 54
2.2 Pre-Composition with an Affine Mapping. 56
2.3 Sum of Two Functions . . . . . . . . . . . . 61
2.4 Infima and Suprema. . . . . . . . . . . . . . 65
2.5 Post-Composition with an Increasing Convex Function . 69
2.6 A Glimpse of Biconjugate Calculus. 71
3 Various Examples . . . . . . . . . . . . . . . . . . . . . . 72
3.1 The Cramer Transformation. . . . . . . . . . . . . . . 72
3.2 Some Results on the Euclidean Distance to a Closed Set 73
3.3 The Conjugate of Convex Partially Quadratic Functions 75
3.4 Polyhedral Functions . . . . . . . . 76
4 Differentiability of a Conjugate Function . . 79
4.1 First-Order Differentiability. . . . . . . 79
4.2 Towards Second-Order Differentiability. 82
VI Table of Contents Part II

XI. Approximate Subdifferentials of Convex Functions 91


1 The Approximate Subdifferential . . . . . . . . 92
1.1 Definition, First Properties and Examples . . 92
1.2 Characterization via the Conjugate Function 95
1.3 Some Useful Properties . . . . . . . . . . . 98
2 The Approximate Directional Derivative . . . . 102
2.1 The Support Function of the Approximate Subdifferential 102
2.2 Properties of the Approximate Difference Quotient . 106
f;
2.3 Behaviour of and Te as Functions of 8 . . . . 110
3 Calculus Rules on the Approximate Subdifferential . 113
3.1 Sum of Functions . . . . . . . . . . . . . 113
3.2 Pre-Composition with an Affine Mapping. 116
3.3 Image and Marginal Functions . . . 118
3.4 A Study of the Infimal Convolution . . . . 119
3.5 Maximum of Functions . . . . . . . . . . 123
3.6 Post-Composition with an Increasing Convex Function. 125
4 The Approximate Subdifferential as a Multifunction . . . . 127
4.1 Continuity Properties of the Approximate Subdifferential . 127
4.2 Transportation of Approximate Subgradients 129

XII. Abstract Duality for Practitioners . . . . . 137


1 The Problem and the General Approach . 13 7
1.1 The Rules of the Game 137
1.2 Examples . . . . . . . . . . . . . . 141
2 The Necessary Theory . . . . . . . . . 147
2.1 Preliminary Results: The Dual Problem. 147
2.2 First Properties of the Dual Problem .. 150
2.3 Primal-Dual Optimality Characterizations 154
2.4 Existence of Dual Solutions . 157
3 Illustrations . . . . . . . . . . . 161
3.1 The Minimax Point of View . 161
3.2 Inequality Constraints. . . . 162
3.3 Dualization of Linear Programs. 165
3.4 Dualization of Quadratic Programs 166
3.5 Steepest-Descent Directions 168
4 Classical Dual Algorithms. . . . 170
4.1 Subgradient Optimization. . 171
4.2 The Cutting-Plane Algorithm 174
5 Putting the Method in Perspective . 178
5.1 The Primal Function . . . . . 178
5.2 Augmented Lagrangians . . . 181
5.3 The Dualization Scheme in Various Situations 185
5.4 Fenchel's Duality . . . . . . . . . . . . . . . 190
Table of Contents Part II VII

XIII. Methods of e-Descent . . . . . . . . . . . 195


1 Introduction. Identifying the Approximate Subdifferential 195
1.1 The Problem and Its Solution . . . . . . . . . . 195
1.2 The Line-Search Function. . . . . . . . . . . 199
1.3 The Schematic Algorithm . . . . . . . . . . . 203
2 A Direct Implementation: Algorithm of e-Descent 206
2.1 Iterating the Line-Search . . . . . . . . . . . 206
2.2 Stopping the Line-Search . . ... . . . . . . . 209
2.3 The e-Descent Algorithm and Its Convergence 212
3 Putting the Algorithm in Perspective . . . . 216
3.1 A Pure Separation Form. . . . . . . . . 216
3.2 A Totally Static Minimization Algorithm 219

XIV. Dynamic Construction of Approximate Subdifferentials:


Dual Form of Bundle Methods . . . . . . . . . . . . . 223
I Introduction: The Bundle of Information . . . . 223
1.1 Motivation . . . . . . . . . . . . . . . . . 223
1.2 Constructing the Bundle of Information. . . 227
2 Computing the Direction . . . . . . . 233
2.1 The Quadratic Program . . . . . . . . . . . 233
2.2 Minimality Conditions . . . . . . . . . . . 236
2.3 Directional Derivatives Estimates . . . . 241
2.4 The Role of the Cutting-Plane Function. 244
3 The Implementable Algorithm . . . . . . . 248
3.1 Derivation of the Line-Search . . . . . . 248
3.2 The Implementable Line-Search and Its Convergence . 250
3.3 Derivation of the Descent Algorithm . . . . . . . . 254
3.4 The Implementable Algorithm and Its Convergence 257
4 Numerical Illustrations 263
4.1 Typical Behaviour. . . . . . . . . . . . . . . . . 263
4.2 The Role of e . . . . . . . . . . . . . . . . . . . 266
4.3 A Variant with Infinite e : Conjugate Subgradients 268
4.4 The Role of the Stopping Criterion . 269
4.5 The Role of Other Parameters. . . . 271
4.6 General Conclusions . . . . . . . . 273

XV. Acceleration of the Cutting-Plane Algorithm:


Primal Forms of Bundle Methods . . . . . . 275
1 Accelerating the Cutting-Plane Algorithm . . . 275
1.1 Instability of Cutting Planes. . . . . . . 276
1.2 Stabilizing Devices: Leading Principles . 279
1.3 A Digression: Step-Control Strategies 283
2 A Variety of Stabilized Algorithms . . 285
2.1 The Trust-Region Point of View . . . 286
VIII Table of Contents Part II

2.2 The Penalization Point of View 289


2.3 The Relaxation Point of View . 292
2.4 A Possible Dual Point of View 295
2.5 Conclusion . . . . . . . . . . 299
3 A Class of lTimal Bundle Algorithms . 301
3.1 The General Method . . . . . 301
3.2 Convergence . . . . . . . . . . . 307
3.3 Appropriate Stepsize Values .. . 314
4 Bundle Methods as Regularizations . . 317
4.1 Basic Properties of the Moreau-Yosida Regularization 317
4.2 Minimizing the Moreau-Yosida Regularization 322
4.3 Computing the Moreau-Yosida Regularization 326

Bibliographical Comments . 331

References 337

Index ... 345


Table of Contents Part I

Introduction . . . . . . . . . . . . . . . . . xv
I. Convex Functions of One Real Variable 1
I Basic Definitions and Examples . . . 1
1.1 First Definitions of a Convex Function 2
1.2 Inequalities with More Than Two Points 6
1.3 Modern Definition of Convexity . . . 8
2 First Properties . . . . . . . . . . . . . . 9
2.1 Stability Under Functional Operations 9
2.2 Limits of Convex Functions . 11
2.3 Behaviour at Infinity . . . . . . . . . 14
3 Continuity Properties . . . . . . . . . . . 16
3.1 Continuity on the Interior of the Domain 16
3.2 Lower Semi-Continuity: Closed Convex Functions 17
3.3 Properties of Closed Convex Functions . . . . . 19
4 First-Order Differentiation . . . . . . . . . . . . . 20
4.1 One-Sided Differentiability of Convex Functions 21
4.2 Basic Properties of Subderivatives 24
4.3 Calculus Rules . . . . . . . . . . . . . . . . 27
5 Second-Order Differentiation . . . . . . . . . . . 29
5.1 The Second Derivative of a Convex Function . 30
5.2 One-Sided Second Derivatives . . . . . . . . 32
5.3 How to Recognize a Convex Function. . . . . 33
6 First Steps into the Theory of Conjugate Functions 36
6.1 Basic Properties of the Conjugate . 38
6.2 Differentiation of the Conjugate. 40
6.3 Calculus Rules with Conjugacy . . 43

II. Introduction to Optimization Algorithms 47


1 Generalities . . . . . . . . . . . . . . 47
1.1 The Problem . . . . . . . . . . . 47
1.2 General Structure of Optimization Schemes 50
1.3 General Structure of Optimization Algorithms 52
2 Defining the Direction. . . . . . . . . . . . . . . 54
X Table of Contents Part I

2.1 Descent and Steepest-Descent Directions . 54


2.2 First-Order Methods . . . . 56
2.3 Newtonian Methods. . . . . 61
2.4 Conjugate-Gradient Methods 65
3 Line-Searches . . . . . . . . . . 70
3.1 General Structure of a Line-Search 71
3.2 Designing the Test (0), (R), (L) 74
3.3 The Wolfe Line-Search . . 77
3.4 Updating the Trial Stepsize 81

III. Convex Sets . . . . . . . . . . . 87


1 Generalities . . . . . . . . . . 87
1.1 Definition and First Examples . 87
1.2 Convexity-Preserving Operations on Sets . 90
1.3 Convex Combinations and Convex Hulls 94
1.4 Closed Convex Sets and Hulls .. 99
2 Convex Sets Attached to a Convex Set 102
2.1 The Relative Interior . 102
2.2 The Asymptotic Cone . 108
2.3 Extreme Points . . . . 110
2.4 Exposed Faces . . . . 113
3 Projection onto Closed Convex Sets . 116
3.1 The Projection Operator . . . . . 116
3.2 Projection onto a Closed Convex Cone 118
4 Separation and Applications. . . . . . . . 121
4.1 Separation Between Convex Sets . . . 121
4.2 First Consequences of the Separation Properties 124
4.3 The Lemma of Minkowski-Farkas . . . . 129
5 Conical Approximations of Convex Sets . . . . . . 132
5.1 Convenient Definitions of Tangent Cones . . . . 133
5.2 The Tangent and Normal Cones to a Convex Set 136
5.3 Some Properties of Tangent and Normal Cones . 139

IV. Convex Functions of Several Variables ... 143


1 Basic Definitions and Examples . . . . . . 143
1.1 The Definitions of a Convex Function. 143
1.2 Special Convex Functions: Affinity and Closedness . 147
1.3 First Examples . . . . . . . . . . . . . 152
2 Functional Operations Preserving Convexity 157
2.1 Operations Preserving Closedness . . . 158
2.2 Dilations and Perspectives of a Function 160
2.3 Infimal Convolution. . . . . . . . . . . 162
2.4 Image of a Function Under a Linear Mapping. 166
2.5 Convex Hull and Closed Convex Hull of a Function 169
Table of Contents Part I XI

3 Local and Global Behaviour of a Convex Function 173


3.1 Continuity Properties . . . . . . . . 173
3.2 Behaviour at Infinity . . . . . . . . 178
4 First- and Second-Order Differentiation . 183
4.1 Differentiable Convex Functions . . 183
4.2 Nondifferentiab1e Convex Functions 188
4.3 Second-Order Differentiation . 190

V. Sub1inearity and Support Functions . 195


1 Sublinear Functions . . . . . . . . 197
1.1 Definitions and First Properties 197
1.2 Some Examples. . . . . . . . 201
1.3 The Convex Cone of All Closed Sub1inear Functions 206
2 The Support Function of a Nonempty Set . 208
2.1 Definitions, Interpretations 208
2.2 Basic Properties. . . . . . . . . . . . 211
2.3 Examples . . . . . . . . . . . . . . . 215
3 The Isomorphism Between Closed Convex Sets
and Closed Sublinear Functions . . . . . . . . 218
3.1 The Fundamental Correspondence . . . . 218
3.2 Example: Norms and Their Duals, Polarity 220
3.3 Calculus with Support Functions . . . . . 225
3.4 Example: Support Functions of Closed Convex Polyhedra 234

VI. Subdifferentials of Finite Convex Functions . . . . . . 237


1 The Subdifferential: Definitions and Interpretations. 238
1.1 First Definition: Directional Derivatives. . . . . 238
1.2 Second Definition: Minorization by Affine Functions . 241
1.3 Geometric Constructions and Interpretations . . . . . 243
1.4 A Constructive Approach to the Existence of a Subgradient 247
2 Local Properties of the Subdifferential 249
2.1 First-Order Developments . 249
2.2 Minimality Conditions 253
2.3 Mean-Value Theorems .. 256
3 First Examples . . . . . . . . 258
4 Calculus Rules with Subdifferentials 261
4.1 Positive Combinations of Functions . 261
4.2 Pre-Composition with an Affine Mapping. 263
4.3 Post-Composition with an Increasing Convex Function
of Several Variables . . . . . . . . . . . . . . 264
4.4 Supremum of Convex Functions . . . . . . . 266
4.5 Image of a Function Under a Linear Mapping. 272
5 FurtherExamp1es . . . . . . . . . . . . . . . 275
5.1 Largest Eigenvalue ofa Symmetric Matrix . . 275
XII Table of Contents Part I

5.2 Nested Optimization . . . . . . . . . . . . . 277


5.3 Best Approximation of a Continuous Function
on a Compactlnterval. . . . . . . . . . . . . 278
6 The Subdifferential as a Multifunction . . . . . . 279
6.1 Monotonicity Properties of the Subdifferential 280
6.2 Continuity Properties of the Subdifferential . 282
6.3 Subdifferentials and Limits of Gradients 284

VII. Constrained Convex Minimization Problems:


Minimality Conditions, Elements of Duality Theory . . . . . . . . . . 291
1 Abstract Minimality Conditions . . 292
1.1 A Geometric Characterization. . . . . . . . . 293
1.2 Conceptual Exact Penalty . . . . . . . . . . . 298
2 Minimality Conditions Involving Constraints Explicitly . 301
2.1 Expressing the Normal and Tangent Cones in Terms
of the Constraint-Functions . . . . . 303
2.2 Constraint Qualification Conditions . . . . . . . . . 307
2.3 The Strong Slater Assumption . . . . . . . . . . . 311
2.4 Tackling the Minimization Problem with Its Data Directly 314
3 Properties and Interpretations of the Multipliers. . . 317
3.1 Multipliers as a Means to Eliminate Constraints:
the Lagrange Function . . . . . . . . . . . . . 317
3.2 Multipliers and Exact Penalty . . . . . . . . . . 320
3.3 Multipliers as Sensitivity Parameters with Respect
to Perturbations . . . . . . . . . . . . . . . . 323
4 Minimality Conditions and Saddle-Points . . . . . 327
4.1 Saddle-Points: Definitions and First Properties 327
4.2 Mini-Maximization Problems . . . . 330
4.3 An Existence Result. . . . . . . . . 333
4.4 Saddle-Points of Lagrange Functions 336
4.5 A First Step into Duality Theory . . 338

VIII. Descent Theory for Convex Minimization:


The Case of Complete Information . . . . 343
1 Descent Directions and Steepest-Descent Schemes 343
1.1 Basic Definitions . . . . . . . . . . . 343
1.2 Solving the Direction-Finding Problem 347
1.3 Some Particular Cases . . . . . . . . 351
1.4 Conclusion . . . . . . . . . . . . . . 355
2 Illustration. The Finite Minimax Problem . 356
2.1 The Steepest-Descent Method for Finite Minimax Problems 357
2.2 Non-Convergence of the Steepest-Descent Method. 363
2.3 Connection with Nonlinear Programming. . . . . . . . . . 366
Table of Contents Part I XIII

3 The Practical Value of Descent Schemes 371


3.1 Large Minimax Problems. 371
3.2 Infinite Minimax Problems . . . 373
3.3 Smooth but Stiff Functions . . . 374
3.4 The Steepest-Descent Trajectory 377
3.5 Conclusion 383

Appendix: Notations 385


1 Some Facts About Optimization. 385
2 The Set of Extended Real Numbers 388
3 Linear and Bilinear Algebra . . . . 390
4 Differentiation in a Euclidean Space 393
5 Set-Valued Analysis . . . . . . . . . 396
6 A Bird's Eye View of Measure Theory and Integration 399

Bibliographical Comments 401

References 407

Index ... 415


Introduction

In this second part of "Convex Analysis and Minimization Algorithms", the dichoto-
mous aspect implied by the title is again continuously present. Chapter IX introduces
the bundling mechanism to construct the subdifferential of a convex function at a
given point. It reveals implementation difficulties which serve as a motivation for
another concept: the approximate subdifferential. However, a convenient study of this
latter set necessitates the so-called Legendre-Fenchel transform, or conjugacy corre-
spondence, which is therefore the subject of Chap. X. This parent of the celebrated
Fourier transform is being used more and more, in its natural context of variational
problems, as well as in broader fields from natural sciences. Then we can study the
approximate subdifferential in Chap. XI, where our point of view is definitely oriented
towards computational utilizations.
Chapter XII, lying on the fringe between theory and applications, makes a break
in our development. Its subject, Lagrange relaxation, is probably the most spectacular
application of convex analysis; it can be of great help in a huge number of practical
optimization problems, but it requires a high level of sophistication. We expose this
delicate theory in a setting as applied as possible, so that the major part of the chapter
should be accessible to the non-specialist. This chapter is quite voluminous but only
its Sections 1 to 3 are essential to understand the subject. Section 4 concerns dual
algorithms and its interest is numerical only. Section 5 definitely deviates towards
theory and relates the approach with other chapters of the book, especially Chap. X.
The last three chapters can then be entirely devoted to bundle methods, and to the
several ways of arriving at them. Chapter XIII does with the approximate subdifferen-
tial the job that was done by Chap. IX with the ordinary subdifferential. This allows the
development of bundle methods in dual form (Chap. XIV). Finally, Chap. XV gives
these same methods in their primal form, which is probably the one having the more
promising future.
A reader mainly interested by the convex analysis part of the book can skip the
numerical Chapters IX, XIII, XIv. Even if he jumps directly to Chap. XV he will get
the necessary comprehension ofbundle methods, and more generally of algorithms for
nonsmooth optimization. This might even be a good idea, in the sense that Chap. XV is
probably much easier to read than the other three. However, we do not recommend this
for a reader with professional algorithmic motivations, and this for several reasons:
- Chapter XV relies exclusively on convex analysis, and not on the general algorith-
mic principles making up Chap. II in the first part. This is dangerous because, as
explained in §VIII.3.3, there is no clearcut division between smooth objective func-
XVI Introduction

tions and merely convex ones. A sound algorithm for convex minimization should
therefore not ignore its parents; Chap. xrv, precisely, gives an account of bundle
methods on the basis of smooth minimization.
- The bundling mechanism of Chap. IX, and the dual bundle method of Chap. Xrv,
can be readily extended to nonconvex objective functions. Chapter XV is of no help
to understand how this extension can be made.
- More generally, this reader would miss many interesting features of bundle methods,
which are important if one wants to do research on the subject. Chapter XV is only
a tree, which should not hide the other chapters making up the forest.
We have tried as much as possible to avoid any prospective development, con-
tenting ourselves with well-established material. This explains why the present book
ignores many recent works in convex analysis and minimization algorithms, in par-
ticular those of the last decade or two, concerning the second-order differentiation of
a nonsmooth function. In fact, these are at present just isolated trials towards solving
a challenging question, and a well-organized theory is still hard to foresee. It is not
clear which of these works will be of real value in the future: they rather belong to
speculative science; but yet, "there still are so many beautiful things to write in C
major" (S. Prokofiev).
We recall that references to theorems or equations of another chapter are preceded
by the chapter number (in roman numerals). The letter A refers to Appendix A of the
first part, in which our main system of notation is overviewed. In order to be reason-
ably self-contained, we give below a short glossary of symbols and terms appearing
throughout.

"1-+" denotes a function or mapping, as in x 1-+ I (x); when I (x) is a set instead of
a singleton, we have a multifunction and we rather use the notation x 1-+ I (x ).
(., .), II . II and III . II are respectively the scalar product, the associated norm and an
arbitrary norm (here and below the space of interest is most generally the Euclidean
space lR.n). Sometimes, (s, x) will be the standard dot-product s T x = E1=1 SjXj.
B(O, 1) := {x : IIxli ~ I} is the (closed) unit ball and its boundary is the unit sphere.
cl I is the closure, or lower semi-continuous hull, of the function I; if I is convex
and lower semi-continuous on the whole of lR.n, we say that I is closed convex.
D/(x), D_/(x) and D+/(x) are respectively the derivative, the left- and the right-
derivative of a univariate function I at a point x; and the directional derivative of I
(multivariate and convex) at x along d is

I'(x,d):= lim I(x +td) - I(x) .


t./,o t

8ij = 0 if i =F j, 1 if i = j is the symbol of Kronecker.


Llk := {(Clio ••• , Clk) : Ef=l Clk = 1, Clj ~ 0 for i = 1, ... , k} is the unit simplex
oflR.k • Its elements Clio ••• , Clk are called convex multipliers. An element of the form
Ef=l CljXj, with Cl E Llk, is a convex combination of the Xj's.
Introduction XVII

domf := {x : f(x) < +oo} and epi f := {(x, r) : f(x) ~ r} are respectively the
domain and epigraph of a function f : JR.n ~ JR. U Hoo}.
The function d 1-+ f/,o(d) := lilDt ....Hool/t [f(x + td) - f(x)] does not depend on
x e dom f and is the asymptotic function of f (e Conv JR.n).
Hs-:r := {x : (s, x) ~ r} is a half-space, defined by s :F 0 and r e JR.; replacing the
inequality by an equality, we obtain the corresponding (affine) hyperplane.
Is(x) is the indicator function of the set S at the point x; its value is 0 on S, +00
elsewhere.
If K is a cone, its polar KO is the set of those s such that (s, x) ~ 0 for all x e K.
If f(x) ~ g(x) for all x, we say that the function f minorizes the function g. A
sequence {Xk} is minimizing for the function f: X ~ JR. if f(Xk) ~ i:= infx f.
JR.+ = [0, +oo[ is the set of nonnegative numbers; (JR.+)n is the nonnegative orthant
ofJR.n; a sequence {tk} is decreasing if k' > k => tk' ~ tk.
Outer semi-continuity of a multifunction F is what is usually called upper semi-
continuity elsewhere. Suppose F (x) is nonempty and included in a fixed compact set
for x in a neighborhood of a given x* (i.e. F is locally bounded near x*); we say that
F is outer semi-continuous at x* if, for arbitrary 8 > 0,

F(x) C F(x*) + B(O, 8) for x close enough to x* .

The perspective-junction i associated with a (convex) function f is obtained by the


projective construction i(x, u) := uf(x/u), for u > O.
ri C and rbd C are respectively the relative interior and relative boundary of a (convex)
set C; these are the interior and boundary for the topology relative to the affine hull
affC ofC.
SrU) := {x : f(x) ~ r} is the sublevel-set of the function f at the level r e JR..
us(d) := sUPseS(s, d) is thesupportfunction of the set S atd ando'S(d) +US( -d) is
its breadth in the directiond; ds(x) is the distance from x to S; ps(x) is the projection
of x onto S, usually considered for S closed convex; when, in addition, S contains the
origin, its gauge is the function

ys(x) := inf {A > 0 : x e AS}.


(Ul) is a black box characterizing a function f, useful for minimization algorithms:
at any given x, (Ul) computes the value f(x) and a subgradient denoted by s(x) e
af(x).
IX. Inner Construction of the Sub differential:
Towards Implementing Descent Methods

Prerequisites. Chapters VI and VIII (essential); Chap. II (recommended).

Introduction. Chapter VIII has explained how to generalize the steepest-descent


method of Chap. II from smooth to merely convex functions; let us briefly recall here
how it works. At a given point x (the current iterate of a steepest-descent algorithm),
we want to find the Euclidean steepest-descent direction, i.e. the solution of

Imini/ex, d)
!lIdll2 = I,
knowing that more general normalizations can also be considered. A slight modifica-
tion of the above problem is much more tractable, just by changing the normalization
constraint into an inequality. Exploiting positive homogeneity and using some duality
s
theory, one obtains the classical projection problem := argminsEof(X) 1/2 lis 112. The
essential results of §VIII.1 are then as follows:
s
- either = 0, which means that x minimizes I;
s
- or #- 0, in which case the Euclidean (non-normalized) steepest-descent direction
is-so
This allows the derivation of a steepest-descent algorithm, in which the direction is
computed as above, and some line-search can be made along this direction.
We have also seen the major drawbacks of this type of algorithm:
(a) It is subject to zigzags, and need not converge to a minimum point.
(b) The projection problem can be solved only in special situations, basically when
the subdifferential is a convex polyhedron, or an ellipsoid.
(c) To compute any kind of descent direction, steepest or not, an explicit description
ofthe whole subdifferential is needed, one way or another. In many applications,
this is too demanding.
Our main motivation in the present chapter is to overcome (c), and we will develop
a technique which will eliminate (b) at the same time. As in the previous chapters, we
assume a finite-valued objective function:

II : lRn -+ lR is convex. I
2 IX. Inner Construction ofthe Subdifferential

For the sake of simplicity, we limit the study to steepest-descent directions in the sense
of the Euclidean norm I I . I I = II . II; adapting our development to other normings
is indeed straightforward and only results in additional technicalities. In fact, our
situation is fundamentally characterized by the information concerning the objective
function f: we need only the value f(x) and a subgradient s(x) E af(x), computed
in a black box of the type (Ul) in Fig. 11.1.2.1, at any x E JRn that pleases us. Such a
computation is possible in theory because dom af = JRn; and it is also natural from
the point of view of applications, for reasons given in §VIII.3.
The idea is to construct the subdifferential of f at x, or at least an approximation
of it, good enough to mimic the computation of the steepest-descent direction. The
approximation in question is obtained by a sequence of compact convex polyhedra,
and this takes care of (b) above. On the other hand, our approximation scheme does
not solve (a): for this, we need additional material from convex analysis; the question
is therefore deferred to Chap. XIII and the chapters that follow it. Furthermore, our
approximation mechanism is not quite implementable; this new difficulty, having the
same remedy as needed for (a), will also be addressed in the same chapters.
The present chapter is fundamental. It introduces the key ingredients for con-
structing efficient algorithms for the minimization of convex functions, as well as
their extensions to the nonconvex case.

1 The Elementary Mechanism

In view of the poor information available, our very first problem is twofold: (i) how
can one compute a descent direction at a given x, or (ii) how can one realize that x is
optimal? Let us illustrate this by a simple specific example.
For n = 2 take the maximum of three functions f), h, /3:

(1.1)

which is minimal at x = (0,0). Figure 1.1 shows a level-set of f and the three half-
lines of kinks. Suppose that the black box computing s(x) produces the gradient of
one of the active functions. For example, the following scheme takes the largest active
index:
=
First, set s(x) V fl (x) = (1,2);
if h(x) ;;;:: fl (x), then set s(x) = Vh(x) = (1, -2);
=
if /3 (x) ;;;:: max{fl (x), h(x)}, then set s(x) Vh(x) = (-1,0).
(i) When x is as indicated in the picture, s(x) = Vh(x) and -s(x) points in a
direction of increasing f. This is confirmed by calculations: with the usual dot-
product for (-, .) in JR2 (which is implicitly assumed in the picture),

(-s(x), Vfl(x») = 3, which implies f'(x, -s(x» ;;;:: 3.

Thus the choice d = -s (x) is bad. Of course, it would not help to take d = s (x):
a rather odd idea since we already know that
1 The Elementary Mechanism 3

FIg.t.t. Getting a descent direction is difficult

I'(x,s(x» ~ IIs(x)1I 2 > O.

Finally note that the trouble would be just the same if, instead of taking the largest
possible index, the black box (U I) chose the smallest one, say, yielding VII (x).
(ii) When x = 0, s(x) = Vh(x) still does nottell us that x is optimal. It would be
extremely demanding of (Ul ) to ask that it select the only subgradient of interest
here, namely s = O!
It is the aim of the present chapter to solve this double problem. From now on, x
is a fixed point in IRn and we want to answer the double question: is x optimal? and if
not, find a descent direction issuing from x.

Remark t.t The difficulty is inherent to the combination of the two words: descent direction
and convex function. In fact, if we wanted an ascent direction, it would suffice to take the
direction of any subgradient, say d = s(x): indeed
f'(X, d) ;;;: (s(x), d) = IIs(x) 112 •

This confirms our Remark VI!.l.l.3: maximizing a convex function is a completely


different problem; here, at least one issue is trivial, namely that of increasing f (barring the
very unlikely event s(x) = 0). 0

Because I is finite everywhere, the multifunction BI is locally bounded (Propo-


sition VI.6.2.2): we can write

3M> 0,317 > 0 such that IIs(y)1I ~ M whenever IIY - xii ~ 17. (1.2)

We will make extensive use of the theory from Chap. VIII, more specifically §VIII.1.
In particular, let S [= BI (x)] be a closed convex polyhedron and consider the problem

min {crs(d) : IIdli ~ I}. [i.e. min {I' (x, d) : IIdli ~ I}]

We recall that a simple solution can be obtained as follows: compute the unique
s
solution of
min HIIsll2 : s E S} ;
then take d = -s.
4 IX. Inner Construction of the Subdifferential

The problem of finding a descent direction issuing from the given x will be solved
by an iterative process. A sequence of''trial'' directions dk will be constructed, together
with a sequence of designated subgradients Sk. These two sequences will be such that:
either (i) us(dk) = f' (x, dk) < 0 for some k, or (ii) k ~ 00 and Sk ~ 0, indicating
that x is optimal. For this process to work, the boundedness property (J .2) is absolutely
essential.

(a) The Key Idea. We start the process with the computation ofS (x) E af (x) coming
from the black box (VI). We call its), we form the polyhedron SI := {sIl, and we set
k = 1.
Recursively, suppose that k ~ 1 subgradients

SJ, ••• , Sk with Sj E af(x) for j = 1, ... , k


have been computed. They generate the compact convex polyhedron

Sk := co {sJ, ... , skI . (1.3)

Then we compute the trial direction dk. Because we want

O>f'(x,dk)=uaf(x)(dk)~ max{(sj,dk}: j=l, ... ,k},

we must certainly take dk so that

(Sj, dk) < 0 for j = I, ... , k (1.4)

or equivalently
(S, dk) < 0 for all S E Sk (1.5)
or again
(1.6)
Only then, has dk a chance of being a descent direction.
The precise computation of dk satisfying (1.4) will be specified later. Suppose for
the moment that this computation is done; then the question is whether or not dk is
a descent direction. The best way to check is to compare f(x + tdk) with f(x) for
t ,j, o. This is nothing but a line-search, which works schematically as follows.

Algorithm 1.2 (Line-Search for Mere Descent) The direction d (= dk) is given.
Start from some t > 0, for example t = I.
STEP I. Compute f (x +t d). If f (x +t d) < f (x) stop. This line-search is successful,
d is downhill.
STEP 2. Take a smaller t > 0, for example 1/2 t, and loop to Step 1. o
Compared to the line-searches of §II.3, the present one is rather special, indeed.
No extrapolations are made: t is never too small. Our aim is not to find a new iterate
x+ = x + td, but just to check the (unknown) sign of f'(x, d). Only the case t ,j, 0
is interesting for our present concern: it indicates that f' (x, d) ~ 0, a new direction
must therefore be computed.
Then the key lies in the following simple result.
1 The Elementary Mechanism 5

Proposition 1.3 For any x E IR n, d =F 0 and t > 0 we have

f(x + td) ~ f(x) ===} (s(x + td), d) ~ o.


PROOF. Straightforward: because s(x + td) E of (x + td), the subgradient inequality
written at x gives

f(x) ~ f(x + td) + (s(x + td), x - x - td) . o


This property, which relies upon convexity, has an obvious importance for our
problem. At Step 1 of the Line-search 1.2, we are entitled to use from the black box
(Ul) not only f(x+td) but also s(x +td). Now, suppose in Algorithm 1.2 thatt .J- 0,
and suppose that we extract a cluster point s* from the corresponding {s (x + t d) It.
Because the mapping of(·) is outer semi-continuous (§VI.6.2), such ans* exists and
lies in of (x). Furthermore, letting t .J- 0 in Proposition 1.3 shows that

(S*,dk) ~o.

In other words: not only have we observed that dk was not a descent direction;
but we have also explained why: we have singled out an additional subgradient s*
which does not satisfy (1.5); therefore s* ¢ Sk. We can call sHI this additional
subgradient, and recompute a new direction satisfying (1.4) (with k increased by 1).
This new direction dk+1 will certainly be different from dt, so the process can be
safely repeated.

Remark 1.4 When doing this, we are exploiting the characterization of of (x) de-
tailed in §VI.6.3(b). First of all, Proposition 1.3 goes along with Lemma VI.6.3.4:
s* is produced by a directional sequence (x + td}q.o, and therefore lies in the face
of of (x) exposed by d. Algorithm 1.2 is one instance of the process called there n,
constructing the subgradient s* = Sn(dk). In view of Theorem VI.6.3.6, we know
that, if we take "sufficiently many" such directional sequences, then we shall be able
to make up the whole set of (x). More precisely (see also Proposition Y.3.1.5), the
whole boundary of of (x) is described by the collection of subgradients sn(d), with
d describing the unit sphere oflRn.
The game is not over, though: it is prohibited to run Algorithm 1.2 infinitely many
times, one time per normalized direction - not even mentioning the fact that each run
takes an infinite computing time, with t tending to o. Our main concern is therefore
to select the directions dk carefully, so as to obtain a sufficiently rich approximation
of of (x) for some reasonable value of k. 0

(b) Choosing the Sequence of Directions. To specify how each dk should be com-
puted, (1.4) leaves room for infinitely many possibilities. As observed in Remark 1.4
above, this computation is crucial if we do not want to exhaust a whole neighborhood
of x. To guide our choice, let us observe that the above mechanism achieves two things
at the same time:
- It performs a sequence of line-searches along the trial directions d l , d2 , ••• , dk; the
hope is really to obtain a descent direction d, having O'Bf(x) (d) < o.
6 IX. Inner Construction of the Subdifferential

- It builds up a sequence {Sk} of compact convex polyhedra defined by (1.3), having


the property that
{S(x)} = SI C Sk C Sk+1 c fJf(x) ;
and Sk is hoped to "tend" to fJf(x), in the sense that (1.6) should eventually imply
aBf(x) (dk) < O.

Accordingly, the aim of the game is double as well:


- We want dk to make the number aSk (dk) as negative as possible: this will increase
the chances of having aBf(x) (dk) negative as well.
- We also want the sequence of polyhedra {Sk} to fill up as much room as possible
within fJf(x).
Fortunately, the above two requirements are not antagonistic but they really go
together: the sensible idea is to take dk minimizing askO. It certainly is good for
our first requirement, but also for the second. In fact, if the process is not going to
terminate with the present db we know in advance that (sk+ I, dk) will be nonnegative.
Hence, making the present aSk (dk) minimal, i.e. as negative as possible, is a good
way of making sure that the next Sk+1 will be as remote as possible from Sk.
When we decide to minimize the (finite) sublinear function aSk , we place ourselves
in the framework of Chap. VIII. We know that a norming is necessary and, as agreed
in the introduction, we just choose the Euclidean norm. In summary, dk is computed
as the solution of
minr
(Sj' d) ~ r for j = I, ... , k, (1.7)
~lIdIl2=1.

Once again, the theory of §VIII.l tells us that it is a good idea to replace the normal-
ization by an inequality, so as to solve instead the more tractable problem

(1.8)

In summary: with Sk on hand, the best we have to do is to pretend that Sk coincides


with fJf(x), and compute the ''pretended steepest-descent direction" accordingly. If
Sk is [close to] fJf(x),dk will be [close to] the steepest-descent direction. On the other
hand, a failure (dk not downhill) will indicate that Sk is too poor an approximation of
fJf (x); this is going to be corrected by an enrichment of it.

Remark 1.5 It is worth mentioning that, with dk solving (1.8), -lIdkll2 is an estimate of
f'(x, dk): there holds (see Remark VIII. 1.3.7)

with equality if Sk = af (x). This estimate is therefore an underestimate, supposedly improved


at each enrichment of Sk. 0
1 The Elementary Mechanism 7

(c) The Bundling Process The complete algorithm is now well-defined:

Algorithm 1.6 (Quest for a Descent Direction) Given x E IRn, compute 1 (x) and
SI =
s(x) E ol(x). Set k 1.=
STEP 1. With Sk defined by (1.3), solve (1.8) to obtaindk. Ifdk = 0 stop: x is optimal.
STEP 2. Execute the line-search 1.2 with its two possible exits:
STEP 3. If the line-search has produced t > 0 with 1 (x + tdk) < 1 (x), then stop: dk
is a descent direction.
STEP 4. If the line-search has produced t -l- 0 and Sk+ 1 E 01 (x) such that
(Sk+I' dk) ~ 0, (1.9)

then replace k by k + 1 and loop to Step 1. o


Two questions must be addressed concerning this algorithm.
(i) Does it solve our problem, i.e. does it really stop at some iteration k? This will
be the subject of §2.
(ii) How can sk+ 1 be actually computed at Step 4? (a normal computer cannot extract
a cluster point).

Remark 1.7 Another question, which is not raised by the algorithm itself but rather by
§VIII.2.2, is the following: if we plug Algorithm 1.6 into a steepest-descent scheme such
as VIII. 1. 1.7 to minimize t, the result will probably converge to a wrong point. A descent
direction dk produced by Algorithm 1.6 is "at best" the steepest-descent direction, which is
known to generate zigzags. In view of (ii) above, we have laid down a non-implementable
computation, which results in a non-convergent method!
Chapter XIII and its successors will be devoted to remedying this double drawback. For
the moment, we just mention once more that the ambition of a minimization method should
not be limited to iterating along steepest-descent directions, which are already bad enough
in the smooth case. The remedy, precisely, will also result in improvements of the steepest-
descent idea. 0

The scheme described by Algorithm 1.6 will be referred to as a bundling mech-


anism: the pieces of information obtained from 1 are bundled together, so as to
construct the subdifferential. Looping from Step 4 to Step 1 will be called a null-step.
The difference between a normal line-search and this procedure is that the latter does
not use the new point x + td as such, but only via the information returned by (Ul):
the next line-search will start from the same x, which is not updated.

Example 1.8 Let us give a geometric illustration of the key assessing the introduction
of Sk+1 into Sk. Take again the function 1 of(1.1) and, for given x, consider the one-
dimensional functiont ~ q(t) := I(x - ts(x».
If q(t) < q(O) for some t > 0, no comment. Suppose, therefore, that q(t) ~ q(O)
for all t ~ O. This certainly implies that x = (~, .,,) is a kink; say ." = 0, as in Fig. 1.1.
More importantly, the function that is active at x - here 12, say, with s(x) = V h(x)
- is certainly active at no t > o. Otherwise, we would have for t > 0 small enough
8 IX. Inner Construction of the Subdifferential

q(t) = h(x - ts(x» = q(O) - t(Vh(x),s(x») = q(O) - tIlVh(x) 112 < q(O).

In other words: the black box (VI) isforced to produce some other gradient - here
Vfl (x) - at any x - ts(x), no matter how small t > 0 is. The discontinuous nature
of s(-) is here turned into an advantage. In the particular example of the picture, once
this new gradient is picked, our knowledge of af(x) is complete and we can iterate.
D

Remark 1.9 Note in Example 1.8 that, fl being affine, s(x - ts(x» is already in of (x) for
t > O. It is not necessary to let t .j.. 0, the argument (ii) above is eliminated. This is a nice
feature of piecewise affine functions.
It is interesting to consider this example with §VIII.3.4 in mind. Specifically, observe
in the picture that an arbitrary s E of (x) is usually not in of (x - ts), no matter how close
the positive t is to O. There is, however, exactly one s E of (x) satisfying s E of (x - ts)
s
for t > 0 small enough; and this is precisely the Euclidean steepest-descent direction at x.
The same observation can be made in the example of §VIII.2.2, even though the geometry
is slightly different. Indeed, we have here an "explanation" of the uniqueness result in Theo-
rem VIIl.3.4.I, combined with the stability result VIII.3.4.5(i). 0

Needless to say, the bundling mechanism is not bound to the Euclidean norm II . II.
Choosing an arbitrary I I . II would amount to solving

minr
(sj,d)~r forj=I, ... ,k,
~Idll = I

instead of (1. 7), or


min{llldlll* : -d E Sk}

instead of(1.8). This would preserve the key property that uSk (dk) is as negative as possible, in
some different sense. As an example, we recall from Proposition VIII. 1.3 .4 that, if a quadratic
norm (d, Qd) 1/2 is chosen, the essential calculations become

which gives the directional derivative estimate of Remark 1.5:

Among all possible norms, could there be one (possibly depending on k) such that the
bundling Algorithm 1.6 converges as fast as possible? This is a fascinating question indeed,
which is important because the quality of algorithms for nonsmooth optimization depends
directly on it. Unfortunately, no clear answer can be given with our present knowledge. Worse:
the very meaning of this question depends on a proper definition of "as fast as possible", and
this is not quite clear either, in the context of nonsmooth optimization.
On the other hand, Remark 1.9 above reveals a certain property of "stability" of the
steepest-descent direction, which is special to the Euclidean norming.
2 Convergence Properties 9

2 Convergence Properties

In this section, we study the convergence properties of the schematic bundling process
described as Algorithm 1.6. Its convergence and speed of convergence will be ana-
lyzed by techniques which, once again, serve as a basis for virtually all minimization
methods to come.

2.1 Convergence

We will use the notation Sk of (1.3); and, to make the writing less cumbersome, we
will denote by

the Euclidean projection of a vector v onto the closed convex hull of a set A.
From the point of view of its convergence, the process of §1 can be reduced
to the following essentials. It consists of generating two sequences of IRn : {Sk} (the
subgradients) and {dk} (the directions, or the projections). Knowing that

-dk= Sk := Proj 0/ Sk E 8f(x) for k = 1,2, ...

our sequences satisfy the fundamental properties:

(2.1.1)

(minimality conditions for the projection) and

(2.1.2)

which is (1.9). Furthermore, we know from the boundedness property (1.2) that

IIsk II :::;; II skII :::;; M for k = 1, 2, ... (2.1.3)

All our convergence theory is based on the property that Sk -+ 0 if Algorithm 1.6
loops forever.

Before giving a proof, we illustrate this last property in Fig. 2.1.1. Suppose af (x) lies
entirely in the upper half-space. Start from SI = (-1,0), say (we assume M = 1). Given
Sk somewhere as indicated in the picture, Sk+l has to lie below the dashed line. "At worst",
this sH I has norm 1 and satisfies equality in (2.1.2). The picture does suggest that, when the
operation is repeated, the ordinate of SHI must tend to 0, implying 8k -+ O.
The property 8k -+ 0 results from the combination of the inequalities (2.1.1), (2.1.2),
(2.1.3): they ensure that the triangle formed by 0, 8k and sk+l is not too elongated. Taking
(2.1.3) for granted, a further examination of Fig. 2.1.1 reveals what exactly (2.1.1) and (2.1.2)
are good for: the important thing is thilt the projection of the segment [SI, sH I] along the
vector 8k has a length which is not "infinitely smaller" than 118k II.
10 IX. Inner Construction of the Subdifferential

Fig. 2.1.1. Why the bundling process converges

<a> A Qualitative Result. Establishing mere convergence of Sk is easy:

Lemma 2.1.1 Let m > 0 befixed. Consider two sequences {Sk} and {Sk}, satisfying
fork=1.2 •...

(Sj -SHI.Sk) ~mllskll2 for j = 1• ...• k. (2.1.4)

If, in addition. {Sk} is bounded. then Sk ~ 0 as k ~ +00.


PROOF. Using the Cauchy-Schwarz inequality in (2.1.4) gives

IIsk II ~ ilz IIsj - sH III for all k ~ 1 and j = 1•...• k .


Suppose that there is a subsequence {Sk}keN•• NI eN such that

o< 8 ~ IIsk II ~ ilz IIsj - sHIll for all k E NI and j = 1•...• k .


Extract from the (bounded) corresponding subsequence {sHI} a convergent subse-
quence and take j preceding k + 1 in that last subsequence to obtain a contradiction.
o

Corollary 2.1.2 Consider the bundling Algorithm 1.6.lfx is not optimal, there must
be some integer k with If (x. dk) < O.

PROOF. For all k ~ 1 we have from (2.1.1). (2.1.2)

(Sj. Sk) ~ IIskll2 for j = 1•...• k (2.1.5)

(SH)o Sk) ~ 0 • (2.1.6)


so Lemma 2.1.1 applies (with m = 1). Being in the compact convex set aI(x), {Sk}
cannot tend to 0 if 0 ¢ aI(x). Therefore Algorithm 1.6 cannot loop forever. Since it
cannot stop in Step 1 either, the only possibility left is a stop in Step 3. 0

Figure 2.1.2 is another illustration of this convergence property. Along the first dl = -SI,
the S2 produced by the line-search has to be in the dashed area. In the situation displayed in the
picture, d2 will separate af(x) from the origin, i.e. d2 will be a descent direction (§VIII.l.l).
2 Convergence Properties 11

Fig. 2.1.2. 1\\'0 successive projections

Lemma 2.1.1 also provides a stopping criterion for the case of an optimal x. It suffices
to insert in Step 1 of Algorithm 1.6 the test

(2.1.7)

where lJ > 0 is some pre-specified tolerance. As explained in Remark VIII. 1.3. 7, -dk can be
viewed as an approximation of''the'' gradient of f at Xk; (2.1.7) thus appears as the equiv.alent
of the "classical" stopping test (11.1.2.1). When (2.1.7) is inserted, Algorithm 1.6 stops in any
case: it either produces a descent direction, or signals (approximate) optimality of x.
There are several ways of interpreting (2.1.7) to obtain some information on the quality
OfXk:
- First we have by the Cauchy-Schwarz inequality:

f(y) ~ f(x) + (Sb Y - x) ~ f(x) -lJlly - xII for all y E ]Rn. (2.1.8)

If x and a minimum of f are a priori known to lie in some ball B(O, R), this gives an
estimate of the type
f(x):r;;,. inf f +2lJR.
- Second x minimizes the perturbed function

Y H> /J(y):= f(y) +lJlly -xII;


this can be seen either from (2.1.8) or from the formula afl (x) = af(x) + B(O, lJ).
- Finally consider another perturbation:

y H> h(y):= f(y) + !clly _xIl 2 ,

forsomefixedc > O;thenB(x, ~lJ)containsalltheminimaoff.Indeed,foranys E af(x),

h(y) - f(x) ~ (s, y - x) + !clly - xll2 ~ lIy - xII [!clly - xII -lIsll] ;

thus, the property h(y) :r;;,. h(x) = f(x) implies lIy - xII :r;;,. ~lIsli.

Remark 2.1.3 The number m in Lemma 2.1.1 is not used in Corollary 2.1.2. It will
appear as a tolerance for numerical implementations of the bundling Algorithm 1.6.
In fact, it will be convenient to set

m = mil - m' with 0 < m' < m" < 1 . (2.1.9)


12 IX. Inner Construction of the Subdifferential

- The tolerance m' plays the role of 0 and will actually be essential for facilitating the
line-search: with m' > 0, the test

(Sk+h dk) ~ - m'lI dkll 2


is easier to meet than (Sk+h dk) ~ o.
- The tolerance m" plays the role of 1 and will be helpful for the quadratic program
computing Sk: with m" < 1, the test

(Sj' Sk) ~ m"lIskll2 for j = 1,2, ... , k


is easier to meet than (2.1.1).
- Finally m" - m' must be positive, to allow application of Lemma 2.1.1. Altogether,
(2.1.9) appears as the convenient choice. In practice, values such as m' = 0.1,
m" = 0.9 are reasonable.
For an illustration, see Fig. 2.1.3, where the dashed area represents the current Sk
of Fig. 2.1.1. Everything is all right ifsk lies above D" andsk+llies below D', so that
the thick segment has a definitely positive length. 0

Fig. 2.1.3. Tolerances for convergence

(b) A Quantitative Result and its Consequences. The next result is a more accurate
alternative to Lemma 2.1.1; it will be useful for numerical implementations, and also
when we study speeds of convergence. Here, sand S play the role of Sk+l and Sk
respectively; m' is the m' of Remark 2.1.3; as for z, it is an alternative to Sk+l'

Lemma 2.1.4 Let sand S (s # 0) be two vectors satisfying the relation


(s, s) ~ m'lIslI 2 (2.1.10)
for some m' < 1; let M ~ max{lIsll, IIslI} be an upper boundfor their norm; call Z
the projection of the origin onto the segment [s, s] and set

1 ')2 IIsII2 0
JL := ( - m M2 _ m 12 l1sl1 2 > . (2.1.11)

Then the following inequality holds:

(2.1.12)
Furthermore. ifequality holds in (2.1.10) and lis II = M. equality holds in (2.1.12).
2 Convergence Properties 13

PROOF. Develop

and consider the function

lR. 3 a 1-+ cp(a) := a 211s112 + 2a(I - a)m'lIsIl 2 + (1 - a)2 M2 .

By assumption, lias + (1 - a)sll2 :::;; cp(a) for a E [0, 1], so we have the bound

IIz1l2:::;; min{cp(a) : a E [0, In

which is exact if (2. l.l 0) holds as an equality and lis II = M.


Now observe that

cp'(O) = 2m' lis 112 - 2M2:::;; 2(m' - I)M 2 < 0

cp'(I) =2(I-m')lIsIl2~0,
so the minimum of cp on [0, 1] is actually unconstrained. Then the minimal value is
given by straightforward calculations:

M2 _m 12 l1sI1 2
CPmin = M2 _ 2m' lis 112 + IIsll211s112

(which, of course, is nonnegative). Write this equality as CPmin = cllsll 2. This defines

[ = ! _ 1] := M2 - 2m' lis 112 + IIsll2 _ 1


JL C M2 _ ml2 lis 112

and another straightforward calculation shows that this JL is just (2.1.11). It is non-
negative because its denominator is

A first use of this additional convergence result appears when considering nu-
merical implementations. In the bundling Algorithm 1.6, Sk is defined by k vectors.
Since k has no a priori upper bound, an infinite memory is required for the algorithm
to work. Furthermore, the computing time necessary to solve (l.8) is also potentially
unbounded. Let us define, however, the following extension of Algorithm 1.6.

Algorithm 2.1.5 (Economic Bundling for a Descent Direction) Given x lR.n , E


compute f(x) and s) = s(x) E af(x); set S~ := lsd, k = 1. The tolerances ~ > 0
and m' < 1 are also given.
STEP 1 (direction-finding and stopping criterion). Compute Sk := Proj 0/ St. Stop if
IISk II :::;; ~: x is nearly optimal. Otherwise set dk := -Sk.
STEP 2 (line-search). Execute the line-search 1.2 with its two possible exits:
STEP 3 (descent direction found). If the line-search has produced a stepsize t > 0
with f(x + tdk) < f(x), then stop: dk is a descent direction.
14 IX. Inner Construction of the Subdifferential

STEP 4 (null-step, update S'). Here, the line-search has produced t .J.. 0 and SHI E
af(x) such that
(SHit dk) ~ - m'lI dkll 2 •
Take for Sk+1 any compact convex set included in af(x) but containing at least
Sk andsHI'
STEP 5 (loop). Replace k by k + I and loop to Step 1. 0

The idea exploited in this scheme should be clear, but let us demonstrate the
mechanism in some detail. Suppose for example that we do not want to store more
than 10 generators to characterize the bundle of information Sk' Until the lOth it-
eration, Algorithm 1.6 can be reproduced normally: the successive subgradients Sk
can be accumulated one after the other, Sk is just Sk. After the lOth iteration has been
completed, however, there is no room to store SII. Then we can carry out the following
operations:
- Extract from {I, 2, ... , 10} a subset containing at most 8 indices - call KIO this
subset, let it contain k' :s:;; 8 indices.
- Delete all those Sj such that j ¢ K IO • After renumbering, the subgradients that are
still present can be denoted by SI, ... ,Ski. We have here a "compression" of the
bundle.
- Append SIO and SII to the above list of subgradients and define sfo to be the convex
hull of Sit ... , Ski, SIO, SII'
- Then loop to perform the II th iteration.
Algorithm 1.6 is then continued: S12, S13, ••• are appended, until the next com-
pression occurs, at some future k (:s:;; 19, depending on the size of KIO).
This is just an example, other compression mechanisms are possible. An impor-
tant observation is that, at subsequent iterations, the polyhedron Sk (k > 10) will be
generated by vectors of several origins: (i) some will be original subgradients, com-
puted by the black box (Ul) during previous iterations; (ii) some will be projections St
for some 10 :s:;; t < k, i.e. convex combinations of those in (i); (iii) some will be "pro-
jections of projections" (after two compressions at least have been made), i.e. convex
combinations of those in (i) and (ii); and so on. The important thing to understand is
that all these generators are in af(x) anyway: we have Sk c af(x) for all k.

Remark 2.1.6 Returning to the notation of §VIII.2.1, suppose that solving the quadratic
problem (VIII.2.1.10) at the 10th iteration has produced el E .1 C ]RIO. Then choose KIO.
If KIO happens to contain each i ~ 10 such that elj > 0, it is not necessary to append
subsequently 810: it is already in CO{SI, ... , sk'}'
A possible strategy for defining the compressed set K 10 is to delete first all those Sj such
that elj = 0, if any (one is enough). If there is none, then call for a more stringent deletion
rule; for example delete the two subgradients with largest norm. 0

Lemma 2.1.4 suggests that the choice of KIO above is completely arbitrary -
including KIO := 0 -leaving convergence unaffected. This is confirmed by the next
result.
2 Convergence Properties 15

Theorem 2.1.7 Algorithm 2.1.5 stops/or somefinite iteration index, producing either
a downhill dk or an Sk E af(x) o/norm less than 8.

PROOF. Suppose for contradiction that the algorithm runs forever. Because of Step 4,
we have at each iteration

(2.1.13)

where ILk is defined by (2.1.11) with S replaced by Sk. If the algorithm runs forever,
the stop of Step 1 never operates and we have

IIskli > 8 fork = 1,2, ...

Set r := ml2l1skll2/M2 E [0, 1[ and observe that the function r 1-+ r/(l - r) is
increasing on [0, 1[. This implies

, 2 82
ILk > (1 - m) M2 _ m12 82 =: IL > 0 for k = 1, 2, ...

and we obtain with (2.1.13) the contradiction

0< 8 ~ II Sk+ I 112 ~ I~IL IIskll2 for k = 1,2, ... o

2.2 Speed of Convergence

As demonstrated by Fig. 2.1.1, the efficiency of a bundling algorithm such as 2.1.5


directly depends on how fast IISk II decreases at each iteration. In fact, the algorithm
must stop at the latest when IISk II has reached its minimal value, which is either 8 or
a
the distance from the origin to f (x).
Here Lemma 2.1.4 comes in again. It was used first to control the storage needed
by the algorithm; a second use is now to measure its speed. Examining (2.1.11),
(2.1.12) with S = Sk and z = Sk+h we see that IL = ILk tends to 0 with Sk, and it is
impossible to bound IISk 112 by an exponential function ofk: a linear rate ofconvergence
is impossible to obtain. Indeed the majorization expressed by (2.1.12) becomes fairly
weak when k grows. The following general result quantifies this.

Lemma 2.2.1 Let {8k} be a sequence 0/nonnegative numbers satisfying, with A > 0
fixed,

(2.2.1)

Then we have/or k = 1,2, ...


(2.2.2)
16 IX. Inner Construction of the Subdifferential

PROOF. We prove the "="-part by induction on k. First, (2.2.2) obviously holds for
k = 1. Assume that the equality in (2.2.2) holds for some k. Plug this value of 8k into
(2.2.1) and work out the algebra to obtain after some manipulations

>..8 >..8
8k+1 = k8 1 +1 A. = (k + 1)81 +1 A. - 81
.
The recurrence is established. The " ~ "- and" ~ "-parts follow since the function
x 1-+Ax / (A. + x) is increasing on R+. 0

In other words, a nonnegative sequence yielding equality in (2.2.1) tends to 0,


with an asymptotic speed of 1/ k. This is very slow indeed; it is fair to say that an
algorithm behaving so slowly must be considered as not convergent at all. We now
proceed to relate the bundling algorithm to the above analysis.

<a> Worst Possible Behaviour. Our analysis so far allows an immediate bound on
the speed of convergence of the bundling algorithm:

Corollary 2.2.2 There holds at each iteration k ofAlgorithm 2.1.5

A 1
II skll ~ /LM. (2.2.3)
(1 - m').., k

PROOF. First check that (2.2.3) holds for k = 1. Then use Lemma 2.1.4: we clearly
have in (2.1.11)

JLr
~ (1 _
m
')2I1 sM2'
kIl 2 •
plug this majorization into (2.1.12) to obtain

with A. := M 2/(I-m')2. The"~"-partinLemma2.2.1 applies (with8k = IIskll2


and observing that A. > M ~ IIs1112); (2.2.3) follows by taking square roots. 0

Thus a bundling algorithm must reach its stopping criterion after not more than
[M / (1- m')8]2 iterations - and we already mentioned that this is a prohibitive number.
Suppose that the estimate (2.2.3) is sharp; then it may take like a million iterations,
just to obtain a subgradient of norm 10-3 M. Now, if we examine the proof of The-
orem 2.1.7, we see that the only upper bound that is possibly not sharp is the first
inequality in (2.1.13). From the second part of Lemma 2.1.4, we see that the equality

2
+ II SklF II SkA 112
II ProJ. O/{ASb Sk+1 }112 = II Sk+II1S112k+111

does hold at each iteration for which (Sk, Sk+1) = O.


The poor speed resulting from Corollary 2.2.2 is therefore the real one if:
2 Convergence Properties 17

(i) the generated subgradients do not cluster to 0 (say all have roughly the same
norm M),
(ii) they are all approximately orthogonal to the corresponding Sk (so (2.1.10) be-
comes roughly an equality), and
(iii) a cheap form of Algorithm 2.1.5 is used (say, S~+) is systematically taken as the
segment [Sko sk+Il, or a compact convex polyhedron hardly richer).

The following example illustrates our point.

Example 2.2.3 Take the function

and let x := (0,0) be the reference point in Algorithm 2.1.5. Suppose S) = (0, 1)
(why not!). Then the first line-search is done along (0, -1). Suppose it produces
S2 = (1,0), so d2 = (-1, -1) (no compression is allowed at this stage). Then S3 is
certainly (-1, 0) (see Fig. 2.2.1) and our knowledge of af (0) is complete.

-~

Fig. 2.2.1. Compression gives slow convergence

From then on, suppose that the "minimal" bundle {Sk, sk+Il is taken at each
iteration. Calling (0-, i) E ]R2 the generic projection S (note: i > 0), the generic
line-search yields the next subgradient S = (-0- / 10- I, 0). Now observe the symmetry
in Fig. 2.2.1: the projections of S2 and S3 onto S are opposite, so we have

(S, s) = -lIslI 2 .

Thus (2.1.10) holds as an equality with m' := -1, and we conclude from the last
part of Lemma 2.1.4 that

2 1 - II Skll 2 2 Ak 2 t4
II SH III
A A A

= I+3IskIF"skl = Ak+4l1Skll2l1skll ork=2,3, ...

(hereAk = I-lIskll2 ~ 1). With Lemma 2.2. 1 in mind, it is clear that IISkll2 ~ I/(4k).
o
18 IX. Inner Construction of the Subdifferential

(b) Best Possible Behaviour. We are naturally led to the next question: might the
bound established in Corollary 2.2.2 be unduly pessimistic? More specifically: since
the only non-sharp majorization in our analysis is II Sk+ I II :::;; IIzll, where z denotes
Proj O/[Sk, sHtl,dowehave IISHIII « liz II? The answerisno; the bundling algorithm
does suffer a sublinear convergence rate, even if no subgradient is ever discarded:

Counter-example 2.2.4 To reach the stopping criterion IIskll :::;; 8, Corollary 2.2.2
tells us that an order of (M /8)2 iterations are sufficient, no matter how {Sk} is generated.
As a partial converse to this result, we claim that a particular sequence {skI can be
constructed, such that an order of M / 8 iterations are indeed necessary.
To substantiate our claim, consider again the example of Fig. 2.1.1, with M = 1;
but place the first iterate close to 0, say at (-8,0), instead of (-1,0) (see the left part
of Fig. 2.2.2); then, place all the subsequent iterates as before, namely take each sHI
of norm 1 and orthogonal to Sk. Look at the left part of Fig. 2.2.2 to see that Sk is, as
before, the projection of the origin onto the line-segment [Slo Sk]; but the new thing
is that IIsllI « IIsk II already for k = 2: the majorization obtained in Lemma 2.1.4
(which is sharp, knowing that S plays now the role of SI), becomes weak from the very
beginning.

o e
Fig. 2.2.2. Perverse effect of a short initial subgradient

These observations can be quantified, with the help of some trigonometry: the
angle f:)k being as drawn in the right part of Fig. 2.2.2, we have

(2.2.4)

Draw the angle a = f:)HI - f:)k and observe that sina = IISHIII = 8cosf:)HI to
obtain
8cosf:)k+1 = sinf:)k+1 cosf:)k - cosf:)k+1 sinf:)k.
Thus, the variable Uk := tanf:)k satisfies the recurrence formula

(2.2.5)

which shows in particular that {Uk} is increasing.


2 Convergence Properties 19

Nowtakes = 8J2.Inviewof(2.2.4), the event IIskll ~ 8 happens when Ok ~ 1'(/4.


Our claim thus amounts to showing that the process (2.2.5) needs 1/8 ~ 1/s iterations
to reach the value Uk ~ 1. This, however, can be established: we have

which, together with the value U 1 = 0, shows that

uk+1 < ksJ2 if U2 < ... < Uk < 1 .

Thus the event Uk+1 ~ 1 requires k ~ l/(sJ2) = 1/(28); our claim is proved. 0

The above counter-example is slightly artificial, in that SI is "good" (i.e. small), and the
following subgradients are as nasty as possible. The example can be modified, however, so
that all the subgradients generated by the black box (Ul) have comparable norms. For this, add
a third coordinate to the space, and start with two preliminary iterations: the first subgradient,
say So, is (-e, 0, 1), so So = So = (-e, 0, 1). For the next subgradient, take SI = (-e, 0, -1),
which obviously satisfies (2.1.2); it is easy to see that SI is then (-e, 0, 0). From then on,
generate the subgradients S2 = (0, 1, 0) and so on as before, all having their third coordinate
null. Draw a picture if necessary to see that the situation is then exactly the same as in 2.2.4,
with SI = (-e, 0, 0) replacing the would-be SI = (-e, 0): the third coordinate no longer
plays a role.
In this variant, a wide angle «(SI, so) « 0) has the same effect as a small initial subgradi-
ent. The message from our counter-examples is that, in a bundling algorithm, an exceptionally
good event (small subgradient, or wide angle) must subsequently be paid fur by a slower con-
vergence.

Among other things, our Counter-example 2.2.4 suggests that the choice of
S~+I c Sk+1 plays a minor role. A situation worth mentioning is one in which
this choice is strictly irrelevant: when all the gradients are mutually orthogonal. Re-
membering Remark 11.2.4.6, assume

(Sj,Sj)=O foralli,j=l, ... ,kwithi#j. (2.2.6)

Then the projection of the origin onto CO{SI, ••• , skl and onto [Sk-l> Sk] coincide.
This remark has some important consequences. Naturally, (2.2.6) cannot hold
forever in a finite-dimensional space. Instead oflRn , consider therefore the space £2 of
square-summable sequences S = (SI, s2, ... ). Let Sk be the kth vector of the canonical
Hilbertian basis of this space:

Sk := (0, ... , 0, 1, 0, ... ) with the "1" in position k

so that, of course, (Sk,sk') = 8kk ,. Then the projection is particularly simple to


compute:
k
Sk := PrO]
A '

O/{sJ, ... , skl = k1",


~Sj.
J=I

This is because
20 IX. Inner Construction of the Subdifferential

k
IISk 112 = ;2 ~ IISj 112 = ~ = (Sk' Sj) for j = I, ... , k , (2.2.7)
J=I

which characterizes the projection. It is also clear that

Thus, the above sequence {Sk} could well be generated by the bundling Algo-
rithm 1.6. In fact, such would be the case with

if(U I ) computed the smallest index giving the max, and ifthe algorithm was initialized
on the reference point x = o.
In this case, the majorization (2.2.3) would be definitely
sharp, as predicted by our theory and confirmed by (2.2.7). It may be argued that
numerical algorithms, intended for implementation on a computer, are not supposed
to solve infinite-dimensional problems. Yet, if n is large, say n ~ 100, then (2.2.6)
may hold till k = 100; in this case, it requires 100 iterations to reduce the initial lis II
by a mere factor of 10.

Remark2.2.5 Let us conclude. When k ~ +00, IISkll2 in Algorithm 2.1.5 is at least as


"good" as 1/ k, even if the bundle SA: is kept minimal, just containing two subgradients. On
Sk
the other hand, IISkll2 can be as bad as 1/ k2, at least if is kept minimal, or also when the
dimension of the space is really large. In particular, a speed of convergence of the form

(2.2.8)

with a rate q < 1 independent of the particular function f, cannot hold for our algorithm -
at least when the space is infinite-dimensional. There is actually a reason for that: according
to complexity theory, no algorithm can enjoy the "good" behaviour (2.2.8); or if it does, the
rate q in (2.2.8) must tend to 1 when the dimension n of the space tends to infinity; in fact q
must be "at best" of the form q ~ 1 - 1/ n.
Actually, complexity theory also tells us that the estimate (2.2.3) cannot be improved,
unless n is taken explicitly into account; a brief and informal account of this complexity result
is as follows. Consider an arbitrary algorithm, having called our black box (UI) at k points
XI, .•• , Xl> and denote by SI, ••. , Sk the output from (UI). Then, an estimate of the typt.

(2.2.9)

is impossible to prove independently of n and of (UI). In particular: no matter how the


algorithm selects its kth point Xl> an Sk can be produced (like in our Counter-example 2.2.4)
so that (2.2.8) does not hold independently of n. Of course, it is a particularly nasty black
box that does the trick: it uses its own output {(f(Xj), Sj)}j<k to concoct its kth answer
(f(Xk), Sk). The corresponding convex function f is not fixed a priori, but is recursively
constructed, depending on the sequence {Xk} chosen by the algorithm.
This result gives rise to an additional and frustrating observation: comparing the be-
haviour (2.2.9) with (2.2.3), we see that the bundling algorithm, although apparently very
weak, is "optimal" in some sense. 0
2 Convergence Properties 21

(c) Practical Behaviour. Our analysis in (a) and (b) above predicts a certain (medio-
cre) convergence speed of Algorithm 2.1.5; it also indicates that the choice of Sk+ I has
little influence on this speed. Unfortunately, this is blatantly contradicted empirically.
Some numerical experiments will give an idea of what is observed in practice.

Test-problem 2.2.6 (TR48) Given a symmetric n x n matrix A, and two vectors d


and s in Rn , consider the piecewise affine objective function
n
f(x) := I)SiXi + di max {ail - XJ, ai2 - X2, ... , ain - xn}] . (2.2.10)
i=1

To minimize this f originates in the problem of minimizing the cost of trans-


porting goods between pairs of locations (after the duality transformation of §VII.4
is performed on the mathematical formulation of this problem). Formulation (2.2.10)
is not a particularly good one for obtaining effective solutions of transportation prob-
lems; but it is an excellent academic example to illustrate some of our points.
Needless to say, the black box (VI) computing f(x) and s(x) works as follows:
for each i, select some j - call it j (i) - such that aij (i) - Xj (i) is maximal. Then form
n
f(x) = ~)SiXi + di (aij(i) - Xj(i»]
i=1

and compute s(x) accordingly.


In what follows, the standard dot-product (x, y) = X T yon Rn will be used. Call
u the vector of Rn whose coordinates are 1, 1, ... , 1. Because
n
f(x +ru) = f(x) + r ~)Si - di) for all r e R,
i=1

we assume that u T (s-d) = 0; otherwise,f would have no minimum. This assumption


implies
STU = 0 for all x eRn and s e af(x)
(easy exercise). Thus, fixing the starting point at x = 0, we are actually faced with a
problem posed in Rn - I , the subspace o1'!Jlogonal to u.
We have selected a particular instance of the above problem, in which n = 48 and
the minimal value is 1 = -638565. The data are organized in such a way that, at a
minimum point i, there are for each i an average of 2-3 terms that yield the max in
(2.2.10). As it happens, the number of possible outputs s(i) for the black box (VI) is
of the order 1011. 0

Apply Algorithm 2.1.5 to this example, the starting point x being a minimum i.
Then no stop can occur at Step 3 and the algorithm loops between Step 4 and Step 1
until a small enough USkU is found.
We have tested two forms of Algorithm 2.1.5:
- The first form is the most expensive possible, in which no subgradient is ever deleted:
at each iteration, Sk+ I = SH I .
22 IX. Inner Construction of the Subdifferential

- The second form is the cheapest possible: all subgradients are systematically deleted
and Sk+1 = [st, sk+d at each iteration.
Figure 2.2.3 displays the respective evolutions of IIskll, as a function of k. We
believe that it illustrates well enough how pessimistic (2.l.I3) can be. The same phe-
nomenon is observed with the test-problem MAXQUAD ofVIII.3.3.3: with the expensive
form, the squared norm of the subgradient is divided by 105 in 3 iterations; with the
cheap form, it is divided by 102 , and then stagnates there.

cheap form

-1 expensive form
-2~__~~__~~______~~______~~
100 200 300 k
Fig. 2.2.3. Efficiency of storing subgradients in a bundling algorithm; TR48

In view of Remark 2.2.5, one could expect that the cheap and expensive forms
tend to equalize when the dimension n increases. To check this, we introduce another
example:

Test-problem 2.2.7 (TSP) Consider a complete graph, i.e. a set of n nodes ("cities")
and m = 1/2n(n - 1) arcs ("routes", linking each pair of cities); let a cost ("length")
be associated with each arc. The traveling salesman problem is that of constructing
a path visiting all the cities exactly once, and having the shortest length. This is not
simple, but a closely related problem is to minimize an ordinary piecewise affine
function associated with the graph, say

IRn 3 X ~ f(x) := max Is! x + bi : i E T} , (2.2.11)

where x is a vector of "dummy costs", associated with the nodes. We do not give
a precise statement of this last problem; let us just say that the basis is once again
duality, introduced in a fashion to be seen in Chap. XII.
Here the index-set T (the set of "I-trees" in the graph) is finite but intractable:
ITI:::: nn.However,findingforgivenxanindexk E Trealizingthemaxin{2.2.II)isa
reasonable task, which can be performed in 0 (m) =
0 (n 2) operations. Furthermore,
the resulting Sk is an integer vector.
This example will be used with several datasets corresponding to several values
ofn, and referred to as Tspn. 0

Take first TSP442, having thus n = 442 variables; the minimal value of f turns out
to be -50505. Just as with TR48, we start from an optimal i and we run the two forms of
Algorithm 2.1.5. With this larger test-problem, Fig. 2.2.4 displays an evolution similar
to that of TR48; but at least the cheap variant is no longer "non-convergent": the two
curves can be drawn on the same picture.
2 Convergence Properties 23

o
'.
\

-1 "'~~~~~~~_____ cheap form


-2 expensive form----------------------____________ _
-4----~----------------------------------------------------------------~k
200 1000 3000
Fig. 2.2.4. Efficiency of storing subgradients in a bundling algorithm; TSP442

o
-1

-2
k
50 100 600
Fig. 2.2.5. Efficiency of storing subgmdients in a bundling algorithm; TSP1173

Figure 2.2.5 is a further illustration, using another dataset called TSP1173. A new
feature is now the unexpectedly small number of iterations, which can be partly
explained by the special structure of the subgradients Sk:
(i) All their coordinates have very small values, say in the range {-I, 0, I}, with
exceptionally some at ±2; as a result, it becomes easier to form the O-vector with
them.
(ii) More importantly, most of their components are systematically 0, for all x near
the optimal reference point: actually, less than 100 of them ever become nonzero;
it follows that the sequence {Sk} evolves in a space of dimension not more than
100 (instead of 1173). In TSP442, this space has dimension 150; and we recall that
in TR48, it has dimension exactly 47.
Remark 2.2.8 The snaky shape of the curves in Figs. 2.2.3,2.2.5 (at least for the expensive
variant) is probably due to piecewise affine chamcter of f. Roughly speaking, IISk II decreases
fast at the beginning, and then slows down, simply because 1/ k follows this pattern; then
comes the structure of fJf(i), which is finitely generated: less and less room is left for the
next Sk+l; the combinatorial aspect of the problem somehow vanishes, and IIskll decreases
more mpidly. 0

Let us conclude this section with a comment similar to Remark 11.2.4.4: numerical anal-
ysis is difficult, in that theory and practice often contmdict each other. More precisely, a
theoretical fmmework establishing specific properties of an algorithm may not be ~levant for
the actual behaviour of that algorithm. Here, it is difficult to find an appropriate theoretical
framework explaining why the bundling algorithm works so much better in its expensive
form. It is also difficult to explain why its actual behaviour contradicts so much the theory.
24 IX. Inner Construction of the Subdifferential

3 Putting the Mechanism in Perspective

In this section, we look at the bundling algorithm from some different points of view,
and we relate it to various topics.

3.1 Bundling as a Substitute for Steepest Descent

The first obvious way of looking at the mechanism of §2 is to consider it as one


single execution of Step 2 in the steepest-descent Algorithm VIII. 1. 1.7. After all,
the original motivation of this mechanism lay precisely there. (From this point of
view, it is an unfortunate notation to use the same index k in Algorithm 1.6 and in
Algorithm VIII.1.1.7; we have good reasons for doing so, which will appear later
in this book: indeed, we will see that one should not distinguish the outer iteration
of a descent algorithm, from the inner iteration used to find the descent direction).
Somehow, we have reached our aim of making implementable the steepest descent
scheme; the bundling Algorithms 1.6 or 2.1.5 merely requires:
(i) the projection of the origin onto a compact convex polyhedron: a perfectly solv-
able convex quadratic problem;
(ii) the computation of an arbitrary subgradient of the objective-function at each
given x; this can be done in many instances, §VIII.3 was devoted to this point;
anyway, it can be done whenever the whole subdifferential can be computed!
On the other hand, some other difficulties are not resolved yet:
(iii) The descent direction db found via Corollary 2.1.2, is not steepest; this point,
however, is not too serious because of (iv) below.
(iv) As already mentioned, the resulting descent algorithm is likely not to minimize
f, since the direction dk suffers the same deficiencies as a steepest one (non-
continuity as a function of x, short-sightedness, zigzags, ... ).
(v) We have introduced the new problem ofletting t .J.. 0 before Algorithm 2.1.5 can
iterate.
(vi) The applications mentioned in §VIII.3.3 are still not covered: for such prob-
lems, our algorithm trivially terminates after the first iteration. We end up with
the mere gradient method, which is among the worst possible ideas; remember
Table VIII.3.3.2.
At this point, the reader may feel that arguments (iv) - (vi) are due to the same
single cause: all this is theory and, for proper implementations, some small parameter
8 must come into play; Remark VIII.2.3.4 made an allusion to it already. Such is
indeed the case and this will motivate subsequent developments: Chap. XI for theory,
Chap. XIII for a first implementation; Chap. XIV will globalize the approach and
enlarge this 8 so as to definitely escape from the steepest-descent concept.
Here we give some comments on (i) and (ii), and we start with an important
remark.
3 Putting the Mechanism in Perspective 25

Remark 3.1.1 As already mentioned in Remark 1.4, each cycle of the bundling
mechanism generates a subgradient Sk+1 lying in the face of CJf(x) exposed by the
direction dk.
This sk+ I is interesting for dk itself: not only is dk uphill (because (sk+ I, dk) ~ 0),
but we can say more. In terms of the descent property of dk, the subgradient Sk+1 is
the worst possible; or, reverting the argument, sk+1 is the best possible in terms of the
useful information concerning dk. Figure 2.1.2 shows how to interpret this property
geometrically: sk+1 (there S2), known to lie in the dashed area, is even at the bottom
tip of it. 0

In light of this remark, we see that the subgradients generated by the bundling
mechanism are fairly special ones. Their special character is also enhanced by the
black box (VI): the way s(x) is computed in practice makes it likely that it will lie
on the boundary of CJf (x), or even at an extreme point. For example, in the case of a
minimax problem (§VIII.2), each s(x), and hence each Sk+h is certainly the gradient
at x of some differentiable active function.
In summary, the bundling mechanism 1.6 can be viewed as aftlter, which extracts
some carefully selected subgradients from the whole set CJf(x). What is even more
interesting is that it has a tendency to select the most interesting ones. Let us admit
that the directions dk are not completely random - they are probably closer and
closer to directions of interest, namely descent directions. Then the subgradients Sk+ I
generated by the algorithm are also likely to ''flirt'' with the subgradients of interest,
namely those near the face of CJf(x) exposed by (steepest) descent directions. In
other words, our algorithm can be interpreted as selecting those subgradients that are
important for defining descent directions, disregarding the "hidden side of the moon";
see Fig. 3.1.1, where the visible side of the moon is the set of faces of CJf (x) that are
exposed by descent directions.

descent
directions

Fig. 3.1.1. The visible side of the moon

According to these observations, the bundling mechanism can also be viewed as


a constructive way of computing the steepest-descent direction, playing its role even
when a full description of the subdifferential is available. For an illustration, consider
a case in which the subdifferential

(3.1.1)
26 IX. Inner Construction of the Subdifferential

is a known compact convex polyhedron, with m possibly large. When Algorithm 1.6
is applied to this example, the full set (3.1.1) is replaced by a growing sequence
of internal approximations Sk. Correspondingly, there is a sequence of approximate
steepest-descent directions db with its sequence of approximate derivatives -lIdk 112
(see Remark 1.5).
Now, it is reasonable to admit that each subgradient Sk+1 generated by the al-
gorithm is extracted from the list {Sl, ... ,sm}: first, (UI) is likely to choose sO in
that list; and second, each sk+1 is in the face of aI(x) exposed by dk; remember Re-
mark 1.4. Then, projecting onto Sk is a simpler problem than projecting onto (3.1.1).
Except for its stopping test, the bundling algorithm really provides a constructive way
s
to compute = Proj O/aI(x).
Of course, this interpretation is of limited value when a descent direction does
exist: Algorithm 1.6 stops much too early, namely as soon as

(knowing that dk = -Sk). A real algorithm computing S should stop much later,
namely when If (x, dk) is "negative enough":

since the latter characterizes Sk as the projection of the origin onto aI(x).
On the other hand, when x is optimal, the problem of computing S = 0 becomes
that of finding the correct convex multipliers, satisfying Lj=1 Clpj = 0 in (3.1.1).
Then the bundling Algorithm 1.6 is a valid alternative to direct methods for projection.
In fact, if(3.1.1) is complex enough, Algorithm 1.6 can be a very efficient approach:
it consists of replacing the single convex quadratic problem

by a sequence of similar problems with m replaced by 1, 2, ... , k ~ m. From our


discussion above, all of these problems are probably "strictly simpler" than the original
one, insofar as their generators Sl, ... ,sk are extracted from the list {Sl, ... ,sm} in
(3.1.1).

Example 3.1.2 The test-problem TR48 of 2.2.6 illustrates our point in a particularly
spectacular way. We have said that, at an optimum i, there are an order of 1011 active
functions; and there are just as many possible values of s(i), as computed by (Ul).
On the other hand, it is (probably) necessary to have 48 of them to obtain the correct
convex multipliers aj in Lj ajs j = 0 (this is Caratheodory's theorem in ]R48, taking
into account that all subgradients are actually in ]R47 because of degeneracy).
Figure 2.2.3 reveals a remarkable phenomenon: in the expensive variant of Algo-
rithm 2.1.5, IId6S 11 is zero within roundoff error. In other words, the bundling mecha-
nism has managed to extract only 65 "interesting" subgradients: to obtain a correct set
of multipliers, it has made only 17 (= 65 - 48) "mistakes", among the 1011 possible
ones. o
3 Putting the Mechanism in Perspective 27

To conclude, we add that all intermediate black boxes (Ul) are conceivable: in this chapter,
we considered only the simplest one, computing just one subgradient; Chap. VIII required the
most sophisticated one, computing the full set (3.1.1). When (Ul) returns richer information,
Algorithm 2.1.5 will converge faster, but possibly at the price of a more expensive iteration;
among other things, each individual projection will be more difficult. The bundling approach
is open to a proper balance between simplicity (one subgradient) and rapidity (all of them).

3.2 Bundling as an Emergency Device for Descent Methods

Another, more subtle, way oflooking at the bundling mechanism comes from (vi) in
the introduction to §3.I. The rather delicate material contained in this Section 3.2 is
an introduction to the numerical Chaps XIII and XIv. Consider an "ordinary" method
such as those in Chap. II for minimizing a smooth function. Within one iteration,
consider the line-search: given the current x and d, and setting q(t) := f(x + td),
we wantto find a convenient stepsize t > 0, satisfying in particular q(t) < q(O).
Anyone familiar with optimization algorithms has experienced line-searches
which "do not work", just because q doggedly refuses to decrease. A frequent rea-
son is that the s(x) computed by (UI) is not the gradient of f at x - because of
some programming mistake. Another reason, more fundamental, is ill-conditioning:
although q' (0) < 0, the property q(t) < q(O) may hold only for values of t > 0 so
small that the computer cannot find any; remember Fig. 11.1.3 .1. In this situation, any
reasonable line-search can only produce a sequence of trial stepsizes tending to O. In
the terminology of §1I.3, tL == 0, tR ,J.. 0: the minimization algorithm is stuck at the
present iteration. Normally there is not much to do, then - except checking (UI)!
It is precisely in this situation that the mechanism of §2 can come into play. As
stated, Algorithm 1.2 can be viewed as the beginning of a ''normal'' line-search, of the
type of those in §II.3. More precisely, compare the line-search 1.2 and Fig. 11.3.3.1
with m = O. As long as tL does not become positive, i.e. as long as q(t) ~ q(O),
both line-searches do the same thing: they perform interpolations aimed at pulling t
towards the region having q (t) < q(O)(ifit is not void). It is only when some tL > 0 is
found, that the difference occurs. Then the line-search 1.2 becomes pointless, simply
because the bundling Algorithm 1.6 stops: the local problem of finding a descent
direction is solved
If, however, the question is considered from the higher point of view of really
minimizing f, it is certainly not a good idea to stop the line-search 1.2 as we do, when
q (t) < q (0). Rather, it is time to call tL the present t, and to start a second phase in the
line-search: our aim is now to find a convenient t, satisfying some criterion (0) in the
terminology of §II.3.2. In other words: knowing that the line-search 1.2 is just a slave
of Algorithm 1.6, it should be a good idea to complement it with some mechanism
coming from §II.3.2.
The result would produce a compound algorithm, which can be outlined as fol-
lows:

Algorithm 3.2.1 (Compound Descent Scheme) Given the current iterate x:


STEP 1. Compute a direction d according to the principles of §II.2.
28 IX. Inner Construction of the Subdifferential

STEP 2. Do a line-search along d, according to the principles of §11.3. This line-search


has two possible exits:
- Either a descent step is found (tL > 0), and a stopping criterion of the type of
those in §1I.3.2 can be fulfilled. Then loop normally to Step 1 with x appropri-
ately updated.
- Or no descent step can be found (tL == 0, tR .J, 0); then make a null-step: do
not update x but switch to
STEP 3. Follow the rules of Algorithm 1.6 by projecting the origin onto the current
approximation of af(x). Loop to Step 2. 0

We have thus grafted the bundling mechanism on a descent scheme as described in


§11.2. Algorithm 3.2.1 can be interpreted as a way to rescue a good "smooth algorithm"
from some difficult situations, in which ill-conditioning provokes zigzags.

Remark 3.2.2 We do not specify how to compute the direction in Step 1. Let us just observe
that, if it is the gradient method of Definition 11.2.2.2 that is chosen, then Algorithm 3.2.1 just
becomes the method we started from (§2.1), simulating the steepest descent. It is in Chap. XIII
that we will start studying approaches making possible harmonious combination of Steps 1
and 2.
Our interpretation of Algorithm 1.6, as an anti-zigzagging mechanism for a descent
method, suggests a way in which one might suppress the difficulty (v) of the introduction to
§3.1. Instead ofletting t = tR ..j.. 0, it is sensible in Algorithm 3.2.1 to switch to Step 3 when
t R becomes small. To make this kind of decision, we will see that the following considerations
are relevant:
- The expected decrease of q in the current direction d should be estimated. If it is small, d
can be relinquished.
- Allowing such a "nonzero null-step" implies appending to Sk some vectors which are not
in af(x). One should therefore estimate how close s(x + td) is to af(x); this makes sense
because of the outer semi-continuity of af.
Thus, the expression "small stepsize" can be given an appropriate meaning, in terms of
objective-value differences, and of subgradient proximity. 0

An interesting point here is to see how Step 2 could be implemented, along the
lines of §II.3. The issue would be to inject into the test (0), (R), (L) a provision for
the switch to Step 3, when t is small enough - whatever this means. In anticipation of
Chap. :xrv; and without giving too much detail, let us mention the starting ideas:
- First, the initial derivative q' (0) is needed. Although it exists (being the directional
derivative f' (x, d) along the current direction d) it is not known. Nevertheless,
Remark 1.5 gives a substitute: we have -lIdll 2 ~ q'(O), with equality if the current
polyhedron S approximates the subdifferential af(x) well enough.
- Then the descent-test, declaring that t is not too large, is that of (11.3 .2.1):

q(t) ~ q(O) - mtlldll 2 (3.2.1)

for a given coefficient m E ]0, 1[.


3 Putting the Mechanism in Perspective 29

- Now, when switching to Step 3 in Algorithm 3.2.1, a new subgradient s must be


found so as to recompute the direction. Ideally, it should satisfy (s, d) ~ 0; but
this can be relaxed, thanks to Lemma 2.1.4: everything will be all right if the new
subgradient merely satisfies

(s, d) ~ - m'lIdll 2 (3.2.2)

for some m' < 1.


At this point, we are bound to realize from (3.2.1), (3.2.2) that Wolfe~ criterion
(11.3.2.4) suggests itself as the appropriate one for declaring a descent step. The
provision for a null-step will operate when -lIdll 2 is a too optimistic underestimate
of the real f'(x, d). Note that the value -lIdIl 2 , in (3.2.1) and (3.2.2), should not be
taken too literally: it does not depend positively homogeneously on d. What actually
counts is (s, d), where sis the "gradient", in the interpretation of Remark VIII. 1.3 .7.

Remark 3.2.3 The need for a null-step appears when (3.2.1) is never satisfied, despite re-
peated interpolations. If the coefficients m andm' are properly chosen, namely m < m', then
we have
(s(x + td), d) ~ q(t) - q(O) > -mlldll 2 > -m'lIdIl 2 ;
t
here the first inequality expresses that s(x +td) E of (x + td), and the second is the negation
of (3.2.1). We see that (3.2.2) holds automatically in this case; one more manifestation of
the second part in Remark 3.1.1. On the other hand, if the black box (Ul) is bugged, the
line-search may well generate a sequence tR ..j.. 0 which never satisfies (3.2.1) nor (3.2.2).
Then, even our rescuing mechanism is helpless. The same phenomenon may happen with
a nonconvex f (but a fairly pathological one, indeed). There is a moral: a totally arbitrary
function, such as the output of a wrong (Ul), cannot be properly minimized! 0

3.3 Bundling as a Separation Algorithm

In its purest form, the mechanism of §1 can be thought of independently of any


convex function to be minimized. What it actually does is: given a compact convex
set S = of (x) and a point 0, find an affine hyperplane strictly separating to} and S,
if there is one; or conclude that 0 E S.
By definition, a (strictly) separating hyperplane is a pair (d, r) E IRn x IR such
that (see §III.4.1):

(d, s) < r for all s E S and r < (d, s') for all s' E to} .
In simpler terms, it is a d E IRn such that

(d, s) < 0 for all s E S,

or equivalently, us(d) < 0 because S is compact. When S is actually a subdifferential


of a convex function, a separating hyperplane gives a descent direction, while a non-
separating hyperplane corresponds to a d such that neither d nor -d is downhill.
The above separation problem can be more or less trivial, depending on how S is
characterized.
30 IX. Inner Construction of the Subdifferential

-If
S := {s E]Rn : c(s)::::; O}
is defined by constraints (several constraints Cj can be included in c, simply by
setting C = maxj Cj), then answering the question is easy: it amounts to computing
c(O) and a subgradient d of cat O. In fact

o ~ S <==> c(O) > 0 .

In case c(O) > 0, then any d E oc(O) defines a separating hyperplane. The reason
is that the property

S E S =* 0;;:: c(s) > c(s) - c(O) ;;:: (d, s)

gives the equation of our hyperplane:

Hd,-c(o) = {s E]Rn : (s, d) = -c(O)}.


- If S is a compact convex polyhedron characterized as a convex hull, the mere ques-
tion "0 E S?" is already more difficult to answer. From the preceding sections, a
suggestion is to project 0 onto S. If the projection is 0, the answer is yes. If it is
nonzero, a separating hyperplane is found.
Our bundling mechanism works in another situation, which we now explain.
Algorithm 1.6 can be considered as made up of two parts: Step 2 and the rest. Step 2,
i.e. Algorithm 1.2, represents the problem to be solved: it is in charge of providing
the rest with information concerning the function f to minimize, or the set of (x) to
separate. By contrast, the rest of Algorithm 1.6 is the "decision maker", which decides
whether to stop, or what other direction to try, etc. It might even decide what norm to
use for the projection.
If we make a parallel with the hill-climbing problem of §II.1.3, we see that Step 2
is one more black box (the concept of black box is very important in numerical
analysis!). Given d, it computes s E S yielding as(d), or detects that as(d) < O. In
other words, our present mechanism is adapted to situations in which S is not known
explicitly, but in which the only available information is pointwise and concerns the
support function of S. In a word, the main task of our black box of Step 2 is to solve
the problem
max {(d, s) : s E S} (3.3.1)
and to return an optimal s. Remark 3.1.1 makes it clear that this is exactly what is
done when extracting a cluster point of {s(x + td)} for t .,!.. O.
With this in mind, Algorithm 1.6 can be read as follows; we deliberately combine
the two possible exits from the line-search: the following algorithm is supposed to
run forever.

Algorithm 3.3.1 (Elementary Separation Algorithm) To initialize the algorithm,


solve (3.3.1) with do = 0 and obtain s, E S. Set k = 1.
STEP 1. Compute dk= - Proj O/{s" ... , Sk}.
3 Putting the Mechanism in Perspective 31

STEP 2. Solve (3.3.1) with d = dk to obtain sk+1 E S such that


(Sk+I' dk) = as(dk) .

Replace k by k + I and loop to Step 1. o


The black box involved in this form of algorithm is, somehow, an "intelligent
(VI)". Its input is no longer x but rather the couple (x, d); its output is no longer an
arbitrary s(x) E S = af(x), but a somewhat particular Sd(X), lying in the face of S
exposed by d. The initialization is artificial: with d = 0, so(x) is now an arbitrary
subgradient.
The next idea that comes to mind is: how about computing not only a separating
hyperplane, but a "best" one, separating S as much as possible from O? The set of
directions defining a separating hyperplane is an open cone, characterized by the
equation as(d) < O. Minimizing as(d) can be interpreted as seeking a ray which
is central in this cone, i.e. as remote as possible from its boundary. This must be
understood among normalized directions only, because positive homogeneity of as
and properties of cones make the picture invariant by homothety. Finally, the theory
of §VIII.1.2 tells us that we are exactly in the situation of §3.1: what we want to do,
really, is to project the origin onto S.

Remark 3.3.2 It is interesting to recall that the above set of separating hyperplanes (the cone
of descent directions!) has a geometric characterization. When S does not contain the origin,
as is not minimal at O. In view of Theorem VI. 1.3.4, our set of separating hyperplanes is there-
fore given by the interior of the polar [cone S]O of the cone generated by S. Incidentally, note
also from the compactness of S that the conical hull cone S is closed (proposition III. 1.4. 7).
o

When a mere descent direction was sought, the bundling Algorithm 1.6 was
stopped as soon as as(dk) became negative. Here, to compute a steepest-descent
direction, Algorithm 3.3.1 must continue until

which characterizes the projection of the origin onto S.


In the spirit of Lemma 2.1.1, it is not difficult to prove that the present process
does converge.

Theorem 3.3.3 When k --+ +00 in Algorithm 3.3.1,


dk --+ d:= -ProjO/S.
PROOF. It is convenient to set Sk := -db S := -d. Then we have
(Sk, Sj) ~ IIskll2 for i = 1, ... , k andk = 1,2, ...
and, since -d E S,
32 IX. Inner Construction of the Subdifferential

We therefore deduce

(Sk. Sj - Sk+l) ~ (Sk. Sk - s) =


= (s. Sk - s) + (Sk - S. Sk - s)
~ IIsk - s1I2.

Finish as in the proof of Lemma 2.1.1. The inequality

(Sk. Sj - SH1) ~ IIsk - sII2 ~!5 > 0 for i = 1..... k and k = 1.2....
infinitely often would lead to a contradiction: we could extract a further subsequence
such that Sj - Sk+l ~ O. 0

Some comments can be made concerning the speed of convergence. When 0 E S,


Algorithms 1.6 and 3.3.1 both construct a sequence {dk} tending to O. The latter gets
from its black box information of better quality, though. Algorithm 1.6 receives at
each iteration an Sk+l such that

while the black box for Algorithm 3.3.1 is "optimal": among all possible such SHh
it selects one having the largest possible such scalar product.
Then a natural question arises: is this reflected by a faster convergence of Al-
gorithm 3.3.1? A complete answer is unknown but we mention the following partial
result.

Proposition 3.3.4 Suppose 0 E ri S. Then dk = 0 for some finite index k in Algo-


rithm 3.3.1.

PROOF. Remember that -dk E S. As always,

(dk. Sj) ~ - IIdkll2 for all k and i ~ k. (3.3.2)

Assume dk "# 0 (otherwise there is nothing to prove) and, for!5 > 0, define s(!5) :=
!5dk/lldkll E B(O. !5). Because 0 E S, the affine and linear hulls of S coincide, and
s(!5) E aft'S. Then. by Definition 111.2.1.1 of the relative interior, our assumption
implies that s(!5) E S if!5 is small enough. In this case

and we deduce by subtraction from (3.3.2)

From the Cauchy-Schwarz inequality,

IISHI - Sj II ~ !5 > 0 for all k and i ~ k •

which implies that the compact sequence {Sk} is actually finite. o


3 Putting the Mechanism in Perspective 33

Of course, nothing of this sort holds for Algorithm 1.6: observe in Fig. 2.1.1
that of (x) may contain some points with negative ordinates, without changing the
sequence {Sk}. The assumption for the above result is interesting, to the extent that
S is a subdifferential and that we are checking optimality of our given x: remember
from the end of §VI.2.2 that, if x = i actually minimizes f, the property 0 E ri of (i)
is natural.
We illustrate this by the following numerical experiments. Take a convex compact
set S containing the origin and consider two forms of Algorithm 3.3.1, differing in
their Step 2, i.e. in their black box.
- The first form is Algorithm 3.3.1 itself, in which Step 2 does solve (3.3.1) at each
iteration.
- In the second form, Step 2 merely computes an sk+1 such that (Sk+l, dk) = 0 (note:
since 0 E S, oS(d) ~ 0 for all d). This second form is therefore pretty much like
Algorithm 1.6. It is even ideal for illustrating Lemma 2.1.4 with equality in (2.1.10).
Just how these two forms can be constructed will be seen in §XIII.3. Here, it
suffices to say that the test-problem TR48 is used; the set S in question is a certain
enlargement of the subdifferential at a non-optimal x; this enlargement contains the
origin; Algorithm 3.3.1 is implemented via Algorithm 1.6, with two variants of the
line-search 1.2.

o
-1
-2 support-black box
-3
k
100 200 400 600
Fig. 3.3.1. Two forms of Algorithm 3.3.1 with TR48

'."
"" ...... IIt ...

-----------------____ orthogonal black box


support-black bOx ----------------------------___ _
-- .... k
10 50 100 150

Fig. 3.3.2. 1\vo forms of Algorithm 3.3.1 with MAXQUAD

Figure 3.3.1 shows the two corresponding speeds of convergence of {dk} to o.


In this test, 0 is actually an interior point of S C ]R48. This explains the very fast
34 IX. Inner Construction of the Subdifferential

decrease of IIdk II at the end of the first variant (finite convergence, cf. Remark 2.2.8).
With another test, based on MAXQUAD ofVIII.3.4.2, 0 is a boundary point of S c lR lo •
The results are plotted in Fig. 3.3.2: IIdk II decreases much more regularly. Finally,
Figs. 3.3.3 and 3.3.4 give another illustration, using the test-problem TSP of 2.2.7. In
all these pictures, note the same qualitative behaviour of IIdk II.

\
\,
·1 '."
"~- ___ ~ orthogonal black box

-3 SUPPO~:I:~:-:~----~----------------~----.
--~~~------------------------__- - k
500 6000
Fig. 3.3.3. 1\vo forms of Algorithm 3.3.1 with TSPI20

"
a
,~

-~~-----_______ orthogonal black box


-1 ----..........-------__-------------------------__ ~~ ~~

"
__-_2_~--~~----------~~------~~----\~
' k
100 600
Fig. 3.3.4. Two forms of Algorithm 3.3.1 with TSP442
x. Conjugacy in Convex Analysis

Prerequisites. Definitions, properties and operations concerning convex sets (Chap. III)
and convex functions (Chap. IV); definition of sublinear functions and associated convex
sets (Chap. V); definitions of the subdifferential of a finite convex function (§VI.l). One-
dimensional conjugacy (§I.6) can be helpful but is not necessary.

Introduction. In classical real analysis, the gradient of a differentiable function f :


]Rn -+ lR plays a key role - to say the least. Considering this gradient as a mapping
x 1-+ s (x ) = V f (x) from (some subset X of) lRn to (some subset S of) lRn , an
interesting object is then its inverse: to a given s E S, associate the x E X such that
s = V f(x). This question may be meaningless: not all mappings are invertible! but
could for example be considered locally, taking for X x S a neighborhood of some
(xo, So = V f(xo», with V2 f continuous and invertible at Xo (use the local inverse
theorem).
Let us skip for the moment all such technicalities. Geometrically, we want to
find x such that the hyperplane in lRn x lR defined by the given (s, -1), and passing
through (x, f(x», is tangent to gr f at x; the problem is meaningful when this x
exists and is unique. Its construction is rather involved but analytically, an amazing
fact is that the new mapping x (.) = (VI) -I thus defined is itself a gradient mapping:
say x(·) = Vh, with h : S C lRn -+ lR. Even more surprising: this function h has a
simple expression, namely

S 3 s 1-+ h(s) = (s, x(s») - f(x(s». (0.1)

To explain this, do a formal a posteriori calculation in (0.1): a differential ds induces


the differentials dx and dh, which are linked by the relation

dh = (s, dx) + (ds, x) - (V f(x), dx) = (s, dx) + (ds, x) - (s, dx) = (ds, x)
and this defines x as the gradient of h with respect to s.

Remark 0.1 When Vf is invertible, the function

is called the Legendre transform (relative to I). It should never be forgotten that, from
the above motivation itself, it is not really the function h which is primarily interesting,
but rather its gradient Vh. 0
36 X. Conjugacy in Convex Analysis

The gradients of f and of h are inverse to each other by definition, so they establish
a reciprocal correspondence:

s = V f(x) {:::::::} x = Vh(s) . (0.2)

In particular, applying the Legendre transform to h, we have to get back f. This


symmetry appears in the expression of h itself: (0.1) tells us that, for x and s related
by (0.2),
f(x) + h(s) = (s, x) .
Once again, the above statements are rather formal, insofar as they implicitly as-
sume that (Vf)-I is well-defined. Convex analysis, however, provides a nice frame-
work to give this last operation a precise meaning.
First of all, observe that the mapping x ~ Vf (x) is now replaced by a set-valued
mappingx 1---+ of (x) -see Chap. VI. To invert it is to find x such that of (x) contains
a given s; and we can accept a nonunique such x: a set-valued (of) -1 will be obtained,
but the price has been already paid anyway.
Second, the construction of x(s) is now much simpler: s E of (x) means that
o E of (x) - Is} and, thanks to convexity, the last property means that x minimizes
f - (s, .) over]Rn. In other words, to find x (s), we have to solve
inf {f(x) - (s, x) : x E ]Rn}; (0.3)

and the Legendre transform - in the classical sense of the term - is well-defined when
this problem has a unique solution.
Let us sum up: if f is convex, (0.3) is a possible way of defining the Legendre
transform when it exists unambiguously. It is easy to see that the latter holds when f
satisfies three properties:
- differentiability - so that there is something to invert;
- strict convexity - to have uniqueness in (0.3);
- V f(lRn) = ]Rn - so that (0.3) does have a solution for all s E ]Rn; this latter
property essentially means that, when IIxll ~ 00, f(x) increases faster than any
linear function: f is J -coercive.
In all other cases, there is no well-defined Legendre transform; but then, the
transformation implied by (0.3) can be taken as a new definition, generalizing the
initial inversion of Vf. We can even extend this definition to nonconvex f, namely
to any function such that (0.3) is meaningful! Finally, an important observation is
that the infimal value in (0.3) is a concave, and even closed, function of s; this is
Proposition IY.2.1.2: the infimand is affine in s, f has little importance, ]Rn is nothing
more than an index set.
The concept of conjugacy in convex analysis results from all the observations
above. It is often useful to simplify algebraic calculations; it plays an important role
in deriving duality schemes for convex minimization problems; it is also a basic
operation/for formulating variational principles in optimization (convex or not), with
applications in other areas of applied mathematics, such as probability, statistics,
nonlinear elasticity, economics, etc.
1 The Convex Conjugate of a Function 37

1 The Convex Conjugate of a Function

1.1 Definition and First Examples

As suggested by (0.3), conjugating a function f essentially amounts to minimizing a


perturbation of it. There are two degenerate situations we want to avoid:
- the result is +00 for some s; observe that this is the case if and only if the result is
+00 for all s;
- the result is - 00 for all s.
Towards this end, we assume throughout that f : ]Rn ~ ]R U {+oo} (not necessarily
convex) satisfies

1f t= +00, and there is an affine function minorizing f on]Rn .1 (1.1.1)

Note in particular that this implies f (x) > -00 for all x. As usual, we use the notation
doml := {x : I(x) < +oo} #- 0. We know from Proposition IY.1.2.1 that (1.1.1)
holds for example if 1 E Conv]Rn.

Definition 1.1.1 The conjugate ofa function 1 satisfying (1.1.1) is the function j*
defined by

]Rn 3 s ~ I*(s) := sup {(s, x) - I(x) : x E dom f}. (1.1.2)

For simplicity, we may also let x run over the whole space instead of dom I.
The mapping 1 ~ 1* will often be called the conjugacy operation, or the
Legendre-Fenchel transform. 0

A very first observation is that a conjugate function is associated with a scalar


product on ]Rn. Of course, note also with relation to (0.3) that

I*(s) = - inf {f(x) - (s, x) : x E dom f}.

As an immediate consequence of (1.1.2), we have for all (x, s) E dom 1 x ]Rn

I*(s) + I(x) ~ (s, x) . (1.1.3)

Furthermore, this inequality is obviously true if x ¢ dom I: it does hold for all
(x, s) E ]Rn x ]Rn and is called Fenchel's inequality. Another observation is that
j*(s) > -00 for all s E ]Rn; also, if x ~ (so, x) + ro is an affine function smaller
than I, we have

- 1* (so) = inf [f(x) - (so, x)] ~ ro, i.e. I*(so) ~ - ro < +00.
x

In a word, j* satisfies (1.1.1).


Thus, dom j* - the set where j* is finite - is the set of slopes of all the possible
affine functions minorizing lover ]Rn. Likewise, any s E dom 1 is the slope of an
affine function smaller than j*.
38 X. Conjugacy in Convex Analysis

Theorem 1.1.2 For f satisfYing (1. 1.1), the conjugate f* isaclosedconvexfunction:


f* E Conv lRn.

PROOF. See Example rv.2.1.3. o


Example 1.1.3 (Convex Quadratic Functions) Let Q be a symmetric positive def-
inite linear operator on lRn , b E lRn and consider

f(x) := 4(x, Qx) + (b, x) for all x E lRn . (1.1.4)

A straightforward calculation gives the optimal x = Q-I(S - b) in the defining


problem (1.1.2), and the resulting f* is

f*(s) = 4(s - b, Q-I (s - b») for all s E lRn.

In particular, the function 1/211 • 112 is its own conjugate.


Needless to say, the Legendre transformation is present here, in a particularly
simple setting: V f is the affine mapping x 1-+ Qx +b, its inverse is the affine mapping
s 1-+ Q-I(S - b), the gradient of f*. What we have done is a parametrization oflRn
via the change of variable s = V f(x); with respect to the new variable, f is given by

f(x(s» = 4(Q-' (s - b), s - b) + (b, Q-I (s - b») .

Adding f* (s) on both sides gives


f(x(s» + f*(s) = (Q-I(s - b), s) = (x(s), s),
which illustrates (0.1).
Taking b = 0 and applying Fenchel's inequality (1.1.3) gives

(x, Qx) + (s, Q-I s) ~ 2(s, x) for all (x, s),

a generalization of the well-known inequality (obtained for Q = cI, c > 0):

cllxll2 + ~lIsll2 ~ 2(s, x) for all c > 0 and x, s E lRn. o


When Q in the above example is merely positive semi-definite, a meaning can
still be given to (V f) -I provided that two problems are taken care of: first, s must be
restricted to V f(lR n ), which is the affine manifold b + 1m Q; second, V f(x + y) =
Vf (x) for all Y E Ker Q, so (Vf) -I (s) can be defined only up to the subspace Ker Q.
Using the definition of the conjugate function, we obtain the following:

Example 1.1.4 (Convex Degenerate Quadratic Functions) Take the convex qua-
dratic function (1.1.4), but with Q symmetric positive semi-definite. The supremum
in (1.1.2) is finite only if s - b E (Ker Q).l, i.e. s - b E 1m Q; it is attained at an x
such that s - Vf (x) = 0 (optimality condition VI.2.2.1). In a word, we obtain

f* (s) = { too s
if ¢ ~ + 1m Q ,
2(x, s - b) otherwtse,
1 The Convex Conjugate of a Function 39

where x is any element satisfying Qx + b = s. This formulation can be condensed to

f*(Qx + b) = !(x, Qx) for all x E IRn ,

which is one more illustration of(O.1): add f(x) to both sides and obtain

f(x) + f*(Qx + b) = (x, Qx + b) = (x, V f(x»).


It is also interesting to express f* in terms of a pseudo-inverse: for example, Q-
denoting the Moore-Penrose pseudo-inverse of Q (see §A.3.4), f* can be written

* { +00 if s \l b + 1m Q ,
f (s)= !(s-b,Q-(s-b») ifsEb+lmQ. o

In the above example, take b = 0 and let Q = PH be the orthogonal projection


onto a subspace H oflRn. Then 1m Q = Hand Q- = Q, so

*
(PH) (s) =
{+oo if s\l H ,
!lI s ll if s H.
2 E

Another interesting example is when Q is a rank-one operator, i.e. Q = UU T,


with 0 =f. U E IRn (we assume the usual dot-product for (', .). Then 1m Q = lRu and,
for x Elm Q, Qx = lIull 2 x. Therefore,

( UU T)\s) = { 211~1I2S if sand U are collinear ,


+00 otherwise.

Example 1.1.5 Let Ie be the indicator function of a nonempty set C c IRn. Then

(Ic)*(s) = sup [(s, x) - Ic(x)] = sup (s, x)


X EdomIe XEe

is just the support function of C. If C is a closed convex cone, we conclude from


Example V.2.3.1 that (Ic)* is the indicator of its polar Co. If C is a subspace, (Ic)*
is the indicator of its orthogonal C.l.
If C = IRn, Co = {O}; Ie == 0 and the conjugate of Ie is 0 at 0, +00 elsewhere
(this is the indicator of {O}, or the support oflRn). A similar example is the nonconvex
function x f-+ f(x) = Illxlll l / 2 , where III . III is some norm. Then /*(0) = 0; but if
s =f. 0, take x of the form x = ts to realize that

f*(s) ~ sup [tllsll 2 -


t~O
Jt1Si] = +00.
In other words, f* is still the indicator function of {O}. The conjugacy operation has
ignored the difference between f and the zero function, simply because f increases
at infinity more slowly than any linear function. 0
40 X. Conjugacy in Convex Analysis

1.2 Interpretations

Geometrically, the computation of !* can be illustrated in the graph-space Rn x R.


For given s E R n , consider the family of affine functions x ~ (s, x) - r, parametrized
by r E R. They correspond to affine hyperplanes Hs,r orthogonal to (s, -1) E Rn+l;
see Fig. 1.2.1. From (1.1.1), Hs,r is below gr 1 whenever (s, r) is properly chosen,
namely s E dom!* and r is large enough. To construct !*(s), we lift Hs,r as much
as possible subject to supporting gr I. Then, admitting that there is contact at some
(x, I(x», we write

(s, x) - r = I(x) or rather r = (s, x) - I(x) ,

to see that r = I*(s). This means that the best Hs.r intersects the vertical axis {OJ x R
at the altitude - !*(s).

Fig.I.2.1. Computation of f* in the graph-space

Naturally, the horizontal hyperplanes Ho.r correspond to minimizing I:


- 1*(0) = inf {f(x) : x ERn} .

The picture illustrates another definition of 1*: being normal to Hs •r , the vector
(s, -1) is also normal to gr 1 (more precisely to epi f) at the contact when it exists.

Proposition 1.2.1 There holds for all x E Rn

I*(s) = O"epi!(s, -1) = sup {(s, x) - r : (x, r) E epi f}. (1.2.1)

It follows that the support function ofepi 1 has the expression

ifu>O,
ifu=O, (1.2.2)
if u < O.

PROOF. In (1.2.1), the right-most term can be written

sup sup [(s, x) - r] = sup [(s, x) - I(x)]


x r ~ !(x) x
1 The Convex Conjugate of a Function 41

and the first equality is established. As for (1.2.2), the case U < 0 is trivial; when
U > 0, use the positive homogeneity of support functions to get

uepif(s, -u) = uUepif(ks, -I) = uf*(ks) ;

finally, for U = 0, we have by definition

Uepif(s,O) = sup {{s, x) : (x, r) e epi f for some r e lR},

and we recognize udomf(s). o


Assume f e Conv lRn. This result, illustrated in Fig. 1.2.1, confirms that the
contact-set between the optimal hyperplane and epi f is the face of epi f exposed by
the given (s, -1). From (1.2.2), we see that Uepi f and the perspective-function of f*
(§IY.2.2) coincide for U =F 0 - up to the change of variable U ~ -u. As a closed
function, a epi f therefore coincides (still up to the change of sign in u) with the closure
of the perspective of f*. As for u = 0, we obtain a relation which will be used on
several occasions:

Proposition 1.2.2 For f e Conv lRn ,

udomf(s) = uepif(s, 0) = (f*)~(s) for all s e lRn . (1.2.3)

PROOF. Use direct calculations; or see Proposition IY.2.2.2 and the calculations in
Example Iy'3.2.4. 0

The additional variable u introduced in Proposition 1.2.1 gives a second geometric


construction: Fig. 1.2.1 plays the role of Fig. Y.2.1.1, knowing that lRn is replaced here
by lRn+l. Here again, note in passing that the closed convex hull of epi f gives the
same f*. Now, playing the same game as we did for Interpretation Y.2.1.6, f* can
be looked at from the point of view of projective geometry, lRn being identified with
lRn x {-I}.
Consider the set epi f x {-I} C lRn x lR x lR, i.e. the copy of epi f, translated
down vertically by one unit. It generates a cone Kf C lRn x lR x lR:

Kf:= {t(x,r, -1) : t > 0, (x,r) e epif}. (1.2.4)

Now take the polar cone (Kf)O c lRn x lR x lR. We know from Interpretation Y.2.1.6
that it is the epigraph (in lRn+ 1 x lR!) of the support function of epi f. In view of
Proposition 1.2.1, its intersection with lRn x {-I} x lR is therefore epi f*. A short
calculation confirms this:

(Kf)O = {(s,a,,8)elRn xlRxlR: t{s,x)+tar-t,8~O


for all (x, r) e epi f and t > O}
= {(s, a,,8) : (s, x) + ar ~ ,8 for all (x, r) e epi f}.

Imposing a = -1, we just obtain the translated epigraph of the function described in
(1.2.1).
42 X. Conjugacy in Convex Analysis

K pnina/
-.---.------~

epilx(-l}

Fig. 1.2.2. A projective view of conjugacy

Figure 1.2.2 illustrates the construction, with n = 1 and epi f drawn horizontally;
this epi f plays the role of S in Fig. Y.2.l.2. Note once more the symmetry: Kf [resp.
(Kf) 0] is defined in such a way that its intersection with the hyperplane lRn x lR x {-1 }
[resp. lRn x {-I} x lR] is just epi f [resp. epi 1*].

Finally, we mention a simple economic interpretation. Suppose ]Rn is a set of goods and
its dual (]Rn)'1< a set of prices: to produce the goods x costs f(x), and to sell x brings an
income (s, x). The net benefit associated with x is then (s, x) - f(x), whose supremal value
1* (s) is the best possible profit, resulting from the given set of unit prices s. Incidentally, this
last interpretation opens the way to nonlinear conjugacy, in which the selling price would be
a nonlinear (but concave) function of x.

Remark 1.2.3 This last interpretation confirms the warning already made on several oc-
casions: ]Rn should not be confused with its dual; the arguments of f and of f* are not
comparable, an expression like x + s is meaningless (until an isomorphism is established
between the Euclidean space ]Rn and its dual).
On the other hand, f -values and f* -values are comparable, indeed: for example, they can
be added to each other- which is just done by Fenchel's inequality (1.1.3)! This is explicitly
due to the particular value "-I" in (1.2.1), which goes together with the "-I" of(1.2.4). 0

1.3 First Properties

Some properties ofthe conjugacy operation f 1-+ 1* come directly from its definition.

(a) Elementary Calculus Rules Direct arguments prove easily a first result:

Proposition 1.3.1 The functions f, fj appearing below are assumed to satisfy


(1.1.1).
(i) The conjugate of the function g(x) := f(x) + a is g*(s) = I*(s) - a.
1 The Convex Conjugate of a Function 43

(ii) With a > O,theconjugateofthefunctiong(x) :=al(x)isg*(s) =af*(s/a).


(iii) With a #- 0, the conjugate ofthefunction g(x) := I(ax) is g*(s) = f*(s/a).
(iv) More generally: if A is an invertible linear operator, (f 0 A)* = f* 0 (A -1)*.
(v) The conjugate of the function g(x) := I(x - xo) is g*(s) = f*(s) + (s, xo).
(vi) The conjugate ofthe function g(x) := I(x) + (so, x) is g*(s) = f*(s - so).
(vii) If II ~ f2, then ft ~ It
(viii) "Convexity" ofthe conjugation: ifdom/l n domh #- 0 and a E ]0,1[,

[all + (1- a)h]* ~ aft + (1- a)N;


(ix) The Legendre-Fenchel transform preserves decomposition: with
m
lR,n := lR,n. X ••• x lR,nm 3 X ~ I(x) := L Ij(xj)
j=1

and assuming that lR,n has the scalar product ofa product-space,
m
1*(sJ, ... , sm) = L 1j*(Sj)' o
j=1

Among these results, (iv) deserves comment: it gives the effect of a change of variables
on the conjugate function; this is of interest for example when the scalar product is put in the
form (x, y) = (AX)T Ay, with A invertible. Using Example 1.1.3, an illustration of (vii) is:
the only function f satistying f = f* is 1/211' 112 (start from Fenchel's inequality); and also,
for symmetric positive definite Q and P:

Q~P => p- I ~ Q-I ;

and an illustration of (viii) is:

[aQ + (1- a)prl ~ aQ-1 + (1 - a)p-I .

Our next result expresses how the conjugate is trarIsformed, when the starting
function is restricted to a subspace.

Proposition 1.3.2 Let I satisfY (1.1.1), let H be a subspace oflR,n, and call PH the
operator of orthogonal projection onto H. Suppose that there is a point in H where
I isfinite. Then 1+ IH satisfies (1.1.1) and its conjugate is

(1.3.1)

PROOF. When Y describes lR,n, PH Y describes H so we can write, knowing that PHis
symmetric:

(f + IH)*(S) := sup {(s, x) - I(x) : x E H}


= sup {(s, PHY) - I(PHY) : Y E lR,n}
= sup {(PHS, y) - I(PHY) : Y E lR,n}. o
44 X. Conjugacy in Convex Analysis

When conjugating f + IH, we disregard f outside H, i.e. we consider f as a function


defined on the space H; this is reflected in the "f 0 PH"-part of (1.3.1). We then obtain a
"partial" conjugate, say j* E Conv H; the last "oPH" of(1.3.1) says that, to recover the whole
conjugate (f + IH)*, which is a function of Conv lRn , we just have to translate horizontally
the graph of j* along H 1. .

Remark 1.3.3 Thus, if a subspace H and a function g (standing for f + IH) are such that
dom g C H, then g* (. + s) = g* for all s E H 1.. It is interesting to note that this property
has a converse: if, for some Xo E domg, we have g(xo + y) = g(xo} for all y E H, then
dom g* C H 1.. The proof is immediate: take Xo as above and, for s ¢ H 1., take Yo E H such
that (s, Yo) = a ::f:. 0; there holds (s, AYO) - g(xo + AYO) = Aa - g(xo) for all A E lR; hence

g*(s) ~ suP~J(s, Xo + AYO) - g(xo + AYO)]


(s, xo) + sup).[Aa - g(xo)] = +00. o
The above formulae can be considered from a somewhat opposite point of view: suppose
that lRn , the space on which our function f is defined, is embedded in some larger space
lRn+p. Various corresponding extensions of f to the whole oflRn+ p are then possible. One
is to set f := +00 outside lRn (which is often relevant when minimizing j), the other is the
horizontal translation
f(x + y) = f(x) for all Y E lRP ;
these two possibilities are, in a sense, dual to each other (see Proposition 1.3.4 below). Such a
duality, however, holds only if the extended scalar product preserves the structure oflRn x lRP
as a product-space: so is the case of the decomposition H + H 1. appearing in Proposition 1.3.2,
because
(s, x) = (PHS, PHX) + (PHl.S, PH.l x ).

Considering affine manifolds instead of subspaces, we mention the following


useful result:

Proposition 1.3.4 For 1 satisfying (1.1.1), let a subspace V contain the subspace
parallel to affdoml and set U := Vol. For any Z E affdoml and any s E IRn
decomposed as s = su + sv, there holds
+ I*(sv)·
I*(s) = (su, z)
PROOF. In (1.1.2), the variable x can range through z + V :J aff dom I:

I*(s) = sUPveV[(SU + sv, z + v) - I(z + v)]


= (su, z) + sUPveY[(SV, z + v) - I(z + v)]
= (su, z) + j*(sv)· o

(b) The Biconjugate of a Function What happens if we take the conjugate of j*


again? Remember that j* satisfies automatically (1.1.1) if 1 does. We can therefore
compute the biconjugate function of I: for all x E IRn,
I**(x) := (f*)*(x) = sup {(s, x) - I*(s) : s E IRn}. (1.3.2)

This operation appears as fundamental. The function j** thus defined is the "close-
convexification" of I, in the sense that its epigraph is the closed convex hull of epi I:
1 The Convex Conjugate of a Function 45

Theorem 1.3.5 For f satisfYing (1.1.1), the jUnction f** of (1.3 .2) is the pointwise
supremum ofall the affine functions on lR.n majorized by f. In other words

epi f** = co (epi f) . (1.3.3)

PROOF. Call 17 C lR. n x lR. the set of pairs (s, r) defining affine functions x 1-+ (s, x}-r
majorized by f:

(s, r) E 17 {:::::::} f(x) ~ (s, x) - r for all x E lR.n


{:::::::} r~ sup{{s,x}-f(x): x ElR.n}
{:::::::} r ~ f*(s) (and s E dom f*!).

Then we obtain, for x E lR.n,

sUPCs,r)EL'[(s,x}-r] = sup{(s,x}-r: sEdomf*, -r~ -f*(s)}


= sup {(s, x) - f*(s) : s E domf*} = f**(x).

Geometrically, the epigraphs of the affine functions associated with (s, r) E 17


are the (non-vertical) closed half-spaces containing epi f. From §IY.2.5, the epigraph
of their supremum is the closed convex hull of epi f, and this proves (1.3.3). 0

Note: the biconjugate of an f E Conv lR.n is not exactly f itself but its closure:
f** = cl f. Thanks to Theorem 1.3.5, the general notation

cof:= f** (1.3.4)

can be - and will be - used for a function simply satisfying (1.1.1); it reminds one
more directly that f** is the closed convex hull of f:

f**(x) = sup {(s, x) - r : (s, y) - r ~ f(y) for all y E lR.n} . (1.3.5)


r,s

Corollary 1.3.6 If g is a function satisfYing co f ~ g ~ f, then g* = f*. The func-


tion f is equal to its biconjugate f** if and only iff E Conv lR. n.

PROOF. Immediate. o

Thus, the conjugacy operation defines an involution on the set of closed convex func-
tions. When applied to strictly convex quadratic functions, it corresponds to the inversion of
symmetric positive definite operators (Example 1.1.3 with b = 0). When applied to indicators
of closed convex cones, it corresponds to the polarity correspondence (Example 1.1.5 - note
also in this example that the biconjugate of the square root of a norm is the zero-function).
For general f E Conv R n , it has a geometric counterpart, also based on polarity, which is the
correspondence illustrated in Fig. 1.3.1. Of course, this involution property implies a lot of
symmetry, already alluded to, for pairs of conjugate functions.
46 X. Conjugacy in Convex Analysis

( . )*
f E ConvRn
-- - f' E COri'V Rn

( . )*
Fig. 1.3.1. The *-involution

(c) Conjugacy and Coercivity A basic question in (1.1.2) is whether the supremum
is going to be +00. This depends only on the behaviour of f at infinity, so we extend
to non-convex situations the concepts seen in Definition Iy'3.2.6:
Definition 1.3.7 A function f satisfying (1.1.1) is said to be O-coercive [resp. 1-
coercive] when

lim f(x) = +00 [resp. lim f(x) = +00] . 0


IIxll-Hoo IIxll-Hoo IIxll
Proposition 1.3.8 If f satisfying (1.1.1) is I-coercive, then !*(s) < +00 for all
s eRn.
PROOF. For given s, the l-coercivity of f implies the existence of a number R such
that
IIxll ~ R ==> f(x) ~ IIxll(lIsll + 1),
so that we have in (1.1.2)
(s, x) - f(x)::S; - IIxli for all x such that IIxli ~ R,
hence
sup{(s,x)-f(x): IIxll~R}::S; -R.
On the other hand, (1.1.1) implies an upper bound
sup{(s,x)-f(x): IIxll::S;R}::S;M. o
For a converse to this property, we have the following:
Proposition 1.3.9 Let f satisfy (1.1.1). Then
(i) Xo e intdomf =>!* - (xo,') is O-coercive;
(ii) in particular, iff is finite over Rn , then !* is I-coercive.
PROOF. We know from (1.2.3) that O'domf = U*)6o so, using Theorem Y.2.2.3(iii),
Xo e int dom f C int(co dom f) implies
U*)~(s) - (xo, s) > 0 for all s#-O.
By virtue of Proposition IY.3.2.5, this means exactly that !* - (xo, .) has compact
sublevel-sets; (i) is proved.
Then, as demonstrated in Definition IY.3.2.6, O-coercivity of!* - (xo, .) for all
Xo means l-coercivity of f*. 0

Piecing together, we see that the l-coercivity of a function f implies that !* is


finite everywhere, and this in turn implies l-coercivity of co f.
1 The Convex Conjugate of a Function 47

Remark 1.3.10 If we assume in particular f E Conv]Rn, (i) and (ii) become equiv-
alences:
Xo E int dom f <==> f* - (xo, .) is O-coercive ,
dom f = ]Rn <==> f* is I-coercive. o

1.4 Subdifferentials of Extended-Valued Functions

For a function f satisfying (1.1.1), consider the following set:

af(x) := {s E]Rn : f(y) ~ f(x) + (s, y - x) for all y E ]Rn} . (1.4.1)

When f happens to be convex and finite-valued, this is just the subdifferential of f at


x, defined in VI.12.1; but (1.4.1) can be used in a much more general framework. We
therefore keep the terminology subdifferential for the set of (1.4.1), and subgradients
for its elements. Note here that af(x) is empty if x ¢ domj: take y E domf in
(1.4.1).

Theorem 1.4.1 For f satisfying (1.1.1) and af defined by (1.4.1), s E af(x) if and
only if
f*(s) + f(x) - (s, x) = 0 (or ~ 0). (1.4.2)

PROOF. To say that s is in the set (1.4.1) is to say that

(s, y) - f(y) ~ (s, x) - f(x) for all y E domf ,

i.e.
f*(s) ~ (s, x) - f(x) ;
but this is indeed an equality, in view of Fenchel's inequality (1.1.3). o
As before, af(x) is closed and convex: it is a sublevel-set of the closed convex
function f* - (., x), namely at the level - f (x). A subgradient of f at x is the slope
of an affine function minorizing f and coinciding with f at x; af(x) can therefore
be empty: for example if epi f has a vertical tangent hyperplane at (x, f (x»; or also
if f is not convex-like near x.

Theorem 1.4.2 Let f E Conv]Rn. Then af (x) =F 0 whenever x E ri dom f.

PROOF. This is Proposition IY.1.2.1. o


When af(x) =F 0, we obtain a particular relationship at x between f and its
convexified version co f:

Proposition 1.4.3 For f satisfying (1.1.1), thefollowingproperties hold:

af(x) =F 0 ~ = f(x) ;
(co f)(x) (1.4.3)

co f ~ g ~ f and g(x) = f(x) ===> ag(x) = af(x) ; (1.4.4)


s E af(x) ~ x E af*(s). (1.4.5)
48 X. Conjugacy in Convex Analysis

PROOF. Let s be a subgradient of I at x. From the definition (1.4.1) itself, the function
Y ~ is(Y) := I(x) + (s, Y - x) is affine and minorizes I, hence is ~ co I ~ I;
because is (x) = I(x), this implies (1.4.3).
Now, s E 81(x) if and only if

I*(s) + I(x) - (s, x) = o.


From our assumption, /* = g* = (co f)* (see Corollary 1.3.6) and g(x) = I(x);
the above equality can therefore be written.

g*(s) + g(x) - (s, x) =0


which expresses exactly that s E 8g(x), and (1.4.4) is proved.
Finally, we know that /** = co I ~ I; so, when s satisfies (1.4.2), we have

I*(s) + /**(x) - (s, x) = /*(s) + (co f)(x) - (s, x) ~ 0,

which means x E 8/*(s): we have just proved (1.4.5). o


Among the consequences of (1.4.3), we note the following sufficiency condition
for convexity: if 81 (x) is nonempty for all x E JRn, then I is convex and finite-valued
on JR n . Another consequence is important:

Corollary 1.4.4 If I E Conv JRn, the following equivalences hold:

I(x) + I*(s) - (s, x) =0 (or ~ 0) {:::::} s E 81(x) {:::::} x E 81*(s) .

PROOF. This is a rewriting of Theorem 1.4.1, taking into account (1.4.5) and the
symmetric role played by I and /* when I E Conv JRn . 0

If s E 81 (x) (which in particular implies x E dom f), the property s E dom /*


comes immediately; beware that, conversely, dom 1* is not entirely covered by
81(JRn ): take I(x) = expx, for which /*(0) = 0; but 0 is only in the closure of
If (JR).
Even though it is attached to some designated x, the concept of subdifferential,
as defined by (1.4.1), is global, in the sense that it uses the values of Ion the whole
of JRn . For example,
inf {f(x) : x E JR n } = - 1*(0)
and obviously

x minimizes I satisfying (1.1.1) {:::::} 0 E 81 (x) .

Then a consequence of Corollary 1.4.4 is:

Argmin/(x) = 81*(0) if IE ConvJRn . (1.4.6)


XElRn
1 The Convex Conjugate of a Function 49

1.5 Convexification and Subdifferentiability

For a function I satisfying (1.1.1), the biconjugate defined in § l.3(b) is important in


optimization. First of all, minimizing I or co I is almost the same problem:
- As a consequence of Corollary 1.3.6, we have the equality in]R U {-oo}

inf (f(x) : x E ]Rn} = inf {(co f)(x) : x E ]Rn}.

- We also have from Proposition 1.4.3 that the minimizers of I minimize co I as


well; and since the latter form a closed convex set,

co(Argminf) C Argmin(co f). (1.5.1)

This inclusion is strict: think of I (x) = Ix 11 /2. Equality holds under appropriate
assumptions, as can be seen from our analysis below (see Remark 1.5.7).
The next results relate the smoothness of I and of co I.

Proposition 1.5.1 Suppose I satisfying (1.1.1) is Gateaux differentiable at x, and


has at x a nonempty subdifferential in the sense of (1.4.1). Then

a/(x) = {Y'/(x)} = a(co f)(x).


PROOF. Let s E al(x): for all d E ]Rn and t > 0,

I(x + td) - I(x) ~ (s, d) .


t

Let t -I.- 0 to obtain (Y'/(x), d) ~ (s, d) for all d, and this implies Y'/(x) = s. The
second equality then follows using (1.4.3) and (1.4.4). 0

For a given x to minimize a differentiable I, a necessary condition is "Y'I (x) =


0". A natural question is then: what additional condition ensures that a stationary point
is a global minimum of I? Clearly, the missing condition has to be global: it involves
the behaviour of I on the whole space. The conjugacy correspondence turns out to
be an appropriate tool for obtaining such a condition.

Corollary 1.5.2 Let I be Gateaux differentiable on ]Rn. Then, x is a global minimum


of I on ]Rn if and only if
(i) Y'/(x) = 0 and
(ii) (co f)(x) = I(x).
In such a case, co I is differentiable at x and Y'(co f)(x) = o.
PROOF. Let x minimize I; then a/(x) is nonempty, hence (ii) follows from (1.4.3);
furthermore, a(co f)(x) reduces to {OJ (Proposition 1.5.1).
Conversely, let x satisfy (i), (ii); because the differentiable function I is finite
everywhere, the convex function co I is such. Then, by virtue of Proposition 1.5.1,
o E a(co f) (x); we therefore obtain immediately that x minimizes I on ]Rn. 0
50 X. Conjugacy in Convex Analysis

Thus, the global property (ii) is just what is missing in the local property (i) for a
stationary point to be a global minimum. One concludes for example that differentiable
functions whose stationary points are global minima are those f for which

vf(x) = 0 :::::} (co f)(x) = f(x).


We turn now to a characterization of the subdifferential of co f in terms· of that
of f, which needs the following crucial assumption:

f satisfies (1.1.1), and is lower semi-continuous and I-coercive. (1.5.2)

Lemma 1.5.3 For f satisfying (1.5.2), co(epi f) is a closed set.

PROOF. Take a sequence {Xk, rk} in co(epi f) converging to (x, r) for k ---+ +00; in
order to establish (x, r) E co(epi f), we will prove that (x, p) E co(epi f) for some
p~r.
By definition of a convex hull in lRn +l , there are n + 2 sequences {xi, ri} with
f(xi) ~ rk, and a sequence {ak} in the unit simplex L1n+2 such that
n+2
(Xk, rk) =L ai (xi, rk) .
i=1

[Step 1] For each i, the sequence {airk} is bounded: in fact, because of (1.5.2), f is
bounded from below, say by JL; then we write

rk ~airk+ Latf(xi> ~airk+ (l-aOJL,


j=/:i

and our claim is true because Irk} and {ail are bounded.
The sequences Irk} are also bounded from below by JL. Now, if some Irk} is not
bounded from above, go to Key Step; otherwise go to Last Step.
[Key Step] We proceed to prove that, if Irk} is not bounded from above for some
index i, the corresponding sequence can be omitted from the convex combination.
Assume without loss of generality that 1 is such an index and extract a subsequence
if necessary, so rl ---+ +00. Then

this is clear if {xl} is bounded; if not, it is a consequence ofl-coercivity. Remembering


Step 1, we thus have the key properties:

alrl
ak ---+ 0, Ii akxIII
l
k -
- k k
rl:Jllxlll
---+ 0
.

Now, for each k, define 13k E L1 n +1 by


1 The Convex Conjugate of a Function 51

ai+1
11~ := -k11 for i = 1, ... , n + 1.
-ak
We have
n+1 1
L l1~x~ = - - I (Xk - akxk) ~ X,
i=1 l-a k
n+1 1 1
JL ~ LI1~r~ = - - I (rk - akrk) ~ --I (rk - akJL) ~ r.
i=1 l-ak l-a k
Let us summarize this key step: starting from the n + 2 sequences of triples
{al, x~, r~}, we have eliminated one having {r~} unbounded, and thus obtained.e =
n + 1 sequences of triples {11~, xl, r~} satisfying
l l
11k E Lll' LI1~xl ~ x, LI1~r~ ~ p ~ r. (1.5.3)
i=1 i=1
Execute this procedure as many times as necessary, to end up with .e ~ 1 sequences
of triples satisfying (1.5.3), and all having {r~} bounded.
[Last Step] At this point of the proof, each {r~} is bounded; fromcoercivity of f, each
{x~} is bounded as well. Extracting subsequences if necessary, we are therefore in the
following situation: there are sequences of triples {11~, x~, r~}, i = 1, ... ,.e ~ n + 2,
satisfying
11k E Lll' 11k ~ 11 E Lll ;
f(xl) ~ r~, xl ~ xi, r~ ~ ri for i = 1, ... ,.e;
l l l
LI1~x~ ~ x, LI1~r~ ~ Ll1iri = p ~ r.
i=1 i=1 i=1
i
Because f is lower semi-continuous, f(x ) ~ ri for i = 1, ... ,.e and the definition
(lY.2.5.3) of co f gives
l l
(cof)(x)~ Ll1if(xi)~ Ll1iri =p~r.
i=1 i=1
In a word, (x,r) E epicof. o
This c10sedness property has important consequences:
Proposition 1.5.4 Let f satisfy (1.5.2). Then
(i) co f = co f (hence co f E Conv]Rn).
(ii) For any x E dom co f = co dom f, there are Xj E dom f and convex multipliers
aj for j = 1, ... , n + 1 such that
n+1 n+1
x = Lajxj and (cof)(x) = Lajf(xj).
j=1 j=1
52 X. Conjugacy in Convex Analysis

PROOF. By definition, co I is the function whose epigraph is the closure of co epi I;


but the latter set is already closed. Since the relations

coepi Ie epico Ie epico 1= coepil

are always true, (i) is proved.


Using the definition co I = II of Proposition IY.2.S.1, and knowing that co epi I
is closed, we infer that the point (x, co I (x)) is on the boundary of co epi I. Then,
invoking Proposition 111.4.2.3, we can describe (x, co I(x)} as a convex combination
of n + I elements in epi I:
n+1
(x,co/(x}) = L,aj(Xj,rj} ,
j=1

with I(xj} :::;; rj. Actually each rj has to be I(xj} if the corresponding aj is positive,
simply because
n+1
L,aj (Xj' I(xj}) E coepi/,
j=1
and, again from the definition of co I (x):
n+1 n+1
L,ajrj = co/(x}:::;; L,aj/(xj}. o
j=1 j=1

The combinations exhibited in (ii) are useful for an explicit calculation of co I


and a(co f), given in the next two results.

Theorem 1.5.5 Let I satisfy (1.1.1). For a given x E co dom I, suppose there exists
a family {Xj, aj} as described in Proposition l.5.4(ii); set

J := {j : aj > O} .

Then
(i) I(xj} = (co f)(xj}for all j E J,
(ii) co I is affine on the polyhedron P := co {Xj: j E J}.

PROOF. [(0] The function co I is convex and minorizes I:


n+l n+l
(co f)(x}:::;; L, aj(co I)(xj) :::;; L, aj/(xj} = (co f)(x);
j=l j=1

following an argument already seen on several occasions, this implies

and (i) is proved.


1 The Convex Conjugate of a Function 53

[(ii)] Consider the affine function (here {fJj} is a set of convex multipliers)

P 3 X' = LfJjXj H- l(x') := LfJj/(Xj).


jeJ jeJ
In view of (i), the convexity of co / implies that co / ~ l on P, with equality at the
givenx.
Now take x' #- x in P. The affine line passing through x and x' cuts rbd P at two
points y) and Y2, both different from x (since the latter is in ri P). Then write (see
Fig. 1.5.1)
x = fJx' + (l - fJ)y) with fJ e ]0, 1[.
By convexity,
(co f)(x) ~ fJ(co f)(x') + (l - fJ)(co /)(y)
~ fJl(x') + (l :.... fJ)l(y) [because co f :::; l ]
= l(x). [because l is affine]

Since (co f) (x) = l (x ), this is actually a chain of equalities:


fJ[(co f)(x') - l(x')] + (1 - fJ)[(co /)(y) - l(y)] = o.
Once again, we have two nonnegative numbers having a zero convex combination;
they are both zero and (ii) is proved. 0

Fig. 1.5.1. Affinity of a convex hull

n
Theorem 1.5.6 Under the hypotheses and notations of Theorem 1.5.5,
a(co f)(x) = a/(Xj); (1.5.4)
jeJ
Vs e a(co f)(x), (s, x) - (co f)(x) = (s, Xj) - /(Xj) for all j e J.
PROOF. A subgradient of co / at x is an s characterized by
(co f)*(s) + (co f)(x) -
(s, x) 0 = [Theorem 1.4.1]
-<==:} /*(s) + LjeJ aj /(Xj) - (s, LjeJ ajxj) 0 = [(co f)* = /*]
-<==:} LjeJ aj[/*(s) + /(Xj) - (s, Xj)] 0 =
-<==:} /*(s)+/(Xj)-(s,Xj)=O foralljeJ. [Fenchel (1.1.3)]

The last line means precisely that s e a/(Xj) for all j e J; furthermore, we can
write
/(Xj) - (s,Xj) = -/*(s) = -(cof)*(s) = (cof)(x) - (s,x). 0

Note in the right-hand side of (1.5.4) that each a/(Xj) could be replaced by
a(co f)(Xj): this is due to (1.4.4).
54 X. Conjugacy in Convex Analysis

Remark 1.5.7 As a supplement to (1.5.1), the above result implies the following: for
f satisfying (1.5.2), Argmin f is a compact set (trivial), whose convex hull coincides
with Argmin(co f). Indeed, use (1.5.4): an x E Argmin(co f) is characterized as
being a convex combination of points {Xj}jeJ such that 0 E af(xj), i.e. these Xj
minimize f. 0

Corollary 1.5.8 Let the function f : lRn ~ lR be lower semi-continuous, Gateaux


differentiable, and l-coercive. Then co f = co f is continuously differentiable on lRn;
furthermore,
V(cof)(X) = Vf(xj) for all j E J, (1.5.5)
where we have used the notation of Theorem 1.5.5.

= co f. The
PROOF. Our f satisfies (1.5.2), so Proposition 1.5.4(i) directly gives co f
latter function is finite everywhere, hence it has a nonempty subdifferential at any x
(Theorem 1.4.2): by virtue of (1.5.4), all the af(xj)'s are nonempty. Together with
Proposition 1.5.1, we obtain that all these af(xj)'s are actually the same singleton,
described by (1.5.5). Finally, the continuityofV(co f) follows from Theorem VI.6.2.4.
o
It is worth mentioning that, even if we impose more regularity on f (say COO),
co f as a rule is not C2 •

2 Calculus Rules on the Conjugacy Operation

The function f, whose conjugate is to be computed, is often obtained from some other
functions !;, whose conjugates are known. In this section, we develop a set of calculus
rules expressing f* in terms of the (fi)* (some rudimentary such rules were already
given in Proposition 1.3.1).

2.1 Image of a Function Under a Linear Mapping

Given a function g : lRm ~ lR U {+oo} satisfying (1.1.1), and a linear mapping


A : lRm ~ lRn , we recall that the image of g under A is the function defined by

lRn 3 x ~ (Ag)(x) := inf {g(y) : Ay = X}. (2.1.1)

Letg* be associated with a scalar product (., ·)m inlRm; we denote by (., ·)n the scalar
product in lRn , with the help of which we want to define (Ag)*. To make sure that Ag
satisfies (1.1.1), some additional assumption is needed: among all the affine functions
minorizing g, there is one with slope in (Ker A)l. = 1m A *.

Theorem 2.1.1 With the above notation, assume that 1m A * n dom g* =f:. 10. Then
Ag satisfies (1.1.1); its conjugate is

(Ag)* = g* 0 A* .
2 Calculus Rules on the Conjugacy Operation 55

PROOF. First, it is clear that Ag ;fE +00 (take x = Ay, with y E domg). On the other
hand, our assumption implies the existence of some Po = A *So such that g* (Po) <
+00; with Fenchel's inequality (1.1.3), we have for all y E Rm:

g(y) ~ (A*so, Y)m - g*(po) = (so, AY)n - g*(po).

For each x E Rn, take the infimum over those Y satisfying Ay = x: the affine function
(so, .) - g*(po) minorizes Ag. Altogether, Ag satisfies (1.1.1).
Then we have for s E Rn

(Ag)*(s) = SUPXERn[(S, x) - infAy=x g(y)]


= SUPxERn,Ay=x[(S, x) - g(y)]
= SUpYERm[(s, Ay) - g(y)] = g*(A*s). o
For example, when m = n and A is invertible, Ag reduces to goA -I, whose
conjugate is therefore g* 0 A*, given by Proposition 1.3.1(iv).
As a first application of Theorem 2.1.1, it is straightforward to compute the con-
jugate of a marginal function:

I(x) := inf {g(x, z) : Z E RP}, (2.1.2)

where g operates on the product-space R n x RP. Indeed, just call A the projection
onto Rn: A (x, z) = x E Rn, so I is clearly Ag. We obtain:

Corollary 2.1.2 With g : R n x RP =: R m -+ R U {+oo} not identically +00, let


g* be associated with a scalar product preserving the structure ofRm as a product
space:
(-, ')m = (-, ')n + (-, ')p,
and suppose that there is So ERn such that (so, 0) E domg*. Then the conjugate of
I defined by (2.1.2) is

!*(s) = g*(s, 0) for all s E Rn.

PROOF. It suffices to observe that, A being the projection defined above, there holds
for all YI = (XI, ZI) E Rm andx2 ERn,

(AYI> X2)n = (XI> X2)n = (Xlt X2)n + (Zit 0) P = (Yl> (X2, O»m ,
which dtifines the adjoint A *x = (x, 0) for all x E Rn. Then apply Theorem 2.1.1.
o
More will be said on this operation in §2.4. Here we consider another application:
the infimal convolution which, to II and h defined on R n , associates (see §IY.2.3)

(2.1.3)

To use Theorem 2.1.1, we take m = 2n and


56 X. Conjugacy in Convex Analysis

Corollary 2.1.3 Let two functions II and hfrom IRn to IR U {+oo}, not identically
t
+00, satisfy domN n domN '# 0. Then II h satisfies (1.1.1), and (fl 12)* = t
N+lt
PROOF. EquippinglRn xlRn with the scalar product (', ')+{', .), weobtaing*(slo S2) =
N(sl)+ N(S2) (Proposition 1.3.I(ix»andA*(s) = (s, s). Then apply the definitions.
o

Example 2.1.4 As sketched in Proposition 1.2.2.4, a function can be regularized if we take


its infimal convolution with one of the kernels 1/2ell·1I 2or ell· II. Then Corollary 2.1.3 gives,
with the help of Proposition 1.3. 1(iii) and Example 1.1.3:

(J ~ !ell'1I 2)* = f* + tell' 112 ,


(f ~ ell . 11>* = f* + ~IB(O.c) •

In particular, if f is the indicator of a nonempty set CeRn, the above formulae yield

2.2 Pre-Composition with an Affine Mapping

In view of the symmetry of the conjugacy operation, Theorem 2.1.1 suggests that
the conjugate of goA, when A is a linear mapping, is the image-function A *g*.
In particular, a condition was needed in §2.1 to prevent Ag(x) = -00 for some x.
Likewise, a condition will be needed here to prevent goA == +00. The symmetry is
not quite perfect, though: the composition of a closed convex function with a linear
mapping is still a closed convex function; but an image-function need not be closed,
and therefore cannot be a conjugate function.
We use notation similar to that of §2.1, but we find it convenient to distinguish
between the linear and affine cases.

Theorem 2.2.1 With g E ConvlRm and Ao linear from IRn to IRm, define A(x) :=
Aox + Yo E IRm and suppose that A(lRn) n domg '# 0. Then goA E ConvlRn and
its conjugate is the closure of the convex function

IRn 3 s 1-+ inf{g*(p) - {Yo, P)m : A6P


P
= s}. (2.2.1)

PROOF. We start with the linear case (Yo =


0): suppose that h E ConvlRn satisfies
ImAo n domh '# 0. Then Theorem 2.1.1 applied to g := h* and A := Ari gives
(Arih*)* = h 0 Ao; conjugating both sides, we see that the conjugate of h 0 Ao is the
closure of the image-function Arih*.
In the affine case, consider the function h := g(. + Yo) E Conv IRm; its conjugate
is given by Proposition 1.3.I(v): h* = g* - {Yo, ·)m. Furthermore, it is clear that

(g 0 A)(x) = g(Aox + Yo) = h(Aox) = (h 0 Ao)(x) ,

so (2.2.1) follows from the linear case. o


2 Calculus Rules on the Conjugacy Operation 57

Thus, the conjugation of a composition by a linear (or affine) mapping is not quite
straightforward, as it requires a closure operation. A natural question is therefore: when
does (2.2.1) define a closed function of s? In other words, when is an image-function
closed? Also, when is the infimum in (2.2.1) attained at some p? We start with a
technical result.

Lemma 2.2.2 Let g E Conv JR.m be such that 0 E dom g and let Ao be linear from
JR.n to JR. m. Make the following assumption:

ImAo nridomg # 0 i.e. 0 E ridomg - ImA o [= ri(domg - 1m Ao)] .

Then (g 0 Ao)* = Atig* and,for every s E dom(g Ao)*, the problem


0

inf {g*(p) : Atip = s} (2.2.2)


p

has an optimal solution p, which therefore satisfies g*(p) = (goAo)*(s) = Atig*(s).

PROOF. To prove (g 0 Ao)* = Ati g*, we have to prove that Ati g* is a closed function,

i.e. that its sublevel-sets are closed (Definition IY.1.2.3). Thus, for given r E JR., take
a sequence {Sk} such that

(Atig*)(Sk) :::;; r and Sk -+ s.

Take also Ok ..l- 0; from the definition of the image-function, we can find Pk E JR.m
such that
g*(Pk) :::;; r + Ok and AtiPk = Sk.
Let qk be the orthogonal projection of Pk onto the subspace V := lin dom g - 1m Ao.
Because V contains lindomg, Proposition 1.3.4 (withz = 0) gives g*(Pk) = g*(qk).
Furthermore, V..1 = (lin dom g)..1 n Ker Ati; in particular, qk - Pk E Ker Ati. In
summary, we have singled out qk E V such that

(2.2.3)

Suppose we can bound qk. Extracting a subsequence if necessary, we will have


qk -+ q and, passing to the limit, we will obtain (since g* is l.s.c)
g*(q) :::;; liminf g*(qk) :::;; r and Atiq =s .
The required closedness property Ati g* (q) :::;; r will follow by definition. Furthermore,
this q will be a solution of (2.2.2) in the particular case Sk == s and r = (Atig*)(s).
In this case, {qk} will be actually a minimizing sequence of (2.2.2).
To prove bo~dedness of qk> use the assumption: for some e > 0, Bm (0, e) n V
is included in domg - ImA. Thus, for arbitrary z E Bm(O, e) n V, we can find
Y E domg and X E JR.n such that z = y - Aox. Then

{qko Z)m = {qk, Y)m - (Atiqk. x)n


:::;; g(y) + g*(qk) - (Atiqko x)n [Fenchel (1.1.3)]
:::;; g(y) + r + Ok - {Sko x)n . [(2.2.3)]
58 X. Conjugacy in Convex Analysis

We conclude

sup {(qk, Z) : k = 1, 2, ... } is bounded for any Z E Bm (0, e) nV ,


which implies that qk is bounded; this is Proposition V.2.1.3 in the vector space V.
o
Adapting this result to the general case is now a matter of translation:

Theorem 2.2.3 With g E ConvlRm and Ao linear from IRn to IRm, define A(x) :=
Aox + Yo E IRm. Make the following assumption:

(2.2.4)

Then,for every s E dom(g 0 Ao)*. thefollowing minimization problem has a solution:

min {g*(p) - (p, Yo) : A~p


p
= s} = (g 0 A)*(s). (2.2.5)

PROOF. By assumption, we can choose x E IRn such that y := A(x) E ridomg.


Consider the function g := g(y + .) E Conv IRm. Observing that

(g 0 A)(x) = g(A(x) - y) = (g 0 Ao)(x - x) ,

we obtain from the calculus rule l.3.I(v)

(g 0 A)* = (g 0 Ao)* - (', x) .

Then Lemma 2.2.2 allows the computation of this conjugate. We have 0 in the
domain of g, and even in its relative interior:

ridomg = ridomg - {y} 30 E ImAo.

We can therefore write: for all s E dom(g 0 Ao)* [= dom(g 0 A)*],

(g 0 Ao)*(s) = min{g*(p)
p
: A~p = s},
or also

(g 0 A)*(s) - (s, x) = minp{g*(p) - (p, y) : A6P = s}


= -(s, x) + minp{g*(p) - (p, Yo) : A6P = s}. 0

An example will be given in Remark 2.2.4 below, to show that the calculus rule (2.2.5)
does need a qualification assumption such as (2.2.4). First of all, we certainly need to avoid
goA == +00, i.e.
(2.2.Q.i)
but this is not sufficient, unless g is a polyhedral function (this situation has essentially been
treated in §Y.3.4).
2 Calculus Rules on the Conjugacy Operation 59

A "comfortable" sharpening of (2.2.Q.i) is

A(lRn) n intdomg =1= Ii'J i.e. 0 E intdomg - A(lRn) , (2.2.Q.ii)

but it is fairly restrictive, implying in particular that dom g is full-dimensional. More tolerant
is
o E int[domg - A (lRn)] , (2.2.Q.iii)
which is rather common. Use various results from Chap. V, in particular Theorem Y.2.2.3(iii),
to see that (2.2.Q.iii) means

O"domg(P) + (p, YO) > 0 for all nonzero P E Ker A~ .

Knowing that O"domg = (g*)~ (Proposition 1.2.2), this condition has already been alluded to
at the end of §IY.3.
Naturally, our assumption (2.2.4) is a further weakening; actually, it is only a slight
generalization of (2.2.Q.iii). It is interesting to note that, if (2.2.4) is replaced by (2.2.Q.iii),
the solution-set in problem (2.2.5) becomes bounded; to see this, read again the proof of
Lemma 2.2.2, in which V becomes Rm.

Remark 2.2.4 Condition (2.2.4) is rather natural: when it does not hold, almost all informa-
tion on g is ignored by A, and pathological things may happen when conjugating. Consider
the following example withm = n = 2, A(~, 1]) := (~, 0),

1]IOg1] for 1] > 0,


R2 3 (~, 1]) 1-+ g(~, 1]) := / 0 for 1] = 0
+00 elsewhere ,

and the scalar product

(s, x) = (p, .), (~, 1]») = (p + .)~ + (p + 2.)1].


Easy calculations give:

*( ) {exP(. - 1) if p +. = 0,
g p,. = +00 otherwise,

and A*(p,.) = (p + .)(2, -1).


Note: taking the canonical basis ofR2 (which is not orthonormal for (., .) I), we can define

A := [~ ~ J. M := [~ ~] so that
(s,x)=sTMx and A*=[ 2 2]=M-1A™.
-1 -1

Thus domg n ImA = R x {OJ; none of the assumptions in Theorem 2.2.3 are satisfied;
domg* n ImA* = {OJ,

A*g*(O) = inf{exp(. -1) : p +. = O} = 0,

and the infimum is not attained "at finite distance". Note also that A *g* is closed (how could
it not be, its domain being a singleton!) and is the conjugate of goA == o.
The trouble illustrated by this example is that composition with A extracts from g only
its nasty behaviour, namely its vertical tangent plane for 1] = o. 0
60 X. Conjugacy in Convex Analysis

The message of the above counter-example is that conjugating brutally goA, with a
scalar product defined on the whole of lim , is clumsy; this is the first line in Fig. 2.2.1. Indeed,
the restriction of g to 1m A, considered as a Euclidean space, is a closed convex function by
itself, and this is the relevant function to consider: see the second line of Fig. 2.2.1 (but a
scalar product is then needed in 1m A). Alternatively, if we insist on working in the whole
environment space lim, the relevant function is rather g + 11m A (third line in Fig. 2.2.1,
which requires the conjugate of a sum). All three constructions give the same goA, but the
intermediate conjugates are quite different.

xeRn_AeL(Rn,lmA) _AxelmA_ geConvlmA _(goA)(x)

X e Rn _ A e L(Rn,Rm) _ Ax e Rm _ g+ilmA e COiWRm_ (goA)(x)

Fig.2.2.1. Three possible expressions for goA

Corollary 2.2.5 Let g E Conv lim and let A be linearfrom lin to lim with 1m A ndom g "# 0.
Then
(g 0 A)*' = A*(g + IlmA)* (2.2.6)
and,for every s E dom(g 0 A)*, the problem

inf {(g
p
+ IlmA)*(P) : A* p = s}

has an optimal solution p, which therefore satisfies

(2.2.7)

PROOF. The closed convex functions goA and (g +11m A) 0 A are identical on lin, so they have
the same conjugate. Also, g + 11m A is a closed convex function whose domain is included in
ImA, hence
=
ridom(g + limA) n ImA ridom(g + limA) "# 0.
The result follows from Theorem 2.2.3. o
We leave it as an exercise to study how Example 2.2.4 is modified by this result. To
make it more interesting, one can also take g(~, 1]) + 1/2~2 instead of g. Theorem 2.2.3
and Corollary 2.2.5 are two different ways of stating essentially the same thing; the former
requires that a qualification assumption such as (2.2.4) be checked; the latter requires that the
conjugate of the additional function g + 11m A be computed. Either result may be simpler to
use, depending on the particular pro~lem under consideration.

Remark 2.2.6 We know from Proposition 1.3.2 that (g + IlmA)* =


(g 0 P A)* 0 P A, where
P A : lim ~ lim is the orthogonal projection onto the subspace 1m A (beware that, by contrast
to A, P A operates on lim). It follows that (2.2.6) - (2.2.7) can be replaced respectively by:
2 Calculus Rules on the Conjugacy Operation 61

inf{(goPA)*(PAP): A*p=s},
(g 0 PA)*(P AP) = A*[(g 0 PA)* 0 PA](S) = (g 0 A)*(s).

When the qualification assumption allows the application of Theorem 2.2.3, we are bound to
realize that
A*g* = A*(g + IImA)* = A*[(g 0 P A)* 0 PA]. (2.2.8)
Indeed, Theorem 2.2.3 actually applies also to the couple (P A, g) in this case: (g 0 P A)*
can be developed further, to obtain (since PAis symmetric)

[(g 0 P A)* 0 P A](P) = [(P Ag*) 0 P A](P) = inf (g*(q) : P Aq = PAP}.


Now, to say that P Aq = PAP, i.e. thatP A(q-P) = O,istosaythatq- P E (ImA)l. = Ker A*,
i.e. A * P = A *q. Altogether, we deduce

A*(g + IImA)*(S) = p.q


inf{g*(q) : A* P = A*q = s} = A*g*(s)
and (2.2.8) does hold. o

To conclude this subsection, consider an example with n = 1: given a function


g Xo E dom g, 0 #- d E JR m, and set
E Conv JRm, fix

1/I(t) := g(xo + td) for all t E JR.

This 1/1
is the composition of g with the affine mapping which, to t E JR, assigns
Xo + t d E JR m. It is an exercise to apply Theorem 2.2.1 and obtain the conjugate:

1/I*(a) = min {g*(s) - (s, xo) : (s, d) = a}

whenever, for example, Xo and d are such that Xo + td E ri dom g for some t - this is
(2.2.4). This example can be further particularized to various functions mentioned in
§ 1 (quadratic, indicator, ... ).

2.3 Sum of Two Functions

The formula for conjugating a sum will supplement Proposition 1.3. 1(ii) to obtain
the conjugate of a positive combination of closed convex functions. Summing two
functions is a simple operation (at least it preserves closedness); but Corollary 2.1.3
shows that a sum is the conjugate of something rather involved: an inf-convolution.
Likewise, compare the simplicity of the composition goA with the complexity of
its dual counterpart Ag; as seen in §2.2, difficulties are therefore encountered when
conjugating a function of the type goA. The same kind of difficulties must be expected
when conjugating a sum, and the development in this section is quite parallel to that
of§2.2.

Theorem 2.3.1 Let gl and g2 be in Conv JRn and assume that dom gl n dom g2 #- 0.
The conjugate (gl + g2)* of their sum is the closure of the convex function gr t g;'
62 X. Conjugacy in Convex Analysis

PROOF. Call f;* := gj, fori = 1,2; apply Corollary 2.1.3: (grtgi)* = gl +g2; then
take the conjugate again. 0

The above calculus rule is very useful in the following framework: suppose we
want to compute an inf-convolution, say h = f t g with f and g in Conv]Rn . Compute
f* and g*; if their sum happens to be the conjugate of some known function in
Conv]Rn, this function has to be the closure of h.
Just as in the previous section, it is of interest to ask whether the closure operation
is necessary, and whether the inf-convolution is exact.

Theorem 2.3.2 The assumptions are those of Theorem 2.3.1. Assume in addition
that
the relative interiors ofdom gl and dom g2 intersect I
or equivalently: 0 E ri(domgl - domg2). (2.3.1)

Then (gl + g2)* = gr t gi and,for every s E dom(gl + g2)*, the problem

has an optimal solution (p, q), which therefore satisfies

PROOF. Define g E Conv(]Rn x ]Rn) by g(XIo X2) := gl (Xl) + g2(X2) and the linear
operator A : ]Rn -+ ]Rn x ]Rn by Ax := (x, x). Then gl + g2 = goA, and we proceed
to use Theorem 2.2.3. As seen in Proposition 1.3.1(ix), g*(p, q) = gr(p) + gi(q)
and straightforward calculation shows that A *(p, q) = p + q. Thus, if we can apply
Theorem 2.2.3, we can write

(gl + g2)*(S) = (g 0 A)*(s) = (A*g*)(s)


= p,q
inf{g~(p) + g;(q) : p + q = s} = (g~ t gi)(s)

and the above minimization problem does have an optimal solution.


To check (2.2.4), we note that dom g = dom gl x dom g2, and 1m A is the diagonal
set Ll := {(s, s) : s E ]Rn}. We have

(x, x) E ridomgl x ridomg2 = ri(domgl x domg2)

(Proposition 111.2.1.11), and this just means that 1m A = Ll has a nonempty intersec-
tion with ri dom g. 0

As with Theorem 2.2.3, a qualification assumption - taken here as (2.3.1) playing the
role of (2.2.4) - is necessary to ensure the stated properties; but other assumptions exist that
do the same thing. First of all,

domgl n domg2 =1= 0 i.e. 0 E domgl - domg2 (2.3.Q.j)


2 Calculus Rules on the Conjugacy Operation 63

is obviously necessary to have gl + g2 '# +00. We saw that this ''minimal'' condition was
sufficient in the polyhedral case. Here the results of Theorem 2.3.2 do hold if (2.3.1) is
replaced by:
gl and g2 are both polyhedral, or also
gl is polyhedral and dom gl n ri dom g2 :f:. (21 •
The "comfortable" assumption playing the role of(2.2.Q.ii) is

intdomgl n intdomg2 :f:. (21, (2.3.Q.jj)

which obviously implies (2.3.1). We mention a non-symmetric assumption, particular to a


sum:
(intdomgl) n domg2 :f:. (21. (2.3.Q.jj')
Finally, the weakening (2.2.Q.iii) is

oE int(domgl - domg2) (2.3.Q.jjj)

which means
Udomg.(S) +udomg2(-s) > 0 foralls:f:. 0;
we leave it as an exercise to prove (2.3.Q.jj') => (2.3.Q.JJJ).

The following application of Theorem 2.3.2 is important in optimization: take


two functions gl and g2 in Conv]Rn and assume that

is a finite number. Under some qualification assumption such as (2.3.1),

(2.3.2)

a relation known as Fenchel s duality Theorem. If, in particular, g2 = Ie is an indicator


function, we read (still under some qualification assumption)

inf U(x) : x E C} = -minU*(s) + oc(-s) : S E ]Rn}.

It is appropriate to recall here that conjugate functions are interesting primarily via
their arguments and subgradients; thus, it is the existence of a minimizing s in the
right-hand side of (2.3.2) which is useful, rather than a mere equality between two
real numbers.

Remark 2.3.3 Our proof of Theorem 2.3.2 is based on the fact that the sum of two functions
can be viewed as the composition of a function with a linear mapping. It is interesting to
demonstrate the converse mechanism: suppose we want to prove Theorem 2.2.3, assuming
that Theorem 2.3.2 is already proved.
Given g E Conv lRm and A : lRn ~ lRm linear, we then select the two functions with
argument (x, y) E lRn x lRm :

gl (x, y) := g(y) and g2:= Igr A .

Direct calculations give


64 X. Conjugacy in Convex Analysis

if s = 0,
gl*(s, p)
g*(p)
= { +00 if not ,
(2.3.3)

g!(s, p) = sup {(s, x}n + (p, Y}m : Y = Ax} = sup(x, s + A* P}n.


x,Y x

Hence
*
g2(s, p) =
{O+00 ifs+A*p=O,
ifnot. (2.3.4)

We want to study (g 0 A)*, which is obtained from the following calculation:

(gl + g2)*(S, 0) = sup{(s, x}n - g(y) : y = Ax} = (g 0 A)*(s) ,


x,Y

so the whole question is to compute the conjugate of gl + g2. If (2.3. 1) holds, we have

but, due to the particular structure appearing in (2.3.3) and (2.3.4), the above minimization
problem can be written

inf{g*(p): s-A*p=O}=A*g*(s).

There remains to check (2.3.1), and this is easy:

domgl = Rn x domg hence ridomgl = Rn x ridomg;


domg2 = gr A hence ridomg2 = gr A.

Under these conditions, (2.3.1) expresses the existence of an Xo such that Axo E ri dom g,
and this is exactly (2.2.4), which was our starting hypothesis. 0

To conclude this study of the sum, we mention one among the possible results
concerning its dual operation: the infimal convolution.

Corollary 2.3.4 Take fl and h in ConvlRn , with flO-coercive and h bounded


from below. Then the inf-convolution problem (2.1.3) has a nonempty compact set of
solutions; furthermore fl thE Conv IRn.

PROOF. Letting f.L denote a lower bound for h, we have

and the first part of the claim follows. For closedness of the infimal convolution, we
set gi := 1;*, i = 1, 2; because ofO-coercivity of flo 0 E int dom gl (Remark 1.3.10),
and g2 (0) ~ - f.L. Thus, we can apply Theorem 2.3.2 with the qualification assumption
(2.3.Q.JJ'). 0
2 Calculus Rules on the Conjugacy Operation 65

2.4 Infima and Suprema

A result of the previous sections is that the conjugacy correspondence establishes


a symmetry between the sum and the inf-convolution, and also between the image
and the pre-composition with a linear mapping. Here we will see that the operations
"supremum" and "closed convex hull of the infimum" are likewise symmetric to each
other.

Theorem 2.4.1 Let {fj }j eJ be a collection offunctions satisfying (1.1.1) and assume
that there is a common affine function minorizing all of them:

sup{f/(s) : j E J} < +00 forsomes E]Rn.

Then their infimum f := infjeJ fj satisfies (1.1.1), and its conjugate is the supremum
of the rJ8:
(infjeJ ij)* = SUPjeJ fj*. (2.4.1)

PROOF. By definition, for all s E ]Rn

j*(s) = sUPx[(s, x) - infjeJ fj(x)]


= sUPx SUPj[(s, x) - fj(x)]
= SUPj supX£(s, x) - fj(x)] = SUPjeJ fj*(s). o
This result should be compared to Corollary 2.1.2, which after all expresses the conjugate
of the same function, and which is proved in just the same way; compare the above proof with
that of Theorem 2.1.1. The only difference (apart from the notation j instead of z) is that the
space ]RP is now replaced by an arbitrary set J, with no special structure, in particular no
scalar product. In fact, Theorem 2.4.1 supersedes Corollary 2.1.2: the latter just says that, for
g defined on a product-space, the easy-to-prove formula

g*(s, O) = supg;(s, z)
z
g;
holds, where the subscript x of indicates conjugacy with respect to the first variable only.
In a word, the conjugate with respect to the couple (x, z) is of no use for computing the
conjugate at (·,0). Beware that this is true only when the scalar product considers x and z
as two separate variables (i.e. is compatible with the product-space). Otherwise, trouble may
occur: see Remark 2.2.4.

Example 2.4.2 The Euclidean distance to a nonempty set C C ]Rn is an inf-function:

x ~ dc(x) = inf (fy(x) : Y E C} with fy(x):= IIx - yll .


Remembering that the conjugate of the norm II . II is the indicator of the unit ball
(combine (V.2.3.1) and Example 1.1.5), the calculus rule 1.3. 1(v) gives

(/y)*(s) = IB(O,I)(S) + (s, y) (2.4.2)


66 X. Conjugacy in Convex Analysis

and (2.4.1) may be written as

dC(s) = IB(o,I)(S) + sup (S, y) = IB(O,I)(S) + ods) ,


YEC

a fonnula already given in Example 2.1.4. The same exercise can be carried out with
the squared distance. 0

Example 2.4.3 (Directional Derivative) With f E Conv IRn , let Xo be a point where
af(xo) is nonempty and consider for t > 0 the functions

IRn 3 d 1-+ ft(d) := f(xo + td) - f(xo) . (2.4.3)


t

They are all minorlzed by (so, .), with So E af(xo); also, the difference quotient is
an increasing function of the stepsize; their infimum is therefore obtained for t .J.. o.
This infimum is denoted by f'(xo, .), the directional derivative of fat Xo (already
encountered in Chap. VI, in a finite-valued setting). Their conjugates can be computed
directly or using Proposition 1.3.1:

Sl-+
_ /*(s)
(jjt ) *(s) - + f(xo) - (s, xo)
, (2.4.4)
t
so we obtain from (2.4.1)

[f '(Xo, .)]*() -
s - sup
/*(s) + f(xo) - (s, xo)
.
t>O t
Observe that the supremand is always nonnegative, and that it is 0 if and only if
s E af(xo) (cf. Theorem 1.4.1). In a word:

[I' (xo, .)]* = la!(xo) .


Taking the conjugate again, we see that the support function of the subdifferential is
the closure of the directional derivative; a result which must be compared to §Vl.l.l.
o

As already mentioned, sup-functions are fairly important in optimization, so a


logical supplement to (2.4.1) is the conjugate of a supremum.

Theorem 2.4.4 Let {gj ljE} be a collection ofjUnctions in ConvlRn. If their supre-
mum g := SUPjE} is not identically +00, it is in ConvlRn, and its conjugate is the
closed convex hull of the s:gi
(SUPjE) gj)* = co (infjE) gj) . (2.4.5)

PROOF. Call Jj := gj, hence fl


= gj, and g is nothing but the /* of (2.4.1). Taking
the conjugate of both sides, the result follows from (1.3.4). 0
2 Calculus Rules on the Conjugacy Operation 67

Example 2.4.5 Given, as in Example 2.4.2, a nonempty bounded set C, let

IRn :3 x ~ L1c(x):= sup{lIx - yll : y E C}

be the distance from x to the most remote point of cl C. Using (2.4.2),

inf (/y)*(s) = 18(0 I)(S) + inf (s, y) ,


yeC 'yeC

so (2.4.5) gives
.1 C= co (18(0,1) - O'-C). o

Example 2.4.6 (Asymptotic Function) With I E Conv IRn ,xo E dom I and It as
defined by (2.4.3), consider

I~(d) := sup {ft(d) : t > O} = lim{ft(d) : t ~ +oo}

(see §1y'3.2). Clearly, It E ConvlRn and It(O) = 0 for all t > 0, so we can apply
Theorem 2.4.4: 160 E Conv IRn. In view of (2.4.4), the conjugate of 160 is therefore
the closed convex hull of the function

s~m
. f I*(s) + I(xo) - (s, xo)
.
t>o t

Since the infimand is in [0, +00], the infimum is 0 if S E dom J*, +00 if not. In
summary, (/60)* is the indicator ofcldomJ*. Conjugating again, we obtain

I~ = O'domf* '
a formula already seen in (1.2.3).
The comparison with Example 2.4.3 is interesting. In both cases we extremize
the same function, namely the difference quotient (2.4.3); and in both cases a support
function is obtained (neglecting the closure operation). Naturally, the supported set is
larger in the present case of maximization. Indeed, dom J* and the union of all the
subdifferentials of I have the same closure. 0

The conjugate of a sup-function appears to be rather difficult to compute. One


possibility is to take the supremum of all the affine functions minorizing all the fj* 's,
i.e. to solve (1.3.5) (a problem with infinitely many constraints, where I stands for
infj fj*)' Another possibility, based on (IY.2.5.3), is to solve for {aj. Sj}

inf{Ej,!llajfj*(Sj) : a E .1 n+J. Sj Edomfj*. Ej,!ll ajsj =s}; (2.4.6)

but then we still have to take the lower semi-continuous hull of the result. Both
possibilities are rather involved, let us mention a situation where the second one takes
an easier form.
68 X. Conjugacy in Convex Analysis

Theorem 2.4.7 Let fl' ... , fp be finitely many convex functions from JRn to JR, and
let f := maxj fj; denote by m := min{p, n + I}. For every
S E domf* = cOU{dom.tj* : j = 1, ... , p},
there exist m vectors Sj E dom fj* and convex multipliers aj such that

S = I>jSj and f*(s) = Laj.tj*(sj)' (2.4.7)


j j

In other words: an optimal solution exists in the minimization problem (2.4.6), and
the optimal value is a closed convex function ofs.
PROOF. Set g := minj fj*, so f* = cog; observe that domg = Uj domfj*. Because
each fj* is closed and I-coercive, so is g. Then we apply Proposition 1.5.4: first,
co g = co g; second, for every

= domcog = co{Uj domfj*} ,


S E domf*
there are Sj E domg and convex multipliers aj, j = 1, ... , n + 1 such that
n+1 n+1
S = LajSj and (cog)(s) = /*(s) = Lajg*(sj)'
j=1 j=1

In the last sum, each g*(Sj) is a certain !;*(Sj); furthermore, several sj's having
the same 1;* can: be compressed to a single convex combination (thanks to convexity
of each 1;*, see Proposition IY.2.S.4). Thus, we obtain (2.4.7). 0

The framework in which we proved this result may appear very restrictive; however,
enlarging it is not easy. For example, extended-valued function Ii create a difficulty: (2.4.7)
does not hold with n = I and
!I (x) = expx and h = I{o} .

Example 2.4.8 Consider the function g+ := max{O, g}, where g : JRn --+- JR is
convex. With fl := g and h == 0, we have ft = I{o} and, according to Theorem 2.4.7:
for all S E dom(g+)* =
co{{O} U domg*},
(g+)*(s)= min {ag*(sl) : SI E domg*, a E [0,1], aSI =s}.

For S # 0, the value a = 0 is infeasible: we can write


(g+)*(s) = min a g*(ks) for S # O.
o<a~ I

For S = 0, the feasible solutions have either a = 0 or SI = O. Both cases are covered
by the condensed formulation
(g+)*(O) = min {g*(O), O},
which could also have been obtained by successive applications of the formula inf f =
- /*(0). 0
2 Calculus Rules on the Conjugacy Operation 69

2.5 Post-Composition with an Increasing Convex Function

For f E Conv lRn and g E Conv lR, the function g 0 f is in Conv lRn if g is increasing
(we asswne f(lR n ) n domg :j:. 0 and we set g(+oo) = +(0). A relevant question is
then how to express its conjugate, in terms of f* and g*.
We start with some preliminary observations, based on the particular nature of
g. The domain of g is an interval unbounded from the left, whose relative interior is
intdomg :j:. 0. Also, domg* c lR+ and (remember §1.3.2)

g(y) = g**(y) = sup {py - g*(p) : p E ridomg*}.

Theorem 2.5.1 With f and g as above, assume that f (lRn) n int dom g :j:. 0. For all
s E dom(g 0 1)*, define the function Vrs E ConvlR by

af*{ks) + g*(a) ija>O,


lR 3 a ~ Vrs(a) ={ Udom!(S) + g*(O) ija=O,
+00 ija<O.

Then
(g 0 I)*(s) = min
aER
Vrs(a) .

PROOF. By definition,

-(g 0 I)*(s) = infx[g(f(x» - (s, x)]


= infx,r{g(r) - (s,x) : f(x) ~ r} [g is increasing]
= infx,r[g(r) - (s, x) + Iepi!(x, r)].
Then we must compute the conjugate of a swn; we set

lRn x lR 3 (x, r) ~ fl(X, r) := g(r) - (s, x)

and h := Iepi!' We have


domfl = lRn x domg, intdomfl = lRn x intdomg;
so, by asswnption:

intdomfl ndomh = (lRn X intdomg) nepif:j:. 0.


Theorem 2.3.2, more precisely Fenchel's duality theorem (2.3.2), can be applied with
the qualification condition (2.3.Q.JJ'):

(g 0 I)*(s) = min {ft(-p, a) + f2*(P' -a) : (p, a) E lRn x lR}.

The computation of the above two conjugates is straightforward and gives

(g 0 I)*(s) = min[g*(a)
p,a
+ I{-s}(-p) + Uepi!(P, -a)] = min
a
Vrs(a) ,

where the second equality comes from (1.2.2). o


Let us take two examples, using the ball-pen function defined on its domain B(O, I) by
f(x) = I - JI - IIx1l2: here feRn) = [0, I] U {+oo}.
70 X. Conjugacy in Convex Analysis

- With g(r) = r+, we leave it to the reader to compute the relevant conjugates: t/ls(a) has the
a
unique minimum-point = 0 for all s.
- With g = 1]-00,0], the qualification assumption in the above theorem is no longer satisfied.
We have (g 0 f)*(s) = 0 for all s, while milla t/ls(a) = IIsll - 1.

Remark 2.5.2 Under the qualification assumption, t/ls always attains its minimum at some
a ~ o. As a result, if for example (J'dom f (s) =
(f*)60 (s) = +00, or if g* (0) = +00, we are
sure that a > O. 0

Theorem 2.5.1 takes an interesting form in case of positive homogeneity, which


can apply to f or g. For the two examples below, we recall that a closed sublinear
function is the support of its subdifferential at the origin.

Example 2.5.3 If g is positively homogeneous, say g = (J'l for some closed interval
l, the domain of '1/1s is contained in l, on which the term g* (a) disappears. The interval
l supported by g has to be contained in lR+ (to preserve monotonicity) and there are
essentially two instances: Example 2.4.8 was one, in which l was [0, I]; in the other
instance, l is unbounded (a half-line), and the following example gives an application.
Consider the set C := (x E lRn : f(x) ~ OJ, with f : lRn ~ lR convex; its
support function can be expressed in terms of f*. Indeed, write the indicator of C as
the composed function:
Ie = 1]-00,0] 0 f .
This places us in the present framework: g is the support oflR +. To satisfy the hypothe-
ses of Theorem 2.5.1, we have only to assume the existence ofanxo with f(xo) < 0;
then the result is: for 0 f. s E dom oc, there exists a > 0 such that

ac(s) = min
a>O
a f*(ks) = af*(s/a) .

Indeed, we are in the case of Remark 2.5.2 because dom f = lRn. o


Example 2.5.4 In our second example, it is f that is positively homogeneous. Then,
for all s E dom(g 0 f)*,

g*(a) if a > 0 and ks E af(O) ,


'I/Is(a)= { adom/(S)+g*(O) ifa=O,
+00 if a < O.

If f is finite everywhere, i.e. af(O) is bounded, we have for all s f. 0


(g 0 f)*(s) = min {g*(a) : a > 0 and ks E af(O)} .

As an application, let Q be symmetric positive definite and define the function

lRn :3 x 1-+ f(x) := J{Q-1X, x),

which is the support function of the elliptic set (see Example Y.2.3.4)

EQ := af(O) = {s E lRn : (s, Qs) ~ I}.


2 Calculus Rules on the Conjugacy Operation 71

The composition of I with the function r ~ g(r) := I/U 2 is the quadratic from
associated with Q-I, whose conjugate is known (see Example 1.1.3):

(g 0 f)(x) = ~(Q-IX, x), (g 0 f)*(s) = ~(s, Qs) .


Then
~{s,Qs)=minHa2: a~OandsEaEQ}
is the half-square of the gauge function YEQ (Example Y.2.3.4). o

2.6 A Glimpse of Biconjugate Calculus

Taking the biconjugate of a function, i.e. "close-convexifying" it, is an important


operation. A relevant question is therefore to derive rules giving the biconjugate of a
function obtained from other functions having known biconjugates.
First of all, Proposition 1.3.1 can easily be reproduced to give the following
statements, in which the various f's and fj's are assumed to satisfy (1.1.1).
G) The biconjugate of g(.) := 1(-) + a is (co g)(.) = (co f)(-) + a.
(jj) With a > 0, the biconjugate of g := al is (co g) = a(co f).
(jjj) With a =1= 0, the biconjugate of the function g(x) := I(ax) is (co g)(x) =
(co f)(ax).
(jv) More generally: if A is an invertible linear operator, co(f 0 A) = (co f) 0 A.
(v) The biconjugate of g(.) := 1(· - xo) is (co g)(-) = (co f)(. - xo).
(vj) The biconjugate of g := I + (so, .) is (co g) = (co f) + (so, .).
(vjj) If II ::;,; 12, then co II ::;,; co f2.
(jx) The biconjugate of I(x) = 'L1=1 fj(Xj) is co 1= 'Lj=1 co fj.
Note: there is nothing corresponding to (viii). All these results are straightforward,
from the very definition of the closed convex hull. Other calculus rules are not easy
to derive: for most of the operations involved, only inequalities are obtained.
Consider first the sum: let II and 12 satisfy (1.1.1), as well as

dom/l ndomh"# 0.

Clearly, co II +co 12 is then a function of Conv IRn which minorizes II + f2. Therefore
co II + co 12 ::;,; co (II + h) .
This inequality is the only general comparison result one can get, as far as sums are
concerned; and closed convexity of II, say, does not help: it is only when II is affine
that equality holds.
Let now (fj)jeJ be a collection of functions and I := sup fj. Starting from
fj ::;,; I for j E J, we obtain immediately co fj ::;,; co I and
sup(co fj) ::;,; co I .
jeJ
Nothing more accurate can be said in general.
The case of an infimum is hardly more informative:
72 X. Conjugacy in Convex Analysis

Proposition 2.6.1 Let {fj}jEJ be a collection offunctions satisfying (1.1.1) and


assume that there is a common affine function minorizing all of them. Then

PROOF. The second relation is trivial. As for the first, Theorem 2.4.1 gives (infj fj)* =
SUPj fj*. The left-hand function is in Conv IRn, so we can take the conjugate of both
sides and apply Theorem 2.4.4. 0

So far, the calculus developed in the previous sections has been oflittle help. The
situation improves for the image function under a linear mapping.

Proposition 2.6.2 Let g : IRm ~ IR U {+oo} satisfy (1.1.1) and let A be linear from
IRn to IRm. Assuming ImA* n ridomg* "# 0, the equality co(Ag) = A (co g) holds.
Actually, for each x E domco(Ag), there exists y such that

[co(Ag)](x) = (co g)(y) and Ay =x .


PROOF. Since ImA* n domg* "# 0, Theorem 2.1.1 applies: (Ag)* = g* 0 A*.
Likewise, Theorem 2.2.3 (with g and A replaced by g* and A *) allows the computation
of the conjugate of the latter. 0

The qualification assumption needed in this result is somewhat abstract; a more


concrete one is for example O-coercivity of co g: then 1m A * 3 0 E int dom g* (see
Remark 1.3.10).
The cases of marginal functions and inf-convolutions are immediately taken care
of by Proposition 2.6.2; the corresponding statements and proofs are left to the reader.

3 Various Examples

The examples listed below illustrate some applications of the conjugacy operation.
It is an obviously useful tool thanks to its convexification ability; note also that the
knowledge of f* fully characterizes co f, i.e. f in case the latter is already in Conv IRn;
another fundamental property is the characterization (1.4.2) of the subdifferential. All
this concerns convex analysis more or less directly, but the conjugacy operation is
instrumental in several areas of mathematics.

3.1 The Cramer Transformation

Let P be a probability measure on IRn and define its Laplace transform L p : IRn ~

f
]0, +00] by
IRn 3 s 1-+ Lp(s) := exp(s, z) dP(z) .
an
The so-called Cramer transform of P is the function
3 Various Examples 73

IRn 3 x H- Cp(X) := sup {(S, x) -logLp(s) : s E IRn}.

It is used for example in statistics. We recognize in C p the conjugate of the function


10gLp, which can be shown to be convex (as a consequence of Holder's inequality)
and lower semi-continuous (from Fatou's lemma). We conclude that C p is a closed
convex function, whose conjugate is log L p .

3.2 Some Results on the Euclidean Distance to a Closed Set

Let S be a nonempty closed set in IRn and define the function I := 1/211 . 112 + Is:

I(x) := { !lIxll2 ~f E S, (3.2.1)


+00 If not.

Clearly, I satisfies (1.1.1), and is lower semi-continuous and I-coercive; it is convex if


and only if S is convex; its conjugate turns out to be of special interest. By definition,

1*(s)=sup{(s,x)-!lIxIl2: XES},

which is the function CPS of Example IY.2.1.4:

(3.2.2)

From Theorem 1.5.4, co 1= co I and, for each X E co S, there are x], ... ,Xn+1
in S and convex multipliers cx], ... ,CXn+1 such that

n+1 n+1
x = L CXjXj and (co f)(x) = ! L CXj IIXj 112.
j=1 j=1

The (nonconvex) function I of (3.2.1) has a subdifferential in the sense of (1.4.1),


which is characterized via the projection operation onto S:

Ps(x) := {y E S : ds(x) = lIy - xII}

(a compact-valued multifunction, having the whole space as domain).

Proposition 3.2.1 Let XES. With I given by (3.2.1),


(i) s is a subgradient 011 at x ifand only ifits projection onto S contains x; in other
words,

a/(x) = {s E IRn x E Ps(s)} = {s E IRn ds(s) = lis -xII};

in particular, x E a/(x);
(ii) (co f)(x) = I(x) = !lIxll2 and a(co f)(x) = a/(x).
74 X. Conjugacy in Convex Analysis

PROOF. From the expression (3.2.2) of !*, s E af(x) if and only if

Using Is(x) = 0, this means ds(s) = lis - x II, i.e. x E P S(s). In particular, s = x is in
af(x),justbecause Ps(x) = {x}. Hence, af(x) is nonempty and (ii) is a consequence
of (1.4.3), (1.4.4). 0

Having thus characterized af, we can compute a(co f) using Theorem 1.5.6, and
this in turn allows the computation of af*:

Proposition 3.2.2 For all s E ]R.n,

af*(s) = coPs(s). (3.2.3)

PROOF. Because co f is closed, x E a!*(s) '¢} s E a (co f)(x), which in turn is


expressed by (1.5.4): there are x" ... , Xn+1 in S and convex multipliers ai, ... , an+1
such that
n+1
x = I>jXj and s E n{af(xj) : aj > O}.
j=1

It then suffices to apply Proposition 3.2.l(i): s E af(xj) '¢} Xj E Ps(s) for j =


1, ... ,n+1. 0

This result is interesting in itself. For example, we deduce that the projection
onto a closed set is almost everywhere a singleton, just because V f* exists almost
everywhere. When S is convex, Ps(s) is for any s a singleton {ps(s)}, which thus
appears as the gradient of f* at s:

VGd~) =I-ps ifSisconvex.

Another interesting result is that the converse to this property is true: if d~ (or equiv-
alently f*) is differentiable, then S is convex and we obtain a convexity criterion:

Theorem 3.2.3 For a nonempty closed set S, thefollowing statements are eqUivalent:
(i) S is convex;
(ii) the projection operation Ps is single-valued on ]R.n;
(iii) the squared distance d~ is differentiable on ]R.n;
(iv) the function f = IS + 1/211 . 112 is convex.

PROOF. We know that (i) => (ii); when (ii) holds, (3.2.3) tells us that the function
!* = 1/2 (II . 112 - d~) is differentiable, i.e. (iii) holds. When fin (iv) is convex, its
domain S is convex. There remains to prove (iii) => (iv).
Thus, assuming (iii), we want to prove that f = co f (= co f). Take first x E
ridomco f, so that there is an s E a(co f)(x) (Theorem 1.4.2). In view of (1.5.4),
there are Xj E S and positive convex multipliers aj such that
3 Various Examples 75

x = Lajxj, (co f)(x) = Laj/(xj), s E njO/(Xj)'


j j

The last property implies that each Xj is the unique element of a/*(s): they are all
equal and it follows (co f) (x) = 1 (x).
Now use Proposition IV.1.2.5, expressing co 1 outside the relative interior of its
domain: for x E domco I, there is a sequence {Xk} C ridomco 1 tending to x such
that
I(x) ~ (co f)(x) = lim (co I)(Xk) = lim/(Xk) ~ I(x) ,
k~+oo

where the last inequality comes from the lower semi-continuity of I. o

3.3 The Conjugate of Convex Partially Quadratic Functions

A simple case in §3.2 is one in which S is a subspace: we essentially obtain a quadratic


function "restricted" to a subspace, resembling the /* obtained in Example 1.1.4.
Thus, given a linear symmetric positive semi-definite operator B and a subspace H,
it is of interest to compute the conjugate of

( x):= { ~(Bx,x) if x E ~, (3.3.1)


g +00 otherwtse .

Such a function (closed, convex, homogeneous of degree 2) is said to be partially


quadratic. Its conjugate turns out to have just the same form, so the set of convex
partially quadratic functions is stable with respect to the conjugacy operation (cf.
Example 1.1.4).

Proposition 3.3.1 Thefunction g of (3.3.l) has the conjugate

*(s) = { ~(s, (PH 0 B 0 PH)-S) ifs Elm B + H1., (3.3.2)


g +00 otherwise,

where PHis the operator of orthogonal projection onto Hand (.) - is the Moore-
Penrose pseudo-inverse.

PROOF. Writing g as 1 + IH with 1 := 1/2 (B·, .), use Proposition 1.3.2: g* =


(f 0 PH)* 0 PH; knowing that

(1 0 PH)(X) = ~(PH 0 B 0 PHX. x),

we obtain from Example 1.1.4 g* (s) under the form

(l oP )*(P s) = { ~(s, (PH 0 B 0 PH)-S) if PHS .E Im(PH 0 B 0 PH),


H H +00 otherwtse.

It could be checked directly that Im(P HoB 0 PH) + H 1. = 1m B + H 1.. A


simpler argument, however, is obtained via Theorem 2.3.2, which can be applied
since dom 1 = ]Rn. Thus,
76 X. Conjugacy in Convex Analysis

g*(s) = (1* tIH . t.) (s) = min {4(p, B-p) p E 1mB, s - p E H1.},

which shows that dom g* = 1m B + H 1.. 0

For example, suppose H = 1mB. Calling A := B- the pseudo-inverse of B


(hence A - = (B-)- = B), we see that g of(3.3.1) is just f* of Example 1.1.4 with
b = O. Since 1m B + H 1. = JR. n , and using the relations PH 0 B = BoP H = B, we
obtain finally that g* is the f of(1.1.4)-and this is perfectly normal: g* = f** = f.
As another example, take the identity for B; then 1m B = JR.n and Proposition 3.3.1
gives immediately

This last example fits exactly into the framework of §3.2, and (3.2.2) becomes

a classical relation known as Pythagoras' Theorem.

3.4 Polyhedral Functions

For given (Si, bi) E JR.n x JR., i = 1, ... , k, set

Ii (x) : = (si , x) - bi for i = 1, ... , k

and define the piecewise affine function

JR. n 3xl-+f(x):=max{fi(x): i=I, ... ,k}. (3.4.1)

Proposition 3.4.1 At each s E co{s(, ... , skI = dom f*, the conjugate off has the
value (L1k is the unit simplex)

(3.4.2)

PROOF. Theorem 2.4.7 is directly applicable: since It


= I{s;} + bi, the variables Sj
in the minimization problem (2.4.6) become the given Sj. This problem simplifies to
(3.4.2), which at the same time reveals dom f*. 0

In JR.n x JR. (considered here as the dual of the graph-space), each of the k vertical
half-lines {(Si, r) : r ~ bj}, i = 1, ... , k is the epigraph of It,
and these half-
lines are the "needles" of Fig. IY.2.5.1. Now, the convex hull of these half-lines is a
closed convex polyhedron, which is just epi f*. Needless to say, (3.4.1) is obtained
by conjugating again (3.4.2). This double operation has an interesting application in
the design of minimization algorithms:
3 Various Examples 77

Example 3.4.2 Suppose that the only available information about a function ! E
Conv IRn is a finite sampling of function- and subgradient-values, say

!(Xj) andsj E a!(Xj) fori = 1, ... ,k.


By convexity, we therefore know that 'PI ~ ! ~ 'P2, where

'PI(Y):= max {f(Xj) + (Sj, Y - Xj) : i = 1, ... ,k},


f/J2 := co (minj gj) with gj(Y):= I{xj}(Y) + !(Xi) for i = 1, ... , k.

The resulting bracket on! is drawn in the left-part of Fig. 304.1. It has a counterpart
in the dual space: 'Pi ~ 1* ~ 'P~, illustrated in the right-part of Fig. 304.1, where

'P~(S) = co [min{I{sj}(s) + (Sj,Xj) - !(Xj»],


'P;(s) = max{(s,xj)-!(Xj): i=I, ... ,k}.

Note: these formulae can be made more symmetric, with the help of the relations
(Sj, Xj) - !(Xj) = I*(Sj).

t*(S1) I·--·..·--·~

t*(S2) 1..·---·-....·-+..·--1~

X1 0 x2
Fig. 3.4.1. Sandwiching a function and its conjugate

In the framework of optimization, the relation 'PI ~ ! is well-known; it will be


studied in more detail in Chapters XII and xv.
It is certainly richer than the relation
! ~ 'P2, which ignores all the first-order information contained in the subgradients Sj.
The interesting point is that the situation is reversed in the dual space: the "cutting-
plane" approximation of 1* (namely 'Pi) is obtained from the poor primal approxi-
mation 'P2 and vice versa. 0

Because the ! of (3 04.1) increases at infinity no faster than linearly, its conjugate
1* has a bounded domain: it is not piecewise affine, but rather polyhedral. A natural
idea is then to develop a full calculus for polyhedral functions, just as was done in
§3.3 for the quadratic case.
A polyhedral function has the general form g = ! +I p, where P is a closed convex
polyhedron and! is defined by (304.1). The conjugate of g is given by Theorem 2.3.2:
g* = !* t Up, i.e.
78 X. Conjugacy in Convex Analysis

g*(S) = aei!1k
min [E~=1 ajbj + up(s - E~=1 ajsj)]. (3.4.3)

This fonnula may take different fonns, depending on the particular description of P.
When
P = co {PI, ... , Pm}
is described as a convex hull, Up is the maximum of just as many linear functions
(Pj' .) and (3.4.3) becomes an explicit fonnula. Dual descriptions of P are more
frequent in applications, though.

Example 3.4.3 With the notations above, suppose that P is described as an intersec-
tion (assumed nonempty) of half-spaces:

Hj := {x E ]Rn : (Cj' x) ~ dj} for j = 1, ... , m


and P:= nHj = {x E]Rn : Cx ~ d} .

It is convenient to introduce the following notation: ]Rk and ]Rm are equipped
with their standard dot-products; A [resp. C] is the linear operator which, to x E ]Rn,
associates the vector whose coordinates are (Sj, x) in]Rk [resp. (Cj' x) E ]Rm]. Then
it is not difficult to compute the adjoints of A and C: for a E ]Rk and Y E ]Rm,

k m
A*a = Lajsj, C*Y = LYjCj.
j=1 j=1
Then we want to compute the conjugate of the polyhedral function

.max (sj,x)-bj ifCx~d,


x ~ g(x):= { }=I •...• k (3.4.4)
+00 if not ,

and (3.4.3) becomes

g*(S) = min [bTa+up(s-A*a)].


aei!1k

To compute Up, we have its conjugate I p as the sum of the indicators of the
Hj's; using Theorem 2.3.2 (with the qualification assumption (2.3.Q.j), since all the
functions involved are polyhedral),

up(p) = min {Ej:.,1 UHj (Sj) : EJ=1 Sj = p} .


Using Example Y.3.4.4 for the support function of Hj' it comes

up(p) = min {dT Y : C*Y = p} .

Piecing together, we finally obtain g*(s) as the optimal value of the problem in
the pair of variables (a, y):
4 Differentiability of a Conjugate Function 79

Observe the nice image-function exhibited by the last constraint, characterized by


the linear operator [ A * IC* ]. The only difference between the variables a (involving
the piecewise affine f) and y (involving the constraints in the primal space) is that
the latter do not have to sum up to 1; they have no a priori bound. Also, note that
the constraint of (3.4.4) can be more generally expressed as Cx - d E K (a closed
convex polyhedral cone), which induces in (3.4.5) the constraint y E KO (another
closed convex polyhedral cone, standing for the nonnegative orthant).
Finally, g* is a closed convex function. Therefore, there is some s E ]Rn such
that the feasible domain of (3.4.5) is nonempty, and the minimal value is never -00.
These properties hold provided that g in (3.4.4) satisfies (1.1.1), i.e. that the domain
C x ::;; d is nonempty. 0

4 Differentiability of a Conjugate Function

For f E Conv]Rn, we know from Corollary 1.4.4 that

s E af(x) if and only if x E af*(s);

and this actually was our very motivation for defining f* , see the introduction to this
a a
chapter. Geometrically, the graph of f and of f* in]Rn x]Rn are images of each other
under the mapping (x, s) 1-+ (s, x). Knowing that a convex function is differentiable
when its subdifferential is a singleton, smoothness properties of f* correspond to
monotonicity properties of af, in the sense of §VI.6.1.

4.1 First-Order Differentiability

Theorem 4.1.1 Let f E Conv]Rn be strictly convex. Then int dom f* =f:. 0 and f*
is continuously differentiable on int dom f*.

PROOF. For arbitrary Xo E dom f and nonzero d E ]Rn, consider Example 2.4.6. Strict
convexity of f implies that

o< f(xo - td) - f(xo)


+ f(xo + td) - f(xo)
for all t > 0 ,
t t
and this inequality extends to the suprema:

o< f~(-d) + f~(d).


Remembering that f~ = adorn f* (Proposition 1.2.2), this means
80 X. Conjugacy in Convex Analysis

i.e. dom f* has a positive breadth in every nonzero direction d: its interior is nonempty
- Theorem Y.2.2.3(iii).
Now, suppose that there is some S E intdomf* such that 8f*(s) contains two
distinct points XI and X2. Then S E 8f(xl) n 8f(X2) and, by convex combination of
the relations
f*(s) + f(Xi) = (s, Xi) for i = 1,2,
we deduce, using Fenchel's inequality (1.1.3):

!*(s) + E:=I (li!(Xi) = (s, E:=I (liXd :::;; !*(s) + f(E:=1 (liXi) ,
which implies that f is affine on [XI. X2], a contradiction. In other words, 8!* is
single-valued on int dom f*, and this means f* is continuously differentiable there.
o
For an illustration, consider

whose conjugate is

lRn :3 s ~ f*(s) = { --/1 - IIsll2 if lis II :::;; 1,


+00 otherwise.

Here, f is strictly convex (compute V2 f to check this), dom!* is the unit ball, on
the interior of which !* is differentiable, but 8!* is empty on the boundary.
Incidentally, observe also that !* is strictly convex; as a result, f is differentiable.
Such is not the case in our next example: with n = 1, take

f(x) := Ixi + !X2 = max {!X2 + X, !x2 - x} . (4.1.1)

Use direct calculations or calculus rules from §2 to compute

! (s + 1)2for s:::;; - 1 •
f* (s) = { 0 for - I :::;; s :::;; 1 ,
!(s - 1)2 for s ;;:: 1 .

This example is illustrated by the instructive Fig. 4.1.1. Its left part displays gr f,
made up of two parabolas; !* (s) is obtained by leaning onto the relevant parabola a
straight line with slope s. The right part illustrates Theorem 2.4.4: it displays 1/2 S2 + s
and 1/2 S2 - s, the conjugates of the two functions making up f; epi!* is then the
convex hull of the union of their epigraphs.

Example 4.1.2 We have studied in §IY.1.3(t) the volume of an ellipsoid: on the Eu-
clidean space (Sn (lR), (( .•. ))) of symmetric matrices with the standard scalar product
oflRnxn , the function

f(A) := { -log(detA) - nl2 if A is ~ositive definite.


+00, othefWIse
4 Differentiability of a Conjugate Function 81

x
s
Fig. 4.1.1. Strict convexity corresponds to differentiability of the conjugate

is convex. It is also differentiable on its domain, with gradient Vf (A) = - A - I


(cf. §A.4.3). Its conjugate can then be computed using the Legendre transform (Re-
mark 0.1):

f*(B):= { -log(det(-B» -n12 if B is~egativedefinite,


+00 otherwIse.

Thus f* is again a differentiable convex function; actually, f is strictly convex.


This property, by no means trivial, can be seen either from the Hessian operator
~ f(A) = A-I ® A-I, i.e.

a2f
-.....:....--(A) = (A-Ik(A-I)jk
aaijaakt J

or from Theorem IVA.l.4: Vf is strictly monotone because

for all symmetric positive definite matrices A =f:. B. o


Actually, the strict convexity of f in Example 4.1.2 can be directly established
in our present framework. In fact, Theorem 4.1.1 gives a sufficient condition for
smoothness of a closed convex function (namely f*). Apart from possible side-effects
on the boundary, this condition is also necessary:

Theorem 4.1.3 Let f E Conv lRn be differentiable on the set D := intdom f. Then
f* is strictly convex on each convex subset C C V f(il).

PROOF. Let C be a convex set as stated. Suppose that there are two distinct points
SI and S2 in C such that f* is affine on the line-segment [sJ, S2]' Then, setting S :=
1/2 (Sl + S2) E C C V f(il), there is x E il such that V f(x) =
s, i.e. x E af*(s).
Using the affine character of f*, we have

= ! L [J(x) + f*(sj) -
2
0= f(x) + f*(s) - (s, x) (Sj, x)]
j=1
82 X. Conjugacy in Convex Analysis

and, in view ofFenchel 's inequality (1.1.3), this implies that each term in the bracket is
0: x E a!*(Sl) n a!*(S2), i.e. af(x) contains the two points Sl and S2, a contradiction
to the existence of Vf (x). 0

The strict convexity of f* cannot in general be extended outside V f(ll); and


be aware that this set may be substantially smaller than one might initially guess: the
function fs* in Example 1.6.2.4(e) is finite everywhere, but not strictly convex.
We gather the results ofthis section, with a characterization ofthe "ideal situation",
in which the Legendre transform alluded to in the introduction is well-defined.

Corollary 4.1.4 Let f : lRn -+ lR be strictly convex, differentiable and I-coercive.


Then
(i) f* is likewise finite-valued on lRn, strictly convex, differentiable and I-coercive;
(ii) the continuous mapping V f is one-to-one from lRn onto lRn , and its inverse is
continuous;
(iii) f*(s) = {s, (Vf)-l(S)} - f(Vf)-l (s») for all s E lRn. o

The simplest such situation occurs when f is a strictly convex quadratic function,
as in Example 1.1.3, corresponding to an affine Legendre transform. Another example
is f(x) := expO IIxIl2).

4.2 Towards Second-Order Differentiability

Along the same lines as in §4.1, better than strict convexity of f and better than
differentiability of f* correspond to each other. We start with the connection between
strong convexity of f and Lipschitz continuity of V f*.

(a) Lipschitz Continuity of the Gradient Mapping. The next result is stated for a
finite-valued f, mainly because the functions considered in Chap. VI were such; but
this assumption is actually useless.

Theorem 4.2.1 Assume that f : lRn -+ lR is strongly convex with modulus c > 0 on
lRn : for all (Xl, X2) E lRn x lRn and a E ]0, 1[,

Then dom f* = lRn and V f* is Lipschitzian with constant 1/c on lR n:

PROOF. We use the various equivalent definitions of strong convexity (see Theo-
rem VI.6.1.2). Fix Xo and So E of(xo): for all 0 =f. dE lRn and t ~ 0
4 Differentiability of a Conjugate Function 83

hence f60 (d) = adorn f* (d) = +00, i.e. dom j* = IRn. Also, f is in particular strictly
convex, so we know from Theorem 4.1.1 that f* is differentiable (on IRn). Finally,
the strong convexity of f can also be written

in which we have Si E af(Xi), i.e. Xi = V j*(Si), for i = 1,2. The rest follows from
the Cauchy-Schwarz inequality. 0

This result is quite parallel to Theorem 4.1.1: improving the convexity of f from
"strict" to "strong" amounts to improving V f* from "continuous" to "Lipschitzian".
The analogy can even be extended to Theorem 4.1.3:

Theorem 4.2.2 Let f : IRn -+ IR be convex and have a gradient-mapping Lips-


chitzian with constant L > 0 on IR n : for all (XI, X2) E ]Rn x IR n,

Then f* is strongly convex with modulus 1/ L on each convex subset C C dom af*.
In particular, there holds for all (XI, X2) E IRn x ]Rn

(4.2.2)

PROOF. Let Sl and S2 be arbitrary in dom af* C dom f*; take sand s' on the segment
[Sl' S2]. To establish the strong convexity of f*, we need to minorize the remainder
term j*(s') - j*(s) - (x, s' - s), with X E aj*(s). For this, we minorize f*(s') =
SUPy[(s', y} - fey)], i.e. we majorize fey):

fey) = f(x) + (V f(x), y - x} + 11 (V f(x + t(y - x» - V f(x), y - x}dt


~ f(x) + (Vf(x), y - x} + !Llly - Xll2 =
= - f*(s) + (s, y) + !LIIY - Xll2
. I
(we have used the property fo t dt = 1/2, as well as x E aj*(s), i.e. V f(x) = s). In
summary, we have

f*(s') ~ f*(s) + sup [(s' - s, y) - !Llly - x1l2] .


y

Observe that the last supremum is nothing but the value at s' - s of the conjugate of
1/2 L II . -x 112. Using the calculus rule 1.3.1, we have therefore proved

f*(s') ~ f*(s) + (s' - s, x) + tt-lis' - sll2 (4.2.3)

for all s, s' in [Sl' S2] and all x E aj*(s). Replacing s' in (4.2.3) by Sl and by S2, and
setting s = aSI + (1 - a)s2' the strong convexity (4.2.1) for j* is established by
convex combination.
84 X. Conjugacy in Convex Analysis

On the other hand, replacing (s, s') by (SI' S2) in (4.2.3):

f*(S2) ~ f*(sl) + (S2 - SJ, XI) + n:lIs2 - sI1I 2 for all XI E of*(sl).

Then, replacing (s, s') by (S2' SI) and summing:

(XI - X2, SI - S2) ~ ilisl - s2112.

In view ofthe differentiability of f, this is just (4.2.2), which has to hold for all (XI, X2)
simply because 1m of* = dom "If = ]Rn. 0

Remark 4.2.3 For a convex function, the Lipschitz property of the gradient mapping
thus appears as equivalently characterized by (4.2.2); a result which is of interest in
itself. As an application, let us return to §3.2: for a convex set C, the convex function
II . 112 - d~ has gradient Pc, a nonexpansive mapping. Therefore

IIpC(x) - pc(y) 112 ~ (pc(x) - pc(y), X - y}

for all X and y in ]Rn. Likewise, V (I /2 d~) = I - PC is also nonexpansive:

II(I - pc)(x) - (I - pc)(y) 112 ~ (I - pc)(x) - (1- pc) (y), X - y}. 0

Naturally, Corollary 4.1.4 has also its equivalent, namely: if f is strongly convex
and has a Lipschitzian gradient-mapping on ]Rn, then f* enjoys the same properties.
These properties do not leave much room, though: f (and f*) must be "sandwiched"
between two positive definite quadratic functions.

(b) Second-Order Approximations. Some Lipschitz continuity of the gradient is


of course necessary for second-order differentiability, but is certainly not sufficient:
(4.1.1) is strongly convex but its conjugate is not C2 • More must therefore be assumed
to ensure the existence of "12 f*, and we recall from §1.6.2 that finding minimal
assumptions is a fairly complex problem.
In the spirit of the "unilateral" and "conical" character of convex analysis, we
proceed as follows. First, a function ({J E Conv]Rn is said to be directionally quadratic
if it is positively homogeneous of degree 2:

((J(tx) = t 2({J(x) for all X E ]Rn and t > o.


If ({J is directionally quadratic, immediate consequences of the definition are:
- ({J(O) = 0 (see the beginning of §v.l.l);
- ({J* is directionally quadratic as well (easy to check);
- hence ({J* (0) = 0, which in turn implies that ({J is minimal at 0: a directionally
quadratic function is nonnegative;
- as a result (see Example V.l.2. 7), "fiP is the support function of a closed convex set
containing o.
Of particular importance are the directionally quadratic functions which are finite
everywhere, and/or positive (except at 0); these two properties are dual to each other:
4 Differentiability of a Conjugate Function 85

Lemma 4.2.4 For a directionally quadratic function qJ, the following properties are
equivalent:
(i) qJ is finite everywhere;
(ii) VqJ(O) exists (and is equal to 0);
(iii) there is C ;;;: 0 such that qJ(x) ~ 1/2 Cllxll 2for all x E JR.n;
(iv) there is c > 0 such that qJ*(s) ;;;: 1/2CllsIl 2for all s E JR.n;
(v) qJ*(s) > ofor all s;/; o.

PROOF. [(i) {:} (ii) ::} (iii)] When (i) holds, qJ is continuous; call 1/2 C ;;;: 0 its maximal
value on the unit sphere: (iii) holds by positive homogeneity. Furthermore, compute
difference quotients to observe that qJ'(O,.) == 0, and (ii) follows from the differen-
tiability criterion IY.4.2.1.
Conversely, existence of VqJ(O) implies finiteness of qJ in a neighborhood of 0
and, by positive homogeneity, on the whole space.
[(iii)::} (iv)::} (v)] When (iii) holds, (iv) comes from the calculus rule 1.3. 1(vii), with
for example 0 < c:= I/C (or rather I/(C + 1), to take care of the case C =
0); this
implies (v) trivially.
[(v) ::} (i)] The lower semi-continuous function qJ* is positive on the unit sphere, and
has a minimal value c > 0 there. Being positively homogeneous of degree 2, qJ* is
then I-coercive; finiteness of its conjugate qJ follows from Proposition 1.3.8. 0

Definition 4.2.5 We say that the directionally quadratic function qJs defines a mi-
norization to second order of f E Conv JR.n at Xo E dom f, and associated with
s E af(xo), when

f(xo + h) ;;;: f(xo) + (s, h) + qJs(h) + 0(lIhIl 2 ). (4.2.4)

We say that qJs defines likewise a majorization to second order when

f(xo + h) ~ f(xo) + (s, h) + qJs(h) + 0(lIhIl 2). (4.2.5)


o
Note in passing that, because qJs ;;;: 0, (4.2.4) could not hold if s were not a sub-
gradient of f at Xo. Whenever Xo E dom af, the zero function and I{o} define trivial
minorizations and majorizations to second order respectively, valid for all SEaf (Xo).
This observation motivates the particular class of directionally quadratic functions in-
troduced in Lemma 4.2.4.
We proceed to show that the correspondences established in Lemma 4.2.4 are
somehow conserved when remainder terms are introduced, as in Definition 4.2.5.
First observe that, if (4.2.5) holds with qJs finite everywhere, then V f(xo) exists (and
is equal to s). The next two results concern the equivalence (iii) {:} (iv).

Proposition 4.2.6 With the notation of Definition 4.2.5, suppose that there is a di-
rectionally quadratiC function qJs satisfying, for some c > 0,
86 X. Conjugacy in Convex Analysis

(4.2.6)

and defining a minorization to second order of f E Conv]Rn, associated with s E


of (Xo). Then ({J: defines a majorization of f* at s, associated with Xo. This implies in
particular V f* (s) = Xo.

PROOF. [Preamble] First note that ({J: is finite everywhere: as already mentioned after
Definition 4.2.5, differentiability of f* will follow from the majorization property of
f*.
We take the tilted function

]Rn 3 h ~ g(h) := f(xo + h) - f(xo) - (s, h) ,

so that the assumption means

In view of the relation

g*(p) = f*(s + p) + f(xo) - (s + p, xo) = f*(s + p) - f*(s) - (p, xo) ,

we have only to prove that ({J: defines a majorization of g*, i.e.

[Step 1] By assumption, for any e > 0, there is a 0 > 0 such that

g(h) ~ ((Js(h) - ellhll2 whenever IIhll ~ 0;

and, in view of (4.2.6), we write:

(1 - ¥) ((Js(h)
IIhll ~ 0, I g(h) ~ Hc _ e) IIh1l2.
g(h) ~
whenever

Our aim is then to establish that, even though (*) holds locally only, the calculus
rule 1.3. 1(vii) applies, at least in a neighborhood ofO.
[Step 2] For IIhll > 0, write the convexity of g on [0, h] and use g(O) = 0 to obtain
with (**)

hence
g(h) ~ Hc - e) ollhll.
Then, if lip II ~ (1/2C - e)o =: 0', we have
(p, h) - g(h) ~ Hc - e) IIhll - g(h) ~ 0 whenever IIhll > o.
Since g* ~ 0, this certainly implies that
4 Differentiability of a Conjugate Function 87

g*(p) = sUPllhll ~ o[(p, h} - g(h)]


~ sUPllhll ~o[(p,h) - (1- ¥)q1s(h)]
~ [(1 - ¥) q1s]*(p) for IIpll ~ 8'.

[Step 3] Thus, using the calculus rule l.3.l(ii) and positive homogeneity of q1;, we
have for P E R(O, 8'),

g*(p) ~ l-iE/Cq1;(P) = q1;(p) + c:'Euq1;(p)


~ q1;(p)+c:'EuicllpII 2 . [from (4.2.6)]

Let us sum up: given s' > 0, choose S in Step 1 such that C(C:'2E) ~ s'. This gives
° °
8 > and 8' > in Step 2 yielding the required majorization

g*(p) ~ q1;(p) + s'lIplI 2 for all P E R(O, 8'). o

Proposition 4.2.7 With the notation of Definition 4.2.5, suppose that there is a di-
rectionally quadratic function q1s satisfYing (4.2.6) for some c > and defining a
majorization to second order of f E Conv IR n , associated with s E af(xo). Then q1;
°
defines a minorization to second order of f* at s, associated with Xo.

PROOF. We use the proof pattern of Proposition 4.2.6; using in particular the same
°
tilting technique, we assume Xo = 0, f(O) = and s = 0. For any S > 0, there is by
°
assumption 8 > such that

f(h) ~ q1s(h) + sIIhll 2 ~ (1 + 2s/c)q1s(h) whenever IIhll ~ 8.


It follows that

f*cp) ~ sup {(p,h) - (1 +2s/c)q1s(h) : IIhll ~8} (4.2.7)

for all P; we have to show that, for lip II small enough, the right-hand side is close to
q1;(p).
Because of (4.2.6), q1; is finite everywhere and Vq1;(O) = (Lemma 4.2.4).
It follows from the outer semi-continuity of aq1; (Theorem VI.6.2.4) that, for some
°
8' > 0,
0=1- aq1;(I+iE/CP) c R(O, 8) for all P E R(O, 8').
Said otherwise: whenever IIpll ~ 8', there exists ho E R(O, 8) such that

*(
q1s I+u/cP
I )
= 1(p,ho)
+ 2s/c - q1s(h o).

Thus, the function (p, .) - (1 + 2s/c)q1s attains its (unconstrained) maximum at ho.
In summary, provided that II p II is small enough, the index h in (4.2.7) can run in the
whole space, to give

f*(p) ~ [(1 + 2s/c)q1s]*(p) for all p E R(O, 8').


88 X. Conjugacy in Convex Analysis

To conclude, mimic Step 3 in the proof of Proposition 4.2.6. o


To show that (4.2.6) cannot be removed in this last result, take

1R2 3 (~, 1/) ~ f(~.1/) = ~~2 + *1/4,


so that ({Jo(p, 'l') := 1/2 p2 defines a majorization to second order of f at Xo = O.
We leave it as an exercise to compute f* and ({J6' and to realize that the latter is not
suitable for a minorization to second order.

Example 4.2.8 Take

f(x) = max {Jj(x) : j = I .... , m},


where each fj : IRn -+ IR is convex and twice differentiable at Xo. Using the notation
J := {j : Jj (xo) = f(xo)} for the active index-set at Xo, and ,1 being the associated
unit simplex, assume for simplicity that Aj := "12 Jj(xo) is positive definite for each
j E J. For a E ,1, define

Sa := 'LajVJj(xo) E af(xo)
jeJ
and
({Ja(h) := ~ 'Laj{Ajh, h};
jeJ
this last function is convex quadratic and yields the minorization

f(xo + h) ~ 'LajJj(xo + h) = f(xo) + {sa. h} + ((Ja(h) + o(lIhIl 2).


jeJ
Because ({Ja is strongly convex, Proposition 4.2.6 tells us that

f*(sa + p) ~ f*(sa) + {Po xo} + ((J:(p) + o(lIpII 2) ,


where
({J:(p) = ~([ LjeJ ajAj rip, p).
Note that the function ({J := co(minaeLl ((Ja) also defines a minorization to second
order of fat Xo, which is no longer quadratic, but which still satisfies (4.2.6), and is
independent of a, i.e. of SEaf (Xo). Dually, it corresponds to

f*(s + p) ~ f*(s) + {p, Xo} + ~ ~{At p, p} + o(lIpll2)


leJ
for all p E IRn , and this majorization is valid at all s E af(xo). o
If a directionally quadratic function ({Js defines both a minorization and a rna-
jorization of f (at Xo, associated with s), it will be said to define an approximation to
second order of f. As already stated, existence of such an approximation implies a
rather strong smoothness property of f at Xo.
Convex quadratic functions are particular instances of directionally quadratic
functions. With Propositions 4.2.6 and 4.2.7 in mind, the next result is straightforward:
4 Differentiability of a Conjugate Function 89

Corollary 4.2.9 Let f E Conv]Rn have an approximation at Xo which is actually


quadratic:for s := V f (xo) and some symmetric positive semi-definite linear operator
A,
f(xo + h) = f(xo) + (s. h) + ~(Ah. h) + 0(lIhIl 2).
If, in addition, A is positive definite, then f* has also a quadratic approximation at
s, namely

f*(s + p) = f*(s) + (P. xo) + ~(A -I P. p) + 0(lIpIl2). o


The global version of Corollary 4.2.9 is:

Corollary 4.2.10 Let f : ]Rn -+ ]R be convex, twice differentiable and I-coercive.


Assume, moreover, that V2 f (x) is positive definite for all x E ]Rn. Then f* enjoys the
same properties and

PROOF. The l-coercivity of f implies that dom f* = ]Rn, so 1m V f = dom f* = ]Rn:


apply Corollary 4.2.9 to any x E ]Rn with A =
V2 f(x). 0

The assumptions of Corollary 4.2.10 are certainly satisfied if the Hessian of f is


uniformly positive definite, i.e. if

3c > 0 such that (V2 f(x)d, d) ~ clldll 2 for all (x, d) E ]Rn x ]Rn .

Use the I-dimensional counter-example f(x) := expx to realize that I-coercivity is


really more than mere positive definiteness of V2 f (x) for all x E ]Rn.

Remark 4.2.11 To conclude, let us return to Example 4.2.8, which illustrates a fun-
damental difficulty for a second-order analysis. Under the conditions of that example,
f can be approximated to second order in a number of ways.
(a) For h E ]Rn, define

Jh := {j E J : f'(xo. h) = (VJj(xo). h}};

in other words. the convex hull of the gradients indexed by Jh forms the face of af (xo)
exposed by h. The function

is directionally quadratic and, for all d E ]Rn, we have the estimate

f(xo + td) = f(xo) + (f' (xo. d) + (2rp(d) + 0«(2) for ( > o.


Unfortunately, the remainder term depends on d, and actually rp is not even con-
tinuous (too bad if we want to approximate a nicely convex function). To see this. just
take m = 2. fl and h quadratic, and Xo at a kink: fl (xo) = h(xo). Then let do be
tangent to the kinky surface (if necessary look at Fig. VI.2.I.3, which displays such a
90 X. Conjugacy in Convex Analysis

surface). For d in the neighborhood of this do, q;(d) is equal to either 1/2 (AId, d) or
1/2 (A 2d, d), depending on which index prevails at Xo + td for small t > O.

(b) Another second-order estimate is

It does not suffer the non-uniformity effect of (a) and the approximating function is
now nicely convex. However, it patches together the first- and second-order informa-
tion and is no longer the sum of a sublinear function and a directionally quadratic
function. D
XI. Approximate Sub differentials
of Convex Functions

Prerequisites. Sublinear functions and associated convex sets (Chap. V); characterization
of the subdifferential via the conjugacy correspondence (§X.l.4); calculus rules on conjugate
functions (§X.2); and also: behaviour at infinity ofone-dimensional convex functions (§I.2.3).

Introduction. There are two motivations for the concepts introduced in this chapter:
a practical one, related with descent methods, and a more theoretical one, in the
framework of differential calculus.

- In §VIII.2.2, we have seen that the steepest-descent method is not convergent, es-
sentially because the subdifferential is not a continuous mapping. Furthermore, we
have defined Algorithm IX.l.6 which, to find a descent direction, needs to extract
limits of subgradients: an impossible task on a computer.
- On the theoretical side, we have seen in Chap. VI the directional derivative of a finite
convex function, which supports a convex set: the subdifferential. This latter set was
generalized to extended-valued functions in §X.1.4; and infinite-valued directional
derivatives have also been seen (Proposition 1.4.1.3, Example X.2.4.3). A natural
question is then: is the supporting property still true in the extended-valued case?
The answer is not quite yes, see below the example illustrated by Fig. 2.1.1.

The two difficulties above are overcome altogether by the so-called e-subdifferen-
tial of f, denoted as f, which is a certain perturbation of the subdifferential studied
in Chap. VI for finite convex functions. While the two sets are identical for e = 0, the
properties of as f turn out to be substantially different from those of af. We therefore
study as f with the help of the relevant tools, essentially the conjugate function f*
(which was of no use in Chap. VI). In return, particularizing our study to the case
e = 0 enables us to generalize the results of Chap. VI to extended-valued functions.
Throughout this chapter, and unless otherwise specified, we therefore have

1f E Conv]Rn and e ~ 0 .1

However, keeping in mind that our development has a practical importance for nu-
merical optimization, we will often pay special attention to the finite-valued case.
92 XI. Approximate Subdifferentials of Convex Functions

1 The Approximate Subdifferential

1.1 Definition, First Properties and Examples

Definition 1.1.1 Given x E dom f, the vector s E JRn is called an e-subgradient of


f at x when the following property holds:

fey) ~ f(x) + (s, y - x) - e for all y E JRn . (1.1.1)

Of course, s is still an e-subgradient if (1.1.1) holds only for y E dom f. The set of
all e-subgradients of f at x is the e-subdifferential (of f at x), denoted by ad(x).
Even though we will rarely take e-subdifferentials of functions not in Conv JRn ,
it goes without saying that the relation of definition (1.1.1) can be applied to any
function finite at x. Also, one could set ad (x) = 0 for x ~ dom f. 0

It follows immediately from the definition that

aef(x) Cae' f(x) whenever e ~ e' ; (1.1.2)

af(x) = aof(x) = n{ad(x) : e > O} [= lime.j,o ad(x)]; (1.1.3)

aae+(l-a)e' f(x) :,) aad(x) + (1 - a)ae, f(x) for all a E ]0, 1[. (1.1.4)

The last relation means that the graph of the multifunction JR+ 3 e f---+ aef (x) is a
convex set in JRn+l; more will be said about this set later (Proposition 1.3.3).
We will continue to use the notation af, rather than aof, for the exact subdiffer-
ential- knowing that ae f can be called an approximate subdifferential when e > O.
Figure 1.1.1 gives a geometric illustration of Definition 1.1.1: s, together with
r E JR, defines the affine function y f-+ as ,r (y) := r + (s, y - x); we say that s is
an e-subgradient of f at x when it is possible to have simultaneously r ~ f (x) - e
and as,r ~ f. The condition r = f(x), corresponding to exact subgradients, is thus
relaxed bye; thanks to closed convexity of f, this relaxation makes it possible to find
such an as,r:

, f(x)
I

~:
e ..--1---...... -----.. _------ gr s,r

f
x .....
Fig.t.t.t. Supporting hyperplanes within c

Theorem 1.1.2 For all x E dom f, ad (x) =f. 0 whenever e > O.


1 The Approximate Subdifferential 93

PROOF. Theorem 111.4.1.1 implies the existence in IRn x IR of a hyperplane Hs,a


separating strictly the point (x, 1 (x) - e) from the closed convex set epi I: for all
y E IRn
(s, x) + a[f(x) - e] < (s, y) + al(y).
Taking y = x in this inequality shows that

a[f(x) - e] < al(x) < +00

hence a > o(the hyperplane is non-vertical). Thens':= -s/a E 8el(x). 0

Incidentally, the above proof shows that, among the e-subgradients, there is one
for which strict inequality holds in (1.1.1). A consequence of this result is that, for
e > 0, the domain of the multifunction x ~ 8e1 (x) is the convex set dom I. Here
is a difference with the case e = 0: we know that dom 81 need not be the whole of
dom I; it may even be a nonconvex set.
Consider for example the one-dimensional convex function

I(x) := { -2.jX ifx ~


+00
?,
otherwtse .
(1.1.5)

Applying the definition, an e-subgradient of 1 at x = 0 is an s E IR satisfying

sy + 2Jy - e ~ 0 for all y ~ 0 ,

and easy calculation shows that the e-subdifferential of 1 at 0 is ] - 00, -1/e]: it is


nonempty for all e > 0, even though 81(0) = 0. Also, note that it is unbounded, just
because 0 is on the boundary of dom I.
For a further illustration, let 1 : IR ~ IR be I(x) := Ix!. We have

[-I,-I-e/x] if x <-e/2,
8ef(x)= { [-1,+1] if-e/2~x~e/2,
[1-e/x,l] ifx>e/2.

The two parts of Fig. 1.1.2 display this set, as a multifunction of e and x respectively. It
is always a segment, reduced to the singleton {f'(x)} only for e = O(whenx # 0). This
example suggests that the approximate subdifferential is usually a proper enlargement
of the exact subdifferential; this will be confirmed by Proposition 1.2.3 below.
Another interesting instance is the indicator function of a nonempty closed convex
set:

Definition 1.1.3 The set of e-normal directions to a closed convex set C at x E C,


or the e-normal set for short, is the e-subdifferential of the indicator function Ie at x:

Ne,e(x) := 8e Ic(x) = {s E IRn : (s, y - x) ~ e for all y E C}. (1.1.6)


o
94 XI. Approximate Subdifferentials of Convex Functions

s s

x o E

for fixed E for fixed x


Fig.l.l.2. Graphs of the approximate subdifferential of Ixl

The 8-nonnal set is thus an intersection of half-spaces but is usually not a cone;
it contains the familiar nonnal cone Nc(x), to which it reduces when 8 = O. A
condensed fonn of (1. 1.6) uses the polar of the set C - {x}:
aeIC(x) = 8(C - x)O for all x E C and 8 > o.
These examples raise the question of the boundedness of aef(x).
Theorem 1.1.4 For 8 ~ 0, aef(x) is a closed convex set, which is nonempty and
bounded if and only if x E int dom f.
PROOF. Closedness and convexity come immediately from the definition (1.1.1).
Now, if x E intdomf, then aef(x) contains the nonempty set af(x) (Theo-
rem X.1.4.2). Then let 8 > 0 be such that B(x, 8) C intdomf, and let L be a
Lipschitz constant for f on B(x, 8) (Theorem Iy'3.1.2). For 0 :f:. s E aef(x), take
y = x + 8s/lIsll:
f(x) + L8 ~ f(y) ~ f(x) + 8{s, s/lIsll) - 8
i.e. IIsll ::;,;; L + 8/8. Thus, the nonempty aef(x) is also bounded.
Conversely, take any s) in the normal cone to dom f at x E dom f:
(sJ, y - x) ::;,;; 0 for all y E dom f .
If aef(x) :f:. 0, add this inequality to (1.1.1) to obtain
f(y) ~ f(x) + (s + SJ, Y - x) - 8

for all y E dom f. We conclude s + s) E as f (x): ae f (x) contains (actually: is equal


a
to) ef (x) + N dom f (x), which is unbounded if x ¢ int dom f. 0

In the introduction to this chapter, it was mentioned that one motivation for the
8-subdifferential was practical. The next result gives a first explanation: as f can be
used to characterize the 8-solutions of a convex minimization problem - but, starting
from Chap. XIII, we will see that its role is much more important than that.
Theorem 1.1.S The follOWing two properties are equivalent:
o E aef(x) ,
f(x) ::;,;; f(y) + 8 for all y E ]Rn •
PROOF. Apply the definition. o
1 The Approximate Subdifferential 95

1.2 Characterization via the Conjugate Function

The conjugate function /* allows a condensed writing for (1.1.1):


Proposition 1.2.1 The vector s E lRn is an s-subgradient of f at x E dom f if and
only if
f*(s) + f(x) - (s, x) ~ s . (1.2.1)
As a result,
S E 8ef(x) <==> x E 8ef*(s) . (1.2.2)

PROOF. The definition (1.1.1) of 8ef(x) can be written

f(x) + [(s, y) - f(y)] - (s, x) ~ s for all y E dom f

which, remembering that /* (s) is the supremum of the bracket, is equivalent to (1.2.1).
This implies that 8ef(x) C dom /* and, applying (1.2.1) with f replaced by f*:

x E 8ef*(s) <==> f**(x) + f*(s) - (s, x) ~ s .

The conclusion follows since Conv lRn 3 f = f**. o


In contrast with the case s = 0, note that
8e f(lR n ) = dom f* whenever s > O.

Indeed, the inclusion "c" comes directly from (1.2.1); conversely, if s E dom f*, we
know from Theorem 1.1.2 that there exists x E 8ef*(s), i.e. s E 8ef(x).
Likewise, for fixed x E dom f,

[lime-jo+oo 8ef(x) =] U {8ef(x) : s > O} = dom f* . (1.2.3)

Here again, the "c" comes directly from (1.2.1) while, if s E dom /*, set

ss := f*(s) + f(x) - (s, x) E [0, +oo[

to see that s is in the corresponding 8esf(x).

Example 1.2.2 (Convex Quadratic Functions) Consider the function

lRn 3 x f-+ f(x) := 4(Qx, x) + (b, x)


where Q is symmetric positive semi-definite with pseudo-inverse Q-. Using Exam-
ple X.1.1.4, we see that 8ef (x) is the set

{sEb+lmQ: f(x)+4(s-b,Q-(s-b))~(s,x)+s}.

This set has a nicer expression if we single out V f (x): setting p = s - b - Qx, we
see that s - b E 1m Q means p E 1m Q and, via some algebra,
96 XI. Approximate Subdifferentials of Convex Functions

ad(x) = {Vf(x)} + {p E ImQ : !(p, Q-p)::::; e} . (1.2.4)

Another equivalent formula is obtained if we set y := Q- p (so p = Qy):

ad(x) = {Vf(x) + Qy : !(Qy, y)::::; e}. (1.2.5)

To illustrate Theorem 1.1.5, we see from (1.2.4) that

... f 'h' {Vf(X)=Qx+bEIm Q [i.e.bEImQ],


x mlmmlzes Wit In e {::::=::> and !(V f(x), Q-V f(x)) ::::; e .

When Q is invertible and b = 0, f defines the norm Illxlll = (Qx, X)I/2. Its e-
subdifferential is then a neighborhood of the gradient for the metric associated with
the dual norm Illslll* = (s, Q- 1S)I/2. 0

The above example suggests once more that as f is usually a proper enlargement
of af; in particular it is "never" reduced to a singleton, in contrast with af, which is
"often" reduced to the gradient of f. This is made precise by the next results, which
somehow describe two opposite situations.

Proposition 1.2.3 Let f E ConvlR.n be I-coercive, i.e. f(x)/lIxll ~ +00 ifllxll ~


00. Then, for all x E dom f,

0::::; e < e' ===} ad(x) c int as' f(x).

Furthermore, any s E lR.n is an e-subgradient of f at x, providing that e is large


enough.

PROOF. Written in the form (1.2.l), an approximate subdifferential of f appears as a


sublevel-set of the convex function f* - (', x); but the 1-coercivity of f means that
this function is finite everywhere (Proposition X.1.3.8). For e' > e, the conditions of
Proposition VI. 1.3.3 are then satisfied to compute the interior of our sublevel-set:

int as' f(x) = {s E lR.n : f*(s) + f(x) - (s, x) < e'} :::> ad(x).

As for the second claim, it comes directly from (1.2.3). o

Proposition 1.2.4 Let f E Conv lR.n and suppose that aso f (Xo) is a Singleton for
some Xo E dom f and eo > O. Then f is affine on lR.n .

PROOF. Denote by So the unique eo-subgradient of f at Xo. Let e E ]0, eo[; in view
of the monotonicity property (1.1.2), the nonempty set as f (Xo) can only be the same
singleton {so}. Then let e' > eo; the graph-convexity property (1.1.4) easily shows
that as' f(xo) is again {so}.
Thus, using the characterization (1.2.l) of an approximate subgradient, we have
proved:

s =1= So ===} f*(s) > e + (so, xo) - f(xo) for all e > 0,
1 The Approximate Subdifferential 97

and not much room is left for f*: indeed


f* = r + I{sol for some real number r .
The affine character of f follows; and naturally r = !*(so) = (so, xo) - f(xo). 0

Our next example somehow generalizes this last situation.


Example 1.2.5 (Support and Indicator Functions) Let a = ae be a closed sub-
linear function, i.e. the support of the nonempty closed convex set C = aa (0). Its
conjugate a * is just the indicator of C, hence
aeac(d) = {s E C : a(d) ~ (s, d) + B}. (1.2.6)
Remembering that a (0) = 0, we see that
aeac(d) c aac(O) = aeac(O) = C for all d E IRn .
Keeping Proposition 1.2.4 in mind, the approximate subdifferential of a linear -
or even affine - function is a singleton. The set aeac(d) could be called the "B-face"
of C exposed by d, i.e. the set of B-maximizers of (', d) over C. As shown in Fig. 1.2.1,
it is the intersection of the convex set C with a certain half-space depending on B. For

(. .1\
B = 0, we do obtain the exposed face itself: aac(d) = Fe(d).

""'ctd
)

d
.........,
•• C ,••~
,
\\

o
Fig.I.2.I. An 8-face

Along the same lines, the conjugate ofIe in (1.1.6) is ae, so the B-normal set to
C at x is equivalently defined as
Ne,e(x) = {s E IRn : ac(s) ~ (s, x) + B}.
Beware of the difference with (1.2.6); one set looks like a face, the other like a cone;
the relation linking them connotes the polarity relation, see Remark V.3.2.6.

Consider for example a half-space: with s I- 0,


H-:= {x ERn: (s,x) (r}.

Its support function is computed by straightforward calculations, already seen in Exam-


ple V.2.3.1; we obtain for x E H-:
NH-,s(x) = {AS: °
(A and (r - (s, X)A (8}.
In other words, the approximate norq1al set to a half-space is a segment, which stretches to the
whole normal cone when x approaches the boundary. Figure 2.1.1 will illustrate an 8-normal
set in a nonlinear situation. o
98 XI. Approximate Subdifferentials of Convex Functions

1.3 Some Useful Properties

(a) Elementary Calculus. First, we list some properties coming directly from the
definitions.

Proposition 1.3.1
(i) For the function g(x) := I(x) + r, aeg(x) = ad(x).
(ii) For the function g(x) := al(x) and a > 0, aeg(x) = aae/al(x).
(iii) For the function g(x) := I(ax) and a =f:. 0, aeg(x) = aael(ax).
(iv) Moregenerally, if A is an invertible linear operator, ae(f oA)(x) = A*ad(Ax).
(v) For the function g(x) := I(x - xo), aeg(x + xo) = a/(x).
(vi) For the function g(x) := I(x) + {so, x}, aeg(x) = ad(x) + {so}.
(vii) If II ~ 12 and II (x) = l2(x), then adl(x) C ae l2(x).

PROOF. Apply (1.1.1), or combine (1.2.1) with the elementary calculus rules X.l.3 .1,
whichever seems easier. 0

Our next result expresses how the approximate subdifferential is transformed,


when the starting function is restricted to a subspace.

Proposition 1.3.2 Let H be a subspace containing a point ofdom I and call PH the
operator oforthogonal projection onto H. For all x E dom I n H,

i.e.

PROOF. From the characterization (1.2.1), s E ae(f + IH)(X) means

e ~ (f + IH)(X) + (f + IH)*(S) - {s, x} =


= I(x) + [(f oPH)* 0PH] (s) - {s,x} [from Prop. X.l.3.2]
= I(PH(X» + (f 0 PH)*(PH(S» - (s, PH(X)} [x = PH(X)]
= (f 0 PH)(X) + (f 0 PH)*(PH(S» - (PH(S), x}. [p H is symmetric]

This just means that PH (s) is in the e-subdifferential of lop H at x. o


A particular case is when doml C H: then I + IH coincides with I and we
have ael = ae(f 0 PH) + H.l..

(b) The Tilted Conjugate Function. From (1.2.1), ad(x) appears as the sublevel-
set at level e of the "tilted conjugate function"

]Rn :3 s ~ I*(s) - {s, x} + I(x) =: g;(s) , (1.3.1)

°
which is clearly in Conv]Rn (remember x E dom I!) and plays an important role. Its
infimum on]Rn is (Theorem 1.1.2), attained at the subgradients of I at x, if there
1 The Approximate Subdifferential 99

are any. Its conjugate g;* =


gx is easily computed, say via Proposition X.l.3.1, and
is simply the "shifted" function

JRn :3 h ~ gx(h) := f(x + h) - f(x). (1.3.2)

Indeed, when connecting g;


to the approximate subdifferential, we just perform the
conjugacy operation on f, but with a special role attributed to the particular point x
under consideration.

Proposition 1.3.3 For x E dom f, the epigraph ofg; is the graph ofthe multifunction
s ~ oef(x):

{(s, s) E JRn x JR : g;(s) ~ s} = {(s, s) E JRn x JR :s E oef(x)}. (1.3.3)

The supportfunction of this set has, at (d, -u) E JRn x JR, the value

O'epig*(d, -u)
x
= sup{(s,
s,e
d) - su : s E oef(x)} = (1.3.4)

u[f(x + diu) - f(x)] if u > 0,


={ O'domj*(d)= f60(d) if u = 0,
+00 if u < O.

PROOF. The equality of the two sets in (1.3.3) is trivial, in view of(I.2.1) and (1.3.1).
Then (1.3.4) comes either from direct calculations, or via Proposition X.l.2.1, with f
replaced by g;,
whose domain is just dom f* (Proposition X.l.3.1 may also be used).
o

Remember §IY.2.2 and Example 1.3.2.2 to realize that, up to the closure operation
at u = 0, the function (l.3.4) is just the perspective of the shifted function h ~
f(x + h) - f(x).
Figure 1.3.1 displays the graph of s ~ oe f (x), with the variable s plotted along
the horizontal axis, as usual (see also the right part of Fig. 1.1.2). Rotating the picture
so that this axis becomes vertical, we obtain epi g;.

£
Fig. 1.3.1. Approximate subdifferential as a SUblevel set in the dual space
100 XI. Approximate Subdifferentials of Convex Functions

(c) A Critical e-Value.


Proposition 1.3.4 Assume that the infimal value j of lover JRn is finite. For all
x E dom I, there holds
inf (8 > 0 : 0 E aef(x)} = I(x) - j. (1.3.5)

PROOF. Because g; is a nonnegative function, we always have


g1(0) = inf (8 > 0 : 8 ~ g1(0)}
which, in view of (1.3.3) and of the definition of g;, can also be written
inf (e > 0 : 0 E aef(x)} = 1*(0) + I(x) = - j + I(x). 0

The relation expressed in (1.3.5) is rather natural, and can be compared to Theo-
rem 1.1.5. It defines a sort of "critical" value for 8, satisfying the following property:

Proposition 1.3.5 Assume that 1 is l-coercive. For a non-optimal x E dom I, define


e := I(x) - j > o.
Then the normal cone to ae1 (x) at 0 is the set of directions pointing from x to the
minimizers of I:
=
Naef(x)(O) JR+(Argminl - (x}).
PROOF. We know that ael(x) is the sublevel-set of g; at level e and we use The-
orem VI.1.3.5 to compute its normal cone. From the coercivity of I, g;
is finite
everywhere (Proposition X.1.3.S); if, in addition, e > 0, the required Slater assump-
tion is satisfied; hence, for an s satisfying g; (s) = e [i.e. s E bd aef (x)],

= JR+ag1(s) = JR+(a/*(s) - (x}).


Nae!(x)(s)
In this formula, we can set e = e > 0 and s = 0 because

g1(0) = 1*(0) + I(x) - (0, x) = e.


The result follows from the expression (X. 1.4.6) of Argminl as a/*(O). 0

(d) Representations of a Closed Convex Function. An observation inspired by the


function g;
is that the multifunctions e t---+ F(e) that are e-subdifferentials (of a
function 1 at a point x) are the closed convex epigraphs; see §IY.1.3(g). Also, if
ael(x) is known for all e > 0 at a given x E dom I, then /* is also known, so 1
itself is completely determined:

Theorem 1.3.6 For x E dom I, there holds

I(x + h) = I(x) + sup {(s, h) - 8 : e > 0, s E ael(x)} (1.3.6)


for all h E JRn; or, using support functions:

I(x + h) = I(x) + sup[crae!(x) (h) - e]. (1.3.7)


e>O
1 The Approximate Subdifferential 101

PROOF. Fix x e dom/. Then set u = 1 in (1.3.4) to obtain uepigj(h, -1) = gx(h),
i.e. (1.3.7), which is just a closed form for (1.3.6). 0

Using the positive homogeneity of a support function, another way of writing


(1.3.7) is: ifx e dom I,

I(x + td) = I(x) + t sup [uataf(x) (d) - a] for all t > o.


u>o

Theorem 1.3.6 gives a converse to Proposition 1.3.1 (vii): if, for a particular x e
dom II n dom 12,
odl (x) C oeh(x) for alle > 0,
then
II (y) - II (x) :::;; h(y) - h(x) for all y e ]Rn .

Remark 1.3.7 The function g;


differs from f* by the affine function I (x) - (., x)
only. In the primal graph-space, gx is obtained from I by a mere translation of the
origin from (0,0) to (x, I(x». Basically, considering the graph of od, instead of
the epigraph of f*, just amounts to this change of variables. In the framework of
numerical optimization, the distinction is significant: it is near the current iterate x of
an algorithm that the behaviour of I matters. Here is one more practical motivation
for introducing the approximate subdifferential.
Theorem 1.3.6 confirms this observation: numerical algorithms need a model of
I near the current x - remember §II.2. A possibility is to considera model for 1*: it
induces via (1.2.1) a model for the multifunction e I---+- oel(x), which in turn gives
via (1.3.7) a model for I, hence a possible strategy for finding the next iterate x+.
o
This remark suggests that, in the framework of numerical optimization, (1.3.7)
will be especially useful with h close to O. More globally, we have

[/**(x) =] I(x) = sup {(s, x) - I*(s) : s e ]Rn},

which also expresses a closed convex function as a supremum, but of affine functions.
The index s can be restricted to dom 1*, or even to ri dom 1*; another better idea:
restrict s to dom of* (::J ridomf*), in which case f*(s) = (s, y) - I(y) for some
y e dom 01. In other words, we can write

I(x) = sup {f(y) + (s,x-y) : yedomol, seol(y)}.

This formula has a refined form, better suited to numerical optimization where
only one subgradient at a particular point is usually known (see again the concept of
black box (Ul) in §VIII.3.5).

Theorem 1.3.8 Associated with I e Conv]Rn, let s : ri dom I ~ ]Rn be a mapping


satisfying s (y) e 01 (y) for all y e ri dom I. Then there holds

I(x) = sup [/(y) + (s(y), x - y}] for all x e dom I. (1.3.8)


YEridomf
102 XI. Approximate Subdifferentials of Convex Functions

PROOF. The existence of the mapping s is guaranteed by Theorem X.1.4.2; besides,


only the inequality ~ in (1.3.8) must be proved. From Lemma III.2.I.6, we can take
d e JRn such that

Yt:=x+tderidomf forte]O,I].

Then
f(x + d) ~ f(Yt) + (s(Yt), x + d - Yt)
so that
{ ( ) d) ~ f (x + d) - f (Yt)
S Yt, "" 1 -t .
Then write

f(Yt) + (s(Yt), x - Yt) = f(Yt) - t{s(Yt), d) ~ f(Yt) ~ tf(x + d)


-t
and let t ..1- 0; use for example Proposition IY.I.2.S to see that the right-hand side tends
to f(x). 0

2 The Approximate Directional Derivative

Throughout this section, x is fixed in dom f. As a (nonempty) closed convex set, the
approximate subdifferential aef(x) then has a support function, for any e > O. We
denote it by f;(x, .):

JRn 3 d ~ f;(x, d) := C1aef(x) (d) = sup (s, d), (2.0.1)


seaef(x)

a closed sublinear function. The notation f; is motivated by §VI.I.I: f' (x, .) supports
af(x), so it is natural to denote by f;(x, .) the function supporting aef(x). The
present section is devoted to a study of this support function, which is obtained via
an "approximate difference quotient".

2.1 The Support Function of the Approximate SubditTerential

Theorem 2.1.1 For x e dom f and e > 0, the support function of ae f (x) is

1!])n
IN.. 3
d
~
~'(
Je x,
d) _ . f f(x
- In
+ td) - f(x) +e , (2.1.1)
t>O t
which will be called the e-directional derivative of fat x.

PROOF. We use Proposition 1.3.3: embedding the set aef (x) in the larger space JRn x JR,
we view it as the intersection of epi gi with the horizontal hyperplane
2 The Approximate Directional Derivative 103

f;
(rotate and contemplate thoroughly Fig. 1.3.1). Correspondingly, (x, d) is the value
at (d, 0) of the support function of our embedded set epigi n He:

f;(x, d) = (Iepigj + IHe)*(d, 0).


Our aim is then to apply the calculus rule X.2.3.2, so we need to check the
qualification assumption (X.2.3.1). The relative interior of the hyperplane He being
He itself, what we have to do is to find in ri epi gi a point of the form (s, e). Denote
by A the linear operator which, to (s, a) E JRn x JR, associates A(s, a) = a: we want

e E A(riepig;) = ri[A(epig;)],
where the last equality comes from Proposition III.2.1.12. But we know from Theo-
rem 1.1.2 and Proposition 1.3.3 that A(epigi) is JR+ or JRt; in both cases, its relative
interior is JRt, which contains the given e > O. Our assumption is checked.
As a result, the following problem

min{CTHe(P,U) +CTePigj(q, v) : p+q=d,u+v=O}=f;(x,d) (2.1.2)

has an optimal solution. Now look at its minimand:


-unless p = 0, CTHe(P, u) = +00; andCTHe(O, u) = ue;
- unless v ~ 0, CTepigj(q, v) = +00 - see (1.3.4).
Ina word,

f;(x, d) = min (CTepigj(d, -u) + ue : u ~ O}.

Remembering that, as a support function, u f-+ CTepi gj (d, u) is in particular continuous


for u -I- 0, and using its value (1.3.4), we finally obtain (2.1.1). 0

The above proof strongly suggests to consider the change of variable t = 1I u in


(2.1.1), and to set

u[f(x + diu) - f(x) + e] ifu > 0,


re(u) := { fbo(d) = CTdomj*(d) ifu=O, (2.1.3)
+00 ifu<O.

When e > 0, this function does achieve its minimum on JR+; in the t-Ianguage, this
means that the case t -I- 0 never occurs in the minimization problem (2.1.1). On the
other hand, 0 may be the unique minimum point of re, i.e. (2.1.1) may have no solution
"at finite distance".

The importance ofthe assumption 8 > 0 cannot be overemphasized; Theorem X.2.3.2


cannot be invoked without it, and the proof breaks down. Consider Fig. 2.1.1 for a counter-
example: B(O, 1) is the unit Euclidean ball oflR2, I its indicator function and x = (0, -1).
Clearly enough, the directional derivative I' (x, d) for d = (y, 8) =/:. 0 is

I'(x,d) = { 0 ~f8 > 0,


+00 If8 ~O,
104 XI. Approximate Subdifferentials of Convex Functions

which cannot be a support function: it is not even closed. On the other hand, the infimum in
(2.1.1), which is obtained for x + td on the unit sphere, is easy to compute: for d =f. 0,

if 8> 0,
if8 ~ 0.

It is the support function of the set

(2.1.4)

which, in view of Theorem 2.1.1, is the e-normal set to B(O, 1) at x.

Fig. 2.1.1. The approximate subdifferential of an indicator function

Note: to obtain (2.1.4), it is fun to use elementary geometry (carry out in Fig. 2.1.1 the
inversion operation from the pole x), but the argument of Remark VI.6.3. 7 is more systematic,
and is also interesting.

Remark 2.1.2 We have here an illustration of Example X.2.4.3: closing f' (x, .) is just what
is needed to obtain the support function of af(x). For x E dom af, the function

~n 3 d 1-+ U(Jj(x)(d) = sup {(s,d) : s E af(x)}

is the closure of the function

1I])n
~ 3
d 1-+
f'e x, d) -_ I·1m f(x + td) - f(x) .
t,j,o t

This property appears also in the proof of Theorem 2.1.1: for e = 0, the closure operation
must still be performed after (2.1.2) is solved. 0

°
Figure 2.1.2 illustrates (2.1.1) in a less dramatic situation: for e > 0, the line representing
t f(x) - e + tf;(x, d) supports the graph oft 1-+ f(x + td), not at t = but at some
1-+
point te > 0; among all the slopes joining (0, f(x) - e) to an arbitrary point on the graph of
f (x + .d), the right-hand side of (2.1.1) is the smallest possible one.
On the other hand, Fig. 2.1.2 is the trace in ~ x ~ of a picture in ~n x R among all
the possible hyperplanes passing through (x, f(x) - e) and supporting epi f, there is one
touching gr f somewhere along the given d; this hyperplane therefore gives the maximal
slope along d, which is the value (2.0.1). The contact x + tEd plays an important role in
minimization algorithms and we will return to it later.
2 The Approximate Directional Derivative 105

f(x) ----I
f(x+td) - E - -

~:::""_...L....

Fig.2.1.2. The e-directional derivative and the minimizing te

The same picture illustrates Theorem 1.3.6: consider the point x + ted as fixed and
call it y. Now, for arbitrary 8 ~ 0, draw a hyperplane supporting epi f and passing through
(0, f (x) - 8). Its altitude at y is f (x) - 8 + f; (x, y - x) which, by definition of a support,
is not larger than f(y), but equal to it when 8 = e.
In summary: fix (x, d) E dom f x lRn and consider a pair (e, y) E lRt x lRn linked by
the relation y = x + ted.
- To obtain y from e, use (2.1.1): as a function of the "horizontal" variable t > 0, draw the
line passing through (x, f(x) - e) and (x + td, f(x + td)); the resulting slope must be
minimized.
- To obtain e from y, use (1.3.7): as a function of the "vertical" variable 8 ~ 0, draw a support
passing through (x, f (x) - 8); its altitude at y must be maximized.

Example 2.1.3 Take again the convex quadratic function of Example 1.2.2: I: (x, d)
is the optimal value of the one-dimensional minimization problem

. t(Qd,d)t 2 +(V/(x),d)t+e
mf .
t>O t
If (Qd, d) = 0, this infimum is (V I(x), d). Otherwise, it is attained at

t
6 -
- Rf; e
(Qd,d)

and
= (V/(x), d) + .j2e(Qd, d).
I;(x, d)
This is a general formula: it also holds for (Qd, d) = O.
It is interesting to observe that[l: (x, d) - I' (x, d)]/ e ~ +00 when e -J.. 0, but
H/:(x, d) - I'(x, d)]2 == (Qd, d).
e
This suggests that, for C2 convex functions,

lim t[f:(x, d) - 1'(x,d)]2 = I" (x , d)


6-1-0 e

where
106 XI. Approximate Subdifferentials of Convex Functions

f "(x, d)'- I' f(x


.- 1m
+ td) - f(x) - tf'(X, d) _ I'
-
f'ex + td, d) - f'ex, d)
1m -'---'-----'---'---'---'-
t--+o !t
2
2 t--+o t

is the second derivative of f at x in the direction d. Here is one more motivation for
the approximate subdifferentiaI: it accounts for second order behaviour of f. 0

2.2 Properties of the Approximate Difference Quotient

For x E dom f, the function qe encountered in (2.1.1) and defined by

f(x + td) - f(x) + e


]0, +oo[ 3 t ~ qeCt) := " - - - - - - - ' - - - - (2.2.1)
t
is called the approximate difference quotient. In what follows, we will set q := qo; to
avoid the trivial case qe == +00, we assume

[x,x +td] c domf for some t > O.

Our aim is now to study the minimization of qe, a relevant problem in view ofTheo-
rem 2.1.1.

(a) Behaviour of q.,. In order to characterize the set of minimizers

Te := {t > 0 : qe(t) = f;(x, d)}, (2.2.2)

two numbers are important. One is

f/x;(d) := sup {q(t) : t > O}

which describes the behaviour of f(x + td) for t -+ +00 (see §IY.3.2 and Exam-
ple X.2.4.6 for the asymptotic function f6o)' The other,

tl:=SUp{t~O: f(x+td)-f(x) = tf'(X, d)},

concerns the case t -J, 0: t l = 0 if f(x+td) has a "positive curvature fort -J, 0". When
t l = +00, i.e. f is affine on the half-line x +lR+d, we have qe(t) = f'ex, d) + elt;
then it is clear that f; (x, d) = f' (x, d) = fbo (d) and that Te is empty for e > O.

Example 2.2.1 Before making precise statements, let us see in Fig. 2.2.1 what can be ex-
pected, with the help of the example

o~ t ~ f(x + td) := max {t, 3t - I}.

The lower-part of the picture gives the correspondence e -4 Te , with the same abscissa-axis
as the upper-part (namely t). Some important properties are thus introduced:
- As already known, q increases from f'ex, d) = 1 to f/x;(d) = 3;
- indeed, q(t) is constantly equaito its minimal value f'ex, d) forO < t ~ 1/2, and t l = 1/2;
2 The Approximate Directional Derivative 107

t(d) =3 ~--::6 i
f'(x,d) =1

10
00
10
--------
= 1 _._ ... _................_._..... _...-.._............_.........

Fig.2.2.1. A possible qe and Te

- f;(x, d) stays between f'ex, d) and f60(d), and reaches this last value for s ;;:: 1;
- To is the segment ]0, 1/2]; TJ is the half-line [1/2, +00[, and Te is empty for s > 1. 0

This example reveals another important number associated with large t, which we will
calls oo (equal to 1 in the example). The statements (i) and (ii) in the next result are already
known, but their proof is more natural in the t-Ianguage.

Proposition 2.2.2 The notations and assumptions are as above.


(i) If s > 0, then qe(t) ~ +00 when t ..j, O.
(ii) When t ~ +00, qe(t) ~ f/x,(d) independently ols ~ O.
(iii) The set
E := {s ~ 0 : f;(x, d) < f~(d)} (2.2.3)
is empty if and only if t l = +00. Its upper bound
sup E =: SOO E [0, +00]

satisfies
Te "# 0 if s < SOO and Te = 0 if s > SOO •

PROOF. For s > 0, let So E Be /2!(x): for all t > 0,

f(x) + t(so, d) - s/2 - f(x) +s d) s


qe(t) ~
t
= (so, + -2t
and this takes care of (i). For t ~ +00, qe(t) = q(t) + s/t has the same limit as
q (t), so (ii) is straightforward.
Thus f~(x, d) ~ f/x,(d), and non-vacuousness of Te depends only on the be-
haviour of qe at infinity. To say that E is empty means from its definition (2.2.3):

f/x,(d) ~ inf inf qe(t) = inf inf qe(t) = f'(x, d).


e ~ 0 t>o t>o e ;;:: 0
108 XI. Approximate Subdifferentials of Convex Functions

Because f'(x,·) ~ f/x;, this in turn means that q(t) is constant for t > 0, or that
l- = +00.
f;
Now observe that s r+ (x, d) is increasing (just as de f); so E, when nonempty,
is an interval containing 0: s < soo implies sEE, hence Te #- 0. Conversely, if
s > soo, take s' in ]Soo, s[ (so s' fj E) and t > 0 arbitrary:

t cannot be in Te , which is therefore empty. o

The example t r+ f(x + td) = .Jf+t2 illustrates the case SOO < +00 (here
soo = l);itcorrespondstof(x+, d) having the asymptote t r+ f(x)-soo+ f60(d)t.
Also, one directly sees in (2.2.3) thats oo = +00 whenever f60(d) is infinite. With the
example f of(1.1.5)(and taking d = 1), wehavet l = 0, f/x;(1) = Oands oo = +00.

(b) The Closed Convex Function r e. For a more accurate study of Te , we use now the
function re of (2.1.3), obtained via the change of variable U = I/t: re(u) = qe(1/u)
for u > O. It is the trace on lR of the function (1.3.4), which is known to be in
Conv(lRn x lR); therefore
re E ConvlR.

We know from Theorem 2.1.1 that re is minimized on a closed interval

Ue := {u ~ 0 : re(u) = f;(x, d)} (2.2.4)

which, in view of Proposition 2.2.2(i), is nonempty and bounded if s > O. Likewise,

Te = {t = I/u : u E Ue and u > O}

is a closed interval, empty if Ue = {O}, a half-line if {O} *- Ue. Knowing that

it will be convenient to set, with the conventions 1/0 = +00, 1/00 = 0:


(2.2.5)

There remains to characterize Te , simply by expressing the minimality condition


o E dre(u).
Lemma 2.2.3 Let ((J be the convexfunction 0 ~ t r+ ((J(t) := f(x + td). For s ~ 0
and 0 < u E domre, the subdifferential ofre at u is given by

dre () _ re(u) - d((J(I/u)


u - . (2.2.6)
u
2 The Approximate Directional Derivative 109

PROOF. The whole point is to compute the subdifferential of the convex function
u ~ t/f(u) := u((J(llu), and this amounts to computing its one-sided derivatives.
Take positive u' = lit' and u = lit, with u E domre (hence ((J(llu) < +00), and
compute the difference quotient of t/f (cf. the proof of Theorem 1.1.1.6)
u'((J(llu') - u((J(llu) ((J(t') - ((J(t)
,
u-u
= ((J(t) - t,
t-t
.

Letting u' ,j.. u, i.e. t' t t, we obtain the right-derivative

the left-derivative is obtained likewise.


Thus, the subdifferential of t/f at u is the closed interval

ot/f(u) = ((J(t) - to((J(t) .

Knowing that re(u') = t/f(u') - u'[f(x) - e] for all u' > 0, we readily obtain
ore(u) = ((J(t) - to((J(t) - f(x) + e = re(u)lu - to((J(t). 0

TherninimalityconditionO E ore(u) is therefore rdu) E o((J(llu).Int-language,


we say: tETe if and only if there is some a E o((J(t) such that

f(x + td) - ta = f(x) - e. (2.2.7)


Remark 2.2.4 Using the calculus rule from Theorem 3.2.1 below, we will see that aq> can
be expressed with the help of af. More precisely, if x + R+d meets ridomf, we have

arE(u) = rE(u) - (af(x + diu), d) .


u
Then Fig. 2.1.2 gives a nice interpretation of (2.2.7): associated with t >0 and S E
af(x + td), consider the affine function
o~ r 1-+ l(r) := f(x + td) + (s, d)(r - t),

whose graph supports epi q> at r = t. Its value 1(0) atthe vertical axis r = 0 isa subderivative
of u 1-+ uq>(l/u) at u = lit, and t is optimal in (2.1.1) when 1(0) reaches the given
1
f(x) - e. In this case, TE is the contact-set between gr and grq>. Note: convexity and
Proposition 2.2.2(iii) tell us that f(x) - e oo ~ 1(0) ~ f(x). 0

Let us summarize the results of this section concerning the optimal set Te, or Ue
of (2.2.4).
- First, we have the somewhat degenerate case t l = +00, meaning that f is affine
on the half-line x + 1R+d. This can be described by one of the following equivalent
statements:
f'(x, d) = f60(d);
f;(x,d) = f'(x,d) foralle > 0;
Vt > 0, qe(t) > f60(d) for aIle> 0;
Te = I{) for all e > 0;
Ue = {OJ for all e > O.
110 XI. Approximate SubdifIerentials of Convex Functions

- The second situation, more interesting, is when tf. < +00; then three essentially
different cases may occur, according to the value of e.
- When 0 < e < eoo , one has equivalently
f'(x, d) < f:(x, d) < f/:.o(d);
3t > 0 such that qs(t) < f60(d) ;
Ts is a nonempty compact interval;
Of/. Us.
- When eOO < e:
f:(x, d) = f/:.o(d);
Vt > 0, qs(t) > f/:.o(d);
Ts =0;
Us = {O}.
f:(x, d) = f60(d) ;
Ts is empty or unbounded;
o E Us.
Note: in the last case, Ts nonempty but unbounded means that f(x + td) touches
its asymptote f(x) - e + tf60(d) for t large enough.

2.3 Behaviour of f; and Te as Functions of e


In this section, we assume again [x, x + td] c domf for some t > O. From the
fundamental relation (2.1.1), the function e t-+- - f:(x, d) appears as a supremum of
affine functions, and is therefore closed and convex: it is the conjugate of some other
function. Following our general policy, we extend it to the whole real line, setting

._ {-f:(X,d) ife~O,
() .-
ve
+00 th . 0 erwlse.
Then v E Conv lR: in fact, dom v is either [0, +oo[ or ]0, +00[. When e -l- 0, v(e)
tends to - f'(x, d) E lR U {+oo}.

Lemma 2.3.1 With r := ro of (2.1.3) andfor all real numbers e and u:

v(e) = r*(-e), i.e. r(u) = v*(-u).


PROOF. For e ~ 0, just apply the definitions:

-v(e) = f;(x, d) = inf [r(u)


uEdomr
+ eu] = -r*(-e).

For -e > 0, we have trivially (remember that f' (x, d) < +00)

r*(-e)~ lim [-eu-r(u)]=+oo-f'(x,d)=+oo=v(e). 0


U~+OO
2 The Approximate Directional Derivative 111

Observe that the case u = 1 is just Theorem 1.3.6; u = 0 gives


[f~(d) =] r(O) = sup f;(x, d) [= lime-++oo ff(x, d)],
e>O

and this confirms the relevance of the notation f60 for an asymptotic function.
Theorem 2.3.2 With the notation o/Proposition 2.3.1 and (2.2.4),

ov(e) = -Ue /orall e ~ O. (2.3.1)

Furthermore, e 1--+ ff(x, d) is strictly increasing on [0, e OO [ and is constantly equal


to f6o(d)/ore ~ eOO.

PROOF. By definition and using Lemma 2.3.1, -u E ov(e) if and only if

-ue = v(e) + v*(-u) = - f;(x, d) + r(u) <==> f;(x, d) = re(u) ,

which exactly means that u E Ue. The rest follows from the conclusions of §2.2. 0

Figure 2.3.1 illustrates the graph of e 1-+ f:


(x, d) in the example of Fig. 2.2.1. Since
the derivative of e 1-+ qe(t) is l/t, a formal application of Theorem VI.4.4.2 would give
directly (2.3.2); but the assumptions of that theorem are hardly satisfied - and certainly not
for e ~ eOO.

f~(x,d)

t,:.(d) =31------:::ao'I"'"-----
f'(x,d) =1

Fig. 2.3.1. The e-directional derivative as a function of e

Remark 2.3.3 Some useful formulae follow from (2.3.2): whenever Te -::/= 0, we have

1'/ - e
I
fT/(x,d) ~ fe(x,d)
I
+- - for all tETe,
t
1'/ - e
fe(x, d) + -.- - 0(1'/ - e) ifl'/ ~ e,
I I
fT/(x, d) =
te
d d 1'/ - e . 1'/
I
fT/(x, ) = Je(x, )
~I
+- -- 0(1'/ - e) If ~ e,
!e
where!e and te are respectively the smallest and largest elements of Te, and the remainder
terms 0(·) are nonnegative (of course,!e = te except possibly for countably many values of
e). We also have the integral representation (1.4.2.6)

f:(x, d) = !'(x, d) + foe Uada for all 0 ~ e < eOO . o


112 XI. Approximate Subdifferentials of Convex Functions

A natural question is now: what happens when e ,!.. o? We already know that
fi(x, d) ~ f'(x, d) but at what speed? A qualitative answer is as follows (see also
Example 2.1.3).

Proposition 2.3.4 Assuming -00 < f'(x, d), there holds

· fi(x,d)-f'(x,d 1 [0 ]
11m
e.J..o e
= ""l
t
E ,+00,

PROOF. We know from (2.2.5) and (2.3.2) that av(O) = -u


o =] - 00, -llt l ], so
0+ v (0) = -1 I t l and everything comes from the elementary results of §1.4.2. 0

For fixed x and d, use the notation of Remark 2.2.4 and look again at Fig. 2.1.2, with
the results of the present section in mind: it is important to meditate on the correspondence
between the horizontal set of stepsizes and the vertical set of f -decrements.
To any stepsize t ~ 0 and slope (s, d) ofa line supporting epiql at (t, qI(t», is associated
a value
8r,s := f(x) - 1(0) = f(x) - f(x + td) + t(s, d) ~ o. (2.3.2)
Likewise, to any decrement 8 ~ 0 from f (x) is associated a stepsize te ~ 0 via the slope
ff(x, d). This defines a pair of multi functions t t---+ -ar(llt) and 8 t---+ Te , inverse to each
other, and monotone in the sense that

for tETe and t' E Te" 8 > 8' ¢=> t > t' .

To go analytically from t to 8, one goes first to the somewhat abstract set of inverse stepsizes
u, from which e is obtained by the duality correspondence of Lemma 2.3.1. See also the lower
part of Fig. 2.2.1, for an instance of a mapping T (or rather its inverse).

To finish this section, we give an additional first-order development: there always


holds the estimate

f(x + td) = f(x) + tf'(x, d) + o(t) ,


and various ways exist to eliminate the remainder term o(t): one is (1.3.7); the mean-
value Theorem VI.2.3.3 is more classical; here is a third one.

Proposition 2.3.5 With the hypotheses and notations given so far, assume t l < +00
(i.e. f is not affine on the whole half-line x + JR.+d). For t > 0, there is a unique
e(t) E [0, e OO [ such that

f(x + td) = f(x) + tf:(t) (x, d). (2.3.3)

PROOF. We have to solve the equation in e: v(e) =


-q(t). In view of the second part
of Theorem 2.3.2, this equation has a unique solution provided that the right-hand
side -q(t) is in the interval
3 Calculus Rules on the Approximate Subdifferential 113

(v(a): 0:::;;a<8°O}=-[I'(x,d),I~(d)[.

As a result, the unique 8(t) of (2.3.3) is

0 for t E ]0, tl],


{
8(t) = (-V)-I(q(t» for t > tl. o

Figure 2.3.2 emphasizes the difference between this 8(t) and the 8t,s defined in
(2.3.2).

f(x+td)
--~------------~---------------
f(x)
gr <p
f(x)-£(t)

f(x)-Et,s -··" . ·. . . .~""::-::::.::::l slope q(t)


Fig.2.3.2. The numbers s(t) and Sl,s

3 Calculus Rules on the Approximate Subdifferential

In §VI.4, we developed a calculus to compute the subdifferential of a finite-valued


convex function I constructed from other functions Ii. Here, we extend the results
of §VI.4 to the case of ael, with 8 ~ 0 and I E ConvJRn . Some rudimentary such
calculus has already been given in Proposition 1.3 .1.
The reader may have observed that the approximate subdifferential is a global
concept (in contrast to the exact subdifferential, which is purely local): for 8 arbitrarily
large, aeI (x) depends on the behaviour of I arbitrarily far from x. It can therefore
be guessed that ae I has to depend on the global behaviour of the Ii'S, while aI (x)
a
depends exclusively on the Ii (x) 'so It turns out that the conjugacy operation gives a
convenient tool for taking this global behaviour into account. Indeed, knowing that
I results from some operation on the Ii'S, the characterization (1.2.1) in terms of
1* shows that the whole issue is to determine the effect of the conjugacy on this
operation: it is the set of calculus rules of §X.2 that is at stake.
Recall that all the functions involved are in Conv JR n .

3.1 Sum of Functions

From Theorem X.2.3.1, the conjugate of a sum of two functions II + h is the closure
of the infimal convolution It t g.
Expressing ae(iJ + h) (s) will require an expres-
114 XI. Approximate Subdifferentials of Convex Functions

sion for this infimal convolution, which in turn requires the following basic assump-
tion:
When s E dom(fl + 12)*'1 (fl + h)*(s) = !t(PI) + g(P2)
(3.1.1)
for some Pi satisfying PI + P2 = s .

This just expresses that the inf-convolution of !t and 12* is exact at s = PI + P2:
the couple (Ph h) actually minimizes the function (Ph P2) ~ !t(PI) + g(P2).
Furthermore, we know (Theorem X.2.3.2) that this property holds under various
conditions on II and 12; one of them is

ridom/l nridomh #- 0, (3.1.2)

which is slightly more stringent than the minimal assumption

dom II n dom 12 #- 0 [¢::::::> II + h ~ +00].


Theorem 3.1.1 For any 8 ~ 0 and x E dom(fl + h) = dom II n dom h. there
holds

8d/l + h)(x) :::> u {8eJI (x) + 8e2 h(x) : 8i ~ 0, 81 + 82 ~ 8} (3.1.3)

with equality under assumption (3.1.1),for example if (3.1.2) holds.

PROOF. Let 8 ~ 0 be given. For arbitrary nonnegative 810 82 with 81 + 82 ~ 8, Defini-


tion 1.1.1 clearly implies

8eJI (x) + 8e2 h(x) C 8e (fl + h)(x).


Conversely, take s E 8e (fl + h)(x), i.e.
(fl + 12)*(S) + (fl + h)(x) - (s, x) ~ 8 . (3.1.4)

This s is therefore in dom(fl + 12)* and we can apply (3.1.1): with the help of some
PI and P2, we write (3.1.4) as 81 + 82 ~ 8, where we have set

8i := N(Pi) + Ii (x) - (Pi, x) for i = 1,2.


We see that Pi E 8ej Ii (x) for i = I, 2, and the required converse inclusion is proved.
o
Naturally, if 1= LI=I Ii, the right-hand side in (3.1.3) becomes
U {LI=I 8e;!i(X) : 8i ~ 0, LI=I 8i ~ 8}. (3.1.5)

Since 8e l(x) increaseswith8, the constraints Li 8i ~ 8 can be replaced by Li 8i = 8


in (3.1.3) and (3.1.5). Also, when 8 = 0, there is only one possibility for 8i in (3.1.3):

Corollary 3.1.2 For x E dom II n dom h. there holds

with equality under, for example, assumption (3.1.2). o


3 Calculus Rules on the Approximate Subdifferential 115

To emphasize the need of an assumption such as (3.1.2), consider in ]R2 the


Euclidean balls C] and C2 , of radius 1 and centered respectively at (-1,0) and (1,0):
they meet at the unique point x = (0,0). Then take for Ii the indicator function of
Cj, so I is the indicator function I{oJ' We have al(O) = ]R2, while a/] (0) + ah(O) =
]R x {OJ.

Example 3.1.3 (£-Normal Sets to Closed Convex Polyhedra) Let C be a closed


convex polyhedron described by its supporting hyperplanes:

Hj-:={XE]Rn: (sj,x):::;rj} forj=l, ... ,m,


(3.1.6)
C:= nHj - = {x E]Rn : (Sj,x) :::;rj for j = 1, ... ,m}.
Its indicator function is clearly
m
Ie = L)H:-'
j=] J

Let us compute the £-subdifferential of this function, i.e. the £-normal set of Defini-
tion 1.1.3. The approximate normal set to Hj- has been given in Example 1.2.5, we
obtain for x E C

(3.1.7)

where we have set Cj(x) := rj - (Sj, x) ~ 0. In fact, Ie is a sum of polyhedral


functions, and Theorem X.2.3.2 - with its qualification assumption (X.2.3.Qj) - does
imply (3.1.1) in this case.
Still for x E C, set K(x) := maxj Cj(x). Clearly enough

K(x) co{s], ... , sm} C Ne,e(x)

(and the normal cone Nc(x) can even be added to the left-hand set). Likewise, set
k(x):= minj Cj(x). Ifk(x) > 0,

Ne,e(x) C kfu co{s], ... ,sm}· o


Example 3.1.4 (Approximate Minimality Conditions) Let be given a convex function f :
IRn -+ IR and a nonempty closed convex set C; assume that
Ie := inf {f(x) : x E C} (3.1.8)

is not -00. The e-minimizers of f on C are those x E C such that f(x) :::;; lc + e; clearly
enough, an e-minimizer is an x such that (remember Theorem 1.1.5)

(f + Ie)(x):::;; inf(f + Ie)


JRn
+e or equivalently 0 E os(f + Ie)(x).

Here domf = ]Rn: we conclude from Theorem 3.1.1 that an e-minimizer is an x such
that
o E oaf(x) + Ne,s-a(x) for some a E [0, e],
i.e. f has at x an a-subgradient whose opposite lies in the (e - a)-normal set to C. The
situation simplifies in some cases.
116 XI. Approximate Subdifferentials of Convex Functions

- Set 8 = 0 to obtain the standard minimality condition of (VII. 1. 1.3):


x solves (3.1.8) {=} -of (x) n Nc(x) =10.
- Another case is when C = (xo}+H is an affine manifold. Then Nc.p (x) = H.l forall,8 ~ 0
and oaf(x) increases with a: our 8-minimality condition becomes -oef(x) n H.l =10.
-Also, when f = (so,,) is linear, oaf(x) = of (x) = {so} for all a ~ 0, while Nc.p(x)
increases with,8; so our e-minimality condition is: -so E Nc,s(x), a triviality if we replace
Nc,s(x) by its definition (1.1.6).
- If C is a closed convex polyhedron as in (3.1.6), its e-nonnal set can be specified as in
(3.1. 7). Take for (3.1.8) a linear programming problem

min{(so,x): (sj,x):(rjforj=I •...• m}.

An 8-minimizer is then an x E C for which there exists JL = (JLl •...• JLm) E R m such that

2:J=1 JLjSj + So = 0, )
2:J=1 JLj'j ~ (so. x) :( e,
JLj ~ 0 for J = 1•...• m. o
We leave it as an exercise to redo the above examples when

C := {x E K : (Sj. x) = rj for j = 1•... , m}

is described in standard form (K being a closed convex polyhedral cone, say the nonnegative
orthant).

Remark 3.1.5 In the space IR n ) x IR n2 equipped with the scalar product of a product-
space, take a decomposable function:

Because ofthe calculus rule X.1.3.1(ix), the basic assumption (3.1.1) holds automat-
ically, so we always have

but beware that this set is not a product-set in IR n ) x IR n2 , except for e = O. 0

3.2 Pre-Composition with an Affine Mapping

Given g E ConvlRm and an affine mapping A : IRn -+ IRm (Ax = Aox + Yo E IRm
with Ao linear), take f := goA E ConvlRn. As in §3.1, we need an assumption,
which is in this context:

When S E dom(g 0 A)*, I (g 0 A)*(s) = g*(p) - (p, Yo)


(3.2.1)
for some p such that A6' p = S

«(', .) will denote indifferently the scalar product in IR n or IRm). As was the case with
(3.1.1), Theorem X.2.2.1 tells us that the above p actually minimizes the function
3 Calculus Rules on the Approximate Subdifferential 117

p ~ g* (p) - (p, Yo) on the affine manifold of equation At p = s. Furthermore we


know (Theorem X.2.2.3) that (3.2.1) holds under various conditions on f and Ao; one
of them is
(3.2.2)
Once again, note that removing the words "relative interior" from (3.2.2) amounts to
assuming goA 1= +00.

Theorem 3.2.1 Let g and A be defined as above. For all e ~ 0 and x such that
Ax E dom g, there holds

(3.2.3)

with equality under assumption (3.2.1),for example if (3.2.2) holds.

PROOF. Fix x such that Ax E domg and let p E aeg(Ax) c IRm :

g(z) ~ g(Ax) + (p, z - Ax) - e for all z E IRm .

Taking in particular z = Ay with y describing IRn :

g(Ay) ~ g(Ax) + (A~p, y - x) - e for all y E IRn ,

where we have used the property A(y - x) = Ao(Y - x). Thus we have proved that
Atp E ae(g 0 A)(x).
Conversely, lets E ae(g 0 A)(x), i.e.

(g 0 A)*(s) + (g 0 A)(x) - (s, x) :::;; e. (3.2.4)

Apply (3.2.1): with the help of some p such that Atp = s, (3.2.4) can be written
e ~ g*(p) - (p, Yo) + g(Ax) - (p, Aox) = g*(p) + g(Ax) - (p, Ax) .

This shows that P E aeg(Ax). Altogether, we have proved that our sis in Ataeg(Ax).
o

Naturally, only the linear part of A counts in the right-hand side of (3.2.3): the
translation is taken care ofby Proposition X.1.3. l(v).
As an illustration of this calculus rule, take Xo E domg, a direction d =f. 0, and
compute the approximate subdifferential of the function

IR 3 t ~ qJ(t) := g(xo + td) .

If Xo + IRd meets ri dom g, we can write


aeqJ(t) = (aeg(xo + td), d) for all t E domqJ.
118 XI. Approximate Subdifferentials of Convex Functions

3.3 Image and Marginal Functions

We recall that, for g ECony lRm and A linear from lRm to lRn , the image of g under
A is the function Ag defined by
lRn 3 x H- (Ag)(x) := inf {g(y) : Ay = x}. (3.3.1)
Once again, we need an assumption for characterizing the 8-subdifferentia1 of Ag,
namely that the infimum in (3.3.1) is attained "at finite distance". A sufficient assump-
tion for this is
ImA* nridomg* =1= 0, (3.3.2)
which implies at the same time that Ag E Cony lRn (see Theorem X.2.2.3). As already
seen for condition (X.2.2.Q.iii), this assumption is implied by
g~(d) > 0 for all nonzero dE Ker A.
Theorem 3.3.1 Let 8 ~ 0 and x E dom Ag = A (dom g). Suppose that there is some
y E lR m with Ay = x and g(y) = Ag(x);for example assume (3.3.2). Then
8£(Ag)(x) = {s E lRn : A*s E 8£g(y)}. (3.3.3)
PROOF. To say that s E 8£(Ag)(x) is to say that

(Ag)*(s) + g(y) - (s, Ay) ~ 8,

where we have made use of the existence and properties of y. Then apply Theo-
rem X.2.1.1: (Ag)* = g* 0 A*, so
S E 8£(Ag)(x) {::=} g*(A*s) + g(y) - (A*s, y) ~ 8. 0

This result can of course be compared to Theorem VI.4.5.1: the hypotheses are
just the same - except for the extended-valuedness possibility. Thus, we see that the
inverse image under A * of 8£g(yx) does not depend on the particular Yx optimal in
(3.3.1).
We know that a particular case is the marginal function:
lRn 3 x H- I(x) := inf {g(x, z) : Z E lR P }, (3.3.4)
where g E Conv(lRn x lR P ). Indeed, I is the image of g under the projection mapping
from lRn+ P to lRn defined by A (x, z) = x. The above result can be particularized to
this case:
Corollary 3.3.2 With g E Conv(lRn x lR P ), let g* be associated with a scalar product
preserving the structure oflRm = lRn x lRP as a product space, namely:
(3.3.5)
and consider the marginal function I of (3.3.4). Let 8 ~ 0, x E dom I; suppose that
there is some Z E lR P such that g(x, z) = I(x); Z exists for example when
3so E lRn such that (so, 0) E ridomg* . (3.3.6)
Then
8ef(x) = {s E lRn : (s,O) E 8£g(x, z)}. (3.3.7)
3 Calculus Rules on the Approximate Subdifferential 119

PROOF. Set A : (x, z) t-+ x. With the scalar product (3.3.5), A* : IR n ---+ IRn x IRP
is defined by A*s = =
(s, 0), ImA* IR n x {OJ, so (3.3.6) and (3.3.7) are just (3.3.2)
and (3.3.3) respectively. 0

3.4 A Study of the Infimal Convolution

As another example of image functions, consider the infimal convolution:

where I, and !2 are both in Conv IRn. With m = 2n, and 1R2n being equipped with
the Euclidean structure of a product-space, this is indeed an image function:

(s" S2), (y" Y2)hn := (s" y,) + (S2' Y2) , }


g(y" Y2) := I, (y,) + !2(Y2); A(y" Y2) := Yt + Y2 , (3.4.1)
g*(s" S2) = !t*(s,) + g(S2); A*s = (s, s).

Theorem 3.4.1 Let 8 ~ 0 and x E dom(f, ~ h) = dom I, + dom fz. Suppose that
there are y, and Y2 such that the inf-convolution is exact at x = y, + Y2; this is the
case for example when
ridomlt n ridomg =f. 13. (3.4.2)
Then

ae(f, ~ !2)(x) = u {ae.f, (Yt) n ae2 !2(Y2) : 8i ~ 0, 8, + 82 ~ 8} . (3.4.3)

PROOF. We apply Theorem 3.3.1 to g and A of (3.4.1). First of all,

dom g* = dom It x dom 12*

so (Proposition III.2.UI)

ridomg* = ri dom It x ri dom 12*


and (3.3.2) means:

3s E IRn such that (s, s) E ri dom It x ri dom g


which is nothing other than (3.4.2). Now, with (y" Y2) as stated and s E IR n , set

8j := /;*(s) + !;(Yi) - (s, Yi) ~ 0 for i = 1,2, (3.4.4)

so that s E aeJi (Yi) for i = 1,2. Particularizing (3.3.3) to our present situation:

which, in view of the definitions (3.4.1), means


120 XI. Approximate Subdifferentials of Convex Functions

With (3.4.4), this is just (3.4.3). o

As with Theorem 3.1.1, nothing is changed if we impose 81 + 82 = 8 in (3.4.3).


The particular value 8 = 0 in the above result brings

aUI t h)(x) = all (YI) n ah(Y2)


(3.4.5)
when the inf-convolution is exact at x = YI + Y2 .
Some more information, not directly contained in (3.4.3), can also be derived: a sort
of converse result ensuring that the inf-convolution is exact.

Proposition 3.4.2 With I := II t h. consider Yi E dom Ii for i = 1, 2 and


x := YI + Y2 (E dom f). Then

If all (YI) n a12 (Y2) =f:. 0, the inf-convolution is exact at x = YI + Y2 and equality
holds in (3.4.6).

It(s) + II (YI) + g(s) + h(Y2) = (s, YI + Y2) = (s, x). (3.4.7)

Then, using It + g = (II t 12)* (Corollary X.2.l.3) and the definition of an


inf-convolution,
UI t h)*(s) + UI t h)(x) :( (s, x) ;

in view of the Fenchel inequality (X. 1. l.3), this is actually an equality, i.e. SEaI (x).
Now use this last equality as a value for (s, x) in (3.4.7) to obtain

i.e. the inf-convolution is exactatx = YI + Y2; equality in (3.4.6) follows from (3.4.5).
o

In summary, fix x E dom(fl t h) and denote by H C Rn x Rn the hyperplane ofequation


YI + Y2 =x. When (YI, Y2) describes H, the set D(YI, Y2):= a/I(Y]) nah(Y2) assumes at
most two values: the empty set and a(fl t h)(x). Actually, there are two possibilities:
- either D(YI, Y2) = 0 for all (YI, Y2) E H;
- or D(YI. Y2) =I 0 for some (YI, Y2) E H. This implies a(fl t h)(x) =I 0 and then, we have
for all (YI. Y2) E H:

D(YI. Y2) =I 0 {:=} D(YI. Y2) = a(fl t h)(x) {:=}


{:=} the inf-convolution is exact at x = YI + Y2 .
3 Calculus Rules on the Approximate Subdifferential 121

Remark 3.4.3 Beware that D may be empty on the whole of H while aUI i!i h)(x) ¥= 0.
Take for example
II (y) = exp Y and 12 = exp( -Y);
then II i!i 12 == 0, hence aUI i!i h) == {O}. Yet D(YI, Y2) = 0 for all (YI, Y2) E H: the
inf-convolution is nowhere exact.
Note also that (3.4.5) may express the equality between empty sets: take for example
II = 1[0, +oo[ and
R ~() {-,JYi. ifY2;?:0,
3 Y2 1-+ J2 Y2 = +00 otherwise.

It is easy to see that

UI i!i h)(x) = inf {-,JYi. : 0 ~ Y2 ~ x} = -.Ji' for all x;?: O.

This inf-convolution is exact in particular at 0 = 0 + 0, yet aUI i!i 12)(0) = 0. 0

Example 3.4.4 (Moreau-Yosida Regularizations) For e > 0 and I E Conv IRn, let
I(c) := f t (1/2 ell . 11 2), i.e.
f(c)(X) = min {f(y) + !ellx - yll2 : y E IRn};

this f(c) is called the Moreau-Yosida regularization of f. Denote by Xc the unique


minimal y above, characterized by

o E of (xc) + e(xc - x).

Using the approximate subdifferential of 1/211 . 112 (see Example 1.2.2 if necessary),
Theorem 3.4.1 gives

Oef(c)(x)= U [oe-af(xc)nB(e(X-xc),J2ea)].
o~a~e

It is interesting to note that, when e = 0, this formula reduces to


Vf(c)(x) = e(x -xc) [E of (x c)]. (3.4.8)

Thus f(c) is a differentiable convex function. It can even be said that V f(c) is Lips-
chitzian with constant e on IRn. To see this, recall that the conjugate

f(~) = f* + ~ II . 112
is strongly convex with modulus 1/e; then apply Theorem X.4.2.1.
In the particular case where e = 1 and f is the indicator function of a closed
convexsetC, f(c) = 1/2d~ (the squared distance to C) and Xc = Pc (x) (the projection
of x onto C). Using the notation of (1.1.6):

oeGd~)(x)= U [Nc,e-a(Pc(X»nB(x- p c(x),.J2a)]. 0


o~a~e
122 XI. Approximate Subdifferentia1s of Convex Functions

Another regularization is ftc] := f t cll . II, i.e.


f[c](x) := inf {f(y) + cllx - yll : y E IR n }, (3.4.9)
which will be of importance in §4.1.
Proposition 3.4.5 For f E ConvlRn , define ftc] as above; assume
c >.~:= inf {lIsll : s E 8f(x), x E dom8f}
and consider
cc[n := {x E IRn : f[c](x) = f(x)} ,
the coincidence set of f and ftc]. Then
(i) ftc] is convex, finite-valued, and Lipschitzian with Lipschitz constant c on IRn;
(ii) the coincidence set is nonempty and characterized by
Cc[n = {x E IRn : 8f(x) n B(O, c) =f. 0} ;
(iii) for all x E Cc[n, there holds
8ef[c](x) = 8ef(x) n B(O, c) .
PROOF. [(i)] There exist by assumption Xo and So E 8f(xo) with IIsoll ::;; c, hence
f ~ f(xo) + (so,, - xo) and cll·1I ~ (so,,);
f and cll . II are minorized by a common affine function: from Proposition IV.2.3.2,
ftc] E Conv IRn . Furthermore, ftc] is finite everywhere by construction.
Now take x and x' in IRn. For any TJ > 0, there is Y'1 such that
f(y'1) + cllx - Y'111 ::;; f[c](x) + TJ
and, by definition of .f[C]'
f[c] (x') ::;; f(y'1) + cllx' - Y'111 ::;; f(y'1) + cllx' - xII + cllx - Y'111
::;; f[c](x) + cllx' - xII + TJ·
This finishes the proof of (i), since TJ was arbitrary.
[(ii) and (iii)] To say that x E Cc[n is to say that the inf-convolution of f and cll·1I is
exactatx = x+O. From Theorem 3.4.2, thismeansthat8f(x)nB(0, c) isnonempty:
indeed
8(cll . 11)(0) = cB(O, 1) = B(O, c) = 8e (cll . 11)(0) .
The last equality comes from the sublinearity of c 11·11 (cf. Example 1.2.5) and implies
with the aid of Theorem 3.4.1:
8e .f[c](x) = U{8a f(x) n B(O, c) : 0::;; a::;; e} = 8ef(x) n B(O, c). 0

Just as in Example 3.4.4, we can particularize ftc] to the case where f is the
indicator function of a closed convex set C: we get ftc] = cdc and therefore
8e (cdc)(x) = NC,e(x) n B(O, c) for all x E C.
Thus, when e increases, the set K' in Fig. V.2.3.1 increases but stays confined within
B(O, c).
3 Calculus Rules on the Approximate Subdifferential 123

Remark 3.4.6 For x fj. Cc[!], there is some Ye =f:. x yielding the infimum in (3.4.9).
At such a Ye, the Euclidean norm is differentiable, so we obtain that f[e] is differentiable
as well and
x - Ye
V f[e] (x) = c Ilx _ Yell E 8f(Ye) .

Compare this with Example 3.4.4: here Ye need not be unique, but the direction Ye - x
depends only on x. 0

3.5 Maximum of Functions

In view of Theorem X.2.4.4, computing the approximate subdifferential of a supre-


mum involves the closed convex hull of an infimum. Here, we limit ourselves to the
case offmitely many functions, say f := maxj=l, ... ,p f), a situation complicated
enough. Thus, to compute f*(s), we have to solve
p
i!lf. Lajfj*(sj) ,
a},s} j=l
p
L ajsj =s , (3.5.1)
j=l
p
aj ~ 0, L aj =1 [i.e. a E Llp] .
j=l

An important issue is whether this minimization problem has a solution, and whether
the infimal value is a closed function of s. Here again, the situation is not simple when
the functions are extended-valued.

Theorem 3.5.1 Let fl' ... , fp be a finite number of convex functions from lRn to lR
and let f := maxj f); set m := min{p, n + I}. Then, S E 8ef(x) ifand only if there
exist m vectors Sj E dom fj*, convex multipliers aj, and nonnegative I>j such that

Sj E 8sj / aj f)(x) for all j such that aj > 0, }


S = L,jajsj, (3.5.2)
L,j[l>j - ajf)(x)] + f(x) ~ 1>.

PROOF. Theorem X.2.4.7 tells us that

m
f*(s) = Lajft(sj) ,
j=l

where (aj, Sj) E lR+ x dom ft form a solution of(3.5.1)(afterpossible renumbering


of the j's). By the characterization (1.2.1), the property S E 8ef(x) is therefore
equivalent to the existence of (aj, Sj) E lR+ x dom fj* - more precisely a solution of
(3.5.1) - such that
124 Xl. Approximate Subdifferentials of Convex Functions

m m
"Lajfj*(Sj) + f(x) ~ S + "Laj(sj,x) ,
j=1 j=1
which we write
m m
"Laj[fj*(Sj) + Jj(x) - (Sj' x)] ~ S + "LajJj(x) - f(x). (3.5.3)
j=1 j=1
Thus, if S E ad(x), i.e. if(3.5.3) holds, we can set

Sj := aj[f/(sj) + fj(x) - (Sj, x)]


so that Sj E aSj/ajJj(x) ifaj > 0: (Sj, aj, Sj) are exhibited for (3.5.2). Conversely,
if(3.5.2) holds, multiply by aj each of the inequalities
*
fj (Sj) + Jj(x) - Sj
(Sj' x) ~ - ;
aj
add to the list thus obtained those inequations having aj = 0 (they hold trivially!);
then sum up to obtain (3.5.3). D

It is important to realize what (3.5.2) means. Its third relation (which, incidentally, could
be replaced by an equality) can be written
m
~)ej + ajej) ~ e, (3.5.4)
j=1
where, for each j, the number
ej := f(x) - fj(x)
is nonnegative and measures how close ij comes to the maximal f at x. Using elementary
calculus rules for approximate subdifferentials, a set-formulation of(3.5.2) is

ad(x) = U {l:J=1 a£/ajij)(x) : a E Llm, l:1=I(ej +ajej) ~ e} (3.5.5)


(remember that the e-subdifferential of the zero function is identically {O}). Another obser-
vation is as follows: those elements (aj, Sj) with aj = 0 do not matter and can be dropped
fromthecombinationmakingups in (3.5.2). Then set 1)j := ej/aj to realize that S E ad(x)
if and only ifthere are positive aj summing up to 1, 1)j ~ 0 and Sj E a'ljf(x) such that

Laj(1)j+ej)~e and Lajsj=s. (3.5.6)


j j
The above formulae are rather complicated, even more so than (3.1.5) corresponding to
a sum - which was itself not simple. In Theorem 3.5.1, denote by
J£(x) := {j : ij(x) ~ f(x) - e}
the set of e-active indices; then we have the inclusion
UjEJ,(X)aij(x) c ad(x) ,
which is already simpler. Unfortunately, equality need not hold: with n = 1, take fl (~) = ~,
h (~) = -~. At ~ = 1, the left-hand side is {I} for all e E [0, 2[, while the right-hand side
has been computed in §l.l: ad(l) = [1- e, 1].
In some cases, the exact formulae simplify:
3 Calculus Rules on the Approximate Subdifferential 125

Example 3.5.2 Consider the function f+ := max{O, f}, where f : lRn --+ lR is convex. We
get
as(r)(x) = u {a8(af)(X) : 0::;; a::;; 1, 8 + rex) - af(x) ::;; 8} .
Setting 8(a) := 8 - rex) + af(x), this can be written as

with the convention ard(x) = 0 if1} < o. o

Another important simplification is obtained when each fJ is affine:

Example 3.5.3 (piecewise Affine Functions) Take

f(x):=max{(sj,x}+bj: j=I, ... ,p}. (3.5.7)

Each 8j /arsubdifferential in (3.5.2) is constantly {Sj}: the 81'S play no role and can be
eliminated from (3.5.4), which can be used as a mere definition

o [::;; L]=I 8j] ::;; 8 - L)=I ajej .


Thus, the approximate subdifferential ofthe function (3.5.7) is the compact convex polyhedron

ad(x) = {L)=I ajsj : a E..1 p , L)=I ajej ::;; 8} .


In this formulation, ej = f(x) - (Sj' x) - bj. The role of ej appears more explicitly if
the origin of the graph-space is carried over to (x, f(x» (remember Example VI.3.4): f of
(3.5.7) can be alternatively defined by

lRn 3 Y 1-+ fey) = f(x) + max {-ej + (Sj, y -x) : j = 1, ... , pl·
The constant term f(x) is of little importance, as far as subdifferentials are concerned.
Neglecting it, ej thus appears as the value at x (the point where asf is computed) ofthe j'll
affine function making up f.
Geometrically, as f (x) dilates when 8 increases, and describes a sort of spider web with
af(x) as "kernel". When 8 reaches the value maxj ej' ad(x) stops at CO{SI, ... , Sp}. 0

Finally, if 8 = 0, (3.5.4) can be satisfied only by 8j = 0 and ajej = 0 for all j. We thus
recover the important Corollary VI.4.3.2. Comparing it with (3.5.6), we see that, for larger ej
or 1}j, the 1}rsubgradient Sj is "more remote from af(x)", and as a result, its weightaj must
be smaller.

3.6 Post-Composition with an Increasing Convex Function

Theorem 3.6.1 Let f : ~n --+ ~ be convex, g E Cony ~ be increasing, and assume


the qualification condition f (~n) n int dom g =1= 0. Then,for all x such that f (x) E
domg,
s E as(g 0 f)(x) {:::::::}
381, 82 ~ 0 and a ~ 0 such that }
(3.6.1)
81 + 82 = 8, S EaSl (af)(x), a E as2 g(f(x» .
126 XI. Approximate SubdiiIerentials of Convex Functions

PROOF. [=*] Let a ~ 0 be a minimum of the function 1/Is in Theorem X.2.5.1; there
are two cases:
(a) a = O. Because domf = lRn , this implies s = 0 and (g 0 f)*(0) = g*(O). The
characterization of s = 0 E oe (g 0 f) (x) is

g*(O) + g(f(x» ~ 8, i.e. 0 E 0eg(f(x» .

Thus, (3.6.1) holds with 82 = 8,81 = 0 - and a = 0 (note: o(Of) == {OJ because f
is finite everywhere).
(b) a > O. Then (g 0 f)*(s) = aj*(s/a) + g*(a) and the characterization of
s E oe(g °f)(x) is

af*(s/a) + g*(a) + g(f(x» - (s, x) ~ 8,

i.e.
(af)*(s) + af(x) - af(x) + g*(a) + g(f(x» - (s, x) ~ 8.

Split the above left-hand side into

(af)*(s) + af(x) - (s, x) =: 81 ~ 0,


g*(a) + g(f(x» - af(x) =: 82 ~ 0,

which can be enunciated as: s E oel (af)(x), a E oe2g(f(x», 81 + 82 ~ 8.


Because, once again, 7/-subdifferentials increase with 7/, (3.6.1) is established.
[<=] (3.6.1) means:
a[f(y) - f(x)] ~ (s, y - x) - 81 for all y E lRn ,
g(r) ~ g(f(x» + a[r - f(x)] - 82 for all r E lR,

hence, with r = f(y):

g(f(y» ~ g(f(x» + (s, y - x) - 8 for all y E lRn . o


As an application, consider the set

C := {x E lRn : c(x) ~ O} with c: lRn ~ lR convex. (3.6.2)

Writing Ie as the compositionI]_oo,o]OC, the 8-normal setNe,e(x) of Definition 1.1.3


can be characterized in terms of approximate subdifferentials of c at x. Theorem 3.6.1
is valid under our qualification condition c(lRn)n ]-00, O[ oF 0, in which we recognize
Slater's assumption. Furthermore, the approximate subdifferential ofI]-oo,o] = g has
been computed in Example 1.2.5:

Then we obtain: s E Ne,e(x) if and only if there are nonnegative a, 8t. 82, with
81 + 82 = 8, such that
s E oel (ac)(x) and ac(x) + 82 ~ O. (3.6.3)
4 The Approximate Subdifferential as a Multifunction 127

Corollary 3.6.2 Let C be described by (3.6.2) and assume that c(xo) < 0 for some
Xo. Then
NC,e(x) = U {a8 (a c) (x) : a ~ 0, 8 ~ 0, 8 - ac(x) ~ 6} . (3.6.4)
Inparticular, ifx e bdC, i.e. ifc(x) = 0,
NC,e(x) = U{ae(ac)(x) : a ~ O}. (3.6.5)
PROOF. Eliminate 62 and let 8 := 6) in (3.6.3) to obtain the set-formulation (3.6.4).
Then remember (VI.l.3.6) and use again the monotonicity of the multifunction 8 ~
~~~W. 0

Naturally, (3.6.5) with 6 = 0 reduces to Theorem VI.1.3.5.


All the necessary material to reproduce Chap. VII is now at hand. For example, Corol-
lary 3.6.2 and various calculus rules from the present Section 3 allow the derivation ofneces-
sary and sufficient conditions for approximate minimality in a constrained convex minimiza-
tion problem. With the help of Example 3.5.2, the theory of exact penalty can be reproduced,
etc. It is interesting to note that the various (Q)-type assumptions, giving the calculus rules in
§X.2.2 and §X.2.3, are intimately related to the constraint qualification conditions of §Vll.2.
Let us conclude with a general remark: in §III.5.3, we have alluded to some calculus rules
for normal and tangent cones. They can be developed in a rigorous manner, by an application
of the present calculus to functions of the form /j = lej' All these exercises are left to the
reader.

4 The Approximate Subdifferential as a Multifunction

4.1 Continuity Properties of the Approximate SubditTerential

We will see in this section that the multifunction ae f is much more regular when 6 > 0
than the exact subdifferential, studied in §VI.6.2. We start with two useful properties,
stating that the approximate subdifferential (e, x) ~ aef(x) has a closed graph,
and is locally bounded on the interior of dom f; see §A.5 for the terminology.
Proposition 4.1.1 Let {(ek> Xk, Sk)} be a sequence converging to (e, x, s), with sk e
aekf(Xk) for all k. Then s e aef(x).
PROOF. By definition,
f(y) ~ f(Xk) + (Sk> y - Xk) - 6k for all y e ]Rn .
Pass to the limit on k and use the lower semi-continuity of f. o

Proposition 4.1.2 Assume intdom f '" 0; let 8 > 0 and L be such that f is Lip-
schitzian with constant L on some ball B(x, 8), where x e intdomf. Then, for all
8' < 8,
e
IIslI ~ L + 8 _ 8' (4.1.1)

whenever s e aef(y), with y e B(x, 8').


128 XI. Approximate Subdifferentials of Convex Functions

PROOF. We know (Theorem N.3.1.2) that f is locally Lipschitzian on intdomf, so


we can take x, 8, L as stated. To prove (4.1.1), take 8', y ands as stated, assume s =f:. 0
and set z := y + (8 - 8')s/lIsll in the definition

f(z) ~ f(y) + (s, z - y) - e.

Observe that z and yare in B(x, 8) and conclude

L(8 - 8') ~ (8 - 8')lIsll- e. o


a
As a result, the multifunction e f is outer semi-continuous, just as is the exact
subdifferential. But for e > 0, it also enjoys fairly strong continuity properties. We
first assume that f is finite everywhere, and then we consider the general case via the
regularized version f ~ ell . II of f. In the result below, recall from Theorem V.3.3.8
that the Hausdorff distance between two (nonempty) compact convex sets A and B
has the expression

Theorem 4.1.3 Let f: Rn -+ R be a convex LipschitzianJunction on Rn. Then there


exists K > 0 such that,for all x, x' in R n and e, e' positive:

'} (lix
LlH(aef(x), ae' f(x'» ~ --:-----{K
mm e,e
- x'il + Ie - e'I). (4.1.2)

PROOF. With d of norm 1, use (2.1.1): for any TJ > 0, there is tl1 > 0 such that

(4.1.3)

where we have used the notation qe(x, t) for the approximate difference quotient. By
assumption, we can let 8 -+ +00 in Proposition 4.1.2: there is a global Lipschitz
constant, say L, such that f;(x, d) ~ L. From

we therefore obtain
1 2L + TJ
-~--
tl1 e
Then we can write, using (2.1.1) and (4.1.3) again:

f;,(x', d) - f;(x, d) - TJ ~ qe'(x', t l1 ) - qe(x, tl1 ) =


f(x' + tl1 d) - f(x + t l1 d) + f(x) - f(x') + e' - e
= tl1
~

~
2Lllx'
-
xii + Ie' - el ~2:7](2Lllx'-xll+le'-el).
L
tl1

Remembering that TJ > 0 is arbitrary and inverting (x, e) with (x', e'), we do
obtain
4 The Approximate Subdifferential as a Multifunction 129

If.e', (x', d) - f;(x, d)1 ~ mm


.2(L
8,8
') (2Lllx' - xII + Ie' - el)
so the theorem is proved, for example with K = max{2L, 4L2}. o
This result definitely implies the inner semi-continuity of (x, e) 1---+ aef(x) for
a Lipschitz-continuous f. In particular, for fixed e > 0,

aef(y) c ae!(x) + lIy - xIlB(O, f) for all x and y,

a property already illustrated by Fig. 1.1.2. Remember from §VI.6.2 that the multi-
function af need not be inner semi-continuous; when e = 0, no inclusion resembling
the above can hold, unless x isjixed.
A local version of Theorem 4.1.3 can similarly be proved: (4.1.2) holds on the
compact sets included in int dom f. Here, we consider an extension of the result to
unbounded subdifferentials. Recall that the Hausdorff distance is not convenient for
unbounded sets. A better distance is obtained by comparing the bounded parts of
closed convex sets: for c ~ 0, we take

Corollary 4.1.4 Let f E Conv IRn. Suppose that S C dom f and f. > are such
that af(x) n B(O, 0 #- 0forall XES. Then,forall c > ~ there exists Kc such that,
°
for all x, x' in Sand e, e' positive,

mm 8,E,}(lIx' -
LlH , c(ae!(x), ae,f(x'» ~ ~{ xII + Ie' - el).

PROOF. Consider fic] := f ~ cll . II and observe that we are in the conditions of
Proposition 3.4.5: f[c] is Lipschitzian on IRn and Theorem 4.1.3 applies. The rest
follows because the coincidence set of f[c] and f contains S. 0

Applying this result to an f finite everywhere, we obtain for example the following
local Lipschitz continuity:

Corollary 4.1.5 Let f : IRn ~ IR be convex. For any 8 ~ 0, there is K8 > Osuch
that
K8
LlH(aef(x), ae!(x'» ~ -lix - x' II for all x and x' in B(O, 8).
e
a
PROOF. We know from Proposition 4.1.2 that e f is bounded on B(O, 8), so the result
is a straightforward application of Corollary 4.1.4. 0

4.2 Transportation of Approximate Subgradients

In §VI.6.3, we have seen that af(x) can be constructed by piecing together limits of
subgradients along directional sequences. From a practical point of view, this is an
130 XI. Approximate Subdifferentials of Convex Functions

important property: for example, it is the basis for descent schemes in nonsmooth opti-
mization, see Chap. IX. This kind of property is even more important for approximate
subdifferentials: remember from §II.l.2 that the only information obtainable from f
is a black box (Ul), which computes an exact subgradient at designated points. The
concept of approximate subgradient is therefore of no use, as long as there is no "black
box" to compute one. Starting from this observation, we study here the problem of
constructing aef(x) with the sole help of the same (Ul).

Theorem 4.2.1 (A. Brndsted and R.T. Rockafellar) Let be given f E ConvJR.n ,
o.
x E dom f and 8 ~ For any 11 > 0 and s E aef(x), there exist xl'/ E B(X,l1) and
sl'/ E af(xl'/) such that IIsl'/ - sil ~ 8/11.

PROOF. The data are x E dom f, 8 > 0 (if 8 = 0, just take xl'/ = x and sl'/ = s!),
11 > 0 and s E aef(x). Consider the closed convex function

JR.n 3 Y 1--+ fP(y) := f(y) + /*(s) - (s, y) .

It is nonnegative (Fenchel's inequality), satisfies fP(x) ~ 8 (cf. (1.2.1», and its subdif-
ferential is
afP(Y) = af(y) - Is} for all y E domfP = domf.
Perturb fP to the closed convex function
8
JR.n 3 y 1--+ 1/f(y) := fP(y) + -lly - x II ,
11
whose subdifferential at y E dom f is (apply Corollary 3.1.2: (3.1.2) obviously holds)

a1/f(y) = afP(Y) + ~a(ll . -xll)(y) c afP(Y) + B(O,~).


Because fP is bounded from below, the O-coercivity of the norm implies the 0-
coercivity of 1/f; there exists a point, say xl'/' minimizing 1/f on JR.n; then 0 E a1/f(xl'/)

*).
is written
o E af(xl'/) - Is} + B(O,
It remains to prove that Xl'/ E B(x, 11). Using the nonnegativity of fP and optimality of

This result can be written in a set formulation:

aef(x) c n u
1'/>0 lIy-xll ~ 1'/
{af(Y) + B(O, *)}. (4.2.1)

It says that any 8-subgradient at x can be approximated by some exact subgradient,


computed at some y, possibly different from x. For Y close to x (11 small) the approx-
imation may be coarse; an accurate approximation (11 large), may require seeking y
far from x. The value 11 = ..;e
is a compromise, which equates the deviation from x
and the degree of approximation: (4.2.1) implies
4 The Approximate Subdifferential as a Multifunction 131

aef(x) c U {af(y) + B(O,../8) : y E B(O,../8)} .

An illustrative example is the one-dimensional function f: x t--+- x + 1/x, with


x = =
1 and e 1 (which is the value eoo of §2.2). Then 1 E aef (x) but 1 is nowhere
the derivative of f: in the above proof, xfj is unbounded for 11 ~ +00.
We also see from (4.2.1) that aef(x) is contained in the closure of af(Rn) - and
the example above shows that the closure operation is necessary. Our next question
is then: given an exact subgradient s computed somewhere, how can we recognize
whether this s is in the set as f (x) that we are interested in? The answer turns out to
be very simple:

Proposition 4.2.2 (Transportation Formula) With x and x' in dom f, let s' E
af (x'). Then s' E aef (x) if and only if
f(x') ~ f(x) + (s', x' - x) - e. (4.2.2)

PROOF. The condition is obviously necessary, since the relation of definition (1.1.1)
must in particular hold at y = x'. Conversely, for s' E af (x'), we have for all y
f(y) ~ f(x') + (s', y - x') =
= f(x) + (s', y - x) + [f(x') - f(x) + (s', x - x')].

If (4.2.2) holds, s' E aef(x). o


The next chapters, dealing with numerical algorithms, will use (4.2.2) intensively.
We call it the transportationformula, since it "transports" at x a given subgradient at
x'. Geometrically, Fig.4.2.1 shows that it is very natural, and it reveals an important
concept:

x ~
Fig. 4.2.1. Linearization errors and the transportation formula

Definition 4.2.3 (Linearization Error) For (x, x', s') E dom f x dom f x IRn, the
linearization error made at x, when f is linearized at x' with slope s', is the number

e(x, x', s') := f(x) - f(x') - (s', x - x').

This linearization error is of particular interest when s' E af(x'); then, calling

ix',s'(Y) := f(x') + (s', y - x')


132 XI. Approximate Subdifferentials of Convex Functions

the corresponding affine approximation of I, there holds the relations

lx',s' :s:;; I, lx',s'(x') = I(x'),


-
Ix',s'(x) = I(x) -
, ,
e(x, x ,s ).

The definition of s' E 01 (x') via the conjugate function can also be used:

e(x, x', s') = I(x) + I*(s') - (s', x) ~ 0 if s' E ol(x'). 0

The content of the transportation formula (4.2.2), illustrated by Fig. 4.2.1, is that
any s' E ol(x') is an e(x, x', s')-subgradient of 1 at x; and also that it is not in a
tighter approximate subdifferential, i.e.

e < e(x, x', s') ==} s' l/. oel(x) .

This latter property relies on the contact (at x') between gr lx',s' and gr I. In fact, a
result slightly different from Proposition 4.2.2 is:

Proposition 4.2.4 Let s' E 0T//(x'). Then s' E oe!(x) if

I(x') ~ I(x) + (s', x' - x) - e +."

or equivalently
e(x, x', s') + ." :s:;; e .

PROOF. Proceed as for the "if" -part of Proposition 4.2.2. o


The transportation formula gives an easy answer to the pragmatic question: "Let
a subgradient be computed at some point by the black box (UI), is it an approximate
subgradient at some other point?" Answer: just compare I-and l-values. Returning
to a more theoretical framework, we now ask: given x and e, what are those x' such
that (4.2.2) holds? This question is ambiguous when ol(x') is not a singleton, so we
define two sets:
Vdx) := {x' E domol : ol(x') c oe!(x)}
Vs(x) := {x' E]Rn : ol(x') n oe!(x) =f:. 0} .

Equivalent definitions are

Ve(x):= {x' E domol : e(x,x',s'):S:;; e for all s' E ol(x')}


(4.2.3)
Vs(x) := {x' E]Rn : e(x, x', s') :s:;; e for some s' E 01 (x') } ,
and it is clear that Ve (x) C Vs (x). If 1 is differentiable on ]Rn, the two sets obviously
coincide:
Ve(x) = Vs(x) = {x' E]Rn : e(x, x', V/(x'»:s:;; e}.
The next result motivates our notation.

Proposition 4.2.5 Suppose that x E intdom/. Then


(i) Ve(x) is a neighborhood o/x ife > O.
(ii) Vs(x) is the closure o/Ve(x).
4 The Approximate Subdifferential as a Multifunction 133

PROOF. [(i)] Apply Proposition 4.1.2: there exist ~ > 0 and a constant K such that,
for all x' E B(x,~) ands' E af(x'),

e(x, x', s') ~ If(x') - f(x)1 + 1Is'1i IIx' - xII ~ 2Kllx' - xII;

so Ve(x) contains the ball of center x and radius min{~, e/(2K)}.


[(ii)] Take a point x' E Ys(x). We claim that

af(x + t(x' - x» c ae!(x) for all t E ]0, 1[; (4.2.4)

see Fig.4.2.1: we insert a point between x and x'. A consequence of (4.2.4) will be
Ys(x) c cl Ve(x) (let ttl).
To prove our claim, set d := x' - x and use the function r = ro of (2.1.3) to
realize with Remark 2.2.4 that

{e(x, x + td, s) : s E af(x + td)} = ar(l/t) for all t > o. (4.2.5)

By definition of Ys(x), we can pick s' E af(x') such that

-ar(l) 3 e(x, x', s') ~ e.

Then we take arbitrary l/t = u > 1 ands" E af(x +td), so that -e(x, x +td, s") E
ar (u). The monotonicity property (V1.6.1.1) or (1.4.2.1) of the subdifferential ar gives

[-e(x, x + td, s") + e(x, x', s')](u - 1) ~ 0

hence e(x, x + td, s") ~ e(x, x', s') ~ e, which proves (4.2.4).
There remains to prove that Ys(x) is closed, so let {xk} be a sequence of Ys(x)
converging to some x'. To each xk' we associate Sk E af(xk> n ae!(x); {Sk} is
bounded by virtue of Theorem 1.1.4: extracting a subsequence if necessary, we may
assume that {Sk} has a limit s'. Then, Proposition 4.1.1 and Theorem 1.1.4 show that
s' E af(x') n aef(x), which means that x' E Ys(x). 0

From a practical point of view, the set Ve (x) and the property (i) above are both important:
provided that y is close enough to x, the s(y) E of(y) computed by the black box (Ul)
is guaranteed to be an e-subgradient at x. Now, Ve(x) and ~(x) differ very little (they
are numerically indistinguishable), and the latter has a strong intrinsic value: by definition,
x' E ~(x) if and only if

3s' E oef(x) such that s' E of (x'), i.e. such that x' E of*(s').

In the language of multifunctions, this can be written.

~(x) = of*(oef(x». (4.2.6)

Furthermore the above proof, especially (4.2.5), establishes a connection between ~ (x)
and §2.2. First ~(x) - {x} is star-shaped. Also, consider the intersection of ~(x) with a
direction d issuing from x. Keeping (4.2.3) in mind, we see that this set is the closed interval
134 XI. Approximate Subdifferentials of Convex Functions

where the perturbed difference quotient qe is decreasing. In a word, let te(d) be the largest
element of Te = Te (d) defined by (2.2.2), with the convention te (d) = +00 if Te is empty or
unbounded (meaning that the approximate difference quotient is decreasing on the whole of
lR;). Then
¥e(x) = (x + td : t E [0, te(d)], d E B(O, I)}. (4.2.7)
See again the geometric interpretations of §2, mainly the end of §2.3.

Our neighborhoods Ve(x) and Ye(x) enjoy some interesting properties if addi-
tional assumptions are made on f:

Proposition 4.2.6
(i) If f is I-coercive, then Ye (x) is bounded.
(ii) Iff is differentiable on IR.n, then Ve (x) = Ye (x) and Vf (Ve (x» c 8d (x).
(iii) Iff is I-coercive and differentiable, then V f(Ve(x» = 8d(x).

PROOF. In case (i), f* is finite everywhere, so the result follows from (4.2.6) and the
local boundedness of the subdifferential mapping (Proposition VI.6.2.2).
When f is differentiable, the equality between the two neighborhoods has already
been observed, and the stated inclusion comes from the very definition of Ve (x).
Finally, let us establish the converse inclusion in case (iii): for s E 8e f(x), pick
y E 81*(s), i.e. s = Vf(y). The result follows from (4.2.6): y E Ye(x) = Ve(x).
o

Example 4.2.7 Let f be convex quadratic as in Example 1.2.2. Substitute in (1.2.5)

Vf(x) + Qy = Qx +b + Qy = Q(x + y) +b = Vf(x + y)

so as to obtain the form

8d(x) = {V f(x') : 4(Q(x' - x), x' - x) ~ e}, (4.2.8)

which discloses the neighborhood

Ye(x) = {x' E IR. n : 4(Q(x' - x), x' - x) ~ e}. (4.2.9)

When Q is invertible, we have a perfect duality correspondence: Ye(x) [resp.


8d(x)] is the ball of radius ,J2£ associated with the metric of Q [resp. Q-I], and
centered at x [resp. at V f(x)].
Finally note that we can use a pseudo-inverse in (4.2.9):

x' = Q-(Qx' + b - b) = Q-[Vf(x') - b]

and obtain the form

Ye(x) = {Q-[Vf(x') - b] : 4(Q(x' - x), x' - x) ~ e} .


This illustrates (4.2.6),knowingthat 81*(s) = Q-(s-b)+KerQfors-b E ImQ
and that Ker Q c Ye(x). 0
4 The Approximate Subdifferential as a Multifunction 135

A final remark: we have seen in §II.2 the importance of defining a neighborhood


of a given iterate x = Xk, to construct suitable directions of search. For the sake of
efficiency, this neighborhood had to reflect the behaviour of I near x. Here, Ye (x) is
a possible candidate: interpret the constraint in (11.2.3.1) as defining a neighborhood
to be compared with (4.2.9). However, it is only because VI and Vf* are affine
mappings that Ye (x) has such a nice expression in the quadratic case. For a general
I, Ye (x) is hardly acceptable; (4.2.7) shows that it need not be bounded, and (4.2.6)
suggests that it need not be convex, being the image of a convex set by a mapping
which is rather complicated

Example 4.2.8 Let I be defined by

I(x) = max{O, ~2 + TJ2 - I} for all x = (~, TJ) E]R2.


Consider x = (1,1), at which I(x) = 1 and V I(x) = (2,2), and take e = 1/2.
Look at Fig. 4.2.2 to see how Ye (x) is constructed: if I were the quadratic function
e + TJ2 - 1, we would obtain the ball
D = {(~, TJ) : (~- 1)2 + (TJ - 1)2 ~ I/2}

(see Example 4.2.7). However, I is minimal on the unit ball B(O, 1), i.e. 0 E a/(x')
forallx' E B(O,I);furthermore,O ¢ al /d(x).Inviewofitsdefinition(4.2.3), Ve(x)
therefore does not meet B(O, 1):

V1/ 2(X) C D\B(O, 1) .

On the other hand, it suffices to remove B(O, 1) from D, and this can be seen from the
definition (4.2.3) of Ye(x): the linearization error e(x, " .) is left unperturbed by the
max-operation defining I. In summary:

Ye(x) = {(~, TJ) : (~- 1)2 + (TJ - 1)2 ~ 1/2, ~2 + TJ2 ~ I}.
We thus obtain a nonconvex neighborhood; observe its star-shaped character. 0

8(0,1)

Fig.4.2.2. A nonconvex neighborhood Ve(x)

By contrast, another interesting neighborhood is indeed convex: it is obtained by


inverting the role of x and x' in (4.2.3). For x E intdom I, define the set illustrated
in Fig. 4.2.3:
136 XI. Approximate Subdifferentials of Convex Functions

Ve*(X) := {x' E]R.n : e(x', x, s) ~ e for all s E af(x)).

Another expression is

which shows that Ve*(x) is closed and convex. Reproduce the proof of Proposi-
tion 4.2.5(i) to see that Ve*(x) is a neighborhood of x. For the quadratic function
of Example 4.2.7, Ve*(x) and Ve(x) coincide.

grf

x' -.. f(x) + <s,x'-x>

Fig. 4.2.3. A second-order neighborhood


XII. Abstract Duality for Practitioners

Prerequisites. Subdifferentials offinite convex functions (Chap. VI); minimality conditions


and elementary duality theory (Chap. VII, but much less than one might think); and for the
last part of the chapter, definition and elementary properties of conjugate convex functions
(Chap.X).

Introduction. The subject of this chapter is by far the most important application of convex
minimization, namely general decomposition in mathematical programming (more exactly:
price decomposition), and dual algorithms. It can be safely ascertained that this subject nearly
coincides with convex minimization, studied in Chaps. VIII and IX:
- on the one hand, algorithms for convex minimization have their best-suited field of applica-
tion in decomposition, more precisely the problem of price-adjustment (another important
field concerns eigenvalue optimization of a varying matrix but there, nonconvexity comes
quickly into play);
- on the other hand, when decomposing a problem - more exactly: when adjusting prices
via a decentralization algorithm in a (usually large-scale) optimization problem -, one is
primarily minimizing a certain convex function, namely the dual function associated with
the problem; and there is no way around this.

1 The Problem and the General Approach

1.1 The Rules of the Game

We consider in this chapter a constrained optimization problem characterized by a


nonempty set U of admissible control variables u, an objective function ({J : U ~ R,
and constraint-functions c" ••• ,Cm : U ~ R; then we want to solve

ICj(u)
sup ((J(u)
=0
U E
for j
U,
= 1, ... , m. (1.1.1)

which we will call the primal problem. As usual, a U E U satisfying the constraints
Cj (u) = 0 will be called feasible. Throughout this chapter, without further mention,
the following assumption will be in force:

IU =f:. 0; ({J. c, .... Cm are finite everywhere on U·I


SO far, we are not assuming any structure in U whatsoever. For example, U
is not at all supposed to be (a subset of) a vector space like Rn; we will see in
138 XII. Abstract Duality for Practitioners

§ 1.2 how abstract, or how concrete U can be. This implies in particular that the
objective- and constraint-functions have no structure either, such as convexity, or a
fortiori differentiability: in U, these words are meaningless for the moment. They
will of course appear (to be solvable, an optimization problem such as (1.1.1) must
enjoy some structure), but much later and it is useful to see how far the theory can
be developed in abstracto. Observe also that assuming q; and each Cj to be finite
everywhere is not really a restriction: if these functions were extended-valued, U
could be replaced by its intersection with their domains.
Actually, the only structure assumed concerns the sets of objective and constraint
values, which are lR and lRm respectively. Then we equip lRm with the ordinary dot-
product: for (A, c) E lRm x lRm ,
m
AT C:= LAjCj. (1.1.2)
j=1

The associated norm will be II· II. As far as the space of constraint-values is concerned,
and this space only, we are therefore in the general framework of this book, with scalar
products, duality, metric, and so on.
Then, denoting by c(u) := (CI (u), ... , cm(u» E lRm the vector of constraint-
values at U E U, we can consider the Lagrange function or Lagrangian, defined
by
L(u, A) := q;(u) - AT c(u) for all A E lRm and u E U. (1.1.3)
For a given A E lRm , this appears as a perturbation of the objective function of (1. 1. 1),
which simply says the following: a violation of the constraints is accepted but then, a
price AT c(u) must be paid when the control value is u E U.

Remark 1.1.1 Naturally, (1.1.3) is the same Lagrange function that played a central role in
Chap. VII. Once again, however, it is important for a better understanding to realize that L is
defined here on U x Rm. By contrast, Chap. VII heavily relied on the situation U = R n - so
as to define subdifferentiability of ({J and c - and little on the concept of a varying A.
With respect to Chap. VII, beware of the minus-sign in (1.1.3); it comes from the fact
that we start from a maximization problem, and will be motivated later. 0

No theoretical property is assumed on the data (U, q;, c), other than the duality
structure induced by (1.1.2). For practical purposes, however, we do make a heavy
assumption:

Assumption 1.1.2 (practical) We assume that the optimization problem in u

sup {L(u, A) : u E U} , (1.1.4h,

where A is fixed in lRm , is considerably simpler than the primal problem (1.1.1).
We will call (1.1.4h. the Lagrange problem associated with A. 0

To quantify a little bit the wording "considerably simpler", we can say for example:
- the simpler (1.1.4h. is, the more efficient the approach of this chapter will be;
1 The Problem and the General Approach 139

- Asswnption 1.1.2 holds when an efficient methodology exists for {1.1.4h, but not
for (1.1.1);
- or when it costs more to solve (1.1.1) once than to solve (1.1.4h 102-104 times,
say;
- this still loose quantification can be slightly sharpened by saying that we will rather
be in the 102-range if (1. 1.4h enjoys some more theoretical properties (convexity),
and rather in the 104-range ifnot.
However vague it is, Asswnption 1.1.2 is fundamental; it has to be supplemented
by one more practical asswnption, hardly more meaningful (but certainly fairly usual
in mathematical programming):

Assumption 1.1.3 One will be content with an approximate solution of (1.1.1), i.e.
with U E U such that
- rp(u) is possibly slightly less than the supremal value in (1.1.1),
- and/or c(u) is possibly not exactly 0 E ]Rm. 0

When faced with a practical optimization problem, there are several ways of formulating
the data (U, rp, c) of (1.1.1): constraints may be incorporated in rp via some penalty term
(§VII.3.2); or its extreme version, which is the indicator function of a feasible set; they can
also be considered as making up U; or as making up c, and so on. To choose a formulation
adapted to our present framework, Assumptions 1.1.2 and 1.1.3 must be the central concern.
The case of Assumption 1.1.2 is linked to the decomposability of (1.1.1) and will be seen
more closely in § 1.2; as for Assumption 1.1.3, it distinguishes two categories of constraints:
- the hard constraints, which must absolutely be satisfied by u; they will most conveniently
appear in the definition of U;
- the soft constraints, for which some tolerance is accepted; they can go in the c-set.

The methods of this chapter are aimed at solving (1.1.1) with the help of (1. 1.3).
A black box (the black box (VI) of Fig. 11.1.2.1 that keeps appearing in this book) is
available which, given).. E ]Rm, solves (1.I.4h and returns appropriate information.
Our problem in this chapter is therefore as follows:

Problem 1.1.4 Find some suitable value of ).. such that the associated Lagrange
problem, as solved by the black box, provides a solution of the primal problem.
The unknown ).. will be called the dual variable. 0

This is the so-called coordination, or decentralization problem, because (1.I.4h


is usually a decomposed form of (1.1.1), see §1.2 below. From its very definition
(1.1.2),).. varies in the dual of the space of constraint-values, hence 1.1.4 can also be
called the "dual problem"; but we will see in §2.1 that this terminology is ambiguous.
Then comes an important point: what information is available from the black box,
to be used in our search for)..? The answer depends first on the basic question: does
the Lagrange problem have a solution at all? It may happen that sup L(·, )..) =
+00,
or that (1.I.4h has no optimal solution "at finite distance"; in either case, the black
box is unlikely to return any valuable information. The exposition is much easier if
this never happens, so we consider the easy situation, at least for the moment:
140 XII. Abstract Duality for Practitioners

Assumption 1.1.5 (Temporary)


(i) For all J.. E ~m , the Lagrange problem (1.1.4h has some optimal solution.
(ii) The black box computes one such optimal solution, say u>.. E U, and returns
the maximal value L(u}.., J..), together with its corresponding constraint-value
c(u}..) E ]Rm. 0

Seen from the point of view of 1.1.4, the situation is therefore as illustrated by
Fig. 1.1.1: the coordinator (sometimes called the master program) chooses a price-
vector J.., sends it to the black box (sometimes called the local problem),and receives in
return information consisting ofa vector(c, L) E ~m xR The duty of the coordinator
is then to decide about optimality in terms of (1. 1. 1), and if not satisfied, to modify J..
accordingly.

Solve the Lagrange problem at A. output: L(uA,A.)


input: A.
and obtain some solution uA

Fig. 1.1.1. Useful local black box for a dual-solver

Remark 1.1.6 The temporary Assumption 1.1.5(i) is of a theoretical nature; it holds es-
sentially if U is a compact set, on which L(·,)..) is upper semi-continuous. A particular and
important case is when U is simply a finite set.
Part (ii), concerning the nature and amount of information computed by the black box, is
typical and useful. We have selected it because it serves best our purpose, mainly illustrative.
Other situations are possible, though:
- A very powerful black box, able to return the maximal L together with all c(u) for all
optimal u's, could be conceived of; but we will see in the next sections that it usually makes
little sense.
- By contrast, more meaningful would be a weaker black box, only able to compute some
suboptimal solution, i.e. an u).,e E U satisfying, for some e > 0:

L(u).,e,)..) ~ L(u,)..) - e for all u E U.

This situation is of theoretical interest: an e-optimal solution exists whenever the supre-
mum in (1.1.4)). is finite; also, from a practical point of view, it demands less from the black
box. This last advantage should not be over-rated, though: for efficiency, e must be decided
by the coordinator; and just for this reason, life may not be easier for the black box.
- We mention that the values of Land c are readily available after (1.1.4)). is solved~ a situation
in which only L is returned is hardly conceivable; this poor information would result in a
poor coordinator anyway. 0

Let us mention one last point: L(u).,)..) is a well-defined number, namely the optimal
value in (1.1.4)).; but c(u).) is not a well-defined vector, since it depends on the particular
solution selected by the black box. We will see in the next sections that (1.1.4)). has a unique
solution for almost all)" E Rm - and this property holds independently of the data (U, f/J, c).
Nevertheless, there may exist "critical" values of).. for which this uniqueness does not hold
1 The Problem and the General Approach 141

This question of uniqueness is crucial, and comes on the top of the practical Assump-
tion 1.1.2 to condition the success of the approach. If (1.1.4h has a unique maximizer for
each). E lR,m, then the dual approach is extremely powerful; difficulties begin in case of
non-uniqueness, and then some structure in (U, cp, c) becomes necessary.

1.2 Examples

Practically all problems suited to our framework are decomposable, i.e.:


- U = U 1 X U 2 X ••• X Un, where each U i is "simpler" (a word that matches the prac-
tical Assumption 1.1.2); the control variables will be denoted by u = (u l , ••• , un),
where u i E U i for i = 1, ... , n;
- cp is a sum of "individual objective functions": u = (u l , ••• , un) 1-+ cp(u) =
1
L: =1 cpi (ui);
1
-likewise, each constraint is a sum: Cj(u) = L: =1 cj(u i ).
Then (1.1.1) is
n
sup Lqi(u i ) u i E U i fori = 1, ... ,n
i=1
n
Lcj(ui)=O forj=I, ... ,m [inshortL:l=ICi(Ui)=OElRm].
i=1
(1.2.1)
In these problems, c(u) = 0 appears as a (vector-valued) coupling constraint,
which links the individual control variables u i . If this constraint were not present,
(1.2.1) would reduce to n problems, each posed in the simpler set U i , and might
thus become "considerably simpler". This is precisely what happens to the Lagrange
problem, which splits into

sup{qi(ui)_;>.,TCi(U i ): u i EU i } fori=I, ... ,n. (1.2.2)

Such decomposable problems form a wide variety in the world of optimization;


usually, each u i is called a local control variable, hence the word "local problems" to
designate the black box of Fig. 1.1.1. We choose three examples for illustration.

(a) The Knapsack Problem. Our first example is the simplest instance of combina-
torial problems, in which U is a finite or countable set. One has a knapsack and one
considers putting in it n objects (toothbrush, saucepan, TV set, ... ) each of which
has a price pi - expressing how much one would like to take it - and a volume vi.
Then one wants to make the knapsack of maximal price, knowing that it has a limited
volume v.
For each object, two decisions are possible: take it ornot; U is the set of all possible
such decisions for the n objects, i.e. U can be identified with the set {I, ... , 2n}. To
each u E U, are associated the objective- and constraint-values, namely the sum of
respectively the prices and volumes of all objects taken. The problem is solvable in
142 XII. Abstract Duality for Practitioners

a finite time, but quite a large time if n is large; actually, problems of this kind are
extremely difficult.
Now consider the Lagrange problem: m = 1 and the number A is a penalty
coefficient, or the ''price of space" in the knapsack: when the i th object is taken, the
payoff is no longer pi but pi - Avi. The Lagrange problem is straightforward: for
each object i,
if pi - Av i > 0, take the object;
if pi - Av i < 0, leave it;
if pi - Av i = 0, do what you want.

While making each of these n decisions, the corresponding terms are added to the
Lagrange- and constraint-values and the job to be done in Fig. 1.1.1 is clear enough.
Some preliminary observations are worth mentioning.
- U is totally unstructured: for example, what is the sum of two decisions? On the
other hand, the Lagrange function defines, via the coefficients pi - Av i , an order
in U, which depends on A;
- from its interpretation, A should be nonnegative: there is no point in giving a bonus
to bulky objects like TV sets;
- there always exists an optimal decision in the Lagrange problem (1.1.4h.; and it is
well-defined (unique) except when A is one of the n values pi Iv i .
Here is a case where the practical Assumption 1.1.2 is "very true". It should not be
hastily concluded, however, that the dual approach is going to solve the problem easily:
in fact, the rule is that Problem 1.1.4 is itself hard when (1.1.1) is combinatorial (so the
"law of conservation of difficulty" applies). The reason comes from the conjunction
of two bad things: uniqueness in the Lagrange problem does not hold - even though it
holds "most of the time" - and U has no nice structure. On the other hand, the knapsack
problem is useful to us, as it illustrates some important points of the approach.
The problem can be given an analytical flavour: assign to the control variable ui
the value 1 if the i th object is taken, 0 ifnot. Then we have to solve

n n
maxLpiu i subjectto u i e{O,I}, Lviui~v.
i=1 i=1

In order to fit with the equality-constrained form (1.1.1 ), a nonnegative slack variable
uO is appended and we obtain the 0-1 programming problem

n
max ((J(u) := L pi u i [+ O· uO]
i=1
n (1.2.3)
c(u):= Lviu i +uo - v =0
i=1
uO ~ 0, u i e {O, I} for i = I, ... , n [{} u e U],

which, incidentally, certainly has a solution.


1 The Problem and the General Approach 143

Associated with A E JR, the Lagrange problem is then

sup {Av - AUo + Ei=l(pi - Avi)u i : UO ~ 0, ui E to, I} for i = 1, ... , n}.


Maximization with respect to uO is easy, and L(u)", A) = +00 for A < 0; Assump-
tion 1.1.5(i) is satisfied via a simple trick: impose the (natural) constraint A ~ 0,
somehow eliminating the absurd negative values of A. Thus, for any A ~ 0, we will
have
L(u)", A) = L (pi - Av i ) + AV, c(u),,) vi - v, = L
ie/(),,) ie/(),,)
where, for example,
I(A) = {i : pi - Av i ~ O} (1.2.4)
but strict inequality would also define a correct I (A): here lies the ambiguity ofu)".

Needless to say, the formulation (1.2.3) should not hide the fact that U = 1R+ X to, I}n
is still unstructured: one can now define, say 1/2 (UI + U2) for U1 E U and U2 E U; but the
half of a toothbrush has little to do with a toothbrush. Another way of saying the same thing
is as follows: the constraints u i E to, I} can be formulated as
o..;; u i ..;; 1 and u i (I - ui ) = O.
Then these constraints are hard; by contrast, the constraint c(u) = 0 can be considered as
soft: if some tolerance on the volume of the knapsack is accepted, the constraint does not have
to be strictly satisfied

(b) The Constrained Lotsizing Problem. Suppose a machine produces objects, for
example bottles, "in lots", i.e. at a high rate. Producing one bottle costs p (French
Francs, say); but first, the machine must be installed and there is a setup cost S to
produce one lot. Thus, the cost for producing u bottles in v lots is
Sv+ pu.
When produced, the bottles are sold at a slow rate, say r bottles per unit time. A stock
is therefore formed, which costs s per bottle and per unit time. Denoting by u(t) the
number of bottles in the stock at time t, the total inventory cost over a period [t1, t2] is

s 1t.
t2
u(t)dt.

Suppose that a total amount u must be produced, to be sold at constant rate of


r bottles per day. The idea of lot-production is to accept an increase in the setup
cost, while reducing the inventory cost: if a total of u bottles must be produced, the
production being split in v lots of u / v bottles each, the stock evolves as in Fig. 1.2.1.
The total cost, including setup, production and inventory, is then

pu+Sv+l __ .
s u2
(1.2.5)
2 r v
Between the two extremes (one lot = high inventory vs. many lots = high setup), there
is an optimum, the economic lotsize, obtained when v minimizes (1.2.5), i.e. when
144 XII. Abstract Duality for Practitioners

=1IIt----;~ time

Fig. 1.2.1. Evolution of the stock when producing in lots

(1.2.6)

Furthermore, production and setup take time, and the total availability of the
machine may be limited, say
tpu +tsv ~ T.
N.B. These times are not taken into account in Fig. 1.2.1 (which assumes tp = ts = 0),
but they do not change the essence of (1.2.5); also, v is allowed non-integer values,
which simply means that sales will continue to deplete the stock left at time t2.
Now suppose that several machines, and possibly several kinds of bottles, are
involved, each with its own cost and time characteristics. The problem is schematically
formulated as
n m
min LL¢ij(Uij, Vij) (i)
i=lj=1
m
L Uij = Di for i = 1, ... , n (ii)
(1.2.7)
j=1
n
L(aijUij + bijvij) ~ Tj for j = 1, ... , m (iii)
i=1
Uij ~ 0, Vij ~ 0 for all i and j , (iv)
where the cost is given as in (1.2.5), say

0 ifu=v=O,
'Pij(u, v) :=
{ aiju + fJijv + Yij :2 if u, v > 0,
+00 otherwise.
The notation is clear enough; Di is the total (known) demand in bottles of type i, Tj is
the total availability of the machine j; all data are positive. To explain the +oo-value
for 'Pij' observe that a positive number of units cannot be processed in 0 lots.
Here, the practical Assumption 1.1.2 comes into play because, barring the time-
constraints (iii) in (1.2.7), the problem would be very easy. First of all, the variables Vij
would become unconstrained; assuming each Uij fixed, an economic lotsize formula
like (1.2.6) would give the optimal Vij. The resulting problem in U would then split
into independent problems, one for each product i.
More specifically, define the Lagrange function by "dualizing" the constraints
(1.2.7)(iii), and optimize the result with respect to v to obtain
1 The Problem and the General Approach 145

Vij = Vij(U, A) = (1.2.8)

The i th local control domain is then the simplex

(1.2.9)

and the i th local problem is a linear program whose constraint-set is U i :

?:
m
min [aij + Ajaij + 2.jYij({3ij + Ajbij)] Uj
J=l
(1.2.10))..
I>j
m
= Di, Uj ~ 0 for j = 1, ... , m ,
j=l

admitting that Uj stands for Uij, but with i fixed. As was the case with the knapsack
problem, the dualized constraints are intrinsic inequalities and the infimum is -00 if
Aj < 0 for some j. Otherwise, we have to minimize a linear objective function on the
convex hull of the m extreme points of U i ; it suffices to take one best such extreme
point, associated with some j (i) giving the smallest of the following m numbers:

aij + Ajaij + .jYij({3ij + Ajbij) for j = 1, ... , m. (1.2.11)

In summary, the Lagrange problem (1.2.1Oh. can be solved as follows: for


i = 1, ... , n, set each Uij to 0, except Ui,j(i) = Di. Economically, this amounts
to producing each product i on one single machine, namely the cheapest one depend-
ing on the given A. Then the lotsizes Vi,j(i) are given by (1.2.8) for each i. The answer
from the black box in Fig. 1.1.1 is thus computed in a straightforward way (do not
forget to change signs, since one is supposed to maximize, and to add Lj Aj 1j).
Here, Assumption 1.1.5(i) is again "mildly violated" and dual constraints Aj ~ 0
will be in order. Uniqueness is again not assured, but again it holds almost every-
where. Despite this absence of uniqueness, the situation is now extremely favourable,
Problem 1.1.4 is reasonably easy. The reason is that U has a structure: it is a convex
set; and besides, the objective function is convex.

(c) Entropy Maximization. The set of control variables is now an infinite-dimensio-


nal space, say U = Ll (il, lR) where il is an interval oflR. We want to solve

sup In 1fr(x, u(x» dx U ELI (il, lR),

In Yj (x, U (x» dx =0 for j = 1, ... , m .


(1.2.12)

In addition to il, the data of the problem is the m + 1 functions 1fr, Yl, ... , Ym, which
send (x, t) E {] x lR to lR. The Lagrangian is the integral over il of the function
146 XII. Abstract Duality for Practitioners

m
l(x, u, A) := t/I(x, u(x» - LAjYj(X, u(x».
j=1

This type of problem appears all the time when some measurements (characterized by
the functions Yj) are made on u; one then seeks the "most probable" u (in the sense
of the entropy -t/l) compatible with these measurements.
In a sense, we are in the decomposable situation (1.2.1), but with "sums of in-
finitely many terms". Indeed, without going into technical details from functional
analysis, suppose that the maximization of l with respect to u E LI (G, lR) makes
sense, and that the standard optimality conditions hold: given A E lRm , we must solve
for each x E G ~e equation in t E lR:

(1.2.13)

Take a solution giving the largestl (if there are several), call uA,(x) the result, and this
gives the output from the black box.
The favourable cases are those where the function t H- t/I (x, t) is strictly concave
and each function t H- Yj (x, t) is affine. A typical example is one in which t/I is an
entropy (usually not depending on x), for example

-t log t for t > 0 ,


t/I(x,t)= { 0 fort=O, (1.2.14)
-00 for t < 0,
or
t/I(x t) = {lOgt ift > ~ (1.2.15)
, -00 othel'Wlse .

Being affine, the constraints have the general form:

In aj(x)u(x) dx - bj =0
where each aj E Loo(G, lR) is a given function, for example aj(x) = cos(21rjx)
(Fourier constraints).
Here, the practical Assumption 1.1.2 is satisfied, at least in the case (1.2.14): the
Lagrange problem (1.2.13) has the explicit solution

(1.2.16)

By contrast, (1.2.12) has "infinitely many" variables; even with a suitable discretiza-
tion, it will likely remain a cumbersome problem.

Remark 1.1.1 On the other hand, the technical details already alluded to must not be for-
gotten. Take for example (1.2.15): the optimality conditions (1.2.13) for u). give

(1.2.17)
2 The Necessary Theory 147

The set of).. for which this function is nice and log-integrable is now much more complicated
than the nonnegative orthant of the two previous examples (a) and (b). Actually, several
delicate questions have popped up in this problem: when does the Lagrange problem have
a solution? and when do the optimality conditions (1.2.13) hold at such a solution? These
questions have little to do with our present concern, and are left unanswered here. 0

As long as the entropy 1/f is strictly concave and the constraints Yj are affine (and
barring the possible technical difficulties from functional analysis), the dual approach
is here very efficient. The reason, now, is that the U A of the black box is unambiguous:
the Lagrange problem has a well-defined unique solution, which behaves itself when
).. varies.

2 The Necessary Theory

Throughout this section, which requires some attention from the reader, our leading
motivation is as follows: we want to solve the u-problem (1.1.1); but we have replaced
it by the A-problem 1.1.4, which is vaguely stated. Our first aim is to formulate a more
explicit A-problem (hereafter called the dual problem), which will turn out to be
very well-posed. Then, we will examine questions regarding its solvability, and its
relevance: to what extent does the dual problem really solve the primal u-problem
that we started from?
The data (U, q>, c) will still be viewed as totally unstructured, up to the point where
we need to require some specific properties from them; in particular, Assumption 1.1.5
will not be in force, unless otherwise specified.

2.1 Preliminary Results: The Dual Problem

To solve our problem, we must at least find a feasible u in (1.1.1). This turns out to
be also sufficient, and the reason lies in what is probably the most important practical
link between (1.1.1) and (1.1.4h:

Theorem 2.1.1 (H. Everett) Fix A E ]Rm; suppose that (1.1.4h has an optimal
solution UA E U and set CA := c(u A) E ]Rm. Then UA is also an optimal solution of

Ic(u)
maxq>(u) u
= CA E
E
]Rm.
U,
(2.1.1)

PROOF. Take an arbitrary U E U. By definition of U A'

q>(u) - AT c(u) = L(u, A) ~ L(u A, A) = q>(u A) - AT C)...

If, in addition, c(u) = CA, then u is feasible in (2.1.1) and the above relations become
q>(u) ~ q>(u A). 0

Thus, once we have solved the Lagrange problem (1.1.4h for some particular A,
we have at the same time solved a perturbation of (1.1.1), in which 0 is replaced by
an a posteriori right-hand side. Immediate consequence:
148 XII. Abstract Duality for Practitioners

Corollary 2.1.2 If, for some A E ]Rm, (1.1.4h, happens to have a solution u>.. which
isfeasible in (1.1.1), then this u>.. is a solution of (1.1.1). 0

To solve Problem 1.1.4, we of course must solve the system of equations

Cj(U>..)=O forj=l, ... ,m, (2.1.2)

where u>.. is given by the black box; Corollary 2.1.2 then says that it suffices to solve
this system. As stated, the problem is not simple: the mapping A ..... C)., is not even
well-defined since u>.., as returned by the black box, is ambiguous.

Example 2.1.3 Take the knapsack problem (1.2.4). In view of the great simplicity of the
black box, CJ" is easy to compute. For >.. ~ 0 (otherwise there is no UJ" and no cJ,,), uO = 0 does
not count; with the index-set J (>..) defined in (1.2.4), the left-hand side of our (univariate)
equation (2.1.2) is obviously
CJ,,=V- LVi.
iel(J,,)

When>.. varies, CJ" stays constant except at the branch-points where J()") changes; then
CJ" jumps to another constant value. Said otherwise (see also the text following Remark 1.1.6),
the mapping >.. ~ CJ", which is ill-defined as it depends on the black box, is discontinuous.
Such wild behaviour makes it hard to imagine an efficient numerical method. 0

Fortunately, another result, just as simple as Theorem 2.1.1, helps a great deal.
First of all, we consider a problem more tolerant than (2.1.2):

.
Fmd A such that there IS U
. E
I maximizing L(·, A) and
U sati'sfyingc () 0 (2.1.3)
U = .

This formulation is mathematically more satisfactory: we have now to solve a "set-


valued equation" with respect to A. Note the difference, though: even if A solves
(2.1.3), we still have to find the correct u, while this U = u>.. mentioned in (2.1.2) was
automatically given by the black box. Equivalence between (2.1.2) and (2.1.3) holds
in two cases: if the Lagrange problems (1.1.4h, have unique solutions, or if the black
box can solve "intelligently" (1.1.4)>...
Now we introduce a fundamental notation.

Definition 2.1.4 (Dual Function) In the Lagrange problem (1.1.4h" the optimal
value is called the dualfunction, denoted bye:

]Rm 3 A ..... e(A) := sup (L(u, A) : u E U}. (2.1.4)


o
Note that e is just the opposite (cf. Remark 1.1.1) of the function", of §VII.4.5; it is
a direct and unambiguous output of the black box. If Assumption 1.1.5 holds, then e().,) =
L(uJ".).,) but existence of an optimal solution is by no means necessary for the definition of
e. Indeed, the content ofthe practical Assumption 1.1.2 is just that e(>..) is easy to compute
for given >.., even if an optimal solution uJ" is not easy to obtain, or even if none exists.
2 The Necessary Theory 149

Theorem 2.1.5 (Weak Duality) For all ).. E IRm and all U feasible in (1.1.1), there
holds
e()..) ~ qJ(u) . (2.1.5)

PROOF. For any U feasible in (1.1.1), we have

qJ(u) = qJ(u) -)..T c(u) = L(u,)..) ~ e()..). o


Thus, each value of the dual function gives an upper bound on the primal optimal
value; a sensible idea is then to find the best such upper bound:

Definition 2.1.6 (Dual Problem) The optimization problem in )..


inf (e()..) : ).. E IRm} (2.1.6)

is called the dual problem associated with (U, qJ, c) of(1.1.1). o


The following crucial result assesses the importance of this dual problem.

Theorem 2.1.7 If i is such that the associated Lagrange problem (1.1.4)j, is max-
imized at a feasible U, i.e. ifi solves (2.1.3), then i is also a solution of the dual
problem (2.1.6).
Conversely, if i solves (2.1.6), and if (2.1.3) has a solution at all, then i is such
a solution.

PROOF. The property c(u) = 0 means


e(i) = L(u, i) = qJ(u) ,

e
which shows that assumes the least of its values allowed by inequality (2.1.5).
Conversely, let (2.1.3) have a solution)..*: some u* E U satisfies c(u*) = 0 and
L(u*,).. *) = e().. *) = qJ(u*). Again from (2.1.5), e().. *) has to be the minimal value
e(i) of the dual function and we write

L(u*, i) = qJ(u*) - iT c(u*) = qJ(u*) = e(i).

In other words, our feasible u* maximizes L(·, i). o


Existence ofa solution to (2.1.3) means full success of the dual approach, in the
following sense:

Corollary 2.1.8 Suppose (2.1.3) has a solution. Then, for any solution).. ofthe dual
problem (2.1.6), the primal solutions are those u maximizing L(·,)..) that are feasible
in (1.1.1).

PROOF. When (2.1.3) has a solution, we already know that ).. minimizes e
if and only
if: ).. solves (2.1.3), and there is some u such that qJ(u) = L(u,)..) = e()..).
Then, for any u solving (1.1.1), we have c(u) = 0 and qJ(u) = qJ(u), hence
L(u,)..) = e()..). 0

In the above two statements, never forget the property "(2.1.3) has a solution", which
need not hold in general. We will see in §2.3 what it implies, in terms of the data (U, qJ, c).
150 XII. Abstract Duality for Practitioners

Remark 2.1.9 When (2.1.3) has no solution, at least a heuristic resolution of (1. 1. 1) can be
considered, inspired by Everett's Theorem 2.1.1 and the weak duality Theorem 2.1.5.
Apply an iterative algorithm to minimize e and, during the course of this algorithm, store
the primal points computed by the black box. After the dual problem is solved, select among
these primal points one giving a best compromise between its qJ-value and constraint-violation.
This heuristic heavily depends on Assumption 1.1.3. Its quality can be judged a posteriori
from the infimal value of e; but nothing can be guaranteed in advance: qJ may remain far
from e, or the constraints may doggedly refuse to approach O. 0

2.2 First Properties of the Dual Problem

Let us summarize our development so far: starting from the vaguely stated problem
1.1.4, we have introduced a number of formulations, interconnected as follows:

A solves 1.1.4 <=? A solves (2.1.2) ~


(2.2.1)
~ A solves (2.1.3) ~ A solves (2.1.6) .

Thus, to solve the ugly equation (2.1.2), we must solve the perfectly well-stated prob-
lem (2.1.6). Furthermore, we have nothing to lose:
- If no primal solution is thus produced, the task was hopeless from the very begin-
~ng: no other value of A could give a primal solution; this is the converse part in
Theorem 2.1.7.
- If the technique works, no primal solution can be missed: they all solve the Lagrange
problem associated with any dual optimum; this is Corollary 2.1.8.

Remark 2.2.1 From now on, we will use the word "dual" for everything concerning
(2.1.6): we have to minimize the dual function e, with respect to the dual variable A
(u being the primal variable), possibly via a dual algorithm, yielding a dual solution,
and so on.
The symmetry between (1.1.1) and (2.1.6) becomes more suggestive if the primal
constraints are incorporated into the objective function: (1.1.1) can be formulated as

sup {q1(u) - Ic(u) : U E U} ,

where C is the domain described by c(u) = 0, Ie being its indicator function. 0

The applications described in § 1.2 give good illustrations of the logical chain
(2.2.1).
(a) The knapsack problem is a typical example where (2.1.6) has little to do with
(2.1.2), or even (2.1.3). Take the (particularly simple!) case n = 1; for example

maxu subjectto 2u:S:; 1, u E {O, I}. (2.2.2)

The dual function is a nice polyhedral convex function (cf. Example 2.1.3 if
necessary):
2 The Necessary Theory 151

in < 0,
ifO ~ A ~ 1/2,
in~1/2;

by contrast, the left-hand side in (2.1.2) is the discontinuous mapping

C).. = 1 1
ambiguously
-1
if 0 ~ >.. < 1/2,
± 1 in = 1/2,
in> 1/2.

For (2.1.3), A = 1/2 gives the doubleton {-1, +I}.


Here, none of the converse inclusions hold in 2.2.1; indeed (2.1.3) - and a
fortiori (2.1.2) - has no solution.
(b) The case of economic lotsize is more involved but similar: C).. jumps when A is
such that the optimal j switches in (1.2.11) for some i. A discontinuity in c)., has
an interesting economic meaning, with Aj viewed as the marginal price to be paid
when the ph machine is occupied. For most values of A, it is economically cheaper
to produce the ith product entirely on one single machine j (i). At some "critical"
values of A, however, two machines at least, say h (i) and h(i), are equally cheap
for some product i; then a splitting of the production between these machines
can be considered, without spoiling optimality. The ui-part ofa solution (u, vh
becomes a subsimplex of (1.2.9); and the corresponding Vi -part is also a simplex,
as described by (1.2.8).
Here we will see that (2.1.2) 1= 2.1.3 => (2.1.6).
(c) As for the entropy-problem, (2.1.2) and (2.1.6) have similar complexity. Take for
example the entropy (1.2.14): from (1.2.16), straightforward calculations give B:

= b TA + fg exp( -1 - 'LJ=1 Ajaj(X)) dx,


lR m :3 A 1-+ B(>..)

ce(u).,) = L ae(x) exp( -1 - 'LJ=1 Ajaj(X)) dx - be for e = 1, ... , m.

A good exercise is to do the same calculations with the entropy (1.2.15) (without
bothering too much with Remark 1.2.1); observe in passing that, in both cases,
B is convex and -c = VB.
Here full equivalence holds in (2.2.1).
In all three examples including (c), the dual problem will be more advantageously
viewed as minimizing B than solving (2.1.2) or (2.1.3). See the considerations at the
end of §II.1: having a "potential" B to minimize makes it possible to stabilize the
resolution algorithm, whatever it is. Indeed, to some extent, the dual problem (2.1.6)
is well posed:

Proposition 2.2.2 If not identically +00, the dual function B is in Conv lRm. Fur-
thermore, for any u)., solution of the Lagrange problem (1.1.4».., the corresponding
-c)., = -c(u).,) is a subgradient ofB at A.
152 XII. Abstract Duality for Practitioners

PROOF. Direct proofs are straightforward; for example, write for each A and /L:

8(/L) ~ qJ(u~J - /LTC)., [by definition of e (/.L)]


= +
qJ(UA,) - AT C)., (A - /L) T C).,
= +
8(A) (/L - A) T (-C).,) , [by definition ofu)..]

so -C)., E a8(A). The other claims can be proved similarly. Proposition IY.2.1.2 and
Lemma VI.4.4.1 can also be invoked: the dual function

JRm 3 A ~ 8(A) = sup [qJ(u) - ATc(u)]


ueU

is the pointwise supremum of affine functions indexed by u. o


Note that the case 8 == +00 is quite possible: take with U = JR, qJ(u) = u 2 ,
c(u) = u. Then, no matter how Ais chosen, the Lagrange function L(u, A) = u2 - AU
goes to +00 with lui. Note also that 8 has no chance to be differentiable, except if
the Lagrange problem has a unique solution.

Thus the dual problem enjoys several important properties:


- It makes sense to minimize on lim a lower semi-continuous function such as e.
- The set of minimizers ofa convex function such as e is well delineated: there is no ambiguity
between local and global minima, and any suitably designed minimization algorithm will
converge to such a minimum (if any). Furthermore, the weak duality Theorem 2.1.5 tells
us that e is bounded from below, unless there is no feasible u (in which case the primal
problem does not make sense anyway).
- Another consequence, of practical value, is that our dual problem (2.1.6) fits within the
framework of Chap. IX, at least in the situation described by Fig. 1.1.1 (which in particular
implies Assumption 1.1.5): the only available information about our objective function e
is a black box which, for given >.., computes the value e(>..) and some subgradient which
we are entitled to call s(>..) := -c>.. = -c(u>..).
The property -c>.. E ae(>..) is fundamental for applications. Note that -c>.. is just the
coefficient of>.. in the Lagrange function, and remember Remark VIAA.6 explaining why
this partial derivative of L becomes a total derivative of e.

Remark 2.2.3 Once again, it is important to understand what is lost by the dual
problem. Altogether, a proper resolution of our Problem 1.1.4 entails two steps:
(i) First solve (2.1.6), i.e. use the one-way implications in (2.2.1); this must be done
anyway, whether (2.1.6) is equivalent to (2.1.2) or not.
(ii) Then take care of the lost converse implications in (2.2.1) to recover a primal
solution from a dual one.
In view of Proposition 2.2.2, (i) is an easy task; a first glance at how it might
be solved was given in Chap. IX, and we will see in §4 below some other traditional
algorithms. The status of (ii) is quite different and we will see in §2.3 that there are
three cases:
(iiI) (2.1.2) has no solution - and hence, (i) was actually of little use; example:
knapsack.
2 The Necessary Theory 153

(ih) A primal solution can be found after (i) is done, but this requires some extra
work because (2.1.3) {} (2.1.6) but (2.1.3) fo (2.1.2); example: economic
lotsize.
(ih) A primal solution is readily obtained from the black box after (i) is done,
(2.1.6) {} (2.1.2); example: entropy (barring difficulties from functional anal-
~~. 0

The dual approach amounts to exploring a certain primal set parameterized by


A E lRm , namely

U := {u E U : u solves the Lagrange problem (1.1.4h. for some A} . (2.2.3)

Usually, fJ is properly contained in U: for example, the functions (1.2.16) and (1.2.17)
certainly describe a very small part ofL] (ll, lR) when A describes lR m - and the interest
of duality precisely lies there. If (and only if) fJ contains a feasible point, the dual
approach will produce a primal solution; otherwise, it will break down. How much it
breaks down depends on the problem. Consider the knapsack example (2.2.2): it has
a unique dual solution j, = 1/2, at which the dual optimal value provides the upper
bound 8(1/2) = 1/2; but the primal optimal value is O. At j, = 1/2, the Lagrange
problem has two (slackened) solutions:

with constraint-values -1 and 1: none ofthem is feasible, ofcourse. On the other hand,
the non-slackened solution u = 0 is feasible with respect to the inequality constraint.
This last example illustrates the following important concept:

Definition 2.2.4 (Duality Gap) The difference between the optimal primal and dual
values
inf 8(A) - sup {cp(u) : c(u) = O} , (2.2.4)
AEIRm UEU

when it is not zero, is called the duality gap. o

By now, the following facts should be clear:


- The number (2.2.4) is always nonnegative (weak duality Theorem 2.1.5).
- The presence of a duality gap definitely implies that fJ contains no feasible point:
failure of the dual approach.
- The absence of a duality gap, i.e. having 0 in (2.2.4), is not quite sufficient; to be on
the safe side, each ofthe two extremization problems in (2.2.4) must have a solution.
- In this latter case, we are done: Corollary 2.1.8 applies.
To close this section, we emphasize once more the fact that all the results so far are
valid without any specific assumption on the primal data (U, cp, c). Such assumptions
become relevant only for questions like: does the primal problem have a solution? and
how can it be recovered from a dual solution? Does the dual problem have a solution?
These questions are going to be addressed now.
154 XII. Abstract Duality for Practitioners

2.3 Primal-Dual Optimality Characterizations

Suppose the dual problem is solved: some A E IRm has been found, minimizing the
dual function e on the whole space; a solution of the primal problem (1.1.1) is still
to be found. It is now that some assumptions must be made - if only to make sure that
(1.1.1) has a solution at all!
A dual solution A is characterized by 0 E ae(A); now the subdifferentials of e
lie in the dual of the A-space, i.e. in the space of constraint-values. Indeed, consider
the (possibly empty) optimal set in the Lagrange problem (1.1.4))..:
U(A) := {u E U : L(u, A) = e(A)}; (2.3.1)
remembering Proposition 2.2.2,
ae(A) :::> co {-c(u) : u E U(A)}.

The converse inclusion is therefore crucial, and motivates the following definition.
Definition 2.3.1 (Filling Property) With U(A) defined in (2.3.1), we say that the
filling property holds at A E IRm when
ae(A) = - co {c(u) : u E U(A)}. (2.3.2)
Observe that the right-hand side in (2.3.2) is then closed, since the left-hand side is.
o
Lemma 2.3.2 Suppose that U in (1.1.1) is a compact set, on which rp is upper semi-
continuous, and each Cj is continuous. Then thefillingproperty (2.3.2) holds at each
A E IRm.
PROOF. Under the stated assumptions, the Lagrange function L(·, A) is upper semi-
continuous and has a maximum for each J..: dom e = IRm. For the supremum of affine
functions A t-+ rp(u) - AT c(u), indexed by u E U, the calculus rule VI.4.4.4 applies.
o
We have therefore a sufficient condition for an easy description of ae in terms of primal
points. It is rather "normal", in that it goes in harmony with two other important properties:
existence of an optimal solution in the primal problem (1.1.1), and in the Lagrange problems
(1.1.4).
Remark 2.3.3 In case of inequality constraints, nonnegative slack variables can be appended
to the primal problem, as in §1.2(a). Then, the presence ofthe (unbounded) nonnegative orthant
in the control space kills the compactness property required by the above result. However,
we will see in §3.2 that this is a "mild" unboundedness. Just remember here that the calculus
rule VIAAA is still valid when Ahas all its coordinates positive. Lemma 2.3.2 applies in this
case, to the extent that).. e int dom e. The slacks play no role: for 11- e (JR+)m (close enough
to )..), maximizing the slackened Lagrangian
q>(u) - 11- T c(u) + 11-TV
with respect to (u, v) e U X (JR+)m just amounts to maximizing the ordinary Lagrangian
L(·, 11-) over U. 0
2 The Necessary Theory 155

The essential result concerning primal-dual relationships can now be stated.

Theorem 2.3.4 Let thejillingproperty (2.3.2) hold (for example, make the assump-
tions ofLemma 2.3.2) and denote by

C()") := {c(u) E IRm : U E U()")} (2.3.3)

the image by c of the set (2.3.1). A dual optimum Xis characterized by the existence
ofk ~ m + 1 points ut. ... , uk in U(X) and convex multipliers at. ... , ak such that

k
Laic(Ui) = O.
i=l

In particular, if C (X) is convex for some optimal Xthen, for any optimal ).. *, the
feasible points in U ().. *) make up all the solutions of the primal problem (1.1.1).

PROOF. When the filling property holds, the minimality condition 0 E ae(X) is
exactly the existence of {Ui, ail as stated, i.e. 0 E coC(X). If C(X) is convex, the
"co"-operation is useless, 0 is already the constraint-value of some U maximizing
L(·, X). The rest follows from Corollary 2.1.8. 0

This is the first instance where convexity comes into play, to guarantee that the
set fJ of (2.2.3) contains a feasible point. Convexity is therefore crucial to rule out a
duality gap.

Remark 2.3.5 Once again, a good illustration is obtained from the simple knapsack problem
(2.2.2). At the unique dual optimum X= 1/2, C(X) = {-I, +1} is not convex and there is a
duality gap equal to 1/2. Nevertheless, the corresponding (slackened) set of solutions to the
Lagrange problem is U(1/2) = {(O, 0), (1, O)} and the convex multipliers al = a2 = 1/2
do make up the point (1/2,0), which satisfies the constraint. Unfortunately, this convex
combination is not in the (slackened) control space U = {O, I} X JR+.
Remember the observation made in Remark VIII.2.1.3: even though the kinks of a convex
function (such as e) form a set of measure zero, a minimum point is usually in this set. Here
ae(I/2) = [-1, 1] and the observation is confirmed: the minimum X= 1/2 is the only kink
~e. 0

In practice, the convexity property needed for Theorem 2.3.4 implies that the original
problem itself has the required convex structure: roughly speaking, the only cases in which
there is no duality gap are those described by the following result.

Corollary 2.3.6 Suppose that thejilling property (2.3.2) holds. In either ofthefol-
lowing situations (i), (ii) below, there is no duality gap; for every dual solution)..*
(if any), the feasible points in U ().. *) make up all the solutions of the primal problem
(1.1.1 ).
(i) For some dual solution X, the associated Lagrangefunction L(·, X) is maximized
u; u
at a unique then, is the unique solution of (1.1.1).
(ii) In (1.1.1), U is convex, rp is concave and c : U ~ IR m is affine.
156 XII. Abstract Duality fQr PractitiQners

PROOF. Straightforward. o
Case (i) means that e is actually differentiable at i; we are in the situation (ih)
of Remark 2.2.3. Differentiability of the dual function thus appears as an important
property, not only for convenience when minimizing it, but also for a harmonious
primal-dual relationship. The entropy problem of § 1.2(c) fully enters this framework,
at least with the entropy (1.2.14). The calculations in §2.2(c) confirm that the dual
function is then differentiable and its minimization amounts to solving Ve().) =
-c(U},) = O. For this example, if Lemma 2.3.2 applies (cf. difficulties from functional
analysis), the black box automatically produces the unique primal solution at any
optimal), (if any).
The knapsack problem gives an illustration a contrario: U is not convex and e
is, as a rule, not differentiable at a minimum (see §2.2(a) for example).
These two situations are rather clear, the intermediate case 2.2.3(ih) is more
delicate. At a dual solution i, the Lagrange problem has several solutions; some of
them solve (1.1.1), but they may not be produced by the black box. A constructive
character must still be given to Corollary 2.3.6(ii).

Example 2.3.7 Take the IQtsizing prQblem Qf § 1.2(b), and remember that we have Qnly
cQnsidered a simple black bQx: fQr any>.., each product i is assigned entirely to. Qne single
machine j (i), even in the ambiguQus cases where SQme prQduct has the same CQst fQr two.
different machines. FQr example, the black bQX is tQtally unable to. yield an allQcatiQn in
which SQme prQduct is split between several machines; and it may well be that Qnly such an
allQcatiQn is Qptimal in (1.2.7); actually, since e is piecewise affine, there is almQst certainly
such an ambiguity at any Qptimal i; remember Remark 2.3.5.
On the Qther hand, imagine a sQphisticated black bQx, cQmputing the full Qptimal set
in the Lagrange prQblem (1.2.1Oh. This amQunts to. determining, fQr each i, all the Qptimal
indices j in (1.2.11); let there be mj such indices. Each Qfthem defines an extreme PQint in
the simplex (1.2.9); piecing them tQgether, they make up the full sQlutiQn-set Qf (1.2.1Oh,
which is the CQnvex hull Qf p = ml x ... x mn PQints in U l x ... X Un; let us call them
u(l), .•. , u(p). NQW CQmes the fundamental property: to. each u(k) cQrresPQnds via (1.2.8) a
PQint v(k) := v (u(k) ,>..) E ]Rnxm; and because u t-+ v(u, >..) in (1.2.8) is linear, the sQlutiQn-
set Qf(1.2.10h is the CQnvex hull Qfthe PQints (u(l), v(I), •.. , (u(p) , v(p) thus Qbtained.
Assume fQr simplicity that>.. is Qptimal and has all its cQQrdinates PQsitive. From TheQ-
rem 2.3.4 Qr CQrQllary 2.3.6, there is a CQnvex cQmbinatiQn Qfthe (u(k), v(k), k = 1, ... , p
(hence a sQlutiQn Qfthe Lagrange problem) which satisfies the cQnstraints (1.2. 7)(iii) as equal-
ities: this is a primal QPtimum. 0

Remark 2.3.8 It is interesting to interpret the null-step mechanism of Chap. IX, in our
present duality framework. Use this mechanism to find a descent direction of the present
function e, but start it from a dual optimum, say i. To. make things simple, just cQnsider the
knapsack problem (2.2.2).
- At the optimal i = 1/2, the index-set (1.2.4) is 10/2) = {1}, thus 81 = -1; the black box
suggests that the Qbject is worth taking and that, to decrease e, >.. must be increased: 1/2 is
an insufficient "price Qf space".
- For all >.. > 1/2, 1(>") = 0; the black box now suggests to. leave the Qbject and to. decrease
>... During the first line-search in Algorithm IX.1.6, no improvement is obtained from and e
82 = 1 E aeO/2) is eventually produced.
2 The Necessary Theory 157

- At this stage, the direction-finding problem (IX.1.8) detects optimality of i; but more
importantly, it produces a = 1/2 for the convex combination

aSI + (l - a)s2 = O.

Calling u(1) (= 1) and u(2) (= 0) the decisions corresponding to SI and S2, the convex
combination 1/2 [u(1) + u(2)] is the one revealed by Theorem 2.3.4.
Some important points are raised by this demonstration.
(i) In the favourable cases (no duality gap), constructing the zero vector in 1Ie(>..) corre-
sponds to constructing an optimal solution in the primal problem.
(ii) Seen with "dual glasses", a problem with a duality gap behaves just as if Corol-
lary 2.3.6(ii) applied. In the above knapsack problem, u = 1/2 is computed by the
dual algorithm, even though a half of a TV set is useless. This u = 1/2 would become
a primal optimum if U = to, I} in (2.2.2) were replaced by its convex relaxation [0,1];
but the dual algorithm would see no difference.
(iii) The bundling mechanism is able to play the role of the sophisticated black box wanted in
Example 2.3.7. For this, an access is needed to the primal points computed by a simple
black box, in addition to their constraint-values.
(iv) Remember Example IX.3.1.2: it can be hoped that the bundling mechanism generates a
primal optimum with only a few calls to the simple black box, when it is started on an
optimal i; and it should do this independently ofthe complexity of the set U (i). Suppose
we had illustrated (iii) with the lotsizing problem of Example 2.3.7, rather than a trivial
knapsack problem. We would have obtained 0 E 1Ie(>..) with probably an order of m
iterations; by contrast, the m I x ... x mn primal candidates furnished by a sophisticated
black box should all be collected for a direct primal construction. 0

We conclude with a comment concerning the two assumptions needed for Theo-
rem 2.3.4:
- The behaviour of the dual function must be fully described by the optimal solutions
of the Lagrange problem. This is the filling property (2.3.2), of topological nature,
and fairly hard to establish (see §VI.4.4). Let us add that situations in which it does
not hold can be considered as pathological.
- Some algebraic structure must exist in the solution-set of the Lagrange problem, at
least at an optimal A (Corollary 2.3.6). It can be considered as a "gift of God": it has
no reason to exist a priori but when it does, everythirtg becomes straightforward.

2.4 Existence of Dual Solutions

The question is now whether the dual problem (2.1.6) has a solution X E IRm. This
implies first that the dual function eis bounded from below; furthermore, it must
attain its infimum. For our study, the relevant object is the image

C(U) := {y E IRm : y = c(u) for some u E U} (2.4.1)

of the control set U under the constraint mapping c : U --+- IRm. Needless to say, an
essential prerequisite for the primal problem (1.1.1) to make sense is 0 E C (U); as
158 XII. Abstract Duality for Practitioners

far as the dual problem is concerned, however, cases in which 0 ¢ C(U) are also of
interest.
r
First of all, denote by the affine hull of C (U):

r:= {y e lRm : y= Ef=1 Ctjc(Uj), Uj e U, Ef=1 Ctj = I for k = 1,2, ... },

and by To the subspace parallel to r. Fixing Uo e U, we have c(u) - c(uo) e To for


all U e U, hence

L(u, J.. + IL) = L(u, J..) -IL T c(u) = L(u, J..) -IL T c(uo) for all (J.., IL) e lRm x ref

and this equality is inherited by the supremal values:

e(J.. + IL) = e(J..) -IL T c(uo) for all J.. e lRm and IL e ro.l. . (2.4.2)

In other words, e is affine in the subspace ro.l.. This observation clarifies two cases:
- If 0 ¢ r,
let 1L0 =f:. 0 be the projection of the origin onto r;
in (2.4.2), fix J.. and take
IL = t 1L0 with t ~ +00. Because c(uo) e r, we have with (2.4.2)

e(J.. + tlLo) = e(J..) - tlLci C(Uo) = e(J..) - t 1I1L0 112 ~ -00.

In this case, the primal problem cannot have a feasible point (weak duality The-
orem 2.1.5); to become meaningful, (1.1.1) should have its constraints perturbed,
say to c(u) = 1L0.
- The only interesting case is therefore 0 e r. The dual optimal set is then A + rrf ,
where Aero is the (possibly empty) optimal set of

inf{e(J..): J..ero=affC(U)=linC(U)}. (2.4.3)

In a way, (2.4.3) is the relevant dual problem to consider - admitting that ro is


known. Alternatively, the essence of the dual problem would remain unchanged if
we assumed r = ro = lRm.
In §2.3, some convexity structure came into play for solving the primal problem
by duality; the same phenomenon occurs here. The next result contains the essential
conditions for existence of a dual solution.

Proposition 2.4.1 Assume e ¥= +00. With the definition (2.4.1) ojC(U),


(i) if 0 ¢ co C(U), then inf).. e(J..) = -00;
(ii) if 0 e co C(U), then inf).. e(J..) > -00;
(iii) if 0 e ri co C (U), then the dual problem has a solution.
PROOF. [(0] The closed convex set co C (U) is separated from {O} (Theorem III.4.1.3):
for some 1L0 =f:. 0 and 8 > 0,

-lLci c(u) ~ - 8 < 0 for all u e U.

Now let J..o be such that e (J..o) < +00, and write for all u e U:
2 The Necessary Theory 159

L(u, >"0 + tILo) = rp(u) - >..l c(u) - tILl c(u) [definition of L]


~ 8(>"0) - tILl c(u) [definition of e]
~ 8(>"0) - t8. [definition of Ito]

Since U was arbitrary, this implies 8(>"0 + tILo) ~ 8(>"0) - t8, and the latter term
tends to -00 when t ~ +00.
[(ii)] There are finitely many points in U, say Ul, ..• , up, and some cx in the unit
simplex oflR P , such that 0 = Lf=l CXjc(Uj). Then write for all >.. E IRm :

8(>..)~rp(Uj)_>..Tc(Uj) fori=I, ... ,p,

and obtain by convex combination

P P
8(>") ~ LCXjrp(Uj) - >.. T LCXjc(Uj) ~ min rp(Uj) ,
j=l j=l 1~ j ~P

where the second inequality holds by construction of {CX j , U j }.


[(iiij] We know from Theorem II1.2.l.3 (see Fig.II1.2.U) that, if 0 E ricoC(U),
then 0 is also in the relative interior of some simplex contained in co C(U). Using the
notation of (2.4.3), this means that there are finitely many points in U, say Ul, ... , up,
and 8 > 0 such that

B(O, 8) n ro C co (C(Ul), ... , c(up)} .

For all >.. E IRm , still by definition of 8 (>..),

8(>") ~ rp(Uj) - >.. T c(Uj) for i = 1, ... , p. (2.4.4)

If ro = (OJ, i.e. C(U) = (OJ, L(u, >..) = rp(u) for all U E U, so 8 is a constant
function; and this constant is not +00 by assumption. If ro =f:. (OJ, take 0 =f:. >.. E ro
in (2.4.4) and
>..
y := -8 11 >"11 E B(O, 8) n To;

this y is therefore a convex combination of the c(Uj)'s. The same convex combination
in (2.4.4) gives
p
8(>") ~ LCXjrp(Uj) + 811>"11 ~ min rp(Uj) + 811>"11.
j=l l~j~p

We conclude that 8 (>..) ~ +00 if>.. grows unboundedly in ro: the closed function
8 (Proposition 2.2.2) does have a minimum on ro, and on IRm as well, in view of
(2.4.3). 0

Figure 2.4.1 summarizes the different situations revealed by our above analysis.
Thus, it is the convex hull of the image-set C (U) which is relevant; and, just as in
§2.3, this convexification is crucial:
160 XII. Abstract Duality for Practitioners

no finite lower bound ambiguous dual existence


Fig. 2.4.1. Various situations for dual existence

Example 2.4.2 In view of the weak duality Theorem 2.1.5, existence of a feasible
U in (1.1.1) implies boundedness of 8 from below. To confirm the prediction of (ii),
that this is not necessary, take a variant of the knapsack problem (2.2.2), in which the
knapsack should be completely filled: we want to solve

supu, 2u=1, uE{O,l}.

This problem clearly has no feasible point: its supremal value is -00. Nevertheless, the
dual problem is essentially the same, and it still has the optimal value 8 (1/2) = 1/2.
Here, the duality gap is infinite. 0

The entropy problem of §1.2(c) provides a few examples to show that the topo-
logical operations also play their role in Theorem 2.4.1.

Example 2.4.3 Take Q = {OJ (so that LI (Q, JR) = JR) and one equality constraint:
our entropy problem is

sup ({I(U) , UEV, u-uo=O.

where Uo is fixed. We consider two particular cases:

VI = [0, +oo[ ({II (u) = -u logu, V 2 =]0, +oo[ ((I2(U) = logu.

The corresponding sets C(V) are both convex: in fact C(UI) = [-uo, +oo[ and
C (V2 ) =] - uo, +00[. The dual functions are easy to compute:

JR 3 A f-+ { 8 1(A) = e- I - A + AUo


8 2 (A) = -logA + AUo (for A> 0).
As predicted by Theorem 2.4.1, the cases Uo > 0 and Uo < 0 are clear; but for
Uo = 0, we have 0 E bd C (Vi) for i = 1, 2. Two situations are then illustrated:
inf 8 1 = 0 and inf 8 2 = -00; there is no dual solution. If we took

V3 = [0, +oo[ ((I3(U) = _!u 2 ,

we would have
8 3 (A) = { !A2 + UOA for A ~ 0,
UOA for A ~ 0
and a dual solution would exist for all Uo ~ O. o
3 Illustrations 161

3 Illustrations

3.1 The Minimax Point of View

The Lagrange function (1.1.3) is not the only possibility to replace the primal problem
(1.1.1) by something supposedly simpler. In the previous sections, we have followed
a path
primal problem (1.1.1) ~ Lagrange problem (1.1.4h, ~
feasibility problem (2.1.2) or (2.1.3) ~ dual problem (2.1.6) .
This was just one instance of the following formal construction:
- define some set V (playing the role oflRm, the dual of the space of constraint-values),
whose elements v E V play the role of A E IRm;
- define some bivariate function l : U x V --+ IR U {+oo}, playing the role of the
Lagrangian L;
- V and l must satisfY the following property, for all u E U,

if c(u) = 0,
. f l(
m
VEV
u,v
)
= {~(U)
-00 otherwise;
(3.1.1)

then it is clear that (1.1.1) is equivalent to

sup [infvEV l(u, v)]. (3.1.2)


UEU

- In addition, the function


O(v) := sup l(u, v)
UEU
must be easy to compute, as in the practical Assumption 1.1.2;
- the whole business in this general pattern is therefore to invert the inf- and sup-
operations, replacing (1.1.1) = (3.1.2) by (2.1.6), which reads here

inf [suPuEul(u,v)], i.e. infO(v). (3.1.3)


VEV VEV
- In order to make this last problem really easy, an extra requirement is therefore:

the function 0 and the set V are closed convex. (3.1.4)

The above construction may sound artificial but the ordinary Lagrange technique,
precisely, discloses an instance where it is not. Its starting ingredient is the pairing
(1.1.2) (a natural "pricing" in the space of constraint-values); then (3.1.1) is obviously
satisfied by the Lagrange function l = L of (1.1.3); furthermore, the latter is affine
with respect to the dual variable v = A, and this takes care of the requirement (3.1.4)
for the dual function 0 = e, provided that 0 ¢. +00 (Proposition IY.2.1.2).
Remark 3.1.1 In the dual approach, the basic idea is thus to replace the sup-inf problem
(1.1.1) = (3.1.2) by its inf-sup form (2.1.6) = (3.1.3) (after having defined an appropriate
scalar product in the space of constraint-values). The theory in §VII.4 tells us when this is
going to work, and explains the role of the assumptions appearing in Sections 2.3 and 2.4:
162 XII. Abstract Duality for Practitioners

- The function l (i.e. L) is already closed and convex in ).,; what is missing is its concavity
with respect to u.
- Some topological properties (of L, U and/or V) are also necessary, so that the relevant
extrema are attained.
Then we can apply Theorem VIIA.3.1: the Lagrange function has a saddle-point, the
values in (3.1.2) and (3.1.3) are equal, there is no duality gap; the feasibility problem (2.1.3)
is equivalent to the dual problem (2.1.6). This explains also the "invariance" property resulting
from Corollary 2.1.8: the set of saddle-points was seen to be a Cartesian product. 0

This point of view is very useful to "dualize" appropriately an optimization prob-


lem. The next subsections demonstrate it with some specific examples.

3.2 Inequality Constraints

Suppose that (1.1.1) has rather the form

sup{cp(u): Cj(u)~Oforj=I, ... ,p}, (3.2.1)


UEU

in which the constraint can be condensely written c(u) E -(lR+)P. Two examples
were seen in § 1.2; in both of them, nonnegative slack variables were included, so as to
recover the equality-constrained framework; then the dual function was +00 outside
the nonnegative orthant.
The ideas of §3.1 can also be used: form the same Lagrange function as before,

uxV :3 (u, J,L) ~ L(u, J,L) := cp(u) - J,L T c(u) ,

but take
V := (JR+)P = {J,L E JRP : J,Lj ~ 0 for j = 1, ... , p}.
This construction satisfies (3.1.1): if u violates some constraint, say Cj (u) > 0, fix all
J,Lj at 0 except J,Lj, which is sent to +00; then the Lagrange function tends to -00.
The dual problem associated with (3.2.1) is therefore as before:

(note that the nonnegative orthant (JR+)P is closed convex: (3.1.4) is preserved if
e ¢. +00).
Instead of introducing slack variables, we thus have a shortcut allowing the primal
set U to reJ1lain unchanged; the price is modest: just add simple constraints in the dual
problem.

Remark 3.2.1 More generally, a problem such as

sup{q1(U) : aj(u) = 0 for i = 1, ... , m, Cj(u) ~ 0 for j = 1, ... , p}


UEU

can be dualized as follows: supremize over U E U the Lagrange function


3 Illustrations 163

m P
L(U,A,IL) :=q1(u)- LAiai(U)- LlLjCj(u)
i=1 j=1

to obtain the dual function 6) : JRm x JRP -+ JR U Hoc} as before; then solve the dual
problem
(3.2.2)
Naturally, other combinations are possible: the primal problem may be a mini-
mization one, the linking constraints may be introduced with a "+" sign in the Lagrange
function (1.1.3), and so on. Some mental exercise is required to make the link with
§VII.4.5.
It is wise to adopt a definite strategy, so as to develop unconscious automatisms.
(i) We advise proceeding as follows: formulate the primal problem as

inf{q1(u): UEU, Cj(u)~Oforj=I, ... ,p}

(and also with some equality constraints if applicable). Then take the dual func-
tion
IL 1-+ 6)(IL) = J~~ [q1(u) + LJ'=llLjCj(U)] ,

which must be maximized over (JR+)P. It is concave and -c(up.,) E a( -6)(IL).


This was the strategy adopted in Chap. VII.
(ii) It is only for local needs that we adopt a different strategy in the present chapter:
we prefer to minimize a convex dual function. The reason for the minus-sign in
the Lagrange function (1.1.3) is to keep nonnegative dual variables.
In either technique (i), (ii), equality [resp. inequality] constraints give birth to
unconstrained [resp. nonnegative] dual variables. 0

Depending on the context, it may be simpler to impose directly dual nonnegativity


constraints, or to use slack variables. Let us review the main results of §2 in an
inequality-constrained context.

(a) Modified Everett's Theorem. For IL E (JR+)P, let up., maximize the Lagrangian
q1(u) - IL T c(u) associated with (3.2.1). Then up., solves

maxql(u) U E U,
Cj(u)~Cj(up.,) iflLj >0, (3.2.3)
Cj (u) unconstrained otherwise.

A way to prove this is to use a slackened formulation of (3.2.1):

sup{q1(U) : U E U, v E (JR+)P, c(u) +v = O}. (3.2.4)


u,v

Then the original Everett theorem directly applies; observe that any Vj maximizes the
associated Lagrangian if ILj = O.
164 XII. Abstract Duality for Practitioners

Naturally, uJ.L is a primal point all the more interesting when (3.2.3) is less con-
strained; but constraints may be appended ad libitum, under the sole condition that
they do not eliminate u J.L' For example, if u J.L is feasible, the last line of (3 .2.3) can be
replaced by
Cj{u) ~ 0 [~Cj{uJ.L)]'

We see that, ifuJ.L is feasible in (3.2.1) and satisfies

Cj{uJ.L) = 0 if JLj > 0,

then uJ.L solves (3.2.1). The wording "and satisfies the complementarity slackness"
must be added to "feasible" wherever applicable in Sections 2.1 and 2.3 (for example
in Corollary 2.1.8).

(b) Modified Filling Property. For convenient primal-dual relationship in problems


with inequality constraints, the filling assumption (2.3.2) is not sufficient. Let again
8{JL) = sUPu L{u, JL) be the dual function associated with (3.2.1); the basic result
reproducing §2.3 is then as follows:

Proposition 3.2.2 With the above notation, assume that

a8{JL) = - co {c(u) : L(u, JL) = 8(JL)} for all JL E (JR.+)P

and
8(JL) < +00 for some JL with JLj > Ofor j = 1, ... , p. (3.2.5)

Then il solves the dual problem if and only if: there are k ~ p + 1 primal points
u l , ••• , uk maximizing L(·, il), and a = (aJ, ... , ak) E .11k such that

k k
LaiCj(Ui) ~ 0 and ilj LaiCj(Ui) = Ofor j = 1, ... , k.
i=1 i=1

PROOF. The dual optimality condition is 0 E a(8 +I(R+)P )(il), so we need to compute
the subdifferential of this sum of functions. With JL as stated in (3.2.5), there is a ball
B(JL, 8) contained in (JR.t)P; and any such ball intersects ri dom 8 (JL is adherent to
ri dom 8). Then Corollary XI.3 .1.2 gives

By assumption, a subgradient of 8 is opposite to a convex combination L~=1 aic(u i )


of constraint-values at points u i maximizing the Lagrangian. The rest follows without
difficulty; see in particular Example VIl.l.l.6 to compute the normal cone N (R+)P'
o
3 Illustrations 165

(c) Existence of Dual Solutions. Finally, let us examine how Proposition 2.4.1 can
handle inequality constraints. Using slack variables, the image-set associated with
(3.2.4) is
C'(U) = C(U) + (JR+)P ,
where C (U) is defined in (2.4.1). Clearly, aff C' (U) = JRP and we obtain:
e
(i) If 0 !if co[C(U) + (JR+)P], then inf = -00; unfortunately this closed convex
hull is not easy to describe in simpler terms.
e
(ii) If 0 E co C(U) + (JR+)P, then inf > -00. This comes from the easy relation
coC'(U) [= co[C(U) + (JR+)P]] = coC(U) + (JR+)P.
(iii) If 0 E ri[ co C (U) + (JR+)P], then a dual solution exists. Interestingly enough,
this is a Slater-type assumption:
Proposition 3.2.3 If there are k (; p + 1 points u I, ... , uk in U and convex multi-
pliers ell, .•. , elk such that
k
LeliCj(Ui) <0 jorj=I, ... ,p,
i=1
then the dualjUnction associated with the optimization problem (3.2.1) has a minimum
over the nonnegative orthant (JR+)P.

PROOF. Our assumption means that 0 lies in co C (U) + (JRt)P, an open convex set:
o E coC(U) + (lRt)P = ri[coC(U) + (JRt)P].

The result follows from Proposition 2.4.1 (iii). o

3.3 Dualization of Linear Programs

As a particular case of §3.2, suppose that U is JR n , with some scalar product (., .),
and that (3.2.1) is
sup(q,u}
I (aj' u) (; bj for j = 1, ... , p,
(3.3.1)

where q and each aj are in JR n , b = (b l , .•• , bp ) E JRP. The Lagrange function

JRn x JRP 3 (u, J.L) f-+ L(u, J.L) := (q - L)=I J.Ljaj, u) + L)=I J.Ljbj
is "easy to maximize": e(J.L) = +00 if [J.L !if (JR+)P or] q - Lj J.Ljaj =f:. O. The dual
problem is therefore

inf {L)=I J.Ljbj : J.Lj ~ 0 for j = 1, ... , p, L)=I J.Ljaj = q} , (3.3.2)

another linear program, posed in JRP.


Here, neither Assumption 1.1.5, nor the concept of a black box giving u JL has much
relevance: if J.L satisfies the dual constraints (i.e. if J.L E dom e, the only interesting
case), any u E JRn is a U w
166 XII. Abstract Duality for Practitioners

Remark 3.3.1 The dual constraints that appear in (3.2.2) and (3.3.2) simply express
that the study can be restricted to dom e, since e must be minimized. Instead of
(3.3.2), for example, the dual of (3.3.1) can obviously be formulated as the uncon-
strained minimization of the function

o
Continuing with this example, consider the standard form of a linear program:
assume that (., .) is the usual dot-product and that our primal problem is now

sup {qTu : Au=b, ui~Ofori=I, ... ,n}. (3.3.3)

Here the matrix A has m rows, b E ]Rm. It is natural to take as primal set U := (lR+)n,
so that the dual space is a priori ]Rm, m being the number of equalities. Then the
Lagrange function

is again "easy to maximize" over U: the maximal value is finite if and only if the
vector q - A T A has all its coordinates nonpositive. In summary, the dual of (3.3.3) is

inf {b T A : A E ]Rm, AT A - q E (]R+)n} . (3.3.4)

The nonnegativity constraints in (3.3.3) can be dualized as well, even though the example
in (1.2.1) does not suggest doing so because they have already a decomposed form. In fact,
nothing much interesting comes out of this extra dualization: the Lagrange function becomes

and we obtain the new dual problem

inf {b T).. : qT + JL T -)..T A = 0, JL E (l~+)n} ,

in which JL is redundant and can be eliminated: the result is exactly (3.3.4).

3.4 Dualization of Quadratic Programs

Let again U be ]Rn, equipped with the dot-product for simplicity, and consider

(3.4.1)

Here q E ]Rn, b E ]RP and the n x n matrix Q is symmetric positive definite. This
problem has a unique solution ifthe feasible domain is nonempty. Choose the Lagrange
function

]Rn x ]RP 3 (u, JL) ..... L(u, JL) := (q - AT JL) T u - ! uT Qu + bT JL.


3 Illustrations 167

As a consequence of our assumptions, its maximum is attained at the unique

(3.4.2)

for which the corresponding constraint-value is

Plugging the value (3.4.2) into L, we obtain the dual function

to be minimized over (lR+)P. Note that it is differentiable and its gradient illustrates
Proposition 2.2.2:

On the other hand, suppose that the constraints in (3.4.1) are equalities: Au = b;
then the dual minimization is performed on the whole of lR P • If they exist, the dual
=
solutions are the solutions of V8(JL) 0, which is the linear system

(3.4.3)

Existence of such a solution implies first that b E 1m A, i.e. that there exists a primal
feasible u. In this case we claim that the dual problem has one solution at least, and
that all such solutions make up via (3.4.2) a unique point (the unique primal solution).
This will illustrate Corollary 2.1.8.
The key lies in the following facts:
(i) The two subspaces 1m AT and Ker A are two orthogonal generators oflRn. This
implies:
(ii) when applying the positive definite operator Q or Q-I to one of them, we never
obtain a vector in the other:

(3.4.4)

Hence:
(iii) we have further decompositions of lRn into two subspaces:

(3.4.5)

Then the proof of our claim goes as follows:


[Existence] If bE ImA, take Uo E lRn such that b =
-AQ-I uo; then use (3.4.5) to
write q + Uo = AT JLo + Qvo, with Vo E Ker A. In summary (3.4.3) becomes

0= AQ-I(AT JL - q - uo) = AQ-1AT (JL - JLo) - Avo.

which has the obvious solution JL = JLo.


168 XII. Abstract Duality for Practitioners

[Uniqueness] If ILl and IL2 solve (3.4.3), with corresponding UI and U2 maximizing
Lover ]R.n,

and we deduce

in view of (3.4.4), this implies UI - U2 = o.


The essence of the above "proof" is the fact that {IL, v} := IL T A Q -I A T V is a
scalar product in Ker AT = 1m A. When (3.4.3) has no solution, which means that the
e
primal feasible set is empty, is not bounded from below: apply Proposition 2.4.1 (i),
knowing that C(U) is here a (closed convex) affine manifold.

3.5 Steepest-Descent Directions

Consider the direction-finding problem of Chapters VIII and IX, assuming the Euclidean
norming on JRn: I I . I I = II . II. We had a number of (sub)gradients SI, ... , Sk and we wanted
to solve, for a given normalization parameter K > 0:

min r r E JR, d E JRn ,


r ~ (Sj, d) for i = 1, ... , k, (3.5.1)
IIdll =K.

Adapting the notations (u is the couple (d, r), the dual variable is a E JR k , we have now
a minimization problem), define the Lagrange function

JRn x JR x JRk :3 (d, r, a) 1-+ L(d, r, a) := (1 - E~=I aj) r + (E~=I ajSj, d), (3.5.2)

to be minimized, with fixed a, over

(3.5.3)

The dual function (to be maximized) is "easy to compute": the dual constraint 1- E~= I aj = 0
takes care of the unconstrained linear variable r; altogether, the dual variable has to vary over
the unit simplex L1k ofJRk.
Setting s(a) := E~=I ajSj, the Lagrange problem associated with a E L1k then has
solutions (dex , rex) defined as follows:

d = { -lIita)1I s(a) ifS(a) # 0,


(3.5.4)
ex arbitrary of norm K otherwise.

and rex is arbitrary, its coefficient in (3.5.2) being o. Finally the dual problem is

max{-KlIs(a)1I : a E L1kJ.

Now, once this dual problem is solved, say at some a,


there are two cases. If the cor-
responding optimal s(a) is nonzero, the primal solution d is recovered from (3.5.4) (the
corresponding r is easy to recover). If the optimal s(a) is zero, nothing useful is obtained
3 Illustrations 169

from the dual approach, except the information that 0 is in the convex hull of the Si'S - and
this was good enough in the case of interest.
In this example, Corollary 2.3.6(i) applies if s(li) =1= 0, which means that s(a) =1= 0
for all a E .1k: even though the original problem (3.5.1) is not convex, the dual function is
differentiable everywhere (at least relative to its domain).
By contrast, suppose s(li) = O. Then the dual function is differentiable at no dual
optimum, simply because II . II is not differentiable at 0; its subdifferential can be computed
with the help of the calculus developed in §VI.4: to within the normalization coefficient K,
we obtain the ellipsoid
{2:7=1 aisi : 2:7=1 al ~ I} .
Having a dual optimal solution Ii is absolutely ofno help to find a primal solution: the Lagrange
function L(·. '. Ii) of (3.5.2) is optimal at all feasible points of (3.5.1), duality has killed the
role of the objective function.

Remark 3.5.1 Formulating (3.5.1) with slack variables, say Pi, we obtain the new Lagrange
function (with obvious notation)

L(d. r. P. a) := (1 - 2:7=1 ai) r + 2:7=1 aiPi + (2:7=1 aisi. d) •


to be minimized over the new primal set

U:={(d.r.p)EJRn+k+ I : IIdll=K. Pi)Ofori=l •...• k}.


Now, the minimizers of L(-, '. '. Ii) must have Pi = 0 if Iii > 0 and the normal situation is
that none ofthese minimizers satisfies the (slackened) constraints

r = Pi + (Si. d) for i = 1•...• k .


Take for example, with n = 1, k = 2 and K = 1: SI = -1 and S2 = P E [0, 1[, so that
the steepest-descent direction is d = 1, the unique dual solution (al. a2) is al = p/O + p),
a2 = I/O + p), and the difference between the primal and dual optimal values is p ) O.
- If p > 0 (there is a duality gap), the slacks must be PI = P2 = 0; no matter how we choose
d in {-I. +1}, r cannot be adjusted to satisfy the two dualized equalities.
- If p = 0 (no duality gap), PI is arbitrary nonnegative and we can for example choose
d = 1, r = 0 = pd and PI 1; this minimizes the Lagrange function and solves the
primal problem as well. 0

Needless to say, all gap-problems would disappear if the normalization constraint in


(3.5.1) were IIdll ~ K; it is this idea that was exploited in §VIIL1.2.
Finally, there is an interesting variant: dualize in (3.5.1) the normalization constraint only,
to obtain the Lagrange function

which has the single dual variable ao. Put

B(ao) := - inf L(d. r. ao) • (3.5.5)


(d,r)eU

where U is the primal domain


170 XII. Abstract Duality for Practitioners

U:= {(d,r) EJRn +1 : r~ (sj,d) fori = 1, ... ,k}. (3.5.6)


In contrast to (3.5.4), the Lagrange minimization problem of (3.5.5) cannot be solved
explicitly, so this approach does not do much good for solving (3.5.1) numerically. It pro-
vides some interesting illustrations, though: as denoting the support function of S :=
CO{SI, ••. ,Sk}, the following properties hold for the Lagrange problem

e(ao) = - inf [ao(lIdf -


delRn
K2) + as(d)] . (3.5.7)

-dome c JR+;
- e(O) = 0 if 0 E S, +00 ifnot;
-e is finite on JRt;
- since e E ConvJR, liIIlao.j.o e(ao) = e(O);
- at any ao > 0, e has the derivative e'(ao) = K2 - IId(ao)ll2, where d(ao) is the unique
optimal solution in (3.5.7);
- if e is minimal at some ao > 0, then d(ao) has norm K and solves the original problem
(Corollary 2.3.6);
e
- if is minimal at 0, there are two cases, illustrated by Example 3.5.1: either IId(ao) II t K
when ao .J,. 0; there is no duality gap, this was the case p > O. If IId(ao)II has a limit strictly
smaller than K, the original problem (3.5.1) is not solvable via duality.

4 Classical Dual Algorithms

In this section, we review the most classical algorithms aimed at solving the dual
problem (2.1.6). We neglect for the moment the primal aspect of the question, namely
the task (ii) mentioned in Remark 2.2.3. According to the various results of Sections
2.1, 2.2, our situation is as follows:
- we must minimize a convex function (the dual function 8);
- the only information at hand is the black box that solves the Lagrange problem
(1.1.4h, for each given A E ]Rm;
- Assumption 1.1.5 is in force; thus, for each A E ]Rm, the black box computes the
number 8 (A) and some S(A) E a8(A).
The present problem is therefore just the one considered in Chap. IX; formally, it
is also that of Chap. II, with the technical difference that the function A ~ S(A) is not
continuous. The general approach will also be the same: at the kth iteration, the black
box is called (i.e. the Lagrange problem is solved at Ak) to obtain 8(Ak) and S(Ak);
then the (k + 1)st iterate Ak+1 is computed. A dual algorithm is thus characterized by
the set of rules giving Ak+ I.
Note here that our present task is after all nothing but developing general algorithms
which minimize a convex function finite everywhere, and which work with the help of the
above-mentioned information (function- and subgradient-values). It is fair to say that most
of the known algorithms of this type have been actually motivated precisely by the need to
solve dual problems.
Note also that all this is valid again with no assumption on the primal problem (1.1.1),
except Assumption 1.1.5.
4 Classical Dual Algorithms 171

4.1 Sub gradient Optimization

The simplest algorithm for convex minimization is directly inspired by the gradient
method of Definition 11.2.2.2: the next iterate is sought along the direction -SP..k)
issuing from Ak. There is a serious difficulty, however, which has been the central
issue in Chap. IX: no line-search is possible based on decreasing e, simply because
-S(Ak) may not be a descent direction - or so weak that the resulting sequence {Ak}
would not minimize e. The relevant algorithm is then as follows:

Algorithm 4.1.1 (Basic Subgradient Algorithm) A sequence {tk} is given, with


tk >0 for k = 1, 2, ...
STEP 0 (initialization). Choose AI E IRm and obtain SI E ae(AI). Set k = 1.
STEP 1. If Sk = 0 stop. Otherwise set
Sk
Ak+1 = Ak - tk II skli . (4.1.1)

STEP 2. Obtain Sk+1 E ae(Ak+I). Replace k by k + I and loop to Step 1. 0

1111- ·11 = Const.

Fig. 4.1.1. An anti-subgradient gets closer to any better point

Needless t9 say, each subgradient is obtained via some u).. solving the Lagrange
problem at the corresponding A - hence the need for Assumption 1.1.5. Note that the
function-values e (Ak) are never used by this algorithm, whose motivation is as follows
(see Fig. 4.1.1): if /L is better than Ab we obtain from the subgradient inequality

sJ (/L - Ak) ~ e(/L) - e(Ak) < 0,

which means that the angle between the direction of move -Sk and the "nice" direction
/L - Ak is acute. If our move along that direction is small enough, we get closer to /L.
From this interpretation, the stepsize tk should be small, and we will require

tk -+ 0 for k -+ +00. (4.1.2)

On the other hand, observe that IIAk+1 - Ak II = tk and the triangle inequality
implies that all iterates are confined in some ball: for k = 1, 2, ...

I>k.
00
Ak E B(A), T), where T:=
k=1
172 XII. Abstract Duality for Practitioners

Suppose B has a nonempty set of minima; then Algorithm 4.1.1 has no chance of
producing a minimizing sequence if B(J"10 T) does not intersect this set: in spite of
(4.1.2), the stepsizes should not be too small, and we will require
00

~)k =+00. (4.1.3)


k=l
Our proof of convergence is not the simplest, nor the oldest; but it motivates
Lemma 4.1.3 below, which is of interest in its own right.

Lemma 4.1.2 Let {tk} be a sequence a/positive numbers satisfying (4.1.2), (4.1.3)
and set
n n
rn:= Ltk. Pn:= Ltl· (4.1.4)
k=l k=l
Then (rn ~ +00 and) -Pn ~ 0 when n ~ +00.
rn
PROOF. Fix 8 > 0; there is some n(8) such that tk ~ 8 for all n > n(8); then
n n
Pn = Pn(c5)-1 + L tl ~ Pn(c5)-1 + 8 L tk ~ Pn(c5)-1 + 8rn.
k=n(c5) k=n(c5)
so that
Pn ~ Pn(c5)-1 +8 for all n > n(8) ;
rn rn
thus, lim sup ~ ~ 8. The result follows since 8 > 0 was arbitrary. o

Lemma 4.1.3 Let B : IRm ~ IR be convex and.fix i E IRm. For all ).. E IRm such
thatB()..) > B(i) and/or ails E aB()..),set

- (s,).. - i)
d()"):= lis II >0. (4.1.5)

Given M > 0, there exists L > 0 such that

d()") ~ M ~ [0 <] B()") - B(i) ~ Ld()") .

PROOF. Positivity of d()") is clear from the subgradient inequality written at )... Let
IL()..) be the projection of i onto the hyperplane

{IL E IRm : (s. IL - )..) = O} .


An easy calculation gives IIIL()..) - ill = d()") and Fig.4.1.2 shows that B(IL()..» ~
B()"). If IL()..) E B(i, M) as assumed, the local Lipschitz property of B (Theo-
rem IV3 .1.2) implies the existence of L such that

which is the required inequality. o


4 Classical Dual Algorithms 173

~(A.)
. I
·l·~
X . 9(.) = 9(A.)

Fig. 4.1.2. A technical majoration

This last result says that dO.) is a sort of "distance" between A and~, for which
e has a locally Lipschitzian behaviour.
Finally, we introduce the sequence of best values generated by Algorithm 4.1.1:

fh:=min{e(Aj): i=I, ... ,k},

which is needed because {e (Ak)} is not monotonic. The whole issue is whether these
best function-values tend to the infimum of e over IRm (a number in IR U {-oo n.
Theorem 4.1.4 Apply Algorithm 4.1.1 to the convex function e : IRm --+- R and let
the stepsizes satisfy (4.1.2), (4.1.3). Then

Bk --+- inf e(A) when k --+- +00.


AElRm

PROOF. Assume for contradiction the existence of ~ E IRm and T/ > 0 such that

(4.1.6)

then develop

II~ - Ak+11I2 = II~ - Akll 2 + 2(~ - Ak) T (Ak - Ak+I) + IIAk - Ak+11I2
= II~ - Akll 2 - 2tkd(Ak) + t1,
where the notation (4.1.5) is used: the triple (~, Ak, Sk) enters the framework of
Lemma 4.1.3. For n ~ k, set On := mink=I, ... ,n d(Ak) so that

summing from 1 to n:
n n
[II~ - An+11I 2+] 20n L:tk::;;; II~ - AI1I2 + L:t1 for all n.
k=1 k=1
From Lemma 4.1.2, it follows that On --+- 0 when n --+- +00.
Thus, we have an infinite subset K of integers such that limkEK d(Ak) = O. Now
apply Lemma 4.1.3: {d(Ak)}kEK is bounded and

lim [e(Ak) - e(~)]


kEK
= 0,
which is a suitable contradiction to (4.1.6). o
174 XII. Abstract Duality for Practitioners

Algorithm 4.1.1 is often called the subgradient method, or sometimes relaxation;


both terminologies are rather ambiguous. Note a peculiarity: there is no convenient
stopping criterion (in particular, Sk has no reason to tend to Of). The algorithm must
in fact be stopped "manually", when tk is small compared to the scale of the problem.

Remark 4.1.5 We mention here that the convergence assumptions concerning (4.1.4) are
somewhat bizarre. A computer cannot represent positive numbers under a certain threshold,
say mo > O. As a result,
- Charybdis: to satisfy (4.1.3), the computer must set tk ;;:: mo for all k: (4.1.2) becomes
impossible.
- Scylla: alternatively, setting tk = 0 for k large enough is even worse. No update is performed
by (4.1.1) and the algorithm has to stop; but at this point, the series E tk has not diverged
yet.
On the other hand, this kind of argument is not totally convincing: the mere concept
"6k ~ 0" is important in numerical analysis; yet, who can be patient enough to check
whether a sequence is convergent, and to obtain its limit? We shall only say that requiring a
property resembling (4.1.2), (4.1.3) is a bit sloppy with respect to finite arithmetic. 0

Naturally, if the dual function appears to be differentiable everywhere, any of


the "normal" methodologies of Chap. II can be used. In this framework, the steepest-
descent method (in the dual space) is usually called Uzawa's method, and is valid in
the "ideal" situation: U convex, ({J strictly concave and c affine.

4.2 The Cutting-Plane Algorithm

To ease our exposition, we assume now that a compact convex set C C JRm is known
to contain a dual solution. In view of the sup-form ofB, our dual problem (2.1.6) can
then be written

I
min () () E JR, A E C ,
() :;;:: L(u, A) for all u E U.
(4.2.1)

This is a semi-infinite programming problem, which would become easy ifit had only
finitely many constraints, i.e. if U were a finite set; taking for example a compact
convex polyhedron for C, (4.2.1) would become an ordinary linear program. Then the
basic idea of the cutting-plane algorithm is quite natural: accumulate the constraints
one after the other in (4.2.1). Furthermore take advantage of the fact that the constraint-
index u can be restricted to the smaller set fJ of (2.2.3).

Algorithm 4.2.1 (Basic Cutting-Plane Algorithm) The compact convex set C and
the stopping tolerance 6 :;;:: 0 are given.

STEP 0 (initialization). Choose AO E C and solve the Lagrange problem at AO to


obtain Uo := u>"o' Set k = O.
4 Classical Dual Algorithms 175

STEP 1 (master problem). Solve the following relaxation of (4.2.1):

min () () E JR, A E C ,
I () ? L(uj, A) for i = 0, ... , k - 1
(4.2.2)

to obtain a solution «()k, Ak)'


STEP 2 (local problem). Solve the Lagrange problem (1.1.4h.k to obtain a next primal
point Uk := U Ak .
STEP 3 (stopping criterion and loop). If

(4.2.3)

then stop. Otherwise replace k by k + 1 and loop to Step 1. o


Typically, C is a parallelotope characterized by known bounds on the components
of A; then (4.2.2) is a linear program with 2m + k constraints. It is important to realize
that the above description can be made without any reference to primal points u: the
Lagrange function being affine in A, we have

valid for all A and i = 0, 1, ... Thus, (4.2.2) can be written

Imin ()
() ? 6>(Aj) + [S(Aj)]T (A -
() E JR, A E C ,
Aj) for i = 0, ... , k - 1,
(4.2.5)

which involves only dual objects, namely 6>- and s-values.

Remark 4.2.2 When we write (4.2.2) in the form (4.2.5), we do nothing but prove again
Proposition 2.2.2: c(u;J E -aeo.). When we hope that (4.2.5) does approximate (4.2.1), we
rely upon the property proved in Theorem XI. 1.3.8: the convex function e can be expressed
as a supremum of affine functions, namely

e(A) = sup {e(p,) + [S(p,)]T ().. - p,) : p, E ~m} for all ).. E ~m ,
s(p,) being arbitrary in ae(p,). 0

The convergence of Algorithm 4.2.1 is easy to establish:

Theorem 4.2.3 With C C JRm convex compact and 6> convexfrom JRm to lR, consider
the optimal value
Be := min {6>(A) : A E C}.
In Algorithm 4.2.1, ()k ~ Be for all k and the following convergence properties hold:
- If s > 0, the stop occurs at some iteration ke with Ake satisfoing

6>(Ake ) ~ Be + s. (4.2.6)

- If s = 0, the sequences {()kl and {6> (Ak)} tend to Be when k --+ +00.
176 XII. Abstract Duality for Practitioners

PROOF. Because (4.2.1) is less constrained than (4.2.2) = (4.2.5), the inequality
Ok :::;; Be is clear and the optimality condition (4.2.6) does hold when the stopping
criterion (4.2.3) is satisfied.
Now suppose for contradiction that, for k = 1, 2, ...

hence
-8> B(Aj) - B(Ak) - IIS(Aj)1I IIAk - Aili.
Because B is Lipschitzian on the bounded set C (Theorem Iy'3.1.2 and Proposi-
tion VI.6.2.2), there is L such that

2LIIAk-Ajll >8 foralli <k=1,2, ...

This is incompatible with the boundedness of {Ak} C C.


Because 8 > 0 was arbitrary and because Ok:::;; Be : :; B(Ak) for all k, we have
actually just proved that B(Ak) - Ok -+ 0 if k -+. +00 (which can happen only if
8 = 0); then Be is the common limit of {Ok} and {B(Ak)}. 0

If 8 = 0, Algorithm 4.2.1 normally loops forever. An interesting case, however, is


when U is a finite set: then the algorithm stops anyway, even if 8 = O. This is implied
by the following result.

Proposition 4.2.4 Here no assumption is made on C in Algorithm 4.2.1. The Uk


generated at Step 2 is different from all points uo, ... , Uk-I, unless the stop in Step 3
is going to operate.

PROOF. Ifuk = Uj for some i :::;; k - 1, then

where the second relation relies on (4.2.2). o


The presence of a compact set C in the basic cutting-plane Algorithm 4.2.1 is not
only motivated by Theorem 4.2.3. More importantly, (4.2.5) would normally have no
solution if C were unbounded: with C = lR,m, think for example of the first iteration,
having only one constraint on 0 and S(AO) # O. This is definitely a drawback of the
algorithm, which corresponds to an inherent instability: the sequence {Ak} must be
artificially stabilized.
In some favourable circumstances, C can eventually be eliminated. This usually
happens when: (i) the dual function B has a bounded set of minima, and (ii) enough
dual iterations have been performed, so that the "linearized sublevel-set"

is bounded (for some 0: remember Proposition Iy'3.2.5); it may even be included in


C. Under these circumstances, it becomes clear that (4.2.5) can be replaced by
4 Classical Dual Algorithms 177

The dualization of (4.2.7) is interesting. Call a E IRk the dual variable, form the
Lagrange function L(B, A, a), and apply the technique of§3.3 to obtain the dual linear
program
k-I
max Lai [e(Ai) - [S(Ai)]T Ai ] a E IRk,
i=o
k-I
LaiS(Ai) = 0,
i=o
k-I
L
i=o
ai = I and ai ~ ° for i = 0, ... , k - 1 .

Not unexpectedly, this problem can be written in primal notation: since S(Ai) =
-C(Ui) and knowing that .11k is the unit simplex of IRk, we actually have to solve

max
aELlk
{It~~ ai({J(ui) : L~':~ aic(ui) = o} . (4.2.8)

When stated with (4.2.8) instead of (4.2.7), the cutting-plane Algorithm 4.2.1 (a
row-generation mechanism) is called Dantzig-WolJe's algorithm (a column-genera-
tion mechanism). In terms of the original problem (1.1.1), it is much more suggestive:
basically, ((J(.) and cO are replaced in (4.2.8) by appropriate convex combinations.
Accordingly, the point
k-I
.11k 3 a f-+ u(a) := L aiui
i=o
can reasonably be viewed as an approximate primal solution:

Theorem 4.2.5 Let the convexity conditions of Corollary 2.3.6(ii) hold (U convex,
({J concave, c affine) and suppose that Algorithm 4.2.1, applied to the convex function
e : IRm -+ IR, can be used with (4.2.7) instead of (4.2.5). When the stop occurs,
denote by ii E .11k an optimal solution of (4.2.8). Then u(ii) is an e-solution of the
primal problem (1.1.1).

PROOF. Observe first that u(ii) is feasible by construction. Also, there is no duality
gap between the two linear programs (4.2.7) and (4.2.8), so we have (use the concavity
of({J)
k-I
(h = L iii ({J(Ui) ~ ({J(u(ii)) .
i=o
The stopping criterion and the weak duality Theorem 2.1.5 terminate the proof. 0

A final comment: the convergence results above heavily rely on the fact that all
the iterates Ui are stored and taken into account for (4.2.2). No trick similar to the
compression mechanism of §IX.2.1 seems possible in general. We will return to the
cutting-plane algorithm in Chap. xv.
178 XII. Abstract Duality for Practitioners

5 Putting the Method in Perspective

On several occasions, we have observed connections between the present dual ap-
proach and various concepts from the previous chapters. These connections were by
no means fortuitous, and exploiting them allows fruitful interpretations and exten-
sions. In this section, Assumption 1.1.5 is not in force; in particular, 8 may assume
the value +00.

5.1 The Primal Function

From its definition itself, the dual function strongly connotes the conjugacy operation
of Chap. X. This becomes even more suggestive if we introduce the artificial variable
y = c(u) (the coefficient of J..); such a trick is always very fruitful when dealing with
duality. Here we consider the following function (remember §IV.2.4):

JRm 3 Y 1-+ P(y) := - sup {cp(u) : u E U, c(u) = y}. (5.1.1)

In other words, the right-hand side of the constraints in the primal problem (1.1.1)
is considered as a varying parameter y, the supremal value being a function of that
parameter; and the original problem is recovered when y is fixed to o. It is this primal
jUnction that yields the expected conjugacy correspondence. The reason for the change
of sign is that convexity comes in in a more handy way.

Theorem 5.1.1 Suppose dom 8 =F 121. Then

P*(J..) = 8(-J..) lor all J.. E IRm.

PROOF. By definition,

P*(J..) = SUPy [J..T Y - P(y)]


= SUPy {suPueu[J.. T Y + cp(u) : c(u) = yJ}
= SUPy {suPueu[J.. T c(u) + cp(u)J}
= +
sUPueU [J.. T c(u) cp(u)] =
8( -J..). o
Remark 5.1.2 The property dom 8 =F 121 should be expressed by a condition on the
primal data (U, cp, c). As seen in the beginning of Chap. X,dom8 = - domP* is (up
to a symmetry) the set of slopes minorizing P. The condition in question is therefore:
for some (e, /L) E IR x IRm,

P(y) ~ e + /LT Y for all y E IRm .

By definition of P, this exactly means

cp(u) ~ - e- /L T c(u) for all u E U,

i.e. 8(-/L) ~ - e < +00.


5 Putting the Method in Perspective 179

This property holds for example when qJ is bounded from above on U (take
f.L = 0). In view of Proposition ry.1.2.1, it also holds if P E Conv lR.m, which in turn
is true if (use the notation (2.4.1), Theorem IY.2.4.2, and remember that qJ is assumed
finite-valued on U "# 0):

U c lR.n is a convex set, - qJ E Conv lR.n, c: lR.n ~ lR.m is linear }


and, for all y E C(U), qJ is bounded from above on l (y) . c 6
(5.1.2

Theorem 5.1.1 explains why convexity popped up in Sections 2.3 and 2.4: being
a conjugate function, e does not distinguish the starting function P from its closed
convex hull. As a first consequence, the closed convex hull of P is readily obtained:

P**(y) = co P(y) = e*(-y) for all y E lR.m .

This sheds some light on the connection between the primal-dual pair of problems
(1.1.1) and (2.1.6):
(i) When it exists, the duality gap is the number

inf e - (-P)(O) = -e*(O) + P(O) = P(O) - co P(O) ,


IRm

i.e. the discrepancy at 0 between P and its closed convex hull. There is no duality
gap if P E Conv lR.m, or at least if P has a "closed convex behaviour near 0";
see (iii) below.
In the situation (5.1.2), P E Conv lR.m, so co P =
cl P. In this case, there is
no duality gap if P(O) = cl P(O).
(ii) The minima of a closed convex function have been characterized in (X. 1.4.6),
which becomes here

Argmin{e(A) : A E lR.m} = ae*(O) = -a(co P)(O).


Using Theorem X.l.4.2, a sufficient condition for existence of a dual solution is
therefore 0 E ri dom(co P), and this explains Theorem 2.4.1 (iii).
(iii) Absence of a duality gap means P (0) = co P (0) [< +00], and we know from
Proposition X.1.4.3(ii) that ap(O) = a (co P)(O) in this case. Thus, the primal
problems that are amenable to duality are those where P has a sublinearlzation
at 0:
a dual solution exists and }
there is no duality gap {=::} a
P (0) "# 0.
Do not believe, however, that the A-problem (2.1.3) has in this case a solution
X[E -ap(O)]: existence ofa solution to the Lagrange problem (1.1.4)X and the
filling property (2.3.2) at Xare still needed for this.
(iv) The optimization problem (5.1.1) has a dualization in its own right; it gives

LY (u, A) := L(u, A) + AT y, e Y (A) := e(A) + AT Y ,


where L = L 0 and e = eO are the Lagrangian and dual function associated
with the original problem (U, qJ, c). Assume that there is no duality gap, not only
at y = 0, but in a whole neighborhood of 0; in other words,
180 XII. Abstract Duality for Practitioners

-P(y) = inf BY (>..)


}..
= inf
}..
[B(>..) + >.. T y] for lIyll small enough.

This proves once more Theorem 5.1.1, but also explains Theorem VU.3.3.2,
giving an expression for the subdifferential of a closed convex primal function:
even though (5.1.1) does not suggest it, P is in this case a sup-function and its
subdifferential can be obtained from the calculus rules of §VI.4.4.
(v) The most interesting observation is perhaps that the dual problem actually solves
the "closed convex version" of the primal problem.
This latter object is rather complicated, but it simplifies in some cases, for example
when the data (U, cp, c) are as follows:

U is a bounded subset of lRn ,


cp(u) = (q, u) is linear, (5.1.3)
c(u) = Au - b is affine.
Proposition 5.1.3 Consider the dual problem associated with (U, cp, c) in (5.1.3).
Its infimal value is the supremal value in

sup{(q,u): uecoU, Au-b=O}. (5.1.4)

Furthermore, assume that this dual problem has some optimal solution )..; then
the solutions of (5.1.4) are those

u e (co U)()") = {u e co U : L(u,)..) = B()")}


that satisfy Au = b.

PROOF. In the case of (5.1.3), we recognize in (5.1.1) the definition of an image-


function:

P(y - b) = inf (lu(u) - (q, u) : Au = y} for all y e lRm;

let us compute its biconjugate. Thanks to the boundedness of U, Theorem X.2.I.I


applies and, using various calculus rules in X.I.3.1, we obtain

P*(>..) = (lu - (q, .»*(A*>") + >.. Tb = au (A*>" + q) + >..Tb for all >.. e lRm .

The support function au = a co U is finite everywhere; Theorem X.2.2.1 applies and,


using again X.I.3.I, we conclude

(co P)(y - b) = inf (lcou(u) - (q, u) : Au = y}


= -sup{(q,u): uecoU, Au=y} forallyelR m .

In other words, closing the duality gap amounts to replacing U by co U.


To finish the proof, observe that the primal problem (5.1.4) satisfies the assump-
tions of Lemma 2.3.2 and Corollary 2.3.6(ii). 0

Note: it is only for simplicity that we have assumed U bounded; the result would still
hold with finer hypotheses relating Ker A with the asymptotic cone of co U. An interesting
5 Putting the Method in Perspective 181

instance of(5.1.3) is integer linear programming. Consider the primal problem in N n (a sort
of generalized knapsack problem)

inf c Tx
Ax = a E ~m, Bx - b E -(~+)P [i.e. Bx :( b E ~P] (5.1.5)
xi E N, xi :( i for i = 1, ... , n .

Here i is some positive integer, introduced again to avoid technicalities; then the next result
uses the duality scheme

Corollary 5.1.4 The dual optimal value associated with (5.1.5) is the optimal value of

infcTx
Ax = a, Bx - b E -(~+)P , (5.1.6)
O:(xi:(i fori = 1, ... ,n.

PROOF. Introduce slack variables in (5.1.5) to describe the inequality constraints as Bx + Z =


z
b, Z ~ O. Because x is bounded, Z is bounded as well, say zj :( for j = 1, ... , p. Then we
are in the situation (5.1.3) with

U:={(x,z)ENnx[O,zV: x i :(ifori=I, ... ,n},

whose closed convex hull is clearly [0, i]n x [0, z]P. Thus Proposition 5.1.3 applies, and it
suffices to eliminate the slacks to obtain (5.1.6). D

Naturally, (5.1.6) is a standard linear program, to which the primal-dual relations of


§3.3 can be applied. The duality technique studied in the present chapter is usually called
Lagrangian relaxation: the constraint c(u) = 0 in (1.1.1) is somehow relaxed, when we
maximize the Lagrangian (1.1.3). The linear program (5.1.6) is likewise called the convex
relaxation of (5.1.5): the integrality constraints are relaxed, to form a convex minimization
problem. The message of Corollary 5.1.4 is that both techniques are equivalent for integer
linear programming.

5.2 Augmented Lagrangians

We saw in §2 that an important issue was uniqueness of a solution to the Lagrange


problem (l.1.4);... More precisely, the important property was single-valuedness of the
multifunction J.. f--+ C(J..) of (2.3.3). Such a property would imply, under the filling
property (2.3.2):
- differentiability of the dual function,
- direct availability of a primal solution, from the black box called at a dual optimum.
To get a singleton for C (J..), a wise idea is to make the Lagrangian strictly concave
with respect to the "variable" c(u); and this in turn might become possible if we
subtract for example II c( u ) 112 from the Lagrangian. Indeed, for given t ;:?: 0, the problem

sup [cp(u) - !tllc(u)1I 2 ] u E U,


I c(u) = 0 E IRm
(5.2.1)
182 XII. Abstract Duality for Practitioners

is evidently equivalent to (1.1.1): it has the same feasible set, and the same objective
function there. In the dual space, however, this equivalence no longer holds: the
Lagrangian associated with (5.2.1) is

Lt(u, J..) := qJ(u) - 4tllc(u)1I2 - J.. T c(u) = L(u, J..) - 4tllc(u)1I2,

called the augmented Lagrangian associated with (1.1.1) (knowing that L = Lo is the
"ordinary" Lagrangian). Correspondingly, we have the "augmented dual function"

et(J..) := sup Lt(u, J..) • (5.2.2)


UEU

Remark 5.2.1 (Inequality Constraints) When the problem has inequality constraints, as
in (3.2.1), the augmentation goes as follows: with slack variables, we have

U x (]R+)P X ]RP 3 (u, v, A) 1-+ Lt(u, v, A) = cp(u) - AT[C(U) + v] - !tllc(u) + vII2 ,


which can be maximized with respect to v E (]R+)P: we obtain

Vj = Vj(u, Aj) = max (O, -Cj(u) - tAj} for j = 1, ... , p.

Working out the calculations, the "non-slackened augmented Lagrangian" boils down to
P
U x ]RP 3 (u, A) 1-+ It(u, A) = cp(u) - I>t(Cj(u), Aj),
j=l

where the function 1Ct : ]R2 -+ ]R is defined by

{
I-LY + !ty 2 ify ~ tI-L,
(5.2.3)
1Ct(y,I-L)= -itllI-L1I2 ify~tI-L·
In words: the /" constraint is "augmentedly dualized" if it is frankly violated; otherwise,
it is neglected, but a correcting term is added to obtain continuity in u. Note: the dual variables
I-Lj are no longer constrained, and they appear in a "more strictly convex" way. 0

A mere application of the weak duality theorem to the primal-dual pair (5.2.1),
(5.2.2) gives
(5.2.4)
for all t, all J.. E ]Rm and all u feasible in (5.2.1) or (1.1.1). On the other hand, the
obvious inequality L t ~ L extends to dual functions: e t ::;; e, i.e. the augmented
Lagrangian approach cannot worsen the duality gap. Indeed, this approach turns out
to be efficient when the "lack of convexity" of the primal function can be corrected
by a quadratic term:

Theorem 5.2.2 With P defined by (5.1.1), suppose that there are to ~ 0 and I E lR.m
such that
P(y) ~ P(O) - IT y - 4tollYII2 for all Y E lR.m . (5.2.5)
Then, for all t ~ to, there is no duality gap associated with the augmented Lagrangian
L t, and actually

-P(O) = et(I) ::;; et(J..) for all J.. E lR.m .


5 Putting the Method in Perspective 183

PROOF. When (5.2.5) holds, it holds also with to replaced by any t ~ to, and then we
have for all y:

-P(O) ~ -P(y) - F y - !tllyll2 =


= sUPueU {({J(u) - F y - !t1lY1l2 : c(u) = y}
= sUPueU{Lt(u, X) : c(u) = y}.
Since y was arbitrary, we conclude that

-P(O)~ sup{Lt(u,i): c(u)=y}=8t(i).


u,Y

Remember the definition (5.1.1) of P: this means precisely that there is no duality
gap, and that i minimizes 8t. 0

This result is to be compared with our comment in §5.1(iii): i satisfies (5.2.5) if


and only if it minimizes 8to. Needless to say, (5.2.5) is more tolerant than the property
ap(O) #- 0. Actually, the primal function associated with (5.2.1) is

-Pt(y) := sup {({J(u) - !tIIc(u) 112 : c(u) = y} = -P(y) - !t1lY1l2


ueU

and (5.2.5) just means aPto(O) #- 0 (attention: the calculus rule on a sum of subdif-
ferentials is absurd for Pt = P + 1/2 til· 11 2, simply because P is not convex!).
The perturbed primal function establishes an interesting connection between the
augmented Lagrangian and the Moreau-Yosida regularization of Example XI.3 .4.4:

Proposition 5.2.3 Suppose P E Conv lRm; then, for all t > 0,

PROOF. Apply Theorem 5.1.1 to the perturbed primal function Pt :

8t(-A) = pt(A) = (p + !tll·1I2)*(A).


The squared-norm function is finite-valued everywhere, so the inf-convolution of this
sum of closed convex functions is given by Theorem X.2.3.2:

8t(A) = (P* t itll·1I 2)(-A) = min {8(-JL) + itllvll2 : JL + v = -A} ,


and this is exactly (5.2.6). o
Thus, suppose that our initial problem (1.1.1) is such that P + 1/2 toll. 112 E ConvlRm for
some to ~ O. Then the augmented Lagrangian approach becomes extremely efficient:
-It suppresses the duality gap (Theorem 5.2.2).
-It smoothes out the dual function: use (5.2.6) to realize that, for t > to,

e, = e,o t 2(1:'0) II . 112


is a Moreau-Yosida regularization; as such, it has a Lipschitzian gradient mapping with
constant 1/(t - to) on IRm , see Example XI.3.4.4.
184 XII. Abstract Duality for Practitioners

- We will see in Chap. XV that there are sound numerical algorithms computing a Moreau-
Yosida regularization, at least approximately; then the gradient Vet (or at least approxima-
tions of it) will be available "for free", see (XI.3.4.8).
However, the augmented Lagrangian technique suffers a serious drawback: it usually
kills the practical Assumption 1.1.2. The reader can convince himself that, in all examples
of § 1.2, the augmented Lagrangian no longer has a decomposed structure; roughly speaking,
if a method is conceived to maximize it, the same method will solve the original problem
(1.1.1) as well, or at least its penalized form

sup [fP(u) - !tllc(u)1I 2 ]


ueU

(see §VII.3.2 for example). We conclude that, in practice, a crude use of the augmented
Lagrangian is rarely possible. On the other hand, it can be quite useful in theory, particularly
for various interpretational exercises: remember our comments at the end of §VII.3.2.

Example 5.2.4 Consider the simple knapsack problem (2.2.2), whose optimal value is 0,
e
but for which the minimal value of is 112 (see Remark 2.3.5 again). The primal function
is plotted on Fig. 5.2.1, observe that the value at 0 of its convex hull is -1/2 (the opposite
of the optimal dual value). The graph of Pt is obtained by bending upwards the graph of P,
i.e. adding 1/2 til· 112; for t ~ to = 2, the discontinuity at y = 1 is lifted high enough to yield
(5.2.5) with ~ = O. In view ofTheorem 5.2.2, the duality gap is suppressed.

t IL ........"1
i···.. I ••••••
. . . . ./
l .,....... +1 ••••••
-'Y
-1 I I'f;,
-1 ~----~------------- P("()
Fig. 5.2.1. The primal function in a knapsack problem

Calculations confirm this happy event: with the help of (5.2.3), the augmented dual
function is (we neglect large values of >..)

e
this shows that, if t ~ 2 and >.. is close to 0, t (>..) = 1/ (2t) >..2 , whose minimal value is o.
Examining Fig. 5.2.1, the reader can convince himselfthat the same phenomenon occurs
for an arbitrary knapsack problem; and even more generally, for any integer program such as
(5.1.5). In other words, any integer linear programming problem is equivalent to minimizing
the (convex) augmented dual function: an easy problem. The price for this "miracle" is the
(expensive) maximization of the augmented Lagrangian. 0

To conclude, we mention that (5.2.1) is not the only possible augmentation. Closed
convexity of Pt implies IJPt(y) =F 0 for "many" y (namely on the whole relative
5 Putting the Method in Perspective 185

interior of dom Pt = dom P), a somewhat luxurious property: as far as closing the
duality gap is concerned, it suffices to have apt (0) =f:. 0 (Theorem 5.2.2). In terms of
(1.1.1), this latter property means that it is possible to add some stiff enough quadratic
term to P, thus obtaining a perturbed primal function which "looks convex near 0".
Other devices may be more appropriate in some other situations:

Example 5.2.5 Suppose that there are i E IRm and to > 0 such that

P(y) ~ P(O) - iT Y - tollyll for all y E IRm;

in other words, the same apparent convexity near y = 0 is yielded by some steep
enough sublinear term added to P. Then consider for t > 0 the class of problems

- p,(y) := sup (qi(u) - tllc(u) II : c(u) = y} = -P(y) - tllyll, (5.2.7)


ueU

going together with the augmented dualization

it(u, A) := L(u, A) - tIIc(u) II, et(A):= sup it(u, A).


ueU

This latter augmentation does suppress the duality gap at y = 0: reproduce the
proof of Theorem 5.2.2 to obtain

-Pt(O) = infRm e = e(i) for t ~ to.

Concerning the regularization ability proved by Proposition 5.2.3, we have the fol-
lowing: til· II is the support function of the ball B(O, t) (§Y.2.3); its conjugate is the
indicator function of B(O, t) (Example X.l.l.5); computing the infunal convolution
of the latter with a function such as e gives at A the value infJL-AeB(O,t) e(JL). In
summary, Proposition 5.2.3 can be reproduced to establish the correspondence

Finally note the connection with the exact penalty technique of §VIl.l.2. We know
from Corollary VII.3.2.3 that, under appropriate assumptions and for t large enough,
P,(O) is not changed if the constraint c(u) = 0 is removed from (5.2.7). 0

5.3 The Dualization Scheme in Various Situations

In §1.2, we have applied duality for the actual solution of some practical optimization
problems; more examples have been seen in §3. We now review a few situations where
duality can be extremely useful also for theoretical purposes.

(a) Constraints with Values in a Cone. Consider abstractly a primal problem posed
under the form
Ic(u)
SUpqi(u)
E
u E U,
K.
(5.3.1)
186 XII. Abstract Duality for Practitioners

In (1.1.1) for example, we had K = {OJ C IRm; in (3.2.1), K was the nonpositive
orthant of IR P • More generally, we take here for K a closed convex cone in some
finite-dimensional vector space, call it IR m, equipped with a scalar product (', .); and
KO will be the polar cone of K.
The Lagrange function is then conveniently defined as

U x KO 3 (u, A) 1-+ L(u, A) = qJ(u) - (A, c(u») , (5.3.2)

and the dual function is e


= sUPueU L(u, .) as before. This dualization enters the
framework of §3.1, as is shown by an easy check of(3.1.1):

inf L(u, A) = qJ(u) - sup (A, c(u») = qJ(u) - IK(C(U» ,


).,eKO ).,eKO

where the last equality simply relates the support and indicator functions of a closed
convex cone; see Example y'2.3.1. The weak duality theorem

e(A) ~ qJ(u) for all (u, A) E U x KO with U feasible

follows (or could be checked in the first place).


Slack variables can be used in (5.3.1):

c(U) E K {::::::} [c(u) - v = 0, V E K] ,

a formulation useful to adapt Everett's Theorem 2.1.1. Indeed the slackened Lagrang-
ian is
U x K x KO 3 (u, v, A) 1-+ qJ(U) - (A, c(u» + (A, v) ,
and a key is to observe that (A, v) stays constant when v describes the face FK(A)
of K exposed by A E KO. In other words, ifu)., maximizes the Lagrangian (5.3.2),
the whole set {u).,} x F K (A) C U x K maximizes its slackened version. Everett's
theorem becomes: a maximizer u)., of the Lagrangian (5.3.2) solves the perturbed
primal problem
sup {qJ(u) : c(u) - c(u)..) E K - FK(A)}.
ueU
If c(u).,) E FK(A), the feasible set in this problem contains that of (5.3.1). Thus,
Corollary 2.1.2 becomes: if some u).. maximizing the Lagrangian is feasible and
satisfies the complementarity slackness (A, c(u).,») = 0, then this u).. is optimal in
(5.3.1).
The results from Sections 2.3 and 2.4 can also be adapted via a slight generalization
of §3 .2(b) and (c), which is left to the reader. We just mention that in §3.2, the cones
K and KO were both full-dimensional, yielding some simplification in the formulae.

(b) Penalization of the Constraints. We mentioned in § 1 that the constraints in


(1.1.1) should be "soft", i.e. some tolerance should be accepted on the feasibility of
primal solutions; this allows an approximate minimization of e.
A sensible idea is
then to penalize these soft constraints, for example replacing (1.1.1) by
5 Putting the Method in Perspective 187

sup [cp(u) - !tIlC(u)1!2] , (5.3.3)


UEU
for a given penalty parameter t > 0 (supposed to be "large"); any concept of feasibility
has disappeared. In terms of the practical problem to be solved, constraint violations
are replaced by a price to pay; a quadratic price may not be particularly realistic, but
other penalizations could be chosen as well (remember Chap. VII, more specifically
its Sections 1.2 and 3.2).
Even under this form, our primal problem is still amenable to duality, thus pre-
serving the possible advantage of the approach, for example in decomposable cases.
Introduce the additional variable v E ]Rm and formulate (5.3.3) as

sup [cp(u) - !tllvIl 2] u E U, v E]Rm,


I c(u) = v;
(5.3.4)

note the difference with (5.2.1): the extra variable v is free, while it was frozen to 0 in
the case of an augmented Lagrangian. We find ourselves in a constrained optimization
framework, and we can define the Lagrangian

Lt(u, v, A) := cp(u) - !tllvll 2 - AT[c(u) - v] = L(u, A) + AT V - !tllvll 2

for all (u, v, A) E U x]Rm x]Rm. The corresponding dual function is easy to compute:
the same Lagrangian black box as before is used for u (another difference with §5.2),
and the maximization in v is explicit:

8t (A) := sup Lt(u, v, A) = e(A) + 1711.1..11 2 •


u,v

A first interesting observation is that 8t always has a minimum (provided that e


itself was not identically +(0). This was indeed predicted by Proposition 2.4. 1(iii):
for our problem (5.3.4), the image-set defined by (2.4.1) is the whole space]Rm.
Second, consider the primal function associated with (5.3.4):

]Rm :3 Y 1-+ -Pr(y) := sup [cp(u) - !tllc(u) _ Y1I2] ; (5.3.5)


uEU
once again, the Moreau-Yosida regularization comes into play (cf. Proposition 5.2.3):

Proposition 5.3.1 Let t > 0; with the notation (5.1.1), (5.3.5), there holds

Pr(y) = inf [P(z)+!tllz-yIl2].


ZElRm

PROOF. Associativity of infima is much used. By definition of P,

infz[P(z)+!tllz-yIl2] = inf z,u{-cp(u)+!tllz-yIl2: c(u)=z}


= infu [-cp(u) +!t infz=c(u) liz _ Y1l2]
= infu [-cp(u) + !tllc(u) - Y1l2] = Pr(y). 0

This result was not totally unexpected: in view of Theorem 5.1.1, the conjugate of Pr is
e
the sum t of two closed convex functions (barring the change of sign). No wonder, then, that
Pr is an infimal convolution: remember Corollary X.2.1.3. Note, however, that the starting
function P is not in Conv R m , so a proof is really needed.
188 XII. Abstract Duality for Practitioners

Remark 5.3.2 Compare Proposition 5.3.1 with Proposition 5.2.3. Augmenting the Lagrang-
ian and penalizing the primal constraints are operations somehow conjugate to each other: in
one case, the primal function is a sum and the dual function is an infimal convolution; and
vice versa. Note an important difference, however: closed convexity of Pt was essential for
Proposition 5.2.3, while Proposition 5.3.1 is totally general.
This difference has a profound reason, outlined in Example 5.2.4: the augmented La-
grangian is able to suppress the duality gap, a very powerful property, which therefore requires
some assumption; by contrast, penalization brings nothing more than existence ora dual so-
lution. 0

(c) Mixed Optimality Conditions for Constrained Minimization Problems. In


Chap. VII, we have studied constrained minimization problems from two viewpoints:
§VII.1 made an abstract study, the feasible set being just x E C; §VII.2 was analytical,
with a feasible set described by equalities and inequalities. Consider now a problem
where both forms occur:
inf f(x) x E Co,
Ax =b, (5.3.6)
Cj(x) ~ 0 for j = 1, ... , p [c(x) ~ 0 for short] .

Here f E Conv JR.n ,Co C JR.n is a nonempty closed convex set intersecting dom f,
A is linear from JR.n to JR.m , and each Cj: JR.n -+ JR. is convex. Thus we now accept
an extended-valued objective function; but the constraint-functions are still assumed
finite-valued.
This problem enters the general framework of the present chapter if we take the
Lagrangian
JR.n x JR.m x (JR.+)P 3 (x, A, f..L) ~ L(x, A, f..L) = f(x) + AT (Ax - b) + f..L Tc(x)
and the corresponding dual function
JR.m x (JR.+)P 3 (A, f..L) ~ t9(A, f..L) = - inf {L(x, A, f..L) : x E Co}.

The control space is now U = Co n dom f. The whole issue is then whether there
is a dual solution, and whether the filling property (2.3.2) holds; altogether, these
properties will guarantee the existence of a saddle-point, i.e. of a primal-dual solution-
pair.

Theorem 5.3.3 With the above notation, make thefollowing Slater-type assumption:
There is Xo E (ri dom f) n ri Co such that IAxo = band .
Cj (xo) < 0 for J = 1, ... , p .
A solution i of (5.3.6) is characterized by the existence of A = (AI. ... , Am) E JR.m
and f..L = (f..LJ. ... , f..Lp) E JR.P such that

P
o E of (i) + ATA + Lf..Ljocj(i) + NCo (i) , (5.3.7)
j=1

f..Lj~O and f..LjCj(i)=Oforj=1, ... ,p.


5 Putting the Method in Perspective 189

PROOF. Use the notation

C I := {x E jRn : Ax = b}, C2 := {x E jRn -c(x) E (jR+)P} ,


so that the solutions of (5.3.6) are those i satisfying

o E aU + ICo + Ic. + IcJ(i).


The Xo postulated in our Slater assumption is in the intersection of the three sets
ridomf nriCo [by assumption]
riC I [=CIl
riC2 • [= int C2: see (VI. 1.3.5)]

The calculus rule XI.3.1.2 can therefore be invoked: i solves (5.3.6) if and only if

o E af(i) + NCo (i) + Nc.(i) + Nc (i). 2

Then it suffices to express the last two normal cones, which was done for example in
§VII.2.2. 0

Of course, (5.3.7) means that L(·, J.., /L) has a subgradient at i whose opposite
is normal to Co at i; thus, i solves the Lagrangian problem associated with (J.., /L) -
and (A, /L) solves the dual problem.
In view of §2.3, a relevant question is now the following: suppose we have found
a dual solution (X, il), can we reconstruct a primal solution from it? For this, we need
the filling property, which in turn calls for the calculus rule VI.4.4.2. The trick is to
realize that the dual function is not changed (and hence, neither are its subdifferentials)
if the minimization of the Lagrangian is restricted to some sublevel-set: for r large
enough and (A, /L) in a neighborhood of (X, il),
-6>(A, /L) = inf {L(x, A, /L) : L(x, A, /L) ~ r}.
XECo

If this new formulation forces x into a compact set, we are done.


Proposition 5.3.4 With the notation adopted in this subsection, assume that f is
I-coercive on Co:
f(x)
-,- ~ +00 when IIxll ~ +00, x E Co.
Ixll
Then, for any bounded set B C jRm X (lR+)P, there are a number r and a compact
set K such that
{x E Co : L(x, A, JL) ~ r} C K for all (A, /L) E B.
PROOF. Each function Cj is minorized by some affine function: Cj ~ (Sj' .) + Pj for
j = 1, ... , p. Then write
m
L(x, A, /L) ~ f(x) -IlATAllllxll- L/Lj[lIsjllllxll + IPjll;
j=1

the result follows from the l-coercivity of f on Co. o


190 XII. Abstract Duality for Practitioners

5.4 Fenchel's Duality

On several occasions in this Section 5, connections have appeared between the La-
grangian duality of the present chapter and the conjugacy operation of Chap. X. On
the other hand, it has been observed in (X.2.3.2) that, for two closed convex functions
gl and g2, the optimal value in the "primal problem"

m := inf {gl(x) + g2(X) : x E ]Rn} (S.4.1)

is opposite to that in the "dual problem"


inf {gt(s) + g;(-s) : s E ]Rn} (S.4.2)

under some appropriate assumption, for example: m is a finite number and


ridomgl n ridomg2 =f:. 0. (S.4.3)

This assumption implies also that (S.4.2) has a solution.


The construction (S .4.1 ) ~ (S .4.2) is called Fenchel s duality, whose starting idea
is to conjugate the sum gl + g2 in (S.4.1). Convexity is therefore required from the
very beginning, so that Theorem X.2.3.1 applies. About the relationship between the
associated primal-dual optimal sets, the following can be said:
Proposition 5.4.1 Let gl and g2 befunctions of Conv ]Rn satisfYing (S.4.3). Ifs is an
arbitrary solution of (S.4.2), the (possibly empty) solution-set of (S.4.1) is

8gt(s) n 8g;(-s). (S.4.4)

= =
Ii gi for i 1. 2: the infimal convolution
PROOF. Apply (XI.3 .4.S) to the functions
gr t g; is exact at 0 =s-
s and the subdifferential 8(gr t g;)(O) is then (S.4.4).
+
Now apply Theorem X.2.3.2: this last subdifferential is 8(gl g2)*(0), which in turn
is just the solution-set of (S.4.1), see (X. 1.4.6). 0

(a) From Fenchel to Lagrange. As was done in §S.3 in different situations, the
approach of this chapter can be applied to Fenchel's duality. It suffices to formulate
(S.4.1) as
inf {gl (XI) + g2(X2) : XI - X2 = O}.
This is a minimization problem posed in ]Rn x ]Rn, with constraint-values in ]Rn,
which lends itself to Lagrangian duality: taking the dual variable).. E ]Rn, we form
the Lagrangian
L(xi. X2.)..) = gl (XI) + g2(X2) +).. T (XI - X2) .

The associated closed convex dual function (to be minimized)


]Rn :3 ).. ~ B()") =- inf L(XIo X2, )..)
X"X2

can be written
B()") = - inf
Xl
[gl (XI) +).. T XI] - inf [g2(X2) -).. T X2]
X2
= gt(-)..) + gi()..).
a form which blatantly displays Fenchel's dual problem (S.4.2).
5 Putting the Method in Perspective 191

(b) From Lagrange to Fenchel. Conversely, suppose we would like to apply Fenchel
duality to the problems encountered in this chapter. This can be done at least formally,
with the help of appropriate functions in (S.4.1):
(i) g2(X) = I{o}(Ax - b) models affine equality constraints, as in convex instances
of (1. 1.1);
(ii) g2(X) = IK(C(X», where K is a closed convex cone, plays the same role for
the problem of §S.3(a); needless to say, inequality constraints correspond to the
nonpositive orthant K oflR P ;
(iii) g2(X) = 1/2tIlAx-bIl 2 is associated to penalized affine constraints, as in §S.3(b);
(iv) in the case of the augmented Lagrangian (S .2.1 ), we have a sum ofthree functions:

Many other situations can be imagined; let us consider the case ofaffine constraints
in some detail.

(c) Qualification Property. With gl and h closed convex, A linear from lRn to lRm ,
and lRm equipped with the usual dot-product, consider the following form of (S.4.1):

(S.4.S)

The role of the control set U is played by dom gl, and h can be an indicator, a squared
norm, etc.; g2 of(S.4.1) is given by

Use Proposition 111.2.1.12 to write (S.4.3) in the form

there exists x E ri dom gl such that Ax - b E ri dom h . (S.4.6)

With h = I{o} for example, this means 0 E A(ridomgl) - b = ri [A(domgl) - b];


compare with condition (iii) in Proposition 2.4.1.

(d) Dual Problem. The conjugate of g2 can be computed with the help of the calculus
rule X.2.2.3: assume (S.4.6), so that in particular [A(lRn) - b] nridomh =f:. 0,

g;(-s) = min {JtO.) +bTA : ATA = -s}, (S.4.7)

and the dual problem (S.4.2) associated with (S.4.S) is

min min {gt(s)


s }..
+ ft(A) +bTA : ATA = -s}.

Invert A and s to obtain s = _AT A, where A solves


192 XII. Abstract Duality for Practitioners

By the definition of gt, a further equivalent formulation is


min sup [-}..TAx - gl (x) + g(}..) + bT}..] or min sup [f2*(}..) - L(x, }..)].
A x ).. x
In summary, we have proved the following result:
Proposition 5.4.2 Assuming (S.4.6), the solutions ofFenchel's dual problem associ-
ated with (S.4.5) are s = _AT}.., where}.. describes the nonempty solution-set of
(S.4.8)
Here,
}.. 1-+ 8(}") = - inf {L(x,}..) : x E JRn}
is the dual jUnction associated with the Lagrangian
JRn x JRm 3 (x,}..) 1-+ L(x,}..) = gl(x) +}..T (Ax - b). (S.4.9)
Furthermore, the optimal values in (S.4.S) and (S.4.8) are equal. o

(e) Primal-Dual Relationship. Proposition S.4.1 characterizes the solutions of the


primal problem (S.4.S) (if any): they are those x satisfying simultaneously
*x agt(-AT }..) with}.. solving (S.4.8); equivalently, 0 E agl(x) + AT}.. ,i.e. x
E
minimizes the Lagrangian L(·,}..) of(S.4.9);
* x E agi(A T }..) with}.. solving (S.4.8) and gi given by (S.4.7); using the calculus
rule XI.3.3.1, this means Ax - bE ag(}..).
To cut a long story short, when we compute the dual function
8(}") := sup [qi(u) - }.. T (Au - b)]
UEU

associated with the primal problem in the format (1.1.1), namely


sup {qi(u) : Au - b = O},
UEU

we obtain precisely

(t) Nonlinear Constraints. The situation corresponding to (ii) in §S.4(b) gives also
an interesting illustration: with p convex functions CI •... , cp from JRn to JR, define
JRn 3 x 1-+ gj(x) := Ij-oo,oj(Cj(x» for j = 1, ...• p.
Then take an objective function go E n
Conv JR , and consider the primal problem
inspired by the form (3.2.1) with inequality constraints:

¥ [go(X) + L)=I gj(X)].


Its dual in Fenchel's style is

inf {L)=ogj(Sj) : L)=oSj = oJ; (S.4.10)

in "normal" situations, this can also be obtained from Lagrange duality:


5 Putting the Method in Perspective 193

Proposition 5.4.3 With the above notation, assume the existence ofii, ... , i p such
thatcj(ij) < Ofor j =
1, ... , p. Then the optimal value in (5.4.10) is

inf{B(IL) : IL E (lRt)P)

with

PROOF. Our assumptions allow the use of the conjugate calculus given in Exam-
ple X.2.5.3:
g}'!'(s) = J.I,>o
min ILC}":(tS) for j 1, ... , p.=
Then (5.4.10) has actually two (groups of) minimization variables: (Sl, ... , sp) and
IL = (ILh ... , ILp). We minimize with respect to {Sj} first: the value (5.4.10) is the
infimum over IL E (lRt)P of the function

IL 1-+ inf {gri(so) + L)=I ILjcj(~/j) : L)=oSj = oJ. (5.4.11)

The key is to realize that this is the value at S = 0 of the infimal convolution

lRn 3 S 1-+ (gri t 1l'~ t ... t 1l';) (s) =: g(s)


where, for j = 1, ... , p,

lRn 3 S 1-+ 1l'l(s) := ILjcj(~/)

is the conjugate of x 1-+ 1l'j(x) = ILjCj(X), a convex function finite everywhere.


Then Theorem X.2.3.2 tells us that g is the conjugate of go + 1l'1 + ... + 1l'p; its
value at zero is opposite to the infimum of this last function. In a word, the function
of (5.4. 1 1) reduces to

Note that this is an abstract result, which says nothing about primal-dual rela-
tionships, nor existence of primal-dual solutions; in particular, it does not rule out the
case dom B = 0.
Let us conclude: there is a two-way correspondence between Lagrange and
Fenchel duality schemes, even though they start from different primal problems; the
difference is mainly a matter of taste. The Lagrangian approach may be deemed more
natural and flexible; in particular, it is often efficient when the initial optimization
problem contains "intermediate" variables, say Yj = Cj (x), which one wants to single
out for some reason. On the other hand, Fenchel's approach is often more direct in
theoretical developments.
XIII. Inner Construction of the Approximate
Subdifferential: Methods of e-Descent

Prerequisites. Basic concepts of numerical optimization (Chap. II); bundling mechanism


for lninimizing convex functions (Chap. IX, essential); definition and elementary properties
of approximate subdifferentials and difference quotients (Chap. XI, especially §XI.2.2).

Introduction. In this chapter, we study a first lninimization algorithm in detail, including


its numerical implementation. It is able to minimize rather general convex functions, without
any smoothness assumptions - in contrastto the algorithms of Chap. II. On the other hand, it is
fully implementable, which was not the case ofthe bundling algorithm exposed in Chap. IX.
We do not study this algorithm for its practical value: in our opinion, it is not suitable
for "real life" problems. However, it is the one that comes first to mind when starting from
the ideas developed in the previous chapters. Furthermore, it provides a good introduction to
the (more useful) methods to come in the next chapters: rather than a dead end, it is a point
of departure, comparable to the steepest-descent method for lninimizing smooth functions,
which is bad but basic. Finally, it can be looked at in a context different from optimization
proper, namely that of separating closed convex sets.
The situation is the same as in Chap. IX:
- f : lRn -+ lR is a (finite-valued) convex function to be lninimized;
- f(x) and s(x) E Bf(x) are computed when necessary in a black box (Ul), as in §.1I.1.2.
- the norm II . II used to compute steepest-descent directions and projections (see §VIII.l and
§IX.l) is the Euclidean norm: •. W= II· II = ,,[M.
In other words, we want to minilnize f even though the information available is fairly
poor: one subgradient at each point, instead of the full subdifferential as in Chap. VIII. Re-
call that the notation s(x) is misleading but can be used in practice; this was explained in
Remark VIII.3.5.1.

1 Introduction. Identifying the Approximate Sub differential

1.1 The Problem and Its Solution

In Chap. IX, we have studied a mechanism for constructing the subdifferential of I


at a given x, or at least to resolve the following alternatives:
- If there exists a hyperplane separating a/(x) and {OJ, i.e. a descent direction, i.e. a
d with I' (x, d) < 0, find one.
196 XIII. Methods of e-Descent

- Or, if there does not exist any such direction, explain why, i.e. find a subgradient
which is (approximately) 0 - the difficulty being that the black box (01) is not
supposed to ever answer s(x) = O.
This mechanism consisted in collecting information about aI (x) in a bundle, and
worked as follows (see Algorithm IX. 1.6 for example):
- Given a compact convex polyhedron Sunder-estimating al (x), i.e. with Seal (x),
one computed a hyperplane separating S and {OJ. This hyperplane was essentially
defined by its normal vector d, interpreted as a direction, and one actually computed
the best such hyperplane, namely the projection of 0 onto S.
- Then one made a line-search along this d, with two possible exits:
(a) The hyperplane actually separated a/(x) and {OJ, in which case the process
was terminated. The line-search was successful and I could be improved in the
direction d, to obtain the next iterate, say x+.
(b) Or the hyperplane did not separate aI (x) and {O}, in which case the line-search
produced a new subgradient, say s+ E a/(x), to improve the current S - the
line-search was unsuccessful and one looped to redo it along a new direction,
issuing from the same x, but obtained from the better S.
We explained that this mechanism could be grafted onto each iteration ofa descent
scheme. It would thus allow the construction of descent directions without computing
the full subdifferential explicitly, a definite advantage over the steepest-descent method
of Chap. VIII. A fairly general algorithm would then be obtained, which could for
example minimize dual functions associated with abstract optimization problems;
such an algorithm would thus be directly comparable to those of §XII.4.

Difficulty 1.1.1 If we do such a grafting, however, two difficulties appear, which


were also mentioned in Remark IX. 1.7:
(i) According to the whole idea of the process, the descent direction found in case
(a) will be "at best" the steepest-descent direction. As a result, the sequence of
iterates will very probably not be a minimizing sequence: this is the message of
§VIII.2.2.
(ii) Finding the subgradient s+ in case (b) requires performing an endless line-search:
the stepsize must tend to 0 so as to compute the directional derivative I' (x, d)
and to obtain s+ by virtue of Lemma VI.6.3.4. Unless I has some very specific
properties (such as being piecewise affine), this cannot be implemented.
In other words, the grafting will result in a non-convergent and non-implementa-
ble algorithm. 0

Yet the idea is good if a simple precaution is taken: rather than the purely local set
aI (x), it is aeI (x) that we must identify. It gathers differential information from a spe-
cific, "finitely small", neighborhoodofx -called Ye(x) in TheoremXI.4.2.5 -instead
ofa neighborhood shrinking to {x}, as is the case for a/(x) - see Theorem VI.6.3.1.
Accordingly, we can guess that this will fix (i) and (ii) above. In fact:
1 Introduction. IdentifYing the Approximate Subdifferential 197

(ie) A direction d such that f;


(x, d) < 0 is downhill not only at x but at any point
in Ye(x), so a line-search along such a direction is able to drive the next iterate
out of Ye (x) and the difficulty 1.1.1 (i) will be eliminated.
(iie) A subgradient at a point of the form x + td is in 8e f(x) for t small enough,
more precisely whenever x + td E Ye(x); so it is no longer necessary to force
the stepsize in (ii) to tend to O.

Definition 1.1.2 A nonzero d E IRn is said to be a direction of e-descent for f at x


if f;(x, d) < 0; in other wordsd defines a hyperplane separating 8ef(x) and {OJ.
A point x E IRn is said to be e-minimal if there is no such separating d, i.e.
f;(x, d) ~ 0 for all d, i.e. 0 E 8ef(x). 0

The reason for this terminology is straightforward:

Proposition 1.1.3 A direction d is ofe-descent if and only if

f(x +td) < f(x) - e for some t > O.

A point x is e-minimal if and only if it minimizes f within e, i.e.

f(y) ~ f(x) - e for all y E IRn .

PROOF. Use for example Theorem XI.2.Ll: for given d '" 0, let TJ := f;(x, d). If
TJ < 0, we can find t > 0 such that

f(x +td) - f(x) +e ~ ~tTJ < O.

Conversely, ~ ~ 0 means that

f(x + td) - f(x) + e ~ TJ ~ 0 for all t > O. o


Then consider the following algorithmic scheme, of"e-descent". The descent iter-
ations are denoted here by a superscript p; the SUbscript k that was used in Chap. VIII
will be reserved to the bundling iterations to come later.

Algorithm 1.1.4 (Conceptual Algorithm of e-Descent) Start from some Xl E IRn.


Choose e > O. Set p = 1.
STEP 1. If 0 E 8e f(x P ) stop. Otherwise compute a direction of e-descent, say d P •
STEP 2. Make a line-search along d P to obtain a stepsize t P > 0 such that

f(x P + tPd P) < f(x P) - e.

Setx p +1 := x P + tPd P. Replace p by p + 1 and loop to Step 1. o


If this scheme is implementable at all, it will be an interesting alternative to the
steepest-descent scheme, clearly eliminating the difficulty 1.1.1 (i).

Proposition 1.I.S In Algorithm 1.1.4, either f (x P) ~ -00, or the stop of Step 1


occurs for some finite iteration index P*, at which x P* is e-minimal.
198 XIII. Methods of c-Descent

PROOF. Fairly obvious: one has by construction

f(x P ) < f(x l ) - (p - l)e for p = 1,2, ...


Therefore, if f is bounded from below, say by i
:= infx f(x), the number p* of
iterations cannot be larger than [f (x I) - ill e, after which the stopping criterion of
Step 1 plays its role. 0

Remark 1.1.6 This is just the basic scheme, for which several variants can easily be
imagined. For example, e = eP can be chosen at each iteration. A stop at iteration p*
will then mean that x P* is e P* - minimal. Various conditions will ensure that this stop
does eventually occur; for example

Le
P*

p=1
P ~ +00 when p* ~ 00. (1.1.1)

It is not the first time that we encounter this idea of taking a divergent series:
the subgradient algorithm of §XII.4.1 also used one for its step size. It is appropriate
to recall Remark XII.4.1.S: here again, a condition like (1.1.1) makes little sense in
practice. This is particularly visible with p* appearing explicitly in the summation:
when choosing eP, we do not know if the present pth iteration is going to be the last
one. In the present context, we just mention one simple "online" rule, more reasonable:
take a fixed e but, when 0 E IJef(x P ), diminish e, say divide it by 2, until it reaches a
final acceptable threshold. It is straightforward to extend the proof of Proposition 1.1.5
to such a strategy.
A general convergence theory of these schemes with varying e's is rather trivial
and is of little interest; we will not insist on this aspect here. The real issue would be
of a practical nature, consisting of a study of the values of e (= e P ), and/or of the
norming I I . III, to reach maximal efficiency - whatever this means! 0

Remark 1.1.7 To solve a numerical problem (for example of optimization) one nor-
mally constructs a sequence - say {x P }, usually infinite - for which one proves some
desired asymptotic property: for example that any, or some, cluster point of {x P }
solves the initial problem - recall Chap. II, and more particularly (11.1.1.8). In this
sense, the statement of Proposition 1.1.5 appears as rather non-classical: we construct
afinite sequence {x P }, and then we establish how close the last iterate comes to opti-
mality. This point of view will be systematically adopted here. It makes convergence
results easier to prove in our context, and furthermore we believe that it better reflects
what actually happens on the computer.
Observe the double role played by e at each iteration of Algorithm 1.1.2: an "ac-
tive" role to compute d P , and a "passive" role as a stopping criterion. We should really
have two different epsilons: one, possibly variable, used to compute the direction; and
the other to stop the algorithm. Such considerations will be of importance for the next
chapter. 0
1 Introduction. Identifying the Approximate Subdifferential 199

Having eliminated the difficulty 1.1.1(i), we must now check that 1.1.1(ii) is
eliminated as well; this is not quite trivial and motivates the present chapter. We
therefore study one single iteration of Algorithm 1.1.4, i.e. one execution of Step 1,
regardless of any sophisticated choice of e. In other words, we are interested in the
static aspect of the algorithm, in which the current iterate x P is considered as fixed.
Dropping the index p from our notation, we are given fixed x E IRn and e > 0,
and we apply the mechanism of Chap. IX to construct compact convex polyhedra in
3d(x).
Ifwe look at Step 1 of Algorithm 1.1.4 in the light of Algorithm IX.l.6, we see that
it is going to be an iterative subprocess which works as follows; we use the subscript
k as in Chap. IX.

Process 1.1.8 The black box (U1) has computed a first e-subgradient SJ, and a first
polyhedron SI is on hand:

SI := s(x) E 3f(x) c 3d(x) and SI:= {sd .

- At stage k, having the current polyhedron Sk C 3e f (x), compute the best hyperplane
separating Sk and {OJ, i.e. compute

(recall that Proj denotes the Euclidean projection onto the closed convex hull of a
set).
- Then determine whether
(ae ) dk separates not only Sk but even 3d(x) from {OJ; then the work is finished,
3e f (x) is properly identified, dk is an e-descent direction and we can pass to
the next iteration in Algorithm 1.1.4;
or
(be) dk does not separate from {OJ; then enlarge Sk with a new Sk+1 E 3d(x) and
loop to the next k. 0

Of course, the above alternatives (ae) - (be) play the same role as (a) - (b) of
§l.l. In Chapter IX, (a) - (b) were resolved by a line-search minimizing f along dk.
In case (b), the optimal step size was 0, the event f(x + tdk) < f(x) was impossible
to obtain, and step sizes t -l- 0 were produced. The new element sk+ I, to be appended
to Sk, was then a by-product of this line-search, namely a corresponding cluster point
of {s(x + tdk)}t.\-O.

1.2 The Line-Search Function

Here, in order to resolve the alternatives (ae ) - (be), and detect whether dk is an
e-descent direction (instead of a mere descent direction), we can again minimize f
along dk and see whether e can thus be dropped from f. This is no good, however: in
case of failure we will not obtain a suitable sk+ I.
200 XIII. Methods of e-Descent

Instead of simply applying Proposition 1.1.3 and checking whether

f(x + tdk) < f(x) - s for some t > 0, (1.2.1)

a much better idea is to use Definition 1.1.2 itself: compute the support function
f; (x, dk) and check whether it is negative. From Theorem IX.2.1.1, this means min-
imizing the perturbed difference quotient, a problem that the present § 1.2 is devoted
to. Clearly enough, the alternatives (ae) - (be) will thus be resolved; but in case of
failure, when (be) holds, we will see below that a by-product of this minimization is
an sk+1 suitable for our enlargement problem.
Naturally, the material in this section takes a lot from §XI.2. In the sequel, we
°
drop the index k: given x E IRn , d =F and s > 0, the perturbed difference quotient
is as in §XI.2.2:

+00 ift ~ 0,
qe(t) := { f(x + td) t- f(x) + s (1.2.2)
ift > 0,

and we are interested in


inf qe(t). (1.2.3)
t>o

Remark 1.2.1 The above problem (1.2.3) looks like a line-search, just as in Chap. II: after
all, it amounts to finding a stepsize along the direction d. We should mention, however, that
such an interpretation is slightly misleading, as the motivation is substantially different. First,
the present "line-search" is aimed at diminishing the perturbed difference quotient, rather
than the objective function itself.
More importantly, as we will see in §2.2 below, qe must be minimized rather accurately.
By contrast, we have insisted long enough in §II.3 to make it clear that a line-search had little
to do with one-dimensional minimization: its role was rather to find a ''reasonable'' stepsize
along the given direction, "reasonable" being understood in terms of the objective function.
We just note here that the present "line-search" will certainly not produce a zero-stepsize
since qe(t) ~ +00 when t -I- 0 (cf. Proposition XI.2.2.2). This confirms our observation (iie)
in§1.1. 0

Minimizing qe is a (hidden) convex problem, via the change of variable u = 1/ t.


We recall the following facts from §XI.2.2: the function

qe(~) = u [t(x + ~d) - f(x) +s] ifu > 0,


u ~ re(u) := { f/,o(d) = limt~oo qe(t) ifu = 0, (1.2.4)
+00 ifu < °
is in Conv lR. Its subdifferential at u > ° is

8re(u) = {s - e(x, x + ~d, s) : s E 8f(x + ~d)} ;

here,
e(x, y, s) := f(x) - [f(y) + (s, x - y)] (1.2.5)
1 Introduction. IdentifYing the Approximate Subdifferential 201

is the linearization error made at x when f is linearized at y with slope s (see the
transportation formula of Proposition XI.4.2.2). The function re is minimized over a
nonempty compact interval Ue = U!e, ue], with

and its positive minima are characterized by the property

e(x, x + ~d, s) = e for some s E of (x + ~d) .

Since we live in the geometrical world of positive stepsizes, we find it convenient


to translate these results into the t-Ianguage. They allow the following supplement to
the classification at the end of §XI.2.2:

Lemma 1.2.2 The set

Te := {t = ~ : u E Ue and u > o}
ofminima ofqe is a closed interval (possibly empty, possibly not boundedfrom above),
which does not contain O. Denoting by t..e := llue and Fe := ll!ie the endpoints of
Te (the convention 110 = +00 is used), there holds for any t > 0:
(i) tETe {=::::} e(x, x + td, s) = e for some s E of (x + td)
(ii) t < t..e {=::::} +
e(x, x td, s) < e for all s E of (x + td)
(iii) t > Fe {=::::} e(x, x + td, s) > e for all s E of (x + td).
Of course, (iii) is pointless if!ie = O.
PROOF. All this comes from the change of variable t = 1I u in the convex function re
(remember that it is minimized "at finite distance"). The best way to "see" the proof
is probably to draw a picture, for example Figs. 1.2.1 and 1.2.2. 0

For our present concern of minimizing qe, the two cases illustrated respectively
by Fig. 1.2.1 (ue > 0, Te nonempty) and Fig. 1.2.2 (ue = !ie = 0, Te empty) are
rather different. For each step size t > 0, take s E of (x + td). When t increases,
- the inverse stepsize u = 1I t decreases;
- being the slope of the convex function re, the number e - e(x, x + td, s) decreases
(not continuously);
-therefore e(x, x + td, s) increases.
- Altogether, e(x, x + td, s) starts from 0 and increases with t > 0; a discontinuity
occurs at each t such that of (x + td) has a nonzero breadth along d.
The difference between the two cases in Figs. 1.2.1 and 1.2.2 is whether or not
e(x, x + td, s) crosses the value e; this property conditions the non-vacuousness of
Te.
Lemma 1.2.2 is of course crucial for a one-dimensional search aimed at minimiz-
ing q e. First, it shows that q e is mildly nonconvex (the technical term is quasi-convex).
Also, the number e - e, calculated at a given t, contains all the essential information
202 XIII. Methods of c:-Descent

Fig. 1.2.1. A "normal" approximate difference quotient

f~ (x,d) ..................................-.................-......................................................._.....................

Fig. 1.2.2. An approximate difference quotient with no minimum

of a "derivative". Ifit happens to be 0, t is optimal; otherwise, it indicates whether qs


is increasing or decreasing near t.
We must now see why it is more interesting to solve (1.2.3) than (1.2.1). Con-
cerning the test for appropriate descent, minimizing qs does some good in terms of
diminishing I:

Lemma 1.2.3 Suppose that d is a direction ofe-descent: I; (x, d) < O. Then


I(x + td) < I(x) - e

for 0 < t close enough to Ts (i.e. t large enough when Ts = 0).


PROOF. Under the stated assumption, qsCt) has the sign of its infimum, namely "-".
o
This fixes the (as)-problem in Process 1.1.8. As for the new s+, necessary in the
(bs )-case, we remember that it must enjoy two properties:
- It must be an e-subgradient at x. By virtue of the transportation formula (XI.4.2.2),
this requires precisely that qs be locally decreasing: we have

al(x + td) 3 s E as/ex) if and only if e(x, x + td, s) ~ e. (1.2.6)

-It must separate the current S from {O} "sufficiently well", which means it must
make (s+, d) large enough, see for example Remark IX.2.l.3.
Then the next result will be useful.

Lemma 1.2.4 Suppose that d is not a direction ofe-descent: I; (x, d) ~ O. Then:


(j) Ift E Ts , there is some s E al(x + td) such that
(s, d) ~ 0 and s E as/ex) .
1 Introduction. Identifying the Approximate Subdifferential 203

(jj) Suppose Te = 0. Then

al(x + td) C aef(x) for all t ~ 0, and

for all TJ > 0, there exists M > 0 such that


t ~ M, s E al(x + td) ===> (s, d) ~ - TJ •

PROOF. (0)] Combine Lemma 1.2.2(i) with (1.2.6) to obtains E al(x +td) naef(x)
such that
I(x) = I(x + td) - t(s, d) + 8;
since t > 0, the property I(x) ~ I(x + td) + 8 implies (s, d) ~ o.
[OJ)] In case (jj), every t > 0 comes under case (ii) of Lemma 1.2.2, and any set) E
al(x+td) is in ae I(x) by virtue of(1.2.6). Also, sinced is nota direction of8-descent,

I(x + td) ~ I(x) - 8


~ I(x + td) - t{s(t), d) - 8. [because s(t) E 8f(x + td)]
Thus, (s(t), d) + 81t ~ 0, and lim inft-+oo (s(t), d) ~ o. o

Remark 1.2.5 Compare with §XI.4.2: for all t smaller than the largest element te (d)
of Te , al(x + td) C aef(x) because x + td E Ve(x). For t = te(d), one can only
ascertain that some subgradient at x + td is an 8-subgradient at x and can fruitfully
enlarge the current polyhedron S. For t > teCd), al(x + td) n oef(x) = 0 because
x + td f/ Vs(x). 0

In summary, when t minimizes qe (including the case "t large enough" ifTe = 0),
it is theoretically possible to find in aI (x +t d) the necessary sub gradient s+ to improve
a
the current approximation of eI (x). We will see in §2 that it is not really possible to
find s+ in practice; rather, it is a convenient approximation of it which will be obtained
during the process of minimizing qe.

1.3 The Schematic Algorithm

We begin to see how Algorithm 1.1.4 will actually work: Step 1 will be a projection
onto a convex polyhedron, followed by a minimization of the convex function reo In
anticipation of §2, we restate Algorithm 1.1.4, with a somewhat more detailed Step 1.
Also, we incorporate a criterion to stop the algorithm without waiting for the (unlikely)
event "0 E Sk".

Algorithm 1.3.1 (Schematic Algorithm of 8-Descent) Start from some x I E IRn.


Choose the descent criterion 8 > 0 and the convergence parameter 8 > o. Set p = 1.
STEP 1.0 (starting to find a direction of 8-descent). Compute s(x P ) E al(x P ). Set
Sl = s(x P ), k = 1.
204 XIII. Methods of c-Descent

STEP 1.1 (computing the trial direction). Solve the minimization problem in a

(1.3.1)

where ilk is the unit simplex of IRk. Set dk = - .E7=, aisi; iflldkll ~ 8 stop.
STEP 1.2 (line-search). Minimize re and conclude:
(ae) either f;(x P , dk) < 0; then go to Step 2.
(be) or f;(x P, dk) ~ 0; then obtain a suitable Sk+1 E ad(xP),replacekbyk+ I
and loop to Step 1.1.
STEP 2 (descent). Obtain t > 0 such that f(x P + tdk) < f(x P) - 6. Set x p + 1 =
x P + tdk, replace p by p + I and loop to Step 1.0. 0

Remark 1.3.2 Within Step 1, k increases by one at each subiteration. The number
of such subiterations is not known in advance, so the number of subgradients to be
stored can grow arbitrarily large, and the complexity of the projection problem as
well. We know, however, that it is not really necessary to keep all the subgradients at
each iteration: Theorem IX.2.1. 7 tells us that the (k + 1)SI projection must be made
onto a polyhedron which can be as small as the segment [-dko sk+d. In other words,
the number of subgradients to be stored, i.e. the number of variables in the quadratic
problem of Step 1.1, can be as small as 2 (but at the price of a substantial loss in actual
performance: remember Fig. IX.2.2.3). 0

Remark 1.3.3 Ifwe compare this algorithm to those of Chap. II, we see that it works
rather similarly: at each iterate x P , it constructs a direction d k , along which it per-
forms a line-search. However there are two new features, both characteristic of all the
minimization algorithms to come.
- First, the direction is computed in a rather sophisticated way, as the solution of an
auxiliary minimization problem instead of being given explicitly in terms of the
"gradient" Sk.
- The second feature brings a rather fundamental difference: it is that the line-search
has two possible exits (ae) and (be). The first case is the normal one and could be
called a descent step, in which the current iterate x P is updated to a better one. In
the second case, the iterate is kept where it is and the next line-search will start from
the same x p . As in the bundling Algorithm IX.1.6, this can be called a null-step.
o
We can guess that there will be two rather different cases in Step 1.2, corresponding
to those of Figs. 1.2.1, 1.2.2. This will be seen in more detail in §2, but we give already
here an idea of the convergence proof, assuming a simple situation: only the case of
Fig. 1.2.1 occurs, qe is minimized exactly at each line-search, and correspondingly,
an Sk+1 predicted by Lemma 1.2.4 is found at each null-step. Note that the last two
assumptions are rather unrealistic.

Theorem 1.3.4 Suppose that each execution ofStep 1.2 in Algorithm 1.3.1 produces
an optimal Uk > 0 and a corresponding sk+1 E af(x p + IJUk dk) such that
1 Introduction. Identifying the Approximate Subditferential 205

where
ek:= f(x P) - f(xp + ulkdk) + (Sk+b ulkdk).
Then: either f(x P) -+ -00, or the stop of Step 1.1 occursforsomefiniteiteration
index P*, at which there holds
f(x P*) ~ fey) + e + 811y - xP*1I for all y E ]Rn. (1.3.2)
PROOF. Suppose f(x P ) is bounded from below. As in Proposition 1.1.5, Step 2 cannot
be executed infinitely many times. At some finite iteration index P*, Algorithm 1.3.1
therefore loops between Steps 1.1 and 1.2, i.e. case (be) always occurs at this iteration.
Then Lemma 1.2.4(j) applies for each k: first
Sk+l E aekf(x P*) = ad(x P*) for all k;
we deduce in particular that the sequence {Sk} c aef(x P*) is bounded (Theo-
rem XL 1.1.4). Second, (sk+ b dk) ~ 0 which, together with the minimality conditions
for the projection -dk, yields
(-dk, Sj - Sk+l) ~ IIdkll2 for j = 1, ... , k.
Lemma IX.2.1.1 applies (with Sk = -dk and m = 1): dk -+ 0 if k -+ 00, so the stop
must eventually occur. Finally, each -dk is a convex combination of e-subgradients
s[, ... ,Sk of fat x P* (note that Sl E asf(x P*) by construction) and is therefore an
e-subgradient itself:
fey) ~ f(x P*) - (dk, y - x P*) - e for all y E ]Rn .

This implies (1.3.2) by the Cauchy-Schwarz inequality. D

It is worth mentioning that the division Step 1 - Step 2 is somewhat artificial,


especially with Remark 1.3.3 in mind; we will no longer make this division. Actually,
it is Step 1.2 which contains the bulk of the work - and Step 1.1 to a lesser extent.
Step 1.2 even absorbs the work of Step 2 (the suitable t and its associated shave
already been found during the line-search) and of Step 1.0 (the starting Sl has been
found at the end of the previous line-search). We have here one more illustration
of the general principle that line-searches are the most important ingredient when
implementing optimization algorithms.
Remark 1.3.5 The boundedness of {Sk} for each outer iteration p is a key property.
Technically, it is interesting to observe that it is automatically satisfied, because of the
very fact that Sk E ad(x P): no boundedness of {x P } is needed.
Along the same idea, the assumption that f is finite everywhere has little impor-
tance: it could be replaced by something more local, for example
Sf(xl)(f):= {x E]Rn : f(x) ~ f(xl)} C intdomf;
Theorem XL 1.1.4 would still apply. This would just imply some sophistication in the
line-search of the next section, to cope with the case of (Ul) being called at a point
x P + t dk f/. dom f. In this case, the question arises of what (U 1) should return for
f(x P + tdk) and s(x P + tdk). D
206 XIII. Methods of c-Descent

2 A Direct Implementation: Algorithm of g-Descent

This section contains the computational details that are necessary to implement the
algorithm introduced in §1, sketched as Algorithm 1.3 .1. We mention that these details
are by no means fundamental for the next chapters and can therefore be skipped by
a casual reader - it has already been mentioned that methods of e-descent are not
advisable for actual use in "real life" problems. On the other hand, these details give a
good idea of the kind ofquestions that arise when methods fornonsmooth optimization
are implemented.
To obtain a really implementable form of Algorithm 1.3.1, we need to specify
two calculations: in Step 1.1, how the a-problem is solved, and in Step 1.2, how the
line-search is performed. The a-problem is a classical convex quadratic minimization
problem with linear constraints; as such, it poses no particular difficulty. The line-
search, on the contrary, is rather new since it consists of minimizing the nonsmooth
function qe or reo It forms the subject of the present section, in which we largely use
the principles of §II.3.
A totally clear implementation of a line-search implies answering three questions.
(i) Initialization: how should the first trial stepsize be chosen? Here, the line-search
must be initialized at each new direction dk. No really satisfactory initialization
is known - and this is precisely one of the drawbacks of the present algorithm.
We will not study this problem, considering that the choice t = 1 is simplest,
if not excessively sensible! (instead of 1, one must of course at least choose a
number which, when multiplied by IIdkll, gives a reasonable move from x P ).
(ii) Iteration: given the current stepsize, assumed not suitable, how the next one can
be chosen? This will be the subject of §2.1.
(iii) Stopping Criterion: when can the current stepsize be considered as suitable
(§2.2)? As is the case with all line-searches, this is by far the most delicate
question in the present context. It is more crucial here than ever: it gives not only
the conditions for stopping the line-search, but also those for determining the
next subgradient to be added to the current approximation of aef(x P ).

2.1 Iterating the Line-Search

In order to make the notation less cumbersome, when no confusion is possible, we


drop the indices p and k, since they are fixed here and in the next subsection. We are
therefore given x E IRn and 0 oF d E IRn , the starting point and the direction for the
current line-search under study. We are also given e > 0, which is crucial in all this
chapter.
As already mentioned in § 1, the aim of the line-search is to minimize the function
qe of (1.2.2), or equivalently the function re of (1.2.4). We prefer to base our devel-
opment on the more natural qe, despite its nonconvexity. Of course, the reciprocal
correspondence (1.2.4) must be kept in mind and used whenever necessary.
At the point y = x + td, it is convenient to simplify the notation: set) will stand
for s(x + td) E af(x + td) and
2 A Direct Implementation: Algorithm of c-Descent 207

e(t) := e(x, x + td, s(t» = f(x) - f(x + td) + t(s(t), d} (2.1.1)

for the linearization error (1.2.5). This creates no confusion because x and d are fixed
and because, once again, the subgradient s is considered as a single-valued function,
depending on the black box (UI). Recall that £ - e(t) is in 8re(lft) (we remind
the reader that the last expression means the subdifferential of the convex function
U 1-+ re(u) at the point u = 1ft).
The problem to be solved in this subsection is: during the process of minimizing
qe, where should we place each trial stepsize? Applying the principles of §1I.3.1, we
must decide whether a given t > 0 is
(0) convenient, so the minimization of qe can be stopped;
(L) on the left of the set of convenient stepsizes;
(R) on their right.
The key to designing (L) and (R) lies in the statements 1.2.2 and 1.2.3: in fact,
call the black box (VI) to compute s(t) E 8f(x + td); then compute e(t) of (2.1.1)
and compare it to £. There are three possibilities:
(0) :=" {e(t) = £}" (an extraordinary event!). Then this t is optimal and the line-
search is finished.
(L) :=" {e(t) < £}". This t is too small, in the sense that no optimal t can lie on
its left. It can serve as a lower bound for all subsequent stepsizes, therefore set
tL = t before looping to the next trial.
(R) :="{e(t) > £}".NotonlyTecanbebutontheleftofthist,butalsos(t) ¢ 8ef(x),
so t is too large (see Proposition XI.4.2.5 and the discussion following it). This
makes two reasons to set t R = t before looping.
Apart from the stopping criterion, we are now in a position to describe the line-
search in some detail. The notation in the following algorithm is exactly that of §II.3.

Algorithm 2.1.1 (Prototype Line-Search for e-Descent)


STEP 0 (initialization). Set tL = 0, tR = +00. Choose an initial t > O.
STEP 1 (work). Obtain f(x + +
td) and s(t) E 8f(x td); compute e(t) defined by
(2.1.1).
STEP 2 (dispatching). If e(t) < £ set tL = t; if e(t) > £ set t R = t.
STEP 3 (stopping criterion). Apply the stopping criterion, which must include in par-
ticular: stop the line-search if e(t) = £ or if f(x + td) < f(x) - £.
STEP 4 (new stepsize). If tR = +00 then extrapolate i.e. compute a new t > tL.
IftR < +00 then interpolate i.e. compute a new t E ]tL, tR[' 0

As explained in §II.3.1, and as can be seen by a long enough contemplation of


this algorithm, two sequences of step sizes {tR} and {td are generated They have the
properties that {td is increasing, {tR} is decreasing. At each cycle,

tL ~ t ~ tR
for any t minimizing qe. Once an interpolation is made, i.e. once some real tR has
been found, no extrapolation is ever made again. In this case one can ascertain that qe
208 XIII. Methods of c-Descent

has a minimum "at finite distance". The new t in Step 4 must be computed so that,
if infinitely many extrapolations [resp. interpolations] are made, then tL --+ 00 [resp.
tR - tL --+ 0]. This is what was called the safeguard-reduction Property 11.3.1.3.

Remark 2.1.2 See §II.3A for suitable safeguarded strategies in Step 4. Without entering
into technical details, we mention a possibility for the interpolation formula; it is aimed at
minimizing a convex function, so we switch to the (u, r)-notation, instead of (t, q).
Suppose that some U L > 0 (i.e. t R = 1/U L < +(0) has been generated: we do have an
actual bracket [UL, UR], with corresponding actual function-and slope-values; let us denote
them by r Land ri, r R and r~ respectively. We will subscript by G the endpoint that is better
than the other. Look for example at Fig. 2.1.1: uo = UL because rL < rR. Finally, call u the
minimum ofre, assumed unique for simplicity.

Fig. 2.1.1. Quadratic and polyhedral approximations

To place the next iterate, say U+, two possible ideas can be exploited.
Idea Q (quadratic).Assume that re is smooth near its minimum. Then it is a good idea to adopt
a smooth model for r, for example a convex quadratic function Q:

re(u) ~ Q(u) := ro + rc,(u - uo) + 4c(u - uO)2 ,

where c > 0 estimates the local curvature of re. This yields a proposal U Q := argmin Q
for the next iterate.
Idea P (polyhedral).Assume that re is kinky near its minimum. Then, best results will probably
be obtained with a piecewise affine model P:

re(u) ~ P(u) := max {rL + ri (u - UL), rR + r~(u - UR)}

yielding a next proposal Up := argmin P.


The limited information available from the function re makes it hard to guess a proper
value for c, and to choose safely between UQ and Up. Nevertheless, the following strategy is
efficient:
- Remembering the secant method of Remark 11.2.3.5, take for c the difference quotient
between slopes computed on the side ofu where Uo lies: in Fig.2.1.1, these slopes would
have been computed at uo and at a previous UL (if any). By virtue of Theorem 104.2.1 (iii),
{rD has a limit, so normally this c behaves itself
- Then choose for U+ either uQ or Up, namely the one that is closer to Uo; in Fig. 2.1.1, for
example, U+ would be chosen as Up.
It can be shown that the resulting interpolation formulae have the following properties:
u,
without any additional assumption on re, i.e. on /, Uo does converge to even though c may
grow without bound. If re enjoys some natural additional assumption concerning difference
quotients between slopes, then the convergence is also fast in some sense. 0
2 A Direct Implementation: Algorithm of e-Descent 209

2.2 Stopping the Line-Search

It has been already mentioned again and again that the stopping criterion (Step 3 in Al-
gorithm 2.1.1) is by far the most important ingredient of the line-search. Without it, no
actual implementation can even be considered. It is a pivot between Algorithms 2.1.1
and 1.3.1.
As it is clear from Step 1.2 in Algorithm 1.3.1, the stopping test consists in
choosing between three possibilities:
- either the present t is not suitable and the line-search must be continued;
- or the present t is suitable because case (ae) is detected: f£(x, d) < 0; this simply
amounts to testing the descent inequality

(2.2.1)

if it is true, update x P to x p + 1 := x P + tdk and increase p by 1;


- or the present t is suitable because case (be) is detected: f£(x P , dk) ~ 0 and also
a suitable e-subgradient Sk+l has been obtained; then append Sk+l to the current
polyhedral approximation Sk of 8e f(x P) and increase k by 1.
The second possibility is quite trivial to test and requires no special comment.
As for the third, it has been studied all along in Chap. IX. Assume that (2.2.1) never
happens, i.e. that k ~ 00 in Step 1 of Algorithm 1.3.1. Then the whole point is to
realize that 0 E 8e f(x P ), i.e. to have dk ~ O. For this, each iteration k must produce
Sk+l E IRn satisfying two properties, already seen before Lemma 1.2.4:

(2.2.2)

(2.2.3)
Here, the symbol "0" can be taken as the mere number 0, as was the case in the
framework of Theorem 1.3.4. However, we saw in Remark IX.2.1.3 that "0" could
also be a suitable negative tolerance, namely a fraction of -lIdkIl2. Actually, the
following specification of (2.2.3):

(2.2.4)

preserves the desired property dk ~ 0: we are in the framework of Lemma IX.2.1.1.


On the other hand, if sk+l is to be the current subgradient s(t) = s(x P + tdk)
(t > 0 being the present trial stepsize, tL or tR), we know from (1.2.6) that (2.2.2) is
equivalent to e(x P , x P + tdb s(t)) ~ e, i.e.

(2.2.5)

It can be seen that the two requirements (2.2.4) and (2.2.5) are antagonistic. They
may even be incompatible:

Example 2.2.1 With n = 1, take the objective function:


210 XIII. Methods of e-Descent

x ~ I(x) = max {f1(X), h(x)} with Il/:l«X» :=expx.


2X
:= 2 - x (2.2.6)

The solution i of exp x + x = 2 is the only point where 1 is not differentiable. It is


also the minimum of I, and I(i) = expi. Note that i E ]0, 1[ and I(i) E ]1, 2[.
The subdifferential of 1 is

{-l} if x <i,
a/(x) = { [-1,expi] if x =i,
expx if x > i.

Now suppose that the £-descent Algorithm 1.3.1 is initialized at Xl = 0, where the
direction of search is d l = 1. The trace-function t ~ 1(0 + t1) is differentiable
except at a certain t E [0, 1[, corresponding to i. The linearization-error function e
of (2.1.1) is
0 if 0 ~ t < t,
{
e(t)= 2+(t-1)expt ift>!.
At t = t, the number e(t) is somewhere in [0, e*] (depending on which subgradient
s(i) is actually computed by the black box), where

e* := 2 + (t - 1) exp t = 3t - P > 1.

This example is illustrated in Fig.2.2.l. Finally suppose that £ ~ e*, for example
£ = 1. Clearly enough,

Fig.2.2.1. Discontinuity ofthe function e

-ift E [0,([, then (s(t),d l ) = -1 = -lIddl 2 so (2.2.4) cannot hold, no matter how
m' is chosen in ]0, 1[;
- if t > t, then e(t) > e* ~ £, so (2.2.5) cannot hold.
In other words, it is impossible to obtain simultaneously (2.2.4) and (2.2.5) unless
the following two extraordinary events happen:
(i) the line-search must produce the particular stepsize t = t (said otherwise, qe or
re must be exactly minimized);
(ii) at this i = 0 + t· 1, the black box must produce a rather particular subgradient s,
namely one between the two extreme points -1 and expi, so that (s, d l ) is large
enough and (2.2.5) has a chance to hold.
Note also that, if £ is large enough, namely £ ~ 2 - I(i), (2.2.1) cannot be
obtained, just by definition of i. 0
2 A Direct Implementation: Algorithm of €-Descent 211

Remark 2.2.2 A reader not familiar with numerical computations may not consider (i) as
so extraordinary. He should however remember how a line-search algorithm has to work (see
§1I.3, and especially Example 11.3.1.5). Furthermore, i is given by a non-solvable equation
and cannot be computed exactly. This is why we bothered to take a somewhat complicated f
in (2.2.6). The example would be just as good, say with f(x) = Ix - 11. 0

From this example, we draw the following conclusion: it may happen that for the
given f, x, d and e, no line-search algorithm can produce a suitable new element
in asf(x); and this no matter how m' is chosen in ]0,1[. When this phenomenon
happens, Algorithm 2.1.1 is stuck: it loops forever between Step 1 and Step 4. Of
course, relaxing condition (2.2.4), say by taking m' ;? 1, does not help because then,
it is Algorithm 1.3.1 which may be stuck, looping within Step 1: dk will have no
°
reason to tend to and the stop of Step 1.1 will never occur.
The diagnostic is that (1.2.6), i.e. the transportation formula, does not allow us
to reach those e-subgradients that we need. The remedy is to use a slightly more
sophisticated formula, to construct e-subgradients as combinations of subgradients
computed at several other points:

Lemma 2.2.3 Let be given: x E lR n , Yj E lRn and Sj E af(Yj) for j = 1, ... , m;


set

ej := e(x, Yj, Sj) = f(x) - f(Yj) - (Sj, x - Yj) for j = 1, ... , m.

With a = (aJ, ... ,am) in the unit simplex .1 m, set e := LJ!=1 ajej. Then there holds
m
S := L ajsj E ae!(x) .
j=l

PROOF. Immediate: proceed exactly as for the transportation formula XI.4.2.2. 0

This result can be applied to our line-search problem. Suppose that the minima
of qs have been bracketed by tL and tR: in view of Lemma 1.2.2, we have on hand
two subgradients SL := sex + tLd) and SR := sex + tRd) satisfying

tL (SL' d) < e - f(x) + f(x + tLd)


tR(SR, d) > e - f(x) + f(x + tRd).
Then there are two key observations. One is that, since (2.2.1) is assumed not to
hold at t = t R (otherwise we have no problem), the second inequality above implies
0< (sR,d).Second,(1.2.6)impliessL E ae!(x) and we may just assume (SL' d) <
(otherwise we have no problem either). In summary,
°
(2.2.7)

Therefore, consider a convex combination

Sf.L := f.LSL + (1 - f.L)SR, f.L E [0,1].


212 XIII. Methods of c-Descent

According to (2.2.7), (sJ.L, d) ~ - m'IIdll 2 if IL is small enough, namely IL ~ jL with

On the other hand, Lemma 2.2.3 shows that sJ.L E aef(x) if IL is large enough,
namely IL ~ !!:.' with
eR -s
IL := E ]0, I [ . (2.2.8)
- eR - eL
It turns out (and this will follow from Lemma 2.3 .2 below) that, from the property
f(x+td) ~ f(x)-s, wehaveIL ~ jLfortR-tL small enough. When this happens, we
are done becausewecanfindsk~1 = sJ.L satisfying (2.2.4) and (2.2.5). Algorithm 1.3.1
can be stopped, and the current approximation of aef(x P ) can be suitably enlarged.
There remains to treat the case in which no finite t R can ever be found, which
happens if qe behaves as in Fig. 1.2.2. Actually, this case is simpler: tL .... 00 and
Lemma 1.2.4 tells us that s(x + tLd) is eventually convenient.
A simple and compact way to carry out the calculations is to take for IL the
maximal value (2.2.8), being understood that IL = 1 if tR = 00. This amounts to
taking systematically sJ.L in aef(x). Then the stopping criterion is:
(ae) stop the line-search if f(x + td) < f(x) - s (and pass to the next x);
(be) stop the line-search if (sJ.L, d) ~ a IIdll 2 (and pass to the next d).
In all other cases, continue the line-search, i.e. pass to the next trial t.
The consistency of this stopping criterion will be confirmed in the next section.

2.3 The e-Descent Algorithm and Its Convergence

We are now in a position to realize this Section 2 and to write down the complete
organization of the algorithm proposed. The initial iterate x 1 is on hand, together
with the black box (Ul) which, for each x E IR n , computes f(x) and s(x) E af(x).
Furthermore, a number k ~ 2 is given, which is the maximal number of n-vectors

°
that can be stored, in view of the memory allocated in the computer. We choose the
tolerances s > 0, " > and m' E ]0, 1[. Our aim is to furnish a final iterate x P*
satisfying
f(y) ~ f(x P*) - s - "lIy - x P* II for all y E IRn . (2.3.1)
This gives some hints on how to choose the tolerances: s is homogeneous to function-
values; as in Chap. II, " is homogeneous to gradient-norms; and the value m' = 0.1
is reasonable.
The algorithm below is of course a combination of Algorithms 1.3.1,2.1.1, and
of the stopping criterion introduced at the end of §2.2. Notes such as e)
refer to
explanations given afterwards.

Algorithm 2.3.1 (Algorithm of s-Descent) The initial iterate x 1is given. Set p = 1.
Compute f(x P ) ands(x P ) E af(x P ). Set fO = f(x P ), Sl = s(xl).e)
2 A Direct Implementation: Algorithm of c-Descent 213

STEP 1 (starting to find a direction of e-descent). Set k = 1.


STEP 2 (computing the trial direction). Set d =- L~=, ajSj, where a solves

STEP 3 (final stopping test). If lid II ~ 8 stop.e)


STEP 4 (initializing the line-search). Choose an initial t > O.
Set tL = 0, SL = s" eL = 0; tR = 0, SR = O.e)
STEP 5. Compute f = f(x P + td) and S = s(x P + td) E af(x p + td); set

e = fO - f + t (s, d) .
STEP 6 (dispatching). If e < e, set tL =
t, eL = e, SL = s.
= =
If e ~ e, set tR t, eR e, SR s. =
IftR = 0, set JL = 1; otherwise set JL = (eR - e)/(eR - eL).(4)
STEP 7 (stopping criterion of the line-search). If f < fO - e go to Step 11.
Set sIL = JLSL + (1 - JL)SR.CS) If (sIL, d) ~ - m'lIdll 2 go to Step 9.
STEP 8 (iterating the line-search). If tR = 0, extrapolate i.e. compute a new
t > tL.
IftR "# 0, interpolate i.e. compute a new tin ]tL, tR[' Loop to Step 5.
STEP 9 (managing the computer-memory). If k = k, delete at least two (arbi-
trary) elements from the list s" ... , Sk. Insert in the new list the element -d
coming from Step 2 and let k < k be the number of elements thus obtained.(6)
STEP 10 (iterating the direction-finding process). Set sk+' = sIL .C) Replace k by
k + 1 and loop to Step 2.
STEP II (iterating the descent process). Set x P+' = x P + td, s, = S.(8) Replace p
by p + 1 and loop to Step 1. 0

Comments
C') fO is the objective-value at the origin-point x P of each line-search; s' is the
subgradient computed at this origin-point.
e) This stop may not be a real stop but a signal to reduce e and/or 8 (cf. Remark 1.3.5).
It should be clear that, at this stage of the algorithm, the current iterate satisfies
the approximate minimality condition (2.3.1).
e) The initializations of SL and eL are "normal": in view ofC') above, s, E af(x P)
and, as is obvious from (2.1.1), e(x P , x P , s,) = o. On the other hand, the initial-
izations of t R and S R are purely artificial (see Algorithm 11.3 .1.2 and the remark
following it): they simply help define sIL in the forthcoming Step 7. Initializing
e R is not necessary.
(4) In view of the values of eL and eR, JL becomes < 1 as soon as some tR > 0 is
found.
CS) If tR = 0, then JL = 1 and sIL = SL. If eR = e, then JL = 0 and sIL = SR.
Otherwise sIL is a nontrivial convex combination. In all cases sIL E aef(x P).
214 XIII. Methods of g-Descent

(6) We have chosen to give a specific rule for cleaning the bundle, instead of staying
abstract as in Algorithm IX.2.1.5. If k < ;C, then k can be increased by one and
the next subgradient can be appended to the current approximation of aef(x P ).
Otherwise, one must make room to store at least the current projection -d and
the next subgradient sIL, thus making it possible to use Theorem IX.2.1.7. Note,
however, that if the deletion process turns out to keep every Sj corresponding to
(Xj > 0, then the current projection need not be appended: it belongs to the convex
hull of the current sub gradients and will not improve the next polyhedron anyway.
C) Here we arrive from Step 7 via Step 10 and slL is the convenient convex combi-
nation found by the line-search. This slL is certainly in aef(x P ): if /-t = 1, then
eL ~ e and the transportation formula (XI.4.2.2) applies; if /-t E ]0, 1[, the corre-
sponding linearization error /-teL + (1- /-t)eR is equal to e, and it is Lemma 2.2.3
that applies.
(8) At this point, s is the last subgradient computed at Step 5 of the line-search. It is
a subgradient at the next iterate x p + 1 = x P + td. 0

A loop from Step 11 to Step 1 represents an actual e-descent, with x P moved; this
was called a descent-step in Remark 1.3.3. A loop from Step 10 to Step 2 represents
one iteration of the direction-finding procedure, i.e. a null-step: one subgradient is ap-
a
pended to the current approximation of ef (x P). The detailed line-search is expanded
between Steps 5 and 8. At each cycle, Step 7 decides whether
- to iterate the line-search, by a loop from Step 8 to Step 5,
- to iterate the direction-finding procedure,
- or to iterate the descent process, with an actual e-descent obtained.
Now we have to prove convergence of this algorithm. First, we make sure that
each line-search terminates.

Lemma 2.3.2 Let f : lRn -+- lR be convex. Suppose that the extrapolation and
interpolation formulae in Step 8 of Algorithm 2.3.1 satisfy the safeguard-reduction
Property 11.3.1.3. Then,for each iteration (p, k), the number of loops from Step 8 to
Step 5 is finite.

PROOF. Suppose for contradiction that, for some fixed iteration (p, k), the line-search
does not terminate. Suppose first that no t R > 0 is ever generated. By construction,
/-t = 1 and slL = s = s L forever. One has therefore at each cycle

where the first inequality holds because Step 7 never exits to Step 11; the second is
because slL E af(x p + tLd). We deduce

(2.3.2)

Now, the assumptions on the extrapolation formulae imply that tL -+- 00. In view of
the test Step 7 -+- Step 9, (2.3.2) cannot hold infinitely often.
2 A Direct Implementation: Algorithm of c-Descent 215

Thus some t R > 0 must eventually be generated, and Step 6 shows that, from
then on, 1.Le L + (1 - fL)e R = 6 at every subsequent cycle. This can be expressed as
follows:
f(x P ) - fLf(x P + tLd) - (1 - fL)f(x P + tRd)
= 6 - tLfL(SL, d) - tR(1 - fL)(SR, d) .
Furthermore, non-exit from Step 7 to Step 11 implies that the left-hand side above is
smaller than 6, so
(2.3.3)
By assumption on the interpolation formulae, tL and tR have a common limit
t ~ 0, and we claim that t >O. If not, the Lipschitz continuity of f in a neighborhood
of x P (Theorem IY.3.1.2) would imply the contradiction
0<6 < eR = f(x P) - f(x P + tRd) + tR(SR. d) ~ 2LtR ~ 0
(L being a Lipschitz constant around x P ).
Thus, after division by t > 0, (2.3.3) can be written as
fL(SL, d) + (1 - fL)(SR, d) + TJ ~ 0
where the extra term
tL - t tR - t
---fL(SL, d) + ---(1 - fL)(SR, d)
TJ :=
t t
tends to 0 when the number of cycles tends to infinity. This is impossible because of
the test Step 7 ~ Step 9. 0

The rest is now a variant of Theorem 1.3.4.


Theorem 2.3.3 The assumptions are those ofLemma 2.3.2. Then, either f (x P ) ~
-00, or the stop of Step 3 occurs for some finite P*, yielding x P* satisfYing the
approximate minimality condition (2.3.1).
PROOF. In Algorithm 2.3.1, p cannot go to +00 if f is bounded from below. Consider
the last iteration, say the (p*)th. From Lemma 2.3.2, each iteration k exits to Step 9.
By construction, there holds for all k
Sk+l E 8e f(x P*)
hence the sequence {Sk} is bounded (x P* is fixed!), and
(Sk+io dk) ~ - m'lI dkll 2 •
We are in the situation of Lemma IX.2.l.l: k cannot go to +00 in view of Step 3. 0

°
Remark 2.3.4 This proof reveals the necessity of taking the tolerance m' > to stop each
line-search. If m' were set to 0, we would obtain a non-implementable algorithm of the type
IX. 1.6.
Other tolerances can equally be used. In the next section, precisely, we will consider
variations around the stopping criterion of the line-search. Here, we propose the following
exercise: reproduce Lemma 2.3.2 with m' = 0, but with the descent criterion in Step 7
replaced by
If f < fo - s' go to Step 11
where s' is fixed in ]0, s[. o
216 XIII. Methods of c-Descent

3 Putting the Algorithm in Perspective

As already stated in Remark 1.1.6, we are not interested in giving a specific choice
for £ in algorithms such as 2.3.1. It is important for numerical efficiency only; but
precisely, we believe that these algorithms cannot be made numerically efficient un-
der their present form. We nevertheless consider two variants, obtained by giving £
rather special values. They are interesting for the sake of curiosity; furthermore, they
illustrate an aspect of the algorithm not directly related to optimization, but rather to
separation of closed convex sets - see §IX.3.3.

3.1 A Pure Separation Form

Suppose that j := inf {f(y) : y E ]Rn} > -00 is known and set

B:= f(x) - j, (3.1.1)

x = Xl being the starting point of the £-descent Algorithm 2.3.1. If we take this
value B for £ in that algorithm, the exit from Step 7 to Step 11 will never occur. The
line-searches will be made along a sequence of successive directions, all issuing from
the same starting x.
Then, we could let the algorithm run with £ = e, simply as stated in 2.3.1 (the
exit-test from Step 7 to Step 11 - and Step 11 itself-could simply be suppressed with
no harm). However, this does no good in terms of minimizing f: when the algorithm
stops, we will learn that the approximate minimality condition (2.3.1) holds with
x P* = x = Xl and £ = e; but we knew this before, even with 8 = O.
Fortunately, there are better things to do. Observe first that, instead of minimizing
f within e(which has already been done), the idea of the algorithm is now to solve the
following problem: by suitable calls to the black box (VI), compute e-subgradients at
the given x, so as to obtain 0 - or at least a vector of small norm - in their convex hull.
Even though the definition of B implies 0 E aef (x), the only constructive information
concerning this latter set is the black box (VI), and 0 might never be produced by
(VI). In other words, we must try to separate 0 from aef(x). We will fail but the
point is to explain this failure.

Remark 3.1.1 The above problem is not a theoretical pastime but is of great interest in
Lagrangian relaxation. Suppose that (Ul) is a "Lagrange black box", as discussed in §XII.l.l:
each subgradient is of the form s = -c(u), U being a primal variable. When we solve the above
separation problem, we construct primal points U I, ... , Uk and associated convex multipliers
ai, ... , ak such that
k
Lajc(uj) = O.
j=1

In favourable situations - when some convexity is present - the corresponding combination


L~=I ajUj is the primal solution that was sought in the first place; see Theorem XII.2.3.4,
see also Theorem XII.4.2.5. 0
3 Putting the Algorithm in Perspective 217

Thus, we are precisely in the framework of §IX.3.3: rather than a minimization


algorithm, Algorithm 2.3.1 becomes a separation one, namely a particular instance of
Algorithm IX.3.3.1. Using the notation S := oe f (x), what the latter algorithm needs
in its Step 2 is a mechanism for computing the support function os (d) of S at any given
d E IRn , together with a solution s(d) in the exposed face F s(d). The comparison
becomes perfect if we interpret the line-search of Algorithm 2.3.1 precisely as this
mechanism.
So, upon exit from Step 7 to Step 9, we still want slL to be an e-subgradient at x;
but we also want
(SIL, d) ~ (s, d) for all s E S,
in order to have

The next result shows that, once again, our task is to minimize re: this sk+1 must be
some s E of (x + I/ud), with u yielding 0 E ore(u).

Theorem 3.1.2 Let x, d "# 0 and 8 ~ 0 be given. Suppose that s E IRn satisfies one
of the following two properties:
(i) either, for some t > 0, s E of (x + td) and e(x, x + td, s) = 8;
(ii) or s is a cluster point ofa sequence (s(t) E of (x + td)lt with t ~ +00 and
e(x, x +td,s(t» ~8 forallt ~ O.

Then
s E oef(x) and (s, d) = f;(x, d) .
PROOF. The transportation formula (Proposition XI.4.2.2) directly implies that the s
in (i) and the s(t) in (ii) are in oef(x). Invoking the closedness of oef(x) for case
(ii), we see that s E oef(x) in both cases.
Now take an arbitrary s' E Os f (x), satisfying in particular

f(x + td) ~ f(x) + t(s', d) - 8 for all t > O. (3.1.2)

[Case (iJ] Replacing 8 in (3.1.2) by e(x, x + td, s), we obtain


o ~ t(s', d) - t(s, d) .

Divide by t > 0: (s, d) = (Taef(x)(d) since s' was arbitrary.


[Case (ii)] Add the subgradient inequality f(x) ~ f(x + td) - t(s(t), d) to (3.1.2):

o ~ t(s', d) - t(s(t), d) - 8.

Divide by t > 0 and let t ~ +00 to see that (s, d) ~ (s', d). o
In terms of Algorithm 2.3.1, we realize that the slL computed in Step 7 precisely
aims at satisfying the conditions in Theorem 3.1.2:
218 XIII. Methods of c-Descent

(i) Suppose first that some tR > 0 is generated (which implies Te =f:. 0). When
sufficiently many interpolations are made, tL and tR become close to their com-
mon limit, say teo Because a/O is outer semi-continuous, SL and SR are both
close to al(x + ted). Their convex combination slI- is also close to the convex
set al(x + ted). Finally, e(x, x + ted, sll-) is close to I.LeL + (l - IL)eR = e,
by continuity of I. In other words, sll- almost satisfies the assumptions of Theo-
rem 3.1.2(i). This is illustrated in Fig. 3.1.1.

f(x+td)
~--------~------~---------

-E

Fig. 3.1.1. Identifying the approximate directional derivative

(ii) Ifno tR > 0 is generated and tL --+ 00 (which happens when Te = 0), it is case
(ii) that is satisfied by the corresponding SL = sll-.
In a word, Algorithm 2.3.1 becomes a realization of Algorithm IX.3.3.1 if we
let the interpolations [resp. the extrapolations] drive tR - tL small enough [resp. tL
large enough] before exiting from Step 7 to Step 9. It is then not difficult to adapt the
various tolerances so that the approximation is good enough, namely (sll-, dk) is close
enough to O'S(dk).
These ideas have been used for the numerical experiments of Figs. IX.3.3.1 and
IX. 3 .3 .2. It is the above variant that was the first form of Algorithm IX.3.3.1, alluded to
at the end of §IX.3.3. In the case OfTR48 (Example IX.2.2.6), ael(x) was identified at
x = 0, a significantly non-optimal point: 1(0) = -464816, while j = -638565. All
generated Sk had norms roughly comparable. In the example MAXQUAD ofVIII.3.3.3,
this was far from being the case: the norms of the subgradients were fairly dispersed,
remember Table VIII.3.3 .1.

Remark 3.1.3 An interesting question is whether this variant produces a minimizing se-
quence. The answer is no: take the example of (IX. 1. 1), where the objective function was

e
Start from the initial x = (2, 1) and take e = = 4. We leave it as an exercise to see
how Algorithm 1.3.1 proceeds: the first iteration, along the direction (-1, -2), produces Y2
anywhere on the half-line {(2 - t, 1 - 2t) : t ~ 1/2} ands2 is unambiguously (1, -2). Then,
d 2 is collinear to (-1,0), Y3 = (-1, 1) and S3 = (-1/3,2/3) (for this last calculation, take
S3 = a( -1,0) + (1 - a)(l, 2), a convex combination of subgradients at n, and adjust a
so that the corresponding linearization error e(x, Y3, S3) is equal to e = 4). Finally d3 = 0,
3 Putting the Mgorithm in Perspective 219

although neither Y2 nor Y3 is even close to the minimum (0, 0). These operations are illustrated
by Fig. 3.1.2.
Note in this example that d3 = 0 is a convex combination of S2 and S3: SI plays no role.
Thus 0 is on the boundary of CO{SI, S2, S3} = 04/(X). If e is decreased, S2 and S3 are slightly
perturbed and 0 ¢ CO{SI, S2, S3}: d3 is no longer zero (although it stays small). The property
o E bdo4/(x) does not hold by chance but was proved in Proposition XI. 1.3.5: e = 4 is the
critical value of e and the normal cone to 04/(X) at 0 is not reduced to {O}. 0

Fig. 3.1.2. Three iterations of the separating algorithm

3.2 A TotaDy Static Minimization Algorithm

Our second variant is also a sort of separation Algorithm IX.3.3 .1. It also performs one
single descent iteration, the starting point of the line-searches never being updated. It
differs from the previous example, though: it is really aimed at minimizing I, and e
is no longer fixed but depends on the direction-index k. Specifically, for fixed x = Xl
and d =F 0 in Algorithm 2.3.1, we set

e = e(x, d) := I(x) - inf I(x


1>0
+ td) . (3.2.1)

A second peculiarity of our variant is therefore that e is not known; and also, it
may be 0 (for simplicity, we assume e(x, d) < +00). Nevertheless, the method can
still be made implementable in this case. In fact, a careful study of Algorithm 2.3.1
shows that the actual value of e is needed in three places:
- for the exit-test from Step 7 to Step 11; with the e of (3 .2.1), this test is never passed
and can be suppressed, just as in §3.1;
- in Step 6, to dispatch t as a t L or t R; Proposition 3.2.1 below shows that the precise
value (3.2.1) is useless for this particular operation;
- in Step 6 again, to compute /L; it is here that lack of knowledge of e has the main
consequences.

Proposition 3.2.1 For given x, d =F 0 in lR.n and e of (3.2.1), the minimizers of


t 1-+ I(x + td) over lR.+ are also minimizers ofqe - knowing that qo is the ordinary
difforence quotient, with qo(O) = If (x, d). The converse is true if e > O.
220 XIII. Methods of c-Descent

PROOF. Letiminimize f(x+, d) overjR+. 1ft = 0, thene(x, d) = oand qe(x ,d) (0) =
f'(x, d) ~ qe(x,d)(t) for all t ~ 0 (monotonicityofthe difference quotient); the proof
is finished.
If i > 0, the definition
f(x + id) = f(x) - e(x, d)
combined with the minimality condition
o E (af(x + id), d}
shows that i satisfies the minimality condition of Lemma 1.2.2(i) for qe(x,d)'
Now let {'fk} C jR+ be a minimizing sequence for f(x + ·d):
f(x + 'fkd ) ~ f(x) - e,
where e is the e(x, d) of(3.2.1). If e > 0, then 0 cannot be a cluster point of {'fd: we
can write
f(x + 'fkd) - f(x) + e
-=--.:.----.:.:~-:......:......:......-~O,
'fk
and this implies f;(x, d) ~ O. Ifte minimizes qe, we have

f(x + ted) - f(x) + e = f;(x, d) ~ O.


te
Thus
f(x + ted) ~ f(x) - e,
te does minimize f(x + .d) (and f;(x, d) is actually zero). o
It is therefore a good idea to minimize f(x + td) with respect to t ~ 0: the
implicitly given function qe(x,d) will by necessity be minimized - and it is a much
more natural task. A first consequence is that the decision to dispatch t is even simpler
than before, namely:
if {s, d} < 0, t must become a tL (f seems to be decreasing), (3.2.2)
if {s, d} > 0, t must become a tR (f seems to be increasing), (3.2.3)
knowing that {s, d} = 0 terminates the line-search, of course.
However, the object of the line-search is now to obtain t and s+ E af(x + td)
satisfying {s+, d} = 0; the property s+ E ae(x,d)f(x) will follow automatically.
The choice of J.L in Step 6 can then be adapted accordingly. The rationale for J.L in
Algorithm 2.3.1 was based on forcing slL to be a priori in aef(x), and it was only
eventually that {slL, d} became close to zero, i.e. large enough in that context. Here,
it is rather the other way round: we do not know in what e(x, d)-subdifferential the
subgradient slL should be, but we do know that we want to obtain {slL, d} = O.
Accordingly, it becomes suitable to compute J.L in Step 6 by
(3.2.4)
when tR > 0 (otherwise, take J.L = 1 as before). Observe that, fortunately, this is in
harmony with the dispatching (3.2.2), (3.2.3): provided that SR exists, J.L E [0, 1].
It remains to find a suitable stopping criterion for the line-search.
3 Putting the Algorithm in Perspective 221

- First, ifno tR > 0 is ever found (f decreasing along d), there is nothing to change:
in view of Lemma 1.2.2(ii), stt = s(x +tLd) E ae(x,d)f(x) for all tL; furthermore,
Lemma 1.2.4Gj) still applies and eventually, (stt, d)?: - m'lIdIl 2 •
- The other case is when a true bracket [t L, t R] is eventually found; then Lemma 2.2.3
implies stt E aef (x) where

(3.2.5)

and J.L is defined by (3.2.4). As for e(x, d), we obviously have

e(x, d) ?: ~:= max {f(x) - f(x + tL), f(x) - f(x + tRd)}. (3.2.6)

e
Continuity of f ensures that and~have the common limit e(x, d) when tL andtR
both tend to the optimal (and finite) t ?: O. Then, because of outer semi-continuity
of e ~ aef(x), stt is close to ae(x,d)f(x) when tR and tL are close together.
In a word, without entering into the hair-splitting details of the implementation,
the following variant of Algorithm 2.3.1 is reasonable.

Algorithm 3.2.2 (Totally Static Algorithm - Sketch) The initial point x is given,
together with the tolerances TJ > 0,0 > 0, m' E ]0, 1[. Set k = 1, Sl = s(x).
STEP 1. Compute dk := - Proj O/{Sl, ... , Sk}.
STEP 2. If IIdk II ~ 0 stop.
STEP 3. By an approximate minimization oft 1-+ f(x + tdk) over t ?: 0, find tk > 0
andsk+l such that
(3.2.7)

and

where
ek := f(x) - f(x + tkdk) + TJ·
STEP 4. Replace k by k + I and loop to Step 1. o

Here, there is no longer any descent test, but a new tolerance TJ is introduced, to
make sure that e - ~ of (3.2.5), (3.2.6) is small. Admitting that Step 3 terminates for
each k, this algorithm is convergent:

Theorem 3.2.3 In Algorithm 3.2.2, let

xk = x + tkdk E Argmin{f(x + tidi) : i = 1, ... , k}


be one of the best points found during the successive iterations. Iff is bounded from
below, the stop of Step 2 occurs for some finite k, and then

f(y) ?: f(Xk) - TJ - oily - xkll for all y E IRn . (3.2.8)


222 XIII. Methods of s-Descent

PROOF. Corresponding to the best point xb set

(3.2.9)

The sequence {Sk} is increasing. By construction, ek ~ Sk and it follows that Sk+l E


aekf(x) for each k.
If Sk ~ +00 when k ~ +00, then f(Xk) ~ -00. Otherwise, {sd is bounded,
{sHd is therefore bounded as well. In view of (3.2.7), Lemma IX.2.l.1 applies: the
algorithm must stop for some iteration k. At this k,

Si E aei f(x) c aekf(x) for i = 1, ... , k


hence dk E aekf(x). Using (3.2.9), we have for all y E IRn :

fey) ~ f(x) + (dk. y - x) - Sk = f(Xk) + (dk. Y - x) - T} • 0

From a practical point of view, this algorithm is of course not appropriate: it is


not a sound idea to start each line-search from the same initial x. Think of a ballistic
comparison, where each direction is a gun: to shoot towards a target i, it should make
sense to place the gun as close as possible to i-and not at the very first iterate,
probably the worst of all! This is reflected in the minimality condition: write (3.2.8)
with an optimal y = i (the only interesting y, actually); admitting that IIi - xII is
unduly large, an unduly small value of 8 - i.e. of IIdk II - is required to obtain any
useful information about optimality of Xk. Remembering our analysis in §IX.2.2, this
means an unduly large number of iterations.
The first inequality in (3.2.7) would suffice to let the algorithm run. The reason for
the second inequality is to illustrate another aspect of the approach. Consider again
the value 8 = f (x) - j of (3 .1.1). Obviously in Algorithm 3.2.2,

for each k. Coming back to the problem of separating asf(x) =: S and {OJ, we
see that Algorithm 3.2.2 is the second form of Algorithm IX.3.3.l that was used for
Figs. IX.3.3.1 and IX.3.3.2: for each direction d, its black box generates S+ E S with
(s+. d) ~ 0 (instead of computing the support function). It is therefore a convenient
instrument for illustrating the point made at the end of §IX.3.3.
XlV. Dynamic Construction of Approximate
Subdifferentials: Dual Form of Bundle Methods

Prerequisites. Basic concepts of numerical optimization (Chap. II); descent principles for
nonsmooth minimization (Chaps. IX and XIII); definition and elementary properties of ap-
proximate subdifferentials (Chap. XI); and to a lesser extent: minirnality conditions for simple
minimization problems, and elements of duality theory (Chaps. VII and XII).

1 Introduction: The Bundle of Information

1.1 Motivation

The methods of s-descent, which were the subject of Chap. XIII, present a number of
deficiencies from the numerical point of view. Their rationale itself - to decrease by s
the objective function at each iteration - is suspect: the fact that s can (theoretically)
be chosen in advance, without any regard for the real behaviour of f, must have its
price, one way or another.

<a> The Choice of e. At first glance, an attractive idea in the algorithms of Chap. XIII
is to choose s fairly large, so as to reduce the number of outer iterations, indexed by
p. Then, of course, a counterpart is that the complexity of each such iteration must
be expected to increase: k will grow larger for fixed p. Let us examine this point in
more detail.
Consider one single iteration, say p = 1, in the schematic algorithm of s-descent
XIII.1.3.1. Set the stopping tolerance £, = 0 and suppose for simplicity that qe (or
re) can be minimized exactly at each subiteration k. In a word, consider the following
algorithm:

Algorithm 1.1.1 Start from x = Xl E lR,n. Choose s > O.


STEP O. Compute SI E af(x), set k = 1.
STEP 1. Compute dk = - Proj O/{s), ... , skl. If dk = 0 stop.
STEP 2. By a search along dk issuing from x, minimize qe of (XIII. 1.2.2) to obtain
tk > 0, YHI = X + tkdk and Sk+1 E af(Yk+d such that

(1.1.1)
224 XlV. Dual Form of Bundle Methods

STEP 3. If I(Yk+l) ~ I(x) - 8, replace k by k + I and loop to Step 1. Otherwise


stop. 0

This algorithm has essentially two possible outputs:


- Either a stop occurs in Step 3 (which plays the role of updating p in Step 2 of
Algorithm XIII.1.3.1). Then one has on hand y (= Yk+ 1) with I (y) < I (x) - 8.
- Or Step 3 loops to Step 1 forever. It is a result of the theory that {dk} then tends to
0, the key properties being that (Sk+l' dk) ~ 0 for all k, and that {Sk} is bounded.
This case therefore means that the starting x minimizes I within 8. Same comment
for the extraordinary case of a stop via dk = 0 in Step 1.
Consider Algorithm 1.1.1 as a mapping 8 f-+ ](8), where ](8) is the best I-value
obtained by the algorithm; in other words, in case ofa stop in Step 3, ](8) is simply
the last computed value I(Yk+l) (then, ](8) < I(x) - 8); otherwise,

](8) = inf {f(Yk) : k = 1,2, ... }.


With] := infJRn I, consider the "maximal useful" value for 8
e:= I(x) - ] .
- For 8 < e, 0 ¢ aeI (x) and {dk} cannot tend to 0; the iterations must terminate with
a success: there are finitely many loops from Step 3 to Step 1. In other words

8< e ==> [] = I(x) - e ~] ](8) < I(x) - 8 , (1.1.2)

which implies in particular

1(8)-+] if8te.

- For 8 ~ e, 0 E aeI (x) and Algorithm 1.1.1 cannot stop in Step 3: it has to loop
between Step 1 and Step 3 - or stop in Step 1. In fact, we obtain for 8 = e nothing
but (a simplified form of) the algorithm of §XIII.3.1. As mentioned there, {ykl need
not be a minimizing sequence; see more particularly Remark XIII.3.1.3. In other
words, it may well be that](e) > j. Altogether, we see that the function] has no
reason to be (left-)continuous at e.
Conclusion: it is hard to believe that (1.1.2) holds in practice, because when
numerical calculations are involved, the discontinuity of ] is likely not to be exactly
at e. When 8 is close to e, ](8) is likely to be far from ],just as it is when 8 = e.

Remark 1.1.2 It is interesting to see what really happens numerically. Toward this end,
take again the examples MAXQUAD ofVIII.3.3.3 and TR48 ofIX.2.2.6. To each of them, apply
Algorithm 1.1.1 with e = S, and Algorithm XIII.3.2.2. With the stopping tolerance;; set to 0,
the algorithms should run forever. Actually, a stop occurs in each case, due to some failure:
in the quadratic program computing the direction, or in the line-search. We call again j the
best I -value obtained when this stop occurs, with either of the two algorithms.
Table 1.1.1 summarizes the results: for each of the four experiments, it gives the number of
iterations, the total number of calls to the black box (UI) of Fig. 11.1.2.1, and the improvement
of the final I-value obtained, relative to the initial value, i.e. the ratio
1 Introduction: The Bundle of Information 225

1(8) - 1
f(x) - j"
Observe that the theory is somewhat contradicted: if one wanted to compare Algo-
rithm 1.1.1 ( supposedlynon-convergent) and Algorithm XIII.3 .2.2 (proved to converge), one
should say that the former is better. We add that each test in Table 1.1.1 was run several times,
with various initial x; all the results were qualitatively the same, as stated in the table. The
lesson is that one must be modest when applying mathematics to computers: proper .assess-
ment of an algorithm implies an experimental study, in addition to establishing its theoretical
properties; see also Remark 11.2.4.4.

Table 1.1.1. What is a convergent algorithm?

MAXQUAD TR48

Algorithm 1.1.1 87 718 4.10- 4 141 892 3.10-2


Algorithm XIII.3.2.2 182 1000 2.10- 3 684 1897 4.10- 2

The apparently good behaviour of Algorithm 1.1.1 can be explained by the rather slow
convergence of {dd to o. The algorithm takes advantage of its large number of iterations to
explore the space around the starting x, and thus locates the optimum more or less accurately.
By contrast, the early stop occurring in the 2-dimensional example of Remark XIII.3.1.3 gives
the algorithm no chance. 0

Knowing that a large e is not necessarily safe, the next temptation is to take e
small. Then, it is hardly necessary to mention the associated danger: the direction may
become that of steepest-descent, which has such a bad reputation, even for smooth
functions (remember the end of §II.2.2). In fact, if f is smooth (or mildly kinky),
every anti-subgradient is a descent direction. Thus, if e is really small, the e-steepest
descent Algorithm XIII. 1.3 .1 will simply reduce to

(1.1.3)

at least until x P becomes nearly optimal. Only then, will the bundling mechanism
enter into play.
The delicate choice of e (not too large, not too small ... ) is a major drawback of
methods of e-descent.

(b) The Role of the Line-Search. In terms of diminishing f, the weakness of the
direction is even aggravated by the choice of the stepsize, supposed to minimize over
jR+ not the function t H- f(x P + tdk) but the perturbed difference quotient qs.
Firstly, trying to minimize qs along a given direction is not a sound idea: after all,
it is f that we would like to reduce. Yet the two functions qs and t H- f(x + td) may
have little to do with each other. Figure 1.1.1 shows Ts of Lemma XIIl.l.2.2 for two
extreme values of e: on the left [right] part of the picture, e is too small [too large]:
the stepsize sought by the e-descentAlgorithm XIII.l.3.1 is too small [too large], and
in both cases, f cannot be decreased efficiently.
226 XlV. Dual Form of Bundle Methods

Fig.l.l.l. Inefficient stepsizes in e-descent methods

The second difficulty with the line-search is that, when dk is not an e-descent
direction, the minimization of qe must be carried out rather accurately. In Table 1.1.1,
the number of calls to the black box (U 1) per iteration is in the range 5-10; recall that,
in the smooth case, a line-search typically takes less than two calls per iteration. Even
taking into account the increased difficulty when passing to a nonsmooth function,
the present ratio seems hardly acceptable.
Still concerning the line-search, we mention one more problem: an initial stepsize
is hard to guess. With methods of Chap. II, it was possible to initialize t by estimating
the decrease of the line-search function t f-+ f(x + td) (Remark 11.3.4.2). The
behaviour of this function after changing x and d, i.e. after completing an iteration,
could be reasonably predicted. Here the function of interest is qe of (XIIl.l.2.2), which
behaves wildly when x and d vary; among other things, it is infinite for t = 0, i.e. at
x.
Remark 1.1.3 The difficulties illustrated by Fig. 1.1.1 are avoided by a further variant of
§XIII.3.2: a descent test can be inserted in Step 4 of Algorithm XIII.3.2.2, updating x when
f(x + tkdk) < f(x) - E. Here E > 0 is again chosen in advance, so this variant does
not facilitate the choice of a suitable e, nor of a suitable initial stepsize. Besides, the one-
dimensional minimization of f must again be performed rather accurately. 0

(c) Clumsy Use of the Information. Our last item in the list of deficiencies of e-
descent methods lies in their structure itself. Consider the two possible actions after
terminating an iteration (p, k).
(ae) In one case (e-descent obtained), x P is moved and all the past is forgotten.
The polyhedron Sk is reinitialized to a singleton, which is the poorest possible
approximation of aef(x p + I ). This "Markovian" character is unfortunate: clas-
sically, all efficient optimization algorithms accumulate and exploit information
about f from previous iterations. This is done for example by conjugate gradient
and quasi-Newton methods of §1I.2.
(be) In the second case (Sk enriched), x P is left as it is, the information is now
correctly accumulated. Another argument appears, however: the real problem of
interest, namely to decrease f, is somewhat overlooked. Observe for example
that no f -value is used to compute the directions dk; more importantly, the
iterate x P + tkdk produced by the line-search is discarded, even if it happens to
be (possibly much) better than the current x p.

In fact, the above argument is rather serious, although subtle, and deserves some explana-
tion. The bundling idea is based on a certain separation process, whose speed of convergence
1 Introduction: The Bundle of Information 227

is questionable; see §IX.2.2. The (be)-phase, which relies entirely and exclusively upon this
process, has therefore a questionable speed of convergence as well. Yet the real problem, of
minimizing a convex function I, is addressed by the (ae)-phase only. This sort of dissociation
seems undesimble for performance.
In other words, there are two processes in the game: one is the convergence to 0 of some
sequence of (approximate) subgradients Sk = -dk; the otheris the convergence of {f (x P)} to
f. The first is theoretically essential- after all, it is the only way to get a minimality condition
- but has fragile convergence qualities. One should therefore strive to facilitate its task with
the help of the second process, by diminishing IIdll- and I-values in an harmonious way. The
ovemll convergence will probably not be improved in theory; but the algorithm must be given
a chance to have a better behaviour heuristically.

(d) Conclusion. Let us sum up this Section 1.1: e-descent methods have a number
of deficiencies, and the following points should be addressed when designing new
minimization methods.
(i) They should rely less on e. In particular, the test for moving the current x P should
be less crude than simply decreasing f by an a priori value e.
(ii) Their line-search should be based on reducing the natural objective function f.
In particular, the general and sound principles of §II.3 should be applied.
(iii) They should take advantage as much as possible of all the information accumu-
lated during the successive calls to (U I ), perhaps from the very beginning of the
iterations.
(iv) Their convergence should not be based entirely on that of {dkl to 0: they should
take care as much as possible (on a heuristic basis) of the decrease of f to its
minimal value.
In this chapter, we will introduce a method coping with (i) - (iii), and somehow
with (iv) (to the extent that this last point can be quantified, i.e. very little). This
method will retain the concept of keeping x P fixed in some cases, in order to improve
the approximation of aef(x P ) for some e. In fact, this technique is mandatory when
designing methods for nonsmooth optimization based on a descent principle. The
direction will again be computed by projecting the origin onto the current approxi-
mation of ad(x P ), which will be substantially richer than in Chap. XIII. Also, the
test for descent will be quite different: as in §II.3.2, it will require from f a definite
relative decrease, rather than absolute. Finally, the information will be accumulated
all along the iterations, and it is only for practical reasons, such as lack of memory,
that this information will be discarded.

1.2 Constructing the Bundle of Information

Generally speaking, in most efficient minimization methods, the direction is computed


at the current iterate in view of the information on hand concerning f. The only source
for such information is the black box (UI) characterizing the problem to be solved;
the following definition is therefore relevant.
228 XlV. Dual Form of Bundle Methods

Definition 1.2.1 At the current iteration of a minimization algorithm, call

{Yi E]Rn : i = 1, ... , L}


the set of arguments at which (Ul) has been called from the beginning. Then the raw
bundle is the set of triples

(Yi E ]Rn, Ii := I(Yi), si := S(Yi) E a/(Yi) i = 1, ... , L} (1.2.1)

collecting all the information available. o

In (1.2.1), L increases at each iteration by the number of cycles needed by the


line-search. The raw bundle is therefore an abstract object (unless L can be bounded a
priori): no computer is able to store this potentially infinite amount of information. In
order to make it tractable one must do selection and/or compression in the raw bundle.
- Selection means that only some designated triples (y, I, S)i are kept, L being re-
placed by a smaller value, say l, which is kept under control, so as to respect the
memory available in the computer.
- Compression means a transformation of the triples, so as to extract their most rele-
vant characteristics.

Definition 1.2.2 The actual bundle (or simply "bundle", if no confusion is possible)
is the bank of information that is
- explicitly obtained from the raw bundle,
- actually stored in the computer,
- and used to compute the current direction. o

There are several possibilities for selecting-compressing the raw bundle, depend-
ing on the role of the actual bundle in terms of the resulting minimization algorithm.
Let us illustrate this point.

Example 1.2.3 (Quasi-Newton Methods) If one wanted to relate the present bundling con-
cept to classical optimization methods, one could first consider the quasi-Newton methods
of §1l.2.3. There the aim of the bundle was to identify the second-order behaviour of f. The
selection consisted of two steps:
(i) Only one triple (y, !, s) per iteration was "selected", namely (Xl. !(Xk), Sk); any inter-
mediate information obtained during each line-search was discarded: the actual number
eof elements in the bundle was the number k of line-searches.
(ii) Only the couple (y, s) was kept from each triple (y, !, s): ! was never used.
Thus, the actual bundle was "compressed" to

{(Xj,Sj) : i = l, ... ,k} (1.2.2)

with i indexing iterations, rather than calls to (VI). This bundle (1.2.2), made of 2nk real
numbers, was not explicitly stored Rather, it was further "compressed" into a symmetric
1 Introduction: The Bundle ofInfonnation 229

n x n matrix - although, for 2nk :s;; 1/2 n(n + 1), this could hardly be called a compression!
- whose aim was to approximate (V2 f) -I.
In summary, the actual bundle was a set of I /2 n (n + 1) numbers, gathering the infonnation
contained in (1.2.2), and computed by recurrence formulae such as that ofBroyden-Fletcher-
Goldfarb-Shanno (11.2.3.9). 0

The (i)-part of the selection in Example 1.2.3 is rather general in optimization methods.
In practice, L in (1.2.1) is just the current number k of iterations, and i indexes line-searches
rather than calls to (Ul). In most optimization algorithms, the intermediate (j, s) computed
during the line-searches are used only to update the stepsize, but not to compute subsequent
directions.

Example 1.2.4 (Conjugate-Gradient Methods) Still along the same idea, consider now
the conjugate-gradient method of §1I.2.4. We have seen that its original motivation, based on
the algebra of quadratic functions, was not completely clear. Rather, it could be interpreted as
a way of smoothing out the sequence {dk} of successive directions (remember Fig. 11.2.4.1).
There, we had a selection similar to (i) of Example 1.2.3; indeed the direction dk was
defined in terms of
{sj:i=l, ...• k}. (1.2.3)

As for the compression, it was based on formulae - say of Polak-Ribiere (11.2.4.8) -


enabling the computation of dk using only Sk and dk-" instead of the full set (1.2.3). In other
words, the actual bundle was {dk-lo skI, made up of two vectors. 0

The above two examples are slightly artificial, in that the bundling concept is a fairly
diverted way of introducing the straightforward calculations for quasi-Newton and conjugate-
gradient methods. However, they have the merit of stressing the importance of the bundling
concept: virtually all efficient minimization methods use it. one way or another. On the other
hand, the cutting-plane algorithm of §XII.4.2 is an instance of an approach in which the
bundling idea is blatant. Actually. that example is even too simple: no compression and no
selection can be made, the raw bundle and actual bundle are there identical.

Example 1.2.5 (e-Descent Methods) In Chap. XIII, the bundle was used to identify
oe f at the current iterate. To disclose the selection-compression mechanism, let us
use the notation (P. k) of that chapter.
First of all, the directions depended only on s-values. When computing the current
direction dk issued from the current iterate x P , one had on hand the set of subgradients
(the raw bundle, in a way)

l Sl
{S I' Sl • S2 S2 s2 . . sP sp}
2.···. k!' I' 2'···· k 2" ' " I'···· k •

where kq, for q = I, ... , p - I, was the total number of line-searches needed by the
qth outer iteration (knowing that the pth is not complete yet).
In the selection, all the information collected from the previous descent-iterations
(those having q < p) was discarded, as well as all f- and y-values. Thus, the actual
bundle was essentially
{sf,···. sf} . (1.2.4)
230 XlV. Dual Form of Bundle Methods

Furthermore, there was a compression mechanism, to cope with possibly large values
ofk. If necessary, an arbitrary subset in (1.2.4) was replaced by the convex combination

(1.2.5)

just computed in the previous iteration (remember Algorithm XII!.2.3.1). In a way, the
aggregate subgradient (1.2.5) could be considered as synthesizing the most essential
information from (1.2.4). 0

The actual bundle in Example 1.2.5, with its Sk'S and Sk'S, is not so easy to
describe with explicit formulae; a clearer view is obtained from inspecting step by
step its evolution along the iterations. For this, we proceed to rewrite the schematic
e-descent Algorithm XII!'I.3.1 in a condensed form, with emphasis on this evolution.
Furthermore, we give up the (p, k)-notation of Chap. XIII, to the advantage of the
"standard" notation, used in Chap. II. Thus, the single index k will denote the number
of iterations done from the very beginning of the algorithm. Accordingly, Xk will
be the current iterate, at which the stopping criterion is checked, the direction dk is
computed, and the line-search along dk is performed.
In terms of the methods of Chap. XIII, this change of notation implies a substantial
change of philosophy: there is only one line-search per iteration (whereas a would-be
outer iteration was a sequence ofline-searches); and this line-search has two possible
exits, corresponding to the two cases (ae) and (be).
Call y+ the final point tried by the current line-search: it is of the form y+ =
Xk + tdk' for some step size t > 0. The two possible exits are:
(ae) When I has decreased bye, the current iterate Xk is updated to Xk+l = y+ =
Xk + tdk. The situation is rather similar to the general scheme of Chap. II and
this is called a descent-iteration, or a descent-step. As for the bundle, it is totally
reset to {sk+d = {S(Xk+l)} c ael(Xk+I). There is no need to use an additional
index p.
(be) When a new line-search must be performed from the same x, the trick is to set
Xk+l = Xb disregarding y+. Here, we have a null- iteration, or a null-step. This
is the new feature with respect to the algorithms of Chap. II: the current iterate
is not changed, only the next direction is going to be changed.
With this notation borrowed from classical optimization (i.e. of smooth functions),
Algorithm XII!'I.3.1 can now be described as follows. We neglect the irrelevant details
of the line-search, but we keep the description of a possible deletion-compression
mechanism.

Algorithm 1.2.6 (Algorithm of e-Descent, Standard Notation) The starting point


Xl E IR n is given; compute Sl E al(Xl). Choose the descent criterion e > 0, the
convergence parameter 8 > 0, the tolerance for the line-search m' E ]0, 1[ and the
maximal bundle size i;;:: 2. Initialize the iteration index k = 1, the descent index
ko = 0, and the bundle size.e = 1.
1 Introduction: The Bundle of Information 231

STEP 1 (direction-finding and stopping criterion). Solve the minimization problem in


the variable (X E lR,l

mm II"ko+l
. {I2 L....i=ko+1 (Xisi 11 2 .. (Xi?'~ 0, "ko+l
L....i=ko+1 (Xi = 1}
and set dk := - L~~t!+1 (XiSi. If IIdkll ~ 8 stop.
STEP 2 (line-search). Minimize re along dk accurately enough to obtain a stepsize
t > 0 and a new iterate y+ = Xk + tdk with S+ E o!(y+) such that
- either! (y+) < ! (Xk) - e; then go to Step 3;
- or (s+, dk) ~ - m'lIdk 112 and S+ E Oe!(Xk); then go to Step 4.
STEP 3 (descent-step). Set Xk+ I = y+, ko = k, .e = 1; replace k by k + 1 and loop to
Step 1.
STEP 4 (null-step). Set Xk+1 = Xk, replace k by k + 1. If.e < £, set sko+l+1 = S+
from Step 2, replace .e by .e + 1 and loop to Step 1. Else go to Step 5.
STEP 5 (compression). Delete at least two vectors from the list {Sko+J, ... , SkoH}.
With .e denoting the number of elements in the new list, set Sko+l+ I = -dk from
Step 1, set Sko+l+2 = s+ from Step 2; replace .e by.e + 2 and loop to Step 1. 0

In a way, we obtain something simpler than Algorithm XIII. 1.3. 1. A reader fa-
miliar with iterative algorithms and computer programming may even observe that
the index k can be dropped: at each iteration, the old (x, d) can be overwritten; the
only important indices are: the current bundle size .e, and the index ko of the last
descent-iteration.
Observe the evolution of the bundle: it is totally refreshed after a descent-iteration
(Step 3); it grows by one element at each null-iteration (Step 4), and it is compressed
when necessary (Step 5). Indeed, the relative complexity of this flow-chart is mainly
due to this compression mechanism. If it were neglected, the algorithm would become
fairly simple: Step 5 would disappear and the number.e ofelements in the bundle would
just become k - ko.
Recall that the compression in Step 5 is not fundamental, but is inserted for
the sake of implementability only. By contrast, Definition 1.2.7 below will give a
different type of compression, which is attached to the conception of the method
itself. The mechanism of Step 5 will therefore be more suitably called aggregation,
while "compression" has a more fundamental meaning, to be seen in (1.2.9).
Examining the e-descent method in its form 1.2.6 suggests a weak point, not so
visible in Chap. XlII: why, after all, should the actual bundle be reset entirely at k o?
Some of the elements {sJ, ... ,Sko} might well be in Oe!(Xk). Then why not use them
to compute dk? The answer, quite simple, is contained in the transportation formula
of Proposition Xl.4.2.2, extensively used in Chap. XIII.
Consider the number

e(x, y, s) := !(x) - [f(y) + (s, x - y)] for all x, y, sin lR,n; (1.2.6)

we recall that, if S E o!(y), then e(x, y, s) ~ 0 and S is an e(x, y, s)-subgradient at


x (this is just what the transportation formula says). The converse is also true:
232 XlV. Dual Form of Bundle Methods

for S E af(y), S E ad(x) {:::::::} e(x, y, s) :::;; s. (1.2.7)

Note also that s returned from (Ul) at y can be considered as a function of y alone
(remember Remark VIII.3.5.l); so the notation e(x, y) could be preferred to (1.2.6).
An important point here is that, for given x and y, this e(x, y) can be explicitly
computed and stored in the computer whenever f (x), f (y) and S (y) are known. Thus,
we have for example a straightforward test to detect whether a given S in the bundle
can be used to compute the direction. Accordingly, we will use the following notation.

Definition 1.2.7 For each i in the raw bundle (1.2.1), the linearization error at the
iterate Xk is the nonnegative number

(1.2.8)

The abbreviated notation ej will be used for ef whenever convenient. o

Only the vectors Sj and the scalars ej are needed by the minimization method of
the present chapter. In other words, we are interested in the set

(1.2.9)

and this is the actual bundle. Note: to be really accurate, the term "actual" should be
reserved for the resulting bundle after aggregation of Step 5 in Algorithm 1.2.6; but
we neglect this question of aggregation for the moment.
It is important to note here that the vectors Yj have been eliminated in (1.2.9);
the memory needed to store (1.2.9) is ken + 1), instead of k(2n + 1) for (1.2.1) (with
L = k): we really have a compression here, and the information contained in (1.2.9)
is definitely poorer than that in (1.2.1). However, (1.2.8) suggests that each ej must
be recomputed at each descent iteration; so, can we really maintain the bundle (1.2.9)
without storing also the y's? The answer is yes, thanks to the following simple result.

Proposition 1.2.8 For any two iterates k and k' and sampling point yj, there holds

(1.2.10)

PROOF. Just apply the definitions:

and obtain the result by mere subtraction. o


This result relies upon additivity of the linearization errors: for two different points
x and x', the linearization error at x' is the sum of the one at x and of the error made
when linearizing f from x to x' (with the same slope). When performing a descent
step fromxk to xk+l i= Xk, it suffices to update all the linearization errors via (1.2.10),
written with k' = k + 1.
2 Computing the Direction 233

Remark 1.2.9 An equivalent way of storing the relevant information is to replace (1.2.9) by

(1.2.11)

where we recognize in
!*(Si) := (Si. Yi) - !(Yi)
the value of the conjugate of ! at Si (Theorem X.1.4.1). Clearly enough, the linearization
errors can then be recovered via

(1.2.12)

Compared to (1.2.8), formulae (1.2.11) and (1.2.12) have a more elegant and more sym-
metric appearance (which hides their nice geometric interpretation, though). Also they explain
Proposition 1.2.8: Yi is eliminated from the notation. From a practical point of view, there are
arguments in favor of each form, we will keep the form (1.2.8). It is more suggestive, and it
will be more useful when we need to consider aggregation of bundle-elements. 0

Let us swn up this section. The methods considered in this chapter compute the
direction dk by using the bundle (1.2.9) exclusively. This information is definitely
poorer than the actual bundle (1.2.1), because the vectors Yj are "compressed" to the
scalars ej. Actually this bundle can be considered just as a bunch of ej-subgradients
at Xk, and the question is: are these approximate subgradients good for computing dk?
Here our answer will be: use them to approximate eJef(Xk) for some chosen 8, and
then proceed essentially as in Chap. XIII.

2 Computing the Direction

2.1 The Quadratic Program

In this section, we show how to compute the direction on the basis of our development
in § 1. The current iterate Xk is given, as well as a bundle {Sj, ej} described by (1.2.8),
(1.2.9). Whenever possible, we will drop the iteration index k: in particular, the current
iterate Xk will be denoted by x. The bundle size will be denoted by l (which is normally
equal to k, at least if no aggregation has been done yet), and we recall that the unit
simplex of Rl is

Lll:= {a ERl : Lf=laj = 1, aj ~Ofori = 1, .... l}.


The starting idea for the algorithms in this chapter lies in the following simple
result.

Proposition2.1.1 Withsj E 8f(Yj)fori = 1•... ,l,ejdefinedby(1.2.8)atxk =x,


and a = (ai, ... , al) E .db there holds

f(z) ~ f(x) + {Lf=l ajSj, Z - x} - Lf=l ajej for all Z ERn.


234 XlV. Dual Form of Bundle Methods

In particular, for e ~ mini ei, the set

S(e) := {s = I:f=l aisi a ELle, I:f=l aiei :s; e} (2.1.1)

is contained in ae f (x).

PROOF. Write the subgradient inequalities

in the form
f(z)~f(x)+(Si,z-x}-ei fori=I, ... ,e
and obtain the results by convex combination. o
We will make constant use of the following notation: for a E Lle, we set

e e
s(a) := Laisi, e(a) := L aiei , (2.1.2)
i=l i=l
so that (2.1.1) can be written in the more compact form

S(e) = {s(a) : a E Lle, e(a):S; e}.

The above result gives way to a strategy for computing the direction, merely
copying the procedure of Chap. XIII. There, we had a set of subgradients, call them
{sJ, ... , se}, obtained via suitable line-searches; their convex hull was a compact
convex polyhedron S included in aef (x). The origin was then projected onto S, so as
to obtain the best hyperplane separating S from {O}. Here, Proposition 2.1.1 indicates
that S can be replaced by the slightly more elaborate S(e) of(2.1.1): we still obtain an
inner approximation of aef (x) and, with this straightforward substitution, the above
strategy can be reproduced.
In a word, all we have to do is project the origin onto S(e): we choose e ~ 0 and
we compute the particular e-subgradient

s:= ProjOjS(e) (2.1.3)

to obtain the direction d = -s. After a line-search along this d is performed, we


either move x to a better x+ (descent-step) or we stay at x (null-step). A key feature,
however, is that in both cases, the bundle size can be increased by one: as long as there
is room in the computer, no selection has to be made from the raw bundle, which is
thus enriched at every iteration (barring aggregation, which will be considered later).
By contrast, the bundle size in the e-descent Algorithm XIII.1.3.1 - or equivalently
Algorithm 1.2.6 - was reset to 1 after each descent-step.
The decision "descent-step vs. null-step" will be studied in §3, together with
the relevant algorithmic details. Here we focus our attention on the direction-finding
problem, which is
2 Computing the Direction 235

mm I II Lj=1
. 2 l ajsj 112 a E lR.l , [min ! lis (a) 112] (i)
Lf=laj = 1, aj ~Ofori = 1, ... ,t, [a E L1l] (ii) (2.1.4)
Lf=1 ajej :::;; 8 , [e(a) :::;; 8] (iii)

which gives birth to the direction

d = -s := -se := -sea) for some a solving (2.1.4). (2.1.5)

For a given bundle, the above direction-finding problem differs from the previous one
- (XIII. 1.3.1 ), (IX. 1.8), or (VIII.2.1.6) - by the additional constraint (2.1.4)(iii) only.
The feasible set (2.1.4)(ii), (iii) is a compact convex polyhedron (possibly empty),
namely a portion of the unit simplex L1l, cut by a half-space. The set S(8) of(2.1.1)
is its image under the linear transformation that, to a E lR.l , associates sea) E lR.n by
(2.1.2). This confirms that S(8) too is a compact convex polyhedron (possibly empty).
Note also that the family {S(8)}e ~ 0 is nested and has a maximal element, namely
the convex hull of the available subgradients:

8:::;; 8' ~ S(8) C S(8') c S(oo) = co {sI. ... , sd.


Here S(oo) is a handy notation for "S(8) ofa large enough e, say 8 ~ maxj ej".

Proposition 2.1.2 The direction-finding problem (2.1.4) has an optimal solution if


and only if
8 ~ min {ej : i = 1, ... , t}, (2.1.6)
in which case the direction d of (2.1.5) is well-defined, independently of the optimal
a.

PROOF. Because all ej and aj are nonnegative, the constraints (2.1.4)(ii), (iii) are
consistent if and only if (2.1.6) holds, in which case the feasible domain is nonempty
and compact. The rest is classical. 0

In a way, (2.1.3) is "equivalent" to (2.1.4), and the optimal set ofthe latteris the polyhedron

At: := {a E Lie : e(a) ~ 8, sea) = st;}. (2.1.7)

Uniqueness of st: certainly does not imply that At: is a singleton, though. In fact, consider the
example with n = 2 and.e = 3:

Sl := (-1, 1), S2 := (0, 1), S3 := (1, 1); el = e2 = 0, e3 = 8 = 1.

s
It is easy to see that = S2 and that the solutions of (2. 1.4) are those a satisfying (2.1.4)(ii)
together with al = a3. Observe, correspondingly, that the set {e(a) : a E At;} is not a
singleton either, but the whole segment [0, 8/2].

°
Remark 2.1.3 Usually, minj ej = in (2.1.6), i.e. the bundle does contain some element of
8!(Xk) - namely S(Xk). Then the direction-finding problem has an optimal solution for any
8 ~o.
It is even often true that there is only one such minimal ej, Le.:
236 XlV. Dual Form of Bundle Methods

3io E {I, ... , l} such that ejo =0 and ej > 0 for i # io.

This extra property is due to the fact that, usually, the only way to obtain a subgradient at a
given x (here Xk) is to call the black box (UI) at this very x. Such is the case for example
when f is strictly convex (Proposition VI.6.1.3):
f(y) > f(x) + (s, y - x) for any y # x and s E af(x) ,

which implies in (1.2.6) that e(x, y, s) > 0 whenever y # x and S E af(y). On the other
hand, the present extra property does not hold for some special objective functions, typically
piecewise affine. 0

Our new direction-finding problem (2.1.4) generalizes those seen in the previous
chapters:
- When s is large, say s ~ maxi ei , there is no difference between S (s) and S = S (00).
The extra constraint (2.1.4)(iii) is inactive, we obtain nothing but the direction of
the s-descent Algorithm XIII.I.3 .1.
- On the other hand, suppose s lies at its minimal value of (2.1.6), say s = 0 in view
of Remark 2.1.3. Then any feasible a in (2.1.4)(ii), (iii) has to satisfy

ai > 0 ===} ei = s [= minj ej = 0] .

Denoting by 10 the set of indices with ei = s, the direction-finding problem


becomes
min {!lIs(a)1I2 : a E ..dl, ai = 0 ifi ~ lo} ,
in which the extra constraint (2.1.4)(iii) has become redundant. In other words, the
"minimal" choice s = mini ei = 0 disregards in the bundle the elements that are not
in af(Xk); it is the direction dk of Chap. IX that is obtained - a long step backward.
- Between these two extremes (0 and 00), intermediate choices of s allow some flex-
ibility in the computation of the direction. When s describes IR+ the normalized
direction describes, on the unit sphere oflRn, a curve having as endpoints the direc-
tions of Chaps. IX and XIII.
Each linearization error ei can also be considered as a weight, expressing how far
the corresponding Si is from af (xd. The extra constraint (2.1.4)(iii) gives a preference
to those subgradients with a smaller weight: in a sense, they are closer to af (Xk). With
this interpretation in mind, the weights in Chap. XIII were rather 0-1: 0 for those
subgradients obtained before the last outer iteration, I for those appearing during the
present pth iteration. Equivalently, each Si in Algorithm 1.2.6 was weighted by 0 for
i ~ ko, and by I for i > ko.
An index i will be called active ifai > 0 for some a solving (2.1.4). We will also
speak of active subgradients Si, and active weights ei.

2.2 MinimaJity Conditions

We equip the a-space IRl with the standard dot-product. Then consider the function
a ~ 1/2I/s(a) 1/ 2 =: v(a) defined via (2.1.2). Observing that it is the composition
2 Computing the Direction 237

l
JRl '3 a ~ s = I>iSi E JRn ~ ~lIsll2 = v(a) E JR,
i=l

standard calculus gives its partial derivatives:

av
aa.(a) = (sj,s(a») forj=l, ... ,e. (2.2.1)
]

Then it is not difficult to write down the minimality conditions for the direction-
finding problem, based on the Lagrange function

(2.2.2)

Theorem 2.2.1 We use the notation (2.1.2); a is a solution of (2.1.4) if and only if:
it satisfies the constraints (2.1.4) (ii), (iii), and there is /L ~ 0 such that

/L = 0 if e(a) < e and

(Si, s(a») + /Lei ~ IIs(a)!l2 + /Le for i = 1, ... , e 2


with equality if aj > O. (.2.3)

Besides, the set of such /L is actually independent of the particular solution a.

PROOF. Our convex minimization problem falls within the framework of §XII.5.3(c).
All the constraints being affine, the weak Slater assumption holds and the Lagrangian
(2.2.2) must be minimized with respect to a in the first orthant, for suitable values of A
and /L. Then, use (2.2.1) to compute the relevant derivatives and obtain the minimality
conditions as in Example VII.1.1.6:

/L[e(a) - e] = 0,
and, for i = I, ... , e,
(Sj, s(a») + /Lej - A ~ 0,
aj[(sj, s(a») + /Lej - A] = o.

Add the last e equalities to see that A = IIs(a) 112 + /Le, and recognize the Karush-
Kuhn-Tucker conditions of Theorem VII.2.1.4; also, /L does not depend on a (Propo-
sition VII.3.l.1). 0

Thus, the right-hand side in (2.2.3) is the multiplier A of the equality constraint
in (2.1.4)(ii); it is an important number, and we will return to it later.
On the other hand, the extra constraint (2.1.4)(iii) is characteristic ofthe algorithm
we have in mind; therefore, its multiplier /L is even more important than A, and the rest
of this section is devoted to its study. To single it out, we use the special Lagrangian
238 XIv. Dual Form of Bundle Methods

It must be minimized with respect to ex E L\e, to obtain the closed convex dual function

lR 3 IL t--+ B(IL) = - ae..1(


min L/L(ex) . (2.2.4)

Duality theory applies in a straightforward way: L /L ( .) is a convex continuous function


on the compact convex set L\e. We know in advance that the set of dual solutions

Me = Argmin{B(IL) : IL;;:: O}
is nonempty: it is actually the set of those IL described by Theorem 2.2.1 for some
solution ex E Ae of (2.1.7). Then the minimality conditions have another expression:

Theorem 2.2.2 ~ use the notation (2.1.2); ex E L\e solves (2.1.4) if and only if, for
some IL ;;:: 0, it solves the minimization problem in (2.2.4) and satisfies
e(ex) ~ e with equality if IL is actually positive. (2.2.5)

PROOF. Direct application of Theorem VII.4.5.1, or also of §XII.5.3(c). o


We retain from this result that, to solve (2.1.4), one can equally solve (2.2.4)
for some JL ;;:: 0 (not known in advance!), and then check (2.2.5): if it holds, the ex
thus obtained is a desired solution, i.e. s(ex) =
se. The next result gives a converse
property:

Proposition 2.2.3 For given IL ;;:: 0, all optimal ex in (2.2.4) make up the same s(ex)
in (2.1.2). If IL > 0, they also make up the same e(ex).
PROOF. Let ex and fJ solve (2.2.4). From convexity, L/L(tex + (1 - t)fJ) is constantly
equal to B(JL) for all t E [0, 1], i.e.

t{ (s(fJ), s(ex) - s(fJ») + lL[e(ex) - e(fJ)]} + ~t2I1s(ex) - s(fJ) 112 = o.


The result follows by identifying to 0 the coefficients of t 2 and of t. o
Note: if we solve (2.2.5) with J-L = 0 and get a value e(a) > e, we cannot conclude that
s(a) =j: se,
because another optimal a might give (2.2.5); only if J-L > 0, can we safely make
this conclusion.
This point of view, in which J-L ~ 0 controls the value of e(a) coming out from the
direction-finding problem, will be a basis for the next chapter. We also recall Everett's Theo-
rem XII.2.I.I: if one solves (2.2.4) for some J-L ~ 0, thus obtaining some optimal aIL' this aIL
solves a version of (2.1.4) in which the right-hand side of the extra constraint (iii) has been
changed from e to the a posteriori value e(alL); this is a nonambiguous number, in view of
Proposition 2.2.3. With the notation (2.1.2) and (2.1.7), aIL e Ae(alL )'

Another consequence of duality theory is the following important interpretation


of Me:

Proposition 2.2.4 The optimal value v(e) := 1/2 lise 112 in (2.1.4) is a convexfunction
of e, and its subdifforential is the negative of Me:
IL E Me <===} -IL E av(e) .
2 Computing the Direction 239

PROOF. This is a direct consequence of, for example, Theorem VII.3.3.2. 0

Being the subdifferential of a convex function, the multifunction 8 1----+ -Me is


monotone, in the sense that

o~ 8 < 8 ' =} JL ~ JL' for all JL E Me and JL' E Mel. (2.2.6)

When 8 diminishes, the multipliers can but increase, but they remain bounded:

Proposition 2.2.5 With the notation (2.1.1), suppose S(O) =I- 0 (i.e. mini ei = 0).
Then
K
JL ~ 2 - for all 8 > 0 and JL E Me ,
~

where

K:=max{lIsiIl2: i=l, ... ,l}, ~:=min{ei: ei >O}.

PROOF. If JL = 0, there is nothing to prove; if not, the optimal e(a) equals 8. Then
take 8 E ]0, ~[: there must be some j with aj > 0 and ej ~ ~ > 8 (the a's sum up to
1I). For this j, we can write

hence
JLVi. - 8) ~ IIsll2 - (Sj. s) ~ 2K .

Divide by ~ - 8 > 0: our majorization holds in a neighborhood of 0+. Because of


(2.2.6), it holds for all 8 > O. 0

When 8 ~ +00, the extra constraint (2.1.4)(iii) eventually becomes inactive; JL


reaches the value 0 and stays there. The threshold where this happens is

8 := min {e(a) : a E L1l. s(a) = soo} (2.2.7)

knowing that Soo = Proj O/{Sh ... , sil. The function 8 ~ 1/2 lise 112 is minimal for
8 E [8. +oo[ and we have

oE Me; 8 > 8 =} Me = {OJ; 8 < 8 =} 0 ¢ Me .

Finally, an interesting question is whether Me is a singleton.

Proposition 2.2.6 Suppose 8 is such that, for some a solving (2.1.4), there is a j
with aj > 0 and ej =I- 8. Then Me consists of the single element
240 XlV. Dual Form of Bundle Methods

PROOF. Immediate: for the j in question, we have

and we can divide by 8 - ej #- O. o

- The situation described by this last result cannot hold when e = 0: all the active weights
are zero, then. Indeed consider the function v of Proposition 2.2.4; Theorem 1.4.2.1 tells us
that Mo = [jL, +oo[ , where jL = - D+ v(O) is the common limit for e -I- 0 of all numbers
J-L E Me. In view of Proposition 2.2.5, jL is a finite number.
- On the other hand, for e > 0, it is usually the case - at least if the bundle is rich enough - that
(2.1.4) produces at least two active subgradients, with corresponding weights bracketing e;
when this holds, Proposition 2.2.6 applies.
- Yet, this last property need not hold: as an illustration in one dimension, take the bundle
with two elements
SI = I, S2 = 2; el = I, e2 = 0; e = I.

Direct calculations show that the set of multipliers J-L in Theorem 2.2.1 is [0, 1].

Thus, the graph of the multifunction e ~ Me may contain vertical intervals other than
{OJ x Mo. On the other hand, we mention that it cannot contain horizontal intervals other than
[e, +oo[x{O}:

Proposition 2.2.7 For e' < e < e of (2.2.7), there holds


J-L > J-L' for all J-L E Me and J-L' E Mel.

PROOF (sketch). A value of e smaller than e has positive multipliers J-L; from Proposi-
tion 2.2.3, the corresponding e(a) is uniquely determined. Said in other terms, the dual
function J-L 1-+ e(J-L) of(2.2.4) is differentiable for J-L > 0; its conjugate is therefore strictly
convex on ]0, e[ (Theorem X.4.1.3); but this conjugate is justthe function e 1-+ 1/2118e 112 (The-
orem XII.5.I.I); with Proposition 2.2.4, this means that the derivative -J-L of e 1-+ 1/2118e 112
is strictly increasing. 0

In summary, the function e 1-+ 1/2 lise 112 behaves as illustrated in Fig. 2.2.1 (which as-
sumes mini ei = 0), and has the following properties.

- It is a convex function with domain [0, +oo[ ,


- with a finite right-slope at 0,
- ordinarily smooth for e > 0,
- e
nowhere affine except for e ~ of (2.2.7), where it is constant.
- We observe in particular that, because of convexity, the function is strictly decreasing on
the segment [0, e];
- and it reduces to a constant on [0, +oo[ ife = 0, i.e. ifsoo E S(O).
2 Computing the Direction 241

E
E
Fig. 2.2.1. Typical graph of the squared norm ofthe direction

2.3 Directional Derivatives Estimates

It is of interest to return to basics and remember why (2.1.4) is relevant for computing
the direction.
- The reason for introducing (2.1.4) was to copy the development of Chap. XIII, with
S(e) of (2. 1.1) replacing the would-be S = Sk = S(oo).
as
- The reason for Chap. XIII was to copy Chap. IX, with f (x) replacing f (x). The a
two approximating polyhedra S were essentially the same, as generated by calls to
(Ul); only their meaning was different: one was contained in ad(x), the other in
af(x).
- Finally, Chap. IX was motivated by an implementation of the steepest-descent al-
gorithm of Chap. VIII: the non-computable af(x) was approximated by S, directly
available.
Now recall §VIlLI. 1: what was needed there was a hyperplane separating {O} and
af(x) "best", i.e. minimizing f'(x, .) on some unit ball; we wanted

min {f' (x, d) : Illd III = I} (2.3.1)

and the theory of §VILl.2 told us that, due to positive homogeneity, this minimization
problem was in some sense equivalent to

min{lllslll* : s E af(x)}. (2.3.2)

Retracing our steps along the above path, we obtain:


- Passing from Chap. VIII to Chap. IX amounts to replacing in (2.3.1) and (2.3.2)
af(x) by S and the corresponding support function f'(x,·) by as. Furthermore,
the norrning is limited to III . III = III . 111* = II . II·
as
- In Chap. XIII, S and as become approximations of f (x) and f; (x, .) respectively.
- Finally the present chapter deals with S(e) and as(s).
In summary, (2.1.4) is a realization of

min{lllslll*: SES(e)}, (2.3.3)

which is in some sense equivalent to

min{aS(s)(d) : Illdlll = I}, (2.3.4)

knowing that we limit ourselves to the case III· III = III . 111* = II . II·
242 XIV: Dual Form of Bundle Methods

Remark 2.3.1 We recall from §VIII.l.2 what the above "in some sense equivalent" means.
First of all, these problems are different indeed if the minimal value in (2.3.4) is non-
negative, i.e. if the optimal sin (2.3.3) is zero. Then 0 E S(e), (2.1.4) produces no useful
direction.
Second, the equivalence in question neglects the normalization. In (2.3.4), the right-hand
side "1" of the constraint should be understood as any K > O. The solution-set stays collinear
with itself when K varies; remember Proposition VIII. 1. l.S.
Passing between the two problems involves the inverse multifunctions

(Rn , III· liD 3 d 1---+ Argmax(s, d) c (Rn , III· 111*)


lsi" :::; I

and
(Rn , I I . 111*) 3 s 1---+ Argmax(s, d} c (Rn , I I . liD.
Idl:::; I
This double mapping simplifies when 111·111 is a Euclidean norm, say Illdlf = 1/2 (Qd, d). Then
s
the (unique) solutions J of (2.3.3) and of (2.3.4) are linked by J = _Q-I if we forgetthe s
normalization. In our present particular case of I I . I I = I I . I I * = II . II, this reduces to (2.1.5).
We leave it as an exercise to copy §2.2 with an arbitrary norm. o

The following relation comes directly from the equivalence between (2.3.3) and
(2.3.4).

Proposition 2.3.2 With the notation (2.1.1), (2.1.5), there holds

PROOF. This comes directly from (VIII. 1.2.5), or can easily be checked via the mini-
malityconditions. By definition, any s E S(e) can be writtens = s{a) with a feasible
in (2.1.4)(ii), (iii); then (2.2.3) gives

(s{a), -s) ~ - lis 112 + JL[e{a) - e] ~ - IIsII2


so as(e) (-s):::; - IIs1I2; equality is obtained by taking a optimal, i.e. s(a) = S. 0

Thus, the optimal value in the direction-finding problem (2.1.4) readily gives
an (ooder-)estimate of the corresponding e-directional derivative. In the ideal case
CJef(x) = S(e), we would have fi(x, -s) = -lIsIl2.
Proposition 2.1.1 brings directly the following inequality: with the notation
(2.1.5),

f(z) ~ f(x) + (se, z - x) - e(a) ~ f(x) -llsellllz - xll- e (2.3.5)

for all z E IRn. This is useful for the stopping criterion: when II S II is small, e can be
decreased, unless it is small as well, in which case f can be considered as satisfactorily
minimized. All this was seen already in Chap. XIII.
2 Computing the Direction 243

Remark 2.3.3 A relation with Remark VIII. 1.3 .7 can be established. Combining the bundle
elements, we obtain for any a E ill (not necessarily optimal) and Z E ]Rn

f(z) ~ f(x) - 8(a)lIz - xll- e(a) , (2.3.6)

where we have set 8(a) := IIs(a)lI. If e(a) ~ e, then 8(a) ~ - IIsll by definition of sand
we can actually write (2.3.5). In other words, IIsll gives the most accurate inequality (2.3.6)
obtainable from the bundle, when e(a) is restricted not to exceed e. 0

Having an interpretation of lis 11 2 , which approximates the e-directional derivative


along the search direction d = -s, we turn to the true directional derivative f' (x, d).
It is now the tenn IIsll2 + J,Le that comes into play.

Proposition 2.3.4 Suppose that


(i) the minimal value in (2.1.6) is zero: ej = 0 for some j ~ l,
(ii) the corresponding aj is positive for some solution of (2.1.4).
Then, for any multiplier J,L described by Theorem 2.2.1,

(2.3.7)

If, in addition, all the extreme points of8f(x) appear among the l elements of
the bundle, then equality holds in (2.3.7).

PROOF. If ej = 0, Sj E 8f(x), so

f'(x, -s) ~ (Sj, -s)

and if aj > 0, we obtain (2.3.7) with Theorem 2.2.1.


Furthennore, by the definition (1.2.6) of the weights ej' any subgradient at x must
have its corresponding e = O. Then, from (2.2.3) written for each bundle element
(Sj' ej = 0) extreme in 8f(x), we deduce

(s, -s) ~ - IIsll2 - J,Le for all s E 8f(x)

and the second assertion follows from, the first. o


Comparing with Proposition 2.3.2, we see that the approximation of f'(x, d) is
more fragile than f~ (x, d). However, the assumptions necessary for equality in (2.3.7)
are rather likely to hold in practice: for example when f has at x a gradient Vf (x),
which appears explicitly in the bundle (a normal situation), and which turns out to
have a positive a in the composition of S.

Remark 2.3.5 In the general case, there is no guarantee whether -lIsll 2 -1Lf: is an over- or
under-estimate of f'(x, d): ifno subgradient at x is active in the direction-finding problem,
then (2.3.7) need not hold.
The assumptions (i), (ii) in Proposition 2.3.4 mean that of (x) n S(O) is nonempty and
contains an element which is active for s. The additional assumption means that of (x) c S (0);
244 XIv. Dual Form of Bundle Methods

and this implies equality of the two sets: indeed af(x) :::> S(O) because, as implied by
Proposition XI.4.2.2, a subgradient s at some Y with a positive weight e(x, y, s) cannot be
a subgradient at x. We emphasize: it is really S(O) which is involved in the above relations,
not S(e); for example, the property S(e) n af(x) =1= 0 does not suffice for equality in (2.3.7).
The following counter-example is instructive.
TakelR2 3 (~, 7/) ~ f(~, 7/) = ~2+7/2 and x = (1,0). Suppose that the bundle contains
justtwoelements,computedatYI = (2, -1)andY2 = (-1,2). Inotherwords,sl = (4, -2),
S2 = (-2,4). The reader can check that el = 2, e2 = 8. Finally take e = 4; then it can be
seen that
v f(x) = (2,0) = l(2s1 + S2) E S(e)
(but Vf(x) ¢ S(O) = 0!), and Proposition 2.3.4 does not apply. The idea of the example
is illustrated by Fig. 2.3.1: s is precisely Vf(x), so f'(x, s) = -lIsIl2; yet, p, =1= 0 (actually
p, = 2). Our estimate is therefore certainly not exact (indeed f'(x, s) = 4 < 12 = IIsIl 2+p,e).
o

----lIoo~......,,.------ ~

$1 (e1 =2)

Fig.2.3.1. A counter-example: f'(x, d) =1= -lIsll 2 - p,e

2.4 The Role of the Cutting-Plane Function

Now we address the following question: is there a convex function having S(e) as ap-
proximate subdifferential? By construction, d = -is would be an e-descent direction
for it: we would be very happy if this function were close to f.
The answer lies in the following concept:

Definition 2.4.1 The cutting-plane function associated with the actual bundle is

j(z) = f(x)+ max [-ei+(si,Z-X)]


i=l, ... ,e for all z E ]Rn . (2.4.1)
= max [f(Yi)
i=l, ... ,e
+ (Si' Z - Yi)]

In the second expression above, Yi is the point at which (U 1) has computed f (Yi )
and Si = S(Yi) E af(Yi); the first expression singles out the current iterate x = Xk.
The equivalence between these two fonns can readily be checked from the definition
(1.2.8) of the weights ei; see again Example VI.3.4. 0
2 Computing the Direction 245

For i = 1•...• l and Z E IRn , call

h(z) := f(x) - ej + (Sj. z - x) = f(Yj) + (Sj. Z - Yj) (2.4.2)

the linearization of fat Yj with slope Sj E af(Yj); the subgradient inequality means
h ~ f for each i, so the cutting-plane function
i = max {h : i = I •...• l}
is a piecewise affine minorization of f (the terminology "supporting plane" would be
more appropriate than cutting plane: (2.4.2) defines a hyperplane supporting gr f).

Proposition 2.4.2 With i defined by (2.4.1) and S(e) by (2.1.1), there holds for all
e~O
aei(x) = S(e).

PROOF. Apply Example XI.3.5.3 to (2.4.1). o


Naturally, the constant term f(x) could equally be dropped in (2.4.1) or (2.4.2),
ai
without changing e (x). There are actually two reasons for introducing it:
- First, it allows the nice second form, already seen in the cutting-plane algorithm
of §XII.4.2. Then we see the connection between our present approach and that of
the cutting-plane algorithm: when we compute se, we are prepared to decrease the
i
cutting-plane function by e, instead of performing a full minimization ofit.
i i
- Second, (x) = f (x) if minj ej = 0: can then be considered as an approximation
of f near x, to the extent that S(e) = aei(x) is an approximation of aef(x);
remember Theorem Xl.l.3.6, expressing the equivalence between the knowledge of
f andofe 1---+ aef(x).
The case of a set Me not reduced to {OJ is of particular interest:

Theorem 2.4.3 Let IL > 0 be a multiplier described by Theorem 2.2.1. Then

(2.4.3)

The above number is actually h(x - I/P,se),for each i such that equality holds in
(2.2.3).

PROOF. For 0 < IL E Me, divide (2.2.3) by IL and change signs to obtain for i =
1•...• l (remember that S (a), S and se denote the same thing):

-e - *"sI ~
2 - ej + (Sj. -*s) = h(x - *s) - f(x);

and equality in (2.2.3) results in an equality above. On the other hand, there is certainly
at least one i E {I •...• l} such that equality holds in (2.2.3). Then the result follows
from the definition of j. 0
246 XlV. Dual Form of Bundle Methods

We give two interpretations of the above result:


- For an arbitrary a E Ai> consider the affine function based on the notation (2.1.2):

JRn 3 Z 1-+ fa(Z) := f(x) - e(a) + (s(a), Z - x) ~ i<z) ,

where the last inequality comes directly by convex combination in (2.4.2) or (2.4.1). What
(2.4.3) says is that, if a solves (2.1.4), then fa (x -I/p., ss) = i(x -1/p.,Ss) for any multiplier
J-t > 0 (if there is one; incidentally, e(a) is then e). In other words, this "optimal" fa gives
one more support of epi f: it even gives a support of epi i at (x - I/p., s, i(x - 1/p.,S»).
- This was a primal interpretation; in the dual space, combine Propositions 2.3.2 and 2.4.2:
i;(x, -ss) = -liss 11 2 , and the value (2.4.3)

fY( X - /iSs
I A
= -e + Js;1 (x, -/iSs
) I A )

actually comes from §XI.2: 1/ J-t is the optimal t in

"( A) . f i(x - tss) - i(x) +e


Js x, -ss = :~o t .

Figure 2.4.1, drawn in the one-dimensional graph-space along the direction d = = -s


-ss, illustrates the above considerations. We have chosen an example where minj ej = 0 and
-lIsll 2 - J-te > f'(X, d).

f(x+tcI)
f(x) (=0) k:--------+-----------

-e3

-e2
-£ _ 11811 2/11 I------lo.-=

Fig. 2.4.1. The cutting-planes point of view

With respect to the environment space JRn x JR, the direction -s


is defined as follows: a
hyperplane in JRn x JR passing through (x, f (x) - e) is characterized by some s E JRn, and
is the set of points (z, r) E JRn x JR satisfying

r = r(z) = f(x) - e + (s, z - x) .

Such a hyperplane supports epi i if and only if s E 8s i(x) = S(e). When a point in this
hyperplane is moved so that its abscissa goes from x to x+h, its altitude varies by (s, h). Fora
normalized h, this variation is minimal (i.e. most negative) and equals -lis II if h = -s / lis II. In
words, -s is the acceleration vector of a drop ofwater rolling on this hyperplane, and subject to
gravity only. Now, the direction -s
of Fig. 2.4.1 defines, among all the hyperplanes supporting
2 Computing the Direction 247

gr j, the one with the smallest possible acceleration. This is the geometric counterpart of the
analytic interpretation of Remark 2.3.3.
The thick line in Fig. 2.4.1 is the intersection of the vertical hyperplane containing the
picture with the "optimal" hyperplane thus obtained, whose equation is

r = p(z):= f(x) -e + (s,z -x) forallz E lin. (2.4.4)

It represents an "aggregate" function ja, whose graph also supports gr f (and gr and h,
which summarizes, in a way conditioned bye, the linearizations h of (2.4.2). Using (2.4.3),
we see that
~ ~
ss
p(x - .is) = f(x) - e - .ill 112 = j(x - .is).
~

Thus, gr P touches epij at the point (x - !/~s, j(x - !/~s)); hence s E oj(x - !/~s).
Note in passing that the step size t = 1/JL thus plays a special role along the direction -s.
It will even become crucial in the next chapter, where these relations will be seen more
thoroughly.

Remark 2.4.4 Suppose JL > O. Each linearization h in the bundle coincides with f and
with j at the corresponding Yi. The aggregate linearization P coincides with j at x - ! / ~ s;
could ja coincide with f itself at some y?Note in Fig. 2.4.1 thaty = x -!/~s would be such
a coincidence point: swould be in of (x - !/~ s). In addition, for each i such that aj > 0 at
some solution of(2.1.4), Sj would also be in of (x -!/~s): this comes from the transportation
formula used in

In Fig. 2.4.1, lift the graph of P as high as possible subject to supporting epi f. Ana-
e,
lytically, this amounts to decreasing e in (2.4.4) to the minimal value, say preserving the
supporting property:

e= inf (e : f(x) - e + (s, z - x) ~ f(z) for all z E lin} = f(x) + f*(s) - (s, x).

On the other hand, use (1.2.12): with a solving (2.1.4),

I- l-
e = Lajei = f(x) + La;j*(si) - (s, x).
j=! i=!

The gap between gr ja and epi f is therefore

l-
e- e= La;j*(sj) - f*(s) ~ O. (2.4.5)
j=!

Another definition ofe is that it is the smallest e such that s E oef(x) (transportation formula
again). The aggregate bundle element (s, e) thus appears as somehow corrupted, as compared
to the original elements (Sj, ej), which are "sharp" because Sj ¢ oef(x) if e < ej. Finally, the
gap (2.4.5) also explains our comment at the end of Remark 1.2.9: the use of f* is clumsy
for the aggregate element. 0
248 XIv. Dual Form of Bundle Methods

i
To conclude, remember Example X.3.4.2: is the smallest convex function com-
patible with the information contained in the bundle; ifg is a convex function satisfying

g(Yi) = !(Yi) andsi E ag(Yi) fori = 1, ... ,.e,


theng ~ 1.
For a function g as above, S(8) =aei(x) c aeg(x): in a sense, S(8) is the most
pessimistic approximation of ae ! (x) - keeping in mind that we would like to stuff
a convex polyhedron as large as possible into ad(x). However, S(8) is the largest
possible set included/or sure in ae!(x), as we know from! only the information
contained in the bundle. In other words, if s ¢ S (8) (for some 8), then we might also
have s ¢ ae ! (x) (for the same 8); so would be the case if, for example, ! turned out
to be exactly 1.

3 The Implementable Algorithm

The previous sections have laid down the basic ideas for constructing the algorithms
we have in mind. To complete our description, there remains to specify the line-search,
and to examine some implementation details.

3.1 Derivation of the Line-Search

In this section, we are given the current iterate x = Xko and the current direction
d = dk = -se obtained from (2.1.4). We are also given the current 8 > 0 and the
current multiplier f.L ~ 0 of the extra constraint (2.1.4)(iii). Our aim is to describe the
line-search along d, which will produce a suitable Y+ = x + td and its corresponding
s+ E a!(y+) (note: we will often drop the subscript k and use "+" for the subscript
k+ 1).
The algorithms developed in the previous chapters have clearly revealed the double
role of the line-search, which must
(a) either find a descent-step, yielding a Y+ better than the current x, so that the next
iterate x+ can be set to this Y+,
(b) or find a null-step, with no better Y+, but with an s+ providing a useful enrichment
of the approximate subdifferential of ! at x.
On the other hand, we announced in § 1.2(b) that we want the line-search to follow
as much as possible the principles exposed in §II.3. Accordingly, we must test each
trial stepsize t, with three possible cases:
(0) this t is suitable and the line-search can be stopped;
(R) this t is not suitable and no suitable t should be searched on its right;
(L) this t is not suitable and no suitable t should be searched on its left.
Naturally, the (O)-clause is itself double, reflecting the alternatives (a) - (b): we
must define what suitable descent- and null-steps are, respectively. The knowledge of
3 The Implementable Algorithm 249

I' (x, d), and possibly of I' (y+, d), was required for this test in §1I.3 -where they were
called q' (0) and q' (t). Here, both numbers are unknown, but (s+, d) is a reasonable
estimate of the latter; and what we need is a reasonable estimate of the former. Thus,
we suppose that a negative number is on hand, playing the role of I' (x, d). We will
see later how it can be computed, based on §2.3. For the moment, it suffices to call it
v
formally < o.
A very first requirement for a descent-step is copied from (lI.3.2.1): a coefficient
m E ]0, 1[ is chosen and, if t > 0 does not satisfy the descent test

I(x + td) ~ I(x) + mtv, (3.1.1)

it is declared "too large"; then case (R) occurs. On the other hand, a "too small"
step size will be more conveniently defined after we see what a null-step is.
According to the general principles of Chap. XIII, a null-step is useful when I
can hardly be decreased along d because S (e) of (2.1.1) approximates at: I (x) poorly.
The role of the line-search is then to improve this approximation with a new element
(s+, e+) enriching the bundle (1.2.9). A new projection s+ will be computed, a new
line-search will be performed from the same x, and here comes a crucial point: we
must absolutely have
(3.1.2)
Indeed, if (3.1.2) does not hold, just take some a E ..1l solving (2.1.4), append
the (l + 1)S! multiplier a+ := 0 to it and realize that the old solution (a, f.1.) is again
an optimal primal-dual pair: the minimization algorithm enters a disastrous loop. In a
way, (3.1.2) plays the role of(XIII.2.1.1): it guarantees that the new element (s+, e+)
will really define an S+(e) richer than See); see also Remark IX.2.1.3.
- To keep the same spirit as in Chap. XIII, we ensure (3.1.2) by forcing first e+ to be
v,
small. In addition to another parameter B > 0 is therefore passed to the line-search
which, if no descent is obtained, must at least produce an s+ E as I (x) to enrich the
bundle. Admitting that y+ = x + td ands+ E a/(y+), this amounts to requiring

(3.1.3)

- For (3.1.2) to hold, not only e+ but also (s+, s) = -(s+, d) must be small. This
v
term represents a directional derivative, comparable to of(3.1.1); so we require

(3.1.4)

for some coefficient m'; for convergence reasons, m' is taken in ]m, 1[.
In summary, a null-iteration will be declared when (3.1.3) and (3.1.4) hold simul-
taneously: this is the second part of the (O)-clause, in which case t is accepted as a
"suitable null-step".

Remark 3.1.1 Thus, we require the null-step to satisfy two inequalities, although the single
(3.1.2) would suffice. In a way, this complicates the algorithm but our motivation is to keep
closer to the preceding chapters: (3.1.3) is fully in the line of Chap. XIII, while (3.1.4) connotes
Wolfe's criterion of (II.3.2.4).
250 XlV. Dual Form of Bundle Methods

Note that this strategy implies a careful choice of ii and e: we have to make sure that
[(3.1.4), (3.1.3)] does imply (3.1.2), i.e.

Said otherwise, we want

This in turn will be ensured if

(3.1.5)

an inequality which we will have to keep in mind when choosing the tolerances. o

Finally, we come to the (L )-clause. From their very motivation, the concepts of
null-step and of descent-step are mutually exclusive; and it is normal to reflect this
exclusion in the criteria defining them. In view ofthis, we find it convenient to declare
t as "not too small" when (3.1.3) does not hold: then t is accepted as a descent-step
(if (3.1.1) holds as well), otherwise case (L) occurs. Naturally, observe that (3.1.3)
holds for any t close enough to O.
To sum up, the stopping criterion for the line-search will be as follows. The
v e
tolerances < 0, > 0 are given (satisfying (3.1.5), but this is unimportant for the
moment), as well as the coefficients 0 < m < m' < 1. Accept the stepsize t > 0 with
y+ = x + td, S+ E af(y+) and e+ = e(x, y+, s+) when:

[descent-step1 (3.1.6)
or
[null-step1 (3.1.7)

3.2 The Implementable Line-Search and Its Convergence

We are now in a position to design the line-search algorithm. It is based on the strategy
of §IL3 (and more particularly Fig. IL3.3.l), suitably adapted to take §3.l into account;
its general organization is:
(0) t is convenient when it satisfies (3.1.6) or (3.1.7), with the appropriate exit-case;
(R) t is called t R when (3.1.1) does not hold; subsequent trials will be smaller than
tR;
(L) t is called tL in all other cases; subsequent trials will be larger than tL.
Corresponding to these rules, the truth-value Table 3.2.1 specifies the decision
made in each possible combination (T for true, F for false; a star* means an impossible
case, ruled out by convexity). The line-search itself can be realized by the following
algorithm.

Algorithm 3.2.1 (Line-Search, Nonsmooth Case) The initial point x E ]Rn and
v e
step size t > 0, the direction d E ]Rn, the tolerances < 0, > 0, and coefficients
m E ]0,1[, m' E ]m, 1[ are given. SettL = 0, tR = O.
3 The Implementable Algorithm 251

Table 3.2.1. Exhaustive analysis of the test

(3.1.1) (3.1.3) (3.1.4) decision


T T T (0) (nUll)
T T F (L)
T F T (0) (descent)
T F F (0) (descent)
F T T (0) (null)
F T F (R)*
F F T (R)
F F F (R)*

STEP 1. Compute f(x + td) and s = sex + td); compute e = e(x, x + td, s).
STEP 2 (test for a null-step). If (3.1.3) and (3.1.4) hold, stop the line-search with a
null-step. Otherwise proceed to Step 3.
STEP 3 (test for large t). If (3.1.1) does not hold, set tR = t and go to Step 6. Other-
wise proceed to Step 4.
STEP 4 (test for a descent-step; t is not too large). If (3.1.3) does not hold, stop the
line-search with a descent-step. Otherwise set tL = t and proceed to Step 5.
STEP 5 (extrapolation). If tR = 0, find a new t by extrapolation beyond tL and loop
to Step 1. Otherwise proceed to Step 6.
STEP 6 (interpolation). Find a new t by interpolation in ]tL, tR[ and loop to Step 1.
D

At each cycle, the extrapolation/interpolation formulae must guarantee a definite


decrease of the bracket [tL' tR], or a definite increase of tL. This was called the
safeguard-reduction property in §II.3.1. A simple way of ensuring it is, for example,
to replace t by lOt in case of extrapolation and by 1/2 (t L + t R) in case of interpolation.
More efficient formulae can be proposed, which were alluded to in Remark XIII.2.1.2.
See also §11.3.4 for the forcing mechanism ensuring the safeguard-reduction property.
Algorithm 3.2.1 is illustrated by the flow-chart of Fig. 3.2.1, which uses notation closer
to that of §11.3.
Under these conditions, a simple adaptation of Theorem 11.3.3.2 proves conver-
gence of our line-search.

Theorem 3.2.2 Let f : IRn -+ IR be a convex jUnction, and assume that the
safeguard-reduction Property 11.3 .1.3 holds. Then Algorithm 3.2.1 either generates a
sequence of stepsizes t -+ +00 with f(x + td) -+ -00, or terminates after afinite
number of cycles, with a null- or a descent-step.

PROOF. In what follows, we will use the notationsL :=s(x+tLd), SR := sex +tRd).
Arguing by contradiction, we suppose that the stop never occurs, neither in Step 2,
nor in Step 4. We observe that, at each cycle,

(3.2.1)
252 XIv. Dual Form of Bundle Methods

Given: q(O), v<0,0< m< m' < 1, ib 0, t> O.


SettL = tR = O.

I
compute q(t), s(t), e(t)

~. ·8
no

Fig. 3.2.1. Line-search with provision for null-step

[Extrapolations] Suppose first that Algorithm 3.2.1 loops indefinitely between Step 5
and Step 1: no interpolation is ever made. Then, by construction, every generated t is
v
a tL and tends to +00 by virtue of the safeguard-reduction property. Because < 0,
(3.2.1) shows that f(x + tLd) ~ -00.
[Interpolations] Thus, if f(x + td) is bounded from below, tR becomes positive
at some cycle. From then on, the algorithm loops between Step 6 and Step 1. By
construction, at each subsequent cycle,

(3.2.2)

the sequence {td is increasing, the sequence {tR} is decreasing, every tL is smaller
than every t R, and the safeguard-reduction property implies that these two sequences
have a common limit, say t ~ o.
By continuity, (3.2.1) and (3.2.2) imply

f(x + td) = f(x) + mtv. (3.2.3)


3 The Implementable Algorithm 253

[Case t = 0] Compare (3.2.2) and (3.2.3) to see that tR > t; then pass to the limit in

f(x + tRd) - f(x)


.:......:~---=c:........:_....:......:......:.. > mv
to obtain
f'(x, d) ~ mv > m'v.
This is a contradiction: indeed, for tR ,j, 0,

(sR,d) ~ f'(x,d) > m'v and eR ~ 0,

i.e. (3.1.3) and (3.1.4) are satisfied by t R small enough; the stopping criterion of Step 2
must eventually be satisfied.
[Case t > 0] Then tL becomes positive after some cycle; write the subgradient
inequality

as
(SL, d) ~ f(x + tLd) - f(x)
tL
and pass to the limit: when the number of cycles goes to infinity, use (3.2.3) to see
that the right-hand side tends to mv m'v.
>
Thus, after some cycle, (3.1.4) is satisfied forever by S+ = SL. On the other hand,
each t L satisfies (3.1.3) (otherwise, a descent-step is found); we therefore obtain again
a contradiction: tL should be accepted as a null-step. 0

As suggested in Remark 3.1.1, severnl strategies can be chosen to define a null-step.


We have mentioned in §1I.3.2 that severnl stmtegies are likewise possible to define a "too
small" step. One specific choice cannot be made independently of the other, though: the
truth-value table such as Table 3.2.1 must reflect the consistency property 11.3 .104. The reader
may imagine a number of other possibilities, and study their consistency.

Remark 3.2.3 The present line-search differs from a "smooth" one by the introduction of
the null-criterion, of course. With respect to Wolfe's line-search of §1I.3.3, a second difference
is the (L)-c1ause, which used to be "not (3.1.4)", and is replaced here by (3.1.3).
However, a slight modification in Algorithm 3.2.1 cancels this last difference. Keep the
same definition for a null-step, but take Wolfe's definition of a descent-step; the test (0), (R),
(L) becomes
~ f(x) + mtv and (s+. d) ~ m'v,
(0) I e+f(y+)
~ e and (s+, d) ~ m'v;
[descent-step]
[null-step]
(R) f(y+) > f(x) + mtv;
(L) all other cases.
Furthermore give priority to the null-criterion; in other words, in the ambiguous case when

make a null-step.
254 XIV, Dual Form of Bundle Methods

The effect of this variant on the flow-chart 3.2.1 is to replace the box "e(t) > 8 :::}
descent" by "q'(t) > m'v :::} descent"; and the comparison with Fig. 11.3.3.1 becomes quite
eloquent. The only difference with (11.3.2.5) is now the insertion of the null-criterion in (0).
The proof of Theorem 3.2.2 can be reproduced (easy exercise). Besides, a close look at the
logic reveals that (3.1.4) still holds in case of a descent-step. In other words, the present
variant is formally identical to its original: it still produces descent-steps satisfying (3.1.6),
and null-steps satisfying (3.1.7).
Although not earth-shaking, this observation illustrates an important point, already al-
luded to in §VIII.3.3: when designing algorithms for nonsmooth minimization, one should
keep as close as possible to the "smooth-philosophy" and extract its good features. In this
sense, the algorithms of §XIIA are somewhat suspect, as they depart too much from the
smooth case. 0

3.3 Derivation of the Descent Algorithm

Based on the previous developments, the general organization of the overall mini-
mization algorithm becomes clear: at each iteration, having chosen e, we solve (2.1.4)
to obtain a primal-dual solution (a, JL), ands from (2.1.5). If lis II is small, (2.3.6) pro-
vides a minimality estimate. Otherwise the line-search 3.2.1 is done along d = -s,
to obtain either a descent-step as in §II.3, or a null-step obviating small moves. For
a complete description of the algorithm, it remains to specify a few items: the line-
v s,
search parameters and the choice of e (which plays such a central role), and the
management of the bundle.

(a) Line-Search Parameters. The line-search needs two parameters and v for s,
s
(3.1.6), (3.1.7). The status of is rather clear: first, it is measured in the same units
as e, i.e. as the objective function. Second, it should not be larger than e: the purpose
a
of a null-step is to find S+ E e I (Xk). We therefore set
s=me, (3.3.1)
m
for some fixed E ]0, 1]. This coefficient allows some flexibility; there is no a priori
reason to choose it very small nor close to 1.
On the other hand, we have two possible strategies concerning the v-question,
respectively based on Chaps. II and IX.
- If we follow the ideas of §II.3.2, our aim is to decrease I by a fraction of its initial
slope I'(x, d) - see (3.1.1). This I'(x, d) is unknown (and, incidentally, it may be
nonnegative) but, from Proposition 2.3.4, it can conveniently be replaced by
ij = -lIsll 2 - JLe. (3.3.2)

If this expression overestimates the real I' (x, d), a descent-step will be easily
found; otherwise the null-mechanism will play its role.
- If we decide to follow the strategy of a separation mechanism (§ XIII.2.2 for exam-
v
ple), the role of is different: it is aimed at forcing s+ away from S(e) and should
rather be as(e) (d), given by Proposition 2.3.2 - see (3.1.4):
v = -lIsll 2 • (3.3.3)
3 The Implementable Algorithm 255

These choices are equally logical, we will keep both possibilities open (actually
they make little difference, from theoretical as well as practical points of view).

(b) Stopping Criterion. The rationale for the stopping criterion is of course (2.3.5):
s
when the projection is close to 0, the current x minimizes f within the current 8, at
least approximately. However, some complexity appears, due to the double role played
by 8: it is not only useful to stop the algorithm, but it is also essential to compute the
direction, via the constraint (2.1.4)(iii).
From this last point of view, 8 must be reasonably large, subject to S(8) being a
reasonable approximation of the 8-subdifferential of f at the current x. As a result, the
s
minimization algorithm cannot be stopped directly when ~ o. One must still check
that the (approximate) minimality condition is tight enough; if not, 8 must be reduced
and the projection recomputed. On the other hand, the partial rninimality condition
thus obtained is useful to safeguard from above the subsequent values of 8.
Accordingly, the general strategy will be as follows. First, a tolerance 8 > 0 is
chosen, which plays the same role as in Algorithms 11.1.3.1 and XIII.l.3 .1. It has the
same units as a subgradient-norm and the event "s
~ 0" is quantified by "lIsli ~ 8".
As for 8, three values are maintained throughout the algorithm:

(i) The current value, denoted by 8 = 8k, is used in (2.1.4)(iii) to solve the direction-
finding problem.
(ii) The value f > 0 is the final tolerance, or the lowest useful value for 8k: the user
wishes to obtain a final iterate x satisfying

f(x) ~ fey) + f + 811y - xII for all y E]Rn .

(iii) The value B represents the best estimate

f(x) ~ fey) + B + 811y - xII for all y E]Rn

obtained so far. It is decreased to the current 8 each time (2.1.4) produces with s
IIsll ~ 8.
In practice, having the current B ~ f, the direction-finding problem is solved at
each iteration with an 8 chosen in (f, B]. The successive values B form a decreasing
sequence {Bd, rninorized by f, and Bk+l is normally equal to Bk. In fact, an iteration
k + 1 with Bk+l < Bk is exceptional; during such an iteration, the direction-finding
problem (2.1.4) has been solved several times, until a true direction -s,
definitely
nonzero, has been produced. The final tolerance f is fixed for the entire algorithm,
which is stopped for good as soon as (2.1.4) is used with 8 = f and produces IIsll ~ 8.
Knowing that f and B are managed in an automatic way, there remains to fix the
current value 8 of (i); we distinguish three cases.
- At the first iteration, the question is of minor importance in the sense that the first
direction does not depend on 8 (the bundle contains only one element, with el =
0);
it does play some role, however, via B of (3.3.1), in case the first direction -Sl is not
downhill. Choosing this initial 8 is rather similar to choosing an initial stepsize.
256 XlV. Dual.Form of Bundle Methods

- At an iteration following a null-step, the spirit of the method is to leave e as it


is: the direction is recomputed with the same e - but with an additional element
in the bundle. In a way, we are in the same situation as after an inner iteration
(p, k) H> (p, k + 1) of §XIII.2.3; also, note that (3.1.2) would become irrelevant if
e were changed.
- Admitting that null-steps occur only occasionally, as emergencies, the general case is
when the current iteration follows a descent-step. It will be seen that the behaviour of
the algorithm is drastically influenced by a proper choice of e in this situation. Thus,
after a descent-step, a new e = ek+1 E (f, e] is chosen to compute the direction
dk+1 issuing from the new Xk+1 f. Xk. Although it is of utmost importance for
efficient implementations, this choice of e will not be specified here: we will state
the algorithm independently of such a choice, thereby establishing convergence in
a general situation.

(c) Management ofthe Bundle. At every "normal" iteration, a new element (s+, e+)
is appended to the bundle. With y+ = Xk + tkdk and s+ E a!(y+) found by the kth
line-search, there are two cases for e+.
- If the line-search has produced a null-step, then xk+1 = Xk; hence

- In the case ofadescent-step, xk+1 = y+ ande+ = e(xk+b y+, s+) = o. Because the
current iterate is moved, we must also update the old linearization errors according
to (1.2.10): for each index i in the old bundle, ei is changed to

ei + !(Xk+I) - !(Xk) - tk (Si , dk}·


Because a computer has finite memory, this appending process cannot go on
forever: at some iteration, room must be made for the new element. In order to preserve
convergence, room must also be made for an aggregate element, say (sa, e a), which
we proceed to define (it has already been alluded to in §2.4, see Fig. 2.4.1).
The purpose of this aggregate element is to fit into the convergence theory of
s
§IX.2.1, based on the fact that the present projection belongs to the next polyhedron
S+(e) - at least in the case of a null-step. This leaves little choice: sa must be sand
e a must not exceed e.
On the other hand, the definition (1.2.9) of the bundle requires sa = S to be an
s
ea-subgradient of! at Xk. Because E S(e) C af:!(Xk), it suffices to take ea = e, or
better
l
ea = e := L ctiei ~ e,
i=1
where l is the current bundle size andct E ill is an optimal solution of(2. 1.4). Indeed,
(3.3.4)

e
and this is just what is needed for (1.2.9) to hold. Incidentally, = e if the multiplier
e
fJ. is nonzero; otherwise we cannot even guarantee that is well-defined, since the
optimal ct may not be unique.
3 The Implementable Algorithm 257

Thus, when necessary, at least two elements are destroyed from the current bundle,
e
whose size is therefore decreased to a value e' :::; e-
2. This makes room to append

(3.3.5)

(with possible renumbering).

=
Remark 3.3.1 If all the elements destroyed have ex 0 from the quadratic problem
(2.1.4), (s, e) brings nothing to the definition of S(s). In that case, no aggregation is
necessary, only one element has to be destroyed. 0

It is important to understand that the aggregate element is formally identical to


any other "natural" element in (1.2.9), coming directly from (UI); and this is true
s
despite the absence of any y such that E af(y), thanks to the following property (in
which the notation of §1.2 is used again, making explicit the dependence on x = Xk).

Lemma 3.3.2 If (Sb ek ) = (s, e) is the aggregate element at iteration k, Sk E


ae,f (x') for all x', where

PROOF. In view of (3.3.4), this is nothing but Proposition XI.4.2.4. o


e e
At subsequent iterations, the aggregate weight = k will therefore be updated
according to (1.2.10), just as the others; it may enter the composition of a further
aggregation afterwards; this further aggregation will nevertheless be an appropriate s-
subgradient at an appropriate iterate. In a word, we can forget the notation (s, e); there
is nothing wrong with the notation (3.3.5), in which an aggregate element becomes
anonymous.

3.4 The Implementable Algorithm and Its Convergence

Now a detailed description of our algorithm can be given. We start with a remark
concerning the line-search parameters.

Remark 3.4.1 As already mentioned in Remark 3.1.1, f: (i.e. m) and vare not totally
independent of each other: they should satisfY (3.1.5); otherwise (3.1.2), which is
essential in the case of a null-step, may not hold.
v
If is given by (3.3.3), we have

-m'v + f,Lf: = m'lIslI 2 + ms:::; max {m', m}(-lIsII 2 + f,Ls)


and (3.1.5) - hence (3.1.2) - does hold ifm and m' are simply smaller than 1. On the
v
other hand, if is given by (3.3.2), we write

-m'v + f,Lf: = m'lIsf + (m' + m)f,Ls


258 XlV. Dual Form of Bundle Methods

to see that the case m' + iii > 1 is dangerous: the new subgradient may belong to the
old S(e); in the case ofanull-step, sand JL will stay the same-a disaster. Conclusion:
if the value (3.3.2) is used for ii, it is just safe to take

m' + iii ::;; 1 . (3.4.1)


o
The algorithm is then described as follows. Notes such as e) refer to explanations
following the algorithm.

Algorithm 3.4.2 (Bundle Method in Dual Form) e)


The data are: the initial Xl and
e > 0 e); the maximal bundle size i ? 2; the tolerances §. > 0 and 8 > O.
Fix the parameters m, m', iii such that 0 < m < m' < 1 and 0 < iii ::;; 1 e).
Initialize the bundle with SI = S(XI), el = 0, the iteration index k = 1, the bundle
size e = 1; set e = e.
STEP 1 (computing the direction). Replace e by max{§., min(e, e)} (4). Solve (2.1.4)
to obtain an optimal solution Ct and multiplier JL Set e>.
e e
s:= LCtiSi, e:= LCtiei.
i=1 i=1

STEP 2 (stopping criterion). If lis II > 8


(6) go to Step 3. Otherwise:
If e::;; e
§. stop C); otherwise diminish (8) and go to Step 1.
STEP 3 (line-search). Perform the line-search 3.2.1 issuing from Xk. along the direc-
tion d = -s,
with the descent-parameter ii < 0 (9) and the null-step parameter
e=me.
Obtain a positive stepsize t > 0 and S+ = S(Xk + td), realizing either a
descent-step or a null-step eo).
e
= i: delete at least 2 elements from the
e
STEP 4 (managing the bundle size). If
bundle I); insert the element (s, e) coming from Step 1 C); call again e < i the
length of the new list thus obtained.
STEP 5 (appending the new element). Append sHI := S+ to the bundle.
- In the case ofa null-step, append eHI := !(Xk) - !(Xk + td) + t(s+, d}.
- In the case of a descent-step, append eHI := 0 and, for all i E {l, ... , e},
change each ei to

Choose a new e e 2 ), replace k by k + 1, eby e + 1 and loop to Step 1. 0

Notes 3.4.3
(I) We use this name because the algorithm has been conceived via a development in the
dual space, to construct approximate subdifferentials. The next chapter will develop
similar algorithms, based on primal arguments only.
e) This is the same problem as the very first initialization ofthe stepsize (Remark II.3.4.2):
use for s an estimate of f(xI) - j, where I is the infimal value of f.
3 The Implementable Algorithm 259

e) Suggested values are: m = m= 0.1, m' = 0.2. Remember Remark 3.4.1.


(4) The aim of this operation is to make sure that 0 < f:::;; e:::;; 8, whatever happens, with
f > 0 fixed once and for all. Then (2.1.4) is consistent and e cannot approach O.
(5) Our notation is that of §2: a E Lie c Re and J1, ~ 0 is the multiplier associated with
the inequality constraint (2.1.4)(iii). For convenience, let us write again the minimality
conditions (2.2.3):

(Sj, s) + J1,ej ~ - lis 112 + J1,e for i E {I, ... , llo


(3.4.2)
with equality if aj > O.

(6) We want to detect e-optimality of the current iterate x. For this, it is a good idea to let 0
vary with e. In fact, we have

f(y) ~ f(x) - e- IIsll lIy - xII·


Suppose that an idea of the diameter of the picture is on hand: a bound M is known
such that, for all y E Rn,

f(y) < f(x) - e => lIy - xII :::;; M.


Then we have
f(y) ~ f(x) - e- Mlisil for all y ERn,
which means roughly e-optimality if Mlisil is small compared to e, say
Mlisil :::;; O.le.
In a word, instead of fixing 0 forever in the data, we can take at each iteration
0.1 {~
0= M max f,e}. (3.4.3)

c) In case J1, = 0, the value e is ambiguous. In view ofthe gap mentioned in Remark 2.4.4, it
is useful to preserve as much accuracy as possible; the aggregate weight could therefore
be taken as small as possible, and this means the optimal value in

min {L:f=t ajej : a E Lie, L:f=t ajSj = s} .


(8) The algorithm is a sequence ofloops between Step 5 and Step I, until e-optimality is
detected; and it will be shown below that this happens after finitely many such loops.
Then 8 must be decreased fast enough, so that the lower threshold f is eventually reached.
A sensible strategy is to take
8=max{0.le,f}·
(1) We have seen that ii can have the two equally sensible values (3.3.2) or (3.3.3). We do not
specify the choice in this algorithm, which is stated formally. Remember Remark 3.4.1,
though.
(10) We recall from §3.2 that, in the case of a descent-step,

f(Xk + td) :::;; f(x) + mtii and e(xk, Xk + td, s+) > me
(meaning that t is neither too small nor too large) while, in the case of a null-step, t is
small enough and S+ is useful, i.e.:

e(xk, Xk + td, s+) :::;; me and (s+, d) ~ m'ii.


260 XIV, Dual Form of Bundle Methods

(II) If a deletion is made after a null-step, if the gradient at the current x (i.e. the element
with e = 0) is deleted, and if E is subsequently decreased at Step 2, then (2.1.4) may
become inconsistent; see (4) above. To be on the safe side, this element should not be
deleted, which means that the maximal bundle size i should actually be at least 3.
e 2 ) Here lies a key of the algorithm: between f and e,a possibly wide range of values
are available for E. Each of them will give a different direction, thus conditioning the
efficiency of the algorithm (remember from Chap. II that a good direction is a key to
an efficient algorithm). A rough idea of sensible values is known, since they have the
same units as the objective f. This, however, is not enough: full efficiency wants more
accurate values. In a way, the requirement (i) in the conclusion of § 1.1 is not totally
fulfilled More will be said on this point in §4. 0

We now turn to convergence questions. Naturally, the relevant tools are those of
Chaps. IX or XIII: the issue will be to show that, although the e's and the e's are
moving, the situation eventually stabilizes and becomes essentially as in §IX.2.1.
In what follows, we call "iteration" a loop from Step 5 to Step 1. During one such
iteration, the direction-finding problem may be solved several times, for decreasing
values of e. We will denote by ek> Sk, ILk the values corresponding to the last such
resolution, which are those actually used by the line-search. We will also assume that
v is given either by (3.3.2) or by (3.3.3). Our aim is to show that there are only finitely
many iterations, i.e. the algorithm does stop in Step 2.

Lemma 3.4.4 Let f : lRn -+ lR be convex and assume that the line-search terminates
at each iteration. Then, either ! (Xk) -+ -00, or the number ofdescent-steps is finite.
PROOF. The descent test (3.1.1) and the definition (3.3.2) or (3.3.3) ofv imply

!(Xk) - !(Xk+I) ~ mtkIISk 112 = mllskll II Xk+ I - xkll ~ m8l1xk+1 - xkll (3.4.4)

at each descent-iteration, i.e. each time Xk is changed (here 8 > 0 is either fixed, or
bounded away from 0 in case the refinement (3.4.3) is used). Suppose that {f(Xk)} is
bounded from below.
We deduce first that {Xk} is bounded. Then the set Uk8e !(Xk) is bounded (Propo-
sition XI.4.1.2), so {Sk} is also bounded, say by L.
Now, we have from the second half "not (3.1.3)" of the descent-test (3.1.6):

'ik = mek < !(Xk) - !(Xk+I) + (S(Xk+I), Xk+1 - Xk) ~ 2Lllxk+1 - xkll·

Combining with (3.4.4), we see that ! decreases at least by the fixed quantity
m m8 f! (2L) each time it is changed: this process must be finite. 0

When doing null-steps, we are essentially in the situation of §XIII.2.2, generating


v
a sequence of nested polyhedra {Sk(e)}k with e fixed. If is given by (3.3.3), each
new subgradient s+ satisfies

(s+, d) ~ - m'lIdll 2 = m'as(e) (d)


(remember Proposition 2.3.2) and the convergence theory of §IX.2.1 applies directly:
v
the event "IISkli ~ 8" eventually occurs. The case of given by (3.3.2) requires a
slightly different argument, we give a result valid for both choices.
3 The Implementable Algorithm 261

Lemma 3.4.5 The assumptions are those o/Lemma 3.4.4; assume also m/ + iii :( 1.
Then, between two descent-steps, only finitely many consecutive null-steps can be
made without reducing B in Step 2.

PROOF. Suppose that, starting from some iteration ko, only null-steps are made.
From then on, Xk and ek remain fixed at some x and e; because ek:( e, we have
Sk E SHI(e) c ad(x). The sequence {IISkIl2} is therefore decreasing; with
Sk+1 = Proj OISHI (e), we can write

IISHI1I2:( (SHb Sk) :( II SH III IISkll :( IISkll2

where we have used successively: the characterization of the projection Sk+b the
Cauchy-Schwarz inequality, and monotonicity of {II Sk 112}. This implies the following
convergence properties: when k --+ +00,

IISkll2 --+ 0/, (SHbSk) --+ 0/,


(3.4.5)
IIsHI - Sk 112 = IISHI1I 2 - 2(SHI' Sk} + IISk 112 --+ O.

Because of the stopping criterion, 0/ ? 02 > 0; and this holds even if the refinement
(3.4.3) is used.
Now, we obtain from the minimality condition (3.4.2) at the (k + i)st iteration:

IISHI1I 2 + ILHle:( (SHb sHI) + ILHleHI :( (SHb sHI) + ILH Iiiie ,


which we write as

(1 - iii)ILHle :( (Sb sHd + (SHI - Sb sHI) - IIsH11I2. (3.4.6)

On the other hand, (3.1.4) gives

s
(Sk' sHI) :( - m/vk:( m'(lI kIl 2 + lLke),

the second inequality coming from the definition (3.3.2) or (3.3.3) of v = Vk. So,
combining with (3.4.6):

where
Ok := m'IISk1l2 + (SHI - Sb sHI) - IISHI1I2.
In view of (3.4.5) and remembering that {sHI} c ariief(x) is bounded,

Ok --+ (m/ -1)0/ < O.

Thus, when k --+ +00, 0 cannot be a cluster point of {Ok}: at each k large enough,
the relation

shows that ILk diminishes at least by a fixed quantity. Since ILk ? 0, this cannot go on
forever. 0
262 XIV. Dual Form of Bundle Methods

With these two results, convergence of the overall algorithm is easy to establish;
but some care must be exercised in the control of the upper bound £: as mentioned
in 3.4.3(8), Step 2 should be organized in such a way that infinitely many decreases
would result in an £ tending to O. This explains the extra assumption made in the
following result.

Theorem 3.4.6 Let f : lRn -+- lR be convex. Assume for example that £ is divided
v
by 10 at each loop from Step 2 to Step 1 ofAlgorithm 3.4.2; if is given by (3.3.2),
assume also m' + iii ::;; 1. Then:
- either f (Xk) -+- -00 for k -+- +00,
- or the line-search detects at some iteration k that f is unbounded from below,
- or the stop occurs in Step 2 for some finite k, at which there holds

PROOF. Details are left to the reader. By virtue of Lemma 3.4.4, the sequence {xtJ
stops. Then, either Lemma IX.2.1.4 or Lemma 3.4.5 ensures that the event "IISk II ::;; 8"
occurs in Step 2 as many times as necessary to reduce £ to its minimal value f. 0

We conclude with some remarks.


(i) The technical reason for assuming domf = Rn is to bound the sequence {Sk} (an
essential property). Such a bound may also hold under some weaker assumption, say for
example if the sublevel-set Sf (XI) (f) is compact and included in the interior of dom f·
Another reason is pragmatic: what could the black box (UI) produce, when called at
some y (j dom f?
(ii) As in Chap. XIII, our convergence result is not classical: it does not establish that the
algorithm produces {Xk} satisfying f(Xk) -+- inf f. For this, f and 8 should be set to
zero, thus destroying the proofs of Lemma 3.4.4 and Proposition 3.4.5. If 8 is set to 0,
one can still prove by contradiction that 0 adheres to {Sk}. On the other hand, the role of
f is much more important: if 0 adheres to {Ek}, dk may be close to the steepest-descent
direction, and we know from §VIII.2.2 that this is dangerous.
(iii) As already mentioned in §3.3(b), f is ambiguous as it acts both as a stopping criterion
and as a safeguard against bad directions. Note, however, that more refined controls of
Ek could also be considered: for example, a property like
00

LEkllskll = +00
k=1

ensures a sufficient decrease in f at each descent-iteration, which still allows the last
argument in the proof of Lemma 3.4.4.
(iv) Finally observe that Lemma 3.4.4 uses very little ofthe definition of S. As far as descent-
steps are concerned, other directions may also work well, provided that (3.1.6) holds
(naturally, ii must then be given some suitable value, interpreted as a convergence pa-
rameter to be driven to 0). It is only when null-steps come into play that the bundling
mechanism, i.e. (2.1.4), becomes important. Indeed, observe the similarity between the
proofs of Lemma 3.4.4 and of Theorem 11.3.3.6.
4 Numerical Illustrations 263

4 Numerical Illustrations

This section illustrates the numerical behaviour of the dual bundle Algorithm 3.4.2.
In particular, we compare to each other various choices of the parameters appearing
in the definition of the algorithm. Unless otherwise specified, the experiments below
are generally conducted with the following values of the parameters:

and a computer working with 6 digit-accuracy; as for the stopping criterion, it uses the
refinement (3.4.3) and generally §. = 10-3 (1 + Ijl), i.e. 3 digit-accuracy is required
(j being the minimal value of f).

4.1 Typical Behaviour

First, we specify how 8 can typically be chosen, in Step 5 of the algorithm; since it
will enter into play at the next iteration, we call8k+l or 8+ this value (as before, the
index k is dropped whenever possible). The following result is useful for this matter.

Proposition 4.1.1 Let the current line-search, made from x along d = -s, produce
y+ = x + td, s+ E IJf(y+), and e+ = e(x, y+, s+) = f(x) - f(y+) + t{s+, d}.
Then there holds
2 , 8 - e+ 8 - Ll
-lIsl! ::;; + - - = --,
A

fe(x, d) ::;; {s+,d} (4.1.1)


t t

with Ll := f(x) - f(y+).

PROOF. The first inequality results from Propositions 2.1.1 and 2.3.2. The second was
given in Remark XI.2.3.3 (see Fig.4.l.1 if necessary): in fact,

f~+ (x, d) = {s+, d} , [transportation formula XIA.2.2]


t E IJ(- j')e+(X, d), [Theorem XI.2.3.2]

where the last set denotes the subdifferential ofthe convex function 81-+ - f~(x, d)
at8 = e+. 0

Fig.4.1.1. The e+ -directional derivative and its super-derivative


264 XlV. Dual Form of Bundle Methods

See Fig. 4.1.2 for illustration; observe from the definitions that as(e) is piecewise
affine, and remember from Example XI.2.1.3 that the actual fi behaves typically like
SI/2 for s .J, O. This result can be exploited when a descent-step is made from Xk = x
to Xk+1 = y+. For example, the average

I(
ak := 2 -lIskll
A 2
+ Sk -tk .1k)
is a possible estimate of the true fik (x, d); and then, we can for example use the
following simple rules:

.... ..
••••• ~ ••••••• ' crS(E)

.........
Fig. 4.1.2. Approximations of the approximate derivative

Strategy 4.1.2 (Standard s-Strategy)


- If ak > 0, we were too optimistic when choosing Sk: we decrease it.
- If ak ~ 0, s can be safely increased.
In both cases, s is multiplied by a specific factor not too far from 1. o
The detailed computation of the above factor is omitted. Specific rules for choos-
ing it are more in the domain of cooking recipes than mathematics; so we skip them.
Naturally, all the experiments reported below use the same such rules.
To give a rough idea of the result, take the test-problem MAXQUAD of VIII.3.3.3:
we want to minimize

max {!x T Ajx + bJ x + Cj : j = 1, ... , m} ,

where x E !RIO, m = 5, and each Aj is positive definite. The minimal value is known
to be O. The top-part of Fig.4.1.3 gives the decrease of f (in logarithmic scale) as a
function of the number of calls to the black box (VI); the stopping criterion§. = 10-4
was obtained after 143 such calculations, making up a total of 33 descent- and 26
null-iterations. Observe the evolution of f: it suggests long levels during which the
s-subdifferential is constructed, and which are separated by abrupt decreases, when
it is suitably identified.
The lower part of Fig. 4.1.3 represents the corresponding evolution of IIsk 112.
Needless to say, the abrupt increases happen mainly when the partial stopping criterion
e
of Step 2 is obtained and is decreased. Indeed, Strategy 4.1.2 is of little influence
in this example, and most of the iterations have s = e. Table 4.1.1 displays the
information concerning this partial stopping criterion, obtained 7 times during the
run (the 7th time being the last).
4 Numerical Illustrations 265

-3
-5 L..-_ _ _. - -_ _ _- . -_ _ _ _ # (U1 )-calls
50 100

-4

-6 L -_ _ _- y -_ _-..!..-,.--_ _ _' - - _ # (U1 )-calls

Fig. 4.1.3. A possible behaviour of the dual bundle algorithm

The test-problem TR48 ofIX.2.2.6 further illustrates how the algorithm can behave. With
£. = 50 (remembering that the optimal value is -638565, this again means 4-digit accuracy),
£ is never decreased; the stopping criterion IIsI12 = 10- 13 is obtained after 86 descent- and
71 null-iterations, for a total of286 calls to (Ul). Figure 4.1.4 shows the evolution of !(Xk),
again as a function ofthe number of calls to (Ul); although slightly smoothed, the curve gives
a fair account of reality; ek follows the same pattern. The behaviour of IISkll2 is still as erratic
as in Fig. 4.1.3; same remark for the multiplier J-tk of the constraint (2.1.4)(iii).

Table 4.1.1. The successive partial stopping criteria

k #(U 1)-calls !(Xk) !(Xk) - ek IIhll


5 10 119. -72. 13.
13 24 7.7 -8.3 1.10- 2
27 65 0.353 -1. 6.10- 5
37 91 0.009 -0.08 4.10- 7
47 113 0.003 - 0.003 O.
55 135 0.0002 -0.0002 3.10- 14
59 143 0.00003 - 0.0001 3.10- 14
266 XIV. Dual Form of Bundle Methods

-470000

-580000

-638524
1======1:;::0=0~!!:::::==2:;:0=0===:--# (U1 )-calls

Fig. 4.1.4. Evolution of the objective with TR48

4.2 The Role of g

To give an idea of how important the choice of ek can be, run the dual bundle Algo-
rithm 3.4.2 with the two examples above, using various strategies for ek. Table 4.2.1
records the number of iterations required to reach 3-digit accuracy. We recall from
Remark 11.3.2.1 that the most important column is the third.

Table 4.2.1. Performance with various e-strategies


MAXQUAD TR48
ek+l = #desc #null #(Ul)-calls #desc #null #(Ul)-calls
standard 4.1.2 25 24 118 63 46 176
!(Xk) - j 20 0 46 58 42 167
O.I[!(Xk) - j] 33 1 65 162 37 326
ek 30 20 112 83 81 262
!(Xk) - !(Xk+l) 24 0 52 351 55 612
-tkvk 29 5 73 305 43 530

The two strategies using I(Xk) - j are of course rarely possible, as they need a
knowledge of the minimal value. They are mentioned mainly to illustrate the range
for sensible values of e; and they suggest that these values should be rather optimistic.
From the motivation of the method itself (to decrease 1 by e at each iteration), one
could think that ek+ 1 = 0.1 [I (Xk) - j] should be more realistic; but this is empirically
contradicted.
The last strategy in Table 4.2.1 is based on a (too!) simple observation: if every-
thing were harmonious, we would have e :::::: 1 (x) - 1 (x+) :::::: -tii, a reasonable value
for e+. The strategy e+ = I(x) - I(x+) is its logical predecessor.

Remark 4.2.1 (e-Instability) Observe the relative inefficiency of these last two
strategies; in fact, they illustrate an interesting difficulty. Suppose that, for some
reason, e is unduly small at some iteration. The resulting direction is bad (remember
steepest descent!), so the algorithm makes little progress; as a result, e+ is going to
be even smaller, thus amplifying the bad choice of e. This is exactly what happens
4 Numerical Illustrations 267

with TR48: in the last two experiments in Table 4.2.1, 8k reaches its lower level f fairly
soon, and stays there forever.
The interesting point is that this difficulty is hard to avoid: to choose 8+, the only
available information is the observed behaviour of the algorithm during the previous
iteration(s). The problem is then to decide whether this behaviour was "normal" or
not; if not, the reason is still to be determined: was it because 8 was too large, or too
small? We claim that here lies a key of all kinds of bundle methods. 0

These two series of results do not allow clear conclusions about which implemen-
table strategies seem good. We therefore supplement the analysis with the test-problem
TSP442, given in IX.2.2. 7. The results are now those of Table 4.2.2; they indicate the im-
portance of 8 much more clearly. Observe, naturally, the high proportion of null-steps
when 8 is constant (i.e. too large); a contrario, the last two strategies have 8 too small
(cf. Remark 4.2.1), and have therefore a small proportion of null-steps. This table
assesses the standard strategy 4.1.2, which had a mediocre behaviour on MAXQUAD.

Table 4.2.2. Performance on TSP442


8k+1 = #desc #null #(U I)-calls
standard 4.1.2 38 66 164
!(Xk) - f 47 41 126
O.l[!(Xk) - f) 182 32 275
8k 80 329 640
!(Xk) - !(Xk+I) 340 19 418
-tkvk 292 24 378

Another series of experiments test the initial value for 8, which has a direct
influence on the standard Strategy 4.1.2. The idea is to correct the value of 81, so as to
reduce hazard. The results are illustrated by Table 4.2.3, in which the initialization is:
82 = K[f(X2) - f(xI)] for K = 0.1, 1,5,10; in the last line, 84 = f(X4) - f(xI) -we
admit that 81,82 and 83 have little importance. From now on, we will call "standard"
the 8-Strategy 4.1.2 supplemented with this last initialization.

Table 4.2.3. Influence of the initial 8


MAXQUAD TR48 TSP442
K desc null calls desc null calls desc null calls
0.1 25 15 88 295 51 *502 34 93 215
1 24 9 63 65 47 181 71 125 292
5 25 16 97 57 95 283 663 695 2310
10 25 16 97 433 148 *1021 231 543 1466
84 16 15 82 93 34 224 36 36 102

A ·star in the table means that the required 3-digit accuracy has not been reached:
because of discrepancies between! - and s-values (due to roundoff errors), the line-
search has failed before the stopping criterion could be met.
268 XlV. Dual Form of Bundle Methods

Finally, we recall from §VIII.3.3 that smooth but stiff problems also provide inter-
esting applications. Precisely, the line "nonsmooth" in Table VIII.3.3.2 was obtained
by the present algorithm, with the e-strategy -fkVk. When the standard strategy is
used, the performances are comparable, as recorded in Table 4.2.4 (to be readjust as
Table VIII.3.3.2; note that 7r = 0 corresponds to Table 4.2.3).

Table 4.2.4. The standard strategy applied to "smooth" problems

7r lOO 10 1 lO-1 lO-2 lO-3 0

iter(#Ul) 6(11) 9(20) 12(23) 28(56) 28(57) 33(67) 31(82)

4.3 A Variant with Infinite e : Conjugate Subgradients

Among the possible E-strategies, a simple one is Ek == +00, i.e. very large. The reSUlting
direction is then (opposite to) the projection of the origin onto the convex hull of all the
subgradients in the bundle - insofar as 8 is also large. Naturally, 8 = mE is then a clumsy
e
value: unduly many null-steps will be generated. In this case, it is rather that must be chosen
via a suitable strategy.
We have applied this idea to the three test-problems ofTable 4.2.3 with e = m[! (Xk) - /]
(in view of Tables 4.2.1 and 4.2.2, this choice is a sensible one, if available). The results,
displayed in Table 4.3.1, show that the strategy is not too good. Observe for example the very
high proportion of null-steps with TSP442: they indicate that the directions are generally bad. We
e
also mention that, when is computed according to the standard strategy, the instability 4.2.1
appears, and the method fails to reach the required 3-digit accuracy (except for MAXQUAD, in
which the standard strategy itself generates large E 's).

Table 4.3.1. Performance of conjugate subgradients

#desc #null #(Ul)-calls


MAXQUAD 28 15 97
TR48 79 93 294
TSP442 79 1059 *1858

These results show that large values for E are inconvenient. Despite the indications of
Tables 4.2.1 and 4.2.2, one should not be "overly optimistic" when choosing E: in fact, since
d is computed in view of S(E), this latter set should not be ''too much larger" than the actual
Be!(x). Figure 4.3.1 confirms this fact: it displays the evolution of !(Xk) for TR48 in the
experiment above, with convergence pushed to 4-digit accuracy. Vertical dashed lines indicate
iterations at which the partial stopping criterion is satisfied; then 8 is reduced and forces Ek
down to a smaller value; after a while, this value becomes again large in terms of !(Xk) - /,
the effect of 8 vanishes and the performance degrades again. We should mention that this
reduction mechanism is triggered by the IS-strategy of (3.4.3); without it, the convergence
would be much slower.
Remark 4.3.1 We see that choosing sensible values for E is a real headache:
- If Ek is too small, dk becomes close to the disastrous steepest-descent direction. In the worst
case, the algorithm has to stop because of roundoff errors, namely when: dk is not numer-
4 Numerical Illustrations 269

2 '-----1.-00--'------'----4....0-0~-- # (U1 )-calls

Fig.4.3.1. Evolution of f in conjugate subgradients; TR48

e
ically downhill, but nevertheless (3.1.3) cannot be fulfilled in practice, being negligible
compared to f -values. This is the cause of *stars in Tables 4.2.3 and 4.3.1.
- If Sk is too large, S(Sk) is irrelevant and dk is not good either.
- Of these two extremes, the second is probably the less dangerous, but Remark 4.2.1 tells
us that the first is precisely the harder to avoid. Furthermore, Fig.4.3.1 suggests that an
appropriate 8-strategy is then compulsory, although hard to adjust.

From our experience, we even add that, when the current S is judged inconvenient, no
available information can help to decide whether it is too small or too large. Furthermore, a
decision made at a given iteration does not yield immediate effects: it takes a few iterations
for the algorithm to recover from the old bad choice of s. All these reasons make it advisable
to bias the strategy (4.1.2 or any other) towards large s-values. 0

The interest ofthe present variant is mainly historical: it establishes a link with conjugate
gradients of §II.2A; and for this reason, it used to be called the conjugate subgradient method.
In fact, suppose that f is convex and quadratic, and that the line-searches are exact: at each
iteration, f(Xk + tdk) is minimized with respect to t, and no null-step is ever accepted. Then
the theory of §II.2.4 can easily be reproduced to realize that:

- the gradients are mutually orthogonal: (Sj, Sj) = 0 for all i =f:. j;
- the direction is actually the projection of the origin onto the affine hull of the gradients;
- this direction is just the same as in the conjugate-gradient method;
- these properties hold for any i ~ 2.

Remember in particular Remark II.2A.6. For this interpretation to hold, the assumptions
"f convex quadratic and line-searches exact" are essential. From the experience reported
above, we consider this interpretation as too thin to justify the present variant in a non-
quadratic context.

4.4 The Role of the Stopping Criterion

(a) Stopping Criterion as a Safegnard. The stopping criterion is controlled by


two parameters: 8 and f. As already mentioned (see (iii) at the end of §3.4, and also
Remark 4.3.1 above), f is not only a tolerance for stopping, but also a safeguard against
dangerous steepest-descent directions. As such, it may have a drastic influence on the
behaviour of the algorithm, possibly from very early iterations. When f is small, the
270 XlV. Dual Form of Bundle Methods

algorithm may become highly inefficient: the directions may become close to steepest
s
descent, and also may become small, making it hard to find nUll-steps when needed.
Furthermore, this phenomenon is hard to avoid, remember Remark 4.2.1.
In Table 4.2.3, for example, this is exactly what happens on the first line with TR48,
in which Bk is generally too small simply because it is initialized on a small value.
Actually, the phenomenon also occurs on each "bad" line, even when B could be
thought of as too large. Take for example the run TSP442 with K = 5: Fig. 4.4.1 shows,
in logarithmic scale and relative values, the simultaneous evolution of !(Xk) - ]
and Bk (remember from Tables 4.2.1 and 4.2.2 that, in a harmonious run, the two
curves should evolve together). It blatantly suffers the instability phenomenon of
Remark 4.2.1: in this run, BI is too large; at the beginning, inefficient null-steps are
taken; then 4.1.2 comes into play and B is reduced down to f; and this reduction goes
much faster than the concomitant reduction of !. Yet, the B-strategy systematically
takes Bk+1 in [0.9Bb 4Bk] - i.e. it is definitely biased towards large B-values.

(log)
.. ~

E-values
o ...
................
' ....
-2
f-values

# (U1)-calls
1000 2000
Fig.4.4.1. TSP442 with a large initial e

(b) The Tail ofthe Algorithm. Theorem 3.4.6 says that Bk ::;; s eventually stabilizes
at the threshold f. During this last phase, the algorithm becomes essentially that of
§XIII.3.1, realizing the separation process of §IX.3.3. Then, of course, ! decreases
little (but descent-steps do continue to appear, due to the value iii = 0.1) and the
relevant convergence is rather that of {II h 112} to 0.
For a more precise study of this last phase, we have used six more Tsp-problems of
the type IX.2.2.7, with respectively 14,29, 100, 120,614 and 1173 (dual) variables.
For each such problem, Table 4.4.1 singles out the iteration ko where B reaches its
limit f (which is here 10- 4 ]), and stays there till the final iteration kj, where the
convergence criterion IIsll2 ::;; 82 = 10- 7 is met. All these experiments have been
performed with I = 400, on a 15-digit computer; the strategy Bk = !(Xk) - ] has
been used, in order for ko to be well-defined. The last column in Table 4.4.1 gives the
number of active subgradients in the final direction-finding problem. It supposedly
gives a rough idea of the dimensionality of S{f), hence of the f-subdifferential at the
final iterate.
4 Numerical Illustrations 271

Table 4.4.1. The f -phase and s-phase in TSP


TSP ko kf kf -ko IIhof IIh r ll 2 #-active
14 14 14 0 0.4 O. 7
29 30 32 2 0.2 10-29 7
100 45 71 26 0.7 10- 29 28
120 107 218 111 0.2 10-7 74
442 300 2211 1911 0.04 10-7 194
614 247 1257 1010 0.2 10-7 141
1173 182 1315 1133 0.5 10-6 204

Remark 4.4.1 The speed of convergence is illustrated by Fig. 4.4.2, which shows the
evolution of lis 112 during the 1911 iterations forming the last phase in TSP442. Despite the
theoretical predictions of §IX.2.2(b), it does suggest that, in practice, {Sk} converges
to 0 at a rate definitely better than sublinear (we believe that the rate - linear - is
measured by the first part of the curve, the finitely many possible s (x) explaining the
steeper second part, starting roughly at iteration 1500). This observation supplements
those in §IX.2.2(c). 0

II 5112
(log)
-2

-4

-7 ~------------~------------~~__ iter.
300 1300 2300
Fig. 4.4.2. Practical convergence of the separation process; TSP442

To measure the influence of f on each of the two phases individually, Table 4.4.2
reports on the same experiments, made with TSP442, but for varying f. Full accuracy
was required, in the sense that 8 was set to 10- 27 (in view of roundoff errors, this
appeared to be a smallest allowable value, below which the method got lost in the
line-search or in the quadratic program). The column !J.! represents the final relative
error [f (x f) - ill i; observe that it is of order iiif·
The positive correlation between f and the number of active subgradients is easy
to understand: when a set decreases, its dimensionality can but decrease. We have no
explanation for the small value of kf - ko in the line "10- 2".

4.5 The Role of Other Parameters

An important item is the maximal length l of the bundle: it is reasonable to expect


that the algorithm goes faster when l is larger, and reasonable values for l should
272 XlV. Dual Form of Bundle Methods

Table 4.4.2. Full accuracy in the s-phase with TSP442

8/111 ko kf kf -ko IIslco 112 II Ski 112 #-active


10-2 21 1495 1474 6. 7.10- 4 357
10- 3 100 3202 3102 0.4 6.10- 5 248
10-4 330 1813 1483 0.04 5.10-6 204
10- 5 536 1601 1065 0.02 1.10-6 185
10-6 710 1828 1118 0.03 4.10- 7 170

be roughly comparable to the dimensionality of the subdifferential near a minimum


point.
Table 4.5.1 (number of iterations and number of calls to (U1) to reach 4-digit
accuracy) illustrates this point, with the following bundle-compression strategy: when
.e = t, all nonactive elements in the previous quadratic problem are destroyed from
the bundle; if no such element is found, a full compression is done, keeping only the
aggregate element (s, e). This strategy is rather coarse and drastically stresses the
influence of i.

Table 4.5.1. Influence of ion TR48

i 30 50 70 700 150 171


iter(#U1) 7193(13762) 821(1714) 362(693) 216(464) 201(372) 171(313)

Table 4.5.2. Influence of ion TSPIl73

i 2 5 10 100
iter(#U1) 101(184) 117(194) 106(186) 137(203)

The same experiment, conducted with TSPIl73, is reported in Table 4.5.2. The re-
sults are paradoxical, to say the least; they suggest once more that theoretical predic-
tions in numerical analysis are suspect, as long as they are not checked experimentally.
Remark 4.5.1 One more point concerning i: as already mentioned on several occasions,
most of the computing time is spent in (VI). If i becomes really large, however, solving
the quadratic problem may become expensive. It is interesting to note that, in most of our
examples (with a large number of variables, say beyond 102), the operation that becomes
most expensive for growing i is not the quadratic problem itself, but rather the computation
of the scalar products (s+, Sj), i = 1, ... ,l. All this indicates the need for a careful choice
of i-values, and of compression strategies. 0

The parameter iii is potentially important, via its direct control of nUll-steps.
Comparative experiments are reported in Table 4.5.3, which just reads as Table 4.2.3
- in which iii was 0.1. Note from (3.4.1) that the largest value allowed is iii = 0.8.
According to these results, iii may not be crucial for performances, but it does
playa role. The main reason is that, when iii is small, e+ can be much smaller than e;
then the approximation (4.1.1) becomes hazardous and the standard e-strategy may
present difficulties. This, for example, causes the failure in the line "0.01" of TR48 via
the instability 4.2.1.
4 Numerical Illustrations 273

Table 4.5.3. Influence of iii


MAXQUAD TR48 TSP442
m desc null calls desc null calls desc null calls
0.8 18 40 98 35 167 368 34 93 215
0.5 19 26 117 57 163 411 71 125 292
0.1 16 15 82 118 47 295 36 36 102
0.01 32 7 90 357 13 *468 236 1 242

Finally, we study the role of V, which can take either of the values (3.3.2) or
(3.3.3). To illustrate their difference, we have recorded at each iteration of the run
"TSP442 + standard algorithm" the ratio

IIsll2 + f.Le
IIsll 2
Figure 4.5.1 gives the histogram: it shows for example that, for 85% of the iterations,
the value (3.3.2) was not larger than 3 times the value (3.3.3). The ratio reached a
maximal value of 8, but it was larger than 5 in only 5% of the iterations. The statistics
involved 316 iterations and, to avoid a bias, we did not count the tail of the algorithm,
in which f.L was constantly O.
Frequency (%)

10

ratio
1 2 5
Fig.4.5.1. Comparing two possible values of ii

Thus, the difference between (3.2.2) and (3.2.3) can be absorbed by a proper re-
definition of the constants m and m'. We mention that experiments conducted with
(3.3.3) give, with respect to the standard strategy (3.3.2), differences which may be
non-negligible, but which are certainly not significant: they never exceed 20 calls to
(UI), for any required accuracy and any test-problem. This also means that the role
of m and m' is minor (a fact already observed with classical algorithms for smooth
functions).

4.6 General Conclusions

A relevant question is now: have we fulfilled the requirements expressed in the con-
clusion of § 1. I? Judging from our experiments, several observations can be made.
(i) The algorithms of this chapter do enter the general framework of Chap. II. A
comparison of the line-searches (Figs. 11.3.3.1 and 3.2.1) makes this point quite
274 XIV, Dual Form of Bundle Methods

clear. These algorithms can be viewed as robust minimization methods, aimed


at solving ill-conditioned problems - compare Table 4.2.4 with the "smooth"
methods in Table VIII.3.3 .2. Also, they use as much as possible of the information
accumulated along the iterations, the only limitations coming from the memory
available in the computer.
(ii) On the other hand, the need for a parameter hard to choose (namely e) has not
been totally eliminated: the standard Strategy 4.1.2 does lack full reliability, and
we do not know any definitely better one. The performances can sometimes rely
heavily on the accuracy required by the user; and in extreme cases, roundoff
errors may cause failure far from the sought minimum point.
(iii) The convergence of {Sk} to 0 - measured by kf - ko in Tables 4.4.1 and 4.4.2 -
is a slow process, indeed; and it gets a lot slower when the number of variables
increases - or rather the dimension of the relevant subdifferential. This confirms
that constructing a convex set may be a difficult task, and algorithms relying
upon it may perform poorly.
(iv) On the positive side, f converges dramatically faster during the first phase of
the descent algorithm, measured by ko in the tables of §4.4. Of course, this
convergence is more painful when the required accuracy becomes more strict,
but in acceptable proportions; and also, it does not depend so much on the number
of variables; but once again, it heavily depends on the e-strategy.
We believe that this last point (iv) is very important, in view of point (iv) in
the conclusion of §1.1: minimizing a convex function is theoretically equivalent to
separating {O} from some subdifferential; but in practice, the situation is quite different:
forcing f to decrease to j may be a lot easier than forcing -lIslI 2 to decrease to O.
Note, however, that this comment can be but heuristic: remember Remark IX.2.2.5.
A consequence of the above observation is that finding an approximate minimum
may be much simpler than checking how approximate it is: although very attractive,
the idea of having a stopping criterion of the type f (x) ~ j + f appears as rather
disappointing. Consider for example Table 4.4.2: on the first line, it took 1474 itera-
tions to realize that the 21 th iterate was accurate within 10-2 (here, the question of an
adequate ~ is of course raised - but unsolved). These 1474 "idle" iterations were not
totally useless, though: they simultaneously improved the 19th iterate and obtained a
1O-3-optimum. Unfortunately, the line "10- 5" did 10 times better in only 536 iter-
ations! and this is normal: §4.3 already warned us that keeping ek constant is a bad
idea.
In summary, the crucial problem offinding an adequate e is hard, mainly because
of instabilities. Safeguarding it from below is not a real cure; it may even be a bad
idea in practice, insofar as it prevents regular decreases of the objective function.
xv. Acceleration of the Cutting-Plane Algorithm:
Primal Forms of Bundle Methods

Prerequisites. Chapter VI (subdifferentials and their elementary properties) is essential


and Chap. VII or XII (minimality conditions and elementary duality for simple convex min-
imization problems) will suffice for a superficial reading. However, practically all the book
is useful for a complete understanding of every detail; in particular: Chap. II (basic princi-
ples for minimization algorithms), Chap. IV (properties of perspective-functions), Chap. IX
(bundling mechanism for descent algorithms), Chap. X (conjugacy, smoothness of the con-
jugate), Chap. XI (approximate subdifferentials, infimal convolution), Chap. XII (classical
algorithms for convex minimization, duality schemes in the convex case), Chap. XIV (gen-
eral principles of dual bundle methods).

Introduction. In Chap. XII, we sketched two numerical algorithms for convex minimiza-
tion: subgradients and cutting planes. Apparently, they have nothing to do with each other;
one has a dual motivation, in the sense that it uses a subgradient (a dual object) as a direction
of motion from the current iterate; the second is definitely primal: the objective function is
replaced by an approximation which is minimized to yield the next iterate.
Here we study more particularly the cutting-plane method, for which we propose a number
of accelerated versions. We show that these versions are primal adaptations of the dual bundle
methods of Chap. XIv. They define a sort of continuum having two endpoints: the algorithms
of subgradients and of cutting planes; a link is thus established between these two methods.

Throughout this chapter,

If: lR n -+ lR is convex I
and we want to minimize f. As always when dealing with numerical algorithms,
we assume the existence of a black box (Ul) which, given x E lRn , computes f (x)
together with some subgradient s(x) E af(x).

1 Accelerating the Cutting-Plane Algorithm

We have seen in §XII.4.2 that a possible algorithm for minimizing f is the cutting-
plane algorithm, and we briefly recall how it works. At iteration k, suppose that the
iterates Yl, ... , Yk have been generated and the corresponding (bundle of) information
f(Yl), ... , f(Yk), Sl = S(Yl), ... , Sk = S(Yk) has been collected. The cutting-plane
276 xv. Primal Forms of Bundle Methods
approximation of f, associated with the sampling points YI, ... , Yk> is the piecewise
affine function of Definition XIY.2.4.1:

]Rn 3 Y t-+- ik(y) := max (f(Yi) + (Si' Y - Yi) : i = 1, ... , k}. (1.0.1)

There results immediately from convexity that this function is an under-estimate of


f, which is exact at each sampling point:

The idea of the cutting-plane algorithm is to minimize ik at each iteration. More


precisely, some compact convex set C is chosen, so that the problem

Yk+1 E Argminik(y) (1.0.2)


yee

does have a solution, which is thus taken as the next iterate. We recall from Theo-
rem XII.4.2.3 the main convergence properties of this algorithm: denoting by Ie the
minimal value of f over C, we have

Ik(Yk+I) ~ le(Yk'+I) ~ Ie ~ f(YfJ for all k ~ k' and alIt,


f(Yk) -+- Ie and Ik(Yk+I) -+- Ie when k -+- +00.

To make sure that our original problem is really solved, C should contain at least one
minimum point of f; so finding a convenient C is not totally trivial. Furthermore, it
is widely admitted that the numerical performance of the cutting-plane algorithm is
intolerably low. Both questions are addressed in the present chapter.

1.1 Instability of Cutting Planes

Consider the simple example illustrated by Fig. 1.1.1: with n = 1, take f(x) = 1/2X2
and start with two iterates YI = 1, Y2 = -8 < O. Then Y3 is obviously the solution of
h (y) = /z(y), i.e.

and Y3 = 1/2 - 1/28. If 8 gets smaller, Y2 increases, and Y3 increases as well. In


algorithmic terms, we say: if the current iterate is better (Y2 comes closer to the
solution 0), the next iterate is worse (Y3 goes further from this solution); the algorithm
is unstable.

Remark 1.1.1 We mention a curious consequence ofthis phenomenon. Forgetting the arti-
ficial set C, consider the linear program expressing (1.0.2):
inf{r: !(Yi)+(Si,Y-Yi)::;;rfori=l, ... ,k}.
Y,r

Taking k Lagrange multipliers aI, ...• ako its dual is (see §XII.3.3 if necessary, L1k C Rk is
the unit simplex):
1 Accelerating the Cutting-Plane Algorithm 277

Y2 = -E Y3
Fig.t.t.t. Instability of cutting planes

k k
sup I>i[f(Yi) - (Si' Yi)], a E ..1b Laisi =0.
i=1 i=1

Now assume that f is quadratic; then the gradient mapping S = Vf is affine. It follows
that, if a is an optimal solution of the dual problem above, we have:

k k
Vf(L~=1 aiYi) = LaiVf(Yi) = Laisi = O.
i=1 i=1

The point Y* := L~=I am minimizes f·


More generally, suppose that f is a differentiable convex function and assume that the
iterates Yi cluster together; by continuity, V f (y*) :::: 0: the point Y* is nearly optimal and can
be considered as a good next iterate Yk+ I. Unfortunately this idea is killed by the instability
of the cutting planes: because Yk+ 1 = Y* is good, the next iterates will be bad. 0

In our example of Fig. 1.1.1, the instability is not too serious; but it can become
disastrous in less naive situations. Indeed the next example cooks a black box (VI) for
which reducing the initial gap f(YI)- ic
by a factore < 1 requires some (l/e)(n-2)/2
iterations: with 20 variables, a billion iterations are needed to obtain just one digit
accuracy!

Example 1.1.2 We use an extra variable TJ, which plays a role for the first two itera-
tions only. Given e E ]0, 1/2[, we want to minimize the function

lRn x lR 3 (y, TJ) 1-+ f(y, TJ) = max {ITJI, -1 + 2e + lIyll}

°
on the unit ball oflRn x lR: C in (1.0.2) is therefore taken as this unit ball. The optimal
value is obviously 0, obtained for TJ = and y anywhere in the ball B(O, 1- 2e) c lRn.
Starting the cutting-plane algorithm at (0, 1) E lRn x lR, which gives the first objective-
value 1, the question is: how many iterations will be necessary to obtain an objective-
value of at most e?
The first subgradient is (0,1) and the second iterate is (0, -1), at which the
objective-value is again 1 and the second subgradient is (0, -1). The next cutting-
plane problem is then

(1.1.1)
278 xv. Primal Forms of Bundle Methods
Its minimal value is 0, obtained at all points of the fonn (y , 0) with Y describing the
unit ball B of]Rn. The constraints of (1.1 .1) can thus be fonnulated as

r ~ 0, lIylI 2 :::; 1 (i.e. Y E B) and 'TJ =0 ;


in other words, the variable 'TJ is dropped, we are now working in B c ]Rn. The
above constraints will be present in all the subsequent cutting-plane problems, whose
minimal values will be nonnegative (because of the pennanent constraint r ~ 0), and
in fact exactly 0 (because ik : :;
f).
We adopt the convention that, if the kth cutting-plane problem has some solution
ofnonn 1, then it produces such a solution for Yk+" rather than an interior point of
its optimal set.
For example, the third iterate Y3 has nonn 1, its objective-value is 2s, and the third
cutting-plane problem is

Look at Fig. 1.1.2(a): the minimal value is still 0 and the effect of the above third
constraint is to cut from the second optimal set (B itself) the portion defined by

(Y3 , y) > 1 - 2s .

More generally, as long as the kth optimal set Bk C B contains a vector of nonn I,
the (k + l)st iterate will have an objective-value of2s, and will cut from Bk a similar
portion obtained by rotation. As a result, no s-optimal solution can be produced before
all the vectors of nonn 1 are eliminated by these successive cuts. For this, k must be
so large that k - 2 times the area S(s) of the cap

{y E]Rn : lIylI = I, (v, y) > 1 - 2s}

is at least equal to the area Sn of the boundary of B; see Fig. 1.1 .2(a), where the thick
line represents S(s), vhas nonn 1 and stands for Y3, Y4, .. . ) .

Fig. 1.1.2. Cuts and surfaces in the unit ball

It is known that the area of the boundary of B(O, r) in]Rn is r n- I Sn . The area of
the infinitesimal ring displayed in Fig. 1.1.2(b), at distance r of the origin, is therefore
1 Accelerating the Cutting-Plane Algorithm 279

~n-2
Sn-n/1 - r2 dr = Sn-l sinn - 1 B dB; (1.1.2)

hence, setting Be := cos- 1 (1 - 2e),

See) = Sn-l l(h sinn - 1 B dB :::; Sn-l l(Js Bn- 1 dB = Sn-l k(Be)n .
Using (1.1.2) again:

Thus, the required number of iterations Sn/ See) is at least 2/(Be)n. Knowing that
Be ~ 2..(£, we see the disaster. 0

It is interesting to observe that the instability demonstrated above has the same
origin as the need to introduce C in (1.0.2): the minima of A
are hardly controlled,
in extreme cases they are "at infinity". The artificial set C had primarily a theoretical
motivation: to give a meaning to the problem of minimizing ik; but this appears to
have a pragmatic supplement: to prevent a wild behaviour of {Yk}. Accordingly, we
can even say that C (which after all is not so artificial) should be "small", which also
implies that it should be "appropriately located" to catch a minimum of f. Actually,
one should rather have a varying set Cko with diameter shrinking to as the cutting-
plane iterations proceed. Such was not the case in our counter-example above: the
°
"stability set" C remained equal to the entire R(O, 1) for an enormous number of
iterations. This seemingly innocent remark can be considered as a starting point for
the methods in this chapter.

1.2 Stabilizing Devices: Leading Principles

Our stability issue existed already in Chap. II: in §II.2.1, we wanted to minimize a
certain directional derivative (Sk, .), which played the role of ik - both functions are
identical for k = 1. Due to positive homogeneity, a normalization had to be introduced;
the situation was the same for the steepest-descent problem of §VIII.I.
We choose to follow the same idea here, and we describe abstractly the kth iteration
of a stabilized algorithm as follows:
(i) we have a model, call it f{Jko supposed to represent f;
(ii) we choose a stability center, call it Xk;
(iii) we choose a norrning, call it I I . Illk;
(iv) then we compute a next iterate Yk+ 1 realizing a compromise between diminishing
the model:
f{Jk(Yk+l) < f{Jk(Xk) ,
and keeping close to the stability center:

IIIYk+l - xklilk small.


280 xv. Primal Forms of Bundle Methods
In Chap. II or VIII, ({Jk approximated I to first order and was positively homoge-
neous; we wanted to solve

min {({Jk(Y) : lIlY - xdlk = K}


with ({Jk(Y) = If (Xk. Y - Xk). As shown in §VIII.l.2, this was "equivalent" to

min {({Jk(Y) : lIlY - xklilk ~ K}


or also to
min [({Jk(Y) + JLIIIY - xklllk]
(see (11.2.3.1), (11.2.3.2), for example), and the resulting direction Yk+l - Xk was
independent of K > 0 or JL > O. Here the model ({Jk = ik
is no longer positively
homogeneous; we see immediately an important consequence: the parameter K or JL
will have a decisive influence on the resulting Yk+l.
Just as for the first-order models, we emphasize that this stabilizing trick is able
to cope simultaneously with the two main deficiencies of the pure cutting-plane al-
gorithm: the need for a compact set C is totally eliminated, and the stability question
will be addressed by a proper management of Xk and/or I I . Illk.
Remark 1.2.1 Mixing the cutting-plane model with a norming is arguably a rather
ambiguous operation:
- As already observed, some flexibility is added via K or JL (with the first-order models
of §II.2.2, these parameters had an trifling role). Because varying K or JL amounts to
dilating 111·lllk, we can also say that the size of the normalizing unit ball is important,
in addition to its shape.
- Again because of the lack of positive homogeneity, the solution Yk+ I of the problem
in K-form should not be viewed as giving a direction issuing from Xk. As a result,
the concept of line-search is going to be inappropriate.
- The cutting-plane function has a global character, which hardly fits with the localiz-
ing effect of the normalization: ik cannot be considered as a local approximation of
I near a distinguished point like Xk. Actually, unless I happens to be very smooth,
an accurate approximation ({Jk of I is normally out of reach, in contrast to what we
saw in Chap. II: see the Newton-like models of §II.2.3.
All in all, one might see more negative than positive aspects in our technique;
but nothing better is available yet. On the other hand, several useful interpretations
will be given in the rest of the chapter, which substantially increase the value of the
idea. For the moment, let us say that the above remarks must be kept in mind when
specifying our stabilizing devices. 0

Let us now review the list of items (i) - (iv) of the beginning of this Section 1.2.
(i) The model ({Jb i.e. the cutting-plane function ik of (1.0.1), can be enriched by
one affine piece for example every time a new iterate Yk+l is obtained from the
model-problem in (iv). Around this general philosophy, several realizations are
possible. Modelling is actually just the same as the bundling concept of §XlY.l.2;
in particular, an important ingredient will be the aggregation technique already
seen on several occasions.
1 Accelerating the Cutting-Plane Algorithm 281

(ii) The stability center should approximate a minimum point of!, as reasonably as
possible; using objective-values to define the word "reasonably", the idea will
be to take for Xk one of the sampling points YI, ... , Yk having the best! -value.
Actually, a recursive strategy will be used: at each iteration, the stability center
is either left as it is, or moved to the new iterate Yk+h depending on how good
!(Yk+I) is.
(iii) Norming is quite an issue, our study will be limited to multiples of a fixed norm.
Said otherwise, our attention will be focused on the /C or the JL considered above,
and III . Illk will be kept fixed (normally to the Euclidean norm).
(iv) Section 2 will be devoted to the stabilized problem; three conceptually equivalent
possibilities will be given, which can be viewed as primal interpretations of the
dual bundle method of Chap. XlV.
In fact, the whole stabilizing idea is primarily characterized by (ii), which relies
on the following crucial technique. In addition to the next iterate Yk+ I, the stabilized
problem yields a "nominal decrease" for!; this is a nonnegative number 13k> giving an
idea of the gain !(Xk) - !(Yk+I) to be expected from the current stability center Xk
to the next iterate Yk+ I; see Example 1.2.3 below. Then Xk is set to Yk+ I if the actual
gain is at least a fraction of the "ideal" gain 13k. With emphasis put on the management
of the stability center, the resulting algorithm looks like the following:

Algorithm 1.2.2 (Schematic Stabilized Algorithm) Start from some XI ERn;


choose a descent coefficient m E ]0, 1[ and a stopping tolerance Q ~ O. Initialize
k = 1.
STEP I. Choose a convex model-function CPk (for example ik)
and a norming III . Illk
(for example a multiple of the Euclidean norm).
STEP 2. Solve the stabilized model-problem, whatever it is, to obtain the next iterate
Yk+I. Upon observation of CPk(Xk) - CPk(Yk+I), choose a nominal decrease 13k ~ 0
for !.
If 13k ~Q stop
STEP 3. If
(1.2.1)
setxk+1 = Yk+l; otherwise setxk+1 = xk. Replacekby k+ 1 and loop to Step 1.
o
Apart from moving the stability center, several other decisions can be based on
the descent-test (1.2.1), or on more sophisticated versions of it: essentially the model
and/or the norming mayor may not be changed. Classical algorithms of Chap. II
can be revisited in the light of the above approach, and this is not a totally frivolous
exercise.

Example 1.2.3 (Line-Search) At the given Xk, take the first-order approximation

(1.2.2)

as model and choose a symmetric positive definite operator Q to define the norming.
282 xv. Primal Forms of Bundle Methods

(aJ Steepest Descent, First Order. Let the stabilized problem be

(1.2.3)

for some radius K > 0; stabilization is thus forced via an explicit constraint. From
the minimality conditions of Theorem VII.2.1.4, there is a multiplier J-L ~ 0 such that

(1.2.4)

From there, we get

(Sko Y - Xk) = -J-L(Q(y - Xk), y - Xk)


= -J-LK [transversality condition1
= -./'K"""'(S-k,-Q::---=I-Sk-c-)

(to obtain the last equality, take the scalar product of (1.2.4) with K Q-I Sk). We in-
terpret these relations as follows: between the stability center and a solution of the
stabilized problem, the model decreases by

Knowing that f{Jk(Xk) = !(Xk) and f{Jk ~ !, this 8k can be viewed as a "nominal
decrease" for!. Then Algorithm 1.2.2 corresponds to the following strategy:
(i) If 8k is small, stop the algorithm, which is justified to the extent that IIsk II is small.
This in turn is the case if K is not too small, and the maximal eigenvalue of Q is
not too large: then the original problem, or its first-order approximation, is not
too affected by the stabilization constraint.
(ii) If 8k is far from zero, the next iterate Yk+1 = Xk - 1/J.LkQ-I Sk is well defined;
setting dk := _Q-ISk (a direction) and tk := 1/J-Lk (a stepsize), (1.2.1) can be
written

We recognize the descent test (11.3.2.1), universally used for classical line-searches. This
test is thus encountered once more, just as in §XIY.3 .1. With Remark 11.3.2.3 in mind, observe
the interpretative role of 8k in terms of the initial derivative of a line-search function:

(ih) Moving Xk in Step 4 of Algorithm 1.2.2 can be interpreted as stopping a line-search;


this decision is thus made whenever ''the stepsize is not too large": the concept of a
"stepsize too small" has now disappeared.
(ih) What should be done in the remaining cases, when Yk+ I is not good? This is not specified
in Algorithm 1.2.2, except perhaps an allusion in Step 1. In a classical line-search, the
model would be left unchanged, and the norming would be reinforced: (1.2.3) would be
solved again with the same f/Jk but with a smaller /C, i.e. a larger IL, or a smaller stepsize
1IlL.
1 Accelerating the Cutting-Plane Algorithm 283

(b) Steepest Descent, Second Order. With the same CfJk of (1.2.2), take the stabilized
problem as
(1.2.5)
for some penalty coefficient f.L > O. The situation is quite comparable to that in (a)
above; a slight difference is that the multiplier f.L is now explicitly given, as well as
the solution Xk - 1//1- Q-I Sk .
A more subtle difference concerns the choice of Ok: the rationale for (a) was to
approximate f by CfJk, at least on some restricted region around Xk. Here we are bound
to consider that f is approximated by the minimand in (1.2.5); otherwise, why solve
(1.2.5) at all? It is therefore more logical to set

Ok = CfJk(Xk) - CfJk(Yk+I) - !f.L(Q(Yk+1 - Xk), Yk+1 - Xk} = 2~ (Sk, Q-I Sk ).

With respect to (a), the nominal decrease of f is divided by two; once again, remember
Remark 11.3.2.3, and also Fig. 11.3.2.2. 0

Definition 1.2.4 We will say that, when the stability center is changed in Algo-
rithm 1.2.2, we make a descent-step; otherwise we make a null-step.
The set of descent iterations is denoted by KeN:

The above terminology suggests a link between the bundle methods of the pre-
ceding chapters and the line-searches of Chap. II; at the same time it points out the
main difference: in case of a null-step, bundle methods enrich the piecewise affine
model CfJk, while line-searches act on the norming (shortening K, increasing f.L) and do
not change the model.

1.3 A Digression: Step-Control Strategies

The ambiguity between cases (a) and (b) in Example 1.2.3 suggests a certain incon-
sistency in the line-search principle. Let us come back to the situation of Chap. II,
with a smooth objective function f, and let us consider a Newton-type approach: we
have on hand a symmetric positive definite operator Q representing, possibly equal
to, the Hessian \f2 f(Xk). Then the model (quadratic but not positively homogeneous)

(1.3.1)

is supposedly a good approximation of f near Xk, viewed as a stability center. Ac-


cordingly, we compute

y= argminCfJk and dk = Y- Xk,

hoping that the Newtonian point y is going to be the next iterate.


Then the line-search strategy of §11.3 is as follows: check ifthe step size t = 1 in the
update Xk+ 1 = Xk + tdk is suitable; if not, use the actual behaviour of f, as described
284 xv. Primal Forms of Bundle Methods
by the black box (VI), to search the next iterate along the half-line Xk + 1R+ dk. Now
we ask an embarrassing question: what is so magic with this half-line? If the descent
test is not passed by our Newtonian point ji, why in the world should xk+t lie on the
half-line issuing from Xk and pointing to it? Needless to say, this half-line has nothing
magic: it is present for historical reasons only, going back to 1847 when A. Cauchy
invented the steepest-descent method.
The stabilizing concepts developed in § 1.2 suggest a much more natural idea. In
fact, with the model ({Jk of(I.3.1), take for example a stabilized problem in constrained
form
(1.3.2)
and call Y(K) its solution. Considering K > 0 as an unknown parameter, the answer
of the black box (VI) at Y(K) can be used to adjust K, instead of a stepsize along the
direction ji - Xk. When K describes 1R+, Y(K) describes a certain curve parameterized
by K; we still can say that K is a stepsize, but along the curve in question, instead of
a half-line. If we insisted on keeping traditional terminology, we could say that we
have a line-search where the direction depends on the stepsize. Officially, the present
approach is rather called the trust-region technique: the constraint in (1.3.2) defines
a region, in which the model ({Jk is expectedly trustful.
To exploit this idea, it suffices to design a test (0), (R), (L) as in §11.3.1, using the
"curved-search function" q(K) := f(Y(K» to search Y(K) along the curve implicitly
defined by (1.3.2). The tests seen in §11.3.2 use the initial derivative q'(O) which is un-
known but this difficulty has an easy solution: it is time to remember Remark 11.3.2.3,
which related the use of q' (0) to the concept of a model. Just as in § 1.2, the nominal
decrease ({Jk(Xk) - ({Jk(Y(K» can be substituted for the would-be -tq' (0).
To give a specific illustration, let us adapt the criterion of Goldstein and Price to
the case of a "curved-search".

Algorithm 1.3.1 (Goldstein and Price Curved-Search) The data are: the current
iterate Xh the model ({Jk, the descent coefficients m E ]0, 1[ and m' E ]m, 1[. Set
KL = 0 and KR = 0; take an initial K > O.
STEP o. Solve (1.3.2) to obtain Y(K) and set 8 := f(Xk) - ({Jk(Y(K».
STEP I (Test for large K). If f(Y(K» > f(Xk) - m8, set KR = K and go to Step 4.
STEP 2 (K is not too large). If f(Y(K» ~ f(Xk) - m'8, stop the curved-search with
xk+t = Y(K).
Otherwise set Kl =
K and go to Step 3.
STEP 3 (Extrapolation). If KR > 0 go to Step 4.
Otherwise find a new K by extrapolation beyond KL and loop to Step o.
STEP 4 (Interpolation). Find a new K by interpolation in]K L, KR [ and loop to Step O.
o
If rpk does provide a relevant Newtonian point y, the initial K should be chosen large
enough to produce it. Afterwards, each new K could be computed by polynomial interpolation,
as in §II.3.4; this implies a parametric study of (1.3.2), so as to get the derivative of the
curved-search function K t-+ f (y (K». Note also thatthe same differential information would
be needed if we wanted to implement a "Wolfe curved-search".
2 A Variety of Stabilized Algorithms 285

Remark 1.3.2 This trust-region technique was initially motivated by non-convexity. Sup-
pose for example that Q in (1.3.1) is indefinite, as might well happen with Q = "112 f(Xk) if
f is not convex. Then the Newtonian point ji = Xk - Q-I Sk becomes suspect (although it is
still of interest for solving the equation Vf (x) = 0); it may not even exist if Q is degenerate.
Furthermore, a line-search along the associated direction dk = ji - Xk may be disastrous
because f'(Xk, dk) = (Skt dk) = -(Qdk, dk) need not be negative.
Indeed, Newton's method with line-search has little relevance in a non-convex situation.
By contrast, the solution Y(K) of(1.3.2) makes a lot of sense in terms of minimizing f:

_·It always exists, since the trust-region is compact. Much more general models CfJk could even
be handled; the only issue would be the actual computation ofY(K).
- To the extent that CfJk really approximates f to second order near Xkt (1.3.2) is consistent
with the original problem.
- If Q happens to be positive definite, the Newtonian point is recovered, provided that K is
chosen large enough.
- If, for some reason, the curved-search produces small values of K, the move from Xk to Y(K)
is made roughly along the steepest-descent direction; and this is good: the steepest-descent
direction, precisely, is steepest for small moves. 0

Admitting that a model such as CfJk is trusted around Xk only, the trust-region technique
is still relevant even when CfJk is nicely convex. In this case, the stabilized problem can also
be formulated as

instead of(1.3.2) (Proposition VII.3.1.4). The parameter J.I, will represent an alternative curvi-
linear coordinate, giving the explicit solution Xk - (Q + J.l,I)-1 Sk if the model is (1.3.1).

The interest of our digression is to view the stabilization introduced in § 1.2 as


some form of trust-region technique. Indeed consider one iteration of Algorithm 1.2.2,
and suppose that a null-step is made. Suppose that Step 1 then keeps the same model
CPk but enhances the norming by taking a larger It, or a smaller K. This corresponds
to Algorithm 1.3.1 with no Step 2: Y(K) is accepted as soon as a sufficient decrease
is obtained in Step 1.

2 A Variety of Stabilized Algorithms

We give in this section several algorithms realizing the general scheme of § 1.2. They
use conceptually equivalent stabilized problems in Step 2 of Algorithm 1.2.2. Three
of them are formulated in the primal space IRn; via an interpretation in the dual space,
they are also conceptually equivalent to the bundle method ofChap. xrv.
The notations
are those of §1: ikis the cutting-plane function of(1.0.1); the Euclidean norm II· II is
assumed for the stabilization (even though our development is only descriptive, and
could accommodate more general situations).
286 xv. Primal Forms of Bundle Methods
2.1 The Trust-Region Point of View

The first idea that comes to mind is to force the next iterate to be a priori in a ball
associated with the given norming, centered at the given stability center, and having
a given radius K. The sequence of iterates is thus defined by

This approach has the same rationale as Example 1.2.3(a). The original model ik
is considered as a good approximation of! in B(Xk, K), a trust-region drawn around
ik
the stability center. Accordingly, is minimized in this trust-region, any point outside
it being disregarded. The resulting algorithm in its crudest form is then as follows.

Algorithm 2.1.1 (Cutting Planes with Trust Region) The initial point Xt is given,
together with a stopping tolerance Q~ o. Choose a trust-region radius K > 0 and a
descent coefficient m E ]0. 1[ . Initialize the descent-set K = 0, the iteration-counter
k = 1 and Yt = Xt; compute !(Yt) andst = s(Yt).
STEP 1. Define the model

Y f-+ ik(y) := max {f(Yj) + (Sj, Y - Yj) i = 1, ... , k}.

STEP 2. Compute a solution Yk+t of

(2.1.1)

and set

STEP 3. If ~k :s:;; Qstop.


STEP 4. Compute !(Yk+t) and sk+t = s(Yk+t). If

set xk+t = Yk+t and append k to the set K (descent-step). Otherwise set Xk+t =
Xk (null-step).
+
Replace k by k 1 and loop to Step 1. 0

Here K represents the set of descent iterations, see Definition 1.2.4 again. Its role
is purely notational and will appear when we study convergence. The algorithm could
well be described without any reference to K. Concerning the nominal decrease ~ko it is
useful to understand that! (Xk) =
ik(Xk) by construction. When a series of null-steps
is taken, the cutting-plane algorithm is applied within the stability set C := B(Xk. K),
which is changed at every descent-step. Such a change happens possibly long before
! is minimized on C, and this is crucial for efficiency: minimizing ! accurately over
C is a pure waste of time if C is far from a minimum of !.
2 A Variety of Stabilized Algorithms 287

Remark 2.1.2 An efficient solution scheme of the stabilized problem (2.1.1) can be devel-
oped, based on duality: for J-L > 0, the Lagrange function

has a unique minimizer Y(J-L), which can be computed exactly. It suffices to find J-L > 0 solving
the equation IIY(J-L) - xkll = K, if there is one; otherwise it has an unconstrained minimum
in B(Xk, K). Equivalently, the concave function J-L ~ L(Y(J-L), J-L) must be minimized over
J-L~Q 0

The solution-set of (2.1.1) has a qualitative description.

Proposition 2.1.3 Denote by /Coo ~ 0 the distance from Xk to the minimum-set ofik
(with the convention /Coo = +00 if Argmin A
= 0). For 0 :::; /C :::; /Coo, (2.1.1) has
a unique solution, which lies at distance /C from Xk. For /C ~ /Coo, the solution-set of
(2.1.1) is Argminik n B(Xb /C).

PROOF.The second statement is trivial (when applicable, i.e. when /Coo < +(0).
Now take 7J ~ 0 such that the sublevel-set

is nonempty, and let x* be the projection of Xk onto S. If /C :::; IIx* - xkll :::; /Coo,
any solution y of (2.1.1) must be at a distance /C from Xk: otherwise y, lying in the
interior of B(Xk, /C), would minimize ik locally, hence globally, and the property
II y - Xk II < /C :::; IIx* - Xk II would contradict the definition of x*. The solution-set of
(2.1.1) is therefore a convex set on the surface of a Euclidean ball: it is a singleton.
o
Now we turn to a brief study of convergence, which uses typical arguments. First
of all, {f (Xk)} is a decreasing sequence, which has a limit in IR U {-oo}. If {f (Xk)}
is unbounded from below,! has no minimum and {Xk} is a ''minimizing'' sequence.
The only interesting case is therefore

!*:= lim !(Xk) > -00. (2.1.2)


k .... +oo

Lemma 2.1.4 With the notation (2.1.2), and m denoting the descent coefficient in
Step 4 ofAlgorithm 2.1.1, there holds LkEK 8k:::; [!(XI) - !*]/m.

PROOF. Take k E K, so that the descent test gives

Let k' be the successor of kinK. Because the stability center is not changed after a
null-step, !(Xk+I) = ... = !(xk') and we have

!(xk') - !(xk'+I)
8k':::;
m
= !(Xk+I) -m !(xk'+I) .
288 xv. Primal Forms of Bundle Methods
The recurrence is thus established: sum over K to obtain the result. o

°
This shows that, if no null-step were made, the method would converge rather
fast: {Ok} would certainly tend to faster than {k- 1} (to be compared with the speed
k- 2/ n of Example 1.1.2). In particular, the stopping criterion would act after finitely
many iterations. Note that, when the method stops, we have by construction

f(x) ~ f(Xk) - Q for all x E B(Xb K) .

Using convexity, an approximate optimality condition is derived for Xk; it is this idea
that is exploited in Case 2 of the proof below.

Theorem 2.1.5 Let Algorithm 2.1.1 be used with fixed K > 0, m E ]0, 1[ and Q = 0.
Then {Xk} is a minimizing sequence.

PROOF. We take (2.1.2) into account and we distinguish two cases.


[Case 1: K is an infinite set of integers] Suppose for contradiction that there are
°
i E IRn and 'Y] > such that

f(i) :( f(Xk) - 'Y] for k = 1,2, ...

Lemma 2.1.4 tells us that limkEK Ok = 0. Hence, for k large enough in K:

and B (Xk, K) cannot contain i. Then consider the construction of Fig. 2.1.1: Zk is
betweenxk andi, at a distance K from Xb and we setrk := Iii - zkll. We have

from convexity,

In other words

Write this inequality for each k (large enough) in K and sum up. By construction,
IIxk+ 1 - xk II :( K and the triangle inequality implies that rk+ 1 :( rk+K for all k E K: the
left-hand side forms a divergent series; but this is impossible in view of Lemma 2.1.4.
[Case 2: K is finite] In this second case, only null-steps are taken after some iteration
k*:
(2.1.3)
2 A Variety of Stabilized Algorithms 289

Fig. 2.1.1. Majorization outside the trust-region

From then on, Algorithm 2.1.1 reduces to the ordinary cutting-plane algorithm ap-
plied on the (fixed) compact set C = B(Xk*, K). It is convergent: when k ~ +00,
Theorem XII.4.2.3 tells us that {A(Yk+I)} and {f(Yk+I)} tend to the minimal value
of! on B(Xk*, K), call it j*. Passing to the limit in (2.1.3),

(1 - m)j* ~ (1 - m)!(xk*).

Because m < 1, we conclude thatxk* minimizes! on B(xk*, K), hence on the whole
space. 0

Note in the above proof that null-steps and descent-steps call for totally different ar-
guments. In one case, similar to §XIIA.2 or §XIY.3A, the key is the accumulation of the
information into the successive models it. The second case uses the definite decrease of f at
each descent iteration, and this connotes the situation in §ll.3.3. Another remark, based upon
this proof, is that the stop does occur at some finite kif§. > 0: either because ofLemma2.1A,
it
or because (Yk+ I) t f (Xk") in Case 2.

We thus have a convergent stabilization of the cutting-plane algorithm. However,


this algorithm is too simple: even though the number of descent-steps is relatively
small (Lemma 2.1.4), many null-steps might still be necessary. In fact, Case 2 in the
proof of Theorem 2.1.5 does suggest that we have not reached our goal. To obtain
something really efficient, the radius K of the trust-region should tend to 0 when
k ~ 00. Otherwise, the disastrous Counter-example 1.1.2 would again crop up.

2.2 The Penalization Point of View

The trust-region technique of §2.l was rather abrupt the next iterate was controlled by
a mere switch: "on" inside the trust-region, "off" outside it. Something more flexible
is obtained if the distance from the stability center acts as a weight.
Here we choose a coefficient JL > 0 (the strength of a spring) and our model is

needless to say, pure cutting planes would be obtained with JL = o. This strategy is
in the spirit of Example 1.2.3(b): it is the model itself that is given a stabilized form;
its unconstrained minimization will furnish the next iterate, and its decrease will give
the nominal decrease for !.
290 xv. Primal Forms of Bundle Methods
Remark 2.2.1 This stabilizing term has been known for a long time in the framework of
least-squares calculations. Suppose we have to minimize the function

1 m
Rn 3 x ~ f(x) := - L f/(x),
2 j=1

where each Jj is smooth. According to the general principles of §11.2, a second-order model
of f is desirable.
Observe that (we assume the dot-product for simplicity)
m m
V2 f(x) = Q(x) +L Jj(x)V2 Jj(x) with Q(x):= LV Jj(x)[V Jj(X)]T ;
j=1 j=1

Q(x) is thus a reasonable approximation of V2 f(x), at least if the Jj(x),s or the V2 Jj(x)'s
are small. The so-called method of Gauss-Newton exploits this idea, taking as next iterate a
minimum of the corresponding model

An advantage is that only first-order derivatives are required; furthermore Q(Xk) is automat-
ically positive semi-definite.
However, trouble will appear when Q(Xk) is singular or ill-conditioned. Adding to it a
multiple of the identity matrix, say ILl, is then a good idea: this is just our present stabilization
by penalty. In a Gauss-Newton framework, IL is traditionally called the coefficient of Leven-
berg and Marquardt. 0

Just as in the previous section, we give the resulting algorithm in its crudest form:

Algorithm 2.2.2 (Cutting Planes with Stabilization by Penalty) The initial point
XI is given, together with a stopping tolerance Q~ o. Choose a spring-strength J.L > 0
and a descent-coefficient m E ]0, 1[. Initialize the descent-set K = 0, the iteration-
counter k = 1, and YI = XI; compute f(YI) ands l = S(YI).
STEP 1. With Adenoting the cutting-plane function (1.0.1), compute the solution
Yk+lof
(2.2.1)

and set
v I
8k := f(Xk) - A(Yk+l) - 2J.LIIYk+1 - xkll ~
2
o.
STEP 2. If 8k ~ Q stop.
STEP 3. Compute f(Yk+I) and Sk+1 = S(Yk+I). If

set xk+ I = Yk+ I and append k to the set K (descent-step). Otherwise set Xk+ I =
Xk (null-step).
Replace k by k + 1 and loop to Step 1. 0
2 A Variety of Stabilized Algorithms 291

With respect to the trust-region variant, a first obvious difference is that the stabi-
lized problem is easier: it is the problem with quadratic objective and affine constraints
[r + !/LIIY - xkll2] (y, r) E IR n x IR
I min
!(Yi) + (Si' Y - Xi) ::;; r for i =1, ... , k. (2.2.2)

A second difference is the value of Ok: here the nominal decrease for! is more
logically taken as the decrease of its piecewise quadratic model; this is comparable
to the opposition (a) vs. (b) in Example 1.2.3. We will see later that the difference
is actually of little significance: Algorithm 2.2.2 would be almost the same with
Ok = !(Xk) - ik(Yk+l).
In fact, neglecting any ok-detail, the two variants are conceptually equivalent in
the sense that they produce the same (k + 1)81 iterate, provided that K and /L are
properly chosen:
Proposition 2.2.3
°
(i) For any K > 0, there is /L ~ such that any solution ofthe trust-region problem
(2.1.1) also solves the penalized problem (2.2.1).
°
(ii) Conversely, for any /L ~ 0, there is K ~ such that any solution of (2.2.1) (as-
sumed to exist) also solves (2.1.1).
PROOF. [(i)] When K > 0, the results of Chap. VII can be applied to the convex

°
minimization problem (2.1.1): Slater's assumption is satisfied, the set of multipliers
is nonempty, and there is a /L ~ such that the solutions of (2.1.1) can be obtained
by unconstrained minimization of the Lagrangian
Y ~ ik(Y) + !/L(IIY - xkll 2 - K2)
(Proposition VII.3.1.4).
[(ii)] Suppose that (2.2.1) has a solution Yk+ 1· For any Y E B (Xk, II Yk+ 1 - Xk II),
VI
!key) + 2/LIIYk+1 - 2
xkll ~ !k(Y)
VI
+ 2/LIlY - 2 v
xkll ~ !k(Yk+l) + 2/LIIYk+1
1
- Xkll
2

and we see that Yk+l solves (2.1.1) for K = IIYk+l - xkll. o


The relation between our two forms of stabilized problems can be made more precise.
See again Proposition 2.1.3: when K increases from 0 to the value Koo , the unique solution
of (2.1.1) describes a curve linking Xk to the minimum of ik that is closest to Xk; and if
ik is unbounded from below, this curve is unbounded (recall from the end of §V3.4, or of
§VIII.3.4, that the piecewise affine function it
attains its infimum when it is bounded from
below).
Now, for J1, > 0, call Y(J1,) the unique solution of(2.2.1). Proposition 2.2.3 tells us that,
when J1, describes lRt, y(J1,) describes the same curve as above, neglecting some possible
problem at K 4- 0 (which corresponds to J1, --+ +(0). This establishes a mapping
J1, 1-+ K = K(J1,) from ]0, +oo[ onto ]0, Koo[
defined by K(J1,) = IIY(J1,) - Xk II for J1, > O. Note that this mapping is not invertible: the J1, of
Proposition 2.2.3(i) need not be unique.
Beware that the equivalence between (2.1.1) and (2.2.1) is not of a practical nature. It
simply means that
292 xv. Primal Forms of Bundle Methods
- choosing K and then solving the trust-region problem,
- or choosing JL and then solving the penalized problem
are conceptually equivalent. Nevertheless, there is no explicit relation giving a priori one
parameter as a function of the other.

Section 3 will be entirely devoted to this second variant, including a study of its
convergence.

2.3 The Relaxation Point of View

The point made in the previous section is that the second term 1/2/LIIY - xkll 2 in
(2.2.1) can be interpreted as the dualization of a certain constraint lIy - xkll ~ K,
whose right-hand side K = K (/L) becomes a function of its multiplier /L. Likewise, we
ik ik
can interpret the first term (y) as the dualization of a constraint (y) ~ e, whose
e e
right-hand side = (/L) will be a function of its multiplier 1/ /L.
In other words, a third possible stabilized problem is

~n!IIY - xk11 2 ,
I fk(y) ~ e (2.3.1)

for some level e. We leave it to the reader to adapt Proposition 2.2.3, thereby studying
the equivalence of (2.3.1) with the previous variants. A difficulty has now appeared,
though: what if(2.3.1) has an empty feasible set, i.e. if < infe ik? By contrast, the
parameter K or /L of the previous sections could be given arbitrary positive values
throughout the iterations, even if a fixed K led to possible inefficiencies. The present
variant thus needs some more sophistication.
On the other hand, the level used in (2.3.1) suggests an obvious nominal decrease
for !: it is natural to set xk+1 = Yk+1 if this results in a definite objective-decrease
from! (xd towards e. Then the resulting algorithm has the following form, in which
we explicitly let the level depend on the iteration.

Algorithm 2.3.1 (Cutting Planes with Level-Stabilization) The initial point XI is


given, together with a stopping tolerance Q;;:: O. Choose a descent-coefficient m E
]0, 1[. Initialize the descent-set K = 0, the iteration counter k = 1, and YI = XI;
compute !(YI) and SI = S(YI).
STEP 1. Choose a level e = ek satisfying inf ik
~ e < !(Xk); perform the stopping
criterion.
STEP 2. Compute the solution Yk+1 of(2.3.1).
STEP 3. Compute !(Yk+I) and Sk+1 = S(Yk+I). If

(2.3.2)

set xk+ I = Yk+ I and append k to the set K (descent-step). Otherwise set xk+ I =
Xk (null-step).
Replace k by k + 1 and loop to Step 1. 0
2 A Variety of Stabilized Algorithms 293

To explain the title of this subsection, consider the problem of solving a system of
inequalities: given an index-set J and smooth functions /j for j E J, find x E Rn such that

/j(X) ~ 0 for all j E J. (2.3.3)

It may be a difficult one, especially when J is a large set (possibly infinite), and/or the /j's
are not affine. Several techniques are known for a numerical treatment of this problem.
- The relaxation method addresses large sets J and consists of a dynamic selection of ap-
propriate elements in J. For example: at the current iterate Xk we take a most violated
inequality, i.e. we compute an index jk such that

then we solve
min UIIX-XkIl2: /jk(X)~O}.
Unless Xk is already a solution of (2.3.3), we certainly obtain Xk+1 =F Xk. Refined variants
exist, in which more than one index is taken into account at each iteration.
- Newton's principle (§II.2.3) can also be used: each inequality can be linearized at Xk, (2.3.3)
being replaced by

- When combining the two techniques, we obtain an implementable method, in which a


sequence of quadratic programming problems are solved.
Now take our original problem of minimizing f, and assume that the optimal value
1= inf f is known. We must clearly solve f(x) ~ 1, to which the above technique can be
applied: linearizing f at the current Xb we are faced with

Remark 2.3.2 The solution of this projection problem can be computed explicitly: the min-
imality conditions are

Assuming S(Xk) =F 0 and 1< f(Xk) (hence x =F Xk, /L > 0), the next iterate is

f(Xk) -1
Xk - IIS(Xk)112 S(Xk)·

This is just the subgradient Algorithm XII.4.1.1, in which the knowledge of 1 is exploited to
provide a special stepsize. 0

When formulating our minimization problem as the single inequality f(x) ~ 1, the
essential resolution step is the linearization of f (Newton's principle). Another possible for-
mulation is via infinitely many cutting-plane inequalities: f can be expressed as a supremum
of affine functions,

f(x) = sup (f(y) + (s(y), Y - x) : y ERn}.

To minimize f is therefore to find x solving the system of affine inequalities


294 xv. Primal Forms of Bundle Methods

f(y) + (s(y), x - y} :s;; i for all y E R.n ,

where the indices i E J of(2.3.3) are mtherdenoted by y E R.n. We can apply to this problem
the relaxation principle and memorize the inequalities visited during the successive itemtions.
At the given iterate Xt, we recover a problem of the form (2.3.1).
Let us conclude this comparison: if the infimum of f is known, then Algorithm 2.3.1
with ik == i is quite a suitable variant. In other cases, the whole issue for this algorithm will
i,
be to identify the unknown value thus indicating suitable rules for the management of {ik}.

We just give an example of a convergence proof, limited to the case of a known


1.
Theorem 2.3.3 Let! have a minimum point i. Then Algorithm 2.3.1, applied with
ik == !(i) =: f and ~ = 0, generates a minimizing sequence {Xk}.

PROOF. The proof technique is similar to that of the trust-region variant.


[Case 1: K is infinite] At each descent iteration, we have by construction

Reasoning as in Lemma 2.1.4,

if v(k) denotes the number of descent-steps that have been made prior to the kth
iteration. If the algorithm performs infinitely many descent-steps, !(Xk) - f ~ 0
and we are done.
[Case 2: K is finite] Suppose now that the sequence {Xk} stops at some iteration k*, so
that Xk = Xt* for all k ~ k*. y:!e proceed to prove that f is a cluster value of {f (Yk)}.
First, !k(i) :s;;; !(i) = ! for all k: i is feasible in each stabilized problem, so
IIYk - Xk* II :s;;; IIi - xk* II by construction. We conclude that {Yk} is bounded; from
their definition, all the model-functions it have a fixed Lipschitz constant L, namely
a Lipschitz constant of! on B(Xk*. IIi - Xk* II) (Theorem IY.3.1.2).
Then take k and k' arbitrary with k* :s;;; k:S;;; k'; observe that ik'(Yk) = !(Yk) by
i
definition of k" so

!(Yk) :s;;; ik'(Yk'+I) + LIIYk - Yk'+11I :s;;; f + LIIYk - Yk'+11I • (2.3.4)

where the last inequality comes from the definition of Yk' +I .


Because {Yk} is bounded, IIYk - Yk'+11I cannot stay away from 0 when k and k'
go independently to infinity: for any 8 > 0, we can find large enough k and k' such
that IIYk - Yk'+11I :s;;; 8. Thus, liminf !(Yk) = f and failure to satisfy of the descent
test (2.3.2) implies that Xk* is already optimal. 0

The comments following Theorem 2.1.5 are still valid, concerning the proof technique;
it can even be said that, after a descent-step, all the linearizations appearing in can be it
refreshed, thus reinitializing ik+1 to the affine function f(Xk+l) + (Sk+lo· - Xk+I). This
modification does not affect Case 2; and Case 1 uses no memory mechanism.
2 A Variety of Stabilized Algorithms 295

Remark 2.3.4 Further consideration of (2.3.4) suggests another interesting technical com-
ment: suppose in Algorithm 2.3.1 that the level l is fixed but the descent test is ignored: only
null-steps are taken. Then, providing that {Yk} is bounded, (f(Yk)} reaches any level above
l. If l = j, this has several implications:
- First, as far as proving convergence is concerned, the concept of descent-step is useless: we
could just set m = 1 in the algorithm; the stability center would remain fixed throughout,
the minimizing sequence would be {yd.
- Even further: the choice of the stability center, to be projected onto the sublevel-set ofit,
is moderately important. Technically, its role is limited to preserving the boundedness of
{Yk}·
- Existence of a minimum of f is required to guarantee this boundedness: in a way, our
present algorithm is weaker than the trust-region form, which did not need this existence.
However, we leave it as an exercise to reproduce the proof of Theorem 2.3.3 for a variant
of Algorithm 2.3.1 using a more conservative choice of the level:

- Our proof of Theorem 2.3.3 is qualitative; compare it with §IX.2.I(a). A quantitative argu-
ment could also be conceived of, as in §IX.2.1(b). 0

To conclude, let us return to our comparison with the inequality-solving problem


(2.3.3), assumed to have a nonempty solution-set. For this problem, the simplest
Newton algorithm takes Xk+l as the unique solution of

(2.3.5)

This makes numerical sense if J is finite (and not large): we have a quadratic program
to solve at each iteration. Even in this case, however, the algorithm may not converge
because a Newton method is only locally convergent.
To get global convergence, line-searching is a natural technique: we can take a
next iterate along the half-line pointing towards the solution of (2.3 .5) and decreasing
"substantially" the natural objective function maxi E J Ii; we are right in the framework
of §II.3. The message ofthe present section is that another possible technique is to
memorize the linearizations; as already seen in § 1.2, a null-step resembles one cycle in
a line-search. This technique enables the resolution of (2.3.3) with an infinite index-set
J, and also with nonsmooth functions Ii. The way is open to minimization methods
with nonsmooth constraints. Compare also the present discussion with §IX.3.2.

2.4 A Possible Dual Point of View

Consider again (2.2.1), written in expanded form:

I min [r + !lLlly -
I(Yi)+(Si,y-Yi)~r
xkll2] (y, r) E]Rn x
fori=l, ... ,k.
]R,
(2.4.1)

The dual of this quadratic program can be formulated explicitly, yielding very in-
structive interpretations. In the result below, Llk is the unit simplex of]Rk as usual;
the coefficients
296 xv. Primal Fonns of Bundle Methods

are the linearization errors between Yi andxk, already encountered in several previous
chapters (see Definitions XI.4.2.3 and XIY.1.2.7).

Lemma 2.4.1 For f.L > 0, the unique solution of the penalized problem (2.2.1) =
(2.4.1) is

(2.4.3)

where a E ]Rk solves

(2.4.4)

Furthermore, there holds

(2.4.5)

PROOF. This is a direct application of Chap. XII (see §XII.3.4 if necessary). Take
k nonnegative dual variables aI, ... , ak and form the Lagrange function which, at
(y, r, a) E ]Rn x ]R x IRk, has the value

k k
r + 4f.LIIY - xkll 2 + L ai(si, y) + Lai[f(Yi) - (Si, Yi) - r].
i=1 i=1
Its minimization with respect to the primal variables (y, r) implies first the condition
L:f=1 ai = 1 (otherwise we get -00), and results in Y given as in (2.4.3). Plugging this
value back into the Lagrangian, we obtain the dual problem associated with (2.4.1):

~ {-2~ lIL:f=laiSir +L:f=l ai [f(Yi) + (Si,Xk-Yi)]},


in which the notation (2.4.2) can be used. Equating the primal and dual objective-
values gives (2.4.5) directly.
Finally, the solution-set of the dual problem is not changed if we multiply its
objective function by f.L > 0, and add the constant term !(Xk) = L:f=1
ai!(xk). This
gives (2.4.4). 0

With the form (2.4.4) of stabilized problem, we can play the same game as in
the previous subsections: the linear term in a can be interpreted as the dualization of
a constraint whose right-hand side, say e, is a function of its multiplier f.L. In other
words: given f.L > 0, there is an e such that the solution of (2.2. 1) is given by (2.4.3),
where a solves

I m~ 411L:f=1 aisi r, a E L\k, (2.4.6)


L:i =1 aiei ~ e.
2 A Variety of Stabilized Algorithms 297

Conversely, let the constraint in this last problem have a positive multiplier J-L; then
the associated Yk+1 of (2.4.3) is also the unique solution of (2.2.1) with this J-L.
A thorough observation of our notation shows that
ej ~ 0 for i = 1, ... , k and ej = 0 for some j :::;; k.
It follows that the correspondence B i ; J-L involves nonnegative values of B only.
Furthermore, we will see that the values of B that are relevant for convergence must
depend on the iteration index. In summary, our detour into the dual space has revealed
a fourth conceptually equivalent stabilized algorithm:
Algorithm 2.4.2 (Cutting Planes with Dual Stabilization) The initial point XI is
given, together with a stopping tolerance Q. ~ O. Choose a descent-coefficient m E
]0, 1[. Initialize the descent-set K = 0, the iteration counter k = 1, and YI = XI;
compute !(YI) and SI = S(YI)·
STEP 1. Choose B ~ 0 such that the constraint in (2.4.6) has a positive multiplier J-L
(for simplicity, we assume this is possible).
STEP 2. Solve (2.4.6) to obtain an optimal a E Llk and a multiplier J-L > O. Unless
the stopping criterion is satisfied, set
k
s:= Lajsj, Yk+1 = Xk - lis.
I A

j=1

STEP 3. Compute !(Yk+I) and sk+1 = S(Yk+I). If

!(Yk+I):::;; !(Xk) - m (B + 2~lIsIl2)


set xk+1 = Yk+1 and append k to the set K (descent-step). Otherwise set xk+1 =
Xk (null-step).
Replace k by k + 1 and loop to Step 1. 0

The interesting point about this interpretation is that (2.4.6) is just the direction-
finding problem (XIY.2.1.4) of the dual bundle methods. This allows some useful
comparisons:
(i) A stabilized cutting-plane method is a particular form of bundle method, in
which the "line-search" chooses systematically t = 1/J-L (the inverse multiplier
of (2.4.6), which must be positive). The privileged role played by this particular
step size was already outlined in §XIY.2.4; see Fig. XIY.2.4.1 again.
(ii) Alternatively, a dual bundle method can be viewed as a stabilized cutting-plane
method, in which a line-search is inserted. Since Remark 1.2.1 warns us against
such a mixture, it might be desirable to imagine something different.
(iii) The stopping criterion in Algorithm 2.4.2 is not specified but we know from the
previous chapters that Xk is approximately optimal ifboth B and II sII are small; see
for example (XIY.2.3.5). This interpretation is also useful for the stopping criteria
used in the previous subsections. Indeed (2.4.5) reveals two distinct components
in the nominal decrease A(Yk+I) - !(Xk). One, B, is directly comparable to
! -values; the other, IIsll2 / J1" connotes shifts in X via the "rate" of decrease IIsll;
see the note XIY.3.4.3(6).
298 xv. Primal Forms of Bundle Methods
(iv) In §XIY.3.3(a), we mentioned an ambiguity concerning the descent test used
by a dual bundle method. If the value (XIy'3.3.2) is used for the line-search
parameter ii, the dual bundle algorithm uses the descent test

In view of (2.4.5), we recognize in the last parenthesis the nominal decrease


ik (Xk - ts) - f (Xk) , recommended for the trust-region variant.
(v) In their dual form, bundle methods lend themselves to the aggregation technique.
A solution a of (2.4.6) not only yields the aggregate subgradient but also s,
an aggregate linearization error := e Lf=1
ajei. As seen in §XIY.2.4, this
corresponds to the aggregate linearization

which minorizes f, simply because it minorizes ik' The way is thus open to
"economic" cutting-plane algorithms (possibly stabilized), in which the above
aggregate linearization can take the place of one or more direct linearizations,
thereby diminishing the complexity of the cutting-plane function.

Recall from §XIY.3.4 that a rather intricate control of ek is necessary, just to ensure con-
vergence of Algorithm 2.4.2. Having said enough about this problem in Chap. XIv, we make
here a last remark. To pass from the primal stabilized Ii--problem to its dual (2.4.4) or (2.4.6),
we applied Lagrange duality to the developed form (2.4.1); and to make the comparison more
suggestive, we introduced the linearization errors (2.4.2). The same interpretative work can
be done by applying Fenchel duality (§XII.5.4) directly to (2.2.1). This gives something more
abstract, but also more intrinsic, involving the conjugate function of A:
Proposition 2.4.3 For Ii- > 0, the unique solution Yk+ I of (2.2.1) is
I
Yk+1 = Xk - /is,
A

where $ E olk (YH I) is the unique minimizer of the closed convex function
Rn 3 S ~ I:(s) - (s, Xk) + 2~ lis 112 . (2.4.7)

PROOF. Set gl := A,g2 := 1/21i-1I . -xkll 2 and apply Proposition XII.5.4.1: because g2 is
finite everywhere, the qualification assumption (XII.5.4.3) certainly holds. The conjugate of
g2 can be easily computed, and minimizing the function of (2.4.7) is exactly the dual problem
of (2.2.1). Denoting its solution by $, the solution of (2.2.1) is then the (therefore nonempty)
set 01:($) n (-I/J.I.$ + xd. Then there holds

I AE UJk
Yk+1 =Xk - /is " 1*(A)
s, o
Needless to say, (2.4.7) is simply a compact form for the objective function of (2.4.4).
To see this, use either of the following two useful expressions for the cutting-plane model:

A(y) = !(Xk) + max {-ej + (Sj, Y - Xk) : i = 1, ... , k}


= max (-!*(Sj) + (Sj,Y) : i=I, ... ,k}.
2 A Variety of Stabilized Algorithms 299

Its conjugate can be computed with the help of various calculus rules from Chap. X (see in
particular §X.3.4); we obtain a convex hull of needles:

it(s) = - t(Xk) + (s, Xk) + min {'Lf=l a,e, : a E Llb 'Lf=l aisi = s}
= min {'Lf=l ad*(Si) : a E Llk, 'Lf=l aisi = s} .

Plugging the first expression into (2.4.7) yields the minimization problem (2.4.4); and
with the second value, we obtain an equivalent form:

r
a~Tk {! II'Lf=l aisi + /-L 'Lf=l ai[f*(si) - (Si, Xk)]}.

2.5 Conclusion

This Section 2 has reviewed a number of possible algorithms for minimizing a (finite-
valued) convex function, based on two possible motivations:
- Three of them work in the primal space. They start from the observation that the
cutting-plane algorithm is unstable: its next iterate is ''too far" from the current one,
and should be pulled back.
- One of them works in the dual space. It starts from the observation that the steepest-
descent algorithm is sluggish: the next iterate is "too close" to the current one, and
should be pushed forward.
In both approaches, the eventual aim is to improve the progress towards an optimal
solution, as measured in terms of the objective function.

Remark 2.5.1 We mention here that our list of variants is not exhaustive.
- When starting from the /-L-problem (2.2.1), we introduced a trust-region constraint, or a
level constraint, ending up with two more primal variants.
- When starting from the same /-L-problem in its dual form (2.4.4), we introduced a lineariza-
tion-error constraint, which corresponded to a level point of view in the dual space. Alto-
gether, we obtained the four possibilities, reviewed above.
- We could likewise introduce a trust-region point of view in the dual space: instead of(2.4.6),
we could formulate

for some dual radius a (note: -a represents the value of a certain support function, remem-
ber Proposition XIV:2.3.2).
- Another idea would be to take the dual of (2.1.1) or (2.3.1), and then to make analogous
changes of parameter.
We do not know what would result from these various exercises. o
All these algorithms follow essentially the same strategy: solve a quadratic program
depending on a certain internal parameter: K, /-L, e, e, or whatever. They are conceptually
equivalent, in the sense that they generate the same iterates, providing that their respective
parameters are linked by a certain a posteriori relation; and they are characterized precisely
by the way this parameter is chosen.
300 xv. Primal Forms of Bundle Methods

Figure 2.5.1 represents the relations connecting all these parameters; it plots the two
model-functions, A and its piecewise quadratic perturbation, along S interpreted as a di-
rection. The graph of ! lies somewhere above gr A and meets the two model-graphs at
(Xb !(Xk»· The dashed line represents the trace along Xk + RS of the affine function

Ifthe variant mentioned in Remark 2.5.1 is used, take the graph of an affine function of slope
-0-, and lift it as high as possible so as to support gr A:
this gives the other parameters.

Fig. 2.5.1. A posteriori relations

Remark 2.5.2 This picture clearly confirms that the nominal decreases for the trust-region
and penalized variants are not significantly different: in view of (2.4.5), they are respectively

~TR = e + kll skll 2 and ~p = e + 2~ II SkIl 2 •


Thus~TR E [~p,2~p]. o

So far, our review has been purely descriptive; no argument has been given to
prefer any particular variant. Yet, the numerical illustrations in §XIY.4.2 have clearly
demonstrated the importance of a proper choice of the associated parameter (be it e,
K, f,.L, l, or whatever). The selection of a particular variant should therefore address
two questions, one theoretical, one practical:
- specific rules for choosing the parameter efficiently, in terms of speed of conver-
gence;
- effective resolution of the stabilized problem, which is extremely important in prac-
tice: being routinely executed, it must be fast and fail-safe.
With respect to the second criterion, the four variants are approximately even, with
a slight advantage to the e-form: some technical details in quadratic programming
make its resolution process more stable than the other three.
The first criterion is also the more decisive; but unfortunately, it does not yield
a clear conclusion. We refer to the difficulties illustrated in §XIY.4.2 for efficient
choices of e; and apart from the situation of a known minimal value j (in which case
the level-strategy becomes obvious), little can be said about appropriate choices of
the other parameters.
3 A Class of Primal Bundle Algorithms 301

It turns out, however, that the variant by penalization has a third possible moti-
vation, which gives it a definite advantage. It will be seen in §4 that it is intimately
related to the Moreau-Yosida regularization of Example XI.3.4.4. The next section is
therefore devoted to a thorough study of this variant.

3 A Class of Primal Bundle Algorithms

In this section, we study more particularly primal bundle methods in penalized form,
introduced in §2.2. Because they work directly in the primal space, they can handle
possible constraints rather easily. We therefore assume that the problem to solve is
actually
inf{f(x) : x E C}; (3.0.1)
f is still a convex function (finite everywhere), and now C is a closed convex subset of
IRn. The only restriction imposed on C is of a practical nature, namely each stabilized
problem (2.2.1) must still be solvable, even if the constraint y E C is added. In
practice, this amounts to assuming that C is a closed convex polyhedron:

C = {x E IRn : (aj, x) ~ bj for j = 1, ... , m}, (3.0.2)

(aj' bj) E IRn x IR being given for j = 1, ... , m. As far as this chapter is concerned,
the only effect of introducing C is a slight complication in the algebra; the reader may
take C = IRn throughout, if this helps him to follow our development more easily.
Note, anyway, that (3.0.1) is the unconstrained minimization of g := f + Ie; because
f is assumed finite everywhere, ag = af + Ne (Theorem XI.3.1.1) and little is
essentially changed with respect to our previous developments.

3.1 The General Method

The model will not be exactly ik of (1.0.1), so we prefer to denote it abstractly by CPt.
a convex function finite everywhere. Besides, the iteration index k is useless for the
moment; the stabilized problem is therefore denoted by

min {cp(y) + ~ILIlY - xl12 : Y E C}, (3.1.1)

x and IL being the (given) kth stability center and penalty coefficient respectively. Once
again, this problem is assumed numerically solvable; such is the case for cp piecewise
affine and C polyhedral.
The model cp will incorporate the aggregation technique, already seen in previous
chapters, which we proceed to describe in a totally primal language. First we reproduce
Lemma 2.4.2.

Lemma 3.1.1 With cp : IRn ~ IR convex, IL > 0 and C closed convex, (3.1.1) has a
unique solution y+ characterized by the formulae

(3.1.2)
302 xv. Primal Forms of Bundle Methods
Furthermore
qJ(Y)~f(X)+(5,y-x}-e forallyElR n ,
where
(3.1.3)

PROOF. The assumptions clearly imply that (3.1.1) has a unique solution. Using the
geometric minimality condition VII. 1. 1. 1 and some subdifferential calculus, this so-
lution is seen to be the unique point y+ satisfying

which is just (3.1.2).


We thus have

Using, as in Proposition XI.4.2.2, a "transportation trick" from y+ to x, this can be


written
qJ(y) ~ f(x) + (5, y - x) - f(x) + qJ(Y+) + (5, x - y+) .
In view of (3.1.2), we recognize the expression (3.1.3) of e. o

Proposition 3.1.2 With the notation ofLemma 3.1.1, take an arbitrary function 1/1 :
IR n --+ IR U {+oo} satisfying

1/I(y) ~ f(x) - e + (5, y - x) =: p(y) for all y E IR n , (3.1.4)

with equality at y = y+. Then y+ minimizes on C the function

PROOF. Use again the same transportation trick: using (3.1.3) and (3.1.2), the relations
defining 1/1 can be written

with equality at y = y+. Adding the term 1/2 /LilY - x 112 to both sides,

again with equality at y = y+. Then it suffices to observe from (3.1.2) thatthe function
of the right-hand side is minimized over Cat y = y+: indeed its gradient at y+ is

and the geometric minimality condition VII. 1. 1.1 (iii) is satisfied. o


3 A Class of Primal Bundle Algorithms 303

The affine function ]a appearing in (3.1.4) is the aggregate linearization of j,


already seen in §XIY.2.4. It minorizes ({J (Lemma 3.1.1) and can also be written

(3.1.5)

Proposition 3.1.2 tells us in particular that the next iterate y+ would not be changed
if, instead of ({J, the model were any convex function 1/1 sandwiched between ]a and
({J.

Note that the aggregate linearization concerns exclusively f, the function that is modeled
by ({I. On the other hand, the indicator part of the (unconstrained) objective function f + Ie
is treated directly, without any modelling; so aggregation is irrelevant for it.

Remark 3.1.3 We could take for example 1/1 = fain Proposition 3.1.2. In relation to
Fig. 2.5.1, this suggests another construction, illustrated by Fig. 3.1.1. For given x and fL > 0,
assume p = 0 (so as to place ourselves in the framework of §2); with y ofthe form x - ts,
draw the parabola of equation

r = r(y) = ro - !fLlly - x1l 2 ,


where ro is lifted as high as possible to support the graph of the model ik = ({I; there is
contact r(y) = ({I(y) at the unique point (y+, ((I(Y+». Because y+ minimizes the quadratic
function fa + 1/2 fLlI· -x 11 2 , the dashed line gr p of Fig. 3.1.1 is tangent to our parabola; the
e
value p(x) unveils the key parameter (the would-be e of Fig. 2.5.1), from which the whole
picture can be reconstructed.

x
Fig. 3.1.1. Supporting a convex epigraph with a parabola

The above observation, together with the property of parabolas given in Fig. 11.3.2.2,
illustrates once more the point made in Remark 2.5.2 concerning nominal decreases. 0

Keeping in mind the strategy used in the previous chapters, the above two results
can be used as follows. When (3.1.1) is solved, a new affine piece (y+, f(y+), s+)
will be introduced to give the new model

Before doing so, we may wish to "simplify" ({J to some other 1/1, in order to make room
in the computer andlor to simplify the next quadratic program. For example, we may
wish to discard some old affine pieces; this results in 1/1 ~ ({J. After such an operation,
304 xv. Primal Forms of Bundle Methods
we must incorporate the affine piece ]a into the definition of the simpler"", so that we
will have fa ""~ ~ rp. This aggregation operation has been seen already in several
previous chapters and we can infer that it will not impair convergence. Besides, the
simpler function "" is piecewise affine if rp is such, and the operation can be repeated
at any later iteration.
In terms of the objective function I, the aggregate linearization ]a is not attached
to any 9 such thats E al(9), so the notation (1.0.1) is no longer correct for the model.
For the same reason, characterizing the model in terms of triples (y, I, s)j is clumsy:
we need only couples (s, r)j E JRn x JR characterizing affine functions. In addition to
its slope Sj, each such affine function could be characterized by its value at 0 but we
prefer to characterize it by its value at the current stability center X; furthermore we
choose to call I(x) - ej this value. Calling l the total number of affine pieces, all the
necessary information is then characterized by a bundle of couples

(Sj, ej) E JRn x JR+ for i = I, ... , l


and the model is

JRn 3Yl-+rp(y)=/(x)+ max [-ej+(sj,y-x)].


j=I ..... l
We recognize notation already introduced in §XIV.I.2. The slopes Sj are either direct
subgradients of 1 computed by the black box (UI) at some sampling points yj, or
s
vectors of the form obtained after one aggregation or more. As for the linearization
errors ej, they can be computed recursively: when the stability center x is changed to
x+, eachej must be changed to the difference et between I(x+) and the value atx+
of the i th linearization. In other words,

I(x+) - et = I(x) - ej + (Sj, x+ - x),

hence
et = ej + I(x+) - I(x) - (Sj, x+ - x).
Finally, we prefer to work with the inverse of the penalty parameter; its interpretation
as a stepsize is much more suggestive in a primal context.
In summary, re-introducing the iteration index k, our stabilized problem to be
solved at the kth iteration is

[r + 2~k lIy -xkIl


I min (y, r) E C x JR,
2]
(3.1.6)
r I(Xk) - ej + (Sj, y -
~ Xk) for i = I, ... , l,

a quadratic program if C is a closed convex polyhedron; the extra variable r stands


forrp-values. It can easily be seen that (3.1.6) has a unique solution (YHlo rHI) with
rp(YHI) = rHI·
The precise algorithm can now be stated, with notations combining those of
Chap. XIV and §2.2.

Algorithm 3.1.4 (Primal Bundle Method With Penalty) The initial point XI is gi-
ven, together with a stopping tolerance Q~ 0 and a maximal bundle size Choose t.
3 A Class of Primal Bundle Algorithms 305

a descent-coefficient m E ]0, 1[, say m = 0.1. Initialize the descent-set K = 0, the


iteration counter k = 1 and the bundle size l = 1. Compute !(XI) and SI = S(XI)'
Set el = 0, corresponding to the initial bundle (st. el), and the initial model

Y t-+ fPI(Y) := !(XI) + (SI' Y - XI).

STEP 1 (main computation and stopping test). Choose a "stepsize" tk > 0 and solve
(3.1.6). As stated in Lemma 3.1.1, its unique solutionis Yk+1 = Xk - tk(Sk + fit),
with Sk E 0fPk(Yk+I) and fit E Nc(Yk+I)' Set

ek := !(Xk) - fPk(Yk+I) - tk(Sb Sk fik) , +


8k := !(Xk) - fPk(Yk+I) - ~tkllsk + fikll 2 •

If 8k :;;;; Qstop.
STEP 2 ( descent test). Compute! (Yk+ I) and S(Yk+ I); if the descent test

(3.1.7)

is not satisfied declare "null-step" and go to Step 4.


STEP 3 (descent-step). Set Xk+1 = Yk+I' Append k to the set K; for i = 1, ... , l,
change ei to
ei + !(Xk+I) - !(Xk) - (Si' Xk+1 - Xk);
change also ek similarly.
STEP 4 (managing the bundle size). If l = i then: delete at least 2 elements from the
bundle and insert the element (Sk' ek).
Call again (Si' ei)i=I ..... l the new bundle thus obtained (note: l < i).
STEP 5 (loop). Append (Sl+lo el+l) to the bundle, whereel+1 = oincase of descent-
step and, in case of null-step:

Replace l by l + 1 and define the model

Replace k by k + 1 and loop to Step 1. o


This algorithm is directly comparable to Algorithm XIY.3.4.2; observe, however,
how much simpler it is - hardly more complex than its schematic version 2.2.2. Let
us add some notes playing the role of those in XIY.3.4.3.

Notes 3.1.5
(i) The initialization tl can use Remark 11.3.4.2 if we have an estimate ..1 of the total
decrease f (XI) - j. Then the initial tl can be obtained from the formula tlllsI1I 2 = 2..1.
Not surprisingly, we then have 81 = ..1.
306 xv. Primal Forms of Bundle Methods

(ii) Lemma 3.1.1 essentially says that Sk E 0ek!(Xk) and our convergence analysis will
establish that Sk + Pk E oek(f + Ie)(Xk), with 8k ;;.: 0 given in Lemma 3.2.1 below.
Because the objective function is ! +Ie, the whole issue will be to show that 8k --+- 0 and
Sk + Pk --+- O. Accordingly, it may be judged convenient to split the stopping tolerance
Q. into two terms: stop when

for given tolerances 8 and 8, respectively homogeneous to objective-values and norms


of subgradients.
(iii) See Remark 2.5.2 for the nominal decrease, which could also be setto ! (Xk) - qik (Yk+ 1);
the descent parameter m can absorb the difference between the two possible 8k'S. Note
also that we subtract the optimal value in the stabilized problem from !(Xk), not from
qik(Xk). This is because the aggregation may make qik(Xk) < !(Xk); a detail which will
be important when we establish convergence.
(iv) This algorithm is "semi-abstract", to the extent that the important choice of tk is left
unspecified. No wonder: we have not made the least progress towards intelligent choices
of the stability parameter, be it IL, K, l, 8 or whatever (see the end of §2.5). 0

This algorithm requires access individually to Sk (the aggregate slope) and Pk (to compute
Yk+ 1 knowing Sk), which are of course given from the minimality conditions in the stabilized
problem. Thus, instead of(3.1.1) = (3.1.6), one might prefer to solve a dual problem.

(a) Abstractly: formulate (3.1.1) as

min [qik(Y)
yERn
+ 2~k lIy - xkll 2 + Ic(Y)].

The conjugates of the three terms making up the above objective function are respectively
qik' 1/2tkll. 112 + (', Xk) and ue. The dual problem (for example Fenchel's dual of §XII.5.4)
is then the minimization with respect to (s, p) E R,n x R,n of

or equivalently

Remark 3.1.6 Note in passing the rather significant role of ILk = 1/ tb which appears as
more than a coefficient needed for numerical efficiency: multiplication by ILk actually sends
a vector ofR,n to a vector of the dual space (R,n)*.
Indeed, it is good practice to view ILk as the operator ILk I, which could be more generally
a symmetric operator Q: R,n --+- (R,n)*. The same remark is valid for the ordinary gradient
method of Chap. II: writing Y = x - ts(x) should be understood as Y = x - Q-l s(x). This
mental operation is automatic in the Newton method, in which Q = V2 f(x). 0

(b) More concretely: consider (3.1.6), where C is a closed convex polyhedron as described
in (3.0.2). The corresponding conjugate functions qik and ue could be computed to specify
the dual problem of (a). More simply, we can also follow the example of §XII.3.4: formulate
(3.0.2), (3.1.6) as
3 A Class of Primal Bundle Algorithms 307

min [r + 2!k lIy - xkll2]


r ~ !(Xk) - ej + (Sj, y - Xk) for i = I, ... ,.e,
(aj, y -Xk) ~bj - (aj,xk) for j = I, ... ,m,

which uses the variable y - Xk. Taking.e multipliers aj and m multipliers Yj, we set

Sea) := Ef=l ajSj, a(y):= Ej=l Yjaj,


e(a) := Ef=l ajej, b(y):= Ej=l Yj[bj- < aj, Xk >]

and we form the Lagrange function

L(y, a, y) = 2!k lIy - xkll 2 + (s(a) + a(y), y - Xk) - e(a) - b(y)+

+ (I - Ef=l aj) + Ef=l ai!(xk) .


Its minimization with respect to (y, r) gives the dual problem

min [!tklls(a) +a(y)1I 2 +e(a) +b(y)] ,


Ef=l aj = I, aj ~ 0 for i = I, ... ,.e,
Yj ~ 0 for j = I, ... , m,

from which we obtain Sk = Ef=l ajSj and fik = Ej=l Yjaj, as well as ek = Ef=l ajej. We
obtain also a term not directly used by the algorithm:
m
bk := LYj[bj - (aj, Xk)];
j=l
from the transversality conditions,

Knowing that the essential operation is conjugation of the polyhedral function ({Jk + Ie,
compare the above dual problem with Example X.3.4.3. Our "bundle" could be viewed as
made up of two parts: (Sj, ej)j=l, ... ,l, as well as (aj, bj - (aj, Xk))j=l •...• m; the multipliers
ofthe second part vary in the nonnegative orthant, instead of the unit simplex.

3.2 Convergence

Naturally, convergence of Algorithm 3.1.4 cannot hold for arbitrary choice of the
stepsizes tk' When studying convergence, two approaches are therefore possible: either
give specific rules to define tk - and establish convergence of the correspondirig
implementation - or give abstract conditions on {tk} for which convergence can be
established. We choose the second solution here, even though our conditions on {tk}
lack implementability. In fact, our aim is to demonstrate a technical framework, rather
than establish a particular result.
The two observations below give a feeling for "reasonable" abstract conditions.
Indeed:
308 xv. Primal Forms of Bundle Methods
(i) A small tk> or a large ILk> is dangerous: it might over-emphasize the role of the
stabilization, resulting in unduly small moves from Xk to Yk+ I. This is suggested
by analogy with Chap. II: the descent-test in Step 2 takes care oflarge stepsizes
but so far, nothing prevents them from being too small.
(ii) It is dangerous to make a null-step with large tk. Here, we are warned by the
convergence theory of Chap. IX: what we need in this situation is Sk + Pk --+ 0,
and this depends crucially on the boundedness of the successive subgradients
Sk> which in turn comes from the boundedness of the successive iterates Yk. A
large tk gives a Yk+1 far away.
First of all, we fix the point made in Note 3.1.5(ii).

Lemma 3.2.1 At each iteration ofAlgorithm 3.1.4, fPk ~ f and there holds

with

PROOF. At the first iteration, fPl is the (affine) cutting-plane functionof(1.0.1): fP) ~ f.
Assume recursively that fPk ~ f. Keeping Lemma 3.1.2 in mind, the compression of
the bundle at Step 4 replaces fPk by some other convex function "" ~ f. Then, when
appending at Step 5 the new piece
- n
f(Xk) - eHI + (SHh Y - Xk) =: fHI(Y) ~ f(y) forall Y E IR •

the model becomes fPk+ 1 = max {"". fH I} ~ f: the required minoration does hold
for all k.
Then add the inequalities

f(y) ~ fPk(Y) ~ f(Xk) + (Sk> Y - Xk) - ek [from Lemma 3.1.1]


o ~ (h. Y - Yk+.) = (h. Y - Xk) + tk(Pk. Pk + Sk)
to obtain the stated value of ek. o
In terms of the nominal decrease, we can also write (see Fig. 2.5.1 again)

(3.2.1)

If both {ek} and {Sk + h} tend to zero, any accumulation point of {Xk} will be
optimal (Proposition XI.4.l.1: the graph of (e. x) 1---+ (Je(f + Ic)(x) is closed).
However, convergence can be established even if {Xk} is unbounded, with a more
subtle argument, based on inequality (3.2.3) below.
As before, we distinguish two cases: if there are infinitely many descent steps,
the objective function decreases "sufficiently". thanks to the successive descent tests
(3.1.7); and if the sequence {Xk} stops, it is the bundling mechanism that does thejob.

Theorem 3.2.2 Let Algorithm 3.1.4 be applied to the minimization problem (3.0.1),
with the stopping tolerance Q= O. Assume that K is an infinite set.
3 A Class of Primal Bundle Algorithms 309

(i) If
(3.2.2)

then {Xk} is a minimizing sequence.


(ii) If, in addition, {tk} has an upper bound on K, and if (3.0.1) has a nonempty set
ofsolutions, then the whole sequence {Xk} converges to such a solution.

PROOF. [Preliminaries] Let k E K, so xk+ 1 = Yk+ I. With Y arbitrary in C, write

IIY - xk+11I 2 = lIy - xkll 2 + 2(y - Xko Xk - Xk+I) + IIXk - xk+11I 2 •


In view of (3. 1.2), the cross-product can be bounded above using Lemma 3.2.1 and,
with (3.2.1), we obtain

(3.2.3)

Now {f(Xk)} has a limit f* E [-00, +00[. If f* = -00, the proof is finished.
Otherwise the descent test (3.1.7) implies that

(3.2.4)

where we have used the fact that f(xk+l) - f(Xk) = Oifk ¢ K (the same argument
was used in Lemma 2.1.4).
[(i)] Assume for contradiction that there are y E C and 71 > 0 such that

f(y) ~ f(Xk) - 71 for all k E K,

so that (3.2.3) gives with this y:

lIy - xk+11I 2 ~ lIy - xkll 2 + 2tk(8k - 71)·

Because of (3.2.4), limteK 8k = 0; there is a ko such that


IIY - xk+11I 2 ~ lIy - xkll 2 - tk'f/ for ko ~ k E K.

Sum these inequalities over k E K. Remembering again that xk+ 1 = Xk if k ¢ K, the


terms lIy - xkll 2 cancel out and we obtain the contradiction

o~ lIy - Xko 112 - 71 L tk = -00.


ko ~keK

[(ii)] Now let i minimize f over C and use y = i in (3.2.3):

IIi - xk+11I 2 ~ IIi - xkll 2 + 2tk8k for all k E K. (3.2.5)

If {tk} is bounded on K, (3.2.4) implies that the series LkeK tk8k is convergent; once
again, sum the inequalities (3.2.5) over k E K to see that {Xk} is bounded. Extract some
310 xv. Primal Forms of Bundle Methods
cluster point; in view of (i), this cluster point minimizes f on C and can legitimately
be called i.
Given an arbitrary TJ > 0, take some kl large enough in K so that

L tk 8k ~ TJ/4.
kl ~keK

Perform once more on (3.2.5) the summation process, with k running in K from kl
to an arbitrary k2 ~ kl (k2 E K). We conclude:

IIi - xk2 +111 2 ~ IIi - xk l 1l 2 + 2 LkeKn{k k


1 •••• 2 } tk 8k ~ TJ. o
It is interesting to note the similarity between (3 .2.2) and the condition (XII.4.1.3),
establishing convergence of the basic subgradient algorithm (although the latter used
normalized directions). This rule is motivated by the case of an affine objective func-
tion: if f (x) = r + (s, x), we certainly have

To obtain f (Xk) -+ -00, we definitely need the "cumulated stepsizes" to be un-


bounded.

Apart from the conditions on the stepsize, the only arguments needed for Theorem 3.2.2
are: the definition (3.1.2) of the next iterate, the inequalities stated in Lemma 3.2.1, and
the descent test (3.1.7); altogether, the bundling mechanism is by no means involved. Thus,
consider a variant of the algorithm in which, when a descent has been obtained, Step 4
flushes the bundle entirely: e is set to 1 and the aggregate piece (it, ek) is simply left out.
Theorem 3.2.2 still applies to such an algorithm. Numerically, the idea is silly, though: it uses
the steepest-descent direction whenever possible, and we have insisted again and again that
something worse can hardly be invented.

To establish convergence of Algorithm 3.1.4, it remains to fix the case of infinitely


many consecutive null-steps: what happens if the stability center stops at some Xko? The
basic argument is then as in other bundle methods. The bundling mechanism forces the
nominal decreases {8k} to tend to 0 at a certain speed. In view of(3.2.1), 8k -+ 0 and,
providing that {tk} does not tend to 0 too fast, Sk + Pk -+ 0; then Lemma 3.2.1 proves
optimality of Xko' The key is to transmit the information contained in the aggregate
linearization p, which therefore needs to be indexed by k.

Lemma 3.2.3 Denote by

the aggregate linearization obtained at the kth iteration of Algorithm 3.1.4. For all
Y E C, there holds

(3.2.6)
3 A Class of Primal Bundle Algorithms 311

PROOF. Consider the quadratic function :v; := i:


+ 1/2tkll . -xkIl2; it is strongly
convex of modulus 1/ tk (in the sense of Proposition IY.l.l.2), and has at Yk+ 1 the
gradient Sk + (Yk+1 - Xk)/tk = -Pk. From Theorem VI.6.1.2,

Because Pk E Nc(Yk+I), the second term can be dropped whenever Y E C. On


the other hand, using various definitions, we have

the result follows. o

Theorem 3.2.4 Consider Algorithm 3.1.4 with the stopping tolerance Q= O. Assume
that K is finite: for some ko, each iteration k ~ ko produces a null-step. If

tk ~ tk-I for all k > ko (3.2.7)

L t2
_k_ =+00, (3.2.8)
k>ko tk-I

then Xko minimizes 1 on C.

PROOF. In all our development below, k > ko and we use the notation of Lemma 3.2.3.
[Step 1] Write the definition of the kth nominal decrease:

I(Xko) - 8k = CfJk(Yk+I) + 2:k IIYk+1 - Xko 112


~ CfJk(Yk+I) + 2tL. IIYk+1 - Xko 112 [because of (3.2.7)]

1:-1 (Yk+I) + 2tk_. IIYk+1 - Xko II [Step 4 implies rpk ;?; It-I]
- 1 2
~
~ I(Xko) - 8k-1 + 2tL.IIYk+1 - Ykll 2 . [from (3.2.6)]

Thus we have proved

8k + 2tL. IIYk+1 - Yk 112 ~ 8k-1 for all k > ko. (3.2.9)

[Step 2] In particular, {8k} is decreasing. Also, set Y = Xk = xko in (3.2.6): knowing


that CfJk ~ 1 (Lemma 3.2.1), we have

Hence
IIYk+1 - xkoll2 ~ 2tk8k ~ 2tko8ko =: R2,
the sequence {Yk} is bounded; L will be a Lipschitz constant for 1 and each CfJk on
B(Xko' R) and will be used to bound from below the decrease from 8k to 8k+1. Indeed,
Step 5 of the algorithm forces I(Yk) = CfJk(Yk), therefore
312 xv. Primal Forms of Bundle Methods
!(Yk+I) - (jlk(Yk+I) = !(Yk+I) - !(Yk) + (jlk(Yk) - (jlk(Yk+I) ~
(3.2.10)
~ 2LIIYk+1 - Ykll·

[Step 3] On the other hand, the descent test (3.1.7) is not satisfied:
!(Yk+I) > !(Xko) - m8k;

subtract the inequality

(jlk(Yk+I) = !(xko) - 8k - 2:k IIYk+1 - xkoll2 ~ !(Xko) - 8k


and combine with (3.2.10):

(1 - m)8k < !(Yk+I) - (jlk(Yk+I) ~ 2LIIYk+1 - Ykll.


Insert this in (3.2.9):
82
C_k_ < 8k-1 - 8k for all k > ko, (3.2.11)
tk-I

where we have set C := 1/2(I-m)2/(2L)2.


[Epilogue] Summing the inequalities (3.2.11):

C L 82
_k_ < 8ko <
k>ko tk-I
+00 .

In view of (3.2.7), (3.2.1), this implies 8k ~ 0, and also

L t2
-k-lisk + fikll 4 < +00.
k>ko tk-I

It remains to use (3.2.8): liminf IISk + fikll 4 = O. Altogether, Lemma 3.2.1 tells us
that 0 E a(f + IC)(Xko). 0

Remark 3.2.5 In this proof, as well as that of Theorem 3.2.2, the essential argument is that
lh -+ o. The algorithm does terminate ifthe stopping tolerance Qis positive; upon termination,
Lemma 3.2.1 then gives an approximate minimality condition, depending on the magnitude
of tk. Thus, a proof can also be given in the spirit of those in the previous chapters, showing
how the actual implementation behaves in the computer.
Note also that the proof gives an indication of the speed at which {8k} tends to 0: com-
pare (3.2.11) with Lemma IX.2.2.1. Numerically speaking, it is a nice property that the two
components ofthe "compound" convergence parameter

8k = 8k + 4tkllsk + Pkll 2
are simultaneously driven to o. See again Fig. XIY.4.1.3 and §XIY.4.4(b).
We mentioned in Remark IX.2.1.3 that a tolerance mil could help an approximate reso-
lution of the quadratic program for dual bundle methods. In an actual implementation of the
present algorithm, a convenient stopping criterion should also be specified for the quadratic
solver. We omit such details here, admitting that exact arithmetic is used for an exact resolu-
tion of(3.1.6). 0
3 A Class of Primal Bundle Algorithms 313

Both conditions (3.2.2), (3.2.8) rule out small stepsizes and connote (XII.4.1.3),
used to prove convergence of the basic subgradient algorithm; note for example that
(3.2.8) holds iftk = 1/ k. Naturally, the comments of Remark XII.4.1.5 apply again,
concerning the practical irrelevance of such conditions. The situation is even worse
here: because we do not know in advance whether k E K, the possibility of checking
(3.2.2) becomes even more remote. Our convergence results 3.2.2 and 3.2.4 are more
interesting for their proofs than for their statements. To guarantee (3.2.2), (3.2.8), a
simple strategy is to bound tk from below by a positive threshold!.. > O. In view of
(3.2.7), it is even safer to impose a fixed stepsize tk == t > 0; then we get Algo-
rithm 2.2.2, with a fixed penalty parameter f.L = 1/ t. However, let us say again that
no reasonable value for t is a priori available; the corresponding algorithm is hardly
efficient in practice.
On the other hand, (3.2.7) is numerically meaningful and says: when a null-step
has been made, do not increase the stepsize. This has a practical motivation. Indeed
suppose that a null-step has been made, so that the only change for the next iteration
is the new piece (sk+1> ek+l) appearing in the bundle. Increasing the stepsize might
well reproduce the iterate Yk+2 = Yk+1> in which case the next call to the black box
(UI) would be redundant.
Remark 3.2.6 We have already mentioned that (3.0.1) is the problem of minimizing the sum
f + Ie of two closed convex functions. More generally, let a minimization problem be posed
in the form
inf {f(x) + g(x) : x E jRn},
with f and g closed and convex. Several bundling patterns can be considered:
(i) First of all, f + g can be viewed as one single objective function h, to which the general
bundle method can be applied. Provided that h is finite everywhere, there is nothing
new so far. We recall here that the local Lipschitz continuity of the objective function is
technically important, as it guarantees the boundedness of the sequence {Sk}.
(ii) A second possibility is to apply the bundling mechanism to f alone, keeping g as it
is; this is what we have done here, keeping the constraint Y E C explicitly in each
stabilized problem. Our convergence results state that this approach is valid if f is finite
everywhere, while g is allowed the value +00.
(iii) When both f and g are finite everywhere, a third possibility is to manage two "decom-
posed" bundles separately. Indeed suppose that the black box (U1) of Fig. 11.1.2.1 is able
to answer individual objective-values fey) and g(y) and subgradient-values, say s(y)
and p(y) - rather than the sums fey) + g(y) and s(y) + p(y). Then f and g can be
modelled separately:
A(y):= max [f(Yi) + (S(Yi), Y - Yi)] ~ f(y) ,
i=I •...• k
8k(Y):= max [g(Yi) + (P(Yi), Y - Yi)] ~ g(y).
i=I ..... k
The resulting model is more accurate than the "normal" model
hk(Y):= max [(f + g)(Yi) + {(s + P)(Yi), Y - Yi)] ~ fey) + g(y) ,
i=I ..... k
simply because hk ~ ik + 8k (this is better seen if two different indices i and j are used
in the definition of i and 8: the maximum hk of a sum is smaller than the sum + 8kA
of maxima).
314 xv. Primal Forms of Bundle Methods
Exploiting this idea costs in memory (two additional subgradients are stored at each
iteration) but may result in faster convergence. 0

3.3 Appropriate Stepsize Values

Each iteration of Algorithm 3.1.4 offers a continuum of possibilities, depending on


the particular value chosen for the stepsize tk. As already said on several occasions,
numerical efficiency requires a careful choice of such values. We limit ourselves to
some general ideas here, since little is known on this question.
In this subsection, we simplity the notation, returning to the unconstrained frame-
work of §2 and dropping the (fixed) iteration index k. The current iterate x is given,
as well as the bundle {(Sj, ej)}, with ej ~ 0 and Sj E oeJ(x) for i = 1, ... We ,.e.
denote by y(t) (t > 0) the unique solution of the stabilized problem:

y(t) := argmin [9'(y) + ft Ily - x 112]


y

where 9' is the model

y ~ 9'(y) = f(x) + j=Imax


•...• e
[-ej + (Sj, y - x)].

Recall the fundamental formulae:


y(t) = x - ts(t), s(t) E oe(t)f(x) ,
(3.3.1)
e(t) = f(x) - 9'(y(t)) - (s(t), x - y(t») ,

where s(t) = L:f=1 (XjSj can be obtained from the dual problem: L1e c IRe being the
unit simplex,
(3.3.2)

(a) SmaU Stepsizes. When t ,!.. 0, the term L:f=1 (Xjej in (3.3.2) is pushed down to its
minimal value. Actually, this minimal value is even attained for finitely small t > 0:

Proposition 3.3.1 Assume that ej = 0 for some i. Then there exists! > 0 such that,
for t E ]0, !], s(t) solves

(3.3.3)

EXPLANATION. We do not give a formal proof, but we link: this result to the direction-
finding problem for dual bundle methods. Interpret l/t =: I.L in (3.3.2) as a multiplier
of the constraint in the equivalent formulation (see §2.4)

min
aELil
{4 11 L:f=1 (Xjsill 2 : L:f=1 (Xjej :s::; e} .
When e ,!.. 0 we know from Proposition XI\l.2.2.S that such a multiplier is bounded,
say by ~ =: l/t. 0
3 A Class of Primal Bundle Algorithms 315

The $(t) obtained from (3.3.3) is nothing but the substitute for steepest-descent,
considered in Chap. IX. Note also that we have $(t) E of (x): small stepsizes in
Algorithm 3.1.4 mimic the basic subgradient algorithm of §XII.4.1. Proposition 3.3.1
tells us that this sort of steepest-descent direction is obtained when t is small. We know
already from Chap. II or VIII that such directions are dangerous, and this explains why
t = tk should not be small in Algorithm 3.1.4: not only for theoretical convergence
but also for numerical efficiency.

(b) Large Stepsizes. The case t -+ +00 represents the other extreme, and a first
situation is one in which rp is bounded from below on IRn. In this case,
- being piecewise affine, rp has a nonempty set of minimizers (§3.4 in Chap. V or
VIII);
- whent -+ +00, y(t) has a limit which is the minimizerofrp that is closestto x; this
follows from Propositions 2.2.3 and 2.1.3. We will even see in Remark 4.2.6 that
this limit is attained for finite t, a property which relies upon the piecewise affine
character of rp.
Thus, we say: among the minimizers of the cutting-plane model (if there are
any), there is a distinguished one which can be reached with a large stepsize in Al-
gorithm 3.1.4. It can also be reached by the trust-region variant, with K = Koo of
Proposition 2.1.3.
On the other hand, suppose that rp is not bounded from below; then there is no
cutting-plane iterate, and no limit for y(t) when t -+ +00. Nevertheless, minimization
of the cutting-plane model can give something meaningful in this case too. In the next
result, we use again the notation Proj for the projection onto the closed convex hull
ofa set.

Proposition 3.3.2 For $(t) given by (3.3.2), there holds

$(t) -+ $(00) = Proj O/{s!. ... , sil when t -+ +00.

PROOF. With t > 0, write the minimality conditions of (3.3.2) (see for example
Theorem XIY.2.2.1):

(Sj, $(t») + tej ~ 1I$(t) 112 + te(t) for i = 1, ... , t


and let l/t ..).. O. Being convex combinations of the bundle elements, $(t) and e(t) are
bounded; e(t) / t -+ 0 and any cluster point $(00) of $(t) satisfies

(Sj, $(00») ~ 11$(00)11 2 for i = 1, ... , t.

Also, as $(00) is in the convex hull of {s!. ... , sil, the only possibilty forS(oo) is to
be the stated projection (see for example Proposition VIII.l.3.4). 0

Thus, assume $(00) =F 0 in the above result. When t -+ +00, y(t) is unbounded
but [y(t) - x]/t = $(t) converges to a nonzero direction $(00), the solution of
316 xv. Primal Forms of Bundle Methods

(3.3.4)

We recognize the direction-finding problem of a dual bundle method, with e set to


a large value, say e ~ maxj ej. The corresponding algorithm was called conjugate
subgradient in §XIV,4.3.

To sum up, consider the following hybrid way of computing the next iterate y+ in Algo-
rithm 3.1.4.
- If the solution $(00) of (3.3.4) is nonzero, make a line-search along -$(00).
- 1[$(00) = 0, take the limit as t ~ +00 of y(t) described by (3.3.1); in this case too, a
line-search can be made to cope with a y(oo) which is very far, and hence has little value.
The essence of this algorithm is to give a meaning to the cutting-plane iteration as much as
possible. Needless to say, it should not be recommended.

Remark 3.3.3 Having thus established a connection between conjugate subgradients and
cutting planes, another observation can be made. Suppose that f is quadratic; we saw in
§XIY.4.3 that, taking the directions of conjugate subgradients, and making exact line-searches
along each successive such direction, we obtained the ordinary conjugate-gradient algorithm
of §U.2.4. Keeping in mind Remark 1.1.1, we are bound to notice mysterious relationships
between piecewise affine and quadratic models of convex functions. 0

(c) On-Line Control of the Stepsize. Our development (a) - (b) above clearly shows
the difficulty of choosing the stepsize in Algorithm 3.1.4: small [resp. large] values
result in some form of steepest descent [resp. cutting plane]; both are disastrous in
practice. Once again, this difficulty is not new, it was with us all the way through
§XIv.4. To guide our choice, an on-line strategy can be suggested, based on the trust-
region principles explained in § 1.3, and combining the present primal motivations with
the dual viewpoint of Chap. XIv. For this, it suffices to follow the general scheme
of Fig. XIY.3 .2.1, say, but with the direction recomputed each time the stepsize is
changed.
Thus, the idea is to design a test (0), (R), (L) which, upon observation of the actual
objective function at the solution y(t) of (3.3.1), (3.3.2), decides:
(Od) This solution is convenient for a descent-step,
(On) or this solution is convenient for a null-step.
(R) This solution corresponds to a t too large.
(L) This solution corresponds to a t too small.
No matter how this test is designed, it will result in the following pattern:

Algorithm 3.3.4 (Curved-Search Pattern, Nonsmooth Case) The data are the ini-
tial x, the model ({J, and an initial t > O. Set tL = tR = O.
STEP 1. Solve (3.3.1), (3.3.2) and compute f(y(t» and s(y(t» E 8f(y(t)). Apply
the test (0), (R), (L).
STEP 2 (Dispatching). In case (0) stop the line-search, with either a null- or a descent-
step. In case (L) [resp. (R)] set tL = t [resp. tR = t] and proceed to Step 3.
4 Bundle Methods as Regularizations 317

STEP 3 (Extrapolation). IftR > 0 go to Step 4. Otherwise find a new t byextrapola-


tion beyond tL and loop to Step 1.
STEP 4 (Interpolation). Find a new t by interpolation in ]tL, tR[ and loop to Step 1.
o
Of course, the stabilizing philosophy of the present chapter should be kept. This
means that the descent test
f(y(t)) :::; f(x) - m [e(t) + !tlls(t)1I 2] (3.3.5)
should be a basis for the test (0), (R), (L):
(i) if it is satisfied, another test should tell us if we are in case (Od) or (L);
(ii) if it is not satisfied, another test should tell us if we are in case (On) or (R).
Note that, in cases (R) and (L), the next t will work with the same model <po Then,
computing some (generalized) derivative y' (t) by a parametric study of (3.3.2), the
way is open to interpolation formulae as in §II.3.4, for convenient computation of the
next t.
By contrast, the model will change in case (0); then this parametric study is no
longer relevant: the question of choosing tk+1 remains intact. In addition, just how
to design the additional tests (i), (ii) above is not totally clear. We will therefore not
elaborate on this approach any longer. It is again impeded by the ever present question
in bundle methods: when the current iterate is not suitable, and in particular when
(3.3 .5) does not hold, should we change the value of the parameter (be it t, /L, K, e .e,
or whatever), and if so, how; or should we enrich the model, or do both things at the
same time?
The end of this chapter is rather devoted to an interesting theoretical aspect of
bundle methods.

4 Bundle Methods as Regularizations

Consider one iteration of the primal bundle method of §3. Given the current iterate,
model and step size, we minimize
<p(y) + trlly - Xll2
with respect to y E IRn (assuming an unconstrained situation for simplicity). Now the
above optimal value can be viewed as a/unction of x E IR n , and we recognize in this
function the Moreau-Yosida regularization of <p, already seen on several occasions.
In fact, bundle methods and Moreau-Yosida regularization are intimately related
and the aim of this section is to explore this relation.

4.1 Basic Properties of the Moreau-Yosida Regularization

In this subsection, we collect and complete some results previously given concerning
the Moreau-Yosida regularization, seen from the point ofview of convex minimization.
In contrast with the previous sections, we consider now a general closed convex
318 xv. Primal Forms of Bundle Methods
function: in what follows,
If E Conv]Rn and M is a symmetric positive definite operator. I
The function
]Rn 3 x ~ fM(X) := min [fey) + !(M(y - x), y - x}] (4.1.1)
YElRn

is the Moreau-Yosida regularization of f, associated with M. Thus we allow qua-


dratic perturbations slightly more general than the mere II . 112 = (., .) of, say, Exam-
ple XI.3.4.4. It will sometimes be convenient to call
]Rn x ]Rn 3 (x, y) ~ g(x, y) := fey) + !(M(y - x), y - x}

the function appearing in (4.1.1).

Lemma 4.1.1 The minimization problem in (4.1.1) has a unique solution, charac-
terized as the unique point y E ]Rn satisfying
M(x - y) E af(y) . (4.1.2)

PROOF. For each x, the minimand g(x, .) is a strictly convex function; as such, it has
at most one minimum. On the other hand, f is minorized by some affine function;
g(x, .) is therefore I-coercive and, being also closed, it does have one minimum.
Now, because the quadratic term in g is finite everywhere, the calculus rule
XI.3.1.1 on the sum of two convex functions applies, and the subdifferential of g(x, .)
at y is
ayg(x, y) = af(y) + M(y - x)
(with the convention 0 + {s} = 0 for all s). Thus (4.1.2) represents the necessary and
sufficient minimality condition 0 E ayg(x, y), and has a solution which is the unique
minimizer of g(x, .). 0

Note here that the convexity of f is important but not its closedness. The function g(x, .)
would still be strongly convex even if f were not closed; and all minimizing sequences would
have the same unique limit point. Then nothing would be essentially changed if f were
replaced by cl f, in (4.1.1) and (4.1.2) as well.

Definition 4.1.2 (Proximal Point) We will extensively use the following system of
notation:
PM(X) := argmin [fey) + !(M(y - x), y - x}]
y
is called the proximal point of x (associated with f and M); x can be called the
proximal center;
(4.1.3)
is the particular subgradient of f at PM (x) defined via Lemma 4.1.1; we set
W:=M- 1 ,

so that there holds


(4.1.4)
o
4 Bundle Methods as Regularizations 319

It is important to understand here that s M (x) of (4.1.3) is a distinguished sub-


gradient of fat PM(X), which comes from the calculation in (4.1.1); it must not be
confused with the arbitrary subgradient S(PM(X» which would come from a black
box (Ul).

Interpretation 4.1.3 The operator M defines the scalar product ((x, x')) := (Mx, x'), with
its associated norm Illxlll := J((x, x)). For this norm, PM(X) is the projection of x onto a
certain sublevel-set of f, namely the one at the level f(PM(X». Indeed take Y such that
fey) ( f(PM(X». Combining with the subgradient inequality

and using (4.1.3), we obtain

((x - PM(X), y - PM (x»)) (0 for all y E Sf(PM(X»(f) .

Since PM(X) is obviously in Sf(PM(X»(f), we recognize the characterization of the asserted


projection (Theorem III.3.1.1).
Geometrically, Fig. 4.1.1 illustrates the construction with M = 1. Let a level f. be de-
creasing from the value f(x) and, for each f., take the projection Ye of x onto the sublevel-set
Sf (f). This Ye is characterized by the property x - Yf E NSe(f) (Ye). Depending on the value
of f., there are a number of possibilities:
- af(ye) may be empty; Se(f) may also be empty;
- in a "normal" case (af(ye) =1= 0 and f. strictly larger than the infimum of I), there is some
se E af(ye) which is collinear with x - ye, say se = t(x - Ye) with t ~ 0 (see Fig. VI. 1.3.2);
- when t = 1, we have just picked the correct level and Ye = PM(X);
- a consequence of Lemma 4.1.1 is that there is exactly one f.-value for which t = 1.

-"",,/"'" t(.) = t(PM(x))

,-'"
YR. L
,// - -------------------------------t(.) = t

Fig.4.1.1. Projecting x onto a sublevel-set

Note the similarity between this construction and the level-variant of §2.3. Note also
from Lemma 4.1.1 that PM (x) may be on the boundary of dom f but, even in this case, f
has a nonempty subdifferential at PM(X). 0

From now on, we will denote by

Amax(Q) := Al (Q) ~ ... ~ An(Q) =: Amin(Q) > 0

the eigenvalues of a symmetric positive definite operator Q, and we recall that W is


the inverse of the given M.
320 xv. Primal Forms of Bundle Methods
Theorem 4.1.4 Thefunction fM of (4.1.1) is finite everywhere, convex and differ-
entiable; its gradient is

(4.1.5)

Its conjugate is
IRn 3 S ~ fM(S) = f*(s) + t{s, Ws). (4.1.6)

Furthermore, there holds for all x and x' in IR n :

and

PROOF. Take an arbitrary Xo E dom f; we clearly have for all x E IRn

fM(X) ~ f(xo) + t{M(xo - x), Xo - x) < +00;


indeed fM is the inf-convolution of f with the quadratic function x ~ 1/2 (Mx, x),
differentiable and finite everywhere. Then f M is convex; its gradient is given by
(XI.3.4.4) and its conjugate is a sum (Corollary X.2.1.3).
Now take the scalar product OfW[SM(X) - SM(X')] with both sides of

use the monotonicity of af in (4.1.3) to obtain (4.1.7).


Finally, (4.1. 7) directly gives

Amin(W)lIsM(X) - SM(X') II ~ IIx - x'il


and (4.1.5) completes the proof. o

Note that 1M can also be viewed as a marginal function associated with the minimand
g (convex in (x, y) and differentiable in x); V1M can therefore be derived from (XI.3.3.7);
see also Corollary VI.4.5.3 and the comments following it. The global Lipschitz property of
V1M can also be proved with Theorem XA.2.1.

Thus, we definitely have a regularization of f, let us show that we have an approx-


imation as well; in words, the minimization (4.1.1) with a "big" operator M results in
PM(X) close to x.

Proposition 4.1.5 When Amin(M) -+ +00 the following convergence properties


hold:

fM(X) -+ f(x) for all x E IR n , (4.1.8)


PM(X) -+ x for all x E domf. (4.1.9)
4 Bundle Methods as Regularizations 321

PROOF. There exists an affine function minorizing I: for some (so, ro) E JRn x JR,

(so, y) - ro :::; I(y) for all y E JRn .

Use this inequality to minorize the left-hand side in

(4.1.10)

so that
(so, PM(X») - ro + !Amin(M)llpM(X) - X1I2:::; 1M (x) .
With some algebraic manipulations, we obtain

(4.1.11)

where we have set

r(M) := 2Am :lM) II So 112 + (so, x) - ro.

Note: {r(M)} is bounded when Amin(M) -+ +00.


Let x E dom I. Combine (4.1.11) with the inequality in (4.1.1 0) and divide
by Amin(M) to see that PM(X) -+ x when Amin(M) -+ +00; (4.1.9) is proved and
the lower semi-continuity of I implies that liminf I(PM(X» ~ I(x). Since there
obviously holds
(4.1.12)
(4.1.8) is also proved for x E dom I.
Nowletx fj dom I; we must prove that 1M (x) -+ +00. Assume for contradiction
that there are a sequence {Mk} with limk-++oo Amin(Mk) = +00, and a number R
such that
IMk(X):::;R fork=I,2, ...
Then (4.1.11) shows that PMk (x) -+ x as before. The lower semi-continuity of I and
(4.1.12) imply

which is the required contradiction. o


Now we pass to properties relating the minimization of I with that of 1M. We
start with a result useful for numerical purposes: it specifies the decrease of I when
its perturbation g (x, .) of (4.1.1) is minimized, instead of I itself.

Lemma 4.1.6 For all x E JRn,

PROOF. Use (4.1.4): the first relation is the definition of 1M; the second is the subgra-
dient inequality

o
322 xv. Primal Forms of Bundle Methods
Theorem 4.1.7 Minimizing I and 1M are equivalent problems, in the sense that

inf IM(X) = inf I(x) (4.1.14)


XElRn XERn

(an equality in ~U {-co}), and that the following statements are equivalent:
(i) x minimizes I;
(ii) PM(X) = x;
(iii) SM(X) = 0;
(iv) x minimizes 1M;
(v) I(PM(X» = I(x);
(vi) IM(X) = I(x).

PROOF. Observe from (4.1.6) that - I'M


(0) = -1*(0); this is just (4.1.14). To prove
our chain of equivalences, remember first that M and W are invertible.
When (i) holds, y = x minimizes simultaneously I and the quadratic term in
(4.1.1): (ii) holds. On the other hand, (ii) <=> (iii) because of (4.1.4), and (iii) <=> (iv)
because V1M (x) = SM(X). Now, if(iv) = (ii) holds, (v) is trivial; conversely, (4.1.13)
shows that (v) implies (iii) and (vi).
In summary, we have proved

(i) ==> (ii) {:::::::} (iii) {:::::::} (iv) {:::::::} (v) ==> (vi) .

Finally assume that (vi) holds; use (4.1.13) again to see that (iii) = (ii) holds; then

0= SM(X) E al(PM(X» = al(x). o

4.2 Minimizing the Moreau-Yosida Regularization

Theorem 4.1.7 gives a number of equivalent formulations for the problem of mini-
mizing I. Among them, (ii) shows that we must find a fixed point of the mapping
x 1-+ PM(X), which resembles a projection: see Remark 4. 1.3. As such, PM is nonex-
pansive (for the norm associated with «', .)))
and the iteration formula xk+ I = PM (Xk)
appears as a reasonable proposal.
This is known as the proximal point algorithm. Note that this algorithm can be
formulated with a varying operator M, say Xk+1 = PMk(Xk). In terms of minimizing
I, this still makes sense, even though the Moreau-Yosida interpretation disappears.
Algorithm 4.2.1 (proximal Point Algorithm) Start with an initial point XI E ~n
and an initial symmetric positive definite operator MI. Set k 1. =
STEP 1. Compute the unique solution y = xk+1 of

STEP 2. If xk+1 = Xk stop.


4 Bundle Methods as Regularizations 323

STEP 3. Choose a new symmetric positive definite operator Mk+ I. Replace k by k +1


and loop to Step 1. 0

Naturally, this is only a conceptual algorithm because we do not specify how


Step 1 can be performed. In addition, it may sound silly to minimize the pertur1:5ation
g(Xk • .) instead of I itself; said otherwise, the best idea should be to take MI = 0: then
no iteration would be needed. For the moment, our development is purely theoretical
and the next subsection will address numerical aspects.
We set Wk := M;;I and we recall from Lemma 4.1.1 that each iterate is charac-
terized by the relations

(4.2.1)

here the notation Sk replaces the former s Mk' Furthermore we have from (4.1.13)

l(xk+l) + (WkSko Sk) ~ I(Xk)


and we can write

For future use, we introduce the number

Ok := I(Xk) - IMk(Xk) = I(Xk) - l(xk+l) - 4(Sk. WkSk) ~ O. (4.2.2)

According to Theorem 4.1.7, Ok = 0 if and only if Xk minimizes I (or IMk); a key


issue is therefore to establish the property Ok -+ O.
In view of (4.1.9), taking ''big'' operators Mk is dangerous: then the iterates Xk
will bog down. On the other hand, "small" Mk'S seem safe, since the zero operator
yields immediately a minimum of I. This explains the convergence condition

L Amin(Wk) = +00.
00
(4.2.3)
k=1
It rules out unduly small [resp. large] eigenvalues of Wk [resp. Mk].

Lemma 4.2.2 Assume that 1* is ajinite number. Then L~I Ok < +00 and (4.2.3)
implies that 0 is a cluster point of {sd.

PROOF. From (4.2.2), we have for all k

Ok + 4Amin (Wk)IISk 112 ~ Ok + 4(Sk. WkSk) ~ I(Xk) - l(xk+l)

and by summation

L [Ok + 4Amin(Wk)llSkIl
00
2] ~ l(xl) - 1* .
k=1

If the right-hand side is finite, the two series L Ok and L Amin(Wk) IISk 112 are conver-
gent. If the convergence condition (4.2.3) holds, IISkll2 cannot stay away from zero.
o
324 xv. Primal Forms of Bundle Methods
Theorem 4.2.3 In Algorithm 4.2.1, assume that the convergence condition (4.2.3)
holds. If {Xk} is bounded, all its cluster points are minimizers of f.

PROOF. Because f is minorized by an affine function, boundedness of {Xk} implies


f* > -00. In view of Lemma 4.2.2, we can take a subsequence for which h -*
0; the corresponding subsequence {xk+d is bounded and we can extract from it
a further subsequence tending to some limit X*. Then (4.2.1) implies 0 E af(x*)
(Proposition VI.6.2.1). Thus f* is the minimal value of f, and the result follows since
f* is also the objective-value at any other cluster point of {xkJ. 0

Among other things, boundedness of {Xk} implies existence of a minimizer of


f. If we call tk the minimal eigenvalue of Wko the convergence condition (4.2.3)
has a familiar flavour: see Sections 3.2 and XII.4.1. However, it is not clear whether
this condition suffices to guarantee that f* is the infimum of f even when {xkJ is
unbounded. For this, we need to take Mk proportional to a fixed operator.

Theorem 4.2.4 Assume that Mk = I-tkM, with I-tk > 0 for all k, and M symmetric
positive difinite.
(i) If the convergence condition (4.2.3) holds, i.e. if

001
L-=+oo,
k=l I-tk

then {Xk} is a minimizing sequence.


(ii) If, in addition, {l-tk} has a positive lower bound, and iff has a nonempty set of
minimum points, then the entire sequence {Xk} converges to such a point.

PROOF. Our proof is rather quick because the situation is in fact as in Theorem 3.2.2.
The usual transportation trick in the subgradient inequality

gives
f(y) ~ f(Xk) + (Mk(Xk - Xk+l), y - Xk) - 8k (4.2.4)
where, using (4.2.2),

8k := f(Xk) - f(Xk+l) - (Sk, WkSk) = Ok - 2~k (Sko M-1sk) .

Denoting by ((u, v)) the scalar product (Mu, v) (III . I I will be the associated norm,
note that both are independent of k), we write (4.2.4) as

Use this to majorize the cross-product in the development


4 Bundle Methods as Regularizations 325

and obtain

Illy - xk+111 2 ~ lIlY - xkl~2 + 2 f (Y) - !(Xk) + 8k for all Y E]Rn •


ILk
Now, if !* = -00, the proof is finished. Otherwise, the above inequality estab-
lishes with Lemma 4.2.2 inequalities playing the role of (3.2.3) and (3.2.4); then copy
the proof of Theorem 3.2.2 with C = ]Rn and tk = 1/ IJ-k; the descent-set K is now
the whole ofN, which simplifies some details. 0

Another interesting particular case is when the proximal point Algorithm 4.2.1
does terminate at Step 2 with Sk = 0 and an optimal Xk.

Proposition 4.2.5 Assume that ! has a finite infimum and satisfies the following
property:

311 > 0 such that a!(x) n B(O, 11) =f. '" :::::::} x minimizes! . (4.2.5)

lfthe convergence condition (4.2.3) holds, the stop in Algorithm 4.2.1 occurs for some
k.

PROOF. Lemma 4.2.2 guarantees that the event IIsk II ~ 11 eventually occurs, implying
optimality of xk+ I. From Theorem 4.1.7, the algorithm will stop at the next iteration.
o
Observe the paradoxical character of the above statement: finite termination does not
match divergence ofthe series E A.min (Wk)! Remember that the correct translation of (4.2.3)
is: the matrices Wk are computed in such a way that
k(R)
VR ~ 0, 3k(R) e N* such that L A.min(Wk) ~ R.
k=1

We have purposedly used a somewhat sloppy statement, in order to stress once more that
properties resembling (4.2.3) have little relevance in reality.

Remark 4.2.6 The meaning of (4.2.5) deserves comment. In words, it says that all subgra-
dients of! at all nonoptimal points are uniformly far from O. In particular, use the notation
s(x) := Proj O/B!(x) for the subgradientof I atx that has least Euc1ideannorm. If I satisfies
(4.2.5), s(x) is either 0 (x optimal) or larger than rJ in norm (x not optimal). In the latter case,
I can be decreased locally from x at a rate at least rJ: simply move along the steepest-descent
direction -s (x). We say in this case that I has a sharp set of minima.
Suppose for example that I is piecewise affine:

l(x)=max{(Sj,x)-bj: j=l, ... ,m} forallxelRn .

Taking an arbitrary nonempty subset J C {I, ... , m}, define

s] := Proj O/{Sj : j e J}

to obtain 2m - 1 vectors s]; all the possible s(x) are taken from this finite list. Setting
326 xv. Primal Forms of Bundle Methods

Tj := min {lIsJ II : sJ ;l: o} ,


we therefore see that (4.2.5) holds for this particular Tj > o. Conclusion: when (4.2.3) holds,
the proximal point algorithm terminates for piecewise affine functions that are bounded from
below. 0

Returning to our original problem of minimizing I, the proximal point algorithm


is based on the formulation (ii) in Theorem 4.1.7; but formulations in terms of convex
minimization can also be used. First of all, observe from (4.2.1) that Algorithm 4.2.1
is an implicit gradient method (preconditioned by Wk, and with unit stepsizes) to
minimize f.
Formulation 4.1.7(iv) is another alternative. To use it conveniently, take a fixed
operator Mk = M in Algorithm 4.2.1; in view of (4. 1.5), the update formula written
as
Xk+l = Xk - M-hvIM(Xk)
shows that we also have an explicit gradient method (preconditioned by M- 1, and
with unit stepsizes) to minimize 1M.
On the other hand, the Moreau-Yosida regularization is very smooth: its gradient
is Lipschitzian; so any method of Chap. II can be chosen, in particular the powerful
quasi-Newton methods of §II.2.3. This is attractive but not quite straightforward:
- 1M cannot be computed explicitly, so actual implementations can only mimic such
methods, with suitable approximations of 1M and VIM;
- the possibility of having a varying operator Mk opens the way to intriguing interpre-
tations, in which the objective function becomes IMk and depends on the iteration
index.
These ideas will not be developed here: we limit our study to a possible algorithm
computing 1M.

4.3 Computing the Moreau-Yosida Regularization

In this subsection, x and M are fixed; they can be for example Xk and Mk at the
current iteration of the proximal point Algorithm 4.2.1. We address the problem of
minimizing the perturbed objective function of (4.1.1)

]Rn 3 y ~ g(x, y) := I(y) + t(M(y - x), y - x)

to obtain the optimal y = PM(X). Since we turn again to numerical algorithms, we


assume that our convex function I is finite everywhere; the above g is therefore a
convex finite-valued function as well. The minimization of g would not necessitate
particular comments, except for the fact that our eventual aim is to minimize I, really.
Then several possibilities are available:
- The basic subgradient algorithm of §XII.4.1 is not very attractive: the only difference
between I and g is the l-coercivity of the latter, a property which does not help this
method much. In other words, a subgradient algorithm should be more fruitfully
applied directly to I.
4 Bundle Methods as Regularizations 327

- The basic cutting-plane algorithm of §XII.4.2 is still impeded by the need for com-
pactness.
- Bundle methods will take cutting-plane approximations for g(x, .), introduce a sta-
bilizing quadratic term, and solve the resulting quadratic program. This is somewhat
redundant: a quadratic perturbation has already been introduced when passing from
f to g(x, .).
- Then a final possibility suggests itself, especially if we remember Remark 3.2.6(ii):
apply the cutting-plane mechanism to f only, obtaining a "hybrid" bundle method
in penalized form.
- This last technique can also be seen as an "auto-stabilized" cutting-plane method,
in which the quadratic term in the objective function g(x,·) is kept as a stabilizer
(preconditioned by M). Because no artificial stabilizer is introduced, the stability
center x must not be moved and the descent test must be inhibited, so as to take
enough "null-steps", until fM(X) = ming(x, .) is reached. Actually, the quadratic
term in g plays the role of the artificial C introduced for (1.0.2).
For consistency with §3, the index k will still denote the current iteration of the
resulting algorithm. The key is then to replace f in the proximal problem (4.1.1) by
a model-function CPk satisfying
CPk::;; f; (4.3.1)
also, CPk is piecewise affine, but this is not essential. Then the next iterate is the proximal
point of x associated with CPk:

Yk+1 := argmin [CPk(Y) + !(M(y - x), Y - x)] . (4.3.2)

According to Lemma 4.1.1, we have (once again, W = M- 1)


(4.3.3)

and the aggregate linearization

(4.3.4)

will be useful for the next model CPk+ I. With respect to §4.2, beware that Sk does not
refer to the proximal point of some varying Xk associated with f: here the proximal
center x is fixed, it is the function CPk that is varying.
We obtain an algorithm which is implementable, in the sense that it needs only a
black box (U 1) that, given Y E ]Rn, computes f (y) and s (y) E af (y).

Algorithm 4.3.1 (Hybrid Bundle Method) The preconditioner M and the proximal
center x are given. Choose the initial model CPI ::;; f, a convex function which can be
for example
Y 1-+ CPI(Y) = f(x) + (s(x), Y - x),

and initialize k = 1.
STEP 1. Compute Yk+1 from (4.3.2) or (4.3.3).
328 xv. Primal Forms of Bundle Methods
STEP 2. If !(Yk+I) = fPdYk+I) stop.
STEP 3. Update the model to any convex function fPk+1 satisfYing (4.3.1), as well as

(4.3.5)

and

(4.3.6)

Replace k by k + 1 and loop to Step 1. o

This algorithm is ofcourse just a form ofAlgorithm 3.1.4 with a few modifications:
- C = IRn; a simplification which yields Pk = o.
- The notation is "bundle-free": no list of affine functions is assumed, and the succes-
sive models are allowed more general forms than piecewise affine; only the essential
properties (4.3.1), (4.3.5), (4.3.6) are retained.
- The stabilizing operator M is fixed but is not proportional to the identity (a negligible
generalization).
- The proximal center Xk = x is never updated; this can be simulated by taking a very
large m > 0 in (3.1.7).
- The stopping criterion in Step 2 has no reason to ever occur; in view of (4.3.1), it
means that ! coincides with its model fPk at the "trial proximal point" Yk+ I .

Proposition 4.3.2 IfAlgorithm 4.3.1 terminates at Step 2, Yk+1 = PM(X).

PROOF. When the stop occurs, Yk+1 satisfies by construction

g(x, Yk+I) =fPk(Yk+I) + !(M(Yk+1 - x), Yk+1 - x) ~


~ fPk(Y) + !(M(y - x), Y - x) ~ g(x, y) for all Y E IRn. 0

A more realistic stopping criterion could be

(4.3.7)

for some positive tolerance Q. The whole issue for convergence is indeed the property
!(Yk+I) - fPk(Yk+I) .... 0, which we establish with the tools of §3.2. First of all,
remember that nothing is changed if the objective function of (4.3 .2) is a posteriori
replaced by
Y ~ Yk(Y) := 1k(y) + !(M(y - x), Y - x) .

In fact, Yk is strongly convex and the definition (4.3.4) of 1k gives 0 = V'Yk(Yk+I)·


Lemma 4.3.3 For all k and all Y E IRn, there holds
4 Bundle Methods as Regularizations 329

PROOF. Use the simple identity, valid for all u, v in IRn and symmetric M:

~(M(u + v), u + v) = ~(Mu, u) + (Mu, v) + ~(Mv, v).


Take u = YHI - X, V =Y- YHI and add to the equality (4.3.4) to obtain the result.
o
This equality plays the role of (3 .2.6); it allows us to reproduce Steps 1 and 2 in
the proof of Theorem 3.2.4.

Theorem 4.3.4 For f : IRn ~ IR convex and M symmetric positive definite, the
sequence {Yk} generated by Algorithm 4.3.1 satisfies

f(YHI) - (jIk(YHI) ~ 0, (4.3.8)

Yk ~ PM(X).

PROOF. We assume that the stop never occurs, otherwise Proposition 4.3.2 applies. In
view of (4.3 .5),
l k-
I (YHI) ~ (jIk(YHI) for all k > 1 .
Add 1/2 (M(YHI -X), YHI - X) to both sides and use Lemma 4.3.3 withk replaced
by k - 1 to obtain

Yk-I (Yk) + ~(M(Yk+1 - Yk), Yk+1 - Yk) ~ Yk(YHI) ~ f(x) for k > 1.

The sequence {Yk(YHI)} is increasing and YHI - Yk ~ o.


Now set Y = X in Lemma 4.3.3:

~(M(x - YHI),x - YHI) ~ Yk(X) - Yk(YHI) ~ f(x) - YI(Y2)

so that {Yk} is bounded. Then use (4.3.6) at the (k - 1)S! iteration:

subtracting f(YHI) from both sides and using a Lipschitz constant L for f around
x,

(4.3.8) follows.
Finally extract a convergent subsequence from {Yk}; more precisely, let KeN
be such that YHI ~ Y* for k ~ +00 in K. Then (jIk(YHI) ~ f(y*) and, passing
to the limit in the subgradient inequality

(jIk(YHI) + (M(x - YHI), Y - YHI) ~ (jIk(Y) ~ f(y) for all Y E ]Rn

shows that M(x - Y*) E af(y*). Because of Lemma 4.1.1, Y* = PM(X). 0

We terminate with an important comment concerning the stopping criterion. Re-


member that, for our present concern, PM (x) is computed only to implement an
330 xv. Primal Forms of Bundle Methods
"outer" algorithm minimizing f, as in §4.2. Then it is desirable to stop the "inner"
Algorithm 4.3.1 early if x is far from a minimizer of f (once again, why waste time
in minimizing the perturbed function g(x, .) instead of f itself?). This implies an ad
hoc value 8 in (4.3.7).
Adding and subtracting f(x), write (4.3.7) as

which can be viewed as a descent criterion. Under these circumstances, we can think of
a Qdepending on k, and comparable to f (x) - CPk (Yk+ [) (a number which is available
when Step 2 is executed). For example, with some coefficient m E ]0, 1[, we can take

in which case (4.3.7) becomes

We recognize a descent test used in bundle methods: see more particularly the trust-
region Algorithm 2.1.1. The descent iteration of such a bundle method will then update
the current stability center x = Xk to this "approximate proximal point" Yk+ [.
Conclusion: seen from the proximal point of view of this Section 4, bundle meth-
ods provide three ingredients:
- an inner algorithm to compute a proximal point, based on cutting-plane approxima-
tions of f;
- a way of stopping this internal algorithm dynamically, when a sufficient decrease is
obtained for the original objective function f;
- an outer algorithm to minimize f, using the output of the inner algorithm to mimic
the proximal point formula.
Furthermore, each inner algorithm can use for its initialization the work performed
during the previous outer iteration: this is reflected in the initial CPt of Algorithm 4.3.1,
in which all the cutting planes computed from the very beginning of the iterations can
be accumulated.
This explains our belief expressed at the very end of §2.5: the penalized form is
probably the most interesting variant among the possible bundle methods.
Bibliographical Comments

Just as we did with the first volume, let us repeat that [159] is a must for convex
analysis in finite dimension. On the other hand, we recommend [89] for an exhaustive
account of bundle methods, with the most refined techniques concerning convergence,
and also various extensions.

Chapter IX. The technique giving birth to the bundling mechanism can be contrasted
with an old separation algorithm going back to [1]. The complexity theory that is
alluded to in §2.2(b) comes from [133] and our Counter-example 2.2.4 is due to
A.S. Nemirovskij.
It is important to know that this bundling mechanism can be extended to noncon-
vex functions without major difficulty (at least theoretically). The approach, pioneered
in [55], goes as follows. First, one considers locally Lipschitzian functions; denote
by af(x) the Clarke generalized gradient of such a function f at x ([36,37]). This
is jus the set co y f(x) of §VI.6.3(a) and would be the subdifferential of f at x if f
were convex. Assume that some s (y) E af (y) can be computed at each y, just as in
the convex case. Given a direction d (= dk), the line-search 1.2 is performed and the
interesting situation is when t .,j, 0, with no descent obtained; this may happen when

·
11m sup
f(x + td) - f(x)
~
0
.
t,\,o t

As explained in §l.3(a), the key is then to produce a cluster point s* of {s(x + td)h
satisfying (s*, d) ~ 0; to be on the safe side, we need

liminf(s(x + td), d} ~ o.
t,\,o

If f is convex, (*) automatically implies (**). If not, the trick is simply to set the
property "( *) => (**)" as an axiom restricting the class of Lipschitz functions that
can be minimized by this approach.
A possible such axiom is for example

· sup f(x
11m + td) - f(x)
~
1·lmill
. f (s(x + t d ), d },
t,\,o t t,\,o

resulting in the rather convenient classes of semi-smooth functions ofR. Mifilin [122].
On the other hand, a minimal requirement allowing the bundling mechanism to work
332 Bibliographical Comments

is obtained if one observes that, in the actual line-search, the sequences {tk} giving
f (x + tkd) in (*) and s (x + tkd) in (**) are just the same. What is needed is simply
the axiom defined in [24]:

.
hmsup
[f(X + td) - f(x)
- (s(x + td), d) ] ~ o.
t,j..o t
Convex quadratic programming problems are usually solved by one universal
pivoting technique, see [23]. The starting idea is that a solution can be explicitly
computed if the set of active constraints is known (solve a system oflinear equations).
The whole technique is then iterative:
- at each iteration, a subset J of the inequality constraints is selected;
- each constraint in J is replaced by an equality, the other constraints being discarded;
- this produces a point XJ, whose optimality is tested;
- if x J is not optimal, J is updated and the process is repeated until a correct J* is
identified, yielding a solution of the original problem.

Chapter X. As suggested from the introduction to this chapter, the transformation


f 1-+ f* has its origins in a publication of A. Legendre (1752-1833), dated from 1787;
remember also Young's inequality in one dimension (see the bibliographical comments
on Chap. I). Since then, this transformation has received a number of names in the
literature: conjugate, polar, maximum transformation, etc. However, it is now generally
agreed that an appropriate terminology is Legendre-Fenchel transform, which we
precisely adopted in this chapter. Let us mention that w. Fenchel (1905-1988) wrote
in a letter to C. Kiselman, dated March 7, 1977: "I do not want to add a new name,
but if I had to propose one now, I would let myself by guided by analogy and the
relation with polarity between convex sets (in dual spaces) and I would call it for
example parabolic polarity". Fenchel was influenced by his geometric (projective)
approach, and also by the fact that the "parabolic" function f : x 1-+ f (x) = 1/2 II X 112
is the only one that satisfies f* = f. In our present finite-dimensional context, it is
mainly Fenchel who studied "his" transformation, in papers published between 1949
and 1953.
The essential part of § 1.5 (Theorem 1.5.6) is new. The close-convexification, or

J:
biconjugacy, of a function arises naturally in variational problems ([49, 81]): mini-

J:
mizing an objective function like l(t, x(t), i(t»dt is related to the minimization
of the "relaxed" form col(t, x(t), i(t»dt, where col denotes the closed convex
hull of the function l(t, x, .). The question leading to Corollary 1.5.2 was answered
in [43], in a calculus of variations framework; a short and pedagogical proof can be
found in [74]. Lemma 1.5.3 is due to M. Valadier [180, p.69]; the proof proposed
here is more legible and detailed. The results of Proposition 1.5.4 and Theorem 1.5.5
appeared in [64].
The calculus rules of §2 are all rather classical; two of them deserve a particular
comment: post-composition with an increasing convex function, and maximum of
functions. The first is treated in [94] for vector-valued functions. As for max-functions,
we limited ourselves here to finitely many functions but more general cases are treated
Bibliographical Comments 333

similarly. In fact, consider the following situation (cf. §VI.4A): T is compact in some
metric space; f : T x IRn -+ IR satisfies
f (t, .) =: ft is a convex function from IRn to IR for all t E T
f(', x) is upper semi-continuous on T for all x E IRn ;
hence f(x) := maxtET ft(X) < +00 for all x E IR n .
As already said in §VI.4A, f is thenjointly upper semi-continuous on T x IRn, so that
1* : (t, s) H- 1*(t, s) := Ut)*(s) is jointly lower semi-continuous and it follows
that cP := mintET f*(t, .) is lower semi-continuous; besides, cp is also I-coercive
since cp* = f is finite everywhere. Altogether, the results of § 1.5 can be applied to
cp: with the help of Proposition 1.5 A, f* = co cp = co cp can be expressed in a way
very similar to Theorem 204.7. We have here an alternative technique to compute the
subdifferential of a max-function (Theorem VI.4A.2).
The equivalence (i) {} (ii) of Theorem 3.2.3 is a result of T.S. Motzkin (1935);
our proof is new. Proposition 3.3.1 was published in [120, §II.3]. Our Section4.2(b) is
partly inspired from [63], [141]: Corollary 4.2.9 can be found in [63], while J.-P. Penot
was more concerned in [141] by the "one-sided" aspect suitable for the unilateral world
of convex analysis, even though all his functions CPs were convex quadratic forms. For
Corollary 4.2.10, we mention [39]; see also [170], where this result is illustrated in
various domains of mathematics.

Chapter XI. Approximate subgradients appeared for the first time in [30, §3]. They
were primarily motivated by topological considerations (related to their regularization
properties), and the idea of using them for algorithmic purposes came in [22]. In [137,
138,139], E.A. Nurminskii used(1.3.5)to design algorithms for convex minimization;
see also [111], and the considerations at the end of [106]. Approximate subgradients
also have intriguing applications in global optimization. Let f and g be two functions
in ConvlRn. Then x minimizes (globally) the difference f - g if and only if aeg(x) c
aef(x) for all £ > O. This result was published in [75] and, to prove its sufficiency
part, one can for example start from (1.3.7).
The characterization (1.304) of the support function of the graph of the multi-
function £ ~ aef(x) also appeared in [75]. The representation of closed convex
functions via approximate directional derivatives (Theorem 1.3.6) was published in-
dependently in [73, §2] and [111, §1]. As for the fundamental expression (2. 1.1) of the
approximate directional derivative, it appeared in [131, p. 67] and [159, pp. 219-220],
but with different proofs. The various properties of approximate difference quotients,
detailed in our §2.2, come from [73, §2].
Concerning approximate subdifferential calculus (§ 3), let us mention that general
results, with vector-valued functions, had been announced in [95]; the case of convex
functions with values in IR U {+oo} is detailed in the survey paper [72], which inspired
our exposition. Qualification assumptions can be avoided to develop calculus rules of
a different nature, using the (richer) information contained in {a l1 ,f;(x) : 0 < I') :::; ij},
instead of the mere a,f;(x). For example, with fl and h in ConvlR n,

aUI + h)(x) = n
0<11:::; ij
cl [a l1 fl (x) + al1 f2 (x) ] , where ij > 0 is arbitrarily small.
334 Bibliographical Comments

This formula emphasizes once more the smoothing or "viscosification" effect of ap-
proximate subdifferentials. For a more concrete development of approximate mini-
mality conditions, alluded to in §3.1 and the end of §3.6, see [175].
The local Lipschitz property of adO, with fixed 8 > 0, was observed for the
first time in [136]; the overall formalization of this result, and the proof used here,
were published in [71]. Passing from a globally Lipschitzian convex function to an ar-
bitrary convex function was anyway the motivation for introducing the regularization-
approximation technique via the inf-convolution with ell . II ([71, §2]). Theorem 4.2.1
comes from [30]. The transportation formula (4.2.2), and the neighborhoods of(4.2.3),
were motivated by the desire to make the algorithm of [22] implementable (see [100]).

Chapter XII. References to duality without convexity assumptions are not so com-
mon; we can mention [32] (especially for §3.1), [52], [65]; see also [21]. However,
there is a wealth of works dealing with Lagrangian relaxation in integer program-
ming; a milestone in this subject was [68], which gave birth to the test-problems TSP
oflX.2.2.7.
The subgradient method of §4.1 was discovered by N.Z. Shor in the beginning
of the sixties, and its best account is probably that of [145]; for more recent de-
velopments, and in particular accelerations by dilation of the space, see [171]. Our
proof of convergence is copied from [134]. The original references to cutting planes
are the two independent papers [35] and [84]; their column-generation variant in a
linear-programming context appeared in [40].
The views expressed in §5 go along those of[163]. The most complete theory of
augmented Lagrangians is given in the works ofD.P. Bertsekas, see for example [20].
Dualization schemes when contraints take their values in a cone are also explained
in [114]. Several authors have related Fenchel and Lagrange duality schemes. Our
approach is inspired from [115].

Chapters XIII and XIV, The 8-descent algorithm, going back to [100], is the ancestor
of bundle methods. It was made possible thanks to the work [22], which was done at
the same time as the speculations detailed in Chap. IX; the latter were motivated by a
particular economic lotsizing problem of the type XII.I.2(b), coming from the glass
industry. This observation points out once more how applied mathematics can be a
delicate subject: the publication of purely theoretical papers like [22] may sometimes
be necessary for the resolution of apparently innocent applied problems. Using an
idea of[116], the method can be extended to the resolution of variational inequalities.
Then came the eonjugate-subgradient form (§XIV.4.3) in [101] and [188], and
Algorithm XIV.3.4.2 was introduced soon after in [102]. From then on, the main works
concerning these methods dealt with generalizations to the nonconvex case and the
treatment of constraints, in particular by R. Mifflin [123]. At that time, the similarity
with conjugate gradients (Remark 11.2.4.6) was felt as a key in favour of conjugate
subgradients; but we now believe that this similarity is incidental. We mention here that
conjugate gradients might well become obsolete for "smooth" optimization anyway,
see [61, 113] among others. In fact, it is probably with other interior point methods, as
in [62], that the most promising connections of bundle methods remain to be explored.
Bibliographical Comments 335

The special algorithm for nonsmooth univariate minimization, alluded to in Re-


mark XIII.2.1.2, can be found in [109]. It is probably the only existing algorithm
that converges globally and superlinearly in this framework. Defining algorithms hav-
ing such qualities for several variables is a real challenge; it probably requires first
an appropriate definition of second-order differentiation of a convex function. These
questions have been pending for the last two decades.

Chapter xv. Primal bundle methods appeared in [103], after it was realized from
[153] that (dual) bundle methods were connected with sequential quadratic program-
ming (see the bibliographical comments of Chap. VIII). R. Miftlin definitely formal-
ized the approach in [124], with an appropriate treatment of non-convexity. In [88],
K.C. Kiwiel gave the most refined proof of convergence and proved finite termination
for piecewise affine functions. Then he proposed a wealth of adaptations to various
situations: noisy data, sums of functions, . .. see for example his review [90]. J. Zowe
contributed a lot to the general knowledge of bundle methods [168]. E.A. Nurminskii
gave an interesting interpretation in [13 7, 138, 139], which can be sketched as follows:
at iteration k,
- choose a distance in the dual graph space;
- choose a certain point of the form Pk = (0, rk) in this same space;
- choose an approximation E k of epi f*; more specifically, take the epigraph of (ik) *;
-project Pk onto Ek in the sense of the distance chosen (which is not necessarily a
norm); this gives a vector (Sko Ok), and -Sk can be used for a line-search.
The counter-example in §1.1 was published in [133, §4.3.6]; for the calculations
that it needs in Example 1.1.2, see [18] for example. In classical "smooth" optimiza-
tion, there is a strong tendency to abandon line-searohes, to the advantage of the
trust-region technique alluded to in § 1.3 and overviewed in [127].
The trust-region variant of §2.1 has its roots in [117]. The level variant of §2.3 is
due to A.S. Nemirovskij and Yu. Nesterov, see [110]. As for the relaxation variant, it
strongly connotes and generalizes [146]; see [46] for an account of the Gauss-Newton
and Levenberg-Marquardt methods.
Our convergence analysis in §3 comes from [91]. Several researchers have felt
attracted by the connection between quadratic models and cutting planes approxima-
tions (Remark 3.3.3): for example [185], [147], [92].
The Moreau-Yosida regularization is due to K. Yosida for maximal monotone
operators, and was adapted in [130] to the case of a subgradient mapping. The idea of
exploiting it for numerical purposes goes back to [14, Chap.V] for solving ill-posed
systems of linear equations. This was generalized in [119] for the minimization of
convex functions, and was then widely developed in primal-dual contexts: [161] and
its derivatives. The connection with bundle methods was realized in the beginning of
the eighties: [59], [11].
References

1. Aizerman, M.A., Braverman, E.M., Rozonoer, L.I.: The probability problem ofpattem
recognition learning and the method of potential functions. Automation and Remote
Control 25,9 (1964) 1307-1323.
2. Alexeev, v., Galeev, E, Tikhomirov, v.: Recueil de Prob!emes d'Optimisation. Mir,
Moscow (1984).
3. Alexeev, v., Tikhomirov, v., Fomine, S.: Commande Optimale. Mir, Moscou (1982).
4. Anderson Jr., WN., Duffin, R1.: Series and parallel addition of matrices. 1. Math. Anal.
Appl. 26 (1969) 576-594.
5. Artstein, Z.: Discrete and continuous bang-bang and facial spaces or: look for the
extreme points. SIAM Review 22,2 (1980) 172-185.
6. Asplund, E.: Differentiability of the metric projection in finite-dimensional Euclidean
space. Proc. Amer. Math. Soc. 38 (1973) 218-219.
7. Aubin, 1.-P.: Optima and Equilibria: An Introduction to Nonlinear Analysis. Springer,
Berlin Heidelberg (1993).
8. Aubin, 1.-P.: Mathematical Methods of Game and Economic Theory. North-Holland
(1982) (revised edition).
9. Aubin, 1.-P., Cellina, A.: Differential Inclusions. Springer, Berlin Reidelberg (1984).
10. Auslender, A: Optimisation, Methodes Numeriques. Masson, Paris (1976).
11. Auslender, A: Numerical methods for nondifferentiable convex optimization. In: Non-
linear Analysis and Optimization. Math. Prog. Study 30 (1987) 102-126.
12. Barbu, v., Precupanu, T.: Convexity and Optimization in Banach Spaces. Sijthoff &
Noordhoff(1982).
13. Barndorff-Nielsen, 0.: Information and Exponential Families in Statistical Theory.
Wiley & Sons (1978).
14. Bellman, R.E., Kalaba, RE., Lockett, 1.: Numerical Inversion ofthe Laplace Transform.
Elsevier (1966).
15. Ben Tal, A, Ben Israel, A, Teboulle, M.: Certainty equivalents and information mea-
sures: duality and extremal principles. 1. Math. Anal. Appl. 157 (1991) 211-236.
16. Berger, M.: Geometry I, II (Chapters 11, 12). Springer, Berlin Heidelberg (1987).
17. Berger, M.: Convexity. Amer. Math. Monthly 97,8 (1990) 650-678.
18. Berger, M., Gostiaux, B.: Differential Geometry: Manifolds, Curves and Surfaces.
Springer, New York (1990).
19. Bertsekas, D.P.: Necessary and sufficient conditions for a penalty method to be exact.
Math. Prog. 9 (1975) 87-99.
20. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic
Press (1982).
338 References

21. Bertsekas, D.P.: Convexification procedures and decomposition methods for nonconvex
optimization problems. 1. Optimization Th. Appl. 29,2 (1979) 169-197.
22. Bertsekas, D.P., Mitter, S.K.: A descent numerical method for optimization problems
with nondifferentiable cost functionals. SIAM 1. Control 11,4 (1973) 637-652.
23. Best, M.1.: Equivalence of some quadratic programming algorithms. Math. Prog. 30,1
(1984) 71-87.
24. Bihain, A: Optimization of upper semi-differentiable functions. 1. Optimization Th.
Appl. 4 (1984) 545-568.
25. Bonnans, 1.F: Theorie de la penalisation exacte. Modelisation Mathematique et Analyse
Numerique 24,2 (1990) 197-210.
26. Borwein, 1.M.: A note on the existence of subgradients. Math. Prog. 24 (1982) 225-228.
27. Borwein, 1.M., Lewis, A: Convexity, Optimization and Functional Analysis. Wiley
Interscience - Canad. Math. Soc. (in preparation).
28. Brenier, Y.: Un algorithme rapide pour Ie calcul de transformees de Legendre-Fenchel
discretes. Note aux C.R. Acad. Sci. Paris 308 (1989) 587-589.
29. Bnmdsted, A.: An Introduction to Convex Polytopes. Springer, New York (1983).
30. Bnmdsted, A, Rockafellar, R.T.: On the subdifferentiability of convex functions. Proc.
Amer. Math. Soc. 16 (1965) 605-611.
31. Brousse, P.: Optimization in Mechanics: Problems and Methods. North-Holland (1988).
32. Cansado, E.: Dual programming problems as hemi-games. Management Sci. 15,9
(1969) 539-549.
33. Castaing, C., Valadier, M.: Convex Analysis and Measurable Multi/unctions. Lecture
Notes in Mathematics, vol. 580. Springer, Berlin Heidelberg (1977).
34. Cauchy, A: Methode generale pour la resolution des systemes d' equations simultanees.
Note aux C. R. Acad. Sci. Paris 25 (1847) 536-538.
35. Cheney, E. w., Goldstein, AA: Newton's method for convex programming and Tcheby-
cheffapproximation. Numer. Math. 1 (1959) 253-268.
36. Clarke, EH.: Generalized gradients and applications. Trans. Amer. Math. Soc. 205
(1975) 247-262.
37. Clarke, EH.: Optimization and Nonsmooth Analysis. Wiley & Sons (1983), reprinted
by SIAM (1990).
38. Crandall, M.G., Ishii, H., Lions, P.-L.: User's guide to viscosity solutions of second
order partial differential equations. Bull. Amer. Math. Soc. 27,1 (1992) 1-67.
39. Crouzeix, 1.-P.: A relationship between the second derivative of a convex function and
of its conjugate. Math. Prog. 13 (1977) 364-365.
40. Dantzig, G.B. Wolfe, P.: A decomposition principle for linear programs. Oper. Res. 8
(1960) 101-111.
41. Davidon, W.C.: Variable metric method for minimization. AEC Report ANL5990,
Argonne National Laboratory (1959).
42. Davidon, W.C.: Variable metric method for minimization. SIAM 1. Optimization 1
(1991) 1-17.
43. Dedieu, 1.-P.: Une condition necessaire et suffisante d'optirnalite en optimisation non
convexe et en calcul des variations. Seminaire d'Analyse Numerique, Univ. Paul
Sabatier, Toulouse (1979-80).
44. Demjanov, V.E: Algorithms for some minimax problems. 1. Compo Syst. Sci. 2 (1968)
342-380.
45. Demjanov, V.E, Malozemov, V.N.: Introduction to Minimax. Wiley & Sons (1974).
References 339

46. Dennis, J., Schnabel, R.: Numerical Methods for Constrained Optimization and Non-
linear Equations. Prentice Hall (1983).
47. Dubois, J.: Surlaconvexire etses applications. Ann. Sci. Math. Quebec 1,1 (1977) 7-31.
48. Dubuc, S.: Problemes d'Optimisation en Calcul des Probabilites. Les Presses de l'Uni-
versite de Montreal (1978).
49. Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. North-Holland,
Amsterdam (1976).
50. Eggleston, H.G.: Convexity. Cambridge University Press, London (1958).
51. Ellis, R.S.: Entropy, Large Deviations and Statistical Mechanics. Springer, New York
(1985).
52. Everett III, H.: Generalized Lagrange multiplier method for solving problems of opti-
mum allocation of resources. Oper. Res. 11 (1963) 399-417.
53. Fenchel, w.: Convexity through the ages. In: Convexity and its Applications (p.M. Gruber
and J.M. Wills, eds.). Birkhiiuser, Basel (1983) 120-130.
54. Fenchel, w.: Obituary for the death of -. Det Kongelige Danske Videnskabemes
Selskabs Aarbok (Oversigten) [Yearbook of the Royal Danish Academy of Sciences]
(1988-89) 163-171.
55. Feuer, A: An implementable mathematical programming algorithm for admissible fun-
damental functions. PhD. Thesis, Columbia Univ. (1974).
56. Fletcher, R.: Practical Methods of Optimization. Wiley & Sons (1987).
57. Fletcher, R., Powell, M.J.D.: A rapidly convergent method for minimization. The Com-
puter Joumal6 (1963) 163-168.
58. Flett, T.M.: Differential Analysis. Cambridge University Press (1980).
59. Fukushima, M.: A descent algorithm for nonsmooth convex programming. Math. Prog.
30,2 (1984) 163-175.
60. Geoffrion, AM.: Duality in nonlinear programming: a simplified application-oriented
development. SIAM Review 13,11 (1971) 1-37.
61. Gilbert, J.C., Lemarechal,C.: Some numerical experiments with variable-storage quasi-
Newton algorithms. Math. Prog. 45 (1989) 407-435.
62. Goffin, J.-L., Haurie, A, Vial, J.-Ph.: Decomposition and nondifferentiable optimization
with the projective algorithm. Management Sci. 38,2 (1992) 284-302.
63. Gomi, G.: Conjugation and second-order properties of convex functions. J. Math. Anal.
Appl. 158,2 (1991) 293-315.
64. Griewank, A, Rabier, P.J.: On the smoothness of convex envelopes. Trans. Amer. Math.
Soc. 322 (1990) 691-709.
65. Grinold, R.C.: Lagrangian subgradients. Management Sci. 17,3 (1970) 185-188.
66. Gritzmann, P., Klee, V.: Mathematical programming and convex geometry. In: Handbook
of Convex Geometry (Elsevier, North-Holland, to appear).
67. Gruber, P.M.: History of convexity. In: Handbook ofConvex Geometry (Elsevier, North-
Holland, to appear).
68. Held, M., Karp, R.M.: The traveling-salesman problem and minimum spanning trees.
Math. Prog. 1,1 (1971) 6-25.
69. Hestenes, M.R., Stiefel, M.R.: Methods of conjugate gradients for solving linear sys-
tems. J. Res. NBS 49 (1959) 409-436.
70. Hiriart-Urruty, J.-B.: Extension of Lipschitz functions. J. Math. Anal. Appl. 77 (1980)
539-554.
71. Hiriart-Urruty, J.-B.: Lipschitz r-continuity of the approximate subdifferential of a con-
vex function. Math. Scand. 47 (1980) 123-134.
340 References

72. Hiriart-Urruty, J.-B.: e-subdifferential calculus. In: Convex Analysis and Optimization
(J.-P. Aubin and R. Vinter, eds.). Pitman (1982), pp. 43-92.
73. Hiriart-Urruty, J.-B.: Limiting behaviour ofthe approximate first order and second order
directional derivatives for a convex function. Nonlinear Anal. Theory, Methods & Appl.
6,12 (1982) 1309-1326.
74. Hiriart-Urruty, J.-B.: When is a point x satisfying V f(x) = 0 a global minimum of f?
Amer. Math. Monthly 93 (1986) 556-558.
75. Hiriart-Urruty, J.-B.: Conditions nt!cessaires et suffisantes d'optimalite globale en op-
timisation de differences de fonctions convexes. Note aux C.R. Acad. Sci. Paris 309, I
(1989) 459-462.
76. Hiriart-Urruty, J.-B., Ye, D.: Sensitivity analysis of all eigenvalues ofa symmetric matrix.
Preprint Univ. Paul Sabatier, Toulouse (1992).
77. Holmes, R.B.: A Course on Optimization and Best Approximation. Lecture Notes in
Mathematics, vol. 257. Springer, Berlin Heidelberg (1972).
78. Holmes, R.B.: Geometrical Functional Analysis and its Applications. Springer, Berlin
Heidelberg (1975).
79. Hormander, L.: Sur la fonction d'appui des ensembles convexes dans un espace locale-
ment convexe. Ark. Mat. 3,12 (1954) 181-186.
80. loffe, A.D., Levin, Y.L.: Subdifferentials of convex functions. Trans. Moscow Math.
Soc. 26 (1972) 1-72.
81. loffe, A.D., Tikhomirov, Y.M.: Theory o/Extremal Problems. North-Holland (1979).
82. Israel, R.B.: Convexity in the Theory o/Lattice Gases. Princeton University Press (1979).
83. Karlin, S.: Mathematical Methods and Theory in Games, Programming and Economics.
Mc Graw-Hill, New York (1960).
84. Kelley, J.E.: The cutting plane method for solving convex programs. J. SIAM 8 (1960)
703-712.
85. Kim, K. Y., Nesterov, Yu.E., Cherkassky, B.Y.: The estimate of complexity of gradient
computation. Soviet Math. Dokl. 275,6 (1984) 1306-1309.
86. Kiselman, C.O.: How smooth is the shadow of a smooth convex body? J. London Math.
Soc. (2) 33 (1986) 101-109.
87. Kiselman, C.O.: Smoothness of vectors sums of plane convex sets. Math. Scand. 60
(1987),239-252.
88. Kiwiel, K.C.: An aggregate subgradient method for nonsmooth convex minimization.
Math. Prog. 27 (1983) 320-341.
89. Kiwiel, K.C.: Methods o/Descent/or NondifJerentiable Optimization. Lecture Notes in
Mathematics, vol. 1133. Springer, Berlin Heidelberg (1985).
90. Kiwiel, K.C.: A survey of bundle methods for nondifferentiable optimization. In: Pro-
ceedings, XIII. International Symposium on Mathematical Programming, Tokyo (1988).
91. Kiwiel, K.C.: Proximity control in bundle methods for convex nondifferentiable mini-
mization. Math. Prog. 46,1 (1990) 105-122.
92. Kiwiel, K.C.: A tilted cutting plane proximal bundle method for convex nondifferen-
tiable optimization. Oper. Res. Lett. 10 (1991) 75-81.
93. Kuhn, H.W.: Nonlinear programming: a historical view. SIAM-AMS Proceedings 9
(1976) 1-26.
94. Kutateladze, S.S.: Changes of variables in the Young transformation. Soviet Math.
Dokl. 18,2 (1977) 545-548.
95. Kutateladze, S.S.: Convex e-programming. Soviet Math. Dokl. 20 (1979) 391-393.
96. Kutateladze, S.S.: e-subdifferentials and e-optimality. Sib. Math. J. (1981) 404-411.
References 341

97. Laurent, P.-J.: Approximation et Optimisation. Hermann, Paris (1972)


98. Lay, S.R.: Convex Sets and their Applications. Wiley & Sons (1982).
99. Lebedev, B.Yu.: On the convergence ofthe method ofloaded functional as applied to a
convex programming problem. J. Num. Math. and Math. Phys. 12 (1977) 765-768.
100. Lemarechal, C.: An algorithm for minimizing convex functions. In: Proceedings, IFIP74
(J.L. Rosenfeld, ed.). Stockholm (1974), pp. 552-556.
101. Lemarechal, C.: An extension of Davidon methods to nondifferentiable problems. In:
NondifJerentiable Optimization (M.L. Balinski, P. Wolfe, eds.). Math. Prog. Study 3
(1975) 95-109.
102. Lemarechal, C.: Combining Kelley's and conjugate gradient methods. In: Abstracts,
IX. Intern. Symp. on Math. Prog., Budapest (1976).
103. Lemarechal, C.: Nonsmooth optimization and descent methods. Research Report 78,4
(1978) IIASA, 2361 Laxenburg, Austria.
104. Lemarechal, C.: Nonlinear programming and nonsmooth optimization: a unification.
Rapport Laboria 332 (1978) INRIA.
105. Lemarechal, C.: A view of line-searches. In: Optimization and Optimal Control (A.
Auslender, W. Oettli, J. Stoer, eds.). Lecture Notes in Control and Information Sciences,
vol. 30. Springer, Berlin Heidelberg (1981), pp. 59-78.
106. Lemarechal, C.: Constructing bundle methods for convex optimization. In: Fermat Days
85: Mathematics for Optimization (J.-B. Hiriart-Urruty, ed.). North-Holland Mathe-
matics Studies 129 (1986) 201-240.
107. Lemarechal, C.: An introduction to the theory of nonsmooth optimization. Optimization
17 (1986) 827-858.
108. Lemarechal, C.: Nondifferentiable optimization. In: Handbook in OR & MS, Vol. 1
(G.L. Nemhauser et aI., eds.). Elsevier, North-Holland (1989), pp. 529-572.
109. Lemarechal, C., Mifflin, R.: Global and superlinear convergence of an algorithm for
one-dimensional minimization of convex functions. Math. Prog. 24,3 (1982) 241-256.
110. Lemarechal, C., Nemirovskij, A.S., Nesterov, Yu.E.: New variants of bundle methods.
Math. Prog. (to appear).
111. Lemarechal, C., Zowe, J.: Some remarks on the construction of higher order algorithms
in convex optimization. Appl. Math. Optimization 10 (1983) 51--68.
112. Lion, G.: Un savoir en voie de disparition: la convexite. Singularite 2,10 (1991) 5-12.
113. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large-scale optimiza-
tion. Math. Prog. 45 (1989) 503-528.
114. Luenberger, D.G.: Optimization by Vector Space Methods. Wiley & Sons (1969).
115. Magnanti, T.L.: Fenchel and Lagrange duality are equivalent. Math. Prog. 7 (1974)
253-258.
116. Marcotte, P., Dussault, J.P.: A sequential linear programming algorithm for solving
monotone variational inequalities. SIAM J. Control Opt. 27 (1989) 1260-1278.
117. Marsten, R.E.: The use of the boxstep method in discrete optimization. In: Nondi/-
ferentiable Optimization (M.L. Balinski, P. Wolfe, eds.). Math. Prog. Study 3 (1975)
127-144.
118. Marti, J.: Konvexe Analysis. Birkhliuser, Basel (1977).
119. Martinet, B.: Regularisation d'inequations variationnelles par approximations succes-
sives. Revue Franc. Rech. Oper. R3 (1970) 154-158.
120. Mazure, M.-L.: Caddition parallele d'operateurs interpretee comme inf-convolution de
formes quadratiques convexes. Modelisation Math. Anal. Numer. 20 (1986) 497-515.
342 References

121. Mc Cormick, G.P', Tapia, R.A.: The gradient projection method under mild differentia-
bility conditions. SIAM 1. Control 10,1 (1972) 93-98.
122. Mifflin, R.: Semi-smooth and semi-convex functions in constrained optimization. SIAM
1. Control Opt. 15,6 (1977) 959-972.
123. Mifflin, R.: An algorithm for constrained optimization with semi-smooth functions.
Math. Oper. Res. 2,2 (1977) 191-207.
124. Mifflin, R.: A modification and an extension ofLemarechal's algorithm for nonsmooth
minimization. In: NondifJerential and Variational Techniques in Optimization (D.C.
Sorensen, 1.B. Wets, eds.). Math. Prog. Study 17 (1982) 77-90.
125. Minoux, M.: Programmation Mathematique: Theorie etAlgorithmes I, II. Dunod, Paris
(1983).
126. More, 1.1.: Implementation and testing of optimization software. In: Performance
Evaluation o/Numerical Software (L.D. Fosdick, ed.). North-Holland (1979).
127. More, 1.1.: Recent developments in algorithms and software for trust region methods. In:
Mathematical Programming, the State o/the Art (A. Bachem, M. Grotschel, B. Korte,
eds.). Springer, Berlin Heidelberg (1983), pp. 258-287.
128. More, J.J., Thuente, D.1.: Line search algorithms with guaranteed sufficient decrease.
ACM Transactions on Math. Software; Assoc. for Compo Machinery (to appear).
129. Moreau, J.-1.: Decomposition orthogonale d'un espace hilbertien selon deux cones
mutuellement polaires. C.R. Acad. Sci. Paris 255 (1962) 238-240.
130. Moreau, 1.-1.: Proximite et dualite dans un espace hilbertien. Bull. Soc. Math. France
93 (1965) 273-299.
131. Moreau, 1.-J.: Fonctionnelles Convexes. Lecture notes, Seminaire "Equations aux deri-
vees partielles", College de France, Paris (1966).
132. Moulin, H., Fogelman-Soulie, F.: La Convexite dans les Mathematiques de la Decision.
Hermann, Paris (1979).
133. Nemirovskij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Opti-
mization. Wiley-Interscience (1983).
134. Nesterov, Yu.E.: Minimization methods for nonsmooth convex and quasiconvex func-
tions. Matekon 20 (1984) 519-531.
135. Niven. I.: Maxima and Minima Without Calculus. Dolciani Mathematical Expositions
6 (1981).
136. Nurminskii, E.A.: On s-subgradient mappings and their applications in nondifferen-
tiable optimization. Working paper 78,58 (1978) IIASA, 2361 Laxenburg, Austria.
137. Nurminskii, E.A.: s-subgradient mapping and the problem of convex optimization.
Cybernetics 21,6 (1986) 796-800.
138. Nurminskii, E.A.: Convex optimization problems with constraints. Cybernetics 23,4
(1988) 470-474.
139. Nurminskii, E.A.: A class of convex programming methods. USSR Comput. Maths
Math. Phys. 26,4 (1988) 122-128.
140. Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimiz-
ing sums ofthe largest eigenvalues of symmetric matrices. Math. Prog. (to appear).
141. Penot, 1.-P.: Subhessians, superhessians and conjugation. Nonlinear Analysis: Theory,
Methods and Appl. (to appear).
142. Peressini, A.L., Sullivan, F.E., Uhl, 1.1.: The Mathematics o/Nonlinear Programming.
Springer, New York (1988).
References 343

143. Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability. Lecture
Notes in Mathematics, vol. 1364. Springer, Berlin Heidelberg (1989, new edition in
1993).
144. Polak, E.: Computational Methods in Optimization. Academic Press, New York (1971).
145. Poljak, B.T.: A general method for solving extremum problems. Soviet Math. Dokl.
174,8 (1966) 33-36.
146. Poljak, B. T.: Minimization ofunsmooth functionals. USSR Comput. Maths Math. Phys.
9 (1969) 14-29.
147. Popova, N.K., Tarasov, V.N.: A modification of the cutting-plane method with accel-
erated convergence. In: Nondifferentiable Optimization: Motivations and Aplications
(Y.E Demjanov, D. Pallaschke, eds.). Lecture Notes in Economics and Mathematical
Systems, vol. 255. Springer, Berlin Heidelberg (1984), pp. 284-190.
148. Ponstein, I: Applying some modem developments to choosing your own Lagrange
multipliers. SIAM Review 25,2 (1983) 183-199.
149. Pourciau, B.H.: Modem multiplier rules. Amer. Math. Monthly 87 (1980), 433-452.
150. Powell, M.ID.: Nonconvex minimization calculations and the conjugate gradient
method. In: Numerical Analysis (D.E Griffiths ed.). Lecture Notes in Mathematics,
vol. 1066. Springer, Berlin Heidelberg (1984), pp. 122-141.
151. Prekopa, A.: On the development of optimization theory. Amer. Math. Monthly 87
(1980) 527-542.
152. Pshenichnyi, B.N.: Necessary ConditionsJor an Extremum. Marcel Dekker (1971).
153. Pshenichnyi, B.N.: Nonsmooth optimization and nonlinear programming. In: Non-
smooth Optimization (C. Lemarechal, R. Mifflin, eds.), IIASA Proceedings Series 3,
Pergamon Press (1978), pp. 71-78.
154. Pshenichnyi, B.N.: Methods oJLinearization. Springer, Berlin Heidelberg (1993).
155. Pshenichnyi, B.N., Danilin, Yu.M.: Numerical Methods Jor Extremal Problems. Mir,
Moscow (1978).
156. Quadrat, I-P.: Theoremes asymptotiques en programmation dynamique. C.R. Acad.
Sci. Paris, 311, Serie I (1990) 745-748.
157. Roberts, A.w., Varberg, D.E.: Convex Functions. Academic Press (1973).
158. Rockafellar, R. T.: Convex programming and systems of elementary monotonic relations.
I Math. Anal. Appl. 19 (1967) 543-564.
159. Rockafellar, R.T.: Convex Analysis. Princeton University Press (1970).
160. Rockafellar, R.T.: Convex Duality and Optimization. SIAM regional conference series
in applied mathematics (1974).
161. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point al-
gorithm in convex programming. Math. Oper. Res. 1,2 (1976) 97-116.
162. Rockafellar, R.T.: Lagrange multipliers in optimization. SIAM-AMS Proceedings 9
(1976) 145-168.
163. Rockafellar, R. T.: Solving a nonlinear programming problem by way of a dual problem.
Symposia Mathematica XIX (1976) 135-160.
164. Rockafellar, R.T.: The Theory oJSubgradients ad its Applications to Problems oJOpti-
mization: Convex and Nonconvex Functions. Heldermann, West-Berlin (1981).
165. Rockafellar, R. T.: Lagrange multipliers and optimality. SIAM Review (to appear, 1993).
166. Rockafellar, R.T., Wets, R.I-B.: Variational Analysis (in preparation).
167. Rosen, IB.: The gradient projection method for nonlinear programming; part I: linear
constraints. I SIAM 8 (1960) 181-217.
344 References

168. Schramm, H., Zowe, 1.: A version of the bundle idea for minimizing a nonsmooth
function: conceptual idea, convergence analysis, numerical results. SIAM 1. Opt. 2
(1992) 121-152.
169. Schrijver, A.: Theory ofLinear and Integer Programming. Wiley-Interscience (1986).
170. Seeger, A.: Second derivatives of a convex function and of its Legendre-Fenchel trans-
formate. SIAM 1. Opt. 2,3 (1992) 405-424.
171. Shor, N.Z.: Minimization Methods for Nondi.fferentiable Functions. Springer, Berlin
Heidelberg (1985).
172. Smith, K.T.: Primer ofModem Analysis. Springer, New York (1983).
173. Stoer, 1., Witzgall, C.: Convexity and Optimization in Finite Dimension I Springer,
Berlin Heidelberg (1970).
174. Strang, G.: Introduction to Applied Mathematics. Wellesley - Cambridge Press (1986).
175. Strodiot, 1.-1., Nguyen, Y.H., Heukemes, N.: e-optimal solutions in nondifferentiable
convex programming and some related questions. Math. Prog. 25 (1983) 307-328.
176. Tikhomirov, Y.M.: Stories about maxima and minima. In: Mathematical World 1, Amer.
Math. Society, Math. Association of America (1990).
177. Troutman, 1.L.: Variational Calculus with Elementary Convexity. Springer, New York
(1983).
178. Valadier, M.: Sous-differentiels d'une borne superieure et d'une somme continue de
fonctions convexes. Note aux C. R. Acad. Sci. Paris, Serie A 268 (1969) 39-42.
179. Valadier, M.: Contribution Ii I 'Analyse Convexe. These de doctorat es sciences mathema-
tiques, Paris (1970).
180. Valadier, M.: Integration de convexes fermes notamment d'epigraphes. Inf-convolution
continue. Revue d'Informatique et de Recherche Operationnelle (1970) 47-53.
181. Van Rooij, A.C.M., Schikhof, W.H.: A Second Course on Real Functions. Cambridge
University Press (1982).
182. Van Tiel, 1.: Convex Analysis. An Introductory Text. Wiley & Sons (1984).
183. Wets, R. 1.-B.: Grundlagen konvexer Optimierung. Lecture Notes in Economics and
Mathematical Systems, vol. 137. Springer, Berlin Heidelberg (1976).
184. Willem, M.: Analyse Convexe et Optimisation, 3rd edn. Editions CIACO Louvain-La-
Neuve (1989).
185. Wolfe, P.: Accelerating the cutting plane method for nonlinear programming. 1. SIAM
9,3 (1961) 481-488.
186. Wolfe, P.: Convergence conditions for ascent methods. SIAM Review 11 (1968) 226-
235.
187. Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable func-
tions. In: Proceedings, XII. Annual Allerton conference on Circuit and System The-
ory (P.Y. Kokotovic, B.S. Davidson, eds.). Univ. Illinois at Urbana-Champaign (1974),
. pp.8-15.
188. P. Wolfe: A method of conjugate subgradients for minimizing nondifferentiable func-
tions. In: Nondifferentiable Optimization (M.L. Balinski, P. Wolfe, eds.). Math. Prog.
Study 3 (1975) 145-173.
189. Zarantonello, E.H.: Projections on convex sets in Hilbert spaces and spectral theory. In:
Contributions to Nonlinear Functional Analysis. Academic Press (1971), pp. 237-424.
190. Zeidler, E.: Nonlinear Functional Analysis and its Applications III Variational Methods
and Optimization. Springer, New York (1985).
Index

asymptotic function, xvn, 111 - (approximate),106, 200


dilation, 334
breadth, xvn directional derivative, XVI, 66, 196
bundle, bundling, 7,157,196,228,244,327 - (approximate),102
- (compression, aggregation ot), 14, 177, directionally quadratic function, 84
228,230,232,247,256,301 distance, 65, 173
- (of information), 14,304 - of bounded sets, 129
divergent series, 198,325
Clarke, 331 domain, XVII
closed convex dot-product, XVI, 138
- cone, 186 dual, duality, 22, 238, 306
- function, XVI, 38, 151, 161 - function, 148
- polyhedron, 115 - gap, 153, 155, 179
closure of a function, XVI, 45
coercive, 46 eigenvalue, 137,323
- (1-),50,82,89,318,333 ellipsoid, I, 80, 169
coincidence set, 122 entropy, 151, 153, 156, 160
complexity,20, 157 epigraph, xvn
computer, computing, 2, 211, 256, 263 Everett, 147, 163
conjugate function, 98, 132, 179, 298, 320 exposed (face, point), 97, 217
conjugate gradient, 229
constraint, 137 Fenchel
convergence, see speed of - - duality theorem, 63
convex combination, XVI, 156,211 - inequality, 37
convex hull, 159 - transformation, 37
convex multiplier, XVI, 26, 216 filling property, 154, 155, 164
critical point, see stationary fixed point, 322
curvature, 106,208
Gateaux, 49,54
curved-search, 284, 316
gauge,XVII,71
cutting plane, 77, 229, 275, 330
Gauss-Newton, 290
gradient, 96, 320
decomposition, 43,116,137,141,184,313
-method, 28
degenerate, 38, 285
graph theory, 22
derivative, 317
- (second), 106 half-space, XVII, 45
descent direction, 156 hyperplane,XVII,199
descent-step, 204, 231, 248, 283,305
difference quotient, 67 image-function, 54, 72
346 Index

indicator function, XVII, 39, 93 - (quadratic), 234, 272, 299, 304, 332
inf-convolution, 55, 187 - (semi-infinite), 174
- (exact), 62, 119, 120 proximal point, 318, 322
interior points, 335
qualification, 58, 62, 72, 125, 191
Lagrange, Lagrangian, 138,237,307 quasi-convex, 201
Legendre transform, 35, 43, 81
rate of convergence, see speed -
Levenberg-Marquardt, 290
recession(cone, function), see asymptotic
line-search,4, 196,283
relative interior, XVII, 62
linearization error, 131,201,232 relaxation
Lipschitz, 122, 128 - (Lagrangian), 181,216
local problem, 140 - (convex), 157, 181
locally bounded, 127 - (method), 174, 293
marginal function, 55, 320 saddle-point, 188
marginal price, 151 safeguard-reduction property, 208, 251
master problem, 140 semi-smooth, 331
mean-value theorem, 112 separation, 10, 195, 199,222,226,254
minimality conditions (approximate), 115 set-valued, see multifunction
minimizing sequence, XVII, 218, 288, 309 slack variable, 142
minimum, minimum point, 49 Slater, 165, 188
- (global), 333 speed of convergence, 33, 288, 310, 312
minorize, minorization, XVII - (fast), 208
model, 101,279 - (linear), 20
Moreau-Yosida, 121, 183, 187 - (sublinear), 16, 18,271
multi-valued, see multifunction stability center, 279
multifunction, XVI, 99, 112 stationary point, 50
steepest-descent direction, 1, 196,225,262,
Newton, quasi-Newton, 228, 283, 293, 326
315
nonnegative, XVII
strictly convex, 79, 81,174,181
normal set (approximate), 93,115
strongly convex, 82, 83, 318
normalization, norming, 279, 280
subdifferential, sub gradient, 47, 151
null-step, 7, 156, 204, 231, 248, 283, 295,
subgradient algorithm, 171, 315
305
sublevel-set, XVII
objective function, 137, 152 support function, XVII, 30, 40,66,97,200,
orthant, XVII 299
outer semi-continuous, XVII, 128 transportation formula, 131,211
penalty, 142 transportation problem, 21
travelling salesman problem, 22
- (exact), 185
trust-region, 284
perspective-function, XVII, 41, 99
piecewise affine, 76, 125, 156,245 unit ball, XVI
polar cone, XVII, 45,186 unit simplex, XVI
polyhedral function, 77, 307 unit sphere, XVI
positively homogeneous, 84, 280
primal problem, 137 vertex, see exposed (point)
programming problem Wolfe, 177,249,284
- (integer), 142, 181
- (linear), 116, 145, 181,276 zigzag, 1, 7, 28
Grundlehren der mathematischen Wissenschaften
A Series of Comprehensive Studies in Mathematics

A Selection

200. Dold: Lectures on Algebraic Topology


201. Beck: Continuous Flows in the Plane
202. Schmetterer: Introduction to Mathematical Statistics
203. Schoeneberg: Elliptic Modular Functions
204. Popov: Hyperstability of Control Systems
205. Nikol'ski'i: Approximation of Functions of Several Variables and Imbedding Theorems
206. Andre: Homologie des Algebres Commutatives
207. Donoghue: Monotone Matrix Functions and Analytic Continuation
208. Lacey: The Isometric Theory of Classical Banach Spaces
209. Ringel: Map Color Theorem
210. GihmanlSkorohod: The Theory of Stochastic Processes I
211. Comfort/Negrepontis: The Theory of Ultrafilters
212. Switzer: Algebraic Topology - Homotopy and Homology
215. Schaefer: Banach Lattices and Positive Operators
217. Stenstrom: Rings of Quotients
218. GihmanlSkorohod: The Theory of Stochastic Procrsses II
219. DuvantlLions: Inequalities in Mechanics and Physics
220. Kirillov: Elements of the Theory of Representations
221. Mumford: Algebraic Geometry I: Complex Projective Varieties
222. Lang: Introduction to Modular Forms
223. BerghlLOfstrom: Interpolation Spaces. An Introduction
224. Gilbargffrudinger: Elliptic Partial Differential Equations of Second Order
225. Schfitte:ProofTheory
226. Karoubi: K-Theory. An Introduction
227. GrauertlRemmert: Theorie der Steinschen Riiume
228. SegaVKunze: Integrals and Operators
229. Hasse: Number Theory
230. Klingenberg: Lectures on Closed Geodesics
231. Lang: Elliptic Curves: Diophantine Analysis
232. GihmanlSkorohod: The Theory of Stochastic Processes III
233. StroocklVaradhan: Multidimensional Diffusion Processes
234. Aigner: Combinatorial Theory
235. DynkinlYushkevich: Controlled Markov Processes
236. GrauertlRemmert: Theory of Stein Spaces
237. Kothe: Topological Vector Spaces II
238. GraharnlMcGehee: Essays in Commutative Harmonic Analysis
239. Elliott: Proabilistic Number Theory I
240. Elliott: Proabilistic Number Theory II
241. Rudin: Function Theory in the Unit Ball of C n
242. HuppertlBlackbum: Finite Groups II
243. HuppertlBlackbum: Finite Groups III
244. KubertlLang: Modular Units
245. ComfeldIFominlSinai: Ergodic Theory
246. NaimarklStem: Theory of Group Representations
247. Suzuki: Group Theory I
248. Suzuki: Group Theory II
249. Chung: Lectures from Markov Processes to Brownian Motion
250. Arnold: Geometrical Methods in the Theory of Ordinary Differential Equations
251. ChowlHale: Methods of Bifurcation Theory
252. Aubin: Nonlinear Analysis on Manifolds. Monge-Ampere Equations
253. Dwork: Lectures on p-adic Differential Equations
254. Freitag: Siegelsche Modulfunktionen
255. Lang: Complex Multiplication
256. Hormander: The Analysis of Linear Partial Differential Operators I
257. Hormander: The Analysis of Linear Partial Differential Operators II
258. Smoller: Shock Waves and Reaction-Diffusion Equations
259. Duren: Univalent Functions
260. FreidlinlWentzell: Random Perturbations of Dynamical Systems
261. BoschlGiintzerlRemmert: Non Archimedian Analysis - A System Approach to
Rigid Analytic Geometry
262. Doob: Classical Potential Theory and Its Probabilistic Counterpart
263. Krasnosel'skiIlZabrelko: Geometrical Methods of Nonlinear Analysis
264. AubinlCellina: Differential Inclusions
265. GrauertlRemmert: Coherent Analytic Sheaves
266. de Rham: Differentiable Manifolds
267. Arbarello/CornalbalGriffithslHarris: Geometry of Algebraic Curves, Vol. I
268. Arbarello/CornalbalGriffithslHarris: Geometry of Algebraic Curves, Vol. II
269. Schapira: Microdifferentia1 Systems in the Complex Domain
270. Scharlau: Quadratic and Hennitian Fonns
271. Ellis: Entropy, Large Deviations, and Statistical Mechanics
272. Elliott: Arithmetic Functions and Integer Products
273. Nikol'skiI: Treatise on the Shift Operator
274. Hormander: The Analysis of Linear Partial Differential Operators III
275. Honnander: The Analysis of Linear Partial Differential Operators IV
276. Liggett: Interacting Particle Systems
277. FultonlLang: Riemann-Roch Algebra
278. BarrlWells: Toposes, Triples and Theories
279. BishoplBridges: Constructive Analysis
280. Neukirch: Class Field Theory
281. Chandrasekharan: Elliptic Functions
282. LelonglGruman: Entire Functions of Several Complex Variables
283. Kodaira: Complex Manifolds and Defonnation of Complex Structures
284. Finn: Equilibrium Capillary Surfaces
285. BuragolZalgaller: Geometric Inequalities
286. Andrianov: Quadratic Fonns and Heeke Operators
287. Maskit: Kleinian Groups
288. Jacod/Shiryaev: Limit Theorems for Stochastic Processes
289. Manin: Gauge Field Theory and Complex Geometry
290. Conway/Sloane: Sphere Packings, Lattices and Groups
291. HahnlO'Meara: The Classical Groups and K-Theory
292. KashiwaralSchapira: Sheaves on Manifolds
293. RevuzIYor: Continuous Martingales and Brownian Motion
294. Knus: Quadratic and Hennitian Fonns over Rings
295. DierkesIHildebrandtIKiisterlWohlrab: Minimal Surfaces I
296. DierkeslHildebrandtIKiisterlWohlrab: Minimal Surfaces II
297. PasturlFigotin: Spectra of Random and Almost-Periodic Operators
298. Berline/GetzlerNergne: Heat Kernels and Dirac Operators
299. Pommerenke: Boundary Behaviour of Confonnal Maps
300. Orlik/Terao: Arrangements of Hyperplanes
301. Loday: Cyclic Homology
303. Lange/Birkenhake: Complex Abelian Varieties
303. DeVore/Lorentz: Constructive Approximation
304. Lorentz/v. GolitschekIMakovoz: Constructive Approximation. Advanced Problems
305. Hiriart-UrrutylLemarechal: Convex Analysis and Minimization Algorithms I.
Fundamentals
306. Hiriart-UrrutylLemarechal: Convex Analysis and Minimization Algorithms II.
Advanced Theory and Bundle Methods
307. Schwarz: Quantum Field Theory and Topology

You might also like