Optimization
Optimization
net/publication/260790215
CITATIONS READS
0 660
3 authors:
Tamás Terlaky
Lehigh University
353 PUBLICATIONS 5,539 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Tamás Terlaky on 15 March 2014.
Nonlinear Optimization
E. de Klerk, C. Roos, and T. Terlaky
0 Introduction 1
0.1 The general nonlinear optimization problem . . . . . . . . . . . . . . . 1
0.2 A short history of nonlinear optimization . . . . . . . . . . . . . . . . . 3
0.3 Some historical examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.3.1 Tartaglia’s problem . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.3.2 Kepler’s problem . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.3.3 Steiner’s problem . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.4 Quadratic optimization examples . . . . . . . . . . . . . . . . . . . . . 7
0.4.1 The concrete mixing problem: least square estimation . . . . . . 8
0.4.2 Portfolio analysis (mean–variance models) . . . . . . . . . . . . 9
1 Convex Analysis 11
1.1 Convex sets and convex functions . . . . . . . . . . . . . . . . . . . . . 11
1.2 More on convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Convex hulls and extremal sets . . . . . . . . . . . . . . . . . . 14
1.2.2 Convex cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.3 The relative interior of a convex set . . . . . . . . . . . . . . . . 22
1.3 More on convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.1 Basic properties of convex functions . . . . . . . . . . . . . . . . 24
1.3.2 On the derivatives of convex functions . . . . . . . . . . . . . . 27
2 Optimality conditions 31
2.1 Optimality conditions for unconstrained optimization . . . . . . . . . . 31
2.2 Optimality conditions for constrained optimization . . . . . . . . . . . 36
2.2.1 A geometric interpretation . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 The Slater condition . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.3 Convex Farkas lemma . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.4 Karush–Kuhn–Tucker theory . . . . . . . . . . . . . . . . . . . . 50
i
3 Duality in convex optimization 55
3.1 Lagrange dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Wolfe dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Examples for dual problems . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Some examples with positive duality gap . . . . . . . . . . . . . . . . . 64
3.5 Semidefinite optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6 Duality in cone-linear optimization . . . . . . . . . . . . . . . . . . . . 71
ii
6.3.2 Newton step for φB . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.3 Proximity measure . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.4 The self-concordance property . . . . . . . . . . . . . . . . . . . 125
6.3.5 Properties of Newton’s method . . . . . . . . . . . . . . . . . . 130
6.3.6 Logarithmic barrier algorithm with full Newton steps . . . . . . 132
6.3.7 Logarithmic barrier algorithm with damped Newton steps . . . 137
∗
6.4 More on self-concordancy . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.4.2 Some basic inequalities . . . . . . . . . . . . . . . . . . . . . . . 145
6.4.3 Linear convergence of the damped Newton method . . . . . . . 148
6.4.4 Quadratic convergence of Newton’s method . . . . . . . . . . . 149
6.4.5 Existence and properties of minimizer . . . . . . . . . . . . . . . 152
6.4.6 Solution strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 156
∗
A Appendix 181
A.1 Some technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.2 The proof of Theorem 6.3 and Lemma 6.4. . . . . . . . . . . . . . . . . 183
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
iii
iv
Chapter 0
Introduction
Nothing takes place in the world whose meaning is not that of some maxi-
mum or minimum.
L. Euler (1707 – 1783)
• The function f is called the objective function of (NLO) and F is called the feasible
set (or feasible region);
• If F = ∅ then we say that problem (NLO) is infeasible;
• If f is not below bounded on F then we say that problem (NLO) is unbounded;
• If the infimum of f over F is attained at x̄ ∈ F then we call x̄ an optimal solution
(or minimum or minimizer) of (NLO) and f (x̄) the optimal (objective) value of
(NLO).
1
Example 0.1 As an example, consider minimization of the ‘humpback function’(see Figure 1):
1
min x21 (4 − 2.1x21 + x41 ) + x1 x2 + x22 (−4 + 4x22 ),
3
subject to the constraints −2 ≤ x1 ≤ 2 and −1 ≤ x2 ≤ 1. Note that the feasible set here is simply the
3
f(x ,x )
1 2
−1
−2
1
0.5 2
1
0
0
minimum −0.5
−1
−1 minimum
−2
x2 x
1
rectangle:
F = {(x1 , x2 ) : −2 ≤ x1 ≤ 2, −1 ≤ x2 ≤ 1} .
This NLO problem has two optimal solutions, namely (0.0898, −0.717) and (−0.0898, 0.717), as one
can (more or less) verify by looking at the contours of the objective function in Figure 1. ∗
Notation
Matrices will be denoted by capital letters (A, B, P, . . .), vectors by small Latin letters
and components of vectors and matrices by the indexed letters [e.g. z = (z1 , . . . , zn ),
m n n
A = (aij )i=1 j=1 ]. For the purpose of matrix-vector multiplication, vectors in IR will
always be viewed as n × 1 matrices (column vectors). Index sets will be denoted by I, J
and K.
We now define some classes of NLO problems. Recall that f : IRn → IR is called a
quadratic function if there is a square matrix Q ∈ IRn×n , a vector c ∈ IRn and a number
γ ∈ IR such that
1
f (x) = xT Qx + cT x + γ for all x ∈ IRn ;
2
2
If Q = 0 then f is called affine and if Q = 0 and γ = 0 then f is called linear. We will
abuse this terminology a bit by sometimes referring to affine functions as linear.
Unconstrained Optimization: The index sets I and J are empty and C = IRn .
Quadratic Optimization (QO): The objective function f is quadratic, and all the
constraint functions h1 , · · · , hp , g1, · · · , gm are affine and the set C is IRn or the
positive orthant IRn+ of IRn .
D E H
x b−x ?
A F C
b -
3
Let H denote the height of the triangle, and let b indicate the length of the edge AC.
Every inscribed parallelogram of the required form is uniquely determined by choosing
a vertex F at a distance x < b from A on the edge AC (see Figure 2).
Exercise 0.1 Show that Euclid’s problem can be formulated as the quadratic optimization
problem (QO):
Hx(b − x)
max . (2)
0<x<b b
/
Euclid could show that the maximum is obtained when x = 21 b, by using geometric
reasoning. A unified methodology for solving nonlinear optimization problems would
have to wait until the development of calculus in the 17th century. Indeed, in any
modern text on calculus we learn to solve problems like (2) by setting the derivative
of the objective function f (x) := Hx(b−x)
b
to zero, and solving the resulting equation to
1
obtain x = 2 b.
This modern methodology is due to Fermat (1601 – 1665). Because of this work,
Lagrange (1736 – 1813) stated clearly that he considered Fermat to be the inventor
of calculus (as opposed to Newton (1643 – 1727) and Liebnitz (1646 – 1716) who
were later locked in a bitter struggle for this honour). Lagrange himself is famous for
extending the method of Fermat to solve (equality) constrained optimization problems
by forming a function now known as the Lagrangian, and applying Fermat’s method to
the Lagrangian. In the words of Lagrange himself:
One can state the following general principle. If one is looking for the max-
imum or minimum of some function of many variables subject to the con-
dition that these variables are related by a constraint given by one or more
equations, then one should add to the function whose extremum is sought
the functions that yield the constraint equations each multiplied by undeter-
mined multipliers and seek the maximum or minimum of the resulting sum
as if the variables were independent. The resulting equations, combined with
the constraint equations, will serve to determine all unknowns.
To better understand what Lagrange meant, consider the general NLO problem with
only equality constraints (J = ∅ and C = IRn in (1)):
inf f (x)
s.t. hi (x) = 0, i ∈ I = {1, · · · , p}
x ∈ IRn .
4
The new variables yi are called (Lagrange) multipliers, and are the undetermined mul-
tipliers Lagrange referred to. Now apply Fermat’s method to find the minimum of the
function L(x, y), ‘as if the variables x and y were independent’. In other words, solve
the system of nonlinear equations defined by setting the gradient of L to zero, and
retaining the feasibility conditions hi (x) = 0 (i ∈ I):
If x∗ is an optimal solution of NLO then there now exists a vector y ∗ ∈ IRp such that
(x∗ , y ∗) is a solution of the nonlinear equations (3). We can therefore solve the nonlinear
system (3) and the x-part of one of the solutions of (3) will yield the optimal solution
of (NLO) (provided it exists). Note that it is difficult to know beforehand whether an
optimal solution of (NLO) exists.
This brings us to the problem of solving a system of nonlinear equations. Isaac New-
ton lent his name to perhaps the most widely known algorithm for this problem. In
conjunction with Fermat and Lagrange’s methods, this yielded one of the first opti-
mization algorithms. It is interesting to note that even today, Newton’s optimization
algorithm is the most widely used and studied algorithm for nonlinear optimization.
The most recent optimization algorithms, namely interior point algorithms, use this
method as their ‘engine’.
The study of nonlinear optimization in the time of Fermat, Lagrange, Euler and
Newton was driven by the realization that many physical principles in nature can be
explained via optimization (extremum) principles. For example, the well known prin-
ciple of Fermat for the refraction of light may be stated as:
in an inhomogeneous medium, light travels from one point to another along the path
requiring the shortest time.
Similarly, it was known that many problems in (celestial) mechanics could be formu-
lated as extremum problems.
We return to the problem of deciding whether (NLO) has an optimal solution at
all. In the 19th century, Karl Weierstrass (1815 – 1897) proved the famous result —
known to any student of analysis — that a continuous function attains its infimum and
supremum on a compact set. This gave a practical sufficient condition for the existence
of optimal solutions.
In modern times, nonlinear optimization is used in optimal engineering design, fi-
nance, statistics and many other fields. It has been said that we live in the age of
optimization, where everything has to be better and faster than before. Think of de-
signing a car with minimal air resistance, a bridge of minimal weight that still meets
essential specifications, a stock portfolio where the risk is minimal and the expected
return high; the list is endless. If you can make a mathematical model of your decision
problem, then you can optimize it!
5
Outline of this course
This short history of nonlinear optimization is of course far from complete and has
served only to introduce some of the most important topics that will be studied in this
course. In Chapters 1 and 2 we will study the methods of Fermat and Lagrange. So-
called duality theory based on the methodology of Lagrange will follow in Chapter 3.
Then we will turn our attention to optimization algorithms in the remaining chapters.
First we will study classical methods like the (reduced) gradient method and Newton’s
method (Chapter 4), before turning to the modern application of Newton’s method in
interior point algorithms (Chapter 6). Finally, we conclude with a chapter on special
classes of structured nonlinear optimization problems that can be solved very efficiently
by interior point algorithms.
In the rest of this chapter we give a few more examples of historical and practical
problems to give some idea of the field of nonlinear optimization.
x1 + x2 = 8, x1 ≥ 0, x2 ≥ 0.
√ √
Tartaglia knew that the correct answer is x1 = 4 + (4/ 3), x2 = 4 − (4/ 3). How can
one prove that this is correct? (Solution via Exercise 2.4.)
6
0.3.2 Kepler’s problem
The famous astronomer Johannes Kepler was so intrigued by the geometry of wine
barrels that he wrote a book about it in 1615: New solid geometry of wine barrels. In
this work he considers the following problem (among others).
Given a sphere, inscribe a cylinder of maximal volume.
Kepler knew the
√ cylinder of maximal volume is such that the ratio of its base diameter
to the height is 2. (And of course the diagonal of the cylinder has the same length
as the diameter of the sphere.) How can one show that this is correct? (Solution via
Exercises 0.2 and 2.2.)
min2 kx − ak + kx − bk + kx − ck
x∈IR
The solution is known as the Torricelli point. How can one find the Torricelli point for
any given triangle? (Solution via Exercise 2.3.)
Since
kAx − bk2 = (Ax − b)T (Ax − b) = xT AT Ax − 2bT Ax + bT b,
it follows that problem (5) is a quadratic optimization (QO) problem.
Below we give examples of the least squares and other quadratic optimization prob-
lems.
7
0.4.1 The concrete mixing problem: least square estimation
In civil engineering, different sorts of concrete are needed for different purposes. One of
the important characteristics of the concrete are its sand-and-gravel composition, i.e.
what percentages of the stones in the sand-and-gravel mixture belong to a certain stone
size categories. For each sort concrete the civil engineers can give an ideal sand-and-
gravel composition that ensures the desired strength with minimal cement content.
Unfortunately, in the sand-and-gravel mines, such ideal composition can not be found
in general. The solution is to mix different sand-and-gravel mixtures in order to ap-
proximate the desired quality as closely as possible.
Mathematical model
Let us assume that we have n different stone size categories. The ideal mixture for
our actual purpose is given by the vector c = (c1 , c2 , · · · , cn )T , where 0 ≤ ci ≤ 1 for all
i = 1, · · · , n and ni=1 ci = 1. The components ci indicate what fraction of the sand-
P
and-gravel mixture belongs to the i-th stone size category. Further, let assume that we
can get sand-and-gravel mixtures from m different mines, and the stone composition at
each mine j = 1, · · · , m is given by the vectors Aj = (a1j , · · · , anj )T , where 0 ≤ aij ≤ 1
for all i = 1, · · · , n and ni=1 aij = 1. The goal is to find the best possible approximation
P
j=1
8
0.4.2 Portfolio analysis (mean–variance models)
An important application of the QO problem is the computation of an efficient frontier
for mean–variance models, introduced by Markowitz [31]. Given assets with expected
return ri and covariances vij , the problem is to find portfolios of the assets that have
minimal variance given a level of total return, and maximal return given a level of
total variance. Mathematically, if xi is the proportion invested in asset i then the basic
mean–variance problem is
1 T
min x V x : eT x = 1, r T x = λ, Dx = d, x ≥ 0 ,
x 2
where e is an all–one vector, and Dx = d may represent additional constraints on the
portfolios to be chosen (for instance those related to volatility of the portfolio). This
problem can be viewed as a parametric QO problem, with parameter λ representing
the total return of investment. The so-called efficient frontier is then just the optimal
value function.
min xT V x : eT x = 1, rT x = λ, x ≥ 0
x
where
0.82 −0.23 0.155 −0.013 −0.314
−0.23 0.484 0.346 0.197 0.592
V = 0.155 0.346 0.298 0.143 0.419 ,
−0.013 0.197 0.143 0.172 0.362
−0.314 0.592 0.419 0.362 0.916
T
r= 1.78 0.37 0.237 0.315 0.49 .
One can check (e.g. by using MATLAB) that for λ > 1.780 or λ < 0.237 the QO problem is infeasible.
For the values λ ∈ [0.237, 1.780] the QO problem has an optimal solution. ∗
Exercise 0.4 A mathematical description of this and related portfolio problems is given at:
http://www-fp.mcs.anl.gov/otc/Guide/CaseStudies/port/formulations.html
http://www-fp.mcs.anl.gov/otc/Guide/CaseStudies/port/demo.html
and solve this problem remotely via internet to obtain the optimal way of dividing your capital
between the stocks you have chosen. In doing this you are free to set the level of risk you are
prepared to take. Give the mathematical description of the problem you have solved and report
on your results. /
9
10
Chapter 1
Convex Analysis
If the nonlinear optimization problem (NLO) has a convex objective function and the
feasible set is a convex set, then the underlying mathematical structure is much richer
than in the general case. For example, one can formulate necessary and sufficient
conditions for the existence of optimal solutions in this case. It is therefore important
to study convexity in some detail.
Definition 1.1 Let two points x1 , x2 ∈ IRn and 0 ≤ λ ≤ 1 be given. Then the point
x = λx1 + (1 − λ)x2
In other words, the line segment connecting two arbitrary points of a convex set is
contained in the set.
Figure 1.1 and Figure 1.2 show examples of convex and nonconvex sets in the plane.
11
Figure 1.2: Non convex sets
Exercise 1.1 We can define the convex combination of k points as follows. Let the points
x1 , · · · , xk ∈ IRn and 0 ≤ λ1 , · · · , λk with ki=1 λi = 1 be given. Then the vector
P
k
λi x i
X
x=
i=1
The intersection of (possibly infinitely many) convex sets is again a convex set.
C := ∩∞
i=1 Ci
ic convex.
Another fundamental property of a convex set is that its closure is again a convex
set.
Theorem 1.3 Let C ⊂ IRn be a convex set and let cl(C) denote its closure. Then cl(C)
is a convex set.
We now turn our attention to so-called convex functions. A parabola f (x) = ax2 +
bx + c with a > 0 is a familiar example of a convex function. Intuitively, it is easier to
characterize minima of convex functions than minima of more general functions, and
for this reason we will study convex functions in some detail.
Exercise 1.3 Let f : IRn 7→ IR be defined by f (x) = kxk for some norm k · k. Prove that f is
a convex function. /
12
Exercise 1.4 Show that the following univariate functions are not convex:
2
f (x) = sin x, f (x) = e−x , f (x) = x3 .
6
f
Exercise 1.5 Prove that the function f : C → R defined on the convex set C is convex if and
only if the epigraph of f is a convex set. /
We also will need the concept of a strictly convex function. These are convex functions
with the nice property that — if a minimum of the function exists — then this minimum
is unique.
We have seen in Exercise 1.5 that a function is convex if and only if its epigraph is
convex.
Also, the next exercise shows that a quadratic function is convex if and only if the
matrix Q in its definition is positive-semidefinite (PSD).
13
Exercise 1.6 Let a symmetric matrix Q ∈ IRn×n , a vector c ∈ IRn and a number γ ∈ IR be
given. Prove that the quadratic function
1 T
x Qx + cT x + γ
2
is convex on IRn if and only if the matrix Q is PSD, and strictly convex if and only if Q is
positive definite. /
Exercise 1.7 Decide whether the following quadratic functions are convex or not. (Hint: use
the result from the previous exercise.)
1
f (x) = x21 + 2x1 x2 + x22 + 5x1 − x2 + , f (x) = x21 + x22 + x23 − 2x1 x2 − 2x1 x3 − 2x2 x3 .
2
/
Note that we can change the problem of maximizing a concave function into the problem
of minimizing a convex function.
Now we review some further properties of convex sets and convex functions that are
necessary to understand and analyze convex optimization problems. First we review
some elementary properties of convex sets.
Observe that conv(S) is generated by taking all possible convex combinations of points
from S.
We now define some important convex subsets of a given convex set C, namely the
so-called extremal sets, that play an important role in convex analysis. Informally, an
extremal set E ⊂ C is a convex subset of C with the following property: if any point on
the line segment connecting two points x1 ∈ C and x2 ∈ C lies in E, then the two points
x1 and x2 must also lie in E. The faces of a polytope are familiar examples of extreme
sets of the polytope.
14
Definition 1.9 The convex set E ⊆ C is an extremal set of the convex set C if, for all
x1 , x2 ∈ C and 0 < λ < 1, one has λx1 + (1 − λ)x2 ∈ E only if x1 , x2 ∈ E.
An extremal set consisting of only one point is called an extremal point. Observe that
extremal sets are convex by definition, and the convex set C itself is always an extremal
set of C. It is easy to verify the following result.
Example 1.11 Let C be the cube {x ∈ IR3 | 0 ≤ x ≤ 1}, then the vertices are extremal points, the
edges are 1-dimensional extremal sets, the faces are 2-dimensional extremal sets, and the whole cube
is a 3-dimensional extremal set of itself.
Example 1.12 Let C be the cylinder {x ∈ IR3 | x21 + x22 ≤ 1, 0 ≤ x3 ≤ 1}, then
• the points on the circles {x ∈ IR3 | x21 + x22 = 1, x3 = 1} and {x ∈ IR3 | x21 + x22 = 1, x3 = 0} are
the extremal points,
3
√
• the lines 2
√ {x ∈ IR | x1 = a, x2 = b, 0 ≤ x3 ≤ 1}, with a ∈ [−1, 1] and b = 1 − a or
b = − 1 − a2 , are the 1-dimensional extremal sets,
• the faces {x ∈ IR3 | x21 + x22 ≤ 1, x3 = 1} and {x ∈ IR3 | x21 + x22 ≤ 1, x3 = 0} are the 2-dimensional
extremal sets, and
• the cylinder itself is the only 3-dimensional extremal set.
Example 1.13 Let f (x) = x2 and let C be the epigraph of f , then all points (x1 , x2 ) such that x2 = x21
are extremal points. The epigraph itself is the only two dimensional extremal set.
15
∗
Lemma 1.14 Let C be a closed convex set. Then all extremal sets of C are closed.
In the above examples we have pointed out extremal sets of different dimension with-
out giving a formal definition of what the dimension of a convex set is. To this end,
recall from linear algebra that if L is a (linear) subspace of IRn and a ∈ IRn then a + L is
called an affine subspace of IRn . By definition, the dimension of a + L is the dimension
of L.
Definition 1.15 The smallest affine space a + L containing a convex set C ⊆ IRn is
the so-called affine hull of C and denoted by aff(C). The dimension of C is defined as
the dimension of aff(C).
Given two points x1 ∈ C and x2 ∈ C, we call any point that lies on the (infinite) line
that passes through x1 and x2 an affine combination of x1 and x2 . Formally we have
the following definition.
Definition 1.16 Let two points x1 , x2 ∈ IRn and λ ∈ IR be given. Then the point
x = λx1 + (1 − λ)x2
Observe that in defining the affine combination we do not require that the coefficients
λ and 1 − λ are from the interval [0, 1], while this was required in the definition of the
convex combination of points.
16
/
Exercise 1.9 Let C ⊆ IRn be a given convex set and k ≥ 2 a given integer. Prove that
k k
( )
i i
X X
aff(C) = z z = λi x , λi = 1, λi ∈ IR, x ∈ C, ∀ i .
i=1 i=1
Exercise 1.10 Let E be an extremal set of the convex set C. Prove that E =aff(E) ∩ C. (Hint:
Use Exercise 1.9 with k = 2.) /
Lemma 1.17 Let E 2 ⊂ E 1 ⊆ C be two extremal sets of the convex set C. Then
dim(E 2 ) < dim(E 1 ).
∗
Proof: Because E 2 ⊂ E 1 we have aff(E 2 ) ⊆aff(E 1 ). Further, by Exercise 1.10,
E 2 = aff(E 2 ) ∩ E 1 .
If we assume to the contrary that dim(E 2 ) = dim(E 1 ) then we have aff(E 2 ) =aff(E 1 ) and so
E 2 = aff(E 2 ) ∩ E 1 = aff(E 1 ) ∩ E 1 = E 1
Lemma 1.18 Let C be a nonempty compact (closed and bounded) convex set. Then C
has at least one extremal point.
∗
Proof: Let F ⊆ C be the set of points in C which are furthest from the origin. The set of such
points is not empty, because C is bounded and closed and the norm function is continuous. We claim
that any point z ∈ F is an extremal point of C.
Let us assume to the contrary that z ∈ F is not an extremal point. Then there exist points x, y ∈ C,
both different from z, and a λ ∈ (0, 1) such that
z = λx + (1 − λ)y.
Further, we have kxk ≤ kzk and kyk ≤ kzk because z maximizes the norm of the points over C. Thus
by the triangle inequality
kzk ≤ λkxk + (1 − λ)kyk ≤ kzk
which implies that kzk = kxk = kyk, i.e. all the three point x, y, z are on the surface of the n-
dimensional sphere with radius kzk and centered at the origin. This is a contradiction, because these
three different points lie on the same line as well. The lemma is proved. 2
Observe, that the above proof does not require the use of the origin as reference point.
We could choose any point u ∈ IRn and prove that the furthest point z ∈ C from u is
an extremal point of C.
The following theorem shows that a compact convex set is completely determined by
its extremal points.
17
Theorem 1.19 (Krein–Milman Theorem) Let C be a compact convex set. Then C
is the convex hull of its extreme points.
Exercise 1.11 Let f be a continuous, concave function defined on a compact convex set C.
Show that the minimum of f is attained at an extreme point of C. (Hint: Use the Krein-
Milman Theorem.) /
Definition 1.20 The set C ⊂ IRn is a convex cone if it is a convex set and for all x ∈ C
and 0 ≤ λ one has λx ∈ C.
Example 1.21
• The set C = {(x1 , x2 ) ∈ IR2 | x2 ≥ 2x1 , x2 ≥ − 21 x1 } is a convex cone in IR2 .
• The set C 0 = {(x1 , x2 , x3 ) ∈ IR3 | x21 + x22 ≤ x23 } is a convex cone in IR3 .
C C0
Definition 1.22 A convex cone is called pointed if it does not contain any subspace
except the origin.
A pointed convex cone could be defined equivalently as a convex cone that does not
contain any line.
Lemma 1.23 A convex cone C is pointed if and only if the origin 0 is an extremal
point of C.
∗
Proof: If the convex cone C is not pointed, then it contains a nontrivial subspace, in particular,
it contains a one dimensional subspace, i.e. a line L going through the origin. Let 0 6= x ∈ L, and then
we have −x ∈ L as well. From here we have 0 = 12 x + 12 (−x) ∈ C, i.e. 0 is not an extremal point.
If the convex cone C is pointed, then it does not contain any subspace, except the origin 0. In
that case we show that 0 is an extremal point of C. If we assume to the contrary that there exists
18
0 6= x1 , x2 ∈ C and a λ ∈ (0, 1) such that 0 = λx1 + (1 − λ)x2 , then we derive x1 = − 1−λ 2
λ x . This
1 2
implies that the line through x , the origin 0 and x is in C, contradicting the assumption that C is
pointed. 2
Example 1.25 Let V1 , V2 be two planes through the origin in IR3 , given by the following equations,
V1 : = {x ∈ IR3 | x3 = a1 x1 + a2 x2 },
V2 : = {x ∈ IR3 | x3 = b1 x1 + b2 x2 },
20
15
10
V1 : x3 = 2x1 − x2
x3 5
V2 : x3 = x1 + 3x2
0
-5
-10
-3
0 -2
1 -1
2 0
3 1 x
x2 4 2 1
5 3
Every convex cone C has an associated dual cone. By definition, every vector in the
dual cone has a nonnegative inner product with every vector in C.
Definition 1.26 Let C ⊆ IRn be a convex cone. The dual cone C ∗ is defined by
19
In the literature the dual cone C ∗ is frequently referred to as the polar or positive polar
of the cone C.
Exercise 1.12 Prove that (IRn+ )∗ = IRn+ , i.e. the nonnegative orthant is a self-dual cone. /
Exercise 1.13 Let Sn denote the set of n × n, symmetric positive semidefinite matrices.
(i) Prove that Sn is a convex cone.
(ii) Prove that (Sn )∗ = Sn , i.e. Sn is a self-dual cone. /
Exercise 1.14 Prove that the dual cone C ∗ is a closed convex cone. /
An important, deep theorem [38, 42] says that the dual of the dual cone (C ∗ )∗ is the
closure C of the cone C.
An important cone in the study of convex optimization is the so-called recession cone.
Given an (unbounded) convex feasible set F and some x ∈ F , the recession cone of F
consists of all the directions one can travel in without ever leaving F , when starting
from x. Surprisingly, the recession cone does not depend on the choice of x.
Lemma 1.27 Let us assume that the convex set C is closed and not bounded. Then
(i) for each x ∈ C there is a non-zero vector z ∈ IRn such that x + λz ∈ C for all
λ ≥ 0, i.e. the set R(x) = {z | x + λz ∈ C, λ ≥ 0} is not empty;
(ii) the set R(x) is a closed convex cone (the so-called recession cone at x);
(iii) the cone R(x) = R is independent of x, thus it is ‘the’ recession cone of the convex
set C;1
(iv) R is a pointed cone if and only if C has at least one extremal point.
∗
Proof: (i) Let x ∈ C be given. Because C is unbounded, then there is a sequence of points
x1 , · · · , xk , · · · such that kxk − xk → ∞. Then the vectors
xk − x
yk =
kxk − xk
are in the unit sphere. The unit sphere is a closed convex, i.e. compact set, hence there exists an
accumulation point ȳ of the sequence y k . We claim that ȳ ∈ R(x). To prove this we take any λ > 0
and prove that x + λȳ ∈ C. This claim follows from the following three observations: 1. If we omit all
the points from the sequence y k for which kx − xk k < λ then ȳ is still an accumulation point of the
remaining sequence y ki . 2. Due to the convexity of C the points
λ λ λ
x + λy ki = x + ki (xki − x) = 1 − ki x + ki xki
kx − xk kx − xk kx − xk
are in C. 3. Because C is closed, the accumulation point x + λȳ of the sequence x + λy ki ∈ C is also in
C. The proof of the first statement is complete.
1
In the literature the recession cone is frequently referred to as the characteristic cone of the convex
set C.
20
(ii)The set R(x) is a cone, because z ∈ R(x) imply µz ∈ R(x). The convexity of R(x) easily follows
from the convexity of C. Finally, if z i ∈ R(x)) for all i = 1, 2, · · · and z̄ = limi→∞ z i , then for each
λ ≥ 0 the closedness of C and x + λz i ∈ C imply that
lim (x + λz i ) = x + λz̄ ∈ C,
i→∞
Corollary 1.28 The nonempty closed convex set C is bounded if and only if its recession
cone R consists of the zero vector alone.
∗
Proof: If C is bounded, then it contains no half line, thus for each x ∈ C the set R(x) = {0}, i.e.
R = {0}.
The other part of the proof follows form item (i) of Lemma 1.27. 2
Example 1.29 Let C be the epigraph of f (x) = x1 . Then every point on the curve x2 = 1
x1 is an
extreme point of C. For an arbitrary point x = (x1 , x2 ) the recession cone is given by
C x + R(x)
(x1 , x2 )
21
∗
Lemma 1.30 If the convex set C is closed and has an extremal point, then each extremal
set of C has at least one extremal point as well.
∗
Proof: Let us assume to the contrary that an extremal set E ⊂ C has no extremal point. Then by
item (iv) of Lemma 1.27 the recession cone of E is not pointed, i.e. it contains a line. By statement
(iii) of the same lemma, this line is contained in the recession cone of C as well. Applying (iv) of
Lemma 1.27 again we conclude that C cannot have an extremal point. This is a contradiction, the
lemma is proved. 2
Lemma 1.31 Let C be a convex set and R be its recession cone. If E is an extremal
set of C the recession cone RE of E is an extremal set of R.
∗
Proof: Clearly RE ⊆ R. Let us assume that RE is not an extremal set of R. Then there are
vectors z 1 , z 2 ∈ R, z 1 ∈
/ RE and a λ ∈ (0, 1) such that z = λz 1 + (1 − λ)z 2 ∈ RE . Finally, for a certain
α > 0 and x ∈ E we have
x1 = x + αz 1 ∈ C \ E, x2 = x + αz 2 ∈ C
and
λx1 + (1 − λ)x2 = x + αz ∈ E
contradicting the extremality of E. 2
The interior of this convex set is empty, but intuitively the points that do not lie on
the ‘boundary’ of the simplex do form a ‘sort of interior’. This leads us to a generalized
concept of the interior of a convex set, namely the relative interior. If the convex set is
full-dimension (i.e. C ∈ IRn has dimension n), then the concepts of interior and relative
interior coincide.
Definition 1.32 Let a convex set C be given. The point x ∈ C is in the relative interior
of C if for all x ∈ C there exists x̃ ∈ C and 0 < λ < 1 such that x = λx + (1 − λ)x̃. The
set of relative interior points of the set C will be denoted by C 0 .
The relative interior C 0 of a convex set C is obviously a subset of the convex set. We will
show that the relative interior C 0 is a relatively open (i.e. it coincides with its relative
interior) convex set.
22
Example 1.33 Let C = {x ∈ IR3 | x21 + x22 ≤ 1, x3 = 1} and L = {x ∈ IR3 | x3 = 0}, then C ⊂ aff(C) =
(0, 0, 1) + L. Hence, dim C = 2 and C 0 = {x ∈ IR3 | x21 + x22 < 1, x3 = 1}.
Lemma 1.34 Let C ⊂ IRn be a convex set. Then for each x ∈ C 0 , y ∈ C and 0 < λ ≤ 1
we have
z = λx + (1 − λ)y ∈ C 0 ⊆ C.
∗
Proof: Let u ∈ C be an arbitrary point. Then we have to show that for each u ∈ C there is an
ū ∈ C and a 0 < ρ < 1 such that z = ρū + (1 − ρ)u. The proof is constructive.
Because x ∈ C 0 , by Definition 1.32 there is an 0 < α < 1 such that the point
1 1
v := x + (1 − )u
α α
is in C. Let
λα
ū = ϑv + (1 − ϑ)y, where ϑ= .
λα + 1 − λ
Due to the convexity of C we have ū ∈ C. Finally, let us define ρ = λα + 1 − λ. Then one can easily
verify that 0 < ρ < 1 and
z = λx + (1 − λ)y = ρū + (1 − ρ)u.
z = λx + (1 − λ)y
u ^ ū = ϑv + (1 − ϑ)y
v = α1 x + (1 − α1 )u
23
Corollary 1.35 The relative interior C 0 of a convex set C ⊂ IRn is convex.
Lemma 1.37 Let f be a convex function defined on the convex set C. Then f is
continuous on the relative interior C 0 of C.
∗
Proof: Let p ∈ C 0 be an arbitrary point. Without loss of generality we may assume that C is full
dimensional, p is the origin and f (p) = 0.
Let us first consider the one dimensional case. Because 0 is in the interior of the domain C of f we
have a v > 0 such that v ∈ C and −v ∈ C as well. Let us consider the two linear functions
f (v) f (−v)
`1 (x) := x and `2 (x) := x .
v −v
One easily checks that the convexity of f implies the following relations:
• `1 (x) ≥ f (x) if x ∈ [0, v];
• `1 (x) ≤ f (x) if x ∈ [−v, 0];
• `2 (x) ≥ f (x) if x ∈ [−v, 0];
• `2 (x) ≤ f (x) if x ∈ [0, v].
Then by defining
24
equals the space IRn . For all i = 1, · · · , n + 1 let the linear functions (hyperplanes) Li (x) be defined
by n + 1 of their values: Li (0) = 0 and Li (v j ) = f (v j ) for all j 6= i. Let us further define
h(x) := max{L1 (x), · · · , Ln+1 (x)} and g(x) := min{L1 (x), · · · , Ln+1 (x)}.
Then one easily proves that the functions g(x) and h(x) are continuous, f (0) = h(0) = g(0) = 0 and
Exercise 1.16 Prove that the functions g(x) and h(x), defined in the proof above, are con-
tinuous, f (0) = h(0) = g(0) = 0 and
The following result, called Jensen’s inequality, is simply a generalization of the in-
equality f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ).
Then
k k
!
λi xi ≤ λi f (xi ).
X X
f
i=1 i=1
25
∗
Proof: The proof is by induction on k. If k = 2 then the statement is true by Definition 1.4. Let
us assume that the statement holds for a given k ≥ 2, then we prove that it also holds for k + 1.
Let the points x1 , · · · , xk , xk+1 ∈ C and λ1 , · · · , λk , λk+1 ≥ 0 with k+1
P
i=1 λi = 1 be given. If at most k
of the λi , 1 ≤ i ≤ k + 1 coefficients are nonzero then, by leaving out the points xi with zero coefficients,
the inequality directly follows from the inductive assumption. Now let us consider the case when all
the coefficients λi are nonzero. Then by convexity of the set C we have that
k
X λi
x̃ = Pk xi ∈ C.
i=1 j=1 λj
Further
!
k+1 k k
X X X λi
f λi xi = f λj Pk xi + λk+1 xk+1
i=1 j=1 i=1 j=1 λj
Xk
= f λj x̃ + λk+1 xk+1
j=1
k
X
λj f (x̃) + λk+1 f xk+1
≤
j=1
!
k k
X X λi i
≤ λj Pk f (x ) + λk+1 f (xk+1 )
j=1 i=1 j=1 λj
k+1
X
= λi f (xi ),
i=1
where the first inequality follows from the convexity of the function f (Definition 1.4) and, at the
second inequality, the inductive assumption was used. The proof is complete. 2
The reader can easily prove the following two lemmas by applying the definitions. We
leave the proofs as exercises.
Lemma 1.40 Let f 1 , · · · , f k be convex functions defined on a convex set C ⊆ IRn . Then
• for all λ1 , · · · , λk ≥ 0 the function
k
λi f i (x)
X
f (x) =
i=1
is convex;
• the function
f (x) = max f i (x)
1≤i≤k
is convex.
26
Lemma 1.42 Let f be a convex function on the convex set C ⊆ IRn and h : IR → IR be
a convex monotonically non-decreasing function. Then the composite function h(f (x)) :
C → IR is convex.
Exercise 1.18 Assume that the function h in Lemma 1.42 is not monotonically non-decreasing.
Give a concrete example that in this case the statement of the lemma fails. /
Definition 1.43 Let a convex function f : C → IR defined on the convex set C be given.
Let α ∈ IR be an arbitrary number. The set Dα = {x ∈ C | f (x) ≤ α} is called a level
set of the function f .
Lemma 1.44 If f is a convex function on the convex set C then for all α ∈ IR the level
set Dα is a (possibly empty) convex set.
∗
Proof: Let x, y ∈ Dα and 0 ≤ λ ≤ 1. Then we have f (x) ≤ α, f (y) ≤ α and we may write
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ≤ λα + (1 − λ)α = α.
Here the first inequality followed from the convexity of the function f . The lemma is proved. 2
Definition 1.45 Let x ∈ IRn and a direction (vector) s ∈ IRn be given. The directional
derivative δf (x, s) of the function f , at point x, in the direction s, is defined as
f (x + λs) − f (x)
δf (x, s) = lim
λ→0 λ
if the above limit exists.
∂f
If the function f is continuously differentiable then ∂xi
= δf (x, ei ) where ei is the i−th
unit vector. This implies the following result.
27
Lemma 1.46 If the function f is continuously differentiable then for all s ∈ IRn we
have
δf (x, s) = ∇f (x)T s.
The Hesse matrix (or Hessian) ∇2 f (x) of the function f at a point x ∈ C is composed
of the second order partial derivatives of f as
∂ 2 f (x)
(∇2 f (x))ij = for all i, j = 1, · · · , n.
∂xi ∂xj
Lemma 1.47 Let f be a function defined on a convex set C ⊆ IRn . The function f is
convex if and only if the function φ(λ) = f (x + λs) is convex on the interval [0, 1] for
all x ∈ C and x + s ∈ C.
∗
Proof: Let us assume that f is a convex function. Then we prove that φ(λ) is convex on the
interval [0, 1]. Let λ1 , λ2 ∈ [0, 1] and 0 ≤ α ≤ 1. Then one has
Example 1.48 Let f (x) = x21 + x22 and let Ef be the epigraph of f . For every s ∈ IR2 , we can define
the half-plane Vs ⊂ IR3 as {(x, y) ∈ IR2 × IR| x = µs, µ > 0}. Now, for x = (0, 0) the epigraph of
φ(λ) = f (x + λs) = f (λs) equals Vs ∩ Ef , which is a convex set. Hence, φ(λ) is convex.
φ(λ)
Vs
s
28
∗
Lemma 1.49 Let f be a continuously differentiable function on the open convex set
C ⊆ IRn . Then the following statements are equivalent.
1. The function f is convex on C.
2. For any two vectors x, x ∈ C one has
∇f (x)T (x − x) ≤ f (x) − f (x) ≤ ∇f (x)T (x − x).
3. For any x ∈ C, and any s ∈ IRn such that x+ s ∈ C, the function φ(λ) = f (x+ λs)
is continuously differentiable on the open interval (0, 1) and φ0 (λ) = sT ∇f (x+λs),
which is a monotonically non-decreasing function.
∗
Proof: First we prove that 1 implies 2. Let 0 ≤ λ ≤ 1 and x, x ∈ C. Then the convexity of f
implies
f (λx + (1 − λ)x) ≤ λf (x) + (1 − λ)f (x).
This can be rewritten as
f (x + λ(x − x)) − f (x)
≤ f (x) − f (x).
λ
Taking the limit as λ → 0 and applying Lemma 1.46 the left-hand-side inequality of 2 follows. As one
interchanges the role x and x, the right-hand-side inequality is obtained analogously.
Now we prove that 2 implies 3. Let x, x + s ∈ C and 0 ≤ λ1 , λ2 ≤ 1. When we apply the inequalities
of 2 with the points x + λ1 s and x + λ2 s the following relations are obtained.
(λ2 − λ1 )∇f (x + λ1 s)T s ≤ f (x + λ2 s) − f (x + λ1 s) ≤ (λ2 − λ1 )∇f (x + λ2 s)T s,
hence
(λ2 − λ1 )φ0 (λ1 ) ≤ φ(λ2 ) − φ(λ1 ) ≤ (λ2 − λ1 )φ0 (λ2 ).
Assuming λ1 < λ2 we have
φ(λ2 ) − φ(λ1 )
φ0 (λ1 ) ≤ ≤ φ0 (λ2 ),
λ2 − λ1
proving that the function φ0 (λ) is monotonically non-decreasing.
Finally we prove that 3 implies 1. We only have to prove that φ(λ) is convex if φ0 (λ) is monotonically
non-decreasing. Let us take 0 < λ1 < λ2 < 1 where φ(λ1 ) ≤ φ(λ2 ). Then for 0 ≤ α ≤ 1 we may write
(1 − α)φ(λ1 ) + αφ(λ2 ) − φ((1 − α)λ1 + αλ2 )
= α[φ(λ2 ) − φ(λ1 )] − [φ((1 − α)λ1 + αλ2 ) − φ(λ1 )]
Z 1 Z 1
0 0
= α(λ2 − λ1 ) φ (λ1 + t(λ2 − λ1 ))dt − φ (λ1 + tα(λ2 − λ1 ))dt
0 0
≥ 0.
The expression for the derivative of φ is left as an exercise in calculus (Exercise 1.19). The proof of
the Lemma is complete. 2
Figure 1.5 illustrates the inequalities at statement 2 of the lemma.
Exercise 1.19 Let f : IRn 7→ IR be twice continuously differentiable and let x ∈ IRn and
s ∈ IRn be given. Define φ : IR 7→ IR via φ(λ) = f (x + λs). Prove that
φ0 (λ) = sT ∇f (x + λs)
and
φ00 (λ) = sT ∇2 f (x + λs)s.
/
29
...
...
...
... f (x)
...
...
...
...
...
...
...
... f (x) − f (x)
...
...
)
... ∇f (x)T (x − x)
f (x) .....
...
... ∇f (x)T (x − x)
...
...
...
...
...
...
...
...
...
...
...
...
.
x x
Lemma 1.50 Let f be a twice continuously differentiable function on the open convex
set C ⊆ IRn . The function f is convex if and only if its Hesse matrix ∇2 f (x) is PSD
for all x ∈ C.
Proof: Let us take an arbitrary x ∈ C and s ∈ IRn , and define φ(λ) = f (x + λs).
If f is convex, then φ0 (λ) is monotonically non-decreasing. This implies that φ00 (λ) is
nonnegative for each x ∈ C and 0 ≤ λ ≤ 1. Thus
sT ∇2 f (x)s = φ00 (0) ≥ 0
proving the positive semidefiniteness of the Hessian ∇2 f (x).
On the other hand, if the Hessian ∇2 f (x) is positive semidefinite for each x ∈ C, then
sT ∇2 f (x + λs)s = φ00 (λ) ≥ 0,
i.e. φ0 (λ) is monotonically non-decreasing proving the convexity of f by Lemma 1.49.
The theorem is proved.
Exercise 1.20 Prove the following statement analogously as Lemma 1.50 was proved.
Let f be a twice continuously differentiable function on the open convex set C. Then f is
strictly convex if its Hesse matrix ∇2 f (x) is positive definite (PD) for all x ∈ C. /
Exercise 1.21 Give an example of a twice continuously differentiable strictly convex function
f where ∇2 f (x) is not positive definite (PD) for all x in the domain of f . /
30
Chapter 2
Optimality conditions
We consider two cases separately. First optimality conditions for unconstrained op-
timization are considered. Then optimality conditions for some special constrained
optimization problems are derived.
Convex functions possess the appealing property that a local minimum is global.
31
-3 -2 -1 0 1 2 3
The point x̄ = 0 is a global minimum of the function f1 because f1 (x̄) ≤ f1 (x) for all x ∈ IR. Because
x̄ = 0 is a global minimum, it follows immediately that it is also a local minimum of the function f1 .
The point x̄ = 0 is neither a strict local nor a strict global minimum point of f1 because for any > 0
we can find an x for which f1 (x̄) = f1 (x) applies with ||x̄ − x|| ≤ .
Now let us consider the non-convex function f1 : IR → IR defined as
−x if x < 2,
2 if −2 ≤ x ≤ −1,
f2 (x) = −x + 1 if −1 < x < 0,
1 if 0 ≤ x ≤ 1,
x
if x > 1.
-3 -2 -1 0 1 2 3
The point x̄ = 0 is a global minimum of the function f2 because f2 (x̄) ≤ f2 (x) for all x ∈ IR. Because
it is a global minimum it is at the same a local minimum as well. The point x̄ = 0 is neither a strict
local, nor a strict global minimum of the function f2 because for any > 0 we can find an x for which
f2 (x̄) = f2 (x) applies with ||x̄ − x|| ≤ .
The point x∗ = −2 is also a local minimum of the function f2 because f2 (x∗ ) ≤ f2 (x) for all x ∈ IR
when ||x∗ − x|| ≤ , with 0 < < 1. It is not a strict local minimum because f2 (x∗ ) 6≤ f2 (x) for all
x ∈ IR when ||x∗ − x|| < , with > 0. The point x∗ = −2 is not a global minimum of f2 because
f2 (−2) > f2 (0). ∗
32
Example 2.3 Consider the convex function f1 (x) = x2 where x ∈ IR.
-3 -2 -1 0 1 2 3
The point x̄ = 0 is a strict local minimum of the function f1 because f1 (x̄) < f1 (x) for all x ∈ IR
when ||x̄ − x|| < , with > 0. The point x̄ = 0 is also a strict global minimum of the function f1
because f1 (x̄) < f1 (x) for all x ∈ IR.
Consider the non-convex function f2 : IR → IR defined as
(x + 3)2 + 3 if x < −2,
f2 (x) =
x2 if x ≥ −2.
-4 -3 -2 -1 0 1 2 3
The point x̄ = 0 is a strict local minimum as well as an strict global minimum for the function f2 ,
because f2 (x̄) < f2 (x) for all x ∈ IR when ||x̄ − x|| < , with > 0, and for all x ∈ IR. The point
x∗ = −3 is a strict local minimum because f2 (x∗ ) < f2 (x) for all x ∈ IR when ||x∗ − x|| < , with
0 < < 1. The point x∗ = −3 is not a strict global minimum, because f2 (−3) > f2 (0).
∗
Lemma 2.4 Any (strict) local minimum of a convex function f is a (strict) global
minimum of f as well.
33
Now we arrive at the famous result of Fermat that says that a necessary condition for
x to be a minimum of a continuously differentiable function f is that ∇f (x) = 0.
Remark: In the above theorem it is enough to assume that the partial derivatives
of f exist. The same proof applies if we choose ei , the standard unit vectors instead of
the arbitrary direction s.
1. Show that Kepler’s problem can be written as the problem of minimizing a nonlinear
univariate function on an open interval.
2. Show that the solution given by Kepler is indeed optimal by using Theorem 2.5.
Observe that the above theorem contains only a one sided implication. It does not
say anything about a minimum of f if ∇f (x) = 0. Such points are not necessarily
minimum points. These points are called stationary points. Think of the stationary
(inflection) point x = 0 of the univariate function f (x) = x3 . In other words, Fermat’s
result only gave a necessary condition for a minimum, namely ∇f (x) = 0. We will now
see that this is also a sufficient condition if f is convex.
Theorem 2.6 Let f be a continuously differentiable convex function. The point x ∈ IRn
is a minimum of the function f if and only if ∇f (x) = 0.
34
Exercise 2.3 We return to Steiner’s problem (see Section 0.3.3) of finding the Torricelli
point of a given triangle, that was defined as the solution of the optimization problem
where a, b, and c are given vectors in IR2 that form the vertices of the given triangle.
2. Give necessary and sufficient conditions for a minimum of (2.2). (In other words, give
the equations that determine the Torricelli point. You may assume that all three angles
of the triangle are smaller that 2π
3 .)
3. Find the Torricelli point of the triangle with vertices (0, 0), (3, 0) and (1, 2).
1
f (x) = f (x + (x − x)) = f (x) + ∇f (x)T (x − x) + (x − x)T ∇2 f (x + α(x − x))(x − x)
2
for some 0 ≤ α ≤ 1. Using the assumptions we have the result f (x) ≥ f (x) as x is in
the neighborhood of x where the Hesse matrix is positive semidefinite. 2
Proof: Since f is twice continuously differentiable, it follows from the positive def-
initeness of the Hesse matrix at x that it is positive definite in a neighborhood of x.
Hence the claim follows from theorem 2.7. 2
35
2.2 Optimality conditions for constrained optimiza-
tion
The following theorem generalizes the optimality conditions for a convex function on
IRn (Theorem 2.6), by replacing IRn by any relatively open convex set C ⊆ IRn .
Theorem 2.9 Let us consider the optimization problem min{ f (x) : x ∈ C} where C
is a relatively open convex set and f is a convex differentiable function. The point x is
an optimal solution of this problem if and only if ∇f (x)T s = 0 for all s ∈ L, where L
denotes the linear subspace with aff(C) = x + L for any x ∈ C. Here aff(C) denotes the
affine hull of C.
f (x) ≤ f (x + λs) if x + λs ∈ C.
f (x + λs) − f (x)
0≤ for all s ∈ L,
λ
if λ > 0 is sufficiently small. Taking the limit as λ ↓ 0 results in
A crucial assumption of the above lemma is that the set C is a relatively open set. In
general this is not the case because the level sets of convex optimization problems are
closed. However as we will see later the barrier function approach will result in such
relatively open feasible sets. This is an important feature of interior point methods that
will be discussed later on. If the set of feasible solutions is not relatively open, similar
results by using similar techniques can be derived (see Theorem 2.14).
1. Eliminate one of the variables and show that the resulting problem can be written as the
problem of minimizing a univariate convex function on an open interval.
36
2. Show that the answer given by Tartaglia is indeed the optimal solution, by applying
Theorem 2.9.
Now let us consider the general convex optimization problem, as given earlier in (1),
but without equality constraints.
F = {x ∈ C | gj (x) ≤ 0, j = 1, · · · , m}.
Example 2.11 Assume that the feasible set F ⊂ IR2 is defined by the three constraints
−x1 − x2 + 1 ≤ 0, 1 − x2 ≤ 0, x1 − x2 ≤ 0.
If x̄ = (1, 1), then the set of feasible directions at x̄ is F D(x̄) = {s ∈ IR2 | s2 ≥ s1 , s2 ≥ 0}. Note that
in this case F D(x̄) is a closed convex set.
x̄ + F D(x̄)
x̄
-3 -2 -1 0 1 2 3
Example 2.12 Assume that the feasible set F ⊂ IR2 is defined by the single constraint x21 − x2 ≤ 0.
If x̄ = (1, 1), then the set of feasible directions at x̄ is F D(x̄) = {s ∈ IR2 | s2 > 2s1 }. Observe that
now F D(x̄) is an open set.
37
x̄ + F D(x̄)
x̄
-3 -2 -1 0 1 2 3
Lemma 2.13 For any x ∈ F the set of feasible directions F D(x) is a convex cone.
Proof: Let ϑ > 0. Obviously, s ∈ F D(x) implies (ϑs) ∈ F D(x) since x + ϑλ (ϑs) =
x + λs ∈ F , hence F D(x) is a cone. To prove the convexity of F D(x) let us take
s, s ∈ F D(x). Then by definition we have x + λs ∈ F and x + λs ∈ F for some λ > 0
(observe that a common λ can be taken). Further, for 0 ≤ α ≤ 1 we write
Due to the convexity of F the right hand side of the above equation is in F , hence the
convexity of F D(x) follows. 2
In view of the above lemma we may speak about the cone of feasible directions F D(x)
for any x ∈ F . Note that the cone of feasible directions is not necessarily closed even if
the set of feasible solutions F is closed. Figure 2.1 illustrates the cone F D(x) for three
different choices of F and x.
We will now formulate an optimality condition in terms of the cone of feasible di-
rections. It states that a feasible solution is optimal if and only if the gradient of the
objective in that point has an acute angle with all feasible directions at that point (no
feasible descent direction exists).
Theorem 2.14 The feasible point x ∈ F is an optimal solution of the convex optimiza-
tion problem (CO) if and only if for all s ∈ F D(x) one has δf (x, s) ≥ 0.
Proof: Observing that s ∈ F D(x) if and only if s = λ(x − x) for some x ∈ F and
some λ > 0, the result follows in the same way as in the proof of Theorem 2.9. 2
38
6
o
k 3
F
y :
6
F
1
-
x∈F F D is the closed positive quadrant
6
F
It is easy to give a sufficient condition for (2.4) to hold, which we will now do. This
condition will depend only on the constraint functions that are zero (active) at x.
39
Now let Ix denote the index set of the active constraints at x, and assume that C = IRn .
We now give a sufficient condition for (2.4) to hold (i.e. for x to be an optimal
solution).
for some nonnegative vector y, where Ix denotes the index set of the active constraints
at x, as before.
Exercise 2.5 Let s ∈ FD(x) be a given feasible direction at x ∈ F for (CO) and let C = IRn .
One has
∇gi (x)T s ≤ 0 for all i ∈ Ix .
(Hint: Use Lemma 1.49.) /
Exercise 2.6 Let x ∈ F be a feasible solution of (CO) where C = IRn . Use the previous
exercise and Theorem 2.14 to show that, if there exists a y ≥ 0 such that
X
∇f (x) = − y i ∇gi (x),
i∈Ix
Exercise 2.7 We wish to design a cylindrical can with height h and radius r such that the
volume is at least V units and the total surface area is minimal.
We can formulate this as the following optimization problem:
subject to
πr 2 h ≥ V, r > 0, h > 0.
1. Show that we can rewrite the above problem as the following optimization problem:
p∗ = min 2π e2x1 + ex1 +x2 ,
subject to
V
ln − 2x1 − x2 ≤ 0, x1 ∈ IR, x2 ∈ IR.
π
40
2. Prove that the new problem is a convex optimization problem (CO).
1
3. Prove that the optimal design is where r = 21 h = V
2π
3
by using the result of Exercise
2.6.
The KKT condition (2.5) is sufficient for optimality, but is not a necessary condition
for optimality in general, as the next example shows.
Example 2.17 Consider the problem of the form (CO):
min x subject to x2 ≤ 0, x ∈ IR.
Obviously, the unique optimal solution is x̄ = 0, and the constraint g(x) := x2 ≤ 0 is active at x̄.
If we write out condition (2.5), we get
1 ≡ ∇f (x) = −y∇g(x) ≡ −y(2(0)) = 0,
which is obviously not satisfied for any choice of y ≥ 0. In other words, we cannot prove that x̄ = 0 is
an optimal solution by using the KKT condition. ∗
In the rest of the chapter we will show that the KKT conditions are also necessary
optimality conditions for (CO), if the feasible set F satisfies an additional assumption
called the Slater condition.
41
Example 2.19
1. Let us consider the optimization problem
min f (x)
s.t. x1 + x22 ≤ 4
2
x1 − x2 ≥ 2
x2 ≥ 0
C = IR2 .
The feasible region F contains only one point, (2, 0), for which the non-linear constraint becomes
an equality. Hence, the problem is not Slater regular.
x2
x1
Exercise 2.8 Assume that (CO) satisfies the Slater condition. Prove that any x ∈ F 0 is a
Slater point of (CO). /
Exercise 2.9 By solving a so-called first-phase problem one can check whether a given prob-
lem of the form (CO) satisfies the Slater condition. Let us assume that C = IRn and consider
the first-phase problem
min τ
s.t. gj (x) − τ ≤ 0, j = 1, · · · , m
x ∈ IRn , τ ∈ IR,
42
where τ is an auxiliary variable.
We can further refine our definition. Some constraint functions gj (x) might take the
value zero for all feasible points. Such constraints are called singular while the others
are called regular. Hence the index set of singular constraints is defined as
Js = {j ∈ J | gj (x) = 0 for all x ∈ F },
while the index set of regular (qualified) constraints is defined as the complement of
the singular set
Jr = J \ Js = {j ∈ J | gj (x) < 0 for some x ∈ F }.
Remark: Note, that if (CO) is Slater regular, then all singular functions must be
linear.
Definition 2.20 A Slater point x∗ ∈ C 0 is called an Ideal Slater point of the convex
optimization problem (CO) if
Lemma 2.21 If the convex optimization problem (CO) is Slater regular then there
exists an ideal Slater point x∗ ∈ F .
Proof: According to the assumption, there exists a Slater point x0 ∈ C 0 and there
exist points xk ∈ F for all k ∈ Jr such that gk (xk ) < 0. Let λ0 > 0, λk > 0 for all
k ∈ Jr such that λ0 + k∈Jr λk = 1, then x∗ = λ0 x0 + j∈Jr λk xk is an ideal Slater
P P
point. This last statement follows from the convexity of the functions gj . 2
Example 2.22
1. Let us consider the optimization problem
min f (x)
s.t. x21 + x22 ≤ 4
x1 − x2 ≥ 2
x2 ≥ −1
C = {x| x1 = 1}.
The feasible region contains only one point, (1, −1), but now this point does lie in the relative
interior of the convex set C. Hence, this point is an ideal Slater point.
43
x2 C
x1
min f (x)
s.t. x21 + x22 ≤ 4
x1 − x2 ≥ 2
x2 ≥ −1
C = IR2 .
Now, the point (1, −1) is again a Slater point, but not an ideal Slater point. The point ( 32 , − 34 )
is an ideal Slater point.
x2
x1
F
Exercise 2.10 Prove that any ideal Slater point of (CO) is in the relative interior of F. /
Theorem 2.23 Let U ⊆ IRn be a convex set and a point w ∈ IRn with w ∈ / U be given.
T n
Then there is a separating hyperplane {x | a x = α}, with a ∈ IR , α ∈ IR such that
1. aT w ≤ α;
44
Note that the last property says that there is a u ∈ U such that aT u > α.
Now we are ready to prove the convex Farkas Lemma. The proof here is a simplified
version of the proofs in the books [38, 42].
Lemma 2.24 (Farkas) The convex optimization problem (CO) is given and we assume
that the Slater regularity condition is satisfied. The inequality system
f (x) < 0
gj (x) ≤ 0, j = 1, · · · , m (2.6)
x ∈ C,
has no solution if and only if there exists a vector y = (y1 , · · · , ym ) ≥ 0 such that
m
X
f (x) + yj gj (x) ≥ 0 for all x ∈ C. (2.7)
j=1
Before proving this important result we make remark. The systems (2.6) and (2.7) are
called alternative systems, i.e. exactly one of them has a solution.
Proof: If the system (2.6) has a solution then clearly (2.7) cannot be true for that
solution. This is the trivial part of the lemma. Note that this part is true without any
regularity condition.
To prove the other side let us assume that (2.6) has no solution. With u = (u0 , · · · , um),
we define the set U ∈ IRm+1 as follows.
Clearly the set U is convex (note that due to the Slater condition singular functions
are linear) and due to the infeasibility of (2.6) it does not contain the origin. Hence ac-
cording to Theorem 2.23 there exists a separating hyperplane defined by an appropriate
vector (y0 , y1 , · · · , ym ) and α = 0 such that
m
X
yj uj ≥ 0 for all u ∈ U (2.8)
j=0
II. Secondly we establish that (2.8) holds for u = (f (x), g1 (x), · · · , gm (x)) if x ∈ C.
45
IV. Finally, it is shown by using induction that we can assume yj > 0 for all j ∈ Js .
I. First we show that y0 ≥ 0 and yj ≥ 0 for all j ∈ Jr . Let us assume that y0 < 0.
Let us take an arbitrary (u0 , u1 , · · · , um ) ∈ U. By definition (u0 + λ, u1, · · · , um ) ∈ U
for all λ ≥ 0. Hence by (2.8) one has
m
X
λy0 + yj uj ≥ 0 for all λ ≥ 0.
j=0
For sufficiently large λ the left hand side is negative, which is a contradiction, i.e. y0
must be nonnegative. The proof of the nonnegativity of all yj as j ∈ Jr goes analogously.
This follows from the observation that for all x ∈ C and for all λ > 0 one has u =
(f (x) + λ, g1 (x), · · · , gm (x)) ∈ U, thus
m
X
y0 (f (x) + λ) + yj gj (x) ≥ 0 for all x ∈ C.
j=1
III. Thirdly we show that y0 > 0. The proof is by contradiction. We already know
that y0 ≥ 0. Let us assume to the contrary that y0 = 0. Hence from (2.10) we have
X X m
X
yj gj (x) + yj gj (x) = yj gj (x) ≥ 0 for all x ∈ C.
j∈Jr j∈Js j=1
gj (x∗ ) = 0 if j ∈ Js ,
whence
yj gj (x∗ ) ≥ 0.
X
j∈Jr
Since yj ≥ 0 and gj (x∗ ) < 0 for all j ∈ Jr , this implies yj = 0 for all j ∈ Jr . This results
in X
yj gj (x) ≥ 0 for all x ∈ C. (2.11)
j∈Js
46
Because the ideal Slater point x∗ is in the relative interior of C there exist a vector
x̃ ∈ C and 0 < λ < 1 such that x∗ = λx + (1 − λ)x̃. Using that gj (x∗ ) = 0 for j ∈ Js
and that the singular functions are linear one gets
0= yj gj (x∗ )
P
j∈Js
= yj gj (λx + (1 − λ)x̃)
P
j∈Js
= λ yj gj (x) + (1 − λ) yj gj (x̃)
P P
j∈Js j∈Js
> (1 − λ) yj gj (x̃).
P
j∈Js
At this point we have (2.10) with y0 > 0 and yj ≥ 0 for all j ∈ Jr . Dividing by y0 > 0
y
in (2.10) and by defining yj := y0j for all j ∈ J we obtain
m
X
f (x) + yj gj (x) ≥ 0 for all x ∈ C. (2.13)
j=1
We finally show that y may be taken such that yj > 0 for all j ∈ Js .
IV. To complete the proof we show by induction on the cardinality of Js that one
can make yj positive for all j ∈ Js . Observe that if Js = ∅ then we are done. If |Js | = 1
then we apply the results proved till this point to the inequality system
gs (x) < 0,
gj (x) ≤ 0, j ∈ Jr , (2.14)
x∈C
where {s} = Js . The system (2.14) has no solution, it satisfies the Slater condition,
and therefore there exists a ŷ ∈ IRm−1 such that
X
gs (x) + ŷj gj (x) ≥ 0 for all x ∈ C, (2.15)
j∈Jr
where ŷj ≥ 0 for all j ∈ Jr . Adding a sufficiently large positive multiple of (2.15) to
(2.13) one obtains a positive coefficient ŷs > 0 for gs (x).
The general inductive step goes analogously. Assuming that the result is proved if
|Js | = k then the result is proved for the case |Js | = k +1. Let s ∈ Js then |Js \{s}| = k,
47
and hence the inductive assumption applies to the system
gs (x) < 0
gj (x) ≤ 0, j ∈ Js \ {s},
(2.16)
gj (x) ≤ 0, j ∈ Jr ,
x ∈ C.
By construction the system (2.16) has no solution, it satisfies the Slater condition, and
by the inductive assumption we have a ŷ ∈ IRm−1 such that
X
gs (x) + ŷj gj (x) ≥ 0 for all x ∈ C. (2.17)
j∈Jr ∪Js \{s}
where ŷj > 0 for all j ∈ Js \ {s} and ŷj ≥ 0 for all j ∈ Jr . Adding a sufficiently large
multiple of (2.17) to (2.13), one obtains the desired nonnegative multipliers. 2
Remark: Note, that finally we proved slightly more than was stated. We have proved
that the multipliers of all the singular constraints can be made strictly positive.
(CO) min x
s.t. x2 ≤ 0
x ∈ IR.
x < 0
x2 ≤ 0
has no solution, but for every y > 0 the quadratic function f (x) = x + yx2 has two zeroes.
− y1 1
− 2y
1
− 4y
48
2. Let us consider the convex optimization problem
(CO) min 1 + x
s.t. x2 − 1 ≤ 0
x ∈ IR.
1+x < 0
x2 − 1 ≤ 0
1
has no solution. If we let y = 2 the quadratic function
1 2 1
g(x) = x + 1 + y(x2 − 1) = x +x+
2 2
−3 −1 1
1 2 1
x + x + ≥ 0 for all x ∈ IR.
2 2
∗
Exercise 2.11 Let the matrices A : m × n and the vector b ∈ IRm be given. Apply the convex
Farkas Lemma 2.24 to prove that exactly one of the following alternative systems (I) or (II)
is solvable:
(I) Ax ≤ b, x ≥ 0,
or
(II) AT y ≥ 0, y ≥ 0, bT y < 0.
/
Exercise 2.12 Let the matrices A : m × n, B : k × n and the vectors a ∈ IRm , b ∈ IRk be
given. With a proper reformulation, apply the convex Farkas Lemma 2.24 to the inequality
system
Ax ≤ a, Bx < b, x≥0
to derive its alternative system. /
49
Exercise 2.13 Let the matrix A : m × n and the vectors c ∈ IRn and b ∈ Rm be given. Apply
the convex Farkas Lemma 2.24 to prove the so-called Goldman–Tucker theorem for the LO
problem:
min {cT x : Ax = b, x ≥ 0}
when it admits an optimal solution. In other words, prove that there exists an optimal solution
x∗ and an optimal solution (y ∗ , s∗ ) of the dual LO problem
max {bT y : AT y + s = c, s ≥ 0}
such that
x∗ + s∗ > 0.
/
where x ∈ C and y ≥ 0. Note that for fixed y the Lagrange function is convex in x.
Definition 2.26 A vector pair (x, y) ∈ IRn+m , x ∈ C and y ≥ 0 is called a saddle point
of the Lagrange function L if
We will see (in the proof of Theorem 2.30) that the x part of a saddle point is always
an optimal solution of (CO).
Example 2.27 [Saddle point] Let us consider the convex optimization problem
(CO) min −x + 2
s.t. ex − 4 ≤ 0
x ∈ IR
50
for x = − log y, thus L(− log y, y) = log y − 4y + 3 is a minimum.
On the other hand, for feasible x, i.e. if x ≤ log 4, we have
sup y(ex − 4) = 0.
y≥0
Hence, defining ψ(y) = inf x∈IR L(x, y) and φ(x) = supy≥0 L(x, y) we have
log y − 4y + 3 for y > 0,
ψ(y) =
−∞ for y = 0;
−x + 2 for x ≤ log 4,
φ(x) =
∞ for x > log 4.
Now, we have
d 1
ψ(y) = − 4 = 0
dy y
for y = 41 , i.e. this value gives the maximum of ψ(y). Hence, supy≥0 ψ(y) = − log 4 + 2. The function
φ(x) is minimal for x = log 4, thus inf x∈IR φ(x) = − log 4 + 2 and we conclude that (log 4, 41 ) is a saddle
point of the Lagrange function L(x, y). Note that x = log 4 is the optimal solution of (CO). ∗
Lemma 2.28 A saddle point (x, y) ∈ IRn+m , x ∈ C and y ≥ 0 of L(x, y) satisfies the
relation
inf sup L(x, y) = L(x, y) = sup inf L(x, y). (2.20)
x∈C y≥0 y≥0 x∈C
hence one can take the supremum of the left hand side and the infimum of the right
hand side resulting in
sup inf L(x, y) ≤ inf sup L(x, y). (2.21)
y≥0 x∈C x∈C y≥0
inf sup L(x, y) ≤ sup L(x, y) ≤ L(x, y) ≤ inf L(x, y) ≤ sup inf L(x, y). (2.22)
x∈C y≥0 y≥0 x∈C y≥0 x∈C
min ex subject to x ≤ 0.
51
Here
L(x, y) = ex + yx.
It is easy to verify that x = −1 and y = e−1 satisfy (2.20). Indeed, L(x, y) = 0 and
However, (x, y) is not a saddle point of L. (This example does not have an optimal solution, and, as
we have mentioned, the x part of a saddle point is always an optimal solution of (CO).) ∗
We still do not know if a saddle point exists or not. Assuming Slater regularity, the
next result states that L(x, y) has a saddle point if and only if (CO) has an optimal
solution.
Proof: The easy part of the theorem is to prove that if (x, y) is a saddle point of
L(x, y) then x is optimal for (CO). The proof of this part does not need any regularity
condition. From the saddle point inequality (2.19) one has
m
X m
X m
X
f (x) + yj gj (x) ≤ f (x) + y j gj (x) ≤ f (x) + y j gj (x)
j=1 j=1 j=1
for all y ≥ 0 and for all x ∈ C. From the first inequality one easily derives gj (x) ≤ 0
for all j = 1, · · · , m hence x ∈ F is a feasible solution of (CO). Taking the two extreme
sides of the above inequality and substituting y = 0 we have
m
X
f (x) ≤ f (x) + y j gj (x) ≤ f (x)
j=1
52
for all x ∈ C. Using that x is feasible one easily derive the saddle point inequality
m
X m
X
f (x) + yj gj (x) ≤ f (x) ≤ f (x) + y j gj (x)
j=1 j=1
Corollary 2.31 Under the assumptions of Theorem 2.30 the vector x ∈ C is an optimal
solution of (CO) if and only if there exists a y ≥ 0 such that
m
X
(i) f (x) = min{f (x) + y j gj (x)} and
x∈C
j=1
m
X m
X
(ii) y j gj (x) = max{ yj gj (x)}.
y≥0
j=1 j=1
Corollary 2.32 Under the assumptions of Theorem 2.30 the vector x ∈ F is an opti-
mal solution of (CO) if and only if there exists a y ≥ 0 such that
m
X
(i) f (x) = min{f (x) + y j gj (x)} and
x∈C
j=1
m
X
(ii) y j gj (x) = 0.
j=1
Corollary 2.33 Let us assume that C = IRn and the functions f, g1 , · · · , gm are con-
tinuously differentiable functions. Under the assumptions of Theorem 2.30 the vector
x ∈ F is an optimal solution of (CO) if and only if there exists a y ≥ 0 such that
m
X
(i) 0 = ∇f (x) + y j ∇gj (x) and
j=1
m
X
(ii) y j gj (x) = 0.
j=1
53
Proof: Follows directly from Corollary 2.32 and the convexity of the function f (x) +
Pm
j=1 y j gj (x), x ∈ C. 2
Note that the last corollary stays valid if C is a full dimensional open subset of IRn .
If the set C is not full dimensional, then the right hand side vector, the x−gradient of
the Lagrange function has to be orthogonal to any direction in the affine hull of C (cf.
Theorem 2.9). To check the validity of these statements is left to the reader.
Definition 2.34 (KKT point) Let us assume that C = IRn and the functions f, g1 , · · · , gm
are continuously differentiable functions. The vector (x, y) ∈ IRn+m is called a Karush–
Kuhn–Tucker (KKT) point of (CO) if
(iv) y ≥ 0.
Exercise 2.15 Let us assume that C = IRn and the functions f, g1 , · · · , gm are continuously
differentiable convex functions and the assumptions of Theorem 2.30 hold. Show that (x, y) is
a saddle point of the Lagrangian of (CO) if and only if it is a KKT point of (CO). /
Corollary 2.35 Let us assume that C = IRn and the functions f, g1 , · · · , gm are contin-
uously differentiable convex functions and the assumptions of Theorem 2.30 hold. Let
the vector (x, y) be a KKT point, then x is an optimal solution of (CO).
Thus we have derived necessary and sufficient optimality conditions for the convex
optimization problem (CO) under the Slater regularity assumption. Note that if an
optimization problem is not convex, or does not satisfy any regularity condition, then
only weaker results can be proven.
54
Chapter 3
Every optimization problem has an associated dual optimization problem. Under some
assumptions, a convex optimization problem (CO) and its dual have the same optimal
objective values. We can therefore use the dual problem to show that a certain solution
of (CO) is in fact optimal. Moreover, some optimization algorithms solve (CO) and
its dual problem at the same time, and when the objective values are the same then
optimality has been proved. One can easily derive dual problems and duality results
from the KKT theory or from the Convex Farkas Lemma. First we define the more
general Lagrange dual and then we specialize it to get the so-called Wolfe dual for
convex problems.
Lemma 3.2 The Lagrange Dual (LD) of (CO) is a convex optimization problem, even
if the functions f, g1 , · · · , gm are not convex.
55
than the sum of the two separate infimums one has:
m
X
ψ(λy + (1 − λ)ŷ) = inf f (x) + (λy j + (1 − λ)ŷj )gj (x)
x∈C
j=1
Xm Xm
= inf λ f (x) + y j gj (x) + (1 − λ) f (x) + ŷj gj (x)
x∈C
j=1 j=1
Xm Xm
≥ inf λ f (x) + y j gj (x) + inf (1 − λ) f (x) + ŷj gj (x)
x∈C x∈C
j=1 j=1
= λψ(y) + (1 − λ)ψ(ŷ).
2
Definition 3.3 If x is a feasible solution of (CO) and y ≥ 0 then we call the quantity
f (x) − ψ(y)
It is easy to prove the so-called weak duality theorem, which states that the duality
gap is always nonnegative.
ψ(y) ≤ f (x)
m
X
and equality holds if and only if inf {f (x) + y j gj (x)} = f (x).
x∈C
j=1
Corollary 3.5 If x is a feasible solution of (CO), y ≥ 0 and ψ(y) = f (x) then the
vector x is an optimal solution of (CO) and y is optimal for (LD). Further if the
functions f, g1 , · · · , gm are continuously differentiable then (x, y) is a KKT-point.
To prove the so-called strong duality theorem one needs a regularity condition.
56
Theorem 3.6 Let us assume that (CO) satisfies the Slater regularity condition. Let x
be a feasible solution of (CO). The vector x is an optimal solution of (CO) if and only
if there exists a y ≥ 0 such that y is an optimal solution of (LD) and
ψ(y) = f (x).
Remark: If the convex optimization problem does not satisfy a regularity condition,
then it is not true in general that the duality gap is zero. It is also not always true (even
not under regularity condition) that the convex optimization problem has an optimal
solution. Frequently only the supremum or the infimum of the objective function exists.
Example 3.7 [Lagrange dual] Let us consider again the problem (see Example 2.25)
(CO) min x
s.t. x2 ≤ 0
x ∈ IR.
As we have seen this (CO) problem is not Slater regular and the Convex Farkas Lemma 2.24 does not
apply to the system
x < 0
x2 ≤ 0.
The optimal value of the Lagrange dual is zero, i.e. in spite of the lack of Slater regularity there is no
duality gap. ∗
57
Definition 3.8 Assume that C = IRn and the functions f, g1 , · · · , gm are continuously
differentiable and convex. The problem
m
X
(W D) sup f (x) + yj gj (x)
x,y
j=1
m
X
∇f (x) + yj ∇gj (x) = 0,
j=1
y ≥ 0, x ∈ IRn ,
Note that the variables in (W D) are both y ≥ 0 and x ∈ IRn , and that the Lagrangian
L(x, y) is the objective function of (W D). For this reason, the Wolfe dual does not have
a concave objective function in general, but it is still very useful tool, as we will see.
In particular, if the Lagrange function has a saddle point, C = IRn and the functions
f, g1 , · · · , gm are continuously differentiable and convex, then the two dual problems are
equivalent. Using the results of the previous section one easily proves weak and strong
duality results, as we will now show. A more detailed discussion of duality theory can
be found in [2, 28].
Theorem 3.9 (Weak duality for the Wolfe dual) Assume that C = IRn and the
functions f, g1 , · · · , gm are continuously differentiable and convex. If x̂ is a feasible
solution of (CO) and (x, y) is a feasible solution for (WD) then
L(x, y) ≤ f (x̂).
Proof: Let (x, y) be a feasible solution for (WD). Since the functions f and g1 , . . . , gm
are convex and continuously differentiable, and ȳ ≥ 0, the function
m
X
h(x) := f (x) + y j gj (x)
j=1
must also be convex and continuously differentiable (see Lemma 1.40). Since (x, y) is
feasible for (W D), one has
m
X
∇h(x) = ∇f (x) + y j ∇gj (x) = 0.
j=1
This means that x is a minimizer of the function h, by Lemma 2.6. In other words
m m
y j gj (x) ∀x ∈ IRn .
X X
f (x) + y j gj (x) ≤ f (x) + (3.1)
j=1 j=1
58
Let x̂ be an arbitrary feasible solution of (CO). Setting x = x̂ in (3.1) one gets
m
X m
X
f (x) + y j gj (x) ≤ f (x̂) + y j gj (x̂) ≤ f (x̂),
j=1 j=1
where the last inequality follows from y ≥ 0 and gj (x̂) ≤ 0 (j = 1, . . . , m). This
completes the proof. 2
Theorem 3.10 (Strong duality for the Wolfe dual) Assume that C = IRn and the
functions f, g1 , · · · , gm are continuously differentiable and convex. Also assume that
(CO) satisfies the Slater regularity condition. Let x be a feasible solution of (CO).
Then x is an optimal solution of (CO) if and only if there exists a y ≥ 0 such that
(x, y) is an optimal solution of (WD).
Warning! Remember, we are only allowed to form the Wolfe dual of a nonlinear
optimization problem if it is a convex optimization problem. We may replace the
infimum in the definition of ψ(y) by the condition that the x-gradient is zero only if
all the functions f and gj , ∀j are convex and if we know that the infimum is attained.
Else, the condition
m
X
∇f (x) + yj ∇gj (x) = 0
j=1
allows solutions which are possibly maxima, saddle points or inflection points, or it may
not have any solution. In such cases no duality relation holds in general. For nonconvex
problems one has to work with the Lagrange dual.
Example 3.11 [Wolfe dual] Let us consider the convex optimization problem
Then the optimal value is 5 with x = (4, 0). Note that the Slater condition holds for this example.
59
which is a non-convex problem. The first constraint gives y1 = 13 , and thus the second constraint
becomes
5 x2
e − y2 = 0.
3
Now we can eliminate y1 and y2 from the object function. We get the function
5 x2 5 10
f (x2 ) = e − x2 ex2 + .
3 3 3
This function has a maximum when
5
f 0 (x2 ) = − x2 ex2 = 0,
3
which is only true when x2 = 0 and f (0) = 5. Hence the optimal value of (WD) is 5 and then
(x, y) = (4, 0, 13 , 35 ).
Lagrange dual We can double check this answer by using the Lagrange dual. Let
We have
0 1
for y1 = 3
inf {x1 − 3y1 x1 } =
x1 ∈IR −∞ otherwise.
Now, for fixed y1 , y2 , with y2 > 0 let
Now we have
d 1 3y2
ψ( , y2 ) = log( )=0
dy2 3 5
when y2 = 35 , and ψ( 31 , 53 ) = 5.
Exercise 3.2 Prove that — under the assumptions of Theorem 3.10 — the Lagrange and
Wolfe duals of the optimization problem (CO) are equivalent. /
60
Exercise 3.3 We wish to design a rectangular prism (box) with length l, width b, and height
h such that the volume of the box is at least V units, and the total surface area is minimal.
This problem has the following (nonconvex) formulation:
min 2(lb + bh + lh), lbh ≥ V, l, b, h > 0. (3.2)
l,b,h
ii) Show that the transformed problem is convex and satisfies Slater’s regularity condition.
iii) Show that the Lagrange dual of problem (3.3) is:
3 3 λ
max + ln(V ) λ − λ ln . (3.4)
λ≥0 2 2 4
iv) Show that the Wolfe dual of problem (3.3) is the same as the Lagrange dual.
v) Use the KKT conditions of problem (3.3) to show that the cube (l = b = h = V 1/3 ) is
the optimal solution of problem (3.2).
vi) Use the dual problem (3.4) to derive the same result as in part v).
Linear optimization
61
As we substitute c = −AT y − + AT y + + s in the objective and introduce the notation
y = y + − y − the standard dual linear optimization problem follows.
max bT y
AT y + s = c,
s ≥ 0.
Quadratic optimization
The quadratic optimization problem is considered in the symmetric form. Let A : m×n
be a matrix, Q : n × n be a positive semi-definite symmetric matrix, b ∈ IRm and
c, x ∈ IRn . The primal Quadratic Optimization (QO) problem is given as
1
(QO) min{cT x + xT Qx | Ax ≥ b, x ≥ 0}.
2
Here we can say that C = IRn . Obviously all the constraints are continuously differen-
tiable. The inequality constraints can be given as gj (x) = (−aj )T x + bj if j = 1, · · · , m
and gj (x) = −xj−m if j = m + 1, · · · , m + n. Denoting the Lagrange multipliers by y
and s respectively the Wolfe dual (WD) of (QO) has the following form:
max bT y − 12 z T z
−D T z + AT y + s = c,
y ≥ 0, s ≥ 0.
62
Constrained maximum likelihood estimation
Maximum Likelihood Estimation frequently occurs in statistics. This problem can also
be used to illustrate duality in convex optimization. In this problem we are given a
finite set of sample points xi , (1 ≤ i ≤ n). The most probable density values at the
sample points are to be determined that satisfy some linear (e.g. convexity) constraints.
Formally, the problem is defined as one has to determine the maximum of the Likelihood
function Πni=1 xi under the conditions
Ax ≥ 0, dT x = 1, x ≥ 0.
Here Ax ≥ 0 represents the linear constraints, the density values xi are nonnegative and
the condition dT x = 1 ensures that the (approximate) integral of the density function is
one. Since the logarithm function is monotone the objective can equivalently replaced
by
n
X
min − ln xi .
i=1
It is easy to check that the so defined problem is a convex optimization problem. Again
we can take C = IRn and all the constraints are linear, hence continuously differentiable.
Denoting the Lagrange multipliers by y ∈ IRm , t ∈ IR and s ∈ IRn respectively the Wolfe
dual (WD) of this problem has the following form:
n
ln xi + y T (−Ax) + t(dT x − 1) + sT (−x)
X
max −
i=1
−X −1 e − AT y + td − s = 0,
y ≥ 0, s ≥ 0.
Here the notation e = (1, · · · , 1) ∈ IRn and X =diag(x) is used. Also note that for
simplicity we did not split the equality constraint into two inequalities but we used
immediately that its multiplier is a free variable. Multiplying the first constraint by xT
one has
−xT X −1 e − xT AT y + txT d − xT s = 0.
Using dT x = 1, xT X −1 e = n and the optimality conditions y T Ax = 0, xT s = 0 we have
t = n.
Observe further that due to the logarithm in the primal objective, the primal optimal
solution is necessarily strictly positive, hence the dual variable s must be zero at the
optimum. Combining these results the dual problem is
n
X
max − ln xi
i=1
X −1 e + AT y = nd,
y ≥ 0.
63
1
Eliminating the variables xi > 0 from the constraints one has xi = ndi −aT
and − ln xi =
i y
ln(ndi − aTi y) for all i = 1, · · · , n. Now we have the final form of our dual problem:
n
ln(ndi − aTi y)
X
max
i=1
AT y ≤ nd,
y ≥ 0.
The feasible region is F = {x ∈ IR2 | x1 ≥ 0, x2 = 0}. The only constraint is non-linear and singular,
thus (CO) is not Slater regular. The optimal value of the object function is 1.
The Lagrange function is given by
q
L(x, y) = e−x2 + y( x21 + x22 − x1 ).
y ≥ 0.
The first constraint imply that x2 = 0 and x1 ≥ 0, but these values do not satisfy the second constraint.
Thus the Wolfe dual is infeasible, yielding an infinitely large duality gap.
p
Let us see if we can do better by using the Lagrange dual. Now, let = x21 + x22 − x1 , then
x22 − 2x1 − 2 = 0.
p
Hence, for any > 0 we can find x1 > 0 such that = x21 + x22 − x1 even if x2 goes to infinity.
However, when x2 goes to infinity e−x2 goes to 0. So,
q
−x2 2 2
ψ(y) = inf 2 e +y x1 + x2 − x1 = 0,
x∈IR
64
is 0. This gives a nonzero duality gap that equals to 1.
Observe that the Wolfe dual becomes infeasible because the infimum in the definition of ψ(y) exists,
but it is not attained. ∗
Example 3.13 [Basic model with zero duality gap] Let us first consider the following simple
convex optimization problem.
min x1
s.t. x21 ≤ 0 (3.5)
−x2 ≤ 0
−1 − x1 ≤ 0.
Here the convex set C where the above functions are defined is IR2 . It is clear that the set of feasible
solutions is given by
F = {(x1 , x2 ) | x1 = 0, x2 ≥ 0},
thus any feasible vector (x1 , x2 ) ∈ F is optimal and the optimal value of this problem is 0. Because
x1 = 0 for all feasible solutions the Slater regularity condition does not hold for (3.5).
Let us make the Lagrange dual of (3.5). The Lagrange multipliers (y1 , y2 , y3 ) are nonnegative and
the Lagrange function
L(x, y) = x1 + y1 x21 − y2 x2 − y3 (1 + x1 )
is defined on x ∈ IR2 and y ∈ IR3 , y ≥ 0.
The Lagrange dual is defined as
max ψ(y) (3.6)
s.t. y ≥ 0.
where
ψ(y) = inf 2 {x1 + y1 x21 − y2 x2 − y3 (1 + x1 )}
x∈IR
Example 3.14 [A variant with positive duality gap] Let us consider the same problem as in the
previous example (see problem (3.5)) with a different representation of the feasible set. As we will see
the new formulation results in a quite different dual. The new dual has also an optimal solution but
now the duality gap is positive.
min x1
s.t. x0 − s0 = 0
x1 − s1 = 0
x2 − s2 = 0 (3.7)
1 + x1 − s3 = 0
x0 = 0
x ∈ IR3 , s ∈ C.
65
Note that (3.7) has the correct form: the constraints are linear, hence convex, and the vector (x, s) of
the variables belong to the convex set IR3 × C. Here the convex set C is defined as follows:
C = {s = (s0 , s1 , s2 , s3 ) | s0 ≥ 0, s2 ≥ 0, s3 ≥ 0, s0 s2 ≥ s21 }.
F = {(x, s) | x0 = 0, x1 = 0, x2 ≥ 0, s0 = 0, s1 = 0, s2 ≥ 0, s3 = 1},
thus any feasible vector (x, s) ∈ F is optimal and the optimal value of this problem is 0.
3. Prove that problem (3.7) does not satisfy the Slater regularity condition.
/
Due to the equality constraints the Lagrange multipliers (y0 , y1 , y2 , y3 , y4 ) are free and the Lagrange
function
where
ψ(y) = inf
3
L(x, s, y)
x∈IR , s∈C
= inf
3
{x1 (1 + y1 + y3 ) + x0 (y4 + y0 ) + x2 y2 − s0 y0 − s1 y1 − s2 y2 − s3 y3 + y3 }
x∈IR , s∈C
y
3 if 1 + y1 + y3 = 0, y4 + y0 = 0, y2 = 0, y3 ≤ 0, y0 ≤ 0, y1 = 0;
=
−∞ otherwise.
66
Summarizing the above results we conclude that the Lagrange dual reduces to
max y3
y0 ≤ 0, y1 = 0, y2 = 0, y3 = −1, y4 = −y0 .
Here for any feasible solution y3 = −1, thus the optimal value of the Lagrange dual is −1, i.e. both
the primal problem (3.7) and its dual (3.8) have optimal solutions, but their optimal values are not
equal. ∗
Exercise 3.5 Modify the above problem so that for a given γ > 0 the nonzero duality gap at
optimum will be equal to γ. /
Example 3.15 [Duality for non convex problems 1] Let us consider the non-convex optimization
problem
and then
We have
0 for y ≥ −1
inf {(1 + y)x21 } =
x1 −∞ for y < −1
−1 for y > 0
inf {yx22 − 2x2 } = y
x2 −∞ for y ≤ 0.
1
(LD) sup − − 4y
y
y > 0,
which is a convex problem, and the optimal value is −4, with y = 12 . Note that although the
problem is not convex, and does not satisfy the Slater regularity condition, the duality gap is
zero.
67
Example 3.16 [Duality for non convex problems 2] Let us consider the non-convex optimization
problem
Then we have the optimal value −12 with x = (−2, 4). The Lagrange function of (CLO) is given by
Now, x21 + yx1 is a parabola which has its minimum at x1 = − y2 . So, this minimum lies within C when
y ≤ 4. When y ≥ 4 the minimum is reached at the boundary of C. The minimum of the parabola
−x22 + yx2 is always reached at the boundaries of C, at x2 = −2 when y ≥ 2, and at x2 = 4 when
y ≤ 2. Hence, we have
2
− y4 + 2y − 16 for y ≤ 2
2
ψ(y) = − y4 − 4y − 4 for 2 ≤ y ≤ 4
−6y
for y ≥ 4.
Hence, the optimal value of the Lagrange dual is −13, and we have a nonzero duality gap that equals
to 1. ∗
where 0 indicates that the left hand side matrix has to be positive semidefinite. It
is clear that the primal problem (P SO) is a convex optimization problem since the
68
convex combination of positive semidefinite matrices is also positive semidefinite. For
convenience the notation n X
F (x) = −A0 + Ak xk
k=1
will be used.
The dual problem of the semidefinite optimization problem, as given e.g. in [44], is as
follows:
(DSP ) max Tr(A0 Z) (3.10)
s.t. Tr(Ak Z) = ck , for all k = 1, · · · , n,
Z 0,
where Z ∈ IRm×m is the matrix of variables. Again, the dual of the semidefinite opti-
mization problem is convex. The trace of a matrix is a linear function of the matrix and
the convex combination of positive semidefinite matrices is also positive semidefinite.
Theorem 3.17 (Weak duality) If x ∈ IRn is primal feasible and Z ∈ IRm×m is dual
feasible, then
cT x ≥ Tr(A0 Z)
with equality if and only if
F (x)Z = 0.
Proof: Using the dual constraints and some elementary properties of the trace of
matrices one may write
n n
T
X X
c x − Tr(A0 Z) = Tr(Ak Z)xk − Tr(A0 Z) = Tr(( Ak xk − A0 )Z) = Tr(F (x)Z) ≥ 0.
k=1 k=1
Here the last inequality holds because both matrices F (x) and Z are positive semidefi-
nite. Equality holds if and only if F (x)Z = 0, which completes the proof. 2
69
where eT = (1, · · · , 1) ∈ IRn and X ◦ Z denotes the Minkowski (coordinatewise) product
of matrices. Before going on we observe that eT (S ◦ Z)e = Tr(SZ), hence the Lagrange
function can be reformulated as
n
L(x, S, Z) = cT x −
X
xk Tr(Ak Z) + Tr(A0 Z) + Tr(SZ). (3.12)
k=1
Before formulating the Lagrange dual of (P SO 0) note that we can assume that the
matrix Z is symmetric, since F (x) is symmetric. The Lagrange dual of problem (P SO 0)
is
where
ψ(Z) = min{ L(x, S, Z) | x ∈ IRn , S ∈ IRm×m , S 0}. (3.14)
As we did in deriving the Wolfe dual, one easily derives optimality conditions to calcu-
late ψ(Z). Since the minimization in (3.14) is done in the free variable x, the positive
semidefinite matrix of variables S and, further the function L(x, S, Z) is separable w.r.t.
x and S we can take these minimums separately.
If we minimize in S all the terms in (3.12) but Tr(SZ) are constant. The matrix S is
positive semidefinite, hence
0 if Z 0,
min Tr(SZ) = (3.15)
S
−∞ otherwise.
By combining the last formula and the results presented in (3.15) and in (3.16) the
simplified form of the Lagrange dual (3.13), the Lagrange–Wolfe dual
70
Exercise 3.6 Consider the problem given in Example 3.13.
and IRn+ , respectively. One observes, that the positive orthants IRm n
+ and IR+ are convex
cones, i.e. the linear optimization problem can be restated as the following cone-linear
optimization problem
min cT x
Ax − b ∈ IRm +
x ∈ IRn+ .
The dual problem
max { bT y | AT y ≤ c, y ≥ 0 },
can similarly be reformulated in the conic form:
max bT y
c − AT y ∈ IRn+
y ∈ IRm+.
The natural question arises: how one can derive dual problems for general cone-linear
optimization problems where, in the above given formulation the simple polyhedral
convex cones IRm n m
+ and IR+ are replaced by arbitrary convex cones C1 ⊆ IR and C2 ⊆ IR .
n
71
The Dual of a Cone-linear Problem
First, by introducing slack variables s, we give another equivalent form of the cone-
linear problem (3.17)
min cT x
s − Ax + b = 0
s ∈ C1
x ∈ C2 .
C1 × C2 := { (s, x) | s ∈ C1 , x ∈ C2 }.
The Lagrange function L(s, x, y) of the above problem is defined on the set
{ (s, x, y) | s ∈ C1 , x ∈ C2 , y ∈ IRm }
and is given by
maxm ψ(y)
y∈IR
where
ψ(y) = min{ L(s, x, y) | s ∈ C1 , x ∈ C2 }. (3.19)
As we did in deriving the Wolfe dual, one easily derives optimality conditions to
calculate ψ(y). Since the minimization in (3.19) is done in the variables s ∈ C1 and
x ∈ C2 , and the function L(s, x, y) is separable w.r.t. x and s, we can take these
minimums separately.
If we minimize in s all the terms in (3.18) but sT y are constant. The vector s is in
the cone C1 , hence
T
0
if y ∈ C1∗ ,
min s y = (3.20)
s∈C1 −∞ otherwise.
If we minimize (3.19) in x then all the terms in (3.18) but xT (c − AT y) are constant.
The vector x is in the cone C2 , hence
0 if c − AT y ∈ C2∗ ,
min xT (c − AT y) = (3.21)
x∈C2 −∞ otherwise.
72
By combining (3.20) and (3.21) we have
bT y if y ∈ C1∗ and c − AT y ∈ C2∗ ,
ψ(y) = (3.22)
−∞ otherwise.
Thus the dual of the cone-linear optimization problem (3.17) is the following cone-linear
problem:
max bT y
c − AT y ∈ C2∗ (3.23)
y ∈ C1∗ .
Exercise 3.7 Derive the dual semidefinite optimization problem (DSO) by using the general
cone-dual problem (3.23). /
To illustrate the duality relation between (3.17) and (3.23) we prove the following
weak duality theorem.
Theorem 3.18 (Weak duality) If x ∈ IRn is a feasible solution of the primal problem
(3.17) and y ∈ IRm is a feasible solution of the dual problem (3.23) then
cT x ≥ bT y
xT (c − AT y) = 0 and y T (Ax − b) = 0.
Proof: Using the definition of the dual cone one may write
cT x − bT y = xT (c − AT y) + y T (Ax − b) ≥ 0.
73
74
Chapter 4
min f (x)
(4.1)
s.t. x ∈ C,
where C is a relatively open convex set. For typical unconstrained optimization problems
one has C = IRn , the trivial full dimensional open set, but for other applications (like
in interior point methods) one frequently has lower dimensional relatively open convex
sets. A generic algorithm for minimizing the function f (x) can be presented as follows.
Generic Algorithm
Input:
x0 is a given (relative interior) feasible point;
For k = 0, 1, . . . do
Step 1: Find a search direction sk with δf (xk , sk ) < 0;
(This should be a descending feasible direction in the constrained case.)
Step 1a: If no such direction exists STOP, optimum found.
Step 2: Line search : find λk = arg min f (xk + λsk );
λ
k+1 k k
Step 3: x = x + λk s , k = k + 1;
Step 4: If stopping criteria are satisfied STOP.
The crucial elements of all algorithms, besides the selection of a starting point are
printed boldface in the scheme, given above.
75
To generate a search direction is the crucial element of all minimization algorithms.
Once a search direction is obtained, then one performs the line search procedure. Before
we discuss these aspects in detail we turn to the question of the convergence rate of an
algorithm.
|αk+1 − α|
The larger p∗ is, the faster the convergence. Let β = lim sup p∗
. If p∗ = 1 and
k→∞ |αk − α|
0 < β < 1 we are speaking about linear (or geometric rate of ) convergence. If p∗ = 1
and β = 0 the convergence rate is super-linear, while if β = 1 the convergence rate is
sub-linear. If p∗ = 2 then the convergence is quadratic.
Exercise 4.1 Show that the sequence αk = ak , where 0 < a < 1 converges linearly to zero
while β = a. /
k
Exercise 4.2 Show that the sequence αk = a(2 ) , where 0 < a < 1, converges quadratically
to zero. /
1
Exercise 4.3 Show that the sequence αk = k converges sub-linearly to zero. /
Exercise 4.5 Construct a sequence that converges to zero with the order of four. /
76
Exercise 4.6 Assume that f is continuously differentiable, xk and sk are given, and λk is
obtained via exact line search:
λk = arg min f (xk + λsk ).
λ
Below we present four line search methods, that require different levels of information
about φ(λ):
• The Dichotomous search and Golden section methods, that use only function
evaluations of φ;
• bisection, that evaluates φ0 (λ) (φ has to be continuously differentiable);
• Newton’s method, that evaluates both φ0 (λ) and φ00 (λ).
Lemma 4.2 If φ(ā) < φ(b̄) then the minimum of φ is contained in the interval [a, b̄].
If φ(ā) ≥ φ(b̄) then the minimum of φ is contained in the interval [ā, b].
77
Exercise 4.8 Prove that — when using Dichotomous Search — the interval of uncertainty
is reduced by a factor ( 21 + δ)t/2 after t function evaluations. /
There is a more clever way to choose āk and b̄k , which reduces the number of function
evaluations per iteration from two to one, while still shrinking the interval of uncertainty
by a constant factor. It is based on a geometric concept called the Golden section.
The golden section of a line segment is its division into two unequal segments, such
that the ratio of the longer of the two segments to the whole segment is equal to the
ratio of the shorter segment to the longer segment.
α - 1−α -
•
1 -
With reference to Figure 4.1, we require that the value α is chosen such that the
following ratios are equal:
1−α α
= .
α 1
This is the same as α2 + α − 1 = 0 which has only one root in the interval [0, 1], namely
α ≈ 0.618.
Returning to the line search procedure, we simply choose āk and b̄k as the points that
correspond to the golden section (see Figure 4.2).
- -
(1 − α)(bk − ak ) (1 − α)(bk − ak )
āk b̄
• • •k •
ak bk
bk − ak -
Figure 4.2: Choosing āk and b̄k via the Golden section rule.
The reasoning behind this is as follows. Assume that we know the values φ(āk ) and
φ(b̄k ) during iteration k. Assume that φ(āk ) < φ(b̄k ), so that we set bk+1 = b̄k and
ak+1 = ak . Now, by the definition of the golden section, b̄k+1 is equal to āk (see Figure
4.3).
In other words, we do not have to evaluate φ at āk+1 , because we already know this
value. In iteration k + 1 we therefore only have to evaluate b̄k+1 in this case. The
analysis for the case where φ(āk ) ≥ φ(b̄k ) is perfectly analogous.
78
ak āk+1 b̄k+1 bk+1
Iteration k + 1 • • • •
Iteration k āk b̄
• • •k •
ak bk
Figure 4.3: Illustration of consecutive iterations of the Golden section rule when φ(āk ) <
φ(b̄k ).
Exercise 4.9 Prove that — when using Golden section search — the interval of uncertainty
is reduced by a factor 0.618t−1 after t function evaluations. /
The Golden section search requires fewer function evaluations than the Dichotomous
search method to reduce the length interval of uncertainty to a given > 0; see Exercise
4.10. If one assumes that the time it takes to evaluate φ dominates the work per
iteration, then it is more important to count the total number of function evaluations
than the number of iterations.
Exercise 4.10 Show that the Dichotomous search algorithm terminates after at most
b0 −a0
log
2
2
log 1+2δ
function evaluations, and that the Golden section search terminates after at most
b0 −a0
log
1+
1
log 0.618
4.3.2 Bisection
The Bisection method (also called Bolzano’s method) is used to find a root of φ0 (λ) (here
we assume φ to be continuously differentiable). Recall that such a root corresponds to
a minimum of φ if φ is convex.
The algorithm is similar to the Dichotomous and Golden section search ones, in the
sense that it too uses an interval of uncertainty that is reduced at each iteration. In
the case of the bisection method the interval of uncertainty contains a root of φ0 (λ).
The algorithm proceeds as follows.
Input:
> 0 is the accuracy parameter;
a0 , b0 are given such that φ0 (a0 ) < 0 and φ0 (b0 ) > 0;
79
For k = 0, 1, . . ., do:
Step 1: If |bk − ak | < STOP.
Step 2: Let λ = 21 (ak + bk );
Step 3: If φ0 (λ) < 0 then ak+1 := λ and bk+1 = bk ;
Step 4: If φ0 (λ) > 0 then bk+1 := λ and ak+1 = ak .
|b0 −a0 |
Exercise 4.11 Prove that that the bisection algorithm uses at most log2 function eval-
uations before terminating. /
Nota that the function φ0 (λ) does not have to be differentiable in order to perform
the bisection procedure.
Next we find the root of l(λ) and set λk+1 to be equal to this root. This means that
λk+1 is given by
φ0 (λk )
λk+1 = λk − 00 .
φ (λk )
Now we repeat the process with λk+1 as the current iterate.
There is an equivalent interpretation of this procedure: take the quadratic Taylor
approximation of φ at the current iterate λk , namely
1
q(λ) = φ(λk ) + φ0 (λk )(λ − λk ) + φ00 (λk )(λ − λk )2 ,
2
and set λk+1 to be the minimum of q. The minimum of q is attained at
φ0 (λk )
λk+1 = λk − ,
φ00 (λk )
and λk+1 becomes the new iterate (new approximation to the minimum). Note that the
two interpretations are indeed equivalent.
Newton’s algorithm can be summarized as follows.
Input:
> 0 is the accuracy parameter;
λ0 is the given initial point; k = 0;
80
For k = 0, 1, . . ., do:
φ0 (λk )
Step 1: Let λk+1 = λk − φ00 (λk )
;
Step 2: If |λk+1 − λk | < STOP.
Newton’s method as presented above may not converge to the global minimum of φ.
On the other hand, Newton’s method has some spectacular properties. It converges
quadratically if the following conditions are met:
Example 4.3 Let us apply Newton’s method to φ(λ) = λ − log(1 + λ). Note that the domain of φ is
(−1, ∞). The first and second derivatives of φ are given by
λ 1
φ0 (λ) = , φ00 (λ) = 2,
1+λ (1 + λ)
and it is therefore clear that φ is strictly convex on its domain, and that λ = 0 is the minimizer of φ.
The iterates from Newton’s method satisfy the recursive relation
Exercise 4.12 This exercise refers to Example 4.3. Prove that, if the sequence {λk } satisfies
λk+1 = − (λk )2 ,
In the following example, Newton’s method converges to the minimum, but the rate
of convergence is only linear.
φ(λ) = λm .
Clearly, φ has a unique minimizer, namely λ = 0. Suppose we start Newton’s method at some
nonzero λ0 ∈ IR.
The derivatives of φ are
φ0 (λ) = mλm−1
φ00 (λ) = m(m − 1)λm−2 .
81
Hence, the iterates from Newton’s method satisfy the recursive relation
−1 −1 m−2
λk+1 = λk − (φ00 (λk )) φ0 (λk ) = λk + λk = λk .
m−1 m−1
This shows that Newton’s method is exact if φ is quadratic (if m = 2), whereas for m > 2 the Newton
process converges to 0 with a linear convergence rate (see Exercise 4.13). ∗
Exercise 4.13 This exercise refers to Example 4.4. Prove that, if the sequence {λk } satisfies
m−2
λk+1 = λk ,
m−1
where m > 2 is even, then λk → 0 with a linear rate of convergence, if λ0 6= 0. /
δf (x, −∇f (x)) = −∇f (x)T ∇f (x) = min {∇f (x)T s}.
||s||=||∇f (x)||
Exercise 4.14 Let f : IRn 7→ IR be continuously differentiable and let x̄ ∈ IRn be given.
Assume that the level set {x ∈ IRn | f (x) = f (x̄)} , is in fact a curve (contour). Show that
∇f (x̄) is orthogonal to the tangent line to the curve at x̄. /
To calculate the gradient is relatively cheap which indicates that the gradient method
can be quite efficient. Although it works fine in many applications, several theoretical
and practical disadvantages can be mentioned. First, the minimization of a convex
quadratic function by the gradient method is not a finite process in general. Slow
convergence, due to a sort of “zigg–zagging” sometimes takes place. Secondly, the
order of convergence is no better than linear in general.
Figure 4.4 illustrates the zig-zag behavior that may occur when using the gradient
method.
Exercise 4.15 Calculate the steepest descent direction for the quadratic function
1 T
f (x) = x Qx + q T x − β,
2
where the matrix Q is positive definite. Calculate the exact step length in the line search as
well. /
Here, for the sake of simplicity, it is assumed that C = IRn . In other cases the negative gradient
1
82
2
1.5
0.5
0
x2
x4 3
2 x
−0.5 x
1
x
−1
x0
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
1
Figure 4.4: Iterates of the gradient method for the function f (x) = 9x21 + 2x1 x2 + x22 .
Exercise 4.16 Prove that subsequent search directions of the gradient method are always
orthogonal (i.e. sk ⊥ sk+1 ; k = 0, 1, 2, . . .) if exact line search is used. /
The following theorem gives a convergence result for the gradient method.
Theorem 4.5 Let f be continuously differentiable. Starting from the initial point x0
using exact line search the gradient method produces a decreasing sequence x0 , x1 , x2 , · · ·
such that f (xk ) > f (xk+1 ) for k = 0, 1, 2, · · ·. Assume that the level set D = {x :
f (x) ≤ f (x0 )} is compact, then any accumulation point x of the generated sequence
x0 , x1 , x2 , · · · , xk , · · · is a stationary point (i.e. ∇f (x) = 0) of f . Further if the function
f is a convex function, then x is a global minimizer of f .
83
On the other hand using the construction of the iteration sequence and the convergent
subsequence we write
f (x) ≤ f (x + λs)
which leads to δf (x, s) = sT ∇f (x) ≥ 0. Combining this result with (4.2) we have
∇f (x) = 0, and the theorem is proved. 2
The classical Newton method does not apply line search, one takes the full Newton
step. If line search is applied then typically we are far from the solution, the step length
is usually less than one. We refer to this as the damped Newton method.
In addition we have to mention that to compute and invert the Hesse matrix is more
expensive than to compute only the gradient. Several methods are developed to reduce
this cost while preserving the advantages of Newton’s method. These are the so-called
quasi-Newton methods of which the most popular are the methods which use conjugate
directions, to be discussed later.
Anyway, the compensation for the extra cost in Newton’s method is a better search
direction. Just note that the minimization of a convex quadratic function happens in
one step.
84
Exercise 4.17 Let f (x) = 12 xT Ax − bT x where A is positive definite and b ∈ IRn . Assume
that we apply Newton’s method to minimize f . Show that x1 = A−1 b, i.e. x1 is the minimum
of f , regardless of the starting point x0 . /
If the Hessian ∇2 f (x) is not positive definite, or is ill-conditioned (the ratio of the
largest and smallest eigenvalue is large) then it is not (or hardly) invertible. In this
case additional techniques are needed to circumvent these difficulties. In the trust region
method, ∇2 f (x) is replaced by (∇2 f (x) + αI) where I is the identity matrix and α is
changed dynamically. Observe that if α = 0 then we have the Hessian, hence we have
the Newton step, while as α → ∞ this matrix approaches a multiple of the identity
matrix and so the search direction is asymptotically getting parallel to the negative
gradient.
The interested reader can consult the following books for more details on trust region
methods [2, 3, 16, 9].
Exercise 4.18 Let x ∈ IRn and f be twice continuously differentiable. Show that s =
−H∇f (x) is a descent direction of f at x for any positive definite matrix H, if ∇f (x) 6= 0.
Which choice of H gives:
• the steepest descent direction?
• Newton’s direction (for convex f )?
/
2.5
x
2
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3
x
1
Figure 4.5: Contours of the function. Note that the minimum is at [2, 1]T .
1. Perform two iterations of the gradient method, starting from x0 = [0, 3]T .
2. Perform four iterations of Newton’s method (without line search), with the same starting
point x0 .
85
Relation with Newton’s method for solving nonlinear equations
The reader may be familiar with Newton’s method to solve nonlinear systems of equa-
tions. Here we show that Newton’s optimization method is obtained by setting the
gradient of f to zero and using Newton’s method for nonlinear equations to solve the
resulting equations.
Assume we have a nonlinear system of equations
F (x) = 0
to solve, where F (x) is a differentiable mapping from IRn → IRm . Given any point
xk ∈ IRn , Newton’s method proceeds as follows. Let us first linearize the nonlinear
equation at xk by approximating F (x) by F (xk ) + JF (xk )(x − xk ) where JF (x) denotes
the Jacobian of F , an m × n matrix defined as
∂Fi (x)
JF (x)ij = where i = 1, · · · , m; j = 1, · · · , n.
∂xj
Now we take a step so that the iterate after the step satisfies the linearized equation
This is a linear system of equations, hence a solution (if it exists) can be found by
standard linear algebra.
Observe, that if we want to minimize a strictly convex function f (x) one can interpret
this problem as solving the nonlinear system of equations ∇f (x) = 0. The solution of
this system by Newton’s method, as we have a point xk , leads to (apply (4.3))
The Jacobian of the gradient is exactly the Hessian of the function f (x) hence it is
positive definite and we have
86
Definition 4.6 The directions (vectors) s1 , · · · , sk ∈ IRn are called conjugate (or A−conjugate)
directions if (si )T Asj = 0 for all 1 ≤ i 6= j ≤ k.
If one uses A-conjugate directions in the generic algorithm to minimize q, then the
minimum is found in at most n iterations. The next theorem establishes this important
fact.
Proof: One has to show (see Theorem 2.9) that ∇q(xk+1 ) ⊥ s1 , · · · , sk . Recall that
xi+1 := xi + λi si i = 0, · · · , k
xk+1 := x1 + λ0 s0 + · · · + λk sk = xi + λi si + · · · + λk sk ,
for any fixed i ≤ k. Due to exact line-search we have ∇q(xi+1 )T si = 0 (see Exercise
4.6). Using ∇q(x) = Ax − b, we get
k
k+1 i i i
λj Asj .
X
∇q(x ) := ∇q(x + λ s ) +
j=i+1
j=i+1
Corollary 4.8 Let xk be defined as in Theorem 4.7. Then xn = A−1 b, i.e. xn is the
minimizer of q(x) = 21 xT Ax − bT x.
Exercise 4.21 Show that the result in Corollary 4.8 follows from Theorem 4.7. /
87
4.6.1 The method of Powell
To formulate algorithms that use conjugate directions, we need tools to construct con-
jugate directions. The next theorem may seem a bit technical, but it gives us such a
tool.
• In the first cycle, the method performs successive exact line searches using the
directions s1 , ..., sn (in that order). In the second cycle the directions s2 , ..., sn , t1
are used (in that order). In the third cycle the directions s3 , ..., sn , t1 , t2 are used,
etc.
• The method terminates after n cycles due to the result in Theorem 4.7.
We will now state the algorithm, but first a word about notation. As mentioned
before, the second cycle uses the directions s2 , ..., sn , t1 . In order to state the algorithm
in a compact way, the search directions used during cycle k are called s(k,1) , ..., s( k, n).
The iterates generated during cycle k via successive line searches will be called
z (k,1) , . . . , z (k,n) , and xk will denote the iterate at the end of cycle k.
88
Powell’s algorithm
For k = 1, 2, . . . , n do:
(Cycle k:)
Let z (k,1) = xk−1 and z (k,i+1) := arg min q z (k,i) + λs(k,i) i = 1, · · · , n.
k (k,n+1) k k (k,n+1) k−1
Let x := argmin q(z + λt ) where t := z −x .
(k+1,i) (k,i+1) (k+1,n) k
Let s =s , i = 1, · · · , n − 1 and s := t .
It may not be clear to the reader why the directions t1 , t2 , . . . are indeed conjugate
directions. As mentioned before, we will invoke Theorem 4.9 to prove this.
Proof: The proof is by induction. Assume that t1 , . . . , tk are conjugate at the end of
cycle k of the algorithm. By the definition of xk in the statement of the algorithm, and
by Theorem 4.7, xk minimizes q over the affine space xk + span{t1 , . . . , tk }.
In cycle k + 1, z (k+1,n+1) is obtained after successive line searches along the directions
{s1 , . . . , sn−k , t1 , . . . , tk }. By Theorem 4.7, z (k+1,n+1) minimizes q over the affine space
z (k+1,n) + span{t1 , . . . , tk }.
Now define tk+1 = z (k,n+1) − xk−1 . By Theorem 4.9, tk+1 is A-conjugate to every
vector in span{t1 , . . . , tk }, and in particular to {t1 , . . . , tk }. 2
89
x0
2
1.5
t1
1
x
0.5
x2
0 2
x optimal
−0.5
−1
t2
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x1
Figure 4.6: Iterates generated by Powell’s algorithm for the function f (x) = 5x21 +
2x1 x2 + x22 + 7, starting from x0 = [1, 2]T .
• Powell’s algorithm uses only line searches, and finds the exact minimum of a
strictly convex quadratic function after at most n(n + 1) line-searches. For a
general (convex) function f , Powell’s method can be combined with the Golden
section line search procedure to obtain an algorithm for minimizing f that does
not require gradient information.
• Storage requirements: The algorithm stores n n-vectors (the current set of search
directions) at any given time.
90
Let us compare Powell’s method to Newton’s method and the gradient method. New-
ton’s method requires only one step to minimize a strictly convex quadratic function,
but requires both gradient and Hessian information for general functions. The gradient
method requires only gradient information, but does not always converge in a finite
number of steps (not even for strictly convex quadratic functions).
In conclusion, Powell method is an attractive algorithm for minimizing ‘black box’
functions where gradient and Hessian information is not available (or too expensive to
compute).
• Note that, unlike Powell’s method, this method requires gradient information.
The advantage over Powell’s method is that we only have to store two n-vectors
and do n + 1 line searches.
• We may again use the method to minimize non-quadratic functions, but then
convergence is not assured.
Let us consider the situation during iteration k, i.e. assume that xk , ∇q(xk ) and
s1 , · · · , sk−1 conjugate directions be given.
We want to find values βk,0 .... βk,k−1 such that
sk := −∇q(xk ) + βk,0s0 + · · · + βk,k−1sk−1 ,
and sk is conjugate with respect to s0 , · · · , sk−1.
We require A-conjugacy, i.e. sTi Ask = 0, which implies:
∇q(xk )T Asi
βk,i = (i = 0, . . . , k − 1).
(si )T Asi
91
We will now show that βk,i = 0 if i < k − 1. To this end, note that
∇q(xi+1 ) − ∇q(xi ) = A(xi+1 − xi ) = λi Asi .
Therefore
∇q(xk )T (∇q(xi+1 ) − ∇q(xi ))
βk,i = (i < k).
(si )T (∇q(xi+1 ) − ∇q(xi ))
For any i < k we have
si = −∇q(xi ) + βi,1 s1 + · · · + βi,i−1 si−1 .
By Theorem 4.7 we have
∇q(xk ) ⊥ si (i = 0, . . . , k − 1).
Therefore
∇q(xi )T ∇q(xk ) = 0 (i < k),
and
∇q(xi )T si = −k∇q(xi )k2 (i < k).
Therefore βk,i = 0 if i < k −1. Also, due to exact line-search, we have (si )T (∇q(xi+1 )) =
0 (see Exercise 4.6). Therefore
k∇q(xk )k2
βk,k−1 = .
k∇q(xk−1 )k2
Fletcher-Reeves algorithm
Let x0 be an initial point.
Exercise 4.22
1. Solve this problem using the conjugate gradient method of Powell. Use exact line search
and the starting point [2, 4, 10]T . Use the standard unit vectors as s1 , s2 and s3 .
2. Solve this problem using the Fletcher-Reeves conjugate gradient method. Use exact line
search and the starting point [2, 4, 10]T .
92
4.7 Quasi-Newton methods
Recall that the Newton direction at iteration k is given by:
h i−1
sk = − ∇2 f (xk ) ∇f (xk ).
h i−1
Quasi-Newton methods use a positive definite approximation Hk to ∇2 f (xk ) . The
approximation Hk is updated at each iteration, say
Hk+1 = Hk + Dk ,
Note that
∇q(xk+1 ) − ∇q(xk ) = A xk+1 − xk .
We introduce the notation
sk = −Hk ∇q(xk )
where Hk is an approximation of [∇2 q(xk )]−1 = A−1 , and subsequently perform the
usual line search:
xk+1 = arg min q(xk + λsk ).
Since
y k := ∇q(xk+1 ) − ∇q(xk ), σ k = xk+1 − xk ,
and σ k = A−1 y k , we require that σ k = Hk+1 y k . This is called the secant condition
(quasi-Newton property).
93
The hereditary property
Since
y k := ∇q(xk+1 ) − ∇q(xk ), σ k = xk+1 − xk ,
and ∇q(x) = Ax − b, it holds that
σ i = A−1 y i (i = 0, . . . , k − 1).
σ i = Hk y i (i = 0, . . . , k − 1).
Discussion
We showed that — if the σ i (i = 0, . . . , n − 1) are linearly independent — we have
−1
Hn = A−1 = [∇2 q(xn )] (the approximation has become exact!) In iteration n, we
therefore use the search direction
But this is simply the Newton direction at xn ! In other words, we find the minimum of
q no later than in iteration n.
Step k: Calculate the search direction sk = −Hk ∇q(xk ) and perform the usual
line search xk+1 = arg min q(xk + λsk ).
94
4.7.1 The DFP update
The Davidon-Fletcher-Powell (DFP) rank-2 update is defined by
T T
σk σk Hk y k y k Hk
Dk = T − .
σk yk y k T Hk y k
We will show that:
σσ T Hyy T H
H+ − ,
σT y y T Hy
is positive definite if the matrix H is P.D. and the vectors y, σ satisfy y T σ > 0.
Hint 2: Set H = LLT and show that
!
σσ T Hyy T H
vT H+ T − T v>0 ∀v ∈ IRn \ {0}.
σ y y Hy
Exercise 4.24 Prove that Hk+1 = Hk + Dk satisfies the secant condition: σ k = Hk+1 y k . /
We now prove that the DFP update satisfies the hereditary property. At the same
time, we will show that the search directions of the DFP method are conjugate.
95
Proof: We will use induction on k. The reader may verify that (4.4) holds for k = 0.
Induction assumption:
σ i = Hk y i (i = 0, . . . , k − 1),
and σ 0 , ..., σ k−1 are mutually conjugate.
We now use
σ k = λk sk = −λk Hk ∇q(xk ),
to get
(σ k )T Aσ i = ∇q(xk )T σ i (i = 0, . . . , k − 1).
∇q(xk )T σ i = 0 (i = 0, . . . , k − 1).
Substituting we get
(σ k )T Aσ i = 0 (i = 0, . . . , k − 1),
i.e. σ 0 , ..., σ k are mutually conjugate. We use this to prove the hereditary property.
Note that
T T
i i σ k σ k y i Hk y k y k Hk y i
Hk+1 y = Hk y + − .
σk T yk y k T Hk y k
We can simplify this, using:
T T
σ k y i = σ k Aσ i = 0 (i = 0, . . . , k − 1).
We get
T
i Hk y k y k Hk y i
i
Hk+1y = Hk y − . (4.5)
y k T Hk y k
By the induction assumption σ i = Hk y i (i = 0, . . . , k − 1), and therefore
T T T
y k Hk y i = y k σ i = σ k Aσ i = 0 (i = 0, . . . , k − 1).
Hk+1 y i = Hk y i = σ i (i = 0, . . . , k − 1).
96
DFP updates: discussion
• We have shown that the DFP updates preserve the required properties: positive
definiteness, the secant condition, and the hereditary property.
• We have also shown that the DFP directions are mutually conjugate for quadratic
functions.
• The DFP method can be applied to non-quadratic functions, but then the con-
vergence of the DFP method is an open problem, even if the function is convex.
• In practice DFP performs quite well, but the method of choice today is the so-
called BFGS update.
where
T
y k Hk y k
τk = 1 + .
σk T yk
T
y k Hk y k
τk = 1 + T
.
σk yk
(a) Show that if ykT σk > 0, and Hk is positive definite, then Hk+1 = Hk + Dk is positive
definite.
(b) Show that the BFGS update satisfies the secant condition: σ k = Hk+1 y k .
97
T
How do we guarantee σ k y k > 0? Note that σ k = λk sk and y k = ∇f (xk+1 ) − ∇f (xk ).
Thus we need to maintain
∇f (xk+1 )T sk > ∇f (xk )T sk .
This can be guaranteed by using a special line-search.
The convergence of the BFGS method for convex functions was proved in 1976 by
Powell. In practice, BFGS outperforms DFP and is currently the Quasi-Newton method
of choice.
1. Perform two iterations using the DFP Quasi-Newton method. Use exact line search and
the starting point [1, 2]T . Plot the iterates.
2. Perform two iterations using the BFGS Quasi-Newton method. Use exact line search
and the starting point [1, 2]T . Plot the iterates.
1.5
0.5
x2
0
−0.5
−1
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
1
Figure 4.7: Contours of the objective function. Note that the minimum is at [0, 0]T .
98
while the relative duality gap is usually defined as
primal obj. value – dual obj. value
.
1 + |primal obj. value|
In unconstrained optimization it happens often that one uses a primal algorithm and
then there is no such absolute measure to show how close we are to the optimum.
Usually the algorithm is then stopped as there is no sufficient improvement in the
objective, or if the iterates are too close to each other or if the length of the gradient
or the length of the Newton step in an appropriate norm is small. All these criteria
can be scaled (relative to) some characteristic number describing the dimensions of the
problem. We give just two examples here. The relative improvement of the objective
is not sufficient and the algorithm is stopped if at two subsequent iterate xk , xk+1
|f (xk ) − f (xk+1 )|
≤ .
1 + |f (xk )|
In Newton’s method we conclude that we are close to the minimum of the function if
the length of the full Newton step in the norm induced by the Hessian is small, i.e.
||(∇2f (xk ))−1 ∇f (xk )||∇2 f (xk ) = (∇f (xk ))T (∇2 f (xk ))−1 ∇2 f (xk )(∇2 f (xk ))−1 ∇f (xk )
= (∇f (xk ))T (∇2 f (xk ))−1 ∇f (xk )
≤ .
This criteria can also be interpreted as the length of the gradient measured in the norm
induced by the inverse Hessian. This last measure is used in interior point methods to
control the Newton process efficiently.
99
100
Chapter 5
Assumptions:
• f is continuously differentiable;
• each extreme point of the feasible set has at least m positive components (non-
degeneracy assumption).
Exercise 5.1 Prove that under the non-degeneracy assumption, every x ∈ F has at least m
positive components. /
101
For simplicity of notation we assume that we can partition the matrix A as A = [B, N].
We partition x accordingly: xT = [xB , xN ]T . Thus we can rewrite Ax = b as
BxB + NxN = b,
such that
xB = B −1 b − B −1 NxN .
(Recall that B −1 exists by assumption.)
Given x ∈ F , we will choose B as the columns corresponding to the m largest com-
ponents of x.
The basic variables xB can now be eliminated from problem (5.1) to obtain the reduced
problem
min fN (xN )
s.t. B −1 b − B −1 NxN ≥ 0,
xN ≥ 0,
BsB + NsN = 0.
102
Note that
∇f (x)T s = r T sN .
In other words, the reduced gradient r plays the same role in the reduced problem as the
gradient ∇f did in the original problem (LC). In fact, the reduced gradient is exactly
the gradient of the function fN with respect to xN in the reduced problem.
Recall that the gradient method uses the search direction s = −∇f (x). Analogously,
the basic idea for the reduced gradient method is to use the negative reduced gradient
sN = −r as search direction for the variables xN , and then calculating the search
direction for the variables xB from
At iteration k of the algorithm we then perform a line search: find 0 ≤ λ ≤ λmax such
that
xk+1 = xk + λsk ≥ 0,
where λmax is an upper bound on the maximal feasible step length and is given by
−(xk )j
minj:(sk )j <0 (sk )j
if sk 6≥ 0
λmax = (5.5)
∞ if sk ≥ 0
103
Convergence results
Since the reduced gradient method may be viewed as an extension of the gradient
method, it may come as no surprise that analogous converge results hold for the reduced
gradient method as for the gradient method. In this section we state some convergence
results and emphasize the analogy with the results we have already derived for the
gradient method (see Theorem 4.5).
Assume that the reduced gradient method generates iterates {xk }, k = 0, 1, 2, . . .
Theorem 5.2 The search direction sk at xk is always a feasible descent direction unless
sk = 0. If sk = 0, then xk is a KKT point of problem (LC).
Compare this to the gradient method where any accumulation point of {xk } is a sta-
tionary point under some assumptions (see Theorem 4.5).
The proof of Theorem 5.3 is beyond the scope of this course. A detailed proof is given
in [2], Theorem 10.6.3.
1. Initialization
Choose a starting point x0 ≥ 0 such that Ax = b. Let k = 0.
2. Main step
[1.1] Form B from those columns of A that correspond to the m largest com-
ponents of xk . Define N as the remaining columns of A. Define xB as the elements
of xk that correspond to B, and define xN similarly.
[1.2] Compute the reduced gradient r from (5.3).
[1.3] Compute sN from (5.6) and sB from (5.4). Form sk from sB and sN .
[1.4] If sk = 0, STOP (xk is a KKT point).
3. Line search
[2.1] Compute λmax from (5.5).
104
[2.2] Perform the line search
λk := arg min f xk + λsk .
0≤λ≤λmax
Remarks:
• During the algorithm the solution xk is not necessarily a basic solution, hence
positive coordinates in xkN may appear. These variables are usually referred to as
superbasic variables.
• The convex simplex method is obtained as the specialization of the above reduced
gradient scheme if the definition of the search direction sN is modified. We allow
∂f (xk )
only one coordinate j of sN to be nonzero and defined as sj = − N∂xj N > 0. The
rest of the sN coordinates is defined to be zero and sB = −B −1 NsN = −B −1 aj sj ,
where aj is the j-th column of the matrix A.
min x2
s.t. x ≥ 2
x ≥ 0.
We solve this problem by using the Reduced Gradient Method starting from the starting point x0 = 5
with objective value 25. We start with converting the constraint in an equality-constraint:
min x2
s.t. x−y = 2
x, y ≥ 0.
The value of the slack variable y 0 is 3. We therefore choose variable x as the basic variable. This
results in B = 1 and N = −1. We eliminate the basic variable:
fN (xN ) = f (B −1 b − B −1 N xN , xN ) = f (2 + y, y).
105
This gives us the following problem:
min (2 + y)2
s.t. 2 + y ≥ 0
y ≥ 0.
Iteration 1
The search directions are:
δfN (y 0 )
s0N = s0y = − = −(2(2 + y 0 )) = −10,
δy
s0B = s0x = −B −1 N s0N = (−1) · (−1) · (−10) = −10.
x1 = x0 + λs0x = 5 − 10λ
1 0
y =y + λs0y = 3 − 10λ
3
which stay non-negative if λ ≤ λ̄ = 10 .
We now have to solve the one-dimensional problem:
min (5 − 10λ)2 .
This means:
s3N = s3y = 0,
s3B = s3x = 0.
106
We perform one iteration of the Reduced Gradient Method starting from the point x0 = (x01 , x02 , x03 , x04 )T =
(2, 2, 1, 0)T with an objective value 5. At this point x0 we consider x1 and x2 as basic variables. This
results in
2 1 1 4
B= and N = .
1 1 2 1
We eliminate the basic variables to obtain the reduced problem:
fN (xN ) = f (B −1 b − B −1 N xN , xN )
1 −1 7 1 −1 1 4 x3
= f ( − , x3 , x4 )
−1 2 6 −1 2 2 1 x4
= f (1 + x3 − 3x4 , 5 − 3x3 + 2x4 , x3 , x4 ).
1 + x3 − 3x4 ≥ 0
5 − 3x3 + 2x4 ≥ 0
x3 , x4 ≥ 0.
Iteration 1
The search directions are:
0 δfN (x03 )
s 3 − δx
s0N = = 3
δf (x0 )
s04 − Nδx4 4
−(2(1 + x03 − 3x04 ) − 6(5 − 3x03 + 2x04 ) + 2x03 − 2)
=
−(−6(1 + x03 − 3x04 ) + 4(5 − 3x03 + 2x04 ) + 2x04 + 3)
8
= .
1
Because x04 = 0 the search direction s04 has to be non-negative. We see that this is true.
0
s 1 1 −1 1 4 8 5
s0B = = −B −1 N s0N = − = .
0
s2 −1 2 2 1 1 −22
We now have to make a line search to obtain the new variables. These new variables as a function of
the step length λ are:
x11 = x01 + λs01 = 2 + 5λ
x12 = x02 + λs02 = 2 − 22λ
x13 = x03 + λs03 = 1 + 8λ
x14 = x04 + λs04 = λ
2
which stay non-negative if λ ≤ λ̄ = 22 ≈ 0.09.
We proceed by solving
min (2 + 5λ)2 + (2 − 22λ)2 + (1 + 8λ)2 + λ2 − 2(2 + 5λ) − 3λ.
107
This means
65 2
The minimizer λ = 1148 is smaller than λ̄ = 22 , so non-negativity of the variables is preserved. Thus
the new iterate is x1 = (x11 , x12 , x13 , x14 )T = (2.28, 0.75, 1.45, 0.06)T with an objective value of 3.13.
∗
Exercise 5.4 Perform two iterations of the reduced gradient method for the following linearly
constrained convex optimization problem:
Let the initial point be given as x0 = (1, 1, 1, 1) and use x1 as the initial basic variable. /
Let a feasible solution xk ≥ 0 with hj (xk ) = 0 for all j be given. By assumption the
Jacobian matrix of the constraints H(x) = (h1 (x), · · · , hm (x))T at each x ≥ 0 has full
1
The problem (NC) is in general not convex. It is a (CO) problem if and only if the functions hj
are affine.
108
rank and, for simplicity at the point xk will be denoted by
A = JH(xk ).
Let us assume that a basis B, where xkB > 0 is given. Then a similar construction as
in the linear case apply. We generate a reduced gradient search direction by virtually
keeping the linearized constraints valid. This direction by construction will be in the
null space of A. More specifically for the linearized constraints we have
xB = B −1 b − B −1 NxN
hence the basic variables xB can be eliminated from the linearization of the problem
(5.7) to result
min fN (xN )
s.t. B −1 b − B −1 NxN ≥ 0,
xN ≥ 0.
From this point on the generation of the search direction s proceeds in exactly the
same way as in the linearly constrained case. Due to the nonlinearity of the constraints
H(xk+1 ) = H(xk + λs) = 0 will not hold in general. Hence something more has to be
done to restore feasibility.
Special care has to be taken to control the step size. A larger step size might allow
larger improvement of the objective but, on the other hand results in larger infeasibility
of the constraints. A good compromise must be made.
In old versions of the GRG method Newton’s method is applied to the nonlinear
equality system H(x) = 0 from the initial point xk+1 to produce a next feasible iter-
ate. In more recent implementations the reduced gradient direction is combined by a
direction from the orthogonal subspace (the range space of AT ) and then a modified
(nonlinear, discrete) line search is performed. These schemes are quite complicated and
not discussed here in more detail.
109
Example 5.6 [Generalized reduced gradient method 1] We consider the following problem:
We perform two steps of the Generalized Reduced Gradient Method starting from the point x0 =
(x01 , x02 )T = (4, 8)T with an objective value of 96. We will plot the progress of the algorithm in Figure
5.1. At the point x0 we consider x2 as the basic variable. First we have to linearize the nonlinearly
constraint:
4
A = (N, B) = JH(x0 ) = (2x01 − 2) = (−8 − 2). b = Ax0 = (−8 − 2) = 16.
8
Iteration 1
The search direction is:
δfN (x01 )
s0N = s01 = − = −(2x01 + 8(4x1 − 8) + 12 − 16)) = −68
δx1
1
s0B = s01 = −B −1 N s0N = · 8 · −68 = −272.
2
1
which stay non-negative if λ ≤ λ̄ = 34 .
We do this by solving
This means
110
Iteration 2
Because x12 stayed positive we now use x11 as basic variable again. But first we have to linearize the
nonlinearly constraint with the values of iteration 1:
2
A = JH(x1 ) = (2x11 − 2) = (4 − 2). b = Ax1 = (4 − 2) = 4.
2
2
which stay non-negative if λ ≤ λ̄ = 32 .
Now we have to solve
This means:
1 1
As we can see has λ = 10 a larger value than λ̄ = 16 . In order to get non-negative values for the
1
variables we have to use the value 16 as step length. This gives us x2 = (x21 , x22 )T = (1, 0)T . To get
variables for which the constraint holds, we take the xN as fixed variable. This leads to x2 = (1, 12 )T
with an objective value of 11 41 .
Example 5.7 [Generalized reduced gradient method 2] We consider the following problem:
111
f = 96
x2 (x1 )2 − 2x2 = 0
8 x0
f = 11 41 f = 24
4
2 x1
x2
x1
0 1 2 3 4
We solve this problem by using three steps of the Generalized Reduced Gradient Method starting from
the point x0 = (x01 , x02 )T = (2, 2)T with an objective value of 20. At this point x0 we consider x1 as
basic variable. First we have to linearize the nonlinearly constraint:
2
A = (B, N ) = JH(x0 ) = (6x01 4x02 ) = (12 8). b = Ax0 = (12 8) .
2
Now we eliminate the basic variables:
40 8
fN (xN ) = f (B −1 b − B −1 N xN , xN ) = f ( − x2 , x2 ).
12 12
This leads us to the following problem:
40 8
min 2( − x2 )2 + 3x22
12 12
40 8
s.t. − x2 ≥ 0
12 12
x2 ≥ 0.
112
1 20 40
s0B = s01 = −B −1 N s0N = − ·8·− = .
12 3 9
2 3
which are non-negative as long as λ ≤ λ̄ = 20 = 10 .
3
We do this by solving
40 2 20
min 2(2 + λ) + 3(2 − λ)2 .
9 3
This means
160 40 120 20
(2 + λ) − (2 − λ) = 0
9 9 3 3
9 3
λ = (λ < λ̄ = ).
70 10
This results in x1 = (x11 , x12 )T = ( 18 8 T
7 , 7 ) . But due to the nonlinearity of the constraint, the constraint
will not hold with these values. To find a solution for which the constraint will hold, we consider
the xN as a fixed variable. The xB will change in a value for which the constraint holds, this means
xB = 2.41. The objective value is 15.52.
Iteration 2
Because x11 stayed positive we use x1 as basic variable again. First with the values of iteration 1 we
linearize the nonlinearly constraint again:
2.41
A = JH(x1 ) = (6x11 4x12 ) = (14.45 4.57). b = Ax1 = (14.45 4.57) = 40.
1.14
113
1.14
which stay non-negative if λ ≤ λ̄ = 3.78 ≈ 0.30.
Now we have to solve
min 2(2.41 + 1.20λ)2 + 3(1.14 − 3.78λ)2 .
This means:
This gives us x2 = (x21 , x22 )T = (2.6, 0.55)T . To get variables for which the constraint holds, we take
the xN as fixed variable. This leads to x2 = (2.52, 0.55)T with an objective value of 13.81.
Iteration 3
Again we can use x1 as basic variable. We start this iteration with linearization of the constraint:
2.54
A = JH(x2 ) = (6x21 4x22 ) = (15.24 2.2). b = Ax2 = (15.24 2.2) = 39.9.
0.55
Search directions:
δfN (x22
s2N = s22 = − = −(−0.56(2.62 − 0.14x22 ) + 6x22 ) = −1.88
δx2
s2B = s21 = −B −1 N s2N = 0.27.
0.55
which stay non-negative if λ ≤ λ̄ = 1.88 ≈ 0.293.
Now we solve
min 2(2.54 + 0.27λ)2 + 3(0.55 − 1.88λ)2 .
This means:
This gives us the variables x31 = 2.58 and x32 = 0.25. Correcting the xB results in x3 = (2.57, 0.25)T
with objective value 13.39. ∗
114
Exercise 5.5 Perform one iteration of the generalized reduced gradient method to solve the
following nonlinearly constrained convex optimization problem:
115
116
Chapter 6
6.1 Introduction
In this chapter we deal with the so-called logarithmic barrier approach to convex opti-
mization. As before we consider the CO problem in the following format:
(CO) min {f (x) : x ∈ F },
where F denotes the feasible region, which is given by
F := {x ∈ IRn : gj (x) ≤ 0, 1 ≤ j ≤ m};
the constraint functions gj : IRn → IR (1 ≤ j ≤ m) and the objective function f :
IRn → IR are convex functions with continuous third order derivatives in the interior
of F . Later on, when dealing with algorithms for solving (CO) we will need to assume
a smoothness condition on the functions f and gj (1 ≤ j ≤ m). Without loss of
generality we further assume that f (x) is linear, i.e. f (x) = −cT x for some objective
vector c ∈ IRn . If this is not true, one may introduce an additional variable xn+1 , an
additional constraint f (x) − xn+1 ≤ 0, and minimize xn+1 . In this way the objective
function becomes linear. Thus we may assume that (CO) has the form
min −cT x
(CP O) gj (x) ≤ 0, j = 1, · · · , m
x ∈ IRn .
117
Here we used that ∇ −cT x = −c.
The interior of the primal feasible region F is denoted as
and we say that (CP O) satisfies the interior point condition (IPC) if F 0 is nonempty.
In other words, (CP O) satisfies the IPC if and only if there exists an x that is strictly
primal feasible (i.e. gj (x) < 0, ∀j = 1, · · · , m). Similarly, we say that (CDO) satisfies
the IPC if there exists a strictly dual feasible solution (i.e. a dual feasible pair (x, y)
with y > 0). We will show that if (CP O) and (CDO) both satisfy the IPC then
these problems can be solved in polynomial time provided that the above mentioned
smoothness condition is fulfilled. We will also present examples of large classes of
well-structured CO problems which satisfy the smoothness condition.
Let us emphasize the trivial fact that if (CP O) satisfies the IPC then (CP O) is Slater
regular. From now the IPC for (CP O) (and hence Slater regularity) and (CDO) will
be assumed.
Theorem 6.1 The vector x is optimal for (CP O) if and only if there exists a vector
y ≥ 0 (y ∈ IRm ) such that the pair (x, y) is a saddle point of the Lagrange function
m
L(x, y) := −cT x +
X
yj gj (x).
j=1
(i) gj (x) ≤ 0, ∀j = 1, · · · , m,
m
X
(ii) yj ∇gj (x) = c, y ≥ 0, (6.1)
j=1
(iii) yj gj (x) = 0, ∀j = 1, · · · , m.
Note that (i) ensures primal feasibility and (ii) dual feasibility. The third condition
in the KKT-system is called the complementarity condition. The complementarity
condition ensures that the duality gap at optimality is zero. This follows since the
difference of the primal and the dual objective value, which the duality gap, is given by
m
X
− yj gj (x).
j=1
118
We relax the complementarity condition by considering the system
(i) gj (x) ≤ 0, ∀j = 1, · · · , m,
m
X
(ii) yj ∇gj (x) = c, y ≥ 0, (6.2)
j=1
for µ > 0. Clearly, if the relaxed KKT-system has a solution (for some µ > 0) then
we have x and y such that x is strictly primal feasible (i.e. gj (x) < 0, ∀j = 1, · · · , m)
and the pair (x, y) is strictly dual feasible (i.e. dual feasible with y > 0), whereas
the duality gap equals mµ. In other words, if the relaxed KKT-system has a solution
(for some µ > 0) then both (CP O) and (CDO) satisfy the IPC. If we impose some
extra condition then, similarly to the case of linear optimization (LO), we also have the
converse result: if the IPC holds then the relaxed KKT-system has a solution (for every
µ > 0). This will be presented in the following theorem, but first we have to introduce
some definitions.
Definition 6.2 Let x, s ∈ IRn . The ray R := {x|x = x + λs, λ ∈ IR} ⊂ IRn is called
bad if every constraint function gj , j = 1, · · · , m is constant along the ray R.
Let x, s ∈ IRn and α1 , α2 ∈ IR. The line segment {x|x = x + λs, λ ∈ [α1 , α2 ]} ⊂ IRn
is called bad if every constraint function gj , j = 1, · · · , m is constant along the ray.
Theorem 6.3 Let us assume that for (CP O) no bad ray exists. Then the following
three statements are equivalent.
(i) (CP O) and (CDO) satisfy the interior point condition;
(ii) For each µ > 0 the relaxed KKT-system (6.2) has a solution;
(iii) For each w > 0 (w ∈ IRm ) there exist y and x such that
(i) gj (x) ≤ 0, ∀j = 1, · · · , m,
m
X
(ii) yj ∇gj (x) = c, y ≥ 0, (6.3)
j=1
The proof of this important theorem can be found in Appendix A.2. From now on
we assume that the IPC holds.
Lemma 6.4 Let us assume that for (CP O) no bad line segment exists. Then the
solutions of the systems (6.2) and (6.3), if they exist, are unique.
119
{x(µ) : µ > 0}
and Pm
−cT x + j=1 yj gj (x) m
φdB (x, y, µ)
X
:= + log yj + n(1 − log µ).
µ j=1
Lemma 6.5 We have φB (x, µ) ≥ φdB (x, y, µ) for all primal feasible x and dual feasible
(x, y). Moreover, φB (x(µ), µ) = φdB (x(µ), y(µ), µ) and, as a consequence, x(µ) is a
minimizer of φB (x, µ) and (x(µ), y(µ)) a maximizer of φdB (x, y, µ).
Proof: The proof uses that if h : D → IR is differentiable and convex then we have
We refer for this property to Lemma 1.49 in Section 1.3. Since −cT x and gj (x), j =
1, · · · , m, are convex on F it follows that for any fixed y ≥ 0 the Lagrange function
m
L(x, y) = −cT x +
X
yj gj (x)
j=1
the last equality follows since (x, y) is dual feasible. Thus we may write
120
m m
−1 X X
≥ yj gj (x) − log(−yj gj (x)) − n(1 − log µ)
µ j=1 j=1
m
!
X −yj gj (x) −yj gj (x)
= − 1 − log
j=1 µ µ
m
!
X−yj gj (x)
= ψ −1 ,
j=1 µ
and
−yj gj (x) = µ, ∀j = 1, · · · , m.
This implies that equality holds if x = x = x(µ) and y = y(µ). 2
Thus the primal central path consists of minimizers of the primal logarithmic bar-
rier function and the dual central path of maximizers of the dual logarithmic barrier
function.
ically increases.
∗
Proof: Suppose 0 < µ < µ. Then x(µ) minimizes φB (x, µ), and x(µ) minimizes φB (x, µ). Thus
we have
φB (x(µ), µ) ≤ φB (x(µ), µ)
φB (x(µ), µ) ≤ φB (x(µ), µ).
These inequalities can be rewritten as
m m
cT x(µ) X cT x(µ) X
− − log(−gj (x(µ))) ≤ − − log(−gj (x(µ))),
µ j=1
µ j=1
m m
cT x(µ) X cT x(µ) X
− − log(−gj (x(µ))) ≤ − − log(−gj (x(µ))).
µ j=1
µ j=1
121
Since 0 < µ < µ this implies cT x(µ) ≤ cT x(µ). Hence −cT x(µ) ≤ −cT x(µ), proving the first part of
the lemma.
The second part follows similarly.
Pm We have that (x(µ), y(µ)) maximizes φdB (x, y, µ). Observe that
T
the dual objective −c x + j=1 yj gj (x) is just the Lagrange function L(x, y) of (CP O). As before,
let 0 < µ < µ. Now (x(µ), y(µ)) maximizes φdB (x, y, µ), and (x(µ), y(µ)) maximizes φdB (x, y, µ), hence
Here we omitted the term n(1 − log µ) at both sides of the first inequality and the term n(1 − log µ)
at both sides of the second inequality, since these terms cancel. Adding the two inequalities gives
1 1
− (L(y(µ), x(µ)) − L(y(µ), x(µ))) ≥ 0.
µ µ
122
We will argue that it is much more appropriate to measure the ‘length’ of the Newton
step with respect to the norm induced by the Hessian matrix of the barrier function.
Using this norm, we show that the Newton process is quadratically convergent if x is
‘close’ to x(µ); if x is ‘far’ from x(µ) then damped Newton steps can be used to reach
the region where the Newton process is quadratically convergent.
In this way we obtain a computationally efficient method to find a good approximation
for x(µ). Having such a method it becomes easy to design an efficient algorithm for
solving (CP O).
From the last expression we see that ∇2 φB (x, µ) is positive semidefinite, because the
matrices ∇2 gj (x) and ∇gj (x)∇gj (x)T are positive semidefinite and gj (x) < 0. In fact,
denoting
we can even show that H(x, µ) is positive definite, provided that the logarithmic bar-
rier function satisfies some smoothness condition that will be defined later on, in Sec-
tion 6.3.4. For the moment we make the following assumption.
Under this assumption φB (x, µ) is strictly convex, by Lemma 1.50. Hence the mini-
mizer x(µ) is unique indeed. Now the second order Taylor polynomial of φB (x, µ) at x
is given by
1
t2 (∆x) = φB (x, µ) + ∆xT g(x, µ) + ∆xT H(x, µ)∆x.
2
123
Since H(x, µ) is positive definite, t2 (∆x) is strictly convex and has a unique minimizer.
At the minimizer of t2 (∆x) the gradient of t2 (∆x) is zero, and thus we find the minimizer
from
g(x, µ) + H(x, µ)∆x = 0.
Therefore, the Newton step at x is given by (cf. Section 4.4)
kx − x(µ)k ,
but this measure has the obvious disadvantage that we cannot calculate it because we
do not know x(µ). A good alternative is the Euclidean norm of the Newton step itself:
k∆xk .
Prove this. /
Remark: The choice of δ(x, µ) as a proximity measure will be justified by the results
below. At this stage it may be worth mentioning another argument for its use. Consider the
function Φ defined by
Φ(z) := φ(Az + a),
124
where φ : IRm → IR can be any two times differentiable function, A is any m × m nonsingular
matrix, a a vector in IRm , and where z runs through all vectors such that Az + a is strictly
primal feasible. The Newton step with respect to φ(x) at x is given by the expression
Exercise 6.2 Prove that the Newton step and δ(x, µ) are affine invariant. /
Let us first recall a simple example used earlier to show that slow convergence is possible
when using Newton’s method.
Let f : IR → IR be defined by f (x) = x2k , where k ≥ 1. Clearly, f has a unique
minimizer, namely x = 0. We saw in Example 4.4, that if we apply Newton’s method
with full Newton steps to this function, then the rate of convergence of the iterates to
the minimum is only linear, unless k = 1 (f is quadratic).
The example suggests that we cannot expect quadratic convergence behavior of New-
ton’s method unless it is applied to a function that is ‘almost’ quadratic. The smooth-
ness condition on the primal barrier function that we are going to discuss can be un-
derstood by keeping this in mind: essentially the condition defines what we mean by
saying that a function is ‘almost’ quadratic.
where α runs through all real values such that x + αh ∈ F 0 . Note that ϕ is strictly
convex because φB (x, µ) is strictly convex. Denoting for the moment φB (x, µ) shortly
as φ(x), we have
n
0
X ∂φ(x)
ϕ (0) = hi
i=1 ∂xi
125
n X
n
∂φ2 (x)
ϕ00 (0) =
X
hi hj
i=1 j=1 ∂xi ∂xj
n X
n X
n
∂φ3 (x)
ϕ000 (0) =
X
hi hj hk .
i=1 j=1 k=1 ∂xi ∂xj ∂xk
Note that the right hand sides in the expressions, given above, are homogeneous forms
in the vector h, of order 1, 2 and 3 respectively. It will be convenient to use short hand
notations for these forms, namely ∇φ(x)[h], ∇2 φ(x)[h, h] and ∇3 φ(x)[h, h, h] respec-
tively. We then may write
The last expression uses that ∇3 φ(x)[h] is a square matrix of size n × n. Moreover, as
before, H = ∇2 φ(x).
Recall that the third order Taylor expansion of ϕ at 0 is given by
1 1
ϕ(0) + ϕ0 (0)α + ϕ00 (0)α2 + ϕ000 (0)α3 .
2 6
Thus it will be clear that the following definition, which defines the so-called self-
concordance property of φ, bounds the third order term in the Taylor expansion of ϕ
in terms of the second order term. Although our main aim is to apply this definition
to the logarithmic barrier function φB above, the definition is more general; it applies
to any three times differentiable convex function with open domain. In fact, after the
definition we will demonstrate it on many other simple examples.
Let φ be any three times differentiable convex function with open domain. We will
say that φ is self-concordant, without specifying κ, if φ is κ-self-concordant for some
κ ≥ 0. Obviously, this will be the case if and only if the quotient
2
(∇3 φ(x)[h, h, h])
(6.7)
(∇2 φ(x)[h, h])3
is bounded above by 4κ2 when x runs through the domain of φ and h through all vectors
in IRn . Note that the condition for κ-self-concordancy is homogeneous in h: if it holds
for some h then it holds for any λh, with λ ∈ IR.
126
Exercise 6.3 In the special case where n = 1 the κ-self-concordancy condition reduces to
000 3
φ (x) ≤ 2κ φ00 (x) 2 .
Prove this. /
The κ-self-concordancy condition bounds the third order term in terms of the second
order term in the Taylor expansion. Hence, if it is satisfied, it makes that the second
order Taylor expansion locally provides a good quadratic approximation of φ(x). The
latter property makes that Newton’s method behaves well on self-concordant functions.
This will be shown later on.
Recall that the definition of the κ-self-concordance property applies to every three
times differentiable convex function with an open domain. Keeping this in mind we can
already give some simple examples of self-concordant functions.
Example 6.9 [Linear function] Let φ(x) = γ + aT x, with γ ∈ IR and a ∈ IRm . Then
1
φ(x) = γ + aT x + xT Ax,
2
We may conclude from the above examples that linear and convex quadratic functions
are 0-self-concordant.
Example 6.11 Consider the convex function φ(x) = x4 , with x ∈ IR. Then
Now we have
2 2
(φ000 (x)) (24x) 1
= = .
(φ00 (x))3 (12x2 )3 3x4
Clearly the right hand side expression is not bounded if x → 0, hence φ(x) is not self-concordant. ∗
Exercise 6.5 Let k be an integer and k > 1. Prove that φ(x) = xk , where x ∈ IR, is
κ-self-concordant for some κ only if k ≤ 2. /
127
Example 6.12 Now consider the convex function
Then
1 1 2
φ0 (x) = 4x3 − , φ00 (x) = 12x2 + , φ000 (x) = 24x − .
x x2 x3
Therefore,
2 2
2
2 2
24x4 − 2 24x4 + 2
(φ000 (x)) 24x − x3 4
3 = 3 = 3 ≤ 3 = ≤ 4.
(φ00 (x)) 12x2 + x12 (12x4 + 1) (12x4 + 1) 12x4 +1
φ(x) = − log x,
n
X
Example 6.14 [The function − log xi ] We now consider
i=1
n
X
φ(x) := − log xi ,
i=1
and n
e X h2i
∇2 φ(x)[h, h] = hT diag h = .
x2 x2
i=1 i
128
Example 6.15 [The function ψ] Let
ψ(x) = x − log(1 + x),
with −1 < x ∈ IR. Then
x 1 −2
ψ 0 (x) = , ψ 00 (x) = 2, ψ 000 (x) = 3,
1+x (1 + x) (1 + x)
and it easily follows that ψ is 1-self-concordant. ∗
Example 6.16 [The function Ψ] With ψ as defined in the previous example we now consider
n
X
Ψ(x) := ψ(xi ),
i=1
and ! n
2 T e X h2i
∇ φ(x)[h, h] = h diag 2 h= 2.
(e + x) i=1 (1 + xi )
Using (6.8) with ξi := hi /(1 + xi ) we obtain
3 3
∇ φ(x)[h, h, h] ≤ 2 ∇2 φ(x)[h, h] 2
Later on (in Section 7.3) we will see that also the multidimensional version of the
entropy function has a 1-self-concordant barrier function.
129
6.3.5 Properties of Newton’s method
From now on we assume that φ(x) := φB (x, µ), for some fixed µ > 0, and that φ is
κ-self-concordant on its domain F 0 , with κ > 0. We are ready to state the result that
if δ(x, µ) is small enough then the Newton process is quadratically convergent.
1
Lemma 6.18 If x is strictly primal feasible and µ > 0 such that δ := δ(x, µ) < κ
then
x + ∆x (where ∆x denotes the Newton step at x) is strictly feasible and
κδ 2
δ(x + ∆x, µ) ≤ .
(1 − κδ)2
Proof: We omit the proof here and refer to Lemma 6.39 in Section 6.4.4 and the
remark following Lemma 6.39. 2
1
Corollary 6.19 If δ := δ(x, µ) ≤ 3κ
then δ(x + ∆x, µ) ≤ 94 κδ 2 .
Lemma 6.20 Let x be strictly primal feasible and δ := δ(x, µ) for some µ > 0. If
µ+ = (1 − θ)µ then √
+ δ+θ m
δ(x, µ ) ≤ .
1−θ
∗
Proof: 1
We have, by definition,
q
δ(x, µ) := k∆xkH := ∆xT H(x, µ)∆x.
130
Therefore, δ(x, µ+ ) is given by
q
δ(x, µ+ ) = g(x, µ+ )T H(x, µ)−1 g(x, µ+ ) =
g(x, µ+ )
H −1 ,
(6.9)
131
1√
Theorem 6.21 Let x+ := x + ∆x and µ+ = (1 − θ)µ, where θ = 30κ m
. Then
1 1
δ(x, µ) ≤ ⇒ δ(x+ , µ+ ) ≤ .
3κ 3κ
9
δ(x+ , µ+ ) ≤ κ δ(x, µ+ )2
4
2
9 1 1 1
≤ κ +
4 1 − 30κ1√m 3κ 30κ
1
≤ .
3κ
This proves the theorem. 2
Input:
A proximity parameter τ , 0 ≤ τ < 1;
an accuracy parameter > 0;
x0 ∈ F 0 and µ0 > 0 such that δ(x0 , µ0 ) ≤ τ ;
a fixed barrier update parameter θ, 0 < θ < 1.
begin
x := x0 ; µ := µ0 ;
while mµ ≥ do
begin
µ := (1 − θ)µ;
x := x + ∆x (∆x is the Newton step at x)
end
end
132
1 1√
Theorem 6.22 If τ = 3κ and θ = 30κ m
, then the Logarithmic Barrier Algorithm with
full Newton steps requires at most
√ mµ0
& '
30κ m log
1
Proof: By Theorem 6.21 the property δ(x, µ) ≤ 3κ is maintained in the course of
the algorithm. Thus each (full) Newton step will yield a strictly feasible point, by
Lemma 6.18. At each iteration the barrier parameter is reduced by the factor 1 − θ.
Hence, after k iterations we will have
mµ = (1 − θ)k mµ0 .
mµ0
& '
1
log (6.10)
θ
iterations the algorithm will have stopped. Substitution of the value of θ in the theorem
yields the desired bound. 2
min {x : x ≥ 0} .
We solve this problem by using the Logarithmic Barrier Algorithm with Full Newton Steps. First we
transfer the function into the standard form:
min {x : −x ≤ 0} .
1 1 1 1
τ= = , θ= √ = .
3κ 3 30κ m 30
We choose = 0.5, µ0 = 0.8 and x0 = 1. Then the Logarithmic Barrier Algorithm with full Newton
Steps requires at most
µ0
30 log = 15
iterations to reach an -optimal solution x. We need to check if
q
δ(x0 , µ0 ) = 4xT H(x0 , µ0 )4x ≤ τ.
133
For this we perform the following calculations:
1 1
g(x, µ) = ∇φB (x, µ) = −
µ x
1
H(x, µ) = ∇2 φB (x, µ) = 2
x
H(x, µ)−1 = x2 .
This implies
1 1
4x = −H(x0 , µ0 )−1 g(x0 , µ0 ) = −1 · =− ,
4 4
and, as a consequence:
1 1
δ(x0 , µ0 ) = |∆x| = ≤ .
4 3
This means we can start the iterations.
Iteration 1
Because
mµ0 = 0.8 ≥ ,
µ1 = (1 − θ)µ0 = 0.773333
g(x0 , µ1 ) = 0.293103
H(x0 , µ1 ) = 1
H(x0 , µ1 )−1 = 1
x1 = x0 + 4x = 1 − 1 · 0.293103 = 0.706896.
Iteration 2
First we check
mµ1 = 0.773333 ≥ .
Therefore
µ2 = (1 − θ)µ1 = 0.747556
g(x1 , µ2 ) = 0.07694
H(x1 , µ2 ) = 2.00119
H(x1 , µ2 )−1 = 0.499703
x2 = x1 + 4x = 0.706896 + 0.038448 = 0.745344.
Iteration: 3 4 5 6
µ 0.722637 0.698549 0.675264 0.652755
g(x, µ) 0.042158 0.04635 0.047759 0.049419
H(x, µ) 1.800057 1.918747 2.053899 2.19795
H(x, µ)−1 0.555538 0.521174 0.486879 0.454969
4x -0.02342 -0.02416 -0.02325 -0.02248
x 0.721924 0.697767 0.674514 0.65203
134
Iteration: 7 8 9 10
µ 0.630997 0.609964 0.589631 0.569977
g(x, µ) 0.051122 0.052885 0.054709 0.056595
H(x, µ) 2.352149 2.517163 2.69375 2.882732
H(x, µ)−1 0.425143 0.397273 0.371229 0.346893
4x -0.02173 -0.02101 -0.02031 -0.01963
x 0.630296 0.609286 0.588976 0.569344
Iteration: 11 12 13 14
µ 0.550978 0.532612 0.514858 0.497696
g(x, µ) 0.058547 0.060566 0.062654 0.064815
H(x, µ) 3.084969 3.301394 3.533002 3.780858
H(x, µ)−1 0.324152 0.302902 0.283045 0.26449
4x -0.018989 0.01835 -0.01773 -0.01714
x 0.550366 0.53202 0.514286 0.497143
We can see that after the fourteenth iteration mµ became less than , hence x14 = 0.497143 is -optimal.
∗
135
This implies
1 1 1
4x = −H(x0 , µ0 )−1 g(x0 , µ0 ) = − · = − ,
5 3 15
whence r
0 0 1 1
δ(x , µ ) = ≈ 0.15 ≤ .
45 3
This means we can start the iterations.
Iteration 1
Because
mµ0 = 3 ≥
we start with computing the new µ and the new x:
µ1 = (1 − θ)µ0 = 2.9
g(x0 , µ1 ) = 0.37931
H(x0 , µ1 ) = 5.137931
0 1 −1
H(x , µ ) = 0.194631
1
x = x0 + 4x = 1 − 0.07383 = 0.926174
f (x1 ) = 0.735819.
Iteration 2
First we check
mµ1 = 2.9 ≥ .
Therefore
µ2 = (1 − θ)µ1 = 2.803333
g(x1 , µ2 ) = 0.0539
H(x1 , µ2 ) = 4.837685
H(x1 , µ2 )−1 = 0.20671
x2 = x1 + 4x = 0.926174 − 0.01114
f (x2 ) = 0.701046.
Iteration 33
Still, we have
mµ32 = 1.013858 ≥ ,
136
so we need one more iteration:
µ33 = (1 − θ)µ32 = 0.980063
g(x32 , µ33 ) = 0.048812
H(x32 , µ33 ) = 8.150924
H(x32 , µ33 )−1 = 0.122685
x33 = x32 + 4x = 0.0957 − 0.00599 = 0.703582
f (x33 ) = 0.245052.
Now, we have
mµ33 = 0.980063 < .
Hence, x33 is -optimal. ∗
1
Note that as long as δ(x, µ) ≥ 3κ , and x is outside the region around x(µ) where the
Newton process is quadratically convergent (cf. Corollary 6.19), we have
1 1 1 0.0457 1
2
ψ (κδ) ≥ 2 ψ = 2
> .
κ κ 3 κ 22κ2
137
This shows that the barrier function decreases with at least some fixed amount, de-
pending on κ but not on the present iterate.
We can now state our second algorithm.
Input:
A proximity parameter τ , 0 ≤ τ < 1;
an accuracy parameter > 0;
x0 ∈ F 0 and µ0 > 0 such that δ(x0 , µ0 ) ≤ τ ;
a damping factor (or step size) α, 0 ≤ α < 1 ;
a fixed barrier update parameter θ, 0 < θ < 1.
begin
x := x0 ; µ := µ0 ;
while mµ ≥ do
begin
µ := (1 − θ)µ;
while δ(x, µ) ≥ τ do
begin
x := x + α∆x;
(The damping factor α must be such that φB (x, µ)
decreases sufficiently. This can be reached by tak-
1
ing the default value is 1+κδ(x,µ) . Larger reductions
can be realized by performing a line search.)
end
end
end
We refer to the first while-loop in the algorithm as the outer loop and to the second
while-loop as the inner loop. Each execution of the outer loop is called an outer
iteration and each execution of the inner loop an inner iteration. The required number
of outer iterations depends only on the dimension n of the problem, on µ0 , on , and
on the (fixed) barrier update parameter θ. This number immediately can be bounded
above by the number
mµ0
& '
1
log
θ
given in (6.10), by using the same argument. The main task in the analysis of the
algorithm is to derive an upper bound for the number of iterations in the inner loop.
In this respect the following result is important.
138
Lemma 6.26 Each inner loop requires at most
5 √
& '
22θ 22
2
θκ m + κ m +
(1 − θ)2 2 3
inner iterations.
∗
Proof: The proof needs some other lemmas that estimate barrier function values and objective
values in the region of quadratic convergence around the µ-center. We refer to Theorem 2.10 (page
61) and its proof in Den Hertog [13]. For a similar result and its proof we refer to Section 6.4.6. 2
Combining the bounds in Lemma 6.26 and (6.10) we obtain our main result. Omitting
the integer brackets we have:
damped Newton steps the Logarithmic Barrier Algorithm with Damped Newton Steps
yields a strictly primal feasible solution x which is -optimal.
Proof: Obvious. 2
If we take θ = √ν for some fixed constant ν then the bound of Theorem 6.27 becomes
m
√ mµ0
!
O κ2 m log .
mµ0
!
2
O κ m log .
min {x : −x ≤ 0} .
We take the same τ = 31 , but we take a larger value for the parameter θ. In this example we will use
the value θ = 0.25 We start again from point x0 = 1 with µ0 = 0.8. For we take the value 0.5. Then
the Logarithmic Barrier Algorithm with Damped Newton Steps requires at most
µ0
1
log =2
θ
139
iterations of the outer loop to reach an -optimal x. Each inner loop requires at most
22θ 5 22
θ+ + = 14
(1 − θ)2 2 3
iterations.
From Example 6.23 we know that
δ(x0 , µ0 ) ≤ τ,
so we can start with the first iteration.
Iteration 1
Because
mµ0 ≥
we compute the new µ:
µ1 = (1 − θ)µ0 = 0.6.
First we have to know if δ(x0 , µ1 ) ≥ τ . This is the case, because
p
δ(x0 , µ1 ) = 4xH(x0 , µ1 )4x = 0.666667.
g(x0 , µ1 ) = 0.666667
H(x0 , µ1 ) = 1
H(x0 , µ1 )−1 = 1
4x = −0.66667
1
α = = 0.6
1 + δ(x0 , µ1 )
x1 = x0 + α4x = 0.6.
µ2 = (1 − θ)µ1 = 0.45.
g(x1 , µ2 ) = 0.555556
H(x1 , µ2 ) = 2.777778
H(x1 , µ2 )−1 = 0.36
4x = −0.2
α = = 0.75
x2 = 0.45.
We now reached an -optimal x. We have used 2 iterations (”outer loops”) and this is exactly the
number of outer loops we could expect. Note however that the number of inner iterations is only 2,
which is far less than expected from the theory. ∗
140
Example 6.29 [Damped Newton Steps 2]
We consider the same problem as in Example 6.24:
min x4 : −x ≤ 0 .
x4
φB (x, µ) = − log x.
µ
As in the previous example we take τ = 31 and θ = 0.25. We start again from the point x0 = 1 with
µ0 = 3. For we take the value 1. Then the algorithm requires at most
µ0
1
log =5
θ
iterations of the outer loop to reach an -optimal x. Each inner loop requires at most
22θ 5 22
θ+ + = 14
(1 − θ)2 2 3
iterations.
From Example 6.24 we know that
δ(x0 , µ0 ) ≤ τ,
so we can start with the first iteration.
Iteration 1
Because
mµ0 ≥
we compute a new µ:
µ1 = (1 − θ)µ0 = 2.25.
First we have to compute δ(x0 , µ1 ).
p
δ(x0 , µ1 ) = 4xH(x0 , µ1 )4x = 0.309058 < τ.
µ2 = (1 − θ)µ1 = 1.6875.
Now, we have p
δ(x0 , µ2 ) = 4xH(x0 , µ2 )4x = 0.481169 ≥ τ.
We now can compute the new x.
g(x0 , µ2 ) = 1.37037
0 2
H(x , µ ) = 8.11111
H(x0 , µ2 )−1 = 0.123288
4x = −0.16895
1
α = = 0.675142
1 + δ(x0 , µ2 )
x1 = x0 + α4x = 0.885935
1
f (x ) = 0.616038.
141
Because δ(x1 , µ2 ) ≤ τ we update µ and then perform outer iteration 2.
Iteration 2
We can see that mµ2 ≥ so we can start this iteration. First we compute the new µ.
µ3 = (1 − θ)µ2 = 1.265625.
g(x1 , µ3 ) = 1.068908
1 3
H(x , µ ) = 8.71591
H(x1 , µ3 )−1 = 0.114733
4x = −0.12264
α = = 0.734181
x2 = 0.795896
f (x2 ) = 0.401259.
Because p
δ(x2 , µ3 ) = 4xH(x2 , µ3 )4x = 0.122348 < τ,
we start the third outer iteration.
Iteration 3
Since mµ3 ≥ , we compute the new µ.
µ4 = (1 − θ)µ3 = 0.949219.
The distance p
δ(x2 , µ4 ) = 4xH(x2 , µ4 )4x = 0.280366
is still smaller than τ , so we can decrease the value of µ again.
µ5 = (1 − θ)µ4 = 0.949219.
This is, in fact, already the fourth outer iteration. Now, we can see that
p
δ(x2 , µ5 ) = 4xH(x2 , µ5 )4x = 0.450248 ≥ τ.
g(x2 , µ5 ) = 1.576259
2 5
H(x , µ ) = 12.25607
H(x2 , µ5 )−1 = 0.081592
4x = −0.12861
α = = 0.689537
x3 = 0.707214
f (x3 ) = 0.250152.
Because p
δ(x3 , µ5 ) = 4xH(x3 , µ5 )4x = 0.177549 < τ,
we can end this outer iteration. Because mµ5 < , we reached an -optimal x in four outer iterations
and by using only three inner iterations. ∗
142
∗
6.4 More on self-concordancy
6.4.1 Introduction
In this section we derive some properties of self-concordant functions. One of our aims
is to provide proofs of some results that were stated before without proofs (especially
Lemma 6.18 and Lemma 6.25) in a somewhat more general setting. We also present an
efficient algorithm to find a minimizer of a κ-self-concordant function, if it exists.
We will deal with a function φ : D → IR, where the domain D is an open and convex
subset of IRn , and we will assume that φ is closed convex and κ-self-concordant. We
did not deal with the notion of a closed convex function so far, therefore we start with
a definition.
Lemma 6.31 For any point x on the boundary of the domain D of φ and for any
sequence {xk }∞
k=0 in the domain that converges to x we have φ(xk ) → ∞.
Proof: Consider the sequence {φ (xk )}∞ k=0 . Assume that it is bounded above. Then
it has a limit point φ. Of course, we can think that this is the unique limit point of the
sequence. Therefore,
zk := (xk , φ (xk )) → x, φ .
Note that zk belongs to the epigraph of φ. Since φ is a closed function, then also x, φ
belongs to the epigraph. But this is a contradiction since x does not belong to the
domain of φ. 2
We conclude that, since the function φ considered in this section is closed convex,
it has the property that φ(x) approaches infinity when x approaches the boundary of
the domain D. This is also expressed by saying that φ is a barrier function on D. In
fact, the following exercise makes clear that the barrier property is equivalent to the
closedness property.
Exercise 6.7 Let the function φ : D → IR have the property that it becomes unbounded (+∞)
when approaching the boundary of its open domain D. Then φ is closed. /
holds for any x ∈ D and for any h ∈ IRn , where κ is fixed and κ ≥ 0.
We will denote
g(x) := ∇φ(x), ∀x ∈ D
143
and
H(x) := ∇2 φ(x), ∀x ∈ D.
For any v ∈ IRn , the local Hessian norm of v at x ∈ D is in this section denoted as
kvkx . Thus q
kvkx := v T H(x)v.
Using this notation, the inequality (6.11) can be written as
≤ 2κ khk3x .
3
∇ φ(x)[h, h, h]
Let us now first point out an equivalent formulation of the self-concordance property.
Lemma 6.32 A three times differentiable closed convex function φ with open domain
D is κ-self-concordant if and only if
3
∇ φ(x)[h1 , h2 , h3 ] ≤ 2κ kh1 kx kh2 kx kh3 kx
Proof: This statement is nothing but a general property of three-linear forms. For
the proof we refer to Lemma A.2 in the Appendix. 2
Theorem 6.33 Let the closed convex function φ with open domain D be κ-self-concor-
dant. If D does not contain a straight line then the Hessian ∇2 φ(x) is positive definite
at any x ∈ D.
Proof: Suppose that H(x) is not positive definite for some x ∈ D. Then there exists
a nonzero vector h ∈ IRn such that hT H(x)h = 0 or, equivalently, khkx = 0. For all α
such that x + αh ∈ D we consider the function
k(α) := hT H(x + αh)h = ∇2 φ(x + αh)[h, h] = khk2x+αh .
Then k(0) = 0 and k(α) is continuously differentiable. We claim that k(α) = 0 for
every α in the domain of k. Note that k(α) ≥ 0 for all α. Assuming that the claim
is not true, we may suppose without loss of generality that k(α) > 0 on some open
interval (0, α) and, moreover, since k 0 is continuous, that k(α) is nondecreasing on this
interval.
The derivative k 0 (α) satisfies
3
k 0 (α) = ∇3 φ(x + αh)[h, h, h] ≤ 2κ khk3x+αh = 2κk(α) 2 .
This implies k 0 (0) = 0 and, moreover, if κ = 0 then k 0 (α) = 0, whence k(α) = 0 for all
α in the domain of k. Thus we may further assume that κ > 0. If α ∈ (0, α) we may
write, using k(0) = 0 and that k is nondecreasing on (0, α),
Z α Z α 3
Z α 3 3
0
k(α) = k (β) dβ ≤ 2κ k(β) dβ ≤ 2κ
2 k(α) 2 dβ = 2ακk(α) 2 .
0 0 0
144
Dividing at both sides by k(α) we get
1
1 ≤ 2ακk(α) 2 ,
which implies
1
k(α) ≥ , ∀α ∈ (0, α).
4α2
Obviously this contradicts the fact that k is continuous in 0.
Thus we have shown that k(α) = 0 for all α such that x + αh ∈ D. From this we
deduce that φ(x + αh) is linear in α, because we have
The hypothesis of the theorem implies that there exists an α such that x + αh belongs
to the boundary of D. Without loss of generality we may assume that α > 0 (else
we replace h by −h). It then follows that φ(x + αh) converges to φ(x) + αhT g(x) if
α converges to α. However, this gives a conflict with Lemma 6.31 which implies that
φ(x + αh) converges if α converges to α. Thus the proof is compete. 2
Corollary 6.34 If D does not contain a straight line then φ(x) is strictly convex. As
a consequence, if φ(x) has a minimizer then this minimizer is unique.
From now on it will be assumed that the hypothesis of Theorem 6.33 is satisfied. So
the domain D does not contain a straight line. As a consequence we have
∀x ∈ D, ∀h ∈ IRn : khkx = 0 ⇔ h = 0.
kdkx kdkx
≤ kdkx+αd ≤ ;
1 + ακ kdkx 1 − ακ kdkx
the left inequality holds for all α such that 1 + ακ kdkx > 0 and the right for all α such
that 1 − ακ kdkx > 0.
145
Hence, using the κ-self-concordancy of φ,
3
|q 0 (α)| = ∇3 φ(x + αd)[d, d, d] ≤ 2κ kdk3x+αd = 2κq(α) 2 .
or, equivalently,
1 − ακ kdkx 1 1 + ακ kdkx
≤ ≤ .
kdkx kdkx+αd kdkx
Hence, if 1 + ακ kdkx > 0 we obtain
kdkx
≤ kdkx+αd
1 + ακ kdkx
kdkx
kdkx+αd ≤ ,
1 − ακ kdkx
proving the lemma. 2
Then
∇3 φ(x + αh)[h, h, h]
β 0 (α) := − 3 ,
2∇2 φ(x + αh)[h, h] 2
and hence |β 0 (α)| ≤ κ. Derive Lemma 6.35 from this. /
Lemma 6.36 Let x and d be such that x ∈ D, x + d ∈ D and κ kdkx < 1. Then we
have, for any nonzero v ∈ IRn ,
kvkx
(1 − κ kdkx ) kvkx ≤ kvkx+d ≤ . (6.12)
1 − κ kdkx
146
Proof: Fixing v, we define for 0 ≤ α ≤ 1,
Note that k(α) > 0. Since k 0 (α)/k(α) is the derivative to α of log k(α) we find that
d log k(α) 2κ kdkx
≤ .
dα 1 − ακ kdkx
147
Lemma 6.37 Let x ∈ D and d ∈ IRm . If kdkx < 1
κ
then x + d ∈ D.
Proof: Since kdkx < κ1 , we have from Lemma 6.36 that H(x + αd) is bounded for all
0 ≤ α ≤ 1, and thus φ(x + αd) is bounded. On the other hand, φ takes infinite values
on the boundary of the feasible set, by Lemma 6.31. Consequently, x + d ∈ D. 2
x+ = x + α∆x.
The next lemma shows that with an appropriate choice of α we can guarantee a fixed
decrease in φ after the step.
1
Lemma 6.38 Let x ∈ D and δ := δ(x). If α := 1+κδ
then
ψ (κδ)
φ(x) − φ(x + α∆x) ≥ .
κ2
Proof: Define
∆(α) := φ(x) − φ(x + α∆x).
Then
Now using that φ is κ-self-concordant we deduce from the last expression that
As a consequence we have
α −2κδ 3 −δ 2 −δ 2
Z
00 00 α
∆ (α) − ∆ (0) ≥ dβ = β=0 = + δ2.
(1 − βκδ)3 (1 − βκδ)2 (1 − ακδ)2
0
148
Since ∆00 (0) = −∇2 φ(x)[∆x, ∆x] = −δ 2 , we obtain
−δ 2
∆00 (α) ≥ .
(1 − ακδ)2
In a similar way, by integrating, we derive an estimate for ∆0 (α):
α −δ 2 −δ −δ δ
Z
0 0 α
∆ (α) − ∆ (0) ≥ 2 dβ = β=0 = + .
0 (1 − βκδ) κ (1 − βκδ) κ (1 − ακδ) κ
to measure the ‘length’ of the Newton step. Note that if x is such that φ(x) is minimal
then g(x) = 0 and hence δ(x) = 0; whereas in all other cases δ(x) will be positive.
After the Newton step we have
q
δ(x+ ) = g(x+ )T H(x+ )−1 g(x+ ) =
H(x+ )−1 g(x+ )
. (6.13)
x+
1
Lemma 6.39 If δ(x) ≤ 3κ
then x+ is feasible and
!2
+ δ(x) 9κ
δ(x ) ≤ κ ≤ δ(x)2 .
1 − κδ(x) 4
149
Proof: The feasibility of x+ follows from Lemma 6.37. Using the Mean Value Theorem
we have for some β, 0 ≤ β ≤ 1,
1
g(x+ ) = g(x) + H(x)∆x + ∇3 (x + β∆x)[∆x, ∆x].
2
Since, by the definition of ∆x, g(x) + H(x)∆x = 0 we obtain
1
g(x+ ) = ∇3 φ(x + β∆x)[∆x, ∆x].
2
Hence, for any vector p ∈ IRn we have
1
pT g(x+ ) = ∇3 φ(x + β∆x)[∆x, ∆x, p].
2
Using Lemma 6.32 we get
T
p g(x+ )
≤ κ k∆xk2x+β∆x kpkx+β∆x . (6.14)
k∆xkx δ(x)
k∆xkx+β∆x ≤ =
1 − βκ k∆xkx 1 − βκδ(x)
and
kpkx+∆x kpkx+∆x kpkx+∆x
kpkx+β∆x ≤ ≤ (1−β)κk∆xk
= .
1 − (1 − β)κ k∆xkx+∆x 1 − 1−κk∆xk x 1 − (1−β)κδ(x)
1−κδ(x)
x
Substituting the last two inequalities in (6.14), while replacing p by the Newton step
H(x+ )−1 g(x+ ) at x+ and also using kpkx+∆x = δ(x+ ), from (6.13), we obtain
!2
δ(x) δ(x+ )
δ(x+ )2 = g(x+ )T H(x+ )−1 g(x+ ) ≤κ .
1 − βκδ(x) (1−β)κδ(x)
1− 1−κδ(x)
+ κδ(x)2
δ(x ) ≤ , (6.15)
h(β)
for some β, 0 ≤ β ≤ 1, where
!
2 (1 − β)κδ(x)
h(β) = (1 − βκδ(x)) 1− .
1 − κδ(x)
150
whence h(β) is concave. Hence, for β, 0 ≤ β ≤ 1,
( )
1 − 2κδ(x)
h(β) ≥ min {h(0), h(1)} = min , 1 − κδ(x)2 = (1 − κδ(x))2 , (6.16)
1 − κδ(x)
Remark: Lemma 6.39 is also valid if δ(x) < κ1 . This can be shown as follows. For v ∈ IRn
and 0 ≤ α ≤ 1, define
k(α) := v T g(x + α∆x) − (1 − α)v T g(x).
Note that if v = H(x+ )−1 g(x+ ) then
Thus our aim is to find a good estimate for k(1). Taking the derivative of k to α we get, also
using that H(x)∆x = −g(x),
By Exercise 6.9,
!
1
H(x + α∆x) − H(x) − 1 H(x).
(1 − ακ k∆xkx )2
Now applying the generalized Cauchy inequality of Lemma A.1 in the Appendix we get
!
T 1
v (H(x + α∆x) − H(x)) ∆x ≤ − 1 kvkx k∆xkx .
(1 − ακ k∆xkx )2
151
Therefore, since k(0) = 0,
!
Z 1 1
k(1) ≤ δ(x) kvkx − 1 dα.
0 (1 − ακδ(x)) 2
We have
!
Z 1 1
1
− 1 dα = − α 1α=0
0 (1 − ακδ(x)) 2 κδ(x) (1 − ακδ(x))
1 1
= −1−
κδ(x) (1 − κδ(x)) κδ(x)
κδ(x)
= .
1 − κδ(x)
Substitution gives
κδ(x)2
k(1) ≤ kvkx .
1 − κδ(x)
For v = H(x+ )−1 g(x+ ), we have, by Lemma 6.36,
kvkx+ δ(x+ )
kvkx ≤ = .
1 − ακ k∆xkx 1 − ακδ(x)
Since k(1) = δ(x+ )2 , it follows by substitution,
δ(x+ ) κδ(x)2
δ(x+ )2 = k(1) ≤ .
1 − ακδ(x) 1 − κδ(x)
Dividing both sides by δ(x+ ) the claim follows. •
152
which proves the first inequality. Using this inequality write
Z 1
φ(x + h) − φ(x) − hT g(x) = hT (g(x + αh) − g(x)) dα
0
α khk2x 1
Z
≥ dα
0 1 + ακ khkx
Lemma 6.41 Let x ∈ D and 0 6= h ∈ IRn such that x + h ∈ D and let khkx < 1. Then
khk2x
hT (g(x + h) − g(x)) ≤
1 − κ khkx
ψ(−κ khkx )
φ(x + h) − φ(x) ≤ hT g(x) + .
κ2
As usual, for each x ∈ D, δ(x) = k∆xkx , with ∆x denoting the Newton step at x.
We now prove that if δ(x) < κ1 for some x ∈ D then φ must have a minimizer. Note
that this surprising result expresses that some local condition on φ provides us with a
global property, namely the existence of a minimizer.
1
Theorem 6.42 Let δ(x) < κ
for some x ∈ D. Then φ has a unique minimizer x∗ in
D.
153
Proof: The proof is based on the observation that the level set
with x as given in the theorem, is compact. This can be seen as follows. Let y ∈ D.
Writing y = x + h, Lemma 6.40 implies the inequality
ψ (κ khkx ) ψ (κ khkx )
φ(y) − φ(x) ≥ hT g(x) + 2
= −hT H(x)∆x + , (6.20)
κ κ2
where we used that, by definition, the Newton step ∆x at x satisfies H(x)∆x = −g(x).
Since
hT H(x)∆x ≤ khkx k∆xkx = khkx δ(x)
we thus have
ψ (κ khkx )
φ(y) − φ(x) ≥ − khkx δ(x) + .
κ2
Hence, if φ(y) ≤ φ(x), then
ψ (κ khkx )
≤ κδ(x) < 1. (6.21)
κ khkx
Putting ξ := κ khkx one may easily verify that ψ(ξ)/ξ is monotonically increasing for
ξ > 0 and goes to 1 if ξ → ∞. Therefore, since δ(x) < κ1 , we may conclude from (6.21)
that κ khkx is bounded above. This implies that the level set (6.19) is bounded. From
this we conclude that φ has a minimizer x∗ . Finally, Corollary 6.34 implies that this
minimizer is unique. 2
Exercise 6.10 Let δ(x) ≥ κ1 for all x ∈ D. Then φ is unbounded and, hence, has no mini-
mizer in D. Prove this. (Hint: use Lemma 6.38.) /
1
Exercise 6.11 Let δ(x) ≥ κ for all x ∈ D. Then D is unbounded. Prove this. /
154
The proof of the next theorem requires the result of the following exercise.
whence
ψ(−s) + ψ(t) ≥ st, s < 1, t > −1. (6.22)
Prove this. /
1
Theorem 6.44 Let x ∈ D be such that δ(x) < κ
and let x∗ denote the unique minimizer
of φ. Then, with δ := δ(x),
ψ(κδ) ψ(−κδ)
2
≤ φ(x) − φ(x∗ ) ≤ (6.23)
κ κ2
0
ψ (κδ) δ δ ψ 0 (−κδ)
= ≤ kx − x∗ kx ≤ =− . (6.24)
κ 1 + κδ 1 − κδ κ
Proof: The left inequality in (6.23) follows from Lemma 6.38, because φ is minimal
at x∗ . Furthermore, from (6.18) in Lemma 6.40, with h = x∗ − x, we get the right
inequality in (6.23):
ψ(κ khkx )
φ(x∗ ) − φ(x) ≥ (h)T g(x) +
κ2
ψ(κ khkx )
≥ − khkx δ(x) +
κ2
1
≥ 2 (−κ khkx κδ + ψ(κ khkx ))
κ
ψ (−κδ)
≥ − ,
κ2
where the second inequality holds since
(h)T g(x) = − (h)T H(x)∆x ≤ khkx k∆xkx = khkx δ(x) = khkx δ, (6.25)
khk2x
≤ (h)T g(x) ≤ khkx δ.
1 + κ khkx
155
which is equivalent to
δ ψ 0 (−κδ)
khkx ≤ =− ,
1 − κδ κ
which proves the right inequality in (6.24).
δ
Note that the left inequality in (6.24) is trivial if κ khkx > 1 since 1+κδ
< 1δ . Thus we
may assume that 1 − κ khkx > 0. For 0 ≤ α ≤ 1, consider
One has k(0) = 0 and k(1) = δ(x)2 = δ 2 . Using the result in Exercise 6.9 and the
generalized Cauchy inequality of Lemma A.1 in the Appendix we may write
khkx δ(x)
k 0 (α) = −hT H(x∗ − αh)H(x)−1 g(x) ≤ .
(1 − κ khkx )2
khkx
δ≤ ,
1 − κ khkx
which is equivalent to
δ
khkx ≥ .
1 + κδ
Thus the proof is complete. 2
min {φ(x) : x ∈ D} .
156
Hence we will have δ(xk ) ≤ if
!2k
4 9κδ 0
≤ .
9κ 4
9κδ 0 9κ
2k log ≤ log .
4 4
9κδ0 3 0
Note that 4
≤ 4
< 1. Dividing by − log 9κδ
4
we get
klog 9κ
4
2 ≥ 0 ,
log 9κδ
4
or, equivalently,
log 9κ
2 4
k ≥ log 0 .
log 9κδ
4
0
Since − log 9κδ
4
≥ − log 34 we find that after no more than
steps the process will stop and the output will be an x ∈ D such that kx − x∗ kx ≤ .
1
If δ 0 > 3κ then we use damped Newton steps. By Lemma 6.38 each damped Newton
step decreases φ with at least the value
ψ (κδ) ψ 13 0.0457 1
≥ = > .
κ2 κ2 κ2 22κ2
Hence, after no more than
22κ2 φ(x0 ) − φ(x∗ )
1
we reach the region where δ(x) < 3κ
. Then we can proceed with full Newton steps, and
after a total of
l m
22κ2 φ(x0 ) − φ(x∗ ) + 2 log (−2.8188 − 3.4761 log κ)
Note the drawback of the above iteration bound: usually we have no a priori knowledge
of φ(x∗ ) and the bound cannot be calculated at the start of the algorithm. But in many
cases we can derive a good estimate for φ(x0 ) − φ(x∗ ) and we obtain an upper bound
for the number of iterations at the start of the algorithm.
157
Example 6.45 Consider the function f : (−1, ∞) → IR defined by
Thus we find in this example that the damped Newton step is exact if x > 0. Also, if −1 < x < 0 then
−2x2
< −x2 ,
1−x
and hence then the full Newton step performs better than the damped Newton step. Finally observe
that if we apply Newton’s method until δ(x) ≤ then the output is an x such that |x| ≤ . ∗
with 0 < x ∈ IRn . We established in Example 6.14 that φ is 1-self-concordant, and the first and second
order derivatives are given by
−e e
g(x) = ∇φ(x) = , H(x) = ∇2 φ(x) = diag .
x x2
Therefore, v
u n
q uX √
δ(x) = g(x)T H(x)−1 g(x) = t 1 = kek = n.
i=1
158
with −e < x ∈ IRn . The gradient and Hessian of Ψ are
!
x e
g(x) = ∇φ(x) = , H(x) = ∇2 φ(x) = diag 2 .
e+x (e + x)
We established that Ψ is 1-self-concordant. One has
v
q u n
uX
T −1
δ(x) = g(x) H(x) g(x) = t x2i = kxk .
i=1
This implies that x = 0 is the unique minimizer. The Newton step at x is given by
∆x = −H(x)−1 g(x) = −x(e + x),
and a full Newton step yields
x+ = x − x(e + x) = −x2 .
The Newton step is feasible only if −x2 > −e, i.e. x2 < e; this certainly holds if δ(x) < 1. Note that
the theory guarantees feasibility only in that case. Moreover, if the Newton step is feasible then
δ(x+ ) =
x2
≤ kxk∞ kxk ≤ δ(x)2 ,
and this is better than the theoretical result of Lemma 6.18. When we take a damped Newton step,
1
with the default step size α = 1+δ(x) , the next iterate is given by
x(e + x)
x+ = x − .
1 + kxk
If we apply Newton’s method until δ(x) ≤ then the output is an x such that kxk ≤ . ∗
159
k xk f (xk ) δ(xk ) αk
0 10.00000000000000 20.72326583694642 9.65615737513337 0.09384245791391
1 7.26783221086343 12.43198234403589 7.19322142387618 0.12205211457924
2 5.04872746432087 6.55544129967853 4.97000287092924 0.16750410705319
3 3.33976698811526 2.82152744553701 3.05643368252612 0.24652196443090
4 2.13180419256384 0.85674030296950 1.55140872104182 0.39194033937129
5 1.39932346194914 0.13416824208214 0.56132642454284 0.64048105782415
6 1.07453881397326 0.00535871156275 0.10538523300312 1.00000000000000
7 0.99591735745291 0.00001670208774 0.00577372342963 1.00000000000000
8 0.99998748482804 0.00000000015663 0.00001769912592 1.00000000000000
9 0.99999999988253 0.00000000000000 0.00000000016613 1.00000000000000
k xk f (xk ) δ(xk ) αk
0 0.10000000000000 2.07232658369464 1.07765920479347 0.48131088953032
1 0.14945506622819 1.61668135596306 1.05829223631865 0.48583965986703
2 0.22112932596124 1.17532173793649 1.00679545093710 0.49830688998873
3 0.32152237588997 0.76986051286674 0.90755746327638 0.52423060340338
4 0.45458940014373 0.42998027395695 0.74937259761986 0.57163351098592
5 0.61604926491198 0.18599661844608 0.53678522950535 0.65070901307522
6 0.78531752299982 0.05188170346324 0.30270971353625 1.00000000000000
7 0.96323307457328 0.00137728412903 0.05199249905660 1.00000000000000
8 0.99897567517041 0.00000104977911 0.00144861398705 1.00000000000000
9 0.99999921284500 0.00000000000062 0.00000111320527 1.00000000000000
10 0.99999999999954 0.00000000000000 0.00000000000066 1.00000000000000
160
Chapter 7
7.1 Introduction
In this chapter we show for some important classes of optimization problems that the
logarithmic barrier function is self-concordant. Sometimes the logarithmic barrier func-
tion of the problem itself is not self-concordant, but it will be necessary to reformulate
the problem. The approach in this section is based mainly on Den Hertog [13], Den
Hertog et al. [15] and Jarre [21], except for the last subsection that deals with the
semidefinite optimization problem. For a rich survey of applications of semidefinite
optimization in control and system theory we refer to Vandenberghe and Boyd [44].
Before dealing with the problems we first present some lemmas that are quite helpful
in recognizing self-concordant functions.
161
The next lemma states that if the quotient of the third and second order derivative of
f (x) is bounded by the second order derivative of − ni=1 log xi , then the corresponding
P
logarithmic barrier function is self-concordant. This lemma will help to simplify self-
concordance proofs in the sequel.
Lemma 7.2 Let f (x) ∈ C 3 (F 0) and convex. If there exists a β such that
v
u n
h2i
|∇3 f (x)[h, h, h]| ≤ T 2
uX
βh ∇ f (x)ht 2
, (7.1)
i=1 xi
with q ≥ 1, is (1 + 13 β)-self-concordant on IR × F 0 .
∗
Proof: We start by proving the first part of the lemma. Note that since (7.1) is scale independent,
we may assume that µ = 1. Straightforward calculations yield
n
X hi
∇ϕ(x)T h = ∇f (x)T h − (7.2)
i=1
xi
n
X h2
hT ∇2 ϕ(x)h = hT ∇2 f (x)h + i
(7.3)
i=1
x2i
n
X h3
∇3 ϕ(x)[h, h, h] = ∇3 f (x)[h, h, h] − 2 i
. (7.4)
i=1
x3i
We show that
1
(∇3 ϕ(x)[h, h, h])2 ≤ 4(1 + β)2 (hT ∇2 ϕ(x)h)3 , (7.5)
3
from which the first part of the lemma follows. Since f is convex, the two terms on the right-hand side
of (7.3) are nonnegative, i.e. the right-hand side can be abbreviated by
hT ∇2 ϕ(x)h = a2 + b2 , (7.6)
162
It is straightforward to verify that
1
(βa2 b + 2b3 )2 ≤ 4(1 + β)2 (a2 + b2 )3 .
3
Together with (7.6) and (7.7) our claim (7.5) follows and hence the first part of the lemma.
Now we prove the second part of the lemma. Let
h0
t
..
x̃ = , h = and g(x̃) = t − f (x), (7.8)
.
x
hn
then
n
X
ψ(x̃) = −q log g(x̃) − log xi (7.9)
i=1
n
∇g(x̃)T h X hi
∇ψ(x̃)T h = −q − (7.10)
g(x̃) x
i=1 i
n
hT ∇2 g(x̃)h (∇g(x̃)T h)2 X h2i
hT ∇2 ψ(x̃)h = −q +q + (7.11)
g(x̃) g(x̃)2 x2
i=1 i
∇3 g(x̃)[h, h, h] (hT ∇2 g(x̃)h)∇g(x̃)T h
∇3 ψ(x̃)[h, h, h] = −q + 3q
g(x̃) g(x̃)2
n
(∇g(x̃)T h)3 X h3
i
−2q 3
−2 3. (7.12)
g(x̃) x
i=1 i
We show that
1
(∇3 ψ(x̃)[h, h, h])2 ≤ 4(1 + β)2 (hT ∇2 ψ(x̃)h)3 , (7.13)
3
which will prove the lemma. Since g is concave, all three terms on the right-hand side of (7.11) are
nonnegative, i.e. the right-hand side can be abbreviated by
hT ∇2 ψ(x̃)h = a2 + b2 + c2 , (7.14)
with a, b, c ≥ 0. Due to (7.1) we have
3
∇ g(x̃)[h, h, h]
≤ βa2 c,
g(x̃)
In the next sections we will show self-concordance for the logarithmic barrier function
for several nonlinear optimization problems by showing that (7.1) is fulfilled. Below we
will frequently and implicitly use the fact that − log(−gj (x)) is 1-self-concordant on its
domain whenever gj (x) is a linear or convex quadratic function. If gj (x) is linear this
follows immediately by using Example 6.13 and the second part of Lemma 7.1.
163
Exercise 7.2 Prove that − log(−gj (x)) is 1-self-concordant on its domain when gj (x) is a
convex quadratic function. /
Primal EO problem
where A ∈ IRm×n , b ∈ IRm and c ∈ IRn . Using Example 6.17 and the first composition
rule in Lemma 7.1 it follows that the logarithmic barrier function is 1-self-concordant.
Dual EO problem
x ∈ C := {x : x ≥ 0} ,
sup {ψ(y)}
where
ψ(y) = inf (L(x, y)) ,
x∈C
i=1
Fixing y, L(x, y) is convex in x. Moreover, L(x, y) goes to infinity if, for some i, xi goes
to infinity, and the partial derivative to xi goes to infinity if xi approaches zero. We
conclude from this that L(x, y) is bounded below. Hence the minimum is attained and
occurs where the gradient vanishes. The gradient of L(x, y) with respect to x is given
by
c + e + log x − AT y,
164
and hence L(x, y) is minimal if
log xi = aTi y − ci − 1, 1 ≤ i ≤ n,
where ai denotes the i-th column of A. From this we can solve x, namely
T
xi = eai y−ci −1 , 1 ≤ i ≤ n.
Substitution gives
n
T
xi log xi − y T (Ax − b)
X
ψ(y) = c x +
i=1
n
= cT x + xi aTi y − ci − 1 − y T (Ax − b)
X
i=1
n
T T T
xi − y T (Ax − b)
X
= c x + y Ax − c x −
i=1
n
= bT y −
X
xi
i=1
n
T
= bT y − eai y−ci −1 .
X
i=1
Duality results
Here we just summarize some basic duality properties of EO problems. For details we
refer to the paper of Kas and Klafszky [18].
Lemma 7.3 (Weak duality) If x is feasible for (EOP ) and y is feasible for (EOD)
then n n
T
cT x + xi log xi ≥ bT y − eai y−ci −1
X X
i=1 i=1
Corollary 7.4 If (7.16) holds for some x ∈ P and y ∈ IRm then they are both optimal
and the duality gap is zero.
165
As we observed, both (EOP ) and (EOD) are convex optimization problems.
2. If
n
T
ν = sup{bT y − eai y−ci −1 } < ∞
X
i=1
i=1
Generalized EO (GEO)
where, for each i there exists a positive number κi such that the function fi : (0, ∞) →
IR satisfies
f 00 (xi )
|fi000 (xi )| ≤ κi i .
xi
Obviously, the logarithmic barrier xi log xi − log xi of the entropy function xi log xi
satisfies this condition, by Example 6.17. The class of GEO problems is studied in Ye
and Potra [45] and Han et al1 . [12].
1
In this paper it is conjectured that these problems do not satisfy the self-concordance condition.
The lemma below shows that it does satisfy the self-concordance condition.
166
Self-concordant barrier for (GEO)
Lemma 7.6 The logarithmic barrier function for the generalized entropy optimization
problem (GEO) is (1 + 13 maxi κi )-self-concordant.
Note that
aTi y − ci − 1 − log zi ≤ 0
is equivalent to
T
eai y−ci −1 ≤ zi ,
and, since we are maximizing, at optimality we will have equality, for each i. The
logarithmic barrier function for (EOD 0) is
Pn n n
−bT y + i=1 zi
aTi y
X X
− log log zi − + ci + 1 − log zi .
µ j=1 j=1
It can be shown that this barrier function is 2-self-concordant. Note that the first term
is linear and hence 0-self-concordant. It follows from the next exercise and Lemma 7.1
that the second term is 2-self-concordant. See also Jarre [21].
167
Exercise 7.3 Prove that the function
N.B. Here and elsewhere in this section the symbol e represents the base of the natural
logarithm, and not the all-one vector!
Posynom optimization
i∈Ik
m
e−ci eaij yj
X Y
=
i∈Ik j=1
m
(eyj )aij
X Y
= αi
i∈Ik j=1
m
X Y a
= αi τj ij
i∈Ik j=1
where αi = e−ci > 0 and τj = eyj > 0. The above polynomial is called a posynomial
because all coefficients αi and all variables τj are positive. Observe that the substitution
τj = eyj convexifies the posynomial.
Recently the ‘transistor sizing problem’, which involves minimizing the active area of
an electrical circuit subject to circuit delay specifications, has been modeled and solved
by using a posynomial model [41].
168
Dual GO problem
where
r
X X X X
φ(x) = xi log xi − xi log xi .
k=1 i∈Ik i∈Ik i∈Ik
Note that if r = 1 and i∈I xi = 1 then (DGO) is just the primal entropy optimization
P
xxi i
Y
r
X i∈Ik
φ(x) = log X
!
k=1 xi
i∈Ik
X
xi
i∈Ik
Exercise 7.4 Derive the dual geometric optimization problem (DGO) from the primal prob-
lem (P GO) by using Lagrange duality. /
Duality results
where α = (α1 , · · · , αn ) ≥ 0 and β = (β1 , · · · , βn ) > 0. Equality holds if and only if α = λβ for some
nonnegative λ. The inequality is also valid for β = (β1 , · · · , βn ) ≥ 0 if we define
x 0
i
:= 1.
0
169
Lemma 7.7 (Weak duality) If y is feasible for (P GO) and x is feasible for (DGO)
then
bT y ≤ cT x + φ(x)
with equality if and only if for each k = 1, · · · , r and j ∈ Ik
T
xj = eaj y−cj
X
xi . (7.17)
i∈Ik
The feasible regions of (DGO) and (P GO) are denoted as D and P respectively.
Corollary 7.8 If (7.17) holds for some x ∈ D and y ∈ P then they are both optimal
and the duality gap is zero.
Most of the following observations are trivial, some of them needs a nontrivial proof.
xxi i
Y
i∈Ik
φk (x) = log X
X xi
xi i∈Ik
i∈Ik
is positive homogeneous of order 1 (φk (λx) = λφk (x) for all λ > 0) and subadditive
(which means that φk (x1 + x2 ) ≤ φk (x1 ) + φk (x2 )).
• If |Ik | = 1 ∀i then (P GO) is equivalent to an LO problem.
• If x is optimal for (DGO) and xi = 0 for some i ∈ Ik then xi = 0, ∀ i ∈ Ik .
• Gk (y) is logarithmically convex (i.e. log Gk (y) is convex). This easily follows by
using Hölder’s inequality.3
n n
!λ n
!1−λ
X X X
αλi βi1−λ ≤ αi βi
i=1 i=1 i=1
170
3. If both of (P GO) and (DGO) are feasible but none of them is Slater regular, then
sup{bT y : y ∈ P} = inf{cT x + φ(x) : x ∈ D}.
∗
Proof: (Sketch)
• 1. and 2. follow by using the Convex Farkas Lemma, or the Karush-Kuhn-Tucker Theorem.
• The proof of 3. consists of several steps.
– First we reduce the dual problem to (DGOr ) by erasing all the variables xi that are zero
at all dual feasible solutions.
– Clearly (DGO) and (DGOr ) are equivalent.
– By construction (DGOr ) is Slater regular.
– Form the primal (P GOr ) of (DGOr ).
– Due to 2. (P GOr ) has optimal solution with optimal value equal to the optimal value of
(DGOr ).
– It then remains to prove that the optimal values of (P GO) and (P GOr ) are equal.
2
Self-concordant barriers
Lemma 7.10 The logarithmic barrier functions of both (P GO) and (DGO) are 2-self-
concordant.4
∗
Proof: We give the proof for the dual GO problem. Because of Lemma 7.1, it suffices to verify
2-self-concordance for the following logarithmic barrier function
X X X X
ϕ(x) = xi log xi − xi log xi − log xi , (7.18)
i∈Ik i∈Ik i∈Ik i∈Ik
for some fixed k. For simplicity, we will drop the subscript i ∈ Ik . Now we can use Lemma 7.2, so that
we only have to verify (7.1) for
X X X
f (x) := xi log xi − xi log xi ,
171
Note that X 1
xi2 ξi = 0.
Using this substitution, we can rewrite the left-hand side of the inequality (7.19):
X h3i ( hi )3
P P
2 P hj
X − 1 3
2 − P 2
= x i
2
ξi + 3ξ i
xi ( xi ) xj
P
X 2 −1 hj
= ξi xi 2 ξi + 3 P
xj
X P P
hi hj hj
ξi2
= −P + 3P
xi xj xj
P
X 2 hi hj
= ξi + 2P
xi xj
X |hi | X | P hj |
≤ ξi2 +2 ξi2 P
xi xj
s
X X hj 2
≤ 3 ξi2 , (7.20)
x2j
where the last inequality follows because
s
|hi | X h2j
≤ ,
xi x2j
and
X h2 X 2 X h2 X X 2
i
xi ≥ i
x2i ≥ |hi | . (7.21)
x2i x2i
(The last inequality in (7.21) follows directly from the Cauchy-Schwartz inequality.) Now note that
the right-hand side of (7.19) is equal to
s
X X h2
2 i
3 ξi .
x2i
Together with (7.20), this completes the proof. 2
and let pi > 1, i = 1, · · · , n. Moreover, let ai ∈ IRm , bi ∈ IRm , ci ∈ IR for all i ∈ I. Then
the primal lp -norm optimization problem [35]-[37], [43] can be formulated as
max η T y
(P lp ) 1 T
|a y − ci |pi + bTk y − dk ≤ 0, k = 1, · · · , r.
X
Gk (y) :=
pi i
i∈Ik
172
One may easily verify that (P lp ) is a convex optimization problem. Note that in spite of
the absolute value in the problem formulation the functions Gk (y) are infinitely many
times differentiable.
Special cases of lp -optimization:
Let qi be such that p1i + q1i = 1. Moreover, let A be the matrix whose columns are ai ,
i = 1, · · · , n, and B the matrix whose columns are bk , k = 1, · · · , r. Then, the dual of
the lp -norm optimization problem (P lp ) is (see [35]-[37], [43])
r qi
1 xi
X X
T T
min ψ(x, z) := c x + d z + zk
qi zk
k=1 i∈Ik
(Dlp )
Ax + Bz = η
z ≥ 0.
qi
If xi 6= 0 and zk = 0, then zk zxki is defined as ∞. If xi = 0 for all i ∈ Ik and zk = 0
then we define q i
xi
zk
zk
to be zero.
Also (Dlp ) is a convex optimization problem. This follows by noting that
X 1 qi −1 qi
z |xi |
i∈Ik qi k
is positive homogeneous and subadditive, hence convex. Moreover, the dual feasible
region is a polyhedron – not necessarily closed.
As in the case of geometric optimization, the derivation of the above formulation
of the dual problem goes beyond the scope of this course. We refer to Peterson and
Ecker [35, 36, 37] and Terlaky [43].5
5
In the dualization process one needs the following inequality: Let α, β ∈ IR, p, q > 1 and
1 1
+ = 1.
p q
173
Exercise 7.5 Derive the dual lp optimization problem (Dlp ) from the primal problem (Dlp )
by using Lagrange duality. /
Duality results
Proposition 7.11 (Weak duality) If y is feasible for (P lp ) and (x, z) for (Dlp ) then
η T y ≤ ψ(x, z)
with equality if and only if for each k = 1, · · · , r and i ∈ Ik , zk Gk (y) = 0 and either
zk = 0 or
xi
= sign aTi y − ci |aTi y − ci |pi−1
zk
or equivalently qi −1
xi
aTi y − ci = sign (xi ) .
zk
• 1. and 2. follow by using the Convex Farkas Lemma, or the Karush-Kuhn-Tucker Theorem.
• The proof of 3. is more involved. It consists of the following steps.
– Reduce the dual problem to (Dlp )r by erasing all the variables xi ∈ Ik and zk if zk is zero
at all dual feasible solutions.
– Clearly (Dlp ) and (Dlp )r are equivalent. By construction (Dlp )r is Slater regular.
– Form the primal (P lp )r of (Dlp )r . Due to 2 (P lp )r has optimal solution with optimal
value equal to the optimal value of (Dlp )r .
– It then remains to prove that the optimal values of (P lp ) and (P lp )r are equal. Moreover
(P lp ) has an optimal solution. 2
Then
1 p 1 q
αβ ≤ |α| + |β|
p q
Equality holds if and only if
α = sign(β)|β|q−1 or β = sign(α)|α|p−1 .
174
Self-concordant barrier for the primal problem
To get a self-concordant barrier function for the primal lp -norm optimization problem
we need to reformulate it as:
max η T y
1
i∈Ik pi ti + bTk y − dk ≤ 0, k = 1, · · · , r
P
pi
si ≤ ti
(P l0p )
(7.22)
aTi y − ci ≤ si i = 1, · · · , n
−aTi y + ci ≤ si
s ≥ 0.
The logarithmic barrier function for this problem can be proved to be self-concordant.
Observe that in the transformed problem we have 4n + r constraints, compared with r
in the original problem (P lp ).
Lemma 7.13 The logarithmic barrier function for the reformulated lp -norm optimiza-
tion problem (P l0p ) is (1 + 31 maxi |pi − 2|)-self-concordant.
∗
Proof: Since f (si ) := spi i , pi ≥ 1, satisfies (7.1) with β = |pi − 2|, we have from Lemma 7.2 that
is (1 + 13 |pi − 2|)-self-concordant. Consequently, it follows from Lemma 7.1 that the logarithmic
barrier function for the reformulated primal lp -norm optimization problem is (1 + 31 maxi |pi − 2|)-self-
concordant. 2
175
Lemma 7.14 The logarithmic barrier function for the reformulated lp -norm optimiza-
tion problem (P l00p ) is 35 -self-concordant.
∗
Proof: Since f (ti ) := −tπi i , πi ≤ 1, satisfies (7.1) with β = |πi − 2|, we have from Lemma 7.2
that
− log(tπi i − si ) − log ti
is (1 + 31 |πi − 2|)-self-concordant, where πi ≤ 1. Consequently, the corresponding logarithmic barrier
function is 53 -self-concordant. 2
Note that the original problem (Dlp ) has r inequalities, and the reformulated problem
(Dl0p ) 4n + r. Now we prove the following lemma.
Lemma 7.15 The logarithmic √ barrier function of the reformulated dual lp -norm opti-
0
mization problem (Dlp ) is (1 + 32 maxi (qi + 1))-self-concordant.
∗
Proof: It suffices to show that
176
and for the third order term
|∇3 f (s, z)[h, h, h]| = q(q − 1)sq−3 z −q−2 |(q − 2)z 3 h31 − (q + 1)s3 h32 −
3(q − 1)sz 2 h21 h2 + 3qs2 zh1 h22 |
= q(q − 1)sq−3 z −q−2 (zh1 − sh2 )2 |(q − 2)zh1 − (q + 1)sh2 |
≤ q(q − 1)(q + 1)sq−3 z −q−2 (zh1 − sh2 )2 (z|h1 | + s|h2 |).
Now we obtain
r
|∇3 f (s, z)[h, h, h]| √ |h1 |2 |h2 |2
|h1 | |h2 |
≤ (q + 1) + ≤ 2(q + 1) + 2 .
hT ∇2 f (s, z)h s z s 2 z
This proves (7.25) and hence the lemma. 2
We can improve this result as follows: the constraints sqi i zk−qi +1 ≤ ti are replaced by
the equivalent constraints tρi i zk−ρi +1 ≥ si , where ρi := q1i , and the redundant constraints
s ≥ 0 are replaced by t ≥ 0. The new reformulated dual lp -norm optimization problem
becomes:
T T Pn 1
min c x + d z + i=1 qi ti
si ≤ tρi i zk−ρi +1 , i ∈ Ik , k = 1, · · · , r
x≤s
(Dl00p ) −x ≤ s (7.26)
Ax + Bz = η
z≥0
t ≥ 0.
Lemma 7.16 The logarithmic barrier function of the reformulated dual lp -norm opti-
mization problem (Dl00p ) is 2-self-concordant.
∗
Proof: Similarly to the proof of Lemma 7.15, it can be proved that
− log(tρi i zk−ρi +1 − si ) − log ti ,
√
2
with ρi ≤ 1, is (1 + 3 (ρi + 1))-self-concordant. The lemma follows now from Lemma 7.1 and from
ρi ≤ 1. 2
177
Let A0 , A1 , · · · , An ∈ Rm×m be symmetric matrices. Further let c ∈ Rn be a given
vector and x ∈ Rn the vector of unknowns in which the optimization is done. The
primal semidefinite optimization problem is defined as
(P SO) min cT x
n
X
s.t. −A0 + Ak xk 0,
k=1
where 0 indicates that the left hand side matrix has to be positive semidefinite.
Clearly the primal problem (P SO) is a convex optimization problem since any convex
combination of positive semidefinite matrices is also positive semidefinite. For conve-
nience the notation n X
F (x) = −A0 + Ak xk
k=1
will be used.
The dual problem of semidefinite optimization is given as follows:
where Z ∈ Rm×m is the matrix of variables. Again, the problem (DSO) is a convex
optimization problem since the trace of a matrix is a linear function of the matrix and
a convex combination of positive semidefinite matrices is positive semidefinite.
As we have seen the weak duality relation between (P SO) and (DSO) holds (see
page 69).
Because the semidefinite optimization problem is nonlinear, strong duality holds only
if a certain regularity assumption, e.g. the Slater regularity assumption holds.
Exercise 7.6 Prove that (P SO) is Slater regular if and only if there is an x ∈ IRn such that
F (x) is positive definite. /
Exercise 7.7 Prove that (DP SO) is Slater regular if and only if there is an m×m symmetric
positive definite matrix Z such that Tr(Ak Z) = ck , for all k = 1, · · · , n. /
The above exercises show that the Slater regularity condition coincides with the in-
terior point assumption in the case of semidefinite optimization.
178
We do not go into detailed discussion of applications and solvability of semidefinite
optimization problems. For applications the interested reader is referred to [44].
Finally, we note that under the interior point assumption the function
cT x − µ log(det (F (x)))
is a self-concordant barrier for the problem (DSO). This results show that semidefinite
optimization problems are efficiently solvable by interior point methods [34].
179
180
Appendix A
∗Appendix
2
aT Mb ≤ aT Aa bT Ab , ∀a, b ∈ IRn .
181
When replacing a by a/µ and b by b/µ this implies
!T 2 !2
T
2 a 1 1 T
a Mb = M (µb) ≤ a Aa + µ2 bT Ab = aT Aa bT Ab ,
µ 4 µ2
Then
|M[x, y, z]| ≤ µ kxkA kykA kzkA , ∀x, y, z ∈ IRn .
the substitution
1 1 1
M[x, y, z] := M[A− 2 x, A− 2 y, A− 2 z]
we can further assume that A = I is the identity matrix and we need to show that
because the remaining part follows by applying Lemma A.1, with M = Mx , for fixed x.
Define
σ := max {M[x, y, y] : kxk2 = kyk2 = 1}
182
and let x and y represent a solution of this maximization problem. The necessary
optimality conditions for x and y imply that
My y 2x 0
= α +β ,
2My x 0 2y
(i) gj (x) ≤ 0, ∀j = 1, · · · , m,
m
X
(ii) yj ∇gj (x) = c, y ≥ 0, (A.1)
j=1
Proof: The implications (iii) → (ii) and (ii) → (i) are evident. We need only to
prove that (i) imply (iii).
Let us assume that the IPC holds. Let
P 0 := {x | gj (x) < 0, ∀j } and Y := IRm m
+ = {y ∈ IR | y > 0 }.
Observe, that due to the IPC the set P 0 is not empty and the solutions of (A.1) are
exactly the saddle points of the function
m m
F (x, y) := −cT x +
X X
yj gj (x) + wj log yj
j=1 j=1
183
on the set P 0 × Y, because the function F (x, y) is convex in x and concave in y. Thus
we only need to prove that F has a saddle point on P 0 × Y.
Observe, that for any fixed x the function F (x, y) attains its maximum in Y at the
point
wj
yj = ,
−gj (x)
and this maximum, up to an additive constant, equals to
m
F (x) := −cT x −
X
wj log(−gj (x)).
j=1
Let us assume to the contrary that there exists a minimizing sequence {xi ∈ P 0 }∞
i=1
i
with kxi k → ∞. Then the sequence x̃i := kxxi k is bounded. We can choose a convergent
subsequence of the bounded sequence x̃i (for simplicity denoted again the same way)
with limit point
s := lim x̃i .
i→∞
We claim that
cT s ≥ 0. (A.2)
This is indeed true, otherwise we would have cT xi ≤ −αkxi k fore some α > 0 and i
large enough, and thus F (xi ) would go to infinity.
The convexity of the functions gj imply that there exists 0 < β, γ ∈ IR such that for
all x we have gj (x) ≥ −β − γkxk (see Exercise A.1 below). Then it follows that for all
i, large enough, we have
m m
i T i i i
X X
F (x ) = F (x) := −c x − wj log(−gj (x )) ≥ −αkx k − wj log(−β − γkxk) → ∞
j=1 j=1
184
Now let (x, y) be an interior solution of (CDO). Then for all j we have
where the last inequality follows from (A.3). Now we have cT s = 0 and, due to y > 0
and (A.3) for each j the equality
Thus for each j the functions gj are bounded from above on the ray R, which together
with (A.4) proves that gj is constant along R. The proof is complete. 2
Exercise A.1 Let g(x) : IRn → IR be a convex function. Prove that there exists 0 < β, γ ∈ IR
such that
gj (x) ≥ −β − γkxk ∀ x ∈ IRn .
/
Remark: The implication (iii) → (i) may be false if a bad ray exists. Let us consider
the following example.
Example A.4 Let the convex set Q := {(x1 , x2 ) ∈ IR2 | x1 ≥ x22 } be given and let π(x) be the
so-called Minkowski function of the set Q with the pole x = (1, 0), i.e.
1
π(x) := min {t | x + (x − x) ∈ Q }.
t
One easily checks that π(x) is a nonnegative, convex function. Further, π(x) = 1 if x is a boundary
point of the set Q and π(x) = 0 on the ray R := {x | x = x + λ(1, 0), λ ≥ 0}.
Setting m := 1, cT x := x2 and g1 (x) := π 2 (x) + x2 , we get a (CP O) problem which satisfy the IPC,
e.g. the point x = (1, −0.5) is strictly feasible. Likewise (CDO) satisfies the IPC, e.g. we may take
x̃ = (1, 0) and ỹ = 1.
185
However the system (A.1) never has a solution. Indeed, as we already know, saying that (A.1) has
a solution is the same as to say that the function
attains its minimum in the set P 0 = {x | g1 (x) < 0}. The latter is clearly not the case, since given any
x̃ (e.g. we may take x̃ = (1, −0.5)) with g1 (x̃) < 0 we always can find a better value of F . One can
take the points on the ray R0 := {x | x = x̃ + λ(1, 0)}, since π(x̃ + λ(1, 0)) → 0 as λ → ∞. ∗
Lemma A.5 Let us assume that for (CP O) the IPC holds and no bad line segment
exists. Then the solutions of the systems (6.2) and (6.3), if they exist, are unique.
Proof: In the proof of Theorem A.3 we have seen that a point (x, s) solves (A.1)
if and only if it is a saddle point of F (x, y) on the set P 0 × Y. Clearly, for fixed x,
the function F is a strictly concave function of y, thus to prove that a saddle point is
unique we only have to prove that the function
m
T
X
F (x) := −c x − wj log(−gj (x)) = max F (x, y) + a constant
y∈Y
j=1
cannot attain its minimum at two different points. Let assume to the contrary that
two distinct minimum points x0 , x00 ∈ P 0 exists. Due to the convexity of the function
F , we have that F (x) is constant on the line segment [x0 , x00 ]. This imply that both
the first and second order directional derivatives of F (x) are zero on this line segment.
This can only happen if the same is true for all the functions gj (x) separately, hence
all the functions gj (x) are constant on the line segment [x0 , x00 ], i.e. this line segment is
bad. We have got a contradiction, thus the lemma is proved. 2
186
Bibliography
[1] Anstreicher, K.M. (1990), A Standard Form Variant, and Safeguarded Linesearch, for
the Modified Karmarkar Algorithm, Mathematical Programming 47, 337–351.
[2] M.S. Bazarraa, H.D. Sherali and C.M. Shetty, Nonlinear Programming: Theory and
Algorithms, John Wiley and Sons, New York (1993).
[5] R. W. Cottle and S-M. Guu. Two Characterizations of Sufficient Matrices. Linear
Algebra and Its Applications, ??–??, 199?.
[6] R. W. Cottle, J.-S. Pang and V. Venkateswaran. Sufficient Matrices and the Linear
Complementarity Problem. Linear Algebra and Its Applications, 114/115:231-249, 1989.
[7] R. W. Cottle, J.-S. Pang and R. E. Stone. The Linear Complementarity Problem.
Academic, Boston, 1992.
[8] Duffin, R.J., Peterson, E.L. and Zener, C. (1967), Geometric Programming, John Wiley
& Sons, New York.
[9] P.E. Gill, W. Murray and M.H. Wright, Practical Optimization Academic Press, Lon-
don, (1981).
[10] P.E. Gill, W. Murray and M.H. Wright, Numerical Linear Algebra and Optimization,
Vol.1. Addison Wiley P.C. New York, (1991).
[11] R.A. Horn and C.R. Jonson, Matrix Analysis, Cambridge University Press, Cambridge,
UK (1985).
[12] Han, C.–G., Pardalos, P.M. and Ye, Y. (1991), On Interior–Point Algorithms for Some
Entropy Optimization Problems, Working Paper, Computer Science Department, The
Pennsylvania State University, University Park, Pennsylvania.
[13] D. den Hertog, (1994), Interior Point Approach to Linear, Quadratic and Convex Pro-
gramming: Algorithms and Complexity, Kluwer A.P.C., Dordrecht.
[14] D. den Hertog, C. Roos and T. Terlaky, The Linear Comlementarity Problem, Sufficient
Matrices and the Criss–Cross Method, (1993) Linear Algebra and Its Applications, 187.
1–14.
187
[15] Hertog, D. den, Jarre, F., Roos, C. and Terlaky, T. (1995), A Sufficient Condition for
Self-Concordance with Application to Some Classes of Structured Convex Programming
Problems, Mathematical Programming 69, 75–88.
[16] J.-B. Hirriart–Urruty and C. Lemarèchal, Convex Analysis and Minimization Algo-
rithms I and II, Springer Verlag, Berlin, Heidelberg (1993).
[17] R. Horst, P.M. Pardalos and N.V. Thoai, (1995) Introduction to Global Optimization,
Kluwer A.P.C. Dordrecht.
[18] P. Kas and E. Klafszky, (1993) On the Dduality of the Mixed Entropy Programming,
Optimization 27. 253–258.
[19] E. Klafszky and T. Terlaky, (1992) Some Generalizations of the Criss–Cross Method
for Quadratic Programming, Mathemathische Operationsforschung und Statistics ser.
Optimization 24. 127–139.
[20] Jarre, F. (1990), Interior-point Methods for Classes of Convex Programs. Technical
Report SOL 90–16, Systems Optimization Laboratory, Department of Operations Re-
search, Stanford University, Stanford, California.
[21] Jarre, F. (1996), Interior-point Methods for Convex Programming. In T. Terlaky (ed.),
Interior-point Methods for Mathematical Programming, Kluwer A.P.C., Dordrecht, pp.
255–296.
[22] Jarre, F. (1994), Interior-point Methods via Self-concordance or Relative Lipschitz Con-
dition. Habilitationsschrift, Würzburg, Germany.
[23] Karmarkar, N.K. (1984), A New Polynomial–Time Algorithm for Linear Programming,
Combinatorica 4, 373–395.
[24] Klafszky, E. (1976), Geometric Programming, Seminar Notes 11.1976, Hungarian Com-
mittee for Systems Analysis, Budapest.
[25] Kortanek, K.O. and No, H. (1992), A Second Order Affine Scaling Algorithm for the
Geometric Programming Dual with Logarithmic Barrier, Optimization 23, 501–507.
[26] D.C. Lay Linear Algebra and Its Applications Addision Wiley (1994).
[27] F. Lootsma, Algorithms for Unconstrained Optimization Dictaat Nr. a85A, en WI-385,
Fac. TWI, TU Delft.
[28] F. Lootsma, Duality in Non–Linear Programming Dictaat Nr. a85D, Fac. TWI, TU
Delft.
[29] F. Lootsma, Algorithms fo Constrained Optimization Dictaat Nr. a85B, Fac. TWI, TU
Delft.
[30] H.M. Markowitz, (1956), The Optimization of a Quadratic Function Subject to Linear
Constraints, Naval Research Logistics Quarterly 3, 111-133.
188
[31] H.M. Markowitz, (1959), Portfolio Selection, Efficient Diversification of Investments,
Cowles Foundation for Research in Economics at Yale University, Monograph 16, John
Wiley & Sons, New York.
[32] Nemirovsky, A.S. (1999), Convex Optimization in Engineering, Dictaat Nr. WI-485,
Fac. ITS/TWI, TU Delft.
[33] Nesterov, Y.E. and Nemirovsky, A.S. (1989), Self–Concordant Functions and Polyno-
mial Time Methods in Convex Programming, Report, Central Economical and Mathe-
matical Institute, USSR Academy of Science, Moscow, USSR.
[34] Nesterov, Y.E. and Nemirovsky, A.S. (1994), Interior-Point Polynomial Algorithms in
Convex Programming, SIAM, Philadelphia.
[35] Peterson, E.L., and Ecker, J.G. (1970), Geometric Programming: Duality in Quadratic
Programming and lp Approximation I, in: H.W. Kuhn and A.W. Tucker (eds.), Pro-
ceedings of the International Symposium of Mathematical Programming, Princeton Uni-
versity Press, New Jersey.
[36] Peterson, E.L., and Ecker, J.G. (1967), Geometric Programming: Duality in Quadratic
Programming and lp Approximation II, SIAM Journal on Applied Mathematics 13,
317–340.
[37] Peterson, E.L., and Ecker, J.G. (1970), Geometric Programming: Duality in Quadratic
Programming and lp Approximation III, Journal of Mathematical Analysis and Appli-
cations 29, 365–383.
[38] R.T. Rockafellar, Convex Analysis, Princeton, New Jersey, Princeton University Press
(1970).
[39] C. Roos and T. Terlaky, Introduction to Linear Optimization Dictaat WI187, Fac.
ITS/TWI, TU Delft (1997).
[40] C. Roos, T. Terlaky and J.-Ph. Vial (1997), Theory and Algorithms for Linear Opti-
mization: An Interior Point Approach. John Wiley and Sons.
[41] S.S. Sapatnekar (1992), A convex programming approach to problems in VSLI designs.
Ph. D. Thesis. University of Illinois at Urbana Campaign. 70–100.
[42] J. Stoer and C. Witzgall, Convexity and Optimization in Finite Dimensions I, Springer
Verlag, Berlin, Heidelberg (1970).
[43] Terlaky, T. (1985), On lp programming, European Journal of Operational Research 22,
70–100.
[44] Vandenberghe, L. and Boyd, S. (1994) Semidefinite Programming, Report December
7, 1994, Information Systems Laboratory, Electrical Engineering Department, Stanford
University, Stanford CA 94305.
[45] Ye, Y. and Potra, F. (1990), An Interior–Point Algorithm for Solving Entropy Op-
timization Problems with Globally Linear and Locally Quadratic Convergence Rate,
Working Paper Series No. 90–22, Department of Management Sciences, The University
of Iowa, Iowa City, Iowa. To Appear in SIAM Journal on Optimization.
189
Index
190
linear convergence, 76, 82 strict local minimum, 31
linear optimization, 3, 61 strong duality theorem, 56
local minimum, 31 super-linear convergence, 76
logarithmic barrier algorithm with damped superbasic variables, 105
Newton steps, 138
logarithmic barrier algorithm with full New- Tartaglia’s problem, 6, 36
ton steps, 132 Taylor approximation, 84
logarithmic barrier approach, 117 Taylor expansion, 35, 126
theorem of alternatives, 44
monotonic function, 26 Torricelli point, 35
trilinear form, 182
Newton’s method, 84, 130 trust region method, 85
non-degeneracy, 101
nonnegative orthant, 20 unboundedness, 1
unconstrained optimization, 3
objective function, 1
optimal solution, 1 weak duality theorem, 56
optimality conditions, 31 Wolfe dual, 57
pointed cone, 18
portfolio optimization, 9
proximity measure, 124
recession cone, 20
reduced gradient, 102, 109
reduced gradient method, 101
relative boundary, 25
relative interior, 22
saddle point, 50
secant condition, 93
self-concordance, 81, 126
semidefinite optimization, 68, 177
separation theorem, 44
simplex method, 105
Slater condition, 41
spectral radius, 182
standard simplex, 16, 22
stationary point, 34
Steiner’s problem, 7, 35
stopping criteria, 98
strict convexity, 13
strict global minimum, 31
191