0% found this document useful (0 votes)
198 views

BasicsOfConvexOptimization PDF

Uploaded by

kavita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

BasicsOfConvexOptimization PDF

Uploaded by

kavita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

Chapter 4

Convex Optimization

4.1 Introduction
4.1.1 Mathematical Optimization
The problem of mathematical optimization is to minimize a non-linear cost
function f0 (x) subject to inequality constraints fi (x) ≤ 0, i = 1, . . . , m and
equality constraints hi (x) = 0, i = 1, . . . , p. x = (x1 , . . . , xn ) is a vector of
variables involved in the optimization problem. The general framework of a
non-linear optimization problem is outlined in (4.1).

minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m
(4.1)
hi (x) = 0, i = 1, . . . , p
variable x = (x1 , . . . , xn )

It is obviously very useful and arises throughout engineering, statistics, es-


timation and numerical analysis. In fact there is the tautology that ‘everything
is an optimization problem’, though the tautology does not convey anything
useful. The most important thing to note first is that the optimization problem
is extremely hard in general. The solution and method is very much dependent
on the property of the objective function as well as properties of the functions
involved in the inequality and equality constraints. There are no good methods
for solving the general non-linear optimization problem. In practice, you have
to make some compromises, which usually translates to finding locally optimial
solutions efficiently. But then you get only suboptimial solutions, unless you are
willing to do global optimizations, which is for most applications too expensive.
There are important exceptions for which the situation is much better; the
global optimum in some cases can be found efficiently and relaibly. Three best
known exceptions are

213
214 CHAPTER 4. CONVEX OPTIMIZATION

1. least-squares
2. linear programming
3. convex optimization problems - more or less the most general class of
problems that can be solved efficiently.
Least squares and linear programming have been around for quite some time
and are very special types of convex optimization problems. Convex program-
ming was not appreciated very much until last 15 years. It has drawn attention
more recently. In fact many combinatorial optimization problems have been
identified to be convex optimization problems. There are also some exceptions
besides convex optimization problems, such as singular value decomposition
(which corresponds to the problem of finding the best rank-k approximation to
a matrix, under the Frobenius norm) etc., which has an exact global solution.
We will first introduce some general optimization principles. We will sub-
sequently motivate the specific class of optimization problems called convex
optimization problems and define convex sets and functions. Next, the theory
of lagrange multipliers will be motivated and duality theory will be introduced.
As two specific and well-studied examples of convex optimization, techniques
for least squares and linear programming will be discussed to contrast them
against generic convex optimization. Finally, we will dive into techniques for
solving general convex optimization problems.

4.1.2 Some Topological Concepts in ℜn


The definitions of some basic topological concepts in ℜn could be helpful in the
discussions that follow.
Definition 12 [Balls in ℜn ]: Consider a point x ∈ ℜn . Then the closed ball
around x of radius ǫ is defined as

B[x, ǫ] = {y ∈ ℜn |||y − x|| ≤ ǫ}

Likewise, the open ball around x of radius ǫ is defined as

B(x, ǫ) = {y ∈ ℜn |||y − x|| < ǫ}

For the 1-D case, open and closed balls degenerate to open and closed intervals
respectively.

Definition 13 [Boundedness in ℜn ]: We say that a set S ⊂ ℜn is bounded


when there exists an ǫ > 0 such that S ⊆ B[0, ǫ].

In other words, a set S ⊆ ℜn is bounded means that there exists a number ǫ > 0
such that for all x ∈ S, ||x|| ≤ ǫ.

Definition 14 [Interior and Boundary points]: A point x is called an in-


terior point of a set S if there exists an ǫ > 0 such that B(x, ǫ) ⊆ S.
4.1. INTRODUCTION 215

In other words, a point x ∈ S is called an interior point of a set S if there


exists an open ball of non-zero radius around x such that the ball is completely
contained within S.

Definition 15 [Interior of a set]: Let S ⊆ ℜn . The set of all points lying


in the interior of S is denoted by int(S) and is called the interior of S.
That is,
int(S) = {x|∃ǫ > 0 s.t. B(x, ǫ) ⊂ S}

In the 1−D case, the open interval obtained by excluding endpoints from an
interval I is the interior of I, denoted by int(I). For example, int([a, b]) = (a, b)
and int([0, ∞)) = (0, ∞).

Definition 16 [Boundary of a set]: Let S ⊆ ℜn . The boundary of S, de-


noted by bnd(S) is defined as

6 ∅ and B(y, ǫ) ∩ S C 6= ∅
bnd(S) = y|∀ ǫ > 0, B(y, ǫ) ∩ S =

For example, bnd([a, b]) = {a, b}.

Definition 17 [Open Set]: Let S ⊆ ℜn . We say that S is an open set when,


for every x ∈ S, there exists an ǫ > 0 such that B(x, ǫ) ⊂ S.

The simplest examples of an open set are the open ball, the empty set ∅ and
ℜn . Further, arbitrary union of opens sets is open. Also, finite intersection of
open sets is open. The interior of any set is always open. It can be proved that
a set S is open if and only if int(S) = S.
The complement of an open set is the closed set.

Definition 18 [Closed Set]: Let S ⊆ ℜn . We say that S is a closed set


when S C (that is the complement of S) is an open set.

The closed ball, the empty set ∅ and ℜn are three simple examples of closed
sets. Arbitrary intersection of closed sets is closed. Furthermore, finite union of
closed sets is closed.

Definition 19 [Closure of a Set]: Let S ⊆ ℜn . The closure of S, denoted


by closure(S) is given by

closure(S) = {y ∈ ℜn |∀ ǫ > 0, B(y, ǫ) ∩ S =


6 ∅}

Loosely speaking, the closure of a set is the smallest closed set containing the set.
The closure of a closed set is the set itself. In fact, a set S is closed if and only if
closure(S) = S. A bounded set can be defined in terms of a closed set; a set S is
bounded if and only if it is contained inside a closed set. A relationship between
the interior, boundary and closure of a set S is closure(S) = int(S) ∪ bnd(S).
216 CHAPTER 4. CONVEX OPTIMIZATION

4.1.3 Optimization Principles for Univariate Functions


Maximum and Minimum values of univariate functions
Let f be a function with domain D. Then f has an absolute maximum (or global
maximum) value at point c ∈ D if

f (x) ≤ f (c), ∀x ∈ D

and an absolute minimum (or global minimum) value at c ∈ D if

f (x) ≥ f (c), ∀x ∈ D
If there is an open interval I containing c in which f (c) ≥ f (x), ∀x ∈ I,
then we say that f (c) is a local maximum value of f . On the other hand, if
there is an open interval I containing c in which f (c) ≤ f (x), ∀x ∈ I, then we
say that f (c) is a local minimum value of f . If f (c) is either a local maximum
or local minimum value of f in an open interval I with c ∈ I, the f (c) is called
a local extreme value of f .
The following theorem gives us the first derivative test for local extreme
value of f , when f is differentiable at the extremum.

Theorem 39 If f (c) is a local extreme value and if f is differentiable at x = c,


then f ′ (c) = 0.

Proof: Suppose f (c) ≥ f (x) for all x in an open interval I containing c and that
f ′ (c) exists. Then the difference quotient f (c+h)−f h
(c)
≤ 0 for small h ≥ 0 (so
that c + h ∈ I). This inequality remains true as h → 0 from the right. In the
limit, f ′ (c) ≤ 0. Also, the difference quotient f (c+h)−f h
(c)
≥ 0 for small h ≤ 0
(so that c + h ∈ I). This inequality remains true as h → 0 from the left. In the
limit, f ′ (c) ≥ 0. Since f ′ (c) ≤ 0 as well as f ′ (c) ≥ 0, we must have f ′ (c) = 01 .
2
The extreme value theorem is one of the most fundamental theorems in cal-
culus concerning continuous functions on closed intervals. It can be stated as:

Theorem 40 A continuous function f (x) on a closed and bounded interval


[a, b] attains a minimum value f (c) for some c ∈ [a, b] and a maximum value
f (d) for some d ∈ [a, b]. That is, a continuous function on a closed, bounded
interval attains a minimum and a maximum value.

We must point out that either or both of the values c and d may be attained
at the end points of the interval [a, b]. Based on theorem (39), the extreme value
theorem can extended as:

Theorem 41 A continuous function f (x) on a closed and bounded interval [a, b]


attains a minimum value f (c) for some c ∈ [a, b] and a maximum value f (d)
for some d ∈ [a, b]. If a < c < b and f ′ (c) exists, then f ′ (c) = 0. If a < d < b
and f ′ (d) exists, then f ′ (d) = 0.
1 By virtue of the squeeze or sandwich theorem
4.1. INTRODUCTION 217

Figure 4.1: Illustration of Rolle’s theorem with f (x) = 9 − x2 on the interval


[−3, +3]. We see that f ′ (0) = 0.

Next, we state the Rolle’s theorem.

Theorem 42 If f is continuous on [a, b] and differentiable at all x ∈ (a, b) and


if f (a) = f (b), then f ′ (c) = 0 for some c ∈ (a, b).

Figure 4.1 illustrates Rolle’s theorem with an example function f (x) = 9−x2
on the interval [−3, +3].
The mean value theorem is a generalization of the Rolle’s theorem, though
we will use the Rolle’s theorem to prove it.

Theorem 43 If f is continuous on [a, b] and differentiable at all x ∈ (a, b),


then there is some c ∈ (a, b) such that, f ′ (c) = f (b)−f
b−a
(a)
.
f (b)−f (a)
Proof: Define g(x) = f (x) − b−a (x − a) on [a, b]. We note rightaway that
g(a) = g(b) and g ′ (x) =f ′ (x) − f (b)−f
b−a
(a)
. Applying Rolle’s theorem on g(x),

we know that there exists c ∈ (a, b) such that g (c) = 0. Which implies that
f ′ (c) = f (b)−f
b−a
(a)
. 2
Figure 4.2 illustrates the mean value theorem for f (x) = 9 − x2 on the
interval [−3, 1]. We observe that the tanget at x = −1 is parallel to the secant
joining −3 to 1. One could think of the mean value theorem as a slanted version
of Rolle’s theorem. A natural corollary of the mean value theorem is as follows:
Corollary 44 Let f be continuous on [a, b] and differentiable on (a, b) with
m ≤ f ′ (x) ≤ M, ∀x ∈ (a, b). Then, m(x − t) ≤ f (x) − f (t) ≤ M (x − t), if
a ≤ t ≤ x ≤ b.
Let D be the domain of function f . We define

1. the linear approximation of a differentiable function f (x) as La (x) =


f (a) + f ′ (a)(x − a) for some a ∈ D. We note that La (x) and its first
derivative at a agree with f (a) and f ′ (a) respectively.
218 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.2: Illustration of mean value theorem with f (x) = 9−x2 on the interval
[−3, 1]. We see that f ′ (−1) = f (1)−f
4
(−3)
.

Figure 4.3: Plot of f (x) = x1 , and its linear, quadratic and cubic approximations.

2. the quadratic approximatin of a twice differentiable function f (x) as the


parabola Qa (x) = f (a) + f ′ (a)(x − a) + 12 f ′′ (a)(x − a)2 . We note that
Qa (x) and its first and second derivatives at a agree with f (a), f ′ (a) and
f ′′ (a) respectively.

3. the cubic approximation of a thrice differentiable function f (x) is Ca (x) =


f (a) + f ′ (a)(x − a) + 21 f ′′ (a)(x − a)2 + 61 f ′′′ (a)(x − a)3 . Ca (x) and its first,
second and third derivatives at a agree with f (a), f ′ (a), f ′′ (a) and f ′′′ (a)
respectively.

The coefficient2 of x2 in Qa (x) is 21 f ′′ (a). Figure 4.3 illustrates the linear,


quadratic and cubic approximations to the function f (x) = x1 with a = 1.
2 The parabola given by Q (x) is strictly convex if f ′′ (a) > 0 and is strictly concave if
a
f ′′ (a) < 0. Strict convexity for functions of single variable will be defined on page 224.
4.1. INTRODUCTION 219

In general, an nth degree polynomial approximation of a function can be


found. Such an approximation will be used to prove a generalization of the
mean value theorem, called the Taylor’s theorem.
Theorem 45 The Taylor’s theorem states that if f and its first n derivatives
f ′ , f ′′ , . . . , f (n) are continuous on the closed interval [a, b], and differentiable on
(a, b), then there exists a number c ∈ (a, b) such that

1 ′′ 1 1
f (b) = f (a)+f ′ (a)(b−a)+ f (a)(b−a)2 +. . .+ f (n) (a)(b−a)n + f (n+1) (c)(b−a)n+1
2! n! (n + 1)!
Proof: Define

1 ′′ 1
pn (x) = f (a) + f ′ (a)(x − a) + f (a)(x − a)2 + . . . + f (n) (a)(x − a)n
2! n!
and

φn (x) = pn (x) + Γ(x − a)n+1


The polynomials pn (x) as well as φn (x) and their first n derivatives match
f and its first n derivatives at x = a. We will choose a value of Γ so that

f (b) = pn (b) + Γ(b − a)n+1


This requires that Γ = f(b−a)
(b)−pn (b)
n+1 . Define the function g(x) = f (x) − φn (x)

that measures the difference between function f and the approximating function
φn (x) for each x ∈ [a, b].
• Since g(a) = g(b) = 0 and since g and g ′ are both continuous on [a, b], we
can apply the Rolle’s theorem to conclude that there exists c1 ∈ [a, b] such
that g ′ (c1 ) = 0.
• Similarly, since g ′ (a) = g ′ (c1 ) = 0, and since g ′ and g ′′ are continuous
on [a, c1 ], we can apply the Rolle’s theorem to conclude that there exists
c2 ∈ [a, c1 ] such that g ′′ (c2 ) = 0.
• In this way, Rolle’s theorem can be applied successively to g ′′ , g ′′′ , . . . , g (n+1)
to imply the existence of ci ∈ (a, ci−1 ) such that g (i) (ci ) = 0 for i =
3, 4, . . . , n + 1. Note however that g (n+1) (x) = f (n+1) (x) − 0 − (n + 1)!Γ
f (n+1) (cn+1 )
which gives us another representation ‘of Γ as (n+1)! .

Thus,

1 ′′ 1 f (n+1) (cn+1 )
f (b) = f (a)+f ′ (a)(b−a)+ f (a)(b−a)2 +. . .+ f (n) (a)(b−a)n + (x−a)n+1
2! n! (n + 1)!
2
220 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.4: The mean value theorem can be violated if f (x) is not differentiable
at even a single point of the interval. Illustration on f (x) = x2/3 with the
interval [−3, 3].

Note that if f fails to be differentiable at even one number in the interval,


then the conclusion of the mean value theorem may be false. For example, if
2
f (x) = x2/3 , then f ′ (x) = 3 √
3 x and the theorem does not hold in the interval

[−3, 3], since f is not differentiable at 0 as can be seen in Figure 4.4.


We will introduce some definitions at this point:

• A function f is said to be increasing on an interval I in its domain D if


f (t) < f (x) whenever t < x.

• The function f is said to be decreasing on an interval I ∈ D if f (t) > f (x)


whenever t < x.

These definitions help us derive the following theorem:

Theorem 46 Let I be an interval and suppose f is continuous on I and dif-


ferentiable on int(I). Then:

1. if f ′ (x) > 0 for all x ∈ int(I), then f is increasing on I;

2. if f ′ (x) < 0 for all x ∈ int(I), then f is decreasing on I;

3. if f ′ (x) = 0 for all x ∈ int(I), iff, f is constant on I.

Proof: Let t ∈ I and x ∈ I with t < x. By virtue of the mean value theorem,
∃c ∈ (t, x) such that f ′ (c) = f (x)−f
x−t
(t)
.

• If f ′ (x) > 0 for all x ∈ int(I), f ′ (c) > 0, which implies that f (x)−f (t) > 0
and we can conclude that f is increasing on I.

• If f ′ (x) < 0 for all x ∈ int(I), f ′ (c) < 0, which implies that f (x)−f (t) < 0
and we can conclude that f is decreasing on I.
4.1. INTRODUCTION 221

Figure 4.5: Illustration of the increasing and decreasing regions of a function


f (x) = 3x4 + 4x3 − 36x2

• If f ′ (x) = 0 for all x ∈ int(I), f ′ (c) = 0, which implies that f (x)−f (t) = 0,
and since x and t are arbitrary, we can conclude that f is constant on I.

2
Figure 4.5 illustrates the intervals in (−∞, ∞) on which the function f (x) =
3x4 + 4x3 − 36x2 is decreasing and increasing. First we note that f (x) is dif-
ferentiable everywhere on (−∞, ∞) and compute f ′ (x) = 12(x3 + x2 − 6x) =
12(x − 2)(x + 3)x, which is negative in the intervals (−∞, −3] and [0, 2] and
positive in the intervals [−3, 0] and [2, ∞). We observe that f is decreasing in
the intervals (−∞, −3] and [0, 2] and while it is increasing in the intervals [−3, 0]
and [2, ∞).
There is a related sufficient condition for a function f to be increasing/decreasing
on an interval I, stated through the following theorem:

Theorem 47 Let I be an interval and suppose f is continuous on I and dif-


ferentiable on int(I). Then:

1. if f ′ (x) ≥ 0 for all x ∈ int(I), and if f ′ (x) = 0 at only finitely many


x ∈ I, then f is increasing on I;

2. if f ′ (x) ≤ 0 for all x ∈ int(I), and if f ′ (x) = 0 at only finitely many


x ∈ I, then f is decreasing on I.

For example, the derivative of the function f (x) = 6x5 − 15x4 + 10x3 vanishes
at 0, and 1 and f ′ (x) > 0 elsewhere. So f (x) is increasing on (−∞, ∞).
Are the sufficient conditions for increasing and decreasing properties of f (x)
in theorem 46 also necesssary? It turns out that it is not the case. Figure 4.6
shows that for the function f (x) = x5 , though f (x) is increasing in (−∞, ∞),
f ′ (0) = 0.
In fact, we have a slightly different necessary condition for an increasing or
decreasing function.
222 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.6: Plot of f (x) = x5 , illustrating that though the function is increasing
on (−∞, ∞), f ′ (0) = 0.

Theorem 48 Let I be an interval, and suppose f is continuous on I and dif-


ferentiable in int(I). Then:

1. if f is increasing on I, then f ′ (x) ≥ 0 for all x ∈ int(I);

2. if f is decreasing on I, then f ′ (x) ≤ 0 for all x ∈ int(I).


f (x+h)−f (x)
Proof: Suppose f is increasing on I, and let x ∈ int(I). Them h >
f (x+h)−f (x)
0 for all h such that x+h ∈ int(I). This implies that f ′ (x) = lim h ≥
h→0
0. For the case when f is decreasing on I, it can be similarly proved that
f ′ (x) = lim f (x+h)−f
h
(x)
≤ 0. 2
h→0
Next, we define the concept of critical number, which will help us derive the
general condition for local extrema.

Definition 20 [Critical number]: A number c in the domain D of f is called


a critical number of f if either f ′ (c) = 0 or f ′ (c) does not exist.

The general condition for local extrema is stated in the next theorem; it
extends the result in theorem 39 to general non-differentiable functions.

Theorem 49 If f (c) is a local extreme value, then c is a critical number of f .

That the converse of theorem 49 does not hold is illustrated in Figure 4.6;
0 is a critical number (f ′ (0) = 0), although f (0) is not a local extreme value.
Then, given a critical number c, how do we discern whether f (c) is a local
extreme value? This can be answered using the first derivative test:

Procedure 1 [First derivative test]: Let c be an isolated critical number of


f . Then,
4.1. INTRODUCTION 223

Figure 4.7: Example illustrating the derivative test for function f (x) = 3x5 −
5x3 .

1. f (c) is a local minimum if f (x) is decreasing in an interval [c − ǫ1 , c]


and increasing in an interval [c, c + ǫ2 ] with ǫ1 , ǫ2 > 0, or (but not
equivalently), the sign of f ′ (x) changes from negative in [c − ǫ1 , c] to
positive in [c, c + ǫ2 ] with ǫ1 , ǫ2 > 0.
2. f (c) is a local maximum if f (x) is increasing in an interval [c − ǫ1 , c]
and decreasing in an interval [c, c + ǫ2 ] with ǫ1 , ǫ2 > 0, or (but not
equivalently), the sign of f ′ (x) changes from positive in [c − ǫ1 , c] to
negative in [c, c + ǫ2 ] with ǫ1 , ǫ2 > 0.
3. If f ′ (x) is positive in an interval [c − ǫ1 , c] and also positive in an
interval [c, c − ǫ2 ], or f ′ (x) is negative in an interval [c − ǫ1 , c] and
also negative in an interval [c, c − ǫ2 ] with ǫ1 , ǫ2 > 0, then f (c) is not
a local extremum.

As an example, the function f (x) = 3x5 − 5x3 has the derivative f ′ (x) =
15x2 (x + 1)(x − 1). The critical points are 0, 1 and −1. Of the three, the sign of
f ′ (x) changes at 1 and −1, which are local minimum and maximum respectively.
The sign does not change at 0, which is therefore not a local supremum. This
is pictorially depicted in Figure 4.7 As another example, consider the function
(
−x if x ≤ 0
f (x) =
1 if x > 0

Then, (
−1 if x < 0
f ′ (x) =
0 if x > 0
Note that f (x) is discontinuous at x = 0, and therefore f ′ (x) is not defined at
x = 0. All numbers x ≥ 0 are critical numbers. f (0) = 0 is a local minimum,
whereas f (x) = 1 is a local minimum as well as a local maximum ∀x > 0.
224 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.8: Plot for the strictly convex function f (x) = x2 which has f ′′ (x) =
2 > 0, ∀x.

Strict Convexity and Extremum


We define strictly convex and concave functions as follows:

1. A differentiable function f is said to be strictly convex (or strictly concave


up) on an open interval I, iff, f ′ (x) is increasing on I. Recall from theo-
rem 46, the graphical interpretation of the first derivative f ′ (x); f ′ (x) > 0
implies that f (x) is increasing at x. Similarly, f ′ (x) is increasing when
f ′′ (x) > 0. This gives us a sufficient condition for the strict convexity of
a function:

Theorem 50 If at all points in an open interval I, f (x) is doubly differ-


entiable and if f ′′ (x) > 0, ∀x ∈ I, then the slope of the function is always
increasing with x and the graph is strictly convex. This is illustrated in
Figure 4.8.

On the other hand, if the function is strictly convex and doubly differen-
tiable in I, then f ′′ (x) ≥ 0, ∀x ∈ I.
There is also a slopeless interpretation of strict convexity as stated in the
following theorem:

Theorem 51 A differentiable function f is strictly convex on an open


interval I, iff

f (ax1 + (1 − a)x2 ) < af (x1 ) + (1 − a)f (x2 ) (4.2)

whenver x1 , x2 ∈ I, x1 6= x2 and 0 < a < 1.


4.1. INTRODUCTION 225

Proof: First we will prove the necessity. Suppose f ′ is increasing on I.


Let 0 < a < 1, x1 , x2 ∈ I and x1 6= x2 . Without loss of generality
assume that x1 < x2 3 . Then, x1 < ax1 + (1 − a)x2 < x2 and therefore
ax1 + (1 − a)x2 ∈ I. By the mean value theorem, there exist s and t with
x1 < s < ax1 + (1 − a)x2 < t < x2 , such that f (ax1 + (1 − a)x2 ) − f (x1 ) =
f ′ (s)(x2 − x1 )(1 − a) and f (x2 ) − f (ax1 + (1 − a)x2 ) = f ′ (t)(x2 − x1 )a.
Therefore,

(1 − a)f (x1 ) − f (ax1 + (1 − a)x2 ) + af (x2 ) =


a [f (x2 ) − f (ax1 + (1 − a)x2 )] − (1 − a) [f (ax1 + (1 − a)x2 ) − f (x1 )] =
a(1 − a)(x2 − x1 ) [f ′ (t) − f ′ (s)]

Since f (x) is strictly convex on I, f ′ (x) is increasing I and therefore,


f ′ (t) − f ′ (s) > 0. Moreover, x2 − x1 > 0 and 0 < a < 1. This implies
that (1 − a)f (x1 ) − f (ax1 + (1 − a)x2 ) + af (x2 ) > 0, or equivalently,
f (ax1 + (1 − a)x2 ) < af (x1 ) + (1 − a)f (x2 ), which is what we wanted to
prove in 4.2.
Next, we prove the sufficiency. Suppose the inequality in 4.2 holds. There-
fore,
f (x2 + a(x1 − x2 )) − f (x2 )
lim ≤ f (x1 ) − f (x2 )
a→0 a
that is,

f ′ (x2 )(x1 − x2 ) ≤ f (x1 ) − f (x2 ) (4.3)

Similarly, we can show that

f ′ (x1 )(x2 − x1 ) ≤ f (x2 ) − f (x1 ) (4.4)

Adding the left and right hand sides of inequalities in (4.3) and (4.4), and
multiplying the resultant inequality by −1 gives us

(f ′ (x2 ) − f ′ (x1 )) (x2 − x1 ) ≥ 0 (4.5)

Using the mean value theorem, ∃z = x1 + t(x2 − x1 ) for t ∈ (0, 1) such


that
3 For the case x2 < x1 , the proof is very similar.
226 CHAPTER 4. CONVEX OPTIMIZATION

f (x2 ) − f (x1 ) = f ′ (z)(x2 − x1 ) (4.6)

Since 4.5 holds for any x1 , x2 ∈ I, it also hold for x2 = z. Therefore,


1 ′
(f ′ (z) − f ′ (x1 ))(x2 − x1 ) = (f (z) − f ′ (x1 ))(z − x1 ) ≥ 0
t
Additionally using 4.6, we get

f (x2 )−f (x1 ) = (f ′ (z)−f ′ (x1 ))(x2 −x1 )+f ′ (x1 )(x2 −x1 ) ≥ f ′ (x1 )(x2 −x1 )
(4.7)

Suppose equality holds in 4.5 for some x1 6= x2 . Then equality holds in


4.7 for the same x1 and x2 . That is,

f (x2 ) − f (x1 ) = f ′ (x1 )(x2 − x1 ) (4.8)

Applying 4.7 we can conclude that

f (x1 ) + af ′ (x1 )(x2 − x1 ) ≤ f (x1 + a(x2 − x1 )) (4.9)

From 4.2 and 4.8, we can derive that

f (x1 + a(x2 − x1 )) < (1 − a)f (x1 ) + af (x2 ) = f (x1 ) + af ′ (x1 )(x2 − x1 )


(4.10)

However, equations 4.9 and 4.10 contradict each other. Therefore, equality
in 4.5 cannot hold for any x1 6= x2 , implying that

(f ′ (x2 ) − f ′ (x1 )) (x2 − x1 ) > 0

that is, f ′ (x) is increasing and therefore f is convex on I. 2


2. A differentiable function f is said to be strictly concave on an open interval
I, iff, f ′ (x) is decreasing on I. Recall from theorem 46, the graphical
interpretation of the first derivative f ′ (x); f ′ (x) < 0 implies that f (x) is
decreasing at x. Similarly, f ′ (x) is monotonically decreasing when f ′′ (x) >
0. This gives us a sufficient condition for the concavity of a function:
4.1. INTRODUCTION 227

Figure 4.9: Plot for the strictly convex function f (x) = −x2 which has f ′′ (x) =
−2 < 0, ∀x.

Theorem 52 If at all points in an open interval I, f (x) is doubly differ-


entiable and if f ′′ (x) < 0, ∀x ∈ I, then the slope of the function is always
decreasing with x and the graph is strictly concave. This is illustrated in
Figure 4.9.

On the other hand, if the function is strictly concave and doubly differen-
tiable in I, then f ′′ (x) ≤ 0, ∀x ∈ I.
There is also a slopeless interpretation of concavity as stated in the fol-
lowing theorem:

Theorem 53 A differentiable function f is strictly concave on an open


interval I, iff

f (ax1 + (1 − a)x2 ) > af (x1 ) + (1 − a)f (x2 ) (4.11)

whenver x1 , x2 ∈ I, x1 6= x2 and 0 < a < 1.

The proof is similar to that for theorem 51.


Figure 4.10 illustrates a function f (x) = x3 − x + 2, whose slope decreases
as x increases to 0 (f ′′ (x) < 0) and then the slope increases beyond x = 0
(f ′′ (x) > 0). The point 0, where the f ′′ (x) changes sign is called the inflection
point; the graph is strictly concave for x < 0 and strictly convex for x > 0. Along
1 5 7 4
similar lines, we can diagnose the function f (x) = 20 x − 12 x + 76 x3 − 15 2
2 x ;
it is strictly concave on (−∞, −1] and [3, 5] and strictly convex on [−1, 3] and
[5, ∞]. The inflection points for this function are at x = −1, x = 3 and x = 5.
The first derivative test for local extrema can be restated in terms of strict
convexity and concavity of functions.
228 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.10: Plot for f (x) = x3 + x + 2, which has an inflection point x = 0,


along with plots for f ′ (x) and f ′′ (x).

Procedure 2 [First derivative test in terms of strict convexity]: Let c be


a critical number of f and f ′ (c) = 0. Then,
1. f (c) is a local minimum if the graph of f (x) is strictly convex on an
open interval containing c.
2. f (c) is a local maximum if the graph of f (x) is strictly concave on
an open interval containing c.

If the second derivative f ′′ (c) exists, then the strict convexity conditions for
the critical number can be stated in terms of the sign of of f ′′ (c), making use
of theorems 50 and 52. This is called the second derivative test.
Procedure 3 [Second derivative test]: Let c be a critical number of f where
f ′ (c) = 0 and f ′′ (c) exists.
1. If f ′′ (c) > 0 then f (c) is a local minimum.
2. If f ′′ (c) < 0 then f (c) is a local maximum.
3. If f ′′ (c) = 0 then f (c) could be a local maximum, a local minimum,
neither or both. That is, the test fails.
For example,
• If f (x) = x4 , then f ′ (0) = 0 and f ′′ (0) = 0 and we can see that f (0) is a
local minimum.
• If f (x) = −x4 , then f ′ (0) = 0 and f ′′ (0) = 0 and we can see that f (0) is
a local maximum.
• If f (x) = x3 , then f ′ (0) = 0 and f ′′ (0) = 0 and we can see that f (0) is
neither a local minimum nor a local maximum. (0, 0) is an inflection point
in this case.
4.1. INTRODUCTION 229

• If f (x) = x + 2 sin x, then f ′ (x) = 1 + 2 cos x. f ′ (x) = 0 for x = 2π 4π


3 , 3 ,
′′ 2π
 2π

which are the critical numbers. f = −2 sin 3 = − 3 < 0 ⇒

 2π √ 3 
f = 3 + 3 is a local maximum value. On the other hand, f ′′ 4π =
√ 3  √ 3
3 > 0 ⇒ f 4π 3 = 4π
3 − 3 is a local minimum value.
• If f (x) = x + x1 , then f ′ (x) = 1 − x12 . The critical numbers are x = ±1.
Note that x = 0 is not a critical number, even though f ′ (0) does not exist,
because 0 is not in the domain of f . f ′′ (x) = x23 . f ′′ (−1) = −2 < 0 and
therefore f (−1) = −2 is a local maximum. f ′′ (1) = 2 > 0 and therefore
f (1) = 2 is a local minimum.

Global Extrema on Closed Intervals


Recall the extreme value theorem (theorem 40). An outcome of the extreme
value theorem is that
• if either of c or d lies in (a, b), then it is a critical number of f ;
• else each of c and d must lie on one of the boundaries of [a, b].
This gives us a procedure for finding the maximum and minimum of a continuous
function f on a closed bounded interval I:

Procedure 4 [Finding extreme values on closed, bounded intervals]: 1.


Find the critical points in int(I).
2. Compute the values of f at the critical points and at the endpoints of
the interval.
3. Select the least and greatest of the computed values.

For example, to compute the maximum and minimum values of f (x) =


4x3 − 8x2 + 5x on the interval [0, 1], we first compute f ′ (x) = 12x2 − 16x + 5
which is 0 at x = 21 , 65 . Values at the critical points are f ( 12 ) = 1, f ( 65 ) = 25
27 .
The values at the end points are f (0) = 0 and f (1) = 1. Therefore, the minimum
value is f (0) = 0 and the maximum value is f (1) = f ( 21 ) = 1.
In this context, it is relevant to discuss the one-sided derivatives of a function
at the endpoints of the closed interval on which it is defined.

Definition 21 [One-sided derivatives at endpoints]: Let f be defined on


a closed bounded interval [a, b]. The (right-sided) derivative of f at x = a
is defined as

f (a + h) − f (a)
f ′ (a) = lim+
h→0 h
Similarly, the (left-sided) derivative of f at x = b is defined as

f (b + h) − f (b)
f ′ (b) = lim−
h→0 h
230 CHAPTER 4. CONVEX OPTIMIZATION

Essentially, each of the one-sided derivatives defines one-sided slopes at the


endpoints. Based on these definitions, the following result can be derived.

Theorem 54 If f is continuous on [a, b] and f ′ (a) exists as a real number or


as ±∞, then we have the following necessary conditions for extremum at a.

• If f (a) is the maximum value of f on [a, b], then f ′ (a) ≤ 0 or f ′ (a) = −∞.

• If f (a) is the minimum value of f on [a, b], then f ′ (a) ≥ 0 or f ′ (a) = ∞.

If f is continuous on [a, b] and f ′ (b) exists as a real number or as ±∞, then


we have the following necessary conditions for extremum at b.

• If f (b) is the maximum value of f on [a, b], then f ′ (b) ≥ 0 or f ′ (b) = ∞.

• If f (b) is the minimum value of f on [a, b], then f ′ (b) ≤ 0 or f ′ (b) = −∞.

The following theorem gives a useful procedure for finding extrema on closed
intervals.

Theorem 55 If f is continuous on [a, b] and f ′′ (x) exists for all x ∈ (a, b).
Then,

• If f ′′ (x) ≤ 0, ∀x ∈ (a, b), then the minimum value of f on [a, b] is either


f (a) or f (b). If, in addition, f has a critical number c ∈ (a, b), then f (c)
is the maximum value of f on [a, b].

• If f ′′ (x) ≥ 0, ∀x ∈ (a, b), then the maximum value of f on [a, b] is either


f (a) or f (b). If, in addition, f has a critical number c ∈ (a, b), then f (c)
is the minimum value of f on [a, b].

The next theorem is very useful for finding global extrema values on open
intervals.

Theorem 56 Let I be an open interval and let f ′′ (x) exist ∀x ∈ I.

• If f ′′ (x) ≥ 0, ∀x ∈ I, and if there is a number c ∈ I where f ′ (c) = 0,


then f (c) is the global minimum value of f on I.

• If f ′′ (x) ≤ 0, ∀x ∈ I, and if there is a number c ∈ I where f ′ (c) = 0,


then f (c) is the global maximum value of f on I.

For example, let f (x) = 23 x − sec x and I = ( −π π ′ 2


2 , 2 ). f (x) = 3 − sec x tan x =
2 sin x π ′′ 2 2
3 − cos2 x = 0 ⇒ x = 6 . Further, f (x) = − sec x(tan x + sec x) < 0 on
( −π
2 , π
2 ). Therefore, f attains the maximum value f ( π
6 ) = π
9 − √2 on I.
3
As another example, let us find the dimensions of the cone with minimum
volume that can contain a sphere with radius R. Let h be the height of the
cone and r the radius of its base. The objective to be minimized is the volume
f (r, h) = 13 πr2 h. The constraint betwen r and h is shown in Figure 4.11; the

h2 +r 2
traingle AEF is similar to traingle ADB and therefore, h−R R = r . Our
4.1. INTRODUCTION 231

Figure 4.11: Illustrating the constraints for the optimization problem of finding
the cone with minimum volume that can contain a sphere of radius R.

first step is to reduce the volume formula to involve only one of r24 or h. The
algebra involved will be the simplest if we solved for h. The constraint gives
R2 h
us r2 = h−2R . Substituting this expression for r2 into the volume formula, we
πR2 h2
get g(h) = 3 (h−2R) with the domain given by D = {h|2R < h < ∞}. Note
2
πR2 2h(h−2R)−h 2 h(h−4R)
that D is an open interval. g ′ = 3 (h−2R)2 = πR3 (h−2R)2 which is 0
2 2(h−2R)3 −2h(h−4R)(h−2R)2
in its domain D if and only if h = 4R. g ′′ = πR
3 (h−2R)4 =
2 2 2
πR2 2(h −4Rh+4R −h +4Rh) πR2 8R2
3 (h−2R)3 = 3 (h−2R)3 , which is greater than 0 in D. There-
fore, g (and consequently f ) has a unique minimum at h = 4R and correspond-
R2 h
ingly, r2 = h−2R = 2R2 .

4.1.4 Optimization Principles for Multivariate Functions


Directional derivative and the gradient vector
Consider a function f (x), with x ∈ ℜn . We start with the concept of the
direction at a point x ∈ ℜn . We will represent a vector by x and the k th
component of x by xk . Let uk be a unit vector pointing along the k th coordinate
axis in ℜn ; ukk = 1 and ukj = 0, ∀j 6= k An arbitrary direction vector v at x is a
vector in ℜn with unit norm (i.e., ||v|| = 1) and component vk in the direction
of uk . Let f : D → ℜ, D ⊆ ℜn be a function.
Definition 22 [Directional derivative]: The directional derivative of f (x)
at x in the direction of the unit vector v is

f (x + hv) − f (x)
Dv f (x) = lim (4.12)
h→0 h
4 Since r appears in the volume formula only in terms of r2 .
232 CHAPTER 4. CONVEX OPTIMIZATION

provided the limit exists.

As a special case, when v = uk the directional derivative reduces to the partial


derivative of f with respect to xk .

∂f (x)
Duk f (x) =
∂xk

Theorem 57 If f (x) is a differentiable function of x ∈ ℜn , then f has a di-


rectional derivative in the direction of any unit vector v, and

n
X ∂f (x)
Dv f (x) = vk (4.13)
∂xk
k=1

Proof: Define g(h) = f (x + vh). Now:

• g ′ (0) = lim g(0+h)−g(0)


h = lim f (x+hv)−f
h
(x)
, which is the expression for the
h→0 h→0
directional derivative defined in equation 4.12. Thus, g ′ (0) = Dv f (x).

• By definition of the chain rule for partial differentiation, we get another


n
X ∂f (x)
expression for g ′ (0); g ′ (0) = vk
∂xk
k=1

n
X ∂f (x)
Therefore, g ′ (0) = Dv f (x) = vk 2
∂xk
k=1
The theorem works if the function is differentiable at the point, else it is not
predictable. The above theorem leads us directly to the idea of the gradient.
We can see that the right hand side of (4.13) can be realized as the dot product
h iT
of two vectors, viz., ∂f∂x(x)
1
, ∂f (x)
∂x2 , . . . , ∂f (x)
∂xn and v. Let us denote ∂f∂x(x)
i
by
fxi (x). Then we assign a name to the special vector discovered above.

Definition 23 [Gradient Vector]: If f is differentiable function of x ∈ ℜn ,


then the gradient of f (x) is the vector function ∇f (x), defined as:

∇f (x) = [fx1 (x), fx2 (x), . . . , fxn (x)]

The directional derivative of a function f at a point x in the direction of a unit


vector v can be now written as

Dv f (x) = ∇T f (x).v (4.14)


4.1. INTRODUCTION 233

What does the gradient ∇f (x) tell you about the function f (x)? We will il-
lustrate with some examples. Consider the polynomial f (x, y, z) = x2 y+z sin xy
and the unit vector vT = √13 [1, 1, 1]T . Consider the point p0 = (0, 1, 3). We will
compute the directional derivative of f at p0 in the direction of v. To do this, we
 T
first compute the gradient of f in general: ∇f = 2xy + yz cos xy, x2 + xz cos xy, sin xy .
T
Evaluating the gradient at a specific point p0 , ∇f (0, 1, 3) = [3, 0, 0] . The di-
1
rectional derivative at p0 in the direction v is Dv f (0, 1, 3) = [3, 0, 0]. √3 [1, 1, 1]T =

3. This directional derivative is the rate of change of f at p0 in the direction
v; it is positive indicating that the function f increases at p0 in the direction v.
All our ideas about first and second derivative in the case of a single variable
carry over to the directional derivative.
As another example, let us find the rate of change of f (x, y, z) = exyz at
p0 = (1, 2, 3) in the direction from p1 = (1, 2, 3) to p2 = (−4, 6, −1). We first
construct a unit vector from p1 to p2 ; v = √157 [−5, 4, −4]. The gradient of f
in general is ∇f = [yzexyz , xzexyz , xyexyz ] = exyz [yz, xz, xy]. Evaluating
T
the gradient at a specific point p0 , ∇f (1, 2, 3) = e6 [6, 3, 2] . The directional
derivative at p0 in the direction v is Du f (1, 2, 3) = e [6, 3, 2]. √157 [−5, 4, −4]T =
6

−26
e6 √ 57
. This directional derivative is negative, indicating that the function f
decreases at p0 in the direction from p1 to p2 .
While there exist infinitely many direction vectors v at any point x, there is
a unique gradient vector ∇f (x). Since we seperated Dv f (x) as the dot prouduct
of ∇f (x) with v, we can study ∇f (x) independently. What does the gradient
vector tell us? We will state a theorem to answer this question.

Theorem 58 Suppose f is a differentiable function of x ∈ ℜn . The maximum


value of the directional derivative Dv f (x) is ||∇f (x|| and it is so when v has
the same direction as the gradient vector ∇f (x).

Proof: The cauchy schwartz inequality when applied in the eucledian space
states that |xT .y| ≤ ||x||.||y|| for any x, y ∈ ℜn , with equality holding iff x
and y are linearly dependent. The inequality gives upper and lower bounds on
the dot product between two vectors; −||x||.||y|| ≤ xT .y ≤ ||x||.||y||. Applying
these bounds to the right hand side of 4.14 and using the fact that ||v|| = 1, we
get
−||∇f (x)|| ≤ Dv f (x) = ∇T f (x).v ≤ ||∇f (x)||
with equality holding iff v = k∇f (x) for some k ≥ 0. Since ||v|| = 1, equality
∇f (x)
can hold iff v = ||∇f (x)|| . 2
The theorem implies that the maximum rate of change of f at a point x is
given by the norm of the gradient vector at x. And the direction in which the
∇f (x
rate of change of f is maximum is given by the unit vector ||∇f (x|| .
An associated fact is that the minimum value of the directional derivative
Dv f (x) is −||∇f (x|| and it occurs when v has the opposite direction of the
∇f (x
gradient vector, i.e., − ||∇f (x|| . This fact is often used in numerical analysis
when one is trying to minimize the value of very complex functions. The method
234 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.12: 10 level curves for the function f (x1 , x2 ) = x1 ex2 .

of steepest descent uses this result to iteratively choose a new value of x by


traversing in the direction of −∇f (x).
Consider the function f (x1 , x2 ) = x1 ex2 . Figure 4.12 shows 10 level curves
for this function, corresponding to f (x1 , x2 ) = c for c = 1, 2, . . . , 10. The idea
behind a level curve is that as you change x along any level curve, the function
value remains unchanged, but as you move x across level curves, the function
value changes.
We will define the concept of a hyperplane next, since it will be repeatedly
referred to in the sequel.

Definition 24 [Hyperplane]: A set of points H ⊆ ℜn is called a hyperplane


if there exists a vector v ∈ ℜn and a point q ∈ ℜn such that

∀ p ∈ H, (p − q)T v = 0

or in other words, ∀p ∈ H, pT v = qT v. This is the equation of a hy-


perplane orthogonal to vector v and passing through point q. The space
spanned by vectors in the hyperplane H which are orthogonal to vector v,
forms the orthogonal complement of the space spanned by v.

Hyperplane H can also be equivalently defined as the set of points p such that
pT v = c for some c ∈ ℜ and some v ∈ ℜn , with c = qT v in our definition.
(This definition will be referred to at a later point.)
What if Dv f (x) turns out to be 0? What can we say about ∇f (x) and v?
There is a useful theorem in this regard.

Theorem 59 Let f : D → ℜ with D ∈ ℜn be a differentiable function. The


gradient ∇f evaluated at x∗ is orthogonal to the tangent hyperplane (tangent
line in case n = 2) to the level surface of f passing through x∗ .
4.1. INTRODUCTION 235

Proof: Let K be the range of f and let k ∈ K such that f (x∗ ) = k. Consider the
level surface f (x) = k. Let r(t) = [x1 (t), x2 (t), . . . , xn (t)] be a curve on the level
surface, parametrized by t ∈ ℜ, with r(0) = x∗ . Then, f (x(t), y(t), z(t)) = k.
Applying the chain rule
n
df (r(t)) X ∂f dxi (t) dr(t)
= = ∇T f (x(t)) =0
dt i=1
∂xi dt dt
For t = 0, the equations become
dr(0)
∇T f (x∗ ) =0
dt

Now, dr(t)
dt represents any tangent vector to the curve through r(t) which lies
completely on the level surface. That is, the tangent line to any curve at x∗
on the level surface containing x∗ , is orthogonal to ∇f (x∗ ). Since the tangent
hyperplane to a surface at any point is the hyperplane containing all tangent
vectors to curves on the surface passing through the point, the gradient is per-
pendicular to the tangent hyperplane to the level surface passing through that
point. The equation of the tangent hyperplane is given by (x−x∗ )T ∇f (x∗ ) = 0.
2
Recall from elementary calculus, that the normal to a plane can be found
by taking the cross product of any two vectors lying within the plane. The
gradient vector at any point on the level surface of a function is normal to the
tangent hyperplane (or tangent line in the case of two variables) to the surface
at the same point, but can however be conveniently obtained using the partial
derivatives of the function at that point.
We will use some illustrative examples to study these facts.
1. Consider the same plot as in Figure 4.12 with a gradient vector at (2, 0) as
shown in Figure 4.13. The gradient vector [1, 2]T is perpendicular to the
tangent hyperplane to the level curve x1 ex2 = 2 at (2, 0). The equation of
the tangent hyperplane is (x1 − 2) + 2(x2 − 0) = 0 and it turns out to be
a tangent line.
2. The level surfaces for f (x1 , x2 , x3 ) = x21 +x22 +x23 are shown in Figure 4.14.
The gradient at (1, 1, 1) is orthogonal to the tangent hyperplane to the
level surface f (x1 , x2 , x3 ) = x21 + x22 + x23 = 3 at (1, 1, 1). The gradient
vector at (1, 1, 1) is [2, 2, 2]T and the tanget hyperplane has the equation
2(x1 − 1) + 2(x2 − 1) + 2(x3 − 1) = 0, which is a plane in 3D. On the other
hand, the dotted line in Figure 4.15 is not orthogonal to the level surface,
since it does not coincide with the gradient.
3. Let f (x1 , x, x3 ) = x21 x32 x43 and consider the point x0 = (1, 2, 1). We will
find the equation of the tangent plane to the level surface through x0 .
The level surface through x0 is determined by setting f equal to its
value evaluated at x0 ; that is, the level surface will have the equation
x21 x32 x43 = 12 23 14 = 8. The gradient vector (normal to tangent plane) at
236 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.13: The level curves from Figure 4.12 along with the gradient vector
at (2, 0). Note that the gradient vector is perpenducular to the level curve
x1 ex2 = 2 at (2, 0).

Figure 4.14: 3 level surfaces for the function f (x1 , x2 , x3 ) = x21 +x22 +x23 with c =
1, 3, 5. The gradient at (1, 1, 1) is orthogonal to the level surface f (x1 , x2 , x3 ) =
x21 + x22 + x23 = 3 at (1, 1, 1).
4.1. INTRODUCTION 237

Figure 4.15: Level surface f (x1 , x2 , x3 ) = x21 + x22 + x23 = 3. The gradient at
(1, 1, 1), drawn as a bold line, is perpendicular to the tangent plane to the level
surface at (1, 1, 1), whereas, the dotted line, though passing through (1, 1, 1) is
not perpendicular to the same tangent plane.
238 CHAPTER 4. CONVEX OPTIMIZATION

(1, 2, 1) is ∇f (x1 , x2 , x3 )|(1,2,1) = [2x1 x32 x43 , 3x21 x22 x43 , 4x21 x32 x33 ]T (1,2,1) =
[16, 12, 32]T . The equation of the tangent plane at x0 , given the normal
vector ∇f (x0 ) can be easily written down: ∇f (x0 )T .[x − x0 ] = 0 which
turns out to be 16(x1 − 1) + 12(x2 − 2) + 32(x3 − 1) = 0, a plane in 3D.
x
4. Consider the function f (x, y, z) = y+z . The directional derivative of f in
1
the direction of the vector v = √14 [1, 2, 3] at the point x0 = (4, 1, 1) is
h i

∇T f (4,1,1) . √114 [1, 2, 3]T = y+z
1 x
, − (y+z) x
2 , − (y+z)2 . √114 [1, 2, 3]T =
1  1 (4,1,1)
T √9
2 , −1, −1 . 14 [1, 2, 3] = − 2 14 . The directional derivative is nega-

tive, indicating that the function decreases along the direction of v. Based
on theorem 58, we know that the maximum rate of change of a function
∇f (x)
at a point x is given by ||∇f (x)|| and it is in the direction ||∇f (x)|| . In
the example under consideration, this maximum  rate of change at x0 is 23
2 1
and it is in the direction of the vector 3 2 , −1, −1 .

5. Let us find the maximum rate of change of the function f (x, y, z) = x2 y 3 z 4


at the point x0 = (1, 1, 1) and the direction in which it occurs. The
gradient at x0 is ∇T f (1,1,1) = [2, 3, 4]. The maximum rate of change at

x0 is therefore 29 and the direction of the corresponding
√ rate of change is
√1 [2, 3, 4]. The minimum rate of change is − 29 and the corresponding
29
direction is − √129 [2, 3, 4].

6. Let us determine the equations of (a) the tangent plane to the paraboloid
P : x1 = x22 + x23 + 2 at (−1, 1, 0) and (b) the normal line to the tangent
plane. To realize this as the level surface of a function of three variables, we
define the function f (x1 , x2 , x3 ) = x1 −x22 −x23 and find that the paraboloid
P is the same as the level surface f (x1 , x2 , x3 ) = −2. The normal to the
tangent plane to P at x0 is in the direction of the gradient vector ∇f (x0 ) =
[1, −2, 0]T and its parametric equation is [x1 , x2 , x3 ] = [−1 + t, 1 − 2t, 0].
The equation of the tangent plane is therefore (x1 + 1) − 2(x2 − 1) = 0.

We can embed the graph of a function of n variables as the 0-level surface of


a function of n + 1 variables. More concretely, if f : D → ℜ, D ⊆ ℜn then we
define F : D′ → ℜ, D′ = D × ℜ as F (x, z) = f (x) − z with x ∈ D′ . The function
f then corresponds to a single level surface of F given by F (x, z) = 0. In other
words, the 0−level surface of F gives back the graph of f . The gradient of F
at any point (x, z) is simply, ∇F (x, z) = [fx1 , fx2 , . . . , fxn , −1] with the first n
components of ∇F (x, z) given by the n components of ∇f (x). We note that the
level surface of F passing through point (x0 , f (x0 ) is its 0-level surface, which
is essentially the surface of the function f (x). The equation of the tangent
hyperplane to the 0−level surface of F at the point (x0 , f (x0 ) (that is, the
tangent hyperplane to f (x) at the point x0 ), is ∇F (x0 , f (x0 ))T .[x − x0 , z −
f (x0 )]T = 0. Substituting appropriate expression for ∇F (x0 ), the equation of
the tangent plane can be written as
4.1. INTRODUCTION 239

n
!
X 
0
fxi (x )(xi − x0i ) − z − f (x0 ) = 0
i=1

or equivalently as,
n
!
X
0
fxi (x )(xi − x0i ) + f (x0 ) = z
i=1

As an example, consider the paraboloid, f (x1 , x2 ) = 9 − x21 − x22 , the corre-


sponding F (x1 , x2 , z) = 9−x21 −x22 −z and the point x0 = (x0 , z) = (1, 1, 7) which
lies on the 0-level surface of F . The gradient ∇F (x1 , x2 , z) is [−2x1 , −2x2 , −1],
which when evaluated at x0 = (1, 1, 7) is [−2, −2, −1]. The equation of the
tangent plane to f at x0 is therefore given by −2(x1 − 1) − 2(x2 − 1) + 7 = z.
Recall from theorem 39 that for functions of single variable, at local extreme
points, the tangent to the curve is a line with a constant component in the
direction of the function and is therefore parallel to the x-axis. If the function
is is differentiable at the extreme point, then the derivative must vanish. This
idea can be extended to functions of multiple variables. The requirement in this
case turns out to be that the tangent plane to the function at any extreme point
must be parallel to the plane z = 0. This can happen if and only if the gradient
∇F is parallel to the z−axis at the extreme point, or equivalently, the gradient
to the function f must be the zero vector at every extreme point.
We will formalize this discussion by first providing the definitions for local
maximum and minimum as well as absolute maximum and minimum values of
a function of n variables.

Definition 25 [Local maximum]: A function f of n variables has a local


maximum at x0 if ∃ǫ > 0 such that ∀ ||x − x0 || < ǫ. f (x) ≤ f (x0 ). In
other words, f (x) ≤ f (x0 ) whenever x lies in some circular disk around
x0 .

Definition 26 [Local minimum]: A function f of n variables has a local


minimum at x0 if ∃ǫ > 0 such that ∀ ||x − x0 || < ǫ. f (x) ≥ f (x0 ). In
other words, f (x) ≥ f (x0 ) whenever x lies in some circular disk around
x0 .

These definitions are exactly analogous to the definitions for a function of


single variable. Figure 4.16 shows the plot of f (x1 , x2 ) = 3x21 − x31 − 2x22 + x42 .
As can be seen in the plot, the function has several local maxima and minima.
We will next state a theorem fundamental to determining the locally extreme
values of functions of multiple variables.

Theorem 60 If f (x) defined on a domain D ⊆ ℜn has a local maximum


or minimum at x∗ and if the first-order partial derivatives exist at x∗ , then
fxi (x∗ ) = 0 for all 1 ≤ i ≤ n.
240 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.16: Plot of f (x1 , x2 ) = 3x21 − x31 − 2x22 + x42 , showing the various local
maxima and minima of the function.

Proof: The idea behind this theorem can be stated as follows. The tangent
hyperplane to the function at any extreme point must be parallel to the plane
z = 0. This can happen if and only if the gradient ∇F = [∇T f, −1]T is parallel
to the z−axis at the extreme point. Or equivalently, the gradient to the function
f must be the zero vector at every extreme point, i.e., fxi (x∗ ) = 0 for 1 ≤ i ≤ n.
To formally prove this theorem, consider the function gi (xi ) = f (x∗1 , x∗2 , . . . , x∗i−1 , xi , x∗i+1 , . . . , x∗n ).
If f has a local extremum at x∗ , then each function gi (xi ) must have a local
′ ′
extremum at x∗i . Therefore gi (x∗i ) = 0 by theorem 39. Now gi (x∗i ) = fxi (x∗ ) so

fxi (x ) = 0. 2
Applying theorem 60 to the function f (x1 , x2 ) = 9 − x21 − x22 , we require that
at any extreme point fx1 = −2x1 = 0 ⇒ x1 = 0 and fx2 = −2x2 = 0 ⇒ x2 = 0.
Thus, f indeed attains its maximum at the point (0, 0) as shown in Figure 4.17.

Definition 27 [Critical point]: A point x∗ is called a critical point of a func-


tion f (x) defined on D ⊆ ℜn if

1. If fxi (x∗ ) = 0, for 1 ≤ i ≤ n.


2. OR fxi (x∗ ) fails to exist for any 1 ≤ i ≤ n.

A procedure for computing all critical points of a function f is:

1. Compute fxi for 1 ≤ i ≤ n.

2. Determine if there are any points where any one of fxi fails to exist. Add
such points (if any) to the list of critical points.

3. Solve the system of equations fxi = 0 simultaneously. Add the solution


points to the list of saddle points.
4.1. INTRODUCTION 241

Figure 4.17: The paraboloid f (x1 , x2 ) = 9 − x21 − x22 attains its maximum at
(0, 0). The tanget plane to the surface at (0, 0, f (0, 0)) is also shown, and so is
the gradient vector ∇F at (0, 0, f (0, 0)).

Figure 4.18: Plot illustrating critical points where derivative fails to exist.
242 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.19: The hyperbolic paraboloid f (x1 , x2 ) = x21 − x22 , which has a saddle
point at (0, 0).

As an example, for the function f (x1 , x2 ) = |x1 |, fx1 does not exist for
(0, s) for any s ∈ ℜ and all of them are critical points. Figure 4.18 shows the
corresponding 3−D plot.
Is the converse of theorem 60 true? That is, if you find an x∗ that satisifes
fxi (x∗ ) = for all 1 ≤ i ≤ n, is it necessary that x∗ is an extreme point? The
answer is no. In fact, points that violate the converse of theorem 60 are called
saddle points.
Definition 28 [Saddle point]: A point x∗ is called a saddle point of a func-
tion f (x) defined on D ⊆ ℜn if x∗ is a critical point of f but x∗ does not
correspond to a local maximum or minimum of the function.
We saw the example of a saddle point in Figure 4.7, for the case n = 1. The
inflection point for a function of single variable, that was discussed earlier, is the
analogue of the saddle point for a function of multiple variables. An example
for n = 2 is the hyperbolic paraboloid5 f (x1 , x2 ) = x21 − x22 , the graph of
which is shown in Figure 4.19. The hyperbolic paraboloid opens up on x1 -axis
(Figure 4.20 and down on x2 -axis (Figure 4.21) and has a saddle point at (0, 0).
To get working on figuring out how to find the maximum and minimum
of a function, we will take some examples. Let us find the critical points of
f (x1 , x2 ) = x21 + x22 − 2x1 − 6x2 + 14 and classify the critical point. This
function is a polyonomial function and is differentiable everywhere. It is a
paraboloid that is shifted away from origin. To find its critical points, we will
solve fx1 = 2x1 −2 = 0 and fx2 = 2x2 −6 = 0, which when solved simultaneously,
yield a single critical point (1, 3). For a simple example like this, the function
f can be rewritten as f (x1 , x2 ) = (x1 − 1)2 + (x2 − 3)2 + 4, which implies that
f (x1 , x2 ) ≥ 4 = f (1, 3). Therefore, (1, 3) is indeed a local minimum (in fact a
global minimum) of f (x1 , x2 ).
5 The hyperbolic paraboloid is shaped like a saddle and can have a critical point called the

saddle point.
4.1. INTRODUCTION 243

Figure 4.20: The hyperbolic paraboloid f (x1 , x2 ) = x21 − x22 , when viewed from
the x1 axis is concave up.

Figure 4.21: The hyperbolic paraboloid f (x1 , x2 ) = x21 − x22 , when viewed from
the x2 axis is concave down.
244 CHAPTER 4. CONVEX OPTIMIZATION

However, it is not always so easy to determine if a critical point is a point


of local extreme value. To understand this, consider the function f (x1 , x2 ) =
2x31 +x1 x22 +5x21 +x22 . The system of equations to be solved are fx1 = 6x21 +x22 +
10x1 = 0 and fx2 = 2x1 x2 + 2x2 = 0. From the second equation, we get either
x2 = 0 or x1 = −1. Using these values one at a time in the first equation, we
get values for the other variables. The critical points are: (0, 0), (− 53 , 0), (−1, 2)
and (−1, −2). Which of these critical points correspond to extreme values of the
function? Since f does not have a quadratic form, it is not easy to find a lower
bound on the function as in the previous example. However, we can make use
of the taylor series expansion for single variable to find polynomial expansions
of functions of n variables. The following theorem gives a systematic method,
similar to the second derivative test for functions of single variable, for finding
maxima and minima of functions of multiple variables.

Theorem 61 Let f : D → ℜ where D ⊆ ℜn . Let f (x) have continuous partial


derivatives and continuous mixed partial derivatives in an open ball R containing
a point x∗ where ∇f (x∗ ) = 0. Let ∇2 f (x) denote an n × n matrix of mixed
partial derivatives of f evaluated at the point x, such that the ij th entry of the
matrix is fxi xj . The matrix ∇2 f (x) is called the Hessian matrix. The Hessian
matrix is symmetric6 . Then,

• If ∇2 f (x∗ ) is positive definite, x∗ is a local minimum.

• If ∇2 f (x∗ ) is negative definite (that is if −∇2 f (x∗ ) is positive definite),


x∗ is a local maximum.

Proof: Since the mixed partial derivatives of f are continuous in an open ball
containing R containing x∗ and since ∇2 f (x∗ ) ≻ 0, it can be shown that there
exists an ǫ > 0, with B(x∗ , ǫ) ⊆ R such that for all ||h|| < ǫ, ∇2 f (x∗ + h) ≻ 0.
Consider an increment vector h such that (x∗ + h) ∈ B(x∗ , ǫ). Define g(t) =
f (x∗ + th) : [0, 1] → ℜ. Using the chain rule,
n
X
′ dxi
g (t) = fxi (x∗ + th) = hT .∇f (x∗ + th)
i=1
dt

Since f has continuous partial and mixed partial derivatives, g ′ is a differ-


entiable function of t and

g ′′ (t) = hT ∇2 f (x∗ + th)h


Since g and g ′ are continous on [0, 1] and g ′ is differentiable on (0, 1), we can
make use of the Taylor’s theorem (45) with n = 1 and a = 0 to obtain:

1
g(1) = g(0) + g ′ (0) + g ′′ (c)
2
6 By Clairauts Theorem, if the partial and mixed derivatives of a function are continuous

on an open region containing a point x∗ , then fxi xj (x∗ ) = fxj xi (x∗ ), for all i, j ∈ [1, n].
4.1. INTRODUCTION 245

for some c ∈ (0, 1). Writing this equation in terms of f gives

1
f (x∗ + h) = f (x∗ ) + hT ∇f (x∗ ) + hT ∇2 f (x∗ + ch)h
2
We are given that ∇f (x∗ ) = 0. Therefore,

1 T 2
f (x∗ + h) − f (x∗ ) = h ∇ f (x∗ + ch)h
2
The presence of an extremum of f at x∗ is determined by the sign of f (x∗ +
h) − f (x∗ ). By virtue of the above equation, this is the same as the sign of
H(c) = hT ∇2 f (x∗ + ch)h. Because the partial derivatives of f are continuous
in R, if H(0) 6= 0, the sign of H(c) will be the same as the sign of H(0) =
hT ∇2 f (x∗ )h for h with sufficiently small components (i.e., since the function
has continuous partial and mixed partial derivatives at (x∗ , the hessian will
be positive in some small neighborhood around (x∗ ). Therefore, if ∇2 f (x∗ )
is positive definite, we are guaranteed to have H(0) positive, implying that f
has a local minimum at x∗ . Similarly, if −∇2 f (x∗ ) is positive definite, we are
guaranteed to have H(0) negative, implying that f has a local maximum at x∗ .
2
Theorem 61 gives sufficient conditions for local maxima and minima of func-
tions of multiple variables. Along similar lines of the proof of theorem 61, we
can prove necessary conditions for local extrema in theorem 62.

Theorem 62 Let f : D → ℜ where D ⊆ ℜn . Let f (x) have continuous par-


tial derivatives and continuous mixed partial derivatives in an open region R
containing a point x∗ where ∇f (x∗ ) = 0. Then,

• If x∗ is a point of local minimum, ∇2 f (x∗ ) must be positive semi-definite.

• If x∗ is a point of local maximum, ∇2 f (x∗ ) must be negative semi-definite


(that is, −∇2 f (x∗ ) must be positive semi-definite).

The following corollary of theorem 62 states a sufficient condition for a point


to be a saddle point.

Corollary 63 Let f : D → ℜ where D ⊆ ℜn . Let f (x) have continuous par-


tial derivatives and continuous mixed partial derivatives in an open region R
containing a point x∗ where ∇f (x∗ ) = 0. If ∇2 f (x∗ ) is neither positive semi-
definite nor negative semi-definite (that is, some of its eigenvalues are positive
and some negative), then x∗ is a saddle point.

Thus, for a function of more than one variable, the second derivative test
generalizes to a test based on the eigenvalues of the function’s Hessian matrix at
the stationary point. Based on theorem 61, we will derive the second derivative
test for determining extreme values of a function of two variables.
246 CHAPTER 4. CONVEX OPTIMIZATION

Theorem 64 Let the partial and second partial derivatives of f (x1 , x2 ) be con-
tinuous on a disk with center (a, b) and suppose fx1 (a, b) = 0 and fx2 (a, b) = 0
so that (a, b) is a critical point of f . Let D(a, b) = fx1 x1 (a, b)fx2 x2 (a, b) −
[fx1 x2 (a, b)]2 . Then7 ,
• If D > 0 and fx1 x1 (a, b) > 0, then f (a, b) is a local minimum.

• Else if D > 0 and fx1 x1 (a, b) < 0, then f (a, b) is a local maximum.
• Else if D < 0 then (a, b) is a saddle point.

Proof: Recall the definition of positive definiteness; a matrix is positive definite


if all its eigenvalues are positive. For the 2 × 2 matrix ∇2 f in this problem, the
product of the eigenvalues is det(∇2 f ) = fx1 x1 (a, b)fx2 x2 (a, b) − [fx1 x2 (a, b)]2
and the sum of the eigenvalues is fx1 x1 (a, b) + fx2 x2 (a, b). Now:

• If det(∇2 f (a, b)) > 0 and if additionally fx1 x1 (a, b) > 0 (or equivalently,
fx2 x2 (a, b) > 0), the product as well as the sum of eigenvalues will be
positive, implying that the eigenvalues are positive and therefore ∇2 f (a, b)
is positive definite, According to theorem 61, this is a sufficient condition
for f (a, b) to be a local minimum.

• If det(∇2 f (a, b)) > 0 and if additionally fx1 x1 (a, b) < 0 (or equivalently,
fx2 x2 (a, b) < 0), the product of the eigenvalue is positive whereas the
sum is negative, implying that the eigenvalues are negative and therefore
∇2 f (a, b) is negative definite, According to theorem 61, this is a sufficient
condition for f (a, b) to be a local maximum.

• If det(∇2 f (a, b)) < 0, the eigenvalues must have opposite signs, implying
that the ∇2 f (a, b) is neither positive semi-definite nor negative-semidefinite.
By corollary 63, this is a sufficient condition for f (a, b) to be a saddle point.
2
We saw earlier that the critical points for f (x1 , x2 ) = 2x31 +x1 x22 +5x21 +x22 are
(0, 0), (− 53 , 0), (−1, 2) and (−1, −2). To determine which of these correspond
to local extrema and which are saddle, we first compute compute the partial
derivatives of f :
fx1 x1 (x1 , x2 ) = 12x1 + 10
fx2 x2 (x1 , x2 ) = 2x1 + 2
fx1 x2 (x1 , x2 ) = 2x2
Using theorem 64, we can verify that (0, 0) corresponds to a local minimum,
(− 53 , 0) corresponds to a local maximum while (−1, 2) and (−1, −2) correspond
to saddle points. Figure 4.22 shows the plot of the function while pointing out
the four critical points.

7D here stands for the discriminant.


4.1. INTRODUCTION 247

Figure 4.22: Plot of the function 2x31 + x1 x22 + 5x21 + x22 showing the four critical
points.

We will take some more examples:

1. Consider a significantly harder function f (x, y) = 10x2 y − 5x2 − 4y 2 −


x4 − 2y 4 . Let us find and classify its critical points. The gradient vector
is ∇f (x, y) = [20xy − 10x − 4x3 , 10x2 − 8y − 8y 3 ]. The critical points
correspond to solutions of the simultaneous set of equations

20xy − 10x − 4x3 = 0


(4.15)
10x2 − 8y − 8y 3 = 0

One of the solutions corresponds to solving the system −8y 3 + 42y −


25 = 08 and 10x2 = 50y − 25, which have four real solutions9 , viz.,
(0.8567, 0.646772), (−0.8567, 0.646772), (2.6442, 1.898384), and (−2.6442, 1.898384).
Another real solution is (0, 0). The mixed partial derivatives of the func-
tion are

fxx = 20y − 10 − 12x2


fxy = 20x (4.16)
fyy = −8 − 24y 2

Using theorem 64, we can verify that (2.6442, 1.898384) and (−2.6442, 1.898384)
correspond to local maxima whereas (0.8567, 0.646772) and (−0.8567, 0.646772)
correspond to saddle points. This is illustrated in Figure 4.23.
8 Solving this using matlab without proper scaling could give you complex values. With

proper scaling of the equation, you should get y = −2.545156 or y = 0.646772 or y = 1.898384.
9 The values of x corresponding to y = −2.545156 are complex
248 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.23: Plot of the function 10x2 y − 5x2 − 4y 2 − x4 − 2y 4 showing the four
critical points.

2. The function f (x, y) = x sin y has the gradient vector [sin y, x cos y].
The critical points correspond to the solutions to the simultaneous set of
equations

sin y = 0
(4.17)
x cos y = 0

The critical points are10 (0, nπ) for n = 0, ±1, ±2, . . .. The mixed partial
derivatives of the function are

fxx = 0
fxy = cos y (4.18)
fyy = −x sin y

which tell us that the discriminant function D = − cos2 y is always nega-


tive. Therefore, all the critical points turn out to be saddle points. This
is illustrated in Figure 4.24.

Along similar lines of the single variable case, we next define the global
maximum and minimum.

Definition 29 [Global maximum]: A function f of n variables, with domain


D ⊆ ℜn has an absolute or global maximum at x0 if ∀ x ∈ D, f (x) ≤
f (x0 ).
10 Note that the cosine does not vanish wherever the sin vanishes.
4.1. INTRODUCTION 249

Figure 4.24: Plot of the function x sin y illustrating that all critical points are
saddle points.

Definition 30 [Global minimum]: A function f of n variables, with domain


D ⊆ ℜn has an absolute or global minimum at x0 if ∀ x ∈ D, f (x) ≥
f (x0 ).

We would like to find the absolute maximum and minimum values of a func-
tion of multiple variables in a closed interval, along similar lines of the method
yielded by theorem 41 for functions of single variable. The procedure was to eval-
uate the value of the function at the critical points as well as the end points of the
interal and determine the absolute maximum and minimum values by scanning
this list. To generalize the idea to function of multiple variables, we point out
that the analogue of finding the value of the function at the boundaries of closed
interval in the single variable case is to find the function value along the bound-
ary curve, which reduces the evaluation of a function of multiple variables to
evaluating a function of a single variable. Recall from the definitions on page 214
that a closed set in ℜn is a set that contains its boundary points (analogous to
closed interval in ℜ) while a bounded set in ℜn is a set that is contained inside a
closed ball, B[0, ǫ]. An example bounded set is (x1 , x2 , x3 )|x21 + x22 + x23 ≤ 1 .
An example unbounded set is {(x1 , x2 , x3 )|x1 > 1, x2 > 1, x3 > 1}. Based on
these definitions, we can state the extreme value theorem for a function of n
variables.

Theorem 65 Let f : D → ℜ where D ⊆ ℜn is a closed bounded set and f be


continuous on D. Then f attains an absolute maximum and absolute minimum
at some points in D.

The theorem implies that whenever a function of n variables is restricted


to a bounded space, it has an absolute maximum and an absolute minimum.
Following theorem 60, we note that the locally extreme values of a function
occur at its critical points. By the very definition of local extremum, it cannot
occur at the boundary point of D. Since every absolute extremum is also a
250 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.25: The region bounded by the points (0, 3), (2, 0), (0, 0) on which we
consider the maximum and minimum of the function f (x, y) = 1 + 4x − 5y.

local extremum, the absolute maximum and minimum of a function on a closed,


bounded set will either happen at the critical points or at the boundary. The
procedure for finding the absolute maximum and minimum of a function on a
closed bounded set is outlined below and is similar to the procedure 4 for a
function of single variable continuous on a closed and bounded interval:

Procedure 5 [Finding extreme values on closed, bounded sets]: To find


the absolute maximum and absolute minimum of a continuous function f
on a closed bounded set D;
• evaluate f at the critical points of f on D
• find the extreme values of f on the boundary of D
• the largest of the values found above is the absolute maximum, and
the smallest of them is the absolute mininum.

We will take some examples to illustrate procedure 5.

1. Consider the function f (x, y) = 1 + 4x − 5y defined on the region R


bounded by the points (0, 3), (2, 0), (0, 0). The region R is shown in
Figure 4.25 and is bounded by three line segments
• B1 : x = 0, 0 ≤ y ≤ 3
• B2 : y = 0, 0 ≤ x ≤ 2
• and B3 : y = 3 − 23 x, 0 ≤ x ≤ 2.
The linear function f (x, y) = 1 + 4x − 5y has no critical points, since
∇f (x, y) = [4, 5]T is defined everywhere, though it cannot disappear at
any point. In fact, linear functions have no critical points and the extreme
values are always assumed at the boundaries; this forms the basis of linear
programming. We will find the extreme values on the boundaries.
4.1. INTRODUCTION 251

Figure 4.26: The region R bounded by y = x2 and y = 4 on which we consider


the maximum and minimum of the function f (x, y) = 1 − xy − x − y.

• On B1 , f (x, y) = f (0, y) = 1 − 5y, for y ∈ [0, 3]. This is a single


variable extreme value problem for a continuous function. Its largest
value is assumed at y = 0 and equals 1 while the smallest value is
assumed at y = 3 and equals −14.
• On B2 , f (x, y) = f (x, 0) = 1 + 4x, for x ∈ [0, 2]. This is again a
single variable extreme value problem for a continuous function. Its
largest value is assumed at x = 2 and equals 9 while the smallest
value is assumed at x = 0 and equals 1.
• On B3 , f (x, y) = 1 + 4x − 5 (3 − (3/2)x) = −14 + (23/2)x, for x ∈
[0, 2]. This is also a single variable extreme value problem for a
continuous function. Its largest value is assumed at x = 2 and equals
9 while the smallest value is assumed at x = 0 and equals −14.
Thus, the absolute maximum is attained by f at (2, 0) while the absolute
minimum is attained at (0, 3). Both extrema are at the vertices of the
polygon (triangle) This example illustrates the general procedure for de-
termining the absolute maximum and minimum of a function on a closed,
bounded set. However, the problem can become very hard in practice as
the function f gets complex.
2. Let us look at a harder problem. Let us find the absolute maximum and
the absolute minimum of the function f (x, y) = 1 − xy − x − y on the
region R bounded by y = x2 and y = 4. This is not a linear function any
longer. The region R is shown in Figure 4.26 and is bounded by
• B1 : y = x2 , −2 ≤ x ≤ 2
• B2 : y = 4, −2 ≤ x ≤ 2
Since f (x, y) = 1 − xy − x − y is differentiable everywhere, the critical
point of f is characterized by ∇f (x, y) = [−y − 1, x − 1]T = 0, that is
252 CHAPTER 4. CONVEX OPTIMIZATION

x = −1, y = −1. However, this point does not lie in R and hence, there
are no critical points, in R. Along similar lines of the previous problem,
we will find the extreme values of f on the boundaries of R.

• On B1 , f (x, y) = 1 − x3 − x − x2 , for x ∈ [−2, 2]. This is a single


variable extreme value problem for a continuous function. Its critical
points correspond to solutions of 3x2 + 2x + 1 = 0. However, this
equation has no real solutions11 and therefore, the function’s extreme
values are only at the boundary points; the minimum value −13 is
attained at x = 2 and the maximum value 7 is attained at x = −2.
• On B2 , f (x, y) = 1 − 4x − x − 4 = −3 − 5x, for x ∈ [−2, 2]. This
is again a single variable extreme value problem for a continuous
function. It has no critical points and extreme values correspond to
the boundary points; its maximum value 7 is assumed at x = −12
while the minimum value −13 is assumed at x = 2.

Thus, the absolute maximum value 7 is attained by f at (−2, 4) while the


absolute minimum value −13 is attained at (2, 4).
3. Consider the same problem as the previous one, with a slightly different
objective function, f (x, y) = 1 + xy − x − y. The critical point of f is
characterized by ∇f (x, y) = [y − 1, x − 1]T = 0, that is x = 1, y = 1.
This lies within R and f takes the value 0 at (1, 1). Next, we find the
extreme values of f on the boundaries of R.

• On B1 , f (x, y) = 1 + x3 − x − x2 , for x ∈ [−2, 2]. Its critical points


correspond to solutions of 3x2 − 2x − 1 = 0. Its solutions are x = 1
and x = − 31 . The function values corresponding to these points are
f (1, 1) = 0 and f (−1/3, 1/9) = 32/27. At the boundary points, the
function assumes the values f (−2, 4) = −9 and f (2, 4) = 3. Thus,
the maximum value on B1 is f (2, 4) = 3 and the minimum value is
f (−2, 4) = −9.
• On B2 , f (x, y) = 1 + 4x − x − 4 = −3 + 3x, for x ∈ [−2, 2]. It has no
critical points and extreme values correspond to the boundary points;
At the boundary points, the function assumes the values f (−2, 4) =
−9 and f (2, 4) = 3, which correspond to the minimum and maximum
values respectively of f on B2 .

Thus, the absolute maximum value 3 is attained by f at (2, 4) while the


absolute minimum value −9 is attained at (−2, 4).

4.1.5 Absolute extrema and Convexity


Theorem 61 specified a sufficient condition for the local minimum of a differ-
entiable function with continuous partial and mixed partial derivatives, while
√ √
11 The complex solutions are x = − 31 + i 13 2 and x = − 31 − i 13 2.
4.2. CONVEX OPTIMIZATION PROBLEM 253

theorem 62 specified a necessary condition for the same. Can these conditions
be extended to globally optimal solutions? The answer is that the extensions
to globally optimal solutions can be made for a specific class of optimization
problems called convex optimization problems. In the next section we introduce
the concept of convex sets and convex functions, enroute to discussing convex
optimization.

4.2 Convex Optimization Problem


A function f (.) is called convex if its value at the scalar combination of two
points x and y is less than the same scalar combination of the function at the
two points. In other words, f (.) is convex if and only if:

f (αx + βy) ≤ αf (x) + βf (y)


(4.19)
if α + β = 1, α ≥ 0, β ≥ 0

For a convex optimization problem, the objective function f (x) as well as


the inquality functions gi (x), i = 1, . . . , m are convex. The equality constraints
are linear, i.e., of the form, Ax = b.

minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m (4.20)
Ax = b

Least squares and linear programming are special cases of convex optimiza-
tion problems. Like in the case of linear programming, there are no analytical
solutions for convex optimization problems. But they can be solved reliably,
efficiently and optimally. There are not many well developed software for the
general class of convex optimization problems, though there are several software
packages in matlab, C, etc., and many free softwares as well. The computation
time is polynomial but more complicated to be expressed exactly because the
computation time depends on the cost of validating the function values and their
derivates. Modulo that, computation time for convex optimization problems is
similar to that for linear programming problems.
To pose pratical problems as convex optimization problems is more diffi-
cult than to recognize least squares and linear programs. There exist many
techniques to reformulate problems in the convex form. However, surprisingly,
many problems in practice can be solved via convex optimization.

4.2.1 Why Convex Optimization?


We will see in this sequel, that generic convex programs, under mild computabil-
ity and boundedness assumptions, are computationally tractable. Many convex
254 CHAPTER 4. CONVEX OPTIMIZATION

programs admit theoretically and practically efficient solution methods. Convex


optimization admits duality theory, which can be used to quantitatively estab-
lish the quality of an approximate solution. Even though duality may not yield
a closed-form solution, it often facilitates nontrivial reformulations of the prob-
lem. Duality theory also comes handy in confirming if an approximate solution
is optimal.
In contrast to this, rarely does it happen that a global solution can be
efficiently found for nonconvex optimization programs12 . For most nonconvex
programs, there are no sound techniques for certifying the global optimalality of
approximate solutions or estimating how non-optimal an approximate solution
is.

4.2.2 History
Numerical optimization started in the 1940s with the development of the sim-
plex method for linear programming. The next obvious extension to linear
programming was by replacing the linear cost function with a quadratic cost
function. The linear inequality constraints were however maintained. This first
extension took place in the 1950s. We can expect that the next step would have
been to replace the linear constraints with quadratic constraints. But it did
not really happen that way. On the other hand, around the end of the 1060s,
there was another non-linear, convex extension of linear programming called
geometric programming. Geometric programming includes linear programming
as a special case. Nothing more happened until the beginning of the 1990s. The
beginning of the 1990s was marked by a big explosion of activities in the area
of convex optimizations, and development really picked up. Researches formu-
lated different and more general classes of convex optimization problems that are
known as semidefinite programming, second-order cone programming, quadrat-
ically constrained quadratic programming, sum-of-squares programming, etc.
The same happened in terms of applications. Since 1990s, applications have
been investigated in many different areas. One of the first application areas was
control, and the optimization methods that were investigated included semi-
definite programming for certain control problem. Geometric programming had
been around since late 1960s and it was applied extensively to circuit design
problems. Quadratic programming found application in machine learning prob-
lem formulations such as support vector machines. Semi-definite programming
relaxations found use in combinatorial optimization. There were many other
interesting applications in different areas such as image processing, quantum
information, finance, signal processing, communications, etc.
This first look at the activities involving applications of optimization clearly
indicates that a lot a of development took place around the 1990s. Further,
people extended interior-point methods (which were already known for linear

12 Optimization problems such as singular value decomposition are some few exceptions to

this.
4.2. CONVEX OPTIMIZATION PROBLEM 255

programming since 198413 ) to non-linear convex optimization problems. A high-


light in this area was the work of Nesterov and Nemirovski who extended Kar-
markar’s work to polynomial-time interior-point methods for nonlinear convex
programming in their book published in 1994, though the work actually took
place in 1990. As a result, people started looking at non-linear convex opti-
mization in a special way; instead of treating non-linear convex optimization as
a special case of non-linear optimization, they looked at it as an extension of
linear programming which can be solved almost with the same efficiency. Once
people started looking at applications of non-linear convex optimization, they
discovered many!
We will begin with a background on convex sets and functions. Convex
sets and functions constitute the basic theory for the entire area of convex
optimization. Next, we will discuss some standard problem classes and some
recently studied problem classes such as semi-definite programming and cone
programming. FInally, we will look at applications.

4.2.3 Affine Set


Definition 31 [Affine Set]: A set A is called affine if the line connecting
any two distinct points in the set is completely contained within A. Math-
ematically, the set A is called affine if

∀ x1 , x2 ∈ A, θ∈ℜ ⇒ θx1 + (1 − θ)x2 ∈ A (4.21)

Theorem 66 The solution set of the system of linear equations Ax = b is an


affine set.

Proof: Suppose x1 and x2 are solutions to the system Ax = b with x1 6= x2 .


Then, A (θx1 + (1 − θ)x2 ) = θb + (1 − θ)b = b. Thus, θx1 + (1 − θ)x2 ∈ A,
implying that the solution set of the system Ax = b is an affine set. 2
In fact, converse of theorem 66 is also true; any affine set can be expressed
as the solution set of a system of linear equations Ax = b.

4.2.4 Convex Set


Definition 32 [Convex Set]: A set C is called convex if the line segment
connecting any two points in the set is completely contained within C.
Else C is called concave. That is,

∀ x1 , x2 ∈ C 0≤θ≤1 ⇒ θx1 + (1 − θ)x2 ∈ C (4.22)


256 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.27: Examples of convex and non-convex sets.

Figure 4.27 shows examples of convex and non-convex (concave) sets. Since
an affine set contains any line passing through two distinct points in the set, it
also contains any line segment connecting two points in the set. Thus, an affine
set is our first example of a convex set.
A set C is a convex cone if it is covex and additionally, for every point x ∈ C,
all non-negative multiples of x are also in C. In other words,

∀x1 , x2 ∈ C θ1 , θ2 ≥ 0 ⇒ θ1 x1 + θ2 x2 ∈ C (4.23)

Combinations of the form θ1 x1 + θ2 x2 for θ1 ≥ 0, θ2 ≥ 0 are called conic com-


binations. We will state a related definition next - that of the convex hull of a
set of points.

Definition 33 [Convex Hull]: A convex combination of the set of points


S = {x1 , x2 , . . . , xk } is any point x of the form

k
X k
X
x= θi xi with θi = 1 and θi ≥ 0 (4.24)
i=1 i=1

The convex hull conv(S) of the set of points S is the set of all convex
combinations of points in S. The convex hull of a convex set S is S itself.

13 The first practical polynomial time algorithm for linear programming by Karmarkar (1984)

involved interior-point methods.


4.2. CONVEX OPTIMIZATION PROBLEM 257

Figure 4.28: Example of a hyperplane in ℜ2 .

4.2.5 Examples of Convex Sets


We will look at simple but important examples of convex sets. We will also look
at some operations that preserve convexity.
A hyperplane is the most common example of a convext set. A hyperplane
is the set of solutions to a linear system of equations of the form aT x = b with
a 6= 0 and was defined earlier in definition 24. A half space is a solution set
over the linear inequality aT x ≤ b, a 6= 0. The hyperplane aT x = b bounds the
half-space from one side.
Formally,
Hyperplane: {x|aT x = b, a 6= 0}. Figure 4.28 shows an example hyperplane
in ℜ2 . a is the normal vector.
Halfspace: {x|aT x ≤ b, a 6= 0}. Figure 4.29 shows an example half-space in
ℜ2 .
The hyperplane is convex and affine, whereas the halfspace is merely convex
and not affine.
Another simple example of a convex set is a closed ball in ℜn with radius r
and center xc which is an n-dimensional vector.

B[xc , r] = {xc + ru | ||u||2 ≤ 1}

where u is a vector with norm less than or equal to 1. The open ball B(xc , r)
is also convex. Replacing r with a non-singular square matrix A, we get an
ellipsoid given by

{xc + Au | ||u||2 ≤ 1}
which is also a convex set. Another equivalent representation of the ellipsoid can
be obtained by observing that for any point x in the ellipsoid, ||A−1 (x−xc )||2 ≤
1, that is (x − xc )T (A−1 )T A−1 (x − xc ) ≤ 1. Since (A−1 )T = (AT )−1 and
258 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.29: Example of a half-space in ℜ2 .

Figure 4.30: Example of a ellipsoid in ℜ2 .


A−1 B −1 = (BA)−1 , the ellipsoid can be equivalently defined as x|(x − xc )T P −1 (x − xc ) ≤ 1
where P = (AAT ) is a symmetric matrix. Furthermore, P is positive definite,
since A is non-singular (c.f. page 208).
Matrix A determines the size of the ellipsoid; the eigenvalue λi of A deter-
mines the length of the ith semi-axis of the ellipsoid (see page number 206).
The ellipsoid is another example of a convex set and is a generalization of the
eucledian ball. Figure 4.30 illustrates an ellipsoid in ℜ2 .
A norm ball is a ball with an arbitrary norm. A norm ball with center xc
and radius r is given by
{x | ||x − xc || ≤ r}
By the definition of the norm, a ball in that norm will be convex. The norm
ball with the ∞−norm corresponds to a square in ℜ2 , while the norm ball with
the 1−norm in ℜ2 corresponds to the same square rotated by 45◦ . The norm
ball is convex for all norms.
The definition of cone can be extended to any arbitrary norm to define a
4.2. CONVEX OPTIMIZATION PROBLEM 259

Figure 4.31: Example of a cone in ℜ2 .

norm cone. The set of all pairs (x, t) satisfying ||x|| ≤ t, i.e.,

{(x, t) | ||x|| ≤ t}

is called a norm cone


When the norm is the eucledian norm, the cone (which looks like an ice-
cream cone) is called the second order cone. Norm cones are always convex.
Figure 4.31 shows a cone in ℜ2 . In general, the cross section of a norm cone has
the shape of a norm ball with the same norm. The norm cone for the ∞−norm
is a square pyramid in ℜ3 and the cone for 1−norm in ℜ3 is the same square
pyramid rotated by 45◦ .
A polyhedron is another convex set which is given as the solution set of a
finite set of linear equalities and inequalities. In matrix form, the inequalities
can be stated as

Ax  b A ∈ ℜm×n
(4.25)
Cx = d C ∈ ℜp×n

where  stands for component-wise inequality of the form ≤14 . A polyhedron


can also be represented as the intersection of a finite number of halfspaces and
hyperplanes. Figure 4.32 depicts a typical polyhedron in ℜ2 . An affine set is a
special type of polyhedron.
A last simple example is the positive semi-definite cone. Let S n be the set
n
of all symmetric n × n matrices and S+ ⊂ S n be the set of all positive semi-
n
definite n × n matrices. The set S+ is a convex cone and is called the positive
semi-definite cone. Consider a positive semi-definite matrix S in ℜ2 . Then S
must of the form
14 The component-wise inequality corresponds to a generalized inequality K with K = ℜn
+.
260 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.32: Example of a polyhedron in ℜ2 .

Figure 4.33: Example of a positive semidefinite cone in ℜ2 .

" #
x y
S= (4.26)
y z
2 2
We can represent the space of matrices S+ of the form S ∈ S+ as a three
dimensional space with non-negative x, y and z coordinates and a non-negative
determinant. This space corresponds to a cone as shown in Figure 4.33.

4.2.6 Convexity preserving operations


In practice if you want to establish the convexity of a set C, you could either
1. prove it from first principles, i.e., using the definition of convexity or
2. prove that C can be built from simpler convex sets through some basic
operations which preserve convexity.
4.2. CONVEX OPTIMIZATION PROBLEM 261

Figure 4.34: Plot for the function in (4.28)

Some of the important operations that preserve complexity are:

Intersection
The intersection of any number of convex sets is convex15 . Consider the set S:
n πo
S = x ∈ ℜn | |p(t)| ≤ 1 f or |t| ≤ (4.27)
3
where
p(t) = x1 cos t + x2 cos 2t + . . . + xm cos mt (4.28)
Any value of t that satisfies |p(t)| ≤ 1, defines two regions, viz.,

ℜ≤ (t) = {x | x1 cos t + x2 cos 2t + . . . + xm cos mt ≤ 1}


and

ℜ≥ (t) = {x | x1 cos t + x2 cos 2t + . . . + xm cos mt ≥ −1}


Each of the these regions is convex and for a given value of t, the set of
points that may lie in S is given by

ℜ(t) = ℜ≤ (t) ∩ ℜ≥ (t)


This set is also convex. However, not all the points in ℜ(t) lie in S, since
the points that lie in S satisfy the inequalities for every value of t. Thus, S can
be given as:

S = ∩|t|≤ π3 ℜ(t)

15 Exercise: Prove.
262 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.35: Illustration of the closure property for S defined in (4.27), for
m = 2.

Affine transform
An affine transform is one that preserves

• Collinearity between points, i.e., three points which lie on a line continue
to be collinear after the transformation.

• Ratios of distances along a line, i.e., for distinct colinear points p1 , p2 , p3 ,


||p2 −p1 ||
||p3 −p2 || is preserved.

An affine transformation or affine map between two vector spaces f : ℜn →


ℜm consists of a linear transformation followed by a translation:

x 7→ Ax + b
where A ∈ ℜn×m and b ∈ ℜm . In the finite-dimensional case each affine trans-
formation is given by a matrix A and a vector b.
The image and pre-image of convex sets under an affine transformation de-
fined as
n
X
f (x) = xi ai + b
i

yield convex sets16 . Here ai is the ith row of A. The following are examples of
convex sets that are either images or inverse images of convex sets under affine
transformations:

1. the solution set of linear matrix inequality (Ai , B ∈ S m )

{x ∈ ℜn | x1 A1 + . . . + xn An  B}
16 Exercise: Prove.
4.2. CONVEX OPTIMIZATION PROBLEM 263

is a convex set. Here A  B means B − A is positive semi-definite17 .


This set is the inverse image under an affine mapping of the positive semi- m
definite cone. That is, f −1 (cone) = x ∈ ℜn |B − (x1 A1 + . . . + xn An ) ∈ S+ =
{x ∈ ℜn |B ≥ (x1 A1 + . . . + xn An )}.
n
2. hyperbolic cone (P ∈ S+ ), which is the inverse
 image of the norm cone
Cm+1 = {(z, u)|||z|| ≤ u, u ≥ 0, z ∈ ℜm } = (z, u)|zT z − u2≤ 0, u ≥ 0, z ∈ ℜm
is
 a convex set. The inverse image is given by f −1 (Cm+1 ) = x ∈ ℜn | Ax, cT x ∈ Cm+1 =
x ∈ ℜn |xT AT Ax − (cT x)2 ≤ 0 . Setting, P = AT A, we get the equation
of the hyperbolic cone:

x | xT P x ≤ (cT x)2 , cT x ≥ 0

Perspective and linear-fractional functions


The perspective function P : ℜn+1 → ℜn is defined as follows:

P : ℜn+1 → ℜn such that


(4.29)
P (x, t) = x/t dom P = {(x, t) | t > 0}
The linear-fractional function f is a generalization of the perspective function
and is defined as: ℜn → ℜm :

f : ℜn → ℜm such that
(4.30)
f (x) = cAx+b
T x+d dom f = {x | cT x + d > 0}
The images and inverse images of convex sets under perspective and linear-
fractional functions are convex18 .
Consider the linear-fractional function f = x1 +x1 2 +1 x. Figure ?? shows an
example convex set. Figure ?? shows the image of this convex set under the
linear-fractional function f .

Supporting Hyperplane Theorem


On page 4.1.4, we introduced the concept of the hyperplane. For disjoint convex
sets, we state the separating hyperplane theorem.
Theorem 67 If C and D are disjoint convex sets, i.e., C ∩ D = φ, then there
exists a 6= 0, with a b ∈ ℜ such that
aT x ≤ b for x ∈ C,
aT x ≥ b for x ∈ D. 
That is, the hyperplane x|aT x = b separates C and D. The seperating
hyperplane need not be unique though.
17 The inequality induced by positive semi-definiteness corresponds to a generalized inequal-

ity K with K = S+ n.
18 Exercise: Prove.
264 CHAPTER 4. CONVEX OPTIMIZATION

Proof: We first note that the set S = {x − y|x ∈ C, y ∈ D} is convex, since it


is the sum of two convex sets. Since C and D are disjoint, 0 ∈
/ S. Consider two
cases:
1. Suppose 0 ∈/ closure(S). Let E = {0} and F = closure(S). Then, the
euclidean distance between E and F, defined as
dist(E; F) = inf {||u − v||2 |u ∈ E, v ∈ F}
is positive, and there exists a point f ∈ F that achieves the minimum
distance, i.e., ||f ||2 = dist(E, F). Define a = f , b = ||f ||2 . Then a 6= 0 and
the affine function f (x) = aT x − b = f T (x − 21 f ) is nonpositive
on E and
nonnegative on F, i.e., that the hyperplane x|aT x = b separates E and
F. Thus, aT (x − y) > 0 for all x − y ∈ S ⊆ closure(S), which implies
that, aT x ≥ aT y for all x ∈ C and y ∈ D.
2. Suppose, 0 ∈ closure(S). Since 0 ∈
/ S, it must be in the boundary of S.
• If S has empty interior, it must lie in an affine set of dimension less
than n, and any hyperplane containing that affine set contains S
and is a hyperplane.
 In other words, S is contained in a hyperplane
z|aT z = b , which must include the origin and therefore b = 0. In
other words, aT x = aT y for all x ∈ C and all y ∈ D gives us a trivial
separating hyperplane.
• If S has a nonempty interior, consider the set
S−ǫ = {z|B(z, ǫ) ⊆ S}
where B(z, ǫ) is the Euclidean ball with center z and radius ǫ > 0.
S−ǫ is the set S, shrunk by ǫ. closure (S−ǫ ) is closed and convex,
and does not contain 0, so as argued before, it is separated from {0}
by atleast one hyperplane with normal vector a(ǫ) such that
a(ǫ)T z ≥ 0 for all z ∈ Sǫ
Without loss of generality assume ||a(ǫ)||2 = 1. Let ǫk , for k =
1, 2, . . . be a sequence of positive values of ǫk with lim ǫk = 0. Since
k→∞
||a(ǫk )||2 = 1 for all k, the sequence a(ǫk ) contains a convergent
subsequence, and let a be its limit. We have
a(ǫk )T z ≥ 0 for all z ∈ S−ǫk
and therefore aT z ≥ 0 for all z ∈ interior(S), and aT z ≥ 0 for all
z ∈ S, which means
aT x ≥ aT y for all x ∈ C, and y ∈ D.
2
Theorem 59 stated that the gradient evaluated at a point on a level set is
orthogonal to the tangent hyperplane to the level set at that point. We now
state the definition of a supporting hyperplane, which is special type of tangent
hyperplane.
Definition 34 [Supporting Hyperplane]: The supporting hyperplane to a
set C at a boundary point x0 is defined as x|aT x = aT x0 , a 6= 0, aT y ≤ aT x0 , ∀ y ∈ C
4.2. CONVEX OPTIMIZATION PROBLEM 265

Figure 4.36: Example of a supporting hyperplane.

Figure 4.36 shows a supporting hyperplane at the point x0 on the boundary


of a convex set C.
For convex sets, there is an important theorem regarding supporting hyper-
planes.

Theorem 68 If the set C is convex, then there exists a supporting hyperplane


at every boundary point of C. As in the case of the seperating hyperplane, the
supporting hyperplane need not be unique.

Proof: If the interior of C is nonempty, the result follows immediately by ap-


plying the separating hyperplane theorem to the sets {x0 } and interior(C). If
the interior of C is empty, then C must lie in an affine set of dimension less than
n, and any hyperplane containing that affine set contains C and x0 , and is a
(trivial) supporting hyperplane. 2

4.2.7 Convex Functions


Definition 35 [Convex Function]: A function f : D → ℜ is convex if D is
a convex set and

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) ∀ x, y ∈ D 0 ≤ θ ≤ 1 (4.31)

Figure 4.37 illustrates an example convex function. A function f : D → ℜ


is strictly convex if D is convex and

f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y)) ∀ x, y ∈ D 0 ≤ θ ≤ 1(4.32)


266 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.37: Example of convex function.

A function f : D → ℜ is called uniformly or strongly convex if D is convex


and there exists a constant c > 0 such that

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)) − 12 cθ(1 − θ)||x − y|| ∀ x, y ∈ D 0 ≤ θ ≤ 1(4.33)


A function f : ℜn → ℜ is said to be concave if the function −f is convex.
Examples of convex functions on the set of reals ℜ as well as on ℜn and ℜm×n
are shown in Table 4.1. Examples of concave functions on the set of reals ℜ are
shown in Table 4.2. If a function is both convex and concave, it must be affine,
as can be seen in the two tables.
Function type Domain Additional Constraints
The affine function: ax + b ℜ Any a, b ∈ ℜ
The exponential function: eax ℜ Any a ∈ ℜ
Powers: xα ℜ++ α ≥ 1 or α ≤ 1
Powers of absolute value: |x|p ℜ p≥1
Negative entropy: x log x ℜ++
T n
Affine functions of vectors: a x + b ℜ
n
!1/p
X
p-norms of vectors: ||x||p = |xi |p ℜn p≥1
i=1
inf norms of vectors: ||x||∞ = maxk |xk | ℜn
m X
X n
Affine functions of matrices: tr(AT X) + b = Aij Xij + b ℜm×n
i=1 j=1
T 1/2
Spectral (maximum singular value) matrix norm: ||X||2 = σmax (X) = (λmax (X X)) ℜm×n

Table 4.1: Examples of convex functions on ℜ, ℜn and ℜm×n .

4.2.8 Convexity and Global Minimum


One of the most fundamental and useful chracteristics of convex functions is
that any point of local minimum point for a convex function is also a point of
global minimum.
4.2. CONVEX OPTIMIZATION PROBLEM 267

Function type Domain Additional Constraints


The affine function: ax + b ℜ Any a, b ∈ ℜ
Powers: xα ℜ++ 0≤α≤1
logarithm: log x ℜ++

Table 4.2: Examples of concave functions on ℜ.

Theorem 69 Let f : D → ℜ be a convex function on a convex domain D. Any


point of locally minimum solution for f is also a point of its globally minimum
solution.

Proof: Suppose x ∈ D is a point of local minimum and let y ∈ D be a point of


global minimum. Thus, f (y) < f (x). Since x corresponds to a local minimum,
there exists an ǫ > 0 such that

∀ z ∈ D, ||z − x|| ≤ ǫ ⇒ f (z) ≥ f (x)


ǫ
Consider a point z = θy + (1 − θ)x with θ = 2||y−x|| . Since x is a point of
local minimum (in a ball of radius ǫ), and since f (y) < f (x), it must be that
||y − x|| > ǫ. Thus, 0 < θ < 12 and z ∈ D. Furthermore, ||z − x|| = 2ǫ . Since f
is a convex function
f (z) ≤ θf (x) + (1 − θ)f (y)
Since f (y) < f (x), we also have

θf (x) + (1 − θ)f (y) < f (x)

The two equations imply that f (z) < f (x), which contradicts our assumption
that x corresponds to a point of local minimum. That is f cannot have a point
of local minimum, which does not coincide with the point y of global minimum.
2
Since any locally minimum point for a convex function also corresponds to
its global minimum, we will drop the qualifiers ‘locally’ as well as ‘globally’ while
referring to the points corresponding to minimum values of a convex function.
For any stricly convex function, the point corresponding to the gobal minimum
is also unique, as stated in the following theorem.

Theorem 70 Let f : D → ℜ be a strictly convex function on a convex domain


D. Then f has a unique point corresponding to its global minimum.

Proof: Suppose x ∈ D and y ∈ D with y 6= x are two points of global minimum.


That is f (x) = f (y) for y 6= x. The point x+y
2 also belongs to the convex set
D and since f is strictly convex, we must have
 
x+y 1 1
f < f (x) + f (y) = f (x)
2 2 2
which is a contradiction. Thus, the point corresponding to the minimum of f
must be unique. 2
268 CHAPTER 4. CONVEX OPTIMIZATION

In the following section, we state some important properties of convex func-


tions, including relationships between convex functions and convex sets, and
first and second order conditions for convexity. We will also draw relationships
between the definitions of convexity and strict convexity stated here, with the
definitions on page 224 for the single variable case.

4.2.9 Properties of Convex Functions


We will first extend the domain of a convex function to all ℜn , while retaining
its convexity and preserving its value in the domain.
Definition 36 [Extended value extension]: If f : D → ℜ, with D ⊆ ℜn is
a convex function, then we define its extended-valued extension fe : ℜn → ℜ
as

(
f (x) if x ∈ D
fe(x) = (4.34)
∞ if x ∈
/D

In what follows, we will assume if necessary, that all convex functions are
implicitly extended to the domain ℜn . A useful technique for verifying the
convexity of a function is to investigate its convexity, by restricting the function
to a line and checking for the convexity of a function of single variable. This
technique is hinged on the following theorem.
Theorem 71 A function f : D → ℜ is (strictly) convex if and only if the
function φ : Dφ → ℜ defined below, is (strictly) convex in t for every x ∈ ℜn
and for every h ∈ ℜn
φ(t) = f (x + th)
with the domain of φ given by Dφ = {t|x + th ∈ D}.
Proof: We will prove the necessity and sufficiency of the convexity of φ for a
convex function f . The proof for necessity and sufficiency of the strict convexity
of φ for a strictly convex f is very similar and is left as an exercise.
Proof of Necessity: Assume that f is convex. And we need to prove that
φ(t) = f (x + th) is also convex. Let t1 , t2 ∈ Dφ and θ ∈ [0, 1]. Then,

φ(θt1 + (1 − θ)t2 ) = f (θ(x + t1 h) + (1 − θ)(x + t2 h))


≤ θf ((x + t1 h)) + (1 − θ)f ((x + t2 h)) = θφ(t1 ) + (1 − θ)φ(t2 ) (4.35)

Thus, φ is convex.
Proof of Sufficiency: Assume that for every h ∈ ℜn and every x ∈ ℜn ,
φ(t) = f (x + th) is convex. We will prove that f is convex. Let x1 , x2 ∈ D.
Take, x = x1 and h = x2 − x1 . We know that φ(t) = f (x1 + t(x2 − x1 )) is
convex, with φ(1) = f (x2 ) and φ(0) = f (x1 ). Therefore, for any θ ∈ [0, 1]
4.2. CONVEX OPTIMIZATION PROBLEM 269

f (θx2 + (1 − θ)x1 ) = φ(θ)


≤ θφ(1) + (1 − θ)φ(0) ≤ θf (x2 ) + (1 − θ)f (x1 ) (4.36)

This implies that f is convex. 2


Next, we will draw the parallel between convex sets and convex functions by
introducing the concept of the epigraph of a function.

Definition 37 [Epigraph]: Let D ⊆ ℜn be a nonempty set and f : D →


ℜ. The set {(x, f (x)|x ∈ D} is called graph of f and lies in ℜn+1 . The
epigraph of f is a subset of ℜn+1 and is defined as

epi(f ) = {(x, α)|f (x) ≤ α, x ∈ D, α ∈ ℜ} (4.37)

In some sense, the epigraph is the set of points lying above the graph of f .
Similarly, the hypograph of f is a subset of ℜn+1 , lying below the graph of
f and is defined by

hyp(f ) = {(x, α)|f (x) ≥ α, x ∈ D, α ∈ ℜ} (4.38)

There is a one to one correspondence between the convexity of function f and


that of the set epi(f ), as stated in the following theorem.

Theorem 72 Let D ⊆ ℜn be a nonempty convex set, and f : D → ℜ. Then f


is convex if and only if epi(f ) is a convex set.

Proof: Let f be convex. For any (x1 , α1 ) ∈ epi(f ) and (x2 , α2 ) ∈ epi(f ) and
any θ ∈ (0, 1),

f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )) ≤ θα1 + (1 − θ)α2

Since D is convex, θx1 +(1−θ)x2 ∈ D. Therefore, (θx1 + (1 − θ)x2 , θα1 + (1 − θ)α2 ) ∈


epi(f ). Thus, epi(f ) is convex if f is convex. This proves the necessity part.
To prove sufficiency, assume that epi(f ) is convex. Let x1 , x2 ∈ D. So,
(x1 , f (x1 )) ∈ epi(f ) and (x2 , f (x2 )) ∈ epi(f ). Since epi(f ) is convex, for θ ∈
(0, 1),
(θx1 + (1 − θ)x2 , θα1 + (1 − θ)α2 ) ∈ epi(f )
which implies that f (θx1 +(1−θ)x2 ) ≤ θf (x1 )+(1−θ)f (x2 )) for any θ ∈ (0, 1).
This proves the sufficiency. 2
There is also a correspondence between the convexity of a function and the
convexity of its sublevel sets.
270 CHAPTER 4. CONVEX OPTIMIZATION

Definition 38 [Sublevel Sets]: Let D ⊆ ℜn be a nonempty set and f : D →


ℜ. The set
Lα (f ) = {x|x ∈ D, f (x) ≤ α}
is called the α−sub-level set of f .
The correspondence between the convexity of f and its α−sub-level set is stated
in the following theorem. Unlike the correspondence with the epigraph, the
correspondence with the α−sub-level set is not one to one.

Theorem 73 Let D ⊆ ℜn be a nonempty convex set, and f : D → ℜ be a


convex function. Then Lα (f ) is a convex set for any α ∈ ℜ.

Proof: Consider x1 , x2 ∈ Lα (f ). Then by definition of the level set, x1 , x2 ∈ D,


f (x1 ) ≤ α and f (x2 ) ≤ α. From convexity of D it follows that for all θ ∈ (0, 1),
x = θx1 + (1 − θ)x2 ∈ D. Moreover, since f is also convex,

f (x) ≤ θf (x1 ) + (1 − θ)f (x2 ) ≤ θα + (1 − θ)α = α

which implies that x ∈ Lα (f ). Thus, Lα (f ) is a convex set. 2The converse


of this theorem does not hold. To illustrate this, consider the function f (x) =
x2
1+2x21
. The 0-sublevel set of this function is {(x1 , x2 ) | x2 ≤ 0 }, which is convex.
However, the function f (x) itself is not convex.
An important property of a convex function is that it is continuous in the
interior of its domain.
Theorem 74 Let f : D → ℜ be a convex function with D ⊆ ℜn being a convex
set. Let S ⊂ D be an open convex set. Then f is continuous on S.
Proof: Let us consider a point x0 ∈ S. Since S is an open convex set, we can
find n + 1 points x1 , x2 , . . . , xn+1 such that the interior of the convex hull
( n+1 n+1
)
X X
C = x|x = ai xi , ai ≥ 0, ai = 1
i=1 1

is not empty and x0 ∈ interior(C). Let M = max f (xi ). Then, for any
1≤i≤n+1
n+1
X
x= ai xi ∈ C,
i=1

n+1
! n+1
X X
f (x) = f ai xi ≤ ai f (xi ) ≤ M
i=1 i=1

Since x0 ∈ C, there exists a δ > 0 such that B(x0 , δ) ⊂ C, where, B(x0 , δ) =


{x|||x − x0 || ≤ δ}. Therefore, x0 can be expressed as a convex combination of
(x0 + θh and x0 − h for some h ∈ B(x0 , δ) and some θ ∈ [0, 1].
1 θ
x0 = (x0 + θh) + (x0 − h)
1+θ 1+θ
4.2. CONVEX OPTIMIZATION PROBLEM 271

Since f is convex on C,
1 θ
f (x0 ) ≤ f (x0 + θh) + f (x0 − h)
1+θ 1+θ
From this, we can conclude that

f (x0 + θh) − f (x0 ) ≥ θ (f (x0 − f (x0 − h)) ≥ −θ (M − f (x0 )) (4.39)

On the other hand,

f (x0 + θh) ≤ θf (x0 + h) + (1 − θ)f (x0 )

which implies that

f (x0 + θh) − f (x0 ) ≤ θ (f (x0 + h) − f (x0 ) ≤ θ (M − f (x0 )) (4.40)

From equations 4.39 and 4.40, we can infer that

|f (x0 + θh) − f (x0 )| ≤ θ|f (x0 ) − M |

For a given ǫ > 0, select δ ′ ≤ δ such that δ ′ |f (x0 ) − M | ≤ ǫδ. Then d = θh with
||h|| = δ, implies that d ∈ B(x0 , δ) and f (x0 + d) − f (x0 )| ≤ ǫ. This proves the
theorem. 2
Analogous to the definition of increasing functions introduced on page num-
ber 220, we next introduce the concept of monotonic functions. This concept is
very useful for characterization of a convex function.

Definition 39 Let f : D → ℜn and D ⊆ ℜn . Then


1. f is monotone on D if for any x1 , x2 ∈ D,

T
(f (x1 ) − f (x2 )) (x1 − x2 ) ≥ 0 (4.41)

2. f is strictly monotone on D if for any x1 , x2 ∈ D with x1 6= x2 ,

T
(f (x1 ) − f (x2 )) (x1 − x2 ) > 0 (4.42)

3. f is uniformly or strongly monotone on D if for any x1 , x2 ∈ D, there is


a constant c > 0 such that

T
(f (x1 ) − f (x2 )) (x1 − x2 ) ≥ c||x1 − x2 ||2 (4.43)
272 CHAPTER 4. CONVEX OPTIMIZATION

First-Order Convexity Conditions


The first order convexity condition for differentiable functions is provided by
the following theorem:

Theorem 75 Let f : D → ℜ be a differentiable convex function on an open


convex set D. Then:

1. f is convex if and only if, for any x, y ∈ D,

f (y) ≥ f (x) + ∇T f (x)(y − x) (4.44)

2. f is strictly convex on D if and only if, for any x, y ∈ D, with x 6= y,

f (y) > f (x) + ∇T f (x)(y − x) (4.45)

3. f is strongly convex on D if and only if, for any x, y ∈ D,

1
f (y) ≥ f (x) + ∇T f (x)(y − x) + c||y − x||2 (4.46)
2

for some constant c > 0.

Proof:
Sufficiency: The proof of sufficiency is very similar for all the three state-
ments of the theorem. So we will prove only for statement (4.44). Suppose
(4.44) holds. Consider x1 , x2 ∈ D and any θ ∈ (0, 1). Let x = θx1 + (1 − θ)x2 .
Then,

f (x1 ) ≥ f (x) + ∇T f (x)(x1 − x)


f (x2 ) ≥ f (x) + ∇T f (x)(x2 − x) (4.47)

Adding (1 − θ) times the second inequality to θ times the first, we get,

θf (x1 ) + (1 − θ)f (x2 ) ≥ f (x)

which proves that f (x) is a convex function. In the case of strict convexity,
strict inequality holds in (4.47) and it follows through. In the case of strong
convexity, we need to additionally prove that
1 1 1
θ c||x − x1 ||2 + (1 − θ) c||x − x2 ||2 = cθ(1 − θ)||x2 − x1 ||2
2 2 2
4.2. CONVEX OPTIMIZATION PROBLEM 273

Figure 4.38: Figure illustrating Theorem 75.

Necessity: Suppose f is convex. Then for all θ ∈ (0, 1) and x1 , x2 ∈ D, we


must have
f (θx2 + (1 − θ)x1 ) ≤ θf (x2 ) + (1 − θ)f (x1 )
Thus,
f (x1 + θ(x2 − x1 )) − f (x1 )
∇T f (x1 )(x2 − x1 ) = lim ≤ f (x2 ) − f (x1 )
θ→0 θ
This proves necessity for (4.44). The necessity proofs for (4.45) and (4.46) are
very similar, except for a small difference for the case of strict convexity; the
strict inequality is not preserved when we take limits. Suppose equality does
hold in the case of strict convexity, that is for a strictly convex function f , let

f (x2 ) = f (x1 ) + ∇T f (x1 )(x2 − x1 ) (4.48)

for some x2 6= x1 . Because f is stricly convex, for any θ ∈ (0, 1) we can write

f (θx1 + (1 − θ)x2 ) = f (x2 + θ(x1 − x2 )) < θf (x1 ) + (1 − θ)f (x2 ) (4.49)

Since (4.44) is already proved for convex functions, we use it in conjunction with
(4.48), and (4.49), to get

f (x2 ) + θ∇T f (x2 )(x1 − x2 ) ≤ f (x2 + θ(x1 − x2 )) < f (x2 ) + θ∇T f (x2 )(x1 − x2 )

which is a contradiction. Thus, equality can never hold in (4.44) for any x1 6= x2 .
This proves the necessity of (4.45). 2
The geometrical interpretation of theorem 75 is that at any point, the linear
approximation based on a local derivative gives a lower estimate of the function,
i.e. the convex function always lies above the supporting hyperplane at that
point. This is pictorially depicted in Figure 4.38. There are some implications
of theorem 75 for strongly convex functions. We state them next.
274 CHAPTER 4. CONVEX OPTIMIZATION

Definition 40 [Some corollaries of theorem 75 for strongly convex functions]:


For a fixed x, the right hand side of the inequality (4.46) is a convex
quadratic function of y. Thus, the critical point of the RHS should corre-
spond to the minimum value that the RHS could take. This yields another
lower bound on f (y).

1
f (y) ≥ f (x) − ||∇f (x)||22 (4.50)
2c
Since this holds for any y ∈ D, we have

1
min f (y) ≥ f (x) − ||∇f (x)||22 (4.51)
y∈D 2c

which can be used to bound the suboptimality of a point x in terms of


||∇f (x)||2 . This bound comes handy in theoretically understanding the
convergence of gradient methods. If y
b = min f (y), we can also derive a
y∈D
bound on the distance between any point x ∈ D and the point of optimality
y
b.

2
||x − y
b ||2 ≤ ||∇f (x)||2 (4.52)
c
Theorem 75 motivates the definition of the subgradient for non-differentiable
convex functions, which has properties very similar to the gradient vector.
Definition 41 [Subgradient]: Let f : D → ℜ be a convex function defined
on a convex set D. A vector h ∈ ℜn is said to be a subgradient of f at the
point x ∈ D if
f (y) ≥ f (x) + hT (y − x)
for all y ∈ D. The set of all such vectors is called the subdifferential of f
at x.
For a differentiable convex function, the gradient at point x is the only subgradi-
ent at that point. Most properties of differentiable convex functions that hold in
terms of the gradient also hold in terms of the subgradient for non-differentiable
convex functions. Theorem 75 gives a very simple optimality criterion for a dif-
ferentiable function f .
Theorem 76 Let f : D → ℜ be a convex function defined on a convex set D.
A point x ∈ D corresponds to a minimum if and only if
∇T f (x)(y − x) ≥ 0
for all y ∈ D.
4.2. CONVEX OPTIMIZATION PROBLEM 275

If ∇f (x) is nonzero, it defines a supporting hyperplane to D at the point x.


Theorem 77 implies that for a differentiable convex function defined on an open
set, every critical point must be a point of (global) minimum.

Theorem 77 Let f : D → ℜ be differentiable and convex on an open convex


domain D ⊆ ℜn . Then x is a critical point of f if and only if it is a (global)
minimum.

Proof: If x is a global minimum, it is a local minimum and by theorem 60, it


must be a critical point and therefore ∇f (x) = 0. Conversely, let ∇f (x) = 0,
By theorem 75, we know that for all y ∈ D,

f (y) ≥ f (x) + ∇T f (x)(y − x)

Substituting ∇f (x) = 0 in this inequality, we get for all y ∈ D,

f (y) ≥ f (x)

That is, x corresponds to a (global) minimum. 2


Based on the definition of monotonic functions in definition 39, we show the
relationship between convexity of a function and monotonicity of its gradient in
the next theorem.

Theorem 78 Let f : D → ℜ with D ⊆ ℜn be differentiable on the convex set


D. Then,
1. f is convex on D if and only if is its gradient ∇f is monotone. That is,
for all x, y ∈ ℜ

T
(∇f (x) − ∇f (y)) (x − y) ≥ 0 (4.53)

2. f is strictly convex on D if and only if is its gradient ∇f is strictly mono-


tone. That is, for all x, y ∈ ℜ with x 6= y,

T
(∇f (x) − ∇f (y)) (x − y) > 0 (4.54)

3. f is uniformly or strongly convex on D if and only if is its gradient ∇f is


uniformly monotone. That is, for all x, y ∈ ℜ,

T
(∇f (x) − ∇f (y)) (x − y) ≥ c||x − y||2 (4.55)

for some constant c > 0.


276 CHAPTER 4. CONVEX OPTIMIZATION

Proof:
Necessity: Suppose f is uniformly convex on D. Then from theorem 75,
we know that for any x, y ∈ D,

1
f (y) ≥ f (x) + ∇T f (x)(y − x) − c||y + x||2
2
1
T
f (x) ≥ f (y) + ∇ f (y)(x − y) − c||x + y||2
2
Adding the two inequalities, we get (4.55). If f is convex, the inequalities hold
with c = 0, yielding (4.54). If f is strictly convex, the inequalities will be strict,
yielding (4.54).
Sufficiency: Suppose ∇f is monotone. For any fixed x, y ∈ D, consider the
function φ(t) = f (x + t(y − x)). By the mean value theorem applied to φ(t),
we should have for some t ∈ (0, 1),

φ(1) − φ(0) = φ′ (t) (4.56)

Letting z = x + t(y − x), (4.56) translates to

f (y) − f (x) = ∇T f (z)(y − x) (4.57)

Also, by definition of monotonicity of ∇f , (from (4.53)),

T 1 T
(∇f (z) − ∇f (x)) (y − x) = (∇f (z) − ∇f (x)) (z − x) ≥ 0 (4.58)
t
Combining (4.57) with (4.58), we get,

T
f (y) − f (x) = (∇f (z) − f (x)) (y − x) + ∇T f (x)(y − x)
≥ ∇T f (x)(y − x) (4.59)

By theorem 75, this inequality proves that f is convex. Strict convexity can
be similarly proved by using the strict inequality in (4.58) inherited from strict
monotonicity, and letting the strict inequality follow through to (4.59). For the
case of strong convexity, from (4.55), we have

T
φ′ (t) − φ′ (0) = (∇f (z) − f (x)) (y − x)
1 T 1
= (∇f (z) − f (x)) (z − x) ≥ c||z − x||2 = ct||y − x||2 (4.60)
t t
4.2. CONVEX OPTIMIZATION PROBLEM 277

Therefore,

Z 1
′ 1
φ(1) − φ(0) − φ (0) = [φ′ (t) − φ′ (0)]dt ≥ c||y − x||2 (4.61)
0 2
which translates to
1
f (y) ≥ f (x) + ∇T f (x)(y − x) + c||y − x||2
2
By theorem 75, f must be strongly convex. 2

Second Order Condition


For twice continuously differentiable convex functions the convexity condition
can be characterized as follows.
Theorem 79 A twice differential function f : D → ℜ for a nonempty open
convex set D
1. is convex if and only if its domain is convex and its Hessian matrix is
positive semidefinite at each point in D. That is

∇2 f (x)  0 ∀x∈D (4.62)

2. is strictly convex if its domain is convex and its Hessian matrix is positive
definite at each point in D. That is

∇2 f (x) ≻ 0 ∀x∈D (4.63)

3. is uniformly convex if and only if its domain is convex and its Hessian
matrix is uniformly positive definite at each point in D. That is, for any
v ∈ ℜn and any x ∈ D, there exists a c > 0 such that

vT ∇2 f (x)v ≥ c||v||2 (4.64)

In other words
∇2 f (x)  cIn×n
where In×n is the n × n identity matrix and  corresponds to the pos-
itive semidefinite inequality. That is, the function f is strongly convex
iff ∇2 f (x) − cIn×n is positive semidefinite, for all x ∈ D and for some
constant c > 0, which corresponds to the positive minimum curvature of
f.
278 CHAPTER 4. CONVEX OPTIMIZATION

Proof: We will prove only the first statement in the theorem; the other two
statements are proved in a similar manner.
Necessity: Suppose f is a convex function, and consider a point x ∈ D.
We will prove that for any h ∈ ℜn , hT ∇2 f (x)h ≥ 0. Since f is convex, by
theorem 75, we have

f (x + th) ≥ f (x) + t∇T f (x)h (4.65)

Consider the function φ(t) = f (x + th) considered in theorem 71, defined on


the domain Dφ = [0, 1]. Using the chain rule,
n
X dxi
φ′ (t) = fxi (x + th) = hT .∇f (x + th)
i=1
dt

Since f has partial and mixed partial derivatives, φ′ is a differentiable function


of t on Dφ and
φ′′ (t) = hT ∇2 f (x + th)h
Since φ and φ′ are continous on Dφ and φ′ is differentiable on int(Dφ ), we
can make use of the Taylor’s theorem (45) with n = 3 to obtain:
1
φ(t) = φ(0) + t.φ′ (0) + t2 . φ′′ (0) + O(t3 )
2
Writing this equation in terms of f gives
1
f (x + th) = f (x) + thT ∇f (x) + t2 hT ∇2 f (x)h + O(t3 )
2
In conjunction with (4.65), the above equation implies that

t2 T 2
h ∇ f (x)h + O(t3 ) ≥ 0
2
Dividing by t2 and taking limits as t → 0, we get

hT ∇2 f (x)h ≥ 0

Sufficiency: Suppose that the Hessian matrix is positive semidefinite at


each point x ∈ D. Consider the same function φ(t) defined above with h = y−x
for y, x ∈ D. Applying Taylor’s theorem (45) with n = 2 and a = 0, we obtain,
1
φ(1) = φ(0) + t.φ′ (0) + t2 . φ′′ (c)
2
for some c ∈ (0, 1). Writing this equation in terms of f gives
1
f (x) = f (y) + (x − y)T ∇f (y) + (x − y)T ∇2 f (z)(x − y)
2
4.2. CONVEX OPTIMIZATION PROBLEM 279

where z = y + c(x − y). Since D is convex, z ∈ D. Thus, ∇2 f (z)  0. It follows


that
f (x) ≥ f (y) + (x − y)T ∇f (y)
By theorem 75, the function f is convex. 2
Examples of differentiable/twice differentiable convex functions, along with
the value of their respective gradients/hessians are tabulated in Table 4.3.

Function type Constraints Gradient/Hessian


1 T
Quadratic : 2 x Ax + bT x + c A0 ∇2 f (x) = P
" #
x2 2 y2 −xy
Quadratic over linear: y ≥0 y>0 ∇2 f (x, y) = y3
−xy x2
n
X
1

Log-sum-exp: log exp(xk ) ∇2 f (x) = (1T z)2
(1T z) diag(z) − zzT where z = [ex1 , ex1 , . . . , exn ]
k=1 Y 1/n
! n1 nxi
n
Y  
i=1
Negative Geometric mean: − xk x∈ ℜn++ 2
∇ f (x) = n2 n diag( x12 , . . . , x12 ) − qq T
1 n
k=1

Table 4.3: Examples of twice differentiable convex functions on ℜ.

4.2.10 Convexity Preserving Operations on Functions


In practice if you want to establish the convexity of a function f , you could
either

1. Prove it from first principles, i.e., using the definition of convexity or

2. If f is twice differentiable, show that ∇2 f (x)  0

3. Show that f is obtained from simple convex functions by operations that


preserve complexity. Following are operations on functions that preserve
complexity (proofs omitted, since they are trivial):
n
X
• Nonnegative weighted sum: f = αi fi is convex if each fi for
i=1
1 ≤ i ≤ n is convex and αi ≥ 0, 1 ≤ i ≤ n.
• Composition with affine function: f (Ax + b) is convex if f is
convex. For example:
m
X
– The log barrier for linear inequalities, f (x) = − log(bi −aTi x),
i=1
is convex since − log(x) is convex.
– Any norm of an affine function, f (x) = ||Ax + b||, is convex.
• Pointwise maximum: If f1 , f2 , . . . , fm are convex, then f (x) =
max {f1 (x), f2 (x), . . . , fm (x)} is also convex, For example:
280 CHAPTER 4. CONVEX OPTIMIZATION

– Sum of r largest components of x ∈ ℜn f (x) = x[1] + x[2] + . . . +


x[r] , where x[1] is the ith largest component of x, is a convex
function.
• Pointwise supremum: If f (x, y) is convex in x for every y ∈ S,
then g(x) = sup f (x, y) is convex. For example:
y∈S
– The function that returns the maximum eigenvalue of a symmet-
ric matrix X, viz., λmax (X) = sup f (x, y) is a convex function
y∈S
of the symmetrix matrix X.
• Composition with functions: Let h : ℜk → ℜ with h(x) =
∞, ∀ x ∈/ dom h and g : ℜn → ℜk . Define f (x) = h(g(x)). f is
convex if
– gi is convex, h is convex and nondecreasing in each argument
– or gi is concave, h is convex and nonincreasing in each argument
Some examples illustrating this property are:
– exp g(x) is convex if g is convex
Xm
– log gi (x) is concave if gi are concave and positive
i=1
m
X
– log exp gi (x) is convex if gi are convex
i=1
– 1/g(x) is convex if g is concave and positive
• Infimum: If f (x, y) is convex in (x, y) and C is a convex set, then
g(x) = inf f (x, y) is convex. For example:
y∈C

– Let f (x, S) that returns the distance of a point x to a convex set


S. That is f (x, S) = inf ||x − y||. Then f (x, S) is a convex.
y∈S
• Perspective Function: The perspective of a function f : ℜn → ℜ is
the function g : Rn × ℜ → ℜ, g(x, t) = tf (x/t). Function g is convex
if f is convex on domg = {(x, t)|x/t ∈ domf, t > 0}. For example,
– The perspective of f (x) = xT x is (quadratic-over-linear) function
T
g(x, t) = x t x and is convex.
– The perspective of negative logarithm f (x) = − log x is the rel-
ative entropy function g(x, t) = t log t − t log x and is convex.

4.3 Convex Optimization Problem


Formally, a convex program is defined as

min cT x (4.66)
x∈X
4.3. CONVEX OPTIMIZATION PROBLEM 281

where X ⊂ ℜn is a convex set and x is a vector of n optimization or decision


variables. In applications, convex optimization programs usually arise in the
form:

minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m
(4.67)
Ax = b
variable x = (x1 , . . . , xn )

If it is given that the functions f, g1 , . . . , gm are convex, by theorem 73, the


feasible set X of this problem, which is the intersection of a finite number of
0−sub-level sets of convex functions is also convex. Therefore, this problem can
be posed as the following convex optimization problem:

min t
x=(t,u)∈X (4.68)
X = {(t, u)|f (u) ≤ t, g1 (u) ≤ 0, g2 (u) ≤ 0, . . . , gm (u) ≤ 0}

The set X is convex, and hence the problem in (4.68) is a convex optimization
problem. Further, every locally optimal point is also globally optimal. The
computation time of algorithms for solving  convex optimization problems is
roughly proportional to max n2 , n2 m, C , where C is the cost of evaluating
f , the gi ’s and their first and second derivatives. There are many reliable and
efficient algorithms for solving convex optimization problems. However, it is
often difficult to recognize convex optimization problems in practice.

Examples
Consider the optimization problem

minimize f (x) = x21 + x22


x1
subject to g1 (x) = 1+x22
≤0 (4.69)
2
h(x) = (x1 + x2 ) = 0

We note that the optimiation problem above is not a convex problem ac-
cording to our definition, since g1 is not convex and h is not affine. However, we
note that the feasible set {(x1 , x2 ) | x1 = −x2 , x1 ≤ 0 } is convex (recall that
the converse of theorem 73 does not hold - the 0-sublevel set of a non convex
function can be convex). This problem can be posed as an equivalent (but not
identical) convex optimization problem:
282 CHAPTER 4. CONVEX OPTIMIZATION

minimize f (x) = x21 + x22


subject to x1 ≤ 0 (4.70)
x1 + x2 = 0

4.4 Duality Theory


Duality is a very important component of nonlinear and linear optimization
models. It has a wide spectrum of applications that are very popular. It arises
in the basic form of linear programming as well as in interior point methods
for linear programming. The duality in linear programming was first observed
by Von Neumann, and later formalized by Tucker, Gale and Kuhn. In the
first attempt at extending duality beyond linear programs, duals of quadratic
programs were next developed. It was subsequently observed that you can
always write a dual for any optimization problem and the modern Lagrange-
based ‘constructive’19 duality theory followed in the late 1960s.
An extremely popular application of duality happens to be in the quadratic
programming for Support Vector Machines. The primal and dual both happen
to be convex optimization programs in this case. The Minimax theorem20 , a
fundamental theorem of Game Theory, proved by John von Neumann in 1928,
is but one instance of the general duality theory. In the consideration of equi-
librium in electrical networks, current are ‘primal variables’ and the potential
differences are the ‘dual variables’. In models of economic markets, the ‘primal’
variables are production and consumption levels while the ‘dual’ variables are
prices (of goods, etc.). Dual price-based decomposition methods were developed
by Danzig. In the case of thrust structures in mechanics, forces are primal vari-
ables and the displacements are the dual variables. Dual problems and their
solutions are used for proving optimality of solutions, finding near-optimal so-
lutions, analysing how sensitive the solution of the primal is to perturbations in
the right hand side of constraints, analysing convergence of algorithms, etc.

4.4.1 Lagrange Multipliers


Consider the following quadratic function of x ∈ ℜn .

F (x) = xT Ax − xT b (4.71)

where A is an n × n square matrix. Consider the unconstrained minimization


problem
19 As we will see, the theory helps us construct duals that are useful in practice.
20 The name Minimax was invented by Tucker.
4.4. DUALITY THEORY 283

min F (x) (4.72)


x∈D

A locally optimum solution x̂ to this objective can be obtained by setting


∇F (x̂) = 0. This condition translates to Ax̂ = b. A sufficient condition for x̂
to be a point of local minimum is that ∇2 F (x̂) ≻ 0. This condition holds iff,
A ≻ 0, that is, A is a positive definite matrix. Given that A ≻ 0, A must be
invertible (c.f. Section 3.12.2) and the unique solution is x = A−1 b.
Now suppose we have a constrained minimization problem

1 T
min 2 y By
y∈ℜn (4.73)
subject to AT y = b
where y ∈ ℜn , A is an n × m matrix, B is an n × n matrix and b is a vector
of size m. To handle constrained minimization, let us consider minimization of
the modified objective function L(y, λ) = 21 yT By + λT (AT y − b).

1 T
min 2 y By + λT (AT y − b) (4.74)
y∈ℜn ,λ∈ℜm

The function L(y, λ) is called the lagrangian and involves the lagrange multi-
plier λ ∈ ℜm . A sufficient condition for optimality of L(y, λ) at a point L(y∗ , λ∗ )
is that ∇L(y∗ , λ∗ ) = 0 and ∇2 L(y∗ , λ∗ ) ≻ 0. For this particular problem:
" # " #
∗ ∗ By∗ + Aλ∗ 0
∇L(y , λ ) = =
AT y∗ − b 0
and
" #
2 ∗ ∗ B A
∇ L(y , λ ) = ≻0
AT 0
The point (y∗ , λ∗ ) must therefore satisfy, AT y∗ = b and Aλ∗ = −By∗ . If B
is taken to be the identity matrix, n = 2 and m = 1, the minimization problem
(4.73) amounts to finding a point y∗ on a line a11 y1 +a12 y2 = b that is closest to
the origin. From geometry, we know that the point on a line closest to the origin
is the point of intersection p∗ of a perpendicular from the origin to the line. On
the other hand, the solution for the minimum of (4.74), for these conditions
coincides with p∗ and is given by:

a11 b
y1 =
(a11 )2 + (a12 )2
a12 b
y2 =
(a11 ) + (a12 )2
2
284 CHAPTER 4. CONVEX OPTIMIZATION

That is, for n = 2 and m = 1, the solution to (4.74) is the same as the solu-
tion to (4.72) Can this construction be used to always find optimal solutions
to a minimization problem? We will answer this question by first motivating
the concept of lagrange multipliers and in Section 4.4.2, we will formalize the
lagrangian dual.

Lagrange Multipliers with Equality Constraints


The concept of lagrange multipliers can be attributed to the mathematician
Lagrange, who was born in the year 1736 in Turin. He largely worked on
mechanics, the calculus of variations probability, group theory, and number
theory. He was party to the choice of base 10 for the metric system (rather than
12). We will here give a brief introduction to lagrange multipliers; Section 4.4.2
will discuss the Karush-Kuhn-Tucker conditions, which are a generalization of
lagrange multipliers.
Consider the equality constrained minimization problem (with D ⊆ ℜn )

min f (x)
x∈D (4.75)
subject to gi (x) = 0 i = 1, 2, . . . , m

A direct approach to solving this problem is to find a parametrization of the


constraints (as in the example on page 230) such that f is expressed in terms
of the parameters, to give an unconstrained problem. For example if there is a
single constraint of the form xT Ax = k, and A ≻ 0, then the coordinate system
can be rotated and x can be rescaled so that we get the constraint y′ y = k.
Further, we can substitute with parametrization of the yi ’s as

y1 = k sin θ1 sin θ2 . . . sin θn−1


y2 = k sin θ1 sin θ2 . . . cos θn−1
.........

However, this is not possible for general constraints. The method of lagrange
multipliers presents an indirect approach to solving this problem.
Consider a schematic representation of the problem in (4.75) with a single
constraint, i.e., m = 1 in Figure 4.39. The figure shows some level curves of the
function f . The constraint function g1 is also plotted with dotted lines in the
same figure. The gradient of the constraint ∇g1 is not parallel to the gradient
∇f of the function21 at f = 10.4; it is therefore possible to move along the
constraint surface so as to further reduce f . However, as shown in Figure 4.39,
∇g1 and ∇f are parallel at f = 10.3, and any motion along g1 (x) = 0 will
21 Note that the (negative) gradient at a point is orthogonal to the contour line going through

that point. This was proved in Theorem 59.


4.4. DUALITY THEORY 285

Figure 4.39: At any non-optimal and non-saddle point of the equality con-
strained problem, the gradient of the constraint will not be parallel to that of
the function.

increase f , or leave it unchanged. Hence, at the solution x∗ , ∇f (x∗ ) must be


proportional to −∇g1 (x∗ ), yielding, ∇f (x∗ ) = −λ∇g1 (x∗ ), for some constant
λ ∈ ℜ; λ is called a Lagrange multiplier. In several problems, the value of λ itself
need never be computed and therefore λ is often qualified as the undetermined
lagrange multiplier.
The necessary condition for an optimum at x∗ for the optimization problem
in (4.75) with m = 1 can be stated as in (4.76), where the gradient is now n + 1
dimensional with its last component being a partial derivative with respect to
λ.

∇L(x∗ , λ∗ ) = ∇f (x∗ ) + λ∗ ∇g1 (x∗ ) = 0 (4.76)

The solutions to (4.76) are the stationary points of the lagrangian L; they are not
necessarily local extrema of L. L is unbounded: given a point x that doesn’t lie
on the constraint, letting λ → ±∞ makes L arbitrarily large or small. However,
under certain stronger assumptions, as we shall see in Section 4.4.2, if the strong
Lagrangian principle holds, the minima of f minimize the Lagrangian globally.
We will extend the necessary condition for optimality of a minimization
problem with single constraint to minimization problems with multiple equality
constraints (i.e., m > 1. in (4.75)). Let S be the subspace spanned by ∇gi (x)
at any point x and let S⊥ be its orthogonal complement. Let (∇f )⊥ be the
component of ∇f in the subspace S⊥ . At any solution x∗ , it must be true that
the gradient of f has (∇f )⊥ = 0 (i.e., no components that are perpendicular to
all of the ∇gi ), because otherwise you could move x∗ a little in that direction
(or in the opposite direction) to increase (decrease) f without changing any
of the gi , i.e. without violating any constraints. Hence for multiple equality
constraints, it must be true that at the solution x∗ , the space S contains the
286 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.40: At the equality constrained optimum, the gradient of the constraint
must be parallel to that of the function.

vector ∇f , i.e., there are some constants λi such that ∇f (x∗ ) = λi ∇gi (x∗ ).
We also need to impose that the solution is on the correct constraint surface
(i.e., gi = 0, ∀i). In the same manner as in the case of m = 1, this can
Xm
be encapsulated by introducing the Lagrangian L(x, λ) = f (x) − λi gi (x),
i=1
whose gradient with respect to both x, and λ vanishes at the solution.
This gives us the following necessary condition for optimality of (4.75):

m
!
X
∗ ∗
∇L(x , λ ) = ∇ f (x) − λi gi (x) =0 (4.77)
i=1

Lagrange Multipliers with Inequality Constraints


Instead of a single equality constraint g1 (x) = 0, we could have a single inequal-
ity constraint g1 (x) ≤ 0. The entire region labeled g1 (x) ≤ 0 in Figure 4.41
then becomes feasible. At the solution x∗ , if g1 (x∗ ) = 0, i.e., if the constraint
is active, we must have (as in the case of a single equality constraint) that ∇f
is parallel to ∇g1 , by the same argument as before. Additionally, it is neces-
sary that the two gradients must point in opposite directions; otherwise a move
away from the surface g1 = 0 and into the feasible region would further reduce
f . Since we are minimizing f , if the Lagrangian is written as L = f + λg1 ,
we must have λ ≥ 0. Therefore, with an inequality constraint, the sign of λ is
important, and λ ≥ 0 becomes a constraint.
However, if the constraint is not active at the solution ∇f (x∗ ) = 0, then
removing g1 makes no difference and we can drop it from L = f + λg1 , which
is equivalent to setting λ = 0. Thus, whether or not the constraints g1 = 0 are
active, we can find the solution by requiring that the gradients of the Lagrangian
vanish, and also requiring that λg1 (x∗ ) = 0. This latter condition is one of the
4.4. DUALITY THEORY 287

Figure 4.41: At the inequality constrained optimum, the gradient of the con-
straint must be parallel to that of the function.

important Karush-Kuhn-Tucker conditions of convex optimization theory that


can facilitate the search for the solution and will be more formally discussed in
Section 4.4.2.
Now consider the general inequality constrained minimization problem

min f (x)
x∈D (4.78)
subject to gi (x) ≤ 0 i = 1, 2, . . . , m

With multiple inequality constraints, for constraints that are active, as in the
case of multiple equality constraints, ∇f must lie in the space spanned by the
Xm
∇gi ’s, and if the Lagrangian is L = f + λi gi , then we must additionally
i=1
have λi ≥ 0, ∀i (since otherwise f could be reduced by moving into the feasible
region). As for an inactive constraint gj (gj < 0), removing gj from L makes
Xm
no difference and we can drop ∇gj from ∇f = − λi ∇gi or equivalently set
i=1
λj = 0. Thus, the above KKT condition generalizes to λi gi (x∗ ) = 0, ∀i. The
necessary condition for optimality of (4.78) is summarily given as

m
!
X
∗ ∗
∇L(x , λ ) = ∇ f (x) − λi gi (x) =0
i=1
∀i λi gi (x) = 0 (4.79)
A simple and often useful trick called the free constraint gambit is to solve
ignoring one or more of the constraints, and then check that the solution satisfies
those constraints, in which case you have solved the problem.
288 CHAPTER 4. CONVEX OPTIMIZATION

4.4.2 The Dual Theory for Constrained Optimization


Consider the general inequality constrained minimization problem in (4.78),
restated below.

min f (x)
x∈D (4.80)
subject to gi (x) ≤ 0, i = 1, 2, . . . , m

There are three simple and straightforward steps in forming a dual problem.
1. The first step involves forming the lagrange function by associating a price
λi , called a lagrange multiplier, with the constraint involving gi .
n
X
L(x, λ) = f (x) + λi gi (x) = f (x) + λT g(x)
i=1

2. The second step is the construction of the dual function L∗ (λ) which is
defined as:
L∗ (λ) = minL(x, λ) = minf (x) + λT g(x)
x∈D x∈D

What makes the theory of duality constructive is when we can solve for
L∗ efficiently - either in a closed form or some other ‘simple’ mechanism.
If L∗ is not easy to evaluate, the duality theory will be less useful.
3. We finally define the dual problem:

max L∗ (λ)
λ∈ℜm (4.81)
subject to λ≥0

It can be immediatly proved that the dual problem is a concave maximization


problem.
Theorem 80 The dual function L∗ (λ) is concave.
Proof: Consider two values of the dual variables, viz., λ1 ≥ 0 and λ2 ≥ 0. Let
λ = θλ1 + (1 − θ)λ2 for any θ ∈ [0, 1]. Then,

L∗ (λ) = min f (x) + λT g(x)


x∈D
   
= min θ f (x) + λT1 g(x) + (1 − θ) f (x) + λT2 g(x)
x∈D
   
≥ min θ f (x) + λT1 g(x) + min (1 − θ) f (x) + λT2 g(x)
x∈D x∈D
= θL∗ (λ1 ) + (1 − θ)L∗ (λ2 )
4.4. DUALITY THEORY 289

This proves that L∗ (λ) is a concave function. 2


The dual is concave (or the negative of the dual is convex) irrespective of the
primal. Solving the dual is therefore always a convex programming problem.
Thus, in some sense, the dual is better structured than the primal. However, the
dual cannot be drastically simpler than the primal. For example, if the primal
is not an LP, the dual cannot be an LP. Similarly, the dual can be quadratic
only if the primal is quadratic.
A tricky thing in duality theory is to decide what we call the domain or
ground set D and what we call the constraints gi ’s. Based on whether constraints
are explicitly stated or implicitly stated in the form of the ground set, the dual
problem could be very different. Thus, many duals are possible for the given
primal.
We will look at two examples to give a flavour of how the duality theory
works.

1. We will first look at linear programming.

min cT x
x∈ℜn
subject to −Ax + b ≤ 0

The lagrangian for this problem is:



L(x, λ) = cT x + λT b − λT Ax = bT λ + cT − AT λ x

The next step is to get L∗ , which we obtain using the first derivative test:

(
∗ T T
T bT λ if AT λ = c
L (λ) = minn b λ + c − Aλ x=
x∈ℜ −∞ if AT λ 6= c

The function L∗ can be thought of as the


 extended value extension of the
same function restricted to the domain λ|AT λ = c . Therefore, the dual
problem can be formulated as:

max bT λ
λ∈ℜm
subject to AT λ = c (4.82)
λ≥0

This is the dual of the standard LP. What if the original LP was the
following?
290 CHAPTER 4. CONVEX OPTIMIZATION

min cT x
x∈ℜn
subject to −Ax + b ≤ 0 x≥0

Now we have a variety of options based on what constraints are intro-


duced into the ground set (or domain) and what are explicitly treated as
constraints. Some working out will convince us that treating x ∈ ℜn as
the constraint and the explicit constraints as part of the ground set is a
very bad idea. One dual for this problem is the same as (4.82).
2. Let us look at a modified version of the problem in (4.83).

Pn
minn cT x − i=1 ln xi
x∈ℜ
subject to −Ax + b = 0
x>0

Typically, when we try to formulate a dual problem, we look for constraints


that get in the way of conveniently solving the problem. We first formulate
the lagrangian for this problem.
n
X n
 X
L(x, λ) = cT x − ln xi + λT b − λT Ax = bT λ + xT c − AT λ − ln xi
i=1 i=1

The domain (or ground set) for this problem is x > 0, which is open.
The expression for L∗ can be obtained using the first derivative test, while
keeping in mind that L can be made arbitrarily small (tending to −∞)
unless (c−AT λ) > 0. This is because, even if one component of c−AT λ is
less than or equal to zero, the value of L can be made arbitrarily
Psmall by
n
decreasing the value of the corresponding component of x in the i=1 ln xi
T Pn
part. Further, the sum bT λ + c − AT λ x − i=1 ln xi can be separated
out into the individual components of λi , and this can be exploited while
determining the critical point of L.

( Pn

bT λ + n + i=1 ln (c−A1T λ) if (c − AT λ) > 0
L (λ) = min L(x, λ) = i
x>0 −∞ otherwise

Finally, the dual will be

Pn
max bT λ + n + i=1 ln (c−A1T λ)
λ∈ℜm i

subject to −AT λ + c > 0


4.4. DUALITY THEORY 291

As noted earlier, the theory of duality remains a theory unless the dual lends
itself to some constructive evaluation; not always is the dual a useful form.
The following Weak duality theorem states an important relationship be-
tween solutions to the primal (4.80) and the dual (4.81) problems.
Theorem 81 If p∗ ∈ ℜ is the solution to the primal problem in (4.80) and
d∗ ∈ ℜ is the solution to the dual problem in (4.81), then

p∗ ≥ d ∗

In general, if x b is a
b is any feasible solution to the primal problem (4.80) and λ
feasible solution to the dual problem (4.81), then
b
x) ≥ L∗ (λ)
f (b

Proof: If x b is a feasible
b is a feasible solution to the primal problem (4.80) and λ
solution to the dual problem, then

x) ≥ f (b
f (b bT g(λ)
x) + λ b ≥ min f (x + λ
bT g(λ) = L∗ (λ)
b
x∈D

This proves the second part of the theorem. A direct consequence of this is that

p∗ = min f (x) ≥ min L∗ (λ) = d∗


x∈D λ≥0

2
The weak duality theorem has some important implications. If the primal
problem is unbounded below, that is, p∗ = −∞, we must have d∗ = −∞, which
means that the Lagrange dual problem is infeasible. Conversely, if the dual
problem is unbounded above, that is, d∗ = ∞, we must have p∗ = ∞, which
is equivalent to saying that the primal problem is infeasible. The difference,
p∗ − d∗ is called the duality gap.
In many hard combinatorial optimization problems with duality gaps, we
get good dual solutions, which tell us that we are guaranteed of being some k %
within the optimal solution to the primal, for some satisfactorily low values of
k. This is one of the powerful uses of duality theory; constructing bounds for
optimization problems.
Under what conditions can one assert that d∗ = p∗ ? The condition d∗ = p∗ is
called strong duality and it does not hold in general. It usually holds for convex
problems but there are exceptions to that - one of the most typical being that
of the semi-definite optimization problem. The semi-definite program (SDP) is
defined, with the linear matrix inequality constraint (c.f. page 262) as follows:

min cT x
x∈ℜn
subject to x1 A1 + . . . + xn An + G  0 (4.83)
Ax = b
292 CHAPTER 4. CONVEX OPTIMIZATION

Sufficient conditions for strong duality in convex problems are called constraint
qualifications. One of the most useful sufficient conditions for strong duality is
called the Slaters constraint qualification.

Definition 42 [Slaters constraint qualification]: For a convex problem

min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . , m
(4.84)
Ax = b
variable x = (x1 , . . . , xn )

strong duality holds (that is d∗ = p∗ ) if it is strictly feasible. That is,

∃x ∈ int(D) : gi (x) < 0 i = 1, 2, . . . , m Ax = b

However, if any of the gi ’s are linear, they do not need to hold with strict
inequalities.

Table 4.4 summarizes some optimization problems, their duals and con-
ditions for strong duality. Strong duality also holds for nonconvex problems

Problem type Objective Function Constraints L∗ (λ) Dual constraints Strong duality
Linear Program cT x Ax ≤ b −bT λ AT λ + c = 0 Feasible primal
λ≥0 and dual
1 T
T 
Quadratic Program 2 x Qx + c x
T n
for Q ∈ S++ Ax ≤ b − 12 c − AT λ Q−1 c − AT λ + bT λ λ≥0 Always
Pn Pn −aT λ
Entropy maximization xi i=1 ln xi Ax ≤ b −bT λ − µ − e−µ−1 i=1 e
i λ≥0 Primal constraints
xT 1 = 1 ai is the ith column of A are satisfied.

Table 4.4: Examples of functions and their duals.

in extremely rare cases. One example of this is minimization of a nonconvex


quadratic function over the unit ball.

4.4.3 Geometry of the Dual


We will study the geometry of the dual in the column space ℜm+1 . The column
geometry of the dual will require definition of the following set:

I = {(s, z) | s ∈ ℜm , z ∈ ℜ, ∃x ∈ D with gi (x) ≤ si ∀1 ≤ i ≤ m, f (x) ≤ z }

The set I is a subset of ℜm+1 , where m is the number of constraints. Consider


a plot in two dimensions, for n = 1, with s1 along the x−axis and z along the
y−axis. For every point, x ∈ D, we can identify all points (s1 , z) for s1 ≥ g1 (x)
4.4. DUALITY THEORY 293

Figure 4.42: Example of the set I for a single constraint (i.e., for n = 1).

and z ≥ f (x) and these are points that lie to the right and above the point
(g1 (x), f (x)). An example set I is shown in Figure 4.42. It turns out that all
the intuitions we need are in two dimensions, which makes it fairly convenient to
understand the idea. It is straightforward to prove that if the objective function
f (x) is convex and each of the constraints gi (x), 1 ≤ i ≤ n is a convex function,
then I must be a convex set. Since the feasible region for the primal problem
(4.78) is the region in I with s ≤ 0, and since all points above and to the right
of a point in I also belong to I, the solution to the primal problem corresponds
to the point in I with s = 0 and least possible value of z. For example, in
Figure 4.42, the solution to the primal corresponds to (0, δ1 ).
Let us define a hyerplane Hλ,α , parametrized by λ ∈ ℜm and α ∈ ℜ as

Hλ,α = (s, z) λT .s + z = α

Consider all hyperplanes that lie below I. For example, in the Figure 4.42,
both hyperplanes Hλ1 ,α1 and Hλ2 ,α2 lie below the set I. Of all hyperplanes
that lie below I, consider the hyperplane whose intersection with the line s =
0, corresponds to as high a value of z as possible. This hyperplane must be
supporting hyperplane. Incidentally, Hλ1 ,α1 happens to be such a supporting
hyperplane. Its point of intersection (0, α1 ) precisely corresponds to the solution
to the dual problem. Let us derive this statement formally after setting up some
more notation.
We will define two half-spaces corresponding to Hλ,α

+
Hλ,α = (s, z) λT .s + z ≥ α


Hλ,α = (s, z) λT .s + z ≤ α
Let us define another set L as

L = {(s, z) |s = 0 }
294 CHAPTER 4. CONVEX OPTIMIZATION

Note that L is essentially the z or function axis. The intersection of Hλ,α with
L is the point (0, α). That is
\
(0, α) = L Hλ,α

We would like to manipulate λ and α so that the set I lies in the half-space
+
Hλ,α as tightly as possible. Mathematically, we are interested in the problem
of maximizing the height of the point of intersection of L with Hλ,α above the
+
s = 0 plane, while ensuring that I remains a subset of Hλ,α .

max α
+
subject to Hλ,α ⊇I
+
By definitions of I, Hλ,α and the subset relation, this problem is equivalent to

max α
subject to λT .s + z ≥ α ∀(s, z) ∈ I

Now notice that if (s, z) ∈ I, then (s′ , z) ∈ I for all s′ ≥ s. This was also
illustrated in Figure 4.42. Thus, we cannot afford to have any component of λ
negative; if any of the λi ’s were negative, we could cranck up si arbitrarily to
violate the inequality λT .s + z ≥ α. Thus, we can add the constraint λ ≥ 0 to
the above problem without changing the solution.

max α
subject to λT .s + z ≥ α ∀(s, z) ∈ I
λ≥0

Any equality constraint h(x) = 0 can be expressed using two inequality con-
straints, viz., h(x) ≤ 0 and −h(x) ≤ 0. This problem can again be proved to be
equivalent to the following problem, using the definition of I or equivalently, the
fact that every point on ∂I must be of the form (g1 (x), g2 (x), . . . , gm (x), f (x))
for some x ∈ D.

max α
subject to λT .g(x) + f (x) ≥ α ∀x ∈ D
λ≥0

We will remind the reader at this point that L(x, λ) = λT .g(x) + f (x). The
above problem is therefore the same as
4.4. DUALITY THEORY 295

max α
subject to L(x, λ) ≥ α ∀x ∈ D
λ≥0

Since, L∗ (λ) = minL(x, λ), we can deal with the equivalent problem
x∈D

max α
subject to L∗ (λ) ≥ α
λ≥0

This problem can be restated as

max L∗ (λ)
subject to λ≥0

This is precisely the dual problem. We thus get a geometric interpretation of


the dual.
Again referring to Figure 4.42, we note that if the set I is not convex, there
could be a gap between the z−intercept (0, α1 ) of the best supporting hyperplane
Hλ1 ,α1 and the closest point (0, δ1 ) of I on the z−axis, which corresponds to
the solution to the primal. In fact, when the set I is not convex, we can
never prove that there will be no duality gap. And even when the set I is
convex, bizzaire things can happen; for example, in the case of semi-definite
programming, the set I, though convex, is not at all well-behaved and this
yields a large duality gap, as shown in Figure 4.43. In fact, the set I is open
from below (the dotted boundary) for a semi-definite program. We could create
very simple problems with convex I, for which there are duality gaps. For well-
behaved convex functions (as in the case of linear programming), there are no
duality gaps. Figure 4.44 illustrates the case of a well-behaved convex program.

4.4.4 Complementary slackness and KKT Conditions


We now state the conditions between the primal and dual optimal points for an
arbitrary function. These conditions, called the Karush-Kuhn-Tucker conditions
(abbreviated as KKT conditions) state a necessary condition for a solution to
be optimal with zero duality gap. Consider the following general optimization
problem.
296 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.43: Example of the convex set I for a single constrained semi-definite
program.

Figure 4.44: Example of the convex set I for a single constrained well-behaved
convex program.
4.4. DUALITY THEORY 297

min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . , m
(4.85)
hj (x) = 0, j = 1, . . . , p
variable x = (x1 , . . . , xn )

Suppose that the primal and dual optimal values for the above problem are
attained and equal, that is, strong duality holds. Let x
b be a primal optimal and
b µ
(λ, b) be a dual optimal point (λb ∈ ℜm , µ
b ∈ ℜp ). Thus,

f (b
x) b µ
= L∗ (λ, b)
bT g(x) + µ
= minf (x) + λ bT h(x)
x∈D
≤ f (b bT g(b
x) + λ bT h(b
x) + µ x)
≤ f (b
x)

b ≥ 0, g(b
The last inequality follows from the fact that λ x) ≤ 0, and h(b
x) = 0.
We can therefore conclude that the two inequalities in this chain must hold with
equality. Some of the conclusions that we can draw from this chain of equalities
are
1. That xb is a minimizer for L(x, λ, b µb) over x ∈ D. In particular, if the func-
tions f , g1 , g2 , . . . , gm and h1 , h2 , . . . , hp are differentiable (and therefore
have open domains), the gradient of L(x, λ, b µ
b) must vanish at xb, since any
point of global optimum must be a point of local optimum. That is,

m
X p
X
∇f (b
x) + bi ∇gi (b
λ x) + bj ∇hj (b
µ x) = 0 (4.86)
i=1 j=1

2. That
n
X
bT g(b
λ x) = bi gi (b
λ x) = 0
i=1

Since each term in this sum is nonpositive, we conclude that

bi gi (b
λ x) = 0 f or i = 1, 2, . . . , m (4.87)

This condition is called complementary slackness and is a necessary con-


dition for strong duality. Complementary slackness implies that the ith
298 CHAPTER 4. CONVEX OPTIMIZATION

optimal lagrange multiplier is 0 unless the ith inequality constraint is ac-


tive at the optimum. That is,
bi > 0
λ ⇒ gi (b
x) = 0
gi (b
x) < 0 ⇒ b
λi = 0

Let us further assume that the functions f , g1 , g2 , . . . , gm and h1 , h2 , . . . , hp


are differentiable on open domains. As above, let x b be a primal optimal and
b µ
(λ, b) be a dual optimal point with zero duality gap. Putting together the
conditions in (4.86), (4.87) along with the feasibility conditions for any pri-
mal solution and dual solution, we can state the following Karush-Kuhn-Tucker
(KKT) necessary conditions for zero duality gap.

Pm b Pp
(1) ∇f (b
x) + i=1 λi ∇gi (b bj ∇hj (b
x) + j=1 µ x) = 0
(2) gi (b
x) ≤ 0 i = 1, 2, . . . , m
(3) bi
λ ≥ 0 i = 1, 2, . . . , m (4.88)
(4) bi gi (b
λ x) = 0 i = 1, 2, . . . , m
(5) hj (bx) = 0 j = 1, 2, . . . , p

When the primal problem is convex, the KKT conditions are also sufficient
for the points to be primal and dual optimal with zero duality gap. If f is convex,
gi are convex and hj are affine, the primal problem is convex and consequently,
the KKT conditions are sufficient conditions for zero duality gap.
Theorem 82 If the function f is convex, gi are convex and hj are affine, then
KKT conditions in 4.88 are necessary and sufficient conditions for zero duality
gap.
Proof: The necessity part has already been proved; here we only prove the
sufficiency part. The conditions (2) and (5) in (4.88) ensure that x b is primal
b µ
feasible. Since λ ≥ 0, L(x, λ, b) is convex in x. Based on condition (1) in (4.88)
and theorem 77, we can infer that x b µ
b minimizes L(x, λ, b). We can thus conclude
that

b µ
L∗ (λ, b) = f (b bT g(b
x) + λ bT h(b
x) + µ x)
= f (b
x)

In the equality above, we use hj (b bi gi (b


x) = 0 and λ x) = 0. Further,
b µ
d∗ ≥ L∗ (λ, x) ≥ p∗
b) == f (b
The duality theorem (theorem 81) however states that p∗ ≥ d∗ . This implies
that
b µ
d∗ = L∗ (λ, x) = p∗
b) == f (b
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 299

This shows that x b and (λ,b µ


b) correspond to the primal and dual optimals re-
spectively and the problem therefore has zero duality gap. 2
In summary, for any convex optimization problem with differentiable objec-
tive and constraint functions, any points that satisfy the KKT conditions are
primal and dual optimal, and have zero duality gap.
The KKT conditions play a very important role in optimization. In some rare
cases, it is possible to solve the optimization problems by finding a solution to
the KKT conditions analytically. Many algorithms for convex optimization are
conceived as, or can be interpreted as, methods for solving the KKT conditions.

4.5 Algorithms for Unconstrained Minimization

We will now study some algorithms for solving convex problems. These tech-
niques are relevant for most convex optimization problems that do not yield
themselves to closed form solutions. We will start with unconstrained mini-
mization.
Recall that the goal in unconstrained minimization is to solve the convex
problem

min f (x)
x∈D

We are not interested in f , whose solution can be obtained in closed form.


For example, minimizing a quadratic is very simple and can be solved by linear
equations, an example of which was discussed in Section 3.9.2. Let us denote the
optimal solution of the minimization problem by p∗ . We will assume that f is
convex and twice continuously differentiable and that it attains a finite optimal
value p∗ . Most unconstrained minimization techniques produce a sequence  of
points x(k) ∈ D, k = 0, 1, . . . such that f x(k) → p∗ as k → ∞ or, ∇f x(k) →
0 as k → ∞. Iterative techniques for optimization, further require a starting
point x(0) ∈ D and sometimes that epi(f ) is closed. The epi(f ) can be inferred
to be closed either if D = ℜn or f (x) → ∞ as x → ∂D. The function f (x) = x1
for x > 0 is an example of a function whose epi(f ) is closed.
While there exist convergence proofs (including guarantees on number of
optimization iterations) for many convex optimization algorithms, the proofs
assume many conditions,many of which are either not verifiable or involve un-
known constants (such as the Lipshitz constant). Thus, most convergence proofs
for convex optimization problems are useless in practice, though it is good to
know that there are conditions under which the algorithm converges. Since con-
vergence proofs are only of theoretical importance, we will make the strongest
possible assumption under which convergence can be proved easily, which is that
the function f is strongly convex (c.f. Section 4.2.7 for definition of strong con-
vexity) with the strong convexity constant c > 0 for which ∇2 f (x)  cI ∀x ∈ D.
300 CHAPTER 4. CONVEX OPTIMIZATION

Further, it can be proved that for a strongly convex function f , ∇2 f (x)  DI


for some constant D ∈ ℜ. The ratio Dc is an upper bound on the condition
number of the matrix ∇2 f (x).

4.5.1 Descent Methods


Descent methods for unconstrained optimization have been in use since the last
70 years or more. The general idea in descent methods is that the next iterate
x(k+1) is the current iterate x(k) added with a descent or search direction ∆x(k)
(a unit vector), which is multiplied by a scale factor t(k) , called the step length.

x(k+1) = x(k) + t(k) ∆x(k)

The incremental step is determined while ensuring that f (x(k+1) ) < f (x(k) ).
We assume that we are dealing with the extended value extension of the convex
function f (c.f. definition 36), which returns ∞ for any point outside its domain.
However, if we do so, we need to make sure that the initial point indeed lies in
the domain D.
A single iteration of the general descent algorithm (shown in Figure 4.45)
consists of two main steps, viz., determining a good descent direction ∆x(k) ,
which is typically forced to have unit norm and determining the step size using
some line search technique. If the function f is convex, and we require that
f (x(k+1) ) < f (x(k) ) then, we must have ∇T f (x(k+1) )(x(k+1) − x(k) ) < 0. This
can be seen from the necessary and sufficient condition for convexity stated in
equation (4.44) within Section 4.2.9 and restated here for reference.

f (x(k+1) ) ≥ f (x(k) ) + ∇T f (x(k) )(x(k+1) − x(k) )

Since t(k) > 0, we must have

∇T f (x(k) )∆x(k) < 0


π 3π

That is, the descent direction ∆x(k) must make an obtuse angle (θ ∈ 2, 2 )
with the gradient vector.

Find a starting point x(0) ∈ D


repeat
1. Determine ∆x(k) .
2. Choose a step size t(k) > 0 using raya search.
3. Obtain x(k+1) = x(k) + t(k) ∆x(k) .
4. Set k = k + 1.
until stopping criterion (such as ||∇f (x(k+1) )|| < ǫ) is satisfied
a Many textbooks refer to this as line search, but we prefer to call it ray search, since the

step must be positive.

Figure 4.45: The general descent algorithm.


4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 301

There are many different empirical techniques for ray search, though it mat-
ters much less than the search for the descent direction. These techniques reduce
the n−dimensional problem to a 1−dimensional problem, which can be easy to
solve by use of plotting and eyeballing or even exact search.
1. Exact ray search: The exact ray search seeks a scaling factor t that
satisfies

t = argminf (x + t∆x) (4.89)


t>0

2. Backtracking ray search: The exact line search may not be feasible
or could be expensive to compute for complex non-linear functions. A
relatively simpler ray search iterates over values of step size starting from
1 and scaling it down by a factor of β ∈ (0, 12 ) after every iteration till
the following condition, called the Armijo condition is satisfied for some
0 < c1 < 1.

f (x + t∆x) < f (x) + c1 t∇T f (x)∆x (4.90)

Based on equation (4.44), it can be inferred that the Armijo inequality


can never hold for c1 = 1; for c1 = 1, the right hand side of the Armijo
condition gives a lower bound on the value of f (x + t∆x). The Armijo
condition simply ensures that t decreases f sufficiently. Often, another
condition is used for inexact line search in conjunction with the Armijo
condition.

T
∆x ∇f (x + t∆x) ≤ c2 ∆xT ∇f (x) (4.91)

where 1 > c1 > c2 > 0. This condition ensures that the slope of the
function f (x+t∆x) at t is less than c2 times that at t = 0. The conditions
in (4.90) and (4.91) are together called the strong Wolfe conditions. These
conditions are particularly very important for non-convex problems.
A finding that is borne out of plenty of empirical evidence is that exact ray
search does better than empirical ray search in a few cases only. Further, the
exact choice of the value of β and α seems to have little effect on the convergence
of the overall descent method.
The trend of specific descent methods has been like a parabola - starting
with simple steepest descent techniques, then accomodating the curvature hes-
sian matrix through a more sophisticated Newton’s method and finally, trying
to simplify the Newton’s method through approximations to the hessian inverse,
302 CHAPTER 4. CONVEX OPTIMIZATION

culminating in conjugate gradient techniques, that do away with any curvature


matrix whatsoever, and form the internal combustion engine of many sophis-
ticated optimization techniques today. We start the thread by describing the
steepest descent methods.

Steepest Descent
Let v ∈ ℜn be a unit vector under some norm. By theorem 75, for convex f ,
f (x(k) ) − f (x(k) + v) ≤ −∇T f (x(k) )v
For small v, the inequality turns into approximate equality. The term −∇T f (x(k) )v
can be thought of as (an upper-bound on) the first order prediction of decrease.
The idea in the steepest descent method [?] is to choose a norm and then deter-
mine a descent direction such that for a unit step in that norm, the first order
prediction of decrease is maximized. This choice of the descent direction can be
stated as 
∆x = argmin ∇T f (x)v | ||v|| = 1
The algorithm is outlined in Figure 4.46.

Find a starting point x(0) ∈ D.


repeat 
1. Set ∆x(k) = argmin ∇T f (x(k) )v | ||v|| = 1 .
2. Choose a step size t(k) > 0 using exact or backtracking ray search.
3. Obtain x(k+1) = x(k) + t(k) ∆x(k) .
4. Set k = k + 1.
until stopping criterion (such as ||∇f (x(k+1) )|| ≤ ǫ) is satisfied

Figure 4.46: The steepest descent algorithm.

The key to understanding the steepest descent method (and in fact many
other iterative methods) is that it heavily depends on the choice of the norm. It
has been empirically observed that if the norm chosen is aligned with the gross
geometry of the sub-level sets22 , the steepest descent method converges faster
to the optimal solution. If the norm chosen is not aligned, it often amplifies
the effect of oscillations. Two examples of the steepest descent method are the
gradient descent method (for the eucledian or L2 norm) and the coordinate-
descent method (for the L1 norm). One fact however is that no two norms
should give exactly opposite steepest descent directions, though they may point
in different directions.

Gradient Descent
A classic greedy algorithm for minimization is the gradient descent algorithm.
This algorithm uses the negative of the gradient of the function at the current
22 The alignment can be determined by fitting, for instance, a quadratic to a sample of the

points.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 303

point x∗ as the descent direction ∆x∗ . It turns out that this choice of ∆x∗
corresponds to the direction of steepest descent under the L2 (eucledian) norm.
This can be proved in a straightforward manner using theorem 58. The algo-
rithm is outlined in Figure 4.47. The steepest descent method can be thought

Find a starting point x(0) ∈ D


repeat
1. Set ∆x(k) = −∇f (x(k) ).
2. Choose a step size t(k) > 0 using exact or backtracking ray search.
3. Obtain x(k+1) = x(k) + t(k) ∆x(k) .
4. Set k = k + 1.
until stopping criterion (such as ||∇f (x(k+1) )||2 ≤ ǫ) is satisfied

Figure 4.47: The gradient descent algorithm.

of as changing the coordinate system in a particular way and then applying the
gradient descent method in the changed coordinate system.

Coordinate-Descent Method
The co-ordinate descent method corresponds exactly to the choice of L1 norm
for the steepest descent method. The steepest descent direction using the L1
norm is given by
∂f (x) i
∆x = − u
∂xi
where,
∂f (x)
= ||∇f (x)||∞
∂xi
and ui was defined on page 231 as the unit vector pointing along the ith co-
ordinate axis. Thus each iteration of the coordinate descent method involves
optimizing over one component of the vector x(k) and then updating the vec-
tor. The component chosen is the one having the largest absolute value in the
gradient vector. The algorithm is outlined in Figure 4.48.

Convergence of Steepest Descent Method


For the gradient method, it can be proved that if f is strongly convex,

 
f (x(k ) − p∗ ≤ ρk f (x(0) − p∗ (4.92)

The value of ρ ∈ (0, 1) depends on the strong convexity constant c (c.f. equation
(4.64) on page 277), the value of x(0) and type of ray search employed. The
suboptimality f (x(k) ) − p∗ goes down by a factor ρ < 1 at every step and
this is referred to as linear convergence23 . However, this is only of theoretical
23 A series s1 , s2 , . . . is said to have
304 CHAPTER 4. CONVEX OPTIMIZATION

Find a starting point x(0) ∈ D.


Select an appropriate norm ||.||.
repeat
(k)
1. Let ∂f (x(k) ) = ||∇f (x||∞ .
∂xi
(k)
2. Set ∆x(k) = − ∂f (x(k) ) ui .
∂xi
3. Choose a step size t(k) > 0 using exact or backtracking ray search.
4. Obtain x(k+1) = x(k) + t(k) ∆x(k) .
5. Set k = k + 1.
until stopping criterion (such as ||∇f (x(k+1) )||∞ ≤ ǫ) is satisfied

Figure 4.48: The coordinate descent algorithm.

importance, since this method is often very slow, indicated by values of ρ, very
close to 1. Use of exact line search in conjunction with gradient descent also has
the tendency to overshoot the next best iterate. It is therefore rarely used in
practice. The convergence rate depends greatly on the condition number of the
Hessian (which is upperbounded by Dc ). It can be proved that the number of
iterations required for the convergence of the gradient descent method is lower-
bounded by the condition number of the hessian; large eigenvalues correspond
to high curvature directions and small eigenvalues correspond to low curvature
directions. Many methods (such as conjugate gradient) try to improve upon
the gradient method by making the hessian better conditioned. Convergence
can be very slow even for moderately well-conditioned problems, with condition
number in the 100s, even though computation of the gradient at each step is
only an O(n) operation. The gradient descent method however works very well
if the function is isotropic, that is if the level-curves are spherical or nearly
spherical.
The convergence of the steepest descent method can be stated in the same
form as in 4.92, using the fact that any norm can be bounded in terms of the
Euclidean norm, i.e., there exists a constant η ∈ (0, 1] such that

||x|| ≥ η||x||2

|si+1 −s|
1. linear convergence to s if lim = δ ∈ (0, 1). For example, si = (γ)i has linear
i→∞ |si −s|
convergence to s = 0 for any γ < 1. The rate of decrease is also sometimes called
exponential or geometric. This is considered quite slow.

|si+1 −s| 1
2. superlinear convergence to s if lim = 0. For example, si = has superlinear
i→∞ |si −s| i!
convergence. This is the most common.

|si+1 −s| i
3. quadratic convergence to s if lim 2 = δ ∈ (0, ∞). For example, si = (γ)2 has
i→∞ |si −s|
quadratic convergence to s = 0 for any γ < 1. This is considered very fast in practice.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 305

4.5.2 Newton’s Method


Newton’s method [?] is based on approximating a function around the current
iterate x(k) using a second degree Taylor expansion.
1
f (x) ≈ fe(x) = f (x(k) ) + ∇T f (x(k) )(x − x(k) ) + (x − x(k) )T ∇2 f (x(k) )(x − x(k) )
2
If the function f is convex, the quadratic approximation is also convex. Newton’s
method is based on solving it exactly by finding its critical point x(k+1) as a
function of x(k) . Setting the gradient of this quadratic approximation (with
respect to x) to 0 gives

∇T f (x(k) ) + ∇2 f (x(k) )(x(k+1) − x(k) ) = 0

solving which yields the next iterate as

 −1
x(k+1) = x(k) − ∇2 f (x(k) ) ∇f (x(k) ) (4.93)

assuming that the Hessian matrix is invertible. The term x(k+1) − x(k) can
be thought of as an update step. This leads to a simple descent algorithm,
outlined in Figure 4.49 and is called the Newton’s method. It relies on the
invertibility of the hessian, which holds if the hessian is positive definite as in
the case of a strictly convex function. In case the hessian is invertible,cholesky
factorization (page 207) of the hessian can be used to solve the linear system
(4.93). However, the Newton method may not even be properly defined if the
hessian is not positive definite. In this case, the hessian could be changed to
a nearby positive definite matrix whenever it is not. Or a line search could be
added to seek a new point having a positive definite hessian.
This method uses a step size of 1. If instead, the stepsize is chosen using
exact or backtracking ray search, the method is called the damped Newton’s
method. Each Newton’s step takes O(n3 ) time (without using any fast matrix
multiplication methods).
The Newton step can also be looked upon as another incarnation of the
steepest descent rule, but with the quadratic norm defined by the (local) Hessian
∇2 f (x(k) ) evaluated at the current iterate x(k) , i.e.,
  12
||u||∇2 f (x(k) ) = u∇2 f (x(k) )u

The norm of the Newton step, in the quadratic norm defined by the Hessian at
a point x is called the Newton decrement at the point x and is denoted by λ(x).
Thus,
−1
λ(x) = ||∆x||∇2 f (x) = ∇T f (x) ∇2 f (x) ∇f (x)
The Newton decrement gives an ‘estimate’ of the proximity of the current iterate
x to the optimal point x∗ obtained by measuring the proximity of x to the
306 CHAPTER 4. CONVEX OPTIMIZATION

Find a starting point x(0) ∈ D.


Select an appropriate tolerance ǫ > 0.
repeat
−1
1. Set ∆x(k) = − ∇2 f (x(k) ) ∇f (x).
2 T (k) 2

(k) −1
2. Let λ = ∇ f (x ) ∇ f (x ) ∇f (x(k) ).
2
3. If λ2 ≤ ǫ, quit.
4. Set step size t(k) = 1.
5. Obtain x(k+1) = x(k) + t(k) ∆x(k) .
6. Set k = k + 1.
until

Figure 4.49: The Newton’s method.

minimum point of the quadratic approximation fe(x). The estimate is 1


2 λ(x)
2

and is given as
1
λ(x)2 = f (x) − min fe(x)
2
Additionally, λ(x)2 is also the directional derivative in the Newton direction.

λ(x)2 = ∇T f (x)∆x

The estimate 12 λ(x)2 is used to test the convergence of the Newton algorithm
in Figure 4.49.
Next, we state an important property of the Newton’s update rule.
−1
Theorem 83 If ∆x(k) = − ∇2 f (x(k) ) ∇f (x(k) ), ∇2 f (x(k) ) is symmetric
and positive definite and ∆x 6= 0, then ∆x(k) is a descent direction at x(k) ,
(k)

that is, ∇T f (x(k) )∆x(k) < 0.


Proof: First of all, if ∇2 f (x−1 ) is symmetric and positive definite, then it is
invertible and its inverse is also symmetric and positive definite. Next, we see
that  −1
∇T f (x(k) )∆x(k) = −∇T f (x) ∇2 f (x(k) ) ∇f (x(k) ) < 0
−1
because ∇2 f (x) is symmetric and positive definite. 2
The Newton method is independent of affine changes of coordinates. That
is, if optimizating a function f x) using the Newton’s method with an initial
estimate x(0) involves the series of iterates x(1) , x(2) , . . . , x(k) , . . ., then optimiz-
ing the same problem using the Newton’s method with a change of coordinates
given by x = Ay and the intial estimate y(0) such that x(0) = Ay(0) yields the
series of iterates y(1) , y(2) , . . . , y(k) , . . ., such that x(k) = Ay(k) . This is a great
advantage over the gradient method, whose convergence can be very sensitive
to affine transformation.
Another well known feature of the Newton’s method is that it converges very
fast, if at all. The convergence is extremely fast in the vicinity of the point of
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 307

optimum. This can be loosely understood as follows. If x∗ is the critical point


of a differentiable convex function f , defined on an open domain, the function is
approximately equal to its second order taylor approximation in the vicinity of
x∗ . Further, ∇f (x∗ ) = 0. This gives the following approximation at any point
x in the vicinity of x∗ .

1
f (x) ≈ f (x∗ ) + ∇T f (x∗ )(x − x∗ ) + (x − x∗ )T ∇2 f (x∗ )(x − x∗ )
2
1

= f (x ) + (x − x∗ )T ∇2 f (x∗ )(x − x∗ )
2
Thus, the level curves of a convex function are approximately ellipsoids near the
point of minimum x∗ . Given this geometry near the minimum, it then makes
sense to do steepest descent in the norm induced by the hessian, near the point
of minimum (which is equivalent to doing a steepest descent after a rotation of
the coordinate system using the hessian). This is exactly the Newton’s step.
Thus, the Newton’s method24 converges very fast in the vicinity of the solution.
This convergence analysis is formally stated in the following theorem and is
due to Leonid Kantorovich.
Theorem 84 Suppose f (x) : D → ℜ is twice continuously differentiable on D
and x∗ is the point corresponding to the optimal value p∗ (so that ∇f (x∗ ) = 0).
Let f be strongly convex on D with constant c > 0. Also, suppose ∇2 f (x∗ ) is
Lipschitz continuous on D with a constant L > 0 (which measures how well f
can be approximated by a quadratic function or how fast the second derivative
of f changes), that is
||∇2 f (x) − ∇2 f (y)||2 ≤ L||x − y||2
2
Then, there exist constants α ∈ (0, cL ) and β > 0 such that
1. Damped Newton Phase: If ||∇2 f (x)||2 ≥ α, then f (x(k+1) )−f (x(k) ) ≤
−β. That is, at every step of the iteration in the damped Newton phase,
the function value decreases by atleast β and the phase ends after at most
f (x(0) )−p∗
β iterations, which is a finite number.

2. Quadratically Convergent Phase: If ||∇2 f (x)||2 < α, then 2cL2 ||∇f (x(k+1) )||2 ≤
L (k)
2
2c2 ||∇f (x )||2 . When applied recursively this inequality yields
 2k−q
L 1
||∇f (x(k) )||2 ≤
2c2 2

where q is iteration number, starting at which ||∇2 f (x(q) )||2 < α. Using
the result for strong convexity in equation (4.50) on page 273, we can
derive
24 Newton originally presented his method for one-dimensional problems. Later on Raphson

extended the method to multi-dimensional problems.


308 CHAPTER 4. CONVEX OPTIMIZATION

 2k−q+1
(k) 1 2c3 1
f (x ∗
) − p ≤ ||∇f (x(k) )||22 ≤ 2 (4.94)
2c L 2

Also, using the result in equation (4.52) on page 273, we get a bound on
the distance between the current iterate and the point x∗ corresponding to
the optimum.

 2k−q
(k) 2 c 1
||x b ||2 ≤ ||∇f (x(k) )||2 ≤
−x ∗
(4.95)
c L 2

Inequality (4.94) shows that convergence is quadratic once the second condition
is satisfied after a finite number of iterations. Roughly speaking, this means
that, after a sufficiently large number of iterations, the number of correct digits
doubles at each iteration25 . In practice, once in the quadratic phase, you do not
even need to bother about any convergence criterion; it suffices to apply a fixed
few number of Newton iterations to get a very accurate solution. Inequality
(4.95) states that the sequence of iterates converges quadratically. The Lips-
chitz continuity condition states that if the second derviative of the function
changes relatively slowly, applying Newton’s method can be useful. Again, the
inequalities are technical junk as far as practical application of Newton’s method
is concerned, since L, c and α are generally unknown, but it helps to understand
the properties of the Newton’s method, such as its two phases and identify them
in problems. In practice, Newton’s method converges very rapidly, if at all.
As an example, consider a one dimensional function f (x) = 7x − ln x. Then
f ′ (x) = 7 − x1 and f ′′ (x) = x12 . The Newton update rule at a point x is

xnew = x − x2 7 − x1 . Starting with x(0) = 0 is really infeasible and useless,
since the updates will always be 0. The unique global minimizer of this function
is x∗ = 71 . The range of quadratic convergence for Newton’s method on this
function is x ∈ 0, 72 . However, if you start with an initial infeasible point
x(0) = 0, the function will quadratically tend to −∞!
There are some classes of functions for which theorem 84 can be applied very
constructively. They are
Pm
• − i=1 ln xi
• − ln t2 − xT x for t > 0
• − ln det(X)
Further, theorem 84 also comes handy for linear combinations of these functions.
These three functions are also at the heart of modern interior points method
theory.
25 Linear convergence adds a constant number of digits of accuracy at each iteration.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 309

4.5.3 Variants of Newton’s Method


One important aspect of the algorithm in Figure 4.49 is the step (1), which
involves solving a linear system ∇2 f (x(k) )∆x(k) = ∇f (x(k) ). The system can
be easy to solve if the Hessian is a 100×100 sparse matrix, but it can get hairy if
it is a larger and denser matrix. Thus it can be unfair to claim that the Newton’s
method is faster than the gradient descent method on the grounds that it takes
a fewer number of iterations to converge as compared to the gradient descent,
since each iteration of the Newton’s method involves inverting the hessian to
solve a linear system, which can take time26 O(n3 ) for dense systems. Further,
the method assumes that the hessian is positive definite and therefore invertible,
which might not always be so. Finally, the Newton’s method might make huge-
uncontrolled steps, especially when the hessian is positive semi-definite (for
example, if the function is flat along the direction corresponding to a 0 or nearly
0 eigenvalue). Due to these disadvantages, most optimization packages do not
use Newton’s method.
There is a whole suite of methods called Quasi-Newton methods that use
approximations of the hessian at each iteration in an attempt to either do less
work per iteration or to handle singular hessian matrices. These methods fall
in between gradient methods and Newton’s method and were introduced in the
1960’s. Work on quasi-Newton methods sprang from the belief that often, in a
large linear system, most variables should not depend on most other variables
(that is, the system is generally sparse).
We should however note that in some signal and image processing problems,
the hessian has a nice structure, which allows one to solve the linear system
∇2 f (x(k) )∆x(k) = ∇f (x(k) ) in time much less than O(n3 ) (often in time com-
parble to that required for quasi Newton methods), without having to explicitly
store the entire hessian. We next discuss some optimization techniques that use
specific approximations to the hessian ∇2 f (x) for specific classes of problems,
by reducing the time required for computing the second derivatives.

4.5.4 Gauss Newton Approximation


The Gauss Newton method decomposes the objective function (typically for a
regression problem) as a composition of two functions27 f = l ◦ m; (i) the vector
valued model or regression function m : ℜn → ℜp and (ii) the scalar-valued
loss (such as the sum squared difference between predicted outputs and target
outputs) function l. For example, if mi is yi − r(ti , x), for parameter vector
x ∈ ℜn and input instances (yi , ti ) for i = 1, 2, . . . , p, the function f can be
written as
p
1X 2
f (x) = (yi − r(ti , x))
2 i=1

26 O(n2.7 ) to be precise.
27 Here, n is the number of weights.
310 CHAPTER 4. CONVEX OPTIMIZATION

An example of the function r is the linear regression function r(ti , x) = xT ti .


Logistic regression poses an example objective function, which involves a cross-
entropy loss.
p
X  
f (x) = − yi log σ(xT ti ) + (1 − yi ) log σ(−xT ti )
i=1

where σ(k) = 1+e1−k is the logistic function.


The task of the loss function is typically to make the optimization work well
and this gives freedom in choosing l. Many different objective functions share
a common loss function. While the sum-squared loss function is used in many
regression settings, cross-entropy loss is used in many classification problems.
These loss functions arise from the problem of maximizing log-likelihoods in
some reasonable way.
The Hessian ∇2 f (x) can be expressed using a matrix version of the chain
rule, as
p
X
∇2 f (x) = Jm (x)T ∇2 l(m)Jm (x) + ∇2 mi (x)(∇l(m))i
| {z }
i=1
Gf (x)

where Jm is the jacobian28 of the vector valued function m. It can be shown


that if ∇2 l(m)  0, then Gf (x)  0. The term Gf (x) is called the Gauss-
Newton approximation of the Hessian ∇2 f (x). In many situtations, Gf (x) is
the dominant part of ∇2 f (x) and the approximation is therefore reasonable.
For example, at the point of minimum (which will be the critical point for a
convex function), ∇2 f (x) = Gf (x). Using the Gauss-Newton approximation to
the hessian ∇2 f (x), the Newton update rule can be expressed as

∆x = −(Gf (x))−1 ∇f (x) = −(Gf (x))−1 Jm


T
(x)∇l(m)
Pp ∂l ∂mk
where we use the fact that (∇f (x))i = k=1 ∂m k ∂xi
, since the gradient of a
composite function is a product of the jacobians.
For the cross entropy classification loss or the sum-squared regression loss
l, the hessian is known to be positive semi-definite. For example, Ppif the loss
function is the sum of squared loss, the objective function is f = 12 i=1 mi (x)2
and ∇2 l(m) = I. The Newton update rule can be expressed as

∆x = −(Jm (x)T Jm (x))−1 Jm (x)T m(x)

Recall that (Jm (x)T Jm (x))−1 Jm (x)T is the Moore-Penrose pseudoinverse Jm (x)+
of Jm (x). The Gauss-Jordan method for the sum-squared loss can be interpreted
as multiplying the gradient ∇l(m) by the pseudo-inverse of the jacobian of m
28 The Jacobian is a p × n matrix of the first derivatives of a vector valued function, where

p is arity of m. The (i, j)th entry of the Jacobian is the derivative of the ith output with
respect to the j th variable, that is ∂m
∂x
i
. For m = 1, the Jacobian is the gradient vector.
j
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 311

instead of its transpose (which is what the gradient descent method would do).
Though the Gauss-Newton method has been traditionally used for non-linear
least squared problems, recently it has also seen use for the cross entropy loss
function. This method is a simple adoption of the Newton’s method, with the
advantage that second derivatives, which can be computationally expensive and
challenging to compute, are not required.

4.5.5 Levenberg-Marquardt
Like the Gauss-Newton method, the Levenberg-Marquardt method has its main
application in the least squares curve fitting problem (as also in the minimum
cross-entropy problem). The Levenberg-Marquardt method interpolates be-
tween the Gauss-Newton algorithm and the method of gradient descent. The
Levenberg-Marquardt algorithm is more robust than the Gauss Newton algo-
rithm - it often finds a solution even if it starts very far off the final minimum. On
the other hand, for well-behaved functions and reasonable starting parameters,
this algorithm tends to be a bit slower than the Gauss Newton algorithm. The
Levenberg-Marquardt method aims to reduce the uncontrolled step size often
taken by the Newton’s method and thus fix the stability issue of the Newton’s
method. The update rule is given by
−1 T
∆x = − (Gf (x) + λ diag(Gf )) Jm (x)∇l(m)
where Gf is the Gauss-Newton approximation to ∇2 f (x) and is assumed to
be positive semi-definite. This method is one of the work-horses of modern
optimization. The parameter λ ≥ 0 adaptively controlled, limits steps to an
elliptical model-trust region29 . This is achieved by adding λ to the smallest
eigenvalues of Gf , thus restricting all eigenvalues of the matrix to be above λ so
that the elliptical region has diagonals of shorter length that inversely vary as
the eigenvalues (c.f. page 3.11.3). While this method fixes the stability issues in
Newtons method, it still requires the O(n3 ) time required for matrix inversion.

4.5.6 BFGS
The Broyden-Fletcher-Goldfarb-Shanno30 (BFGS) method uses linear algebra
−1
to iteratively update an estimate B (k) of ∇2 f (x(k) ) (the inverse of the
curvature matrix), while ensuring that the approximation to the hessian inverse
is symmetric and positive definite. Let ∆x(k) be the direction vector for the k th
step obtained as the solution to
∆x(k) = −B (k) ∇f (x(k) )
The next point x(k+1) is obtained as
x(k+1) = x(k) + t(k) ∆x(k)
29 Essentially the algorithm approximates only a certain region (the so-called trust region)

of the objective function with a quadratic as opposed to the entire function.


30 The the 4 authors wrote papers for exactly the same method at exactly at the same time.
312 CHAPTER 4. CONVEX OPTIMIZATION

where t(k) is the step size obtained by line search. Let ∆g(k) = ∇f (x(k+1) ) −
∇f (x(k) ). Then the BFGS update rule is derived by imposing the following
logical conditions:

1. ∆x(k) = −B (k) ∇f (x(k) ) with B (k) ≻ 0. That is, ∆x(k) is the minimizer
of the convex quadratic model
1  −1
Q(k) (p) = f (x(k) ) + ∇T f (x(k) )p + pT B (k) p
2

2. x(k+1) = x(k) + t(k) ∆x(k) , where t(k) is obtained by line search.


−1
3. The gradient of the function Q(k+1) = f (x(k+1) )+∇T f (x(k+1) )p+ 12 pT B (k+1) p
at p = 0 and p = −t(k) ∆x(k) agrees with gradient of f at x(k+1) and x(k)
respectively. While the former condition is naturally satisfied, the latter
need to be imposed. This quasi-Newton condition yields
 −1  
B (k+1) x(k+1) − x(k) = ∇f (x(k+1) ) − ∇f (x(k) ).

This equation is called the secant equation.


4. Finally, among all symmetric matrices satisfying the secant equation,
B (k+1) is closest to the current matrix B (k) in some norm. Different
matrix norms give rise to different quasi-Newton methods. In particular,
when the norm chosen is the Frobenius norm, we get the following BGFS
update rule
B (k+1) = B (k) + R(k) + S (k)
where,
T T T
(k) ∆x(k) ∆x(k) B (k) ∆g(k) ∆g(k) B (k)
R = T − T
∆x(k) ∆g(k) ∆g(k) B (k) ∆g(k)

and  T
S (k) = u ∆x(k) B (k) ∆x(k) uT

with
∆x(k) B (k) ∆g(k)
u= T − T
∆x(k) ∆g(k) ∆g(k) B (k) ∆g(k)
We have made use of the Sherman Morrison formula that determines how
updates to a matrix relate to the updates to the inverse of the matrix.

The approximation to the Hessian is updated by analyzing successive gra-


dient vectors and thus the Hessian matrix does not need to be computed at
any stage. The initial estimate B (0) can be taken to be the identity matrix, so
that the first step is equivalent to a gradient descent. The BFGS method has
a reduced complexity of O(n2 ) time per iteration. The method is summarized
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 313

Find a starting point x(0) ∈ D and an approximate B (0) (which could be


I).
Select an appropriate tolerance ǫ > 0.
repeat
1. Set ∆x(k) = −B (k) ∇f (x(k) ).
2. Let λ2 = ∇T f (x(k) )B (k) ∇f (x(k) ).
2
3. If λ2 ≤ ǫ, quit.
4. Set step size t(k) = 1.
5. Obtain x(k+1) = x(k) + t(k) ∆x(k) .
6. Compute ∆g(k) = ∇f (x(k+1) ) − ∇f (x(k) ).
7. Compute R(k) and S (k) .
8. Compute B (k+1) = B (k) + R(k) + S (k) .
6. Set k = k + 1.
until

Figure 4.50: The BFGS method.

in Figure 4.50 The BFGS [?] method approaches the Newton’s method in be-
haviour as the iterate approaches the solution. They are much faster than the
Newton’s method in practice. It has been proved that when BFGS is applied
to a convex quadratic function with exact line search, it finds the minimizer
within n steps. There is a variety of methods related to BFGS and collectively
they are known as Quasi-Newton methods. They are preferred over the New-
ton’s method or the Levenberg-Marquardt when it comes to speed. There is a
variant of BFGS, called LBFGS [?], which stands for ”Limited memory BFGS
method”. LBFGS employs a limited-memory quasi-Newton approximation that
does not require much storage or computation. It limites the rank of the inverse
of the hessian to some number γ ∈ ℜ so that only nγ numbers have to be stored
instead of n2 numbers. For general non-convex problems, LBFGS may fail when
the initial geometry (in the form of B (0) ) has been placed very close to a saddle
point. Also, LBFGS is very sensitive to line search.
Recently, L-BFGS has been observed [?] to be the most effective parameter
estimation method for Maximum Entropy model, much better than improved
iterative scaling [?] (IIS) and generalized iterative scaling [?] (GIS).

4.5.7 Solving Large Sparse Systems


In many convex optimization problems such as least squares, newton’s method
for optimization, etc., one has to deal with solving linear systems involving large
and sparse matrices. Elimination with ordering can be expensive in such cases.
A lot of work has gone into solving such problems efficiently31 using iterative
31 Packages such as LINPack (which is now renamed to LAPACK), EiSPACK, MINPACK,

etc., which can be found under the netlib respository, have focused on efficiently solving large
linear systems under general conditions as well as specific conditions such as symmetry or
positive definiteness of the coefficient matrix.
314 CHAPTER 4. CONVEX OPTIMIZATION

methods instead of direct elimination methods. An example iterative method


is for solving a system Ax = b by repeated multiplication of a large and sparse
matrix A by vectors to quickly get an answer x b that is sufficiently close to the
optimal solution x∗ . Multiplication of an n × n sparse matrix A having k non-
zero entries with a vector of dimension n takes O(kn) time only, in contrast to
O(n3 ) time for Gauss elimination. We will study three types of methods for
solving systems with large and sparse matrices:
1. Iterative Methods.
2. Multigrid Methods.
3. Krylov Methods.
The most famous and successful amongst the Krylov methods has been the
conjugate gradient method, which works for problems with positive definite ma-
trices.

Iterative Methods
The central step in an iteration is
P xk+1 = (P − A)xk + b
where xk is the estimate of the solution at the k th step, for k = 0, 1, . . .. If the
iterations converge to the solution, that is, if xk+1 = xk one can immediatly
see that the solution is reached. The choice of matrix P , which is called the
preconditioner, determines the rate of convergence of the solution sequence to
the actual solution. The initial estimate x0 can be arbitrary for linear systems,
but for non-linear systems, it is important to start with a good approximation.
It is desirable to choose the matrix P reasonably close to A, though setting
P = A (which is referred to as perfect preconditioning) will entail solving the
large system Ax = b, which is undesirable as per our problem definition. If x∗
is the actual solution, the relationship between the errors ek and ek+1 at the
k th and (k + 1)th steps respectively can be expressed as
P ek+1 = (P − A)ek
where ek = xk − x∗ . This is called the error equation. Thus,
ek+1 = (I − P −1 A)ek = M ek
Whether the solutions are convergent or not is controlled by the matrix M .
The iterations are stationary (that is, the update is of the same form at every
step). On the other hand, Multigrid and Krylov methods adapt themselves
across iterations to enable faster convergence. The error after k steps is given
by

ek = M k e0 (4.96)
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 315

Using the idea of eigenvector decomposition presented in (3.101), it can be


proved that the error vector ek → 0 if the absolute values of all the eigenvalues
of M are less than 1. This is the fundamental theorem of iteration. In this
case, the rate of convergence of ek to 0 is determined by the maximum absolute
eigenvalue of M , called the spectral radius of M and denoted by ρ(M ).
Any iterative method should attempt to choose P so that it is easy to com-
pute xk+1 and at the same time, the matrix M = I−P −1 A has small eigenvalues.
Corresponding to various choices of the preconditioner P , there exist different
iterative methods.

1. Jacobi: In the simplest setting, P can be chosen to be a diagonal matrix


with its diagonal borrowed from A. This choice of A is correponds to the
Jacobi method. The value of ρ(M ) is less than 1 for the Jacobi method,
though it is often very close to 1. Thus, the Jacobi method does converge,
but the convergence can be very slow in practice. While the residual
r = Ab
b x − b converges rapidly, the error x = x b − x∗ decreases rapidly in
the beginning, but the rate of decrease of x reduces as iterations proceed.
This happens because x = A−1b r and A−1 happens to have large condition
number for sparse matrices. In fact, it can be shown that Jacobi can take
upto nβ iterations to reduce the error x by a factor β.
We will take an example to illustrate the Jacobi method. Consider the
following n × n tridiagonal matrix A.

 
2 −1 0 ... 0 0 0 ... 0
 
 −1 2 −1 ... 0 0 0 ... 0 
 
 . . . ... . . . ... . 
 
 
 . . . ... . . . ... . 
 
A=
 0 0 0 ... −1 2 −1 ... 0 
 (4.97)
 
 0 0 0 ... 0 −1 2 ... 0 
 
 . . . ... . . . ... . 
 
 
 . . . ... . . . ... . 
0 0 0 ... 0 0 0 ... 2


The absolute value of the ith eigenvalue of M is cos n+1 and its spectral
π
radius is ρ(M ) = cos n+1 . For extremely large n, the spectral radius is
 2
approximately 1 − 12 n+1π
, which is very close to 1. Thus, the Jacobi
steps converge very slowly.

2. Gauss-Seidel: The second possibility is to choose P to be the lower-


triangular part of A. The method for this choice is called the Gauss-
Siedel method. For the example tridiagonal matrix A in (4.97), matrix
316 CHAPTER 4. CONVEX OPTIMIZATION

P − A will be the strict but negated upper-triangular part of A. For


the Gauss-Seidel technique, the components of xk+1 can be determined
from xk using back-substitution. The Gauss-sidel method provides only a
constant factor improvement over the Jacobi method.

3. Successive over-relaxation: In this method, the preconditioner is obtained


as a weighted composition of the preconditioners from the above two meth-
ods. It is abbreviated as SOR. In history, this was the first step of progress
beyond Jacobi and Gauss-Seidel.

4. Incomplete LU: This method involves an incomplete elimination on the


sparse matrix A. For a sparse matrix A, many entries in its LU decompo-
sition will comprise of nearly 0 elements; the idea behind this method is
to treat such entries as 0’s. Thus, the L and U matrices are approximated
based on the tolerance threshold; if the tolerance threshold is very high,
the factors are exact. Else they are approximate.

Multigrid Methods
Multigrid methods come very handy in solving large sparse systems, especially
differential equations using a hierarchy of discretizations. This approach often
scales linearly with the number of unknowns n for a pre-specified accuracy
threshold. The overall multi-grid algorithm for solving Ah uh = bh with residual
given by rh = b − Auh is

1. Smoothing: Perform a few (say 2-3) iterations on Ah u = bh using either


Jacobi or Gauss-sidel. This will help remove high frequency components of
the residual r = b − Ah u. This step is really outside the core of the multi-
grid method. Denote the solution obtained by uh . Let rh = b − Ah uh .

2. Restriction: Restrict rh to coarse grid by setting r2h = Rrh . That is,


rh is downsampled to yield r2h Let k < n characterize the coarse grid.
Then, the k × n matrix R is called the restriction matrix and it takes the
residuals from a finer to a coarser grid. It is typically scaled to ensure that
a vector of 1’s on the fine mesh gets transformed to a vector of 1’s on a
coarse mesh. Calculations on the coarse grid are way faster than on the
finer grid.

3. Solve A2h e2h = r2h with A2h = RAh N , which is a natural construction for
the coarse mesh operation. This could be done by running few iterations
of Jacobi, starting with e2h = 0.

4. Interpolation/Prolongation: This step involves interpolating the cor-


rection computed on a coarses grid to a finer grid. Interpolate back to
eh = N e2h . Here N is a k × n interpolation matrix and it takes the resid-
uals from a coarse to a fine grid. It is generally a good idea to connect N
to R by setting N = αRT for some scaling factor α. Add eh to uh . The
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 317

analytical expression for eh is



eh = N (A2h )−1 RAh (u − uh ) = N (RAN )−1 RAh (u − uh ) (u − uh )
| {z }
S

A property of the n×n matrix S is that S 2 = S. Thus, the only eigenvalues


of S are 0 and 1. Since S is of rank k < n, k of its eigenvalues are 1 and
n − k are 0. Further, the eigenvectors for the 1 eigenvalues, which are
in the null space of I − S form the coarse mesh (and correspond to low
frequency vectors) whereas the eigenvectors for the 0 eigenvalues, which
are in the null space of S form the fine mesh (and correspond to high
frequency vectors). We can easily derive that k eigenvalues of I − S will
be 0 and n − k of them will be 1.

5. Finally as a post-smoothing step, iterate Auh = bh starting from the


improved uh + eh , using Jacobi or Gauss-Sidel.

Overall, the error ek after k steps will be of the form

ek = (M t (I − S)M t )e0 (4.98)

where t is the number of Jacobi steps performed in (1) and (5). Typically t
is 2 or 3. When you contrast (4.98) against (4.96), we discover that ρ(M ) ≥≥
ρ(M t (I − S)M t ). As t increases, ρ(M t (I − S)M t ) further decreases by a smaller
proportion.
In general, you could have multiple levels of coarse grids corresponding to
2h, 4h, 8h and so on, in which case, steps (2), (3) and (4) would be repeated
as many times with varying specifications of the coarseness. If A is an n × n
matrix, multi-grid methods are known to run in O(n2 ) floating point operations
(flops). The multi-grid method could be used an iterative method to solve a
linear system. Alternatively, it could be used to obtain the preconditioner.

Linear Conjugate Gradient Method


The conjugate gradient method is one of the most popular Krylov methods.
The Krylov matrix Kj , for the linear system Au = b is given by
 
Kj = b Ab A2 b . . . Aj−1 b

The columns of Kj are easy to compute; each column is a result of a matrix


multiplication A with the previous column. Assuming we are working with
sparse matrices, (often symmetric matrices such as the Hessian) these compu-
tations will be inexpensive. The Krylov space Kj is the column space of Kj .
The columns of Kj are computed during the first j steps of an iterative method
such as Jacobi. Most Krylov methods opt to choose vectors from Kj instead of a
fixed choice of the j th column of Kj . A method such as MinRes chooses a vector
318 CHAPTER 4. CONVEX OPTIMIZATION

uj ∈ Kj that minimizes b − Auj . One of the well-known Krylov methods is


the Conjugate gradient method, which assumes that the matrix A is symmetric
and positive definite and is faster than MinRes. In this method, the choice of
uj is made so that b − Auj ⊥Kj . That is, the choice of uj is made so that the
residual rj = b − Auj is orthogonal to the space Kj . The conjugate gradient
method gives an exact solution to the linear system if j = n and that is how
they were originally designed to be (and put aside subsequently). But later,
they were found to give very good approximations for j << n.
The discussions that follow require the computation of a basis for Kj . It
is always prefered to have a basis matrix with low condition number32 , and an
orthonormal basis is a good choice, since it has a condition number of 1 (the
basis consisting of the columns of Kj turns out to be not-so-good in practice).
The Arnoldi method yields an orthonormal Krylov basis q1 , q2 , . . . , qj to get
something that is numerically reasonable to work on. The method is summarized
in Figure 4.51. Though the underlying idea is borrowed from Gram-Schmidt
at every step, there is a difference; the vector t is t = AQj as against simply
t = qj . Will it be expensive to compute each t? Not if A is symmetric. First
we note that by construction, AQ = QH, where qj is the j th column of Q.
Thus, H = QT AQ. If A is symmetric, then so is H. Further, since H has only
one lower diagonal (by construction), it must have only one higher diagonal.
Therefore, H must be symmetric and tridiagonal. If A is symmetric, it suffices
to subtract only the components of t in the direction of the last two vectors
qj−1 and qj from t. Thus, for a symmetric A, the inner ‘for’ loop needs to
iterate only over i = j − 1 and i = j.
Since A and H are similar matrices, they have exactly the same eigenvalues.
Restricting the computation to a smaller number of orthonormal vectors (for
some k << n), we can save time for computing Qk and Hk . The k eigenvalues
of Hk are good approximations to the first k eigenvalues of H. This is called
the Arnoldi-Lanczos method for finding the top k eigenvalues of a matrix.
As an example, consider the following matrix A.
 
0.5344 1.0138 1.0806 1.8325
 
 1.0138 1.4224 0.9595 0.8234 
A=



 1.0806 0.9595 1.0412 1.0240 
1.8325 0.8234 1.0240 0.7622

32 For σ (A)
any matrix A, the condition number κ(A) = σmax(A) , where σmax (A) and σmin (A)
min
are maximal and minimal singular values of A respectively. Recall from Section 3.13 that
th T
the i eigenvalue of A A (the gram matrix) is the square of the ith singular value of A.

λmax (A)
Further, if A is normal, κ(A) = λ , where λmax (A) and λmin (A) are eigevalues of
min (A)
A with maximal and minimal magnitudes respectively. All orthogonal, symmetric, and skew-
symmetric matrices are normal. The condition number measures how much the columns/rows
of a matrix are dependent on each other; higher the value of the condition number, more is
the linear dependence. Condition number 1 means that the columns/rows of a matrix are
linearly independent.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 319

1
Set q1 = ||b|| b. //The first step in Gram schmidt.
for j = 1 to n − 1 do
t = Aqj .
for i = 1 to j do
//If A is symmetric, it will be i = max(1, j − 1) to j.
Hi,j = qTi t.
t = t − Hi,j qi .
end for
Hj+1,j = ||t||.
1
qj+1 = ||t|| t.
end for
t = Aqn .
for i = 1 to n do
//If A is symmetric, it will be i = n − 1 to n.
Hi,n = qTi t.
t = t − Hin qi .
end for
Hj+1,j = ||t||.
1
qj+1 = ||t|| t.

Figure 4.51: The Arnoldi algorithm for computing orthonormal basis.

and the vector b


h iT
b= 0.6382 0.3656 0.1124 0.5317

The matrix K4 is
 
0.6382 1.8074 8.1892 34.6516
 
 0.3656 1.7126 7.5403 32.7065 
K4 = 
 0.1124

 1.7019 7.4070 31.9708 

0.5317 1.9908 7.9822 34.8840

Its condition number is 1080.4.


The algorithm in Figure 4.51 computed the following basis for the matrix
K4 .
 
0.6979 -0.3493 0.5101 -0.3616
 
 0.3998 0.2688 0.2354 0.8441 

Q4 =  
 0.1229 0.8965 0.1687 -0.3908  
0.5814 0.0449 -0.8099 -0.0638
320 CHAPTER 4. CONVEX OPTIMIZATION

The coefficient matrix H4 is


 
3.6226 1.5793 0 0
 
 1.5793 0.6466 0.5108 0 
H4 = 


 0 0.5108 -0.8548 0.4869 

0 0 0.4869 0.3459

and its eigenvalues are 4.3125, 0.5677, −1.2035 and 0.0835. On the other hand,
the following matrix H3 (obtained by restricting to K3 ) has eigenvalues 4.3124,
0.1760 and −1.0741.
The basic conjugate gradient method selects vectors in xk ∈ Kk that ap-
proach the exact solution to Ax = b. Following are the main ideas in the
conjugate gradient method.
1. The rule is to select an xk so that the new residual rk = b − Axk is
orthogonal to all the previous residuals. Since Axk ∈ Kk+1 , we must have
rk ∈ Kk+1 and rk must be orthogonal to all vectors in Kk . Thus, rk must
be a multiple of qk+1 . This holds for all k and implies that

rTk ri = 0

for all i < k.


2. Consequently, the difference rk − rk−1 , which is a linear combination of
qk+1 and qk , is orthogonal to each subspace Ki for i < k.
3. Now, xi −xi−1 lies in the subspace Ki . Thus, ∆r = rk −rk−1 is orthogonal
to all the previous ∆x = xi − xi−1 . Since rk − rk−1 = −A(xk − xk−1 ), we
get the following ‘conjugate directions’ condition for the updates

(xi − xi−1 )T A(xk − xk−1 ) = 0

for all i < k. This is a necessary and sufficient condition for the orthogo-
nality of the new residual to all the previous residuals. Note that while the
residual updates are orthogonal in the usual inner product, the variable
updates are orthogonal in the inner product with respect to A.
The basic conjugate gradient method consists of 5 steps. Each iteration of
the algorithm involves a multiplication of vector dk−1 by A and computation of
two inner products. In addition, an iteration also involves around three vector
updates. So each iteration should take time upto (2+θ)n, where θ is determined
by the sparsity of matrix A. The error ek after k iterations is bounded as follows.
p !k
T κ(A) − 1
||ek ||A = (xk − x) A(xk − x) ≤ 2 p ||e0 ||
κ(A) + 1

The ‘gradient’ part of the name conjugate gradient stems from the fact that
solving the linear system Ax = b is corresponds to finding the minimum value
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 321

x0 = 0, r0 = b, d0 = r0 , k = 1.
repeat
rT
k−1 rk−1
1. αk = dT
. //Step length for next update. This corresponds to
k−1 Adk−1
the entry Hk,k .
2. xk = xk−1 + αk dk−1 .
3. rk = rk−1 − αk Adk−1 . //New residual obtained using rk − rk−1 =
−A(xk − xk−1 ).
rT
k rk
4. βk = rT
. //Improvement over previous step. This corresponds
k−1 rk−1
to the entry Hk,k+1 .
5. dk = rk + βk dk−1 . //The next search direction, which should be
orthogonal to the search direction just used.
k = k + 1.
until βk < θ.

Figure 4.52: The conjugate gradient algorithm for solving Ax = b or equiva-


lently, for minimizing E(x) = 12 xT Ax − xT b.

of the convex (for positive definite A) energy function E(x) = 12 xT Ax−bT x = r


by setting its gradient Ax − b to the zero vector. The steepest descent method
makes a move along at the direction of the residual r at every step but it
does not have a great convergence; we land up doing a lot of work to make
a little progress. In contrast, as reflect in the step dk = rk + βk dk−1 , the
conjugate gradient method makes a step in the direction of the residual, but
only after removing any component βk along the direction of the step it just
took. Figures 4.53 and 4.54 depict the steps taken by the steepest descent
and the conjugate descent techniques respectively, on the level-curves of the
function E(x) = 12 xT Ax−xT b, in two dimensions. It can be seen that while the
steepest descent technique requires many iterations for convergence, owing to its
oscillations, the conjugate gradient method takes steps that are orthogonal with
respect to A (or are orthogonal in the transfomed space obtained by multiplying
with A), thus taking into account the geometry of the problem and taking
a fewer number of steps. If the matrix A is a hessian, the steps taken by
conjugate gradient are orthogonal in the local Mahalonobis metric induced by
the curvature matrix A. Note that if x(0) = 0, the first step taken by both
methods will be the same.
The conjugate gradient method is guaranteed to reach the minimum of the
energy function E in exactly n steps. Further, if A has only r distinct eigen-
values, then the conjugate gradient method will terminate at the solution in at
most r iterations.

4.5.8 Conjugate Gradient


We have seen that the Conjugate Gradient method in Figure 4.52 can be
viewed as a minimization algorithm for the convex quadratic function E(x) =
322 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.53: Illustration of the steepest descent technique on level curves of the
function E(x) = 12 xT Ax − xT b.

Figure 4.54: Illustration of the conjugate gradient technique on level curves of


the function E(x) = 12 xT Ax − xT b.

1 T
2 x Ax − xT b. Can the approach be adapted to minimize general nonlinear
convex functions? Nonlinear variants of the conjugate gradient are well stud-
ied [?] and have proved to be quite successful in practice. The general conjugate
gradient method is essentially an incremental way of doing second order search.
Fletcher and Reeves showed how to extend the conjugate gradient method
to nonlinear functions by making two simple changes33 to the algorithm in
Figure 4.52. First, in place of the exact line search formula in step (1) for the
step length αk , we need to perform a line search that identifies an approximate
minimum of the nonlinear function f along d(k−1) . Second, the residual r(k) ,
which is simply the gradient of E (and which points in the direction of decreasing
value of E), must be replaced by the gradient of the nonlinear objective f , which
serves a similar purpose. These changes give rise to the algorithm for nonlinear
optimization outlined in Figure 4.55. The search directions d(k) are computed
by Gram-Schmidt conjugation of the residuals as with linear conjugate gradient.
The algorithm is very sensitive to the line minimization step and it generally
requires a very good line minimization. Any line search procedure that yields
an αk satisfying the strong Wolfe conditions (see (4.90) and (4.91)) will ensure
that all directions d(k) are descent directions for the function f , otherwise, d(k)
may cease to remian a descent direction as iterations proceed. We note that
each iteration of this method costs on O(n), as against the Newton or quasi-
newton methods which cost atleast O(n2 ) owing to matrix operations. Most
often, it yields optimal progress after h << n iterations. Due to this property,
the conjugate gradient method drives nearly all large-scale optimization today.
33 We note that in the algorithm in Figure 4.52, the residuals r(k) in successive iterations

(which are gradients of E) are orthogonal to each other, while the corresponding update
directions are orthogonal with respect to A. While the former property is difficult to enforce
for general non-linear functions, the latter condition can be enforced.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 323

Select x(0) , Let f0 = f (x(0) ), g0 = ∇f (x(0) ), d(0) = −∇g0 , k = 1.


repeat
1. Compute αk by line search.
2. Set x(k) = x(k−1) + αk d(k−1) .
3. Evaluate g(k) = ∇f (x(k) ).
T
(g(k) ) g(k)
4. βk = (k−1) T (k−1) .
(g ) g
5. dk = −g + βk d(k−1) .
(k)

k = k + 1.
(k)
until ||g ||
||g(0) ||
< θ OR k > maxIter.

Figure 4.55: The conjugate gradient algorithm for optimizing nonlinear convex
function f .

It revolutionalized optimization ever since it was invented in 1954.


Variants of the Fletcher-Reeves method use different choices of the parameter
βk . An important variant, proposed by Polak and Ribiere, defines βk as
T 
g(k) g(k) − g(k−1)
βkP R = T
g(k) g(k)

The Fletcher-Reeves method converges if the starting point is sufficiently close


to the desired minimum. However, convergence of the Polak-Ribiere method
can be guaranteed by choosing

βk = max βkP R , 0

Using this value is equivalent to restarting34 conjugate gradient if βkP R < 0.


In practice, the Polak-Ribiere method converges much more quickly than the
Fletcher-Reeves method. It is generally required to restart the conjugate gradi-
ent method after every n iterations, in order to get back conjugacy, etc.
If we choose f to be the strongly convex quadratic E and αk to be the
exact minimizer, this algorithm reduces to the linear conjugate gradient method,
Unlike the linear conjugate gradient method, whose convergence properties are
well understood and which is known to be optimal (see page 321), nonlinear
conjugate gradient methods sometimes show bizarre convergence
 properties.
It
has been proved by Al-Baali that if the level set L = x|f (x) ≤ f (x(0) of a
convex function f is bounded and in some open neighborbood of L, f is Lipshitz
continuously differentiable and that the algorithm is implemented with a line
search that satisfies the strong Wolfe conditions, with 0 < c1 < c2 < 1, then

lim inf ||g(k) || = 0


k→∞

34 Restarting conjugate gradient means forgetting the past search directions, and start it

anew in the direction of steepest descent.


324 CHAPTER 4. CONVEX OPTIMIZATION

In summary, quasi-Newton methods are robust. But, they require O(n2 )


memory space to store the approximate Hessian inverse, and so they are not
directly suited for large scale problems. Modificationsof these methods called
Limited Memory Quasi-Newton methods use O(n) memory and they are suited
for large scale problems. Conjugate gradient methods also work well and are
well suited for large scale problems. However they need to be implemented
carefully, with a carefully set line search. In some situations block coordinate
descent methods (optimizing a selected subset of variables at a time) can be
very much better suited than the above methods.

4.6 Algorithms for Constrained Minimization


The general form of constrained convex optimization problem was given in (4.20)
and is restated below.

minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m (4.99)
Ax = b

For example, when f is linear and gi ’s are polyhedral, the problem is a linear
program, which was stated in (4.83) and whose dual was discussed on page 289.
Linear programming is a typical example of constraint minimization problem
and will form the subject matter for discussion in Section 4.7. As another
example, when f is quadratic (of the form xT Qx+bT x) and gi ’s are polyhedral,
the problem is called a quadratic programming problem. A special case of
quadratic programming is the least squares problem, which we will take up in
details in Section 4.8.

4.6.1 Equality Constrained Minimization


The simpler form of constrained convex optimization is when there is only the
equality constrained in problem (4.99) and it turns out to be not much different
from the unconstrained case. The equality constrained convex problem can be
more explicitly stated as in (4.100).

minimize f (x)
(4.100)
subject to Ax = b

where f is a convex and twice continuously differentiable function and A ∈ ℜp×n


has rank p. We will assume that the finite primal optimal value p∗ is attained
by f at some point xb. The following fundamental theorem for the equality con-
strained convex problem (4.100) can be derived using the KKT conditions stated
4.6. ALGORITHMS FOR CONSTRAINED MINIMIZATION 325

in Section 4.4.4 that were proved to be necessary and sufficiency conditions for
optimality of a convex problem with differentible objective and constraint func-
tions.
Theorem 85 x b is optimal point for the primal iff there exists a µ
b such that the
following conditions are satisfied.

x) + AT µ
∇f (b b=0
(4.101)
Ab
x=b

The term ∇f (b x) + AT µb is sometimes called the dual residual (rd ) while the
term Ab x − b is referred to as the primal residual (rp ). The optimality condition
basically states that both rd and rp should both be 0 and the the success of this
test is a certificate of optimality.
As an illustration of this theorem, consider the constrained quadratic prob-
lem

1 T
minimize 2 x Ax + bT x + c
(4.102)
subject to Px = q

By theorem 85, the necessary and sufficient condition for optimality of a point
(b b is
x, λ) " #" # " #
A PT x
b −b
=
P 0 b
λ q
| {z }
KKT matrix

The KKT matrix is nonsingular iff, P + AT A ≻ 0. In such an event, the


35

system of n + p linear equations in n + p unknowns will have a unique solution


corresponding to the point of global minimum of (4.102). The linearly con-
strained least squared problem is a specific example of this and is discussed in
Section 4.8.2.

Eliminating Equality Constraints


Figure 3.3 summarized the number of solutions to the system Ax = b under
different conditions. In particular, when the rank of A is the number of its
rows (p) and is less than the number of its columns (n), there are infinitely
many solutions. This was logically derived in (3.35), and we restate it here for
reference:
xcomplete = xparticular + xnullspace
where the three vectors are defined with respect to the reduced row echelon
form R of A (c.f. Section 3.6.2):
35 This matrix comes up very often in many areas such as optimization, mechanics, etc.
326 CHAPTER 4. CONVEX OPTIMIZATION

1. xcomplete : specifies any solution to Ax = b

2. xparticular : is obtained by setting all free variables (corresponding to


columns with no pivots) to 0 and solving Ax = b for pivot variables.

3. xnullspace : is any vector in the null space of the matrix A, obtained as a


linear combination of the basis vectors for N (A).

Using formula (3.27) on page 169 to derive the null basis N ∈ ℜn×n−p
(that is, AN = 0 and the columns of N span N (A)), we get the following free
parameter expression for the solution set to Ax = b:

{x |Ax = b } = N z + xparticular z ∈ ℜn−p

We can express the constrained problem in (4.100) in terms of the variables


z ∈ ℜn−p (that is through an affine change of coordinates) to get the following
equivalent problem:

minimize f (N z + xparticular ) (4.103)


z∈ℜn−1

This problem is equivalent to the original problem in (4.100), has no equality


constraints and has p fewer variables. The optimal solutions x b and µ b to the
primal and dual of (4.100) respectively can be expressed in terms of the optimal
solution b
z to (??) as:

x
b = Nb
z + xparticular
(4.104)
b = −(AAT )−1 A∇f (b
µ x)

Any iterative algorithm that is applied to solve the problem (4.104) will
ensure that all intermediate points are feasible, since for any z ∈ ℜn−p , x =
F z + xparticular is feasible, that is, Ax = b. However, when the Newton’s
method is applied, the iterates are independent of the exact affine change of
coordinates induced by the choice of the null basis F (c.f. page 306). The
Newton update rule ∆z(k) for (4.103) is given by the solution to:

N ∇2 f (N z(k) + xparticular )N T ∆z(k) = N ∇f (N z(k) + xparticular )

Due the affine invariance of Newton’s method, if z(0) is the starting iterate
and x(0) = N z(0) + xparticular , the k th iterate x(k) = N z(k) + xparticular is
independent of the choice of the null basis N . We therefore do not need seperate
convergence analysis. The algorithm for the Newton’s method was outlined in
Figure 4.49. Techniques for handling constrained optimization using Newton’s
method given an infeasible starting point x(0) can be found in [?].
4.6. ALGORITHMS FOR CONSTRAINED MINIMIZATION 327

4.6.2 Inequality Constrained Minimization


The general inequality constrained convex minimization problem is

minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m (4.105)
Ax = b

where f as well as the gi ’s are convex and twice continuously differentiable. As


in the case of equality constrained optimization, we will assume that A ∈ ℜp×n
and has rank p. Further, we will also assume that the finite primal optimal value
p∗ is attained by f at some point x b. Finally, we will assume that the Slaters
constraint qualification (c.f. page 292) conditions hold so that strong duality
holds and the dual optimum is attained. Linear programs (LP), quadratically
constrained quadratic programs (QCQP) (all listed in table 4.4 on page 292) and
geometric programs36 (GP) are some examples of convex optimization problems
with inequality constraints. An example geometric program (in its convex form)
is

P 
q aT
k y+bk
minimize
n
log k=1 e
y∈ℜ P 
r cT
k y+dk (4.106)
subject to log k=1 e ≤0 i = 1, 2, . . . , p
giT y + hi i = 1, 2, . . . , m

Semi-definite programs (SDPs) do not satisfy conditions such as zero duality


gap, etc., but can be handled by extensions of interior-point methods to prob-
lems having generalized inequalities.

Logarithmic Barrier
One idea for solving a minimization problem with inequalities is to replace the
inequalities by a so-called barrier term. The barrier term is subtracted from the
objective function with a weight µ on it. The solution to (4.105) is approximated
by the solution to the following problem.

Pm
minimize B(x, µ) = f (x) − µ i=1 ln (−gi (x))
(4.107)
subject to Ax = b
36 Although geometric programs are not convex in their natural form, they can, however, be

transformed to convex optimization problems, by a change of variables and a transformation


of the objective and constraint functions.
328 CHAPTER 4. CONVEX OPTIMIZATION

The objective function B(x, µ) is called the logarithmic barrier function. This
function is convex, which can be proved by invoking the composition rules de-
scribed in Section 4.2.10. It is also twice continuously differentiable. The bar-
rier term, as a function of x approaches +∞ as any feasible interior point x
approaches the boundary of the feasible region. Because we are minimizing,
this property prevents the feasible iterates from crossing the boundary and be-
coming infeasible. We will denote the point of optimality x b(µ) as a function of
µ.
However, the optimal solution to the original problem (a typical example
being the LP discussed in Section 4.7) is typically a point on the boundary
of the feasible region (we will see this in the case of linear programming in
Section 4.7). To obtain such a boundary point solution, it is necessary to keep
decreasing the parameter µ of the barrier function to 0 in the limit. As a
very simple example, consider the following inequality constrained optimization
problem.
minimize x2
subject to x≥1
The logarithmic barrier formulation of this problem is

minimize x2 − µ ln (x − 1)

The unconstrained
√ minimizer for this convex logarithmic barrier function is
b(µ) = 12 + 21 1 + 2µ. As µ → 0, the optimal point of the logarithmic barrier
x
problem approaches the actual point of optimality x b = 1 (which, as we can
see, lies on the boundary of the feasible region). The generalized idea, that as
µ → 0, f (bx) → p∗ (where p∗ is the optimal for (4.105)) will be proved next.

Properties of the estimate f (b


x(µ))
The following are necessary and sufficient conditions for x
b(µ) to be a solution
to (4.107) for a fixed µ (see KKT conditions in (4.88)):

1. The point x
b(µ) must be strictly feasible. That is,

Ab
x(µ) = b

and
gi (b
x(µ)) < 0

2. There must exist a η ∈ ℜp such that

m
X −µ
∇f (b
x(µ)) + x(µ)) + AT ηb = 0
∇gi (b (4.108)
i=1
gi (b
x(µ))
4.6. ALGORITHMS FOR CONSTRAINED MINIMIZATION 329

Define
bi (µ) = −µ
λ
gi (b
x(µ))
and
ηb(µ) = ηb
b
We claim that the pair (λ(µ), ηb(µ)) is dual feasible. The following steps prove
our claim

1. Since gi (b b
x(µ)) < 0 for i = 1, 2, . . . , m, λ(µ) ≻ 0.

2. Based on the proof of theorem 82, we can infer that L(x, λ, η) is convex
in x.
m
X
L(x, λ, η) = f (x) + λi gi (x) + η T (Ax − b)
i=1

Since the lagrangian is convex in x and since it is differentiable on its


domain, from (4.108), we can conclude that x b(µ) is a critical point of
b
L(x, λ, η) and therefore minimizes it for (λ(µ), ηb(µ)).
b
3. That is, the dual L∗ (λ(µ), b
ηb(µ)) is defined and therefore, (λ(µ), ηb(µ)) is
dual feasible.

Xm
b
L∗ (λ(µ), ηb(µ)) = f (b
x(µ))+ bi gi (b
λ η (µ)T (Ab
x(µ))+b x(µ)−b) = f (b
x(µ))−mµ
i=1
(4.109)

From the weak duality theorem 81, we know that d∗ ≤ p∗ , where d∗ and p∗ are
b
the primal and dual optimals respectively, for (4.105). Since L∗ (λ(µ), ηb(µ)) ≤ d∗

x(µ)) − mµ ≤ p . Or equivalently,
(by definition), we will have from (4.109), f (b

x(µ)) − p∗ ≤ mµ
f (b (4.110)

The inequality in (4.110) forms the basis of the barrier method; it confirms the
b(µ) converges to an optimal point as µ → 0. We will next
intuitive idea that x
discuss the barrier method.

The Barrier Method


The barrier method is a simple extension of the unconstrained minimization
method to inequality constrained minimization. This method is based on the
property in (4.110). This method solves a sequence of unconstrained (or linearly
constrained) minimization problems, using the last point found as the starting
point for the next unconstrained minimization problem. It computes x b(µ) for a
330 CHAPTER 4. CONVEX OPTIMIZATION

sequence of decreasing values of µ, until mµ ≤ ǫ, which guarantees that we have


an ǫ-suboptimal solution of the original problem. It was originally proposed
as the sequential unconstrained minimization technique (SUMT) technique by
Fiacco and McCormick in the 1960s. A simple version of the method is outlined
in Figure 4.56.

b, µ = µ(0) > 0, 1 > α > 0.


Find a strictly feasible starting point x
Select an appropriate tolerance ǫ > 0.
repeat
1. Centering Step: Compute x b(µ) by minimizing B(x, µ) (optionally
subject to Ax = b) starting at x.
2. Update x = x b(µ).
3. If mµ ≤ ǫ, quit.
4. Decrease µ: µ = αµ.
until

Figure 4.56: The Barrier method.

The centering step (1) can be executed using any of the descent techniques
discussed in Section 4.5. It can be proved [?] that the duality gap is mµ(0) αk
after k iterations. Therefore, the desired

accuracy ǫ can be achieved by the
mµ(0)
logǫ
barrier method after exactly  
 −log(α)  steps.
 
Successive minima x b(µ) of the Barrier function B(x, µ) can be shown to have
the following properties. Let µ < µ for sufficiently small µ, then

x(µ), µ) < B(b


1. B(b x(µ), µ)

x(µ)) ≤ f (b
2. f (b x(µ))
Pm Pm
3. − i=1 ln (−gi (bx(µ))) ≥ − i=1 ln (−gi (b
x(µ)))

When a strictly feasible point x b is not known, the barrier method is pre-
ceded by a preliminary stage, called phase I, in which a strictly feasible point is
computed (if it exists). The strictly feasible point found during phase I is then
used as the starting point for the barrier method. This is discussed in greater
details in [?].

4.7 Linear Programming


Linear programming has been widely used in the industry for maximizing profits,
minimizing costs, etc. The word linear implies that the cost function is linear
in the form of an inner product.
The inputs to the program are

1. c, a cost vector of size n.


4.7. LINEAR PROGRAMMING 331

2. An m × n matrix A.

3. A vector b of size m.

The unknown is a vector x of size n, and this is what we will try to determine.
In linear programming (LP), the task is to minimize a linear objective func-
Xn
tion of the form cj xj , subject to linear inequality constraints37 of the form
j=1
n
X
aij xj ≥ bi , i = 1, . . . , m and xi > 0. The problem can be stated as in
j=1
(4.111). In contrast to the LP specification on page 289, where the constraint
x ≥ 0 was absorbed into the more general constraint −Ax + b ≤ 0, here we
choose to specify it as a seperate constraint.

min xT c
x∈ℜn (4.111)
subject to −Ax + b ≤ 0 x≥0

The flip side of this problem is that it has no analytical formula as a solution.
However, that does not make a big difference in practice, because there exist
reliable and efficient algorithms and software for linear programming. The com-
putational time is roughly proportional to n2 m, if m ≥ n. This is basically the
cost of one iteration in an interior point method.
Linear programming (LP ) problems are harder to recognize in practice and
often need reformulations to get into the standard form in (4.111). Minimizing
a piecewise linear function of x is not an LP , thought it can be written and
solved as an LP . Other problems involving 1 or ∞ norms can also be written
as linear programming problems.
The basis for linear programing was mentioned on page 250; linear functions
have no critical points and therefore, by theorem 60, the extreme values are
always assumed at the boundary of the feasible set. In the case of linear pro-
grams, the feasible set is itself defined by linear inequalities: {x| − Ax + b ≤ 0}.
Applying the argument recursively, it can be proved that the extreme values for
a linear program are assumed at some corners (i.e., vertices) of the feasible set.
A corner is the intersection point of n different planes, each given by a single
equation. That is, a corner point is obtained by turning n of the n + m inequali-
ties into equalities and finding their intersection38 An edge is the intersection of
n − 1 inequalities and connects two corners. Geometrically, it can be observed
that when you maximize or minimize some linear function, as your progress in
one direction in the search space, the objective will either increase monotoni-
cally or decrease monotonically. Therefore, the maximum and minimum will be
found at the corners of the allowed region.
37 It is a rare feature to have linear inequality constraints.
38 In (n+m)!
general, there are n!m! intersections.
332 CHAPTER 4. CONVEX OPTIMIZATION

Figure 4.57: Example of the feasible set of a linear program for n = 3.

The feasible set is in the form of a finite interval in n dimensions. Figure 4.57
pictorially depicts a typical example of the feasible region for n = 3. The
constraints Ax ≥ b and x ≥ 0 would allow a tetrahedron or pyramid in the
first (or completely positive) octant. If the constraint was an equality, Ax = b,
the feasible set would be the shaded traingle in the figure. In general for any n,
the constraint Ax ≥ b, x ≥ 0 would yield as the feasible set, a polyhedron. The
Xn
T
task of maximizing (or miminizing) the linear objective function x c = ci xi
i=1
translates to finding a solution at one of the corners of the feasible region.
Corners are points where some of the inequality constraints are tight or active,
and others are not. At the corners, some of the inequality constraints translate
to equalities. It is just a question of finding the right corner.
Why not just search all corners for the optimal answer? The trouble is that
there are lots of corners. In n dimensions, with m constraints, the number of
corners grows exponentially and there is no way to check all of them. There is
an interesting competition between two quite different approaches for solving
linear programs:

1. The simplex method

2. Interior point barrier method


4.7. LINEAR PROGRAMMING 333

4.7.1 Simplex Method


The simplex algorithm [?] is one of the fundamenetal methods for linear pro-
gramming, developed in the late 1940s by Dantzig. It is the best established
approach for solving linear problems. In the worst case, the algorithm takes a
number of steps that is exponential in n; but, in practice it is the most efficient
method for solving linear programs.
The simplex method first constructs an admissible solution at a corner
(which can be quite a bit of a job) of the polyhedron and then moves along
its edges to vertices with successively higher values of the objective function
until the optimum is reached. The movement along an edge originating at a
vertex is performed by ‘loosening’ one of the inequalities that were tight at the
vertex. The inequality chosen for ‘loosening’ is the one promising the fastest
drop in the objective function xT c. The rate of decrease along an edge can
be measured using the gradient of the objective. This procedure is carried out
iteratively, till the method encounters a vertex which has no edge (constraint)
that is a promising descent direction (which means that the cost goes up along
all edges incident at that vertex). Since an edge corresponding to decreasing
value of the objective cannot correspond to its increasing value, no edge will be
traversed twice in this process.
We will first rewrite the constraints Ax ≥ b in the above LP as equations,
by introducing a new non-negative ”slack” variable sj for the j th constraint (for
all j’s) and subtracting it from the left-hand side of each inequality:

Ax − s = b

or equivalently in matrix notation


h i
[−A + I] xs = −b

We will treat the m × n + m matrix M = [−A I] as our new coefficient matrix


and y = [x s]T as our new variable vector. With this, the above constraint can
be rewritten as
M y = −b
The feasible set is now governed by these m equality constraints and the n + m
non-negativity constraints x ≥ 0 and y ≥ 0. The original cost vector c is
extended to a vector d by appending m more zero components. This leaves us
with the following problem, equivalent to the original LP (4.111).

min yT d
y∈ℜn+m (4.112)
subject to M y = −b y≥0

We will assume that the matrix A (and therefore M ) is of full row rank, that
is of rank m. In practice, a preprocessing phase is applied to the user-supplied
334 CHAPTER 4. CONVEX OPTIMIZATION

data to remove some redundancies from the given constraints to get a full row
rank matrix.
The following definitions and observations will set the platform for the sim-
plex algorithm, which we will describe subsequently.

1. A vector y is a basic feasible point if it is feasible and if there exists a


subset B of the index set {1, 2, ..., n} such that

(a) B contains exactly m indices.


(b) y ≥ 0.
(c) yi ≥ 0 can be inactive (that is yi > 0) only if i ∈ B. In other words,
i∈/ B ⇒ yi = 0.
(d) If mi is the ith column of M , the m × m matrix B defined as B =
[mi ]i∈B is nonsingular.

A set B satisfying these properties is called a basis for the problem (4.112).
The corresponding matrix B is called the basis matrix. Any variable yi
for i ∈ B is called a basic variable, while any variable yi for i ∈
/ B is called
a free variable.

2. It can be seen that all basic feasible points of (4.112) are corners of the
feasible simplex S = {x|Ax ≥ b, x ≥ 0} and vice versa. In other words, a
corner of S corresponds to a point y in the new representation that has n
components as zeroes.

3. Any two corners connected by an edge will have exactly m − 1 common


basic variables. Each corner has n incident edges (corresponding to the
addition of any one of n new basic variables and the corresponding drop
of a basic variable).

4. Further, it can be proved that

(a) If (4.112) has a nonempty feasible region, then there is at least one
basic feasible point
(b) If (4.112) has solutions, then at least one such solution is a basic
optimal point
(c) If (4.112) is feasible and bounded, then it has an optimal solution.

This is known as the fundamental theorem of linear programming.

Using the ideas and notations presented above, the simplex algorithm can
be outlined as follows.

1. Each iterate generated by the simplex algorithm is a basic feasible point


of (4.112).
4.7. LINEAR PROGRAMMING 335

2. Entering free variable: The next iterate is determined by moving along


an edge from one basic feasible solution to another. As discussed above,
movement along an edge will mean that m − 1 variables will remain basic
while one will become free. On the other hand, a new free variable will
become basic. The real decision is which variable should be removed from
the basis and which should be added. The idea in the simplex algorithm
is to include that free variable yk , which has the most negative component
dk (something like steepest descent in the L1 norm).
3. Leaving basic variable: The basic variable from the current basis that will
leave next is determined using a pivot rule. A commonly applied pivot
rule is to determine the leaving basic variable through the constraint (say
the j th one) that has the smallest non-negative ratio of the right hand
′ ′
side bj of the constraint to the coefficient mjk of the entering variable yk .
If the coefficients of yk are negative in all the constraints, it implies an
unbounded case; the cost can be made −∞ by arbitrarily increasing the
value of yk .
4. In order to facilitate the easy identification of leaving basic variables, we
bring the equations into a form such that the basic variables stand by
themselves. This is done by treating the new entering variable yk as a
‘pivot’ in the j th equation and substituting its value in terms of the other
variables in the j th equation into the other equations (as well as the cost
function yT d). In this form,
(a) the protocol is that variables corresponding to all columns of M that
are in unit form are basic variables, while the rest are free variables.
(b) the choice of an equality in the step above automatically entails the
choice of the leaving variable - the basic variable yl corresponding to
row j will be the next leaving variable.
5. In a large problem, it is possible for a leaving variable to reenter the basis
at a later stage. Unless there is degeneracy, the costs keep going down
and it can never happen that all of the m basic variables are the same
as before. Thus, no corner is revisited and the method must end at the
optimal corner or conclude that the cost is unbounded below. Degenracy
is said to occur if more than the usual n components of x are 0 (in which
case, cycling might occur but extremely rarely).
Since each simplex step involves decisions (choice of entering and leaving
basic variables) and row operations (pivoting etc.), it is convenient to fit the
data into a large matrix or tableau. The operations of the simplex method
outlined above can be systematically translated to operations on the tableau.
1. The starting tableau is just a bigger m + 1 × m + n matrix
" #
M −b
T =
d 0
336 CHAPTER 4. CONVEX OPTIMIZATION

2. Our first step is to get one basic variable alone on each row. Without loss of
generality, we will renumber the variables and rearrange the corresponding
coefficients of M so that at every iteration, y1 , y2 , . . . ym are the basic
variables and the rest are free (i.e., 0). The first m columns of A form an
m × m square matrix B and the last n form an m × n matrix N . The cost
vector d can also be split as dT = [dTB dTN ] and the variable vector can
be split as yT = [yB T
yNT
] with yN = 0. To operate with the tableau, we
will split it as
" #
B N −b
dTB dTN 0

Performing Gauss Jordan elimination on the columns corresponding to


basic variables, we get the equations into the form that will be preserved
across iterations. " #
I B −1 N −B −1 b
dTB dTN 0

Further, we will ensure that all the columns corresponding to basic vari-
ables are in the unit form.
" #
I B −1 N −B −1 b
dTB − dTB I = 0 dTN − dTB B −1 N dTB B −1 b

This corresponds to a solution yB = −B −1 b with cost dT y = −dTB B −1 b,


which is the negative of the expression on the right hand bottom corner.

3. In the above tableau, the components of the expression r = dTN −dTB B −1 N


are the reduced costs and capture what it costs to use the existing set of
free variables; if the direct cost in dN is less than the saving due to use
of the other basic variables, it will help to try a free variable. This guides
us in the choice of the entering variable. If r = dTN − dTB B −1 N has any
negative component, then the variable corresponding to the most negative
component is picked up as the next entering variable and this choice cor-
responds to moving from a corner of the polytope S to an adjacent corner
with lower cost. Let yk be the entering variable and dk the corresponding
cost.

4. As the entering component yk is increased, to maintain M y = −b, the


first component yj that decreases to 0 becomes the leaving variable and
transforms from a basic to a free variable. The other components of yB
would have moved around but would remain positive. Thus, the one that
drops to zero should satisfy

−B −1 b t
j= argmin −1 N )
t=1,2,...,m (B −1 N )tk >0 (B tk
4.7. LINEAR PROGRAMMING 337

Note that the minimum is taken only over the positive components B −1 N tk .
If there are no positive components, the next corner is infinitely far away
and the cost can be reduced forever to yield a minimum cost of −∞.

5. With the new choice of basic variables, steps (2)-(4) are repeated till the
reduced cost is completely non-negative. The variables corresponding to
the unit columns in the final tableau are the basic variables at the opti-
mum.

What we have not discussed so far is how to obtain the initial basic feasible
point. If x = 0 satisfies Ax ≥ b, we can have an initial basic feasible point with
the basic variables comprising of s and x constituting the free variables. This
is illustrated through the following example. Consider the problem

min −15x1 − 18x2 − 20x3


x1 ,x2 ,x3 ∈ℜ
subject to − 16 x1 − 14 x2 − 21 x3 ≥ −60
−40x1 − 50x2 − 60x3 ≥ −2880
−25x1 − 30x2 − 40x3 ≥ −2400
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0

The initial tableau is


 
1 1 1
1 0 0 60
6 4 2
 
 40 50 60 0 1 0 2880 
 
 25 30 40 0 0 1 2400 
 

−15 −18 −20 0 0 0 0

The most negative component of the reduced cost vector is for k = 3. The
(B −1 b)t
pivot row number is 2 = argmin (B −1 N ) . Thus, the leaving basic
tk
t=1,2,...,m (B −1 N )tk >0
variable is s2 (the basic variable corresponding to the second row) while the
entering free variable is x3 . Performing Gauss elimination to obtain column
k = 3 in the unit form, we get
 
−1 − 61 0 1 − 120 1
0 36

 26 5 1

 3 1 0 0 48 
 6 60 
 − 5 − 10 0 0 − 2
1 480 
 3 3 3 

− 53 − 34 0 0 1
3 0 960

This tableau corresponds to the solution x1 = 0, x2 = 0, x3 = 48, s1 = 0, s2 =


36, s3 = 480 and cost cT x = −960 (negative of the number on the right hand
bottom corner). Since the reduced cost vector has still some negative compo-
nents, it is possible to find a basic feasible solution with lower cost. Using the
most negative component of the reduced cost vector, we select the next pivot
338 CHAPTER 4. CONVEX OPTIMIZATION

element to be m21 = 32 . Again performing Gaussian elimination, we obtain the


tableau corresponding to the next iterate.
 
0 1 1
1 − 1
0 48
24 4 240
 5 3 1

 1 0 0 72 
 4 2 40 
 0 −5 5 0 − 58 1 600 
 4 2 
3 5 3

0 4 2 0 8 0 1080

Note that the optimal solution has been found, since the reduced cost vector
is non-negative. The optimal solution is x1 = 72, x2 = 0, x3 = 0, s1 = 48, s2 =
0, s3 = 600 and cost cT x = −1080
What if x = 0 does not satisfy Ax ≥ b? The choice of s as the basic
variables and x as the free variables will not be valid. As an example, consider
the problem
min 30x1 + 60x2 + 70x3
x1 ,x2 ,x3 ∈ℜ
subject to x1 + 3x2 + 4x3 ≥ 14
2x1 + 2x2 + 3x3 ≥ 16
x1 + 3x2 + 2x3 ≥ 12
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0
The initial tableau is
 
−1 −3 −4 1 0 0 −14
 
 −2 −2 −3 0 1 0 −16 
 
 −1 −3 −2 0 0 1 −12 
 

30 60 70 0 0 0 0

With the choice of basic and free variables as above, we are not even in the
feasible region to start off with. In general, if we have any negative number in
the last column of the tableau, x = 0 is not in the feasible region. Further,
we have no negative numbers in the bottom row, which does not leave us with
any choice of cost reducing free variable. But this is not of primary concern,
since we first need to maneuver our way into the feasible region. We do this
by moving from one basic point (that is, a point having not more than n zero
components) to another till we land in the feasible region, which is indicated
by all positive components in the extreme right hand column. This movement
from one basic point to another is not driven by negative components in the
cost vector, but rather by the negative components in the right hand column.
The new rules for moving from one basic point to another are:

1. Pick39 any negative number in the far right column (excluding the last
row). Let this be in the q th row for q < m + 1.
39 Note that there is no priority here.
4.7. LINEAR PROGRAMMING 339

2. In the q th row, move40 to left to a column number k where there is another


negative number. The variable yk will be the next entering variable.
3. Choose pivot element mjk which gives the smallest positive ratio of an
element in the j th row of the last column to the element mjk . The leaving
variable will be yj .
4. Once the pivot element is chosen, proceed as usual to convert the pivot
element to 1 and the other elements in the pivot column to 0.
5. Repeat steps (1)-(4) on the modified tableau until there is no negative
element in the right-most column.
Applying this procedure to the tableau above, we pick m2,1 = −2 as our
first pivot element and do row elimination to get the first column in unit form.
 
0 −2 − 25 1 − 21 0 −6
 3

 1
 1 2 0 − 12 0 8 

 0 −2 − 1 0 − 1 1 −4 
 2 2 

0 30 25 0 15 0 −240
We pick m35 = − 12 as our next pivot element and do similar row elimination
operations to obtain
 
0 0 −2 1 0 −1 −2
 
 1 3 2 0 0 −1 12 
 
 0 4 1 0 1 −2 8 
 

0 −30 10 0 0 30 −360
We have still not obtained a feasible basic point. We choose m16 = −1 as the
next pivot and do row eliminations to get the next tableau.
 
0 0 2 −1 0 1 2
 
 1 3 4 −1 0 0 14 
 
 0 4 5 −2 1 0 12 
 

0 −30 −50 30 0 0 −420
This tableau has not negative numbers in the last column and gives a basic
feasible point x1 = 14, x2 = 0, x3 = 0. Once we obtain the basic feasible point,
we rever to the standard simplex procedure discussed earlier. The most negative
component of the reduced cost vector is −50 and this leads to the pivot element
m13 = 2. Row elimination yields
 
0 0 1 − 12 0 1 1
2
 
 1 3 0 1 0 −2 10 
 
 0 4 0 1
1 − 52 7 
 2 

0 −30 0 5 0 25 −370
40 Note that there is no priority here either.
340 CHAPTER 4. CONVEX OPTIMIZATION

Our next pivot element is m32 = 4. Row elimination yields


 
0 0 1 − 21 0 1 1
2
 5

 1
 0 0 8 − 34 − 81 194


 0 1 0 1 1 5
−8 7 
 8 4 4 
35 15 25
635
0 0 0 4 2 4
− 2

We are done! The reduced cost vector has no more negative components. The
optimal basic feasible point is x1 = 4.75, x2 = 1.75, x3 = 1 and the optimal cost
is 317.5.

Revised Simplex Method


The simplex method illustrated above serves two purposes:
1. Doing all the eliminations completely makes the idea clear.
2. It easier to follow the process when working out the solution by hand.
For computational purposes however, it is uncommon now to use the method
as described earlier. This is because, once r is computed, none of the columns
above r, (except for that corresponding to the leaving variable) are used. There-
fore, computing them is a useless effort. Doing the eliminations completely at
each step cannot be justified practically. Instead, the more efficient version of
the simplex method, as outlined below, is used by software packages. It is called
the revised simplex method and is essentially the simplex method itself, boiled
down.

Compute the reduced costs r = dN − (dB )B −1 N .


while r 6≥ 0 do
1. Let rk be the most negative component of r.
2. Compute v = B −1 ni , where ni is the ith column of N .
(−B −1 b)t
3. Let j = argmin (B −1 ni ) .t
t=1,2,...,m (B −1 ni )t >0
4. Update B (or B −1 ) and xB =B b to reflect the j th leaving column
−1
th
and the k entering variable.
Compute the new reduced costs r = dN − (dB )B −1 N .
end while

Figure 4.58: The revised simplex method.

4.7.2 Interior point barrier method


Researchers have dreamt up pathological examples for the simplex method, for
which the simplex method takes an exponential amount of time. In practice,
however, the simplex method is one of the most efficient methods for a majority
4.7. LINEAR PROGRAMMING 341

of the LPs. Application of interior point methods to LP have led to a new


competitor to the simplex method in the form of interior point methods for
linear programming. In contrast to the simplex algorithm, which finds the
optimal solution by progressing along points on the boundary of a polyhedral
set, interior point methods traverse to the optimal through the interior of the
feasible region (polyhedral set in the case of LPs). The first in this league was the
iterative Karmarkar’s method [?], developed by Narendra Karmarkar in 1984.
Karmarkar also proved that the algorithm was polynomial time. This line of
research was inspired by the ellipsoid method for linear programming, outlined
by Leonid Khachiyan in 1979; the ellipsoid algorithm itself was introduced by
Naum Z. Shor, et. al. in 1972 and used by Leonid Khachiyan [?] to prove the
polynomial-time solvability of linear programs r linear programming , which was
the first such algorithm known to have a polynomial running time.
This competitor to the simplex method takes a Newton’s method-like ap-
proach through the interior of the feasible region. Newton steps are taken till
the ‘barrier’ is encountered. It stirred up the world of optimization and in-
spired the whole class of barrier methods. Following this, a lot of interest was
generated in the application of the erstwhile interior point methods for general
non-linear constrainted optimization problems. The Karmarkar’s algorithm is
now replaced by an improved logarithmic barrier method that makes use of the
primal as well as the dual for solving an LP. Shanno and Bagchi [?] showed that
Karmarkars method is just a special case of the logarithmic barrier function
method. We will restrict our discussion to the primal-dual barrier method [?].
The dual (4.113) for the linear program (4.111) can be derived in manner
similar to the dual on page 289.

max λT b
λ∈ℜm
subject to AT λ ≤ c
λ≥0

The weak duality theorem (theorem 81) states that the objective function value
of the dual at any feasible solution is always less than or equal to the objective
function value of the primal at any feasible solution. That is, for any primal
feasible x and any dual feasible λ,

cT x − bT λ ≥ 0

For this specific case, the weak duality is easy to see: bT λ ≤ xT AT λ ≤ xT c.


Further, it can be proved using the Farkas’ lemma that if the primal has an
optimal solution x∗ (which is assumed to be bounded), then the dual also has
an optimal solution41 λ∗ , such that cT x∗ = bT λ∗ .
The following steps will set the platform for the interior point method.
41 For an LP and its dual D, there are only four possibilities:
1. (LP) is bounded and feasible and (D) is bounded and feasible.
342 CHAPTER 4. CONVEX OPTIMIZATION

1. As on page 4.7.1, we will rewrite the constraints Ax ≥ b in the above


LP as equations, by introducing a new non-negative ”slack” variable sj
for the j th constraint (for all j’s) and subtracting it from the left-hand
side of each inequality. This gives us the constraint M y = −b, where
M = [−A I] and y = [x s]T . As before, the original cost vector c is
extended to a vector d by appending m more zero components. This gives
us the following problem, equivalent to the original LP (??).

min yT d
y∈ℜn+m
subject to M y = −b (4.113)
y≥0

Its dual problem is given by

max −λT b
λ∈ℜm (4.114)
subject to MT λ ≤ d

2. Next, we set up the barrier method formulation of the dual of the linear
program. Letting µ > 0 be a given fixed parameter, which is decreased
during the course of the algorithm. We also insert slack variables ξ =
[ξ1 , ξ2 , . . . , ξn ]T ≥ 0. The barrier method formulation of the dual is then
given by:

Pn
max −λT b + µ i=1 ln (ξi )
λ∈ℜm (4.115)
T
subject to M λ+ξ =d

The conditions ξi ≥ 0 are no longer needed since ln (ξi ) → ∞ as ξi → 0,


if ξi > 0. This latter property means that ln (ξi ) serves as a barrier,
discouraging ξi from going to 0.
3. To write the first-order necessary conditions for a minimum, we set the
partial derivatives of the Lagrangian
n
X 
L(y, λ, ξ) = −λT b + µ ln (ξi ) − yT M T λ + ξ − d
i=1

2. (LP) is infeasible and (D) is unbounded and feasible.


3. (LP) is unbounded and feasible and (D) is infeasible.
4. (LP) is infeasible and (D) is infeasible.
This can be proved using the Farkas’ Lemma.
4.7. LINEAR PROGRAMMING 343

with respect to y, λ, ξ to zero. This results in the set of following three


equations:

MT λ + ξ = d
M y = −b (4.116)
diag(ξ) diag(y)1 = µ1

which include the dual and primal feasibility conditions excluding y ≥ 0


and ξ ≥ 0.
4. We will assume that our current point y(k) is primal feasible and the cur-
rent point (λ(k) , ξ (k) )is dual feasible. We determine a new search direction 
∆y(k) , ∆λ(k) , ∆ξ (k) so that the new point y(k) + ∆y(k) , λ(k) + ∆λ(k) , ξ (k) + ∆ξ)(k)
satisfies (4.116). This gives us the so-called Newton equations:

M T ∆λ + ∆ξ = 0
M ∆y = 0 (4.117)
(yi + ∆yi )(ξi + ∆ξi ) = µ i = 1, 2, . . . , n

Ignoring the second order term ∆yi ∆ξi in the third equation, and solving
the system of equations in (4.117), we get the following update rules:

−1 
∆λ(k) = − M diag(y(k) ) diag(ξ (k) )−1 M T M diag(ξ (k) )−1 µ1 − diag(y(k) ) diag(ξ (k) )1
∆ξ (k) = −M T ∆λ(k) (4.118)

∆y(k) = diag(ξ (k) )−1 (k) (k) (k)
µ1 − diag(y ) diag(ξ )1 − diag(y ) diag(ξ ) ∆ξ(k) −1 (k)

An affine variant of the algorithm can be developed by setting µ = 0 in


the equations (4.118).
5. ∆y(k) and (∆λ(k) , ∆ξ (k) ) correspond to partially constrained Newton steps,
which might not honour the constraints y(k) +∆y(k) ≥ 0 and ξ (k) +∆ξ (k) ≥
0. Since we have a separate search direction ∆y(k) in the primal space and
a separate search direction (∆λ(k) , ∆ξ (k) ) in the dual space, we could com-
(k)
pute the maximum step length t(max,P ) that maintains the pending primal
(k) (k)
inequality y(k) +t(max,P ) ∆y(k) ≥ 0 and the maximum step length t(max,D)
(k)
that maintains the pending dual inequality ξ (k) + t(max,D) ∆ξ (k) ≥ 0.

6. Now we have a feasible primal solution y(k+1) and feasible dual solution
(λ(k+1) , ξ (k+1) ) given by
344 CHAPTER 4. CONVEX OPTIMIZATION

(k)
y(k+1) = y(k) + t(max,P ) ∆y(k)
λ(k+1) = λ(k) + ∆λ(k) (4.119)
(k+1) (k) (k)
ξ =ξ + t(max,D) ∆ξ (k)

7. For user specified small thresholds of ǫ1 > 0 and ǫ2 > 0, if the duality gap
dT y(k+1) + bT λ(k+1) is not sufficiently close to 0, i.e.,

dT y(k+1) + bT λ(k+1) > ǫ1

for a µ not yet sufficiently close to 0, i.e.,

µ > ǫ2

we decrease µ by a user specified factor ρ < 1 (such as ρ = 0.1).

µ=µ×ρ

8. Set k = k + 1. If µ was not modified in step (7), the duality gap is


sufficiently small and the termination condition has been reached. So
EXIT. Else, the last condition in (4.116) no longer holds with the modified
value of µ. So steps (4)-(7) are re-executed.
In practice, the interior point method for LP gets down the duality gap to
within 10−8 in just 20-80 steps (which is still slower than the simplex method
for many problems), independent of the size of the problem specified through
values of m and n.

4.8 Least Squares


Least squares was motivated in Section 3.9.2, based on the idea of projection.
Least squares problems appear very frequently in practice. The objective for
minimization in the case of least squares is the square of the eucledian norm of
Ax − b, where A is a m × n matrix, x is a vector of n variables and b is a vector
of m knowns.

min ||Ax − b||22 (4.120)


x∈ℜn

Very often one has a system of linear constraints on problem (4.120).

min ||Ax − b||22


x∈ℜn (4.121)
subject to CT x = 0
4.8. LEAST SQUARES 345

This problem is called the least squares problem with linear constraints.
In practice, incorporating the constraints C T x = 0 properlymakes quite a
difference. In lots of regularization problems, the least squares problem often
comes with quadratic constraints in the following form.

min ||Ax − b||22


x∈ℜn (4.122)
subject to ||x||22 = α2

This problem is termed as the least squares problem with quadratic constraints.
The classical statistical model assumes that all the error occurs in the vector
b. But sometimes, the data matrix A is itself not very well known, owing to
errors in the variables. This is the model we have in the simplest version of the
total least squares problem, which is stated as follows.

min ||E||2F + ||r||22


x∈ℜn ,E∈ℜm×n ,r∈ℜm (4.123)
subject to (A + E)x = b + r

While there is always a solution to the least squares problem (4.120), there is
not always a solution to the total least squares problem (4.133). Finally, you
can have a combination of linear and quadratic constraints in a least squares
problem to yield a least squares problem with linear and quadratic constraints.
We will briefly discuss the problem of solving linear least squares problems
and total least squares problems with linear or a quadratic constraint (due to
regularization) The importance of lagrange multipliers will be introduced in the
process. We will discuss stable numerical methods when the data matrix A
is singular or near singular. We will also present iterative methods for large
and sparse data matrices. There are many applications of least squares prob-
lems, which include statistical methods, image processing, data interpolation
and surface fitting and finally geometrical problems.

4.8.1 Linear Least Squares


As a user of least squares in practice, one of the most important things to be
known is that when A is of full column rank, it has an analytical solution given
by x∗ (which was derived in Section 3.9.2 and gives a dual interpretation).

Analytical solution: x∗ = (AT A)−1 AT b (4.124)

This analytic solution can also be obtained by observing that

1. ||y||22 is a convex function for y ∈ ℜm .


346 CHAPTER 4. CONVEX OPTIMIZATION

2. Square of the convex eucledian norm function, applied to an affine trans-


form is also convex. Thus ||Ax − b||22 is convex.

3. Every critical point of a convex function defined on an open domain cor-


responds to its local minimum. The critical point x∗ of ||Ax − b||22 should
satisfy
∇(Ax − b)T (Ax − b) = 2AT Ax∗ − 2AT b = 0

Thus,
x∗ = (AT A)−1 AT b

corresponds to a point of local minumum of (4.120) if AT A is invertible.

This is the classical way statisticians solve least squares problem. It can be
solved very efficiently, and there exist many softwares that implement this solu-
tion. The computation time is linear in the number of rows of A and quadratic
in the number of columns. For extremely large A, it can become important to
look at the structure of A to solve it efficiently, but for most problems, it is
efficient. In practice least-squares is very easy to recognize as an objective func-
tion. There are a few standard tricks to increase the flexibility. For example,
constraints can be handled to a certain extent by adding weights. When the
matrix A is not full column rank, the solution to (4.120) may not be unique.
We should note that while we get a closed form solution to the problem
of minimizing the square of the eucledian norm, it is not so for most other
norms such as the infinity norm. However, there exist iterative methods for
solving least squares with infinity norm that yield a solution in as much time
as is taken in computing the solution using the analytical formula in 4.124.
Therefore, having a closed form solution is not always computationally helpful.
In general, the method of solution to a least squares problem depends on the
sparsity as well as the size of A and the degree of accuracy desired.
In practice, however, it is not recommended to solve least squares problem
using the classical equation in 4.124 since the method is numerically unsta-
ble. Numerical linear algebra instead recommends the QR decomposition to
accurately solve the least squares problem. This method is slower, but more
numerically stable than the classical method. In theorem ??, we state a theory
that compares the analytical solution (4.124) and the QR approach to the least
squares problem.
Let A be an m × n matrix of either full row or full column rank. For the
case of n > m, we saw on page ?? (summarised in Figure 3.3) that the system
Ax = b will have at least one solution which means that minimum value of the
objective function will be 0, corresponding to the solution. We are interested in
the case m ≥ n, for which there will either be no solution or a single solution
to Ax = b and we are interested in one that minimizes ||Ax − b||22 .

1. We first decompose A into the product of an orthonormal m × m matrix


Q with an upper traingular m × n matrix R, using the gram-schmidt or-
4.8. LEAST SQUARES 347

thonormalization process42 discussed in Section 3.9.4. The decomposition


can also be performed using the Householder43 transformation or Givens
rotation. Householder transformation has the added advantage that new
rows or columns can be introduced without requiring a complete redo of
the decomposition process. The last m − n rows of R will be zero rows.
Since Q−1 = QT , the QR decomposition yields the system
" #
R 1
QT A =
0

2. Applying the same orthogonal matrix to b, we get


" #
T c
Q b=
d

where d ∈ ℜm−n .

3. The solution to the least squares problem is found by solving R1 x = c.


The solution to this can be found by simple back-substitution.

The next theorem examines how the least squares solution and its residual
||Ax − b|| are affected by changes in A and b. Before stating the theorem, we
will introduce the concept of the condition number.

Condition Number
The condition number associated with a problem is a measure of how numerically
well-posed the problem is. A problem with a low condition number is said to
be well-conditioned, while a problem with a high condition number is said to be
ill-conditioned. For a linear system Ax = b, the condition number is defined as
maximum ratio of the relative error in x (measured using any particular norm)
divided by the relative error in b. It can be proved (using the Cauchy Shwarz
inequality) that the condition number equals ||A−1 A|| and is independent of b.
It is denoted by κ(A) and is also called the condition number of the matrix A.

κ(A) = ||A−1 A||

If ||.||2 is the L2 norm, then

σmax (A)
κ(A) = = ||A||2 ||(AT A)−1 AT ||2
σmin (A)
42 The classical Gram-Schmidt method is often numerically unstable. Golub [?] suggests a

modified Gram-Schmidt method that is numerically stable.


43 Householder was a numerical analyst. However, the first mention of the Householder

transformation dates back to the 1930s in a book by Aikins, a statistician and a numerical
analyst.
348 CHAPTER 4. CONVEX OPTIMIZATION

where σmax (A) and σmin (A) are maximal and minimal singular values of A
respectively. For a real square matrix A, the square roots of the eigenvalues of
AT A, are called singular values. Further,

κ(A)2 = ||A||22 ||(AT A)−1 ||22

Theorem 86 By ||.||, we will refer to the L2 norm. Let

x∗ = argmin ||Ax − b||

b = argmin ||(A + δA)x − (b + δb||


x
where A and δA are in ℜm×n with m ≥ n. Let b and δb be in ℜm with b 6= 0.
Let us set
r∗ = b − Ax∗
r = b − Ab
b x
and
ρ∗ = ||Ax∗ − b||
If  
||δA|| δb σn (A)
ǫ = max , <
||A|| b σ1 (A)
and
ρ∗
sin θ = 6= 1
||b||
then,  
x − x∗ ||
||b 2κ(A)
≤ǫ + tan θκ(A)2 + O(ǫ2 )
||x∗ || cos θ
In this inequality most critical term for our discussion is κ(A)2 and this is
the term that can kill the analytical solution to least squares. Now matter how
accurate an algorithm you use, you still have κ(A)2 , provided tan θ is non-
zero. Now tan θ does not appear if you are solving a linear system, but if you
solve a least squares problem this term appears, bringing along κ(A)2 . Thus,
solving least squares problem is inherently more difficult and sensitive than linear
equations. The perturbation theory for the residual vector depends just on the
condition number κ(A) (and not its square):

r − r∗ ||
||b
≤ ǫ {1 + 2κ(A)} min {1, m − n} + O(ǫ2 ) + O(ǫ2 )
||b||
However, having a small residual does not necessarily imply that you will have
a good approximate solution.

The theorem implies that the sensitivity of the analytical solution x∗ for
non-zero residual problems is measured by the square of the condition number.
Whereas, sensitivity of the residual depends just linearly on κ(A). We note that
the QR method actually solves a nearby least squares problem.
4.8. LEAST SQUARES 349

Linear Least Squares for Singular Systems


To solve the linear least squares problem (4.120) for a matrix A that is of rank
r < min{m, n}, we can compute the pseudo-inverse (c.f. page 211) A+ and
obtain the least squares solution44 as

b = A+ b
x

A+ can be computed by first computing a singular orthogonal factorization


" #
R 0
A=Q ZT
0 0

where QT Q = Im×m and Z T Z = In×n and R is an r × r upper traingular


matrix. A+ can be computed in a straightforward manner as
" #
+ R−1 0
A =Z QT
0 0

The above least squares solution can be justified as follows. Let


" #
T c
Q b=
d

and " #
w
ZT x =
y
Then
||Ax − b||2 = ||QT AZZ T x − QT b||2 = ||Rw − c||2 + ||d||2
The least squares solution is therefore given by
" #
R−1 c
x
b=Z
0

One particular decomposition that can be used is the singular value decompo-
sition (c.f. Section 3.13) of A, with QT ≡ U T and Z ≡ V and U T AV = Σ. The
pseudo-inverse A+ has the following expression.

A+ = V Σ−1 U T

It can be shown that this A+ is the unique minimal Frobenius norm solution to
the following problem.

A+ = argmin ||AX − Im×m ||


X∈ℜn×m
44 Note that this solution not only minimizes ||Ax − b|| but also minimizes ||x||. This may

or may not be desirable.


350 CHAPTER 4. CONVEX OPTIMIZATION

This also shows that singular value decomposition can be looked upon as an
optimization problem.
A greater problem is with systems that are nearly singular. Numerically
and computationally it seldom happens that the rank of matrix is exactly r. A
classical example is the following n × n matrix K, which has a determinant of
1.

 
1 −1 ... −1 −1 ... −1
 
 0 1 ... −1 −1 ... −1 
 
 . . ... . . ... . 
 
 
 . . ... . . ... . 
 
K=
 0 0 ... 1 −1 ... −1 

 
 0 0 ... 0 1 ... −1 
 
 . . ... . . ... . 
 
 
 . . ... . . ... . 
0 0 ... 0 0 ... 1

The eigenvalues of this matrix are also equal to 1, while its rank is n. However,
a very small perturbation to this matrix can reduce its rank to n − 1; the rank
of K − 2−(n−1) In×n is n − 1! Such catastrophic problems occur very often when
you do large computations. The solution using SVD is applicable for nearly
singular systems as well.

4.8.2 Least Squares with Linear Constraints


We first reproduce the least squares problem with linear constraints that was
stated earlier in (4.121).

min ||Ax − b||2


x∈ℜn
subject to CT x = 0

Let C ∈ ℜn×p , A ∈ ℜm×n and b ∈ ℜm . We note that ||Ax − b||2 is a convex


function (since L2 norm is convex and this function is the L2 norm applied to
an affine transform). We can thus solve this constrained problem by invoking
the necessary and sufficient KKT conditions discussed in Section 4.4.4. The
conditions can be worked out to yield
" #" # " #
AT A C x AT b
=
CT 0 λ 0

We need to now solve not only for the unknowns x, but also for the lagrange
multipliers; we have increased the dimensionality of the problem to n+p. If x
b=
4.8. LEAST SQUARES 351

(AT A)−1 AT b denotes the solution of the unconstrained least squares problem,
then, using the first system of equality above, x can be expressed as

b − (AT A)−1 Cλ
x=x (4.125)
In conjunction with the second system, this leads to

C T (AT A)−1 Cλ = C T x
b (4.126)
The unconstrained least squares solution can be obtained using methods in
Section 4.8.1. Next, the value of λ can be obtained by solving (4.126). If A is
singular or nearly singular, we can use the singular value decomposition (or a
similar decomposition) of A to determine xb.
C T R−1 (RT )−1 Cλ = C T x
b
The QR factorization of (RT )−1 C can be efficiently used to determine λ. Finally,
the value of λ can be substituted in (4.125) to solve for x. This technique yields
both the solutions, provided that both exist.
Another trick that is often employed when AT A is singular or nearly sin-
gular is to decrease its condition number by augmenting it in (4.125) with the
‘harmless’ CW C T and solve
b − (AT A + CW C T )−1 Cλ
x=x
The addition of CW C T is considered harmless, since C T x = 0 is to be imposed
anyways. Matrix W can be chosen to be an identical or nearly identical matrix
that chooses a few columns of C, just to make AT A + CW C T non-singular.
If we use the following notation:
" #
AT A + CW C T C
A(W ) =
CT 0
and " #
AT A C
A = A(0) =
CT 0
and if A and A(W ) are invertible for W 6= 0, it can be proved that
" #
−1 −1 0 0
A (W ) = A −
0 W
Consequently
κ(A(W )) ≤ κ(A) + ||W ||2 ||C||2 + α||W ||
for some α > 0. That is, the condition number of A(W ) is bounded by the
condition number of A and some positive terms.
Another useful technique is to find an approximation to (4.121) by solving
the following weighted unconstrained minimization problem.
352 CHAPTER 4. CONVEX OPTIMIZATION

min ||Ax − b||2 + µ2 ||C T x||2


x∈ℜn

For large values of µ, the solution x


b(µ) of the unconstrained problem is a good
appproximation to the solution x b of the constrained problem (4.121). We can
use the generalized singular value decompositions of matrices A and C T , that
allows us to simultaneously diagonalize A and C T .

U T AX = diag(α1 , . . . , αm )

V T C T X = diag(γ1 , . . . , γm )
where U and V are orthogonal matrices and X is some general matrix. The
solution to the constrained problem can be expressed as
p
X uT b i
x
b= xi
i=1
αi

The analytical solution x


b(µ) is then given as
p
X αi uTi b
x
b(µ) = xi + x
b
α2
i=1 i
+ µ2 γi2

It can be easily seen that as µ2 → ∞, x b(µ) → x


b.
Generally, if possible, it is better to eliminate the constraints, since this
makes the problem better conditioned. We will discuss one final approach to
solving the linearly constrained least squares problem (4.121), which reduces
the dimensionality of the problem by eliminating the constraints. It is hinged
on computing the QR factorization of C.

!
R p
QT C = (4.127)
0 n−p

This yields

 
AQT = A1 A2 (4.128)

and

!
T y p
Q x= (4.129)
z n−p
4.8. LEAST SQUARES 353

The constrained problem then becomes

min ||b − A1 y − A2 z||2


x∈ℜn
subject to RT y = 0

Since R is invertible, we must have y = 0. Thus, the solution x


b to the con-
strained least squares problem can be determined as

!
T 0
x
b=Q (4.130)
z
b

where
z = argmax ||b − A2 z||2
b
z

It can be proved that the matrix A2 is atleast as well-conditioned as the matrix


A. Often, the original problem is singular and imposing the constraints makes
it non-singular (and is reflected in a non-singular matrix A2 .

4.8.3 Least Squares with Quadratic Constraints


The quadratically constrained least squares problem is often encountered in
regularization problems and can be stated as follows.

min ||Ax − b||22


x∈ℜn
subject to ||x||22 = α2

Since the objective function as well as the constraint function are convex, the
KKT conditions (c.f. Section 4.4.4) are necessary and sufficient conditions for
the optimality of the problem at the primal-dual variable pair given by (b
x, µ
b).
The KKT conditions lead to the following equations

(AT A + µI)x = AT b (4.131)


xT x = α2 (4.132)

The expression in (4.131) is the solution to what the statisticians sometimes


refer to as the ridge regression problem. The solution to the problem under
consideration has the additional constraint though, that the norm of the solution
vector xb should equal |α|. The two equations above yield the so-called secular
equation stated below.

bT A(AT A + µI)−2 AT b − α2 = 0
354 CHAPTER 4. CONVEX OPTIMIZATION

Further, the matrix A can be diagonalized using its singular value decom-
position A = U ΣV T to obtain the following equation which is to be solved.
n
X σi2
βi2 − α2 = 0
i=1
(σi2 + µ)2

4.8.4 Total Least Squares


The total least squares problem is stated as

min ||E||2F + ||r||22


x∈ℜn ,E∈ℜm×n ,r∈ℜm
subject to (A + E)x = b + r

You might also like