BasicsOfConvexOptimization PDF
BasicsOfConvexOptimization PDF
Convex Optimization
4.1 Introduction
4.1.1 Mathematical Optimization
The problem of mathematical optimization is to minimize a non-linear cost
function f0 (x) subject to inequality constraints fi (x) ≤ 0, i = 1, . . . , m and
equality constraints hi (x) = 0, i = 1, . . . , p. x = (x1 , . . . , xn ) is a vector of
variables involved in the optimization problem. The general framework of a
non-linear optimization problem is outlined in (4.1).
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m
(4.1)
hi (x) = 0, i = 1, . . . , p
variable x = (x1 , . . . , xn )
213
214 CHAPTER 4. CONVEX OPTIMIZATION
1. least-squares
2. linear programming
3. convex optimization problems - more or less the most general class of
problems that can be solved efficiently.
Least squares and linear programming have been around for quite some time
and are very special types of convex optimization problems. Convex program-
ming was not appreciated very much until last 15 years. It has drawn attention
more recently. In fact many combinatorial optimization problems have been
identified to be convex optimization problems. There are also some exceptions
besides convex optimization problems, such as singular value decomposition
(which corresponds to the problem of finding the best rank-k approximation to
a matrix, under the Frobenius norm) etc., which has an exact global solution.
We will first introduce some general optimization principles. We will sub-
sequently motivate the specific class of optimization problems called convex
optimization problems and define convex sets and functions. Next, the theory
of lagrange multipliers will be motivated and duality theory will be introduced.
As two specific and well-studied examples of convex optimization, techniques
for least squares and linear programming will be discussed to contrast them
against generic convex optimization. Finally, we will dive into techniques for
solving general convex optimization problems.
For the 1-D case, open and closed balls degenerate to open and closed intervals
respectively.
In other words, a set S ⊆ ℜn is bounded means that there exists a number ǫ > 0
such that for all x ∈ S, ||x|| ≤ ǫ.
In the 1−D case, the open interval obtained by excluding endpoints from an
interval I is the interior of I, denoted by int(I). For example, int([a, b]) = (a, b)
and int([0, ∞)) = (0, ∞).
The simplest examples of an open set are the open ball, the empty set ∅ and
ℜn . Further, arbitrary union of opens sets is open. Also, finite intersection of
open sets is open. The interior of any set is always open. It can be proved that
a set S is open if and only if int(S) = S.
The complement of an open set is the closed set.
The closed ball, the empty set ∅ and ℜn are three simple examples of closed
sets. Arbitrary intersection of closed sets is closed. Furthermore, finite union of
closed sets is closed.
Loosely speaking, the closure of a set is the smallest closed set containing the set.
The closure of a closed set is the set itself. In fact, a set S is closed if and only if
closure(S) = S. A bounded set can be defined in terms of a closed set; a set S is
bounded if and only if it is contained inside a closed set. A relationship between
the interior, boundary and closure of a set S is closure(S) = int(S) ∪ bnd(S).
216 CHAPTER 4. CONVEX OPTIMIZATION
f (x) ≤ f (c), ∀x ∈ D
f (x) ≥ f (c), ∀x ∈ D
If there is an open interval I containing c in which f (c) ≥ f (x), ∀x ∈ I,
then we say that f (c) is a local maximum value of f . On the other hand, if
there is an open interval I containing c in which f (c) ≤ f (x), ∀x ∈ I, then we
say that f (c) is a local minimum value of f . If f (c) is either a local maximum
or local minimum value of f in an open interval I with c ∈ I, the f (c) is called
a local extreme value of f .
The following theorem gives us the first derivative test for local extreme
value of f , when f is differentiable at the extremum.
Proof: Suppose f (c) ≥ f (x) for all x in an open interval I containing c and that
f ′ (c) exists. Then the difference quotient f (c+h)−f h
(c)
≤ 0 for small h ≥ 0 (so
that c + h ∈ I). This inequality remains true as h → 0 from the right. In the
limit, f ′ (c) ≤ 0. Also, the difference quotient f (c+h)−f h
(c)
≥ 0 for small h ≤ 0
(so that c + h ∈ I). This inequality remains true as h → 0 from the left. In the
limit, f ′ (c) ≥ 0. Since f ′ (c) ≤ 0 as well as f ′ (c) ≥ 0, we must have f ′ (c) = 01 .
2
The extreme value theorem is one of the most fundamental theorems in cal-
culus concerning continuous functions on closed intervals. It can be stated as:
We must point out that either or both of the values c and d may be attained
at the end points of the interval [a, b]. Based on theorem (39), the extreme value
theorem can extended as:
Figure 4.1 illustrates Rolle’s theorem with an example function f (x) = 9−x2
on the interval [−3, +3].
The mean value theorem is a generalization of the Rolle’s theorem, though
we will use the Rolle’s theorem to prove it.
Figure 4.2: Illustration of mean value theorem with f (x) = 9−x2 on the interval
[−3, 1]. We see that f ′ (−1) = f (1)−f
4
(−3)
.
Figure 4.3: Plot of f (x) = x1 , and its linear, quadratic and cubic approximations.
1 ′′ 1 1
f (b) = f (a)+f ′ (a)(b−a)+ f (a)(b−a)2 +. . .+ f (n) (a)(b−a)n + f (n+1) (c)(b−a)n+1
2! n! (n + 1)!
Proof: Define
1 ′′ 1
pn (x) = f (a) + f ′ (a)(x − a) + f (a)(x − a)2 + . . . + f (n) (a)(x − a)n
2! n!
and
that measures the difference between function f and the approximating function
φn (x) for each x ∈ [a, b].
• Since g(a) = g(b) = 0 and since g and g ′ are both continuous on [a, b], we
can apply the Rolle’s theorem to conclude that there exists c1 ∈ [a, b] such
that g ′ (c1 ) = 0.
• Similarly, since g ′ (a) = g ′ (c1 ) = 0, and since g ′ and g ′′ are continuous
on [a, c1 ], we can apply the Rolle’s theorem to conclude that there exists
c2 ∈ [a, c1 ] such that g ′′ (c2 ) = 0.
• In this way, Rolle’s theorem can be applied successively to g ′′ , g ′′′ , . . . , g (n+1)
to imply the existence of ci ∈ (a, ci−1 ) such that g (i) (ci ) = 0 for i =
3, 4, . . . , n + 1. Note however that g (n+1) (x) = f (n+1) (x) − 0 − (n + 1)!Γ
f (n+1) (cn+1 )
which gives us another representation ‘of Γ as (n+1)! .
Thus,
1 ′′ 1 f (n+1) (cn+1 )
f (b) = f (a)+f ′ (a)(b−a)+ f (a)(b−a)2 +. . .+ f (n) (a)(b−a)n + (x−a)n+1
2! n! (n + 1)!
2
220 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.4: The mean value theorem can be violated if f (x) is not differentiable
at even a single point of the interval. Illustration on f (x) = x2/3 with the
interval [−3, 3].
Proof: Let t ∈ I and x ∈ I with t < x. By virtue of the mean value theorem,
∃c ∈ (t, x) such that f ′ (c) = f (x)−f
x−t
(t)
.
• If f ′ (x) > 0 for all x ∈ int(I), f ′ (c) > 0, which implies that f (x)−f (t) > 0
and we can conclude that f is increasing on I.
• If f ′ (x) < 0 for all x ∈ int(I), f ′ (c) < 0, which implies that f (x)−f (t) < 0
and we can conclude that f is decreasing on I.
4.1. INTRODUCTION 221
• If f ′ (x) = 0 for all x ∈ int(I), f ′ (c) = 0, which implies that f (x)−f (t) = 0,
and since x and t are arbitrary, we can conclude that f is constant on I.
2
Figure 4.5 illustrates the intervals in (−∞, ∞) on which the function f (x) =
3x4 + 4x3 − 36x2 is decreasing and increasing. First we note that f (x) is dif-
ferentiable everywhere on (−∞, ∞) and compute f ′ (x) = 12(x3 + x2 − 6x) =
12(x − 2)(x + 3)x, which is negative in the intervals (−∞, −3] and [0, 2] and
positive in the intervals [−3, 0] and [2, ∞). We observe that f is decreasing in
the intervals (−∞, −3] and [0, 2] and while it is increasing in the intervals [−3, 0]
and [2, ∞).
There is a related sufficient condition for a function f to be increasing/decreasing
on an interval I, stated through the following theorem:
For example, the derivative of the function f (x) = 6x5 − 15x4 + 10x3 vanishes
at 0, and 1 and f ′ (x) > 0 elsewhere. So f (x) is increasing on (−∞, ∞).
Are the sufficient conditions for increasing and decreasing properties of f (x)
in theorem 46 also necesssary? It turns out that it is not the case. Figure 4.6
shows that for the function f (x) = x5 , though f (x) is increasing in (−∞, ∞),
f ′ (0) = 0.
In fact, we have a slightly different necessary condition for an increasing or
decreasing function.
222 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.6: Plot of f (x) = x5 , illustrating that though the function is increasing
on (−∞, ∞), f ′ (0) = 0.
The general condition for local extrema is stated in the next theorem; it
extends the result in theorem 39 to general non-differentiable functions.
That the converse of theorem 49 does not hold is illustrated in Figure 4.6;
0 is a critical number (f ′ (0) = 0), although f (0) is not a local extreme value.
Then, given a critical number c, how do we discern whether f (c) is a local
extreme value? This can be answered using the first derivative test:
Figure 4.7: Example illustrating the derivative test for function f (x) = 3x5 −
5x3 .
As an example, the function f (x) = 3x5 − 5x3 has the derivative f ′ (x) =
15x2 (x + 1)(x − 1). The critical points are 0, 1 and −1. Of the three, the sign of
f ′ (x) changes at 1 and −1, which are local minimum and maximum respectively.
The sign does not change at 0, which is therefore not a local supremum. This
is pictorially depicted in Figure 4.7 As another example, consider the function
(
−x if x ≤ 0
f (x) =
1 if x > 0
Then, (
−1 if x < 0
f ′ (x) =
0 if x > 0
Note that f (x) is discontinuous at x = 0, and therefore f ′ (x) is not defined at
x = 0. All numbers x ≥ 0 are critical numbers. f (0) = 0 is a local minimum,
whereas f (x) = 1 is a local minimum as well as a local maximum ∀x > 0.
224 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.8: Plot for the strictly convex function f (x) = x2 which has f ′′ (x) =
2 > 0, ∀x.
On the other hand, if the function is strictly convex and doubly differen-
tiable in I, then f ′′ (x) ≥ 0, ∀x ∈ I.
There is also a slopeless interpretation of strict convexity as stated in the
following theorem:
Adding the left and right hand sides of inequalities in (4.3) and (4.4), and
multiplying the resultant inequality by −1 gives us
f (x2 )−f (x1 ) = (f ′ (z)−f ′ (x1 ))(x2 −x1 )+f ′ (x1 )(x2 −x1 ) ≥ f ′ (x1 )(x2 −x1 )
(4.7)
However, equations 4.9 and 4.10 contradict each other. Therefore, equality
in 4.5 cannot hold for any x1 6= x2 , implying that
Figure 4.9: Plot for the strictly convex function f (x) = −x2 which has f ′′ (x) =
−2 < 0, ∀x.
On the other hand, if the function is strictly concave and doubly differen-
tiable in I, then f ′′ (x) ≤ 0, ∀x ∈ I.
There is also a slopeless interpretation of concavity as stated in the fol-
lowing theorem:
If the second derivative f ′′ (c) exists, then the strict convexity conditions for
the critical number can be stated in terms of the sign of of f ′′ (c), making use
of theorems 50 and 52. This is called the second derivative test.
Procedure 3 [Second derivative test]: Let c be a critical number of f where
f ′ (c) = 0 and f ′′ (c) exists.
1. If f ′′ (c) > 0 then f (c) is a local minimum.
2. If f ′′ (c) < 0 then f (c) is a local maximum.
3. If f ′′ (c) = 0 then f (c) could be a local maximum, a local minimum,
neither or both. That is, the test fails.
For example,
• If f (x) = x4 , then f ′ (0) = 0 and f ′′ (0) = 0 and we can see that f (0) is a
local minimum.
• If f (x) = −x4 , then f ′ (0) = 0 and f ′′ (0) = 0 and we can see that f (0) is
a local maximum.
• If f (x) = x3 , then f ′ (0) = 0 and f ′′ (0) = 0 and we can see that f (0) is
neither a local minimum nor a local maximum. (0, 0) is an inflection point
in this case.
4.1. INTRODUCTION 229
f (a + h) − f (a)
f ′ (a) = lim+
h→0 h
Similarly, the (left-sided) derivative of f at x = b is defined as
f (b + h) − f (b)
f ′ (b) = lim−
h→0 h
230 CHAPTER 4. CONVEX OPTIMIZATION
• If f (a) is the maximum value of f on [a, b], then f ′ (a) ≤ 0 or f ′ (a) = −∞.
• If f (b) is the minimum value of f on [a, b], then f ′ (b) ≤ 0 or f ′ (b) = −∞.
The following theorem gives a useful procedure for finding extrema on closed
intervals.
Theorem 55 If f is continuous on [a, b] and f ′′ (x) exists for all x ∈ (a, b).
Then,
The next theorem is very useful for finding global extrema values on open
intervals.
Figure 4.11: Illustrating the constraints for the optimization problem of finding
the cone with minimum volume that can contain a sphere of radius R.
first step is to reduce the volume formula to involve only one of r24 or h. The
algebra involved will be the simplest if we solved for h. The constraint gives
R2 h
us r2 = h−2R . Substituting this expression for r2 into the volume formula, we
πR2 h2
get g(h) = 3 (h−2R) with the domain given by D = {h|2R < h < ∞}. Note
2
πR2 2h(h−2R)−h 2 h(h−4R)
that D is an open interval. g ′ = 3 (h−2R)2 = πR3 (h−2R)2 which is 0
2 2(h−2R)3 −2h(h−4R)(h−2R)2
in its domain D if and only if h = 4R. g ′′ = πR
3 (h−2R)4 =
2 2 2
πR2 2(h −4Rh+4R −h +4Rh) πR2 8R2
3 (h−2R)3 = 3 (h−2R)3 , which is greater than 0 in D. There-
fore, g (and consequently f ) has a unique minimum at h = 4R and correspond-
R2 h
ingly, r2 = h−2R = 2R2 .
f (x + hv) − f (x)
Dv f (x) = lim (4.12)
h→0 h
4 Since r appears in the volume formula only in terms of r2 .
232 CHAPTER 4. CONVEX OPTIMIZATION
∂f (x)
Duk f (x) =
∂xk
n
X ∂f (x)
Dv f (x) = vk (4.13)
∂xk
k=1
n
X ∂f (x)
Therefore, g ′ (0) = Dv f (x) = vk 2
∂xk
k=1
The theorem works if the function is differentiable at the point, else it is not
predictable. The above theorem leads us directly to the idea of the gradient.
We can see that the right hand side of (4.13) can be realized as the dot product
h iT
of two vectors, viz., ∂f∂x(x)
1
, ∂f (x)
∂x2 , . . . , ∂f (x)
∂xn and v. Let us denote ∂f∂x(x)
i
by
fxi (x). Then we assign a name to the special vector discovered above.
What does the gradient ∇f (x) tell you about the function f (x)? We will il-
lustrate with some examples. Consider the polynomial f (x, y, z) = x2 y+z sin xy
and the unit vector vT = √13 [1, 1, 1]T . Consider the point p0 = (0, 1, 3). We will
compute the directional derivative of f at p0 in the direction of v. To do this, we
T
first compute the gradient of f in general: ∇f = 2xy + yz cos xy, x2 + xz cos xy, sin xy .
T
Evaluating the gradient at a specific point p0 , ∇f (0, 1, 3) = [3, 0, 0] . The di-
1
rectional derivative at p0 in the direction v is Dv f (0, 1, 3) = [3, 0, 0]. √3 [1, 1, 1]T =
√
3. This directional derivative is the rate of change of f at p0 in the direction
v; it is positive indicating that the function f increases at p0 in the direction v.
All our ideas about first and second derivative in the case of a single variable
carry over to the directional derivative.
As another example, let us find the rate of change of f (x, y, z) = exyz at
p0 = (1, 2, 3) in the direction from p1 = (1, 2, 3) to p2 = (−4, 6, −1). We first
construct a unit vector from p1 to p2 ; v = √157 [−5, 4, −4]. The gradient of f
in general is ∇f = [yzexyz , xzexyz , xyexyz ] = exyz [yz, xz, xy]. Evaluating
T
the gradient at a specific point p0 , ∇f (1, 2, 3) = e6 [6, 3, 2] . The directional
derivative at p0 in the direction v is Du f (1, 2, 3) = e [6, 3, 2]. √157 [−5, 4, −4]T =
6
−26
e6 √ 57
. This directional derivative is negative, indicating that the function f
decreases at p0 in the direction from p1 to p2 .
While there exist infinitely many direction vectors v at any point x, there is
a unique gradient vector ∇f (x). Since we seperated Dv f (x) as the dot prouduct
of ∇f (x) with v, we can study ∇f (x) independently. What does the gradient
vector tell us? We will state a theorem to answer this question.
Proof: The cauchy schwartz inequality when applied in the eucledian space
states that |xT .y| ≤ ||x||.||y|| for any x, y ∈ ℜn , with equality holding iff x
and y are linearly dependent. The inequality gives upper and lower bounds on
the dot product between two vectors; −||x||.||y|| ≤ xT .y ≤ ||x||.||y||. Applying
these bounds to the right hand side of 4.14 and using the fact that ||v|| = 1, we
get
−||∇f (x)|| ≤ Dv f (x) = ∇T f (x).v ≤ ||∇f (x)||
with equality holding iff v = k∇f (x) for some k ≥ 0. Since ||v|| = 1, equality
∇f (x)
can hold iff v = ||∇f (x)|| . 2
The theorem implies that the maximum rate of change of f at a point x is
given by the norm of the gradient vector at x. And the direction in which the
∇f (x
rate of change of f is maximum is given by the unit vector ||∇f (x|| .
An associated fact is that the minimum value of the directional derivative
Dv f (x) is −||∇f (x|| and it occurs when v has the opposite direction of the
∇f (x
gradient vector, i.e., − ||∇f (x|| . This fact is often used in numerical analysis
when one is trying to minimize the value of very complex functions. The method
234 CHAPTER 4. CONVEX OPTIMIZATION
∀ p ∈ H, (p − q)T v = 0
Hyperplane H can also be equivalently defined as the set of points p such that
pT v = c for some c ∈ ℜ and some v ∈ ℜn , with c = qT v in our definition.
(This definition will be referred to at a later point.)
What if Dv f (x) turns out to be 0? What can we say about ∇f (x) and v?
There is a useful theorem in this regard.
Proof: Let K be the range of f and let k ∈ K such that f (x∗ ) = k. Consider the
level surface f (x) = k. Let r(t) = [x1 (t), x2 (t), . . . , xn (t)] be a curve on the level
surface, parametrized by t ∈ ℜ, with r(0) = x∗ . Then, f (x(t), y(t), z(t)) = k.
Applying the chain rule
n
df (r(t)) X ∂f dxi (t) dr(t)
= = ∇T f (x(t)) =0
dt i=1
∂xi dt dt
For t = 0, the equations become
dr(0)
∇T f (x∗ ) =0
dt
Now, dr(t)
dt represents any tangent vector to the curve through r(t) which lies
completely on the level surface. That is, the tangent line to any curve at x∗
on the level surface containing x∗ , is orthogonal to ∇f (x∗ ). Since the tangent
hyperplane to a surface at any point is the hyperplane containing all tangent
vectors to curves on the surface passing through the point, the gradient is per-
pendicular to the tangent hyperplane to the level surface passing through that
point. The equation of the tangent hyperplane is given by (x−x∗ )T ∇f (x∗ ) = 0.
2
Recall from elementary calculus, that the normal to a plane can be found
by taking the cross product of any two vectors lying within the plane. The
gradient vector at any point on the level surface of a function is normal to the
tangent hyperplane (or tangent line in the case of two variables) to the surface
at the same point, but can however be conveniently obtained using the partial
derivatives of the function at that point.
We will use some illustrative examples to study these facts.
1. Consider the same plot as in Figure 4.12 with a gradient vector at (2, 0) as
shown in Figure 4.13. The gradient vector [1, 2]T is perpendicular to the
tangent hyperplane to the level curve x1 ex2 = 2 at (2, 0). The equation of
the tangent hyperplane is (x1 − 2) + 2(x2 − 0) = 0 and it turns out to be
a tangent line.
2. The level surfaces for f (x1 , x2 , x3 ) = x21 +x22 +x23 are shown in Figure 4.14.
The gradient at (1, 1, 1) is orthogonal to the tangent hyperplane to the
level surface f (x1 , x2 , x3 ) = x21 + x22 + x23 = 3 at (1, 1, 1). The gradient
vector at (1, 1, 1) is [2, 2, 2]T and the tanget hyperplane has the equation
2(x1 − 1) + 2(x2 − 1) + 2(x3 − 1) = 0, which is a plane in 3D. On the other
hand, the dotted line in Figure 4.15 is not orthogonal to the level surface,
since it does not coincide with the gradient.
3. Let f (x1 , x, x3 ) = x21 x32 x43 and consider the point x0 = (1, 2, 1). We will
find the equation of the tangent plane to the level surface through x0 .
The level surface through x0 is determined by setting f equal to its
value evaluated at x0 ; that is, the level surface will have the equation
x21 x32 x43 = 12 23 14 = 8. The gradient vector (normal to tangent plane) at
236 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.13: The level curves from Figure 4.12 along with the gradient vector
at (2, 0). Note that the gradient vector is perpenducular to the level curve
x1 ex2 = 2 at (2, 0).
Figure 4.14: 3 level surfaces for the function f (x1 , x2 , x3 ) = x21 +x22 +x23 with c =
1, 3, 5. The gradient at (1, 1, 1) is orthogonal to the level surface f (x1 , x2 , x3 ) =
x21 + x22 + x23 = 3 at (1, 1, 1).
4.1. INTRODUCTION 237
Figure 4.15: Level surface f (x1 , x2 , x3 ) = x21 + x22 + x23 = 3. The gradient at
(1, 1, 1), drawn as a bold line, is perpendicular to the tangent plane to the level
surface at (1, 1, 1), whereas, the dotted line, though passing through (1, 1, 1) is
not perpendicular to the same tangent plane.
238 CHAPTER 4. CONVEX OPTIMIZATION
(1, 2, 1) is ∇f (x1 , x2 , x3 )|(1,2,1) = [2x1 x32 x43 , 3x21 x22 x43 , 4x21 x32 x33 ]T (1,2,1) =
[16, 12, 32]T . The equation of the tangent plane at x0 , given the normal
vector ∇f (x0 ) can be easily written down: ∇f (x0 )T .[x − x0 ] = 0 which
turns out to be 16(x1 − 1) + 12(x2 − 2) + 32(x3 − 1) = 0, a plane in 3D.
x
4. Consider the function f (x, y, z) = y+z . The directional derivative of f in
1
the direction of the vector v = √14 [1, 2, 3] at the point x0 = (4, 1, 1) is
h i
∇T f (4,1,1) . √114 [1, 2, 3]T = y+z
1 x
, − (y+z) x
2 , − (y+z)2 . √114 [1, 2, 3]T =
1 1 (4,1,1)
T √9
2 , −1, −1 . 14 [1, 2, 3] = − 2 14 . The directional derivative is nega-
√
tive, indicating that the function decreases along the direction of v. Based
on theorem 58, we know that the maximum rate of change of a function
∇f (x)
at a point x is given by ||∇f (x)|| and it is in the direction ||∇f (x)|| . In
the example under consideration, this maximum rate of change at x0 is 23
2 1
and it is in the direction of the vector 3 2 , −1, −1 .
6. Let us determine the equations of (a) the tangent plane to the paraboloid
P : x1 = x22 + x23 + 2 at (−1, 1, 0) and (b) the normal line to the tangent
plane. To realize this as the level surface of a function of three variables, we
define the function f (x1 , x2 , x3 ) = x1 −x22 −x23 and find that the paraboloid
P is the same as the level surface f (x1 , x2 , x3 ) = −2. The normal to the
tangent plane to P at x0 is in the direction of the gradient vector ∇f (x0 ) =
[1, −2, 0]T and its parametric equation is [x1 , x2 , x3 ] = [−1 + t, 1 − 2t, 0].
The equation of the tangent plane is therefore (x1 + 1) − 2(x2 − 1) = 0.
n
!
X
0
fxi (x )(xi − x0i ) − z − f (x0 ) = 0
i=1
or equivalently as,
n
!
X
0
fxi (x )(xi − x0i ) + f (x0 ) = z
i=1
Figure 4.16: Plot of f (x1 , x2 ) = 3x21 − x31 − 2x22 + x42 , showing the various local
maxima and minima of the function.
Proof: The idea behind this theorem can be stated as follows. The tangent
hyperplane to the function at any extreme point must be parallel to the plane
z = 0. This can happen if and only if the gradient ∇F = [∇T f, −1]T is parallel
to the z−axis at the extreme point. Or equivalently, the gradient to the function
f must be the zero vector at every extreme point, i.e., fxi (x∗ ) = 0 for 1 ≤ i ≤ n.
To formally prove this theorem, consider the function gi (xi ) = f (x∗1 , x∗2 , . . . , x∗i−1 , xi , x∗i+1 , . . . , x∗n ).
If f has a local extremum at x∗ , then each function gi (xi ) must have a local
′ ′
extremum at x∗i . Therefore gi (x∗i ) = 0 by theorem 39. Now gi (x∗i ) = fxi (x∗ ) so
∗
fxi (x ) = 0. 2
Applying theorem 60 to the function f (x1 , x2 ) = 9 − x21 − x22 , we require that
at any extreme point fx1 = −2x1 = 0 ⇒ x1 = 0 and fx2 = −2x2 = 0 ⇒ x2 = 0.
Thus, f indeed attains its maximum at the point (0, 0) as shown in Figure 4.17.
2. Determine if there are any points where any one of fxi fails to exist. Add
such points (if any) to the list of critical points.
Figure 4.17: The paraboloid f (x1 , x2 ) = 9 − x21 − x22 attains its maximum at
(0, 0). The tanget plane to the surface at (0, 0, f (0, 0)) is also shown, and so is
the gradient vector ∇F at (0, 0, f (0, 0)).
Figure 4.18: Plot illustrating critical points where derivative fails to exist.
242 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.19: The hyperbolic paraboloid f (x1 , x2 ) = x21 − x22 , which has a saddle
point at (0, 0).
As an example, for the function f (x1 , x2 ) = |x1 |, fx1 does not exist for
(0, s) for any s ∈ ℜ and all of them are critical points. Figure 4.18 shows the
corresponding 3−D plot.
Is the converse of theorem 60 true? That is, if you find an x∗ that satisifes
fxi (x∗ ) = for all 1 ≤ i ≤ n, is it necessary that x∗ is an extreme point? The
answer is no. In fact, points that violate the converse of theorem 60 are called
saddle points.
Definition 28 [Saddle point]: A point x∗ is called a saddle point of a func-
tion f (x) defined on D ⊆ ℜn if x∗ is a critical point of f but x∗ does not
correspond to a local maximum or minimum of the function.
We saw the example of a saddle point in Figure 4.7, for the case n = 1. The
inflection point for a function of single variable, that was discussed earlier, is the
analogue of the saddle point for a function of multiple variables. An example
for n = 2 is the hyperbolic paraboloid5 f (x1 , x2 ) = x21 − x22 , the graph of
which is shown in Figure 4.19. The hyperbolic paraboloid opens up on x1 -axis
(Figure 4.20 and down on x2 -axis (Figure 4.21) and has a saddle point at (0, 0).
To get working on figuring out how to find the maximum and minimum
of a function, we will take some examples. Let us find the critical points of
f (x1 , x2 ) = x21 + x22 − 2x1 − 6x2 + 14 and classify the critical point. This
function is a polyonomial function and is differentiable everywhere. It is a
paraboloid that is shifted away from origin. To find its critical points, we will
solve fx1 = 2x1 −2 = 0 and fx2 = 2x2 −6 = 0, which when solved simultaneously,
yield a single critical point (1, 3). For a simple example like this, the function
f can be rewritten as f (x1 , x2 ) = (x1 − 1)2 + (x2 − 3)2 + 4, which implies that
f (x1 , x2 ) ≥ 4 = f (1, 3). Therefore, (1, 3) is indeed a local minimum (in fact a
global minimum) of f (x1 , x2 ).
5 The hyperbolic paraboloid is shaped like a saddle and can have a critical point called the
saddle point.
4.1. INTRODUCTION 243
Figure 4.20: The hyperbolic paraboloid f (x1 , x2 ) = x21 − x22 , when viewed from
the x1 axis is concave up.
Figure 4.21: The hyperbolic paraboloid f (x1 , x2 ) = x21 − x22 , when viewed from
the x2 axis is concave down.
244 CHAPTER 4. CONVEX OPTIMIZATION
Proof: Since the mixed partial derivatives of f are continuous in an open ball
containing R containing x∗ and since ∇2 f (x∗ ) ≻ 0, it can be shown that there
exists an ǫ > 0, with B(x∗ , ǫ) ⊆ R such that for all ||h|| < ǫ, ∇2 f (x∗ + h) ≻ 0.
Consider an increment vector h such that (x∗ + h) ∈ B(x∗ , ǫ). Define g(t) =
f (x∗ + th) : [0, 1] → ℜ. Using the chain rule,
n
X
′ dxi
g (t) = fxi (x∗ + th) = hT .∇f (x∗ + th)
i=1
dt
1
g(1) = g(0) + g ′ (0) + g ′′ (c)
2
6 By Clairauts Theorem, if the partial and mixed derivatives of a function are continuous
on an open region containing a point x∗ , then fxi xj (x∗ ) = fxj xi (x∗ ), for all i, j ∈ [1, n].
4.1. INTRODUCTION 245
1
f (x∗ + h) = f (x∗ ) + hT ∇f (x∗ ) + hT ∇2 f (x∗ + ch)h
2
We are given that ∇f (x∗ ) = 0. Therefore,
1 T 2
f (x∗ + h) − f (x∗ ) = h ∇ f (x∗ + ch)h
2
The presence of an extremum of f at x∗ is determined by the sign of f (x∗ +
h) − f (x∗ ). By virtue of the above equation, this is the same as the sign of
H(c) = hT ∇2 f (x∗ + ch)h. Because the partial derivatives of f are continuous
in R, if H(0) 6= 0, the sign of H(c) will be the same as the sign of H(0) =
hT ∇2 f (x∗ )h for h with sufficiently small components (i.e., since the function
has continuous partial and mixed partial derivatives at (x∗ , the hessian will
be positive in some small neighborhood around (x∗ ). Therefore, if ∇2 f (x∗ )
is positive definite, we are guaranteed to have H(0) positive, implying that f
has a local minimum at x∗ . Similarly, if −∇2 f (x∗ ) is positive definite, we are
guaranteed to have H(0) negative, implying that f has a local maximum at x∗ .
2
Theorem 61 gives sufficient conditions for local maxima and minima of func-
tions of multiple variables. Along similar lines of the proof of theorem 61, we
can prove necessary conditions for local extrema in theorem 62.
Thus, for a function of more than one variable, the second derivative test
generalizes to a test based on the eigenvalues of the function’s Hessian matrix at
the stationary point. Based on theorem 61, we will derive the second derivative
test for determining extreme values of a function of two variables.
246 CHAPTER 4. CONVEX OPTIMIZATION
Theorem 64 Let the partial and second partial derivatives of f (x1 , x2 ) be con-
tinuous on a disk with center (a, b) and suppose fx1 (a, b) = 0 and fx2 (a, b) = 0
so that (a, b) is a critical point of f . Let D(a, b) = fx1 x1 (a, b)fx2 x2 (a, b) −
[fx1 x2 (a, b)]2 . Then7 ,
• If D > 0 and fx1 x1 (a, b) > 0, then f (a, b) is a local minimum.
• Else if D > 0 and fx1 x1 (a, b) < 0, then f (a, b) is a local maximum.
• Else if D < 0 then (a, b) is a saddle point.
• If det(∇2 f (a, b)) > 0 and if additionally fx1 x1 (a, b) > 0 (or equivalently,
fx2 x2 (a, b) > 0), the product as well as the sum of eigenvalues will be
positive, implying that the eigenvalues are positive and therefore ∇2 f (a, b)
is positive definite, According to theorem 61, this is a sufficient condition
for f (a, b) to be a local minimum.
• If det(∇2 f (a, b)) > 0 and if additionally fx1 x1 (a, b) < 0 (or equivalently,
fx2 x2 (a, b) < 0), the product of the eigenvalue is positive whereas the
sum is negative, implying that the eigenvalues are negative and therefore
∇2 f (a, b) is negative definite, According to theorem 61, this is a sufficient
condition for f (a, b) to be a local maximum.
• If det(∇2 f (a, b)) < 0, the eigenvalues must have opposite signs, implying
that the ∇2 f (a, b) is neither positive semi-definite nor negative-semidefinite.
By corollary 63, this is a sufficient condition for f (a, b) to be a saddle point.
2
We saw earlier that the critical points for f (x1 , x2 ) = 2x31 +x1 x22 +5x21 +x22 are
(0, 0), (− 53 , 0), (−1, 2) and (−1, −2). To determine which of these correspond
to local extrema and which are saddle, we first compute compute the partial
derivatives of f :
fx1 x1 (x1 , x2 ) = 12x1 + 10
fx2 x2 (x1 , x2 ) = 2x1 + 2
fx1 x2 (x1 , x2 ) = 2x2
Using theorem 64, we can verify that (0, 0) corresponds to a local minimum,
(− 53 , 0) corresponds to a local maximum while (−1, 2) and (−1, −2) correspond
to saddle points. Figure 4.22 shows the plot of the function while pointing out
the four critical points.
Figure 4.22: Plot of the function 2x31 + x1 x22 + 5x21 + x22 showing the four critical
points.
Using theorem 64, we can verify that (2.6442, 1.898384) and (−2.6442, 1.898384)
correspond to local maxima whereas (0.8567, 0.646772) and (−0.8567, 0.646772)
correspond to saddle points. This is illustrated in Figure 4.23.
8 Solving this using matlab without proper scaling could give you complex values. With
proper scaling of the equation, you should get y = −2.545156 or y = 0.646772 or y = 1.898384.
9 The values of x corresponding to y = −2.545156 are complex
248 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.23: Plot of the function 10x2 y − 5x2 − 4y 2 − x4 − 2y 4 showing the four
critical points.
2. The function f (x, y) = x sin y has the gradient vector [sin y, x cos y].
The critical points correspond to the solutions to the simultaneous set of
equations
sin y = 0
(4.17)
x cos y = 0
The critical points are10 (0, nπ) for n = 0, ±1, ±2, . . .. The mixed partial
derivatives of the function are
fxx = 0
fxy = cos y (4.18)
fyy = −x sin y
Along similar lines of the single variable case, we next define the global
maximum and minimum.
Figure 4.24: Plot of the function x sin y illustrating that all critical points are
saddle points.
We would like to find the absolute maximum and minimum values of a func-
tion of multiple variables in a closed interval, along similar lines of the method
yielded by theorem 41 for functions of single variable. The procedure was to eval-
uate the value of the function at the critical points as well as the end points of the
interal and determine the absolute maximum and minimum values by scanning
this list. To generalize the idea to function of multiple variables, we point out
that the analogue of finding the value of the function at the boundaries of closed
interval in the single variable case is to find the function value along the bound-
ary curve, which reduces the evaluation of a function of multiple variables to
evaluating a function of a single variable. Recall from the definitions on page 214
that a closed set in ℜn is a set that contains its boundary points (analogous to
closed interval in ℜ) while a bounded set in ℜn is a set that is contained inside a
closed ball, B[0, ǫ]. An example bounded set is (x1 , x2 , x3 )|x21 + x22 + x23 ≤ 1 .
An example unbounded set is {(x1 , x2 , x3 )|x1 > 1, x2 > 1, x3 > 1}. Based on
these definitions, we can state the extreme value theorem for a function of n
variables.
Figure 4.25: The region bounded by the points (0, 3), (2, 0), (0, 0) on which we
consider the maximum and minimum of the function f (x, y) = 1 + 4x − 5y.
x = −1, y = −1. However, this point does not lie in R and hence, there
are no critical points, in R. Along similar lines of the previous problem,
we will find the extreme values of f on the boundaries of R.
theorem 62 specified a necessary condition for the same. Can these conditions
be extended to globally optimal solutions? The answer is that the extensions
to globally optimal solutions can be made for a specific class of optimization
problems called convex optimization problems. In the next section we introduce
the concept of convex sets and convex functions, enroute to discussing convex
optimization.
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m (4.20)
Ax = b
Least squares and linear programming are special cases of convex optimiza-
tion problems. Like in the case of linear programming, there are no analytical
solutions for convex optimization problems. But they can be solved reliably,
efficiently and optimally. There are not many well developed software for the
general class of convex optimization problems, though there are several software
packages in matlab, C, etc., and many free softwares as well. The computation
time is polynomial but more complicated to be expressed exactly because the
computation time depends on the cost of validating the function values and their
derivates. Modulo that, computation time for convex optimization problems is
similar to that for linear programming problems.
To pose pratical problems as convex optimization problems is more diffi-
cult than to recognize least squares and linear programs. There exist many
techniques to reformulate problems in the convex form. However, surprisingly,
many problems in practice can be solved via convex optimization.
4.2.2 History
Numerical optimization started in the 1940s with the development of the sim-
plex method for linear programming. The next obvious extension to linear
programming was by replacing the linear cost function with a quadratic cost
function. The linear inequality constraints were however maintained. This first
extension took place in the 1950s. We can expect that the next step would have
been to replace the linear constraints with quadratic constraints. But it did
not really happen that way. On the other hand, around the end of the 1060s,
there was another non-linear, convex extension of linear programming called
geometric programming. Geometric programming includes linear programming
as a special case. Nothing more happened until the beginning of the 1990s. The
beginning of the 1990s was marked by a big explosion of activities in the area
of convex optimizations, and development really picked up. Researches formu-
lated different and more general classes of convex optimization problems that are
known as semidefinite programming, second-order cone programming, quadrat-
ically constrained quadratic programming, sum-of-squares programming, etc.
The same happened in terms of applications. Since 1990s, applications have
been investigated in many different areas. One of the first application areas was
control, and the optimization methods that were investigated included semi-
definite programming for certain control problem. Geometric programming had
been around since late 1960s and it was applied extensively to circuit design
problems. Quadratic programming found application in machine learning prob-
lem formulations such as support vector machines. Semi-definite programming
relaxations found use in combinatorial optimization. There were many other
interesting applications in different areas such as image processing, quantum
information, finance, signal processing, communications, etc.
This first look at the activities involving applications of optimization clearly
indicates that a lot a of development took place around the 1990s. Further,
people extended interior-point methods (which were already known for linear
12 Optimization problems such as singular value decomposition are some few exceptions to
this.
4.2. CONVEX OPTIMIZATION PROBLEM 255
Figure 4.27 shows examples of convex and non-convex (concave) sets. Since
an affine set contains any line passing through two distinct points in the set, it
also contains any line segment connecting two points in the set. Thus, an affine
set is our first example of a convex set.
A set C is a convex cone if it is covex and additionally, for every point x ∈ C,
all non-negative multiples of x are also in C. In other words,
∀x1 , x2 ∈ C θ1 , θ2 ≥ 0 ⇒ θ1 x1 + θ2 x2 ∈ C (4.23)
k
X k
X
x= θi xi with θi = 1 and θi ≥ 0 (4.24)
i=1 i=1
The convex hull conv(S) of the set of points S is the set of all convex
combinations of points in S. The convex hull of a convex set S is S itself.
13 The first practical polynomial time algorithm for linear programming by Karmarkar (1984)
where u is a vector with norm less than or equal to 1. The open ball B(xc , r)
is also convex. Replacing r with a non-singular square matrix A, we get an
ellipsoid given by
{xc + Au | ||u||2 ≤ 1}
which is also a convex set. Another equivalent representation of the ellipsoid can
be obtained by observing that for any point x in the ellipsoid, ||A−1 (x−xc )||2 ≤
1, that is (x − xc )T (A−1 )T A−1 (x − xc ) ≤ 1. Since (A−1 )T = (AT )−1 and
258 CHAPTER 4. CONVEX OPTIMIZATION
A−1 B −1 = (BA)−1 , the ellipsoid can be equivalently defined as x|(x − xc )T P −1 (x − xc ) ≤ 1
where P = (AAT ) is a symmetric matrix. Furthermore, P is positive definite,
since A is non-singular (c.f. page 208).
Matrix A determines the size of the ellipsoid; the eigenvalue λi of A deter-
mines the length of the ith semi-axis of the ellipsoid (see page number 206).
The ellipsoid is another example of a convex set and is a generalization of the
eucledian ball. Figure 4.30 illustrates an ellipsoid in ℜ2 .
A norm ball is a ball with an arbitrary norm. A norm ball with center xc
and radius r is given by
{x | ||x − xc || ≤ r}
By the definition of the norm, a ball in that norm will be convex. The norm
ball with the ∞−norm corresponds to a square in ℜ2 , while the norm ball with
the 1−norm in ℜ2 corresponds to the same square rotated by 45◦ . The norm
ball is convex for all norms.
The definition of cone can be extended to any arbitrary norm to define a
4.2. CONVEX OPTIMIZATION PROBLEM 259
norm cone. The set of all pairs (x, t) satisfying ||x|| ≤ t, i.e.,
{(x, t) | ||x|| ≤ t}
Ax b A ∈ ℜm×n
(4.25)
Cx = d C ∈ ℜp×n
" #
x y
S= (4.26)
y z
2 2
We can represent the space of matrices S+ of the form S ∈ S+ as a three
dimensional space with non-negative x, y and z coordinates and a non-negative
determinant. This space corresponds to a cone as shown in Figure 4.33.
Intersection
The intersection of any number of convex sets is convex15 . Consider the set S:
n πo
S = x ∈ ℜn | |p(t)| ≤ 1 f or |t| ≤ (4.27)
3
where
p(t) = x1 cos t + x2 cos 2t + . . . + xm cos mt (4.28)
Any value of t that satisfies |p(t)| ≤ 1, defines two regions, viz.,
S = ∩|t|≤ π3 ℜ(t)
15 Exercise: Prove.
262 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.35: Illustration of the closure property for S defined in (4.27), for
m = 2.
Affine transform
An affine transform is one that preserves
• Collinearity between points, i.e., three points which lie on a line continue
to be collinear after the transformation.
x 7→ Ax + b
where A ∈ ℜn×m and b ∈ ℜm . In the finite-dimensional case each affine trans-
formation is given by a matrix A and a vector b.
The image and pre-image of convex sets under an affine transformation de-
fined as
n
X
f (x) = xi ai + b
i
yield convex sets16 . Here ai is the ith row of A. The following are examples of
convex sets that are either images or inverse images of convex sets under affine
transformations:
{x ∈ ℜn | x1 A1 + . . . + xn An B}
16 Exercise: Prove.
4.2. CONVEX OPTIMIZATION PROBLEM 263
f : ℜn → ℜm such that
(4.30)
f (x) = cAx+b
T x+d dom f = {x | cT x + d > 0}
The images and inverse images of convex sets under perspective and linear-
fractional functions are convex18 .
Consider the linear-fractional function f = x1 +x1 2 +1 x. Figure ?? shows an
example convex set. Figure ?? shows the image of this convex set under the
linear-fractional function f .
ity K with K = S+ n.
18 Exercise: Prove.
264 CHAPTER 4. CONVEX OPTIMIZATION
The two equations imply that f (z) < f (x), which contradicts our assumption
that x corresponds to a point of local minimum. That is f cannot have a point
of local minimum, which does not coincide with the point y of global minimum.
2
Since any locally minimum point for a convex function also corresponds to
its global minimum, we will drop the qualifiers ‘locally’ as well as ‘globally’ while
referring to the points corresponding to minimum values of a convex function.
For any stricly convex function, the point corresponding to the gobal minimum
is also unique, as stated in the following theorem.
(
f (x) if x ∈ D
fe(x) = (4.34)
∞ if x ∈
/D
In what follows, we will assume if necessary, that all convex functions are
implicitly extended to the domain ℜn . A useful technique for verifying the
convexity of a function is to investigate its convexity, by restricting the function
to a line and checking for the convexity of a function of single variable. This
technique is hinged on the following theorem.
Theorem 71 A function f : D → ℜ is (strictly) convex if and only if the
function φ : Dφ → ℜ defined below, is (strictly) convex in t for every x ∈ ℜn
and for every h ∈ ℜn
φ(t) = f (x + th)
with the domain of φ given by Dφ = {t|x + th ∈ D}.
Proof: We will prove the necessity and sufficiency of the convexity of φ for a
convex function f . The proof for necessity and sufficiency of the strict convexity
of φ for a strictly convex f is very similar and is left as an exercise.
Proof of Necessity: Assume that f is convex. And we need to prove that
φ(t) = f (x + th) is also convex. Let t1 , t2 ∈ Dφ and θ ∈ [0, 1]. Then,
Thus, φ is convex.
Proof of Sufficiency: Assume that for every h ∈ ℜn and every x ∈ ℜn ,
φ(t) = f (x + th) is convex. We will prove that f is convex. Let x1 , x2 ∈ D.
Take, x = x1 and h = x2 − x1 . We know that φ(t) = f (x1 + t(x2 − x1 )) is
convex, with φ(1) = f (x2 ) and φ(0) = f (x1 ). Therefore, for any θ ∈ [0, 1]
4.2. CONVEX OPTIMIZATION PROBLEM 269
In some sense, the epigraph is the set of points lying above the graph of f .
Similarly, the hypograph of f is a subset of ℜn+1 , lying below the graph of
f and is defined by
Proof: Let f be convex. For any (x1 , α1 ) ∈ epi(f ) and (x2 , α2 ) ∈ epi(f ) and
any θ ∈ (0, 1),
is not empty and x0 ∈ interior(C). Let M = max f (xi ). Then, for any
1≤i≤n+1
n+1
X
x= ai xi ∈ C,
i=1
n+1
! n+1
X X
f (x) = f ai xi ≤ ai f (xi ) ≤ M
i=1 i=1
Since f is convex on C,
1 θ
f (x0 ) ≤ f (x0 + θh) + f (x0 − h)
1+θ 1+θ
From this, we can conclude that
For a given ǫ > 0, select δ ′ ≤ δ such that δ ′ |f (x0 ) − M | ≤ ǫδ. Then d = θh with
||h|| = δ, implies that d ∈ B(x0 , δ) and f (x0 + d) − f (x0 )| ≤ ǫ. This proves the
theorem. 2
Analogous to the definition of increasing functions introduced on page num-
ber 220, we next introduce the concept of monotonic functions. This concept is
very useful for characterization of a convex function.
T
(f (x1 ) − f (x2 )) (x1 − x2 ) ≥ 0 (4.41)
T
(f (x1 ) − f (x2 )) (x1 − x2 ) > 0 (4.42)
T
(f (x1 ) − f (x2 )) (x1 − x2 ) ≥ c||x1 − x2 ||2 (4.43)
272 CHAPTER 4. CONVEX OPTIMIZATION
1
f (y) ≥ f (x) + ∇T f (x)(y − x) + c||y − x||2 (4.46)
2
Proof:
Sufficiency: The proof of sufficiency is very similar for all the three state-
ments of the theorem. So we will prove only for statement (4.44). Suppose
(4.44) holds. Consider x1 , x2 ∈ D and any θ ∈ (0, 1). Let x = θx1 + (1 − θ)x2 .
Then,
which proves that f (x) is a convex function. In the case of strict convexity,
strict inequality holds in (4.47) and it follows through. In the case of strong
convexity, we need to additionally prove that
1 1 1
θ c||x − x1 ||2 + (1 − θ) c||x − x2 ||2 = cθ(1 − θ)||x2 − x1 ||2
2 2 2
4.2. CONVEX OPTIMIZATION PROBLEM 273
for some x2 6= x1 . Because f is stricly convex, for any θ ∈ (0, 1) we can write
Since (4.44) is already proved for convex functions, we use it in conjunction with
(4.48), and (4.49), to get
f (x2 ) + θ∇T f (x2 )(x1 − x2 ) ≤ f (x2 + θ(x1 − x2 )) < f (x2 ) + θ∇T f (x2 )(x1 − x2 )
which is a contradiction. Thus, equality can never hold in (4.44) for any x1 6= x2 .
This proves the necessity of (4.45). 2
The geometrical interpretation of theorem 75 is that at any point, the linear
approximation based on a local derivative gives a lower estimate of the function,
i.e. the convex function always lies above the supporting hyperplane at that
point. This is pictorially depicted in Figure 4.38. There are some implications
of theorem 75 for strongly convex functions. We state them next.
274 CHAPTER 4. CONVEX OPTIMIZATION
1
f (y) ≥ f (x) − ||∇f (x)||22 (4.50)
2c
Since this holds for any y ∈ D, we have
1
min f (y) ≥ f (x) − ||∇f (x)||22 (4.51)
y∈D 2c
2
||x − y
b ||2 ≤ ||∇f (x)||2 (4.52)
c
Theorem 75 motivates the definition of the subgradient for non-differentiable
convex functions, which has properties very similar to the gradient vector.
Definition 41 [Subgradient]: Let f : D → ℜ be a convex function defined
on a convex set D. A vector h ∈ ℜn is said to be a subgradient of f at the
point x ∈ D if
f (y) ≥ f (x) + hT (y − x)
for all y ∈ D. The set of all such vectors is called the subdifferential of f
at x.
For a differentiable convex function, the gradient at point x is the only subgradi-
ent at that point. Most properties of differentiable convex functions that hold in
terms of the gradient also hold in terms of the subgradient for non-differentiable
convex functions. Theorem 75 gives a very simple optimality criterion for a dif-
ferentiable function f .
Theorem 76 Let f : D → ℜ be a convex function defined on a convex set D.
A point x ∈ D corresponds to a minimum if and only if
∇T f (x)(y − x) ≥ 0
for all y ∈ D.
4.2. CONVEX OPTIMIZATION PROBLEM 275
f (y) ≥ f (x)
T
(∇f (x) − ∇f (y)) (x − y) ≥ 0 (4.53)
T
(∇f (x) − ∇f (y)) (x − y) > 0 (4.54)
T
(∇f (x) − ∇f (y)) (x − y) ≥ c||x − y||2 (4.55)
Proof:
Necessity: Suppose f is uniformly convex on D. Then from theorem 75,
we know that for any x, y ∈ D,
1
f (y) ≥ f (x) + ∇T f (x)(y − x) − c||y + x||2
2
1
T
f (x) ≥ f (y) + ∇ f (y)(x − y) − c||x + y||2
2
Adding the two inequalities, we get (4.55). If f is convex, the inequalities hold
with c = 0, yielding (4.54). If f is strictly convex, the inequalities will be strict,
yielding (4.54).
Sufficiency: Suppose ∇f is monotone. For any fixed x, y ∈ D, consider the
function φ(t) = f (x + t(y − x)). By the mean value theorem applied to φ(t),
we should have for some t ∈ (0, 1),
T 1 T
(∇f (z) − ∇f (x)) (y − x) = (∇f (z) − ∇f (x)) (z − x) ≥ 0 (4.58)
t
Combining (4.57) with (4.58), we get,
T
f (y) − f (x) = (∇f (z) − f (x)) (y − x) + ∇T f (x)(y − x)
≥ ∇T f (x)(y − x) (4.59)
By theorem 75, this inequality proves that f is convex. Strict convexity can
be similarly proved by using the strict inequality in (4.58) inherited from strict
monotonicity, and letting the strict inequality follow through to (4.59). For the
case of strong convexity, from (4.55), we have
T
φ′ (t) − φ′ (0) = (∇f (z) − f (x)) (y − x)
1 T 1
= (∇f (z) − f (x)) (z − x) ≥ c||z − x||2 = ct||y − x||2 (4.60)
t t
4.2. CONVEX OPTIMIZATION PROBLEM 277
Therefore,
Z 1
′ 1
φ(1) − φ(0) − φ (0) = [φ′ (t) − φ′ (0)]dt ≥ c||y − x||2 (4.61)
0 2
which translates to
1
f (y) ≥ f (x) + ∇T f (x)(y − x) + c||y − x||2
2
By theorem 75, f must be strongly convex. 2
2. is strictly convex if its domain is convex and its Hessian matrix is positive
definite at each point in D. That is
3. is uniformly convex if and only if its domain is convex and its Hessian
matrix is uniformly positive definite at each point in D. That is, for any
v ∈ ℜn and any x ∈ D, there exists a c > 0 such that
In other words
∇2 f (x) cIn×n
where In×n is the n × n identity matrix and corresponds to the pos-
itive semidefinite inequality. That is, the function f is strongly convex
iff ∇2 f (x) − cIn×n is positive semidefinite, for all x ∈ D and for some
constant c > 0, which corresponds to the positive minimum curvature of
f.
278 CHAPTER 4. CONVEX OPTIMIZATION
Proof: We will prove only the first statement in the theorem; the other two
statements are proved in a similar manner.
Necessity: Suppose f is a convex function, and consider a point x ∈ D.
We will prove that for any h ∈ ℜn , hT ∇2 f (x)h ≥ 0. Since f is convex, by
theorem 75, we have
t2 T 2
h ∇ f (x)h + O(t3 ) ≥ 0
2
Dividing by t2 and taking limits as t → 0, we get
hT ∇2 f (x)h ≥ 0
min cT x (4.66)
x∈X
4.3. CONVEX OPTIMIZATION PROBLEM 281
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m
(4.67)
Ax = b
variable x = (x1 , . . . , xn )
min t
x=(t,u)∈X (4.68)
X = {(t, u)|f (u) ≤ t, g1 (u) ≤ 0, g2 (u) ≤ 0, . . . , gm (u) ≤ 0}
The set X is convex, and hence the problem in (4.68) is a convex optimization
problem. Further, every locally optimal point is also globally optimal. The
computation time of algorithms for solving convex optimization problems is
roughly proportional to max n2 , n2 m, C , where C is the cost of evaluating
f , the gi ’s and their first and second derivatives. There are many reliable and
efficient algorithms for solving convex optimization problems. However, it is
often difficult to recognize convex optimization problems in practice.
Examples
Consider the optimization problem
We note that the optimiation problem above is not a convex problem ac-
cording to our definition, since g1 is not convex and h is not affine. However, we
note that the feasible set {(x1 , x2 ) | x1 = −x2 , x1 ≤ 0 } is convex (recall that
the converse of theorem 73 does not hold - the 0-sublevel set of a non convex
function can be convex). This problem can be posed as an equivalent (but not
identical) convex optimization problem:
282 CHAPTER 4. CONVEX OPTIMIZATION
F (x) = xT Ax − xT b (4.71)
1 T
min 2 y By
y∈ℜn (4.73)
subject to AT y = b
where y ∈ ℜn , A is an n × m matrix, B is an n × n matrix and b is a vector
of size m. To handle constrained minimization, let us consider minimization of
the modified objective function L(y, λ) = 21 yT By + λT (AT y − b).
1 T
min 2 y By + λT (AT y − b) (4.74)
y∈ℜn ,λ∈ℜm
The function L(y, λ) is called the lagrangian and involves the lagrange multi-
plier λ ∈ ℜm . A sufficient condition for optimality of L(y, λ) at a point L(y∗ , λ∗ )
is that ∇L(y∗ , λ∗ ) = 0 and ∇2 L(y∗ , λ∗ ) ≻ 0. For this particular problem:
" # " #
∗ ∗ By∗ + Aλ∗ 0
∇L(y , λ ) = =
AT y∗ − b 0
and
" #
2 ∗ ∗ B A
∇ L(y , λ ) = ≻0
AT 0
The point (y∗ , λ∗ ) must therefore satisfy, AT y∗ = b and Aλ∗ = −By∗ . If B
is taken to be the identity matrix, n = 2 and m = 1, the minimization problem
(4.73) amounts to finding a point y∗ on a line a11 y1 +a12 y2 = b that is closest to
the origin. From geometry, we know that the point on a line closest to the origin
is the point of intersection p∗ of a perpendicular from the origin to the line. On
the other hand, the solution for the minimum of (4.74), for these conditions
coincides with p∗ and is given by:
a11 b
y1 =
(a11 )2 + (a12 )2
a12 b
y2 =
(a11 ) + (a12 )2
2
284 CHAPTER 4. CONVEX OPTIMIZATION
That is, for n = 2 and m = 1, the solution to (4.74) is the same as the solu-
tion to (4.72) Can this construction be used to always find optimal solutions
to a minimization problem? We will answer this question by first motivating
the concept of lagrange multipliers and in Section 4.4.2, we will formalize the
lagrangian dual.
min f (x)
x∈D (4.75)
subject to gi (x) = 0 i = 1, 2, . . . , m
However, this is not possible for general constraints. The method of lagrange
multipliers presents an indirect approach to solving this problem.
Consider a schematic representation of the problem in (4.75) with a single
constraint, i.e., m = 1 in Figure 4.39. The figure shows some level curves of the
function f . The constraint function g1 is also plotted with dotted lines in the
same figure. The gradient of the constraint ∇g1 is not parallel to the gradient
∇f of the function21 at f = 10.4; it is therefore possible to move along the
constraint surface so as to further reduce f . However, as shown in Figure 4.39,
∇g1 and ∇f are parallel at f = 10.3, and any motion along g1 (x) = 0 will
21 Note that the (negative) gradient at a point is orthogonal to the contour line going through
Figure 4.39: At any non-optimal and non-saddle point of the equality con-
strained problem, the gradient of the constraint will not be parallel to that of
the function.
The solutions to (4.76) are the stationary points of the lagrangian L; they are not
necessarily local extrema of L. L is unbounded: given a point x that doesn’t lie
on the constraint, letting λ → ±∞ makes L arbitrarily large or small. However,
under certain stronger assumptions, as we shall see in Section 4.4.2, if the strong
Lagrangian principle holds, the minima of f minimize the Lagrangian globally.
We will extend the necessary condition for optimality of a minimization
problem with single constraint to minimization problems with multiple equality
constraints (i.e., m > 1. in (4.75)). Let S be the subspace spanned by ∇gi (x)
at any point x and let S⊥ be its orthogonal complement. Let (∇f )⊥ be the
component of ∇f in the subspace S⊥ . At any solution x∗ , it must be true that
the gradient of f has (∇f )⊥ = 0 (i.e., no components that are perpendicular to
all of the ∇gi ), because otherwise you could move x∗ a little in that direction
(or in the opposite direction) to increase (decrease) f without changing any
of the gi , i.e. without violating any constraints. Hence for multiple equality
constraints, it must be true that at the solution x∗ , the space S contains the
286 CHAPTER 4. CONVEX OPTIMIZATION
Figure 4.40: At the equality constrained optimum, the gradient of the constraint
must be parallel to that of the function.
vector ∇f , i.e., there are some constants λi such that ∇f (x∗ ) = λi ∇gi (x∗ ).
We also need to impose that the solution is on the correct constraint surface
(i.e., gi = 0, ∀i). In the same manner as in the case of m = 1, this can
Xm
be encapsulated by introducing the Lagrangian L(x, λ) = f (x) − λi gi (x),
i=1
whose gradient with respect to both x, and λ vanishes at the solution.
This gives us the following necessary condition for optimality of (4.75):
m
!
X
∗ ∗
∇L(x , λ ) = ∇ f (x) − λi gi (x) =0 (4.77)
i=1
Figure 4.41: At the inequality constrained optimum, the gradient of the con-
straint must be parallel to that of the function.
min f (x)
x∈D (4.78)
subject to gi (x) ≤ 0 i = 1, 2, . . . , m
With multiple inequality constraints, for constraints that are active, as in the
case of multiple equality constraints, ∇f must lie in the space spanned by the
Xm
∇gi ’s, and if the Lagrangian is L = f + λi gi , then we must additionally
i=1
have λi ≥ 0, ∀i (since otherwise f could be reduced by moving into the feasible
region). As for an inactive constraint gj (gj < 0), removing gj from L makes
Xm
no difference and we can drop ∇gj from ∇f = − λi ∇gi or equivalently set
i=1
λj = 0. Thus, the above KKT condition generalizes to λi gi (x∗ ) = 0, ∀i. The
necessary condition for optimality of (4.78) is summarily given as
m
!
X
∗ ∗
∇L(x , λ ) = ∇ f (x) − λi gi (x) =0
i=1
∀i λi gi (x) = 0 (4.79)
A simple and often useful trick called the free constraint gambit is to solve
ignoring one or more of the constraints, and then check that the solution satisfies
those constraints, in which case you have solved the problem.
288 CHAPTER 4. CONVEX OPTIMIZATION
min f (x)
x∈D (4.80)
subject to gi (x) ≤ 0, i = 1, 2, . . . , m
There are three simple and straightforward steps in forming a dual problem.
1. The first step involves forming the lagrange function by associating a price
λi , called a lagrange multiplier, with the constraint involving gi .
n
X
L(x, λ) = f (x) + λi gi (x) = f (x) + λT g(x)
i=1
2. The second step is the construction of the dual function L∗ (λ) which is
defined as:
L∗ (λ) = minL(x, λ) = minf (x) + λT g(x)
x∈D x∈D
What makes the theory of duality constructive is when we can solve for
L∗ efficiently - either in a closed form or some other ‘simple’ mechanism.
If L∗ is not easy to evaluate, the duality theory will be less useful.
3. We finally define the dual problem:
max L∗ (λ)
λ∈ℜm (4.81)
subject to λ≥0
min cT x
x∈ℜn
subject to −Ax + b ≤ 0
The next step is to get L∗ , which we obtain using the first derivative test:
(
∗ T T
T bT λ if AT λ = c
L (λ) = minn b λ + c − Aλ x=
x∈ℜ −∞ if AT λ 6= c
max bT λ
λ∈ℜm
subject to AT λ = c (4.82)
λ≥0
This is the dual of the standard LP. What if the original LP was the
following?
290 CHAPTER 4. CONVEX OPTIMIZATION
min cT x
x∈ℜn
subject to −Ax + b ≤ 0 x≥0
Pn
minn cT x − i=1 ln xi
x∈ℜ
subject to −Ax + b = 0
x>0
The domain (or ground set) for this problem is x > 0, which is open.
The expression for L∗ can be obtained using the first derivative test, while
keeping in mind that L can be made arbitrarily small (tending to −∞)
unless (c−AT λ) > 0. This is because, even if one component of c−AT λ is
less than or equal to zero, the value of L can be made arbitrarily
Psmall by
n
decreasing the value of the corresponding component of x in the i=1 ln xi
T Pn
part. Further, the sum bT λ + c − AT λ x − i=1 ln xi can be separated
out into the individual components of λi , and this can be exploited while
determining the critical point of L.
( Pn
∗
bT λ + n + i=1 ln (c−A1T λ) if (c − AT λ) > 0
L (λ) = min L(x, λ) = i
x>0 −∞ otherwise
Pn
max bT λ + n + i=1 ln (c−A1T λ)
λ∈ℜm i
As noted earlier, the theory of duality remains a theory unless the dual lends
itself to some constructive evaluation; not always is the dual a useful form.
The following Weak duality theorem states an important relationship be-
tween solutions to the primal (4.80) and the dual (4.81) problems.
Theorem 81 If p∗ ∈ ℜ is the solution to the primal problem in (4.80) and
d∗ ∈ ℜ is the solution to the dual problem in (4.81), then
p∗ ≥ d ∗
In general, if x b is a
b is any feasible solution to the primal problem (4.80) and λ
feasible solution to the dual problem (4.81), then
b
x) ≥ L∗ (λ)
f (b
Proof: If x b is a feasible
b is a feasible solution to the primal problem (4.80) and λ
solution to the dual problem, then
x) ≥ f (b
f (b bT g(λ)
x) + λ b ≥ min f (x + λ
bT g(λ) = L∗ (λ)
b
x∈D
This proves the second part of the theorem. A direct consequence of this is that
2
The weak duality theorem has some important implications. If the primal
problem is unbounded below, that is, p∗ = −∞, we must have d∗ = −∞, which
means that the Lagrange dual problem is infeasible. Conversely, if the dual
problem is unbounded above, that is, d∗ = ∞, we must have p∗ = ∞, which
is equivalent to saying that the primal problem is infeasible. The difference,
p∗ − d∗ is called the duality gap.
In many hard combinatorial optimization problems with duality gaps, we
get good dual solutions, which tell us that we are guaranteed of being some k %
within the optimal solution to the primal, for some satisfactorily low values of
k. This is one of the powerful uses of duality theory; constructing bounds for
optimization problems.
Under what conditions can one assert that d∗ = p∗ ? The condition d∗ = p∗ is
called strong duality and it does not hold in general. It usually holds for convex
problems but there are exceptions to that - one of the most typical being that
of the semi-definite optimization problem. The semi-definite program (SDP) is
defined, with the linear matrix inequality constraint (c.f. page 262) as follows:
min cT x
x∈ℜn
subject to x1 A1 + . . . + xn An + G 0 (4.83)
Ax = b
292 CHAPTER 4. CONVEX OPTIMIZATION
Sufficient conditions for strong duality in convex problems are called constraint
qualifications. One of the most useful sufficient conditions for strong duality is
called the Slaters constraint qualification.
min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . , m
(4.84)
Ax = b
variable x = (x1 , . . . , xn )
However, if any of the gi ’s are linear, they do not need to hold with strict
inequalities.
Table 4.4 summarizes some optimization problems, their duals and con-
ditions for strong duality. Strong duality also holds for nonconvex problems
Problem type Objective Function Constraints L∗ (λ) Dual constraints Strong duality
Linear Program cT x Ax ≤ b −bT λ AT λ + c = 0 Feasible primal
λ≥0 and dual
1 T
T
Quadratic Program 2 x Qx + c x
T n
for Q ∈ S++ Ax ≤ b − 12 c − AT λ Q−1 c − AT λ + bT λ λ≥0 Always
Pn Pn −aT λ
Entropy maximization xi i=1 ln xi Ax ≤ b −bT λ − µ − e−µ−1 i=1 e
i λ≥0 Primal constraints
xT 1 = 1 ai is the ith column of A are satisfied.
Figure 4.42: Example of the set I for a single constraint (i.e., for n = 1).
and z ≥ f (x) and these are points that lie to the right and above the point
(g1 (x), f (x)). An example set I is shown in Figure 4.42. It turns out that all
the intuitions we need are in two dimensions, which makes it fairly convenient to
understand the idea. It is straightforward to prove that if the objective function
f (x) is convex and each of the constraints gi (x), 1 ≤ i ≤ n is a convex function,
then I must be a convex set. Since the feasible region for the primal problem
(4.78) is the region in I with s ≤ 0, and since all points above and to the right
of a point in I also belong to I, the solution to the primal problem corresponds
to the point in I with s = 0 and least possible value of z. For example, in
Figure 4.42, the solution to the primal corresponds to (0, δ1 ).
Let us define a hyerplane Hλ,α , parametrized by λ ∈ ℜm and α ∈ ℜ as
Hλ,α = (s, z) λT .s + z = α
Consider all hyperplanes that lie below I. For example, in the Figure 4.42,
both hyperplanes Hλ1 ,α1 and Hλ2 ,α2 lie below the set I. Of all hyperplanes
that lie below I, consider the hyperplane whose intersection with the line s =
0, corresponds to as high a value of z as possible. This hyperplane must be
supporting hyperplane. Incidentally, Hλ1 ,α1 happens to be such a supporting
hyperplane. Its point of intersection (0, α1 ) precisely corresponds to the solution
to the dual problem. Let us derive this statement formally after setting up some
more notation.
We will define two half-spaces corresponding to Hλ,α
+
Hλ,α = (s, z) λT .s + z ≥ α
−
Hλ,α = (s, z) λT .s + z ≤ α
Let us define another set L as
L = {(s, z) |s = 0 }
294 CHAPTER 4. CONVEX OPTIMIZATION
Note that L is essentially the z or function axis. The intersection of Hλ,α with
L is the point (0, α). That is
\
(0, α) = L Hλ,α
We would like to manipulate λ and α so that the set I lies in the half-space
+
Hλ,α as tightly as possible. Mathematically, we are interested in the problem
of maximizing the height of the point of intersection of L with Hλ,α above the
+
s = 0 plane, while ensuring that I remains a subset of Hλ,α .
max α
+
subject to Hλ,α ⊇I
+
By definitions of I, Hλ,α and the subset relation, this problem is equivalent to
max α
subject to λT .s + z ≥ α ∀(s, z) ∈ I
Now notice that if (s, z) ∈ I, then (s′ , z) ∈ I for all s′ ≥ s. This was also
illustrated in Figure 4.42. Thus, we cannot afford to have any component of λ
negative; if any of the λi ’s were negative, we could cranck up si arbitrarily to
violate the inequality λT .s + z ≥ α. Thus, we can add the constraint λ ≥ 0 to
the above problem without changing the solution.
max α
subject to λT .s + z ≥ α ∀(s, z) ∈ I
λ≥0
Any equality constraint h(x) = 0 can be expressed using two inequality con-
straints, viz., h(x) ≤ 0 and −h(x) ≤ 0. This problem can again be proved to be
equivalent to the following problem, using the definition of I or equivalently, the
fact that every point on ∂I must be of the form (g1 (x), g2 (x), . . . , gm (x), f (x))
for some x ∈ D.
max α
subject to λT .g(x) + f (x) ≥ α ∀x ∈ D
λ≥0
We will remind the reader at this point that L(x, λ) = λT .g(x) + f (x). The
above problem is therefore the same as
4.4. DUALITY THEORY 295
max α
subject to L(x, λ) ≥ α ∀x ∈ D
λ≥0
Since, L∗ (λ) = minL(x, λ), we can deal with the equivalent problem
x∈D
max α
subject to L∗ (λ) ≥ α
λ≥0
max L∗ (λ)
subject to λ≥0
Figure 4.43: Example of the convex set I for a single constrained semi-definite
program.
Figure 4.44: Example of the convex set I for a single constrained well-behaved
convex program.
4.4. DUALITY THEORY 297
min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . , m
(4.85)
hj (x) = 0, j = 1, . . . , p
variable x = (x1 , . . . , xn )
Suppose that the primal and dual optimal values for the above problem are
attained and equal, that is, strong duality holds. Let x
b be a primal optimal and
b µ
(λ, b) be a dual optimal point (λb ∈ ℜm , µ
b ∈ ℜp ). Thus,
f (b
x) b µ
= L∗ (λ, b)
bT g(x) + µ
= minf (x) + λ bT h(x)
x∈D
≤ f (b bT g(b
x) + λ bT h(b
x) + µ x)
≤ f (b
x)
b ≥ 0, g(b
The last inequality follows from the fact that λ x) ≤ 0, and h(b
x) = 0.
We can therefore conclude that the two inequalities in this chain must hold with
equality. Some of the conclusions that we can draw from this chain of equalities
are
1. That xb is a minimizer for L(x, λ, b µb) over x ∈ D. In particular, if the func-
tions f , g1 , g2 , . . . , gm and h1 , h2 , . . . , hp are differentiable (and therefore
have open domains), the gradient of L(x, λ, b µ
b) must vanish at xb, since any
point of global optimum must be a point of local optimum. That is,
m
X p
X
∇f (b
x) + bi ∇gi (b
λ x) + bj ∇hj (b
µ x) = 0 (4.86)
i=1 j=1
2. That
n
X
bT g(b
λ x) = bi gi (b
λ x) = 0
i=1
bi gi (b
λ x) = 0 f or i = 1, 2, . . . , m (4.87)
Pm b Pp
(1) ∇f (b
x) + i=1 λi ∇gi (b bj ∇hj (b
x) + j=1 µ x) = 0
(2) gi (b
x) ≤ 0 i = 1, 2, . . . , m
(3) bi
λ ≥ 0 i = 1, 2, . . . , m (4.88)
(4) bi gi (b
λ x) = 0 i = 1, 2, . . . , m
(5) hj (bx) = 0 j = 1, 2, . . . , p
When the primal problem is convex, the KKT conditions are also sufficient
for the points to be primal and dual optimal with zero duality gap. If f is convex,
gi are convex and hj are affine, the primal problem is convex and consequently,
the KKT conditions are sufficient conditions for zero duality gap.
Theorem 82 If the function f is convex, gi are convex and hj are affine, then
KKT conditions in 4.88 are necessary and sufficient conditions for zero duality
gap.
Proof: The necessity part has already been proved; here we only prove the
sufficiency part. The conditions (2) and (5) in (4.88) ensure that x b is primal
b µ
feasible. Since λ ≥ 0, L(x, λ, b) is convex in x. Based on condition (1) in (4.88)
and theorem 77, we can infer that x b µ
b minimizes L(x, λ, b). We can thus conclude
that
b µ
L∗ (λ, b) = f (b bT g(b
x) + λ bT h(b
x) + µ x)
= f (b
x)
We will now study some algorithms for solving convex problems. These tech-
niques are relevant for most convex optimization problems that do not yield
themselves to closed form solutions. We will start with unconstrained mini-
mization.
Recall that the goal in unconstrained minimization is to solve the convex
problem
min f (x)
x∈D
The incremental step is determined while ensuring that f (x(k+1) ) < f (x(k) ).
We assume that we are dealing with the extended value extension of the convex
function f (c.f. definition 36), which returns ∞ for any point outside its domain.
However, if we do so, we need to make sure that the initial point indeed lies in
the domain D.
A single iteration of the general descent algorithm (shown in Figure 4.45)
consists of two main steps, viz., determining a good descent direction ∆x(k) ,
which is typically forced to have unit norm and determining the step size using
some line search technique. If the function f is convex, and we require that
f (x(k+1) ) < f (x(k) ) then, we must have ∇T f (x(k+1) )(x(k+1) − x(k) ) < 0. This
can be seen from the necessary and sufficient condition for convexity stated in
equation (4.44) within Section 4.2.9 and restated here for reference.
There are many different empirical techniques for ray search, though it mat-
ters much less than the search for the descent direction. These techniques reduce
the n−dimensional problem to a 1−dimensional problem, which can be easy to
solve by use of plotting and eyeballing or even exact search.
1. Exact ray search: The exact ray search seeks a scaling factor t that
satisfies
2. Backtracking ray search: The exact line search may not be feasible
or could be expensive to compute for complex non-linear functions. A
relatively simpler ray search iterates over values of step size starting from
1 and scaling it down by a factor of β ∈ (0, 12 ) after every iteration till
the following condition, called the Armijo condition is satisfied for some
0 < c1 < 1.
T
∆x ∇f (x + t∆x) ≤ c2 ∆xT ∇f (x) (4.91)
where 1 > c1 > c2 > 0. This condition ensures that the slope of the
function f (x+t∆x) at t is less than c2 times that at t = 0. The conditions
in (4.90) and (4.91) are together called the strong Wolfe conditions. These
conditions are particularly very important for non-convex problems.
A finding that is borne out of plenty of empirical evidence is that exact ray
search does better than empirical ray search in a few cases only. Further, the
exact choice of the value of β and α seems to have little effect on the convergence
of the overall descent method.
The trend of specific descent methods has been like a parabola - starting
with simple steepest descent techniques, then accomodating the curvature hes-
sian matrix through a more sophisticated Newton’s method and finally, trying
to simplify the Newton’s method through approximations to the hessian inverse,
302 CHAPTER 4. CONVEX OPTIMIZATION
Steepest Descent
Let v ∈ ℜn be a unit vector under some norm. By theorem 75, for convex f ,
f (x(k) ) − f (x(k) + v) ≤ −∇T f (x(k) )v
For small v, the inequality turns into approximate equality. The term −∇T f (x(k) )v
can be thought of as (an upper-bound on) the first order prediction of decrease.
The idea in the steepest descent method [?] is to choose a norm and then deter-
mine a descent direction such that for a unit step in that norm, the first order
prediction of decrease is maximized. This choice of the descent direction can be
stated as
∆x = argmin ∇T f (x)v | ||v|| = 1
The algorithm is outlined in Figure 4.46.
The key to understanding the steepest descent method (and in fact many
other iterative methods) is that it heavily depends on the choice of the norm. It
has been empirically observed that if the norm chosen is aligned with the gross
geometry of the sub-level sets22 , the steepest descent method converges faster
to the optimal solution. If the norm chosen is not aligned, it often amplifies
the effect of oscillations. Two examples of the steepest descent method are the
gradient descent method (for the eucledian or L2 norm) and the coordinate-
descent method (for the L1 norm). One fact however is that no two norms
should give exactly opposite steepest descent directions, though they may point
in different directions.
Gradient Descent
A classic greedy algorithm for minimization is the gradient descent algorithm.
This algorithm uses the negative of the gradient of the function at the current
22 The alignment can be determined by fitting, for instance, a quadratic to a sample of the
points.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 303
point x∗ as the descent direction ∆x∗ . It turns out that this choice of ∆x∗
corresponds to the direction of steepest descent under the L2 (eucledian) norm.
This can be proved in a straightforward manner using theorem 58. The algo-
rithm is outlined in Figure 4.47. The steepest descent method can be thought
of as changing the coordinate system in a particular way and then applying the
gradient descent method in the changed coordinate system.
Coordinate-Descent Method
The co-ordinate descent method corresponds exactly to the choice of L1 norm
for the steepest descent method. The steepest descent direction using the L1
norm is given by
∂f (x) i
∆x = − u
∂xi
where,
∂f (x)
= ||∇f (x)||∞
∂xi
and ui was defined on page 231 as the unit vector pointing along the ith co-
ordinate axis. Thus each iteration of the coordinate descent method involves
optimizing over one component of the vector x(k) and then updating the vec-
tor. The component chosen is the one having the largest absolute value in the
gradient vector. The algorithm is outlined in Figure 4.48.
f (x(k ) − p∗ ≤ ρk f (x(0) − p∗ (4.92)
The value of ρ ∈ (0, 1) depends on the strong convexity constant c (c.f. equation
(4.64) on page 277), the value of x(0) and type of ray search employed. The
suboptimality f (x(k) ) − p∗ goes down by a factor ρ < 1 at every step and
this is referred to as linear convergence23 . However, this is only of theoretical
23 A series s1 , s2 , . . . is said to have
304 CHAPTER 4. CONVEX OPTIMIZATION
importance, since this method is often very slow, indicated by values of ρ, very
close to 1. Use of exact line search in conjunction with gradient descent also has
the tendency to overshoot the next best iterate. It is therefore rarely used in
practice. The convergence rate depends greatly on the condition number of the
Hessian (which is upperbounded by Dc ). It can be proved that the number of
iterations required for the convergence of the gradient descent method is lower-
bounded by the condition number of the hessian; large eigenvalues correspond
to high curvature directions and small eigenvalues correspond to low curvature
directions. Many methods (such as conjugate gradient) try to improve upon
the gradient method by making the hessian better conditioned. Convergence
can be very slow even for moderately well-conditioned problems, with condition
number in the 100s, even though computation of the gradient at each step is
only an O(n) operation. The gradient descent method however works very well
if the function is isotropic, that is if the level-curves are spherical or nearly
spherical.
The convergence of the steepest descent method can be stated in the same
form as in 4.92, using the fact that any norm can be bounded in terms of the
Euclidean norm, i.e., there exists a constant η ∈ (0, 1] such that
||x|| ≥ η||x||2
|si+1 −s|
1. linear convergence to s if lim = δ ∈ (0, 1). For example, si = (γ)i has linear
i→∞ |si −s|
convergence to s = 0 for any γ < 1. The rate of decrease is also sometimes called
exponential or geometric. This is considered quite slow.
|si+1 −s| 1
2. superlinear convergence to s if lim = 0. For example, si = has superlinear
i→∞ |si −s| i!
convergence. This is the most common.
|si+1 −s| i
3. quadratic convergence to s if lim 2 = δ ∈ (0, ∞). For example, si = (γ)2 has
i→∞ |si −s|
quadratic convergence to s = 0 for any γ < 1. This is considered very fast in practice.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 305
−1
x(k+1) = x(k) − ∇2 f (x(k) ) ∇f (x(k) ) (4.93)
assuming that the Hessian matrix is invertible. The term x(k+1) − x(k) can
be thought of as an update step. This leads to a simple descent algorithm,
outlined in Figure 4.49 and is called the Newton’s method. It relies on the
invertibility of the hessian, which holds if the hessian is positive definite as in
the case of a strictly convex function. In case the hessian is invertible,cholesky
factorization (page 207) of the hessian can be used to solve the linear system
(4.93). However, the Newton method may not even be properly defined if the
hessian is not positive definite. In this case, the hessian could be changed to
a nearby positive definite matrix whenever it is not. Or a line search could be
added to seek a new point having a positive definite hessian.
This method uses a step size of 1. If instead, the stepsize is chosen using
exact or backtracking ray search, the method is called the damped Newton’s
method. Each Newton’s step takes O(n3 ) time (without using any fast matrix
multiplication methods).
The Newton step can also be looked upon as another incarnation of the
steepest descent rule, but with the quadratic norm defined by the (local) Hessian
∇2 f (x(k) ) evaluated at the current iterate x(k) , i.e.,
12
||u||∇2 f (x(k) ) = u∇2 f (x(k) )u
The norm of the Newton step, in the quadratic norm defined by the Hessian at
a point x is called the Newton decrement at the point x and is denoted by λ(x).
Thus,
−1
λ(x) = ||∆x||∇2 f (x) = ∇T f (x) ∇2 f (x) ∇f (x)
The Newton decrement gives an ‘estimate’ of the proximity of the current iterate
x to the optimal point x∗ obtained by measuring the proximity of x to the
306 CHAPTER 4. CONVEX OPTIMIZATION
and is given as
1
λ(x)2 = f (x) − min fe(x)
2
Additionally, λ(x)2 is also the directional derivative in the Newton direction.
λ(x)2 = ∇T f (x)∆x
The estimate 12 λ(x)2 is used to test the convergence of the Newton algorithm
in Figure 4.49.
Next, we state an important property of the Newton’s update rule.
−1
Theorem 83 If ∆x(k) = − ∇2 f (x(k) ) ∇f (x(k) ), ∇2 f (x(k) ) is symmetric
and positive definite and ∆x 6= 0, then ∆x(k) is a descent direction at x(k) ,
(k)
1
f (x) ≈ f (x∗ ) + ∇T f (x∗ )(x − x∗ ) + (x − x∗ )T ∇2 f (x∗ )(x − x∗ )
2
1
∗
= f (x ) + (x − x∗ )T ∇2 f (x∗ )(x − x∗ )
2
Thus, the level curves of a convex function are approximately ellipsoids near the
point of minimum x∗ . Given this geometry near the minimum, it then makes
sense to do steepest descent in the norm induced by the hessian, near the point
of minimum (which is equivalent to doing a steepest descent after a rotation of
the coordinate system using the hessian). This is exactly the Newton’s step.
Thus, the Newton’s method24 converges very fast in the vicinity of the solution.
This convergence analysis is formally stated in the following theorem and is
due to Leonid Kantorovich.
Theorem 84 Suppose f (x) : D → ℜ is twice continuously differentiable on D
and x∗ is the point corresponding to the optimal value p∗ (so that ∇f (x∗ ) = 0).
Let f be strongly convex on D with constant c > 0. Also, suppose ∇2 f (x∗ ) is
Lipschitz continuous on D with a constant L > 0 (which measures how well f
can be approximated by a quadratic function or how fast the second derivative
of f changes), that is
||∇2 f (x) − ∇2 f (y)||2 ≤ L||x − y||2
2
Then, there exist constants α ∈ (0, cL ) and β > 0 such that
1. Damped Newton Phase: If ||∇2 f (x)||2 ≥ α, then f (x(k+1) )−f (x(k) ) ≤
−β. That is, at every step of the iteration in the damped Newton phase,
the function value decreases by atleast β and the phase ends after at most
f (x(0) )−p∗
β iterations, which is a finite number.
2. Quadratically Convergent Phase: If ||∇2 f (x)||2 < α, then 2cL2 ||∇f (x(k+1) )||2 ≤
L (k)
2
2c2 ||∇f (x )||2 . When applied recursively this inequality yields
2k−q
L 1
||∇f (x(k) )||2 ≤
2c2 2
where q is iteration number, starting at which ||∇2 f (x(q) )||2 < α. Using
the result for strong convexity in equation (4.50) on page 273, we can
derive
24 Newton originally presented his method for one-dimensional problems. Later on Raphson
2k−q+1
(k) 1 2c3 1
f (x ∗
) − p ≤ ||∇f (x(k) )||22 ≤ 2 (4.94)
2c L 2
Also, using the result in equation (4.52) on page 273, we get a bound on
the distance between the current iterate and the point x∗ corresponding to
the optimum.
2k−q
(k) 2 c 1
||x b ||2 ≤ ||∇f (x(k) )||2 ≤
−x ∗
(4.95)
c L 2
Inequality (4.94) shows that convergence is quadratic once the second condition
is satisfied after a finite number of iterations. Roughly speaking, this means
that, after a sufficiently large number of iterations, the number of correct digits
doubles at each iteration25 . In practice, once in the quadratic phase, you do not
even need to bother about any convergence criterion; it suffices to apply a fixed
few number of Newton iterations to get a very accurate solution. Inequality
(4.95) states that the sequence of iterates converges quadratically. The Lips-
chitz continuity condition states that if the second derviative of the function
changes relatively slowly, applying Newton’s method can be useful. Again, the
inequalities are technical junk as far as practical application of Newton’s method
is concerned, since L, c and α are generally unknown, but it helps to understand
the properties of the Newton’s method, such as its two phases and identify them
in problems. In practice, Newton’s method converges very rapidly, if at all.
As an example, consider a one dimensional function f (x) = 7x − ln x. Then
f ′ (x) = 7 − x1 and f ′′ (x) = x12 . The Newton update rule at a point x is
xnew = x − x2 7 − x1 . Starting with x(0) = 0 is really infeasible and useless,
since the updates will always be 0. The unique global minimizer of this function
is x∗ = 71 . The range of quadratic convergence for Newton’s method on this
function is x ∈ 0, 72 . However, if you start with an initial infeasible point
x(0) = 0, the function will quadratically tend to −∞!
There are some classes of functions for which theorem 84 can be applied very
constructively. They are
Pm
• − i=1 ln xi
• − ln t2 − xT x for t > 0
• − ln det(X)
Further, theorem 84 also comes handy for linear combinations of these functions.
These three functions are also at the heart of modern interior points method
theory.
25 Linear convergence adds a constant number of digits of accuracy at each iteration.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 309
26 O(n2.7 ) to be precise.
27 Here, n is the number of weights.
310 CHAPTER 4. CONVEX OPTIMIZATION
Recall that (Jm (x)T Jm (x))−1 Jm (x)T is the Moore-Penrose pseudoinverse Jm (x)+
of Jm (x). The Gauss-Jordan method for the sum-squared loss can be interpreted
as multiplying the gradient ∇l(m) by the pseudo-inverse of the jacobian of m
28 The Jacobian is a p × n matrix of the first derivatives of a vector valued function, where
p is arity of m. The (i, j)th entry of the Jacobian is the derivative of the ith output with
respect to the j th variable, that is ∂m
∂x
i
. For m = 1, the Jacobian is the gradient vector.
j
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 311
instead of its transpose (which is what the gradient descent method would do).
Though the Gauss-Newton method has been traditionally used for non-linear
least squared problems, recently it has also seen use for the cross entropy loss
function. This method is a simple adoption of the Newton’s method, with the
advantage that second derivatives, which can be computationally expensive and
challenging to compute, are not required.
4.5.5 Levenberg-Marquardt
Like the Gauss-Newton method, the Levenberg-Marquardt method has its main
application in the least squares curve fitting problem (as also in the minimum
cross-entropy problem). The Levenberg-Marquardt method interpolates be-
tween the Gauss-Newton algorithm and the method of gradient descent. The
Levenberg-Marquardt algorithm is more robust than the Gauss Newton algo-
rithm - it often finds a solution even if it starts very far off the final minimum. On
the other hand, for well-behaved functions and reasonable starting parameters,
this algorithm tends to be a bit slower than the Gauss Newton algorithm. The
Levenberg-Marquardt method aims to reduce the uncontrolled step size often
taken by the Newton’s method and thus fix the stability issue of the Newton’s
method. The update rule is given by
−1 T
∆x = − (Gf (x) + λ diag(Gf )) Jm (x)∇l(m)
where Gf is the Gauss-Newton approximation to ∇2 f (x) and is assumed to
be positive semi-definite. This method is one of the work-horses of modern
optimization. The parameter λ ≥ 0 adaptively controlled, limits steps to an
elliptical model-trust region29 . This is achieved by adding λ to the smallest
eigenvalues of Gf , thus restricting all eigenvalues of the matrix to be above λ so
that the elliptical region has diagonals of shorter length that inversely vary as
the eigenvalues (c.f. page 3.11.3). While this method fixes the stability issues in
Newtons method, it still requires the O(n3 ) time required for matrix inversion.
4.5.6 BFGS
The Broyden-Fletcher-Goldfarb-Shanno30 (BFGS) method uses linear algebra
−1
to iteratively update an estimate B (k) of ∇2 f (x(k) ) (the inverse of the
curvature matrix), while ensuring that the approximation to the hessian inverse
is symmetric and positive definite. Let ∆x(k) be the direction vector for the k th
step obtained as the solution to
∆x(k) = −B (k) ∇f (x(k) )
The next point x(k+1) is obtained as
x(k+1) = x(k) + t(k) ∆x(k)
29 Essentially the algorithm approximates only a certain region (the so-called trust region)
where t(k) is the step size obtained by line search. Let ∆g(k) = ∇f (x(k+1) ) −
∇f (x(k) ). Then the BFGS update rule is derived by imposing the following
logical conditions:
1. ∆x(k) = −B (k) ∇f (x(k) ) with B (k) ≻ 0. That is, ∆x(k) is the minimizer
of the convex quadratic model
1 −1
Q(k) (p) = f (x(k) ) + ∇T f (x(k) )p + pT B (k) p
2
and T
S (k) = u ∆x(k) B (k) ∆x(k) uT
with
∆x(k) B (k) ∆g(k)
u= T − T
∆x(k) ∆g(k) ∆g(k) B (k) ∆g(k)
We have made use of the Sherman Morrison formula that determines how
updates to a matrix relate to the updates to the inverse of the matrix.
in Figure 4.50 The BFGS [?] method approaches the Newton’s method in be-
haviour as the iterate approaches the solution. They are much faster than the
Newton’s method in practice. It has been proved that when BFGS is applied
to a convex quadratic function with exact line search, it finds the minimizer
within n steps. There is a variety of methods related to BFGS and collectively
they are known as Quasi-Newton methods. They are preferred over the New-
ton’s method or the Levenberg-Marquardt when it comes to speed. There is a
variant of BFGS, called LBFGS [?], which stands for ”Limited memory BFGS
method”. LBFGS employs a limited-memory quasi-Newton approximation that
does not require much storage or computation. It limites the rank of the inverse
of the hessian to some number γ ∈ ℜ so that only nγ numbers have to be stored
instead of n2 numbers. For general non-convex problems, LBFGS may fail when
the initial geometry (in the form of B (0) ) has been placed very close to a saddle
point. Also, LBFGS is very sensitive to line search.
Recently, L-BFGS has been observed [?] to be the most effective parameter
estimation method for Maximum Entropy model, much better than improved
iterative scaling [?] (IIS) and generalized iterative scaling [?] (GIS).
etc., which can be found under the netlib respository, have focused on efficiently solving large
linear systems under general conditions as well as specific conditions such as symmetry or
positive definiteness of the coefficient matrix.
314 CHAPTER 4. CONVEX OPTIMIZATION
Iterative Methods
The central step in an iteration is
P xk+1 = (P − A)xk + b
where xk is the estimate of the solution at the k th step, for k = 0, 1, . . .. If the
iterations converge to the solution, that is, if xk+1 = xk one can immediatly
see that the solution is reached. The choice of matrix P , which is called the
preconditioner, determines the rate of convergence of the solution sequence to
the actual solution. The initial estimate x0 can be arbitrary for linear systems,
but for non-linear systems, it is important to start with a good approximation.
It is desirable to choose the matrix P reasonably close to A, though setting
P = A (which is referred to as perfect preconditioning) will entail solving the
large system Ax = b, which is undesirable as per our problem definition. If x∗
is the actual solution, the relationship between the errors ek and ek+1 at the
k th and (k + 1)th steps respectively can be expressed as
P ek+1 = (P − A)ek
where ek = xk − x∗ . This is called the error equation. Thus,
ek+1 = (I − P −1 A)ek = M ek
Whether the solutions are convergent or not is controlled by the matrix M .
The iterations are stationary (that is, the update is of the same form at every
step). On the other hand, Multigrid and Krylov methods adapt themselves
across iterations to enable faster convergence. The error after k steps is given
by
ek = M k e0 (4.96)
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 315
2 −1 0 ... 0 0 0 ... 0
−1 2 −1 ... 0 0 0 ... 0
. . . ... . . . ... .
. . . ... . . . ... .
A=
0 0 0 ... −1 2 −1 ... 0
(4.97)
0 0 0 ... 0 −1 2 ... 0
. . . ... . . . ... .
. . . ... . . . ... .
0 0 0 ... 0 0 0 ... 2
jπ
The absolute value of the ith eigenvalue of M is cos n+1 and its spectral
π
radius is ρ(M ) = cos n+1 . For extremely large n, the spectral radius is
2
approximately 1 − 12 n+1π
, which is very close to 1. Thus, the Jacobi
steps converge very slowly.
Multigrid Methods
Multigrid methods come very handy in solving large sparse systems, especially
differential equations using a hierarchy of discretizations. This approach often
scales linearly with the number of unknowns n for a pre-specified accuracy
threshold. The overall multi-grid algorithm for solving Ah uh = bh with residual
given by rh = b − Auh is
3. Solve A2h e2h = r2h with A2h = RAh N , which is a natural construction for
the coarse mesh operation. This could be done by running few iterations
of Jacobi, starting with e2h = 0.
where t is the number of Jacobi steps performed in (1) and (5). Typically t
is 2 or 3. When you contrast (4.98) against (4.96), we discover that ρ(M ) ≥≥
ρ(M t (I − S)M t ). As t increases, ρ(M t (I − S)M t ) further decreases by a smaller
proportion.
In general, you could have multiple levels of coarse grids corresponding to
2h, 4h, 8h and so on, in which case, steps (2), (3) and (4) would be repeated
as many times with varying specifications of the coarseness. If A is an n × n
matrix, multi-grid methods are known to run in O(n2 ) floating point operations
(flops). The multi-grid method could be used an iterative method to solve a
linear system. Alternatively, it could be used to obtain the preconditioner.
32 For σ (A)
any matrix A, the condition number κ(A) = σmax(A) , where σmax (A) and σmin (A)
min
are maximal and minimal singular values of A respectively. Recall from Section 3.13 that
th T
the i eigenvalue of A A (the gram matrix) is the square of the ith singular value of A.
λmax (A)
Further, if A is normal, κ(A) = λ , where λmax (A) and λmin (A) are eigevalues of
min (A)
A with maximal and minimal magnitudes respectively. All orthogonal, symmetric, and skew-
symmetric matrices are normal. The condition number measures how much the columns/rows
of a matrix are dependent on each other; higher the value of the condition number, more is
the linear dependence. Condition number 1 means that the columns/rows of a matrix are
linearly independent.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 319
1
Set q1 = ||b|| b. //The first step in Gram schmidt.
for j = 1 to n − 1 do
t = Aqj .
for i = 1 to j do
//If A is symmetric, it will be i = max(1, j − 1) to j.
Hi,j = qTi t.
t = t − Hi,j qi .
end for
Hj+1,j = ||t||.
1
qj+1 = ||t|| t.
end for
t = Aqn .
for i = 1 to n do
//If A is symmetric, it will be i = n − 1 to n.
Hi,n = qTi t.
t = t − Hin qi .
end for
Hj+1,j = ||t||.
1
qj+1 = ||t|| t.
The matrix K4 is
0.6382 1.8074 8.1892 34.6516
0.3656 1.7126 7.5403 32.7065
K4 =
0.1124
1.7019 7.4070 31.9708
0.5317 1.9908 7.9822 34.8840
and its eigenvalues are 4.3125, 0.5677, −1.2035 and 0.0835. On the other hand,
the following matrix H3 (obtained by restricting to K3 ) has eigenvalues 4.3124,
0.1760 and −1.0741.
The basic conjugate gradient method selects vectors in xk ∈ Kk that ap-
proach the exact solution to Ax = b. Following are the main ideas in the
conjugate gradient method.
1. The rule is to select an xk so that the new residual rk = b − Axk is
orthogonal to all the previous residuals. Since Axk ∈ Kk+1 , we must have
rk ∈ Kk+1 and rk must be orthogonal to all vectors in Kk . Thus, rk must
be a multiple of qk+1 . This holds for all k and implies that
rTk ri = 0
for all i < k. This is a necessary and sufficient condition for the orthogo-
nality of the new residual to all the previous residuals. Note that while the
residual updates are orthogonal in the usual inner product, the variable
updates are orthogonal in the inner product with respect to A.
The basic conjugate gradient method consists of 5 steps. Each iteration of
the algorithm involves a multiplication of vector dk−1 by A and computation of
two inner products. In addition, an iteration also involves around three vector
updates. So each iteration should take time upto (2+θ)n, where θ is determined
by the sparsity of matrix A. The error ek after k iterations is bounded as follows.
p !k
T κ(A) − 1
||ek ||A = (xk − x) A(xk − x) ≤ 2 p ||e0 ||
κ(A) + 1
The ‘gradient’ part of the name conjugate gradient stems from the fact that
solving the linear system Ax = b is corresponds to finding the minimum value
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 321
x0 = 0, r0 = b, d0 = r0 , k = 1.
repeat
rT
k−1 rk−1
1. αk = dT
. //Step length for next update. This corresponds to
k−1 Adk−1
the entry Hk,k .
2. xk = xk−1 + αk dk−1 .
3. rk = rk−1 − αk Adk−1 . //New residual obtained using rk − rk−1 =
−A(xk − xk−1 ).
rT
k rk
4. βk = rT
. //Improvement over previous step. This corresponds
k−1 rk−1
to the entry Hk,k+1 .
5. dk = rk + βk dk−1 . //The next search direction, which should be
orthogonal to the search direction just used.
k = k + 1.
until βk < θ.
Figure 4.53: Illustration of the steepest descent technique on level curves of the
function E(x) = 12 xT Ax − xT b.
1 T
2 x Ax − xT b. Can the approach be adapted to minimize general nonlinear
convex functions? Nonlinear variants of the conjugate gradient are well stud-
ied [?] and have proved to be quite successful in practice. The general conjugate
gradient method is essentially an incremental way of doing second order search.
Fletcher and Reeves showed how to extend the conjugate gradient method
to nonlinear functions by making two simple changes33 to the algorithm in
Figure 4.52. First, in place of the exact line search formula in step (1) for the
step length αk , we need to perform a line search that identifies an approximate
minimum of the nonlinear function f along d(k−1) . Second, the residual r(k) ,
which is simply the gradient of E (and which points in the direction of decreasing
value of E), must be replaced by the gradient of the nonlinear objective f , which
serves a similar purpose. These changes give rise to the algorithm for nonlinear
optimization outlined in Figure 4.55. The search directions d(k) are computed
by Gram-Schmidt conjugation of the residuals as with linear conjugate gradient.
The algorithm is very sensitive to the line minimization step and it generally
requires a very good line minimization. Any line search procedure that yields
an αk satisfying the strong Wolfe conditions (see (4.90) and (4.91)) will ensure
that all directions d(k) are descent directions for the function f , otherwise, d(k)
may cease to remian a descent direction as iterations proceed. We note that
each iteration of this method costs on O(n), as against the Newton or quasi-
newton methods which cost atleast O(n2 ) owing to matrix operations. Most
often, it yields optimal progress after h << n iterations. Due to this property,
the conjugate gradient method drives nearly all large-scale optimization today.
33 We note that in the algorithm in Figure 4.52, the residuals r(k) in successive iterations
(which are gradients of E) are orthogonal to each other, while the corresponding update
directions are orthogonal with respect to A. While the former property is difficult to enforce
for general non-linear functions, the latter condition can be enforced.
4.5. ALGORITHMS FOR UNCONSTRAINED MINIMIZATION 323
k = k + 1.
(k)
until ||g ||
||g(0) ||
< θ OR k > maxIter.
Figure 4.55: The conjugate gradient algorithm for optimizing nonlinear convex
function f .
34 Restarting conjugate gradient means forgetting the past search directions, and start it
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m (4.99)
Ax = b
For example, when f is linear and gi ’s are polyhedral, the problem is a linear
program, which was stated in (4.83) and whose dual was discussed on page 289.
Linear programming is a typical example of constraint minimization problem
and will form the subject matter for discussion in Section 4.7. As another
example, when f is quadratic (of the form xT Qx+bT x) and gi ’s are polyhedral,
the problem is called a quadratic programming problem. A special case of
quadratic programming is the least squares problem, which we will take up in
details in Section 4.8.
minimize f (x)
(4.100)
subject to Ax = b
in Section 4.4.4 that were proved to be necessary and sufficiency conditions for
optimality of a convex problem with differentible objective and constraint func-
tions.
Theorem 85 x b is optimal point for the primal iff there exists a µ
b such that the
following conditions are satisfied.
x) + AT µ
∇f (b b=0
(4.101)
Ab
x=b
The term ∇f (b x) + AT µb is sometimes called the dual residual (rd ) while the
term Ab x − b is referred to as the primal residual (rp ). The optimality condition
basically states that both rd and rp should both be 0 and the the success of this
test is a certificate of optimality.
As an illustration of this theorem, consider the constrained quadratic prob-
lem
1 T
minimize 2 x Ax + bT x + c
(4.102)
subject to Px = q
By theorem 85, the necessary and sufficient condition for optimality of a point
(b b is
x, λ) " #" # " #
A PT x
b −b
=
P 0 b
λ q
| {z }
KKT matrix
Using formula (3.27) on page 169 to derive the null basis N ∈ ℜn×n−p
(that is, AN = 0 and the columns of N span N (A)), we get the following free
parameter expression for the solution set to Ax = b:
{x |Ax = b } = N z + xparticular z ∈ ℜn−p
x
b = Nb
z + xparticular
(4.104)
b = −(AAT )−1 A∇f (b
µ x)
Any iterative algorithm that is applied to solve the problem (4.104) will
ensure that all intermediate points are feasible, since for any z ∈ ℜn−p , x =
F z + xparticular is feasible, that is, Ax = b. However, when the Newton’s
method is applied, the iterates are independent of the exact affine change of
coordinates induced by the choice of the null basis F (c.f. page 306). The
Newton update rule ∆z(k) for (4.103) is given by the solution to:
Due the affine invariance of Newton’s method, if z(0) is the starting iterate
and x(0) = N z(0) + xparticular , the k th iterate x(k) = N z(k) + xparticular is
independent of the choice of the null basis N . We therefore do not need seperate
convergence analysis. The algorithm for the Newton’s method was outlined in
Figure 4.49. Techniques for handling constrained optimization using Newton’s
method given an infeasible starting point x(0) can be found in [?].
4.6. ALGORITHMS FOR CONSTRAINED MINIMIZATION 327
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m (4.105)
Ax = b
P
q aT
k y+bk
minimize
n
log k=1 e
y∈ℜ P
r cT
k y+dk (4.106)
subject to log k=1 e ≤0 i = 1, 2, . . . , p
giT y + hi i = 1, 2, . . . , m
Logarithmic Barrier
One idea for solving a minimization problem with inequalities is to replace the
inequalities by a so-called barrier term. The barrier term is subtracted from the
objective function with a weight µ on it. The solution to (4.105) is approximated
by the solution to the following problem.
Pm
minimize B(x, µ) = f (x) − µ i=1 ln (−gi (x))
(4.107)
subject to Ax = b
36 Although geometric programs are not convex in their natural form, they can, however, be
The objective function B(x, µ) is called the logarithmic barrier function. This
function is convex, which can be proved by invoking the composition rules de-
scribed in Section 4.2.10. It is also twice continuously differentiable. The bar-
rier term, as a function of x approaches +∞ as any feasible interior point x
approaches the boundary of the feasible region. Because we are minimizing,
this property prevents the feasible iterates from crossing the boundary and be-
coming infeasible. We will denote the point of optimality x b(µ) as a function of
µ.
However, the optimal solution to the original problem (a typical example
being the LP discussed in Section 4.7) is typically a point on the boundary
of the feasible region (we will see this in the case of linear programming in
Section 4.7). To obtain such a boundary point solution, it is necessary to keep
decreasing the parameter µ of the barrier function to 0 in the limit. As a
very simple example, consider the following inequality constrained optimization
problem.
minimize x2
subject to x≥1
The logarithmic barrier formulation of this problem is
minimize x2 − µ ln (x − 1)
The unconstrained
√ minimizer for this convex logarithmic barrier function is
b(µ) = 12 + 21 1 + 2µ. As µ → 0, the optimal point of the logarithmic barrier
x
problem approaches the actual point of optimality x b = 1 (which, as we can
see, lies on the boundary of the feasible region). The generalized idea, that as
µ → 0, f (bx) → p∗ (where p∗ is the optimal for (4.105)) will be proved next.
1. The point x
b(µ) must be strictly feasible. That is,
Ab
x(µ) = b
and
gi (b
x(µ)) < 0
m
X −µ
∇f (b
x(µ)) + x(µ)) + AT ηb = 0
∇gi (b (4.108)
i=1
gi (b
x(µ))
4.6. ALGORITHMS FOR CONSTRAINED MINIMIZATION 329
Define
bi (µ) = −µ
λ
gi (b
x(µ))
and
ηb(µ) = ηb
b
We claim that the pair (λ(µ), ηb(µ)) is dual feasible. The following steps prove
our claim
1. Since gi (b b
x(µ)) < 0 for i = 1, 2, . . . , m, λ(µ) ≻ 0.
2. Based on the proof of theorem 82, we can infer that L(x, λ, η) is convex
in x.
m
X
L(x, λ, η) = f (x) + λi gi (x) + η T (Ax − b)
i=1
Xm
b
L∗ (λ(µ), ηb(µ)) = f (b
x(µ))+ bi gi (b
λ η (µ)T (Ab
x(µ))+b x(µ)−b) = f (b
x(µ))−mµ
i=1
(4.109)
From the weak duality theorem 81, we know that d∗ ≤ p∗ , where d∗ and p∗ are
b
the primal and dual optimals respectively, for (4.105). Since L∗ (λ(µ), ηb(µ)) ≤ d∗
∗
x(µ)) − mµ ≤ p . Or equivalently,
(by definition), we will have from (4.109), f (b
x(µ)) − p∗ ≤ mµ
f (b (4.110)
The inequality in (4.110) forms the basis of the barrier method; it confirms the
b(µ) converges to an optimal point as µ → 0. We will next
intuitive idea that x
discuss the barrier method.
The centering step (1) can be executed using any of the descent techniques
discussed in Section 4.5. It can be proved [?] that the duality gap is mµ(0) αk
after k iterations. Therefore, the desired
accuracy ǫ can be achieved by the
mµ(0)
logǫ
barrier method after exactly
−log(α) steps.
Successive minima x b(µ) of the Barrier function B(x, µ) can be shown to have
the following properties. Let µ < µ for sufficiently small µ, then
x(µ)) ≤ f (b
2. f (b x(µ))
Pm Pm
3. − i=1 ln (−gi (bx(µ))) ≥ − i=1 ln (−gi (b
x(µ)))
When a strictly feasible point x b is not known, the barrier method is pre-
ceded by a preliminary stage, called phase I, in which a strictly feasible point is
computed (if it exists). The strictly feasible point found during phase I is then
used as the starting point for the barrier method. This is discussed in greater
details in [?].
2. An m × n matrix A.
3. A vector b of size m.
The unknown is a vector x of size n, and this is what we will try to determine.
In linear programming (LP), the task is to minimize a linear objective func-
Xn
tion of the form cj xj , subject to linear inequality constraints37 of the form
j=1
n
X
aij xj ≥ bi , i = 1, . . . , m and xi > 0. The problem can be stated as in
j=1
(4.111). In contrast to the LP specification on page 289, where the constraint
x ≥ 0 was absorbed into the more general constraint −Ax + b ≤ 0, here we
choose to specify it as a seperate constraint.
min xT c
x∈ℜn (4.111)
subject to −Ax + b ≤ 0 x≥0
The flip side of this problem is that it has no analytical formula as a solution.
However, that does not make a big difference in practice, because there exist
reliable and efficient algorithms and software for linear programming. The com-
putational time is roughly proportional to n2 m, if m ≥ n. This is basically the
cost of one iteration in an interior point method.
Linear programming (LP ) problems are harder to recognize in practice and
often need reformulations to get into the standard form in (4.111). Minimizing
a piecewise linear function of x is not an LP , thought it can be written and
solved as an LP . Other problems involving 1 or ∞ norms can also be written
as linear programming problems.
The basis for linear programing was mentioned on page 250; linear functions
have no critical points and therefore, by theorem 60, the extreme values are
always assumed at the boundary of the feasible set. In the case of linear pro-
grams, the feasible set is itself defined by linear inequalities: {x| − Ax + b ≤ 0}.
Applying the argument recursively, it can be proved that the extreme values for
a linear program are assumed at some corners (i.e., vertices) of the feasible set.
A corner is the intersection point of n different planes, each given by a single
equation. That is, a corner point is obtained by turning n of the n + m inequali-
ties into equalities and finding their intersection38 An edge is the intersection of
n − 1 inequalities and connects two corners. Geometrically, it can be observed
that when you maximize or minimize some linear function, as your progress in
one direction in the search space, the objective will either increase monotoni-
cally or decrease monotonically. Therefore, the maximum and minimum will be
found at the corners of the allowed region.
37 It is a rare feature to have linear inequality constraints.
38 In (n+m)!
general, there are n!m! intersections.
332 CHAPTER 4. CONVEX OPTIMIZATION
The feasible set is in the form of a finite interval in n dimensions. Figure 4.57
pictorially depicts a typical example of the feasible region for n = 3. The
constraints Ax ≥ b and x ≥ 0 would allow a tetrahedron or pyramid in the
first (or completely positive) octant. If the constraint was an equality, Ax = b,
the feasible set would be the shaded traingle in the figure. In general for any n,
the constraint Ax ≥ b, x ≥ 0 would yield as the feasible set, a polyhedron. The
Xn
T
task of maximizing (or miminizing) the linear objective function x c = ci xi
i=1
translates to finding a solution at one of the corners of the feasible region.
Corners are points where some of the inequality constraints are tight or active,
and others are not. At the corners, some of the inequality constraints translate
to equalities. It is just a question of finding the right corner.
Why not just search all corners for the optimal answer? The trouble is that
there are lots of corners. In n dimensions, with m constraints, the number of
corners grows exponentially and there is no way to check all of them. There is
an interesting competition between two quite different approaches for solving
linear programs:
Ax − s = b
min yT d
y∈ℜn+m (4.112)
subject to M y = −b y≥0
We will assume that the matrix A (and therefore M ) is of full row rank, that
is of rank m. In practice, a preprocessing phase is applied to the user-supplied
334 CHAPTER 4. CONVEX OPTIMIZATION
data to remove some redundancies from the given constraints to get a full row
rank matrix.
The following definitions and observations will set the platform for the sim-
plex algorithm, which we will describe subsequently.
A set B satisfying these properties is called a basis for the problem (4.112).
The corresponding matrix B is called the basis matrix. Any variable yi
for i ∈ B is called a basic variable, while any variable yi for i ∈
/ B is called
a free variable.
2. It can be seen that all basic feasible points of (4.112) are corners of the
feasible simplex S = {x|Ax ≥ b, x ≥ 0} and vice versa. In other words, a
corner of S corresponds to a point y in the new representation that has n
components as zeroes.
(a) If (4.112) has a nonempty feasible region, then there is at least one
basic feasible point
(b) If (4.112) has solutions, then at least one such solution is a basic
optimal point
(c) If (4.112) is feasible and bounded, then it has an optimal solution.
Using the ideas and notations presented above, the simplex algorithm can
be outlined as follows.
2. Our first step is to get one basic variable alone on each row. Without loss of
generality, we will renumber the variables and rearrange the corresponding
coefficients of M so that at every iteration, y1 , y2 , . . . ym are the basic
variables and the rest are free (i.e., 0). The first m columns of A form an
m × m square matrix B and the last n form an m × n matrix N . The cost
vector d can also be split as dT = [dTB dTN ] and the variable vector can
be split as yT = [yB T
yNT
] with yN = 0. To operate with the tableau, we
will split it as
" #
B N −b
dTB dTN 0
Further, we will ensure that all the columns corresponding to basic vari-
ables are in the unit form.
" #
I B −1 N −B −1 b
dTB − dTB I = 0 dTN − dTB B −1 N dTB B −1 b
5. With the new choice of basic variables, steps (2)-(4) are repeated till the
reduced cost is completely non-negative. The variables corresponding to
the unit columns in the final tableau are the basic variables at the opti-
mum.
What we have not discussed so far is how to obtain the initial basic feasible
point. If x = 0 satisfies Ax ≥ b, we can have an initial basic feasible point with
the basic variables comprising of s and x constituting the free variables. This
is illustrated through the following example. Consider the problem
The most negative component of the reduced cost vector is for k = 3. The
(B −1 b)t
pivot row number is 2 = argmin (B −1 N ) . Thus, the leaving basic
tk
t=1,2,...,m (B −1 N )tk >0
variable is s2 (the basic variable corresponding to the second row) while the
entering free variable is x3 . Performing Gauss elimination to obtain column
k = 3 in the unit form, we get
−1 − 61 0 1 − 120 1
0 36
26 5 1
3 1 0 0 48
6 60
− 5 − 10 0 0 − 2
1 480
3 3 3
− 53 − 34 0 0 1
3 0 960
Note that the optimal solution has been found, since the reduced cost vector
is non-negative. The optimal solution is x1 = 72, x2 = 0, x3 = 0, s1 = 48, s2 =
0, s3 = 600 and cost cT x = −1080
What if x = 0 does not satisfy Ax ≥ b? The choice of s as the basic
variables and x as the free variables will not be valid. As an example, consider
the problem
min 30x1 + 60x2 + 70x3
x1 ,x2 ,x3 ∈ℜ
subject to x1 + 3x2 + 4x3 ≥ 14
2x1 + 2x2 + 3x3 ≥ 16
x1 + 3x2 + 2x3 ≥ 12
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0
The initial tableau is
−1 −3 −4 1 0 0 −14
−2 −2 −3 0 1 0 −16
−1 −3 −2 0 0 1 −12
30 60 70 0 0 0 0
With the choice of basic and free variables as above, we are not even in the
feasible region to start off with. In general, if we have any negative number in
the last column of the tableau, x = 0 is not in the feasible region. Further,
we have no negative numbers in the bottom row, which does not leave us with
any choice of cost reducing free variable. But this is not of primary concern,
since we first need to maneuver our way into the feasible region. We do this
by moving from one basic point (that is, a point having not more than n zero
components) to another till we land in the feasible region, which is indicated
by all positive components in the extreme right hand column. This movement
from one basic point to another is not driven by negative components in the
cost vector, but rather by the negative components in the right hand column.
The new rules for moving from one basic point to another are:
1. Pick39 any negative number in the far right column (excluding the last
row). Let this be in the q th row for q < m + 1.
39 Note that there is no priority here.
4.7. LINEAR PROGRAMMING 339
We are done! The reduced cost vector has no more negative components. The
optimal basic feasible point is x1 = 4.75, x2 = 1.75, x3 = 1 and the optimal cost
is 317.5.
max λT b
λ∈ℜm
subject to AT λ ≤ c
λ≥0
The weak duality theorem (theorem 81) states that the objective function value
of the dual at any feasible solution is always less than or equal to the objective
function value of the primal at any feasible solution. That is, for any primal
feasible x and any dual feasible λ,
cT x − bT λ ≥ 0
min yT d
y∈ℜn+m
subject to M y = −b (4.113)
y≥0
max −λT b
λ∈ℜm (4.114)
subject to MT λ ≤ d
2. Next, we set up the barrier method formulation of the dual of the linear
program. Letting µ > 0 be a given fixed parameter, which is decreased
during the course of the algorithm. We also insert slack variables ξ =
[ξ1 , ξ2 , . . . , ξn ]T ≥ 0. The barrier method formulation of the dual is then
given by:
Pn
max −λT b + µ i=1 ln (ξi )
λ∈ℜm (4.115)
T
subject to M λ+ξ =d
MT λ + ξ = d
M y = −b (4.116)
diag(ξ) diag(y)1 = µ1
M T ∆λ + ∆ξ = 0
M ∆y = 0 (4.117)
(yi + ∆yi )(ξi + ∆ξi ) = µ i = 1, 2, . . . , n
Ignoring the second order term ∆yi ∆ξi in the third equation, and solving
the system of equations in (4.117), we get the following update rules:
−1
∆λ(k) = − M diag(y(k) ) diag(ξ (k) )−1 M T M diag(ξ (k) )−1 µ1 − diag(y(k) ) diag(ξ (k) )1
∆ξ (k) = −M T ∆λ(k) (4.118)
∆y(k) = diag(ξ (k) )−1 (k) (k) (k)
µ1 − diag(y ) diag(ξ )1 − diag(y ) diag(ξ ) ∆ξ(k) −1 (k)
6. Now we have a feasible primal solution y(k+1) and feasible dual solution
(λ(k+1) , ξ (k+1) ) given by
344 CHAPTER 4. CONVEX OPTIMIZATION
(k)
y(k+1) = y(k) + t(max,P ) ∆y(k)
λ(k+1) = λ(k) + ∆λ(k) (4.119)
(k+1) (k) (k)
ξ =ξ + t(max,D) ∆ξ (k)
7. For user specified small thresholds of ǫ1 > 0 and ǫ2 > 0, if the duality gap
dT y(k+1) + bT λ(k+1) is not sufficiently close to 0, i.e.,
µ > ǫ2
µ=µ×ρ
This problem is called the least squares problem with linear constraints.
In practice, incorporating the constraints C T x = 0 properlymakes quite a
difference. In lots of regularization problems, the least squares problem often
comes with quadratic constraints in the following form.
This problem is termed as the least squares problem with quadratic constraints.
The classical statistical model assumes that all the error occurs in the vector
b. But sometimes, the data matrix A is itself not very well known, owing to
errors in the variables. This is the model we have in the simplest version of the
total least squares problem, which is stated as follows.
While there is always a solution to the least squares problem (4.120), there is
not always a solution to the total least squares problem (4.133). Finally, you
can have a combination of linear and quadratic constraints in a least squares
problem to yield a least squares problem with linear and quadratic constraints.
We will briefly discuss the problem of solving linear least squares problems
and total least squares problems with linear or a quadratic constraint (due to
regularization) The importance of lagrange multipliers will be introduced in the
process. We will discuss stable numerical methods when the data matrix A
is singular or near singular. We will also present iterative methods for large
and sparse data matrices. There are many applications of least squares prob-
lems, which include statistical methods, image processing, data interpolation
and surface fitting and finally geometrical problems.
Thus,
x∗ = (AT A)−1 AT b
This is the classical way statisticians solve least squares problem. It can be
solved very efficiently, and there exist many softwares that implement this solu-
tion. The computation time is linear in the number of rows of A and quadratic
in the number of columns. For extremely large A, it can become important to
look at the structure of A to solve it efficiently, but for most problems, it is
efficient. In practice least-squares is very easy to recognize as an objective func-
tion. There are a few standard tricks to increase the flexibility. For example,
constraints can be handled to a certain extent by adding weights. When the
matrix A is not full column rank, the solution to (4.120) may not be unique.
We should note that while we get a closed form solution to the problem
of minimizing the square of the eucledian norm, it is not so for most other
norms such as the infinity norm. However, there exist iterative methods for
solving least squares with infinity norm that yield a solution in as much time
as is taken in computing the solution using the analytical formula in 4.124.
Therefore, having a closed form solution is not always computationally helpful.
In general, the method of solution to a least squares problem depends on the
sparsity as well as the size of A and the degree of accuracy desired.
In practice, however, it is not recommended to solve least squares problem
using the classical equation in 4.124 since the method is numerically unsta-
ble. Numerical linear algebra instead recommends the QR decomposition to
accurately solve the least squares problem. This method is slower, but more
numerically stable than the classical method. In theorem ??, we state a theory
that compares the analytical solution (4.124) and the QR approach to the least
squares problem.
Let A be an m × n matrix of either full row or full column rank. For the
case of n > m, we saw on page ?? (summarised in Figure 3.3) that the system
Ax = b will have at least one solution which means that minimum value of the
objective function will be 0, corresponding to the solution. We are interested in
the case m ≥ n, for which there will either be no solution or a single solution
to Ax = b and we are interested in one that minimizes ||Ax − b||22 .
where d ∈ ℜm−n .
The next theorem examines how the least squares solution and its residual
||Ax − b|| are affected by changes in A and b. Before stating the theorem, we
will introduce the concept of the condition number.
Condition Number
The condition number associated with a problem is a measure of how numerically
well-posed the problem is. A problem with a low condition number is said to
be well-conditioned, while a problem with a high condition number is said to be
ill-conditioned. For a linear system Ax = b, the condition number is defined as
maximum ratio of the relative error in x (measured using any particular norm)
divided by the relative error in b. It can be proved (using the Cauchy Shwarz
inequality) that the condition number equals ||A−1 A|| and is independent of b.
It is denoted by κ(A) and is also called the condition number of the matrix A.
σmax (A)
κ(A) = = ||A||2 ||(AT A)−1 AT ||2
σmin (A)
42 The classical Gram-Schmidt method is often numerically unstable. Golub [?] suggests a
transformation dates back to the 1930s in a book by Aikins, a statistician and a numerical
analyst.
348 CHAPTER 4. CONVEX OPTIMIZATION
where σmax (A) and σmin (A) are maximal and minimal singular values of A
respectively. For a real square matrix A, the square roots of the eigenvalues of
AT A, are called singular values. Further,
r − r∗ ||
||b
≤ ǫ {1 + 2κ(A)} min {1, m − n} + O(ǫ2 ) + O(ǫ2 )
||b||
However, having a small residual does not necessarily imply that you will have
a good approximate solution.
The theorem implies that the sensitivity of the analytical solution x∗ for
non-zero residual problems is measured by the square of the condition number.
Whereas, sensitivity of the residual depends just linearly on κ(A). We note that
the QR method actually solves a nearby least squares problem.
4.8. LEAST SQUARES 349
b = A+ b
x
and " #
w
ZT x =
y
Then
||Ax − b||2 = ||QT AZZ T x − QT b||2 = ||Rw − c||2 + ||d||2
The least squares solution is therefore given by
" #
R−1 c
x
b=Z
0
One particular decomposition that can be used is the singular value decompo-
sition (c.f. Section 3.13) of A, with QT ≡ U T and Z ≡ V and U T AV = Σ. The
pseudo-inverse A+ has the following expression.
A+ = V Σ−1 U T
It can be shown that this A+ is the unique minimal Frobenius norm solution to
the following problem.
This also shows that singular value decomposition can be looked upon as an
optimization problem.
A greater problem is with systems that are nearly singular. Numerically
and computationally it seldom happens that the rank of matrix is exactly r. A
classical example is the following n × n matrix K, which has a determinant of
1.
1 −1 ... −1 −1 ... −1
0 1 ... −1 −1 ... −1
. . ... . . ... .
. . ... . . ... .
K=
0 0 ... 1 −1 ... −1
0 0 ... 0 1 ... −1
. . ... . . ... .
. . ... . . ... .
0 0 ... 0 0 ... 1
The eigenvalues of this matrix are also equal to 1, while its rank is n. However,
a very small perturbation to this matrix can reduce its rank to n − 1; the rank
of K − 2−(n−1) In×n is n − 1! Such catastrophic problems occur very often when
you do large computations. The solution using SVD is applicable for nearly
singular systems as well.
We need to now solve not only for the unknowns x, but also for the lagrange
multipliers; we have increased the dimensionality of the problem to n+p. If x
b=
4.8. LEAST SQUARES 351
(AT A)−1 AT b denotes the solution of the unconstrained least squares problem,
then, using the first system of equality above, x can be expressed as
b − (AT A)−1 Cλ
x=x (4.125)
In conjunction with the second system, this leads to
C T (AT A)−1 Cλ = C T x
b (4.126)
The unconstrained least squares solution can be obtained using methods in
Section 4.8.1. Next, the value of λ can be obtained by solving (4.126). If A is
singular or nearly singular, we can use the singular value decomposition (or a
similar decomposition) of A to determine xb.
C T R−1 (RT )−1 Cλ = C T x
b
The QR factorization of (RT )−1 C can be efficiently used to determine λ. Finally,
the value of λ can be substituted in (4.125) to solve for x. This technique yields
both the solutions, provided that both exist.
Another trick that is often employed when AT A is singular or nearly sin-
gular is to decrease its condition number by augmenting it in (4.125) with the
‘harmless’ CW C T and solve
b − (AT A + CW C T )−1 Cλ
x=x
The addition of CW C T is considered harmless, since C T x = 0 is to be imposed
anyways. Matrix W can be chosen to be an identical or nearly identical matrix
that chooses a few columns of C, just to make AT A + CW C T non-singular.
If we use the following notation:
" #
AT A + CW C T C
A(W ) =
CT 0
and " #
AT A C
A = A(0) =
CT 0
and if A and A(W ) are invertible for W 6= 0, it can be proved that
" #
−1 −1 0 0
A (W ) = A −
0 W
Consequently
κ(A(W )) ≤ κ(A) + ||W ||2 ||C||2 + α||W ||
for some α > 0. That is, the condition number of A(W ) is bounded by the
condition number of A and some positive terms.
Another useful technique is to find an approximation to (4.121) by solving
the following weighted unconstrained minimization problem.
352 CHAPTER 4. CONVEX OPTIMIZATION
U T AX = diag(α1 , . . . , αm )
V T C T X = diag(γ1 , . . . , γm )
where U and V are orthogonal matrices and X is some general matrix. The
solution to the constrained problem can be expressed as
p
X uT b i
x
b= xi
i=1
αi
!
R p
QT C = (4.127)
0 n−p
This yields
AQT = A1 A2 (4.128)
and
!
T y p
Q x= (4.129)
z n−p
4.8. LEAST SQUARES 353
!
T 0
x
b=Q (4.130)
z
b
where
z = argmax ||b − A2 z||2
b
z
Since the objective function as well as the constraint function are convex, the
KKT conditions (c.f. Section 4.4.4) are necessary and sufficient conditions for
the optimality of the problem at the primal-dual variable pair given by (b
x, µ
b).
The KKT conditions lead to the following equations
bT A(AT A + µI)−2 AT b − α2 = 0
354 CHAPTER 4. CONVEX OPTIMIZATION
Further, the matrix A can be diagonalized using its singular value decom-
position A = U ΣV T to obtain the following equation which is to be solved.
n
X σi2
βi2 − α2 = 0
i=1
(σi2 + µ)2