0% found this document useful (0 votes)
128 views

Non Linear Optimization Modified

1) A necessary condition for a point x0 to be a local extremum of the function f(x) on the interval [a,b] is that the derivative f'(x0) equals 0, if it exists. 2) There are three types of points that could produce a local extremum: where f'(x0)=0, where f'(x0) does not exist, and the endpoints a and b. 3) To find the optimal solution to the nonlinear program max/min f(x) s.t. x ∈ [a,b], we evaluate f(x) at all local extrema points and select the point with the largest/smallest function value as

Uploaded by

Steven Dreckett
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Non Linear Optimization Modified

1) A necessary condition for a point x0 to be a local extremum of the function f(x) on the interval [a,b] is that the derivative f'(x0) equals 0, if it exists. 2) There are three types of points that could produce a local extremum: where f'(x0)=0, where f'(x0) does not exist, and the endpoints a and b. 3) To find the optimal solution to the nonlinear program max/min f(x) s.t. x ∈ [a,b], we evaluate f(x) at all local extrema points and select the point with the largest/smallest function value as

Uploaded by

Steven Dreckett
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

MATH2431 : Non-Linear Optimization (Recap)

Limits

Continuity

Differentiation

Higher Derivatives

Taylor Series Expansion

Partial Derivatives

1
Lecture 1

Sufficient Condition. A sufficient condition is a condition that guarantees a particular outcome.


Example: A sufficient condition for Imani to visit France is that she goes to the Eiffel Tower.
Explanation:

Necessary Condition. A necessary condition is a condition that must hold to achieve a particular
outcome.
Example: A necessary condition for Chelsea to win the Premier League is that they buy a left
foot striker.
Explanation:

DEFINITION: A nonlinear programming problem(NLP) is any problem that can


be expressed as follows: find the decision variables x1 , x2 , . . . , xn that

max (or min) Z = f (x1 , x2 , ..., xn )

s.t. g1 (x1 , x2 , ..., xn ) (≤, ≥, or =) b1

s.t. g2 (x1 , x2 , ..., xn ) (≤, ≥, or =) b2

· (1)

s.t. gm (x1 , x2 , ..., xn ) (≤, ≥, or =) bm ,

where f (x1 , x2 , . . . , xn ) and gm (x1 , x2 , . . . , xn ) cannot all be linear functions. f (x1 , x2 , ..., xn )
is the NLP’s objective function, and gi (x1 , x2 , ..., xn )(≤, ≥, or =) bi , i = (1, 2, . . . , m) are
the NLP’s constraints. An NLP with no constraints is said to be an unconstrained NLP.

2
DEFINITION: The feasible region for an NLP is the set of points (x1 , x2 , ..., xn ) that
satisfy the m constraints in (1). A point in the feasible region is a feasible point, and a point
that is not in the feasible region is an infeasible point.

DEFINITION: Any point x̄ in the feasible region for which f (x̄) ≥ f (x) holds for all points
x in the feasible region is an optimal solution to the NLP, given that the NLP is a max
problem. [For a minimization problem, x̄ is the optimal solution if f (x̄) ≤ f (x) for all feasible
x.]

Example: It costs a company c dollars per unit to manufacture a product. If the company
charges p dollars per unit for the product, customers demand D(p) units. To maximize
profits, what price should the firm charge?

Answer: The firm’s decision variable is p. Since the firm’s profit is (p − c)D(p), the firm
wants to solve the following unconstrained maximization problem: max(p − c)D(p).

Example If K units of capital and L units of labor are used, a company can produce
KL units of a manufactured good. Capital can be purchased at $4/unit and labor can be
purchased at $1/unit. A total of $8 is available to purchase capital and labor. How can the
firm maximize the quantity of the good that can be manufactured?

Answer: The firm wants to solve the following constrained maximization problem:

max z = KL

s.t. 4K + L ≤ 8

K, L ≥ 0

3
Recap:

Let c be a number in the domain , D, of a function f. Then f (c) is the

– absolute(global) maximum value of f on D if f (c) ≥ f (x)∀x ∈ D.

– absolute minimum value of f on D if f (c) ≤ f (x)∀x ∈ D

– local(relative) maximum value of f if f (c) ≥ f (x) when x is near c.

– local minimum value of f if f (c) ≤ f (x) when x is near c.

Extreme Value Theorem. If f is continuous on a closed interval [a, b], then f attains an
absolute maximum value f (c) and an absolute minimum value f (d) at some number c and d
in [a, b].

Fermat’s Theorem. If f has a local maximum or minimum at c, and if f 0 (c) exists, then
f 0 (c) = 0

A critical number of a function, f, is a number, c, in the domain of f such that either


f 0 (c) = 0 or f 0 (c) does not exist.

A set of point in R is said to be bounded if it a subset of some finite interval.

Let h > 0(h ∈ R), and let x ∈ R be given. The open interval (x − h, x + h) is called a
neighborhood with x as center and of radius h. We denote the neighborhood by N (x; h)
or simply N (x) if the radius is unimportant.

4
Let S be a set in R and assume x ∈ S. Then x is called an interior point of S if there is
some neighborhood, N (x), all of whose points belong to S.

Let S be a set in R. Then S is called an open set if every point of S is an interior point of S.

Supremum and Infimum

Let A ⊆ R then m ∈ R is the infimum or the greatest lower bound of A if :

1. m is a lower bound for A.

2. For any other lower bound m0 of A, we have m ≥ m0 .

Let A ⊆ R then M ∈ R is the supremum or the least upper bound of A if :

1. M is an upper bound for A.

2. For any upper bound M0 of A, we have M ≤ M0 .

Theorem: Every non-empty subset of R that is bounded above has a least upper bound
and also if it is bounded below then it has a greatest lower bound.

Example: Classify the following subsets in R as bounded/unbounded if bounded, write


down the supremum and infimum.

1. {x ∈: |x − 3| ≤ 2}

2. {x ∈: |x + 3| ≥ 2}

3. { x12 : x ∈ Z+ }

4. { x12 : x ∈ R\{0}}

5
Example: Find the supremum and infimum of the following functions in the given intervals

1. f (x) = ln (cos x + 2) in the interval (−∞, ∞).


Z x
2. f (x) = sin10 t + sin t + 2 dt, t ∈ [0, 2π].
0

6
Extension of Previous Ideas.

Local and Absolute Extremum


Let f be a real-valued function defined on a set S in Rn . Then f is said to have an absolute
maximum on the set S if there exists a point a in S such that

f (x) ≤ f (a), ∀x ∈ S.

If a ∈ S and if there is a neighborhood N (a) such that

f (x) ≤ f (a), ∀x ∈ N (a) ∩ S.

then f is said to have a relative maximum at the point a. (Absolute minimum and local minimum
are similarly defined, using f (x) ≥ f (a).)

7
1 Solving NLPs With One Variable

In this section, we explain how to solve the NLP

Max (or Min)f (x)

s.t. x ∈ [a, b] (2)

[If b = ∞, then the feasible region for NLP (2) is x ≥ a, and if a = −∞, then the feasible region
for (2) is x ≤ b.]

To find the optimal solution to (2), we find all local maxima (or minima). Then the optimal
solution to (2) is the local maximum (or minimum) having the largest (or smallest) value of f (x).
Note: if a = −∞ or b = ∞, then (2) may have no optimal solution.
There are three types of points for which (2) can have a local maximum or minimum
(these points are often called extremum candidates):

Case 1. Points where a < x < b, and f 0 (x) = 0 [called a stationary point of f (x)].

Case 2. Points where f 0 (x) does not exist.

Case 3. Endpoints a and b of the interval [a, b].

8
1.1 Case 1. Points where a < x < b, and f 0 (x) = 0

Suppose a < x < b, and f 0 (x0 ) exists. If x0 is a local maximum or a local minimum, then
f 0 (x0 ) = 0.

Necessary condition for relative extremum

Let f (x) be a continuous function having a continuous second derivative in a neighbourhood of


x0 . A necessary condition for (x0 , f (x0 )) to be an extremum is that f 0 (x0 ) = 0.

9
Proof. The Taylor series for f around x0 is

0 f 00 (η2 )
f (x) = f (x0 ) + f (x0 )(x − x0 ) + (x − x0 )2
2

where η2 lies between x and x0 .


So

f 00 (η2 )
f (x) − f (x0 ) = f 0 (x0 )(x − x0 ) + (x − x0 )2
2
f 00 (η2 )
 
0
= f (x0 ) + (x − x0 ) (x − x0 )
2

f 00 (η2 )
Assuming continuous derivatives, 2
(x − x0 ) is bounded in a neighbourhood of x0 .
Let the neighbourhood be such that

f 00 (η2 )
(x − x0 ) = r <  for any  > 0
2

Case 1: f 0 (x0 ) = k > 0

f (x) − f (x0 ) = [k + r(< k)](x − x0 )

= δ(x − x0 )

where δ > 0. Since (x − x0 ) may vary in sign δ(x − x0 ) varies sign and so f (x) − f (x0 ) varies in
sign in a neighbourhood of x0 . Therefore (x0 , f (x0 )) is not an extrema.
Case 2: f 0 (x0 ) = k < 0

f (x) − f (x0 ) = [k + r(< −k)](x − x0 )

= δ(x − x0 )

where δ < 0. Since (x − x0 ) may vary in sign δ(x − x0 ) varies in sign and so f (x) − f (x0 ) varies in
sign in a neighbourhood of x0 . Therefore (x0 , f (x0 ) is not an extrema.
Thus, f 0 (x0 ) must be equal to 0.

10
Necessary condition for relative extremum(Alternate proof )

Let f (x) be a continuous function having a continuous second derivative in a neighbourhood of


x0 : A necessary condition for (x0 , f (x0 )) to be an extremum is that f 0 (x0 ) = 0.

Since f 0 (x0 ) exist,

f (x) − f (x0 )
f 0 (x0 ) = lim greater than (case 1), lesser than (case 2) or equal to 0 (case 3).
x→x0 x − x0

• Case 1

f (x) − f (x0 ) f (x) − f (x0 ) f (x) − f (x0 )


If lim < 0, then lim+ < 0 and lim− <0
x→x0 x − x0 x→x0 x − x0 x→x0 x − x0
when x − x0 > 0 and x − x0 < 0, respectively.
f (x) − f (x0 )
Since, lim+ < 0 and x − x0 > 0 then f (x) − f (x0 ) < 0. Also,
x→x0 x − x0
f (x) − f (x0 )
since, lim− < 0 and x − x0 < 0 then f (x) − f (x0 ) > 0.
x→x0 x − x0

Thus, f (x) − f (x0 ) changes sign in a neighborhood of x0 . Hence, f (x0 ) is not an extremum.

• Case 2

f (x) − f (x0 ) f (x) − f (x0 ) f (x) − f (x0 )


If lim > 0, then lim+ > 0 and lim− >0
x→x0 x − x0 x→x0 x − x0 x→x0 x − x0
when x − x0 > 0 and x − x0 < 0, respectively.
f (x) − f (x0 )
Since, lim+ > 0 and x − x0 > 0 then f (x) − f (x0 ) > 0. Also,
x→x0 x − x0
f (x) − f (x0 )
since, lim− > 0 and x − x0 < 0 then f (x) − f (x0 ) < 0.
x→x0 x − x0

Thus, f (x) − f (x0 ) changes sign in a neighborhood of x0 . Hence, f (x0 ) is not an extremum.

• Case 3
If the limit exist and equal to 0, then we obtain the desired result.

Therefore based on the three possibilities, (x0 , f (x0 )) cannot be an extremum without f 0 (x0 ) = 0.

11
Sufficient condition for relative extrema

Let f (x) be a continuous function having a continuous third derivative in a neighbourhood of


x0 . A sufficient condition for (x0 , f (x0 )) to be an extremum is that f 0 (x0 ) = 0 and f 00 (x0 ) 6= 0.
In the case where f 00 (x0 ) > 0, (x0 , f (x0 )) is a relative minimum point and in the case where
f 00 (x0 ) < 0, (x0 , f (x0 )) is a relative maximum point.

Proof. Suppose f 0 (x0 ) = 0 and f has continuous third derivative, then the Taylor series for f
around x0 is

f 00 (x0 ) f 000 (η3 )


f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + (x − x0 )2 + (x − x0 )3
2 3!
f 00 (x0 ) f 000
(η3 )
f (x) − f (x0 ) = (x − x0 )2 + (x − x0 )3
 00 2 3!
f (x0 ) f 000 (η3 )

= + (x − x0 ) (x − x0 )2
2 3!

where η3 lies between x and x0 .


f 000 (η3 )
Since f 000 (x) is continuous, (x − x0 ) is bounded in a neighbourhood of x0 .
3!
Let the neighbourhood be such that

f 000 (η3 )
(x − x0 ) = r <  for any  > 0.
3!

f 00 (x0 )
Case 1: f 00 (x0 ) > 0. In this case = k > 0 and
2

f (x) − f (x0 ) = [k + r(< k)](x − x0 )2

= δ(x − x0 )2

where δ > 0. Since (x − x0 )2 > 0 for all x in a neighbourhood of x0 ,

f (x) − f (x0 ) = δ(x − x0 )2 > 0

for all x in a neighbourhood of x0 . Therefore (x0 , f (x0 )) is a relative minimum point.

12
f 00 (x0 )
Case 2: f 00 (x0 ) < 0. In this case 2
= k < 0 and

f (x) − f (x0 ) = [k + r(< −k)](x − x0 )2

= δ(x − x0 )2

where δ < 0. Since (x − x0 )2 > 0 for all x in a neighbourhood of x0 ,

f (x) − f (x0 ) = δ(x − x0 )2 < 0

all x in a neighbourhood of x0 , Therefore (x0 , f (x0 )) is a relative maximum point.

Theorem: If f 0 (x0 ) = 0, and

• If the first non-vanishing (nonzero) derivative at x0 is an odd-order derivative [f 000 (x0 ), f 00000 (x0 )
and so on], then x0 is not a local maximum or a local minimum.

• If the first non-vanishing derivative at x0 is positive and an even-order derivative, then x0 is


a local minimum.

• If the first non-vanishing derivative at x0 is negative and an even-order derivative, then x0


is a local maximum.

13
1.2 Case 2. Points Where f 0 (x) Does Not Exist

If f (x) does not have a derivative at x0 , x0 may be a local maximum, a local minimum, or neither.
In this case, we determine whether x0 is a local maximum or a local minimum by checking values
of f (x) at points x1 < x0 and x2 > x0 near x0 . This results in four possible cases:

Relationship between f (x0 ), f (x1 ) and f (x2 ) x0


f (x0 ) > f (x1 ); f (x0 ) < f (x2 ) Not local extremum

f (x0 ) < f (x1 ); f (x0 ) > f (x2 ) Not local extremum

f (x0 ) ≥ f (x1 ); f (x0 ) ≥ f (x2 ) Local maximum

f (x0 ) ≤ f (x1 ); f (x0 ) ≤ f (x2 ) Local minimum

Table 1: How to determine whether a point where f 0 (x) does not exist is a local maximum or a
local minimum.

1.3 Case 3. Endpoints a and b of [a, b]

• If f 0 (a) > 0, then a is a local minimum.

• If f 0 (a) < 0, then a is a local maximum.

• If f 0 (b) > 0, then b is a local maximum.

• If f 0 (b) < 0, then b is a local minimum.

• What if f 0 (a) = 0 or f 0 (b) = 0?

14
Exercise

1. Given that f (x) = x3 − 3x2 . Is f (1) a relative extremum?

2. Find the extrema for the function f (x) = x3 − 3x2 .

Note: In Economics one is interested in maximizing the total profit function T (q) where q
is the quantity of a commodity manufactured or purchased. The price per unit is p(q) and
the cost of production is C(q). It follows that the total revenue function is R(q) = q × p(q)
and
T (q) = R(q) − C(q).

That is, Total Profit = Total Revenue - Total Cost. For T to be maximized, a necessary
condition is
d
(T (q)) = 0.
dq

3. If a company charges a price p for a product, then it can sell 3e−p thousand units of the
product. Then, f (p) = 3, 000pe−p is the company’s revenue if it charges a price p.

a. For what values of p is f (p) decreasing? For what values of p is f (p) increasing?

b. Suppose the current price is $4 and the company increases the price by 5cent. By
approximately how much would the company’s revenue change?

4. Given that C(q) = q 3 − 10q 2 + 17q + 66 and p = 5($). Find the most profitable production
level(the value of q that will maximize profit).

5. Given that C(q) = 40q + 20, 000 and p = 160.01q($). Find the most profitable production
level.

15
2 Solving NLPs With Multiple Variable

In this section, we explain how to solve NLPs having more that one variables.
There are many formulas where a given quantity depends on two or more variables.
Example:

1
• A = bh
2

• V = lwh

1
• x̄ = (x1 + x2 + · · · + xn )(arithmetic average)
n
1
We say A is a function of two variables and write A(b, h) = bh. Similarly
2

• V (l, w, h) = lwh

1X
• x̄(x1 , x2 , . . . , xn ) = xi
n i=1

How do we differentiate a function of more than one variable?

16
Partial Derivative

Maybe the simplest way to proceed is to reduce the discussion to the one-dimensional case by
treating a function of several variables as a function of one variable at a time, holding the others
fixed.
Example: Find the first partial derivatives of f (x, y) = x2 + 2xy 2 .

Recall the definition of definition of the derivative of a single variable,

f (x + h) − f (x)
f 0 (x) = lim .
h→0 h

Analogously, we consider z = g(x, y).


gx (x, y) = [g(x, y)]
∂x
g(x + h, y) − g(x, h)
= lim
h→0 h

and


gy (x, y) = [g(x, y)]
∂y
g(x, y + h) − g(x, h)
= lim .
h→0 h

Example: Find the first partial derivatives of f (x, y) = x2 + 2xy 2 using the definition.

A more general definition of the partial derivative takes the form.

Definition: Let f : Rn → R be a real-valued function. The partial derivative of f with respect


∂f
to xi is denoted by fi (x1 , x2 , . . . , xi , . . . , xn ) = and is given by
∂xi

∂f f (x1 , x2 , . . . , xi + h, . . . , xn ) − fi (x1 , x2 , . . . , xn )
= lim .
∂xi h→0 h

17
Notation For Partial Derivatives

∂f ∂ ∂z
fx (x, y) = fx = = [f (x, y)] = = f1 = D1 f = Dx f
∂x ∂x ∂x

∂f ∂ ∂z
fy (x, y) = fy = = [f (x, y)] = = f2 = D2 f = Dy f
∂y ∂y ∂y

∂ 2f ∂ 2z
 
∂ ∂f
(fx )x = fxx = f11 = = =
∂x ∂x ∂x2 ∂x2

∂ 2f ∂ 2z
 
∂ ∂f
(fy )y = fyy = f22 = = =
∂y ∂y ∂y 2 ∂y 2

∂ 2f ∂ 2z
 
∂ ∂f
(fx )y = fxy = f12 = = =
∂y ∂y ∂y∂x ∂y∂x

∂ 2f ∂ 2z
 
∂ ∂f
(fy )x = fyx = f21 = = =
∂x ∂y ∂x∂y ∂x∂y

Clairauts’ Theorem: If 0 f 0 has continuous first partial derivatives then

∂ 2f ∂ 2f
= .
∂x∂y ∂y∂x

Example: Verify that f (x, y) = 3e2x cos y and g(x, y) = tan−1 xy satisfy Clairauts’ Theorem.

18
∂f (x)
Remark: If we take a cross-section of the graph of f (x) along the xi −axis, then is the gra-
∂xi
dient of the function of f (x) (seen) at xi . That is, it is the slope of the function in the xi −direction.

Recall: For x0 ∈ R, the existence of the derivative of a function, f, at x0 implies continuity at x0 .


On the contrary, a function of n variable can have partial derivatives at a point w.r.t. each of the
variables and yet not be continuous at the point.
Example:

 x + y, if x = 0 or y = 0,

f (x, y) =

 1, otherwise

2.1 Cross-Partial Derivatives

Definition: Let f : Rn → R be a real-valued function. For x = (x1 , x2 , . . . , xn ) ∈ Rn , the cross-


∂f (x)
partial derivative of f w.r.t. i and j is denoted by fij (x) and is the partial derivative of
∂xi
w.r.t. xj .

19
2.2 Dot Product

Definition: Given two vectors a = (a1 , . . . , an ) and b = (b1 , . . . , bn ) in Rn , the real number denoted
by a · b and defined by the equation

a · b = a1 b 1 + · · · + an b n .

is called the dot product of a and b.


It follows at once from this definition that we have

• a·b=b·a

• (λa) · b = λ(a · b)

• a · (b + c) = a · b + a · c

∀a, b and c ∈ Rn and λ ∈ R. Using inner product, the cauchy-schwarz inequality assumes the form
|a · b| ≤ |a||b|.

2.3 The Gradient Function

Definition: Let f : Rn → R be a real-valued function. For x = (x1 , x2 , . . . , xn ) ∈ Rn the gradient


of f is a function ∇f : Rn → Rn which maps to each x = (x1 , x2 , . . . , xn ) ∈ Rn the vector of partial
derivatives

 
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,..., ∈ Rn .
∂x1 ∂x2 ∂xn

2.4 Differential (total differential)

Definition: Let f : Rn → R be a real-valued function. For x = (x1 , x2 , . . . , xn ) ∈ Rn the


differential of f at x corresponding to a small change in x, ∆x = (∆x1 , ∆x2 , . . . , ∆xn ) ∈ Rn is
∇f (x) · ∆x.
Example: Let f (x, y) = x2 + y 2 . If (x, y) changes from (100, 100) to (100.1, 100.1), what is the
difference between ∆f (x) (the change in f ) and ∇f (x) · ∆x (the differential of f )?

20
2.5 Total derivative

Definition: Let f : Rn → R be a real-valued function. For x = (x1 , x2 , . . . , xn ) ∈ Rn , the total


df (x)
derivative of f with respect to t is and is given by
dt

df (x) ∂f (x) dx1 ∂f (x) dx2 ∂f (x) dxn


= · + · + ··· + · .
dt ∂x1 dt ∂x2 dt ∂xn dt

Here, among possibly other variables, x1 , x2 , . . . , xn are functions of t.


Example: For the price p and production output q(p), the profit function is π(x, y) where x = p

and y = q(p). What is ?
dp

2.6 Directional derivatives

Definition: Let f : Rn → R be a real-valued function and let →



u =< u1 , u2 , . . . , un > be a unit
vector. For x = (x1 , x2 , . . . , xn ) ∈ Rn , the directional derivative of f at x in the direction of →

u is
given by
f (x + h→

u ) − f (x)
lim .
h→0 h
 
1
Example: If f (x, y) = xy, what is the directional derivative of f (x, y) in the direction of h =  ?
1

2.7 Hessian Matrix

Definition: Let f : Rn → R be a real-valued function whose second order partial derivatives


exist. Then the Hessian Matrix of f is the n × n matrix of second partial derivatives of f denoted
Hf (x) is  
 f11 f12 . . . f1n 
 
 f21 f22 . . . f2n 
.
 
...

 : : : 
 
 
fn1 fn2 . . . fnn

21
2.8 Jacobian Matrix
∂fi
Definition: Let F : Rn → Rm be a vector function whose first order partial derivatives, ,
∂xj
i = 1, . . . , n, j = 1, . . . , m, exist. We define the Jacobian matrix to be the matrix of all first
order partial derivatives of the vector-valued function F,

 
∂f1 ∂f1 ∂f1
 ∂x1 ∂x2
. . . ∂xn 
 ∂f ∂f2

∂f2 
 2 . . . ∂xn 
 ∂x ∂x2
JF (x) =  1 .
.
 :
 : .. :  
 
∂fm ∂fm ∂fm
∂x1 ∂x2
. . . ∂xn

2.9 Differentiability

Definition: The function f of (x, y) if it is defined for all points near (x, y) [that is, in a neigh-
borhood of (x, y)] and if there exists numbers A, B [depending on f and (x, y)] such that

|f (x + h, y + k) − f (x, y) − (Ah + Bk)|


lim √ = 0.
(h,k)→(0,0) h2 + k 2

Theorem: If a function is differentiable at a point, it is continuous there.

22
2.10 Taylor Series
X
Real Power Series. If x0 , x and an (n ∈ W) are real numbers, then series an (x − x0 )n is called
a real power series. Its circle of convergence intersects the real axis in an interval of convergence.

Suppose we are given a real-valued function, f, defined in some neighborhood of a point x0 ∈ R,


and suppose f has derivatives of every order in this neighborhood. Then we can certainly form
the power series

X f (n) x0
(x − x0 )n
n=0
n!

Question: Does this series converge for any x besides x = x0 ?


If so, is its sum equal to f (x)?
Definition: Let f be a real-valued function defined on an interval I in R. If f has derivatives of
every order at each point of I, we write f ∈ C ∞ on I.


X
If f ∈ C ∞
on some neighbourhood of a point x0 , the power series f (n) is called the Taylor’s
n=0
series about x0 generated by f. To indicate that f this series, we write

X f (n) x0
f (x) ∼ (x − x0 )n .
n=0
n!

Question: When can we replace ∼ with =?

Taylor’s formula states that if f ∈ C ∞ on the closed interval [a, b] and if x0 ∈ [a, b], then, for every
x ∈ [a, b] and for every n, we have

n−1 k
X f (x0 ) f (n) (x1 )
f (x) = (x − x0 )k + (x − x0 )n
k=0
k! n!

where x1 ∈ [x, x0 ].

23
A necessary and sufficient condition for Taylor’s Series to converge to f (x) is that the

f (n) (x1 )
lim (x − x0 )n = 0.
n→∞ n!

Theorem: Assume that f ∈ C ∞ on [a, b] and let x0 ∈ [a, b]. If there is a neighbourhood N (x0 )
and a constant M (x0 ) such that |f (n) (x)| ≤ M for every x ∈ N (x0 ) ∩ [a, b]. Then, for each
x ∈ N (x0 ) ∩ [a, b], we have


X f (n) (x0 )
f (x) = (x − x0 )n .
n=0
n!

Example: Find the third order Taylor Series of ex about x0 = 0. How accurate is this approxi-
mation, when x = 1?

Solution:


X f (n) (x0 )
f (x) = (x − x0 )n
n=0
n!

X f (n) (0)
At x0 = 0, f (x) = xn
n=0
n!
n
d x
Since n
e = ex for all n ∈ N, and we are interest in the third order Taylor Series. Then,
dx
e0 e0 e x1 4
f (x) = ex = e0 + e0 x + x2 + x3 + x
2! 3! 4!
1 1 e x1 4
That is, ex = 1 + x + x2 + x3 + x.
2! 3! 4!

ex1
When x = 1, the error term = , 0 < x1 < 1.
4!
e x1 e 3
< < .
4! 24 24
1 2 1 3
Thus, when x = 1, ex = 1 + x + x + x is accurate within 12.5%
2! 3!

24
The Taylor series for an infinitely differentiable function f (x) of a single variable can also be
written as

df (δx)2 d2 f
δf = δx + + ...,
dx 2! dx2

where δx = (x − x0 ) is a change in x, and δf = f (x) − f (x0 ) is the corresponding change in the


value of the function.
Taylor series can also be constructed for functions of more than one variable. For a function f (x, y)
of two independent variables, the analogous formula is

∂f ∂f (δx)2 ∂ 2 f (δy)2 ∂ 2 f ∂ 2f
δf = δx + δy + + + δxδy + ...,
∂x ∂y 2! ∂x2 2! ∂y 2 ∂x∂y
 2
∂f ∂f 1 ∂ ∂
δf = δx + δy + δx + δy f (x0 , y0 ) + . . .
∂x ∂y 2! ∂x ∂y

More generally,

Theorem: Let f have continuous partial derivatives of order m at each point of an open set S in
En . If a ∈ S, b ∈ S, a 6= b, and if L(a, b) ⊂ S, then there exists a point z on the line segment
L(a, b) such that

m−1 n
!k n
!m
X 1 X ∂ 1 X ∂
δf = δxi f+ δxi f
k=1
k! i=1
∂xi m! i=1
∂xi

NB: En ⊆ Rn .

25
Example: Consider f (x, y, z) = xy 2 − x3 yz + z 4 .

a. Compute ∇f (x, y, z)
 
1
 
b. Compute the directional derivative fh (0, 1, 2) in the direction of h = 
2 .

 
2

c. Compute a first order approximation of f (1, 3, 4) using information about f at (0, 1, 2).

Example: Consider f : R2 → R such that f (x, y) = ( 21 x2 − 13 x3 )( 13 y 3 − 21 y 2 ).

a. Compute ∇f (x, y)

b. Compute Jf (x, y)

c. Evaluate ∆∇f (x, y) at x = (0.667, 0.667) and x + dx = (0.4722, 0.8889)

d Estimate d∇f (x, y) = d[∇f (x, y)] for x = (0.667, 0.666) and dx = (−0.1944, 0.2222)

26
2.11 Convex and Concave Functions

Let f (x1 , x2 , . . . , xn ) be a function that is defined for all points (x1 , x2 , . . . , xn ) in a convex set S.
Definition: A function f (x1 , x2 , . . . , xn ) is a convex function on a convex set S if for any x̄0 ∈ S
and x̄00 ∈ S
f [cx̄0 + (1 − c)x̄00 ] ≤ cf (x0 ) + (1 − c)f (x̄00 )

holds for 0 ≤ c ≤ 1.
Definition: A function f (x1 , x2 , . . . , xn ) is a concave function on a convex set S if for any x̄0 ∈ S
and x̄00 ∈ S
f [cx̄0 + (1 − c)x̄00 ] ≥ cf (x̄0 ) + (1 − c)f (x̄00 )

holds for 0 ≤ c ≤ 1.
Theorem: Suppose f 00 (x) exists for all x in a convex set S. Then f (x) is a convex function on S
if and only if f 00 (x) ≥ 0 for all x in S.
Theorem: Suppose f 00 (x) exists for all x in a convex set S. Then f (x) is a concave function on S
if and only if f 00 (x) ≤ 0 for all x in S.

27
2.11.1 Using the Hessian Matrix to Determine Whether a Function is Concave or
Convex

First we need to consider the idea of the principal minor.

An ith principal minor of an n × n matrix is the determinant of any i × i matrix obtained by


deleting n − i rows and the corresponding n − i columns of the matrix.

The k th leading principal minor, Hk (·), of an n × n matrix is the determinant of the k × k


matrix obtained by deleting the last n − k rows and columns of the matrix.
Theorem: Suppose f (x1 , x2 , . . . , xn ) has continuous second-order partial derivatives for each point
x = (x1 , x2 , . . . , xn ) ∈ S. Then f (x1 , x2 , . . . , xn ) is a convex function on S if and only if for each
x ∈ S, all principal minors of H are non-negative.
Theorem: Suppose f (x1 , x2 , . . . , xn ) has continuous second-order partial derivatives for each point
x = (x1 , x2 , . . . , xn ) ∈ S. Then f (x1 , x2 , . . . , xn ) is a concave function on S if and only if for each
x ∈ S and k = 1, 2, ..., n, all nonzero principal minors have the same sign as (−1)k .

28
Example:
Show that f (x1 , x2 ) = −x21 − x1 x2 − 2x22 is a concave function on R2 .

We find that  
−2 −1
H(x1 , x2 ) =  
−1 −4

The first principal minors are the diagonal entries of the Hessian (−2 and −4). These are both
non-positive. The second principal minor is the determinant of H(x1 , x2 ) and equals −2(−4) −
(−1)(−1) = 7 > 0. Thus, f (x1 , x2 ) is a concave function on R2 .

Example:
Show that for S = R3 , f (x1 , x2 , x3 ) = x21 + x22 + 2x23 − x1 x2 − x2 x3 − x1 x3 is a convex function.
The Hessian is given by  
 2 −1 −1
 
H(x1 , x2 , x3 ) = 
 −1 2 −1

 
−1 −1 4

By deleting rows (and columns) 1 and 2 of Hessian, we obtain the first-order principal minor 4 > 0.
By deleting rows (and columns) 1 and 3 of Hessian, we obtain the first order principal minor 2 > 0.
By deleting rows (and columns) 2 and 3 of Hessian, we obtain the first order principal minor 2 > 0.
By deleting row 1 and column 1 of Hessian, we find the second order principal minor
 
 2 −1
det   = 7 > 0.
−1 4

By deleting row 2 and column 2 of Hessian, we find the second-order principal minor
 
 2 −1
det   = 7 > 0.
−1 4

29
By deleting row 3 and column 3 of Hessian, we find the second-order principal minor
 
 2 −1
det   = 3 > 0.
−1 2

The third-order principal minor is simply the determinant of the Hessian itself. Expanding by row
1 cofactors we find the third-order principal minor

2[(2)(4) − (−1)(−1)] − (−1)[(−1)(4) − (−1)(−1)]

+(−1)[(−1)(−1) − (−1)(2)] = 14 − 5 − 3 = 6 > 0.

Because for all (x1 , x2 , x3 ) all principal minors of the Hessian are non-negative, we have shown
that f (x1 , x2 , x3 ) is a convex function on R3 .

Example:
Show that for S = R2 , f (x1 , x2 ) = x21 − 3x1 x2 + 2x22 is not a convex or a concave function.
We have  
 2 −3
H(x1 , x2 ) =  
−3 4

The first principal minors of the Hessian are 2 and 4. Because both the first principal minors are
positive, f (x1 , x2 ) cannot be concave. The second principal minor is 2(4) − (−3)(−3) = −1 < 0.
Thus, f (x1 , x2 ) cannot be convex. Together, these facts show that f (x1 , x2 ) cannot be a convex or
a concave function.

30
2.12 Unconstrained Maximization and Minimization

with Several Variables

Let us consider how to find an optimal solution (if it exists) or a local extremum for the following
unconstrained NLP:

Max.(or Min.)f (x1 , x2 , . . . , xn )

s.t. (x1 , x2 , . . . , xn ) ∈ Rn . (3)

Assuming that the first and second partial derivatives of f (x1 , x2 , . . . , xn ) exist and are continuous
at all points.
Theorem: For any NLP (maximization), a feasible point x = (x1 , x2 , . . . , xn ) is a local maximum
if for sufficiently small , any feasible point x0 ∈ (x01 , x02 , ..., x0n ) having |xi −x0i | <  (i = 1, 2, . . . , n)
satisfies f (x) ≥ f (x0 ).

Theorem: Consider NLP (3) and assume it is a maximization problem. Suppose the feasible
region for NLP (3) is a convex set. If f (x) is concave on S, then any local maximum for NLP (3)
is an optimal solution to this NLP.
Theorem: Consider NLP (3) and assume it is a minimization problem. Suppose the feasible
region S for NLP (3) is a convex set. If f (x) is convex on S, then any local mini- mum for NLP
(3) is an optimal solution to this NLP.
∂f (x̄)
Theorem: If x̄ is a local extremum for (3), then = 0.
∂xi

∂f (x̄)
DEFINITION: A point, x̄, having = 0, for i = 1, 2, . . . , n, is called a stationary point
∂xi
of f.
Theorem: If Hk (x̄) > 0, k = 1, 2, ..., n, then a stationary point x̄ is a local minimum for NLP (3).
Theorem: If, for k = 1, 2, . . . , n, Hk (x̄) is nonzero and has the same sign as (−1)k , a stationary
point x̄ is a local maximum for NLP (3). NB: If a stationary point x̄ is not a local extremum,
then it is called a saddle point.

31
If Hn (x̄) = 0 for a stationary point x̄, then x̄ may be a local minimum, a local maximum, or a
saddle point, and the preceding tests are inconclusive.

32
2.13 M.L.E

Suppose that we wish, from n independent trial measurements of x, to find the most likely estimate
(or estimator), g, of a true parameter γ in a known mathematical functional form φ(x; γ). Assume
that there is only one parameter to be determined. We set up some function g = g(x1 , x2 , . . . , xn )
of the trial values of x from which the estimate g is to be deduced.
There are several methods for setting up such g functions, and each method gives a different
degree of goodness of estimate of g. The statisticians rate these methods in terms of their relative
efficiency’s. For most parametric estimation problems, the method of estimation known as the
method of maximum likelihood is the most efficient, and, if n is large, the estimate is usually
satisfactorily consistent. The likelihood function, the product of all n values of φ(xi ; γ), is written

L(x1 , x2 , . . . , xn ; γ) = φ(x1 ; γ) × φ(x2 ; γ) × · · · × φ(xn ; γ).

Especially in the case of a discrete population, certain values of xi are observed with frequency fi
which is greater that unity. In this event, the actual frequency fi as an exponent on the factor
φ(xi , γ) and a total number of factors is r with r < n. Consider the general case in which there
is a continuum of possible values for g, i.e., a parameter that is continuous variable. The relative
probability of any two different values of g is given by the likelihood ratio, in which the likelihood
function are of the form given with one value of g in the form given with one value of g in the
Pin (A)
place of γ for the denominator. The ratio , if nothing is otherwise known about it, maybe
Pin (B)
taken as equal to unity as the desperation-in-ignorance guess. We imagine each N values of Lj
computed. These N values of Lj form a distribution which, as N → ∞, can be shown to approach
a normal distribution and whose mean value at the maximum of the distribution corresponds to
the desired estimate g. To find the value of Lj that makes L a maximum, we differentiate L with
respect to γ and set the derivative equal to zero. Since L is a maximum when log L is a maximum,
we may use logarithmic form when it is more convenient to deal with a sum instead of a product.

  n
∂ X ∂
log L =0= fi log φ(xi ; g) (4)
∂γ γ=g i=1
∂g

33
and we seek a solution of this expression for g. This value for g is most likely estimate of γ. Solution
of equation (4) is often explicit and easy without multiple roots; in case of multiple roots significant
root is chosen. The procedure can be generalized to treat more than one parameter; there is one
likelihood equation for each variable. The maximum likelihood method is generally considered to
be about the best statistical approach to the majority of measure-mental problems encountered
in experimental science. This method uses all of the experimental information in the most direct
and efficient fashion possible to give an unambiguous estimate. Its principal disadvantage is that
the function relationship must be known or assumed.

2.13.1 Probability Distribution

A probability distribution is a statistical function that describes all the possible values and likeli-
hoods that a random variable can take within a given range.
A random variable is a variable that takes on numerical values determined by the outcome of a
random experiment.
A random variable is a discrete random variable if it can take on no more than a countable
number of values.
A random variable is a continuous random variable if it can take any value in an interval.

Binomial Distribution

Suppose that a random experiment can result in two possible mutually exclusive and collectively
exhaustive outcomes, ”success” and ”failure,” and that P is the probability of a success in a single
trial. If n independent trials are carried out, the distribution of the number of resulting successes,
x, is called the binomial distribution. Its probability distribution function for the binomial
random variable X = x is

P (x Successes in n independent trials) = P (x)


n!
= P x (1 − P )(n−x) for x = 0, 1, 2, . . . , n. (5)
x!(n − x)!

34
Hypergeometric Distribution

Suppose that a random sample of n objects is chosen from a group of N objects, S of which are
successes. The distribution of the number of successes, X, in the sample is called the hypergeo-
metric distribution. Its probability distribution is

N −s
CxS Cn−x
P (x) = (6)
CnN

where x can take integer values ranging from the larger of 0 and [n − (N − S)] to the smaller of n
N!
and S. Note that CnN = n!(N −n)!
.

Normal Distribution

The probability density function for a normally distributed random variable X is

1 (x−µ)2
P (x) = √ e− 2σ2 for − ∞ < x < ∞ (7)
σ 2π

where µ and σ 2 are any numbers such that −∞ < µ < ∞ and 0 < σ 2 < ∞ and where e and π are
physical constants.

Poisson Distribution

The random variable X is said to follow the Poisson probability distribution if it has proba-
bility function

eλ λx
P (x) = (8)
x!

where
P (x) = the probability of x successes over a given time or space, given λ.
λ = the expected number of successes per time or space unit; λ > 0.

35
2.14 Least Square

Among experimental scientist the most popular application of the principle known as least squares
is used in curve fitting.
The method of least square does not tell us in a practical way the functional relationship; it does
tell us precisely the best values of the constants appearing in the equation. Also, it does allow us
to choose between two different functional relations.

2.14.1 Best Fit of a Straight Line

Recall: y = a + bx
It usually turns out that, of the two general constant a and b we are more interested in b than in
a, but this is not always so, b gives the slope of the graph and a the intercept.
Consider the graph of measured values of x and y, such as in figure 1 and a straight line

y0 = a + bx0

such that the sum of the squares of the deviations from it shall be a minimum. In what direction
should the deviations be reckoned? Ideally, if only random errors are present in both x and y, the
deviations should be reckoned perpendicularly to the straight line. But the arithmetic involved
in the determination of the constants a and b is rather formidable in this case and, in general,
the result depends upon the choice of the scale of each coordinate axis; correction for even this
effect is possible but is very laborious. The usual procedure is to choose either the x or the y
direction for deviations, recognizing that the price paid for the simpler arithmetic is a sacrifice,
usually negligible small, in the accuracy of the best fit of the line. The choice between the x and
the y direction is made in favor of that direction in which the larger standard deviation is found;
in order to make this comparison in the same dimensions and units, sy is compared with bsx . In
almost all cases in experimental science, x is taken as the independent variable whose values are
selected with practically negligible error, and in these cases the deviations are reckoned along the
y axis.

36
We shall assume that all the deviations are taken along the y-axis, i.e., that b2 s2x << s2y . Then,
a + bxi is always taken as the exact value of y, y0 . Accordingly, a deviation is written as

δyi = yi − y0 = yi − (a + bxi ).

Graphically, δy is the length of a vertical line drawn between yi and yo (= a+bxi ) at the abscissa
position.
Let us assume initially that all δyi0 s are equally weighted. In accord with the principle of least
squares, we seek those values of a and b that make the sum of the squares of the n deviations δyi
a minimum. Thus, we consider

n
X n
X
(δyi )2 = (yi − a − bxi )2 ,
i=1 i=1

2.14.2 Best Fit of a Parabola

2.14.3 Best Fit of a Sine Curve

Example: Fit a straight line y = mx + c to the following data by the method of least square:

x-values 0 1 3 6 8
.
y-values 1 3 2 5 4

NB: We consider the errors along the y−axis and assume that all δyi0 s are equally weighted.
Example: Fit a parabola y = a + bx + cx2 to the following data by the method of least square:

x-values 2 4 6 8 10
.
y-values 1 3 2 5 4

NB: We consider the errors along the y−axis and assume that all δyi0 s are equally weighted.

37
Method of steepest ascent
Suppose we want to solve the following unconstrained NLP:

max z = f (x1 , x2 , . . . , xn )

s.t. (x1 , x2 , . . . , xn ) ∈ Rn

Recall: If f (x1 , x2 , . . . , xn ) is a concave function, then the optimal solution to the above NLP(if
there is one) will occur at a stationary point x̄ having

∂f (x̄) ∂f (x̄) ∂f (x̄)


= = ··· = = 0.
∂x1 ∂x2 ∂xn

∂f (x̄)
Note: From the definition of , it follows that if the value of xi is increased by a small amount
∂xi
∂f (x̄)
δ, the value of f (x) will increase by approximately δ
∂xi
Lemma 1
Suppose we are at a point v and we move from v a small distance δ in a direction d. Then for a
given δ, the maximal increase in the value of f (x1 , x2 , . . . , xn ) will occur if we choose

∇f (x̄)
d= .
||∇f (x̄)||

Method of finding the maximum

• Begin at any point v0 .

• For some non-negative value of t, we move to a point v1 = v0 + t∇f (v0 ).

The maximum possible improvement in the value of f (for a max problem) that can be attained
by moving away from v0 in the direction of ∇f (v0 ) results from moving to v1 = v0 + t∇f (v0 ),
where t0 solves the following one- dimensional optimization problem:

max f (v0 + t∇f (v0 ))

s.t. t0 ≥ 0...

38
Example: Use the method of steepest ascent to approximate the solution to

max f (x1 , x2 ) = −(x1 − 3)2 − (x2 − 2)2

s.t. (x1 , x2 ) ∈ R2

We arbitrarily choose to begin at the point v0 = (1, 1). Because ∇f (x1 , x2 ) = (−2(x1 − 3), −2(x2 −
2)), we have ∇f (1, 1) = (4, 2). Thus, we must choose t0 to maximize

f (t0 ) = f [(1, 1) + t0 (4, 2)] = f (1 + 4t0 , 1 + 2t0 ) = −(−2 + 4t0 )2 − (−1 + 2t0 )2

Setting f 0 (t0 ) = 0, we obtain

−8(−2 + 4t0 ) − 4(−1 + 2t0 ) = 0

20 − 40t0 = 0

t0 = 0.5

Our new point is v1 = (1, 1) + 0.5(4, 2) = (3, 2). Now ∇f (3, 2) = (0, 0), and we terminate the
algorithm. Because f (x1 , x2 ) is a concave function, we have found the optimal solution to the
NLP. Thus the maximum value is f (3, 2) = 0

Choosing v0 = (0, 0), we see that f (0, 0) = −13 and ∇f (0, 0) = (6, 4).
Thus, we consider
f (t0 ) = f [(0, 0) + t0 (6, 4)] = f (6t0 , 4t0 ) =?

To find t0 we consider ∇f (6t0 , 4t0 ),

∇f (6t0 , 4t0 ) = (−2(6t0 − 3), −2(4t0 − 2)) = (0, 0) =⇒ t0 = 0.5.

Now,
f (t0 ) = f [(0, 0) + t0 (6, 4)] = f (6 × 0.5, 4 × 0.5) = f (3, 2).

39
Choosing v0 = (10, 5), we see that f (10, 5) = −58 and ∇f (0, 0) = (−14, −6).
Thus, we consider

f (t0 ) = f [(10, 5) + t0 (−14, −6)] = f (10 − 14t0 , 5 − 6t0 ) =?

To find t0 we consider ∇f (10 − 14t0 , 5 − 6t0 ),

∇f (10 − 14t0 , 5 − 6t0 ) = (−2(10 − 14t0 − 3), −2(5 − 6t0 − 2)) = (0, 0) =⇒ t0 = 0.5.

Now,

f (t0 ) = f [(10, 5) + t0 (−14, −6)] = f [(10, 5) + 0.5(−14, −6)] = f (10 − 7, 5 − 3) = f (3, 2).

Aside: This algorithm is called the method of steepest ascent because to generate points, we
always move in the direction that maximizes the rate at which f increases (at least locally).

40
3 Optimizing Constrained N.L.P.

The primary tool used to solve these problems (which you may have seen in year one) is Lagrange
Multipliers. Constrained optimization involves optimizing a function of variables where there are
restrictions on the variables, whether they may be time, money or limitation on the availability of
raw materials. Here we will focus on all the constraints being equations (as against inequalities).
The constraints essentially impose a restriction on the domain of the function. So a solution to the
constrained problem gives the optimal value of the objective function over the restricted domain,
determined by the constraints. Consequently, unconstrained minimum is typically smaller than the
constrained counterpart while the unconstrained maximum is typically larger than the constrained
counterpart.

3.1 Substitution Method

The substitution method comprises of the following steps:

1. Substitute the constraint into the objective function and reduce the problem to an uncon-
strained one.

2. Use previous test(s) to find all relative extrema.

3. Use the objective function to identify the absolute extremum.

41
Example:
Find the minimum value of f (x, y) = x2 +2y 2 +2xy −18 subjected to the constraint that x−y = 1.

Solution:

1. Consider f (x, y) = x2 + 2y 2 + 2xy − 18, using the substitution x = 1 + y. We arrive at


f (1 + y, y) = (1 + y)2 + 2y 2 + 2(1 + y)y − 18 = 5y 2 + 4y − 17.

2. To determine potential extrema candidates we solve


d2
 
d 2 2 2
f = 0 =⇒ 10y + 4 = 0 =⇒ y = − . Note that 2 f = 10 > 0 =⇒ 1 − , − is a
dy 5 dy 5 5
relative minimum of f (x, y).
Therefore, the minimum value of the f is
f 1 − 25 , − 25 = (0.6)2 + 2(−0.4) + 2(0.6)(−0.4) − 18 = −17.8.


42
3.2 Lagrange Multipliers

NB: The substitution method is not always easy to do (especially if the constraints are not lin-
ear). The typical method of solving constrained(equality) optimization problems is the Lagrange
multiplier method.
Consider the NLP:
Find the values of decision variables x1 , x2 , . . . , xn that

Max(or Min) z = f (x1 , x2 , . . . , xn )

s.t. g1 (x1 , x2 , . . . , xn ) = b1

s.t. g2 (x1 , x2 , . . . , xn ) = b2

· (9)

s.t. gm (x1 , x2 , . . . , xn ) = bm

where f (x1 , x2 , ..., xn )(objective function) and gi (x1 , x2 , ..., xn ) = bi , i = (1, 2, . . . , m)


(constraints) cannot all be linear functions.
To solve this NLP, we associate a multiplier, λi , with the ith constraint and form the Lagrangian

m
X
L(x1 , x2 , . . . , xn , λ1 , λ2 , . . . , λm ) = f (x1 , x2 , . . . , xn ) + λi [bi − gi (x1 , x2 , . . . , xn )] .
i=1

Then we attempt to find a point (x̄1 , x̄2 , . . . , x̄n , λ̄1 , λ̄2 , . . . , λ̄m ) that maximizes (or minimizes)
L(x1 , x2 , . . . , xn , λ1 , λ2 , . . . , λm ).

43
Theorem:
Suppose the NLP is a maximization problem. If f (x1 , x2 , ..., xn ) is a concave function and
∂L
gi (x1 , x2 , ..., xn ) is a linear function, then any point (x̄1 , x̄2 , . . . , x̄n , λ̄1 , λ̄2 , . . . , λ̄m ) satisfying =
∂xi
∂L
= 0, i ∈ (1, 2, . . . , n) and j ∈ (1, 2, . . . , m) will yield an optimal solution,(x̄1 , x̄2 , . . . , x̄n ), to
∂λj
the NLP.
Theorem:
Suppose the NLP is a minimization problem. If f (x1 , x2 , ..., xn ) is a convex function and gi (x1 , x2 , ..., xn )
∂L ∂L
is a linear function, then any point (x̄1 , x̄2 , . . . , x̄n , λ̄1 , λ̄2 , . . . , λ̄m ) satisfying = = 0, i ∈
∂xi ∂λj
(1, 2, . . . , n) and j ∈ (1, 2, . . . , m) will yield an optimal solution,(x̄1 , x̄2 , . . . , x̄n ), to the NLP.

Example:
Find the minimum value of f (x, y) = x2 +2y 2 +2xy −18 subjected to the constraint that x−y = 1.
Solution:
Consider the minimum value of f (x, y) = x2 + 2y 2 + 2xy − 18 subjected to the constraint x − y = 1.
Thus, the objective function is f (x, y) = x2 +2y 2 +2xy −18 and the constraint is g(x, y) = x−y −1.
Therefore, the Lagrangian is

L(x, y, λ) = x2 + 2y 2 + 2xy − 18 + λ(1 − x + y).

The 3 first order necessary conditions are

Lx = 2x + 2y − λ = 0

Ly = 4y + 2x + λ = 0

Lλ = 1 − x + y = 0

44
In matrix form, we need to solve
    
 2 2 −1 x 0
    
 2 4 1  y 
   = 
0

    
−1 1 0 λ −1
   −1  
x  2 2 −1  0 
     
y 
  = 
 2 4 1  0
  
     
λ −1 1 0 −1
   −1  
1 1 3
x  10 10 − 5   0 
     
y  =  1 1 2 
 10 10 5   0 
 
 
     
λ − 35 25 − 52 −1
   
3
x  5 
   
=  2
− 5  .
y 
 
   
2
λ 5

3
, − 52 , 25

Let us verify whether or not the point 5
is a local minimum. Consider Hf (x, y),

   
fxx fxy  2 2
Hf (x, y) =  = 
fyx fyy 2 4

Note that the first principal minors are 2 and 4. The second principal minor= 2 × 4 − 2 × 2 = 4.
Thus, f (x, y) is a convex function on R2 . Since g(x, y) is a linear function, the point 53 , − 25 , 25 is


a local minimum and the optimal solution of the NLP is

   2  2   
3 2 2 3 2 3 2 89
f ,− , = +2 − +2 − − 18 = − .
5 5 5 5 5 5 5 5

45
Example:
Find the maximum value of f (x, y) = xy subjected to the constraint that x+y = 2P, where P > 0.

Example:
The plane x + y + z = 12 intersect the paraboloid z = x2 + y 2 in an ellipse. Find the points on
the ellipse that are closest to and furthest away from the origin.

Example:
2 1
The production of a company is given by the Cobb-Douglas function, P = 200L 3 K 3 . The cost
constraints on the business force 2L + 5K ≤ 150. Find the values of the labour, L, and capital, K,
to maximize production.

46
Example:
Find the relative extrema of the function f (x, y) = x2 + 4y 2 subjected to the constraint g(x, y) =
x2 + y 2 − 1.
Note: The constraint is non-linear and hence, we cannot apply the previous theorem.
The lagrangian takes the form :

L(x, y, λ) = x2 + 4y 2 + λ(1 − x2 − y 2 ).

Thus, the first order necessary conditions are

Lx = 2x − 2xλ = 0

Ly = 8y − 2yλ = 0

Lλ = 1 − x2 − y 2 = 0

Thus, the four critical points found using the are (±1, 0) and (0, ±1). The points (±1, 0) are
minima, f (±1, 0) = 1; the points (0, ±1) are maxima, f (0, ±1) = 2. (How can we verify this?)
Let us check if the Hessian of f (x, y) can provide us with any information about these critical
points.
   
fxx fxy  2 0
Hf (x, y) =  = 
fyx fyy 0 8

The Hessian of f (x, y) is the same for all points. Hence, the nature of the critical points cannot
depend solely on the Hessian of f (x, y).

47
3.3 Bordered Hessian

Sufficiency condition for the Lagrangian Method: Consider equation (10),

Max(or Min) z = f (x1 , x2 , . . . , xn )

s.t. g1 (x1 , x2 , . . . , xn ) = b1

s.t. g2 (x1 , x2 , . . . , xn ) = b2

· (10)

s.t. gm (x1 , x2 , . . . , xn ) = bm

 
− →
→ −
 0 P
H B := →
−T →
−
P Q
(m+n)×(m+n)

where



P = Jg (x̄)

and



P = HL(x̄)

The matrix H B is called the bordered Hessian matrix.

48
Given that (x̄0 , λ̄0 ) is a stationary point for the Lagrangian function, L(x̄, λ̄), and the H B
evaluated at (x̄0 , λ̄0 ) then x̄ is

1. a maximum point if, starting with the principal minor determinant of order (2m + 1), the
last (n − m) principal minor determinants of H B have an alternating sign pattern starting
with (−1)m+1 .

2. A minimum point if, starting with the principal minor determinant of order (2m + 1), the
last (n − m) principal minor determinants of H B have the sign of (−1)m .

49
Example:
Find the relative extrema of the function f (x, y) = x2 + 4y 2 subjected to the constraint g(x, y) =
x2 + y 2 − 1.
Previously, we saw that the Lagrangian took the form, L(x, y, λ) = x2 + 4y 2 + λ(1 − x2 − y 2 ) and
the four critical points are (±1, 0) and (0, ±1). Now,

   
 0 gx gy   0 2x 2y 
   
H B := 
gx Lxx Lxy  = 2x 2 − 2λ
  0 
   
gy Lyx Lyy 2y 0 8 − 2λ

At (±1, 0), λ = 1. This implies that at (1, 0),

 
0 2 0
 
H B := 
 2 0 0 =⇒ |H B | = −24.

 
0 0 6

Hence, starting with the principal minor determinant of order (2(1) + 1), the last (2 − 1) principal
minor determinants of H B have the sign of (−1)1 . Thus, (1, 0) is a minimum point. Similarly,
applying the same method one can show that (−1, 0) is a minimum point.
At (0, ±1), λ = 4. This implies that at (0, 1),

 
0 0 2 
 
H B := 
 0 −6 0  =⇒ |H B | = 24.

 
2 0 0

Hence, starting with the principal minor determinant of order (2(1) + 1), the last (2 − 1) principal
minor determinants of H B have an alternating sign pattern starting with (−1)1+1 . Thus, (0, 1) is
a maximum point. Similarly, applying the same method one can show that (0, −1) is a maximum
point.

50
Example:
Find the minimum value of f (x, y) = x2 +2y 2 +2xy −18 subjected to the constraint that x−y = 1.
Solution:
3
, − 52 , 52

Previously, we saw that the critical point was 5
. Now,

 
 0 1 −1
 
H B := 
 1 2 2  =⇒ |H B | = −10.

 
−1 2 4

Note, starting with the principal minor determinant of order (2(1) + 1) = 3, the last (2 − 1) = 1
principal minor determinants of H B have the sign of (−1)1 = −1. Hence 35 , − 25 , 25 is a minimum


point.

51
Example:
Consider the NLP
Minimize f (x1 , x2 , x3 ) = x21 + x22 + x23
subjected to 4x1 + x22 + 2x3 = 14
Use Lagrangian Multipliers to solve.

Example:
Consider the NLP
Minimize f (x, y) = x2 + y
subjected to g1 (x, y) = −(x2 + y 2 ) + 9 = 0 and g2 (x, y) = −x − y + 1 = 0.
Use Lagrangian Multipliers to solve.

Example:
Find the maximum and minimum values of |x̄|2 , x̄ ∈ R3
x2 y 2 z 2
subjected to g1 (x̄) = + + − 1 = 0 and g2 (x̄) = x + y − z = 0.
4 5 25

Example:
Consider the NLP
Minimize f (x, y) = x2 + y
subjected to g1 (x, y) = −(x2 + y 2 ) + 9 ≥ 0 and g2 (x, y) = −x − y + 10 ≥ 0.
Use Lagrangian Multipliers to solve.

52
3.4 Lagrange Dual Problem

• Given a nonlinear programming problem (primal problem), one can construct a closely related
nonlinear programming problem known as the Lagrangian dual problem.

• Under certain convexity assumptions and suitable constraint qualifications, the primal and
dual problems have equal objective function values.

Consider the following NLP:

Minimize f (x̄)

Subjected to : gi (x̄) ≤ 0 for i = 1, . . . , m. (P )

hj (x̄) = 0 for j = 1, . . . , l,

x̄ ∈ X ⊆ Rn .

The Lagrangian of the problem is

m
X l
X
(x̄ ∈ X, λ ∈ Rm l

L(x̄, λ, µ) = f (x̄) + λi gi (x̄) + µj hj x̄) +, µ ∈ R .
i j

The dual objective function q : Rm l


+ × R → R ∪ {−∞} is defined to be

q(λ, µ) = min L(x̄, λ, µ)


x∈X

Then the Lagrangian dual problem is defined as the following non-linear programming problem:

Maximize q(λ, µ)

Subjected to: (λ, µ) ∈ dom(q),

where dom(q) = {(λ, µ) ∈ Rm l


+ × R : q(λ, µ) > −∞}.

or

53
Maximize θ(λ, µ)

Subjected to :λi ≥ 0 (D)


( m l
)
X X
where θ(λ, µ) = inf f (x̄) + λi gi (x̄) + µj hj (x̄)
i=1 j=1

(P) is refer to as the primal problem. In the dual problem, the vectors λ and µ has components
i = 1, . . . , m, and j = 1, . . . , l, respectively. The Lagrange multipliers λi , corresponding to the
inequality constraints gi ≤ 0, are restricted to be non-negative, whereas the Lagrange multiplier
µj corresponding to the equality constraints hj (x̄) = 0, are unrestricted in sign.

Theorem(Convexity of the dual problem). Consider

Minimize f (x̄)

Subjected to : gi (x̄) ≤ 0 for i = 1, . . . , m.

hj (x̄) = 0 for j = 1, . . . , l,

x̄ ∈ X ⊆ Rn ,

where f, gi and hj are finite-valued functions defined on the set X ⊆ Rn , and let q(λ, µ) =
minx∈X L(x̄, λ, µ). Then

(a) dom(q) is a convex set,

(b) (q) is a concave function over dom(q).

Theorem (Weak Duality Theorem):


Consider the primal and Lagrangian dual problems previously stated. If x̄∗ is an optimal solution
for the primal problem and (λ∗ , µ∗ ) is an optimal solution for the Lagrangian dual problem then
f (x̄∗ ) ≥ q(λ∗ , µ∗ ).

54
Proof. Let us denote the feasible set of the primal problem by

S = {x̄ ∈ X : gi (x̄) ≤ 0, hj = 0, i = 1, 2, . . . , m, j = 1, 2, . . . , l}.

Then for any (λ, µ) ∈ Rm l


+ × R we have

q(λ, µ) = min L(x, λ, µ)


x∈X

≤ min L(x, λ, µ)
x∈S
" m m
#
X X
= min f (x̄) + λi gi (x̄) + µj hj (x̄)
x∈S
i=1 i=1
≤ min f (x̄)
x∈S

Definition: The difference between the solutions of the primal and dual problems is known as
the duality gap.

Example:
Consider the following primal problem:

Minimize f (x, y) = x2 + y 2

subject to: − x − y + 4 ≤ 0,

x, y ≥ 0.

Use Lagrangian duality to find the solution.


Solution:
The Lagrangian dual problem is given as
Maximize θ(u)
Subjected to : u ≥ 0,
where θ(u) = inf {x2 + y 2 + u(−x − y + 4) : x, y ≥ 0} .

55
Now,

θ(u) = inf x2 + y 2 + u(−x − y + 4) : x, y ≥ 0




θ(u) = inf x2 − ux : x ≥ 0 + inf y 2 − uy : y ≥ 0 + 4u


 

Using the method or completing the square or otherwise, we see that the inf {x2 − ux : x ≥ 0} and
u2
inf {y 2 − uy : y ≥ 0} is − . Hence, our problem becomes
4
1 2
Maximize θ(u) = 4u − 2 u
Subjected to : u ≥ 0.
Using the method or completing the square or otherwise, the maximum value of θ(u) = 8.

56
Theorem: If Hf (x̄) > 0, k = 1, 2, . . . , n, then a stationary point x̄ is a local minimum for (3).
Theorem: If, for k = 1, 2, . . . , n, Hf (x̄) is nonzero and has the sign as (−1)k , stationary point
x̄ is a local maximum for (3).
Theorem: If Hf (x̄) 6= 0 and the conditions of the previous theorems do not hold, then the
stationary point x̄ is not a local extremum.

57
Example:
Find all local maxima, local minima, and saddle points for f (x, y) = x2 y + y 3 x − xy.

58
3.5 Kuhn-Tucker Multipliers

Theorem(KKT conditions for linearly constrained problems; necessary optimality con-


ditions). Consider the minimization problem

min f (x̄)

s.t. aTi x̄ ≤ bi , i = 1, 2, . . . , m

where f is continuously differentiable over Rn , a1 , a2 , . . . , am ∈ Rn , b1 , b2 , . . . , bm ∈ R, and let x̄∗ be


a local minimum point of the minimizaton problem. Then there exist λ1 , λ2 , . . . , λm ≥ 0 such that

m
X

∇f (x̄ ) + λi ai = 0
i=1
and λi (aTi x̄∗ − bi ) = 0, i = 1, 2, . . . , m.

Theorem (KKT conditions for convex linearly constrained problems; necessary and
sufficient optimality conditions). Consider the minimization problem

minf (x̄)

s.t. aTi x̄ ≤ bi , i = 1, 2, . . . , m,

where f is a convex continuously differentiable function over Rn , a1 , a2 , . . . , am ∈ Rn , b1 , b2 , . . . , bm ∈


R, and let x̄∗ be a feasible solution of minimization problem. Then x̄∗ is an optimal solution if and
only if there exist λ1 , λ2 , . . . , λm ≥ 0 such that

m
X

∇f (x̄ ) + λ i ai = 0
i=1
T ∗
and λi (ai x̄ − bi ) = 0, i = 1, 2, . . . , m.

59
Theorem (KKT conditions for linearly constrained problems). Consider the mini-
mization problem

min f (x̄)

s.t. aTi x̄ ≤ bi , i = 1, 2, . . . , m, (Q)

cTj x̄ = dj , j = 1, 2, . . . , p,

where f is a continuously differentiable function over Rn , a1 , a2 , . . . , am , c1 , c2 , . . . , cp ∈ Rn ,


b1 , b2 , . . . , bm , d1 , d2 , . . . , dp ∈ R. Then we have the following:

(a) (necessity of the KKT conditions) If x̄∗ is a local minimum point of (Q), then there
exist λ1 , λ2 , . . . , λm ≥ 0 and µ1 , µ2 , . . . , µp ∈ R such that

m p
X X

∇f (x̄ ) + λi ai + µ j cj = 0 (11)
i=1 i=1
and λi (aTi x̄∗ − bi ) = 0, i = 1, 2, . . . , m. (12)

(b) (sufficiency in the convex case) If in addition f is convex over Rn and x̄∗ is a feasible
solution of (Q) for which there exist λ1 , λ2 , . . . , λm ≥ 0 and µ1 , µ2 , . . . , µp ∈ R such that (11)
and (12) are satisfied, then x̄∗ is an optimal solution of (Q).

Theorem(Fritz-John conditions for inequality constrained problems). Let x̄∗ be a


local minimum of the problem

min f (x̄)

s.t. gi (x̄) ≤ 0, i = 1, 2, . . . , m,

where f, g1 , . . . , gm are continuously differentiable functions over Rn . Then there exist multipliers

60
λi , λ2 , . . . , λm ≥ 0, which are not all zeros, such that

m
X

λ0 ∇f (x̄ ) + λi ∇gi (x̄∗ ) = 0,
i=1

λi gi (x̄ ) = 0, i = 1, 2, . . . , m.

Theorem (KKT conditions for inequality/equality constrained problems). Let x̄∗ be


a local minimum of the problem

min f (x̄)

s.t. gi (x̄) ≤ 0, i = 1, 2, . . . , m, (13)

hj (x̄∗ ) = 0, j = 1, 2, . . . , p.

where f, g1 , . . . , gm , h1 , h2 , . . . , hp are continuously differentiable functions over Rn . Suppose that


the gradients of the active constraints and the equality constraints

{∇gi (x̄∗ ) : i ∈ I(x̄∗ )} ∪ {∇hj (x̄∗ ) : j = 1, 2, . . . , p}

are linearly independent (where as before I(x̄∗ ) = {i : gi (x̄∗ ) = 0}). Then there exist multipliers
λ1 , λ2 , . . . , λm ≥ 0 and µ1 , µ2 , . . . , µp ∈ R such that

m p
X X
∗ ∗
∇f (x̄ ) + λi ∇gi (x̄ ) + µj ∇hj (x̄∗ ) = 0,
i=1 j=1

λi gi (x̄ ) = 0, i = 1, 2, . . . , m.

Definition:(KKT points). Consider the minimization problem (13), where f, g1 , . . . , gm , h1 , h2 ,


. . . , hp are continuously differentiable functions over Rn . A feasible point x̄∗ is called a KKT point
if there exist λ1 , λ2 , . . . , λm ≥ 0 and µ1 , µ2 , . . . , µp ∈ R such that

m p
X X
∗ ∗
∇f (x̄ ) + λi ∇gi (x̄ ) + µj ∇hj (x̄∗ ) = 0
i=1 j=1

λi gi (x̄ ) = 0, i = 1, 2, . . . , m.

61
Definition:(regularity). Consider the minimization problem (13), where f, g1 , . . . , gm , h1 , h2 ,
. . . , hp are continuously differentiable functions over Rn . A feasible point x̄∗ is called regular if the
gradients of the active constraints among the inequality constraints and of the equality constraints

{∇gi (x̄∗ ) : i ∈ I(x̄∗ )} ∪ {∇hj (x̄∗ ) : j = 1, 2, . . . , p.}

are linearly independent.

Thus, a necessary optimality condition for local optimality of a regular point is that it is a
KKT point. The additional requirement of regularity is not required in the linearly constrained
case in which no such assumption is needed.
The Convex Case
The KKT conditions are necessary optimality condition under the regularity condition. When the
problem is convex, the KKT conditions are always sufficient and no further conditions are required.

Theorem (sufficiency of the KKT conditions for convex optimization problems).


Let x̄∗ be a feasible solution of

min f (x̄)

s.t. gi (x̄) ≤ 0, i = 1, 2, . . . , m, (14)

hj (x̄) = 0, j = l, 2, . . . , p,

where f, g1 , . . . , gm are continuously differentiable convex functions over Rn and h1 , h2 , . . . , hp are


affine functions. Suppose that there exist multipliers λ1 , λ2 , . . . , λm ≥ 0 and µ1 , µ2 , . . . , µp ∈ R
such that

m p
X X
∗ ∗
∇f (x̄ ) + λi ∇gi (x̄ ) + µj ∇hj (x̄∗ ) = 0,
i=1 j=1

λi gi (x̄ ) = 0, i = 1, 2, . . . , m.

62
Then x̄∗ is an optimal solution of to the minimization problem.
Slater’s Condition
This condition is satisfied for a set of convex inequalities

gi (x̄) ≤ 0, i = 1, 2, . . . , m,

where g1 , g2 , . . . , gm are given convex functions if there exists x̄∗ ∈ Rn such that

gi (x̄∗ ) < 0, i = 1, 2, . . . , m.

Theorem (necessity of the KK.T conditions under Slater’s condition).


Let x̄∗ be an optimal solution of the problem

min f (x̄)

s.t. gi (x̄) ≤ 0, i = 1, 2, . . . , m,

where f, g1 , . . . , gm are continuously differentiable functions over Rn . In addition, g1 , g2 , . . . , gm are


convex functions over Rn . Suppose that there exists x̄∗ ∈ Rn such that

gi (x̄∗ ) < 0, i = 1, 2, . . . , m.

Then there exist multipliers λ1 , λ2 , . . . , λm ≥ 0 such that

m
X

∇f (x̄ ) + λi ∇gi (x̄∗ ) = 0,
i=1

λ̃i gi (x̄ ) = 0, i = 1, 2, . . . , m.

63
Let us discuss necessary and sufficient conditions for x̄ = (x̄1 , x̄2 , . . . , x̄n ) to be an optimal
solution for the following NLP:

max (or min) f (x1 , x2 , . . . , xn )

s.t. g1 (x1 , x2 , . . . , xn ) ≤ b1

g2 (x1 , x2 , . . . , xn ) ≤ b2

gm (x1 , x2 , . . . , xn ) ≤ bm (15)

To apply the K.T. conditions, all the NLP’s constraints must be ≤ constraints. Any constraint
of the form h(x1 , x2 , ..., xn ) ≥ b must be rewritten as −h(x1 , x2 , ..., xn ) ≤ −b. For example, the
constraint 2x1 + x2 ≥ 2 should be rewritten as −2x1 − x2 ≤ −2.
For the following theorems to hold, the functions g1 , g2 , . . . , gm must satisfy certain called con-
straint qualifications. When the constraints are linear, these regularity assumptions are always
satisfied. In other situations (particularly when some of the constraints are equality constraints),
the regularity conditions may not be satisfied. We assume that all problems we consider satisfy
these constraint qualifications conditions.

64
Theorem α1 :
Suppose (15) is a maximization problem. If x̄ = (x̄1 , x̄2 , . . . , x̄n ) is an optimal solution to (15),
then x̄ = (x̄1 , x̄2 , . . . , x̄n ) must satisfy the m constraints in (15), and there must exist multipliers
λ̄1 , λ̄2 , . . . , λ̄m satisfying

m
∂f (x̄) X ∂gi (x̄)
− λ̄i = 0 (j = 1, 2, . . . , n) (16)
∂xj i=1
∂xj
λ̄i [bi − gi (x̄)] = 0 (i = 1, 2, . . . , m) (17)

λ̄i ≥ 0 (i = 1, 2, . . . , m) (18)

Theorem α2 :
Suppose (15) is a minimization problem. If x̄ = (x̄1 , x̄2 , . . . , x̄n ) is an optimal solution to (15),
then x̄ = (x̄1 , x̄2 , . . . , x̄n ) must satisfy the m constraints in (15), and there must exist multipliers
λ̄1 , λ̄2 , . . . , λ̄m satisfying

m
∂f (x̄) X ∂gi (x̄)
+ λ̄i = 0 (j = 1, 2, . . . , n) (19)
∂xj i=1
∂x j

λ̄i [bi − gi (x̄)] = 0 (i = 1, 2, . . . , m) (20)

λ̄i ≥ 0 (i = 1, 2, . . . , m) (21)

65
In many situations, the K.T conditions are applied to NLPs in which the variables must be
non-negative. For example, we may want to use the K.T conditions to find the optimal solution
to

max (or min) z = f (x1 , x2 , . . . , xn )

s.t. g1 (x1 , x2 , . . . , xn ) ≤ b1

g2 (x1 , x2 , . . . , xn ) ≤ b2

gm (x1 , x2 , . . . , xn ) ≤ bm

−xi ≤ 0 (i = 1, 2, . . . , n) (22)

66
If we associate multipliers µi (i = 1, 2, . . . , n) with the non-negativity constraints in (22), the
previous theorems reduce to the following:
Theorem α3 :
Suppose (22) is a maximization problem. If x̄ = (x̄1 , x̄2 , . . . , x̄n ) is an optimal solution to (22),
then x̄ = (x̄1 , x̄2 , . . . , x̄n ) must satisfy the constraints in (22), and there must exist multipliers
λ̄1 , λ̄2 , . . . , λ̄m , µ̄1 , µ̄2 , . . . , µ̄n satisfying

m
∂f (x̄) X ∂gi (x̄)
− λ̄i + µj = 0 (j = 1, 2, . . . , n) (23)
∂xj i=1
∂x j

λ̄i [bi − gi (x̄)] = 0 (j = 1, 2, . . . , m) (24)


 m 
∂f (x̄) X ∂gi (x̄)
− λ̄i x̄j = 0 (j = 1, 2, . . . , n) (25)
∂xj i=1
∂x j

λ̄i ≥ 0 (i = 1, 2, . . . , m) (26)

µ̄j ≥ 0 (j = 1, 2, . . . , n) (27)

Because µ̄ ≥ 0, (23) is equivalent to

m
∂f (x̄) X ∂gi (x̄)
− λ̄i ≤ 0 (j = 1, 2, . . . , n).
∂xj i=1
∂x j

The K − T conditions for a maximization problem with non-negativity constraints, may be written
as

m
∂f (x̄) X ∂gi (x̄)
− λ̄i ≤ 0 (j = 1, 2, . . . , n) (28)
∂xj i=1
∂x j

λ̄i [bi − gi (x̄)] = 0 (j = 1, 2, . . . , m) (29)


 m 
∂f (x̄) X ∂gi (x̄)
− λ̄i x̄j = 0 (j = 1, 2, . . . , n) (30)
∂xj i=1
∂xj
λ̄i ≥ 0 (i = 1, 2, . . . , m) (31)

67
Theorem α4 :
Suppose (22) is a minimization problem. If x̄ = (x̄1 , x̄2 , . . . , x̄n ) is an optimal solution to (15),
then x̄ = (x1 , x2 , . . . , xn ) must satisfy the m constraints in (15), and there must exist multipliers
λ̄1 , λ̄2 , . . . , λ̄m , µ̄1 , µ̄2 , . . . , µ̄n satisfying

m
∂f (x̄) X ∂gi (x̄)
+ λ̄i − µj = 0 (j = 1, 2, . . . , n) (32)
∂xj i=1
∂xj
λ̄i [bi − gi (x̄)] = 0 (j = 1, 2, . . . , m) (33)
 m 
∂f (x̄) X ∂gi (x̄)
+ λ̄i x̄j = 0 (j = 1, 2, . . . , n) (34)
∂xj i=1
∂x j

λ̄i ≥ 0 (i = 1, 2, . . . , m) (35)

µ̄j ≥ 0 (i = 1, 2, . . . , n) (36)

Because µ̄ ≥ 0, (32) may be written as

m
∂f (x̄) X ∂gi (x̄)
+ λ̄i ≥ 0 (j = 1, 2, . . . , n).
∂xj i=1
∂x j

The K − T conditions for a minimization problem with non-negativity constraints, may be written
as

m
∂f (x̄) X ∂gi (x̄)
+ λ̄i ≥ 0 (j = 1, 2, . . . , n) (37)
∂xj i=1
∂x j

λ̄i [bi − gi (x̄)] = 0 (j = 1, 2, . . . , m) (38)


 m 
∂f (x̄) X ∂gi (x̄)
+ λ̄i x̄j = 0 (j = 1, 2, . . . , n) (39)
∂xj i=1
∂x j

λ̄i ≥ 0 (i = 1, 2, . . . , m) (40)

68
The previous theorems give conditions that are necessary for a point to be an optimal solution.
The following two theorems give conditions that are sufficient for a point to be an optimal solution
to (15) or (22).

Theorem α5 : Suppose (15) is a maximization problem. If f (x1 , x2 , . . . , xn ) is a concave


function and g1 (x1 , x2 , . . . , xn ), . . . , gm (x1 , x2 , . . . , xn ) are convex functions, then any point x̄ =
(x̄1 , x̄2 , . . . , x̄n ) satisfying the hypotheses of Theorem α1 is an optimal solution to (15). Also, if (22)
is a maximization problem, f (x1 , x2 , . . . , xn ) is a concave function, and g1 (x1 , x2 , . . . , xn ), . . . , gm (x1 , x2 , . . . , xn )
are convex functions, then any point x̄ = (x̄1 , x̄2 , . . . , xn ) satisfying the hypotheses of Theorem α3
is an optimal solution to (22).

Theorem α6 : Suppose (15) is a minimization problem. If f (x1 , x2 , . . . , xn ) is a convex function and


g1 (x1 , x2 , . . . , xn ), . . . , gm (x1 , x2 , . . . , xn ) are convex functions, then any point x̄ = (x1 , x2 , . . . , xn )
satisfying the hypotheses of Theorem α2 is an optimal solution to (15). Also, if (22) is a minimiza-
tion problem, f (x1 , x2 , . . . , xn ) is a convex function, and g1 (x1 , x2 , . . . , xn ), . . . , gm (x1 , x2 , . . . , xn )
are convex functions, then any point x̄ = (x1 , x2 , . . . , xn ) satisfying the hypotheses of Theorem α4
is an optimal solution to (22).

69
4 Geometrical Interpretation of Kuhn–Tucker Conditions

It is easy to show that conditions (2)–(4) of Theorem α1 will hold at a point x̄ if and only if ∇f is
a non-negative linear combination of ∇g1 , ∇g2 , . . . , ∇gm , and the weight multiplying ∇gi in this
linear combination equals 0 if the ith constraint in (15) is nonbinding.
In short, (2)–(4) are equivalent to the existence of λi ≥ 0 such that

m
X
∇f (x̄) = λi ∇gi (x̄) (41)
i=1

and each constraint that is nonbinding at x̄ has λi = 0. Figures 43 and 44 illustrate (41). In Figure
43, we are trying to solve (the feasible region is shaded)

min z = f (x1 , x2 ) (42)

s.t. g1 (x1 , x2 ) ≤ 0 (43)

g2 (x1 , x2 ) ≤ 0 (44)

At x̄ (41) holds with both constraints binding and we have λ1 > 0 and λ2 > 0. In Figure 44, we
are again trying to solve (feasible region is again shaded)

min z = f (x1 , x2 ) (45)

s.t. g1 (x1 , x2 ) ≤ 0 (46)

g2 (x1 , x2 ) ≤ 0 (47)

Here, the second constraint is nonbinding so (41) must hold with λ2 = 0. The following two exam-
ples illustrate the use of the K–T conditions.

70
71
Example: Assuming that f 0 (x) exists ∀x on the interval [a, b]. Describe the optimal solution to
the following NLP, using Kuhn-Tucker conditions:

max f (x)

s.t. a ≤ x ≤ b

Solution:
We know that the optimal solution to this problem must occur at a [with f 0 (a) ≤ 0], at b [with
f 0 (b) ≥ 0], or at a point having f 0 (x) = 0.
How do the K.T conditions yield these three cases?
Let us write this NLP in following form:

max f (x)

s.t. − x ≤ −a

x ≤ b

Then the K.T conditions yield

f 0 (x) + λ1 − λ2 = 0 (48)

λ1 (−a + x) = 0 (49)

λ2 (b − x) = 0 (50)

λ1 ≥ 0 (51)

λ2 ≥ 0 (52)

72
In using the K.T conditions to solve NLPs, it is useful to note that each multiplier λi must
satisfy λi = 0 or λi > 0. Thus, in attempting to find values of x, λ1 , and λ2 that satisfy the
previous equations, we must consider the following four cases:

Case 1. λ1 = λ2 = 0. From (48), we obtain the case f (x̄) = 0.

Case 2. λ1 = 0 and λ2 > 0. Since λ2 > 0, (50) yields x̄ = b. Then (48) yields f 0 (b) = λ2 , and because
λ2 > 0, we obtain the case where f 0 (b) > 0.

Case 3. λ1 > 0 and λ2 = 0. Since λ1 > 0, (49) yields x̄ = a. Then (48) yields the case where
f 0 (a) = −λ1 < 0.

Case 4. λ1 > 0, λ2 > 0. From (49) and (50), we obtain x̄ = a and x̄ = b. This contradiction indicates
that Case 4 cannot occur.

The constraints are linear, so Theorem α5 shows that if f (x) is concave, then (43)–(47) yield the
optimal solution to initial problem.

Example : A monopolist can purchase up to 17.25 oz of a chemical for $10/oz. At a cost of $3/oz,
the chemical can be processed into an ounce of product 1; or, at a cost of $5/oz, the chemical can
be processed into an ounce of product 2. If x1 oz of product 1 are produced, it sells for a price of
$30 − x1 per ounce. If x2 oz of product 2 are produced, it sells for a price of $50 − 2x2 per ounce.
Determine how the monopolist can maximize profits.
Solution:
Let

• x1 = ounces of product 1 produced

• x2 = ounces of product 2 produced

• x3 = ounces of chemical processed

Then we want to solve the following NLP:

73
max z = x1 (30 − x1 ) + x2 (50 − 2x2 ) − 3x1 − 5x2 − 10x3

s.t. x1 + x2 − x3 ≤ 0 (53)

x3 ≤ 17.25

x1 , x2 , x3 ≥ 0,

since the optimal solution to (53) satisfies the non-negativity constraints. Observe that the objec-
tive function in (53) is the sum of concave functions (and is therefore concave), and the constraints
are convex (because they are linear). Thus, Theorem α5 shows that the K.T conditions are nec-
essary and sufficient for any point to be an optimal solution to (53). From Theorem α1 , the K.T
conditions become

30 − 2x1 − 3 − λ1 = 0 (54)

50 − 4x2 − 5 − λ1 = 0 (55)

−10 + λ1 − λ2 = 0 (56)

λ1 (−x1 − x2 + x3 ) = 0 (57)

λ2 (17.25 − x3 ) = 0 (58)

λ1 ≥ 0 (59)

λ2 ≥ 0 (60)

Here, there are four cases to consider:

Case 1. λ1 = λ2 = 0. This case cannot occur, because (56) would be violated.

Case 2. λ1 = 0, λ2 > 0. If λ1 = 0, then (56) implies λ2 = −10. This would violate (60).

74
Case 3. λ1 > 0, λ2 = 0. From(56), we obtain λ1 = 10. Now (54) yields x1 = 8.5, and (55) yields
x2 = 8.75. From (57), we obtain x1 + x2 = x3 , so x3 = 17.25. Thus, x̄1 = 8.5, x̄2 = 8.75, x̄3 =
17.25, λ̄1 = 10, λ̄2 = 0 satisfies the K.T conditions.

Case 4. λ1 > 0, λ2 > 0. Case 3 yields an optimal solution, so we need not consider Case 4.

Thus, the optimal solution to (56) is to buy 17.25oz of the chemical and produce 8.5oz of product
1 and 8.75oz of product 2. For ∆ small, λ̄1 = 10 indicates that if an extra ∆oz of the chemical
were obtained at no cost, then profits would increase by ∆10. (Can you see why?) From (56), we
find that λ̄2 = 0. This implies that the right to purchase an extra oz of the chemical would not
increase profits. (Can you see why?)

75
Constraint Qualifications
Unless a constraint qualification or regularity condition is satisfied at an optimal point x̄, the
Kuhn–Tucker conditions may fail to hold at x̄. There are many constraint qualifications, but we
choose to discuss the Linear Independence Constraint Qualification: Let x̄ be an optimal solution
to NLP (7) or (14).
If all gi are continuous, and the gradients of all binding constraints (including any binding non-
negativity constraints on x1 , x2 , . . . , xn ) at x̄ form a set of linearly independent vectors, then the
Kuhn–Tucker conditions must hold at x̄.
The following example shows that if the Linear Independence Constraint Qualification fails to
hold, then the Kuhn–Tucker conditions may fail to hold at the optimal solution to an NLP.
Show that the Kuhn–Tucker conditions fail to hold at the optimal solution to the following NLP:

max z = x1

s.t. x2 − (1 − x1 )3 ≤ 0 (61)

x1 , x2 ≥ 0

Solution:
If x1 > 1, then the first constraint in (61) implies that x2 < 0. Thus, the optimal z-value for (61)
cannot exceed 1. Since x1 = 1 and x2 = 0 is feasible and yields z = 1, (1, 0) must be the optimal
solution to NLP (61).
From Theorem α3 , the following are two of the Kuhn–Tucker conditions for (61).

1 + 3λ1 (1 − x1 )2 = −µ1 (62)

µ1 ≥ 0 (63)

At the optimal solution (1, 0), (62) implies µ1 = −1, which contradicts (63). Thus, the Kuhn–Tucker
conditions are not satisfied at (1, 0).

76
One can say that Kuhn–Tucker conditions are not satisfied at (1, 0), due to the fact that the
Linear Independence Constraint Qualification is violated at that point for the stated problem. At
(1, 0) the constraints x2 − (1 − x1 )3 ≤ 0 and x2 ≥ 0 are binding. Then

∇(x2 − (1 − x1 )3 ) = [0, 1]

∇(−x2 ) = [0, −1]

Since [0, 1] + [0, −1] = [0, 0], these gradients are linearly dependent. Thus, at (1,0) the gradients
of the binding constraints are linearly dependent, and the constraint qualification is not satisfied.

77
Example: Solve the following, using Kuhn-Tucker conditions:

Max f (x, y) = 3x + y

s.t. x2 + y 2 ≤ 5

x−y ≤1

78

You might also like