0% found this document useful (0 votes)
80 views

Lecture 05 - Quasi Newthon Methods

This document provides an overview of quasi-Newton methods for optimization. It describes that quasi-Newton methods approximate the inverse Hessian matrix to avoid directly computing it, using only objective function and gradient values. The approximation Hk is required to satisfy the quasi-Newton condition Hkγi = δi for 0 ≤ i ≤ k. Specific quasi-Newton methods are classified based on whether they make a rank-one or rank-two correction to the inverse Hessian approximation Hk at each iteration.

Uploaded by

Fanta Camara
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Lecture 05 - Quasi Newthon Methods

This document provides an overview of quasi-Newton methods for optimization. It describes that quasi-Newton methods approximate the inverse Hessian matrix to avoid directly computing it, using only objective function and gradient values. The approximation Hk is required to satisfy the quasi-Newton condition Hkγi = δi for 0 ≤ i ≤ k. Specific quasi-Newton methods are classified based on whether they make a rank-one or rank-two correction to the inverse Hessian approximation Hk at each iteration.

Uploaded by

Fanta Camara
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

School of Computer Science and Applied Mathematics

APPM 3017: Optimization III


Lecture 05 : Quasi Newton Methods
Lecturer: Matthews Sejeso Date: August 2021

LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe properties of the quasi-Newton methods.
2. Describe the symmetric rank-one quasi-Newton method.
3. Describe the rank-two quasi-Newton method.
Reference:
• Chapter 6, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.

5.1 Introduction

Newton method is one of the most important algorithms in optimization. Recall that the idea in Newton’s
method is to locally approximate the objective function at every iteration by a quadratic function. Then
it obtains the next iterate as the minimizer of the quadratic approximation. The general iteration is
h i−1
xk+1 = xk + sk ∇2 f (xk ) ∇f (xk ) (5.1)

where the stepsize sk is chosen to ensure a reduction of the objective function. Note that Newton’s method
finds the global minimum of a convex quadratic function in one iteration (assuming sk = 1). If the initial
point is far from optimal, the search direction may not always be a descent direction. Thus, convergence to
an optimal solution cannot be guaranteed from an arbitrary initial point for the general nonlinear objective
function. Generally, if Newton’s method converges, it has a quadratic rate of convergence.
The computational drawback of Newton’s method is the need to evaluate Hessian ∇2 f (xk ) and solve
the equation ∇2 f (xk )dk = ∇f (xk ) (i.e compute the search direction dk ). To avoid computation of
 2 −1
∇ f (xk ) , the quasi-Newton methods use an approximation in place of the true inverse. The approximation
−1
is updated at each iteration so that it exhibits some properties associated with the true inverse ∇2 f (xk ) .


This leads to the recursive algorithm

xk+1 = xk − sk Hk gk , (5.2)
−1
where Hk is the approximation of the inverse Hessian ∇2 f (xk ) , and gk = ∇f (xk ). The approximation


Hk is required to be positive definite to ensure a descent direction, and thus to guarantee a decease in the
objective function.
When constructing an approximation to the inverse of the Hessian matrix, we should use only the objective
function and gradient values. Thus, if we can find a suitable method of choosing Hk , the iteration may
be carried out without any evaluation of the Hessian and without computation of inverse Hessian. In this
lecture, we discuss different choices of updating the matrix Hk .

5-1
5.2 Approximating the Inverse Hessian

In this section, we derive an additional condition that the approximation Hk should satisfy, which forms
the basis of subsequent discussion of quasi-Newton methods. To begin, suppose that the Hessian matrix
∇2 f (x) of the objective function f is constant and independent of x. Thus, we are dealing with a quadratic
function
1
f (x) = xT Qx + bT x + c,
2
with the Hessian ∇2 f (x) = Q for all x, where Q = QT . An important idea is that the change in position
xk+1 − xk together with the corresponding change in gradient gk+1 − gk , provide information about the
Hessian by means of the relation

gk+1 − gk = Q(xk+1 − xk ). (5.3)

Let the change in position and change in gradient be represented by

δ k = xk+1 − xk and γ k = gk+1 − gk .

Then, the relation (5.3) may be written as

γ k = Qδ k . (5.4)

Note that given k, the matrix Q−1 satisfies

Q−1 γ i = δ i , 0 ≤ i ≤ k.

We start with a real symmetric positive definite matrix H0 , and at the k-th iteration we impose that the
approximation H k+1 of the inverse Hessian satisfy

Hk+1 γ i = δ i , 0 ≤ i ≤ k. (5.5)

The equation (5.5) is known as the quasi-Newton condition. Hence, given n linearly independent iteration
increments δ 0 , δ 1 , . . . , δ n−1 , we obtain

Hn γ 0 , γ 1 , . . . , γ n−1 = δ 0 , δ 1 , . . . , δ n−1 .
   

Note that Q satisfies

Q δ 0 , δ 1 , . . . , δ n−1 = γ 0 , γ 1 , . . . , γ n−1 .
   

and

Q−1 γ 0 , γ 1 , . . . , γ n−1 = δ 0 , δ 1 , . . . , δ n−1 .


   

Therefore, if γ 0 , γ 1 , . . . , γ n−1 is non-singular, then Q−1 is determined uniquely after n steps, via
 

−1
Q−1 = Hn = δ 0 , δ 1 , . . . , δ n−1 γ 0 , γ 1 , . . . , γ n−1
 
.

As a consequence, we conclude that if Hn satisfies the equations Hn γ i = δ i , 0 ≤ i ≤ n − 1, then the


iteration xk+1 = xk − sk Hk gk , with exact line search, is guaranteed to solve problems with quadratic
objective function in n steps.
The above considerations illustrate the basic idea of quasi-Newton methods. Specifically, quasi-Newton
algorithms have the form illustrated in Algorithm 5.1

5-2
Algorithm 5.1 General quasi-Newton Algorithm
1: Set the initial point x0 , the initial approximate inverse Hessian H0 , and k ← 0.
2: while not convergence do
3: Compute the gradient gk = ∇f (xk ) and set the search direction dk = −Hk gk .
4: Determine the stepsize sk , using exact or inexact line search.
5: Update the new iterate xk+1 = xk + sk dk .
6: Update the approximation of inverse Hessian Hk+1 .
7: Increase the iteration counter k = k + 1.
8: end while
9: return x∗ = xk

It turns out that in the case of quadratic function, the quasi-Newton methods are also conjugate direction
methods, as stated in the following theorem.

Theorem 5.1. Consider a quasi-Newton algorithm applied to a quadratic function with Hessian Q = QT ,
such that 0 ≤ k < n − 1,

Hk+1 γ i = δ i , 0 ≤ i ≤ k,
T
where Hk+1 = Hk+1 . If si 6= 0, 0 ≤ i ≤ k + 1, then d0 , d1 , . . . , dk+1 are Q-conjugate.
The immediate result of Theorem 5.1 is that quasi-Newton method converge in n steps for a quadratic
function of n variables, using the exact line search. Note that the quasi-Newton condition (5.5) does not
have restriction on how the matrices Hk are determined, in particular these matrices are not unique. Thus,
we have some freedom in the way we can compute Hk . In the methods we describe, we compute Hk+1 by
adding a correction ∆Hk to Hk , i.e. Hk+1 = Hk + ∆Hk . Methods differ in the way the correction term is
chosen, essentially they are classified according to a rank-one and rank-two correction formulae.

5.3 The rank one correction formula


T
In the symmetric rank one correction formula, the correction term is symmetric, and has the form ak uk uk ,
where ak is a scalar, and uk ∈ Rn . Therefore, the update equation is:
T
Hk+1 = Hk + ak uk uk . (5.6)
T
Note that rank(uk uk ) = 1, thus the make rank one correction (it is also called symmetric rank one
algorithm, due to the fact that the update in (5.6) is always symmetric).
The goal is to determine ak and uk , given Hk , γ k and δ k such that the quasi-Newton equation (5.5) is
satisfied. To begin, let us first consider the condition Hk+1 γ k = δ k . Given Hk , γ k and δ k , we wish to find
ak and uk to ensure that
h T
i
Hk+1 γ k = Hk + ak uk uk γ k = δ k .

T
Note that uk γ k is a scalar. Thus
 T

δ k − Hk γ k = ak uk γ k uk , (5.7)

and hence
δ k − Hk γ k
uk =  
T
ak uk γ k

5-3
We can now write the last term of equation (5.6) as
T
δ k − Hk γ k δ k − Hk γ k

k kT
ak u u =  2 .
T
ak uk γ k

Hence, substituting this expression into (5.6) we get,


T
δ k − Hk γ k δ k − Hk γ k

k+1 k
H =H +  2 .
T
ak uk γ k

The next step is to express the denominator of the second term on the right-hand side of the above equation
as a function of the known quantities Hk , gk and δ k . To accomplish this, premultiply equation (5.7) by
T
γ k to obtain
T
  T T
γ k δ k − Hk γ k = γ k ak uk uk γ k .
T T
Note that ak is a scalar and so is γ k uk = uk γ k . Thus
T
   T 2
γ k δ k − Hk γ k = ak uk γ k .

Taking the above relation into account yields the rank-one update
T
δ k − Hk γ k δ k − Hk γ k

k+1 k
H =H + T
. (5.8)
γ k (δ k − Hk γ k )

Hence the symmetric rank-one algorithm is summarized below:

Algorithm 5.2 Symmetric Rank One Algorithm


1: Set the initial point x0 , a symmetric positive define matrix H 0 and k ← 0.
2: while not convergence do
3: Compute the gradient gk = ∇f (xk ) and set the search direction dk = −Hk gk .
4: Determine the stepsize sk , using exact or inexact line search.
5: Compute the new iterate xk+1 = xk + sk dk .
6: Update the approximate inverse Hessian Hk+1 , using (5.8).
7: Set k = k + 1.
8: end while
9: return x∗ = xk

The rank one algorithm is based on satisfying the equation Hk+1 γ k = δ k . However, the requirement it
that is should satisfy the quasi-Newton equation
Hk+1 γ i = δ i , for i = 0, 1, . . . , k.
It turns out this equation is, in fact, automatically true as stated in the following theorem.

Theorem 5.2. The rank one algorithm applied to the quadratic with Hessian Q = QT , we have
Hk+1 γ i = δ i , 0 ≤ i ≤ k.
Proof: We prove the induction. For k = 0, we have
T
δ 0 − H0 γ 0 δ 0 − H0 γ 0

1 0 0 0
H γ =H γ + γ0
− γ0T (δ 0 H0 γ 0 )
= H γ + δ − H γ = δ0.
0 0 0 0 0

5-4
Suppose the theorem is true for k − 1; that is Hk γ i = δ i , i < k. We how show the theorem is true for k.
Our construction of the correction term ensures that Hk+1 γ k = δ k . We show the result for

Hk+1 γ i = δ i , for i < k.

We have
T
δ k − Hk γ k δ k − Hk γ k

k+1 i k i
H γ =H γ + T
γi.
γ k (δ k − Hk γ k )

By induction hypothesis, Hk γ i = δ i . To complete the proof it is enough to show that the second term on
the right hand side of the above equation is equal to zero. For this to be true consider
 T T T
δ k − Hk γ k γ i = δ k γ i − γ k Hk γ i
T T
= δk γ i − γ k δi by induction hypothesis
kT i kT i
=δ γ −δ Qδ by equation (5.4)
kT i kT i
=δ γ −δ γ by equation (5.4)
=0

Hence this complete the proof. Q.E.D.

Example 5.1. Let f (x) = x21 + 21 x22 + 3. Apply the rank one correction algorithm to minimize f . Use the
initial point x0 = [1, 2]T and H0 = In .
The symmetric rank-one algorithm works well in the case of quadratic functions, in which the Hessian
matrix is constant. Unfortunately, for the nonquadratic problems the formula (5.8) has few drawbacks.
The Hessian Hk+1 may not always be positive definite, even when Hk is positive definite, thus dk+1 may
T
not be a descent direction. Furthermore, numerical instability appear whenever γ k δ k − Hk γ k is close


to zero. Fortunately, alternative algorithms have been developed for updating Hk . In particular if we use
rank-two update, the matrix Hk is guaranteed to be positive definite for all k.

5.4 The rank two correction formula


T T
In the rank-two correction formula, the correction term is symmetric and takes the form ak uk uk +bk vk vk ,
where ak , bk are scalars, and uk ,vk ∈ Rn . Therefore, the update formula can be derived by applying the
following equation:
T T
Hk+1 = Hk + ak uk uk + bk vk vk . (5.9)

Instead of determining the expression for ak , bk , uk , vk , as in rank-one correction, we state the update
formula and show that it satisfy the desired properties − quasi-Newton condition and it is positive definite.
The most used rank-two quasi-Newton method is the Broyden class. The update rule is specified as:
T T
k+1 k δk δk Hk γ k Hk γ k T T
H =H + T − T
+ ck γ k Hk γ k qk qk , (5.10)
δk γ k γ k Hk γ k
where
δk Hk γ k
qk = T
− T
,
δk γ k γ k Hk γ k

5-5
and the scalars ck must satisfy 0 ≤ ck ≤ 1 for all k. The scalars ck parametrize the method. If ck = 0 for
all k, we obtain the Davidon-Fletcher-Powell (DFP) method, which is historically the first quasi-Newton
method. If ck = 1 for all k, we obtain the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, for which
there is substantial evidence that it is the best general purpose quasi-Newton method. The general rank
two algorithm is summarized below.

Algorithm 5.3 Rank Two Algorithm


1: Set the initial point x0 , a symmetric positive define matrix H 0 and k ← 0.
2: while not convergence do
3: Compute the gradient gk = ∇f (xk ) and set the search direction dk = −Hk gk .
4: Determine the stepsize sk , using exact or inexact line search.
5: Compute the new iterate xk+1 = xk + sk dk .
6: Update the approximate inverse Hessian Hk+1 , using (5.10).
7: Set k = k + 1.
8: end while
9: return x∗ = xk

We now show that the matrices Hk generated by (5.10) indeed from a quasi-Newton method, in the sense
that when applied to the quadratic problem, we have Hk+1 γ k = δ k , 0 ≤ i ≤ k. We first show that under
a mild assumption, the matrices Hk generated by (5.10) are positive definite. This is a very important
property, since it guarantees that dk is a descent direction.

Theorem 5.3. If Hk is positive definite, and the step-length sk is chosen such that xk+1 satisfy
T T
gk dk < gk+1 dk , (5.11)
then Hk+1 given by equation (5.10) is positive definite.
Proof: We first note that equation (5.11) implies that sk 6= 0, γ k 6= 0, and
T T
 
δ k γ k = sk dk gk+1 − gk > 0. (5.12)

Thus all denominator terms in equation (5.10) are nonzero, and Hk+1 is well defined.
Now for any vector 0 6= z ∈ Rn , we have
 2
T
γ k Hk z
2
zT δ k T
 T 2
zT Hk+1 z = zT Hk z + T
−T
+ ck γ k Hk γ k qk z . (5.13)
δk γ k γ k Hk γ k
1/2 1/2 k 1/2 1/2
We define a := Hk z and b := Hk γ , where Hk = Hk Hk . Note that because Hk is
positive definite its square root is well defined. Using this definition of a and b, we obtain
 1/2  1/2
zT Hk z = zT Hk Hk z = aT a,
 1/2  1/2
zT Hk γ k = zT Hk Hk γ k = aT b and
T T
 1/2  
γ k Hk γ k = γ k Hk Hk γ k = bT b

Hence, equation (5.13) can be written as


2 2
T k+1 T aT b zT δ k kT k k
 T 2
z H z=a a− + T
+ c k γ H γ qk z
bT b δk γ k
2 2
kak2 kbk2 − aT b zT δ k kT k k
 T 2
= + T
+ ck γ H γ qk z
kbk2 δk γ k

5-6
All the term on the right-hand side of the above equation are nonnegative−the first term is nonnegative
because of the Cauchy-Schwarz inequality, the second term is nonnegative because of equation (5.12), and
the last term is nonnegative because Hk is positive definite. In order to show that zT Hk+1 z > 0, we need
to demonstrate that we cannot have simultaneously
2
kak2 kbk2 = aT b and zT δ k = 0.
2
Indeed if kak2 kbk2 = aT b we must have a = λb or equivalently, z = λγ k . Since z 6= 0, it follows that
T
λ 6= 0, so if zT δ k = 0, we must have γ k δ k = 0, which is impossible by equation (5.12). This complete the
proof. Q.E.D

Theorem 5.4. Let {xk }, {dk }, and {Hk } be sequences generated by the rank two quasi-Newton Algorithm
5.1, applied to the quadratic function with Hessian Q = QT , we have Hk+1 γ i = δ i , 0 ≤ i ≤ k.
Proof: Note that the last term of equation (5.10) vanish when multiplied by γ k . In particular
T T
kT k δk γ k γ k Hk γ k
q γ = T
− T
=0
δk γ k γ k Hk γ k
We prove the theorem by induction. For k = 0, we have
T T
δ0δ0 H0 γ 0 γ 0 H0 0 0T 0 0 0

0T 0

H1 γ 0 = H0 γ 0 + γ 0
− γ + c0 γ H γ q q γ
δ0T γ 0 γ 0 T H0 γ 0
! !
δ 0T γ0 γ 0 T H0 γ 0
= H0 γ 0 + δ 0 − H0 γ 0
δ0T γ 0 γ 0 T H0 γ 0
= δ0.

Assume the result is true for k − 1; that is Hk γ i = δ k , 0 ≤ i ≤ k − 1. We now show that Hk+1 γ i = δ k ,
0 ≤ i ≤ k. First consider i = k. We have
T T
k+1 k k k δ k δ k k Hk γ k γ k Hk k kT k k k
 T 
H γ =H γ + T γ − T
γ + ck γ H γ q qk γ k
δk γ k γ k Hk γ k
T
T
! !
0 0 0 δ0 γ 0 0 0 γ k Hk γ k
=H γ +δ −H γ
δ0T γ 0 T
γ k Hk γ k
= δk .

It remains to show the result for the case i < k. To this end,
T T
δk δk Hk γ k γ k Hk T
 T 
Hk+1 γ i = Hk γ i + T
γi − T
γ i + ck γ k Hk γ k qk qk γ i
δk γ k γ k Hk γ k
!
δk  k T i Hk γ k  k T i  T 1 
kT i
 1 
kT i

= δi + T δ γ − T γ δ + ck γ k Hk γ k qk T
δ γ − T
γ δ
δk γ k γ k Hk γ k γ k δk γ k Hk γ k
Now,
T T T
δ k γ i = δ k Qδ i = sk si dk Qdi = 0
T
by the induction hypothesis and Theorem 1. The same argument yield γ k δ i = 0. Hence
Hk+1 γ i = δ i ,
and this complete the proof. Q.E.D

5-7
Example 5.2. Use the BFGS method to locate the minimizer of the objective function
5
f (x) = x1 − 3x1 x2 + x22 − x2 .
2
Use the initial point x0 = [0, 0]T and H0 = I2 .
The rank two correction formulae enjoy all the properties of quasi-Method methods, including the conjugate
directions property. In the case of quadratic functions, the methods converge in n iterations. In the case of
larger general nonlinear problems, the DFP sometimes tends to get stuck. This phenomenon is attributed
to Hk becoming nearly singular. The BFGS is reasonably robust, and it avoids this issue. Thus, BFGS is
often more efficient than DFP.
For general nonlinear functions, the quasi-Newton algorithms will not usually converge in n iterations. As
in the case of conjugate gradient methods, some modifications may be necessary to deal with nonquadratic
problems. For example, we may reinitialize the direction vector after every few iterations (e.g. n + 1
iterations) and continue until the algorithm satisfies the stopping criterion.

5-8
PROBLEM SET V
1. Show that the following algorithms all satisfy the descent condition:
(a) Steepest descent algorithm
(b) Newton’s method assuming the Hessian is positive definite
(c) Conjugate gradient algorithm
(d) Quasi-Newton Algorithm, assuming H k is positive definite
2. Prove Theorem 5.1, the search directions generated by quasi-Newton algorithm applied to quadratic
function are Q-conjugate.
3. Apply the symmetric rank-one algorithm to locate the minimizer of the function

f (x) = 4x21 + x22 − 2x1 x2 .

Use the initial point x0 = [1, 1]T and H0 = I2 (2 × 2 identity matrix).


4. Use the BFGS algorithm to locate the minimizer of

f (x) = 2x21 + 2x1 x2 + x22 + x1 − x2 .


 T
1 3
Use the initial point x0 = ,− and H0 = I2 .
2 2
5. Use the DFP method to minimize
1
f (x) = xT Qx + bT x + log(π),
2
where
   
5 −3 0
Q= , b=
−3 2 −1
 T
1 3
Take H0 = I2 and start with the initial point x0 = ,− .
2 2
6. Consider the nonlinear function f (x) = x41 − 2x2 x21 + x22 + x21 − 2x1 + 5. Starting from the initial point
x0 = [1, 2]T , find the minimizer of f using:
(a) The rank one algorithm,
(b) The DFP algorithm,
(c) The BFGS algorithm.
7. Given a Function f : Rn → R, consider an algorithm xk+1 = xk − sk Hk gk for finding the minimizer
of f , where gk = ∇f (xk ) and Hk ∈ Rn×n is symmetric. Suppose

Hk = ψHkDF P + (1 − ψ)HkBF GS ,

where ψ ∈ R is a scalar, and HkDF P and HkBF GS are matrices generated by the DF P and BF GS
algorithms respectively.
(a) Show that the above algorithm is a quasi-Newton algorithm. Is the above algorithm a conjugate
direction algorithm?

5-9
(b) Suppose 0 ≤ ψ ≤ 1. Show that if H0 is positive definite, then Hk is positive definite for all k.
What can you conclude from this about whether or not the algorithm has the descent property?
8. Write a compute programme to implement the quasi-Newton(BFGS) algorithm for general functions.
Use the backtracking algorithm for the line search. Test the different formulas for βk on Rosenbrock’s
function, with an initial point x0 = [−2, 2]T . For this exercise reinitialize the update direction to the
negative gradient every 6 iterations.
9. Consider the function f (x) = 14 x41 + 21 x22 − x1 x2 + x1 − x2 . Apply the DFP algorithm to minimize f
with the following starting initial conditions (a) x0 = [0, 0]T ; (b) x0 = [1.5, 1]T . Use H0 = I2 . Does
the algorithm converge to the same point for two initial conditions? If not, explain.

5-10

You might also like