Lecture 05 - Quasi Newthon Methods
Lecture 05 - Quasi Newthon Methods
LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe properties of the quasi-Newton methods.
2. Describe the symmetric rank-one quasi-Newton method.
3. Describe the rank-two quasi-Newton method.
Reference:
• Chapter 6, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.
5.1 Introduction
Newton method is one of the most important algorithms in optimization. Recall that the idea in Newton’s
method is to locally approximate the objective function at every iteration by a quadratic function. Then
it obtains the next iterate as the minimizer of the quadratic approximation. The general iteration is
h i−1
xk+1 = xk + sk ∇2 f (xk ) ∇f (xk ) (5.1)
where the stepsize sk is chosen to ensure a reduction of the objective function. Note that Newton’s method
finds the global minimum of a convex quadratic function in one iteration (assuming sk = 1). If the initial
point is far from optimal, the search direction may not always be a descent direction. Thus, convergence to
an optimal solution cannot be guaranteed from an arbitrary initial point for the general nonlinear objective
function. Generally, if Newton’s method converges, it has a quadratic rate of convergence.
The computational drawback of Newton’s method is the need to evaluate Hessian ∇2 f (xk ) and solve
the equation ∇2 f (xk )dk = ∇f (xk ) (i.e compute the search direction dk ). To avoid computation of
2 −1
∇ f (xk ) , the quasi-Newton methods use an approximation in place of the true inverse. The approximation
−1
is updated at each iteration so that it exhibits some properties associated with the true inverse ∇2 f (xk ) .
xk+1 = xk − sk Hk gk , (5.2)
−1
where Hk is the approximation of the inverse Hessian ∇2 f (xk ) , and gk = ∇f (xk ). The approximation
Hk is required to be positive definite to ensure a descent direction, and thus to guarantee a decease in the
objective function.
When constructing an approximation to the inverse of the Hessian matrix, we should use only the objective
function and gradient values. Thus, if we can find a suitable method of choosing Hk , the iteration may
be carried out without any evaluation of the Hessian and without computation of inverse Hessian. In this
lecture, we discuss different choices of updating the matrix Hk .
5-1
5.2 Approximating the Inverse Hessian
In this section, we derive an additional condition that the approximation Hk should satisfy, which forms
the basis of subsequent discussion of quasi-Newton methods. To begin, suppose that the Hessian matrix
∇2 f (x) of the objective function f is constant and independent of x. Thus, we are dealing with a quadratic
function
1
f (x) = xT Qx + bT x + c,
2
with the Hessian ∇2 f (x) = Q for all x, where Q = QT . An important idea is that the change in position
xk+1 − xk together with the corresponding change in gradient gk+1 − gk , provide information about the
Hessian by means of the relation
γ k = Qδ k . (5.4)
Q−1 γ i = δ i , 0 ≤ i ≤ k.
We start with a real symmetric positive definite matrix H0 , and at the k-th iteration we impose that the
approximation H k+1 of the inverse Hessian satisfy
Hk+1 γ i = δ i , 0 ≤ i ≤ k. (5.5)
The equation (5.5) is known as the quasi-Newton condition. Hence, given n linearly independent iteration
increments δ 0 , δ 1 , . . . , δ n−1 , we obtain
Hn γ 0 , γ 1 , . . . , γ n−1 = δ 0 , δ 1 , . . . , δ n−1 .
Q δ 0 , δ 1 , . . . , δ n−1 = γ 0 , γ 1 , . . . , γ n−1 .
and
Therefore, if γ 0 , γ 1 , . . . , γ n−1 is non-singular, then Q−1 is determined uniquely after n steps, via
−1
Q−1 = Hn = δ 0 , δ 1 , . . . , δ n−1 γ 0 , γ 1 , . . . , γ n−1
.
5-2
Algorithm 5.1 General quasi-Newton Algorithm
1: Set the initial point x0 , the initial approximate inverse Hessian H0 , and k ← 0.
2: while not convergence do
3: Compute the gradient gk = ∇f (xk ) and set the search direction dk = −Hk gk .
4: Determine the stepsize sk , using exact or inexact line search.
5: Update the new iterate xk+1 = xk + sk dk .
6: Update the approximation of inverse Hessian Hk+1 .
7: Increase the iteration counter k = k + 1.
8: end while
9: return x∗ = xk
It turns out that in the case of quadratic function, the quasi-Newton methods are also conjugate direction
methods, as stated in the following theorem.
Theorem 5.1. Consider a quasi-Newton algorithm applied to a quadratic function with Hessian Q = QT ,
such that 0 ≤ k < n − 1,
Hk+1 γ i = δ i , 0 ≤ i ≤ k,
T
where Hk+1 = Hk+1 . If si 6= 0, 0 ≤ i ≤ k + 1, then d0 , d1 , . . . , dk+1 are Q-conjugate.
The immediate result of Theorem 5.1 is that quasi-Newton method converge in n steps for a quadratic
function of n variables, using the exact line search. Note that the quasi-Newton condition (5.5) does not
have restriction on how the matrices Hk are determined, in particular these matrices are not unique. Thus,
we have some freedom in the way we can compute Hk . In the methods we describe, we compute Hk+1 by
adding a correction ∆Hk to Hk , i.e. Hk+1 = Hk + ∆Hk . Methods differ in the way the correction term is
chosen, essentially they are classified according to a rank-one and rank-two correction formulae.
T
Note that uk γ k is a scalar. Thus
T
δ k − Hk γ k = ak uk γ k uk , (5.7)
and hence
δ k − Hk γ k
uk =
T
ak uk γ k
5-3
We can now write the last term of equation (5.6) as
T
δ k − Hk γ k δ k − Hk γ k
k kT
ak u u = 2 .
T
ak uk γ k
The next step is to express the denominator of the second term on the right-hand side of the above equation
as a function of the known quantities Hk , gk and δ k . To accomplish this, premultiply equation (5.7) by
T
γ k to obtain
T
T T
γ k δ k − Hk γ k = γ k ak uk uk γ k .
T T
Note that ak is a scalar and so is γ k uk = uk γ k . Thus
T
T 2
γ k δ k − Hk γ k = ak uk γ k .
Taking the above relation into account yields the rank-one update
T
δ k − Hk γ k δ k − Hk γ k
k+1 k
H =H + T
. (5.8)
γ k (δ k − Hk γ k )
The rank one algorithm is based on satisfying the equation Hk+1 γ k = δ k . However, the requirement it
that is should satisfy the quasi-Newton equation
Hk+1 γ i = δ i , for i = 0, 1, . . . , k.
It turns out this equation is, in fact, automatically true as stated in the following theorem.
Theorem 5.2. The rank one algorithm applied to the quadratic with Hessian Q = QT , we have
Hk+1 γ i = δ i , 0 ≤ i ≤ k.
Proof: We prove the induction. For k = 0, we have
T
δ 0 − H0 γ 0 δ 0 − H0 γ 0
1 0 0 0
H γ =H γ + γ0
− γ0T (δ 0 H0 γ 0 )
= H γ + δ − H γ = δ0.
0 0 0 0 0
5-4
Suppose the theorem is true for k − 1; that is Hk γ i = δ i , i < k. We how show the theorem is true for k.
Our construction of the correction term ensures that Hk+1 γ k = δ k . We show the result for
We have
T
δ k − Hk γ k δ k − Hk γ k
k+1 i k i
H γ =H γ + T
γi.
γ k (δ k − Hk γ k )
By induction hypothesis, Hk γ i = δ i . To complete the proof it is enough to show that the second term on
the right hand side of the above equation is equal to zero. For this to be true consider
T T T
δ k − Hk γ k γ i = δ k γ i − γ k Hk γ i
T T
= δk γ i − γ k δi by induction hypothesis
kT i kT i
=δ γ −δ Qδ by equation (5.4)
kT i kT i
=δ γ −δ γ by equation (5.4)
=0
Example 5.1. Let f (x) = x21 + 21 x22 + 3. Apply the rank one correction algorithm to minimize f . Use the
initial point x0 = [1, 2]T and H0 = In .
The symmetric rank-one algorithm works well in the case of quadratic functions, in which the Hessian
matrix is constant. Unfortunately, for the nonquadratic problems the formula (5.8) has few drawbacks.
The Hessian Hk+1 may not always be positive definite, even when Hk is positive definite, thus dk+1 may
T
not be a descent direction. Furthermore, numerical instability appear whenever γ k δ k − Hk γ k is close
to zero. Fortunately, alternative algorithms have been developed for updating Hk . In particular if we use
rank-two update, the matrix Hk is guaranteed to be positive definite for all k.
Instead of determining the expression for ak , bk , uk , vk , as in rank-one correction, we state the update
formula and show that it satisfy the desired properties − quasi-Newton condition and it is positive definite.
The most used rank-two quasi-Newton method is the Broyden class. The update rule is specified as:
T T
k+1 k δk δk Hk γ k Hk γ k T T
H =H + T − T
+ ck γ k Hk γ k qk qk , (5.10)
δk γ k γ k Hk γ k
where
δk Hk γ k
qk = T
− T
,
δk γ k γ k Hk γ k
5-5
and the scalars ck must satisfy 0 ≤ ck ≤ 1 for all k. The scalars ck parametrize the method. If ck = 0 for
all k, we obtain the Davidon-Fletcher-Powell (DFP) method, which is historically the first quasi-Newton
method. If ck = 1 for all k, we obtain the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, for which
there is substantial evidence that it is the best general purpose quasi-Newton method. The general rank
two algorithm is summarized below.
We now show that the matrices Hk generated by (5.10) indeed from a quasi-Newton method, in the sense
that when applied to the quadratic problem, we have Hk+1 γ k = δ k , 0 ≤ i ≤ k. We first show that under
a mild assumption, the matrices Hk generated by (5.10) are positive definite. This is a very important
property, since it guarantees that dk is a descent direction.
Theorem 5.3. If Hk is positive definite, and the step-length sk is chosen such that xk+1 satisfy
T T
gk dk < gk+1 dk , (5.11)
then Hk+1 given by equation (5.10) is positive definite.
Proof: We first note that equation (5.11) implies that sk 6= 0, γ k 6= 0, and
T T
δ k γ k = sk dk gk+1 − gk > 0. (5.12)
Thus all denominator terms in equation (5.10) are nonzero, and Hk+1 is well defined.
Now for any vector 0 6= z ∈ Rn , we have
2
T
γ k Hk z
2
zT δ k T
T 2
zT Hk+1 z = zT Hk z + T
−T
+ ck γ k Hk γ k qk z . (5.13)
δk γ k γ k Hk γ k
1/2 1/2 k 1/2 1/2
We define a := Hk z and b := Hk γ , where Hk = Hk Hk . Note that because Hk is
positive definite its square root is well defined. Using this definition of a and b, we obtain
1/2 1/2
zT Hk z = zT Hk Hk z = aT a,
1/2 1/2
zT Hk γ k = zT Hk Hk γ k = aT b and
T T
1/2
γ k Hk γ k = γ k Hk Hk γ k = bT b
5-6
All the term on the right-hand side of the above equation are nonnegative−the first term is nonnegative
because of the Cauchy-Schwarz inequality, the second term is nonnegative because of equation (5.12), and
the last term is nonnegative because Hk is positive definite. In order to show that zT Hk+1 z > 0, we need
to demonstrate that we cannot have simultaneously
2
kak2 kbk2 = aT b and zT δ k = 0.
2
Indeed if kak2 kbk2 = aT b we must have a = λb or equivalently, z = λγ k . Since z 6= 0, it follows that
T
λ 6= 0, so if zT δ k = 0, we must have γ k δ k = 0, which is impossible by equation (5.12). This complete the
proof. Q.E.D
Theorem 5.4. Let {xk }, {dk }, and {Hk } be sequences generated by the rank two quasi-Newton Algorithm
5.1, applied to the quadratic function with Hessian Q = QT , we have Hk+1 γ i = δ i , 0 ≤ i ≤ k.
Proof: Note that the last term of equation (5.10) vanish when multiplied by γ k . In particular
T T
kT k δk γ k γ k Hk γ k
q γ = T
− T
=0
δk γ k γ k Hk γ k
We prove the theorem by induction. For k = 0, we have
T T
δ0δ0 H0 γ 0 γ 0 H0 0 0T 0 0 0
0T 0
H1 γ 0 = H0 γ 0 + γ 0
− γ + c0 γ H γ q q γ
δ0T γ 0 γ 0 T H0 γ 0
! !
δ 0T γ0 γ 0 T H0 γ 0
= H0 γ 0 + δ 0 − H0 γ 0
δ0T γ 0 γ 0 T H0 γ 0
= δ0.
Assume the result is true for k − 1; that is Hk γ i = δ k , 0 ≤ i ≤ k − 1. We now show that Hk+1 γ i = δ k ,
0 ≤ i ≤ k. First consider i = k. We have
T T
k+1 k k k δ k δ k k Hk γ k γ k Hk k kT k k k
T
H γ =H γ + T γ − T
γ + ck γ H γ q qk γ k
δk γ k γ k Hk γ k
T
T
! !
0 0 0 δ0 γ 0 0 0 γ k Hk γ k
=H γ +δ −H γ
δ0T γ 0 T
γ k Hk γ k
= δk .
It remains to show the result for the case i < k. To this end,
T T
δk δk Hk γ k γ k Hk T
T
Hk+1 γ i = Hk γ i + T
γi − T
γ i + ck γ k Hk γ k qk qk γ i
δk γ k γ k Hk γ k
!
δk k T i Hk γ k k T i T 1
kT i
1
kT i
= δi + T δ γ − T γ δ + ck γ k Hk γ k qk T
δ γ − T
γ δ
δk γ k γ k Hk γ k γ k δk γ k Hk γ k
Now,
T T T
δ k γ i = δ k Qδ i = sk si dk Qdi = 0
T
by the induction hypothesis and Theorem 1. The same argument yield γ k δ i = 0. Hence
Hk+1 γ i = δ i ,
and this complete the proof. Q.E.D
5-7
Example 5.2. Use the BFGS method to locate the minimizer of the objective function
5
f (x) = x1 − 3x1 x2 + x22 − x2 .
2
Use the initial point x0 = [0, 0]T and H0 = I2 .
The rank two correction formulae enjoy all the properties of quasi-Method methods, including the conjugate
directions property. In the case of quadratic functions, the methods converge in n iterations. In the case of
larger general nonlinear problems, the DFP sometimes tends to get stuck. This phenomenon is attributed
to Hk becoming nearly singular. The BFGS is reasonably robust, and it avoids this issue. Thus, BFGS is
often more efficient than DFP.
For general nonlinear functions, the quasi-Newton algorithms will not usually converge in n iterations. As
in the case of conjugate gradient methods, some modifications may be necessary to deal with nonquadratic
problems. For example, we may reinitialize the direction vector after every few iterations (e.g. n + 1
iterations) and continue until the algorithm satisfies the stopping criterion.
5-8
PROBLEM SET V
1. Show that the following algorithms all satisfy the descent condition:
(a) Steepest descent algorithm
(b) Newton’s method assuming the Hessian is positive definite
(c) Conjugate gradient algorithm
(d) Quasi-Newton Algorithm, assuming H k is positive definite
2. Prove Theorem 5.1, the search directions generated by quasi-Newton algorithm applied to quadratic
function are Q-conjugate.
3. Apply the symmetric rank-one algorithm to locate the minimizer of the function
Hk = ψHkDF P + (1 − ψ)HkBF GS ,
where ψ ∈ R is a scalar, and HkDF P and HkBF GS are matrices generated by the DF P and BF GS
algorithms respectively.
(a) Show that the above algorithm is a quasi-Newton algorithm. Is the above algorithm a conjugate
direction algorithm?
5-9
(b) Suppose 0 ≤ ψ ≤ 1. Show that if H0 is positive definite, then Hk is positive definite for all k.
What can you conclude from this about whether or not the algorithm has the descent property?
8. Write a compute programme to implement the quasi-Newton(BFGS) algorithm for general functions.
Use the backtracking algorithm for the line search. Test the different formulas for βk on Rosenbrock’s
function, with an initial point x0 = [−2, 2]T . For this exercise reinitialize the update direction to the
negative gradient every 6 iterations.
9. Consider the function f (x) = 14 x41 + 21 x22 − x1 x2 + x1 − x2 . Apply the DFP algorithm to minimize f
with the following starting initial conditions (a) x0 = [0, 0]T ; (b) x0 = [1.5, 1]T . Use H0 = I2 . Does
the algorithm converge to the same point for two initial conditions? If not, explain.
5-10