0% found this document useful (0 votes)
1 views

Sequential Quadratic Programming

The document discusses the Sequential Quadratic Programming (SQP) method, a technique for solving Nonlinear Programming (NLP) problems, emphasizing its theoretical foundations and optimality conditions. It outlines the evolution of SQP through various special cases of NLP, including systems of nonlinear equations and unconstrained optimization, while highlighting the relationship between SQP and Newton methods. The monograph aims to provide a comprehensive understanding of SQP's application and the challenges associated with global convergence in optimization problems.

Uploaded by

宗本 官
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Sequential Quadratic Programming

The document discusses the Sequential Quadratic Programming (SQP) method, a technique for solving Nonlinear Programming (NLP) problems, emphasizing its theoretical foundations and optimality conditions. It outlines the evolution of SQP through various special cases of NLP, including systems of nonlinear equations and unconstrained optimization, while highlighting the relationship between SQP and Newton methods. The monograph aims to provide a comprehensive understanding of SQP's application and the challenges associated with global convergence in optimization problems.

Uploaded by

宗本 官
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

The Sequential Quadratic

Programming Method

Roger Fletcher

1 Introduction

Sequential (or Successive) Quadratic Programming (SQP) is a technique for


the solution of Nonlinear Programming (NLP) problems. It is, as we shall
see, an idealized concept, permitting and indeed necessitating many varia-
tions and modifications before becoming available as part of a reliable and
efficient production computer code. In this monograph we trace the evolu-
tion of the SQP method through some important special cases of nonlinear
programming, up to the most general form of problem. To fully understand
these developments it is important to have a thorough grasp of the underly-
ing theoretical concepts, particularly in regard to optimality conditions. In
this monograph we include a simple yet rigorous presentation of optimality
conditions, which yet covers most cases of interest.
A nonlinear programming problem is the minimization of a nonlinear ob-
jective function f (x), x ∈ IRn , of n variables, subject to equation and/or
inequality constraints involving a vector of nonlinear functions c(x). A basic
statement of the problem, useful for didactic purposes is

minimize
n
f (x)
x∈IR

subject to ci (x) ≥ 0 i = 1, 2, . . . , m. (1.1)

In this formulation, equation constraints must be encoded as two opposed in-


equality constraints, that is c(x) = 0 is replaced by c(x) ≥ 0 and −c(x) ≥ 0,
which is usually not convenient. Thus in practice a more detailed formula-
tion is appropriate, admitting also equations, linear constraints and simple
bounds. One way to do this is to add slack variables to the constraints, which

R. Fletcher ()
Department of Mathematics, University of Dundee, Dundee DD1 4HN
e-mail: fl[email protected]

G. Di Pillo et al. (eds.), Nonlinear Optimization, Lecture Notes in 165


Mathematics 1989, DOI 10.1007/978-3-642-11339-0 3,

c Springer-Verlag Berlin Heidelberg 2010
166 R. Fletcher

together with simple bounds on the natural variables, gives rise to

minimize
n
f (x)
x∈IR
subject to AT x = b
(1.2)
c(x) = 0
l ≤ x ≤ u.

in which either ui or −li can be set to a very large number if no bound is


present. Alternatively one might have

minimize f (x)
x∈IRn ⎛ ⎞
x
⎜ ⎟
subject to l ≤ ⎝AT x⎠ ≤ u (1.3)
c(x)

in which the user can specify an equation by having li = ui .


There are a number of special cases of NLP which are important in their
own right, and for which there are special cases of the SQP method for
their solution. These include systems of nonlinear equations, unconstrained
optimization, and linearly constrained optimization. Understanding the the-
oretical and practical issues associated with these special cases is important
when it comes to dealing with the general NLP problem as specified above.
A common theme in all these cases is that there are ‘linear’ problems which
can be solved in a finite number of steps (ignoring the effects of round-off
error), nonlinear problems which can usually only be solved approximately by
iteration, and Newton methods for solving nonlinear problems by the succes-
sive solution of linear problems, obtained by making Taylor series expansions
about a current iterate. This context is underpinned by a famous theorem
of Dennis and Moré [10] which states, subject to a regularity condition, that
superlinear convergence occurs if and only if an iterative method is asymptot-
ically equivalent to a Newton method. Loosely speaking, this tells us that we
can only expect rapid convergence if our iterative method is closely related to
a Newton method. The SQP method is one realization of a Newton method.
Generally x ∈ IRn will denote the variables (unknowns) in the problem,

x denotes a (local) solution, x(k) , k = 1, 2 . . . are iterates in some iterative
method, and g(k) for example denotes the function g(x) evaluated at x(k) .
Likewise g∗ would denote g(x) evaluated at x∗ . It is important to recognise
that solutions to problems may not exist, or on the other hand, there may
exist multiple or non-unique solutions. Generally algorithms are only able
to find local minima, and indeed guaranteeing to find a global (best local)
minimizer is impractical for problems of any significant size. A more extensive
and detailed coverage of topics in this monograph can be found for example
in Fletcher [13].
The Sequential Quadratic Programming Method 167

2 Newton Methods and Local Optimality

In this and subsequent sections we trace the development of Newton methods


from the simplest case of nonlinear equations, through to the general case of
nonlinear programming with equations and inequalities.

2.1 Systems of n Simultaneous Equations


in n Unknowns

In this case the ‘linear’ problem referred to above is the well known system
of linear equations
AT x = b (2.1)
in which A is a given n × n matrix of coefficients and b is a given vector
of right hand sides. For all except the very largest problems, the system is
readily solved by computing factors P AT = LU using elimination and partial
pivoting. If A is a very sparse matrix, techniques are available to enable large
systems to be solved. A less well known approach is to compute implicit
factors LP AT = U (see for example Fletcher [14]) which can be advantageous
in certain contexts. The system is regular when A is nonsingular in which
case a unique solution exists. Otherwise there may be non-unique solutions,
or more usually no solutions.
The corresponding nonlinear problem is the system of nonlinear equations

r(x) = 0 (2.2)

in which r(x) is a given vector of n nonlinear functions. We assume that


r(x) is continuously differentiable and denote the n × n Jacobian matrix by
A = ∇rT , that is A is the matrix whose columns are the gradients of the
component functions in r. If r = AT x − b is in fact linear then its Jacobian
matrix is A, which accounts for the use of AT rather than the more usual A
in (2.1). A Taylor series about the current iterate x(k) gives

r(x(k) + d) = r(k) + A(k)T d + o(d). (2.3)

Truncating the negligible term in d and setting the left hand side to zero
yields the system of linear equations

A(k)T d = −r(k) . (2.4)

This system forms the basis of the Newton-Raphson (NR) method in which
(2.4) is solved for a displacement d = d(k) , and x(k+1) = x(k) + d(k) becomes
the next iterate.
168 R. Fletcher

A solution x∗ is said to be regular iff A∗ is nonsingular. Assuming that A


is Lipschitz continuous in a neighbourhood of x∗ , then the Newton-Raphson
method converges locally (that is if some x(k) is sufficiently close to x∗ ) and
the order of convergence is second order, that is

x(k+1) − x∗  = O(x(k) − x∗ 2 ).

We prove these properties in the next subsection. This property is very desir-
able, indicating as it does that the number of significant figures of accuracy
doubles on each iteration, in the limit. Moreover a theorem of Dennis and
Moré [10] indicates that if x∗ is regular then any sequence {x(k) } converging
to x∗ exhibits superlinear convergence (that is x(k+1) − x∗  = o(x(k) − x∗ )
if and only if the displacements converge asymptotically to those of the
Newton-Raphson method, in the sense that

(k) (k)
d(k) = dN R + o(dN R ).

We shall see in what follows that many successful methods of optimization


(Newton methods) are derived from the Newton-Raphson method, including
the SQP method, which indicates the fundamental importance of the Dennis
and Moré result.
An important class of Newton methods are the so-called quasi-Newton
methods in which A(k) is approximated by a matrix B (k) . Initially B (k) is
arbitrary, and is updated after each iteration to take advantage of information
gained about how r(x) behaves.
Unfortunately this favourable theoretical profile of the NR method is only
valid locally in a neighbourhood of a regular solution. If x(1) is remote from x∗
then there is no guarantee of convergence, and indeed A(k) can be singular in
which case (2.4) usually has no solution and the method aborts. Modifications
of the NR method to promote global convergence (that is convergence when
x(1) is remote from x∗ ) are a subject of much active research interest, some
of which is described later in this monograph.

2.2 Local Convergence of the Newton-Raphson


Method

Let x∗ be a solution of (2.2). Then the following theorem holds.


Theorem 2.2.1. Assume that r(x) is continuously differentiable in a neigh-
bourhood of x∗ , and that A∗ is nonsingular. Then there exists a neighbourhood
of x∗ such that if any iterate x(k) is within this neighbourhood, then the
Newton-Raphson method converges to x∗ and the order of convergence is
superlinear.
The Sequential Quadratic Programming Method 169

Proof. For x in a neighbourhood of x∗ , denote r = r(x), e = x − x∗ and à =


∫01 A(x − θe) dθ. The conditions of the theorem permit an integral form of the
Taylor series about x

0 = r∗ = r − ÃT e (2.5)

to be used. As e → 0, so à → A (A = A(x)) and hence

r − AT e/e → 0. (2.6)

Because A∗ is nonsingular there exists a neighbourhood N (x∗ ) in which both


A−T  ≤ β is bounded, and

r − AT e ≤ αe/β (2.7)

for some fixed α ∈ (0, 1), by virtue of (2.6).


Let x(k) ∈ N (x∗ ) and denote the error e(k) = x(k) − x∗ . Then

e(k+1) = e(k) + d(k) (2.8)


= e(k) − (A(k)T )−1 r(k) (2.9)
(k)T −1
= −(A ) (r(k)
−A
(k)T
e (k)
). (2.10)

It follows from (2.7) that e(k+1)  ≤ αe(k)  and hence that x(k+1) ∈ N (x∗ ).
By induction, e(k) → 0, and hence x(k) → x∗ . It also follows from (2.10) and
(2.6) that e(k+1) /e(k)  → 0, showing that the order of convergence is
superlinear.
"
!

Corollary 2.2.1. If, in addition, the Jacobian matrix A(x) satisfies a


Lipschitz condition, then the order of convergence is second order.

Proof. In this case we can write the Taylor series (2.3) in the form

r(x(k) + d) = r(k) + A(k)T d + O(d2 ). (2.11)

Now we can replace (2.6) by the stronger result that r − AT e = O(e2 ).


Following a similar argument to the above, we then deduce that

e(k+1) = O(e(k) 2 ) (2.12)

which is the definition of second order convergence. "


!
170 R. Fletcher

2.3 Unconstrained Optimization

In this case the ‘linear’ problem is that of minimizing a quadratic function


of n variables
q(x) = 12 xT Gx + hT x (2.13)
where the Hessian matrix of second derivatives G is symmetric and positive
definite. The corresponding nonlinear problem is to find a local minimizing
point x∗ of a given non-quadratic function f (x). We refer to the gradient
(column) vector of first partial derivatives of f (x) by g(x) = ∇f (x) and the
Hessian by G(x) = ∇g(x)T . If x∗ is a local minimizer of f (x) then clearly it
is a minimizer along any line

x(α) = x∗ + αs, s=0

through x∗ . It follows from this and the chain rule that the slope

df (x(α))/dα|α=0 = g∗T s = 0

for any s. Consequently a necessary condition for x∗ to be a local minimizer


of f (x) is that g∗ = 0. Points x which satisfy g(x) = 0 are referred to
as stationary points, and include saddle points and maximizers as well as
minimizers. Similarly another necessary condition for a minimizer is that the
second derivative of f (x(α)) at α = 0 is non-negative, that is sT G∗ s ≥ 0 for
all s, which is the condition that G∗ is positive semi-definite. On the other
hand, if both g∗ = 0 and G∗ is positive definite, then this is a sufficient
condition for x∗ to be a local minimizer.
The stationary point condition g(x) = 0 is a system of nonlinear equations
that can be solved by the NR method to find a stationary point of f (x). The
Jacobian ∇g(x)T is just the Hessian G(x). For minimization we are inter-
ested in the case that G is positive definite. In the case of a quadratic function
(2.13), g(x) = Gx+h and we solve the system Gx = −h. In the non-quadratic
case, the appropriate regularity condition is that G∗ is positive definite, and
the NR iteration formula (2.4) becomes G(k) d = −g(k) . These linear sys-
tems are most efficiently solved using Choleski factors G = LLT when G is
positive definite. We refer to this method as the Newton method for mini-
mization. The method inherits all the favourable local properties of the NR
method, described in the previous subsection. Likewise, there is also the pos-
sibility that when the initial point is remote from a minimizer, the method
might fail to converge, and so must be modified to promote global conver-
gence. There is an additional issue that not only might G(k) be singular,
whence the method aborts, but also G(k) might become indefinite, remote
from the solution. In this case the local quadratic approximating function
q(d) = 12 dT G(k) d + g(k)T d no longer has a minimizer, and the resulting dis-
placement d(k) obtained by finding the stationary point of q(d) is unlikely
The Sequential Quadratic Programming Method 171

to be useful. Thus the most effective way of promoting global convergence is


still a subject of some interest. In this respect, quasi-Newton methods are of
particular interest because there exist methods of updating a Hessian approx-
imating matrix B (k) so as to maintain the property that B (k) is symmetric
positive definite (see for example [13]).

2.4 Optimization with Linear Equality Constraints

This section is mainly important for the techniques it introduces in regard


to handling linear constraints, which form a major feature of the Quadratic
Programming (QP) method that is the subproblem of the SQP method. The
‘linear’ problem that we consider is the Equality QP (EQP) problem

minimize
n
q(x) = 12 xT Gx + hT x
x∈IR (2.14)
subject to AT x = b.

In this problem A is an n×m matrix with m ≤ n. The constraints are regular


iff rank(A) = m, which we assume to be the case. When m = n the solution is
simply that given by (2.1) and q(x) plays no part. We shall focus therefore on
the case that m < n, whence the equations in (2.14) are under-determined.
In this case we can express the general solution of AT x = b as

x = x◦ + Zt (2.15)

where x◦ is a particular solution of AT x = b, Z is an n × (n − m) matrix


whose columns are a basis for the null space of AT , that is null(AT ) =
{z | AT z = 0}, and t ∈ IRn−m is an arbitrary vector. The Zt term in (2.15)
expresses the non-uniqueness of solutions of AT x = b. Neither x◦ nor Z
are uniquely defined and any valid choice is acceptable, although there are
possible considerations relating to ill-conditioning. There are various ways
of finding a suitable x◦ and Z. One is to reduce AT to upper echelon form,
as described in any basic linear algebra text, using pivoting to avoid ill-
conditioning. More relevant to QP software is to find any matrix, V say,
such that [A | V ] is nonsingular, and also well-conditioned, insofar as that is
possible. Denote [A | V ]−T = [Y | Z] where Z has n − m columns. Then it
can readily be verified using AT Y = I, AT Z = 0 and rank(Z) = n − m that
x◦ = Y b is a particular solution and columns of Z are a basis for null(AT ).
What follows is known as the Null Space Method for solving EQP problems.
Simply, we substitute the general solution (2.15) into the definition of q(x),
giving a reduced quadratic function

Q(t) = q(x◦ + Zt) = 12 (x◦ + Zt)T G(x◦ + Zt) + hT (x◦ + Zt).


172 R. Fletcher

We find a stationary point of Q(t) by applying the condition ∇t Q(t) = 0,


giving rise to the system of equations

Z T GZt = −Z T (h + Gx◦ ) (2.16)

which is solved for t. Then (2.15) defines the solution x∗ . The solution is a
unique minimizer if and only if Z T GZ, referred to as the reduced Hessian
matrix, is positive definite.
The Null Space Method extends to solve any linear equality constrained
problem (LECP) in which the objective function is non-quadratic, that is

minimize
n
f (x)
x∈IR (2.17)
subject to AT x = b.

Again, we just substitute for x using (2.15) giving a reduced problem

minimize
n−m
F (t) = f (x◦ + Zt).
t∈IR

This is now a non-quadratic unconstrained minimization problem and can be


solved by the methods of the previous subsection.

3 Optimization with Nonlinear Equations

In this section we consider the Equality constrained NLP problem



minimize
n
f (x)
ENLP x∈IR
subject to c(x) = 0

where in general c(x) is a vector of m nonlinear functions. We assume that


these functions are continuous and continuously differentiable (CI 1 ) functions
of x. In the null space method we looked for a parametrization of the feasi-
ble region which allows us to eliminate the linear constraints AT x = b, and
so solve an unconstrained problem. In this section we seek to do the same
when the constraint manifold is nonlinear. The development of this section
provides an elegant and concise introduction to the concept of so-called La-
grange multipliers and their relation to the reduced optimization problems
seen in the previous section. We make use of a locally valid nonlinear trans-
formation, which, although computationally unattractive, does enable us to
state necessary optimality conditions, and to derive a Newton method which
is ultimately the basis for the SQP method.
The Sequential Quadratic Programming Method 173

3.1 Stationary Points and Lagrange Multipliers

In this section we follow the rationale of the null space method and attempt to
derive an equivalent reduced unconstrained optimization problem. In this case
however it is necessary to make a nonlinear transformation of variables, and
there exist exceptional situations in which this is not possible. In order there-
fore to ensure that our transformation is well defined, local to a solution x∗ of
the ENLP, we make the regularity assumption that the columns of the Jaco-
bian matrix A∗ are linearly independent, or equivalently that rank(A∗ ) = m.
Existence and some properties of the transformation are a consequence
of the Inverse Function Theorem, an important result which can be found
in texts on real variable calculus. It may be stated as follows. Let r(x),
IRn → IRn , be a CI 1 nonlinear mapping, and let x∗ be such that ∇r(x∗ ) is
nonsingular. Then open neighbourhoods of x∗ and r∗ (= r(x∗ )) exist within
which a CI 1 inverse mapping x(r) is uniquely defined, so that x(r(x)) = x
and r(x(r)) = r. Moreover, derivatives of the mappings are related by

∇r xT = (∇x rT )−1 . (3.1)

In the case of the ENLP above, we choose any fixed matrix V such that
[A∗ | V ] is nonsingular (this is possible by virtue of the regularity assumption),
and consider the nonlinear mapping
 
c(x)
r(x) = , (3.2)
V T (x − x∗ )

noting that r(x∗ ) = 0. The Jacobian of the transformation is ∇x rT = [A | V ]


which is nonsingular at x∗ . It follows by virtue of the inverse function theorem
that a well defined inverse function x(r) exists in a neighbourhood of x∗ . We
consider the constrained form of x(r) in which r = ( 0t ). This defines a function
x(t), t ∈ IRn−m , for which c(x(t)) = 0, and so provides a parametrization
of the feasible region of the ENLP which is valid local to x∗ . Moreover, for
any x local to x∗ there corresponds a unique value of t = V T (x − x∗ ), and
x = x∗ corresponds to t = 0. It also follows from (3.1) that
* +
−1 YT
∇r x(r) = [A | V ]
T
= , (3.3)
ZT

say, so that

∇t x(t)T = Z T and hence ∂xi /∂tj = zij . (3.4)

Note in general that Z is no longer a constant matrix, and the expressions


are only valid local to x∗ , in contrast to the null space method for handling
174 R. Fletcher

Illustrating dimension reduction for n = 3

m=1 m=2

a1
a1
Linear

a2
*
X
* Z2
X Z1

Z1

Non Linear
a1*

a*2

X* X*
a1*

arcs in the arc common to


both surfaces with
*
direction Z* at X
surface with directions
* * * 1
Z and Z at X
1 2

Fig. 1 Illustrating dimension reduction for n = 3

linear constraints. However, in the linear case the reduced problem that is
obtained below is identical to that described in Section 2.4.
The process of dimension reduction is illustrated in Figure 1 when the full
space is three dimensional. In the linear case the constraints can be repre-
sented by planes, and the normal vectors a1 and a2 (columns of the matrix
A) are perpendicular to the planes. When m = 1, the null space has dimen-
sion two, and is spanned by two independent vectors z1 and z2 which are the
columns of Z. Any point x in the plane can be represented uniquely by the
linear combination x = x∗ + t1 z1 + t2 z2 . When m = 2 the feasible set is the
intersection of two planes, there is only one basis vector z1 , and the feasible
set is just x = x∗ + t1 z1 . It can be seen in both cases that the vectors ai and
zj are mutually perpendicular for all i and j, which expresses the condition
AT Z = 0. In the nonlinear case, the planes are replaced by curved surfaces,
and feasible lines in the linear case have become feasible arcs in the nonlinear
case, whose directions at x∗ are the vectors z∗i . It can be seen that A and Z
are no longer constant in the nonlinear case.
The Sequential Quadratic Programming Method 175

We can now state an equivalent reduced unconstrained optimization prob-


lem for the ENLP, that is
minimize
n−m
F (t), (3.5)
t∈IR

where F (t) = f (x(t)). A stationary point of F (t) is defined by the condition


that
∇t F (t) = 0. (3.6)
By the chain rule
∂  ∂xi ∂  ∂
= = zij
∂tj i
∂tj ∂xi i
∂xi

from (3.4), or
∇t = Z T ∇x . (3.7)
This result shows how derivatives in the reduced space are related to those in
the full space. Applying these derivatives to F (t) and f (x) at x∗ , it follows
from (3.6) that the stationary point condition for the ENLP problem is

Z ∗T g(x∗ ) = 0, (3.8)

where Z ∗ denotes the matrix given by (3.3) when x = x∗ and A = A∗ . The


vector Z T g is referred to as the reduced gradient and is zero at a stationary
point of the ENLP problem.
There is also an alternative formulation of the stationary point condi-
tion that can be deduced. Arising from (3.3) and the regularity assumption
rank(A∗ ) = m, we know that both A∗ and Z ∗ have linearly independent
columns. The definition of the inverse in (3.3) implies that A∗T Z ∗ = 0
showing that columns of Z ∗ are in null(A∗T ). But we might equally write
Z ∗T A∗ = 0, showing that columns of A∗ are in null(Z ∗T ), and indeed provide
a basis for null(Z ∗T ) by virtue of linear independence. Now the stationary
point condition (3.8) states that g∗ ∈ null(Z ∗T ). Thus we can express g∗ as
a linear combination of basis vectors for null(Z ∗T ), that is


m
g∗ = A∗ λ∗ = a∗i λ∗i . (3.9)
i=1

The multipliers λ∗ in the linear combination are referred to as Lagrange


multipliers. There is one Lagrange multiplier for each constraint.
Equations (3.8) and (3.9) provide alternative and equivalent statements of
the stationary point conditions for an ENLP problem. They are often referred
to as first order necessary conditions for a local solution of a regular ENLP
problem. We can also express (3.9) as g∗ ∈ range(A∗ ). These alternative
viewpoints of null space and range space formulations pervade much of both
the theoretical and computational aspects of quadratic programming and
nonlinear programming, as we shall see below.
176 R. Fletcher

Satisfying (3.9) and feasibility provides a method for solving the ENLP,
that is: find x∗ , λ∗ to solve the system of n + m equations
 
g − Aλ
r(x, λ) = = 0, (3.10)
−c

in the n + m variables x, λ, where g, A and c are functions of x. This is the


so-called method of Lagrange multipliers. However, the system is generally
nonlinear in x and may not be straightforward to solve, as we have observed
above. It also can only be expected to yield stationary points of the ENLP.
Equation (3.10) can also be interpreted as defining a stationary point of a
Lagrangian function
L(x, λ) = f (x) − λT c(x), (3.11)

since ∇x L = g − Aλ and ∇λ L = −c. The Lagrangian function plays a


central rôle in both the theoretical and computational aspects of nonlinear
programming and the SQP method.
Lagrange multipliers also have a useful interpretation in terms of the sen-
sitivity of the ENLP to perturbations in the constraints. Consider an ENLP
problem

minimize
n
f (x)
x∈IR
(3.12)
subject to c(x) = ε

in which the right hand sides of the constraints have been perturbed by an
amount ε. Let x(ε), λ(ε) be the solution and multipliers of the perturbed
problem, and consider f (x(ε)). Then

df (x(ε))/dεi = λi , (3.13)

showing that λi measures the change of f (x(ε)) with respect to a change εi


in constraint i, to first order. To prove this result, let the Lagrangian of the
perturbed ENLP problem be

L(x, λ, ε) = f (x) − λT (c(x) − ε),

and observe that L(x(ε), λ(ε), ε) = f (x(ε)). Then the chain rule gives

df dL ∂xT ∂λT ∂L
= = ∇x L + ∇λ L + = λi
dεi dεi ∂εi ∂εi ∂εi

by virtue of the stationarity of the Lagrangian function with respect to


x and λ.
The Sequential Quadratic Programming Method 177

3.2 Second Order Conditions for the ENLP Problem

Let x∗ solve the ENLP and let rank(A∗ ) = m (regularity). Then we have
seen that (3.5) is an equivalent reduced unconstrained minimization problem.
Thus, from Section 2.2, a second order necessary condition is that the Hessian
matrix ∇2t F (t) is positive semi-definite. To relate this to the ENLP, we use
equation (3.7) which relates derivatives in the reduced and full systems. Thus

∇2t F (t) = ∇t (∇t F (t))T = Z T ∇x (gT Z).

When the constraints are linear, such as in (2.17), Z is a constant matrix so


we can differentiate further to get

Z T ∇x (gT Z) = Z T (∇x gT )Z = Z T GZ

where G is the Hessian matrix of f (x). Thus the second order necessary
condition in this case is that the reduced Hessian matrix Z T G∗ Z is positive
semi-definite. Moreover, Z T g∗ = 0 and Z T G∗ Z being positive definite are
sufficient conditions for x∗ to solve (3.5).
For an ENLP with nonlinear constraints, Z depends on x and we can
no longer assume that derivatives of Z with respect to x are zero. To make
progress, we observe that the ENLP is equivalent to the problem

minimize
n
L(x, λ∗ )
x∈IR
(3.14)
subject to c(x) = 0

since f (x) = L(x, λ) when c(x) = 0. We now define x(t) as in (3.2), and
consider the problem of minimizing a reduced function F (t) = L(x(t), λ∗ ).
Then ∇t F = 0 becomes Z T ∇x L = 0, or Z T (g − Aλ∗ ) = 0. At x∗ , it follows
that Z ∗T g∗ = 0 which is the first order necessary condition. For second
derivatives,

∇2t F (t) = ∇t (∇t F (t))T = Z T ∇x ((g − Aλ∗ )T Z).

At x∗ , derivatives of Z are multiplied by g∗ − A∗ λ∗ which is zero, so we have

∇2t F (t∗ ) = Z T ∇x ((g − Aλ∗ )T )Z|x∗ = Z ∗T W ∗ Z ∗ ,

where

m
W (x, λ) = ∇2x L(x, λ) = ∇2 f (x) − λi ∇2 ci (x) (3.15)
i=1

is the Hessian with respect to x of the Lagrangian function, and W ∗ =


W (x∗ , λ∗ ). Thus the second order necessary condition for the regular ENLP
178 R. Fletcher

problem is that the reduced Hessian of the Lagrangian function is positive


semi-definite. As above, a sufficient condition is that the reduced Hessian is
positive definite and the reduced gradient is zero.

3.3 The SQP Method for the ENLP Problem

We have seen in (3.10) that a stationary point of a regular ENLP can be found
by solving a system of nonlinear equations. Applying the Newton-Raphson
method to these equations enables us to derive a Newton type method with
rapid local convergence properties. First however we consider solving (3.10) in
the case of an EQP problem (2.14). In this case, g = Gx+h and c = AT x−b,
so (3.10) can be written as the system of n + m linear equations in n + m
unknowns * +   
G −A x h
= − . (3.16)
−AT 0 λ b
Although symmetric, the coefficient matrix in (3.16) is indefinite so cannot be
solved by using Choleski factors. The Null Space Method (2.16) is essentially
one way of solving (3.16), based on eliminating the constraints AT x = b.
When G is positive definite, and particularly when G permits sparse Choleski
factors G = LLT to be obtained, it can be more effective to use the first block
equation to eliminate x = G−1 (Aλ − h). Then the system

AT G−1 A λ = b + AT G−1 h (3.17)

is used to determine λ, and hence implicitly x. Of course, operations with G−1


are carried out by making triangular solves with the Choleski factor L. This
method might be regarded as a Range Space Method for EQP, in contrast to
the Null Space Method described earlier.
We now proceed to consider the ENLP problem in the general case of non-
linear constraints. In this case we attempt to solve the equations r(x, λ) = 0
in (3.10) by the Newton-Raphson method. First we need the Jacobian matrix
of this system, which is the (n + m) × (n + m) matrix
  * +
∇x W −A
rT = (3.18)
∇λ −AT 0

where W as defined in (3.15) is the Hessian with respect to x of the


Lagrangian function. The current iterate in the NR method is the pair of
vectors x(k) , λ(k) , and the iteration formula, generalising (2.4), is
* +  (k)   (k) (k) 
W (k) −A(k) d A λ − g(k)
= −r =
(k)
(3.19)
−A(k)T 0 δ (k) c(k)
The Sequential Quadratic Programming Method 179

where superscript (k) denotes quantities calculated from x(k) and λ(k) . Then
updated values for the next iteration are defined by x(k+1) = x(k) + d(k)
and λ(k+1) = λ(k) + δ (k) . These formulae may be rearranged by moving the
A(k) λ(k) term to the left hand side of (3.19), giving
* +   (k) 
W (k) −A(k) d(k) −g
= . (3.20)
−A(k)T 0 λ(k+1) c(k)

This then is a Newton iteration formula for finding a stationary point of


an EQP problem. For rapid local convergence to a stationary point x∗ with
multipliers λ∗ , we require that the Jacobian matrix
* +
W ∗ −A∗
(3.21)
−A∗T 0

is nonsingular.
So where does the SQP method come in? There are two important obser-
vations to make. First, if the constraints in the ENLP problem are regular
(rank(A∗ ) = m), and the ENLP satisfies second order sufficiency conditions
(Z ∗T W ∗ Z ∗ is positive definite), then it is a nice exercise in linear algebra to
show that the matrix (3.21) is nonsingular (see [13]). Thus the local rapid con-
vergence of (3.20) is assured. Moreover, it also follows that rank(A(k) ) = m
and Z (k)T W (k) Z (k) is positive definite in a neighbourhood of x∗ , λ∗ . Under
these conditions, the EQP problem
 1 T (k)
(k)
minimize
n 2d W d+ dT g(k) + f (k)
EQP d∈IR
(k) (k)T
subject to c +A d=0

is regular and has a unique local minimizer, which can be found by solving the
stationary point condition (see (3.16)), which for EQP(k) is none other than
(3.20). Thus, for finding a local minimizer of an ENLP problem, it is better
to replace the iteration formula (3.20) by one based on solving EQP(k) for
a correction d(k) = d and multiplier vector λ(k+1) . This correctly accounts
for the second order condition required by a local minimizer to an ENLP
problem. In particular, any solution of (3.20) which corresponds to a saddle
point or maximizer of EQP(k) is not accepted. (EQP(k) is unbounded in this
situation.) Also EQP(k) has a nice interpretation: the constraints are linear
Taylor series approximations about x(k) to those in the ENLP problem, and
the objective function is a quadratic Taylor series approximation about x(k)
to the objective function in the ENLP, plus terms in W (k) that account
for constraint curvature. The objective function can equally be viewed as a
quadratic approximation to the Lagrangian function. (In fact the term f (k)
in the objective function of EQP(k) is redundant, but is included so as to
make these nice observations.)
180 R. Fletcher

To summarize, solving an ENLP in this way may be interpreted as a


Sequential EQP method (SEQP) with the following basic structure
initialize x(1) , λ(1)
for k = 1, 2, . . . until converged
solve EQP(k) giving d(k) and multipliers λ(k+1)
set x(k+1) = x(k) + d(k)
end
As with other Newton methods, the method may not converge globally, and
EQP(k) may have no solution, for example it may be unbounded, or possibly
infeasible if A(k) is rank deficient. It is therefore essential that extra features
are included in any practical implementation. We return to this subject later
in the monograph.

4 Inequality Constraints and Nonlinear Programming

In this section we examine the extra complication caused by having inequal-


ity constraints in the formulation of an optimization problem. As above we
discuss ‘linear’ problems which can be solved in a finite number of steps, and
nonlinear problems for which iteration is required, leading to a general formu-
lation of the SQP method. We also discuss changes to optimality conditions
to accommodate inequality constraints.

4.1 Systems of Inequalities

Corresponding to the development of Section 2.1, the ‘linear’ problem we now


consider is that of a system of linear inequalities

AT x ≥ b (4.1)
in which A is a given n × m matrix of coefficients and b is a given vector of
right hand sides. Usually m > n when there is no objective function present,
although in general, m ≤ n is also possible. Each inequality aTi x ≥ bi in
(4.1) divides IRn into two parts, a feasible side and an infeasible side, with
respect to the inequality. Equality holds on the boundary. Any n independent
such equations define a point of intersection, referred to as a vertex in this
context. Usually methods for solving (4.1) attempt to locate a feasible vertex.
Each vertex can be found by solving a system of linear equations as in (2.1).
There are only a finite number of vertices so the process will eventually find
a
$msolution,
% or establish that none exists. However, there may be as many as
n vertices, which can be extremely large for problems of any size. Thus it is
important to enumerate the vertices in an efficient way. This can be done by
The Sequential Quadratic Programming Method 181

a modification known as Phase I of the so-called Simplex Method for Linear


Programming, which we describe briefly below (see also [13] for example).
An important saving in this case is that adjacent vertices differ by only one
equation, and this can be used to update matrix factors to gain efficiency.
The corresponding nonlinear problem is the system of nonlinear
inequalities
r(x) ≥ 0 (4.2)
in which r(x) is a given vector of m nonlinear functions. This problem can be
solved by a Newton type algorithm in which a sequence of linearized subprob-
lems are solved, each being obtained by a linear Taylor series approximation
to c(x) about a current iterate x(k) . This just becomes a special case of the
SQP method, and we shall defer discussion on it until later.
In cases when no solution exists to (4.1) or (2.1), it is often of interest to
find a ‘best’ solution which minimizes some measure of constraint infeasibility.
Exactly what measure to choose is a decision for the user, and is one which
has implications for the type of method that is possible.

4.2 Optimization with Inequality Constraints

In this case our generic linear problem, which can be solved finitely, is the
QP problem
minimize
n
q(x) = 12 xT Gx + hT x
x∈IR (4.3)
subject to AT x ≥ b.

In this problem A is an n × m matrix with no restrictions on the value of m.


The corresponding nonlinear problem is the NLP problem (1.1). There is also
the intermediate stage of a linearly constrained problem (LCP say) in which
the objective function is non-quadratic, that is

minimize
n
f (x)
x∈IR (4.4)
subject to AT x ≥ b.

As in Section 1, a more general formulation of the QP and NLP problems is


often appropriate for practical use, but the simplified form is convenient for
introducing the main features. An important special case of QP, which can
also be solved finitely, is the Linear Programming (LP) problem, characterized
by G = 0 in (4.3).
In this section our main aim is to discuss optimality conditions for an
NLP problem. An important concept is that of an active constraint. The set
of active constraints at a point x is defined by

A(x) = {i | ci (x) = 0} (4.5)


182 R. Fletcher

so that i ∈ A(x) indicates that x is on the boundary of constraint i. The


set of active constraints at the solution is denoted by A∗ . Constraints not in
A∗ have no influence on the behaviour of the NLP problem, local to a local
solution x∗ of (1.1).
Clearly x∗ solves the following ENLP problem

minimize
n
f (x)
x∈IR (4.6)
subject to ci (x) = 0 i ∈ A∗ .

Let the gradient vectors a∗i = ∇c∗i , i ∈ A∗ be linearly independent (regular-


ity). Then from (3.9) in Section 3.1 there exist multipliers λ∗ such that

 
m
g∗ = a∗i λ∗i = a∗i λ∗i = A∗ λ∗ , (4.7)
i∈A∗ i=1

denoting λ∗i = 0 for inactive constraints i ∈ A∗ . Moreover, by virtue of


regularity, we can perturb the right hand side of (4.6) by a sufficiently small
amount εi > 0, and εj = 0, j = i, and still retain feasibility in (1.1). Then,
if λ∗i < 0, it follows from (3.13) that df /dεi = λ∗i < 0, which contradicts
optimality. Thus
λ∗i ≥ 0 (4.8)
for an inequality constraint is also necessary. (The multiplier of an equality
constraint can take either sign.) The convention that the multiplier of an
inactive constraint is zero may also be expressed as

λ∗i c∗i = 0 (4.9)

which is referred to as the complementarity condition. If λ∗i > 0 for all active
inequality constraints, then strict complementarity is said to hold. Collec-
tively, feasibility in (1.1), (3.9), (4.8) for an inequality constraint, and (4.9)
are known as KT (Kuhn-Tucker) (or KKT (Karush-Kuhn-Tucker)) conditions
(Karush [31], Kuhn and Tucker [32]). Subject to a regularity assumption of
some kind, they are necessary conditions for a local solution of (1.1). A point
x∗ which satisfies KT conditions for some λ∗ is said to be a KT point.
A second order necessary condition that can be deduced from (4.6) is that
Z ∗T W ∗ Z ∗ is positive semi-definite, where Z ∗ is the null-space basis matrix
for (4.6), and W is defined in (3.15). A sufficient condition is that x∗ is a KT
point, strict complementarity holds, and Z ∗T W ∗ Z ∗ is positive definite.
The regularity assumption used in these notes (that the gradient vectors
a∗i = ∇c∗i , i ∈ A∗ are linearly independent) is known as the Linear Indepen-
dence Constraint Qualification (LICQ). If LICQ fails at any point, degeneracy
is said to hold at that point. However KT conditions can hold under weaker
conditions, most notably when all the active constraints are linear. In this
The Sequential Quadratic Programming Method 183

case the validity of KT conditions is a consequence of the famous Farkas


Lemma (see [13] for example). Weaker regularity conditions than LICQ have
been proposed, from which KT conditions can be deduced, for instance by
Kuhn and Tucker [32] (KTCQ), and by Mangasarian and Fromowitz [33]
(MFCQ). A sufficient condition for KTCQ to hold is that the active con-
straints are all linear. For problems in which there are some nonlinear active
constraints, LICQ is arguably the only practical way of establishing that a
point is regular, and failure of LICQ is rare at local solutions of nonlinear
constraint problems.

4.3 Quadratic Programming

Before describing the SQP method, it is important to know how to solve


a QP problem (4.3) that contains inequality constraints. A method with
finite termination is the Active Set Method (ASM), which has features that
are favourable in the context of SQP. The method solves (4.3) by solving
a sequence of EQP problems, whilst retaining feasibility in (4.3), until the
correct active set is determined. The method is described in the case that
G is positive definite, all the constraints are inequalities, and there is no
degeneracy at the iterates x(k) . The method is initialized by finding a feasible
vertex, x(1) say, as described in Section 4.1. We let A denote the current set
of active constraints at x(k) . The current EQP is defined by

1 T
minimize
n 2 x Gx + hT x
x∈IR (4.10)
subject to aTi x = bi i ∈ A.

Because x(1) is a vertex, it is in fact the solution of the current EQP defined
by A. The ASM has two major steps.
(i) If x(k) solves the current EQP, then find the corresponding multipliers λ(k) .
Choose any i : λi < 0 (if none exist, then finish with x∗ = x(k) ).
(k)

Otherwise, remove i from A and goto step (ii).


(ii) Find the solution, x̂ say, of the current EQP. If x̂ is feasible in the QP
problem then [set x(k+1) = x̂, k = k + 1 and goto step (i)]. Otherwise,
set x(k+1) as the closest feasible point to x̂ along the line segment from
x(k) to x̂. Add the index of a newly active constraint to A. Set k = k + 1.
If |A| = n then goto step (i) else goto step (ii).
The motivation for the algorithm is provided by the observation in Section
(k)
4.2 that if there exists i : λi < 0 at the solution of the current EQP, then
it is possible to relax constraint i whilst reducing the objective function. If
G = 0 in (4.3) then we have a Linear Programming (LP) problem, and the
ASM is essentially the same as the Simplex method for LP, although it is not
often explained in this way.
184 R. Fletcher

Special linear algebra techniques are required to make the method efficient
in practice. Changes to the current active set involve either adding or sub-
tracting one constraint index. Updates to matrices such as Z and Z T GZ can
be performed much more quickly than re-evaluating the matrices. For large
problems, it is important to take advantage of sparsity in A and possibly G.
There are some complicating factors for the ASM. If the Hessian G is not
positive definite, then it is possible that the EQP obtained by removing i
from A may be unbounded, so that x̂ does not exist. In this case an arbitrary
choice of feasible descent direction is chosen, to make progress. If G has
negative eigenvalues, then the QP problem may have local solutions, and the
ASM does not guarantee to find a global solution. Any solution found by
the ASM will be a KT point of the QP problem, but may not be a local
solution unless strict complementarity holds. A more serious complicating
factor is that of degeneracy which refers to the situation where regularity
of the active constraints at the solution of an EQP problem fails to hold.
An example would be where there are more than n active constraints at a
feasible vertex. In this case, deciding whether x(k) solves the current EQP, or
whether a feasible descent direction exists, is a more complex issue, although
a finite algorithm to decide the issue is possible. Degeneracy is often present
in practical instances of QP problems, and it is important that it is correctly
accounted for in a computer code.
More recently an alternative class of methods has become available for
the solution of LP or QP problems in which G is positive semi-definite.
These interior point methods have the advantage that they avoid the worst
case behaviour of ASM and Simplex methods, in which the number of itera-
tions required to locate the solution may grow exponentially with n. However,
interior point methods also have some disadvantages in an SQP context.

4.4 The SQP Method

We are now in a position to describe the basic SQP method for an NLP (1.1)
with inequality constraints. The method was first suggested in a thesis of
Wilson (1960), [48], and became well known due to the work of Beale [1]. The
idea follows simply from the SEQP method for an ENLP problem, where the
equality constraints c(x) = 0 are approximated by the linear Taylor series
c(k) + A(k)T d = 0 in the subproblem EQP(k) . In an NLP with inequality
constraints c(x) ≥ 0 we therefore make the same approximation, leading to
a QP subproblem with linear inequality constraints c(k) + A(k)T d ≥ 0, that is
 1 T (k)
(k)
minimize
n 2d W d+ dT g(k)
QP d∈IR
subject to c(k)
+A d ≥ 0.
(k)T
The Sequential Quadratic Programming Method 185

The basic form of the algorithm therefore is that described at the end of
Section 3.3, with the substitution of QP(k) for EQP(k) . To view this method
as a Newton-type method, we need to assume that strict complementarity
λ∗i > 0, i ∈ A∗ holds at a regular solution to (1.1). Then, if x(k) , λ(k) is
sufficiently close to x∗ , λ∗ , it follows that the solution of EQP(k) with active
constraints A∗ , also satisfies the sufficient conditions for QP(k) . Thus we
can ignore inactive constraints i ∈ A∗ , and the SQP method is identical to
the SEQP method on the active constraint set A∗ . Thus the SQP method
inherits the local rapid convergence of a Newton type method under these
circumstances.
The progress of the SQP method on the NLP problem

minimize
2
f (x) = −x1 − x2
x∈IR
subject to c1 (x) = x2 − x21 ≥ 0
c2 (x) = 1 − x21 − x22 ≥ 0

is illustrated in Table 1, and has some instructive features. Because the initial
multiplier estimate is zero, and f (x) is linear, the initial W (1) matrix is zero,
and QP(1) is in fact an LP problem. Consequently, x(1) has to be chosen
carefully to avoid an unbounded subproblem (or alternatively one could add
simple upper and lower bounds to the NLP problem). The solution of QP(1)
delivers some non-zero multipliers for λ(2) , so that W (2) becomes positive
definite. The solution of QP(2) predicts that constraint 1 is inactive, and we
(3)
see that λ1 is zero. This situation persists on all subsequent iterations. For
this NLP problem, the active set is A∗ = {2}, and we see for k ≥ 3, that the
SQP method converges to the solution in the same way as the SEQP method
with the single equality constraint 1 − x21 − x22 = 0. The onset of rapid local
convergence, characteristic of a Newton method, can also be observed.
However, the basic method can fail to converge if x(k) is remote from x∗
(it is not as important to have λ(k) close to λ∗ because if x(k) is close to
x∗ , one solution of QP(k) will give an accurate multiplier estimate). It is
also possible that QP(k) has no solution, either because it is unbounded, or
because the linearized constraints are infeasible.
For these reasons, the SQP method is only the starting point for a fully
developed NLP solver, and extra features must be added to promote conver-
gence from remote initial values. This is the subject of subsequent sections of

Table 1 A numerical example of the SQP method


(k) (k) (k) (k) (k) (k)
k x1 x2 λ1 λ2 c1 c2
1 3
1 2
1 0 0 4
− 21
11 2 1 2
2 12 3 3 3
−0.173611 −0.284722
3 0.747120 0.686252 0 0.730415 0.128064 −0.029130
4 0.708762 0.706789 0 0.706737 0.204445 −0.001893
5 0.707107 0.707108 0 0.707105 0.207108 −0.2810 − 5
186 R. Fletcher

this monograph. Nonetheless it has been and still is the method of choice for
many researchers. The success of the method is critically dependent on hav-
ing an efficient, flexible and reliable code for solving the QP subproblem. It
is important to be able to take advantage of warm starts, that is, initializing
the QP solver with the active set from a previous iteration. Also important
is the ability to deal with the situation that the matrix W (k) is not positive
semi-definite. For both these reasons, an active set method code for solving
the QP subproblems is likely to be preferred to an interior point method.
However, NLP solvers are still a very active research area, and the situation
is not at all clear, especially when dealing with very large scale NLPs.

4.5 SLP-EQP Algorithms

An early idea for solving NLP problems is the successive linear programming
(SLP) algorithm in which an LP subproblem is solved (W (k) = 0 in QP(k) ).
This is able to take advantage of fast existing software for large scale LP.
However, unless the solution of the NLP problem is at a vertex, convergence
is slow because of the lack of second derivative information. A more recent
development is the SLP-EQP algorithm, introduced by Fletcher and Sainz
de la Maza [22], in which the SLP subproblem is used to determine the
active set and multipliers, but the resulting step d is not used. Instead an
SEQP calculation using the subproblem EQP(k) in Section 3.3 is made to
determine d(k) . The use of a trust region in the LP subproblem (see below)
is an essential feature in the calculation. The method is another example of a
Newton-type method and shares the rapid local convergence properties.The
idea has proved quite workable, as a recent software product SLIQUE of
Byrd, Gould, Nocedal and Waltz [3] demonstrates.

4.6 Representing the Lagrangian Hessian W (k)

An important issue for the development of an SQP code is how to represent


the Hessian matrix W (k) that arises in the SQP subproblem. As defined in
QP(k) , it requires evaluation of all the Hessian matrices of f and ci , i =
1, 2, . . . , m at x(k) , and their combination using the multipliers λ(k) . Writing
code from which to evaluate second derivatives can be quite error prone, and
in the past, this option has not always been preferred. However, the use of
an exact W (k) matrix has been given new impetus through the availability
of easy-to-use automatic differentiation within modelling languages such as
AMPL, GAMS and TOMLAB (see Section 7). In large problems the exact
W (k) may be a sparse matrix, which provides another reason to consider
this option. On the other hand, it may be that the globalization strategy
The Sequential Quadratic Programming Method 187

requires W (k) to be positive definite, in which case it will usually not be


possible to use W (k) directly (except for certain ‘convex’ problems, W can
be indefinite, even at the solution). However, if W (k) is readily and cheaply
available, it is probably best to make use of it in some way, as this can
be expected to keep the iteration count low. There are various ideas for
modifying W (k) to obtain a positive definite matrix. One is to add a suitable
multiple of a unit matrix to W (k) . Another is to add outer products using
active constraint gradients, as when using an augmented Lagrangian penalty
function.
Otherwise, the simplest approach, generically referred to as quasi-Newton
SQP, is to update a symmetric matrix B (k) which approximates W (k) . In this
method, B (1) is initialized to some suitable positive definite or semi-definite
matrix (often a multiple of the unit matrix), and B (k) is updated after each
iteration to build up information about second derivatives. A suitable strategy
(e.g. Nocedal and Overton [37]) is usually based on evaluating difference
vectors
δ (k) = x(k+1) − x(k) (4.11)
in x, and
γ (k) = ∇x L(x(k+1) , λ(k) ) − ∇x L(x(k) , λ(k) ) (4.12)

in the gradient of the Lagrangian function L(x, λ(k) ), using the latest avail-
able estimate λ(k) of the multipliers. Then the updated matrix B (k+1) is
chosen to satisfy the secant condition B (k+1) δ (k) = γ (k) .
There are many ways in which one might proceed. For small problems,
where it is required to maintain a positive definite B (k) matrix, the BFGS
formula (see [13]) might be used, in which case it is necessary to have
δ (k)T γ (k) > 0. It is not immediately obvious how best to meet this require-
ment in an NLP context, although a method suggested by Powell [41] has
been used widely with some success. For large problems, some form of limited
memory update is a practical proposition. The L-BFGS method, Nocedal [36],
as implemented by Byrd, Nocedal and Schnabel [4] is attractive, although
other ideas have also been tried. Another method which permits low costs
is the low rank Hessian approximation B = U U T (Fletcher [15]), where U
has relatively few columns. For ENLP, updating the reduced Hessian matrix
M ≈ Z T W Z, B = V M V T , using differences in reduced gradients, is ap-
propriate, essentially updating the Hessian of the reduced objective function
F (t) in (3.5). However, this idea does not translate easily into the context of
NLP with inequality constraints, due to the change in dimension of m when
the number of active constraints changes.
An intermediate situation for large scale SQP is to update an approx-
imation which takes the sparsity pattern of W (k) into account, and up-
dates only the non-sparse elements. The LANCELOT project (see Conn,
Gould and Toint [8] for many references) makes use of partially separa-
ble functions in which B (k) is the sum of various low dimensional element
188 R. Fletcher

Hessians, for which the symmetric rank one update is used. Other sparsity
respecting updates have also been proposed, for example Toint [45], Fletcher,
Grothey and Leyffer [17], but the coding is complex, and there are some
difficulties.
Various important conditions exist regarding rapid local convergence, re-
lating to the asymptotic properties of W Z or Z T W Z (see [13] for references).
Significantly, low storage methods like L-BFGS do not satisfy these con-
ditions, and indeed slow convergence is occasionally observed, especially
when the true reduced Hessian Z ∗T W ∗ Z ∗ is ill-conditioned. For this rea-
son, obtaining rapid local convergence when the null space dimension is very
large is still a topic of research interest. Indeed the entire subject of how best
to provide second derivative information in an SQP method is very much an
open issue.

5 Globalization of NLP Methods

In this section we examine the transition from Newton type methods with
rapid local convergence, such as the SQP method, to globally convergent
methods suitable for incorporation into production NLP software. By glob-
ally convergent, we refer to the ability to converge to local solutions of an
NLP problem from globally selected initial iterates which may be remote
from any solution. This is not to be confused with the problem of guaran-
teeing to find global solutions of an NLP problem in the sense of the best
local solution, which is computationally impractical for problems of any size
(perhaps >40 variables, say), unless the problem has some special convex-
ity properties, which is rarely the case outside of LP and QP. We must also
be aware that NLP problems may have no solution, mainly due to the con-
straints being infeasible (that is, no feasible point exists). In this case the
method should ideally be able to indicate that this is the case, and not spend
an undue amount of time in searching for a non-existent solution. In practice
even to guarantee that no feasible solution exists is an unrealistic aim, akin
to that of finding a global minimizer of some measure of constraint infeasi-
bility. What is practical is to locate a point which is locally infeasible in the
sense that the first order Taylor series approximation to the constraints set
is infeasible at that point. Again the main requirement is that the method
should be able to converge rapidly to such a point, and exit with a suitable
indication of local infeasibility. Another possibility, which can be excluded
by bounding the feasible region, is that the NLP is unbounded, that is f (x)
is not bounded below on the feasible region, or that there are no KT points
in the feasible region. Again the software has to recognize the situation and
terminate accordingly.
Ultimately the aim is to be able to effectively solve NLP problems created
by scientists, engineers, economists etc., who have a limited background in
The Sequential Quadratic Programming Method 189

optimization methods. For this we must develop general purpose software


which is
• Efficient
• Reliable
• Well documented
• Thoroughly tested
• Flexible
• Easy to use, and has
• Large scale capacity.
Efficiency and reliability are self evident, as is being well-documented, Being
thoroughly tested involves monitoring the behaviour on test problems (e.g.
CUTE [2]) which encompass a wide range of possible situations, including
problems with no solution as referred to above. Flexibility includes the ability
to specify simple bounds, linear constraints, etc., to take account of sparsity
in the formulation, and to be able to make warm starts from the solution
of a previously solved problem. It can also include the flexibility to decide
whether or not to supply second derivatives (indeed some codes have been
developed which require no derivative information to be supplied, but these
are very limited in the size of problem that can be solved, and the accuracy
that can be achieved). By easy to use we envisage issues such as the provision
of default parameter values which do not need tuning by the user, and access
to the software via modelling languages like AMPL, GAMS or TOMLAB.
Large scale capacity is required for the software to make a significant impact:
even run of the mill applications of optimization now have 1000’s of variables
and/or constraints and may be computationally intractable without the use
of sparse matrix techniques, or special iterative methods such as Conjugate
Gradients.

5.1 Penalty and Barrier Functions

From an historical perspective, almost all general purpose NLP solvers until
about 1996 aimed to promote global convergence by constructing an auxil-
iary function from f (x) and c(x) known variously as a penalty, barrier, or
merit function. In the earlier days, the idea was to apply successful existing
techniques for unconstrained minimization to the auxiliary function, in such
a way as to find the solution of the NLP problem. Later, there came the idea
of using the auxiliary function to decide whether or not to accept the step
given by the SQP method, hence the term merit function.
For an ENLP problem, an early idea was the Courant [9] penalty function


m
φ(x; σ) = f (x) + 12 σcT c = f (x) + 12 σ c2i where c = c(x), (5.1)
i=1
190 R. Fletcher

where σ > 0 is a parameter. The cT c term ‘penalizes’ points which violate


the constraints, and σ determines the strength of the penalty. Let a min-
imizer, x(σ) say, of φ(x; σ) be found by some technique for unconstrained
minimization. Then we choose a sequence of values of σ → ∞, for example
σ = {1, 10, 100, . . .} and observe the behaviour of x(σ). Under some assump-
tions, it can be shown that the NLP solution is given by x∗ = limσ→∞ x(σ).
A simple modification for the NLP problem is to include terms (min(ci , 0))2
in the summation for any inequality constraints ci (x) ≥ 0. In practice the
methods are slow as compared with SQP techniques, and more seriously,
suffer serious effects due to ill-conditioning of the Hessian matrix of φ as
σ → ∞.
Another early idea for NLP with inequality constraints was the Frisch [24]
log function

m
φ(x; μ) = f (x) − μ loge ci (x), (5.2)
i=1

where μ > 0 is a parameter. In this case we require x to lie in the interior of


the feasible region where ci (x) > 0, i = 1, 2, . . . , m, so that the log terms are
well defined. Then each term − loge ci (x) approaches +∞ as x approaches the
boundary of the feasible region, and creates a ‘barrier’ which prevents iterates
from escaping out of the feasible region. The parameter μ determines the
extent to which the influence of barrier extends into the interior of the feasible
region. For any fixed value of μ we again find x(μ) to minimize φ(x; μ). Then
we choose a sequence of values of μ → 0, for example μ = {1, 10 1 1
, 100 , . . .}
and observe the behaviour of x(μ). Under some assumptions, it can be shown
that x∗ = limμ→0 x(μ). Again the methods are slow and suffer from ill-
conditioning of the Hessian. Moreover the need to find a strictly feasible
starting point is usually a difficult problem in its own right. More recently
the log barrier function has been used as a merit function in conjunction with
interior point methods.

5.2 Multiplier Penalty and Barrier Functions

From around 1969 onwards, a more satisfactory class of auxiliary function


came into use, involving an additional parameter λ ∈ IRm which could be
interpreted as an estimate of the Lagrange multiplier vector. For ENLP the
Lagrangian function is augmented with a penalty term giving

φ(x; λ, σ) = f (x) − λT c + 12 σcT c where c = c(x), (5.3)

sometimes referred to as the augmented Lagrangian function (Hestenes [30],


Powell [40]). We note that

∇x φ = g − Aλ + σAc, (5.4)
The Sequential Quadratic Programming Method 191

so that ∇φ(x∗ ; λ∗ , σ) = 0. Moreover if the local solution is regular and


satisfies second order sufficient conditions, and if σ is sufficiently large, then
∇2 φ(x∗ ; λ∗ , σ) is positive definite (see [13]). Thus, if λ∗ is known, the solution
of the ENLP can be found by minimizing φ(x; λ∗ , σ), without the need to
drive σ to infinity, in contrast to the methods of the previous subsection, and
this avoids the worst effects of the ill-conditioning. Unfortunately λ∗ is not
known a-priori, so a sequential minimization technique must still be used. In
an outer iteration a sequence of parameters λ(k) → λ∗ is chosen, whilst in
the inner iteration, a minimizer x(λ(k) , σ) of φ(x; λ(k) , σ) is found by means
of an unconstrained minimization technique. The behaviour of x(λ(k) , σ) is
observed as λ(k) changes. If the iteration is not converging to x∗ , it may be
necessary to increase σ at some stage. To update λ(k) , a formula

λ(k+1) = λ(k) − (AT W A)−1 c(k) (5.5)

derived from the SQP iteration formula (3.20) may be used. For large σ this
may be approximated by the scheme

λ(k+1) = λ(k) − σc(k) , (5.6)

see [13].
A multiplier penalty function for the NLP (1.1) with inequality con-
straints is

 −λi ci + 1 σc2i if ci ≤ λi /σ
φ(x; λ, σ) = f (x) + 2
(5.7)
i
− 1 2
λ
2 i /σ if ci ≥ λi /σ
suggested by Rockafellar [44]. The piecewise term does not cause any dis-
continuity in first derivatives, and any second derivative discontinuities occur
away from the solution. Otherwise its use is similar to that in (5.3) above.
A multiplier based modification of the Frisch barrier function due to Polyak
[39] is

m
φ(x; λ, μ) = f (x) − μ λi loge (ci /μ + 1), (5.8)
i=1

in which the boundary occurs where ci = −μ, which is strictly outside the
feasible region. Thus the discontinuity of the Frisch function at the solution
x∗ is moved away into the infeasible region. We note that

m
λ∗i a∗i /μ
∇x φ(x∗ ; λ∗ , μ) = g∗ − μ =0
i=1
(c∗i /μ + 1)

using KT conditions (including complementarity). If the solution x∗ is reg-


ular, and sufficient conditions hold, with strict complementarity, then it can
also be shown that ∇2x φ(x∗ ; λ∗ , μ) is positive definite if μ is sufficiently small.
Thus a suitable fixed value of μ > 0 can be found and the worst effects of
192 R. Fletcher

ill-conditioning are avoided. Both the Rockafellar and Polyak functions are
used in a sequential manner with an outer iteration in which λ(k) → λ∗ .
All the above proposals involve sequential unconstrained minimization,
and as such, are inherently less effective than the SQP method, particu-
larly in regard to the rapidity of local convergence. Errors in λ(k) induce
errors of similar order in x(λ(k) ), which is not the case for the SQP method.
Many other auxiliary functions have been suggested for solving NLP or ENLP
problems in ways related to the above. In a later section we shall investigate
so-called exact penalty functions which avoid the sequential unconstrained
minimization aspect.
A major initiative to provide robust and effective software with large scale
capability based on the augmented Lagrangian function was the LANCELOT
code of Conn, Gould and Toint (see [8] for references). The code applies to
an NLP in the form

minimize
n
f (x)
x∈IR
subject to c(x) = 0 (5.9)
l ≤ x ≤ u.

which treats simple bounds explicitly but assumes that slack variables have
been added to any other inequality constraints. In the inner iteration, the aug-
mented Lagrangian function (5.3) is minimized subject to the simple bounds
l ≤ x ≤ u on the variables. A potential disadvantage of this approach for
large scale computation is that the Hessian ∇2x φ(x; λ(k) , σ) is likely to be
much less sparse than the Hessian W (k) in the SQP method. To avoid this
difficulty, LANCELOT uses a simple bound minimization technique based on
the use of the preconditioned conjugate gradient method, and solves the sub-
problem to lower accuracy when λ(k) is inaccurate. It also uses an innovative
idea of building the Hessian from a sum of elementary Hessians through the
concept of group partial separability. LANCELOT has been successfully used
to solve problems with upwards of 104 variables, particularly those with large
dimensional null spaces arising for example from the discretization of a par-
tial differential equation. It is less effective for problems with low dimensional
null spaces and does not take advantage of any linear constraints.

5.3 Augmented Lagrangians with SQP

A fruitful way to take advantage of linear constraints has been to merge the
globalization aspect of the augmented Lagrangian function with the rapid
local convergence of the SQP method. This was the motivation of the very
successful MINOS code of Murtagh and Saunders [35], which was arguably
the first SQP-like NLP solver with large scale capability. In fact MINOS was
The Sequential Quadratic Programming Method 193

influenced by a method due to Robinson [43] which is not an SQP method


in the strict sense described above, but is closely related to it. We describe
Robinson’s method in the context of the NLP problem (1.1). The method
generates a sequence of major iterates x(k) , λ(k) and solves the LCP problem

(k)
minimize
n
f (x) − λ(k)T (c(x) − s)
LCP x∈IR
subject to s = c(k) + A(k)T (x − x(k) ) ≥ 0,

where s = s(x, x(k) ) = c(k) + A(k)T (x − x(k) ) is the first order Taylor series
approximation to c(x) about the current point, and the quantity c(x)−s may
be thought of as the deviation from linearity. The solution and multipliers
of LCP(k) then become the iterates x(k+1) , λ(k+1) for the next major itera-
tion. The method differs from SQP, firstly in that LCP(k) cannot be solved
finitely, which is a disadvantage, and secondly that second derivatives are
not required, which is an advantage. Robinson intended that LCP(k) should
be solved by a reduced Hessian quasi-Newton method. If a Taylor expansion
of the objective function in LCP(k) about the current point is made, then
it agrees with that of SQP(k) up to and including second order terms. Also
the method has the same fixed point property as SQP that if x(k) , λ(k) is
equal to x∗ , λ∗ , then x∗ , λ∗ is the next iterate, and the process terminates.
Consequently the method has the same rapid local convergence properties
as the SQP method, assuming that the LCP(k) subproblem is solved suffi-
ciently accurately. However there is no global convergence result available,
for example there is no mechanism to force the iterates x(k) to accumulate
at a feasible point.
The MINOS code attempts to mitigate the lack of a global convergence
property by augmenting the objective function in LCP(k) with a squared
penalty term. As with LANCELOT, the method is applicable to an NLP in
the form (5.9), and the LCP subproblem that is solved on the k-th major
iteration is

minimize
n
f (x) − λ(k)T (c(x) − s) + 12 σ(c(x) − s)T (c(x) − s)
x∈IR
subject to s = c(k) + A(k)T (x − x(k) ) = 0
l ≤ x ≤ u.

In the original source, MINOS refers to the active set method used to solve
this LCP subproblem, and MINOS/AUGMENTED refers to the major iter-
ative procedure for solving (5.9). However it is more usual now to refer to
the NLP solver by MINOS. The code has sparse matrix facilities, and also
allows ‘linear variables’ to be designated, so allowing the use of a smaller Hes-
sian approximation. MINOS was probably the first SQP-type code with the
capability to solve large scale problems, and as such has been very successful
and is still in use.
194 R. Fletcher

A development of MINOS is the SNOPT code of Gill, Murray and


Saunders [25] which first appeared in about 1992. In place of an LCP subprob-
lem it solves a QP subproblem, but using an approximate Hessian matrix.
Slack variables s are explicitly included, and each iteration involves the solu-
tion of a line search subproblem based on the MINOS augmented Lagrangian

φ(x, s, λ) = f (x) − λ(k)T (c(x) − s) + 12 (c(x) − s)T D(c(x) − s), (5.10)

where D = diag σi is a diagonal matrix of penalty parameters. The entire


triple x(k) , s(k) , λ(k) is varied in the line search. Various treatments of the
Hessian approximation are possible, depending for example on the size of
the problem. Another difference from MINOS is the use of ‘elastic mode’
(essentially the l1 penalty function of the next subsection) to resolve signifi-
cant deviations from infeasibility. It is impossible here to do justice to all the
features of the code, and the reader is referred to the comprehensive descrip-
tion in [25], although it is quite likely that further development of the code
has taken place. For NLP problems in which the null space dimension is not
too large, up to 1000 say, SNOPT is currently amongst the best currently
available NLP solvers (see Byrd, Gould, Nocedal and Waltz [3]).

5.4 The l1 Exact Penalty Function

The penalty and barrier functions in Sections 5.1 and 5.2 are inherently se-
quential, that is the solution of the NLP problem is obtained by a sequence of
unconstrained minimization calculations. It is however possible to construct
a so-called exact penalty function, that is a penalty function of which x∗ , a
solution of the NLP problem, is a local minimizer. It is convenient here to
consider an NLP problem in the form

minimize
n
f (x)
x∈IR (5.11)
subject to c(x) ≤ 0.

The most well known exact penalty function (Pietrzykowski [38], see [13]
for more references) is the l1 exact penalty function (l1 EPF)


m
φ(x; σ) = f (x) + σc+ (x)1 = f (x) + σ c+
i (x), (5.12)
i=1

where c+
i = max(ci , 0) is the amount by which the i−th constraint is violated.
The parameter σ controls the strength of the penalty.
First we consider optimality conditions for a local minimizer of φ(x, σ).
The function is nonsmooth due to the discontinuity in derivative of max(ci , 0)
The Sequential Quadratic Programming Method 195

at zero, so we cannot refer to the stationary point condition of Section 2.2.


In fact we proceed very much as in Section 4.2, by defining the set of active
constraints as in (4.5), and the set of infeasible constraints

I(x) = {i | ci (x) > 0}. (5.13)

Infeasible constraints at x∗ are denoted by the set I ∗ and are assigned a


multiplier λ∗i = σ. Constraints that are neither active nor infeasible at x∗
play no part in the conditions and can be ignored. As before we assign them
a multiplier λ∗i = 0. We observe that if x∗ minimizes φ(x, σ), then it must
solve the problem

m
minimize
n
f (x) + σ c+
i (x)
x∈IR
i=1
(5.14)

subject to ci (x) ≤ 0 i∈A .

and hence the problem



minimize
n
f (x) + σ ci (x)
x∈IR
i∈I ∗ (5.15)
subject to ci (x) ≤ 0 i ∈ A∗ .

since the penalty term comes only from the infeasible constraints. Therefore
if we make the same regularity assumption that the vectors a∗i , i ∈ A∗ are
linearly independent, then the KT conditions for this problem are also nec-
essary for a minimizer of (5.12). Moreover, if we perturb the right hand side
of a constraint i ∈ A∗ in (5.15) by a sufficiently small εi > 0, εj = 0, j = i,
we make constraint i infeasible, but do not change the status of any other
constraints. This causes an increase of σεi in the penalty term. Moreover
the change in f (x) + σ i∈I ci (x) to first order is −λ∗i εi . (The negative sign
holds because of the sign change in (5.11)). Hence the change in φ(x, σ) to
first order is εi (σ − λ∗i ). If λ∗i > σ then φ is reduced by the perturbation,
which contradicts the optimality of x∗ in the l1 EPF. Thus the condition
λ∗i ≤ σ, i ∈ A∗ is also necessary. This result tells us that unless the penalty
parameter σ is sufficiently large, a local minimizer will not be created. We
can therefore summarize the first order necessary conditions as

  
m
g∗ + σ a∗i + a∗i λ∗i = g∗ + a∗i λ∗i = g∗ + A∗ λ∗ = 0 (5.16)
i∈I ∗ i∈A∗ i=1

0 ≤ λ∗i ≤ σ ⎬
c∗i < 0 ⇒ λ∗i = 0 i = 1, 2, . . . , m. (5.17)

c∗i > 0 ⇒ λ∗i = σ
196 R. Fletcher

If ⇒ in (5.17) can be replaced by ⇔ then strict complementarity is said to


hold. Second order necessary conditions are the same as for (5.15), that is
Z ∗T W ∗ Z ∗ is positive semi-definite. Sufficient are that first order conditions
hold, with strict complementarity, and Z ∗T W ∗ Z ∗ is positive definite.
We see that these conditions are very similar to those for solving the NLP
problem (5.11). Consequently there is a strong correlation between local min-
imizers of the l1 EPF and local solutions of the NLP problem, which justifies
the practical use of the l1 EPF as a means of solving an NLP problem. To
be precise, assume that the appropriate second order sufficient conditions
and strict complementarity hold. Then if x∗ solves the NLP problem and
σ > λ∗ ∞ , it follows that x∗ is a local minimizer of the l1 EPF. Conversely,
if x∗ is a local minimizer of the l1 EPF, and x∗ is feasible in the NLP problem,
then x∗ is a local solution of the NLP problem.

5.5 SQP with the l1 EPF

The SQP method first came into prominent use when used in conjunction
with the l1 EPF as suggested by Han [29] and Powell [41]. The vector d(k)
generated by the SQP subproblem QP(k) is regarded as a direction of search,
and the l1 EPF is used as a merit function, so that the next iterate is x(k+1) =
x(k) + α(k) d(k) , with α(k) being chosen to obtain a sufficient reduction in
φ(x, σ). For this to be possible requires that d(k) is a descent direction at x(k)
for φ(x, σ). If the Hessian W (k) (or its approximation) is positive definite,
it is possible to ensure that this is the case, if necessary by increasing σ.
Early results with this technique were quite promising, when compared with
sequential unconstrained penalty and barrier methods. However the use of
a nonsmooth merit function is not without its difficulties. In particular the
discontinuities in derivative cause ‘curved valleys’, with sides whose steepness
depends on the size of σ. If σ is large, the requirement to monotonically
improve φ on every iteration can only be achieved by taking correspondingly
small steps, leading to slow convergence. Unfortunately, increasing σ to obtain
descent exacerbates this situation.
A way round this is the Sequential l1 Quadratic Programming (Sl1 QP)
method of Fletcher [11]. The idea (also applicable to an l∞ exact penalty
function) is to solve a subproblem which more closely models the l1 EPF,
by moving the linearized constraints into the objective function, in an l1
penalty term. Thus the l1 QP subproblem is

minimize
n
g(k)T d + 12 dT W (k) d + σ(c(k) + A(k)T d)+ 1
d∈IR (5.18)
subject to d∞ ≤ ρ.

It is necessary that σ is sufficiently large as discussed in Section 5.4, and


also below. The restriction on d is the trust region constraint and ρ is
The Sequential Quadratic Programming Method 197

the trust region radius. Solving the subproblem ensures descent, and quite
strong results regarding global convergence can be proved by using standard
trust region ideas. The use of an l∞ trust region (a ‘box constraint’) fits
conveniently into a QP type framework.
Even so, there are still some issues to be resolved. Firstly, (5.18) is not
a QP in standard form due to the presence of l1 terms in the objective,
although it is still a problem that can be solved in a finite number of steps.
Ideally a special purpose l1 QP solver with sparse matrix capabilities would
be used. This would enable an efficient l1 piecewise quadratic line search to
be used within the solver. Unfortunately a fully developed code of this type
is not easy to come by. The alternative is to transform (5.18) into a regular
QP by the addition of extra variables. For example a constraint l ≤ c(x) ≤ u
can be written as l ≤ c(x) − v + w ≤ u where v ≥ 0 and w ≥ 0 are auxiliary
variables, and a penalty term of the form σv + w1 , which is linear in
v and w, would then be appropriate. However 2m extra variables need be
added, which is cumbersome, and the benefit of the piecewise quadratic line
search is not obtained. A related idea is to use the SLP-EQP idea of Fletcher
and Sainz de la Maza [22], referred to in Section 4.5. In this case an l1 LP
subproblem would be used to find an active set and multipliers, followed by
an EQP calculation to obtain the step d(k) . As above, the l1 LP subproblem
can be converted to an LP problem by the addition of extra variables, and
this allows fast large scale LP software to be used.
It is also not easy for the user to choose a satisfactory value of σ. If
it is chosen too small, then a local minimizer may not be created, if too
large then the difficulties referred to above become apparent. There is also a
possibility of the Maratos effect [34] occurring, in which, close to the solution,
the Newton-type step given by the SQP method increases φ and cannot be
accepted if monotonic improvement in φ is sought. Thus the expected rapid
local convergence is not realised. More recently, ideas for circumventing these
difficulties have been suggested, including second order corrections [12], the
watchdog technique [7], and a non-monotonic line search [27].

6 Filter Methods

Filter methods were introduced in response to the perceived difficulties in


using penalty function methods for globalization, that is the difficulty of
choosing suitable penalty parameters, the inefficiency of sequential methods,
and the slow convergence associated with monotonic minimization methods,
particularly in the case of nonsmooth exact penalty functions. Along with
this is the observation that the basic SQP method is able to quickly solve
a significant proportion of test problems without the need for modifications
to induce global convergence. Thus the goal of filter methods is to provide
global optimization safeguards that allow the full SQP step to be taken much
198 R. Fletcher

more often. In this section we describe the main ideas and possible pitfalls,
and discuss the way in which a global convergence result for a filter method
has been constructed.
A penalty function is an artefact to combine two competing aims in NLP,
namely the minimization of f (x) and the need to obtain feasibility with
respect to the constraints. The latter aim can equivalently be expressed as the
minimization of some measure h(c(x)) of constraint violation. For example,
in the context of (5.11) we could define h(c) = c+  in some convenient norm.
Thus, in a filter method, we view NLP as the resolution of two competing
aims of minimizing f (x) and h(c(x)). This is the type of situation addressed
by Pareto (multi-objective) optimization, but in our context the minimization
of h(c(x)) has priority, in that it is essential to find a Pareto solution that
corresponds to a feasible point. However it is useful to borrow the concept of
domination from multi-objective optimization. Let x(k) and x(l) be two points
generated during the progress of some method. We say that x(k) dominates
x(l) if and only if h(k) ≤ h(l) and f (k) ≤ f (l) . That is to say, there is no
reason to prefer x(l) on the basis of either measure. Now we define a filter
to be a list of pairs (h(k) , f (k) ) such that no pair dominates any other. As
the algorithm progresses, a filter is built up from all the points that have
been sampled by the algorithm. A typical filter is shown in Figure 2, where
the shaded region shows the region dominated by the filter entries (the outer
vertices of this shaded region). The contours of the l1 exact penalty function
would be straight lines with slope −σ on this plot, indicating that at least
for a single entry, the filter provides a less restrictive acceptance condition
than the penalty function.

f(x)

upper bound

U h(x)

Fig. 2 A Typical Filter Diagram


The Sequential Quadratic Programming Method 199

6.1 SQP Filter Methods

Filter methods were first introduced in the context of trust region SQP meth-
ods in 1997 by Fletcher and Leyffer [18], making use of a subproblem


⎨ minimize
1 T
2d W
(k)
d + dT g(k)
d∈IRn
(k)
QP (ρ) subject to c(k) + A(k)T d ≥ 0.


d∞ ≤ ρ,

obtained by adding a trust region constraint to QP(k) . A trust region (see


[13]) is a heuristic aimed at restricting the step size to lie in a region in which
there is ‘adequate’ agreement between the true functions and their Taylor
series approximations. The trust region radius ρ is adjusted during or after
each iteration to achieve this aim.
A first attempt at a filter algorithm might go as follows. On iteration k = 1
the filter F (1) is empty. On iteration k we solve QP(k) (ρ) giving a step d and
evaluate f = f (x(k) + d) and h = h(c(x(k) + d)). If the resulting pair (h, f )
is acceptable to F (k) (that is, it is not dominated by any of the entries in
F (k) ), then we update x(k) and λ(k) as described in Section 4.4. We also
update the filter, adding the new pair (h, f ) and removing any entries that
are dominated by it. Possibly we might also increase the trust region radius.
If, on the other hand, the pair is not acceptable, then we reject it, reduce the
trust region radius, and re-solve QP(k) (ρ).
There are various ways in which this simple approach can fail, or become
unsatisfactory. One is that unacceptably large violations in the constraints
may occur. This is readily handled by imposing an upper bound U on con-
straint violations, and initializing F (1) to (U, −∞) (see Figure 2). More
serious is the possibility that, if the current point x(k) is infeasible (h(k) > 0),
and if ρ is reduced sufficiently in the algorithm, then the constraints of
QP(k) (ρ) can become incompatible, and the algorithm stalls. In this case our
approach has been to enter a feasibility restoration phase, in which a different
SQP-like algorithm (see Fletcher and Leyffer [19] for a filter-like approach) is
invoked to find a new acceptable point x(k+1) for which the TR subproblem
is solvable. Of course, we cannot predict the effect on f (x) and we must be
prepared for it to increase. Another feature that the algorithm lacks is any
sense of sufficient reduction in either f or h, as is used in other convergence
proofs. For instance we would not like new pairs to become arbitrarily close
to existing filter entries, because this might allow convergence to a non-KT
point. With this in mind we strengthen the acceptance condition by adding
a filter envelope around the current filter entries, so as to extend the set of
unacceptable points (see Figure 2). Most recently we use the sloping envelope
of Chin [5], Chin and Fletcher [6], in which acceptability of a pair (h, f ) with
respect to a filter F is defined by

h ≤ βhi or f ≤ fi − γh ∀(hi , fi ) ∈ F (6.1)


200 R. Fletcher

where β and γ are constants in (0, 1). Typical values might be β = 0.9
and γ = 0.01. (In earlier work we used f ≤ fi − γhi for the second test,
giving a rectangular envelope. However, this allows the possibility that (h, f )
dominates (hi , fi ) but the envelope of (h, f ) does not dominate the envelope
of (hi , fi ), which is undesirable.)
During testing of the filter algorithm, another less obvious disadvantage
became apparent. Say the current filter contains an entry (0, fi ) where fi is
relatively small. If, subsequently, feasibility restoration is invoked, it may be
impossible to find an acceptable point which is not dominated by (0, fi ). Most
likely, the feasibility restoration phase then converges to a feasible point that
is not a KT point. We refer to (0, fi ) as a blocking entry. We were faced with
two possibilities. One is to allow the removal of blocking entries on emerging
from feasibility restoration. This we implemented in the first code, reported
by Fletcher and Leyffer [18]. To avoid the possibility of cycling, we reduce the
upper bound when a blocking entry is removed. We did not attempt to pro-
vide a global convergence proof for this code, although it may well be possible
to do so. Subsequently it became clear that other heuristics in the code were
redundant and further work resulted in a related filter algorithm (Fletcher,
Leyffer and Toint [20]) for which a convergence proof can be given. In this
algorithm we resolve the difficulty over blocking by not including all accepted
points in the filter. This work is described in the next section. However, the
earlier code proved very robust, and has seen widespread use. It shows up
quite well on the numbers of function and derivative counts required to solve
a problem, in comparison say with SNOPT. Actual computing times are less
competitive, probably because the QP solver used by SNOPT is more effi-
cient. It has the same disadvantage as SNOPT that it is inefficient for large
null space problems. Otherwise, good results were obtained in comparison
with LANCELOT and an implementation of the l1 EPF method.

6.2 A Filter Convergence Proof

In this section an outline is given of the way in which a global convergence


proof has been developed for the trust region filter SQP method. The theory
has two aspects: how to force h(k) → 0, and how to minimize f (x) subject to
h(c(x)) = 0.
First we review existing trust region SQP convergence theory for uncon-
strained optimization. At any non-KT point x(k) and radius ρ in QP(k) (ρ),
we define the predicted reduction

Δq = q(0) − q(d) = −g(k)T d − 12 dT W (k) d > 0 (6.2)


The Sequential Quadratic Programming Method 201

and the actual reduction

Δf = f (k) − f (x(k) + d). (6.3)

As ρ → 0, and with suitable assumptions, so d → 0, Δq ∼ ρg(k)  and


Δf /Δq → 1. It follows for sufficiently small ρ that the inequality

Δf ≥ σΔq (6.4)

is valid. This is referred to as the sufficient reduction condition. In the TR


algorithm of Figure 4 (with the feasibility restoration and filter boxes stripped
out) we essentially choose ρ(k) as large as possible (within a factor of 2)
subject to (6.4) holding. If the gradients g(k) are accumulating at a value
g∞ = 0, then ρ(k) is uniformly bounded away from zero, and it follows that
Δf (k) ≥ σΔq (k) ∼ σρ(k) . Summing over all k shows that f (k) → −∞ which
is a contradiction. Thus the gradients can only accumulate at 0 and any
accumulation point x∞ is stationary.
We aim to make use of these ideas in an NLP context. However there is a
difficulty when h(k) > 0 that Δq < 0 may be possible. This is illustrated by
the left hand diagram of Figure 3. However, with a larger trust region radius,
it is possible to have Δq > 0, as in the right hand diagram. We describe
the resulting steps respectively as being either h-type or f-type, according
to whether Δq ≤ 0 or not. Of course, if h(k) = 0 the resulting step must
be f-type. We construct our TR algorithm such that whilst f-type steps are
being taken, we make no new entries into the filter, and rely to a large extent
on the above convergence theory. Only h-type steps give rise to a filter entry
(we include the use of feasibility restoration as an h-type step). The resulting
algorithm is detailed in Figure 4. Note that the current pair (h(k) , f (k) ) is
not in the filter, and is only included subsequently if the algorithm takes an
h-type step.
We now turn to examine the way in which the slanting envelope (6.1)
operates. If an infinite sequence of entries are made, then it is a consequence

Fig. 3 Illustration of h-type and f-type steps


202 R. Fletcher

Initialize with a point x, k = 1,


and only (u, −∞) in the filter

?
- enter restoration phase to find
a point x(k) acceptable to the filter
such that QP(k) (ρ̃) is compatible
for some ρ̃ ≥ ρ◦ and initialize ρ = ρ̃

initialize ρ ≥ ρ◦ 
? ?
try to solve QP(k) (ρ) 

incompatible solution d
? ?
include (h(k) , f (k) ) if d = 0 then finish
in the filter (KT point)
(h–type iteration)
?
? evaluate f (x(k) + d)
k := k + 1 and c(x(k) + d)

 ?
is x(k) + d acceptable to no - ρ := ρ/2
the filter and (h(k) , f (k) )?
yes
? yes
is Δf < σΔq and -
Δq > 0?
no
?
ρ(k) = ρ d(k) = d
Δq (k) = Δq Δf (k) = Δf

?
if Δq (k) ≤ 0 then include
(h(k) , f (k) ) in the filter
(h–type iteration)

?
x(k+1) = x(k) + d(k)
k := k + 1

-
Fig. 4 A Filter–SQP Algorithm
The Sequential Quadratic Programming Method 203

that h(k) → 0 (otherwise the condition f ≤ fi − γh has the effect of forcing


f → −∞: a contradiction). Thus the convergence proof claims that either
1. The restoration phase converges to a locally infeasible point,
2. The algorithm terminates at a KT point, or
3. There exists a feasible accumulation point that is either a KT point, or
the Mangasarian-Fromowitz constraint qualification (MFCQ) fails.
The proof proceeds as follows. The first case corresponds to the situation
that the local approximation to the constraint set is infeasible, and no fur-
ther progress can be made. So we only need to examine case 3. If there are
an infinite number of h-type iterations, it is possible to find a subsequence
on which h(k) → 0 by virtue of (6.1). By taking thinner subsequences if
necessary, we examine the behaviour of iterates in the neighbourhood of a
feasible non-KT point that satisfies MFCQ (a type of regularity condition).
Because MFCQ holds, there exist feasible descent directions and it is shown
that the TR algorithm takes f-type steps in the limit, which is a contradiction.
The only other possibility to consider is that there exists some K such that
the algorithm takes f-type steps for all k ≥ K. We can deduce from (6.1) and
the fact that (h(k+1) , f (k+1) ) is always acceptable to (h(k), f (k) ), that h(k) → 0.
Then an argument similar to that in the unconstrained case contradicts the
fact that there are feasible descent directions at any accumulation point.
Because MFCQ holds, it follows that the accumulation point is a KT point.
In passing, note that the proof of the Corollary to Lemma 1 in [20] contains
an error. A corrected version of the paper can be found on my web site.

6.3 Other Filter SQP Methods

Another filter SQP algorithm, analysed by Fletcher, Gould, Leyffer, Toint and
Wächter [16], decomposes the SQP step into a normal and tangential com-
ponent. The normal step provides feasibility for the linearized constraints
and the tangential step minimizes the quadratic model in the feasible re-
gion. Related ideas are discussed by Gonzaga, Karas and Varga [26] and
Ribiero, Karas and Gonzaga [42]. Wächter and Biegler describe line search
filter methods in [46] and [47]. Chin [5] and Chin and Fletcher [6] consider
SLP-EQP trust region filter methods. Gould and Toint [28] present a non-
monotone filter SQP method which extends the non-monotonic properties of
filter SQP type algorithms. A review of other recent developments of filter
methods, outwith SQP, but including interior point methods for NLP, ap-
pears in the SIAG/OPT activity group newsletter (March 2007) and can be
accessed in [21].
204 R. Fletcher

7 Modelling Languages and NEOS

Solving complex optimization problems that arise in practice has many


difficulties. Writing your own NLP solver is not recommended, as there are
many difficulties to be circumvented, even if you have access to a good QP
solver. However, access to a fully developed production code, e.g. MINOS,
LANCELOT, etc., is not the end of the story. The interface between the
requirements of the code, and the features of the problem can be very diffi-
cult to set up. It is usually necessary to provide derivatives of any nonlinear
functions, which is prone to error. Modelling languages have been designed
to allow the user to present the problem to the NLP solver in as friendly
a way as possible. The user is able to define constructions in terms of con-
cepts familiar to the problem. Three languages come to mind, AMPL, GAMS
and TOMLAB. All are hooked up to a number of well known NLP solvers.
TOMLAB is based on MATLAB syntax, the other have their own individual
syntax that has to be learned. Unfortunately all are commercial products.
In this review I shall describe AMPL (A Mathematical Programming Lan-
guage) which I have found very flexible and easy to use. A student edition is
freely available which allows problems of up to 300 variables and constraints,
and gives access to MINOS and some other solvers. Access to AMPL for
larger problems is freely available through the so-called NEOS system, de-
scribed below in Section 7.5.

7.1 The AMPL Language

AMPL is a high level language for describing optimization problems,


submitting them to a solver, and manipulating the results. An AMPL
program has three main parts, the model, in which the problem constructs
are defined, the data which is self evident, and programming in which in-
structions for activating a solver, and displaying or manipulating the results
are carried out. A list of model, data and programming commands are pre-
pared in one or more files and presented to the AMPL system. This processes
the information, evaluates any derivatives automatically, and presents the
problem to a designated solver. Results are then returned to the user. The
AMPL system is due to Fourer, Gay and Kernighan, and is described in the
AMPL reference manual [23]. The syntax of AMPL is concisely defined in
the Appendix of [23], and provides an invaluable aid to debugging an AMPL
model. The user is strongly advised to come to terms with the notation that
is used. Other more abridged introductions to AMPL are available, as can
be found by surfing the web.
The main features of an NLP problem are the variables, presented via the
keyword var, the objective, presented via either minimize or maximize, and
constraints, presented via the keyword subject to. These constructions are
The Sequential Quadratic Programming Method 205

described using entities introduced by the keywords param for fixed parame-
ters, and set for multivalued set constructions. An example which illustrates
all the features is to solve the HS72 problem in CUTE. Here the model is spec-
ified in a file hs72.mod. Note that upper and lower case letters are different:
here we have written user names in upper case, but that is not necessary. All
AMPL commands are terminated by a semicolon.
hs72.mod
set ROWS = {1..2};
set COLUMNS = {1..4};
param A {ROWS, COLUMNS};
param B {ROWS};
var X {COLUMNS} >= 0.001;
minimize OBJFN: 1 + sum {j in COLUMNS} x[j];
subject to
CONSTR {i in ROWS}: sum {j in COLUMNS}
A[i,j]/X[j] <= B[i];
UBD {j in COLUMNS}: X[j] <= (5-j)*1e5;

In this model, the sets are just simple ranges, like 1..4 (i.e. 1 up to 4). We could
have shortened the program by deleting the set declaration and replacing
ROWS by 1..2 etc., in the rest of the program. But the use of ROWS and COLUMNS
is more descriptive. The program defines a vector of variables X, and the data
is the matrix A and vector B which are parameters. Simple lower bounds on
the variables are specified in the var statement, and sum is a construction
which provides
4summation. The constraint CONSTR implements the system of
inequalities j=1 ai,j /xj ≤ bi , i = 1, 2. Note that indices are given within
square brackets in AMPL. Constructions like j in COLUMNS are referred to
as indexing in the AMPL syntax. The constraints in UBD define upper bounds
on the variables which depend upon j. Note that the objective function and
each set of constraint functions must be given a name by the user.
The data of the problem is specified in the file hs72.dat. Note the use of
tabular presentation for elements of A and B.
hs72.dat
param A: 1 2 3 4 :=
1 4 2.25 1 0.25
2 0.16 0.36 0.64 0.64;
param B :=
1 0.0401
2 0.010085;

The next stage is to fire up the AMPL system on the computer. This will
result in the user receiving the AMPL prompt ampl:. The programming
session to solve HS72 would proceed as follows.
206 R. Fletcher

An AMPL session
ampl: model hs72.mod;
ampl: data hs72.dat;
ampl: let {j in COLUMNS} X[j] := 1;
ampl: solve;
ampl: display OBJFN;
ampl: display X;

The model and data keywords read in the model and data. The keyword
let allows assignment of initial values to the variables, and the display
commands initiate output to the terminal. Output from AMPL (not shown
here) would be interspersed between the user commands. It is also possible
to aggregate data and, if required, programming, into a single hs72.mod
file. In this case the data must follow the model, and must be preceded by
the statement data;. One feature to watch out for occurs when revising the
model. In this case, repeating the command model hs72.mod; will add the
new text to the database, rather than overwrite it. To remove the previous
model from the database, the command reset; should first be given.

7.2 Networks in AMPL

AMPL is a most convenient system for modelling networks of diverse kinds


(road, gas supply, work flow, . . .). We illustrate some of the useful AMPL
constructions in this section by reference to an electrical power network in
which the objective is to minimize the power generation required to meet
the demand on a given network. The model also illustrates some other useful
AMPL features, notably the use of dependent variables. A toy system used
by electrical engineers is illustrated in Figure 5. It is described in AMPL

~
NORTH LAKE MAIN

Fig. 5 A Simple Power SOUTH ELM

Distribution Network ~
The Sequential Quadratic Programming Method 207

A Power Generation Problem


set CONSUMERS;
set GENERATORS;
set NODES = CONSUMERS union GENERATORS;
set POWERLINES within (NODES cross NODES);
param LOADP {NODES};
param R {POWERLINES};
param X {POWERLINES};
param ZSQ {(i,j) in POWERLINES} = R[i,j]ˆ2+X[i,j]ˆ2;
param C {(i,j) in POWERLINES} = R[i,j]/ZSQ[i,j];
param D {(i,j) in POWERLINES} = X[i,j]/ZSQ[i,j];
...
var V {NODES}; # Voltages
var THETA {NODES}; # Phase angles
var PG {GENERATORS} ¿= 0; # Power generation
...
var PSI {(i,j) in POWERLINES} = THETA[i] - THETA[j];
var P {(i,j) in POWERLINES} = C[i,j]*V[i]ˆ2 +
V[i]*V[j]*(D[i,j]*sin(PSI[i,j])-C[i,j]*cos(PSI[i,j]);
...
minimize PGEN: sum {i in GENERATORS} PG[i];
...
subject to
EQUALPC {i in CONSUMERS}:
sum {(j,i) in POWERLINES} P[j,i]=
sum {(i,j) in POWERLINES} P[i,j] + LOADP[i];
EQUALPG {i in GENERATORS}:
PG[i,j] + sum {(j,i) in POWERLINES} P[j,i]=
sum {(i,j) in POWERLINES} P[i,j] + LOADP[i];
...
data;
set CONSUMERS := LAKE MAIN ELM;
set GENERATORS := NORTH SOUTH;
set POWERLINES := (NORTH,SOUTH) (NORTH,LAKE)
(SOUTH,LAKE) (SOUTH,MAIN) (SOUTH,ELM)
(LAKE,MAIN) (MAIN,ELM);
param: R X :=
NORTH SOUTH 0.02 0.06
NORTH LAKE 0.08 0.24
SOUTH LAKE 0.06 0.24
...
MAIN ELM 0.08 0.24;
param LOADP := NORTH 0; SOUTH 0.2, LAKE 0.45, MAIN 0.4, ELM 0.6;
...
208 R. Fletcher

by the constructions shown on page 207. Note the use of the keyword union
to merge the nodes with and without power generation. Also observe the
use of cross, which indicates all possible connections between the nodes,
and within which indicates that the actual network is a subset of these.
In fact the operator cross has higher priority than within so the brackets
around the cross construction are not necessary. The user is using the con-
vention that power flows from the first node to the second node (a negative
value of flow is allowed and would indicate flow in the opposite sense). The
program also shows the use of dependent parameters and variables. Thus
the parameters ZSQ depend on R and X, and C and D both depend on R, X
and ZSQ. It is necessary that the order in which these statements are given
reflects these dependencies. The true variables in the problem (as shown here)
are V, THETA and PG. Additional variables which depend on these variables,
and also on the parameters, are PSI and P, as defined by the expressions
which follow the ‘=’ sign. The objective is to minimize the sum of gener-
ated power. Constraints include power balance constraints at consumer nodes
and generator nodes, the latter including a term for power generation. Note
the use of P[j,i] for power entering node i and P[i,j] for power exiting
node i. The program also provides a useful illustration of how to supply
data for network problems, and the use of the # sign for including com-
ments. Note also the use of standard functions such as sin and cos in the
expressions. The program shown is only part of the full model, which would
include flow of reactive power, and upper and lower bounds on various of the
variables.

7.3 Other Useful AMPL Features

The following problem is due to Bob Vanderbei (who gives many interesting
AMPL programs: search for vanderbei princeton on Google and click on
LOQO). A rocket starts at time t = 0, position x(0) = 0 and velocity v(0) = 0.
It must reach position x(T ) = 100, also with velocity v(T ) = 0. We shall
divide the total time T into n intervals of length h and use finite difference
approximations

x 1 −x 1 v 1 −v 1
i+ 2 i− 2 i+ 2 i− 2
vi = and ai = .
h h
for velocity and acceleration. The maximum velocity is 5 units and the ac-
celeration must lie within ±1 units. The aim is to minimize the total time
T = nh. The AMPL program is
The Sequential Quadratic Programming Method 209

A Rocket Problem
param n > 2;
set vrange = {0.5..n-0.5 by 1};
set arange = {1..n-1};
var x {0..n}; var v {vrange} <= 5;
var a {arange} <= 1, >= -1;
var h;
minimize T: n*h;
subject to
xdiff {i in vrange}: x[i+0.5]-x[i-0.5]=h*v[i];
vdiff {i in arange}: v[i+0.5]-v[i-0.5]=h*a[i];
x0: x[0] = 0; xn: x[n] = 100;
v0: v[1.5] = 3*v[0.5]; # Implements v0 = 0
vn: v[n-1.5] = 3*v[n-0.5]; # Implements vn = 0

The actual value of n must be supplied in the data section. An alternative


implementation could be made in which v and a are expressed as dependent
variables. Things to observe include the treatment of ranges and the use of by.
Also note the use of both upper and lower bounds on the variables, and the
beautifully concise form that AMPL permits.
AMPL can also be very descriptive when applied to finite element dis-
cretizations. I have used the following constructions in the context of a two
dimensional p.d.e.

2-D Finite Element Constructions


set NODES;
set DIRICHLET within NODES;
set INTERIOR = NODES diff DIRICHLET;
set TRIANGLES within NODES cross NODES cross NODES;
set EDGES = setof {(i,j,k) in TRIANGLES} (i,j)
union setof {(i,j,k) in TRIANGLES} (j,k)
union setof {(i,j,k) in TRIANGLES} (k,i);
set SHAPE FNS =
setof {(i,j,k) in TRIANGLES} (i,j,k) union
setof {(i,j,k) in TRIANGLES} (j,k,i) union
setof {(i,j,k) in TRIANGLES} (k,i,j);

This illustrates the use of three suffix quantities (TRIANGLES), and the
selection of all two suffix entities (EDGES) using setof. Shape functions are
210 R. Fletcher

elementary functions defined on triangles taking the value 1 at one node and
0 at the others. Note also the use of diff for set difference, and the allowed
use of underscore within an identifier.
AMPL contains many more useful constructions which we have not space
to mention here. Purchasing a copy of the manual is essential! Worthy of men-
tion however is the existence of for and if then else constructions. This
can be very useful at the programming stage. An if then else construc-
tion is also allowed within a model but should be used with care, because it
usually creates a nonsmooth function which many methods are not designed
to handle. The same goes for abs and related nonsmooth functions. Another
useful feature for creating loops at the programming stage is the repeat
construction.

7.4 Accessing AMPL

Unfortunately the full AMPL system is a commercial product for which a


licence fee must be paid. However there is a student version of AMPL available
which is free, but restricted to no more than 300 variables and constraints. In
this section we list the steps needed to install the student version of AMPL
on a unix operating system, as follows
1. connect to www.ampl.com
2. follows the link to download the student edition, following the quick start
instructions
3. choose the architecture
4. download ampl.gz and gunzip it
5. download one or more solvers (e.g. MINOS or SNOPT).
Ideally these files should be stored in a \usr\bin area with symbolic links
that enable them to be called from other directories.

7.5 NEOS and Kestrel

NEOS (Network Enabled Optimization Server) is a system running at the


Argonne National Laboratory that solves optimization problems submitted
by anyone, through the medium of the internet. The simplest way to access
NEOS is via email. The following template, using the solver SNOPT for
example, should be sent to [email protected], after having included the
model, data, etc., where indicated.
The Sequential Quadratic Programming Method 211

NEOS Template for AMPL


<document>
<category>nco</category>
<solver>SNOPT</solver>
<inputMethod>AMPL</inputMethod>

<model><![CDATA[
Insert Model Here
]]></model>

<data><![CDATA[
Insert Data Here
]]></data>

<commands><![CDATA[
Insert Programming Here
]]></commands>

<comments><![CDATA[
Insert Any Comments Here
]]></comments>

</document>

An alternative approach is to use the Kestrel interface to NEOS. This


enables the remote solution of optimization problems within the AMPL and
GAMS modelling languages. Quoting from the documentation, problem gen-
eration, including the run-time detection of syntax errors, occurs on the local
machine using any available modelling language facilities. Solution takes place
on a remote machine, with the result returned in the native modelling lan-
guage format for further processing. To use Kestrel, the Kestrel interface must
be downloaded at step 5 above, using the same process as for downloading
the solvers. To initiate a solve with say SNOPT using Kestrel, the following
protocol must be initiated when using AMPL on the local machine.
Accessing NEOS from the Kestrel interface
ampl: option solver kestrel;
ampl: option kestrel options "solver=SNOPT";

An advantage of using NEOS from Kestrel (or by email as above) is that the
restriction in size no longer applies. A disadvantage is that the response of
the NEOS server can be slow at certain times of the day.
212 R. Fletcher

References

1. E. M. L. Beale, Numerical Methods In: Nonlinear Programming, J. Abadie, ed., North-


Holland, Amsterdam, 1967.
2. I. Bongartz, A. R. Conn, N. I. M. Gould and Ph. L. Toint, CUTE: Constrained and Un-
constrained Testing Environment ACM Trans. Math. Software, 21, 1995, pp. 123–160.
3. R. H. Byrd, N. I. M. Gould, J. Nocedal and R. A. Waltz, An Active-Set Algorithm for
Nonlinear Programming Using Linear Programming and Equality Constrained Sub-
problems, Math. Programming B, 100, 2004 pp. 27–48.
4. R. H. Byrd, J. Nocedal and R. B. Schnabel, Representations of quasi-Newton matrices
and their use in limited memory methods, Math. Programming, 63, 1994, pp. 129–156.
5. C. M. Chin, A new trust region based SLP filter algorithm which uses EQP active set
strategy, PhD thesis, Dept. of Mathematics, Univ. of Dundee, 2001.
6. C. M. Chin and R. Fletcher, On the global convergence of an SLP-filter algorithm that
takes EQP steps, Math. Programming, 96, 2003, pp. 161–177.
7. R. M. Chamberlain, C. Lemarechal, H. C. Pedersen and M. J. D. Powell, The watchdog
technique for forcing convergence in algorithms forconstrained optimization, In: Algo-
rithms for Constrained Minimization of Smooth Nonlinear Functions, A. G. Buckley
and J.-L. Goffin, eds., Math. Programming Studies, 16, 1982, pp. 1–17.
8. A. R. Conn, N. I. M. Gould and Ph. L. Toint, Trust Region Methods, MPS-SIAM
Series on Optimization, SIAM Publications, Philadelphia, 2000.
9. R. Courant, Variational methods for the solution of problems of equilibrium and vi-
bration, Bull. Amer. Math. Soc., 49, 1943, pp. 1–23.
10. J. E. Dennis and J. J. Moré, A characterization of superlinear convergence and its
application to quasi-Newton methods, Math. Comp., 28, 1974, pp. 549–560.
11. R. Fletcher, A model algorithm for composite nondifferentiable optimization problems,
In: Nondifferential and Variational Techniques in Optimization, D. C. Sorensen and
R. J.-B. Wets eds., Math.Programming Studies, 17, 1982, pp. 67–76.
12. R. Fletcher, Second order corrections for nondifferentiable optimization, In: Numerical
Analysis, Dundee 1981, G. A. Watson ed., Lecture Notes in Mathematics 912, Springer-
Verlag, Berlin, 1982.
13. R. Fletcher, Practical Methods of Optimization, 1987, Wiley, Chichester.
14. R. Fletcher, Dense Factors of Sparse Matrices, In: Approximation Theory and Opti-
mization, M. D. Buhmann and A. Iserles, eds, C.U.P., Cambridge, 1997.
15. R. Fletcher, A New Low Rank Quasi-Newton Update Scheme for Nonlinear Program-
ming, In: System Modelling and Optimization, H. Futura, K. Marti and L. Pandolfi,
eds., Springer IFIP series in Computer Science, 199, 2006, pp. 275–293, Springer,
Boston.
16. R. Fletcher, N. I. M. Gould, S. Leyffer, Ph. L. Toint, and A. Wächter, Global conver-
gence of trust-region SQP-filter algorithms for general nonlinear programming, SIAM
J. Optimization, 13, 2002, pp. 635–659.
17. R. Fletcher, A. Grothey and S. Leyffer, Computing sparse Hessian and Jacobian ap-
proximations with optimal hereditary properties, In: Large-Scale Optimization with
Applications, Part II: Optimal Design and Control, L. T. Biegler, T. F. Coleman,
A. R. Conn and F. N. Santosa, Springer, 1997.
18. R. Fletcher and S. Leyffer, Nonlinear programming without a penalty function, Math.
Programming, 91, 2002, pp. 239–270.
19. R. Fletcher and S. Leyffer, Filter-type algorithms for solving systems of algebraic equa-
tions and inequalities, In: G. di Pillo and A. Murli, eds, High Performance Algorithms
and Software for Nonlinear Optimization, Kluwer, 2003.
20. R. Fletcher, S. Leyffer, and Ph. L. Toint, On the global convergence of a filter-SQP
algorithm, SIAM J. Optimization, 13, 2002, pp. 44–59.
21. R. Fletcher, S. Leyffer, and Ph. L. Toint, A Brief History of Filter Methods, Preprint
ANL/MCS-P1372-0906, Argonne National Laboratory, Mathematics and Computer
Science Division, September 2006.
The Sequential Quadratic Programming Method 213

22. R. Fletcher and E. Sainz de la Maza, Nonlinear programming and nonsmooth


optimization by successive linear programming, Math. Programming, 43, 1989,
pp. 235–256.
23. R. Fourer, D. M. Gay and B. W. Kernighan, AMPL: A Modeling Language for Math-
ematical Programming, 2nd Edn., Duxbury Press, 2002.
24. K. R. Frisch, The logarithmic potential method of convex programming, Oslo Univ.
Inst. of Economics Memorandum, May 1955.
25. P. E. Gill, W. Murray and M. A. Saunders, SNOPT: An SQP Algorithm for Large-
Scale Constrained Optimization, SIAM Review, 47, 2005, pp. 99–131.
26. C. C. Gonzaga, E. Karas, and M. Vanti, A globally convergent filter method for non-
linear programming, SIAM J. Optimization, 14, 2003, pp. 646–669.
27. L. Grippo, F. Lampariello and S. Lucidi, A nonmonotone line search technique for
Newton’s method, SIAM J. Num. Anal., 23, pp. 707–716.
28. N. I. M. Gould and Ph. L. Toint, Global Convergence of a Non-monotone Trust-Region
SQP-Filter Algorithm for Nonlinear Programming, In: Multiscale Optimization Meth-
ods and Applications, W. W. Hager, S. J. Huang, P. M. Pardalos and O. A. Prokopyev,
eds., Springer Series on Nonconvex Optimization and Its Applications, Vol. 82, Springer
Verlag, 2006.
29. S. P. Han, A globally convergent method for nonlinear programming, J. Opt. Theo.
Applns., 22, 1976, pp. 297–309.
30. M. R. Hestenes, Multiplier and gradient methods, J. Opt. Theo. Applns, 4, 1969,
pp. 303–320.
31. W. Karush, Minima of functions of several variables with ineqalities as side conditions,
Master’s Thesis, Dept. of Mathematics, Univ. of Chicago, 1939.
32. H. W. Kuhn and A. W. Tucker, Nonlinear Programming, In: Proceedings of the Second
Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman, ed.,
University of California Press, 1951.
33. O. L. Mangasarian and S. Fromowitz, The Fritz John necessary optimality conditions
in the presence of equality and inequality constraints J. Math. Analysis and Applica-
tions, 17, 1967, pp. 37–47.
34. N. Maratos, Exact penalty function algorithms for finite dimensional and control op-
timization problems, Ph.D. Thesis, Univ. of London, 1978.
35. B. A. Murtagh and M. A. Saunders, A projected Lagrangian algorithm and its im-
plementation for sparse nonlinear constraints, Math. Programming Studies, 16, 1982,
pp. 84–117.
36. J. Nocedal, Updating quasi-Newton matrices with limited storage, Math. Comp., 35,
1980, pp. 773–782.
37. J. Nocedal and M. L. Overton, Projected Hessian updating algorithms for nonlinearly
constrained optimization, SIAM J. Num. Anal., 22, 1985, pp. 821–850.
38. T. Pietrzykowski, An exact potential method for constrained maxima, SIAM J. Num.
Anal., 6, 1969, pp. 217–238.
39. R. Polyak, Modified barrier functions (theory and methods), Math. Programming, 54,
1992, pp. 177–222.
40. M. J. D. Powell, A method for nonlinear constraints in minimization problems, In:
Optimization, R. Fletcher ed., Academic Press, London, 1969.
41. M. J. D. Powell, A fast algorithm for nonlinearly constrained optimization calculations,
In: Numerical Analysis, Dundee 1977, G. A. Watson, ed., Lecture Notes in Mathematics
630, Springer Verlag, Berlin, 1978.
42. A. A. Ribeiro, E. W. Karas, and C. C. Gonzaga, Global convergence of filter methods
for nonlinear programming, Technical report, Dept .of Mathematics, Federal University
of Paraná, Brazil, 2006.
43. S. M. Robinson, A quadratically convergent method for general nonlinear programming
problems, Math. Programming, 3, 1972, pp. 145–156.
44. R. T. Rockafellar, Augmented Lagrange multiplier functions and duality in non-convex
programming, SIAM J. Control, 12, 1974, pp. 268–285.
214 R. Fletcher

45. Ph. L. Toint, On sparse and symmetric updating subject to a linear equation, Math.
Comp., 31, 1977, pp. 954–961.
46. A. Wächter and L. T. Biegler, Line search filter methods for nonlinear programming:
Motivation and global convergence, SIAM J. Optimization, 16, 2005, pp. 1–31.
47. A. Wächter and L. T. Biegler, Line search filter methods for nonlinear programming:
Local convergence, SIAM J. Optimization, 16, 2005, pp. 32–48.
48. R. B. Wilson, A simplicial algorithm for concave programming, Ph.D. dissertation,
Harvard Univ. Graduate School of Business Administration, 1960.

You might also like