Sequential Quadratic Programming
Sequential Quadratic Programming
Programming Method
Roger Fletcher
1 Introduction
minimize
n
f (x)
x∈IR
R. Fletcher ()
Department of Mathematics, University of Dundee, Dundee DD1 4HN
e-mail: fl[email protected]
minimize
n
f (x)
x∈IR
subject to AT x = b
(1.2)
c(x) = 0
l ≤ x ≤ u.
minimize f (x)
x∈IRn ⎛ ⎞
x
⎜ ⎟
subject to l ≤ ⎝AT x⎠ ≤ u (1.3)
c(x)
In this case the ‘linear’ problem referred to above is the well known system
of linear equations
AT x = b (2.1)
in which A is a given n × n matrix of coefficients and b is a given vector
of right hand sides. For all except the very largest problems, the system is
readily solved by computing factors P AT = LU using elimination and partial
pivoting. If A is a very sparse matrix, techniques are available to enable large
systems to be solved. A less well known approach is to compute implicit
factors LP AT = U (see for example Fletcher [14]) which can be advantageous
in certain contexts. The system is regular when A is nonsingular in which
case a unique solution exists. Otherwise there may be non-unique solutions,
or more usually no solutions.
The corresponding nonlinear problem is the system of nonlinear equations
r(x) = 0 (2.2)
Truncating the negligible term in d and setting the left hand side to zero
yields the system of linear equations
This system forms the basis of the Newton-Raphson (NR) method in which
(2.4) is solved for a displacement d = d(k) , and x(k+1) = x(k) + d(k) becomes
the next iterate.
168 R. Fletcher
x(k+1) − x∗ = O(x(k) − x∗ 2 ).
We prove these properties in the next subsection. This property is very desir-
able, indicating as it does that the number of significant figures of accuracy
doubles on each iteration, in the limit. Moreover a theorem of Dennis and
Moré [10] indicates that if x∗ is regular then any sequence {x(k) } converging
to x∗ exhibits superlinear convergence (that is x(k+1) − x∗ = o(x(k) − x∗ )
if and only if the displacements converge asymptotically to those of the
Newton-Raphson method, in the sense that
(k) (k)
d(k) = dN R + o(dN R ).
0 = r∗ = r − ÃT e (2.5)
r − AT e/e → 0. (2.6)
r − AT e ≤ αe/β (2.7)
It follows from (2.7) that e(k+1) ≤ αe(k) and hence that x(k+1) ∈ N (x∗ ).
By induction, e(k) → 0, and hence x(k) → x∗ . It also follows from (2.10) and
(2.6) that e(k+1) /e(k) → 0, showing that the order of convergence is
superlinear.
"
!
Proof. In this case we can write the Taylor series (2.3) in the form
through x∗ . It follows from this and the chain rule that the slope
df (x(α))/dα|α=0 = g∗T s = 0
minimize
n
q(x) = 12 xT Gx + hT x
x∈IR (2.14)
subject to AT x = b.
x = x◦ + Zt (2.15)
which is solved for t. Then (2.15) defines the solution x∗ . The solution is a
unique minimizer if and only if Z T GZ, referred to as the reduced Hessian
matrix, is positive definite.
The Null Space Method extends to solve any linear equality constrained
problem (LECP) in which the objective function is non-quadratic, that is
minimize
n
f (x)
x∈IR (2.17)
subject to AT x = b.
minimize
n−m
F (t) = f (x◦ + Zt).
t∈IR
In this section we follow the rationale of the null space method and attempt to
derive an equivalent reduced unconstrained optimization problem. In this case
however it is necessary to make a nonlinear transformation of variables, and
there exist exceptional situations in which this is not possible. In order there-
fore to ensure that our transformation is well defined, local to a solution x∗ of
the ENLP, we make the regularity assumption that the columns of the Jaco-
bian matrix A∗ are linearly independent, or equivalently that rank(A∗ ) = m.
Existence and some properties of the transformation are a consequence
of the Inverse Function Theorem, an important result which can be found
in texts on real variable calculus. It may be stated as follows. Let r(x),
IRn → IRn , be a CI 1 nonlinear mapping, and let x∗ be such that ∇r(x∗ ) is
nonsingular. Then open neighbourhoods of x∗ and r∗ (= r(x∗ )) exist within
which a CI 1 inverse mapping x(r) is uniquely defined, so that x(r(x)) = x
and r(x(r)) = r. Moreover, derivatives of the mappings are related by
In the case of the ENLP above, we choose any fixed matrix V such that
[A∗ | V ] is nonsingular (this is possible by virtue of the regularity assumption),
and consider the nonlinear mapping
c(x)
r(x) = , (3.2)
V T (x − x∗ )
say, so that
m=1 m=2
a1
a1
Linear
a2
*
X
* Z2
X Z1
Z1
Non Linear
a1*
a*2
X* X*
a1*
linear constraints. However, in the linear case the reduced problem that is
obtained below is identical to that described in Section 2.4.
The process of dimension reduction is illustrated in Figure 1 when the full
space is three dimensional. In the linear case the constraints can be repre-
sented by planes, and the normal vectors a1 and a2 (columns of the matrix
A) are perpendicular to the planes. When m = 1, the null space has dimen-
sion two, and is spanned by two independent vectors z1 and z2 which are the
columns of Z. Any point x in the plane can be represented uniquely by the
linear combination x = x∗ + t1 z1 + t2 z2 . When m = 2 the feasible set is the
intersection of two planes, there is only one basis vector z1 , and the feasible
set is just x = x∗ + t1 z1 . It can be seen in both cases that the vectors ai and
zj are mutually perpendicular for all i and j, which expresses the condition
AT Z = 0. In the nonlinear case, the planes are replaced by curved surfaces,
and feasible lines in the linear case have become feasible arcs in the nonlinear
case, whose directions at x∗ are the vectors z∗i . It can be seen that A and Z
are no longer constant in the nonlinear case.
The Sequential Quadratic Programming Method 175
from (3.4), or
∇t = Z T ∇x . (3.7)
This result shows how derivatives in the reduced space are related to those in
the full space. Applying these derivatives to F (t) and f (x) at x∗ , it follows
from (3.6) that the stationary point condition for the ENLP problem is
Z ∗T g(x∗ ) = 0, (3.8)
m
g∗ = A∗ λ∗ = a∗i λ∗i . (3.9)
i=1
Satisfying (3.9) and feasibility provides a method for solving the ENLP,
that is: find x∗ , λ∗ to solve the system of n + m equations
g − Aλ
r(x, λ) = = 0, (3.10)
−c
minimize
n
f (x)
x∈IR
(3.12)
subject to c(x) = ε
in which the right hand sides of the constraints have been perturbed by an
amount ε. Let x(ε), λ(ε) be the solution and multipliers of the perturbed
problem, and consider f (x(ε)). Then
df (x(ε))/dεi = λi , (3.13)
and observe that L(x(ε), λ(ε), ε) = f (x(ε)). Then the chain rule gives
df dL ∂xT ∂λT ∂L
= = ∇x L + ∇λ L + = λi
dεi dεi ∂εi ∂εi ∂εi
Let x∗ solve the ENLP and let rank(A∗ ) = m (regularity). Then we have
seen that (3.5) is an equivalent reduced unconstrained minimization problem.
Thus, from Section 2.2, a second order necessary condition is that the Hessian
matrix ∇2t F (t) is positive semi-definite. To relate this to the ENLP, we use
equation (3.7) which relates derivatives in the reduced and full systems. Thus
Z T ∇x (gT Z) = Z T (∇x gT )Z = Z T GZ
where G is the Hessian matrix of f (x). Thus the second order necessary
condition in this case is that the reduced Hessian matrix Z T G∗ Z is positive
semi-definite. Moreover, Z T g∗ = 0 and Z T G∗ Z being positive definite are
sufficient conditions for x∗ to solve (3.5).
For an ENLP with nonlinear constraints, Z depends on x and we can
no longer assume that derivatives of Z with respect to x are zero. To make
progress, we observe that the ENLP is equivalent to the problem
minimize
n
L(x, λ∗ )
x∈IR
(3.14)
subject to c(x) = 0
since f (x) = L(x, λ) when c(x) = 0. We now define x(t) as in (3.2), and
consider the problem of minimizing a reduced function F (t) = L(x(t), λ∗ ).
Then ∇t F = 0 becomes Z T ∇x L = 0, or Z T (g − Aλ∗ ) = 0. At x∗ , it follows
that Z ∗T g∗ = 0 which is the first order necessary condition. For second
derivatives,
where
m
W (x, λ) = ∇2x L(x, λ) = ∇2 f (x) − λi ∇2 ci (x) (3.15)
i=1
We have seen in (3.10) that a stationary point of a regular ENLP can be found
by solving a system of nonlinear equations. Applying the Newton-Raphson
method to these equations enables us to derive a Newton type method with
rapid local convergence properties. First however we consider solving (3.10) in
the case of an EQP problem (2.14). In this case, g = Gx+h and c = AT x−b,
so (3.10) can be written as the system of n + m linear equations in n + m
unknowns * +
G −A x h
= − . (3.16)
−AT 0 λ b
Although symmetric, the coefficient matrix in (3.16) is indefinite so cannot be
solved by using Choleski factors. The Null Space Method (2.16) is essentially
one way of solving (3.16), based on eliminating the constraints AT x = b.
When G is positive definite, and particularly when G permits sparse Choleski
factors G = LLT to be obtained, it can be more effective to use the first block
equation to eliminate x = G−1 (Aλ − h). Then the system
where superscript (k) denotes quantities calculated from x(k) and λ(k) . Then
updated values for the next iteration are defined by x(k+1) = x(k) + d(k)
and λ(k+1) = λ(k) + δ (k) . These formulae may be rearranged by moving the
A(k) λ(k) term to the left hand side of (3.19), giving
* + (k)
W (k) −A(k) d(k) −g
= . (3.20)
−A(k)T 0 λ(k+1) c(k)
is nonsingular.
So where does the SQP method come in? There are two important obser-
vations to make. First, if the constraints in the ENLP problem are regular
(rank(A∗ ) = m), and the ENLP satisfies second order sufficiency conditions
(Z ∗T W ∗ Z ∗ is positive definite), then it is a nice exercise in linear algebra to
show that the matrix (3.21) is nonsingular (see [13]). Thus the local rapid con-
vergence of (3.20) is assured. Moreover, it also follows that rank(A(k) ) = m
and Z (k)T W (k) Z (k) is positive definite in a neighbourhood of x∗ , λ∗ . Under
these conditions, the EQP problem
1 T (k)
(k)
minimize
n 2d W d+ dT g(k) + f (k)
EQP d∈IR
(k) (k)T
subject to c +A d=0
is regular and has a unique local minimizer, which can be found by solving the
stationary point condition (see (3.16)), which for EQP(k) is none other than
(3.20). Thus, for finding a local minimizer of an ENLP problem, it is better
to replace the iteration formula (3.20) by one based on solving EQP(k) for
a correction d(k) = d and multiplier vector λ(k+1) . This correctly accounts
for the second order condition required by a local minimizer to an ENLP
problem. In particular, any solution of (3.20) which corresponds to a saddle
point or maximizer of EQP(k) is not accepted. (EQP(k) is unbounded in this
situation.) Also EQP(k) has a nice interpretation: the constraints are linear
Taylor series approximations about x(k) to those in the ENLP problem, and
the objective function is a quadratic Taylor series approximation about x(k)
to the objective function in the ENLP, plus terms in W (k) that account
for constraint curvature. The objective function can equally be viewed as a
quadratic approximation to the Lagrangian function. (In fact the term f (k)
in the objective function of EQP(k) is redundant, but is included so as to
make these nice observations.)
180 R. Fletcher
AT x ≥ b (4.1)
in which A is a given n × m matrix of coefficients and b is a given vector of
right hand sides. Usually m > n when there is no objective function present,
although in general, m ≤ n is also possible. Each inequality aTi x ≥ bi in
(4.1) divides IRn into two parts, a feasible side and an infeasible side, with
respect to the inequality. Equality holds on the boundary. Any n independent
such equations define a point of intersection, referred to as a vertex in this
context. Usually methods for solving (4.1) attempt to locate a feasible vertex.
Each vertex can be found by solving a system of linear equations as in (2.1).
There are only a finite number of vertices so the process will eventually find
a
$msolution,
% or establish that none exists. However, there may be as many as
n vertices, which can be extremely large for problems of any size. Thus it is
important to enumerate the vertices in an efficient way. This can be done by
The Sequential Quadratic Programming Method 181
In this case our generic linear problem, which can be solved finitely, is the
QP problem
minimize
n
q(x) = 12 xT Gx + hT x
x∈IR (4.3)
subject to AT x ≥ b.
minimize
n
f (x)
x∈IR (4.4)
subject to AT x ≥ b.
minimize
n
f (x)
x∈IR (4.6)
subject to ci (x) = 0 i ∈ A∗ .
m
g∗ = a∗i λ∗i = a∗i λ∗i = A∗ λ∗ , (4.7)
i∈A∗ i=1
which is referred to as the complementarity condition. If λ∗i > 0 for all active
inequality constraints, then strict complementarity is said to hold. Collec-
tively, feasibility in (1.1), (3.9), (4.8) for an inequality constraint, and (4.9)
are known as KT (Kuhn-Tucker) (or KKT (Karush-Kuhn-Tucker)) conditions
(Karush [31], Kuhn and Tucker [32]). Subject to a regularity assumption of
some kind, they are necessary conditions for a local solution of (1.1). A point
x∗ which satisfies KT conditions for some λ∗ is said to be a KT point.
A second order necessary condition that can be deduced from (4.6) is that
Z ∗T W ∗ Z ∗ is positive semi-definite, where Z ∗ is the null-space basis matrix
for (4.6), and W is defined in (3.15). A sufficient condition is that x∗ is a KT
point, strict complementarity holds, and Z ∗T W ∗ Z ∗ is positive definite.
The regularity assumption used in these notes (that the gradient vectors
a∗i = ∇c∗i , i ∈ A∗ are linearly independent) is known as the Linear Indepen-
dence Constraint Qualification (LICQ). If LICQ fails at any point, degeneracy
is said to hold at that point. However KT conditions can hold under weaker
conditions, most notably when all the active constraints are linear. In this
The Sequential Quadratic Programming Method 183
1 T
minimize
n 2 x Gx + hT x
x∈IR (4.10)
subject to aTi x = bi i ∈ A.
Because x(1) is a vertex, it is in fact the solution of the current EQP defined
by A. The ASM has two major steps.
(i) If x(k) solves the current EQP, then find the corresponding multipliers λ(k) .
Choose any i : λi < 0 (if none exist, then finish with x∗ = x(k) ).
(k)
Special linear algebra techniques are required to make the method efficient
in practice. Changes to the current active set involve either adding or sub-
tracting one constraint index. Updates to matrices such as Z and Z T GZ can
be performed much more quickly than re-evaluating the matrices. For large
problems, it is important to take advantage of sparsity in A and possibly G.
There are some complicating factors for the ASM. If the Hessian G is not
positive definite, then it is possible that the EQP obtained by removing i
from A may be unbounded, so that x̂ does not exist. In this case an arbitrary
choice of feasible descent direction is chosen, to make progress. If G has
negative eigenvalues, then the QP problem may have local solutions, and the
ASM does not guarantee to find a global solution. Any solution found by
the ASM will be a KT point of the QP problem, but may not be a local
solution unless strict complementarity holds. A more serious complicating
factor is that of degeneracy which refers to the situation where regularity
of the active constraints at the solution of an EQP problem fails to hold.
An example would be where there are more than n active constraints at a
feasible vertex. In this case, deciding whether x(k) solves the current EQP, or
whether a feasible descent direction exists, is a more complex issue, although
a finite algorithm to decide the issue is possible. Degeneracy is often present
in practical instances of QP problems, and it is important that it is correctly
accounted for in a computer code.
More recently an alternative class of methods has become available for
the solution of LP or QP problems in which G is positive semi-definite.
These interior point methods have the advantage that they avoid the worst
case behaviour of ASM and Simplex methods, in which the number of itera-
tions required to locate the solution may grow exponentially with n. However,
interior point methods also have some disadvantages in an SQP context.
We are now in a position to describe the basic SQP method for an NLP (1.1)
with inequality constraints. The method was first suggested in a thesis of
Wilson (1960), [48], and became well known due to the work of Beale [1]. The
idea follows simply from the SEQP method for an ENLP problem, where the
equality constraints c(x) = 0 are approximated by the linear Taylor series
c(k) + A(k)T d = 0 in the subproblem EQP(k) . In an NLP with inequality
constraints c(x) ≥ 0 we therefore make the same approximation, leading to
a QP subproblem with linear inequality constraints c(k) + A(k)T d ≥ 0, that is
1 T (k)
(k)
minimize
n 2d W d+ dT g(k)
QP d∈IR
subject to c(k)
+A d ≥ 0.
(k)T
The Sequential Quadratic Programming Method 185
The basic form of the algorithm therefore is that described at the end of
Section 3.3, with the substitution of QP(k) for EQP(k) . To view this method
as a Newton-type method, we need to assume that strict complementarity
λ∗i > 0, i ∈ A∗ holds at a regular solution to (1.1). Then, if x(k) , λ(k) is
sufficiently close to x∗ , λ∗ , it follows that the solution of EQP(k) with active
constraints A∗ , also satisfies the sufficient conditions for QP(k) . Thus we
can ignore inactive constraints i ∈ A∗ , and the SQP method is identical to
the SEQP method on the active constraint set A∗ . Thus the SQP method
inherits the local rapid convergence of a Newton type method under these
circumstances.
The progress of the SQP method on the NLP problem
minimize
2
f (x) = −x1 − x2
x∈IR
subject to c1 (x) = x2 − x21 ≥ 0
c2 (x) = 1 − x21 − x22 ≥ 0
is illustrated in Table 1, and has some instructive features. Because the initial
multiplier estimate is zero, and f (x) is linear, the initial W (1) matrix is zero,
and QP(1) is in fact an LP problem. Consequently, x(1) has to be chosen
carefully to avoid an unbounded subproblem (or alternatively one could add
simple upper and lower bounds to the NLP problem). The solution of QP(1)
delivers some non-zero multipliers for λ(2) , so that W (2) becomes positive
definite. The solution of QP(2) predicts that constraint 1 is inactive, and we
(3)
see that λ1 is zero. This situation persists on all subsequent iterations. For
this NLP problem, the active set is A∗ = {2}, and we see for k ≥ 3, that the
SQP method converges to the solution in the same way as the SEQP method
with the single equality constraint 1 − x21 − x22 = 0. The onset of rapid local
convergence, characteristic of a Newton method, can also be observed.
However, the basic method can fail to converge if x(k) is remote from x∗
(it is not as important to have λ(k) close to λ∗ because if x(k) is close to
x∗ , one solution of QP(k) will give an accurate multiplier estimate). It is
also possible that QP(k) has no solution, either because it is unbounded, or
because the linearized constraints are infeasible.
For these reasons, the SQP method is only the starting point for a fully
developed NLP solver, and extra features must be added to promote conver-
gence from remote initial values. This is the subject of subsequent sections of
this monograph. Nonetheless it has been and still is the method of choice for
many researchers. The success of the method is critically dependent on hav-
ing an efficient, flexible and reliable code for solving the QP subproblem. It
is important to be able to take advantage of warm starts, that is, initializing
the QP solver with the active set from a previous iteration. Also important
is the ability to deal with the situation that the matrix W (k) is not positive
semi-definite. For both these reasons, an active set method code for solving
the QP subproblems is likely to be preferred to an interior point method.
However, NLP solvers are still a very active research area, and the situation
is not at all clear, especially when dealing with very large scale NLPs.
An early idea for solving NLP problems is the successive linear programming
(SLP) algorithm in which an LP subproblem is solved (W (k) = 0 in QP(k) ).
This is able to take advantage of fast existing software for large scale LP.
However, unless the solution of the NLP problem is at a vertex, convergence
is slow because of the lack of second derivative information. A more recent
development is the SLP-EQP algorithm, introduced by Fletcher and Sainz
de la Maza [22], in which the SLP subproblem is used to determine the
active set and multipliers, but the resulting step d is not used. Instead an
SEQP calculation using the subproblem EQP(k) in Section 3.3 is made to
determine d(k) . The use of a trust region in the LP subproblem (see below)
is an essential feature in the calculation. The method is another example of a
Newton-type method and shares the rapid local convergence properties.The
idea has proved quite workable, as a recent software product SLIQUE of
Byrd, Gould, Nocedal and Waltz [3] demonstrates.
in the gradient of the Lagrangian function L(x, λ(k) ), using the latest avail-
able estimate λ(k) of the multipliers. Then the updated matrix B (k+1) is
chosen to satisfy the secant condition B (k+1) δ (k) = γ (k) .
There are many ways in which one might proceed. For small problems,
where it is required to maintain a positive definite B (k) matrix, the BFGS
formula (see [13]) might be used, in which case it is necessary to have
δ (k)T γ (k) > 0. It is not immediately obvious how best to meet this require-
ment in an NLP context, although a method suggested by Powell [41] has
been used widely with some success. For large problems, some form of limited
memory update is a practical proposition. The L-BFGS method, Nocedal [36],
as implemented by Byrd, Nocedal and Schnabel [4] is attractive, although
other ideas have also been tried. Another method which permits low costs
is the low rank Hessian approximation B = U U T (Fletcher [15]), where U
has relatively few columns. For ENLP, updating the reduced Hessian matrix
M ≈ Z T W Z, B = V M V T , using differences in reduced gradients, is ap-
propriate, essentially updating the Hessian of the reduced objective function
F (t) in (3.5). However, this idea does not translate easily into the context of
NLP with inequality constraints, due to the change in dimension of m when
the number of active constraints changes.
An intermediate situation for large scale SQP is to update an approx-
imation which takes the sparsity pattern of W (k) into account, and up-
dates only the non-sparse elements. The LANCELOT project (see Conn,
Gould and Toint [8] for many references) makes use of partially separa-
ble functions in which B (k) is the sum of various low dimensional element
188 R. Fletcher
Hessians, for which the symmetric rank one update is used. Other sparsity
respecting updates have also been proposed, for example Toint [45], Fletcher,
Grothey and Leyffer [17], but the coding is complex, and there are some
difficulties.
Various important conditions exist regarding rapid local convergence, re-
lating to the asymptotic properties of W Z or Z T W Z (see [13] for references).
Significantly, low storage methods like L-BFGS do not satisfy these con-
ditions, and indeed slow convergence is occasionally observed, especially
when the true reduced Hessian Z ∗T W ∗ Z ∗ is ill-conditioned. For this rea-
son, obtaining rapid local convergence when the null space dimension is very
large is still a topic of research interest. Indeed the entire subject of how best
to provide second derivative information in an SQP method is very much an
open issue.
In this section we examine the transition from Newton type methods with
rapid local convergence, such as the SQP method, to globally convergent
methods suitable for incorporation into production NLP software. By glob-
ally convergent, we refer to the ability to converge to local solutions of an
NLP problem from globally selected initial iterates which may be remote
from any solution. This is not to be confused with the problem of guaran-
teeing to find global solutions of an NLP problem in the sense of the best
local solution, which is computationally impractical for problems of any size
(perhaps >40 variables, say), unless the problem has some special convex-
ity properties, which is rarely the case outside of LP and QP. We must also
be aware that NLP problems may have no solution, mainly due to the con-
straints being infeasible (that is, no feasible point exists). In this case the
method should ideally be able to indicate that this is the case, and not spend
an undue amount of time in searching for a non-existent solution. In practice
even to guarantee that no feasible solution exists is an unrealistic aim, akin
to that of finding a global minimizer of some measure of constraint infeasi-
bility. What is practical is to locate a point which is locally infeasible in the
sense that the first order Taylor series approximation to the constraints set
is infeasible at that point. Again the main requirement is that the method
should be able to converge rapidly to such a point, and exit with a suitable
indication of local infeasibility. Another possibility, which can be excluded
by bounding the feasible region, is that the NLP is unbounded, that is f (x)
is not bounded below on the feasible region, or that there are no KT points
in the feasible region. Again the software has to recognize the situation and
terminate accordingly.
Ultimately the aim is to be able to effectively solve NLP problems created
by scientists, engineers, economists etc., who have a limited background in
The Sequential Quadratic Programming Method 189
From an historical perspective, almost all general purpose NLP solvers until
about 1996 aimed to promote global convergence by constructing an auxil-
iary function from f (x) and c(x) known variously as a penalty, barrier, or
merit function. In the earlier days, the idea was to apply successful existing
techniques for unconstrained minimization to the auxiliary function, in such
a way as to find the solution of the NLP problem. Later, there came the idea
of using the auxiliary function to decide whether or not to accept the step
given by the SQP method, hence the term merit function.
For an ENLP problem, an early idea was the Courant [9] penalty function
m
φ(x; σ) = f (x) + 12 σcT c = f (x) + 12 σ c2i where c = c(x), (5.1)
i=1
190 R. Fletcher
∇x φ = g − Aλ + σAc, (5.4)
The Sequential Quadratic Programming Method 191
derived from the SQP iteration formula (3.20) may be used. For large σ this
may be approximated by the scheme
see [13].
A multiplier penalty function for the NLP (1.1) with inequality con-
straints is
−λi ci + 1 σc2i if ci ≤ λi /σ
φ(x; λ, σ) = f (x) + 2
(5.7)
i
− 1 2
λ
2 i /σ if ci ≥ λi /σ
suggested by Rockafellar [44]. The piecewise term does not cause any dis-
continuity in first derivatives, and any second derivative discontinuities occur
away from the solution. Otherwise its use is similar to that in (5.3) above.
A multiplier based modification of the Frisch barrier function due to Polyak
[39] is
m
φ(x; λ, μ) = f (x) − μ λi loge (ci /μ + 1), (5.8)
i=1
in which the boundary occurs where ci = −μ, which is strictly outside the
feasible region. Thus the discontinuity of the Frisch function at the solution
x∗ is moved away into the infeasible region. We note that
m
λ∗i a∗i /μ
∇x φ(x∗ ; λ∗ , μ) = g∗ − μ =0
i=1
(c∗i /μ + 1)
ill-conditioning are avoided. Both the Rockafellar and Polyak functions are
used in a sequential manner with an outer iteration in which λ(k) → λ∗ .
All the above proposals involve sequential unconstrained minimization,
and as such, are inherently less effective than the SQP method, particu-
larly in regard to the rapidity of local convergence. Errors in λ(k) induce
errors of similar order in x(λ(k) ), which is not the case for the SQP method.
Many other auxiliary functions have been suggested for solving NLP or ENLP
problems in ways related to the above. In a later section we shall investigate
so-called exact penalty functions which avoid the sequential unconstrained
minimization aspect.
A major initiative to provide robust and effective software with large scale
capability based on the augmented Lagrangian function was the LANCELOT
code of Conn, Gould and Toint (see [8] for references). The code applies to
an NLP in the form
minimize
n
f (x)
x∈IR
subject to c(x) = 0 (5.9)
l ≤ x ≤ u.
which treats simple bounds explicitly but assumes that slack variables have
been added to any other inequality constraints. In the inner iteration, the aug-
mented Lagrangian function (5.3) is minimized subject to the simple bounds
l ≤ x ≤ u on the variables. A potential disadvantage of this approach for
large scale computation is that the Hessian ∇2x φ(x; λ(k) , σ) is likely to be
much less sparse than the Hessian W (k) in the SQP method. To avoid this
difficulty, LANCELOT uses a simple bound minimization technique based on
the use of the preconditioned conjugate gradient method, and solves the sub-
problem to lower accuracy when λ(k) is inaccurate. It also uses an innovative
idea of building the Hessian from a sum of elementary Hessians through the
concept of group partial separability. LANCELOT has been successfully used
to solve problems with upwards of 104 variables, particularly those with large
dimensional null spaces arising for example from the discretization of a par-
tial differential equation. It is less effective for problems with low dimensional
null spaces and does not take advantage of any linear constraints.
A fruitful way to take advantage of linear constraints has been to merge the
globalization aspect of the augmented Lagrangian function with the rapid
local convergence of the SQP method. This was the motivation of the very
successful MINOS code of Murtagh and Saunders [35], which was arguably
the first SQP-like NLP solver with large scale capability. In fact MINOS was
The Sequential Quadratic Programming Method 193
where s = s(x, x(k) ) = c(k) + A(k)T (x − x(k) ) is the first order Taylor series
approximation to c(x) about the current point, and the quantity c(x)−s may
be thought of as the deviation from linearity. The solution and multipliers
of LCP(k) then become the iterates x(k+1) , λ(k+1) for the next major itera-
tion. The method differs from SQP, firstly in that LCP(k) cannot be solved
finitely, which is a disadvantage, and secondly that second derivatives are
not required, which is an advantage. Robinson intended that LCP(k) should
be solved by a reduced Hessian quasi-Newton method. If a Taylor expansion
of the objective function in LCP(k) about the current point is made, then
it agrees with that of SQP(k) up to and including second order terms. Also
the method has the same fixed point property as SQP that if x(k) , λ(k) is
equal to x∗ , λ∗ , then x∗ , λ∗ is the next iterate, and the process terminates.
Consequently the method has the same rapid local convergence properties
as the SQP method, assuming that the LCP(k) subproblem is solved suffi-
ciently accurately. However there is no global convergence result available,
for example there is no mechanism to force the iterates x(k) to accumulate
at a feasible point.
The MINOS code attempts to mitigate the lack of a global convergence
property by augmenting the objective function in LCP(k) with a squared
penalty term. As with LANCELOT, the method is applicable to an NLP in
the form (5.9), and the LCP subproblem that is solved on the k-th major
iteration is
minimize
n
f (x) − λ(k)T (c(x) − s) + 12 σ(c(x) − s)T (c(x) − s)
x∈IR
subject to s = c(k) + A(k)T (x − x(k) ) = 0
l ≤ x ≤ u.
In the original source, MINOS refers to the active set method used to solve
this LCP subproblem, and MINOS/AUGMENTED refers to the major iter-
ative procedure for solving (5.9). However it is more usual now to refer to
the NLP solver by MINOS. The code has sparse matrix facilities, and also
allows ‘linear variables’ to be designated, so allowing the use of a smaller Hes-
sian approximation. MINOS was probably the first SQP-type code with the
capability to solve large scale problems, and as such has been very successful
and is still in use.
194 R. Fletcher
The penalty and barrier functions in Sections 5.1 and 5.2 are inherently se-
quential, that is the solution of the NLP problem is obtained by a sequence of
unconstrained minimization calculations. It is however possible to construct
a so-called exact penalty function, that is a penalty function of which x∗ , a
solution of the NLP problem, is a local minimizer. It is convenient here to
consider an NLP problem in the form
minimize
n
f (x)
x∈IR (5.11)
subject to c(x) ≤ 0.
The most well known exact penalty function (Pietrzykowski [38], see [13]
for more references) is the l1 exact penalty function (l1 EPF)
m
φ(x; σ) = f (x) + σc+ (x)1 = f (x) + σ c+
i (x), (5.12)
i=1
where c+
i = max(ci , 0) is the amount by which the i−th constraint is violated.
The parameter σ controls the strength of the penalty.
First we consider optimality conditions for a local minimizer of φ(x, σ).
The function is nonsmooth due to the discontinuity in derivative of max(ci , 0)
The Sequential Quadratic Programming Method 195
since the penalty term comes only from the infeasible constraints. Therefore
if we make the same regularity assumption that the vectors a∗i , i ∈ A∗ are
linearly independent, then the KT conditions for this problem are also nec-
essary for a minimizer of (5.12). Moreover, if we perturb the right hand side
of a constraint i ∈ A∗ in (5.15) by a sufficiently small εi > 0, εj = 0, j = i,
we make constraint i infeasible, but do not change the status of any other
constraints. This causes an increase of σεi in the penalty term. Moreover
the change in f (x) + σ i∈I ci (x) to first order is −λ∗i εi . (The negative sign
holds because of the sign change in (5.11)). Hence the change in φ(x, σ) to
first order is εi (σ − λ∗i ). If λ∗i > σ then φ is reduced by the perturbation,
which contradicts the optimality of x∗ in the l1 EPF. Thus the condition
λ∗i ≤ σ, i ∈ A∗ is also necessary. This result tells us that unless the penalty
parameter σ is sufficiently large, a local minimizer will not be created. We
can therefore summarize the first order necessary conditions as
m
g∗ + σ a∗i + a∗i λ∗i = g∗ + a∗i λ∗i = g∗ + A∗ λ∗ = 0 (5.16)
i∈I ∗ i∈A∗ i=1
⎫
0 ≤ λ∗i ≤ σ ⎬
c∗i < 0 ⇒ λ∗i = 0 i = 1, 2, . . . , m. (5.17)
⎭
c∗i > 0 ⇒ λ∗i = σ
196 R. Fletcher
The SQP method first came into prominent use when used in conjunction
with the l1 EPF as suggested by Han [29] and Powell [41]. The vector d(k)
generated by the SQP subproblem QP(k) is regarded as a direction of search,
and the l1 EPF is used as a merit function, so that the next iterate is x(k+1) =
x(k) + α(k) d(k) , with α(k) being chosen to obtain a sufficient reduction in
φ(x, σ). For this to be possible requires that d(k) is a descent direction at x(k)
for φ(x, σ). If the Hessian W (k) (or its approximation) is positive definite,
it is possible to ensure that this is the case, if necessary by increasing σ.
Early results with this technique were quite promising, when compared with
sequential unconstrained penalty and barrier methods. However the use of
a nonsmooth merit function is not without its difficulties. In particular the
discontinuities in derivative cause ‘curved valleys’, with sides whose steepness
depends on the size of σ. If σ is large, the requirement to monotonically
improve φ on every iteration can only be achieved by taking correspondingly
small steps, leading to slow convergence. Unfortunately, increasing σ to obtain
descent exacerbates this situation.
A way round this is the Sequential l1 Quadratic Programming (Sl1 QP)
method of Fletcher [11]. The idea (also applicable to an l∞ exact penalty
function) is to solve a subproblem which more closely models the l1 EPF,
by moving the linearized constraints into the objective function, in an l1
penalty term. Thus the l1 QP subproblem is
minimize
n
g(k)T d + 12 dT W (k) d + σ(c(k) + A(k)T d)+ 1
d∈IR (5.18)
subject to d∞ ≤ ρ.
the trust region radius. Solving the subproblem ensures descent, and quite
strong results regarding global convergence can be proved by using standard
trust region ideas. The use of an l∞ trust region (a ‘box constraint’) fits
conveniently into a QP type framework.
Even so, there are still some issues to be resolved. Firstly, (5.18) is not
a QP in standard form due to the presence of l1 terms in the objective,
although it is still a problem that can be solved in a finite number of steps.
Ideally a special purpose l1 QP solver with sparse matrix capabilities would
be used. This would enable an efficient l1 piecewise quadratic line search to
be used within the solver. Unfortunately a fully developed code of this type
is not easy to come by. The alternative is to transform (5.18) into a regular
QP by the addition of extra variables. For example a constraint l ≤ c(x) ≤ u
can be written as l ≤ c(x) − v + w ≤ u where v ≥ 0 and w ≥ 0 are auxiliary
variables, and a penalty term of the form σv + w1 , which is linear in
v and w, would then be appropriate. However 2m extra variables need be
added, which is cumbersome, and the benefit of the piecewise quadratic line
search is not obtained. A related idea is to use the SLP-EQP idea of Fletcher
and Sainz de la Maza [22], referred to in Section 4.5. In this case an l1 LP
subproblem would be used to find an active set and multipliers, followed by
an EQP calculation to obtain the step d(k) . As above, the l1 LP subproblem
can be converted to an LP problem by the addition of extra variables, and
this allows fast large scale LP software to be used.
It is also not easy for the user to choose a satisfactory value of σ. If
it is chosen too small, then a local minimizer may not be created, if too
large then the difficulties referred to above become apparent. There is also a
possibility of the Maratos effect [34] occurring, in which, close to the solution,
the Newton-type step given by the SQP method increases φ and cannot be
accepted if monotonic improvement in φ is sought. Thus the expected rapid
local convergence is not realised. More recently, ideas for circumventing these
difficulties have been suggested, including second order corrections [12], the
watchdog technique [7], and a non-monotonic line search [27].
6 Filter Methods
more often. In this section we describe the main ideas and possible pitfalls,
and discuss the way in which a global convergence result for a filter method
has been constructed.
A penalty function is an artefact to combine two competing aims in NLP,
namely the minimization of f (x) and the need to obtain feasibility with
respect to the constraints. The latter aim can equivalently be expressed as the
minimization of some measure h(c(x)) of constraint violation. For example,
in the context of (5.11) we could define h(c) = c+ in some convenient norm.
Thus, in a filter method, we view NLP as the resolution of two competing
aims of minimizing f (x) and h(c(x)). This is the type of situation addressed
by Pareto (multi-objective) optimization, but in our context the minimization
of h(c(x)) has priority, in that it is essential to find a Pareto solution that
corresponds to a feasible point. However it is useful to borrow the concept of
domination from multi-objective optimization. Let x(k) and x(l) be two points
generated during the progress of some method. We say that x(k) dominates
x(l) if and only if h(k) ≤ h(l) and f (k) ≤ f (l) . That is to say, there is no
reason to prefer x(l) on the basis of either measure. Now we define a filter
to be a list of pairs (h(k) , f (k) ) such that no pair dominates any other. As
the algorithm progresses, a filter is built up from all the points that have
been sampled by the algorithm. A typical filter is shown in Figure 2, where
the shaded region shows the region dominated by the filter entries (the outer
vertices of this shaded region). The contours of the l1 exact penalty function
would be straight lines with slope −σ on this plot, indicating that at least
for a single entry, the filter provides a less restrictive acceptance condition
than the penalty function.
f(x)
upper bound
U h(x)
Filter methods were first introduced in the context of trust region SQP meth-
ods in 1997 by Fletcher and Leyffer [18], making use of a subproblem
⎧
⎪
⎨ minimize
1 T
2d W
(k)
d + dT g(k)
d∈IRn
(k)
QP (ρ) subject to c(k) + A(k)T d ≥ 0.
⎪
⎩
d∞ ≤ ρ,
where β and γ are constants in (0, 1). Typical values might be β = 0.9
and γ = 0.01. (In earlier work we used f ≤ fi − γhi for the second test,
giving a rectangular envelope. However, this allows the possibility that (h, f )
dominates (hi , fi ) but the envelope of (h, f ) does not dominate the envelope
of (hi , fi ), which is undesirable.)
During testing of the filter algorithm, another less obvious disadvantage
became apparent. Say the current filter contains an entry (0, fi ) where fi is
relatively small. If, subsequently, feasibility restoration is invoked, it may be
impossible to find an acceptable point which is not dominated by (0, fi ). Most
likely, the feasibility restoration phase then converges to a feasible point that
is not a KT point. We refer to (0, fi ) as a blocking entry. We were faced with
two possibilities. One is to allow the removal of blocking entries on emerging
from feasibility restoration. This we implemented in the first code, reported
by Fletcher and Leyffer [18]. To avoid the possibility of cycling, we reduce the
upper bound when a blocking entry is removed. We did not attempt to pro-
vide a global convergence proof for this code, although it may well be possible
to do so. Subsequently it became clear that other heuristics in the code were
redundant and further work resulted in a related filter algorithm (Fletcher,
Leyffer and Toint [20]) for which a convergence proof can be given. In this
algorithm we resolve the difficulty over blocking by not including all accepted
points in the filter. This work is described in the next section. However, the
earlier code proved very robust, and has seen widespread use. It shows up
quite well on the numbers of function and derivative counts required to solve
a problem, in comparison say with SNOPT. Actual computing times are less
competitive, probably because the QP solver used by SNOPT is more effi-
cient. It has the same disadvantage as SNOPT that it is inefficient for large
null space problems. Otherwise, good results were obtained in comparison
with LANCELOT and an implementation of the l1 EPF method.
Δf ≥ σΔq (6.4)
?
- enter restoration phase to find
a point x(k) acceptable to the filter
such that QP(k) (ρ̃) is compatible
for some ρ̃ ≥ ρ◦ and initialize ρ = ρ̃
initialize ρ ≥ ρ◦
? ?
try to solve QP(k) (ρ)
incompatible solution d
? ?
include (h(k) , f (k) ) if d = 0 then finish
in the filter (KT point)
(h–type iteration)
?
? evaluate f (x(k) + d)
k := k + 1 and c(x(k) + d)
?
is x(k) + d acceptable to no - ρ := ρ/2
the filter and (h(k) , f (k) )?
yes
? yes
is Δf < σΔq and -
Δq > 0?
no
?
ρ(k) = ρ d(k) = d
Δq (k) = Δq Δf (k) = Δf
?
if Δq (k) ≤ 0 then include
(h(k) , f (k) ) in the filter
(h–type iteration)
?
x(k+1) = x(k) + d(k)
k := k + 1
-
Fig. 4 A Filter–SQP Algorithm
The Sequential Quadratic Programming Method 203
Another filter SQP algorithm, analysed by Fletcher, Gould, Leyffer, Toint and
Wächter [16], decomposes the SQP step into a normal and tangential com-
ponent. The normal step provides feasibility for the linearized constraints
and the tangential step minimizes the quadratic model in the feasible re-
gion. Related ideas are discussed by Gonzaga, Karas and Varga [26] and
Ribiero, Karas and Gonzaga [42]. Wächter and Biegler describe line search
filter methods in [46] and [47]. Chin [5] and Chin and Fletcher [6] consider
SLP-EQP trust region filter methods. Gould and Toint [28] present a non-
monotone filter SQP method which extends the non-monotonic properties of
filter SQP type algorithms. A review of other recent developments of filter
methods, outwith SQP, but including interior point methods for NLP, ap-
pears in the SIAG/OPT activity group newsletter (March 2007) and can be
accessed in [21].
204 R. Fletcher
described using entities introduced by the keywords param for fixed parame-
ters, and set for multivalued set constructions. An example which illustrates
all the features is to solve the HS72 problem in CUTE. Here the model is spec-
ified in a file hs72.mod. Note that upper and lower case letters are different:
here we have written user names in upper case, but that is not necessary. All
AMPL commands are terminated by a semicolon.
hs72.mod
set ROWS = {1..2};
set COLUMNS = {1..4};
param A {ROWS, COLUMNS};
param B {ROWS};
var X {COLUMNS} >= 0.001;
minimize OBJFN: 1 + sum {j in COLUMNS} x[j];
subject to
CONSTR {i in ROWS}: sum {j in COLUMNS}
A[i,j]/X[j] <= B[i];
UBD {j in COLUMNS}: X[j] <= (5-j)*1e5;
In this model, the sets are just simple ranges, like 1..4 (i.e. 1 up to 4). We could
have shortened the program by deleting the set declaration and replacing
ROWS by 1..2 etc., in the rest of the program. But the use of ROWS and COLUMNS
is more descriptive. The program defines a vector of variables X, and the data
is the matrix A and vector B which are parameters. Simple lower bounds on
the variables are specified in the var statement, and sum is a construction
which provides
4summation. The constraint CONSTR implements the system of
inequalities j=1 ai,j /xj ≤ bi , i = 1, 2. Note that indices are given within
square brackets in AMPL. Constructions like j in COLUMNS are referred to
as indexing in the AMPL syntax. The constraints in UBD define upper bounds
on the variables which depend upon j. Note that the objective function and
each set of constraint functions must be given a name by the user.
The data of the problem is specified in the file hs72.dat. Note the use of
tabular presentation for elements of A and B.
hs72.dat
param A: 1 2 3 4 :=
1 4 2.25 1 0.25
2 0.16 0.36 0.64 0.64;
param B :=
1 0.0401
2 0.010085;
The next stage is to fire up the AMPL system on the computer. This will
result in the user receiving the AMPL prompt ampl:. The programming
session to solve HS72 would proceed as follows.
206 R. Fletcher
An AMPL session
ampl: model hs72.mod;
ampl: data hs72.dat;
ampl: let {j in COLUMNS} X[j] := 1;
ampl: solve;
ampl: display OBJFN;
ampl: display X;
The model and data keywords read in the model and data. The keyword
let allows assignment of initial values to the variables, and the display
commands initiate output to the terminal. Output from AMPL (not shown
here) would be interspersed between the user commands. It is also possible
to aggregate data and, if required, programming, into a single hs72.mod
file. In this case the data must follow the model, and must be preceded by
the statement data;. One feature to watch out for occurs when revising the
model. In this case, repeating the command model hs72.mod; will add the
new text to the database, rather than overwrite it. To remove the previous
model from the database, the command reset; should first be given.
~
NORTH LAKE MAIN
Distribution Network ~
The Sequential Quadratic Programming Method 207
by the constructions shown on page 207. Note the use of the keyword union
to merge the nodes with and without power generation. Also observe the
use of cross, which indicates all possible connections between the nodes,
and within which indicates that the actual network is a subset of these.
In fact the operator cross has higher priority than within so the brackets
around the cross construction are not necessary. The user is using the con-
vention that power flows from the first node to the second node (a negative
value of flow is allowed and would indicate flow in the opposite sense). The
program also shows the use of dependent parameters and variables. Thus
the parameters ZSQ depend on R and X, and C and D both depend on R, X
and ZSQ. It is necessary that the order in which these statements are given
reflects these dependencies. The true variables in the problem (as shown here)
are V, THETA and PG. Additional variables which depend on these variables,
and also on the parameters, are PSI and P, as defined by the expressions
which follow the ‘=’ sign. The objective is to minimize the sum of gener-
ated power. Constraints include power balance constraints at consumer nodes
and generator nodes, the latter including a term for power generation. Note
the use of P[j,i] for power entering node i and P[i,j] for power exiting
node i. The program also provides a useful illustration of how to supply
data for network problems, and the use of the # sign for including com-
ments. Note also the use of standard functions such as sin and cos in the
expressions. The program shown is only part of the full model, which would
include flow of reactive power, and upper and lower bounds on various of the
variables.
The following problem is due to Bob Vanderbei (who gives many interesting
AMPL programs: search for vanderbei princeton on Google and click on
LOQO). A rocket starts at time t = 0, position x(0) = 0 and velocity v(0) = 0.
It must reach position x(T ) = 100, also with velocity v(T ) = 0. We shall
divide the total time T into n intervals of length h and use finite difference
approximations
x 1 −x 1 v 1 −v 1
i+ 2 i− 2 i+ 2 i− 2
vi = and ai = .
h h
for velocity and acceleration. The maximum velocity is 5 units and the ac-
celeration must lie within ±1 units. The aim is to minimize the total time
T = nh. The AMPL program is
The Sequential Quadratic Programming Method 209
A Rocket Problem
param n > 2;
set vrange = {0.5..n-0.5 by 1};
set arange = {1..n-1};
var x {0..n}; var v {vrange} <= 5;
var a {arange} <= 1, >= -1;
var h;
minimize T: n*h;
subject to
xdiff {i in vrange}: x[i+0.5]-x[i-0.5]=h*v[i];
vdiff {i in arange}: v[i+0.5]-v[i-0.5]=h*a[i];
x0: x[0] = 0; xn: x[n] = 100;
v0: v[1.5] = 3*v[0.5]; # Implements v0 = 0
vn: v[n-1.5] = 3*v[n-0.5]; # Implements vn = 0
This illustrates the use of three suffix quantities (TRIANGLES), and the
selection of all two suffix entities (EDGES) using setof. Shape functions are
210 R. Fletcher
elementary functions defined on triangles taking the value 1 at one node and
0 at the others. Note also the use of diff for set difference, and the allowed
use of underscore within an identifier.
AMPL contains many more useful constructions which we have not space
to mention here. Purchasing a copy of the manual is essential! Worthy of men-
tion however is the existence of for and if then else constructions. This
can be very useful at the programming stage. An if then else construc-
tion is also allowed within a model but should be used with care, because it
usually creates a nonsmooth function which many methods are not designed
to handle. The same goes for abs and related nonsmooth functions. Another
useful feature for creating loops at the programming stage is the repeat
construction.
<model><![CDATA[
Insert Model Here
]]></model>
<data><![CDATA[
Insert Data Here
]]></data>
<commands><![CDATA[
Insert Programming Here
]]></commands>
<comments><![CDATA[
Insert Any Comments Here
]]></comments>
</document>
An advantage of using NEOS from Kestrel (or by email as above) is that the
restriction in size no longer applies. A disadvantage is that the response of
the NEOS server can be slow at certain times of the day.
212 R. Fletcher
References
45. Ph. L. Toint, On sparse and symmetric updating subject to a linear equation, Math.
Comp., 31, 1977, pp. 954–961.
46. A. Wächter and L. T. Biegler, Line search filter methods for nonlinear programming:
Motivation and global convergence, SIAM J. Optimization, 16, 2005, pp. 1–31.
47. A. Wächter and L. T. Biegler, Line search filter methods for nonlinear programming:
Local convergence, SIAM J. Optimization, 16, 2005, pp. 32–48.
48. R. B. Wilson, A simplicial algorithm for concave programming, Ph.D. dissertation,
Harvard Univ. Graduate School of Business Administration, 1960.