Sequential Quadratic Programming: Acta Numerica January 1995

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/230872679

Sequential Quadratic Programming

Article  in  Acta Numerica · January 1995


DOI: 10.1017/S0962492900002518 · Source: CiteSeer

CITATIONS READS
1,017 1,980

2 authors, including:

Paul Boggs

71 PUBLICATIONS   3,189 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Structure-preserving model reduction View project

All content following this page was uploaded by Paul Boggs on 23 January 2015.

The user has requested enhancement of the downloaded file.


Acta Numerica (1996), pp. 1{000

Sequential Quadratic Programming 


Paul T. Boggs

Applied and Computational Mathematics Division


National Institute of Standards and Technology
Gaithersburg, Maryland 20899
USA
Email: boggs@cam.nist.gov
Jon W. Tolle

Departments of Mathematics and Operations Research


University of North Carolina
Chapel Hill, North Carolina 27599
USA
Email: tolle@math.unc.edu
CONTENTS

1 Introduction 1
2 The Basic SQP Method 4
3 Local Convergence 12
4 Merit Functions and Global Convergence 27
5 Global to Local Behavior 37
6 SQP Trust Region Methods 40
7 Practical Considerations 43
References 48

1. Introduction

Since its popularization in the late 1970s, Sequential Quadratic Program-


ming (SQP) has arguably become the most successful method for solving
nonlinearly constrained optimization problems. As with most optimization
methods, SQP is not a single algorithm, but rather a conceptual method
from which numerous speci c algorithms have evolved. Backed by a solid
theoretical and computational foundation, both commercial and public do-
main SQP algorithms have been developed and used to solve a remarkably
large set of important practical problems. Recently large scale versions have
been devised and tested with promising results.
In this paper we examine the underlying ideas of the SQP method and the
theory that establishes it as a framework from which e ective algorithms can
 Contribution of the National Institute of Standards and Technology and not subject to
copyright in the United States.
2 Boggs and Tolle

be derived. In the process we will describe the most popular manifestations


of the method, discuss their theoretical properties, and comment on their
practical implementations.
The nonlinear programming problem to be solved is
minimize
x f (x)
subject to: h(x) = 0 (NLP)
g (x)  0
where f : Rn ! R, h : Rn ! Rm , and g : Rn ! Rp . Such problems arise in
a variety of applications in science, engineering, industry, and management.
In the form (NLP) the problem is quite general; it includes as special cases
linear and quadratic programs in which the constraint functions, h and g,
are ane and f is linear or quadratic. While these problems are important
and numerous, the great strength of the SQP method is its ability to solve
problems with nonlinear constraints. For this reason it is assumed that
(NLP) contains at least one nonlinear constraint function.
The basic idea of SQP is to model (NLP) at a given approximate solu-
tion, say xk , by a quadratic programming subproblem, and then to use the
solution to this subproblem to construct a better approximation xk+1 . This
process is iterated to create a sequence of approximations that, it is hoped,
will converge to a solution x . Perhaps the key to understanding the per-
formance and theory of SQP is the fact that, with an appropriate choice of
quadratic subproblem, the method can be viewed as the natural extension
of Newton and quasi-Newton methods to the constrained optimization set-
ting. Thus one would expect SQP methods to share the characteristics of
Newton-like methods, namely, rapid convergence when the iterates are close
to the solution but possible erratic behavior that needs to be carefully con-
trolled when the iterates are far from a solution. While this correspondence
is valid in general, the presence of constraints makes both the analysis and
implementation of SQP methods signi cantly more complex.
Two additional properties of the SQP method should be pointed out.
First, SQP is not a feasible-point method; that is, neither the initial point
nor any of the subsequent iterates need be feasible (a feasible point satis es
all of the constraints of (NLP)). This is a major advantage since nding a
feasible point when there are nonlinear constraints may be nearly as hard
as solving (NLP) itself. SQP methods can be easily modi ed so that lin-
ear constraints, including simple bounds, are always satis ed. Second, the
success of the SQP methods depends on the existence of rapid and accurate
algorithms for solving quadratic programs. Fortunately, quadratic programs
are easy to solve in the sense that there are a good procedures for their so-
lution. Indeed, when there are only equality constraints the solution to a
quadratic program reduces to the solution of a linear system of equations.
Sequential Quadratic Programming 3
When there are inequality constraints a sequence of systems may have to be
solved.
A comprehensive theoretical foundation does not guarantee that a pro-
posed algorithm will be e ective in practice. In real-life problems hypotheses
may not be satis ed, certain constants and bounds may not be computable,
matrices may be numerically singular, or the scaling may be poor. A suc-
cessful algorithm needs adaptive safeguards that deal with these pathologies.
The algorithmic details to overcome such diculties, as well as more mun-
dane questions|how to choose parameters, how to recognize convergence,
and how to carry out the numerical linear algebra|are lumped under the
term \implementation". While a detailed treatment of this topic is not pos-
sible here, we will take care to point out questions of implementation that
pertain speci cally to the SQP method.
This survey is arranged as follows. In Section 2 we state the basic SQP
method along with the assumptions about (NLP) that will hold throughout
the paper. We also make some necessary remarks about notation and ter-
minology. Section 3 treats local convergence, that is, behavior of the iterates
when they are close to the solution. Rates of convergence are provided both
in general terms and for some speci c SQP algorithms. The goal is not to
present the strongest results but to establish the relation between Newton's
method and SQP, to delineate the kinds of quadratic models that will yield
satisfactory theoretical results, and to place current variations of the SQP
method within this scheme.
The term global is used in two di erent contexts in nonlinear optimiza-
tion and is often the source of confusion. An algorithm is said to be globally
convergent if, under suitable conditions, it will converge to some local so-
lution from any remote starting point. Nonlinear optimization problems
can have multiple local solutions; the global solution is that local solution
corresponding to the least value of f . SQP methods, like Newton's method
and steepest descent, are only guaranteed to nd a local solution of (NLP);
they should not be confused with algorithms for nding the global solution,
which are of an entirely di erent avor.
To establish global convergence for constrained optimization algorithms,
a way of measuring progress towards a solution is needed. For SQP this is
done by constructing a merit function, a reduction in which implies that an
acceptable step has been taken. In Section 4, two standard merit functions
are de ned and their advantages and disadvantages for forcing global con-
vergence are considered. In Section 5, the problems of the transition from
global to local convergence are discussed. In Sections 4 and 5 the emphasis
is on line search methods. Because trust region methods, which have been
found to be e ective in unconstrained optimization, have been extended to
t into the SQP framework; a brief description of this approach is given in
4 Boggs and Tolle

Section 6. Section 7 is devoted to implementation issues, including those


associated with large scale problems.
To avoid interrupting the ow of the presentation, comments on related
results and references to the literature are provided at the end of each sec-
tion. The number of papers on the SQP method and its variants is large
and space prohibits us from compiling a complete list of references; we have
tried to give enough references to each topic to direct the interested reader
to further material.
1.1. Notes and References
The earliest reference to SQP-type algorithms seems to have been in the
PhD thesis of Wilson (1963) at Harvard University, in which he proposed the
method we call in Section 3 the Newton-SQP algorithm. The development
of the secant or variable-metric algorithms for unconstrained optimization
in the late 1960's and early 1970's naturally led to the extension of these
methods to the constrained problem. The initial work on these methods
was done by Mangasarian and his students at the University of Wisconsin.
Garcia-Palomares and Mangasarian (1976) investigated an SQP-type algo-
rithm in which the entire Hessian matrix of the Lagrangian, i.e., the matrix
of second derivatives with respect to both to x and the multipliers, was
updated at each step. Shortly thereafter, Han (Han 1976) and (Han 1977)
provided further impetus for the study of SQP methods. In the rst paper
Han gave local convergence and rate of convergence theorems for the PSB-
and BFGS-SQP algorithms for the inequality-constrained problem and, in
the second, employed the `1 merit function to obtain a global convergence
theorem in the convex case. In a series of papers presented at conferences,
(Powell 1977), (Powell 1978a), and (Powell 1978b), Han's work was brought
to the attention of the general optimization audience. From that time there
has been a continuous production of research papers on the SQP method.
As noted, a signi cant advantage of SQP is that feasible points are not
required at any stage of the process. Nevertheless, a version of SQP that
always remains feasible has been developed and studied, for example, by
Bonnans, Panier, Tits and Zhou (1992).
2. The Basic SQP Method

2.1. Assumptions and Notation


As with virtually all nonlinear problems, it is necessary to make some as-
sumptions on the problem (NLP) that clarify the class of problems for which
the algorithm can be shown to perform well. These assumptions, as well as
the consequent theory of nonlinear programming, are also needed to de-
scribe the SQP algorithm itself. In this presentation we do not attempt to
Sequential Quadratic Programming 5
make the weakest assumptions possible, but rather provide what we consider
reasonable assumptions under which the main ideas can be illustrated.
We make the blanket assumption that all of the functions in (NLP) are
three times continuously di erentiable. We denote the gradient of a scalar-
valued function by r, e.g., rf (x). (Here, and throughout the paper, all
vectors are assumed to be column vectors; we use the superscript t to denote
transpose.) For vector-valued functions we also use r to denote the Jacobian
of the function. For example,
rh(x) = (rh1(x); rh2(x); : : :; rhm (x)) :
The Hessian of a scalar-valued function, denoted by the letter H , is de ned
to be the symmetric matrix whose (i; j )th component is
Hf (x) = @ f (x) :
2
i;j @ xi @ xj
Where a function is de ned on more than one set of variables, such as the
Lagrangian function de ned below, the di erentiation operators r and H
will refer to di erentiation with respect to x only.
A key function, one that plays a central role in all of the theory of con-
strained optimization, is the scalar-valued Lagrangian function de ned by
L(x; u; v) = f (x) + uth(x) + vtg(x) (2:1)
where u 2 Rm and v 2 Rp are the multiplier vectors. Given a vector x, the
set of active constraints at x consists of the inequality constraints, if any,
satis ed as equalities at x. We denote the index set of active constraints by
I (x) = fi : gi(x) = 0g:
The matrix G(x) made up of the matrix rh(x) along with the columns
rgi(x); i 2 I (x) will be important in describing the basic assumptions
and carrying out the subsequent analyses. Assuming that the matrix G(x)
has full column rank, the null space of G(x)t de nes the tangent space to
the equality and active inequality constraints at x. The projection onto this
tangent space can be written as
 ?1
P (x) = I ? G(x) G(x)tG(x) G(x)t: (2:2)
The corresponding projection onto the range space of G(x) will be written
Q(x) = I ? P (x): (2:3)
For convenience, these projection matrices evaluated at iterates xk and at a
solution x will be denoted by P k , P  , Qk , and Q. Similarly, we will write
H L = H L(x ; u; v)
throughout the remainder of this paper.
6 Boggs and Tolle

In this paper x will represent any particular local solution of (NLP). We


assume that the following conditions apply to each such solution.
A1: The rst order necessary conditions hold, i.e., there exist optimal mul-
tiplier vectors u and v  0 such that
rL(x; u; v) = rf (x) + rh(x)u + rg(x)v = 0:
A2: The columns of G(x) are linearly independent.
A3: Strict complementary slackness holds, i.e.,
gi (x ) vi = 0
for i = 1; : : :; p and, if gi (x) = 0, then vi > 0.
A4: The Hessian of the Lagrangian function with respect to x is positive
de nite on the null space of G(x)t ; i.e.,
dtH Ld > 0
for all d 6= 0 such that G(x )t d = 0:
The above conditions, sometimes called the strong second order sucient
conditions, in fact guarantee that x is an isolated local minimum of (NLP)
and that the optimal multiplier vectors u and v  are unique. It should be
noted that without strong additional assumptions (NLP) may have multiple
local solutions.
We use the term critical point to denote a feasible point that satis es the
rst order necessary conditions A1. A critical point may or may not be a
local minimum of (NLP).
In discussing convergence of SQP algorithms the asymptotic rate of con-
vergence plays an important role. Three standard measures of convergence
rates will be emphasized in this paper. In the de nitions that follow, and
throughout the paper, the norm kk will refer to the 2-norm unless speci -
cally noted otherwise. Other norms can be used for most of the analysis.
De nition 2.1 Let fxk g be a sequence converging to x. The sequence is
said to converge linearly if there exists a positive constant  < 1 such that


xk+1 ? x   xk ? x
for all k suciently large; superlinearly if there exists a sequence of positive
constants k ! 0 such that


xk+1 ? x  k xk ? x
for all k suciently large; and quadratically if there exists a positive constant
 such that
2

xk+1 ? x   xk ? x
Sequential Quadratic Programming 7
for all k suciently large.
These rates, which measure improvement at each step, are sometimes
referred to as Q-rates. We will also have occasion to state results in terms of
a measure of the average rate of convergence called the R-rate. A sequence

fx g will be said to converge R-linearly if the sequence f xk ? x g is
k
bounded by a sequence that converges Q-linearly to zero. Similar de nitions
exist for R-superlinear and R-quadratic. There are two relations between
R-rate and Q-rate convergence that are of importance for this paper. First,
m-step Q-linear convergence (where k + 1 is replaced by k + m in the above
de nitions) implies an R-linear rate of convergence and, second, a Q-rate of
convergence of a sequence of vectors implies the same R-rate (but not the
same Q-rate) of convergence of its components. In the analyses that follow,
rates not designated explicitly as Q- or R-rates are assumed to be Q-rates.
2.2. The Quadratic Subproblem
As suggested in the Introduction the SQP method is an iterative method in
which, at a current iterate xk , the step to the next iterate is obtained through
information generated by solving a quadratic subproblem. The subproblem
is assumed to re ect in some way the local properties of the original problem.
The major reason for using a quadratic subproblem, i.e., a problem with a
quadratic objective function and linear constraints, is that such problems
are relatively easy to solve and yet, in their objective function, can re ect
the nonlinearities of the original problem. The technical details of solving
quadratic programs will not be dealt with here, although the algorithmic
issues involved in solving these quadratic subproblems are nontrivial and
their resolution will a ect the overall performance of the SQP algorithm.
Further comments on this are made in Section 7.
A major concern in SQP methods is the choice of appropriate quadratic
subproblems. At a current approximation xk a reasonable choice for the
constraints is a linearization of the actual constraints about xk . Thus the
quadratic subproblem will have the form
minimize
dx (rk )t dx + 21 dx t Bk dx
subject to: rh(xk )t dx + h(xk ) = 0
rg(xk )tdx + g(xk )  0
where dx = x ? xk . The vector rk and the symmetric matrix Bk remain to
be chosen.
The most obvious choice for the objective function in this quadratic pro-
gram is the local quadratic approximation to f at xk . That is, Bk is taken
as the Hessian and rk as the gradient of f at xk . While this is a reasonable
approximation to use if the constraints are linear, the presence of nonlin-
8 Boggs and Tolle

ear constraints makes this choice inappropriate. For example, consider the
problem
minimize
x ?x1 ? 12 (x2)2
subject to: (x1 )2 + (x2)2 ? 1 = 0:
The point (1; 0) is a solution satisfying A1{A4, but at the point (1 + ; 0)
the approximating quadratic program is (with d replacing dx )
minimize
d ?d1 ? 12 (d2)2
subject to: d1 = ? 21 (2 + )(1 + )?1
which is unbounded regardless of how small  is. Thus the algorithm that
computes d breaks down for this problem.
To take nonlinearities in the constraints into account while maintaining
the linearity of the constraints in the subproblem, the SQP method uses a
quadratic model of the Lagrangian function as the objective. This can be
justi ed by noting that conditions A1{A4 imply that x is a local minimum
for the problem
minimize
x L(x; u; v)
subject to: h(x) = 0
g(x)  0:
Note that the constraint functions are included in the objective function for
this equivalent problem. Although the optimal multipliers are not known,
approximations uk and v k to the multipliers can be maintained as part of the
iterative process. Then given a current iterate, (xk ; uk ; vk ), the quadratic
Taylor series approximation in x for the Lagrangian is
L(xk ; uk ; vk ) + rL(xk ; uk ; vk )tdx + dxtH L(xk ; uk ; vk )dx:
1
2

A strong motivation for using this function as the objective function in the
quadratic subproblem is that it generates iterates that are identical to those
generated by Newton's method when applied to the system composed of the
rst order necessary condition (condition A1) and the constraint equations
(including the active inequality constraints). This means that the resulting
algorithm will have good local convergence properties. In spite of these local
convergence properties there are good reasons to consider choices other than
the actual Hessian of the Lagrangian, for example approximating matrices
that have properties that permit the quadratic subproblem to be solved at
any xk and the resulting algorithm to be amenable to a global convergence
analysis. Letting Bk be an approximation of H L(xk ; uk ; vk ), we can write
Sequential Quadratic Programming 9
the quadratic subproblem as:
minimize
dx rL(xk ; uk ; vk )tdx + 21 dxtBk dx
subject to: rh(xk )t dx + h(xk ) = 0 (2:4)
rg(x ) dx + g(x )  0:
k t k

The form of the quadratic subproblem most often found in the literature,
and the one that will be employed here, is
minimize
dx rf (xk )tdx + 12 dxtBk dx
subject to: rh(xk )t dx + h(xk ) = 0 (QP)
rg(x ) dx + g(x )  0:
k t k

These two forms are equivalent for problems with only equality constraints
since, by virtue of the linearized constraints, the term rh(xk )tdx is constant
and the objective function becomes rf (xk )t dx + 21 dx t Bk dx . The two sub-
problems are not quite equivalent in the inequality-constrained case unless
the multiplier estimate vk is zero for all inactive linear constraints. However
(QP) is equivalent to (2:4) for the slack-variable formulation of (NLP) given
by
minimize
x; z f (x)
subject to: h(x) = 0 (2:5)
g(x) + z = 0
z  0
where z 2 Rp is the vector of slack variables. Therefore (QP) can be con-
sidered an appropriate quadratic subproblem for (NLP).
The solution dx of (QP) can be used to generate a new iterate xk+1 , by
taking a step from xk in the direction of dx . But to continue to the next
iteration new estimates for the multipliers are needed. There are several
ways in which these can be chosen, but one obvious approach is to use the
optimal multipliers of the quadratic subproblem. (Other possibilities will be
discussed where appropriate.) Letting the optimal multipliers of (QP) be
denoted by uqp and vqp , and setting
du = uqp ? uk
dv = v qp ? v k
allow the updates of (x,u,v) to be written in the compact form
xk+1 = xk + dx
uk+1 = uk + du (2:6)
vk+1 = vk + dv
for some selection of the steplength parameter . Once the new iterates
10 Boggs and Tolle

are constructed, the problem functions and derivatives are evaluated and a
prescribed choice of Bk+1 calculated. It will be seen that the e ectiveness
of the algorithm is determined in large part by this choice.

2.3. The Basic Algorithm


Since the quadratic subproblem (QP) has to be solved to generate steps in
the algorithm, the rst priority in analyzing an SQP algorithm is to deter-
mine conditions that guarantee that (QP) has a solution. To have a solution
the system of constraints of (QP) must have a nonempty feasible set (i.e.,
the system must be consistent) and the quadratic objective function should
be bounded below on that set (although a local solution can sometimes exist
without this condition). The consistency condition can be guaranteed when
xk is in a neighborhood of x by virtue of Assumption A2 but, depending
on the problem, may fail at nonlocal points. Practical means of dealing
with this possibility will be considered in Section 7. An appropriate choice
of Bk will assure that a consistent quadratic problem will have a solution;
the discussion of this point will be a signi cant part of the analysis of the
next two sections.
Assuming that (QP) can be solved, the question of whether the sequence
generated by the algorithm will converge must then be resolved. As de-
scribed in the Introduction, convergence properties generally are classi ed
as either local or global. Local convergence results proceed from the as-
sumptions that the initial x-iterate is close to a solution x and the initial
Hessian approximation is, in an appropriate sense, close to H L. Conditions
on the updating schemes that ensure that the xk (and the Bk ) stay close to
x (and H L ) are then formulated. These conditions allow one to conclude
that (QP) is a good model of (NLP) at each iteration and hence that the
system of equations de ned by the rst order conditions and the constraints
for (QP) are nearly the same as those for (NLP) at x. Local convergence
proofs can be modeled on the proof of convergence of the classical Newton's
method, which assumes = 1 in (2:6).
Convergence from a remote starting point is called global convergence. As
stated in the Introduction, to ensure global convergence the SQP algorithm
needs to be equipped with a measure of progress, a merit function , whose
reduction implies progress towards a solution. In order to guarantee that 
is reduced at each step a procedure for adjusting the steplength parameter
in (2:6) is required. Using the decrease in  it can be shown that under
certain assumptions the iterates will converge to a solution (or, to be precise,
a potential solution) even if the initial x-iterate, x0, is not close to a solution.
Local convergence theorems are based on Newton's method whereas global
convergence theorems are based on descent. Ideally an algorithm should
be such that as the iterates get close to the solution, the conditions for
Sequential Quadratic Programming 11
local convergence will be satis ed and the local convergence theory will ap-
ply without the need for the merit function. In general a global algorithm,
although ultimately forcing the x-iterates to get close to x , does not auto-
matically force the other local conditions (such as unit steplengths and close
Hessian approximations) to hold and therefore the merit function must be
used throughout the iteration process. Since there is no practical way to
know when, if ever, the local convergence conditions will hold an implemen-
tation's true performance can be deduced only from numerical experience.
These issues will be discussed more fully in Section 5.
With this background we can now state a basic SQP algorithm. This tem-
plate indicates the general steps only, without the numerous details required
for a general code.
Basic Algorithm
Given approximations (x0 ; u0 ; v0 ), B0 , and a merit function , set k = 0.
1. Form and solve (QP) to obtain (dx ; du; dv ).
2. Choose steplength so that
(xk + dx) < (xk ):
3. Set
xk+1 = xk + dx
uk+1 = uk + du
v k+1 = v k + dv :
4. Stop if converged.
5. Compute Bk+1 .
6. Set k := k + 1; go to 1.

2.4. Notes and References


The basic second order sucient conditions as well as a description of the
major theoretical ideas of nite-dimensional nonlinear constrained optimiza-
tion can be found in numerous sources. See, e.g., (Luenberger 1984), (Nash
and Sofer 1995), or (Gill, Murray and Wright 1981). The basic de nitions
for various rates of convergence and the relations among them can be found
in (Ortega and Rheinboldt 1970) and (Dennis and Schnabel 1983).
The trust region methods discussed in Section 6 use di erent quadratic
subproblems than those given here. Also Fukushima (1986) considers a
di erent quadratic subproblem in order to avoid the Maratos e ect discussed
in Section 5.
12 Boggs and Tolle

3. Local Convergence

In this section the theory of local convergence for the most common versions
of the SQP method is developed. The local convergence analysis establishes
conditions under which the iterates converge to a solution and at what rate,
given that that the starting data (e.g., x0 , u0 , v 0, B0 ) are suciently close to
the corresponding data at a solution, x . As will be seen, it is the method of
generating the matrices Bk that will be the determining factor in obtaining
the convergence results.
An SQP method can have excellent theoretical local convergence prop-
erties; quadratic, superlinear, or two-step superlinear convergence of the
x - iterates can be achieved by requiring the Bk to approximate H L in
an appropriate manner. Although each of these local convergence rates is
achievable in practice under certain conditions, the need for a globally con-
vergent algorithm further restricts the choice of the Bk . Certain choices
of the Bk have led to algorithms that extensive numerical experience has
shown to be quite acceptable. Therefore, while there is a gap between what
is theoretically possible and what has been achieved in terms of local con-
vergence, this discrepancy should not obscure the very real progress that
the SQP methods represent in solving nonlinearly constrained optimization
problems.
Two important assumptions simplify the presentation. First, it will be
assumed that the active inequality constraints for (NLP) at x are known. As
will be discussed in Section 5, this assumption can be justi ed for many of the
SQP algorithms because the problem (QP) at xk will have the same active
constraints as (NLP) at x when xk is near x. The fact that the active
constraints, and hence the inactive constraints, are correctly identi ed at xk
means that those inequality constraints that are inactive for (NLP) at x can
be ignored and those that are active can be changed to equality constraints
without changing the solution of (QP). Thus, under this assumption, only
equality-constrained problems need be considered for the local analysis. For
the remainder of this section (NLP) will be assumed to be an equality-
constrained problem and the reference to inequality multiplier vectors (the
vt g (x) term) will be eliminated from the Lagrangian. For reference, we
rewrite the quadratic subproblem with only equality constraints:
minimize
dx rf (xk )tdx + 12 dxtBk dx
(ECQP )
subject to: rh(xk )t dx + h(xk ) = 0:
The second assumption follows from the fact that the local convergence
for SQP methods, as for many optimization methods, is based on Newton's
method; that is, the convergence and rate of convergence results are ob-
tained by demonstrating an asymptotic relation between the iterates of the
method being analyzed and the classical Newton steps. Because Newton's
Sequential Quadratic Programming 13
method for a system with a nonsingular Jacobian at the solution requires
that unit steps be taken to obtain rapid convergence, we assume that the
merit function allows = 1 to be used in the update formulas (2.6). Simi-
lar results to those given in this section can be obtained if the steplengths
converge to one suciently quickly.
For the equality-constrained problem a good initial estimate of x can be
used to obtain a good estimate for the optimal multiplier vector. The rst
order necessary condition, A1, together with condition A2, leads to the
formula
u = ?[rh(x )t A rh(x )]?1 rh(x )t A rf (x ): (3:1)
for any nonsingular matrix A that is positive de nite on the null space of
rh(x)t. In particular, if A is taken to be the identity matrix, then (3:1)
de nes the least squares solution of the rst order necessary conditions. By
our smoothness assumptions, it follows that
u0 = ?[rh(x0 )t rh(x0)]?1 rh(x0 )trf (x0 ) (3:2)
can be made arbitrarily close to u by choosing x0 close to x . Consequently,
no additional assumption will be made about an initial optimal multiplier
estimate for the local convergence analysis.
Denoting the optimal multiplier for (ECQP) by uqp we see that the rst
order conditions for this problem are
Bk dx + rh(xk )uqp = ?rf (xk )
rh(xk )tdx = ?h(xk ):
If, as discussed in Section 2, we set
uk+1 = uqp = uk + du (3:3)
then the above equations become
Bk dx + rh(xk )du = ?rL(xk ; uk ) (3.4)
rh(x ) dx = ?h(x ):
k t k (3.5)
We are now in a position to begin the analysis of the SQP methods.

3.1. The Newton SQP Method


The straightforward SQP method derived from setting
Bk = H L(xk ; uk )
will be analyzed rst. Assuming that x0 is close to x it follows from (3.2)
that u0 can be presumed to be close to u and hence that H L(x0 ; u0) is
close to H L . The local convergence for the SQP algorithm now follows
14 Boggs and Tolle

from the application of Newton's method to the nonlinear system of equa-


tions obtained from the rst order necessary conditions and the constraint
equation:
(x; u) = rLh((xx;)u) = 0:
 

From assumptions A1 and A4 the Jacobian of this system at the solution,



H L  r h(x ) 
 
J (x ; u ) = rh(x)t ; (3:6)
0
is nonsingular. Therefore, the Newton iteration scheme
xk+1 = xk + sx
uk+1 = uk + su
where the vector s = (sx ; su ) is the solution to
J (xk ; uk ) s = ? (xk ; uk ); (3:7)
yields iterates that converge to (x ; u ) quadratically provided (x0; u0) is
suciently close to (x ; u ). The equations (3.7) are identical to equations
(3.4) and (3.5) with Bk = H L(xk ; uk ), dx = sx, and du = su . Consequently
the iterates (xk+1 ; uk+1 ) are exactly those generated by the SQP algorithm.
For this reason we call this version of the algorithm the (local) Newton SQP
method. The basic local convergence results are summarized in the following
theorem:
Theorem 3.1 Let x0 be an initial estimate of the solution to (NLP) and
let u0 be given by (3:2). Suppose that the sequence of iterates f(xk ; uk )g is
generated by
xk+1 = xk + dx
uk+1 = uk + du
where dx and uqp = uk + du are the solution and multiplier
of the quadratic
program (ECQP) with Bk = H L(xk ; uk ). Then, if x0 ? x is suciently
small, the sequence of iterates in (x; u)-space is well-de ned and converges
quadratically to the pair (x ; u ).
It is important to emphasize that this version of the SQP algorithm always
requires a unit step in the variables; there is no line search to try to determine
a better point. If a line search is used (the so-called damped Newton method)
then the rate of convergence of the iterates can be signi cantly decreased|
usually to linear.
In a sense, the Newton version of the SQP algorithm can be thought of
as the ideal algorithm for solving the nonlinear program: it provides rapid
convergence with no requirement for line searches and no necessity for the
Sequential Quadratic Programming 15
introduction of additional parameters. Of course, in this form it has little
practical value. It is often dicult to choose an initial point close enough
to the true solution to guarantee that the Newton algorithm will converge
to a minimum of f . When remote from the solution H L(xk ; uk ) cannot
be assumed to be positive de nite on the appropriate subspace and hence
a solution to the subproblem (ECQP) cannot be guaranteed to exist. The
value of this Newton SQP method is that it provides a standard against
which the local behavior of other versions can be measured. In fact, to obtain
superlinear convergence it is necessary and sucient that the steps aproach
the Newton steps as the solution is neared. Despite the disadvantages, the
use of H L(xk ; uk ) often makes (ECQP) an excellent model of (NLP) and
its use, combined with some of the techniques of Sections 6 and 7, can
lead to e ective algorithms. The actual computation of H L(xk ; uk ) can be
accomplished eciently by using nite di erence techniques or by automatic
di erentiation methods.
3.2. Conditions for Local Convergence
Here we discuss the theoretical properties of the approximation matrices,
Bk , that are sucient to guarantee that the local SQP method will give
Newton-like convergence results. In the following subsections the practical
attempts to generate matrices that satisfy these properties will be described.
In order to formulate conditions on Bk that will yield locally convergent
SQP algorithms, the following assumption on H L will be imposed.
A5: The matrix H L is nonsingular.
In light of A4 most of the results that follow could be extended to cover
the case where this condition is not satis ed, but the resulting complexity
of the arguments would clutter the exposition.
The following conditions on the matrix approximations will be referenced
in the remainder of the paper.
B1: The matrices Bk are uniformly positive de nite on the null spaces of
the matrices rh(xk )t , i.e.,there exists a 1 > 0 such that for each k
dtBk d  kdk
1
2

for all d satisfying


rh(xk )td = 0:
B2: The sequence fBk g is uniformly bounded, i.e., there exists a > 02
such that for each k
kBk k  : 2

B3: The matrices Bk have uniformly bounded inverses, i.e., there exists a
16 Boggs and Tolle

3 > 0 such that for each k, Bk ?1 exists and



Bk
?1  3 :
As the SQP methods require the solution of the quadratic subproblem at
each step, it is imperative that the choice of matrices Bk make that pos-
sible. The second order sucient conditions for a solution to (ECQP) to
exist require that the matrix Bk be positive de nite on the null space of
rh(xk )t (cf. A4). Condition B1 is a slight strengthening of this require-
ment. Assumption A5 and the fact that the matrices Bk are to approximate
the Hessian of the Lagrangian suggest that conditions B2 and B3 are not
unreasonable.
Under assumptions B1{B3 (3.4) and (3.5) have a solution provided (xk ,uk )
is suciently close to (x , u ). Moreover, the solutions can be written in
the form
du = [rh(xk )tBk ?1 rh(xk )]?1[h(xk ) ? rh(xk )t Bk ?1 rL(xk ; uk )] (3:8)
and
dx = ?Bk ?1 rL(xk ; uk+1 ) (3:9)
where uk+1 is given by (3.3). In particular (3.8) leads to the relation
uk+1 = [rh(xk )t Bk ?1 rh(xk )]?1 [h(xk ) ? rh(xk )t Bk ?1 rf (xk )]
 W (xk ; Bk ):
Setting A = Bk ?1 in (3.1) yields
u = W (x ; Bk ):
It can now be deduced that
uk+1 ? u = W (xk ; Bk ) ? W (x ; Bk )
= [rh(x )t Bk ?1 rh(x )]?1 rh(x )t Bk ?1 (Bk ? H L)(xk ? x)
+wk (3.10)
where, by assumptions B2 and B3,
2
wk   xk ? x
for some  independent of k. It should be noted that (3.10) can be used in
conjunction with Theorem 3.1 to prove that the sequence fxk g is quadrati-
cally convergent when the Newton SQP method is used and H L is nonsin-
gular. In that case fuk g converges R-quadratically.
Equations (3.9) and (3.10) and A2 now yield
h i
xk+1 ? x = xk ? x ? Bk ?1 rL(xk ; uk+1 ) ? rL(x ; u)
h i
= Bk ?1 (Bk ? H L)(xk ? x) ? rh(x )(uk+1 ? u )
Sequential Quadratic Programming 17
 
+O xk ? x


 2
 
= Bk ?1 Vk (Bk ? H L )(xk ? x ) + O xk ? x
2

(3.11)
where
Vk = I ? rh(x )[rh(x)t Bk ?1 rh(x)]?1 rh(x)t Bk ?1 :
The projection operator de ned by (2.2), which in the equality-constrained
case has the form
P (x) = I ? rh(x)[rh(x)trh(x)]?1rh(x)t;
satis es
Vk P k = V k :
Thus (3.11) leads to the inequality

k+1
x ? x  Bk ?1 kVk k P k (Bk ? H L)(xk ? x)
 
+O xk ? x :
2

(3.12)
Using induction the above analysis can be made rigorous to yield the fol-
lowing local convergence theorem.
Theorem 3.2 Let assumptions B1{B3 hold. Then there exist positive
constants  and such that if

? x < ;

0x

u ? u < ;
0

and

P k (Bk ? H L)(xk ? x) < xk ? x (3:13)
for all k, then the sequences fxk g and f(xk ; uk )g generated by the SQP algo-
rithm are well de ned and converge linearly to x and (x ; u ) respectively.
The sequence fuk g converges R-linearly to u .
Condition (3.13) is, in conjuction with the other hypotheses of the theo-
rem, almost necessary for linear convergence. In fact, if the other hypotheses
of the theorem hold and the sequence fxk g converges linearly, then there
exists a  such that

k  k 
P (Bk ? H L )(x ? x ) <  (x ? x )
k 
for all k. These results and those that follow indicate the crucial role that
the projection of (Bk ? H L) plays in the theory. The inequality (3.13) is
18 Boggs and Tolle

guaranteed by either of the stronger conditions:



k 
P (Bk ? H L ) 

or
k(Bk ? H L)k  (3:14)
which are easier to verify in practice.
In order to satisfy (3.14) it is not necessary that the approximations
converge to the true Hessian but only that the growth of the di erence
kBk ? H Lk be kept under some control. In the quasi-Newton theory for
unconstrained optimization this can be accomplished if the approximating
matrices have a property called bounded deterioration. This concept can be
generalized from unconstrained optimization to the constrained setting in a
straightforward way.
De nition 3.1 A sequence of matrix approximations, fBk g, for the SQP
method is said to have the bounded deterioration property if there exist
constants 1 and 2 independent of k such that
kBk+1 ? H Lk  (1 + 1k ) kBk ? H Lk + 2k (3:15)
where

k = maxf xk+1 ? x ; xk ? x ; uk+1 ? u ; uk ? u g:
It seems plausible that the bounded deterioration condition when applied
to the SQP process will lead to a sequence of matrices that satisfy B2,
B3, and (3.13) provided the initial matrix is close to the Hessian of the
Lagrangian and (xk ,uk ) is close to (x ,u ). Indeed this is the case, as can
be shown by induction to lead to the following result.
Theorem 3.3 Suppose that the sequence of iterates f(xk ; uk )g is gener-
ated by the SQP algorithm and the sequence fBk g of symmetric matrix
approximations satis es B1 and (3.15). If x0 ? x and kB0 ? H L k are

suciently small and u0 is given by (3:2) then the hypotheses of Theorem


3.2 hold.
The linear convergence for the iterates guaranteed by the above theorem is
hardly satisfactory in light of the quadratic convergence of the Newton SQP
method. As would be expected, a stronger relation between the approxima-
tion matrices, Bk , and the Hessian of the Lagrangian is needed to improve
the rate of convergence; however, this relation still depends only on the pro-
jection of the di erence between the approximation and the true Hessian.
With some e ort the following theorem can be deduced as a consequence of
the inequalities above.
Theorem 3.4 Let assumptions B1{B3 hold and let the sequence f(xk ; uk )g
Sequential Quadratic Programming 19
be generated by the SQP algorithm. Assume that xk ! x . Then the se-
quence fxk g converges to x superlinearly if and only if the matrix approx-
imations satisfy

k  k+1 ? xk )
P (Bk ? H L )(x
lim = 0: (3:16)
k!1 k+1
(x ? xk )
If this equation holds then the sequence fuk g converges R-superlinearly to
u and the sequence f(xk ; uk )g converges superlinearly.
Not that this theorem requires convergence of the x-iterates; (3.16) does
not appear to be enough to yield convergence by itself. A slightly di erent
result that uses a two-sided approximation of the Hessian approximation
can be obtained by writing (3:16) as
(
P k (Bk ? H L )P k sk P k (Bk ? H L )Qk sk )
lim
k!1 ksk k + ksk k =0 (3:17)
where sk = (xk+1 ? xk ). It can be shown that if only the rst term goes to
zero then a weaker form of superlinear convergence holds.
Theorem 3.5 Let assumptions B1{B3 hold and let the sequence f(xk ; uk )g
be generated by the SQP algorithm. Assume that xk ! x. Then, if

k  k k+1 ? xk )
P (Bk ? H L )P (x
lim = 0; (3:18)
k!1 k+1
(x ? xk )
the sequence fxk g converges to x two-step superlinearly.
If the sequence fxk+1 ? xk g approaches zero in a manner tangent to the null
space of the Jacobians, i.e.,
k k+1
lim Q (x ? x ) = 0;
k
k!1
xk+1 ? xk
then (3.18) implies (3.16) and superlinear convergence results. This tan-
gential convergence has been observed in practice for some of the methods
discussed below.
It has been dicult to nd useful updating schemes for the Bk that sat-
isfy these conditions for linear and superlinear convergence. Two basic ap-
proaches have been tried for generating good matrix approximations. In full
Hessian approximations the matrices Bk are chosen to approximate H L
while in reduced Hessian approximations only matrices that approximate the
Hessian on the null space of the Jacobians of the constraints are computed.
Each of these methods will be described in turn.
20 Boggs and Tolle

3.3. Full Hessian Approximations


An obvious choice of the matrix Bk is a nite di erence approximation of
the Hessian of the Lagrangian at (xk ; uk ). It is clear from the Theorems
3.2 and 3.4 that if a nite di erence method is used, the resulting sequence
will be superlinearly convergent if the nite di erence step size goes to zero
and will have rapid linear convergence for xed step size. Of course using
nite di erence approximations, while not requiring evaluation of second
derivatives, su ers from the same global diculties as the Newton SQP
method described in the earlier subsection.
As has been seen, one way of obtaining convergence is to use a sequence
of Hessian approximations that satis es the bounded deterioration property.
However, this property is not, by itself, enough to guarantee that (3.16)
is satis ed and hence it does not guarantee superlinear convergence. The
condition (3.16) essentially requires the component of the step generated
by the Hessian approximation in the direction of the null space to converge
asymptotically to the corresponding component of the step generated by
the Newton SQP method. In the following a class of approximations called
secant approximations, which satisfy a version of this condition and have
the bounded deterioration property as well, is considered.
Under the smoothness assumptions of this paper the Lagrangian satis es
rL(xk+1 ; uk+1) ? rL(xk ; uk+1)  H L(xk+1; uk+1)(xk+1 ? xk )
with equality if the Lagrangian is quadratic in x. As a result, it makes sense
to approximate the Hessian of the Lagrangian at (xk+1 ,uk+1 ) by requiring
Bk+1 to satisfy
Bk+1 (xk+1 ? xk ) = rL(xk+1 ; uk+1 ) ? rL(xk ; uk+1 ); (3:19)
especially since this approximation strongly suggests that (3.16) is likely to
be satis ed near the solution. Equation (3.19) is called the secant equation;
it plays an important role in the algorithmic theory of nonlinear systems
and unconstrained optimization.
A common procedure for generating the Hessian approximations that sat-
isfy (3:19) is to compute (xk+1 ; uk+1 ) with a given Bk and then to update
Bk according to
Bk+1 = Bk + Uk
where Uk is a rank-one or rank-two matrix that depends on the values of
Bk , xk , uk , xk+1 , and uk+1 . In choosing secant approximations that have
the bounded deterioration property, it is natural to look to those updating
schemes of this type that have been developed for unconstrained optimiza-
tion.
The rank-two Powell-symmetric-Broyden (PSB) formula gives one such
Sequential Quadratic Programming 21
updating scheme. For the constrained case this update is given by
Bk+1 = Bk + (s1ts) [(y ? Bk s)st + s(y ? Bk s)t ]
? (y ?(sBtsk)s2) s s st;
t
(3.20)
where
s = xk+1 ? xk (3:21)
and
y = rL(xk+1 ; uk+1 ) ? rL(xk ; uk+1 ): (3:22)
The PSB-SQP algorithm which employs this update has been shown to have
the desired local convergence properties.
Theorem 3.6 Suppose that the sequence of iterates f(xk ; uk )g is gen-
erated by the SQP algorithm using the sequence fBk g of matrix approx-
imations

generated by the PSB update formulas, (3.20){(3.22). Then, if
x0 ? x and kB ? H L(x ; u )k are suciently small and u0 is given by

0
(3:2), the sequence fBk g is of bounded deterioration and the iterates (xk ; uk )
are well de ned and converge superlinearly to (x,u ). In addition the x-
iterates converge superlinearly and the multipliers converge R-superlinearly.
Note that there is no assumption of positive de niteness here. In fact, while
B2 and B3 are satis ed as a result of the bounded deterioration, B1 does
not necessarily hold. Consequently, dx and du are not necessarily solutions
of (ECQP), but of the rst order conditions (3.4) and (3.5).
As a practical method, the PSB-SQP algorithm has the advantage over
the Newton SQP method of not requiring the computation of the Hessian
of the Lagrangian (but, as a result, yielding only superlinear rather than
quadratic convergence). However, because the matrices are not necessarily
positive de nite it su ers from the same serious drawback; the problem
(ECQP) may not have a solution if the initial starting point is not close to
the solution. Consequently, it does not appear to be useful in establishing
a globally convergent algorithm (see, however, Section 6). As mentioned
above, solving (ECQP) requires that the matrices be positive de nite on
the appropriate subspace for each subproblem. One way to enforce this is
to require the matrices to be positive de nite.
A rank-two updating scheme that is considered to be the most e ective
for unconstrained problems and has useful positive de nite properties is the
BFGS method. The formula for this update, generalized to be applicable to
the constrained problem, is
Bk+1 = Bk + ystyy ? Bsktss Bk ;
t t
(3:23)
Bk s
22 Boggs and Tolle

where s and y are as in (3.21) and (3.22). The important qualities of the
matrices generated by this formula are that they have the bounded deterio-
ration property and satisfy a hereditary positive de niteness condition. The
latter states that if Bk is positive de nite and
yt s > 0 (3:24)
then Bk+1 is positive de nite. If the matrix approximations are positive
de nite then (ECQP) is easily solved and the SQP steps are well de ned.
Unfortunately, because the Hessian of the Lagrangian at (x ; u ) need only
be positive de nite on a subspace, it need not be the case that (3.24) is
satis ed and hence the algorithm may break down, even close to the solution.
However, if the Hessian of the Lagrangian is positive de nite, (3.24) is valid
provided Bk is positive de nite and (xk ; uk ) is close to (x; u ). In this case
this BFGS-SQP algorithm has the same local convergence properties as the
PSB-SQP method.
Theorem 3.7 Suppose that H L is positive de nite and let B0 be an ini-
tial positive de nite matrix. Suppose that the sequence of iterates f(xk ; uk )g
is generated by the SQP algorithm using the sequence of matrix
approxi-

mations generated by the BFGS update (3.23). Then, if x0 ? x and
kB0 ? H L(x; u)k are suciently small and u0 is given by (3:2), the se-
quence fBk g is of bounded deterioration and (3.16) is satis ed. Therefore,
the iterates (xk ; uk ) converge superlinearly to the pair (x; u ). In addi-
tion the x-iterates converge superlinearly and the multipliers converge R-
superlinearly.
The assumption that H L is positive de nite allows B0 to be both positive
de nite and close to H L . The requirement that H L be positive de nite is
satis ed, for example, if (NLP) is convex, but cannot be assumed in general.
However, because the positive de niteness of Bk permits the solution of the
quadratic subproblems independent of the nearness of the iterate to the
solution, the BFGS method has been the focus of vigorous research e orts
to adapt it to the general case. Several of these e orts have resulted in
implementable algorithms.
One scheme, which we call the Powell-SQP method, is to maintain the
positive de nite property of the Hessian approximations by modifying the
update formula (3.23) by replacing y in (3.22) by
y^ =  y + (1 ? ) Bk s (3:25)
for some  2 (0; 1]. With this modi cation, the condition (3.24) can always
be satis ed although the updates no longer satisfy the secant condition. Nev-
ertheless, it has been shown that a speci c choice of  leads to a sequence,
fxk g, that converges R-superlinearly to x provided that the sequence con-
verges. Unfortunately, no proof of local covergence has been found although
Sequential Quadratic Programming 23
algorithms based on this procedure have proven to be quite successful in
practice.
A second approach is to transform the problem so that H L is positive
de nite and then to apply the BFGS-SQP method on the transformed prob-
lem. It is a common trick in developing algorithms for solving equality con-
strained optimization problems to replace the objective function in (NLP)
by the function
f A (x) = f (x) + 2 kh(x)k2
for a positive value of the parameter  . This change gives a new Lagrangian
function, the so-called augmented Lagrangian
LA(x; u) = L(x; u) + 2 kh(x)k2
for which (x,u ) is a critical point. The Hessian of the the augmented
Lagrangian at (x ,u ) has the form
H LA (x; u ) = H L +  rh(x)rh(x )t:
It follows from A4 that there exists a positive value   such that for  
 , H LA (x ; u ) is positive de nite. If the value of   is known then the
BFGS-SQP algorithm can be applied directly to the transformed problem
with     and, by Theorem 3.7, a superlinearly convergent algorithm
results. This version of the SQP algorithm has been extensively studied
and implemented. Although adequate for local convergence theory, this
augmented-BFGS method has major drawbacks that are related to the fact
that it is dicult to choose an appropriate value of  . To apply Theorem 3.7,
the value of  must be large enough to ensure that H LA (x ; u ) is positive
de nite, a condition that requires a priori knowledge of x . If unnecessarily
large values of  are used without care numerical instabilities can result.
This problem is exacerbated by the fact that if the iterates are not close to
the solution appropriate values of  may not exist.
Quite recently, an intriguing adaptation of the BFGS-SQP algorithm has
been suggested that shows promise of leading to a resolution of the above
diculties. This method, called the SALSA-SQP method is related to the
augmented Lagrangian SQP method but di ers (locally) in the sense that a
precise estimate of  can be chosen independent of x . If
yA = rLA (xk+1; uk+1 ) ? rLA (xk ; uk+1)
then
yA = y + rh(xk+1 )h(xk+1 )
where y is the vector generated by the ordinary Lagrangian in (3.22). If
ytAs > 0; (3:26)
24 Boggs and Tolle

then the BFGS updates for the augmented Lagrangian have the hereditary
positive de niteness property. A minimum value of k , not directly depen-
dent on x , can be given that guarantees that ytA s is \suciently" positive
for given values of xk and uk+1 in a neighborhood of (x ,u ). The (local)
version of this algorithm proceeds by using the augmented Lagrangian at
each iteration with the required value of k (which may be zero) to force
the satisfaction of this condition. The Bk generated by this procedure are
positive de nite while H L may not be; hence, the standard bounded dete-
rioration property is not applicable. As a result a local convergence theorem
is not yet available. As in the Powell-SQP algorithm, an R-superlinear rate
of convergence of the xk to x has been shown, provided that the sequence
is known to converge.

3.4. Reduced Hessian SQP Methods


An entirely di erent approach to approximating the Hessian is based on
the fact that assumption A4 requires H L to be positive de nite only on
a particular subspace. The reduced Hessian methods approximate only the
portion of the Hessian matrix relevant to this subspace. The advantages
of these methods are that the standard positive de nite updates can be
used and that the dimension of the problem is reduced to n ? m (possibly
a signi cant reduction). In this section we discuss the local convergence
properties of such an approach. Several versions of a reduced Hessian type of
algorithm have been proposed; they di er in the ways the multiplier vectors
are chosen and the way the reduced Hessian approximation is updated; in
particular, in the form of y that is used (see (3.31)). A general outline of
the essential features of the method is given below.
Let xk be an iterate for which rh(xk ) has full rank and let Zk and Yk
be matrices whose columns form bases for the null space of rh(xk )t and
range space of rh(xk ) respectively. Also assume that the columns of Zk are
orthogonal. Zk and Yk could be obtained, for example, by a QR factorization
of rh(xk ).
De nition 3.2 Let (xk ,uk ) be a given solution-multiplier pair and assume
that rh(xk ) has full rank. The matrix
Zk t H L(xk ; uk )Zk
is called a reduced Hessian for the Lagrangian function at (xk ,uk ).
The reduced Hessian is not unique; its form depends upon the choice of basis
for the null space of rh(xk )t. Since by A4 a reduced Hessian at the solution
is positive de nite, it follows that if (xk ,uk ) is close enough to (x ,u ) then
the reduced Hessian at (xk ; uk ) is positive de nite. Decomposing the vector
Sequential Quadratic Programming 25
dx as
dx = Zk pZ + Yk pY (3:27)
it can be seen that the constraint equation of (ECQP) becomes
rh(xk )tYk pY = ?h(xk )
which by virtue of A2 can be solved to obtain
pY = ?[rh(xk )tYk ]?1h(xk ): (3:28)
The minimization problem (ECQP) is now an unconstrained problem in
n ? m variables given by
minimize
p
1
p tZk tBk Zk pZ + (rf (xk )t + pY tBk )Zk pZ :
2 Z
Z
The matrix in this unconstrained problem is an approximation to the re-
duced Hessian of the Lagrangian at xk . Rather than update Bk and then
compute Zk t Bk Zk , the reduced matrix itself is updated. Thus at a partic-
ular iteration, given a positive de nite (n ? m)  (n ? m) matrix Rk , an
iterate xk and a multiplier estimate uk , the new iterate can be found by
rst computing pY from (3.28), setting
pZ = ?R?k 1Zk t(rf (xk ) + Bk pY ); (3:29)
and setting xk+1 = xk + dx, where dx is given by (3.27). A new multiplier,
uk+1 , is generated and the reduced Hessian approximation is updated using
the BFGS formula (3.23) with the Rk replacing Bk ,
s = Zk t (xk+1 ? xk ) = Zk t Zk pZ (3:30)
and
y = Zk t [rL(xk + Zk pZ ; uk ) ? rL(xk ; uk )]: (3:31)
The choices of s and y are motivated by the fact that only the reduced
Hessian is being approximated.
This method does not stipulate a new multiplier iterate directly since
the problem being solved at each step is unconstrained. However, the least
squares solution for the rst order conditions (cf. (3.2)) can be used. Gen-
erally, all that is needed is that the multipliers satisfy
 
k 
u ? u = O x ? x :
k  (3:32)
Since P k = Zk Zk t the approximation
Zk Rk Zk t
can be thought of as an approximation of
P k H L(xk ; uk )P k :
26 Boggs and Tolle

Thus since this method does not approximate


P k H L
neither the local convergence theorem nor the superlinear rate of convergence
theorem, Theorems 3.2 and 3.4, follow as for full Hessian approximations.
Nevertheless, the two-sided approximation of the Hessian matrix suggests
that the conditions of Theorem 3.5 may hold. In fact, if it is assumed that
the matrices Zk are chosen in a smooth way, i.e., so that
 
kZk ? Z (x)k = O xk ? x ; (3:33)
the assumption of local convergence leads to two-step superlinear conver-
gence.
Theorem 3.8 Assume that the reduced Hessian algorithm is applied with
uk and Zk chosen so that (3.32) and (3.33) are satis ed. If the sequence fxk g
converges to x R-linearly then fRk g and fR?k 1 g are uniformly bounded and
fxk g converges two-step superlinearly.
While the condition on the multiplier iterates are easy to satisfy, some
care is required to ensure that (3.33) is satis ed as it has been observed that
arbitrary choices of Zk may be be discontinuous and hence invalidate the
theorem.
3.5. Notes and References
The study of Newton's method applied to the rst order necessary conditions
for constrained optimization can be found in (Tapia 1974) and (Goodman
1985). The equivalence of Newton's method applied to the system of rst
order equations and feasibility conditions and the SQP step with the La-
grangian Hessian was rst mentioned by Tapia (1978).
The expository paper, (Dennis and More 1977), is a good source for an
introductory discussion of bounded deterioration and of updating methods
satisfying the secant condition in the setting of unconstrained optimization.
The paper by Broyden, Dennis and More (1973) provides the basic results
on the convergence properties for quasi-Newton methods.
Theorems (3:2) and (3:4) were rst proven by Boggs, Tolle and Wang
(1982). The latter theorem was proven under the assumption that the iter-
ates converge linearly, generalizing the result for unconstrained optimization
given in Dennis and More (1977). The assumption was substantially weak-
ened by several authors; see, for example, (Fontecilla, Steihaug and Tapia
1987). Theorem (3:5) is due to Powell (1978b). The paper by Coleman
(1990) contains a good general overview of superlinear convergence theo-
rems for SQP methods.
The (local) superlinear convergence of the PSB-SQP method was proven
Sequential Quadratic Programming 27
by Han (1976) in his rst paper on SQP methods. This paper also included
the proof of superlinear covergence for the positive de nite updates when the
Hessian of the Lagrangian is positive de nite (Theorem (3:7)) and suggested
the use of the augmenting term in the general case. The augmented method
was also investigated in (Schittkowski 1981) and (Tapia 1977). A description
of the Powell-SQP method appears in (Powell 1978b) while the Salsa-SQP
method is introduced in (Tapia 1988) and developed further in (Byrd, Tapia
and Zhang 1992).
The reduced Hessian SQP method has been the subject of research for a
number of authors. The presentation here follows that of Byrd and Nocedal
(1991). Other important articles include (Coleman and Conn 1984), (No-
cedal and Overton 1985), (Gabay 1982), (Yuan 1985), and (Gilbert 1993).
The problems involved in satisfying (3:33) have been addressed in (Byrd and
Schnabel 1986) , (Coleman and Sorensen 1984), and (Gill, Murray, Saunders,
Stewart and Wright 1985).
Other papers of interest on local convergence for SQP methods include
(Schittkowski 1983), (Panier and Tits 1993), (Bertsekas 1980), (Coleman
and Feynes 1992), (Glad 1979), (Fukushima 1986), and (Fontecilla 1988).

4. Merit Functions and Global Convergence

In the previous section we demonstrated the existence of variants of the SQP


method that are rapidly locally convergent. Here we show how the merit
function assures that the iterates eventually get close to a critical point;
in the next section we point out some gaps in the theory that prevent a
complete analysis.
A merit function  is incorporated into an SQP algorithm for the purpose
of achieving global convergence. As described in Section 2.3, a line search
procedure is used to modify the length of the step dx so that the step from
xk to xk+1 reduces the value of . This reduction is taken to imply that
acceptable progress towards the solution is being made.
The standard way to ensure that a reduction in  indicates progress is to
construct  so that the solutions of (NLP) are the unconstrained minimizers
of . Then it must be possible to decrease  by taking a step in the direction
dx generated by solving the quadratic subproblem, i.e., dx must be a descent
direction for . If this is the case, then for a suciently small ; (xk + dx )
will be less than (xk ). An appropriate steplength that decreases  can
then be computed; for example, by a \backtracking" procedure of trying
successively smaller values of until a suitable one is obtained.
Given that a merit function is found that has these properties and that a
procedure is used to take a step that decreases the function, global conver-
gence proofs for the resulting SQP algorithm are generally similar to those
found in unconstrained optimization: they depend on proving that the limit
28 Boggs and Tolle

points of the x-iterates are critical points of . These proofs rely heavily on
being able to guarantee that a \sucient" decrease in  can be achieved at
each iteration..
In unconstrained minimization there is a natural merit function; namely,
the objective function itself. In the constrained setting, unless the iterates
are always feasible, a merit function has to balance the drive to decrease
f with the need to satisfy the constraints. This balance is often controlled
by a parameter in  that weights a measure of the infeasibility against the
value of either the objective function or the Lagrangian function. In this
section, we illustrate these ideas by describing two of the most popular
merit functions: a di erentiable augmented Lagrangian function and the
simpler, but nondi erentiable, `1 penalty function. We derive some of the
important properties of these functions and discuss some of the advantages
and disadvantages of each. In addition, we provide appropriate rules for
choosing the steplength parameter so as to obtain the basic convergence
results.
To simplify the presentation we restrict our attention at the beginning to
the problem with only equality constraints. Extensions that allow inequality
constraints are discussed at the ends of the sections.
As might be expected, some additional assumptions are needed if global
convergence is to be achieved. In fact, the need for these assumptions raises
the issue as to what exactly is meant by \global" convergence. It is impossi-
ble to conceive of an algorithm that would converge from any starting point
for every (NLP). The very nature of nonlinearity allows the possibility, for
example, of perfectly satisfactory iterations following a steadily decreasing
function towards achieving feasibility and a minimum at some in nite point.
In order to focus attention on the methods and not the problem structure
we make the following assumptions.
C1: The starting point and all succeeding iterates lie in some compact set
C.
C2: The columns of rh(x) are linearly independent for all x 2 C .
The rst assumption is made in some guise in almost all global conver-
gence proofs, often by making speci c assumptions about the functions.
The second assumption assures that the systems of linearized constraints
are consistent. An additional assumption will be made about the matrix
approximations Bk , depending on the particular merit function.

4.1. Augmented Lagrangian merit functions


Our rst example of a merit function is the augmented Lagrangian function.
There are several versions of this function; we use the following version to
Sequential Quadratic Programming 29
illustrate the class:
F (x; ) = f (x) + h(x)tu(x) + 2 kh(x)k22 (4:1)
where  is a constant to be determined and
h i?1
u(x) = ?rh(x)trh(x) rh(x)trf (x): (4:2)
The multipliers u(x) de ned by (4:2) are the least squares estimates of the
optimal multipliers based on the rst order necessary conditions and hence
u(x ) = u (see (3.1) and (3.2)). Under the assumptions, F and u are
di erentiable and F is bounded below on C for  suciently large. The
following formulas will be useful in the discussion:
h i?1
ru(x) = ?H L(x; u(x))rh(x) rh(x)trh(x)
+R (x) 1 (4.3)
rF (x; ) = rf (x) + rh(x)u(x)
+ru(x)h(x) +  rh(x)h(x): (4.4)
The function R (x) in (4:3) is bounded and satis es R (x) = 0 if x
1 1
satis es the rst order necessary conditions.
The following theorem establishes that the augmented Lagrangian merit
function has the essential properties of a merit function for the equality-
constrained nonlinear program.
Theorem 4.1 Assume B1, B2, C1, and C2 are satis ed. Then for 
suciently large the following hold:
(i) x 2 C is a strong local minimum of F if and only if x is a strong
local minimum of (NLP) and
(ii) if x is not a critical point of (NLP) then dx is a descent direction
for F .
The proof is rather technical, but it does illustrate the techniques that are
often used in such arguments and it provides a useful intermediate result,
namely (4:10). A key idea is to decompose certain vectors according to the
range and null space projections P and Q.
Proof. If x 2 C is feasible, and satis es the rst order necessary condi-
tions, then it follows from (4.4) that rF (x;  ) = 0: Conversely, if x 2 C ,
rF (x; ) = 0, and  is suciently large then it follows from C1 and C2
that h(x ) = 0 and x satis es A1 for (NLP). To establish (i), the rela-
tion between HF (x;  ) and H L must be explored for such points x. It
follows from (4:4) that
HF (x; ) = H L ? QH L ? H L Q +  rh(x )rh(x)t (4:5)
30 Boggs and Tolle

where Q is de ned by (2:3). Now let y 2 Rn , y 6= 0, be arbitrary. Then


setting y = Q y + P  y, (4:5) yields
y t HF (x ;  )y = (P  y)t H Lh (P y ) ? (Q y)t H L(Qy)
i
+ (Q y )t rh(x )rh(x)t (Q y): (4:6)
Assume that x is a strong local minimum of (NLP). Since Q y is in the
range space of rh(x), it follows from A2 that there exists a constant 
such that h i
(Q y )t rh(x )rh(x)t (Q y)   kQ y k2 :
Let the maximum eigenvalue of H L in absolute value be max and let min
be the minimum eigenvalue of P  H L P  , which by A4 is positive. De ning

 = kQkykyk ;
which implies
kP yk = 1 ?  ;
2
2
kyk
2

and dividing both sides of (4:6) by ky k gives


2

y t HF (x ;  )y
 min + ( ? max ? min ) :
2
kyk
2

This last expression will be positive for  suciently large, which implies
that x is a strong local minimum of F . Conversely, if x is a strong local
minimum of F then (4:6) must be positive for all y . This implies that H L
must be positive de nite on the null space of rh(x )t and hence that x is
a strong local minimum of (NLP). This establishes (i).
To show that the direction dx is always a descent direction for F , the
inner product of both sides of (4:4) is taken with dx to yield
dx t rF (xk ;  ) = dxt rf (xk ) + dxt rh(xk )u(xk )
+dx t ru(xk )h(xk ) +  dxt rh(xk )h(xk ): (4:7)
Using (4:2) it follows that
h i?1
dx t rh(xk )u(xk ) =?dxtrh(xk ) rh(x)trh(x) rh(x)trf (x)
= ?dx t Qk rf (xk ):
(4:8)
Writing dx = Qk dx + P k dx and noting that h(xk ) = ?rh(xk )t Qk dx results
in
dx t rF (xk ;  ) = P k dx t rf (xk ) ? dx t ru(xk )rh(xk )tQk dx
h i
?(Qk dx)t rh(xk )rh(xk )t (Qk dx):
Sequential Quadratic Programming 31
Finally, using the rst order necessary conditions for (ECQP) and the fact
that P k rh(xk )t = 0, the following is obtained:
dx t rF (xk ;  ) = ?(P k dx )t Bk (P k dx ) ? (Qk dx )tBk dx
?dxtru(xkh)rh(xk )tQk dx i (4:9)
k
?(Q dx) rh(x )rh(x ) (Q dx):
t k k t k

Now, from (4:3) and assumptions C1 and C2 it follows that



dxt ru(xk )rh(xk )t Qk dx  1 Qk dx kdxk
for some constant 1 . Dividing both sides of (4:9) by kdx k2, letting


Qk dx
 = kd k ;
x
and using B1 yields
rF (xk ; )tdx  ? +  ?   2
(4:10)
kdxk
2 1 2 3

for constants and . The quantity on the left of (4:10) is then negative
2 3
and can be uniformly bounded away from zero provided  is suciently
large, thus proving dx is a descent direction for . 2
It is interesting to observe that the value of  necessary to obtain descent
of dx depends on the eigenvalue bounds on fBk g, whereas the value of 
necessary to ensure that x is a strong local minimizer of F depends on the
eigenvalues of H L . Thus a strategy to adjust  to achieve a good descent
direction may not be sucient to prove that HF (x) is positive de nite.
We comment further on this below.
This line of reasoning can be continued to obtain a more useful form of
the descent result. From (4:4) and arguments similar to those above,

rF (x ;  )  4 kdx k
k

for some constant 4. Thus from (4:10) it follows that there exists a
constant 5 such that

rF (xk ;  )tdx  ? < 0 (4:11)
rF (x ;  ) kdx k
k 5

for all k. This inequality ensures that the cosine of the angle  k between
the direction dx and the negative gradient of F is bounded away from zero,
i.e., for all k
cos( k ) = ? rF (xk ;  ) dx  5 > 0:
k t
(4:12)
rF (x ;  ) kdx k
32 Boggs and Tolle

This condition is sucient to guarantee convergence of the sequence if it


is coupled with suitable line search criteria that impose restrictions on the
steplength . For example, the Wolfe conditions require a steplength to
satisfy
F (xk + dx ; )  F (xk ;  ) + 1 rF (xk ;  )tdx (4.13)
rF (xk + dx; )tdx  2rF (xk ; )tdx (4.14)
where 0 < 1 < 2 < 1. Inequality (4:13) ensures that there will be a
sucient reduction in F while (4:14) guarantees that the steplength will
not be too small. An important property of these conditions is that if dx is
a descent direction for F , then a steplength satisfying (4:13){(4:14) can
always be found. Furthermore, the reduction in F for such a step satis es
2
F (xk+1 ; )  F (xk ; ) ? 6 cos2(k ) rF (xk ;  ) ;
for some positive constant 6. Therefore
1
X 2
cos2 (k ) rF (xk ;  ) < 1 (4:15)
k=1
and, since cos(k ) is uniformly bounded away from 0, it follows that
lim r (xk ;  ) = 0:
k!1 F
This implies that fxk g converges to a critical point of F . The following
theorem results:
Theorem 4.2 Assume that  is chosen such that (4:12) holds. Then the
SQP algorithm with steplength chosen to satisfy the Wolfe conditions
(4:13){(4:14) is globally convergent to a critical point of F .
If the critical points of F all correspond to local minima of (NLP) then
the algorithm will converge to a local solution. This, however, can rarely
be guaranteed. Only convergence to a critical point, and not to a local
minimum of F , is guaranteed. It is easy to see from the following example
that convergence can occur to a local maximum of (NLP). Consider
minimize
x x1
(4:16)
subject to: x21 + x22 ? 1 = 0
with a starting approximation of (2; 0) and Bk = I for all k. The iterates
will converge from the right to the point (1; 0), which is a local maximum
of the problem, but F will decrease appropriately at each iteration. The
point (1; 0), however, is a saddle point of F . In practice convergence to a
maximizer of (NLP) rarely occurs.
Theorems 4.2 and 4.1 require that  be suciently large. Since it is not
Sequential Quadratic Programming 33
known in advance how large  needs to be, it is necessary to employ an
adaptive strategy for adjusting  when designing a practical code. Such
adjustment strategies can have a dramatic e ect on the performance of the
implementation, as is discussed in Section 7.
Augmented Lagrangian merit functions have been extended to handle
inequality constrained problems in several ways. Two successful approaches
are described below.
In the rst an active set strategy is employed, i.e., at each iteration a set
of the inequality constraints is selected and treated as if they were equality
constraints; the remaining inequalities are handled di erently. The active
set is selected by using the multipliers from (QP). In particular,
n o
I k = i : gi(xk )  ? vqp i : ( )

With this choice, I k will contain all unsatis ed constraints and no \safely
satis ed" ones. The merit function at xk is de ned by
FI (x; uqp ; vqp; ) = f (x) + h(x)t uqp(x) + 12  kh(x)k2
X
+ (g i (x)(vqp (x))i + 21  gi (x)2 )
i2I k
+ 21
X
(vqp (x))2i :
i=2I k
FI is still di erentiable and, as a result of A3, the correct active set is
eventually identi ed in a neighborhood of the solution. In this formulation,
the multipliers are the multipliers from (QP), not the least squares mul-
tipliers. Therefore FI is a function of both x and these multipliers and,
consequently, the analysis for FI is somewhat more complicated.
A second approach uses the idea of squared slack variables. One can
consider the problem
minimize
x; t f (x)
subject to: h(x) = 0 (4:17)
g i (x) + (ti )2 = 0
i = 1; : : :; p
where t is the vector of slack variables. This problem is equivalent to (NLP)
in the sense that both have a strong local solution at x where, in (4:17),
(ti )2 = ?g i (x). By writing out the function F for the problem (4:17), it
is observed that the merit function only involves the squares of ti . Thus, by
letting z i = t2i and setting
 
h(x)
h(x; z ) = g (x) + z ;
34 Boggs and Tolle

we can construct the merit function


FZ (x; z) = f (x) + h(x; z)t u(x; z) + 2 h(x; z) :
2

Here u(x,z ) is the least squares estimate for all of the multipliers. Note
that (4:17) is not used to create the quadratic subproblem, but rather (QP)
is solved at each iteration to obtain the step dx . The slack variables can
then be updated at each iteration in a manner guaranteed to maintain the
nonnegativity of z . For example, a step
dz = ?(rg (xk )tdx + g(xk ) + z k )
can be calculated. Then the constraints of (QP) imply that z k+1 = z k +
dz  0 if z k  0 and 2 (0; 1].
4.2. The `1 Merit Function
One of the rst merit functions to be introduced was the `1 exact penalty
function that, in the equality constrained case, is given by
1(x; ) = f (x) +  kh(x)k1 (4:18)
where  is a positive constant to be chosen. The properties of this func-
tion vis-a-vis the equality-constrained optimization problem have been well-
documented in the literature. For our purposes it is sucient to note that
1, like the augmented Lagrangian of the previous section, is an \exact"
penalty function; that is, there exists a positive  such that for all   ,
an unconstrained minimum of 1 corresponds to a solution of (NLP). Note
that 1 is not di erentiable on the feasible set. If the penalty term were
squared to achieve di erentiability then the \exact" property would be lost;
minimizers of 1 would only converge to solutions of (NLP) as  ! 1.
Although 1 is not di erentiable, it does have a directional derivative
along dx. It can be shown that this directional derivative, denoted by the
operator D, is given by

D(1(xk ; ); dx) = rf (xk )tdx ?  h(xk ) 1 : (4:19)
Substituting the rst order necessary conditions for (ECQP) into (4:19)
yields

D(1(xk ; ); dx) = ?dx t Bk dx ? dxt rh(xk )uqp ?  h(xk ) 1 :
It follows from the linearized constraints of (ECQP) that
dx t rh(xk )uqp = ?h(xk )t uqp
and, since
h(xk )t uqp  kuqp k1 h(xk 1 ; (4:20)
Sequential Quadratic Programming 35
the inequality

D((xk ; ); dx)  ?dx t Bk dx ? ( ? kuqp k1 ) h(xk ) 1 (4:21)
is obtained. In order to have dx be a descent direction for 1 and to ob-
tain a convergence theorem it is sucient to assume the uniform positive
de niteness of fBk g:
B4: For all d 2 Rn there are positive constants 1 and 2 > 0 such that
1 kdk2  dt Bk d  2 kdk2
for all k.
The assumptions B1 and B2 that were made for the augmented Lagrangian
are sucient, but we make the stronger assumption to simplify the presen-
tation. Using B4, the rst term in (4:21) is always negative and thus dx is
a guaranteed descent direction for 1 if  > kuqp k1 .
Global convergence of an algorithm that uses 1 as a merit function can
be demonstrated using arguments similar to those for Theorem 4.1. First,
at each iteration, the parameter  is chosen by
 = kuqpk1 +  (4:22)
for some constant  > 0: Next, the rst Wolfe line search condition, (4:13),
is replaced by
1 (xk + dx)  1(xk ) +  D(1(xk ; ); dx) (4:23)
where  2 (0; 21 ). (The condition  < 21 is for technical reasons; in practice
 is usually chosen to be much smaller.) Recall that the second of the
Wolfe conditions, (4:14), is to prevent a steplength that is too small. The
assumptions that have been made guarantee that the use of a backtracking
line search procedure produces steplengths that are uniformly bounded away
from zero for all iterations. Denoting this lower bound by  , it now follows
from (4:21) and (4:23) that the reduction in 1 at each step satis es
h i
1 (xk + dx) ? 1(xk )  ?  1 kdx k2 +  h(xk ) 1 : (4:24)
Assumption C1 and (4:24) imply that
1 
X 2 

dx(xk ) + h(xk ) 1 1
k=1
 2 
and therefore dx(xk ) + h(xk ) 1 ! 0 as k ! 1. Since dx = 0 if and
only if xk is a feasible point satisfying A1, the following theorem results.
Theorem 4.3 Assume that  is chosen such that (4:22) holds. Then the
36 Boggs and Tolle

SQP algorithm started at any point x0 with steplength   > 0 chosen


to satisfy (4:23) converges to a stationary point of 1 .
As in the case of F convergence to a local minimum of (NLP) cannot
be guaranteed. In fact, the same counterexample used in that case applies
here.
An advantage of 1 over the augmented Lagrangian is that 1 is easily
extended to handle inequalities. Since di erentiability is not an issue in this
case, the function 1 can simply be de ned as
 
1(x) = f (x) +  kh(x)k1 + g + (x) 1
where
x)  0

g (x) = g0 (x) ifif ggi ((x
+
)>0
i i
All of the theoretical results continue to hold under this extension.
The merit function 1 has been popular because of its simplicity | it
requires only the evaluation of f , h, and g to check a prospective point. F ,
on the other hand, is expensive to evaluate in that it requires the evaluation
of f , h, g and their gradients just to check a prospective point. In addition,
in the large scale case, the evaluation of u(x) involves nontrivial algebra.
Implementations that use these merit functions partially circumvent this
diculty by using an approximation to F , e.g., by approximating u by a
linear function or by keeping it xed over the current step.
4.3. Notes and References
The augmented Lagrangian merit function was rst proposed as an exact
penalty function by Fletcher (1972). See also (Bertsekas 1982) and (Boggs
and Tolle 1980). It was suggested as a merit function in (Powell and Yuan
1986) and in a slightly di erent form in (Boggs and Tolle 1984) and (Boggs
and Tolle 1989). See (Byrd and Nocedal 1991) for its application to reduced
Hessian methods. The Wolfe conditions have been studied by numerous
authors; we recommend (Nocedal 1992) or (Dennis and Schnabel 1983).
Schittkowski (1981) and (Schittkowski 1983) developed the form for in-
equalities given by FI . This form has been further developed and incor-
porated into a highly successful algorithm by Gill, Murray, Saunders and
Wright (1986). (Boggs, Tolle and Kearsley 1991b) proposed the form given
by FZ .
The `1 exact penalty function was originally suggested as a merit function
by Han (1977) where he obtained the rst global convergence results for SQP
methods. See (Fletcher 1981) for a good introduction to nondi erentiable
optimization and the use of this function. See also (Polak 1989) and (Wolfe
1975). Byrd and Nocedal (1991) study the `1 merit function in the context
Sequential Quadratic Programming 37
of reduced Hessian methods. The analysis for the `1 merit function applies
to the corresponding `p merit function with the only change ocurring in
(4:20) where the 1-norm and the 1-norm are replaced by the p-norm and
the q -norm with 1=p + 1=q = 1.
5. Global to Local Behavior

In the previous two sections we have examined conditions that imply that
the basic SQP algorithm will converge. Section 3 dealt with local conver-
gence where the emphasis was on the rates of convergence, while Section 4
was concerned with obtaining convergence from remote starting points; the
implicit hope being that the two theories would come together to produce a
uni ed theory that would be applicable to a given algorithm. For the global
convergence theories, where the process has to be controlled by a merit func-
tion, it was seen in Section 4 that convergence of fxk g to a critical point
of  is all that can be demonstrated. Assuming that this critical point is a
solution of (NLP), the question that arises is whether or not the conditions
for the local convergence theories are eventually satis ed by this sequence.
If they are then the more rapid rates of local convergence can be achieved.
In this section we discuss three questions concerning this possibility: will
the correct active set be chosen by (QP) when xk is close to x ; will Bk
eventually approximate H L in one of the ways stipulated in Section 3 that
yields rapid local convergence; and will the merit function allow a steplength
of one if the underlying process can achieve rapid local convergence?
Recall that the local convergence theory was proven for equality-constrained
problems only. This was based on the assumption that the active inequality
constraints at the solution of (NLP) would be identi ed by (SQP) when xk
gets close to the solution. The question of when the active constraints for
(QP) will be the same as those for (NLP) is resolved by using a perturbation
theory result as follows. Consider (NLP) with only inequality constraints
minimize
x f (x)
(5:1)
subject to: g (x)  0
and the quadratic program with the same solution
minimize
x rf (x)t(x ? x) + 21 (x ? x)tB (x ? x)
(5:2)
subject to: rg(x )t(x ? x ) + g (x )  0
where the only restriction on B is that
yt B y > 0 (5:3)
for all y such that rg(x )t y = 0. It is easily veri ed that (x; v ) is a
strong local minimizer of both (5:1) and (5:2). This implies that the active
38 Boggs and Tolle

sets are the same and that strict complementary slackness (assumption A3)
holds for both. The quadratic programming approximation to (5:1) is
minimize
dx rf (xk )tdx + 21 dxtBk dx
(5:4)
subject to: rg (xk )tdx + g(xk )  0:
A standard perturbation argument now asserts that if xk is close enough to
x and Bk is close enough to B , then the active set for (5:4) is the same as
the active set for (5:2). It follows that the active sets for (5:1) and (5:4) are
the same. As a result, if xk is suciently close to x and Bk is suciently
close to any matrix B satisfying (5:3) then (5:4) will identify the correct
active set. The condition (5:3) will be satis ed if the Bk are always positive
de nite or if they satisfy the condition
lim P k (Bk ? H L)P k = 0:
k!1
(5:5)
This latter condition implies two-step superlinear convergence by Theorem
(3:5).
From Section 3 linear, superlinear, or two-step superlinear convergence
is assured if Bk approaches H L in one of the ways hypothesized in The-
orems 3.2{3.5. Since the initial matrix is unlikely in practice to be a good
approximation to H L , it is reasonable to ask under what conditions the
subsequent Bk will be good enough approximations to H L for these results
of Section 3 to hold. If the Bk satisfy the secant equation (3:19) then it is
reasonable to expect that the Bk will converge to H L provided that the
directions, dx , span Rn repeatedly, for example, if each set of n directions is
(uniformly) linearly independent. Some recent research supports this expec-
tation. However, if positive de niteness is maintained, it is not possible for
the Bk to converge to H L unless the latter is at least positive semide nite.
So for the general case, further analysis is necessary. As seen in Section 3,
(3:16) is equivalent to
       
lim P (Bk ? H L )P sk + P (Bk ? H L )Q sk = 0;
k!1 ksk k ksk k (5:6)
where sk = (xk+1 ? xk ). For superlinear convergence both of these terms
must tend to zero if the directions sk = ksk k repeatedly span Rn . Thus for
positive de nite Bk superlinear convergence is not likely unless the second
term goes to zero, which is the type of tangential convergence mentioned
in Section 3.2. However, two step superlinear convergence, in which the
rst term of (5:6) goes to zero, is possible for positive de nite updates and
for reduced Hessian updates as well, provided that convergence occurs and
steplengths of one are acceptable. Thus, it is important from the point of
view of obtaining a rapid local rate of convergence that a steplength of one
be taken near x .
Sequential Quadratic Programming 39
For the general algorithm, however, the use of the merit function with
the line search continues throughout the course of the algorithm. Thus, the
line search for the merit function should allow a steplength of one if the
matrix approximations are such that rapid local convergence is possible. In
the case of the augmented Lagrangian merit function it can be shown, using
arguments similar to those in Section 4, that
F (xk+1 )?2 F (xk )  ? 1 1 + 2 ?  32
kdxk 2
k
+ P [Bk ?H Lkd(xx k;u(x ))]dx
k k (5:7)
+O(kdx k) :
Then, if the process is converging superlinearly, the penultimate term in (5:7)
tends to zero by Theorem 3.4 and the right hand side of (5:7) is ultimately
less than zero, thus showing that a steplength of one decreases F . A slight
extension to this argument shows that the Wolfe condition (4:13) also can
be satis ed for a steplength of one.
A signi cant disadvantage of the merit function 1 is that it may not
allow a steplength of = 1 near the solution, no matter how good the
approximation of (QP) to (NLP). This phenomenon, which can prohibit su-
perlinear convergence, is called the Maratos e ect. Several procedures have
been proposed to overcome this problem, including so-called nonmonotone
line searches, i.e., techniques that allow the merit function to increase over
one or more iterations (see Section 7).
5.1. Notes and References
The perturbation argument showing conditions under which (QP) has the
same active set as (NLP) is derived from the work of Robinson (1974) who
proves a general perturbation result. For a discussion of the convergence of
matrix updates see Boggs and Tolle (1994), Ge and Powell (1983) and Stoer
(1984).
The Maratos e ect, i.e., the fact that the `1 merit function may not allow
a steplength of one even when the iterates are close to the solution and Bk is
a good approximation to H L , was discussed by Chamberlain, Lemarechal,
Pedersen and Powell (1982). These authors suggested the \watchdog" tech-
nique, which is essentially a nonmonotone line search method. (See also
(Bonnans et al. 1992).) The fact that the augmented Lagrangian merit
function does not su er from this problem has been shown by many au-
thors.
6. SQP Trust Region Methods

Trust region algorithms have become a part of the arsenal for solving un-
constrained optimization problems, so it is natural to attempt to extend the
40 Boggs and Tolle

ideas to solving constrained optimization problems and, in particular, to


SQP methods. In this section an outline of the basic ideas of this approach
will be provided. As the subject is still in a state of ux, no attempt will
be made to give a comprehensive account of the algorithms that have been
proposed. The discussion will be limited to equality-constrained problems;
the inclusion of inequality constraints into trust region algorithms has been
the subject of little research.
A major diculty associated with SQP methods arises from trying to en-
sure that (QP) has a solution. As was seen in the preceding, the requirement
that the Hessian approximations be positive de nite on the null space of the
constraints is dicult to guarantee short of requiring the Bk to be positive
de nite. The trust region methods attempt to avoid this diculty by the
addition of a bounding constraint. Since this added constraint causes the
feasible region to be bounded the subproblem is guaranteed to have a solu-
tion independent of the choice of Bk provided the feasible region is nonempty
To be precise, a straightforward trust region adaptation of the SQP method
would generate the iterates by solving the problem
minimize
dx rf (xk )tdx + 21 dxtBk dx
subject to: rh(xk )t dx + h(xk ) = 0 (6:1)
kS dxk  4k
2 2

where S is a positive diagonal scaling matrix and 4k is the trust region


radius. In this discussion we assume the norm is the Euclidean norm but
other norms have been used in the trust region constraint. The trust region
radius is updated at each iteration based on a comparison of the actual
decrease in a merit function to the decrease predicted by the model. If
there is good agreement the radius is maintained or increased; if not, it is
decreased.
It is worth noting that the necessary condition for dx to be a solution to
(6:1) is that
(Bk +  S 2)dx = ?rh(xk )u ? rf (xk ) (6:2)
for some multiplier vector u and some nonnegative scalar . This equation
is used to generate approximate solutions to (6:1).
The removal of the positive de nite requirement on Bk does not come
without cost. The additional constraint is a quadratic inequality constraint
and hence it is not a trivial matter to nd a good approximate solution to
(6:1). Moreover, there is the bothersome possibility that the solution sets
of the linear constraints and the trust region constraint may be disjoint.
For this reason, research on trust region methods has centered on nding
di erent types of subproblems that have feasible solutions but still capture
Sequential Quadratic Programming 41
the essence of the quadratic subproblem. Several avenues of investigation
are summarized below.
One approach is to relax the linear constraints in such a way that the
resulting problem is feasible. In this case the linear constraint in (6:1) is
replaced by
rh(xk )tdx + k h(xk ) = 0 (6:3)
where 0  k  1. A major diculty with this approach is the problem
of choosing k so that (6.3) together with the trust region constraint has a
solution.
A second approach is to replace the equality constraints by a least squares
approximation. Then the equality constraints in (6:1) become the quadratic
constraint
rh(x ) dx + h(x )  (k )
k t k 2 2
(6:4)
where k is an appropriate value. One choice of k is the error in the linear
constraints evaluated at the \Cauchy" point. The Cauchy point is de ned
to be the optimal step in the steepest descent direction for the function
2

rh(xk )tdx + h(xk )
i.e., the step, scp , which minimizes this function in the steepest descent
direction. This yields
2
(k )2 = rh(xk )t scp + h(xk ) :

Another possibility is to take k to be any value of rh(xk )t s + h(xk )
for which
14k  ksk  2 4k
where 0 < 1  2 < 1. This approach assures that the quadratic sub-
problem is always feasible, but at the cost of some extra computation to
obtain k . Moreover, while the resulting subproblem has only two quadratic
constraints, nding a good approximate solution quickly is a matter that
has not been completely resolved.
Finally, in a Reduced Hessian approach to the trust region method the
linear constraint in (6:1) is replaced by
rh(xk )tdx + rh(xk )tsk = 0 (6:5)
where sk is the solution of
2
minimize rh(xk )t s + h(xk )
s
subject to: ksk   4k
42 Boggs and Tolle

for  2 (0; 1). Using the decomposition (3:27) from Section 3.4
dx = Zk pZ + Yk pY
where the columns of Z (xk ) are a basis for the null space of rh(xk )t and
the columns of Yk are a basis for the range space of rh(xk ), it follows from
(6:5) that
Yk pY = ?sk :
As in Section 3.4, the null space component, pZ , can now be seen to be the
solution of the quadratic problem
h it
minimize f (xk ) + Bk sk pZ + 12 pZ t Zk t Bk Zk pZ
pZ
subject to: kpZ k2  (4k )2 ? ksk k2 :
A great deal of research has been done on the subject of minimizing a
quadratic function subject to a trust region constraint so that quick and
accurate methods for approximating the solutions to these two problems are
available.
Much of the work to be done in transforming these approaches into al-
gorithms for which local and global convergence theorems can be proven is
similar in nature to that which must be done for standard SQP methods.
In particular, a method for updating the matrices Bk needs to be speci ed
(a wider class of updating schemes is now available, at least in theory, be-
cause there is no necessity of requiring them to have the positive de niteness
properties) and a merit function has to be speci ed. As was shown in Sec-
tion 3, local superlinear convergence depends on the steps approaching the
Newton-SQP steps as the solution is neared. In particular, this requires that
the modi ed constraints reduce to the linearized constraint of (ECQP) and
that the trust region constraint not be active so that steplengths of one can
be accepted. Conditions under which these events will occur have not been
established. Global convergence theorems have been proven for most of the
above variations of the trust region method by utilizing either the `1 or the
augmented Lagrangian merit function under the assumptions C1 and C2.
6.1. Notes and References
An introductory exposition of trust region methods for unconstrained opti-
mization can be found in the book by Dennis and Schnabel (1983). Methods
for minimizing a quadratic function subject to a trust region constraint can
be found in (Gay 1981) and (More and Sorensen 1983).
The rst application of trust region methods to the constrained problem
appears to be that of Vardi (1985), who used the constraint (6:3) in place
of the linearized equality constraints. This approach was also used by Byrd,
Sequential Quadratic Programming 43
Schnabel and Schultz (1985). The introduction of the quadratic constraint
(6:4) as a substitute for the linear constraint is due to Celis, Dennis and
Tapia (1985), who used the error at the \Cauchy point" as the k . A global
convergence theory for this strategy is given in (El-Alem 1991). Powell and
Yuan (1986) suggested the second version of this approach mentioned in
the text. Yuan (1990) proposed solution techniques for the resulting sub-
problem. The reduced Hessian approach has been introduced by Omojokun
(1989) who also considers the case when inequality constraints are present.
This approach has also been used by (Lalee, Nocedal and Plantega 1993)
for large-scale problems.

7. Practical Considerations

Our goal throughout this survey has been to concentrate on those theoretical
properties that bear on actual implementations, but we have mentioned
some of the diculties that arise in the implementation of an SQP algorithm
when the assumptions that we have made are not satis ed. In this section
we elaborate on a few of the more important of these diculties and suggest
some common computational techniques that can be used to work around
them. No attempt is made to be complete; the object is to give an indication
of the issues involved. We also discuss brie y the extension of the SQP
algorithm to the large scale case, where special considerations are necessary
to create an ecient algorithm.

7.1. The Solution of (QP)


The assumptions behind the theory in Sections 3 and 4 imply that (QP)
always has a solution. In practice, of course, this may not be the case.
(QP) may be infeasible or the objective function may be unbounded on the
feasible set and have no local minima. As stated earlier, surveying methods
for solving (QP) is beyond the scope of this paper, but we discuss some ways
of continuing the computations when these diculties arise.
One technique to avoid infeasibilities is to relax the constraints by using
the trust region methods in Section 6. Another approach is to take dx to be
some convenient direction when (QP) is infeasible, e.g., the steepest descent
direction for the merit function. The infeasibility of (QP) is usually detected
during a \Phase I" procedure that is used to obtain a feasible point with
which to begin the quadratic programming algorithm. If the constraints are
inconsistent, then there are no feasible points and the Phase I procedure
will fail. However, the direction, dx , obtained in this case can sometimes be
used to produce directions that reduce the constraint infeasibilities and can
thus be used to improve xk . For example, the phase I procedure known as
the \big M" method modi es (QP) by adding one new variable, say , in
44 Boggs and Tolle

such a way that the new problem


minimize
dx rf (xk )tdx + 12 dxtBk dx + M
subject to: rh(xk )tdx + h(xk ) = 0 (7:1)
rg(x ) dx + g(x ) ? e  0
k t k
is feasible. In this form M is a constant and e is the vector of ones. For
0  maxfgi(xk ) : gi (xk ) > 0g the initial point (dx; ) = (0; 0 ) is feasible
for (7:1). The constant M is chosen large enough so that, as (7:1) is being
solved,  is forced to be reduced. Once   0; dx is a feasible point for
(QP). If  cannot be reduced to zero, then the inequalities are inconsistent.
Nevertheless, it can be shown under certain conditions that the resulting dx
is a descent direction for the merit function F .
If the problem is unbounded, but feasible, then the diculty is due to
the structure of the Hessian approximation, Bk . For example, if Bk is not
positive de nite then a multiple of the identity (or non-negative diagonal
matrix) can be added to Bk to ensure that it is positive de nite. Note that
adding a strictly positive diagonal matrix to Bk is equivalent to using a trust
region constraint (see (6:2)).
Finally, in cases where (QP) does not yield approximate multipliers for
nonlinear program such as when (QP) is infeasible or the matrix G(xk ) has
linearly dependent columns, any reasonable approximation of the multipliers
will usually suce. The most common approach in these cases is to use a
least squares multiplier approximation. If (NLP) does not satisfy A2 then
at best the theoretical convergence rate will be slowed (if x is a simple
degeneracy) and at worst even the Newton method may fail.
7.2. Adjusting the Merit Function Parameter
It was shown in Section 4 that setting the parameter  to ensure that the
direction dx is a descent direction for the merit function 1 given by (4:18)
can be relatively straightforward. Setting the parameter for F is only a
little more dicult. In either case adjusting the parameter can cause both
theoretical and computational diculties. In theory, there is no problem
with having the parameter large. Indeed, if only increases in the parameter
are allowed, the assumptions of Section 4 coupled with a reasonable ad-
justment strategy will lead to a provably nite value of the parameter and
convergence can be proved using the techniques of Section 4. Computation-
ally, however, having a large value of the parameter implies that there will
be too much emphasis on satisfying the constraints. The result will be that
the iterates will be forced to follow the constraint boundary closely which, in
highly nonlinear problems, can cause the algorithm to be slow. A strategy
for only allowing increases in the parameter can lead to an excessively large
value due entirely to an early iterate's being far from the solution.
Sequential Quadratic Programming 45
Ideally the parameter should be adjusted up or down at various stages
of the iteration process to ensure both good theoretical and computational
performance. Some strategies for this have been proposed. For example, it
is possible to allow controlled decreases in the parameter that ensure that
the predicted decrease in the merit function is not dominated by a decrease
in the constraint infeasibilities. For such choices it is still possible to prove
convergence.
7.3. Nonmonotone Decrease of the Merit Function
Global convergence results are usually proved by insisting that an appro-
priate merit function be suciently reduced at each iteration. Sometimes,
however, it is more ecient computationally to be less conservative and
to allow steps to be accepted even if the merit function is temporarily in-
creased. If, for example, the merit function is forced to be reduced over
any xed number of iterations, then convergence follows. In practice such
strategies have been quite successful, especially near the solution. As a par-
ticular example, note that the merit function 1 may not allow a steplength
of one near the solution. One remedy for this is to accept the full step tem-
porarily even if 1 increases. Then, if 1 is not suciently reduced after a
small number of steps, the last good iterate is restored and reduction in 1
is required for the next iteration.
As a second example, it was noted in Section 4 that the use of F requires
the evaluation of f , h, g , and their gradients just to test a prospective point
for acceptance. This diculty can be circumvented by using an \approxi-
mate" or \working" merit function that only requires evaluation of f , h, and
g, and no gradients, to test a point. For example, F could be approximated
(in the equality-constrained case) by
kF (x) = f (x) + h(x)tu(xk ) + 12  kh(x)k2
where u(xk ) stays xed throughout the kth iteration. This is then coupled
with a strategy to monitor the iterations to ensure that F is suciently
reduced after a certain number of iterations as discussed above.
7.4. Large Scale Problems
Ecient SQP algorithms in the large scale case depend on carefully address-
ing many factors. In this section we mention two of these, namely problem
structure and the solution of large scale quadratic programs.
Problems are considered large if, to solve them eciently, either their
structure must be exploited or the storage and manipulation of the matrices
involved must be handled in a special manner. The most obvious structure,
and the one most commonly considered, is sparsity of the matrices rh, rg,
and H L: Typically in large scale problems most constraints depend only on
46 Boggs and Tolle

a few of the variables and the objective function is \ partially separable" (a


partially separable function is, for example made up of a sum of functions,
each of which depends only on a few of the variables). In such cases the
matrices are sparse. In other cases, one or more of the matrices is of low
rank, and this structure can also be exploited.
A key to using SQP in the large scale case is to be able to solve, or
to solve approximately, the large quadratic programming subproblems. In
the small scale case, it rarely matters how (QP) is solved and it usually
makes sense to solve it completely. Depending on the algorithm used in the
large scale case, it is often essential to realize that (QP) at iteration k + 1
di ers only slightly from that at iteration k. In these cases, the active set
may remain the same, or change only slightly, and it follows that solving
(approximately) the next (QP) can be accomplished quickly. It must be
shown, however, that approximate solutions of (QP) are descent directions
for an appropriate merit function.
Recently, ecient interior point methods have been developed for solving
large scale linear programs; these ideas are now being applied to large scale
quadratic programs. One such method is the \subspace" method where, at
each iteration, a low-dimensional subspace is chosen and (QP) restricted to
that subspace is solved. A step in the resulting direction is calculated and
the procedure iterated. It has been shown that any number of iterations of
this technique gives rise to a descent direction for an augmented Lagrangian
merit function; an SQP algorithm based on this has shown promise.

7.5. Notes and References


See (Fletcher 1981) for an introduction to the solution of quadratic programs
using classical techniques, including the Big M method. Recently there
have been numerous papers on the application of interior-point methods
to quadratic programming problems. This is still the subject of intense
research and there are no good survey papers on the subject. See (Vanderbei
and Carpenter 1993) and (Boggs, Domich, Rogers and Witzgall 1991a) or
(Boggs, Domich, Rogers and Witzgall 1994a) for a discussion of two di erent
approaches.
Adjusting the penalty parameter in the augmented Lagrangian merit func-
tion is discussed in (Schittkowski 1981), (Schittkowski 1983), (Powell and
Yuan 1986), (Byrd and Nocedal 1991), (Boggs, Tolle and Kearsley 1994b),
(Byrd and Nocedal 1991), and (El-Alem 1992). The idea of allowing the
merit function to increase temporarily is an old one. It is sometimes re-
ferred to as a nonmonotone line search; for a general discussion see (Grippo,
Lampariello and Lucidi 1986). Its use with the `1 merit function is dis-
cussed in (Chamberlain et al. 1982). Using approximate merit functions is
Sequential Quadratic Programming 47
suggested in (Powell and Yuan 1986), (Boggs and Tolle 1989), and (Boggs
et al. 1994b).
Based on the work to date there is signi cant promise for SQP in the large
scale case, but more work is required to compare SQP with other large scale
techniques. For work to date see (Murray 1994), (Murray and Prieto to
appear), and (Boggs et al. 1994b).

Acknowledgments: The authors would like to thank Ubaldo Garcia-


Palomares, Anthony Kearsley, Stephen Nash, Dianne O'Leary, and G. W.
Stewart for reading and commenting on early drafts of the paper.
48 Boggs and Tolle

REFERENCES

D. P. Bertsekas (1980), Variable metric methods for constrained optimization based


on di erentiable exact penalty functions, in Proceedings of the Eighteenth
Allerton Conference on Communications, Control and Computing, Allerton
Park, Illinois, pp. 584{593.
D. P. Bertsekas (1982), Constrained Optimization and Lagrange Multipliers Methods,
Academic Press, London.
P. T. Boggs and J. W. Tolle (1980), `Augmented Lagrangians which are quadratic in
the multiplier', Journal of Optimization Theory and Applications 31, 17{26.
P. T. Boggs and J. W. Tolle (1984), `A family of descent functions for constrained
optimization', SIAM Journal on Numerical Analysis 21, 1146{1161.
P. T. Boggs and J. W. Tolle (1989), `A strategy for global convergence in a sequen-
tial quadratic programming algorithm', SIAM Journal on Numerical Analysis
21, 600{623.
P. T. Boggs and J. W. Tolle (1994), `Convergence properties of a class of rank-two
updates', SIAM Journal on Optimization 4, 262{287.
P. T. Boggs, P. D. Domich, J. E. Rogers and C. Witzgall (1991a), `An interior
point method for linear and quadratic programming problems', Mathematical
Programming Society Committee on Algorithms Newsletter (19), 32{40.
P. T. Boggs, P. D. Domich, J. E. Rogers and C. Witzgall (1994a), An interior-point
method for general large scale quadratic programming problems, Internal Re-
port (in process), National Institute of Standards and Technology.
P. T. Boggs, J. W. Tolle and A. J. Kearsley (1991b), A merit function for inequality
constrained nonlinear programming problems, Internal Report 4702, National
Institute of Standards and Technology.
P. T. Boggs, J. W. Tolle and A. J. Kearsley (1994b), A practical algorithm for
general large scale nonlinear optimization problems, Internal report, National
Institute of Standards and Technology.
P. T. Boggs, J. W. Tolle and P. Wang (1982), `On the local convergence of quasi-
Newton methods for constrained optimization', SIAM Journal on Control and
Optimization 20, 161{171.
J. F. Bonnans, E. R. Panier, A. L. Tits and J. L. Zhou (1992), `Avoiding the
Maratos e ect by means of a nonmonotone line search II. Inequality con-
strained problems|feasible iterates', SIAM Journal on Numerical Analysis
29, 1187{1202.
C. G. Broyden, J. E. Dennis, Jr. and J. J. More (1973), `On the local and superlinear
convergence of quasi-Newton methods', Journal of the Institute of Mathemat-
ical Applications 12, 223{246.
R. H. Byrd and J. Nocedal (1991), `An analysis of reduced Hessian methods for
constrained optimization', Mathematical Programming 49, 285{323.
R. H. Byrd and R. B. Schnabel (1986), `Continuity of the null space basis and
constrained optimization', Mathematical Programming 35, 32{41.
R. H. Byrd, R. B. Schnabel and G. A. Schultz (1985), `A trust-region strategy for
nonlinearly constrained optimization', SIAM Journal on Numerical Analysis
24, 1152{1170.
Sequential Quadratic Programming 49
R. H. Byrd, R. A. Tapia and Y. Zhang (1992), `An SQP augmented Lagrangian
BFGS algorithm for constrained optimization', SIAM Journal on Optimiza-
tion 2, 210{241.
M. R. Celis, J. E. Dennis, Jr. and R. A. Tapia (1985), A trust-region strategy for
nonlinear equality constrained optimization, in Numerical Optimization 1984
(P. Boggs, R. Byrd and R. Schnabel, eds), SIAM, Philadelphia, pp. 71{82.
R. Chamberlain, C. Lemarechal, H. C. Pedersen and M. J. D. Powell (1982), `The
watchdog technique for forcing convergence in algorithms for constrained op-
timization', Mathematical Programming Study 16, 1{17.
T. F. Coleman (1990), On characterizations of superlinear convergence for con-
strained optimization, in Lectures in Applied Mathematics, American Mathe-
matical Society, Providence, Rhode Island, pp. 113{133.
T. F. Coleman and A. R. Conn (1984), `On the local convergence of a quasi-Newton
method for the nonlinear programming problem', SIAM Journal on Numerical
Analysis 21, 755{769.
T. F. Coleman and P. A. Feynes (1992), `Partitioned quasi-Newton methods for non-
linear equality constrained optimization', Mathematical Programming 53, 17{
44.
T. F. Coleman and D. C. Sorensen (1984), `A note on the computation of an or-
thonormal basis for the null space of a matrix', Mathematical Programming
29, 234{242.
J. E. Dennis, Jr. and J. J. More (1977), `Quasi-Newton methods, motivation and
theory', SIAM Review 19, 46{89.
J. E. Dennis, Jr. and R. B. Schnabel (1983), Numerical Methods for Unconstrained
Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cli s, New
Jersey.
M. M. El-Alem (1991), `A global convergence theory for the Celis-Dennis-Tapia trust
region algorithm for constrained optimization', SIAM Journal on Numerical
Analysis 28, 266{290.
M. M. El-Alem (1992), A robust trust region algorithm with nonmonotonic penalty
parameter scheme for constrained optimization, Department of Mathematical
Sciences 92-30, Rice University.
R. Fletcher (1972), A class of methods for nonlinear programming, iii: Rates of con-
vergence, in Numerical Methods for Nonlinear Optimization (F. A. Lootsma,
ed.), Academic Press, New York, pp. 371{382.
R. Fletcher (1981), Practical Methods of Optimization: Volume 2, John Wiley &
Sons, New York.
R. Fontecilla (1988), `Local convergence of secant methods for nonlinear constrained
optimization', SIAM Journal on Numerical Analysis 25, 692{712.
R. Fontecilla, T. Steihaug and R. A. Tapia (1987), `A convergence theory for a class
of quasi-Newton methods for constrained optimization', SIAM Journal on
Numerical Analysis 24, 1133{1151.
M. Fukushima (1986), `A successive quadratic programming algorithm with global
and superlinear convergence properties', Mathematical Programming 35, 253{
264.
50 Boggs and Tolle

D. Gabay (1982), `Reduced quasi-Newton methods with feasibility improvement


for nonlinearly constrained optimization', Mathematical Programming Study
16, 18{44.
U. M. Garcia-Palomares and O. L. Mangasarian (1976), `Superlinearly convergent
quasi-Newton methods for nonlinearly constrained optimization problems',
Mathematical Programming 11, 1{13.
D. M. Gay (1981), `Computing optimal locally constrained steps', SIAM Journal on
Scienti c and Statistical Computing 2, 186{197.
R. Ge and M. J. D. Powell (1983), `The convergence of variable metric matrices in
unconstrained optimization', Mathematical Programming 27, 123{143.
J. C. Gilbert (1993), Superlinear convergence of a reduced BFGS method with piece-
wise line-search and update criterion, Rapport de Recherche 2140, Institut
National de Recherche en Informatique et en Automatique.
P. E. Gill, W. Murray and M. H. Wright (1981), Practical Optimization, Academic
Press, New York.
P. E. Gill, W. Murray, M. A. Saunders and M. H. Wright (1986), User's guide for
NPSOL (version 4.0): A Fortran package for nonlinear programming, Techni-
cal Report SOL 2, Department of Operations Research, Stanford University.
P. E. Gill, W. Murray, M. A. Saunders, G. W. Stewart and M. H. Wright (1985),
`Properties of a representation of a basis for the null space', Mathematical
Programming 33, 172{186.
S. T. Glad (1979), `Properties of updating methods for the multipliers in augmented
Lagrangians', Journal of Optimization Theory and Applications 28, 135{156.
J. Goodman (1985), `Newton's method for constrained optimization', Mathematical
Programming 33, 162{171.
L. Grippo, F. Lampariello and S. Lucidi (1986), `A nonmonotone line search tech-
nique for Newton's method', SIAM Journal on Numerical Analysis 23, 707{
716.
S.-P. Han (1976), `Superlinearly convergent variable metric algorithms for general
nonlinear programming problems', Mathematical Programming 11, 263{282.
S.-P. Han (1977), `A globally convergent method for nonlinear programming', Jour-
nal of Optimization Theory and Applications 22, 297{309.
M. Lalee, J. Nocedal and T. Plantega (1993), On the implementation of an algorithm
for large-scale equality optimization, Department of Electrical Engineering
and Computer Science NAM 09-93, Northwestern University.
D. G. Luenberger (1984), Linear and Nonlinear Programming, 2nd Edition, Addison-
Wesley, Reading, Massachusetts.
J. More and D. C. Sorensen (1983), `Computing a trust region step', SIAM Journal
of Scienti c and Statistical Computing 4, 553{572.
W. Murray (1994), Algorithms for large nonlinear problems, in Mathematical Pro-
gramming: State of the Art 1994 (J. R. Birge and K. G. Murty, eds), Univer-
sity of Michigan, Ann Arbor, MI, pp. 172{185.
W. Murray and F. J. Prieto ( to appear), `A sequential quadratic programming
algorithm using an incomplete solution of the subproblem', SIAM Journal on
Optimization.
S. G. Nash and A. Sofer (1995), Linear and Nonlinear Programming, McGraw-Hill,
New York.
Sequential Quadratic Programming 51
J. Nocedal (1992), `Theory of algorithms for unconstrained optimization', Acta Nu-
merica 1991, 199{242.
J. Nocedal and M. L. Overton (1985), `Projected Hessian updating algorithms for
nonlinearly constrained optimization problems', SIAM Journal on Numerical
Analysis 22, 821{850.
E. M. Omojokun (1989), Trust Region Algorithms for Optimization with Nonlinear
Equality and Inequality Constraints, PhD thesis, University of Colorado.
J. M. Ortega and W. C. Rheinboldt (1970), Iterative Solution of Nonlinear Equations
In Several Variables, Academic Press, New York.
E. R. Panier and A. L. Tits (1993), `On combining feasibility, descent and superlinear
convergence in inequality constrained optimization', Mathematical Program-
ming 59, 261{276.
E. Polak (1989), Basics of minimax algorithms, in Nonsmooth Optimization and
Related Topics (F. H. Clark, V. F. Dem'yanov and F. Giannessi, eds), Plenum,
New York, pp. 343{369.
M. J. D. Powell (1977), A fast algorithm for nonlinearly constrained optimization cal-
culations, in Numerical Analysis Dundee, 1977 (G. A. Watson, ed.), Springer-
Verlag, Berlin, pp. 144{157.
M. J. D. Powell (1978a), `Algorithms for nonlinear constraints that use Lagrangian
functions', Mathematical Programming 14, 224{248.
M. J. D. Powell (1978b), The convergence of variable metric methods for nonlinearly
constrained optimization calculations, in Nonlinear Programming 3 (O. Man-
gasarian, R. Meyer and S. Robinson, eds), Academic Press, New York, NY,
pp. 27{64.
M. J. D. Powell and Y. Yuan (1986), `A recursive quadratic programming algorithm
that uses di erentiable exact penalty functions', Mathematical Programming
35, 265{278.
S. M. Robinson (1974), `Perturbed Kuhn-Tucker points and rates of convergence
for a class of nonlinear-programming algorithms', Mathematical Programming
7, 1{16.
K. Schittkowski (1981), `The nonlinear programming method of Wilson, Han, and
Powell with an augemented Lagrangian type line search function, part 1:
Convergence analysis', Numerische Mathematik 38, 83{114.
K. Schittkowski (1983), `On the convergence of a sequential quadratic programming
method with an augmented Lagrangian line search function', Math. Opera-
tionsforsch. U. Statist., Ser. Optimization 14, 197{216.
J. Stoer (1984), `The convergence of matrices generated by rank-2 methods from the
restricted -class of Broyden', Numerische Mathematik 44, 37{52.
R. A. Tapia (1974), `A stable approach to Newton's method for optimization prob-
lems with equality constraints', Journal of Optimization Theory and Applica-
tions 14, 453{476.
R. A. Tapia (1977), `Diagonalized multiplier methods and quasi-Newton methods for
constrained optimization', Journal of Optimization Theory and Applications
22, 135{194.
R. A. Tapia (1978), Quasi-Newton methods for equality constrained optimization:
Equivalence of existing methods and a new implementation, in Nonlinear
52 Boggs and Tolle

Programming 3 (O. Mangasarian, R. Meyer and S. Robinson, eds), Academic


Press, New York, NY, pp. 125{164.
R. A. Tapia (1988), `On secant updates for use in general constrained optimization',
Mathematics of Computation 51, 181{202.
R. J. Vanderbei and T. J. Carpenter (1993), `Symmetric inde nite systems for inte-
rior point methods', Mathematical Programming 58, 1{32.
A. Vardi (1985), `A trust-region algorithm for equality constrained minimization and
implementation', SIAM Journal on Numerical Analysis 22, 575{591.
R. B. Wilson (1963), A Simplicial Algorithm for Concave Programming, PhD thesis,
Harvard University.
P. Wolfe (1975), A method of conjugate subgradients for minimizingnondi erentiable
functions, in Mathematical Programming Study 3 (M. Balinski and P. Wolfe,
eds), North Holland, pp. 145{173.
Y. Yuan (1985), `An only 2-step q-superlinear convergence example for some algo-
rithms that use reduced Hessian approximations', Mathematical Programming
32, 224{231.
Y. Yuan (1990), `On a subproblem of trust region algorithms for constrained opti-
mization', Mathematical Programming 47, 53{63.

View publication stats

You might also like