Sequential Quadratic Programming: Acta Numerica January 1995
Sequential Quadratic Programming: Acta Numerica January 1995
Sequential Quadratic Programming: Acta Numerica January 1995
net/publication/230872679
CITATIONS READS
1,017 1,980
2 authors, including:
Paul Boggs
71 PUBLICATIONS 3,189 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Paul Boggs on 23 January 2015.
1 Introduction 1
2 The Basic SQP Method 4
3 Local Convergence 12
4 Merit Functions and Global Convergence 27
5 Global to Local Behavior 37
6 SQP Trust Region Methods 40
7 Practical Considerations 43
References 48
1. Introduction
ear constraints makes this choice inappropriate. For example, consider the
problem
minimize
x ?x1 ? 12 (x2)2
subject to: (x1 )2 + (x2)2 ? 1 = 0:
The point (1; 0) is a solution satisfying A1{A4, but at the point (1 + ; 0)
the approximating quadratic program is (with d replacing dx )
minimize
d ?d1 ? 12 (d2)2
subject to: d1 = ? 21 (2 + )(1 + )?1
which is unbounded regardless of how small is. Thus the algorithm that
computes d breaks down for this problem.
To take nonlinearities in the constraints into account while maintaining
the linearity of the constraints in the subproblem, the SQP method uses a
quadratic model of the Lagrangian function as the objective. This can be
justied by noting that conditions A1{A4 imply that x is a local minimum
for the problem
minimize
x L(x; u; v)
subject to: h(x) = 0
g(x) 0:
Note that the constraint functions are included in the objective function for
this equivalent problem. Although the optimal multipliers are not known,
approximations uk and v k to the multipliers can be maintained as part of the
iterative process. Then given a current iterate, (xk ; uk ; vk ), the quadratic
Taylor series approximation in x for the Lagrangian is
L(xk ; uk ; vk ) + rL(xk ; uk ; vk )tdx + dxtH L(xk ; uk ; vk )dx:
1
2
A strong motivation for using this function as the objective function in the
quadratic subproblem is that it generates iterates that are identical to those
generated by Newton's method when applied to the system composed of the
rst order necessary condition (condition A1) and the constraint equations
(including the active inequality constraints). This means that the resulting
algorithm will have good local convergence properties. In spite of these local
convergence properties there are good reasons to consider choices other than
the actual Hessian of the Lagrangian, for example approximating matrices
that have properties that permit the quadratic subproblem to be solved at
any xk and the resulting algorithm to be amenable to a global convergence
analysis. Letting Bk be an approximation of H L(xk ; uk ; vk ), we can write
Sequential Quadratic Programming 9
the quadratic subproblem as:
minimize
dx rL(xk ; uk ; vk )tdx + 21 dxtBk dx
subject to: rh(xk )t dx + h(xk ) = 0 (2:4)
rg(x ) dx + g(x ) 0:
k t k
The form of the quadratic subproblem most often found in the literature,
and the one that will be employed here, is
minimize
dx rf (xk )tdx + 12 dxtBk dx
subject to: rh(xk )t dx + h(xk ) = 0 (QP)
rg(x ) dx + g(x ) 0:
k t k
These two forms are equivalent for problems with only equality constraints
since, by virtue of the linearized constraints, the term rh(xk )tdx is constant
and the objective function becomes rf (xk )t dx + 21 dx t Bk dx . The two sub-
problems are not quite equivalent in the inequality-constrained case unless
the multiplier estimate vk is zero for all inactive linear constraints. However
(QP) is equivalent to (2:4) for the slack-variable formulation of (NLP) given
by
minimize
x; z f (x)
subject to: h(x) = 0 (2:5)
g(x) + z = 0
z 0
where z 2 Rp is the vector of slack variables. Therefore (QP) can be con-
sidered an appropriate quadratic subproblem for (NLP).
The solution dx of (QP) can be used to generate a new iterate xk+1 , by
taking a step from xk in the direction of dx . But to continue to the next
iteration new estimates for the multipliers are needed. There are several
ways in which these can be chosen, but one obvious approach is to use the
optimal multipliers of the quadratic subproblem. (Other possibilities will be
discussed where appropriate.) Letting the optimal multipliers of (QP) be
denoted by uqp and vqp , and setting
du = uqp ? uk
dv = v qp ? v k
allow the updates of (x,u,v) to be written in the compact form
xk+1 = xk + dx
uk+1 = uk + du (2:6)
vk+1 = vk + dv
for some selection of the steplength parameter . Once the new iterates
10 Boggs and Tolle
are constructed, the problem functions and derivatives are evaluated and a
prescribed choice of Bk+1 calculated. It will be seen that the eectiveness
of the algorithm is determined in large part by this choice.
3. Local Convergence
In this section the theory of local convergence for the most common versions
of the SQP method is developed. The local convergence analysis establishes
conditions under which the iterates converge to a solution and at what rate,
given that that the starting data (e.g., x0 , u0 , v 0, B0 ) are suciently close to
the corresponding data at a solution, x . As will be seen, it is the method of
generating the matrices Bk that will be the determining factor in obtaining
the convergence results.
An SQP method can have excellent theoretical local convergence prop-
erties; quadratic, superlinear, or two-step superlinear convergence of the
x - iterates can be achieved by requiring the Bk to approximate H L in
an appropriate manner. Although each of these local convergence rates is
achievable in practice under certain conditions, the need for a globally con-
vergent algorithm further restricts the choice of the Bk . Certain choices
of the Bk have led to algorithms that extensive numerical experience has
shown to be quite acceptable. Therefore, while there is a gap between what
is theoretically possible and what has been achieved in terms of local con-
vergence, this discrepancy should not obscure the very real progress that
the SQP methods represent in solving nonlinearly constrained optimization
problems.
Two important assumptions simplify the presentation. First, it will be
assumed that the active inequality constraints for (NLP) at x are known. As
will be discussed in Section 5, this assumption can be justied for many of the
SQP algorithms because the problem (QP) at xk will have the same active
constraints as (NLP) at x when xk is near x. The fact that the active
constraints, and hence the inactive constraints, are correctly identied at xk
means that those inequality constraints that are inactive for (NLP) at x can
be ignored and those that are active can be changed to equality constraints
without changing the solution of (QP). Thus, under this assumption, only
equality-constrained problems need be considered for the local analysis. For
the remainder of this section (NLP) will be assumed to be an equality-
constrained problem and the reference to inequality multiplier vectors (the
vt g (x) term) will be eliminated from the Lagrangian. For reference, we
rewrite the quadratic subproblem with only equality constraints:
minimize
dx rf (xk )tdx + 12 dxtBk dx
(ECQP )
subject to: rh(xk )t dx + h(xk ) = 0:
The second assumption follows from the fact that the local convergence
for SQP methods, as for many optimization methods, is based on Newton's
method; that is, the convergence and rate of convergence results are ob-
tained by demonstrating an asymptotic relation between the iterates of the
method being analyzed and the classical Newton steps. Because Newton's
Sequential Quadratic Programming 13
method for a system with a nonsingular Jacobian at the solution requires
that unit steps be taken to obtain rapid convergence, we assume that the
merit function allows = 1 to be used in the update formulas (2.6). Simi-
lar results to those given in this section can be obtained if the steplengths
converge to one suciently quickly.
For the equality-constrained problem a good initial estimate of x can be
used to obtain a good estimate for the optimal multiplier vector. The rst
order necessary condition, A1, together with condition A2, leads to the
formula
u = ?[rh(x )t A rh(x )]?1 rh(x )t A rf (x ): (3:1)
for any nonsingular matrix A that is positive denite on the null space of
rh(x)t. In particular, if A is taken to be the identity matrix, then (3:1)
denes the least squares solution of the rst order necessary conditions. By
our smoothness assumptions, it follows that
u0 = ?[rh(x0 )t rh(x0)]?1 rh(x0 )trf (x0 ) (3:2)
can be made arbitrarily close to u by choosing x0 close to x . Consequently,
no additional assumption will be made about an initial optimal multiplier
estimate for the local convergence analysis.
Denoting the optimal multiplier for (ECQP) by uqp we see that the rst
order conditions for this problem are
Bk dx + rh(xk )uqp = ?rf (xk )
rh(xk )tdx = ?h(xk ):
If, as discussed in Section 2, we set
uk+1 = uqp = uk + du (3:3)
then the above equations become
Bk dx + rh(xk )du = ?rL(xk ; uk ) (3.4)
rh(x ) dx = ?h(x ):
k t k (3.5)
We are now in a position to begin the analysis of the SQP methods.
B3: The matrices Bk have uniformly bounded inverses, i.e., there exists a
16 Boggs and Tolle
and
P k (Bk ? H L)(xk ? x)
<
xk ? x
(3:13)
for all k, then the sequences fxk g and f(xk ; uk )g generated by the SQP algo-
rithm are well dened and converge linearly to x and (x ; u ) respectively.
The sequence fuk g converges R-linearly to u .
Condition (3.13) is, in conjuction with the other hypotheses of the theo-
rem, almost necessary for linear convergence. In fact, if the other hypotheses
of the theorem hold and the sequence fxk g converges linearly, then there
exists a such that
k k
P (Bk ? H L )(x ? x )
<
(x ? x )
k
for all k. These results and those that follow indicate the crucial role that
the projection of (Bk ? H L) plays in the theory. The inequality (3.13) is
18 Boggs and Tolle
or
k(Bk ? H L)k
(3:14)
which are easier to verify in practice.
In order to satisfy (3.14) it is not necessary that the approximations
converge to the true Hessian but only that the growth of the dierence
kBk ? H Lk be kept under some control. In the quasi-Newton theory for
unconstrained optimization this can be accomplished if the approximating
matrices have a property called bounded deterioration. This concept can be
generalized from unconstrained optimization to the constrained setting in a
straightforward way.
Denition 3.1 A sequence of matrix approximations, fBk g, for the SQP
method is said to have the bounded deterioration property if there exist
constants 1 and 2 independent of k such that
kBk+1 ? H Lk (1 + 1k ) kBk ? H Lk + 2k (3:15)
where
k = maxf
xk+1 ? x
;
xk ? x
;
uk+1 ? u
;
uk ? u
g:
It seems plausible that the bounded deterioration condition when applied
to the SQP process will lead to a sequence of matrices that satisfy B2,
B3, and (3.13) provided the initial matrix is close to the Hessian of the
Lagrangian and (xk ,uk ) is close to (x ,u ). Indeed this is the case, as can
be shown by induction to lead to the following result.
Theorem 3.3 Suppose that the sequence of iterates f(xk ; uk )g is gener-
ated by the SQP algorithm and the sequence fBk g
of symmetric matrix
approximations satises B1 and (3.15). If
x0 ? x
and kB0 ? H L k are
where s and y are as in (3.21) and (3.22). The important qualities of the
matrices generated by this formula are that they have the bounded deterio-
ration property and satisfy a hereditary positive deniteness condition. The
latter states that if Bk is positive denite and
yt s > 0 (3:24)
then Bk+1 is positive denite. If the matrix approximations are positive
denite then (ECQP) is easily solved and the SQP steps are well dened.
Unfortunately, because the Hessian of the Lagrangian at (x ; u ) need only
be positive denite on a subspace, it need not be the case that (3.24) is
satised and hence the algorithm may break down, even close to the solution.
However, if the Hessian of the Lagrangian is positive denite, (3.24) is valid
provided Bk is positive denite and (xk ; uk ) is close to (x; u ). In this case
this BFGS-SQP algorithm has the same local convergence properties as the
PSB-SQP method.
Theorem 3.7 Suppose that H L is positive denite and let B0 be an ini-
tial positive denite matrix. Suppose that the sequence of iterates f(xk ; uk )g
is generated by the SQP algorithm using the sequence of matrix
approxi-
mations generated by the BFGS update (3.23). Then, if
x0 ? x
and
kB0 ? H L(x; u)k are suciently small and u0 is given by (3:2), the se-
quence fBk g is of bounded deterioration and (3.16) is satised. Therefore,
the iterates (xk ; uk ) converge superlinearly to the pair (x; u ). In addi-
tion the x-iterates converge superlinearly and the multipliers converge R-
superlinearly.
The assumption that H L is positive denite allows B0 to be both positive
denite and close to H L . The requirement that H L be positive denite is
satised, for example, if (NLP) is convex, but cannot be assumed in general.
However, because the positive deniteness of Bk permits the solution of the
quadratic subproblems independent of the nearness of the iterate to the
solution, the BFGS method has been the focus of vigorous research eorts
to adapt it to the general case. Several of these eorts have resulted in
implementable algorithms.
One scheme, which we call the Powell-SQP method, is to maintain the
positive denite property of the Hessian approximations by modifying the
update formula (3.23) by replacing y in (3.22) by
y^ = y + (1 ? ) Bk s (3:25)
for some 2 (0; 1]. With this modication, the condition (3.24) can always
be satised although the updates no longer satisfy the secant condition. Nev-
ertheless, it has been shown that a specic choice of leads to a sequence,
fxk g, that converges R-superlinearly to x provided that the sequence con-
verges. Unfortunately, no proof of local covergence has been found although
Sequential Quadratic Programming 23
algorithms based on this procedure have proven to be quite successful in
practice.
A second approach is to transform the problem so that H L is positive
denite and then to apply the BFGS-SQP method on the transformed prob-
lem. It is a common trick in developing algorithms for solving equality con-
strained optimization problems to replace the objective function in (NLP)
by the function
f A (x) = f (x) + 2 kh(x)k2
for a positive value of the parameter . This change gives a new Lagrangian
function, the so-called augmented Lagrangian
LA(x; u) = L(x; u) + 2 kh(x)k2
for which (x,u ) is a critical point. The Hessian of the the augmented
Lagrangian at (x ,u ) has the form
H LA (x; u ) = H L + rh(x)rh(x )t:
It follows from A4 that there exists a positive value such that for
, H LA (x ; u ) is positive denite. If the value of is known then the
BFGS-SQP algorithm can be applied directly to the transformed problem
with and, by Theorem 3.7, a superlinearly convergent algorithm
results. This version of the SQP algorithm has been extensively studied
and implemented. Although adequate for local convergence theory, this
augmented-BFGS method has major drawbacks that are related to the fact
that it is dicult to choose an appropriate value of . To apply Theorem 3.7,
the value of must be large enough to ensure that H LA (x ; u ) is positive
denite, a condition that requires a priori knowledge of x . If unnecessarily
large values of are used without care numerical instabilities can result.
This problem is exacerbated by the fact that if the iterates are not close to
the solution appropriate values of may not exist.
Quite recently, an intriguing adaptation of the BFGS-SQP algorithm has
been suggested that shows promise of leading to a resolution of the above
diculties. This method, called the SALSA-SQP method is related to the
augmented Lagrangian SQP method but diers (locally) in the sense that a
precise estimate of can be chosen independent of x . If
yA = rLA (xk+1; uk+1 ) ? rLA (xk ; uk+1)
then
yA = y + rh(xk+1 )h(xk+1 )
where y is the vector generated by the ordinary Lagrangian in (3.22). If
ytAs > 0; (3:26)
24 Boggs and Tolle
then the BFGS updates for the augmented Lagrangian have the hereditary
positive deniteness property. A minimum value of k , not directly depen-
dent on x , can be given that guarantees that ytA s is \suciently" positive
for given values of xk and uk+1 in a neighborhood of (x ,u ). The (local)
version of this algorithm proceeds by using the augmented Lagrangian at
each iteration with the required value of k (which may be zero) to force
the satisfaction of this condition. The Bk generated by this procedure are
positive denite while H L may not be; hence, the standard bounded dete-
rioration property is not applicable. As a result a local convergence theorem
is not yet available. As in the Powell-SQP algorithm, an R-superlinear rate
of convergence of the xk to x has been shown, provided that the sequence
is known to converge.
points of the x-iterates are critical points of . These proofs rely heavily on
being able to guarantee that a \sucient" decrease in can be achieved at
each iteration..
In unconstrained minimization there is a natural merit function; namely,
the objective function itself. In the constrained setting, unless the iterates
are always feasible, a merit function has to balance the drive to decrease
f with the need to satisfy the constraints. This balance is often controlled
by a parameter in that weights a measure of the infeasibility against the
value of either the objective function or the Lagrangian function. In this
section, we illustrate these ideas by describing two of the most popular
merit functions: a dierentiable augmented Lagrangian function and the
simpler, but nondierentiable, `1 penalty function. We derive some of the
important properties of these functions and discuss some of the advantages
and disadvantages of each. In addition, we provide appropriate rules for
choosing the steplength parameter so as to obtain the basic convergence
results.
To simplify the presentation we restrict our attention at the beginning to
the problem with only equality constraints. Extensions that allow inequality
constraints are discussed at the ends of the sections.
As might be expected, some additional assumptions are needed if global
convergence is to be achieved. In fact, the need for these assumptions raises
the issue as to what exactly is meant by \global" convergence. It is impossi-
ble to conceive of an algorithm that would converge from any starting point
for every (NLP). The very nature of nonlinearity allows the possibility, for
example, of perfectly satisfactory iterations following a steadily decreasing
function towards achieving feasibility and a minimum at some innite point.
In order to focus attention on the methods and not the problem structure
we make the following assumptions.
C1: The starting point and all succeeding iterates lie in some compact set
C.
C2: The columns of rh(x) are linearly independent for all x 2 C .
The rst assumption is made in some guise in almost all global conver-
gence proofs, often by making specic assumptions about the functions.
The second assumption assures that the systems of linearized constraints
are consistent. An additional assumption will be made about the matrix
approximations Bk , depending on the particular merit function.
y t HF (x ; )y
min + ( ? max ? min ) :
2
kyk
2
This last expression will be positive for suciently large, which implies
that x is a strong local minimum of F . Conversely, if x is a strong local
minimum of F then (4:6) must be positive for all y . This implies that H L
must be positive denite on the null space of rh(x )t and hence that x is
a strong local minimum of (NLP). This establishes (i).
To show that the direction dx is always a descent direction for F , the
inner product of both sides of (4:4) is taken with dx to yield
dx t rF (xk ; ) = dxt rf (xk ) + dxt rh(xk )u(xk )
+dx t ru(xk )h(xk ) + dxt rh(xk )h(xk ): (4:7)
Using (4:2) it follows that
h i?1
dx t rh(xk )u(xk ) =?dxtrh(xk ) rh(x)trh(x) rh(x)trf (x)
= ?dx t Qk rf (xk ):
(4:8)
Writing dx = Qk dx + P k dx and noting that h(xk ) = ?rh(xk )t Qk dx results
in
dx t rF (xk ; ) = P k dx t rf (xk ) ? dx t ru(xk )rh(xk )tQk dx
h i
?(Qk dx)t rh(xk )rh(xk )t (Qk dx):
Sequential Quadratic Programming 31
Finally, using the rst order necessary conditions for (ECQP) and the fact
that P k rh(xk )t = 0, the following is obtained:
dx t rF (xk ; ) = ?(P k dx )t Bk (P k dx ) ? (Qk dx )tBk dx
?dxtru(xkh)rh(xk )tQk dx i (4:9)
k
?(Q dx) rh(x )rh(x ) (Q dx):
t k k t k
for constants
and
. The quantity on the left of (4:10) is then negative
2 3
and can be uniformly bounded away from zero provided is suciently
large, thus proving dx is a descent direction for . 2
It is interesting to observe that the value of necessary to obtain descent
of dx depends on the eigenvalue bounds on fBk g, whereas the value of
necessary to ensure that x is a strong local minimizer of F depends on the
eigenvalues of H L . Thus a strategy to adjust to achieve a good descent
direction may not be sucient to prove that HF (x) is positive denite.
We comment further on this below.
This line of reasoning can be continued to obtain a more useful form of
the descent result. From (4:4) and arguments similar to those above,
rF (x ; )
4 kdx k
k
for some constant
4. Thus from (4:10) it follows that there exists a
constant
5 such that
rF (xk ;
)tdx ?
< 0 (4:11)
rF (x ; )
kdx k
k
5
for all k. This inequality ensures that the cosine of the angle k between
the direction dx and the negative gradient of F is bounded away from zero,
i.e., for all k
cos( k ) = ?
rF (xk ;
) dx
5 > 0:
k t
(4:12)
rF (x ; )
kdx k
32 Boggs and Tolle
With this choice, I k will contain all unsatised constraints and no \safely
satised" ones. The merit function at xk is dened by
FI (x; uqp ; vqp; ) = f (x) + h(x)t uqp(x) + 12 kh(x)k2
X
+ (g i (x)(vqp (x))i + 21 gi (x)2 )
i2I k
+ 21
X
(vqp (x))2i :
i=2I k
FI is still dierentiable and, as a result of A3, the correct active set is
eventually identied in a neighborhood of the solution. In this formulation,
the multipliers are the multipliers from (QP), not the least squares mul-
tipliers. Therefore FI is a function of both x and these multipliers and,
consequently, the analysis for FI is somewhat more complicated.
A second approach uses the idea of squared slack variables. One can
consider the problem
minimize
x; t f (x)
subject to: h(x) = 0 (4:17)
g i (x) + (ti )2 = 0
i = 1; : : :; p
where t is the vector of slack variables. This problem is equivalent to (NLP)
in the sense that both have a strong local solution at x where, in (4:17),
(ti )2 = ?g i (x). By writing out the function F for the problem (4:17), it
is observed that the merit function only involves the squares of ti . Thus, by
letting z i = t2i and setting
h(x)
h(x; z ) = g (x) + z ;
34 Boggs and Tolle
Here u(x,z ) is the least squares estimate for all of the multipliers. Note
that (4:17) is not used to create the quadratic subproblem, but rather (QP)
is solved at each iteration to obtain the step dx . The slack variables can
then be updated at each iteration in a manner guaranteed to maintain the
nonnegativity of z . For example, a step
dz = ?(rg (xk )tdx + g(xk ) + z k )
can be calculated. Then the constraints of (QP) imply that z k+1 = z k +
dz 0 if z k 0 and 2 (0; 1].
4.2. The `1 Merit Function
One of the rst merit functions to be introduced was the `1 exact penalty
function that, in the equality constrained case, is given by
1(x; ) = f (x) + kh(x)k1 (4:18)
where is a positive constant to be chosen. The properties of this func-
tion vis-a-vis the equality-constrained optimization problem have been well-
documented in the literature. For our purposes it is sucient to note that
1, like the augmented Lagrangian of the previous section, is an \exact"
penalty function; that is, there exists a positive such that for all ,
an unconstrained minimum of 1 corresponds to a solution of (NLP). Note
that 1 is not dierentiable on the feasible set. If the penalty term were
squared to achieve dierentiability then the \exact" property would be lost;
minimizers of 1 would only converge to solutions of (NLP) as ! 1.
Although 1 is not dierentiable, it does have a directional derivative
along dx. It can be shown that this directional derivative, denoted by the
operator D, is given by
D(1(xk ; ); dx) = rf (xk )tdx ?
h(xk )
1 : (4:19)
Substituting the rst order necessary conditions for (ECQP) into (4:19)
yields
D(1(xk ; ); dx) = ?dx t Bk dx ? dxt rh(xk )uqp ?
h(xk )
1 :
It follows from the linearized constraints of (ECQP) that
dx t rh(xk )uqp = ?h(xk )t uqp
and, since
h(xk )t uqp kuqp k1
h(xk
1 ; (4:20)
Sequential Quadratic Programming 35
the inequality
D((xk ; ); dx) ?dx t Bk dx ? ( ? kuqp k1 )
h(xk )
1 (4:21)
is obtained. In order to have dx be a descent direction for 1 and to ob-
tain a convergence theorem it is sucient to assume the uniform positive
deniteness of fBk g:
B4: For all d 2 Rn there are positive constants 1 and 2 > 0 such that
1 kdk2 dt Bk d 2 kdk2
for all k.
The assumptions B1 and B2 that were made for the augmented Lagrangian
are sucient, but we make the stronger assumption to simplify the presen-
tation. Using B4, the rst term in (4:21) is always negative and thus dx is
a guaranteed descent direction for 1 if > kuqp k1 .
Global convergence of an algorithm that uses 1 as a merit function can
be demonstrated using arguments similar to those for Theorem 4.1. First,
at each iteration, the parameter is chosen by
= kuqpk1 + (4:22)
for some constant > 0: Next, the rst Wolfe line search condition, (4:13),
is replaced by
1 (xk + dx) 1(xk ) + D(1(xk ; ); dx) (4:23)
where 2 (0; 21 ). (The condition < 21 is for technical reasons; in practice
is usually chosen to be much smaller.) Recall that the second of the
Wolfe conditions, (4:14), is to prevent a steplength that is too small. The
assumptions that have been made guarantee that the use of a backtracking
line search procedure produces steplengths that are uniformly bounded away
from zero for all iterations. Denoting this lower bound by , it now follows
from (4:21) and (4:23) that the reduction in 1 at each step satises
h
i
1 (xk + dx) ? 1(xk ) ? 1 kdx k2 +
h(xk )
1 : (4:24)
Assumption C1 and (4:24) imply that
1
X
2
dx(xk )
+
h(xk )
1 1
k=1
2
and therefore
dx(xk )
+
h(xk )
1 ! 0 as k ! 1. Since dx = 0 if and
only if xk is a feasible point satisfying A1, the following theorem results.
Theorem 4.3 Assume that is chosen such that (4:22) holds. Then the
36 Boggs and Tolle
In the previous two sections we have examined conditions that imply that
the basic SQP algorithm will converge. Section 3 dealt with local conver-
gence where the emphasis was on the rates of convergence, while Section 4
was concerned with obtaining convergence from remote starting points; the
implicit hope being that the two theories would come together to produce a
unied theory that would be applicable to a given algorithm. For the global
convergence theories, where the process has to be controlled by a merit func-
tion, it was seen in Section 4 that convergence of fxk g to a critical point
of is all that can be demonstrated. Assuming that this critical point is a
solution of (NLP), the question that arises is whether or not the conditions
for the local convergence theories are eventually satised by this sequence.
If they are then the more rapid rates of local convergence can be achieved.
In this section we discuss three questions concerning this possibility: will
the correct active set be chosen by (QP) when xk is close to x ; will Bk
eventually approximate H L in one of the ways stipulated in Section 3 that
yields rapid local convergence; and will the merit function allow a steplength
of one if the underlying process can achieve rapid local convergence?
Recall that the local convergence theory was proven for equality-constrained
problems only. This was based on the assumption that the active inequality
constraints at the solution of (NLP) would be identied by (SQP) when xk
gets close to the solution. The question of when the active constraints for
(QP) will be the same as those for (NLP) is resolved by using a perturbation
theory result as follows. Consider (NLP) with only inequality constraints
minimize
x f (x)
(5:1)
subject to: g (x) 0
and the quadratic program with the same solution
minimize
x rf (x)t(x ? x) + 21 (x ? x)tB (x ? x)
(5:2)
subject to: rg(x )t(x ? x ) + g (x ) 0
where the only restriction on B is that
yt B y > 0 (5:3)
for all y such that rg(x )t y = 0. It is easily veried that (x; v ) is a
strong local minimizer of both (5:1) and (5:2). This implies that the active
38 Boggs and Tolle
sets are the same and that strict complementary slackness (assumption A3)
holds for both. The quadratic programming approximation to (5:1) is
minimize
dx rf (xk )tdx + 21 dxtBk dx
(5:4)
subject to: rg (xk )tdx + g(xk ) 0:
A standard perturbation argument now asserts that if xk is close enough to
x and Bk is close enough to B , then the active set for (5:4) is the same as
the active set for (5:2). It follows that the active sets for (5:1) and (5:4) are
the same. As a result, if xk is suciently close to x and Bk is suciently
close to any matrix B satisfying (5:3) then (5:4) will identify the correct
active set. The condition (5:3) will be satised if the Bk are always positive
denite or if they satisfy the condition
lim P k (Bk ? H L)P k = 0:
k!1
(5:5)
This latter condition implies two-step superlinear convergence by Theorem
(3:5).
From Section 3 linear, superlinear, or two-step superlinear convergence
is assured if Bk approaches H L in one of the ways hypothesized in The-
orems 3.2{3.5. Since the initial matrix is unlikely in practice to be a good
approximation to H L , it is reasonable to ask under what conditions the
subsequent Bk will be good enough approximations to H L for these results
of Section 3 to hold. If the Bk satisfy the secant equation (3:19) then it is
reasonable to expect that the Bk will converge to H L provided that the
directions, dx , span Rn repeatedly, for example, if each set of n directions is
(uniformly) linearly independent. Some recent research supports this expec-
tation. However, if positive deniteness is maintained, it is not possible for
the Bk to converge to H L unless the latter is at least positive semidenite.
So for the general case, further analysis is necessary. As seen in Section 3,
(3:16) is equivalent to
lim P (Bk ? H L )P sk + P (Bk ? H L )Q sk = 0;
k!1 ksk k ksk k (5:6)
where sk = (xk+1 ? xk ). For superlinear convergence both of these terms
must tend to zero if the directions sk = ksk k repeatedly span Rn . Thus for
positive denite Bk superlinear convergence is not likely unless the second
term goes to zero, which is the type of tangential convergence mentioned
in Section 3.2. However, two step superlinear convergence, in which the
rst term of (5:6) goes to zero, is possible for positive denite updates and
for reduced Hessian updates as well, provided that convergence occurs and
steplengths of one are acceptable. Thus, it is important from the point of
view of obtaining a rapid local rate of convergence that a steplength of one
be taken near x .
Sequential Quadratic Programming 39
For the general algorithm, however, the use of the merit function with
the line search continues throughout the course of the algorithm. Thus, the
line search for the merit function should allow a steplength of one if the
matrix approximations are such that rapid local convergence is possible. In
the case of the augmented Lagrangian merit function it can be shown, using
arguments similar to those in Section 4, that
F (xk+1 )?2 F (xk ) ? 1 1 +
2 ?
32
kdxk 2
k
+ P [Bk ?H Lkd(xx k;u(x ))]dx
k k (5:7)
+O(kdx k) :
Then, if the process is converging superlinearly, the penultimate term in (5:7)
tends to zero by Theorem 3.4 and the right hand side of (5:7) is ultimately
less than zero, thus showing that a steplength of one decreases F . A slight
extension to this argument shows that the Wolfe condition (4:13) also can
be satised for a steplength of one.
A signicant disadvantage of the merit function 1 is that it may not
allow a steplength of = 1 near the solution, no matter how good the
approximation of (QP) to (NLP). This phenomenon, which can prohibit su-
perlinear convergence, is called the Maratos eect. Several procedures have
been proposed to overcome this problem, including so-called nonmonotone
line searches, i.e., techniques that allow the merit function to increase over
one or more iterations (see Section 7).
5.1. Notes and References
The perturbation argument showing conditions under which (QP) has the
same active set as (NLP) is derived from the work of Robinson (1974) who
proves a general perturbation result. For a discussion of the convergence of
matrix updates see Boggs and Tolle (1994), Ge and Powell (1983) and Stoer
(1984).
The Maratos eect, i.e., the fact that the `1 merit function may not allow
a steplength of one even when the iterates are close to the solution and Bk is
a good approximation to H L , was discussed by Chamberlain, Lemarechal,
Pedersen and Powell (1982). These authors suggested the \watchdog" tech-
nique, which is essentially a nonmonotone line search method. (See also
(Bonnans et al. 1992).) The fact that the augmented Lagrangian merit
function does not suer from this problem has been shown by many au-
thors.
6. SQP Trust Region Methods
Trust region algorithms have become a part of the arsenal for solving un-
constrained optimization problems, so it is natural to attempt to extend the
40 Boggs and Tolle
for 2 (0; 1). Using the decomposition (3:27) from Section 3.4
dx = Zk pZ + Yk pY
where the columns of Z (xk ) are a basis for the null space of rh(xk )t and
the columns of Yk are a basis for the range space of rh(xk ), it follows from
(6:5) that
Yk pY = ?sk :
As in Section 3.4, the null space component, pZ , can now be seen to be the
solution of the quadratic problem
h it
minimize f (xk ) + Bk sk pZ + 12 pZ t Zk t Bk Zk pZ
pZ
subject to: kpZ k2 (4k )2 ? ksk k2 :
A great deal of research has been done on the subject of minimizing a
quadratic function subject to a trust region constraint so that quick and
accurate methods for approximating the solutions to these two problems are
available.
Much of the work to be done in transforming these approaches into al-
gorithms for which local and global convergence theorems can be proven is
similar in nature to that which must be done for standard SQP methods.
In particular, a method for updating the matrices Bk needs to be specied
(a wider class of updating schemes is now available, at least in theory, be-
cause there is no necessity of requiring them to have the positive deniteness
properties) and a merit function has to be specied. As was shown in Sec-
tion 3, local superlinear convergence depends on the steps approaching the
Newton-SQP steps as the solution is neared. In particular, this requires that
the modied constraints reduce to the linearized constraint of (ECQP) and
that the trust region constraint not be active so that steplengths of one can
be accepted. Conditions under which these events will occur have not been
established. Global convergence theorems have been proven for most of the
above variations of the trust region method by utilizing either the `1 or the
augmented Lagrangian merit function under the assumptions C1 and C2.
6.1. Notes and References
An introductory exposition of trust region methods for unconstrained opti-
mization can be found in the book by Dennis and Schnabel (1983). Methods
for minimizing a quadratic function subject to a trust region constraint can
be found in (Gay 1981) and (More and Sorensen 1983).
The rst application of trust region methods to the constrained problem
appears to be that of Vardi (1985), who used the constraint (6:3) in place
of the linearized equality constraints. This approach was also used by Byrd,
Sequential Quadratic Programming 43
Schnabel and Schultz (1985). The introduction of the quadratic constraint
(6:4) as a substitute for the linear constraint is due to Celis, Dennis and
Tapia (1985), who used the error at the \Cauchy point" as the k . A global
convergence theory for this strategy is given in (El-Alem 1991). Powell and
Yuan (1986) suggested the second version of this approach mentioned in
the text. Yuan (1990) proposed solution techniques for the resulting sub-
problem. The reduced Hessian approach has been introduced by Omojokun
(1989) who also considers the case when inequality constraints are present.
This approach has also been used by (Lalee, Nocedal and Plantega 1993)
for large-scale problems.
7. Practical Considerations
Our goal throughout this survey has been to concentrate on those theoretical
properties that bear on actual implementations, but we have mentioned
some of the diculties that arise in the implementation of an SQP algorithm
when the assumptions that we have made are not satised. In this section
we elaborate on a few of the more important of these diculties and suggest
some common computational techniques that can be used to work around
them. No attempt is made to be complete; the object is to give an indication
of the issues involved. We also discuss brie
y the extension of the SQP
algorithm to the large scale case, where special considerations are necessary
to create an ecient algorithm.
REFERENCES