Equality Constrained Optimization: Daniel P. Robinson

Notes
Equality Constrained Optimization
Daniel P. Robinson
Department of Applied Mathematics and Statistics
Johns Hopkins University
April 23, 2014
Outline
Notes
1 Introduction
2 Quadratic penalty method
3 Augmented Lagrangian method
4 Sequential quadratic programming

Linesearch SQP methods
Trust-region SQP methods
S`1 QP method by Fletcher
Composite step methods
Filter method by Fletcher and Leyffer
Notes
The target problem
minimize
n
f (x) subject to c(x) = 0 (1)
x∈R
the objective function f : Rn −→ R

the constraints c : Rn −→ Rm (m ≤ n)
assume that f and c are both in C1 (sometimes C2 ) and Lipschitz continuous
often in practice this assumption violated, but not necessary
Comments
we have conflicting goals:
I minimize the objective function f (x)
I satisfy the constraints
the methods we discuss may be extended to handle inequality constraints
interior-point methods may be better for inequality constrained problems when the
problem is large-scale and you only need to solve a single “one-off” problem
Notes
minimize
n
x∈R
Penalty methods handle the conflict of reducing f and obtaining feasibility by

minimizing a single composite penalty function Φ(x; p)
p is a vector of parameters
(some) unconstrained minimizers of Φ(x; p) with respect to x approach solutions
of (2) as p approaches some set P
only uses unconstrained minimization methods to minimize Φ(x; p) (EN.550.661)
sometimes these penalty functions are called merit functions
some examples include
1
I Φ(x; µ) = f (x) +2µ
kc(x)k22 (quadratic penalty function)
1
I Φ(x; µ) = f (x) +
µ
kc(x)k2
I Φ(x; µ) = f (x) + 1 kc(x)k (`1 -penalty function)
µ 1
I Φ(x; µ, y) = f (x) − c(x)T y + 1 kc(x)k2 (augmented Lagrangian function)
2µ 2
this section focuses on the quadratic penalty function
Notes
The quadratic penalty function

One example of a penalty function associated with the problem
minimize
n
x∈R
is the quadratic penalty function, which is defined as

1
Φ(x; µ) = f (x) + kc(x)k22
2µ
for some penalty parameter µ > 0.
the idea is to use unconstrained minimizers of Φ(x; µ) (with respect to x) as

candidate approximate minimizers to (3).
these approximations, generally, become more accurate as µ → 0.
Convergence (hopefully) of minimizers of the quadratic penalty function

Notes
x(µ) := argminx Φ(x; µ)

limµ→0 x(µ) = x∗
x(10−1 ) x(1)
x(10−2 )
x(10−4 )
x(10−3 )
x∗
c(x) = 0
Algorithm 1 Basic quadratic penalty algorithm Notes
1: Input µ0 > 0 and x0
2: Set k ← 0 and xS0 ← x0 .
3: loop
4: With initial guess xSk , compute the “approximate” minimizer xk as
xk = argmin Φ(x; µk )
x∈Rn
5: Compute µk+1 > 0 smaller than µk such that limk→∞ µk+1 = 0.

6: Set xSk+1 ← xk .
7: k←k+1
8: end loop
9: return approximate solution xSk
Often choose µk+1 = 0.1µk or even µk+1 = µ2k .

Line 4 requires solving an unconstrained optimization problem whose solution
requires an iterative method (EN.550.661).
The optimization problem in line 4 may have multiple minimizers.
Question: how do we decide when to terminate the loop?
Answer: we need multiplier estimates!!!
Derivatives of the quadratic penalty function

Notes
We may differentiate the quadratic penalty function
1
Φ(x; µ) = f (x) + kc(x)k22
2µ
to obtain
∇x Φ(x; µ) = g(x, π(x))
1
∇2xx Φ(x; µ) = H(x, π(x)) + J(x)T J(x)
µ
where we have defined
Lagrange multiplier estimates:
c(x)
π(x) := −
µ
gradient of the Lagrangian:
g(x, π(x)) = g(x) − J(x)T π(x)
Lagrangian Hessian:
m
X
H(x, π(x)) = H(x) − πi (x)∇2xx ci (x)
i=1
Theorem 2.1 (Convergence of the quadratic penalty method Algorithm 1) Notes
Suppose that f and c are both C 2 , and define

c(xk )
πk := − .
µk
It follows that if
k∇x Φ(xk ; µk )k2 ≤ εk , lim εk = 0, lim xk = x∗ ,

k→∞ k→∞
and J(x∗ ) has full rank, then x∗ satisfies the first-order necessary optimality conditions
for the problem
minimize
n
f (x) subject to c(x) = 0
x∈R
and {πk } converges to the associated Lagrange multipliers y∗ .
Comment
the “approximate” in Algorithm 1 may now be replaced by finding an xk such that
k∇x Φ(xk ; µk )k2 ≤ εk and lim εk = 0.

k→∞
Proof: First, we observe that the generalized inverse

−1
Notes
J+ (x) := J(x)JT (x) J(x)
is bounded near x∗ since J(x∗ ) is full rank by assumption. Second, we define

c(xk )
πk := − and y∗ := J+ (x∗ )g(x∗ ). (4)
µk
Then, the inner-iteration termination rule implies that
kg(xk ) − JT (xk )πk k2 ≤ εk (5)
which implies that

kJ+ (xk )g(xk ) − πk k2 = J+ (xk ) g(xk ) − JT (xk )πk ≤ 2kJ+ (x∗ )k2 εk

2
for all k sufficiently large. We may then conclude that

kπk − y∗ k2 ≤ kJ+ (x∗ )g(x∗ ) − J+ (xk )g(xk )k2 + kJ+ (xk )g(xk ) − πk k2
from which we may conclude that {πk } −→ y∗ . The continuity of gradients combined
with (5) imply that
g(x∗ ) − J(x∗ )T y∗ = 0.
Then, we note that (4) implies that c(xk ) = −µk πk , from which we deduce (using
continuity of c) that c(x∗ ) = 0 since µk → 0. We have shown that (x∗ , y∗ ) satisfies the
first-order optimality conditions, and that {πk } → y∗ .
Notes
Algorithms to minimize Φ(x; µ) include

linesearch methods
I might use specialized linesearch to cope with large quadratic term
1
kc(x)k22
2µ
trust-region methods
I (ideally) need to “shape” trust region to cope with contours associated with
1
kc(x)k22
2µ
we will not consider these unconstrained methods in this class (see EN.550.661).
we will assume that there is a magical “black box” (solver) that may provide us with
an unconstrained minimizer of Φ(x; µ) when requested.
There are three practical issues that we now consider.
Practical issue 1: Φ(x; µ) may have additional useless stationary points

Notes
Example 2.2 (Useless minimizer of the quadratic penalty function)
120
100
80
Φ(x,µ)
60
40
20
0
−0.5 0 0.5 1 1.5 2 2.5
Figure : Plot of the quadratic penalty function Φ(x;x µ) associated with f (x) = x + 1 and
c(x) = 31 x3 − 32 x2 + 2x for the penalty value of µ = 10−2 . The minimizer on the “right” does not
approximate any minimizer of the associated equality constrained problem.
can not avoid this (unfortunate fact about all known penalty functions).
can help mitigate (not prevent) this issue by using approximate minimizer of
Φ(x; µ) as an initial guess at the solution of Φ(x; µ̄) for 0 < µ̄ < µ.
Practical issue 2: ill-conditioned Newton systems for small µ
Notes
The Newton correction s from x for finding a zero of ∇Φ(x; µ) is given by
∇2xx Φ(x; µ)∆x = −∇x Φ(x; µ) (6)
Limiting derivatives of Φ for small µ (roughly)
∇x Φ(x; µ) = g(x) − J(x)T π(x)
| {z }
moderate
1 1
∇2xx Φ(x; µ) = H(x, π(x)) + J(x)T J(x) ≈ J(x)T J(x)
| {z } µ µ
moderate | {z }
large
Roughly speaking—and assuming that J(x∗ ) has full row rank—we have
m eigenvalues ≈ λi J(xk )T J(xk ) /µk

T
n − m eigenvalues ≈ λi Z (xk )H(x∗ , y∗ )Z(xk )
where Z(x) is an orthogonal basis for null(J(xk )), which implies that

cond ∇2xx Φ(xk ; µk ) = O(1/µk )
Concerns:
may not be able to solve (6) accurately.
may be difficult to minimize Φ(x; µ).
this was blamed for the failure of early quadratic penalty methods.
Practical issue 2: ill-conditioned Newton systems for small µ

Notes
It turns out that the ill-conditioning is benign!
Recall the Newton system from the previous slide given by

1 1
H(x, π(x)) + JT (x)J(x) ∆x = − g(x) + JT (x)c(x) (7)
µ µ
If we define the auxiliary variables

1
w= (J(x)∆x + c(x))
µ
then it can be verified that
JT (x)

H(x, π(x)) ∆x g(x)
=− (8)
J(x) −µI w c(x)
essentially independent of µ for small µ implies no inherent ill-conditioning.

can recover accurate Newton step ∆x of (7) by solving (8).
more sophisticated analysis implies original system (7) is ok.
even when this was discovered, the methods still did not work well! Why?
Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0
Notes
We have shown that the Newton system can be solved accurately, but minimizing
Φ(x; µ) is still difficult for small µ from an arbitrary starting point!
Contours of the quadratic penalty function for
min x21 + x22 subject to x1 + x22 = 1

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ = 100 µ=1

Notes
Contours of the quadratic penalty function for
min x21 + x22 subject to x1 + x22 = 1
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ = 0.1 µ = 0.01
the unconstrained minimizer of the quadratic penalty function seems to approach
the solution to the equality constrain problem (3) as µ → 0.
contours become “closer” as µ → 0
from an arbitrary initial guess, Φ(x; µ) becomes “difficult” to minimize as µ → 0
have to follow the “valley” corresponding to the constraints
Notes
Under suitable assumptions on a constrained minimizer x∗ , there exists a trajectory
parameterized by µ of unconstrained minimizers of Φ(x; µ) that converge to x∗ .
Theorem 2.3 (trajectory of minimizers of Φ(x; µ))

Assume that J(x∗ ) has full rank and that the 2nd-order sufficient conditions hold at
(x∗ , y∗ ) for the equality constrained problem NEP (1). Then, for µ > 0 sufficiently
small, there exist continuously differentiable functions x(µ) and y(µ) such that
lim x(µ) = x∗ and lim y(µ) = y∗ .

µ→0 µ→0
Moreover, x(µ) is an unconstrained minimizer of Φ(x; µ).
Proof: It follows from the first-order optimality conditions for problem NEP that
g(x) − J(x)T y

F(x∗ , y∗ , 0) = 0, where F(x, y, µ) := .
c(x) + µy
The derivative matrix of F with respect to (x, y) is given by
H(x, y) −J(x)T

F0 (x, y, µ) = .
J(x) µI
Scaling the last m-columns by −1, using the assumptions of this theorem, and
Lemmas 5.4 and 5.5 from Lecture 01, we may conclude that F0 (x∗ , y∗ , 0) is
nonsingular and, therefore, we may apply the Implicit Function Theorem.

Notes
It follows from the Implicit Function Theorem that there exists a unique differentiable
vector function (x(µ), y(µ)) that satisfies
F x(µ), y(µ), µ = 0 and lim x(µ), y(µ) = (x∗ , y∗ ).

(9)
µ→0
Using the definition of F, it follows that

1
0 = g(x(µ)) + J(x(µ))T c(x(µ)) = g(x(µ)) − J(x(µ))T π(x(µ)))
µ
so that x(µ) is a stationary point of Φ(x; µ). It remains to prove that x(µ) is, in fact, a
minimizer of Φ(x; µ).
To prove that x(µ) is a minimizer of Φ(x; µ) for µ > 0 sufficiently small, we must prove
that
1 T
∇2xx Φ(x; µ) ≡ H x(µ), y(µ) + J x(µ) J x(µ) 0.

µ
However, this result immediately follows from continuity assumptions, (9), and Debreu’s
result given as Lemma 5.5 of Lecture01.
Notes
Recall: we do not want to follow the constraint “valley” to reach the solution!
Question: is the trajectory of minimizers tangent to the constraints as it approaches a

solution x∗ , i.e., is it tangent to the constraint surface?
Answer: No!!!. . . if y∗ 6= 0 and J(x∗ ) has full row rank.
This can be seen as follows. We know the existence of

x(µ) = x∗ + µp + O(µ2 ), where p := x0 (0)
is the vector tangent to the trajectory of minimizers emanating from the given minimizer
x∗ . Moreover, since xµ := x(µ) is a minimizer of Φ(x; µ), we know that
0 = µ∇Φ(x; µ) = µg(xµ ) + J(xµ )T c(xµ ).
Differentiating with respect to µ and using the chain rule leads to
d
0= (µg(xµ ) + J(xµ )T c(xµ ))
dµ
Xm
= g(xµ ) + µH(xµ )x0 (µ) + J(xµ )T J(xµ )x0 (µ) + ci (xµ )∇2xx ci (xµ )x0 (µ)
i=1
Taking limits as µ → 0 and using limµ→0 c(xµ ) = 0 and g(x∗ ) = J(x∗ )T y∗ , gives
0 = g(x∗ ) + J(x∗ )T J(x∗ )p = J(x∗ )T y∗ + J(x∗ )T J(x∗ )p = J(x∗ )T (y∗ + J(x∗ )p).

Notes
We have from the previous slide that
0 = J(x∗ )T ( y∗ + J(x∗ )p).
Since J(x∗ ) has full row rank by assumption, we know that
J(x∗ )p = −y∗ .
Therefore, if y∗ 6= 0, we may conclude that J(x∗ )p 6= 0, i.e., that the tangent vector p to
the trajectory of minimizers is not tangent to the constraint manifold.
Main point: the trajectory of minimizers does not approach x∗ along the “constraint
valley” provided J(x∗ ) has full row rank and y∗ 6= 0.
An idea: try to follow this trajectory!

Notes
Algorithm 1 tries to track this trajectory by computing x(µ) for a sequence of
decreasing values of the penalty parameter µ.
Question: why does Algorithm 1 not work so well as stated?
Answer: the point x(µ̄) is not computed efficiently starting from x(µ) for 0 < µ̄ < µ.
The Newton step at x(µ) for minimizing Φ(x; µ̄) is not tangent to the trajectory of
minimizers. Will not show this, but it is bad!
To compute x0 (µ), we can differentiate with respect to µ the equations
g(x(µ)) − J(x(µ))T y(µ) = 0
c(x(µ)) + µy(µ) = 0
which define the trajectory of minimizers, to obtain
H x(µ), y(µ) x0 (µ) − J(x(µ))T y0 (µ) = 0

J(x(µ))x0 (µ) + µy0 (µ) = −y(µ)

which may be written equivalently as
0
J(x(µ))T

H x(µ), y(µ) x (µ) 0
= .
J(x(µ))) −µI −y0 (µ) −y(µ)
Since the Newton step for minimizing Φ(x; µ̄) does not approximate x0 (µ), let’s find
one that does!

Notes
For a given penalty parameter µ, the point (x(µ), y(µ)) on the trajectory of minimizers
satisfies the “perturbed” conditions given by
g(x) − J(x)T y = 0 dual feasibility

(10)
c(x) + µy = 0 perturbed primal feasibility
where µ > 0, which is a perturbation of the optimality conditions for problem NEP (1),
which are given by
g(x) − J(x)T y = 0 dual feasibility

c(x) = 0 primal feasibility
A primal-dual path-following based method tracks the roots by using Newton’s Method
on (10) in the primal-dual variables (x, y). This leads to the primal-dual Newton system
H(x, y) JT (x) g(x) − JT (x)y

∆x
=− .
J(x) −µI −∆y c(x) + µy
After eliminating ∆y, we have

1 1
H(x, y) + JT (x)J(x) ∆x = − g(x) + JT (x)c(x)
µ µ
c.f. Newton method for quadratic penalty function minimization!

Notes
Primal:
1 T
H(x, π(x)) + J (x)J(x) ∆xP = − g(x) − J(x)T π(x)
µ
Primal-dual:

1
H(x, y) + JT (x)J(x) ∆xPD = − g(x) − J(x)T π(x)
µ
where
c(x)
π(x) = −
µ
What is the difference?

we have the freedom to choose y in H(x, y) for primal-dual case
we will see that this is vital!

Notes
Let us now see how the primal-dual direction is related to the trajectory.
Suppose that we have computed x(µ) and now wish to compute x(µ̄) for some
0 < µ̄ < µ.
We know from two slides ago that with (x, y) = (x(µ), y(µ)), the primal-dual search
direction satisfies
H(x(µ), y(µ)) JT (x(µ)) g(x(µ)) − JT (x(µ))y(µ)

∆x
=−
J(x(µ)) −µ̄I −∆y c(x(µ)) + µ̄y(µ)

0
=−
c(x(µ)) + µy(µ) − µy(µ) + µ̄y(µ)

0
= −(µ̄ − µ)
y(µ)
and from earlier that the tangent to the trajectory satisfies
0
J(x(µ))T

H x(µ), y(µ) x (µ) 0
0 =− .
J(x(µ)) −µI −y (µ) y(µ)
We can then observe that if we replace µ̄ on the left side above with µ, then
0
∆x x (µ)
= (µ̄ − µ) 0 .
∆y y (µ)
Notes
From the previous slide, we have that

0
∆x x (µ)
= (µ̄ − µ) 0 .
∆y y (µ)
Using a Taylor expansion, it follows that
x(µ̄) = x(µ) + (µ̄ − µ)x0 (µ) + O (µ̄ − µ)2

= x(µ) + ∆x +O (µ̄ − µ)2

| {z }
trial step
which means that the primal-dual trial step
x(µ) + ∆x
is a first-order accurate approximation to x(µ̄).
Algorithm 2 An improved quadratic penalty algorithm

Notes
1: Input µ0 > 0 and x0 .
2: Set k ← 0 and xS0 ← x0 .
3: loop
4: With initial guess xSk , compute xk such that k∇Φ(xk ; µk )k ≤ min(10−1 , µk ).
5: Set yk ← −c(xk )/µk .
6: Compute µk+1 > 0 smaller than µk such that limk→∞ µk+1 = 0.
7: Compute the primal-dual Newton step (∆x, ∆y) as
H(xk , yk ) JT (xk ) g(xk ) − JT (xk )yk

∆x
=− .
J(xk ) −µk I −∆y c(xk ) + µk+1 yk
8: Set xSk+1 ← xk + ∆x and ySk+1 ← yk + ∆y.

9: Set k ← k + 1.
10: end loop
Replaced µk+1 by µk in the primal-dual system as motivated previously.

Often choose µk+1 = 0.1µk or even µk+1 = µ2k .
Terminate when

c(xSk )

−6

g(xSk ) − J(xSk )T ySk ≤ 10 (should scale this)

Observation: an annoying weakness of the quadratic penalty method was that µ → 0
and for small µ > 0 it was difficult to minimize the quadratic penalty function. Notes
Idea: shift the constraints so that the quadratic penalty function passes through the
minimizer x∗ for some positive value of µ (perhaps all µ sufficiently small).
minimize f (x) subject to c(x) = 0
x(µ)
x(µ)
x∗
minimize f (x) subject to c(x) − s = 0
c(x) − s = 0 c(x) = 0
For some shift s ∈ Rm , consider the shifted problem

minimize
n
f (x) subject to c(x) − s = 0. Notes
x∈R
Applying the quadratic penalty function to this shifted problem produces

1
P(x; µ, s) = f (x) + kc(x) − sk22
2µ
1
∇P(x; µ, s) = g(x) + J(x)T (c(x) − s)
µ
The ideal shift s∗ makes x∗ a minimizer of P(x; µ, s∗ ), i.e.,
1 1
0 = ∇P(x∗ ; µ, s∗ ) = g(x∗ ) + J(x∗ )T (c(x∗ ) − s∗ ) = g(x∗ ) − J(x∗ )T s∗ .
µ µ
However, if J(x∗ ) has full row rank, then since g(x∗ ) − J(x∗ )T y∗ = 0 we conclude that
1 ∗
s = y∗ ⇐⇒ s∗ = µy∗
µ
Thus, we have the function
1
P(x; µ, y∗ ) = f (x) + kc(x) − µy∗ k22
2µ
1 µ
= f (x) − c(x)T y∗ + kc(x)k22 + ky∗ k22 .
2µ 2
Since
we are only interested in minimizers of P, we may drop the constant term
we do not know y∗ in advance so we must approximate it with some y
This leads to the definition of the Augmented Lagrangian function.
Notes
Definition 3.1 (augmented Lagrangian)

The augmented Lagrangian function is defined by
1
Φ(x; µ, y) = f (x) − c(x)Ty + kc(x)k22
2µ
where µ > 0 and y ∈ Rm are auxiliary parameters.
Two interpretations of the Augmented Lagrangian function:

quadratic penalty function applied to the shifted problem
convexification of the Lagrangian function
Goal: adjust µ and y to encourage convergence.
Notes
Theorem 3.2 (key result for the augmented Lagrangian)

If (x∗ , y∗ ) satisfies the second-order sufficient optimality conditions for problem
NEP (1), then there exists a µ̄ such that for all 0 < µ < µ̄ the point x∗ satisfies the
second-order sufficient optimality conditions for the problem
1
minimize Φ(x; µ, y∗ ) = f (x) − c(x)T y∗ + kc(x)k22
n
x∈R 2µ
Proof: Homework!
this theorem suggests that minimizers of Φ(x; µ, y) should approximate x∗

provided µ is sufficiently small and y approximates y∗ .
we now need to also figure out how to approximate y∗ .
also need to figure out how to update µ to make sure that it is sufficiently small.
Derivatives of the augmented Lagrangian Φ(x; µ, y)
Notes
∇x Φ(x; µ, y) = g(x, yF (x))

1
∇xx Φ(x; µ, y) = H(x, yF (x)) + J(x)T J(x)
µ
where
first-order Lagrange multiplier estimates
c(x)
yF (x) = y −
µ
gradient of the Lagrangian
g(x, yF (x)) = g(x) − J(x)T yF (x)
Hessian of the Lagrangian

m
X
H(x, yF (x)) = H(x) − yFi (x)∇2xx ci (x)
i=1
Theorem 3.3 (Convergence result of the augmented Lagrangian)

Notes
2
Suppose that f and c are both C , that
yFk := yk − c(xk )/µk for some given sequence {yk },
that
k∇x Φ(xk ; yk , µk )k2 ≤ εk for some {εk } satisfying lim εk = 0,
k→∞
and that xk converges to x∗ for which J(x∗ ) is full rank. It then follows that there exists
y∗ such that
lim yFk = y∗ and g(x∗ ) = J(x∗ )T y∗ .
k→∞
In addition, if either µk → 0 with {yk } bounded, or yk → y∗ for µk bounded away from

zero, then (x∗ , y∗ ) satisfies the first-order necessary optimality conditions for
minimize
n
x∈R
Proof: The convergence of {yFk } → y∗ := J+ (x∗ )g(x∗ ) for which g(x∗ ) = JT (x∗ )y∗ is
exactly as for Theorem 2.1. Moreover, the definition of yFk implies that
kc(xk )k = µk kyFk − yk k ≤ µk kyFk − y∗ k + µk ky∗ − yk k.
It follows from this inequality and the assumptions of this theorem that c(x∗ ) = 0.
Therefore, (x∗ , y∗ ) satisfies the first-order optimality conditions for problem (11).
Contours of the augmented Lagrangian function
Notes
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y = 0.5 y = 0.9
Augmented Lagrangian function for min x21 + x22 subject to x1 + x22 = 1 with fixed µ = 1
Contours of the augmented Lagrangian function (continued)

Notes
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y = 0.99 y = 1 = y∗
Augmented Lagrangian function for min x21 + x22 subject to x1 + x22 = 1 with fixed µ = 1
Algorithm 3 A basic augmented Lagrangian algorithm Notes
1: Input µ0 > 0, x0 , and y0 .
2: Set k ← 0 and xS0 ← x0 .
3: loop
4: Choose suitable εk > 0 and ηk > 0 such that {εk } → 0 and {ηk } → 0.
5: With initial guess xSk , compute xk such that k∇Φ(xk ; µk , yk )k ≤ min(10−1 , εk ).
6: Set yFk ← yk − c(xk )/µk .
7: if kc(xk )k ≤ ηk then
8: Set yk+1 ← yFk and µk+1 ← µk .
9: else
3/2
10: Set yk+1 ← yk and µk+1 ← min 0.1µk , µk .
11: end if
12: Set xSk+1 ← xk .
13: Set k ← k + 1.
14: end loop
reasonable to choose ηk = µ0.1+0.9j

k and εk = µj+1
k where j is the number of
iterations since µk was last changed.
can terminate when (xk , yFk ) is an approximate KKT point.
should also terminate if a maximum number of iterations are performed.
Summary of a basic augmented Lagrangian strategy

Notes
convergence guaranteed if yk fixed and µk −→ 0. This follows from Theorem 3.3

since it ensures that
lim yFk = y∗ and lim c(xk ) = 0

k→∞ k→∞
can check if kc(xk )k ≤ ηk for some sequence {ηk } satisfying limk→∞ ηk = 0

I if satisfied, set
yk+1 ← yFk and µk+1 ← µk
I if not satisfied, set
yk+1 ← yk and µk+1 ≤ τ µk for some τ ∈ (0, 1)
reasonable choice is
ηk = µ0.1+0.9j
k
where j is the number of iterations since µk was last changed.

under such rules, can ensure µk eventually unchanged under modest
assumptions and achieves (fast) linear convergence.
2
also must ensure that µk is sufficiently small so that ∇xx Φ(xk ; µk , yk ) is positive
(semi-)definite
Notes
The target problem
minimize
n
x∈R
the objective function f : Rn −→ R

the constraints c : Rn −→ Rm (m ≤ n)
assume that f and c are both in C1 (sometimes C2 ) and Lipschitz continuous
often in practice this assumption is violated, but may still be ok
Comments
Penalty methods did not aim directly for the solution.
The trial steps are Newton-related steps based on models of the penalty function,
and only converge to a solution of problem (12) as certain parameter(s) that
defined the penalty function converge to some desirable value(s).
Sequential quadratic programming (SQP) methods aim directly for the solution
using Newton’s Method!
Newton’s Method based on what function?
Optimality and Newton’s Method

Notes
1st order necessary optimality conditions are characterized as zeros of the function
g(x) − J(x)T y

F(x, y) = .
c(x)
The Newton correction (x+ , y+ ) = (x, y) + (s, w) for this nonlinear system (linear in y)
may be written in several equivalent forms:
asymmetric:
−J(x)T g(x) − J(x)T y

H(x, y) s
=−
J(x) 0 w c(x)
symmetric:
J(x)T g(x) − J(x)T y

H(x, y) s
=−
J(x) 0 −w c(x)
asymmetric (with y+ = y + w):
−J(x)T

H(x, y) s g(x)
=−
J(x) 0 y+ c(x)
symmetric (with y+ = y + w):

J(x)T

H(x, y) s g(x)
=−
J(x) 0 −y+ c(x)
Some details
Notes
Often use a symmetric approximation B ≈ H(x, y), which leads to the

approximate Newton system
K
z }| {
J(x)T

B s g(x)
= −
J(x) 0 −y+ c(x)
Important: this is the Newton system if B = H(x, y).
Solve the above linear KKT system using previously discussed methods
I full system: symmetric (indefinite) LBLT factorization of K
I Schur-complement: symmetric factorization of B and Schur-Complement J(x)B−1 J(x)T
I Null-space: Cholesky factorization of reduced Hessian ZTBZ and asymmetric
factorization of J(x)Y.
I iterative methods: GMRES(k), MINRES, CG within null J(x) , etc.
An alternative interpretation
Notes
Consider the quadratic program (QP) given by
minimize
n
f (x) + g(x)T s + 21 sT Bs subject to c(x) + J(x)s = 0 (13)
s∈R
QP constraints are first-order model of constraints c(x + s)

QP objective is second-order model of objective f (x + s), but B includes curvature
of constraints. Why?
The first-order necessary optimality conditions for (13) are
c(x) + J(x)s = 0 and Bs + g(x) = J(x)T y+
which are equivalent to
J(x)T

B s g(x)
=−
J(x) 0 −y+ c(x)
where y+ is the Lagrange multiplier vector for the constraint c(x) + J(x)s = 0. This is
exactly the same system as the approximate Newton system on the previous slide.
Important: The Newton update for finding a zero of F(x, y) is given by
(x+ , y+ ) = (x, y) + (s, w).
We showed that s is the solution to (13) and y+ is the Lagrange multiplier.

This provides an important link between Newton’s Method (zero finding) and
optimization (problem (13)).
Pure sequential quadratic programming (SQP)
Notes
or successive quadratic programming
or recursive quadratic programming (RQP)
or sequential quadratic optimization (SQO)
Algorithm 4 Pure sequential quadratic programming
1: Input (x0 , y0 ) and set k ← 0.
2: loop
3: Compute (sk , yk+1 ) as a solution (when it exists) and multiplier vector for
minimize n
f (xk ) + g(xk )T s + 12 sT H(xk , yk )s subject to J(xk )s = −ck
s∈R
4: Set xk+1 = xk + sk .
5: Set k ← k + 1.
6: end loop
Strengths
simple to define/implement
fast (when it works)
I quadratically convergent with Bk = H(xk , yk ) (under standard assumptions)
I superlinearly convergent with good Bk ≈ H(xk , yk ) (under standard assumptions)
I don’t actually need Bk → H(xk , yk ) but rather they converge on a suitable subspace.
Weaknesses
Subproblem may not be feasible or may be unbounded when it is feasible.
Equivalent to Newton’s Method for finding a zero of F, so not globally convergent.
Problems with pure SQP Notes
how to choose Bk ?
what if the QP subproblem
minimize
n
f (xk ) + g(xk )T s + 12 sTBk s subject to J(xk )s = −c(xk )
s∈R
is infeasible? What if it is unbounded below? When is it unbounded below?

We already studied equality constrained QP.
I if J(xk ) has full row rank, then constraints will be consistent
I QP is bounded below if Bk is positive definite on null(J(xk )), i.e.,
ZT Bk Z 0 where the columns of Z form a basis for null(J(xk )),
which holds if and only if
Bk J(xk )T
J(xk ) 0
is non-singular and has exactly m negative eigenvalues
pure SQP is equivalent to Newton’s Method for finding a first-order optimal point.
I Question: how do we globalize this iteration?
I Answer: linesearch/trust-region techniques with appropriate step acceptance strategies
Linesearch SQP methods
Notes
Basic idea:
Let the direction
sk = argmin fk + gTk s + 21 sT Bk s subject to J(xk )k s = −c(xk )

s∈Rn
serve as a search direction.

pick αk ∈ (0, 1] such that
xk+1 = xk + αk sk
satisfies
Φ(xk + αk sk ; pk ) “<” Φ(xk ; pk )
for some “suitable” merit function Φ(x; pk ) defined by parameters pk .
vital that sk is a descent direction for Φ(x; pk ) at xk
normally require that Bk is positive definite to ensure that sk is a descent direction
Suitable merit functions: the quadratic penalty function

Notes
Recall the quadratic penalty function Φ(x; µ) = f (x) + 1
2µ
kc(x)k22
Theorem 4.1 (pure SQP is a descent direction for quadratic penalty function)
If Bk 0, (sk , yk+1 ) is the SQP search direction and associated multiplier estimate for
problem (1) at xk , and xk is not a first-order KKT point, then the following hold:
1 if ck 6= 0, then sk is a descent direction for Φ(x; µk ) at x = xk provided
kc(xk )k2
µk ≤
kyk+1 k2
2 if ck = 0, then sk is a descent direction for Φ(x; µ) at x = xk for all µ > 0.
Proof: SQP direction sk and associated multiplier estimates yk+1 satisfy

Bk sk − JkT yk+1 = −gk and Jk sk = −ck
which may be combined to deduce that
1 T T kc k2
sTk gk = −sTk Bk sk + sTk JkT yk+1 = −sTk Bk sk − cTk yk+1 and sk Jk ck = − k 2 . (14)
µk µk
Using the previous two equalities, the positive definiteness of Bk , the Cauchy-Schwarz
inequality, the bound on µk , and sk 6= 0 sincexk is not a KKT point, leads to 2
1 T kc k
sTk ∇x Φ(xk ; µk ) = sTk gk + Jk ck = −sTk Bk sk − cTk yk+1 − k 2
µk µk

kc k
≤ −sTk Bk sk − kck k2 k 2
− kyk+1 k2 < 0
µk
Suitable merit functions: non-differentiable exact penalty functions (1)
Notes
Consider the non-differentiable exact penalty function
Φ(x; ρ) = f (x) + ρkc(x)k
for any norm k · k and scalar ρ > 0.
Theorem 4.2
Suppose that both f and c are in C2 , and that x∗ is an isolated local minimizer of the
equality constrained problem (1) with corresponding Lagrange multipliers y∗ . Then x∗
is also an isolated local minimizer of Φ(x; ρ) provided that
ρ > ky∗ kD
where the dual norm is defined as

yT x
kykD = sup
x6=0 kxk
Comments
Theorem 4.2 justifies the use of Φ(x; ρ) as a merit function.
Based on this result, one might suspect that the pure SQP step is a descent
direction provided ρ is sufficiently large. It is!!
Suitable merit functions: non-differentiable exact penalty functions (2)

Notes
Recall the non-differentiable exact penalty function Φ(x; ρ) = f (x) + ρkc(x)k
Theorem 4.3 (pure SQP direction is descent direction for exact penalty
function)
If Bk 0, (sk , yk+1 ) is the SQP search direction and its associated multiplier estimate
for problem (1) at xk , and xk is not a first-order KKT point, then sk is a descent direction
for the non-differentiable exact penalty function Φ(x; ρk ) at xk whenever
ρk ≥ kyk+1 kD
Proof: Taylor’s theorem applied to f and c and ck + Jk sk = 0 imply (for small α) that
Φ(xk + αsk ; ρk ) − Φ(xk ; ρk ) = αsTk gk + ρk (kck + αJk sk k − kck k) + O(α2 )
= αsTk gk + ρk (k(1 − α)ck k − kck k) + O(α2 )

= α sTk gk − ρk kck k + O α2
which combined with (14), the positive definiteness of Bk , the Hölder inequality, sk 6= 0
since xk is not a KKT point, and the bound on ρk , imply that

Φ(xk + αsk ; ρk ) − Φ(xk ; ρk ) = −α sTk Bk sk + cTk yk+1 + ρk kck k + O(α2 )
≤ −αkck k ρk − kyk+1 kD − αλmin (Bk )ksk k2 + O(α2 )

< 0 for all α > 0 sufficiently small.

The Maratos effect (1)
Notes
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.6
−0.7
−0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Figure : The `1 non-differentiable exact penalty function Φ(x; ρ) = f (x) + ρkc(x)k1 with ρ = 1,
f (x) = 2(x21 + x22 − 1) − x1 , and c(x) = x21 + x22 − 1. The solution is x∗ = (1, 0) and y∗ = 3/2.
Maratos effect: the merit function may prevent acceptance of the pure SQP step
arbitrarily close to x∗ and lead to slow (not Newton-like) convergence!

Notes
The Maratos effect occurs because the curvature of the constraints is not adequately
represented by the linearization of the constraints in the SQP model:
c(xk + sk ) = ck + Jk sk + O(ksk k2 ) = O(ksk k2 ).
We can correct this by computing a second-order correction sCk that satisfies
c(xk + sk + sCk ) = o(ksk k2 )
but we do not want to ruin potential Newton-like fast convergence, so we also require
sCk = o(sk )
Popular 2nd-order corrections:
1 minimum norm solution to c(xk + sk ) + J(xk + sk )s = 0
J(xk + sk )T
C
I sk 0
=−
J(xk + sk ) 0 −yCk+1 c(xk + sk )
2 minimum norm solution to c(xk + sk ) + J(xk )s = 0
J(xk )T
C
I sk 0
= −
J(xk ) 0 −yCk+1 c(xk + sk )
3 another pure SQP step from xk + sk
H(xk + sk , y+
C
k ) JT (xk + sk ) sk g(xk + sk )
C =−
J(xk + sk ) 0 −yk+1 c(xk + sk )
Notes
The 2nd-order correction in action.
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.6
−0.7
−0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Figure : The `1 non-differentiable exact penalty function Φ(x; ρ) with ρ = 1,

f (x) = 2(x21 + x22 − 1) − x1 , and c(x) = x21 + x22 − 1. The solution is x∗ = (1, 0) and y∗ = 3/2.
(very) fast convergence

Φ(xk + sk + sCk ; ρk ) “<” Φ(xk ; ρk ) (and assumptions) imply global convergence
Algorithm 5 A linesearch SQP algorithm based on Φ(x; ρ) = f (x) + ρkc(x)k1 .

Notes
1: Input (x0 , y0 ).
2: Set k ← 0 and choose parameters η ∈ (0, 1/2), τ ∈ (0, 1), and ε ∈ (0, ∞).
3: loop
4: Calculate a suitable symmetric matrix Bk ≈ H(xk , yk ) such that Bk 0.
5: Compute (sk , yk+1 ) as a solution and associated Lagrange multiplier vector for
sk = argmin f (xk ) + g(xk )T s + 12 sT Bk s subject to J(xk )s = −ck
s∈Rn
6: Set ρk ← kyk+1 k∞ + ε so that sk is a descent direction for Φ(x; ρk ) at x = xk .
7: Set αk ← 1.
8: while Φ(xk + αk sk ; ρk ) > Φ(xk ; ρk ) + ηαk Dpk Φ(xk ; ρk ) do . linesearch
9: Set αk ← τ αk .
10: end while
11: Set xk+1 ← xk + αk sk .
12: Set k ← k + 1.
13: end loop
Comments
Can choose Bk+1 0 via a quasi-Newton approximation. To ensure that Bk+1 is
positive-definite, may use Powell’s damping technique (see Procedure 18.2 in
Nocedal and Wright, 2005.).
DpkΦ(xk ; ρk ) is the directional derivative of Φ(x; ρk ) at xk , and it is equal to
DpkΦ(xk ; ρk ) = g(xk )T sk − ρk kck k1 .
Models of Φ
a piecewise linear model of Φ(xk + s; xk , ρk ) is given by
Notes
`Φ(s; xk , ρk ) := fk + gTk s + ρk kck + Jk sk1
and the associated predicted change in Φ given by `Φ for the step s defined by
∆`Φ(s; xk , ρk ) := `Φ(0; xk , ρk ) − `Φ(s; xk , ρk )
= −gTk s + ρk kck k1 − kck + Jk sk1

a piecewise quadratic model of Φ(xk + s; xk , ρk ) is given by

qΦ(s; xk , ρk ) = fk + gTk s + 12 sT Bk s + ρk kck + Jk sk1
with the associated predicted change in Φ given by qΦ for the step s defined by
∆qΦ(s; xk , ρk ) := qΦ(0; xk , ρk ) − qΦ(s; xk , ρk )
= −gTk s − 21 sTBk s + ρk kck k1 − kck + Jk sk1

Observation
We may merely require the computed αk > 0 to satisfy the weaker condition
Φ(xk + αk sk ; ρk ) ≤ Φ(xk ; ρk ) − ηαk ∆qΦ(sk ; xk ρk )
which makes sense since if xk is not a KKT point, then (homework)
DpkΦ(xk ; ρk ) = gTk sk − ρk kck k1
≤ gTk sk + ρk kck + Jk sk k1 − kck k1 = −∆`Φ(sk ; xk , ρk )

≤ gTk sk + 21 sTk Bk sk + ρk kck + Jk sk k1 − kck k1 = −∆qΦ(sk ; xk , ρk ) < 0.

Notes
A simple SQP trust-region subproblem is
sk = argmin fk + gTk s + 12 sT Bk s subject to Jk s = −ck , ksk ≤ δk

s∈Rn
δk > 0 is called the trust-region radius

the constraint ksk ≤ δk is called a trust-region constraint
do not require Bk to be positive definite, e.g., we may choose Bk = H(xk , yk )
Observation: if δk < ∆CRIT where
∆CRIT := minn ksk subject to Jk s = −ck

s∈R
then the trust-region subproblem is infeasible

Conclusion: the simple trust-region SQP subproblem above is too simple (flawed)
we must consider alternatives
Infeasibility of the simple SQP subproblem
Notes
sk = argmin fk + gTk s + 1 T
2
s Bk s subject to Jk s = −ck , ksk ≤ δk
s∈Rn
The linearized onstraint PP

PPPqP

+ +
The trust region -
KA Æ
+
A The nonlinear onstraint +
Notes
Alternatives: SQP subproblem

the S`1 QP method of Roger Fletcher
composite step SQP methods
I constraint relaxation (Vardi)
I constraint reduction (Byrd–Omojokun)
I constraint lumping (Celis–Dennis–Tapia)
the filter-SQP approach of Fletcher and Leyffer
The S`1 QP method
Notes
Try to minimize the (exact) `1 -penalty function
Φ(x; ρ) = f (x) + ρkc(x)k1
for sufficiently large ρ > 0, using an iterative trust-region approach.
A suitable step generating subproblem for minimizing Φ(x; ρ) is given by
sk = argmin mk (s) = fk + gTk s + 12 sT Bk s + ρkck + Jk sk1 subject to ksk∞ ≤ δk (15)

s∈Rn
Strengths
subproblem (15) is always consistent
when ρ and δk are large enough, sk is equivalent to the pure SQP direction. Why?
the subproblem (15) is equivalent to the (smooth) quadratic program
minimize fk + gTk s + 12 sT Bk s + ρ eT(u + v)

(s,u,v)∈Rn+2m
subject to ck + Jk s = u − v, (u, v) ≥ 0, −δk e ≤ s ≤ δk e
I have good methods for solving smooth QPs

I should exploit structure associated with elastic variables u and v
Algorithm 6 An S`1 QP trust-region SQP algorithm

Notes
1: Input initial guess (x0 , y0 ) at a first-order KKT point and penalty parameter ρ > 0.
2: Choose 0 < γd < 1 < γi and 0 < ηs ≤ ηvs < 1 and set k ← 0.
3: loop
4: Build the second-order model mk (s) of Φ(xk + s; ρ).
5: Find the global minimizer sk and multipliers yk+1 to the subproblem (15).
6: Set
Φ(xk ; ρ) − Φ(xk + sk ; ρ) Φ(xk ; ρ) − Φ(xk + sk ; ρ)
rk ← ≡ .
mk (0) − mk (sk ) ∆mk (sk )
7: if rk ≥ ηvs , then
8: Set xk+1 ← xk + sk and δk+1 ← γi δk . . very successful
9: else if rk ≥ ηs , then
10: Set xk+1 ← xk + sk and δk+1 ← δk . . successful
11: else
12: Set xk+1 ← xk and δk+1 ← γd δk . . unsuccessful
13: end if
14: Set k ← k + 1.
15: end loop
Recall: ∆mk (sk ) := mk (0) − mk (s)

In practice, one must include a test for termination
Typical values: ηs = 0.1, ηvs = 0.9, γd = 1/2, γi = 2
Notes
Final comments
Algorithm 6 converges to first-order solutions for minimizing Φ(x; ρ).
need to adjust ρ (has to be sufficiently large) as the method progresses so that
minimizers of problem NEP (1) correspond to minimizers of Φ(x; ρ).
best strategies for adjusting ρ are based on steering techniques introduced by
Byrd, Nocedal, and Waltz.
easy to generalize to inequality constraints as we will see.
globally convergent, but for fast asymptotic convergence we need to use a
I second-order correction step to avoid the Maratos effect
I trust-region radius reset after successful iterations so that the trust-region constraint will
eventually be inactive
if c(x) = 0 are inconsistent, the iterates will converge to a local first-order solution
associated with minimizing kc(x)k1 .
the requirement that sk is the global minimizer of the subproblem is disappointing.
I Two-phase method by Byrd, Gould, Nocedal, and Waltz (2002) avoids this assumption
by using a predictor step based on a linear model.
I Two-phase method by Gould and Robinson (2012) avoids this assumption by using a
predictor step based on a strictly convex quadratic model.
Composite-step methods
Notes
Goal: define a useful step sk as a composite step of the form
sk = n k + t k
where
the normal step nk moves towards feasibility of the linearized constraints (within
the trust region):
kck + Jk nk k < kck k
(model objective may get worse)
the tangential step tk reduces the model of the objective function (within the
trust-region) without sacrificing the improvement in feasibility predicted by nk :
kck + Jk (nk + tk )k = kck + Jk nk k,
which holds if Jk tk = 0, i.e., tk ∈ null(Jk ).

Notes
The linearized onstraint PP

PPPqP
+ +
Nearest point on linearized onstraint Close to nearest point

nk
nk
The trust region -

+ +
Figure : Diagram illustrates normal and tangential steps. Points on the dotted line are all potential
tangential steps.
Approach 1: constraint relaxation (Vardi)

Notes
Normal step:
Relax the constraints
Jk s = −ck and ksk ≤ δk
so that we find nk that satisfies
Jk n = −σk ck and knk ≤ δk
where σk ∈ [0, 1] is small enough so that there are feasible n
Tangential step:
Find
tk ≈ argmin (gk + Bk nk )T t + 21 tT Bk t subject to Jk t = 0, knk + tk ≤ δk
t
Snags:
choice of σk
incompatible constraints
Approach 2: constraint reduction (Byrd and Omojokun)
Notes
Normal step:
Replace
by
nk ≈ argmin kck + Jk nk subject to knk ≤ δk
n
Tangential step:
Same as with Vardi.
Comments:
use iterative linear conjugate gradient method to (approximately) solve both
subproblems (use Cauchy points in both cases)
globally convergent using `2 merit function
basis of a successful algorithm used in the package KNITRO
Approach 3: constraint lumping (Celis, Dennis, and Tapia)

Notes
Normal step:
Replace
by finding σk ∈ [0, kck k] large enough so that there is a feasible nk for the constraints
kck + Jk nk ≤ σk and knk ≤ δk
Tangential step:
Compute tk as
tk ≈ argmin (gk + Bk nk )T t + 21 tT Bk t subject to kck + Jk (nk + t)k ≤ σk , knk + tk ≤ δk

t∈Rn
Snags:
choice of σk
tangential step subproblem is difficult to solve because of “hard” constraints
Filter method by Fletcher and Leyffer
Notes
Motivation
the simple trust-region SQP subproblem
sk = argmin fk + gTk s + 21 sT Bk s subject to Jk s = −ck , ksk ≤ δk
s∈Rn
is compatible if ck is “small enough” and Jk has full row rank (roughly).

if the simple trust-region subproblem is incompatible, should ignore it and instead
compute iterates that focus on reducing infeasibility.
merit functions depend on “arbitrary” parameter values, so perhaps we can use a
different mechanism to measure progress.
A filter is another mechanism for measuring progress.
Definition 4.4 (filter)

A filter is a set of pairs {(v` , f` )} such that no pair dominates another pair, i.e., it does
not happen that
vi “<” vj and fi “<” fj
for any pair of filter entries for i 6= j, where
v(x) = kc(x)k
measures the constraint violation and we used the notation f` = f (x` ) and v` = v(x` ).
f (x) 6
Notes
q1
q4
-
0 v(x)
q2
q3
Figure : Diagram depicts a filter with four entries, where v(x) = kc(x)k. A new point xk is
“acceptable” to the filter if it associated pair v(xk ), f (xk ) lies below and to the left of the dashed
staircase-like region.
Basic idea behind filter methods Notes
computing a trial step sk
I if the simple SQP trust-region subproblem constraints given by
Jk s = −ck and ksk∞ ≤ δk (16)
are compatible, then define
sk = argmin fk + gTk s + 12 sT Bk s subject to Jk s = −ck , ksk∞ ≤ δk
s∈Rn
I if the constraints given by (16) are not compatible, find sk such that
v(xk + sk ) “<” min(vk , vi ) for all i in the filter
step acceptability
I if xk + sk is “acceptable”
to the filter, set xk+1 = xk + sk , possibly increase δk , add
v(xk+1 ), f (xk+1 ) to the filter, and then remove (prune) any dominated entries from
the filter
I otherwise, reduce δk and try again
In practice
many additional details are needed
far more complicated than this!
Notes

Equality Constrained Optimization: Daniel P. Robinson

Uploaded by

Copyright:

Available Formats

Equality Constrained Optimization: Daniel P. Robinson

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Equality Constrained Optimization: Daniel P. Robinson

Uploaded by

Copyright:

Available Formats

Notes

Equality Constrained Optimization

April 23, 2014

2 Quadratic penalty method

3 Augmented Lagrangian method

4 Sequential quadratic programming

the objective function f : Rn −→ R

Penalty methods handle the conflict of reducing f and obtaining feasibility by

The quadratic penalty function

is the quadratic penalty function, which is defined as

the idea is to use unconstrained minimizers of Φ(x; µ) (with respect to x) as

Convergence (hopefully) of minimizers of the quadratic penalty function

x(µ) := argminx Φ(x; µ)

5: Compute µk+1 > 0 smaller than µk such that limk→∞ µk+1 = 0.

Often choose µk+1 = 0.1µk or even µk+1 = µ2k .

Derivatives of the quadratic penalty function

g(x, π(x)) = g(x) − J(x)T π(x)

Suppose that f and c are both C 2 , and define

k∇x Φ(xk ; µk )k2 ≤ εk , lim εk = 0, lim xk = x∗ ,

and {πk } converges to the associated Lagrange multipliers y∗ .

k∇x Φ(xk ; µk )k2 ≤ εk and lim εk = 0.

Proof: First, we observe that the generalized inverse

is bounded near x∗ since J(x∗ ) is full rank by assumption. Second, we define

for all k sufficiently large. We may then conclude that

Algorithms to minimize Φ(x; µ) include

Practical issue 1: Φ(x; µ) may have additional useless stationary points

Practical issue 2: ill-conditioned Newton systems for small µ

Recall the Newton system from the previous slide given by

If we define the auxiliary variables

essentially independent of µ for small µ implies no inherent ill-conditioning.

Contours of the quadratic penalty function for

min x21 + x22 subject to x1 + x22 = 1

Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0

Theorem 2.3 (trajectory of minimizers of Φ(x; µ))

lim x(µ) = x∗ and lim y(µ) = y∗ .

Moreover, x(µ) is an unconstrained minimizer of Φ(x; µ).

Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0

F x(µ), y(µ), µ = 0 and lim x(µ), y(µ) = (x∗ , y∗ ).

Using the definition of F, it follows that

Question: is the trajectory of minimizers tangent to the constraints as it approaches a

This can be seen as follows. We know the existence of

Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0

We have from the previous slide that

0 = J(x∗ )T ( y∗ + J(x∗ )p).

Since J(x∗ ) has full row rank by assumption, we know that

An idea: try to follow this trajectory!

J(x(µ))x0 (µ) + µy0 (µ) = −y(µ)

Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0

g(x) − J(x)T y = 0 dual feasibility

g(x) − J(x)T y = 0 dual feasibility

H(x, y) JT (x) g(x) − JT (x)y

After eliminating ∆y, we have

c.f. Newton method for quadratic penalty function minimization!

What is the difference?

Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0

From the previous slide, we have that

Using a Taylor expansion, it follows that

x(µ̄) = x(µ) + (µ̄ − µ)x0 (µ) + O (µ̄ − µ)2

The trust region -